fpga-cpu | Wishbone comments

After implementing the Wishbone interface for main memory access from JOP I see several issues with the Wishbone specification that makes it not the best choice for SoC interconnect. The Wishbone interface specification is still in the tradition of microcomputer or backplane busses. However, for a SoC interconnect, which is usually point-to-point, this is not the best approach. The master is requested to hold the address and data valid through the whole read or write cycle. This complicates the connection to a master that has the data valid only for one cycle. In this case the address and data have to be registered *before* the Wishbone connect or an expensive (time and resources) MUX has to be used. A register results in one additional cycle latency. A better approach would be to register the address and data in the slave. Than there is also time to perform address decoding in the slave (before the address register). There is a similar issue for the output data from the slave: As it is only valid for a single cycle it has to be registered by the master when the processor is not reading it immediately. Therefore, the slave should keep the last valid data at it's output even when wb.stb is not assigned anymore (which is no issue from the hardware complexity). The Wishbone connection for JOP resulted in an unregistered Wishbone memory interface and registers for the address and data in the Wishbone master. However, for fast address and control output (tco) and short setup time (tsu) we want the registers in the IO-pads of the FPGA. With the registers buried in the WB master it takes some effort to set the right constraints for the Synthesizer to implement such IO-registers. The same issue is true for the control signals. The translation from the wb.cyc, wb.stb and wb.we signals to ncs, noe and nwe for the SRAM are on the critical path. The ack signal is too late for a pipelined master. We would need to know it *earlier* when the next data will be available --- and this is possible, as we know in the slave when the data from the SRAM will arrive. A work around solution is a non-WB-conforming early ack signal. Due to the fact that the data registers not inside the WB interface we need an extra WB interface for the Flash/NAND interface (on the Cyclone board). We cannot afford the address decoding and a MUX in the data read path without registers. This would result in an extra cycle for the memory read due to the combinational delay. In the WB specification (AFAIK) there is no way to perform pipelined read or write. However, for blocked memory transfers (e.g. cache load) this is the usual way to get a good performance. Conclusion -- I would prefer: * Address and data (in/out) register in the slave * A way to know earlier when data will be available (or a write has finished) * Pipelining in the slave As a result from this experience I'm working on a new SoC interconnect (working name SimpCon) definition that should avoid the mentioned issues and should be still easy to implement the master and slave. As there are so many projects available that implement the WB interface I will provide bridges between SimpCon and WB. For IO devices the former arguments do not apply to that extent as the pressure for low latency access and pipelining is not high. Therefore, a bridge to WB IO devices can be a practical solution for design reuse. Martin

Reply by Kolja Sulimma ●November 24, 20052005-11-24

You are probably right for high clock rate interconnects or high latency accesses (DRAM, etc). However, WB works very well for single cycle accesses as you usually get in very simple SoCs with only primitve peripherals. Especially the early ACKs can get in the way of single cycle accesses. Holding the last output valid is only easy for the slave if it registers the addresses. Anyway, I am a big fan of pipelined busses (ever seen the SCI link controller interface?) so I would like to get a draft of your spec. Kolja Sulimma Martin Schoeberl schrieb: >After implementing the Wishbone interface for main memory access >from JOP I see several issues with the Wishbone specification that >makes it not the best choice for SoC interconnect. > >The Wishbone interface specification is still in the tradition of >microcomputer or backplane busses. However, for a SoC interconnect, >which is usually point-to-point, this is not the best approach. > >The master is requested to hold the address and data valid through >the whole read or write cycle. This complicates the connection to a >master that has the data valid only for one cycle. In this case the >address and data have to be registered *before* the Wishbone connect >or an expensive (time and resources) MUX has to be used. A register >results in one additional cycle latency. A better approach would be >to register the address and data in the slave. Than there is also >time to perform address decoding in the slave (before the address >register). > >There is a similar issue for the output data from the slave: As it >is only valid for a single cycle it has to be registered by the >master when the processor is not reading it immediately. Therefore, >the slave should keep the last valid data at it's output even when >wb.stb is not assigned anymore (which is no issue from the hardware >complexity). > >The Wishbone connection for JOP resulted in an unregistered Wishbone >memory interface and registers for the address and data in the >Wishbone master. However, for fast address and control output (tco) >and short setup time (tsu) we want the registers in the IO-pads of >the FPGA. With the registers buried in the WB master it takes some >effort to set the right constraints for the Synthesizer to implement >such IO-registers. > >The same issue is true for the control signals. The translation from >the wb.cyc, wb.stb and wb.we signals to ncs, noe and nwe for the >SRAM are on the critical path. > >The ack signal is too late for a pipelined master. We would need to >know it *earlier* when the next data will be available --- and this >is possible, as we know in the slave when the data from the SRAM >will arrive. A work around solution is a non-WB-conforming early ack >signal. > >Due to the fact that the data registers not inside the WB interface >we need an extra WB interface for the Flash/NAND interface (on the >Cyclone board). We cannot afford the address decoding and a MUX in >the data read path without registers. This would result in an extra >cycle for the memory read due to the combinational delay. > >In the WB specification (AFAIK) there is no way to perform pipelined >read or write. However, for blocked memory transfers (e.g. cache >load) this is the usual way to get a good performance. > >Conclusion -- I would prefer: > > * Address and data (in/out) register in the slave > * A way to know earlier when data will be available (or > a write has finished) > * Pipelining in the slave > >As a result from this experience I'm working on a new SoC >interconnect (working name SimpCon) definition that should avoid the >mentioned issues and should be still easy to implement the master >and slave. > >As there are so many projects available that implement the WB >interface I will provide bridges between SimpCon and WB. For IO >devices the former arguments do not apply to that extent as the >pressure for low latency access and pipelining is not high. >Therefore, a bridge to WB IO devices can be a practical solution for >design reuse. > >Martin >

Reply by Martin Schoeberl ●November 24, 20052005-11-24

> You are probably right for high clock rate interconnects or high latency > accesses (DRAM, etc). > However, WB works very well for single cycle accesses as you usually > get in very simple SoCs > with only primitve peripherals. Especially the early ACKs can get in the > way of single cycle accesses. > Holding the last output valid is only easy for the slave if it registers > the addresses. The idea is that the address and data register should reside inside the slave and not the master. > Anyway, I am a big fan of pipelined busses (ever seen the SCI link > controller interface?) so I would like No, have not seen it. Do you have a link to it handy? At the momnet I'm also trying to collect different interconnect standards to avoid to reinvent the wheel. > to get a draft of your spec. The idea for (some) pipeline support is twofold: 1.) The slave will provide more information than a single ack or wait states. It will (if it is capable to do) signal the number of clock cycles remaining till the read data is available (or the write has finished) to the master. This feature allows the pipelined master to prepare for the upcomming read. 2.) If the slave can provide pipelining the master can use overlapped wr or rd requests. The slave has a static output port that tells how many pipeline stages are available. I call this 'pipeline level': 0 means non overlapping 1 a new rd/wr request can be issued in the same cycle when the former data is read. 2 one earlier and 3 is the maximum level where you get full pipelining on the basic read cycle with one wait state (command - read - read - result). The draft of the spec at the moment are few sketches on real paper - takes some time to draw all diagrams for a document (BTW does anybody know a tool for quick drawing of timing diagrams). I have a first implementation of SimpCon on JOP to test the ideas: A master in JOP and a slave for SRAM access. If you are interested in a early access I can upload the VHDL files to the opencores CVS server. Martin > Martin Schoeberl schrieb: > >>After implementing the Wishbone interface for main memory access >>from JOP I see several issues with the Wishbone specification that >>makes it not the best choice for SoC interconnect. >> >>The Wishbone interface specification is still in the tradition of >>microcomputer or backplane busses. However, for a SoC interconnect, >>which is usually point-to-point, this is not the best approach. >> >>The master is requested to hold the address and data valid through >>the whole read or write cycle. This complicates the connection to a >>master that has the data valid only for one cycle. In this case the >>address and data have to be registered *before* the Wishbone connect >>or an expensive (time and resources) MUX has to be used. A register >>results in one additional cycle latency. A better approach would be >>to register the address and data in the slave. Than there is also >>time to perform address decoding in the slave (before the address >>register). >> >>There is a similar issue for the output data from the slave: As it >>is only valid for a single cycle it has to be registered by the >>master when the processor is not reading it immediately. Therefore, >>the slave should keep the last valid data at it's output even when >>wb.stb is not assigned anymore (which is no issue from the hardware >>complexity). >> >>The Wishbone connection for JOP resulted in an unregistered Wishbone >>memory interface and registers for the address and data in the >>Wishbone master. However, for fast address and control output (tco) >>and short setup time (tsu) we want the registers in the IO-pads of >>the FPGA. With the registers buried in the WB master it takes some >>effort to set the right constraints for the Synthesizer to implement >>such IO-registers. >> >>The same issue is true for the control signals. The translation from >>the wb.cyc, wb.stb and wb.we signals to ncs, noe and nwe for the >>SRAM are on the critical path. >> >>The ack signal is too late for a pipelined master. We would need to >>know it *earlier* when the next data will be available --- and this >>is possible, as we know in the slave when the data from the SRAM >>will arrive. A work around solution is a non-WB-conforming early ack >>signal. >> >>Due to the fact that the data registers not inside the WB interface >>we need an extra WB interface for the Flash/NAND interface (on the >>Cyclone board). We cannot afford the address decoding and a MUX in >>the data read path without registers. This would result in an extra >>cycle for the memory read due to the combinational delay. >> >>In the WB specification (AFAIK) there is no way to perform pipelined >>read or write. However, for blocked memory transfers (e.g. cache >>load) this is the usual way to get a good performance. >> >>Conclusion -- I would prefer: >> >> * Address and data (in/out) register in the slave >> * A way to know earlier when data will be available (or >> a write has finished) >> * Pipelining in the slave >> >>As a result from this experience I'm working on a new SoC >>interconnect (working name SimpCon) definition that should avoid the >>mentioned issues and should be still easy to implement the master >>and slave. >> >>As there are so many projects available that implement the WB >>interface I will provide bridges between SimpCon and WB. For IO >>devices the former arguments do not apply to that extent as the >>pressure for low latency access and pipelining is not high. >>Therefore, a bridge to WB IO devices can be a practical solution for >>design reuse. >> >>Martin >> >> > > > > To post a message, send it to: fpga-cpu@fpga... > To unsubscribe, send a blank message to: fpga-cpu-unsubscribe@fpga... > Yahoo! Groups Links >

Reply by Kolja Sulimma ●November 25, 20052005-11-25

Martin Schoeberl schrieb: >>Anyway, I am a big fan of pipelined busses (ever seen the SCI link >>controller interface?) so I would like >> >> > >No, have not seen it. Do you have a link to it handy? No. Only the SCI spec, not the link controller. >The idea for (some) pipeline support is twofold: > >1.) The slave will provide more information than a single ack >or wait states. It will (if it is capable to do) signal the >number of clock cycles remaining till the read data is available >(or the write has finished) to the master. This feature allows >the pipelined master to prepare for the upcomming read. > >2.) If the slave can provide pipelining the master can use >overlapped wr or rd requests. The slave has a static output >port that tells how many pipeline stages are available. >I call this 'pipeline level': > 0 means non overlapping > 1 a new rd/wr request can be issued in the same cycle > when the former data is read. > 2 one earlier and > 3 is the maximum level where you get full pipelining > on the basic read cycle with one wait state > (command - read - read - result). > I do not like the concept of telling the master at the beginning of each cycle what the latency will be. But I believe that you get what you want simply by using split transactions. The slave acknowledges that it latched the address and control information and the master is free to supply the next address to the next or the same slave. Pipelining within a single slave can have any amount of levels. The slave just keeps acknowleding addresses and after a while starts acknowleding data. Having multiple outstanding read transactions to different slaves is tricky and probably not worth the effort. I would suggest limiting the bus to one or none outstanding transaction to other slaves. Otherwise it would be necessary to keep a queue of outstanding slaves and select the right data source at the right time. Unfortunately I am not available at the moment to implement any hardware. To many outstanding transactions ;-) Kolja

Reply by Martin Schoeberl ●November 25, 20052005-11-25

>>>Anyway, I am a big fan of pipelined busses (ever seen the SCI link >>>controller interface?) so I would like >>> >>> >> >>No, have not seen it. Do you have a link to it handy? >> >> > No. Only the SCI spec, not the link controller. This was a mismatch - I meant if you have a hyper-link to the specification handy... ;-) > >>The idea for (some) pipeline support is twofold: >> >>1.) The slave will provide more information than a single ack >>or wait states. It will (if it is capable to do) signal the >>number of clock cycles remaining till the read data is available >>(or the write has finished) to the master. This feature allows >>the pipelined master to prepare for the upcomming read. >> >>2.) If the slave can provide pipelining the master can use >>overlapped wr or rd requests. The slave has a static output >>port that tells how many pipeline stages are available. >>I call this 'pipeline level': >> 0 means non overlapping >> 1 a new rd/wr request can be issued in the same cycle >> when the former data is read. >> 2 one earlier and >> 3 is the maximum level where you get full pipelining >> on the basic read cycle with one wait state >> (command - read - read - result). >> >> >> > I do not like the concept of telling the master at the beginning of each > cycle what the latency will be. The numbers from the slave mean how many cycles it can pipeline, not the actual value of the latency. The latency is a different thing: A rdy_cnt signal will tell the master, when the (e.g. read) will finish - this is dynamic. > But I believe that you get what you want simply by using split > transactions. > The slave acknowledges that it latched the address and control > information and the master is free to > supply the next address to the next or the same slave. In my opinion split transactions are a waste of cycles. When the rdy_cnt is 0 (or the slave accepts pipelining) a new address and data will be accepted. No acknoledge is necessary. Martin

Reply by Kolja Sulimma ●November 26, 20052005-11-26

Martin Schoeberl schrieb: >>But I believe that you get what you want simply by using split >>transactions. >>The slave acknowledges that it latched the address and control >>information and the master is free to >>supply the next address to the next or the same slave. >> >> > >In my opinion split transactions are a waste of cycles. When the >rdy_cnt is 0 (or the slave accepts pipelining) a new address and >data will be accepted. No acknoledge is necessary. No waste of cycles. If you have seperate adress and data busses (like wishbone) you can acknowledge data and adress at the same time. There is no scenario were split transactions take more cycles. However especially for read accesses usually a target can acknowledge the adress earlier than data. This allows the master to drive a new addresses, saving cycles. Kolja

Reply by Martin Schoeberl ●November 26, 20052005-11-26

> >>>But I believe that you get what you want simply by using split >>>transactions. >>>The slave acknowledges that it latched the address and control >>>information and the master is free to >>>supply the next address to the next or the same slave. >>> >>> >> >>In my opinion split transactions are a waste of cycles. When the >>rdy_cnt is 0 (or the slave accepts pipelining) a new address and >>data will be accepted. No acknoledge is necessary. >> >> > No waste of cycles. If you have seperate adress and data busses (like > wishbone) you can acknowledge > data and adress at the same time. There is no scenario were split > transactions take more cycles. > However especially for read accesses usually a target can acknowledge > the adress earlier than data. > This allows the master to drive a new addresses, saving cycles. > Actually I do not really understand why the address should be acknowledged by the slave? The idea is that the address and the command (rd or wr) is active only a single cycle and the slave *has* to accept it. Therefore no need for acknowleding it. The next address/cmd can be issued depending on the bsy_cnt and the pipeline level of the slave. I believe this is simpler than the ack for both the master that don't have to wait in the ack and the slave. Martin

Reply by Martin Schoeberl ●November 27, 20052005-11-27

I've put a first draft of the SimpCon specification on my web site: http://www.jopdesign.com/docu.jsp More implementation examples and Wishbone bridges will follow. Comments are very welcome, Martin

Wishbone comments

Sign in

You might also like...

Search forums

Free PDF Downloads

Blogs - Hall of Fame

Discussion Groups

Quick Links

About EmbeddedRelated.com

Social Networks

The Related Media Group