EmbeddedRelated.com
Forums
Memfault Beyond the Launch

Wishbone comments

Started by Martin Schoeberl November 23, 2005
After implementing the Wishbone interface for main memory access
from JOP I see several issues with the Wishbone specification that
makes it not the best choice for SoC interconnect.

The Wishbone interface specification is still in the tradition of
microcomputer or backplane busses. However, for a SoC interconnect,
which is usually point-to-point, this is not the best approach.

The master is requested to hold the address and data valid through
the whole read or write cycle. This complicates the connection to a
master that has the data valid only for one cycle. In this case the
address and data have to be registered *before* the Wishbone connect
or an expensive (time and resources) MUX has to be used. A register
results in one additional cycle latency. A better approach would be
to register the address and data in the slave. Than there is also
time to perform address decoding in the slave (before the address
register).

There is a similar issue for the output data from the slave: As it
is only valid for a single cycle it has to be registered by the
master when the processor is not reading it immediately. Therefore,
the slave should keep the last valid data at it's output even when
wb.stb is not assigned anymore (which is no issue from the hardware
complexity).

The Wishbone connection for JOP resulted in an unregistered Wishbone
memory interface and registers for the address and data in the
Wishbone master. However, for fast address and control output (tco)
and short setup time (tsu) we want the registers in the IO-pads of
the FPGA. With the registers buried in the WB master it takes some
effort to set the right constraints for the Synthesizer to implement
such IO-registers.

The same issue is true for the control signals. The translation from
the wb.cyc, wb.stb and wb.we signals to ncs, noe and nwe for the
SRAM are on the critical path.

The ack signal is too late for a pipelined master. We would need to
know it *earlier* when the next data will be available --- and this
is possible, as we know in the slave when the data from the SRAM
will arrive. A work around solution is a non-WB-conforming early ack
signal.

Due to the fact that the data registers not inside the WB interface
we need an extra WB interface for the Flash/NAND interface (on the
Cyclone board). We cannot afford the address decoding and a MUX in
the data read path without registers. This would result in an extra
cycle for the memory read due to the combinational delay.

In the WB specification (AFAIK) there is no way to perform pipelined
read or write. However, for blocked memory transfers (e.g. cache
load) this is the usual way to get a good performance.

Conclusion -- I would prefer:

* Address and data (in/out) register in the slave
* A way to know earlier when data will be available (or
a write has finished)
* Pipelining in the slave

As a result from this experience I'm working on a new SoC
interconnect (working name SimpCon) definition that should avoid the
mentioned issues and should be still easy to implement the master
and slave.

As there are so many projects available that implement the WB
interface I will provide bridges between SimpCon and WB. For IO
devices the former arguments do not apply to that extent as the
pressure for low latency access and pipelining is not high.
Therefore, a bridge to WB IO devices can be a practical solution for
design reuse.

Martin



You are probably right for high clock rate interconnects or high latency
accesses (DRAM, etc).
However, WB works very well for single cycle accesses as you usually
get in very simple SoCs
with only primitve peripherals. Especially the early ACKs can get in the
way of single cycle accesses.
Holding the last output valid is only easy for the slave if it registers
the addresses.

Anyway, I am a big fan of pipelined busses (ever seen the SCI link
controller interface?) so I would like
to get a draft of your spec.

Kolja Sulimma

Martin Schoeberl schrieb:

>After implementing the Wishbone interface for main memory access
>from JOP I see several issues with the Wishbone specification that
>makes it not the best choice for SoC interconnect.
>
>The Wishbone interface specification is still in the tradition of
>microcomputer or backplane busses. However, for a SoC interconnect,
>which is usually point-to-point, this is not the best approach.
>
>The master is requested to hold the address and data valid through
>the whole read or write cycle. This complicates the connection to a
>master that has the data valid only for one cycle. In this case the
>address and data have to be registered *before* the Wishbone connect
>or an expensive (time and resources) MUX has to be used. A register
>results in one additional cycle latency. A better approach would be
>to register the address and data in the slave. Than there is also
>time to perform address decoding in the slave (before the address
>register).
>
>There is a similar issue for the output data from the slave: As it
>is only valid for a single cycle it has to be registered by the
>master when the processor is not reading it immediately. Therefore,
>the slave should keep the last valid data at it's output even when
>wb.stb is not assigned anymore (which is no issue from the hardware
>complexity).
>
>The Wishbone connection for JOP resulted in an unregistered Wishbone
>memory interface and registers for the address and data in the
>Wishbone master. However, for fast address and control output (tco)
>and short setup time (tsu) we want the registers in the IO-pads of
>the FPGA. With the registers buried in the WB master it takes some
>effort to set the right constraints for the Synthesizer to implement
>such IO-registers.
>
>The same issue is true for the control signals. The translation from
>the wb.cyc, wb.stb and wb.we signals to ncs, noe and nwe for the
>SRAM are on the critical path.
>
>The ack signal is too late for a pipelined master. We would need to
>know it *earlier* when the next data will be available --- and this
>is possible, as we know in the slave when the data from the SRAM
>will arrive. A work around solution is a non-WB-conforming early ack
>signal.
>
>Due to the fact that the data registers not inside the WB interface
>we need an extra WB interface for the Flash/NAND interface (on the
>Cyclone board). We cannot afford the address decoding and a MUX in
>the data read path without registers. This would result in an extra
>cycle for the memory read due to the combinational delay.
>
>In the WB specification (AFAIK) there is no way to perform pipelined
>read or write. However, for blocked memory transfers (e.g. cache
>load) this is the usual way to get a good performance.
>
>Conclusion -- I would prefer:
>
> * Address and data (in/out) register in the slave
> * A way to know earlier when data will be available (or
> a write has finished)
> * Pipelining in the slave
>
>As a result from this experience I'm working on a new SoC
>interconnect (working name SimpCon) definition that should avoid the
>mentioned issues and should be still easy to implement the master
>and slave.
>
>As there are so many projects available that implement the WB
>interface I will provide bridges between SimpCon and WB. For IO
>devices the former arguments do not apply to that extent as the
>pressure for low latency access and pipelining is not high.
>Therefore, a bridge to WB IO devices can be a practical solution for
>design reuse.
>
>Martin >


> You are probably right for high clock rate interconnects or high latency
> accesses (DRAM, etc).
> However, WB works very well for single cycle accesses as you usually
> get in very simple SoCs
> with only primitve peripherals. Especially the early ACKs can get in the
> way of single cycle accesses.
> Holding the last output valid is only easy for the slave if it registers
> the addresses.

The idea is that the address and data register should reside inside
the slave and not the master.

> Anyway, I am a big fan of pipelined busses (ever seen the SCI link
> controller interface?) so I would like

No, have not seen it. Do you have a link to it handy?

At the momnet I'm also trying to collect different interconnect
standards to avoid to reinvent the wheel.

> to get a draft of your spec.

The idea for (some) pipeline support is twofold:

1.) The slave will provide more information than a single ack
or wait states. It will (if it is capable to do) signal the
number of clock cycles remaining till the read data is available
(or the write has finished) to the master. This feature allows
the pipelined master to prepare for the upcomming read.

2.) If the slave can provide pipelining the master can use
overlapped wr or rd requests. The slave has a static output
port that tells how many pipeline stages are available.
I call this 'pipeline level':
0 means non overlapping
1 a new rd/wr request can be issued in the same cycle
when the former data is read.
2 one earlier and
3 is the maximum level where you get full pipelining
on the basic read cycle with one wait state
(command - read - read - result). The draft of the spec at the moment are few sketches on real
paper - takes some time to draw all diagrams for a document
(BTW does anybody know a tool for quick drawing of timing
diagrams).

I have a first implementation of SimpCon on JOP to test the
ideas: A master in JOP and a slave for SRAM access.

If you are interested in a early access I can upload the
VHDL files to the opencores CVS server.

Martin > Martin Schoeberl schrieb:
>
>>After implementing the Wishbone interface for main memory access
>>from JOP I see several issues with the Wishbone specification that
>>makes it not the best choice for SoC interconnect.
>>
>>The Wishbone interface specification is still in the tradition of
>>microcomputer or backplane busses. However, for a SoC interconnect,
>>which is usually point-to-point, this is not the best approach.
>>
>>The master is requested to hold the address and data valid through
>>the whole read or write cycle. This complicates the connection to a
>>master that has the data valid only for one cycle. In this case the
>>address and data have to be registered *before* the Wishbone connect
>>or an expensive (time and resources) MUX has to be used. A register
>>results in one additional cycle latency. A better approach would be
>>to register the address and data in the slave. Than there is also
>>time to perform address decoding in the slave (before the address
>>register).
>>
>>There is a similar issue for the output data from the slave: As it
>>is only valid for a single cycle it has to be registered by the
>>master when the processor is not reading it immediately. Therefore,
>>the slave should keep the last valid data at it's output even when
>>wb.stb is not assigned anymore (which is no issue from the hardware
>>complexity).
>>
>>The Wishbone connection for JOP resulted in an unregistered Wishbone
>>memory interface and registers for the address and data in the
>>Wishbone master. However, for fast address and control output (tco)
>>and short setup time (tsu) we want the registers in the IO-pads of
>>the FPGA. With the registers buried in the WB master it takes some
>>effort to set the right constraints for the Synthesizer to implement
>>such IO-registers.
>>
>>The same issue is true for the control signals. The translation from
>>the wb.cyc, wb.stb and wb.we signals to ncs, noe and nwe for the
>>SRAM are on the critical path.
>>
>>The ack signal is too late for a pipelined master. We would need to
>>know it *earlier* when the next data will be available --- and this
>>is possible, as we know in the slave when the data from the SRAM
>>will arrive. A work around solution is a non-WB-conforming early ack
>>signal.
>>
>>Due to the fact that the data registers not inside the WB interface
>>we need an extra WB interface for the Flash/NAND interface (on the
>>Cyclone board). We cannot afford the address decoding and a MUX in
>>the data read path without registers. This would result in an extra
>>cycle for the memory read due to the combinational delay.
>>
>>In the WB specification (AFAIK) there is no way to perform pipelined
>>read or write. However, for blocked memory transfers (e.g. cache
>>load) this is the usual way to get a good performance.
>>
>>Conclusion -- I would prefer:
>>
>> * Address and data (in/out) register in the slave
>> * A way to know earlier when data will be available (or
>> a write has finished)
>> * Pipelining in the slave
>>
>>As a result from this experience I'm working on a new SoC
>>interconnect (working name SimpCon) definition that should avoid the
>>mentioned issues and should be still easy to implement the master
>>and slave.
>>
>>As there are so many projects available that implement the WB
>>interface I will provide bridges between SimpCon and WB. For IO
>>devices the former arguments do not apply to that extent as the
>>pressure for low latency access and pipelining is not high.
>>Therefore, a bridge to WB IO devices can be a practical solution for
>>design reuse.
>>
>>Martin
>>
>>
> >
>
> To post a message, send it to: fpga-cpu@fpga...
> To unsubscribe, send a blank message to: fpga-cpu-unsubscribe@fpga...
> Yahoo! Groups Links >



Martin Schoeberl schrieb:

>>Anyway, I am a big fan of pipelined busses (ever seen the SCI link
>>controller interface?) so I would like
>>
>>
>
>No, have not seen it. Do you have a link to it handy?
No. Only the SCI spec, not the link controller.

>The idea for (some) pipeline support is twofold:
>
>1.) The slave will provide more information than a single ack
>or wait states. It will (if it is capable to do) signal the
>number of clock cycles remaining till the read data is available
>(or the write has finished) to the master. This feature allows
>the pipelined master to prepare for the upcomming read.
>
>2.) If the slave can provide pipelining the master can use
>overlapped wr or rd requests. The slave has a static output
>port that tells how many pipeline stages are available.
>I call this 'pipeline level':
> 0 means non overlapping
> 1 a new rd/wr request can be issued in the same cycle
> when the former data is read.
> 2 one earlier and
> 3 is the maximum level where you get full pipelining
> on the basic read cycle with one wait state
> (command - read - read - result). >
I do not like the concept of telling the master at the beginning of each
cycle what the latency will be.
But I believe that you get what you want simply by using split
transactions.
The slave acknowledges that it latched the address and control
information and the master is free to
supply the next address to the next or the same slave.
Pipelining within a single slave can have any amount of levels. The
slave just keeps acknowleding addresses
and after a while starts acknowleding data.
Having multiple outstanding read transactions to different slaves is
tricky and probably not worth the effort.
I would suggest limiting the bus to one or none outstanding transaction
to other slaves. Otherwise it would be
necessary to keep a queue of outstanding slaves and select the right
data source at the right time.

Unfortunately I am not available at the moment to implement any
hardware. To many outstanding transactions ;-)

Kolja


>>>Anyway, I am a big fan of pipelined busses (ever seen the SCI link
>>>controller interface?) so I would like
>>>
>>>
>>
>>No, have not seen it. Do you have a link to it handy?
>>
>>
> No. Only the SCI spec, not the link controller.

This was a mismatch - I meant if you have a hyper-link
to the specification handy... ;-)

>
>>The idea for (some) pipeline support is twofold:
>>
>>1.) The slave will provide more information than a single ack
>>or wait states. It will (if it is capable to do) signal the
>>number of clock cycles remaining till the read data is available
>>(or the write has finished) to the master. This feature allows
>>the pipelined master to prepare for the upcomming read.
>>
>>2.) If the slave can provide pipelining the master can use
>>overlapped wr or rd requests. The slave has a static output
>>port that tells how many pipeline stages are available.
>>I call this 'pipeline level':
>> 0 means non overlapping
>> 1 a new rd/wr request can be issued in the same cycle
>> when the former data is read.
>> 2 one earlier and
>> 3 is the maximum level where you get full pipelining
>> on the basic read cycle with one wait state
>> (command - read - read - result).
>>
>>
>>
> I do not like the concept of telling the master at the beginning of each
> cycle what the latency will be.

The numbers from the slave mean how many cycles it can pipeline, not
the actual value of the latency.

The latency is a different thing: A rdy_cnt signal will tell the
master, when the (e.g. read) will finish - this is dynamic.

> But I believe that you get what you want simply by using split
> transactions.
> The slave acknowledges that it latched the address and control
> information and the master is free to
> supply the next address to the next or the same slave.

In my opinion split transactions are a waste of cycles. When the
rdy_cnt is 0 (or the slave accepts pipelining) a new address and
data will be accepted. No acknoledge is necessary.

Martin


Martin Schoeberl schrieb:

>>But I believe that you get what you want simply by using split
>>transactions.
>>The slave acknowledges that it latched the address and control
>>information and the master is free to
>>supply the next address to the next or the same slave.
>>
>>
>
>In my opinion split transactions are a waste of cycles. When the
>rdy_cnt is 0 (or the slave accepts pipelining) a new address and
>data will be accepted. No acknoledge is necessary.
No waste of cycles. If you have seperate adress and data busses (like
wishbone) you can acknowledge
data and adress at the same time. There is no scenario were split
transactions take more cycles.
However especially for read accesses usually a target can acknowledge
the adress earlier than data.
This allows the master to drive a new addresses, saving cycles.

Kolja


>
>>>But I believe that you get what you want simply by using split
>>>transactions.
>>>The slave acknowledges that it latched the address and control
>>>information and the master is free to
>>>supply the next address to the next or the same slave.
>>>
>>>
>>
>>In my opinion split transactions are a waste of cycles. When the
>>rdy_cnt is 0 (or the slave accepts pipelining) a new address and
>>data will be accepted. No acknoledge is necessary.
>>
>>
> No waste of cycles. If you have seperate adress and data busses (like
> wishbone) you can acknowledge
> data and adress at the same time. There is no scenario were split
> transactions take more cycles.
> However especially for read accesses usually a target can acknowledge
> the adress earlier than data.
> This allows the master to drive a new addresses, saving cycles.
>
Actually I do not really understand why the address should be
acknowledged by the slave? The idea is that the address and the
command (rd or wr) is active only a single cycle and the slave
*has* to accept it. Therefore no need for acknowleding it.
The next address/cmd can be issued depending on the bsy_cnt and
the pipeline level of the slave.
I believe this is simpler than the ack for both the master that
don't have to wait in the ack and the slave.

Martin



I've put a first draft of the SimpCon specification on
my web site: http://www.jopdesign.com/docu.jsp

More implementation examples and Wishbone bridges will
follow.

Comments are very welcome,

Martin




Memfault Beyond the Launch