This list is for discussion of the design and implementation of field-programmable gate array based processors and integrated systems. It is also for discussion and community support of the XSOC Project (see http://www.fpgacpu.org/xsoc).
|
Hi all. Currently I'm thinking about starting my own CPU on FPGA project using Xilinx Spartan-II FPGAs. I had a look at the xr16 and similar designs. What I'm not sure about is how implement the register file on the FPGA. Most CPUs need a dual-ported register file, that has 2 read and 1 write port. How could this be implemented most conveniant and efficient using Spartan-II elements? Firstly, I've thought about using internal BlockRAM but I'm not sure, whether I understand the Xilinx BlockRAM documentation right. Conceptually, the writeback of the result of the current instruction to the register file should happen simultaneous to the read-out of the 2 operands for the next instruction from the register file. It seems that this is not possible using BlockRAM, since a write and a read cannot be performed at the same time because the dual ported BlockRAMS have only 2 address busses. An alternative would be to replicate the register file i.e. using 2 BlockRAMs. This would allow 2 reads and 1 write in parallel. Is this the way to go? Or is there a better solution to implement the register file? Maybe not using BlockRAM but LUT RAM? Regards, Christian -- Christian Plessl < |
|
|
|
I use two dual-ported CLB RAMs (i.e. RAM16X1D), one for each read port. For big register files, BlockRAMs work as well. The difference is the BlockRAM has a registered read port, CLB RAM is direct. --Mike |
|
|
|
>I use two dual-ported CLB RAMs (i.e. RAM16X1D), one >for each read port. Could you send me the corresponding snipet of VHDL/Verilog? >For big register files, BlockRAMs >work as well. The difference is the BlockRAM has a >registered read port, CLB RAM is direct. What does this mean in practice to your design? Regards, Christian -- Christian Plessl < |
|
I use two copies of single port LUT-rams to get two read ports then double cycle the rams by writing to the rams during the first half clock and reading from them during the second half of the clock. (You can get rid of one layer of result forwarding multiplexer this way.) I found that this was faster than using the dual port ram feature for the number of registers I wanted (32). Rob |
|
|
|
>I use two copies of single port LUT-rams to get two read ports then >double cycle the rams by writing to the rams during the first half >clock and reading from them during the second half of the clock. I'm not sure whether I understand this right: This way the result the has to be written to the register file has to by stable after one double-clock cycle i.e. after halft the clock-cycle the rest of the circuit is operating, right? Don't you lose half of the clockcycle time for computation this way? Best regards, Christian -- Christian Plessl < |
|
|
|
Don't you lose half of the clockcycle time for > computation this way? > Yep. You hit it right on the nose. However, most ALU operations can complete within one half clock cycle (because they involve just one or two logic levels). It's other parts of the design that limit the operating frequency. Right now a big one in my design is the path from the cache ram to the tag comparators to a valid cache address signal to a ready signal to a pipeline clock enable signal. Basically control signals. Any signals that have more than about three (routed) logic levels across the full clock period are going to be slower than ALU operations. For instance, a 32 bit adder by itself might work at 190MHZ in the SpartanII-5. Add in logic to control the adder, and suddenly you're down to 50MHz. Try experimenting with simple circuits. This is how I found out it would be faster to double cycle the ram and ALU than use dual port ram. (Not really double cycling, just allowing only a half cycle to complete an operation). Several newer commercial processors double cycle the register array and ALU. I'm assuming it's for a similar reason. Note to get above 60MHz, I have had to trim the last logic level off my barrel shifter so I can only shift up to 15 bits at a time, not 31. So I lost a fraction of a percent in performance. Note: a big time consumer I've found is signals with a high-fanout. It really adds the nano-seconds for routing a high-fanout signal. I've had to constrain some signal to a lower fanout using the max_fanout atrtibute. Rob |