fpga-cpu | Floating Point Arithmetic| page 2

Reply by Arius - Rick Collins ●October 21, 20042004-10-21

At 12:14 PM 10/21/2004, you wrote: >For the fun of it... I started wire-wrapping RTL back in the >late '60s and I couldn't even imagine what it would take to build a >real computer. I remember some of the early minicomputers that used >several boards full of TTL and magnetic memory. > >I had a LOT of experience with the IBM 1130 and a ton of >documentation. So, I took some 74xxx, some fusible proms and built >an emulation. Memory, like the 2102, wasn't available in my price >range for several more years and by that time the Altair 8800 was >available si I gave up on the 1130. I was just a couple of years behind you. I did not know much about computers when the Altair came out. A few years later when the Heathkit version of the LSI-11 was available, I bought one. The design used a CPU chip with several microcode prom chips. That made me think of writing my own microcode to plug into the empty microcode chip socket. But I never found the documentation I needed. >Today, a single chip can have an 1130 and a Z80 in the same package - > absolutely amazing. > >I plan to get back to that 1130 and have starting accumulating the >documentation. Right after the P machine. > >I was sort of planning on a lot more speed than 10 MHz. Heck, the >T80 core runs reliably at 14 MHz and could probably do more if I >used the high slew rate on the external RAM. I was thinking about >at least 20 MHz for the stack machine. The SRAM on the board is >rated at 10 nS and the Xtal is 50 MHz. I am sure you can get much higher than 20 MHz even. The trick is to think in terms of levels of 4 input LUTs and keep the number of levels down. In my design I found the multiplexors to be real hogs, both the number of levels and the number of LUTs in general. You may need to make some tradeoffs between reducing the number of cycles for a given instruction and the speed of all cycles. My suggestion is to minimize your complexity first (giving you speed) and try to optimize individual instructions later (by adding paths and special hardware). >I will look at the other processors. True, I want to roll my own >but there are too many good ideas out there to just ignore them. I learned a lot from others implementations. But it is often hard to understand exactly what they did and why. It can be a lot of work just to learn what the "state of the art" is in FPGA CPUs. You might do very well just to learn about either the NIOS-II or the microBlaze. Understanding either one of these will likely be a real education in FPGA optimization. Rick Collins Arius - A Signal Processing Solutions Company Specializing in DSP and FPGA design http://www.arius.com 4 King Ave 301-682-7772 Voice Frederick, MD 21701-3110 301-682-7666 FAX

Reply by rtstofer ●October 21, 20042004-10-21

> I was just a couple of years behind you. I did not know much about > computers when the Altair came out. A few years later when the Heathkit > version of the LSI-11 was available, I bought one. The design used a CPU > chip with several microcode prom chips. That made me think of writing my > own microcode to plug into the empty microcode chip socket. But I never > found the documentation I needed. > I really wanted one of the LSI-11s but never could get myself in position to buy one. Western Digital made the chip and made a 'similar' (read identical) chip for the UCSD Pascal system. I think the system was called a Terak(?) > I am sure you can get much higher than 20 MHz even. The trick is to think > in terms of levels of 4 input LUTs and keep the number of levels down. In > my design I found the multiplexors to be real hogs, both the number of > levels and the number of LUTs in general. You may need to make some > tradeoffs between reducing the number of cycles for a given instruction and > the speed of all cycles. My suggestion is to minimize your complexity > first (giving you speed) and try to optimize individual instructions later > (by adding paths and special hardware). I really new at the FPGA stuff and, while I can read the timing information, I don't know what it means. According to the timing reports the T80 has a maximum delay of 18+ nS: Timing summary: --------------- Timing errors: 0 Score: 0 Constraints cover 851300 paths, 0 nets, and 11402 connections Design statistics: Minimum period: 18.647ns (Maximum frequency: 53.628MHz) Minimum input required time before clock: 1.333ns Maximum output delay after clock: 9.892ns I don't know exactly what to do with this number. I know it says I can run 50+ MHz but I just have to believe there are a bunch of 'gotchas'. I am running the core at 12.5 MHz and, if I really thought I could kick it to 50, I would certainly like to do it. Any guidance here will be appreciated. I really don't have any idea how to figure the timing for FPGAs. > > > >I will look at the other processors. True, I want to roll my own > >but there are too many good ideas out there to just ignore them. > > I learned a lot from others implementations. But it is often hard to > understand exactly what they did and why. It can be a lot of work just to > learn what the "state of the art" is in FPGA CPUs. You might do very well > just to learn about either the NIOS-II or the microBlaze. Understanding > either one of these will likely be a real education in FPGA optimization. > I looked at the XSOC core and didn't so much 'give up' as just decide my simple project wasn't worth going to that level of caching, pipelining, register interlocking, etc. For a toy the good old fetch-decode-execute will be just fine. And, for an initial implementation, it will be difficult enough. Maybe later, when I know more about what I am doing... > > > Rick Collins > > rick.collins@a... > > Arius - A Signal Processing Solutions Company > Specializing in DSP and FPGA design http://www.arius.com > 4 King Ave 301-682-7772 Voice > Frederick, MD 21701-3110 301-682-7666 FAX

Reply by Arius - Rick Collins ●October 21, 20042004-10-21

At 03:40 PM 10/21/2004, you wrote: >I really new at the FPGA stuff and, while I can read the timing >information, I don't know what it means. According to the timing >reports the T80 has a maximum delay of 18+ nS: > >Timing summary: >--------------- > >Timing errors: 0 Score: 0 > >Constraints cover 851300 paths, 0 nets, and 11402 connections > >Design statistics: > Minimum period: 18.647ns (Maximum frequency: 53.628MHz) > Minimum input required time before clock: 1.333ns > Maximum output delay after clock: 9.892ns > >I don't know exactly what to do with this number. I know it says I >can run 50+ MHz but I just have to believe there are a bunch >of 'gotchas'. I am running the core at 12.5 MHz and, if I really >thought I could kick it to 50, I would certainly like to do it. > >Any guidance here will be appreciated. I really don't have any idea >how to figure the timing for FPGAs. If the report says it will run at 53 MHz, then it will. Of course that is only considering your internal FF to FF delays. Unless you have added timing constraints, the software does not know what your external timing is like. Also, the tool does not try to optimize timing unless you give it a constraint. So you might get a lot better than 53 MHz if you ask for something better. >I looked at the XSOC core and didn't so much 'give up' as just >decide my simple project wasn't worth going to that level of >caching, pipelining, register interlocking, etc. For a toy the good >old fetch-decode-execute will be just fine. And, for an initial >implementation, it will be difficult enough. Maybe later, when I >know more about what I am doing... I know what you mean. The XSOC RISC is also not very small. The small, simple, highly optimized CPUs are called MISC and many are stack oriented. Using the stack should make a CPU more simple and run faster, but it appears that in an FPGA, registers are basically free (speed wise) if you use LUT ram or block ram due to their inherent speed. So a 16 register CPU can run as fast as a stack machine in an FPGA (depending on your instruction encoding). Rick Collins Arius - A Signal Processing Solutions Company Specializing in DSP and FPGA design http://www.arius.com 4 King Ave 301-682-7772 Voice Frederick, MD 21701-3110 301-682-7666 FAX

Reply by rtstofer ●October 21, 20042004-10-21

One problem I had with the T80 core that will come up again is bus turnaround or contention on the external ram data bus. Basically, as long as the FPGA is driving the data bus toward the ram, no problem But, when I turn it around I need to shut off the tristate buffers in the FPGA before I turn on output enable at the ram. I couldn't come up with a neat way to do that so I did a huge no-no and gated the last half of the clock cycle with the ram output enable signal. This way the address bus was stable for most of the cycle, the fpga buffers were turned off at the beginning of the cycle and the ram turned on during the last half. Given 15 nS ram, in this case, I just about doubled the access time to 30 nS thus slowing the machine to about 33 MHz, all else being equal. I searched around for info on asynchronous SRAM interfaces and the best I could find was a deep, closely held secret (meaning I had to spend $) at Xilinx. As far as I can tell, they used a very high speed FSA (like 200 MHz, perhaps) to accomplish the same thing. I also found a rough calculation of the result of not worrying about contention that indicated I could see a 1.1 degree C rise in temperature if I just ignored the issue. Still, it doesn't seem right to allow this to occur. Maybe this is a place where current limiting resistors in series would be a quick fix. I'll have to think about that.

Reply by Jeff Brower ●October 21, 20042004-10-21

RT- > One problem I had with the T80 core that will come up again is bus > turnaround or contention on the external ram data bus. Basically, > as long as the FPGA is driving the data bus toward the ram, no > problem But, when I turn it around I need to shut off the tristate > buffers in the FPGA before I turn on output enable at the ram. > > I couldn't come up with a neat way to do that so I did a huge no-no > and gated the last half of the clock cycle with the ram output > enable signal. This way the address bus was stable for most of the > cycle, the fpga buffers were turned off at the beginning of the > cycle and the ram turned on during the last half. > > Given 15 nS ram, in this case, I just about doubled the access time > to 30 nS thus slowing the machine to about 33 MHz, all else being > equal. > > I searched around for info on asynchronous SRAM interfaces and the > best I could find was a deep, closely held secret (meaning I had to > spend $) at Xilinx. As far as I can tell, they used a very high > speed FSA (like 200 MHz, perhaps) to accomplish the same thing. > > I also found a rough calculation of the result of not worrying about > contention that indicated I could see a 1.1 degree C rise in > temperature if I just ignored the issue. Still, it doesn't seem > right to allow this to occur. > Maybe this is a place where current limiting resistors in series > would be a quick fix. I'll have to think about that. Exactly -- try the zero Rs. -Jeff

Reply by Arius - Rick Collins ●October 22, 20042004-10-22

At 06:39 PM 10/21/2004, you wrote: >One problem I had with the T80 core that will come up again is bus >turnaround or contention on the external ram data bus. Basically, >as long as the FPGA is driving the data bus toward the ram, no >problem But, when I turn it around I need to shut off the tristate >buffers in the FPGA before I turn on output enable at the ram. > >I couldn't come up with a neat way to do that so I did a huge no-no >and gated the last half of the clock cycle with the ram output >enable signal. This way the address bus was stable for most of the >cycle, the fpga buffers were turned off at the beginning of the >cycle and the ram turned on during the last half. > >Given 15 nS ram, in this case, I just about doubled the access time >to 30 nS thus slowing the machine to about 33 MHz, all else being >equal. Some CPUs use the opposite edge of the clock to control the WE signal while ANDing the clock with the OE to keep reads to one clock cycle while writes require two clocks each. Async SRAMS typically need a write pulse about the same width as the read address access time, but the output enable can be faster. I'll draw the timing. | READ | WRITE | READ | CLK __----____----____----____----____----__ A ==x=======x===============x=======x===== CS- ---________________________________----- OE- -------____--------------------____----- WE- ---------------________----------------- D -------<===>--<===========>----<===>--- ^ ^ Turn around times If you want to get fancy, back to back reads don't need to toggle the OE signal, but it will be more work on your part to do that and I don't know that it has much advantage, perhaps some power savings. >I searched around for info on asynchronous SRAM interfaces and the >best I could find was a deep, closely held secret (meaning I had to >spend $) at Xilinx. As far as I can tell, they used a very high >speed FSA (like 200 MHz, perhaps) to accomplish the same thing. I don't know what FSA means. The real problem is that async rams are *async* while most logic in an FPGA is synchronous. That makes it hard to set up the timing without using a lot of margin. >I also found a rough calculation of the result of not worrying about >contention that indicated I could see a 1.1 degree C rise in >temperature if I just ignored the issue. Still, it doesn't seem >right to allow this to occur. > >Maybe this is a place where current limiting resistors in series >would be a quick fix. I'll have to think about that. Brief contention on the bus is not likely to be a reliability issue, but it is not hard to avoid. You still need to control the timing of the write enable to assure that the address and data busses are stable until the end of the write enable. I think preventing contention is not much more difficult. Remember, your rams are not sync and the outputs will have different delays and settling times. The write enable is your clock in this case. Rick Collins Arius - A Signal Processing Solutions Company Specializing in DSP and FPGA design http://www.arius.com 4 King Ave 301-682-7772 Voice Frederick, MD 21701-3110 301-682-7666 FAX

Reply by Jeffery, Robert ●October 22, 20042004-10-22

Hi rtstofer. Thought I might be able to help a liitle on the FPGA timing side. The timing numbers mean that the internal logic to the FPGA can run up to a maximum of 53.628MHz for this design which is better than you wanted. The other two numbers are looking at the data coming into and going out of the device at the pins. So the Minimum input required time before clock: 1.333ns means that data going into the FPGA must get there at least 1.333 ns before the clock rises for at least one pin, i.e. the worst case pin. That means you have (18.647 - 1.333) ns of slack to play with from other parts driving across your pcb and into the FPGA. Similarly the Maximum output delay after clock: 9.892ns means that valid data is available on all output pins 9.892 ns after the clock has risen. So that menas you have (18.647 - 9.892) ns to get across your pcb and into any parts connected to the FPGA. Bearing in mind that IC's will probably have a setup time it means you have about 7 ns of slack to play with for the worst case output pin of the FPGA. To get information on all the pins you need to look at the twr file from the timing tool. From that you will probably see that some output pins are valid well before the 9.892 ns figure and some inputs might be able to arrive later than 1.333 ns before the rising edge of the clock although 1.333 ns isn't very long so I suspect that all your inputs must be registered which is one way to ensure a short input setup time! Here is an example piece of a twr file for the worst case input pin, instruction(9) on the Xilinx CPLD version of the Picoblaze. My clock is set to 10MHz or 100 ns. The bottom line here is that my signal must arrive at the FPGA pin no later than 12.793 ns before the rising edge of the clock. ======================================================================== ======== Timing constraint: TIMEGRP "ARVG0" OFFSET = IN 100 nS BEFORE COMP "clk" ; 5729 items analyzed, 0 timing errors detected. (0 setup errors, 0 hold errors) Minimum allowable offset is 12.793ns. -------- Slack: 87.207ns (requirement - (data path - clock path - clock arrival + uncertainty)) Source: instruction(9) (PAD) Destination: prog_count_reg_count_value(6)_repl2 (FF) Destination Clock: clk_int rising at 0.000ns Requirement: 100.000ns Data Path Delay: 14.261ns (Levels of Logic = 10) Clock Path Delay: 1.468ns (Levels of Logic = 2) Clock Uncertainty: 0.000ns Data Path: instruction(9) to prog_count_reg_count_value(6)_repl2 Location Delay type Delay(ns) Physical Resource Logical Resource(s) ------------- ------------------- T8.I Tiopi 0.965 instruction(9) instruction(9) instruction_ibuf(9) SLICE_X26Y15.G3 net (fanout=9) 1.995 instruction_int(9) SLICE_X26Y15.Y Tilo 0.652 stack_control_valid_to_move ix34635z1412 SLICE_X26Y15.F3 net (fanout=2) 0.019 nx34635z15 SLICE_X26Y15.X Tilo 0.744 stack_control_valid_to_move ix57496z1530 SLICE_X30Y17.F1 net (fanout) 1.802 stack_control_valid_to_move SLICE_X30Y17.X Tilo 0.744 nx34635z21 ix34635z1575 SLICE_X33Y18.G2 net (fanout=8) 1.010 nx34635z21 SLICE_X33Y18.Y Tilo 0.631 nx34635z25 ix34635z4588 SLICE_X33Y18.F3 net (fanout=1) 0.007 nx34635z26 SLICE_X33Y18.X Tilo 0.723 nx34635z25 ix34635z1261 SLICE_X32Y20.G2 net (fanout=1) 0.698 nx34635z25 SLICE_X32Y20.COUT Topcyg 0.860 address_dup0(0) ix34635z1510 ix34635z63346 SLICE_X32Y21.CIN net (fanout=1) 0.000 ix34635z63346/O SLICE_X32Y21.COUT Tbyp 0.170 address_dup0(2) ix34635z63345 ix34635z63344 SLICE_X32Y22.CIN net (fanout=1) 0.000 ix34635z63344/O SLICE_X32Y22.COUT Tbyp 0.170 address_dup0(4) ix34635z63343 ix34635z63342 SLICE_X32Y23.CIN net (fanout=1) 0.000 ix34635z63342/O SLICE_X32Y23.X Tcinx 0.917 address_dup0(6) ix37700z19564 F15.O1 net (fanout=1) 1.763 nx37700z1 F15.OTCLK1 Tioock 0.391 address(6) prog_count_reg_count_value(6)_repl2 ------------- --------------------------- Total 14.261ns (6.967ns logic, 7.294ns route) (48.9% logic, 51.1% route) Clock Path: clk to prog_count_reg_count_value(6)_repl2 Location Delay type Delay(ns) Physical Resource Logical Resource(s) ------------- ------------------- P8.I Tiopi 0.772 clk clk clk_ibuf/IBUFG BUFGMUX3.I0 net (fanout=1) 0.001 clk_ibuf/IBUFG BUFGMUX3.O Tgi0o 0.160 clk_ibuf/BUFG clk_ibuf/BUFG F15.OTCLK1 net (fanouta) 0.535 clk_int ------------- --------------------------- Total 1.468ns (0.932ns logic, 0.536ns route) (63.5% logic, 36.5% route) Actually it might be worth pointing out to people that someone in Xilinx has written up the PicoBlaze for CPLD. This is a complete VHDL model and comes with source code for the compiler too! The reason is that the author wanted to allow users to add their own instructions. It's described in Xilinx Appnote 387 and the design files can be downloaded. I have simulated it and it works fine. I intend to try it out on the Spartan3 starter board aas soon as I get some time! Couple of things to note. 1. The RTL is written for CPLD not FPGA. The main issue here is that the arithmetic component uses normal and/or logic to implement it. This could be written so that the logic synthesized gets mapped to the LUT and carry resources of the FPGA. 2. The RTL simulation gives lots of X's. This is not a problem it's just that until all the registers have a value written to them they won't contain a value. Hope that helps. Cheers. Robert. -----Original Message----- From: rtstofer [mailto:] Sent: 21 October 2004 20:40 To: Subject: [fpga-cpu] Re: Floating Point Arithmetic > I was just a couple of years behind you. I did not know much about > computers when the Altair came out. A few years later when the Heathkit > version of the LSI-11 was available, I bought one. The design used a CPU > chip with several microcode prom chips. That made me think of writing my > own microcode to plug into the empty microcode chip socket. But I never > found the documentation I needed. > I really wanted one of the LSI-11s but never could get myself in position to buy one. Western Digital made the chip and made a 'similar' (read identical) chip for the UCSD Pascal system. I think the system was called a Terak(?) > I am sure you can get much higher than 20 MHz even. The trick is to think > in terms of levels of 4 input LUTs and keep the number of levels down. In > my design I found the multiplexors to be real hogs, both the number of > levels and the number of LUTs in general. You may need to make some > tradeoffs between reducing the number of cycles for a given instruction and > the speed of all cycles. My suggestion is to minimize your complexity > first (giving you speed) and try to optimize individual instructions later > (by adding paths and special hardware). I really new at the FPGA stuff and, while I can read the timing information, I don't know what it means. According to the timing reports the T80 has a maximum delay of 18+ nS: Timing summary: --------------- Timing errors: 0 Score: 0 Constraints cover 851300 paths, 0 nets, and 11402 connections Design statistics: Minimum period: 18.647ns (Maximum frequency: 53.628MHz) Minimum input required time before clock: 1.333ns Maximum output delay after clock: 9.892ns I don't know exactly what to do with this number. I know it says I can run 50+ MHz but I just have to believe there are a bunch of 'gotchas'. I am running the core at 12.5 MHz and, if I really thought I could kick it to 50, I would certainly like to do it. Any guidance here will be appreciated. I really don't have any idea how to figure the timing for FPGAs. > > > >I will look at the other processors. True, I want to roll my own but > >there are too many good ideas out there to just ignore them. > > I learned a lot from others implementations. But it is often hard to > understand exactly what they did and why. It can be a lot of work just to > learn what the "state of the art" is in FPGA CPUs. You might do very well > just to learn about either the NIOS-II or the microBlaze. Understanding > either one of these will likely be a real education in FPGA optimization. > I looked at the XSOC core and didn't so much 'give up' as just decide my simple project wasn't worth going to that level of caching, pipelining, register interlocking, etc. For a toy the good old fetch-decode-execute will be just fine. And, for an initial implementation, it will be difficult enough. Maybe later, when I know more about what I am doing... > > > Rick Collins > > rick.collins@a... > > Arius - A Signal Processing Solutions Company > Specializing in DSP and FPGA design http://www.arius.com > 4 King Ave 301-682-7772 Voice > Frederick, MD 21701-3110 301-682-7666 FAX To post a message, send it to: To unsubscribe, send a blank message to: Yahoo! Groups Links

Reply by rtstofer ●October 22, 20042004-10-22

Robert, Thanks for the data. I have really been lazy about getting into the timing issue. Your explanation helps as it reinforces what I already suspected about setup and hold times. In terms of the T80 project, I need to get back and look at the SRAM. I must have been asleep not to notice the two cycle timing mentioned in Rick's post. I am not certain it applies but this is the second time I have seen reference to asymetric timing for SRAM. The other part of timing that I haven't thought about is the constraints. I need to understand how I tell the software that SRAM setup time is xx and SRAM access time is yy, etc. Then the timing analyzer could give better answers. Then the issue of slew rate on the outputs. Right now I am not using the fast slew rate. No particular reason, I just haven't seen the need. I guess if I really want to get some speed out of my project I am going to have to do a little more work. I also have to deal with the port timing spec of the IDE interface. At the speed I am running I don't have to deal with wait states. I'll have to look carefully at the timing spec if I try to get up around 50 MHz. One small step at a time... --- In , "Jeffery, Robert" <robert_jeffery@m...> wrote: > Hi rtstofer. > > Thought I might be able to help a liitle on the FPGA timing side. > > The timing numbers mean that the internal logic to the FPGA can run up > to a maximum of 53.628MHz for this design which is better than you > wanted. The other two numbers are looking at the data coming into and > going out of the device at the pins. So the Minimum input required time > before clock: 1.333ns means that data going into the FPGA must get > there at least 1.333 ns before the clock rises for at least one pin, > i.e. the worst case pin. That means you have (18.647 - 1.333) ns of > slack to play with from other parts driving across your pcb and into the > FPGA. Similarly the Maximum output delay after clock: 9.892ns means > that valid data is available on all output pins 9.892 ns after the clock > has risen. So that menas you have (18.647 - 9.892) ns to get across your > pcb and into any parts connected to the FPGA. Bearing in mind that IC's > will probably have a setup time it means you have about 7 ns of slack to > play with for the worst case output pin of the FPGA. > > To get information on all the pins you need to look at the twr file from > the timing tool. From that you will probably see that some output pins > are valid well before the 9.892 ns figure and some inputs might be able > to arrive later than 1.333 ns before the rising edge of the clock > although 1.333 ns isn't very long so I suspect that all your inputs must > be registered which is one way to ensure a short input setup time! > > Here is an example piece of a twr file for the worst case input pin, > instruction(9) on the Xilinx CPLD version of the Picoblaze. My clock is > set to 10MHz or 100 ns. The bottom line here is that my signal must > arrive at the FPGA pin no later than 12.793 ns before the rising edge of > the clock. ===================================================================== === > ======== > Timing constraint: TIMEGRP "ARVG0" OFFSET = IN 100 nS BEFORE COMP "clk" > ; > > 5729 items analyzed, 0 timing errors detected. (0 setup errors, 0 hold > errors) > Minimum allowable offset is 12.793ns. > ------------------------------- ----- > -------- > Slack: 87.207ns (requirement - (data path - clock path > - clock arrival + uncertainty)) > Source: instruction(9) (PAD) > Destination: prog_count_reg_count_value(6)_repl2 (FF) > Destination Clock: clk_int rising at 0.000ns > Requirement: 100.000ns > Data Path Delay: 14.261ns (Levels of Logic = 10) > Clock Path Delay: 1.468ns (Levels of Logic = 2) > Clock Uncertainty: 0.000ns > > Data Path: instruction(9) to prog_count_reg_count_value(6)_repl2 > Location Delay type Delay(ns) Physical Resource > Logical > Resource(s) > ------------- > ------------------- > T8.I Tiopi 0.965 instruction (9) > instruction (9) > > instruction_ibuf(9) > SLICE_X26Y15.G3 net (fanout=9) 1.995 > instruction_int(9) > SLICE_X26Y15.Y Tilo 0.652 > stack_control_valid_to_move > ix34635z1412 > SLICE_X26Y15.F3 net (fanout=2) 0.019 nx34635z15 > SLICE_X26Y15.X Tilo 0.744 > stack_control_valid_to_move > ix57496z1530 > SLICE_X30Y17.F1 net (fanout) 1.802 > stack_control_valid_to_move > SLICE_X30Y17.X Tilo 0.744 nx34635z21 > ix34635z1575 > SLICE_X33Y18.G2 net (fanout=8) 1.010 nx34635z21 > SLICE_X33Y18.Y Tilo 0.631 nx34635z25 > ix34635z4588 > SLICE_X33Y18.F3 net (fanout=1) 0.007 nx34635z26 > SLICE_X33Y18.X Tilo 0.723 nx34635z25 > ix34635z1261 > SLICE_X32Y20.G2 net (fanout=1) 0.698 nx34635z25 > SLICE_X32Y20.COUT Topcyg 0.860 address_dup0 (0) > ix34635z1510 > ix34635z63346 > SLICE_X32Y21.CIN net (fanout=1) 0.000 ix34635z63346/O > SLICE_X32Y21.COUT Tbyp 0.170 address_dup0 (2) > ix34635z63345 > ix34635z63344 > SLICE_X32Y22.CIN net (fanout=1) 0.000 ix34635z63344/O > SLICE_X32Y22.COUT Tbyp 0.170 address_dup0 (4) > ix34635z63343 > ix34635z63342 > SLICE_X32Y23.CIN net (fanout=1) 0.000 ix34635z63342/O > SLICE_X32Y23.X Tcinx 0.917 address_dup0 (6) > ix37700z19564 > F15.O1 net (fanout=1) 1.763 nx37700z1 > F15.OTCLK1 Tioock 0.391 address(6) > > prog_count_reg_count_value(6)_repl2 > ------------- > --------------------------- > Total 14.261ns (6.967ns logic, > 7.294ns route) > (48.9% logic, > 51.1% route) > > Clock Path: clk to prog_count_reg_count_value(6)_repl2 > Location Delay type Delay(ns) Physical Resource > Logical > Resource(s) > ------------- > ------------------- > P8.I Tiopi 0.772 clk > clk > clk_ibuf/IBUFG > BUFGMUX3.I0 net (fanout=1) 0.001 clk_ibuf/IBUFG > BUFGMUX3.O Tgi0o 0.160 clk_ibuf/BUFG > clk_ibuf/BUFG > F15.OTCLK1 net (fanouta) 0.535 clk_int > ------------- > --------------------------- > Total 1.468ns (0.932ns logic, > 0.536ns route) > (63.5% logic, > 36.5% route) > > Actually it might be worth pointing out to people that someone in Xilinx > has written up the PicoBlaze for CPLD. This is a complete VHDL model and > comes with source code for the compiler too! The reason is that the > author wanted to allow users to add their own instructions. It's > described in Xilinx Appnote 387 and the design files can be downloaded. > I have simulated it and it works fine. I intend to try it out on the > Spartan3 starter board aas soon as I get some time! > > Couple of things to note. > > 1. The RTL is written for CPLD not FPGA. The main issue here is that the > arithmetic component uses normal and/or logic to implement it. This > could be written so that the logic synthesized gets mapped to the LUT > and carry resources of the FPGA. > > 2. The RTL simulation gives lots of X's. This is not a problem it's just > that until all the registers have a value written to them they won't > contain a value. > > Hope that helps. > > Cheers. > > Robert. > -----Original Message----- > From: rtstofer [mailto:rstofer@p...] > Sent: 21 October 2004 20:40 > To: > Subject: [fpga-cpu] Re: Floating Point Arithmetic > > > I was just a couple of years behind you. I did not know much > about > > computers when the Altair came out. A few years later when the > Heathkit > > version of the LSI-11 was available, I bought one. The design > used a CPU > > chip with several microcode prom chips. That made me think of > writing my > > own microcode to plug into the empty microcode chip socket. But I > never > > found the documentation I needed. > > > > I really wanted one of the LSI-11s but never could get myself in > position to buy one. Western Digital made the chip and made a 'similar' > (read identical) chip for the UCSD Pascal system. I think the system > was called a Terak(?) > > I am sure you can get much higher than 20 MHz even. The trick is > to think > > in terms of levels of 4 input LUTs and keep the number of levels > down. In > > my design I found the multiplexors to be real hogs, both the > number of > > levels and the number of LUTs in general. You may need to make > some > > tradeoffs between reducing the number of cycles for a given > instruction and > > the speed of all cycles. My suggestion is to minimize your > complexity > > first (giving you speed) and try to optimize individual > instructions later > > (by adding paths and special hardware). > > I really new at the FPGA stuff and, while I can read the timing > information, I don't know what it means. According to the timing > reports the T80 has a maximum delay of 18+ nS: > > Timing summary: > --------------- > > Timing errors: 0 Score: 0 > > Constraints cover 851300 paths, 0 nets, and 11402 connections > > Design statistics: > Minimum period: 18.647ns (Maximum frequency: 53.628MHz) > Minimum input required time before clock: 1.333ns > Maximum output delay after clock: 9.892ns > > I don't know exactly what to do with this number. I know it says I can > run 50+ MHz but I just have to believe there are a bunch of 'gotchas'. > I am running the core at 12.5 MHz and, if I really thought I could kick > it to 50, I would certainly like to do it. > > Any guidance here will be appreciated. I really don't have any idea how > to figure the timing for FPGAs. > > > > > > > >I will look at the other processors. True, I want to roll my own but > > > >there are too many good ideas out there to just ignore them. > > > > I learned a lot from others implementations. But it is often hard > to > > understand exactly what they did and why. It can be a lot of work > just to > > learn what the "state of the art" is in FPGA CPUs. You might do > very well > > just to learn about either the NIOS-II or the microBlaze. > Understanding > > either one of these will likely be a real education in FPGA > optimization. > > > > I looked at the XSOC core and didn't so much 'give up' as just decide my > simple project wasn't worth going to that level of caching, pipelining, > register interlocking, etc. For a toy the good old fetch-decode- execute > will be just fine. And, for an initial implementation, it will be > difficult enough. Maybe later, when I know more about what I am > doing... > > > > > > > Rick Collins > > > > rick.collins@a... > > > > Arius - A Signal Processing Solutions Company > > Specializing in DSP and FPGA design http://www.arius.com > > 4 King Ave 301-682-7772 Voice > > Frederick, MD 21701-3110 301-682-7666 FAX > To post a message, send it to: To unsubscribe, > send a blank message to: > Yahoo! Groups Links

Reply by rtstofer ●October 22, 20042004-10-22

> ANDing the clock with the OE to keep reads to one clock cycle while writes The software whines about gating with the clock. It's just a warning but, having no experience in such things, it keeps me aware that there is a potential issue. > require two clocks each. Async SRAMS typically need a write pulse about > the same width as the read address access time, but the output enable can > be faster. I'll draw the timing. > > | READ | WRITE | READ | > CLK __----____----____----____----____----__ > A ==x=======x===============x=======x===== > CS- ---________________________________----- > OE- -------____--------------------____----- > WE- ---------------________----------------- > D -------<===>--<===========>----<===>--- > ^ ^ > Turn around times > > If you want to get fancy, back to back reads don't need to toggle the OE > signal, but it will be more work on your part to do that and I don't know > that it has much advantage, perhaps some power savings. > I must have missed the part in the datasheet dealing with asymetric timing of read versus write. I have to get back into this as it may turn out that my system is working, but just because it is slow. But this is the second time I have seen refereces to multi-clock timing of writes. I have been treating it like a plain, vanilla, static ram, 2102 style. OOPS! > > >I searched around for info on asynchronous SRAM interfaces and the > >best I could find was a deep, closely held secret (meaning I had to > >spend $) at Xilinx. As far as I can tell, they used a very high > >speed FSA (like 200 MHz, perhaps) to accomplish the same thing. > > I don't know what FSA means. The real problem is that async rams are > *async* while most logic in an FPGA is synchronous. That makes it hard to > set up the timing without using a lot of margin. I tend to use the terms FSA and FSM interchangeably although it appears that FSM is more common around here. > > > >I also found a rough calculation of the result of not worrying about > >contention that indicated I could see a 1.1 degree C rise in > >temperature if I just ignored the issue. Still, it doesn't seem > >right to allow this to occur. > > > >Maybe this is a place where current limiting resistors in series > >would be a quick fix. I'll have to think about that. > > Brief contention on the bus is not likely to be a reliability issue, but it > is not hard to avoid. You still need to control the timing of the write > enable to assure that the address and data busses are stable until the end > of the write enable. I think preventing contention is not much more > difficult. Remember, your rams are not sync and the outputs will have > different delays and settling times. The write enable is your clock in > this case. I have to take a hard look at this. I know I didn't do it correctly although it works. It may turn out that, when handled properly, I can get some serious speed out of the T80 core. I also need to look at the timing on the IDE interface. Right now I am just using a single machine cycle, no wait state. If I speed things up I will need to insert a wait but that is a cheap price to pay for the potential gain in speed. A 50 MHz Z80? Now that would be interesting! > Rick Collins > > rick.collins@a... > > Arius - A Signal Processing Solutions Company > Specializing in DSP and FPGA design http://www.arius.com > 4 King Ave 301-682-7772 Voice > Frederick, MD 21701-3110 301-682-7666 FAX

12Next

Floating Point Arithmetic

Sign in

You might also like...

Search forums

Free PDF Downloads

Blogs - Hall of Fame

Discussion Groups

Quick Links

About EmbeddedRelated.com

Social Networks

The Related Media Group