EmbeddedRelated.com
Forums

Floating Point Arithmetic

Started by rtstofer October 19, 2004
At 12:14 PM 10/21/2004, you wrote:

>For the fun of it... I started wire-wrapping RTL back in the
>late '60s and I couldn't even imagine what it would take to build a
>real computer. I remember some of the early minicomputers that used
>several boards full of TTL and magnetic memory.
>
>I had a LOT of experience with the IBM 1130 and a ton of
>documentation. So, I took some 74xxx, some fusible proms and built
>an emulation. Memory, like the 2102, wasn't available in my price
>range for several more years and by that time the Altair 8800 was
>available si I gave up on the 1130.

I was just a couple of years behind you. I did not know much about
computers when the Altair came out. A few years later when the Heathkit
version of the LSI-11 was available, I bought one. The design used a CPU
chip with several microcode prom chips. That made me think of writing my
own microcode to plug into the empty microcode chip socket. But I never
found the documentation I needed. >Today, a single chip can have an 1130 and a Z80 in the same package -
> absolutely amazing.
>
>I plan to get back to that 1130 and have starting accumulating the
>documentation. Right after the P machine.
>
>I was sort of planning on a lot more speed than 10 MHz. Heck, the
>T80 core runs reliably at 14 MHz and could probably do more if I
>used the high slew rate on the external RAM. I was thinking about
>at least 20 MHz for the stack machine. The SRAM on the board is
>rated at 10 nS and the Xtal is 50 MHz.

I am sure you can get much higher than 20 MHz even. The trick is to think
in terms of levels of 4 input LUTs and keep the number of levels down. In
my design I found the multiplexors to be real hogs, both the number of
levels and the number of LUTs in general. You may need to make some
tradeoffs between reducing the number of cycles for a given instruction and
the speed of all cycles. My suggestion is to minimize your complexity
first (giving you speed) and try to optimize individual instructions later
(by adding paths and special hardware). >I will look at the other processors. True, I want to roll my own
>but there are too many good ideas out there to just ignore them.

I learned a lot from others implementations. But it is often hard to
understand exactly what they did and why. It can be a lot of work just to
learn what the "state of the art" is in FPGA CPUs. You might do very well
just to learn about either the NIOS-II or the microBlaze. Understanding
either one of these will likely be a real education in FPGA optimization.
Rick Collins
Arius - A Signal Processing Solutions Company
Specializing in DSP and FPGA design http://www.arius.com
4 King Ave 301-682-7772 Voice
Frederick, MD 21701-3110 301-682-7666 FAX




> I was just a couple of years behind you. I did not know much
about
> computers when the Altair came out. A few years later when the
Heathkit
> version of the LSI-11 was available, I bought one. The design
used a CPU
> chip with several microcode prom chips. That made me think of
writing my
> own microcode to plug into the empty microcode chip socket. But I
never
> found the documentation I needed.
>

I really wanted one of the LSI-11s but never could get myself in
position to buy one. Western Digital made the chip and made
a 'similar' (read identical) chip for the UCSD Pascal system. I
think the system was called a Terak(?) > I am sure you can get much higher than 20 MHz even. The trick is
to think
> in terms of levels of 4 input LUTs and keep the number of levels
down. In
> my design I found the multiplexors to be real hogs, both the
number of
> levels and the number of LUTs in general. You may need to make
some
> tradeoffs between reducing the number of cycles for a given
instruction and
> the speed of all cycles. My suggestion is to minimize your
complexity
> first (giving you speed) and try to optimize individual
instructions later
> (by adding paths and special hardware).

I really new at the FPGA stuff and, while I can read the timing
information, I don't know what it means. According to the timing
reports the T80 has a maximum delay of 18+ nS:

Timing summary:
---------------

Timing errors: 0 Score: 0

Constraints cover 851300 paths, 0 nets, and 11402 connections

Design statistics:
Minimum period: 18.647ns (Maximum frequency: 53.628MHz)
Minimum input required time before clock: 1.333ns
Maximum output delay after clock: 9.892ns

I don't know exactly what to do with this number. I know it says I
can run 50+ MHz but I just have to believe there are a bunch
of 'gotchas'. I am running the core at 12.5 MHz and, if I really
thought I could kick it to 50, I would certainly like to do it.

Any guidance here will be appreciated. I really don't have any idea
how to figure the timing for FPGAs.

>
>
> >I will look at the other processors. True, I want to roll my own
> >but there are too many good ideas out there to just ignore them.
>
> I learned a lot from others implementations. But it is often hard
to
> understand exactly what they did and why. It can be a lot of work
just to
> learn what the "state of the art" is in FPGA CPUs. You might do
very well
> just to learn about either the NIOS-II or the microBlaze.
Understanding
> either one of these will likely be a real education in FPGA
optimization.
>

I looked at the XSOC core and didn't so much 'give up' as just
decide my simple project wasn't worth going to that level of
caching, pipelining, register interlocking, etc. For a toy the good
old fetch-decode-execute will be just fine. And, for an initial
implementation, it will be difficult enough. Maybe later, when I
know more about what I am doing...

>
>
> Rick Collins
>
> rick.collins@a...
>
> Arius - A Signal Processing Solutions Company
> Specializing in DSP and FPGA design http://www.arius.com
> 4 King Ave 301-682-7772 Voice
> Frederick, MD 21701-3110 301-682-7666 FAX




At 03:40 PM 10/21/2004, you wrote:

>I really new at the FPGA stuff and, while I can read the timing
>information, I don't know what it means. According to the timing
>reports the T80 has a maximum delay of 18+ nS:
>
>Timing summary:
>---------------
>
>Timing errors: 0 Score: 0
>
>Constraints cover 851300 paths, 0 nets, and 11402 connections
>
>Design statistics:
> Minimum period: 18.647ns (Maximum frequency: 53.628MHz)
> Minimum input required time before clock: 1.333ns
> Maximum output delay after clock: 9.892ns
>
>I don't know exactly what to do with this number. I know it says I
>can run 50+ MHz but I just have to believe there are a bunch
>of 'gotchas'. I am running the core at 12.5 MHz and, if I really
>thought I could kick it to 50, I would certainly like to do it.
>
>Any guidance here will be appreciated. I really don't have any idea
>how to figure the timing for FPGAs.


If the report says it will run at 53 MHz, then it will. Of course that is
only considering your internal FF to FF delays. Unless you have added
timing constraints, the software does not know what your external timing is
like. Also, the tool does not try to optimize timing unless you give it a
constraint. So you might get a lot better than 53 MHz if you ask for
something better.
>I looked at the XSOC core and didn't so much 'give up' as just
>decide my simple project wasn't worth going to that level of
>caching, pipelining, register interlocking, etc. For a toy the good
>old fetch-decode-execute will be just fine. And, for an initial
>implementation, it will be difficult enough. Maybe later, when I
>know more about what I am doing...

I know what you mean. The XSOC RISC is also not very small. The small,
simple, highly optimized CPUs are called MISC and many are stack
oriented. Using the stack should make a CPU more simple and run faster,
but it appears that in an FPGA, registers are basically free (speed wise)
if you use LUT ram or block ram due to their inherent speed. So a 16
register CPU can run as fast as a stack machine in an FPGA (depending on
your instruction encoding).
Rick Collins
Arius - A Signal Processing Solutions Company
Specializing in DSP and FPGA design http://www.arius.com
4 King Ave 301-682-7772 Voice
Frederick, MD 21701-3110 301-682-7666 FAX




One problem I had with the T80 core that will come up again is bus
turnaround or contention on the external ram data bus. Basically,
as long as the FPGA is driving the data bus toward the ram, no
problem But, when I turn it around I need to shut off the tristate
buffers in the FPGA before I turn on output enable at the ram.

I couldn't come up with a neat way to do that so I did a huge no-no
and gated the last half of the clock cycle with the ram output
enable signal. This way the address bus was stable for most of the
cycle, the fpga buffers were turned off at the beginning of the
cycle and the ram turned on during the last half.

Given 15 nS ram, in this case, I just about doubled the access time
to 30 nS thus slowing the machine to about 33 MHz, all else being
equal.

I searched around for info on asynchronous SRAM interfaces and the
best I could find was a deep, closely held secret (meaning I had to
spend $) at Xilinx. As far as I can tell, they used a very high
speed FSA (like 200 MHz, perhaps) to accomplish the same thing.

I also found a rough calculation of the result of not worrying about
contention that indicated I could see a 1.1 degree C rise in
temperature if I just ignored the issue. Still, it doesn't seem
right to allow this to occur.

Maybe this is a place where current limiting resistors in series
would be a quick fix. I'll have to think about that.



RT-

> One problem I had with the T80 core that will come up again is bus
> turnaround or contention on the external ram data bus. Basically,
> as long as the FPGA is driving the data bus toward the ram, no
> problem But, when I turn it around I need to shut off the tristate
> buffers in the FPGA before I turn on output enable at the ram.
>
> I couldn't come up with a neat way to do that so I did a huge no-no
> and gated the last half of the clock cycle with the ram output
> enable signal. This way the address bus was stable for most of the
> cycle, the fpga buffers were turned off at the beginning of the
> cycle and the ram turned on during the last half.
>
> Given 15 nS ram, in this case, I just about doubled the access time
> to 30 nS thus slowing the machine to about 33 MHz, all else being
> equal.
>
> I searched around for info on asynchronous SRAM interfaces and the
> best I could find was a deep, closely held secret (meaning I had to
> spend $) at Xilinx. As far as I can tell, they used a very high
> speed FSA (like 200 MHz, perhaps) to accomplish the same thing.
>
> I also found a rough calculation of the result of not worrying about
> contention that indicated I could see a 1.1 degree C rise in
> temperature if I just ignored the issue. Still, it doesn't seem
> right to allow this to occur.

> Maybe this is a place where current limiting resistors in series
> would be a quick fix. I'll have to think about that.

Exactly -- try the zero Rs.

-Jeff



At 06:39 PM 10/21/2004, you wrote: >One problem I had with the T80 core that will come up again is bus
>turnaround or contention on the external ram data bus. Basically,
>as long as the FPGA is driving the data bus toward the ram, no
>problem But, when I turn it around I need to shut off the tristate
>buffers in the FPGA before I turn on output enable at the ram.
>
>I couldn't come up with a neat way to do that so I did a huge no-no
>and gated the last half of the clock cycle with the ram output
>enable signal. This way the address bus was stable for most of the
>cycle, the fpga buffers were turned off at the beginning of the
>cycle and the ram turned on during the last half.
>
>Given 15 nS ram, in this case, I just about doubled the access time
>to 30 nS thus slowing the machine to about 33 MHz, all else being
>equal.

Some CPUs use the opposite edge of the clock to control the WE signal while
ANDing the clock with the OE to keep reads to one clock cycle while writes
require two clocks each. Async SRAMS typically need a write pulse about
the same width as the read address access time, but the output enable can
be faster. I'll draw the timing.

| READ | WRITE | READ |
CLK __----____----____----____----____----__
A ==x=======x===============x=======x=====
CS- ---________________________________-----
OE- -------____--------------------____-----
WE- ---------------________-----------------
D -------<===>--<===========>----<===>---
^ ^
Turn around times

If you want to get fancy, back to back reads don't need to toggle the OE
signal, but it will be more work on your part to do that and I don't know
that it has much advantage, perhaps some power savings. >I searched around for info on asynchronous SRAM interfaces and the
>best I could find was a deep, closely held secret (meaning I had to
>spend $) at Xilinx. As far as I can tell, they used a very high
>speed FSA (like 200 MHz, perhaps) to accomplish the same thing.

I don't know what FSA means. The real problem is that async rams are
*async* while most logic in an FPGA is synchronous. That makes it hard to
set up the timing without using a lot of margin. >I also found a rough calculation of the result of not worrying about
>contention that indicated I could see a 1.1 degree C rise in
>temperature if I just ignored the issue. Still, it doesn't seem
>right to allow this to occur.
>
>Maybe this is a place where current limiting resistors in series
>would be a quick fix. I'll have to think about that.

Brief contention on the bus is not likely to be a reliability issue, but it
is not hard to avoid. You still need to control the timing of the write
enable to assure that the address and data busses are stable until the end
of the write enable. I think preventing contention is not much more
difficult. Remember, your rams are not sync and the outputs will have
different delays and settling times. The write enable is your clock in
this case.
Rick Collins
Arius - A Signal Processing Solutions Company
Specializing in DSP and FPGA design http://www.arius.com
4 King Ave 301-682-7772 Voice
Frederick, MD 21701-3110 301-682-7666 FAX



Hi rtstofer.

Thought I might be able to help a liitle on the FPGA timing side.

The timing numbers mean that the internal logic to the FPGA can run up
to a maximum of 53.628MHz for this design which is better than you
wanted. The other two numbers are looking at the data coming into and
going out of the device at the pins. So the Minimum input required time
before clock: 1.333ns means that data going into the FPGA must get
there at least 1.333 ns before the clock rises for at least one pin,
i.e. the worst case pin. That means you have (18.647 - 1.333) ns of
slack to play with from other parts driving across your pcb and into the
FPGA. Similarly the Maximum output delay after clock: 9.892ns means
that valid data is available on all output pins 9.892 ns after the clock
has risen. So that menas you have (18.647 - 9.892) ns to get across your
pcb and into any parts connected to the FPGA. Bearing in mind that IC's
will probably have a setup time it means you have about 7 ns of slack to
play with for the worst case output pin of the FPGA.

To get information on all the pins you need to look at the twr file from
the timing tool. From that you will probably see that some output pins
are valid well before the 9.892 ns figure and some inputs might be able
to arrive later than 1.333 ns before the rising edge of the clock
although 1.333 ns isn't very long so I suspect that all your inputs must
be registered which is one way to ensure a short input setup time!

Here is an example piece of a twr file for the worst case input pin,
instruction(9) on the Xilinx CPLD version of the Picoblaze. My clock is
set to 10MHz or 100 ns. The bottom line here is that my signal must
arrive at the FPGA pin no later than 12.793 ns before the rising edge of
the clock.

========================================================================
========
Timing constraint: TIMEGRP "ARVG0" OFFSET = IN 100 nS BEFORE COMP "clk"
;

5729 items analyzed, 0 timing errors detected. (0 setup errors, 0 hold
errors)
Minimum allowable offset is 12.793ns.

--------
Slack: 87.207ns (requirement - (data path - clock path
- clock arrival + uncertainty))
Source: instruction(9) (PAD)
Destination: prog_count_reg_count_value(6)_repl2 (FF)
Destination Clock: clk_int rising at 0.000ns
Requirement: 100.000ns
Data Path Delay: 14.261ns (Levels of Logic = 10)
Clock Path Delay: 1.468ns (Levels of Logic = 2)
Clock Uncertainty: 0.000ns

Data Path: instruction(9) to prog_count_reg_count_value(6)_repl2
Location Delay type Delay(ns) Physical Resource
Logical
Resource(s)
-------------
-------------------
T8.I Tiopi 0.965 instruction(9)
instruction(9)

instruction_ibuf(9)
SLICE_X26Y15.G3 net (fanout=9) 1.995
instruction_int(9)
SLICE_X26Y15.Y Tilo 0.652
stack_control_valid_to_move
ix34635z1412
SLICE_X26Y15.F3 net (fanout=2) 0.019 nx34635z15
SLICE_X26Y15.X Tilo 0.744
stack_control_valid_to_move
ix57496z1530
SLICE_X30Y17.F1 net (fanout) 1.802
stack_control_valid_to_move
SLICE_X30Y17.X Tilo 0.744 nx34635z21
ix34635z1575
SLICE_X33Y18.G2 net (fanout=8) 1.010 nx34635z21
SLICE_X33Y18.Y Tilo 0.631 nx34635z25
ix34635z4588
SLICE_X33Y18.F3 net (fanout=1) 0.007 nx34635z26
SLICE_X33Y18.X Tilo 0.723 nx34635z25
ix34635z1261
SLICE_X32Y20.G2 net (fanout=1) 0.698 nx34635z25
SLICE_X32Y20.COUT Topcyg 0.860 address_dup0(0)
ix34635z1510
ix34635z63346
SLICE_X32Y21.CIN net (fanout=1) 0.000 ix34635z63346/O
SLICE_X32Y21.COUT Tbyp 0.170 address_dup0(2)
ix34635z63345
ix34635z63344
SLICE_X32Y22.CIN net (fanout=1) 0.000 ix34635z63344/O
SLICE_X32Y22.COUT Tbyp 0.170 address_dup0(4)
ix34635z63343
ix34635z63342
SLICE_X32Y23.CIN net (fanout=1) 0.000 ix34635z63342/O
SLICE_X32Y23.X Tcinx 0.917 address_dup0(6)
ix37700z19564
F15.O1 net (fanout=1) 1.763 nx37700z1
F15.OTCLK1 Tioock 0.391 address(6)

prog_count_reg_count_value(6)_repl2
-------------
---------------------------
Total 14.261ns (6.967ns logic,
7.294ns route)
(48.9% logic,
51.1% route)

Clock Path: clk to prog_count_reg_count_value(6)_repl2
Location Delay type Delay(ns) Physical Resource
Logical
Resource(s)
-------------
-------------------
P8.I Tiopi 0.772 clk
clk
clk_ibuf/IBUFG
BUFGMUX3.I0 net (fanout=1) 0.001 clk_ibuf/IBUFG
BUFGMUX3.O Tgi0o 0.160 clk_ibuf/BUFG
clk_ibuf/BUFG
F15.OTCLK1 net (fanouta) 0.535 clk_int
-------------
---------------------------
Total 1.468ns (0.932ns logic,
0.536ns route)
(63.5% logic,
36.5% route)
Actually it might be worth pointing out to people that someone in Xilinx
has written up the PicoBlaze for CPLD. This is a complete VHDL model and
comes with source code for the compiler too! The reason is that the
author wanted to allow users to add their own instructions. It's
described in Xilinx Appnote 387 and the design files can be downloaded.
I have simulated it and it works fine. I intend to try it out on the
Spartan3 starter board aas soon as I get some time!

Couple of things to note.

1. The RTL is written for CPLD not FPGA. The main issue here is that the
arithmetic component uses normal and/or logic to implement it. This
could be written so that the logic synthesized gets mapped to the LUT
and carry resources of the FPGA.

2. The RTL simulation gives lots of X's. This is not a problem it's just
that until all the registers have a value written to them they won't
contain a value.

Hope that helps.

Cheers.

Robert. -----Original Message-----
From: rtstofer [mailto:]
Sent: 21 October 2004 20:40
To:
Subject: [fpga-cpu] Re: Floating Point Arithmetic
> I was just a couple of years behind you. I did not know much
about
> computers when the Altair came out. A few years later when the
Heathkit
> version of the LSI-11 was available, I bought one. The design
used a CPU
> chip with several microcode prom chips. That made me think of
writing my
> own microcode to plug into the empty microcode chip socket. But I
never
> found the documentation I needed.
>

I really wanted one of the LSI-11s but never could get myself in
position to buy one. Western Digital made the chip and made a 'similar'
(read identical) chip for the UCSD Pascal system. I think the system
was called a Terak(?) > I am sure you can get much higher than 20 MHz even. The trick is
to think
> in terms of levels of 4 input LUTs and keep the number of levels
down. In
> my design I found the multiplexors to be real hogs, both the
number of
> levels and the number of LUTs in general. You may need to make
some
> tradeoffs between reducing the number of cycles for a given
instruction and
> the speed of all cycles. My suggestion is to minimize your
complexity
> first (giving you speed) and try to optimize individual
instructions later
> (by adding paths and special hardware).

I really new at the FPGA stuff and, while I can read the timing
information, I don't know what it means. According to the timing
reports the T80 has a maximum delay of 18+ nS:

Timing summary:
---------------

Timing errors: 0 Score: 0

Constraints cover 851300 paths, 0 nets, and 11402 connections

Design statistics:
Minimum period: 18.647ns (Maximum frequency: 53.628MHz)
Minimum input required time before clock: 1.333ns
Maximum output delay after clock: 9.892ns

I don't know exactly what to do with this number. I know it says I can
run 50+ MHz but I just have to believe there are a bunch of 'gotchas'.
I am running the core at 12.5 MHz and, if I really thought I could kick
it to 50, I would certainly like to do it.

Any guidance here will be appreciated. I really don't have any idea how
to figure the timing for FPGAs.

>
>
> >I will look at the other processors. True, I want to roll my own but

> >there are too many good ideas out there to just ignore them.
>
> I learned a lot from others implementations. But it is often hard
to
> understand exactly what they did and why. It can be a lot of work
just to
> learn what the "state of the art" is in FPGA CPUs. You might do
very well
> just to learn about either the NIOS-II or the microBlaze.
Understanding
> either one of these will likely be a real education in FPGA
optimization.
>

I looked at the XSOC core and didn't so much 'give up' as just decide my
simple project wasn't worth going to that level of caching, pipelining,
register interlocking, etc. For a toy the good old fetch-decode-execute
will be just fine. And, for an initial implementation, it will be
difficult enough. Maybe later, when I know more about what I am
doing...

>
>
> Rick Collins
>
> rick.collins@a...
>
> Arius - A Signal Processing Solutions Company
> Specializing in DSP and FPGA design http://www.arius.com
> 4 King Ave 301-682-7772 Voice
> Frederick, MD 21701-3110 301-682-7666 FAX


To post a message, send it to: To unsubscribe,
send a blank message to:
Yahoo! Groups Links





Robert,

Thanks for the data. I have really been lazy about getting into the
timing issue. Your explanation helps as it reinforces what I
already suspected about setup and hold times.

In terms of the T80 project, I need to get back and look at the
SRAM. I must have been asleep not to notice the two cycle timing
mentioned in Rick's post. I am not certain it applies but this is
the second time I have seen reference to asymetric timing for SRAM.

The other part of timing that I haven't thought about is the
constraints. I need to understand how I tell the software that SRAM
setup time is xx and SRAM access time is yy, etc. Then the timing
analyzer could give better answers.

Then the issue of slew rate on the outputs. Right now I am not
using the fast slew rate. No particular reason, I just haven't seen
the need.

I guess if I really want to get some speed out of my project I am
going to have to do a little more work. I also have to deal with
the port timing spec of the IDE interface. At the speed I am
running I don't have to deal with wait states. I'll have to look
carefully at the timing spec if I try to get up around 50 MHz.

One small step at a time...

--- In , "Jeffery, Robert"
<robert_jeffery@m...> wrote:
> Hi rtstofer.
>
> Thought I might be able to help a liitle on the FPGA timing side.
>
> The timing numbers mean that the internal logic to the FPGA can
run up
> to a maximum of 53.628MHz for this design which is better than you
> wanted. The other two numbers are looking at the data coming into
and
> going out of the device at the pins. So the Minimum input required
time
> before clock: 1.333ns means that data going into the FPGA must
get
> there at least 1.333 ns before the clock rises for at least one
pin,
> i.e. the worst case pin. That means you have (18.647 - 1.333) ns of
> slack to play with from other parts driving across your pcb and
into the
> FPGA. Similarly the Maximum output delay after clock: 9.892ns
means
> that valid data is available on all output pins 9.892 ns after the
clock
> has risen. So that menas you have (18.647 - 9.892) ns to get
across your
> pcb and into any parts connected to the FPGA. Bearing in mind that
IC's
> will probably have a setup time it means you have about 7 ns of
slack to
> play with for the worst case output pin of the FPGA.
>
> To get information on all the pins you need to look at the twr
file from
> the timing tool. From that you will probably see that some output
pins
> are valid well before the 9.892 ns figure and some inputs might be
able
> to arrive later than 1.333 ns before the rising edge of the clock
> although 1.333 ns isn't very long so I suspect that all your
inputs must
> be registered which is one way to ensure a short input setup time!
>
> Here is an example piece of a twr file for the worst case input
pin,
> instruction(9) on the Xilinx CPLD version of the Picoblaze. My
clock is
> set to 10MHz or 100 ns. The bottom line here is that my signal must
> arrive at the FPGA pin no later than 12.793 ns before the rising
edge of
> the clock. =====================================================================
===
> ========
> Timing constraint: TIMEGRP "ARVG0" OFFSET = IN 100 nS BEFORE
COMP "clk"
> ;
>
> 5729 items analyzed, 0 timing errors detected. (0 setup errors, 0
hold
> errors)
> Minimum allowable offset is 12.793ns.
> -------------------------------
-----
> --------
> Slack: 87.207ns (requirement - (data path - clock
path
> - clock arrival + uncertainty))
> Source: instruction(9) (PAD)
> Destination: prog_count_reg_count_value(6)_repl2 (FF)
> Destination Clock: clk_int rising at 0.000ns
> Requirement: 100.000ns
> Data Path Delay: 14.261ns (Levels of Logic = 10)
> Clock Path Delay: 1.468ns (Levels of Logic = 2)
> Clock Uncertainty: 0.000ns
>
> Data Path: instruction(9) to prog_count_reg_count_value(6)_repl2
> Location Delay type Delay(ns) Physical
Resource
> Logical
> Resource(s)
> -------------
> -------------------
> T8.I Tiopi 0.965 instruction
(9)
> instruction
(9)
>
> instruction_ibuf(9)
> SLICE_X26Y15.G3 net (fanout=9) 1.995
> instruction_int(9)
> SLICE_X26Y15.Y Tilo 0.652
> stack_control_valid_to_move
> ix34635z1412
> SLICE_X26Y15.F3 net (fanout=2) 0.019 nx34635z15
> SLICE_X26Y15.X Tilo 0.744
> stack_control_valid_to_move
> ix57496z1530
> SLICE_X30Y17.F1 net (fanout) 1.802
> stack_control_valid_to_move
> SLICE_X30Y17.X Tilo 0.744 nx34635z21
> ix34635z1575
> SLICE_X33Y18.G2 net (fanout=8) 1.010 nx34635z21
> SLICE_X33Y18.Y Tilo 0.631 nx34635z25
> ix34635z4588
> SLICE_X33Y18.F3 net (fanout=1) 0.007 nx34635z26
> SLICE_X33Y18.X Tilo 0.723 nx34635z25
> ix34635z1261
> SLICE_X32Y20.G2 net (fanout=1) 0.698 nx34635z25
> SLICE_X32Y20.COUT Topcyg 0.860 address_dup0
(0)
> ix34635z1510
>
ix34635z63346
> SLICE_X32Y21.CIN net (fanout=1) 0.000
ix34635z63346/O
> SLICE_X32Y21.COUT Tbyp 0.170 address_dup0
(2)
>
ix34635z63345
>
ix34635z63344
> SLICE_X32Y22.CIN net (fanout=1) 0.000
ix34635z63344/O
> SLICE_X32Y22.COUT Tbyp 0.170 address_dup0
(4)
>
ix34635z63343
>
ix34635z63342
> SLICE_X32Y23.CIN net (fanout=1) 0.000
ix34635z63342/O
> SLICE_X32Y23.X Tcinx 0.917 address_dup0
(6)
>
ix37700z19564
> F15.O1 net (fanout=1) 1.763 nx37700z1
> F15.OTCLK1 Tioock 0.391 address(6)
>
> prog_count_reg_count_value(6)_repl2
> -------------
> ---------------------------
> Total 14.261ns (6.967ns
logic,
> 7.294ns route)
> (48.9%
logic,
> 51.1% route)
>
> Clock Path: clk to prog_count_reg_count_value(6)_repl2
> Location Delay type Delay(ns) Physical
Resource
> Logical
> Resource(s)
> -------------
> -------------------
> P8.I Tiopi 0.772 clk
> clk
>
clk_ibuf/IBUFG
> BUFGMUX3.I0 net (fanout=1) 0.001
clk_ibuf/IBUFG
> BUFGMUX3.O Tgi0o 0.160
clk_ibuf/BUFG
>
clk_ibuf/BUFG
> F15.OTCLK1 net (fanouta) 0.535 clk_int
> -------------
> ---------------------------
> Total 1.468ns (0.932ns
logic,
> 0.536ns route)
> (63.5%
logic,
> 36.5% route) >
> Actually it might be worth pointing out to people that someone in
Xilinx
> has written up the PicoBlaze for CPLD. This is a complete VHDL
model and
> comes with source code for the compiler too! The reason is that the
> author wanted to allow users to add their own instructions. It's
> described in Xilinx Appnote 387 and the design files can be
downloaded.
> I have simulated it and it works fine. I intend to try it out on
the
> Spartan3 starter board aas soon as I get some time!
>
> Couple of things to note.
>
> 1. The RTL is written for CPLD not FPGA. The main issue here is
that the
> arithmetic component uses normal and/or logic to implement it. This
> could be written so that the logic synthesized gets mapped to the
LUT
> and carry resources of the FPGA.
>
> 2. The RTL simulation gives lots of X's. This is not a problem
it's just
> that until all the registers have a value written to them they
won't
> contain a value.
>
> Hope that helps.
>
> Cheers.
>
> Robert. > -----Original Message-----
> From: rtstofer [mailto:rstofer@p...]
> Sent: 21 October 2004 20:40
> To:
> Subject: [fpga-cpu] Re: Floating Point Arithmetic >
> > I was just a couple of years behind you. I did not know much
> about
> > computers when the Altair came out. A few years later when the
> Heathkit
> > version of the LSI-11 was available, I bought one. The design
> used a CPU
> > chip with several microcode prom chips. That made me think of
> writing my
> > own microcode to plug into the empty microcode chip socket. But
I
> never
> > found the documentation I needed.
> >
>
> I really wanted one of the LSI-11s but never could get myself in
> position to buy one. Western Digital made the chip and made
a 'similar'
> (read identical) chip for the UCSD Pascal system. I think the
system
> was called a Terak(?) > > I am sure you can get much higher than 20 MHz even. The trick is
> to think
> > in terms of levels of 4 input LUTs and keep the number of levels
> down. In
> > my design I found the multiplexors to be real hogs, both the
> number of
> > levels and the number of LUTs in general. You may need to make
> some
> > tradeoffs between reducing the number of cycles for a given
> instruction and
> > the speed of all cycles. My suggestion is to minimize your
> complexity
> > first (giving you speed) and try to optimize individual
> instructions later
> > (by adding paths and special hardware).
>
> I really new at the FPGA stuff and, while I can read the timing
> information, I don't know what it means. According to the timing
> reports the T80 has a maximum delay of 18+ nS:
>
> Timing summary:
> ---------------
>
> Timing errors: 0 Score: 0
>
> Constraints cover 851300 paths, 0 nets, and 11402 connections
>
> Design statistics:
> Minimum period: 18.647ns (Maximum frequency: 53.628MHz)
> Minimum input required time before clock: 1.333ns
> Maximum output delay after clock: 9.892ns
>
> I don't know exactly what to do with this number. I know it says
I can
> run 50+ MHz but I just have to believe there are a bunch
of 'gotchas'.
> I am running the core at 12.5 MHz and, if I really thought I could
kick
> it to 50, I would certainly like to do it.
>
> Any guidance here will be appreciated. I really don't have any
idea how
> to figure the timing for FPGAs.
>
> >
> >
> > >I will look at the other processors. True, I want to roll my
own but
>
> > >there are too many good ideas out there to just ignore them.
> >
> > I learned a lot from others implementations. But it is often
hard
> to
> > understand exactly what they did and why. It can be a lot of
work
> just to
> > learn what the "state of the art" is in FPGA CPUs. You might do
> very well
> > just to learn about either the NIOS-II or the microBlaze.
> Understanding
> > either one of these will likely be a real education in FPGA
> optimization.
> >
>
> I looked at the XSOC core and didn't so much 'give up' as just
decide my
> simple project wasn't worth going to that level of caching,
pipelining,
> register interlocking, etc. For a toy the good old fetch-decode-
execute
> will be just fine. And, for an initial implementation, it will be
> difficult enough. Maybe later, when I know more about what I am
> doing...
>
> >
> >
> > Rick Collins
> >
> > rick.collins@a...
> >
> > Arius - A Signal Processing Solutions Company
> > Specializing in DSP and FPGA design http://www.arius.com
> > 4 King Ave 301-682-7772 Voice
> > Frederick, MD 21701-3110 301-682-7666 FAX > To post a message, send it to: To
unsubscribe,
> send a blank message to:
> Yahoo! Groups Links




> ANDing the clock with the OE to keep reads to one clock cycle
while writes

The software whines about gating with the clock. It's just a
warning but, having no experience in such things, it keeps me aware
that there is a potential issue.

> require two clocks each. Async SRAMS typically need a write pulse
about
> the same width as the read address access time, but the output
enable can
> be faster. I'll draw the timing.
>
> | READ | WRITE | READ |
> CLK __----____----____----____----____----__
> A ==x=======x===============x=======x=====
> CS- ---________________________________-----
> OE- -------____--------------------____-----
> WE- ---------------________-----------------
> D -------<===>--<===========>----<===>---
> ^ ^
> Turn around times
>
> If you want to get fancy, back to back reads don't need to toggle
the OE
> signal, but it will be more work on your part to do that and I
don't know
> that it has much advantage, perhaps some power savings.
>

I must have missed the part in the datasheet dealing with asymetric
timing of read versus write. I have to get back into this as it may
turn out that my system is working, but just because it is slow.
But this is the second time I have seen refereces to multi-clock
timing of writes. I have been treating it like a plain, vanilla,
static ram, 2102 style. OOPS!

>
> >I searched around for info on asynchronous SRAM interfaces and the
> >best I could find was a deep, closely held secret (meaning I had
to
> >spend $) at Xilinx. As far as I can tell, they used a very high
> >speed FSA (like 200 MHz, perhaps) to accomplish the same thing.
>
> I don't know what FSA means. The real problem is that async rams
are
> *async* while most logic in an FPGA is synchronous. That makes it
hard to
> set up the timing without using a lot of margin.

I tend to use the terms FSA and FSM interchangeably although it
appears that FSM is more common around here.

>
>
> >I also found a rough calculation of the result of not worrying
about
> >contention that indicated I could see a 1.1 degree C rise in
> >temperature if I just ignored the issue. Still, it doesn't seem
> >right to allow this to occur.
> >
> >Maybe this is a place where current limiting resistors in series
> >would be a quick fix. I'll have to think about that.
>
> Brief contention on the bus is not likely to be a reliability
issue, but it
> is not hard to avoid. You still need to control the timing of the
write
> enable to assure that the address and data busses are stable until
the end
> of the write enable. I think preventing contention is not much
more
> difficult. Remember, your rams are not sync and the outputs will
have
> different delays and settling times. The write enable is your
clock in
> this case.

I have to take a hard look at this. I know I didn't do it correctly
although it works. It may turn out that, when handled properly, I
can get some serious speed out of the T80 core. I also need to look
at the timing on the IDE interface. Right now I am just using a
single machine cycle, no wait state. If I speed things up I will
need to insert a wait but that is a cheap price to pay for the
potential gain in speed.

A 50 MHz Z80? Now that would be interesting!

> Rick Collins
>
> rick.collins@a...
>
> Arius - A Signal Processing Solutions Company
> Specializing in DSP and FPGA design http://www.arius.com
> 4 King Ave 301-682-7772 Voice
> Frederick, MD 21701-3110 301-682-7666 FAX