EmbeddedRelated.com
Forums

Small CPUs in FPGAs

Started by Rick Collins July 31, 2008
--- In f..., "Jan Gray" wrote:
>
> > How ever I differ with your view point of just what a LUT is.
>
> Well, these are Virtex-era LUTs, not XC4000-era LUTs. No H-LUTs are
assumed
> or harmed. We do use all the carry-logic, MULT_ANDs, etc. that we
can, esp.
> to build functions like
>
> MUX(A,B+C)
> or
> MUX4(A+B,A-B,A&B,A^B)
>
> in one column of LUTs (1 LUT per bit).
>
> No TBUFs (alas).
>
> I'm sorry I'm about to be away from the computer for two days but
have fun
> with the discussion thread.

Yes, today's FPGAs no longer include T-bufs so multiplexers have to be
used increasing the LUT count.

Jan, are you willing to release these designs for use in commercial
apps? I have tried to contact you here and I did not get a reply.
Except for the use of T-bufs, your designs appear to be the best
available in terms of performance for a small size. I would like to
know more, but my application is commercial. Should I stop
considering your designs or will I be able to get permission to use them?



To post a message, send it to: f...
To unsubscribe, send a blank message to: f...
Rick,

This is the device utilization summary for Webpack ISE 7.1 for different
size address and data bus widths of my CPU. It only mentions slices and
4 input LUTs. It's probably not that different from what you have rolled
yourself already.

8 Bit Address and Data:

Device utilization summary:
---------------------------

Selected Device : 3s200ft256-4

Number of Slices: 178 out of 1920 9%
Number of Slice Flip Flops: 74 out of 3840 1%
Number of 4 input LUTs: 328 out of 3840 8%
Number of bonded IOBs: 33 out of 173 19%
Number of MULT18X18s: 1 out of 12 8%
Number of GCLKs: 1 out of 8 12%

12 Bit Address and Data

Device utilization summary:
---------------------------

Selected Device : 3s200ft256-4

Number of Slices: 217 out of 1920 11%
Number of Slice Flip Flops: 99 out of 3840 2%
Number of 4 input LUTs: 402 out of 3840 10%
Number of bonded IOBs: 45 out of 173 26%
Number of MULT18X18s: 1 out of 12 8%
Number of GCLKs: 1 out of 8 12%

16 Bit Address and Data bus

Device utilization summary:
---------------------------

Selected Device : 3s200ft256-4

Number of Slices: 255 out of 1920 13%
Number of Slice Flip Flops: 124 out of 3840 3%
Number of 4 input LUTs: 475 out of 3840 12%
Number of bonded IOBs: 57 out of 173 32%
Number of MULT18X18s: 1 out of 12 8%
Number of GCLKs: 1 out of 8 12%

24 Bit Address and Data

Device utilization summary:
---------------------------

Selected Device : 3s200ft256-4

Number of Slices: 368 out of 1920 19%
Number of Slice Flip Flops: 160 out of 3840 4%
Number of 4 input LUTs: 672 out of 3840 17%
Number of bonded IOBs: 81 out of 173 46%
Number of MULT18X18s: 1 out of 12 8%
Number of GCLKs: 1 out of 8 12%

32 Bit Address and Data

Device utilization summary:
---------------------------

Selected Device : 3s200ft256-4

Number of Slices: 484 out of 1920 25%
Number of Slice Flip Flops: 208 out of 3840 5%
Number of 4 input LUTs: 887 out of 3840 23%
Number of bonded IOBs: 105 out of 173 60%
Number of MULT18X18s: 1 out of 12 8%
Number of GCLKs: 1 out of 8 12%
Rick Collins wrote:
> --- In f..., John Kent wrote:
>
> Earlier I mentioned designs that used only 250 or so LUTs. I also
> said that a design with 1000's of LUTs is too large. I have rolled my
> own 16 bit CPU that uses around 600 LUTs. So far the really small
> CPUs are *very* limited and are only supported with an assembler. To
> get HLL support the designs get over 1000 LUTs.
>
>
I guess that is the trade off you have to make then. What I have done is
use Douglas Jones Smal32 macro assembler and simply generated the
instructions in 32 bit words shifted up by the appropriate data bus
width. I need to write a Block RAM generator that can split the 32 bit
Smal32 file into nibbles or bytes. Obviously there is no high level
support. The easiest approach to software development would be to write
a virtual machine for Forth or P Code. I remember back in the dim dark
ages of implementing FIG Forth on the 6800. The virtual machine was only
1 KByte in size, so using a VM approach allows you to lever off a lot of
development work.

> I have no understanding of how you measured gates or how that would be
> used to evaluate a design. The only common denominator for use in
> FPGAs is the 4 input LUT. Register usage is typically much less than
> LUTs so for the most part it is not important to consider registers.
>
>
It was just taken from the Synthesis report that ISE spat out. The
report also gave a estimate of gate count.

>
>> When I looked at the LUTs and Slice Flip Flops I got an over head of
>> 138.5 (4 input) LUTS and 32.5 Slice Flip flops and per bit I got 21.125
>> LUTs and 5.625 Slice FFs. This implies only a 20% overhead for a 32 bit
>> CPU, so I'm not too sure what to make of the ISE report figures, but on
>> average it's probably not far from what bfranchuk (woodelf?) was saying
>> about 2/3 data path and 1/3 control.
>>
>
> It sounds like you are using Xilinx devices. When you count LUTs, are
> you counting *all* LUTs or just the ones reported by Xilinx tools as 4
> input LUTs? For whatever reason, they break LUTs into separate counts
> of 4 input LUTs, 3 input LUTs and several types of LUT based RAM. The
> true count of LUTs is the sum of all these.
>
>
OK ... I'm not sure how Xilinx reports the use of 3 input LUTs and RAM
LUTs. I've just included a summary of the utilization above ...that
might not be enough.

> If I understand your numbers, a 16 bit design uses 168.5 LUTs and a 32
> bit design uses 1,178.5 LUTS. I assume the fractional LUT counts were
> obtained by generating two designs and projecting a straight line
> through them. Have you tried this a third number of bits, say 8, 16
> and 32 bits? I would be interested in seeing how the size ranges and
> if it still fits the projection.
>
>
I'm not sure how you get those figures:
138.5 LUTs + (21.125 * No. of Bits)

16 bits => 476.5 LUTs
32 bits => 814.5 LUTs

The fractional counts resulted from average the the 5 versions of the
design - 8 bit, 12 bit, 16 bit, 24 bit and 32 bit.
I worked out the difference in the number of slices and divided that by
the difference in the number of bits to get a Slices per bit count.
Interpolating that you can work out the fixed overhead of the design.

Anyway ... I've spent enough time on this project ... I'll leave you to
sort out you problems. My design is not really a commercial design in
that it has not undergone rigorous testing. I was essentially verifying
Woodelf's assertion of 2/3 utilization for data path and 1/3 for
control, but it depends very much on the design and the number of bits.

John.

--
http://www.johnkent.com.au
http://members.optushome.com.au/jekent




To post a message, send it to: f...
To unsubscribe, send a blank message to: f...
Rick,

Do you really think there is a market for a cross platform processor
core ? I have not looked at the JOP, Java Oriented Processor, but if
that can support C as well as Java, that might be the way to go. I'm not
sure how scalable that design is or what software has been developed for
it. I was looking at a stack based design. It's probably not as
efficient in speed as a pipelined register based general RISC
architecture, but it might be more efficient from an FPGA resource point
of view.

John.

--
http://www.johnkent.com.au
http://members.optushome.com.au/jekent


To post a message, send it to: f...
To unsubscribe, send a blank message to: f...
> Do you really think there is a market for a cross platform processor
> core ? I have not looked at the JOP, Java Oriented Processor, but if
> that can support C as well as Java, that might be the way to go. I'm not

JOP executes Java bytecodes only, not C. There have been projects
that try to compile C to Java bytecode, but I don't know how
effective this is.

> sure how scalable that design is or what software has been developed for

What do you mean by scalable? With respect to SW: there are a few
industrial applications written for JOP and in use. Nothing big,
very control oriented.

> it. I was looking at a stack based design. It's probably not as
> efficient in speed as a pipelined register based general RISC
> architecture, but it might be more efficient from an FPGA resource point
> of view.

The stack based drawbacks are well known. The main benefit of a stack
with two explicite TOS and NOS registers is that you can merge EX and
WB into a single pipeline stage. You need no forwarding at all :-)

Martin



To post a message, send it to: f...
To unsubscribe, send a blank message to: f...
Rick Collins wrote
>Earlier I mentioned designs that used only 250 or so LUTs. I also
>said that a design with 1000's of LUTs is too large. I have rolled my
>own 16 bit CPU that uses around 600 LUTs. So far the really small
>CPUs are *very* limited and are only supported with an assembler. To
>get HLL support the designs get over 1000 LUTs.

Thanks for the gauntlet :-)

Well, I went for a dig in my deep archive and found the following.
Not available for commercial use (sorry Rick), but another example
of what can be done.

The Risc4005 is now 17 years old. It was designed prior to the first
commercial release of the Xilinx XC4000 family, and fitted into the
first product the XC4005. A 14 by 14 array of CLBs, each with 2 4-luts,
2 FFs, 2 TBUFs, and 2 bits of carry chain.

Here is the PPR report:

Partitioned Design Utilization Using Part 4005PG156-5
No. Used Max Available % Used
---------------------------- ------- ------------- ------
Occupied CLBs 155 196 79%
Packed CLBs 130 196 66%
---------------------------- ------- ------------- ------
Bonded I/O Pins: 81 112 72%
F and G Function Generators: 261 392 66%
H Function Generators: 27 196 13%
CLB Flip Flops: 196 392 50%
IOB Input Flip Flops: 4 112 3%
IOB Output Flip Flops: 48 112 42%
Memory Write Controls: 16 196 8%
3-State Buffers: 144 448 32%
3-State Half Longlines: 32 56 57%
Edge Decode Inputs: 12 168 7%
Edge Decode Half Longlines: 2 32 6%
Software for the Risc4005 included a cross assembled with very fancy
macro capability, and a C compiler, which was a port of LCC. The
compiler port took me about 2 or 3 days. What was missing was a
relocateable binary format, and a linker/librarian. That didn't stop me
from compiling and running C programs. Linking was done at compile time
with #includes of source files. Not pretty, but it worked.
>From 2001, I wrote the following (with a few edits):
==================For those of you who don't know,
the RISC4005 was designed in 1991, and fitted into about 140
CLBs in the original 4005 (no dual port RAM or Sync RAM write
in those days). So, about 260 LUTs/FFs. Ran at 20MHz in 1991
vintage technology. I would expect that it would run at 200+ MHz
in current technology (16 bit ALU, 4 deep pipeline, reg forwarding,
2 delay slots on branch, 1 on mem read).

Jan and I have become very good friends, and it started when
he worked on the precursor to the XR16. The similarities between
the architecture that he developed and the RISC4005 were
extensive, and allowed for significant transfer of info in both
directions.

But, Jan took his design far further than I did. He fleshed out the
C compiler, and added a nice simulator. His articles in Circuit Cellar
INK, and the forming of this news group, together with his web site
demonstrate an energy and enthusiasm far beyond my burnt out
condition on this subject. While I might have the stripes for the
first CPU implemented in an FPGA, Jan's work is I believe far more
significant. Thanks for keeping the flame burning Jan.

I was going to write a paragraph or two on microblaze, but I came
to my senses.
==================And to my surprise, my design got mentioned in the August 2008
issue of Circuit Cellar magazine, in an article by Tom Cantrell :-)
(I bet I now owe him a dinner and a beer)
Cheers,
Philip
================Philip Freidin
p...@fliptronics.com



To post a message, send it to: f...
To unsubscribe, send a blank message to: f...
--- In f..., Philip Freidin wrote:
> Thanks for the gauntlet :-)
>
> Well, I went for a dig in my deep archive and found the following.
> Not available for commercial use (sorry Rick), but another example
> of what can be done.
>
> The Risc4005 is now 17 years old. It was designed prior to the first
> commercial release of the Xilinx XC4000 family, and fitted into the
> first product the XC4005. A 14 by 14 array of CLBs, each with 2 4-luts,
> 2 FFs, 2 TBUFs, and 2 bits of carry chain.
>
> Here is the PPR report:
>
> Partitioned Design Utilization Using Part 4005PG156-5
> No. Used Max Available % Used
> ---------------------------- ------- ------------- ------
> Occupied CLBs 155 196 79%
> Packed CLBs 130 196 66%
> ---------------------------- ------- ------------- ------
> Bonded I/O Pins: 81 112 72%
> F and G Function Generators: 261 392 66%
> H Function Generators: 27 196 13%
> CLB Flip Flops: 196 392 50%
> IOB Input Flip Flops: 4 112 3%
> IOB Output Flip Flops: 48 112 42%
> Memory Write Controls: 16 196 8%
> 3-State Buffers: 144 448 32%
> 3-State Half Longlines: 32 56 57%
> Edge Decode Inputs: 12 168 7%
> Edge Decode Half Longlines: 2 32 6%
> Software for the Risc4005 included a cross assembled with very fancy
> macro capability, and a C compiler, which was a port of LCC. The
> compiler port took me about 2 or 3 days. What was missing was a
> relocateable binary format, and a linker/librarian. That didn't stop me
> from compiling and running C programs. Linking was done at compile time
> with #includes of source files. Not pretty, but it worked.
> From 2001, I wrote the following (with a few edits):
> ==================> For those of you who don't know,
> the RISC4005 was designed in 1991, and fitted into about 140
> CLBs in the original 4005 (no dual port RAM or Sync RAM write
> in those days). So, about 260 LUTs/FFs. Ran at 20MHz in 1991
> vintage technology. I would expect that it would run at 200+ MHz
> in current technology (16 bit ALU, 4 deep pipeline, reg forwarding,
> 2 delay slots on branch, 1 on mem read).
>
> Jan and I have become very good friends, and it started when
> he worked on the precursor to the XR16. The similarities between
> the architecture that he developed and the RISC4005 were
> extensive, and allowed for significant transfer of info in both
> directions.
>
> But, Jan took his design far further than I did. He fleshed out the
> C compiler, and added a nice simulator. His articles in Circuit Cellar
> INK, and the forming of this news group, together with his web site
> demonstrate an energy and enthusiasm far beyond my burnt out
> condition on this subject. While I might have the stripes for the
> first CPU implemented in an FPGA, Jan's work is I believe far more
> significant. Thanks for keeping the flame burning Jan.
>
> I was going to write a paragraph or two on microblaze, but I came
> to my senses.
> ==================> And to my surprise, my design got mentioned in the August 2008
> issue of Circuit Cellar magazine, in an article by Tom Cantrell :-)
> (I bet I now owe him a dinner and a beer)
> Cheers,
> Philip
> ================> Philip Freidin
> philip@...
>

Philip,

I wandered around your web site this morning looking for the RISC4005
project but didn't see it. Is there a link?

Richard



To post a message, send it to: f...
To unsubscribe, send a blank message to: f...
Hi Martin,

Martin Schoeberl wrote:
> What do you mean by scalable? With respect to SW: there are a few
> industrial applications written for JOP and in use. Nothing big,
> very control oriented.
>
>
I was wondering how a stack oriented machine would handle multitasking ...
Saving the context of the machine and restoring it.
I'm inclined to think you'd need a supervisory state like the 68000 to
support the swapping of TOS, NOS, FP, SP and so on.
> The stack based drawbacks are well known. The main benefit of a stack
> with two explicite TOS and NOS registers is that you can merge EX and
> WB into a single pipeline stage. You need no forwarding at all :-)
>
> Martin
I was thinking of using the argument and local variable stack for
working space, but using the same stack for work space would preclude
the allocation of new local variables, which may be needed in languages
like C++. Pascal scoping rules mean that all variables are effectively
created on the stack, (there is the concept of a heap in Pacal too isn't
there ?). I assume Java is the same. C and C++ have global or static
variable space as well as arguments and local variables, which are
normally referenced above or below a frame pointer.

HP and TI calculators had a 4 level stack. Is it possible to entirely
evaluate an expression with only 4 stack entries ? I would suspect not,
but if it was possible, then you could possibly implement a hardware
working stack as well as have a common pointer and stack and frame
pointer for global and local variables.

I might be showing my ignorance .... sorry.

John.

--
http://www.johnkent.com.au
http://members.optushome.com.au/jekent


To post a message, send it to: f...
To unsubscribe, send a blank message to: f...
Hi Philip,

We, well my boss, at the CSIRO used a couple of XC4005s and XC4008s in
his CLP board back in the mid 90's.
I'm not sure how much the chips were, but I recall them being in the
$1,000 price mark in Australia.

John.

Philip Freidin wrote:
>
> Thanks for the gauntlet :-)
>
> Well, I went for a dig in my deep archive and found the following.
> Not available for commercial use (sorry Rick), but another example
> of what can be done.
>
> The Risc4005 is now 17 years old. It was designed prior to the first
> commercial release of the Xilinx XC4000 family, and fitted into the
> first product the XC4005. A 14 by 14 array of CLBs, each with 2 4-luts,
> 2 FFs, 2 TBUFs, and 2 bits of carry chain.
>

--
http://www.johnkent.com.au
http://members.optushome.com.au/jekent


To post a message, send it to: f...
To unsubscribe, send a blank message to: f...
> I was wondering how a stack oriented machine would handle
multitasking ...
> Saving the context of the machine and restoring it.
> I'm inclined to think you'd need a supervisory state like the 68000 to
> support the swapping of TOS, NOS, FP, SP and so on.

The ARM processors have multiple stacks. One drawback is that they
are of fixed size and statically allocated at startup. An RTOS like
FreeRTOS uses fixed size task stacks.

I don't see why a memory management unit couldn't solve the whole
problem. Memory for stacks could be dynamically allocated in
relatively small chunks. There would need to be some kind of page
fault detection. Presumably the system level stacks would be large
enough to prevent faults during hardware interrupts.

How did Burroughs handle stacks in their Algol machines such as the
B5000? http://en.wikipedia.org/wiki/Burroughs_large_systems

> I was thinking of using the argument and local variable stack for
> working space, but using the same stack for work space would preclude
> the allocation of new local variables, which may be needed in languages
> like C++. Pascal scoping rules mean that all variables are effectively
> created on the stack, (there is the concept of a heap in Pacal too
isn't
> there ?)

Yes, there is a heap with all the usual complications of allocation,
release and cleanup. For security, some portion of the system needs
to clear the heap chunks as they are allocated.

> I assume Java is the same. C and C++ have global or static
> variable space as well as arguments and local variables, which are
> normally referenced above or below a frame pointer.
>
> HP and TI calculators had a 4 level stack. Is it possible to entirely
> evaluate an expression with only 4 stack entries ? I would suspect not,

Given that the expression could be rewritten in postfix notation, I
think it could. I believe the stack on the HP48GX is not limited
other than by the size of memory.

> but if it was possible, then you could possibly implement a hardware
> working stack as well as have a common pointer and stack and frame
> pointer for global and local variables.
>
> I might be showing my ignorance .... sorry.
>
> John.

I don't believe I have spent enough time admiring the B5000. This was
a very clever design!

Richard



To post a message, send it to: f...
To unsubscribe, send a blank message to: f...
Richard

> The ARM processors have multiple stacks.
>
The ARMs have operating modes with register sets each. The software
usually has a stack assigned to each mode.
> One drawback is that they are of fixed size and statically allocated
> at startup.
>
The ARM does not imply anything on the size and allocation of a stack.
An OS might though.
> I don't see why a memory management unit couldn't solve the whole
> problem. Memory for stacks could be dynamically allocated in
> relatively small chunks. There would need to be some kind of page
> fault detection. Presumably the system level stacks would be large
> enough to prevent faults during hardware interrupts.
>
The ARM memory manager works in 1K, 4K or 64K (IIRC) chunks depending
on how it is set up and how much space one wants to use for paging
tables. Some OSes allocate an initial page for a task stack and use
the page faulting as you suggest. It works well enough for regular
tasks but not so well on realtime tasks and as you point out, the
supervisor stack should be fixed. One side issue here is what happens
in the exception handling if you use page faults for resizing a stack
if the exception handler uses the stack.

Veronica



To post a message, send it to: f...
To unsubscribe, send a blank message to: f...