EmbeddedRelated.com
Forums
Memfault Beyond the Launch

Small CPUs in FPGAs

Started by Rick Collins July 31, 2008
> I have looked at the B16. In fact, I learned quite a bit
> from that which I used when designing my own CPU. But 600 LUTs is
> not all that small. That is what my CPU was without optimization. The
> smaller CPUs from the FPGA vendors push down to 250 or so LUTs. I
> don't know how well they will run Forth, but I will be taking a harder
> look.

Again, I don't share this obsession, but note for the record that b16-small is half the size.

IMO life is too short for a core that my existing body of C code can't be trivially recompiled for. I need a 32-bit core without random limits.

Tommy



To post a message, send it to: f...
To unsubscribe, send a blank message to: f...
Tommy Thorn wrote:
>> I have looked at the B16. In fact, I learned quite a bit
>> from that which I used when designing my own CPU. But 600 LUTs is
>> not all that small. That is what my CPU was without optimization. The
>> smaller CPUs from the FPGA vendors push down to 250 or so LUTs. I
>> don't know how well they will run Forth, but I will be taking a harder
>> look.
>>
>
> Again, I don't share this obsession, but note for the record that b16-small is half the size.
>
> IMO life is too short for a core that my existing body of C code can't be trivially recompiled for. I need a 32-bit core without random limits.
>
>
Well it is too bad that you seem to need 32+ megabyte of memory nowdays
just to
compile C code. Well you get what you pay for as I expect two thirds of
the LUT's
will be data path, and the rest for control of any cpu. You want 32 bit
cpu expect a good
chunk of LUTS.
> Tommy
>
>
>
>


To post a message, send it to: f...
To unsubscribe, send a blank message to: f...
Ben Franchuk wrote:
>> You want 32 bit cpu expect a good chunk of LUTS.

Using the GR0040/GR1040 (same basic design but w/ instruction-register
pipeline register after ifetch), the 32-bits versions (GR0050/1050) were
<500 LUTs, if I recall correctly, and were designed to be a C compiler
target.

Cheers.
Jan Gray



To post a message, send it to: f...
To unsubscribe, send a blank message to: f...
--- In f..., "Jan Gray" wrote:
>
> Ben Franchuk wrote:
> >> You want 32 bit cpu expect a good chunk of LUTS.
>
> Using the GR0040/GR1040 (same basic design but w/ instruction-register
> pipeline register after ifetch), the 32-bits versions (GR0050/1050) were
> <500 LUTs, if I recall correctly, and were designed to be a C compiler
> target.

Hi Jan,

I am interested in the gr0040 and the the others, but there are two
issues. One is that the license does not allow commercial use. The
other is that except for the gr0040 I can't find info on them. The
gr0040 doesn't have the assembler available. I think your web site
says you developed one, but it isn't posted.

I would be interested in testing out a 32 bit processor that only uses
500 LUTs. But is this measured while using T-bufs which are no longer
supported in FPGAs? I think the info on the xr16 indicates the use of
T-bufs, but I don't recall details on the gr0040 and no info on the
others.

Rick


To post a message, send it to: f...
To unsubscribe, send a blank message to: f...
Jan Gray wrote:
> Ben Franchuk wrote:
>
>>> You want 32 bit cpu expect a good chunk of LUTS.
>>>
>
> Using the GR0040/GR1040 (same basic design but w/ instruction-register
> pipeline register after ifetch), the 32-bits versions (GR0050/1050) were
> <500 LUTs, if I recall correctly, and were designed to be a C compiler
> target.
>
>
How ever I differ with your view point of just what a LUT is.
With out the 16x1 block ram ( and fast ripple carry ) just how
large would a RISC machine be. For using the hardware
your designs can't be beat, but I always tend to be with the
OTHER guy FPGA's. Right now I tend to favor CPLD's
as they seem to be hobbiest friendly, rather than everything
in unmanagable packinging or requiring $$$ for Verlog or VHTL
compilers.
> Cheers.
> Jan Gray
>
>
PS. I am also not a fan of using internal block ram
as main memory ... It kind of defeats the purpose of
having a general purpose CPU any more.


To post a message, send it to: f...
To unsubscribe, send a blank message to: f...
> Right now I tend to favor CPLD's
> as they seem to be hobbiest friendly, rather than everything
> in unmanagable packinging or requiring $$$ for Verlog or VHTL
> compilers.

Both Altera and Xilinx provide FREE Verilog and VHDL development
environments. Sure, they lack some of the high end features of the
retail versions but, for a fact, the Xilinx package works well.

> PS. I am also not a fan of using internal block ram
> as main memory ... It kind of defeats the purpose of
> having a general purpose CPU any more.
>

If the BlockRAM is large enough, why not use it? I have a 16 bit
minicomputer (retro) that uses 16k words of BlockRAM for central
memory as well as VGA memory for a 80x25 text display.

Richard



To post a message, send it to: f...
To unsubscribe, send a blank message to: f...
I'm not sure what Rick's requirements are, but the discussion prompted
me to revisit my minimal Micro16 design.

The design resembles in many ways the PDP-8, with a 3 bit op code and an
indirect bit, but I have extended the design to include a multiply, a
skip on condition with 16 conditional tests and a jump instruction
rather than two conditional jumps. I also extended the ALU to include 12
shift and complement operators on the Accumulator in addition to the 4
exiting NOR, ADD, MUL and STA operations and added push and pull
instructions as well as software interrupts.

I also made the data width of the design scalable, from 8 bits to 32
bits. The design is essentially square with the address bus width being
the same as the data bus width although you can shortened the address
bus by a couple of bits if desired.

I did a few synthesis's of the CPU design with Xilinx ISE 7.1 for a
XC3S200 with different address and data width sizes to see what the
resource utilization was like. I tested the design for 8 bit, 12 bit, 16
bit, 24 bit and 32bit address and data busses. I got an average overhead
of about 5,000 gates with 200 gates / bit. This implies about 44% fixed
overhead for a 32 bit CPU.

When I looked at the LUTs and Slice Flip Flops I got an over head of
138.5 (4 input) LUTS and 32.5 Slice Flip flops and per bit I got 21.125
LUTs and 5.625 Slice FFs. This implies only a 20% overhead for a 32 bit
CPU, so I'm not too sure what to make of the ISE report figures, but on
average it's probably not far from what bfranchuk (woodelf?) was saying
about 2/3 data path and 1/3 control.

John.

--
http://www.johnkent.com.au
http://members.optushome.com.au/jekent


To post a message, send it to: f...
To unsubscribe, send a blank message to: f...
> How ever I differ with your view point of just what a LUT is.

Well, these are Virtex-era LUTs, not XC4000-era LUTs. No H-LUTs are assumed
or harmed. We do use all the carry-logic, MULT_ANDs, etc. that we can, esp.
to build functions like

MUX(A,B+C)
or
MUX4(A+B,A-B,A&B,A^B)

in one column of LUTs (1 LUT per bit).

No TBUFs (alas).

I'm sorry I'm about to be away from the computer for two days but have fun
with the discussion thread.

Jan.

-----Original Message-----
From: f... [mailto:f...] On Behalf
Of b...@jetnet.ab.ca
Sent: Sunday, August 03, 2008 11:52 PM
To: f...
Subject: Re: [fpga-cpu] Re: Small CPUs in FPGAs

Jan Gray wrote:
> Ben Franchuk wrote:
>
>>> You want 32 bit cpu expect a good chunk of LUTS.
>>>
>
> Using the GR0040/GR1040 (same basic design but w/ instruction-register
> pipeline register after ifetch), the 32-bits versions (GR0050/1050) were
> <500 LUTs, if I recall correctly, and were designed to be a C compiler
> target.
>
>
With out the 16x1 block ram ( and fast ripple carry ) just how
large would a RISC machine be. For using the hardware
your designs can't be beat, but I always tend to be with the
OTHER guy FPGA's. Right now I tend to favor CPLD's
as they seem to be hobbiest friendly, rather than everything
in unmanagable packinging or requiring $$$ for Verlog or VHTL
compilers.
> Cheers.
> Jan Gray
>
>
PS. I am also not a fan of using internal block ram
as main memory ... It kind of defeats the purpose of
having a general purpose CPU any more.


To post a message, send it to: f...
To unsubscribe, send a blank message to:
f...
--- In f..., "rtstofer" wrote:
> > Right now I tend to favor CPLD's
> > as they seem to be hobbiest friendly, rather than everything
> > in unmanagable packinging or requiring $$$ for Verlog or VHTL
> > compilers.
>
> Both Altera and Xilinx provide FREE Verilog and VHDL development
> environments. Sure, they lack some of the high end features of the
> retail versions but, for a fact, the Xilinx package works well.

The synthesis may work ok, but the Xilinx simulator is not very good
these days. It was a long time ago that Xilinx decided to roll their
own synthesis and they have had a chance to tune it pretty well for
their chips. But the simulator is still rather new and full of
"issues", not to mention rather slow. Interesting that the speed
crippled versions of other tools run about the same speed or faster
than the Xilinx in-house tool.
> > PS. I am also not a fan of using internal block ram
> > as main memory ... It kind of defeats the purpose of
> > having a general purpose CPU any more.
> > If the BlockRAM is large enough, why not use it? I have a 16 bit
> minicomputer (retro) that uses 16k words of BlockRAM for central
> memory as well as VGA memory for a 80x25 text display.

I think bfranchuk was saying that he wants a full CPU will plenty of
external memory. That does not fit my needs at all since I am trying
to fit this into existing designs using only internal resources.

Rick


To post a message, send it to: f...
To unsubscribe, send a blank message to: f...
--- In f..., John Kent wrote:
>
> I'm not sure what Rick's requirements are, but the discussion prompted
> me to revisit my minimal Micro16 design.

Earlier I mentioned designs that used only 250 or so LUTs. I also
said that a design with 1000's of LUTs is too large. I have rolled my
own 16 bit CPU that uses around 600 LUTs. So far the really small
CPUs are *very* limited and are only supported with an assembler. To
get HLL support the designs get over 1000 LUTs.
> The design resembles in many ways the PDP-8, with a 3 bit op code
and an
> indirect bit, but I have extended the design to include a multiply, a
> skip on condition with 16 conditional tests and a jump instruction
> rather than two conditional jumps. I also extended the ALU to
include 12
> shift and complement operators on the Accumulator in addition to the 4
> exiting NOR, ADD, MUL and STA operations and added push and pull
> instructions as well as software interrupts.
>
> I also made the data width of the design scalable, from 8 bits to 32
> bits. The design is essentially square with the address bus width being
> the same as the data bus width although you can shortened the address
> bus by a couple of bits if desired.
>
> I did a few synthesis's of the CPU design with Xilinx ISE 7.1 for a
> XC3S200 with different address and data width sizes to see what the
> resource utilization was like. I tested the design for 8 bit, 12
bit, 16
> bit, 24 bit and 32bit address and data busses. I got an average
overhead
> of about 5,000 gates with 200 gates / bit. This implies about 44% fixed
> overhead for a 32 bit CPU.

I have no understanding of how you measured gates or how that would be
used to evaluate a design. The only common denominator for use in
FPGAs is the 4 input LUT. Register usage is typically much less than
LUTs so for the most part it is not important to consider registers.
> When I looked at the LUTs and Slice Flip Flops I got an over head of
> 138.5 (4 input) LUTS and 32.5 Slice Flip flops and per bit I got 21.125
> LUTs and 5.625 Slice FFs. This implies only a 20% overhead for a 32 bit
> CPU, so I'm not too sure what to make of the ISE report figures, but on
> average it's probably not far from what bfranchuk (woodelf?) was saying
> about 2/3 data path and 1/3 control.

It sounds like you are using Xilinx devices. When you count LUTs, are
you counting *all* LUTs or just the ones reported by Xilinx tools as 4
input LUTs? For whatever reason, they break LUTs into separate counts
of 4 input LUTs, 3 input LUTs and several types of LUT based RAM. The
true count of LUTs is the sum of all these.

If I understand your numbers, a 16 bit design uses 168.5 LUTs and a 32
bit design uses 1,178.5 LUTS. I assume the fractional LUT counts were
obtained by generating two designs and projecting a straight line
through them. Have you tried this a third number of bits, say 8, 16
and 32 bits? I would be interested in seeing how the size ranges and
if it still fits the projection.

Rick


To post a message, send it to: f...
To unsubscribe, send a blank message to: f...

Memfault Beyond the Launch