Sign in

username:

password:



Not a member?

Search fpga-cpu



Search tips

Subscribe to fpga-cpu



fpga-cpu by Keywords

Altera | CISCifying | IDE | ISA | Java | JHDL | JTAG | LBU | MicroBlaze | PAR | PCI | RISC | SoC | Spartan | Transputers | Verilog | VHDL | Virtex | VLIW | WebPack | Xilinx | Xsoc | YARD-1A

Discussion Groups

Discussion Groups | FPGA-CPU | Just what is small and is it the best?

This list is for discussion of the design and implementation of field-programmable gate array based processors and integrated systems. It is also for discussion and community support of the XSOC Project (see http://www.fpgacpu.org/xsoc).

Just what is small and is it the best? - Ben Franchuk - Oct 6 16:36:00 2000


Just reading the latest News:
"8-Bit Micro controller for Virtex Devices. If I may be permitted to quote
so extensively, I'll let this superb app note speak for itself:
"..."What do I consider small? Not 950 or 1100 or 1700 logic cells. Certainly
not 3000. By small, I mean cores like this excellent assembler-programmable
KCPSM (35 CLBs => ~140 logic cells) or the integer-C-programmable xr16(~300
logic cells)"

Small is often good but is the smallest the BEST?
Looking at the common designs:

Risc machines and stack machines (FORTH) while have smallest size being just a
Alu
with some jump logic on the PC. This is the racing engine of computing -
fast but not powerful.

Cisc's on the other hand are overburdened with opcode decoding.
A turning machine has the smallest data path but a very large control section.
This is the diesel engine of computing - powerful but most the just
idling away.

4 and 8 bit micros are the standby in embedded items. A washing machine
doers not need the latest 1GHZ cpu.The 2 cycle engine that powers your
lawnmower.

The umm... strange we don't have cpu the fits in this category, the
just right CPU, that is the automobile of computing. The PDP-11 is close
but memory addressing is only 64kb.

Right now RISC machines are the fastest needing only
a few CLB'S for the ALU. The alu could be limited in function
say add,nor,shift left,shift right. Anything we can't do in one cycle
we can do in two or three... Memory is fast.
Decoding is quick,and the alu is small. But is memory fast?
With the speed of memory limited because of external buffering,cache
lookups and bus setups and holds the speed of main memory is compromised
to say 2-3x the access of time of a memory element.

Perhaps rather than looking at # of CLB's one needs to look again
at the overall picture.As a crude example:
A full featured ALU with limited shifting could be about 50% more than the
minimum needed of 3 or 4 CLB's per bit say 8 CLB's. Registers like
MAR,Input,Output and byte logic and memory could take say 8 more CLB's. For a 16
bit cpu this 256 CLB's.
A risc computer would take 25% of that for control giving about 320 CLB's
for 16 bit computer. A simple CISC computer would use say 100% with about 500
CLB's for the same 16 bit CPU.
FPGA's are getting faster as dies become smaller while external memory stays the
same speed. This could push designs that require fast main memory to a more
CISC style of design. Adding features could push a 16 bit design to say 200%
giving 768 CLB's. Now this is getting to be a BIG complex design,yuck!.
The Risc/Forth design requires too fast a memory. Could not a streamlined
CSIC be designed use only about 50% of CLB's used for the data path
and yet give a processor that has a good bit of power?
Ben.
--
"We do not inherit our time on this planet from our parents...
We borrow it from our children."
"Luna family of Octal Computers" http://www.jetnet.ab.ca/users/bfranchuk






(You need to be a member of fpga-cpu -- send a blank email to fpga-cpu-subscribe@yahoogroups.com )

Re: Just what is small and is it the best? - Rob Finch - Oct 6 18:18:00 2000

--- In , Ben Franchuk <bfranchuk@j...> wrote:
> Just reading the latest News:
> The Risc/Forth design requires too fast a memory. Could not a
streamlined
> CSIC be designed use only about 50% of CLB's used for the data path
> and yet give a processor that has a good bit of power?
> Ben.
> --

I've had similar thoughts, and I started designing a streamlined CISC
but dumped it. The tough part is arguing the requirements, and
justifying how a CISC design would fill those requirements. The
primary motivation for a CISC design is it's conservative use of
opcode space. IE. embedded memory resources. I'm not sure what the
RISC/CISC ratio for opcode usage is (has anyone done any reasearch on
this?) I suspect CISC is not as good as one might think because alot
can be done with registers in a RISC) It's possible to design a
fairly conservative RISC processor. The control logic for a CISC
takes more room than a simple RISC design. That extra room used by
the CISC's control logic can be traded for extra memory in a RISC
design. A simple RISC design might be easier to debug, and use less
man hours in development than a CISC design (although I'm just
guessing here).
If you really want to get the most bang for the byte, you can always
use a simple bytecode interpreter with a RISC design, perhaps having
the whole interpreter in cache (dedicated ROM), like a microcode
store :)
I used to be a big fan of CISC designs, then I started studying
architectures and have since become a big fan of RISC designs.
I might go back and finish that CISC design just for comparison
purposes.

PS. Isn't a simple CISC design a RISC processor by definition ?




(You need to be a member of fpga-cpu -- send a blank email to fpga-cpu-subscribe@yahoogroups.com )

Re: Re: Just what is small and is it the best? - Ben Franchuk - Oct 6 18:32:00 2000

Rob Finch wrote:
>
> I've had similar thoughts, and I started designing a streamlined CISC
> but dumped it. The tough part is arguing the requirements, and
> justifying how a CISC design would fill those requirements. The
> primary motivation for a CISC design is it's conservative use of
> opcode space. IE. embedded memory resources. I'm not sure what the
> RISC/CISC ratio for opcode usage is (has anyone done any reasearch on
> this?) I suspect CISC is not as good as one might think because alot
> can be done with registers in a RISC) It's possible to design a
> fairly conservative RISC processor. The control logic for a CISC
> takes more room than a simple RISC design. That extra room used by
> the CISC's control logic can be traded for extra memory in a RISC
> design. A simple RISC design might be easier to debug, and use less
> man hours in development than a CISC design (although I'm just
> guessing here).

I view Risc machines as Micro-coded hardware that uses all
of main memory as micro-code.

> If you really want to get the most bang for the byte, you can always
> use a simple bytecode interpreter with a RISC design, perhaps having
> the whole interpreter in cache (dedicated ROM), like a microcode
> store :)

This is the kind of thought that made CISC complex.Very good byte operations
very bad at everything else. The 8086 instruction set comes to mind here.
A dedicated fast memory segment is nice but rarity can you get a small OS
now days.

> I used to be a big fan of CISC designs, then I started studying
> architectures and have since become a big fan of RISC designs.
> I might go back and finish that CISC design just for comparison
> purposes.
>
> PS. Isn't a simple CISC design a RISC processor by definition ?
Nope it is a load/store design. The cleanest designs I have seen
are the still the machines from the early 60's like the PDP-8 or the
PDP-4. I am guess a fan of the OLD IRON... > To Post a message, send it to:
> To Unsubscribe, send a blank message to:
Ben.
--
"We do not inherit our time on this planet from our parents...
We borrow it from our children."
"Luna family of Octal Computers" http://www.jetnet.ab.ca/users/bfranchuk





(You need to be a member of fpga-cpu -- send a blank email to fpga-cpu-subscribe@yahoogroups.com )

Re: Re: Just what is small and is it the best? - Tim Böscke - Oct 6 18:49:00 2000

>
> PS. Isn't a simple CISC design a RISC processor by definition ? http://www.cs.uiowa.edu/~jones/arch/cisc/

Do you consider this as RISC ? (Just an example)

But nevertheless I am of the opinion that there are architectures
where the RISC/CISC destinction is quite difficult.

For example: Is the good old subtract-and-branch one instruction
machine RISC or CISC ?

pro RISC:
- fixed instruction length.
- orthogonal "register" set.. (yes, there is no difference between memory and registers)
- no complex adressing modes.
- fixed data size

pro CISC:
- Multicycle operation. (Though the one instruction could always be broken up
into one subtract and one branch instruction)




(You need to be a member of fpga-cpu -- send a blank email to fpga-cpu-subscribe@yahoogroups.com )

Re: Re: Just what is small and is it the best? - Ben Franchuk - Oct 6 19:04:00 2000

Tim Böscke wrote:
>
> >
> > PS. Isn't a simple CISC design a RISC processor by definition ?
> >
>
> http://www.cs.uiowa.edu/~jones/arch/cisc/
>
> Do you consider this as RISC ? (Just an example)

I consider it to be a stack machine... But then I don't teach
computer architecture. I consider a CISC machine to be single
address machine.

> But nevertheless I am of the opinion that there are architectures
> where the RISC/CISC destinction is quite difficult.
>
> For example: Is the good old subtract-and-branch one instruction
> machine RISC or CISC ?
RISC machine - very reduced instruction set :).

> pro RISC:
> - fixed instruction length.
> - orthogonal "register" set.. (yes, there is no difference between memory and registers)
> - no complex adressing modes.
> - fixed data size

Look at the classic computer designs like the PDP-8. The big difference is
load/store
design and a multitude of internal registers compared to other designs.
>
> pro CISC:
> - Multicycle operation. (Though the one instruction could always be broken up
> into one subtract and one branch instruction) The multicyle operation is only because IBM pushed for the 8 bit byte in
the 360 computers. This brought the smallest word size down from 12 bits to
8 bits, making any computer using the new format have to be multi-cycle.
The 360 being a large machine could afford to process data in 16 or 32 bit
chunks and not be slowed down.

> To Post a message, send it to:
> To Unsubscribe, send a blank message to:
Ben.
--
"We do not inherit our time on this planet from our parents...
We borrow it from our children."
"Luna family of Octal Computers" http://www.jetnet.ab.ca/users/bfranchuk





(You need to be a member of fpga-cpu -- send a blank email to fpga-cpu-subscribe@yahoogroups.com )

RE: Just what is small and is it the best? - Jan Gray - Oct 6 19:56:00 2000

> Small is often good but is the smallest the BEST?

If "smallest" delivers on requirements (e.g. fast enough, C programmable,
has interrupt handling, or what have you), probably yes.

"A small cat is better than a large cat because it eats less, poops less,
and sheds less." "So it follows that the ideal cat is a cat of zero
length?"

As with so many things, the first few resource units provide the essentials.
The rest are luxuries. As you climb the luxury curve, each resource spent
provides less and less additional value. Sometimes supposed luxuries (like
deeper pipelines) make things worse. If you add up the number of 4-LUTs in a minimal "bare necessities" n-bit
processor datapath, for example,

Cost What
n 1 port 16-entry register file
n adder/subtractor
n logic unit
0 TBUF-based immediate mux
0 TBUF-based operand mux
---
3n

you can build a simple streamlined RISC datapath in only 3n logic cells.
Maybe even 2n if your ALU operation is "add/nand". If you're willing to
multi-cycle it (take k cycles per word) then it's 3n/k or 2n/k.

But it takes a few cycles to execute even one "RISC instruction" like add
r3,r1,r2:

(assume r[0]=0, rPC=1, r[2]=2, bus is 3-state bus, t is temp reg, ir is
instruction register)
; increment PC and fetch insn
t = bus <- r[2]
r[rPC] = mar = bus <- r[rPC] + t
ir = mem[mar]
; add instruction
t = bus <- r[ir.ra]
t = bus <- r[ir.rb] + t
r[ir.rd] = t

If you're only building a toaster SoC, or a toaster channel processor, where
100 kHz frequency would be quite adequate, you might as well build the 3n or
3n/k datapath.

But if that's not fast enough, if you need closer to one instruction per
cycle, you must add resources. The first thing you add is a dedicated PC
register, PC adder/incrementor, and PC mux. Next you add a second read port
to the register file, and perhaps a concurrent write port too. And you add
a result multiplexor to select among the various results (add, logic,
shifts, load-data-in, return address, etc.):

Cost What
2n-4n 2r1w 16-entry register file
n adder/subtractor
n logic unit
0-6n result multiplexer
n PC
n PC incrementer
n PC mux
---
7n-15n

This is a lot more costly, but is now approximately one instruction per
cycle.

If you still need more speed, you'll add pipelining to reduce the cycle
time. (But add 2n (or more) for result forwarding muxes for each stage.)
Each new pipeline stage you add will reduce the cycle time until the
diminishing returns set in, possibly due to the extra interconnect delay
incurred by signalling across many result forwarding multiplexers.

If you still need more speed, you'll think about multiple issue,
out-of-order, LIW, custom function units, or perhaps multiple processors on
chip.

Including control unit overhead, etc., xr16 is about 300 logic cells / 16
bits = ~20n overall, xr32 about ~14n overall.

Jan Gray
Gray Research LLC





(You need to be a member of fpga-cpu -- send a blank email to fpga-cpu-subscribe@yahoogroups.com )

Re: Just what is small and is it the best? - Ben Franchuk - Oct 6 20:25:00 2000

Jan Gray wrote:

> If "smallest" delivers on requirements (e.g. fast enough, C programmable,
> has interrupt handling, or what have you), probably yes.

True except maybe for some unnamed OS's and sales people.

> As with so many things, the first few resource units provide the essentials.
> The rest are luxuries. As you climb the luxury curve, each resource spent
> provides less and less additional value. Sometimes supposed luxuries (like
> deeper pipelines) make things worse.

True but sometimes cutting corners has a big impact on things. While not
hardware, I am thinking how the serial port on the PC is not interupt
driven under DOS.

> If you add up the number of 4-LUTs in a minimal "bare necessities" n-bit
> processor datapath, for example,
>
> Cost What
> n 1 port 16-entry register file
> n adder/subtractor
> n logic unit
> 0 TBUF-based immediate mux
> 0 TBUF-based operand mux
> ---
> 3n

Other FPGA's could have slightly different layouts but still a low value
for N.

> you can build a simple streamlined RISC datapath in only 3n logic cells.
> Maybe even 2n if your ALU operation is "add/nand". If you're willing to
> multi-cycle it (take k cycles per word) then it's 3n/k or 2n/k.
>
> But it takes a few cycles to execute even one "RISC instruction" like add
> r3,r1,r2:
>
> (assume r[0]=0, rPC=1, r[2]=2, bus is 3-state bus, t is temp reg, ir is
> instruction register)
> ; increment PC and fetch insn
> t = bus <- r[2]
> r[rPC] = mar = bus <- r[rPC] + t
> ir = mem[mar]
> ; add instruction
> t = bus <- r[ir.ra]
> t = bus <- r[ir.rb] + t
> r[ir.rd] = t

A different alu design like the 2901's could reduce this to.
t = ir.rb
mar = rPC, rPC <- rPC + #2
ir = mem[mar],ir.ra = ir.ra + t > But if that's not fast enough, if you need closer to one instruction per
> cycle, you must add resources. The first thing you add is a dedicated PC
> register, PC adder/incrementor, and PC mux. Next you add a second read port
> to the register file, and perhaps a concurrent write port too. And you add
> a result multiplexor to select among the various results (add, logic,
> shifts, load-data-in, return address, etc.):

Hey I thought Risc was simple? Is this not what Complex computers do now?

> Cost What
> 2n-4n 2r1w 16-entry register file
> n adder/subtractor
> n logic unit
> 0-6n result multiplexer
> n PC
> n PC incrementer
> n PC mux
> ---
> 7n-15n
>
> This is a lot more costly, but is now approximately one instruction per
> cycle.

True but that is because we now have a Harvard style machine.
One data memory (on the cpu only) and one program memory ( main memory). > If you still need more speed, you'll add pipelining to reduce the cycle
> time. (But add 2n (or more) for result forwarding muxes for each stage.)
> Each new pipeline stage you add will reduce the cycle time until the
> diminishing returns set in, possibly due to the extra interconnect delay
> incurred by signalling across many result forwarding multiplexers.

I agree fully here. Also the limiting factor in any case is the adder delay time
as that is the biggest delay in the system.
used.

> If you still need more speed, you'll think about multiple issue,
> out-of-order, LIW, custom function units, or perhaps multiple processors on
> chip.

And more gray hair unless you go bald.

> Including control unit overhead, etc., xr16 is about 300 logic cells / 16
> bits = ~20n overall, xr32 about ~14n overall.
>
The numbers seem in the right ballpark.
Ben.
--
"We do not inherit our time on this planet from our parents...
We borrow it from our children."
"Luna family of Octal Computers" http://www.jetnet.ab.ca/users/bfranchuk





(You need to be a member of fpga-cpu -- send a blank email to fpga-cpu-subscribe@yahoogroups.com )

Re: Just what is small and is it the best? - Ben Franchuk - Oct 7 1:05:00 2000

Jan Gray wrote:

> The License Agreement speaks for itself and takes precedence over anything I
> might write here. That said, one could interpret it to not permit any use
> the work or any derivative work for any commercial purpose, and further to
> not permit any distribution of a derivative work (except for a modification
> to an excerpt).

Why copywrite the CPU? Copywrite the BUGS in the CPU or specific workarounds
for hardware limitations. It seems to me the bug fixes and workarounds stay
in the code forever and thus give could the longest revenue for a product.:)

I like the idea of split license but figuring just what you can copywrite
takes a bit of thinking.
Ben.
--
"We do not inherit our time on this planet from our parents...
We borrow it from our children."
"Luna family of Octal Computers" http://www.jetnet.ab.ca/users/bfranchuk





(You need to be a member of fpga-cpu -- send a blank email to fpga-cpu-subscribe@yahoogroups.com )

RE: Just what is small and is it the best? - Gary Watson - Oct 7 4:54:00 2000


Jan,

How small do you think a xr16-opcode-compatible cpu could be if one didn't
care about speed? Do you think it could get below 150 logic cells, maybe?
And, if I were to do one of these in VHDL would that violate the spirit of
your no-commercial-use license?

Best Regards,

Gary Watson
Technical Director
Nexsan Technologies, Ltd.
Imperial House
East Service Road
Raynesway
Derby DE21 7BF ENGLAND
+44 (0) 1332 5 444 33
http://www.nexsan.com






(You need to be a member of fpga-cpu -- send a blank email to fpga-cpu-subscribe@yahoogroups.com )

RE: Just what is small and is it the best? - Jan Gray - Oct 7 13:00:00 2000

> How small do you think a xr16-opcode-compatible cpu could be if one didn't
> care about speed? Do you think it could get below 150 logic cells, maybe?

Let me take you on a tour of less through more drastic changes to xr16 to
save area. At some point it ceases to be xr16, but retains its character.
This is no problem because we are now adept at porting to similar
instruction sets, through small changes to our lcc .md file, or to the
assembler.

A non-pipelined, no-DMA xr[n] would save (at least) these resources:
Savings What
n LUTs A forwarding mux
2n FFs A, B operand registers
n LUTs PC register file
(n FFs) PC register
n FFs RETAD register
n FFs DOUT register
n FFs EXIR
-----
2n LUTs 4n FFs

Changing the memory interface to Harvard style would eliminate the need to
save the next instruction in NEXTIR in the event of a load/store instruction
(see 2nd Circuit Cellar article):
Savings What
n LUTs NEXTIR
-----
3n LUTs 4n FFs

Changing the way interrupts work, or cutting them entirely, would eliminate
any further need for IRMUX:
Savings What
n LUTs IRMUX
-----
4n LUTs 4n FFs

Changing the instruction set to 2-operand (no r3=r1 op r2, only r1=r1 op r2)
could save:
Savings What
n LUTs 2nd copy of register file
n FFs "
-----
5n LUTs 5n FFs

Move PC into the register file (say r13), so that each instruction needs an
ifetch sub-cycle and an execute sub-cycle. Here branch displacements would
be added via the IMM mux. All addresses would be available at the ADDSUB
output. Savings:
Savings What
n FFs PC
n LUTs PCINCR
8 LUTs PCDISP
n LUTs ADDRMUX
-----
7n+8 LUTs 6n FFs

Using the output register of a block RAM as the instruction register would
save:
Savings What
n FFs IR
-----
7n+8 LUTs 7n FFs

Halve the datapath into an 8-bit tall datapath and take 2 sub-sub-cycles per
sub-cycle.
Savings What
n/2 LUTs addsub
n/2 LUTs logic
n/2 FFs reg file output register
-----
8n+8 LUTs 7 1/2 n FFs

For n=16, we save 136 LUTs and 120 FFs, so we have a 4-cycle per instruction
RISC in about 165 LUTS => 165 logic cells. So 150 is not out of the
question, but you'd need to change the ISA further. Of course, this
exercise is a little strained because we are taking things away from
something substantial instead of building something simpler up from nothing. > And, if I were to do one of these in VHDL would that violate the spirit of
> your no-commercial-use license?

The License Agreement speaks for itself and takes precedence over anything I
might write here. That said, one could interpret it to not permit any use
the work or any derivative work for any commercial purpose, and further to
not permit any distribution of a derivative work (except for a modification
to an excerpt).

So, are third party implementations of xr16 considered derivative works?
That depends upon how they are prepared. If one such implementation
contains any part of, or are a mere translation from, the XSOC Kit sources,
it's probably a derivative work.

I am contemplating cleaving the XSOC/xr16 project in two. Here's the
concept: The first part, the instruction set specifications, tests, and
tools, (except for the code covered by the lcc license), GR LLC would
relicense under some open source license that preserves the integrity of the
xr16/xr32 name. The second part, the implementation -- the schematics, HDL
code, and documentation related to that, GRLLC would continue to license
under the XSOC License Agreement. This action, *if taken*, might help
clarify the status of third party implementations based upon the xr specs
and tests and would permit non-derivative clean-room implementations to be
used without contamination with any XSOC-licensed works. It would also make
it easier for you "third parties" to enhance and redistribute changes to the
xr tools suite. If you strongly favor this, please let me know through
private email.

Jan Gray
Gray Research LLC





(You need to be a member of fpga-cpu -- send a blank email to fpga-cpu-subscribe@yahoogroups.com )

RE: Just what is small and is it the best? - Gary Watson - Oct 8 8:07:00 2000


Jan, following the suggestion on your web site, I looked at the Xilinx app
note Xapp213 which describes their KCPSM microcontroller. It's pretty cool
that they fit it in 35 CLB's and made a snazzy assembler for it. The only
thing that I'm uneasy about is the fact that it has a 256 instruction (16
bit wide) limit. I'm pretty sure I need a few k for what I want to do. I'm
going to send him feedback to ask how much work it would be to make an
upscale version of KCPSM...

Best Regards,

Gary Watson
Technical Director
Nexsan Technologies, Ltd.
Imperial House
East Service Road
Raynesway
Derby DE21 7BF ENGLAND
+44 (0) 1332 5 444 33
http://www.nexsan.com -----Original Message-----
From: Jan Gray [mailto:]
Sent: Saturday, October 07, 2000 7:00 PM
To: fpga-cpu
Subject: RE: [fpga-cpu] Just what is small and is it the best? > How small do you think a xr16-opcode-compatible cpu could be if one didn't
> care about speed? Do you think it could get below 150 logic cells, maybe?

Let me take you on a tour of less through more drastic changes to xr16 to
save area. At some point it ceases to be xr16, but retains its character.
This is no problem because we are now adept at porting to similar
instruction sets, through small changes to our lcc .md file, or to the
assembler.
[ excellent discussion condensed ]




(You need to be a member of fpga-cpu -- send a blank email to fpga-cpu-subscribe@yahoogroups.com )