Sign in

username:

password:



Not a member?

Search fpga-cpu



Search tips

Subscribe to fpga-cpu



fpga-cpu by Keywords

Altera | CISCifying | IDE | ISA | Java | JHDL | JTAG | LBU | MicroBlaze | PAR | PCI | RISC | SoC | Spartan | Transputers | Verilog | VHDL | Virtex | VLIW | WebPack | Xilinx | Xsoc | YARD-1A

Discussion Groups

Discussion Groups | FPGA-CPU | NANO instruction set architecture

This list is for discussion of the design and implementation of field-programmable gate array based processors and integrated systems. It is also for discussion and community support of the XSOC Project (see http://www.fpgacpu.org/xsoc).

Re: NANO instruction set architecture - Ben Franchuk - Jan 16 17:27:00 2002


Reinoud wrote:
>
> Hi all,
>
> I've made an attempt at designing an efficient architecture for very
> small implementations. The point of the architecture is the
> combination of small cores and small code size. From the spec:
>
> > The NANO (NANO Architecture Negates Overhead) architecture negates
> > overhead: it requires very little resources. Code is very compact,
> > and exceptionally small cores can deliver good performance.
> >
> > The NANO ISA (Instruction Set Architecture) combines features from
> > various architecture concepts, which makes it difficult to classify.
> > It is a simple load/store architecture with implicit operands and
> > variable-size immediates.
>
> For the full spec, go to:
>
> http://ce.et.tudelft.nl/~reinoud/nano/
>
> BTW, I'd be happy to post the full spec here, it's a plain text file
> anyway, but at 13k it may be somewhat large for a mailing list.
> Executive summary: 3 registers, 4-bit instruction set, no stack :-).
>
> I haven't implemented it yet (software simulation only so far); of
> course, the first implementation will target FPGA. A serial
> implementation will probably map very well to Virtex, a handful of
> CLBs should do it...
>
> Comments would be much appreciated!

This design could spend too much of its time unpacking #operands
for memory access for really fast speed. Were speed is not
important this looks promising in the minimal computer category.
Like One-Instruction computers, you spend a lot of effort in creating
indexed addressing modes. Instruction decoding is a pain for FPGA
logic and could be a problem for you , if you have lot of machine
states.
Good luck with the project.
--
Ben Franchuk - Dawn * 12/24 bit cpu *
www.jetnet.ab.ca/users/bfranchuk/index.html





(You need to be a member of fpga-cpu -- send a blank email to fpga-cpu-subscribe@yahoogroups.com )

NANO instruction set architecture - Reinoud - Jan 16 17:31:00 2002


Hi all,

I've made an attempt at designing an efficient architecture for very
small implementations. The point of the architecture is the
combination of small cores and small code size. From the spec:

> The NANO (NANO Architecture Negates Overhead) architecture negates
> overhead: it requires very little resources. Code is very compact,
> and exceptionally small cores can deliver good performance.
>
> The NANO ISA (Instruction Set Architecture) combines features from
> various architecture concepts, which makes it difficult to classify.
> It is a simple load/store architecture with implicit operands and
> variable-size immediates.

For the full spec, go to:

http://ce.et.tudelft.nl/~reinoud/nano/

BTW, I'd be happy to post the full spec here, it's a plain text file
anyway, but at 13k it may be somewhat large for a mailing list.
Executive summary: 3 registers, 4-bit instruction set, no stack :-).

I haven't implemented it yet (software simulation only so far); of
course, the first implementation will target FPGA. A serial
implementation will probably map very well to Virtex, a handful of
CLBs should do it...

Comments would be much appreciated!

Regards,

- Reinoud






(You need to be a member of fpga-cpu -- send a blank email to fpga-cpu-subscribe@yahoogroups.com )

Re: NANO instruction set architecture - Reinoud - Jan 16 19:45:00 2002


Ben Franchuk wrote:
> This design could spend too much of its time unpacking #operands
> for memory access for really fast speed.

Good point! But:

- Speed is a secondary concern; size comes first.

- Who said you have to to unpack immediates sequentially? You
might even want to execute an immediate instruction and the
instruction that uses the immediate in parallel, if you want
speed (and don't mind the cost). As a matter of fact, this is
one of the reasons why I don't specify instruction packing in
memory... There are some nice opportunities for alignment and
fast decode, if you care to spend the extra decoding logic and
code size for speed. Not that you'd usually want to.

> Like One-Instruction computers, you spend a lot of effort in creating
> indexed addressing modes.

Indexing is explicit, yes, but doesn't seem particularly expensive?
(For code size, or were you thinking of performance again?) If you
think there's a code size problem, would you care to elaborate?

> Instruction decoding is a pain for FPGA logic and could be a problem
> for you , if you have lot of machine states.

Very true. There aren't many states needed (e.g. immediate decoding
itself is designed to be stateless - for that reason). However, I
did choose for a few extra states which allowed for significant code
size savings... :-/

Thanks for the insightful comments!

- Reinoud





(You need to be a member of fpga-cpu -- send a blank email to fpga-cpu-subscribe@yahoogroups.com )

Re: NANO instruction set architecture - pagercam - Jan 18 3:13:00 2002

This reminds me a lot of the Transputer instruction set
which used a 4 bit operation with 4 bit data field take
a look at the architecture it might give you some ideas
and suggestions. The company that designed this was
Inmos, there is a three element stack, PC and workspace
point (IIRC). The instruction set is the same for the
16bit and 32 bit versions. They had similar goals of
minimilistic hardware and were willing, concentrate on the
real idea of RISC of small instruction set to do the
most common operations quickly and more complex using a
few cycles. The gate count was low and the complexity
was factors less the than 386 of the time. Great architecure
too bad it didn't survive. The small instructions, really
pay off in a big way by keeping the program memory small,
I keep on hearing poeple say that code size doesn't matter
but every cycle spent on loading code is one cycle
you can't load data.

http://www.geocities.com/SiliconValley/Heights/1190/specs.htm

--- In fpga-cpu@y..., Reinoud <dus@w...> wrote:
>
> Hi all,
>
> I've made an attempt at designing an efficient architecture for very
> small implementations. The point of the architecture is the
> combination of small cores and small code size. From the spec:
>
> > The NANO (NANO Architecture Negates Overhead) architecture negates
> > overhead: it requires very little resources. Code is very compact,
> > and exceptionally small cores can deliver good performance.
> >
> > The NANO ISA (Instruction Set Architecture) combines features from
> > various architecture concepts, which makes it difficult to classify.
> > It is a simple load/store architecture with implicit operands and
> > variable-size immediates.
>
> For the full spec, go to:
>
> http://ce.et.tudelft.nl/~reinoud/nano/




(You need to be a member of fpga-cpu -- send a blank email to fpga-cpu-subscribe@yahoogroups.com )

Re: NANO instruction set architecture - Tim Boescke - Jan 18 3:42:00 2002

> Hi all,
>
> I've made an attempt at designing an efficient architecture for very
> small implementations. The point of the architecture is the
> combination of small cores and small code size. From the spec:

> Comments would be much appreciated! Well in my opinion the encoding of the immediates is quite
a bottleneck. A big problem is of course the very complicated
immediate decoding which is either slow or eats a lot of logic.
Since you dont have an adress register, loading of constants will
be quite frequent in a real program. (~40% ?)

A sane way to program with this instruction set is probably to
use the memory locations -4..3 as local registers. Basically
this means that each register read/write costs 12 bits of program
code... An instruction set using 12 bit encoding would be probably
far more efficient. My guess is that the code density of this
instruction set is not very good..

Other things which came to my mind:

- Its missing a negating logic instruction. Thus not all boolean
operations are possible.
- Why does the park instruction waste memory address zero which
is also very easy to access otherwhise ?
- Why does the load instruction load to B ? This makes copying etc.
very inefficient. (adress deleted, addition swap to A)
- Is rotate/shift right really used often enough to justify an extra
instruction/extra hardware ? (shift left is ADD)

Well, just for general amusement, here is an old unfinished attempt
of mine on a very similar instruction set. Basically it is the Steamer
design without a stack.

---------------------------------------------------------------------

Registers:
(all are 16 bit)

PC
D (Adress, memory reference)
Akku

Instructions:

three bit encoding, comes in bundles of five.

0 SWP D<=A, A<=D Swap D and Akku
1 LDI D<=(PC), PC++ move data at PC to D-Register,
increment PC
2 LDA A<=(D)
3 STO (D)<= A Store akku at D
4 ADD A = A + (D) Add (d) to Akku
5 AND A = A AND (D)
6 XOR A = A XOR (D)
7 ZGO PC <= D, if A=0 jump to D, if Akku equals zero (might
make sense to
use a carry instead)

The nop is replaced with pairs of SWP, while the last nop is encoded by
using
the remaining bit of a bundle. Trailing nops in a bundle dont make sense and
thus
the assembler should reject them. The format of one bundle is:

1111110000000000
5432109876543210

AAABBBCCCDDDEEEN

Starting with instruction A

If N=1 skip E




(You need to be a member of fpga-cpu -- send a blank email to fpga-cpu-subscribe@yahoogroups.com )

Re: NANO instruction set architecture - Ben Franchuk - Jan 21 16:42:00 2002

Reinoud wrote:

<snip>
>
> PS. Waited with sending this until an update based on Tim's comments
> was finished:
>
> 1.1-0: [2002.01.21] Removed the ZER instruction; added the LOD
> instruction; changed the LOA instruction to load to A.
> Removed opcodes and short mnemonics from the instruction
> table (and rearranged the table). Added acknowledgement of
> Tim Boescke's input.
>
> See http://ce.et.tudelft.nl/~reinoud/nano/ for the full new spec.

One thought while reading , is byte size really that important?
Character data generally is the most byte sized data around than
small constants, and characters are heading towards a 16+ bit encoding.

--
Ben Franchuk - Dawn * 12/24 bit cpu *
www.jetnet.ab.ca/users/bfranchuk/index.html





(You need to be a member of fpga-cpu -- send a blank email to fpga-cpu-subscribe@yahoogroups.com )

Re: NANO instruction set architecture - Reinoud - Jan 21 17:31:00 2002


Tim,

Thanks for the excellent criticism; I certainly agree with you on
several points.

Tim Boescke wrote:
> Well in my opinion the encoding of the immediates is quite
> a bottleneck. A big problem is of course the very complicated
> immediate decoding which is either slow or eats a lot of logic.

Allow me to disagree here... Yes, the format is slightly geared
towards sequential implementation, but not too much.

For bit- or nybble-serial cores, the immediate encoding is quite
natural (cheap and no performance problem). Simple parallel
implementations will indeed be relatively slow, but very economical.
Anyway, most immediates are small so performance doesn't actually
suffer much (while a lot of memory is saved). Note that the branch
and trap instructions usually take small immediates.

To obtain higher performance for large immediates, without spending
much on decode logic, support fast decoding only for aligned (and
possibly fixed size) immediates. Code with aligned immediates will
still be binary compatible (the code can be padded with ZER or ONE
instructions to get the immediate to the right place; the fast
decoder should recognize this as a special case). The assembler, or
even the loader, might do the proper padding for a particular target
(immediate sizes have to be determined anyway). The cost of aligned
immediates is mostly in code size; but as alignment is optional, its
use can be restricted to where performance is needed (e.g. inner
loops). Best of both worlds... :-)

> Since you dont have an adress register, loading of constants will
> be quite frequent in a real program. (~40% ?)

Sure, constants are often needed - but they aren't that expensive
because of the variable-size immediates (and if they are big, the
park instruction may help). BTW, I don't think just having an
address register instead of an operand register improves things. The
address register will have to be loaded or adjusted with immediates
all the time anyway, and immediate operands for ALU ops will be a
problem (i.e. require extra opcodes). Another issue is that the
variable-size immediates need a register for sequential construction,
which makes adjusting (adding to) an address register costly (can't
simply use the address register for building the immediate).

All in all, I think the approach with two operand registers wins...

> A sane way to program with this instruction set is probably to
> use the memory locations -4..3 as local registers.

Yes, that's a reasonable approach.

> Basically this means that each register read/write costs 12 bits of
> program code...

Well, locations 0 and 1 can be reached with 8 bits (thanks to the
handy ZER and ONE instructions).

> An instruction set using 12 bit encoding would be probably
> far more efficient. My guess is that the code density of this
> instruction set is not very good..

Okay, I agree this is a problem. However, I don't think it's as bad
as you describe; besides the 0 and 1 locations, the trap instruction
can come to the rescue. With just one byte (i.e. using one 4-bit
immediate specifier), 8 different traps can be specified. When using
this for a few stack operations (push, pop, stack adjust), and using
the 0 and 1 locations as 'registers', you have some fairly low cost
memory access.

Some improvement here would clearly be nice; I'll reconsider load and
store 'direct' instructions.

> Other things which came to my mind:
>
> - Its missing a negating logic instruction. Thus not all boolean
> operations are possible.

Boolean negate: 1-x; bitwise negate: -1-x. So to do a boolean negate
of the value in A: ONE SWA SUB.

> - Why does the park instruction waste memory address zero which
> is also very easy to access otherwhise ?

Good point. It's just a choice, for a several small reasons:

1) The park instruction sometimes allows for shorter code sequences,
but for such use it sure helps when the address is cheap to load
from...

2) I think it makes sense to use at least one of these cheap
locations as an evaluation or scratch 'register', so you'd expect
(and not mind) to lose its contents when doing a call etc. Now, an
important use for the park instruction is to get the contents of B
'out of the way' at the start of a subroutine (called with jump and
link) or a trap handler, so that a value passed in A can be used or
stored. You can either pass A and overwrite address 0 (with park),
or keep address 0 intact and overwrite A (to save the return address
in B).

3) Even on the smallest possible systems (with code in ROM and very
little RAM), you can count on this location to be available in RAM.

Do you have a better idea?

> - Why does the load instruction load to B ? This makes copying etc.
> very inefficient. (adress deleted, addition swap to A)

Loading to A would overwrite both A and B (B is needed for the
address). This would make the instruction practically useless. It
would make sense in combination with a load direct (to B).

> - Is rotate/shift right really used often enough to justify an extra
> instruction/extra hardware ? (shift left is ADD)

Yes, that's arguable, though I expect it's generally useful for a
small controller (doing a lot of bit twiddling). The real reason for
including it, though, is to make packed data structures cheap (so you
can save memory again).

> PC
> D (Adress, memory reference)
> Akku
>
> Instructions:
>
> three bit encoding, comes in bundles of five.
>
> 0 SWP D<=A, A<=D Swap D and Akku
> 1 LDI D<=(PC), PC++ move data at PC to D-Register,
> increment PC
> 2 LDA A<=(D)
> 3 STO (D)<= A Store akku at D
> 4 ADD A = A + (D) Add (d) to Akku
> 5 AND A = A AND (D)
> 6 XOR A = A XOR (D)
> 7 ZGO PC <= D, if A=0 jump to D, if Akku equals zero (might
> make sense to
> use a carry instead)

Nice! BTW, wouldn't this architecture need quite a bit more code
space for address constants than NANO?

Oops, long post. Sorry :-).

- Reinoud PS. Waited with sending this until an update based on Tim's comments
was finished:

1.1-0: [2002.01.21] Removed the ZER instruction; added the LOD
instruction; changed the LOA instruction to load to A.
Removed opcodes and short mnemonics from the instruction
table (and rearranged the table). Added acknowledgement of
Tim Boescke's input.

See http://ce.et.tudelft.nl/~reinoud/nano/ for the full new spec.





(You need to be a member of fpga-cpu -- send a blank email to fpga-cpu-subscribe@yahoogroups.com )

Re: Re: NANO instruction set architecture - Reinoud - Jan 21 17:48:00 2002

pagercam wrote:
> This reminds me a lot of the Transputer instruction set
> which used a 4 bit operation with 4 bit data field take
> a look at the architecture it might give you some ideas
> and suggestions.

Yeah, I remember it (having actually designed Transputer-based
systems back in those days;).

> Great architecure too bad it didn't survive.

Well, they priced their chips way out there (esp. the T800 series
with floating point), so people weren't too eager to buy them :-/.
Also, the stack architecture didn't scale well to higher performance
(and they tried to compete on performance).

Too bad indeed, it was a nice design...

- Reinoud





(You need to be a member of fpga-cpu -- send a blank email to fpga-cpu-subscribe@yahoogroups.com )

Re: NANO instruction set architecture - rtfinch35 - Jan 22 3:35:00 2002

--- In fpga-cpu@y..., Reinoud <dus@w...> wrote:
>
> Hi all,
>
> I've made an attempt at designing an efficient architecture for very
> small implementations. The point of the architecture is the
> combination of small cores and small code size. From the spec: A true test of the architecture would be to implement a "real" app on
the processor and compare that to an equivalent app on another
processor. You can make a pretty decent small RISC an about 150 LUTs,
less if things are done serially. I think this would be hard to beat.

Rob




(You need to be a member of fpga-cpu -- send a blank email to fpga-cpu-subscribe@yahoogroups.com )

Re: NANO instruction set architecture - Reinoud - Jan 22 5:10:00 2002

Ben Franchuk wrote:
> > See http://ce.et.tudelft.nl/~reinoud/nano/ for the full new spec.
>
> One thought while reading , is byte size really that important?
> Character data generally is the most byte sized data around than
> small constants, and characters are heading towards a 16+ bit encoding.

Where does it say that byte size is important?

- Reinoud





(You need to be a member of fpga-cpu -- send a blank email to fpga-cpu-subscribe@yahoogroups.com )

Re: Re: NANO instruction set architecture - Reinoud - Jan 22 6:44:00 2002


rtfinch35 wrote:
> A true test of the architecture would be to implement a "real" app

Of course.

> on the processor and compare that to an equivalent app on another
> processor.

I think you should compare that to an equivalent app on another
processor *with the same core size*, or even better, compare the
combined cost of both core and memory for the application.

> You can make a pretty decent small RISC an about 150 LUTs,
> less if things are done serially. I think this would be hard to beat.

Well, that depends on the application size and the efficiency of the
ISA. I'm trying to optimise total cost (core + memory) for small
applications (maybe I should have stated this more clearly). For
larger applications, larger cores with better code density often make
more sense. Code size can practically always be improved by spending
more on instruction decode (e.g. use compression). Sometimes, a
plain RISC will make most sense, but that's usually when performance
matters more than code size.

BTW, 150 LUTs (LCs?) are a lot. You may be able to fit an entire
serial (NANO) core and all the memory for a small control application
in there (with Xilinx LUTs that is).

- Reinoud





(You need to be a member of fpga-cpu -- send a blank email to fpga-cpu-subscribe@yahoogroups.com )

Re: NANO instruction set architecture - Ben Franchuk - Jan 22 13:50:00 2002

Reinoud wrote:
>
> Ben Franchuk wrote:
> > > See http://ce.et.tudelft.nl/~reinoud/nano/ for the full new spec.
> >
> > One thought while reading , is byte size really that important?
> > Character data generally is the most byte sized data around than
> > small constants, and characters are heading towards a 16+ bit encoding.
>
> Where does it say that byte size is important?
>
> - Reinoud

You are right in this case for the NANO architecture. How ever 4 bit
wide memory is hard to find. The big test is indeed writing programs
for the cpu. How many Turning machine programs have you written?
I find I spend a lot of time revising my cpu because I want to make
it easy to write programs and still not have too complex hardware.
50 pages of schematics is more than ample for my CPU and my FPGA is
98% full.
I expect that there are thresholds of logical mass that define
the power of a computer system. Memory width, addressing range ,
instruction
architecture, speed all are tightly bound. A computer system needs to
look
all aspects of the system, and it is the weakest link that slows the
system
down. Historically you had 1) large word length serial, parallel
processors
>= 28 bits with about 8 instructions and tiny memory. 2) Smaller word
sized
computers >= 12 bits with paged and indirect memory with 4/8K word
memory.
3) 4 & 8 bit controllers with up to 64k of memory. 4) 16/8 bit machines
with 64kb data and 64kb code space - PDP-11 - 8086. 5) Big machines with
few registers. 6) Load/store architecture that internal design is
looking
a lot like #1. Is the wheel of computers is going around again?
The nano instruction set looks place it around 2 and 3 for design, thus
hinting that most programs would be under 64K words long. --
Ben Franchuk - Dawn * 12/24 bit cpu *
www.jetnet.ab.ca/users/bfranchuk/index.html






(You need to be a member of fpga-cpu -- send a blank email to fpga-cpu-subscribe@yahoogroups.com )

Re: Re: NANO instruction set architecture - Ben Franchuk - Jan 22 14:42:00 2002

Reinoud wrote:
> Well, that depends on the application size and the efficiency of the
> ISA. I'm trying to optimise total cost (core + memory) for small
> applications (maybe I should have stated this more clearly). For
> larger applications, larger cores with better code density often make
> more sense. Code size can practically always be improved by spending
> more on instruction decode (e.g. use compression). Sometimes, a
> plain RISC will make most sense, but that's usually when performance
> matters more than code size.
>
> BTW, 150 LUTs (LCs?) are a lot. You may be able to fit an entire
> serial (NANO) core and all the memory for a small control application
> in there (with Xilinx LUTs that is).

The real test would be on several different FPGA architectures.
Alot of tiny RISC FPGA designs would not be so tiny if they did
not use the two port ram and 3 state lines. Give me good old real
gate count ... ??? 2 input ands ... ??? 3 input nors and so forth.
Btw is NANO-NANO a computer from the planet ork? :)
--
Ben Franchuk - Dawn * 12/24 bit cpu *
www.jetnet.ab.ca/users/bfranchuk/index.html





(You need to be a member of fpga-cpu -- send a blank email to fpga-cpu-subscribe@yahoogroups.com )

Re: NANO instruction set architecture - Reinoud - Jan 22 15:08:00 2002

Ben Franchuk wrote:
> > Ben Franchuk wrote:
> > > > See http://ce.et.tudelft.nl/~reinoud/nano/ for the full new spec.
> > >
> > > One thought while reading , is byte size really that important?
> > > Character data generally is the most byte sized data around than
> > > small constants, and characters are heading towards a 16+ bit encoding.
> >
> > Where does it say that byte size is important?
> >
> > - Reinoud
>
> You are right in this case for the NANO architecture. How ever 4 bit
> wide memory is hard to find.

Where does it say that 4-bit memory is used?

- Reinoud





(You need to be a member of fpga-cpu -- send a blank email to fpga-cpu-subscribe@yahoogroups.com )

Re: NANO instruction set architecture - Ben Franchuk - Jan 22 15:14:00 2002

Tommy Thorn wrote:
>
> --- Ben Franchuk <> wrote:
> > The big test is indeed writing programs
> > for the cpu. How many Turning machine programs
> > have you written? I find I spend a lot of time
> > revising my cpu because I want to make
> > it easy to write programs and still not have too
> > complex hardware.
>
> Have you considered writing a simulator first
> and experimenting with it? It is much easier to
> write. Experimenting before getting too deep
> into the FPGA code could save you some nasty
> surprises.
>
> /Tommy

Thanks but I have all the nasty surprises with Intel
and Microsoft that I need.
I have had a simulator and assembler and
a self compiling C compiler for a long time now. The problem
is the 'small C compiler' does not have structures or good
code generation. The later versions of 'small C' do not
compile under Microsoft's C compiler thus I can't port them.

Most the the upgrades over the last few weeks have been
what last minute features can I add at low cost or stuff
that has to do with physical design like control signals
and timing. I chose to debug the FPGA with the traditional
lights and switches rather than a software gate simulation
and did find several problems like open wires or stupid
logic bugs like the polarity of the RS232 start bit wrong.
I debugged the uart module, then the bootstrap setup,then
the CPU. It is just final testing that needs to be done now. --
Ben Franchuk - Dawn * 12/24 bit cpu *
www.jetnet.ab.ca/users/bfranchuk/index.html





(You need to be a member of fpga-cpu -- send a blank email to fpga-cpu-subscribe@yahoogroups.com )