This list is for discussion of the design and implementation of field-programmable gate array based processors and integrated systems. It is also for discussion and community support of the XSOC Project (see http://www.fpgacpu.org/xsoc).
|
Reinoud wrote: > > Hi all, > > I've made an attempt at designing an efficient architecture for very > small implementations. The point of the architecture is the > combination of small cores and small code size. From the spec: > > > The NANO (NANO Architecture Negates Overhead) architecture negates > > overhead: it requires very little resources. Code is very compact, > > and exceptionally small cores can deliver good performance. > > > > The NANO ISA (Instruction Set Architecture) combines features from > > various architecture concepts, which makes it difficult to classify. > > It is a simple load/store architecture with implicit operands and > > variable-size immediates. > > For the full spec, go to: > > http://ce.et.tudelft.nl/~reinoud/nano/ > > BTW, I'd be happy to post the full spec here, it's a plain text file > anyway, but at 13k it may be somewhat large for a mailing list. > Executive summary: 3 registers, 4-bit instruction set, no stack :-). > > I haven't implemented it yet (software simulation only so far); of > course, the first implementation will target FPGA. A serial > implementation will probably map very well to Virtex, a handful of > CLBs should do it... > > Comments would be much appreciated! This design could spend too much of its time unpacking #operands for memory access for really fast speed. Were speed is not important this looks promising in the minimal computer category. Like One-Instruction computers, you spend a lot of effort in creating indexed addressing modes. Instruction decoding is a pain for FPGA logic and could be a problem for you , if you have lot of machine states. Good luck with the project. -- Ben Franchuk - Dawn * 12/24 bit cpu * www.jetnet.ab.ca/users/bfranchuk/index.html |
|
Hi all, I've made an attempt at designing an efficient architecture for very small implementations. The point of the architecture is the combination of small cores and small code size. From the spec: > The NANO (NANO Architecture Negates Overhead) architecture negates > overhead: it requires very little resources. Code is very compact, > and exceptionally small cores can deliver good performance. > > The NANO ISA (Instruction Set Architecture) combines features from > various architecture concepts, which makes it difficult to classify. > It is a simple load/store architecture with implicit operands and > variable-size immediates. For the full spec, go to: http://ce.et.tudelft.nl/~reinoud/nano/ BTW, I'd be happy to post the full spec here, it's a plain text file anyway, but at 13k it may be somewhat large for a mailing list. Executive summary: 3 registers, 4-bit instruction set, no stack :-). I haven't implemented it yet (software simulation only so far); of course, the first implementation will target FPGA. A serial implementation will probably map very well to Virtex, a handful of CLBs should do it... Comments would be much appreciated! Regards, - Reinoud |
|
|
|
Ben Franchuk wrote: > This design could spend too much of its time unpacking #operands > for memory access for really fast speed. Good point! But: - Speed is a secondary concern; size comes first. - Who said you have to to unpack immediates sequentially? You might even want to execute an immediate instruction and the instruction that uses the immediate in parallel, if you want speed (and don't mind the cost). As a matter of fact, this is one of the reasons why I don't specify instruction packing in memory... There are some nice opportunities for alignment and fast decode, if you care to spend the extra decoding logic and code size for speed. Not that you'd usually want to. > Like One-Instruction computers, you spend a lot of effort in creating > indexed addressing modes. Indexing is explicit, yes, but doesn't seem particularly expensive? (For code size, or were you thinking of performance again?) If you think there's a code size problem, would you care to elaborate? > Instruction decoding is a pain for FPGA logic and could be a problem > for you , if you have lot of machine states. Very true. There aren't many states needed (e.g. immediate decoding itself is designed to be stateless - for that reason). However, I did choose for a few extra states which allowed for significant code size savings... :-/ Thanks for the insightful comments! - Reinoud |
|
This reminds me a lot of the Transputer instruction set which used a 4 bit operation with 4 bit data field take a look at the architecture it might give you some ideas and suggestions. The company that designed this was Inmos, there is a three element stack, PC and workspace point (IIRC). The instruction set is the same for the 16bit and 32 bit versions. They had similar goals of minimilistic hardware and were willing, concentrate on the real idea of RISC of small instruction set to do the most common operations quickly and more complex using a few cycles. The gate count was low and the complexity was factors less the than 386 of the time. Great architecure too bad it didn't survive. The small instructions, really pay off in a big way by keeping the program memory small, I keep on hearing poeple say that code size doesn't matter but every cycle spent on loading code is one cycle you can't load data. http://www.geocities.com/SiliconValley/Heights/1190/specs.htm --- In fpga-cpu@y..., Reinoud <dus@w...> wrote: > > Hi all, > > I've made an attempt at designing an efficient architecture for very > small implementations. The point of the architecture is the > combination of small cores and small code size. From the spec: > > > The NANO (NANO Architecture Negates Overhead) architecture negates > > overhead: it requires very little resources. Code is very compact, > > and exceptionally small cores can deliver good performance. > > > > The NANO ISA (Instruction Set Architecture) combines features from > > various architecture concepts, which makes it difficult to classify. > > It is a simple load/store architecture with implicit operands and > > variable-size immediates. > > For the full spec, go to: > > http://ce.et.tudelft.nl/~reinoud/nano/ |
|
> Hi all, > > I've made an attempt at designing an efficient architecture for very > small implementations. The point of the architecture is the > combination of small cores and small code size. From the spec: > Comments would be much appreciated! Well in my opinion the encoding of the immediates is quite a bottleneck. A big problem is of course the very complicated immediate decoding which is either slow or eats a lot of logic. Since you dont have an adress register, loading of constants will be quite frequent in a real program. (~40% ?) A sane way to program with this instruction set is probably to use the memory locations -4..3 as local registers. Basically this means that each register read/write costs 12 bits of program code... An instruction set using 12 bit encoding would be probably far more efficient. My guess is that the code density of this instruction set is not very good.. Other things which came to my mind: - Its missing a negating logic instruction. Thus not all boolean operations are possible. - Why does the park instruction waste memory address zero which is also very easy to access otherwhise ? - Why does the load instruction load to B ? This makes copying etc. very inefficient. (adress deleted, addition swap to A) - Is rotate/shift right really used often enough to justify an extra instruction/extra hardware ? (shift left is ADD) Well, just for general amusement, here is an old unfinished attempt of mine on a very similar instruction set. Basically it is the Steamer design without a stack. --------------------------------------------------------------------- Registers: (all are 16 bit) PC D (Adress, memory reference) Akku Instructions: three bit encoding, comes in bundles of five. 0 SWP D<=A, A<=D Swap D and Akku 1 LDI D<=(PC), PC++ move data at PC to D-Register, increment PC 2 LDA A<=(D) 3 STO (D)<= A Store akku at D 4 ADD A = A + (D) Add (d) to Akku 5 AND A = A AND (D) 6 XOR A = A XOR (D) 7 ZGO PC <= D, if A=0 jump to D, if Akku equals zero (might make sense to use a carry instead) The nop is replaced with pairs of SWP, while the last nop is encoded by using the remaining bit of a bundle. Trailing nops in a bundle dont make sense and thus the assembler should reject them. The format of one bundle is: 1111110000000000 5432109876543210 AAABBBCCCDDDEEEN Starting with instruction A If N=1 skip E |
|
Reinoud wrote: <snip> > > PS. Waited with sending this until an update based on Tim's comments > was finished: > > 1.1-0: [2002.01.21] Removed the ZER instruction; added the LOD > instruction; changed the LOA instruction to load to A. > Removed opcodes and short mnemonics from the instruction > table (and rearranged the table). Added acknowledgement of > Tim Boescke's input. > > See http://ce.et.tudelft.nl/~reinoud/nano/ for the full new spec. One thought while reading , is byte size really that important? Character data generally is the most byte sized data around than small constants, and characters are heading towards a 16+ bit encoding. -- Ben Franchuk - Dawn * 12/24 bit cpu * www.jetnet.ab.ca/users/bfranchuk/index.html |
|
Tim, Thanks for the excellent criticism; I certainly agree with you on several points. Tim Boescke wrote: > Well in my opinion the encoding of the immediates is quite > a bottleneck. A big problem is of course the very complicated > immediate decoding which is either slow or eats a lot of logic. Allow me to disagree here... Yes, the format is slightly geared towards sequential implementation, but not too much. For bit- or nybble-serial cores, the immediate encoding is quite natural (cheap and no performance problem). Simple parallel implementations will indeed be relatively slow, but very economical. Anyway, most immediates are small so performance doesn't actually suffer much (while a lot of memory is saved). Note that the branch and trap instructions usually take small immediates. To obtain higher performance for large immediates, without spending much on decode logic, support fast decoding only for aligned (and possibly fixed size) immediates. Code with aligned immediates will still be binary compatible (the code can be padded with ZER or ONE instructions to get the immediate to the right place; the fast decoder should recognize this as a special case). The assembler, or even the loader, might do the proper padding for a particular target (immediate sizes have to be determined anyway). The cost of aligned immediates is mostly in code size; but as alignment is optional, its use can be restricted to where performance is needed (e.g. inner loops). Best of both worlds... :-) > Since you dont have an adress register, loading of constants will > be quite frequent in a real program. (~40% ?) Sure, constants are often needed - but they aren't that expensive because of the variable-size immediates (and if they are big, the park instruction may help). BTW, I don't think just having an address register instead of an operand register improves things. The address register will have to be loaded or adjusted with immediates all the time anyway, and immediate operands for ALU ops will be a problem (i.e. require extra opcodes). Another issue is that the variable-size immediates need a register for sequential construction, which makes adjusting (adding to) an address register costly (can't simply use the address register for building the immediate). All in all, I think the approach with two operand registers wins... > A sane way to program with this instruction set is probably to > use the memory locations -4..3 as local registers. Yes, that's a reasonable approach. > Basically this means that each register read/write costs 12 bits of > program code... Well, locations 0 and 1 can be reached with 8 bits (thanks to the handy ZER and ONE instructions). > An instruction set using 12 bit encoding would be probably > far more efficient. My guess is that the code density of this > instruction set is not very good.. Okay, I agree this is a problem. However, I don't think it's as bad as you describe; besides the 0 and 1 locations, the trap instruction can come to the rescue. With just one byte (i.e. using one 4-bit immediate specifier), 8 different traps can be specified. When using this for a few stack operations (push, pop, stack adjust), and using the 0 and 1 locations as 'registers', you have some fairly low cost memory access. Some improvement here would clearly be nice; I'll reconsider load and store 'direct' instructions. > Other things which came to my mind: > > - Its missing a negating logic instruction. Thus not all boolean > operations are possible. Boolean negate: 1-x; bitwise negate: -1-x. So to do a boolean negate of the value in A: ONE SWA SUB. > - Why does the park instruction waste memory address zero which > is also very easy to access otherwhise ? Good point. It's just a choice, for a several small reasons: 1) The park instruction sometimes allows for shorter code sequences, but for such use it sure helps when the address is cheap to load from... 2) I think it makes sense to use at least one of these cheap locations as an evaluation or scratch 'register', so you'd expect (and not mind) to lose its contents when doing a call etc. Now, an important use for the park instruction is to get the contents of B 'out of the way' at the start of a subroutine (called with jump and link) or a trap handler, so that a value passed in A can be used or stored. You can either pass A and overwrite address 0 (with park), or keep address 0 intact and overwrite A (to save the return address in B). 3) Even on the smallest possible systems (with code in ROM and very little RAM), you can count on this location to be available in RAM. Do you have a better idea? > - Why does the load instruction load to B ? This makes copying etc. > very inefficient. (adress deleted, addition swap to A) Loading to A would overwrite both A and B (B is needed for the address). This would make the instruction practically useless. It would make sense in combination with a load direct (to B). > - Is rotate/shift right really used often enough to justify an extra > instruction/extra hardware ? (shift left is ADD) Yes, that's arguable, though I expect it's generally useful for a small controller (doing a lot of bit twiddling). The real reason for including it, though, is to make packed data structures cheap (so you can save memory again). > PC > D (Adress, memory reference) > Akku > > Instructions: > > three bit encoding, comes in bundles of five. > > 0 SWP D<=A, A<=D Swap D and Akku > 1 LDI D<=(PC), PC++ move data at PC to D-Register, > increment PC > 2 LDA A<=(D) > 3 STO (D)<= A Store akku at D > 4 ADD A = A + (D) Add (d) to Akku > 5 AND A = A AND (D) > 6 XOR A = A XOR (D) > 7 ZGO PC <= D, if A=0 jump to D, if Akku equals zero (might > make sense to > use a carry instead) Nice! BTW, wouldn't this architecture need quite a bit more code space for address constants than NANO? Oops, long post. Sorry :-). - Reinoud PS. Waited with sending this until an update based on Tim's comments was finished: 1.1-0: [2002.01.21] Removed the ZER instruction; added the LOD instruction; changed the LOA instruction to load to A. Removed opcodes and short mnemonics from the instruction table (and rearranged the table). Added acknowledgement of Tim Boescke's input. See http://ce.et.tudelft.nl/~reinoud/nano/ for the full new spec. |
|
pagercam wrote: > This reminds me a lot of the Transputer instruction set > which used a 4 bit operation with 4 bit data field take > a look at the architecture it might give you some ideas > and suggestions. Yeah, I remember it (having actually designed Transputer-based systems back in those days;). > Great architecure too bad it didn't survive. Well, they priced their chips way out there (esp. the T800 series with floating point), so people weren't too eager to buy them :-/. Also, the stack architecture didn't scale well to higher performance (and they tried to compete on performance). Too bad indeed, it was a nice design... - Reinoud |
|
--- In fpga-cpu@y..., Reinoud <dus@w...> wrote: > > Hi all, > > I've made an attempt at designing an efficient architecture for very > small implementations. The point of the architecture is the > combination of small cores and small code size. From the spec: A true test of the architecture would be to implement a "real" app on the processor and compare that to an equivalent app on another processor. You can make a pretty decent small RISC an about 150 LUTs, less if things are done serially. I think this would be hard to beat. Rob |
|
Ben Franchuk wrote: > > See http://ce.et.tudelft.nl/~reinoud/nano/ for the full new spec. > > One thought while reading , is byte size really that important? > Character data generally is the most byte sized data around than > small constants, and characters are heading towards a 16+ bit encoding. Where does it say that byte size is important? - Reinoud |
|
rtfinch35 wrote: > A true test of the architecture would be to implement a "real" app Of course. > on the processor and compare that to an equivalent app on another > processor. I think you should compare that to an equivalent app on another processor *with the same core size*, or even better, compare the combined cost of both core and memory for the application. > You can make a pretty decent small RISC an about 150 LUTs, > less if things are done serially. I think this would be hard to beat. Well, that depends on the application size and the efficiency of the ISA. I'm trying to optimise total cost (core + memory) for small applications (maybe I should have stated this more clearly). For larger applications, larger cores with better code density often make more sense. Code size can practically always be improved by spending more on instruction decode (e.g. use compression). Sometimes, a plain RISC will make most sense, but that's usually when performance matters more than code size. BTW, 150 LUTs (LCs?) are a lot. You may be able to fit an entire serial (NANO) core and all the memory for a small control application in there (with Xilinx LUTs that is). - Reinoud |
|
Reinoud wrote: > > Ben Franchuk wrote: > > > See http://ce.et.tudelft.nl/~reinoud/nano/ for the full new spec. > > > > One thought while reading , is byte size really that important? > > Character data generally is the most byte sized data around than > > small constants, and characters are heading towards a 16+ bit encoding. > > Where does it say that byte size is important? > > - Reinoud You are right in this case for the NANO architecture. How ever 4 bit wide memory is hard to find. The big test is indeed writing programs for the cpu. How many Turning machine programs have you written? I find I spend a lot of time revising my cpu because I want to make it easy to write programs and still not have too complex hardware. 50 pages of schematics is more than ample for my CPU and my FPGA is 98% full. I expect that there are thresholds of logical mass that define the power of a computer system. Memory width, addressing range , instruction architecture, speed all are tightly bound. A computer system needs to look all aspects of the system, and it is the weakest link that slows the system down. Historically you had 1) large word length serial, parallel processors >= 28 bits with about 8 instructions and tiny memory. 2) Smaller word sized computers >= 12 bits with paged and indirect memory with 4/8K word memory. 3) 4 & 8 bit controllers with up to 64k of memory. 4) 16/8 bit machines with 64kb data and 64kb code space - PDP-11 - 8086. 5) Big machines with few registers. 6) Load/store architecture that internal design is looking a lot like #1. Is the wheel of computers is going around again? The nano instruction set looks place it around 2 and 3 for design, thus hinting that most programs would be under 64K words long. -- Ben Franchuk - Dawn * 12/24 bit cpu * www.jetnet.ab.ca/users/bfranchuk/index.html |
|
|
|
Reinoud wrote: > Well, that depends on the application size and the efficiency of the > ISA. I'm trying to optimise total cost (core + memory) for small > applications (maybe I should have stated this more clearly). For > larger applications, larger cores with better code density often make > more sense. Code size can practically always be improved by spending > more on instruction decode (e.g. use compression). Sometimes, a > plain RISC will make most sense, but that's usually when performance > matters more than code size. > > BTW, 150 LUTs (LCs?) are a lot. You may be able to fit an entire > serial (NANO) core and all the memory for a small control application > in there (with Xilinx LUTs that is). The real test would be on several different FPGA architectures. Alot of tiny RISC FPGA designs would not be so tiny if they did not use the two port ram and 3 state lines. Give me good old real gate count ... ??? 2 input ands ... ??? 3 input nors and so forth. Btw is NANO-NANO a computer from the planet ork? :) -- Ben Franchuk - Dawn * 12/24 bit cpu * www.jetnet.ab.ca/users/bfranchuk/index.html |
|
Ben Franchuk wrote: > > Ben Franchuk wrote: > > > > See http://ce.et.tudelft.nl/~reinoud/nano/ for the full new spec. > > > > > > One thought while reading , is byte size really that important? > > > Character data generally is the most byte sized data around than > > > small constants, and characters are heading towards a 16+ bit encoding. > > > > Where does it say that byte size is important? > > > > - Reinoud > > You are right in this case for the NANO architecture. How ever 4 bit > wide memory is hard to find. Where does it say that 4-bit memory is used? - Reinoud |
|
Tommy Thorn wrote: > > --- Ben Franchuk <> wrote: > > The big test is indeed writing programs > > for the cpu. How many Turning machine programs > > have you written? I find I spend a lot of time > > revising my cpu because I want to make > > it easy to write programs and still not have too > > complex hardware. > > Have you considered writing a simulator first > and experimenting with it? It is much easier to > write. Experimenting before getting too deep > into the FPGA code could save you some nasty > surprises. > > /Tommy Thanks but I have all the nasty surprises with Intel and Microsoft that I need. I have had a simulator and assembler and a self compiling C compiler for a long time now. The problem is the 'small C compiler' does not have structures or good code generation. The later versions of 'small C' do not compile under Microsoft's C compiler thus I can't port them. Most the the upgrades over the last few weeks have been what last minute features can I add at low cost or stuff that has to do with physical design like control signals and timing. I chose to debug the FPGA with the traditional lights and switches rather than a software gate simulation and did find several problems like open wires or stupid logic bugs like the polarity of the RS232 start bit wrong. I debugged the uart module, then the bootstrap setup,then the CPU. It is just final testing that needs to be done now. -- Ben Franchuk - Dawn * 12/24 bit cpu * www.jetnet.ab.ca/users/bfranchuk/index.html |