This list is for discussion of the design and implementation of field-programmable gate array based processors and integrated systems. It is also for discussion and community support of the XSOC Project (see http://www.fpgacpu.org/xsoc).
|
Hi, I'm working on a 24 bit RISC processor that fits in an XC4010, 32 regs, 32 word I-Cache, paging mmu. If anyone's interested I'll be posting my progress at: www3.sympatico.ca/robfinch/Sparrow/SparrowTOC.html |
|
|
|
Rob, thank you for writing your web site and sharing your work-in-progress with us. I added a link to the Sparrow web pages from www.fpgacpu.org/links.html. I enjoyed reading about your design, particularly your MMU and I-cache designs, which both seem solid and well thought out. I concur that an I-cache is required to avoid being bottlenecked on memory as processors get much above 33 MHz or so. As for the MMU, I agree that a simple address mapping table suffices for small address space systems. What do you plan to do on a read or write fault? Will you save the offending access away in a special register and jump to an exception handler and recover, or terminate the offending process? I also admired the spirit of the work -- of searching for a feasible implementation within the confines of limited hardware resources. Such tangible constraints inspire one to get right to the heart of the matter, and leave out the unnecessary frills. By the way, most MMUs combine address translation and access checking (page protection). These are separable facilities. Consider a system w/o address mapping, but with page protection (e.g. one bit per page). This can make good sense for an embedded system where there is no secondary backing store, and hence no after-the-fact reorganization/reassignment of physical address space. I have also been designing a new 24-bit instruction word, 16- or 32-bit data word processor, which will complement xr16/xr32. The general approach is the fruit of a discussion with Mike Butts some time ago, that I hint at it early-on in www.fpgacpu.org/xsoc2/log.html. To take advantage of the mass of available "open source"/"free software", not to mention pre-existing C runtime libraries, you need a GCC tools chain. One approach is to implement an existing instruction set architecture. Another is to design a new FPGA-optimized ISA, then port all of GCC, binutils, gdb, C runtime libraries, etc. That's a lot of work. A third approach is to adopt some pre-existing instruction set architecture's GAS assembly format, then use a special GAS port to cross-assemble to your FPGA-optimized ISA. This allows you to reuse existing runtime library source code, including assembler assist and finesse much of the GCC porting (and maintenance) work. This approach needs a generic FPGA RISC with >= 32 registers. This is a tight fit for a 16-bit instruction word machine. For example, you certainly need at least a 4-bit opcode. For the load/store instructions lw lh lb sw sh sb you will want a 5-bit dest reg, 5-bit base reg, and at least 4 bits of offset (particularly when addressing locals in a stack frame). That's 18 bits! For a 32-bit register machine, you also require a simple mechanism to build 32-bit constants. In a 16-bit instruction word, there are some promising ways to shoehorn 18 bits into 16, for example, using a 2-operand ISA or using a 2- or 3-bit field to refer to one of 4 or 8 special-use- or most-recently-stored (or -referenced) registers, but it's still a tight squeeze. In some cases, you cannot avoid a 2 instruction sequence. This was fine for the space-constrained xr16 but now that our aspirations turn to high performance implementations, it is important that the instruction word encode a full and natural (pipelineable) amount of work in each instruction, including a larger immediate constant. 16-bit instructions don't suffice. Therefore, I have been designing a 24-bit instruction word, 16- or 32-bit data word machine, and also a 24-bit instruction fetch unit that reads and aligns instructions obtained from a 32-bit-wide Virtex block RAM I-cache. Here's a quick overview of the current draft instruction set architecture: /* GR2000 instruction set architecture GR204x - 32 16-bit registers GR205x - 32 32-bit registers union I { struct RRfR { op:5, ra:5, rb:5, fn:4, rd:5 } rrfr; struct RRfI { op:5, ra:5, rb:5, fn:4, imm5:5 } rrfi; struct RfI { op:5, ra:5, fn:4, imm10:10 } rri; struct RRI { op:5, ra:5, rb:5, imm9:9 } rrr; struct I19 { op:5, imm19:19 } i19; struct I23 { i:1, imm23:23 } i23; }; Instructions op fn fmt usage semantics 00 0 RRfR add rd,ra,rb rd = ra + rb 00 1 RRfR sub rd,ra,rb rd = ra - rb 00 2 RRfR lt rd,ra,rb rd = ra < rb 00 3 RRfR lts rd,ra,rb rd = (signed)ra < (signed)rb 00 4 RRfR and rd,ra,rb rd = ra & rb 00 5 RRfR or rd,ra,rb rd = ra | rb 00 6 RRfR xor rd,ra,rb rd = ra ^ rb 00 7 RRfR nor rd,ra,rb rd = ra ~| rb 00 8 RRfR sl rd,ra,rb rd = ra << rb 00 9 RRfR sr rd,ra,rb rd = ra >> rb 00 A RRfR srs rd,ra,rb rd = (signed)ra >> rb 00 B RRfR sxb rd,ra rd = sext(ra[7:0]) 00 C RRfR sxw rd,ra rd = sext(ra[15:0]) 01 2 RRfI lti rb,ra,imm5 rb = ra < imm5 01 3 RRfI ltsi rb,ra,imm5 rb = (signed)ra < imm5 01 4 RRfI andi rb,ra,imm5 rb = ra & imm5 01 5 RRfI ori rb,ra,imm5 rb = ra | imm5 01 6 RRfI xori rb,ra,imm5 rb = ra ^ imm5 01 7 RRfI nori rb,ra,imm5 rb = ra ~| imm5 01 8 RRfI sli rb,ra,imm5 rb = ra << imm5 01 9 RRfI sri rb,ra,imm5 rb = ra >> imm5 01 A RRfI srsi rb,ra,imm5 rb = (signed)ra >> imm5 02 2 RfI ltj ra,imm10 ra = ra < imm10 02 3 RfI ltsj ra,imm10 ra = (singed)ra < imm10 02 4 RfI andj ra,imm10 ra = ra & imm10 02 5 RfI orj ra,imm10 ra = ra | imm10 02 6 RfI xorj ra,imm10 ra = ra ^ imm10 02 7 RfI norj ra,imm10 ra = ra ~| imm10 02 8 RfI slj ra,imm10 ra = ra << imm10 02 9 RfI srj ra,imm10 ra = ra >> imm10 02 A RfI srsj ra,imm10 ra = (signed)ra >> imm10 03 - RRI addi rb,ra,imm9 rb = ra + sext(imm9) 04 - RRI lw rb,imm9(ra) rb = mem.word[ra+sext(imm9)] ; GR205x only 05 - RRI lh rb,imm9(ra) rb = mem.half[ra+sext(imm9)] ; GR205x only 06 - RRI lb rb,imm9(ra) rb = mem.byte[ra+sext(imm9)] 07 - RRI sw rb,imm9(ra) mem.word[ra+sext(imm9)] = rb 08 - RRI sh rb,imm9(ra) mem.half[ra+sext(imm9)] = rb[15:0] 09 - RRI sb rb,imm9(ra) mem.byte[ra+sext(imm9)] = rb[7:0] 0A - I19 call imm19 r31 = pc, pc = pc[31:19] || imm19 0B - RRI jal rb,imm9(ra) rb = pc, pc = ra + sext(imm9) 0C - RRI be rb,ra,L if (ra == rb) pc += L 0D - RRI bne rb,ra,L if (ra != rb) pc += L 0E reserved 0F reserved 1x - I23 imm imm23 imm_next[31:9] = imm23 */ As with xr processors, there is an immediate prefix instruction which in this case establishes the upper 23 bits of the immediate constant in the RRI, RRfI, or RfI format instruction that immediately follows. (Given the RfI format instructions, the RRfI instructions may be removed. This awaits some cross-assembler data gathering.) Unlike xr processors, this machine has neither condition codes nor conditional-branch-prefix interlocks. Implementation-wise, the downside of fetching a 24-bit instruction word from a 32-bit-wide I-cache is that branch targets at addresses 2 and 3 (mod 4) will require two cycles to fetch and tag check the parts of the instruction that span cache lines. This can be avoided by fetching two separate parcels of 16-bits from PC&~1 and (PC&~1)+2, but it adds another adder delay, and complexity. Alternately you can build a 24-bit wide cache and then handle 32->24 bit cache refills outside the processor core. In which case, instruction addresses are encoded (0,1,2,...) to make i-cache lookups easy, but outside the core, the I-cache refill unit must translate instruction addresses to byte addresses, (e.g. IA + IA<<1), and do partial word fetch-and-align. I expect to refine the ISA as the cross-assembler strategy is refined -- which may never occur, of course -- this work competes for time and attention with other projects. If anyone has any comments or criticisms of the above, please share them with the list. Jan Gray Gray Research LLC |