Sign in

username:

password:



Not a member?

Search fpga-cpu



Search tips

Subscribe to fpga-cpu



fpga-cpu by Keywords

Altera | CISCifying | IDE | ISA | Java | JHDL | JTAG | LBU | MicroBlaze | PAR | PCI | RISC | SoC | Spartan | Transputers | Verilog | VHDL | Virtex | VLIW | WebPack | Xilinx | Xsoc | YARD-1A

Discussion Groups

Discussion Groups | FPGA-CPU | 24 bit RISC

This list is for discussion of the design and implementation of field-programmable gate array based processors and integrated systems. It is also for discussion and community support of the XSOC Project (see http://www.fpgacpu.org/xsoc).

24 bit RISC - Rob Finch - Sep 15 8:39:00 2000


Hi, I'm working on a 24 bit RISC processor that fits in an XC4010, 32
regs, 32 word I-Cache, paging mmu. If anyone's interested I'll be
posting my progress at:
www3.sympatico.ca/robfinch/Sparrow/SparrowTOC.html






(You need to be a member of fpga-cpu -- send a blank email to fpga-cpu-subscribe@yahoogroups.com )

24-bit RISCs - Jan Gray - Sep 15 17:48:00 2000

Rob, thank you for writing your web site and sharing your work-in-progress
with us. I added a link to the Sparrow web pages from
www.fpgacpu.org/links.html.

I enjoyed reading about your design, particularly your MMU and I-cache
designs, which both seem solid and well thought out. I concur that an
I-cache is required to avoid being bottlenecked on memory as processors get
much above 33 MHz or so.

As for the MMU, I agree that a simple address mapping table suffices for
small address space systems. What do you plan to do on a read or write
fault? Will you save the offending access away in a special register and
jump to an exception handler and recover, or terminate the offending
process?

I also admired the spirit of the work -- of searching for a feasible
implementation within the confines of limited hardware resources. Such
tangible constraints inspire one to get right to the heart of the matter,
and leave out the unnecessary frills.

By the way, most MMUs combine address translation and access checking (page
protection). These are separable facilities. Consider a system w/o address
mapping, but with page protection (e.g. one bit per page). This can make
good sense for an embedded system where there is no secondary backing store,
and hence no after-the-fact reorganization/reassignment of physical address
space. I have also been designing a new 24-bit instruction word, 16- or 32-bit data
word processor, which will complement xr16/xr32. The general approach is
the fruit of a discussion with Mike Butts some time ago, that I hint at it
early-on in www.fpgacpu.org/xsoc2/log.html.

To take advantage of the mass of available "open source"/"free software",
not to mention pre-existing C runtime libraries, you need a GCC tools chain.
One approach is to implement an existing instruction set architecture.
Another is to design a new FPGA-optimized ISA, then port all of GCC,
binutils, gdb, C runtime libraries, etc. That's a lot of work. A third
approach is to adopt some pre-existing instruction set architecture's GAS
assembly format, then use a special GAS port to cross-assemble to your
FPGA-optimized ISA. This allows you to reuse existing runtime library
source code, including assembler assist and finesse much of the GCC porting
(and maintenance) work.

This approach needs a generic FPGA RISC with >= 32 registers. This is a
tight fit for a 16-bit instruction word machine. For example, you certainly
need at least a 4-bit opcode. For the load/store instructions lw lh lb sw
sh sb you will want a 5-bit dest reg, 5-bit base reg, and at least 4 bits of
offset (particularly when addressing locals in a stack frame). That's 18
bits! For a 32-bit register machine, you also require a simple mechanism to
build 32-bit constants.

In a 16-bit instruction word, there are some promising ways to shoehorn 18
bits into 16, for example, using a 2-operand ISA or using a 2- or 3-bit
field to refer to one of 4 or 8 special-use- or most-recently-stored
(or -referenced) registers, but it's still a tight squeeze. In some cases,
you cannot avoid a 2 instruction sequence. This was fine for the
space-constrained xr16 but now that our aspirations turn to high performance
implementations, it is important that the instruction word encode a full and
natural (pipelineable) amount of work in each instruction, including a
larger immediate constant. 16-bit instructions don't suffice.

Therefore, I have been designing a 24-bit instruction word, 16- or 32-bit
data word machine, and also a 24-bit instruction fetch unit that reads and
aligns instructions obtained from a 32-bit-wide Virtex block RAM I-cache.
Here's a quick overview of the current draft instruction set architecture:

/*
GR2000 instruction set architecture
GR204x - 32 16-bit registers
GR205x - 32 32-bit registers

union I {
struct RRfR { op:5, ra:5, rb:5, fn:4, rd:5 } rrfr;
struct RRfI { op:5, ra:5, rb:5, fn:4, imm5:5 } rrfi;
struct RfI { op:5, ra:5, fn:4, imm10:10 } rri;
struct RRI { op:5, ra:5, rb:5, imm9:9 } rrr;
struct I19 { op:5, imm19:19 } i19;
struct I23 { i:1, imm23:23 } i23;
};

Instructions

op fn fmt usage semantics

00 0 RRfR add rd,ra,rb rd = ra + rb
00 1 RRfR sub rd,ra,rb rd = ra - rb
00 2 RRfR lt rd,ra,rb rd = ra < rb
00 3 RRfR lts rd,ra,rb rd = (signed)ra < (signed)rb
00 4 RRfR and rd,ra,rb rd = ra & rb
00 5 RRfR or rd,ra,rb rd = ra | rb
00 6 RRfR xor rd,ra,rb rd = ra ^ rb
00 7 RRfR nor rd,ra,rb rd = ra ~| rb
00 8 RRfR sl rd,ra,rb rd = ra << rb
00 9 RRfR sr rd,ra,rb rd = ra >> rb
00 A RRfR srs rd,ra,rb rd = (signed)ra >> rb
00 B RRfR sxb rd,ra rd = sext(ra[7:0])
00 C RRfR sxw rd,ra rd = sext(ra[15:0])

01 2 RRfI lti rb,ra,imm5 rb = ra < imm5
01 3 RRfI ltsi rb,ra,imm5 rb = (signed)ra < imm5
01 4 RRfI andi rb,ra,imm5 rb = ra & imm5
01 5 RRfI ori rb,ra,imm5 rb = ra | imm5
01 6 RRfI xori rb,ra,imm5 rb = ra ^ imm5
01 7 RRfI nori rb,ra,imm5 rb = ra ~| imm5
01 8 RRfI sli rb,ra,imm5 rb = ra << imm5
01 9 RRfI sri rb,ra,imm5 rb = ra >> imm5
01 A RRfI srsi rb,ra,imm5 rb = (signed)ra >> imm5

02 2 RfI ltj ra,imm10 ra = ra < imm10
02 3 RfI ltsj ra,imm10 ra = (singed)ra < imm10
02 4 RfI andj ra,imm10 ra = ra & imm10
02 5 RfI orj ra,imm10 ra = ra | imm10
02 6 RfI xorj ra,imm10 ra = ra ^ imm10
02 7 RfI norj ra,imm10 ra = ra ~| imm10
02 8 RfI slj ra,imm10 ra = ra << imm10
02 9 RfI srj ra,imm10 ra = ra >> imm10
02 A RfI srsj ra,imm10 ra = (signed)ra >> imm10

03 - RRI addi rb,ra,imm9 rb = ra + sext(imm9)

04 - RRI lw rb,imm9(ra) rb = mem.word[ra+sext(imm9)] ; GR205x only
05 - RRI lh rb,imm9(ra) rb = mem.half[ra+sext(imm9)] ; GR205x only
06 - RRI lb rb,imm9(ra) rb = mem.byte[ra+sext(imm9)]
07 - RRI sw rb,imm9(ra) mem.word[ra+sext(imm9)] = rb
08 - RRI sh rb,imm9(ra) mem.half[ra+sext(imm9)] = rb[15:0]
09 - RRI sb rb,imm9(ra) mem.byte[ra+sext(imm9)] = rb[7:0]

0A - I19 call imm19 r31 = pc, pc = pc[31:19] || imm19
0B - RRI jal rb,imm9(ra) rb = pc, pc = ra + sext(imm9)
0C - RRI be rb,ra,L if (ra == rb) pc += L
0D - RRI bne rb,ra,L if (ra != rb) pc += L
0E reserved
0F reserved

1x - I23 imm imm23 imm_next[31:9] = imm23
*/

As with xr processors, there is an immediate prefix instruction which in
this case establishes the upper 23 bits of the immediate constant in the
RRI, RRfI, or RfI format instruction that immediately follows. (Given the
RfI format instructions, the RRfI instructions may be removed. This awaits
some cross-assembler data gathering.)

Unlike xr processors, this machine has neither condition codes nor
conditional-branch-prefix interlocks.

Implementation-wise, the downside of fetching a 24-bit instruction word from
a 32-bit-wide I-cache is that branch targets at addresses 2 and 3 (mod 4)
will require two cycles to fetch and tag check the parts of the instruction
that span cache lines. This can be avoided by fetching two separate parcels
of 16-bits from PC&~1 and (PC&~1)+2, but it adds another adder delay, and
complexity. Alternately you can build a 24-bit wide cache and then handle
32->24 bit cache refills outside the processor core. In which case,
instruction addresses are encoded (0,1,2,...) to make i-cache lookups easy,
but outside the core, the I-cache refill unit must translate instruction
addresses to byte addresses, (e.g. IA + IA<<1), and do partial word
fetch-and-align.

I expect to refine the ISA as the cross-assembler strategy is refined --
which may never occur, of course -- this work competes for time and
attention with other projects.

If anyone has any comments or criticisms of the above, please share them
with the list.

Jan Gray
Gray Research LLC




(You need to be a member of fpga-cpu -- send a blank email to fpga-cpu-subscribe@yahoogroups.com )