This list is for discussion of the design and implementation of field-programmable gate array based processors and integrated systems. It is also for discussion and community support of the XSOC Project (see http://www.fpgacpu.org/xsoc).
|
YARD-1A processor: I've been working on my own 32 bit RISC processor intermittently for the past year or so; it's not quite soup yet, but it's getting close. The current implementation is a very simple two operand, 2 stage pipeline, set up to use internal Xilinx FPGA RAM resources. I've put a draft description of the processor at: ftp://members.aol.com/fpgastuff/yard-1a.zip Its' first "blink the LED's" program booted in hardware last December, running in an XC4010E on an old Xilinx "FPGA Eval Board" ( 32 bit core, 64x16 ROM, 32x32 RAM, parallel I/O ). The current target is an XC2S100; I'd ordered an Insight "Spartan II Development Board" at the beginning of July, and it arrived a few weeks ago; see: http://www.insight- electronics.com/xcellence/scalable/kit/spartan/index.html DS-XC2S100-BRD $125 USD Other Stuff: Hill, Jouppi, Sohi, "Readings in Computer Architecture" ISBN 1-55860-539-8 Nice collection of papers from 1964 to the present http://www.mkp.com/architecture-readings CGEN CPU tools generator http://sources.redhat.com/cgen RISC-8 PERL cross assembler http://www.geocities.com/microprocessors Brian Davis |
|
|
|
Brian, congratulations, nice work. Looks like you've been having lots of fun. I read your yard-1a materials. Some comments and questions -- (apologies in advance, this was written quickly) I like your use of the single bank of dual port RAM for the register file, which does constrain one to an instruction set with two register operands, and with no pipeline stage between the operand fetch register file read access, and the execute stage result write-back. As you have found with your quite respectable "push-button" cycle time of 25 ns or so, this can still provide good performance in a small area. I apologize in advance for yet another "I did one of those too", but I did one of those too, back in August. I wrote a quick and dirty non-pipelined RISC in about 170 lines of Verilog, slightly incomplete and no-doubt buggy, with the intention of polishing it up and turning it into a small, annotated design example on the web site, to demonstrate just how simple these things can be. Not coincidentally, I'm also seeing performance of about 35-40 MHz in a Virtex-4, but with load/store instructions (as usual) needing at least a second clock cycle. Unlike your design, this one uses an on-chip I-cache and so doesn't have an ifetch stage (on a cache hit) -- so no branch delay issues. It's pretty simple. Here is the relevant I-cache code excerpt: reg [15:0] itag; wire [N:0] pc_nxt = pipe_ce ? (`JL ? sum : (`Bx&branch) ? (pc + {brdisp,1'b0}) : (pc + 2)) : pc; `ifdef GR14 /* 16-bits */ assign pipe_ce = !(rst_sync || (itag != pc[15:8])); RAMB4_S16_S16 icache( .WEA(icache_we), .ENA(1'b1), .RSTA(rst), .CLKA(clk), .ADDRA({1'b0,pc_nxt[7:1]}), .DIA(di[15:0]), .DOA(ir), .WEB(icache_we), .ENB(1'b1), .RSTB(rst), .CLKB(clk), .ADDRB({1'b1,pc_nxt[7:1]}), .DIB({8'b0,pc[15:8]}), .DOB(itag)); `else /* 32-bits */ assign pipe_ce = !(rst_sync || (itag != pc[23:8])); RAMB4_S16_S16 icache( .WEA(icache_we), .ENA(1'b1), .RSTA(rst), .CLKA(clk), .ADDRA({1'b0,pc_nxt[7:1]}), .DIA(di[15:0]), .DOA(ir), .WEB(icache_we), .ENB(1'b1), .RSTB(rst), .CLKB(clk), .ADDRB({1'b1,pc_nxt[7:1]}), .DIB(pc[23:8]), .DOB(itag)); `endif Here we use one dual-port block ram to implement a 128x16 I-cache instruction memory with a 128x16 I-cache tag memory. Then pipe_ce is false if itag != pc[15:8] (16-bits) or itag != pc[23:8] (32-bits). The processor will spin fetching the same instruction until an external agent writes the new instruction data+tag into the I-cache at address pc_nxt[7:1]. More on that work sooner or later. You are using the Insight 2S100 board. I am currently using an XESS XSV-300 for my Virtex work, not inexpensive, and I know of at least one other designer on this list who is using the same inexpensive board that you are. Perhaps it would make a more accessible platform for future projects in the Virtex space. I'm ordering one in the morning. One downside of this board is that the more budget-constrained among us will not be able to target this board, since even using the rumored-to-be-forthcoming new Student Edition, which allegedly targets the V50 (and hence 2S50), probably won't be able to target the 2S100. I'll write the Xilinx University Program folks and see what they are up to. The second downside of this board is it has no built-in RAM. It would be nice to design a simple anybody-can-solder-it expansion board to plug into the 2S100 board's prototyping area, to provide RAM, VGA port, and a few other niceties. But back to your work. I like your nullable branch delay slots. When you state "full implementation of SHIFT" -- are you planning a multicycle shifter or a full barrel shifter or something in-between? A full barrel shifter is quite area intensive. Oops, never mind, I see, 1,2,4,8,16. For external memory, you state " - extend the pipeline from 2 to 3 stages ( would need register forwarding HW, control logic rework )" But if you insert a pipeline register between register file read and write accesses you may have problems with your single bank of dual-port RAM, right? It may be simpler to stall the pipeline during the memory access than to build a MEM stage. I like your simulation framework a lot, and it's good to know there's an adquate, free VHDL simulator out there, too. I like your immediate operand encoding to get those bit-masks, etc. Reminiscent of (but different than) ARM (IIRC). A while back I looked at some frequently used wide constants (e.g. 0x000000FF, 0x0000FFFF, 0x00010000, 0x10000000) and so forth, I used to kick around the idea of a loadable "immediate constant register file" for compact and quick access to your favorite 16 or 32 immediate constants. It could be loaded once at system init time, or once per dynamic library, or even once per function (if restored). Of course, this is hardly an improvement over a larger regular register file! The funny thing is, in FPGAs, an n-bit 2-1 mux is often just as expensive as a 16xn register file. I note your hardware call stack, which will certainly improve call overhead. But in my experience more time is spent saving and reloading live value registers across calls than the return address. What happens on overflow? :-) For your decision to allow base+offset addressing on load/store instructions, I debated the exact same point with myself when I was doing this quick&dirty non-pipelined RISC last month. Since the Virtex block RAM is synchronous for read and write, you have to have the address prepared before the data block RAM clock edge. If you want loads to occur in one cycle, you have no choice but to present the load address to the block RAM on the clock falling edge (if the rest of the design is clocked on the rising edge). If you do that, the 16- or 32-bit adder delay to compute base register+offset must occur in one half cycle, and the min cycle time will be loooooong. If, on the other hand, you stall the processor for one cycle (so that loads take two cycles) then the register+offset add should not affect the cycle time (because it is basically identical to the add instruction critical path). That's what I did. Re: your immediate instruction approach, I believe Philip Freidin's RISC4005 had a similar instruction to write an immediate (literal) value into a register. (In his case, a general purpose register, right Philip?) I like what you did there, because now you can have a store instruction which sources two registers (the data-to-store register and the base-register) plus your literal in the SR register. Re: Sign/zero-extension on loads. j32 and early xr16's had sign- and zero- extension, but the extra delay needed to drive the load-data-byte's MSB onto other data bus lines was proving to hurt the xr16 cycle time, so out went LBS! Re: no CCs: The aspect of xr16 I am most unhappy with is the condition codes and the associated interlocks, which I used despite my better judgement. I won't be fooled again. Re: skip instructions. IIRC that's almost exactly what RISC4005 did. FF0, FF1, CNT0, CNT1, great! Did you put them in for show or do you have an application that will use them? :-) About ten years ago (if I remember correctly) there was a 40 MHz CMOS RISC processor called the GE? RPM-40 designed by Dennis O'Connor. It was a 16-bit instruction word RISC, tight on opcode space, that also had RSUB, etc. It would be fun to look that up that up and see if there are any lessons for we opcode-space challenged. Alas most FPGA CPU design issues were faced by regular full-custom CPU designers ten years ago. I have the Readings in Computer Architecture book, it's excellent. This inspires me to put up a web page of recommended books, etc. I saw some announcements on CGEN. It seems very promising. Have you read its docs and/or used it? Does it make porting binutils a snap? Thank you for taking the time to share your interesting work with us. Jan Gray Gray Research LLC |
|
|
|
On Wed, 20 Sep 2000 19:56:23 -0700, Jan Gray wrote: >Brian, congratulations, nice work. Looks like you've been having lots of >fun. Yep lots of it :-) > ..... >But back to your work. >Re: your immediate instruction approach, I believe Philip Freidin's RISC4005 >had a similar instruction to write an immediate (literal) value into a >register. (In his case, a general purpose register, right Philip?) I like >what you did there, because now you can have a store instruction which >sources two registers (the data-to-store register and the base-register) >plus your literal in the SR register. So I had to go dig up the archive for the RISC4005 (1991/1992 vintage). What I had back then, and I am still happy with it was two instructions, each were 4 bits opcode, 4 bits dest reg, and 8 bits constant. They were constlo Rn,0xAA and swapconst Rn,0xBB constlo always zeroed the upper byte of Rn. swapconst copied the low half of Rn to the high half, and loaded the low half with the new constant. From my macro definition file (the assembler I wrote had a very powerful macro facility) comes the following (slightly trimmed). The RISC4005 had condition codes, and 2 delay slots after branches, due to it being a 4 stage piped processor Look at the macro "CONST", which loads a 16 bit constant ; ; STDMACS.S Standard MACROS ; ; ; Last edit What ; 04-Jan-92 Initial Creation. ; ; ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; nop .macro skip_never .endm ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; push .macro reg ; push register onto stack dec r15,r15 ; predecrement st reg,r15 .endm ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; pop .macro reg ; pop register from stack ld reg,r15 inc r15,r15 .endm ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; const .macro reg,val constlo reg,( val ) >> 8 swapconst reg,val .endm ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; set_c .macro temp_reg constlo temp_reg,0x01 srl temp_reg,temp_reg .endm ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; clear_c .macro temp_reg constlo temp_reg,0x00 srl temp_reg,temp_reg .endm ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; mov .macro rdest,rsrc or rdest,rsrc,rsrc .endm ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; far_call .macro dest const r0,dest call r0,r0 .endm ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; >Re: no CCs: The aspect of xr16 I am most unhappy with is the condition codes >and the associated interlocks, which I used despite my better judgement. I >won't be fooled again. As an old grey haired architect I could go on for way to long about the fights we had over this for the 29000, 15 years ago. We resolved, and I still believe we were right, that CC's suck. The 29K let you do conditional tests, and they stored the true/false result in the MSB of the dest reg. We then just had a simple jump true/false, that looked at a register's value. Made compound conditionals very easy: calc all the primitive conditions, then just do AND/OR/XOR/NOT ops on the registers that held the conditional results. And then, because we were realists, we added CC's to the architecture, because we expected to run instruction set emulations of 68K and i86 code, and the emulators would benefit from a real CC register. >Re: skip instructions. IIRC that's almost exactly what RISC4005 did. So given the lessons of the 29K, why didn't the RISC4005 do this too? Not enough opcode bits to specify a dest reg for the condition to be stored in. The RISC4005 used a CC register (I had no plans for a superscalar version, which is where CC's really bite you in the a*s). Then you can have loads of skip instructions, because there are no reg fields. I think RISC4005 had 48 different SKIP instructions, that tested all true and false cases of every interesting combination of bits in the CC. A really neat capability of RISC4005 (that I should have patented, because no-one before or after me has thought of it) was the stunning additional instruction group: SKIP2, of which I had 48 of these as well. It skipped 2 instructions. Which is great for double precision arith, because you can skip an ADD, and an ADDC (add with carry) with one skip instruction. This really helps in multiply and divide routines. Keep having fun. Philip Freidin ======================== Philip Freidin Mindspring that acquired Earthlink that acquired Netcom has decided to kill off all Netcom Shell accounts, including mine. My new primary email address is Please update your address book, sorry for the inconvenience ================= Philip Freidin |
|
|
|
Philip, Thanks for the info. > So I had to go dig up the archive for the RISC4005 (1991/1992 vintage). Is any of the RISC4005 stuff online? ( I built a 40 bit bit-slice machine, sorta like a lobotimized '2901, in a 4010 when they first came out in the 92-93 timeframe. Had 16 general registers, 16 constant registers, external '448 microsequencer; it ran at 12.5 MHz, with a 25 MHz clock to generate the then-required asynchronous CLB write signal ) >A really neat capability of RISC4005 (that I should have patented, >because no-one before or after me has thought of it) was the stunning >additional instruction group: SKIP2, of which I had 48 of these as well. >It skipped 2 instructions. Which is great for double precision arith, >because you can skip an ADD, and an ADDC (add with carry) with >one skip instruction. This really helps in multiply and divide routines. From my copy of "User Manual for the CDP1802 COSMAC Microprocessor", RCA publication MPM-201B, copyright 1977, pages 37-38: "The SHORT SKIP is unconditional and skips the byte following the operation code. The LONG SKIP is also unconditional but skips two bytes following the operation code. The other instructions are long skips if test conditions for D, DF, or Q are satisfied." SKP SHORT SKIP LSKP LONG SKIP LSZ LONG SKIP IF D=0 LSNZ LONG SKIP IF D NOT 0 LSDF LONG SKIP IF DF=1 LSNF LONG SKIP IF DF=0 LSQ LONG SKIP IF Q=1 LSNQ LONG SKIP IF Q=0 LSDF LONG SKIP IF IE=1 I also had my own "skip extension" plans for the 5 opcode bits that are now in use for the bit number of the "skip on bit" mode: SMODE : selects AND or XOR of skip condition with enable bits E1 : enable for first instruction following skip E2 : enable for second E3 : enable for third E4 : enable for fourth In the AND mode, the instructions with enable bits set are skipped if the condition was true, executed if the condition was false; those with enable bits cleared are executed normally. In the XOR mode, the instructions with enable bits set are skipped if the condition was true, executed if the condition was false; instructions with enable bits cleared suffer the opposite fate. ( If I don't implement short conditional branches, I may bring this back as an "eskip" instruction using what's now the "br.cc" opcode ) This won't work on the 16 bit datapath processor, as there aren't enough bits in the status register to hold the four pending bits of skip state. still having fun, Brian Davis |
|
|
|
Jan, Thanks for the comments. > It would be nice to design a simple anybody-can-solder-it > expansion board to plug into the 2S100 board's prototyping > area, to provide RAM, VGA port, and a few other niceties. When I first looked at the Insight board, I'd hoped to stick an external synchronous SRAM on a daughtercard above the FPGA; alas, in all 160 pins of header there's nary a ground in sight- I don't think I'll be running any 100 MHz+ bus cycles there. The protoype area should be OK for slower external interfacing. My <tentative> plan for the external RAM interface version of the core (YARD-1B) is to double cycle an SSRAM at 2x the core rate, providing one instruction fetch and one data memory cycle per processor cycle. > I like your nullable branch delay slots. I hope to have them enabled for all of bra/bsr/jmp/jsr/rts/rti in the final version; this should allow for two-cycle call/return overhead by pulling the first instruction of the target into the call delay slot, and moving the instruction before the return into the return delay slot. The return address stacking mechanism on a call needs to accomodate the change in pushed return address depending upon whether the delay slot was executed; if I can't do that cleanly with the existing address hardware, I'll probably leave the delay slot enable in only for bra/jmp/rts/rti. > When you state "full implementation of SHIFT" -- are you > planning a multicycle shifter or a full barrel shifter or > something in-between? A full barrel shifter is quite area > intensive. Oops, never mind, I see, 1,2,4,8,16. I picked those values so you can do any constant shift in at most five instruction cycles, or a variable shift with code like: ; ; r0 = data to shift ; r1 = shift count ; skip.bs r1,#0 lsr r0,#1 skip.bs r1,#1 lsr r0,#2 skip.bs r1,#2 lsr r0,#4 skip.bs r1,#3 lsr r0,#8 skip.bs r1,#4 lsr r0,#16 I may add some more constant shift type operations in the holes left around LSL/LSR/ASR ( e.g. shift by 24, byte swap/extract ). The unused opcode after FF0/FF1/FFD/CNT0/CNT1 is there for a variable shift-by-register instruction, which would require building a barrel shifter. > But if you insert a pipeline register between register file > read and write accesses you may have problems with your single > bank of dual-port RAM,right? Right, anything that would move the register writeback to another clock cycle requires the use of an independent read/read/write register file. I thought I'd put a note about that in the "register file" section, but I don't see anything there; in any event, the source code for the register file synthesizes to one bank of dual ports if the address lines are common on a read and write port, two banks if they are not. > I note your hardware call stack, which will certainly improve > call overhead. But in my experience more time is spent saving > and reloading live value registers across calls than the return > address. What happens on overflow? :-) Once the trap mechanism is working, a "stack almost full" trap will occur. Many embedded processors/DSPs get by with a 16-32 deep ( or smaller ) hardware return stack; IIRC, the DSP compilers typically have a flag that controls whether function entry code manually pops the top entry of the return stack to a software stack for all but leaf functions. > FF0, FF1, CNT0, CNT1, great! Did you put them in for show > or do you have an application that will use them? :-) FF1 and FFD were planned for fast normalization of unsigned/signed fractional binary numbers for floating point and signed fractional block floating point code; with proper bit count encoding and a variable shift instruction, you get two cycle normalizations ( which should be useful in another 30-40 years when I retire and have the time to write a floating point package ). The others came along for the ride; I may be able to do them all with almost the same hardware. The XC4000 carry chains let you build this sort of stuff, but I haven't tried it yet with the Virtex/Spartan-II carry chains. > Re: Sign/zero-extension on loads. j32 and early xr16's had > sign- and zero- extension, but the extra delay needed to drive > the load-data-byte's MSB onto other data bus lines was proving > to hurt the xr16 cycle time, so out went LBS! I have compile-time flags to turn the sign extension stuff off and allow only word stores. In general, many of the 'frills' will be enabled/disabled with a configuration file once I tidy up the code. > I saw some announcements on CGEN. It seems very promising. > Have you read its docs and/or used it? Does it make porting > binutils a snap? I'm just at the 'file under things that look interesting' stage. I stumbled across CGEN while looking for the NIOS stuff at Redhat; let's see, if they built a compiler toolchain around GCC, that means the source code is be available, right? Brian |
|
What I wish I could find is a small 8 bit cpu core for Xilinx, in VHDL, which: a) is free and available for unrestricted commercial use; b) can access at least 8 kb of internal Xilinx block ram for firmware/data; c) can have its I/O expanded to hundreds of pins befitting a Spartan II; d) has had a credible test bench run against it; e) has an assembler and hex-to-vhdl converter program; and f) runs at 1 MHz or more. (speed is not very important) g) (optional) has a c compiler available. Everything I've found in books and on the web fails in at least one of the above categories. The closest I've come is one of the PIC emulators, but the licensing of it is unclear, so I hesitate to use it. Best Regards, Gary Watson Technical Director Nexsan Technologies, Ltd. Imperial House East Service Road Raynesway Derby DE21 7BF ENGLAND +44 (0) 1332 5 444 33 http://www.nexsan.com |
|
On Fri, 22 Sep 2000 02:39:20 -0000, you (Brian Davis) wrote: >Philip, > Is any of the RISC4005 stuff online? No. My design and Jan's XR16 are very similar, only he has done a far better job of documenting it, and supporting it with software. Jan's work, while independent of mine had many striking similarities to the RISC4005, which we realized when we started trading email and phone calls a few years ago. We agreed that this was probably due to both of us having the same basic goals of efficient implementation, and realizing that an efficient CPU would be far better if the CPU architecture was adjusted to the FPGA resources, rather than a standard CPU, with the FPGA resources applied to meet an existing architecture. > ( I built a 40 bit bit-slice machine, sorta like a lobotimized >'2901, in a 4010 when they first came out in the 92-93 timeframe. >Had 16 general registers, 16 constant registers, external '448 >microsequencer; it ran at 12.5 MHz, with a 25 MHz clock to generate >the then-required asynchronous CLB write signal ) I have in my garage a variable width data path microcoded CPU, built in 1980-1982 with 8 x 2903s, and a very modified 2910. It covers about twenty 6U by 220mm wire wrap cards. It includes 128 bit wide microword, with up to 1 MW of control store, all built with 4Kbit SRAMs (8KW implemented, and VM (yes, VM microcode) for the rest. I/O channel is dual 16 bit Multibus 1 cardcages. The boot processor was an 8080 CPM system that booted a custom Z8000 system (I designed this too, and the OS on it), and the Z8000 then loaded the WCS, and controlled the clocks for the system. >>A really neat capability of RISC4005 (that I should have patented, >>because no-one before or after me has thought of it) was the stunning >>additional instruction group: SKIP2, of which I had 48 of these as well. >>It skipped 2 instructions. Which is great for double precision arith, >>because you can skip an ADD, and an ADDC (add with carry) with >>one skip instruction. This really helps in multiply and divide >routines. > > From my copy of "User Manual for the CDP1802 COSMAC Microprocessor", >RCA publication MPM-201B, copyright 1977, pages 37-38: The fastest way to do research in internet time is to post an assertion, and sit back :-) :-) :-) > "The SHORT SKIP is unconditional and skips the byte following the > operation code. The LONG SKIP is also unconditional but skips two > bytes following the operation code. The other instructions are long > skips if test conditions for D, DF, or Q are satisfied." > > SKP SHORT SKIP > LSKP LONG SKIP > > LSZ LONG SKIP IF D=0 > LSNZ LONG SKIP IF D NOT 0 > > LSDF LONG SKIP IF DF=1 > LSNF LONG SKIP IF DF=0 > > LSQ LONG SKIP IF Q=1 > LSNQ LONG SKIP IF Q=0 > > LSDF LONG SKIP IF IE=1 I of course stand corrected, and humbled :-) > I also had my own "skip extension" plans for the 5 opcode bits >that are now in use for the bit number of the "skip on bit" mode: > > SMODE : selects AND or XOR of skip condition with enable bits > > E1 : enable for first instruction following skip > E2 : enable for second > E3 : enable for third > E4 : enable for fourth This sounds pretty neat. Have you thought how you will get a compiler to make use of this? > In the AND mode, the instructions with enable bits set are skipped > if the condition was true, executed if the condition was false; > those with enable bits cleared are executed normally. > > In the XOR mode, the instructions with enable bits set are skipped > if the condition was true, executed if the condition was false; > instructions with enable bits cleared suffer the opposite fate. > > ( If I don't implement short conditional branches, I may bring this > back as an "eskip" instruction using what's now the "br.cc" opcode ) > > This won't work on the 16 bit datapath processor, as there aren't > enough bits in the status register to hold the four pending bits of > skip state. >still having fun, >Brian Davis Me too. Philip Freidin ================= Philip Freidin |
|
|
|
--- In , Philip Freidin <philip@f...> wrote: > > I have in my garage a variable width data path microcoded CPU, built > in 1980-1982 with 8 x 2903s, and a very modified 2910. It covers about > twenty 6U by 220mm wire wrap cards. It includes 128 bit wide microword, > with up to 1 MW of control store, all built with 4Kbit SRAMs (8KW > implemented, and VM (yes, VM microcode) for the rest. I/O channel is > dual 16 bit Multibus 1 cardcages. The boot processor was an 8080 CPM > system that booted a custom Z8000 system (I designed this too, and the > OS on it), and the Z8000 then loaded the WCS, and controlled the clocks > for the system. > And I thought I had a problem with home projects... can't top that, won't even try. :-) ( Although I do have a Multibus I chassis and two bed-of-nails Augat wirewrap panels for it... need any spares? ) The bit-slice machine I'd mentioned wasn't a home processor project; it handled header and error processing for a big data formatting/DMA engine that took up the rest of the 4010. However, it did make me realize that the FPGA's were getting big enough to stuff a processor into. > > I also had my own "skip extension" plans for the 5 opcode bits > >that are now in use for the bit number of the "skip on bit" mode: > > > > SMODE : selects AND or XOR of skip condition with enable bits > > > > E1 : enable for first instruction following skip > > E2 : enable for second > > E3 : enable for third > > E4 : enable for fourth > > > > In the AND mode, the instructions with enable bits set are skipped > > if the condition was true, executed if the condition was false; > > those with enable bits cleared are executed normally. > > > > In the XOR mode, the instructions with enable bits set are skipped > > if the condition was true, executed if the condition was false; > > instructions with enable bits cleared suffer the opposite fate. > > > > ( If I don't implement short conditional branches, I may bring this > > back as an "eskip" instruction using what's now the "br.cc" opcode ) > > > > This won't work on the 16 bit datapath processor, as there aren't > > enough bits in the status register to hold the four pending bits of > > skip state. > > > > This sounds pretty neat. Have you thought how you will get a compiler > to make use of this? > For a human compiler, it's pretty easy... The XOR mode gives you small if..then..else's without branches; execution time wise, you trade branches and ( maybe unused ) branch delay slots for the overhead of always executing all four skip slots in XOR mode. My experience to date with code generators has been limited to writing Small-C and Micro-C back ends about 7-8 years ago, so take the following with a grain of salt. I think it could be done with a peephole optimizer by looking for the 'if' sequence emitted by the compiler: ; ; if cond ; then <then_code> ; else <else_code> ; skip.cc bra else_code then_code: <some_then_code> bra end_if else_code: <some_else_code> end_if: Check to see if the <then_code> and <else_code> add up to <= 4 instructions, then pack the <then_code> and <else_code> into a 4 instruction sequence preceded by an eskip with the appropriate enable bits set. ( would need NOP padding if < 4 instructions ). ESKIP allows 1/3, 2/2, 3/1 partitioning of the "then" and "else" code, which should handle simple conditional assignments ( like set one variable/ update a pointer ) without needing branches. so code like: ; ; if r1 = 0 ; then r1 = 1023 , r2 = r2 + 1; ; else r1-- ; ; skip.z r1 bra else_code then_code: move r1,#1023 add r2,#1 bra end_if else_code: sub r1,#1 end_if: becomes: ; if r1 = 0 eskip.z r1 #%1_0011 ; then r1 = 1023 , r2 = r2 + 1; move r1,#1023 add r2,#1 ; else r1-- ; sub r1,#1 nop Brian |