Sign in

username:

password:



Not a member?

Search fpga-cpu



Search tips

Subscribe to fpga-cpu



fpga-cpu by Keywords

Altera | CISCifying | IDE | ISA | Java | JHDL | JTAG | LBU | MicroBlaze | PAR | PCI | RISC | SoC | Spartan | Transputers | Verilog | VHDL | Virtex | VLIW | WebPack | Xilinx | Xsoc | YARD-1A

Ads

Discussion Groups

Discussion Groups | FPGA-CPU | Yet Another RISC Design (YARD-1A)

This list is for discussion of the design and implementation of field-programmable gate array based processors and integrated systems. It is also for discussion and community support of the XSOC Project (see http://www.fpgacpu.org/xsoc).

Yet Another RISC Design (YARD-1A) - Brian Davis - Sep 19 22:28:00 2000


YARD-1A processor:

I've been working on my own 32 bit RISC processor
intermittently for the past year or so; it's not
quite soup yet, but it's getting close.

The current implementation is a very simple two
operand, 2 stage pipeline, set up to use internal
Xilinx FPGA RAM resources.

I've put a draft description of the processor at:

ftp://members.aol.com/fpgastuff/yard-1a.zip Its' first "blink the LED's" program booted in
hardware last December, running in an XC4010E on
an old Xilinx "FPGA Eval Board" ( 32 bit core,
64x16 ROM, 32x32 RAM, parallel I/O ).

The current target is an XC2S100; I'd ordered an
Insight "Spartan II Development Board" at the beginning
of July, and it arrived a few weeks ago; see:

http://www.insight-
electronics.com/xcellence/scalable/kit/spartan/index.html

DS-XC2S100-BRD $125 USD Other Stuff:

Hill, Jouppi, Sohi, "Readings in Computer Architecture"
ISBN 1-55860-539-8
Nice collection of papers from 1964 to the present
http://www.mkp.com/architecture-readings

CGEN CPU tools generator
http://sources.redhat.com/cgen

RISC-8 PERL cross assembler
http://www.geocities.com/microprocessors
Brian Davis





(You need to be a member of fpga-cpu -- send a blank email to fpga-cpu-subscribe@yahoogroups.com )


RE: Yet Another RISC Design (YARD-1A) (long) - Jan Gray - Sep 20 21:56:00 2000

Brian, congratulations, nice work. Looks like you've been having lots of
fun.

I read your yard-1a materials. Some comments and questions -- (apologies in
advance, this was written quickly)

I like your use of the single bank of dual port RAM for the register file,
which does constrain one to an instruction set with two register operands,
and with no pipeline stage between the operand fetch register file read
access, and the execute stage result write-back. As you have found with
your quite respectable "push-button" cycle time of 25 ns or so, this can
still provide good performance in a small area.

I apologize in advance for yet another "I did one of those too", but I did
one of those too, back in August. I wrote a quick and dirty non-pipelined
RISC in about 170 lines of Verilog, slightly incomplete and no-doubt buggy,
with the intention of polishing it up and turning it into a small, annotated
design example on the web site, to demonstrate just how simple these things
can be. Not coincidentally, I'm also seeing performance of about 35-40 MHz
in a Virtex-4, but with load/store instructions (as usual) needing at least
a second clock cycle. Unlike your design, this one uses an on-chip I-cache
and so doesn't have an ifetch stage (on a cache hit) -- so no branch delay
issues. It's pretty simple. Here is the relevant I-cache code excerpt:

reg [15:0] itag;
wire [N:0] pc_nxt = pipe_ce ? (`JL ? sum : (`Bx&branch) ? (pc +
{brdisp,1'b0}) : (pc + 2)) : pc;
`ifdef GR14 /* 16-bits */
assign pipe_ce = !(rst_sync || (itag != pc[15:8]));
RAMB4_S16_S16 icache(
.WEA(icache_we), .ENA(1'b1), .RSTA(rst), .CLKA(clk),
.ADDRA({1'b0,pc_nxt[7:1]}), .DIA(di[15:0]), .DOA(ir),
.WEB(icache_we), .ENB(1'b1), .RSTB(rst), .CLKB(clk),
.ADDRB({1'b1,pc_nxt[7:1]}), .DIB({8'b0,pc[15:8]}), .DOB(itag));
`else /* 32-bits */
assign pipe_ce = !(rst_sync || (itag != pc[23:8]));
RAMB4_S16_S16 icache(
.WEA(icache_we), .ENA(1'b1), .RSTA(rst), .CLKA(clk),
.ADDRA({1'b0,pc_nxt[7:1]}), .DIA(di[15:0]), .DOA(ir),
.WEB(icache_we), .ENB(1'b1), .RSTB(rst), .CLKB(clk),
.ADDRB({1'b1,pc_nxt[7:1]}), .DIB(pc[23:8]), .DOB(itag));
`endif

Here we use one dual-port block ram to implement a 128x16 I-cache
instruction memory with a 128x16 I-cache tag memory. Then pipe_ce is false
if itag != pc[15:8] (16-bits) or itag != pc[23:8] (32-bits). The
processor will spin fetching the same instruction until an external agent
writes the new instruction data+tag into the I-cache at address pc_nxt[7:1].

More on that work sooner or later.

You are using the Insight 2S100 board. I am currently using an XESS XSV-300
for my Virtex work, not inexpensive, and I know of at least one other
designer on this list who is using the same inexpensive board that you are.
Perhaps it would make a more accessible platform for future projects in the
Virtex space. I'm ordering one in the morning. One downside of this board
is that the more budget-constrained among us will not be able to target this
board, since even using the rumored-to-be-forthcoming new Student Edition,
which allegedly targets the V50 (and hence 2S50), probably won't be able to
target the 2S100. I'll write the Xilinx University Program folks and see
what they are up to. The second downside of this board is it has no
built-in RAM. It would be nice to design a simple anybody-can-solder-it
expansion board to plug into the 2S100 board's prototyping area, to provide
RAM, VGA port, and a few other niceties.

But back to your work.

I like your nullable branch delay slots.

When you state "full implementation of SHIFT" -- are you planning a
multicycle shifter or a full barrel shifter or something in-between? A full
barrel shifter is quite area intensive. Oops, never mind, I see,
1,2,4,8,16.

For external memory, you state
" - extend the pipeline from 2 to 3 stages
( would need register forwarding HW, control logic rework )"

But if you insert a pipeline register between register file read and write
accesses you may have problems with your single bank of dual-port RAM,
right? It may be simpler to stall the pipeline during the memory access
than to build a MEM stage.

I like your simulation framework a lot, and it's good to know there's an
adquate, free VHDL simulator out there, too.

I like your immediate operand encoding to get those bit-masks, etc.
Reminiscent of (but different than) ARM (IIRC). A while back I looked at
some frequently used wide constants (e.g. 0x000000FF, 0x0000FFFF,
0x00010000, 0x10000000) and so forth, I used to kick around the idea of a
loadable "immediate constant register file" for compact and quick access to
your favorite 16 or 32 immediate constants. It could be loaded once at
system init time, or once per dynamic library, or even once per function (if
restored). Of course, this is hardly an improvement over a larger regular
register file! The funny thing is, in FPGAs, an n-bit 2-1 mux is often just
as expensive as a 16xn register file.

I note your hardware call stack, which will certainly improve call overhead.
But in my experience more time is spent saving and reloading live value
registers across calls than the return address. What happens on overflow?
:-)

For your decision to allow base+offset addressing on load/store
instructions, I debated the exact same point with myself when I was doing
this quick&dirty non-pipelined RISC last month. Since the Virtex block RAM
is synchronous for read and write, you have to have the address prepared
before the data block RAM clock edge. If you want loads to occur in one
cycle, you have no choice but to present the load address to the block RAM
on the clock falling edge (if the rest of the design is clocked on the
rising edge). If you do that, the 16- or 32-bit adder delay to compute base
register+offset must occur in one half cycle, and the min cycle time will be
loooooong. If, on the other hand, you stall the processor for one cycle (so
that loads take two cycles) then the register+offset add should not affect
the cycle time (because it is basically identical to the add instruction
critical path). That's what I did.

Re: your immediate instruction approach, I believe Philip Freidin's RISC4005
had a similar instruction to write an immediate (literal) value into a
register. (In his case, a general purpose register, right Philip?) I like
what you did there, because now you can have a store instruction which
sources two registers (the data-to-store register and the base-register)
plus your literal in the SR register.

Re: Sign/zero-extension on loads. j32 and early xr16's had sign- and zero-
extension, but the extra delay needed to drive the load-data-byte's MSB onto
other data bus lines was proving to hurt the xr16 cycle time, so out went
LBS!

Re: no CCs: The aspect of xr16 I am most unhappy with is the condition codes
and the associated interlocks, which I used despite my better judgement. I
won't be fooled again.

Re: skip instructions. IIRC that's almost exactly what RISC4005 did.

FF0, FF1, CNT0, CNT1, great! Did you put them in for show or do you have an
application that will use them? :-)

About ten years ago (if I remember correctly) there was a 40 MHz CMOS RISC
processor called the GE? RPM-40 designed by Dennis O'Connor. It was a
16-bit instruction word RISC, tight on opcode space, that also had RSUB,
etc. It would be fun to look that up that up and see if there are any
lessons for we opcode-space challenged. Alas most FPGA CPU design issues
were faced by regular full-custom CPU designers ten years ago.

I have the Readings in Computer Architecture book, it's excellent. This
inspires me to put up a web page of recommended books, etc.

I saw some announcements on CGEN. It seems very promising. Have you read
its docs and/or used it? Does it make porting binutils a snap?

Thank you for taking the time to share your interesting work with us.

Jan Gray
Gray Research LLC





(You need to be a member of fpga-cpu -- send a blank email to fpga-cpu-subscribe@yahoogroups.com )

Re: YARD-1A Consts, CCs, Skips - Philip Freidin - Sep 21 13:16:00 2000

On Wed, 20 Sep 2000 19:56:23 -0700, Jan Gray wrote:
>Brian, congratulations, nice work. Looks like you've been having lots of
>fun.

Yep lots of it :-)

> .....

>But back to your work. >Re: your immediate instruction approach, I believe Philip Freidin's RISC4005
>had a similar instruction to write an immediate (literal) value into a
>register. (In his case, a general purpose register, right Philip?) I like
>what you did there, because now you can have a store instruction which
>sources two registers (the data-to-store register and the base-register)
>plus your literal in the SR register.

So I had to go dig up the archive for the RISC4005 (1991/1992 vintage).
What I had back then, and I am still happy with it was two instructions, each
were 4 bits opcode, 4 bits dest reg, and 8 bits constant. They were

constlo Rn,0xAA
and
swapconst Rn,0xBB

constlo always zeroed the upper byte of Rn.
swapconst copied the low half of Rn to the high half, and loaded the low
half with the new constant. From my macro definition file (the assembler
I wrote had a very powerful macro facility) comes the
following (slightly trimmed).

The RISC4005 had condition codes, and 2 delay slots after branches,
due to it being a 4 stage piped processor

Look at the macro "CONST", which loads a 16 bit constant

;
; STDMACS.S Standard MACROS
;
;
; Last edit What
; 04-Jan-92 Initial Creation.
;
;
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
nop .macro
skip_never
.endm
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
push .macro reg ; push register onto stack
dec r15,r15 ; predecrement
st reg,r15
.endm
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
pop .macro reg ; pop register from stack
ld reg,r15
inc r15,r15
.endm
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
const .macro reg,val
constlo reg,( val ) >> 8
swapconst reg,val
.endm
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
set_c .macro temp_reg
constlo temp_reg,0x01
srl temp_reg,temp_reg
.endm
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
clear_c .macro temp_reg
constlo temp_reg,0x00
srl temp_reg,temp_reg
.endm
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
mov .macro rdest,rsrc
or rdest,rsrc,rsrc
.endm
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
far_call .macro dest
const r0,dest
call r0,r0
.endm
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;

>Re: no CCs: The aspect of xr16 I am most unhappy with is the condition codes
>and the associated interlocks, which I used despite my better judgement. I
>won't be fooled again.

As an old grey haired architect I could go on for way to long about the
fights we had over this for the 29000, 15 years ago. We resolved, and I
still believe we were right, that CC's suck. The 29K let you do conditional
tests, and they stored the true/false result in the MSB of the dest reg.
We then just had a simple jump true/false, that looked at a register's value.
Made compound conditionals very easy: calc all the primitive conditions,
then just do AND/OR/XOR/NOT ops on the registers that held the
conditional results. And then, because we were realists, we added CC's
to the architecture, because we expected to run instruction set emulations
of 68K and i86 code, and the emulators would benefit from a real CC
register. >Re: skip instructions. IIRC that's almost exactly what RISC4005 did.

So given the lessons of the 29K, why didn't the RISC4005 do this too?
Not enough opcode bits to specify a dest reg for the condition to be
stored in. The RISC4005 used a CC register (I had no plans for a
superscalar version, which is where CC's really bite you in the a*s).
Then you can have loads of skip instructions, because there are no
reg fields. I think RISC4005 had 48 different SKIP instructions, that
tested all true and false cases of every interesting combination
of bits in the CC.

A really neat capability of RISC4005 (that I should have patented,
because no-one before or after me has thought of it) was the stunning
additional instruction group: SKIP2, of which I had 48 of these as well.
It skipped 2 instructions. Which is great for double precision arith,
because you can skip an ADD, and an ADDC (add with carry) with
one skip instruction. This really helps in multiply and divide routines.

Keep having fun.

Philip Freidin ========================
Philip Freidin
Mindspring that acquired Earthlink that acquired Netcom has
decided to kill off all Netcom Shell accounts, including mine.
My new primary email address is
Please update your address book, sorry for the inconvenience
=================
Philip Freidin





(You need to be a member of fpga-cpu -- send a blank email to fpga-cpu-subscribe@yahoogroups.com )

Re: YARD-1A Consts, CCs, Skips - Brian Davis - Sep 21 21:39:00 2000

Philip,

Thanks for the info.

> So I had to go dig up the archive for the RISC4005 (1991/1992
vintage).

Is any of the RISC4005 stuff online?

( I built a 40 bit bit-slice machine, sorta like a lobotimized
'2901, in a 4010 when they first came out in the 92-93 timeframe.
Had 16 general registers, 16 constant registers, external '448
microsequencer; it ran at 12.5 MHz, with a 25 MHz clock to generate
the then-required asynchronous CLB write signal )

>A really neat capability of RISC4005 (that I should have patented,
>because no-one before or after me has thought of it) was the stunning
>additional instruction group: SKIP2, of which I had 48 of these as
well.
>It skipped 2 instructions. Which is great for double precision arith,
>because you can skip an ADD, and an ADDC (add with carry) with
>one skip instruction. This really helps in multiply and divide
routines.

From my copy of "User Manual for the CDP1802 COSMAC Microprocessor",
RCA publication MPM-201B, copyright 1977, pages 37-38:

"The SHORT SKIP is unconditional and skips the byte following the
operation code. The LONG SKIP is also unconditional but skips two
bytes following the operation code. The other instructions are long
skips if test conditions for D, DF, or Q are satisfied."

SKP SHORT SKIP
LSKP LONG SKIP

LSZ LONG SKIP IF D=0
LSNZ LONG SKIP IF D NOT 0

LSDF LONG SKIP IF DF=1
LSNF LONG SKIP IF DF=0

LSQ LONG SKIP IF Q=1
LSNQ LONG SKIP IF Q=0

LSDF LONG SKIP IF IE=1 I also had my own "skip extension" plans for the 5 opcode bits
that are now in use for the bit number of the "skip on bit" mode:

SMODE : selects AND or XOR of skip condition with enable bits

E1 : enable for first instruction following skip
E2 : enable for second
E3 : enable for third
E4 : enable for fourth In the AND mode, the instructions with enable bits set are skipped
if the condition was true, executed if the condition was false;
those with enable bits cleared are executed normally.

In the XOR mode, the instructions with enable bits set are skipped
if the condition was true, executed if the condition was false;
instructions with enable bits cleared suffer the opposite fate.

( If I don't implement short conditional branches, I may bring this
back as an "eskip" instruction using what's now the "br.cc" opcode )

This won't work on the 16 bit datapath processor, as there aren't
enough bits in the status register to hold the four pending bits of
skip state. still having fun,
Brian Davis





(You need to be a member of fpga-cpu -- send a blank email to fpga-cpu-subscribe@yahoogroups.com )

Re: Yet Another RISC Design (YARD-1A) (long) - Brian Davis - Sep 21 21:42:00 2000

Jan,
Thanks for the comments.

> It would be nice to design a simple anybody-can-solder-it
> expansion board to plug into the 2S100 board's prototyping
> area, to provide RAM, VGA port, and a few other niceties.

When I first looked at the Insight board, I'd hoped to stick
an external synchronous SRAM on a daughtercard above the FPGA;
alas, in all 160 pins of header there's nary a ground in sight-
I don't think I'll be running any 100 MHz+ bus cycles there.

The protoype area should be OK for slower external interfacing.

My <tentative> plan for the external RAM interface version of
the core (YARD-1B) is to double cycle an SSRAM at 2x the core
rate, providing one instruction fetch and one data memory cycle
per processor cycle.

> I like your nullable branch delay slots.

I hope to have them enabled for all of bra/bsr/jmp/jsr/rts/rti
in the final version; this should allow for two-cycle call/return
overhead by pulling the first instruction of the target into the
call delay slot, and moving the instruction before the return
into the return delay slot.

The return address stacking mechanism on a call needs to
accomodate the change in pushed return address depending upon
whether the delay slot was executed; if I can't do that cleanly
with the existing address hardware, I'll probably leave the
delay slot enable in only for bra/jmp/rts/rti.

> When you state "full implementation of SHIFT" -- are you
> planning a multicycle shifter or a full barrel shifter or
> something in-between? A full barrel shifter is quite area
> intensive. Oops, never mind, I see, 1,2,4,8,16.

I picked those values so you can do any constant shift in at
most five instruction cycles, or a variable shift with code like:
;
; r0 = data to shift
; r1 = shift count
;
skip.bs r1,#0
lsr r0,#1

skip.bs r1,#1
lsr r0,#2

skip.bs r1,#2
lsr r0,#4

skip.bs r1,#3
lsr r0,#8

skip.bs r1,#4
lsr r0,#16

I may add some more constant shift type operations in the holes
left around LSL/LSR/ASR ( e.g. shift by 24, byte swap/extract ).

The unused opcode after FF0/FF1/FFD/CNT0/CNT1 is there for a
variable shift-by-register instruction, which would require
building a barrel shifter.

> But if you insert a pipeline register between register file
> read and write accesses you may have problems with your single
> bank of dual-port RAM,right?

Right, anything that would move the register writeback to
another clock cycle requires the use of an independent
read/read/write register file.

I thought I'd put a note about that in the "register file"
section, but I don't see anything there; in any event, the
source code for the register file synthesizes to one bank
of dual ports if the address lines are common on a read and
write port, two banks if they are not.

> I note your hardware call stack, which will certainly improve
> call overhead. But in my experience more time is spent saving
> and reloading live value registers across calls than the return
> address. What happens on overflow? :-)

Once the trap mechanism is working, a "stack almost full"
trap will occur.

Many embedded processors/DSPs get by with a 16-32 deep
( or smaller ) hardware return stack; IIRC, the DSP compilers
typically have a flag that controls whether function entry code
manually pops the top entry of the return stack to a software
stack for all but leaf functions.

> FF0, FF1, CNT0, CNT1, great! Did you put them in for show
> or do you have an application that will use them? :-)

FF1 and FFD were planned for fast normalization of
unsigned/signed fractional binary numbers for floating point
and signed fractional block floating point code; with proper
bit count encoding and a variable shift instruction, you get
two cycle normalizations ( which should be useful in another
30-40 years when I retire and have the time to write a floating
point package ).

The others came along for the ride; I may be able to do them
all with almost the same hardware. The XC4000 carry chains let
you build this sort of stuff, but I haven't tried it yet with the
Virtex/Spartan-II carry chains.

> Re: Sign/zero-extension on loads. j32 and early xr16's had
> sign- and zero- extension, but the extra delay needed to drive
> the load-data-byte's MSB onto other data bus lines was proving
> to hurt the xr16 cycle time, so out went LBS!

I have compile-time flags to turn the sign extension stuff
off and allow only word stores.

In general, many of the 'frills' will be enabled/disabled
with a configuration file once I tidy up the code.

> I saw some announcements on CGEN. It seems very promising.
> Have you read its docs and/or used it? Does it make porting
> binutils a snap?

I'm just at the 'file under things that look interesting' stage.

I stumbled across CGEN while looking for the NIOS stuff at
Redhat; let's see, if they built a compiler toolchain around
GCC, that means the source code is be available, right? Brian




(You need to be a member of fpga-cpu -- send a blank email to fpga-cpu-subscribe@yahoogroups.com )

My Wish List... - Gary Watson - Sep 22 12:02:00 2000


What I wish I could find is a small 8 bit cpu core for Xilinx, in VHDL,
which:

a) is free and available for unrestricted commercial use;
b) can access at least 8 kb of internal Xilinx block ram for firmware/data;
c) can have its I/O expanded to hundreds of pins befitting a Spartan II;
d) has had a credible test bench run against it;
e) has an assembler and hex-to-vhdl converter program; and
f) runs at 1 MHz or more. (speed is not very important)
g) (optional) has a c compiler available.

Everything I've found in books and on the web fails in at least one of the
above categories. The closest I've come is one of the PIC emulators, but
the licensing of it is unclear, so I hesitate to use it.

Best Regards,

Gary Watson
Technical Director
Nexsan Technologies, Ltd.
Imperial House
East Service Road
Raynesway
Derby DE21 7BF ENGLAND
+44 (0) 1332 5 444 33
http://www.nexsan.com





(You need to be a member of fpga-cpu -- send a blank email to fpga-cpu-subscribe@yahoogroups.com )

Re: YARD-1A Consts, CCs, Skips - Philip Freidin - Sep 29 2:50:00 2000

On Fri, 22 Sep 2000 02:39:20 -0000, you (Brian Davis) wrote:
>Philip,
> Is any of the RISC4005 stuff online?

No. My design and Jan's XR16 are very similar, only he has done a
far better job of documenting it, and supporting it with software.
Jan's work, while independent of mine had many striking similarities
to the RISC4005, which we realized when we started trading email
and phone calls a few years ago.

We agreed that this was probably due to both of us having the same
basic goals of efficient implementation, and realizing that an efficient
CPU would be far better if the CPU architecture was adjusted to the
FPGA resources, rather than a standard CPU, with the FPGA resources
applied to meet an existing architecture.

> ( I built a 40 bit bit-slice machine, sorta like a lobotimized
>'2901, in a 4010 when they first came out in the 92-93 timeframe.
>Had 16 general registers, 16 constant registers, external '448
>microsequencer; it ran at 12.5 MHz, with a 25 MHz clock to generate
>the then-required asynchronous CLB write signal )

I have in my garage a variable width data path microcoded CPU, built
in 1980-1982 with 8 x 2903s, and a very modified 2910. It covers about
twenty 6U by 220mm wire wrap cards. It includes 128 bit wide microword,
with up to 1 MW of control store, all built with 4Kbit SRAMs (8KW
implemented, and VM (yes, VM microcode) for the rest. I/O channel is
dual 16 bit Multibus 1 cardcages. The boot processor was an 8080 CPM
system that booted a custom Z8000 system (I designed this too, and the
OS on it), and the Z8000 then loaded the WCS, and controlled the clocks
for the system.

>>A really neat capability of RISC4005 (that I should have patented,
>>because no-one before or after me has thought of it) was the stunning
>>additional instruction group: SKIP2, of which I had 48 of these as well.
>>It skipped 2 instructions. Which is great for double precision arith,
>>because you can skip an ADD, and an ADDC (add with carry) with
>>one skip instruction. This really helps in multiply and divide
>routines.
>
> From my copy of "User Manual for the CDP1802 COSMAC Microprocessor",
>RCA publication MPM-201B, copyright 1977, pages 37-38:

The fastest way to do research in internet time is to post an assertion,
and sit back :-) :-) :-)

> "The SHORT SKIP is unconditional and skips the byte following the
> operation code. The LONG SKIP is also unconditional but skips two
> bytes following the operation code. The other instructions are long
> skips if test conditions for D, DF, or Q are satisfied."
>
> SKP SHORT SKIP
> LSKP LONG SKIP
>
> LSZ LONG SKIP IF D=0
> LSNZ LONG SKIP IF D NOT 0
>
> LSDF LONG SKIP IF DF=1
> LSNF LONG SKIP IF DF=0
>
> LSQ LONG SKIP IF Q=1
> LSNQ LONG SKIP IF Q=0
>
> LSDF LONG SKIP IF IE=1 I of course stand corrected, and humbled :-)

> I also had my own "skip extension" plans for the 5 opcode bits
>that are now in use for the bit number of the "skip on bit" mode:
>
> SMODE : selects AND or XOR of skip condition with enable bits
>
> E1 : enable for first instruction following skip
> E2 : enable for second
> E3 : enable for third
> E4 : enable for fourth

This sounds pretty neat. Have you thought how you will get a compiler
to make use of this?

> In the AND mode, the instructions with enable bits set are skipped
> if the condition was true, executed if the condition was false;
> those with enable bits cleared are executed normally.
>
> In the XOR mode, the instructions with enable bits set are skipped
> if the condition was true, executed if the condition was false;
> instructions with enable bits cleared suffer the opposite fate.
>
> ( If I don't implement short conditional branches, I may bring this
> back as an "eskip" instruction using what's now the "br.cc" opcode )
>
> This won't work on the 16 bit datapath processor, as there aren't
> enough bits in the status register to hold the four pending bits of
> skip state. >still having fun,
>Brian Davis

Me too.
Philip Freidin

=================
Philip Freidin





(You need to be a member of fpga-cpu -- send a blank email to fpga-cpu-subscribe@yahoogroups.com )

Re: YARD-1A Consts, CCs, Skips - Brian Davis - Sep 29 21:11:00 2000

--- In , Philip Freidin <philip@f...> wrote:
>
> I have in my garage a variable width data path microcoded CPU, built
> in 1980-1982 with 8 x 2903s, and a very modified 2910. It covers
about
> twenty 6U by 220mm wire wrap cards. It includes 128 bit wide
microword,
> with up to 1 MW of control store, all built with 4Kbit SRAMs (8KW
> implemented, and VM (yes, VM microcode) for the rest. I/O channel is
> dual 16 bit Multibus 1 cardcages. The boot processor was an 8080 CPM
> system that booted a custom Z8000 system (I designed this too, and
the
> OS on it), and the Z8000 then loaded the WCS, and controlled the
clocks
> for the system.
>
And I thought I had a problem with home projects...
can't top that, won't even try. :-)

( Although I do have a Multibus I chassis and two
bed-of-nails Augat wirewrap panels for it... need any
spares? )

The bit-slice machine I'd mentioned wasn't a home
processor project; it handled header and error
processing for a big data formatting/DMA engine that
took up the rest of the 4010. However, it did make me
realize that the FPGA's were getting big enough to
stuff a processor into. > > I also had my own "skip extension" plans for the 5 opcode bits
> >that are now in use for the bit number of the "skip on bit" mode:
> >
> > SMODE : selects AND or XOR of skip condition with enable bits
> >
> > E1 : enable for first instruction following skip
> > E2 : enable for second
> > E3 : enable for third
> > E4 : enable for fourth
> >
> > In the AND mode, the instructions with enable bits set are
skipped
> > if the condition was true, executed if the condition was false;
> > those with enable bits cleared are executed normally.
> >
> > In the XOR mode, the instructions with enable bits set are
skipped
> > if the condition was true, executed if the condition was false;
> > instructions with enable bits cleared suffer the opposite fate.
> >
> > ( If I don't implement short conditional branches, I may bring
this
> > back as an "eskip" instruction using what's now the "br.cc"
opcode )
> >
> > This won't work on the 16 bit datapath processor, as there
aren't
> > enough bits in the status register to hold the four pending bits
of
> > skip state.
> >
>
> This sounds pretty neat. Have you thought how you will get a
compiler
> to make use of this?
>
For a human compiler, it's pretty easy...

The XOR mode gives you small if..then..else's
without branches; execution time wise, you trade
branches and ( maybe unused ) branch delay slots
for the overhead of always executing all four skip
slots in XOR mode.

My experience to date with code generators has been
limited to writing Small-C and Micro-C back ends about
7-8 years ago, so take the following with a grain of salt.

I think it could be done with a peephole optimizer by
looking for the 'if' sequence emitted by the compiler:

;
; if cond
; then <then_code>
; else <else_code>
;
skip.cc
bra else_code

then_code:
<some_then_code>
bra end_if

else_code:
<some_else_code>

end_if: Check to see if the <then_code> and <else_code> add up
to <= 4 instructions, then pack the <then_code> and
<else_code> into a 4 instruction sequence preceded by
an eskip with the appropriate enable bits set. ( would
need NOP padding if < 4 instructions ).

ESKIP allows 1/3, 2/2, 3/1 partitioning of the "then"
and "else" code, which should handle simple conditional
assignments ( like set one variable/ update a pointer )
without needing branches.

so code like:
;
; if r1 = 0
; then r1 = 1023 , r2 = r2 + 1;
; else r1-- ;
;
skip.z r1
bra else_code

then_code:
move r1,#1023
add r2,#1
bra end_if

else_code:
sub r1,#1

end_if: becomes:

; if r1 = 0
eskip.z r1 #%1_0011

; then r1 = 1023 , r2 = r2 + 1;
move r1,#1023
add r2,#1

; else r1-- ;
sub r1,#1
nop Brian




(You need to be a member of fpga-cpu -- send a blank email to fpga-cpu-subscribe@yahoogroups.com )