Sign in

username:

password:



Not a member?

Search fpga-cpu



Search tips

Subscribe to fpga-cpu



fpga-cpu by Keywords

Altera | CISCifying | IDE | ISA | Java | JHDL | JTAG | LBU | MicroBlaze | PAR | PCI | RISC | SoC | Spartan | Transputers | Verilog | VHDL | Virtex | VLIW | WebPack | Xilinx | Xsoc | YARD-1A

Discussion Groups

Discussion Groups | FPGA-CPU | CISCifying RISC calls and returns

This list is for discussion of the design and implementation of field-programmable gate array based processors and integrated systems. It is also for discussion and community support of the XSOC Project (see http://www.fpgacpu.org/xsoc).

CISCifying RISC calls and returns - Author Unknown - Nov 16 1:58:00 2001


(Rob babbling away again...)

In the perpetual analysis for, and in striving for the perfect isa,
it seems to me that cisc style calls and returns would be better than
than the risc style of using a link register. But I'm wondering *why*
it hasn't been done.

For instance, a 'call' instruction could be viewed as nothing more
than a specialized store instruction that stores the pc to
the stack. Similarly, a 'ret' just retrieves (loads) the pc from the
stack. Of course the stack pointer has to be adjusted, but that is
*very* simple to do.

It seems to me that using cisc style calls and returns could
eliminate the two instructions from every non-leaf subroutine that
are otherwise required to save and restore the link register. This
should result in a modest performance gain of two per cent across the
non-leaf routines.

Before I go ahead and modify my processor, I'd like to know why this
is a bad idea ?

If implemented, this would mean my processor is no longer strictly a
load / store architecture. Maybe I'll call it a HyRC - HYbrid Risc -
Cisc ("Herc") processor ? Or better yet a CRHy ("Cry") processor.

PS. I'm thinking of code naming my 240Mips EPIC processor
the "Puffin".

Thanks
Rob http://www.birdcomputer.ca/






(You need to be a member of fpga-cpu -- send a blank email to fpga-cpu-subscribe@yahoogroups.com )

Re: CISCifying RISC calls and returns - Johan Klockars - Nov 16 4:35:00 2001

> In the perpetual analysis for, and in striving for the perfect isa,
> it seems to me that cisc style calls and returns would be better than
> than the risc style of using a link register. But I'm wondering *why*
> it hasn't been done.
>
> For instance, a 'call' instruction could be viewed as nothing more
> than a specialized store instruction that stores the pc to
> the stack. Similarly, a 'ret' just retrieves (loads) the pc from the
> stack. Of course the stack pointer has to be adjusted, but that is

But on a RISC CPU that doesn't have return address storing calls, there
is not necessarily an explicit stack register. Such calls would require
one.

> It seems to me that using cisc style calls and returns could
> eliminate the two instructions from every non-leaf subroutine that
> are otherwise required to save and restore the link register. This
> should result in a modest performance gain of two per cent across the
> non-leaf routines.

The same work needs to be done in either case, so there's not necessarily
any difference in performance. It all depends on the implementation, of
course.

More importantly, with the link register style calls, the leaf routines
can save two memory accesses by not storing the return address. Assuming
there are enough register available.

> If implemented, this would mean my processor is no longer strictly a
> load / store architecture. Maybe I'll call it a HyRC - HYbrid Risc -
> Cisc ("Herc") processor ? Or better yet a CRHy ("Cry") processor.

I don't really see a problem with having call/ret in a RISC architecture
without giving it a new name. After all, as you say, those instructions
_are_ load/store, although specialised ones with side effects.

Personally, for a simple FPGA RISC, I'd also consider an internal call
stack. Naturally, that would be of limited size, but you could implement
explicit or implicit over-flow handling.

> PS. I'm thinking of code naming my 240Mips EPIC processor

MIPS is usually used with some kind of definition other than maximum
number of instructions executed per second, since that is usually a very
useless number. Dhrystone MIPS seems to be the most common.

Also, don't you mean VLIW rather than EPIC? Of course, Intel seems to mean
much the same thing with EPIC, but a major difference in IA64 architecture
is that the length of the instruction groups is independent of the number
of execution units, unlike classic VLIW.

--
Chalmers University | Why are these | e-mail:
of Technology | .signatures |
| so hard to do | WWW: rand.thn.htu.se
Gothenburg, Sweden | well? | (fVDI, MGIFv5, QLem)






(You need to be a member of fpga-cpu -- send a blank email to fpga-cpu-subscribe@yahoogroups.com )

Re: CISCifying RISC calls and returns - Author Unknown - Nov 16 20:30:00 2001

First, thanks for the reply. It's very good to have someone to bounce
ideas off of.

Cisc style calls / returns are not a very significant performance
benefit (~1%-2%), but the implementation cost is also very low. So
this may be much ado about nothing. Still I'd like not to get
clobbered by something I should've thought of earlier in the design.

> > stack. Of course the stack pointer has to be adjusted, but that
is
>
> But on a RISC CPU that doesn't have return address storing calls,
there
> is not necessarily an explicit stack register. Such calls would
require
> one.
>
Yes, there would have to be an explicit stack pointer. This is not so
worrisome because I'm trading the link register for the stack
pointer, so there is no difference in the number of registers used.
Also normal usage designates a register as the stack pointer, and due
to its usage by the compiler it's effectively dedicated anyway. So
really one of the benefits of a cisc style call / return is that it
effectively frees up a register. This might be very useful in say a
16 register cpu.

> > It seems to me that using cisc style calls and returns could
> > eliminate the two instructions from every non-leaf subroutine
that
> > are otherwise required to save and restore the link register.
This
> > should result in a modest performance gain of two per cent across
the
> > non-leaf routines.
>
> The same work needs to be done in either case, so there's not
necessarily
> any difference in performance. It all depends on the
implementation, of
> course.

Yes, it's basically the same work; however the cisc implementation
takes two fewer instructions and two fewer clock cycles to execute
(or maybe more - eg look at the PowerPC arch.) , because the explict
store link register to stack and load link
register from stack instructions are "wrapped up" into the call and
return instructions, which are also required in the risc machine.
This also has the effect of increasing code density, thus making
better use of any high speed memory that's available, like a cache.
>
> More importantly, with the link register style calls, the leaf
routines
> can save two memory accesses by not storing the return address.
Assuming
> there are enough register available.

Yes, but it does not change the performance because avoiding the
memory access does not avoid the two instructions which still have to
be executed, risc or cisc style (depending on the design). Using a
Harvard architecture the loads and stores are done in parallel with
the instruction access. So cycle wise (assuming single cycle access
to data memory) there should be no difference.

> Personally, for a simple FPGA RISC, I'd also consider an internal
call
> stack. Naturally, that would be of limited size, but you could
implement
> explicit or implicit over-flow handling.
>
I think that too. An internal stack would work very well I think for
a simple FPGA RISC. I was thinking of hacking the gr0040 to use an
internal stack rather than a link register. But alas, I have'nt the
time... I'm too busy working on a not-so-simple cpu. You can get the
benefit of an internal stack, without having to worry about overflow
by using a return address predictor.

Sorry for the long post. I think I'm going to go ahead and use cisc
style call / return. The return instruction I've got now executes the
equivalent of three risc instructions in a single cycle. (What's the
emoticon for bragging ?) The other impact of cisc-ifying is that I
can't have a jump and link (jal) instruction anymore, at least not
one that works with the stack. I guess it'll have to be a jump-and-
<blank> (ja_) instruction.

> > PS. I'm thinking of code naming my 240Mips EPIC processor
>
For anyone who hasn't figured it out already, I wasn't really serious
about the 240Mips EPIC processor. Go back and look at the post. Name
= "Puffin" = non-existent bird.

Rob http://www.birdcomputer.ca/





(You need to be a member of fpga-cpu -- send a blank email to fpga-cpu-subscribe@yahoogroups.com )

Re: CISCifying RISC calls and returns - Author Unknown - Nov 17 21:45:00 2001

> >
> > But on a RISC CPU that doesn't have return address storing calls,
> there
> > is not necessarily an explicit stack register. Such calls would
> require
> > one.
> >
> Yes, there would have to be an explicit stack pointer. This is not
so

Hey, I found a cheesy way around having an explicit stack pointer
register. Instead of the jump-and-link instruction specifying the
link register, it just specifies the stack pointer register to use
instead. So it's really a "jal [Rt],disp[Rn]" instruction, where Rt
indicates the stack pointer reg.

My design has so many cheesy features I'm thinking of packageing it
up and offering it as junk food. Maybe I could call it the "Cheetos"
processor.

Rob http://www.birdcomputer.ca/




(You need to be a member of fpga-cpu -- send a blank email to fpga-cpu-subscribe@yahoogroups.com )

Re: Re: CISCifying RISC calls and returns - Ben Franchuk - Nov 18 0:21:00 2001

wrote:

> Hey, I found a cheesy way around having an explicit stack pointer
> register. Instead of the jump-and-link instruction specifying the
> link register, it just specifies the stack pointer register to use
> instead. So it's really a "jal [Rt],disp[Rn]" instruction, where Rt
> indicates the stack pointer reg.

And what is wrong with a explicit stack pointer anyhow?
In my current cpu design I have a memory accessed relative to a
global base register for static variables -G. I reserve a small amount
of static space relative to the G register for traps and interrupts.

memory G = n

n..n-?? { irq data }
n+?..n { static variables }
??..n+? { program/data }

This has two advantages providing irq/swi service routines don't
need much free space. 1) You don't have stack over flows due to irq's.
2) You can have cleaner software emulation of instructions that
deal with the stack. > My design has so many cheesy features I'm thinking of packageing it
> up and offering it as junk food. Maybe I could call it the "Cheetos"
> processor.

Oat flavored core memory rings comes to mind.:)
Ben Franchuk.
--
Live "Pre-historic Cpu's" -- and you thought they were extinct.
www.jetnet.ab.ca/users/bfranchuk/index.html






(You need to be a member of fpga-cpu -- send a blank email to fpga-cpu-subscribe@yahoogroups.com )

Re: Re: CISCifying RISC calls and returns - Ben Franchuk - Nov 18 3:43:00 2001

wrote:

> I've been wondering how that global register works. Is it like an x86
> segment register ? I'm a little unclear as to how the memory space is
> divided. Are you saying irq data is placed in memory beneath the G
> register, then comes static variables and finally program / data ?
> Could you show an example with (relative) addresses ?
Yep just like that.
In this version I have no memory segments like the x86 processor.
This is a flat 16 meg address space with ram/rom at the bottom of memory
and memory mapped I/O page at top of memory. Note do pin limitations
of the FPGA emulating a 40 pin dip memory is limited externally to 1meg.

0 { interrupt service vectors }
{ O/S kernel RAM/ROM}
8k{ program(s) and free space }
? { free - space NO RAM }
-4k { I/O page
{ FF:FFFn <console uart>}

A program is loaded into a free memory block and remains there for the
life
of the program. Address constants are fixed up by the loader for this
location.
It is up to the loader routine to allot static space for interrupt
service data
and variables like 'the end of free memory'. Negitive offsets the base
register
G could be used for system variables or system function vectors.

> Thanks
> Rob
>
Ben Franchuk
--
Live "Pre-historic Cpu's" -- and you thought they were extinct.
www.jetnet.ab.ca/users/bfranchuk/index.html





(You need to be a member of fpga-cpu -- send a blank email to fpga-cpu-subscribe@yahoogroups.com )

Re: Re: CISCifying RISC calls and returns - Ben Franchuk - Nov 18 3:55:00 2001

wrote:

> There is absolutely nothing wrong with an explicit stack pointer. I
> think it's a good idea. I've always wondered why being able to
> specify any register as the stack pointer in a risc machine is a big
> deal. It's advertised as a feature sometimes. Why would you want to
> be able to do this anyway ? I can't think of a good reason. In fact I
> think it's a bad idea. Nothing like encouraging software development
> that uses different conventions for register usage. The only reason I
> allow it is because the register field is available in the
> instruction; I was thinking of forcing the register to the sp but it
> is more hardware, still I might do it anyway. My next task is to see
> if I can implement push / pop instructions.

Providing you have push/pop for general registers even stack intensive
programs like FORTH should work fine.
Ben Franchuk.
--
Live "Pre-historic Cpu's" -- and you thought they were extinct.
www.jetnet.ab.ca/users/bfranchuk/index.html






(You need to be a member of fpga-cpu -- send a blank email to fpga-cpu-subscribe@yahoogroups.com )

Re: Re: CISCifying RISC calls and returns - Ben Franchuk - Nov 18 11:20:00 2001

Eric Smith wrote:
>
> Ben wrote:
> > Providing you have push/pop for general registers even stack intensive
> > programs like FORTH should work fine.

> If you mean being able to use an arbitrary register as a stack pointer
> for push/pop operations (or at least having two registers to choose from),
> that will support FORTH just fine.

I meant the above. I assumed that a general register could be used for
auto-incriment/decriment as this is a modern design.
I forgot about x86 style 'want a be' cpu's.
Ben Franchuk.
--
Live "Pre-historic Cpu's" -- and you thought they were extinct.
www.jetnet.ab.ca/users/bfranchuk/index.html





(You need to be a member of fpga-cpu -- send a blank email to fpga-cpu-subscribe@yahoogroups.com )

Re: CISCifying RISC calls and returns - Author Unknown - Nov 18 15:10:00 2001

>
> And what is wrong with a explicit stack pointer anyhow?
There is absolutely nothing wrong with an explicit stack pointer. I
think it's a good idea. I've always wondered why being able to
specify any register as the stack pointer in a risc machine is a big
deal. It's advertised as a feature sometimes. Why would you want to
be able to do this anyway ? I can't think of a good reason. In fact I
think it's a bad idea. Nothing like encouraging software development
that uses different conventions for register usage. The only reason I
allow it is because the register field is available in the
instruction; I was thinking of forcing the register to the sp but it
is more hardware, still I might do it anyway. My next task is to see
if I can implement push / pop instructions.

> In my current cpu design I have a memory accessed relative to a
> global base register for static variables -G. I reserve a small
amount
> of static space relative to the G register for traps and interrupts.
>
> memory G = n
>
> n..n-?? { irq data }
> n+?..n { static variables }
> ??..n+? { program/data }
>
I've been wondering how that global register works. Is it like an x86
segment register ? I'm a little unclear as to how the memory space is
divided. Are you saying irq data is placed in memory beneath the G
register, then comes static variables and finally program / data ?
Could you show an example with (relative) addresses ?

Thanks
Rob





(You need to be a member of fpga-cpu -- send a blank email to fpga-cpu-subscribe@yahoogroups.com )

RE: Re: CISCifying RISC calls and returns - Jan Gray - Nov 19 10:44:00 2001

> I've always wondered why being able to
> specify any register as the stack pointer in a risc machine is a big
> deal. It's advertised as a feature sometimes. Why would you want to
> be able to do this anyway ? I can't think of a good reason.

This delivers flexibility, and more importantly, simplicity, in the
instruction set and in the implementation. Using a general purpose
register instead of a dedicated register can save you instructions to
move to/from the DR, and control logic and *multiplexers* in the
datapath.

Muxes are not such an issue in a full custom design but are relatively
rather expensive in FPGAs -- a single 2-1 mux can be the same size (1
LUT/bit) as a simple ALU or a register file. And by avoiding these
extra muxes in the datapath you may be able to clock the critical paths
faster.

This is not to say dedicated registers doing dedicated or autonomous
tasks are always a bad idea, but you have to be clear on what the
additional costs are before focusing solely on benefits.

If you start with a simple (or at any rate, a working) design and then
measure/simulate the performance impact of a proposed enhancement on
performance (== 1/(instructions * cycles/instruction * cycle time) )
your design may improve in a quantitative, methodical fashion. (But
beware local maxima!)

One more comment. In instruction sets where there is "opcode pressure",
but some time slack in the instruction decoder, it may be helpful to
have a family of "stack pointer relative" load/store instructions where
the "by convention stack pointer" general purpose register number is
encoded implicitly, just as in the xr16 CALL instruction the return
address linkage register (r15) is encoded implicitly.

Simple is beautiful: http://www.fpgacpu.org/log/sep00.html#000919

Jan Gray
Gray Research LLC





(You need to be a member of fpga-cpu -- send a blank email to fpga-cpu-subscribe@yahoogroups.com )

Re: Re: CISCifying RISC calls and returns - Ben Franchuk - Nov 19 20:25:00 2001

wrote:

> I have to admit, stacking the pc register instead of using a link
> register does lead to a little more hardware.<snip>

The link register idea is faster for short subroutine calls rather
than inline code. getchar() comes to mind here.

> I don't know if I'll implement push and pop instructions because the
> pop instruction requires a register file with two write ports in
> order to execute in a single cycle.

Why don't you define them as Macros under what ever assembler you
use. This way code will not change regardless of the hardware.

> The processor I'm working on now (the BlueBird) is getting to be
> quite large (over 40% of SpartanII 2S200). It has loads of different
> instruction formats, complex decoding, a four stage instruction
> pipeline and other features. I've found I've been able to add complex
> features to the processor without affecting the performance because
> the thing that limits performance right now is access to the block
> ram. My instincts tell me that the block ram access is what is going
> to limit performance and I've not worried so much about the
> complexity of the decoder. If anybody wants a good laugh, the current
> verilog source for the bluebird (not working yet) is available on my
> website.

Nah ... I got the same good laugh on my site with schematics. :)

Other thing that is bothering me in this late stage of design is the
exact
structure of the memory timing and memory bus as well as the I/O
devices.
Right now I am running at the rather slow rate of 1 Mhz external with a
68xx
style clock,( 4 Mhz in ) but I may still have slow the system down.
Yet looking at the I/O chips I can find very few
are high speed ones - uarts - floppy disc controllers - PIA's.
I wish a few more I/O devices were open source in a FPGA so one does
not have wait forever for I/O devices.
Ben Franchuk.
--
Live "Pre-historic Cpu's" -- and you thought they were extinct.
www.jetnet.ab.ca/users/bfranchuk/index.html






(You need to be a member of fpga-cpu -- send a blank email to fpga-cpu-subscribe@yahoogroups.com )

Re: CISCifying RISC calls and returns - Author Unknown - Nov 20 1:29:00 2001

> This delivers flexibility, and more importantly, simplicity, in the
> instruction set and in the implementation.

I was thinking in terms of from a programming standpoint. AFAIK
programming wise, there is no advantage. In terms of hardware, I
won't argue there is no advantage. It depends on your goals. Keeping
things simple and general purpose leads to hardware that has a small
footprint and is fast.

I have to admit, stacking the pc register instead of using a link
register does lead to a little more hardware. The overwhelming reason
to do things that way is that I find it aestically pleasing. If you
are looking at the benefit versus cost from a mathematical
perspective there is probably little or no difference, in which case
it would logically be better to go with the simpler hardware. But, I
was brought up on the likes of the 6502,Z80,68k,x86 processors and
the risc paradigm of using a link register just seems plain alien,
especially considering what you really want to do with the pc is
stack it on a subroutine call. The human factor.

I don't know if I'll implement push and pop instructions because the
pop instruction requires a register file with two write ports in
order to execute in a single cycle. There's also no real reason to
implement them as subroutine parameters should be passed via
registers, not on the stack.

The processor I'm working on now (the BlueBird) is getting to be
quite large (over 40% of SpartanII 2S200). It has loads of different
instruction formats, complex decoding, a four stage instruction
pipeline and other features. I've found I've been able to add complex
features to the processor without affecting the performance because
the thing that limits performance right now is access to the block
ram. My instincts tell me that the block ram access is what is going
to limit performance and I've not worried so much about the
complexity of the decoder. If anybody wants a good laugh, the current
verilog source for the bluebird (not working yet) is available on my
website.

I've been wondering what Xilinx is going to do a couple of years from
now when their customers start asking for memory management for the
MicroBlaze soft cpu. My guess is extra pipeline stalls.

Sorry for the long post.
Rob http://www.birdcomputer.ca

PS. Does AFAIK = as far as I know ? (I've been guessing what all the
internet acronyms are.





(You need to be a member of fpga-cpu -- send a blank email to fpga-cpu-subscribe@yahoogroups.com )

Re: Re: CISCifying RISC calls and returns - Tommy Thorn - Nov 20 1:58:00 2001

--- wrote:
> I was thinking in terms of from a programming
> standpoint. AFAIK programming wise, there is no
> advantage.

Not so. Leaf routines need not save the link register
to the stack and thus can avoid an expensive load and
an expensive store. Studies show that the majority
of processing time is spent in the "bottom of the
call tree," that is, leaf calls, so this could be a
significant saving.

The advantage of multiple link registers is harder to
exploit but in the case of a whole-program globally
optimizing compiler (whole-program optimization make
a lot of sense for embedded devices) a sufficiently
clever register allocator can generalize this
principle
to the n-most lower levels in the tree. Granted I
only
know of a few compilers doing this.

These days new CPUs are most often co-designed with
the
compiler as no CPU is better than the code it runs.
However, if you intend to program your CPU directly in
assembler, you have a lot more freedom.

/Tommy __________________________________________________






(You need to be a member of fpga-cpu -- send a blank email to fpga-cpu-subscribe@yahoogroups.com )

Re: CISCifying RISC calls and returns - Author Unknown - Nov 20 6:29:00 2001

<Rob being offensive> First, *read the post*, I was talking about the
stack pointer not the link register.

<Rob trying to be nice because he wants to be enlightened>
Please note I can be pretty bull-headed about things until I am
really convinced otherwise. It's just that I'm a digital thinker and
I've got a built in schmidt triggers with lots of hysterisis. :)
Perhaps my perspective will help. I am looking at the cpu from the
viewpoint that it's used in a reasonably powerful system. Something
>64k memory with a multi-threaded OS, not a simple embedded system.

Anyway...

> > I was thinking in terms of from a programming
>
> Not so. Leaf routines need not save the link register
> to the stack and thus can avoid an expensive load and
> an expensive store.

The fallacy is assuming the load and store are expensive. In a modern
(Harvard) processor the load and store to the data cache (memory)
would be done the same cycle the call and return instructions are
executing and therefore are effectively done for free.

>Studies show that the majority
>of processing time is spent in the "bottom of the
>call tree," that is, leaf calls, so this could be a
>significant saving.

2) Unfortunately, probably not a savings. In most cases the compiler
doesn't know whether or not it has to save and restore the link
register so it saves and restores the link register just to be safe.
For instance it is quite common to use third party libraries for all
those "bottom of the call tree" functions, and the compiler doesn't
know whether or not those routine will call other routines. Also, any
short function (not using third party libraires) will probably be in-
lined by the compiler, eliminating the function call overhead
completely; so all that's left is the larger functions which
invariably are not leaf functions. So... using a link register
represents two extra instructions and clock cycles in every routine
where it can not be determined to be safe to not save / restore.
At least I have not seen a language compiler that will allow you to
specify which external routines are leaf routines and which are not.
If I was a compiler writer, I'd just always save and restore the link
register when dealing with external routines just to be safe, because
the idiot that specified a routine was a leaf routine might be wrong.
Sure my code would run 1% slower in that case, but at least it would
be guaranteed to work.
>
> The advantage of multiple link registers is harder to
> exploit but in the case of a whole-program globally
> optimizing compiler (whole-program optimization make
> a lot of sense for embedded devices) a sufficiently
> clever register allocator can generalize this
> principle

That's the beauty of eliminating the link register. It makes this
type of optimization unnecessary, because the return address is
simply stored on the stack, which is equally fast as storing it in a
link register.
>
> These days new CPUs are most often co-designed with
> the
> compiler as no CPU is better than the code it runs.

Yes, exactly. If the compiler can't make use of the link register,
then why have it ? The link register makes hardware simpler at the
expense of making software (compiler) more complicated (and slower).

Rob http://www.birdcomputer.ca/




(You need to be a member of fpga-cpu -- send a blank email to fpga-cpu-subscribe@yahoogroups.com )

Re: Re: CISCifying RISC calls and returns - Goran Bilski - Nov 20 11:58:00 2001

Hi,

Since I designed the MicroBlaze, I can say why MicroBlaze ended up to way it
did.
There is a tradeoff between area and performance and I tried to find a good
compromise.
MicroBlaze has good enough performance and can still be implemented even to
smaller devices.
I could have made MicroBlaze smaller but with less performance or vice verse.

In MicroBlaze, this is what I did for the link register
In PowerPC the link register is a special register and in order to save it
you first have to move to a
general purpose register, the same for restoring it.
So it actually takes three instructions, one for the branch, one for moving
the link register to a general purpose and
one for saving to memory.
Since MicroBlaze has 32 register, I can use one of them as the link register
(r15) and thus save one instruction compared to PPC.
A branch and save onto stack instruction have to do
1. PC <- PC + offset
2. mem(r1) <- PC
3 r1 <- r1 -1
This would introduce one adder , one mux and more decoding logic and would
increase the size of MicroBlaze with 10%.
The gain would be 1 clock cycle for all branch to a subroutines that isn't a
leaf subroutine.

Which one is better, it depends on the application but I took the decision to
have a link register because it cleaner and wouldn't force
one register to be the stack pointer. Adding the logic may or may not
decrease the clock frequency, even if it's not on the critical path
the design will be bigger and thus move thing further apart from each other,
this can in itself introduce larger routing delay.

For the coming MMU on MicroBlaze, yes, a MMU will introduce 1 clock cycle
penalty for looking up a MMU table.
In PPC that's normal.

Göran Bilski wrote:

> > This delivers flexibility, and more importantly, simplicity, in the
> > instruction set and in the implementation.
>
> I was thinking in terms of from a programming standpoint. AFAIK
> programming wise, there is no advantage. In terms of hardware, I
> won't argue there is no advantage. It depends on your goals. Keeping
> things simple and general purpose leads to hardware that has a small
> footprint and is fast.
>
> I have to admit, stacking the pc register instead of using a link
> register does lead to a little more hardware. The overwhelming reason
> to do things that way is that I find it aestically pleasing. If you
> are looking at the benefit versus cost from a mathematical
> perspective there is probably little or no difference, in which case
> it would logically be better to go with the simpler hardware. But, I
> was brought up on the likes of the 6502,Z80,68k,x86 processors and
> the risc paradigm of using a link register just seems plain alien,
> especially considering what you really want to do with the pc is
> stack it on a subroutine call. The human factor.
>
> I don't know if I'll implement push and pop instructions because the
> pop instruction requires a register file with two write ports in
> order to execute in a single cycle. There's also no real reason to
> implement them as subroutine parameters should be passed via
> registers, not on the stack.
>
> The processor I'm working on now (the BlueBird) is getting to be
> quite large (over 40% of SpartanII 2S200). It has loads of different
> instruction formats, complex decoding, a four stage instruction
> pipeline and other features. I've found I've been able to add complex
> features to the processor without affecting the performance because
> the thing that limits performance right now is access to the block
> ram. My instincts tell me that the block ram access is what is going
> to limit performance and I've not worried so much about the
> complexity of the decoder. If anybody wants a good laugh, the current
> verilog source for the bluebird (not working yet) is available on my
> website.
>
> I've been wondering what Xilinx is going to do a couple of years from
> now when their customers start asking for memory management for the
> MicroBlaze soft cpu. My guess is extra pipeline stalls.
>
> Sorry for the long post.
> Rob http://www.birdcomputer.ca
>
> PS. Does AFAIK = as far as I know ? (I've been guessing what all the
> internet acronyms are.
>
> To Post a message, send it to:
> To Unsubscribe, send a blank message to:





(You need to be a member of fpga-cpu -- send a blank email to fpga-cpu-subscribe@yahoogroups.com )

Re: Re: CISCifying RISC calls and returns - Goran Bilski - Nov 20 12:04:00 2001

Hi,

I forgot to mention that MicroBlaze instruction for branch and link can have any
of the 32 registers as the link register.
The compiler needs however to have one register so r15 is the link register
according to the ABI.
But a assembler programmer can use any of the 32 register as the link register.
I use this for the rom monitor where I don't want a stack but need to call
subroutines so I use 3 register as different link registers.
So with the MicroBlaze scheme you can have a calling depth without the need of
saving the link register to the stack.

Göran Bilski Goran Bilski wrote:

> Hi,
>
> Since I designed the MicroBlaze, I can say why MicroBlaze ended up to way it
> did.
> There is a tradeoff between area and performance and I tried to find a good
> compromise.
> MicroBlaze has good enough performance and can still be implemented even to
> smaller devices.
> I could have made MicroBlaze smaller but with less performance or vice verse.
>
> In MicroBlaze, this is what I did for the link register
> In PowerPC the link register is a special register and in order to save it
> you first have to move to a
> general purpose register, the same for restoring it.
> So it actually takes three instructions, one for the branch, one for moving
> the link register to a general purpose and
> one for saving to memory.
> Since MicroBlaze has 32 register, I can use one of them as the link register
> (r15) and thus save one instruction compared to PPC.
> A branch and save onto stack instruction have to do
> 1. PC <- PC + offset
> 2. mem(r1) <- PC
> 3 r1 <- r1 -1
> This would introduce one adder , one mux and more decoding logic and would
> increase the size of MicroBlaze with 10%.
> The gain would be 1 clock cycle for all branch to a subroutines that isn't a
> leaf subroutine.
>
> Which one is better, it depends on the application but I took the decision to
> have a link register because it cleaner and wouldn't force
> one register to be the stack pointer. Adding the logic may or may not
> decrease the clock frequency, even if it's not on the critical path
> the design will be bigger and thus move thing further apart from each other,
> this can in itself introduce larger routing delay.
>
> For the coming MMU on MicroBlaze, yes, a MMU will introduce 1 clock cycle
> penalty for looking up a MMU table.
> In PPC that's normal.
>
> Göran Bilski
>
> wrote:
>
> > > This delivers flexibility, and more importantly, simplicity, in the
> > > instruction set and in the implementation.
> >
> > I was thinking in terms of from a programming standpoint. AFAIK
> > programming wise, there is no advantage. In terms of hardware, I
> > won't argue there is no advantage. It depends on your goals. Keeping
> > things simple and general purpose leads to hardware that has a small
> > footprint and is fast.
> >
> > I have to admit, stacking the pc register instead of using a link
> > register does lead to a little more hardware. The overwhelming reason
> > to do things that way is that I find it aestically pleasing. If you
> > are looking at the benefit versus cost from a mathematical
> > perspective there is probably little or no difference, in which case
> > it would logically be better to go with the simpler hardware. But, I
> > was brought up on the likes of the 6502,Z80,68k,x86 processors and
> > the risc paradigm of using a link register just seems plain alien,
> > especially considering what you really want to do with the pc is
> > stack it on a subroutine call. The human factor.
> >
> > I don't know if I'll implement push and pop instructions because the
> > pop instruction requires a register file with two write ports in
> > order to execute in a single cycle. There's also no real reason to
> > implement them as subroutine parameters should be passed via
> > registers, not on the stack.
> >
> > The processor I'm working on now (the BlueBird) is getting to be
> > quite large (over 40% of SpartanII 2S200). It has loads of different
> > instruction formats, complex decoding, a four stage instruction
> > pipeline and other features. I've found I've been able to add complex
> > features to the processor without affecting the performance because
> > the thing that limits performance right now is access to the block
> > ram. My instincts tell me that the block ram access is what is going
> > to limit performance and I've not worried so much about the
> > complexity of the decoder. If anybody wants a good laugh, the current
> > verilog source for the bluebird (not working yet) is available on my
> > website.
> >
> > I've been wondering what Xilinx is going to do a couple of years from
> > now when their customers start asking for memory management for the
> > MicroBlaze soft cpu. My guess is extra pipeline stalls.
> >
> > Sorry for the long post.
> > Rob http://www.birdcomputer.ca
> >
> > PS. Does AFAIK = as far as I know ? (I've been guessing what all the
> > internet acronyms are.
> >
> > To Post a message, send it to:
> > To Unsubscribe, send a blank message to:
> >
> >
>
> To Post a message, send it to:
> To Unsubscribe, send a blank message to:





(You need to be a member of fpga-cpu -- send a blank email to fpga-cpu-subscribe@yahoogroups.com )

Re: Re: CISCifying RISC calls and returns - Ben Franchuk - Nov 20 12:20:00 2001

Goran Bilski wrote:
>
> Hi,
>
> I forgot to mention that MicroBlaze instruction for branch and link can have any
> of the 32 registers as the link register.
> The compiler needs however to have one register so r15 is the link register
> according to the ABI.
> But a assembler programmer can use any of the 32 register as the link register.
> I use this for the rom monitor where I don't want a stack but need to call
> subroutines so I use 3 register as different link registers.
> So with the MicroBlaze scheme you can have a calling depth without the need of
> saving the link register to the stack.
>
> Göran Bilski

Years back I remember in Dr Dobb's (early 1980's?) somebody had a idea
for marking
C-functions calling depth so that leaf functions could use static
variables
rather than stack variables for machines like the z80 that had a hard
time with
stack offsets. I suspect the same idea would work for the link register
too.

Ben Franchuk
--
Live "Pre-historic Cpu's" -- and you thought they were extinct.
www.jetnet.ab.ca/users/bfranchuk/index.html





(You need to be a member of fpga-cpu -- send a blank email to fpga-cpu-subscribe@yahoogroups.com )

Re: Re: CISCifying RISC calls and returns - Ben Franchuk - Nov 20 12:26:00 2001

"" wrote:
>
> >In most cases the compiler doesn't know whether or not it has to save and
> >restore the link register so it saves and restores the link register just
> >to be safe.
> Some compilers have a mechaism to allow the programmer to tell the compiler
> a routine is a leaf function. Some versions of GCC have this option. GCC for ARM that I use at work does.
> If you are using LCC, you could build it into the compiler to pass down to
> the back end in some way.
>
> Just a thought.
>
> Veronica
Other than small C , is GCC and LCC the only reasonably easily to port C
compilers out there?
I still figure a C compiler (source & basic library) need not take up
several meg! to compile
and run. Ben Franchuk.
--
Live "Pre-historic Cpu's" -- and you thought they were extinct.
www.jetnet.ab.ca/users/bfranchuk/index.html





(You need to be a member of fpga-cpu -- send a blank email to fpga-cpu-subscribe@yahoogroups.com )

RE: Re: CISCifying RISC calls and returns - Author Unknown - Nov 20 12:31:00 2001

>In most cases the compiler doesn't know whether or not it has to save and
>restore the link register so it saves and restores the link register just
>to be safe.
Some compilers have a mechaism to allow the programmer to tell the compiler
a routine is a leaf function. Some versions of GCC have this option. GCC for ARM that I use at work does.
If you are using LCC, you could build it into the compiler to pass down to
the back end in some way.

Just a thought.

Veronica --------------------------------------------------------------------
mail2web - Check your email from the web at
http://mail2web.com/ .




(You need to be a member of fpga-cpu -- send a blank email to fpga-cpu-subscribe@yahoogroups.com )

RE: Re: Re: CISCifying RISC calls and returns - Author Unknown - Nov 20 12:56:00 2001

Ben

>Other than small C , is GCC and LCC the only reasonably easily to port C
>compilers out there?
LCC isn't big. It isn't difficult to port. LCC is ANSI C only. The library
support is poor.
GCC is big. It is difficult to port. It is C and C++. The library support is good.

>I still figure a C compiler (source & basic library) need not take up
>several meg! to compile and run.
Exactly.

I am working with LCC and have been pleased with it. I have ported it to
DOS, compiling it with DJGPP (GCC ported to DOS) - it needs a 32 bit
compiler.

I have build a MIPS version and the NULL back end version and have been using
them to compile my target code so I can get an idea of what optimisions I
can make in the back end and in my final FPGA target.

Veronica --------------------------------------------------------------------
mail2web - Check your email from the web at
http://mail2web.com/ .




(You need to be a member of fpga-cpu -- send a blank email to fpga-cpu-subscribe@yahoogroups.com )

Re: Re: CISCifying RISC calls and returns - Goran Bilski - Nov 20 13:02:00 2001

Hi,

LCC takes one week to port but the optimization isn't that great. No good register
allocation, etc ..
No libraries.
GCC takes 1-3 month to port but the result is much better.

Göran Bilski

"" wrote:

> Ben
>
> >Other than small C , is GCC and LCC the only reasonably easily to port C
> >compilers out there?
> LCC isn't big. It isn't difficult to port. LCC is ANSI C only. The library
> support is poor.
> GCC is big. It is difficult to port. It is C and C++. The library support is good.
>
> >I still figure a C compiler (source & basic library) need not take up
> >several meg! to compile and run.
> Exactly.
>
> I am working with LCC and have been pleased with it. I have ported it to
> DOS, compiling it with DJGPP (GCC ported to DOS) - it needs a 32 bit
> compiler.
>
> I have build a MIPS version and the NULL back end version and have been using
> them to compile my target code so I can get an idea of what optimisions I
> can make in the back end and in my final FPGA target.
>
> Veronica
>
> --------------------------------------------------------------------
> mail2web - Check your email from the web at
> http://mail2web.com/ .
>
> To Post a message, send it to:
> To Unsubscribe, send a blank message to:





(You need to be a member of fpga-cpu -- send a blank email to fpga-cpu-subscribe@yahoogroups.com )

Re: Re: CISCifying RISC calls and returns - Veronica Merryfield - Nov 21 17:46:00 2001

Rob

I meant to pick up on this when I saw it, so appologies for the delay.

>Perhaps my perspective will help. I am looking at the cpu from the
>viewpoint that it's used in a reasonably powerful system. Something
>64k memory with a multi-threaded OS, not a simple embedded system. This was said in a broader discussion about link registers verses stack
and also cache was mentioned.

For a multi-threaded system, link register verses stack is not going to
figure
highly in the things that take time. Issues such a cache
flushing/invalidation,
context saving and restoration, if you have one - MMU loading, schemes
for process (memory space) management, inter-process communications
are all likely in some form or other to be of more concern to you.

I would start looking at these issues and what you can do in the core to
help.
There are a number of things you can do.

Context saving is the first on the list. Come up with a scheme that will
enable
you to save and restore contexts quickly. ARM cores have a load and store
mulitple instruction that will save all the registers in one instruction.
This means
that cycle cost over and above saving one register is one write cycle.

Caches in a small system aren't likely to be a problem, but adding an MMU
could. Depending on your process model, the selection of physical and
virtual
tagging can make a big difference.

What about priority and privilige handling? Using supervisor/kernel services
and the need to move between privildges?

The core implementation can have dramatic implications on these.

For the record, I manage a software team that supports customers porting the
kernel and HAL to new targets in a company that supplies an OS. I work with
our main CPU core supplier on ways to make things better both in the kernel
implementation and in the CPU.

In short, think about the kernel software and the core features together not
in
isolation.

Veronica





(You need to be a member of fpga-cpu -- send a blank email to fpga-cpu-subscribe@yahoogroups.com )

Re: Re: CISCifying RISC calls and returns - Ben Franchuk - Nov 21 20:52:00 2001

Veronica Merryfield wrote:
>
> Rob

> Context saving is the first on the list. Come up with a scheme that will
> enable
> you to save and restore contexts quickly. ARM cores have a load and store
> mulitple instruction that will save all the registers in one instruction.
> This means
> that cycle cost over and above saving one register is one write cycle.

Don't forget the old 'exchange' register instructions like on the Z80.
They too could be adapted easily with the larger FPGA ram nowadays.

> In short, think about the kernel software and the core features together not
> in
> isolation.

Lets not forget peripheral devices too. With a FPGA for each device you
can have
smart dumb devices. Example - hard reset would load sector 0 into
memory. Handle
DMA and timing with common fixed formats. More you can off load the main
cpu the
better.

Ben Franchuk.
--
Live "Pre-historic Cpu's" -- and you thought they were extinct.
www.jetnet.ab.ca/users/bfranchuk/index.html





(You need to be a member of fpga-cpu -- send a blank email to fpga-cpu-subscribe@yahoogroups.com )

Re: Re: CISCifying RISC calls and returns - Ben Franchuk - Nov 22 13:15:00 2001

wrote:
>
> Hi Veronica,
>
> > highly in the things that take time. Issues such a cache
> > flushing/invalidation,
>
> Cache flushing is a problem right now as there is no good way to
> invalidate the cache all at once. I've thought of a couple of
> different hardware schemes, but they're ugly, so I'm going to rely on
> software for now. This does add to the context switch overhead, but
> the caches implemented with block ram are fairly small.

Need all the software have the same cache size.Can't you have
small and medium fixed size cache for core kernel/irq service and
the standard cache for every thing else.The core kernel is the
task switching/MMU handling type stuff? What about going back to
the old idea of fast fixed area memory for the important stuff. > At the moment I have a simple mmu that uses a page allocation table
> to track which process owns a memory page. All addresses are physical
> addresses (no virtual memory management). Thus the mmu does not need
> to be reloaded on a context switch. Initially I did look at
> implementing a virtual memory system, but it was really overkill for
> the hardware I have available, and I wanted to concentrate on the cpu
> for now. Never-the-less I have anticipated going to a more complex
> memory management system in the future. It's one of the reasons why I
> chose not to use delayed branches for instance. (I've assumed that I
> might need a longer processor pipeline).

You might want to have a option bit in software to say "this can
be virtual" so you can keep a real time os for important things.
Not having used a real time OS or read much on them I would expect
time based instructions would be useful. Test i/o and sleep for 10 us,
would be handy for a handling say a floppy chip.

> Ooops! I started looking at these issues quite a bit before I started
> working on this project. I've spent some time investigating how
> operating systems are put together. A few years ago, when DOS was
> popular (ugh) I put together a simple RTOS executive. Before I
> started working on this project, I was working on a message based
> multi-tasking operating system. Actually I want to build the
> processor so I can put my OS on it ;)

I got my processor -- now I need I/O and a OS. :)

> In any case, this type of instruction might be difficult to implement
> in a superscaler processor. Example: in a superscaler architecture
> with dual port access to the data cache, it would be possible to
> load / store two registers per clock cycle using independent
> instructions.

But main memory is still serial :(.

> Hopefully I've done this. For instance, one of the things I've
> choosen to do is provide only a single interrupt vector location
> through which all external interrupts are processed. The actual
> interrupt occuring is encoded in the status register. If you look at
> modern OS code, basically all it does for an interrupt is save the
> context, select a mailbox number to send to, then jump to a common
> routine to send an interrupt message. By vectoring to a single
> interrupt routine to begin with, it makes the code a little bit
> smaller and faster. IE. if OS's aren't really making good use of
> having multiple int vectors, then why supply them ?

This is where hardware next task processing would be useful. Take the
pc for example with a high speed uart -- interrupt -- massive irq
service
-- massive task switch. Not the best idea. IRQ/SWI are best done as
slow things.

> I've thought of trying to port linux for my processor. However,
> looking at the linux kernel I was not that impressed. It seems to be
> based on 1960's (unix) technology (although I've not looked at it
> recently). There are things that a modern OS paradigm does better.

True -- I like multi-taking but I favor ONE user I computer.
It has some good features but I just formatted my Linux HD so I could
have room for windows PCB design software as I need a prototype board
developed. Real memory/ real I/O/ real FLASH/real neat.

> Rob http://www.birdcomputer.ca/
Ben Franchuk.






(You need to be a member of fpga-cpu -- send a blank email to fpga-cpu-subscribe@yahoogroups.com )

Re: Re: CISCifying RISC calls and returns - Veronica Merryfield - Nov 22 16:10:00 2001

>Need all the software have the same cache size.Can't you have
>small and medium fixed size cache for core kernel/irq service and
>the standard cache for every thing else.The core kernel is the
>task switching/MMU handling type stuff? What about going back to
>the old idea of fast fixed area memory for the important stuff. As I was driving home tonight I was thinking about the issues this brings
up. There are certain known processes that could have fixed "things"
set aside for them thus speeding up context switching. Cache and
operating registers are one, hence some CPUs have and have register
banks. The ARM has several for different modes - user, supervisor,
interrupt, etc. The Z80 had two banks with exchange instructions and
I'm sure you know of other examples. Intel were trying to solve some
of these issues with the segment scheme. Caches genreal though are not
tied to a prcess although the new architechture 6 from ARM has gone
this route (because we asked them) along with the MMU. This means
that cache line invalidation and loading can be tied to process bit this
is complex behaviour.

On another note, interrupt handling. It is important to be able to establish
priorty when more than one source has interrupted. Some architechures
make this is lengthy process, specially when status registers are behind
compex logic as it adds to the read time significantly. Most OSes have
one interrupt handling routine that dispatches or "messages" a thread
depending on the structure. The whole latency is important - start ISR,
establish which device has interrupted, get to execute handler code. There
are many issues. Some OSes have to be open and flexible and don't know
anything about the target. Others, like the ones we are talking about, are
smaller and the target is well known. These factor make a big difference
to the architecture and both the OS software and core/peripherals can be
dealt with together.

>time based instructions would be useful. Test i/o and sleep for 10 us,
>would be handy for a handling say a floppy chip.

You really want to be to deschedule for short times or construct
your IO to tollerate a descheduling period before being polled. Whilst
this makes for say a slow disk transfer, it makes for a system that more
reponsive.

Use DMA. In my design (32 bit) I have a DMA mode in the core so the
core will do the transfers but a DMA engine will inject the DMA instruction
into the queue (a bit like xr16 interrupt processing). I have an on board IO
buss (that also falls in the memory map) such that the DMA instructions can
do a transfer but using 2 busses - the main memory and the IO. Anyway, DMA
can shift some of the burden from the OS and whilst stealing the odd cycle,
the
OS can be left to get on with things, only needing software intervention
when
complete.

Anyway, a few more thoughts...

Veronica





(You need to be a member of fpga-cpu -- send a blank email to fpga-cpu-subscribe@yahoogroups.com )

RE: Re: CISCifying RISC calls and returns - Campbell, John - Nov 26 12:12:00 2001

Hi

Jan Gray wrote:
> * the Inmos Transputer [who?]: fast task switch -- stack architecture
> with limited state to switch); I'm told that credit goes to David May.

-jc




(You need to be a member of fpga-cpu -- send a blank email to fpga-cpu-subscribe@yahoogroups.com )