This list is for discussion of the design and implementation of field-programmable gate array based processors and integrated systems. It is also for discussion and community support of the XSOC Project (see http://www.fpgacpu.org/xsoc).
|
(Rob babbling away again...) In the perpetual analysis for, and in striving for the perfect isa, it seems to me that cisc style calls and returns would be better than than the risc style of using a link register. But I'm wondering *why* it hasn't been done. For instance, a 'call' instruction could be viewed as nothing more than a specialized store instruction that stores the pc to the stack. Similarly, a 'ret' just retrieves (loads) the pc from the stack. Of course the stack pointer has to be adjusted, but that is *very* simple to do. It seems to me that using cisc style calls and returns could eliminate the two instructions from every non-leaf subroutine that are otherwise required to save and restore the link register. This should result in a modest performance gain of two per cent across the non-leaf routines. Before I go ahead and modify my processor, I'd like to know why this is a bad idea ? If implemented, this would mean my processor is no longer strictly a load / store architecture. Maybe I'll call it a HyRC - HYbrid Risc - Cisc ("Herc") processor ? Or better yet a CRHy ("Cry") processor. PS. I'm thinking of code naming my 240Mips EPIC processor the "Puffin". Thanks Rob http://www.birdcomputer.ca/ |
|
|
|
> In the perpetual analysis for, and in striving for the perfect isa, > it seems to me that cisc style calls and returns would be better than > than the risc style of using a link register. But I'm wondering *why* > it hasn't been done. > > For instance, a 'call' instruction could be viewed as nothing more > than a specialized store instruction that stores the pc to > the stack. Similarly, a 'ret' just retrieves (loads) the pc from the > stack. Of course the stack pointer has to be adjusted, but that is But on a RISC CPU that doesn't have return address storing calls, there is not necessarily an explicit stack register. Such calls would require one. > It seems to me that using cisc style calls and returns could > eliminate the two instructions from every non-leaf subroutine that > are otherwise required to save and restore the link register. This > should result in a modest performance gain of two per cent across the > non-leaf routines. The same work needs to be done in either case, so there's not necessarily any difference in performance. It all depends on the implementation, of course. More importantly, with the link register style calls, the leaf routines can save two memory accesses by not storing the return address. Assuming there are enough register available. > If implemented, this would mean my processor is no longer strictly a > load / store architecture. Maybe I'll call it a HyRC - HYbrid Risc - > Cisc ("Herc") processor ? Or better yet a CRHy ("Cry") processor. I don't really see a problem with having call/ret in a RISC architecture without giving it a new name. After all, as you say, those instructions _are_ load/store, although specialised ones with side effects. Personally, for a simple FPGA RISC, I'd also consider an internal call stack. Naturally, that would be of limited size, but you could implement explicit or implicit over-flow handling. > PS. I'm thinking of code naming my 240Mips EPIC processor MIPS is usually used with some kind of definition other than maximum number of instructions executed per second, since that is usually a very useless number. Dhrystone MIPS seems to be the most common. Also, don't you mean VLIW rather than EPIC? Of course, Intel seems to mean much the same thing with EPIC, but a major difference in IA64 architecture is that the length of the instruction groups is independent of the number of execution units, unlike classic VLIW. -- Chalmers University | Why are these | e-mail: of Technology | .signatures | | so hard to do | WWW: rand.thn.htu.se Gothenburg, Sweden | well? | (fVDI, MGIFv5, QLem) |
|
|
|
First, thanks for the reply. It's very good to have someone to bounce ideas off of. Cisc style calls / returns are not a very significant performance benefit (~1%-2%), but the implementation cost is also very low. So this may be much ado about nothing. Still I'd like not to get clobbered by something I should've thought of earlier in the design. > > stack. Of course the stack pointer has to be adjusted, but that is > > But on a RISC CPU that doesn't have return address storing calls, there > is not necessarily an explicit stack register. Such calls would require > one. > Yes, there would have to be an explicit stack pointer. This is not so worrisome because I'm trading the link register for the stack pointer, so there is no difference in the number of registers used. Also normal usage designates a register as the stack pointer, and due to its usage by the compiler it's effectively dedicated anyway. So really one of the benefits of a cisc style call / return is that it effectively frees up a register. This might be very useful in say a 16 register cpu. > > It seems to me that using cisc style calls and returns could > > eliminate the two instructions from every non-leaf subroutine that > > are otherwise required to save and restore the link register. This > > should result in a modest performance gain of two per cent across the > > non-leaf routines. > > The same work needs to be done in either case, so there's not necessarily > any difference in performance. It all depends on the implementation, of > course. Yes, it's basically the same work; however the cisc implementation takes two fewer instructions and two fewer clock cycles to execute (or maybe more - eg look at the PowerPC arch.) , because the explict store link register to stack and load link register from stack instructions are "wrapped up" into the call and return instructions, which are also required in the risc machine. This also has the effect of increasing code density, thus making better use of any high speed memory that's available, like a cache. > > More importantly, with the link register style calls, the leaf routines > can save two memory accesses by not storing the return address. Assuming > there are enough register available. Yes, but it does not change the performance because avoiding the memory access does not avoid the two instructions which still have to be executed, risc or cisc style (depending on the design). Using a Harvard architecture the loads and stores are done in parallel with the instruction access. So cycle wise (assuming single cycle access to data memory) there should be no difference. > Personally, for a simple FPGA RISC, I'd also consider an internal call > stack. Naturally, that would be of limited size, but you could implement > explicit or implicit over-flow handling. > I think that too. An internal stack would work very well I think for a simple FPGA RISC. I was thinking of hacking the gr0040 to use an internal stack rather than a link register. But alas, I have'nt the time... I'm too busy working on a not-so-simple cpu. You can get the benefit of an internal stack, without having to worry about overflow by using a return address predictor. Sorry for the long post. I think I'm going to go ahead and use cisc style call / return. The return instruction I've got now executes the equivalent of three risc instructions in a single cycle. (What's the emoticon for bragging ?) The other impact of cisc-ifying is that I can't have a jump and link (jal) instruction anymore, at least not one that works with the stack. I guess it'll have to be a jump-and- <blank> (ja_) instruction. > > PS. I'm thinking of code naming my 240Mips EPIC processor > For anyone who hasn't figured it out already, I wasn't really serious about the 240Mips EPIC processor. Go back and look at the post. Name = "Puffin" = non-existent bird. Rob http://www.birdcomputer.ca/ |
|
|
|
> > > > But on a RISC CPU that doesn't have return address storing calls, > there > > is not necessarily an explicit stack register. Such calls would > require > > one. > > > Yes, there would have to be an explicit stack pointer. This is not so Hey, I found a cheesy way around having an explicit stack pointer register. Instead of the jump-and-link instruction specifying the link register, it just specifies the stack pointer register to use instead. So it's really a "jal [Rt],disp[Rn]" instruction, where Rt indicates the stack pointer reg. My design has so many cheesy features I'm thinking of packageing it up and offering it as junk food. Maybe I could call it the "Cheetos" processor. Rob http://www.birdcomputer.ca/ |
|
wrote: > Hey, I found a cheesy way around having an explicit stack pointer > register. Instead of the jump-and-link instruction specifying the > link register, it just specifies the stack pointer register to use > instead. So it's really a "jal [Rt],disp[Rn]" instruction, where Rt > indicates the stack pointer reg. And what is wrong with a explicit stack pointer anyhow? In my current cpu design I have a memory accessed relative to a global base register for static variables -G. I reserve a small amount of static space relative to the G register for traps and interrupts. memory G = n n..n-?? { irq data } n+?..n { static variables } ??..n+? { program/data } This has two advantages providing irq/swi service routines don't need much free space. 1) You don't have stack over flows due to irq's. 2) You can have cleaner software emulation of instructions that deal with the stack. > My design has so many cheesy features I'm thinking of packageing it > up and offering it as junk food. Maybe I could call it the "Cheetos" > processor. Oat flavored core memory rings comes to mind.:) Ben Franchuk. -- Live "Pre-historic Cpu's" -- and you thought they were extinct. www.jetnet.ab.ca/users/bfranchuk/index.html |
|
|
|
wrote: > I've been wondering how that global register works. Is it like an x86 > segment register ? I'm a little unclear as to how the memory space is > divided. Are you saying irq data is placed in memory beneath the G > register, then comes static variables and finally program / data ? > Could you show an example with (relative) addresses ? Yep just like that. In this version I have no memory segments like the x86 processor. This is a flat 16 meg address space with ram/rom at the bottom of memory and memory mapped I/O page at top of memory. Note do pin limitations of the FPGA emulating a 40 pin dip memory is limited externally to 1meg. 0 { interrupt service vectors } { O/S kernel RAM/ROM} 8k{ program(s) and free space } ? { free - space NO RAM } -4k { I/O page { FF:FFFn <console uart>} A program is loaded into a free memory block and remains there for the life of the program. Address constants are fixed up by the loader for this location. It is up to the loader routine to allot static space for interrupt service data and variables like 'the end of free memory'. Negitive offsets the base register G could be used for system variables or system function vectors. > Thanks > Rob > Ben Franchuk -- Live "Pre-historic Cpu's" -- and you thought they were extinct. www.jetnet.ab.ca/users/bfranchuk/index.html |
|
wrote: > There is absolutely nothing wrong with an explicit stack pointer. I > think it's a good idea. I've always wondered why being able to > specify any register as the stack pointer in a risc machine is a big > deal. It's advertised as a feature sometimes. Why would you want to > be able to do this anyway ? I can't think of a good reason. In fact I > think it's a bad idea. Nothing like encouraging software development > that uses different conventions for register usage. The only reason I > allow it is because the register field is available in the > instruction; I was thinking of forcing the register to the sp but it > is more hardware, still I might do it anyway. My next task is to see > if I can implement push / pop instructions. Providing you have push/pop for general registers even stack intensive programs like FORTH should work fine. Ben Franchuk. -- Live "Pre-historic Cpu's" -- and you thought they were extinct. www.jetnet.ab.ca/users/bfranchuk/index.html |
|
|
|
Eric Smith wrote: > > Ben wrote: > > Providing you have push/pop for general registers even stack intensive > > programs like FORTH should work fine. > If you mean being able to use an arbitrary register as a stack pointer > for push/pop operations (or at least having two registers to choose from), > that will support FORTH just fine. I meant the above. I assumed that a general register could be used for auto-incriment/decriment as this is a modern design. I forgot about x86 style 'want a be' cpu's. Ben Franchuk. -- Live "Pre-historic Cpu's" -- and you thought they were extinct. www.jetnet.ab.ca/users/bfranchuk/index.html |
|
> > And what is wrong with a explicit stack pointer anyhow? There is absolutely nothing wrong with an explicit stack pointer. I think it's a good idea. I've always wondered why being able to specify any register as the stack pointer in a risc machine is a big deal. It's advertised as a feature sometimes. Why would you want to be able to do this anyway ? I can't think of a good reason. In fact I think it's a bad idea. Nothing like encouraging software development that uses different conventions for register usage. The only reason I allow it is because the register field is available in the instruction; I was thinking of forcing the register to the sp but it is more hardware, still I might do it anyway. My next task is to see if I can implement push / pop instructions. > In my current cpu design I have a memory accessed relative to a > global base register for static variables -G. I reserve a small amount > of static space relative to the G register for traps and interrupts. > > memory G = n > > n..n-?? { irq data } > n+?..n { static variables } > ??..n+? { program/data } > I've been wondering how that global register works. Is it like an x86 segment register ? I'm a little unclear as to how the memory space is divided. Are you saying irq data is placed in memory beneath the G register, then comes static variables and finally program / data ? Could you show an example with (relative) addresses ? Thanks Rob |
|
|
|
> I've always wondered why being able to > specify any register as the stack pointer in a risc machine is a big > deal. It's advertised as a feature sometimes. Why would you want to > be able to do this anyway ? I can't think of a good reason. This delivers flexibility, and more importantly, simplicity, in the instruction set and in the implementation. Using a general purpose register instead of a dedicated register can save you instructions to move to/from the DR, and control logic and *multiplexers* in the datapath. Muxes are not such an issue in a full custom design but are relatively rather expensive in FPGAs -- a single 2-1 mux can be the same size (1 LUT/bit) as a simple ALU or a register file. And by avoiding these extra muxes in the datapath you may be able to clock the critical paths faster. This is not to say dedicated registers doing dedicated or autonomous tasks are always a bad idea, but you have to be clear on what the additional costs are before focusing solely on benefits. If you start with a simple (or at any rate, a working) design and then measure/simulate the performance impact of a proposed enhancement on performance (== 1/(instructions * cycles/instruction * cycle time) ) your design may improve in a quantitative, methodical fashion. (But beware local maxima!) One more comment. In instruction sets where there is "opcode pressure", but some time slack in the instruction decoder, it may be helpful to have a family of "stack pointer relative" load/store instructions where the "by convention stack pointer" general purpose register number is encoded implicitly, just as in the xr16 CALL instruction the return address linkage register (r15) is encoded implicitly. Simple is beautiful: http://www.fpgacpu.org/log/sep00.html#000919 Jan Gray Gray Research LLC |
|
|
|
wrote: > I have to admit, stacking the pc register instead of using a link > register does lead to a little more hardware.<snip> The link register idea is faster for short subroutine calls rather than inline code. getchar() comes to mind here. > I don't know if I'll implement push and pop instructions because the > pop instruction requires a register file with two write ports in > order to execute in a single cycle. Why don't you define them as Macros under what ever assembler you use. This way code will not change regardless of the hardware. > The processor I'm working on now (the BlueBird) is getting to be > quite large (over 40% of SpartanII 2S200). It has loads of different > instruction formats, complex decoding, a four stage instruction > pipeline and other features. I've found I've been able to add complex > features to the processor without affecting the performance because > the thing that limits performance right now is access to the block > ram. My instincts tell me that the block ram access is what is going > to limit performance and I've not worried so much about the > complexity of the decoder. If anybody wants a good laugh, the current > verilog source for the bluebird (not working yet) is available on my > website. Nah ... I got the same good laugh on my site with schematics. :) Other thing that is bothering me in this late stage of design is the exact structure of the memory timing and memory bus as well as the I/O devices. Right now I am running at the rather slow rate of 1 Mhz external with a 68xx style clock,( 4 Mhz in ) but I may still have slow the system down. Yet looking at the I/O chips I can find very few are high speed ones - uarts - floppy disc controllers - PIA's. I wish a few more I/O devices were open source in a FPGA so one does not have wait forever for I/O devices. Ben Franchuk. -- Live "Pre-historic Cpu's" -- and you thought they were extinct. www.jetnet.ab.ca/users/bfranchuk/index.html |
|
|
|
> This delivers flexibility, and more importantly, simplicity, in the > instruction set and in the implementation. I was thinking in terms of from a programming standpoint. AFAIK programming wise, there is no advantage. In terms of hardware, I won't argue there is no advantage. It depends on your goals. Keeping things simple and general purpose leads to hardware that has a small footprint and is fast. I have to admit, stacking the pc register instead of using a link register does lead to a little more hardware. The overwhelming reason to do things that way is that I find it aestically pleasing. If you are looking at the benefit versus cost from a mathematical perspective there is probably little or no difference, in which case it would logically be better to go with the simpler hardware. But, I was brought up on the likes of the 6502,Z80,68k,x86 processors and the risc paradigm of using a link register just seems plain alien, especially considering what you really want to do with the pc is stack it on a subroutine call. The human factor. I don't know if I'll implement push and pop instructions because the pop instruction requires a register file with two write ports in order to execute in a single cycle. There's also no real reason to implement them as subroutine parameters should be passed via registers, not on the stack. The processor I'm working on now (the BlueBird) is getting to be quite large (over 40% of SpartanII 2S200). It has loads of different instruction formats, complex decoding, a four stage instruction pipeline and other features. I've found I've been able to add complex features to the processor without affecting the performance because the thing that limits performance right now is access to the block ram. My instincts tell me that the block ram access is what is going to limit performance and I've not worried so much about the complexity of the decoder. If anybody wants a good laugh, the current verilog source for the bluebird (not working yet) is available on my website. I've been wondering what Xilinx is going to do a couple of years from now when their customers start asking for memory management for the MicroBlaze soft cpu. My guess is extra pipeline stalls. Sorry for the long post. Rob http://www.birdcomputer.ca PS. Does AFAIK = as far as I know ? (I've been guessing what all the internet acronyms are. |
|
|
|
--- wrote: > I was thinking in terms of from a programming > standpoint. AFAIK programming wise, there is no > advantage. Not so. Leaf routines need not save the link register to the stack and thus can avoid an expensive load and an expensive store. Studies show that the majority of processing time is spent in the "bottom of the call tree," that is, leaf calls, so this could be a significant saving. The advantage of multiple link registers is harder to exploit but in the case of a whole-program globally optimizing compiler (whole-program optimization make a lot of sense for embedded devices) a sufficiently clever register allocator can generalize this principle to the n-most lower levels in the tree. Granted I only know of a few compilers doing this. These days new CPUs are most often co-designed with the compiler as no CPU is better than the code it runs. However, if you intend to program your CPU directly in assembler, you have a lot more freedom. /Tommy __________________________________________________ |
|
|
|
<Rob being offensive> First, *read the post*, I was talking about the stack pointer not the link register. <Rob trying to be nice because he wants to be enlightened> Please note I can be pretty bull-headed about things until I am really convinced otherwise. It's just that I'm a digital thinker and I've got a built in schmidt triggers with lots of hysterisis. :) Perhaps my perspective will help. I am looking at the cpu from the viewpoint that it's used in a reasonably powerful system. Something >64k memory with a multi-threaded OS, not a simple embedded system. Anyway... > > I was thinking in terms of from a programming > > Not so. Leaf routines need not save the link register > to the stack and thus can avoid an expensive load and > an expensive store. The fallacy is assuming the load and store are expensive. In a modern (Harvard) processor the load and store to the data cache (memory) would be done the same cycle the call and return instructions are executing and therefore are effectively done for free. >Studies show that the majority >of processing time is spent in the "bottom of the >call tree," that is, leaf calls, so this could be a >significant saving. 2) Unfortunately, probably not a savings. In most cases the compiler doesn't know whether or not it has to save and restore the link register so it saves and restores the link register just to be safe. For instance it is quite common to use third party libraries for all those "bottom of the call tree" functions, and the compiler doesn't know whether or not those routine will call other routines. Also, any short function (not using third party libraires) will probably be in- lined by the compiler, eliminating the function call overhead completely; so all that's left is the larger functions which invariably are not leaf functions. So... using a link register represents two extra instructions and clock cycles in every routine where it can not be determined to be safe to not save / restore. At least I have not seen a language compiler that will allow you to specify which external routines are leaf routines and which are not. If I was a compiler writer, I'd just always save and restore the link register when dealing with external routines just to be safe, because the idiot that specified a routine was a leaf routine might be wrong. Sure my code would run 1% slower in that case, but at least it would be guaranteed to work. > > The advantage of multiple link registers is harder to > exploit but in the case of a whole-program globally > optimizing compiler (whole-program optimization make > a lot of sense for embedded devices) a sufficiently > clever register allocator can generalize this > principle That's the beauty of eliminating the link register. It makes this type of optimization unnecessary, because the return address is simply stored on the stack, which is equally fast as storing it in a link register. > > These days new CPUs are most often co-designed with > the > compiler as no CPU is better than the code it runs. Yes, exactly. If the compiler can't make use of the link register, then why have it ? The link register makes hardware simpler at the expense of making software (compiler) more complicated (and slower). Rob http://www.birdcomputer.ca/ |
|
Hi, Since I designed the MicroBlaze, I can say why MicroBlaze ended up to way it did. There is a tradeoff between area and performance and I tried to find a good compromise. MicroBlaze has good enough performance and can still be implemented even to smaller devices. I could have made MicroBlaze smaller but with less performance or vice verse. In MicroBlaze, this is what I did for the link register In PowerPC the link register is a special register and in order to save it you first have to move to a general purpose register, the same for restoring it. So it actually takes three instructions, one for the branch, one for moving the link register to a general purpose and one for saving to memory. Since MicroBlaze has 32 register, I can use one of them as the link register (r15) and thus save one instruction compared to PPC. A branch and save onto stack instruction have to do 1. PC <- PC + offset 2. mem(r1) <- PC 3 r1 <- r1 -1 This would introduce one adder , one mux and more decoding logic and would increase the size of MicroBlaze with 10%. The gain would be 1 clock cycle for all branch to a subroutines that isn't a leaf subroutine. Which one is better, it depends on the application but I took the decision to have a link register because it cleaner and wouldn't force one register to be the stack pointer. Adding the logic may or may not decrease the clock frequency, even if it's not on the critical path the design will be bigger and thus move thing further apart from each other, this can in itself introduce larger routing delay. For the coming MMU on MicroBlaze, yes, a MMU will introduce 1 clock cycle penalty for looking up a MMU table. In PPC that's normal. Göran Bilski wrote: > > This delivers flexibility, and more importantly, simplicity, in the > > instruction set and in the implementation. > > I was thinking in terms of from a programming standpoint. AFAIK > programming wise, there is no advantage. In terms of hardware, I > won't argue there is no advantage. It depends on your goals. Keeping > things simple and general purpose leads to hardware that has a small > footprint and is fast. > > I have to admit, stacking the pc register instead of using a link > register does lead to a little more hardware. The overwhelming reason > to do things that way is that I find it aestically pleasing. If you > are looking at the benefit versus cost from a mathematical > perspective there is probably little or no difference, in which case > it would logically be better to go with the simpler hardware. But, I > was brought up on the likes of the 6502,Z80,68k,x86 processors and > the risc paradigm of using a link register just seems plain alien, > especially considering what you really want to do with the pc is > stack it on a subroutine call. The human factor. > > I don't know if I'll implement push and pop instructions because the > pop instruction requires a register file with two write ports in > order to execute in a single cycle. There's also no real reason to > implement them as subroutine parameters should be passed via > registers, not on the stack. > > The processor I'm working on now (the BlueBird) is getting to be > quite large (over 40% of SpartanII 2S200). It has loads of different > instruction formats, complex decoding, a four stage instruction > pipeline and other features. I've found I've been able to add complex > features to the processor without affecting the performance because > the thing that limits performance right now is access to the block > ram. My instincts tell me that the block ram access is what is going > to limit performance and I've not worried so much about the > complexity of the decoder. If anybody wants a good laugh, the current > verilog source for the bluebird (not working yet) is available on my > website. > > I've been wondering what Xilinx is going to do a couple of years from > now when their customers start asking for memory management for the > MicroBlaze soft cpu. My guess is extra pipeline stalls. > > Sorry for the long post. > Rob http://www.birdcomputer.ca > > PS. Does AFAIK = as far as I know ? (I've been guessing what all the > internet acronyms are. > > To Post a message, send it to: > To Unsubscribe, send a blank message to: |
|
Hi, I forgot to mention that MicroBlaze instruction for branch and link can have any of the 32 registers as the link register. The compiler needs however to have one register so r15 is the link register according to the ABI. But a assembler programmer can use any of the 32 register as the link register. I use this for the rom monitor where I don't want a stack but need to call subroutines so I use 3 register as different link registers. So with the MicroBlaze scheme you can have a calling depth without the need of saving the link register to the stack. Göran Bilski Goran Bilski wrote: > Hi, > > Since I designed the MicroBlaze, I can say why MicroBlaze ended up to way it > did. > There is a tradeoff between area and performance and I tried to find a good > compromise. > MicroBlaze has good enough performance and can still be implemented even to > smaller devices. > I could have made MicroBlaze smaller but with less performance or vice verse. > > In MicroBlaze, this is what I did for the link register > In PowerPC the link register is a special register and in order to save it > you first have to move to a > general purpose register, the same for restoring it. > So it actually takes three instructions, one for the branch, one for moving > the link register to a general purpose and > one for saving to memory. > Since MicroBlaze has 32 register, I can use one of them as the link register > (r15) and thus save one instruction compared to PPC. > A branch and save onto stack instruction have to do > 1. PC <- PC + offset > 2. mem(r1) <- PC > 3 r1 <- r1 -1 > This would introduce one adder , one mux and more decoding logic and would > increase the size of MicroBlaze with 10%. > The gain would be 1 clock cycle for all branch to a subroutines that isn't a > leaf subroutine. > > Which one is better, it depends on the application but I took the decision to > have a link register because it cleaner and wouldn't force > one register to be the stack pointer. Adding the logic may or may not > decrease the clock frequency, even if it's not on the critical path > the design will be bigger and thus move thing further apart from each other, > this can in itself introduce larger routing delay. > > For the coming MMU on MicroBlaze, yes, a MMU will introduce 1 clock cycle > penalty for looking up a MMU table. > In PPC that's normal. > > Göran Bilski > > wrote: > > > > This delivers flexibility, and more importantly, simplicity, in the > > > instruction set and in the implementation. > > > > I was thinking in terms of from a programming standpoint. AFAIK > > programming wise, there is no advantage. In terms of hardware, I > > won't argue there is no advantage. It depends on your goals. Keeping > > things simple and general purpose leads to hardware that has a small > > footprint and is fast. > > > > I have to admit, stacking the pc register instead of using a link > > register does lead to a little more hardware. The overwhelming reason > > to do things that way is that I find it aestically pleasing. If you > > are looking at the benefit versus cost from a mathematical > > perspective there is probably little or no difference, in which case > > it would logically be better to go with the simpler hardware. But, I > > was brought up on the likes of the 6502,Z80,68k,x86 processors and > > the risc paradigm of using a link register just seems plain alien, > > especially considering what you really want to do with the pc is > > stack it on a subroutine call. The human factor. > > > > I don't know if I'll implement push and pop instructions because the > > pop instruction requires a register file with two write ports in > > order to execute in a single cycle. There's also no real reason to > > implement them as subroutine parameters should be passed via > > registers, not on the stack. > > > > The processor I'm working on now (the BlueBird) is getting to be > > quite large (over 40% of SpartanII 2S200). It has loads of different > > instruction formats, complex decoding, a four stage instruction > > pipeline and other features. I've found I've been able to add complex > > features to the processor without affecting the performance because > > the thing that limits performance right now is access to the block > > ram. My instincts tell me that the block ram access is what is going > > to limit performance and I've not worried so much about the > > complexity of the decoder. If anybody wants a good laugh, the current > > verilog source for the bluebird (not working yet) is available on my > > website. > > > > I've been wondering what Xilinx is going to do a couple of years from > > now when their customers start asking for memory management for the > > MicroBlaze soft cpu. My guess is extra pipeline stalls. > > > > Sorry for the long post. > > Rob http://www.birdcomputer.ca > > > > PS. Does AFAIK = as far as I know ? (I've been guessing what all the > > internet acronyms are. > > > > To Post a message, send it to: > > To Unsubscribe, send a blank message to: > > > > > > To Post a message, send it to: > To Unsubscribe, send a blank message to: |
|
Goran Bilski wrote: > > Hi, > > I forgot to mention that MicroBlaze instruction for branch and link can have any > of the 32 registers as the link register. > The compiler needs however to have one register so r15 is the link register > according to the ABI. > But a assembler programmer can use any of the 32 register as the link register. > I use this for the rom monitor where I don't want a stack but need to call > subroutines so I use 3 register as different link registers. > So with the MicroBlaze scheme you can have a calling depth without the need of > saving the link register to the stack. > > Göran Bilski Years back I remember in Dr Dobb's (early 1980's?) somebody had a idea for marking C-functions calling depth so that leaf functions could use static variables rather than stack variables for machines like the z80 that had a hard time with stack offsets. I suspect the same idea would work for the link register too. Ben Franchuk -- Live "Pre-historic Cpu's" -- and you thought they were extinct. www.jetnet.ab.ca/users/bfranchuk/index.html |
|
"" wrote: > > >In most cases the compiler doesn't know whether or not it has to save and > >restore the link register so it saves and restores the link register just > >to be safe. > Some compilers have a mechaism to allow the programmer to tell the compiler > a routine is a leaf function. Some versions of GCC have this option. GCC for ARM that I use at work does. > If you are using LCC, you could build it into the compiler to pass down to > the back end in some way. > > Just a thought. > > Veronica Other than small C , is GCC and LCC the only reasonably easily to port C compilers out there? I still figure a C compiler (source & basic library) need not take up several meg! to compile and run. Ben Franchuk. -- Live "Pre-historic Cpu's" -- and you thought they were extinct. www.jetnet.ab.ca/users/bfranchuk/index.html |
|
>In most cases the compiler doesn't know whether or not it has to save and >restore the link register so it saves and restores the link register just >to be safe. Some compilers have a mechaism to allow the programmer to tell the compiler a routine is a leaf function. Some versions of GCC have this option. GCC for ARM that I use at work does. If you are using LCC, you could build it into the compiler to pass down to the back end in some way. Just a thought. Veronica -------------------------------------------------------------------- mail2web - Check your email from the web at http://mail2web.com/ . |
|
Ben >Other than small C , is GCC and LCC the only reasonably easily to port C >compilers out there? LCC isn't big. It isn't difficult to port. LCC is ANSI C only. The library support is poor. GCC is big. It is difficult to port. It is C and C++. The library support is good. >I still figure a C compiler (source & basic library) need not take up >several meg! to compile and run. Exactly. I am working with LCC and have been pleased with it. I have ported it to DOS, compiling it with DJGPP (GCC ported to DOS) - it needs a 32 bit compiler. I have build a MIPS version and the NULL back end version and have been using them to compile my target code so I can get an idea of what optimisions I can make in the back end and in my final FPGA target. Veronica -------------------------------------------------------------------- mail2web - Check your email from the web at http://mail2web.com/ . |
|
Hi, LCC takes one week to port but the optimization isn't that great. No good register allocation, etc .. No libraries. GCC takes 1-3 month to port but the result is much better. Göran Bilski "" wrote: > Ben > > >Other than small C , is GCC and LCC the only reasonably easily to port C > >compilers out there? > LCC isn't big. It isn't difficult to port. LCC is ANSI C only. The library > support is poor. > GCC is big. It is difficult to port. It is C and C++. The library support is good. > > >I still figure a C compiler (source & basic library) need not take up > >several meg! to compile and run. > Exactly. > > I am working with LCC and have been pleased with it. I have ported it to > DOS, compiling it with DJGPP (GCC ported to DOS) - it needs a 32 bit > compiler. > > I have build a MIPS version and the NULL back end version and have been using > them to compile my target code so I can get an idea of what optimisions I > can make in the back end and in my final FPGA target. > > Veronica > > -------------------------------------------------------------------- > mail2web - Check your email from the web at > http://mail2web.com/ . > > To Post a message, send it to: > To Unsubscribe, send a blank message to: |
|
Rob I meant to pick up on this when I saw it, so appologies for the delay. >Perhaps my perspective will help. I am looking at the cpu from the >viewpoint that it's used in a reasonably powerful system. Something >64k memory with a multi-threaded OS, not a simple embedded system. This was said in a broader discussion about link registers verses stack and also cache was mentioned. For a multi-threaded system, link register verses stack is not going to figure highly in the things that take time. Issues such a cache flushing/invalidation, context saving and restoration, if you have one - MMU loading, schemes for process (memory space) management, inter-process communications are all likely in some form or other to be of more concern to you. I would start looking at these issues and what you can do in the core to help. There are a number of things you can do. Context saving is the first on the list. Come up with a scheme that will enable you to save and restore contexts quickly. ARM cores have a load and store mulitple instruction that will save all the registers in one instruction. This means that cycle cost over and above saving one register is one write cycle. Caches in a small system aren't likely to be a problem, but adding an MMU could. Depending on your process model, the selection of physical and virtual tagging can make a big difference. What about priority and privilige handling? Using supervisor/kernel services and the need to move between privildges? The core implementation can have dramatic implications on these. For the record, I manage a software team that supports customers porting the kernel and HAL to new targets in a company that supplies an OS. I work with our main CPU core supplier on ways to make things better both in the kernel implementation and in the CPU. In short, think about the kernel software and the core features together not in isolation. Veronica |
|
|
|
Veronica Merryfield wrote: > > Rob > Context saving is the first on the list. Come up with a scheme that will > enable > you to save and restore contexts quickly. ARM cores have a load and store > mulitple instruction that will save all the registers in one instruction. > This means > that cycle cost over and above saving one register is one write cycle. Don't forget the old 'exchange' register instructions like on the Z80. They too could be adapted easily with the larger FPGA ram nowadays. > In short, think about the kernel software and the core features together not > in > isolation. Lets not forget peripheral devices too. With a FPGA for each device you can have smart dumb devices. Example - hard reset would load sector 0 into memory. Handle DMA and timing with common fixed formats. More you can off load the main cpu the better. Ben Franchuk. -- Live "Pre-historic Cpu's" -- and you thought they were extinct. www.jetnet.ab.ca/users/bfranchuk/index.html |
|
wrote: > > Hi Veronica, > > > highly in the things that take time. Issues such a cache > > flushing/invalidation, > > Cache flushing is a problem right now as there is no good way to > invalidate the cache all at once. I've thought of a couple of > different hardware schemes, but they're ugly, so I'm going to rely on > software for now. This does add to the context switch overhead, but > the caches implemented with block ram are fairly small. Need all the software have the same cache size.Can't you have small and medium fixed size cache for core kernel/irq service and the standard cache for every thing else.The core kernel is the task switching/MMU handling type stuff? What about going back to the old idea of fast fixed area memory for the important stuff. > At the moment I have a simple mmu that uses a page allocation table > to track which process owns a memory page. All addresses are physical > addresses (no virtual memory management). Thus the mmu does not need > to be reloaded on a context switch. Initially I did look at > implementing a virtual memory system, but it was really overkill for > the hardware I have available, and I wanted to concentrate on the cpu > for now. Never-the-less I have anticipated going to a more complex > memory management system in the future. It's one of the reasons why I > chose not to use delayed branches for instance. (I've assumed that I > might need a longer processor pipeline). You might want to have a option bit in software to say "this can be virtual" so you can keep a real time os for important things. Not having used a real time OS or read much on them I would expect time based instructions would be useful. Test i/o and sleep for 10 us, would be handy for a handling say a floppy chip. > Ooops! I started looking at these issues quite a bit before I started > working on this project. I've spent some time investigating how > operating systems are put together. A few years ago, when DOS was > popular (ugh) I put together a simple RTOS executive. Before I > started working on this project, I was working on a message based > multi-tasking operating system. Actually I want to build the > processor so I can put my OS on it ;) I got my processor -- now I need I/O and a OS. :) > In any case, this type of instruction might be difficult to implement > in a superscaler processor. Example: in a superscaler architecture > with dual port access to the data cache, it would be possible to > load / store two registers per clock cycle using independent > instructions. But main memory is still serial :(. > Hopefully I've done this. For instance, one of the things I've > choosen to do is provide only a single interrupt vector location > through which all external interrupts are processed. The actual > interrupt occuring is encoded in the status register. If you look at > modern OS code, basically all it does for an interrupt is save the > context, select a mailbox number to send to, then jump to a common > routine to send an interrupt message. By vectoring to a single > interrupt routine to begin with, it makes the code a little bit > smaller and faster. IE. if OS's aren't really making good use of > having multiple int vectors, then why supply them ? This is where hardware next task processing would be useful. Take the pc for example with a high speed uart -- interrupt -- massive irq service -- massive task switch. Not the best idea. IRQ/SWI are best done as slow things. > I've thought of trying to port linux for my processor. However, > looking at the linux kernel I was not that impressed. It seems to be > based on 1960's (unix) technology (although I've not looked at it > recently). There are things that a modern OS paradigm does better. True -- I like multi-taking but I favor ONE user I computer. It has some good features but I just formatted my Linux HD so I could have room for windows PCB design software as I need a prototype board developed. Real memory/ real I/O/ real FLASH/real neat. > Rob http://www.birdcomputer.ca/ Ben Franchuk. |
|
|
|
>Need all the software have the same cache size.Can't you have >small and medium fixed size cache for core kernel/irq service and >the standard cache for every thing else.The core kernel is the >task switching/MMU handling type stuff? What about going back to >the old idea of fast fixed area memory for the important stuff. As I was driving home tonight I was thinking about the issues this brings up. There are certain known processes that could have fixed "things" set aside for them thus speeding up context switching. Cache and operating registers are one, hence some CPUs have and have register banks. The ARM has several for different modes - user, supervisor, interrupt, etc. The Z80 had two banks with exchange instructions and I'm sure you know of other examples. Intel were trying to solve some of these issues with the segment scheme. Caches genreal though are not tied to a prcess although the new architechture 6 from ARM has gone this route (because we asked them) along with the MMU. This means that cache line invalidation and loading can be tied to process bit this is complex behaviour. On another note, interrupt handling. It is important to be able to establish priorty when more than one source has interrupted. Some architechures make this is lengthy process, specially when status registers are behind compex logic as it adds to the read time significantly. Most OSes have one interrupt handling routine that dispatches or "messages" a thread depending on the structure. The whole latency is important - start ISR, establish which device has interrupted, get to execute handler code. There are many issues. Some OSes have to be open and flexible and don't know anything about the target. Others, like the ones we are talking about, are smaller and the target is well known. These factor make a big difference to the architecture and both the OS software and core/peripherals can be dealt with together. >time based instructions would be useful. Test i/o and sleep for 10 us, >would be handy for a handling say a floppy chip. You really want to be to deschedule for short times or construct your IO to tollerate a descheduling period before being polled. Whilst this makes for say a slow disk transfer, it makes for a system that more reponsive. Use DMA. In my design (32 bit) I have a DMA mode in the core so the core will do the transfers but a DMA engine will inject the DMA instruction into the queue (a bit like xr16 interrupt processing). I have an on board IO buss (that also falls in the memory map) such that the DMA instructions can do a transfer but using 2 busses - the main memory and the IO. Anyway, DMA can shift some of the burden from the OS and whilst stealing the odd cycle, the OS can be left to get on with things, only needing software intervention when complete. Anyway, a few more thoughts... Veronica |
|
|