EmbeddedRelated.com
Forums
The 2024 Embedded Online Conference

PIC vs ARM assembler (no flamewar please)

Started by Unknown February 14, 2007
"Jonathan Kirwan" <jkirwan@easystreet.com> wrote in message 
news:gicnt214v581okaluheha0sietn16fsvn5@4ax.com...
> On Wed, 21 Feb 2007 14:56:59 +1300, Jim Granville > <no.spam@designtools.maps.co.nz> wrote: > >>Wilco Dijkstra wrote: >>> >>> While SRAM is faster than flash, it wouldn't be fast enough to be >>> used like a register in a simple MCU. On ARM7 for example, >>> register read, ALU operation and register write all happen within >>> one clock cycle. With SRAM the cycle time would become 3-4 >>> times as long (not to mention power consumption). >> >> To get a handle on what On-Chip, small RAM speeds can achieve, in real >>silicon, look at the FPGA Block Sync RAMS - those are smallish block, >>Dual ported, and plenty fast enough to keep up with the cycle times of a >>CPU. >> I don't see FPGA CPUs being held back by their 'slow sram', >>as you claim ?. >> RAM based DSPs are now pushing 1GHz, and that's larger chunks >>of RAM than are needed for register-maped-memory.
Jim, you earlier wrote "I think you missed my uC = microcontroller." - I don't think 1GHz DSPs/FPGAs are micro controllers. Yes high end SRAMs on advanced processes easily reach 1GHz, but my point (and I think Rick's) is that registers are much faster still.
> I just dumped my message in progress on this -- you said what I wanted > to say very clearly. I use such DSPs. I think Wilco must be stuck > thinking in terms of external bus drivers where what is connected is > unknown and the bus interface designer must work to worst cases. Too > much ARM, perhaps?
No, not at all. I'm talking about needing to access the SRAM several times per cycle to read/write the registers (as explained in the first paragraph). Therefore the speed of a CPU using SRAM rather than registers becomes a fraction of the cycle time of the SRAM. A register file is a small dedicated structure designed for very high random access bandwidth. SRAM simply can't achieve that. Wilco
Wilco Dijkstra wrote:
> "David Brown" <david@westcontrol.removethisbit.com> wrote in message > news:45db1975$0$31521$8404b019@news.wineasy.se... > >> However, there is no fixed distinction between RISC and CISC. The two >> terms refer to a range of characteristics commonly associated with RISC >> cpus and CISC cpus. Some chips clearly fall into one camp or the other, >> but most have at least slightly mixed characteristics. > > RISC and CISC are about instruction set architecture, not implementation > (although it does have an effect on the implementation). >
The whole point of RISC is to be able to make a more efficient implementation - it is an architectural design philosophy aimed at making small and fast (clock speed) implementations.
>> The ColdFire core is very much such a mixed chip - in terms of the ISA, it >> is noticeably more RISCy than the 68k (especially the later cores with >> their more complex addressing modes), and in terms of its implementation, >> it is even more so. Even the original 68k, with its multiple registers and >> (mostly) orthogonal instruction set is pretty RISCy. > > Well, let's look at 10 features that are typical for most RISCs today: > > * large uniform register file: no (8 data + 8 address registers)
Typical CISC is 4 to 8 registers, each with specialised uses. Thus the 68k is far from typical CISC, and is much more in the middle.
> * load/store architecture: no
The 68k can handle both operands of an ALU instruction in memory, which is CISC. The ColdFire can have one in memory, one in a register, which is again half-way.
> * naturally aligned load/store: no
That is purely an implementation issue for the memory interface. It is common that RISC cpus, in keeping with the aim of a small, neat and fast implementation, insist on aligned access. But it is not a requirement - IIRC, the some PPC implementations can access non-aligned data in big-endian mode. The ColdFire is certainly more efficient with aligned accesses, but they are not a requirement.
> * simple addressing modes: no (9 variants, yes for ColdFire?)
The addressing modes for a ColdFire "move" instruction are: Rx, (Ax), (Ax)+, -(Ax), (d16 + Ax), (d8 + Ax + Ri*SF), xxx.w, xxx.l, #xxx The source and destination addressing modes can be mixed as long as only one of them needs an extension word. The 68k had several other modes in its later generations, and they could be freely mixed for the source and destination. I am not familiar enough with the ARM (it's 17 years since I programmed one), but if we look at the PPC, it has addressing modes roughly equivalent to: Rx, (Rx), (d16 + Rx), (Rx + Ry), xxx.w Using update versions of the instructions, you get something much like the (Ax)+ and -(Ax) modes as well as more complex modes. All in all, the CF modes are only marginally more complex than the PPC modes. The big difference, however, is that the CF can use these modes on ALU instructions and not just for loads and stores - but that has already been counted above.
> * fixed instruction sizes: no > * simple instructions: no (yes for ColdFire)
The instruction set for the PPC contains much more complicated instructions than the CF. The 68k has things like division instructions, which the CF has dropped. A far more useful (and precise) distinction would be to look at the implementation - does the architecture use microcoded instructions? RISC cpus, in general, do not - that is one of the guiding principles of using RISC in the first place. Traditional CISC use microcode extensively. The 68k used microcode for many instructions - the CF does not.
> * calls place return address in a register: no
More generally speaking, CISC has specific purpose registers, while RISC have mostly general purpose registers. Yes, the CF has extra functionality on A7 to make it a stack pointer. Putting the return address in a register, as done in RISC cpus, is not an advantage - it is a consequence of not having a dedicated stack.
> * 3 operand ALU instructions: no > * ALU instructions do not corrupt flags: no > * delayed branch: no > > So that is 0 for 68K, 2 for ColdFire. ARM scores 8, Thumb scores 6, > Thumb-2 7. MIPS scores 10 (very pure). This clearly shows 68K and > ColdFire are CISCs, while the rest are RISCs. >
If we add in some other features that are a little more implementation dependant (and therefore entirely relevant, since that is the reason for RISC in the first place), things are a bit different: * Single-cycle register-only instructions: yes * Short execution pipeline: yes * (Mostly) microcode-free core: yes * Short and fast instruction decode: half point * Low overhead branches: yes * Stall-free for typical instruction streams: yes Suddenly the scores are looking a bit different. Perhaps we could compare the CF to traditional CISC features: * Specialised accumulator: no * Specialised frame pointer: no * Specialised index registers: no * Microcoded instructions: no * Looped instructions: no * Direct memory-to-memory operations: no * Bottlenecks due to register or flag conflicts: not often * Long pipelines: no * Register renaming needed for fast implementation: no * Unaligned code: no * Highly variable instruction length: half (only 1, 2, or 3 16-bit words) * Instruction prefix codes: no I could go on - and I expect you could too.
>> So the ARM is moving from a fairly pure RISC architecture, through the >> Thumb (with it's more CISCy smaller register set and more specialised >> register usage) and now Thumb-2 (with variable length instructions). It's >> gaining CISC attributes in a move to improve code density at the expense >> of more complex instruction decoding. > > Yes, RISCs have become more complex. However that doesn't make > them CISC! Although ARM is not a pure RISC to start with, Thumb-1 > and Thumb-2 are only slighly more complex and still have most of > the RISC characteristics. >
As I said, with the Thumb-2, the ARM is gaining the CISC feature of variable length instructions - I did not say it is changing into a CISC architecture. The real world is grey - there is no dividing line between CISC and RISC, merely a collection of characteristics that some chips have and others don't. Adding these variable length instructions is a good thing, if it doesn't cost too much at the decoder. It increases both code density and instruction speed, since it opens the path for 32-bit immediate data (or addresses) to be included directly in a single instruction.
>> The ColdFire, on the other hand, has moved from the original 68k to a more >> RISCy core, with a much greater emphasis on single-cycle >> register-to-register instructions and a simpler and more efficient core, >> in order to improve performance and lead to a smaller implementation. > > Indeed, it has gained 2 points by removing some of the complex micro > coded instrucions and addressing modes, thus allowing a simpler > more pipelined implementation. But it clearly doesn't make it a RISC > like the marketing people want us to believe... >
My point is not that the CF is a RISC core - I never claimed it was. But neither is it a CISC core in comparison to, say, the x86 architecture. If there were such a thing as a scale running from pure RISC to pure CISC, then the CF lies near the middle. It is not as RISCy as the ARM, but is somewhat RISCier than the original 68k.
>> There are still plenty of differences between the architectures, but there >> is no doubt that there are a lot more similarities between the ARM Thumb-2 >> and the ColdFire than between the original ARM and the original 68k. > > I'd say that any similarities only exist on a superficial level. For example
My original comment was pretty superficial.
> the variable length instructions in Thumb-2 are easier to decode than 68K > or ColdFire. >
The hard ones from the 68k were dropped in the ColdFire, precisely to allow a faster, more RISC-style decoder.
>>> There are few RISCs with variable length instructions. >>> >> The AVR? I can't think of any others. > > Hitachi SH and ARC for example.
I haven't looked at them, so I'm happy to take your word for it.
> > Wilco > >
"Jonathan Kirwan" <jkirwan@easystreet.com> wrote in message 
news:hoant2phqt3m4agivtnqa1nc1t6odhv9ha@4ax.com...
> On Wed, 21 Feb 2007 00:02:27 GMT, "Wilco Dijkstra" > <Wilco_dot_Dijkstra@ntlworld.com> wrote: > >>"David Brown" <david@westcontrol.removethisbit.com> wrote in message >>news:45db1975$0$31521$8404b019@news.wineasy.se... >> >>> However, there is no fixed distinction between RISC and CISC. The two >>> terms refer to a range of characteristics commonly associated with RISC >>> cpus and CISC cpus. Some chips clearly fall into one camp or the other, >>> but most have at least slightly mixed characteristics. >> >>RISC and CISC are about instruction set architecture, not implementation >>(although it does have an effect on the implementation). >><snip> > > I respect your knowledge and skill, Wilco, but I cannot agree with > this as I understand you writing it here based upon my experiences.
You're free to disagree but there is consensus about what RISC and CISC are. It's unfortunate that many confuse ISA and implementation... Please read this excellent article by John Mashey: http://yarchive.net/comp/risc_definition.html Hennessy & Patterson's "Computer Architecture: A Quantitative Approach" is well worth reading too.
> I spent 1-on-1 time with Hennessy and listened to the reasoning he > used. RISC was all about thinking in detailed terms of practical > implementation. They were faced with access to lower-technology FABs > (larger feature sizes, fewer transmission gates and inverters, etc.) > and wanted to achieve more with less. Doing that was everything about > implementation and the instruction set architecture was allowed to go > where it must. That this worked out to being a 'reduced instruction > set' was something that came out of achieving competing performance > out of lower-tech FAB capability than folks like Intel or Motorola had > available to their flagship lines of the day.
It is true that in those early days they wanted to cram a complete CPU on a single die (including caches to speed up memory access) and the only way to achieve that was to throw out everything unnecessary. Those days are long gone, transistor budgets are much larger now. Today all CPUs, whether RISC or CISC, use the same implementation techniques to achieve high performance.
> There was a design philosophy based upon theory -- that was simply the > realization that many of the things that slowed down a CISC was also a > matter of perceived convenience for programmers, so the policy was > then to get rid of anything and everything that slowed down the clock > rate without paying _well_ for that delay. A focus on throughput. The > fact that removing barriers to speed also happened to reduce the need > for more transistor equivalents was the happy coincidence that fueled > the initiative. The instructions were a result of the application of > focusing on implementation details -- not some instruction set theory > under which the implementation then followed. If higher level features > were cheap to implement and paid for themselves in performance, they > were simply kept. Very practical, hard nosed approach.
Again, it is true that in the early days the focus was on getting performance without much regard for anything else. However saying that the instruction set design followed from the implementation is incorrect. RISC started as a reaction against the CISC goal of "closing the semantic gap" after IBM studies showed only a few simple instructions were used 90% of the time. It's about taking a quantitative approach to instruction set design. RISC takes the interaction between the various components of a complete system into account (compiler, ISA, implementation). The result of this is a particular set of features in the ISA, not in the implementation. A microcoded RISC is still a RISC, a pipelined CISC is still a CISC!
> If you ever listened to such a lecture by those actually doing the > work, you'd see this narrow focus. The register flags that signalled > whether or not a register was in-use as a destination were tossed as > too expensive -- they required infrastructure in order to delay the > processor and the combinatorial worst-case path of the whole of that > meant additional __delay__ in each clock cycle, whether or not this > interlock was useful instruction to instruction. You paid for it on > every cycle, need it or not. So out it went. No interlocks. Sorry. > Similar thinking was involved in the Alpha's refusal to do 'lane > changes,' for example.
Those were mistakes indeed that were corrected in later RISCs. Some of the early ideologies were taken too far, and concentrated too much on a single implementation rather than on the ISA (which lives for many implementatios). Going for all out clock speed without thinking about power consumption, codesize, ease of compiler design etc is a bad idea. Many early RISCs ended up with features that were found to have a negative impact in the end (either in software or in later CPUs). Alpha byte access is a great example of this, delayed branches is another. MIPS quickly realised the silliness of omitting interlocks. :-)
> Some of the difficulties were higher memory bandwidths required, once > you started tossing out stuff like register interlocks, microstore and > its associated sequencing overhead, lane changing, etc. But if that > could be satisfied, and that was kind of possible at the time with > some static ram from performance semi, it would perform like a bat out > of hell. So to speak. > > But the focus was on implementation on lower-tech FABs and, while > doing that, still competing with CISC and beating it.
At the time, yes. Nowadays it is accepted that while RISC still has some advantages over CISC (eg. area, power consumption, design effort), CISC CPUs can be made as fast as RISC CPUs as long as you put enough effort into it. Of course CISCs can compete on one out of power, area or speed, not on all at the same time!
> But for those making cheap embedded controllers, I suspect that die > size and effectively using somewhat lower FAB technology remains > useful. So the low-transistor count approaches once the much lauded > domain of RISC remain important.
Correct. It's no surprise most 32-bit embedded CPUs are RISC. I think the real lesson was not to adhere to the early dogma's too strongly. RISC has evolved over time, and so has CISC. RISCs have fixed their early mistakes about thinking too much about the first implementation rather than ISA. As we discussed before, RISCs have taken on more complex features as transistor budgets grew. CISCs have moved from mostly micro code to mostly pipelined single cycle instructions. The key features that differentiates RISC from CISC both then and now are all about instruction set architecture, not implementation. Wilco
On Feb 20, 9:53 pm, Jonathan Kirwan <jkir...@easystreet.com> wrote:
[snip interesting comments]
> But for those making cheap embedded controllers, I suspect that die > size and effectively using somewhat lower FAB technology remains > useful. So the low-transistor count approaches once the much lauded > domain of RISC remain important.
With the note that code density is also a factor. If the area saved by simpler decoding comes at the cost of more area in Icache (for the same performance) or FLASH, then simpler decode is a net loss. Simpler decoding can also save power, but the reading of larger instructions consumes more power. RISC also reduces the design effort required and testing complexity. At higher volumes design cost becomes less significant; so the balance point in the trade-offs between code density and implementation complexity changes (e.g., per chip design cost savings can be translated into larger chip area). Per chip design cost savings can also be translated into a better (faster and/or more power-efficient and/or smaller) process technology. (Greater design effort [whether from ISA factors or greater effort to optimize the design for power, performance, and/or area] also increases scheduling risks; so a sub-optimal ISA or implementation might be safer. Safer probably means easier access to start-up capital [a double-whammy because a simpler design also requires less start-up capital]. Of course, one also cannot trade costs [number of designers] for time to completion at a fixed rate.) It might also be noted that a move to multiple cores per chip multiplies the decode area savings (but does not reduce Icache costs) while shared FLASH cost remains constant. As you implied, the trade-offs for a real product are much more complex. Paul A. Clayton just a technophile and babbler
On Feb 20, 5:01 pm, Jim Granville <no.s...@designtools.maps.co.nz>
wrote:
> rickman wrote: > > But for RAM to be as efficient as a register file it has to be triple > > ported so you can read two operands and write back another... or you > > have to go to an accumulator based design. Once you have triple > > ported RAM, you have just added a register file! A rose by any other > > name still smells as sweet... > > Correct, that's the hardware level detail. > > The really important point, is at the SW level, you now access any small > clusters of Register-Mappable-RAM variables VERY efficently indeed, > using register opcodes. > - Such clusters of variables are very common in code > - eg a Real time clock subroutine, could be fully coded using register > opcodes, with a single Ram-locate operation on entry.
My point is that there is no difference between registers in triple ported RAM and a large register file. If I have 1kB of triple ported RAM, I can play the same game and static allocate memory to interrupt routines for zero overhead context switching.
> Fast context switching is also now built in. Stack usage drops. > Lots of benefits, but you DO have to design the chip more as a system, > and not simply buy and paste-in an IP core.
Stack usage drops because the TMS9900 ISA did not support stacks very well. There was no stack usage on subroutine calls because a link register was used (R12 or R13, IIRC). Before another routine was called the register had to be saved or a full register context change had to be done. This was over 20 years ago, so I may not remember the details correctly. But I remember distinctly that I was initially impressed with the 9900, but eventually realized that this was outdated technology as CPU speeds and RAM densities increased.
> It's also backward compatible. If you are uncomfortable with the > overlay, or the tools are catching up, just leave the register pointer > alone, and you have plain-old-vanilla-RISC.
With more limited capabilities due to the register linking for subroutine calls.
> See the XC166, and IIRC the Sun CPUs used to allow > a partial page overlap, so you could pass params in Ram.Registers, and > allow locals as well, with very low pointer thrashing.
Passing params in registers is still used without a special register block pointer. With a significant number of registers being used for housekeeping, there is limited utility of to using a block pointer. Say you allocate the top 8 registers as "important" registers (I'm sure there is a term for this, but I can't recall it) that must be saved when a routine is called. The lower 8 are considered as volatile and can be reused as required or data passed in these. To use the lower 8 and save the upper 8 you need to adjust the register pointer down by 8 cells. You still need to copy some of this data since some of the new upper registers (old lower 8) are hardware dedicated and will clobber data otherwise. Yes, it can be made to work, but I never saw a big savings in speed or memory usage. Perhaps you saw different applications.
Wilco Dijkstra wrote:
>
... snip ...
> > At the time, yes. Nowadays it is accepted that while RISC still > has some advantages over CISC (eg. area, power consumption, design > effort), CISC CPUs can be made as fast as RISC CPUs as long as you > put enough effort into it. Of course CISCs can compete on one out > of power, area or speed, not on all at the same time!
Since the fundamental limitation today is propagation time, premium performance is found in small devices. A chip is obviously a smaller device than a PCB. Once you shuffle everything, including the dog, into one chip you can gain performance by adding features. So I predict that evolution will tend in the CISC direction. We see this in spades in the embedded arena. Also remember that NOT having to drive off-chip lines produces heavy reduction in power and area, and increase in speed, all at the same time! Imagine a chip with 2G memory, a PPC instruction set, and a USB external interface. It needs 6 pins, possibly 8 to allow for a clock. The memory would be ECC, since the cells would be rather small and highly subject to bit drops. All on 1/4 inch square package! External HD access would suffer. Probably a simple stack oriented instruction set would be better. Memory access on such a machine would not be significantly slower than register access. -- <http://www.cs.auckland.ac.nz/~pgut001/pubs/vista_cost.txt> <http://www.securityfocus.com/columnists/423> "A man who is right every time is not likely to do very much." -- Francis Crick, co-discover of DNA "There is nothing more amazing than stupidity in action." -- Thomas Matthews
"David Brown" <david@westcontrol.removethisbit.com> wrote in message 
news:45dc2bf5$0$31548$8404b019@news.wineasy.se...
> Wilco Dijkstra wrote: >> "David Brown" <david@westcontrol.removethisbit.com> wrote in message >> news:45db1975$0$31521$8404b019@news.wineasy.se...
>> RISC and CISC are about instruction set architecture, not implementation >> (although it does have an effect on the implementation). >> > The whole point of RISC is to be able to make a more efficient > implementation - it is an architectural design philosophy aimed at making > small and fast (clock speed) implementations.
That's a good summary.
>>> The ColdFire core is very much such a mixed chip - in terms of the ISA, >>> it is noticeably more RISCy than the 68k (especially the later cores >>> with their more complex addressing modes), and in terms of its >>> implementation, it is even more so. Even the original 68k, with its >>> multiple registers and (mostly) orthogonal instruction set is pretty >>> RISCy. >> >> Well, let's look at 10 features that are typical for most RISCs today: >> >> * large uniform register file: no (8 data + 8 address registers) > > Typical CISC is 4 to 8 registers, each with specialised uses. Thus the > 68k is far from typical CISC, and is much more in the middle.
There are various CISCs (eg VAX, MSP430) that have 16 registers, while most RISCs have 32 or more.
>> * load/store architecture: no > > The 68k can handle both operands of an ALU instruction in memory, which is > CISC. The ColdFire can have one in memory, one in a register, which is > again half-way.
The ColdFire is no different from 68K in this aspect. Most ALU operations can do read/modify/write to memory and the move instruction can access two memory operands.
>> * naturally aligned load/store: no > > That is purely an implementation issue for the memory interface. It is > common that RISC cpus, in keeping with the aim of a small, neat and fast > implementation, insist on aligned access. But it is not a requirement - > IIRC, the some PPC implementations can access non-aligned data in > big-endian mode. The ColdFire is certainly more efficient with aligned > accesses, but they are not a requirement.
Unaligned accesses are non-trivial so most RISCs left it out. However modern CPUs nowadays have much of the required logic (due to hit-under-miss, OoO execution etc), so a few RISCs (ARM and POWER) have added this. Hardware designers still hate its complexity, but it is often a software requirement. Quite surprisingly it gives huge speedups in programs that use memcpy a lot.
>> * simple addressing modes: no (9 variants, yes for ColdFire?)
...
> All in all, the CF modes are only marginally more complex than the PPC > modes.
It's the (d8 + Ax + Ri*SF) mode that places it in the complex camp. The first not only uses a separate extension word that needs decoding but also must perform a shift and 2 additions...
> The instruction set for the PPC contains much more complicated > instructions than the CF. The 68k has things like division instructions, > which the CF has dropped.
What PPC instructions are complex? PPC is a subset of POWER just like CF is a subset of 68K, so most of the complex instructions were left out.
> A far more useful (and precise) distinction would be to look at the > implementation - does the architecture use microcoded instructions? RISC > cpus, in general, do not - that is one of the guiding principles of using > RISC in the first place. Traditional CISC use microcode extensively. The > 68k used microcode for many instructions - the CF does not.
This is misguided. RISC *enables* simple non-microcoded implementations. One can make a micro code implementation of a RISC, but that doesn't make it any less RISC.
>> * calls place return address in a register: no > > More generally speaking, CISC has specific purpose registers, while RISC > have mostly general purpose registers. Yes, the CF has extra > functionality on A7 to make it a stack pointer. Putting the return > address in a register, as done in RISC cpus, is not an advantage - it is a > consequence of not having a dedicated stack.
It is an advantage as it avoids unnecessary memory traffic - a key goal of RISC.
> If we add in some other features that are a little more implementation > dependant (and therefore entirely relevant, since that is the reason for > RISC in the first place), things are a bit different: > > * Single-cycle register-only instructions: yes > * Short execution pipeline: yes > * (Mostly) microcode-free core: yes > * Short and fast instruction decode: half point > * Low overhead branches: yes > * Stall-free for typical instruction streams: yes > > Suddenly the scores are looking a bit different.
I don't see how the scores change at all. Most of the features you mention are "yes" for 68K implementations (except for the original 68000 which scores 4 out of 6), ColdFire and ARM.
> Perhaps we could compare the CF to traditional CISC features: > > * Specialised accumulator: no
Many famous CISCs are not accumulator based, eg PDP, VAX, 68K, System/360 etc. Accumulators are typically used in 8-bitters where most instructions are 1 or 2 bytes for good codesize.
> * Microcoded instructions: no
Implementation detail. CF is still complex enough that micro coded implementations might be a good choice.
> * Looped instructions: no
Loop mode is just an implementation optimization that could be done on any architecture.
> * Direct memory-to-memory operations: no
Eh, what does move.l (a0),(a1) do? It's valid on CF.
> * Bottlenecks due to register or flag conflicts: not often > * Long pipelines: no
Longer than an equivalent RISC (mainly due to needing 2 memory accesses per instruction and more complex decoding). And likely longer than a simpler microcoded implementation.
> As I said, with the Thumb-2, the ARM is gaining the CISC feature of > variable length instructions - I did not say it is changing into a CISC > architecture. The real world is grey - there is no dividing line between > CISC and RISC, merely a collection of characteristics that some chips have > and others don't.
Sure, there is always a grey area in the middle, but most ISAs clearly fall in either camp. If you use my rules, can you mention one that scores 4 or 5?
> Adding these variable length instructions is a good thing, if it doesn't > cost too much at the decoder. It increases both code density and > instruction speed, since it opens the path for 32-bit immediate data (or > addresses) to be included directly in a single instruction.
Actually, embedding large immediates in the instruction stream is bad for codesize because they cannot be shared. For Thumb-2 the main goal was to allow access to 32-bit ARM instructions for cases where a single 16-bit instruction was not enough. Thumb-2 doesn't have immediates like 68K/CF.
> My point is not that the CF is a RISC core - I never claimed it was. But > neither is it a CISC core in comparison to, say, the x86 architecture. If > there were such a thing as a scale running from pure RISC to pure CISC, > then the CF lies near the middle. It is not as RISCy as the ARM, but is > somewhat RISCier than the original 68k.
I agree CF is less CISCy than 68K but it is still more CISCy than x86. If it dropped 2 memory operands, removed ALU+memory operations, 32-bit immediates and the absolute and (d8 + Ax + Ri*SF) addressing modes then I would agree it is a RISC... Wilco
Wilco Dijkstra wrote:
> "Jonathan Kirwan" <jkirwan@easystreet.com> wrote in message > news:gicnt214v581okaluheha0sietn16fsvn5@4ax.com... > >>On Wed, 21 Feb 2007 14:56:59 +1300, Jim Granville >><no.spam@designtools.maps.co.nz> wrote: >> >> >>>Wilco Dijkstra wrote: >>> >>>>While SRAM is faster than flash, it wouldn't be fast enough to be >>>>used like a register in a simple MCU. On ARM7 for example, >>>>register read, ALU operation and register write all happen within >>>>one clock cycle. With SRAM the cycle time would become 3-4 >>>>times as long (not to mention power consumption). >>> >>> To get a handle on what On-Chip, small RAM speeds can achieve, in real >>>silicon, look at the FPGA Block Sync RAMS - those are smallish block, >>>Dual ported, and plenty fast enough to keep up with the cycle times of a >>>CPU. >>> I don't see FPGA CPUs being held back by their 'slow sram', >>>as you claim ?. >>> RAM based DSPs are now pushing 1GHz, and that's larger chunks >>>of RAM than are needed for register-maped-memory. > > > Jim, you earlier wrote "I think you missed my uC = microcontroller." - > I don't think 1GHz DSPs/FPGAs are micro controllers. Yes high end > SRAMs on advanced processes easily reach 1GHz, but my point > (and I think Rick's) is that registers are much faster still.
FPGAs are certainly used a microcontrollers, and in increasing volumes. CPU designers had better be aware of just what a FPGA soft CPU can do these days, as they are replacing uC in some designs.
> >>I just dumped my message in progress on this -- you said what I wanted >>to say very clearly. I use such DSPs. I think Wilco must be stuck >>thinking in terms of external bus drivers where what is connected is >>unknown and the bus interface designer must work to worst cases. Too >>much ARM, perhaps? > > > No, not at all. I'm talking about needing to access the SRAM several > times per cycle to read/write the registers (as explained in the first > paragraph). Therefore the speed of a CPU using SRAM rather than > registers becomes a fraction of the cycle time of the SRAM.
?
> A register file is a small dedicated structure designed for very high > random access bandwidth. SRAM simply can't achieve that.
OK, I'll try ome more time. You seem to be stuck on a restrictive use of SRAM, so I'll use different words. Let's take a sufficently skilled chip designer, that he knows various RAM structures, and that he will not use vanilla SRAM (as Rick has aleady mentioned) but will use something more like the dual port sync ram of the FPGAs, I gave as an example. Yes, this ram is more complex than simplest RAM, (which is why Infineon keep the size to 1-2K), but it buys you many benefits on a uC, and the die size impact of such RAM is still tiny. Q: What percentage of a RISC(eg ARM) die is taken by the registers themselves ? A: A miniscule fraction << 0.1% Q: Why not apply more of the die, to fix what is a real code/performance bottle neck. Let's increase the size of these 'blazingly fast' registers, and local-ram-overlay them, to reap the benefits, until we hit a point where their time-impact matches the code access, or get to 1-2K, (or a small % of die), whichever comes first. Taking other devices as a yardstick, it's likely to hit the size corner before it hits the speed corner, but a designer will watch both effects. [The AVR RAM-overlays their registers, they just forgot to allow that overlay to move - partly because the very first AVR had no RAM, and it was a later bolt-on.] There is NO design-brick-wall, that says when you go over 16 or 32 registers, suddenly it is impossible to get fast access : FPGA designers are doing that now, in real silicon. Larger CPU designers put their efforts into cache ram (which is also specialised SRAM memory) - but that has its own trade-offs, and better solutions exist for microcontroller-focused, real time designs. -jg
"Jim Granville" <no.spam@designtools.maps.co.nz> wrote in message 
news:45dc98c5@clear.net.nz...
> Wilco Dijkstra wrote:
> FPGAs are certainly used a microcontrollers, and in increasing volumes. > CPU designers had better be aware of just what a FPGA soft CPU can do > these days, as they are replacing uC in some designs.
They can indeed, but FPGA prices need to come down a lot before it becomes a good idea in a high volume design. I've worked with big FPGA stacks for CPU emulation and large/fast FPGAs can cost well in the 5 figures a piece. Even the smallest ARM uses a big chunk of a large FPGA. So you can only use very simple CPUs in a small FPGA.
>> A register file is a small dedicated structure designed for very high >> random access bandwidth. SRAM simply can't achieve that. > > OK, I'll try ome more time. > You seem to be stuck on a restrictive use of SRAM, so I'll use different > words. > Let's take a sufficently skilled chip designer, that he knows various RAM > structures, and that he will not use vanilla SRAM (as Rick has aleady > mentioned) but will use something more like the dual port sync ram of the > FPGAs, I gave as an example. > Yes, this ram is more complex than simplest RAM, (which is why Infineon > keep the size to 1-2K), but it buys you many benefits on a uC, and the > die size impact of such RAM is still tiny.
I'm with you. Inventing a new kind of SRAM with 3 read and 2 write ports would do the job indeed. But it is going to be big compared to using standard single ported SRAM, so there needs to be a major advantage.
> Q: What percentage of a RISC(eg ARM) die is taken by the registers > themselves ? > A: A miniscule fraction << 0.1%
On ARM7tdmi it is around 5% for 32 registers with 2 read and 1 write port. An ARM7tdmi is as large as 5KB of SRAM. Assuming flash is 4 times as dense, a typical MCU with 128KB flash and 16KB of RAM still has 0.5% of the area devoted to registers.
> Q: Why not apply more of the die, to fix what is a real code/performance > bottle neck.
What bottleneck? You lost me here... Adding more registers doesn't automatically improve performance.
>Let's increase the size of these 'blazingly fast' registers, and >local-ram-overlay them, to reap the benefits, until we hit a point where >their time-impact matches the code access, or get to 1-2K, (or a small % of >die), whichever comes first. > > Taking other devices as a yardstick, it's likely to hit the size corner > before it hits the speed corner, but a designer will watch both effects.
At 2KB it would double the size of an ARM7tdmi and slow it down a lot without a redesigned pipeline (it needs to support 2 accesses in less than half a cycle at up to 120MHz, so the SRAM would need to run at 500MHz). I think you're assuming register read/write is not already critical in existing CPUs - it often is. However what use do you have for 256/512 fast registers? You can only access 16 at any time...
> There is NO design-brick-wall, that says when you go over 16 or 32 > registers, suddenly it is impossible to get fast access : FPGA designers > are doing that now, in real silicon.
Sure, it just gets progressively slower with size and number of ports. Wilco
rickman wrote:
> On Feb 20, 5:01 pm, Jim Granville <no.s...@designtools.maps.co.nz> > wrote: > >>rickman wrote: >> >>>But for RAM to be as efficient as a register file it has to be triple >>>ported so you can read two operands and write back another... or you >>>have to go to an accumulator based design. Once you have triple >>>ported RAM, you have just added a register file! A rose by any other >>>name still smells as sweet... >> >>Correct, that's the hardware level detail. >> >>The really important point, is at the SW level, you now access any small >>clusters of Register-Mappable-RAM variables VERY efficently indeed, >>using register opcodes. >>- Such clusters of variables are very common in code >>- eg a Real time clock subroutine, could be fully coded using register >>opcodes, with a single Ram-locate operation on entry. > > > My point is that there is no difference between registers in triple > ported RAM and a large register file. If I have 1kB of triple ported > RAM, I can play the same game and static allocate memory to interrupt > routines for zero overhead context switching.
Yes, and to do that zero context switch, you need a register frame pointer (or similar). You do not want to use this as mere smart-stack, but to allow all the Reg opcodes to work, on any window into that larger memory. Triple ported, or dual port, depends more on the core in question. Even the lowly 80c51 has a register frame pointer,(all of 2 bits in size), and it does overlay the registers with RAM. The z8 expands this to 8 bits, and I think the XC166 uses a 16 bit one. <snip>
>>It's also backward compatible. If you are uncomfortable with the >>overlay, or the tools are catching up, just leave the register pointer >>alone, and you have plain-old-vanilla-RISC. > > > With more limited capabilities due to the register linking for > subroutine calls.
? - you've lost me here. In subset mode, you simply ignore the register frame pointer, and it is _exactly_ the same as your un-enhanced core. -jg

The 2024 Embedded Online Conference