PIC vs ARM assembler (no flamewar please)| page 9

Reply by Wilco Dijkstra ●February 21, 20072007-02-21

"Jonathan Kirwan" <jkirwan@easystreet.com> wrote in message 
news:gicnt214v581okaluheha0sietn16fsvn5@4ax.com...
> On Wed, 21 Feb 2007 14:56:59 +1300, Jim Granville
> <no.spam@designtools.maps.co.nz> wrote:
>
>>Wilco Dijkstra wrote:
>>>
>>> While SRAM is faster than flash, it wouldn't be fast enough to be
>>> used like a register in a simple MCU. On ARM7 for example,
>>> register read, ALU operation and register write all happen within
>>> one clock cycle. With SRAM the cycle time would become 3-4
>>> times as long (not to mention power consumption).
>>
>>  To get a handle on what On-Chip, small RAM speeds can achieve, in real
>>silicon,  look at the FPGA Block Sync RAMS - those are smallish block,
>>Dual ported, and plenty fast enough to keep up with the cycle times of a
>>CPU.
>>  I don't see FPGA CPUs being held back by their 'slow sram',
>>as you claim ?.
>>  RAM based DSPs are now pushing 1GHz, and that's larger chunks
>>of RAM than are needed for register-maped-memory.

Jim, you earlier wrote "I think you missed my uC = microcontroller." -
I don't think 1GHz DSPs/FPGAs are micro controllers. Yes high end
SRAMs on advanced processes easily reach 1GHz, but my point
(and I think Rick's) is that registers are much faster still.

> I just dumped my message in progress on this -- you said what I wanted
> to say very clearly.  I use such DSPs.  I think Wilco must be stuck
> thinking in terms of external bus drivers where what is connected is
> unknown and the bus interface designer must work to worst cases.  Too
> much ARM, perhaps?

No, not at all. I'm talking about needing to access the SRAM several
times per cycle to read/write the registers (as explained in the first
paragraph). Therefore the speed of a CPU using SRAM rather than
registers becomes a fraction of the cycle time of the SRAM.

A register file is a small dedicated structure designed for very high
random access bandwidth. SRAM simply can't achieve that.

Wilco

Reply by David Brown ●February 21, 20072007-02-21

Wilco Dijkstra wrote:
> "David Brown" <david@westcontrol.removethisbit.com> wrote in message 
> news:45db1975$0$31521$8404b019@news.wineasy.se...
> 
>> However, there is no fixed distinction between RISC and CISC.  The two 
>> terms refer to a range of characteristics commonly associated with RISC 
>> cpus and CISC cpus.  Some chips clearly fall into one camp or the other, 
>> but most have at least slightly mixed characteristics.
> 
> RISC and CISC are about instruction set architecture, not implementation
> (although it does have an effect on the implementation).
> 

The whole point of RISC is to be able to make a more efficient 
implementation - it is an architectural design philosophy aimed at 
making small and fast (clock speed) implementations.

>> The ColdFire core is very much such a mixed chip - in terms of the ISA, it 
>> is noticeably more RISCy than the 68k (especially the later cores with 
>> their more complex addressing modes), and in terms of its implementation, 
>> it is even more so.  Even the original 68k, with its multiple registers and 
>> (mostly) orthogonal instruction set is pretty RISCy.
> 
> Well, let's look at 10 features that are typical for most RISCs today:
> 
> * large uniform register file: no (8 data + 8 address registers)

Typical CISC is 4 to 8 registers, each with specialised uses.  Thus the 
68k is far from typical CISC, and is much more in the middle.

> * load/store architecture: no

The 68k can handle both operands of an ALU instruction in memory, which 
is CISC.  The ColdFire can have one in memory, one in a register, which 
is again half-way.

> * naturally aligned load/store: no

That is purely an implementation issue for the memory interface.  It is 
common that RISC cpus, in keeping with the aim of a small, neat and fast 
implementation, insist on aligned access.  But it is not a requirement - 
IIRC, the some PPC implementations can access non-aligned data in 
big-endian mode.  The ColdFire is certainly more efficient with aligned 
accesses, but they are not a requirement.

> * simple addressing modes: no (9 variants, yes for ColdFire?)

The addressing modes for a ColdFire "move" instruction are:

Rx, (Ax), (Ax)+, -(Ax), (d16 + Ax), (d8 + Ax + Ri*SF), xxx.w, xxx.l, #xxx

The source and destination addressing modes can be mixed as long as only 
one of them needs an extension word.

The 68k had several other modes in its later generations, and they could 
be freely mixed for the source and destination.

I am not familiar enough with the ARM (it's 17 years since I programmed 
one), but if we look at the PPC, it has addressing modes roughly 
equivalent to:

Rx, (Rx), (d16 + Rx), (Rx + Ry), xxx.w

Using update versions of the instructions, you get something much like 
the (Ax)+ and -(Ax) modes as well as more complex modes.

All in all, the CF modes are only marginally more complex than the PPC 
modes.

The big difference, however, is that the CF can use these modes on ALU 
instructions and not just for loads and stores - but that has already 
been counted above.

> * fixed instruction sizes: no
> * simple instructions: no (yes for ColdFire)

The instruction set for the PPC contains much more complicated 
instructions than the CF.  The 68k has things like division 
instructions, which the CF has dropped.

A far more useful (and precise) distinction would be to look at the 
implementation - does the architecture use microcoded instructions? 
RISC cpus, in general, do not - that is one of the guiding principles of 
using RISC in the first place.  Traditional CISC use microcode 
extensively.  The 68k used microcode for many instructions - the CF does 
not.

> * calls place return address in a register: no

More generally speaking, CISC has specific purpose registers, while RISC 
have mostly general purpose registers.  Yes, the CF has extra 
functionality on A7 to make it a stack pointer.  Putting the return 
address in a register, as done in RISC cpus, is not an advantage - it is 
a consequence of not having a dedicated stack.

> * 3 operand ALU instructions: no
> * ALU instructions do not corrupt flags: no
> * delayed branch: no
> 
> So that is 0 for 68K, 2 for ColdFire. ARM scores 8, Thumb scores 6,
> Thumb-2 7. MIPS scores 10 (very pure). This clearly shows 68K and
> ColdFire are CISCs, while the rest are RISCs.
> 

If we add in some other features that are a little more implementation 
dependant (and therefore entirely relevant, since that is the reason for 
RISC in the first place), things are a bit different:

* Single-cycle register-only instructions: yes
* Short execution pipeline: yes
* (Mostly) microcode-free core: yes
* Short and fast instruction decode: half point
* Low overhead branches: yes
* Stall-free for typical instruction streams: yes

Suddenly the scores are looking a bit different.

Perhaps we could compare the CF to traditional CISC features:

* Specialised accumulator: no
* Specialised frame pointer: no
* Specialised index registers: no
* Microcoded instructions: no
* Looped instructions: no
* Direct memory-to-memory operations: no
* Bottlenecks due to register or flag conflicts: not often
* Long pipelines: no
* Register renaming needed for fast implementation: no
* Unaligned code: no
* Highly variable instruction length: half (only 1, 2, or 3 16-bit words)
* Instruction prefix codes: no

I could go on - and I expect you could too.

>> So the ARM is moving from a fairly pure RISC architecture, through the 
>> Thumb (with it's more CISCy smaller register set and more specialised 
>> register usage) and now Thumb-2 (with variable length instructions). It's 
>> gaining CISC attributes in a move to improve code density at the expense 
>> of more complex instruction decoding.
> 
> Yes, RISCs have become more complex. However that doesn't make
> them CISC! Although ARM is not a pure RISC to start with, Thumb-1
> and Thumb-2 are only slighly more complex and still have most of
> the RISC characteristics.
> 

As I said, with the Thumb-2, the ARM is gaining the CISC feature of 
variable length instructions - I did not say it is changing into a CISC 
architecture.  The real world is grey - there is no dividing line 
between CISC and RISC, merely a collection of characteristics that some 
chips have and others don't.  Adding these variable length instructions 
is a good thing, if it doesn't cost too much at the decoder.  It 
increases both code density and instruction speed, since it opens the 
path for 32-bit immediate data (or addresses) to be included directly in 
a single instruction.

>> The ColdFire, on the other hand, has moved from the original 68k to a more 
>> RISCy core, with a much greater emphasis on single-cycle 
>> register-to-register instructions and a simpler and more efficient core, 
>> in order to improve performance and lead to a smaller implementation.
> 
> Indeed, it has gained 2 points by removing some of the complex micro
> coded instrucions and addressing modes, thus allowing a simpler
> more pipelined implementation. But it clearly doesn't make it a RISC
> like the marketing people want us to believe...
> 

My point is not that the CF is a RISC core - I never claimed it was. 
But neither is it a CISC core in comparison to, say, the x86 
architecture.  If there were such a thing as a scale running from pure 
RISC to pure CISC, then the CF lies near the middle.  It is not as RISCy 
as the ARM, but is somewhat RISCier than the original 68k.

>> There are still plenty of differences between the architectures, but there 
>> is no doubt that there are a lot more similarities between the ARM Thumb-2 
>> and the ColdFire than between the original ARM and the original 68k.
> 
> I'd say that any similarities only exist on a superficial level. For example

My original comment was pretty superficial.

> the variable length instructions in Thumb-2 are easier to decode than 68K
> or ColdFire.
> 

The hard ones from the 68k were dropped in the ColdFire, precisely to 
allow a faster, more RISC-style decoder.

>>> There are few RISCs with variable length instructions.
>>>
>> The AVR?  I can't think of any others.
> 
> Hitachi SH and ARC for example.

I haven't looked at them, so I'm happy to take your word for it.

> 
> Wilco 
> 
>

Reply by Wilco Dijkstra ●February 21, 20072007-02-21

"Jonathan Kirwan" <jkirwan@easystreet.com> wrote in message 
news:hoant2phqt3m4agivtnqa1nc1t6odhv9ha@4ax.com...
> On Wed, 21 Feb 2007 00:02:27 GMT, "Wilco Dijkstra"
> <Wilco_dot_Dijkstra@ntlworld.com> wrote:
>
>>"David Brown" <david@westcontrol.removethisbit.com> wrote in message
>>news:45db1975$0$31521$8404b019@news.wineasy.se...
>>
>>> However, there is no fixed distinction between RISC and CISC.  The two
>>> terms refer to a range of characteristics commonly associated with RISC
>>> cpus and CISC cpus.  Some chips clearly fall into one camp or the other,
>>> but most have at least slightly mixed characteristics.
>>
>>RISC and CISC are about instruction set architecture, not implementation
>>(although it does have an effect on the implementation).
>><snip>
>
> I respect your knowledge and skill, Wilco, but I cannot agree with
> this as I understand you writing it here based upon my experiences.

You're free to disagree but there is consensus about what RISC and
CISC are. It's unfortunate that many confuse ISA and implementation...
Please read this excellent article by John Mashey:
http://yarchive.net/comp/risc_definition.html

Hennessy & Patterson's "Computer Architecture: A Quantitative
Approach" is well worth reading too.

> I spent 1-on-1 time with Hennessy and listened to the reasoning he
> used.  RISC was all about thinking in detailed terms of practical
> implementation.  They were faced with access to lower-technology FABs
> (larger feature sizes, fewer transmission gates and inverters, etc.)
> and wanted to achieve more with less.  Doing that was everything about
> implementation and the instruction set architecture was allowed to go
> where it must.  That this worked out to being a 'reduced instruction
> set' was something that came out of achieving competing performance
> out of lower-tech FAB capability than folks like Intel or Motorola had
> available to their flagship lines of the day.

It is true that in those early days they wanted to cram a complete CPU
on a single die (including caches to speed up memory access) and the
only way to achieve that was to throw out everything unnecessary.

Those days are long gone, transistor budgets are much larger now.
Today all CPUs, whether RISC or CISC, use the same implementation
techniques to achieve high performance.

> There was a design philosophy based upon theory -- that was simply the
> realization that many of the things that slowed down a CISC was also a
> matter of perceived convenience for programmers, so the policy was
> then to get rid of anything and everything that slowed down the clock
> rate without paying _well_ for that delay.  A focus on throughput. The
> fact that removing barriers to speed also happened to reduce the need
> for more transistor equivalents was the happy coincidence that fueled
> the initiative.  The instructions were a result of the application of
> focusing on implementation details -- not some instruction set theory
> under which the implementation then followed. If higher level features
> were cheap to implement and paid for themselves in performance, they
> were simply kept.  Very practical, hard nosed approach.

Again, it is true that in the early days the focus was on getting
performance without much regard for anything else. However saying
that the instruction set design followed from the implementation is
incorrect. RISC started as a reaction against the CISC goal of
"closing the semantic gap" after IBM studies showed only a few simple
instructions were used 90% of the time. It's about taking a quantitative
approach to instruction set design.

RISC takes the interaction between the various components of
a complete system into account (compiler, ISA, implementation).
The result of this is a particular set of features in the ISA, not in
the implementation. A microcoded RISC is still a RISC, a
pipelined CISC is still a CISC!

> If you ever listened to such a lecture by those actually doing the
> work, you'd see this narrow focus.  The register flags that signalled
> whether or not a register was in-use as a destination were tossed as
> too expensive -- they required infrastructure in order to delay the
> processor and the combinatorial worst-case path of the whole of that
> meant additional __delay__ in each clock cycle, whether or not this
> interlock was useful instruction to instruction.  You paid for it on
> every cycle, need it or not.  So out it went.  No interlocks.  Sorry.
> Similar thinking was involved in the Alpha's refusal to do 'lane
> changes,' for example.

Those were mistakes indeed that were corrected in later RISCs.
Some of the early ideologies were taken too far, and concentrated too
much on a single implementation rather than on the ISA (which lives for
many implementatios). Going for all out clock speed without thinking
about power consumption, codesize, ease of compiler design etc is
a bad idea.

Many early RISCs ended up with features that were found to have a
negative impact in the end (either in software or in later CPUs).
Alpha byte access is a great example of this, delayed branches is
another. MIPS quickly realised the silliness of omitting interlocks. :-)

> Some of the difficulties were higher memory bandwidths required, once
> you started tossing out stuff like register interlocks, microstore and
> its associated sequencing overhead, lane changing, etc.  But if that
> could be satisfied, and that was kind of possible at the time with
> some static ram from performance semi, it would perform like a bat out
> of hell.  So to speak.
>
> But the focus was on implementation on lower-tech FABs and, while
> doing that, still competing with CISC and beating it.

At the time, yes. Nowadays it is accepted that while RISC still has
some advantages over CISC (eg. area, power consumption, design
effort), CISC CPUs can be made as fast as RISC CPUs as long
as you put enough effort into it. Of course CISCs can compete on one
out of power, area or speed, not on all at the same time!

> But for those making cheap embedded controllers, I suspect that die
> size and effectively using somewhat lower FAB technology remains
> useful.  So the low-transistor count approaches once the much lauded
> domain of RISC remain important.

Correct. It's no surprise most 32-bit embedded CPUs are RISC.

I think the real lesson was not to adhere to the early dogma's too
strongly. RISC has evolved over time, and so has CISC. RISCs have
fixed their early mistakes about thinking too much about the first
implementation rather than ISA. As we discussed before, RISCs have
taken on more complex features as transistor budgets grew. CISCs
have moved from mostly micro code to mostly pipelined single cycle
instructions. The key features that differentiates RISC from CISC
both then and now are all about instruction set architecture, not
implementation.

Wilco

Reply by Paul A. Clayton ●February 21, 20072007-02-21

On Feb 20, 9:53 pm, Jonathan Kirwan <jkir...@easystreet.com> wrote:
[snip interesting comments]
> But for those making cheap embedded controllers, I suspect that die
> size and effectively using somewhat lower FAB technology remains
> useful.  So the low-transistor count approaches once the much lauded
> domain of RISC remain important.

With the note that code density is also a factor.  If
the area saved by simpler decoding comes at the cost
of more area in Icache (for the same performance) or
FLASH, then simpler decode is a net loss.  Simpler
decoding can also save power, but the reading of
larger instructions consumes more power.  RISC also
reduces the design effort required and testing
complexity.  At higher volumes design cost becomes
less significant; so the balance point in the
trade-offs between code density and implementation
complexity changes (e.g., per chip design cost savings
can be translated into larger chip area).  Per chip
design cost savings can also be translated into a
better (faster and/or more power-efficient and/or
smaller) process technology.

(Greater design effort [whether from ISA factors or
greater effort to optimize the design for power,
performance, and/or area] also increases scheduling
risks; so a sub-optimal ISA or implementation might
be safer.  Safer probably means easier access to
start-up capital [a double-whammy because a simpler
design also requires less start-up capital].  Of
course, one also cannot trade costs [number of
designers] for time to completion at a fixed rate.)

It might also be noted that a move to multiple cores
per chip multiplies the decode area savings (but does
not reduce Icache costs) while shared FLASH cost
remains constant.

As you implied, the trade-offs for a real product are
much more complex.

Paul A. Clayton
just a technophile and babbler

Reply by rickman ●February 21, 20072007-02-21

On Feb 20, 5:01 pm, Jim Granville <no.s...@designtools.maps.co.nz>
wrote:
> rickman wrote:
> > But for RAM to be as efficient as a register file it has to be triple
> > ported so you can read two operands and write back another... or you
> > have to go to an accumulator based design.  Once you have triple
> > ported RAM, you have just added a register file!  A rose by any other
> > name still smells as sweet...
>
> Correct, that's the hardware level detail.
>
> The really important point, is at the SW level, you now access any small
> clusters of Register-Mappable-RAM variables VERY efficently indeed,
> using register opcodes.
> - Such clusters of variables are very common in code
> - eg a Real time clock subroutine, could be fully coded using register
> opcodes, with a single Ram-locate operation on entry.

My point is that there is no difference between registers in triple
ported RAM and a large register file.  If I have 1kB of triple ported
RAM, I can play the same game and static allocate memory to interrupt
routines for zero overhead context switching.

> Fast context switching is also now built in. Stack usage drops.
> Lots of benefits, but you DO have to design the chip more as a system,
> and not simply buy and paste-in an IP core.

Stack usage drops because the TMS9900 ISA did not support stacks very
well.  There was no stack usage on subroutine calls because a link
register was used (R12 or R13, IIRC).  Before another routine was
called the register had to be saved or a full register context change
had to be done.  This was over 20 years ago, so I may not remember the
details correctly.  But I remember distinctly that I was initially
impressed with the 9900, but eventually realized that this was
outdated technology as CPU speeds and RAM densities increased.

> It's also backward compatible. If you are uncomfortable with the
> overlay, or the tools are catching up, just leave the register pointer
> alone, and you have plain-old-vanilla-RISC.

With more limited capabilities due to the register linking for
subroutine calls.

> See the XC166, and IIRC the Sun CPUs used to allow
> a partial page overlap, so you could pass params in Ram.Registers, and
> allow locals as well, with very low pointer thrashing.

Passing params in registers is still used without a special register
block pointer.  With a significant number of registers being used for
housekeeping, there is limited utility of to using a block pointer.
Say you allocate the top 8 registers as "important" registers (I'm
sure there is a term for this, but I can't recall it) that must be
saved when a routine is called.  The lower 8 are considered as
volatile and can be reused as required or data passed in these.  To
use the lower 8 and save the upper 8 you need to adjust the register
pointer down by 8 cells.  You still need to copy some of this data
since some of the new upper registers (old lower 8) are hardware
dedicated and will clobber data otherwise.

Yes, it can be made to work, but I never saw a big savings in speed or
memory usage.  Perhaps you saw different applications.

Reply by CBFalconer ●February 21, 20072007-02-21

Wilco Dijkstra wrote:
> 
... snip ...
> 
> At the time, yes. Nowadays it is accepted that while RISC still
> has some advantages over CISC (eg. area, power consumption, design
> effort), CISC CPUs can be made as fast as RISC CPUs as long as you
> put enough effort into it. Of course CISCs can compete on one out
> of power, area or speed, not on all at the same time!

Since the fundamental limitation today is propagation time, premium
performance is found in small devices.  A chip is obviously a
smaller device than a PCB.  Once you shuffle everything, including
the dog, into one chip you can gain performance by adding
features.  So I predict that evolution will tend in the CISC
direction.  We see this in spades in the embedded arena.

Also remember that NOT having to drive off-chip lines produces
heavy reduction in power and area, and increase in speed, all at
the same time!

Imagine a chip with 2G memory, a PPC instruction set, and a USB
external interface.  It needs 6 pins, possibly 8 to allow for a
clock.  The memory would be ECC, since the cells would be rather
small and highly subject to bit drops.  All on 1/4 inch square
package!  External HD access would suffer.

Probably a simple stack oriented instruction set would be better. 
Memory access on such a machine would not be significantly slower
than register access.

-- 
 <http://www.cs.auckland.ac.nz/~pgut001/pubs/vista_cost.txt>
 <http://www.securityfocus.com/columnists/423>

 "A man who is right every time is not likely to do very much."
                           -- Francis Crick, co-discover of DNA
 "There is nothing more amazing than stupidity in action."
                                             -- Thomas Matthews

Reply by Wilco Dijkstra ●February 21, 20072007-02-21

"David Brown" <david@westcontrol.removethisbit.com> wrote in message 
news:45dc2bf5$0$31548$8404b019@news.wineasy.se...
> Wilco Dijkstra wrote:
>> "David Brown" <david@westcontrol.removethisbit.com> wrote in message 
>> news:45db1975$0$31521$8404b019@news.wineasy.se...

>> RISC and CISC are about instruction set architecture, not implementation
>> (although it does have an effect on the implementation).
>>
> The whole point of RISC is to be able to make a more efficient 
> implementation - it is an architectural design philosophy aimed at making 
> small and fast (clock speed) implementations.

That's a good summary.

>>> The ColdFire core is very much such a mixed chip - in terms of the ISA, 
>>> it is noticeably more RISCy than the 68k (especially the later cores 
>>> with their more complex addressing modes), and in terms of its 
>>> implementation, it is even more so.  Even the original 68k, with its 
>>> multiple registers and (mostly) orthogonal instruction set is pretty 
>>> RISCy.
>>
>> Well, let's look at 10 features that are typical for most RISCs today:
>>
>> * large uniform register file: no (8 data + 8 address registers)
>
> Typical CISC is 4 to 8 registers, each with specialised uses.  Thus the 
> 68k is far from typical CISC, and is much more in the middle.

There are various CISCs (eg VAX, MSP430) that have 16 registers,
while most RISCs have 32 or more.

>> * load/store architecture: no
>
> The 68k can handle both operands of an ALU instruction in memory, which is 
> CISC.  The ColdFire can have one in memory, one in a register, which is 
> again half-way.

The ColdFire is no different from 68K in this aspect. Most ALU operations
can do read/modify/write to memory and the move instruction can access
two memory operands.

>> * naturally aligned load/store: no
>
> That is purely an implementation issue for the memory interface.  It is 
> common that RISC cpus, in keeping with the aim of a small, neat and fast 
> implementation, insist on aligned access.  But it is not a requirement - 
> IIRC, the some PPC implementations can access non-aligned data in 
> big-endian mode.  The ColdFire is certainly more efficient with aligned 
> accesses, but they are not a requirement.

Unaligned accesses are non-trivial so most RISCs left it out. However
modern CPUs nowadays have much of the required logic (due to
hit-under-miss, OoO execution etc), so a few RISCs (ARM and POWER)
have added this. Hardware designers still hate its complexity, but it is
often a software requirement. Quite surprisingly it gives huge speedups
in programs that use memcpy a lot.

>> * simple addressing modes: no (9 variants, yes for ColdFire?)
...
> All in all, the CF modes are only marginally more complex than the PPC 
> modes.

It's the (d8 + Ax + Ri*SF) mode that places it in the complex camp.
The first not only uses a separate extension word that needs
decoding but also must perform a shift and 2 additions...

> The instruction set for the PPC contains much more complicated 
> instructions than the CF.  The 68k has things like division instructions, 
> which the CF has dropped.

What PPC instructions are complex? PPC is a subset of POWER
just like CF is a subset of 68K, so most of the complex instructions
were left out.

> A far more useful (and precise) distinction would be to look at the 
> implementation - does the architecture use microcoded instructions? RISC 
> cpus, in general, do not - that is one of the guiding principles of using 
> RISC in the first place.  Traditional CISC use microcode extensively.  The 
> 68k used microcode for many instructions - the CF does not.

This is misguided. RISC *enables* simple non-microcoded
implementations. One can make a micro code implementation of a
RISC, but that doesn't make it any less RISC.

>> * calls place return address in a register: no
>
> More generally speaking, CISC has specific purpose registers, while RISC 
> have mostly general purpose registers.  Yes, the CF has extra 
> functionality on A7 to make it a stack pointer.  Putting the return 
> address in a register, as done in RISC cpus, is not an advantage - it is a 
> consequence of not having a dedicated stack.

It is an advantage as it avoids unnecessary memory traffic - a key
goal of RISC.

> If we add in some other features that are a little more implementation 
> dependant (and therefore entirely relevant, since that is the reason for 
> RISC in the first place), things are a bit different:
>
> * Single-cycle register-only instructions: yes
> * Short execution pipeline: yes
> * (Mostly) microcode-free core: yes
> * Short and fast instruction decode: half point
> * Low overhead branches: yes
> * Stall-free for typical instruction streams: yes
>
> Suddenly the scores are looking a bit different.

I don't see how the scores change at all. Most of the features you
mention are "yes" for 68K implementations (except for the original
68000 which scores 4 out of 6), ColdFire and ARM.

> Perhaps we could compare the CF to traditional CISC features:
>
> * Specialised accumulator: no

Many famous CISCs are not accumulator based, eg PDP, VAX, 68K,
System/360 etc. Accumulators are typically used in 8-bitters where
most instructions are 1 or 2 bytes for good codesize.

> * Microcoded instructions: no

Implementation detail. CF is still complex enough that micro
coded implementations might be a good choice.

> * Looped instructions: no

Loop mode is just an implementation optimization that could be done
on any architecture.

> * Direct memory-to-memory operations: no

Eh, what does move.l (a0),(a1) do? It's valid on CF.

> * Bottlenecks due to register or flag conflicts: not often
> * Long pipelines: no

Longer than an equivalent RISC (mainly due to needing 2 memory
accesses per instruction and more complex decoding). And likely
longer than a simpler microcoded implementation.

> As I said, with the Thumb-2, the ARM is gaining the CISC feature of 
> variable length instructions - I did not say it is changing into a CISC 
> architecture.  The real world is grey - there is no dividing line between 
> CISC and RISC, merely a collection of characteristics that some chips have 
> and others don't.

Sure, there is always a grey area in the middle, but most ISAs
clearly fall in either camp. If you use my rules, can you mention one
that scores 4 or 5?

> Adding these variable length instructions is a good thing, if it doesn't 
> cost too much at the decoder.  It increases both code density and 
> instruction speed, since it opens the path for 32-bit immediate data (or 
> addresses) to be included directly in a single instruction.

Actually, embedding large immediates in the instruction stream is
bad for codesize because they cannot be shared. For Thumb-2 the
main goal was to allow access to 32-bit ARM instructions for cases
where a single 16-bit instruction was not enough. Thumb-2 doesn't
have immediates like 68K/CF.

> My point is not that the CF is a RISC core - I never claimed it was. But 
> neither is it a CISC core in comparison to, say, the x86 architecture.  If 
> there were such a thing as a scale running from pure RISC to pure CISC, 
> then the CF lies near the middle.  It is not as RISCy as the ARM, but is 
> somewhat RISCier than the original 68k.

I agree CF is less CISCy than 68K but it is still more CISCy than x86.
If it dropped 2 memory operands, removed ALU+memory operations,
32-bit immediates and the absolute and (d8 + Ax + Ri*SF) addressing
modes then I would agree it is a RISC...

Wilco

Reply by Jim Granville ●February 21, 20072007-02-21

Wilco Dijkstra wrote:
> "Jonathan Kirwan" <jkirwan@easystreet.com> wrote in message 
> news:gicnt214v581okaluheha0sietn16fsvn5@4ax.com...
> 
>>On Wed, 21 Feb 2007 14:56:59 +1300, Jim Granville
>><no.spam@designtools.maps.co.nz> wrote:
>>
>>
>>>Wilco Dijkstra wrote:
>>>
>>>>While SRAM is faster than flash, it wouldn't be fast enough to be
>>>>used like a register in a simple MCU. On ARM7 for example,
>>>>register read, ALU operation and register write all happen within
>>>>one clock cycle. With SRAM the cycle time would become 3-4
>>>>times as long (not to mention power consumption).
>>>
>>> To get a handle on what On-Chip, small RAM speeds can achieve, in real
>>>silicon,  look at the FPGA Block Sync RAMS - those are smallish block,
>>>Dual ported, and plenty fast enough to keep up with the cycle times of a
>>>CPU.
>>> I don't see FPGA CPUs being held back by their 'slow sram',
>>>as you claim ?.
>>> RAM based DSPs are now pushing 1GHz, and that's larger chunks
>>>of RAM than are needed for register-maped-memory.
> 
> 
> Jim, you earlier wrote "I think you missed my uC = microcontroller." -
> I don't think 1GHz DSPs/FPGAs are micro controllers. Yes high end
> SRAMs on advanced processes easily reach 1GHz, but my point
> (and I think Rick's) is that registers are much faster still.

FPGAs are certainly used a microcontrollers, and in increasing volumes.
CPU designers had better be aware of just what a FPGA soft CPU can do 
these days, as they are replacing uC in some designs.

> 
>>I just dumped my message in progress on this -- you said what I wanted
>>to say very clearly.  I use such DSPs.  I think Wilco must be stuck
>>thinking in terms of external bus drivers where what is connected is
>>unknown and the bus interface designer must work to worst cases.  Too
>>much ARM, perhaps?
> 
> 
> No, not at all. I'm talking about needing to access the SRAM several
> times per cycle to read/write the registers (as explained in the first
> paragraph). Therefore the speed of a CPU using SRAM rather than
> registers becomes a fraction of the cycle time of the SRAM.

?

> A register file is a small dedicated structure designed for very high
> random access bandwidth. SRAM simply can't achieve that.

OK, I'll try ome more time.
You seem to be stuck on a restrictive use of SRAM, so I'll use different 
words.
  Let's take a sufficently skilled chip designer, that he knows various 
RAM structures, and that he will not use vanilla SRAM (as Rick has 
aleady mentioned) but will use something more like the dual port sync 
ram of the FPGAs, I gave as an example.
  Yes, this ram is more complex than simplest RAM, (which is why 
Infineon keep the size to 1-2K), but it buys you many benefits on a uC, 
and the
die size impact of such RAM is still tiny.

Q: What percentage of a RISC(eg ARM) die is taken by the registers 
themselves ?
A: A miniscule fraction << 0.1%
Q: Why not apply more of the die, to fix what is a real code/performance 
  bottle neck. Let's increase the size of these 'blazingly fast' 
registers, and local-ram-overlay them, to reap the benefits, until we 
hit a point where their time-impact matches the code access, or get to 
1-2K, (or a small % of die), whichever comes first.

  Taking other devices as a yardstick, it's likely to hit the size 
corner before it hits the speed corner, but a designer will watch both 
effects.

[The AVR RAM-overlays their registers, they just forgot to allow
that overlay to move - partly because the very first AVR had no RAM,
and it was a later bolt-on.]

  There is NO design-brick-wall, that says when you go over 16 or 32 
registers, suddenly it is impossible to get fast access : FPGA designers 
are doing that now, in real silicon.

  Larger CPU designers put their efforts into cache ram (which is also 
specialised SRAM memory) - but that has its own trade-offs, and better 
solutions exist for microcontroller-focused, real time designs.

-jg

Reply by Wilco Dijkstra ●February 21, 20072007-02-21

"Jim Granville" <no.spam@designtools.maps.co.nz> wrote in message 
news:45dc98c5@clear.net.nz...
> Wilco Dijkstra wrote:

> FPGAs are certainly used a microcontrollers, and in increasing volumes.
> CPU designers had better be aware of just what a FPGA soft CPU can do 
> these days, as they are replacing uC in some designs.

They can indeed, but FPGA prices need to come down a lot before
it becomes a good idea in a high volume design. I've worked with big
FPGA stacks for CPU emulation and large/fast FPGAs can cost
well in the 5 figures a piece. Even the smallest ARM uses a big chunk
of a large FPGA. So you can only use very simple CPUs in a small
FPGA.

>> A register file is a small dedicated structure designed for very high
>> random access bandwidth. SRAM simply can't achieve that.
>
> OK, I'll try ome more time.
> You seem to be stuck on a restrictive use of SRAM, so I'll use different 
> words.
>  Let's take a sufficently skilled chip designer, that he knows various RAM 
> structures, and that he will not use vanilla SRAM (as Rick has aleady 
> mentioned) but will use something more like the dual port sync ram of the 
> FPGAs, I gave as an example.
>  Yes, this ram is more complex than simplest RAM, (which is why Infineon 
> keep the size to 1-2K), but it buys you many benefits on a uC, and the
> die size impact of such RAM is still tiny.

I'm with you. Inventing a new kind of SRAM with 3 read and 2 write
ports would do the job indeed. But it is going to be big compared to
using standard single ported SRAM, so there needs to be a major
advantage.

> Q: What percentage of a RISC(eg ARM) die is taken by the registers 
> themselves ?
> A: A miniscule fraction << 0.1%

On ARM7tdmi it is around 5% for 32 registers with 2 read and 1 write port.
An ARM7tdmi is as large as 5KB of SRAM. Assuming flash is 4 times as
dense, a typical MCU with 128KB flash and 16KB of RAM still has 0.5%
of the area devoted to registers.

> Q: Why not apply more of the die, to fix what is a real code/performance 
> bottle neck.

What bottleneck? You lost me here... Adding more registers doesn't
automatically improve performance.

>Let's increase the size of these 'blazingly fast' registers, and 
>local-ram-overlay them, to reap the benefits, until we hit a point where 
>their time-impact matches the code access, or get to 1-2K, (or a small % of 
>die), whichever comes first.
>
>  Taking other devices as a yardstick, it's likely to hit the size corner 
> before it hits the speed corner, but a designer will watch both effects.

At 2KB it would double the size of an ARM7tdmi and slow it down
a lot without a redesigned pipeline (it needs to support 2 accesses in
less than half a cycle at up to 120MHz, so the SRAM would need to run
at 500MHz). I think you're assuming register read/write is not already
critical in existing CPUs - it often is.

However what use do you have for 256/512 fast registers? You can
only access 16 at any time...

>  There is NO design-brick-wall, that says when you go over 16 or 32 
> registers, suddenly it is impossible to get fast access : FPGA designers 
> are doing that now, in real silicon.

Sure, it just gets progressively slower with size and number of ports.

Wilco

Reply by Jim Granville ●February 21, 20072007-02-21

rickman wrote:
> On Feb 20, 5:01 pm, Jim Granville <no.s...@designtools.maps.co.nz>
> wrote:
> 
>>rickman wrote:
>>
>>>But for RAM to be as efficient as a register file it has to be triple
>>>ported so you can read two operands and write back another... or you
>>>have to go to an accumulator based design.  Once you have triple
>>>ported RAM, you have just added a register file!  A rose by any other
>>>name still smells as sweet...
>>
>>Correct, that's the hardware level detail.
>>
>>The really important point, is at the SW level, you now access any small
>>clusters of Register-Mappable-RAM variables VERY efficently indeed,
>>using register opcodes.
>>- Such clusters of variables are very common in code
>>- eg a Real time clock subroutine, could be fully coded using register
>>opcodes, with a single Ram-locate operation on entry.
> 
> 
> My point is that there is no difference between registers in triple
> ported RAM and a large register file.  If I have 1kB of triple ported
> RAM, I can play the same game and static allocate memory to interrupt
> routines for zero overhead context switching.

  Yes, and to do that zero context switch, you need a register frame 
pointer (or similar). You do not want to use this as mere smart-stack, 
but to allow all the Reg opcodes to work, on any window into that larger 
memory.
  Triple ported, or dual port, depends more on the core in question.

  Even the lowly 80c51 has a register frame pointer,(all of 2 bits in 
size), and it does overlay the registers with RAM.
  The z8 expands this to 8 bits, and I think the XC166 uses a 16 bit one.

<snip>
>>It's also backward compatible. If you are uncomfortable with the
>>overlay, or the tools are catching up, just leave the register pointer
>>alone, and you have plain-old-vanilla-RISC.
> 
> 
> With more limited capabilities due to the register linking for
> subroutine calls.

? - you've lost me here. In subset mode, you simply ignore the register 
frame pointer, and it is _exactly_ the same as your un-enhanced core.

-jg