PIC vs ARM assembler (no flamewar please)| page 8

Reply by Jim Granville ●February 20, 20072007-02-20

Wilco Dijkstra wrote:
> "Jim Granville" <no.spam@designtools.maps.co.nz> wrote in message 
> news:45db4806$1@clear.net.nz...
> 
>>rickman wrote:
> 
> 
>>>The TMS9900 used a pointer register (that's right, registers did not
>>>go away) to point to the first register in memory.  An ADD would then
>>>take three memory accesses to complete rather than one clock cycle.
>>>Even if you put the memory on chip, you either have to limit the
>>>location of the registers to a special bank of fast, multiport memory
>>>(register bank) or you have to accept multiple memory cycles for a
>>>single instruction, even when working in registers.
>>
>>Sounds like a poor example of how anyone would do this today.
>>
>>Look at the XC166, and eZ8, for examples of how you can do
>>very efficent memory overlays.
>>
>>In a uC, you are talking of a few K's of memory, so speed should
>>not be an issue at all.
> 
> 
> These are not examples of a RAM mapped register file, just of a
> hardware assisted context switch. So the contents of the RAM are
> copied to/from the register file but are not kept in sync until the next
> context switch.

Which are not ? - perhaps you are talking about the TMS9900 ?

If you meant the eZ8, then perhaps reading up on the Register Pointer
operation would assist. In the eZ8, the register pointer adds to the
4 bit register operand, to map/overlay those 16 registers, into up to 12 
bits of RAM

> 
> Even a few KB of SRAM is much slower than a register file.

Slower, yes. 'Much slower' is moot - given that the bottle neck in
most CPUs/uC is code access from FLASH, and that on-chip SRAM speeds
are MUCH FASTER than Flash speeds, so it's not looking like the
determining-speed path.

There seems to be no practical speed impact from this, when you
look at the Mhz speeds of real devices like the St10/XC166 cores ?

-jg

Reply by D. ●February 20, 20072007-02-20

Jim Granville wrote:
> Wilco Dijkstra wrote:
>> "Jim Granville" <no.spam@designtools.maps.co.nz> wrote in message 
>> news:45db4806$1@clear.net.nz...
>>
>>> rickman wrote:
>>
>>
>>>> The TMS9900 used a pointer register (that's right, registers did not
>>>> go away) to point to the first register in memory.  An ADD would then
>>>> take three memory accesses to complete rather than one clock cycle.
>>>> Even if you put the memory on chip, you either have to limit the
>>>> location of the registers to a special bank of fast, multiport memory
>>>> (register bank) or you have to accept multiple memory cycles for a
>>>> single instruction, even when working in registers.
>>>
>>> Sounds like a poor example of how anyone would do this today.
>>>
>>> Look at the XC166, and eZ8, for examples of how you can do
>>> very efficent memory overlays.
>>>
>>> In a uC, you are talking of a few K's of memory, so speed should
>>> not be an issue at all.
>>
>>
>> These are not examples of a RAM mapped register file, just of a
>> hardware assisted context switch. So the contents of the RAM are
>> copied to/from the register file but are not kept in sync until the next
>> context switch.
> 
> Which are not ? - perhaps you are talking about the TMS9900 ?
> 
> If you meant the eZ8, then perhaps reading up on the Register Pointer
> operation would assist. In the eZ8, the register pointer adds to the
> 4 bit register operand, to map/overlay those 16 registers, into up to 12 
> bits of RAM

Indeed. The Register Pointer in itself is made up of two separate parts, and 
those parts are used to add to the register operand, as JG said. It allows 
you to have 4-bit addressing (a group of 16 working registers, with the full 
RP being used for the complete address), or 8-bit addressing (a page, with 
only half of the RP being used), or the absolute 12-bit address. Throw in 
compatibility with older code from when the Z8s could only address 2^8 bits 
of RAM (or register file), and you've got a pretty good blend of power and 
low code size.

It's very logical and intuitive, when you think of it.

 >>
 >> Even a few KB of SRAM is much slower than a register file.
 >
 > Slower, yes. 'Much slower' is moot - given that the bottle neck in
 > most CPUs/uC is code access from FLASH, and that on-chip SRAM speeds
 > are MUCH FASTER than Flash speeds, so it's not looking like the
 > determining-speed path.
 >
 > There seems to be no practical speed impact from this, when you
 > look at the Mhz speeds of real devices like the St10/XC166 cores ?
 >
 > -jg

Regards,
D.

Reply by Wilco Dijkstra ●February 20, 20072007-02-20

"David Brown" <david@westcontrol.removethisbit.com> wrote in message 
news:45db1975$0$31521$8404b019@news.wineasy.se...

> However, there is no fixed distinction between RISC and CISC.  The two 
> terms refer to a range of characteristics commonly associated with RISC 
> cpus and CISC cpus.  Some chips clearly fall into one camp or the other, 
> but most have at least slightly mixed characteristics.

RISC and CISC are about instruction set architecture, not implementation
(although it does have an effect on the implementation).

>The ColdFire core is very much such a mixed chip - in terms of the ISA, it 
>is noticeably more RISCy than the 68k (especially the later cores with 
>their more complex addressing modes), and in terms of its implementation, 
>it is even more so.  Even the original 68k, with its multiple registers and 
>(mostly) orthogonal instruction set is pretty RISCy.

Well, let's look at 10 features that are typical for most RISCs today:

* large uniform register file: no (8 data + 8 address registers)
* load/store architecture: no
* naturally aligned load/store: no
* simple addressing modes: no (9 variants, yes for ColdFire?)
* fixed instruction sizes: no
* simple instructions: no (yes for ColdFire)
* calls place return address in a register: no
* 3 operand ALU instructions: no
* ALU instructions do not corrupt flags: no
* delayed branch: no

So that is 0 for 68K, 2 for ColdFire. ARM scores 8, Thumb scores 6,
Thumb-2 7. MIPS scores 10 (very pure). This clearly shows 68K and
ColdFire are CISCs, while the rest are RISCs.

> So the ARM is moving from a fairly pure RISC architecture, through the 
> Thumb (with it's more CISCy smaller register set and more specialised 
> register usage) and now Thumb-2 (with variable length instructions). It's 
> gaining CISC attributes in a move to improve code density at the expense 
> of more complex instruction decoding.

Yes, RISCs have become more complex. However that doesn't make
them CISC! Although ARM is not a pure RISC to start with, Thumb-1
and Thumb-2 are only slighly more complex and still have most of
the RISC characteristics.

> The ColdFire, on the other hand, has moved from the original 68k to a more 
> RISCy core, with a much greater emphasis on single-cycle 
> register-to-register instructions and a simpler and more efficient core, 
> in order to improve performance and lead to a smaller implementation.

Indeed, it has gained 2 points by removing some of the complex micro
coded instrucions and addressing modes, thus allowing a simpler
more pipelined implementation. But it clearly doesn't make it a RISC
like the marketing people want us to believe...

> There are still plenty of differences between the architectures, but there 
> is no doubt that there are a lot more similarities between the ARM Thumb-2 
> and the ColdFire than between the original ARM and the original 68k.

I'd say that any similarities only exist on a superficial level. For example
the variable length instructions in Thumb-2 are easier to decode than 68K
or ColdFire.

>> There are few RISCs with variable length instructions.
>>
> The AVR?  I can't think of any others.

Hitachi SH and ARC for example.

Wilco

Reply by Didi ●February 20, 20072007-02-20

> > Even a few KB of SRAM is much slower than a register file.
>
> Slower, yes. ....

Generally true, but there are exceptions. TIs 54xx DSPs have some
registers memory adressable (not all, e.g. not the accumulators,
just the so called "auxilary registers"). Whether they are really
memory addresses or not I don't know, the RAM is on-chip at address
0 (about where these register are), and this RAM allows 2 accesses
per cycle, so there is no slowdown out of that. But given that
this architecture allows 3 RAM accesses per cycle (or was it 4?),
this is hardly surprising, it is designed to not have a memory
bottleneck.

Dimiter

On Feb 21, 1:00 am, Jim Granville <no.s...@designtools.maps.co.nz>
wrote:
> Wilco Dijkstra wrote:
> > "Jim Granville" <no.s...@designtools.maps.co.nz> wrote in message
> >news:45db4806$1@clear.net.nz...
>
> >>rickman wrote:
>
> >>>The TMS9900 used a pointer register (that's right, registers did not
> >>>go away) to point to the first register in memory.  An ADD would then
> >>>take three memory accesses to complete rather than one clock cycle.
> >>>Even if you put the memory on chip, you either have to limit the
> >>>location of the registers to a special bank of fast, multiport memory
> >>>(register bank) or you have to accept multiple memory cycles for a
> >>>single instruction, even when working in registers.
>
> >>Sounds like a poor example of how anyone would do this today.
>
> >>Look at the XC166, and eZ8, for examples of how you can do
> >>very efficent memory overlays.
>
> >>In a uC, you are talking of a few K's of memory, so speed should
> >>not be an issue at all.
>
> > These are not examples of a RAM mapped register file, just of a
> > hardware assisted context switch. So the contents of the RAM are
> > copied to/from the register file but are not kept in sync until the next
> > context switch.
>
> Which are not ? - perhaps you are talking about the TMS9900 ?
>
> If you meant the eZ8, then perhaps reading up on the Register Pointer
> operation would assist. In the eZ8, the register pointer adds to the
> 4 bit register operand, to map/overlay those 16 registers, into up to 12
> bits of RAM
>
>
>
> > Even a few KB of SRAM is much slower than a register file.
>
> Slower, yes. 'Much slower' is moot - given that the bottle neck in
> most CPUs/uC is code access from FLASH, and that on-chip SRAM speeds
> are MUCH FASTER than Flash speeds, so it's not looking like the
> determining-speed path.
>
> There seems to be no practical speed impact from this, when you
> look at the Mhz speeds of real devices like the St10/XC166 cores ?
>
> -jg

Reply by Wilco Dijkstra ●February 20, 20072007-02-20

"Jim Granville" <no.spam@designtools.maps.co.nz> wrote in message 
news:45db7d1e$1@clear.net.nz...
> Wilco Dijkstra wrote:
>> "Jim Granville" <no.spam@designtools.maps.co.nz> wrote in message 
>> news:45db4806$1@clear.net.nz...

>>>Look at the XC166, and eZ8, for examples of how you can do
>>>very efficent memory overlays.
>>>
>>>In a uC, you are talking of a few K's of memory, so speed should
>>>not be an issue at all.
>>
>> These are not examples of a RAM mapped register file, just of a
>> hardware assisted context switch. So the contents of the RAM are
>> copied to/from the register file but are not kept in sync until the next
>> context switch.
>
> Which are not ? - perhaps you are talking about the TMS9900 ?

No, I meant the XC166 (SPARC, AMD29K etc) register windows.

> If you meant the eZ8, then perhaps reading up on the Register Pointer
> operation would assist. In the eZ8, the register pointer adds to the
> 4 bit register operand, to map/overlay those 16 registers, into up to 12 
> bits of RAM

The eZ8 is really weird indeed, you can either call it a CPU with a large
register file or a CPU with no registers and direct memory addressing.
The instruction cycle timings are pretty slow so its either fetch speed
or the register access that is holding it back.

>> Even a few KB of SRAM is much slower than a register file.
>
> Slower, yes. 'Much slower' is moot - given that the bottle neck in
> most CPUs/uC is code access from FLASH, and that on-chip SRAM speeds
> are MUCH FASTER than Flash speeds, so it's not looking like the
> determining-speed path.

While SRAM is faster than flash, it wouldn't be fast enough to be
used like a register in a simple MCU. On ARM7 for example,
register read, ALU operation and register write all happen within
one clock cycle. With SRAM the cycle time would become 3-4
times as long (not to mention power consumption).

> There seems to be no practical speed impact from this, when you
> look at the Mhz speeds of real devices like the St10/XC166 cores ?

That's because the XC166 uses registers and not RAM.

Wilco

Reply by Jim Granville ●February 20, 20072007-02-20

Wilco Dijkstra wrote:
> 
> While SRAM is faster than flash, it wouldn't be fast enough to be
> used like a register in a simple MCU. On ARM7 for example,
> register read, ALU operation and register write all happen within
> one clock cycle. With SRAM the cycle time would become 3-4
> times as long (not to mention power consumption).

  To get a handle on what On-Chip, small RAM speeds can achieve, in real 
silicon,  look at the FPGA Block Sync RAMS - those are smallish block, 
Dual ported, and plenty fast enough to keep up with the cycle times of a 
CPU.
  I don't see FPGA CPUs being held back by their 'slow sram',
as you claim ?.
  RAM based DSPs are now pushing 1GHz, and that's larger chunks
of RAM than are needed for register-maped-memory.

-jg

Reply by Jonathan Kirwan ●February 20, 20072007-02-20

On Wed, 21 Feb 2007 14:56:59 +1300, Jim Granville
<no.spam@designtools.maps.co.nz> wrote:

>Wilco Dijkstra wrote:
>> 
>> While SRAM is faster than flash, it wouldn't be fast enough to be
>> used like a register in a simple MCU. On ARM7 for example,
>> register read, ALU operation and register write all happen within
>> one clock cycle. With SRAM the cycle time would become 3-4
>> times as long (not to mention power consumption).
>
>  To get a handle on what On-Chip, small RAM speeds can achieve, in real 
>silicon,  look at the FPGA Block Sync RAMS - those are smallish block, 
>Dual ported, and plenty fast enough to keep up with the cycle times of a 
>CPU.
>  I don't see FPGA CPUs being held back by their 'slow sram',
>as you claim ?.
>  RAM based DSPs are now pushing 1GHz, and that's larger chunks
>of RAM than are needed for register-maped-memory.

I just dumped my message in progress on this -- you said what I wanted
to say very clearly.  I use such DSPs.  I think Wilco must be stuck
thinking in terms of external bus drivers where what is connected is
unknown and the bus interface designer must work to worst cases.  Too
much ARM, perhaps?

Jon

Reply by Jonathan Kirwan ●February 20, 20072007-02-20

On Wed, 21 Feb 2007 00:02:27 GMT, "Wilco Dijkstra"
<Wilco_dot_Dijkstra@ntlworld.com> wrote:

>"David Brown" <david@westcontrol.removethisbit.com> wrote in message 
>news:45db1975$0$31521$8404b019@news.wineasy.se...
>
>> However, there is no fixed distinction between RISC and CISC.  The two 
>> terms refer to a range of characteristics commonly associated with RISC 
>> cpus and CISC cpus.  Some chips clearly fall into one camp or the other, 
>> but most have at least slightly mixed characteristics.
>
>RISC and CISC are about instruction set architecture, not implementation
>(although it does have an effect on the implementation).
><snip>

I respect your knowledge and skill, Wilco, but I cannot agree with
this as I understand you writing it here based upon my experiences.

I spent 1-on-1 time with Hennessy and listened to the reasoning he
used.  RISC was all about thinking in detailed terms of practical
implementation.  They were faced with access to lower-technology FABs
(larger feature sizes, fewer transmission gates and inverters, etc.)
and wanted to achieve more with less.  Doing that was everything about
implementation and the instruction set architecture was allowed to go
where it must.  That this worked out to being a 'reduced instruction
set' was something that came out of achieving competing performance
out of lower-tech FAB capability than folks like Intel or Motorola had
available to their flagship lines of the day.

There was a design philosophy based upon theory -- that was simply the
realization that many of the things that slowed down a CISC was also a
matter of perceived convenience for programmers, so the policy was
then to get rid of anything and everything that slowed down the clock
rate without paying _well_ for that delay.  A focus on throughput. The
fact that removing barriers to speed also happened to reduce the need
for more transistor equivalents was the happy coincidence that fueled
the initiative.  The instructions were a result of the application of
focusing on implementation details -- not some instruction set theory
under which the implementation then followed. If higher level features
were cheap to implement and paid for themselves in performance, they
were simply kept.  Very practical, hard nosed approach.

If you ever listened to such a lecture by those actually doing the
work, you'd see this narrow focus.  The register flags that signalled
whether or not a register was in-use as a destination were tossed as
too expensive -- they required infrastructure in order to delay the
processor and the combinatorial worst-case path of the whole of that
meant additional __delay__ in each clock cycle, whether or not this
interlock was useful instruction to instruction.  You paid for it on
every cycle, need it or not.  So out it went.  No interlocks.  Sorry.
Similar thinking was involved in the Alpha's refusal to do 'lane
changes,' for example.

Hennessy had a huge blow up of the 68020 CPU in one room at MIPS
(which was quite near Weitek, at the time), when I visited.  He would
go through each and every detail of the implementation there and talk
about it, at length, and explain why it was worthwhile... or not...
and what the exact quantitative cost was in each cycle's timing and
over the broader arch of an application.

Some of the difficulties were higher memory bandwidths required, once
you started tossing out stuff like register interlocks, microstore and
its associated sequencing overhead, lane changing, etc.  But if that
could be satisfied, and that was kind of possible at the time with
some static ram from performance semi, it would perform like a bat out
of hell.  So to speak.

But the focus was on implementation on lower-tech FABs and, while
doing that, still competing with CISC and beating it.

Of course, FABs got a lot better and access to high tech FAB resources
became increasingly brokered to keep them running 24/7, and the
driving need for lower-tech feature sizes became relaxed.  Also, CISC
looking external designs could now be designed with internal RISC
processors, built-in TLBs, re-order buffers, registration stations
with multiple functional units to share, jump prediction, .... so much
so, that in fact Intel started putting L1 cache memory on-die.  There
was so much excess available, they ran out of nifty ideas and the best
they knew to do with it was suck up die space with cache memory.

So the RISC drive relaxed.  At least, on the consumer market area.

But for those making cheap embedded controllers, I suspect that die
size and effectively using somewhat lower FAB technology remains
useful.  So the low-transistor count approaches once the much lauded
domain of RISC remain important.

Jon

Reply by Jim Granville ●February 20, 20072007-02-20

Jonathan Kirwan wrote:
> On Wed, 21 Feb 2007 00:02:27 GMT, "Wilco Dijkstra"
> <Wilco_dot_Dijkstra@ntlworld.com> wrote:
> 
> 
>>"David Brown" <david@westcontrol.removethisbit.com> wrote in message 
>>news:45db1975$0$31521$8404b019@news.wineasy.se...
>>
>>
>>>However, there is no fixed distinction between RISC and CISC.  The two 
>>>terms refer to a range of characteristics commonly associated with RISC 
>>>cpus and CISC cpus.  Some chips clearly fall into one camp or the other, 
>>>but most have at least slightly mixed characteristics.
>>
>>RISC and CISC are about instruction set architecture, not implementation
>>(although it does have an effect on the implementation).
>><snip>
> 
> 
> I respect your knowledge and skill, Wilco, but I cannot agree with
> this as I understand you writing it here based upon my experiences.
> 
> I spent 1-on-1 time with Hennessy and listened to the reasoning he
> used.  RISC was all about thinking in detailed terms of practical
> implementation.  They were faced with access to lower-technology FABs
> (larger feature sizes, fewer transmission gates and inverters, etc.)
> and wanted to achieve more with less.  Doing that was everything about
> implementation and the instruction set architecture was allowed to go
> where it must.  That this worked out to being a 'reduced instruction
> set' was something that came out of achieving competing performance
> out of lower-tech FAB capability than folks like Intel or Motorola had
> available to their flagship lines of the day.
> 
> There was a design philosophy based upon theory -- that was simply the
> realization that many of the things that slowed down a CISC was also a
> matter of perceived convenience for programmers, so the policy was
> then to get rid of anything and everything that slowed down the clock
> rate without paying _well_ for that delay.  A focus on throughput. The
> fact that removing barriers to speed also happened to reduce the need
> for more transistor equivalents was the happy coincidence that fueled
> the initiative.  The instructions were a result of the application of
> focusing on implementation details -- not some instruction set theory
> under which the implementation then followed. If higher level features
> were cheap to implement and paid for themselves in performance, they
> were simply kept.  Very practical, hard nosed approach.
> 
> If you ever listened to such a lecture by those actually doing the
> work, you'd see this narrow focus.  The register flags that signalled
> whether or not a register was in-use as a destination were tossed as
> too expensive -- they required infrastructure in order to delay the
> processor and the combinatorial worst-case path of the whole of that
> meant additional __delay__ in each clock cycle, whether or not this
> interlock was useful instruction to instruction.  You paid for it on
> every cycle, need it or not.  So out it went.  No interlocks.  Sorry.
> Similar thinking was involved in the Alpha's refusal to do 'lane
> changes,' for example.
> 
> Hennessy had a huge blow up of the 68020 CPU in one room at MIPS
> (which was quite near Weitek, at the time), when I visited.  He would
> go through each and every detail of the implementation there and talk
> about it, at length, and explain why it was worthwhile... or not...
> and what the exact quantitative cost was in each cycle's timing and
> over the broader arch of an application.
> 
> Some of the difficulties were higher memory bandwidths required, once
> you started tossing out stuff like register interlocks, microstore and
> its associated sequencing overhead, lane changing, etc.  But if that
> could be satisfied, and that was kind of possible at the time with
> some static ram from performance semi, it would perform like a bat out
> of hell.  So to speak.
> 
> But the focus was on implementation on lower-tech FABs and, while
> doing that, still competing with CISC and beating it.
> 
> Of course, FABs got a lot better and access to high tech FAB resources
> became increasingly brokered to keep them running 24/7, and the
> driving need for lower-tech feature sizes became relaxed.  Also, CISC
> looking external designs could now be designed with internal RISC
> processors, built-in TLBs, re-order buffers, registration stations
> with multiple functional units to share, jump prediction, .... so much
> so, that in fact Intel started putting L1 cache memory on-die.  There
> was so much excess available, they ran out of nifty ideas and the best
> they knew to do with it was suck up die space with cache memory.
> 
> So the RISC drive relaxed.  At least, on the consumer market area.
> 
> But for those making cheap embedded controllers, I suspect that die
> size and effectively using somewhat lower FAB technology remains
> useful.  So the low-transistor count approaches once the much lauded
> domain of RISC remain important.

  All one can really derive in meaning from RISC, is Reduced Instruction 
Set Computer - any other assertions become in the eye of the beholder, 
or worse, spin doctoring - so there is little point in slicing and 
dicing the details of what is, or is not, RISC.

-jg

Reply by Jonathan Kirwan ●February 21, 20072007-02-21

On Wed, 21 Feb 2007 16:39:37 +1300, Jim Granville
<no.spam@designtools.maps.co.nz> wrote:

>Jonathan Kirwan wrote:
>> On Wed, 21 Feb 2007 00:02:27 GMT, "Wilco Dijkstra"
>> <Wilco_dot_Dijkstra@ntlworld.com> wrote:
>> 
>> 
>>>"David Brown" <david@westcontrol.removethisbit.com> wrote in message 
>>>news:45db1975$0$31521$8404b019@news.wineasy.se...
>>>
>>>
>>>>However, there is no fixed distinction between RISC and CISC.  The two 
>>>>terms refer to a range of characteristics commonly associated with RISC 
>>>>cpus and CISC cpus.  Some chips clearly fall into one camp or the other, 
>>>>but most have at least slightly mixed characteristics.
>>>
>>>RISC and CISC are about instruction set architecture, not implementation
>>>(although it does have an effect on the implementation).
>>><snip>
>> 
>> 
>> I respect your knowledge and skill, Wilco, but I cannot agree with
>> this as I understand you writing it here based upon my experiences.
>> 
>> I spent 1-on-1 time with Hennessy and listened to the reasoning he
>> used.  RISC was all about thinking in detailed terms of practical
>> implementation.  They were faced with access to lower-technology FABs
>> (larger feature sizes, fewer transmission gates and inverters, etc.)
>> and wanted to achieve more with less.  Doing that was everything about
>> implementation and the instruction set architecture was allowed to go
>> where it must.  That this worked out to being a 'reduced instruction
>> set' was something that came out of achieving competing performance
>> out of lower-tech FAB capability than folks like Intel or Motorola had
>> available to their flagship lines of the day.
>> 
>> There was a design philosophy based upon theory -- that was simply the
>> realization that many of the things that slowed down a CISC was also a
>> matter of perceived convenience for programmers, so the policy was
>> then to get rid of anything and everything that slowed down the clock
>> rate without paying _well_ for that delay.  A focus on throughput. The
>> fact that removing barriers to speed also happened to reduce the need
>> for more transistor equivalents was the happy coincidence that fueled
>> the initiative.  The instructions were a result of the application of
>> focusing on implementation details -- not some instruction set theory
>> under which the implementation then followed. If higher level features
>> were cheap to implement and paid for themselves in performance, they
>> were simply kept.  Very practical, hard nosed approach.
>> 
>> If you ever listened to such a lecture by those actually doing the
>> work, you'd see this narrow focus.  The register flags that signalled
>> whether or not a register was in-use as a destination were tossed as
>> too expensive -- they required infrastructure in order to delay the
>> processor and the combinatorial worst-case path of the whole of that
>> meant additional __delay__ in each clock cycle, whether or not this
>> interlock was useful instruction to instruction.  You paid for it on
>> every cycle, need it or not.  So out it went.  No interlocks.  Sorry.
>> Similar thinking was involved in the Alpha's refusal to do 'lane
>> changes,' for example.
>> 
>> Hennessy had a huge blow up of the 68020 CPU in one room at MIPS
>> (which was quite near Weitek, at the time), when I visited.  He would
>> go through each and every detail of the implementation there and talk
>> about it, at length, and explain why it was worthwhile... or not...
>> and what the exact quantitative cost was in each cycle's timing and
>> over the broader arch of an application.
>> 
>> Some of the difficulties were higher memory bandwidths required, once
>> you started tossing out stuff like register interlocks, microstore and
>> its associated sequencing overhead, lane changing, etc.  But if that
>> could be satisfied, and that was kind of possible at the time with
>> some static ram from performance semi, it would perform like a bat out
>> of hell.  So to speak.
>> 
>> But the focus was on implementation on lower-tech FABs and, while
>> doing that, still competing with CISC and beating it.
>> 
>> Of course, FABs got a lot better and access to high tech FAB resources
>> became increasingly brokered to keep them running 24/7, and the
>> driving need for lower-tech feature sizes became relaxed.  Also, CISC
>> looking external designs could now be designed with internal RISC
>> processors, built-in TLBs, re-order buffers, registration stations
>> with multiple functional units to share, jump prediction, .... so much
>> so, that in fact Intel started putting L1 cache memory on-die.  There
>> was so much excess available, they ran out of nifty ideas and the best
>> they knew to do with it was suck up die space with cache memory.
>> 
>> So the RISC drive relaxed.  At least, on the consumer market area.
>> 
>> But for those making cheap embedded controllers, I suspect that die
>> size and effectively using somewhat lower FAB technology remains
>> useful.  So the low-transistor count approaches once the much lauded
>> domain of RISC remain important.
>
>  All one can really derive in meaning from RISC, is Reduced Instruction 
>Set Computer - any other assertions become in the eye of the beholder, 
>or worse, spin doctoring - so there is little point in slicing and 
>dicing the details of what is, or is not, RISC.

Real meaning is found in the details of how things work, not in some
banner or ideology.  Which is, I suppose, about what I said.

Thanks,
Jon