PIC vs ARM assembler (no flamewar please)| page 10

Reply by Jonathan Kirwan ●February 21, 20072007-02-21

On Wed, 21 Feb 2007 10:41:23 GMT, "Wilco Dijkstra"
<Wilco_dot_Dijkstra@ntlworld.com> wrote:

>"Jonathan Kirwan" <jkirwan@easystreet.com> wrote in message 
>news:gicnt214v581okaluheha0sietn16fsvn5@4ax.com...
>> On Wed, 21 Feb 2007 14:56:59 +1300, Jim Granville
>> <no.spam@designtools.maps.co.nz> wrote:
>>
>>>Wilco Dijkstra wrote:
>>>>
>>>> While SRAM is faster than flash, it wouldn't be fast enough to be
>>>> used like a register in a simple MCU. On ARM7 for example,
>>>> register read, ALU operation and register write all happen within
>>>> one clock cycle. With SRAM the cycle time would become 3-4
>>>> times as long (not to mention power consumption).
>>>
>>>  To get a handle on what On-Chip, small RAM speeds can achieve, in real
>>>silicon,  look at the FPGA Block Sync RAMS - those are smallish block,
>>>Dual ported, and plenty fast enough to keep up with the cycle times of a
>>>CPU.
>>>  I don't see FPGA CPUs being held back by their 'slow sram',
>>>as you claim ?.
>>>  RAM based DSPs are now pushing 1GHz, and that's larger chunks
>>>of RAM than are needed for register-maped-memory.
>
>Jim, you earlier wrote "I think you missed my uC = microcontroller." -
>I don't think 1GHz DSPs/FPGAs are micro controllers. Yes high end
>SRAMs on advanced processes easily reach 1GHz, but my point
>(and I think Rick's) is that registers are much faster still.
>
>> I just dumped my message in progress on this -- you said what I wanted
>> to say very clearly.  I use such DSPs.  I think Wilco must be stuck
>> thinking in terms of external bus drivers where what is connected is
>> unknown and the bus interface designer must work to worst cases.  Too
>> much ARM, perhaps?
>
>No, not at all. I'm talking about needing to access the SRAM several
>times per cycle to read/write the registers (as explained in the first
>paragraph). Therefore the speed of a CPU using SRAM rather than
>registers becomes a fraction of the cycle time of the SRAM.
>
>A register file is a small dedicated structure designed for very high
>random access bandwidth. SRAM simply can't achieve that.

I think of SRAM as "static RAM."  Nothing more than that.  This means
it can be single-ported, or multi-ported.  The only discerning issue
is whether or not it is static and can retain its contents down to a
DC clock.  This includes latches.  The actual cell of an SRAM can be
implemented in a variety of ways and with a variety of surrounding
control logic.

So again I think you are considering __external__ SRAM packages
commonly found and a bus upon which it operates or are otherwise
locked into some mental viewpoint you aren't escaping just yet and one
that doesn't reflect actual cpu design practice.  Registers are, in
fact, almost always implemented as SRAM in an ALU design today,
whether as multiported or not.  (They used to be nmos dram in some
processes, but I don't know of any of those now.)  Saying "SRAM simply
can't achieve that" sounds silly to me.  It does, because that's what
registers actually happen to be.

Jon

Reply by Wilco Dijkstra ●February 21, 20072007-02-21

"Jonathan Kirwan" <jkirwan@easystreet.com> wrote in message 
news:ea7pt21omlbn70fujpjtd5b91ror4gc9l3@4ax.com...
> On Wed, 21 Feb 2007 10:41:23 GMT, "Wilco Dijkstra"
> <Wilco_dot_Dijkstra@ntlworld.com> wrote:

>>A register file is a small dedicated structure designed for very high
>>random access bandwidth. SRAM simply can't achieve that.
>
> I think of SRAM as "static RAM."  Nothing more than that.  This means
> it can be single-ported, or multi-ported.  The only discerning issue
> is whether or not it is static and can retain its contents down to a
> DC clock.  This includes latches.  The actual cell of an SRAM can be
> implemented in a variety of ways and with a variety of surrounding
> control logic.
>
> So again I think you are considering __external__ SRAM packages
> commonly found and a bus upon which it operates or are otherwise
> locked into some mental viewpoint you aren't escaping just yet and one
> that doesn't reflect actual cpu design practice.

Well, with SRAM I'm thinking of standard single ported SRAM like 99.9%
of the SRAM that exists, whether as external packages, on-chip RAM,
part of a cache, cell libraries etc. Dual ported SRAM is pretty rare (note
dual ported caches don't actually use dual ported SRAM). Anything else
you'll likely have to design yourself at the transistor level.

>Registers are, in
> fact, almost always implemented as SRAM in an ALU design today,
> whether as multiported or not.  (They used to be nmos dram in some
> processes, but I don't know of any of those now.)  Saying "SRAM simply
> can't achieve that" sounds silly to me.  It does, because that's what
> registers actually happen to be.

Not for synthesized CPUs. Standard cell libraries don't provide the right
number of ports or the right pitch, so ARMs typically have register files
created from flops and muxes. CPUs that are largely handcrafted can
obviously create specially designed RAMs with enough ports to achieve
the required bandwidth. Even then they use various techniques to reduce
the number of ports.

Wilco

Reply by Jonathan Kirwan ●February 21, 20072007-02-21

On Wed, 21 Feb 2007 22:27:27 GMT, "Wilco Dijkstra"
<Wilco_dot_Dijkstra@ntlworld.com> wrote:

>
>"Jonathan Kirwan" <jkirwan@easystreet.com> wrote in message 
>news:ea7pt21omlbn70fujpjtd5b91ror4gc9l3@4ax.com...
>> On Wed, 21 Feb 2007 10:41:23 GMT, "Wilco Dijkstra"
>> <Wilco_dot_Dijkstra@ntlworld.com> wrote:
>
>>>A register file is a small dedicated structure designed for very high
>>>random access bandwidth. SRAM simply can't achieve that.
>>
>> I think of SRAM as "static RAM."  Nothing more than that.  This means
>> it can be single-ported, or multi-ported.  The only discerning issue
>> is whether or not it is static and can retain its contents down to a
>> DC clock.  This includes latches.  The actual cell of an SRAM can be
>> implemented in a variety of ways and with a variety of surrounding
>> control logic.
>>
>> So again I think you are considering __external__ SRAM packages
>> commonly found and a bus upon which it operates or are otherwise
>> locked into some mental viewpoint you aren't escaping just yet and one
>> that doesn't reflect actual cpu design practice.
>
>Well, with SRAM I'm thinking of standard single ported SRAM like 99.9%
>of the SRAM that exists, whether as external packages, on-chip RAM,
>part of a cache, cell libraries etc. Dual ported SRAM is pretty rare (note
>dual ported caches don't actually use dual ported SRAM). Anything else
>you'll likely have to design yourself at the transistor level.

My first reaction to the above is that ASIC cpu designs have control
over all this and they use that flexibility as a matter of course,
too.  And none of this addresses itself to the fact that registers
are, in fact, SRAM.  So your differentiation is without a difference.

>>Registers are, in
>> fact, almost always implemented as SRAM in an ALU design today,
>> whether as multiported or not.  (They used to be nmos dram in some
>> processes, but I don't know of any of those now.)  Saying "SRAM simply
>> can't achieve that" sounds silly to me.  It does, because that's what
>> registers actually happen to be.
>
>Not for synthesized CPUs.

So now you bring this in?  I thought you were simply saying SRAM and
registers are different, which I don't agree with because registers
are sram.  And now you talk about synthesized cpus to see if that
might help your case?

>Standard cell libraries don't provide the right
>number of ports or the right pitch, so ARMs typically have register files
>created from flops and muxes.

What exactly is the difference in your mind between a flipflip and an
sram bit cell?  I'm curious.

>CPUs that are largely handcrafted can
>obviously create specially designed RAMs with enough ports to achieve
>the required bandwidth. Even then they use various techniques to reduce
>the number of ports.

Not sure what to say to all that.

Jon

Reply by Jim Granville ●February 21, 20072007-02-21

Jonathan Kirwan wrote:
> On Wed, 21 Feb 2007 22:27:27 GMT, "Wilco Dijkstra"
> <Wilco_dot_Dijkstra@ntlworld.com> wrote:
>>Standard cell libraries don't provide the right
>>number of ports or the right pitch, so ARMs typically have register files
>>created from flops and muxes.
> 
> 
> What exactly is the difference in your mind between a flipflip and an
> sram bit cell?  I'm curious.

If it is a async vanilla SRAM cell there is a slight difference.
- with a register you can read and write on the same clock edge.

If it is a Sync SRAM cell, there is no difference. Both have
a Clock, and a Tsu and Th.

Dual port Sync SRAMs are quite common fare, across most FPGA vendors.

A one generation back FPGA (so as to not be too unfair ) specs
Tsu of 0.52ns, Th of 0ns, Tco 0.6ns, and Tclk of 572Mhz,
on a 2K byte dual port RAM.
These devices deliver 100-200MHz Soft CPU speeds, and the
SyncSRAM speed does not look like the main bottleneck.

-jg

Reply by Jonathan Kirwan ●February 21, 20072007-02-21

On Thu, 22 Feb 2007 13:55:22 +1300, Jim Granville
<no.spam@designtools.maps.co.nz> wrote:

>Jonathan Kirwan wrote:
>> On Wed, 21 Feb 2007 22:27:27 GMT, "Wilco Dijkstra"
>> <Wilco_dot_Dijkstra@ntlworld.com> wrote:
>>>Standard cell libraries don't provide the right
>>>number of ports or the right pitch, so ARMs typically have register files
>>>created from flops and muxes.
>> 
>> 
>> What exactly is the difference in your mind between a flipflip and an
>> sram bit cell?  I'm curious.
>
>If it is a async vanilla SRAM cell there is a slight difference.
>- with a register you can read and write on the same clock edge.
>
>If it is a Sync SRAM cell, there is no difference. Both have
>a Clock, and a Tsu and Th.
>
>Dual port Sync SRAMs are quite common fare, across most FPGA vendors.
>
>A one generation back FPGA (so as to not be too unfair ) specs
>Tsu of 0.52ns, Th of 0ns, Tco 0.6ns, and Tclk of 572Mhz,
>on a 2K byte dual port RAM.
>These devices deliver 100-200MHz Soft CPU speeds, and the
>SyncSRAM speed does not look like the main bottleneck.

I'm aware.  I was curious what Wilco was thinking about.

Jon

Reply by David Brown ●February 22, 20072007-02-22

Wilco Dijkstra wrote:
> "David Brown" <david@westcontrol.removethisbit.com> wrote in message 
> news:45dc2bf5$0$31548$8404b019@news.wineasy.se...
>> Wilco Dijkstra wrote:
>>> "David Brown" <david@westcontrol.removethisbit.com> wrote in message 
>>> news:45db1975$0$31521$8404b019@news.wineasy.se...
> 
>>> RISC and CISC are about instruction set architecture, not implementation
>>> (although it does have an effect on the implementation).
>>>
>> The whole point of RISC is to be able to make a more efficient 
>> implementation - it is an architectural design philosophy aimed at making 
>> small and fast (clock speed) implementations.
> 
> That's a good summary.
> 

It's nice that we don't entirely disagree!

>>>> The ColdFire core is very much such a mixed chip - in terms of the ISA, 
>>>> it is noticeably more RISCy than the 68k (especially the later cores 
>>>> with their more complex addressing modes), and in terms of its 
>>>> implementation, it is even more so.  Even the original 68k, with its 
>>>> multiple registers and (mostly) orthogonal instruction set is pretty 
>>>> RISCy.
>>> Well, let's look at 10 features that are typical for most RISCs today:
>>>
>>> * large uniform register file: no (8 data + 8 address registers)
>> Typical CISC is 4 to 8 registers, each with specialised uses.  Thus the 
>> 68k is far from typical CISC, and is much more in the middle.
> 
> There are various CISCs (eg VAX, MSP430) that have 16 registers,
> while most RISCs have 32 or more.
> 

I still don't see the point of trying to make black-and-white 
classifications of cpus as *either* CISC, *or* RISC.  You could divide 
them into load/store and non-load/store architectures, which is perhaps 
the most important difference (although there are no doubt hybrids there 
too).  Using that definition, the msp430 is CISC - but it has plenty of 
RISC features (such as 16 registers - a lot for its size).

>>> * load/store architecture: no
>> The 68k can handle both operands of an ALU instruction in memory, which is 
>> CISC.  The ColdFire can have one in memory, one in a register, which is 
>> again half-way.
> 
> The ColdFire is no different from 68K in this aspect. Most ALU operations
> can do read/modify/write to memory and the move instruction can access
> two memory operands.
> 

IIRC, the 68k could do some ALU operations with both operands in memory 
(such as ADDX), and MOVE operations can use any addressing mode for both 
operands.  The CF is more limited to simplify decoding and operand fetch.

Another example of the simplifications is that the CF no longer supports 
byte or (16-bit) word sizes for most operations - about the only 
instructions that support sizes other than the native 32 bits are MOVEs. 
  So for other data sizes, you effectively have a load/store architecture.

I've worked for years with the 68332, and in recent times I've worked 
with the ColdFire.  I've studied generated assembly code, often made 
with the same compiler, from the same source code.  There is no doubt 
whatsoever - the generated CF code makes much heavier use of 
register-to-register instructions, with code strategies more reminiscent 
of compiler-generated RISC code.  This is partly because some of the 
more costly memory operation capabilities were dropped from the 68k, and 
partly because the CF is more heavily optimised for such RISC style 
instructions.  If you were to think of the CF as a RISC core with a bit 
too few registers, but some added direct memory modes to compensate, 
you'd program fairly optimal code - the same is not true for the 68k.

>>> * naturally aligned load/store: no
>> That is purely an implementation issue for the memory interface.  It is 
>> common that RISC cpus, in keeping with the aim of a small, neat and fast 
>> implementation, insist on aligned access.  But it is not a requirement - 
>> IIRC, the some PPC implementations can access non-aligned data in 
>> big-endian mode.  The ColdFire is certainly more efficient with aligned 
>> accesses, but they are not a requirement.
> 
> Unaligned accesses are non-trivial so most RISCs left it out. However
> modern CPUs nowadays have much of the required logic (due to
> hit-under-miss, OoO execution etc), so a few RISCs (ARM and POWER)
> have added this. Hardware designers still hate its complexity, but it is
> often a software requirement. Quite surprisingly it gives huge speedups
> in programs that use memcpy a lot.
> 

That *is* surprising - the memcpy() implementations I have seen either 
use byte for byte copying, or use larger accesses if the pointers are 
(or can be) properly aligned.

>>> * simple addressing modes: no (9 variants, yes for ColdFire?)
> ...
>> All in all, the CF modes are only marginally more complex than the PPC 
>> modes.
> 
> It's the (d8 + Ax + Ri*SF) mode that places it in the complex camp.
> The first not only uses a separate extension word that needs
> decoding but also must perform a shift and 2 additions...
> 

Yes, that's a complex one, and it's slightly surprising that it survived 
the jump from 68k to CF.  I think it was included as it is the only mode 
that can get its address from the sum of two registers, which is a 
common requirement (the PPC has such an addressing mode).  Since an 
extension word is needed, the 68k architecture put the extra bits to 
good use - a scale factor of 1, 2, 4 or 8, and the remaining bits giving 
an offset which is probably seldom used.

>> The instruction set for the PPC contains much more complicated 
>> instructions than the CF.  The 68k has things like division instructions, 
>> which the CF has dropped.
> 
> What PPC instructions are complex? PPC is a subset of POWER
> just like CF is a subset of 68K, so most of the complex instructions
> were left out.
> 

The mask and rotation instructions are examples of complex ALU 
instructions, and there are several multi-cycle data movement 
instructions (such as the load multiple word, and the load string word).

>> A far more useful (and precise) distinction would be to look at the 
>> implementation - does the architecture use microcoded instructions? RISC 
>> cpus, in general, do not - that is one of the guiding principles of using 
>> RISC in the first place.  Traditional CISC use microcode extensively.  The 
>> 68k used microcode for many instructions - the CF does not.
> 
> This is misguided. RISC *enables* simple non-microcoded
> implementations. One can make a micro code implementation of a
> RISC, but that doesn't make it any less RISC.
> 

Again, I don't see RISC vs. CISC as a black and white division, but as a 
set of characteristics.  Microcoding is a CISC characteristic - it is 
perfectly possible to have a mostly RISC core with CISCy microcode.

>>> * calls place return address in a register: no
>> More generally speaking, CISC has specific purpose registers, while RISC 
>> have mostly general purpose registers.  Yes, the CF has extra 
>> functionality on A7 to make it a stack pointer.  Putting the return 
>> address in a register, as done in RISC cpus, is not an advantage - it is a 
>> consequence of not having a dedicated stack.
> 
> It is an advantage as it avoids unnecessary memory traffic - a key
> goal of RISC.
> 

It avoids an extra memory write (and subsequent read) in leaf functions, 
at the cost of extra instruction fetches for the code to save and 
restore the link register for non-leaf functions.  I can't give you a 
detailed analysis of the costs and benefits here, but I'd be surprised 
if it is a distinct advantage.

>> If we add in some other features that are a little more implementation 
>> dependant (and therefore entirely relevant, since that is the reason for 
>> RISC in the first place), things are a bit different:
>>
>> * Single-cycle register-only instructions: yes
>> * Short execution pipeline: yes
>> * (Mostly) microcode-free core: yes
>> * Short and fast instruction decode: half point
>> * Low overhead branches: yes
>> * Stall-free for typical instruction streams: yes
>>
>> Suddenly the scores are looking a bit different.
> 
> I don't see how the scores change at all. Most of the features you
> mention are "yes" for 68K implementations (except for the original
> 68000 which scores 4 out of 6), ColdFire and ARM.
> 

Exactly the point - when you include these typical RISC features as well 
as your chosen features, the CF scores much more like the ARM.  I'm not 
claiming in any way that the CF is RISCier than the ARM, or even *as* 
RISCy - just that it has far more typical RISC features than you give it 
credit for.

>> Perhaps we could compare the CF to traditional CISC features:
>>
>> * Specialised accumulator: no
> 
> Many famous CISCs are not accumulator based, eg PDP, VAX, 68K,
> System/360 etc. Accumulators are typically used in 8-bitters where
> most instructions are 1 or 2 bytes for good codesize.
> 

Specialised accumulators are a typical CISC feature, even though they 
are by no means universal.

>> * Microcoded instructions: no
> 
> Implementation detail. CF is still complex enough that micro
> coded implementations might be a good choice.
> 
>> * Looped instructions: no
> 
> Loop mode is just an implementation optimization that could be done
> on any architecture.
> 
>> * Direct memory-to-memory operations: no
> 
> Eh, what does move.l (a0),(a1) do? It's valid on CF.
> 

I intended to refer to ALU operations, sorry.

>> * Bottlenecks due to register or flag conflicts: not often
>> * Long pipelines: no
> 
> Longer than an equivalent RISC (mainly due to needing 2 memory
> accesses per instruction and more complex decoding). And likely
> longer than a simpler microcoded implementation.
> 

Are you are making this up out of thin air?

I don't have any details of the CF pipeline.  But a mispredicted branch 
that hits the instruction prefetch cache (thus avoiding instruction 
fetches) executes in 3 cycles.  That's definitely a short pipeline.

A fair proportion of CF instructions are single-word, and a single 
memory access reads two such instructions.  I'd estimate that you'd have 
slightly less than one memory access per instruction on average, but of 
course that's highly code dependant.  Instructions are aligned with 
their extension words as they are loaded into the prefetch cache, so 
decoding is not any more complicated or time-consuming than for a RISC 
instruction set - the coding format is nice and regular.

>> As I said, with the Thumb-2, the ARM is gaining the CISC feature of 
>> variable length instructions - I did not say it is changing into a CISC 
>> architecture.  The real world is grey - there is no dividing line between 
>> CISC and RISC, merely a collection of characteristics that some chips have 
>> and others don't.
> 
> Sure, there is always a grey area in the middle, but most ISAs
> clearly fall in either camp. If you use my rules, can you mention one
> that scores 4 or 5?
> 

I wouldn't use your rules - they are picked specifically to match you 
argument (and even then, you placed the ARM Thumb at 6).  Add in the six 
I picked, and the ColdFire is at 8 out of 16.  Of course, my rules, like 
yours, are arbitrary and unweighted, so they hardly count as an 
objective or quantitative analysis.

Most ISAs can certainly be classified as roughly RISC or roughly CISC - 
I'll not deny that, and given a choice of merely RISC or CISC, I'd 
classify the CF as CISC without hesitation.  All I am trying to say is 
that there are characteristics that are typical for each camp, and that 
architectures frequently use characteristics from the "opposing" camp to 
make a better chip.  The CF has a lot more RISC features than most CISC 
devices, and the ARM is picking up a few more CISC features with their 
newer developments.  My original statement, that the inclusion of 
variable-length instructions in Thumb-2 makes the ARM more like the CF, 
is true.

>> Adding these variable length instructions is a good thing, if it doesn't 
>> cost too much at the decoder.  It increases both code density and 
>> instruction speed, since it opens the path for 32-bit immediate data (or 
>> addresses) to be included directly in a single instruction.
> 
> Actually, embedding large immediates in the instruction stream is
> bad for codesize because they cannot be shared. For Thumb-2 the
> main goal was to allow access to 32-bit ARM instructions for cases
> where a single 16-bit instruction was not enough. Thumb-2 doesn't
> have immediates like 68K/CF.
> 

Embedding large immediates in the instruction stream is good for code 
size if there is no need to share them.  If they are shared, then the 
typical RISC arrangement of reading the values from code memory using a 
pointer register and 16-bit displacement is more code efficient (for 3 
or more uses of the 32-bit data), but less bandwidth efficient (taking a 
32-bit instruction and a 32-bit read, compared to a single 48-bit 
instruction).

Of course, that would require support for 48-bit instructions rather 
than just 32-bit, which might not be worth the cost.

>> My point is not that the CF is a RISC core - I never claimed it was. But 
>> neither is it a CISC core in comparison to, say, the x86 architecture.  If 
>> there were such a thing as a scale running from pure RISC to pure CISC, 
>> then the CF lies near the middle.  It is not as RISCy as the ARM, but is 
>> somewhat RISCier than the original 68k.
> 
> I agree CF is less CISCy than 68K but it is still more CISCy than x86.

I must have misread that - are you saying the CF (and 68k) is more CISCy 
than the x86 ??

> If it dropped 2 memory operands, removed ALU+memory operations,
> 32-bit immediates and the absolute and (d8 + Ax + Ri*SF) addressing
> modes then I would agree it is a RISC...
> 

That's true - but then it would not be nearly as good a core.  Just 
because there are some truly horrible CISC architectures, does not mean 
that all things RISC are better!

mvh.,

David

> Wilco 
> 
>

Reply by David Brown ●February 22, 20072007-02-22

Wilco Dijkstra wrote:
> "Jim Granville" <no.spam@designtools.maps.co.nz> wrote in message 
> news:45dc98c5@clear.net.nz...
>> Wilco Dijkstra wrote:
> 
>> FPGAs are certainly used a microcontrollers, and in increasing volumes.
>> CPU designers had better be aware of just what a FPGA soft CPU can do 
>> these days, as they are replacing uC in some designs.
> 
> They can indeed, but FPGA prices need to come down a lot before
> it becomes a good idea in a high volume design. I've worked with big
> FPGA stacks for CPU emulation and large/fast FPGAs can cost
> well in the 5 figures a piece. Even the smallest ARM uses a big chunk
> of a large FPGA. So you can only use very simple CPUs in a small
> FPGA.
> 

The ARM is a very poor choice for a CPU in an FPGA, and it most 
certainly does not follow that only simple CPUs can be used in small 
FPGAs.  The most common soft processors used are the Nios II (Altera) 
and the Microblaize (Xilinx) - both are designed specifically for FPGAs, 
and will give much more processing power for the same area in the FPGA 
than a "standard" CPU core.  The ARM, like the ColdFire, is designed to 
be efficient and easily synthesizable on ASICs and other fine-grained 
architectures - FPGA optimised designs are significantly different.

>>> A register file is a small dedicated structure designed for very high
>>> random access bandwidth. SRAM simply can't achieve that.
>> OK, I'll try ome more time.
>> You seem to be stuck on a restrictive use of SRAM, so I'll use different 
>> words.
>>  Let's take a sufficently skilled chip designer, that he knows various RAM 
>> structures, and that he will not use vanilla SRAM (as Rick has aleady 
>> mentioned) but will use something more like the dual port sync ram of the 
>> FPGAs, I gave as an example.
>>  Yes, this ram is more complex than simplest RAM, (which is why Infineon 
>> keep the size to 1-2K), but it buys you many benefits on a uC, and the
>> die size impact of such RAM is still tiny.
> 
> I'm with you. Inventing a new kind of SRAM with 3 read and 2 write
> ports would do the job indeed. But it is going to be big compared to
> using standard single ported SRAM, so there needs to be a major
> advantage.
> 
>> Q: What percentage of a RISC(eg ARM) die is taken by the registers 
>> themselves ?
>> A: A miniscule fraction << 0.1%
> 
> On ARM7tdmi it is around 5% for 32 registers with 2 read and 1 write port.
> An ARM7tdmi is as large as 5KB of SRAM. Assuming flash is 4 times as
> dense, a typical MCU with 128KB flash and 16KB of RAM still has 0.5%
> of the area devoted to registers.
> 

On the ColdFire, the register set is a very significant part of the die 
area - so much so, that in designing the ColdFire v1 core, FreeScale 
considered halving the number of registers.

>> Q: Why not apply more of the die, to fix what is a real code/performance 
>> bottle neck.
> 
> What bottleneck? You lost me here... Adding more registers doesn't
> automatically improve performance.
> 

That's true - look at the rather mediocre real-world performance of the 
Itanium, for an example.

>> Let's increase the size of these 'blazingly fast' registers, and 
>> local-ram-overlay them, to reap the benefits, until we hit a point where 
>> their time-impact matches the code access, or get to 1-2K, (or a small % of 
>> die), whichever comes first.
>>
>>  Taking other devices as a yardstick, it's likely to hit the size corner 
>> before it hits the speed corner, but a designer will watch both effects.
> 
> At 2KB it would double the size of an ARM7tdmi and slow it down
> a lot without a redesigned pipeline (it needs to support 2 accesses in
> less than half a cycle at up to 120MHz, so the SRAM would need to run
> at 500MHz). I think you're assuming register read/write is not already
> critical in existing CPUs - it often is.
> 
> However what use do you have for 256/512 fast registers? You can
> only access 16 at any time...
> 
>>  There is NO design-brick-wall, that says when you go over 16 or 32 
>> registers, suddenly it is impossible to get fast access : FPGA designers 
>> are doing that now, in real silicon.
> 
> Sure, it just gets progressively slower with size and number of ports.
> 
> Wilco 
> 
>

Reply by Wilco Dijkstra ●February 22, 20072007-02-22

"Jonathan Kirwan" <jkirwan@easystreet.com> wrote in message 
news:6nkpt2he5ovvonio95dk1op99nm42q6kfp@4ax.com...
> On Wed, 21 Feb 2007 22:27:27 GMT, "Wilco Dijkstra"
> <Wilco_dot_Dijkstra@ntlworld.com> wrote:
>
>>"Jonathan Kirwan" <jkirwan@easystreet.com> wrote in message
>>news:ea7pt21omlbn70fujpjtd5b91ror4gc9l3@4ax.com...

>>> I think of SRAM as "static RAM."  Nothing more than that.

> My first reaction to the above is that ASIC cpu designs have control
> over all this and they use that flexibility as a matter of course,
> too.  And none of this addresses itself to the fact that registers
> are, in fact, SRAM.  So your differentiation is without a difference.

You're using a different definition of SRAM.

Wikipedia defines SRAM as a regular single ported cell structure with
word and bitlines which is laid out in a 2 dimensional array and typically
uses sense amps. It mentions dual ported SRAM and calls it DPRAM.

> What exactly is the difference in your mind between a flipflip and an
> sram bit cell?  I'm curious.

An SRAM bit cell is designed to be laid out in a 2 dimensional structure
sharing bit and word lines thus taking minimal area. A flop is completely
different. There are lots of variants, but they typically have a clock, may
contain a scan chain for debug and sometimes have special features
to save power. Note that flops are typically used in synthesized logic
rather than latches and precharged logic. They are irregular and much
larger than an SRAM cell, but they have a good fanout and can drive
logic directly unlike SRAM.

So while logically they both store 1 bit, they have different interfaces,
characteristics, layout and uses. I hope that clears things up...

Wilco

Reply by rickman ●February 22, 20072007-02-22

On Feb 21, 4:14 pm, Jim Granville <no.s...@designtools.maps.co.nz>
wrote:
> rickman wrote:
> > On Feb 20, 5:01 pm, Jim Granville <no.s...@designtools.maps.co.nz>
> > wrote:
>
> >>rickman wrote:
>
> >>>But for RAM to be as efficient as a register file it has to be triple
> >>>ported so you can read two operands and write back another... or you
> >>>have to go to an accumulator based design.  Once you have triple
> >>>ported RAM, you have just added a register file!  A rose by any other
> >>>name still smells as sweet...
>
> >>Correct, that's the hardware level detail.
>
> >>The really important point, is at the SW level, you now access any small
> >>clusters of Register-Mappable-RAM variables VERY efficently indeed,
> >>using register opcodes.
> >>- Such clusters of variables are very common in code
> >>- eg a Real time clock subroutine, could be fully coded using register
> >>opcodes, with a single Ram-locate operation on entry.
>
> > My point is that there is no difference between registers in triple
> > ported RAM and a large register file.  If I have 1kB of triple ported
> > RAM, I can play the same game and static allocate memory to interrupt
> > routines for zero overhead context switching.
>
>   Yes, and to do that zero context switch, you need a register frame
> pointer (or similar). You do not want to use this as mere smart-stack,
> but to allow all the Reg opcodes to work, on any window into that larger
> memory.
>   Triple ported, or dual port, depends more on the core in question.
>
>   Even the lowly 80c51 has a register frame pointer,(all of 2 bits in
> size), and it does overlay the registers with RAM.
>   The z8 expands this to 8 bits, and I think the XC166 uses a 16 bit one.
>
> <snip>
>
> >>It's also backward compatible. If you are uncomfortable with the
> >>overlay, or the tools are catching up, just leave the register pointer
> >>alone, and you have plain-old-vanilla-RISC.
>
> > With more limited capabilities due to the register linking for
> > subroutine calls.
>
> ? - you've lost me here. In subset mode, you simply ignore the register
> frame pointer, and it is _exactly_ the same as your un-enhanced core.

I am referring to the way the TMS9900 links subroutines.  They save
the return address in a register.  This is in part because the use of
the register pointer partially negates the need for a stack.  But you
still have to save the return address before you link to another
subroutine.  If you are changing the register pointer, you either have
to save the old one on a stack or a register or you have to hard code
the register pointer restore.

I seem to recall TI having set a convention that used an extra
location in memory at the start of a routine.  I believe this was to
load the register pointer, but I'm not certain.  I just recall that
the overall effect was not really any better than using a stack with
internal registers.

If you are using internal, multiport RAM for memory mapped registers,
then you are really just using a large register file.  Don't some of
the RISC CPUs do that?

Reply by Jonathan Kirwan ●February 22, 20072007-02-22

On 22 Feb 2007 08:18:39 -0800, "rickman" <gnuarm@gmail.com> wrote:

><snip>
>I am referring to the way the TMS9900 links subroutines.  They save
>the return address in a register.

The PDP-11's JSR allowed something like that, as well.  The
instruction was/is:

   JSR reg, destination

It would:
    scratch     <--   destination
    --(SP)      <--   reg             ; push reg
    reg         <--   PC              ; put return addr in reg
    PC          <--   scratch         ; go to subr

(The use of 'scratch' above makes it clear that side effects of the
destination address [such as auto increment or decrement] would take
place first.)

A common use was to set PC as 'reg', resulting in the usual call that
you find on most micros these days.  But the PDP-11 version allowed
you greater flexibility -- particularly useful in the case of
supporting paired coroutines.  Sad that the common use has caused the
less common usage to be largely lost to us, today.

>This is in part because the use of
>the register pointer partially negates the need for a stack.  But you
>still have to save the return address before you link to another
>subroutine.

Or, if it had supported the more flexible PDP11 JSR, you could choose
and that would not be needed unless you used some non-PC register as
your linkage.

>If you are changing the register pointer, you either have
>to save the old one on a stack or a register or you have to hard code
>the register pointer restore.
>
>I seem to recall TI having set a convention that used an extra
>location in memory at the start of a routine.

Yikes!  If I gather you correctly, that's what we got away from with
the concepts of using a stack!  Storing in a subr location means
program lifetime and it's not recursive or re-entrant that way!

>I believe this was to
>load the register pointer, but I'm not certain.  I just recall that
>the overall effect was not really any better than using a stack with
>internal registers.

Jon

Previous 8 91011 12 13 Next

PIC vs ARM assembler (no flamewar please)

Sign in

You might also like...

Search forums

Free PDF Downloads

Blogs - Hall of Fame

Discussion Groups

Quick Links

About EmbeddedRelated.com

Social Networks

The Related Media Group