EmbeddedRelated.com
Forums

Compare ARM MCU Vendors

Started by Dave Graffio September 1, 2010
Ulf Samuelsson <nospam.ulf@atmel.com> writes:

>> an 8 or 16 byte buffer is fine when your cpu clock speed is not too much >> higher than the flash access speed. > > It is not implemented on the ARM, and I do not think that it > exists in the Cortex-M3 as well.
Cortex-M3 includes a three-word prefetch buffer, which in the best case can hold six instructions. The macrocell also provides signals that export certain internal pipeline states for controlling "flash accellerators". Best regards Marcus -- note that "property" can also be used as syntactic sugar to reference a property, breaking the clean design of verilog; [...] (seen on http://www.veripool.com/verilog-mode_news.html)
On Fri, 24 Sep 2010 09:28:01 +0200, David Brown
<david@westcontrol.removethisbit.com> wrote:

>I think it is interesting to look at the history of instruction sets. >Long ago, there were two competing ideas - there were CISC instruction >sets with very varied instruction sets (typically in 8-bit parts), and >RISC which were all consistent and wide (typically 32-bit). It turns >out that both extremes were "wrong", and the most efficient modern >instruction sets for small devices are 16-bit wide for most >instructions, with some 32-bit (or 48-bit) for flexibility.
A significant point missing in the above comment you made, David, is that the _environment_ was different then as now. For example, when the MIPS R2000 RISC processor was being designed and then instanced, they didn't have access to the highest density FABs... which were only found at Intel and Motorola, at the time. For obvious commercial reasons. Instead, they had to live with much lower feature sizes and capabilities and _still_ field a competitive product. Also, in and around that period of time, the number of inverters and transmission gates (aka 'transistors') available was much, much less than now. Even if you were Intel or Motorola. As time passed, that capability grew to the point where folks weren't at all strapped and began wondering what else they could do with all those extras sitting around. Which opened the door for making design decisions that were impossible, earlier. Such as the PPro/P2 choice of decoding CISC instructions into RISC instructions, executed out of order and re-assembled later on. That simply wasn't possible, earlier. I wouldn't characterize decisions made at that time as "wrong." The options available to a designer back then were very little like what is available now. Jon
On Fri, 24 Sep 2010 07:34:50 -0400, Walter Banks
<walter@bytecraft.com> wrote:

>There is a lot I like about the Thumb 2 ISA. I have worked on ISA >design on several commercial processors. M68K (that you mention >and I clipped) patterned after the PDP11 is the classical orthogonal >instruction set.
Since I'm intimately familiar with both, I'd be interested in discussing some of those decisions made in the 68k case if you are open to the idea in a public space.
>It takes a lot more than that to make an efficient >processor. The TI9900 a contemporary of the 68K development with >similar roots was less effective at executing applications. The >difference between 68k and 9900 was essentially data flow inside >the processor. The 9900 was easier to program in many ways BUT >it relied on more indirect data accesses to data and was >significantly less efficient.
The 9900 was very interesting, though I never had the chance to actually program one of them. If I recall correctly, it supported the visibility requirements of Pascal's local variables within nested functions. I'm not entirely sure I appreciate your comment about data flows because the Intel x86 did add some function prologue/epilogue instructions later on to also support that feature... but you are probably referring to some other aspect I know zero about.
>Clean data flow between executing instructions is as important as >the instructions. The classic example of how to kill a processor >is to need to process memory management through primary >accumulator(s). This killed several processors in the 90's. > >RISC can be very efficient but requires a different approach to >code generation. The xgate is a simple 16 bit RISC that driven >with a well designed code generator will compete with well designed >CISC processors. Our application based benchmarks showed that >the difference was about 10%. There is a whole area of instruction >design that trades compile time complexity for processor >simplicity or timing.
Good books on this subject, too. I remember talking with one of the founders of MIPS (Hennessey) about his analysis of the 68020 (and this is part of why I'd very much enjoy a discussion on that topic, because of what he shared with me that long day) and discussing even such 'insignificant' details as why they chose to not flag registers as 'busy' in the R2000. There was a cycle-length cost to it because it added delay to a combinatorial chain, which reduced the clock rate possible. Instead, the next instruction would NOT wait until a write completed. They left that to the compiler to worry over. I was blessed to hear about many other such interesting design decisions they made on the R2000, that day. (It was a personal 1:1 meeting.)
>Many of the most successful ISAs make very good use of redundant >instructions. This has been done four ways. > >1) Conceptually have a page 0 space where some RAM areas are > more valuable but the access is quicker and requires less > generated code.
PIC and 6502 being examples?
>2) Memory to memory operations that don't require intervening > register involvement.
As in the Intel REP MOV or the DEC PDP-11's MOV (R5)+, (R6)+ as two very different examples?
>3) Instructions with implied arguments. > For example inc dec compliment.
8051 being a classic here?
>4) Mapping registers (real and virtual) on RAM space reduces > register specific instructions an extreme example is the > move machines with one instruction.
I think I remember some comments you made about this, earlier. Thanks, Walter. I enjoyed reading this. Jon
On 24/09/2010 15:31, Jon Kirwan wrote:
> On Fri, 24 Sep 2010 09:28:01 +0200, David Brown > <david@westcontrol.removethisbit.com> wrote: > >> I think it is interesting to look at the history of instruction sets. >> Long ago, there were two competing ideas - there were CISC instruction >> sets with very varied instruction sets (typically in 8-bit parts), and >> RISC which were all consistent and wide (typically 32-bit). It turns >> out that both extremes were "wrong", and the most efficient modern >> instruction sets for small devices are 16-bit wide for most >> instructions, with some 32-bit (or 48-bit) for flexibility. > > A significant point missing in the above comment you made, > David, is that the _environment_ was different then as now. > > For example, when the MIPS R2000 RISC processor was being > designed and then instanced, they didn't have access to the > highest density FABs... which were only found at Intel and > Motorola, at the time. For obvious commercial reasons. > Instead, they had to live with much lower feature sizes and > capabilities and _still_ field a competitive product. > > Also, in and around that period of time, the number of > inverters and transmission gates (aka 'transistors') > available was much, much less than now. Even if you were > Intel or Motorola. As time passed, that capability grew to > the point where folks weren't at all strapped and began > wondering what else they could do with all those extras > sitting around. Which opened the door for making design > decisions that were impossible, earlier. Such as the PPro/P2 > choice of decoding CISC instructions into RISC instructions, > executed out of order and re-assembled later on. That simply > wasn't possible, earlier. > > I wouldn't characterize decisions made at that time as > "wrong." The options available to a designer back then were > very little like what is available now. >
That is precisely while I put the inverted commas around "wrong". ISA's like Thumb2 or ColdFire are more efficient than the x86 or ARM instruction sets using modern design and fabrication techniques - I fully agree that the situation was different 30 years ago (well, 21 years ago for ARM, IIRC). There is also the minor matter of trial-and-error - we know better know based on experience from past designs. However, it is still valid to compare the x86 and the m68k architectures because they were from a similar time. The x86 was considered poor and old-fashioned when it was first made, and not suitable for powerful computing. The m68k was considered an elegant and modern design, with forward-looking design choices (such as support for 32-bit in the ISA, even though it only had a 16-bit ALU in the implementation). It was not without reason that the IBM engineers wanted the m68k for their new "PC".
David Brown wrote:

> However, it is still valid to compare the x86 and the m68k architectures > because they were from a similar time. The x86 was considered poor and > old-fashioned when it was first made, and not suitable for powerful > computing. The m68k was considered an elegant and modern design, with > forward-looking design choices (such as support for 32-bit in the ISA, > even though it only had a 16-bit ALU in the implementation). It was not > without reason that the IBM engineers wanted the m68k for their new "PC".
In my mind I'd made the cut between specialized-register machines (i8080 etc., Z80, TI32000 DSPs, DEC 8, GE-635) and general-register machines (m68k, mc6800 -- sort of, m88000 DSPs, DEC 11, IBM360). Specialized were easier to code for because after you'd figured out which registers your data had to be in the low-level code was pretty much determined; general registers needed a conscious resource-allocation step. In code performance, the only head-to-head comparison I've done was to rewrite identical functions for Z180 and MC68HC11. The functions always took less code memory and less execution time on the HC11. Partly because there was less data-shuffling (and partly because HC11 conditional-branch operations were handier.) Mel.
On Sep 24, 10:28=A0am, David Brown <da...@westcontrol.removethisbit.com>
wrote:
> .... > I think it is interesting to look at the history of instruction sets. > Long ago, there were two competing ideas - there were CISC instruction > sets with very varied instruction sets (typically in 8-bit parts), and > RISC which were all consistent and wide (typically 32-bit). =A0It turns > out that both extremes were "wrong", and the most efficient modern > instruction sets for small devices are 16-bit wide for most > instructions, with some 32-bit (or 48-bit) for flexibility. =A0Consistenc=
y
> and orthogonality of the architecture is important, but should not be > taken to extremes. =A0There is a lot to like about the Thumb2 set - I > think it's a big improvement on the original ARM ISA. > > Of course, the 68000 designers at Motorola figured this out about 30 > years ago...
The 68k guys certainly did that - and they created the 68k assembly language, which has been the most efficient native CPU assembly language I have seen since. But the power architecture designers did a great job, too - apart from the barely if at all usable native assembly mnemonics. I can hardly think of something I miss from the 68k on the PPC; and you will be surprised how little code size difference there is if you write with the PPC in mind (on VPA, which is what I call the language I built over the 68k assembly to produce PPC code). Some examples: 68k byte or word operands affect only the byte or word in the destination register. This usually costs either a preceding extra clr.l or a subsequent ext.l, it is very rare one can take advantage of the reserved upper part data. Power takes advantage of its 32 bit opcode size and lets us leave the condition codes unaffected if needed; not so on the 68k, this also costs code size (I am not talking speed at all here which is clearly in favour of power). What I do miss from the 68k is the movem instruction with a bit per register which will be moved; VPA does this on power by simply doing a single move per register. I suppose this is one of the major code space eaters I have in my code. They do have a native move multiple opcode, but only for sequential registers and I do not use it much. But the rlwinm, rlwnm and rlwimi PPC instructions are a major help. While the 68020 did have bitfield instructions, they did not make it to the cpu32 and were probably too bulky to implement; these 3 opcodes with a little help from others can really do a lot, I would not give them up, not without a fight :-) . Dimiter ------------------------------------------------------ Dimiter Popoff Transgalactic Instruments http://www.tgi-sci.com ------------------------------------------------------ http://www.flickr.com/photos/didi_tgi/sets/72157600228621276/
Didi <dp@tgi-sci.com> wrote:
> But the power architecture designers did a great job, too - apart from > the barely if at all usable native assembly mnemonics.
That and insisting on numbering bits the opposite way to everyone else. -a

Mel wrote:
> > In code performance, the only head-to-head comparison I've done was to > rewrite identical functions for Z180 and MC68HC11. The functions always > took less code memory and less execution time on the HC11. Partly because > there was less data-shuffling (and partly because HC11 conditional-branch > operations were handier.)
HC11 and HC08 both have reasonably good code density. THere are many arguments for and against conditional branch vs conditional skip. In benchmarks conditional branch generally wins but on some very time critical code conditional skip wins. Motorola's conditional branch set is complete but has limited range resulting in "5" byte two instruction long branches in some code. I am surprised that the 6809 has yet to show up in this discussion Regards, w.. -- Walter Banks Byte Craft Limited http://www.bytecraft.com
On Sep 24, 9:51=A0pm, Walter Banks <wal...@bytecraft.com> wrote:
> ... > Motorola's conditional branch set is complete but has limited > range resulting in "5" byte two instruction long branches in > some code. > > I am surprised that the 6809 has yet to show up in this discussion
:D Indeed. But even I - having grown up on it - have long since put it out of use :-). 6800/11 code can be very dense indeed, some 25+ years ago I was used to count every single byte of code, used cpx # to skip 2 bytes and cmpa # (or did I use cmpb #) to skip 1 byte and plenty of other things, most were applicable also to the 09. But since you asked for the 6809, here are it is - 3 instances of it emulated in DPS windows, the tiniest being as large as the graphics board I had built back then.. only here in colours, that first graphics board of mine had 2 bits/pixel I used to view on a mono monitor... http://tgi-sci.com/misc/sc09em.gif A 400 MHz PPC does the emulation, something like 80+ times faster than the original 2 MHz 6809 system which is emulated, never cared to really measure it precisely. Oh no, I can't believe all these years have passed. Walter, you really should not have asked that :D :D . Dimiter ------------------------------------------------------ Dimiter Popoff Transgalactic Instruments http://www.tgi-sci.com ------------------------------------------------------ http://www.flickr.com/photos/didi_tgi/sets/72157600228621276/
2010-09-24 09:28, David Brown skrev:
> On 24/09/2010 01:48, Ulf Samuelsson wrote: >> 2010-09-23 23:17, rickman skrev: >>> On Sep 23, 12:52 pm, Ulf Samuelsson<nospam....@atmel.com> wrote: >>>> 2010-09-23 09:52, David Brown skrev: >>>> >>>> >>>> >>>>> On 23/09/2010 08:30, Ulf Samuelsson wrote: >>>>>> 2010-09-21 13:16, rickman skrev: >>>>>>> On Sep 20, 4:30 am, Ulf Samuelsson<u...@a-t-m-e-l.com> wrote: >>>>>>>> David Brown skrev: >>>>>>>>> A better solution for micros like that is a wider flash design >>>>>>>>> with an >>>>>>>>> sram buffer in the flash module - that is certainly how some >>>>>>>>> manufacturers handle the problem. It is a simpler solution than >>>>>>>>> a full >>>>>>>>> instruction cache because you have only a single "tag" (or perhaps >>>>>>>>> two, >>>>>>>>> if you have two such buffers), and there are no issues with >>>>>>>>> coherence or >>>>>>>>> anything else. The buffer of perhaps 256 bytes gets filled >>>>>>>>> whenever >>>>>>>>> you >>>>>>>>> access a new "page" in the flash, so that the processor then reads >>>>>>>>> from >>>>>>>>> the buffer rather than directly from the flash. And if >>>>>>>>> space/economics >>>>>>>>> allow, you have have a wider flash-to-buffer bus to keep up a high >>>>>>>>> bandwidth even with slow flash and a fast processor. >>>> >>>>>>>> The disadvantage of having a 256 byte wide memory, is power >>>>>>>> consumption. >>>>>>>> You will have 2048 active sense amplifiers. >>>>>>>> I dont see that coming soon. >>>> >>>>>>> I hope you aren't involved in architecting new MCU designs. I don't >>>>>>> think anyone said they wanted 2048 sense amplifiers. I would either >>>>>>> interpret the above to be "256 bits" or I would consider an >>>>>>> implementation that used a 256 byte cache of some sort. What >>>>>>> would be >>>>>>> the utility of a 256 byte wide interface to the Flash? Even the >>>>>>> fastest CM3 CPUs can't run at nearly that speed. >>>> >>>>>>> Rick >>>> >>>>>> I am certainly involved in the definition of new MCU designs, >>>>>> altough mostly by providing ideas. >>>> >>>>>> He said that he wanted a 256 byte buffer, and i really doubt >>>>>> that this should be interpreted as bits. >>>> >>>>>> He only said that the buffer will be filled when you >>>>>> accessed a new page, and did not state how many cycles it would take. >>>>>> From performance point of view, it makes more sense to load it in one >>>>>> cycle. If you start loading using sequential accesses to the flash, >>>>>> you will probably waste both cycles and power. >>>> >>>>> From the performance viewpoint, loading in a single cycle would be >>>>> ideal - but from the space and power viewpoint that would be a bad >>>>> idea. >>>>> So loading sequentially with a medium-width bus (I suggested 64 >>>>> bit) is >>>>> likely to be the best compromise. >>>> >>>>>> The proposal is already implemented in page mode DRAMs, >>>>>> so it may make sense at first, unless you know more about flash >>>>>> internals. >>>> >>>>> I know enough about flash internals to know it is a useful idea, and >>>>> could be a cheap, simple and low-power method to improve flash access >>>>> speeds. I know enough about chip design and logic design to know that >>>>> de-coupling the flash access and control logic from the processor's >>>>> memory bus will simplify some of the logic, and reduce the levels of >>>>> combination logic that must be completed within a clock cycle. It also >>>>> allows the processor and the flash module to run at independent >>>>> speeds. >>>> >>>>> I also know that it would complicate other parts of the design, and >>>>> the >>>>> extra unnecessary flash reads may outweigh the flash reads spared. >>>> >>>>> In effect, my suggestion is a cache front-end to the flash with just >>>>> one >>>>> line, but a large line width and perhaps two-way associativity. The >>>>> ideal balance may be different - half the line width and four-way >>>>> associativity might be better. It's all a balancing act. >>>> >>>>> I also know that I don't know nearly enough detail to judge whether >>>>> the >>>>> sums will add up to making this a good idea in practice. It depends on >>>>> so many factors such as flash design (some incur extra delays when >>>>> switching pages), access times, power requirements of the different >>>>> parts, access patterns on the instruction bus, area costs, design >>>>> times >>>>> and design costs, etc., and I don't know anything about these. >>>> >>>>> I am also fairly sure that the designers who /are/ capable of >>>>> calculating and balancing these tradeoffs will have thought of doing >>>>> something like this. There are certainly similar solutions used on >>>>> many >>>>> high-speed flash microcontrollers, though they may be much smaller. It >>>>> could well be that my suggested 256 byte buffer is far too big, and >>>>> that >>>>> an 8 or 16 byte buffer is fine when your cpu clock speed is not too >>>>> much >>>>> higher than the flash access speed. >>>> >>>> I think that the way this is implemented is through an instruction >>>> queue. This was implemented in early 32 bit chips, like the NS32016 >>>> and the MC68010. The MC68010 even allowed you to loop in the queue. >>>> >>>> It is not implemented on the ARM, and I do not think that it >>>> exists in the Cortex-M3 as well. The AVR32 does have a queue >>>> and will fetch instructions faster that it will execute, >>>> and this is one reason why the AVR32 can handle waitstates >>>> better than the Cortex-m3. >>>> >>>> On the AVR32 you lose about 7% due to the waitstate on the first >>>> access, >>>> and you only need one waitstate at 66 MHz, the top speed of current >>>> production parts. >>>> >>>> You will not get 100% hitrate, so your boost will be less than 7%. >>>> If you do add SRAM, you might be better off adding a branch-target >>>> cache to get rid of the initial waitstates. >>>> Once you start running sequential fetch the wide memory will >>>> give you a benefit but even a 128 bit flash can be a hog on power. >>>> >>>> The SAM7 with a 32 bit flash is faster than an LPC2xxx with 128 bit >>>> flash, at the same frequency when running Thumb Mode, >>>> and it draws much less current. >>>> The faster flash makes all the difference. >>>> The LPC2xxxx can offset this with a slightly higher clock rate, >>>> but that will not make power consumption better. >>> >>> So many IFs, so little time. Benchmarking is an art, not a science. >>> Best to run your app and see what is faster for your app. >>> >>> Rick >> >> If fast is the parameter you are looking for! >> Many applications need a certain speed, but once it is there, >> it will not use additional performance. >> >> You have a basic selection between speed and code size on the ARM7, >> but with waitstates the lower memory use of the Thumb instruction set >> can make it faster than the ARM instruction set. >> > > I think it is interesting to look at the history of instruction sets. > Long ago, there were two competing ideas - there were CISC instruction > sets with very varied instruction sets (typically in 8-bit parts), and > RISC which were all consistent and wide (typically 32-bit). It turns out > that both extremes were "wrong", and the most efficient modern > instruction sets for small devices are 16-bit wide for most > instructions, with some 32-bit (or 48-bit) for flexibility. Consistency > and orthogonality of the architecture is important, but should not be > taken to extremes. There is a lot to like about the Thumb2 set - I think > it's a big improvement on the original ARM ISA.
> > Of course, the 68000 designers at Motorola figured this out about 30 > years ago... >
I consider the 68000 to be a nightmare compared to a real orthogonal machine like the Series 32000... This did not fix the location of the fields though, so you would get a messy instruction decoder. The immediates were 7,14 or 30 bits wide with the size encoded in either the top bit or the two top bits. I think variable size, fixed location of fields seems to be most efficient. The National CompactRISC was one of the first implementations of this idea. 16/32/48 bit instructions. "Quick" 5 bit immediates with reserved values that indicated that an extension word or two followed the instruction. Internally, the decoder would decode directly the 16/32/48 bit instruction for a simple pipeline. For multiple clock functions like interrupts & exceptions there were 7-8 state machines that could override the instruction decoder. Since the datapath was controlled by ~90 signals, this turned out to be a significant part of the chip. 90 x (9->1 mux) + state logic... Turned out that this could be simplifed further. The reason for the state machines were that you need to do operations which are not supported by the instruction set. When you enter an interrupt, you need to clear the interrupt flag. This is really an (unsupported) instruction AND $IRQMASK, PSR I.E: AND 0xFF7F, PSR The problem we found, was that you need to do operations on registers which are not accessible,and you need a few more instructions. The problem was solved by extending each register address from 4 to 6 bits, allowing all registers (including PSR) in the CPU to be directly addressable by all instructions operating on registers. The opcode was extended by two bits allowing more instructions to be directly handled. The normal instructions was extended from 16 bits to 22 bits but user code, would only use the normal 16 bits. If a multiclock function was needed, then the instruction decoder was fed from a 22 bit wide ROM which ran for a few clock cycles. 8 x 90 bit statemachines + 90 x 9->1 multiplexer were replaced by a 32-64 x 22 bit ROM and 22 x 2->1 muxes. -- Best Regards Ulf Samuelsson These are my own personal opinions, which may (or may not) be shared by my employer Atmel Nordic AB