On Sep 25, 6:56=A0pm, David Brown
<david.br...@hesbynett.removethisbit.no> wrote:
> On 24/09/2010 20:01, Anders.Monto...@kapsi.spam.stop.fi.invalid wrote:
>
> > Didi<d...@tgi-sci.com> =A0wrote:
> >> But the power architecture designers did a great job, too - apart from
> >> the barely if at all usable native assembly mnemonics.
>
> Some of the instructions are barely understandable in assembly - but
> it's designed to be used with a compiler, not assembly.

Indeed, but in VPA the same instructions are as easy to use and
understand as in 68k assembly, so it is a matter of the mnemonics they
have chosen, not the instruction set. For example:

00000000 8971 0004                    movez.b-  (4,a1),d3
00000004 3940 1234                    move.l-   #$1234,d2
00000008 3D80 1234 618C 5678          move.l-   #($12345678).l,d4
00000010 61AD 0008                    bset.l-   #3,d5
00000014 71C6 0008 61CE 0008          bset.l    #3,d6

these look pretty familiar and "normal".

Below is the list of the same with the native PPC mnemonics
VPA generates, now these do not look that straight forward :-).

00000000                     *
00000000 8971 0004                    movez.b-  (4,a1),d3
00000000: 89710004  lbz       4,r17,r11
00000004 3940 1234                    move.l-   #$1234,d2
00000004: 39401234  addi      $1234,r0,r10
00000008 3D80 1234 618C 5678          move.l-   #($12345678).l,d4
00000008: 3D801234  addis     MSW_of_($12345678),r0,r12
0000000C: 618C5678  ori       LSW_of_($12345678),r12,r12
00000010 61AD 0008                    bset.l-   #3,d5
00000010: 61AD0008  ori       1!<(3),r13,r13
00000014 71C6 0008 61CE 0008          bset.l    #3,d6
00000014: 71C60008  andi      1!<(3),r14,r6
00000018: 61CE0008  ori       1!<(3),r14,r14

>
> And the EIEIO instruction is wonderful.

Indeed it is, I can see that after you enlightened me about its
English origin :D .

Dimiter

------------------------------------------------------
Dimiter Popoff               Transgalactic Instruments

http://www.tgi-sci.com
------------------------------------------------------
http://www.flickr.com/photos/didi_tgi/sets/72157600228621276/

On 24/09/2010 20:01, Anders.Montonen@kapsi.spam.stop.fi.invalid wrote:
> Didi<dp@tgi-sci.com>  wrote:
>> But the power architecture designers did a great job, too - apart from
>> the barely if at all usable native assembly mnemonics.

Some of the instructions are barely understandable in assembly - but 
it's designed to be used with a compiler, not assembly.

And the EIEIO instruction is wonderful.

>
> That and insisting on numbering bits the opposite way to everyone else.
>
> -a

Yes, the PPC's bit numbering does not make it easy to understand.  I've 
used a couple of the MPC5xxx devices, and the reference manuals are 
cryptic enough without having to deal with bit 31 being the LSB.  It 
gets even more fun when some registers are 64 bit, and the LSB becomes 
bit 63.  And if you have a part with more than 32 address lines, the LSB 
stays A31 for consistency - and the lines above A0 are numbered A-1, 
A-2, etc.

2010-09-24 09:28, David Brown skrev:
> On 24/09/2010 01:48, Ulf Samuelsson wrote:
>> 2010-09-23 23:17, rickman skrev:
>>> On Sep 23, 12:52 pm, Ulf Samuelsson<nospam....@atmel.com> wrote:
>>>> 2010-09-23 09:52, David Brown skrev:
>>>>
>>>>
>>>>
>>>>> On 23/09/2010 08:30, Ulf Samuelsson wrote:
>>>>>> 2010-09-21 13:16, rickman skrev:
>>>>>>> On Sep 20, 4:30 am, Ulf Samuelsson<u...@a-t-m-e-l.com> wrote:
>>>>>>>> David Brown skrev:
>>>>>>>>> A better solution for micros like that is a wider flash design
>>>>>>>>> with an
>>>>>>>>> sram buffer in the flash module - that is certainly how some
>>>>>>>>> manufacturers handle the problem. It is a simpler solution than
>>>>>>>>> a full
>>>>>>>>> instruction cache because you have only a single "tag" (or perhaps
>>>>>>>>> two,
>>>>>>>>> if you have two such buffers), and there are no issues with
>>>>>>>>> coherence or
>>>>>>>>> anything else. The buffer of perhaps 256 bytes gets filled
>>>>>>>>> whenever
>>>>>>>>> you
>>>>>>>>> access a new "page" in the flash, so that the processor then reads
>>>>>>>>> from
>>>>>>>>> the buffer rather than directly from the flash. And if
>>>>>>>>> space/economics
>>>>>>>>> allow, you have have a wider flash-to-buffer bus to keep up a high
>>>>>>>>> bandwidth even with slow flash and a fast processor.
>>>>
>>>>>>>> The disadvantage of having a 256 byte wide memory, is power
>>>>>>>> consumption.
>>>>>>>> You will have 2048 active sense amplifiers.
>>>>>>>> I dont see that coming soon.
>>>>
>>>>>>> I hope you aren't involved in architecting new MCU designs. I don't
>>>>>>> think anyone said they wanted 2048 sense amplifiers. I would either
>>>>>>> interpret the above to be "256 bits" or I would consider an
>>>>>>> implementation that used a 256 byte cache of some sort. What
>>>>>>> would be
>>>>>>> the utility of a 256 byte wide interface to the Flash? Even the
>>>>>>> fastest CM3 CPUs can't run at nearly that speed.
>>>>
>>>>>>> Rick
>>>>
>>>>>> I am certainly involved in the definition of new MCU designs,
>>>>>> altough mostly by providing ideas.
>>>>
>>>>>> He said that he wanted a 256 byte buffer, and i really doubt
>>>>>> that this should be interpreted as bits.
>>>>
>>>>>> He only said that the buffer will be filled when you
>>>>>> accessed a new page, and did not state how many cycles it would take.
>>>>>> From performance point of view, it makes more sense to load it in one
>>>>>> cycle. If you start loading using sequential accesses to the flash,
>>>>>> you will probably waste both cycles and power.
>>>>
>>>>> From the performance viewpoint, loading in a single cycle would be
>>>>> ideal - but from the space and power viewpoint that would be a bad
>>>>> idea.
>>>>> So loading sequentially with a medium-width bus (I suggested 64
>>>>> bit) is
>>>>> likely to be the best compromise.
>>>>
>>>>>> The proposal is already implemented in page mode DRAMs,
>>>>>> so it may make sense at first, unless you know more about flash
>>>>>> internals.
>>>>
>>>>> I know enough about flash internals to know it is a useful idea, and
>>>>> could be a cheap, simple and low-power method to improve flash access
>>>>> speeds. I know enough about chip design and logic design to know that
>>>>> de-coupling the flash access and control logic from the processor's
>>>>> memory bus will simplify some of the logic, and reduce the levels of
>>>>> combination logic that must be completed within a clock cycle. It also
>>>>> allows the processor and the flash module to run at independent
>>>>> speeds.
>>>>
>>>>> I also know that it would complicate other parts of the design, and
>>>>> the
>>>>> extra unnecessary flash reads may outweigh the flash reads spared.
>>>>
>>>>> In effect, my suggestion is a cache front-end to the flash with just
>>>>> one
>>>>> line, but a large line width and perhaps two-way associativity. The
>>>>> ideal balance may be different - half the line width and four-way
>>>>> associativity might be better. It's all a balancing act.
>>>>
>>>>> I also know that I don't know nearly enough detail to judge whether
>>>>> the
>>>>> sums will add up to making this a good idea in practice. It depends on
>>>>> so many factors such as flash design (some incur extra delays when
>>>>> switching pages), access times, power requirements of the different
>>>>> parts, access patterns on the instruction bus, area costs, design
>>>>> times
>>>>> and design costs, etc., and I don't know anything about these.
>>>>
>>>>> I am also fairly sure that the designers who /are/ capable of
>>>>> calculating and balancing these tradeoffs will have thought of doing
>>>>> something like this. There are certainly similar solutions used on
>>>>> many
>>>>> high-speed flash microcontrollers, though they may be much smaller. It
>>>>> could well be that my suggested 256 byte buffer is far too big, and
>>>>> that
>>>>> an 8 or 16 byte buffer is fine when your cpu clock speed is not too
>>>>> much
>>>>> higher than the flash access speed.
>>>>
>>>> I think that the way this is implemented is through an instruction
>>>> queue. This was implemented in early 32 bit chips, like the NS32016
>>>> and the MC68010. The MC68010 even allowed you to loop in the queue.
>>>>
>>>> It is not implemented on the ARM, and I do not think that it
>>>> exists in the Cortex-M3 as well. The AVR32 does have a queue
>>>> and will fetch instructions faster that it will execute,
>>>> and this is one reason why the AVR32 can handle waitstates
>>>> better than the Cortex-m3.
>>>>
>>>> On the AVR32 you lose about 7% due to the waitstate on the first
>>>> access,
>>>> and you only need one waitstate at 66 MHz, the top speed of current
>>>> production parts.
>>>>
>>>> You will not get 100% hitrate, so your boost will be less than 7%.
>>>> If you do add SRAM, you might be better off adding a branch-target
>>>> cache to get rid of the initial waitstates.
>>>> Once you start running sequential fetch the wide memory will
>>>> give you a benefit but even a 128 bit flash can be a hog on power.
>>>>
>>>> The SAM7 with a 32 bit flash is faster than an LPC2xxx with 128 bit
>>>> flash, at the same frequency when running Thumb Mode,
>>>> and it draws much less current.
>>>> The faster flash makes all the difference.
>>>> The LPC2xxxx can offset this with a slightly higher clock rate,
>>>> but that will not make power consumption better.
>>>
>>> So many IFs, so little time. Benchmarking is an art, not a science.
>>> Best to run your app and see what is faster for your app.
>>>
>>> Rick
>>
>> If fast is the parameter you are looking for!
>> Many applications need a certain speed, but once it is there,
>> it will not use additional performance.
>>
>> You have a basic selection between speed and code size on the ARM7,
>> but with waitstates the lower memory use of the Thumb instruction set
>> can make it faster than the ARM instruction set.
>>
>
> I think it is interesting to look at the history of instruction sets.
> Long ago, there were two competing ideas - there were CISC instruction
> sets with very varied instruction sets (typically in 8-bit parts), and
> RISC which were all consistent and wide (typically 32-bit). It turns out
> that both extremes were "wrong", and the most efficient modern
> instruction sets for small devices are 16-bit wide for most
> instructions, with some 32-bit (or 48-bit) for flexibility. Consistency
> and orthogonality of the architecture is important, but should not be
> taken to extremes. There is a lot to like about the Thumb2 set - I think
> it's a big improvement on the original ARM ISA.

>
> Of course, the 68000 designers at Motorola figured this out about 30
> years ago...
>

I consider the 68000 to be a nightmare compared to a real orthogonal 
machine like the Series 32000...

This did not fix the location of the fields though,
so you would get a messy instruction decoder.
The immediates were 7,14 or 30 bits wide with the size
encoded in either the top bit or the two top bits.

I think variable size, fixed location of fields seems to be most 
efficient. The National CompactRISC was one of the first implementations
of this idea.

16/32/48 bit instructions. "Quick" 5 bit immediates
with reserved values that indicated that an extension word or two
followed the instruction.

Internally, the decoder would decode directly the 16/32/48 bit 
instruction for a simple pipeline.
For multiple clock functions like interrupts & exceptions
there were 7-8 state machines that could override the
instruction decoder.

Since the datapath was controlled by ~90 signals,
this turned out to be a significant part of the chip.

90 x (9->1 mux) + state logic...

Turned out that this could be simplifed further.
The reason for the state machines were that you need to do
operations which are not supported by the instruction set.

When you enter an interrupt, you need to clear the interrupt flag.
This is really an (unsupported) instruction

AND $IRQMASK, PSR	I.E: AND 0xFF7F, PSR

The problem we found, was that you need to do operations
on registers which are not accessible,and you need
a few more instructions.

The problem was solved by extending each register address
from 4 to 6 bits, allowing all registers (including PSR) in the CPU
to be directly addressable by all instructions operating on registers.
The opcode was extended by two bits allowing more instructions
to be directly handled.

The normal instructions was extended from 16 bits to 22 bits
but user code, would only use the normal 16 bits.
If a multiclock function was needed, then the instruction decoder
was fed from a 22 bit wide ROM which ran for a few clock cycles.

8 x 90 bit statemachines + 90 x 9->1 multiplexer were replaced by
a 32-64 x 22 bit ROM and 22 x 2->1 muxes.


-- 
Best Regards
Ulf Samuelsson
These are my own personal opinions, which may (or may not)
be shared by my employer Atmel Nordic AB

On Sep 24, 9:51=A0pm, Walter Banks <wal...@bytecraft.com> wrote:
> ...
> Motorola's conditional branch set is complete but has limited
> range resulting in "5" byte two instruction long branches in
> some code.
>
> I am surprised that the 6809 has yet to show up in this discussion

:D

Indeed. But even I - having grown up on it - have long since
put it out of use :-).
6800/11 code can be very dense indeed, some 25+ years ago I
was used to count every single byte of code, used cpx # to
skip 2 bytes and cmpa # (or did I use cmpb #) to skip 1 byte
and plenty of other things, most were applicable also to
the 09.

But since you asked for the 6809, here are it is - 3 instances
of it emulated in DPS windows, the tiniest being as large
as the graphics board I had built back then.. only here in
colours, that first graphics board of mine had 2 bits/pixel
I used to view on a mono monitor...
http://tgi-sci.com/misc/sc09em.gif
A 400 MHz PPC does the emulation, something like 80+ times
faster than the original 2 MHz 6809 system which is emulated,
never cared to really measure it precisely.

Oh no, I can't believe all these years have passed.
Walter, you really should not have asked that :D :D .

Dimiter

------------------------------------------------------
Dimiter Popoff               Transgalactic Instruments

http://www.tgi-sci.com
------------------------------------------------------
http://www.flickr.com/photos/didi_tgi/sets/72157600228621276/

Mel wrote:
> 
> In code performance, the only head-to-head comparison I've done was to
> rewrite identical functions for Z180 and MC68HC11.  The functions always
> took less code memory and less execution time on the HC11.  Partly because
> there was less data-shuffling (and partly because HC11 conditional-branch
> operations were handier.)

HC11 and HC08 both have reasonably good code density. THere are 
many arguments for and against conditional branch vs conditional 
skip. In benchmarks conditional branch generally wins but on 
some very time critical code conditional skip wins. 

Motorola's conditional branch set is complete but has limited 
range resulting in "5" byte two instruction long branches in 
some code.

I am surprised that the 6809 has yet to show up in this discussion

Regards,

w..
--
Walter Banks
Byte Craft Limited
http://www.bytecraft.com

Didi <dp@tgi-sci.com> wrote:
> But the power architecture designers did a great job, too - apart from
> the barely if at all usable native assembly mnemonics.

That and insisting on numbering bits the opposite way to everyone else.

-a

On Sep 24, 10:28=A0am, David Brown <da...@westcontrol.removethisbit.com>
wrote:
> ....
> I think it is interesting to look at the history of instruction sets.
> Long ago, there were two competing ideas - there were CISC instruction
> sets with very varied instruction sets (typically in 8-bit parts), and
> RISC which were all consistent and wide (typically 32-bit). =A0It turns
> out that both extremes were "wrong", and the most efficient modern
> instruction sets for small devices are 16-bit wide for most
> instructions, with some 32-bit (or 48-bit) for flexibility. =A0Consistenc=
y
> and orthogonality of the architecture is important, but should not be
> taken to extremes. =A0There is a lot to like about the Thumb2 set - I
> think it's a big improvement on the original ARM ISA.
>
> Of course, the 68000 designers at Motorola figured this out about 30
> years ago...

The 68k guys certainly did that - and they created the 68k assembly
language, which has been the most efficient native CPU assembly
language I have seen since.

But the power architecture designers did a great job, too - apart from
the barely if at all usable native assembly mnemonics.

I can hardly think of something I miss from the 68k on the PPC; and
you will be surprised how little code size difference there is if
you write with the PPC in mind (on VPA, which is what I call the
language I built over the 68k assembly to produce PPC code).

Some examples:

68k byte or word operands affect only the byte or word in the
destination register. This usually costs either a preceding
extra clr.l or a subsequent ext.l, it is very rare one can
take advantage of the reserved upper part data.

Power takes advantage of its 32 bit opcode size and lets us
leave the condition codes unaffected if needed; not so on the
68k, this also costs code size (I am not talking speed at all
here which is clearly in favour of power).

What I do miss from the 68k is the movem instruction with
a bit per register which will be moved; VPA does this on
power by simply doing a single move per register. I suppose
this is one of the major code space eaters I have in my code.
They do have a native move multiple opcode, but only for
sequential registers and I do not use it much.

But the rlwinm, rlwnm and rlwimi PPC instructions are a
major help. While the 68020 did have bitfield instructions,
they did not make it to the cpu32 and were probably too
bulky to implement; these 3 opcodes with a little help
from others can really do a lot, I would not give them
up, not without a fight :-) .

Dimiter

------------------------------------------------------
Dimiter Popoff               Transgalactic Instruments

http://www.tgi-sci.com
------------------------------------------------------
http://www.flickr.com/photos/didi_tgi/sets/72157600228621276/

David Brown wrote:

> However, it is still valid to compare the x86 and the m68k architectures
> because they were from a similar time.  The x86 was considered poor and
> old-fashioned when it was first made, and not suitable for powerful
> computing.  The m68k was considered an elegant and modern design, with
> forward-looking design choices (such as support for 32-bit in the ISA,
> even though it only had a 16-bit ALU in the implementation).  It was not
> without reason that the IBM engineers wanted the m68k for their new "PC".

In my mind I'd made the cut between specialized-register machines (i8080 
etc., Z80, TI32000 DSPs, DEC 8, GE-635) and general-register machines (m68k, 
mc6800 -- sort of, m88000 DSPs, DEC 11, IBM360).  Specialized were easier to 
code for because after you'd figured out which registers your data had to be 
in the low-level code was pretty much determined; general registers needed a 
conscious resource-allocation step.
In code performance, the only head-to-head comparison I've done was to 
rewrite identical functions for Z180 and MC68HC11.  The functions always 
took less code memory and less execution time on the HC11.  Partly because 
there was less data-shuffling (and partly because HC11 conditional-branch 
operations were handier.)

	Mel.

On 24/09/2010 15:31, Jon Kirwan wrote:
> On Fri, 24 Sep 2010 09:28:01 +0200, David Brown
> <david@westcontrol.removethisbit.com>  wrote:
>
>> I think it is interesting to look at the history of instruction sets.
>> Long ago, there were two competing ideas - there were CISC instruction
>> sets with very varied instruction sets (typically in 8-bit parts), and
>> RISC which were all consistent and wide (typically 32-bit).  It turns
>> out that both extremes were "wrong", and the most efficient modern
>> instruction sets for small devices are 16-bit wide for most
>> instructions, with some 32-bit (or 48-bit) for flexibility.
>
> A significant point missing in the above comment you made,
> David, is that the _environment_ was different then as now.
>
> For example, when the MIPS R2000 RISC processor was being
> designed and then instanced, they didn't have access to the
> highest density FABs... which were only found at Intel and
> Motorola, at the time.  For obvious commercial reasons.
> Instead, they had to live with much lower feature sizes and
> capabilities and _still_ field a competitive product.
>
> Also, in and around that period of time, the number of
> inverters and transmission gates (aka 'transistors')
> available was much, much less than now.  Even if you were
> Intel or Motorola.  As time passed, that capability grew to
> the point where folks weren't at all strapped and began
> wondering what else they could do with all those extras
> sitting around.  Which opened the door for making design
> decisions that were impossible, earlier.  Such as the PPro/P2
> choice of decoding CISC instructions into RISC instructions,
> executed out of order and re-assembled later on.  That simply
> wasn't possible, earlier.
>
> I wouldn't characterize decisions made at that time as
> "wrong."  The options available to a designer back then were
> very little like what is available now.
>

That is precisely while I put the inverted commas around "wrong".  ISA's 
like Thumb2 or ColdFire are more efficient than the x86 or ARM 
instruction sets using modern design and fabrication techniques - I 
fully agree that the situation was different 30 years ago (well, 21 
years ago for ARM, IIRC).  There is also the minor matter of 
trial-and-error - we know better know based on experience from past designs.

However, it is still valid to compare the x86 and the m68k architectures 
because they were from a similar time.  The x86 was considered poor and 
old-fashioned when it was first made, and not suitable for powerful 
computing.  The m68k was considered an elegant and modern design, with 
forward-looking design choices (such as support for 32-bit in the ISA, 
even though it only had a 16-bit ALU in the implementation).  It was not 
without reason that the IBM engineers wanted the m68k for their new "PC".

On Fri, 24 Sep 2010 07:34:50 -0400, Walter Banks
<walter@bytecraft.com> wrote:

>There is a lot I like about the Thumb 2 ISA. I have worked on ISA 
>design on several commercial processors. M68K (that you mention 
>and I clipped) patterned after the PDP11 is the classical orthogonal 
>instruction set.

Since I'm intimately familiar with both, I'd be interested in
discussing some of those decisions made in the 68k case if
you are open to the idea in a public space.

>It takes a lot more than that to make an efficient 
>processor. The TI9900 a contemporary of the 68K development with 
>similar roots was less effective at executing applications. The 
>difference between 68k and 9900 was essentially data flow inside 
>the processor. The 9900 was easier to program in many ways BUT 
>it relied on more indirect data accesses to data and was 
>significantly less efficient.

The 9900 was very interesting, though I never had the chance
to actually program one of them.  If I recall correctly, it
supported the visibility requirements of Pascal's local
variables within nested functions.  I'm not entirely sure I
appreciate your comment about data flows because the Intel
x86 did add some function prologue/epilogue instructions
later on to also support that feature... but you are probably
referring to some other aspect I know zero about.

>Clean data flow between executing instructions is as important as 
>the instructions. The classic example of how to kill a processor 
>is to need to process memory management through primary 
>accumulator(s). This killed several processors in the 90's.
>
>RISC can be very efficient but requires a different approach to 
>code generation. The xgate is a simple 16 bit RISC that driven
>with a well designed code generator will compete with well designed 
>CISC processors. Our application based benchmarks showed that 
>the difference was about 10%. There is a whole area of instruction 
>design that trades compile time complexity for processor 
>simplicity or timing.

Good books on this subject, too.  I remember talking with one
of the founders of MIPS (Hennessey) about his analysis of the
68020 (and this is part of why I'd very much enjoy a
discussion on that topic, because of what he shared with me
that long day) and discussing even such 'insignificant'
details as why they chose to not flag registers as 'busy' in
the R2000.  There was a cycle-length cost to it because it
added delay to a combinatorial chain, which reduced the clock
rate possible.  Instead, the next instruction would NOT wait
until a write completed.  They left that to the compiler to
worry over.  I was blessed to hear about many other such
interesting design decisions they made on the R2000, that
day.  (It was a personal 1:1 meeting.)

>Many of the most successful ISAs make very good use of redundant 
>instructions. This has been done four ways. 
>
>1) Conceptually have a page 0 space where some RAM areas are
>   more valuable but the access is quicker and requires less 
>   generated code.

PIC and 6502 being examples?

>2) Memory to memory operations that don't require intervening 
>   register involvement.

As in the Intel REP MOV or the DEC PDP-11's MOV (R5)+, (R6)+
as two very different examples?

>3) Instructions with implied arguments. 
>   For example inc dec compliment.

8051 being a classic here?

>4) Mapping registers (real and virtual) on RAM space reduces 
>   register specific instructions an extreme example is the 
>   move machines with one instruction. 

I think I remember some comments you made about this,
earlier.

Thanks, Walter.  I enjoyed reading this.

Jon