New ARM Cortex Microcontroller Product Family from STMicroelectronics| page 7

Reply by Ulf Samuelsson ●June 26, 20072007-06-26

<wilco.dijkstra@ntlworld.com> skrev i meddelandet 
news:1182838751.331770.190430@p77g2000hsh.googlegroups.com...
> On 25 Jun, 17:12, "Ulf Samuelsson" <u...@a-t-m-e-l.com> wrote:
>> > Put simply, the question is "Can you read data from code memory at a
>> > random address?".
>>
>> > Wilco
>>
>> Any Harvard architecture can read data into registers using the immediate
>> addressing mode.
>>
>> getdata:      execute    label[r0:d]
>>                   ret
>>
>> label:
>> label0:    ld    r0,#0
>> label1:    ld    r0,#1
>> label2:    ld    r0,#2
>> label3:    ld    r0,#3
>> label4:    ld    r0,#4
>> label5:    ld    r0,#5
>> label6:    ld    r0,#6
>>
>> Should work on a Harvard architecture without going to
>> a "modified Harvard" so the question is irrelevant to determine
>> Harvard/No Harvard.
>
> So how does this read the *data* contained in the code memory? I'd
> like to read the bitpatterns of the instructions, not execute them.
>
> Wilco
>

The data is embedded inside the instruction, and gets loaded
into R0, using immediate addressing mode.
I bet there is no Harvard architecture, which does not allow use of
constants embedded in instructions, so if you accept
that, then you also accept that there is a path from the
instruction bus to the internal registers without having
to connect the databus.

What you "like" is really irrelevant for the discussion
on whether moving from instruction bus to ALU is
deviating from original harvard or not.

If the microarchitecture connects the instruction memory
to the ALU using an internal mux in the CPU core it is original harvard.
If it connect the databus to the instruction memory to
fetch data, it isn't.

-- 
Best Regards,
Ulf Samuelsson
This is intended to be my personal opinion which may,
or may not be shared by my employer Atmel Nordic AB

Reply by ●June 27, 20072007-06-27

On 26 Jun, 11:14, "Ulf Samuelsson" <u...@a-t-m-e-l.com> wrote:
> <wilco.dijks...@ntlworld.com> skrev i meddelandetnews:1182838751.331770.190430@p77g2000hsh.googlegroups.com...
>
>
>
>
>
> > On 25 Jun, 17:12, "Ulf Samuelsson" <u...@a-t-m-e-l.com> wrote:
> >> > Put simply, the question is "Can you read data from code memory at a
> >> > random address?".
>
> >> > Wilco
>
> >> Any Harvard architecture can read data into registers using the immediate
> >> addressing mode.
>
> >> getdata:      execute    label[r0:d]
> >>                   ret
>
> >> label:
> >> label0:    ld    r0,#0
> >> label1:    ld    r0,#1
> >> label2:    ld    r0,#2
> >> label3:    ld    r0,#3
> >> label4:    ld    r0,#4
> >> label5:    ld    r0,#5
> >> label6:    ld    r0,#6
>
> >> Should work on a Harvard architecture without going to
> >> a "modified Harvard" so the question is irrelevant to determine
> >> Harvard/No Harvard.
>
> > So how does this read the *data* contained in the code memory? I'd
> > like to read the bitpatterns of the instructions, not execute them.
>
> > Wilco
>
> The data is embedded inside the instruction, and gets loaded
> into R0, using immediate addressing mode.

That still doesn't data. If I have 1KByte of data, can you encode it
in 1KB of code? If the answer is no, then they are not equivalent.

> I bet there is no Harvard architecture, which does not allow use of
> constants embedded in instructions, so if you accept
> that, then you also accept that there is a path from the
> instruction bus to the internal registers without having
> to connect the databus.

Sure this path always exists for immediates. But this path can't read
any random byte from the instruction memory, it can only be used for
immediates. You need special instructions like MOVC to read actual
data from a random address. The key "feature" of the original Harvards
was that they could not treat instructions as data, unlike Von
Neumann.

> What you "like" is really irrelevant for the discussion
> on whether moving from instruction bus to ALU is
> deviating from original harvard or not.
>
> If the microarchitecture connects the instruction memory
> to the ALU using an internal mux in the CPU core it is original harvard.
> If it connect the databus to the instruction memory to
> fetch data, it isn't.

I don't see why you'd want to separate them based on how the buses are
implemented, this is a micro architecture detail. There are various
options, including a mux inside the core, 2 separate buses with a mux
to unify them or an independent bus used only for code->data. Each of
these options behaves the same from a programmer's perspective, so
there is no obvious way to differentiate them.

Wilco

Reply by Wilco Dijkstra ●July 3, 20072007-07-03

"Jim Granville" <no.spam@designtools.maps.co.nz> wrote in message news:467a3dcd$1@clear.net.nz...
> Wilco Dijkstra wrote:
>
>> "Jim Granville" <no.spam@designtools.maps.co.nz> wrote in message news:4679cb90$1@clear.net.nz...
>>
>>>Wilco Dijkstra wrote:
>>>
>>>>>Yes, MOVC is only for table lookup - In 80C51 there is no write to code memory, and no read code from ram, without
>>>>>special HW bridge steps.
>>>>
>>>>
>>>>So you agree table lookup can read code memory. That means there is a
>>>>connection (at least an uni-directional bus) between code and data memories,
>>>>so it is not at all like the initial Harvards (which did not have such a connection).
>>>
>>>Not quite, all it means is a connection between code and the accumulator. The std 80C51 has no Code-Data opcodes, and
>>>only one Code-ACC opcode.
>>
>>
>> There are various ways of implementing such a bus, including reusing the
>> fetch bus if you only need read-only access. That doesn't make it any less of
>> a connection.
>
> You've lost me. You have claimed a "conection between code and data memories", and unless you somehow claim the
> accumulator is DATA memory, there is no such link. The 80C51 does not have a MOVC Data,@A+PC opcode
> it only has the single MOVC A,@A+PC opcode. Code -> ACC. From there,
> you have to use other opcodes to move ACC to various Data memory choices. There are many more Acc-> Data Space
> opcodes.

Copying data between different memories always involves some kind of temporary
buffer, usually a register. Whether there is an instruction that can do it in one go is
irrelevant, the implementation would sequence it into a load and a store anyway.

Put simply, the question is: "can I read data from a random address in code memory?".
If the answer is yes, there is a connection between the code and data memories, even
if the path goes via the accumulator. If so, you can implement general pointers in C
(albeit not very efficiently). If not, you cannot do general C pointers at all and you need
to transform all constant data into executable code (very inefficient).

Wilco

Reply by Jonathan Kirwan ●July 4, 20072007-07-04

On Wed, 04 Jul 2007 01:38:20 GMT, "Wilco Dijkstra"
<Wilco_dot_Dijkstra@ntlworld.com> wrote:

>
>"Jim Granville" <no.spam@designtools.maps.co.nz> wrote in message news:467a3dcd$1@clear.net.nz...
>> Wilco Dijkstra wrote:
>>
>>> "Jim Granville" <no.spam@designtools.maps.co.nz> wrote in message news:4679cb90$1@clear.net.nz...
>>>
>>>>Wilco Dijkstra wrote:
>>>>
>>>>>>Yes, MOVC is only for table lookup - In 80C51 there is no write to code memory, and no read code from ram, without
>>>>>>special HW bridge steps.
>>>>>
>>>>>
>>>>>So you agree table lookup can read code memory. That means there is a
>>>>>connection (at least an uni-directional bus) between code and data memories,
>>>>>so it is not at all like the initial Harvards (which did not have such a connection).
>>>>
>>>>Not quite, all it means is a connection between code and the accumulator. The std 80C51 has no Code-Data opcodes, and
>>>>only one Code-ACC opcode.
>>>
>>>
>>> There are various ways of implementing such a bus, including reusing the
>>> fetch bus if you only need read-only access. That doesn't make it any less of
>>> a connection.
>>
>> You've lost me. You have claimed a "conection between code and data memories", and unless you somehow claim the
>> accumulator is DATA memory, there is no such link. The 80C51 does not have a MOVC Data,@A+PC opcode
>> it only has the single MOVC A,@A+PC opcode. Code -> ACC. From there,
>> you have to use other opcodes to move ACC to various Data memory choices. There are many more Acc-> Data Space
>> opcodes.
>
>Copying data between different memories always involves some kind of temporary
>buffer, usually a register. Whether there is an instruction that can do it in one go is
>irrelevant, the implementation would sequence it into a load and a store anyway.
>
>Put simply, the question is: "can I read data from a random address in code memory?".
>If the answer is yes, there is a connection between the code and data memories, even
>if the path goes via the accumulator. If so, you can implement general pointers in C
>(albeit not very efficiently). If not, you cannot do general C pointers at all and you need
>to transform all constant data into executable code (very inefficient).

I seem to recall you saying that "pure Harvard" is the case where this
is not allowed (code space cannot be read as data, at all) and also
saying that such beasts aren't found, today.  Doesn't this suggest to
you that you are defending a distinction without a difference?

When I think of a processor as being "Harvard," several meaningful
modern distinctions come to mind.  One is more a matter of software,
where I am mentally aware that (1) I probably cannot run code from
RAM, should I want to do so ... for example, because I'd like to
program the code flash and there is only one memory controller for the
code flash (no separate blocks, for example) and (2) the depth and
breadth of the data access methods coded in the instruction space
should not be construed to indicate anything for the code access
methods.  Two is more a matter of hardware, where I'm mentally aware
that there may be (often is) separate data buses, address lines; and
if neither of these, then control lines indicating separate spaces, to
be considered together with any intended software -- for example, I
may _want_ to subvert the architecture and overlay the memories.  And
if there is no external bus, that I may still simply have to take that
into account in considering data structure design.

A distinction matters, in other words.  And I don't find much value in
your conflation of architectures which have a single memory viewpoint,
where code and data share a common address space and all the memory
addressing mechanisms can be used for reading either code or data,
with architectures which have several memories with material
differences in their access from an instruction point of view.  It
seems to me that you think there is little meaningful difference, so
you just lump them together.  I find that kind of conflation as
without useful discerning value.

Jon

Reply by PaulInCa ●July 4, 20072007-07-04

If I can add a few cents on this: the term "Harvard" architecture did
originally refer two completely different buses (with overlapping address
spaces), but like the term RISC, has been overloaded and adapted over the
years. With big registers (32 bits), the need to reuse addresses just is
not there (unlike 12-bits (of old) or 16-bits of DSPs and 8-bit
processors).

The true Harvard devices have to add special move instructions which
refocus the address in the register, since 0x100 (say) will point to two
different locations: one instruction and one data. In some DSPs, this also
means different sized buses. The advantage was that b 0x100 clearly meant
the instruction bus and load R0,[0x100] clearly meant data. The bad thing
was when you needed to read data from the code bus (constants and
literals) or run from RAM (such as for external boot).

For Cortex-M3, the term Harvard is still in reference to two buses (that
can operate in parallel), but for convenience, the address mapping is such
that they do not overlap. So, addresses 0 to 1/2GB are mapped to one bus,
and addresses 1/2GB on up to one or more other buses (system and
RAM/peripheral). The processor can take advantage of this by pre-fetching
instructions while also reading/writing memory through the LSU. If you run
code from RAM or you read/write memory from the code space, an arbiter
simply prioritizes contending operations (favoring data over pre-fetch,
since it is just pre-fetch). Running code from RAM does not run at half
speed since LSU operations usually make up less than 1/5th of the
instruction stream. Further, Thumb-2 has many 16-bit instructions, so one
pre-fetched word can serve two instructions.

The value is seen over ARM7 and the like simply because code and data are
not normally competing for bus, so LSU operations are fast. Further, the
LSU can pipeline back-to-back loads and stores (as well as load-multiple
and store-multiple). Finally, a store buffer hides the wait time on the
store to complete (unless another load or store comes along too soon).

The other big advantage is for interrupt processing. On an interrupt, the
pre-fetcher is loading the instructions of the ISR while the LSU is
handling stacking (saving return_PC, xPSR, LR, R0-R3, and R12). Because
they are running in parallel, the users 1st instruction is executed that
much faster. And note, this is a normal C function since the 5 scratch
registers have been saved and the LR properly setup. On return from
exception, if not tail-chaining, it can use the same trick of loading code
while popping the stack. This gets back to the interrupted code that much
faster.

The original term was "Harvard style" architecture. I think it is fair to
say that it avoids the nastiness of true Harvard, since you can address
any location (0 to 2^32-1) for any purpose from any instruction (other
than range limitations).

Hope this clarifies. Paul

Reply by rickman ●July 5, 20072007-07-05

On Jun 24, 11:45 am, wilco.dijks...@ntlworld.com wrote:
> On 23 Jun, 03:10, rickman <gnu...@gmail.com> wrote:
> > I don't follow what you are saying at all.  Branch prediction relates
> > to pipelining.  I don't see how it relates to wait states.
>
> Adding a wait state is the same as increasing the pipeline depth, and
> branch
> prediction coupled with prefetching can hide some of that latency.

I don't see how that is true at all.  When you add a waitstate you
freeze all stages of the pipeline while you wait for the Flash to
finish the access.

> You have to increase the fetch width if you add waitstates, that's a
> given. The
> M3 TRM recommends 64-bit flash fetch. While this allows straightline
> code to run
> at full speed, branches are still slow. What I meant is that M3 has
> branch
> optimizations that reduce this slowdown.

I don't have the option of increasing the width of the Flash.
Besides, you statement is just plain wrong.  If you add waitstates,
the Flash width is not relevant.  If you increase the Flash width, you
can use fewer wait states, but that is entirely different from what
you are saying.  What did you mean to say?

> power
> consumption at lower speeds. The specs showed various settings that
> use
> significantly less than 9mA below 8MHz. So I don't think it really
> burns 9mA
> at 0MHz (which is what I think you mean with Y intercept, right?).

Sure, if you want to turn off the flash entirely you can get the power
down, or if you want to stop the clocks you can get the power down.
My point is that under normal operating conditions, the part has a
hefty static power consumption so that running at half speed does not
get you near half current.  Still, it is a lot better than the
Luminary parts, but not so near the Atmel ARM7 parts.

> The ST specs list some numbers with peripherals off, and that is less
> than
> half the normal current, 21mA from flash at 72MHz IIRC - pretty good.

I don't see that figure anywhere.  What page did you read this?
Regardless, if they are doing things to reduce power consumption, like
running from RAM, then the comparison is still not apples to apples.
It just depends on what you want to compare.

> > How do you support the claim that the M3 runs twice as fast as the
> > SAM7 at the same frequency???  Maybe I don't want to know...
>
> Because I've benchmarked it myself?

Ok, has anyone else on the planet published similar results?  I have
not even heard anyone else make this claim, much less be able to
support it.

> You seem to forget that the M3 was designed from the ground up to be
> an efficient
> MCU, while ARM7 wasn't:

I'm not even considering that.  I am reporting what I have read as
measured results.  But then I don't know of any published benchmarks.
I guess that is what is required.

> > Yes, but that is a small delta compared to adding waitstates with a 2x
> > or 3x reduction in performance and therefore the same effect on power
> > efficiency.
>
> Not at all. Adding waitstates doesn't slow you down by that much as
> you make
> the memory wider (not doing that makes no sense at all, so I do not
> consider
> that a valid option). But instruction set and microarchitecture
> differences can easily
> make a factor of 2 difference, as shown above.

How do you make the memory wider?  I would love to be able to do that
with a lot of the MCU chips I have used in the past.  Can you use this
Flash stretcher tool on Atmel parts as well?  Then they could run at
55 MHz with no wait states!!!

Reply by Ulf Samuelsson ●July 5, 20072007-07-05

"rickman" <gnuarm@gmail.com> skrev i meddelandet 
news:1183651600.912254.284760@o61g2000hsh.googlegroups.com...
> On Jun 24, 11:45 am, wilco.dijks...@ntlworld.com wrote:
>> On 23 Jun, 03:10, rickman <gnu...@gmail.com> wrote:
>> > I don't follow what you are saying at all.  Branch prediction relates
>> > to pipelining.  I don't see how it relates to wait states.
>>
>> Adding a wait state is the same as increasing the pipeline depth, and
>> branch
>> prediction coupled with prefetching can hide some of that latency.
>
> I don't see how that is true at all.  When you add a waitstate you
> freeze all stages of the pipeline while you wait for the Flash to
> finish the access.
>

I don't know exactly how the Cortex work, but I worked on the internals
of another 32 bit RISC core.
This core had a 16 byte FIFO in the first pipeline stage.
The prefetch mechanism loaded 32 bits into this FIFO each access.
The memory controller could add waitstates to this access if neccessary.

The first pipeline stage did a simple decoding of the top halfword of the 
FIFO
to determine the length of the instruction (1-3 halfwords) and if the FIFO
had enough valid content, the full instruction was made available
to the second decoding stage, otherwise a "not valid" signal was asserted.

The second stage would either execute the instruction, reading 1-3 halfwords
from the FIFO, or if the "not valid" was asserted, the second stage would
execute a NOP instruction.

Since most instructions are 16 bits, and you read 32 bits at a time,
zero waitstate operation allows to fetch almost two instructions per cycle.
The FIFO will quite soon be filled and if the odd 32/48 bit instruction pops 
up,
it wont hurt your performance.

If you have one waitstate, you will see that the bandwidth is still high
enough that 1MIPS/MHz can be maintained as long as you only
execute 16 bit instructions. You will be hurt by fetching a 32 bit 
instruction
since that takes 2 clocks.

I have run the SAM7 at 48 MHz, zero waitstate. Does not work over the full 
temp range though.
The AVR32 will support 1.2 MIPS/MHz @ 1 waitstate operation @ 66 MHz
due to its 33 MHz 2 way interleaved flash memory.
(1st access after jump is two clocks, subsucquent accesses are 1 clock)

-- 
Best Regards,
Ulf Samuelsson
This is intended to be my personal opinion which may,
or may not be shared by my employer Atmel Nordic AB

Reply by rickman ●July 6, 20072007-07-06

On Jul 5, 4:30 pm, "Ulf Samuelsson" <u...@a-t-m-e-l.com> wrote:
> "rickman" <gnu...@gmail.com> skrev i meddelandetnews:1183651600.912254.284760@o61g2000hsh.googlegroups.com...
>
> > On Jun 24, 11:45 am, wilco.dijks...@ntlworld.com wrote:
> >> On 23 Jun, 03:10, rickman <gnu...@gmail.com> wrote:
> >> > I don't follow what you are saying at all.  Branch prediction relates
> >> > to pipelining.  I don't see how it relates to wait states.
>
> >> Adding a wait state is the same as increasing the pipeline depth, and
> >> branch
> >> prediction coupled with prefetching can hide some of that latency.
>
> > I don't see how that is true at all.  When you add a waitstate you
> > freeze all stages of the pipeline while you wait for the Flash to
> > finish the access.
>
> I don't know exactly how the Cortex work, but I worked on the internals
> of another 32 bit RISC core.

Thanks for the rundown on this alternative CPU.  Sounds a bit like the
National 32 bit CPU with variable length instructions.  That was
supposed to be a fast CPU, but not a commercial success.  If there had
been a longer term commitment, it may have grown in popularity.  But
the realities of the commercial CPU market allowed it to pass on to
the CPU boneyard.

> This core had a 16 byte FIFO in the first pipeline stage.
> The prefetch mechanism loaded 32 bits into this FIFO each access.
> The memory controller could add waitstates to this access if neccessary.
>
...snip...
>
> Since most instructions are 16 bits, and you read 32 bits at a time,
> zero waitstate operation allows to fetch almost two instructions per cycle.
> The FIFO will quite soon be filled and if the odd 32/48 bit instruction pops
> up,
> it wont hurt your performance.

No, the "odd" 48 bit instruction won't hurt performance, but the FIFO
already has had a negative influence anytime the instruction sequence
is not linear.  It is, in terms of the negative effect, like adding
pipeline stages.  The entire FIFO has to be flushed anytime you
branch.

> If you have one waitstate, you will see that the bandwidth is still high
> enough that 1MIPS/MHz can be maintained as long as you only
> execute 16 bit instructions. You will be hurt by fetching a 32 bit
> instruction
> since that takes 2 clocks.

Even executing 16 bit instructions takes a 1 clock cycle hit on a
branch.  Instead of having the next instruction in the FIFO, you have
to wait 2 clock cycles before you can start decoding it.

> I have run the SAM7 at 48 MHz, zero waitstate. Does not work over the full
> temp range though.
> The AVR32 will support 1.2 MIPS/MHz @ 1 waitstate operation @ 66 MHz
> due to its 33 MHz 2 way interleaved flash memory.
> (1st access after jump is two clocks, subsucquent accesses are 1 clock)

How does that compare to the Cortex M3 running at 50 MHz with no
waitstates and no branch penalty?

Reply by Ulf Samuelsson ●July 6, 20072007-07-06


>> > On Jun 24, 11:45 am, wilco.dijks...@ntlworld.com wrote:
>> >> On 23 Jun, 03:10, rickman <gnu...@gmail.com> wrote:
>> >> > I don't follow what you are saying at all.  Branch prediction 
>> >> > relates
>> >> > to pipelining.  I don't see how it relates to wait states.
>>
>> >> Adding a wait state is the same as increasing the pipeline depth, and
>> >> branch
>> >> prediction coupled with prefetching can hide some of that latency.
>>
>> > I don't see how that is true at all.  When you add a waitstate you
>> > freeze all stages of the pipeline while you wait for the Flash to
>> > finish the access.
>>
>> I don't know exactly how the Cortex work, but I worked on the internals
>> of another 32 bit RISC core.
>
> Thanks for the rundown on this alternative CPU.  Sounds a bit like the
> National 32 bit CPU with variable length instructions.  That was
> supposed to be a fast CPU, but not a commercial success.  If there had
> been a longer term commitment, it may have grown in popularity.  But
> the realities of the commercial CPU market allowed it to pass on to
> the CPU boneyard.

It was not a Series 32000 CPU. The Series 32000 has (IIRC)
instruction sizes which varied between 2 and 21 bytes.

I.E.    movzbd x(y(sp))[r0:d], z(w(sb))[r1:d]

         with all displacements beeing 30 bits.
>
>> This core had a 16 byte FIFO in the first pipeline stage.
>> The prefetch mechanism loaded 32 bits into this FIFO each access.
>> The memory controller could add waitstates to this access if neccessary.
>>
> ...snip...
>>
>> Since most instructions are 16 bits, and you read 32 bits at a time,
>> zero waitstate operation allows to fetch almost two instructions per 
>> cycle.
>> The FIFO will quite soon be filled and if the odd 32/48 bit instruction 
>> pops
>> up,
>> it wont hurt your performance.
>
> No, the "odd" 48 bit instruction won't hurt performance, but the FIFO
> already has had a negative influence anytime the instruction sequence
> is not linear.  It is, in terms of the negative effect, like adding
> pipeline stages.  The entire FIFO has to be flushed anytime you
> branch.
>
The FIFO is implemented using Flip-Flops and you had a
simple three stage pipeline (fetch, decode,execute) so
your latency was not dramatic.

>
>> If you have one waitstate, you will see that the bandwidth is still high
>> enough that 1MIPS/MHz can be maintained as long as you only
>> execute 16 bit instructions. You will be hurt by fetching a 32 bit
>> instruction
>> since that takes 2 clocks.
>
> Even executing 16 bit instructions takes a 1 clock cycle hit on a
> branch.  Instead of having the next instruction in the FIFO, you have
> to wait 2 clock cycles before you can start decoding it.
>

Yes, but if the jumps are probably only 10-20% of all instructions
so you lose only between 10-20% of the performance instead of 50%.
The AVR32 loses less than 10% in average.

>
>> I have run the SAM7 at 48 MHz, zero waitstate. Does not work over the 
>> full
>> temp range though.
>> The AVR32 will support 1.2 MIPS/MHz @ 1 waitstate operation @ 66 MHz
>> due to its 33 MHz 2 way interleaved flash memory.
>> (1st access after jump is two clocks, subsucquent accesses are 1 clock)
>
> How does that compare to the Cortex M3 running at 50 MHz with no
> waitstates and no branch penalty?
>

The UC3000 is claimed as 80 MIPS at 66 MHz.
For the Cortex M3 to reach 80 MIPS at 50 MHz,
you have to have 80/50 = 1,6 MIPS per MHz.
I think that ARM does not claim that the Cortex is close to 1,6 MIPS per 
MHz.

The AVR32 is decidedly better on DSP algorithms due to its
single cycle MAC and also it has faster access to SRAM.
Reading internal SRAM is a one clock cycle operation on the AVR32.
Bit banging will be one of the strengths of the UC3000.

-- 
Best Regards,
Ulf Samuelsson
This is intended to be my personal opinion which may,
or may not be shared by my employer Atmel Nordic AB

Reply by rickman ●July 7, 20072007-07-07

On Jul 6, 10:51 am, "Ulf Samuelsson" <u...@a-t-m-e-l.com> wrote:
> The FIFO is implemented using Flip-Flops and you had a
> simple three stage pipeline (fetch, decode,execute) so
> your latency was not dramatic.

That is not the point.  By prefetching the instructions, you are
setting up for a bigger dump and subsequent loss of instruction memory
bandwidth when you branch.  FIFOs or instruction prefetching are not a
perfect solution.  It is much better to just have single cycle
memory.

> >> If you have one waitstate, you will see that the bandwidth is still high
> Yes, but if the jumps are probably only 10-20% of all instructions
> so you lose only between 10-20% of the performance instead of 50%.
> The AVR32 loses less than 10% in average.

But you are comparing apples and oranges.  A processor that has no
wait states doesn't have to deal with this no matter what the
instruction mix is.  It is just much simpler to not have to consider
memory latencies.

> >> I have run the SAM7 at 48 MHz, zero waitstate. Does not work over the
> >> full
> >> temp range though.
> >> The AVR32 will support 1.2 MIPS/MHz @ 1 waitstate operation @ 66 MHz
> >> due to its 33 MHz 2 way interleaved flash memory.
> >> (1st access after jump is two clocks, subsucquent accesses are 1 clock)
>
> > How does that compare to the Cortex M3 running at 50 MHz with no
> > waitstates and no branch penalty?
>
> The UC3000 is claimed as 80 MIPS at 66 MHz.
> For the Cortex M3 to reach 80 MIPS at 50 MHz,
> you have to have 80/50 = 1,6 MIPS per MHz.
> I think that ARM does not claim that the Cortex is close to 1,6 MIPS per
> MHz.

Oh, this is marketing stuff.  I thought you might have run some real
benchmarks or someone else at Atmel might have.  Certainly they have
looked hard at the Cortex.  But if it competes too well against the
AVR32, I can see why it would not be pushed at Atmel.  Certainly there
will be a lot of sockets that will be won by an ARM device over a sole
source part like the AVR32.  At this point I don't think anyone can
say whether the AVR32 has legs and will be around in 5 years.  It has
been out for what, a year or so?

> The AVR32 is decidedly better on DSP algorithms due to its
> single cycle MAC and also it has faster access to SRAM.
> Reading internal SRAM is a one clock cycle operation on the AVR32.
> Bit banging will be one of the strengths of the UC3000.

Isn't reading internal SRAM a single cycle on *all* processors?  I
can't think of any that require wait states.  In fact, most processors
try to cram as much SRAM onto the chip as possible because it is so
fast.  Did you say what you meant to say?