New ARM Cortex Microcontroller Product Family from STMicroelectronics| page 8

Reply by Ulf Samuelsson ●July 7, 20072007-07-07

"rickman" <gnuarm@gmail.com> skrev i meddelandet 
news:1183843932.123792.195400@n60g2000hse.googlegroups.com...
> On Jul 6, 10:51 am, "Ulf Samuelsson" <u...@a-t-m-e-l.com> wrote:
>> The FIFO is implemented using Flip-Flops and you had a
>> simple three stage pipeline (fetch, decode,execute) so
>> your latency was not dramatic.
>
> That is not the point.  By prefetching the instructions, you are
> setting up for a bigger dump and subsequent loss of instruction memory
> bandwidth when you branch.  FIFOs or instruction prefetching are not a
> perfect solution.  It is much better to just have single cycle
> memory.

Actually it is not, because if you try to decode your instruction
in the same stage as the decoding, your clock frequency will
go down significantly.
The prefetching will work with single cycle memory and with
memory having waitstates.

Prefetching, decoding and execution, all will take one clock.
If you execute at 66 MHz with a three stage pipeline
then you probably will execute around ~40 MHz with
a two stage pipeline (Just a guess).

If you execute blocks of 5 instruction including one jump,
each block will use 7 cycles (3 + 1 + 1 + 1 + 1) @ 66 Mhz
in a three stage pipeline for ~ 10 blocks / us.

In a two stage pipeline, you could use 2 clocks for a jump
so you execute (2 + 1 + 1 + 1 + 1) @ 40 MHz
which is 6,5 blocks / us, clearly slower.

>
>
>> >> If you have one waitstate, you will see that the bandwidth is still 
>> >> high
>> Yes, but if the jumps are probably only 10-20% of all instructions
>> so you lose only between 10-20% of the performance instead of 50%.
>> The AVR32 loses less than 10% in average.
>
> But you are comparing apples and oranges.  A processor that has no
> wait states doesn't have to deal with this no matter what the
> instruction mix is.  It is just much simpler to not have to consider
> memory latencies.
>

A processor running from flash without waitstates will be limited
in performance by the memory.
A processor which reads multiple instructions with waitstate
will be able to execute faster due to its higher bandwidth to memory.

>
>> >> I have run the SAM7 at 48 MHz, zero waitstate. Does not work over the
>> >> full
>> >> temp range though.
>> >> The AVR32 will support 1.2 MIPS/MHz @ 1 waitstate operation @ 66 MHz
>> >> due to its 33 MHz 2 way interleaved flash memory.
>> >> (1st access after jump is two clocks, subsucquent accesses are 1 
>> >> clock)
>>
>> > How does that compare to the Cortex M3 running at 50 MHz with no
>> > waitstates and no branch penalty?
>>
>> The UC3000 is claimed as 80 MIPS at 66 MHz.
>> For the Cortex M3 to reach 80 MIPS at 50 MHz,
>> you have to have 80/50 = 1,6 MIPS per MHz.
>> I think that ARM does not claim that the Cortex is close to 1,6 MIPS per
>> MHz.
>
> Oh, this is marketing stuff.  I thought you might have run some real
> benchmarks or someone else at Atmel might have.

They have run benchmarks on the AVR32, but I think people are relying
on official figures for the Cortex.

> Certainly they have
> looked hard at the Cortex.  But if it competes too well against the
> AVR32, I can see why it would not be pushed at Atmel.
> Certainly there
> will be a lot of sockets that will be won by an ARM device over a sole
> source part like the AVR32.

And hopefully ARM device from Atmel :-)

>  At this point I don't think anyone can
> say whether the AVR32 has legs and will be around in 5 years.  It has
> been out for what, a year or so?
>

Fortunately there are plenty of sockets around, and some will go AVR32.

>
>> The AVR32 is decidedly better on DSP algorithms due to its
>> single cycle MAC and also it has faster access to SRAM.
>> Reading internal SRAM is a one clock cycle operation on the AVR32.
>> Bit banging will be one of the strengths of the UC3000.
>
> Isn't reading internal SRAM a single cycle on *all* processors?  I
> can't think of any that require wait states.  In fact, most processors
> try to cram as much SRAM onto the chip as possible because it is so
> fast.  Did you say what you meant to say?
>

On the UC3000 family, loading from internal SRAM will take one clock
in the execution stage.
Using single cycle SRAM does not mean that the load instruction is 1 clock.

-- 
Best Regards,
Ulf Samuelsson
This is intended to be my personal opinion which may,
or may not be shared by my employer Atmel Nordic AB

Reply by Jim Granville ●July 7, 20072007-07-07

rickman wrote:
> On Jul 6, 10:51 am, "Ulf Samuelsson" <u...@a-t-m-e-l.com> wrote:
> 
>>The FIFO is implemented using Flip-Flops and you had a
>>simple three stage pipeline (fetch, decode,execute) so
>>your latency was not dramatic.
> 
> 
> That is not the point.  By prefetching the instructions, you are
> setting up for a bigger dump and subsequent loss of instruction memory
> bandwidth when you branch.  FIFOs or instruction prefetching are not a
> perfect solution.  It is much better to just have single cycle
> memory.
> 
> 
> 
>>>>If you have one waitstate, you will see that the bandwidth is still high
>>
>>Yes, but if the jumps are probably only 10-20% of all instructions
>>so you lose only between 10-20% of the performance instead of 50%.
>>The AVR32 loses less than 10% in average.
> 
> 
> But you are comparing apples and oranges.  A processor that has no
> wait states doesn't have to deal with this no matter what the
> instruction mix is.  It is just much simpler to not have to consider
> memory latencies.

Of course, yes it "is much better to just have single cycle
memory" - but in the real world, chip designers have to settle on
what they can get, and right now, FLASH access speeds are a real
bottleneck on uC performance.

Width of FLASH access, (or Interleave), can have MORE impact on final
speed, than any subtleness in the core itself.

> 
>>>>I have run the SAM7 at 48 MHz, zero waitstate. Does not work over the
>>>>full
>>>>temp range though.
>>>>The AVR32 will support 1.2 MIPS/MHz @ 1 waitstate operation @ 66 MHz
>>>>due to its 33 MHz 2 way interleaved flash memory.
>>>>(1st access after jump is two clocks, subsucquent accesses are 1 clock)
>>
>>>How does that compare to the Cortex M3 running at 50 MHz with no
>>>waitstates and no branch penalty?
>>
>>The UC3000 is claimed as 80 MIPS at 66 MHz.
>>For the Cortex M3 to reach 80 MIPS at 50 MHz,
>>you have to have 80/50 = 1,6 MIPS per MHz.
>>I think that ARM does not claim that the Cortex is close to 1,6 MIPS per
>>MHz.
> 
> 
> Oh, this is marketing stuff.  I thought you might have run some real
> benchmarks or someone else at Atmel might have.  Certainly they have
> looked hard at the Cortex.  But if it competes too well against the
> AVR32, I can see why it would not be pushed at Atmel.  Certainly there
> will be a lot of sockets that will be won by an ARM device over a sole
> source part like the AVR32.  At this point I don't think anyone can
> say whether the AVR32 has legs and will be around in 5 years.  It has
> been out for what, a year or so?

You can say (almost) the same for the CortexM3 ?
It too is quite new, and I've not seen any multi-sourced (pin/peripheral 
compatible) offerings. Will it hit 'critical mass' ?
 From a porting viewpoint, an Atmel ARM7 user, could find it less of
a jump to go to AVR32 (or the comming Atmel Flash ARM9's), than 
CortexM3, as the Atmel peripherals are very similar.

The AVR32 I see as having a long life, it seems to have low cost tool
flows, and good debug support. (Don't underestimate the importance of
good debug support.)

The actual uC Cores matter less and less : package and peripherals have
determined our shortlists in latest projects - and the ST Cortex even 
made it onto the list, on that basis, until we found their serious oops, 
that CAN and USB were mutually exclusive ?!?

Then, there is the new Coldfire V1 core from Freescale. Choices, Choices....

-jg

Reply by rickman ●July 9, 20072007-07-09

Ulf Samuelsson wrote:
> "rickman" <gnuarm@gmail.com> skrev i meddelandet
> > That is not the point.  By prefetching the instructions, you are
> > setting up for a bigger dump and subsequent loss of instruction memory
> > bandwidth when you branch.  FIFOs or instruction prefetching are not a
> > perfect solution.  It is much better to just have single cycle
> > memory.
>
> Actually it is not, because if you try to decode your instruction
> in the same stage as the decoding, your clock frequency will
> go down significantly.
> The prefetching will work with single cycle memory and with
> memory having waitstates.

What are you talking about???  How is slow memory faster than fast
memory???

> Prefetching, decoding and execution, all will take one clock.
> If you execute at 66 MHz with a three stage pipeline
> then you probably will execute around ~40 MHz with
> a two stage pipeline (Just a guess).
>
> If you execute blocks of 5 instruction including one jump,
> each block will use 7 cycles (3 + 1 + 1 + 1 + 1) @ 66 Mhz
> in a three stage pipeline for ~ 10 blocks / us.
>
> In a two stage pipeline, you could use 2 clocks for a jump
> so you execute (2 + 1 + 1 + 1 + 1) @ 40 MHz
> which is 6,5 blocks / us, clearly slower.

Since when do I get to design my own processor???  Everything you have
just written is based on your own assumptions.  This is a pointless
discussion since everything you say is based on *your* assumptions!
In addition, you only consider the parts of the issue that you choose
to include.  You did a timing analysis on paper that does not include
the effect of branches.  Clearly not accurate regardless of your
assumptions!

> > But you are comparing apples and oranges.  A processor that has no
> > wait states doesn't have to deal with this no matter what the
> > instruction mix is.  It is just much simpler to not have to consider
> > memory latencies.
> >
>
> A processor running from flash without wait states will be limited
> in performance by the memory.
> A processor which reads multiple instructions with wait state
> will be able to execute faster due to its higher bandwidth to memory.

Again you are assuming facts that are not in evidence.  Where do you
get the higher bandwidth from memory if it is running with wait
states?  Oh, right, you are *assuming* that there is something
different in the design that will make that one faster.  Something
that is not part of a slower Flash that requires wait states.

> >> The UC3000 is claimed as 80 MIPS at 66 MHz.
> >> For the Cortex M3 to reach 80 MIPS at 50 MHz,
> >> you have to have 80/50 = 1,6 MIPS per MHz.
> >> I think that ARM does not claim that the Cortex is close to 1,6 MIPS per
> >> MHz.
> >
> > Oh, this is marketing stuff.  I thought you might have run some real
> > benchmarks or someone else at Atmel might have.
>
> They have run benchmarks on the AVR32, but I think people are relying
> on official figures for the Cortex.

"People" being "you"?

> > Certainly they have
> > looked hard at the Cortex.  But if it competes too well against the
> > AVR32, I can see why it would not be pushed at Atmel.
> > Certainly there
> > will be a lot of sockets that will be won by an ARM device over a sole
> > source part like the AVR32.
>
> And hopefully ARM device from Atmel :-)

There are a number of sockets that Atmel won't win if they don't have
a CM3 device.  There are two companies with the new core in production
and a third on their heels.  I am sure sales of the ARM7 devices won't
drop off a cliff.  But this business is all about design wins and I
stand by my earlier post in another thread that the CM3 will start to
steal significant numbers of design wins by the end of this year and
by the end of next year they will overshadow the ARM7 design wins in
the off the shelf MCU market.

> >  At this point I don't think anyone can
> > say whether the AVR32 has legs and will be around in 5 years.  It has
> > been out for what, a year or so?
> >
>
> Fortunately there are plenty of sockets around, and some will go AVR32.

Is that the plan for the AVR32, to take *some* sockets?  You know as
well as I do that if the AVR32 does not get significant market
penetration within a two years from now, it will be put on the back
burner and eventually discontinued. Atmel has no reason to keep making
a part that consumes significant resources and does not make
significant profit.  Look at what happened to Atmel programmable
logic.  When was the last time they added a new FPGA to the product
line?  How many FPSLICs have been designed into new sockets?

> >> The AVR32 is decidedly better on DSP algorithms due to its
> >> single cycle MAC and also it has faster access to SRAM.
> >> Reading internal SRAM is a one clock cycle operation on the AVR32.
> >> Bit banging will be one of the strengths of the UC3000.
> >
> > Isn't reading internal SRAM a single cycle on *all* processors?  I
> > can't think of any that require wait states.  In fact, most processors
> > try to cram as much SRAM onto the chip as possible because it is so
> > fast.  Did you say what you meant to say?
> >
>
> On the UC3000 family, loading from internal SRAM will take one clock
> in the execution stage.
> Using single cycle SRAM does not mean that the load instruction is 1 clock.

Like I said, aren't all internal SRAMs in all processors single
cycle???

Reply by Ulf Samuelsson ●July 14, 20072007-07-14

"rickman" <gnuarm@gmail.com> skrev i meddelandet
news:1183995592.678499.34860@n2g2000hse.googlegroups.com...
> Ulf Samuelsson wrote:
>> "rickman" <gnuarm@gmail.com> skrev i meddelandet
>> > That is not the point.  By prefetching the instructions, you are
>> > setting up for a bigger dump and subsequent loss of instruction memory
>> > bandwidth when you branch.  FIFOs or instruction prefetching are not a
>> > perfect solution.  It is much better to just have single cycle
>> > memory.
>>
>> Actually it is not, because if you try to decode your instruction
>> in the same stage as the decoding, your clock frequency will
>> go down significantly.
>> The prefetching will work with single cycle memory and with
>> memory having waitstates.
>
> What are you talking about???  How is slow memory faster than fast
> memory???
>

If you have a memory capable of running at 50 MHz and you
put that in a CPU capable of running at 25 MHz, then you
will run slower.

In a two stage pipeline, you do "fetch-decode" and "execute".
If memory access, decoding and execution takes 20 ns,
then it will take 20 + 20 = 40 ns to handle the "fetch-decode" stage,
so the CPU can run at 25 MHz.

In a three stage pipeline, you do "fetch", "decode", "execute".
If all three stages take 20 ns, then you will be able to run at 50 MHz.

>
>> Prefetching, decoding and execution, all will take one clock.
>> If you execute at 66 MHz with a three stage pipeline
>> then you probably will execute around ~40 MHz with
>> a two stage pipeline (Just a guess).
>>
>> If you execute blocks of 5 instruction including one jump,
>> each block will use 7 cycles (3 + 1 + 1 + 1 + 1) @ 66 Mhz
>> in a three stage pipeline for ~ 10 blocks / us.
>>
>> In a two stage pipeline, you could use 2 clocks for a jump
>> so you execute (2 + 1 + 1 + 1 + 1) @ 40 MHz
>> which is 6,5 blocks / us, clearly slower.
>
> Since when do I get to design my own processor???  Everything you have
> just written is based on your own assumptions.  This is a pointless
> discussion since everything you say is based on *your* assumptions!
> In addition, you only consider the parts of the issue that you choose
> to include.  You did a timing analysis on paper that does not include
> the effect of branches.  Clearly not accurate regardless of your
> assumptions!

Statistics is likely to show that branches are normally not that frequent
that you
gain speed by having a shorter pipeline.

>
>
>> > But you are comparing apples and oranges.  A processor that has no
>> > wait states doesn't have to deal with this no matter what the
>> > instruction mix is.  It is just much simpler to not have to consider
>> > memory latencies.
>> >
>>
>> A processor running from flash without wait states will be limited
>> in performance by the memory.
>> A processor which reads multiple instructions with wait state
>> will be able to execute faster due to its higher bandwidth to memory.
>
> Again you are assuming facts that are not in evidence.  Where do you
> get the higher bandwidth from memory if it is running with wait
> states?  Oh, right, you are *assuming* that there is something
> different in the design that will make that one faster.  Something
> that is not part of a slower Flash that requires wait states.
>

By making it wider.
>
>> >> The UC3000 is claimed as 80 MIPS at 66 MHz.
>> >> For the Cortex M3 to reach 80 MIPS at 50 MHz,
>> >> you have to have 80/50 = 1,6 MIPS per MHz.
>> >> I think that ARM does not claim that the Cortex is close to 1,6 MIPS
>> >> per
>> >> MHz.
>> >
>> > Oh, this is marketing stuff.  I thought you might have run some real
>> > benchmarks or someone else at Atmel might have.
>>
>> They have run benchmarks on the AVR32, but I think people are relying
>> on official figures for the Cortex.
>
> "People" being "you"?

No, Atmel marketing.
>
>
>> > Certainly they have
>> > looked hard at the Cortex.  But if it competes too well against the
>> > AVR32, I can see why it would not be pushed at Atmel.
>> > Certainly there
>> > will be a lot of sockets that will be won by an ARM device over a sole
>> > source part like the AVR32.
>>
>> And hopefully ARM device from Atmel :-)
>
> There are a number of sockets that Atmel won't win if they don't have
> a CM3 device.  There are two companies with the new core in production
> and a third on their heels.  I am sure sales of the ARM7 devices won't
> drop off a cliff.  But this business is all about design wins and I
> stand by my earlier post in another thread that the CM3 will start to
> steal significant numbers of design wins by the end of this year and
> by the end of next year they will overshadow the ARM7 design wins in
> the off the shelf MCU market.

And maybe the ARM9 designs overshadows the ARM7 and CM3 as well.
I see most high volume designs nowadays require 200 MHz + operation.
The large customers (1M+) requiring low power, seems to focus
on 1,8V SAM7s or AVR32s.
This is of course only 5% of the total MCU market normally
so things could be different in your region.

A company selecting a binary compatible family, will still be better off
with ARM
than with Cortex, due to larger performance span.

>> >  At this point I don't think anyone can
>> > say whether the AVR32 has legs and will be around in 5 years.  It has
>> > been out for what, a year or so?
>> >
>>
>> Fortunately there are plenty of sockets around, and some will go AVR32.
>
> Is that the plan for the AVR32, to take *some* sockets?  You know as
> well as I do that if the AVR32 does not get significant market
> penetration within a two years from now, it will be put on the back
> burner and eventually discontinued. Atmel has no reason to keep making
> a part that consumes significant resources and does not make
> significant profit.  Look at what happened to Atmel programmable
> logic.  When was the last time they added a new FPGA to the product
> line?  How many FPSLICs have been designed into new sockets?
>

>
>> >> The AVR32 is decidedly better on DSP algorithms due to its
>> >> single cycle MAC and also it has faster access to SRAM.
>> >> Reading internal SRAM is a one clock cycle operation on the AVR32.
>> >> Bit banging will be one of the strengths of the UC3000.
>> >
>> > Isn't reading internal SRAM a single cycle on *all* processors?  I
>> > can't think of any that require wait states.  In fact, most processors
>> > try to cram as much SRAM onto the chip as possible because it is so
>> > fast.  Did you say what you meant to say?
>> >
>>
>> On the UC3000 family, loading from internal SRAM will take one clock
>> in the execution stage.
>> Using single cycle SRAM does not mean that the load instruction is 1
>> clock.
>
> Like I said, aren't all internal SRAMs in all processors single
> cycle???
>

Maybe so, but from a performance point of view, you are more
interested in how many cycles it takes to load from SRAM into a
register, and if this takes 1 clock cycle due to a 1 clock load
instruction, or 3 clock cycles due to a 3 clock load instruction
(from a 1 clock cycle SRAM), then you do see a performance differnence.


-- 
Best Regards,
Ulf Samuelsson
This is intended to be my personal opinion which may,
or may not be shared by my employer Atmel Nordic AB

Reply by Stephen Pelc ●July 14, 20072007-07-14

On Sat, 14 Jul 2007 10:04:25 +0200, "Ulf Samuelsson"
<ulf@a-t-m-e-l.com> wrote:

>Statistics is likely to show that branches are normally not that frequent
>that you
>gain speed by having a shorter pipeline.

Branch frequency is highly dependent on the application domain
and coding style. However, it has been reported than in control-type
applications branch instructions can be 20% to 30% of the total.

Stephen


-- 
Stephen Pelc, stephenXXX@mpeforth.com
MicroProcessor Engineering Ltd - More Real, Less Time
133 Hill Lane, Southampton SO15 5AF, England
tel: +44 (0)23 8063 1441, fax: +44 (0)23 8033 9691
web: http://www.mpeforth.com - free VFX Forth downloads

Reply by rickman ●July 16, 20072007-07-16

On Jul 14, 4:04 am, "Ulf Samuelsson" <u...@a-t-m-e-l.com> wrote:
> "rickman" <gnu...@gmail.com> skrev i meddelandetnews:1183995592.678499.34860@n2g2000hse.googlegroups.com...
>
>
>
> > Ulf Samuelsson wrote:
> >> "rickman" <gnu...@gmail.com> skrev i meddelandet
> >> > That is not the point.  By prefetching the instructions, you are
> >> > setting up for a bigger dump and subsequent loss of instruction memory
> >> > bandwidth when you branch.  FIFOs or instruction prefetching are not a
> >> > perfect solution.  It is much better to just have single cycle
> >> > memory.
>
> >> Actually it is not, because if you try to decode your instruction
> >> in the same stage as the decoding, your clock frequency will
> >> go down significantly.
> >> The prefetching will work with single cycle memory and with
> >> memory having waitstates.
>
> > What are you talking about???  How is slow memory faster than fast
> > memory???
>
> If you have a memory capable of running at 50 MHz and you
> put that in a CPU capable of running at 25 MHz, then you
> will run slower.
>
> In a two stage pipeline, you do "fetch-decode" and "execute".
> If memory access, decoding and execution takes 20 ns,
> then it will take 20 + 20 = 40 ns to handle the "fetch-decode" stage,
> so the CPU can run at 25 MHz.
>
> In a three stage pipeline, you do "fetch", "decode", "execute".
> If all three stages take 20 ns, then you will be able to run at 50 MHz.

This conversation has become pointless.  It started discussing the
loss of performance in processors that use slow Flash memory and you
have turned it into a discussion of processor design.  You are way off
topic and your comments are irrelevant to the original point.  The
bottom line is that if all other things are equal, a processor with
faster Flash memory will run faster.  The Stellaris CM3 running at 50
MHz with no wait states from Flash will be faster for most apps than a
processor running at 70 MHz with 1 or two wait states like the STM
parts we were discussing.  It may also be faster in many apps than a
processor running at 70 MHz using a wide flash bus interface to
overcome the wait states required because the lookahead fetch is often
wasted when the instruction flow changes.

You can dance around that, but those are the facts.


> >> Prefetching, decoding and execution, all will take one clock.
> >> If you execute at 66 MHz with a three stage pipeline
> >> then you probably will execute around ~40 MHz with
> >> a two stage pipeline (Just a guess).
>
> >> If you execute blocks of 5 instruction including one jump,
> >> each block will use 7 cycles (3 + 1 + 1 + 1 + 1) @ 66 Mhz
> >> in a three stage pipeline for ~ 10 blocks / us.
>
> >> In a two stage pipeline, you could use 2 clocks for a jump
> >> so you execute (2 + 1 + 1 + 1 + 1) @ 40 MHz
> >> which is 6,5 blocks / us, clearly slower.
>
> > Since when do I get to design my own processor???  Everything you have
> > just written is based on your own assumptions.  This is a pointless
> > discussion since everything you say is based on *your* assumptions!
> > In addition, you only consider the parts of the issue that you choose
> > to include.  You did a timing analysis on paper that does not include
> > the effect of branches.  Clearly not accurate regardless of your
> > assumptions!
>
> Statistics is likely to show that branches are normally not that frequent
> that you
> gain speed by having a shorter pipeline.

Funny, you are bringing in both statistics *and* probability.  That is
the type of language I hear all the time in commercials where they
want you to think they have just told you a fact when in fact they
have said pretty close to nothing.


> >> > But you are comparing apples and oranges.  A processor that has no
> >> > wait states doesn't have to deal with this no matter what the
> >> > instruction mix is.  It is just much simpler to not have to consider
> >> > memory latencies.
>
> >> A processor running from flash without wait states will be limited
> >> in performance by the memory.
> >> A processor which reads multiple instructions with wait state
> >> will be able to execute faster due to its higher bandwidth to memory.
>
> > Again you are assuming facts that are not in evidence.  Where do you
> > get the higher bandwidth from memory if it is running with wait
> > states?  Oh, right, you are *assuming* that there is something
> > different in the design that will make that one faster.  Something
> > that is not part of a slower Flash that requires wait states.
>
> By making it wider.
>
>
>
> >> >> The UC3000 is claimed as 80 MIPS at 66 MHz.
> >> >> For the Cortex M3 to reach 80 MIPS at 50 MHz,
> >> >> you have to have 80/50 = 1,6 MIPS per MHz.
> >> >> I think that ARM does not claim that the Cortex is close to 1,6 MIPS
> >> >> per
> >> >> MHz.
>
> >> > Oh, this is marketing stuff.  I thought you might have run some real
> >> > benchmarks or someone else at Atmel might have.
>
> >> They have run benchmarks on the AVR32, but I think people are relying
> >> on official figures for the Cortex.
>
> > "People" being "you"?
>
> No, Atmel marketing.

Ahhh, *marketing*!  That makes it very clear now.  We can all have
complete trust in benchmark figures from *marketing*!


> >> > Certainly they have
> >> > looked hard at the Cortex.  But if it competes too well against the
> >> > AVR32, I can see why it would not be pushed at Atmel.
> >> > Certainly there
> >> > will be a lot of sockets that will be won by an ARM device over a sole
> >> > source part like the AVR32.
>
> >> And hopefully ARM device from Atmel :-)
>
> > There are a number of sockets that Atmel won't win if they don't have
> > a CM3 device.  There are two companies with the new core in production
> > and a third on their heels.  I am sure sales of the ARM7 devices won't
> > drop off a cliff.  But this business is all about design wins and I
> > stand by my earlier post in another thread that the CM3 will start to
> > steal significant numbers of design wins by the end of this year and
> > by the end of next year they will overshadow the ARM7 design wins in
> > the off the shelf MCU market.
>
> And maybe the ARM9 designs overshadows the ARM7 and CM3 as well.
> I see most high volume designs nowadays require 200 MHz + operation.
> The large customers (1M+) requiring low power, seems to focus
> on 1,8V SAM7s or AVR32s.
> This is of course only 5% of the total MCU market normally
> so things could be different in your region.

Yes, the swan song of the truly desperate.  If anyone connected to the
ARM7 feels threatened by the CM3, they simply bring in the ARM9 which
is a totally unsuited processor for most of the apps that the ARM7 and
CM3 target.  The ARM9 will never fit the sockets that the ARM7 and CM3
fill.  However, the CM3 fill most of those sockets much better than
the ARM7 and that is my point.


> A company selecting a binary compatible family, will still be better off
> with ARM
> than with Cortex, due to larger performance span.

If they can shoe horn it onto their board!  An ARM9 may be the right
choice for a router, but not for a controller.  The CM3 is targeted to
the lower end bumping up against the 8 bit devices and eating into
their market segment.  The ARM9 will never compete in that area.  It
is too large of a chip and will always be uncompetitive at the low
end.


> >> >  At this point I don't think anyone can
> >> > say whether the AVR32 has legs and will be around in 5 years.  It has
> >> > been out for what, a year or so?
>
> >> Fortunately there are plenty of sockets around, and some will go AVR32.
>
> > Is that the plan for the AVR32, to take *some* sockets?  You know as
> > well as I do that if the AVR32 does not get significant market
> > penetration within a two years from now, it will be put on the back
> > burner and eventually discontinued. Atmel has no reason to keep making
> > a part that consumes significant resources and does not make
> > significant profit.  Look at what happened to Atmel programmable
> > logic.  When was the last time they added a new FPGA to the product
> > line?  How many FPSLICs have been designed into new sockets?

I see you ignored this comment.  There are any number of "good ideas"
that have totally failed in the market place.  It is very possible
that the ARM32 will be one of them.


> >> >> The AVR32 is decidedly better on DSP algorithms due to its
> >> >> single cycle MAC and also it has faster access to SRAM.
> >> >> Reading internal SRAM is a one clock cycle operation on the AVR32.
> >> >> Bit banging will be one of the strengths of the UC3000.
>
> >> > Isn't reading internal SRAM a single cycle on *all* processors?  I
> >> > can't think of any that require wait states.  In fact, most processors
> >> > try to cram as much SRAM onto the chip as possible because it is so
> >> > fast.  Did you say what you meant to say?
>
> >> On the UC3000 family, loading from internal SRAM will take one clock
> >> in the execution stage.
> >> Using single cycle SRAM does not mean that the load instruction is 1
> >> clock.
>
> > Like I said, aren't all internal SRAMs in all processors single
> > cycle???
>
> Maybe so, but from a performance point of view, you are more
> interested in how many cycles it takes to load from SRAM into a
> register, and if this takes 1 clock cycle due to a 1 clock load
> instruction, or 3 clock cycles due to a 3 clock load instruction
> (from a 1 clock cycle SRAM), then you do see a performance differnence.

What processor only uses 3 clock instructions to access 1 clock
memory?  My understanding is that many processors not only use faster
instructions to load, but can use memory in other instructions which
allow single cycle back to back memory accesses.

Besides, no one feature ever makes or breaks a processor chip.  There
are literally dozens of distinguishing points between different
processors and only marketing and salesmen try to narrow an engineer's
focus to a small number of features.  I care about the overall utility
of a processor and one of the big selling points to me is the
ubiquitousness of the ARM chips. Very soon that will include the CM3
devices which will take over the low end squeezing the ARM7 between
the CM3 and the ARM9.

Reply by Jim Granville ●July 16, 20072007-07-16

rickman wrote:
 > On Jul 14, 4:04 am, "Ulf Samuelsson" <u...@a-t-m-e-l.com> wrote:
>>And maybe the ARM9 designs overshadows the ARM7 and CM3 as well.
>>I see most high volume designs nowadays require 200 MHz + operation.
>>The large customers (1M+) requiring low power, seems to focus
>>on 1,8V SAM7s or AVR32s.
>>This is of course only 5% of the total MCU market normally
>>so things could be different in your region.
> 
> 
> Yes, the swan song of the truly desperate.  If anyone connected to the
> ARM7 feels threatened by the CM3, they simply bring in the ARM9 which
> is a totally unsuited processor for most of the apps that the ARM7 and
> CM3 target.  The ARM9 will never fit the sockets that the ARM7 and CM3
> fill.  However, the CM3 fill most of those sockets much better than
> the ARM7 and that is my point.

  Couple of teensy weeny problems to that sweeping statement:
For something to hope to "fill most of those sockets", it
needs to be Pin and code compatible, Alas, the M3 is neither.

  I note that NXP has licensed the Cortex A8, but simply not bothered 
with the M3.
  [Likely their 128 bit fetch ARM7, makes the M3 too small a change]

  Many designers will think the same.
  I don't see many taking an ARM7 out of a released product, just
for the fun of dropping in a M3.

  So, the M3 competes for new designs, and Ulf is right that the leading 
edge will want a bigger new-design jump than ARM7->M3, so that leaves 
the M3 chasing a narrow aperture of design wins.
  There, it competes against all the other 32bit offerings, and
it competes on Peripherals package and power, as much as Core.

  We looked at the new ST M3's : Great I thought, a Small MCU, with USB
and CAN (notice the actual core is not even on this selection list! )
-Oops, seems ST have designed a part that is USB _or_ CAN.
  Even a good 8 bit core would run USB & CAN, so we don't actually care
about a 25% performance window.

>>> Look at what happened to Atmel programmable
>>>logic.  When was the last time they added a new FPGA to the product
>>>line?  

Atmel are adding new CPLDs, (but their FPGAs are in stable design mode).
They have the new CAP series, with ARM7 and ARM9.
The new family looks well placed, to pick up 'Cost Down Design Passes'
on products that started commercial life in FPGAs, but as volume
(and competition) ramps, they need more efficent silicon.

>>> How many FPSLICs have been designed into new sockets?
> 
> I see you ignored this comment.  There are any number of "good ideas"
> that have totally failed in the market place.  It is very possible
> that the ARM32 will be one of them.

I'm guessing you actually meant to say AVR32 here ;)

I see AVR32 and FpSLIC as very different animals.

FpSLIC: - a "good idea" ? Hmm...
It was clear (to me, at least) even from release, the FpSLIC had 
problems, which was that it LOOKED to be very flexible to someone in 
marketing, but to a designer was actually very constraining:

You had to KNOW you code was NEVER going to go above the (16K?)
chip limit, and you had to have an application too big for a CPLD,
and small enough to use the FPGA portion (but never exceed it)
Then you notice that an application small enough to fit in 16K,
but that ALSO needs a Small-Moderate FPGA, is becomming a tiny segment
indeed.

AVR32: This is a much simpler design choice. high end uC design choice
is based mainly on the 4 P's : Peripherals, Power, Package & Price.
Success is helped a lot by low cost tools, and good on chip debug
will be important, as will a strong eco-system.

Atmel's road map on this is looking pretty good.
[So do Freescale's, and Infineon's, and none of these use M3...]

-jg

Reply by Ulf Samuelsson ●July 20, 20072007-07-20


"rickman" <gnuarm@gmail.com> skrev i meddelandet
news:1184594668.666542.195070@57g2000hsv.googlegroups.com...
> On Jul 14, 4:04 am, "Ulf Samuelsson" <u...@a-t-m-e-l.com> wrote:
>> "rickman" <gnu...@gmail.com> skrev i
>> meddelandetnews:1183995592.678499.34860@n2g2000hse.googlegroups.com...
>>
>>
>>
>> > Ulf Samuelsson wrote:
>> >> "rickman" <gnu...@gmail.com> skrev i meddelandet
>> >> > That is not the point.  By prefetching the instructions, you are
>> >> > setting up for a bigger dump and subsequent loss of instruction
>> >> > memory
>> >> > bandwidth when you branch.  FIFOs or instruction prefetching are not
>> >> > a
>> >> > perfect solution.  It is much better to just have single cycle
>> >> > memory.
>>
>> >> Actually it is not, because if you try to decode your instruction
>> >> in the same stage as the decoding, your clock frequency will
>> >> go down significantly.
>> >> The prefetching will work with single cycle memory and with
>> >> memory having waitstates.
>>
>> > What are you talking about???  How is slow memory faster than fast
>> > memory???
>>
>> If you have a memory capable of running at 50 MHz and you
>> put that in a CPU capable of running at 25 MHz, then you
>> will run slower.
>>
>> In a two stage pipeline, you do "fetch-decode" and "execute".
>> If memory access, decoding and execution takes 20 ns,
>> then it will take 20 + 20 = 40 ns to handle the "fetch-decode" stage,
>> so the CPU can run at 25 MHz.
>>
>> In a three stage pipeline, you do "fetch", "decode", "execute".
>> If all three stages take 20 ns, then you will be able to run at 50 MHz.
>
> This conversation has become pointless.  It started discussing the
> loss of performance in processors that use slow Flash memory and you
> have turned it into a discussion of processor design.  You are way off
> topic and your comments are irrelevant to the original point.  The
> bottom line is that if all other things are equal, a processor with
> faster Flash memory will run faster.  The Stellaris CM3 running at 50
> MHz with no wait states from Flash will be faster for most apps than a
> processor running at 70 MHz with 1 or two wait states like the STM
> parts we were discussing.  It may also be faster in many apps than a
> processor running at 70 MHz using a wide flash bus interface to
> overcome the wait states required because the lookahead fetch is often
> wasted when the instruction flow changes.
>
> You can dance around that, but those are the facts.
>

Nope it isn't, the AVR32 running at 66 MHz will run mostly
at zero waitstates due to its interleaved flash controller design.
Each flash access done by the memory controller
will have 1 waitstate, but since the memory controller can do
two accesses in parallel, the CPU will only see waitstates
during jumps, and no waitstates during non jump instructions.
If you do jumps 20% of the time, then the average number of waitstates is
0,2.
On top of that you will be able to perform dataaccesses to the flash
while eating from the instruction queue wihout any performance penalty.




>> And maybe the ARM9 designs overshadows the ARM7 and CM3 as well.
>> I see most high volume designs nowadays require 200 MHz + operation.
>> The large customers (1M+) requiring low power, seems to focus
>> on 1,8V SAM7s or AVR32s.
>> This is of course only 5% of the total MCU market normally
>> so things could be different in your region.
>
> Yes, the swan song of the truly desperate.  If anyone connected to the
> ARM7 feels threatened by the CM3, they simply bring in the ARM9 which
> is a totally unsuited processor for most of the apps that the ARM7 and
> CM3 target.  The ARM9 will never fit the sockets that the ARM7 and CM3
> fill.  However, the CM3 fill most of those sockets much better than
> the ARM7 and that is my point.

The ARM9 will fit almost any sockets where the user require an external bus.


>
>
>> A company selecting a binary compatible family, will still be better off
>> with ARM
>> than with Cortex, due to larger performance span.
>
> If they can shoe horn it onto their board!  An ARM9 may be the right
> choice for a router, but not for a controller.  The CM3 is targeted to
> the lower end bumping up against the 8 bit devices and eating into
> their market segment.  The ARM9 will never compete in that area.  It
> is too large of a chip and will always be uncompetitive at the low
> end.

You'd  be surprised how often ARM9 fits the bill.

>> >> >  At this point I don't think anyone can
>> >> > say whether the AVR32 has legs and will be around in 5 years.  It
>> >> > has
>> >> > been out for what, a year or so?
>>
>> >> Fortunately there are plenty of sockets around, and some will go
>> >> AVR32.
>>
>> > Is that the plan for the AVR32, to take *some* sockets?  You know as
>> > well as I do that if the AVR32 does not get significant market
>> > penetration within a two years from now, it will be put on the back
>> > burner and eventually discontinued. Atmel has no reason to keep making
>> > a part that consumes significant resources and does not make
>> > significant profit.  Look at what happened to Atmel programmable
>> > logic.  When was the last time they added a new FPGA to the product
>> > line?  How many FPSLICs have been designed into new sockets?
>
> I see you ignored this comment.  There are any number of "good ideas"
> that have totally failed in the market place.  It is very possible
> that the ARM32 will be one of them.
>
>
>> >> >> The AVR32 is decidedly better on DSP algorithms due to its
>> >> >> single cycle MAC and also it has faster access to SRAM.
>> >> >> Reading internal SRAM is a one clock cycle operation on the AVR32.
>> >> >> Bit banging will be one of the strengths of the UC3000.
>>
>> >> > Isn't reading internal SRAM a single cycle on *all* processors?  I
>> >> > can't think of any that require wait states.  In fact, most
>> >> > processors
>> >> > try to cram as much SRAM onto the chip as possible because it is so
>> >> > fast.  Did you say what you meant to say?
>>
>> >> On the UC3000 family, loading from internal SRAM will take one clock
>> >> in the execution stage.
>> >> Using single cycle SRAM does not mean that the load instruction is 1
>> >> clock.
>>
>> > Like I said, aren't all internal SRAMs in all processors single
>> > cycle???
>>
>> Maybe so, but from a performance point of view, you are more
>> interested in how many cycles it takes to load from SRAM into a
>> register, and if this takes 1 clock cycle due to a 1 clock load
>> instruction, or 3 clock cycles due to a 3 clock load instruction
>> (from a 1 clock cycle SRAM), then you do see a performance differnence.
>
> What processor only uses 3 clock instructions to access 1 clock
> memory?  My understanding is that many processors not only use faster
> instructions to load, but can use memory in other instructions which
> allow single cycle back to back memory accesses.

The simple three stage pipeline processors (and the CM3) normally use a few
clocks
in the execution stage to load data, but the uC3 family does not.

> Besides, no one feature ever makes or breaks a processor chip.  There
> are literally dozens of distinguishing points between different
> processors and only marketing and salesmen try to narrow an engineer's
> focus to a small number of features.  I care about the overall utility
> of a processor and one of the big selling points to me is the
> ubiquitousness of the ARM chips. Very soon that will include the CM3
> devices which will take over the low end squeezing the ARM7 between
> the CM3 and the ARM9.




-- 
Best Regards,
Ulf Samuelsson
This is intended to be my personal opinion which may,
or may not be shared by my employer Atmel Nordic AB

Reply by rickman ●July 23, 20072007-07-23

On Jul 16, 5:30 pm, Jim Granville <no.s...@designtools.maps.co.nz>
wrote:
> rickman wrote:
>
>  > On Jul 14, 4:04 am, "Ulf Samuelsson" <u...@a-t-m-e-l.com> wrote:
>
> >>And maybe the ARM9 designs overshadows the ARM7 and CM3 as well.
> >>I see most high volume designs nowadays require 200 MHz + operation.
> >>The large customers (1M+) requiring low power, seems to focus
> >>on 1,8V SAM7s or AVR32s.
> >>This is of course only 5% of the total MCU market normally
> >>so things could be different in your region.
>
> > Yes, the swan song of the truly desperate.  If anyone connected to the
> > ARM7 feels threatened by the CM3, they simply bring in the ARM9 which
> > is a totally unsuited processor for most of the apps that the ARM7 and
> > CM3 target.  The ARM9 will never fit the sockets that the ARM7 and CM3
> > fill.  However, the CM3 fill most of those sockets much better than
> > the ARM7 and that is my point.
>
>   Couple of teensy weeny problems to that sweeping statement:
> For something to hope to "fill most of those sockets", it
> needs to be Pin and code compatible, Alas, the M3 is neither.

No, when I say "fill the sockets" I am not talking about new chips
being used in old designs, I am talking about the new chips being used
in new designs that would otherwise make use of the other MCUs.  So
when new designs are started, a designer who considers the CM3 will
see that it is a better choice for most designs that he would
otherwise use an ARM7.  Likewise, for designs that would otherwise use
an ARM9, they will mostly continue to use the ARM9.  I see the ARM7/
CM3 as fitting different sockets than the ARM9 with little overlap.

So please try to read my words carefully.  I know you can figure out
what I mean since we have discussed this before and I am saying the
same things I have said before.  I guess I should reconsider my
purpose in continuing to discuss this with you since you don't seem to
pick up on what I am saying and the meaning seems to get twisted a
lot.

>   I note that NXP has licensed the Cortex A8, but simply not bothered
> with the M3.
>   [Likely their 128 bit fetch ARM7, makes the M3 too small a change]
>
>   Many designers will think the same.
>   I don't see many taking an ARM7 out of a released product, just
> for the fun of dropping in a M3.

I agree, it would be silly to pull back a released product just to
change the MCU when it is working just fine.

>   So, the M3 competes for new designs, and Ulf is right that the leading
> edge will want a bigger new-design jump than ARM7->M3, so that leaves
> the M3 chasing a narrow aperture of design wins.
>   There, it competes against all the other 32bit offerings, and
> it competes on Peripherals package and power, as much as Core.

I don't know what you mean by "leading-edge".  New designs cover a
wide range of requirements for the MCU from tiny 8 bit devices that
give the lowest cost to huge 32 bit processors that nearly keep up
with x86 CPUs.  The application range of the ARM7/CM3 has little
overlap with the ARM9.  The most significant separator is cost.  Most
ARM9s do not include program storage requiring external Flash.  The
one ARM9 family that includes Flash runs much slower than the other
ARM9s and is only a slight speed (or any other) improvement over the
ARM7 or CM3.  The CM3 has several advantages over both the ARM7 and
ARM9 which you seem to want to dismiss while focusing on how the ARM9
is a very different processor with more advanced capabilities
targeting a different market.  Using an ARM9 in many applications is
like using a mortar to hunt rabbits.  There may be more features in
the ARM9s than the CM3, but if you don't need them, why pay for them?

Why do you continue to try to compare the ARM9 to the CM3?  They
address different markets and there is very little overlap.

>   We looked at the new ST M3's : Great I thought, a Small MCU, with USB
> and CAN (notice the actual core is not even on this selection list! )
> -Oops, seems ST have designed a part that is USB _or_ CAN.
>   Even a good 8 bit core would run USB & CAN, so we don't actually care
> about a 25% performance window.
>
> >>> Look at what happened to Atmel programmable
> >>>logic.  When was the last time they added a new FPGA to the product
> >>>line?
>
> Atmel are adding new CPLDs, (but their FPGAs are in stable design mode).
> They have the new CAP series, with ARM7 and ARM9.
> The new family looks well placed, to pick up 'Cost Down Design Passes'
> on products that started commercial life in FPGAs, but as volume
> (and competition) ramps, they need more efficent silicon.

Now you are going off into left field.  My point was to compare the
single source AVR32 to other single source products such as the FPSLIC
which has failed in the market and will leave someone high and dry
when it is discontinued.  You bring an ASIC into the discussion as if
it were somehow relevant.  What was your point???

> >>> How many FPSLICs have been designed into new sockets?
>
> > I see you ignored this comment.  There are any number of "good ideas"
> > that have totally failed in the market place.  It is very possible
> > that the ARM32 will be one of them.
>
> I'm guessing you actually meant to say AVR32 here ;)

Yes, my slip...

> I see AVR32 and FpSLIC as very different animals.

Yes, they are different, but they have a significant common point,
they are both single source with very stiff competition.  It will be
very easy for the AVR32 to slowly die just like the FPSLIC, the
Transcend processors and many other products that just could not
compete in the market. It is especially interesting that Atmel
continues to introduce new ARM processors along side of the AVR32.  I
seem to recall Intel doing that with various processors like the 860,
960 and others, all of which died off and left users high and dry.  I
believe the 860 was a popular product in the military camp and was
designed into a number of systems with 10 to 20 year lifespans.  Then
3 years in, the family was discontinued so customers didn't even have
similar chips to upgrade to.  I can see the AVR32 going this same
route.

> FpSLIC: - a "good idea" ? Hmm...
> It was clear (to me, at least) even from release, the FpSLIC had
> problems, which was that it LOOKED to be very flexible to someone in
> marketing, but to a designer was actually very constraining:
>
> You had to KNOW you code was NEVER going to go above the (16K?)
> chip limit, and you had to have an application too big for a CPLD,
> and small enough to use the FPGA portion (but never exceed it)
> Then you notice that an application small enough to fit in 16K,
> but that ALSO needs a Small-Moderate FPGA, is becomming a tiny segment
> indeed.

This sounds like a specious argument.  *EVERY* CPU has limitations
which you have to accept when you use it.  At the time the FPSLIC was
introduced some 10 years or more ago, 16kB was a generous amount of
RAM for an 8 bit MCU.  This memory is RAM, not Flash which was stored
off chip in the FPSLIC.  Regardless, it does not matter what flaws the
product had, the point is that this type of product was sole sourced
which had a lot to do with the product failure.  It is not just a
matter of pin compatibility, there was no one else making devices
remotely like FPSLICs.  That was actually the reason I did not use it
in a design it was perfectly suited to.  Likewise switching from an
AVR32 to another processor will require a lot more work than just
switching between ARMs.

> AVR32: This is a much simpler design choice. high end uC design choice
> is based mainly on the 4 P's : Peripherals, Power, Package & Price.
> Success is helped a lot by low cost tools, and good on chip debug
> will be important, as will a strong eco-system.

That rolls off the tongue well, but there are significant difference
between CPUs.  You seem to point that out in spades when you compare
the ARM7 to its sibling CM3, but completely dismiss it when you
compare the AVR32 to all the other 32 bit processors.  Staying within
a family saves a lot of work.  The ARM family has a great deal of
commonality between all of its members with a wide target range while
the AVR32 has a limited target range and requires switching families
to go outside it.  The bottom line is that the ARM chips have legs
that other, proprietary products don't.  Even ignoring the technical
issues, the ARM has momentum which will capture a lot of design wins
in close races.

> Atmel's road map on this is looking pretty good.
> [So do Freescale's, and Infineon's, and none of these use M3...]

I seem to recall that the ARMs are a big part of Atmel's road map.
That is my point, the CM3 is a better ARM than the ARM7 is.
Everything the ARM7 does, the CM3 does better.  The designs they
target are not a good match to the ARM9 because of higher power
consumption, larger physical size or higher cost.  The CM3 out
competes the ARM7 in every area except for the number of
implementations which I am saying will be changing over the next few
years.  Finally, I don't see the AVR32 having any real advantages over
the ARM processors unless there is an app which just happens to fit
the AVR32 details better than any of the ARMs.  The number of apps for
which this is true will be very small indeed.

So with more makers announcing new CM3 chips, I see the crossover
point (more design wins of off the shelf MCUs) for the CM3 vs the ARM7
to be within the next year and may be by the end of this year.

Reply by rickman ●July 23, 20072007-07-23

On Jul 20, 6:37 pm, "Ulf Samuelsson" <u...@a-t-m-e-l.com> wrote:
> "rickman" <gnu...@gmail.com> skrev i meddelandetnews:1184594668.666542.195070@57g2000hsv.googlegroups.com...
>
>
>
> > On Jul 14, 4:04 am, "Ulf Samuelsson" <u...@a-t-m-e-l.com> wrote:
> >> "rickman" <gnu...@gmail.com> skrev i
> >> meddelandetnews:1183995592.678499.34860@n2g2000hse.googlegroups.com...
>
> >> > Ulf Samuelsson wrote:
> >> >> "rickman" <gnu...@gmail.com> skrev i meddelandet
> >> >> > That is not the point.  By prefetching the instructions, you are
> >> >> > setting up for a bigger dump and subsequent loss of instruction
> >> >> > memory
> >> >> > bandwidth when you branch.  FIFOs or instruction prefetching are not
> >> >> > a
> >> >> > perfect solution.  It is much better to just have single cycle
> >> >> > memory.
>
> >> >> Actually it is not, because if you try to decode your instruction
> >> >> in the same stage as the decoding, your clock frequency will
> >> >> go down significantly.
> >> >> The prefetching will work with single cycle memory and with
> >> >> memory having waitstates.
>
> >> > What are you talking about???  How is slow memory faster than fast
> >> > memory???
>
> >> If you have a memory capable of running at 50 MHz and you
> >> put that in a CPU capable of running at 25 MHz, then you
> >> will run slower.
>
> >> In a two stage pipeline, you do "fetch-decode" and "execute".
> >> If memory access, decoding and execution takes 20 ns,
> >> then it will take 20 + 20 = 40 ns to handle the "fetch-decode" stage,
> >> so the CPU can run at 25 MHz.
>
> >> In a three stage pipeline, you do "fetch", "decode", "execute".
> >> If all three stages take 20 ns, then you will be able to run at 50 MHz.
>
> > This conversation has become pointless.  It started discussing the
> > loss of performance in processors that use slow Flash memory and you
> > have turned it into a discussion of processor design.  You are way off
> > topic and your comments are irrelevant to the original point.  The
> > bottom line is that if all other things are equal, a processor with
> > faster Flash memory will run faster.  The Stellaris CM3 running at 50
> > MHz with no wait states from Flash will be faster for most apps than a
> > processor running at 70 MHz with 1 or two wait states like the STM
> > parts we were discussing.  It may also be faster in many apps than a
> > processor running at 70 MHz using a wide flash bus interface to
> > overcome the wait states required because the lookahead fetch is often
> > wasted when the instruction flow changes.
>
> > You can dance around that, but those are the facts.
>
> Nope it isn't, the AVR32 running at 66 MHz will run mostly
> at zero waitstates due to its interleaved flash controller design.
> Each flash access done by the memory controller
> will have 1 waitstate, but since the memory controller can do
> two accesses in parallel, the CPU will only see waitstates
> during jumps, and no waitstates during non jump instructions.
> If you do jumps 20% of the time, then the average number of waitstates is
> 0,2.
> On top of that you will be able to perform dataaccesses to the flash
> while eating from the instruction queue wihout any performance penalty.

That is pointless.  It does not matter how large the FIFO is, if you
are pulling data out at a given rate and you can only put data in at
that same rate, as soon as you have to stop instruction reads to do a
data read, you will not be filling the FIFO as fast as it is being
emptied and performance will suffer.  Run through a simulation and see
if that is not true.  Based on the info you provided, this is the
result.


> >> And maybe the ARM9 designs overshadows the ARM7 and CM3 as well.
> >> I see most high volume designs nowadays require 200 MHz + operation.
> >> The large customers (1M+) requiring low power, seems to focus
> >> on 1,8V SAM7s or AVR32s.
> >> This is of course only 5% of the total MCU market normally
> >> so things could be different in your region.
>
> > Yes, the swan song of the truly desperate.  If anyone connected to the
> > ARM7 feels threatened by the CM3, they simply bring in the ARM9 which
> > is a totally unsuited processor for most of the apps that the ARM7 and
> > CM3 target.  The ARM9 will never fit the sockets that the ARM7 and CM3
> > fill.  However, the CM3 fill most of those sockets much better than
> > the ARM7 and that is my point.
>
> The ARM9 will fit almost any sockets where the user require an external bus.

So you are agreeing with me that the ARM9 is not a good match for most
ARM7 or CM3 designs?  The ARM9 may "fit" the design, but it will not
be as good a fit if the ARM7 or CM3 can do the job.  If nothing else,
the cost and power consumption will be higher with the ARM9.  In most
cases the package size will be larger for the ARM9.  Why use a shotgun
when a slingshot will do the job?


> >> A company selecting a binary compatible family, will still be better off
> >> with ARM
> >> than with Cortex, due to larger performance span.
>
> > If they can shoe horn it onto their board!  An ARM9 may be the right
> > choice for a router, but not for a controller.  The CM3 is targeted to
> > the lower end bumping up against the 8 bit devices and eating into
> > their market segment.  The ARM9 will never compete in that area.  It
> > is too large of a chip and will always be uncompetitive at the low
> > end.
>
> You'd  be surprised how often ARM9 fits the bill.

No, I think I have a pretty good handle on the differences between
Atmel's ARM9 processors and the CM3 product line.  They are similar
CPUs with very different interfaces to the outside world for two very
different target ranges.  Anyone who thinks there is much overlap is
kidding themselves.


> >> >> >  At this point I don't think anyone can
> >> >> > say whether the AVR32 has legs and will be around in 5 years.  It
> >> >> > has
> >> >> > been out for what, a year or so?
>
> >> >> Fortunately there are plenty of sockets around, and some will go
> >> >> AVR32.
>
> >> > Is that the plan for the AVR32, to take *some* sockets?  You know as
> >> > well as I do that if the AVR32 does not get significant market
> >> > penetration within a two years from now, it will be put on the back
> >> > burner and eventually discontinued. Atmel has no reason to keep making
> >> > a part that consumes significant resources and does not make
> >> > significant profit.  Look at what happened to Atmel programmable
> >> > logic.  When was the last time they added a new FPGA to the product
> >> > line?  How many FPSLICs have been designed into new sockets?
>
> > I see you ignored this comment.  There are any number of "good ideas"
> > that have totally failed in the market place.  It is very possible
> > that the ARM32 will be one of them.
>
> >> >> >> The AVR32 is decidedly better on DSP algorithms due to its
> >> >> >> single cycle MAC and also it has faster access to SRAM.
> >> >> >> Reading internal SRAM is a one clock cycle operation on the AVR32.
> >> >> >> Bit banging will be one of the strengths of the UC3000.
>
> >> >> > Isn't reading internal SRAM a single cycle on *all* processors?  I
> >> >> > can't think of any that require wait states.  In fact, most
> >> >> > processors
> >> >> > try to cram as much SRAM onto the chip as possible because it is so
> >> >> > fast.  Did you say what you meant to say?
>
> >> >> On the UC3000 family, loading from internal SRAM will take one clock
> >> >> in the execution stage.
> >> >> Using single cycle SRAM does not mean that the load instruction is 1
> >> >> clock.
>
> >> > Like I said, aren't all internal SRAMs in all processors single
> >> > cycle???
>
> >> Maybe so, but from a performance point of view, you are more
> >> interested in how many cycles it takes to load from SRAM into a
> >> register, and if this takes 1 clock cycle due to a 1 clock load
> >> instruction, or 3 clock cycles due to a 3 clock load instruction
> >> (from a 1 clock cycle SRAM), then you do see a performance differnence.
>
> > What processor only uses 3 clock instructions to access 1 clock
> > memory?  My understanding is that many processors not only use faster
> > instructions to load, but can use memory in other instructions which
> > allow single cycle back to back memory accesses.
>
> The simple three stage pipeline processors (and the CM3) normally use a few
> clocks
> in the execution stage to load data, but the uC3 family does not.

Ok, I have to assume that you don't have any examples.  Regardless,
this seems like a red herring in this discussion anyway.


> > Besides, no one feature ever makes or breaks a processor chip.  There
> > are literally dozens of distinguishing points between different
> > processors and only marketing and salesmen try to narrow an engineer's
> > focus to a small number of features.  I care about the overall utility
> > of a processor and one of the big selling points to me is the
> > ubiquitousness of the ARM chips. Very soon that will include the CM3
> > devices which will take over the low end squeezing the ARM7 between
> > the CM3 and the ARM9.

I stand by my analysis of the competitiveness of the CM3.