What is happening to Atmel EEPROMs?| page 3

Reply by Jon Kirwan ●March 27, 20102010-03-27

On Sat, 27 Mar 2010 08:15:03 +0100, Ulf Samuelsson
<nospam.ulf@atmel.com> wrote:

><snip of LPC2xxx vx SAM7S discussion>
>The 128 bit memory is overkill for thumb mode and just
>wastes power.
><snip>

Ulf, let me remind you of something you wrote about the SAM7:

    "In thumb mode, the 32 bit access gives you two
     instructions per cycle so in average this gives
     you 1 instruction per clock on the SAM7."

I gather this is regarding the case where there is 1 wait
state reading the 32-bit flash line -- so 2 clocks per line
and thus the 1 clock per 16-bit instruction (assuming it
executes in 1 clock.)

Nico's comment about the NPX ARM, about the 128-bit wide
flash line-width, would (I imagine) work about the same
except that it reads at full clock rate speeds, no wait
states.  So I gather, if it works similarly, that there are
eight thumb instructions per line (roughly.)  I take it your
point is that since each instruction (things being equal)
cannot execute faster than 1 clock per, that it takes 8
clocks to execute those thumb instructions.

The discussion could move between discussing instruction
streams to discussing constant data tables and the like, but
staying on the subject of instructions for the following....

So the effect is that it takes the same number of clocks to
execute 1-clock thumb instructions on either system?
(Ignoring frequency, for now.)  Or do I get that wrong?

You then discussed power consumption issues.  Wouldn't it be
the case that since the NPX ARM is accessing its flash at a
1/8th clock rate and the SAM7 is constantly operating its
flash that the _average_ power consumption might very well be
better with the NPX ARM, despite somewhat higher current when
it is being accessed?  Isn't the fact that the access cycle
takes place far less frequently observed as a lower average?
Perhaps the peak divided by 8, or so?  (Again, keep the clock
rates identical [downgraded to SAM7 rates in the NXP ARM
case.])  Have you computed figures for both?

Jon

Reply by bigbrownbeastie ●March 27, 20102010-03-27

On Mar 25, 12:57=A0pm, Leon <leon...@btinternet.com> wrote:
> On 25 Mar, 11:20, Peter <nos...@nospam9876.com> wrote:
>
> > They have doubled their prices and the lead times are 18 weeks.
>
> > Yet, others are making them OK.
>
> > Are Atmel trying to get out of the business?
> > x----------x
>
> They got rid of their fabs, and are now having to join the queue at
> TSMC or wherever they get their chips made. They are probably having
> to pay a lot more for them, because of demand for the manufacturing
> facilities. Microchip have their own fabs, and seem able to keep up
> with demand.

Some ARM cored micros are on 40 week lead times, good thing atmel
manafacture the AVRs themselves.  For related see here:
http://www.electronicsweekly.com/blogs/david-manners-semiconductor-blog/201=
0/03/when-youve-got-your-customer-b.html

Reply by Nico Coesel ●March 27, 20102010-03-27

Ulf Samuelsson <ulf@a-t-m-e-l.com> wrote:

>TheM skrev:
>> "Nico Coesel" <nico@puntnl.niks> wrote in message news:4bacf169.1721173156@news.planet.nl...
>>> "TheM" <DontNeedSpam@test.com> wrote:
>>>
>>>> "Spehro Pefhany" <speffSNIP@interlogDOTyou.knowwhat> wrote in message news:5elnq5d2ncjvs91v1cu5dmt5tbntuhefg3@4ax.com...
>>>>> On Thu, 25 Mar 2010 13:19:46 -0800, "Bob Eld" <nsmontassoc@yahoo.com>
>>>>> wrote:
>>>>>
>>>>>> "Peter" <nospam@nospam9876.com> wrote in message
>>>>>> news:9lhmq5plg1gr3sduo9n52mdi5g6iiqucqc@4ax.com...
>>>>>>> They have doubled their prices and the lead times are 18 weeks.
>>>> Is this limited to EEPROM/Memory only or uCPU as well?
>>>>
>>>> Definitely worth considering getting out of AVR.
>>>> Do NPX ARM come with on-chip FLASH?
>>> Yes, all of them have 128 bit wide flash that allows zero waitstate
>>> execution at the maximum CPU clock.
>> 
>> Not bad, I ordered a couple books on ARM off Amazon, may get into it finally.
>> From what I see they are same price as AVR mega, low power and much faster.
>> And NXP is very generous with samples.
>> 
>> M 
>> 
>> 
>
>The typical 32 bitters of today are implemented using advanced
>flash technologies which allows high density memories in small chip
>areas, but they are not low power.
>
>The inherent properties of the process makes for high leakage.
>When you see power consumption in sleep of around 1-2 uA,
>this is when the chip is turned OFF.
>Only a small part of the chip is powered, RTC and a few other things.
>
>When you implement in a 0.25u process or higher, you can have the chip
>fully initialized and ready to react on input while using
>1-2 uA in sleep.
>
>That is a big difference.
>
>While the NXP devices gets zero waitstate from 128 bit bus,
>this also makes them extremely power hungry.
>An LPC ARM7 uses about 2 x the current of a SAM7.
>It gets higher performance in ARM mode.
>
>The ARM mode has a price in code size, so if you want more features,
>then you better run in Thumb mode. The SAM7 with 32 bit flash is
>actually faster than the LPC when running in Thumb mode,
>(at the same frequency) since the SAM7 uses as 33 MHz flash,
>while the LPC uses a 24 Mhz flash.
>In thumb mode, the 32 bit access gives you two instructions
>per cycle so in average this gives you 1 instruction per clock on the SAM7.

I think this depends a lot on what method you use to measure this.
Thumb code is expected to be slower than ARM code. You should test
with drystone and make sure the same C library is used since drystone
results also depend on the C library!

>Less waitstates means higher performance.
>By copying a few 32 bit ARM routines to SRAM,
>you can overcome that limitation.
>You can get slightly higher top frequency out of the LPC,
>but that again increases the power consumption.
>
>
>For Cortex-M3 I did some test on the new SAM3, which can be
>configured to use both 64 bit or 128 bit memories.
>With a 128 bit memory, you can wring about 5% extra performance
>out of the chip compared to 64 bit operation.
>From a power consumption point of view it is probably better
>to increase the clock frequency by 5% than to enable the 128 bit mode.
>It is therefore only the most demanding applications that have
>any use for the 128 bit memory.
>
>Testing on other Cortex-M3 chips indicate similar results.
>
>Someone told me that they tried executing out of SRAM on an STM32
>and this was actually slower than executing out of flash.
>Executing out of external memory also appears to be a problem,
>since there is no cache/burst and bandwidth seems to be lower
>than equivalent ARM7 devices.

That doesn't surprise me. From my experience with STR7 and the STM32
datasheets it seems ST does a sloppy job putting controllers together.
They are cheap but you don't get maximum performance.

>Current guess is that the AHB bus has some delays due to
>synchronization. Also if you execute out of SRAM
>you are going to have conflicts with data access.
>Something which is avoided when you execute out of flash.

NXP has some sort of cache between the CPU and the flash on the M3
devices. According to the documentation NXP's LPC1700 M3 devices use a
Harvard architecture with 3 busses so multiple data transfers
(CPU-flash, CPU-memory and DMA) can occur simultaneously. Executing
from RAM would occupy one bus so you'll have less memory bandwidth to
work with.

-- 
Failure does not prove something is impossible, failure simply
indicates you are not using the right tools...
nico@nctdevpuntnl (punt=.)
--------------------------------------------------------------

Reply by Paul Carpenter ●March 27, 20102010-03-27

In article <4BAD45B3.2000507@a-t-m-e-l.com>, ulf@a-t-m-e-l.com says...
> Leon skrev:
> > On 25 Mar, 11:20, Peter <nos...@nospam9876.com> wrote:
> >> They have doubled their prices and the lead times are 18 weeks.
> >>
> >> Yet, others are making them OK.
> >>
> >> Are Atmel trying to get out of the business?
> >> x----------x
> > 
> > They got rid of their fabs, and are now having to join the queue at
> > TSMC or wherever they get their chips made. They are probably having
> > to pay a lot more for them, because of demand for the manufacturing
> > facilities. Microchip have their own fabs, and seem able to keep up
> > with demand.

Different manufacturers have different levels of outsourcing, from
all processes are outsourced (100% outsourced), to all in house.

Sometimes some of the processes are outsourced because the majority
of their machinery is now for smaller geometry, and the wafers only
may be outsourced for some products to be made somewhere that has the
larger geometry proceses.

> While at least some memory chips are outsourced, the AVRs are still
> manufactured inside Atmel.

....

> If there is no stock, then it normally takes 16 weeks to
> produce new things for any semiconductor manufacturer.
> 
> Quite often, the fab capacity is not the problem, but testing is.
> If you can't buy new testers, then capacity cannot increase.
> Companies doing test equipment cant deliver, because they
> have long lead times on components. Hmmm...

Reminds me of an ASIC company whose customer in purchasing wanted to 
bring forward the next 6 months of production to that week, and asked 
"can't you just put more people on it?". At the time that would have been 
impossible even with stocks of wafers, as this was an avionics ASIC.

The testing procedure for this avionics ASIC was

    Wafer test electronically room temperature
    Package good parts
    Package test electronically room temperature

    Place large batch in oven and power all devices with clocks
    attached, and leave all parts running for a week at 125 deg C

    After a week slowly drop temperature to then
       test electronically at room temperature
       lower temperature to -55 deg C electronically test

    Parts needed a second packaging process and then retest
     at room temperature.

    All with full serial number of device and batch testing
    logged.

If new wafers are needed you can add 12 weeks in front of that.

Environmental chambers and testing for full temperature range
is a long job and about every 12 to 18 months you have to strip
down and replace ALL the internal wiring, connectors and boards.

Imagine the setups required for testing upto 120 off 100 pin
devices in enviromental chambers and how many you require.

Designing the PCBs is also fun...

-- 
Paul Carpenter          | paul@pcserviceselectronics.co.uk
<http://www.pcserviceselectronics.co.uk/>    PC Services
<http://www.pcserviceselectronics.co.uk/fonts/> Timing Diagram Font
<http://www.gnuh8.org.uk/>  GNU H8 - compiler & Renesas H8/H8S/H8 Tiny
<http://www.badweb.org.uk/> For those web sites you hate

Reply by Peter ●March 27, 20102010-03-27

Ulf Samuelsson <ulf@a-t-m-e-l.com> wrote

>ATmega128A is probably a better choice.

We have a development kit for the 128 (bought ~ 2 years ago) so we
will get a new one of those.

What kind of price is the 128A, 1k+, these days?

Reply by Ulf Samuelsson ●March 27, 20102010-03-27

Peter skrev:
> Ulf Samuelsson <ulf@a-t-m-e-l.com> wrote
> 
>> ATmega128A is probably a better choice.
> 
> We have a development kit for the 128 (bought ~ 2 years ago) so we
> will get a new one of those.
> 
> What kind of price is the 128A, 1k+, these days?

No clue, but they should be lower than the ATmega128.
Should have lower power consumption as well.

BR
Ulf Samuelsson

Reply by Ulf Samuelsson ●March 27, 20102010-03-27

Jon Kirwan skrev:
> On Sat, 27 Mar 2010 08:15:03 +0100, Ulf Samuelsson
> <nospam.ulf@atmel.com> wrote:
> 
>> <snip of LPC2xxx vx SAM7S discussion>
>> The 128 bit memory is overkill for thumb mode and just
>> wastes power.
>> <snip>
> 
> Ulf, let me remind you of something you wrote about the SAM7:
> 
>     "In thumb mode, the 32 bit access gives you two
>      instructions per cycle so in average this gives
>      you 1 instruction per clock on the SAM7."
> 
> I gather this is regarding the case where there is 1 wait
> state reading the 32-bit flash line -- so 2 clocks per line
> and thus the 1 clock per 16-bit instruction (assuming it
> executes in 1 clock.)
> 
> Nico's comment about the NPX ARM, about the 128-bit wide
> flash line-width, would (I imagine) work about the same
> except that it reads at full clock rate speeds, no wait
> states.  So I gather, if it works similarly, that there are
> eight thumb instructions per line (roughly.)  I take it your
> point is that since each instruction (things being equal)
> cannot execute faster than 1 clock per, that it takes 8
> clocks to execute those thumb instructions.
> 

Yes, the SAM7 is very nicely tuned to thumb mode.
The LPC2 provides much more bandwidth than is needed
when you run in thumb mode.
Due to the higher latency for the LPC, due to slower
flash, the SAM7 will be better at certain frequencies,
but the LPC will have a higher max clock frequency.

The real point is that you are not neccessarily
faster because you have a wide memory.
The speed of the memory counts as well.
There are a lot of parameters to take into account
if you want to get find the best part.

People with different requirements will find different
parts to be the best.

If you start to use high speed communications, then the
PDC of the SAM7 serial ports tend to even out any
difference in performance vs the LPC very quickly.

> The discussion could move between discussing instruction
> streams to discussing constant data tables and the like, but
> staying on the subject of instructions for the following....

Yes, this will have an effect.
Accessing a random word should be faster on the SAM7
and, assuming you copy sequentially a large area
having 128 bit memory will be beneficial.

> 
> So the effect is that it takes the same number of clocks to
> execute 1-clock thumb instructions on either system?
> (Ignoring frequency, for now.)  Or do I get that wrong?

Yes, the LPC will in certain frequencies hjave longer latency
so it will be marginally slower in thumb mode.

> 
> You then discussed power consumption issues.  Wouldn't it be
> the case that since the NPX ARM is accessing its flash at a
> 1/8th clock rate and the SAM7 is constantly operating its
> flash that the _average_ power consumption might very well be
> better with the NPX ARM, despite somewhat higher current when
> it is being accessed?  Isn't the fact that the access cycle
> takes place far less frequently observed as a lower average?

As far as I understand the chip select for the internal flash
is always active when you run at higher frequencies
so there is a lot of wasted power.

> Perhaps the peak divided by 8, or so?  (Again, keep the clock
> rates identical [downgraded to SAM7 rates in the NXP ARM
> case.])  Have you computed figures for both?

Best is to check the datasheet.
The CPU core used is another important parameter.
The SAM7S uses the ARM7TDMI while most other uses the ARM7TDMI-S
(S = synthesizable) which inherently has 33 % higher power consumption.

> 
> Jon

Reply by Ulf Samuelsson ●March 27, 20102010-03-27

Nico Coesel skrev:
> Ulf Samuelsson <ulf@a-t-m-e-l.com> wrote:
> 
>> TheM skrev:
>>> "Nico Coesel" <nico@puntnl.niks> wrote in message news:4bacf169.1721173156@news.planet.nl...
>>>> "TheM" <DontNeedSpam@test.com> wrote:
>>>>
>>>>> "Spehro Pefhany" <speffSNIP@interlogDOTyou.knowwhat> wrote in message news:5elnq5d2ncjvs91v1cu5dmt5tbntuhefg3@4ax.com...
>>>>>> On Thu, 25 Mar 2010 13:19:46 -0800, "Bob Eld" <nsmontassoc@yahoo.com>
>>>>>> wrote:
>>>>>>
>>>>>>> "Peter" <nospam@nospam9876.com> wrote in message
>>>>>>> news:9lhmq5plg1gr3sduo9n52mdi5g6iiqucqc@4ax.com...
>>>>>>>> They have doubled their prices and the lead times are 18 weeks.
>>>>> Is this limited to EEPROM/Memory only or uCPU as well?
>>>>>
>>>>> Definitely worth considering getting out of AVR.
>>>>> Do NPX ARM come with on-chip FLASH?
>>>> Yes, all of them have 128 bit wide flash that allows zero waitstate
>>>> execution at the maximum CPU clock.
>>> Not bad, I ordered a couple books on ARM off Amazon, may get into it finally.
>>> From what I see they are same price as AVR mega, low power and much faster.
>>> And NXP is very generous with samples.
>>>
>>> M 
>>>
>>>
>> The typical 32 bitters of today are implemented using advanced
>> flash technologies which allows high density memories in small chip
>> areas, but they are not low power.
>>
>> The inherent properties of the process makes for high leakage.
>> When you see power consumption in sleep of around 1-2 uA,
>> this is when the chip is turned OFF.
>> Only a small part of the chip is powered, RTC and a few other things.
>>
>> When you implement in a 0.25u process or higher, you can have the chip
>> fully initialized and ready to react on input while using
>> 1-2 uA in sleep.
>>
>> That is a big difference.
>>
>> While the NXP devices gets zero waitstate from 128 bit bus,
>> this also makes them extremely power hungry.
>> An LPC ARM7 uses about 2 x the current of a SAM7.
>> It gets higher performance in ARM mode.
>>
>> The ARM mode has a price in code size, so if you want more features,
>> then you better run in Thumb mode. The SAM7 with 32 bit flash is
>> actually faster than the LPC when running in Thumb mode,
>> (at the same frequency) since the SAM7 uses as 33 MHz flash,
>> while the LPC uses a 24 Mhz flash.
>> In thumb mode, the 32 bit access gives you two instructions
>> per cycle so in average this gives you 1 instruction per clock on the SAM7.
> 
> I think this depends a lot on what method you use to measure this.
> Thumb code is expected to be slower than ARM code. You should test
> with drystone and make sure the same C library is used since drystone
> results also depend on the C library!

It is pretty clear, that if you
* execute out of flash in thumb mode
* do not access flash for data transfers
* run the chips at equivalent frequencies
* run sequential fetch at zero waitstates.

the difference will be the number of waitstates in non-sequential fetch.



> 
>> Less waitstates means higher performance.
>> By copying a few 32 bit ARM routines to SRAM,
>> you can overcome that limitation.
>> You can get slightly higher top frequency out of the LPC,
>> but that again increases the power consumption.
>>
>>
>> For Cortex-M3 I did some test on the new SAM3, which can be
>> configured to use both 64 bit or 128 bit memories.
>> With a 128 bit memory, you can wring about 5% extra performance
>> out of the chip compared to 64 bit operation.
>>From a power consumption point of view it is probably better
>> to increase the clock frequency by 5% than to enable the 128 bit mode.
>> It is therefore only the most demanding applications that have
>> any use for the 128 bit memory.
>>
>> Testing on other Cortex-M3 chips indicate similar results.
>>
>> Someone told me that they tried executing out of SRAM on an STM32
>> and this was actually slower than executing out of flash.
>> Executing out of external memory also appears to be a problem,
>> since there is no cache/burst and bandwidth seems to be lower
>> than equivalent ARM7 devices.
> 
> That doesn't surprise me. From my experience with STR7 and the STM32
> datasheets it seems ST does a sloppy job putting controllers together.
> They are cheap but you don't get maximum performance.
> 
>> Current guess is that the AHB bus has some delays due to
>> synchronization. Also if you execute out of SRAM
>> you are going to have conflicts with data access.
>> Something which is avoided when you execute out of flash.
> 
> NXP has some sort of cache between the CPU and the flash on the M3
> devices. According to the documentation NXP's LPC1700 M3 devices use a
> Harvard architecture with 3 busses so multiple data transfers
> (CPU-flash, CPU-memory and DMA) can occur simultaneously. Executing
> from RAM would occupy one bus so you'll have less memory bandwidth to
> work with.
> 

The SAM3 uses the same AHB bus as the ARM9.
The "bus" is actually a series of multiplexers where each target
has a multiplexer with an input for each bus master.

As long as noone else wants to access the same target,
a bus master will get unrestricted access.

If you execute from flash, you will get full access for the instruction
bus, (with the exception for the few constants).
If you execute out of a single SRAM, you have to share access
with the data transfers, which will slow you down.

BR
Ulf Samuelsson

Reply by Jon Kirwan ●March 27, 20102010-03-27

On Sat, 27 Mar 2010 14:14:58 +0100, Ulf Samuelsson
<nospam.ulf@atmel.com> wrote:

>Jon Kirwan skrev:
>> On Sat, 27 Mar 2010 08:15:03 +0100, Ulf Samuelsson
>> <nospam.ulf@atmel.com> wrote:
>> 
>>> <snip of LPC2xxx vx SAM7S discussion>
>>> The 128 bit memory is overkill for thumb mode and just
>>> wastes power.
>>> <snip>
>> 
>> Ulf, let me remind you of something you wrote about the SAM7:
>> 
>>     "In thumb mode, the 32 bit access gives you two
>>      instructions per cycle so in average this gives
>>      you 1 instruction per clock on the SAM7."
>> 
>> I gather this is regarding the case where there is 1 wait
>> state reading the 32-bit flash line -- so 2 clocks per line
>> and thus the 1 clock per 16-bit instruction (assuming it
>> executes in 1 clock.)
>> 
>> Nico's comment about the NPX ARM, about the 128-bit wide
>> flash line-width, would (I imagine) work about the same
>> except that it reads at full clock rate speeds, no wait
>> states.  So I gather, if it works similarly, that there are
>> eight thumb instructions per line (roughly.)  I take it your
>> point is that since each instruction (things being equal)
>> cannot execute faster than 1 clock per, that it takes 8
>> clocks to execute those thumb instructions.
>
>Yes, the SAM7 is very nicely tuned to thumb mode.
>The LPC2 provides much more bandwidth than is needed
>when you run in thumb mode.

I think I gathered that much and didn't disagree, just
wondered.

>Due to the higher latency for the LPC, due to slower
>flash, the SAM7 will be better at certain frequencies,
>but the LPC will have a higher max clock frequency.

I remember you writing that "SAM7 uses a 33 MHz flash, while
the LPC uses a 24 Mhz flash."  It seems hard to imagine,
though, except perhaps for data fetch situations or
branching, it being actually slower.  If it fetches something
like 8 thumb instructions at a time, anyway.  As another
poster pointed out, the effective rate is much higher for
sequential reads no matter how you look at it.  So it would
take branching or non-sequential data fetches to highlight
the difference.

One would have to do an exhaustive, stochastic analysis of
application spaces to get a good bead on all this.  But
ignorant of the details as I truly am right now, not having a
particular application in mind and just guessing where I'd
put my money if betting one way or another, I'd put it on 384
mb/sec memory over 132 mb/sec memory for net throughput.

>The real point is that you are not neccessarily
>faster

Yes, but the key here is the careful "not necessarily"
wording.  Not necessarily, is true enough, as one could form
specific circumstances where you'd be right.  But it seems to
me they'd be more your 'corner cases' than 'run of the mill.'

>because you have a wide memory.
>The speed of the memory counts as well.

Of course.  So people who seem to care about the final speed
and little else should indeed do some analysis before
deciding.  But if they don't know their application well
enough to make that comparison...  hmm.

>There are a lot of parameters to take into account
>if you want to get find the best part.

Yes.  That seems to ever be true!

>People with different requirements will find different
>parts to be the best.

Yes, no argument.  I was merely curious about something else
which you mostly didn't answer, so I suppose if I care enough
I will have to go find out on my own.... see below.

>If you start to use high speed communications, then the
>PDC of the SAM7 serial ports tend to even out any
>difference in performance vs the LPC very quickly.

Some parts have such wonderfully sophisticated peripherals.
Some of these are almost ancient (68332, for example.)  So
it's not only a feature of new parts, either.  Which goes
back to your point that there are a lot of parameters to take
into account, I suppose.

>> The discussion could move between discussing instruction
>> streams to discussing constant data tables and the like, but
>> staying on the subject of instructions for the following....
>
>Yes, this will have an effect.
>Accessing a random word should be faster on the SAM7
>and, assuming you copy sequentially a large area
>having 128 bit memory will be beneficial.

The 'random' part being important here.  In some cases, that
may be important where the structures are 'const' and can be
stored in flash and are accessed in a way that cannot take
advantage of the 128-bit wide lines.  A binary search on a
calibration table with small table entry sizes, perhaps,
might be a reasonable example that actually occurs often
enough and may show off your point well.  Other examples,
such as larger element sizes (such as doubles or pairs of
doubles) for that binary search or a FIR filter table used
sequentially, might point the other way.

>> So the effect is that it takes the same number of clocks to
>> execute 1-clock thumb instructions on either system?
>> (Ignoring frequency, for now.)  Or do I get that wrong?
>
>Yes, the LPC will in certain frequencies hjave longer latency
>so it will be marginally slower in thumb mode.

I find this tough to stomach, when talking about instruction
streams  Unless there are lots of branches salted in the mix.
I know I must have read somewhere someone's analysis of many
programs and the upshot of this, but I think it was for the
x86 system and a product of Intel's research department some
years ago and I've no idea how well that applies to the ARM
core.  I'm sure someone (perhaps you?) has access to such
anaylses and might share it here?

>> You then discussed power consumption issues.  Wouldn't it be
>> the case that since the NPX ARM is accessing its flash at a
>> 1/8th clock rate and the SAM7 is constantly operating its
>> flash that the _average_ power consumption might very well be
>> better with the NPX ARM, despite somewhat higher current when
>> it is being accessed?  Isn't the fact that the access cycle
>> takes place far less frequently observed as a lower average?
>
>As far as I understand the chip select for the internal flash
>is always active when you run at higher frequencies
>so there is a lot of wasted power.

By "at higher frequencies" do you have a particular number
above which your comment applies and below which it does not?

In any case, this is the answer I was looking for and you
don't appear to answer now.  Why would anyone "run the flash"
when the bus isn't active?  It seems.... well, bone-headed.
And I can't recall any chip design being that poor.  I've
seen cases where an external board design (not done by chip
designers, but more your hobbyist designer type) that did
things like that.  But it is hard for me to imagine a chip
designer being that stupid.  It's almost zero work to be
smarter than that.

So this suggests you want me to go study the situation. Maybe
someone already knows, though, and can post it.  I can hope.

>> Perhaps the peak divided by 8, or so?  (Again, keep the clock
>> rates identical [downgraded to SAM7 rates in the NXP ARM
>> case.])  Have you computed figures for both?
>
>Best is to check the datasheet.

I wondered if you already knew the answer.  I suppose not,
now.

>The CPU core used is another important parameter.
>The SAM7S uses the ARM7TDMI while most other uses the ARM7TDMI-S
>(S = synthesizable) which inherently has 33 % higher power consumption.

I'm aware of the general issue.  Your use of "most other"
does NOT address itself to the subject at hand, though.  It
leaves open either possibility for the LPC2.  But it's a
point worth keeping in mind if you make these chips, I
suppose.  For the rest of us, it's just a matter of deciding
which works better by examining the data sheet.  We don't
have the option to move a -S design to a crafted ASIC.

So this leaves some more or less interesting questions.

(1)  Where is a quality report or two on the subject of
instruction mix for ARM applications, broken down by
application spaces that differ substantially from each other,
and what are the results of these studies?

(2)  Does the LPC2 device really operate the flash all the
time?  Or not?

(3)  Is the LPC2 a -S (which doesn't matter that much, but
since the topic is brought up it might be nice to put that to
bed?)

I don't know.

Jon

Reply by Ulf Samuelsson ●March 27, 20102010-03-27

Jon Kirwan skrev:
> On Sat, 27 Mar 2010 14:14:58 +0100, Ulf Samuelsson
> <nospam.ulf@atmel.com> wrote:
> 
>> Jon Kirwan skrev:
>>> On Sat, 27 Mar 2010 08:15:03 +0100, Ulf Samuelsson
>>> <nospam.ulf@atmel.com> wrote:
>>>
>>>> <snip of LPC2xxx vx SAM7S discussion>
>>>> The 128 bit memory is overkill for thumb mode and just
>>>> wastes power.
>>>> <snip>
>>> Ulf, let me remind you of something you wrote about the SAM7:
>>>
>>>     "In thumb mode, the 32 bit access gives you two
>>>      instructions per cycle so in average this gives
>>>      you 1 instruction per clock on the SAM7."
>>>
>>> I gather this is regarding the case where there is 1 wait
>>> state reading the 32-bit flash line -- so 2 clocks per line
>>> and thus the 1 clock per 16-bit instruction (assuming it
>>> executes in 1 clock.)
>>>
>>> Nico's comment about the NPX ARM, about the 128-bit wide
>>> flash line-width, would (I imagine) work about the same
>>> except that it reads at full clock rate speeds, no wait
>>> states.  So I gather, if it works similarly, that there are
>>> eight thumb instructions per line (roughly.)  I take it your
>>> point is that since each instruction (things being equal)
>>> cannot execute faster than 1 clock per, that it takes 8
>>> clocks to execute those thumb instructions.
>> Yes, the SAM7 is very nicely tuned to thumb mode.
>> The LPC2 provides much more bandwidth than is needed
>> when you run in thumb mode.
> 
> I think I gathered that much and didn't disagree, just
> wondered.
> 
>> Due to the higher latency for the LPC, due to slower
>> flash, the SAM7 will be better at certain frequencies,
>> but the LPC will have a higher max clock frequency.
> 
> I remember you writing that "SAM7 uses a 33 MHz flash, while
> the LPC uses a 24 Mhz flash."  It seems hard to imagine,
> though, except perhaps for data fetch situations or
> branching, it being actually slower.  If it fetches something
> like 8 thumb instructions at a time, anyway.  As another
> poster pointed out, the effective rate is much higher for
> sequential reads no matter how you look at it.  So it would
> take branching or non-sequential data fetches to highlight
> the difference.
> 
> One would have to do an exhaustive, stochastic analysis of
> application spaces to get a good bead on all this.  But
> ignorant of the details as I truly am right now, not having a
> particular application in mind and just guessing where I'd
> put my money if betting one way or another, I'd put it on 384
> mb/sec memory over 132 mb/sec memory for net throughput.

That is because you ignore the congestion caused by the fact that the
ARM7 core only fetches 16 bits per access in thumb mode.
At 33 MHz, the CPU can only use 66 MB / second,
At 66 MHz, the CPU can only use 132 MB / second.
Since you can sustain 132 MB / second with a 33 Mhz 32 bit memory,
you do not need it to be wider to keep the pipeline running
at zero waitstates for sequential fetch.
For non-sequential fetch, the width is not important.
Only the number of waitstates, and the SAM7 has same or less # of
waitstates than the LPC.

----
The 128 bit memory is really only useful for ARM mode.
For thumb mode it is more or less a waste.

> 
>> The real point is that you are not neccessarily
>> faster
> 
> Yes, but the key here is the careful "not necessarily"
> wording.  Not necessarily, is true enough, as one could form
> specific circumstances where you'd be right.  But it seems to
> me they'd be more your 'corner cases' than 'run of the mill.'

I dont think running in Thumb mode is a corner case.


> 
>> because you have a wide memory.
>> The speed of the memory counts as well.
> 
> Of course.  So people who seem to care about the final speed
> and little else should indeed do some analysis before
> deciding.  But if they don't know their application well
> enough to make that comparison...  hmm.
> 
>> There are a lot of parameters to take into account
>> if you want to get find the best part.
> 
> Yes.  That seems to ever be true!
> 
>> People with different requirements will find different
>> parts to be the best.
> 
> Yes, no argument.  I was merely curious about something else
> which you mostly didn't answer, so I suppose if I care enough
> I will have to go find out on my own.... see below.
> 
>> If you start to use high speed communications, then the
>> PDC of the SAM7 serial ports tend to even out any
>> difference in performance vs the LPC very quickly.
> 
> Some parts have such wonderfully sophisticated peripherals.
> Some of these are almost ancient (68332, for example.)  So
> it's not only a feature of new parts, either.  Which goes
> back to your point that there are a lot of parameters to take
> into account, I suppose.
> 
>>> The discussion could move between discussing instruction
>>> streams to discussing constant data tables and the like, but
>>> staying on the subject of instructions for the following....
>> Yes, this will have an effect.
>> Accessing a random word should be faster on the SAM7
>> and, assuming you copy sequentially a large area
>> having 128 bit memory will be beneficial.
> 
> The 'random' part being important here.  In some cases, that
> may be important where the structures are 'const' and can be
> stored in flash and are accessed in a way that cannot take
> advantage of the 128-bit wide lines.  A binary search on a
> calibration table with small table entry sizes, perhaps,
> might be a reasonable example that actually occurs often
> enough and may show off your point well.  Other examples,
> such as larger element sizes (such as doubles or pairs of
> doubles) for that binary search or a FIR filter table used
> sequentially, might point the other way.
> 
>>> So the effect is that it takes the same number of clocks to
>>> execute 1-clock thumb instructions on either system?
>>> (Ignoring frequency, for now.)  Or do I get that wrong?
>> Yes, the LPC will in certain frequencies hjave longer latency
>> so it will be marginally slower in thumb mode.
> 
> I find this tough to stomach, when talking about instruction
> streams  Unless there are lots of branches salted in the mix.
> I know I must have read somewhere someone's analysis of many
> programs and the upshot of this, but I think it was for the
> x86 system and a product of Intel's research department some
> years ago and I've no idea how well that applies to the ARM
> core.  I'm sure someone (perhaps you?) has access to such
> anaylses and might share it here?

LPC with 1 waistates at 33 Mhz.

NOP	2	(fetches 8 instructions)
NOP	1
NOP	1
NOP	1
NOP	1
NOP	1
NOP	1
NOP	1
.........
Sum  =  9

Same code with SAM7, 0 waitstate at 33 MHz.

NOP	1	(fetches 1 instruction)
NOP	1	(fetches 1 instruction)
NOP	1	(fetches 1 instruction)
NOP	1	(fetches 1 instruction)
NOP	1	(fetches 1 instruction)
NOP	1	(fetches 1 instruction)
NOP	1	(fetches 1 instruction)
NOP	1	(fetches 1 instruction)
.........
Sum  =  8

It should not be to hard to grasp.


>  
>>> You then discussed power consumption issues.  Wouldn't it be
>>> the case that since the NPX ARM is accessing its flash at a
>>> 1/8th clock rate and the SAM7 is constantly operating its
>>> flash that the _average_ power consumption might very well be
>>> better with the NPX ARM, despite somewhat higher current when
>>> it is being accessed?  Isn't the fact that the access cycle
>>> takes place far less frequently observed as a lower average?
>> As far as I understand the chip select for the internal flash
>> is always active when you run at higher frequencies
>> so there is a lot of wasted power.
> 
> By "at higher frequencies" do you have a particular number
> above which your comment applies and below which it does not?

Each chip designer makes their own choices.
I know of some chips starting to strobe the flash
chip select when below 1 - 4 Mhz


> 
> In any case, this is the answer I was looking for and you
> don't appear to answer now.  Why would anyone "run the flash"
> when the bus isn't active?  It seems.... well, bone-headed.
> And I can't recall any chip design being that poor.  I've
> seen cases where an external board design (not done by chip
> designers, but more your hobbyist designer type) that did
> things like that.  But it is hard for me to imagine a chip
> designer being that stupid.  It's almost zero work to be
> smarter than that.

This is an automatic thing which measures the clock frequency
vs another clock frequency, and the "other" clock frequency
is often not that quick.


> 
> So this suggests you want me to go study the situation. Maybe
> someone already knows, though, and can post it.  I can hope.
> 
>>> Perhaps the peak divided by 8, or so?  (Again, keep the clock
>>> rates identical [downgraded to SAM7 rates in the NXP ARM
>>> case.])  Have you computed figures for both?
>> Best is to check the datasheet.
> 
> I wondered if you already knew the answer.  I suppose not,
> now.

Looking at the LPC2141 datasheet, which seems to be the part
closest to the SAM7S256 you get
57 mA @ 3.3V = 188 mW @ 60 Mhz = 3.135 mW/Mhz.

The SAM7S datasheet runs 33 mA @ 3.3 V @ 55 MHz = 1.98 mW/Mhz,
You can, on the SAM7S choose to feed VDDCORE from 1.8V.

The SAM7S is specified with USB enabled, so this
has to be used for the LPC as well for a fair comparision.

>> The CPU core used is another important parameter.
>> The SAM7S uses the ARM7TDMI while most other uses the ARM7TDMI-S
>> (S = synthesizable) which inherently has 33 % higher power consumption.
> 
> I'm aware of the general issue.  Your use of "most other"
> does NOT address itself to the subject at hand, though.  It
> leaves open either possibility for the LPC2.  But it's a
> point worth keeping in mind if you make these chips, I
> suppose.  For the rest of us, it's just a matter of deciding
> which works better by examining the data sheet.  We don't
> have the option to move a -S design to a crafted ASIC.
> 
> So this leaves some more or less interesting questions.
> 
> (1)  Where is a quality report or two on the subject of
> instruction mix for ARM applications, broken down by
> application spaces that differ substantially from each other,
> and what are the results of these studies?
> 
> (2)  Does the LPC2 device really operate the flash all the
> time?  Or not?
> 

You do not have any figures in the datasheet indicating
low power mode.

> (3)  Is the LPC2 a -S (which doesn't matter that much, but
> since the topic is brought up it might be nice to put that to
> bed?)

Yes it is.
It should be enough to look in the datasheet.


> I don't know.
> 
> Jon

Ulf