Atmel releasing FLASH AVR32 ?| page 2

Reply by Eric ●March 20, 20072007-03-20

On Mar 20, 4:16 am, Jim Granville <no.s...@designtools.maps.co.nz>
wrote:

> Compare with ARM7 line ? : "Pricing for the 512K Flash variants of the
> SAM7S, SAM7X and SAM7XC devices start at US$6 in quantities of 10,000
> units."
> - so there is a slight premium for the higher performance AVR32 core.

I'm guessing it makes more sense to compare AVR32 with Arm9? Once the
Cortex A8 comes out, I think that will be a better comparison. It's my
guess the AVR32 is targetting the same applications as the A8?

Eric

Reply by linnix ●March 20, 20072007-03-20

On Mar 20, 12:27 pm, "Eric" <englere_...@yahoo.com> wrote:
> On Mar 20, 4:16 am, Jim Granville <no.s...@designtools.maps.co.nz>
> wrote:
>
> > Compare with ARM7 line ? : "Pricing for the 512K Flash variants of the
> > SAM7S, SAM7X and SAM7XC devices start at US$6 in quantities of 10,000
> > units."
> > - so there is a slight premium for the higher performance AVR32 core.
>
> I'm guessing it makes more sense to compare AVR32 with Arm9? Once the
> Cortex A8 comes out, I think that will be a better comparison. It's my
> guess the AVR32 is targetting the same applications as the A8?
>
> Eric

Yes, the A8 has 13 stages pipeline vs. 3 for M3.
I am sure you pay for it in price and power consumptions.

Reply by Wilco Dijkstra ●March 20, 20072007-03-20

"Eric" <englere_geo@yahoo.com> wrote in message 
news:1174422443.273008.115250@n59g2000hsh.googlegroups.com...
> On Mar 20, 4:16 am, Jim Granville <no.s...@designtools.maps.co.nz>
> wrote:
>
>> Compare with ARM7 line ? : "Pricing for the 512K Flash variants of the
>> SAM7S, SAM7X and SAM7XC devices start at US$6 in quantities of 10,000
>> units."
>> - so there is a slight premium for the higher performance AVR32 core.
>
> I'm guessing it makes more sense to compare AVR32 with Arm9?

The UC3 is similar to ARM7 and Cortex-M3 in terms of target market,
performance etc. ARM9 is significantly faster, but uses caches and
external memory. Note AVR32 is the architecture, not one of the 2
implementations.

>Once the
> Cortex A8 comes out, I think that will be a better comparison. It's my
> guess the AVR32 is targetting the same applications as the A8?

Cortex-A8 is in a completely different league. The fastest AVR32 does
150MHz I believe and has 32-bit wide SIMD, Cortex-A8 runs at 1GHz
and does 128-bit wide SIMD...

Wilco

Reply by Jim Granville ●March 20, 20072007-03-20

Ulf Samuelsson wrote:

>>There is no point if you branch every few instructions as most programs 
>>do...
>>
> 
> 
> Unless you branch to another part of the page.
> I think most branches are pretty short, even though I do not have hard data.
> 
> 
>>>>Branch operatons  (unpredictable PC changes depending on user input
>>>>etc) will still run at 50 mhz but it is still a good gain..
>>
>>Not at all. If you run a CPU at 500MHz but branches take 10 cycles then
>>you're lucky if you get the performance of a 150MHz CPU with 5 times
>>the power consumption...
>>
>>The solution is to use a cache and branch prediction.
> 
> 
> Really, I think a large page memory and H/W multithreading is a much better 
> solution.
> Cache and branch prediction is waste of energy and gates.
> H/W multithreading simplifies the CPU,
> No need for nasty feedback muxes in the datapath allowing higher 
> frequencies.
> No need for branch prediction, since you can execute computable threads
> while you are waiting for the flash access to complete.
> 
> 
> 
>>>Maybe the sense amplifiers for the flash are large, or draw a lot of
>>>current.
>>>I still remember page mode DRAM memories with 4096 bits per page,
>>>and noone has been able to tell me why this is not possible with flash.
>>
>>Even if it were feasible, a cache with 1 line of 512 bytes is totally 
>>useless.
>>A fully associative cache with 32 lines of 16 bytes would be better, but
>>likely still too small to be useful (about 4KB is the absolute minimum).
>>Combining prefetch with a branch target instruction cache would make
>>even better use of such a small cache.
>>
> 
> 
> I think you need to think about worst case and best case behaviour.
> Adding a cache leads to more unpredictability.
> A cache can even reduce worst case performance since it
> can introduce delays in the critical path.
> 
> I would not be surprised if a 512 byte page could fit an entire interrupt 
> routine.
> If you can read the flash in one cycle in page mode,
> then I do not see how the cache/branch prediction brings a lot of benefit.

You are right, and wide reads will work well, until you hit the fishhook
that page reads are absolute, whilst code relocates.

This adds a compile-dependant variance on code execution.

Imagine if your routine that fits well into one page, has some
minor changes elsewhere, and now that moves to be across two pages...

Of course, the tools could be made smarter, so they page-snap code 
blocks if told to....

Fundamental memory structure has faster access on some pins, than 
others, ( as some have to go thru the cells, and some just de-mux the 
array out) but that is rarely spec'd into modern data sheets.

I've seen memories that issue a pause/busy flag, when the cross such
bondaries, but can be faster on sequential read, and there was one
memory (even Atmel's IIRC) that did interleaved sequential reads.

I see the press-release in the links above, says
"128-bit wide bus with 40ns Tacc" for AT91SAM9XE512


-jg

Reply by Jim Granville ●March 20, 20072007-03-20

Eric wrote:

> On Mar 20, 4:16 am, Jim Granville <no.s...@designtools.maps.co.nz>
> wrote:
> 
> 
>>Compare with ARM7 line ? : "Pricing for the 512K Flash variants of the
>>SAM7S, SAM7X and SAM7XC devices start at US$6 in quantities of 10,000
>>units."
>>- so there is a slight premium for the higher performance AVR32 core.
> 
> 
> I'm guessing it makes more sense to compare AVR32 with Arm9? Once the
> Cortex A8 comes out, I think that will be a better comparison. It's my
> guess the AVR32 is targetting the same applications as the A8?

Well, seems Atmel want to target everything :)
Their press release says

" Atmel&#4294967295;s Cost-optimized AVR32 UC3 Core Targets
ARM7/9 and Cortex-M3 Sockets "

but this on chip FLASH advance, does move AVR32 from Microprocessor 
usage, to microcontroller usage. ARM9 with flash are also appearing.

-jg

Reply by Wilco Dijkstra ●March 20, 20072007-03-20

"Ulf Samuelsson" <ulf@a-t-m-e-l.com> wrote in message news:etp769$te9$1@aioe.org...
>> There is no point if you branch every few instructions as most programs do...
>
> Unless you branch to another part of the page.
> I think most branches are pretty short, even though I do not have hard data.

That's true, but function calls are common too and they would typically
branch between pages. And then you have the nasty case of a function
or a loop split between 2 pages...

>> The solution is to use a cache and branch prediction.
>
> Really, I think a large page memory and H/W multithreading is a much better solution.
> Cache and branch prediction is waste of energy and gates.

On the contrary, caches typically reduce power consumption as code
runs faster and you avoid having to use the bus. A small and fast local
memory always wins.

Similarly, branch prediction makes a CPU go faster and so it burns less
power to do a given task. Cortex-M3 has a special branch prediction
scheme to improve performance when running from flash with wait
states, so it makes sense even in low-end CPUs.

> H/W multithreading simplifies the CPU,
> No need for nasty feedback muxes in the datapath allowing higher frequencies.
> No need for branch prediction, since you can execute computable threads
> while you are waiting for the flash access to complete.

Unless the other threads are also branching... Multithreading is not
relevant in the embedded space, it would add a lot of complexity and
die area for hardly any gain. It really only makes sense on high-end
CPUs, but even there the gains are not that impressive.

>>> Maybe the sense amplifiers for the flash are large, or draw a lot of
>>> current.
>>> I still remember page mode DRAM memories with 4096 bits per page,
>>> and noone has been able to tell me why this is not possible with flash.
>>
>> Even if it were feasible, a cache with 1 line of 512 bytes is totally useless.
>> A fully associative cache with 32 lines of 16 bytes would be better, but
>> likely still too small to be useful (about 4KB is the absolute minimum).
>> Combining prefetch with a branch target instruction cache would make
>> even better use of such a small cache.
>>
> I think you need to think about worst case and best case behaviour.
> Adding a cache leads to more unpredictability.

A single page cache is a cache. Performance of otherwise identical
code is completely unpredictable due to code layout. Adding more
cachelines evens this effect out, making performance more predictable.

> A cache can even reduce worst case performance since it
> can introduce delays in the critical path.

So would a page cache. That is the price you have to pay when
improving performance: the best case is better but the worst
case is typically worse. Overall it is a huge win.

> If you can read the flash in one cycle in page mode,
> then I do not see how the cache/branch prediction brings a lot of benefit.

If you can read the flash in one cycle then you don't need page mode!
If reading takes several cycles then it makes sense to fetch more than
one instruction. Page mode is bad because you fetch a whole page
even if you only call a small function. So you burn power for fetching a
lot of data you didn't need. A proper cache has smaller lines, thus
reducing wastage.

Wilco

Reply by Ulf Samuelsson ●March 21, 20072007-03-21

"Wilco Dijkstra" <Wilco_dot_Dijkstra@ntlworld.com> skrev i meddelandet 
news:2y_Lh.16902$NK3.2627@newsfe6-win.ntli.net...
>
> "Ulf Samuelsson" <ulf@a-t-m-e-l.com> wrote in message 
> news:etp769$te9$1@aioe.org...
>>> There is no point if you branch every few instructions as most programs 
>>> do...
>>
>> Unless you branch to another part of the page.
>> I think most branches are pretty short, even though I do not have hard 
>> data.
>
> That's true, but function calls are common too and they would typically
> branch between pages. And then you have the nasty case of a function
> or a loop split between 2 pages...

Fixed by compiler pragma...
>
>>> The solution is to use a cache and branch prediction.
>>
>> Really, I think a large page memory and H/W multithreading is a much 
>> better solution.
>> Cache and branch prediction is waste of energy and gates.
>
> On the contrary, caches typically reduce power consumption as code
> runs faster and you avoid having to use the bus. A small and fast local
> memory always wins.
>

On an ARM7, adding a cache also adds on waitstate to all non-cache accesses.

> Similarly, branch prediction makes a CPU go faster and so it burns less
> power to do a given task. Cortex-M3 has a special branch prediction
> scheme to improve performance when running from flash with wait
> states, so it makes sense even in low-end CPUs.
>

Branch prediction cost is chasing an ever eluding target.
With multithreading you can swap in a computable process and use EVERY 
cycle.

>> H/W multithreading simplifies the CPU,
>> No need for nasty feedback muxes in the datapath allowing higher 
>> frequencies.
>> No need for branch prediction, since you can execute computable threads
>> while you are waiting for the flash access to complete.
>
> Unless the other threads are also branching...

> Multithreading is not relevant in the embedded space, it would add a lot 
> of complexity and
> die area for hardly any gain.

Yes it is, just look at a mobile phone, lots of ~20 MIPS CPUs handling
Bluetooth, WLAN, GPS etc , just because noone has designed
a proper multithreading for embedded.

> It really only makes sense on high-end
> CPUs, but even there the gains are not that impressive.
>

If you believe that, you dont understand multithreading for embedded.
The purpose is not to increase performance, it is to improve real time
response so you do not have to have multiple CPUs.

>>>> Maybe the sense amplifiers for the flash are large, or draw a lot of
>>>> current.
>>>> I still remember page mode DRAM memories with 4096 bits per page,
>>>> and noone has been able to tell me why this is not possible with flash.
>>>
>>> Even if it were feasible, a cache with 1 line of 512 bytes is totally 
>>> useless.
>>> A fully associative cache with 32 lines of 16 bytes would be better, but
>>> likely still too small to be useful (about 4KB is the absolute minimum).
>>> Combining prefetch with a branch target instruction cache would make
>>> even better use of such a small cache.
>>>
>> I think you need to think about worst case and best case behaviour.
>> Adding a cache leads to more unpredictability.
>
> A single page cache is a cache. Performance of otherwise identical
> code is completely unpredictable due to code layout.

Can, and should be handled by tools.

> Adding more
> cachelines evens this effect out, making performance more predictable.
>

No, your unpredictability comes from jumping to a place
and instead of accessing memory, to fetch the page
you have a cache hit, and then your timing is screwed.


>> A cache can even reduce worst case performance since it
>> can introduce delays in the critical path.
>
> So would a page cache. That is the price you have to pay when
> improving performance: the best case is better but the worst
> case is typically worse. Overall it is a huge win.

No it is not a win if you have to guarantee that a job completes
in a certain time.

>
>> If you can read the flash in one cycle in page mode,
>> then I do not see how the cache/branch prediction brings a lot of 
>> benefit.
>
> If you can read the flash in one cycle then you don't need page mode!
> If reading takes several cycles then it makes sense to fetch more than
> one instruction. Page mode is bad because you fetch a whole page
> even if you only call a small function.

> So you burn power for fetching a
> lot of data you didn't need. A proper cache has smaller lines, thus
> reducing wastage.

The cache in itself draws power, and you cannot compare
accesses to cache compared to accesses to flash memory.

Totally different technology.
I have never been able to get any hard data on this but
I suspect that flash sense amplifiers do not exhibit a linear curve
for access speed vs current.

You have to run the cached CPU at a higher clock frequency to compensate
for loss of worst case performance.

>
> Wilco



-- 
Best Regards,
Ulf Samuelsson
This is intended to be my personal opinion which may,
or may not be shared by my employer Atmel Nordic AB

Reply by Jim Granville ●March 21, 20072007-03-21

Ulf Samuelsson wrote:
> Branch prediction cost is chasing an ever eluding target.
> With multithreading you can swap in a computable process and use EVERY 
> cycle.

I'm with you up to this point, but the challenge with hard-real-time
multithread, is the code-fetches feeding that "computable process" still 
has to come from somewhere ?
  I can see Multithread doing good things for removing SW taskswitch,
and lowering interrupt latencies, but unless you do fancy things with
the code pathway, you are actually thrashing the memory about even more.

  One solution is what some called a locked cache, where small critical 
code is fetched from fast local RAM, others really do have separate
cores, and separate memory pathways.

> Yes it is, just look at a mobile phone, lots of ~20 MIPS CPUs handling
> Bluetooth, WLAN, GPS etc , just because noone has designed
> a proper multithreading for embedded.

but this is also a memory-bandwidth problem.
If you had your magic CPU, that could do 5 x 20 MIPS, how do you
keep that fed, with today's memory technologies ?

IP core vendors do not design memory, so they get fancier and fancier
with the cache handling, and of course, they pitch peak MIPS, not
real world mips.

-jg

Reply by Jim Granville ●March 21, 20072007-03-21

Jim Granville wrote:
> You are right, and wide reads will work well, until you hit the fishhook
> that page reads are absolute, whilst code relocates.
> 
> This adds a compile-dependant variance on code execution.
> 
> Imagine if your routine that fits well into one page, has some
> minor changes elsewhere, and now that moves to be across two pages...
> 
> Of course, the tools could be made smarter, so they page-snap code 
> blocks if told to....
> 
> Fundamental memory structure has faster access on some pins, than 
> others, ( as some have to go thru the cells, and some just de-mux the 
> array out) but that is rarely spec'd into modern data sheets.
> 
> I've seen memories that issue a pause/busy flag, when the cross such
> bondaries, but can be faster on sequential read, and there was one
> memory (even Atmel's IIRC) that did interleaved sequential reads.

I've found the device, a AT27LV1026, and it dates from 1999, when
it offered 35ns (double speed) sequential access, _without_ page 
boundary gotchas.

It was a clever way to get more bandwidth from memory busses,
and even reduced the pin count needed via the ALE, but because
CPUs are designed for very 'dumb' memory interfaces, the idea
has never hit critical mass.

-jg

Reply by Ulf Samuelsson ●March 21, 20072007-03-21

"Jim Granville" <no.spam@designtools.maps.co.nz> skrev i meddelandet 
news:4600f533@clear.net.nz...
> Ulf Samuelsson wrote:
>> Branch prediction cost is chasing an ever eluding target.
>> With multithreading you can swap in a computable process and use EVERY 
>> cycle.
>
> I'm with you up to this point, but the challenge with hard-real-time
> multithread, is the code-fetches feeding that "computable process" still 
> has to come from somewhere ?

If you fetch a large chunk of code in each fetch to a prefetch buffer
this is not a problem.

>  I can see Multithread doing good things for removing SW taskswitch,
> and lowering interrupt latencies, but unless you do fancy things with
> the code pathway, you are actually thrashing the memory about even more.
>
No, I see embedded multithreading as one threa accessing external memory
while all the other threads (mostly) accesses internal high bandwidth memory
without any nasty cache in between.

>  One solution is what some called a locked cache, where small critical 
> code is fetched from fast local RAM, others really do have separate
> cores, and separate memory pathways.

Single core, single data bus and plenty of register banks and program 
counters.
Any cache will be used by gebneric thread, all other threads
on tightly coupled memory

>
>> Yes it is, just look at a mobile phone, lots of ~20 MIPS CPUs handling
>> Bluetooth, WLAN, GPS etc , just because noone has designed
>> a proper multithreading for embedded.
>
> but this is also a memory-bandwidth problem.
> If you had your magic CPU, that could do 5 x 20 MIPS, how do you
> keep that fed, with today's memory technologies ?
>

You can do a WLAN MAC in 10's of kB with a 20 MIPS CPU.
No need to have external memory.  Same for Bluetooth implementing up to HCI.
The Bluetooth Stack will run on the "generic" thread.

> IP core vendors do not design memory, so they get fancier and fancier
> with the cache handling, and of course, they pitch peak MIPS, not
> real world mips.
>
> -jg
>



-- 
Best Regards,
Ulf Samuelsson
This is intended to be my personal opinion which may,
or may not be shared by my employer Atmel Nordic AB