Atmel releasing FLASH AVR32 ?| page 3

Reply by Jim Granville ●March 21, 20072007-03-21

Ulf Samuelsson wrote:

> "Jim Granville" <no.spam@designtools.maps.co.nz> skrev i meddelandet 
> news:4600f533@clear.net.nz...
> 
>>Ulf Samuelsson wrote:
>>
>>>Branch prediction cost is chasing an ever eluding target.
>>>With multithreading you can swap in a computable process and use EVERY 
>>>cycle.
>>
>>I'm with you up to this point, but the challenge with hard-real-time
>>multithread, is the code-fetches feeding that "computable process" still 
>>has to come from somewhere ?
> 
> 
> If you fetch a large chunk of code in each fetch to a prefetch buffer
> this is not a problem.
> 
> 
>> I can see Multithread doing good things for removing SW taskswitch,
>>and lowering interrupt latencies, but unless you do fancy things with
>>the code pathway, you are actually thrashing the memory about even more.
>>
> 
> No, I see embedded multithreading as one threa accessing external memory
> while all the other threads (mostly) accesses internal high bandwidth memory
> without any nasty cache in between.

I think you have mentally added quite a bit of hardware.
If you do what you describe, then you need a wide buffer per thread ?
- so have dictated a quite special memory architecture, and that has
to be on-chip.
[ it is still simple, and deterministic to a point, but it is special ]

If you extend that wide buffer to be interleaved (see AT27LV1026], then 
you can cross a boundary (sequentially) and not have that affect things 
- so the tools can be simpler.


Someone like Atmel could do this, but the IP suppliers who sell 
Microprocessors as Microcontrollers are pushing in a different direction.

-jg

Reply by Wilco Dijkstra ●March 21, 20072007-03-21

"Ulf Samuelsson" <ulf@a-t-m-e-l.com> wrote in message news:etqnn5$v66$1@aioe.org...
> "Wilco Dijkstra" <Wilco_dot_Dijkstra@ntlworld.com> skrev i meddelandet 
> news:2y_Lh.16902$NK3.2627@newsfe6-win.ntli.net...
>>
>> "Ulf Samuelsson" <ulf@a-t-m-e-l.com> wrote in message news:etp769$te9$1@aioe.org...

>> That's true, but function calls are common too and they would typically
>> branch between pages. And then you have the nasty case of a function
>> or a loop split between 2 pages...
>
> Fixed by compiler pragma...

Easy to say, a bit harder in reality. If you don't care about codesize you could
align big functions to 512-byte boundaries and pack small functions in the
gaps. But even that is hardly a solution as every minor change in the code
results in a different memory layout making performance unpredictable.
Basically it is an unsolveable problem.

> On an ARM7, adding a cache also adds on waitstate to all non-cache accesses.

No, a cache doesn't impact other accesses to non-cacheable
memory areas. A local flash cache is something you could
just drop into an existing design without even worrying about
needing to turn it on or flush it. It's completely transparent.

>> Similarly, branch prediction makes a CPU go faster and so it burns less
>> power to do a given task. Cortex-M3 has a special branch prediction
>> scheme to improve performance when running from flash with wait
>> states, so it makes sense even in low-end CPUs.
>
> Branch prediction cost is chasing an ever eluding target.

Branch prediction is pretty trivial as branches are very predictable.
A small global branch predictor (for example as used in the ARM1156)
gives an amazing good prediction at a neglegible hardware cost.

> With multithreading you can swap in a computable process and use EVERY cycle.

So what? There are few wasted cycles on modern embedded CPUs.
Only very high-end CPUs are waiting a lot for slow memory.

>> Multithreading is not relevant in the embedded space, it would add a lot of complexity 
>> and die area for hardly any gain.
>
> Yes it is, just look at a mobile phone, lots of ~20 MIPS CPUs handling
> Bluetooth, WLAN, GPS etc , just because noone has designed
> a proper multithreading for embedded.

No, phones are extremely integrated and usually have only one CPU,
one DSP and perhaps a micro controller in the flash card.

Hardware multithreading doesn't give much performance on a high
end CPU, and it gives almost no benefit on a low end one. Less than
10% of the memory bandwidth is unused in an ARM7, so running a
second thread either means it runs at 10% of the maximum speed
or it slows down the main thread.

>> It really only makes sense on high-end
>> CPUs, but even there the gains are not that impressive.
>
> If you believe that, you dont understand multithreading for embedded.
> The purpose is not to increase performance, it is to improve real time
> response so you do not have to have multiple CPUs.

You don't understand multithreading at all. Interrupt latency is completely
unaffected by multithreading. Whether you run 2 interrupts in parallel at
half the speed or one after the other at full speed is irrelevant.

You confuse multiprocessing with multithreading. A 2-core CPU can
indeed deal with 2 interrupts in parallel at full speed.

>> Adding more
>> cachelines evens this effect out, making performance more predictable.
>
> No, your unpredictability comes from jumping to a place
> and instead of accessing memory, to fetch the page
> you have a cache hit, and then your timing is screwed.

It is impossible to run code at a predictable speed, so you're
screwed no matter whether you use a cache or not.

>>> A cache can even reduce worst case performance since it
>>> can introduce delays in the critical path.
>>
>> So would a page cache. That is the price you have to pay when
>> improving performance: the best case is better but the worst
>> case is typically worse. Overall it is a huge win.
>
> No it is not a win if you have to guarantee that a job completes
> in a certain time.

Wrong. Code is highly repetitive, so even if you assume the cache
is invalidated at the start of a task, using a cache results in much
faster execution.

> The cache in itself draws power, and you cannot compare
> accesses to cache compared to accesses to flash memory.

Of course the cache burns power, but you're not using the flash.
Which uses less power is highly dependent on their size and
implementation. From what I've heard, caches are extremely
efficient for sequential accesses - ie. code accesses.

> You have to run the cached CPU at a higher clock frequency to compensate
> for loss of worst case performance.

No, it would be virtually impossible to find code that actually can't
meet its deadline with a cache.

Wilco

Reply by Ulf Samuelsson ●March 21, 20072007-03-21

>> Fixed by compiler pragma...
>
> Easy to say, a bit harder in reality. If you don't care about codesize you 
> could
> align big functions to 512-byte boundaries and pack small functions in the
> gaps. But even that is hardly a solution as every minor change in the code
> results in a different memory layout making performance unpredictable.
> Basically it is an unsolveable problem.&#4294967295;

I assume that there are certain cricitcal paths which needs this 
determinism.
Thos can be handled by pragmas.
It is also entirely possible that most threads only execute out of zero 
waitstate SRAM.

>> On an ARM7, adding a cache also adds on waitstate to all non-cache 
>> accesses.
>
> No, a cache doesn't impact other accesses to non-cacheable
> memory areas. A local flash cache is something you could
> just drop into an existing design without even worrying about
> needing to turn it on or flush it. It's completely transparent.

Adding a cache to the ARM7 CPU (not to the flash) will add waitstate
to ALL non-hit accesses according to chip designers.
It also adds waitstates to ARM9's if you put the memory on the AMBA bus.
Only way to allow no waitstate operation is to put the SRAM in TCM.

>> With multithreading you can swap in a computable process and use EVERY 
>> cycle.
>
> So what? There are few wasted cycles on modern embedded CPUs.
> Only very high-end CPUs are waiting a lot for slow memory.

If you make a jump to a location outside the cache, then you are dead in the 
water.
Synch with AMBA bus
Synch with bus interface
SDRAM precharge cycle.
70 ns access...

With a multithreaded CPU you can use those cycles for something good.


>>> Multithreading is not relevant in the embedded space, it would add a lot 
>>> of complexity and die area for hardly any gain.
>>
>> Yes it is, just look at a mobile phone, lots of ~20 MIPS CPUs handling
>> Bluetooth, WLAN, GPS etc , just because noone has designed
>> a proper multithreading for embedded.
>
> No, phones are extremely integrated and usually have only one CPU,
> one DSP and perhaps a micro controller in the flash card.

Then you do not know how a moderm phone looks like.
Each Bluetooth chip normally has an ARM
Each WLAN chip normally has one or more ARMs
On Smart Phones, you have a GSM/WCDMA controller and an application CPU.
GPS functions will add one more ARM.
Then you have a micro doing the charging algorithm.

It quickly adds up.

> Hardware multithreading doesn't give much performance on a high
> end CPU, and it gives almost no benefit on a low end one. Less than
> 10% of the memory bandwidth is unused in an ARM7, so running a
> second thread either means it runs at 10% of the maximum speed
> or it slows down the main thread.

Multithreading for embedded systems is not about increasing performance.
It is about replacing 2 CPUs capable of 50 MIPS which only runs at 20 MIPS
with a single CPU which can run 2 x 20 MIPS threads.
I.E: it is trying to fix the real time response problem.

It is cheaper to have one CPU doing the job of two CPUs
than having two CPUs each doing the job of half a CPU.
Someone is going to get very rich, once they understand this.
I am too lazy...

>
>>> It really only makes sense on high-end
>>> CPUs, but even there the gains are not that impressive.

You are locked into conventional thoughts on multithreading.

>> If you believe that, you dont understand multithreading for embedded.
>> The purpose is not to increase performance, it is to improve real time
>> response so you do not have to have multiple CPUs.
>
> You don't understand multithreading at all. Interrupt latency is 
> completely
> unaffected by multithreading. Whether you run 2 interrupts in parallel at
> half the speed or one after the other at full speed is irrelevant.

Not if you need both interrupts to respond within 200 ns.
With multithreading, you do not even need interrupt,
You can schedule a thread.

> You confuse multiprocessing with multithreading. A 2-core CPU can
> indeed deal with 2 interrupts in parallel at full speed.

No I dont.
A multithreaded CPU running at 400 MHz can do the task of 40 CPUs
running at 10 MHz.

> It is impossible to run code at a predictable speed, so you're
> screwed no matter whether you use a cache or not.
>

No you can measure how many cycles each thread is using
within a certain timq quanta, and ensure that each thread
gets their fair share.




-- 
Best Regards,
Ulf Samuelsson
This is intended to be my personal opinion which may,
or may not be shared by my employer Atmel Nordic AB

Reply by Jim Granville ●March 21, 20072007-03-21

Wilco Dijkstra wrote:
> 
> It is impossible to run code at a predictable speed, so you're
> screwed no matter whether you use a cache or not.

?! - what ?
Or are you talking only within the ARM subset of the CPU universe here ?

-jg

Reply by Jim Granville ●March 21, 20072007-03-21

Ulf Samuelsson wrote:
>>Hardware multithreading doesn't give much performance on a high
>>end CPU, and it gives almost no benefit on a low end one. Less than
>>10% of the memory bandwidth is unused in an ARM7, so running a
>>second thread either means it runs at 10% of the maximum speed
>>or it slows down the main thread.
> 
> 
> Multithreading for embedded systems is not about increasing performance.
> It is about replacing 2 CPUs capable of 50 MIPS which only runs at 20 MIPS
> with a single CPU which can run 2 x 20 MIPS threads.
> I.E: it is trying to fix the real time response problem.

Plus once you have this, you can often drop the MAX clock speed, which
may have been hiked in the first place, to try and reduce the SW 
latencies to a tolerable level....

> 
> It is cheaper to have one CPU doing the job of two CPUs
> than having two CPUs each doing the job of half a CPU.
> Someone is going to get very rich, once they understand this.
> I am too lazy...

For an example of someone already doing this, look at Ubicom's devices.
It's what I'd call hard-multithreading, where they have timeslices and 
can allocate them to tasks: if you want, you can map 29/64 to a high 
priority task, and 4/64 to lower pri ones, and 1/64 to a background 
watchdog type task, and get full independance. (etc)
Then there is the Parallex Propellor, multiple cores, with small
code storages/core.

-jg

Reply by Jonathan Kirwan ●March 21, 20072007-03-21

On Thu, 22 Mar 2007 10:37:31 +1200, Jim Granville
<no.spam@designtools.maps.co.nz> wrote:

>Wilco Dijkstra wrote:
>> 
>> It is impossible to run code at a predictable speed, so you're
>> screwed no matter whether you use a cache or not.
>
>?! - what ?
>Or are you talking only within the ARM subset of the CPU universe here ?

That sounds like the explanation.

There are CPUs with exact, predictable execution times and where the
only place for unavoidable variability is in recognizing and
synchronizing interrupt code execution to an asynchronous external
event (variability here can be kept to a cycle.)  And where interrupt
generation from internal timers have do NOT have this unavoiable
variability, since their generation is synchronous with the cpu, and
are exactly predictable in terms of their latency.

Jon

Reply by Ulf Samuelsson ●March 21, 20072007-03-21

>>
>> It is cheaper to have one CPU doing the job of two CPUs
>> than having two CPUs each doing the job of half a CPU.
>> Someone is going to get very rich, once they understand this.
>> I am too lazy...
>
> For an example of someone already doing this, look at Ubicom's devices.
> It's what I'd call hard-multithreading, where they have timeslices and can 
> allocate them to tasks: if you want, you can map 29/64 to a high priority 
> task, and 4/64 to lower pri ones, and 1/64 to a background watchdog type 
> task, and get full independance. (etc)
> Then there is the Parallex Propellor, multiple cores, with small
> code storages/core.

I know, I wrote a white paper on multithreading for embedded control when I 
worked
in the National Semiconductor Research labs and presented
to the Microcontroller division.
Bulent Celebi, The head of the NSC microcontroller division became the CEO 
of Ubicom
and Gideon Intrater , the head of the architecture group became VP at MIPS 
architecture group,
(MIPS also has introduced a multithreaded MIPS core).

As I said, I am too lazy...


>
> -jg



-- 
Best Regards,
Ulf Samuelsson
This is intended to be my personal opinion which may,
or may not be shared by my employer Atmel Nordic AB

Reply by Wilco Dijkstra ●March 21, 20072007-03-21

"Ulf Samuelsson" <ulf@a-t-m-e-l.com> wrote in message news:ets9le$tf1$1@aioe.org...

> Adding a cache to the ARM7 CPU (not to the flash) will add waitstate
> to ALL non-hit accesses according to chip designers.

It doesn't have to if you divide the memory map. Accesses to
non-cacheable memory simply bypass the cache, cacheable
accesses try the cache first. It does take some extra logic as
the ARM7 isn't built for caches so you get a slightly lower
maximum frequency. That is why doing it on the flash is a
better solution.

> It also adds waitstates to ARM9's if you put the memory on the AMBA bus.
> Only way to allow no waitstate operation is to put the SRAM in TCM.

Correct.

> If you make a jump to a location outside the cache, then you are dead in the water.
> Synch with AMBA bus
> Synch with bus interface
> SDRAM precharge cycle.
> 70 ns access...

Sure, cachemisses are bad. But caches work extremely well, we're using
3GHz CPUs with 10MHz memory after all...

> With a multithreaded CPU you can use those cycles for something good.

The other thread also needs to use part of the cache for its code and data,
so the cache becomes less effective. It is a difficult tradeoff, not as simple
as you claim.

>> No, phones are extremely integrated and usually have only one CPU,
>> one DSP and perhaps a micro controller in the flash card.
>
> Then you do not know how a moderm phone looks like.
> Each Bluetooth chip normally has an ARM
> Each WLAN chip normally has one or more ARMs
> On Smart Phones, you have a GSM/WCDMA controller and an application CPU.
> GPS functions will add one more ARM.
> Then you have a micro doing the charging algorithm.
>
> It quickly adds up.

You haven't seen an average phone then. Yes, the most complex smart
phones use 5-6 chips with several ARMs. Most phones are far more
integrated and use 2-3 chips containing just one ARM and a DSP.

> Multithreading for embedded systems is not about increasing performance.
> It is about replacing 2 CPUs capable of 50 MIPS which only runs at 20 MIPS
> with a single CPU which can run 2 x 20 MIPS threads.
> I.E: it is trying to fix the real time response problem.

What realtime response problem? Interrupt latency of a modern CPU
is only a few cycles. Cortex-R4 has a 20 cycle latency eventhough it has
caches, branch prediction, and runs at 500MHz...

> It is cheaper to have one CPU doing the job of two CPUs
> than having two CPUs each doing the job of half a CPU.
> Someone is going to get very rich, once they understand this.
> I am too lazy...

This is already happening, but you don't need multithreading.

>> You don't understand multithreading at all. Interrupt latency is completely
>> unaffected by multithreading. Whether you run 2 interrupts in parallel at
>> half the speed or one after the other at full speed is irrelevant.
>
> Not if you need both interrupts to respond within 200 ns.

Sorry, it's simple maths. If we have 2 20MHz CPUs that have a
200ns interrupt deadline, then a 40MHz CPU takes 100ns for the
deadlines (as it is twice as fast), so it meets the 200ns deadline.

> A multithreaded CPU running at 400 MHz can do the task of 40 CPUs
> running at 10 MHz.

And a non-multithreaded CPU running at 400MHz can do the task of 40
CPUs running at 10MHz. Multithreading doesn't enter the picture at all...

Wilco

Reply by Wilco Dijkstra ●March 21, 20072007-03-21

"Jim Granville" <no.spam@designtools.maps.co.nz> wrote in message 
news:4601b38b$1@clear.net.nz...
> Wilco Dijkstra wrote:
>>
>> It is impossible to run code at a predictable speed, so you're
>> screwed no matter whether you use a cache or not.
>
> ?! - what ?
> Or are you talking only within the ARM subset of the CPU universe here ?

I guess you haven't heard about interrupts, wait states, cycle
stealing DMA and other niceties then. Some of use live in the
real world...

Wilco

Reply by Jim Granville ●March 21, 20072007-03-21

Wilco Dijkstra wrote:
> "Jim Granville" <no.spam@designtools.maps.co.nz> wrote in message 
> news:4601b38b$1@clear.net.nz...
> 
>>Wilco Dijkstra wrote:
>>
>>>It is impossible to run code at a predictable speed, so you're
>>>screwed no matter whether you use a cache or not.
>>
>>?! - what ?
>>Or are you talking only within the ARM subset of the CPU universe here ?
> 
> 
> I guess you haven't heard about interrupts, wait states, cycle
> stealing DMA and other niceties then. Some of use live in the
> real world...

Which has nothing to do with the false, sweeping claim you made above.

Not only is is possible to run code a predictable speeds,
a large number of designs out there are doing this on a daily basis....

I'm glad the systems I ship do not have to conform to
your idea of the 'real world', or they would fail.... :)

-jg