EmbeddedRelated.com
Forums
Memfault Beyond the Launch

Atmel releasing FLASH AVR32 ?

Started by -jg March 19, 2007
Ulf Samuelsson wrote:

> "Jim Granville" <no.spam@designtools.maps.co.nz> skrev i meddelandet > news:4600f533@clear.net.nz... > >>Ulf Samuelsson wrote: >> >>>Branch prediction cost is chasing an ever eluding target. >>>With multithreading you can swap in a computable process and use EVERY >>>cycle. >> >>I'm with you up to this point, but the challenge with hard-real-time >>multithread, is the code-fetches feeding that "computable process" still >>has to come from somewhere ? > > > If you fetch a large chunk of code in each fetch to a prefetch buffer > this is not a problem. > > >> I can see Multithread doing good things for removing SW taskswitch, >>and lowering interrupt latencies, but unless you do fancy things with >>the code pathway, you are actually thrashing the memory about even more. >> > > No, I see embedded multithreading as one threa accessing external memory > while all the other threads (mostly) accesses internal high bandwidth memory > without any nasty cache in between.
I think you have mentally added quite a bit of hardware. If you do what you describe, then you need a wide buffer per thread ? - so have dictated a quite special memory architecture, and that has to be on-chip. [ it is still simple, and deterministic to a point, but it is special ] If you extend that wide buffer to be interleaved (see AT27LV1026], then you can cross a boundary (sequentially) and not have that affect things - so the tools can be simpler. Someone like Atmel could do this, but the IP suppliers who sell Microprocessors as Microcontrollers are pushing in a different direction. -jg
"Ulf Samuelsson" <ulf@a-t-m-e-l.com> wrote in message news:etqnn5$v66$1@aioe.org...
> "Wilco Dijkstra" <Wilco_dot_Dijkstra@ntlworld.com> skrev i meddelandet > news:2y_Lh.16902$NK3.2627@newsfe6-win.ntli.net... >> >> "Ulf Samuelsson" <ulf@a-t-m-e-l.com> wrote in message news:etp769$te9$1@aioe.org...
>> That's true, but function calls are common too and they would typically >> branch between pages. And then you have the nasty case of a function >> or a loop split between 2 pages... > > Fixed by compiler pragma...
Easy to say, a bit harder in reality. If you don't care about codesize you could align big functions to 512-byte boundaries and pack small functions in the gaps. But even that is hardly a solution as every minor change in the code results in a different memory layout making performance unpredictable. Basically it is an unsolveable problem.
> On an ARM7, adding a cache also adds on waitstate to all non-cache accesses.
No, a cache doesn't impact other accesses to non-cacheable memory areas. A local flash cache is something you could just drop into an existing design without even worrying about needing to turn it on or flush it. It's completely transparent.
>> Similarly, branch prediction makes a CPU go faster and so it burns less >> power to do a given task. Cortex-M3 has a special branch prediction >> scheme to improve performance when running from flash with wait >> states, so it makes sense even in low-end CPUs. > > Branch prediction cost is chasing an ever eluding target.
Branch prediction is pretty trivial as branches are very predictable. A small global branch predictor (for example as used in the ARM1156) gives an amazing good prediction at a neglegible hardware cost.
> With multithreading you can swap in a computable process and use EVERY cycle.
So what? There are few wasted cycles on modern embedded CPUs. Only very high-end CPUs are waiting a lot for slow memory.
>> Multithreading is not relevant in the embedded space, it would add a lot of complexity >> and die area for hardly any gain. > > Yes it is, just look at a mobile phone, lots of ~20 MIPS CPUs handling > Bluetooth, WLAN, GPS etc , just because noone has designed > a proper multithreading for embedded.
No, phones are extremely integrated and usually have only one CPU, one DSP and perhaps a micro controller in the flash card. Hardware multithreading doesn't give much performance on a high end CPU, and it gives almost no benefit on a low end one. Less than 10% of the memory bandwidth is unused in an ARM7, so running a second thread either means it runs at 10% of the maximum speed or it slows down the main thread.
>> It really only makes sense on high-end >> CPUs, but even there the gains are not that impressive. > > If you believe that, you dont understand multithreading for embedded. > The purpose is not to increase performance, it is to improve real time > response so you do not have to have multiple CPUs.
You don't understand multithreading at all. Interrupt latency is completely unaffected by multithreading. Whether you run 2 interrupts in parallel at half the speed or one after the other at full speed is irrelevant. You confuse multiprocessing with multithreading. A 2-core CPU can indeed deal with 2 interrupts in parallel at full speed.
>> Adding more >> cachelines evens this effect out, making performance more predictable. > > No, your unpredictability comes from jumping to a place > and instead of accessing memory, to fetch the page > you have a cache hit, and then your timing is screwed.
It is impossible to run code at a predictable speed, so you're screwed no matter whether you use a cache or not.
>>> A cache can even reduce worst case performance since it >>> can introduce delays in the critical path. >> >> So would a page cache. That is the price you have to pay when >> improving performance: the best case is better but the worst >> case is typically worse. Overall it is a huge win. > > No it is not a win if you have to guarantee that a job completes > in a certain time.
Wrong. Code is highly repetitive, so even if you assume the cache is invalidated at the start of a task, using a cache results in much faster execution.
> The cache in itself draws power, and you cannot compare > accesses to cache compared to accesses to flash memory.
Of course the cache burns power, but you're not using the flash. Which uses less power is highly dependent on their size and implementation. From what I've heard, caches are extremely efficient for sequential accesses - ie. code accesses.
> You have to run the cached CPU at a higher clock frequency to compensate > for loss of worst case performance.
No, it would be virtually impossible to find code that actually can't meet its deadline with a cache. Wilco
>> Fixed by compiler pragma... > > Easy to say, a bit harder in reality. If you don't care about codesize you > could > align big functions to 512-byte boundaries and pack small functions in the > gaps. But even that is hardly a solution as every minor change in the code > results in a different memory layout making performance unpredictable. > Basically it is an unsolveable problem.&#4294967295;
I assume that there are certain cricitcal paths which needs this determinism. Thos can be handled by pragmas. It is also entirely possible that most threads only execute out of zero waitstate SRAM.
>> On an ARM7, adding a cache also adds on waitstate to all non-cache >> accesses. > > No, a cache doesn't impact other accesses to non-cacheable > memory areas. A local flash cache is something you could > just drop into an existing design without even worrying about > needing to turn it on or flush it. It's completely transparent.
Adding a cache to the ARM7 CPU (not to the flash) will add waitstate to ALL non-hit accesses according to chip designers. It also adds waitstates to ARM9's if you put the memory on the AMBA bus. Only way to allow no waitstate operation is to put the SRAM in TCM.
>> With multithreading you can swap in a computable process and use EVERY >> cycle. > > So what? There are few wasted cycles on modern embedded CPUs. > Only very high-end CPUs are waiting a lot for slow memory.
If you make a jump to a location outside the cache, then you are dead in the water. Synch with AMBA bus Synch with bus interface SDRAM precharge cycle. 70 ns access... With a multithreaded CPU you can use those cycles for something good.
>>> Multithreading is not relevant in the embedded space, it would add a lot >>> of complexity and die area for hardly any gain. >> >> Yes it is, just look at a mobile phone, lots of ~20 MIPS CPUs handling >> Bluetooth, WLAN, GPS etc , just because noone has designed >> a proper multithreading for embedded. > > No, phones are extremely integrated and usually have only one CPU, > one DSP and perhaps a micro controller in the flash card.
Then you do not know how a moderm phone looks like. Each Bluetooth chip normally has an ARM Each WLAN chip normally has one or more ARMs On Smart Phones, you have a GSM/WCDMA controller and an application CPU. GPS functions will add one more ARM. Then you have a micro doing the charging algorithm. It quickly adds up.
> Hardware multithreading doesn't give much performance on a high > end CPU, and it gives almost no benefit on a low end one. Less than > 10% of the memory bandwidth is unused in an ARM7, so running a > second thread either means it runs at 10% of the maximum speed > or it slows down the main thread.
Multithreading for embedded systems is not about increasing performance. It is about replacing 2 CPUs capable of 50 MIPS which only runs at 20 MIPS with a single CPU which can run 2 x 20 MIPS threads. I.E: it is trying to fix the real time response problem. It is cheaper to have one CPU doing the job of two CPUs than having two CPUs each doing the job of half a CPU. Someone is going to get very rich, once they understand this. I am too lazy...
> >>> It really only makes sense on high-end >>> CPUs, but even there the gains are not that impressive.
You are locked into conventional thoughts on multithreading.
>> If you believe that, you dont understand multithreading for embedded. >> The purpose is not to increase performance, it is to improve real time >> response so you do not have to have multiple CPUs. > > You don't understand multithreading at all. Interrupt latency is > completely > unaffected by multithreading. Whether you run 2 interrupts in parallel at > half the speed or one after the other at full speed is irrelevant.
Not if you need both interrupts to respond within 200 ns. With multithreading, you do not even need interrupt, You can schedule a thread.
> You confuse multiprocessing with multithreading. A 2-core CPU can > indeed deal with 2 interrupts in parallel at full speed.
No I dont. A multithreaded CPU running at 400 MHz can do the task of 40 CPUs running at 10 MHz.
> It is impossible to run code at a predictable speed, so you're > screwed no matter whether you use a cache or not. >
No you can measure how many cycles each thread is using within a certain timq quanta, and ensure that each thread gets their fair share. -- Best Regards, Ulf Samuelsson This is intended to be my personal opinion which may, or may not be shared by my employer Atmel Nordic AB
Wilco Dijkstra wrote:
> > It is impossible to run code at a predictable speed, so you're > screwed no matter whether you use a cache or not.
?! - what ? Or are you talking only within the ARM subset of the CPU universe here ? -jg
Ulf Samuelsson wrote:
>>Hardware multithreading doesn't give much performance on a high >>end CPU, and it gives almost no benefit on a low end one. Less than >>10% of the memory bandwidth is unused in an ARM7, so running a >>second thread either means it runs at 10% of the maximum speed >>or it slows down the main thread. > > > Multithreading for embedded systems is not about increasing performance. > It is about replacing 2 CPUs capable of 50 MIPS which only runs at 20 MIPS > with a single CPU which can run 2 x 20 MIPS threads. > I.E: it is trying to fix the real time response problem.
Plus once you have this, you can often drop the MAX clock speed, which may have been hiked in the first place, to try and reduce the SW latencies to a tolerable level....
> > It is cheaper to have one CPU doing the job of two CPUs > than having two CPUs each doing the job of half a CPU. > Someone is going to get very rich, once they understand this. > I am too lazy...
For an example of someone already doing this, look at Ubicom's devices. It's what I'd call hard-multithreading, where they have timeslices and can allocate them to tasks: if you want, you can map 29/64 to a high priority task, and 4/64 to lower pri ones, and 1/64 to a background watchdog type task, and get full independance. (etc) Then there is the Parallex Propellor, multiple cores, with small code storages/core. -jg
On Thu, 22 Mar 2007 10:37:31 +1200, Jim Granville
<no.spam@designtools.maps.co.nz> wrote:

>Wilco Dijkstra wrote: >> >> It is impossible to run code at a predictable speed, so you're >> screwed no matter whether you use a cache or not. > >?! - what ? >Or are you talking only within the ARM subset of the CPU universe here ?
That sounds like the explanation. There are CPUs with exact, predictable execution times and where the only place for unavoidable variability is in recognizing and synchronizing interrupt code execution to an asynchronous external event (variability here can be kept to a cycle.) And where interrupt generation from internal timers have do NOT have this unavoiable variability, since their generation is synchronous with the cpu, and are exactly predictable in terms of their latency. Jon
>> >> It is cheaper to have one CPU doing the job of two CPUs >> than having two CPUs each doing the job of half a CPU. >> Someone is going to get very rich, once they understand this. >> I am too lazy... > > For an example of someone already doing this, look at Ubicom's devices. > It's what I'd call hard-multithreading, where they have timeslices and can > allocate them to tasks: if you want, you can map 29/64 to a high priority > task, and 4/64 to lower pri ones, and 1/64 to a background watchdog type > task, and get full independance. (etc) > Then there is the Parallex Propellor, multiple cores, with small > code storages/core.
I know, I wrote a white paper on multithreading for embedded control when I worked in the National Semiconductor Research labs and presented to the Microcontroller division. Bulent Celebi, The head of the NSC microcontroller division became the CEO of Ubicom and Gideon Intrater , the head of the architecture group became VP at MIPS architecture group, (MIPS also has introduced a multithreaded MIPS core). As I said, I am too lazy...
> > -jg
-- Best Regards, Ulf Samuelsson This is intended to be my personal opinion which may, or may not be shared by my employer Atmel Nordic AB
"Ulf Samuelsson" <ulf@a-t-m-e-l.com> wrote in message news:ets9le$tf1$1@aioe.org...

> Adding a cache to the ARM7 CPU (not to the flash) will add waitstate > to ALL non-hit accesses according to chip designers.
It doesn't have to if you divide the memory map. Accesses to non-cacheable memory simply bypass the cache, cacheable accesses try the cache first. It does take some extra logic as the ARM7 isn't built for caches so you get a slightly lower maximum frequency. That is why doing it on the flash is a better solution.
> It also adds waitstates to ARM9's if you put the memory on the AMBA bus. > Only way to allow no waitstate operation is to put the SRAM in TCM.
Correct.
> If you make a jump to a location outside the cache, then you are dead in the water. > Synch with AMBA bus > Synch with bus interface > SDRAM precharge cycle. > 70 ns access...
Sure, cachemisses are bad. But caches work extremely well, we're using 3GHz CPUs with 10MHz memory after all...
> With a multithreaded CPU you can use those cycles for something good.
The other thread also needs to use part of the cache for its code and data, so the cache becomes less effective. It is a difficult tradeoff, not as simple as you claim.
>> No, phones are extremely integrated and usually have only one CPU, >> one DSP and perhaps a micro controller in the flash card. > > Then you do not know how a moderm phone looks like. > Each Bluetooth chip normally has an ARM > Each WLAN chip normally has one or more ARMs > On Smart Phones, you have a GSM/WCDMA controller and an application CPU. > GPS functions will add one more ARM. > Then you have a micro doing the charging algorithm. > > It quickly adds up.
You haven't seen an average phone then. Yes, the most complex smart phones use 5-6 chips with several ARMs. Most phones are far more integrated and use 2-3 chips containing just one ARM and a DSP.
> Multithreading for embedded systems is not about increasing performance. > It is about replacing 2 CPUs capable of 50 MIPS which only runs at 20 MIPS > with a single CPU which can run 2 x 20 MIPS threads. > I.E: it is trying to fix the real time response problem.
What realtime response problem? Interrupt latency of a modern CPU is only a few cycles. Cortex-R4 has a 20 cycle latency eventhough it has caches, branch prediction, and runs at 500MHz...
> It is cheaper to have one CPU doing the job of two CPUs > than having two CPUs each doing the job of half a CPU. > Someone is going to get very rich, once they understand this. > I am too lazy...
This is already happening, but you don't need multithreading.
>> You don't understand multithreading at all. Interrupt latency is completely >> unaffected by multithreading. Whether you run 2 interrupts in parallel at >> half the speed or one after the other at full speed is irrelevant. > > Not if you need both interrupts to respond within 200 ns.
Sorry, it's simple maths. If we have 2 20MHz CPUs that have a 200ns interrupt deadline, then a 40MHz CPU takes 100ns for the deadlines (as it is twice as fast), so it meets the 200ns deadline.
> A multithreaded CPU running at 400 MHz can do the task of 40 CPUs > running at 10 MHz.
And a non-multithreaded CPU running at 400MHz can do the task of 40 CPUs running at 10MHz. Multithreading doesn't enter the picture at all... Wilco
"Jim Granville" <no.spam@designtools.maps.co.nz> wrote in message 
news:4601b38b$1@clear.net.nz...
> Wilco Dijkstra wrote: >> >> It is impossible to run code at a predictable speed, so you're >> screwed no matter whether you use a cache or not. > > ?! - what ? > Or are you talking only within the ARM subset of the CPU universe here ?
I guess you haven't heard about interrupts, wait states, cycle stealing DMA and other niceties then. Some of use live in the real world... Wilco
Wilco Dijkstra wrote:
> "Jim Granville" <no.spam@designtools.maps.co.nz> wrote in message > news:4601b38b$1@clear.net.nz... > >>Wilco Dijkstra wrote: >> >>>It is impossible to run code at a predictable speed, so you're >>>screwed no matter whether you use a cache or not. >> >>?! - what ? >>Or are you talking only within the ARM subset of the CPU universe here ? > > > I guess you haven't heard about interrupts, wait states, cycle > stealing DMA and other niceties then. Some of use live in the > real world...
Which has nothing to do with the false, sweeping claim you made above. Not only is is possible to run code a predictable speeds, a large number of designs out there are doing this on a daily basis.... I'm glad the systems I ship do not have to conform to your idea of the 'real world', or they would fail.... :) -jg

Memfault Beyond the Launch