EmbeddedRelated.com
Forums
The 2024 Embedded Online Conference

What application requires 500MHz for embedded processors

Started by jade March 5, 2006
"jade" <jade.emily@msa.hinet.net> wrote in message 
news:1141608758.252062.209820@v46g2000cwv.googlegroups.com...
> Use processor to do DSP seems not proper. > The general tasks run on processor are random but the DSP task is quite > uniform > in DSP algorithm and implementation. > > Thus I think it's justified to dispatch heavy loading of DSP to another > co-processor > instead of run by pure software in processor.
That used to be the case, but nowadays most general purpose CPUs have added DSP features which makes them reasonable DSPs. Perhaps not as good as a dedicated DSP, however the bulk of the code is still general purpose, so it makes more sense to improve the DSP capabilities of a general purpose CPU than the other way around. One of the driving factors of adding DSP capabilities to general purpose CPUs is to remove the need for a separate DSP, which adds a lot of extra cost to a project (think of needing 2 teams to do development, 2 sets of development tools, more complex hardware interconnect, higher cost of product, higher power consumption etc). A general purpose CPU can be relatively easily modified to add DSP capabilities. For example, an ARM11 can do 2 16-bit MACs and load/store 4 16-bit values per cycle.
> Do you have examples in current design that use processor to do DSP?
Most harddiscs use ARM9E rather than DSPs nowadays (the head flying code is definitely hard realtime). Wilco
Wilco Dijkstra wrote:

> "jade" <jade.emily@msa.hinet.net> wrote in message > news:1141608758.252062.209820@v46g2000cwv.googlegroups.com... > >>Use processor to do DSP seems not proper. >>The general tasks run on processor are random but the DSP task is quite >>uniform >>in DSP algorithm and implementation. >> >>Thus I think it's justified to dispatch heavy loading of DSP to another >>co-processor >>instead of run by pure software in processor. > > > That used to be the case, but nowadays most general purpose CPUs > have added DSP features which makes them reasonable DSPs. > Perhaps not as good as a dedicated DSP, however the bulk of the > code is still general purpose, so it makes more sense to improve the > DSP capabilities of a general purpose CPU than the other way around. > > One of the driving factors of adding DSP capabilities to general purpose > CPUs is to remove the need for a separate DSP, which adds a lot of > extra cost to a project (think of needing 2 teams to do development, > 2 sets of development tools, more complex hardware interconnect, > higher cost of product, higher power consumption etc).
<snip> another issue with separate DSP, is the decision of how much resource to give it. The CODE RAM is likely to be larger die cost than the DSP core, and gives a rather hard ceiling. - so it's hard to spec a generic device this way. However, if the market is large enough and the task well defined, you WILL find very specific co-processors doing the DSP stuff - this will always be lower power than spinning all that external memory & BUS lines. For examples, look at the new MP3 and MPEG chips - often with ARM's alongside. -jg
> A general purpose CPU can be relatively easily modified to add > DSP capabilities. For example, an ARM11 can do 2 16-bit MACs and > load/store 4 16-bit values per cycle.
Perhaps not so easy. Todays DSPs do multiple memory accesses per cycle to do MMAC; e.g. the 54xx of TI can fetch a coefficient with address increment, data with address increment, and write back with address increment in a single cycle... A 500 MHz general purpose CPU (even if PPC) has no chance to match a 100 MHz DSP when it comes to brute force MMAC which is what DSPs are all about. Dimiter ------------------------------------------------------ Dimiter Popoff Transgalactic Instruments http://www.tgi-sci.com ------------------------------------------------------ Wilco Dijkstra wrote:
> "jade" <jade.emily@msa.hinet.net> wrote in message > news:1141608758.252062.209820@v46g2000cwv.googlegroups.com... > > Use processor to do DSP seems not proper. > > The general tasks run on processor are random but the DSP task is quite > > uniform > > in DSP algorithm and implementation. > > > > Thus I think it's justified to dispatch heavy loading of DSP to another > > co-processor > > instead of run by pure software in processor. > > That used to be the case, but nowadays most general purpose CPUs > have added DSP features which makes them reasonable DSPs. > Perhaps not as good as a dedicated DSP, however the bulk of the > code is still general purpose, so it makes more sense to improve the > DSP capabilities of a general purpose CPU than the other way around. > > One of the driving factors of adding DSP capabilities to general purpose > CPUs is to remove the need for a separate DSP, which adds a lot of > extra cost to a project (think of needing 2 teams to do development, > 2 sets of development tools, more complex hardware interconnect, > higher cost of product, higher power consumption etc). > > A general purpose CPU can be relatively easily modified to add > DSP capabilities. For example, an ARM11 can do 2 16-bit MACs and > load/store 4 16-bit values per cycle. > > > Do you have examples in current design that use processor to do DSP? > > Most harddiscs use ARM9E rather than DSPs nowadays (the head flying > code is definitely hard realtime). > > Wilco
"Didi" <dp@tgi-sci.com> wrote in message 
news:1141671988.983106.6020@p10g2000cwp.googlegroups.com...
> >> A general purpose CPU can be relatively easily modified to add >> DSP capabilities. For example, an ARM11 can do 2 16-bit MACs and >> load/store 4 16-bit values per cycle. > > Perhaps not so easy. Todays DSPs do multiple memory accesses > per cycle to do MMAC; e.g. the 54xx of TI can fetch a coefficient > with address increment, data with address increment, and write back > with address increment in a single cycle...
That is 1 MAC and 3 16-bit reads/writes per cycle. The ARM11 can do 2 MACs and read/write 4 16-bit values per cycle including address increment. Which is faster? Well, if you look at the DSP scores on BDTI.com, an ARM11 is 12% faster than a C54xx running at the same frequency.
>A 500 MHz general > purpose CPU (even if PPC) has no chance to match a 100 MHz > DSP when it comes to brute force MMAC which is what DSPs > are all about.
That's clearly wrong. A 100Mhz DSP like the C54xx is about 5.6 times as _slow_ as a 500 Mhz ARM11... The magical brute force of DSPs is greatly exaggerated - general purpose CPUs are currently beating all but the very high-end DSPs. Even that is under threat, I wonder how a 1GHz Cortex-A8 stacks up against a 1GHz C64x+? Wilco
Wilco Dijkstra wrote:

> Well, if you look at the DSP scores on BDTI.com, an ARM11 is 12% > faster than a C54xx running at the same frequency.
What are the MIPS/mW figures like?
"David Brown" <david@westcontrol.removethisbit.com> wrote in message 
news:440bf946$1@news.wineasy.se...
> Everett M. Greene wrote: >> "Wilco Dijkstra" <Wilco_dot_Dijkstra@ntlworld.com> writes: >> >>> Software is also becoming >>> more complex as a result, with Windows CE being used in many phones. >>> Java never runs fast enough. All this requires a lot more performance... >> >> Which demonstrates that if you throw enough software >> inefficiencies at a processor, you can kill its >> performance no matter how fast it is. > > Wirth's law: > Software gets slower faster than hardware gets faster.
Yes, and it is becoming a problem in the embedded world too unfortunately... I guess desktop software is moving down with its wastful attitude. I also suspect few graduates nowadays start on a tiny 8-bit system where every byte and every single cycle matters. Equally few understand the details of programming in high level languages well enough to accurately predict resource usage. One of the fallacies that proponents of wasteful programming repeat is that nothing is a bottleneck until proven so - however that's the wrong way around... Even the least expected part of a program can become a serious bottleneck. I've seen a command-line parser of a compiler slowing down by 3 orders of magnitude due to STL strings. Parsing the command-line was slower than running the back-end at full optimization... Wilco
Wilco,

> That is 1 MAC and 3 16-bit reads/writes per cycle. The ARM11 can do > 2 MACs and read/write 4 16-bit values per cycle including address > increment. Which is faster?
I am not intimately familiar with ARM so you might be right after all. However, I need to ask: how many address and data busses has the ARM in question so it can read/write 3 independent address areas simultaneously? What memory interface does it use so it can read/write to memory in a single cycle (to all 3 busses)? But to make the comparison fairer, let us compare a practical case - something I have done on a 5420. The sampled signal is 14 bits, there are 9.2 MSPS continuously running and DMA buffered without missing a single sample on a circular queue in the on-chip memory. Parallel with the sampling, the DSP does some filtering to recognize events using about 90% of its theoretical MAC bandwidth; the remaining bandwidth goes on qualifying found events and programming yet another of the many DMACs to pass enough of the samples surrounding the event to the other DSP core. This takes exactly a 100 MHz clocked 54xx core, all memory being internal, in fact you can safely claim there is no external hardware of interest related to the comparison. Can you do that with a 500 MHz ARM? I know you cannot do it with a 500 MHz PPC which is what I am familiar with (memory latencies will kill you). Remember, all this is done in real time, no missed samples, no missed events. When you add the second core's consumption (also almost 100% busy, but this one so only under the toughest conditions), you get about 300 mW... Dimiter ------------------------------------------------------ Dimiter Popoff Transgalactic Instruments http://www.tgi-sci.com ------------------------------------------------------ Wilco Dijkstra wrote:
> "Didi" <dp@tgi-sci.com> wrote in message > news:1141671988.983106.6020@p10g2000cwp.googlegroups.com... > > > >> A general purpose CPU can be relatively easily modified to add > >> DSP capabilities. For example, an ARM11 can do 2 16-bit MACs and > >> load/store 4 16-bit values per cycle. > > > > Perhaps not so easy. Todays DSPs do multiple memory accesses > > per cycle to do MMAC; e.g. the 54xx of TI can fetch a coefficient > > with address increment, data with address increment, and write back > > with address increment in a single cycle... > > That is 1 MAC and 3 16-bit reads/writes per cycle. The ARM11 can do > 2 MACs and read/write 4 16-bit values per cycle including address > increment. Which is faster? > > Well, if you look at the DSP scores on BDTI.com, an ARM11 is 12% > faster than a C54xx running at the same frequency. > > >A 500 MHz general > > purpose CPU (even if PPC) has no chance to match a 100 MHz > > DSP when it comes to brute force MMAC which is what DSPs > > are all about. > > That's clearly wrong. A 100Mhz DSP like the C54xx is about 5.6 > times as _slow_ as a 500 Mhz ARM11... The magical brute force > of DSPs is greatly exaggerated - general purpose CPUs are currently > beating all but the very high-end DSPs. Even that is under threat, I > wonder how a 1GHz Cortex-A8 stacks up against a 1GHz C64x+? > > Wilco
"larwe" <zwsdotcom@gmail.com> wrote in message 
news:1141680159.304392.46030@j33g2000cwa.googlegroups.com...
> > Wilco Dijkstra wrote: > >> Well, if you look at the DSP scores on BDTI.com, an ARM11 is 12% >> faster than a C54xx running at the same frequency. > > What are the MIPS/mW figures like?
According to TI numbers the lowest power C54xx uses around 0.6mW/Mhz (160Mhz max, not sure what process). It's measured as 50% NOPs, 50% MACs (I'm not convinced that is typical). ARM11 uses 0.8mW/Mhz (500Mhz at 130nm, with IEM it becomes 0.5mW/Mhz). So the C54 has ~16% advantage in energy per task. So general purpose processors are competitive but not quite there yet. A superscalar CPU with a lower maximum frequency will be more power efficient than an ARM11 while achieving similar performance. It will be interesting to see how good Cortex-R4 turns out... Wilco
Wilco Dijkstra wrote:
> "larwe" <zwsdotcom@gmail.com> wrote in message > news:1141680159.304392.46030@j33g2000cwa.googlegroups.com... > >>Wilco Dijkstra wrote: >> >> >>>Well, if you look at the DSP scores on BDTI.com, an ARM11 is 12% >>>faster than a C54xx running at the same frequency. >> >>What are the MIPS/mW figures like? > > > According to TI numbers the lowest power C54xx uses around 0.6mW/Mhz > (160Mhz max, not sure what process). It's measured as 50% NOPs, 50% > MACs (I'm not convinced that is typical). ARM11 uses 0.8mW/Mhz > (500Mhz at 130nm, with IEM it becomes 0.5mW/Mhz). So the C54 has > ~16% advantage in energy per task. > > So general purpose processors are competitive but not quite there yet. > A superscalar CPU with a lower maximum frequency will be more power > efficient than an ARM11 while achieving similar performance. It will be > interesting to see how good Cortex-R4 turns out...
Another thing to watch, is if the values for C54xx include memory (probably, as it is on chip?) and the ARM11 ones exclude memory (probably not, as that is off-chip ? ) - so the cores themselves might be in the same ballpark, but what about the system figures ? -jg
"Didi" <dp@tgi-sci.com> wrote in message 
news:1141689973.771281.79780@i39g2000cwa.googlegroups.com...
> Wilco, > >> That is 1 MAC and 3 16-bit reads/writes per cycle. The ARM11 can do >> 2 MACs and read/write 4 16-bit values per cycle including address >> increment. Which is faster? > > I am not intimately familiar with ARM so you might be right after all. > However, I need to ask: how many address and data busses has > the ARM in question so it can read/write 3 independent address > areas simultaneously? What memory interface does it use so it > can read/write to memory in a single cycle (to all 3 busses)?
Internally the ARM11 has a L2 sub system with 4 independent 64-bit buses which can run at full core speed and a 32-bit peripheral bus. The L1 system is Harvard with separate 64-bit buses connecting to the core and L2 system. All of these can work in parallel. However that wasn't what I was talking about. An ARM11 can't execute 3 independent loads/stores per cycle, every cycle. But it can read or write 64-bits per cycle. In 4 cycles it can read 4x4 16-bit values from 4 independent addresses. So on streaming data it effectively achieves the same throughput as 4 independent 16-bit memory accesses without using XYZ memories. ARM11 can also issue a load multiple instruction and continue to do computations while the memory system fetches the data in the background - even on a cachemiss.
> But to make the comparison fairer, let us compare a practical > case - something I have done on a 5420. The sampled signal is 14 bits, > there are 9.2 MSPS continuously running and DMA buffered without > missing a single sample on a circular queue in the on-chip memory. > Parallel with the sampling, the DSP does some filtering to recognize > events using about 90% of its theoretical MAC bandwidth; > the remaining bandwidth goes on qualifying found events and > programming yet another of the many DMACs to pass enough of > the samples surrounding the event to the other DSP core. > This takes exactly a 100 MHz clocked 54xx core, all memory > being internal, in fact you can safely claim there is no external > hardware of interest related to the comparison. > > Can you do that with a 500 MHz ARM? I know you cannot do it > with a 500 MHz PPC which is what I am familiar with (memory > latencies will kill you). Remember, all this is > done in real time, no missed samples, no missed events. > When you add the second core's consumption (also almost 100% > busy, but this one so only under the toughest conditions), > you get about 300 mW...
Yes, from what you say it could do it in 100Mhz. If the ADC has a FIFO you could soft-DMA the samples directly into local memory or cache, something like 256 bytes would give an interrupt rate of 100K, taking about 10MHz. Using the built-in DMA allows a smaller buffer and has virtually no overhead. If all else fails, the samples could be written to main memory in 32-byte blocks, as the required bandwidth of 20MBytes/s is tiny. Doing 100M MACs would take less than 100MHz if all data is in the L1 cache (which can be locked) or local memory. Using SDRAM or L2 costs a bit more, but the data can be streamed into local memory using the DMA or using software prefetch. If everything fits in the caches/local memory doing it in 100Mhz is achievable. I can imagine a PPC without Altivec has a problem doing MACs. Also it may not have good enough cache lockdown/local memory facilities so worst case latencies may just be too high. I don't know which PPC you meant, but ARM11 was designed for precisely this kind of realtime on-chip processing. Wilco

The 2024 Embedded Online Conference