What application requires 500MHz for embedded processors| page 2

Reply by Wilco Dijkstra ●March 6, 20062006-03-06

"jade" <jade.emily@msa.hinet.net> wrote in message 
news:1141608758.252062.209820@v46g2000cwv.googlegroups.com...
> Use processor to do DSP seems not proper.
> The general tasks run on processor are random but the DSP task is quite
> uniform
> in DSP algorithm and implementation.
>
> Thus I think it's justified to dispatch heavy loading of DSP to another
> co-processor
> instead of run by pure software in processor.

That used to be the case, but nowadays most general purpose CPUs
have added DSP features which makes them reasonable DSPs.
Perhaps not as good as a dedicated DSP, however the bulk of the
code is still general purpose, so it makes more sense to improve the
DSP capabilities of a general purpose CPU than the other way around.

One of the driving factors of adding DSP capabilities to general purpose
CPUs is to remove the need for a separate DSP, which adds a lot of
extra cost to a project (think of needing 2 teams to do development,
2 sets of development tools, more complex hardware interconnect,
higher cost of product, higher power consumption etc).

A general purpose CPU can be relatively easily modified to add
DSP capabilities. For example, an ARM11 can do 2 16-bit MACs and
load/store 4 16-bit values per cycle.

> Do you have examples in current design that use processor to do DSP?

Most harddiscs use ARM9E rather than DSPs nowadays (the head flying
code is definitely hard realtime).

Wilco

Reply by Jim Granville ●March 6, 20062006-03-06

Wilco Dijkstra wrote:

> "jade" <jade.emily@msa.hinet.net> wrote in message 
> news:1141608758.252062.209820@v46g2000cwv.googlegroups.com...
> 
>>Use processor to do DSP seems not proper.
>>The general tasks run on processor are random but the DSP task is quite
>>uniform
>>in DSP algorithm and implementation.
>>
>>Thus I think it's justified to dispatch heavy loading of DSP to another
>>co-processor
>>instead of run by pure software in processor.
> 
> 
> That used to be the case, but nowadays most general purpose CPUs
> have added DSP features which makes them reasonable DSPs.
> Perhaps not as good as a dedicated DSP, however the bulk of the
> code is still general purpose, so it makes more sense to improve the
> DSP capabilities of a general purpose CPU than the other way around.
> 
> One of the driving factors of adding DSP capabilities to general purpose
> CPUs is to remove the need for a separate DSP, which adds a lot of
> extra cost to a project (think of needing 2 teams to do development,
> 2 sets of development tools, more complex hardware interconnect,
> higher cost of product, higher power consumption etc).
<snip>

another issue with separate DSP, is the decision of how much resource to
give it. The CODE RAM is likely to be larger die cost than the DSP core,
and gives a rather hard ceiling. - so it's hard to spec a generic
device this way.

  However, if the market is large enough and the task well defined, you 
WILL find  very specific co-processors doing the DSP stuff - this will
always be lower power than spinning all that external memory & BUS lines.
  For examples, look at the new MP3 and MPEG chips - often with ARM's 
alongside.

-jg

Reply by Didi ●March 6, 20062006-03-06

> A general purpose CPU can be relatively easily modified to add
> DSP capabilities. For example, an ARM11 can do 2 16-bit MACs and
> load/store 4 16-bit values per cycle.

Perhaps not so easy. Todays DSPs do multiple memory accesses
per cycle to do MMAC; e.g. the 54xx of TI can fetch a coefficient
with address increment, data with address increment, and write back
with address increment in a single cycle... A 500 MHz general
purpose CPU (even if  PPC) has no chance to match a 100 MHz
DSP when it comes to brute force MMAC which is what DSPs
are all about.

Dimiter

------------------------------------------------------
Dimiter Popoff               Transgalactic Instruments

http://www.tgi-sci.com
------------------------------------------------------


Wilco Dijkstra wrote:
> "jade" <jade.emily@msa.hinet.net> wrote in message
> news:1141608758.252062.209820@v46g2000cwv.googlegroups.com...
> > Use processor to do DSP seems not proper.
> > The general tasks run on processor are random but the DSP task is quite
> > uniform
> > in DSP algorithm and implementation.
> >
> > Thus I think it's justified to dispatch heavy loading of DSP to another
> > co-processor
> > instead of run by pure software in processor.
>
> That used to be the case, but nowadays most general purpose CPUs
> have added DSP features which makes them reasonable DSPs.
> Perhaps not as good as a dedicated DSP, however the bulk of the
> code is still general purpose, so it makes more sense to improve the
> DSP capabilities of a general purpose CPU than the other way around.
>
> One of the driving factors of adding DSP capabilities to general purpose
> CPUs is to remove the need for a separate DSP, which adds a lot of
> extra cost to a project (think of needing 2 teams to do development,
> 2 sets of development tools, more complex hardware interconnect,
> higher cost of product, higher power consumption etc).
>
> A general purpose CPU can be relatively easily modified to add
> DSP capabilities. For example, an ARM11 can do 2 16-bit MACs and
> load/store 4 16-bit values per cycle.
>
> > Do you have examples in current design that use processor to do DSP?
>
> Most harddiscs use ARM9E rather than DSPs nowadays (the head flying
> code is definitely hard realtime).
> 
> Wilco

Reply by Wilco Dijkstra ●March 6, 20062006-03-06

"Didi" <dp@tgi-sci.com> wrote in message 
news:1141671988.983106.6020@p10g2000cwp.googlegroups.com...
>
>> A general purpose CPU can be relatively easily modified to add
>> DSP capabilities. For example, an ARM11 can do 2 16-bit MACs and
>> load/store 4 16-bit values per cycle.
>
> Perhaps not so easy. Todays DSPs do multiple memory accesses
> per cycle to do MMAC; e.g. the 54xx of TI can fetch a coefficient
> with address increment, data with address increment, and write back
> with address increment in a single cycle...

That is 1 MAC and 3 16-bit reads/writes per cycle. The ARM11 can do
2 MACs and read/write 4 16-bit values per cycle including address
increment. Which is faster?

Well, if you look at the DSP scores on BDTI.com, an ARM11 is 12%
faster than a C54xx running at the same frequency.

>A 500 MHz general
> purpose CPU (even if  PPC) has no chance to match a 100 MHz
> DSP when it comes to brute force MMAC which is what DSPs
> are all about.

That's clearly wrong. A 100Mhz DSP like the C54xx is about 5.6
times as _slow_ as a 500 Mhz ARM11... The magical brute force
of DSPs is greatly exaggerated - general purpose CPUs are currently
beating all but the very high-end DSPs. Even that is under threat, I
wonder how a 1GHz Cortex-A8 stacks up against a 1GHz C64x+?

Wilco

Reply by larwe ●March 6, 20062006-03-06

Wilco Dijkstra wrote:

> Well, if you look at the DSP scores on BDTI.com, an ARM11 is 12%
> faster than a C54xx running at the same frequency.

What are the MIPS/mW figures like?

Reply by Wilco Dijkstra ●March 6, 20062006-03-06

"David Brown" <david@westcontrol.removethisbit.com> wrote in message 
news:440bf946$1@news.wineasy.se...
> Everett M. Greene wrote:
>> "Wilco Dijkstra" <Wilco_dot_Dijkstra@ntlworld.com> writes:
>>
>>> Software is also becoming
>>> more complex as a result, with Windows CE being used in many phones.
>>> Java never runs fast enough. All this requires a lot more performance...
>>
>> Which demonstrates that if you throw enough software
>> inefficiencies at a processor, you can kill its
>> performance no matter how fast it is.
>
> Wirth's law:
> Software gets slower faster than hardware gets faster.

Yes, and it is becoming a problem in the embedded world too
unfortunately... I guess desktop software is moving down with
its wastful attitude. I also suspect few graduates nowadays start
on a tiny 8-bit system where every byte and every single cycle
matters. Equally few understand the details of programming in
high level languages well enough to accurately predict resource
usage.

One of the fallacies that proponents of wasteful programming
repeat is that nothing is a bottleneck until proven so - however
that's the wrong way around... Even the least expected part of a
program can become a serious bottleneck. I've seen a
command-line parser of a compiler slowing down by 3 orders
of magnitude due to STL strings. Parsing the command-line
was slower than running the back-end at full optimization...

Wilco

Reply by Didi ●March 6, 20062006-03-06

Wilco,

> That is 1 MAC and 3 16-bit reads/writes per cycle. The ARM11 can do
> 2 MACs and read/write 4 16-bit values per cycle including address
> increment. Which is faster?

I am not intimately familiar with ARM so you might be right after all.
However, I need to ask: how many address and data busses has
the ARM in question so it can read/write 3 independent address
areas simultaneously? What memory interface does it use so it
can read/write to memory in a single cycle (to all 3 busses)?

But to make the comparison fairer, let us compare a practical
case - something I have done on a 5420. The sampled signal is 14 bits,
there are 9.2 MSPS continuously running and DMA buffered without
missing a single sample on a circular queue in the on-chip memory.
Parallel with the sampling, the DSP does some filtering to recognize
events using about 90% of its theoretical MAC bandwidth;
the remaining bandwidth goes on qualifying found events and
programming yet another of the many DMACs to pass enough of
the samples surrounding the event to the other DSP core.
This takes exactly a 100 MHz clocked 54xx core, all memory
being internal, in fact you can safely claim there is no external
hardware of interest related to the comparison.

Can you do that with a 500 MHz ARM? I know you cannot do it
with a 500 MHz PPC which is what I am familiar with (memory
latencies will kill you). Remember, all this is
done in real time, no missed samples, no missed events.
When you add the second core's consumption (also almost 100%
busy, but this one so only under the toughest conditions),
you get about 300 mW...

Dimiter

------------------------------------------------------
Dimiter Popoff               Transgalactic Instruments

http://www.tgi-sci.com
------------------------------------------------------

Wilco Dijkstra wrote:
> "Didi" <dp@tgi-sci.com> wrote in message
> news:1141671988.983106.6020@p10g2000cwp.googlegroups.com...
> >
> >> A general purpose CPU can be relatively easily modified to add
> >> DSP capabilities. For example, an ARM11 can do 2 16-bit MACs and
> >> load/store 4 16-bit values per cycle.
> >
> > Perhaps not so easy. Todays DSPs do multiple memory accesses
> > per cycle to do MMAC; e.g. the 54xx of TI can fetch a coefficient
> > with address increment, data with address increment, and write back
> > with address increment in a single cycle...
>
> That is 1 MAC and 3 16-bit reads/writes per cycle. The ARM11 can do
> 2 MACs and read/write 4 16-bit values per cycle including address
> increment. Which is faster?
>
> Well, if you look at the DSP scores on BDTI.com, an ARM11 is 12%
> faster than a C54xx running at the same frequency.
>
> >A 500 MHz general
> > purpose CPU (even if  PPC) has no chance to match a 100 MHz
> > DSP when it comes to brute force MMAC which is what DSPs
> > are all about.
>
> That's clearly wrong. A 100Mhz DSP like the C54xx is about 5.6
> times as _slow_ as a 500 Mhz ARM11... The magical brute force
> of DSPs is greatly exaggerated - general purpose CPUs are currently
> beating all but the very high-end DSPs. Even that is under threat, I
> wonder how a 1GHz Cortex-A8 stacks up against a 1GHz C64x+?
> 
> Wilco

Reply by Wilco Dijkstra ●March 7, 20062006-03-07

"larwe" <zwsdotcom@gmail.com> wrote in message 
news:1141680159.304392.46030@j33g2000cwa.googlegroups.com...
>
> Wilco Dijkstra wrote:
>
>> Well, if you look at the DSP scores on BDTI.com, an ARM11 is 12%
>> faster than a C54xx running at the same frequency.
>
> What are the MIPS/mW figures like?

According to TI numbers the lowest power C54xx uses around 0.6mW/Mhz
(160Mhz max, not sure what process). It's measured as 50% NOPs, 50%
MACs (I'm not convinced that is typical). ARM11 uses 0.8mW/Mhz
(500Mhz at 130nm, with IEM it becomes 0.5mW/Mhz). So the C54 has
~16% advantage in energy per task.

So general purpose processors are competitive but not quite there yet.
A superscalar CPU with a lower maximum frequency will be more power
efficient than an ARM11 while achieving similar performance. It will be
interesting to see how good Cortex-R4 turns out...

Wilco

Reply by Jim Granville ●March 7, 20062006-03-07

Wilco Dijkstra wrote:
> "larwe" <zwsdotcom@gmail.com> wrote in message 
> news:1141680159.304392.46030@j33g2000cwa.googlegroups.com...
> 
>>Wilco Dijkstra wrote:
>>
>>
>>>Well, if you look at the DSP scores on BDTI.com, an ARM11 is 12%
>>>faster than a C54xx running at the same frequency.
>>
>>What are the MIPS/mW figures like?
> 
> 
> According to TI numbers the lowest power C54xx uses around 0.6mW/Mhz
> (160Mhz max, not sure what process). It's measured as 50% NOPs, 50%
> MACs (I'm not convinced that is typical). ARM11 uses 0.8mW/Mhz
> (500Mhz at 130nm, with IEM it becomes 0.5mW/Mhz). So the C54 has
> ~16% advantage in energy per task.
> 
> So general purpose processors are competitive but not quite there yet.
> A superscalar CPU with a lower maximum frequency will be more power
> efficient than an ARM11 while achieving similar performance. It will be
> interesting to see how good Cortex-R4 turns out...

Another thing to watch, is if the values for C54xx include memory 
(probably, as it is on chip?) and the ARM11 ones exclude memory
(probably not, as that is off-chip ? ) - so the cores themselves
might be in the same ballpark, but what about the system figures ?

-jg

Reply by Wilco Dijkstra ●March 7, 20062006-03-07

"Didi" <dp@tgi-sci.com> wrote in message 
news:1141689973.771281.79780@i39g2000cwa.googlegroups.com...
> Wilco,
>
>> That is 1 MAC and 3 16-bit reads/writes per cycle. The ARM11 can do
>> 2 MACs and read/write 4 16-bit values per cycle including address
>> increment. Which is faster?
>
> I am not intimately familiar with ARM so you might be right after all.
> However, I need to ask: how many address and data busses has
> the ARM in question so it can read/write 3 independent address
> areas simultaneously? What memory interface does it use so it
> can read/write to memory in a single cycle (to all 3 busses)?

Internally the ARM11 has a L2 sub system with 4 independent
64-bit buses which can run at full core speed and a 32-bit
peripheral bus. The L1 system is Harvard with separate 64-bit
buses connecting to the core and L2 system. All of these can
work in parallel. However that wasn't what I was talking about.

An ARM11 can't execute 3 independent loads/stores per cycle,
every cycle. But it can read or write 64-bits per cycle. In 4 cycles
it can read 4x4 16-bit values from 4 independent addresses.
So on streaming data it effectively achieves the same
throughput as 4 independent 16-bit memory accesses without
using XYZ memories. ARM11 can also issue a load multiple
instruction and continue to do computations while the memory
system fetches the data in the background - even on a cachemiss.

> But to make the comparison fairer, let us compare a practical
> case - something I have done on a 5420. The sampled signal is 14 bits,
> there are 9.2 MSPS continuously running and DMA buffered without
> missing a single sample on a circular queue in the on-chip memory.
> Parallel with the sampling, the DSP does some filtering to recognize
> events using about 90% of its theoretical MAC bandwidth;
> the remaining bandwidth goes on qualifying found events and
> programming yet another of the many DMACs to pass enough of
> the samples surrounding the event to the other DSP core.
> This takes exactly a 100 MHz clocked 54xx core, all memory
> being internal, in fact you can safely claim there is no external
> hardware of interest related to the comparison.
>
> Can you do that with a 500 MHz ARM? I know you cannot do it
> with a 500 MHz PPC which is what I am familiar with (memory
> latencies will kill you). Remember, all this is
> done in real time, no missed samples, no missed events.
> When you add the second core's consumption (also almost 100%
> busy, but this one so only under the toughest conditions),
> you get about 300 mW...

Yes, from what you say it could do it in 100Mhz. If the ADC has a
FIFO you could soft-DMA the samples directly into local memory
or cache, something like 256 bytes would give an interrupt rate of
100K, taking about 10MHz. Using the built-in DMA allows a smaller
buffer and has virtually no overhead. If all else fails, the samples
could be written to main memory in 32-byte blocks, as the required
bandwidth of 20MBytes/s is tiny.

Doing 100M MACs would take less than 100MHz if all data is in
the L1 cache (which can be locked) or local memory. Using SDRAM
or L2 costs a bit more, but the data can be streamed into local
memory using the DMA or using software prefetch. If everything
fits in the caches/local memory doing it in 100Mhz is achievable.

I can imagine a PPC without Altivec has a problem doing MACs.
Also it may not have good enough cache lockdown/local memory
facilities so worst case latencies may just be too high. I don't know
which PPC you meant, but ARM11 was designed for precisely this
kind of realtime on-chip processing.

Wilco