EmbeddedRelated.com
Forums
The 2024 Embedded Online Conference

What application requires 500MHz for embedded processors

Started by jade March 5, 2006
> The usual way to handle this on a general purpose > processor is to unroll and pack the loads. > > Can you explain why this is not applicable in your > case?
Are you sure you read my postings. General purpose processors are applicable, just at a different performance cost. I would estimate a 500 MHz decent RISC of todays generation could perhaps do what a 54xx 100 MHz DSP can do in terms of real time signal processing. If you would explain (perhaps by an example) how you want to unroll and pack the trivial filtering example I had given, I might be able to explain more. Dimiter ------------------------------------------------------ Dimiter Popoff Transgalactic Instruments http://www.tgi-sci.com ------------------------------------------------------ Stephen Clarke wrote:
> "Didi" <dp@tgi-sci.com> wrote in message > news:1142132669.823913.252540@i40g2000cwc.googlegroups.com... > > > >> With a 64-bit bus you can read 4 16-bit values per cycle, every cycle. > >> This is clearly faster than reading 16-bits from 3 independent address > >> per cycle, right? > > > > No. Every 16 bit value has a separate address, which is - in the case > > of the 5420 - another 16 bits. I will not go into explanation why this > > is so, I guess there are sufficient books on digital signal processing > > around. > > The usual way to handle this on a general purpose > processor is to unroll and pack the loads. > > Can you explain why this is not applicable in your > case? > > Steve.
"Didi" <dp@tgi-sci.com> wrote in message 
news:1142297032.012651.187070@i39g2000cwa.googlegroups.com...
>> The usual way to handle this on a general purpose >> processor is to unroll and pack the loads. >> >> Can you explain why this is not applicable in your >> case? > > Are you sure you read my postings. General purpose processors > are applicable, just at a different performance cost.
I did not intend to ask, "why are general purpose processors not applicable". Rather, I was trying to understand your assertion that you need to do a 16-bit load from three independent addresses every cycle. The non-DSP orthodoxy is that this is not necessary, because you can unroll the loop by 4, merge the loads, and load up to twelve 16-bit objects from three independent addresses in three cycles. i.e. even though you can only do one 64-bit load per cycle, over three cycles, the effect is equivalent.
> If you would explain (perhaps by an example) how you > want to unroll and pack the trivial filtering example I had given, > I might be able to explain more.
I cannot see that you have provided any example code. However, Wilco has already offered to provide an explanation if you do supply some code: Wilco Dijkstra wrote:
> Maybe you could show a C snippet of what you do, then I'll show you > an equivalent one that doesn't need 3 memory accesses per cycle.
ARM expertise is not often free: someone should take him up on that offer! Steve.
> ARM expertise is not often free: someone should take > him up on that offer!
So is DSP and PPC (and much more, for that matter) expertise some of which I already gave for free in this thread. Enjoy. Dimiter ------------------------------------------------------ Dimiter Popoff Transgalactic Instruments http://www.tgi-sci.com ------------------------------------------------------ Stephen Clarke wrote:
> "Didi" <dp@tgi-sci.com> wrote in message > news:1142297032.012651.187070@i39g2000cwa.googlegroups.com... > >> The usual way to handle this on a general purpose > >> processor is to unroll and pack the loads. > >> > >> Can you explain why this is not applicable in your > >> case? > > > > Are you sure you read my postings. General purpose processors > > are applicable, just at a different pesrformance cost. > > I did not intend to ask, "why are general purpose processors > not applicable". Rather, I was trying to understand your > assertion that you need to do a 16-bit load from three > independent addresses every cycle. > > The non-DSP orthodoxy is that this is not necessary, because > you can unroll the loop by 4, merge the loads, and load up to > twelve 16-bit objects from three independent addresses in three > cycles. i.e. even though you can only do one 64-bit load per cycle, > over three cycles, the effect is equivalent. > > > If you would explain (perhaps by an example) how you > > want to unroll and pack the trivial filtering example I had given, > > I might be able to explain more. > > I cannot see that you have provided any example code. > However, Wilco has already offered to provide an > explanation if you do supply some code: > > Wilco Dijkstra wrote: > > Maybe you could show a C snippet of what you do, then I'll show you > > an equivalent one that doesn't need 3 memory accesses per cycle. > > ARM expertise is not often free: someone should take > him up on that offer! > > Steve.
"Didi" <dp@tgi-sci.com> wrote in message
news:1142132669.823913.252540@i40g2000cwc.googlegroups.com...
>> There is no need to do 3 independent accesses per cycle. This is a >> very inefficient way of increasing bandwidth and that is why modern >> CPUs increase the width of buses instead. > > This tells me you have never actually done any DSP programming. > Please correct me if I am wrong (I certainly mean no offence).
You're wrong. For example I've written a highly optimised JPEG (de)compressor on ARM using software SIMD techniques.
>> With a 64-bit bus you can read 4 16-bit values per cycle, every cycle. >> This is clearly faster than reading 16-bits from 3 independent address >> per cycle, right? > > No. Every 16 bit value has a separate address, which is - in the case > of the 5420 - another 16 bits. I will not go into explanation why this > is so, I guess there are sufficient books on digital signal processing > around.
I know why low and mid-end DSPs do this, however there are major limitations with this approach. Alternatives exist which do not have these limitations, and general purpose CPUs use these to improve DSP performance without needing the traditional features of a DSP. My point is that these alternatives allow modern general purpose CPUs to easily beat traditional DSPs.
>> There is no need to do several independent accesses per cycle as >> long as you've got enough bandwidth. 4 16-bit accesses every 10ns >> is only 800MBytes/s. Just the data bandwidth between the core and L1 >> is 4GBytes/s on a 500Mhz ARM11 for example. > > Here we go again, you don't want to believe DSPs have been > designed as they are because of necessity.
It's not necessity, more a particular design approach (like RISC/CISC). It works fine at the low end, but it is simply not scalable. If you use it like a dogma then you'll crash and burn, just like CPUs that were too CISCy or RISCy...
>OK, I'll try a > general example. There is an area - say, 4 kilobytes - with > the coefficients, there is a circular queue - say, 64 k - with > the incoming data, and there is another circular queue - say, > again 64 k - with the filtered results. All are 16 bits wide, > you do 4k MACs per sample each time starting one address > further in the input queue and write the result to the output > queue. > Can you tell me how you do this without separate addresses > (especially on the ARM where the registers are so scarce)?
The standard way of doing FIR filters is to block them. This reduces the memory bandwidth requirements by the blocking factor. An example of how a 4x4 filter looks like on ARM11 - since you don't like C, this is ARM assembly language :-) fir_loop LDM x!,{x45,x67} ; load 4 16-bit input values and post inc SMLAD a0,x01,c01,a0 ; do 2 16-bit MACs SMLAD a1,x01,d01,a1 LDM c!,{c23,d23} ; load 4 16-bit coefficients SMLAD a2,x23,c01,a2 SMLAD a3,x23,d01,a3 LDM c!,{c01,d01} ; load 4 16-bit coefficients SMLAD a0,x23,c23,a0 SMLAD a1,x23,d23,a1 SMLAD a2,x45,c23,a2 SMLAD a3,x45,d23,a3 ... repeat another time with x45<->x01 and x67<->x23 swapped TST c,#mask ; test for end of loop BNE fir_loop ; branch back - 24 instructions total This code uses 4 accumulators a0-a3, 8 coefficients c0-c3 and d0-d3, 8 input values x0-x7, a coefficient address and an input pointer - total 14 registers (2 16-bit values fit in a 32-bit register). The coefficient array is duplicated to avoid alignment issues and interleaved to avoid the need of a second pointer. There is no need for a loop counter as we can use the coefficient pointer. The instructions are scheduled to avoid any interlocks. On ARM11 this computes 8 taps per iteration of 4 outputs (32 MACs) in 24 cycles. In terms of bandwidth, it only does 6 loads every 32 MACs (0.2 loads per MAC or 0.25 loads per cycle). So a 100Mhz ARM11 easily outperforms the 5420 at the same frequency. FIR filters are clearly MAC rather than bandwidth bound. If we could do 4 MACs per cycle, the loop would go faster. Now why do you insist that you need at least 3 loads per MAC? Wilco
On Tuesday, in article
     <4jwRf.10311$ZJ2.4094@newsfe6-gui.ntli.net>
     Wilco_dot_Dijkstra@ntlworld.com "Wilco Dijkstra" wrote:

>"Didi" <dp@tgi-sci.com> wrote in message >news:1142132669.823913.252540@i40g2000cwc.googlegroups.com... >>> There is no need to do 3 independent accesses per cycle. This is a >>> very inefficient way of increasing bandwidth and that is why modern >>> CPUs increase the width of buses instead. >> >> This tells me you have never actually done any DSP programming. >> Please correct me if I am wrong (I certainly mean no offence). > >You're wrong. For example I've written a highly optimised JPEG >(de)compressor on ARM using software SIMD techniques.
Depends on application constraints.
>>> With a 64-bit bus you can read 4 16-bit values per cycle, every cycle. >>> This is clearly faster than reading 16-bits from 3 independent address >>> per cycle, right? >> >> No. Every 16 bit value has a separate address, which is - in the case >> of the 5420 - another 16 bits. I will not go into explanation why this >> is so, I guess there are sufficient books on digital signal processing >> around. > >I know why low and mid-end DSPs do this, however there are major >limitations with this approach. Alternatives exist which do not have these >limitations, and general purpose CPUs use these to improve DSP >performance without needing the traditional features of a DSP. > >My point is that these alternatives allow modern general purpose >CPUs to easily beat traditional DSPs.
Not for some applications.
>>> There is no need to do several independent accesses per cycle as >>> long as you've got enough bandwidth. 4 16-bit accesses every 10ns >>> is only 800MBytes/s. Just the data bandwidth between the core and L1 >>> is 4GBytes/s on a 500Mhz ARM11 for example. >> >> Here we go again, you don't want to believe DSPs have been >> designed as they are because of necessity. > >It's not necessity, more a particular design approach (like RISC/CISC). >It works fine at the low end, but it is simply not scalable. If you use it >like a dogma then you'll crash and burn, just like CPUs that were too >CISCy or RISCy...
Always forcing all data through a processor can for some applications cause problems. .......
>On ARM11 this computes 8 taps per iteration of 4 outputs (32 MACs) >in 24 cycles. In terms of bandwidth, it only does 6 loads every 32 MACs >(0.2 loads per MAC or 0.25 loads per cycle). So a 100Mhz ARM11 >easily outperforms the 5420 at the same frequency. > >FIR filters are clearly MAC rather than bandwidth bound. If we could >do 4 MACs per cycle, the loop would go faster. Now why do you insist >that you need at least 3 loads per MAC?
Having done various work with real time video, whereby the video must have minimal delay and NO non-deterministic delays or stops, (i.e. continuous operation), often because of other limitations of the system (broadcast effects, mixing, scaling or equipment in loops with eye/hand co-ordination). There are times where you have to have dedicated hardware as every pixel on multiple video streams at the same time are undergoing 24 multiply and 9 adds at pixel rate. Having done standards conversion and rescaling from input to output in less than 15 input TV lines delay, most of the delay was changing the start times for active video due to blanking differences. Often in these types of applications, the blockiness and delays of frame delays can screw things up as all the delays add up. There are times when the delay does not matter, still images, or open loop methodology (e.g. set-top boxes, DVD players, audio players), but others where the closed loop nature of the WHOLE system means DSP or fast processor will not cut it. Horses for courses, and various other reasons (often internal politics). -- Paul Carpenter | paul@pcserviceselectronics.co.uk <http://www.pcserviceselectronics.co.uk/> PC Services <http://www.gnuh8.org.uk/> GNU H8 & mailing list info <http://www.badweb.org.uk/> For those web sites you hate
> On ARM11 this computes 8 taps per iteration of 4 outputs (32 MACs) > in 24 cycles. In terms of bandwidth, it only does 6 loads every 32 MACs > (0.2 loads per MAC or 0.25 loads per cycle). So a 100Mhz ARM11 > easily outperforms the 5420 at the same frequency.
Not at all. I had a look at the ARM11 architecture, and the first thing I saw was that there are no 40 bit accumulators. You are going to need them if you want to compete with the 54xx series for my (and many other) DSP applications. Also, the 54xx have a FIRS instruction for symmetric filters which does two MACs per cycle. Then there are details like memory bandwidth - you can have all the coefficients cached but you cannot - generally you always have a miss - on the incoming data, and then you probably have all the snoop issues with the DMA pushing the data to memory etc. etc. Under the score, you will find out that your 500 MHz ARM will likely be about the same as a 100 MHz 54xx when it comes to the complete application - if you have one which would tolerate only 32 bit accumulator width. The 54xx operates on every memory address as if it were a register, and a (great) number of on-chip DMACs can access all that space without incurring any delay to the the program flow at all, you cannot just neglect all that overhead.
> Now why do you insist > that you need at least 3 loads per MAC?
I did not insist - but here you go, two accesses for the data and one for the opcode as it is in your case (the 54xx can stop fetching in loop mode, it is indeed highly specialized, I also prefer to program normal processors). And yes there are opcodes which make 3 data accesses per cycle on the 54xx. Finally, the 54xx is almost 10 years of age now, no wonder there are newer candidates for its job. The ARM architecture is not bad, they have been learning from the right sources (68k and PPC), and it does have the potential to compete for some DSP applications. If it only had 32 registers and could evolve into 64 bits ... And more finally, may I suggest that we include some information about ourselves whenever this is relevant, had I known Wilko was directly associated with ARM I would have been a lot less willing to support his agenda by contributing to a discussion. (I have no interest neither in TI nor in Freescale or other PPC manufacturers). Dimiter ------------------------------------------------------ Dimiter Popoff Transgalactic Instruments http://www.tgi-sci.com ------------------------------------------------------ Wilco Dijkstra wrote:
> "Didi" <dp@tgi-sci.com> wrote in message > news:1142132669.823913.252540@i40g2000cwc.googlegroups.com... > >> There is no need to do 3 independent accesses per cycle. This is a > >> very inefficient way of increasing bandwidth and that is why modern > >> CPUs increase the width of buses instead. > > > > This tells me you have never actually done any DSP programming. > > Please correct me if I am wrong (I certainly mean no offence). > > You're wrong. For example I've written a highly optimised JPEG > (de)compressor on ARM using software SIMD techniques. > > >> With a 64-bit bus you can read 4 16-bit values per cycle, every cycle. > >> This is clearly faster than reading 16-bits from 3 independent address > >> per cycle, right? > > > > No. Every 16 bit value has a separate address, which is - in the case > > of the 5420 - another 16 bits. I will not go into explanation why this > > is so, I guess there are sufficient books on digital signal processing > > around. > > I know why low and mid-end DSPs do this, however there are major > limitations with this approach. Alternatives exist which do not have these > limitations, and general purpose CPUs use these to improve DSP > performance without needing the traditional features of a DSP. > > My point is that these alternatives allow modern general purpose > CPUs to easily beat traditional DSPs. > > >> There is no need to do several independent accesses per cycle as > >> long as you've got enough bandwidth. 4 16-bit accesses every 10ns > >> is only 800MBytes/s. Just the data bandwidth between the core and L1 > >> is 4GBytes/s on a 500Mhz ARM11 for example. > > > > Here we go again, you don't want to believe DSPs have been > > designed as they are because of necessity. > > It's not necessity, more a particular design approach (like RISC/CISC). > It works fine at the low end, but it is simply not scalable. If you use it > like a dogma then you'll crash and burn, just like CPUs that were too > CISCy or RISCy... > > >OK, I'll try a > > general example. There is an area - say, 4 kilobytes - with > > the coefficients, there is a circular queue - say, 64 k - with > > the incoming data, and there is another circular queue - say, > > again 64 k - with the filtered results. All are 16 bits wide, > > you do 4k MACs per sample each time starting one address > > further in the input queue and write the result to the output > > queue. > > Can you tell me how you do this without separate addresses > > (especially on the ARM where the registers are so scarce)? > > The standard way of doing FIR filters is to block them. This > reduces the memory bandwidth requirements by the blocking > factor. An example of how a 4x4 filter looks like on ARM11 - > since you don't like C, this is ARM assembly language :-) > > fir_loop > LDM x!,{x45,x67} ; load 4 16-bit input values and post inc > SMLAD a0,x01,c01,a0 ; do 2 16-bit MACs > SMLAD a1,x01,d01,a1 > LDM c!,{c23,d23} ; load 4 16-bit coefficients > SMLAD a2,x23,c01,a2 > SMLAD a3,x23,d01,a3 > LDM c!,{c01,d01} ; load 4 16-bit coefficients > SMLAD a0,x23,c23,a0 > SMLAD a1,x23,d23,a1 > SMLAD a2,x45,c23,a2 > SMLAD a3,x45,d23,a3 > > ... repeat another time with x45<->x01 and x67<->x23 swapped > > TST c,#mask ; test for end of loop > BNE fir_loop ; branch back - 24 instructions total > > This code uses 4 accumulators a0-a3, 8 coefficients c0-c3 and d0-d3, > 8 input values x0-x7, a coefficient address and an input pointer - total > 14 registers (2 16-bit values fit in a 32-bit register). > The coefficient array is duplicated to avoid alignment issues and > interleaved to avoid the need of a second pointer. There is no need > for a loop counter as we can use the coefficient pointer. > The instructions are scheduled to avoid any interlocks. > > On ARM11 this computes 8 taps per iteration of 4 outputs (32 MACs) > in 24 cycles. In terms of bandwidth, it only does 6 loads every 32 MACs > (0.2 loads per MAC or 0.25 loads per cycle). So a 100Mhz ARM11 > easily outperforms the 5420 at the same frequency. > > FIR filters are clearly MAC rather than bandwidth bound. If we could > do 4 MACs per cycle, the loop would go faster. Now why do you insist > that you need at least 3 loads per MAC? > > Wilco
"Didi" <dp@tgi-sci.com> wrote in message
news:1142346935.115166.318960@i39g2000cwa.googlegroups.com...
>> On ARM11 this computes 8 taps per iteration of 4 outputs (32 MACs) >> in 24 cycles. In terms of bandwidth, it only does 6 loads every 32 MACs >> (0.2 loads per MAC or 0.25 loads per cycle). So a 100Mhz ARM11 >> easily outperforms the 5420 at the same frequency. > > Not at all. I had a look at the ARM11 architecture, and the first > thing I saw was that there are no 40 bit accumulators. You are > going to need them if you want to compete with the 54xx series > for my (and many other) DSP applications.
The architecture supports 32-bit and 64-bit accumulators. For many purposes (graphics for example), 32 bit is more than enough. 64-bit accumulators need more registers and are slower in some cases. A common trick is to use 32-bit accumulators for several iterations, then do a 64-bit accumulate. This allows the inner loop to run at optimal speed without overflow (you can precompute how many iterations are possible without overflow). Also, the 54xx have a
> FIRS instruction for symmetric filters which does two MACs per cycle.
I'd say acc += (A + B) * C does 1 MAC and 1 ADD, not 2 MACs... But yes, it would run twice as fast effectively. The ARM11 version would run faster too of course, my guess is that the 54xx would be around 30% faster in this case.
> Then there are details like memory bandwidth - you can have all the > coefficients cached but you cannot - generally you always have a > miss - on the incoming data, and then you probably have all the snoop > issues with the DMA pushing the data to memory etc. etc.
The DMA stores the data in DTCM (fast local memory), which doesn't have any issues associated with caches, such as misses and consistency etc. So there is really no cost in accessing the incoming data - that's the point of DMA!
> Under the score, you will find out that your 500 MHz ARM will > likely be about the same as a 100 MHz 54xx when it comes > to the complete application - if you have one which would tolerate > only 32 bit accumulator width.
This is mind boggling. I showed you actual code that uses 75Mhz on the ARM11 doing the same work as a 54xx at 100Mhz, and then you do some handwaving and suddenly it needs 500Mhz? Can you explain where the other 425Mhz is going? (Not on cache misses)
> The 54xx operates on every memory address as if it were a register, > and a (great) number of on-chip DMACs can access all that space > without incurring any delay to the the program flow at all, you cannot > just neglect all that overhead.
Caches allow you do the same, it's why they exist. DSP programs generally exhibit ideal cache behaviour compared to general purpose programs. So they behave like fast high bandwidth memory without any overhead. But if you dislike caches there are TCMs anyway.
>> Now why do you insist >> that you need at least 3 loads per MAC? > > I did not insist - but here you go, two accesses for the data and one > for the opcode as it is in your case (the 54xx can stop fetching in > loop mode, it is indeed highly specialized, I also prefer to program > normal processors). And yes there are opcodes which make 3 > data accesses per cycle on the 54xx.
I agree the 54xx can do 3 memory accesses per cycle, however what I was asking is why you think that is the only way another CPU could achieve the same performance? My code example proves you don't.
> Finally, the 54xx is almost 10 years of age now, no wonder there are > newer candidates for its job. The ARM architecture is not bad, > they have been learning from the right sources (68k and PPC), > and it does have the potential to compete for some DSP applications. > If it only had 32 registers and could evolve into 64 bits ...
32 registers would have been nice indeed, but it's not a big problem in integer code. SIMD always wants more though, so the next generation uses 32 64-bit registers.
> And more finally, may I suggest that we include some information > about ourselves whenever this is relevant, had I known Wilko > was directly associated with ARM I would have been a lot > less willing to support his agenda by contributing to a discussion. > (I have no interest neither in TI nor in Freescale or other PPC > manufacturers).
In what sense is what I do or who I work for relevant to this dicussion? Would it make what I wrote any less true? In my spare time I post about subjects I'm interested in, that's all. I could argue you have a hidden agenda by repeatedly posting false statements about how much faster DSPs are compared to general purpose processors with DSP extensions. Wilco
Wilco Dijkstra wrote:

> In what sense is what I do or who I work for relevant to this dicussion? > Would it make what I wrote any less true? In my spare time I post about > subjects I'm interested in, that's all. I could argue you have a hidden > agenda by repeatedly posting false statements about how much faster > DSPs are compared to general purpose processors with DSP extensions.
I agree. Your postings on this topic have been very professional and well informed. It doesn't matter whether you're speaking for yourself or your employer. Keep up the good work.
> 64-bit accumulators need more registers and are slower in some cases. > A common trick is to use 32-bit accumulators for several iterations, > then do a 64-bit accumulate
So much about your code example. Which registers do you use for the 64 bit accumulate. Depending on the incoming signal and the coefficients, you may need to do 64 bit accumulate every few loops so memory accumulate will not do any good.
> I'd say acc += (A + B) * C does 1 MAC and 1 ADD, not 2 MACs... > But yes, it would run twice as fast effectively.
Because it does 3 data accesses per cycle, yes.
> The DMA stores the data in DTCM (fast local memory), which doesn't > have any issues associated with caches
Do you have > 64k of that memory (something like 80 would be tight but migh be enough) to dedicate on that alone? If not, you are out of busyness. Remember, 10 MSPS is not sound sampling at some KSPS where some of your examples might be applicable. Pointing to some real world working product which does use 10 MSPS/16 bit data sampling based on your ARM will help a lot - can you do that?
> Caches allow you do the same, it's why they exist.
I know what caches are. I have about 20 M sources of text, running on a PPC, which I have written over the past 10 years, a full-blown OS included, with MMU, VM and all. Now tell me what caches are about.
> 32 registers would have been nice indeed, but it's not a big problem > in integer code.
I agree, but it is the major problem in the example you are trying to sell me. They are just too few for the 40 bit accumulate to fit in at the speeds you claim.
> so the next generation uses 32 64-bit registers.
Well, good luck with the next architectural generation. I am sure it will be better - and the current one, like I said before, is not bad at all, perhaps the new architecture will be as successfull.
> > The 54xx operates on every memory address as if it were a register, > > and a (great) number of on-chip DMACs can access all that space > > without incurring any delay to the the program flow at all, you cannot > > just neglect all that overhead. > > Caches allow you do the same ...
This is wrong. To operate on a cached value, you need at least two cycles - to load and to execute. The wider bus combined with multiple operands per register minimises and can even beat this, of course - wherever applicable. Notice that I am not at all advocating some weird DSP architecture against a "normal" one. The PPC architecture is by far the most advanced I know of and most likely the development will continue in this direction. DSPs are highly specialised things which, like other specialised logic circuits did in the past, will probably disappear. But, like I said before, there is still a way to go. Even if we assume the ARM you are talking about is as good as a 100 MHz 54xx at 130 MHz (which it s not), matching the 5420 (two 100 MHz cores) will take 260 MHz. How much does it consume at that speed? The 5420 needs something like 300 mW, with a lot (times) more on chip memory than the ARM.
> I could argue you have a hidden agenda ...
So which is my hidden agenda. Yours is the fact that you "forgot" to mention you were an ARM employee. Using your email address @arm.com would have ben enough for me, but you did not do so. Now tell me you did not know this was unethical. Dimiter ------------------------------------------------------ Dimiter Popoff Transgalactic Instruments http://www.tgi-sci.com ------------------------------------------------------ Wilco Dijkstra wrote:
> "Didi" <dp@tgi-sci.com> wrote in message > news:1142346935.115166.318960@i39g2000cwa.googlegroups.com... > >> On ARM11 this computes 8 taps per iteration of 4 outputs (32 MACs) > >> in 24 cycles. In terms of bandwidth, it only does 6 loads every 32 MACs > >> (0.2 loads per MAC or 0.25 loads per cycle). So a 100Mhz ARM11 > >> easily outperforms the 5420 at the same frequency. > > > > Not at all. I had a look at the ARM11 architecture, and the first > > thing I saw was that there are no 40 bit accumulators. You are > > going to need them if you want to compete with the 54xx series > > for my (and many other) DSP applications. > > The architecture supports 32-bit and 64-bit accumulators. For many > purposes (graphics for example), 32 bit is more than enough. 64-bit > accumulators need more registers and are slower in some cases. > A common trick is to use 32-bit accumulators for several iterations, > then do a 64-bit accumulate. This allows the inner loop to run at optimal > speed without overflow (you can precompute how many iterations are > possible without overflow). > > Also, the 54xx have a > > FIRS instruction for symmetric filters which does two MACs per cycle. > > I'd say acc += (A + B) * C does 1 MAC and 1 ADD, not 2 MACs... > But yes, it would run twice as fast effectively. The ARM11 version > would run faster too of course, my guess is that the 54xx would > be around 30% faster in this case. > > > Then there are details like memory bandwidth - you can have all the > > coefficients cached but you cannot - generally you always have a > > miss - on the incoming data, and then you probably have all the snoop > > issues with the DMA pushing the data to memory etc. etc. > > The DMA stores the data in DTCM (fast local memory), which doesn't > have any issues associated with caches, such as misses and > consistency etc. So there is really no cost in accessing the incoming > data - that's the point of DMA! > > > Under the score, you will find out that your 500 MHz ARM will > > likely be about the same as a 100 MHz 54xx when it comes > > to the complete application - if you have one which would tolerate > > only 32 bit accumulator width. > > This is mind boggling. I showed you actual code that uses 75Mhz on > the ARM11 doing the same work as a 54xx at 100Mhz, and then you > do some handwaving and suddenly it needs 500Mhz? Can you explain > where the other 425Mhz is going? (Not on cache misses) > > > The 54xx operates on every memory address as if it were a register, > > and a (great) number of on-chip DMACs can access all that space > > without incurring any delay to the the program flow at all, you cannot > > just neglect all that overhead. > > Caches allow you do the same, it's why they exist. DSP programs > generally exhibit ideal cache behaviour compared to general purpose > programs. So they behave like fast high bandwidth memory without > any overhead. But if you dislike caches there are TCMs anyway. > > >> Now why do you insist > >> that you need at least 3 loads per MAC? > > > > I did not insist - but here you go, two accesses for the data and one > > for the opcode as it is in your case (the 54xx can stop fetching in > > loop mode, it is indeed highly specialized, I also prefer to program > > normal processors). And yes there are opcodes which make 3 > > data accesses per cycle on the 54xx. > > I agree the 54xx can do 3 memory accesses per cycle, however what > I was asking is why you think that is the only way another CPU could > achieve the same performance? My code example proves you don't. > > > Finally, the 54xx is almost 10 years of age now, no wonder there are > > newer candidates for its job. The ARM architecture is not bad, > > they have been learning from the right sources (68k and PPC), > > and it does have the potential to compete for some DSP applications. > > If it only had 32 registers and could evolve into 64 bits ... > > 32 registers would have been nice indeed, but it's not a big problem > in integer code. SIMD always wants more though, so the next generation > uses 32 64-bit registers. > > > And more finally, may I suggest that we include some information > > about ourselves whenever this is relevant, had I known Wilko > > was directly associated with ARM I would have been a lot > > less willing to support his agenda by contributing to a discussion. > > (I have no interest neither in TI nor in Freescale or other PPC > > manufacturers). > > In what sense is what I do or who I work for relevant to this dicussion? > Would it make what I wrote any less true? In my spare time I post about > subjects I'm interested in, that's all. I could argue you have a hidden > agenda by repeatedly posting false statements about how much faster > DSPs are compared to general purpose processors with DSP extensions. > > Wilco

The 2024 Embedded Online Conference