> On ARM11 this computes 8 taps per iteration of 4 outputs (32 MACs)
> in 24 cycles. In terms of bandwidth, it only does 6 loads every 32 MACs
> (0.2 loads per MAC or 0.25 loads per cycle). So a 100Mhz ARM11
> easily outperforms the 5420 at the same frequency.
Not at all. I had a look at the ARM11 architecture, and the first
thing I saw was that there are no 40 bit accumulators. You are
going to need them if you want to compete with the 54xx series
for my (and many other) DSP applications. Also, the 54xx have a
FIRS instruction for symmetric filters which does two MACs per cycle.
Then there are details like memory bandwidth - you can have all the
coefficients cached but you cannot - generally you always have a
miss - on the incoming data, and then you probably have all the snoop
issues with the DMA pushing the data to memory etc. etc.
Under the score, you will find out that your 500 MHz ARM will
likely be about the same as a 100 MHz 54xx when it comes
to the complete application - if you have one which would tolerate
only 32 bit accumulator width.
The 54xx operates on every memory address as if it were a register,
and a (great) number of on-chip DMACs can access all that space
without incurring any delay to the the program flow at all, you cannot
just neglect all that overhead.
> Now why do you insist
> that you need at least 3 loads per MAC?
I did not insist - but here you go, two accesses for the data and one
for the opcode as it is in your case (the 54xx can stop fetching in
loop mode, it is indeed highly specialized, I also prefer to program
normal processors). And yes there are opcodes which make 3
data accesses per cycle on the 54xx.
Finally, the 54xx is almost 10 years of age now, no wonder there are
newer candidates for its job. The ARM architecture is not bad,
they have been learning from the right sources (68k and PPC),
and it does have the potential to compete for some DSP applications.
If it only had 32 registers and could evolve into 64 bits ...
And more finally, may I suggest that we include some information
about ourselves whenever this is relevant, had I known Wilko
was directly associated with ARM I would have been a lot
less willing to support his agenda by contributing to a discussion.
(I have no interest neither in TI nor in Freescale or other PPC
manufacturers).
Dimiter
------------------------------------------------------
Dimiter Popoff Transgalactic Instruments
http://www.tgi-sci.com
------------------------------------------------------
Wilco Dijkstra wrote:
> "Didi" <dp@tgi-sci.com> wrote in message
> news:1142132669.823913.252540@i40g2000cwc.googlegroups.com...
> >> There is no need to do 3 independent accesses per cycle. This is a
> >> very inefficient way of increasing bandwidth and that is why modern
> >> CPUs increase the width of buses instead.
> >
> > This tells me you have never actually done any DSP programming.
> > Please correct me if I am wrong (I certainly mean no offence).
>
> You're wrong. For example I've written a highly optimised JPEG
> (de)compressor on ARM using software SIMD techniques.
>
> >> With a 64-bit bus you can read 4 16-bit values per cycle, every cycle.
> >> This is clearly faster than reading 16-bits from 3 independent address
> >> per cycle, right?
> >
> > No. Every 16 bit value has a separate address, which is - in the case
> > of the 5420 - another 16 bits. I will not go into explanation why this
> > is so, I guess there are sufficient books on digital signal processing
> > around.
>
> I know why low and mid-end DSPs do this, however there are major
> limitations with this approach. Alternatives exist which do not have these
> limitations, and general purpose CPUs use these to improve DSP
> performance without needing the traditional features of a DSP.
>
> My point is that these alternatives allow modern general purpose
> CPUs to easily beat traditional DSPs.
>
> >> There is no need to do several independent accesses per cycle as
> >> long as you've got enough bandwidth. 4 16-bit accesses every 10ns
> >> is only 800MBytes/s. Just the data bandwidth between the core and L1
> >> is 4GBytes/s on a 500Mhz ARM11 for example.
> >
> > Here we go again, you don't want to believe DSPs have been
> > designed as they are because of necessity.
>
> It's not necessity, more a particular design approach (like RISC/CISC).
> It works fine at the low end, but it is simply not scalable. If you use it
> like a dogma then you'll crash and burn, just like CPUs that were too
> CISCy or RISCy...
>
> >OK, I'll try a
> > general example. There is an area - say, 4 kilobytes - with
> > the coefficients, there is a circular queue - say, 64 k - with
> > the incoming data, and there is another circular queue - say,
> > again 64 k - with the filtered results. All are 16 bits wide,
> > you do 4k MACs per sample each time starting one address
> > further in the input queue and write the result to the output
> > queue.
> > Can you tell me how you do this without separate addresses
> > (especially on the ARM where the registers are so scarce)?
>
> The standard way of doing FIR filters is to block them. This
> reduces the memory bandwidth requirements by the blocking
> factor. An example of how a 4x4 filter looks like on ARM11 -
> since you don't like C, this is ARM assembly language :-)
>
> fir_loop
> LDM x!,{x45,x67} ; load 4 16-bit input values and post inc
> SMLAD a0,x01,c01,a0 ; do 2 16-bit MACs
> SMLAD a1,x01,d01,a1
> LDM c!,{c23,d23} ; load 4 16-bit coefficients
> SMLAD a2,x23,c01,a2
> SMLAD a3,x23,d01,a3
> LDM c!,{c01,d01} ; load 4 16-bit coefficients
> SMLAD a0,x23,c23,a0
> SMLAD a1,x23,d23,a1
> SMLAD a2,x45,c23,a2
> SMLAD a3,x45,d23,a3
>
> ... repeat another time with x45<->x01 and x67<->x23 swapped
>
> TST c,#mask ; test for end of loop
> BNE fir_loop ; branch back - 24 instructions total
>
> This code uses 4 accumulators a0-a3, 8 coefficients c0-c3 and d0-d3,
> 8 input values x0-x7, a coefficient address and an input pointer - total
> 14 registers (2 16-bit values fit in a 32-bit register).
> The coefficient array is duplicated to avoid alignment issues and
> interleaved to avoid the need of a second pointer. There is no need
> for a loop counter as we can use the coefficient pointer.
> The instructions are scheduled to avoid any interlocks.
>
> On ARM11 this computes 8 taps per iteration of 4 outputs (32 MACs)
> in 24 cycles. In terms of bandwidth, it only does 6 loads every 32 MACs
> (0.2 loads per MAC or 0.25 loads per cycle). So a 100Mhz ARM11
> easily outperforms the 5420 at the same frequency.
>
> FIR filters are clearly MAC rather than bandwidth bound. If we could
> do 4 MACs per cycle, the loop would go faster. Now why do you insist
> that you need at least 3 loads per MAC?
>
> Wilco