> 64-bit accumulators need more registers and are slower in some cases.
> A common trick is to use 32-bit accumulators for several iterations,
> then do a 64-bit accumulate

So much about your code example. Which registers do you use for
the 64 bit accumulate. Depending on the incoming signal and the
coefficients, you may need to do 64 bit accumulate every few loops
so memory accumulate will not do any good.

> I'd say acc += (A + B) * C does 1 MAC and 1 ADD, not 2 MACs...
> But yes, it would run twice as fast effectively.

Because it does 3 data accesses per cycle, yes.

> The DMA stores the data in DTCM (fast local memory), which doesn't
> have any issues associated with caches

Do you have > 64k of that memory (something like 80 would be tight but
migh be enough) to dedicate on that alone?
 If not, you are out of busyness. Remember,
10 MSPS is not sound sampling at some KSPS where some of
your examples might be applicable. Pointing to some
real world working product which does use 10 MSPS/16 bit data
sampling based on your ARM will help a lot - can you do that?

> Caches allow you do the same, it's why they exist.

I know what caches are. I have about 20 M sources of text,
running on a PPC, which I have written over the past 10 years,
a full-blown OS included, with MMU, VM and all. Now tell me what
caches are about.

> 32 registers would have been nice indeed, but it's not a big problem
> in integer code.

I agree, but it is the major problem in the example you are trying
to sell me. They are just too few for the 40 bit accumulate to fit
in at the speeds you claim.

> so the next generation uses 32 64-bit registers.

Well, good luck with the next architectural generation. I am sure
it will be better - and the current one, like I said before, is not
bad at all, perhaps the new architecture will be as successfull.

> > The 54xx operates on every memory address as if it were a register,
> > and a (great) number of on-chip DMACs can access all that space
> > without incurring any delay to the the program flow at all, you cannot
> > just neglect all that overhead.
>
> Caches allow you do the same ...

This is wrong. To operate on a cached value, you need at least
two cycles - to load and to execute. The wider bus
combined with multiple operands per register minimises and
can even beat this, of course - wherever applicable.
 Notice that I am not at all advocating some weird DSP architecture
against a "normal" one. The PPC architecture is by far the most
advanced I know of and most likely the development will continue
in this direction.
 DSPs are highly specialised things which, like other specialised
logic circuits did in the past, will probably disappear. But, like I
said
before, there is still a way to go. Even if we assume the ARM
you are talking about is as good as a 100 MHz 54xx at 130 MHz
(which it s not), matching the 5420 (two 100 MHz cores)
will take 260 MHz. How much does it consume at that speed?
The 5420 needs something like 300 mW, with a lot (times) more
on chip memory than the ARM.

>  I could argue you have a hidden agenda ...

So which is my hidden agenda. Yours is the fact that you
"forgot" to mention you were an ARM employee. Using
your email address @arm.com would have ben enough for me,
but you did not do so. Now tell me you did not know this
was unethical.

Dimiter

------------------------------------------------------
Dimiter Popoff               Transgalactic Instruments

http://www.tgi-sci.com
------------------------------------------------------



Wilco Dijkstra wrote:
> "Didi" <dp@tgi-sci.com> wrote in message
> news:1142346935.115166.318960@i39g2000cwa.googlegroups.com...
> >> On ARM11 this computes 8 taps per iteration of 4 outputs (32 MACs)
> >> in 24 cycles. In terms of bandwidth, it only does 6 loads every 32 MACs
> >> (0.2 loads per MAC or 0.25 loads per cycle). So a 100Mhz ARM11
> >> easily outperforms the 5420 at the same frequency.
> >
> > Not at all. I had a look at the ARM11 architecture, and the first
> > thing I saw was that there are no 40 bit accumulators. You are
> > going to need them if you want to compete with the 54xx series
> > for my (and many other) DSP applications.
>
> The architecture supports 32-bit and 64-bit accumulators. For many
> purposes (graphics for example), 32 bit is more than enough. 64-bit
> accumulators need more registers and are slower in some cases.
> A common trick is to use 32-bit accumulators for several iterations,
> then do a 64-bit accumulate. This allows the inner loop to run at optimal
> speed without overflow (you can precompute how many iterations are
> possible without overflow).
>
> Also, the 54xx have a
> > FIRS instruction for symmetric filters which does two MACs per cycle.
>
> I'd say acc += (A + B) * C does 1 MAC and 1 ADD, not 2 MACs...
> But yes, it would run twice as fast effectively. The ARM11 version
> would run faster too of course, my guess is that the 54xx would
> be around 30% faster in this case.
>
> > Then there are details like memory bandwidth - you can have all the
> > coefficients cached but you cannot - generally you always have a
> > miss - on the incoming data, and then you probably have all the snoop
> > issues with the DMA pushing the data to memory etc. etc.
>
> The DMA stores the data in DTCM (fast local memory), which doesn't
> have any issues associated with caches, such as misses and
> consistency etc. So there is really no cost in accessing the incoming
> data - that's the point of DMA!
>
> > Under the score, you will find out that your 500 MHz ARM will
> > likely be about the same as a 100 MHz 54xx when it comes
> > to the complete application - if you have one which would tolerate
> > only 32 bit accumulator width.
>
> This is mind boggling. I showed you actual code that uses 75Mhz on
> the ARM11 doing the same work as a 54xx at 100Mhz, and then you
> do some handwaving and suddenly it needs 500Mhz? Can you explain
> where the other 425Mhz is going? (Not on cache misses)
>
> > The 54xx operates on every memory address as if it were a register,
> > and a (great) number of on-chip DMACs can access all that space
> > without incurring any delay to the the program flow at all, you cannot
> > just neglect all that overhead.
>
> Caches allow you do the same, it's why they exist. DSP programs
> generally exhibit ideal cache behaviour compared to general purpose
> programs. So they behave like fast high bandwidth memory without
> any overhead. But if you dislike caches there are TCMs anyway.
>
> >> Now why do you insist
> >> that you need at least 3 loads per MAC?
> >
> > I did not insist - but here you go, two accesses for the data and one
> > for the opcode as it is in your case (the 54xx can stop fetching in
> > loop mode, it is indeed highly specialized, I also prefer to program
> > normal processors). And yes there are opcodes which make 3
> > data accesses per cycle on the 54xx.
>
> I agree the 54xx can do 3 memory accesses per cycle, however what
> I was asking is why you think that is the only way another CPU could
> achieve the same performance? My code example proves you don't.
>
> > Finally, the 54xx is almost 10 years of age now, no wonder there are
> > newer candidates for its job. The ARM architecture is not bad,
> > they have been learning from the right sources (68k and PPC),
> > and it does have the potential to compete for some DSP applications.
> > If it only had 32 registers and could evolve into 64 bits ...
>
> 32 registers would have been nice indeed, but it's not a big problem
> in integer code. SIMD always wants more though, so the next generation
> uses 32 64-bit registers.
>
> > And more finally, may I suggest that we include some information
> > about ourselves whenever this is relevant, had I known Wilko
> > was directly associated with ARM I would have been a lot
> > less willing to support his agenda by contributing to a discussion.
> > (I have no interest neither in TI nor in Freescale or other PPC
> > manufacturers).
>
> In what sense is what I do or who I work for relevant to this dicussion?
> Would it make what I wrote any less true? In my spare time I post about
> subjects I'm interested in, that's all. I could argue you have a hidden
> agenda by repeatedly posting false statements about how much faster
> DSPs are compared to general purpose processors with DSP extensions.
> 
> Wilco

Wilco Dijkstra wrote:

> In what sense is what I do or who I work for relevant to this dicussion?
> Would it make what I wrote any less true? In my spare time I post about
> subjects I'm interested in, that's all. I could argue you have a hidden
> agenda by repeatedly posting false statements about how much faster
> DSPs are compared to general purpose processors with DSP extensions.

I agree. Your postings on this topic have been very professional and
well informed. It doesn't matter whether you're speaking for yourself
or your employer. Keep up the good work.

"Didi" <dp@tgi-sci.com> wrote in message
news:1142346935.115166.318960@i39g2000cwa.googlegroups.com...
>> On ARM11 this computes 8 taps per iteration of 4 outputs (32 MACs)
>> in 24 cycles. In terms of bandwidth, it only does 6 loads every 32 MACs
>> (0.2 loads per MAC or 0.25 loads per cycle). So a 100Mhz ARM11
>> easily outperforms the 5420 at the same frequency.
>
> Not at all. I had a look at the ARM11 architecture, and the first
> thing I saw was that there are no 40 bit accumulators. You are
> going to need them if you want to compete with the 54xx series
> for my (and many other) DSP applications.

The architecture supports 32-bit and 64-bit accumulators. For many
purposes (graphics for example), 32 bit is more than enough. 64-bit
accumulators need more registers and are slower in some cases.
A common trick is to use 32-bit accumulators for several iterations,
then do a 64-bit accumulate. This allows the inner loop to run at optimal
speed without overflow (you can precompute how many iterations are
possible without overflow).

Also, the 54xx have a
> FIRS instruction for symmetric filters which does two MACs per cycle.

I'd say acc += (A + B) * C does 1 MAC and 1 ADD, not 2 MACs...
But yes, it would run twice as fast effectively. The ARM11 version
would run faster too of course, my guess is that the 54xx would
be around 30% faster in this case.

> Then there are details like memory bandwidth - you can have all the
> coefficients cached but you cannot - generally you always have a
> miss - on the incoming data, and then you probably have all the snoop
> issues with the DMA pushing the data to memory etc. etc.

The DMA stores the data in DTCM (fast local memory), which doesn't
have any issues associated with caches, such as misses and
consistency etc. So there is really no cost in accessing the incoming
data - that's the point of DMA!

> Under the score, you will find out that your 500 MHz ARM will
> likely be about the same as a 100 MHz 54xx when it comes
> to the complete application - if you have one which would tolerate
> only 32 bit accumulator width.

This is mind boggling. I showed you actual code that uses 75Mhz on
the ARM11 doing the same work as a 54xx at 100Mhz, and then you
do some handwaving and suddenly it needs 500Mhz? Can you explain
where the other 425Mhz is going? (Not on cache misses)

> The 54xx operates on every memory address as if it were a register,
> and a (great) number of on-chip DMACs can access all that space
> without incurring any delay to the the program flow at all, you cannot
> just neglect all that overhead.

Caches allow you do the same, it's why they exist. DSP programs
generally exhibit ideal cache behaviour compared to general purpose
programs. So they behave like fast high bandwidth memory without
any overhead. But if you dislike caches there are TCMs anyway.

>> Now why do you insist
>> that you need at least 3 loads per MAC?
>
> I did not insist - but here you go, two accesses for the data and one
> for the opcode as it is in your case (the 54xx can stop fetching in
> loop mode, it is indeed highly specialized, I also prefer to program
> normal processors). And yes there are opcodes which make 3
> data accesses per cycle on the 54xx.

I agree the 54xx can do 3 memory accesses per cycle, however what
I was asking is why you think that is the only way another CPU could
achieve the same performance? My code example proves you don't.

> Finally, the 54xx is almost 10 years of age now, no wonder there are
> newer candidates for its job. The ARM architecture is not bad,
> they have been learning from the right sources (68k and PPC),
> and it does have the potential to compete for some DSP applications.
> If it only had 32 registers and could evolve into 64 bits ...

32 registers would have been nice indeed, but it's not a big problem
in integer code. SIMD always wants more though, so the next generation
uses 32 64-bit registers.

> And more finally, may I suggest that we include some information
> about ourselves whenever this is relevant, had I known Wilko
> was directly associated with ARM I would have been a lot
> less willing to support his agenda by contributing to a discussion.
> (I have no interest neither in TI nor in Freescale or other PPC
> manufacturers).

In what sense is what I do or who I work for relevant to this dicussion?
Would it make what I wrote any less true? In my spare time I post about
subjects I'm interested in, that's all. I could argue you have a hidden
agenda by repeatedly posting false statements about how much faster
DSPs are compared to general purpose processors with DSP extensions.

Wilco

> On ARM11 this computes 8 taps per iteration of 4 outputs (32 MACs)
> in 24 cycles. In terms of bandwidth, it only does 6 loads every 32 MACs
> (0.2 loads per MAC or 0.25 loads per cycle). So a 100Mhz ARM11
> easily outperforms the 5420 at the same frequency.

Not at all. I had a look at the ARM11 architecture, and the first
thing I saw was that there are no 40 bit accumulators. You are
going to need them if you want to compete with the 54xx series
for my (and many other) DSP applications. Also, the 54xx have a
FIRS instruction for symmetric filters which does two MACs per cycle.
Then there are details like memory bandwidth - you can have all the
coefficients cached but you cannot - generally you always have a
miss - on the incoming data, and then you probably have all the snoop
issues with the DMA pushing the data to memory etc. etc.
Under the score, you will find out that your 500 MHz ARM will
likely be about the same as a 100 MHz 54xx when it comes
to the complete application - if you have one which would tolerate
only 32 bit accumulator width.
 The 54xx operates on every memory address as if it were a register,
and a (great) number of on-chip DMACs can access all that space
without incurring any delay to the the program flow at all, you cannot
just neglect all that overhead.

> Now why do you insist
> that you need at least 3 loads per MAC?

I did not insist - but here you go, two accesses for the data and one
for the opcode as it is in your case (the 54xx can stop fetching in
loop mode, it is indeed highly specialized, I also prefer to program
normal processors). And yes there are opcodes which make 3
data accesses per cycle on the 54xx.

Finally, the 54xx is almost 10 years of age now, no wonder there are
newer candidates for its job. The ARM architecture is not bad,
they have been learning from the right sources (68k and PPC),
and it does have the potential to compete for some DSP applications.
If it only had 32 registers and could evolve into 64 bits ...

And more finally, may I suggest that we include some information
about ourselves whenever this is relevant, had I known Wilko
was directly associated with ARM I would have been a lot
less willing to support his agenda by contributing to a discussion.
(I have no interest neither in TI nor in Freescale or other PPC
manufacturers).

Dimiter

------------------------------------------------------
Dimiter Popoff               Transgalactic Instruments

http://www.tgi-sci.com
------------------------------------------------------

Wilco Dijkstra wrote:
> "Didi" <dp@tgi-sci.com> wrote in message
> news:1142132669.823913.252540@i40g2000cwc.googlegroups.com...
> >> There is no need to do 3 independent accesses per cycle. This is a
> >> very inefficient way of increasing bandwidth and that is why modern
> >> CPUs increase the width of buses instead.
> >
> > This tells me you have never actually done any DSP programming.
> > Please correct me if I am wrong (I certainly mean no offence).
>
> You're wrong. For example I've written a highly optimised JPEG
> (de)compressor on ARM using software SIMD techniques.
>
> >> With a 64-bit bus you can read 4 16-bit values per cycle, every cycle.
> >> This is clearly faster than reading 16-bits from 3 independent address
> >> per cycle, right?
> >
> > No. Every 16 bit value has a separate address, which is - in the case
> > of the 5420 - another 16 bits. I will not go into explanation why this
> > is so, I guess there are sufficient books on digital signal processing
> > around.
>
> I know why low and mid-end DSPs do this, however there are major
> limitations with this approach. Alternatives exist which do not have these
> limitations, and general purpose CPUs use these to improve DSP
> performance without needing the traditional features of a DSP.
>
> My point is that these alternatives allow modern general purpose
> CPUs to easily beat traditional DSPs.
>
> >> There is no need to do several independent accesses per cycle as
> >> long as you've got enough bandwidth. 4 16-bit accesses every 10ns
> >> is only 800MBytes/s. Just the data bandwidth between the core and L1
> >> is 4GBytes/s on a 500Mhz ARM11 for example.
> >
> > Here we go again, you don't want to believe DSPs have been
> > designed as they are because of necessity.
>
> It's not necessity, more a particular design approach (like RISC/CISC).
> It works fine at the low end, but it is simply not scalable. If you use it
> like a dogma then you'll crash and burn, just like CPUs that were too
> CISCy or RISCy...
>
> >OK, I'll try a
> > general example. There is an area - say, 4 kilobytes - with
> > the coefficients, there is a circular queue - say, 64 k - with
> > the incoming data, and there is another circular queue - say,
> > again 64 k - with the filtered results. All are 16 bits wide,
> > you do 4k MACs per sample each time starting one address
> > further in the input queue and write the result to the output
> > queue.
> > Can you tell me how you do this without separate addresses
> > (especially on the ARM where the registers are so scarce)?
>
> The standard way of doing FIR filters is to block them. This
> reduces the memory bandwidth requirements by the blocking
> factor. An example of how a 4x4 filter looks like on ARM11 -
> since you don't like C, this is ARM assembly language :-)
>
> fir_loop
> LDM x!,{x45,x67}             ; load 4 16-bit input values and post inc
> SMLAD a0,x01,c01,a0  ; do 2 16-bit MACs
> SMLAD a1,x01,d01,a1
> LDM c!,{c23,d23}            ; load 4 16-bit coefficients
> SMLAD a2,x23,c01,a2
> SMLAD a3,x23,d01,a3
> LDM c!,{c01,d01}            ; load 4 16-bit coefficients
> SMLAD a0,x23,c23,a0
> SMLAD a1,x23,d23,a1
> SMLAD a2,x45,c23,a2
> SMLAD a3,x45,d23,a3
>
> ... repeat another time with x45<->x01 and x67<->x23 swapped
>
> TST c,#mask                  ; test for end of loop
> BNE fir_loop                  ;  branch back - 24 instructions total
>
> This code uses 4 accumulators a0-a3, 8 coefficients c0-c3 and d0-d3,
> 8 input values x0-x7, a coefficient address and an input pointer - total
> 14 registers (2 16-bit values fit in a 32-bit register).
> The coefficient array is duplicated to avoid alignment issues and
> interleaved to avoid the need of a second pointer. There is no need
> for a loop counter as we can use the coefficient pointer.
> The instructions are scheduled to avoid any interlocks.
>
> On ARM11 this computes 8 taps per iteration of 4 outputs (32 MACs)
> in 24 cycles. In terms of bandwidth, it only does 6 loads every 32 MACs
> (0.2 loads per MAC or 0.25 loads per cycle). So a 100Mhz ARM11
> easily outperforms the 5420 at the same frequency.
>
> FIR filters are clearly MAC rather than bandwidth bound. If we could
> do 4 MACs per cycle, the loop would go faster. Now why do you insist
> that you need at least 3 loads per MAC?
> 
> Wilco

On Tuesday, in article
     <4jwRf.10311$ZJ2.4094@newsfe6-gui.ntli.net>
     Wilco_dot_Dijkstra@ntlworld.com "Wilco Dijkstra" wrote:

>"Didi" <dp@tgi-sci.com> wrote in message
>news:1142132669.823913.252540@i40g2000cwc.googlegroups.com...
>>> There is no need to do 3 independent accesses per cycle. This is a
>>> very inefficient way of increasing bandwidth and that is why modern
>>> CPUs increase the width of buses instead.
>>
>> This tells me you have never actually done any DSP programming.
>> Please correct me if I am wrong (I certainly mean no offence).
>
>You're wrong. For example I've written a highly optimised JPEG
>(de)compressor on ARM using software SIMD techniques.

Depends on application constraints.

>>> With a 64-bit bus you can read 4 16-bit values per cycle, every cycle.
>>> This is clearly faster than reading 16-bits from 3 independent address
>>> per cycle, right?
>>
>> No. Every 16 bit value has a separate address, which is - in the case
>> of the 5420 - another 16 bits. I will not go into explanation why this
>> is so, I guess there are sufficient books on digital signal processing
>> around.
>
>I know why low and mid-end DSPs do this, however there are major
>limitations with this approach. Alternatives exist which do not have these
>limitations, and general purpose CPUs use these to improve DSP
>performance without needing the traditional features of a DSP.
>
>My point is that these alternatives allow modern general purpose
>CPUs to easily beat traditional DSPs.

Not for some applications.

>>> There is no need to do several independent accesses per cycle as
>>> long as you've got enough bandwidth. 4 16-bit accesses every 10ns
>>> is only 800MBytes/s. Just the data bandwidth between the core and L1
>>> is 4GBytes/s on a 500Mhz ARM11 for example.
>>
>> Here we go again, you don't want to believe DSPs have been
>> designed as they are because of necessity.
>
>It's not necessity, more a particular design approach (like RISC/CISC).
>It works fine at the low end, but it is simply not scalable. If you use it
>like a dogma then you'll crash and burn, just like CPUs that were too
>CISCy or RISCy...

Always forcing all data through a processor can for some applications cause
problems.

.......

>On ARM11 this computes 8 taps per iteration of 4 outputs (32 MACs)
>in 24 cycles. In terms of bandwidth, it only does 6 loads every 32 MACs
>(0.2 loads per MAC or 0.25 loads per cycle). So a 100Mhz ARM11
>easily outperforms the 5420 at the same frequency.
>
>FIR filters are clearly MAC rather than bandwidth bound. If we could
>do 4 MACs per cycle, the loop would go faster. Now why do you insist
>that you need at least 3 loads per MAC?

Having done various work with real time video, whereby the video must have
minimal delay and NO non-deterministic delays or stops, (i.e. continuous
operation), often because of other limitations of the system (broadcast
effects, mixing, scaling or equipment in loops with eye/hand co-ordination).
There are times where you have to have dedicated hardware as every pixel on
multiple video streams at the same time are undergoing 24 multiply and 9
adds at pixel rate. Having done standards conversion and rescaling from
input to output in less than 15 input TV lines delay, most of the delay
was changing the start times for active video due to blanking differences.

Often in these types of applications, the blockiness and delays of frame
delays can screw things up as all the delays add up.

There are times when the delay does not matter, still images, or open loop
methodology (e.g. set-top boxes, DVD players, audio players), but others
where the closed loop nature of the WHOLE system means DSP or fast processor
will not cut it.

Horses for courses, and various other reasons (often internal politics).

-- 
Paul Carpenter          | paul@pcserviceselectronics.co.uk
<http://www.pcserviceselectronics.co.uk/>    PC Services
<http://www.gnuh8.org.uk/>              GNU H8 & mailing list info
<http://www.badweb.org.uk/>             For those web sites you hate

"Didi" <dp@tgi-sci.com> wrote in message
news:1142132669.823913.252540@i40g2000cwc.googlegroups.com...
>> There is no need to do 3 independent accesses per cycle. This is a
>> very inefficient way of increasing bandwidth and that is why modern
>> CPUs increase the width of buses instead.
>
> This tells me you have never actually done any DSP programming.
> Please correct me if I am wrong (I certainly mean no offence).

You're wrong. For example I've written a highly optimised JPEG
(de)compressor on ARM using software SIMD techniques.

>> With a 64-bit bus you can read 4 16-bit values per cycle, every cycle.
>> This is clearly faster than reading 16-bits from 3 independent address
>> per cycle, right?
>
> No. Every 16 bit value has a separate address, which is - in the case
> of the 5420 - another 16 bits. I will not go into explanation why this
> is so, I guess there are sufficient books on digital signal processing
> around.

I know why low and mid-end DSPs do this, however there are major
limitations with this approach. Alternatives exist which do not have these
limitations, and general purpose CPUs use these to improve DSP
performance without needing the traditional features of a DSP.

My point is that these alternatives allow modern general purpose
CPUs to easily beat traditional DSPs.

>> There is no need to do several independent accesses per cycle as
>> long as you've got enough bandwidth. 4 16-bit accesses every 10ns
>> is only 800MBytes/s. Just the data bandwidth between the core and L1
>> is 4GBytes/s on a 500Mhz ARM11 for example.
>
> Here we go again, you don't want to believe DSPs have been
> designed as they are because of necessity.

It's not necessity, more a particular design approach (like RISC/CISC).
It works fine at the low end, but it is simply not scalable. If you use it
like a dogma then you'll crash and burn, just like CPUs that were too
CISCy or RISCy...

>OK, I'll try a
> general example. There is an area - say, 4 kilobytes - with
> the coefficients, there is a circular queue - say, 64 k - with
> the incoming data, and there is another circular queue - say,
> again 64 k - with the filtered results. All are 16 bits wide,
> you do 4k MACs per sample each time starting one address
> further in the input queue and write the result to the output
> queue.
> Can you tell me how you do this without separate addresses
> (especially on the ARM where the registers are so scarce)?

The standard way of doing FIR filters is to block them. This
reduces the memory bandwidth requirements by the blocking
factor. An example of how a 4x4 filter looks like on ARM11 -
since you don't like C, this is ARM assembly language :-)

fir_loop
LDM x!,{x45,x67}             ; load 4 16-bit input values and post inc
SMLAD a0,x01,c01,a0  ; do 2 16-bit MACs
SMLAD a1,x01,d01,a1
LDM c!,{c23,d23}            ; load 4 16-bit coefficients
SMLAD a2,x23,c01,a2
SMLAD a3,x23,d01,a3
LDM c!,{c01,d01}            ; load 4 16-bit coefficients
SMLAD a0,x23,c23,a0
SMLAD a1,x23,d23,a1
SMLAD a2,x45,c23,a2
SMLAD a3,x45,d23,a3

... repeat another time with x45<->x01 and x67<->x23 swapped

TST c,#mask                  ; test for end of loop
BNE fir_loop                  ;  branch back - 24 instructions total

This code uses 4 accumulators a0-a3, 8 coefficients c0-c3 and d0-d3,
8 input values x0-x7, a coefficient address and an input pointer - total
14 registers (2 16-bit values fit in a 32-bit register).
The coefficient array is duplicated to avoid alignment issues and
interleaved to avoid the need of a second pointer. There is no need
for a loop counter as we can use the coefficient pointer.
The instructions are scheduled to avoid any interlocks.

On ARM11 this computes 8 taps per iteration of 4 outputs (32 MACs)
in 24 cycles. In terms of bandwidth, it only does 6 loads every 32 MACs
(0.2 loads per MAC or 0.25 loads per cycle). So a 100Mhz ARM11
easily outperforms the 5420 at the same frequency.

FIR filters are clearly MAC rather than bandwidth bound. If we could
do 4 MACs per cycle, the loop would go faster. Now why do you insist
that you need at least 3 loads per MAC?

Wilco

> ARM expertise is not often free: someone should take
> him up on that offer!

So is DSP and PPC (and much more, for that matter) expertise
some of which I already gave for free in this thread.
Enjoy.

Dimiter

------------------------------------------------------
Dimiter Popoff               Transgalactic Instruments

http://www.tgi-sci.com
------------------------------------------------------

Stephen Clarke wrote:
> "Didi" <dp@tgi-sci.com> wrote in message
> news:1142297032.012651.187070@i39g2000cwa.googlegroups.com...
> >> The usual way to handle this on a general purpose
> >> processor is to unroll and pack the loads.
> >>
> >> Can you explain why this is not applicable in your
> >> case?
> >
> > Are you sure you read my postings. General purpose processors
> > are applicable, just at a different pesrformance cost.
>
> I did not intend to ask, "why are general purpose processors
> not applicable".  Rather, I was trying to understand your
> assertion that you need to do a 16-bit load from three
> independent addresses every cycle.
>
> The non-DSP orthodoxy is that this is not necessary, because
> you can unroll the loop by 4, merge the loads, and load up to
> twelve 16-bit objects from three independent addresses in three
> cycles. i.e. even though you can only do one 64-bit load per cycle,
> over three cycles, the effect is equivalent.
>
> > If you would explain (perhaps by an example) how you
> > want to unroll and pack the trivial filtering example I had given,
> > I might be able to explain more.
>
> I cannot see that you have provided any example code.
> However, Wilco has already offered to provide an
> explanation if you do supply some code:
>
> Wilco Dijkstra wrote:
> > Maybe you could show a C snippet of what you do, then I'll show you
> > an equivalent one that doesn't need 3 memory accesses per cycle.
>
> ARM expertise is not often free: someone should take
> him up on that offer!
> 
> Steve.

"Didi" <dp@tgi-sci.com> wrote in message 
news:1142297032.012651.187070@i39g2000cwa.googlegroups.com...
>> The usual way to handle this on a general purpose
>> processor is to unroll and pack the loads.
>>
>> Can you explain why this is not applicable in your
>> case?
>
> Are you sure you read my postings. General purpose processors
> are applicable, just at a different performance cost.

I did not intend to ask, "why are general purpose processors
not applicable".  Rather, I was trying to understand your
assertion that you need to do a 16-bit load from three
independent addresses every cycle.

The non-DSP orthodoxy is that this is not necessary, because
you can unroll the loop by 4, merge the loads, and load up to
twelve 16-bit objects from three independent addresses in three
cycles. i.e. even though you can only do one 64-bit load per cycle,
over three cycles, the effect is equivalent.

> If you would explain (perhaps by an example) how you
> want to unroll and pack the trivial filtering example I had given,
> I might be able to explain more.

I cannot see that you have provided any example code.
However, Wilco has already offered to provide an
explanation if you do supply some code:

Wilco Dijkstra wrote:
> Maybe you could show a C snippet of what you do, then I'll show you
> an equivalent one that doesn't need 3 memory accesses per cycle.

ARM expertise is not often free: someone should take
him up on that offer!

Steve.

> The usual way to handle this on a general purpose
> processor is to unroll and pack the loads.
>
> Can you explain why this is not applicable in your
> case?

Are you sure you read my postings. General purpose processors
are applicable, just at a different performance cost. I would estimate
a 500 MHz decent RISC of todays generation could perhaps do what a
54xx 100 MHz DSP can do in terms of real time signal processing.
 If you would explain (perhaps by an example) how you
want to unroll and pack the trivial filtering example I had given,
I might be able to explain more.

Dimiter

------------------------------------------------------
Dimiter Popoff               Transgalactic Instruments

http://www.tgi-sci.com
------------------------------------------------------


Stephen Clarke wrote:
> "Didi" <dp@tgi-sci.com> wrote in message
> news:1142132669.823913.252540@i40g2000cwc.googlegroups.com...
> >
> >> With a 64-bit bus you can read 4 16-bit values per cycle, every cycle.
> >> This is clearly faster than reading 16-bits from 3 independent address
> >> per cycle, right?
> >
> > No. Every 16 bit value has a separate address, which is - in the case
> > of the 5420 - another 16 bits. I will not go into explanation why this
> > is so, I guess there are sufficient books on digital signal processing
> > around.
>
> The usual way to handle this on a general purpose
> processor is to unroll and pack the loads.
>
> Can you explain why this is not applicable in your
> case?
> 
> Steve.

"Didi" <dp@tgi-sci.com> wrote in message 
news:1142132669.823913.252540@i40g2000cwc.googlegroups.com...
>
>> With a 64-bit bus you can read 4 16-bit values per cycle, every cycle.
>> This is clearly faster than reading 16-bits from 3 independent address
>> per cycle, right?
>
> No. Every 16 bit value has a separate address, which is - in the case
> of the 5420 - another 16 bits. I will not go into explanation why this
> is so, I guess there are sufficient books on digital signal processing
> around.

The usual way to handle this on a general purpose
processor is to unroll and pack the loads.

Can you explain why this is not applicable in your
case?

Steve.