Reply by Wilco Dijkstra November 14, 20082008-11-14
"Ulf Samuelsson" <ulf@a-t-m-e-l.com> wrote in message news:gfkp5n$pjh$1@aioe.org...
> > "Wilco Dijkstra" <Wilco.removethisDijkstra@ntlworld.com> skrev i meddelandet > news:7RkTk.45327$nA3.22941@newsfe03.ams2... >> >> "Ulf Samuelsson" <ulf@a-t-m-e-l.com> wrote in message news:gfkcrn$7jm$1@aioe.org... >>>>>> The LDRHs take 7 cycles (6 + 1), the MLAs take 6 cycles, or in total 26 cycles. >>>>>> That is exactly twice as slow as AVR32 on the above code. So the claim of 11 >>>>>> times slower is a total lie. Those Atmel marketeers should be ashamed of >>>>>> themselves. >> > > So the AVR32 inner loop is only 2-3 x faster than the Cortex-M3. > Yes, noone in their right mind would switch for such&#4294967295; > a meagre performance increase ;-)
Actually the worst case is 2.5x, but as steve said earlier, actual measurements taking flash speed etc into account are closer to 1.5x. Either way, that's not close at all to the claimed 11x difference. For better DSP performance and more MHz most people would use ARM9E instead (it's used in many harddrives). Wilco
Reply by Ulf Samuelsson November 14, 20082008-11-14
"Wilco Dijkstra" <Wilco.removethisDijkstra@ntlworld.com> skrev i meddelandet 
news:7RkTk.45327$nA3.22941@newsfe03.ams2...
> > "Ulf Samuelsson" <ulf@a-t-m-e-l.com> wrote in message > news:gfkcrn$7jm$1@aioe.org... >>>>> The LDRHs take 7 cycles (6 + 1), the MLAs take 6 cycles, or in total >>>>> 26 cycles. >>>>> That is exactly twice as slow as AVR32 on the above code. So the claim >>>>> of 11 >>>>> times slower is a total lie. Those Atmel marketeers should be ashamed >>>>> of >>>>> themselves. >
So the AVR32 inner loop is only 2-3 x faster than the Cortex-M3. Yes, noone in their right mind would switch for such&#4294967295; a meagre performance increase ;-)
>>>> And you are comparing 3 MACs with 6 MACs. >>>> >>>> 6 MACs from memory using AVR32 = 13 clocks. >>>> 6 MACs from memory using CM3 = 52 clocks or 4 x difference. >>> >>> >>> No, read again. It's 13 cycles to do 3 MACs, so 26 to do 6 MACS. > >> OK, >> I see that now, where do you check for saturation? > > There is usually no need to check for saturation unless you have 16-bit > ADC's (rare). With saturation it would be 32 cycles. > > Wilco >
-- Best Regards, Ulf Samuelsson This is intended to be my personal opinion which may, or may not be shared by my employer Atmel Nordic AB
Reply by Wilco Dijkstra November 14, 20082008-11-14
"Ulf Samuelsson" <ulf@a-t-m-e-l.com> wrote in message news:gfkcrn$7jm$1@aioe.org...
>>>> The LDRHs take 7 cycles (6 + 1), the MLAs take 6 cycles, or in total 26 cycles. >>>> That is exactly twice as slow as AVR32 on the above code. So the claim of 11 >>>> times slower is a total lie. Those Atmel marketeers should be ashamed of >>>> themselves.
>>> And you are comparing 3 MACs with 6 MACs. >>> >>> 6 MACs from memory using AVR32 = 13 clocks. >>> 6 MACs from memory using CM3 = 52 clocks or 4 x difference. >> >> >> No, read again. It's 13 cycles to do 3 MACs, so 26 to do 6 MACS.
> OK, > I see that now, where do you check for saturation?
There is usually no need to check for saturation unless you have 16-bit ADC's (rare). With saturation it would be 32 cycles. Wilco
Reply by Ulf Samuelsson November 14, 20082008-11-14
>>> The LDRHs take 7 cycles (6 + 1), the MLAs take 6 cycles, or in total 26 >>> cycles. >>> That is exactly twice as slow as AVR32 on the above code. So the claim >>> of 11 >>> times slower is a total lie. Those Atmel marketeers should be ashamed of >>> themselves. >>> >> >> >> And you are comparing 3 MACs with 6 MACs. >> >> 6 MACs from memory using AVR32 = 13 clocks. >> 6 MACs from memory using CM3 = 52 clocks or 4 x difference. > > > No, read again. It's 13 cycles to do 3 MACs, so 26 to do 6 MACS. > > Wilco >
OK, I see that now, where do you check for saturation? -- Best Regards, Ulf Samuelsson This is intended to be my personal opinion which may, or may not be shared by my employer Atmel Nordic AB
Reply by Wilco Dijkstra November 14, 20082008-11-14
"Ulf Samuelsson" <ulf@a-t-m-e-l.com> wrote in message news:gfhl33$h59$1@aioe.org...

> And you are comparing 3 MACs with 6 MACs.
No, read again. It's 13 cycles to do 3 MACs, so 26 to do 6 MACS. Wilco
Reply by Wilco Dijkstra November 13, 20082008-11-13
"Ulf Samuelsson" <ulf@a-t-m-e-l.com> wrote in message news:gfhl33$h59$1@aioe.org...

>> The LDRHs take 7 cycles (6 + 1), the MLAs take 6 cycles, or in total 26 cycles. >> That is exactly twice as slow as AVR32 on the above code. So the claim of 11 >> times slower is a total lie. Those Atmel marketeers should be ashamed of >> themselves. >> > > > And you are comparing 3 MACs with 6 MACs. > > 6 MACs from memory using AVR32 = 13 clocks. > 6 MACs from memory using CM3 = 52 clocks or 4 x difference.
No, read again. It's 13 cycles to do 3 MACs, so 26 to do 6 MACS. Wilco
Reply by Wilco Dijkstra November 13, 20082008-11-13
"Ulf Samuelsson" <ulf@a-t-m-e-l.com> wrote in message news:gfhl33$h59$1@aioe.org...
>>> ==> 66 MHz (with a 100% unrolled loop) >>> I.E: n = 6 => >>> >>> LOAD 1 clock >>> LOAD 1 clock >>> MAC 1 clock >>> MAC 1 clock >>> LOAD 1 clock >>> LOAD 1 clock >>> MAC 1 clock >>> MAC 1 clock >>> LOAD 1 clock >>> LOAD 1 clock >>> MAC 1 clock >>> MAC 1 clock >>> ; Hidden writeback: 1 clock >> >> On Cortex-M3 this would take the following sequence: >> >> LDRH r2, [r0,#0] >> LDRH r3, [r0,#2] >> LDRH r4, [r0,#4] >> LDRH r5, [r1,#0] >> LDRH r6, [r1,#2] >> LDRH r7, [r1,#4] >> MLA r8,r2,r5,r8 >> MLA r8,r3,r6,r8 >> MLA r8,r4,r7,r8 >> >> The LDRHs take 7 cycles (6 + 1), the MLAs take 6 cycles, or in total 26 cycles. >> That is exactly twice as slow as AVR32 on the above code. So the claim of 11 >> times slower is a total lie. Those Atmel marketeers should be ashamed of >> themselves. >> > > > And you are comparing 3 MACs with 6 MACs. > > 6 MACs from memory using AVR32 = 13 clocks. > 6 MACs from memory using CM3 = 52 clocks or 4 x difference.
No, read again. It's 13 cycles to do 3 MACs, so 26 to do 6 MACS. Wilco
Reply by Ulf Samuelsson November 13, 20082008-11-13
>> ==> 66 MHz (with a 100% unrolled loop) >> I.E: n = 6 => >> >> LOAD 1 clock >> LOAD 1 clock >> MAC 1 clock >> MAC 1 clock >> LOAD 1 clock >> LOAD 1 clock >> MAC 1 clock >> MAC 1 clock >> LOAD 1 clock >> LOAD 1 clock >> MAC 1 clock >> MAC 1 clock >> ; Hidden writeback: 1 clock > > On Cortex-M3 this would take the following sequence: > > LDRH r2, [r0,#0] > LDRH r3, [r0,#2] > LDRH r4, [r0,#4] > LDRH r5, [r1,#0] > LDRH r6, [r1,#2] > LDRH r7, [r1,#4] > MLA r8,r2,r5,r8 > MLA r8,r3,r6,r8 > MLA r8,r4,r7,r8 > > The LDRHs take 7 cycles (6 + 1), the MLAs take 6 cycles, or in total 26 > cycles. > That is exactly twice as slow as AVR32 on the above code. So the claim of > 11 > times slower is a total lie. Those Atmel marketeers should be ashamed of > themselves. >
And you are comparing 3 MACs with 6 MACs. 6 MACs from memory using AVR32 = 13 clocks. 6 MACs from memory using CM3 = 52 clocks or 4 x difference.
> Wilco
-- Best Regards, Ulf Samuelsson ulf@a-t-m-e-l.com This message is intended to be my own personal view and it may or may not be shared by my employer Atmel Nordic AB
Reply by steve November 12, 20082008-11-12
On Nov 12, 6:15=A0am, "Wilco Dijkstra"
<Wilco.removethisDijks...@ntlworld.com> wrote:
> "Ulf Samuelsson" <u...@a-t-m-e-l.com> wrote in messagenews:gfe7ts$cu6$1@a=
ioe.org...
> > "steve" <bungalow_st...@yahoo.com> skrev i meddelandet > > I've noticed in the Atmel slides packages they say FIR filter is 11 > > times faster then on a CortexM3. That is hard to believe, not sure > > why, Cortex is 2 cycle MAC, AVR32 is single cycle, maybe with the 2 > > wait states on Cortex FLASH they came up with that number? > > > =3D=3D> Not only that. > > =A0 =A0 =A0 =A0I am not sure about 11 times though. > > Indeed, people are still spreading lies about Cortex-M3 as usual. > > > =A0 =A0 =A0 =A0You win by having > > =A0 =A0 =A0 =A0* 1 clock cycle load instructions. > > =A0 =A0 =A0 =A0 =A0 =A0Cortex-M3 implementations are at least 2, maybe =
more
> > Cortex-M3 loads are 2 cycles unless the next instruction is a load or > store, in which case it is 1 cycle. So a sequence of N loads takes > N+1 cycles. > > > =A0 =A0 =A0 =A0* The ability to use the upper part of the 32 bit regist=
er
> > =A0 =A0 =A0 =A0 =A0 =A0for MAC instructions, so you load TWO samples/co=
efficients
> > =A0 =A0 =A0 =A0 =A0 =A0in a single clock cycle. > > > =A0 =A0 =A0 =A0 =A0 =A0The unroled loop then becomes: > > > =A0 =A0 =A0 =A0 =A0 =A0LOAD =A0 =A0 =A0 =A0 =A0 =A01 clock > > =A0 =A0 =A0 =A0 =A0 =A0LOAD =A0 =A0 =A0 =A0 =A0 =A01 clock > > =A0 =A0 =A0 =A0 =A0 =A0MAC =A0 =A0 =A0 =A0 =A0 =A0 =A01 clock > > =A0 =A0 =A0 =A0 =A0 =A0MAC =A0 =A0 =A0 =A0 =A0 =A0 =A01 clock > > This is the same trick as the ARM9E introduced a long time ago. > > > =A0 =A0 =A0 =A0* The AVR32 runs with 1 waitstate, while the STM32 runs =
with 2.
> > The Luminary Cortex-M3 cores run with 0 wait states. But even with a > wait state you don't necessary see a slowdown if the fetch width is at > least 64 bits (3-4 Thumb-2 instructions). Waitstates primarily slowdown > branches. > > > * Sustained 33 DSP MIPS when doing vector sums > > =A0 =A0for(sum=3D0; i =3D 0; i < n; i++) sum =3D sum + C[i] * X[i]; > > > * The last feature is instructions which handle saturation > > =A0 =A0the way a DSP should, and this has to be handled > > =A0 =A0manually in other RISCs like CM3 > > Actually Cortex-M3 has a saturate instruction. > > > > > > > the 33 MIPS is at what clock speed? > > > =3D=3D> 66 MHz (with a 100% unrolled loop) > > =A0 =A0I.E: =A0 =A0 =A0 =A0n =3D 6 =3D> > > > =A0 =A0 =A0 =A0 =A0 =A0LOAD =A0 =A0 =A0 =A0 =A0 =A01 clock > > =A0 =A0 =A0 =A0 =A0 =A0LOAD =A0 =A0 =A0 =A0 =A0 =A01 clock > > =A0 =A0 =A0 =A0 =A0 =A0MAC =A0 =A0 =A0 =A0 =A0 =A0 =A01 clock > > =A0 =A0 =A0 =A0 =A0 =A0MAC =A0 =A0 =A0 =A0 =A0 =A0 =A01 clock > > =A0 =A0 =A0 =A0 =A0 =A0LOAD =A0 =A0 =A0 =A0 =A0 =A01 clock > > =A0 =A0 =A0 =A0 =A0 =A0LOAD =A0 =A0 =A0 =A0 =A0 =A01 clock > > =A0 =A0 =A0 =A0 =A0 =A0MAC =A0 =A0 =A0 =A0 =A0 =A0 =A01 clock > > =A0 =A0 =A0 =A0 =A0 =A0MAC =A0 =A0 =A0 =A0 =A0 =A0 =A01 clock > > =A0 =A0 =A0 =A0 =A0 =A0LOAD =A0 =A0 =A0 =A0 =A0 =A01 clock > > =A0 =A0 =A0 =A0 =A0 =A0LOAD =A0 =A0 =A0 =A0 =A0 =A01 clock > > =A0 =A0 =A0 =A0 =A0 =A0MAC =A0 =A0 =A0 =A0 =A0 =A0 =A01 clock > > =A0 =A0 =A0 =A0 =A0 =A0MAC =A0 =A0 =A0 =A0 =A0 =A0 =A01 clock > > =A0 =A0 =A0 =A0 =A0 =A0; Hidden writeback: 1 clock > > On Cortex-M3 this would take the following sequence: > > LDRH r2, [r0,#0] > LDRH r3, [r0,#2] > LDRH r4, [r0,#4] > LDRH r5, [r1,#0] > LDRH r6, [r1,#2] > LDRH r7, [r1,#4] > MLA r8,r2,r5,r8 > MLA r8,r3,r6,r8 > MLA r8,r4,r7,r8 > > The LDRHs take 7 cycles (6 + 1), the MLAs take 6 cycles, or in total 26 c=
ycles.
> That is exactly twice as slow as AVR32 on the above code. So the claim of=
11
> times slower is a total lie. Those Atmel marketeers should be ashamed of > themselves. > > Wilco- Hide quoted text - > > - Show quoted text -
Ok, I took the atmel published FIR filter cycle count and the STM FIR filter cycle count both from their websites (using their optimized in house DSP packages) http://www.atmel.com/dyn/resources/prod_documents/doc32076.pdf http://www.st.com/stonline/products/literature/um/14988.pdf of course both don't give data on the same size FIR filter, so I have to normalize... For Atmel, a 64 point, 24 tap,41 outputs FIR takes 2,439 cycles, which is 41*24 =3D 984 MACs, for a cycle/MAC ratio of 2.478 cycles/MAC For STM Cortex at full speed 2 wait states, 63 point 32 tap, 32 output FIR takes 3929 cycles, which is 32*32 =3D 1024 MACs for a ratio of 3.83 cycles/MAC (2 wait states) a difference of 1.54X at zero wait states ( below 24Mhz) STM reports 3478 cycles so 3.396 cycles/Mac (0 wait states), a difference of 1.37 times
Reply by Wilco Dijkstra November 12, 20082008-11-12
"Ulf Samuelsson" <ulf@a-t-m-e-l.com> wrote in message news:gfe7ts$cu6$1@aioe.org...
> "steve" <bungalow_steve@yahoo.com> skrev i meddelandet
> I've noticed in the Atmel slides packages they say FIR filter is 11 > times faster then on a CortexM3. That is hard to believe, not sure > why, Cortex is 2 cycle MAC, AVR32 is single cycle, maybe with the 2 > wait states on Cortex FLASH they came up with that number? > > ==> Not only that. > I am not sure about 11 times though.
Indeed, people are still spreading lies about Cortex-M3 as usual.
> You win by having > * 1 clock cycle load instructions. > Cortex-M3 implementations are at least 2, maybe more
Cortex-M3 loads are 2 cycles unless the next instruction is a load or store, in which case it is 1 cycle. So a sequence of N loads takes N+1 cycles.
> * The ability to use the upper part of the 32 bit register > for MAC instructions, so you load TWO samples/coefficients > in a single clock cycle. > > The unroled loop then becomes: > > LOAD 1 clock > LOAD 1 clock > MAC 1 clock > MAC 1 clock
This is the same trick as the ARM9E introduced a long time ago.
> * The AVR32 runs with 1 waitstate, while the STM32 runs with 2.
The Luminary Cortex-M3 cores run with 0 wait states. But even with a wait state you don't necessary see a slowdown if the fetch width is at least 64 bits (3-4 Thumb-2 instructions). Waitstates primarily slowdown branches.
> * Sustained 33 DSP MIPS when doing vector sums > for(sum=0; i = 0; i < n; i++) sum = sum + C[i] * X[i]; > > * The last feature is instructions which handle saturation > the way a DSP should, and this has to be handled > manually in other RISCs like CM3
Actually Cortex-M3 has a saturate instruction.
> the 33 MIPS is at what clock speed? > > ==> 66 MHz (with a 100% unrolled loop) > I.E: n = 6 => > > LOAD 1 clock > LOAD 1 clock > MAC 1 clock > MAC 1 clock > LOAD 1 clock > LOAD 1 clock > MAC 1 clock > MAC 1 clock > LOAD 1 clock > LOAD 1 clock > MAC 1 clock > MAC 1 clock > ; Hidden writeback: 1 clock
On Cortex-M3 this would take the following sequence: LDRH r2, [r0,#0] LDRH r3, [r0,#2] LDRH r4, [r0,#4] LDRH r5, [r1,#0] LDRH r6, [r1,#2] LDRH r7, [r1,#4] MLA r8,r2,r5,r8 MLA r8,r3,r6,r8 MLA r8,r4,r7,r8 The LDRHs take 7 cycles (6 + 1), the MLAs take 6 cycles, or in total 26 cycles. That is exactly twice as slow as AVR32 on the above code. So the claim of 11 times slower is a total lie. Those Atmel marketeers should be ashamed of themselves. Wilco