Math computing time statistics for ARM7TDMI and MSP430| page 2

Reply by Wilco Dijkstra ●November 17, 20062006-11-17

"Tilmann Reh" <tilmannreh@despammed.com> wrote in message 
news:455d6bd4$0$30328$9b4e6d93@newsspool1.arcor-online.net...
> Hello,
>
> for an estimation of required computing time I would like to roughly
> know the time that current controllers need for math operations
> (addition/subtraction, multiplication, division, and also logarithm) in
> single and/or double precision floating point format (assuming common
> compilers).
>
> The MCUs in question are ARM7TDMI of NXP/Atmel flavour (LPC2000 or
> SAM7), and Texas MSP430.
>
> Can anyone provide a link to some statistics?

None of these support floating point in hardware, so it depends on
the libraries you use. On ARM there exist highly optimised FP
libraries, the one I wrote takes about 25 cycles for fadd/fsub, 40 for
fmul and 70 for fdiv. Double precision takes almost twice as long.
You would get 500KFlops quite easily on a 50MHz ARM7tdmi.
Of course this is highly compiler/library specific, many are much
slower than this, 5-6x slower for an unoptimised implementation
is fairly typical.

Doing floating point on the MSP, especially double precision,
seems like a bad idea...

Wilco

Reply by Wilco Dijkstra ●November 17, 20062006-11-17

"Stef" <stef33d@yahooI-N-V-A-L-I-D.com.invalid> wrote in message 
news:4ac2d$455d8a61$54f63171$16633@publishnet.news-service.com...
> In comp.arch.embedded,
> Jim Granville <no.spam@designtools.maps.co.nz> wrote:
>> eg recently we needed extended scaling, and we found the
>> Zilog ZNEO has 64/32=32 divide, and 32*32=64 multiply.
>> To access that, we had to use in-line ASM, but once we did
>> that, the result was maybe 1000x faster than a libary call to
>> shift/subtract SW division in a uC without divide opcodes.

Sounds like a badly written library. If the instruction was available
the library should have used it in the first place. Even so, making
the shift&subtract variant more than 10x slower requires you to
really work hard to make it as slow as possible...

> I've encountered something simular on an arm7tdmi. We needed the
> 32*32=64 multiply, but could not find a way to let the compiler
> emit the smlal (IIRC) instruction. So we also ended up doing
> this in asm. Has anybody found a way to let the compiler do this
> (ADS or GCC)?

Later versions of ADS supported inlined S/UMULL, U/SMLAL
was added in RVCT IIRC.

Wilco

Reply by Tilmann Reh ●November 17, 20062006-11-17

Wilco Dijkstra schrieb:

> [math on ARM7, MSP430]
> None of these support floating point in hardware, so it depends on
> the libraries you use.

I was (maybe erroneously) assuming that the RTLs of common compiler
packages have about equal performance...

> On ARM there exist highly optimised FP
> libraries, the one I wrote takes about 25 cycles for fadd/fsub, 40 for
> fmul and 70 for fdiv. Double precision takes almost twice as long.
> You would get 500KFlops quite easily on a 50MHz ARM7tdmi.
> Of course this is highly compiler/library specific, many are much
> slower than this, 5-6x slower for an unoptimised implementation
> is fairly typical.

Thanks, this is at least a rough figure I can use at first place.

> Doing floating point on the MSP, especially double precision,
> seems like a bad idea...

Not all the math needs double precision - and hey, we've done DP
floating point math with a Z80 as well. :-)
I know that it will me much more work for the MSP than for an ARM. But
from the overall application, it seems reasonable to me to also take the
MSP into consideration.

Tilmann

-- 
http://www.autometer.de - Elektronik nach Ma&#4294967295;.

Reply by Stef ●November 17, 20062006-11-17

In comp.arch.embedded,
Peter Dickerson <firstname.lastname@REMOVE.tesco.net> wrote:
> "Stef" <stef33d@yahooI-N-V-A-L-I-D.com.invalid> wrote in message
> news:12dcb$455db079$54f63171$7933@publishnet.news-service.com...
>>
>> memcpy(sram_loc, try_smlal, try_smlal_length);
>
> Well let me see what I do...

[something far fancier :-)]

Hey, that does all the work, we did it all by hand (memcpy, function
pointer..). I have saved your article and will refer to it next time
I need something like this, thanks.


--
Stef    (remove caps, dashes and .invalid from e-mail address to reply by mail)

Reply by Tilmann Reh ●November 17, 20062006-11-17

rickman schrieb:

> If you are considering the MSP430 based on power consumption, be aware
> that the ARM parts are not hugely different once the clock rate is
> taken into account.

Of course this is true. Even when a given set of calculations has to be
done, the consumed /energy/ may be fairly the same (ARM faster with more
current, MSP with lower current but takes longer) - however it's not
/only/ math that has to be done here. The overall current consumption,
especially at those times when there's no math to do, is also relevant.
To me it seems that these aspects are more easy to take care of when
using the MSP, so that's why I am also considering it.

> I have several eval board from Atmel and Philips and would like to run
> some bench marks to see how the power and speed compares.  If anyone
> would like to provide test code, I would be willing to run it in the
> next few weeks and make the results public.

That sounds interesting. But as Wilco mentioned, math performance can be
expected to depend on the used libraries - so you'd have to take care
about them. Also, at least I can't provide test code yet. For the time
being, I will look at the benchmarks that Jim pointed to, and consider
the numbers given by Wilco (though I am really interested how long a
logarithm takes in a "good" [tm] library... :-) ).

Tilmann

-- 
http://www.autometer.de - Elektronik nach Ma&#4294967295;.

Reply by steve ●November 17, 20062006-11-17

Tilmann Reh wrote:
> Hello,
>
> for an estimation of required computing time I would like to roughly
> know the time that current controllers need for math operations
> (addition/subtraction, multiplication, division, and also logarithm) in
> single and/or double precision floating point format (assuming common
> compilers).
>
> The MCUs in question are ARM7TDMI of NXP/Atmel flavour (LPC2000 or
> SAM7), and Texas MSP430.

Cycles

MPS430, 32 bit floats, imagecraft complier, typical cycles
add 158
sub 184
mul 332
div  620

ARM, keil complier 32 bit floats, typical cycles
add 53
sub 53
mul 48
div  224
sqrt 439
log 435

ARM, GNU complier 32 bit floats, typical cycles
add 472
sub 478
mul 439
div  652
sqrt 2387
log 13,523


8051, keil complier, 32 bit floats, typical cycles
add 199
sub 201
mul 219
div  895
sqrt 1117
log 2006


max cycles up to 2x typical

Reply by Karl Olsen ●November 17, 20062006-11-17

steve <bungalow_steve@yahoo.com> wrote:
> Tilmann Reh wrote:
>> Hello,
>>
>> for an estimation of required computing time I would like to roughly
>> know the time that current controllers need for math operations
>> (addition/subtraction, multiplication, division, and also logarithm)
>> in single and/or double precision floating point format (assuming
>> common compilers).
>>
>> The MCUs in question are ARM7TDMI of NXP/Atmel flavour (LPC2000 or
>> SAM7), and Texas MSP430.
>
> Cycles
> [...]
> ARM, GNU complier 32 bit floats, typical cycles
> add 472
> sub 478
> mul 439
> div  652
> sqrt 2387
> log 13,523
> [...]

This must be with an old GCC.  In GCC 3.4, the generic floating-point code
was rewritten in ARM assembler.

http://groups.google.com/group/comp.arch.embedded/browse_thread/thread/2de3e337a67b557e/f1ee7a09f78f6bc2?lnk=st&q=&rnum=1&hl=en#f1ee7a09f78f6bc2

Clocks for gcc-3.3.1, clocks for gcc-3.4.3, speedup (32-bit float):

__addsf3: 514  73  7.0x
__subsf3:  511  74  6.9x
__mulsf3:  428  49  8.7x
__divsf3:  634  142  4.5x

Some further speedup should have happened in GCC 4.0.

Karl Olsen

Reply by rickman ●November 17, 20062006-11-17

Karl Olsen wrote:
> steve <bungalow_steve@yahoo.com> wrote:
> > Tilmann Reh wrote:
> >> Hello,
> >>
> >> for an estimation of required computing time I would like to roughly
> >> know the time that current controllers need for math operations
> >> (addition/subtraction, multiplication, division, and also logarithm)
> >> in single and/or double precision floating point format (assuming
> >> common compilers).
> >>
> >> The MCUs in question are ARM7TDMI of NXP/Atmel flavour (LPC2000 or
> >> SAM7), and Texas MSP430.
> >
> > Cycles
> > [...]
> > ARM, GNU complier 32 bit floats, typical cycles
> > add 472
> > sub 478
> > mul 439
> > div  652
> > sqrt 2387
> > log 13,523
> > [...]
>
> This must be with an old GCC.  In GCC 3.4, the generic floating-point code
> was rewritten in ARM assembler.
>
> http://groups.google.com/group/comp.arch.embedded/browse_thread/thread/2de3e337a67b557e/f1ee7a09f78f6bc2?lnk=st&q=&rnum=1&hl=en#f1ee7a09f78f6bc2
>
> Clocks for gcc-3.3.1, clocks for gcc-3.4.3, speedup (32-bit float):
>
> __addsf3: 514  73  7.0x
> __subsf3:  511  74  6.9x
> __mulsf3:  428  49  8.7x
> __divsf3:  634  142  4.5x
>
> Some further speedup should have happened in GCC 4.0.

GNUARM is up to 4.1.1.

Reply by steve ●November 18, 20062006-11-18

Karl Olsen wrote:
> steve <bungalow_steve@yahoo.com> wrote:
> > Tilmann Reh wrote:
> >> Hello,
> >>
> >> for an estimation of required computing time I would like to roughly
> >> know the time that current controllers need for math operations
> >> (addition/subtraction, multiplication, division, and also logarithm)
> >> in single and/or double precision floating point format (assuming
> >> common compilers).
> >>
> >> The MCUs in question are ARM7TDMI of NXP/Atmel flavour (LPC2000 or
> >> SAM7), and Texas MSP430.
> >
> > Cycles
> > [...]
> > ARM, GNU complier 32 bit floats, typical cycles
> > add 472
> > sub 478
> > mul 439
> > div  652
> > sqrt 2387
> > log 13,523
> > [...]
>
> This must be with an old GCC.  In GCC 3.4, the generic floating-point code
> was rewritten in ARM assembler.
>
> http://groups.google.com/group/comp.arch.embedded/browse_thread/thread/2de3e337a67b557e/f1ee7a09f78f6bc2?lnk=st&q=&rnum=1&hl=en#f1ee7a09f78f6bc2
>
> Clocks for gcc-3.3.1, clocks for gcc-3.4.3, speedup (32-bit float):
>
> __addsf3: 514  73  7.0x
> __subsf3:  511  74  6.9x
> __mulsf3:  428  49  8.7x
> __divsf3:  634  142  4.5x
>
> Some further speedup should have happened in GCC 4.0.
> 
> Karl Olsen


yes, it was version 3.3.1, nice speed update for 3.4!

Reply by Tilmann Reh ●November 18, 20062006-11-18

steve schrieb:

[some cycle data]

Thanks very much - that's perfect.
(Including thanks to Karl for the update on GCC cycles.)

Tilmann

-- 
http://www.autometer.de - Elektronik nach Ma&#4294967295;.