Math computing time statistics for ARM7TDMI and MSP430

Hello,

for an estimation of required computing time I would like to roughly
know the time that current controllers need for math operations
(addition/subtraction, multiplication, division, and also logarithm) in
single and/or double precision floating point format (assuming common
compilers).

The MCUs in question are ARM7TDMI of NXP/Atmel flavour (LPC2000 or
SAM7), and Texas MSP430.

Can anyone provide a link to some statistics?

Thanks,
Tilmann

-- 
http://www.autometer.de - Elektronik nach Ma&#4294967295;.

Reply by Jim Granville ●November 17, 20062006-11-17

Tilmann Reh wrote:
> Hello,
> 
> for an estimation of required computing time I would like to roughly
> know the time that current controllers need for math operations
> (addition/subtraction, multiplication, division, and also logarithm) in
> single and/or double precision floating point format (assuming common
> compilers).
> 
> The MCUs in question are ARM7TDMI of NXP/Atmel flavour (LPC2000 or
> SAM7), and Texas MSP430.
> 
> Can anyone provide a link to some statistics?

Might be hard to find...

Look at http://www.eembc.org/; recently Philips/NXP made some
noise about their core being 37 percent to 51 better than other
ARM7 cores, because of their wider memory paths.

Generally, the ASM opcodes will give some indication. Some
uC lack division, others have it in HW, and that will make a huge
difference to that corner of the benchmark.

eg recently we needed extended scaling, and we found the
Zilog ZNEO has 64/32=32 divide, and 32*32=64 multiply.
To access that, we had to use in-line ASM, but once we did
that, the result was maybe 1000x faster than a libary call to
shift/subtract SW division in a uC without divide opcodes.

Also likely to be well-maths-resourced are DSP uC like the TMS320F2802

-jg

Reply by Stef ●November 17, 20062006-11-17

In comp.arch.embedded,
Jim Granville <no.spam@designtools.maps.co.nz> wrote:
> eg recently we needed extended scaling, and we found the
> Zilog ZNEO has 64/32=32 divide, and 32*32=64 multiply.
> To access that, we had to use in-line ASM, but once we did
> that, the result was maybe 1000x faster than a libary call to
> shift/subtract SW division in a uC without divide opcodes.

I've encountered something simular on an arm7tdmi. We needed the
32*32=64 multiply, but could not find a way to let the compiler
emit the smlal (IIRC) instruction. So we also ended up doing
this in asm. Has anybody found a way to let the compiler do this
(ADS or GCC)?

-- 
Stef    (remove caps, dashes and .invalid from e-mail address to reply by mail)

Reply by Dominic ●November 17, 20062006-11-17

Stef wrote:
> I've encountered something simular on an arm7tdmi. We needed the
> 32*32=64 multiply, but could not find a way to let the compiler
> emit the smlal (IIRC) instruction. So we also ended up doing
> this in asm. Has anybody found a way to let the compiler do this
> (ADS or GCC)?

Just tried it, and this works for me (gcc 4.1.1):

long long c = (long long) a * (long long) b;

long long d = (long long) c + (long long) a * (long long) b;
emits a smlal

Regards,

Dominic

Reply by larwe ●November 17, 20062006-11-17

Tilmann Reh wrote:

> single and/or double precision floating point format (assuming common
> compilers).
>
> The MCUs in question are ARM7TDMI of NXP/Atmel flavour (LPC2000 or
> SAM7), and Texas MSP430.

Just as a general point: If you're considering software DSP
applications, unless they're _INHERENTLY_ constrained and will never
need to be scalable, ARM is strongly suggested IMHO. MSP430's address
space is architecturally limited. Targeting ARM from the get-go will
leave the door open for more complex algorithms, larger sample buffers,
etc.

Reply by Stef ●November 17, 20062006-11-17

In comp.arch.embedded,
Dominic <Dominic.at.usenet@gmx.com> wrote:
> Stef wrote:
>> I've encountered something simular on an arm7tdmi. We needed the
>> 32*32=64 multiply, but could not find a way to let the compiler
>> emit the smlal (IIRC) instruction. So we also ended up doing
>> this in asm. Has anybody found a way to let the compiler do this
>> (ADS or GCC)?
>
> Just tried it, and this works for me (gcc 4.1.1):
>
> long long c = (long long) a * (long long) b;
>
> long long d = (long long) c + (long long) a * (long long) b;
> emits a smlal

Hey, that works, thanks! I tried this small FIR filter example:

long long try_smlal(long *a, long *b)
{
  long long rv = 0;

  rv += (long long) *(a++) * (long long) *(b++);
  rv += (long long) *(a++) * (long long) *(b++);
  rv += (long long) *(a++) * (long long) *(b++);
  rv += (long long) *(a++) * (long long) *(b++);

  return rv;
}

The result with GCC 3.2.1 is:

020002b8 <try_smlal>:
 20002b8:	e92d4030 	stmdb	sp!, {r4, r5, lr}
 20002bc:	e1a03000 	mov	r3, r0
 20002c0:	e1a02001 	mov	r2, r1
 20002c4:	e493e004 	ldr	lr, [r3], #4
 20002c8:	e492c004 	ldr	ip, [r2], #4
 20002cc:	e4930004 	ldr	r0, [r3], #4
 20002d0:	e4921004 	ldr	r1, [r2], #4
 20002d4:	e0c54190 	smull	r4, r5, r0, r1
 20002d8:	e1a01005 	mov	r1, r5
 20002dc:	e1a00004 	mov	r0, r4
 20002e0:	e0e10e9c 	smlal	r0, r1, ip, lr
 20002e4:	e493e004 	ldr	lr, [r3], #4
 20002e8:	e492c004 	ldr	ip, [r2], #4
 20002ec:	e0e10e9c 	smlal	r0, r1, ip, lr
 20002f0:	e593c000 	ldr	ip, [r3]
 20002f4:	e5923000 	ldr	r3, [r2]
 20002f8:	e0e10c93 	smlal	r0, r1, r3, ip
 20002fc:	e8bd8030 	ldmia	sp!, {r4, r5, pc}

Looks almost optimal, I only don't see why the smull result is placed
in r4/r5 and then moved to r0/r1, but on a 20 tap filter it wouldn't
really be significant.

Last time I tried is was years ago on an ADS compiler and they I
couldn't get it. May have been the wrong casts or just the old
compiler.

The other optimization we did with this is to run the function out
of the AT91's internal SRAM by first copying it from flash on
startup and point a function pointer at the sram. We got the function
address OK, but the length was (IIRC) fixed in code. Any tips on
copying an entire function during run-time using GCC? Or how to
get the length argument in this call:

 memcpy(sram_loc, try_smlal, try_smlal_length);

-- 
Stef    (remove caps, dashes and .invalid from e-mail address to reply by mail)

Reply by Tilmann Reh ●November 17, 20062006-11-17

larwe schrieb:

>> [math power of MCUs]
>> The MCUs in question are ARM7TDMI of NXP/Atmel flavour (LPC2000 or
>> SAM7), and Texas MSP430.
> 
> Just as a general point: If you're considering software DSP
> applications, unless they're _INHERENTLY_ constrained and will never
> need to be scalable, ARM is strongly suggested IMHO. MSP430's address
> space is architecturally limited. Targeting ARM from the get-go will
> leave the door open for more complex algorithms, larger sample buffers,
> etc.

Thanks for the note - I already know. However in this application, there
is neither much data nor much code. It's just a task that needs some
amount of math operations, and I will have to trade power consumption
against calculation time... I also tend to using ARM, but I would also
like to see some figures.

Thanks,
Tilmann

-- 
http://www.autometer.de - Elektronik nach Ma&#4294967295;.

Reply by Tilmann Reh ●November 17, 20062006-11-17

Jim Granville schrieb:

>> [math power of MCUs]
>> Can anyone provide a link to some statistics?
> 
> Might be hard to find...
> 
> Look at http://www.eembc.org/; recently Philips/NXP made some
> noise about their core being 37 percent to 51 better than other
> ARM7 cores, because of their wider memory paths.

Thanks for the link - it will probably provide at least some general
figures (will have a closer look soon).

> Generally, the ASM opcodes will give some indication. Some
> uC lack division, others have it in HW, and that will make a huge
> difference to that corner of the benchmark.

I fear that I will need double precision floating point math, for which
assembler won't be much better than the RTL of a common compiler, I
assume. (I'm well used to programming assembler, so that won't hurt me
if it really makes sense.)

> Also likely to be well-maths-resourced are DSP uC like the TMS320F2802

Too much power consumption for this application, I think.

Tilmann

-- 
http://www.autometer.de - Elektronik nach Ma&#4294967295;.

Reply by Peter Dickerson ●November 17, 20062006-11-17

"Stef" <stef33d@yahooI-N-V-A-L-I-D.com.invalid> wrote in message 
news:12dcb$455db079$54f63171$7933@publishnet.news-service.com...
> In comp.arch.embedded,
[snip]
>
> The other optimization we did with this is to run the function out
> of the AT91's internal SRAM by first copying it from flash on
> startup and point a function pointer at the sram. We got the function
> address OK, but the length was (IIRC) fixed in code. Any tips on
> copying an entire function during run-time using GCC? Or how to
> get the length argument in this call:
>
> memcpy(sram_loc, try_smlal, try_smlal_length);

Well let me see what I do...

I define this in a header
#define IRAM_CODE __attribute__((long_call,section(".icode")))

then

IRAM_CODE void foo(void)
{
...
}

the linker scrip put the secion .icode into Flash just line initialized data 
something like this
    __icode_rom__ = ADDR(.gcc_except_table ) + SIZEOF(.gcc_except_table);
    .icode : AT(__icode_rom__)
    {
    __icode_start__ = . ;
    *(.icode);
    *(.idata);
    . = ALIGN(4);
    } > iram
    __data_rom__ = __icode_rom__ + SIZEOF(.icode);

then the crt0.s init code copies it out something like this
/* Copy data from ICODE to IRAM */
    ldr r2,=__icode_start__
    ldr r3,=__icode_rom__
    ldr r4,=__data_rom__
    b 2f
1:
    ldmia r3!,{r0,r1}
    stmia r2!,{r0,r1}
2:
    cmp r3,r4
    blt 1b

Note that this assumes that the code will stay permanently in RAM rather 
than being overlayed and loaded dynamically. A more dynamic version could be 
done by having multiple sections then memcpy()ing the one your interested 
in.

Note 2 that GCC has a problem if you call an IRAM_CODE function from a non 
IRAM_CODE function *in* *the* *same* *file* (it seems to lose the long_call 
attrib and uses a relative call that is typically out of range). So the best 
idea is to put the IRAM_CODE functions in a separate file.


hope that helps.
Peter

Reply by rickman ●November 17, 20062006-11-17

Tilmann Reh wrote:
> larwe schrieb:
>
> >> [math power of MCUs]
> >> The MCUs in question are ARM7TDMI of NXP/Atmel flavour (LPC2000 or
> >> SAM7), and Texas MSP430.
> >
> > Just as a general point: If you're considering software DSP
> > applications, unless they're _INHERENTLY_ constrained and will never
> > need to be scalable, ARM is strongly suggested IMHO. MSP430's address
> > space is architecturally limited. Targeting ARM from the get-go will
> > leave the door open for more complex algorithms, larger sample buffers,
> > etc.
>
> Thanks for the note - I already know. However in this application, there
> is neither much data nor much code. It's just a task that needs some
> amount of math operations, and I will have to trade power consumption
> against calculation time... I also tend to using ARM, but I would also
> like to see some figures.

If you are considering the MSP430 based on power consumption, be aware
that the ARM parts are not hugely different once the clock rate is
taken into account.  I don't have good numbers for the MSP430, but they
appear to be around 350 uA at 1 MHz.  I don't know exactly how that
varies with clock rate, but I'll assume the y-intercept is 0 and the
slope is linear.  The Atmel SAM7S parts are pretty much linear with
nearly no offset other than the bias for the internal LDO.  The slope
is about 650 uA per MHz.  So between the MSP430 and the SAM7S it is
about a 2 to 1 power difference.  I can't say if the processing power
of the 32 bit device makes up for any of this or not.

I have several eval board from Atmel and Philips and would like to run
some bench marks to see how the power and speed compares.  If anyone
would like to provide test code, I would be willing to run it in the
next few weeks and make the results public.

Previous12 3 4 5 6 Next

Math computing time statistics for ARM7TDMI and MSP430

Sign in

You might also like...

Search forums

Free PDF Downloads

Blogs - Hall of Fame

Discussion Groups

Quick Links

About EmbeddedRelated.com

Social Networks

The Related Media Group