Any Large Integer Libraries Out There for the ARM?| page 3

Reply by Boudewijn Dijkstra ●April 25, 20122012-04-25

Op Mon, 23 Apr 2012 20:36:48 +0200 schreef David T. Ashley  
<dashley@gmail.com>:
> On Fri, 20 Apr 2012 00:57:28 +0200, "Boudewijn Dijkstra"
> <sp4mtr4p.boudewijn@indes.com> wrote:
>
>> Op Wed, 18 Apr 2012 16:22:10 +0200 schreef David T. Ashley
>> <dashley@gmail.com>:
>>> I needed to multiply U128 = U96 * U32 on an ARM the other day, and did
>>> a hack job in C.  It is logically correct, but it probably takes 3-10
>>> times as long as assembly-language.
>>
>> Unlikely.  With Cortex-M3 and the IAR compiler I came to the following
>> results:
>> * C-only: 63 cycles
>> * C-with-asm: 50 cycles
>> * asm-only: 39 cycles
>>
>> In how many cycles does your version run?
>
> Can you post your C code that led to 63-cycle execution?

Note that I did not perform exhaustive testing on the below code.

#include <inttypes.h>

typedef struct uint96_s {
   uint32_t w0;
   uint32_t w1;
   uint32_t w2;
} uint96_t;

typedef struct uint128_s {
   uint32_t w0;
   uint32_t w1;
   uint32_t w2;
   uint32_t w3;
} uint128_t;

void umul_96x32c(uint96_t *a, uint32_t b, uint128_t *r)
{
   uint64_t p0, p1, p2;
   uint32_t r0, r1a, r1b, r2a, r2b, r3;
   uint64_t r1, r2;

   p0 = (uint64_t)a->w0 * (uint64_t)b;
   r0 = p0;
   r1a = p0 >> 32;
   r->w0 = r0;
   p1 = (uint64_t)a->w1 * (uint64_t)b;
   r1b = p1;
   r2a = p1 >> 32;
   r1 = (uint64_t)r1a + (uint64_t)r1b;
   r->w1 = r1;
   p2 = (uint64_t)a->w2 * (uint64_t)b;
   r2b = p2;
   r3 = p2 >> 32;
   r2 = (r1 >> 32) + (uint64_t)r2a + (uint64_t)r2b;
   r->w2 = r2;
   r3 += r2 >> 32;
   r->w3 = r3;
}

> And your C with ASM?

The obvious thing to do is to add manual ADC instructions.

#pragma inline=never
void umul_96x32i(uint96_t *a, uint32_t b, uint128_t *r)
{
   uint64_t p0, p1, p2;
   uint32_t r0, r1a, r1b, r2a, r2b, r3;
   uint32_t r1, r2;

   p0 = (uint64_t)a->w0 * (uint64_t)b;
   r0 = p0;
   r1a = p0 >> 32;
   r->w0 = r0;
   p1 = (uint64_t)a->w1 * (uint64_t)b;
   r1b = p1;
   r2a = p1 >> 32;
   r1 = r1a + r1b;
   r->w1 = r1;
   p2 = (uint64_t)a->w2 * (uint64_t)b;
   r2b = p2;
   r3 = p2 >> 32;
   __asm("adcs %[Rd],%[Rn],%[Rm]"
       : [Rd]"=r"(r2)
       : [Rn]"r" (r2a), [Rm]"r" (r2b)
       : "cc");
   r->w2 = r2;
   __asm("adc %[Rd],%[Rn],#0"
       : [Rd]"=r"(r3)
       : [Rn]"r" (r3));
   r->w3 = r3;
}

> And your ASM-only?

The only further possible optimization was register allocation.

umul_96x32:
         PUSH     {R4}
         LDR      R3,[R0, #+0]
         UMULL    R3,R12,R1,R3
         STR      R3,[R2, #+0]
         LDR      R3,[R0, #+4]
         UMULL    R3,R4,R1,R3
         ADDS     R12,R3,R12
         STR      R12,[R2, #+4]
         LDR      R0,[R0, #+8]
         UMULL    R0,R1,R1,R0
         ADCS     R3,R4,R0
         STR      R3,[R2, #+8]
         ADC      R0,R1,#0
         STR      R0,[R2, #+12]
         POP      {R4}
         BX       LR               ;; return



-- 
Gemaakt met Opera's revolutionaire e-mailprogramma:  
http://www.opera.com/mail/
(Remove the obvious prefix to reply privately.)

Reply by Noob ●April 25, 20122012-04-25

Boudewijn Dijkstra wrote:

> The obvious thing to do is to add manual ADC instructions.

The somewhat less obvious optimization is to use umlal,
which handles the 64-bit add.

> The only further possible optimization was register allocation.

I'm not sure what the pipeline looks like, but spacing producers
and consumers, by using more temporary registers /might/ slightly
improve performance.

Regards.

Reply by Noob ●April 25, 20122012-04-25

David T. Ashley wrote:

> By changing to one temporary variable (rather than 3) the compiler
> seems to have shaved out two instructions (one at the start to
> allocate stack space and one at the end to deallocate it).

In the parameters, you should provide the value of in_u32,
instead of a pointer to the value, because the compiler is
reloading the value every time it is needed.

You didn't specify. Which CPU(s) are you targeting?

Regards.

Reply by Boudewijn Dijkstra ●April 25, 20122012-04-25

Op Wed, 25 Apr 2012 10:19:23 +0200 schreef Noob <root@127.0.0.1>:
> Boudewijn Dijkstra wrote:
>
>> The obvious thing to do is to add manual ADC instructions.
>
> The somewhat less obvious optimization is to use umlal,
> which handles the 64-bit add.

Indeed.  But I'm not sure which is faster.

>> The only further possible optimization was register allocation.
>
> I'm not sure what the pipeline looks like, but spacing producers
> and consumers, by using more temporary registers /might/ slightly
> improve performance.

No, it wouldn't.  Cortex-M3 only stalls on memory operations.


-- 
Gemaakt met Opera's revolutionaire e-mailprogramma:  
http://www.opera.com/mail/
(Remove the obvious prefix to reply privately.)

Reply by David T. Ashley ●April 25, 20122012-04-25

On Wed, 25 Apr 2012 09:23:16 +0200, "Boudewijn Dijkstra"
<sp4mtr4p.boudewijn@indes.com> wrote:
>
>> And your ASM-only?
>
>The only further possible optimization was register allocation.
>
>umul_96x32:
>         PUSH     {R4}
>         LDR      R3,[R0, #+0]
>         UMULL    R3,R12,R1,R3
>         STR      R3,[R2, #+0]
>         LDR      R3,[R0, #+4]
>         UMULL    R3,R4,R1,R3
>         ADDS     R12,R3,R12
>         STR      R12,[R2, #+4]
>         LDR      R0,[R0, #+8]
>         UMULL    R0,R1,R1,R0
>         ADCS     R3,R4,R0
>         STR      R3,[R2, #+8]
>         ADC      R0,R1,#0
>         STR      R0,[R2, #+12]
>         POP      {R4}
>         BX       LR               ;; return

Thanks for all the help.  The code posted allowed me to optimize my
multiply function to almost as efficient as hand-written
assembly-language.

I did post my original code elsewhere on this thread.  I handled
carries badly.

Thanks and best regards, Dave Ashley

Reply by Noob ●April 25, 20122012-04-25

Boudewijn Dijkstra wrote:

> Noob wrote:
>
>> Boudewijn Dijkstra wrote:
>>
>>> The obvious thing to do is to add manual ADC instructions.
>>
>> The somewhat less obvious optimization is to use umlal,
>> which handles the 64-bit add.
> 
> Indeed.  But I'm not sure which is faster.

According to ARM's "Cortex-M3 Technical Reference Manual",

  UMULL/SMULL/UMLAL/SMLAL use early termination depending on the size
  of source values. These are interruptible (abandoned/restarted), with
  worst case latency of one cycle. MLAL versions take four to seven
  cycles and MULL versions take three to five cycles. For MLAL, the
  signed version is one cycle longer than the unsigned.

and 

  Cycle count based on input sizes. That is, ABS(inputs) < 64K
  terminates early.

Therefore, we are comparing
(1 umull + 2 umlal) vs (3 umull + 3 add) = 11-19 vs 12-18
Hard to give a definitive answer...

>>> The only further possible optimization was register allocation.
>>
>> I'm not sure what the pipeline looks like, but spacing producers
>> and consumers, by using more temporary registers /might/ slightly
>> improve performance.
> 
> No, it wouldn't.  Cortex-M3 only stalls on memory operations.

I don't understand. Are multi-cycle instructions (such as umull)
pipelined or not?

If I write

  umull r4, r5, r0, r3
  add r6, r7, r8

add can be executed one cycle after umull (mult and ALU are
two different hardware blocks).

If I write

  umull r4, r5, r0, r3
  add r6, r5, r4

the pipeline must stall to wait for the result of umull to
become available. What am I missing?

Regards.

Reply by Boudewijn Dijkstra ●April 26, 20122012-04-26

Op Wed, 25 Apr 2012 17:56:06 +0200 schreef Noob <root@127.0.0.1>:
> Boudewijn Dijkstra wrote:
>> Noob wrote:
>>> Boudewijn Dijkstra wrote:
>>>
>>>> The only further possible optimization was register allocation.
>>>
>>> I'm not sure what the pipeline looks like, but spacing producers
>>> and consumers, by using more temporary registers /might/ slightly
>>> improve performance.
>>
>> No, it wouldn't.  Cortex-M3 only stalls on memory operations.
>
> I don't understand. Are multi-cycle instructions (such as umull)
> pipelined or not?
>
> If I write
>
>   umull r4, r5, r0, r3
>   add r6, r7, r8
>
> add can be executed one cycle after umull (mult and ALU are
> two different hardware blocks).
>
> If I write
>
>   umull r4, r5, r0, r3
>   add r6, r5, r4
>
> the pipeline must stall to wait for the result of umull to
> become available. What am I missing?

AFAIK Cortex-M does not feature parallel execution paths.  Cortex-R does.


-- 
Gemaakt met Opera's revolutionaire e-mailprogramma:  
http://www.opera.com/mail/
(Remove the obvious prefix to reply privately.)

Reply by David T. Ashley ●April 26, 20122012-04-26

On Wed, 25 Apr 2012 10:23:50 +0200, Noob <root@127.0.0.1> wrote:

>David T. Ashley wrote:
>
>> By changing to one temporary variable (rather than 3) the compiler
>> seems to have shaved out two instructions (one at the start to
>> allocate stack space and one at the end to deallocate it).
>
>In the parameters, you should provide the value of in_u32,
>instead of a pointer to the value, because the compiler is
>reloading the value every time it is needed.
>
>You didn't specify. Which CPU(s) are you targeting?

The compiler is set for Thumb.  (I believe the Cortex-M3 handles only
Thumb-2 instructions, which are a superset of Thumb.)

The parameter observation is interesting.  I'd have to look at the
code in more detail (which I won't do now).

But you'd think when you have a CPU with 16 registers, maybe 12 of
which you are free to use assuming you push a few, the compiler could
allocate 4 registers just to buffer the pass-by-reference inputs.

DTA

Reply by Paul ●April 26, 20122012-04-26

In article <1hkip7tdo729s18cokiued1uku0frp3rqd@4ax.com>, 
dashley@gmail.com says...
> 
> On Wed, 25 Apr 2012 10:23:50 +0200, Noob <root@127.0.0.1> wrote:
> 
> >David T. Ashley wrote:
> >
> >> By changing to one temporary variable (rather than 3) the compiler
> >> seems to have shaved out two instructions (one at the start to
> >> allocate stack space and one at the end to deallocate it).
> >
> >In the parameters, you should provide the value of in_u32,
> >instead of a pointer to the value, because the compiler is
> >reloading the value every time it is needed.
> >
> >You didn't specify. Which CPU(s) are you targeting?
> 
> The compiler is set for Thumb.  (I believe the Cortex-M3 handles only
> Thumb-2 instructions, which are a superset of Thumb.)
> 
> The parameter observation is interesting.  I'd have to look at the
> code in more detail (which I won't do now).
> 
> But you'd think when you have a CPU with 16 registers, maybe 12 of
> which you are free to use assuming you push a few, the compiler could
> allocate 4 registers just to buffer the pass-by-reference inputs.
>

Why all compiler manufacturers assume you are making a 
tablet/phone/laptop therefor just put in more RAM to blaot around the 
problem. Just like the chip manufacturers, nothing is made in quantities
less than 10,000 and always ahs TerraBytes of RAM.

Thats what everybody does of course..



[Hint just a tad of sarcasm in the above statements]


-- 
Paul Carpenter          | paul@pcserviceselectronics.co.uk
<http://www.pcserviceselectronics.co.uk/>    PC Services
<http://www.pcserviceselectronics.co.uk/fonts/> Timing Diagram Font
<http://www.gnuh8.org.uk/>  GNU H8 - compiler & Renesas H8/H8S/H8 Tiny
<http://www.badweb.org.uk/> For those web sites you hate

Reply by Tauno Voipio ●April 26, 20122012-04-26

On 26.4.12 4:52 , David T. Ashley wrote:
> On Wed, 25 Apr 2012 10:23:50 +0200, Noob<root@127.0.0.1>  wrote:
>
>> David T. Ashley wrote:
>>
>>> By changing to one temporary variable (rather than 3) the compiler
>>> seems to have shaved out two instructions (one at the start to
>>> allocate stack space and one at the end to deallocate it).
>>
>> In the parameters, you should provide the value of in_u32,
>> instead of a pointer to the value, because the compiler is
>> reloading the value every time it is needed.
>>
>> You didn't specify. Which CPU(s) are you targeting?
>
> The compiler is set for Thumb.  (I believe the Cortex-M3 handles only
> Thumb-2 instructions, which are a superset of Thumb.)
>
> The parameter observation is interesting.  I'd have to look at the
> code in more detail (which I won't do now).
>
> But you'd think when you have a CPU with 16 registers, maybe 12 of
> which you are free to use assuming you push a few, the compiler could
> allocate 4 registers just to buffer the pass-by-reference inputs.
>
> DTA

At least GCC does it - R0 to R3.

-- 

Tauno Voipio