EmbeddedRelated.com
Forums

Any Large Integer Libraries Out There for the ARM?

Started by David T. Ashley April 18, 2012
Op Mon, 23 Apr 2012 20:36:48 +0200 schreef David T. Ashley  
<dashley@gmail.com>:
> On Fri, 20 Apr 2012 00:57:28 +0200, "Boudewijn Dijkstra" > <sp4mtr4p.boudewijn@indes.com> wrote: > >> Op Wed, 18 Apr 2012 16:22:10 +0200 schreef David T. Ashley >> <dashley@gmail.com>: >>> I needed to multiply U128 = U96 * U32 on an ARM the other day, and did >>> a hack job in C. It is logically correct, but it probably takes 3-10 >>> times as long as assembly-language. >> >> Unlikely. With Cortex-M3 and the IAR compiler I came to the following >> results: >> * C-only: 63 cycles >> * C-with-asm: 50 cycles >> * asm-only: 39 cycles >> >> In how many cycles does your version run? > > Can you post your C code that led to 63-cycle execution?
Note that I did not perform exhaustive testing on the below code. #include <inttypes.h> typedef struct uint96_s { uint32_t w0; uint32_t w1; uint32_t w2; } uint96_t; typedef struct uint128_s { uint32_t w0; uint32_t w1; uint32_t w2; uint32_t w3; } uint128_t; void umul_96x32c(uint96_t *a, uint32_t b, uint128_t *r) { uint64_t p0, p1, p2; uint32_t r0, r1a, r1b, r2a, r2b, r3; uint64_t r1, r2; p0 = (uint64_t)a->w0 * (uint64_t)b; r0 = p0; r1a = p0 >> 32; r->w0 = r0; p1 = (uint64_t)a->w1 * (uint64_t)b; r1b = p1; r2a = p1 >> 32; r1 = (uint64_t)r1a + (uint64_t)r1b; r->w1 = r1; p2 = (uint64_t)a->w2 * (uint64_t)b; r2b = p2; r3 = p2 >> 32; r2 = (r1 >> 32) + (uint64_t)r2a + (uint64_t)r2b; r->w2 = r2; r3 += r2 >> 32; r->w3 = r3; }
> And your C with ASM?
The obvious thing to do is to add manual ADC instructions. #pragma inline=never void umul_96x32i(uint96_t *a, uint32_t b, uint128_t *r) { uint64_t p0, p1, p2; uint32_t r0, r1a, r1b, r2a, r2b, r3; uint32_t r1, r2; p0 = (uint64_t)a->w0 * (uint64_t)b; r0 = p0; r1a = p0 >> 32; r->w0 = r0; p1 = (uint64_t)a->w1 * (uint64_t)b; r1b = p1; r2a = p1 >> 32; r1 = r1a + r1b; r->w1 = r1; p2 = (uint64_t)a->w2 * (uint64_t)b; r2b = p2; r3 = p2 >> 32; __asm("adcs %[Rd],%[Rn],%[Rm]" : [Rd]"=r"(r2) : [Rn]"r" (r2a), [Rm]"r" (r2b) : "cc"); r->w2 = r2; __asm("adc %[Rd],%[Rn],#0" : [Rd]"=r"(r3) : [Rn]"r" (r3)); r->w3 = r3; }
> And your ASM-only?
The only further possible optimization was register allocation. umul_96x32: PUSH {R4} LDR R3,[R0, #+0] UMULL R3,R12,R1,R3 STR R3,[R2, #+0] LDR R3,[R0, #+4] UMULL R3,R4,R1,R3 ADDS R12,R3,R12 STR R12,[R2, #+4] LDR R0,[R0, #+8] UMULL R0,R1,R1,R0 ADCS R3,R4,R0 STR R3,[R2, #+8] ADC R0,R1,#0 STR R0,[R2, #+12] POP {R4} BX LR ;; return -- Gemaakt met Opera's revolutionaire e-mailprogramma: http://www.opera.com/mail/ (Remove the obvious prefix to reply privately.)
Boudewijn Dijkstra wrote:

> The obvious thing to do is to add manual ADC instructions.
The somewhat less obvious optimization is to use umlal, which handles the 64-bit add.
> The only further possible optimization was register allocation.
I'm not sure what the pipeline looks like, but spacing producers and consumers, by using more temporary registers /might/ slightly improve performance. Regards.
David T. Ashley wrote:

> By changing to one temporary variable (rather than 3) the compiler > seems to have shaved out two instructions (one at the start to > allocate stack space and one at the end to deallocate it).
In the parameters, you should provide the value of in_u32, instead of a pointer to the value, because the compiler is reloading the value every time it is needed. You didn't specify. Which CPU(s) are you targeting? Regards.
Op Wed, 25 Apr 2012 10:19:23 +0200 schreef Noob <root@127.0.0.1>:
> Boudewijn Dijkstra wrote: > >> The obvious thing to do is to add manual ADC instructions. > > The somewhat less obvious optimization is to use umlal, > which handles the 64-bit add.
Indeed. But I'm not sure which is faster.
>> The only further possible optimization was register allocation. > > I'm not sure what the pipeline looks like, but spacing producers > and consumers, by using more temporary registers /might/ slightly > improve performance.
No, it wouldn't. Cortex-M3 only stalls on memory operations. -- Gemaakt met Opera's revolutionaire e-mailprogramma: http://www.opera.com/mail/ (Remove the obvious prefix to reply privately.)
On Wed, 25 Apr 2012 09:23:16 +0200, "Boudewijn Dijkstra"
<sp4mtr4p.boudewijn@indes.com> wrote:
> >> And your ASM-only? > >The only further possible optimization was register allocation. > >umul_96x32: > PUSH {R4} > LDR R3,[R0, #+0] > UMULL R3,R12,R1,R3 > STR R3,[R2, #+0] > LDR R3,[R0, #+4] > UMULL R3,R4,R1,R3 > ADDS R12,R3,R12 > STR R12,[R2, #+4] > LDR R0,[R0, #+8] > UMULL R0,R1,R1,R0 > ADCS R3,R4,R0 > STR R3,[R2, #+8] > ADC R0,R1,#0 > STR R0,[R2, #+12] > POP {R4} > BX LR ;; return
Thanks for all the help. The code posted allowed me to optimize my multiply function to almost as efficient as hand-written assembly-language. I did post my original code elsewhere on this thread. I handled carries badly. Thanks and best regards, Dave Ashley
Boudewijn Dijkstra wrote:

> Noob wrote: > >> Boudewijn Dijkstra wrote: >> >>> The obvious thing to do is to add manual ADC instructions. >> >> The somewhat less obvious optimization is to use umlal, >> which handles the 64-bit add. > > Indeed. But I'm not sure which is faster.
According to ARM's "Cortex-M3 Technical Reference Manual", UMULL/SMULL/UMLAL/SMLAL use early termination depending on the size of source values. These are interruptible (abandoned/restarted), with worst case latency of one cycle. MLAL versions take four to seven cycles and MULL versions take three to five cycles. For MLAL, the signed version is one cycle longer than the unsigned. and Cycle count based on input sizes. That is, ABS(inputs) < 64K terminates early. Therefore, we are comparing (1 umull + 2 umlal) vs (3 umull + 3 add) = 11-19 vs 12-18 Hard to give a definitive answer...
>>> The only further possible optimization was register allocation. >> >> I'm not sure what the pipeline looks like, but spacing producers >> and consumers, by using more temporary registers /might/ slightly >> improve performance. > > No, it wouldn't. Cortex-M3 only stalls on memory operations.
I don't understand. Are multi-cycle instructions (such as umull) pipelined or not? If I write umull r4, r5, r0, r3 add r6, r7, r8 add can be executed one cycle after umull (mult and ALU are two different hardware blocks). If I write umull r4, r5, r0, r3 add r6, r5, r4 the pipeline must stall to wait for the result of umull to become available. What am I missing? Regards.
Op Wed, 25 Apr 2012 17:56:06 +0200 schreef Noob <root@127.0.0.1>:
> Boudewijn Dijkstra wrote: >> Noob wrote: >>> Boudewijn Dijkstra wrote: >>> >>>> The only further possible optimization was register allocation. >>> >>> I'm not sure what the pipeline looks like, but spacing producers >>> and consumers, by using more temporary registers /might/ slightly >>> improve performance. >> >> No, it wouldn't. Cortex-M3 only stalls on memory operations. > > I don't understand. Are multi-cycle instructions (such as umull) > pipelined or not? > > If I write > > umull r4, r5, r0, r3 > add r6, r7, r8 > > add can be executed one cycle after umull (mult and ALU are > two different hardware blocks). > > If I write > > umull r4, r5, r0, r3 > add r6, r5, r4 > > the pipeline must stall to wait for the result of umull to > become available. What am I missing?
AFAIK Cortex-M does not feature parallel execution paths. Cortex-R does. -- Gemaakt met Opera's revolutionaire e-mailprogramma: http://www.opera.com/mail/ (Remove the obvious prefix to reply privately.)
On Wed, 25 Apr 2012 10:23:50 +0200, Noob <root@127.0.0.1> wrote:

>David T. Ashley wrote: > >> By changing to one temporary variable (rather than 3) the compiler >> seems to have shaved out two instructions (one at the start to >> allocate stack space and one at the end to deallocate it). > >In the parameters, you should provide the value of in_u32, >instead of a pointer to the value, because the compiler is >reloading the value every time it is needed. > >You didn't specify. Which CPU(s) are you targeting?
The compiler is set for Thumb. (I believe the Cortex-M3 handles only Thumb-2 instructions, which are a superset of Thumb.) The parameter observation is interesting. I'd have to look at the code in more detail (which I won't do now). But you'd think when you have a CPU with 16 registers, maybe 12 of which you are free to use assuming you push a few, the compiler could allocate 4 registers just to buffer the pass-by-reference inputs. DTA
In article <1hkip7tdo729s18cokiued1uku0frp3rqd@4ax.com>, 
dashley@gmail.com says...
> > On Wed, 25 Apr 2012 10:23:50 +0200, Noob <root@127.0.0.1> wrote: > > >David T. Ashley wrote: > > > >> By changing to one temporary variable (rather than 3) the compiler > >> seems to have shaved out two instructions (one at the start to > >> allocate stack space and one at the end to deallocate it). > > > >In the parameters, you should provide the value of in_u32, > >instead of a pointer to the value, because the compiler is > >reloading the value every time it is needed. > > > >You didn't specify. Which CPU(s) are you targeting? > > The compiler is set for Thumb. (I believe the Cortex-M3 handles only > Thumb-2 instructions, which are a superset of Thumb.) > > The parameter observation is interesting. I'd have to look at the > code in more detail (which I won't do now). > > But you'd think when you have a CPU with 16 registers, maybe 12 of > which you are free to use assuming you push a few, the compiler could > allocate 4 registers just to buffer the pass-by-reference inputs. >
Why all compiler manufacturers assume you are making a tablet/phone/laptop therefor just put in more RAM to blaot around the problem. Just like the chip manufacturers, nothing is made in quantities less than 10,000 and always ahs TerraBytes of RAM. Thats what everybody does of course.. [Hint just a tad of sarcasm in the above statements] -- Paul Carpenter | paul@pcserviceselectronics.co.uk <http://www.pcserviceselectronics.co.uk/> PC Services <http://www.pcserviceselectronics.co.uk/fonts/> Timing Diagram Font <http://www.gnuh8.org.uk/> GNU H8 - compiler & Renesas H8/H8S/H8 Tiny <http://www.badweb.org.uk/> For those web sites you hate
On 26.4.12 4:52 , David T. Ashley wrote:
> On Wed, 25 Apr 2012 10:23:50 +0200, Noob<root@127.0.0.1> wrote: > >> David T. Ashley wrote: >> >>> By changing to one temporary variable (rather than 3) the compiler >>> seems to have shaved out two instructions (one at the start to >>> allocate stack space and one at the end to deallocate it). >> >> In the parameters, you should provide the value of in_u32, >> instead of a pointer to the value, because the compiler is >> reloading the value every time it is needed. >> >> You didn't specify. Which CPU(s) are you targeting? > > The compiler is set for Thumb. (I believe the Cortex-M3 handles only > Thumb-2 instructions, which are a superset of Thumb.) > > The parameter observation is interesting. I'd have to look at the > code in more detail (which I won't do now). > > But you'd think when you have a CPU with 16 registers, maybe 12 of > which you are free to use assuming you push a few, the compiler could > allocate 4 registers just to buffer the pass-by-reference inputs. > > DTA
At least GCC does it - R0 to R3. -- Tauno Voipio