EmbeddedRelated.com
Forums

Any Large Integer Libraries Out There for the ARM?

Started by David T. Ashley April 18, 2012
Op Wed, 18 Apr 2012 16:22:10 +0200 schreef David T. Ashley  
<dashley@gmail.com>:
> I needed to multiply U128 = U96 * U32 on an ARM the other day, and did > a hack job in C. It is logically correct, but it probably takes 3-10 > times as long as assembly-language.
Unlikely. With Cortex-M3 and the IAR compiler I came to the following results: * C-only: 63 cycles * C-with-asm: 50 cycles * asm-only: 39 cycles In how many cycles does your version run? -- Gemaakt met Opera's revolutionaire e-mailprogramma: http://www.opera.com/mail/ (Remove the obvious prefix to reply privately.)
On Fri, 20 Apr 2012 00:57:28 +0200, "Boudewijn Dijkstra"
<sp4mtr4p.boudewijn@indes.com> wrote:

>Op Wed, 18 Apr 2012 16:22:10 +0200 schreef David T. Ashley ><dashley@gmail.com>: >> I needed to multiply U128 = U96 * U32 on an ARM the other day, and did >> a hack job in C. It is logically correct, but it probably takes 3-10 >> times as long as assembly-language.
Was that with the GNU ARM compiler ?? If so, that's why I use IAR also. Looks like I better see if I can optimize some stuff like Boudewijn has done. boB
> >Unlikely. With Cortex-M3 and the IAR compiler I came to the following >results: >* C-only: 63 cycles >* C-with-asm: 50 cycles >* asm-only: 39 cycles > >In how many cycles does your version run?
David T.  Ashley wrote:

> On Wed, 18 Apr 2012 11:59:46 -0400, mwilson@the-wire.com wrote: >> >>I'm still working to get my head around the physical requirements for >>packaging a release. It seems as though a source->.a process would suit >>my >>needs best. That way, ARM/THUMB can be decided project-by-project. >> >>I've just started some due diligence and registered with github. > > Hi Mel, > > I have the same thoughts. Assuming that ".a" is the standard suffix > for an ARM library, a library would be built from the smallest > compileable modules (typically a single function each), and the linker > will automatically bring in only what it needs to satisfy references. >
[ ... ] I've taken a highly opportunistic cargo-cult-style swipe at/of the OpenSSL bignum code. The results can be got via git clone https://github.com/melwilson/bignum-embedded.git Results don't look bad for a couple of days work, but clearly have to be massively cleaned and made consistent before they constitute a real library. Mel.
On Fri, 20 Apr 2012 00:57:28 +0200, "Boudewijn Dijkstra"
<sp4mtr4p.boudewijn@indes.com> wrote:

>Op Wed, 18 Apr 2012 16:22:10 +0200 schreef David T. Ashley ><dashley@gmail.com>: >> I needed to multiply U128 = U96 * U32 on an ARM the other day, and did >> a hack job in C. It is logically correct, but it probably takes 3-10 >> times as long as assembly-language. > >Unlikely. With Cortex-M3 and the IAR compiler I came to the following >results: >* C-only: 63 cycles >* C-with-asm: 50 cycles >* asm-only: 39 cycles > >In how many cycles does your version run?
Can you post your C code that led to 63-cycle execution? And your C with ASM? And your ASM-only? If you've posted it and I missed it, I apologize. Thanks, Dave Ashley
David T. Ashley wrote:

> I needed to multiply U128 = U96 * U32 on an ARM the other day, and did > a hack job in C. It is logically correct, but it probably takes 3-10 > times as long as assembly-language.
gcc "understands" the intent of the following code, and produces fairly decent assembly code for SH-4: #include <stdint.h> typedef uint32_t U32; typedef U32 U96[3]; typedef U32 U128[4]; void mul(U32 u, U96 v, U128 res) { uint64_t m0 = (uint64_t)u * v[0]; res[0] = m0; uint64_t m1 = (uint64_t)u * v[1] + (m0 >> 32); res[1] = m1; uint64_t m2 = (uint64_t)u * v[2] + (m1 >> 32); res[2] = m2; res[3] = (m2 >> 32); } Do you get acceptable results with this function? Regards.
On Tue, 24 Apr 2012 13:55:32 +0200, Noob <root@127.0.0.1> wrote:

>David T. Ashley wrote: > >> I needed to multiply U128 = U96 * U32 on an ARM the other day, and did >> a hack job in C. It is logically correct, but it probably takes 3-10 >> times as long as assembly-language. > >gcc "understands" the intent of the following code, and produces >fairly decent assembly code for SH-4: > >#include <stdint.h> >typedef uint32_t U32; typedef U32 U96[3]; typedef U32 U128[4]; >void mul(U32 u, U96 v, U128 res) >{ > uint64_t m0 = (uint64_t)u * v[0]; > res[0] = m0; > > uint64_t m1 = (uint64_t)u * v[1] + (m0 >> 32); > res[1] = m1; > > uint64_t m2 = (uint64_t)u * v[2] + (m1 >> 32); > res[2] = m2; > > res[3] = (m2 >> 32); >} > >Do you get acceptable results with this function?
I'll give that a try. What I used was this ... similar but different. Your handling of carries looks more efficient. ---------- void APP_BIGINT_UmulU128U96U32( APP_BIGINT_U128_T *out_u128, const APP_BIGINT_U96_T *in_u96, const APP_BIGINT_U32_T *in_u32) { APP_BIGINT_U64_T term_96w0_32w0; APP_BIGINT_U64_T term_96w1_32w0; APP_BIGINT_U64_T term_96w2_32w0; APP_BIGINT_U32_NATIVE_T carry; APP_BIGINT_U64_NATIVE_T temp_u64; //**************************************************************************************************** //**************************************************************************************************** //**************************************************************************************************** //Calculate all of the intermediate terms. It was discovered that the compiler does a good //job of using the UMULL instruction. // //96w0 * 32w0 temp_u64 = (APP_BIGINT_U64_NATIVE_T)in_u96->w[0] * (APP_BIGINT_U64_NATIVE_T)in_u32->w[0]; term_96w0_32w0.w[1] = temp_u64 >> 32; term_96w0_32w0.w[0] = temp_u64; // //96w1 * 32w0 temp_u64 = (APP_BIGINT_U64_NATIVE_T)in_u96->w[1] * (APP_BIGINT_U64_NATIVE_T)in_u32->w[0]; term_96w1_32w0.w[1] = temp_u64 >> 32; term_96w1_32w0.w[0] = temp_u64; // //96w2 * 32w0 temp_u64 = (APP_BIGINT_U64_NATIVE_T)in_u96->w[2] * (APP_BIGINT_U64_NATIVE_T)in_u32->w[0]; term_96w2_32w0.w[1] = temp_u64 >> 32; term_96w2_32w0.w[0] = temp_u64; // //**************************************************************************************************** //**************************************************************************************************** //**************************************************************************************************** //Add together the terms. The best way to envision this is the way that children are sometimes //taught to multiply with a rectangle and then adding diagonally. // //Word [0] of the result. No addition is required, because this term comes from the least significant //term alone. out_u128->w[0] = term_96w0_32w0.w[0]; // //Word [1] of the result. temp_u64 = (APP_BIGINT_U64_NATIVE_T)term_96w1_32w0.w[0] + (APP_BIGINT_U64_NATIVE_T)term_96w0_32w0.w[1]; carry = temp_u64 >> 32; out_u128->w[1] = temp_u64; //Word [2] of the result. temp_u64 = (APP_BIGINT_U64_NATIVE_T)carry + (APP_BIGINT_U64_NATIVE_T)term_96w2_32w0.w[0] + (APP_BIGINT_U64_NATIVE_T)term_96w1_32w0.w[1]; carry = temp_u64 >> 32; out_u128->w[2] = temp_u64; //Word [3] of the result. temp_u64 = (APP_BIGINT_U64_NATIVE_T)carry + (APP_BIGINT_U64_NATIVE_T)term_96w2_32w0.w[1]; out_u128->w[3] = temp_u64; } Thanks, DTA
David T. Ashley wrote:

> Noob wrote: > >> David T. Ashley wrote: >> >>> I needed to multiply U128 = U96 * U32 on an ARM the other day, and did >>> a hack job in C. It is logically correct, but it probably takes 3-10 >>> times as long as assembly-language. >> >> gcc "understands" the intent of the following code, and produces >> fairly decent assembly code for SH-4: >> >> #include <stdint.h> >> typedef uint32_t U32; typedef U32 U96[3]; typedef U32 U128[4]; >> void mul(U32 u, U96 v, U128 res) >> { >> uint64_t m0 = (uint64_t)u * v[0]; >> res[0] = m0; >> >> uint64_t m1 = (uint64_t)u * v[1] + (m0 >> 32); >> res[1] = m1; >> >> uint64_t m2 = (uint64_t)u * v[2] + (m1 >> 32); >> res[2] = m2; >> >> res[3] = (m2 >> 32); >> } >> >> Do you get acceptable results with this function? > > I'll give that a try. > > What I used was this ... similar but different. > > Your handling of carries looks more efficient.
IMHO, the biggest economy, in terms of code density, comes from the umlal instruction (32x32->64_mul followed by add_64). (On SH-4, it's all done "by hand" using 4-6 instructions.) FWIW, here's the assembly code gcc 4.6.1 generates for Cortex-M3. What CPU are you targeting? ARM or Thumb mode? mul: push {r4, r5, r6, r7, r8, r9} ldr r6, [r1, #0] movs r5, #0 umull r6, r7, r6, r0 str r6, [r2, #0] ldr r3, [r1, #4] mov r8, r7 mov r9, r5 umlal r8, r9, r0, r3 str r8, [r2, #4] ldr r3, [r1, #8] mov r6, r9 mov r7, r5 umlal r6, r7, r0, r3 str r6, [r2, #8] str r7, [r2, #12] pop {r4, r5, r6, r7, r8, r9} bx lr AFAICT, the register copies before the 2 umlal are unnecessary, which means we could shave 3 instructions. Regards.
On Tue, 24 Apr 2012 15:55:59 +0200, Noob <root@127.0.0.1> wrote:

>David T. Ashley wrote: > >> Noob wrote: >> >>> David T. Ashley wrote: >>> >>>> I needed to multiply U128 = U96 * U32 on an ARM the other day, and did >>>> a hack job in C. It is logically correct, but it probably takes 3-10 >>>> times as long as assembly-language. >>> >>> gcc "understands" the intent of the following code, and produces >>> fairly decent assembly code for SH-4: >>> >>> #include <stdint.h> >>> typedef uint32_t U32; typedef U32 U96[3]; typedef U32 U128[4]; >>> void mul(U32 u, U96 v, U128 res) >>> { >>> uint64_t m0 = (uint64_t)u * v[0]; >>> res[0] = m0; >>> >>> uint64_t m1 = (uint64_t)u * v[1] + (m0 >> 32); >>> res[1] = m1; >>> >>> uint64_t m2 = (uint64_t)u * v[2] + (m1 >> 32); >>> res[2] = m2; >>> >>> res[3] = (m2 >> 32); >>> } >>> >>> Do you get acceptable results with this function? >> >> I'll give that a try. >> >> What I used was this ... similar but different. >> >> Your handling of carries looks more efficient. > >IMHO, the biggest economy, in terms of code density, comes from the >umlal instruction (32x32->64_mul followed by add_64). (On SH-4, it's >all done "by hand" using 4-6 instructions.) > >FWIW, here's the assembly code gcc 4.6.1 generates for Cortex-M3. >What CPU are you targeting? ARM or Thumb mode? > >mul: > push {r4, r5, r6, r7, r8, r9} > ldr r6, [r1, #0] > movs r5, #0 > umull r6, r7, r6, r0 > str r6, [r2, #0] > ldr r3, [r1, #4] > mov r8, r7 > mov r9, r5 > umlal r8, r9, r0, r3 > str r8, [r2, #4] > ldr r3, [r1, #8] > mov r6, r9 > mov r7, r5 > umlal r6, r7, r0, r3 > str r6, [r2, #8] > str r7, [r2, #12] > pop {r4, r5, r6, r7, r8, r9} > bx lr > >AFAICT, the register copies before the 2 umlal are unnecessary, >which means we could shave 3 instructions.
Thanks. Examining your code ... The possibility of a carry makes me nervous. I'd first like to verify that the sum
>>> uint64_t m1 = (uint64_t)u * v[1] + (m0 >> 32);
can't exceed 64 bits and generate a carry. It doesn't need to be verfied that
>>> uint64_t m2 = (uint64_t)u * v[2] + (m1 >> 32);
can't generate a carry, because there is a simple proof of that relying on the operand sizes and the fact that this is the most significant product. Squaring the maximum 32-bit integer yields: (2^32-1) * (2^32-1) = 2^64 - 2^33 + 1. If the second term is also a maximum square, adjusting for the placement, its maximum value is: 2^96 - 2^65 + 2^32 The question of whether there can be a carry is equivalent to the question of whether. 2^96 - 2^65 + 2^32 + 2^64 - 2^33 + 1 < 2^96 Looking at the terms, this is definitely the case. OK, I believe that
>>> uint64_t m1 = (uint64_t)u * v[1] + (m0 >> 32);
cannot generate a carry. This seems plausible because one of the arguments is 32-bits ... I'm just not sure about how larger multiplications (more terms to add) would work out. I believe your implementation is correct. Below is the source code and assembly-language for my original plus your improved version. Quite a size and speed difference! Thanks! DTA -------- void APP_BIGINT_UmulU128U96U32Old( APP_BIGINT_U128_T *out_u128, const APP_BIGINT_U96_T *in_u96, const APP_BIGINT_U32_T *in_u32) { APP_BIGINT_U64_T term_96w0_32w0; APP_BIGINT_U64_T term_96w1_32w0; APP_BIGINT_U64_T term_96w2_32w0; APP_BIGINT_U32_NATIVE_T carry; APP_BIGINT_U64_NATIVE_T temp_u64; //**************************************************************************************************** //**************************************************************************************************** //**************************************************************************************************** //Calculate all of the intermediate terms. It was discovered that the compiler does a good //job of using the UMULL instruction. // //96w0 * 32w0 temp_u64 = (APP_BIGINT_U64_NATIVE_T)in_u96->w[0] * (APP_BIGINT_U64_NATIVE_T)in_u32->w[0]; term_96w0_32w0.w[1] = temp_u64 >> 32; term_96w0_32w0.w[0] = temp_u64; // //96w1 * 32w0 temp_u64 = (APP_BIGINT_U64_NATIVE_T)in_u96->w[1] * (APP_BIGINT_U64_NATIVE_T)in_u32->w[0]; term_96w1_32w0.w[1] = temp_u64 >> 32; term_96w1_32w0.w[0] = temp_u64; // //96w2 * 32w0 temp_u64 = (APP_BIGINT_U64_NATIVE_T)in_u96->w[2] * (APP_BIGINT_U64_NATIVE_T)in_u32->w[0]; term_96w2_32w0.w[1] = temp_u64 >> 32; term_96w2_32w0.w[0] = temp_u64; // //**************************************************************************************************** //**************************************************************************************************** //**************************************************************************************************** //Add together the terms. The best way to envision this is the way that children are sometimes //taught to multiply with a rectangle and then adding diagonally. // //Word [0] of the result. No addition is required, because this term comes from the least significant //term alone. out_u128->w[0] = term_96w0_32w0.w[0]; // //Word [1] of the result. temp_u64 = (APP_BIGINT_U64_NATIVE_T)term_96w1_32w0.w[0] + (APP_BIGINT_U64_NATIVE_T)term_96w0_32w0.w[1]; carry = temp_u64 >> 32; out_u128->w[1] = temp_u64; //Word [2] of the result. temp_u64 = (APP_BIGINT_U64_NATIVE_T)carry + (APP_BIGINT_U64_NATIVE_T)term_96w2_32w0.w[0] + (APP_BIGINT_U64_NATIVE_T)term_96w1_32w0.w[1]; carry = temp_u64 >> 32; out_u128->w[2] = temp_u64; //Word [3] of the result. temp_u64 = (APP_BIGINT_U64_NATIVE_T)carry + (APP_BIGINT_U64_NATIVE_T)term_96w2_32w0.w[1]; out_u128->w[3] = temp_u64; } APP_BIGINT_UmulU128U96U32Old PROC ;;;621 ;;;622 void APP_BIGINT_UmulU128U96U32Old( APP_BIGINT_U128_T *out_u128, 0001e0 e92d47f0 PUSH {r4-r10,lr} ;;;623 const APP_BIGINT_U96_T *in_u96, ;;;624 const APP_BIGINT_U32_T *in_u32) ;;;625 { 0001e4 b090 SUB sp,sp,#0x40 0001e6 4604 MOV r4,r0 0001e8 460d MOV r5,r1 0001ea 4616 MOV r6,r2 ;;;626 APP_BIGINT_U64_T term_96w0_32w0; ;;;627 APP_BIGINT_U64_T term_96w1_32w0; ;;;628 APP_BIGINT_U64_T term_96w2_32w0; ;;;629 APP_BIGINT_U32_NATIVE_T carry; ;;;630 APP_BIGINT_U64_NATIVE_T temp_u64; ;;;631 ;;;632 //**************************************************************************************************** ;;;633 //**************************************************************************************************** ;;;634 //**************************************************************************************************** ;;;635 //Calculate all of the intermediate terms. It was discovered that the compiler does a good ;;;636 //job of using the UMULL instruction. ;;;637 // ;;;638 //96w0 * 32w0 ;;;639 temp_u64 = (APP_BIGINT_U64_NATIVE_T)in_u96->w[0] * (APP_BIGINT_U64_NATIVE_T)in_u32->w[0]; 0001ec 6828 LDR r0,[r5,#0] 0001ee 6831 LDR r1,[r6,#0] 0001f0 fba00101 UMULL r0,r1,r0,r1 0001f4 e9cd0108 STRD r0,r1,[sp,#0x20] ;;;640 term_96w0_32w0.w[1] = temp_u64 >> 32; 0001f8 2100 MOVS r1,#0 0001fa 9809 LDR r0,[sp,#0x24] 0001fc e9cd0106 STRD r0,r1,[sp,#0x18] 000200 900f STR r0,[sp,#0x3c] ;;;641 term_96w0_32w0.w[0] = temp_u64; 000202 9808 LDR r0,[sp,#0x20] 000204 900e STR r0,[sp,#0x38] ;;;642 // ;;;643 //96w1 * 32w0 ;;;644 temp_u64 = (APP_BIGINT_U64_NATIVE_T)in_u96->w[1] * (APP_BIGINT_U64_NATIVE_T)in_u32->w[0]; 000206 6868 LDR r0,[r5,#4] 000208 6831 LDR r1,[r6,#0] 00020a fba00101 UMULL r0,r1,r0,r1 00020e e9cd0108 STRD r0,r1,[sp,#0x20] ;;;645 term_96w1_32w0.w[1] = temp_u64 >> 32; 000212 2100 MOVS r1,#0 000214 9809 LDR r0,[sp,#0x24] 000216 e9cd0106 STRD r0,r1,[sp,#0x18] 00021a 900d STR r0,[sp,#0x34] ;;;646 term_96w1_32w0.w[0] = temp_u64; 00021c 9808 LDR r0,[sp,#0x20] 00021e 900c STR r0,[sp,#0x30] ;;;647 // ;;;648 //96w2 * 32w0 ;;;649 temp_u64 = (APP_BIGINT_U64_NATIVE_T)in_u96->w[2] * (APP_BIGINT_U64_NATIVE_T)in_u32->w[0]; 000220 68a8 LDR r0,[r5,#8] 000222 6831 LDR r1,[r6,#0] 000224 fba00101 UMULL r0,r1,r0,r1 000228 e9cd0108 STRD r0,r1,[sp,#0x20] ;;;650 term_96w2_32w0.w[1] = temp_u64 >> 32; 00022c 2100 MOVS r1,#0 00022e 9809 LDR r0,[sp,#0x24] 000230 e9cd0106 STRD r0,r1,[sp,#0x18] 000234 900b STR r0,[sp,#0x2c] ;;;651 term_96w2_32w0.w[0] = temp_u64; 000236 9808 LDR r0,[sp,#0x20] 000238 900a STR r0,[sp,#0x28] ;;;652 // ;;;653 //**************************************************************************************************** ;;;654 //**************************************************************************************************** ;;;655 //**************************************************************************************************** ;;;656 //Add together the terms. The best way to envision this is the way that children are sometimes ;;;657 //taught to multiply with a rectangle and then adding diagonally. ;;;658 // ;;;659 //Word [0] of the result. No addition is required, because this term comes from the least significant ;;;660 //term alone. ;;;661 out_u128->w[0] = term_96w0_32w0.w[0]; 00023a 980e LDR r0,[sp,#0x38] 00023c 6020 STR r0,[r4,#0] ;;;662 // ;;;663 //Word [1] of the result. ;;;664 temp_u64 = (APP_BIGINT_U64_NATIVE_T)term_96w1_32w0.w[0] 00023e 2000 MOVS r0,#0 000240 f8dd9030 LDR r9,[sp,#0x30] 000244 e9cd9004 STRD r9,r0,[sp,#0x10] 000248 f8dd803c LDR r8,[sp,#0x3c] 00024c 4603 MOV r3,r0 00024e 4601 MOV r1,r0 000250 e9cd8006 STRD r8,r0,[sp,#0x18] 000254 eb190008 ADDS r0,r9,r8 000258 4159 ADCS r1,r1,r3 00025a e9cd0108 STRD r0,r1,[sp,#0x20] ;;;665 + ;;;666 (APP_BIGINT_U64_NATIVE_T)term_96w0_32w0.w[1]; ;;;667 carry = temp_u64 >> 32; 00025e 2100 MOVS r1,#0 000260 9809 LDR r0,[sp,#0x24] 000262 4607 MOV r7,r0 000264 e9cd0106 STRD r0,r1,[sp,#0x18] ;;;668 out_u128->w[1] = temp_u64; 000268 9808 LDR r0,[sp,#0x20] 00026a 6060 STR r0,[r4,#4] ;;;669 ;;;670 //Word [2] of the result. ;;;671 temp_u64 = (APP_BIGINT_U64_NATIVE_T)carry 00026c 46ba MOV r10,r7 00026e 2000 MOVS r0,#0 000270 e9cd7002 STRD r7,r0,[sp,#8] 000274 f8dd9028 LDR r9,[sp,#0x28] 000278 4603 MOV r3,r0 00027a 4601 MOV r1,r0 00027c e9cd9004 STRD r9,r0,[sp,#0x10] 000280 eb170009 ADDS r0,r7,r9 000284 4159 ADCS r1,r1,r3 000286 e9cd0100 STRD r0,r1,[sp,#0] 00028a 2000 MOVS r0,#0 00028c f8dd8034 LDR r8,[sp,#0x34] 000290 4603 MOV r3,r0 000292 e9cd8006 STRD r8,r0,[sp,#0x18] 000296 9800 LDR r0,[sp,#0] 000298 eb100008 ADDS r0,r0,r8 00029c 4159 ADCS r1,r1,r3 00029e e9cd0108 STRD r0,r1,[sp,#0x20] ;;;672 + ;;;673 (APP_BIGINT_U64_NATIVE_T)term_96w2_32w0.w[0] ;;;674 + ;;;675 (APP_BIGINT_U64_NATIVE_T)term_96w1_32w0.w[1]; ;;;676 carry = temp_u64 >> 32; 0002a2 2100 MOVS r1,#0 0002a4 9809 LDR r0,[sp,#0x24] 0002a6 4607 MOV r7,r0 0002a8 e9cd0106 STRD r0,r1,[sp,#0x18] ;;;677 out_u128->w[2] = temp_u64; 0002ac 9808 LDR r0,[sp,#0x20] 0002ae 60a0 STR r0,[r4,#8] ;;;678 ;;;679 //Word [3] of the result. ;;;680 temp_u64 = (APP_BIGINT_U64_NATIVE_T)carry 0002b0 46b8 MOV r8,r7 0002b2 2000 MOVS r0,#0 0002b4 e9cd7004 STRD r7,r0,[sp,#0x10] 0002b8 f8dd902c LDR r9,[sp,#0x2c] 0002bc 4603 MOV r3,r0 0002be 4601 MOV r1,r0 0002c0 e9cd9006 STRD r9,r0,[sp,#0x18] 0002c4 eb170009 ADDS r0,r7,r9 0002c8 4159 ADCS r1,r1,r3 0002ca e9cd0108 STRD r0,r1,[sp,#0x20] ;;;681 + ;;;682 (APP_BIGINT_U64_NATIVE_T)term_96w2_32w0.w[1]; ;;;683 out_u128->w[3] = temp_u64; 0002ce 9808 LDR r0,[sp,#0x20] 0002d0 60e0 STR r0,[r4,#0xc] ;;;684 } 0002d2 b010 ADD sp,sp,#0x40 0002d4 e8bd87f0 POP {r4-r10,pc} ;;;685 ENDP --------- void APP_BIGINT_UmulU128U96U32( APP_BIGINT_U128_T *out_u128, const APP_BIGINT_U96_T *in_u96, const APP_BIGINT_U32_T *in_u32) { APP_BIGINT_U64_NATIVE_T m2, m1, m0; m0 = (APP_BIGINT_U64_NATIVE_T)in_u96->w[0] * (APP_BIGINT_U64_NATIVE_T)in_u32->w[0]; out_u128->w[0] = m0; m1 = (APP_BIGINT_U64_NATIVE_T)in_u96->w[1] * (APP_BIGINT_U64_NATIVE_T)in_u32->w[0] + (m0 >> 32); out_u128->w[1] = m1; m2 = (APP_BIGINT_U64_NATIVE_T)in_u96->w[2] * (APP_BIGINT_U64_NATIVE_T)in_u32->w[0] + (m1 >> 32); out_u128->w[2] = m2; out_u128->w[3] = (m2 >> 32); } APP_BIGINT_UmulU128U96U32 PROC ;;;602 //---------------------------------------------------------------------------------------------------- ;;;603 void APP_BIGINT_UmulU128U96U32( APP_BIGINT_U128_T *out_u128, 00018e b570 PUSH {r4-r6,lr} ;;;604 const APP_BIGINT_U96_T *in_u96, ;;;605 const APP_BIGINT_U32_T *in_u32) ;;;606 { 000190 b088 SUB sp,sp,#0x20 ;;;607 APP_BIGINT_U64_NATIVE_T m2, m1, m0; ;;;608 ;;;609 m0 = (APP_BIGINT_U64_NATIVE_T)in_u96->w[0] * (APP_BIGINT_U64_NATIVE_T)in_u32->w[0]; 000192 680b LDR r3,[r1,#0] 000194 6814 LDR r4,[r2,#0] 000196 fba33404 UMULL r3,r4,r3,r4 00019a e9cd3402 STRD r3,r4,[sp,#8] ;;;610 out_u128->w[0] = m0; 00019e 9b02 LDR r3,[sp,#8] 0001a0 6003 STR r3,[r0,#0] ;;;611 ;;;612 m1 = (APP_BIGINT_U64_NATIVE_T)in_u96->w[1] * (APP_BIGINT_U64_NATIVE_T)in_u32->w[0] + (m0 >> 32); 0001a2 684c LDR r4,[r1,#4] 0001a4 6815 LDR r5,[r2,#0] 0001a6 2600 MOVS r6,#0 0001a8 9b03 LDR r3,[sp,#0xc] 0001aa e9cd3600 STRD r3,r6,[sp,#0] 0001ae fbe43605 UMLAL r3,r6,r4,r5 0001b2 e9cd3604 STRD r3,r6,[sp,#0x10] ;;;613 out_u128->w[1] = m1; 0001b6 9b04 LDR r3,[sp,#0x10] 0001b8 6043 STR r3,[r0,#4] ;;;614 ;;;615 m2 = (APP_BIGINT_U64_NATIVE_T)in_u96->w[2] * (APP_BIGINT_U64_NATIVE_T)in_u32->w[0] + (m1 >> 32); 0001ba 688c LDR r4,[r1,#8] 0001bc 6815 LDR r5,[r2,#0] 0001be 2600 MOVS r6,#0 0001c0 9b05 LDR r3,[sp,#0x14] 0001c2 e9cd3600 STRD r3,r6,[sp,#0] 0001c6 fbe43605 UMLAL r3,r6,r4,r5 0001ca e9cd3606 STRD r3,r6,[sp,#0x18] ;;;616 out_u128->w[2] = m2; 0001ce 9b06 LDR r3,[sp,#0x18] 0001d0 6083 STR r3,[r0,#8] ;;;617 ;;;618 out_u128->w[3] = (m2 >> 32); 0001d2 2400 MOVS r4,#0 0001d4 9b07 LDR r3,[sp,#0x1c] 0001d6 e9cd3400 STRD r3,r4,[sp,#0] 0001da 60c3 STR r3,[r0,#0xc] ;;;619 } 0001dc b008 ADD sp,sp,#0x20 0001de bd70 POP {r4-r6,pc} ;;;620 ENDP
Noob wrote:

> IMHO, the biggest economy, in terms of code density, comes from the > umlal instruction (32x32->64_mul followed by add_64). (On SH-4, it's > all done "by hand" using 4-6 instructions.) > > FWIW, here's the assembly code gcc 4.6.1 generates for Cortex-M3. > What CPU are you targeting? ARM or Thumb mode? > > mul: > push {r4, r5, r6, r7, r8, r9} > ldr r6, [r1, #0] > movs r5, #0 > umull r6, r7, r6, r0 > str r6, [r2, #0] > ldr r3, [r1, #4] > mov r8, r7 > mov r9, r5 > umlal r8, r9, r0, r3 > str r8, [r2, #4] > ldr r3, [r1, #8] > mov r6, r9 > mov r7, r5 > umlal r6, r7, r0, r3 > str r6, [r2, #8] > str r7, [r2, #12] > pop {r4, r5, r6, r7, r8, r9} > bx lr > > AFAICT, the register copies before the 2 umlal are unnecessary, > which means we could shave 3 instructions.
Hand-written version (direct translation of the C source) mul: push {r4, r5} ldr r3, [ r1, #0 ] umull r4, r5, r0, r3 str r4, [ r2, #0 ] ldr r3, [ r1, #4 ] mov r4, #0 umlal r5, r4, r0, r3 str r5, [ r2, #4 ] ldr r3, [ r1, #8 ] mov r5, #0 umlal r4, r5, r0, r3 str r4, [ r2, #8 ] str r5, [ r2, #12 ] pop {r4, r5} bx lr Depending on the specific CPU being targeted, it might be possible to use more registers to improve performance.
On Tue, 24 Apr 2012 11:32:54 -0400, David T. Ashley
<dashley@gmail.com> wrote:
> >Below is the source code and assembly-language for my original plus >your improved version.
One more note ... By changing to one temporary variable (rather than 3) the compiler seems to have shaved out two instructions (one at the start to allocate stack space and one at the end to deallocate it). I believe that the definition of C guarantees that temp = expression + temp; will use the "old" value of temp before updating the new one, so that should be OK. Here is the new code and assembly-language (with two fewer instructions). I really appreciate the help. The C version you provided is SUBSTANTIALLY more efficient. DTA --------- void APP_BIGINT_UmulU128U96U32( APP_BIGINT_U128_T *out_u128, const APP_BIGINT_U96_T *in_u96, const APP_BIGINT_U32_T *in_u32) { APP_BIGINT_U64_NATIVE_T temp64; temp64 = (APP_BIGINT_U64_NATIVE_T)in_u96->w[0] * (APP_BIGINT_U64_NATIVE_T)in_u32->w[0]; out_u128->w[0] = temp64; temp64 = (APP_BIGINT_U64_NATIVE_T)in_u96->w[1] * (APP_BIGINT_U64_NATIVE_T)in_u32->w[0] + (temp64 >> 32); out_u128->w[1] = temp64; temp64 = (APP_BIGINT_U64_NATIVE_T)in_u96->w[2] * (APP_BIGINT_U64_NATIVE_T)in_u32->w[0] + (temp64 >> 32); out_u128->w[2] = temp64; out_u128->w[3] = (temp64 >> 32); } APP_BIGINT_UmulU128U96U32 PROC ;;;602 //---------------------------------------------------------------------------------------------------- ;;;603 void APP_BIGINT_UmulU128U96U32( APP_BIGINT_U128_T *out_u128, 00018e b57f PUSH {r0-r6,lr} ;;;604 const APP_BIGINT_U96_T *in_u96, ;;;605 const APP_BIGINT_U32_T *in_u32) ;;;606 { ;;;607 APP_BIGINT_U64_NATIVE_T temp64; ;;;608 ;;;609 temp64 = (APP_BIGINT_U64_NATIVE_T)in_u96->w[0] * (APP_BIGINT_U64_NATIVE_T)in_u32->w[0]; 000190 680b LDR r3,[r1,#0] 000192 6814 LDR r4,[r2,#0] 000194 fba33404 UMULL r3,r4,r3,r4 000198 e9cd3402 STRD r3,r4,[sp,#8] ;;;610 out_u128->w[0] = temp64; 00019c 9b02 LDR r3,[sp,#8] 00019e 6003 STR r3,[r0,#0] ;;;611 ;;;612 temp64 = (APP_BIGINT_U64_NATIVE_T)in_u96->w[1] * (APP_BIGINT_U64_NATIVE_T)in_u32->w[0] + (temp64 >> 32); 0001a0 684c LDR r4,[r1,#4] 0001a2 6815 LDR r5,[r2,#0] 0001a4 2600 MOVS r6,#0 0001a6 9b03 LDR r3,[sp,#0xc] 0001a8 e9cd3600 STRD r3,r6,[sp,#0] 0001ac fbe43605 UMLAL r3,r6,r4,r5 0001b0 e9cd3602 STRD r3,r6,[sp,#8] ;;;613 out_u128->w[1] = temp64; 0001b4 9b02 LDR r3,[sp,#8] 0001b6 6043 STR r3,[r0,#4] ;;;614 ;;;615 temp64 = (APP_BIGINT_U64_NATIVE_T)in_u96->w[2] * (APP_BIGINT_U64_NATIVE_T)in_u32->w[0] + (temp64 >> 32); 0001b8 688c LDR r4,[r1,#8] 0001ba 6815 LDR r5,[r2,#0] 0001bc 2600 MOVS r6,#0 0001be 9b03 LDR r3,[sp,#0xc] 0001c0 e9cd3600 STRD r3,r6,[sp,#0] 0001c4 fbe43605 UMLAL r3,r6,r4,r5 0001c8 e9cd3602 STRD r3,r6,[sp,#8] ;;;616 out_u128->w[2] = temp64; 0001cc 9b02 LDR r3,[sp,#8] 0001ce 6083 STR r3,[r0,#8] ;;;617 ;;;618 out_u128->w[3] = (temp64 >> 32); 0001d0 2400 MOVS r4,#0 0001d2 9b03 LDR r3,[sp,#0xc] 0001d4 e9cd3400 STRD r3,r4,[sp,#0] 0001d8 60c3 STR r3,[r0,#0xc] ;;;619 } 0001da bd7f POP {r0-r6,pc} ;;;620 ENDP