Any Large Integer Libraries Out There for the ARM?| page 2

Reply by Boudewijn Dijkstra ●April 19, 20122012-04-19

Op Wed, 18 Apr 2012 16:22:10 +0200 schreef David T. Ashley  
<dashley@gmail.com>:
> I needed to multiply U128 = U96 * U32 on an ARM the other day, and did
> a hack job in C.  It is logically correct, but it probably takes 3-10
> times as long as assembly-language.

Unlikely.  With Cortex-M3 and the IAR compiler I came to the following  
results:
* C-only: 63 cycles
* C-with-asm: 50 cycles
* asm-only: 39 cycles

In how many cycles does your version run?

-- 
Gemaakt met Opera's revolutionaire e-mailprogramma:  
http://www.opera.com/mail/
(Remove the obvious prefix to reply privately.)

Reply by boB ●April 20, 20122012-04-20

On Fri, 20 Apr 2012 00:57:28 +0200, "Boudewijn Dijkstra"
<sp4mtr4p.boudewijn@indes.com> wrote:

>Op Wed, 18 Apr 2012 16:22:10 +0200 schreef David T. Ashley  
><dashley@gmail.com>:
>> I needed to multiply U128 = U96 * U32 on an ARM the other day, and did
>> a hack job in C.  It is logically correct, but it probably takes 3-10
>> times as long as assembly-language.



Was that with the GNU ARM compiler ??

If so, that's why I use IAR also.
Looks like I better see if I can optimize some stuff like Boudewijn
has done.

boB



>
>Unlikely.  With Cortex-M3 and the IAR compiler I came to the following  
>results:
>* C-only: 63 cycles
>* C-with-asm: 50 cycles
>* asm-only: 39 cycles
>
>In how many cycles does your version run?

Reply by ●April 23, 20122012-04-23

David T.  Ashley wrote:

> On Wed, 18 Apr 2012 11:59:46 -0400, mwilson@the-wire.com wrote:
>>
>>I'm still working to get my head around the physical requirements for
>>packaging a release.  It seems as though a source->.a process would suit
>>my
>>needs best.  That way, ARM/THUMB can be decided project-by-project.
>>
>>I've just started some due diligence and registered with github.
> 
> Hi Mel,
> 
> I have the same thoughts.  Assuming that ".a" is the standard suffix
> for an ARM library, a library would be built from the smallest
> compileable modules (typically a single function each), and the linker
> will automatically bring in only what it needs to satisfy references.
> 
[ ... ]
I've taken a highly opportunistic cargo-cult-style swipe at/of the OpenSSL 
bignum code.  The results can be got via

git clone https://github.com/melwilson/bignum-embedded.git

Results don't look bad for a couple of days work, but clearly have to be 
massively cleaned and made consistent before they constitute a real library.

	Mel.

Reply by David T. Ashley ●April 23, 20122012-04-23

On Fri, 20 Apr 2012 00:57:28 +0200, "Boudewijn Dijkstra"
<sp4mtr4p.boudewijn@indes.com> wrote:

>Op Wed, 18 Apr 2012 16:22:10 +0200 schreef David T. Ashley  
><dashley@gmail.com>:
>> I needed to multiply U128 = U96 * U32 on an ARM the other day, and did
>> a hack job in C.  It is logically correct, but it probably takes 3-10
>> times as long as assembly-language.
>
>Unlikely.  With Cortex-M3 and the IAR compiler I came to the following  
>results:
>* C-only: 63 cycles
>* C-with-asm: 50 cycles
>* asm-only: 39 cycles
>
>In how many cycles does your version run?

Can you post your C code that led to 63-cycle execution?

And your C with ASM?

And your ASM-only?

If you've posted it and I missed it, I apologize.

Thanks, Dave Ashley

Reply by Noob ●April 24, 20122012-04-24

David T. Ashley wrote:

> I needed to multiply U128 = U96 * U32 on an ARM the other day, and did
> a hack job in C.  It is logically correct, but it probably takes 3-10
> times as long as assembly-language.

gcc "understands" the intent of the following code, and produces
fairly decent assembly code for SH-4:

#include <stdint.h>
typedef uint32_t U32; typedef U32 U96[3]; typedef U32 U128[4];
void mul(U32 u, U96 v, U128 res)
{
  uint64_t m0 = (uint64_t)u * v[0];
  res[0] = m0;

  uint64_t m1 = (uint64_t)u * v[1] + (m0 >> 32);
  res[1] = m1;

  uint64_t m2 = (uint64_t)u * v[2] + (m1 >> 32);
  res[2] = m2;

  res[3] = (m2 >> 32);
}

Do you get acceptable results with this function?

Regards.

Reply by David T. Ashley ●April 24, 20122012-04-24

On Tue, 24 Apr 2012 13:55:32 +0200, Noob <root@127.0.0.1> wrote:

>David T. Ashley wrote:
>
>> I needed to multiply U128 = U96 * U32 on an ARM the other day, and did
>> a hack job in C.  It is logically correct, but it probably takes 3-10
>> times as long as assembly-language.
>
>gcc "understands" the intent of the following code, and produces
>fairly decent assembly code for SH-4:
>
>#include <stdint.h>
>typedef uint32_t U32; typedef U32 U96[3]; typedef U32 U128[4];
>void mul(U32 u, U96 v, U128 res)
>{
>  uint64_t m0 = (uint64_t)u * v[0];
>  res[0] = m0;
>
>  uint64_t m1 = (uint64_t)u * v[1] + (m0 >> 32);
>  res[1] = m1;
>
>  uint64_t m2 = (uint64_t)u * v[2] + (m1 >> 32);
>  res[2] = m2;
>
>  res[3] = (m2 >> 32);
>}
>
>Do you get acceptable results with this function?

I'll give that a try.

What I used was this ... similar but different.

Your handling of carries looks more efficient.

----------

void APP_BIGINT_UmulU128U96U32(      APP_BIGINT_U128_T *out_u128,
                               const APP_BIGINT_U96_T  *in_u96,
                               const APP_BIGINT_U32_T  *in_u32)
{
   APP_BIGINT_U64_T         term_96w0_32w0;
   APP_BIGINT_U64_T         term_96w1_32w0;
   APP_BIGINT_U64_T         term_96w2_32w0;
   APP_BIGINT_U32_NATIVE_T  carry;
   APP_BIGINT_U64_NATIVE_T  temp_u64;


//****************************************************************************************************

//****************************************************************************************************

//****************************************************************************************************
   //Calculate all of the intermediate terms.  It was discovered that
the compiler does a good
   //job of using the UMULL instruction.
   //
   //96w0 * 32w0
   temp_u64 = (APP_BIGINT_U64_NATIVE_T)in_u96->w[0] *
(APP_BIGINT_U64_NATIVE_T)in_u32->w[0];
   term_96w0_32w0.w[1] = temp_u64 >> 32;
   term_96w0_32w0.w[0] = temp_u64;
   //
   //96w1 * 32w0
   temp_u64 = (APP_BIGINT_U64_NATIVE_T)in_u96->w[1] *
(APP_BIGINT_U64_NATIVE_T)in_u32->w[0];
   term_96w1_32w0.w[1] = temp_u64 >> 32;
   term_96w1_32w0.w[0] = temp_u64;
   //
   //96w2 * 32w0
   temp_u64 = (APP_BIGINT_U64_NATIVE_T)in_u96->w[2] *
(APP_BIGINT_U64_NATIVE_T)in_u32->w[0];
   term_96w2_32w0.w[1] = temp_u64 >> 32;
   term_96w2_32w0.w[0] = temp_u64;
   //

//****************************************************************************************************

//****************************************************************************************************

//****************************************************************************************************
   //Add together the terms.  The best way to envision this is the way
that children are sometimes
   //taught to multiply with a rectangle and then adding diagonally.
   //
   //Word [0] of the result.  No addition is required, because this
term comes from the least significant
   //term alone.
   out_u128->w[0] = term_96w0_32w0.w[0];
   //
   //Word [1] of the result.
   temp_u64 = (APP_BIGINT_U64_NATIVE_T)term_96w1_32w0.w[0]
              +
              (APP_BIGINT_U64_NATIVE_T)term_96w0_32w0.w[1];
   carry    = temp_u64 >> 32;
   out_u128->w[1] = temp_u64;   

   //Word [2] of the result.
   temp_u64 = (APP_BIGINT_U64_NATIVE_T)carry
              +
              (APP_BIGINT_U64_NATIVE_T)term_96w2_32w0.w[0]
              +
              (APP_BIGINT_U64_NATIVE_T)term_96w1_32w0.w[1];
   carry = temp_u64 >> 32;
   out_u128->w[2] = temp_u64;   

   //Word [3] of the result.
   temp_u64 = (APP_BIGINT_U64_NATIVE_T)carry
              +
              (APP_BIGINT_U64_NATIVE_T)term_96w2_32w0.w[1];
   out_u128->w[3] = temp_u64;   
}

Thanks, DTA

Reply by Noob ●April 24, 20122012-04-24

David T. Ashley wrote:

> Noob wrote:
> 
>> David T. Ashley wrote:
>>
>>> I needed to multiply U128 = U96 * U32 on an ARM the other day, and did
>>> a hack job in C.  It is logically correct, but it probably takes 3-10
>>> times as long as assembly-language.
>>
>> gcc "understands" the intent of the following code, and produces
>> fairly decent assembly code for SH-4:
>>
>> #include <stdint.h>
>> typedef uint32_t U32; typedef U32 U96[3]; typedef U32 U128[4];
>> void mul(U32 u, U96 v, U128 res)
>> {
>>  uint64_t m0 = (uint64_t)u * v[0];
>>  res[0] = m0;
>>
>>  uint64_t m1 = (uint64_t)u * v[1] + (m0 >> 32);
>>  res[1] = m1;
>>
>>  uint64_t m2 = (uint64_t)u * v[2] + (m1 >> 32);
>>  res[2] = m2;
>>
>>  res[3] = (m2 >> 32);
>> }
>>
>> Do you get acceptable results with this function?
> 
> I'll give that a try.
> 
> What I used was this ... similar but different.
> 
> Your handling of carries looks more efficient.

IMHO, the biggest economy, in terms of code density, comes from the
umlal instruction (32x32->64_mul followed by add_64). (On SH-4, it's
all done "by hand" using 4-6 instructions.)

FWIW, here's the assembly code gcc 4.6.1 generates for Cortex-M3.
What CPU are you targeting? ARM or Thumb mode?

mul:
	push	{r4, r5, r6, r7, r8, r9}
	ldr	r6, [r1, #0]
	movs	r5, #0
	umull	r6, r7, r6, r0
	str	r6, [r2, #0]
	ldr	r3, [r1, #4]
	mov	r8, r7
	mov	r9, r5
	umlal	r8, r9, r0, r3
	str	r8, [r2, #4]
	ldr	r3, [r1, #8]
	mov	r6, r9
	mov	r7, r5
	umlal	r6, r7, r0, r3
	str	r6, [r2, #8]
	str	r7, [r2, #12]
	pop	{r4, r5, r6, r7, r8, r9}
	bx	lr

AFAICT, the register copies before the 2 umlal are unnecessary,
which means we could shave 3 instructions.

Regards.

Reply by David T. Ashley ●April 24, 20122012-04-24

On Tue, 24 Apr 2012 15:55:59 +0200, Noob <root@127.0.0.1> wrote:

>David T. Ashley wrote:
>
>> Noob wrote:
>> 
>>> David T. Ashley wrote:
>>>
>>>> I needed to multiply U128 = U96 * U32 on an ARM the other day, and did
>>>> a hack job in C.  It is logically correct, but it probably takes 3-10
>>>> times as long as assembly-language.
>>>
>>> gcc "understands" the intent of the following code, and produces
>>> fairly decent assembly code for SH-4:
>>>
>>> #include <stdint.h>
>>> typedef uint32_t U32; typedef U32 U96[3]; typedef U32 U128[4];
>>> void mul(U32 u, U96 v, U128 res)
>>> {
>>>  uint64_t m0 = (uint64_t)u * v[0];
>>>  res[0] = m0;
>>>
>>>  uint64_t m1 = (uint64_t)u * v[1] + (m0 >> 32);
>>>  res[1] = m1;
>>>
>>>  uint64_t m2 = (uint64_t)u * v[2] + (m1 >> 32);
>>>  res[2] = m2;
>>>
>>>  res[3] = (m2 >> 32);
>>> }
>>>
>>> Do you get acceptable results with this function?
>> 
>> I'll give that a try.
>> 
>> What I used was this ... similar but different.
>> 
>> Your handling of carries looks more efficient.
>
>IMHO, the biggest economy, in terms of code density, comes from the
>umlal instruction (32x32->64_mul followed by add_64). (On SH-4, it's
>all done "by hand" using 4-6 instructions.)
>
>FWIW, here's the assembly code gcc 4.6.1 generates for Cortex-M3.
>What CPU are you targeting? ARM or Thumb mode?
>
>mul:
>	push	{r4, r5, r6, r7, r8, r9}
>	ldr	r6, [r1, #0]
>	movs	r5, #0
>	umull	r6, r7, r6, r0
>	str	r6, [r2, #0]
>	ldr	r3, [r1, #4]
>	mov	r8, r7
>	mov	r9, r5
>	umlal	r8, r9, r0, r3
>	str	r8, [r2, #4]
>	ldr	r3, [r1, #8]
>	mov	r6, r9
>	mov	r7, r5
>	umlal	r6, r7, r0, r3
>	str	r6, [r2, #8]
>	str	r7, [r2, #12]
>	pop	{r4, r5, r6, r7, r8, r9}
>	bx	lr
>
>AFAICT, the register copies before the 2 umlal are unnecessary,
>which means we could shave 3 instructions.

Thanks.

Examining your code ...

The possibility of a carry makes me nervous.  I'd first like to verify
that the sum

>>>  uint64_t m1 = (uint64_t)u * v[1] + (m0 >> 32);

can't exceed 64 bits and generate a carry.

It doesn't need to be verfied that 

>>>  uint64_t m2 = (uint64_t)u * v[2] + (m1 >> 32);

can't generate a carry, because there is a simple proof of that
relying on the operand sizes and the fact that this is the most
significant product.

Squaring the maximum 32-bit integer yields:

(2^32-1) * (2^32-1) = 2^64 - 2^33 + 1.

If the second term is also a maximum square, adjusting for the
placement, its maximum value is:

2^96 - 2^65 + 2^32

The question of whether there can be a carry is equivalent to the
question of whether.

2^96 - 2^65 + 2^32 + 2^64 - 2^33 + 1 < 2^96

Looking at the terms, this is definitely the case.

OK, I believe that 

>>>  uint64_t m1 = (uint64_t)u * v[1] + (m0 >> 32);

cannot generate a carry.  This seems plausible because one of the
arguments is 32-bits ... I'm just not sure about how larger
multiplications (more terms to add) would work out.

I believe your implementation is correct.

Below is the source code and assembly-language for my original plus
your improved version.

Quite a size and speed difference!

Thanks!

DTA

--------

void APP_BIGINT_UmulU128U96U32Old(      APP_BIGINT_U128_T *out_u128,
                               const APP_BIGINT_U96_T  *in_u96,
                               const APP_BIGINT_U32_T  *in_u32)
{
   APP_BIGINT_U64_T         term_96w0_32w0;
   APP_BIGINT_U64_T         term_96w1_32w0;
   APP_BIGINT_U64_T         term_96w2_32w0;
   APP_BIGINT_U32_NATIVE_T  carry;
   APP_BIGINT_U64_NATIVE_T  temp_u64;


//****************************************************************************************************

//****************************************************************************************************

//****************************************************************************************************
   //Calculate all of the intermediate terms.  It was discovered that
the compiler does a good
   //job of using the UMULL instruction.
   //
   //96w0 * 32w0
   temp_u64 = (APP_BIGINT_U64_NATIVE_T)in_u96->w[0] *
(APP_BIGINT_U64_NATIVE_T)in_u32->w[0];
   term_96w0_32w0.w[1] = temp_u64 >> 32;
   term_96w0_32w0.w[0] = temp_u64;
   //
   //96w1 * 32w0
   temp_u64 = (APP_BIGINT_U64_NATIVE_T)in_u96->w[1] *
(APP_BIGINT_U64_NATIVE_T)in_u32->w[0];
   term_96w1_32w0.w[1] = temp_u64 >> 32;
   term_96w1_32w0.w[0] = temp_u64;
   //
   //96w2 * 32w0
   temp_u64 = (APP_BIGINT_U64_NATIVE_T)in_u96->w[2] *
(APP_BIGINT_U64_NATIVE_T)in_u32->w[0];
   term_96w2_32w0.w[1] = temp_u64 >> 32;
   term_96w2_32w0.w[0] = temp_u64;
   //

//****************************************************************************************************

//****************************************************************************************************

//****************************************************************************************************
   //Add together the terms.  The best way to envision this is the way
that children are sometimes
   //taught to multiply with a rectangle and then adding diagonally.
   //
   //Word [0] of the result.  No addition is required, because this
term comes from the least significant
   //term alone.
   out_u128->w[0] = term_96w0_32w0.w[0];
   //
   //Word [1] of the result.
   temp_u64 = (APP_BIGINT_U64_NATIVE_T)term_96w1_32w0.w[0]
              +
              (APP_BIGINT_U64_NATIVE_T)term_96w0_32w0.w[1];
   carry    = temp_u64 >> 32;
   out_u128->w[1] = temp_u64;   

   //Word [2] of the result.
   temp_u64 = (APP_BIGINT_U64_NATIVE_T)carry
              +
              (APP_BIGINT_U64_NATIVE_T)term_96w2_32w0.w[0]
              +
              (APP_BIGINT_U64_NATIVE_T)term_96w1_32w0.w[1];
   carry = temp_u64 >> 32;
   out_u128->w[2] = temp_u64;   

   //Word [3] of the result.
   temp_u64 = (APP_BIGINT_U64_NATIVE_T)carry
              +
              (APP_BIGINT_U64_NATIVE_T)term_96w2_32w0.w[1];
   out_u128->w[3] = temp_u64;   
}

                  APP_BIGINT_UmulU128U96U32Old PROC
;;;621    
;;;622    void APP_BIGINT_UmulU128U96U32Old(      APP_BIGINT_U128_T
*out_u128,
0001e0  e92d47f0          PUSH     {r4-r10,lr}
;;;623                                   const APP_BIGINT_U96_T
*in_u96,
;;;624                                   const APP_BIGINT_U32_T
*in_u32)
;;;625    {
0001e4  b090              SUB      sp,sp,#0x40
0001e6  4604              MOV      r4,r0
0001e8  460d              MOV      r5,r1
0001ea  4616              MOV      r6,r2
;;;626       APP_BIGINT_U64_T         term_96w0_32w0;
;;;627       APP_BIGINT_U64_T         term_96w1_32w0;
;;;628       APP_BIGINT_U64_T         term_96w2_32w0;
;;;629       APP_BIGINT_U32_NATIVE_T  carry;
;;;630       APP_BIGINT_U64_NATIVE_T  temp_u64;
;;;631    
;;;632
//****************************************************************************************************
;;;633
//****************************************************************************************************
;;;634
//****************************************************************************************************
;;;635       //Calculate all of the intermediate terms.  It was
discovered that the compiler does a good
;;;636       //job of using the UMULL instruction.
;;;637       //
;;;638       //96w0 * 32w0
;;;639       temp_u64 = (APP_BIGINT_U64_NATIVE_T)in_u96->w[0] *
(APP_BIGINT_U64_NATIVE_T)in_u32->w[0];
0001ec  6828              LDR      r0,[r5,#0]
0001ee  6831              LDR      r1,[r6,#0]
0001f0  fba00101          UMULL    r0,r1,r0,r1
0001f4  e9cd0108          STRD     r0,r1,[sp,#0x20]
;;;640       term_96w0_32w0.w[1] = temp_u64 >> 32;
0001f8  2100              MOVS     r1,#0
0001fa  9809              LDR      r0,[sp,#0x24]
0001fc  e9cd0106          STRD     r0,r1,[sp,#0x18]
000200  900f              STR      r0,[sp,#0x3c]
;;;641       term_96w0_32w0.w[0] = temp_u64;
000202  9808              LDR      r0,[sp,#0x20]
000204  900e              STR      r0,[sp,#0x38]
;;;642       //
;;;643       //96w1 * 32w0
;;;644       temp_u64 = (APP_BIGINT_U64_NATIVE_T)in_u96->w[1] *
(APP_BIGINT_U64_NATIVE_T)in_u32->w[0];
000206  6868              LDR      r0,[r5,#4]
000208  6831              LDR      r1,[r6,#0]
00020a  fba00101          UMULL    r0,r1,r0,r1
00020e  e9cd0108          STRD     r0,r1,[sp,#0x20]
;;;645       term_96w1_32w0.w[1] = temp_u64 >> 32;
000212  2100              MOVS     r1,#0
000214  9809              LDR      r0,[sp,#0x24]
000216  e9cd0106          STRD     r0,r1,[sp,#0x18]
00021a  900d              STR      r0,[sp,#0x34]
;;;646       term_96w1_32w0.w[0] = temp_u64;
00021c  9808              LDR      r0,[sp,#0x20]
00021e  900c              STR      r0,[sp,#0x30]
;;;647       //
;;;648       //96w2 * 32w0
;;;649       temp_u64 = (APP_BIGINT_U64_NATIVE_T)in_u96->w[2] *
(APP_BIGINT_U64_NATIVE_T)in_u32->w[0];
000220  68a8              LDR      r0,[r5,#8]
000222  6831              LDR      r1,[r6,#0]
000224  fba00101          UMULL    r0,r1,r0,r1
000228  e9cd0108          STRD     r0,r1,[sp,#0x20]
;;;650       term_96w2_32w0.w[1] = temp_u64 >> 32;
00022c  2100              MOVS     r1,#0
00022e  9809              LDR      r0,[sp,#0x24]
000230  e9cd0106          STRD     r0,r1,[sp,#0x18]
000234  900b              STR      r0,[sp,#0x2c]
;;;651       term_96w2_32w0.w[0] = temp_u64;
000236  9808              LDR      r0,[sp,#0x20]
000238  900a              STR      r0,[sp,#0x28]
;;;652       //
;;;653
//****************************************************************************************************
;;;654
//****************************************************************************************************
;;;655
//****************************************************************************************************
;;;656       //Add together the terms.  The best way to envision this
is the way that children are sometimes
;;;657       //taught to multiply with a rectangle and then adding
diagonally.
;;;658       //
;;;659       //Word [0] of the result.  No addition is required,
because this term comes from the least significant
;;;660       //term alone.
;;;661       out_u128->w[0] = term_96w0_32w0.w[0];
00023a  980e              LDR      r0,[sp,#0x38]
00023c  6020              STR      r0,[r4,#0]
;;;662       //
;;;663       //Word [1] of the result.
;;;664       temp_u64 = (APP_BIGINT_U64_NATIVE_T)term_96w1_32w0.w[0]
00023e  2000              MOVS     r0,#0
000240  f8dd9030          LDR      r9,[sp,#0x30]
000244  e9cd9004          STRD     r9,r0,[sp,#0x10]
000248  f8dd803c          LDR      r8,[sp,#0x3c]
00024c  4603              MOV      r3,r0
00024e  4601              MOV      r1,r0
000250  e9cd8006          STRD     r8,r0,[sp,#0x18]
000254  eb190008          ADDS     r0,r9,r8
000258  4159              ADCS     r1,r1,r3
00025a  e9cd0108          STRD     r0,r1,[sp,#0x20]
;;;665                  +
;;;666                  (APP_BIGINT_U64_NATIVE_T)term_96w0_32w0.w[1];
;;;667       carry    = temp_u64 >> 32;
00025e  2100              MOVS     r1,#0
000260  9809              LDR      r0,[sp,#0x24]
000262  4607              MOV      r7,r0
000264  e9cd0106          STRD     r0,r1,[sp,#0x18]
;;;668       out_u128->w[1] = temp_u64;   
000268  9808              LDR      r0,[sp,#0x20]
00026a  6060              STR      r0,[r4,#4]
;;;669    
;;;670       //Word [2] of the result.
;;;671       temp_u64 = (APP_BIGINT_U64_NATIVE_T)carry
00026c  46ba              MOV      r10,r7
00026e  2000              MOVS     r0,#0
000270  e9cd7002          STRD     r7,r0,[sp,#8]
000274  f8dd9028          LDR      r9,[sp,#0x28]
000278  4603              MOV      r3,r0
00027a  4601              MOV      r1,r0
00027c  e9cd9004          STRD     r9,r0,[sp,#0x10]
000280  eb170009          ADDS     r0,r7,r9
000284  4159              ADCS     r1,r1,r3
000286  e9cd0100          STRD     r0,r1,[sp,#0]
00028a  2000              MOVS     r0,#0
00028c  f8dd8034          LDR      r8,[sp,#0x34]
000290  4603              MOV      r3,r0
000292  e9cd8006          STRD     r8,r0,[sp,#0x18]
000296  9800              LDR      r0,[sp,#0]
000298  eb100008          ADDS     r0,r0,r8
00029c  4159              ADCS     r1,r1,r3
00029e  e9cd0108          STRD     r0,r1,[sp,#0x20]
;;;672                  +
;;;673                  (APP_BIGINT_U64_NATIVE_T)term_96w2_32w0.w[0]
;;;674                  +
;;;675                  (APP_BIGINT_U64_NATIVE_T)term_96w1_32w0.w[1];
;;;676       carry = temp_u64 >> 32;
0002a2  2100              MOVS     r1,#0
0002a4  9809              LDR      r0,[sp,#0x24]
0002a6  4607              MOV      r7,r0
0002a8  e9cd0106          STRD     r0,r1,[sp,#0x18]
;;;677       out_u128->w[2] = temp_u64;   
0002ac  9808              LDR      r0,[sp,#0x20]
0002ae  60a0              STR      r0,[r4,#8]
;;;678    
;;;679       //Word [3] of the result.
;;;680       temp_u64 = (APP_BIGINT_U64_NATIVE_T)carry
0002b0  46b8              MOV      r8,r7
0002b2  2000              MOVS     r0,#0
0002b4  e9cd7004          STRD     r7,r0,[sp,#0x10]
0002b8  f8dd902c          LDR      r9,[sp,#0x2c]
0002bc  4603              MOV      r3,r0
0002be  4601              MOV      r1,r0
0002c0  e9cd9006          STRD     r9,r0,[sp,#0x18]
0002c4  eb170009          ADDS     r0,r7,r9
0002c8  4159              ADCS     r1,r1,r3
0002ca  e9cd0108          STRD     r0,r1,[sp,#0x20]
;;;681                  +
;;;682                  (APP_BIGINT_U64_NATIVE_T)term_96w2_32w0.w[1];
;;;683       out_u128->w[3] = temp_u64;   
0002ce  9808              LDR      r0,[sp,#0x20]
0002d0  60e0              STR      r0,[r4,#0xc]
;;;684    }
0002d2  b010              ADD      sp,sp,#0x40
0002d4  e8bd87f0          POP      {r4-r10,pc}
;;;685    
                          ENDP

---------

void APP_BIGINT_UmulU128U96U32(      APP_BIGINT_U128_T *out_u128,
                               const APP_BIGINT_U96_T  *in_u96,
                               const APP_BIGINT_U32_T  *in_u32)
{
   APP_BIGINT_U64_NATIVE_T  m2, m1, m0;

   m0             = (APP_BIGINT_U64_NATIVE_T)in_u96->w[0] *
(APP_BIGINT_U64_NATIVE_T)in_u32->w[0];
   out_u128->w[0] = m0;

   m1             = (APP_BIGINT_U64_NATIVE_T)in_u96->w[1] *
(APP_BIGINT_U64_NATIVE_T)in_u32->w[0] + (m0 >> 32);
   out_u128->w[1] = m1;

   m2             = (APP_BIGINT_U64_NATIVE_T)in_u96->w[2] *
(APP_BIGINT_U64_NATIVE_T)in_u32->w[0] + (m1 >> 32);
   out_u128->w[2] = m2;

   out_u128->w[3] = (m2 >> 32);
}

                  APP_BIGINT_UmulU128U96U32 PROC
;;;602
//----------------------------------------------------------------------------------------------------
;;;603    void APP_BIGINT_UmulU128U96U32(      APP_BIGINT_U128_T
*out_u128,
00018e  b570              PUSH     {r4-r6,lr}
;;;604                                   const APP_BIGINT_U96_T
*in_u96,
;;;605                                   const APP_BIGINT_U32_T
*in_u32)
;;;606    {
000190  b088              SUB      sp,sp,#0x20
;;;607       APP_BIGINT_U64_NATIVE_T  m2, m1, m0;
;;;608    
;;;609       m0             = (APP_BIGINT_U64_NATIVE_T)in_u96->w[0] *
(APP_BIGINT_U64_NATIVE_T)in_u32->w[0];
000192  680b              LDR      r3,[r1,#0]
000194  6814              LDR      r4,[r2,#0]
000196  fba33404          UMULL    r3,r4,r3,r4
00019a  e9cd3402          STRD     r3,r4,[sp,#8]
;;;610       out_u128->w[0] = m0;
00019e  9b02              LDR      r3,[sp,#8]
0001a0  6003              STR      r3,[r0,#0]
;;;611    
;;;612       m1             = (APP_BIGINT_U64_NATIVE_T)in_u96->w[1] *
(APP_BIGINT_U64_NATIVE_T)in_u32->w[0] + (m0 >> 32);
0001a2  684c              LDR      r4,[r1,#4]
0001a4  6815              LDR      r5,[r2,#0]
0001a6  2600              MOVS     r6,#0
0001a8  9b03              LDR      r3,[sp,#0xc]
0001aa  e9cd3600          STRD     r3,r6,[sp,#0]
0001ae  fbe43605          UMLAL    r3,r6,r4,r5
0001b2  e9cd3604          STRD     r3,r6,[sp,#0x10]
;;;613       out_u128->w[1] = m1;
0001b6  9b04              LDR      r3,[sp,#0x10]
0001b8  6043              STR      r3,[r0,#4]
;;;614    
;;;615       m2             = (APP_BIGINT_U64_NATIVE_T)in_u96->w[2] *
(APP_BIGINT_U64_NATIVE_T)in_u32->w[0] + (m1 >> 32);
0001ba  688c              LDR      r4,[r1,#8]
0001bc  6815              LDR      r5,[r2,#0]
0001be  2600              MOVS     r6,#0
0001c0  9b05              LDR      r3,[sp,#0x14]
0001c2  e9cd3600          STRD     r3,r6,[sp,#0]
0001c6  fbe43605          UMLAL    r3,r6,r4,r5
0001ca  e9cd3606          STRD     r3,r6,[sp,#0x18]
;;;616       out_u128->w[2] = m2;
0001ce  9b06              LDR      r3,[sp,#0x18]
0001d0  6083              STR      r3,[r0,#8]
;;;617    
;;;618       out_u128->w[3] = (m2 >> 32);
0001d2  2400              MOVS     r4,#0
0001d4  9b07              LDR      r3,[sp,#0x1c]
0001d6  e9cd3400          STRD     r3,r4,[sp,#0]
0001da  60c3              STR      r3,[r0,#0xc]
;;;619    }
0001dc  b008              ADD      sp,sp,#0x20
0001de  bd70              POP      {r4-r6,pc}
;;;620    
                          ENDP

Reply by Noob ●April 24, 20122012-04-24

Noob wrote:

> IMHO, the biggest economy, in terms of code density, comes from the
> umlal instruction (32x32->64_mul followed by add_64). (On SH-4, it's
> all done "by hand" using 4-6 instructions.)
> 
> FWIW, here's the assembly code gcc 4.6.1 generates for Cortex-M3.
> What CPU are you targeting? ARM or Thumb mode?
> 
> mul:
> 	push	{r4, r5, r6, r7, r8, r9}
> 	ldr	r6, [r1, #0]
> 	movs	r5, #0
> 	umull	r6, r7, r6, r0
> 	str	r6, [r2, #0]
> 	ldr	r3, [r1, #4]
> 	mov	r8, r7
> 	mov	r9, r5
> 	umlal	r8, r9, r0, r3
> 	str	r8, [r2, #4]
> 	ldr	r3, [r1, #8]
> 	mov	r6, r9
> 	mov	r7, r5
> 	umlal	r6, r7, r0, r3
> 	str	r6, [r2, #8]
> 	str	r7, [r2, #12]
> 	pop	{r4, r5, r6, r7, r8, r9}
> 	bx	lr
> 
> AFAICT, the register copies before the 2 umlal are unnecessary,
> which means we could shave 3 instructions.

Hand-written version (direct translation of the C source)

mul:
  push {r4, r5}

  ldr r3, [ r1, #0 ]
  umull r4, r5, r0, r3
  str r4, [ r2, #0 ]

  ldr r3, [ r1, #4 ]
  mov r4, #0
  umlal r5, r4, r0, r3
  str r5, [ r2, #4 ]

  ldr r3, [ r1, #8 ]
  mov r5, #0
  umlal r4, r5, r0, r3
  str r4, [ r2, #8 ]
  str r5, [ r2, #12 ]

  pop {r4, r5}
  bx lr

Depending on the specific CPU being targeted, it might be possible
to use more registers to improve performance.

Reply by David T. Ashley ●April 24, 20122012-04-24

On Tue, 24 Apr 2012 11:32:54 -0400, David T. Ashley
<dashley@gmail.com> wrote:
>
>Below is the source code and assembly-language for my original plus
>your improved version.

One more note ...

By changing to one temporary variable (rather than 3) the compiler
seems to have shaved out two instructions (one at the start to
allocate stack space and one at the end to deallocate it).

I believe that the definition of C guarantees that

temp = expression + temp;

will use the "old" value of temp before updating the new one, so that
should be OK.

Here is the new code and assembly-language (with two fewer
instructions).

I really appreciate the help.  The C version you provided is
SUBSTANTIALLY more efficient.

DTA

---------

void APP_BIGINT_UmulU128U96U32(      APP_BIGINT_U128_T *out_u128,
                               const APP_BIGINT_U96_T  *in_u96,
                               const APP_BIGINT_U32_T  *in_u32)
{
   APP_BIGINT_U64_NATIVE_T  temp64;

   temp64         = (APP_BIGINT_U64_NATIVE_T)in_u96->w[0] *
(APP_BIGINT_U64_NATIVE_T)in_u32->w[0];
   out_u128->w[0] = temp64;

   temp64         = (APP_BIGINT_U64_NATIVE_T)in_u96->w[1] *
(APP_BIGINT_U64_NATIVE_T)in_u32->w[0] + (temp64 >> 32);
   out_u128->w[1] = temp64;

   temp64         = (APP_BIGINT_U64_NATIVE_T)in_u96->w[2] *
(APP_BIGINT_U64_NATIVE_T)in_u32->w[0] + (temp64 >> 32);
   out_u128->w[2] = temp64;

   out_u128->w[3] = (temp64 >> 32);
}

                  APP_BIGINT_UmulU128U96U32 PROC
;;;602
//----------------------------------------------------------------------------------------------------
;;;603    void APP_BIGINT_UmulU128U96U32(      APP_BIGINT_U128_T
*out_u128,
00018e  b57f              PUSH     {r0-r6,lr}
;;;604                                   const APP_BIGINT_U96_T
*in_u96,
;;;605                                   const APP_BIGINT_U32_T
*in_u32)
;;;606    {
;;;607       APP_BIGINT_U64_NATIVE_T  temp64;
;;;608    
;;;609       temp64         = (APP_BIGINT_U64_NATIVE_T)in_u96->w[0] *
(APP_BIGINT_U64_NATIVE_T)in_u32->w[0];
000190  680b              LDR      r3,[r1,#0]
000192  6814              LDR      r4,[r2,#0]
000194  fba33404          UMULL    r3,r4,r3,r4
000198  e9cd3402          STRD     r3,r4,[sp,#8]
;;;610       out_u128->w[0] = temp64;
00019c  9b02              LDR      r3,[sp,#8]
00019e  6003              STR      r3,[r0,#0]
;;;611    
;;;612       temp64         = (APP_BIGINT_U64_NATIVE_T)in_u96->w[1] *
(APP_BIGINT_U64_NATIVE_T)in_u32->w[0] + (temp64 >> 32);
0001a0  684c              LDR      r4,[r1,#4]
0001a2  6815              LDR      r5,[r2,#0]
0001a4  2600              MOVS     r6,#0
0001a6  9b03              LDR      r3,[sp,#0xc]
0001a8  e9cd3600          STRD     r3,r6,[sp,#0]
0001ac  fbe43605          UMLAL    r3,r6,r4,r5
0001b0  e9cd3602          STRD     r3,r6,[sp,#8]
;;;613       out_u128->w[1] = temp64;
0001b4  9b02              LDR      r3,[sp,#8]
0001b6  6043              STR      r3,[r0,#4]
;;;614    
;;;615       temp64         = (APP_BIGINT_U64_NATIVE_T)in_u96->w[2] *
(APP_BIGINT_U64_NATIVE_T)in_u32->w[0] + (temp64 >> 32);
0001b8  688c              LDR      r4,[r1,#8]
0001ba  6815              LDR      r5,[r2,#0]
0001bc  2600              MOVS     r6,#0
0001be  9b03              LDR      r3,[sp,#0xc]
0001c0  e9cd3600          STRD     r3,r6,[sp,#0]
0001c4  fbe43605          UMLAL    r3,r6,r4,r5
0001c8  e9cd3602          STRD     r3,r6,[sp,#8]
;;;616       out_u128->w[2] = temp64;
0001cc  9b02              LDR      r3,[sp,#8]
0001ce  6083              STR      r3,[r0,#8]
;;;617    
;;;618       out_u128->w[3] = (temp64 >> 32);
0001d0  2400              MOVS     r4,#0
0001d2  9b03              LDR      r3,[sp,#0xc]
0001d4  e9cd3400          STRD     r3,r4,[sp,#0]
0001d8  60c3              STR      r3,[r0,#0xc]
;;;619    }
0001da  bd7f              POP      {r0-r6,pc}
;;;620    
                          ENDP