EmbeddedRelated.com
Forums

Signed 64 bit 32.32 fixed point multiplication algorithm problem

Started by alastair2010 May 17, 2008
Hi does anyone know how to do signed 64 bit multiplication on an
arm7-tdmi core (LCP2364)

I only require the centre 32 bits of the result as the values
shouldn't roll into the upper 127 95 bit and the lower 0
31 precision bits can be discarded.

I'm programming a digital filter using 32.32 fixed point arithmetic
and trying to do the processing efficiently.

So far I have the following:

//----------------------------------\
----
void Filter(unsigned int ADCSample)
{
static signed long long xv[5], yv[5];
static signed long long Coeff1 = 0xFFFFFFFFAAAAAA80; // -0.3333333333
static signed long long Coeff2 = 0xFFFFFFFF0EA41100; //
-0.9428090416
static signed long long Gain = 0x0000000018FE59A0; //
0.09763107291035901797
long long InpuVal;

InpuVal = (long long)ADCSample << 32; // Format the 10 bit ADC
sample to 32.32 fixed point

xv[0] = xv[1];
xv[1] = xv[2];
xv[2] = xv[3];
xv[3] = xv[4];
xv[4] = FixedPoint64Mul(&InpuVal, &Gain); // Multiply the ADC
value by a gain value

yv[0] = yv[1];
yv[1] = yv[2];
yv[2] = yv[3];
yv[3] = yv[4];

yv[4] = (xv[0] + xv[4]) - (xv[2] << 1) + FixedPoint64Mul(&Coeff1,
&yv[0]) + FixedPoint64Mul(&Coeff2, &yv[2]); // Filter
}
//----------------------------------\
----
__asm long long FixedPoint64Mul(long long *Num1, long long *Num2)
{
LDR R2, [R0], #4 // Load R2 with lower 32bits (fractional
part) of Num1 and inc pointer to upper 32bits (integer part)
LDR R3, [R1], #4 // Load R3 with lower 32bits (fractional
part) of Num2 and inc pointer to upper 32bits (integer part)
UMULL R4, R5, R2, R3 // Multiply 32 bit values R2*R3 and store 64
bit result on R4 R5 (R4 = lower 32bits)
LDR R2, [R0] // Load R2 with upper 32bits (integer
part) of Num1
LDR R3, [R1], #-4 // Load R3 with upper 32bits (integer
part) of Num2
UMULL R6, R7, R2, R3 // Multiply 32 bit values R2*R3 and store 64
bit result on R6 R7 (R6 = lower 32bits)
LDR R2, [R0], #-4 // Load R2 with upper 32bits (integer
part) of Num1 and dec pointer to lower 32bits of Num1
LDR R3, [R1], #4 // Load R3 with lower 32bits (fractional
part) of Num2 and inc pointer to upper 32bits of Num2
UMLAL R5, R6, R2, R3 // Multiply 32 bit values R2*R3 and add to 64
bit value on R5 R6 then store on R5 R6
LDR R2, [R0] // Load R2 with upper 32bits
(fractional part) of Num1
LDR R3, [R1] // Load R3 with lower 32bits (integer
part) of Num2
UMLAL R5, R6, R2, R3 // Multiply 32 bit values R2*R3 and add to 64
bit value on R5 R6 then store on R5 R6
MOV R0, R5 // Move fractional result to R0
MOV R1, R6 // Move integer result to R1
BX lr // Return
}
//----------------------------------\
----

The FixedPoint64Mul function works perfect with unsigned values but not
signed. I have tried the signed version of multiply instructions SMULL
and SMLAL, the result changes but it's still wrong. The problem
seems to be based around the decimal point as -0.5
(0xffff,ffff.8000,0000) isn't seen as one number but two negative
numbers. I'm quite confused on what the exact problem is.

Can anyone see how to fix this or know of another method?

Alastair


An Engineer's Guide to the LPC2100 Series

----- Original Message -----
From: "alastair2010"
To:
Sent: Sunday, May 18, 2008 1:25 AM
Subject: [lpc2000] Signed 64 bit 32.32 fixed point multiplication algorithm
problem

Hi does anyone know how to do signed 64 bit multiplication on an
arm7-tdmi core (LCP2364)

I only require the centre 32 bits of the result as the values
shouldn't roll into the upper 127 - 95 bit and the lower 0 -
31 precision bits can be discarded.

I'm programming a digital filter using 32.32 fixed point arithmetic
and trying to do the processing efficiently.

Why do you need that much precision for 10-bit samples?

Leon
Hi Alasdair,

you can imagine signed 64b integer as signed 32 bit
high part plus unsigned
32 bit low part. So you want to [signed_high1,
unsigned_low1] x
[signed_high2, unsigned_low2], say [sh1, ul1], [sh1,
ul2]. The result is
128b and you want to take the middle 64b of it. You
need to sum:

high 32b of unsigned multiplication of the ul1 x ul2
low 32b of signed multiplication of the sh1 x sh2
mixed multiplication of sh1 x ul2
mixed multiplication of sh2 x ul1

The mixed multiplication of S and U is signed
multiplication where the
highest bit of U is set to 0. If it was 1, you need to
add S*2^31 to the
result. This is S << 31. If you have 64b integer in
R5, R6 (R5 low, R6 high)
and want to add S << 31 to it, you can store S << 31
to R2 and S >> 1 to R3
and do ADDS R5, R5, R2 and ADC R6, R6, R3. The S >> 1
must be signed shift.

There can be better ways to do this mixed
multiplication, but I only know
one other which is worse.

I tested this on 2 16b numbers, and it seems to work.
I marked to code in
the mul() function with corresponding instructions.

With regards,
Jan

>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>

typedef unsigned short u16;
typedef short i16;
typedef unsigned int u32;
typedef int i32;
typedef __int64 i64;
typedef unsigned __int64 u64;

#pragma pack (push, 1)

struct L2
{
u16 low;
u16 high;
};

#pragma pack (pop)

i32 mul(const L2& left, const L2& right)
{
u32 low = left.low * right.low; // UMULL R4, R5,
R2, R3

i32 high = (i16)left.high * (i16)right.high; //
SMULL R6, R7, R2, R3

i32 mid1 = (i16)left.high * (i16)(right.low &
0x7FFF); // SMLAL R5, R6,
R2, R3
if (right.low & 0x8000)
mid1 = mid1 + ((i16)left.high << 15); // ADDS
R5, R5, R2 and ADC R6,
R6, R3

i32 mid2 = (i16)right.high * (i16)(left.low &
0x7FFF); // SMLAL R5, R6,
R2, R3
if (left.low & 0x8000)
mid2 = mid2 + ((i16)right.high << 15); // ADDS
R5, R5, R2 and ADC
R6, R6, R3

i32 res = (u32)(low >> 16) + mid1 + mid2 +
(u32)(high << 16); // already
accumulated
return res;
}

int _tmain(int argc, _TCHAR* argv[])
{
L2 left = { 0x8001, 0x8001 };
L2 right = { 0x8001, 0x8001 };
i32 leftValue = *(i32*)&left;
i32 rightValue = *(i32*)&right;

i32 r1 = (i32)(((i64)leftValue * (i64)rightValue)
>> 16);
u32 r1unsigned = (u32)(((u64)(u32)leftValue *
(u64)(u32)rightValue) >>
16);
i32 r2 = mul(left, right);
if (r1 != r2)
; // wrong

return 0;
}

----- Original Message -----
From: "alastair2010"
To:
Sent: Sunday, May 18, 2008 2:25 AM
Subject: [lpc2000] Signed 64 bit 32.32 fixed point
multiplication algorithm
problem

Hi does anyone know how to do signed 64 bit
multiplication on an
arm7-tdmi core (LCP2364)

I only require the centre 32 bits of the result as the
values
shouldn't roll into the upper 127 - 95 bit and the
lower 0 -
31 precision bits can be discarded.

I'm programming a digital filter using 32.32 fixed
point arithmetic
and trying to do the processing efficiently.

So far I have the following:

//----------------------------------\
----
void Filter(unsigned int ADCSample)
{
static signed long long xv[5], yv[5];
static signed long long Coeff1 0xFFFFFFFFAAAAAA80; // -0.3333333333
static signed long long Coeff2 0xFFFFFFFF0EA41100; //
-0.9428090416
static signed long long Gain 0x0000000018FE59A0; //
0.09763107291035901797
long long InpuVal;

InpuVal = (long long)ADCSample << 32; //
Format the 10 bit ADC
sample to 32.32 fixed point

xv[0] = xv[1];
xv[1] = xv[2];
xv[2] = xv[3];
xv[3] = xv[4];
xv[4] = FixedPoint64Mul(&InpuVal, &Gain); //
Multiply the ADC
value by a gain value

yv[0] = yv[1];
yv[1] = yv[2];
yv[2] = yv[3];
yv[3] = yv[4];

yv[4] = (xv[0] + xv[4]) - (xv[2] << 1) +
FixedPoint64Mul(&Coeff1,
&yv[0]) + FixedPoint64Mul(&Coeff2, &yv[2]); // Filter
}
//----------------------------------\
----
__asm long long FixedPoint64Mul(long long *Num1, long
long *Num2)
{
LDR R2, [R0], #4 // Load R2 with lower
32bits (fractional
part) of Num1 and inc pointer to upper 32bits (integer
part)
LDR R3, [R1], #4 // Load R3 with lower
32bits (fractional
part) of Num2 and inc pointer to upper 32bits (integer
part)
UMULL R4, R5, R2, R3 // Multiply 32 bit values
R2*R3 and store 64
bit result on R4 R5 (R4 = lower 32bits)
LDR R2, [R0] // Load R2 with
upper 32bits (integer
part) of Num1
LDR R3, [R1], #-4 // Load R3 with upper
32bits (integer
part) of Num2
UMULL R6, R7, R2, R3 // Multiply 32 bit values
R2*R3 and store 64
bit result on R6 R7 (R6 = lower 32bits)
LDR R2, [R0], #-4 // Load R2 with upper
32bits (integer
part) of Num1 and dec pointer to lower 32bits of Num1
LDR R3, [R1], #4 // Load R3 with lower
32bits (fractional
part) of Num2 and inc pointer to upper 32bits of Num2
UMLAL R5, R6, R2, R3 // Multiply 32 bit values
R2*R3 and add to 64
bit value on R5 R6 then store on R5 R6
LDR R2, [R0] // Load R2 with
upper 32bits
(fractional part) of Num1
LDR R3, [R1] // Load R3 with
lower 32bits (integer
part) of Num2
UMLAL R5, R6, R2, R3 // Multiply 32 bit values
R2*R3 and add to 64
bit value on R5 R6 then store on R5 R6
MOV R0, R5 // Move fractional
result to R0
MOV R1, R6 // Move integer
result to R1
BX lr // Return
}
//----------------------------------\
----

The FixedPoint64Mul function works perfect with
unsigned values but not
signed. I have tried the signed version of multiply
instructions SMULL
and SMLAL, the result changes but it's still wrong.
The problem
seems to be based around the decimal point as -0.5
(0xffff,ffff.8000,0000) isn't seen as one number but
two negative
numbers. I'm quite confused on what the exact problem
is.

Can anyone see how to fix this or know of another
method?

Alastair



Wow thanks for the help. I understand now what the problem was and how
to fix it.

I not sure how much precision is needed so I went for an over kill with
32.32

I now have the whole filter in 40 assembly instructions working with
16.16 precision with sample rate of 400000. The centre frequency in
practise is 71KHz (outputting the filter's output results to the DAC)
where in theory it should be 100KHz. I think this is because of a lack
in fractional precision so I will try 12.20 precision next.

Thanks for the help

Alastair

I'd just like to warn y'all against second-guessing your compiler here. Many
times I've started down the path of implementing certain functionality in
assembly, only to find that even the free GCC compiler is much, much smarter
than I. Instruction re-ordering is very important with pipelined CPUs like
the ARM, and I find it to be non-intuitive - at least to the point that I'd
much rather have my compiler do it for me - though I do often take a look at
what the compiler is doing anyway, in critical points.

My preferred method of late is to only use assembly to wrap instructions
that my compiler won't emit otherwise, like SWP, the multiply-accumulate
instructions, and specific versions of multiple load-store. I wrap them in
brief static inline functions in my architecture-level headers, and then
continue writing C. I question whether the vast majority of programmers
could do significantly better, without a lot more time invested. Of course
there will always be exceptions to every rule...

On Mon, May 19, 2008 at 3:19 AM, alastair2010
wrote:

> Wow thanks for the help. I understand now what the problem was and how
> to fix it.
>
> I not sure how much precision is needed so I went for an over kill with
> 32.32
>
> I now have the whole filter in 40 assembly instructions working with
> 16.16 precision with sample rate of 400000. The centre frequency in
> practise is 71KHz (outputting the filter's output results to the DAC)
> where in theory it should be 100KHz. I think this is because of a lack
> in fractional precision so I will try 12.20 precision next.
>
> Thanks for the help
>
> Alastair
>
>