Fixed Point Calculations

I'm working with a custom designed hardware in fixed point arithmetic. I expect it won't be too difficult once I get my head wrapped around it, but I seem to be spinning my wheels getting started.

The arithmetic hardware has an 18x18 multiplier producing a 36 bit result feeding a three input, 55 bit adder. The other two inputs are a 54 bit input that is subtracted and a loopback of the output for accumulation operations (using an enable).

The algorithms to be implemented are not very complex. Analog sensors are read as 17 bit integers from 0 to 167772 (so a bit larger than 2^17). An offset is subtracted on all of them with different scale factors. Some are simple and used directly. Others must be processed a bit more by factoring in other inputs for scale correction. One is used directly but also integrated over a period of time for another quantity. Some values need to be compared to limits.

To preserve the most significant bits I am thinking of working in Q1.17 format. Any inputs that are not already close to a value that would be treated as ~1.0 can be multiplied to "normalize" the working value. At the end of calculations the appropriate scale factor can be used to retrieve the actual value each quantity.

Some of the correction equations require quadratic calculations. These will require adjustment of the coefficients. Others are more straight forward with simple linear expressions.

I guess I just need to map it all out and start plugging away to see if there are any pitfalls. I haven't done much with fixed point arithmetic. In computing the quadratic correction it seems like a lot of data will be potentially lost because the coefficient is small. I can square the variable first which will lose significant bits if it is small or multiply the variable by the coefficient first which will also lose significant bits. I expect I may need to add a shifter to select just the right bits on each multiply rather than always picking off the msbs. I guess I should do the math to see what the new coefficient will be. I guess if the variable is normalized from 100 to 1.9 for example, the x^2 coefficient should be multiplied by 10,000 and the x coefficient should be multiplied by 100. Hmmm, or maybe the x^2 coefficient should be multiplied by 100 and the x coefficient not multiplied at all. I guess I should push some values around in a spread sheet to see what gives the same answer as floating point.

Rick C.

- Get 1,000 miles of free Supercharging
- Tesla referral code - https://ts.la/richard11209

Reply by ●January 1, 20212021-01-01

On Fri, 1 Jan 2021 02:18:42 -0800 (PST), Rick C
<gnuarm.deletethisbit@gmail.com> wrote:

>I'm working with a custom designed hardware in fixed point arithmetic.  I expect it won't be too difficult once I get my head wrapped around it, but I seem to be spinning my wheels getting started.  
>
>The arithmetic hardware has an 18x18 multiplier producing a 36 bit result feeding a three input, 55 bit adder.  The other two inputs are a 54 bit input that is subtracted and a loopback of the output for accumulation operations (using an enable). 

Is the multiplier signed or unsigned ?

If the multiplier is unsigned and the result of the preceding
subtraction is negative, this will cause havoc to the multiply result.
If multiplier input is negative, set it to zero.

If the multiplier is signed and the result of the preceding
addition/subtraction is larger than signed maximum (check carry and
overflow flags), set the value to largest positive or negative value. 
That is, use saturated addition/subtraction.
, 

>The algorithms to be implemented are not very complex.  Analog sensors are read as 17 bit integers from 0 to 167772 (so a bit larger than 2^17).  An offset is subtracted on all of them with different scale factors.  Some are simple and used directly.  Others must be processed a bit more by factoring in other inputs for scale correction.  One is used directly but also integrated over a period of time for another quantity.  Some values need to be compared to limits. 

Do you really have sensors of producing 17 significant bits ?

>To preserve the most significant bits I am thinking of working in Q1.17 format.  Any inputs that are not already close to a value that would be treated as ~1.0 can be multiplied to "normalize" the working value.  At the end of calculations the appropriate scale factor can be used to retrieve the actual value each quantity.  
>
>Some of the correction equations require quadratic calculations.  These will require adjustment of the coefficients.  Others are more straight forward with simple linear expressions. 
>
>I guess I just need to map it all out and start plugging away to see if there are any pitfalls.  I haven't done much with fixed point arithmetic.  In computing the quadratic correction it seems like a lot of data will be potentially lost because the coefficient is small.  I can square the variable first which will lose significant bits if it is small or multiply the variable by the coefficient first which will also lose significant bits.  I expect I may need to add a shifter to select just the right bits on each multiply rather than always picking off the msbs.  I guess I should do the math to see what the new coefficient will be.  I guess if the variable is normalized from 100 to 1.9 for example, the x^2 coefficient should be multiplied by 10,000 and the x coefficient should be multiplied by 100.  Hmmm, or maybe the x^2 coefficient should be multiplied by 100 and the x coefficient not multiplied at all.  I guess I should push some values around in a spread sheet to see what gives
>the same answer as floating point.

Reply by Rick C ●January 3, 20212021-01-03

On Friday, January 1, 2021 at 6:33:44 AM UTC-5, upsid...@downunder.com wrote:
> On Fri, 1 Jan 2021 02:18:42 -0800 (PST), Rick C 
> <gnuarm.del...@gmail.com> wrote: 
> 
> >I'm working with a custom designed hardware in fixed point arithmetic. I expect it won't be too difficult once I get my head wrapped around it, but I seem to be spinning my wheels getting started. 
> > 
> >The arithmetic hardware has an 18x18 multiplier producing a 36 bit result feeding a three input, 55 bit adder. The other two inputs are a 54 bit input that is subtracted and a loopback of the output for accumulation operations (using an enable).
> Is the multiplier signed or unsigned ? 
> 
> If the multiplier is unsigned and the result of the preceding 
> subtraction is negative, this will cause havoc to the multiply result. 
> If multiplier input is negative, set it to zero. 
> 
> If the multiplier is signed and the result of the preceding 
> addition/subtraction is larger than signed maximum (check carry and 
> overflow flags), set the value to largest positive or negative value. 
> That is, use saturated addition/subtraction. 

I won't be feeding any negative numbers into the multiplier, so it doesn't matter,  or does it???  If I multiply 8000h by 8000h with sign extension to the 32 bit register the only difference in the result is the msb which is a bit that would be ignored in the result.  In this design the inputs are Q1.17 so the result will be Q2.34.  The interesting bits are Q1.17 working from the next to msb down.  The msb and 17 lsbs are dropped to produce the Q1.17 output.  But looking at an example of multiplying -1 x -1 I can see that would turn out differently it is considered signed or unsigned. 

I remember coding a shift and add multiplier once and that had issues with a signed multiplier.  So for negative multipliers I just complemented the multiplier to be a positive number and subtracted all the partial products or complemented the result or something similar.  I dug into the code for simulating the DSP units and it appears they provide controls to indicate if each input is signed or unsigned.  In the simulation they double with word width going into the multiplier, sign extending if signed, then pick the 36 lsbs for the result.  This seems to work for signed values using the same hardware as for unsigned.  

> >The algorithms to be implemented are not very complex. Analog sensors are read as 17 bit integers from 0 to 167772 (so a bit larger than 2^17). An offset is subtracted on all of them with different scale factors. Some are simple and used directly. Others must be processed a bit more by factoring in other inputs for scale correction. One is used directly but also integrated over a period of time for another quantity. Some values need to be compared to limits.
> Do you really have sensors of producing 17 significant bits ? 

You need to define the question better.  What do you consider to be "significant"?  We are using a sigma delta converter that will average the input over 167772 samples, so that as a max value.  How many bits are "significant" I can't say yet.  We haven't tested the circuit.  But the process is doing a lot of averaging and the "noise" that people talk about is mostly... in the noise when you filter over such a long period.  I think issues with nonidealities in the digital output driving the analog element will be the limiting factor rather than noise.  We actually don't need a lot of bits for these measurements, but with a 33 MHz clock rate and a 200 Hz sample rate you get 17.3 bits of resolution even if not the accuracy.  I have no reason to toss any accuracy in the math.  Some of the calculations will require division and at one point we were looking at a sensor that required a square root.  Turns out a square root in the math messes up the resolution at the low end kinda like the inverse of u-law compression.  It took me forever to convince the people who have spent months designing that flow sensor it would not work well.  Rather than them proving it would work, I had to prove it wouldn't.  Sometimes people are idiots.  

I might do the division in a lookup table by multiplying by the inverse.  The trouble with that is the poor accuracy when the value is large and the inverse is small.  But then that is simply the natural result of the division, no?  Not sure I will actually have that problem.  I need to plan out the calculations better. 

> >To preserve the most significant bits I am thinking of working in Q1.17 format. Any inputs that are not already close to a value that would be treated as ~1.0 can be multiplied to "normalize" the working value. At the end of calculations the appropriate scale factor can be used to retrieve the actual value each quantity. 
> > 
> >Some of the correction equations require quadratic calculations. These will require adjustment of the coefficients. Others are more straight forward with simple linear expressions. 
> > 
> >I guess I just need to map it all out and start plugging away to see if there are any pitfalls. I haven't done much with fixed point arithmetic. In computing the quadratic correction it seems like a lot of data will be potentially lost because the coefficient is small. I can square the variable first which will lose significant bits if it is small or multiply the variable by the coefficient first which will also lose significant bits. I expect I may need to add a shifter to select just the right bits on each multiply rather than always picking off the msbs. I guess I should do the math to see what the new coefficient will be. I guess if the variable is normalized from 100 to 1.9 for example, the x^2 coefficient should be multiplied by 10,000 and the x coefficient should be multiplied by 100. Hmmm, or maybe the x^2 coefficient should be multiplied by 100 and the x coefficient not multiplied at all. I guess I should push some values around in a spread sheet to see what gives 
> >the same answer as floating point.

-- 

Rick C.

+ Get 1,000 miles of free Supercharging
+ Tesla referral code - https://ts.la/richard11209

Reply by Tauno Voipio ●January 3, 20212021-01-03

On 3.1.21 11.45, Rick C wrote:

> I remember coding a shift and add multiplier once and that had issues with a signed multiplier.  So for negative multipliers I just complemented the multiplier to be a positive number and subtracted all the partial products or complemented the result or something similar.  I dug into the code for simulating the DSP units and it appears they provide controls to indicate if each input is signed or unsigned.  In the simulation they double with word width going into the multiplier, sign extending if signed, then pick the 36 lsbs for the result.  This seems to work for signed values using the same hardware as for unsigned.
> 

The classical solution to signed multiply in 2's complement arithmetic
is Booth's algorithm (Google for it).

-- 

-TV

Reply by ●January 3, 20212021-01-03

On Sun, 3 Jan 2021 01:45:49 -0800 (PST), Rick C
<gnuarm.deletethisbit@gmail.com> wrote:

>On Friday, January 1, 2021 at 6:33:44 AM UTC-5, upsid...@downunder.com wrote:
>> On Fri, 1 Jan 2021 02:18:42 -0800 (PST), Rick C 
>> <gnuarm.del...@gmail.com> wrote: 
>> 
>> >I'm working with a custom designed hardware in fixed point arithmetic. I expect it won't be too difficult once I get my head wrapped around it, but I seem to be spinning my wheels getting started. 
>> > 
>> >The arithmetic hardware has an 18x18 multiplier producing a 36 bit result feeding a three input, 55 bit adder. The other two inputs are a 54 bit input that is subtracted and a loopback of the output for accumulation operations (using an enable).
>> Is the multiplier signed or unsigned ? 
>> 
>> If the multiplier is unsigned and the result of the preceding 
>> subtraction is negative, this will cause havoc to the multiply result. 
>> If multiplier input is negative, set it to zero. 
>> 
>> If the multiplier is signed and the result of the preceding 
>> addition/subtraction is larger than signed maximum (check carry and 
>> overflow flags), set the value to largest positive or negative value. 
>> That is, use saturated addition/subtraction. 
>
>I won't be feeding any negative numbers into the multiplier, so it doesn't matter,  or does it??? 

With an 18 x 18 bit multiplier, you can multiply say 130000 by 130000
and get a correct result on both on a signed as well as unsigned
multiplier.

If you try to multiply 140000 by 140000 you will get the correct
result with an unsigned multiplier, however a signed multiplier would
give an incorrect result. A signed multiplier would interpret 140000
as a small negative number and produce a small positive result.

Anyway, since you have 54 bit add/sub, why not do all arithmetic as 32
or 54 bit internal representation. That way you do not have to worry
so much for overflows. 

With 18 x 18 bit multiply, you can easily make 36x36=72 bit multiply
with four multiplies, adds and shifts with unsigned multiply
instruction. If the multiplier is signed, you could easily make a
34x34 bit multiply, still sufficient for 32 bit arithmetic. 

> If I multiply 8000h by 8000h with sign extension to the 32 bit register the only difference in the result is the msb which is a bit that would be ignored in the result.  In this design the inputs are Q1.17 so the result will be Q2.34.  The interesting bits are Q1.17 working from the next to msb down.  The msb and 17 lsbs are dropped to produce the Q1.17 output.  But looking at an example of multiplying -1 x -1 I can see that would turn out differently it is considered signed or unsigned. 
>
>I remember coding a shift and add multiplier once and that had issues with a signed multiplier.  So for negative multipliers I just complemented the multiplier to be a positive number and subtracted all the partial products or complemented the result or something similar.  I dug into the code for simulating the DSP units and it appears they provide controls to indicate if each input is signed or unsigned.  In the simulation they double with word width going into the multiplier, sign extending if signed, then pick the 36 lsbs for the result.  This seems to work for signed values using the same hardware as for unsigned.  
>
>
>> >The algorithms to be implemented are not very complex. Analog sensors are read as 17 bit integers from 0 to 167772 (so a bit larger than 2^17). An offset is subtracted on all of them with different scale factors. Some are simple and used directly. Others must be processed a bit more by factoring in other inputs for scale correction. One is used directly but also integrated over a period of time for another quantity. Some values need to be compared to limits.
>> Do you really have sensors of producing 17 significant bits ? 
>
>You need to define the question better.  What do you consider to be "significant"?  We are using a sigma delta converter that will average the input over 167772 samples, so that as a max value.  How many bits are "significant" I can't say yet.  We haven't tested the circuit.  But the process is doing a lot of averaging and the "noise" that people talk about is mostly... in the noise when you filter over such a long period.  I think issues with nonidealities in the digital output driving the analog element will be the limiting factor rather than noise.  We actually don't need a lot of bits for these measurements, but with a 33 MHz clock rate and a 200 Hz sample rate you get 17.3 bits of resolution even if not the accuracy.  I have no reason to toss any accuracy in the math.  Some of the calculations will require division and at one point we were looking at a sensor that required a square root.  Turns out a square root in the math messes up the resolution at the low end kinda like the
>inverse of u-law compression.  It took me forever to convince the people who have spent months designing that flow sensor it would not work well.  Rather than them proving it would work, I had to prove it wouldn't.  Sometimes people are idiots.  
>
>I might do the division in a lookup table by multiplying by the inverse.  The trouble with that is the poor accuracy when the value is large and the inverse is small.  But then that is simply the natural result of the division, no?  Not sure I will actually have that problem.  I need to plan out the calculations better. 

Many RISC processors (e.g. Alpha) doesn't even have a division
instruction just an inverse instruction. You may have to "normalize"
the argument and then "demoralized" the result.

Square root is not bad. You may have to norm/denorm and use a 4th
degree polynomial for 32 bits or 6 degree polynomial for 54 bits
accuracy.

>
>
>> >To preserve the most significant bits I am thinking of working in Q1.17 format. Any inputs that are not already close to a value that would be treated as ~1.0 can be multiplied to "normalize" the working value. At the end of calculations the appropriate scale factor can be used to retrieve the actual value each quantity. 
>> > 
>> >Some of the correction equations require quadratic calculations. These will require adjustment of the coefficients. Others are more straight forward with simple linear expressions. 
>> > 
>> >I guess I just need to map it all out and start plugging away to see if there are any pitfalls. I haven't done much with fixed point arithmetic. In computing the quadratic correction it seems like a lot of data will be potentially lost because the coefficient is small. I can square the variable first which will lose significant bits if it is small or multiply the variable by the coefficient first which will also lose significant bits. I expect I may need to add a shifter to select just the right bits on each multiply rather than always picking off the msbs. I guess I should do the math to see what the new coefficient will be. I guess if the variable is normalized from 100 to 1.9 for example, the x^2 coefficient should be multiplied by 10,000 and the x coefficient should be multiplied by 100. Hmmm, or maybe the x^2 coefficient should be multiplied by 100 and the x coefficient not multiplied at all. I guess I should push some values around in a spread sheet to see what gives 
>> >the same answer as floating point.

Reply by Rick C ●January 3, 20212021-01-03

On Sunday, January 3, 2021 at 5:18:19 AM UTC-5, Tauno Voipio wrote:
> On 3.1.21 11.45, Rick C wrote: 
> 
> > I remember coding a shift and add multiplier once and that had issues with a signed multiplier. So for negative multipliers I just complemented the multiplier to be a positive number and subtracted all the partial products or complemented the result or something similar. I dug into the code for simulating the DSP units and it appears they provide controls to indicate if each input is signed or unsigned. In the simulation they double with word width going into the multiplier, sign extending if signed, then pick the 36 lsbs for the result. This seems to work for signed values using the same hardware as for unsigned. 
> >
> The classical solution to signed multiply in 2's complement arithmetic 
> is Booth's algorithm (Google for it). 

Yes, I am aware of Booth's, but that is not responsive to the issue.  I'm not crafting a multiplier out of gates.  I'm using one already built in the chip along with an accumulator, etc.  

I think I've got a handle on it now.  There are controls ASIGN and BSIGN which I only saw as controlling a sign extension which I thought was at the point of the output of the multiplier feeding the accumulator.  But this is at the input to the multiplier where it is doubling the width of the inputs so the resulting product has the correct answer in the 36 lsbs regardless of whether the inputs are signed or not.  You just have to set the control inputs to match your data.  

So I think I'm ok.  I just need to verify this in simulation. 

-- 

Rick C.

-- Get 1,000 miles of free Supercharging
-- Tesla referral code - https://ts.la/richard11209

Reply by David Brown ●January 3, 20212021-01-03

On 03/01/2021 10:45, Rick C wrote:

> I might do the division in a lookup table by multiplying by the
> inverse.  The trouble with that is the poor accuracy when the value
> is large and the inverse is small.  But then that is simply the
> natural result of the division, no?  Not sure I will actually have
> that problem.  I need to plan out the calculations better.
> 

"Multiply by reciprocal" is a very common way of doing division.  It is
particularly popular if you have several numbers to be divided by the
same divisor, or if the divisor is a fixed known value.  You need some
care to get the scaling right, and to get the LSB's precisely correct,
but it is entirely doable.

A lookup table for reciprocals is going to get impractical quite quickly
for large ranges.  Certainly it would take up a lot more space in an
FPGA than putting in a Cortex M1 or RISC-V soft processor and doing the
sane thing - implement your maths in software.

Reply by Rick C ●January 3, 20212021-01-03

On Sunday, January 3, 2021 at 9:38:11 AM UTC-5, David Brown wrote:
> On 03/01/2021 10:45, Rick C wrote: 
> 
> > I might do the division in a lookup table by multiplying by the 
> > inverse. The trouble with that is the poor accuracy when the value 
> > is large and the inverse is small. But then that is simply the 
> > natural result of the division, no? Not sure I will actually have 
> > that problem. I need to plan out the calculations better. 
> >
> "Multiply by reciprocal" is a very common way of doing division. It is 
> particularly popular if you have several numbers to be divided by the 
> same divisor, or if the divisor is a fixed known value. You need some 
> care to get the scaling right, and to get the LSB's precisely correct, 
> but it is entirely doable. 
> 
> A lookup table for reciprocals is going to get impractical quite quickly 
> for large ranges. Certainly it would take up a lot more space in an 
> FPGA than putting in a Cortex M1 or RISC-V soft processor and doing the 
> sane thing - implement your maths in software.

If you would like to participate, please let me know and I will connect you with the project lead.  Then you can discuss the use of the MCU with him. 

Not sure why you think the reciprocal would be so cumbersome.  The reality is the values requires are all relatively close to 1.0, so I think it may be fairly effective.  This system is not doing general arbitrary math, it is doing a well defined set of calculations.  So restrictions on range are entirely realistic. 

-- 

Rick C.

-+ Get 1,000 miles of free Supercharging
-+ Tesla referral code - https://ts.la/richard11209

Reply by David Brown ●January 4, 20212021-01-04

On 04/01/2021 03:55, Rick C wrote:
> On Sunday, January 3, 2021 at 9:38:11 AM UTC-5, David Brown wrote:
>> On 03/01/2021 10:45, Rick C wrote:
>> 
>>> I might do the division in a lookup table by multiplying by the 
>>> inverse. The trouble with that is the poor accuracy when the
>>> value is large and the inverse is small. But then that is simply
>>> the natural result of the division, no? Not sure I will actually
>>> have that problem. I need to plan out the calculations better.
>>> 
>> "Multiply by reciprocal" is a very common way of doing division. It
>> is particularly popular if you have several numbers to be divided
>> by the same divisor, or if the divisor is a fixed known value. You
>> need some care to get the scaling right, and to get the LSB's
>> precisely correct, but it is entirely doable.
>> 
>> A lookup table for reciprocals is going to get impractical quite
>> quickly for large ranges. Certainly it would take up a lot more
>> space in an FPGA than putting in a Cortex M1 or RISC-V soft
>> processor and doing the sane thing - implement your maths in
>> software.
> 
> If you would like to participate, please let me know and I will
> connect you with the project lead.  Then you can discuss the use of
> the MCU with him.

No, thank you.  But if you, or anyone else in the project, want to talk
about MCU's (or soft processors, which was what I suggested) here then
I'm happy to join in.

From what I have heard of this project (from your posts here and in
other groups), it sounds like you may have spend an extraordinary amount
of time trying to make things work with an FPGA alone where other
devices would have made the job far easier.  Obviously I am not privy to
the details - I don't know if you have spent months working on things or
merely a few hours spread out over months.  And I don't know the budget
balances between development costs and production costs.  Nor do I know
how much influence you have in the decision processes, nor how fixed the
designs are at this stage.  All I can say is how /I/ would handle
things, or recommend other engineers to handle things, given a similar
situation as well as I can surmise from your posts.

Different technologies have their strengths and weaknesses.  There are
times when an FPGA is clearly better, times when an MCU is clearly
better, times when either will do just as well, and times where one can
be made to work even if it is not ideal for the task.  It is important
for you to know where you stand here.

> 
> Not sure why you think the reciprocal would be so cumbersome.  The
> reality is the values requires are all relatively close to 1.0, so I
> think it may be fairly effective.  This system is not doing general
> arbitrary math, it is doing a well defined set of calculations.  So
> restrictions on range are entirely realistic.
> 

I did not describe reciprocal method as "cumbersome".  You simply need
to be careful with the details to get things accurate - it's very easy
to be a bit off in the lowest bits of the result, and it's very easy to
have bad behaviour in corner cases.  And a lookup table might be fine if
there are only a small number of possible divisors, but if it is needed
for a whole 17-bit range, that's 128K entries of 34 bits each.

A restricted range, as you describe, certainly helps here.  It means you
don't need the whole 128K, and you can have less than 34 bits on each
entry while covering the accuracy you need.  Only you can figure out how
big the table must be, and whether it is affordable or not.

If the alternative is to calculate the reciprocal using an iterative
process - it can be done with each cycle doubling the number of bits,
giving perhaps 5 cycles.

For 17 bit numbers, this is probably not worth the bother.  After all, a
straight-forward long-division in base 2 will give you the answer in 17
cycles.

Reply by Rick C ●January 18, 20212021-01-18

I solved the problem of fixed point math being a PITA by changing to floating point which is easier in some ways even if a bit of a bother in others. Multiplies and divides only need to be handled between numbers in the range of 1.0 to 1.999... Turns out the divide ends up being very easy. A table lookup gets around 10 bits of accuracy and one iteration of the Newton-Raphson algorithm gives the full 18 bits the basic hardware is capable of and more than is needed for the calculations. Adds/subtractions are a bit more work requiring denormalization before the sum and renormalization after. In order to prevent having to deal with negative numbers the two addends are ordered so a subtraction does not result in a negative mantissa.

Once the details are worked out floating point is not hard at all and lends itself to an easy solution to the divide problem as well as the tracking of scale factor in fixed point math. Someone had suggested that a 32.32 format fixed point would do the job without floating point, but it would require a lot more hardware resources and still not work for every calculation that might be required. One of the calcs involved squaring a value then applying a coefficient with a 10^-6 range. I suppose that could be done by taking the square root of the coefficient, but it's just easier to not have to worry with how many significant bits are left at the end. With floating point the main worry is the small result from subtracting large numbers and I will be able to identify those ahead of time.

Rick C.

++ Get 1,000 miles of free Supercharging
++ Tesla referral code - https://ts.la/richard11209

Previous12 Next

Fixed Point Calculations

Sign in

You might also like...

Search forums

Free PDF Downloads

Blogs - Hall of Fame

Discussion Groups

Quick Links

About EmbeddedRelated.com

Social Networks

The Related Media Group