Rick C <gnuarm.deletethisbit@gmail.com> writes:

> What sort of hardware was produced for integer operations?  When you
> say integer, do you mean literally integer data type as in VHDL, or do
> you mean integer interpretation of a array of binary signals like
> signed/unsigned or the equivalent in Verilog?

The input and output data types were std_logic_vector so unsigned
division of bits interpreted as integers. Unfortunately I don't have
access to the code any more. This was part of converting some sensor
manufacturer's awful reference C code into VHDL to convert raw sensor
values to real ones. The C code used 32-bit signed and unsigned integers
and while it was long, it was straightforward to do in VHDL.

> If I don't use floating point, I would need to use fixed point.  I've
> not worked with the fixed point library in VHDL and didn't want to
> climb the learning curve.  So I'm expecting to plan it all manually.

I didn't experience much in the way of learning curve with the sfixed
types even though I hadn't used them before. A few minutes browsing
through Ashenden. But then that was only for a PID controller which is
after all a pretty simple thing and I didn't have to be overly concerned
about accuracy.

> Float to fixed is pretty straightforward.  It's just a shifter to
> align the mantissa according to the value in the exponent and forming
> the 2's complement according to the sign bit.  That is one clock cycle
> using the DSP multiplier.  Think of it as a denormalize operation in
> prep for an add in floating point.  The int to float is similar with
> the setting of the exponent.  That's essentially a normalize operation
> and uses the same hardware.

Now that would mean a learning curve for me although I remember I did a
floating point multiplier as student over 20 years ago. Pity I don't
think I have the VHDL code from back then any more.

> Since I require similar operations on multiple data it makes sense to
> have an ALU that can be controlled to process data sequentially rather
> than dedicating logic to various data flows.  Was your application
> similar or did you have high data rates that needed dedicated hardware
> for a given data path?

It was similar, there were three sensors which all shared the single
divider and other calculations. I was pretty relaxed about the use of HW
multipliers since I had plenty. Just not so many I could've had 3x the
number used.

Data rate was slow, I only read the sensors at something like 2 Hz. No
rush to get from the raw data to final data either but that took maybe
20-30 50ns cycles per sensor.

On Tuesday, January 19, 2021 at 3:44:32 AM UTC-5, upsid...@downunder.com wrote:
> On Mon, 18 Jan 2021 13:46:36 -0800 (PST), Rick C 
> <gnuarm.del...@gmail.com> wrote: 
> 
> >Float to fixed is pretty straightforward. It's just a shifter to align the mantissa according to the value in the exponent and forming the 2's complement according to the sign bit. That is one clock cycle using the DSP multiplier.
> Yes, the required shift count an be obtained from the exponent.
> >Think of it as a denormalize operation in prep for an add in floating point.
> The shift count is the difference in exponents.
> >The int to float is similar with the setting of the exponent. That's essentially a normalize operation and uses the same hardware.
> How do you determine the required shift count ? 

In this case it's not hard.  The shifting is being done by a multiplier, so the logic is a bit less complex than a priority encoder, but similar.  The msb of the integer is used to set the multiplier value.  For normalizing the integer there is a reverse correspondence in bit position.  Once an output bit is selected, the other lsbs are ignored.  An older Altera part had a chain of AND gates they used for carry that was perfect for this sort of operation with the proper polarity.  
This logic is required for every addition renormalization, not just integer to float conversion.  


> One way is to use some dedicated hardware that determines in which bit 
> position is the most significant "1" bit and then shift left by that 
> amount. 
> 
> The other method is to shift the integer one position to the left at a 
> time into carry. If the carry bit is 1, you have found the (hidden 
> bit) normalized value and the process completes. However, if the 
> carry bit was "0" start over in a loop by shifting one more bit 
> position. The loop must terminate if the value was all zeroes. 
> 
> Shifting one bit a time in a loop is time consuming for large integers 
> such as 64 bit integers (up to 63 cycles). This can be speeded up by 
> examining e.g. a byte (8 bits) at a time starting from the left. As 
> long as the byte is 0x00, continue looping. When the first non-zero 
> byte is detect, find out which is the first bit set by shifting into 
> carry. This reduces the cycle count to 7+7=14 cycles. 
> 
> On an 8 bit processor, large (multiple of 8 bit) shifts can be 
> implemented with byte moves and only the last 1-7 position needs to be 
> implemented with actual shifts. 
> 
> If both the integer as well as the float mantissa uses 2's complement 
> then you need to change the logic slightly for negative values, i.e. 
> continue looping as long as the carry bit is "1" or when the byte is 
> 0xFF. If the float uses sign + magnitude representation, better do the 
> conversion to unsigned prior to the shifts,

Neither use 2's complement.  The integer values are all positive numbers and I don't recall seeing a floating point format that uses 2's complement.  If you use sign magnitude and use a biased exponent between the sign bit and the mantissa, a standard arithmetic subtract on the full word can be used to compare floats.  That's too useful to toss out the window by using 2's complement. 

-- 

Rick C.

-+- Get 1,000 miles of free Supercharging
-+- Tesla referral code - https://ts.la/richard11209

On Mon, 18 Jan 2021 13:46:36 -0800 (PST), Rick C
<gnuarm.deletethisbit@gmail.com> wrote:

>Float to fixed is pretty straightforward. It's just a shifter to align the mantissa according to the value in the exponent and forming the 2's complement according to the sign bit. That is one clock cycle using the DSP multiplier.

Yes, the required shift count an be obtained from the exponent.

>Think of it as a denormalize operation in prep for an add in floating point.

The shift count is the difference in exponents.

>The int to float is similar with the setting of the exponent. That's essentially a normalize operation and uses the same hardware.

How do you determine the required shift count ?

One way is to use some dedicated hardware that determines in which bit
position is the most significant "1" bit and then shift left by that
amount.

The other method is to shift the integer one position to the left at a
time into carry. If the carry bit is 1, you have found the (hidden
bit) normalized value and the process completes. However, if the
carry bit was "0" start over in a loop by shifting one more bit
position. The loop must terminate if the value was all zeroes.

Shifting one bit a time in a loop is time consuming for large integers
such as 64 bit integers (up to 63 cycles). This can be speeded up by
examining e.g. a byte (8 bits) at a time starting from the left. As
long as the byte is 0x00, continue looping. When the first non-zero
byte is detect, find out which is the first bit set by shifting into
carry. This reduces the cycle count to 7+7=14 cycles.

On an 8 bit processor, large (multiple of 8 bit) shifts can be
implemented with byte moves and only the last 1-7 position needs to be
implemented with actual shifts.

If both the integer as well as the float mantissa uses 2's complement
then you need to change the logic slightly for negative values, i.e.
continue looping as long as the carry bit is "1" or when the byte is
0xFF. If the float uses sign + magnitude representation, better do the
conversion to unsigned prior to the shifts,

On Monday, January 18, 2021 at 3:01:03 PM UTC-5, Anssi Saari wrote:
> Rick C <gnuarm.del...@gmail.com> writes: 
> 
> > Dividing by coding A <= B/C will produce exactly what in hardware using VHDL???
> I've only done that with Intel FPGAs, that produced an integer divider 
> from Intel's IP library for my integer division needs. Intel provided 
> both a pipelined and non-pipelined option for that in their free IP 
> library. 
> 
> For floats I'd guess the produced HW might be large. I did try to make a 
> PID controller with floats but that was a little large so I used 
> standard fixed point stuff instead, sfixed 23 downto -7 as I recall. 
> I only used the float to fixed point converter from the standard 
> floating point library and that was already largish but not too bad.

What sort of hardware was produced for integer operations?  When you say integer, do you mean literally integer data type as in VHDL, or do you mean integer interpretation of a array of binary signals like signed/unsigned or the equivalent in Verilog?  If I don't use floating point, I would need to use fixed point.  I've not worked with the fixed point library in VHDL and didn't want to climb the learning curve.  So I'm expecting to plan it all manually.  

Float to fixed is pretty straightforward.  It's just a shifter to align the mantissa according to the value in the exponent and forming the 2's complement according to the sign bit.   That is one clock cycle using the DSP multiplier.  Think of it as a denormalize operation in prep for an add in floating point.  The int to float is similar with the setting of the exponent.  That's essentially a normalize operation and uses the same hardware.  

Since I require similar operations on multiple data it makes sense to have an ALU that can be controlled to process data sequentially rather than dedicating logic to various data flows.  Was your application similar or did you have high data rates that needed dedicated hardware for a given data path? 

-- 

Rick C.

--+ Get 1,000 miles of free Supercharging
--+ Tesla referral code - https://ts.la/richard11209

Rick C <gnuarm.deletethisbit@gmail.com> writes:

> Dividing by coding A <= B/C will produce exactly what in hardware using VHDL??? 

I've only done that with Intel FPGAs, that produced an integer divider
from Intel's IP library for my integer division needs. Intel provided
both a pipelined and non-pipelined option for that in their free IP
library.

For floats I'd guess the produced HW might be large. I did try to make a
PID controller with floats but that was a little large so I used
standard fixed point stuff instead, sfixed 23 downto -7 as I recall.
I only used the float to fixed point converter from the standard
floating point library and that was already largish but not too bad.

On Monday, January 18, 2021 at 2:53:25 AM UTC-5, David Brown wrote:
> On 18/01/2021 07:11, Rick C wrote: 
> > I solved the problem of fixed point math being a PITA by changing to 
> > floating point which is easier in some ways even if a bit of a bother 
> > in others.
> Floating point makes multiplies and divides easier (ignoring NaNs and 
> other complications), but addition and subtraction harder. 

But not appreciably so.  The basic architecture is a MAC where the multiplier serves dual duty as a multiplier and as a shifter.  So normalization and denormalization are trivial.  First a test is done to see which is to be denormalized and the denormalization is done in the same cycle with the add/subtract.   So three clock cycles compared to two for the multiply.  No mistakes in tracking the scaling factors!  


> I still can't help feeling you are over-complicating things by your 
> approach, and that this should all be doable by a few lines of code 
> (whether in an HDL or in a software HLL), or by ready-made components. 
> You are not the first person to do a division in an FPGA. 

You seem to be contradicting yourself.  Is it just a a few lines of code or is it a big hassle?  

Dividing by coding A <= B/C will produce exactly what in hardware using VHDL??? 


> But I get the impression that you are restrained by certain 
> requirements, and whether or not they are logical (I've seen projects 
> that ban "software" for "reliability reasons" but are quite happy with 
> FPGA code). 

Not your problem really.  So why dwell on it? 


> Anyway, it's nice to hear you've got it under control.
> > Multiplies and divides only need to be handled between 
> > numbers in the range of 1.0 to 1.999... Turns out the divide ends 
> > up being very easy. A table lookup gets around 10 bits of accuracy 
> > and one iteration of the Newton-Raphson algorithm gives the full 18 
> > bits the basic hardware is capable of and more than is needed for the 
> > calculations. Adds/subtractions are a bit more work requiring 
> > denormalization before the sum and renormalization after. In order 
> > to prevent having to deal with negative numbers the two addends are 
> > ordered so a subtraction does not result in a negative mantissa. 
> > 
> > Once the details are worked out floating point is not hard at all and 
> > lends itself to an easy solution to the divide problem as well as the 
> > tracking of scale factor in fixed point math. Someone had suggested 
> > that a 32.32 format fixed point would do the job without floating 
> > point, but it would require a lot more hardware resources and still 
> > not work for every calculation that might be required. One of the 
> > calcs involved squaring a value then applying a coefficient with a 
> > 10^-6 range. I suppose that could be done by taking the square root 
> > of the coefficient, but it's just easier to not have to worry with 
> > how many significant bits are left at the end. With floating point 
> > the main worry is the small result from subtracting large numbers and 
> > I will be able to identify those ahead of time. 

Indeed!   I appreciate the support. 

-- 

Rick C.

--- Get 1,000 miles of free Supercharging
--- Tesla referral code - https://ts.la/richard11209

On 18/01/2021 07:11, Rick C wrote:
> I solved the problem of fixed point math being a PITA by changing to
> floating point which is easier in some ways even if a bit of a bother
> in others.

Floating point makes multiplies and divides easier (ignoring NaNs and
other complications), but addition and subtraction harder.

I still can't help feeling you are over-complicating things by your
approach, and that this should all be doable by a few lines of code
(whether in an HDL or in a software HLL), or by ready-made components.
You are not the first person to do a division in an FPGA.

But I get the impression that you are restrained by certain
requirements, and whether or not they are logical (I've seen projects
that ban "software" for "reliability reasons" but are quite happy with
FPGA code).

Anyway, it's nice to hear you've got it under control.

>   Multiplies and divides only need to be handled between
> numbers in the range of 1.0 to 1.999...   Turns out the divide ends
> up being very easy.  A table lookup gets around 10 bits of accuracy
> and one iteration of the Newton-Raphson algorithm gives the full 18
> bits the basic hardware is capable of and more than is needed for the
> calculations.  Adds/subtractions are a bit more work requiring
> denormalization before the sum and renormalization after.   In order
> to prevent having to deal with negative numbers the two addends are
> ordered so a subtraction does not result in a negative mantissa.
> 
> Once the details are worked out floating point is not hard at all and
> lends itself to an easy solution to the divide problem as well as the
> tracking of scale factor in fixed point math.  Someone had suggested
> that a 32.32 format fixed point would do the job without floating
> point, but it would require a lot more hardware resources and still
> not work for every calculation that might be required.  One of the
> calcs involved squaring a value then applying a coefficient with a
> 10^-6 range.  I suppose that could be done by taking the square root
> of the coefficient, but it's just easier to not have to worry with
> how many significant bits are left at the end.  With floating point
> the main worry is the small result from subtracting large numbers and
> I will be able to identify those ahead of time.
>

I solved the problem of fixed point math being a PITA by changing to floating point which is easier in some ways even if a bit of a bother in others. Multiplies and divides only need to be handled between numbers in the range of 1.0 to 1.999... Turns out the divide ends up being very easy. A table lookup gets around 10 bits of accuracy and one iteration of the Newton-Raphson algorithm gives the full 18 bits the basic hardware is capable of and more than is needed for the calculations. Adds/subtractions are a bit more work requiring denormalization before the sum and renormalization after. In order to prevent having to deal with negative numbers the two addends are ordered so a subtraction does not result in a negative mantissa.

Once the details are worked out floating point is not hard at all and lends itself to an easy solution to the divide problem as well as the tracking of scale factor in fixed point math. Someone had suggested that a 32.32 format fixed point would do the job without floating point, but it would require a lot more hardware resources and still not work for every calculation that might be required. One of the calcs involved squaring a value then applying a coefficient with a 10^-6 range. I suppose that could be done by taking the square root of the coefficient, but it's just easier to not have to worry with how many significant bits are left at the end. With floating point the main worry is the small result from subtracting large numbers and I will be able to identify those ahead of time.

Rick C.

++ Get 1,000 miles of free Supercharging
++ Tesla referral code - https://ts.la/richard11209

On 04/01/2021 03:55, Rick C wrote:
> On Sunday, January 3, 2021 at 9:38:11 AM UTC-5, David Brown wrote:
>> On 03/01/2021 10:45, Rick C wrote:
>> 
>>> I might do the division in a lookup table by multiplying by the 
>>> inverse. The trouble with that is the poor accuracy when the
>>> value is large and the inverse is small. But then that is simply
>>> the natural result of the division, no? Not sure I will actually
>>> have that problem. I need to plan out the calculations better.
>>> 
>> "Multiply by reciprocal" is a very common way of doing division. It
>> is particularly popular if you have several numbers to be divided
>> by the same divisor, or if the divisor is a fixed known value. You
>> need some care to get the scaling right, and to get the LSB's
>> precisely correct, but it is entirely doable.
>> 
>> A lookup table for reciprocals is going to get impractical quite
>> quickly for large ranges. Certainly it would take up a lot more
>> space in an FPGA than putting in a Cortex M1 or RISC-V soft
>> processor and doing the sane thing - implement your maths in
>> software.
> 
> If you would like to participate, please let me know and I will
> connect you with the project lead.  Then you can discuss the use of
> the MCU with him.

No, thank you.  But if you, or anyone else in the project, want to talk
about MCU's (or soft processors, which was what I suggested) here then
I'm happy to join in.

From what I have heard of this project (from your posts here and in
other groups), it sounds like you may have spend an extraordinary amount
of time trying to make things work with an FPGA alone where other
devices would have made the job far easier.  Obviously I am not privy to
the details - I don't know if you have spent months working on things or
merely a few hours spread out over months.  And I don't know the budget
balances between development costs and production costs.  Nor do I know
how much influence you have in the decision processes, nor how fixed the
designs are at this stage.  All I can say is how /I/ would handle
things, or recommend other engineers to handle things, given a similar
situation as well as I can surmise from your posts.

Different technologies have their strengths and weaknesses.  There are
times when an FPGA is clearly better, times when an MCU is clearly
better, times when either will do just as well, and times where one can
be made to work even if it is not ideal for the task.  It is important
for you to know where you stand here.

> 
> Not sure why you think the reciprocal would be so cumbersome.  The
> reality is the values requires are all relatively close to 1.0, so I
> think it may be fairly effective.  This system is not doing general
> arbitrary math, it is doing a well defined set of calculations.  So
> restrictions on range are entirely realistic.
> 

I did not describe reciprocal method as "cumbersome".  You simply need
to be careful with the details to get things accurate - it's very easy
to be a bit off in the lowest bits of the result, and it's very easy to
have bad behaviour in corner cases.  And a lookup table might be fine if
there are only a small number of possible divisors, but if it is needed
for a whole 17-bit range, that's 128K entries of 34 bits each.

A restricted range, as you describe, certainly helps here.  It means you
don't need the whole 128K, and you can have less than 34 bits on each
entry while covering the accuracy you need.  Only you can figure out how
big the table must be, and whether it is affordable or not.

If the alternative is to calculate the reciprocal using an iterative
process - it can be done with each cycle doubling the number of bits,
giving perhaps 5 cycles.

For 17 bit numbers, this is probably not worth the bother.  After all, a
straight-forward long-division in base 2 will give you the answer in 17
cycles.

On Sunday, January 3, 2021 at 9:38:11 AM UTC-5, David Brown wrote:
> On 03/01/2021 10:45, Rick C wrote: 
> 
> > I might do the division in a lookup table by multiplying by the 
> > inverse. The trouble with that is the poor accuracy when the value 
> > is large and the inverse is small. But then that is simply the 
> > natural result of the division, no? Not sure I will actually have 
> > that problem. I need to plan out the calculations better. 
> >
> "Multiply by reciprocal" is a very common way of doing division. It is 
> particularly popular if you have several numbers to be divided by the 
> same divisor, or if the divisor is a fixed known value. You need some 
> care to get the scaling right, and to get the LSB's precisely correct, 
> but it is entirely doable. 
> 
> A lookup table for reciprocals is going to get impractical quite quickly 
> for large ranges. Certainly it would take up a lot more space in an 
> FPGA than putting in a Cortex M1 or RISC-V soft processor and doing the 
> sane thing - implement your maths in software.

If you would like to participate, please let me know and I will connect you with the project lead.  Then you can discuss the use of the MCU with him. 

Not sure why you think the reciprocal would be so cumbersome.  The reality is the values requires are all relatively close to 1.0, so I think it may be fairly effective.  This system is not doing general arbitrary math, it is doing a well defined set of calculations.  So restrictions on range are entirely realistic. 

-- 

Rick C.

-+ Get 1,000 miles of free Supercharging
-+ Tesla referral code - https://ts.la/richard11209