EmbeddedRelated.com
Forums
The 2026 Embedded Online Conference

Integer/Fixedpoint to 32 bit float

Started by Vincent vB April 6, 2016
On 6-4-2016 at 19:33, Robert Wessel wrote:
> > Assuming you're starting with an IEEE float containing an integer > -32768..32767, you can just subtract 14 from the exponent, *but* you > have to handle zero as a special case. > > OTOH, that sounds like extra work, just convert it correctly in the > first place. Starting with a 16 bit signed integer: > > 1. handle zero as a special case > 2. consider handling -32768 as a special case > 3. remove (and save) sign (IOW, take the absolute value) > 4. count the number of leading zeros (lz) > 4a. the result would be 1-15, 0-15 if you didn't special case -32768 > 5. put the result in a 32 bit unsigned integer > 6. shift left (23-lz) places > 7. set the 8 exponent bits (30-22) to (127+(15-lz)) > 7a. note that this overlays one bit from step 6 > 7b. approximately: > ui32 &= 0x007fffff; > ui32 |= (127-(15-lz)) << 23; > 8. set the sign bit (31) as appropriate > > Modulo bugs (I did the above from memory, but it should be close), > that should then have a single precision float in the 32 bit unsigned > integer, cast as needed. If you want to convert to a double, it's > basically the same procedure, but some of the constants change. >
I'll start with a 16 bit signed integer. Thanks for your step-by-step explanation!
On 7-4-2016 at 1:23, Clifford Heath wrote:
> gcc has "int __builtin_clz (unsigned int x)", lso long and long-long > versions. These map to whatever is most efficient for your hardware. > > It's a pity that there's no integer equivalent of ldexp; maybe called > ldiexp. > > To the OP: If your endian-ness and compiler bit-fields work out, you > can use this (works for me on x64 with gcc) for building and breaking > float values. > > typedef union { > float f; > struct { > uint32_t mantissa:23; > uint32_t exponent:8; > uint32_t sign:1; > }; > } FloatU; > > Note that building a floating point value like this is likely to > be slower than just saying "(float)l" - with any decent compiler. > But it will help you understand what's going on. > > Clifford Heath.
Well, its not really a microcontroller. It is a LatticeMico 32 on a Xilinx FPGA. I think its a big-endian processor, so this may work out. I've tried doing the (float)l, but then the LM32 compiler really attempts to convert the integer into a float. Horrible tricks like this would work (except for the compiler screaming 'murder and fire' as we say in Dutch): uint32_t l = ...; float f; *f = (float*)&l;
On 06/04/16 15:17, Vincent vB wrote:
> Currently I'm in the process of replacing a custom compass / > accelerometer with an ST LSM303D. The 'old' custom device produced > single precision floats. Without parsing the values there where just > passed inside a UDP packet. > > Unfortunately the LSM303D produces 16 bit signed integers. So, the > embedded system needs to convert these values to floats. The scaling it > self is quite simple: Values -32768..32767 need to be scaled to [-2,2). > > Now, my hardware has no floating point support. However doing the > following: > > float output = ( (float)input ) / 16384f; > > will require quite a bit of FP magic. I would imagine that this: > > const float scale = 1f / 16384f; > float output = ( (float)input ) * scale; > > may be faster, but still requires FP multiply support. > > Is there a simple and fast way which I can use to convert these integers > to floats without the aid of an FP library? I have not found much code > in this respect. > > Vincent >
You handle this by writing: float output = ((float) input) * 2.0f / 32768; Then you let the /compiler/ generate code that works. Ignore everyone here who has suggested "count leading ones", "handle 0 as a special case", "subtract 14 from the exponent", etc. That is not your job - other people have already figured out this stuff long ago, and the bugs have been ironed out. The LatticeMicro 32 compiler is gcc. Use the "-ffast-math" option to tell it that you are happy with a bit of obvious code re-arrangements rather than insisting on perfect IEEE operation - this lets you write "* 2.0f / 32768" to clearly express your intent in the code, while the /compiler/ turns it into "* (1f/32768f)". Write your code clearly and correctly, and let the tools do the work. Then all you need to do is make sure that you give the tools the best chance to generate fast code (such as -O2 -ffast-math, and whatever LM32 flags such as -mbarrel-shift-enabled and -mmultiply-enabled are appropriate for your particular cpu).
On 07.4.2016 &#1075;. 11:21, David Brown wrote:
> On 06/04/16 15:17, Vincent vB wrote: >> Currently I'm in the process of replacing a custom compass / >> accelerometer with an ST LSM303D. The 'old' custom device produced >> single precision floats. Without parsing the values there where just >> passed inside a UDP packet. >> >> Unfortunately the LSM303D produces 16 bit signed integers. So, the >> embedded system needs to convert these values to floats. The scaling it >> self is quite simple: Values -32768..32767 need to be scaled to [-2,2). >> >> Now, my hardware has no floating point support. However doing the >> following: >> >> float output = ( (float)input ) / 16384f; >> >> will require quite a bit of FP magic. I would imagine that this: >> >> const float scale = 1f / 16384f; >> float output = ( (float)input ) * scale; >> >> may be faster, but still requires FP multiply support. >> >> Is there a simple and fast way which I can use to convert these integers >> to floats without the aid of an FP library? I have not found much code >> in this respect. >> >> Vincent >> > > You handle this by writing: > > float output = ((float) input) * 2.0f / 32768; > > Then you let the /compiler/ generate code that works. Ignore everyone > here who has suggested "count leading ones", "handle 0 as a special > case", "subtract 14 from the exponent", etc. That is not your job - > other people have already figured out this stuff long ago, and the bugs > have been ironed out. > > The LatticeMicro 32 compiler is gcc. Use the "-ffast-math" option to > tell it that you are happy with a bit of obvious code re-arrangements > rather than insisting on perfect IEEE operation - this lets you write "* > 2.0f / 32768" to clearly express your intent in the code, while the > /compiler/ turns it into "* (1f/32768f)". > > Write your code clearly and correctly, and let the tools do the work. > > Then all you need to do is make sure that you give the tools the best > chance to generate fast code (such as -O2 -ffast-math, and whatever LM32 > flags such as -mbarrel-shift-enabled and -mmultiply-enabled are > appropriate for your particular cpu). > >
Are you saying this will work without the compiler bringing in an FP library? Dimiter
On 07/04/16 10:28, Dimiter_Popoff wrote:
> On 07.4.2016 &#1075;. 11:21, David Brown wrote: >> On 06/04/16 15:17, Vincent vB wrote: >>> Currently I'm in the process of replacing a custom compass / >>> accelerometer with an ST LSM303D. The 'old' custom device produced >>> single precision floats. Without parsing the values there where just >>> passed inside a UDP packet. >>> >>> Unfortunately the LSM303D produces 16 bit signed integers. So, the >>> embedded system needs to convert these values to floats. The scaling it >>> self is quite simple: Values -32768..32767 need to be scaled to [-2,2). >>> >>> Now, my hardware has no floating point support. However doing the >>> following: >>> >>> float output = ( (float)input ) / 16384f; >>> >>> will require quite a bit of FP magic. I would imagine that this: >>> >>> const float scale = 1f / 16384f; >>> float output = ( (float)input ) * scale; >>> >>> may be faster, but still requires FP multiply support. >>> >>> Is there a simple and fast way which I can use to convert these integers >>> to floats without the aid of an FP library? I have not found much code >>> in this respect. >>> >>> Vincent >>> >> >> You handle this by writing: >> >> float output = ((float) input) * 2.0f / 32768; >> >> Then you let the /compiler/ generate code that works. Ignore everyone >> here who has suggested "count leading ones", "handle 0 as a special >> case", "subtract 14 from the exponent", etc. That is not your job - >> other people have already figured out this stuff long ago, and the bugs >> have been ironed out. >> >> The LatticeMicro 32 compiler is gcc. Use the "-ffast-math" option to >> tell it that you are happy with a bit of obvious code re-arrangements >> rather than insisting on perfect IEEE operation - this lets you write "* >> 2.0f / 32768" to clearly express your intent in the code, while the >> /compiler/ turns it into "* (1f/32768f)". >> >> Write your code clearly and correctly, and let the tools do the work. >> >> Then all you need to do is make sure that you give the tools the best >> chance to generate fast code (such as -O2 -ffast-math, and whatever LM32 >> flags such as -mbarrel-shift-enabled and -mmultiply-enabled are >> appropriate for your particular cpu). >> >> > > Are you saying this will work without the compiler bringing in > an FP library? >
Almost certainly the compiler will bring in parts of its FP library. But as long as you have /any/ floating point in the code, that is usually the case anyway. And assuming your library is constructed reasonably, you will only get the required functions linked in, not the entire library. Of course it would be possible to write a dedicated and optimised function to handle this conversion from integer to floating point, combined with scaling. But you would be doing an enormous amount of work in order to save a few KB of code space and/or a few microseconds of run time - not to mention the significant effort in testing, the risk of code having bugs or portability issues, and the maintenance effort when the scale factors change. Therefore my advice is to write the code simply, cleanly and in an obviously correct manner. Understand your tools and how to help them generate optimal code. And then get on with the rest of the project, having handled this tasks in a few minutes rather than days.
On 07.4.2016 &#1075;. 11:41, David Brown wrote:
> On 07/04/16 10:28, Dimiter_Popoff wrote: >> On 07.4.2016 &#1075;. 11:21, David Brown wrote: >>> On 06/04/16 15:17, Vincent vB wrote: >>>> Currently I'm in the process of replacing a custom compass / >>>> accelerometer with an ST LSM303D. The 'old' custom device produced >>>> single precision floats. Without parsing the values there where just >>>> passed inside a UDP packet. >>>> >>>> Unfortunately the LSM303D produces 16 bit signed integers. So, the >>>> embedded system needs to convert these values to floats. The scaling it >>>> self is quite simple: Values -32768..32767 need to be scaled to [-2,2). >>>> >>>> Now, my hardware has no floating point support. However doing the >>>> following: >>>> >>>> float output = ( (float)input ) / 16384f; >>>> >>>> will require quite a bit of FP magic. I would imagine that this: >>>> >>>> const float scale = 1f / 16384f; >>>> float output = ( (float)input ) * scale; >>>> >>>> may be faster, but still requires FP multiply support. >>>> >>>> Is there a simple and fast way which I can use to convert these integers >>>> to floats without the aid of an FP library? I have not found much code >>>> in this respect. >>>> >>>> Vincent >>>> >>> >>> You handle this by writing: >>> >>> float output = ((float) input) * 2.0f / 32768; >>> >>> Then you let the /compiler/ generate code that works. Ignore everyone >>> here who has suggested "count leading ones", "handle 0 as a special >>> case", "subtract 14 from the exponent", etc. That is not your job - >>> other people have already figured out this stuff long ago, and the bugs >>> have been ironed out. >>> >>> The LatticeMicro 32 compiler is gcc. Use the "-ffast-math" option to >>> tell it that you are happy with a bit of obvious code re-arrangements >>> rather than insisting on perfect IEEE operation - this lets you write "* >>> 2.0f / 32768" to clearly express your intent in the code, while the >>> /compiler/ turns it into "* (1f/32768f)". >>> >>> Write your code clearly and correctly, and let the tools do the work. >>> >>> Then all you need to do is make sure that you give the tools the best >>> chance to generate fast code (such as -O2 -ffast-math, and whatever LM32 >>> flags such as -mbarrel-shift-enabled and -mmultiply-enabled are >>> appropriate for your particular cpu). >>> >>> >> >> Are you saying this will work without the compiler bringing in >> an FP library? >> > > Almost certainly the compiler will bring in parts of its FP library.
The question you replied to was how to do the conversion _without_ bringing in an FP library. Dimiter
On 6-4-2016 at 19:33, Robert Wessel wrote:
> On Wed, 6 Apr 2016 14:40:49 +0000 (UTC), Grant Edwards > <invalid@invalid.invalid> wrote: > > 1. handle zero as a special case > 2. consider handling -32768 as a special case > 3. remove (and save) sign (IOW, take the absolute value) > 4. count the number of leading zeros (lz) > 4a. the result would be 1-15, 0-15 if you didn't special case -32768 > 5. put the result in a 32 bit unsigned integer > 6. shift left (23-lz) places > 7. set the 8 exponent bits (30-22) to (127+(15-lz)) > 7a. note that this overlays one bit from step 6 > 7b. approximately: > ui32 &= 0x007fffff; > ui32 |= (127-(15-lz)) << 23; > 8. set the sign bit (31) as appropriate > > Modulo bugs (I did the above from memory, but it should be close), > that should then have a single precision float in the 32 bit unsigned > integer, cast as needed. If you want to convert to a double, it's > basically the same procedure, but some of the constants change. >
I finally came to this code: === #include <stdio.h> #include <stdbool.h> #include <stdint.h> int clz(uint32_t val) { int t = 0; if ((val & 0xFFFF0000) == 0) t += 16; else val >>= 16; if ((val & 0x0000FF00) == 0) t += 8; else val >>= 8; if ((val & 0x000000F0) == 0) t += 4; else val >>= 4; if ((val & 0x0000000C) == 0) t += 2; else val >>= 2; if ((val & 0x00000002) == 0) t += 1; return t; } static inline float castU32ToFloat(uint32_t f) { void * v = &f; return *((float *)v); } float fltFromI16(int16_t val, int fBits) { bool sign; uint32_t ival; int zeros; if (val == 0) return 0.0f; if (val < 0) { ival = -val; sign = true; } else { ival = val; sign = false; } zeros = clz(ival) - 16; ival = ival << (zeros + 8) & 0x007fffff; ival |= (142 - (zeros + fBits)) << 23; if (sign) ival |= 0x80000000; return castU32ToFloat(ival); } int main(int argc, char **argv) { for (int i = -32768; i < 32767; i += 1) { float b = fltFromI16(i, 14); float f = ((float)i) / 16384; if (b != f) printf("Value : %f != %f\n", f, b); } } === It may not be optimal, but it seems to produce the correct results. Vincent
There is an error in this line:
 > ival = ival << (zeros + 8) & 0x007fffff;
It should be:
 > ival = (ival << (zeros + 8)) & 0x007fffff;

Sorry


On 07/04/16 10:47, Dimiter_Popoff wrote:

> The question you replied to was how to do the conversion _without_ > bringing in an FP library. >
The OP said that his cpu had no hardware floating point, and then he said he wanted to do the conversion "without the aid of an FP library". And yes, my recommendation uses floating point library code (technically it is part of the compiler language support library, rather than being part of the standard C library or other library, but it is still library code). I should really have first asked the OP exactly why he requires the code without using the compiler's library. Usually when people say the don't want an FP library, it is because they have a fixed idea that software FP is always big and slow - but they have not properly considered or tested whether it is /too/ big or /too/ slow for the job, nor thought enough about the complications (size, time, development effort and risk) of alternatives. Of course, it may be that the OP /has/ worked through this and concluded that even the small amount of library code needed for the conversion is too large.
On 7-4-2016 at 10:47, Dimiter_Popoff wrote:
> On 07.4.2016 &#1075;. 11:41, David Brown wrote: >> On 07/04/16 10:28, Dimiter_Popoff wrote: >>> On 07.4.2016 &#1075;. 11:21, David Brown wrote: >>>> ... >>>> The LatticeMicro 32 compiler is gcc. Use the "-ffast-math" option to >>>> tell it that you are happy with a bit of obvious code re-arrangements >>>> rather than insisting on perfect IEEE operation - this lets you >>>> write "* >>>> 2.0f / 32768" to clearly express your intent in the code, while the >>>> /compiler/ turns it into "* (1f/32768f)". >>>> >>>> Write your code clearly and correctly, and let the tools do the work. >>>> >>>> Then all you need to do is make sure that you give the tools the best >>>> chance to generate fast code (such as -O2 -ffast-math, and whatever >>>> LM32 >>>> flags such as -mbarrel-shift-enabled and -mmultiply-enabled are >>>> appropriate for your particular cpu). >>>> >>>> >>> >>> Are you saying this will work without the compiler bringing in >>> an FP library? >>> >> >> Almost certainly the compiler will bring in parts of its FP library. > > The question you replied to was how to do the conversion _without_ > bringing in an FP library. > > Dimiter >
I wrote a test, creating the fltFromI16 using floats, as suggested by expert David Brown. I think I wrote my code clearly and correctly and let the /compiler/ do the work, with the following additional objects as result: /libgcc.a(_mul_sf.o) /libgcc.a(_div_sf.o) /libgcc.a(_si_to_sf.o) /libgcc.a(_thenan_sf.o) /libgcc.a(_muldi3.o) /libgcc.a(_lshrdi3.o) /libgcc.a(_clzsi2.o) /libgcc.a(_pack_sf.o) /libgcc.a(_unpack_sf.o) /libgcc.a(_mulsi3.o) /libgcc.a(_udivmodsi4.o) /libgcc.a(_clz.o) Used GCC flags that matter: -mbarrel-shift-enabled -mmultiply-enabled -msign-extend-enabled -Os -ffast-math 'The tools' also required 2736 more bytes for the same task than my highly flawed and inferior code. Vincent
The 2026 Embedded Online Conference