EmbeddedRelated.com
Forums
The 2024 Embedded Online Conference

How would my SoC calculate an integer square root?

Started by Unknown January 4, 2017
Hello everyone,

I am working with a Nordic nRF52 SoC.  The CPU subsystem of the nRF52 is an ARM Cortex-M4.  My development system is a Linux box running GCC.  My embedded systems toolchain is gcc-arm-none-eabi-4_9-2015q1, as recommended by Nordic.

I have written a fair amount of code for this device, but now I have to explore a bit deeper.  I have an algorithm that may require me to calculate a lot of square roots, and quickly.  However, the numbers in question are all small integers.

Even though the ARM Cortex-M4 has an FPU, it shouldn't need to get involved in this square root calculation.  Optimized algorithms for various math functions exist.  A nice discussion of the optimization of the integer square root algorithm, written by Dr. Jack Crenshaw, is found here:

http://www.embedded.com/electronics-blogs/programmer-s-toolbox/4219659/Integer-Square-Roots

I am interested in understanding exactly what code will be compiled if I #include <math.h> in my program, and then call sqrt() with an int32 argument.  It stands to reason that optimized code, like that shown in the blog above, is what should be used.  How can I determine this?  

I am poking around in the gcc-arm-none-eabi-4_9-2015q1 folder for clues.  The only math-related file that I can find there is called tgmath.h, and it just #defines aliases which point back to math.h.  I am looking for source code.

Thanks for any advice you can share.

On 05/01/17 10:27, jladasky@itu.edu wrote:
> I am interested in understanding exactly what code will be compiled if I #include <math.h> in my program, > and then call sqrt() with an int32 argument.
sqrt() takes a double-precision floating point number, so your int32 will be promoted to double before sqrt is called, and you'll get a double back.
On Wednesday, January 4, 2017 at 3:46:43 PM UTC-8, Clifford Heath wrote:
> On 05/01/17 10:27, j...@itu.edu wrote: > > I am interested in understanding exactly what code will be compiled if I #include <math.h> in my program, > > and then call sqrt() with an int32 argument. > > sqrt() takes a double-precision floating point number, so > your int32 will be promoted to double before sqrt is called, > and you'll get a double back.
Thanks for your reply, Clifford. That's actually not an encouraging answer though! If that is the case, I think I will write the square root function that was shown in Listing 4 of Jack Crenshaw's blog. The floating-point function will spend time calculating digits of precision that I do not need for my application.
On 05/01/17 00:27, jladasky@itu.edu wrote:
> Hello everyone, > > I am working with a Nordic nRF52 SoC. The CPU subsystem of the nRF52 > is an ARM Cortex-M4. My development system is a Linux box running > GCC. My embedded systems toolchain is gcc-arm-none-eabi-4_9-2015q1, > as recommended by Nordic. > > I have written a fair amount of code for this device, but now I have > to explore a bit deeper. I have an algorithm that may require me to > calculate a lot of square roots, and quickly. However, the numbers > in question are all small integers. > > Even though the ARM Cortex-M4 has an FPU, it shouldn't need to get > involved in this square root calculation. Optimized algorithms for > various math functions exist. A nice discussion of the optimization > of the integer square root algorithm, written by Dr. Jack Crenshaw, > is found here: > > http://www.embedded.com/electronics-blogs/programmer-s-toolbox/4219659/Integer-Square-Roots > > I am interested in understanding exactly what code will be compiled > if I #include <math.h> in my program, and then call sqrt() with an > int32 argument. It stands to reason that optimized code, like that > shown in the blog above, is what should be used. How can I determine > this?
No, it does not "stand to reason" at all. The C standard library header <math.h> goes along with an implementation of the C standard library function for sqrt() - this will do the calculations in double-precision floating point, as required by the C standards, and giving the answer to full IEEE accuracy. There are a lot of ways you can calculate square roots, with various algorithms, lookup tables, etc., providing different balances between accuracy, run-time, and code space. There is no one "perfect" answer. And don't dismiss the floating point unit in the M4 too quickly - for some sorts of calculations, it may be faster than using integer arithmetic, even though you are using integer operands at the start.
> > I am poking around in the gcc-arm-none-eabi-4_9-2015q1 folder for > clues. The only math-related file that I can find there is called > tgmath.h, and it just #defines aliases which point back to math.h. I > am looking for source code. > > Thanks for any advice you can share. >
Den torsdag den 5. januar 2017 kl. 01.17.19 UTC+1 skrev jlad...@itu.edu:
> On Wednesday, January 4, 2017 at 3:46:43 PM UTC-8, Clifford Heath wrote: > > On 05/01/17 10:27, j...@itu.edu wrote: > > > I am interested in understanding exactly what code will be compiled if I #include <math.h> in my program, > > > and then call sqrt() with an int32 argument. > > > > sqrt() takes a double-precision floating point number, so > > your int32 will be promoted to double before sqrt is called, > > and you'll get a double back. > > Thanks for your reply, Clifford. That's actually not an encouraging answer though! If that is the case, I think I will write the square root function that was shown in Listing 4 of Jack Crenshaw's blog. The floating-point function will spend time calculating digits of precision that I do not need for my application.
the fpu instruction VSQRT.F32 takes 14 cycles if floats are enough that is probably hard to beat
jladasky@itu.edu wrote:
> Hello everyone, > > I am working with a Nordic nRF52 SoC. The CPU subsystem of the nRF52 is an ARM Cortex-M4. My development system is a Linux box running GCC. My embedded systems toolchain is gcc-arm-none-eabi-4_9-2015q1, as recommended by Nordic. > > I have written a fair amount of code for this device, but now I have to explore a bit deeper. I have an algorithm that may require me to calculate a lot of square roots, and quickly. However, the numbers in question are all small integers. > > Even though the ARM Cortex-M4 has an FPU, it shouldn't need to get involved in this square root calculation. Optimized algorithms for various math functions exist. A nice discussion of the optimization of the integer square root algorithm, written by Dr. Jack Crenshaw, is found here: > > http://www.embedded.com/electronics-blogs/programmer-s-toolbox/4219659/Integer-Square-Roots
Actually, this method is good if you do not have hardware multiplication, or possibly if code size is more important than speed. However, otherwise method based on table lookup for initial approximation and Newton iteration is likely to be faster.
> I am interested in understanding exactly what code will be compiled if I #include <math.h> in my program, and then call sqrt() with an int32 argument. It stands to reason that optimized code, like that shown in the blog above, is what should be used. How can I determine this?
Use something like: /mnt/m1/pom/usr/bin/arm-none-eabi-gcc -Wall -mthumb -mcpu=cortex-m4 -march=armv7e-m -mfloat-abi=hard -mfpu=fpv4-sp-d16 -O3 -S your_file.c (replace /mnt/m1/pom/usr/bin/arm-none-eabi-gcc by path to your gcc and use your comiler options). In file 'your_file.s' you will find resulting assembly code. Using gcc-5.3 I get: bl __aeabi_i2d vmov d0, r0, r1 bl sqrt vmov r0, r1, d0 bl __aeabi_d2iz AFAICS this is: convertion from integer to double, moving argument to flating point register, calling sqrt from C library, moving argument back to integer registers, convertion from double to integer. What sqrt from library will do depends on C library, I do not use any...
> I am poking around in the gcc-arm-none-eabi-4_9-2015q1 folder for clues. The only math-related file that I can find there is called tgmath.h, and it just #defines aliases which point back to math.h. I am looking for source code.
'sqrt' is in C library. Some frequently used library functions are buit into gcc, but apparenty 'sqrt' is not one of them. So do not look inside gcc. Instad look for C library. It seems that newlib uses implementation which converts back to integers and is doing most calculations on integers. Seem highly suboptimal. -- Waldek Hebisch
AT Thursday 05 January 2017 08:17, jladasky@itu.edu wrote:

> On Wednesday, January 4, 2017 at 3:46:43 PM UTC-8, Clifford Heath wrote: >> On 05/01/17 10:27, j...@itu.edu wrote: >> > I am interested in understanding exactly what code will be compiled if >> > I #include <math.h> in my program, and then call sqrt() with an int32 >> > argument. >> >> sqrt() takes a double-precision floating point number, so >> your int32 will be promoted to double before sqrt is called, >> and you'll get a double back. > > Thanks for your reply, Clifford. That's actually not an encouraging > answer though! If that is the case, I think I will write the square root > function that was shown in Listing 4 of Jack Crenshaw's blog. The > floating-point function will spend time calculating digits of precision > that I do not need for my application.
One of the rules of optimization is: Only do it after you have numbers to compare. You are jumping to conclusions without any data. First benchmark and measure, then optimize, measure again and compare. -- Reinhardt
On 1/4/2017 8:22 PM, Reinhardt Behm wrote:
> AT Thursday 05 January 2017 08:17, jladasky@itu.edu wrote: > >> On Wednesday, January 4, 2017 at 3:46:43 PM UTC-8, Clifford Heath wrote: >>> On 05/01/17 10:27, j...@itu.edu wrote: >>>> I am interested in understanding exactly what code will be compiled if >>>> I #include <math.h> in my program, and then call sqrt() with an int32 >>>> argument. >>> >>> sqrt() takes a double-precision floating point number, so >>> your int32 will be promoted to double before sqrt is called, >>> and you'll get a double back. >> >> Thanks for your reply, Clifford. That's actually not an encouraging >> answer though! If that is the case, I think I will write the square root >> function that was shown in Listing 4 of Jack Crenshaw's blog. The >> floating-point function will spend time calculating digits of precision >> that I do not need for my application. > > One of the rules of optimization is: Only do it after you have numbers to > compare. You are jumping to conclusions without any data. First benchmark > and measure, then optimize, measure again and compare.
The first rule of optimizing is, "don't" until you find you clearly need it. -- Rick C
On Wed, 04 Jan 2017 15:27:10 -0800, jladasky wrote:

> Hello everyone, > > I am working with a Nordic nRF52 SoC. The CPU subsystem of the nRF52 is > an ARM Cortex-M4. My development system is a Linux box running GCC. My > embedded systems toolchain is gcc-arm-none-eabi-4_9-2015q1, as > recommended by Nordic. > > I have written a fair amount of code for this device, but now I have to > explore a bit deeper. I have an algorithm that may require me to > calculate a lot of square roots, and quickly. However, the numbers in > question are all small integers. > > Even though the ARM Cortex-M4 has an FPU, it shouldn't need to get > involved in this square root calculation. Optimized algorithms for > various math functions exist. A nice discussion of the optimization of > the integer square root algorithm, written by Dr. Jack Crenshaw, is > found here: > > http://www.embedded.com/electronics-blogs/programmer-s-toolbox/4219659/
Integer-Square-Roots
> > I am interested in understanding exactly what code will be compiled if I > #include <math.h> in my program, and then call sqrt() with an int32 > argument. It stands to reason that optimized code, like that shown in > the blog above, is what should be used. How can I determine this? > > I am poking around in the gcc-arm-none-eabi-4_9-2015q1 folder for clues. > The only math-related file that I can find there is called tgmath.h, > and it just #defines aliases which point back to math.h. I am looking > for source code. > > Thanks for any advice you can share.
The FPU hardware is almost certainly faster than anything you could do in software. Unless the largest-value integer you need to take the square root of is small enough that you can just implement a look-up table. Benchmark, and see. -- Tim Wescott Wescott Design Services http://www.wescottdesign.com I'm looking for work -- see my website!
On 05/01/17 01:28, lasselangwadtchristensen@gmail.com wrote:
> Den torsdag den 5. januar 2017 kl. 01.17.19 UTC+1 skrev > jlad...@itu.edu: >> On Wednesday, January 4, 2017 at 3:46:43 PM UTC-8, Clifford Heath >> wrote: >>> On 05/01/17 10:27, j...@itu.edu wrote: >>>> I am interested in understanding exactly what code will be >>>> compiled if I #include <math.h> in my program, and then call >>>> sqrt() with an int32 argument. >>> >>> sqrt() takes a double-precision floating point number, so your >>> int32 will be promoted to double before sqrt is called, and >>> you'll get a double back. >> >> Thanks for your reply, Clifford. That's actually not an >> encouraging answer though! If that is the case, I think I will >> write the square root function that was shown in Listing 4 of Jack >> Crenshaw's blog. The floating-point function will spend time >> calculating digits of precision that I do not need for my >> application. > > > the fpu instruction VSQRT.F32 takes 14 cycles if floats are enough > that is probably hard to beat >
As long as "-ffast-math" and at least -O1 are enabled, as well as appropriate cpu flags, the "sqrtf" function should generate a VSQRT.F32 automatically. Note that "sqrtf" is not the same as "sqrt".

The 2024 Embedded Online Conference