How would my SoC calculate an integer square root?| page 2

Reply by David Brown ●January 5, 20172017-01-05

On 05/01/17 02:22, Reinhardt Behm wrote:
> AT Thursday 05 January 2017 08:17, jladasky@itu.edu wrote:
> 
>> On Wednesday, January 4, 2017 at 3:46:43 PM UTC-8, Clifford Heath wrote:
>>> On 05/01/17 10:27, j...@itu.edu wrote:
>>>> I am interested in understanding exactly what code will be compiled if
>>>> I #include <math.h> in my program, and then call sqrt() with an int32
>>>> argument.
>>>
>>> sqrt() takes a double-precision floating point number, so
>>> your int32 will be promoted to double before sqrt is called,
>>> and you'll get a double back.
>>
>> Thanks for your reply, Clifford.  That's actually not an encouraging
>> answer though!  If that is the case, I think I will write the square root
>> function that was shown in Listing 4 of Jack Crenshaw's blog.  The
>> floating-point function will spend time calculating digits of precision
>> that I do not need for my application.
> 
> One of the rules of optimization is: Only do it after you have numbers to 
> compare. You are jumping to conclusions without any data. First benchmark 
> and measure, then optimize, measure again and compare.
> 

Rules about optimisation are often misunderstood - perhaps because
"optimisation" is such a general term, and is used for everything
between enabling a compiler switch to algorithmic changes to fine-tuning
hand-written assembly routines.

The first rule of optimising is from Knuth - "Premature optimisation is
the root of all evil."  The emphasis is on /premature/.  That
encompasses not putting effort into optimisation that is not needed -
but equally, it allows for optimisation when you know that it /is/ needed.

Measurement is useful - invaluable, even - when you are comparing
different solutions, finding bottlenecks, or identifying which changes
made which effects.  But it is not necessary when you can get a good
enough estimation of the scales involved.  If you want to plan a route
across a country, you might want to calculate carefully the times for
driving, trains, or taking a plane - but you can already rule out
walking from the start as a poor "algorithm", and also rule out rockets
and submarines for various reasons.  You don't need any sort of
measurements to start with.

The OP here says specifically that he needs to do lots of square roots,
and needs them fast, with small integers.  Until I know otherwise, I
assume he has a reasonable idea of what he is doing, and can have a
reasonable estimate that a software-only double-precision IEEE accuracy
full range square root function is going to be too slow.  It makes sense
to think about this from the start, not after benchmarking and
measurement - it could easily be a make-or-break issue for the algorithm
and his whole project.  I assume that the OP /does/ have useful data,
such as the number of square roots per second and the clock speed of the
device, even if he does not have /all/ the relevant data.

There are also plenty of optimisations that are very "cheap" to do - and
in general, should /always/ be in place.  Basic compiler optimisation -
at least "-O1", but usually "-O2" or "-Os", should be standard.  It
costs nothing, gives better code (assembly code with "-O1" is usually a
lot easier to understand than code from "-O0"), and enables better
static warning and error checking.

If you are using floating point, "-ffast-math" should also be standard
for most uses - unless you /really/ need full IEEE support.

You don't need to measure anything before using these - just as you
don't need to time your car journeys before going out of first gear.

And there is also the common misconception that "optimised code" is
difficult to understand, difficult to write, and difficult to debug.
Sometimes that is the case - often it is not.  The OP needs the square
roots of small integers - the first idea that comes to my mind is a
lookup table, which is pretty simple.

Reply by Dimiter_Popoff ●January 5, 20172017-01-05

On 05.1.2017 &#1075;. 01:27, jladasky@itu.edu wrote:
> Hello everyone,
>
> I am working with a Nordic nRF52 SoC.  The CPU subsystem of the nRF52 is an ARM Cortex-M4.  My development system is a Linux box running GCC.  My embedded systems toolchain is gcc-arm-none-eabi-4_9-2015q1, as recommended by Nordic.
>
> I have written a fair amount of code for this device, but now I have to explore a bit deeper.  I have an algorithm that may require me to calculate a lot of square roots, and quickly.  However, the numbers in question are all small integers.
>
> Even though the ARM Cortex-M4 has an FPU, it shouldn't need to get involved in this square root calculation.  Optimized algorithms for various math functions exist.  A nice discussion of the optimization of the integer square root algorithm, written by Dr. Jack Crenshaw, is found here:
>
> http://www.embedded.com/electronics-blogs/programmer-s-toolbox/4219659/Integer-Square-Roots
>
> I am interested in understanding exactly what code will be compiled if I #include <math.h> in my program, and then call sqrt() with an int32 argument.  It stands to reason that optimized code, like that shown in the blog above, is what should be used.  How can I determine this?
>
> I am poking around in the gcc-arm-none-eabi-4_9-2015q1 folder for clues.  The only math-related file that I can find there is called tgmath.h, and it just #defines aliases which point back to math.h.  I am looking for source code.
>
> Thanks for any advice you can share.
>

Here are my two replies to a similar query some 9 years ago:

> https://groups.google.com/d/msg/comp.arch.embedded/4osSFiWt2WI/jz_myBDmvvUJ

> https://groups.google.com/d/msg/comp.arch.embedded/4osSFiWt2WI/CzUxHUdKYA0J

Dimiter

------------------------------------------------------
Dimiter Popoff, TGI             http://www.tgi-sci.com
------------------------------------------------------
http://www.flickr.com/photos/didi_tgi/

Reply by ●January 5, 20172017-01-05

On Wednesday, January 4, 2017 at 4:28:53 PM UTC-8, lasselangwad...@gmail.com wrote:

> the fpu instruction VSQRT.F32 takes 14 cycles if floats are enough that is probably hard to beat

My information about the performance of FPU's is clearly a few years behind the times.  That's actually a very impressive result, and you're right, it would be quite hard to beat.

Reply by ●January 5, 20172017-01-05

On Wednesday, January 4, 2017 at 4:54:50 PM UTC-8, anti...@math.uni.wroc.pl wrote:

> Use something like:
> 
> /mnt/m1/pom/usr/bin/arm-none-eabi-gcc -Wall -mthumb -mcpu=cortex-m4 -march=armv7e-m -mfloat-abi=hard -mfpu=fpv4-sp-d16 -O3 -S your_file.c
> 
> (replace /mnt/m1/pom/usr/bin/arm-none-eabi-gcc by path to your gcc and
> use your comiler options).  In file 'your_file.s' you will find
> resulting assembly code.   Using gcc-5.3 I get:
> 
>        bl      __aeabi_i2d
>        vmov    d0, r0, r1
>        bl      sqrt
>        vmov    r0, r1, d0
>        bl      __aeabi_d2iz
> 
> AFAICS this is: convertion from integer to double, moving argument to
> flating point register, calling sqrt from C library, moving argument
> back to integer registers, convertion from double to integer.
> 
> What sqrt from library will do depends on C library, I do not
> use any...
> 
> > I am poking around in the gcc-arm-none-eabi-4_9-2015q1 folder for clues.  The only math-related file that I can find there is called tgmath.h, and it just #defines aliases which point back to math.h.  I am looking for source code.
> 
> 'sqrt' is in C library.  Some frequently used library functions
> are buit into gcc, but apparenty 'sqrt' is not one of them.
> So do not look inside gcc.  Instad look for C library.  It
> seems that newlib uses implementation which converts back to
> integers and is doing most calculations on integers.  Seem
> highly suboptimal.
> 
> -- 
>                               Waldek Hebisch

Thanks Waldek, I haven't looked at the .s files and that's definitely helpful.  If I compile using the "fast math" flag that was suggested by David Brown, that "sqrt" MIGHT compile to a single FPU instruction, VSQRT.F32, which apparently only requires 14 clock cycles.  While the FPU result may not achieve IEEE precision, I need speed.

Reply by ●January 5, 20172017-01-05

On Thursday, January 5, 2017 at 12:43:54 AM UTC-8, David Brown wrote:
> On 05/01/17 02:22, Reinhardt Behm wrote:
> 
> The OP here says specifically that he needs to do lots of square roots,
> and needs them fast, with small integers.  Until I know otherwise, I
> assume he has a reasonable idea of what he is doing, and can have a
> reasonable estimate that a software-only double-precision IEEE accuracy
> full range square root function is going to be too slow.  

My nRF52 has a 32 MHz CPU clock.  The system will need to calculate about 1,000 square roots per second.  The largest number I expect to see is about 10,000.  If this task was all that concerned me, I wouldn't sweat.  However: my system also has several other real-time tasks to perform.  It's also battery-powered.  I am thinking about optimization because I need to conserve both time and power.

> There are also plenty of optimisations that are very "cheap" to do - and
> in general, should /always/ be in place.  Basic compiler optimisation -
> at least "-O1", but usually "-O2" or "-Os", should be standard.  It
> costs nothing, gives better code (assembly code with "-O1" is usually a
> lot easier to understand than code from "-O0"), and enables better
> static warning and error checking.
> 
> If you are using floating point, "-ffast-math" should also be standard
> for most uses - unless you /really/ need full IEEE support.

That's helpful.  Now you have me wondering whether there are compiler flags for optimizing integer math as well.

> And there is also the common misconception that "optimised code" is
> difficult to understand, difficult to write, and difficult to debug.
> Sometimes that is the case - often it is not.  The OP needs the square
> roots of small integers - the first idea that comes to my mind is a
> lookup table, which is pretty simple.

As you can see, a 100-number lookup table should address my needs.  A bit long, perhaps.  A binary search through the table could require 7 iterations.

Reply by Tim Wescott ●January 5, 20172017-01-05

On Thu, 05 Jan 2017 15:13:42 -0800, jladasky wrote:

> On Wednesday, January 4, 2017 at 4:28:53 PM UTC-8,
> lasselangwad...@gmail.com wrote:
> 
>> the fpu instruction VSQRT.F32 takes 14 cycles if floats are enough that
>> is probably hard to beat
> 
> My information about the performance of FPU's is clearly a few years
> behind the times.  That's actually a very impressive result, and you're
> right, it would be quite hard to beat.

Keep in mind, as David pointed out, that you want to make sure it's a 
single-precision floating point: the FPU on the M4 doesn't do double-
precision math, so that would still go the horridly slow software route 
(although a smart, FPU-aware algorithm might get an initial guess from 
the FPU).

-- 

Tim Wescott
Wescott Design Services
http://www.wescottdesign.com

I'm looking for work -- see my website!

Reply by Robert Wessel ●January 5, 20172017-01-05

On Thu, 5 Jan 2017 15:38:10 -0800 (PST), jladasky@itu.edu wrote:

>On Thursday, January 5, 2017 at 12:43:54 AM UTC-8, David Brown wrote:
>> On 05/01/17 02:22, Reinhardt Behm wrote:
>> 
>> The OP here says specifically that he needs to do lots of square roots,
>> and needs them fast, with small integers.  Until I know otherwise, I
>> assume he has a reasonable idea of what he is doing, and can have a
>> reasonable estimate that a software-only double-precision IEEE accuracy
>> full range square root function is going to be too slow.  
>
>My nRF52 has a 32 MHz CPU clock.  The system will need to calculate about 1,000 square roots per second.  The largest number I expect to see is about 10,000.  If this task was all that concerned me, I wouldn't sweat.  However: my system also has several other real-time tasks to perform.  It's also battery-powered.  I am thinking about optimization because I need to conserve both time and power.
>
>> There are also plenty of optimisations that are very "cheap" to do - and
>> in general, should /always/ be in place.  Basic compiler optimisation -
>> at least "-O1", but usually "-O2" or "-Os", should be standard.  It
>> costs nothing, gives better code (assembly code with "-O1" is usually a
>> lot easier to understand than code from "-O0"), and enables better
>> static warning and error checking.
>> 
>> If you are using floating point, "-ffast-math" should also be standard
>> for most uses - unless you /really/ need full IEEE support.
>
>That's helpful.  Now you have me wondering whether there are compiler flags for optimizing integer math as well.
>
>> And there is also the common misconception that "optimised code" is
>> difficult to understand, difficult to write, and difficult to debug.
>> Sometimes that is the case - often it is not.  The OP needs the square
>> roots of small integers - the first idea that comes to my mind is a
>> lookup table, which is pretty simple.
>
>As you can see, a 100-number lookup table should address my needs.  A bit long, perhaps.  A binary search through the table could require 7 iterations.


The basic binary digit-by-digit algorithm for number of that scale
ought to be at least a bit faster than a binary search of a table that
size, and be less code (and no data).

https://en.wikipedia.org/wiki/Methods_of_computing_square_roots#Binary_numeral_system_.28base_2.29

The inner loop of that is about a dozen instructions.  And if you're
really limited to 10,000, you can initialize "bit" to 1<<12.  If you
can a count-leading zeros instructions available, you can improve the
initial computation of bit a fair, *ahem*,  bit.

Obviously you can do that faster, but almost always with considerable
more complexity and/or memory usage.

Reply by ●January 6, 20172017-01-06

On Thu, 05 Jan 2017 17:44:36 -0600, Tim Wescott
<seemywebsite@myfooter.really> wrote:

>On Thu, 05 Jan 2017 15:13:42 -0800, jladasky wrote:
>
>> On Wednesday, January 4, 2017 at 4:28:53 PM UTC-8,
>> lasselangwad...@gmail.com wrote:
>> 
>>> the fpu instruction VSQRT.F32 takes 14 cycles if floats are enough that
>>> is probably hard to beat
>> 
>> My information about the performance of FPU's is clearly a few years
>> behind the times.  That's actually a very impressive result, and you're
>> right, it would be quite hard to beat.
>
>Keep in mind, as David pointed out, that you want to make sure it's a 
>single-precision floating point: the FPU on the M4 doesn't do double-
>precision math, so that would still go the horridly slow software route 
>(although a smart, FPU-aware algorithm might get an initial guess from 
>the FPU).

If I understand correctly, the OP wants a square root algorithm
accepting an integer and returning an integer (not a float). 

That algorithm will give precise values for sqrt of 1, 4, 9, 16,  25
and so on. The sqrt for other values contains a more or less large
error. 

The discussion about single vs double precision floats is not relevant
in this environment.

The OP said that the argument is a small integer and as long as the
argument  is 6 decimal digits or less, it can be represented exactly
in single precision float. Thus the result will have up to 4 decimal
digits in integer part and a few decimal digits in the fractional
part. Just convert the single precision value to integer to get away
the fractional part, if truly an integer result is required.

It should be noted that conversion from float to integer can take a
long time on some architectures (if no barrel shifter available), so
it may be a good idea to keep integer and float calculations separate
and avoid mixing them in the same calculations.

Reply by David Brown ●January 6, 20172017-01-06

On 06/01/17 00:44, Tim Wescott wrote:
> On Thu, 05 Jan 2017 15:13:42 -0800, jladasky wrote:
> 
>> On Wednesday, January 4, 2017 at 4:28:53 PM UTC-8,
>> lasselangwad...@gmail.com wrote:
>>
>>> the fpu instruction VSQRT.F32 takes 14 cycles if floats are enough that
>>> is probably hard to beat
>>
>> My information about the performance of FPU's is clearly a few years
>> behind the times.  That's actually a very impressive result, and you're
>> right, it would be quite hard to beat.
> 

Not all FPU's give accurate square root results from their hardware
instructions.  The ARM documentation does not give details, so (unless
someone else knows better) we can assume that it /does/ give accurate
values, but not necessarily following IEEE specifications precisely.  I
have seen other FPU's equivalent to the M4F's where the square root
instruction is explicitly an approximation.  It would be good enough for
many uses, but for a highly accurate result you would use it as the
estimate and then run a couple of rounds of Newton-Raphson.

> Keep in mind, as David pointed out, that you want to make sure it's a 
> single-precision floating point: the FPU on the M4 doesn't do double-
> precision math, so that would still go the horridly slow software route 
> (although a smart, FPU-aware algorithm might get an initial guess from 
> the FPU).
> 

A smart FPU-aware algorithm /might/ do that, but it is very unlikely in
practice.  It would not actually save many cycles - you can already get
a good initial estimate by simply taking the exponent from the floating
point number and halving it.  By the time you have handled the
conversions between double and floating point and vice versa, and
considering the cost of the rest of the double-precision software
floating point operations needed in the square root, I can't see you
saving more than a couple of percent off the total time.  It is not
worth building a special version of the library just for this one case -
other double-precision maths cannot, in general, get any benefit from
using the single-precision hardware.

Reply by David Brown ●January 6, 20172017-01-06

On 06/01/17 06:54, upsidedown@downunder.com wrote:
> On Thu, 05 Jan 2017 17:44:36 -0600, Tim Wescott
> <seemywebsite@myfooter.really> wrote:
> 
>> On Thu, 05 Jan 2017 15:13:42 -0800, jladasky wrote:
>>
>>> On Wednesday, January 4, 2017 at 4:28:53 PM UTC-8,
>>> lasselangwad...@gmail.com wrote:
>>>
>>>> the fpu instruction VSQRT.F32 takes 14 cycles if floats are enough that
>>>> is probably hard to beat
>>>
>>> My information about the performance of FPU's is clearly a few years
>>> behind the times.  That's actually a very impressive result, and you're
>>> right, it would be quite hard to beat.
>>
>> Keep in mind, as David pointed out, that you want to make sure it's a 
>> single-precision floating point: the FPU on the M4 doesn't do double-
>> precision math, so that would still go the horridly slow software route 
>> (although a smart, FPU-aware algorithm might get an initial guess from 
>> the FPU).
> 
> If I understand correctly, the OP wants a square root algorithm
> accepting an integer and returning an integer (not a float). 
> 

Yes.

> That algorithm will give precise values for sqrt of 1, 4, 9, 16,  25
> and so on. The sqrt for other values contains a more or less large
> error. 
> 

Yes.  The OP has not said if he wants round-down, round-to-nearest, or
does not care.

> The discussion about single vs double precision floats is not relevant
> in this environment.

It is very relevant.  An easy way to implement his requirements
(ignoring range checks) is:

int sqrt_int(int x) {
	return sqrtf(x);
}

But that requires the compiler flags that have been discussed, and
"sqrtf" instead of "sqrt", in order to compile to a VSQRT instruction
and a couple of conversion instructions.

> 
> The OP said that the argument is a small integer and as long as the
> argument  is 6 decimal digits or less, it can be represented exactly
> in single precision float. Thus the result will have up to 4 decimal
> digits in integer part and a few decimal digits in the fractional
> part. Just convert the single precision value to integer to get away
> the fractional part, if truly an integer result is required.

It should work fine - as long as doubles are avoided (to avoid the
slowdown).

> 
> It should be noted that conversion from float to integer can take a
> long time on some architectures (if no barrel shifter available), so
> it may be a good idea to keep integer and float calculations separate
> and avoid mixing them in the same calculations.
> 

The M4F has float to integer conversion instructions (as do all cpus
with floating point hardware that I have ever seen).  Again, flags like
-ffast-math and -O1 are required to be sure that these are used without
library calls to handle the possibility of "weird" floats.

Your point is important if doubles are involved, or on an architecture
without hardware floating point.  Conversions back and forth between
integers and floating point formats can be relevant for the cost of the
operations.  However, they are not /that/ expensive - they are typically
not worse than general arithmetic like addition and multiplication, and
cheaper than division.  They may also give the compiler more
optimisation opportunities than floating point values (especially if you
have forgotten -ffast-math).



(And the ARM architecture has a barrel shifter.)

Previous 123 4 Next

How would my SoC calculate an integer square root?

Sign in

You might also like...

Search forums

Free PDF Downloads

Blogs - Hall of Fame

Discussion Groups

Quick Links

About EmbeddedRelated.com

Social Networks

The Related Media Group