EmbeddedRelated.com
Forums

How would my SoC calculate an integer square root?

Started by Unknown January 4, 2017
On 05/01/17 02:22, Reinhardt Behm wrote:
> AT Thursday 05 January 2017 08:17, jladasky@itu.edu wrote: > >> On Wednesday, January 4, 2017 at 3:46:43 PM UTC-8, Clifford Heath wrote: >>> On 05/01/17 10:27, j...@itu.edu wrote: >>>> I am interested in understanding exactly what code will be compiled if >>>> I #include <math.h> in my program, and then call sqrt() with an int32 >>>> argument. >>> >>> sqrt() takes a double-precision floating point number, so >>> your int32 will be promoted to double before sqrt is called, >>> and you'll get a double back. >> >> Thanks for your reply, Clifford. That's actually not an encouraging >> answer though! If that is the case, I think I will write the square root >> function that was shown in Listing 4 of Jack Crenshaw's blog. The >> floating-point function will spend time calculating digits of precision >> that I do not need for my application. > > One of the rules of optimization is: Only do it after you have numbers to > compare. You are jumping to conclusions without any data. First benchmark > and measure, then optimize, measure again and compare. >
Rules about optimisation are often misunderstood - perhaps because "optimisation" is such a general term, and is used for everything between enabling a compiler switch to algorithmic changes to fine-tuning hand-written assembly routines. The first rule of optimising is from Knuth - "Premature optimisation is the root of all evil." The emphasis is on /premature/. That encompasses not putting effort into optimisation that is not needed - but equally, it allows for optimisation when you know that it /is/ needed. Measurement is useful - invaluable, even - when you are comparing different solutions, finding bottlenecks, or identifying which changes made which effects. But it is not necessary when you can get a good enough estimation of the scales involved. If you want to plan a route across a country, you might want to calculate carefully the times for driving, trains, or taking a plane - but you can already rule out walking from the start as a poor "algorithm", and also rule out rockets and submarines for various reasons. You don't need any sort of measurements to start with. The OP here says specifically that he needs to do lots of square roots, and needs them fast, with small integers. Until I know otherwise, I assume he has a reasonable idea of what he is doing, and can have a reasonable estimate that a software-only double-precision IEEE accuracy full range square root function is going to be too slow. It makes sense to think about this from the start, not after benchmarking and measurement - it could easily be a make-or-break issue for the algorithm and his whole project. I assume that the OP /does/ have useful data, such as the number of square roots per second and the clock speed of the device, even if he does not have /all/ the relevant data. There are also plenty of optimisations that are very "cheap" to do - and in general, should /always/ be in place. Basic compiler optimisation - at least "-O1", but usually "-O2" or "-Os", should be standard. It costs nothing, gives better code (assembly code with "-O1" is usually a lot easier to understand than code from "-O0"), and enables better static warning and error checking. If you are using floating point, "-ffast-math" should also be standard for most uses - unless you /really/ need full IEEE support. You don't need to measure anything before using these - just as you don't need to time your car journeys before going out of first gear. And there is also the common misconception that "optimised code" is difficult to understand, difficult to write, and difficult to debug. Sometimes that is the case - often it is not. The OP needs the square roots of small integers - the first idea that comes to my mind is a lookup table, which is pretty simple.
On 05.1.2017 &#1075;. 01:27, jladasky@itu.edu wrote:
> Hello everyone, > > I am working with a Nordic nRF52 SoC. The CPU subsystem of the nRF52 is an ARM Cortex-M4. My development system is a Linux box running GCC. My embedded systems toolchain is gcc-arm-none-eabi-4_9-2015q1, as recommended by Nordic. > > I have written a fair amount of code for this device, but now I have to explore a bit deeper. I have an algorithm that may require me to calculate a lot of square roots, and quickly. However, the numbers in question are all small integers. > > Even though the ARM Cortex-M4 has an FPU, it shouldn't need to get involved in this square root calculation. Optimized algorithms for various math functions exist. A nice discussion of the optimization of the integer square root algorithm, written by Dr. Jack Crenshaw, is found here: > > http://www.embedded.com/electronics-blogs/programmer-s-toolbox/4219659/Integer-Square-Roots > > I am interested in understanding exactly what code will be compiled if I #include <math.h> in my program, and then call sqrt() with an int32 argument. It stands to reason that optimized code, like that shown in the blog above, is what should be used. How can I determine this? > > I am poking around in the gcc-arm-none-eabi-4_9-2015q1 folder for clues. The only math-related file that I can find there is called tgmath.h, and it just #defines aliases which point back to math.h. I am looking for source code. > > Thanks for any advice you can share. >
Here are my two replies to a similar query some 9 years ago:
> https://groups.google.com/d/msg/comp.arch.embedded/4osSFiWt2WI/jz_myBDmvvUJ
> https://groups.google.com/d/msg/comp.arch.embedded/4osSFiWt2WI/CzUxHUdKYA0J
Dimiter ------------------------------------------------------ Dimiter Popoff, TGI http://www.tgi-sci.com ------------------------------------------------------ http://www.flickr.com/photos/didi_tgi/
On Wednesday, January 4, 2017 at 4:28:53 PM UTC-8, lasselangwad...@gmail.com wrote:

> the fpu instruction VSQRT.F32 takes 14 cycles if floats are enough that is probably hard to beat
My information about the performance of FPU's is clearly a few years behind the times. That's actually a very impressive result, and you're right, it would be quite hard to beat.
On Wednesday, January 4, 2017 at 4:54:50 PM UTC-8, anti...@math.uni.wroc.pl wrote:

> Use something like: > > /mnt/m1/pom/usr/bin/arm-none-eabi-gcc -Wall -mthumb -mcpu=cortex-m4 -march=armv7e-m -mfloat-abi=hard -mfpu=fpv4-sp-d16 -O3 -S your_file.c > > (replace /mnt/m1/pom/usr/bin/arm-none-eabi-gcc by path to your gcc and > use your comiler options). In file 'your_file.s' you will find > resulting assembly code. Using gcc-5.3 I get: > > bl __aeabi_i2d > vmov d0, r0, r1 > bl sqrt > vmov r0, r1, d0 > bl __aeabi_d2iz > > AFAICS this is: convertion from integer to double, moving argument to > flating point register, calling sqrt from C library, moving argument > back to integer registers, convertion from double to integer. > > What sqrt from library will do depends on C library, I do not > use any... > > > I am poking around in the gcc-arm-none-eabi-4_9-2015q1 folder for clues. The only math-related file that I can find there is called tgmath.h, and it just #defines aliases which point back to math.h. I am looking for source code. > > 'sqrt' is in C library. Some frequently used library functions > are buit into gcc, but apparenty 'sqrt' is not one of them. > So do not look inside gcc. Instad look for C library. It > seems that newlib uses implementation which converts back to > integers and is doing most calculations on integers. Seem > highly suboptimal. > > -- > Waldek Hebisch
Thanks Waldek, I haven't looked at the .s files and that's definitely helpful. If I compile using the "fast math" flag that was suggested by David Brown, that "sqrt" MIGHT compile to a single FPU instruction, VSQRT.F32, which apparently only requires 14 clock cycles. While the FPU result may not achieve IEEE precision, I need speed.
On Thursday, January 5, 2017 at 12:43:54 AM UTC-8, David Brown wrote:
> On 05/01/17 02:22, Reinhardt Behm wrote: > > The OP here says specifically that he needs to do lots of square roots, > and needs them fast, with small integers. Until I know otherwise, I > assume he has a reasonable idea of what he is doing, and can have a > reasonable estimate that a software-only double-precision IEEE accuracy > full range square root function is going to be too slow.
My nRF52 has a 32 MHz CPU clock. The system will need to calculate about 1,000 square roots per second. The largest number I expect to see is about 10,000. If this task was all that concerned me, I wouldn't sweat. However: my system also has several other real-time tasks to perform. It's also battery-powered. I am thinking about optimization because I need to conserve both time and power.
> There are also plenty of optimisations that are very "cheap" to do - and > in general, should /always/ be in place. Basic compiler optimisation - > at least "-O1", but usually "-O2" or "-Os", should be standard. It > costs nothing, gives better code (assembly code with "-O1" is usually a > lot easier to understand than code from "-O0"), and enables better > static warning and error checking. > > If you are using floating point, "-ffast-math" should also be standard > for most uses - unless you /really/ need full IEEE support.
That's helpful. Now you have me wondering whether there are compiler flags for optimizing integer math as well.
> And there is also the common misconception that "optimised code" is > difficult to understand, difficult to write, and difficult to debug. > Sometimes that is the case - often it is not. The OP needs the square > roots of small integers - the first idea that comes to my mind is a > lookup table, which is pretty simple.
As you can see, a 100-number lookup table should address my needs. A bit long, perhaps. A binary search through the table could require 7 iterations.
On Thu, 05 Jan 2017 15:13:42 -0800, jladasky wrote:

> On Wednesday, January 4, 2017 at 4:28:53 PM UTC-8, > lasselangwad...@gmail.com wrote: > >> the fpu instruction VSQRT.F32 takes 14 cycles if floats are enough that >> is probably hard to beat > > My information about the performance of FPU's is clearly a few years > behind the times. That's actually a very impressive result, and you're > right, it would be quite hard to beat.
Keep in mind, as David pointed out, that you want to make sure it's a single-precision floating point: the FPU on the M4 doesn't do double- precision math, so that would still go the horridly slow software route (although a smart, FPU-aware algorithm might get an initial guess from the FPU). -- Tim Wescott Wescott Design Services http://www.wescottdesign.com I'm looking for work -- see my website!
On Thu, 5 Jan 2017 15:38:10 -0800 (PST), jladasky@itu.edu wrote:

>On Thursday, January 5, 2017 at 12:43:54 AM UTC-8, David Brown wrote: >> On 05/01/17 02:22, Reinhardt Behm wrote: >> >> The OP here says specifically that he needs to do lots of square roots, >> and needs them fast, with small integers. Until I know otherwise, I >> assume he has a reasonable idea of what he is doing, and can have a >> reasonable estimate that a software-only double-precision IEEE accuracy >> full range square root function is going to be too slow. > >My nRF52 has a 32 MHz CPU clock. The system will need to calculate about 1,000 square roots per second. The largest number I expect to see is about 10,000. If this task was all that concerned me, I wouldn't sweat. However: my system also has several other real-time tasks to perform. It's also battery-powered. I am thinking about optimization because I need to conserve both time and power. > >> There are also plenty of optimisations that are very "cheap" to do - and >> in general, should /always/ be in place. Basic compiler optimisation - >> at least "-O1", but usually "-O2" or "-Os", should be standard. It >> costs nothing, gives better code (assembly code with "-O1" is usually a >> lot easier to understand than code from "-O0"), and enables better >> static warning and error checking. >> >> If you are using floating point, "-ffast-math" should also be standard >> for most uses - unless you /really/ need full IEEE support. > >That's helpful. Now you have me wondering whether there are compiler flags for optimizing integer math as well. > >> And there is also the common misconception that "optimised code" is >> difficult to understand, difficult to write, and difficult to debug. >> Sometimes that is the case - often it is not. The OP needs the square >> roots of small integers - the first idea that comes to my mind is a >> lookup table, which is pretty simple. > >As you can see, a 100-number lookup table should address my needs. A bit long, perhaps. A binary search through the table could require 7 iterations.
The basic binary digit-by-digit algorithm for number of that scale ought to be at least a bit faster than a binary search of a table that size, and be less code (and no data). https://en.wikipedia.org/wiki/Methods_of_computing_square_roots#Binary_numeral_system_.28base_2.29 The inner loop of that is about a dozen instructions. And if you're really limited to 10,000, you can initialize "bit" to 1<<12. If you can a count-leading zeros instructions available, you can improve the initial computation of bit a fair, *ahem*, bit. Obviously you can do that faster, but almost always with considerable more complexity and/or memory usage.
On Thu, 05 Jan 2017 17:44:36 -0600, Tim Wescott
<seemywebsite@myfooter.really> wrote:

>On Thu, 05 Jan 2017 15:13:42 -0800, jladasky wrote: > >> On Wednesday, January 4, 2017 at 4:28:53 PM UTC-8, >> lasselangwad...@gmail.com wrote: >> >>> the fpu instruction VSQRT.F32 takes 14 cycles if floats are enough that >>> is probably hard to beat >> >> My information about the performance of FPU's is clearly a few years >> behind the times. That's actually a very impressive result, and you're >> right, it would be quite hard to beat. > >Keep in mind, as David pointed out, that you want to make sure it's a >single-precision floating point: the FPU on the M4 doesn't do double- >precision math, so that would still go the horridly slow software route >(although a smart, FPU-aware algorithm might get an initial guess from >the FPU).
If I understand correctly, the OP wants a square root algorithm accepting an integer and returning an integer (not a float). That algorithm will give precise values for sqrt of 1, 4, 9, 16, 25 and so on. The sqrt for other values contains a more or less large error. The discussion about single vs double precision floats is not relevant in this environment. The OP said that the argument is a small integer and as long as the argument is 6 decimal digits or less, it can be represented exactly in single precision float. Thus the result will have up to 4 decimal digits in integer part and a few decimal digits in the fractional part. Just convert the single precision value to integer to get away the fractional part, if truly an integer result is required. It should be noted that conversion from float to integer can take a long time on some architectures (if no barrel shifter available), so it may be a good idea to keep integer and float calculations separate and avoid mixing them in the same calculations.
On 06/01/17 00:44, Tim Wescott wrote:
> On Thu, 05 Jan 2017 15:13:42 -0800, jladasky wrote: > >> On Wednesday, January 4, 2017 at 4:28:53 PM UTC-8, >> lasselangwad...@gmail.com wrote: >> >>> the fpu instruction VSQRT.F32 takes 14 cycles if floats are enough that >>> is probably hard to beat >> >> My information about the performance of FPU's is clearly a few years >> behind the times. That's actually a very impressive result, and you're >> right, it would be quite hard to beat. >
Not all FPU's give accurate square root results from their hardware instructions. The ARM documentation does not give details, so (unless someone else knows better) we can assume that it /does/ give accurate values, but not necessarily following IEEE specifications precisely. I have seen other FPU's equivalent to the M4F's where the square root instruction is explicitly an approximation. It would be good enough for many uses, but for a highly accurate result you would use it as the estimate and then run a couple of rounds of Newton-Raphson.
> Keep in mind, as David pointed out, that you want to make sure it's a > single-precision floating point: the FPU on the M4 doesn't do double- > precision math, so that would still go the horridly slow software route > (although a smart, FPU-aware algorithm might get an initial guess from > the FPU). >
A smart FPU-aware algorithm /might/ do that, but it is very unlikely in practice. It would not actually save many cycles - you can already get a good initial estimate by simply taking the exponent from the floating point number and halving it. By the time you have handled the conversions between double and floating point and vice versa, and considering the cost of the rest of the double-precision software floating point operations needed in the square root, I can't see you saving more than a couple of percent off the total time. It is not worth building a special version of the library just for this one case - other double-precision maths cannot, in general, get any benefit from using the single-precision hardware.
On 06/01/17 06:54, upsidedown@downunder.com wrote:
> On Thu, 05 Jan 2017 17:44:36 -0600, Tim Wescott > <seemywebsite@myfooter.really> wrote: > >> On Thu, 05 Jan 2017 15:13:42 -0800, jladasky wrote: >> >>> On Wednesday, January 4, 2017 at 4:28:53 PM UTC-8, >>> lasselangwad...@gmail.com wrote: >>> >>>> the fpu instruction VSQRT.F32 takes 14 cycles if floats are enough that >>>> is probably hard to beat >>> >>> My information about the performance of FPU's is clearly a few years >>> behind the times. That's actually a very impressive result, and you're >>> right, it would be quite hard to beat. >> >> Keep in mind, as David pointed out, that you want to make sure it's a >> single-precision floating point: the FPU on the M4 doesn't do double- >> precision math, so that would still go the horridly slow software route >> (although a smart, FPU-aware algorithm might get an initial guess from >> the FPU). > > If I understand correctly, the OP wants a square root algorithm > accepting an integer and returning an integer (not a float). >
Yes.
> That algorithm will give precise values for sqrt of 1, 4, 9, 16, 25 > and so on. The sqrt for other values contains a more or less large > error. >
Yes. The OP has not said if he wants round-down, round-to-nearest, or does not care.
> The discussion about single vs double precision floats is not relevant > in this environment.
It is very relevant. An easy way to implement his requirements (ignoring range checks) is: int sqrt_int(int x) { return sqrtf(x); } But that requires the compiler flags that have been discussed, and "sqrtf" instead of "sqrt", in order to compile to a VSQRT instruction and a couple of conversion instructions.
> > The OP said that the argument is a small integer and as long as the > argument is 6 decimal digits or less, it can be represented exactly > in single precision float. Thus the result will have up to 4 decimal > digits in integer part and a few decimal digits in the fractional > part. Just convert the single precision value to integer to get away > the fractional part, if truly an integer result is required.
It should work fine - as long as doubles are avoided (to avoid the slowdown).
> > It should be noted that conversion from float to integer can take a > long time on some architectures (if no barrel shifter available), so > it may be a good idea to keep integer and float calculations separate > and avoid mixing them in the same calculations. >
The M4F has float to integer conversion instructions (as do all cpus with floating point hardware that I have ever seen). Again, flags like -ffast-math and -O1 are required to be sure that these are used without library calls to handle the possibility of "weird" floats. Your point is important if doubles are involved, or on an architecture without hardware floating point. Conversions back and forth between integers and floating point formats can be relevant for the cost of the operations. However, they are not /that/ expensive - they are typically not worse than general arithmetic like addition and multiplication, and cheaper than division. They may also give the compiler more optimisation opportunities than floating point values (especially if you have forgotten -ffast-math). (And the ARM architecture has a barrel shifter.)