EmbeddedRelated.com
Forums

How would my SoC calculate an integer square root?

Started by Unknown January 4, 2017
On 06/01/17 00:38, jladasky@itu.edu wrote:
> On Thursday, January 5, 2017 at 12:43:54 AM UTC-8, David Brown > wrote: >> On 05/01/17 02:22, Reinhardt Behm wrote: >> >> The OP here says specifically that he needs to do lots of square >> roots, and needs them fast, with small integers. Until I know >> otherwise, I assume he has a reasonable idea of what he is doing, >> and can have a reasonable estimate that a software-only >> double-precision IEEE accuracy full range square root function is >> going to be too slow. > > My nRF52 has a 32 MHz CPU clock. The system will need to calculate > about 1,000 square roots per second. The largest number I expect to > see is about 10,000. If this task was all that concerned me, I > wouldn't sweat. However: my system also has several other real-time > tasks to perform. It's also battery-powered. I am thinking about > optimization because I need to conserve both time and power. > >> There are also plenty of optimisations that are very "cheap" to do >> - and in general, should /always/ be in place. Basic compiler >> optimisation - at least "-O1", but usually "-O2" or "-Os", should >> be standard. It costs nothing, gives better code (assembly code >> with "-O1" is usually a lot easier to understand than code from >> "-O0"), and enables better static warning and error checking. >> >> If you are using floating point, "-ffast-math" should also be >> standard for most uses - unless you /really/ need full IEEE >> support. > > That's helpful. Now you have me wondering whether there are compiler > flags for optimizing integer math as well.
Integer maths is optimised with "-O1" or higher. It does not need separate flags. (There are other flags that can affect the way integer overflow is handled, but the default is to follow the C standards, and that is usually what gives the fastest results anyway.)
> >> And there is also the common misconception that "optimised code" >> is difficult to understand, difficult to write, and difficult to >> debug. Sometimes that is the case - often it is not. The OP needs >> the square roots of small integers - the first idea that comes to >> my mind is a lookup table, which is pretty simple. > > As you can see, a 100-number lookup table should address my needs. A > bit long, perhaps. A binary search through the table could require 7 > iterations. >
A 10000 number lookup table will give you the result in a single operation. Do you have 20K of flash (or ram) space to spare? I don't know your particular chip, but many Cortex devices actually have quite a lot of flash space, and while 20K for a square root table might seem excessive, it could work. And it would be the fastest possible method. You can also do smart things with a smaller table - say, a 157 entry table for the square roots of (x >> 6), shifted left by 3 bits. Your algorithm is then to take x, shift right 6 bits, look it up in the table as y. The integer square root of x is then between y and y + 7. A simple algorithm to check these would take 8 checks, each of which is simpler than binary search rounds. (And the table takes less flash.) A smarter algorithm could probably jump to the correct y faster, based on (x >> 6) and (x & 63). That would still be slower than using VSQRT, but is a possibility if you want to avoid floating point.
On Fri, 06 Jan 2017 09:12:44 +0100, David Brown
<david.brown@hesbynett.no> wrote:

> >> The discussion about single vs double precision floats is not relevant >> in this environment. > >It is very relevant. An easy way to implement his requirements >(ignoring range checks) is: > >int sqrt_int(int x) { > return sqrtf(x); >} > >But that requires the compiler flags that have been discussed, and >"sqrtf" instead of "sqrt", in order to compile to a VSQRT instruction >and a couple of conversion instructions.
The time used to discuss what compiler flags to use could have better used coding such simple function in assembler :-).
On Fri, 06 Jan 2017 08:46:11 +0100, David Brown
<david.brown@hesbynett.no> wrote:

>On 06/01/17 00:44, Tim Wescott wrote: >> On Thu, 05 Jan 2017 15:13:42 -0800, jladasky wrote: >> >>> On Wednesday, January 4, 2017 at 4:28:53 PM UTC-8, >>> lasselangwad...@gmail.com wrote: >>> >>>> the fpu instruction VSQRT.F32 takes 14 cycles if floats are enough that >>>> is probably hard to beat
If those cycles are the same as the actual clock oscillator cycles, the 32 MHz oscillator would allow more than 2 million VSQRT instructions each second. The OP estimated a thousand square roots each second, which would consume less than 0.1 % of the clock cycles. No big deal and not a reason for optimization.
>>> My information about the performance of FPU's is clearly a few years >>> behind the times. That's actually a very impressive result, and you're >>> right, it would be quite hard to beat. >> > >Not all FPU's give accurate square root results from their hardware >instructions. The ARM documentation does not give details, so (unless >someone else knows better) we can assume that it /does/ give accurate >values, but not necessarily following IEEE specifications precisely.
Not following the IEEE usually means it doesn't handle (extremely small) denormalized values, might not generate proper +/- infinity values f(for huge values) or properly handling NaN (Not a number) cases.
> I >have seen other FPU's equivalent to the M4F's where the square root >instruction is explicitly an approximation. It would be good enough for >many uses, but for a highly accurate result you would use it as the >estimate and then run a couple of rounds of Newton-Raphson.
Floating point values are nearly always approximations anyway. Only some special cases (small integers) are exact, but you can't represent exactly values like 0.1 or 0.01. To calculate the sqrt on a hardware with basic floating point instruction but not hardware SQRT instruction, I have used the following approach: * normalize the value between 1..4 so that the exponent is even * divide the exponent by two to get the result exponent * using a polynomical with the mantissa bits to calculate the result mantissa * 4th order polynomicals give sufficient results for float and 8th order for double. * I have seen implementations (such as VAX) with 3rd order for single and 6th for double degree.
On 06/01/17 11:21, upsidedown@downunder.com wrote:
> On Fri, 06 Jan 2017 09:12:44 +0100, David Brown > <david.brown@hesbynett.no> wrote: > >> >>> The discussion about single vs double precision floats is not relevant >>> in this environment. >> >> It is very relevant. An easy way to implement his requirements >> (ignoring range checks) is: >> >> int sqrt_int(int x) { >> return sqrtf(x); >> } >> >> But that requires the compiler flags that have been discussed, and >> "sqrtf" instead of "sqrt", in order to compile to a VSQRT instruction >> and a couple of conversion instructions. > > The time used to discuss what compiler flags to use could have better > used coding such simple function in assembler :-). >
No, it could not - it is better to learn how to use your tools (such as the compiler) properly, so that you can re-use the information to improve more code later, rather than to learn a single inconvenient work-around. Understanding the point of -ffast-math and optimisation flags, and about float vs. double, will improve /all/ floating point code written by /everyone/ who reads this thread and learned from it. A simple inline assembly function (which is, in fact, slightly fiddly if you want to get the best from it) would merely help one person in one single situation. Give a man a fish, and all that.
On 06/01/17 11:50, upsidedown@downunder.com wrote:
> On Fri, 06 Jan 2017 08:46:11 +0100, David Brown > <david.brown@hesbynett.no> wrote: > >> On 06/01/17 00:44, Tim Wescott wrote: >>> On Thu, 05 Jan 2017 15:13:42 -0800, jladasky wrote: >>> >>>> On Wednesday, January 4, 2017 at 4:28:53 PM UTC-8, >>>> lasselangwad...@gmail.com wrote: >>>> >>>>> the fpu instruction VSQRT.F32 takes 14 cycles if floats are enough that >>>>> is probably hard to beat > > If those cycles are the same as the actual clock oscillator cycles, > the 32 MHz oscillator would allow more than 2 million VSQRT > instructions each second. > > The OP estimated a thousand square roots each second, which would > consume less than 0.1 % of the clock cycles. No big deal and not a > reason for optimization.
He also made it clear that he wants to do other things during the time than just square roots, as well as wanting to minimise the time in order to reduce power consumption.
> >>>> My information about the performance of FPU's is clearly a few years >>>> behind the times. That's actually a very impressive result, and you're >>>> right, it would be quite hard to beat. >>> >> >> Not all FPU's give accurate square root results from their hardware >> instructions. The ARM documentation does not give details, so (unless >> someone else knows better) we can assume that it /does/ give accurate >> values, but not necessarily following IEEE specifications precisely. > > Not following the IEEE usually means it doesn't handle (extremely > small) denormalized values, might not generate proper +/- infinity > values f(for huge values) or properly handling NaN (Not a number) > cases.
Mostly true. It is also not uncommon for rounding to be done in a different way, or to lose a LSB here or there compared to IEEE references. Usually these details don't matter. But if you compile without -ffast-math (or the relevant individual flags), the compiler will do its best to get IEEE conformance /exactly/ right - even if that means a library subroutine to handle the final details.
> >> I >> have seen other FPU's equivalent to the M4F's where the square root >> instruction is explicitly an approximation. It would be good enough for >> many uses, but for a highly accurate result you would use it as the >> estimate and then run a couple of rounds of Newton-Raphson. > > Floating point values are nearly always approximations anyway. Only > some special cases (small integers) are exact, but you can't represent > exactly values like 0.1 or 0.01. > > To calculate the sqrt on a hardware with basic floating point > instruction but not hardware SQRT instruction, I have used the > following approach: > * normalize the value between 1..4 so that the exponent is even > * divide the exponent by two to get the result exponent > * using a polynomical with the mantissa bits to calculate the result > mantissa > * 4th order polynomicals give sufficient results for float and 8th > order for double. > * I have seen implementations (such as VAX) with 3rd order for single > and 6th for double degree. >
Polynomials like this are sometimes used. Usually the "best" choice of algorithm here depends on the relative speeds of division and multiplication. Early floating point units (such as the VAX) were notoriously slow at division, so an algorithm that uses polynomials (addition and multiplication) is preferred.
On 06.1.2017 &#1075;. 01:38, jladasky@itu.edu wrote:
> On Thursday, January 5, 2017 at 12:43:54 AM UTC-8, David Brown wrote: >> On 05/01/17 02:22, Reinhardt Behm wrote: >> >> The OP here says specifically that he needs to do lots of square roots, >> and needs them fast, with small integers. Until I know otherwise, I >> assume he has a reasonable idea of what he is doing, and can have a >> reasonable estimate that a software-only double-precision IEEE accuracy >> full range square root function is going to be too slow. > >My nRF52 has a 32 MHz CPU clock. The system will need to calculate about >1,000 square roots per second. The largest number I expect to see is >about 10,000. If this task was all that concerned me, I wouldn't sweat. >However: my system also has several other real-time tasks to perform. >It's also battery-powered. I am thinking about optimization because I >need to conserve both time and power. >
So calculating the square root of a 14 bit number is fine for you; you will get a 7 bit result. Using my algorithm this will cost you 7 multiply instructions, 7 compares, 7 bset/bclr or some equivalent (e.g. shift + or sort of thing), 7 conditional branches and 7 decrement and branch opcodes. 35, say I forgot something, make that 70 cycles per square root. 70000 cycles per second out of a total of 32M cycles per second makes.... 0.22% of the computing power you have. Does not sound like something worth all that head banging. Dimiter ------------------------------------------------------ Dimiter Popoff, TGI http://www.tgi-sci.com ------------------------------------------------------ http://www.flickr.com/photos/didi_tgi/
On Fri, 6 Jan 2017 16:50:02 +0100, David Brown
<david.brown@hesbynett.no> wrote:

>On 06/01/17 11:21, upsidedown@downunder.com wrote: >> On Fri, 06 Jan 2017 09:12:44 +0100, David Brown >> <david.brown@hesbynett.no> wrote: >> >>> >>>> The discussion about single vs double precision floats is not relevant >>>> in this environment. >>> >>> It is very relevant. An easy way to implement his requirements >>> (ignoring range checks) is: >>> >>> int sqrt_int(int x) { >>> return sqrtf(x); >>> } >>> >>> But that requires the compiler flags that have been discussed, and >>> "sqrtf" instead of "sqrt", in order to compile to a VSQRT instruction >>> and a couple of conversion instructions. >> >> The time used to discuss what compiler flags to use could have better >> used coding such simple function in assembler :-). >> > >No, it could not - it is better to learn how to use your tools (such as >the compiler) properly, so that you can re-use the information to >improve more code later, rather than to learn a single inconvenient >work-around. > >Understanding the point of -ffast-math and optimisation flags, and about >float vs. double, will improve /all/ floating point code written by >/everyone/ who reads this thread and learned from it. A simple inline >assembly function (which is, in fact, slightly fiddly if you want to get >the best from it) would merely help one person in one single situation. >
I did not refer to in-line assembly in particular, but also to separate compiled assembler code. Of course, you need to know the parameter passing mechanism to separate compiled/assembled units and for functions how to return results. In most languages, parameters are passed on the stack, some push the first parameter first, while some push the last parameter last.
>Give a man a fish, and all that.
Understanding the parameter passing mechanism in general is a useful skill.
On Fri, 6 Jan 2017 16:55:18 +0100, David Brown
<david.brown@hesbynett.no> wrote:

>On 06/01/17 11:50, upsidedown@downunder.com wrote: >> On Fri, 06 Jan 2017 08:46:11 +0100, David Brown >> <david.brown@hesbynett.no> wrote: >> >>> On 06/01/17 00:44, Tim Wescott wrote: >>>> On Thu, 05 Jan 2017 15:13:42 -0800, jladasky wrote: >>>> >>>>> On Wednesday, January 4, 2017 at 4:28:53 PM UTC-8, >>>>> lasselangwad...@gmail.com wrote: >>>>> >>>>>> the fpu instruction VSQRT.F32 takes 14 cycles if floats are enough that >>>>>> is probably hard to beat >> >> If those cycles are the same as the actual clock oscillator cycles, >> the 32 MHz oscillator would allow more than 2 million VSQRT >> instructions each second. >> >> The OP estimated a thousand square roots each second, which would >> consume less than 0.1 % of the clock cycles. No big deal and not a >> reason for optimization. > >He also made it clear that he wants to do other things during the time >than just square roots, as well as wanting to minimise the time in order >to reduce power consumption.
He just has 99.9 % of the total CPU cycles available for the "other" tasks.
upsidedown@downunder.com writes:

> On Fri, 6 Jan 2017 16:55:18 +0100, David Brown > <david.brown@hesbynett.no> wrote: > >>On 06/01/17 11:50, upsidedown@downunder.com wrote: >>> On Fri, 06 Jan 2017 08:46:11 +0100, David Brown >>> <david.brown@hesbynett.no> wrote: >>> >>>> On 06/01/17 00:44, Tim Wescott wrote: >>>>> On Thu, 05 Jan 2017 15:13:42 -0800, jladasky wrote: >>>>> >>>>>> On Wednesday, January 4, 2017 at 4:28:53 PM UTC-8, >>>>>> lasselangwad...@gmail.com wrote: >>>>>> >>>>>>> the fpu instruction VSQRT.F32 takes 14 cycles if floats are enough that >>>>>>> is probably hard to beat >>> >>> If those cycles are the same as the actual clock oscillator cycles, >>> the 32 MHz oscillator would allow more than 2 million VSQRT >>> instructions each second. >>> >>> The OP estimated a thousand square roots each second, which would >>> consume less than 0.1 % of the clock cycles. No big deal and not a >>> reason for optimization. >> >>He also made it clear that he wants to do other things during the time >>than just square roots, as well as wanting to minimise the time in order >>to reduce power consumption. > > He just has 99.9 % of the total CPU cycles available for the "other" > tasks.
It looks like you did not read to the end of his sentence. It will be consume 100% of the cycles if it only wakes up to do that one thing. Modern MCUs can achieve very low power operation by sleeping most of the time. Power consumption can then be nearly proportional to the wake duty cycle. So going from "0.1%" to "0.2%" could be the difference between a 2 year and a 4 year life when powered by a coin cell say. -- John Devereux
On Wed, 04 Jan 2017 15:27:10 -0800, jladasky wrote:

> Hello everyone, > > I am working with a Nordic nRF52 SoC. The CPU subsystem of the nRF52 is > an ARM Cortex-M4. My development system is a Linux box running GCC. My > embedded systems toolchain is gcc-arm-none-eabi-4_9-2015q1, as > recommended by Nordic. > > I have written a fair amount of code for this device, but now I have to > explore a bit deeper. I have an algorithm that may require me to > calculate a lot of square roots, and quickly. However, the numbers in > question are all small integers. > > Even though the ARM Cortex-M4 has an FPU, it shouldn't need to get > involved in this square root calculation. Optimized algorithms for > various math functions exist. A nice discussion of the optimization of > the integer square root algorithm, written by Dr. Jack Crenshaw, is > found here: > > http://www.embedded.com/electronics-blogs/programmer-s-toolbox/4219659/
Integer-Square-Roots
> > I am interested in understanding exactly what code will be compiled if I > #include <math.h> in my program, and then call sqrt() with an int32 > argument. It stands to reason that optimized code, like that shown in > the blog above, is what should be used. How can I determine this? > > I am poking around in the gcc-arm-none-eabi-4_9-2015q1 folder for clues. > The only math-related file that I can find there is called tgmath.h, > and it just #defines aliases which point back to math.h. I am looking > for source code. > > Thanks for any advice you can share.
I notice, in our arguing about what you said, that some folks are thinking you want to get an integer _out_ of your algorithm -- is this true? I'm not sure if it has much of a bearing on things, because I suspect that using the processor's floating point math is going to be the quickest, even with conversion to and from float. But I'm curious. -- Tim Wescott Wescott Design Services http://www.wescottdesign.com I'm looking for work -- see my website!