EmbeddedRelated.com
Forums

How would my SoC calculate an integer square root?

Started by Unknown January 4, 2017
On Friday, January 6, 2017 at 1:40:19 PM UTC-8, Tim Wescott wrote:

> I notice, in our arguing about what you said, that some folks are > thinking you want to get an integer _out_ of your algorithm -- is this > true?
That's correct. Rounding or even truncating is OK. These square roots are being fed to an algorithm which needs the rank order of the results to be preserved, but it's OK if the exact distribution is a little distorted.
> I'm not sure if it has much of a bearing on things, because I suspect > that using the processor's floating point math is going to be the > quickest, even with conversion to and from float. But I'm curious.
After reading everyone's input, I'm starting to think that the code which would be produced by -ffast-math will be a pretty good start. I am giving serious consideration to David Brown's suggestion of a 10,000 entry lookup table. I have 512K of flash, and I only need to store int16's.
jladasky@itu.edu wrote:
> Hello everyone, > > I am working with a Nordic nRF52 SoC. The CPU subsystem of the nRF52 is an ARM Cortex-M4. My development system is a Linux box running GCC. My embedded systems toolchain is gcc-arm-none-eabi-4_9-2015q1, as recommended by Nordic. > > I have written a fair amount of code for this device, but now I have to explore a bit deeper. I have an algorithm that may require me to calculate a lot of square roots, and quickly. However, the numbers in question are all small integers. > > Even though the ARM Cortex-M4 has an FPU, it shouldn't need to get involved in this square root calculation. Optimized algorithms for various math functions exist. A nice discussion of the optimization of the integer square root algorithm, written by Dr. Jack Crenshaw, is found here: > > http://www.embedded.com/electronics-blogs/programmer-s-toolbox/4219659/Integer-Square-Roots > > I am interested in understanding exactly what code will be compiled if I #include <math.h> in my program, and then call sqrt() with an int32 argument. It stands to reason that optimized code, like that shown in the blog above, is what should be used. How can I determine this? > > I am poking around in the gcc-arm-none-eabi-4_9-2015q1 folder for clues. The only math-related file that I can find there is called tgmath.h, and it just #defines aliases which point back to math.h. I am looking for source code. > > Thanks for any advice you can share. >
I'd measure the floating point version before piling off into a bespoke ( custom ) square root routine. -- Les Cargill
Op 07-Jan-17 om 01:52 schreef jladasky@itu.edu:
> On Friday, January 6, 2017 at 1:40:19 PM UTC-8, Tim Wescott wrote: > >> I notice, in our arguing about what you said, that some folks are >> thinking you want to get an integer _out_ of your algorithm -- is this >> true? > > That's correct. Rounding or even truncating is OK. These square roots are being fed to an algorithm which needs the rank order of the results to be preserved, but it's OK if the exact distribution is a little distorted. > >> I'm not sure if it has much of a bearing on things, because I suspect >> that using the processor's floating point math is going to be the >> quickest, even with conversion to and from float. But I'm curious. > > After reading everyone's input, I'm starting to think that the code which would be produced by -ffast-math will be a pretty good start. > > I am giving serious consideration to David Brown's suggestion of a 10,000 entry lookup table. I have 512K of flash, and I only need to store int16's.
The sqrt of 10000 is only 100, so you'd need to store only unsigned chars. Wouter "Objects? No Thanks!" van Ooijen
On 07.1.2017 &#1075;. 02:52, jladasky@itu.edu wrote:
> ..... > > After reading everyone's input, I'm starting to think that the code > which would be produced by -ffast-math will be a pretty good start. > > I am giving serious consideration to David Brown's suggestion of a > 10,000 entry lookup table. I have 512K of flash, and I only need to > store int16's. >
At 512k flash this is a no brainer of course. Dimiter
On 07/01/2017 02:29, Les Cargill wrote:
> jladasky@itu.edu wrote: >> Hello everyone, >> >> I am working with a Nordic nRF52 SoC. The CPU subsystem of the nRF52 >> is an ARM Cortex-M4. My development system is a Linux box running >> GCC. My embedded systems toolchain is gcc-arm-none-eabi-4_9-2015q1, >> as recommended by Nordic. >> >> I have written a fair amount of code for this device, but now I have >> to explore a bit deeper. I have an algorithm that may require me to >> calculate a lot of square roots, and quickly. However, the numbers in >> question are all small integers. >> >> Even though the ARM Cortex-M4 has an FPU, it shouldn't need to get >> involved in this square root calculation. Optimized algorithms for >> various math functions exist. A nice discussion of the optimization >> of the integer square root algorithm, written by Dr. Jack Crenshaw, is >> found here: >> >> http://www.embedded.com/electronics-blogs/programmer-s-toolbox/4219659/Integer-Square-Roots >> >> >> I am interested in understanding exactly what code will be compiled if >> I #include <math.h> in my program, and then call sqrt() with an int32 >> argument. It stands to reason that optimized code, like that shown in >> the blog above, is what should be used. How can I determine this? >> >> I am poking around in the gcc-arm-none-eabi-4_9-2015q1 folder for >> clues. The only math-related file that I can find there is called >> tgmath.h, and it just #defines aliases which point back to math.h. I >> am looking for source code. >> >> Thanks for any advice you can share. >> > > > I'd measure the floating point version before piling off into a > bespoke ( custom ) square root routine.
ARM documentation suggests this takes 14/15 clocks cycles. There is then the conversion from integer to floating point. -- Mike Perkins Video Solutions Ltd www.videosolutions.ltd.uk
On 04/01/2017 23:27, jladasky@itu.edu wrote:
> Hello everyone, > > I am working with a Nordic nRF52 SoC. The CPU subsystem of the nRF52 > is an ARM Cortex-M4. My development system is a Linux box running > GCC. My embedded systems toolchain is gcc-arm-none-eabi-4_9-2015q1, > as recommended by Nordic. > > I have written a fair amount of code for this device, but now I have > to explore a bit deeper. I have an algorithm that may require me to > calculate a lot of square roots, and quickly. However, the numbers > in question are all small integers. > > Even though the ARM Cortex-M4 has an FPU, it shouldn't need to get > involved in this square root calculation. Optimized algorithms for > various math functions exist. A nice discussion of the optimization > of the integer square root algorithm, written by Dr. Jack Crenshaw, > is found here: > > http://www.embedded.com/electronics-blogs/programmer-s-toolbox/4219659/Integer-Square-Roots > > I am interested in understanding exactly what code will be compiled > if I #include <math.h> in my program, and then call sqrt() with an > int32 argument. It stands to reason that optimized code, like that > shown in the blog above, is what should be used. How can I determine > this? > > I am poking around in the gcc-arm-none-eabi-4_9-2015q1 folder for > clues. The only math-related file that I can find there is called > tgmath.h, and it just #defines aliases which point back to math.h. I > am looking for source code. > > Thanks for any advice you can share.
For me the lookup table is the best and simplest, especially on a device with 512kB of ROM. Is your program that big? Given the answer is always going to be of byte length, for the square root of a number up to 10,000 you would only need 10kB of table. An alternative is a variation of: https://en.wikipedia.org/wiki/Methods_of_computing_square_roots#Decimal_.28base_10.29 except you would work to base 256 and have a lookup table of just 256 bytes to find the sqrt of 0 to 255. -- Mike Perkins Video Solutions Ltd www.videosolutions.ltd.uk
On Fri, 06 Jan 2017 09:25:32 +0100, David Brown
<david.brown@hesbynett.no> wrote:

>A 10000 number lookup table will give you the result in a single >operation. Do you have 20K of flash (or ram) space to spare? I don't >know your particular chip, but many Cortex devices actually have quite a >lot of flash space, and while 20K for a square root table might seem >excessive, it could work. And it would be the fastest possible method.
Of course he'd only need a 10000 byte table, all the results of isqrt(n) for n<10000 fit in a byte.
On 06/01/17 18:08, upsidedown@downunder.com wrote:
> On Fri, 6 Jan 2017 16:50:02 +0100, David Brown > <david.brown@hesbynett.no> wrote: > >> On 06/01/17 11:21, upsidedown@downunder.com wrote: >>> On Fri, 06 Jan 2017 09:12:44 +0100, David Brown >>> <david.brown@hesbynett.no> wrote: >>> >>>> >>>>> The discussion about single vs double precision floats is not relevant >>>>> in this environment. >>>> >>>> It is very relevant. An easy way to implement his requirements >>>> (ignoring range checks) is: >>>> >>>> int sqrt_int(int x) { >>>> return sqrtf(x); >>>> } >>>> >>>> But that requires the compiler flags that have been discussed, and >>>> "sqrtf" instead of "sqrt", in order to compile to a VSQRT instruction >>>> and a couple of conversion instructions. >>> >>> The time used to discuss what compiler flags to use could have better >>> used coding such simple function in assembler :-). >>> >> >> No, it could not - it is better to learn how to use your tools (such as >> the compiler) properly, so that you can re-use the information to >> improve more code later, rather than to learn a single inconvenient >> work-around. >> >> Understanding the point of -ffast-math and optimisation flags, and about >> float vs. double, will improve /all/ floating point code written by >> /everyone/ who reads this thread and learned from it. A simple inline >> assembly function (which is, in fact, slightly fiddly if you want to get >> the best from it) would merely help one person in one single situation. >> > > I did not refer to in-line assembly in particular, but also to > separate compiled assembler code. Of course, you need to know the > parameter passing mechanism to separate compiled/assembled units and > for functions how to return results.
Separately compiled assembly usually involves quite a bit more messing around than inline assembly. You have to handle a variety of assembler directives to get it to work. You need to understand more, in order to follow the calling convention properly (including preserving registers if and only if necessary). And the result is a lot less efficient for short sections of code, because of the function call overhead. Inline assembly is shorter, simpler, and can work together with the optimiser - though you still need to know what you are doing.
> In most languages, parameters are > passed on the stack, some push the first parameter first, while some > push the last parameter last. >
The calling convention varies a lot between platforms. On targets with plenty of registers, parameters are mostly passed in registers, with the stack used when there are too many registers or for things like structs.
>> Give a man a fish, and all that. > > Understanding the parameter passing mechanism in general is a useful > skill. >
It /can/ be a useful skill, yes. But there is no such thing as "parameter passing mechanism in general" as it varies between platforms. And the skill is not nearly as helpful as the skill of understanding how to use your compiler. Learning some of the basics of how compiler flags work and how compiler optimisation works is /vastly/ more useful than trying to write your own assembly functions. It is also much easier, much safer (if you get it wrong you are likely to get slower code, but still equally correct code), much more portable, and leads to writing clearer and more maintainable code instead of cryptic assembly. (I strongly advocate learning to /read/ assembly on the platforms you work with, in order to understand the results of your compilation. But for most developers, if you are writing more assembly than a very occasional single-line inline assembly function, you should probably re-consider your development strategy.)
On 08/01/17 06:03, Robert Wessel wrote:
> On Fri, 06 Jan 2017 09:25:32 +0100, David Brown > <david.brown@hesbynett.no> wrote: > >> A 10000 number lookup table will give you the result in a single >> operation. Do you have 20K of flash (or ram) space to spare? I don't >> know your particular chip, but many Cortex devices actually have quite a >> lot of flash space, and while 20K for a square root table might seem >> excessive, it could work. And it would be the fastest possible method. > > > Of course he'd only need a 10000 byte table, all the results of > isqrt(n) for n<10000 fit in a byte. >
True.
On 07/01/17 11:29, Dimiter_Popoff wrote:
> On 07.1.2017 &#1075;. 02:52, jladasky@itu.edu wrote: >> ..... >> >> After reading everyone's input, I'm starting to think that the code >> which would be produced by -ffast-math will be a pretty good start. >> >> I am giving serious consideration to David Brown's suggestion of a >> 10,000 entry lookup table. I have 512K of flash, and I only need to >> store int16's. >> > > At 512k flash this is a no brainer of course. >
I is /almost/ a no-brainer. There might be other considerations, such as a slow over-the-air firmware update. And even if this sounds like a simple answer, it's good that he is giving the different ideas "serious consideration". Some of the ideas and points raised in this thread could be useful learning experience for the future.