How would my SoC calculate an integer square root?| page 4

Reply by ●January 6, 20172017-01-06

On Friday, January 6, 2017 at 1:40:19 PM UTC-8, Tim Wescott wrote:

> I notice, in our arguing about what you said, that some folks are 
> thinking you want to get an integer _out_ of your algorithm -- is this 
> true?

That's correct.  Rounding or even truncating is OK.  These square roots are being fed to an algorithm which needs the rank order of the results to be preserved, but it's OK if the exact distribution is a little distorted.

> I'm not sure if it has much of a bearing on things, because I suspect 
> that using the processor's floating point math is going to be the 
> quickest, even with conversion to and from float.  But I'm curious.

After reading everyone's input, I'm starting to think that the code which would be produced by -ffast-math will be a pretty good start.  

I am giving serious consideration to David Brown's suggestion of a 10,000 entry lookup table.  I have 512K of flash, and I only need to store int16's.

Reply by Les Cargill ●January 6, 20172017-01-06

jladasky@itu.edu wrote:
> Hello everyone,
>
> I am working with a Nordic nRF52 SoC.  The CPU subsystem of the nRF52 is an ARM Cortex-M4.  My development system is a Linux box running GCC.  My embedded systems toolchain is gcc-arm-none-eabi-4_9-2015q1, as recommended by Nordic.
>
> I have written a fair amount of code for this device, but now I have to explore a bit deeper.  I have an algorithm that may require me to calculate a lot of square roots, and quickly.  However, the numbers in question are all small integers.
>
> Even though the ARM Cortex-M4 has an FPU, it shouldn't need to get involved in this square root calculation.  Optimized algorithms for various math functions exist.  A nice discussion of the optimization of the integer square root algorithm, written by Dr. Jack Crenshaw, is found here:
>
> http://www.embedded.com/electronics-blogs/programmer-s-toolbox/4219659/Integer-Square-Roots
>
> I am interested in understanding exactly what code will be compiled if I #include <math.h> in my program, and then call sqrt() with an int32 argument.  It stands to reason that optimized code, like that shown in the blog above, is what should be used.  How can I determine this?
>
> I am poking around in the gcc-arm-none-eabi-4_9-2015q1 folder for clues.  The only math-related file that I can find there is called tgmath.h, and it just #defines aliases which point back to math.h.  I am looking for source code.
>
> Thanks for any advice you can share.
>


I'd measure the floating point version before piling off into a
bespoke ( custom ) square root routine.

-- 
Les Cargill

Reply by Wouter van Ooijen ●January 7, 20172017-01-07

Op 07-Jan-17 om 01:52 schreef jladasky@itu.edu:
> On Friday, January 6, 2017 at 1:40:19 PM UTC-8, Tim Wescott wrote:
>
>> I notice, in our arguing about what you said, that some folks are
>> thinking you want to get an integer _out_ of your algorithm -- is this
>> true?
>
> That's correct.  Rounding or even truncating is OK.  These square roots are being fed to an algorithm which needs the rank order of the results to be preserved, but it's OK if the exact distribution is a little distorted.
>
>> I'm not sure if it has much of a bearing on things, because I suspect
>> that using the processor's floating point math is going to be the
>> quickest, even with conversion to and from float.  But I'm curious.
>
> After reading everyone's input, I'm starting to think that the code which would be produced by -ffast-math will be a pretty good start.
>
> I am giving serious consideration to David Brown's suggestion of a 10,000 entry lookup table.  I have 512K of flash, and I only need to store int16's.

The sqrt of 10000 is only 100, so you'd need to store only unsigned chars.

Wouter "Objects? No Thanks!" van Ooijen

Reply by Dimiter_Popoff ●January 7, 20172017-01-07

On 07.1.2017 &#1075;. 02:52, jladasky@itu.edu wrote:
> .....
>
> After reading everyone's input, I'm starting to think that the code
> which would be produced by -ffast-math will be a pretty good start.
>
> I am giving serious consideration to David Brown's suggestion of a
> 10,000 entry lookup table.  I have 512K of flash, and I only need to
> store int16's.
>

At 512k flash this is a no brainer of course.

Dimiter

Reply by Mike Perkins ●January 7, 20172017-01-07

On 07/01/2017 02:29, Les Cargill wrote:
> jladasky@itu.edu wrote:
>> Hello everyone,
>>
>> I am working with a Nordic nRF52 SoC.  The CPU subsystem of the nRF52
>> is an ARM Cortex-M4.  My development system is a Linux box running
>> GCC.  My embedded systems toolchain is gcc-arm-none-eabi-4_9-2015q1,
>> as recommended by Nordic.
>>
>> I have written a fair amount of code for this device, but now I have
>> to explore a bit deeper.  I have an algorithm that may require me to
>> calculate a lot of square roots, and quickly.  However, the numbers in
>> question are all small integers.
>>
>> Even though the ARM Cortex-M4 has an FPU, it shouldn't need to get
>> involved in this square root calculation.  Optimized algorithms for
>> various math functions exist.  A nice discussion of the optimization
>> of the integer square root algorithm, written by Dr. Jack Crenshaw, is
>> found here:
>>
>> http://www.embedded.com/electronics-blogs/programmer-s-toolbox/4219659/Integer-Square-Roots
>>
>>
>> I am interested in understanding exactly what code will be compiled if
>> I #include <math.h> in my program, and then call sqrt() with an int32
>> argument.  It stands to reason that optimized code, like that shown in
>> the blog above, is what should be used.  How can I determine this?
>>
>> I am poking around in the gcc-arm-none-eabi-4_9-2015q1 folder for
>> clues.  The only math-related file that I can find there is called
>> tgmath.h, and it just #defines aliases which point back to math.h.  I
>> am looking for source code.
>>
>> Thanks for any advice you can share.
>>
>
>
> I'd measure the floating point version before piling off into a
> bespoke ( custom ) square root routine.

ARM documentation suggests this takes 14/15 clocks cycles. There is then 
the conversion from integer to floating point.

-- 
Mike Perkins
Video Solutions Ltd
www.videosolutions.ltd.uk

Reply by Mike Perkins ●January 7, 20172017-01-07

On 04/01/2017 23:27, jladasky@itu.edu wrote:
> Hello everyone,
>
> I am working with a Nordic nRF52 SoC.  The CPU subsystem of the nRF52
> is an ARM Cortex-M4.  My development system is a Linux box running
> GCC.  My embedded systems toolchain is gcc-arm-none-eabi-4_9-2015q1,
> as recommended by Nordic.
>
> I have written a fair amount of code for this device, but now I have
> to explore a bit deeper.  I have an algorithm that may require me to
> calculate a lot of square roots, and quickly.  However, the numbers
> in question are all small integers.
>
> Even though the ARM Cortex-M4 has an FPU, it shouldn't need to get
> involved in this square root calculation.  Optimized algorithms for
> various math functions exist.  A nice discussion of the optimization
> of the integer square root algorithm, written by Dr. Jack Crenshaw,
> is found here:
>
> http://www.embedded.com/electronics-blogs/programmer-s-toolbox/4219659/Integer-Square-Roots
>
>  I am interested in understanding exactly what code will be compiled
> if I #include <math.h> in my program, and then call sqrt() with an
> int32 argument.  It stands to reason that optimized code, like that
> shown in the blog above, is what should be used.  How can I determine
> this?
>
> I am poking around in the gcc-arm-none-eabi-4_9-2015q1 folder for
> clues.  The only math-related file that I can find there is called
> tgmath.h, and it just #defines aliases which point back to math.h.  I
> am looking for source code.
>
> Thanks for any advice you can share.

For me the lookup table is the best and simplest, especially on a device 
with 512kB of ROM. Is your program that big? Given the answer is always 
going to be of byte length, for the square root of a number up to 10,000 
you would only need 10kB of table.

An alternative is a variation of:
 
https://en.wikipedia.org/wiki/Methods_of_computing_square_roots#Decimal_.28base_10.29
except you would work to base 256 and have a lookup table of just 256 
bytes to find the sqrt of 0 to 255.


-- 
Mike Perkins
Video Solutions Ltd
www.videosolutions.ltd.uk

Reply by Robert Wessel ●January 8, 20172017-01-08

On Fri, 06 Jan 2017 09:25:32 +0100, David Brown
<david.brown@hesbynett.no> wrote:

>A 10000 number lookup table will give you the result in a single
>operation.  Do you have 20K of flash (or ram) space to spare?  I don't
>know your particular chip, but many Cortex devices actually have quite a
>lot of flash space, and while 20K for a square root table might seem
>excessive, it could work.  And it would be the fastest possible method.


Of course he'd only need a 10000 byte table, all the results of
isqrt(n) for n<10000 fit in a byte.

Reply by David Brown ●January 8, 20172017-01-08

On 06/01/17 18:08, upsidedown@downunder.com wrote:
> On Fri, 6 Jan 2017 16:50:02 +0100, David Brown
> <david.brown@hesbynett.no> wrote:
>
>> On 06/01/17 11:21, upsidedown@downunder.com wrote:
>>> On Fri, 06 Jan 2017 09:12:44 +0100, David Brown
>>> <david.brown@hesbynett.no> wrote:
>>>
>>>>
>>>>> The discussion about single vs double precision floats is not relevant
>>>>> in this environment.
>>>>
>>>> It is very relevant.  An easy way to implement his requirements
>>>> (ignoring range checks) is:
>>>>
>>>> int sqrt_int(int x) {
>>>> 	return sqrtf(x);
>>>> }
>>>>
>>>> But that requires the compiler flags that have been discussed, and
>>>> "sqrtf" instead of "sqrt", in order to compile to a VSQRT instruction
>>>> and a couple of conversion instructions.
>>>
>>> The time used to discuss what compiler flags to use could have better
>>> used coding such simple function in assembler :-).
>>>
>>
>> No, it could not - it is better to learn how to use your tools (such as
>> the compiler) properly, so that you can re-use the information to
>> improve more code later, rather than to learn a single inconvenient
>> work-around.
>>
>> Understanding the point of -ffast-math and optimisation flags, and about
>> float vs. double, will improve /all/ floating point code written by
>> /everyone/ who reads this thread and learned from it.  A simple inline
>> assembly function (which is, in fact, slightly fiddly if you want to get
>> the best from it) would merely help one person in one single situation.
>>
>
> I did not refer to in-line assembly in particular, but also to
> separate compiled assembler code. Of course, you need to know the
> parameter passing mechanism to separate compiled/assembled units and
> for functions how to return results.

Separately compiled assembly usually involves quite a bit more messing 
around than inline assembly.  You have to handle a variety of assembler 
directives to get it to work.  You need to understand more, in order to 
follow the calling convention properly (including preserving registers 
if and only if necessary).  And the result is a lot less efficient for 
short sections of code, because of the function call overhead.  Inline 
assembly is shorter, simpler, and can work together with the optimiser - 
though you still need to know what you are doing.

> In most languages, parameters are
> passed on the stack, some push the first parameter first, while some
> push the last parameter last.
>

The calling convention varies a lot between platforms.  On targets with 
plenty of registers, parameters are mostly passed in registers, with the 
stack used when there are too many registers or for things like structs.

>> Give a man a fish, and all that.
>
> Understanding the parameter passing mechanism in general is a useful
> skill.
>

It /can/ be a useful skill, yes.  But there is no such thing as 
"parameter passing mechanism in general" as it varies between platforms. 
  And the skill is not nearly as helpful as the skill of understanding 
how to use your compiler.  Learning some of the basics of how compiler 
flags work and how compiler optimisation works is /vastly/ more useful 
than trying to write your own assembly functions.  It is also much 
easier, much safer (if you get it wrong you are likely to get slower 
code, but still equally correct code), much more portable, and leads to 
writing clearer and more maintainable code instead of cryptic assembly.

(I strongly advocate learning to /read/ assembly on the platforms you 
work with, in order to understand the results of your compilation.  But 
for most developers, if you are writing more assembly than a very 
occasional single-line inline assembly function, you should probably 
re-consider your development strategy.)

Reply by David Brown ●January 8, 20172017-01-08

On 08/01/17 06:03, Robert Wessel wrote:
> On Fri, 06 Jan 2017 09:25:32 +0100, David Brown
> <david.brown@hesbynett.no> wrote:
>
>> A 10000 number lookup table will give you the result in a single
>> operation.  Do you have 20K of flash (or ram) space to spare?  I don't
>> know your particular chip, but many Cortex devices actually have quite a
>> lot of flash space, and while 20K for a square root table might seem
>> excessive, it could work.  And it would be the fastest possible method.
>
>
> Of course he'd only need a 10000 byte table, all the results of
> isqrt(n) for n<10000 fit in a byte.
>

True.

Reply by David Brown ●January 8, 20172017-01-08

On 07/01/17 11:29, Dimiter_Popoff wrote:
> On 07.1.2017 &#1075;. 02:52, jladasky@itu.edu wrote:
>> .....
>>
>> After reading everyone's input, I'm starting to think that the code
>> which would be produced by -ffast-math will be a pretty good start.
>>
>> I am giving serious consideration to David Brown's suggestion of a
>> 10,000 entry lookup table.  I have 512K of flash, and I only need to
>> store int16's.
>>
>
> At 512k flash this is a no brainer of course.
>

I is /almost/ a no-brainer.  There might be other considerations, such 
as a slow over-the-air firmware update.

And even if this sounds like a simple answer, it's good that he is 
giving the different ideas "serious consideration".  Some of the ideas 
and points raised in this thread could be useful learning experience for 
the future.

Previous 2 34Next

How would my SoC calculate an integer square root?

Sign in

You might also like...

Search forums

Free PDF Downloads

Blogs - Hall of Fame

Discussion Groups

Quick Links

About EmbeddedRelated.com

Social Networks

The Related Media Group