On 18 Nov 2006 19:02:08 -0800, "steve" <bungalow_steve@yahoo.com> wrote:>Jonathan Kirwan wrote: > >> Steve, do you _know for certain_ that the library tested abouve from >> Imagecraft does support all of them? It's been my own experience that >> the libraries for floating point don't completely support all types >> and exceptions. Are you sure this is the case here? >> >No I am not certain, imagecraft claims IEEE floating point, which means >its should be compatible with IEEE 754 so that it runs identical to >IEEE 754 compatible FPU's.Well, this is the crux of your earlier point to me, isn't it? Can you find out what exactly it _does_ do? This could completely break your point.>Maybe your MSP430 had the HW multiply?It wouldn't matter if it did. I wrote the assembly code myself and I didn't use HW multiplies to aide the generalized input/output floating point division routine -- and I'm not entirely sure just now how I might. Can you suggest a reason why this question may be germane? Jon
Math computing time statistics for ARM7TDMI and MSP430
Started by ●November 17, 2006
Reply by ●November 19, 20062006-11-19
Reply by ●November 19, 20062006-11-19
Jonathan Kirwan wrote:> > > Steve, do you _know for certain_ that the library tested abouve from > Imagecraft does support all of them? It's been my own experience that > the libraries for floating point don't completely support all types > and exceptions. Are you sure this is the case here? > > In the example I was testing out, I was examining just one compiler > library routine to mimic it's behavior. I think I captured all the > elements there, but it's probable that the compiler itself has > advanced in the two intervening years and it wasn't Imagecraft's > anyway, so your point may remain a good one to keep in mind. > > I believe I wouldn't need another 200 cycles, though, to achieve what > extra is done in compiler libraries. I'd be very interested in > finishing it up, though, so as to exactly match the features of the > Imagecraft routine you tested with, if provided with a complete > implementation of their 32-bit fp divide for the MSP430 so that I > could personally guarantee that I've met the goal. Not that this > would prove anything much, except that more time given to informed > effort is better than less time. Still, I'd do it for the fun of > trying. >Jonanthan, the guy who wrote the code says it takes ~550 cycles on average. Does your stuff do the guard bit etc.? The code is MSP430 specific and not stale code from other CPUs...> Jon
Reply by ●November 19, 20062006-11-19
On Sun, 19 Nov 2006 04:26:21 -0800, Richard <richard@imagecraft.com> wrote:>Jonathan Kirwan wrote: >> >> >> Steve, do you _know for certain_ that the library tested abouve from >> Imagecraft does support all of them? It's been my own experience that >> the libraries for floating point don't completely support all types >> and exceptions. Are you sure this is the case here? >> >> In the example I was testing out, I was examining just one compiler >> library routine to mimic it's behavior. I think I captured all the >> elements there, but it's probable that the compiler itself has >> advanced in the two intervening years and it wasn't Imagecraft's >> anyway, so your point may remain a good one to keep in mind. >> >> I believe I wouldn't need another 200 cycles, though, to achieve what >> extra is done in compiler libraries. I'd be very interested in >> finishing it up, though, so as to exactly match the features of the >> Imagecraft routine you tested with, if provided with a complete >> implementation of their 32-bit fp divide for the MSP430 so that I >> could personally guarantee that I've met the goal. Not that this >> would prove anything much, except that more time given to informed >> effort is better than less time. Still, I'd do it for the fun of >> trying. >> > >Jonanthan, the guy who wrote the code says it takes ~550 cycles on >average. Does your stuff do the guard bit etc.? The code is MSP430 >specific and not stale code from other CPUs...Yes. I keep more bits for rounding, if that's what you mean. I suppose I can send the code to you, if you are interested in playing with it -- I've no proprietary interest in it. It handles the standard floating point codes found on the IBM PC. Sign, signed exponent, and hidden bit notation. It does NOT handle denormals (not hard to add) or special codes such as infinities or not-a-numbers. Rounding is only in the usual method; I didn't access a static status bit of any kind to control rounding. When I last played with this in April 2004, I shaved about 100 cycles off of a compiler's version mostly, I think, because the compiler's library code used a division loop that used a count of 32 when the loop only really needed 24 for a 48/24 divide to produce a 24r24 result (the remainder is used for the rounding.) The central division part of the code, that part that actually takes the unpacked values and divides them, takes from 264 to 312 cycles, with 302 being typical. Jon
Reply by ●November 20, 20062006-11-20
On Sun, 19 Nov 2006 16:32:57 GMT, Jonathan Kirwan <jkirwan@easystreet.com> wrote:>On Sun, 19 Nov 2006 04:26:21 -0800, Richard <richard@imagecraft.com> >wrote: > >>Jonathan Kirwan wrote: >>> >>> Steve, do you _know for certain_ that the library tested abouve from >>> Imagecraft does support all of them? It's been my own experience that >>> the libraries for floating point don't completely support all types >>> and exceptions. Are you sure this is the case here? >>> >>> In the example I was testing out, I was examining just one compiler >>> library routine to mimic it's behavior. I think I captured all the >>> elements there, but it's probable that the compiler itself has >>> advanced in the two intervening years and it wasn't Imagecraft's >>> anyway, so your point may remain a good one to keep in mind. >>> >>> I believe I wouldn't need another 200 cycles, though, to achieve what >>> extra is done in compiler libraries. I'd be very interested in >>> finishing it up, though, so as to exactly match the features of the >>> Imagecraft routine you tested with, if provided with a complete >>> implementation of their 32-bit fp divide for the MSP430 so that I >>> could personally guarantee that I've met the goal. Not that this >>> would prove anything much, except that more time given to informed >>> effort is better than less time. Still, I'd do it for the fun of >>> trying. >>> >> >>Jonanthan, the guy who wrote the code says it takes ~550 cycles on >>average. Does your stuff do the guard bit etc.? The code is MSP430 >>specific and not stale code from other CPUs... > >Yes. I keep more bits for rounding, if that's what you mean. I >suppose I can send the code to you, if you are interested in playing >with it -- I've no proprietary interest in it. It handles the >standard floating point codes found on the IBM PC. Sign, signed >exponent, and hidden bit notation. It does NOT handle denormals (not >hard to add) or special codes such as infinities or not-a-numbers. >Rounding is only in the usual method; I didn't access a static status >bit of any kind to control rounding. > >When I last played with this in April 2004, I shaved about 100 cycles >off of a compiler's version mostly, I think, because the compiler's >library code used a division loop that used a count of 32 when the >loop only really needed 24 for a 48/24 divide to produce a 24r24 >result (the remainder is used for the rounding.) The central division >part of the code, that part that actually takes the unpacked values >and divides them, takes from 264 to 312 cycles, with 302 being >typical.I should add that I can shave still more time off of this. There are at least two reasons that come to mind: (1) I had just been playing with a fast division method I've devised, and was mostly playing with that idea back in 2004 without really thinking about how this would play into a floating point routine designed entirely for speed. In my general purpose divide routine, I produce way more bits than I actually need for the result. I produce a perfect remainder, which allows an exact determination of rounding. However, an exact remainder isn't needed as it contains more information than is strictly needed for rounding. If I modified the division routine so as not to have to produce an accurate remainder, the loops can take less cycles -- multiplied by 24, this can account for some real time. (2) I did NOT use some of my non-restoring methods (what amounts to loop-unrolling of sorts) for division, which would yield about 18% savings on the code I did apply back then. The reason I didn't is that I was focused on just 'getting it right' and not so much on adding in longer stretches of unwound code -- easier to debug that way, if need be. So even the number I originally posted is not by any stretch the best I can do here, on the MSP430 -- if seriously bent to the task. I'd bet I could get perilously close to 300 cycles for the entire thing, even including denormals. Not that anyone would care that much. But it's probably doable. Jon
Reply by ●November 21, 20062006-11-21
On Sat, 18 Nov 2006 10:54:11 +0100, "Ulf Samuelsson" <ulf@a-t-m-e-l.com> wrote:>Here are some AVR figures , IAR Full opt for speed, run in AVR Studio >simulator >Obviously, you cant compare exactly without using same source >Figures measured in a subroutine after values have been loaded into >registers. >Only tested one set of data, so I can't say if it is typical or not. > >add 173 >sub 176 >mul 175 >div 694 >sqrt 2586 >log 3255Ulf, is that figure for div on the IAR compiler right??? On the ARM7/Thumb I've looked at before, there was no direct integer DIV instruction, unlike the case with multiply, so it makes sense to take a while longer than multiply does. But this long?? Since Steve was writing about 32-bit floats, I assume that it what you were doing too, but is it possible that they were promoted to doubles? On the MSP430, I was stuck doing pairs of registers/instructions to handle shifts (a triplet in one step of the loop) and this has _got_ to be better on a 32-bit register chip. Also, the MSP430 can't get much of anything done in single cycle. Luckily, the core division can be done in registers which are single cycle, but there are conditional jumps in there. Something sounds wrong to get cycle counts like that. I believe you, it just bugs me. The 32-bit register advantage would have pulled another 48 cycles (24*2) off of the computation in the MSP430, for sure, and in cases where a restore was needed, another 24 cycles -- so the mean (average) would be above 48 and below 72 cycles pulled off -- probably right in the center of that at 60 cycles. That puts it at a projected 340 cycles on the ARM7 right now and I already have ways coded up and fully tested to improve that another 60 cycles, anyway. Which would bring such a thing to the area of 280 cycles on the ARM, as a rough guess -- including overheads. Jon P.S. By the way, I just retested a new floating point divide routine on the MSP430, that handles division by zero by returning #INF and deals with #INF on either input parameter, and it runs in an average of 350 cycles right now. This includes the call overhead (5 cycles for the call, 3 cycles for the return.) It correctly rounds without glitch, as it produces a complete fractional remainder that is exactly known.
Reply by ●November 21, 20062006-11-21
Jonathan Kirwan wrote:> On Sat, 18 Nov 2006 10:54:11 +0100, "Ulf Samuelsson" > <ulf@a-t-m-e-l.com> wrote: > > >Here are some AVR figures , IAR Full opt for speed, run in AVR Studio > >simulator > >Obviously, you cant compare exactly without using same source > >Figures measured in a subroutine after values have been loaded into > >registers. > >Only tested one set of data, so I can't say if it is typical or not. > > > >add 173 > >sub 176 > >mul 175 > >div 694 > >sqrt 2586 > >log 3255 > > Ulf, is that figure for div on the IAR compiler right??? On the > ARM7/Thumb I've looked at before, there was no direct integer DIV > instruction, unlike the case with multiply, so it makes sense to take > a while longer than multiply does. But this long?? Since Steve was > writing about 32-bit floats, I assume that it what you were doing too, > but is it possible that they were promoted to doubles? > > On the MSP430, I was stuck doing pairs of registers/instructions to > handle shifts (a triplet in one step of the loop) and this has _got_ > to be better on a 32-bit register chip.Ulf posted AVR 32 bit floating point cycles, AVR has 8 bit registers...
Reply by ●November 21, 20062006-11-21
On 21 Nov 2006 15:57:36 -0800, "steve" <bungalow_steve@yahoo.com> wrote:> >Jonathan Kirwan wrote: >> On Sat, 18 Nov 2006 10:54:11 +0100, "Ulf Samuelsson" >> <ulf@a-t-m-e-l.com> wrote: >> >> >Here are some AVR figures , IAR Full opt for speed, run in AVR Studio >> >simulator >> >Obviously, you cant compare exactly without using same source >> >Figures measured in a subroutine after values have been loaded into >> >registers. >> >Only tested one set of data, so I can't say if it is typical or not. >> > >> >add 173 >> >sub 176 >> >mul 175 >> >div 694 >> >sqrt 2586 >> >log 3255 >> >> Ulf, is that figure for div on the IAR compiler right??? On the >> ARM7/Thumb I've looked at before, there was no direct integer DIV >> instruction, unlike the case with multiply, so it makes sense to take >> a while longer than multiply does. But this long?? Since Steve was >> writing about 32-bit floats, I assume that it what you were doing too, >> but is it possible that they were promoted to doubles? >> >> On the MSP430, I was stuck doing pairs of registers/instructions to >> handle shifts (a triplet in one step of the loop) and this has _got_ >> to be better on a 32-bit register chip. > >Ulf posted AVR 32 bit floating point cycles, AVR has 8 bit registers...Boy! Was my mind out of touch!! I read "AVR" and thought "ARM". Thanks for that, Steve. Sometimes, I need a kick in the head. Jon
Reply by ●November 21, 20062006-11-21
>> >> On the MSP430, I was stuck doing pairs of registers/instructions to >> handle shifts (a triplet in one step of the loop) and this has _got_ >> to be better on a 32-bit register chip. > > Ulf posted AVR 32 bit floating point cycles, AVR has 8 bit > registers...It is a nice compliment that people find the AVR so fast that their spine believes it is a 32 bitter. -- Best Regards, Ulf Samuelsson ulf@a-t-m-e-l.com This message is intended to be my own personal view and it may or may not be shared by my employer Atmel Nordic AB
Reply by ●November 21, 20062006-11-21
"Jonathan Kirwan" <jkirwan@easystreet.com> wrote in message news:3ia4m219cuj2p7p33vsn8grf76b6d3uvqf@4ax.com...> On Sat, 18 Nov 2006 10:54:11 +0100, "Ulf Samuelsson" > <ulf@a-t-m-e-l.com> wrote: > >>Here are some AVR figures , IAR Full opt for speed, run in AVR Studio...> Ulf, is that figure for div on the IAR compiler right???The figures were for 8-bit AVR, not ARM. Not also that IAR's FP libs are unoptimized C (at least on ARM, I assume the same code is used on AVR). Optimized FP libraries are usually handcrafted assembler.> Something sounds wrong to get cycle counts like that. I believe you, > it just bugs me. The 32-bit register advantage would have pulled > another 48 cycles (24*2) off of the computation in the MSP430, for > sure, and in cases where a restore was needed, another 24 cycles -- so > the mean (average) would be above 48 and below 72 cycles pulled off -- > probably right in the center of that at 60 cycles. That puts it at a > projected 340 cycles on the ARM7 right now and I already have ways > coded up and fully tested to improve that another 60 cycles, anyway. > Which would bring such a thing to the area of 280 cycles on the ARM, > as a rough guess -- including overheads.280??? I do it in about 70. It uses a tiny lookup table to create an 8-bit reciprocal estimate which is used in 3 long division steps. This turned out to be simpler and faster than Newton-Rhapson as it only uses 32-bit multiplies (which are faster than 64-bit muls on most ARMs). The result is that FP emulation on ARM is extremely fast - in fact hardware FP (eg. ARM11) is only around 5-6 times faster in FP benchmarks eventhough it can issue 1 FP operation per cycle... Note that integer division and floating point division are completely different things - a standard 3-cycle per bit integer division takes about 120 cycles in the worst case when unrolled (although it takes just 30 on average). When you have to produce a certain minimum number of result bits methods that produce many result bits in a single step become faster. Proving it is correct is a little more involved due to the many approximation steps though :-) Wilco
Reply by ●November 21, 20062006-11-21
On Wed, 22 Nov 2006 01:20:59 +0100, "Ulf Samuelsson" <ulf@a-t-m-e-l.com> wrote:>>> On the MSP430, I was stuck doing pairs of registers/instructions to >>> handle shifts (a triplet in one step of the loop) and this has _got_ >>> to be better on a 32-bit register chip. >> >> Ulf posted AVR 32 bit floating point cycles, AVR has 8 bit >> registers... > >It is a nice compliment that people find the AVR so fast that >their spine believes it is a 32 bitter.hehe. No, it was entirely my own idiocy. I wouldn't draw too much else from that. ;) Jon