EmbeddedRelated.com
Forums

Math computing time statistics for ARM7TDMI and MSP430

Started by Tilmann Reh November 17, 2006
On Thu, 23 Nov 2006 00:02:32 GMT, "Wilco Dijkstra"
<Wilco_dot_Dijkstra@ntlworld.com> wrote:

>"Jonathan Kirwan" <jkirwan@easystreet.com> wrote in message >news:eid9m2hofhf8q3msqunvs548skqr25gb4s@4ax.com... >> On Wed, 22 Nov 2006 20:21:27 GMT, "Wilco Dijkstra" >> <Wilco_dot_Dijkstra@ntlworld.com> wrote: > >>>You can get rid of the extra overflow check if you do the check >>>in a similar way as I do. >> >> I'd like to see how that works. Perhaps, if you look over my example >> code you can point out where/how this might be done. > >I'll do that. There are various techniques to do this.
Thanks. It is on its way to you.
>>>>>How many cycles does your version take per bit? If it is a lot >>>>>then multiplicative methods typically win. >>>> >>>> On which CPU? It really does depend. >>> >>>On the CPU you were talking about, MSP430 I thought. >> >> Ah. Execution time of the entire code is 340-350 cycles, or so. The >> core division routine, the one that does the central work here, takes >> about 240-250 cycles of that. So the division itself is about 10 >> MSP430 cycles per bit. I'd very much enjoy seeing a better version to >> learn from. > >I'll have a look at the inner loop and see whether it can be improved. >At 10 cycles per bit multiplicative methods should be faster. On ARM >I went from 2 cycles per bit for fixed point division to 1 cycle per bit >for long division. If you can get a decent 8x24-bit multiply on the MSP >it should be possible to get it down to 150 cycles.
Only _some_ of the MSP430's include a multiplier. So you can't count on it. And there is a penalty. You write to locations and the sequence must not be interrupted. There is no way to observe and restore the state of it. So if you use it, you need to protect against pre-emption to another process or routine that may also attempt to use it. Not exactly a good thing.
>>>However >>>an older version of my division was published in "ARM System >>>Developer's Guide - Designing and Optimizing System Software". They >>>even did mention me. You can get an evaluation version (RealView or >>>Keil) from http://www.arm.com/products/DevTools/RVDSEvalCD.html >>>These tools include the compiler I worked on for 10 years and all >>>my highly optimised library routines, including memcpy/memmove, >>>strcmp, FP libraries etc. >> >> Could you please simplify how I might directly access the routine? I'm >> not interested in some long investigation/installation and trapsing >> around through obscurity just to get the code itself. Can you help >> make this any simpler? If it is published, would you be willing to >> just copy it into an email and send it to me?? I'd appreciate that. > >I've sent you some disassembly. The source code even if published >is still under copyright. Luckily you can disassemble anything you like >to learn from it. I've done that a lot with competitor's compilers and >libaries when I was doing compiler stuff.
Thanks, I'll see about looking it over when I've a moment to do so. I've two things to do, then. One is perusing the ARM docs at the same time as reading the code. Will be fun. But anything for a chance to learn something new. ;) Jon
Wilco Dijkstra wrote:

> 280??? I do it in about 70. It uses a tiny lookup table to create an 8-bit > reciprocal estimate which is used in 3 long division steps. This turned > out to be simpler and faster than Newton-Rhapson as it only uses > 32-bit multiplies (which are faster than 64-bit muls on most ARMs). > > The result is that FP emulation on ARM is extremely fast - in fact > hardware FP (eg. ARM11) is only around 5-6 times faster in FP > benchmarks eventhough it can issue 1 FP operation per cycle... >
???? Doesn't make any sense, 1 cycle HW FP operation is 5-6 times faster then a 70-280 cycle emulated FP? I suspect those benchmarks had a few math operations surrounded by tons of logic. Run a ARM11 aor ARM9(with hardware FP) FFT and a ARM7 FFT, both floating point, and look at the timing differences, probably 100 to 1 difference. steve
"steve" <bungalow_steve@yahoo.com> wrote in message 
news:1164676578.913240.27830@j72g2000cwa.googlegroups.com...
> > Wilco Dijkstra wrote: > >> 280??? I do it in about 70. It uses a tiny lookup table to create an >> 8-bit >> reciprocal estimate which is used in 3 long division steps. This turned >> out to be simpler and faster than Newton-Rhapson as it only uses >> 32-bit multiplies (which are faster than 64-bit muls on most ARMs). >> >> The result is that FP emulation on ARM is extremely fast - in fact >> hardware FP (eg. ARM11) is only around 5-6 times faster in FP >> benchmarks eventhough it can issue 1 FP operation per cycle... >> > ???? Doesn't make any sense, 1 cycle HW FP operation is 5-6 times > faster then a 70-280 cycle emulated FP?
It's max 70 cycle for division. Operations like addition and multiplication are much faster. Float addition can finish in as little as 21 cycles, special cases such as x+0 are even quicker.
>I suspect those benchmarks had > a few math operations surrounded by tons of logic. Run a ARM11 aor > ARM9(with hardware FP) FFT and a ARM7 FFT, both floating point, and > look at the timing differences, probably 100 to 1 difference.
If there is a lot of integer logic, there is no speedup at all. I'm talking about pure floating point code, including FFTs. Such code has a large number of load and store operations. Additionally there are dependencies between FP operations resulting in interlocks (FP ops have high latencies). With no loads/stores and no interlocks, the maximum theoretical speedup is 25x for fadd/fsub, 50x for fmac and 5x for division. Actual speed ups are 5-6x on ARM11. Amdahl's law applies... Wilco
On Sat, 18 Nov 2006 21:55:14 GMT, I wrote:

>On 17 Nov 2006 14:53:58 -0800, "steve" <bungalow_steve@yahoo.com> >wrote: > >>MPS430, 32 bit floats, imagecraft complier, typical cycles >>add 158 >>sub 184 >>mul 332 >>div 620 > >Back in 2004, I wanted to play with writing a floating point routine >for the MSP430. It accepts IEEE format 32-bit floats. The 32-bit by >32-bit with 32-bit result floating point divide takes roughly 400-435 >cycles on the MSP430. This is substantially less than the 620 cycles >mentioned above. ><snip>
Thanks to Wilco's kicking my butt about a couple of issues in my code, the code I've written up is now faster on the MSP430 -- a mean of 280 cycles (with no material variation) for 32-bit float division. It supports INF for divide by zero and INF propagation but lacks denormal support. However, I just did up a short project in TI/IAR's kickstart and the _Div32f function called appears to produce its results in a mean of about 362 cycles, not the 620 mentioned by Steve. To compute that, I did 81 float divisions (reaches the 4k limit) and removed the register load overhead so that only the call itself is considered. This made me curious and it turns out that the library there uses a restoring division version, but one that also uses a tight 'inner' loop that doesn't do a trial subtraction unless there is a reason to do so (leading bit in the remainder [divisor is normalized so it's a given that it's leading bit is 1]) and doesn't use a loop counter, looking instead for a shift of a '1' out of the quotient (which must occur in almost every case, and the rare other cases are tested for.) This brings their cycles per bit in the low level division part of the code, when streams of quotient '0's are being produced, to 8 cycles. More, though, when '1's are being produced. Which is what brings up their average. The code I use in the low level division uses almost exactly 7 cycles per bit (just under) and is consistent regardless of '1' or '0' being produced in the quotient. There are some interesting approaches to know about. Jon
Jonathan Kirwan wrote:
> On Sat, 18 Nov 2006 21:55:14 GMT, I wrote: > > >>On 17 Nov 2006 14:53:58 -0800, "steve" <bungalow_steve@yahoo.com> >>wrote: >> >> >>>MPS430, 32 bit floats, imagecraft complier, typical cycles >>>add 158 >>>sub 184 >>>mul 332 >>>div 620 >> >>Back in 2004, I wanted to play with writing a floating point routine >>for the MSP430. It accepts IEEE format 32-bit floats. The 32-bit by >>32-bit with 32-bit result floating point divide takes roughly 400-435 >>cycles on the MSP430. This is substantially less than the 620 cycles >>mentioned above. >><snip> > > > Thanks to Wilco's kicking my butt about a couple of issues in my code, > the code I've written up is now faster on the MSP430 -- a mean of 280 > cycles (with no material variation) for 32-bit float division. It > supports INF for divide by zero and INF propagation but lacks denormal > support. > > However, I just did up a short project in TI/IAR's kickstart and the > _Div32f function called appears to produce its results in a mean of > about 362 cycles, not the 620 mentioned by Steve. To compute that, I > did 81 float divisions (reaches the 4k limit) and removed the register > load overhead so that only the call itself is considered. > > This made me curious and it turns out that the library there uses a > restoring division version, but one that also uses a tight 'inner' > loop that doesn't do a trial subtraction unless there is a reason to > do so (leading bit in the remainder [divisor is normalized so it's a > given that it's leading bit is 1]) and doesn't use a loop counter, > looking instead for a shift of a '1' out of the quotient (which must > occur in almost every case, and the rare other cases are tested for.) > This brings their cycles per bit in the low level division part of the > code, when streams of quotient '0's are being produced, to 8 cycles. > More, though, when '1's are being produced. Which is what brings up > their average. The code I use in the low level division uses almost > exactly 7 cycles per bit (just under) and is consistent regardless of > '1' or '0' being produced in the quotient. > > There are some interesting approaches to know about.
Sounds clever. Did you see the notes here ? : http://www.maxim-ic.com/appnotes.cfm/appnote_number/3593 http://pdfserv.maxim-ic.com/en/ds/MAXQ7653-56_Prelim.pdf These are for the MAXQ, which claims 16x16 Mult but the maths numbers seem broadly similar to MSP430, AVR ( within a speed-grade-iteration ) Maxim's MHz and mA/MHz numbers also are all over the place, almost like different teams are working on these devices ? I think they give sources, so you could compare your Div32f with the ones at Maxim's ? -jg
On Fri, 08 Dec 2006 09:35:45 +1300, Jim Granville
<no.spam@designtools.maps.co.nz> wrote:

>Jonathan Kirwan wrote: >> On Sat, 18 Nov 2006 21:55:14 GMT, I wrote: >> >>>On 17 Nov 2006 14:53:58 -0800, "steve" <bungalow_steve@yahoo.com> >>>wrote: >>> >>> >>>>MPS430, 32 bit floats, imagecraft complier, typical cycles >>>>add 158 >>>>sub 184 >>>>mul 332 >>>>div 620 >>> >>>Back in 2004, I wanted to play with writing a floating point routine >>>for the MSP430. It accepts IEEE format 32-bit floats. The 32-bit by >>>32-bit with 32-bit result floating point divide takes roughly 400-435 >>>cycles on the MSP430. This is substantially less than the 620 cycles >>>mentioned above. >>><snip> >> >> >> Thanks to Wilco's kicking my butt about a couple of issues in my code, >> the code I've written up is now faster on the MSP430 -- a mean of 280 >> cycles (with no material variation) for 32-bit float division. It >> supports INF for divide by zero and INF propagation but lacks denormal >> support. >> >> However, I just did up a short project in TI/IAR's kickstart and the >> _Div32f function called appears to produce its results in a mean of >> about 362 cycles, not the 620 mentioned by Steve. To compute that, I >> did 81 float divisions (reaches the 4k limit) and removed the register >> load overhead so that only the call itself is considered. >> >> This made me curious and it turns out that the library there uses a >> restoring division version, but one that also uses a tight 'inner' >> loop that doesn't do a trial subtraction unless there is a reason to >> do so (leading bit in the remainder [divisor is normalized so it's a >> given that it's leading bit is 1]) and doesn't use a loop counter, >> looking instead for a shift of a '1' out of the quotient (which must >> occur in almost every case, and the rare other cases are tested for.) >> This brings their cycles per bit in the low level division part of the >> code, when streams of quotient '0's are being produced, to 8 cycles. >> More, though, when '1's are being produced. Which is what brings up >> their average. The code I use in the low level division uses almost >> exactly 7 cycles per bit (just under) and is consistent regardless of >> '1' or '0' being produced in the quotient. >> >> There are some interesting approaches to know about. > >Sounds clever.
I enjoy these things for reasons I'm not entirely certain of. ;)
>Did you see the notes here ? : > >http://www.maxim-ic.com/appnotes.cfm/appnote_number/3593 >http://pdfserv.maxim-ic.com/en/ds/MAXQ7653-56_Prelim.pdf
Perhaps some earlier versions. I read so much and I remember going through a few things similar to this on the MAXQ a year ago or so?
>These are for the MAXQ, which claims 16x16 Mult but the >maths numbers seem broadly similar to MSP430, AVR >( within a speed-grade-iteration ) > >Maxim's MHz and mA/MHz numbers also are all over >the place, almost like different teams are working >on these devices ? > >I think they give sources, so you could compare your >Div32f with the ones at Maxim's ?
I did a _quick_ look to see if it was easy to find the sources and didn't. Didn't see them easily. But the first page you mention above describes TI's approach of trying to focus on chip performance and to ignore C compiler issues as a "flawed" approach. Frankly, that's what I'm interested in because I usually mix assembly with C in applications I write and therefore it _IS_ valuable to me to know what the chip can do regardless of the C compiler. When I need to drop into assembly, that's what I want to know about. And if a C compiler isn't entirely all that great for a chip, well that only means I may use more assembly than I otherwise might. But probably not that much more. So I really find that kind of argument ... not quite disingenuous, because I can see that it has a valid facet to it ... but not useful for my case, at least. So I'm wondering if the benchmarks are going to be in C and if they don't disclose the actual assembly, anyway. It may be the case that they didn't care to, because of their argument against assembly, anyway. If you know different, let me know and I'll go look. In any case, the MAXQ isn't getting designed into anything I care about right now. Jon
Jonathan Kirwan wrote:
> On Fri, 08 Dec 2006 09:35:45 +1300, Jim Granville > > I did a _quick_ look to see if it was easy to find the sources and > didn't. Didn't see them easily.
Bottom of this page, there are 2 Compiler MSP430 examples + Sources. http://www.maxim-ic.com/products/microcontrollers/maxq/performance/competitive.cfm -jg
On Fri, 08 Dec 2006 10:44:02 +1300, Jim Granville
<no.spam@designtools.maps.co.nz> wrote:

>Jonathan Kirwan wrote: >> On Fri, 08 Dec 2006 09:35:45 +1300, Jim Granville >> >> I did a _quick_ look to see if it was easy to find the sources and >> didn't. Didn't see them easily. > >Bottom of this page, there are 2 Compiler MSP430 examples + Sources. >http://www.maxim-ic.com/products/microcontrollers/maxq/performance/competitive.cfm
Yeah. I think it's just C code. They depend on the libraries. I've already looked at three of them from three different C compiler vendors for the MSP430. So nothing new to be found, there. For the MAXQ, I'd just have to go to the compilers and see if I could dig out their source code for the library routines (by debugger, if not in source freely available.) But I'm not going to take the time for that. I also note that they wrap their floating point with function calls to functions that just do the one operation and return values. So this adds another layer to the benchmark. On the MSP430, function calls alone (without inlining) cost you 8 cycles (5 in, 3 out.) Did you imagine I'd learn any new detailed techniques from this? /******************************************************************************* * * Name : Floating-point Math * Purpose : Benchmark floating-point math functions. * *******************************************************************************/ float add(float a, float b) { return (a + b); } float mul(float a, float b) { return (a * b); } float div(float a, float b) { return (a / b); } void main(void) { volatile float result[4]; result[0] = 54.567; result[1] = 14346.67; result[2] = add(result[0], result[1]); result[1] = mul(result[0], result[2]); result[3] = div(result[1], result[2]); return; } Jon
Jonathan Kirwan wrote:
> On Fri, 08 Dec 2006 10:44:02 +1300, Jim Granville > <no.spam@designtools.maps.co.nz> wrote: > > >>Jonathan Kirwan wrote: >> >>>On Fri, 08 Dec 2006 09:35:45 +1300, Jim Granville >>> >>>I did a _quick_ look to see if it was easy to find the sources and >>>didn't. Didn't see them easily. >> >>Bottom of this page, there are 2 Compiler MSP430 examples + Sources. >>http://www.maxim-ic.com/products/microcontrollers/maxq/performance/competitive.cfm > > > Yeah. I think it's just C code. They depend on the libraries. I've > already looked at three of them from three different C compiler > vendors for the MSP430. So nothing new to be found, there. For the > MAXQ, I'd just have to go to the compilers and see if I could dig out > their source code for the library routines (by debugger, if not in > source freely available.) But I'm not going to take the time for > that. I also note that they wrap their floating point with function > calls to functions that just do the one operation and return values. > So this adds another layer to the benchmark. On the MSP430, function > calls alone (without inlining) cost you 8 cycles (5 in, 3 out.) > > Did you imagine I'd learn any new detailed techniques from this?
:) - No, but I did glance into the MAP files, and they seem to identify the sizes of the ADD,MUL,DIV,SUB blocks, and those numbers I figured you would be interested in. Plus you can clone the code below to call your Libs and then compare the cycles and bytes. -jg
> > /******************************************************************************* > * > * Name : Floating-point Math > * Purpose : Benchmark floating-point math functions. > * > *******************************************************************************/ > float add(float a, float b) > { > return (a + b); > } > float mul(float a, float b) > { > return (a * b); > } > float div(float a, float b) > { > return (a / b); > } > void main(void) > { > volatile float result[4]; > result[0] = 54.567; > result[1] = 14346.67; > result[2] = add(result[0], result[1]); > result[1] = mul(result[0], result[2]); > result[3] = div(result[1], result[2]); > return; > } > > > Jon
On Fri, 08 Dec 2006 11:43:04 +1300, Jim Granville
<no.spam@designtools.maps.co.nz> wrote:

>Jonathan Kirwan wrote: >> On Fri, 08 Dec 2006 10:44:02 +1300, Jim Granville >> <no.spam@designtools.maps.co.nz> wrote: >> >> >>>Jonathan Kirwan wrote: >>> >>>>On Fri, 08 Dec 2006 09:35:45 +1300, Jim Granville >>>> >>>>I did a _quick_ look to see if it was easy to find the sources and >>>>didn't. Didn't see them easily. >>> >>>Bottom of this page, there are 2 Compiler MSP430 examples + Sources. >>>http://www.maxim-ic.com/products/microcontrollers/maxq/performance/competitive.cfm >> >> >> Yeah. I think it's just C code. They depend on the libraries. I've >> already looked at three of them from three different C compiler >> vendors for the MSP430. So nothing new to be found, there. For the >> MAXQ, I'd just have to go to the compilers and see if I could dig out >> their source code for the library routines (by debugger, if not in >> source freely available.) But I'm not going to take the time for >> that. I also note that they wrap their floating point with function >> calls to functions that just do the one operation and return values. >> So this adds another layer to the benchmark. On the MSP430, function >> calls alone (without inlining) cost you 8 cycles (5 in, 3 out.) >> >> Did you imagine I'd learn any new detailed techniques from this? > >:) - No, but I did glance into the MAP files, and they seem to identify >the sizes of the ADD,MUL,DIV,SUB blocks, and those numbers I figured >you would be interested in.
Nope. Not really. All I care about is learning. And I can't do much of that from reading byte counts.
> Plus you can clone the code below to call your Libs and then compare >the cycles and bytes.
Yeah, like that would be useful -- not. ;) I can write that kind of stuff stone dead, let alone asleep. And the whole idea of loading down a series of toolsets for a processor I won't be using isn't something I want to engage. If the code is there, I'd look at it. Otherwise, I need to wait for motivation. Jon