Reply by Wilco Dijkstra December 12, 20062006-12-12
"Jonathan Kirwan" <jkirwan@easystreet.com> wrote in message 
news:6evgn2p580pf8s74bvd0s09le6ig487aou@4ax.com...

> But the first page you mention above describes TI's approach of trying > to focus on chip performance and to ignore C compiler issues as a > "flawed" approach. Frankly, that's what I'm interested in because I > usually mix assembly with C in applications I write and therefore it > _IS_ valuable to me to know what the chip can do regardless of the C > compiler.
Yes, but they still used the C compiler, but just switched off all optimization. That's the same as comparing the performance of cars by driving them around in first gear. If they wanted to avoid comparing the compiler they should have written everything in assembler. Wilco
Reply by Richard December 7, 20062006-12-07
>>Jonathan Kirwan wrote: > > Did you imagine I'd learn any new detailed techniques from this? > > /******************************************************************************* > * > * Name : Floating-point Math > * Purpose : Benchmark floating-point math functions. > * > *******************************************************************************/ > float add(float a, float b) > { > return (a + b); > }
... This code is.... pure genius. They should further improve it thusly: /* INPUT * a - a float * b - b float * RETURN * a float based on some operations on a and b * * GLOBAL VARIABLES USED * None * LOCAL VARIABLES USED * None * FUNCTIONS USED * None * * REVISIONS: * V1 - Dec 12, 2001, initial code * V1.01 - Dec 15, 2001, changing stuff around after code review * V2 - June 9, 2003, user reported bug, fixed * V3 - May 17, 2004, increased speed by 10x */ float add(float a, float b) { return (a + b); }
Reply by Jonathan Kirwan December 7, 20062006-12-07
On Fri, 08 Dec 2006 11:43:04 +1300, Jim Granville
<no.spam@designtools.maps.co.nz> wrote:

>Jonathan Kirwan wrote: >> On Fri, 08 Dec 2006 10:44:02 +1300, Jim Granville >> <no.spam@designtools.maps.co.nz> wrote: >> >> >>>Jonathan Kirwan wrote: >>> >>>>On Fri, 08 Dec 2006 09:35:45 +1300, Jim Granville >>>> >>>>I did a _quick_ look to see if it was easy to find the sources and >>>>didn't. Didn't see them easily. >>> >>>Bottom of this page, there are 2 Compiler MSP430 examples + Sources. >>>http://www.maxim-ic.com/products/microcontrollers/maxq/performance/competitive.cfm >> >> >> Yeah. I think it's just C code. They depend on the libraries. I've >> already looked at three of them from three different C compiler >> vendors for the MSP430. So nothing new to be found, there. For the >> MAXQ, I'd just have to go to the compilers and see if I could dig out >> their source code for the library routines (by debugger, if not in >> source freely available.) But I'm not going to take the time for >> that. I also note that they wrap their floating point with function >> calls to functions that just do the one operation and return values. >> So this adds another layer to the benchmark. On the MSP430, function >> calls alone (without inlining) cost you 8 cycles (5 in, 3 out.) >> >> Did you imagine I'd learn any new detailed techniques from this? > >:) - No, but I did glance into the MAP files, and they seem to identify >the sizes of the ADD,MUL,DIV,SUB blocks, and those numbers I figured >you would be interested in.
Nope. Not really. All I care about is learning. And I can't do much of that from reading byte counts.
> Plus you can clone the code below to call your Libs and then compare >the cycles and bytes.
Yeah, like that would be useful -- not. ;) I can write that kind of stuff stone dead, let alone asleep. And the whole idea of loading down a series of toolsets for a processor I won't be using isn't something I want to engage. If the code is there, I'd look at it. Otherwise, I need to wait for motivation. Jon
Reply by Jim Granville December 7, 20062006-12-07
Jonathan Kirwan wrote:
> On Fri, 08 Dec 2006 10:44:02 +1300, Jim Granville > <no.spam@designtools.maps.co.nz> wrote: > > >>Jonathan Kirwan wrote: >> >>>On Fri, 08 Dec 2006 09:35:45 +1300, Jim Granville >>> >>>I did a _quick_ look to see if it was easy to find the sources and >>>didn't. Didn't see them easily. >> >>Bottom of this page, there are 2 Compiler MSP430 examples + Sources. >>http://www.maxim-ic.com/products/microcontrollers/maxq/performance/competitive.cfm > > > Yeah. I think it's just C code. They depend on the libraries. I've > already looked at three of them from three different C compiler > vendors for the MSP430. So nothing new to be found, there. For the > MAXQ, I'd just have to go to the compilers and see if I could dig out > their source code for the library routines (by debugger, if not in > source freely available.) But I'm not going to take the time for > that. I also note that they wrap their floating point with function > calls to functions that just do the one operation and return values. > So this adds another layer to the benchmark. On the MSP430, function > calls alone (without inlining) cost you 8 cycles (5 in, 3 out.) > > Did you imagine I'd learn any new detailed techniques from this?
:) - No, but I did glance into the MAP files, and they seem to identify the sizes of the ADD,MUL,DIV,SUB blocks, and those numbers I figured you would be interested in. Plus you can clone the code below to call your Libs and then compare the cycles and bytes. -jg
> > /******************************************************************************* > * > * Name : Floating-point Math > * Purpose : Benchmark floating-point math functions. > * > *******************************************************************************/ > float add(float a, float b) > { > return (a + b); > } > float mul(float a, float b) > { > return (a * b); > } > float div(float a, float b) > { > return (a / b); > } > void main(void) > { > volatile float result[4]; > result[0] = 54.567; > result[1] = 14346.67; > result[2] = add(result[0], result[1]); > result[1] = mul(result[0], result[2]); > result[3] = div(result[1], result[2]); > return; > } > > > Jon
Reply by Jonathan Kirwan December 7, 20062006-12-07
On Fri, 08 Dec 2006 10:44:02 +1300, Jim Granville
<no.spam@designtools.maps.co.nz> wrote:

>Jonathan Kirwan wrote: >> On Fri, 08 Dec 2006 09:35:45 +1300, Jim Granville >> >> I did a _quick_ look to see if it was easy to find the sources and >> didn't. Didn't see them easily. > >Bottom of this page, there are 2 Compiler MSP430 examples + Sources. >http://www.maxim-ic.com/products/microcontrollers/maxq/performance/competitive.cfm
Yeah. I think it's just C code. They depend on the libraries. I've already looked at three of them from three different C compiler vendors for the MSP430. So nothing new to be found, there. For the MAXQ, I'd just have to go to the compilers and see if I could dig out their source code for the library routines (by debugger, if not in source freely available.) But I'm not going to take the time for that. I also note that they wrap their floating point with function calls to functions that just do the one operation and return values. So this adds another layer to the benchmark. On the MSP430, function calls alone (without inlining) cost you 8 cycles (5 in, 3 out.) Did you imagine I'd learn any new detailed techniques from this? /******************************************************************************* * * Name : Floating-point Math * Purpose : Benchmark floating-point math functions. * *******************************************************************************/ float add(float a, float b) { return (a + b); } float mul(float a, float b) { return (a * b); } float div(float a, float b) { return (a / b); } void main(void) { volatile float result[4]; result[0] = 54.567; result[1] = 14346.67; result[2] = add(result[0], result[1]); result[1] = mul(result[0], result[2]); result[3] = div(result[1], result[2]); return; } Jon
Reply by Jim Granville December 7, 20062006-12-07
Jonathan Kirwan wrote:
> On Fri, 08 Dec 2006 09:35:45 +1300, Jim Granville > > I did a _quick_ look to see if it was easy to find the sources and > didn't. Didn't see them easily.
Bottom of this page, there are 2 Compiler MSP430 examples + Sources. http://www.maxim-ic.com/products/microcontrollers/maxq/performance/competitive.cfm -jg
Reply by Jonathan Kirwan December 7, 20062006-12-07
On Fri, 08 Dec 2006 09:35:45 +1300, Jim Granville
<no.spam@designtools.maps.co.nz> wrote:

>Jonathan Kirwan wrote: >> On Sat, 18 Nov 2006 21:55:14 GMT, I wrote: >> >>>On 17 Nov 2006 14:53:58 -0800, "steve" <bungalow_steve@yahoo.com> >>>wrote: >>> >>> >>>>MPS430, 32 bit floats, imagecraft complier, typical cycles >>>>add 158 >>>>sub 184 >>>>mul 332 >>>>div 620 >>> >>>Back in 2004, I wanted to play with writing a floating point routine >>>for the MSP430. It accepts IEEE format 32-bit floats. The 32-bit by >>>32-bit with 32-bit result floating point divide takes roughly 400-435 >>>cycles on the MSP430. This is substantially less than the 620 cycles >>>mentioned above. >>><snip> >> >> >> Thanks to Wilco's kicking my butt about a couple of issues in my code, >> the code I've written up is now faster on the MSP430 -- a mean of 280 >> cycles (with no material variation) for 32-bit float division. It >> supports INF for divide by zero and INF propagation but lacks denormal >> support. >> >> However, I just did up a short project in TI/IAR's kickstart and the >> _Div32f function called appears to produce its results in a mean of >> about 362 cycles, not the 620 mentioned by Steve. To compute that, I >> did 81 float divisions (reaches the 4k limit) and removed the register >> load overhead so that only the call itself is considered. >> >> This made me curious and it turns out that the library there uses a >> restoring division version, but one that also uses a tight 'inner' >> loop that doesn't do a trial subtraction unless there is a reason to >> do so (leading bit in the remainder [divisor is normalized so it's a >> given that it's leading bit is 1]) and doesn't use a loop counter, >> looking instead for a shift of a '1' out of the quotient (which must >> occur in almost every case, and the rare other cases are tested for.) >> This brings their cycles per bit in the low level division part of the >> code, when streams of quotient '0's are being produced, to 8 cycles. >> More, though, when '1's are being produced. Which is what brings up >> their average. The code I use in the low level division uses almost >> exactly 7 cycles per bit (just under) and is consistent regardless of >> '1' or '0' being produced in the quotient. >> >> There are some interesting approaches to know about. > >Sounds clever.
I enjoy these things for reasons I'm not entirely certain of. ;)
>Did you see the notes here ? : > >http://www.maxim-ic.com/appnotes.cfm/appnote_number/3593 >http://pdfserv.maxim-ic.com/en/ds/MAXQ7653-56_Prelim.pdf
Perhaps some earlier versions. I read so much and I remember going through a few things similar to this on the MAXQ a year ago or so?
>These are for the MAXQ, which claims 16x16 Mult but the >maths numbers seem broadly similar to MSP430, AVR >( within a speed-grade-iteration ) > >Maxim's MHz and mA/MHz numbers also are all over >the place, almost like different teams are working >on these devices ? > >I think they give sources, so you could compare your >Div32f with the ones at Maxim's ?
I did a _quick_ look to see if it was easy to find the sources and didn't. Didn't see them easily. But the first page you mention above describes TI's approach of trying to focus on chip performance and to ignore C compiler issues as a "flawed" approach. Frankly, that's what I'm interested in because I usually mix assembly with C in applications I write and therefore it _IS_ valuable to me to know what the chip can do regardless of the C compiler. When I need to drop into assembly, that's what I want to know about. And if a C compiler isn't entirely all that great for a chip, well that only means I may use more assembly than I otherwise might. But probably not that much more. So I really find that kind of argument ... not quite disingenuous, because I can see that it has a valid facet to it ... but not useful for my case, at least. So I'm wondering if the benchmarks are going to be in C and if they don't disclose the actual assembly, anyway. It may be the case that they didn't care to, because of their argument against assembly, anyway. If you know different, let me know and I'll go look. In any case, the MAXQ isn't getting designed into anything I care about right now. Jon
Reply by Jim Granville December 7, 20062006-12-07
Jonathan Kirwan wrote:
> On Sat, 18 Nov 2006 21:55:14 GMT, I wrote: > > >>On 17 Nov 2006 14:53:58 -0800, "steve" <bungalow_steve@yahoo.com> >>wrote: >> >> >>>MPS430, 32 bit floats, imagecraft complier, typical cycles >>>add 158 >>>sub 184 >>>mul 332 >>>div 620 >> >>Back in 2004, I wanted to play with writing a floating point routine >>for the MSP430. It accepts IEEE format 32-bit floats. The 32-bit by >>32-bit with 32-bit result floating point divide takes roughly 400-435 >>cycles on the MSP430. This is substantially less than the 620 cycles >>mentioned above. >><snip> > > > Thanks to Wilco's kicking my butt about a couple of issues in my code, > the code I've written up is now faster on the MSP430 -- a mean of 280 > cycles (with no material variation) for 32-bit float division. It > supports INF for divide by zero and INF propagation but lacks denormal > support. > > However, I just did up a short project in TI/IAR's kickstart and the > _Div32f function called appears to produce its results in a mean of > about 362 cycles, not the 620 mentioned by Steve. To compute that, I > did 81 float divisions (reaches the 4k limit) and removed the register > load overhead so that only the call itself is considered. > > This made me curious and it turns out that the library there uses a > restoring division version, but one that also uses a tight 'inner' > loop that doesn't do a trial subtraction unless there is a reason to > do so (leading bit in the remainder [divisor is normalized so it's a > given that it's leading bit is 1]) and doesn't use a loop counter, > looking instead for a shift of a '1' out of the quotient (which must > occur in almost every case, and the rare other cases are tested for.) > This brings their cycles per bit in the low level division part of the > code, when streams of quotient '0's are being produced, to 8 cycles. > More, though, when '1's are being produced. Which is what brings up > their average. The code I use in the low level division uses almost > exactly 7 cycles per bit (just under) and is consistent regardless of > '1' or '0' being produced in the quotient. > > There are some interesting approaches to know about.
Sounds clever. Did you see the notes here ? : http://www.maxim-ic.com/appnotes.cfm/appnote_number/3593 http://pdfserv.maxim-ic.com/en/ds/MAXQ7653-56_Prelim.pdf These are for the MAXQ, which claims 16x16 Mult but the maths numbers seem broadly similar to MSP430, AVR ( within a speed-grade-iteration ) Maxim's MHz and mA/MHz numbers also are all over the place, almost like different teams are working on these devices ? I think they give sources, so you could compare your Div32f with the ones at Maxim's ? -jg
Reply by Jonathan Kirwan December 7, 20062006-12-07
On Sat, 18 Nov 2006 21:55:14 GMT, I wrote:

>On 17 Nov 2006 14:53:58 -0800, "steve" <bungalow_steve@yahoo.com> >wrote: > >>MPS430, 32 bit floats, imagecraft complier, typical cycles >>add 158 >>sub 184 >>mul 332 >>div 620 > >Back in 2004, I wanted to play with writing a floating point routine >for the MSP430. It accepts IEEE format 32-bit floats. The 32-bit by >32-bit with 32-bit result floating point divide takes roughly 400-435 >cycles on the MSP430. This is substantially less than the 620 cycles >mentioned above. ><snip>
Thanks to Wilco's kicking my butt about a couple of issues in my code, the code I've written up is now faster on the MSP430 -- a mean of 280 cycles (with no material variation) for 32-bit float division. It supports INF for divide by zero and INF propagation but lacks denormal support. However, I just did up a short project in TI/IAR's kickstart and the _Div32f function called appears to produce its results in a mean of about 362 cycles, not the 620 mentioned by Steve. To compute that, I did 81 float divisions (reaches the 4k limit) and removed the register load overhead so that only the call itself is considered. This made me curious and it turns out that the library there uses a restoring division version, but one that also uses a tight 'inner' loop that doesn't do a trial subtraction unless there is a reason to do so (leading bit in the remainder [divisor is normalized so it's a given that it's leading bit is 1]) and doesn't use a loop counter, looking instead for a shift of a '1' out of the quotient (which must occur in almost every case, and the rare other cases are tested for.) This brings their cycles per bit in the low level division part of the code, when streams of quotient '0's are being produced, to 8 cycles. More, though, when '1's are being produced. Which is what brings up their average. The code I use in the low level division uses almost exactly 7 cycles per bit (just under) and is consistent regardless of '1' or '0' being produced in the quotient. There are some interesting approaches to know about. Jon
Reply by Wilco Dijkstra November 28, 20062006-11-28
"steve" <bungalow_steve@yahoo.com> wrote in message 
news:1164676578.913240.27830@j72g2000cwa.googlegroups.com...
> > Wilco Dijkstra wrote: > >> 280??? I do it in about 70. It uses a tiny lookup table to create an >> 8-bit >> reciprocal estimate which is used in 3 long division steps. This turned >> out to be simpler and faster than Newton-Rhapson as it only uses >> 32-bit multiplies (which are faster than 64-bit muls on most ARMs). >> >> The result is that FP emulation on ARM is extremely fast - in fact >> hardware FP (eg. ARM11) is only around 5-6 times faster in FP >> benchmarks eventhough it can issue 1 FP operation per cycle... >> > ???? Doesn't make any sense, 1 cycle HW FP operation is 5-6 times > faster then a 70-280 cycle emulated FP?
It's max 70 cycle for division. Operations like addition and multiplication are much faster. Float addition can finish in as little as 21 cycles, special cases such as x+0 are even quicker.
>I suspect those benchmarks had > a few math operations surrounded by tons of logic. Run a ARM11 aor > ARM9(with hardware FP) FFT and a ARM7 FFT, both floating point, and > look at the timing differences, probably 100 to 1 difference.
If there is a lot of integer logic, there is no speedup at all. I'm talking about pure floating point code, including FFTs. Such code has a large number of load and store operations. Additionally there are dependencies between FP operations resulting in interlocks (FP ops have high latencies). With no loads/stores and no interlocks, the maximum theoretical speedup is 25x for fadd/fsub, 50x for fmac and 5x for division. Actual speed ups are 5-6x on ARM11. Amdahl's law applies... Wilco