On Wed, 22 Nov 2006 00:34:17 GMT, "Wilco Dijkstra" <Wilco_dot_Dijkstra@ntlworld.com> wrote:> >"Jonathan Kirwan" <jkirwan@easystreet.com> wrote in message >news:3ia4m219cuj2p7p33vsn8grf76b6d3uvqf@4ax.com... >> On Sat, 18 Nov 2006 10:54:11 +0100, "Ulf Samuelsson" >> <ulf@a-t-m-e-l.com> wrote: >> >>>Here are some AVR figures , IAR Full opt for speed, run in AVR Studio >... >> Ulf, is that figure for div on the IAR compiler right??? > >The figures were for 8-bit AVR, not ARM. Not also that IAR's FP libs >are unoptimized C (at least on ARM, I assume the same code is used >on AVR). Optimized FP libraries are usually handcrafted assembler.Yes, Steve was quick to nail me on this! Good thing, too.>> Something sounds wrong to get cycle counts like that. I believe you, >> it just bugs me. The 32-bit register advantage would have pulled >> another 48 cycles (24*2) off of the computation in the MSP430, for >> sure, and in cases where a restore was needed, another 24 cycles -- so >> the mean (average) would be above 48 and below 72 cycles pulled off -- >> probably right in the center of that at 60 cycles. That puts it at a >> projected 340 cycles on the ARM7 right now and I already have ways >> coded up and fully tested to improve that another 60 cycles, anyway. >> Which would bring such a thing to the area of 280 cycles on the ARM, >> as a rough guess -- including overheads. > >280??? I do it in about 70.That's good to hear. In fact, it's a bit of information that may help me someday.>It uses a tiny lookup table to create an 8-bit >reciprocal estimate which is used in 3 long division steps. This turned >out to be simpler and faster than Newton-Rhapson as it only uses >32-bit multiplies (which are faster than 64-bit muls on most ARMs).When I get an ARM board here, I really must get a chance to play with this myself. I love simple challenges.>The result is that FP emulation on ARM is extremely fast - in fact >hardware FP (eg. ARM11) is only around 5-6 times faster in FP >benchmarks eventhough it can issue 1 FP operation per cycle...Interesting. Thanks.>Note that integer division and floating point division are completely >different thingsLike I wouldn't know this... But at the core of the code I wrote is, in fact, an integer divide process. In the case of 32-bit floats, this is a 48 bit / 24 bit integer divide that produces a 24 bit quotient and a 24 bit remainder.> - a standard 3-cycle per bit integer division takes about >120 cycles in the worst case when unrolled (although it takes just 30 >on average).Do you have a code snippet to see?>When you have to produce a certain minimum number >of result bits methods that produce many result bits in a single step >become faster. Proving it is correct is a little more involved due to >the many approximation steps though :-)As always. Thanks, Jon
Math computing time statistics for ARM7TDMI and MSP430
Started by ●November 17, 2006
Reply by ●November 21, 20062006-11-21
Reply by ●November 22, 20062006-11-22
On Wed, 22 Nov 2006 00:34:17 GMT, "Wilco Dijkstra" <Wilco_dot_Dijkstra@ntlworld.com> wrote:> >"Jonathan Kirwan" <jkirwan@easystreet.com> wrote in message >news:3ia4m219cuj2p7p33vsn8grf76b6d3uvqf@4ax.com... >> On Sat, 18 Nov 2006 10:54:11 +0100, "Ulf Samuelsson" >> <ulf@a-t-m-e-l.com> wrote: >> >>>Here are some AVR figures , IAR Full opt for speed, run in AVR Studio >... >> Ulf, is that figure for div on the IAR compiler right??? > >The figures were for 8-bit AVR, not ARM. Not also that IAR's FP libs >are unoptimized C (at least on ARM, I assume the same code is used >on AVR). Optimized FP libraries are usually handcrafted assembler.That would explain the strange ratio between mul and log. Both in log and sqrt the exponent and mantissa are usually handled separately (sqrt may need a single bit mantissa shift to make exponent even). To get single precision results a 3rd or 4th order polynomial is typically sufficient. Since the mantissa is (almost) normalised, the polynomial could be calculated using 32 bit integer (or perhaps even 24 bit) arithmetics. log will require one additional multiplication to convert from log2 to log10. With optimised assembler code, the log/FP-mul ratio should not be that big. Paul
Reply by ●November 22, 20062006-11-22
"Wilco Dijkstra" <Wilco_dot_Dijkstra@ntlworld.com> wrote in message news:aGo7h.38501$163.1888@newsfe6-gui.ntli.net...> > "Tilmann Reh" <tilmannreh@despammed.com> wrote in message > news:455d6bd4$0$30328$9b4e6d93@newsspool1.arcor-online.net... >> Hello, >> >> for an estimation of required computing time I would like to roughly >> know the time that current controllers need for math operations >> (addition/subtraction, multiplication, division, and also logarithm) in >> single and/or double precision floating point format (assuming common >> compilers). >> >> The MCUs in question are ARM7TDMI of NXP/Atmel flavour (LPC2000 or >> SAM7), and Texas MSP430. >> >> Can anyone provide a link to some statistics? > > None of these support floating point in hardware, so it depends on > the libraries you use. On ARM there exist highly optimised FP > libraries, the one I wrote takes about 25 cycles for fadd/fsub, 40 for > fmul and 70 for fdiv. Double precision takes almost twice as long.That's pretty impressive. I would have expected that checking for NaNs, Infs and denormals to take sort of time.> You would get 500KFlops quite easily on a 50MHz ARM7tdmi.I'm using a 51.6 MHz ARM7TDMI and getting very much slower than that, but I am using Thumb from a very slow external bus. It's not a problem for me for now but its nice to know what I can expect if the math workload becomes an issue.> Of course this is highly compiler/library specific, many are much > slower than this, 5-6x slower for an unoptimised implementation > is fairly typical. > > Doing floating point on the MSP, especially double precision, > seems like a bad idea... >
Reply by ●November 22, 20062006-11-22
"Jonathan Kirwan" <jkirwan@easystreet.com> wrote in message news:ss97m29cq5d8ns1rr220caan0i4rnv5nlm@4ax.com...> On Wed, 22 Nov 2006 00:34:17 GMT, "Wilco Dijkstra" > <Wilco_dot_Dijkstra@ntlworld.com> wrote:>>Note that integer division and floating point division are completely >>different things > > Like I wouldn't know this... But at the core of the code I wrote is, > in fact, an integer divide process. In the case of 32-bit floats, > this is a 48 bit / 24 bit integer divide that produces a 24 bit > quotient and a 24 bit remainder.You don't need to do it as a wide integer division. It is basically a fixed point division, so you can use 24 / 24 bit, shifting in zeroes in the numerator (it can never become wider than 24 bits). It also doesn't need any preshifting as both values are normalized (assuming denormals are handled specially). How many cycles does your version take per bit? If it is a lot then multiplicative methods typically win.>> - a standard 3-cycle per bit integer division takes about >>120 cycles in the worst case when unrolled (although it takes just 30 >>on average). > > Do you have a code snippet to see?RSBS tmp, den, num, LSR #N ; trial subtract, sets carry if OK SUBCS num, num, den, LSL #N ; do actual subtract if OK ADC div,div,div ; update divident This does a trial subtract to see whether the denominator * 2^N could fit in the numerator. To avoid overflow you need to shift the numerator right rather than the denominator left. Then conditionally do the subtract and add the carry flag in the division result. The overflow avoidance feature means you don't need to precalculate exactly how many bits need to be done, but make a quick guess instead. My state of the art implementation checks whether there are 4 or fewer result bits, then jumps to code to do 4 bits, otherwise if 8 or fewer bits, jump to code to do 8 bits etc. This way common divisions are really fast, 8 bits are typically done in around 30 cycles. This kind of optimization means it takes serious hardware effort to beat a good software implementation. The Cortex-M3 hardware divider does 4 bits per cycle just to be sure :-) Wilco
Reply by ●November 22, 20062006-11-22
"Peter Dickerson" <firstname.lastname@REMOVE.tesco.net> wrote in message news:rJV8h.25471$hK2.6998@newsfe3-win.ntli.net...> "Wilco Dijkstra" <Wilco_dot_Dijkstra@ntlworld.com> wrote in message > news:aGo7h.38501$163.1888@newsfe6-gui.ntli.net...>> None of these support floating point in hardware, so it depends on >> the libraries you use. On ARM there exist highly optimised FP >> libraries, the one I wrote takes about 25 cycles for fadd/fsub, 40 for >> fmul and 70 for fdiv. Double precision takes almost twice as long. > > That's pretty impressive. I would have expected that checking for NaNs, > Infs and denormals to take sort of time.It doesn't take much time as it is pretty trivial to check for them on ARM due to the powerful instructions. However in many cases you don't need any checks. I aggressively optimise for the common path (at the cost of making uncommon cases slower) and deal with the special cases only when absolutely required. For example there is no need to deal with NaN/Inf in addition, just do the add as if they didn't exist, and if the result overflowed simply return NaN if the input was a NaN, otherwise Inf. Similarly you don't need any denormal handling.>> You would get 500KFlops quite easily on a 50MHz ARM7tdmi. > > I'm using a 51.6 MHz ARM7TDMI and getting very much slower than that, but > I am using Thumb from a very slow external bus. It's not a problem for me > for now but its nice to know what I can expect if the math workload > becomes an issue.My numbers assume ARM code running from single cycle memory, so if you're using a 16-bit bus with waitstates things will be slow... Unless you're doing many multicycle operations (multiplies, load multiple) it may be better to run at a lower frequency with no wait states. RISC pretty much assumes the memory system can provide 1 instruction per cycle. Wilco
Reply by ●November 22, 20062006-11-22
On Wed, 22 Nov 2006 18:59:15 GMT, "Wilco Dijkstra" <Wilco_dot_Dijkstra@ntlworld.com> wrote:>"Jonathan Kirwan" <jkirwan@easystreet.com> wrote in message >news:ss97m29cq5d8ns1rr220caan0i4rnv5nlm@4ax.com... >> On Wed, 22 Nov 2006 00:34:17 GMT, "Wilco Dijkstra" >> <Wilco_dot_Dijkstra@ntlworld.com> wrote: > >>>Note that integer division and floating point division are completely >>>different things >> >> Like I wouldn't know this... But at the core of the code I wrote is, >> in fact, an integer divide process. In the case of 32-bit floats, >> this is a 48 bit / 24 bit integer divide that produces a 24 bit >> quotient and a 24 bit remainder. > >You don't need to do it as a wide integer division. It is basically a >fixed point division, so you can use 24 / 24 bit, shifting in zeroes in >the numerator (it can never become wider than 24 bits).Yes. Which is exactly what I do. But it is exactly the same code as does a 48/24. This is territory I know like the back of my own hand. And my 48/24 code is usually _faster_ than most other peoples' code in doing this task. Period.>It also >doesn't need any preshifting as both values are normalized >(assuming denormals are handled specially).I know just what you are saying. But there are such things as denormals and float unpacking. And in addition, there are minor single shifts required to precheck for overflow of the integer divide. I can show you what I mean with explicit code, if you like. I'd _love_ to hear your opinion on the details.>How many cycles does your version take per bit? If it is a lot >then multiplicative methods typically win.On which CPU? It really does depend.>>> - a standard 3-cycle per bit integer division takes about >>>120 cycles in the worst case when unrolled (although it takes just 30 >>>on average). >> >> Do you have a code snippet to see? > >RSBS tmp, den, num, LSR #N ; trial subtract, sets carry if OK >SUBCS num, num, den, LSL #N ; do actual subtract if OK >ADC div,div,div ; update dividentI was thinking more about the overall code. It's from that I can get context. But I'll pop out some ARM reference and look at this, anyway. I think I'd like to apprehend it well.>This does a trial subtract to see whether the denominator * 2^N >could fit in the numerator. To avoid overflow you need to shift >the numerator right rather than the denominator left. Then >conditionally do the subtract and add the carry flag in the >division result. The overflow avoidance feature means you >don't need to precalculate exactly how many bits need to be >done, but make a quick guess instead.Your explanation here is why I'd like to see the larger context. I'm perfectly willing to completely expose my own hand. Are you?>My state of the art implementation checks whether there are 4 or >fewer result bits, then jumps to code to do 4 bits, otherwise if 8 or >fewer bits, jump to code to do 8 bits etc.I think this makes some sense. I did NOT add this to my own code, but would in cases where I could demonstrate the value. In cases where I unroll the loop once (for example, using a loop count of 12 instead of 24 in the 48/24 case), this may pose some interesting alternatives to consider. But it's a good point to make.>This way common >divisions are really fast, 8 bits are typically done in around 30 cycles. >This kind of optimization means it takes serious hardware effort >to beat a good software implementation. The Cortex-M3 hardware >divider does 4 bits per cycle just to be sure :-)I'm interested. Let me know if you are willing to expose your code. I will return that offer in kind. Jon
Reply by ●November 22, 20062006-11-22
"Jonathan Kirwan" <jkirwan@easystreet.com> wrote in message news:4k89m25o48t6e7ulq6tf652dfb1tsesj78@4ax.com...> On Wed, 22 Nov 2006 18:59:15 GMT, "Wilco Dijkstra" > <Wilco_dot_Dijkstra@ntlworld.com> wrote: > >>"Jonathan Kirwan" <jkirwan@easystreet.com> wrote in message >>news:ss97m29cq5d8ns1rr220caan0i4rnv5nlm@4ax.com... >>> On Wed, 22 Nov 2006 00:34:17 GMT, "Wilco Dijkstra" >>> <Wilco_dot_Dijkstra@ntlworld.com> wrote: >> >>>>Note that integer division and floating point division are completely >>>>different things >>> >>> Like I wouldn't know this... But at the core of the code I wrote is, >>> in fact, an integer divide process. In the case of 32-bit floats, >>> this is a 48 bit / 24 bit integer divide that produces a 24 bit >>> quotient and a 24 bit remainder. >> >>You don't need to do it as a wide integer division. It is basically a >>fixed point division, so you can use 24 / 24 bit, shifting in zeroes in >>the numerator (it can never become wider than 24 bits). > > Yes. Which is exactly what I do. But it is exactly the same code as > does a 48/24. This is territory I know like the back of my own hand. > And my 48/24 code is usually _faster_ than most other peoples' code in > doing this task. Period.It should be less code than 48/24 as 24 of those 48 bits are zeroes. There is a trick where you can reuse the divident to hold extra numerator bits and shift them into the numerator at the same time as you do the divident update. You were talking about a 16-bit (or was it 8-bit?) CPU, so I believe the extra 24-bits are not for free. But maybe you could show me what you mean.>>It also >>doesn't need any preshifting as both values are normalized >>(assuming denormals are handled specially). > > I know just what you are saying. But there are such things as > denormals and float unpacking. And in addition, there are minor > single shifts required to precheck for overflow of the integer divide. > I can show you what I mean with explicit code, if you like. I'd > _love_ to hear your opinion on the details.If you can, that would be great. Denormals would be treated specially for division. In most embedded code denormals are flushed to zero, but for full IEEE754 you can normalise them and then jump into the main code. That works really well for div and mul, as a result there are far fewer special cases to worry about. You can get rid of the extra overflow check if you do the check in a similar way as I do.>>How many cycles does your version take per bit? If it is a lot >>then multiplicative methods typically win. > > On which CPU? It really does depend.On the CPU you were talking about, MSP430 I thought.>>This does a trial subtract to see whether the denominator * 2^N >>could fit in the numerator. To avoid overflow you need to shift >>the numerator right rather than the denominator left. Then >>conditionally do the subtract and add the carry flag in the >>division result. The overflow avoidance feature means you >>don't need to precalculate exactly how many bits need to be >>done, but make a quick guess instead. > > Your explanation here is why I'd like to see the larger context. I'm > perfectly willing to completely expose my own hand. Are you?I don't have the source code - see below. I can explain the tricks but nothing more than that. There isn't any point in recreating the code just for this discussion.>>My state of the art implementation checks whether there are 4 or >>fewer result bits, then jumps to code to do 4 bits, otherwise if 8 or >>fewer bits, jump to code to do 8 bits etc. > > I think this makes some sense. I did NOT add this to my own code, but > would in cases where I could demonstrate the value. In cases where I > unroll the loop once (for example, using a loop count of 12 instead of > 24 in the 48/24 case), this may pose some interesting alternatives to > consider. But it's a good point to make.The principle works as well if you use a loop. The published version described below uses an 8-way unrolled loop and obvious avoids doing unnecessary iterations. It also jumps into the middle of the loop to avoid doing extra work.>>This way common >>divisions are really fast, 8 bits are typically done in around 30 cycles. >>This kind of optimization means it takes serious hardware effort >>to beat a good software implementation. The Cortex-M3 hardware >>divider does 4 bits per cycle just to be sure :-) > > I'm interested. Let me know if you are willing to expose your code. I > will return that offer in kind.I'm afraid copyright of the code is with my previous employer. However an older version of my division was published in "ARM System Developer's Guide - Designing and Optimizing System Software". They even did mention me. You can get an evaluation version (RealView or Keil) from http://www.arm.com/products/DevTools/RVDSEvalCD.html These tools include the compiler I worked on for 10 years and all my highly optimised library routines, including memcpy/memmove, strcmp, FP libraries etc. Wilco
Reply by ●November 22, 20062006-11-22
On Wed, 22 Nov 2006 20:21:27 GMT, "Wilco Dijkstra" <Wilco_dot_Dijkstra@ntlworld.com> wrote:>"Jonathan Kirwan" <jkirwan@easystreet.com> wrote in message >news:4k89m25o48t6e7ulq6tf652dfb1tsesj78@4ax.com... >> On Wed, 22 Nov 2006 18:59:15 GMT, "Wilco Dijkstra" >> <Wilco_dot_Dijkstra@ntlworld.com> wrote: >> >>>"Jonathan Kirwan" <jkirwan@easystreet.com> wrote in message >>>news:ss97m29cq5d8ns1rr220caan0i4rnv5nlm@4ax.com... >>>> On Wed, 22 Nov 2006 00:34:17 GMT, "Wilco Dijkstra" >>>> <Wilco_dot_Dijkstra@ntlworld.com> wrote: >>> >>>>>Note that integer division and floating point division are completely >>>>>different things >>>> >>>> Like I wouldn't know this... But at the core of the code I wrote is, >>>> in fact, an integer divide process. In the case of 32-bit floats, >>>> this is a 48 bit / 24 bit integer divide that produces a 24 bit >>>> quotient and a 24 bit remainder. >>> >>>You don't need to do it as a wide integer division. It is basically a >>>fixed point division, so you can use 24 / 24 bit, shifting in zeroes in >>>the numerator (it can never become wider than 24 bits). >> >> Yes. Which is exactly what I do. But it is exactly the same code as >> does a 48/24. This is territory I know like the back of my own hand. >> And my 48/24 code is usually _faster_ than most other peoples' code in >> doing this task. Period. > >It should be less code than 48/24 as 24 of those 48 bits are >zeroes. There is a trick where you can reuse the divident to >hold extra numerator bits and shift them into the numerator >at the same time as you do the divident update. You were >talking about a 16-bit (or was it 8-bit?) CPU, so I believe >the extra 24-bits are not for free. But maybe you could >show me what you mean. > >>>It also >>>doesn't need any preshifting as both values are normalized >>>(assuming denormals are handled specially). >> >> I know just what you are saying. But there are such things as >> denormals and float unpacking. And in addition, there are minor >> single shifts required to precheck for overflow of the integer divide. >> I can show you what I mean with explicit code, if you like. I'd >> _love_ to hear your opinion on the details. > >If you can, that would be great. Denormals would be treated >specially for division. In most embedded code denormals are >flushed to zero, but for full IEEE754 you can normalise them >and then jump into the main code. That works really well for >div and mul, as a result there are far fewer special cases to >worry about.I will send a version to your email address.>You can get rid of the extra overflow check if you do the check >in a similar way as I do.I'd like to see how that works. Perhaps, if you look over my example code you can point out where/how this might be done. I'd be very interested. (No, not because of a professional interest. My last full FP library writing took place 30 years ago and it's not my business. But I do have a hobbyist interest, of course.)>>>How many cycles does your version take per bit? If it is a lot >>>then multiplicative methods typically win. >> >> On which CPU? It really does depend. > >On the CPU you were talking about, MSP430 I thought.Ah. Execution time of the entire code is 340-350 cycles, or so. The core division routine, the one that does the central work here, takes about 240-250 cycles of that. So the division itself is about 10 MSP430 cycles per bit. I'd very much enjoy seeing a better version to learn from.>>>This does a trial subtract to see whether the denominator * 2^N >>>could fit in the numerator. To avoid overflow you need to shift >>>the numerator right rather than the denominator left. Then >>>conditionally do the subtract and add the carry flag in the >>>division result. The overflow avoidance feature means you >>>don't need to precalculate exactly how many bits need to be >>>done, but make a quick guess instead. >> >> Your explanation here is why I'd like to see the larger context. I'm >> perfectly willing to completely expose my own hand. Are you? > >I don't have the source code - see below. I can explain the tricks >but nothing more than that. There isn't any point in recreating the >code just for this discussion.Well, then I probably won't spend a lot of time thinking about it right now.>>>My state of the art implementation checks whether there are 4 or >>>fewer result bits, then jumps to code to do 4 bits, otherwise if 8 or >>>fewer bits, jump to code to do 8 bits etc. >> >> I think this makes some sense. I did NOT add this to my own code, but >> would in cases where I could demonstrate the value. In cases where I >> unroll the loop once (for example, using a loop count of 12 instead of >> 24 in the 48/24 case), this may pose some interesting alternatives to >> consider. But it's a good point to make. > >The principle works as well if you use a loop. The published version >described below uses an 8-way unrolled loop and obvious avoids >doing unnecessary iterations. It also jumps into the middle of the >loop to avoid doing extra work.Thanks. I'll see if I can get a look. Is it easy to gain access to? Or ... well, let me comment later when I see what you wrote there.>>>This way common >>>divisions are really fast, 8 bits are typically done in around 30 cycles. >>>This kind of optimization means it takes serious hardware effort >>>to beat a good software implementation. The Cortex-M3 hardware >>>divider does 4 bits per cycle just to be sure :-) >> >> I'm interested. Let me know if you are willing to expose your code. I >> will return that offer in kind. > >I'm afraid copyright of the code is with my previous employer.Thing are as they are. okay.>However >an older version of my division was published in "ARM System >Developer's Guide - Designing and Optimizing System Software". They >even did mention me. You can get an evaluation version (RealView or >Keil) from http://www.arm.com/products/DevTools/RVDSEvalCD.html >These tools include the compiler I worked on for 10 years and all >my highly optimised library routines, including memcpy/memmove, >strcmp, FP libraries etc.Could you please simplify how I might directly access the routine? I'm not interested in some long investigation/installation and trapsing around through obscurity just to get the code itself. Can you help make this any simpler? If it is published, would you be willing to just copy it into an email and send it to me?? I'd appreciate that. Jon
Reply by ●November 22, 20062006-11-22
On Wed, 22 Nov 2006 20:59:04 GMT, Jonathan Kirwan <jkirwan@easystreet.com> wrote:><snip> >Ah. Execution time of the entire code is 340-350 cycles, or so. The >core division routine, the one that does the central work here, takes >about 240-250 cycles of that. So the division itself is about 10 >MSP430 cycles per bit. I'd very much enjoy seeing a better version to >learn from. ><snip>I should point out that this produces both a 24-bit quotient and a 24-bit remainder. Just in case that wasn't clear. I use the remainder to perform full-knowledge rounding as the fraction is exactly known. Jon
Reply by ●November 22, 20062006-11-22
"Jonathan Kirwan" <jkirwan@easystreet.com> wrote in message news:eid9m2hofhf8q3msqunvs548skqr25gb4s@4ax.com...> On Wed, 22 Nov 2006 20:21:27 GMT, "Wilco Dijkstra" > <Wilco_dot_Dijkstra@ntlworld.com> wrote:>>You can get rid of the extra overflow check if you do the check >>in a similar way as I do. > > I'd like to see how that works. Perhaps, if you look over my example > code you can point out where/how this might be done.I'll do that. There are various techniques to do this.>>>>How many cycles does your version take per bit? If it is a lot >>>>then multiplicative methods typically win. >>> >>> On which CPU? It really does depend. >> >>On the CPU you were talking about, MSP430 I thought. > > Ah. Execution time of the entire code is 340-350 cycles, or so. The > core division routine, the one that does the central work here, takes > about 240-250 cycles of that. So the division itself is about 10 > MSP430 cycles per bit. I'd very much enjoy seeing a better version to > learn from.I'll have a look at the inner loop and see whether it can be improved. At 10 cycles per bit multiplicative methods should be faster. On ARM I went from 2 cycles per bit for fixed point division to 1 cycle per bit for long division. If you can get a decent 8x24-bit multiply on the MSP it should be possible to get it down to 150 cycles.>>However >>an older version of my division was published in "ARM System >>Developer's Guide - Designing and Optimizing System Software". They >>even did mention me. You can get an evaluation version (RealView or >>Keil) from http://www.arm.com/products/DevTools/RVDSEvalCD.html >>These tools include the compiler I worked on for 10 years and all >>my highly optimised library routines, including memcpy/memmove, >>strcmp, FP libraries etc. > > Could you please simplify how I might directly access the routine? I'm > not interested in some long investigation/installation and trapsing > around through obscurity just to get the code itself. Can you help > make this any simpler? If it is published, would you be willing to > just copy it into an email and send it to me?? I'd appreciate that.I've sent you some disassembly. The source code even if published is still under copyright. Luckily you can disassemble anything you like to learn from it. I've done that a lot with competitor's compilers and libaries when I was doing compiler stuff. Wilco