In an upcoming hardware design I'm thinking about using a CPU without a floating point unit. The application uses floating point numbers, so I'll have to do software emulation. However, I can't seem to find any information on how long these operations might take in software. I'm trying to figure out how much processing power I need & choose an appropriate CPU. I have plently of info on MIPS ratings for the CPU's, and I figured out how many MFLOPS my application needs, but how do I figure out how many MIPS it takes to do so many MFLOPS? Does anyone know of any info resources or methods? Thanks for any help! Chris
estimating CPU load /MFLOPS for software emulation of floating point
Started by ●December 18, 2003
Reply by ●December 18, 20032003-12-18
In article <dd951bd2.0312181114.201e632d@posting.google.com>, Christopher Holmes <a_team_of_scientists@yahoo.com> wrote:>In an upcoming hardware design I'm thinking about using a CPU without >a floating point unit. The application uses floating point numbers, >so I'll have to do software emulation. However, I can't seem to find >any information on how long these operations might take in software. >I'm trying to figure out how much processing power I need & choose an >appropriate CPU. > >I have plently of info on MIPS ratings for the CPU's, and I figured >out how many MFLOPS my application needs, but how do I figure out how >many MIPS it takes to do so many MFLOPS? > >Does anyone know of any info resources or methods?Lots of the latter, but the former are mostly in people's heads or on paper. Old paper. If you want to emulate a hardware floating-point format, you are talking hundreds of instructions or more, depending on how clever you are and the interface you use. If you merely want to implement floating-point in software, then you can get it down to tens of instructions. For example, holding floating-point numbers as a structure designed for software, like: struct (unsigned long mantissa, int exponent, unsigned char sign) is VASTLY easier than emulating IEEE. It's still thoroughly messy. Regards, Nick Maclaren.
Reply by ●December 18, 20032003-12-18
"Christopher Holmes" <a_team_of_scientists@yahoo.com> wrote in message news:dd951bd2.0312181114.201e632d@posting.google.com...> In an upcoming hardware design I'm thinking about using a CPU without > a floating point unit. The application uses floating point numbers, > so I'll have to do software emulation. However, I can't seem to find > any information on how long these operations might take in software. > I'm trying to figure out how much processing power I need & choose an > appropriate CPU. > > I have plently of info on MIPS ratings for the CPU's, and I figured > out how many MFLOPS my application needs, but how do I figure out how > many MIPS it takes to do so many MFLOPS? > > Does anyone know of any info resources or methods? > > Thanks for any help! > ChrisIf you absolutely must use normalized FP (a la IEEE) it could be hundreds or even thousands depending onthe CPU resources and the cleverness of the code. Look and un-normalized FP or even integer. Normalization results in non-deterministic timing. Of course, if your CPU doesn't have hardware multiply, then all your math timing is non-deterministic ;-) Very few things really need FP - the algorithm designers are just lazy. A 32 bit integer has better than 1 ppb (1 part per billion) resolution. Most things in the real world (like ADCs and DACs) aren't anywhere near that. Bob
Reply by ●December 18, 20032003-12-18
On Thu, 18 Dec 2003 20:38:54 +0000, Nick Maclaren wrote:> If you merely want to implement > floating-point in software, then you can get it down to tens of > instructions. For example, holding floating-point numbers as a > structure designed for software, like: > > struct (unsigned long mantissa, int exponent, unsigned char sign) > > is VASTLY easier than emulating IEEE. It's still thoroughly messy.Why would you muck about with a separate sign, rather than just using a signed mantissa, for a non-standard software implementation? Does it buy you something in terms of speed? Precision, I guess, given that long is only 32 bits on many systems, and few have 64x64->128 integer multipliers anyway. The OP didn't say what the application was, so it's hard to say whether more than 32 bits of mantissa would be needed. Frankly, he's almost certainly going to be able to translate to fixed-point or block-floating-point anyway, and not bother with the per-value exponent field. That's what all of the "multi-media" applications that run on integer-only ARM, MIPS, SH-RISC etc do. Modern versions of these chips all have strong (low latency, pipelined) integer multipliers, so performance can be quite good. Cheers, -- Andrew
Reply by ●December 18, 20032003-12-18
On Thu, 18 Dec 2003 11:53:03 -0800, "Bob" <SkiBoyBob@excite.com> wrote:> >If you absolutely must use normalized FP (a la IEEE) it could be hundreds or >even thousands depending onthe CPU resources and the cleverness of the code. >Look and un-normalized FP or even integer. Normalization results in >non-deterministic timing. Of course, if your CPU doesn't have hardware >multiply, then all your math timing is non-deterministic ;-)Floating point multiplication and division is not much worse than doing integer multiplication or division with operands of similar sizes. Only an extra addition/subtraction is involved. However, floating point addition and subtractions are nasty, since you first have to denormilze the smaller value and then perform the addition/subtraction in the normal way. Especially after subtraction, you often have to find the most significant bit set and do the denormalisation, which can be quite time consuming. However, even if you would have to normalize a 64 bit mantissa with an 8 bit processor, you could first test in which byte the first "1" bit is located and by byte copying (or preferably pointer arithmetic) move that byte to the beginning of the result. After that you have to perform 1-7 full sized (64 bit) left shift operations (or 1-4 bit left/right shifts) to get into correct positions. Rounding requires up to 8 adds with carry. Even so, I very much doubt that you would require more than 100 instruction in addition to the actual integer multiply/add/sub operation with the same operand sizes. An 8 by 8 bit multiply instruction would reduce the computational load considerably. Paul
Reply by ●December 18, 20032003-12-18
Hi, such an open ended question is impossible to answer. It takes forever on a typical 8 bit micro. 16 bit is much quicker but still slow. It's possible to emove fp operations from most applications, try that first. You can easily measure performance, on the hardware with some test routines, try that second.
Reply by ●December 18, 20032003-12-18
In article <pan.2003.12.18.21.34.59.500214@gurney.reilly.home>, Andrew Reilly <andrew@gurney.reilly.home> wrote:> >Why would you muck about with a separate sign, rather than just using a >signed mantissa, for a non-standard software implementation? Does it buy >you something in terms of speed? Precision, I guess, given that long is >only 32 bits on many systems, and few have 64x64->128 integer multipliers >anyway. The OP didn't say what the application was, so it's hard to say >whether more than 32 bits of mantissa would be needed.It buys some convenience, and probably a couple of instructions fewer for some operations. Not a big deal.>Frankly, he's almost certainly going to be able to translate to >fixed-point or block-floating-point anyway, and not bother with the >per-value exponent field. That's what all of the "multi-media" >applications that run on integer-only ARM, MIPS, SH-RISC etc do. Modern >versions of these chips all have strong (low latency, pipelined) integer >multipliers, so performance can be quite good.See "scaling" in any good 1930s book on numerical analysis :-) Regards, Nick Maclaren.
Reply by ●December 18, 20032003-12-18
Paul Keinanen wrote: (snip regarding software floating point)> Even so, I very much doubt that you would require more than 100 > instruction in addition to the actual integer multiply/add/sub > operation with the same operand sizes.> An 8 by 8 bit multiply instruction would reduce the computational load > considerably.The 6809 has an 8 by 8 multiply, but the floating point implementations I knew on it didn't use it. I looked at it once, and I don't think it was all that much faster to use it. -- glen
Reply by ●December 19, 20032003-12-19
Christopher Holmes wrote:> > In an upcoming hardware design I'm thinking about using a CPU > without a floating point unit. The application uses floating > point numbers, so I'll have to do software emulation. However, I > can't seem to find any information on how long these operations > might take in software. I'm trying to figure out how much > processing power I need & choose an appropriate CPU.There was a time when you had no choice. You should also decide on the precision levels needed in the FP system. Many years ago I decided that my applications could be adequately handled with a 16 bit significand, and the result was the FP system for the 8080 published in DDJ about 20 years ago. The actual code is probably of little use today, but the breakdown may well be. That was fairly efficient and speedy because the 8080 was capable of 16 bit arithmetic, and it was not hard to extend it to 24 and 32 bits where needed. -- Chuck F (cbfalconer@yahoo.com) (cbfalconer@worldnet.att.net) Available for consulting/temporary embedded and systems. <http://cbfalconer.home.att.net> USE worldnet address!
Reply by ●December 19, 20032003-12-19
Christopher Holmes wrote:> I have plently of info on MIPS ratings for the CPU's, and I figured > out how many MFLOPS my application needs, but how do I figure out how > many MIPS it takes to do so many MFLOPS? > > Does anyone know of any info resources or methods?Check out John Hauser's SoftFloat package, at: http://www.jhauser.us/arithmetic/SoftFloat.html He quotes some timings on that page, and/or you could measure the calculations you are interested in for yourself.. Turning his timings for doubles into number of clock cycles per operation, one gets roughly: Add: 305 Mul: 285 Div: 605 On a Pentium, Add and Multiply take 1-3 cycles, Divide takes 39, so for add or multiply you're looking at 2 orders of magnitude slowdown, for divide, nearer to one. (As others have pointed out, with a non-standard floating-point format and arithmetic one can go faster than that.) Mike Cowlishaw