FPU vs soft library vs. fixed point| page 2

Reply by rickman ●May 26, 20142014-05-26

On 5/26/2014 12:41 AM, Paul Rubin wrote:
> rickman <gnuarm@gmail.com> writes:
>> I have to assume you mean many of the parts with FP they also have a
>> larger pin count, meaning 100 and up.  But unless I have a set of
>> requirements, that is getting into some very hard to compare features.
>
> http://www.ti.com/product/tm4c123gh6pm used in the TI Tiva Launchpad is
> a 64LQFP, still not exactly tiny.  There is supposedly a new comparable
> Freescale part (MK22FN1M0VLH12, also 64LQFP) that is pin compatible with
> the part in the Teensy 3.1 (pjrc.com), and that has floating point (the
> Teensy cpu is integer-only).  I wonder if the pjrc guy will make a
> Teensy 3.2 with the new part, which also has more memory.  It's a cute
> little board.  The FP on all these parts is unfortunately single
> precision.

I really don't get your point.  What are you comparing this to?

-- 

Rick

Reply by George Neuner ●May 26, 20142014-05-26

Hi Don,

On Sun, 25 May 2014 13:25:40 -0700, Don Y <this@is.not.me.com> wrote:

>OToOH, a non-generic library approach *could*, possibly, eek out
>a win by eliminating unnecessary operations that an FPU (or a
>generic library approach) would naively undertake.

Only 3 hands?

>So, for a specific question:  anyone have any *real* metrics
>regarding how efficient (power, cost) hardware FPU (or not!)
>is in FP-intensive applications?  

This is being kicked around in comp.arch right now in a wandering
thread called "RISC versus the Pentium Pro".  Haven't seen the numbers
you're asking for but likely you can get them if you ask nicely.

They are discussing a closely related question involving the tradeoffs
between providing an all-up FPU (e.g., IEEE-754) vs providing
sub-units and allowing software to drive them.  I haven't followed the
whole thread [it's wandering a lot (even for c.a.)] but there's been
some mentions of break even points of HW vs SW for general code.

>(by "FP-intensive", assume 20% of the operations performed by the
>processor fall into that category).

A number of comp.arch participants are present/past CPU designers.
Quite a few others come from HPC ... when they are talking about FP
intensive code, they mean 70+%.

>Thx,
>--don
George

Reply by Don Y ●May 26, 20142014-05-26

Hi Rick,

On 5/25/2014 9:06 PM, rickman wrote:
> On 5/25/2014 6:56 PM, Don Y wrote:
>> On 5/25/2014 3:40 PM, Tim Wescott wrote:
>>
>>>> I also don't have lots of experience with floating point, but I would
>>>> expect if you are doing a lot of floating point the hardware would use
>>>> less power than a software emulation. I can't imagine the cost would be
>>>> very significant unless you are building a million of them.
>>>
>>> The cost is often more in that the pool of available processors shrinks
>>> dramatically, and it's hard to get physically small parts.
>
> Can't say since "small" is not really anything I can measure. There are
> very small packages available (some much smaller than I want to work
> with). I have to assume you mean many of the parts with FP they also
> have a larger pin count, meaning 100 and up. But unless I have a set of
> requirements, that is getting into some very hard to compare features.

Things like FPU, MMU, GPU, etc. *tend* to find their way into
more expensive/capable/larger devices.  The thinking seems to be
that -- once you've "graduated" to that level of complexity -- you
aren't pinching pennies/watts/mm^3, etc.

>> Exactly. Especially if the "other" functions that the processor
>> is performing do not benefit from the extra (hardware) complexity.
>
> What? If one part of a design needs an I2C interface you think you
> should not use a hardware I2C interface because the entire project
> doesn't need it??? That makes no sense to me.

If the interface carried other baggage with it (size, cost, power)
and COULDN'T BE IMPLEMENTED SOME OTHER WAY (e.g., FP library can
produce the exact same results as FPU), then why would you take on
that extra baggage?

Why not put EVERYTHING into EVERY DESIGN?  And, just increase package
dimensions, power requirements, cost, etc. accordingly...

>> "advanced RISC machine"
>>
>> Would you use a processor with a GPU (G, not F) to control a
>> CNC lathe? Even if it had a GUI? Or, would you "tough it out"
>> and write your own blit'er and take the performance knock on
>> the (largely static!) display tasks?
>
> This is a total non-sequitur. Reminds me of a supposed true story where
> one employee said you get floor area by dividing length by width rather
> than multiplying. When that was questioned he reasoned with, "How many
> quarters in a dollar? How many quarters in two dollars?... SEE!" lol

You've missed the point.

You can provide the *functionality* of a GPU (FPU) in software at
the expense of some execution speed.  If you don't *need* that speed
for the operation of the CNC lathe (i.e., the GPU *might* make
some of the routines form moving TEXT and LINE DRAWINGS around the
LARGELY STATIC display screen), then why take on that cost?

You might grumble that updating the screen takes a full 1/10th of a
second and COULD BE *SO* MUCH FASTER (with the GPU) but do you think
the marketing guys are going to brag about a *faster* update rate than
that?  Especially if there are other consequences to this choice?

Reply by Paul Rubin ●May 26, 20142014-05-26

rickman <gnuarm@gmail.com> writes:
>>> parts with FP they also have a larger pin count, meaning 100 and up. ...
>> http://www.ti.com/product/tm4c123gh6pm used in the TI Tiva Launchpad is
>> a 64LQFP, ... Freescale part (MK22FN1M0VLH12, also 64LQFP) 
> I really don't get your point.  What are you comparing this to?

Just passing information along.  You expressed disappointment that parts
with FP tend to be large, i. e. 100 pins and up.  I mentioned a couple
that I knew of with 64 pins.

Reply by rickman ●May 26, 20142014-05-26

On 5/26/2014 1:13 AM, Paul Rubin wrote:
> rickman <gnuarm@gmail.com> writes:
>>>> parts with FP they also have a larger pin count, meaning 100 and up. ...
>>> http://www.ti.com/product/tm4c123gh6pm used in the TI Tiva Launchpad is
>>> a 64LQFP, ... Freescale part (MK22FN1M0VLH12, also 64LQFP)
>> I really don't get your point.  What are you comparing this to?
>
> Just passing information along.  You expressed disappointment that parts
> with FP tend to be large, i. e. 100 pins and up.  I mentioned a couple
> that I knew of with 64 pins.

Ok, but I think you have me confused with another poster.  I'm agnostic 
on that particular issue.  My cross is the lack of reasonable packages 
for FPGAs.

-- 

Rick

Reply by Don Y ●May 26, 20142014-05-26

Hi Rick,

On 5/25/2014 9:46 PM, rickman wrote:

[attrs elided]

>>> I also don't have lots of experience with floating point, but I would
>>> expect if you are doing a lot of floating point the hardware would use
>>> less power than a software emulation. I can't imagine the cost would be
>>> very significant unless you are building a million of them.
>>
>> I'm pricing in 100K quantities -- which *tends* to make cost
>> differences diminish.
>
> I'm not sure what you are saying about cost differences diminishing.
> High volume makes cost differences jump out and be noticed! Or are you
> saying everyone quotes you great prices at those volumes?

The differences amount to a lot FOR THE LOT.  And, when reflected to
retail pricing.  But, the differences in piece part prices drop
dramatically.  At some point, you're just "buying plastic" (regardless
of what sort of silicon is inside).

>> But, regardless of quantity, physics dictates the volume of a
>> battery/cell required to power the things! Increased quantity
>> doesn't make it draw less power, etc. :<
>
> I may have something completely different for you to consider.
>
>>> I think finding general "metrics" on FP approaches will be a lot harder
>>> than defining your requirements and looking for a solution that suits.
>>> Do you have requirements at this point?
>>
>> I can't discuss two of the applications. But, to "earn my training
>> wheels", I set out to redesign another app with similar constraints.
>> It's a (formant) speech synthesizer that runs *in* a BT earpiece.
>> (i.e., the size of the device is severely constrained -- which has
>> repercussions on power available, etc.)
>>
>> A shirt-cuff analysis of the basic processing loop shows ~60 FMUL,
>> ~30 FADD and a couple of trig/transcend operations per iteration.
>> That's running at ~20KHz (lower sample rates make it hard to
>> synthesize female and child voices with any quality). Not a tough
>> requirement to meet WITHOUT the power+size constraints. But, throw
>> it in a box with a ~0.5WHr power source and see how long it lasts
>> before you're back on the charger! :-/

>> I think this would be a good "get my feet wet" application because
>> all of the math is constrained a priori. While I can't *know* what
>> the synthesizer will be called upon to speak, I *do* know what all
>> of the CONSTANTS are that drive the algorithms.

>> At the same time, it has fixed (hard) processing limits -- I can't
>> "preprocess" speech and play it back out of a large (RAM) buffer...
>> there's no RAM lying around to exploit in that wasteful a manner.
>
> I highly recommend that you not use pejoratives like "wasteful" when it
> comes to engineering. One man's waste is another man's efficiency. It

If you don't have a resource available, then any use of that resource
that can be accomplished by other means is wasteful.  E.g., the speech
synthesizer does most of its work out of ROM to avoid using RAM.

> only depends on the numbers. I don't know what your requirements are so
> I can't say having RAM is wasteful or not.
>
>> It also highlights the potential problem of including FPU hardware
>> in a design if it isn't *always* in use -- does having an *idle*
>> FPU carry any other recurring (operational) costs? (in theory,
>> CMOS should have primarily dynamic currents... can I be sure the
>> FPU is truly "idle" when I'm not executing FP opcodes?)
>
> I think this is a red herring. If you are worried about power
> consumption, worry about power consumption. Don't start worrying about
> what is idle and what is not before you even get started. Do you really
> think the FP instructions are going to be hammering away at the power
> draw when they are not being executed? Do you worry about the return
> from interrupt instruction when you aren't using that?

An FPU represents a lot of gates!  Depending on the processor it's
attached to, often 20-30% of the gates *in* the processor.  That's
a lot of silicon to "ignore" on the assumption that it doesn't
cost anything while not being used.  I'd much rather have assurances
that it doesn't than to assume it doesn't and learn that it has
dynamic structures within.

>> And, how much "assist" do the various FPA's require? Where is the
>> break-even point for a more "tailored" approach?
>
> What is an FPA?

Floating Point Accelerator.

>> Note that hardware FPU and software *generic* libraries have to
>> accommodate all sorts of use/abuse. They can't know anything about
>> the data they are being called upon to process so always have to
>> "play it safe". (imagine how much extra work they do when summing
>> a bunch of numbers of similar magnitudes!)
>>
>> I'm hoping someone has either measured hardware vs. software
>> implementations (if not, that's the route I'll be pursuing)
>> *or* looked at the power requirements of each approach...
>
> If you are designing a device for 100k production run, it would seem
> reasonable to do some basic testing and get real answers to your
> questions rather than to ask others for their opinions and biases.

It sure seems *most* efficient to poll others who *might* have
similar experiences (hardware software tradeoff re: floating point
as I suspect *most* of us have made that decision at least a few
times in our careers!).

I didn't ask for "opinion" or "bias" as both of those suggest
an arbitrariness not born of fact.  To be clear, my question was:

---------------------------------------------------VVVVVVVVVVVVV
     So, for a specific question:  anyone have any *real* metrics
     regarding how efficient (power, cost) hardware FPU (or not!)
     is in FP-intensive applications?  (by "FP-intensive", assume
     20% of the operations performed by the processor fall into
     that category).

> Ok, I'm still not clear on your application requirements, but if you
> need some significant amount of computation ability with analog I/O and
> power is a constraint, I know of a device you might want to look at.
>
> The GA144 from Green Arrays is an array of 144 async processors, each of
> which can run instructions at up to 700 MIPS. Floating point would need
> to be software, but that should not be a significant issue in this case.
> The features that could be great for your app are...
>
> 1) Low standby power consumption of 8 uA, active power of 5 mW/processor
> 2) Instant start up on trigger
> 3) Integrated ADC and DAC (5 each) with variable resolution/sample rate
> (can do 20 kHz at ~15 bits)
> 4) Small device in 1 cm sq 88 pin QFP
> 5) Small processors use little power and suspend in a single instruction
> time reducing power to 55 nA each with instant wake up.
>
> This device has its drawbacks too. It is programmed in a Forth like
> language which many are not familiar with. The I/Os are 1.8 volts which
> should not be a problem in your app. Each processor is 18 bits with only
> 64 words of memory, not sure what your requirements might be. You can
> hang an external memory on the device. It needs a separate SPI flash to
> boot and for program storage.
>
> The price is in the $10 ball park in lower volumes, not sure what it is
> at 100k units.
>
> One of the claimed apps that has been prototyped on this processor is a
> hearing aid app which requires a pair of TMS320C6xxx processors using a
> watt of power (or was it a watt each?). Sounds a bit like your app. :)
>
> Using this device will require you to forget everything you think you
> know about embedded processors and letting yourself be guided by the
> force. But your app might just be a good one for the GA144.

Try 
<http://www.hpcwire.com/hpcwire/2012-08-22/adapteva_unveils_64-core_chip.html> 
-- 100GFLOPS/2W (FLOPS, not IPS)

Of course, the speech synthesizer example is in the sub-40mW (avg)
budget (including audio and radio) so ain't gonna work for me!  :>

Reply by rickman ●May 26, 20142014-05-26

On 5/26/2014 1:10 AM, Don Y wrote:
> Hi Rick,
>
> On 5/25/2014 9:06 PM, rickman wrote:
>> On 5/25/2014 6:56 PM, Don Y wrote:
>>> On 5/25/2014 3:40 PM, Tim Wescott wrote:
>>>
>>>>> I also don't have lots of experience with floating point, but I would
>>>>> expect if you are doing a lot of floating point the hardware would use
>>>>> less power than a software emulation. I can't imagine the cost
>>>>> would be
>>>>> very significant unless you are building a million of them.
>>>>
>>>> The cost is often more in that the pool of available processors shrinks
>>>> dramatically, and it's hard to get physically small parts.
>>
>> Can't say since "small" is not really anything I can measure. There are
>> very small packages available (some much smaller than I want to work
>> with). I have to assume you mean many of the parts with FP they also
>> have a larger pin count, meaning 100 and up. But unless I have a set of
>> requirements, that is getting into some very hard to compare features.
>
> Things like FPU, MMU, GPU, etc. *tend* to find their way into
> more expensive/capable/larger devices.  The thinking seems to be
> that -- once you've "graduated" to that level of complexity -- you
> aren't pinching pennies/watts/mm^3, etc.

Ok, we have left the realm of an engineering discussion.  The point is 
that if FP is useful, use it.  If it is not useful, don't use it.  But 
don't assume, before you have actually looked, that you won't be able to 
find a device with the feature set you want or can use.


>>> Exactly. Especially if the "other" functions that the processor
>>> is performing do not benefit from the extra (hardware) complexity.
>>
>> What? If one part of a design needs an I2C interface you think you
>> should not use a hardware I2C interface because the entire project
>> doesn't need it??? That makes no sense to me.
>
> If the interface carried other baggage with it (size, cost, power)
> and COULDN'T BE IMPLEMENTED SOME OTHER WAY (e.g., FP library can
> produce the exact same results as FPU), then why would you take on
> that extra baggage?
>
> Why not put EVERYTHING into EVERY DESIGN?  And, just increase package
> dimensions, power requirements, cost, etc. accordingly...

Yes, but what exactly are you saying...


>>> "advanced RISC machine"
>>>
>>> Would you use a processor with a GPU (G, not F) to control a
>>> CNC lathe? Even if it had a GUI? Or, would you "tough it out"
>>> and write your own blit'er and take the performance knock on
>>> the (largely static!) display tasks?
>>
>> This is a total non-sequitur. Reminds me of a supposed true story where
>> one employee said you get floor area by dividing length by width rather
>> than multiplying. When that was questioned he reasoned with, "How many
>> quarters in a dollar? How many quarters in two dollars?... SEE!" lol
>
> You've missed the point.
>
> You can provide the *functionality* of a GPU (FPU) in software at
> the expense of some execution speed.  If you don't *need* that speed
> for the operation of the CNC lathe (i.e., the GPU *might* make
> some of the routines form moving TEXT and LINE DRAWINGS around the
> LARGELY STATIC display screen), then why take on that cost?

Well, yeah.  Wonderful analogy.  Now can we get back to discussing the 
issue?


> You might grumble that updating the screen takes a full 1/10th of a
> second and COULD BE *SO* MUCH FASTER (with the GPU) but do you think
> the marketing guys are going to brag about a *faster* update rate than
> that?  Especially if there are other consequences to this choice?

Ok, you are still in analogy land.  I'm happy to discuss the original 
issue if that is what you want.

We have deviated far from my original statement that I would expect 
floating point instructions to use less power than floating point in 
software.  Of course the devil is in the details and this is just one 
feature of your design.  The chip you choose to use will depend on many 
factors.

-- 

Rick

Reply by Don Y ●May 26, 20142014-05-26

Hey George!

Finally warm up, (and dry out) there??  :>  Broke 100F last week... :<
July's gonna be a bitch!

On 5/25/2014 10:06 PM, George Neuner wrote:
> On Sun, 25 May 2014 13:25:40 -0700, Don Y<this@is.not.me.com>  wrote:
>
>> OToOH, a non-generic library approach *could*, possibly, eek out
>> a win by eliminating unnecessary operations that an FPU (or a
>> generic library approach) would naively undertake.
>
> Only 3 hands?

The others were busy at the time... (but I reserve the right NOT to
disclose what they were doing!  :> )

>> So, for a specific question:  anyone have any *real* metrics
>> regarding how efficient (power, cost) hardware FPU (or not!)
>> is in FP-intensive applications?
>
> This is being kicked around in comp.arch right now in a wandering
> thread called "RISC versus the Pentium Pro".  Haven't seen the numbers
> you're asking for but likely you can get them if you ask nicely.

Thanks, I will look at the thread!

> They are discussing a closely related question involving the tradeoffs
> between providing an all-up FPU (e.g., IEEE-754) vs providing
> sub-units and allowing software to drive them.  I haven't followed the
> whole thread [it's wandering a lot (even for c.a.)] but there's been
> some mentions of break even points of HW vs SW for general code.

I think you can get even finer in choosing how little you implement
based on application domain (of course, I haven't yet read their claims
but still assume they are operating within some "rational" sense of
partitioning... e.g., not willing to allow the actual number format
to be rendered arbitrary, etc.)

E.g., in the speech synthesizer example (elsewhere), I can point to
any operator/operation/argument and *know* what sorts of values it
will take on at any time during the life of the algorithm.  And, the
consequences of trimming precision or range, etc.  I'm not sure
how easily a generalized solution could be tweeked to shed unnecessary
capability in those situations.

>> (by "FP-intensive", assume 20% of the operations performed by the
>> processor fall into that category).
>
> A number of comp.arch participants are present/past CPU designers.
> Quite a few others come from HPC ... when they are talking about FP
> intensive code, they mean 70+%.

Well, I *do* have other things to do besides crunch numbers!  :>

Hope you are well.  Really busy, here!  :-/
--don

Reply by ●May 26, 20142014-05-26

On Sun, 25 May 2014 13:25:40 -0700, Don Y <this@is.not.me.com> wrote:

>I'm exploring tradeoffs in implementation of some computationally
>expensive routines.
>
>The easy (coding) solution is to just use doubles everywhere and
>*assume* the noise floor is sufficiently far down that the ulp's
>don't impact the results in any meaningful way.
>
>But, that requires either hardware support (FPU) or a library
>implementation or some "trickery" on my part.
>
>Hardware FPU adds cost and increases average power consumption
>(for a generic workload).  It also limits the choices I have
>(severe cost and space constraints).

Also verify that the FP also supports 64 bits in hardware, not just 32
bits. 

>OTOH, a straight-forward library implementation burns more CPU
>cycles to achieve the same result.  Eventually, I will have to
>instrument a design to see where the tipping point lies -- how
>many transistors are switching in each case, etc.

If you do not need strict IEEE float/double conformance and can live
without denormals, infinity and NaN cases, those libraries will
somewhat be simplified.

>Fixed point solutions mean a lot more up-front work verifying
>no loss of precision throughout the calculations.  Do-able but
>a nightmare for anyone having to maintain the codebase.

Perhaps some other FP format would be suitable for emulation like the
48 bit (6 byte) Borland Turbo Pascal Real data type, which uses the
integer arithmetic more efficiently.

One needs to look careful at the integer instruction set of the
processor. FMUL is easy, it just needs a fast NxN integer
multiplication and some auxiliary instructions. FADD/FSUB are more
complicated, requiring to have a fast shift right by a variable number
of bits for denormalization and a fast find-first-bit-set instruction
for normalization. Without these instructions, you may have to do up
to 64 iteration cycles in a loop with a shift right/left instruction
and some conditional instructions, which can take a lot of time.

Reply by rickman ●May 26, 20142014-05-26

On 5/26/2014 2:09 AM, Don Y wrote:
> Hi Rick,
>
> On 5/25/2014 9:46 PM, rickman wrote:
>
>> The GA144 from Green Arrays is an array of 144 async processors, each of
>> which can run instructions at up to 700 MIPS. Floating point would need
>> to be software, but that should not be a significant issue in this case.
>> The features that could be great for your app are...
>>
>> 1) Low standby power consumption of 8 uA, active power of 5 mW/processor
>> 2) Instant start up on trigger
>> 3) Integrated ADC and DAC (5 each) with variable resolution/sample rate
>> (can do 20 kHz at ~15 bits)
>> 4) Small device in 1 cm sq 88 pin QFP
>> 5) Small processors use little power and suspend in a single instruction
>> time reducing power to 55 nA each with instant wake up.
>>
>> This device has its drawbacks too. It is programmed in a Forth like
>> language which many are not familiar with. The I/Os are 1.8 volts which
>> should not be a problem in your app. Each processor is 18 bits with only
>> 64 words of memory, not sure what your requirements might be. You can
>> hang an external memory on the device. It needs a separate SPI flash to
>> boot and for program storage.
>>
>> The price is in the $10 ball park in lower volumes, not sure what it is
>> at 100k units.
>>
>> One of the claimed apps that has been prototyped on this processor is a
>> hearing aid app which requires a pair of TMS320C6xxx processors using a
>> watt of power (or was it a watt each?). Sounds a bit like your app. :)
>>
>> Using this device will require you to forget everything you think you
>> know about embedded processors and letting yourself be guided by the
>> force. But your app might just be a good one for the GA144.
>
> Try
> <http://www.hpcwire.com/hpcwire/2012-08-22/adapteva_unveils_64-core_chip.html>
> -- 100GFLOPS/2W (FLOPS, not IPS)
>
> Of course, the speech synthesizer example is in the sub-40mW (avg)
> budget (including audio and radio) so ain't gonna work for me!  :>

What won't work for you, the GA144 or the 100GFLOPS unit?  As I 
mentioned, the GA144 has already been evaluated by someone for a hearing 
aid app which is very low power.  40 mW is not at all unreasonable for 
an audio app on this device.

-- 

Rick

Previous 123 Next

FPU vs soft library vs. fixed point

Sign in

You might also like...

Search forums

Free PDF Downloads

Blogs - Hall of Fame

Discussion Groups

Quick Links

About EmbeddedRelated.com

Social Networks

The Related Media Group