On Jan 7, 5:50&#4294967295;pm, Mark Borgerson <mborger...@comcast.net> wrote:
> ....
> When I get time, &#4294967295;I'll clean up my test code and do a couple
> of variants that pare things down to the minimum set of operations.
> I'll do variants that concentrate on floating point multiply and
> divide. &#4294967295;Since divide is multi-cycle, it should cause more pipeline
> stalls and may show a different power result than multiply.

Basically this is work to be done in assembly. What counts is
not just multiply, but also the data dependencies, these can affect
the multiply/add performance several times, basically as many times
as there are pipeline stages involved in the opcode under test.
I would expect this to influence the power consumption (but have
never measured it, as opposed to the data dependencies which I had
to eliminate on a power core to get all of its power out :-) ).

Dimiter

------------------------------------------------------
Dimiter Popoff               Transgalactic Instruments

http://www.tgi-sci.com
------------------------------------------------------
http://www.flickr.com/photos/didi_tgi/sets/72157600228621276/

In article <50EAD116.510D7A0E@bytecraft.com>, walter@bytecraft.com 
says...
> 
> Mark Borgerson wrote:
> 
> > In a recent thread Jon Kirwan and I were discussing FPUs and power
> > consumption.  I decided to try some real world tests on an
> > STM32F4 Discovery board.  After a few tests in the ChiBios
> > RTOS, where I discovered that you can save a lot of power by
> > doing floating point math with the FPU and shutting off the
> > CPU clock in the idle process,  I decided to try to measure
> > the power using software and hardware floating point without
> > the RTOS.  I initialized the CPU clock to 168MHz and ran this code:
> >
> > //  ChiBios calls commented out to run without OS
> > static msg_t ThreadMath(void *arg) {
> >         float sinetable[360], fval;
> >         int i,j;
> >         systime_t start, end;
> >
> >         msg_t mathmsg;
> >         long mathloop = 0;
> >         (void)arg;
> > //      chRegSetThreadName("Math");
> >         while (TRUE) {
> > //              mathmsg =       chBSemWait(&MathSemaphore);
> > //              start = chTimeNow();
> >
> >                 for(j= 0; j<200; j++){
> >                         fval = 0.00001*j;
> >                         for(i= 0; i<100; i++){
> >                                 // sinf function skips casting to double
> >                                 sinetable[i] = sinf(fval);
> >                                 fval += 3.141529/360.0;
> >
> >                         }
> >                 }
> > //              end = chTimeNow();
> >                 mathloop++;
> >
> > //              printf("Math loop took %lu ticks\n", end-start);
> > //              chThdSleepMilliseconds(2);
> >         }
> > }
> >
> > Code profiling does show that the processor is spending all its
> > time in the math loop computing and storing sine values.
> >
> > Here's the puzzling part:
> >
> > Using FPU for floating point:  49.9mA
> > Using software floating point: 55.1mA
> >
> > Why does the CPU use LESS power doing floating point math
> > in the FPU???
> >
> > Mark Borgerson
> 
> I have been following this thread with interest because it
> is similar in approach to power experiments we have done
> characterizing instructions.
> 
> Your results are puzzling and you might want to contract
> ST privately and see what they have to say. The Discover
> board has been used quite a bit recently in the ST promotional
> seminars and one of the demos is a time and size difference
> between FPU compiled code and the same source using floating
> libraries.

When I get time,  I'll clean up my test code and do a couple
of variants that pare things down to the minimum set of operations.
I'll do variants that concentrate on floating point multiply and
divide.  Since divide is multi-cycle, it should cause more pipeline
stalls and may show a different power result than multiply.

During the paring-down process, I'll to make sure that compiler
optimizations don't eliminate the math functions!   ;-)
> 
> As others have suggested the only explanation for the
> difference that I can see is wait states for FPU. When
> total power for a given task is factored in (volts*current*time)
> then a different picture will likely emerge.
> 
When I used the RTOS with an idle task that shut off the CPU
clock, power was definitely lower when using the FPU.


Mark Borgerson


Mark Borgerson wrote:

> In a recent thread Jon Kirwan and I were discussing FPUs and power
> consumption.  I decided to try some real world tests on an
> STM32F4 Discovery board.  After a few tests in the ChiBios
> RTOS, where I discovered that you can save a lot of power by
> doing floating point math with the FPU and shutting off the
> CPU clock in the idle process,  I decided to try to measure
> the power using software and hardware floating point without
> the RTOS.  I initialized the CPU clock to 168MHz and ran this code:
>
> //  ChiBios calls commented out to run without OS
> static msg_t ThreadMath(void *arg) {
>         float sinetable[360], fval;
>         int i,j;
>         systime_t start, end;
>
>         msg_t mathmsg;
>         long mathloop = 0;
>         (void)arg;
> //      chRegSetThreadName("Math");
>         while (TRUE) {
> //              mathmsg =       chBSemWait(&MathSemaphore);
> //              start = chTimeNow();
>
>                 for(j= 0; j<200; j++){
>                         fval = 0.00001*j;
>                         for(i= 0; i<100; i++){
>                                 // sinf function skips casting to double
>                                 sinetable[i] = sinf(fval);
>                                 fval += 3.141529/360.0;
>
>                         }
>                 }
> //              end = chTimeNow();
>                 mathloop++;
>
> //              printf("Math loop took %lu ticks\n", end-start);
> //              chThdSleepMilliseconds(2);
>         }
> }
>
> Code profiling does show that the processor is spending all its
> time in the math loop computing and storing sine values.
>
> Here's the puzzling part:
>
> Using FPU for floating point:  49.9mA
> Using software floating point: 55.1mA
>
> Why does the CPU use LESS power doing floating point math
> in the FPU???
>
> Mark Borgerson

I have been following this thread with interest because it
is similar in approach to power experiments we have done
characterizing instructions.

Your results are puzzling and you might want to contract
ST privately and see what they have to say. The Discover
board has been used quite a bit recently in the ST promotional
seminars and one of the demos is a time and size difference
between FPU compiled code and the same source using floating
libraries.

As others have suggested the only explanation for the
difference that I can see is wait states for FPU. When
total power for a given task is factored in (volts*current*time)
then a different picture will likely emerge.

Walter Banks..

Mark Borgerson <mborgerson@comcast.net> wrote:
> In article <kcag62$buq$1@speranza.aioe.org>, 
> Anders.Montonen@kapsi.spam.stop.fi.invalid says...
>> Mark Borgerson <mborgerson@comcast.net> wrote:
>> 
>> > Why does the CPU use LESS power doing floating point math
>> > in the FPU???
>> Pure speculation, but since many of the floating-point instructions take 
>> multiple cycles to complete, the CPU pipeline may spend more time 
>> stalled, which in turn means the flash interface is activated less 
>> often.
> I guess that's a possibility. While a FP multiply is just one cycle,
> an FP divide is 12.  The sine function and the loop code do use
> divide instructions.

I had a look in the data sheet for the STM32F405xx/407xx, and the 
current consumption characteristics on pages 77-78 gives the following 
figures for running at 168MHz with all peripherals disabled:

* With flash accelerator OFF: 46mA typ, 61mA max
* With flash accelerator ON:  40mA typ, 54mA max

This would support the idea that flash memory accesses at least play a 
part in the overall power consumption. The ARM embedded trace macrocell 
has a performance counter specifically for measuring multi-cycle 
instruction and instruction fetch stalls which you could use to test 
whether the hardware FP code actually stalls significantly more than the 
emulated code.

-a

Mark Borgerson <mborgerson@comcast.net> wrote:
> In article <kcag62$buq$1@speranza.aioe.org>, 
> Anders.Montonen@kapsi.spam.stop.fi.invalid says...

>> Pure speculation, but since many of the floating-point instructions take 
>> multiple cycles to complete, the CPU pipeline may spend more time 
>> stalled, which in turn means the flash interface is activated less 
>> often.
> I guess that's a possibility. While a FP multiply is just one cycle,
> an FP divide is 12.  The sine function and the loop code do use
> divide instructions.

If the chip you're using allows it, you could try rearranging the test 
to run entirely from RAM.

-a

In article <kccea2$mr$1@dont-email.me>, news.x.richarddamon@xoxy.net 
says...
> 
> On 1/5/13 11:25 PM, Mark Borgerson wrote:
> > In article <kcatlq$po5$1@dont-email.me>, news.x.richarddamon@xoxy.net 
> > says...
> >>
> >> On 1/5/13 12:59 PM, Mark Borgerson wrote:
> >>
> >>> Here's the puzzling part:
> >>>
> >>> Using FPU for floating point:  49.9mA
> >>> Using software floating point: 55.1mA
> >>>
> >>> Why does the CPU use LESS power doing floating point math
> >>> in the FPU???
> >>>
> >>>
> >>> Mark Borgerson
> >>>
> >>
> >> My guess would be due to a couple of factors:
> >>
> >> 1) The FPU is quicker, so the CPU will be spending more time in the IDLE
> >> state, where the power consumption is a lot less.
> > I'm not sure this matches the test conditions.  Both with and without
> > the fpu, the software was in a continuous loop that computed and stored
> > sine values.  There was no idle state.
> >>
> >> 2) The FPU is probably a lot more efficient in the number of electrons
> >> needed to do the operation than the software emulation. On a per
> >> microsecond basis, the FPU may use more power when it is running than
> >> the integer ALU, but it may well need less energy to do the full
> >> computation.
> > 
> > But both computations were running continuously---but the FPU loop
> > does cycle more times per second.
> > 
> > 
> > Mark Borgerson
> > 
> > 
> > 
> > 
> 
> In the initial post, the OP said
> 
> > After a few tests in the ChiBios
> > RTOS, where I discovered that you can save a lot of power by
> > doing floating point math with the FPU and shutting off the
> > CPU clock in the idle process,
> 

That was the initial test.  In the later test, with the code
shown in the post, there was no RTOS active, just a continuous
loop computing and storing the sine values.
> So there appears to be a fixed amount of computations to be done per
> unit time and the processor being put to sleep in between. Under this
> condition, it makes sense that the FPU will save power, as it is more
> efficient in doing the calculation, being designed for it.
> 
> If the choices are doing more calculations per unit time with the FPU
> verse not, then the power per unit time likely goes up, but the energy
> used per unit of calculation should still be lower. Since normally there
> IS a fixed amount of processing to do in an embedded system, using the
> FPU can be a power savings (as long as you can use it enough that its
> "idle" power doesn't eat up the savings when you are not using it).

I agree with this---and it was demonstrated in the initial test using
the RTOS where the power went down by about 50% with the CPU idle
between calculation loops.

The mystery is why the power is lower using the FPU when the calculation
are in an infinite loop with no idle state between loops.

Mark Borgerson

On 1/5/13 11:25 PM, Mark Borgerson wrote:
> In article <kcatlq$po5$1@dont-email.me>, news.x.richarddamon@xoxy.net 
> says...
>>
>> On 1/5/13 12:59 PM, Mark Borgerson wrote:
>>
>>> Here's the puzzling part:
>>>
>>> Using FPU for floating point:  49.9mA
>>> Using software floating point: 55.1mA
>>>
>>> Why does the CPU use LESS power doing floating point math
>>> in the FPU???
>>>
>>>
>>> Mark Borgerson
>>>
>>
>> My guess would be due to a couple of factors:
>>
>> 1) The FPU is quicker, so the CPU will be spending more time in the IDLE
>> state, where the power consumption is a lot less.
> I'm not sure this matches the test conditions.  Both with and without
> the fpu, the software was in a continuous loop that computed and stored
> sine values.  There was no idle state.
>>
>> 2) The FPU is probably a lot more efficient in the number of electrons
>> needed to do the operation than the software emulation. On a per
>> microsecond basis, the FPU may use more power when it is running than
>> the integer ALU, but it may well need less energy to do the full
>> computation.
> 
> But both computations were running continuously---but the FPU loop
> does cycle more times per second.
> 
> 
> Mark Borgerson
> 
> 
> 
> 

In the initial post, the OP said

> After a few tests in the ChiBios
> RTOS, where I discovered that you can save a lot of power by
> doing floating point math with the FPU and shutting off the
> CPU clock in the idle process,

So there appears to be a fixed amount of computations to be done per
unit time and the processor being put to sleep in between. Under this
condition, it makes sense that the FPU will save power, as it is more
efficient in doing the calculation, being designed for it.

If the choices are doing more calculations per unit time with the FPU
verse not, then the power per unit time likely goes up, but the energy
used per unit of calculation should still be lower. Since normally there
IS a fixed amount of processing to do in an embedded system, using the
FPU can be a power savings (as long as you can use it enough that its
"idle" power doesn't eat up the savings when you are not using it).

On 05/01/2013 17:59, Mark Borgerson wrote:
> Here's the puzzling part:
>
> Using FPU for floating point:  49.9mA
> Using software floating point: 55.1mA
>
> Why does the CPU use LESS power doing floating point math
> in the FPU???

Because the FPU is specifically designed for floating point and it performs 
those operations more efficiently in terms of fetched instructions, changed 
register bits etc than a software implementation can do.

Boo2

In article <kcatlq$po5$1@dont-email.me>, news.x.richarddamon@xoxy.net 
says...
> 
> On 1/5/13 12:59 PM, Mark Borgerson wrote:
> 
> > Here's the puzzling part:
> > 
> > Using FPU for floating point:  49.9mA
> > Using software floating point: 55.1mA
> > 
> > Why does the CPU use LESS power doing floating point math
> > in the FPU???
> > 
> > 
> > Mark Borgerson
> > 
> 
> My guess would be due to a couple of factors:
> 
> 1) The FPU is quicker, so the CPU will be spending more time in the IDLE
> state, where the power consumption is a lot less.
I'm not sure this matches the test conditions.  Both with and without
the fpu, the software was in a continuous loop that computed and stored
sine values.  There was no idle state.
> 
> 2) The FPU is probably a lot more efficient in the number of electrons
> needed to do the operation than the software emulation. On a per
> microsecond basis, the FPU may use more power when it is running than
> the integer ALU, but it may well need less energy to do the full
> computation.

But both computations were running continuously---but the FPU loop
does cycle more times per second.


Mark Borgerson

On 1/5/13 12:59 PM, Mark Borgerson wrote:

> Here's the puzzling part:
> 
> Using FPU for floating point:  49.9mA
> Using software floating point: 55.1mA
> 
> Why does the CPU use LESS power doing floating point math
> in the FPU???
> 
> 
> Mark Borgerson
> 

My guess would be due to a couple of factors:

1) The FPU is quicker, so the CPU will be spending more time in the IDLE
state, where the power consumption is a lot less.

2) The FPU is probably a lot more efficient in the number of electrons
needed to do the operation than the software emulation. On a per
microsecond basis, the FPU may use more power when it is running than
the integer ALU, but it may well need less energy to do the full
computation.