Puzzling power results STM32F4 FPU test

In a recent thread Jon Kirwan and I were discussing FPUs and power
consumption.  I decided to try some real world tests on an
STM32F4 Discovery board.  After a few tests in the ChiBios
RTOS, where I discovered that you can save a lot of power by
doing floating point math with the FPU and shutting off the
CPU clock in the idle process,  I decided to try to measure
the power using software and hardware floating point without
the RTOS.  I initialized the CPU clock to 168MHz and ran this code:

//  ChiBios calls commented out to run without OS
static msg_t ThreadMath(void *arg) {
	float sinetable[360], fval;
	int i,j;
	systime_t start, end;

	msg_t mathmsg;
	long mathloop = 0;
	(void)arg;
//	chRegSetThreadName("Math");
	while (TRUE) {
//		mathmsg = 	chBSemWait(&MathSemaphore);	
//		start = chTimeNow();		
		
		for(j= 0; j<200; j++){
			fval = 0.00001*j;	
			for(i= 0; i<100; i++){
				// sinf function skips casting to double
				sinetable[i] = sinf(fval);
				fval += 3.141529/360.0;	

			}
		}
//		end = chTimeNow();
		mathloop++;

//		printf("Math loop took %lu ticks\n", end-start);
//		chThdSleepMilliseconds(2);			
	}
}

Code profiling does show that the processor is spending all its
time in the math loop computing and storing sine values.

Here's the puzzling part:

Using FPU for floating point:  49.9mA
Using software floating point: 55.1mA

Why does the CPU use LESS power doing floating point math
in the FPU???


Mark Borgerson

Reply by Frank Miles ●January 5, 20132013-01-05

On Sat, 05 Jan 2013 09:59:47 -0800, Mark Borgerson wrote:

> In a recent thread Jon Kirwan and I were discussing FPUs and power
> consumption.  I decided to try some real world tests on an STM32F4
> Discovery board.  After a few tests in the ChiBios RTOS, where I
> discovered that you can save a lot of power by doing floating point math
> with the FPU and shutting off the CPU clock in the idle process,  I
> decided to try to measure the power using software and hardware floating
> point without the RTOS.  I initialized the CPU clock to 168MHz and ran
> this code:
> 
> //  ChiBios calls commented out to run without OS static msg_t
> ThreadMath(void *arg) {
> 	float sinetable[360], fval;
> 	int i,j;
> 	systime_t start, end;
> 
> 	msg_t mathmsg;
> 	long mathloop = 0;
> 	(void)arg;
> //	chRegSetThreadName("Math");
> 	while (TRUE) {
> //		mathmsg = 	chBSemWait(&MathSemaphore); //		
start = chTimeNow();
> 		
> 		for(j= 0; j<200; j++){
> 			fval = 0.00001*j;
> 			for(i= 0; i<100; i++){
> 				// sinf function skips casting to double 
sinetable[i] = sinf(fval);
> 				fval += 3.141529/360.0;
> 
> 			}
> 		}
> //		end = chTimeNow();
> 		mathloop++;
> 
> //		printf("Math loop took %lu ticks\n", end-start); //	
> chThdSleepMilliseconds(2);
> 	}
> }
> 
> Code profiling does show that the processor is spending all its time in
> the math loop computing and storing sine values.
> 
> Here's the puzzling part:
> 
> Using FPU for floating point:  49.9mA Using software floating point:
> 55.1mA
> 
> Why does the CPU use LESS power doing floating point math in the FPU???
> 
> 
> Mark Borgerson

Did you check the current as a function of time (i.e. with a current 
probe and 'scope)?  The obvious reason is that the FPU does the job 
faster, so you spend more than enough time in sleep to make up for the 
higher current consumed by the FPU.  BTW - does the FPU get completely 
turned off when you are not going to use it?

If you don't have a current probe, you could set up your system to 
exercise floating point calculations continuously {either using FPU or 
CPU/software}.  That should get you the results you expect.

Reply by Mark Borgerson ●January 5, 20132013-01-05

In article <kca14k$p94$1@dont-email.me>, fpm@u.washington.edu says...
> 
> On Sat, 05 Jan 2013 09:59:47 -0800, Mark Borgerson wrote:
> 
> > In a recent thread Jon Kirwan and I were discussing FPUs and power
> > consumption.  I decided to try some real world tests on an STM32F4
> > Discovery board.  After a few tests in the ChiBios RTOS, where I
> > discovered that you can save a lot of power by doing floating point math
> > with the FPU and shutting off the CPU clock in the idle process,  I
> > decided to try to measure the power using software and hardware floating
> > point without the RTOS.  I initialized the CPU clock to 168MHz and ran
> > this code:
> > 
> > //  ChiBios calls commented out to run without OS static msg_t
> > ThreadMath(void *arg) {
> > 	float sinetable[360], fval;
> > 	int i,j;
> > 	systime_t start, end;
> > 
> > 	msg_t mathmsg;
> > 	long mathloop = 0;
> > 	(void)arg;
> > //	chRegSetThreadName("Math");
> > 	while (TRUE) {
> > //		mathmsg = 	chBSemWait(&MathSemaphore); //		
> start = chTimeNow();
> > 		
> > 		for(j= 0; j<200; j++){
> > 			fval = 0.00001*j;
> > 			for(i= 0; i<100; i++){
> > 				// sinf function skips casting to double 
> sinetable[i] = sinf(fval);
> > 				fval += 3.141529/360.0;
> > 
> > 			}
> > 		}
> > //		end = chTimeNow();
> > 		mathloop++;
> > 
> > //		printf("Math loop took %lu ticks\n", end-start); //	
> > chThdSleepMilliseconds(2);
> > 	}
> > }
> > 
> > Code profiling does show that the processor is spending all its time in
> > the math loop computing and storing sine values.
> > 
> > Here's the puzzling part:
> > 
> > Using FPU for floating point:  49.9mA Using software floating point:
> > 55.1mA
> > 
> > Why does the CPU use LESS power doing floating point math in the FPU???
> > 
> > 
> > Mark Borgerson
> 
> Did you check the current as a function of time (i.e. with a current 
> probe and 'scope)?  The obvious reason is that the FPU does the job 
> faster, so you spend more than enough time in sleep to make up for the 
> higher current consumed by the FPU.  BTW - does the FPU get completely 
> turned off when you are not going to use it?

The function that fills in the table of sine values runs continuously--
the CPU should never go to sleep.

I get the expected reduction in power when using an RTOS where the
sine function is intermittent and the CPU sleeps between activations.
> 
> If you don't have a current probe, you could set up your system to 
> exercise floating point calculations continuously {either using FPU or 
> CPU/software}.  That should get you the results you expect.

That's what I did with the code above.

Mark Borgerson

Reply by Waldek Hebisch ●January 5, 20132013-01-05

Mark Borgerson <mborgerson@comcast.net> wrote:
> 
> In a recent thread Jon Kirwan and I were discussing FPUs and power
> consumption.  I decided to try some real world tests on an
> STM32F4 Discovery board.  After a few tests in the ChiBios
> RTOS, where I discovered that you can save a lot of power by
> doing floating point math with the FPU and shutting off the
> CPU clock in the idle process,  I decided to try to measure
> the power using software and hardware floating point without
> the RTOS.  I initialized the CPU clock to 168MHz and ran this code:
> 
<snip>

> Code profiling does show that the processor is spending all its
> time in the math loop computing and storing sine values.
> 
> Here's the puzzling part:
> 
> Using FPU for floating point:  49.9mA
> Using software floating point: 55.1mA
> 
> Why does the CPU use LESS power doing floating point math
> in the FPU???

Wild guess: FPU instructions take more time to execute so probably
CPU is doing smaller number of instructions when using FPU.
In other words CPU may be spending a lot of cycles stalled
waiting on FPU.  Less work in integer part of CPU may give
power saving.
 
-- 
                              Waldek Hebisch
hebisch@math.uni.wroc.pl

Reply by ●January 5, 20132013-01-05

Mark Borgerson <mborgerson@comcast.net> wrote:

> Why does the CPU use LESS power doing floating point math
> in the FPU???

Pure speculation, but since many of the floating-point instructions take 
multiple cycles to complete, the CPU pipeline may spend more time 
stalled, which in turn means the flash interface is activated less 
often.

Also, just to avoid mistakes I've made myself, I trust you have verified 
that the compiler emits floating-point instructions, and that the 
hardware-float version of the math library is linked?

-a

Reply by Frank Miles ●January 5, 20132013-01-05

On Sat, 05 Jan 2013 15:35:15 -0800, Mark Borgerson wrote:

>> > Why does the CPU use LESS power doing floating point math in the
>> > FPU???
>> > 
>> > 
>> > Mark Borgerson
>> 
>> Did you check the current as a function of time (i.e. with a current
>> probe and 'scope)?  The obvious reason is that the FPU does the job
>> faster, so you spend more than enough time in sleep to make up for the
>> higher current consumed by the FPU.  BTW - does the FPU get completely
>> turned off when you are not going to use it?
> 
> The function that fills in the table of sine values runs continuously--
> the CPU should never go to sleep.
> 
> I get the expected reduction in power when using an RTOS where the sine
> function is intermittent and the CPU sleeps between activations.
>> 
>> If you don't have a current probe, you could set up your system to
>> exercise floating point calculations continuously {either using FPU or
>> CPU/software}.  That should get you the results you expect.
> 
> That's what I did with the code above.
> 
> Mark Borgerson

Ah, could you hold on a moment while I find some hole to crawl into, 
preferably one with a remedial reading class?

Sorry, guess I'm clueless today.

To repeat one point - are you sure that the FPU is completely turned off
when you're not going to be using it?  Hopefully there's some way to be
sure this is happening.

Reply by Mark Borgerson ●January 5, 20132013-01-05

In article <kcag62$buq$1@speranza.aioe.org>, 
Anders.Montonen@kapsi.spam.stop.fi.invalid says...
> 
> Mark Borgerson <mborgerson@comcast.net> wrote:
> 
> > Why does the CPU use LESS power doing floating point math
> > in the FPU???
> 
> Pure speculation, but since many of the floating-point instructions take 
> multiple cycles to complete, the CPU pipeline may spend more time 
> stalled, which in turn means the flash interface is activated less 
> often.
I guess that's a possibility. While a FP multiply is just one cycle,
an FP divide is 12.  The sine function and the loop code do use
divide instructions.
> 
> Also, just to avoid mistakes I've made myself, I trust you have verified 
> that the compiler emits floating-point instructions, and that the 
> hardware-float version of the math library is linked?
> 
Yes,  I checked the instruction codes in the assembly display of
the C-spy debugger.  It does use the FPU---and the 8X faster 
performance on other test code supports that.

One other hypothesis that I've come up with is that the 
software FP  pushes and pops more stuff on and off the stack doing
the same work that the hardware FP does with a single transfer
to the FPU registers.  Perhaps moving all those registers to and
from RAM uses more energy.

Mark Borgerson

Reply by Mark Borgerson ●January 5, 20132013-01-05

In article <kcag71$t97$1@dont-email.me>, fpm@u.washington.edu says...
> 
> On Sat, 05 Jan 2013 15:35:15 -0800, Mark Borgerson wrote:
> 
> >> > Why does the CPU use LESS power doing floating point math in the
> >> > FPU???
> >> > 
> >> > 
> >> > Mark Borgerson
> >> 
> >> Did you check the current as a function of time (i.e. with a current
> >> probe and 'scope)?  The obvious reason is that the FPU does the job
> >> faster, so you spend more than enough time in sleep to make up for the
> >> higher current consumed by the FPU.  BTW - does the FPU get completely
> >> turned off when you are not going to use it?
> > 
> > The function that fills in the table of sine values runs continuously--
> > the CPU should never go to sleep.
> > 
> > I get the expected reduction in power when using an RTOS where the sine
> > function is intermittent and the CPU sleeps between activations.
> >> 
> >> If you don't have a current probe, you could set up your system to
> >> exercise floating point calculations continuously {either using FPU or
> >> CPU/software}.  That should get you the results you expect.
> > 
> > That's what I did with the code above.
> > 
> > Mark Borgerson
> 
> Ah, could you hold on a moment while I find some hole to crawl into, 
> preferably one with a remedial reading class?
> 
> Sorry, guess I'm clueless today.
> 
> To repeat one point - are you sure that the FPU is completely turned off
> when you're not going to be using it?  Hopefully there's some way to be
> sure this is happening.

I'm not sure of the status of the FPU when I compile the code for 
software FP.  There are a couple of FPU enable bits that aren't set
when using software FP, but I'm not sure if they turn off the FPU
clock or if they just cause a fault on writes to the FPU registers.


Mark Borgerson

Reply by Richard Damon ●January 6, 20132013-01-06

On 1/5/13 12:59 PM, Mark Borgerson wrote:

> Here's the puzzling part:
> 
> Using FPU for floating point:  49.9mA
> Using software floating point: 55.1mA
> 
> Why does the CPU use LESS power doing floating point math
> in the FPU???
> 
> 
> Mark Borgerson
> 

My guess would be due to a couple of factors:

1) The FPU is quicker, so the CPU will be spending more time in the IDLE
state, where the power consumption is a lot less.

2) The FPU is probably a lot more efficient in the number of electrons
needed to do the operation than the software emulation. On a per
microsecond basis, the FPU may use more power when it is running than
the integer ALU, but it may well need less energy to do the full
computation.

Reply by Mark Borgerson ●January 6, 20132013-01-06

In article <kcatlq$po5$1@dont-email.me>, news.x.richarddamon@xoxy.net 
says...
> 
> On 1/5/13 12:59 PM, Mark Borgerson wrote:
> 
> > Here's the puzzling part:
> > 
> > Using FPU for floating point:  49.9mA
> > Using software floating point: 55.1mA
> > 
> > Why does the CPU use LESS power doing floating point math
> > in the FPU???
> > 
> > 
> > Mark Borgerson
> > 
> 
> My guess would be due to a couple of factors:
> 
> 1) The FPU is quicker, so the CPU will be spending more time in the IDLE
> state, where the power consumption is a lot less.
I'm not sure this matches the test conditions.  Both with and without
the fpu, the software was in a continuous loop that computed and stored
sine values.  There was no idle state.
> 
> 2) The FPU is probably a lot more efficient in the number of electrons
> needed to do the operation than the software emulation. On a per
> microsecond basis, the FPU may use more power when it is running than
> the integer ALU, but it may well need less energy to do the full
> computation.

But both computations were running continuously---but the FPU loop
does cycle more times per second.


Mark Borgerson

Previous12 Next

Puzzling power results STM32F4 FPU test

Sign in

You might also like...

Search forums

Free PDF Downloads

Blogs - Hall of Fame

Discussion Groups

Quick Links

About EmbeddedRelated.com

Social Networks

The Related Media Group