EmbeddedRelated.com
Blogs
The 2024 Embedded Online Conference

Trust, but Verify: Examining the Output of an Embedded Compiler

Jason SachsSeptember 27, 2015

I work with motor control firmware on the Microchip dsPIC33 series of microcontrollers. The vast majority of that firmware is written in C, with only a few percent in assembly. And I got to thinking recently: I programmed in C and C++ on an Intel PC from roughly 1991 to 2009. But I don’t remember ever working with x86 assembly code. Not once. Not even reading it. Which seems odd. I do that all the time with embedded firmware. And I think you should too. Before I say why, here are reasons not to work in assembly:

  • It’s easier to make errors
  • It’s more difficult and takes longer to write
  • It’s more difficult and takes longer to read
  • It’s more difficult and takes longer to debug
  • It’s more difficult and takes longer to maintain
  • It’s more difficult and takes longer to document
  • It’s not portable between architectures
  • The compiler can produce assembly code for you, which runs almost as fast as your hand-optimized assembly code, sometimes faster
  • When was the last time you wrote code that was CPU-bound, anyway?

This last note needs some clarification. By CPU-bound, I mean the limiting factor on how fast your program can complete a task is the CPU, rather than disk I/O or network I/O or scheduled timer delays. If your program runs calculations one after the other, never returning control to the operating system via some yielding call to a blocking function, then it’s CPU-bound. That’s getting rarer these days, unless you’re writing serious number-crunching code. And that’s why desktop and embedded software are different. On a desktop PC, I’ve got a multi-core multi-GHz processor screaming along running my code along with the 97 other processes that are active. My goodness, I can even write in high-level interpreted languages and it’s still fast. If it seems a little slower than I’d like, then I measure it with a profiler, find the code that seems to take the most processing power, and see if there’s a better algorithm for doing what I want — because I don’t know x86 assembly, and even if I did, I’d probably be lucky to speed up a large task by more than maybe 10-15%.... versus an algorithmic speedup that cuts down the runtime by 95% because I went from \( O(n^2) \) to \( O(n \log n) \). Or sometimes it’s just a matter of which library or OS functions you use: if things are too slow, try a few different methods and see if any are significantly faster. That doesn’t need any assembly code work, just some experimentation and careful measurement.

On an embedded system, things are different. Embedded processors have only a fraction of the computing power of a desktop PC, and there’s economic incentive to use less CPU processing, because it means you can buy a cheaper processor or use less energy (which is crucial in the case of battery-powered systems). When I’m working on motor control software, I have a single-core processor running at 70 million instruction cycles per second, and it has a hard real-time requirement to do a certain set of motor control tasks every 50 microseconds. That means the motor control code needs to finish in less than 3500 instruction cycles. This isn’t a lot of time; if I can shorten things by 35 instruction cycles, that frees up 1% of the CPU. It matters.

I said only a few percent of the motor control firmware is written in assembly. My team uses it only when necessary; in fact, in some cases we are migrating from assembly to C for maintainability. But 35 instruction cycles uses up 1% of the CPU — execution time matters — so while I don’t write assembly code very often, I do read assembly code fairly frequently. I look behind the curtain and see what the compiler is doing. There’s an old Russian proverb, Доверяй, но проверяй (Trust, but verify), which was co-opted and made famous by U.S. President Reagan in the 1985-1986 nuclear disarmament talks with the USSR’s General Secretary Gorbachev. I let the compiler do its job rather than try to write assembly code myself… but I do look at the output code. A compiler has only so much freedom to optimize — it must produce output code that correctly implements the specified source code according to the rules of the programming language I’m using. For example, which of the following functions for summing up an array of 16-bit integers will have a shorter execution time?

int32_t sum1(const int16_t *items, int count)
{
    int32_t S = 0;
    int i;
    for (i = 0; i < count; ++i)
    {
       S += items[i];
    }
    return S;
}

int32_t sum2(const int16_t *items, int count)
{
    int32_t S = 0;
    while (count-- > 0)
    {
       S += *items++;
    }
    return S;
}

We know these functions are equivalent, but the compiler may not be able to deduce that, so one may be faster than the other. (Don’t believe me? Keep on reading!) And looking at the compiler’s output lets you learn what matters and what doesn’t matter. Sometimes a compiler has been programmed with optimizations to handle certain cases; if you are aware of which cases do better, you can help it along.

Yes, this represents an investment of time to learn the quirks of a particular compiler running on a particular architecture — but it’s a lot less than trying to write everything in assembly by hand.

Ready to give it a shot?

The Basics of Looking Behind the Curtain

This is going to be fun! We’re going to look what the compiler does with a really simple function:

#include <stdint.h>

int16_t add(int16_t a, int16_t b)
{
  return a+b;
}

That’s so simple, you might think it’s not even worth doing… but we can learn from it.

Okay, there are several things you need if you want to look at assembly code generated by the compiler:

  1. A compiler. In this case we’ll be using the Microchip XC16 compiler. The features I’m going to show are in the free version; if you want to use the higher optimization levels you’ll need to purchase an appropriate license, but you can do a lot with just the basic features in the free version.

  2. The manual for the compiler. ‘Cause we want to know how to use it properly.

  3. The instruction set reference manual for the architecture we’re working with. In this case it’s Microchip’s 16-bit MCU and DSC architecture. If you’ve never looked at assembly code before, it’s going to look like gobbledygook, and even if you have, one core architecture’s assembly may be very different from another.

The C code I’m going to look at is architecture-independent, so if you’re working with another type of microcontroller, you can do exactly the same exploration; you just need the appropriate compiler and reference documents to try it out. Of course, the resulting assembly output will be different on other architectures.

Now we need to compile some code. There are a few options here.

Option 1: The IDE

We can use an integrated development environment (IDE), which usually makes it easy to write and debug a computer program. At work I use the MPLAB X IDE based on NetBeans, and it’s helpful, especially for things like debugging and syntax highlighting and refactoring and all that stuff. But for looking at little snippets of code, it’s kind of a pain; there’s not really an easy way to look at the compiler output unless you compile a whole program, and with an embedded microcontroller, that means putting together a bunch of stuff like a linker file and a project file, which gets in the way. The IDE definitely makes it easier. But here we just want to look at one function, so it’s too much baggage, at least for this article.

Option 2: The command line

Okay, here all the UNIX fanboys are going to come out of the woodwork cheering. Hey, I like the command line too, at times… it’s just that life is short, and I just don’t have enough free brain cells in my working set to remember all of those cryptic arguments like du -sk or chmod a+x or whatever. I’m already upset at how much of my memory is occupied by that junk. It’s a necessary evil.

But it’s simple. It may be cryptic and hard to remember, but it’s really simple to follow instructions.

Here’s what we have to do to use the compiler.

  1. We write our function into a file, let’s call it add.c, using our favorite editor. I’m not going to tell you mine. Them’s the stuff of holy wars, and I stay neutral.

  2. We open up a command-line shell in the directory where we put add.c, and type this:

    xc16-gcc -S -c add.c
    

    What does this do? The -c option just runs the compiler, usually converting add.c to an object file (either add.obj or add.o depending on whether you’re running on Windows or a UNIX-based OS like a Mac or Linux); if you leave it out, the compiler also tries to link it into a full program, and that will fail because we haven’t included a main() function, and we really don’t care about making a complete program. We just want to see what our add() function turns into. But if we try to look at the object file add.o / add.obj it’s just incomprehensible binary junk, easily readable to a linker or debugger, but not meant for human consumption. The -S option changes things, and produces an assembly file, add.s, instead of the object file.

  3. We open add.s in our favorite editor and look at it. Here it is:

    .file "/Users/jmsachs/link/embedded-blog/code-samples/add.c"
    .section    .text,code
    .align  2
    .global _add    ; export
    .type   _add,@function
_add:
    .set ___PA___,1
    lnk #4
    mov w0,[w14]
    mov w1,[w14+2]
    mov [w14+2],w0
    add w0,[w14],w0
    ulnk    
    return  
    .set ___PA___,0
    .size   _add, .-_add

    .section __c30_signature, info, data
    .word 0x0001
    .word 0x0000
    .word 0x0000

; MCHP configuration words

    .set ___PA___,0
    .end

Okay, now which are the important bits? Assembly language for Microchip processors is fairly similar to most others, and contains four elements:

  • Comments (which in this case start with a semicolon), which the assembler ignores
  • Directives (which start with a ., like .word 0x0000), which tell the assembler some auxiliary information
  • Labels (which start at the beginning of a line and end in a colon :, like _add:), which represent a reference to a point in the program
  • Instructions (everything else)

If we ignore all the comments and directives we see this:

_add:
    lnk #4
    mov w0,[w14]
    mov w1,[w14+2]
    mov [w14+2],w0
    add w0,[w14],w0
    ulnk    
    return

And that’s our add() function in 16-bit dsPIC assembly! 7 instructions!

What does it mean? Well, before we get into that, let’s look at another option.

Option 3: Instant gratification with pyxc16

I used to use options 1 or 2 when I needed to look at compiler output. And it was tiring, and error-prone, because often I’d edit some C code, go to compile it, get distracted by something, then forget whether I actually ran the compiler, then compile it again, and flip back and forth between the source and the assembly. Oh, and actually finding the right portion of assembly was sometimes a challenge, because of all that other junk that shows up in an assembly file. We just saw 7 instructions of code, and a label. That’s 8 lines. But the add.s file contains 26 lines. The directives obscure the contents. Larger C code inputs produce a lot more output.

So I got fed up, and wrote a quick Python module, pyxc16, which I have put up on my Bitbucket account for anyone to use. Essentially it takes care of running the compiler, fetching the output, and filtering out the stuff in the assembler file that’s not important. Here’s how to run it. You just have to make sure there’s an XC16_DIR environment variable that points to the root (not the bin directory) of the compiler installation. On my Mac that’s in /Applications/microchip/xc16/v1.25/. Then you give it a string (or a filename, if you must) and whatever command-line arguments you want to include:

import pyxc16

pyxc16.compile('''
#include <stdint.h>

int16_t add(int16_t a, int16_t b)
{
    return a+b;
}
''', '-c')
_add:
	lnk	#4
	mov	w0,[w14]
	mov	w1,[w14+2]
	mov	[w14+2],w0
	add	w0,[w14],w0
	ulnk
	return

Tada! Instant gratification! The pyxc16.compile() function saves your input as sample.c and compiles it with the -S option.

Now I can use IPython Notebook — yeah, I know, it’s Jupyter Notebook now, but the name just doesn’t roll off the tongue — and write my blog on my Chromebook laptop while sitting on my comfy couch, compiling C snippets with XC16. Whereas my Mac which actually runs IPython and XC16 is relegated to an inconvenient spot with a marginally-acceptable desk and chair from IKEA that I don’t really like sitting at. The chair is a modern-but-cheap birchwood thing called JULES, with swivel casters that fall off every once in a while, and I can’t remember the desk’s IKEA name, and yes, I should make sitting at my computer at home more comfortable, but maybe that’s the point — if it were more comfortable, I would spend more time there.

Anyway, we were talking about the assembly code produced by XC16, given our add() function.

Dissecting assembly code, perhaps with the help of Uncle Eddie

What is going on here? If you look at the programmer’s reference manual, you’ll see that the LNK instruction — when I refer to an instruction mnemonic without operands, I’ll be showing it in upper case; when it’s in a program used with specific operands, it will be in lower case — allocates a stack frame of a certain number of bytes, used for temporary variables. MOV copies data from its first operand to its second operand, ADD adds its operands, ULNK deallocates a stack frame (the inverse of LNK), and RETURN jumps back to the calling program. I’m not going to dwell on the details of dsPIC assembly instructions (you’ll have to read the programmer’s reference manual), but here’s an annotated version:

_add:
    lnk #4            ; allocate a stack frame of 4 bytes
    mov w0,[w14]      ; copy what's in register W0 to offset 0 relative to W14 
                      ; (the stack frame pointer), aka FP[0]
    mov w1,[w14+2]    ; copy what's in W1 to FP[2]
    mov [w14+2],w0    ; copy what's in FP[2] to W0
    add w0,[w14],w0   ; add W0 to FP[0], and put the result into W0
    ulnk              ; deallocate the stack frame
    return            ; return to caller

That’s kind of stupid… why does the compiler do all that shuffling-around of data? The default mode of the compiler is to operate in its -O0 mode, eschewing essentially all attempts at optimization. Here’s what the XC16 manual says about -O0:

Do not optimize. (This is the default.) Without -O, the compiler’s goal is to reduce the cost of compilation and to make debugging produce the expected results. Statements are independent: if you stop the program with a breakpoint between statements, you can then assign a new value to any variable or change the program counter to any other statement in the function and get exactly the results you would expect from the source code. The compiler only allocates variables declared register in registers.

It’s essentially there to make the debug tool’s job very easy, and to allow breakpoints on all lines of C code with all variables available on the stack. As far as “reducing the cost of compilation” goes, I’ve never noticed any serious slowdown at other optimization levels, but who knows. If you care about efficiency and need to speed up the execution, you’ll want to use one of the other optimization options. The -O1 option is the next step up, and does a pretty decent job; the other optimization levels are higher performance but aren’t present in the free version of the compiler.

Here’s what our add() function looks like with -O1:

pyxc16.compile('''
#include <stdint.h>

int16_t add(int16_t a, int16_t b)
{
    return a+b;
}
''', '-c', '-O1')
_add:
	add	w1,w0,w0
	return

And now we’re down to two instructions, an ADD and a RETURN. Yay! Much better. In fact, this is optimal for a function call.

Something else that’s useful to know is the -fverbose-asm option, which adds a kind of color commentary to the assembly code. It’s kind of like having your paranoid schizophrenic uncle Eddie tell you about the wolves and drug dealers watching him behind the grocery store shelves. He includes things that aren’t there… but… well, you just don’t see what he’s seeing, and maybe that’s your loss.

pyxc16.compile('''
#include <stdint.h>

int16_t add(int16_t a, int16_t b)
{
    return a+b;
}
''', '-c', '-O1', '-fverbose-asm')
_add:
	add	w1,w0,w0	; b, a, tmp46
	return

Uncle Eddie is telling you… I mean XC16 is telling you that it implements add(a, b) with b in the W1 register and a in the W0 register, and then the result of the ADD instruction, that gets put into W0 — well, that’s easy, that’s just tmp46.

Wait, what’s tmp46?

Let’s see another instance of -fverbose-asm:

pyxc16.compile('''
#include <stdint.h>

int16_t f1(int16_t x, int16_t y)
{
    return x*y;
}

int16_t cross_product(const int16_t *pa, const int16_t *pb)
{
    return f1(pa[0], pb[1]) - f1(pb[0], pa[1]);
}
''', '-c', '-O1', '-fverbose-asm')
_f1:
	mul.ss	w1,w0,w0	; y, x, tmp47
	return
_cross_product:
	mov.d	w8,[w15++]	;
	mov	w10,[w15++]	;,
	mov	w0,w9	; pa, pa
	mov	w1,w8	; pb, pb
	mov	[w8+2],w1	;, tmp54
	mov	[w9],w0	;* pa,
	rcall	_f1	;
	mov	w0,w10	;, D.2032
	mov	[w9+2],w1	;, tmp55
	mov	[w8],w0	;* pb,
	rcall	_f1	;
	sub	w10,w0,w0	; D.2032, D.2036, tmp56
	mov	[--w15],w10	;,
	mov.d	[--w15],w8	;,
	return

Now we’re looking at tmp47 and D.2032 and other imaginary friends. Huh. Presumably these are internal names that the compiler uses for intermediate results, but it’s too bad they aren’t annotated a little better; in this example, except for D.2032, all of them appear only once and it doesn’t seem to shed much light on what’s going on. I won’t be using -fverbose-asm in the rest of this article.

A Quick Tour of Compiled Snippets

Okay, here are some common features of C, along with the compiler’s generated output. You can learn a lot about the kinds of tricks the compiler uses behind the scenes to make them work.

Branches

pyxc16.compile('''
#include <stdint.h>
#include <stdbool.h>

/* if-statement */
int16_t add(bool a, int16_t x, int16_t y)
{
    if (a)
    {
        return x+y;
    }
    else
    {
        return x-y;
    }
}

/* equivalent result using the conditional operator */
int16_t add2(bool a, int16_t x, int16_t y)
{
    return (a) ? x+y : x-y;
}

int16_t add3(bool a, int16_t x, int16_t y)
{
    return x + ((a) ? y : -y);
}

int16_t add4(bool a, int16_t x, int16_t y)
{
    return x - ((a) ? -y : y);
}

''', '-c', '-O1')
_add:
	cp0.b	w0
	bra	z,.L2
	add	w2,w1,w0
	bra	.L3
.L2:
	sub	w1,w2,w0
.L3:
	return
_add2:
	cp0.b	w0
	bra	z,.L5
	add	w2,w1,w0
	bra	.L6
.L5:
	sub	w1,w2,w0
.L6:
	return
_add3:
	cp0.b	w0
	bra	nz,.L8
	neg	w2,w2
.L8:
	add	w2,w1,w0
	return
_add4:
	cp0.b	w0
	bra	z,.L10
	neg	w2,w2
.L10:
	sub	w1,w2,w0
	return

The CP0 instruction compares a register with 0; the BRA instruction does a conditional or unconditional branch. Calls, jumps, and branches on the dsPIC architecture, if taken, require more time to execute; part or all of the instruction pipeline must be flushed because of a branch hazard. On the dsPIC30F/33F and PIC24F/24H devices, the branch takes 1 extra cycle; on the dsPIC33E and PIC24E devices, the branch takes 3 extra cycles.

The compiler generates its own labels — here we have .L5, .L6, .L8, and .L10 — for use as branch and jump targets; these don’t conflict with the names used from C functions. A C function’s start label gets an underscore added at the beginning; the autogenerated labels all start with a .L followed by a number. If you use debug mode -g when compiling, there are a whole bunch of other labels starting with .LSM and .LFB, but pyxc16 strips those out since they’re never actually used as branch and jump targets.

Note that there is no difference between the use of an if statement and an equivalent behavior using the ? : conditional operator. The third and fourth functions add3() and add4() are interesting… all four of these functions are equivalent, but we’ve rewritten the implementation in C, and in this case the compiler is able to implement it more efficiently, taking fewer instructions for one of the branches. The compiler can’t figure this out on its own — it’s not omniscient. We also can’t tell the compiler directly which branch we want to make shorter. On the 33F devices, both branches take the same amount of time; on the 33E series of devices, the add3() function takes longer if a is true (6 cycles not including the RETURN function if the branch is taken, 4 cycles not including the RETURN function if the branch is not taken), and the add4() function has the opposite behavior for execution time.

Looking at the compiler’s output can give you information to decide how to coax the compiler into doing things more efficiently… but there’s no solid guarantee of this. Upgrade to a new version of the compiler? The behavior might change. Use different compiler flags? The behavior might change. Tweak the C code? The behavior might change. It’s not specified as part of the C standard. If you really want a specific assembly implementation, you either have to write it yourself in assembly, or double-check the result of the compiler each time you compile a different input file. Don’t be a control freak! Learning how the compiler behaves in various conditions can help you work better with it, but don’t micromanage, except in cases where you have critical execution time requirements, like a function that is called thousands or tens of thousands of times per second.

Loops

pyxc16.compile('''
#include <stdint.h>

/* Compute the nth triangular number, namely
 *   0 + 1 + 2 + 3 + ... (n-1) + n. 
 */
int16_t trisum1(int16_t n)
{
    int16_t result = 0;
    while (n > 0)
    {
        result += n;
        --n;
    }
    return result;
}

int16_t trisum2(int16_t n)
{
    int16_t result = 0;
    int16_t i;
    for (i = 1; i <= n; ++i)
    {
        result += i;
    }
    return result;
}

void wait_forever()
{
    while (1)
        ;
}

''', '-c', '-O1')
_trisum1:
	clr	w1
	cp0	w0
	bra	le,.L2
.L3:
	add	w1,w0,w1
	dec	w0,w0
	bra	nz,.L3
.L2:
	mov	w1,w0
	return
_trisum2:
	mov	w0,w2
	clr	w0
	cp0	w2
	bra	le,.L7
	mov	#1,w1
.L8:
	add	w0,w1,w0
	inc	w1,w1
	sub	w2,w1,[w15]
	bra	ge,.L8
.L7:
	return
_wait_forever:
.L12:
	bra	.L12

Not much to add here; a loop is really just a branch backwards to repeat something.

Function calls

pyxc16.compile('''
#include <stdint.h>

int16_t foo(int16_t n)
{
    return n+1;
}

int16_t bar(int16_t a)
{
    return foo(a+1)-1;
}
''', '-c', '-O1')
_foo:
	inc	w0,w0
	return
_bar:
	inc	w0,w0
	rcall	_foo
	dec	w0,w0
	return

The RCALL instruction does a relative call (destination address within -32768 and +32767 words of the location of the RCALL instruction), which takes the same amount of execution time as an unlimited CALL instruction, but the code size of an RCALL is 1 word and the code size of a CALL is 2 words. The use of RCALL puts a constraint on where the linker will locate the foo and bar functions.

Switch statements

pyxc16.compile('''
#include <stdint.h>

int16_t switch_example1(int16_t n)
{
    switch (n)
    {
        case 0:
            return 3;
        case 1:
            return n;
        case 2:
            return -n;
        case 3:
            return 44;
        default:
            return 78;
    }
}

int16_t switch_example2(int16_t n, int16_t x)
{
    switch (n)
    {
        case 0:
            return 3;
        case 1:
            return x;
        case 2:
            return -x;
        case 5:
            return 12;
        case 6:
            return 9;
        case 7:
            return 45;
        case 8:
            return 40;
        case 9:
            return 4;
        case 14:
            return x+1;
        case 23:
            return 12345;
        default:
            return x-1;
    }
}
''', '-c', '-O1')
_switch_example1:
	sub	w0,#1,[w15]
	bra	z,.L4
	bra	gt,.L7
	cp0	w0
	bra	z,.L3
	bra	.L2
.L7:
	sub	w0,#2,[w15]
	bra	z,.L5
	sub	w0,#3,[w15]
	bra	nz,.L2
	bra	.L8
.L3:
	mov	#3,w0
	bra	.L4
.L5:
	mov	#-2,w0
	bra	.L4
.L8:
	mov	#44,w0
	bra	.L4
.L2:
	mov	#78,w0
.L4:
	return
_switch_example2:
	mul.su	w0,#1,w2
	sub	w2,#23,[w15]
	subb	w3,#0,[w15]
	bra	gtu,.L10
	bra	w2
	bra	.L11
	bra	.L12
	bra	.L13
	bra	.L10
	bra	.L10
	bra	.L14
	bra	.L15
	bra	.L16
	bra	.L17
	bra	.L18
	bra	.L10
	bra	.L10
	bra	.L10
	bra	.L10
	bra	.L19
	bra	.L10
	bra	.L10
	bra	.L10
	bra	.L10
	bra	.L10
	bra	.L10
	bra	.L10
	bra	.L10
	bra	.L20
.L11:
	mov	#3,w1
	bra	.L12
.L13:
	neg	w1,w1
	bra	.L12
.L14:
	mov	#12,w1
	bra	.L12
.L15:
	mov	#9,w1
	bra	.L12
.L16:
	mov	#45,w1
	bra	.L12
.L17:
	mov	#40,w1
	bra	.L12
.L18:
	mov	#4,w1
	bra	.L12
.L19:
	inc	w1,w1
	bra	.L12
.L20:
	mov	#12345,w1
	bra	.L12
.L10:
	dec	w1,w1
.L12:
	mov	w1,w0
	return

The compiler has two strategies for handling switch statements. One of them (as in switch_example1) is a decision tree, the other (as in switch_example2) is a jump table. We’ve already seen the case of plain branches, but the jump table is new. The bra w2 does a computed relative branch, and the series of BRA instructions that follow, to computed labels .L10 through .L20, causes the program to springboard off to the appropriate spot in the code for each case.

Which choice gets used? The compiler has a heuristic that tries to optimize code size and execution time. If the number of case statements is small enough, the compiler uses a decision tree. If the number of case statements is large enough and the jump table implementation is small enough, the compiler uses a jump table. If a jump table would be too large, the compiler goes back to a decision tree.

Miscellany

for opt_level in ['-O0','-O1']:
    print '*****',opt_level,'*****'
    pyxc16.compile('''
#include <stdint.h>
#include <stdbool.h>

bool check_range(int16_t n)
{
    return n > 33 && n < 48;
}
''', '-c', opt_level)
***** -O0 *****
_check_range:
	lnk	#2
	mov	w0,[w14]
	mov	#33,w0
	mov	[w14],w1
	sub	w1,w0,[w15]
	bra	le,.L2
	mov	#47,w0
	mov	[w14],w1
	sub	w1,w0,[w15]
	bra	gt,.L2
	mov	#1,w0
	bra	.L3
.L2:
	clr	w0
.L3:
	mov.b	w0,w0
	ulnk
	return
***** -O1 *****
_check_range:
	mov	#-34,w1
	add	w1,w0,w1
	mov.b	#1,w0
	sub	w1,#13,[w15]
	bra	leu,.L2
	clr.b	w0
.L2:
	return

Now, that’s interesting. The -O0 version has two conditional branches, one for each of the two arithmetic comparisons (n > 33 and n < 48). The -O1 version manages to do this in one conditional branch, first by subtracting 34 (shifting the 34-47 range down to the 0-13 range), then by subtracting 13, and using the BRA LEU conditional branch, which branches if the previous arithmetic result produces a positive less-than-or-equal comparison in an unsigned manner (more precisely, the carry flag is false or the zero flag is true).

The compiler is full of little tricks like this. Unfortunately, it’s really difficult to know which cases it’s been trained to optimize, and which ones it hasn’t. (This is one reason to look at your compiler’s output every once in a while!)

Now let’s look at the teaser code I posted at the beginning of this article to sum up elements of an array:

pyxc16.compile('''
#include <stdint.h>

int32_t sum1(const int16_t *items, int count)
{
    int32_t S = 0;
    int i;
    for (i = 0; i < count; ++i)
    {
       S += items[i];
    }
    return S;
}

int32_t sum2(const int16_t *items, int count)
{
    int32_t S = 0;
    while (count-- > 0)
    {
       S += *items++;
    }
    return S;
}
''', '-c', '-O1')
_sum1:
	mov	w0,w2
	mov	w1,w5
	mul.uu	w0,#0,w0
	cp0	w5
	bra	le,.L2
	clr	w3
.L3:
	mov	[w2++],w4
	asr	w4,#15,w6
	add	w4,w0,w0
	addc	w6,w1,w1
	inc	w3,w3
	sub	w3,w5,[w15]
	bra	nz,.L3
.L2:
	return
_sum2:
	mov	w0,w2
	mul.uu	w4,#0,w4
	cp0	w1
	bra	le,.L7
	dec	w1,w1
.L8:
	mov	[w2++],w0
	asr	w0,#15,w3
	add	w0,w4,w4
	addc	w3,w5,w5
	dec	w1,w1
	add	w1,#1,[w15]
	bra	nz,.L8
.L7:
	mov.d	w4,w0
	return

The answer is that they take almost exactly the same time! The loops are the same except that one uses a DEC and ADD and the other uses and INC and SUB. There are the same number of loop iterations. In sum1() we have a 6-cycle prologue and no epilogue; in sum2() we have a 5-cycle prologue and a 2-cycle epilogue (MOV.D takes two cycles) not including the RETURN. So sum1() is one instruction cycle faster. The compiler does something pretty fancy in sum1(): it recognizes an array index items[i] with an incrementing index in a loop, and turns it into the equivalent *items++ access, using the mov [w2++],w4 instruction. Wow!

There are three other little things to notice here in both functions.

  • mul.uu w4,#0,w4: This is an easy 1-cycle way to clear a pair of registers used as a 32-bit value; we’re writing 0 to W4 and W5 in one cycle
  • asr w0,#15,w3: This is a sign-extension operation. If W0 is negative, we put 0xFFFF into W3, otherwise we put 0x0000 into W3, and then the pair W3:W0 forms a 32-bit value equal to W0 cast to a 32-bit signed integer.
  • add w1,#1,[w15]: This is an odd little line of assembly that confused me at first… what’s in [W15]? It’s not used anywhere else in the code. Well — that’s actually the answer; we execute an ADD instruction and throw the result away by writing it to the top of the stack (W15 is the stack pointer), which will be overwritten the next time any part of the program executes a PUSH or a CALL or a LNK instruction. It’s an easy way to throw a result somewhere without having to overwrite one of the registers that we care about. Why would we do this? Because all we need are the arithmetic status flags, which are used in the conditional branch instruction on the next line.

Let’s do a similar operation, this time adding four fixed values in an array:

pyxc16.compile('''
#include <stdint.h>

/* sum up 4 numbers in an array */
int32_t foursum1(const int16_t *items)
{
    int32_t S = 0;
    S += items[0];
    S += items[1];
    S += items[2];
    S += items[3];
    return S;
}

int32_t foursum2(const int16_t *items)
{
    int32_t S = 0;
    S += *items++;
    S += *items++;
    S += *items++;
    S += *items;
    return S;
}

int32_t foursum3(const int16_t *items)
{
    int32_t S = 0;
    S += *items;
    S += *++items;
    S += *++items;
    S += *++items;
    return S;
}

''', '-c', '-O1')
_foursum1:
	mov	[w0+2],w1
	mov	[w0+4],w2
	asr	w2,#15,w3
	asr	w1,#15,w4
	add	w1,w2,w2
	addc	w4,w3,w3
	mov	[w0],w1
	asr	w1,#15,w4
	add	w1,w2,w2
	addc	w4,w3,w3
	mov	[w0+6],w0
	asr	w0,#15,w4
	add	w0,w2,w0
	addc	w4,w3,w1
	return
_foursum2:
	mov	w0,w1
	mov	w1,w0
	mov	[++w0],w4
	mov	[w1],w2
	asr	w2,#15,w3
	asr	w4,#15,w1
	add	w4,w2,w2
	addc	w1,w3,w3
	mov	[++w0],w1
	asr	w1,#15,w4
	add	w1,w2,w2
	addc	w4,w3,w3
	mov	[w0+2],w0
	asr	w0,#15,w4
	add	w0,w2,w0
	addc	w4,w3,w1
	return
_foursum3:
	mov	w0,w1
	mov	w1,w0
	mov	[++w0],w4
	mov	[w1],w2
	asr	w2,#15,w3
	asr	w4,#15,w1
	add	w4,w2,w2
	addc	w1,w3,w3
	mov	[++w0],w1
	asr	w1,#15,w4
	add	w1,w2,w2
	addc	w4,w3,w3
	mov	[w0+2],w0
	asr	w0,#15,w4
	add	w0,w2,w0
	addc	w4,w3,w1
	return

This time the compiler doesn’t change the array index access into a postincremented pointer — the value of W0 stays the same through most of the foursum1() function. In foursum1(), the compiler uses array indices and takes 14 cycles plus a RETURN. In the second function foursum2(), oddly enough, the compiler seems to get a little confused and transforms the postincrementing pointers into two preincrementing pointers, a plain indirect access, and an indirect-with-offset access (mov [w0+2],w0). Here we take 16 cycles plus a RETURN, but only because the compiler does this odd dance at the beginning where it writes W0 to W1 and then back to W0. Sometimes things like that happen; Uncle Eddie comes by and makes some unnecessary motions just to be sure those wolves at the grocery store can’t hear him.

The third implementation, foursum3(), is exactly the same as foursum2(), instruction-for-instruction, even though we switched our use of pointers in C from postincrement to preincrement. The compiler can recognize what’s being computed as equivalent, and chooses an assembly implementation according to its heuristics. Go figure.

Summary

We’ve looked at a handful of simple C functions, and examined the Microchip XC16 compiler’s output. The basic C control structures are fairly simple, and result in various conditional or unconditional branches or subroutine call instructions.

You can learn a lot from checking the C compiler’s output, no matter what CPU architecture or compiler you’re using. This is a strategy that can be used, if necessary, to determine which types of C program implementations have more efficient implementations in assembly. Be careful: the exact mechanisms for optimization of assembly code generation are internal undocumented details of the compiler, and may have brittle dependencies on compiler flags, C program input, the compiler version, the day of the week, or the phase of the moon. There’s no iron-clad guarantee that the compiler will optimize your program the same way in multiple compilation runs. From a practical standpoint, however, the compiler’s behavior is fairly consistent, and can help guide you towards using it to achieve good program performance.

I use this technique often on critical execution tasks that run thousands of times per second on a dsPIC microcontroller. I don’t use it on code that runs infrequently; it’s not worth it to optimize what doesn’t make much difference in execution time.

I’ve also posted a link to my open-source pyxc16 repository, which is a simple Python module for making it easy in Python to compile C code and view the results from the Microchip XC16 compiler.

Happy compiling!



The 2024 Embedded Online Conference

To post reply to a comment, click on the 'reply' button attached to each comment. To post a new comment (not a reply to a comment) check out the 'Write a Comment' tab at the top of the comments.

Please login (on the right) if you already have an account on this platform.

Otherwise, please use this form to register (free) an join one of the largest online community for Electrical/Embedded/DSP/FPGA/ML engineers: