C Programming Techniques: Function Call Inlining
Abstraction is a key to manage software systems as they increase in size and complexity. As shown in a previous post, abstraction requires a developper to clearly define a software interface for both data and functions, and eventually hide the underlying implementation.
When using the C language, the interface is often exposed in a header '.h' file, while the implementation is put in one or more corresponding '.c' files.
First, separating an interface from its implementation is a good way to 'self document' the code.
More importantly, while the compiler does not have to know a function definition when processing a function call, it should know the function prototype to generate correct code. This is especially true for the return value and the parameter types. The calling convention is important too, when differing from the default one. Note that the source can still compile (eventually with warnings) when function prototypes are missing, but the generated code may not be correct.
The purpose of this post is to show that hiding implementation, while a good software engineering practice, requires the compiler to generate extra instructions, not part of the useful computation. By using function call inlining, part of the induced overhead can be avoided without damaging the software interface. This practice is especially useful for computations that are a few cycles long.
Overhead associated with function calls
When a function is called from a different compilation units (the usual case), the compiler generates a call instruction. Be it relative, indirect or absolute, the instruction operand (the branch destination address) is resolved by the static linker when merging the compilation units ('.o' files) together. For simplicity, we omit the case of dynamic libraries, and assume that the address is fully resolved by the static linker. The called function finally returns the calling site using a branching or a specialized return instruction.
Depending on its internally maintained state (registers currently in used, calling convention ...) and configuration (optimization level ...), the compiler generates extra instructions to save and restore context, and to pass arguments. This is applicable to both the caller and the called function. In the called function, context saving and restoring are known as prolog and epilog, respectively.
All these added instructions are not part of the useful computation and are considered as overhead. This is pictured by the following diagram:
For a developper looking for performance, this overhead may not be acceptable, especially for functions that are only a few cycles long. One way to avoid it while keeping the software interface clean is to use function call inlining.
Generally, a compiler is said to inline a function when it substitutes the definition in place, removing the need for branching instructions and the associated context management. This is illustrated by the following diagram:
Knowing the function definition, the compiler can even perform optimizations such as constant substitution, smarter register usage ...
For inlining to occur, the compiler has to know the function definition at the time of use. It means that the definition must be in the same compilation unit as the caller. For obvious reasons, copy pasting a function body across the source files is not a good practice. Thus, file inclusion is used to make the definition known to the compiler.
One way to inline a function is to define it using a preprocessor macro. You are sure the code is substituted in place by the preprocessor, with no intervention from the compiler. One problem is that macro parameters are not typed. Also, macros tend to require tricks that make them difficult to read and maintain.
A preferred way is to define functions inside the included header file. This way, the compiler knows its definition at call time, and can substitute it in place. However, putting function defintions in the header file kind of 'break' the interface versus implementation separation practice. A possible convention is to put the inlined code in a separate '.c' file and include it in the header file. The header file must be properly guarded, or multiple definitions will occur at link time. Also, note that the Makefile (or equivalent) must properly handle these files as object generation rules prerequisites.
Whatever the chosen approach, there are several ways to tell your compiler a function call should be inlined. It takes place when declaring the function, using the inline keyword. Details differ among compilers. The documentation for GCC is available here:
I routinely use the static inline approach, with optimizations enabled at command line (-O2 flags). If enabling optimization is a problem, the always_inline attribute can be used, as described in the documentation.
A very simple example shows how to implement what we have seen so far. Both the default and the inlined versions of a bit testing function are compiled. The generated assembly code is dumped and analyzed. Note that an ARM toolchain is used.---- file: bit.c--
#include < stdint.h >---- file: bit.h--
#if (CONFIG_INLINED == 1)
bit_word_t bit_is_set(bit_word_t x, bit_off_t off)
return x & (1 << off);
#ifndef BIT_H_INCLUDED---- file: main.c--
#include < stdint.h >
typedef uint32_t bit_word_t;
typedef uint32_t bit_off_t;
#if (CONFIG_INLINED == 1)
bit_word_t bit_is_set(bit_word_t x, bit_off_t off);
#endif /* BIT_H_INCLUDED */
int main(int ac, char** av)
volatile bit_word_t* x = (bit_word_t*)0xdeadbeef;
const bit_word_t s = 2;
__asm__ __volatile__ ("nop":::"memory");
while (bit_is_set(*x, s)) ;
__asm__ __volatile__ ("nop":::"memory");
---- file: do_build.sh--
# to change with your own prefix
$CROSS_COMPILE\gcc -DCONFIG_INLINED=0 -Wall -O2 main.c bit.c
$CROSS_COMPILE\objdump -d a.out > not_inlined.s
$CROSS_COMPILE\gcc -DCONFIG_INLINED=1 -Wall -O2 main.c
$CROSS_COMPILE\objdump -d a.out > inlined.s
The default, non inlined version, generates the following code:
8444: e3a03001 mov r3, #1
8448: e0000113 and r0, r0, r3, lsl r1
844c: e12fff1e bx lr
82ec: e59f4018 ldr r4, [pc, #24] ; 830c
82f0: e5140110 ldr r0, [r4, #-272] ; 0x110
82f4: e3a01002 mov r1, #2
82f8: eb000051 bl 8444
82fc: e3500000 cmp r0, #0
8300: 1afffffa bne 82f0
The inlined version results in the following code:
82e8: e59f3010 ldr r3, [pc, #16] ; 8300
82ec: e5130110 ldr r0, [r3, #-272] ; 0x110
82f0: e2100004 ands r0, r0, #4
82f4: 1afffffc bne 82ec
First, note that the branching instructions are no longer present (default version, offsets 0x82f8 and 0x844c).
Also, since it knows the function body, GCC can make smarter decisions. It uses the ands instruction (inlined version, offset 0x82f0), removing the need for the cmp instruction (default version, offset 0x82fc).
Finally, note that the mask is automatically computed (inlined version, offset 0x82f0). In the particular case of the ARM instruction set, it does not result in a smaller code size (see default version, 0x8448).
While very simple, this example illustrates the advantages of inlining function calls. If you are interested in larger code base, I suggest you to have a look at the LINUX kernel source tree. For instance, the linked list library makes heavy use of it:http://lxr.free-electrons.com/source/include/linux/list.h
Note that it actually uses preprocessor macros along with inlined functions. This is done to implement type based code generation (similar to C++ templates), not to reduce overhead.
If you are interested in a real world embedded application where inlining played an important role, I propose you to look at an open source project I recently worked on:
In software engineering, portability and well designed interfaces are especially important. Still, some parts of the source code require fine tuning to reduce overhead. The C language offers mechanisms to meet these goals, function call inlining being one of them.
However, one must keep in mind that inlining code often results in executables that are larger in size. It can be an issue on microcontrollers with limited memory resources. Whether or not to use function inlining is, as often, a trade-off.
Next post by Fabien Le Mentec:
Interfacing LINUX with microcontrollers
Add a Comment