EmbeddedRelated.com
Blogs
Memfault Beyond the Launch

Scorchers, Part 3: Bare-Metal Concurrency With Double-Buffering and the Revolving Fireplace

Jason SachsJuly 25, 20201 comment

This is a short article about one technique for communicating between asynchronous processes on bare-metal embedded systems.

Q: Why did the multithreaded chicken cross the road?

A: to To other side. get the

— Jason Whittington

There are many reasons why concurrency is hard to manage — at least if you’re using an imperative programming language with shared mutable state. Using message-passing between isolated processes (exemplified by languages such as Go and Erlang) is one way to deal with concurrency… but then you need a system that supports message-passing as a primitive operation, along with the resources to support it.

Run the concept of message-passing by anyone working on a resource-constrained embedded system, and they will probably look at you as though you have two heads. You’ll need some kind of a hardware or software queue with enough room to handle pending messages. This approach works better on a high-end embedded system or a general-purpose PC, where you can make use of the message-passing features of operating systems, and the required overhead is small. The typical 8- or 16-bit system running bare-metal and with kilobytes (or less!) of RAM rather than megabytes or gigabytes doesn’t have the luxury of using queues everywhere — at least not without careful design.

So let’s set the stage for one alternative, a bare-metal system that has two processes that need to communicate.

Speedy and Poky

Let’s say we’re using a microcontroller with these features:

  • some nominal amount of RAM (say, 4096 bytes)
  • a single-core processor which accesses the RAM in a way that can be modeled by sequential consistency
  • an analog-to-digital converter (ADC) that can digitize a few signals on some of its pins
  • some kind of timing mechanism that can trigger the ADC on a regular basis, say at 10kHz
  • an interrupt that can be triggered when the ADC finishes digitizing those signals

These are all pretty common; except for the really-low end 8-bit processors without ADCs, you’ll have no trouble finding such a microcontroller. (And if autotriggering of the ADC isn’t available, a slightly-less-desirable fallback is to manually trigger it in software upon entering a timer interrupt service routine.)

The only aspect that might be harder to understand and verify is the sequential consistency. This is one simple model of computing for a processor core that executes a series of instructions, one after the other, where each instruction that reads memory can see the results of any previous instruction that writes the memory. This is not the only computational memory model that exists, but it’s the rule rather than the exception in simple embedded systems. On the other end are today’s desktop and server processors, which have not only multiple cores accessing shared memory using multilevel memory caches, but also out-of-order execution. In those kind of systems, the ordering guarantees of memory access are relaxed, and you only get stricter guarantees when you use barriers or fences to ensure that a write from one computation happens before a read from another.

The firmware for our microcontroller will be programmed in C, and does not use an operating system. We have two processes (I use the term “processes” in the abstract sense, not in the sense of an OS like Linux that has some number of processes trying to be scheduled.) that we’ll call Speedy and Poky:

  • Poky starts executing from main(), and after some initialization steps, its only job is to repeat the following tasks in order:
    • accept commands through some kind of serial communication port (CAN or UART)
    • process them (which may involve coordination with Speedy)
    • send the results back over the serial communication port
  • Speedy executes in an interrupt service routine that is called at 10kHz whenever the ADC samples are available, performing these tasks:
    • getting the results of the ADC samples
    • performing just enough signal processing to relieve Poky from the burden of these computations
    • exchanging information with Poky
    • in some cases, altering the microcontroller output via PWM signals or a DAC

This sort of Speedy/Poky system is not contrived: it’s something I’ve worked with for the last 25 years, and is found in many motor drives, digital power converters, and other industrial systems.

Because this is a single-core processor, when an interrupt occurs:

  • Poky’s execution stops (this might not occur instantly; it typically takes a few instruction cycles for things like pipelines to clear)
  • the program counter (PC) is saved in hardware
  • the ISR starts running, and saves any CPU registers it may need to change
  • Speedy executes
  • the ISR restores any CPU registers to their saved state
  • the CPU jumps back to the saved PC and continues Poky’s execution.

From the standpoint of Poky, Speedy executes instantly in a flash, between the instructions of Poky’s firmware. With one exception, Poky must assume that Speedy’s execution can occur anytime, usually in the spot between instructions that is least convenient. The one exception is that Poky can temporarily disable the ADC interrupt during a critical section of code, and while ADC interrupts are disabled, Speedy’s execution is deferred. This means Poky doesn’t have to worry about Speedy changing RAM during the critical section. But it means that Speedy may not run exactly at 10kHz, especially if Poky disables interrupts for a long period of time.

The challenge here is how to get Speedy and Poky to communicate with each other. We can do it if we define a mechanism that includes shared state along with some rules on how to use it.

Blocking and Non-Blocking Concurrency

Now, the safest way to manage concurrency on machines that use shared memory is by using a well-designed library primitive like a mutex. A mutex is an example of blocking concurrency, where two processes contend for a resource by waiting to acquire a mutex. Processes only access the shared memory when they have acquired the mutex, and they release it as soon as they are done so that another process can gain access. This technique safely serializes access to the shared memory.

In the case of Speedy and Poky, though, using a mutex may not work: Speedy is an example of a hard real-time process, where something MUST execute every so often with some maximum bounded delay. In Speedy’s case, we haven’t specified how strictly the execution timing of the interrupt handler has to be managed so it can run and complete its processing, but we’d like it to be every 100 microseconds, and maybe that means we can tolerate 10 or 20μs of latency before the interrupt kicks off. This is something you’d have to look at on a case-by-case basis, and see what the consequences are of being late. (In a motor drive, for example, you have some amount of time between digitizing ADC inputs and updating PWM outputs, and being late can cause disturbances to a control loop, leading to audible noise, vibration, or even loss of control.) What Speedy cannot tolerate, however, is an indefinite delay. Blocking on a mutex is not something that should be done in an ISR; even if you can guarantee the maximum latency somehow, it’s a red flag. ISRs should contain only non-blocking code.

Furthermore, although the use of non-blocking code in an ISR is necessary, it is not sufficient. Speedy not only can’t get stuck; it must complete its required tasks. If Speedy detects that it cannot access shared memory because Poky is in the middle of writing to something, then Speedy’s task fails, and all it can do is set some kind of a flag to note that it was unable to complete its processing, potentially putting in jeopardy the correctness of all future computation.

There are a number of wait-free data structures that could be used in cases where concurrent processes need to access shared memory — for things like linked lists and queues and hash maps. Those are overkill here. You might see them in multithreaded software on general-purpose computers.

We’re going to take advantage of a few properties of the Speedy/Poky system that aren’t present in general-purpose computers, and describe something simpler, a variant of double-buffering that I call the Revolving Fireplace.

The Revolving Fireplace

Picture this:

Shaggy and Scooby-Doo are in a haunted mansion. They are trying to solve a mystery with their friends, but in the meantime they are hungry and a bit cold, so they plunder sandwich fixings from the pantry and bring them to another room where there are some regal burgundy velvet chairs in front of a fireplace. There is a warm, crackling fire going, and a small table next to the fireplace, so they each begin fixing a sandwich on the table. These are soon piled high with all the food they have found.

Scooby hears a faint howling noise, and jumps up into Shaggy’s arms. They go out into the hall to investigate, but don’t see anything. There is a noise of stone scraping stone, like that of a heavy crypt lid sliding open, and when they return, the table is bare: both the sandwiches and all the food are gone.

Shaggy and Scooby search the room, looking for their sandwiches, and while their backs are turned, there is that scraping noise again, and this time, out of the corner of their eye, they could swear they see the table move just for a moment. This time a large stuffed black bear is standing next to the table, a taxidermist’s masterpiece, with arms and claws raised, glaring at them. Shaggy laughs nervously. He waves his hand in front of the bear’s face. “It’s only a dummy,” he says, knocking on the bear’s skull. Eyes move within the bear’s head, and there is a growling noise. “Zoinks!” yells Shaggy. He and Scooby run out of the room and down the hall, arms and legs flailing erratically as they go.

Scooby jumps onto the stairway bannister and slides down, with Shaggy following close behind, whereupon they land in a heap at the bottom of the stairs. Fred, Velma, and Daphne look at them disapprovingly. “Where have you two been?” says Fred. “We’re looking for clues to this mystery,” says Velma.

“We found a h-h-h… a h-haunted fireplace!” says Shaggy.

“Raunted rireplace!” says Scooby.

Fred, Velma, and Daphne are skeptical, but Shaggy and Scooby lead the way back upstairs and into the room with the fireplace, where their sandwiches are sitting quietly on the table, just the way they’d left them.

“But… but there was a bear here!” says Shaggy. Scooby eats both of the sandwiches.

Fred, Velma, and Daphne are not convinced.

Later in the episode, Daphne falls down a laundry chute and into another room with regal blue velvet chairs in front of an identical fireplace and table. The rest of the gang manages to find Daphne, along with a secret lever that makes the fireplaces rotate in tandem, from one room to the other. It turns out that there were a pair of counterfeiters just messing with Shaggy and Scooby as a practical joke, and though all their counterfeiting machinery is safely hidden on the other side of town in an abandoned candy factory, the gang captures them and finds a counterfeit \$20 bill, and that gives the District Attorney probable cause for a warrant to search the candy factory. “We would have gotten away with it, too, if it weren’t for you nosy meddling kids!” one of the counterfeiters says as the cops put handcuffs on them.

Hypothetically speaking, that is. I could have sworn there was an episode of Scooby-Doo with a pair of rotating fireplaces (with the Harlem Globetrotters?). Unfortunately, there does not seem to be such a thing. I seem to have gotten this mixed up with that episode of Captain Caveman with the twin ski lodges or maybe the fire scene from Indiana Jones and the Last Crusade:

The revolving what?!

As a concurrency protocol, the revolving fireplace is a poor-man’s messaging system.

A true message-passing architecture, with isolation between transmit and receive processes, requires these elements:

  • temporary storage on the transmit end, to construct a message
  • temporary storage on the receive end, to receive a message
  • some entity acting as an intermediary between transmit and receive processes (for example, an operating system, or in cases of bare-metal systems, a module containing functions and data structures for facilitating message-passing)
  • some method of allowing the transmit end to know whether a message can be sent
  • some method of allowing the transmit end to request the intermediary transmit a message from the transmitter’s temporary storage, and know when the transmission is complete
  • some method of allowing the receive end to know whether a message is available to be received
  • some method of allowing the receive end to request the intermediary deliver the next available message into the receiver’s temporary storage, and know when the reception is complete
  • storage capacity for the intermediary to hold messages between their transmission and reception

This puts the burden of any concurrency on the intermediary. The transmitter and receiver themselves are decoupled and there’s no shared state that needs a mutex or for them to cooperate.

In the case of Speedy and Poky, we don’t need all that.

Speedy and Poky each need some kind of a working space to prepare the information they’re going to share with each other. With shared memory, rather than messaging, Poky needs the freedom to write information to an appropriate place, and only when it is finished should it signal for Speedy to read the information. Our “fireplaces” are two sections of shared memory that can be swapped between Speedy and Poky; we also need some synchronization mechanism to signal an event (“ready to publish” or “request to change ownership”), and some rules for proper cooperation.

The synchronization mechanism can be implemented using a single shared variable and actions to manipulate it, as long as those actions have three important properties:

  • access to the shared variable is atomic
  • the shared variable is considered volatile
  • access to the shared variable is prohibited from being reordered with respect to other key operations, either by the processor or by the compiler.

Here’s where the trickiness lies, because unfortunately we don’t get much help from the C language standard in a way that is easy to understand by the average reasonable programmer.

Atomic and volatile: Another Visit to the Basement

volatile

The volatile property just requires use of the volatile keyword in C, which tells the compiler not to make any optimization based on an assumption that the compiler knows what is contained in a variable. For example:

int16_t ultimate_answer(void)
{
    volatile int16_t x = 6*9;
    x = 42;
    return x;
}

Without the volatile qualifier on the local variable x, the compiler could just have ultimate_answer() return 42 to the caller. Instead, the compiler is forced to store 54 into x, then store 42 into x, and then read the content of x before returning that value to the caller.

Furthermore, access to different volatile variables cannot be reordered (although the compiler can reorder non-volatile variables with respect to volatile variables); the C standard’s way of stating this is (in draft N1256 section 5.1.2.3)

Accessing a volatile object, modifying an object, modifying a file, or calling a function that does any of those operations are all side effects, which are changes in the state of the execution environment. Evaluation of an expression may produce side effects. At certain specified points in the execution sequence called sequence points, all side effects of previous evaluations shall be complete and no side effects of subsequent evaluations shall have taken place. (A summary of the sequence points is given in annex C.)

(More recent drafts like N1570 have muddied the waters and expressed this in a way that is less clear to the average reader.)

I’ve mentioned volatile in a previous article, so if you want to understand it in more depth, go read it or one of these:

Ordering

As long as the critical ordering of computation is determined by the C’s rules on volatile variables, there’s no additional synchronization needed. Memory barriers (e.g. asm volatile ("" : : : "memory"); as outlined in John Regehr’s post) may be needed to constrain non-volatile variables from being reordered.

Atomicity

For atomicity, we just need to ensure that reading or writing the shared variable happens in one indivisible operation; if it takes Poky two steps to do something, and Speedy executes in between those two steps, then Speedy might see an invalid value in its shared state. (For example, setting 16-bit halves of a 32-bit pointer.)

Most (if not all) processors can guarantee that a memory load or store of the machine word size is atomic: on a 16-bit system, for instance, you can load or store a 16-bit value in memory in one instruction, and an interrupt can’t pop up in the middle of a load and store. Similarly, sometimes there are atomic instructions for setting or clearing bits, or even for loading/storing two words at once. (Read the fine print, though; on a dsPIC33 device, there is a MOV.D instruction that takes 2 cycles and is not interruptable, but because the device has a 16-bit bus, it is possible for DMA to sneak in between the cycles and ruin the atomicity, at least with respect to DMA.)

If the compiler doesn’t have builtins or intrinsics that are guaranteed to map into one of these atomic instructions, then you may need to rely on inline assembly to use them.

There’s a weaker method, which is to trust the compiler, but in this case the burden is on you to trust but verify that the compiler is doing what you think it is doing. This is dangerous.

If you’re lucky enough to have a modern C or C++ compiler — where “modern” means C11/C++11 or later — there are features dictated by the std::atomic library in C++, or <stdatomic.h> in C that can guarantee all three of these behaviors from the compiler:

  • values will be read or updated in an indivisible operation
  • access to values will occur in sequentially consistent ordering
  • the compiler is prohibited from making optimizations based on the assumptions of the values

Unfortunately, I’m not familiar enough with the C11/C++11 atomics to give good advise on their proper use; you can look at the subsection below called Speculative Advice. If you don’t have a modern C or C++ compiler, you’ll have to see if your C compiler has builtins or intrinsics that can guarantee atomic operation.

Speculative Advice

This section contains my limited and somewhat questionable advice on atomic support in C. (WARNING! I have never used these before due to availability in my working environment, so take this with a grain of salt and do your own due diligence.)

If your C compiler supports C11’s <stdatomic.h> (compilers are allowed to define __STDC_NO_ATOMICS__ to say that they can’t be bothered with this stuff and you’re on your own), then you should be able to use any of the base types such as atomic_int or atomic_bool, or qualify any type with _Atomic() such as _Atomic(uint32_t). Here you just use the variables the way you want and the compiler will guarantee there aren’t data races, although it may need to use a lock in the underlying implementation in order to make this guarantee, even if your C code does not specify use of a lock. For example:

#include <stdint.h>
#include <stdatomic.h>

_Atomic(uint32_t) shared_var1;
atomic_uint shared_count = 0;

// thread 1
void store_var1(uint32_t val)
{
    shared_var1 = val;
    ++shared_count;
}

// thread 2
uint32_t replace_old_with_44_maybe(unsigned int maxcount)
{
    uint32_t oldval = shared_var1;
    /* Warning: the code below doesn't perform the if-statement and body 
     * in an atomic manner; for that, you would need to use 
     * a critical section or mutex.
     */
    if (shared_count <= maxcount) 
    {
        shared_var1 = 44;
    }
    return oldval;
}

The accesses to shared_var1 might turn into a function call that allows only the currently-running thread to read or write it.

Alternatively you can use functions like atomic_load and atomic_store to work with “regular” variables in an atomic manner:

#include <stdint.h>
#include <stdatomic.h>

uint32_t shared_var1;
unsigned int shared_count = 0;

// thread 1
void store_var1(uint32_t val)
{
    atomic_store(&shared_var1, val);
    atomic_fetch_add(&shared_count, 1);  
    // atomic version of ++shared_count
}

// thread 2
uint32_t replace_old_with_44_maybe(unsigned int maxcount)
{
    uint32_t oldval = atomic_load(&shared_var1);
    /* Warning: the code below doesn't perform the if-statement and body 
     * in an atomic manner; for that, you would need to use 
     * a critical section or mutex.
     */
    if (atomic_load(&shared_count) <= maxcount) 
    {
        atomic_store(&shared_var1, 44);
    }
    return oldval;
}

Sequential consistency is only guaranteed between accesses to the shared variables through these functions; if one thread reads or writes from one of these variables directly, all bets are off.

If your compiler does not support C11 or does not provide <stdatomic.h> support, then some compilers provide intrinsic substitutes. For example, GCC 4.7.0 (and later) has __atomic_load_n and __atomic_store_n:

#include <stdint.h>

uint32_t shared_var1;
unsigned int shared_count = 0;

// thread 1
void store_var1(uint32_t val)
{
    __atomic_store_n(&shared_var1, val, __ATOMIC_SEQ_CST);
    __atomic_fetch_add(&shared_count, 1, __ATOMIC_SEQ_CST);  
    // atomic version of ++shared_count
}

// thread 2
uint32_t replace_old_with_44_maybe(unsigned int maxcount)
{
    uint32_t oldval = __atomic_load_n(&shared_var1, __ATOMIC_SEQ_CST);
    /* Warning: the code below doesn't perform the if-statement and body 
     * in an atomic manner; for that, you would need to use
     * a critical section or mutex.
     */
    if (__atomic_load_n(&shared_count, __ATOMIC_SEQ_CST) <= maxcount) 
    {
        __atomic_store_n(&shared_var1, 44, __ATOMIC_SEQ_CST);
    }
    return oldval;
}

Clang appears to provide builtins like __c11_atomic_store and __c11_atomic_load but it’s unclear if there’s any situation where you should use this instead of the standard C11 features.

As to what kind of code actually runs under the hood on a typical low-end single-core processor with sequentially-consistent memory access, we can take a look at Compiler Explorer on the MSP430 where <stdatomic.h> is supported. I wrote a handful of short functions with _a_ (atomic) and _v_ (volatile) variants to perform loads and stores of bool, uint16_t, and uint32_t values, along with a 16-bit increment:

#include <stdatomic.h>
#include <stdint.h>
#include <stdbool.h>

typedef _Atomic(uint16_t) atomic_uint16_t; 
typedef _Atomic(uint32_t) atomic_uint32_t;

void store_v_b(volatile bool *pshared, bool b) {
    *pshared = b;
}

void store_v_16(volatile uint16_t *pshared, uint16_t x) {
    *pshared = x;
}

void store_v_32(volatile uint32_t *pshared, uint32_t x) {
    *pshared = x;
}

void store_a_b(atomic_bool *pshared, bool b) {
    *pshared = b;
}

void store_a_16(atomic_uint16_t *pshared, uint16_t x) {
    *pshared = x;
}

void store_a_32(atomic_uint32_t *pshared, uint32_t x) {
    *pshared = x;
}

bool load_v_b(volatile bool *pshared) {
    return *pshared;
} 

uint16_t load_v_16(volatile uint16_t *pshared) {
    return *pshared;
}

uint16_t load_v_32(volatile uint32_t *pshared)
{
    return *pshared;
}

bool load_a_b(atomic_bool *pshared)
{
    return *pshared;
}

uint16_t load_a_16(atomic_uint16_t *pshared)
{
    return *pshared;
}

uint16_t load_a_32(atomic_uint32_t *pshared)
{
    return *pshared;
}

void inc_a_16(atomic_uint16_t *pshared)
{
    atomic_fetch_add(pshared, 1);
}

void inc_v_16(volatile uint16_t *pshared)
{
    ++*pshared;
}

With MSP430 gcc 6.2.1 -O2, this compiles to

store_v_b:
        MOV.B   R13, @R12
        RET
store_v_16:
        MOV.W   R13, @R12
        RET
store_v_32:
        MOV.W   R13, @R12
        MOV.W   R14, 2(R12)
        RET
store_a_b:
        AND     #0xff, R13
        MOV.B   R13, @R12
        RET
store_a_16:
        MOV.W   R13, @R12
        RET
store_a_32:
        MOV.B   #5, R15
        CALL    #__atomic_store_4
        RET
load_v_b:
        MOV.B   @R12, R12
        RET
load_v_16:
        MOV.W   @R12, R12
        RET
load_v_32:
        MOV.W   2(R12), R13
        MOV.W   @R12, R12
        RET
load_a_b:
        MOV.B   @R12, R12
        RET
load_a_16:
        MOV.W   @R12, R12
        RET
load_a_32:
        MOV.B   #5, R13
        CALL    #__atomic_load_4
        RET
inc_a_16:
        MOV.B   #5, R14
        MOV.B   #1, R13
        CALL    #__atomic_fetch_add_2
        RET
inc_v_16:
        ADD.W   #1, @R12
        RET

You’ll note that in most cases these are trivial, lightweight implementations, even for the _Atomic variants. The exceptions are:

  • 32-bit atomic load
  • 32-bit atomic store
  • 16-bit atomic increment

In these cases, the compiler calls a library function, which presumably disables interrupts just long enough for it to runs a few critical instructions. (Anyone out there familiar enough with the MSP430 compiler that could say what’s inside these?) The 32-bit load and store make sense, but I’m intrigued about the 16-bit increment — you can see there is a single instruction ADD.W #1, @R12 that works in the volatile case. Either there is something about the CPU execution or memory models on the MSP430 that makes this insufficient for the guarantees needed by C11 atomics, or the compiler writers haven’t finished tweaking the compiler to reduce the 16-bit fetch_add case to the ADD.W instruction.

DIY atomics with extended assembly

If you don’t have access to <stdatomic.h> or an appropriate builtin, and you REALLY want to try implementing and testing your own atomic functions, you might try using extended assembly. Here’s an example, illustrated with pyxc16:

import pyxc16

src = r'''
#include <stdint.h>
#include <stdbool.h>

#define DECLARE_ATOMIC_READWRITE(T, M) \
inline static void _util_atomic_##T##_write(volatile T *location, T value) \
{ \
    /* atomic equivalent of *location = value; */                \
    asm volatile(";! BEGIN _util_atomic_" #T "_write\n"          \
                 "\t" M " %[val], [%[loc]]\n"                    \
                 "\t;! END   _util_atomic_" #T "_write"          \
                 : /* no outputs */                              \
                 : [val] "r" (value), [loc] "r" (location));     \
} \
\
inline static uint16_t _util_atomic_##T##_read(volatile uint16_t *location) \
{ \
    /* atomic equivalent of return *location; */                 \
    uint16_t result;                                             \
    asm volatile(";! BEGIN _util_atomic_" #T "_read\n"           \
                 "\t" M " [%[loc]], %[result]\n"                 \
                 "\t;! END   _util_atomic_" #T "_read"           \
                 : [result] "=r" (result)                        \
                 : [loc] "r" (location));                        \
    return result;                                               \
}

typedef volatile void *voidptr;
DECLARE_ATOMIC_READWRITE(uint16_t, "mov")
DECLARE_ATOMIC_READWRITE(voidptr, "mov")
DECLARE_ATOMIC_READWRITE(bool, "mov.b")

// just using volatile
uint16_t test0(volatile uint16_t *px, uint16_t v, 
               volatile voidptr *pv, 
               volatile bool *pb)
{
    *px = v;
    *px = ++v;
    *++px = v+37;
    *pv = px;
    uint16_t y = *px;
    *pb = y > 100;
    
    return y;
}

uint16_t test1(uint16_t *px, uint16_t v, 
               voidptr *pv, 
               bool *pb)
{
    _util_atomic_uint16_t_write(px, v);
    _util_atomic_uint16_t_write(px, ++v);
    _util_atomic_uint16_t_write(++px, v+37);
    _util_atomic_voidptr_write(pv, px);
    uint16_t y = _util_atomic_uint16_t_read(px);
    _util_atomic_bool_write(pb, y > 100);
    
    return y;
}
'''
pyxc16.compile(src, '-O2', comment_filter=';!')
_test0:
	mov	w1,[w0]
	inc	w1,[w0]
	mov	#38,w4
	add	w1,w4,[++w0]
	mov	w0,[w2]
	mov	[w0],w0
	mov.b	#1,w1
	mov	#100,w2
	sub	w0,w2,[w15]
	bra	gtu,.L2
	clr.b	w1
.L2:
	mov.b	w1,[w3]
	return
_test1:
	;!	BEGIN _util_atomic_uint16_t_write
	mov	w1, [w0]
	;!	END   _util_atomic_uint16_t_write
	inc	w1,w1
	;!	BEGIN _util_atomic_uint16_t_write
	mov	w1, [w0]
	;!	END   _util_atomic_uint16_t_write
	inc2	w0,w0
	add	#37,w1
	;!	BEGIN _util_atomic_uint16_t_write
	mov	w1, [w0]
	;!	END   _util_atomic_uint16_t_write
	;!	BEGIN _util_atomic_voidptr_write
	mov	w0, [w2]
	;!	END   _util_atomic_voidptr_write
	;!	BEGIN _util_atomic_uint16_t_read
	mov	[w0], w0
	;!	END   _util_atomic_uint16_t_read
	mov.b	#1,w1
	mov	#100,w2
	sub	w0,w2,[w15]
	bra	gtu,.L6
	clr.b	w1
.L6:
	;!	BEGIN _util_atomic_bool_write
	mov.b	w1, [w3]
	;!	END   _util_atomic_bool_write
	return

The DECLARE_ATOMIC_READWRITE macro takes in a typename T and a move instruction M, and generates functions that utilize inline assembly with a single line of implementation (surrounded by BEGIN and END comments to tell where it comes from). The single line is the key — if you can implement a load or store on a dsPIC with a single MOV.B or MOV or MOV.D then it is uninterruptable, and does not suffer from any read-modify-write problems. (although again: MOV.D is a 2-cycle instruction due to the use of a 16-bit data bus; it’s not interruptible, but a DMA transaction could sneak in between the bus accesses.)

(Here’s why read-modify-write is a tricky issue. The dsPIC33 architecture has pipelined execution where each instruction is handled with a fetch and execute phase. We have to be sure that an interrupt cannot sneak in between those fetch and execute stages for something like an INC (increment) instruction in indirect mode where contents of memory, specified by a pointer, are incremented, since it accesses memory both on the fetch and execute stages of the pipeline. The manuals on the dsPIC33 cores are mostly clear on interrupt processing, namely that all instructions are allowed to complete before an interrupt occurs. The read-modify-write behavior can cause a stall in the instruction pipeline in order to prevent the next instruction from fetching old data, but it is uninterruptible. If you are using a different architecture, you need to understand very carefully how it executes code.)

At any rate, you’ll note that the assembly generated for test0() and test1() are almost the same — in test0() the compiler can save 2 instructions because it can optimize some of the generated code, whereas in test1() it is restricted to plunking down the inline assembly of the __util_atomic_uint16_t* functions as black boxes it cannot change. If I weren’t picky, I’d just use test0() because the compiler should know better and treat 16-bit assignments as atomic operations. But there’s no responsibility of the compiler to do so, at least in terms of the C standard.

Just to reiterate one point: C11 and C++11 atomic<> are supposed to help you write correct programs. The DIY approach is not recommended if you can use something that gives you better, safer guarantees from your compiler.

If you really want to delve into the dark corners of memory models and the C11/C++11 atomics, check out Herb Sutter’s talks on “atomic<> Weapons” which get into behaviors of multicore processors and acquire/release semantics, and how you’re supposed to write C code using C11 or C++11 atomics.

Anyway, that’s the word from the basement. Yeccch, I feel kind of icky. If you need advice on these topics, please consult a professional who is familiar with the C standard.

Revolving Fireplace, Example 1

OK, now to some actual code! Here’s one example of a revolving fireplace implementation. We’re going to make Speedy accumulate a sum of ADC samples for current and voltage. Poky will read the accumulated sums and utilize them, transmitting them over a communications port when requested.

In this example, the protocol is as follows:

  • At any given time outside of a switch operation, Poky and Speedy each have access to their own fireplace, and are not allowed to access the contents of the other fireplace.
  • Only Poky is allowed to switch the fireplaces.
  • Speedy must leave its fireplace contents in a valid state after each ISR completion.
/* fireplace.h */

#include <stdint.h>
#include <stdbool.h>

typedef struct {
    int32_t voltage_sum;
    int32_t current_sum;
    uint16_t count;
} FIREPLACE;

typedef struct {
    FIREPLACE fireplaces[2];
    struct {
        volatile FIREPLACE *speedy;
        volatile FIREPLACE *poky;
    } access;
} RAW_SHARED_MEMORY; // w/o volatile -- do not use directly

typedef volatile RAW_SHARED_MEMORY SHARED_MEMORY;

inline static void fireplace_init(volatile FIREPLACE *fp)
{
    fp->voltage_sum = 0;
    fp->current_sum = 0;
    fp->count = 0;
}

inline static void fireplace_switch(SHARED_MEMORY *shmem)
{
    volatile FIREPLACE *tmp = shmem->access.poky;
    shmem->access.poky = shmem->access.speedy;
    shmem->access.speedy = tmp;
}
/* poky.h */
typedef struct {
    int64_t voltage_sum;
    int64_t current_sum;
    uint32_t count;
} POKY_STATE;

inline static void poky_sum_init(POKY_STATE *pstate)
{
    pstate->voltage_sum = 0;
    pstate->current_sum = 0;
    pstate->count = 0;
}
/* poky.c */

#include "fireplace.h"
#include "poky.h"

void poky_init(POKY_STATE *pstate, SHARED_MEMORY *shmem)
{
    poky_sum_init(pstate);
    fireplace_init(&shmem->fireplaces[0]);
    fireplace_init(&shmem->fireplaces[1]);
    shmem->access.poky   = &shmem->fireplaces[0];
    shmem->access.speedy = &shmem->fireplaces[1];
}

void poky_step(POKY_STATE *pstate, SHARED_MEMORY *shmem)
{
    /* called during the main loop */
    fireplace_switch(shmem);
    volatile FIREPLACE *mine = shmem->access.poky;
    if (mine->count > 0)
    {
        // Speedy has accumulated samples! 
        // Let's accumulate that into an overall sum.
        pstate->voltage_sum += mine->voltage_sum;
        pstate->current_sum += mine->current_sum;
        pstate->count += mine->count;

        // Now zero out the accumulators 
        // so we can switch fireplaces next time.
        fireplace_init(mine);
    }

    // If we get a message asking for the sums, send them and zero them out.
    if (should_we_transmit_sums())
    {
        transmit_sums(pstate->voltage_sum, pstate->current_sum, pstate->count);
        poky_sum_init(pstate);
    }
}
/* speedy.c */

#include "fireplace.h"

void speedy_step(SHARED_MEMORY *shmem)
{
    // read ADC
    int16_t current = read_adc(ADC_CURRENT);
    int16_t voltage = read_adc(ADC_VOLTAGE);

    volatile FIREPLACE *mine = shmem->access.speedy;
    mine->voltage_sum += voltage;
    mine->current_sum += current;
    ++mine->count;
}

There! We have our fireplace protocol. Speedy doesn’t care, he just dumps ADC samples into accumulators in his fireplace, because Speedy can never be interrupted by Poky.

Poky can be interrupted by Speedy, but because Speedy isn’t allowed to switch the fireplaces, it is safe for Poky to read or write from Poky’s own fireplace.

Note that the fireplace_switch function isn’t atomic — Speedy can interrupt between the read and writes to shmem->access — but this doesn’t matter; Speedy always gets a consistent fireplace to use, and Poky only accesses the fireplaces before or after the switch.

The only other requirement is that Poky must switch the fireplaces before the variables overflow, in other words, at least once every 65536 samples.

Aside from that, Poky can now be assured it has aggregate statistics on the total of every single ADC reading, without missing any samples or double-counting, even though

  • it doesn’t run at the ADC rate
  • it doesn’t run periodically
  • it can be interrupted anywhere by Speedy
  • it might run several times between executions of Speedy (sometimes the main loop has little to do and repeats fast!)

Revolving Fireplace, Example 2

This time, we’ll have Speedy switch the fireplaces on a request from Poky. The differences here are:

  • Speedy no longer has to keep its fireplace content ready for immediate use by Poky; instead, it can provide its data into the fireplace on demand when it’s time to switch
  • After requesting a switch, Poky cannot access the fireplaces until Speedy completes the switch. (Once the request is made, Speedy effectively owns both fireplaces.)
/* fireplace.h */

#include <stdint.h>
#include <stdbool.h>

typedef struct {
    int32_t voltage_sum;
    int32_t current_sum;
    uint16_t count;
} FIREPLACE;

typedef struct {
    FIREPLACE fireplaces[2];
    struct {
        volatile FIREPLACE *speedy;
        volatile FIREPLACE *poky;
    } access;

    bool switch_request;  
    // only Poky is allowed to set, 
    // only Speedy is allowed to clear
} RAW_SHARED_MEMORY; // w/o volatile -- do not use directly

typedef volatile RAW_SHARED_MEMORY SHARED_MEMORY;

inline static void fireplace_init(volatile FIREPLACE *fp)
{
    fp->voltage_sum = 0;
    fp->current_sum = 0;
    fp->count = 0;
}

inline static void fireplace_switch(SHARED_MEMORY *shmem)
{
    volatile FIREPLACE *tmp = shmem->access.poky;
    shmem->access.poky = shmem->access.speedy;
    shmem->access.speedy = tmp;
}
/* poky.h */
typedef struct {
    int64_t voltage_sum;
    int64_t current_sum;
    uint32_t count;
} POKY_STATE;

inline static void poky_sum_init(POKY_STATE *pstate)
{
    pstate->voltage_sum = 0;
    pstate->current_sum = 0;
    pstate->count = 0;
}
/* poky.c */

#include "fireplace.h"
#include "poky.h"

void poky_init(POKY_STATE *pstate, SHARED_MEMORY *shmem)
{
    poky_sum_init(pstate);
    fireplace_init(&shmem->fireplaces[0]);
    fireplace_init(&shmem->fireplaces[1]);
    shmem->access.poky   = &shmem->fireplaces[0];
    shmem->access.speedy = &shmem->fireplaces[1];
    shmem->switch_request = false;
}

void poky_step(POKY_STATE *pstate, SHARED_MEMORY *shmem)
{
    /* called during the main loop */

    // Skip if Speedy hasn't processed the switch request
    if (!shmem->switch_request)
    {
        volatile FIREPLACE *mine = shmem->access.poky;
        if (mine->count > 0)
        {
            // Speedy has accumulated samples! 
            // Let's accumulate that into an overall sum.
            pstate->voltage_sum += mine->voltage_sum;
            pstate->current_sum += mine->current_sum;
            pstate->count += mine->count;

            // Now zero out the accumulators 
            // so we can switch fireplaces next time.
            fireplace_init(mine);
        }
        shmem->switch_request = true;
    }

    // If we get a message asking for the sums, send them and zero them out.
    if (should_we_transmit_sums())
    {
        transmit_sums(pstate->voltage_sum, pstate->current_sum, pstate->count);
        poky_sum_init(pstate);
    }
}
/* speedy.c */

#include "fireplace.h"

typedef struct {
    FIREPLACE private_fireplace;
} SPEEDY_STATE;

void speedy_init(SPEEDY_STATE *pstate)
{
    fireplace_init(&pstate->private_fireplace);
}

void speedy_step(SPEEDY_STATE *pstate, SHARED_MEMORY *shmem)
{
    // read ADC
    int16_t current = read_adc(ADC_CURRENT);
    int16_t voltage = read_adc(ADC_VOLTAGE);

    FIREPLACE *my_own = &pstate->private_fireplace;
    my_own->voltage_sum += voltage;
    my_own->current_sum += current;
    ++my_own->count;

    if (shmem->switch_request)
    {
        // Time to switch fireplaces! Put latest stats in the fireplace
        volatile FIREPLACE *mine = shmem->access.speedy;
        mine->voltage_sum = my_own->voltage_sum;
        mine->current_sum = my_own->current_sum;
        mine->count = my_own->count;
        fireplace_switch(shmem);
        shmem->switch_request = false;

        fireplace_init(my_own);
    }
}

With this method, Speedy can do the accumulations in its own private fireplace that isn’t declared volatile. When it’s time to switch fireplaces, that’s when Speedy copies its accumulated statistics from its private fireplace into the shared fireplace, and triggers the switch, clearing the switch_request flag.

From an execution time tradeoff, Example 2 may allow the compiler to achieve a slightly quicker ISR outside of fireplace switching (because it can optimize access to Speedy’s private fireplace, which isn’t volatile), but slower worst-case when switching is necessary (because it has to copy the data). Example 1 has consistent ISR timing but it may be slower on average (when switching is not necessary).

Complete Examples

I have posted both examples in the form of MPLAB X projects on my Github account. These also include a primitive testing facility where Speedy executes from a timer interrupt and Poky executes from the main loop, running some number iter_count_end of iterations which default to 500. Instead of using the real ADC, I provided a source of mock ADC readings from a linear congruential generator for a reproducible series of samples that I could repeat sequentially to check the correctness of the results. I ran both examples in the MPLAB X simulator with iter_count_end = 16384 and got correct sums in both cases. Does this prove correctness? Absolutely not, but if I had found a bug it would have given me a chance to troubleshoot.

A Few Other Notes

Don’t overuse shared memory!

Whether you’re going to use one of these approaches or something else, try to minimize the amount of program state that is part of shared memory and accessed by more than one thread. It’s much easier to write correct programs when they don’t have to worry about shared variables. This not only applies to you, it applies to the compiler, which has much more freedom to create an optimized implementation of your program when it can assume only one thread accesses an area of memory. So if you’re working with shared memory among threads, even in the simple Speedy/Poky system, and you’re using volatile or _Atomic or some kind of synchronization mechanism, don’t just throw all your program’s state into the concurrency bucket and use volatile or _Atomic everywhere. That’s overkill and you’ll likely suffer an unnecessary performance hit.

Unidirectional vs. Bidirectional

These examples show unidirectional data flow: Speedy is providing information to Poky. There’s no reason it can’t be bidirectional — for example, Poky sending Speedy a voltage command (in case of a variable DC/DC converter) and Speedy sending Poky voltage and current feedback.

In the bidirectional case, if you are really stingy about memory use, the fireplace can contain a union of the unidirectional data used, for example:

typedef struct {
    union {
        struct {
            int16_t voltage_command;
        } to_speedy;
        struct {
            int16_t voltage_feedback;
            int16_t current_feedback;
        } to_poky;
    } u;

    /* things that aren't unidirectional go here */
} FIREPLACE;

so that Poky only reads from mine->u.to_poky and only writes to mine->u.to_speedy, whereas Speedy only reads from mine->u.to_speedy and only writes to mine->u.to_poky. This adds a couple of requirements, so it’s probably not a good solution in most cases, especially if exchanging just a couple of bytes of data:

  • neither process can rely on the fireplace as a storage location for a persistent state variable — instead, it must be only used for communication
  • before switching fireplaces, one of the following conditions has to be true
    • the contents of both fireplaces must be valid (which means each process has to update the content)
    • each fireplace contains a “valid” flag that indicates whether the data in the fireplace is valid or not, and both processes must set or clear the valid flag

Atomicity requirements

I mentioned atomicity earlier. The atomicity requirements in these two examples are fairly minimal:

  • Speedy’s code has no atomicity requirements, since it can never be interrupted by Poky.
  • Poky’s code has atomicity requirements only on the data that it uses for synchronization:
  • in example 1, this is the assignment to the pointer shmem->access.speedy in fireplace_switch()
  • in example 2, this is reading and writing of the flag shmem->switch_request (it’s hard to imagine a bool read or write being non-atomic, but I try to make no assumptions here)

Testing (Ay, There’s the Rub)

Concurrency mechanisms are extremely hard to test because of their nondeterministic behavior. There might be one tiny error that is only encountered when processes A and B are executing their instructions in a very specific order relative to each other.

I will offer one idea as a way of facilitating testing.

The fireplace could contain a guard variable, for example:

  • Poky sets the guard to 0xDEAD just before it starts writing to data in its fireplace, and sets it back to 0x0000 just before it switches fireplaces (or signals a fireplace switch request). This sets up an invariant that Speedy should always see 0x0000 in its fireplace and never anything else, and that the guard is set to 0xDEAD when Poky is writing to Poky’s side of the fireplace.
  • Speedy should check the guard; if it ever sees a nonzero value then it means there is an error, and appropriate action should be taken. (in the paranoid case, trigger a system-wide fault; for investigative purposes, increment a counter that can be reported to developers for further investigation)

Isn’t this just the same as double-buffering?

Double-buffering is a classic technique in computer graphics where two processes share a pair of buffers to update a display. The buffers represent a grid of display pixels. One process is responsible for drawing shapes onto one of the buffers; the other process displays the contents of the other buffers. This lets the drawing process take its time until the buffer is complete and ready to hand off to the display process; to update the display, the drawing process switches the buffers. As a result you can reduce the chance of display flickering — especially if using something like a page swap.

Another use of double-buffering is for DMA, sometimes called ping-pong buffers, where some hardware process like an ADC is used to gradually fill one of two buffers with a series of samples. When the buffer is filled, the buffers are swapped and the ADC continues filling the other buffer while other software processes the contents of the first buffer.

Structurally, the revolving fireplace technique is a form of double-buffering, but it is explicitly aimed at concurrency and the fireplaces are not really “buffers” consisting of a homogeneous array of elements like the two cases I just outlined for double-buffering.

Why is this better than shared memory protected with a synchronizing flag?

The revolving fireplace allows each side to work on its data without worry of corruption or invalid data transfer; switching fireplaces is where the exchange occurs.

Suppose you wanted to use shared memory and a non-blocking volatile event flag READY for transmitting voltage commands from Poky to Speedy:

  • For Poky to update the command:
    • Poky sets READY to FALSE
    • Poky writes a new voltage command to shared memory
    • Poky sets READY to TRUE
  • For Speedy to act on the command: . Speedy checks READY
    • if READY is TRUE, Speedy copies the command to Speedy’s own internal state
    • if READY is FALSE, Speedy ignores the command, relying on the last copy in its internal state

This could work in theory, but in the worst-case for timing, Speedy might repeatedly see READY = FALSE, and be forced to deal with stale commands.

Something similar is necessary for the other direction:

  • For Poky to read feedback from shared memory:
    • Poky sets READY to FALSE
    • Poky reads what it needs from shared memory
    • Poky sets READY to TRUE
  • For Speedy to update its feedback: . Speedy checks READY
    • if READY is TRUE, Speedy updates the shared memory from its own internal state
    • if READY is FALSE, Speedy silently sheds a tear, because it can’t do anything, and must wait until the next ISR to try again

Again, in the worst case for timing, Speedy might repeatedly see READY = FALSE, leaving Poky with very stale data

The revolving fireplace isolates the sender and receiver so that they are not forced to share a single area of storage.

Disclaimer

Finally, an important disclaimer:

Concurrency is hard. You are free to use these mechanisms, but I have not presented iron-clad proof that they are safe from race conditions or other concurrency hazards. I also do not warrant that the sample code here is bug-free; I have used this type of mechanism on real systems, but they are proprietary and I have had to paraphrase the ideas here rather than posting that proprietary code.

My representation of the guarantees and challenges of both the C standard and microcontrollers, concerning memory model and concurrency issues, is not an authoritative one. I have made a best attempt at investigating these subtleties before reporting them here. If I have made any errors or misleading statements, please bring them to my attention – I apologize and will correct them as best as possible.

Please perform your own due diligence before incorporating new concurrency mechanisms into your own system. (Or don’t, and rely on the mechanisms of an RTOS to meet your needs.)

Wrapup

We read about a common pattern in embedded control systems, exemplified by a regularly-executing interrupt service routine — aka Speedy — and a main loop — aka Poky — which need to exchange data. Speedy can interrupt Poky but Poky can never interrupt Speedy.

We explored a variant of double buffering called the Revolving Fireplace, which facilitates communication between Speedy and Poky by allowing each to read or write from a separate section of memory (a “fireplace”), with the exchange of data occurring when access to the memory sections are switched, typically by swapping pointers. This lets us focus our concurrency efforts on the switching mechanism; either Speedy can switch the fireplaces directly, or can raise a flag to signal that Poky can switch the fireplaces.

We touched upon some of the concurrency requirements needed to support the Revolving Fireplace:

  • microcontrollers that can be modeled by sequential consistency (no out-of-order execution)
  • the use of volatile to restrict the compiler from optimizing out reads or writes, and from reordering with respect to other reads and writes of volatile variables
  • the need for atomic updates in the fireplace-switching mechanism

We delved a little bit into the basement of C11’s <stdatomic.h> and talked about alternatives for those of us who don’t have access to a C11 compiler. (Note again: if you can use the <stdatomic.h> or C++11’s std::atomic<>, it’s probably worth doing so — the compiler can make certain guarantees of correctness which are stronger and simpler than the manual alternatives.)

I’d love to hear about your experiences with this or other concurrency mechanisms in resource-limited embedded systems!

Acknowledgements

Thanks to Matthew Eshleman and John Payson for their comments and suggestions.


© 2020 Jason M. Sachs, all rights reserved.



Memfault Beyond the Launch
[ - ]
Comment by dlawJune 24, 2023

Thanks for the excellent article (as always). I've usually heard the revolving fireplace referred to as a "ping-pong buffer" (perhaps with some slight contextual differences, but it's the same concept).

Lately, I've been playing around with the Rust programming language for embedded systems. As an exercise, I implemented the revolving fireplace as a Rust library. It works out surprising nicely: the fireplace is generic over the type of its contents, fireplace flipping (using built-in atomics) is transparent to users of the library, and Rust's memory safety guarantees are preserved.

It's uploaded to the Rust "crates" ecosystem if you want to take a look:
https://crates.io/crates/atomic_pingpong
https://docs.rs/atomic_pingpong/

Cheers,

David Lawrence
Cambridge, MA

To post reply to a comment, click on the 'reply' button attached to each comment. To post a new comment (not a reply to a comment) check out the 'Write a Comment' tab at the top of the comments.

Please login (on the right) if you already have an account on this platform.

Otherwise, please use this form to register (free) an join one of the largest online community for Electrical/Embedded/DSP/FPGA/ML engineers: