How to write a simple driver in bare metal systems: volatile, memory barrier, critical sections and so on

Even I write software for embedded systems for more than 10 years, 
there's an argument that from time to time let me think for hours and 
leave me with many doubts.

Consider a simple embedded system based on a MCU (AVR8 or Cortex-Mx). 
The software is bare metal, without any OS. The main pattern is the well 
known mainloop (background code) that is interrupted by ISR.

Interrupts are used mainly for timings and for low-level driver. For 
example, the UART reception ISR move the last received char in a FIFO 
buffer, while the mainloop code pops new data from the FIFO.


static struct {
   unsigned char buf[RXBUF_SIZE];
   uint8_t in;
   uint8_t out;
} rxfifo;

/* ISR */
void uart_rx_isr(void) {
   unsigned char c = UART->DATA;
   rxfifo.buf[in % RXBUF_SIZE] = c;
   rxfifo.in++;
   // Reset interrupt flag
}

/* Called regularly from mainloop code */
int uart_task(void) {
   int c = -1;
   if (out != in) {
     c = rxfifo.buf[out % RXBUF_SIZE];
     out++;
   }
   return -1;
}


 From a 20-years old article[1] by Nigle Jones, this seems a situation 
where volatile must be used for rxfifo.in, because is modified by an ISR 
and used in the mainloop code.

I don't think so, rxfifo.in is read from memory only one time in 
uart_task(), so there isn't the risk that compiler can optimize badly. 
Even if ISR is fired immediately after the if statement, this doesn't 
bring to a dangerous state: the just received data will be processed at 
the next call to uart_task().

So IMHO volatile isn't necessary here. And critical sections (i.e. 
disabling interrupts) aren't useful too.

However I'm thinking about memory barrier. Suppose the compiler reorder 
the instructions in uart_task() as follows:


   c = rxfifo.buf[out % RXBUF_SIZE]
   if (out != in) {
     out++;
     return c;
   } else {
     return -1;
   }


Here there's a big problem, because compiler decided to firstly read 
rxfifo.buf[] and then test in and out equality. If the ISR is fired 
immediately after moving data to c (most probably an internal register), 
the condition in the if statement will be true and the register value is 
returned. However the register value isn't correct.

I don't think any modern C compiler reorder uart_task() in this way, but 
we can't be sure. The result shouldn't change for the compiler, so it 
can do this kind of things.

How to fix this issue if I want to be extremely sure the compiler will 
not reorder this way? Applying volatile to rxfifo.in shouldn't help for 
this, because compiler is allowed to reorder access of non volatile 
variables yet[2].

One solution is adding a memory barrier in this way:


int uart_task(void) {
   int c = -1;
   if (out != in) {
     memory_barrier();
     c = rxfifo.buf[out % RXBUF_SIZE];
     out++;
   }
   return -1;
}


However this approach appears to me dangerous. You have to check and 
double check if, when and where memory barriers are necessary and it's 
simple to skip a barrier where it's nedded and add a barrier where it 
isn't needed.

So I'm thinking that a sub-optimal (regarding efficiency) but reliable 
(regarding the risk to skip a barrier where it is needed) could be to 
enter a critical section (disabling interrupts) anyway, if it isn't 
strictly needed.


int uart_task(void) {
   ENTER_CRITICAL_SECTION();
   int c = -1;
   if (out != in) {
     c = rxfifo.buf[out % RXBUF_SIZE];
     out++;
   }
   EXIT_CRITICAL_SECTION();
   return -1;
}


Another solution could be to apply volatile keyword to rxfifo.in *AND* 
rxfifo.buf too, so compiler can't change the order of accesses them.

Do you have other suggestions?



[1] https://barrgroup.com/embedded-systems/how-to/c-volatile-keyword
[2] https://blog.regehr.org/archives/28

Reply by Clifford Heath ●October 22, 20212021-10-22

On 23/10/21 9:07 am, pozz wrote:
> Even I write software for embedded systems for more than 10 years, 
> there's an argument that from time to time let me think for hours and 
> leave me with many doubts.
> 
> Consider a simple embedded system based on a MCU (AVR8 or Cortex-Mx). 
> The software is bare metal, without any OS. The main pattern is the well 
> known mainloop (background code) that is interrupted by ISR.
> 
> Interrupts are used mainly for timings and for low-level driver. For 
> example, the UART reception ISR move the last received char in a FIFO 
> buffer, while the mainloop code pops new data from the FIFO.
> 
> 
> static struct {
>  &#4294967295; unsigned char buf[RXBUF_SIZE];
>  &#4294967295; uint8_t in;
>  &#4294967295; uint8_t out;
> } rxfifo;
> 
> /* ISR */
> void uart_rx_isr(void) {
>  &#4294967295; unsigned char c = UART->DATA;
>  &#4294967295; rxfifo.buf[in % RXBUF_SIZE] = c;
>  &#4294967295; rxfifo.in++;
>  &#4294967295; // Reset interrupt flag
> }
> 
> /* Called regularly from mainloop code */
> int uart_task(void) {
>  &#4294967295; int c = -1;
>  &#4294967295; if (out != in) {
>  &#4294967295;&#4294967295;&#4294967295; c = rxfifo.buf[out % RXBUF_SIZE];
>  &#4294967295;&#4294967295;&#4294967295; out++;
>  &#4294967295; }
>  &#4294967295; return -1;
> }
> 
> 
>  From a 20-years old article[1] by Nigle Jones, this seems a situation 
> where volatile must be used for rxfifo.in, because is modified by an ISR 
> and used in the mainloop code.
> 
> I don't think so, rxfifo.in is read from memory only one time in 
> uart_task(), so there isn't the risk that compiler can optimize badly. 
> Even if ISR is fired immediately after the if statement, this doesn't 
> bring to a dangerous state: the just received data will be processed at 
> the next call to uart_task().
> 
> So IMHO volatile isn't necessary here. And critical sections (i.e. 
> disabling interrupts) aren't useful too.
> 
> However I'm thinking about memory barrier. Suppose the compiler reorder 
> the instructions in uart_task() as follows:
> 
> 
>  &#4294967295; c = rxfifo.buf[out % RXBUF_SIZE]
>  &#4294967295; if (out != in) {
>  &#4294967295;&#4294967295;&#4294967295; out++;
>  &#4294967295;&#4294967295;&#4294967295; return c;
>  &#4294967295; } else {
>  &#4294967295;&#4294967295;&#4294967295; return -1;
>  &#4294967295; }
> 
> 
> Here there's a big problem, because compiler decided to firstly read 
> rxfifo.buf[] and then test in and out equality. If the ISR is fired 
> immediately after moving data to c (most probably an internal register), 
> the condition in the if statement will be true and the register value is 
> returned. However the register value isn't correct.
> 
> I don't think any modern C compiler reorder uart_task() in this way, but 
> we can't be sure. The result shouldn't change for the compiler, so it 
> can do this kind of things.
> 
> How to fix this issue if I want to be extremely sure the compiler will 
> not reorder this way? Applying volatile to rxfifo.in shouldn't help for 
> this, because compiler is allowed to reorder access of non volatile 
> variables yet[2].
> 
> One solution is adding a memory barrier in this way:
> 
> 
> int uart_task(void) {
>  &#4294967295; int c = -1;
>  &#4294967295; if (out != in) {
>  &#4294967295;&#4294967295;&#4294967295; memory_barrier();
>  &#4294967295;&#4294967295;&#4294967295; c = rxfifo.buf[out % RXBUF_SIZE];
>  &#4294967295;&#4294967295;&#4294967295; out++;
>  &#4294967295; }
>  &#4294967295; return -1;
> }
> 
> 
> However this approach appears to me dangerous. You have to check and 
> double check if, when and where memory barriers are necessary and it's 
> simple to skip a barrier where it's nedded and add a barrier where it 
> isn't needed.
> 
> So I'm thinking that a sub-optimal (regarding efficiency) but reliable 
> (regarding the risk to skip a barrier where it is needed) could be to 
> enter a critical section (disabling interrupts) anyway, if it isn't 
> strictly needed.
> 
> 
> int uart_task(void) {
>  &#4294967295; ENTER_CRITICAL_SECTION();
>  &#4294967295; int c = -1;
>  &#4294967295; if (out != in) {
>  &#4294967295;&#4294967295;&#4294967295; c = rxfifo.buf[out % RXBUF_SIZE];
>  &#4294967295;&#4294967295;&#4294967295; out++;
>  &#4294967295; }
>  &#4294967295; EXIT_CRITICAL_SECTION();
>  &#4294967295; return -1;
> }
> 
> 
> Another solution could be to apply volatile keyword to rxfifo.in *AND* 
> rxfifo.buf too, so compiler can't change the order of accesses them.
> 
> Do you have other suggestions?
> 
> 
> 
> [1] https://barrgroup.com/embedded-systems/how-to/c-volatile-keyword
> [2] https://blog.regehr.org/archives/28

This is a good introduction to how Linux makes this possible for its 
horde of device-driver authors:

<https://www.kernel.org/doc/Documentation/memory-barriers.txt>


Clifford Heath

Reply by Don Y ●October 23, 20212021-10-23

On 10/22/2021 3:07 PM, pozz wrote:
> static struct {
>    unsigned char buf[RXBUF_SIZE];
>    uint8_t in;
>    uint8_t out;
> } rxfifo;
> 
> /* ISR */
> void uart_rx_isr(void) {
>    unsigned char c = UART->DATA;
>    rxfifo.buf[in % RXBUF_SIZE] = c;
>    rxfifo.in++;
>    // Reset interrupt flag
> }
> 
> /* Called regularly from mainloop code */
> int uart_task(void) {
>    int c = -1;

Why?  And why a retval from uart_task -- if it is always "-1"?

>    if (out != in) {
>      c = rxfifo.buf[out % RXBUF_SIZE];
>      out++;
>    }
>    return -1;
> }

This is a bug(s) waiting to happen.

How is RXBUF_SIZE defined?  How does it reflect the data rate (and,
thus, interrupt rate) as well as the maximum latency between "main
loop" accesses?  I.e., what happens when the buffer is *full* -- and,
thus, appears EMPTY?  What stops the "in" member from growing to the
maximum size of a uint8 -- and then wrapping?  How do you convey this
to the upper level code ("Hey, we just lost a whole RXBUF_SIZE of
characters so if the character stream doesn't make sense, that might
be a cause...")?  What if RXBUF_SIZE is relatively prime wrt uint8max?

When writing UART handlers, I fetch the received datum along with
the uart's flags and stuff *both* of those things in the FIFO.
If the FIFO would be full, I, instead, modify the flags of the
preceeding datum to reflect this fact ("Some number of characters
have been lost AFTER this one...") and discard the current character.

I then signal an event and let a task waiting for that specific event
wake up and retrieve the contents of the FIFO (which may include more
than one character, at that time as characters can arrive after the
initial event has been signaled).

This lets me move the line discipline out of the ISR and still keep
the system "responsive".

Figure out everything that you need to do before you start sorting out
how the compiler can "shaft" you...

Reply by David Brown ●October 23, 20212021-10-23

On 23/10/2021 00:07, pozz wrote:
> Even I write software for embedded systems for more than 10 years,
> there's an argument that from time to time let me think for hours and
> leave me with many doubts.

It's nice to see a thread like this here - the group needs such discussions!

> 
> Consider a simple embedded system based on a MCU (AVR8 or Cortex-Mx).
> The software is bare metal, without any OS. The main pattern is the well
> known mainloop (background code) that is interrupted by ISR.
> 
> Interrupts are used mainly for timings and for low-level driver. For
> example, the UART reception ISR move the last received char in a FIFO
> buffer, while the mainloop code pops new data from the FIFO.
> 
> 
> static struct {
> &#4294967295; unsigned char buf[RXBUF_SIZE];
> &#4294967295; uint8_t in;
> &#4294967295; uint8_t out;
> } rxfifo;
> 
> /* ISR */
> void uart_rx_isr(void) {
> &#4294967295; unsigned char c = UART->DATA;
> &#4294967295; rxfifo.buf[in % RXBUF_SIZE] = c;
> &#4294967295; rxfifo.in++;

Unless you are sure that RXBUF_SIZE is a power of two, this is going to
be quite slow on an AVR.  Modulo means division, and while division by a
constant is usually optimised to a multiplication by the compiler, you
still have a multiply, a shift, and some compensation for it all being
done as signed integer arithmetic.

It's also wrong, for non-power of two sizes, since the wrapping of your
increment and your modulo RXBUF_SIZE get out of sync.

The usual choice is to track "head" and "tail", and use something like:

void uart_rx_isr(void) {
  unsigned char c = UART->DATA;
  // Reset interrupt flag
  uint8_t next = rxfifo.tail;
  rxfifo.buf[next] = c;
  next++;
  if (next >= RXBUF_SIZE) next -= RXBUF_SIZE;
  rxfifo.tail = next;
}

> &#4294967295; // Reset interrupt flag
> }
> 
> /* Called regularly from mainloop code */
> int uart_task(void) {
> &#4294967295; int c = -1;
> &#4294967295; if (out != in) {
> &#4294967295;&#4294967295;&#4294967295; c = rxfifo.buf[out % RXBUF_SIZE];
> &#4294967295;&#4294967295;&#4294967295; out++;
> &#4294967295; }
> &#4294967295; return -1;
> }

int uart_task(void) {
  int c = -1;
  uint8_t next = rxfifo.head;
  if (next != rxfifo.tail) {
      c = rxfifo.buf[next];
      next++;
      if (next >= RXBUF_SIZE) next -= RXBUF_SIZE;
      rxfifo.head = next;
  }
  return c;
}

These don't track buffer overflow at all - you need to call uart_task()
often enough to avoid that.

(I'm skipping volatiles so we don't get ahead of your point.)

> 
> 
> From a 20-years old article[1] by Nigle Jones, this seems a situation
> where volatile must be used for rxfifo.in, because is modified by an ISR
> and used in the mainloop code.
> 

Certainly whenever data is shared between ISR's and mainloop code, or
different threads, then you need to think about how to make sure data is
synchronised and exchanged.  "volatile" is one method, atomics are
another, and memory barriers can be used.

> I don't think so, rxfifo.in is read from memory only one time in
> uart_task(), so there isn't the risk that compiler can optimize badly.

That is incorrect in two ways.  One - baring compiler bugs (which do
occur, but they are very rare compared to user bugs), there is no such
thing as "optimising badly".  If optimising changes the behaviour of the
code, other than its size and speed, the code is wrong.  Two - it is a
very bad idea to imagine that having code inside a function somehow
"protects" it from re-ordering or other optimisation.

Functions can be inlined, outlined, cloned, and shuffled about.
Link-time optimisation, code in headers, C++ modules, and other
cross-unit optimisations are becoming more and more common.  So while it
might be true /today/ that the compiler has no alternative but to read
rxfifo.in once per call to uart_task(), you cannot assume that will be
the case with later compilers or with more advanced optimisation
techniques enabled.  It is safer, more portable, and more future-proof
to avoid such assumptions.

> Even if ISR is fired immediately after the if statement, this doesn't
> bring to a dangerous state: the just received data will be processed at
> the next call to uart_task().
> 
> So IMHO volatile isn't necessary here. And critical sections (i.e.
> disabling interrupts) aren't useful too.
> 
> However I'm thinking about memory barrier. Suppose the compiler reorder
> the instructions in uart_task() as follows:
> 
> 
> &#4294967295; c = rxfifo.buf[out % RXBUF_SIZE]
> &#4294967295; if (out != in) {
> &#4294967295;&#4294967295;&#4294967295; out++;
> &#4294967295;&#4294967295;&#4294967295; return c;
> &#4294967295; } else {
> &#4294967295;&#4294967295;&#4294967295; return -1;
> &#4294967295; }
> 
> 
> Here there's a big problem, because compiler decided to firstly read
> rxfifo.buf[] and then test in and out equality. If the ISR is fired
> immediately after moving data to c (most probably an internal register),
> the condition in the if statement will be true and the register value is
> returned. However the register value isn't correct.

You are absolutely correct.

> 
> I don't think any modern C compiler reorder uart_task() in this way, but
> we can't be sure. The result shouldn't change for the compiler, so it
> can do this kind of things.

It is not an unreasonable re-arrangement.  On processors with
out-of-order execution (which does not apply to the AVR or Cortex-M),
compilers will often push loads as early as they can in the instruction
stream so that they start the cache loading process as quickly as
possible.  (But note that on such "big" processors, much of this
discussion on volatile and memory barriers is not sufficient, especially
if there is more than one core.  You need atomics and fences, but that's
a story for another day.)

> 
> How to fix this issue if I want to be extremely sure the compiler will
> not reorder this way? Applying volatile to rxfifo.in shouldn't help for
> this, because compiler is allowed to reorder access of non volatile
> variables yet[2].
> 

The important thing about "volatile" is that it is /accesses/ that are
volatile, not objects.  A volatile object is nothing more than an object
for which all accesses are volatile by default.  But you can use
volatile accesses on non-volatile objects.  This macro is your friend:

#define volatileAccess(v) *((volatile typeof((v)) *) &(v))

(Linux has the same macro, called ACCESS_ONCE.  It uses a gcc extension
- if you are using other compilers then you can make an uglier
equivalent using _Generic.  However, if you are using a C compiler that
supports C11, it is probably gcc or clang, and you can use the "typeof"
extension.)

That macro will let you make a volatile read or write to an object
without requiring that /all/ accesses to it are volatile.

> One solution is adding a memory barrier in this way:
> 
> 
> int uart_task(void) {
> &#4294967295; int c = -1;
> &#4294967295; if (out != in) {
> &#4294967295;&#4294967295;&#4294967295; memory_barrier();
> &#4294967295;&#4294967295;&#4294967295; c = rxfifo.buf[out % RXBUF_SIZE];
> &#4294967295;&#4294967295;&#4294967295; out++;
> &#4294967295; }
> &#4294967295; return -1;
> }
> 

Note that you are forcing the compiler to read "out" twice here, as it
can't keep the value of "out" in a register across the memory barrier.
(And as I mentioned before, the compiler might be able to do larger
scale optimisation across compilation units or functions, and in that
way keep values across multiple calls to uart_task.)

> 
> However this approach appears to me dangerous. You have to check and
> double check if, when and where memory barriers are necessary and it's
> simple to skip a barrier where it's nedded and add a barrier where it
> isn't needed.

Memory barriers are certainly useful, but they are a shotgun approach -
they affect /everything/ involving reads and writes to memory.  (But
remember they don't affect ordering of calculations.)

> 
> So I'm thinking that a sub-optimal (regarding efficiency) but reliable
> (regarding the risk to skip a barrier where it is needed) could be to
> enter a critical section (disabling interrupts) anyway, if it isn't
> strictly needed.
> 
> 
> int uart_task(void) {
> &#4294967295; ENTER_CRITICAL_SECTION();
> &#4294967295; int c = -1;
> &#4294967295; if (out != in) {
> &#4294967295;&#4294967295;&#4294967295; c = rxfifo.buf[out % RXBUF_SIZE];
> &#4294967295;&#4294967295;&#4294967295; out++;
> &#4294967295; }
> &#4294967295; EXIT_CRITICAL_SECTION();
> &#4294967295; return -1;
> }

Critical sections for something like this are /way/ overkill.  And a
critical section with a division in the middle?  Not a good idea.

> 
> 
> Another solution could be to apply volatile keyword to rxfifo.in *AND*
> rxfifo.buf too, so compiler can't change the order of accesses them.
> 
> Do you have other suggestions?
> 

Marking "in" and "buf" as volatile is /far/ better than using a critical
section, and likely to be more efficient than a memory barrier.  You can
also use volatileAccess rather than making buf volatile, and it is often
slightly more efficient to cache volatile variables in a local variable
while working with them.

> 
> 
> [1] https://barrgroup.com/embedded-systems/how-to/c-volatile-keyword
> [2] https://blog.regehr.org/archives/28

Reply by pozz ●October 23, 20212021-10-23

Il 23/10/2021 07:09, Don Y ha scritto:
> On 10/22/2021 3:07 PM, pozz wrote:
>> static struct {
>> &#4294967295;&#4294967295; unsigned char buf[RXBUF_SIZE];
>> &#4294967295;&#4294967295; uint8_t in;
>> &#4294967295;&#4294967295; uint8_t out;
>> } rxfifo;
>>
>> /* ISR */
>> void uart_rx_isr(void) {
>> &#4294967295;&#4294967295; unsigned char c = UART->DATA;
>> &#4294967295;&#4294967295; rxfifo.buf[in % RXBUF_SIZE] = c;
>> &#4294967295;&#4294967295; rxfifo.in++;
>> &#4294967295;&#4294967295; // Reset interrupt flag
>> }
>>
>> /* Called regularly from mainloop code */
>> int uart_task(void) {
>> &#4294967295;&#4294967295; int c = -1;
> 
> Why?&#4294967295; And why a retval from uart_task -- if it is always "-1"?

It was my mistake. The last instruction of uart_task() should be

   return c;

And maybe the name of uart_task() is not so good, it should be uart_rx().

> 
>> &#4294967295;&#4294967295; if (out != in) {
>> &#4294967295;&#4294967295;&#4294967295;&#4294967295; c = rxfifo.buf[out % RXBUF_SIZE];
>> &#4294967295;&#4294967295;&#4294967295;&#4294967295; out++;
>> &#4294967295;&#4294967295; }
>> &#4294967295;&#4294967295; return -1;
>> }
> 
> This is a bug(s) waiting to happen.
> 
> How is RXBUF_SIZE defined?  

Power of two.

> How does it reflect the data rate (and,
> thus, interrupt rate) as well as the maximum latency between "main
> loop" accesses?  

Rx FIFO filled by interrupt is needed to face a burst (a packet?) of 
incoming characters.

If the baudrate is 9600bps 8n1, interrupt would be fired every 
10/9600=1ms. If maximum interval between two successive uart_task() 
calls is 10ms, it is sufficient a buffer of 10 bytes, so RXBUF_SIZE 
could be 16 or 32.

> I.e., what happens when the buffer is *full* -- and,
> thus, appears EMPTY?  

These are good questions, but I didn't want to discuss about them. Of 
course ISR is not complete, because before pushing a new byte, we must 
check if FIFO is full. For example:

/* The difference in-out gives a correct result even after a
  * wrap-around of in only, thanks to unsigned arithmetic. */
#define RXFIFO_IS_FULL()     (rxfifo.in - rxfifo.out < RXBUF_SIZE)

void uart_rx_isr(void) {
    unsigned char c = UART->DATA;
    if (!RXFIFO_IS_FULL()) {
      rxfifo.buf[in % RXBUF_SIZE] = c;
      rxfifo.in++;
    } else {
      // FIFO is full, ignore the char
    }
    // Reset interrupt flag
}

> What stops the "in" member from growing to the
> maximum size of a uint8 -- and then wrapping?  

As I wrote, this should work even after a wrap-around.

> How do you convey this
> to the upper level code ("Hey, we just lost a whole RXBUF_SIZE of
> characters so if the character stream doesn't make sense, that might
> be a cause...")?  

FIFO full is event is extremely rare if I'm able to size rx FIFO 
correctly, i.e. on the worst case.
Anyway I usually ignore incoming chars when the FIFO is full. The high 
level protocols are usually defined in such a way the absence of chars 
are detected, mostly thanks to CRC.

> What if RXBUF_SIZE is relatively prime wrt uint8max?
> 
> When writing UART handlers, I fetch the received datum along with
> the uart's flags and stuff *both* of those things in the FIFO.
> If the FIFO would be full, I, instead, modify the flags of the
> preceeding datum to reflect this fact ("Some number of characters
> have been lost AFTER this one...") and discard the current character.
> 
> I then signal an event and let a task waiting for that specific event
> wake up and retrieve the contents of the FIFO (which may include more
> than one character, at that time as characters can arrive after the
> initial event has been signaled).

Signal an event? Task waiting for a specific event? Maybe you are 
thinking of a full RTOS. I was thinking of bare metal systems.

> This lets me move the line discipline out of the ISR and still keep
> the system "responsive".
> 
> Figure out everything that you need to do before you start sorting out
> how the compiler can "shaft" you...

Reply by pozz ●October 23, 20212021-10-23

Il 23/10/2021 18:09, David Brown ha scritto:
> On 23/10/2021 00:07, pozz wrote:
>> Even I write software for embedded systems for more than 10 years,
>> there's an argument that from time to time let me think for hours and
>> leave me with many doubts.
> 
> It's nice to see a thread like this here - the group needs such discussions!
> 
>>
>> Consider a simple embedded system based on a MCU (AVR8 or Cortex-Mx).
>> The software is bare metal, without any OS. The main pattern is the well
>> known mainloop (background code) that is interrupted by ISR.
>>
>> Interrupts are used mainly for timings and for low-level driver. For
>> example, the UART reception ISR move the last received char in a FIFO
>> buffer, while the mainloop code pops new data from the FIFO.
>>
>>
>> static struct {
>>  &#4294967295; unsigned char buf[RXBUF_SIZE];
>>  &#4294967295; uint8_t in;
>>  &#4294967295; uint8_t out;
>> } rxfifo;
>>
>> /* ISR */
>> void uart_rx_isr(void) {
>>  &#4294967295; unsigned char c = UART->DATA;
>>  &#4294967295; rxfifo.buf[in % RXBUF_SIZE] = c;
>>  &#4294967295; rxfifo.in++;
> 
> Unless you are sure that RXBUF_SIZE is a power of two, this is going to
> be quite slow on an AVR.  Modulo means division, and while division by a
> constant is usually optimised to a multiplication by the compiler, you
> still have a multiply, a shift, and some compensation for it all being
> done as signed integer arithmetic.
> 
> It's also wrong, for non-power of two sizes, since the wrapping of your
> increment and your modulo RXBUF_SIZE get out of sync.

Yes, RXBUF_SIZE is a power of two.



> The usual choice is to track "head" and "tail", and use something like:
> 
> void uart_rx_isr(void) {
>    unsigned char c = UART->DATA;
>    // Reset interrupt flag
>    uint8_t next = rxfifo.tail;
>    rxfifo.buf[next] = c;
>    next++;
>    if (next >= RXBUF_SIZE) next -= RXBUF_SIZE;
>    rxfifo.tail = next;
> }

This isn't the point of this thread, anyway...
You insist that tail is always in the range [0...RXBUF_SIZE - 1]. My 
approach is different.

RXBUF_SIZE is a power of two, usualy <=256. head and tail are uint8_t 
and *can* reach the maximum value of 255, even RXBUF_SIZE is 128. All 
works well.

Suppose rxfifo.in=rxfifo.out=127, FIFO is empty. When a new char is 
received, it is saved into rxfifo.buf[127 % 128=127] and rxfifo.in will 
be increased to 128.
Now mainloop detect the new char (in != out), reads the new char at 
rxfifo.buf[127 % 128=127] and increase out that will be 128.

The next byte will be saved into rxfifo.rxbuf[rxfifo.in % 128=128 % 128 
= 0] and rxfifo.in will be 129. Again, the next byte will be saved to 
rxbuf[rxfifo.in % 128=129 % 128=1] and rxfifo.in will be 130.

When the mainloop tries to pop data from fifo, it tests

    rxfifo.in(130) !=rxfifo.out(128)

The test is true, so the code extracts chars from rxbuf[out % 128] that 
is rxbuf[0]... and so on.

I hope that explanation is good.


> 
>>  &#4294967295; // Reset interrupt flag
>> }
>>
>> /* Called regularly from mainloop code */
>> int uart_task(void) {
>>  &#4294967295; int c = -1;
>>  &#4294967295; if (out != in) {
>>  &#4294967295;&#4294967295;&#4294967295; c = rxfifo.buf[out % RXBUF_SIZE];
>>  &#4294967295;&#4294967295;&#4294967295; out++;
>>  &#4294967295; }
>>  &#4294967295; return -1;
>> }
> 
> int uart_task(void) {
>    int c = -1;
>    uint8_t next = rxfifo.head;
>    if (next != rxfifo.tail) {
>        c = rxfifo.buf[next];
>        next++;
>        if (next >= RXBUF_SIZE) next -= RXBUF_SIZE;
>        rxfifo.head = next;
>    }
>    return c;
> }
> 
> These don't track buffer overflow at all - you need to call uart_task()
> often enough to avoid that.

Sure, with a good number for RXBUF_SIZE, buffer overflow shouldn't 
happen ever. Anyway, if it happens, the higher level layers (protocol) 
should detect a corrupted packet.


> (I'm skipping volatiles so we don't get ahead of your point.)
> 
>>
>>
>>  From a 20-years old article[1] by Nigle Jones, this seems a situation
>> where volatile must be used for rxfifo.in, because is modified by an ISR
>> and used in the mainloop code.
>>
> 
> Certainly whenever data is shared between ISR's and mainloop code, or
> different threads, then you need to think about how to make sure data is
> synchronised and exchanged.  "volatile" is one method, atomics are
> another, and memory barriers can be used.
> 
>> I don't think so, rxfifo.in is read from memory only one time in
>> uart_task(), so there isn't the risk that compiler can optimize badly.
> 
> That is incorrect in two ways.  One - baring compiler bugs (which do
> occur, but they are very rare compared to user bugs), there is no such
> thing as "optimising badly".  If optimising changes the behaviour of the
> code, other than its size and speed, the code is wrong.  

Yes of course, but I don't think the absence of volatile for rxfifo.in, 
even if it can change in ISR, could be a *real* problem with *modern and 
current* compilers.

voltile attribute needs to avoid compiler optimization (that would be a 
bad thing, because of volatile nature of the variabile), but on that 
code it's difficult to think of an optimization, caused by the absence 
of volatile, that changes the behaviour erroneously... except reorering.


> Two - it is a
> very bad idea to imagine that having code inside a function somehow
> "protects" it from re-ordering or other optimisation.

I didn't say this, at the contrary I was thinking exactly to reordering 
issues.


> Functions can be inlined, outlined, cloned, and shuffled about.
> Link-time optimisation, code in headers, C++ modules, and other
> cross-unit optimisations are becoming more and more common.  So while it
> might be true /today/ that the compiler has no alternative but to read
> rxfifo.in once per call to uart_task(), you cannot assume that will be
> the case with later compilers or with more advanced optimisation
> techniques enabled.  It is safer, more portable, and more future-proof
> to avoid such assumptions.

Ok, you are talking of future scenarios. I don't think actually this 
could be a real problem. Anyway your observation makes sense.


> 
>> Even if ISR is fired immediately after the if statement, this doesn't
>> bring to a dangerous state: the just received data will be processed at
>> the next call to uart_task().
>>
>> So IMHO volatile isn't necessary here. And critical sections (i.e.
>> disabling interrupts) aren't useful too.
>>
>> However I'm thinking about memory barrier. Suppose the compiler reorder
>> the instructions in uart_task() as follows:
>>
>>
>>  &#4294967295; c = rxfifo.buf[out % RXBUF_SIZE]
>>  &#4294967295; if (out != in) {
>>  &#4294967295;&#4294967295;&#4294967295; out++;
>>  &#4294967295;&#4294967295;&#4294967295; return c;
>>  &#4294967295; } else {
>>  &#4294967295;&#4294967295;&#4294967295; return -1;
>>  &#4294967295; }
>>
>>
>> Here there's a big problem, because compiler decided to firstly read
>> rxfifo.buf[] and then test in and out equality. If the ISR is fired
>> immediately after moving data to c (most probably an internal register),
>> the condition in the if statement will be true and the register value is
>> returned. However the register value isn't correct.
> 
> You are absolutely correct.
> 
>>
>> I don't think any modern C compiler reorder uart_task() in this way, but
>> we can't be sure. The result shouldn't change for the compiler, so it
>> can do this kind of things.
> 
> It is not an unreasonable re-arrangement.  On processors with
> out-of-order execution (which does not apply to the AVR or Cortex-M),
> compilers will often push loads as early as they can in the instruction
> stream so that they start the cache loading process as quickly as
> possible.  (But note that on such "big" processors, much of this
> discussion on volatile and memory barriers is not sufficient, especially
> if there is more than one core.  You need atomics and fences, but that's
> a story for another day.)
> 
>>
>> How to fix this issue if I want to be extremely sure the compiler will
>> not reorder this way? Applying volatile to rxfifo.in shouldn't help for
>> this, because compiler is allowed to reorder access of non volatile
>> variables yet[2].
>>
> 
> The important thing about "volatile" is that it is /accesses/ that are
> volatile, not objects.  A volatile object is nothing more than an object
> for which all accesses are volatile by default.  But you can use
> volatile accesses on non-volatile objects.  This macro is your friend:
> 
> #define volatileAccess(v) *((volatile typeof((v)) *) &(v))
> 
> (Linux has the same macro, called ACCESS_ONCE.  It uses a gcc extension
> - if you are using other compilers then you can make an uglier
> equivalent using _Generic.  However, if you are using a C compiler that
> supports C11, it is probably gcc or clang, and you can use the "typeof"
> extension.)
> 
> That macro will let you make a volatile read or write to an object
> without requiring that /all/ accesses to it are volatile.

This is a good point. The code in ISR can't be interrupted, so there's 
no need to have volatile access in ISR.


>> One solution is adding a memory barrier in this way:
>>
>>
>> int uart_task(void) {
>>  &#4294967295; int c = -1;
>>  &#4294967295; if (out != in) {
>>  &#4294967295;&#4294967295;&#4294967295; memory_barrier();
>>  &#4294967295;&#4294967295;&#4294967295; c = rxfifo.buf[out % RXBUF_SIZE];
>>  &#4294967295;&#4294967295;&#4294967295; out++;
>>  &#4294967295; }
>>  &#4294967295; return -1;
>> }
>>
> 
> Note that you are forcing the compiler to read "out" twice here, as it
> can't keep the value of "out" in a register across the memory barrier.

Yes, you're right. A small penalty to avoid the problem of reordering.


> (And as I mentioned before, the compiler might be able to do larger
> scale optimisation across compilation units or functions, and in that
> way keep values across multiple calls to uart_task.)
> 
>>
>> However this approach appears to me dangerous. You have to check and
>> double check if, when and where memory barriers are necessary and it's
>> simple to skip a barrier where it's nedded and add a barrier where it
>> isn't needed.
> 
> Memory barriers are certainly useful, but they are a shotgun approach -
> they affect /everything/ involving reads and writes to memory.  (But
> remember they don't affect ordering of calculations.)
> 
>>
>> So I'm thinking that a sub-optimal (regarding efficiency) but reliable
>> (regarding the risk to skip a barrier where it is needed) could be to
>> enter a critical section (disabling interrupts) anyway, if it isn't
>> strictly needed.
>>
>>
>> int uart_task(void) {
>>  &#4294967295; ENTER_CRITICAL_SECTION();
>>  &#4294967295; int c = -1;
>>  &#4294967295; if (out != in) {
>>  &#4294967295;&#4294967295;&#4294967295; c = rxfifo.buf[out % RXBUF_SIZE];
>>  &#4294967295;&#4294967295;&#4294967295; out++;
>>  &#4294967295; }
>>  &#4294967295; EXIT_CRITICAL_SECTION();
>>  &#4294967295; return -1;
>> }
> 
> Critical sections for something like this are /way/ overkill.  And a
> critical section with a division in the middle?  Not a good idea.
> 
>>
>>
>> Another solution could be to apply volatile keyword to rxfifo.in *AND*
>> rxfifo.buf too, so compiler can't change the order of accesses them.
>>
>> Do you have other suggestions?
>>
> 
> Marking "in" and "buf" as volatile is /far/ better than using a critical
> section, and likely to be more efficient than a memory barrier.  You can
> also use volatileAccess rather than making buf volatile, and it is often
> slightly more efficient to cache volatile variables in a local variable
> while working with them.

Yes, I think so too. Lastly I read many experts say volatile is often a 
bad thing, so I'm re-thinking about its use compared with other approaches.


>> [1] https://barrgroup.com/embedded-systems/how-to/c-volatile-keyword
>> [2] https://blog.regehr.org/archives/28
>

Reply by Don Y ●October 23, 20212021-10-23

On 10/23/2021 1:12 PM, pozz wrote:
>> This is a bug(s) waiting to happen.
>>
>> How is RXBUF_SIZE defined? 
> 
> Power of two.

The point was its relationship to the actual code.

>> How does it reflect the data rate (and,
>> thus, interrupt rate) as well as the maximum latency between "main
>> loop" accesses? 
> 
> Rx FIFO filled by interrupt is needed to face a burst (a packet?) of incoming 
> characters.
> 
> If the baudrate is 9600bps 8n1, interrupt would be fired every 10/9600=1ms. If 
> maximum interval between two successive uart_task() calls is 10ms, it is 
> sufficient a buffer of 10 bytes, so RXBUF_SIZE could be 16 or 32.

What GUARANTEES this in your system?  Folks often see things that "can't
happen" -- yet DID (THEY SAW IT!).  Your code/design should ensure that
"can't happen" REALLY /can't happen/.  It costs very little to explain
(commentary) WHY you don't have to check for X, Y or Z in your code.

[If the user's actions (or any outside agency) can affect operation,
then how can you guarantee that THEY "behave"?]

And, give that a very high degree of visibility so that when someone
decides they can increase the baudrate or add some sluggish task
to your "main loop" that this ASSUMPTION isn't silently violated.

>> I.e., what happens when the buffer is *full* -- and,
>> thus, appears EMPTY? 
> 
> These are good questions, but I didn't want to discuss about them. Of course 
> ISR is not complete, because before pushing a new byte, we must check if FIFO 
> is full. For example:

My point is that you should fleshout your code before you start
thinking about what can go wrong.

E.g., if the ISR is the *only* entity to modify ".in" and always does so
in with interrupts off, then it can do so without worrying about conflict
with something else -- if those other things always ensure they read it
atomically (if they read it just before or just after it has been modified
by the ISR, the value will still "work" -- they just may not realize, yet,
that there is an extra character in the buffer that they haven't yet seen).

Likewise, if the "task" is the only entity modifying ".out", then ensuring
that those modifications are atomic means the ISR can safely use any *single*
reference to it.

>> How do you convey this
>> to the upper level code ("Hey, we just lost a whole RXBUF_SIZE of
>> characters so if the character stream doesn't make sense, that might
>> be a cause...")? 
> 
> FIFO full is event is extremely rare if I'm able to size rx FIFO correctly, 
> i.e. on the worst case.

"Rare" and "impossible" are too entirely different scenarios.
It is extremely rare for a specific individual to win the lottery.
But, any individual *can* win it!

> Anyway I usually ignore incoming chars when the FIFO is full. The high level 
> protocols are usually defined in such a way the absence of chars are detected, 
> mostly thanks to CRC.

What if the CRC characters disappear?  Are you sure the front of one message
can't appear to match the ass end of another?

"Pozz is here."
"Don  is not here."

"Pozz is not here."

And that there is no value in knowing that one or more messages may have been
dropped?

>> What if RXBUF_SIZE is relatively prime wrt uint8max?
>>
>> When writing UART handlers, I fetch the received datum along with
>> the uart's flags and stuff *both* of those things in the FIFO.
>> If the FIFO would be full, I, instead, modify the flags of the
>> preceeding datum to reflect this fact ("Some number of characters
>> have been lost AFTER this one...") and discard the current character.
>>
>> I then signal an event and let a task waiting for that specific event
>> wake up and retrieve the contents of the FIFO (which may include more
>> than one character, at that time as characters can arrive after the
>> initial event has been signaled).
> 
> Signal an event? Task waiting for a specific event? Maybe you are thinking of a 
> full RTOS. I was thinking of bare metal systems.

You can implement as much or as little of an OS as you choose;
you're not stuck with "all or nothing".

Reply by Johann Klammer ●October 24, 20212021-10-24

On 10/23/2021 12:07 AM, pozz wrote:
> Even I write software for embedded systems for more than 10 years, there's an argument that from time to time let me think for hours and leave me with many doubts.
> 
> Consider a simple embedded system based on a MCU (AVR8 or Cortex-Mx). The software is bare metal, without any OS. The main pattern is the well known mainloop (background code) that is interrupted by ISR.
> 
> Interrupts are used mainly for timings and for low-level driver. For example, the UART reception ISR move the last received char in a FIFO buffer, while the mainloop code pops new data from the FIFO.
> 
> 
> static struct {
>   unsigned char buf[RXBUF_SIZE];
>   uint8_t in;
>   uint8_t out;
> } rxfifo;
> 
> /* ISR */
> void uart_rx_isr(void) {
>   unsigned char c = UART->DATA;
>   rxfifo.buf[in % RXBUF_SIZE] = c;
>   rxfifo.in++;
>   // Reset interrupt flag
> }
> 
> /* Called regularly from mainloop code */
> int uart_task(void) {
>   int c = -1;
>   if (out != in) {
>     c = rxfifo.buf[out % RXBUF_SIZE];
>     out++;
>   }
>   return -1;
> }
> 
> 
> From a 20-years old article[1] by Nigle Jones, this seems a situation where volatile must be used for rxfifo.in, because is modified by an ISR and used in the mainloop code.
> 
> I don't think so, rxfifo.in is read from memory only one time in uart_task(), so there isn't the risk that compiler can optimize badly. Even if ISR is fired immediately after the if statement, this doesn't bring to a dangerous state: the just received data will be processed at the next call to uart_task().
> 
> So IMHO volatile isn't necessary here. And critical sections (i.e. disabling interrupts) aren't useful too.
> 
> However I'm thinking about memory barrier. Suppose the compiler reorder the instructions in uart_task() as follows:
> 
> 
>   c = rxfifo.buf[out % RXBUF_SIZE]
>   if (out != in) {
>     out++;
>     return c;
>   } else {
>     return -1;
>   }
> 
> 
> Here there's a big problem, because compiler decided to firstly read rxfifo.buf[] and then test in and out equality. If the ISR is fired immediately after moving data to c (most probably an internal register), the condition in the if statement will be true and the register value is returned. However the register value isn't correct.
> 
> I don't think any modern C compiler reorder uart_task() in this way, but we can't be sure. The result shouldn't change for the compiler, so it can do this kind of things.
> 
> How to fix this issue if I want to be extremely sure the compiler will not reorder this way? Applying volatile to rxfifo.in shouldn't help for this, because compiler is allowed to reorder access of non volatile variables yet[2].
> 
> One solution is adding a memory barrier in this way:
> 
> 
> int uart_task(void) {
>   int c = -1;
>   if (out != in) {
>     memory_barrier();
>     c = rxfifo.buf[out % RXBUF_SIZE];
>     out++;
>   }
>   return -1;
> }
> 
> 
> However this approach appears to me dangerous. You have to check and double check if, when and where memory barriers are necessary and it's simple to skip a barrier where it's nedded and add a barrier where it isn't needed.
> 
> So I'm thinking that a sub-optimal (regarding efficiency) but reliable (regarding the risk to skip a barrier where it is needed) could be to enter a critical section (disabling interrupts) anyway, if it isn't strictly needed.
> 
> 
> int uart_task(void) {
>   ENTER_CRITICAL_SECTION();
>   int c = -1;
>   if (out != in) {
>     c = rxfifo.buf[out % RXBUF_SIZE];
>     out++;
>   }
>   EXIT_CRITICAL_SECTION();
>   return -1;
> }
> 
> 
> Another solution could be to apply volatile keyword to rxfifo.in *AND* rxfifo.buf too, so compiler can't change the order of accesses them.
> 
> Do you have other suggestions?
> 
> 
> 
> [1] https://barrgroup.com/embedded-systems/how-to/c-volatile-keyword
> [2] https://blog.regehr.org/archives/28
Disable interrupts while accessing the fifo. you really have to.
alternatively you'll often get away not using a fifo at all,
unless you're blocking for a long while in some part of the code.

Reply by pozz ●October 24, 20212021-10-24

Il 24/10/2021 13:02, David Brown ha scritto:
> On 23/10/2021 22:49, pozz wrote:
>> Il 23/10/2021 18:09, David Brown ha scritto:
>>> On 23/10/2021 00:07, pozz wrote:
>>>> Even I write software for embedded systems for more than 10 years,
>>>> there's an argument that from time to time let me think for hours and
>>>> leave me with many doubts.
>>>
>>> It's nice to see a thread like this here - the group needs such
>>> discussions!
>>>
>>>>
>>>> Consider a simple embedded system based on a MCU (AVR8 or Cortex-Mx).
>>>> The software is bare metal, without any OS. The main pattern is the well
>>>> known mainloop (background code) that is interrupted by ISR.
>>>>
>>>> Interrupts are used mainly for timings and for low-level driver. For
>>>> example, the UART reception ISR move the last received char in a FIFO
>>>> buffer, while the mainloop code pops new data from the FIFO.
>>>>
>>>>
>>>> static struct {
>>>>  &#4294967295;&#4294967295; unsigned char buf[RXBUF_SIZE];
>>>>  &#4294967295;&#4294967295; uint8_t in;
>>>>  &#4294967295;&#4294967295; uint8_t out;
>>>> } rxfifo;
>>>>
>>>> /* ISR */
>>>> void uart_rx_isr(void) {
>>>>  &#4294967295;&#4294967295; unsigned char c = UART->DATA;
>>>>  &#4294967295;&#4294967295; rxfifo.buf[in % RXBUF_SIZE] = c;
>>>>  &#4294967295;&#4294967295; rxfifo.in++;
>>>
>>> Unless you are sure that RXBUF_SIZE is a power of two, this is going to
>>> be quite slow on an AVR.&#4294967295; Modulo means division, and while division by a
>>> constant is usually optimised to a multiplication by the compiler, you
>>> still have a multiply, a shift, and some compensation for it all being
>>> done as signed integer arithmetic.
>>>
>>> It's also wrong, for non-power of two sizes, since the wrapping of your
>>> increment and your modulo RXBUF_SIZE get out of sync.
>>
>> Yes, RXBUF_SIZE is a power of two.
>>
> 
> If your code relies on that, make sure the code will fail to compile if
> it is not the case.  Documentation is good, compile-time check is better:
> 
> static_assert((RXBUF_SIZE & (RXBUF_SIZE - 1)) == 0, "Needs power of 2");

Good point.


>>> The usual choice is to track "head" and "tail", and use something like:
>>>
>>> void uart_rx_isr(void) {
>>>  &#4294967295;&#4294967295; unsigned char c = UART->DATA;
>>>  &#4294967295;&#4294967295; // Reset interrupt flag
>>>  &#4294967295;&#4294967295; uint8_t next = rxfifo.tail;
>>>  &#4294967295;&#4294967295; rxfifo.buf[next] = c;
>>>  &#4294967295;&#4294967295; next++;
>>>  &#4294967295;&#4294967295; if (next >= RXBUF_SIZE) next -= RXBUF_SIZE;
>>>  &#4294967295;&#4294967295; rxfifo.tail = next;
>>> }
>>
>> This isn't the point of this thread, anyway...
>> You insist that tail is always in the range [0...RXBUF_SIZE - 1]. My
>> approach is different.
>>
>> RXBUF_SIZE is a power of two, usualy <=256. head and tail are uint8_t
>> and *can* reach the maximum value of 255, even RXBUF_SIZE is 128. All
>> works well.
>>
> 
> Yes, your approach will work - /if/ you have a power-of-two buffer size.
>   It has no noticeable efficiency advantages, merely an extra
> inconvenient restriction and the possible confusion caused by doing
> things in a different way from common idioms.
> 
> However, this is not the point of the thread - so I am happy to leave
> that for now.

If you want, we can start a small [OT].

<OT>
I know my ring-buffer implementation has the restriction of having a 
buffer with a power-of-two size. However I like it, because I can avoid 
introducing a new variable (actual number of elements in the buffer) or 
waste an element to solve the ambiguity when the buffer is full or empty.
</OT>


>> Suppose rxfifo.in=rxfifo.out=127, FIFO is empty. When a new char is
>> received, it is saved into rxfifo.buf[127 % 128=127] and rxfifo.in will
>> be increased to 128.
>> Now mainloop detect the new char (in != out), reads the new char at
>> rxfifo.buf[127 % 128=127] and increase out that will be 128.
>>
>> The next byte will be saved into rxfifo.rxbuf[rxfifo.in % 128=128 % 128
>> = 0] and rxfifo.in will be 129. Again, the next byte will be saved to
>> rxbuf[rxfifo.in % 128=129 % 128=1] and rxfifo.in will be 130.
>>
>> When the mainloop tries to pop data from fifo, it tests
>>
>>  &#4294967295;&#4294967295; rxfifo.in(130) !=rxfifo.out(128)
>>
>> The test is true, so the code extracts chars from rxbuf[out % 128] that
>> is rxbuf[0]... and so on.
>>
>> I hope that explanation is good.
>>
>>
>>>
>>>>  &#4294967295;&#4294967295; // Reset interrupt flag
>>>> }
>>>>
>>>> /* Called regularly from mainloop code */
>>>> int uart_task(void) {
>>>>  &#4294967295;&#4294967295; int c = -1;
>>>>  &#4294967295;&#4294967295; if (out != in) {
>>>>  &#4294967295;&#4294967295;&#4294967295;&#4294967295; c = rxfifo.buf[out % RXBUF_SIZE];
>>>>  &#4294967295;&#4294967295;&#4294967295;&#4294967295; out++;
>>>>  &#4294967295;&#4294967295; }
>>>>  &#4294967295;&#4294967295; return -1;
>>>> }
>>>
>>> int uart_task(void) {
>>>  &#4294967295;&#4294967295; int c = -1;
>>>  &#4294967295;&#4294967295; uint8_t next = rxfifo.head;
>>>  &#4294967295;&#4294967295; if (next != rxfifo.tail) {
>>>  &#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295; c = rxfifo.buf[next];
>>>  &#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295; next++;
>>>  &#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295; if (next >= RXBUF_SIZE) next -= RXBUF_SIZE;
>>>  &#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295; rxfifo.head = next;
>>>  &#4294967295;&#4294967295; }
>>>  &#4294967295;&#4294967295; return c;
>>> }
>>>
>>> These don't track buffer overflow at all - you need to call uart_task()
>>> often enough to avoid that.
>>
>> Sure, with a good number for RXBUF_SIZE, buffer overflow shouldn't
>> happen ever. Anyway, if it happens, the higher level layers (protocol)
>> should detect a corrupted packet.
>>
> 
> You risk getting seriously out of sync if there is an overflow.
> Normally, on an overflow there will be a dropped character or two (which
> as you say, must be caught at a higher level).  Here you could end up
> going round your buffer an extra time and /gaining/ RXBUF_SIZE extra
> characters.
> 
> Still, if you are sure that your functions are called fast enough so
> that overflow is not a concern, then that's fine.  Extra code to check
> for a situation that can't occur is not helpful.
> 
>>
>>> (I'm skipping volatiles so we don't get ahead of your point.)
>>>
>>>>
>>>>
>>>>  &#4294967295;From a 20-years old article[1] by Nigle Jones, this seems a situation
>>>> where volatile must be used for rxfifo.in, because is modified by an ISR
>>>> and used in the mainloop code.
>>>>
>>>
>>> Certainly whenever data is shared between ISR's and mainloop code, or
>>> different threads, then you need to think about how to make sure data is
>>> synchronised and exchanged.&#4294967295; "volatile" is one method, atomics are
>>> another, and memory barriers can be used.
>>>
>>>> I don't think so, rxfifo.in is read from memory only one time in
>>>> uart_task(), so there isn't the risk that compiler can optimize badly.
>>>
>>> That is incorrect in two ways.&#4294967295; One - baring compiler bugs (which do
>>> occur, but they are very rare compared to user bugs), there is no such
>>> thing as "optimising badly".&#4294967295; If optimising changes the behaviour of the
>>> code, other than its size and speed, the code is wrong.
>>
>> Yes of course, but I don't think the absence of volatile for rxfifo.in,
>> even if it can change in ISR, could be a *real* problem with *modern and
>> current* compilers.
>>
> 
> Personally, I am not satisfied with "it's unlikely to be a problem in
> practice" - I prefer "The language guarantees it is not a problem".
> Remember, when you know the data needs to be read at this point, then
> using a volatile read is free.  Volatile does not make code less
> efficient unless you use it incorrectly and force more accesses than are
> necessary.  So using volatile accesses for "rxfifo.in" here turns
> "probably safe" into "certainly safe" without cost.  What's not to like?
> 
>> voltile attribute needs to avoid compiler optimization (that would be a
>> bad thing, because of volatile nature of the variabile), but on that
>> code it's difficult to think of an optimization, caused by the absence
>> of volatile, that changes the behaviour erroneously... except reorering.
>>
>>
>>> Two - it is a
>>> very bad idea to imagine that having code inside a function somehow
>>> "protects" it from re-ordering or other optimisation.
>>
>> I didn't say this, at the contrary I was thinking exactly to reordering
>> issues.
>>
>>
>>> Functions can be inlined, outlined, cloned, and shuffled about.
>>> Link-time optimisation, code in headers, C++ modules, and other
>>> cross-unit optimisations are becoming more and more common.&#4294967295; So while it
>>> might be true /today/ that the compiler has no alternative but to read
>>> rxfifo.in once per call to uart_task(), you cannot assume that will be
>>> the case with later compilers or with more advanced optimisation
>>> techniques enabled.&#4294967295; It is safer, more portable, and more future-proof
>>> to avoid such assumptions.
>>
>> Ok, you are talking of future scenarios. I don't think actually this
>> could be a real problem. Anyway your observation makes sense.
>>
>>
>>>
>>>> Even if ISR is fired immediately after the if statement, this doesn't
>>>> bring to a dangerous state: the just received data will be processed at
>>>> the next call to uart_task().
>>>>
>>>> So IMHO volatile isn't necessary here. And critical sections (i.e.
>>>> disabling interrupts) aren't useful too.
>>>>
>>>> However I'm thinking about memory barrier. Suppose the compiler reorder
>>>> the instructions in uart_task() as follows:
>>>>
>>>>
>>>>  &#4294967295;&#4294967295; c = rxfifo.buf[out % RXBUF_SIZE]
>>>>  &#4294967295;&#4294967295; if (out != in) {
>>>>  &#4294967295;&#4294967295;&#4294967295;&#4294967295; out++;
>>>>  &#4294967295;&#4294967295;&#4294967295;&#4294967295; return c;
>>>>  &#4294967295;&#4294967295; } else {
>>>>  &#4294967295;&#4294967295;&#4294967295;&#4294967295; return -1;
>>>>  &#4294967295;&#4294967295; }
>>>>
>>>>
>>>> Here there's a big problem, because compiler decided to firstly read
>>>> rxfifo.buf[] and then test in and out equality. If the ISR is fired
>>>> immediately after moving data to c (most probably an internal register),
>>>> the condition in the if statement will be true and the register value is
>>>> returned. However the register value isn't correct.
>>>
>>> You are absolutely correct.
>>>
>>>>
>>>> I don't think any modern C compiler reorder uart_task() in this way, but
>>>> we can't be sure. The result shouldn't change for the compiler, so it
>>>> can do this kind of things.
>>>
>>> It is not an unreasonable re-arrangement.&#4294967295; On processors with
>>> out-of-order execution (which does not apply to the AVR or Cortex-M),
>>> compilers will often push loads as early as they can in the instruction
>>> stream so that they start the cache loading process as quickly as
>>> possible.&#4294967295; (But note that on such "big" processors, much of this
>>> discussion on volatile and memory barriers is not sufficient, especially
>>> if there is more than one core.&#4294967295; You need atomics and fences, but that's
>>> a story for another day.)
>>>
>>>>
>>>> How to fix this issue if I want to be extremely sure the compiler will
>>>> not reorder this way? Applying volatile to rxfifo.in shouldn't help for
>>>> this, because compiler is allowed to reorder access of non volatile
>>>> variables yet[2].
>>>>
>>>
>>> The important thing about "volatile" is that it is /accesses/ that are
>>> volatile, not objects.&#4294967295; A volatile object is nothing more than an object
>>> for which all accesses are volatile by default.&#4294967295; But you can use
>>> volatile accesses on non-volatile objects.&#4294967295; This macro is your friend:
>>>
>>> #define volatileAccess(v) *((volatile typeof((v)) *) &(v))
>>>
>>> (Linux has the same macro, called ACCESS_ONCE.&#4294967295; It uses a gcc extension
>>> - if you are using other compilers then you can make an uglier
>>> equivalent using _Generic.&#4294967295; However, if you are using a C compiler that
>>> supports C11, it is probably gcc or clang, and you can use the "typeof"
>>> extension.)
>>>
>>> That macro will let you make a volatile read or write to an object
>>> without requiring that /all/ accesses to it are volatile.
>>
>> This is a good point. The code in ISR can't be interrupted, so there's
>> no need to have volatile access in ISR.
>>
> 
> Correct.  (Well, /almost/ correct - bigger microcontrollers have
> multiple interrupt priorities.  But it should be correct in this case,
> as no other interrupt would be messing with the same variables anyway.)

Yes, static variables defined in a uart driver aren't accessed from 
other interrupts.


>>>> One solution is adding a memory barrier in this way:
>>>>
>>>>
>>>> int uart_task(void) {
>>>>  &#4294967295;&#4294967295; int c = -1;
>>>>  &#4294967295;&#4294967295; if (out != in) {
>>>>  &#4294967295;&#4294967295;&#4294967295;&#4294967295; memory_barrier();
>>>>  &#4294967295;&#4294967295;&#4294967295;&#4294967295; c = rxfifo.buf[out % RXBUF_SIZE];
>>>>  &#4294967295;&#4294967295;&#4294967295;&#4294967295; out++;
>>>>  &#4294967295;&#4294967295; }
>>>>  &#4294967295;&#4294967295; return -1;
>>>> }
>>>>
>>>
>>> Note that you are forcing the compiler to read "out" twice here, as it
>>> can't keep the value of "out" in a register across the memory barrier.
>>
>> Yes, you're right. A small penalty to avoid the problem of reordering.
>>
> 
> But an unnecessary penalty.
> 
>>
>>> (And as I mentioned before, the compiler might be able to do larger
>>> scale optimisation across compilation units or functions, and in that
>>> way keep values across multiple calls to uart_task.)
>>>
>>>>
>>>> However this approach appears to me dangerous. You have to check and
>>>> double check if, when and where memory barriers are necessary and it's
>>>> simple to skip a barrier where it's nedded and add a barrier where it
>>>> isn't needed.
>>>
>>> Memory barriers are certainly useful, but they are a shotgun approach -
>>> they affect /everything/ involving reads and writes to memory.&#4294967295; (But
>>> remember they don't affect ordering of calculations.)
>>>
>>>>
>>>> So I'm thinking that a sub-optimal (regarding efficiency) but reliable
>>>> (regarding the risk to skip a barrier where it is needed) could be to
>>>> enter a critical section (disabling interrupts) anyway, if it isn't
>>>> strictly needed.
>>>>
>>>>
>>>> int uart_task(void) {
>>>>  &#4294967295;&#4294967295; ENTER_CRITICAL_SECTION();
>>>>  &#4294967295;&#4294967295; int c = -1;
>>>>  &#4294967295;&#4294967295; if (out != in) {
>>>>  &#4294967295;&#4294967295;&#4294967295;&#4294967295; c = rxfifo.buf[out % RXBUF_SIZE];
>>>>  &#4294967295;&#4294967295;&#4294967295;&#4294967295; out++;
>>>>  &#4294967295;&#4294967295; }
>>>>  &#4294967295;&#4294967295; EXIT_CRITICAL_SECTION();
>>>>  &#4294967295;&#4294967295; return -1;
>>>> }
>>>
>>> Critical sections for something like this are /way/ overkill.&#4294967295; And a
>>> critical section with a division in the middle?&#4294967295; Not a good idea.
>>>
>>>>
>>>>
>>>> Another solution could be to apply volatile keyword to rxfifo.in *AND*
>>>> rxfifo.buf too, so compiler can't change the order of accesses them.
>>>>
>>>> Do you have other suggestions?
>>>>
>>>
>>> Marking "in" and "buf" as volatile is /far/ better than using a critical
>>> section, and likely to be more efficient than a memory barrier.&#4294967295; You can
>>> also use volatileAccess rather than making buf volatile, and it is often
>>> slightly more efficient to cache volatile variables in a local variable
>>> while working with them.
>>
>> Yes, I think so too. Lastly I read many experts say volatile is often a
>> bad thing, so I'm re-thinking about its use compared with other approaches.
>>
> 
> People who say "volatile is a bad thing" are often wrong.  Remember, all
> generalisations are false :-)

Ok, I wrote "volatile is **often** a bad thing".


> "volatile" is a tool.  It doesn't do everything that some people think
> it does, but it is a very useful tool nonetheless.  It has little place
> in big systems - Linus Torvalds wrote a rant against it as being both
> too much and too little, and in the context of writing Linux code, he
> was correct.  For Linux programming, you should be using OS-specific
> features (which rely on "volatile" for their implementation) or atomics,
> rather than using "volatile" directly.
> 
> But for small-systems embedded programming, it is very handy.  Used
> well, it is free - used excessively it has a cost, but an extra volatile
> will not make an otherwise correct program fail.
> 
> Memory barriers are great for utility functions such as interrupt
> enable/disable inline functions, but are usually sub-optimal compared to
> specific and targeted volatile accesses.

Just to say what I read:

https://blog.regehr.org/archives/28

https://mcuoneclipse.com/2021/10/12/spilling-the-beans-volatile-qualifier/


>>>> [1] https://barrgroup.com/embedded-systems/how-to/c-volatile-keyword
>>>> [2] https://blog.regehr.org/archives/28
>>>
>>
>

Reply by David Brown ●October 24, 20212021-10-24

On 24/10/2021 17:39, pozz wrote:
> Il 24/10/2021 13:02, David Brown ha scritto:

<snipping to save on electrons>

>> People who say "volatile is a bad thing" are often wrong.&#4294967295; Remember, all
>> generalisations are false :-)
> 
> Ok, I wrote "volatile is **often** a bad thing".
> 

:-)

> 
>> "volatile" is a tool.&#4294967295; It doesn't do everything that some people think
>> it does, but it is a very useful tool nonetheless.&#4294967295; It has little place
>> in big systems - Linus Torvalds wrote a rant against it as being both
>> too much and too little, and in the context of writing Linux code, he
>> was correct.&#4294967295; For Linux programming, you should be using OS-specific
>> features (which rely on "volatile" for their implementation) or atomics,
>> rather than using "volatile" directly.
>>
>> But for small-systems embedded programming, it is very handy.&#4294967295; Used
>> well, it is free - used excessively it has a cost, but an extra volatile
>> will not make an otherwise correct program fail.
>>
>> Memory barriers are great for utility functions such as interrupt
>> enable/disable inline functions, but are usually sub-optimal compared to
>> specific and targeted volatile accesses.
> 
> Just to say what I read:
> 
> https://blog.regehr.org/archives/28

Yes, you had that in your footnotes - and that is not a bad article.
(It's better than a lot of that guy's stuff - he also writes a lot of
crap about undefined behaviour and the evils of optimisation.)

> 
> https://mcuoneclipse.com/2021/10/12/spilling-the-beans-volatile-qualifier/
> 

That article is mostly wrong - or at best, inappropriate for what you
are doing.  Critical sections are a massive over-kill for a ring buffer
in most cases.  If you intend to call your uart_task() from multiple
places in a re-entrant manner (e.g., multiple RTOS threads), then you
have a lot more challenges to deal with than just the ordering of the
accesses to "in", "out" and "buf" - you can't just throw in a critical
section and hope it's all okay.  And if you are /not/ doing such a daft
thing, then critical sections are completely unnecessary.

> 
>>>>> [1] https://barrgroup.com/embedded-systems/how-to/c-volatile-keyword
>>>>> [2] https://blog.regehr.org/archives/28
>>>>
>>>
>>
>

Previous12 3 4 5 6 Next

How to write a simple driver in bare metal systems: volatile, memory barrier, critical sections and so on

Sign in

You might also like...

Search forums

Free PDF Downloads

Blogs - Hall of Fame

Discussion Groups

Quick Links

About EmbeddedRelated.com

Social Networks

The Related Media Group