On 19/05/2020 07:34, Kent Dickey wrote:
> In article <IdweG.1169345$Gh7.768707@fx45.iad>,
> Clifford Heath <no.spam@please.net> wrote:
>> On 24/3/20 12:21 am, David Brown wrote:
>>> Instructions like "dmb" force an order on the cpu operations, not the
>>> compiler - while "volatile" enforces a partial order on the compiler,
>>> but not the hardware.
>>>
>>> The standard solution would be asm("dmb" ::: "memory"), where the memory
>>> clobber forces an ordering on the compiler - and thus you can (usually)
>>> omit the "volatile".
>> David, if you have time, many of us would appreciate a quick summary of
>> the situations where the different barriers should be used.
>>
>> "dmb" can be a full read/write barrier, or just a write barrier. When to
>> use each?
>>
>> "dsb" is different from "dmb" because it limits instruction ordering,
>> but when is that useful? When to use it as a write-only barrier?
>>
>> "isb" is intended to precede a context switch, i.e. task switching in an
>> RTOS. Is it sufficient for that, and is that the only time to use it?
>>
>> If you know a good article that gives practical guidelines, post a link,
>> otherwise I'd really like to hear your thoughts.
>>
>> Clifford Heath.
>
> Barriers are very difficult to get right for a programmer. In my opinion,
> it is also an CPU architectural mistake to need barriers for user-level code.
> (It's OK for system software to need some barriers like ISB). Many people
> disagree with me on this point, and we don't need to argue it here.
>
I agree on the principle. And usually it can be done in practice too,
but it can come at a cost. For most embedded systems, the way to avoid
needing barrier instructions is to set up memory areas with different
characteristics such as cacheable, bufferable, etc. Typically memory
mapped peripherals will be in an area where all accesses are strictly
ordered and uncacheable, and then no barrier instructions are needed.
For small microcontroller cores, this has no cost since you don't have
caches or write buffers anyway, but on bigger processors it can be
significant when you have larger blocks of data to transfer. This can
be a measurable hit on things like Ethernet performance or data in DMA
buffers.
The most important thing is always that the code should be correct. It
is better to be slower and correct than faster and incorrect!
Thus usually you have the such memory setups to cover the normal cases,
and put any required cache or barrier instructions in system code. If
you are going to need some cache flush and data ordering instruction
before starting a DMA transfer, then those should be in the
"start_dma_transfer" function - written by a programmer who /does/ know
how these things work.
Another kind of barrier is the compiler memory barrier. Again, it can
be hard for users to get these right - and they should be put in system
code for things like interrupt disable functions so that users don't
have to worry about them.
> The basic problem is there's a mismatch between what the programmer is
> thinking about, and what the compiler/CPU are requiring. This is one reason
> why multithreading is more difficult than it should be.
>
Agreed. C11 and C++11 can help a bit with atomics and fences, but
relatively few people understand these well. I am a fan of message
passing and queues as a way of inter-thread communication, as it is a
lot easier to understand and get right than using locks or critical
sections. It is also much easier to scale with SMP or AMP. You don't
need to worry about whether data is written to memory before the lock is
taken, or whether you want a compiler memory barrier, a processor
barrier instruction, volatile accesses - just put the message you want
on the queue and off it goes. (Just don't pass pointers to data on the
local stack...)
> ARM let's you choose whether the barrier is a read or write or both barrier.
(Write or read/write barrier - there is AFAIK no read barrier.)
> I suggest you always do "both", which ARM calls SYSTEM, and is the
> default if you just say "DMB" or "DSB". I suspect there's almost no
> performance difference, and it's one less thing you have to worry about.
For smaller microcontrollers, there will be no noticeable difference.
By the time you have external dynamic memory connected via a quad SPI
bus, the latency on reads can be much more dramatic. Writes can be
buffered further down the chain (such as in the QSPI or SDRAM
controller), but you don't want to wait for reads if you don't have to.
Still, it is always better to be safe than fast, and use "both" if you
are not sure.
> It makes sense for macros for Linux to try to be more aggressive, since they
> need to show off, and they care more about performance and have the time to
> test and debug their logic across a variety of systems. This stuff is easy to
> make a mistake on, and very hard to debug.
>
Agreed.
> For application programming, you should generally only need DMB for
> ordering of volatile accesses.
Generally you don't need that either. The volatile accesses will be
ordered by the compiler (as long as the programmer doesn't make the
mistake of thinking that volatile accesses also order with non-volatile
accesses). If the memory setup is done right, then when writing to
peripherals the cpu will enforce the order without the need of DMB. And
you don't need DMB for purely cpu-related actions, such as interaction
between interrupt routines or threads on the same processor (volatile
and compiler barriers are sufficient).
The point you typically need DMB is for data that is in main memory and
shared between bus masters, like other processors, DMA, or Ethernet
controllers. Then you might need a DMB before informing the other
masters that data is ready. You may also need cache control
instructions too. (You need this sort of thing for reads as well as
writes.)
>
> DSB is for ordering other traffic with data accesses--things like cache
> invalidates, icache fetches, or TLB shoot downs, etc.
Yes, and also changes to the MPU mappings are a common case.
> If you have anything
> like that which needs to be ordered, you want DSB. Again, ordering accesses
> to variables in normal memory don't need DSB, but you can use it if you want
> much lower performance (DSB is a super-set of DMB, so you can use DSB anyplace
> you would use DMB). Generally, user code doesn't need DSB unless you're
> doing self-modifying code.
Just say "no" to self-modifying code! Firmware updates are an
exception, of course. And your DSB is likely to be combined with data
cache flushes (to make sure the changes are written to memory),
instruction cache flushes (to make sure you don't have stale data there)
and ISB.
>
> ISB is for ordering special system register accesses, or odering data accesses
> with other CPU actions. If you want to read the TLB using an AT instruction,
> you must do an ISB before reading the PAR register. I cannot think of a
> user-level code sequence case that needs ISB off the top of my head now.
> ISB does nothing to order data operations, and it's not what you want.
> There's hidden magic in the combination: "DSB; ISB".
> This waits for most (but not quite all) previous bus traffic to complete
> before executing any new instructions, including instruction fetches, data
> fetches, etc. User code generally never needs to do this, but OS code
> sometimes does.
Sometimes this sort of thing can be recommended for entering "sleep"
modes - often in combination with chip errata on early versions of devices.
>
> In terms of "heaviness", DMB is the lightest--it will slow the pipeline
> a few clocks to get the data accesses right. DSB is actually the slowest--it
> generally causes a bus transaction, which is sent to other agents, and
> a response is sent back (CPU optimizations can avoid this traffic sometimes,
> especially if another DSB was recently done). You don't want to use DSB
> unless you really need it. And ISB is in the middle--also a few cycles, but
> likely a little more than DMB. And "DSB; ISB" basically brings the CPU
> to temporary halt--waits for (almost) every current fetch to finish, then
> restarts.
>
Note that the cost of these instructions varies significantly from
system to system. On an M0, all three barrier instructions will likely
be no more expensive than a NOP. On an M7 with cache and outstanding
transactions to external memory, they can cost a lot.
> To make things worse, the way to actually insert the barrier is another
> level of complexity, and which sadly seems to be compiler dependent.
>
That is partly true - ARM has made a reasonable attempt at headers that
can be used with a variety of compilers for at least some of this stuff.
But there are always complications when you are dealing with features
that simply cannot be described in languages like C.
> It's almost as if this whole area is a big giant mess.
>
Well, it's all a big compromise. You can design a processor system that
doesn't need barriers of any kind, but it won't scale for higher speeds
and certainly won't work with multiple processors. (And once you get to
multiple processors, you have another layer with the memory models - you
can have programmer-friendly "strong" models like the x86, or far
simpler and more efficient "weak" models like most RISC processors,
requiring more effort from the programmer.)