atomic test_and_set on ADSP21020| page 3

Reply by alb ●July 16, 20122012-07-16

On 7/14/2012 12:07 AM, Niklas Holsti wrote:
>> In order to 'serve' multiple processes with the need to transmit data we
>> thought about implementing a 'request queue' (which can eventually be
>> even prioritized maybe according to the link budget you suggested in an
>> earlier thread).
>>
>> When the process put the request in the queue it also 'attempts' to
>> start the transmission but if the resource is locked it will not succeed.
>> At the beginning the lock is free and the first process which attempts
>> to start the transmission will lock it, while all the others which
>> follow will only manage to set their request in the queue.
>>
>> Given this logic, the interrupt service routine, once it completed
>> transmitting the first packet, should also check the queue and lock the
>> resource in case there is a pending packet.
> 
> Aha, so the "lock" is not actually protecting access to some independent
> data structure, and making processes wait for access, but is more of a
> flag that shows whether or not a buffer is being transmitted. This
> should be an easier problem to solve.

Indeed, the 'lock' is to prevent writing to the serial port while
another transmission is going on, since this will result in lost of
character.

> typedef struct buffer_t {
>    data_t data[MAX_BUFFER_SIZE];
>    int    length;
>    int    data_index;
>    bool   transmitted;
>    struct buffer_t *next;
> } buffer_t;
> // The data in the buffer are data[0 .. length - 1].
> // The "data_index" is the index of the next data[]
> // to transmit.
> // The "transmitted" flag is just what its name implies.
> // The "next" field points to the next buffer in the
> // request queue, or is NULL if this buffer is the
> // last one.
> 
> volatile buffer_t * q_head = NULL;
> // Not NULL when and only when a buffer is being
> // transmitted, and then points to the first (i.e.
> // currently transmitting) buffer in the request queue.
> //
> // If a non-interrupt process reads q_head as NULL,
> // an output interrupt will NOT happen again, until
> // transmission is again started by some process.
> //
> // If a non-interrupt process reads q_head as not NULL,
> // an output interrupt WILL happen after that read of
> // q_head (and perhaps immediately after the read).
> 
> volatile buffer_t * volatile q_tail = NULL;

why do you need the pointer to be volatile?
Isn't sufficient to have the pointed structure to be volatile?

AFAIK, volatile qualifier for a structure makes volatiles for each
member of the structure, but in this case you have a recursive structure
and I'm not sure I understand fully how the volatile qualifier is
interpreted here.


> // NULL when and only when no buffer has ever been
> // transmitted or put in the queue, and otherwise
> // points to the last buffer in the queue (when
> // transmission is going on) or to the last
> // buffer transmitted (otherwise).
[...]
> 
> void transmit (buffer_t * volatile buf)
> // Starts transmitting the buf, or enqueues it for
> // later tranmission. Assume that this can be
> // interrupted at any point by an interrupt from an
> // ongoing transmission, but cannot be preempted by
> // another non-interrupt process.
> {
>    // Assume that buf->data and buf->length are set
>    // by caller and that buf->length > 0.
> 
>    // Prepare the buffer for enqueing or transmitting:
>    buf->transmitted = FALSE;
>    buf->next        = NULL;
>    buf->data_index  = 0;
> 
>    // Enqueue the buffer, if there is something
>    // ahead of it in the queue:
>    if (q_head != NULL) {
>       // Transmission is going on (or was going on
>       // when we read q_head).
>       q_tail->next = buf;
>       // If transmission was still going on when
>       // the above assignment to q_tail was executed,
>       // the interrupt handler will eventually transmit
>       // this buf. Otherwise transmission stopped
>       // before reaching this buf.

This is a good hint that I missed in my implementation. Assigning
q_tail->next will allow the isr to send this one as soon as the previous
is complete.

Since I did not have a 'next' pointer in my structure I had to call the
transmit() function from the isr in order to pull data from the request
queue. It seems to me this approach is more elegant than mine.

>       q_tail = buf;
>       // This update of q_tail is for the benefit of
>       // the next caller of transmit().

In a two element request queue it seems to me that q_head->next should
always be equal to q_tail. Correct?
>    }
> 
>    // If transmission is now going on, either it is
>    // still transmitting some earlier buffer in the
>    // queue, or has already started on transmitting
>    // this buf. In either case, we have nothing more
>    // to do.
> 
>    // If transmission has now stopped, either it had
>    // already stopped before we enqueued (or did not
>    // enqueue) this buf, or it has had time to transmit
>    // this buf, and then stop.
> 
>    // Start transmission, if it stopped before
>    // transmitting this buf:
>    if (q_head == NULL) {

it seems to me this is your 'locking' variable. You do not allow any
other transmission if the head of the queue is != NULL.
>       // No transmission going on now.
>       if (!(buf->transmitted)) {
>          // All of the earlier buffers in the queue
>          // have been transmitted, but not this buf.
>          // This buf is now the first in the queue.
>          q_head = buf;
>          q_tail = buf;
>          buf->data_index = 1;
>          // Start transmission:
>          write_to_output_port (buf->data[0]);
>       }
>    }
> }
> 
> void output_isr (void)
> // Interrupt handler for the interrupt that reports
> // "transmission complete" of the last data_t written
> // to the output port.
> // That data_t was q_head->data[q_head->data_index - 1],
> // so the data_t at q_head->data_index is next to be
> // transmitted, unless the whole q_head has been
> // transmitted already.
> // Assume that this can interrupt any non-interrupt
> // process at any point, but then itself runs atomically
> // (i.e. cannot be interrupted by anything else that
> // uses the same variables).
> {
>    buffer_t * buf = q_head;
>    // This initialization may trigger a warning about
>    // qualifiers being discarded (because "q_head" is
>    // qualified as volatile, but "buf" is not), but
>    // this warning can be ignored. Using "buf" instead of
>    // q_head is an optimization (perhaps premature)
>    // to avoid unnecessary repeated reads of q_head
>    // and its fields. In principle, buf should be
>    // declared as "buffer_t * volatile buf", but this
>    // is unnecessary since this function will not be
>    // in-lined, and instruction reordering within this
>    // function does not matter, because the function is
>    // executed atomically with respect to the non-
>    // interrupt processes.
> 
>    if (buf->data_index >= buf->length) {
>       // This buf was completely transmitted.

[pasted from your next post] buf->transmitted = TRUE;

>       // Go on to the next buffer in the queue,
>       // if there is one:
>       buf = buf->next;
>       q_head = buf;
>    }
> 
>    if (buf != NULL) {
>       // Some data are left to transmit from this
>       // buf (= q_head).
>       write_to_output_port (buf->data[buf->data_index]);
>       buf->data_index ++;
>    }
> }
> 
> 
> I have not tested the above code, only thought about it.
> 

I appreciate your effort to sketch down the implementation. It indeed
gave me a valuable hint on how to write the transmit() function.

I think I still have to pin down a nasty bug in moving pointers around
since the program seems to crash always at the same point.

Reply by Niklas Holsti ●July 16, 20122012-07-16

On 12-07-16 11:19 , alb wrote:
> On 7/14/2012 12:07 AM, Niklas Holsti wrote:
>>> In order to 'serve' multiple processes with the need to transmit data we
>>> thought about implementing a 'request queue' (which can eventually be
>>> even prioritized maybe according to the link budget you suggested in an
>>> earlier thread).
>>>
>>> When the process put the request in the queue it also 'attempts' to
>>> start the transmission but if the resource is locked it will not succeed.
>>> At the beginning the lock is free and the first process which attempts
>>> to start the transmission will lock it, while all the others which
>>> follow will only manage to set their request in the queue.
>>>
>>> Given this logic, the interrupt service routine, once it completed
>>> transmitting the first packet, should also check the queue and lock the
>>> resource in case there is a pending packet.
>>
>> Aha, so the "lock" is not actually protecting access to some independent
>> data structure, and making processes wait for access, but is more of a
>> flag that shows whether or not a buffer is being transmitted. This
>> should be an easier problem to solve.
>
> Indeed, the 'lock' is to prevent writing to the serial port while
> another transmission is going on, since this will result in lost of
> character.
>
>> typedef struct buffer_t {
>>     data_t data[MAX_BUFFER_SIZE];
>>     int    length;
>>     int    data_index;
>>     bool   transmitted;
>>     struct buffer_t *next;
>> } buffer_t;
>> // The data in the buffer are data[0 .. length - 1].
>> // The "data_index" is the index of the next data[]
>> // to transmit.
>> // The "transmitted" flag is just what its name implies.
>> // The "next" field points to the next buffer in the
>> // request queue, or is NULL if this buffer is the
>> // last one.
>>
>> volatile buffer_t * q_head = NULL;
>> // Not NULL when and only when a buffer is being
>> // transmitted, and then points to the first (i.e.
>> // currently transmitting) buffer in the request queue.
>> //
>> // If a non-interrupt process reads q_head as NULL,
>> // an output interrupt will NOT happen again, until
>> // transmission is again started by some process.
>> //
>> // If a non-interrupt process reads q_head as not NULL,
>> // an output interrupt WILL happen after that read of
>> // q_head (and perhaps immediately after the read).
>>
>> volatile buffer_t * volatile q_tail = NULL;
>
> why do you need the pointer to be volatile?

My mistake (a remmant from an early version of the code). Well spotted. 
Since q_tail is accessed only in the non-interrupt processes, which do 
not preempt each other, it does not have to be volatile.

> Isn't sufficient to have the pointed structure to be volatile?

Yes, I agree.

> AFAIK, volatile qualifier for a structure makes volatiles for each
> member of the structure,

That is my understanding too. But note that there is a paper by Eric 
Eide and John Regehr saying that compilers often (well, relatively 
often) have bugs in their treatment of volatiles (google for "volatiles 
are miscompiled"). I would definitely check the compiler-generated 
assembly code, at least once, to verify that my compiler is doing the 
right thing.

> but in this case you have a recursive structure
> and I'm not sure I understand fully how the volatile qualifier is
> interpreted here.

I assume that by recursion you mean that the buffer_t field "next" 
points to another buffer_t. Since q_head is defined as a pointer to a 
volatile object, accesses to the "next" field of the buffer at q_head 
are treated as access to a volatile. Since the "next" field is not 
defined as a pointer *to* a volatile object, accesses to fields of the 
buffer that q_head->next points to are not treated as volatile accesses, 
when the code uses the form q_head->next->Xxx. As I understand it.

> In a two element request queue it seems to me that q_head->next should
> always be equal to q_tail. Correct?

Yes, if you observe the state when neither output_isr() nor transmit() 
is running.

>>     // Start transmission, if it stopped before
>>     // transmitting this buf:
>>     if (q_head == NULL) {
>
> it seems to me this is your 'locking' variable. You do not allow any
> other transmission if the head of the queue is != NULL.

Yes, q_head is certainly the main controlling variable. q_head and the 
field q_tail->next are the only variables that are accessed concurrently 
by the interrupt handler and the non-interrupt processes. The code 
assumes (and I should have remarked on it) that reads and writes (loads 
and stores) of these variables are atomic in the sense that when a read 
and a write are concurrent, the read returns either the old or the new 
value of the variable, and not some half-updated value. This assumption 
holds for the ADSP21020, where memory read and write are wide enough for 
a whole pointer, but it would not hold for example on a processor with 
8-bit read/write but 16-bit or wider pointers.

(In C, unfortunately, there is no "atomic" qualifier similar to the 
"volatile" qualifier, that could ensure at compile-time that a given 
variable is atomically read and written. There is only the standard type 
sig_atomic_t that is ensured to have atomic read/write, but this type 
may not be large enough to hold a pointer; it is only required to be 8 
bits or larger.)

Note that while my code has concurrent (racing) read and write of the 
some variables (q_head and q_tail->next), it avoids concurrent writes, 
which could lose (overwrite) data. The transmit() function only writes 
to q_head when it knows that transmission is not going on, and it only 
writes to the "transmitted" field of a buffer before the buffer is 
enqueued. Therefore, output_isr() cannot try to write to those variables 
concurrently.

>> void output_isr (void)
>> // Interrupt handler for the interrupt that reports
>> // "transmission complete" of the last data_t written
>> // to the output port. ...
>> {
>>     buffer_t * buf = q_head;
>>     // This initialization may trigger a warning about
>>     // qualifiers being discarded (because "q_head" is
>>     // qualified as volatile, but "buf" is not), but
>>     // this warning can be ignored.

A warning about ignoring this warning :-)  : The C99 standard says, in 
paragraph 5 of the section "6.7.3 Type qualifiers", that "If an attempt 
is made to refer to an object defined with a volatile-qualified type 
through use of an lvalue with non-volatile-qualified type, the behavior 
is undefined", which might apply to this initialization of the "buf" 
variable. However, the same paragraph has a foot-note saying that this 
applies to objects that were "never actually defined as objects in the 
program (such as an object at a memory-mapped input/output address)", 
which is not the case here.

> I appreciate your effort to sketch down the implementation. It indeed
> gave me a valuable hint on how to write the transmit() function.

You are very welcome. But note how simple the transmit() code becomes if 
interrupts can be disabled or masked:

void transmit (buffer_t * volatile buf)
// Starts transmitting the buf, or enqueues it for
// later tranmission.
{
    // Assume that buf->data and buf->length are set
    // by caller and that buf->length > 0.

    // Prepare the buffer for enqueing or transmitting:
    buf->transmitted = FALSE;
    buf->next        = NULL;

    prevent_output_interrupts();

    if (q_head != NULL) {
       // Transmission is going on. We can just enqueue
       // the buffer, and it will be transmitted.
       buf->data_index = 0;
       q_tail->next = buf;
    } else {
       // No transmission going on now.
       // This buf becomes the first in the queue, and
       // we start its transmission.
       q_head = buf;
       buf->data_index = 1;
       // Start transmission:
       write_to_output_port (buf->data[0]);
    }
    q_tail = buf;

    allow_output_interrupts();
}

The interrupt handler code is the same in this case.

-- 
Niklas Holsti
Tidorum Ltd
niklas holsti tidorum fi
       .      @       .

Reply by George Neuner ●July 16, 20122012-07-16

On Fri, 13 Jul 2012 22:00:38 +0300, upsidedown@downunder.com wrote:

>On Fri, 13 Jul 2012 12:21:27 -0400, George Neuner
><gneuner2@comcast.net> wrote:
>
>>>I tend to believe now that I only need mutual exclusion and considering
>>>that my processes are not running in a multithreading environment I
>>>should be safe without an atomic instruction.
>>
>>The software algorithms also will work under pre-emption as long as a
>>single memory access is not interruptible.  That sounds strange
>>because it's true for all modern chips ... even with VMM an access
>>fault cancels and reschedules the instruction, the resulting interrupt
>>then is taken between instructions.  However, this wasn't necessarily
>>true for early processors, some of which had interruptible load and
>>store operations.  The early software exclusion algorithms were
>>designed with this fact in mind.
>
>Just out of curiosity, what was the dividing line between early and
>modern architectures ? Of the early architectures, VAX/VMS could in
>the worst case generate about a dozen page fault interrupts for a
>single instruction (instruction crossing page boundary, with 1-6
>operands and each causing a page fault) and of these any indirection
>causing one or two (unaligned across page boundary) access interrupts.

You can trace the roots of load/store back to the CDC 6600 (1964), but
IMO "modern" really began in the 80's with the interlocked load/store
RISC designs: MIPS2, PA-RISC, etc.  All modern chips effectively are a
load/store architecture internally even if externally their programmer
visible ISA presents complex addressing modes.  But pinning down when
the mass market adopted load/store is difficult because it happened in
stages.

The i486 and the 68040 could be called "mostly load/store" as each had
only a couple of (internally) complex modes.  The Pentium Pro was a
true load/store design, but probably wasn't popular enough to count
... so call the Pentium II the first one for Intel.  The 88K was a
market failure, so call POWER the first for Motorola.

I think a case could be made that ARM really was the first widely used
load/store design, but until VMM became common in designs, the issue
of interruptible vs restartable instructions didn't much matter.

>The string and decimal instructions could cause multiple additional
>page fault interrupts and were definitely restartable, rather than
>interruptible instructions, but of course irrelevant to the mutual
>exclusion discussion.

String instructions are a special case.  They really are a loop of
microinstructions which can be interrupted between iterations.

[I know not all chips technically execute microcode ... what I'm
referring to as "microinstructions" may simply be states in the
execution sequencer.]

This is where terminology becomes troublesome 8( ... what exactly does
it mean to be "restartable" vs "interruptible"?   Everyone has a
slightly different understanding based on which vendors' documentation
they have been reading.

FWIW: I consider that an instruction is "interruptible" if a
programmer visible context switch can occur during its execution.
Excepting some special cases like string operations, most ISAs specify
that an interrupt logically can occur only between ISA level
instructions

"Restartable" is a different matter.  I consider that an instruction
is restartable if it can be halted and resumed in-progress or can be
aborted and reexecuted without noticeable side effects.  At the ISA
level, instructions may be either resumable or reexecutable.  At the
microinstruction level, instructions almost always are aborted and
reexecuted.

On almost all chips today, all but the very simplest ISA level
instructions decompose into a sequence of microinstructions.  When a
microinstruction sequence representing a memory accessing instruction
takes a fault, the entire sequence is rolled back to the beginning and
the VMM interrupt is generated.  Once the fault is cleared and the
interrupt returns, the microinstruction sequence is reexecuted.
Whether this is a full reexecution from the beginning or a resume from
the previous point of failure is implementation dependent.

WRT software mutual exclusion algorithms, realizing that any memory
access may be arbitrarily delayed is where the difference between
in-order and out-of-order execution becomes important.  These
algorithms all depend on specific ordering of memory accesses to work
properly.  As long as those ordering requirements are respected,
mutual exclusion will be guaranteed even in the presence of arbitrary
thread delays.  Special care has to taken on OoO chips to guarantee
that the memory accesses do occur in the required order.

Lamport's algorithm is unique because it purports to work with
multiple CPUs, however it still requires in-order accesses from all
participating CPUs.

Hope this ... doesn't confuse the issue more 8-)
George

Reply by George Neuner ●July 16, 20122012-07-16

On Fri, 13 Jul 2012 10:24:53 -0700 (PDT), dp <dp@tgi-sci.com> wrote:

>On Jul 13, 7:29&nbsp;pm, Robert Wessel <robertwess...@yahoo.com> wrote:
>> ...
>> Assuming you can guarantee the visible ordering of certain memory
>> accesses, Dekker's Algorithm can be used to do mutual exclusion for a
>> two processor system, without requiring any atomic update instruction.
>> It's very old. &nbsp;There are more complicated algorithms than can handle
>> more than two processors (Eisenberg and McGuire's, for example),
>> again, requiring nothing more than some control over memory ordering.
>
>It's been 13 years since I last considered something of that sort,
>in fact IIRC what I reinvented - and discarded as unusable - was what
>seems to be called "Dekker's algorithm".
>It only lowers the probability of a conflict.
>The thing is, to read and update a flag takes two separate accesses;
>if one of the requestors interrupts the other between these two
>accesses, completes the transaction, then it still tests the
>address and succeeds for a second time - after which gets interrupted
>by the first one which writes its ID (not knowing an entire cycle
>has been up etc.) we have a conflict.

No.  If you examine them carefully, each of the algorithms relies on 2
separate flags being set and tested in a particular order.  The use of
2 flags guarantees that the interruption of any operation does not
compromise the integrity of the algorithm.

>It can be practical in some cases, perhaps the timing of some systems
>might guarantee success, but this is by no means obvious.

I agree that their operation is not particularly obvious, but these 
algorithms have been very well proven.  They are basic knowledge for
introductory CS operating system courses.  They should be taught with
basic algorithms to every programmer.  Now that many (most?)
programmers are writing threaded code, IMO every programmer should be
aware of them and how they work.

>:
>Basically, no indivisible read/write means no means to exclude
>the mess. 

No, it only means the way is more complicated.

>Dimiter
George

Reply by ●July 17, 20122012-07-17

On Mon, 16 Jul 2012 17:07:30 -0400, George Neuner
<gneuner2@comcast.net> wrote:

>On Fri, 13 Jul 2012 22:00:38 +0300, upsidedown@downunder.com wrote:

>>Just out of curiosity, what was the dividing line between early and
>>modern architectures ? Of the early architectures, VAX/VMS could in
>>the worst case generate about a dozen page fault interrupts for a
>>single instruction (instruction crossing page boundary, with 1-6
>>operands and each causing a page fault) and of these any indirection
>>causing one or two (unaligned across page boundary) access interrupts.
>
>You can trace the roots of load/store back to the CDC 6600 (1964), but
>IMO "modern" really began in the 80's with the interlocked load/store
>RISC designs: MIPS2, PA-RISC, etc.  All modern chips effectively are a
>load/store architecture internally even if externally their programmer
>visible ISA presents complex addressing modes.  But pinning down when
>the mass market adopted load/store is difficult because it happened in
>stages.

<clip>

Thank you very much for this description, I haven't followed the
detailed HW development since the 1980's.

>WRT software mutual exclusion algorithms, realizing that any memory
>access may be arbitrarily delayed is where the difference between
>in-order and out-of-order execution becomes important.  These
>algorithms all depend on specific ordering of memory accesses to work
>properly.  As long as those ordering requirements are respected,
>mutual exclusion will be guaranteed even in the presence of arbitrary
>thread delays.  Special care has to taken on OoO chips to guarantee
>that the memory accesses do occur in the required order.

In the x86 family, the CPUID instruction seems to be a quite popular
method to drain the pipeline.

Reply by dp ●July 17, 20122012-07-17

On Jul 17, 1:45=A0am, George Neuner <gneun...@comcast.net> wrote:
> On Fri, 13 Jul 2012 10:24:53 -0700 (PDT), dp <d...@tgi-sci.com> wrote:
> ...
> >It's been 13 years since I last considered something of that sort,
> >in fact IIRC what I reinvented - and discarded as unusable - was what
> >seems to be called "Dekker's algorithm".
> >It only lowers the probability of a conflict.
> >The thing is, to read and update a flag takes two separate accesses;
> >if one of the requestors interrupts the other between these two
> >accesses, completes the transaction, then it still tests the
> >address and succeeds for a second time - after which gets interrupted
> >by the first one which writes its ID (not knowing an entire cycle
> >has been up etc.) we have a conflict.
>
> No. =A0If you examine them carefully, each of the algorithms relies on 2
> separate flags being set and tested in a particular order. =A0The use of
> 2 flags guarantees that the interruption of any operation does not
> compromise the integrity of the algorithm

Indeed, I was wrong. Taking advantage of "in order" memory access
(which can be enforced on the "out of order" machines I know) can
be used.
I did not really remember what exactly I have considered 13 years
ago... :-).
I suppose the fact that expanding this for multiple tasks
(say, 64) would take each task examine all 63 flags etc. has
made it less appealing to me or something. Or the fact that I
just had an easier option.
But it can be made to work indeed - as far as I can see at
a glance, there is no fundamental flaw in it.

Dimiter

Reply by David Brown ●July 18, 20122012-07-18

On 17/07/12 10:39, dp wrote:
> On Jul 17, 1:45 am, George Neuner<gneun...@comcast.net>  wrote:
>> On Fri, 13 Jul 2012 10:24:53 -0700 (PDT), dp<d...@tgi-sci.com>  wrote:
>> ...
>>> It's been 13 years since I last considered something of that sort,
>>> in fact IIRC what I reinvented - and discarded as unusable - was what
>>> seems to be called "Dekker's algorithm".
>>> It only lowers the probability of a conflict.
>>> The thing is, to read and update a flag takes two separate accesses;
>>> if one of the requestors interrupts the other between these two
>>> accesses, completes the transaction, then it still tests the
>>> address and succeeds for a second time - after which gets interrupted
>>> by the first one which writes its ID (not knowing an entire cycle
>>> has been up etc.) we have a conflict.
>>
>> No.  If you examine them carefully, each of the algorithms relies on 2
>> separate flags being set and tested in a particular order.  The use of
>> 2 flags guarantees that the interruption of any operation does not
>> compromise the integrity of the algorithm
>
> Indeed, I was wrong. Taking advantage of "in order" memory access
> (which can be enforced on the "out of order" machines I know) can
> be used.
> I did not really remember what exactly I have considered 13 years
> ago... :-).
> I suppose the fact that expanding this for multiple tasks
> (say, 64) would take each task examine all 63 flags etc. has
> made it less appealing to me or something. Or the fact that I
> just had an easier option.
> But it can be made to work indeed - as far as I can see at
> a glance, there is no fundamental flaw in it.
>
> Dimiter

There are other algorithms that are more efficient for multiple tasks - 
as you note, Dekker's is not very scalable.  But in most cases, there 
should only be a few tasks competing for any given mutex/critical 
section/semaphore - you don't really want 64 different tasks all 
fighting for the same shared resource!

Note also that you don't actually need proper in-order memory access 
here (as enforced by things like the eieio or msync instructions on a 
PPC, or by memory area policies) - you only need the /appearance/ of 
in-order memory access.  So if you are talking about multiple tasks on 
the same processor, you don't need anything beyond "volatile" accesses - 
even if the reads and writes are cached or buffered in some way, the 
processor will still view them in the same order.

Reply by David Brown ●July 18, 20122012-07-18

On 16/07/12 12:06, Niklas Holsti wrote:
> On 12-07-16 11:19 , alb wrote:
>> AFAIK, volatile qualifier for a structure makes volatiles for each
>> member of the structure,
>
> That is my understanding too. But note that there is a paper by Eric
> Eide and John Regehr saying that compilers often (well, relatively
> often) have bugs in their treatment of volatiles (google for "volatiles
> are miscompiled"). I would definitely check the compiler-generated
> assembly code, at least once, to verify that my compiler is doing the
> right thing.
>

(I haven't studied the code here - this is just general comments.)

The paper looks mainly at complex uses of "volatile", especially in 
connection with "sequence points" and with poorly specified areas of C. 
  An example would be the expression "(v + v)", where "v" is declared 
volatile - should the compiler read "v" twice and add the results, can 
it simply read "v" once and double the result, or perhaps read "v" twice 
and double one of the results?  The C specifications are very unclear on 
such matters, and compiler writers differ in their interpretations - as 
do programmers writing the code.  I am not convinced that E&R are 
correct in all /their/ interpretations either.  All we can be sure of is 
that with complex expressions using volatiles, we can't all be right - 
so such code should be avoided.

When using "volatile", the key to correct coding is to realise there is 
no such thing as volatile data, or volatile variables.  There are only 
volatile /accesses/.  A "volatile variable" is nothing more than an 
ordinary variable with a note to the compiler that all normal accesses 
should be "volatile".  Think about when you want your /reads/ and 
/writes/ to be volatile, and you will get it right.  I often don't even 
bother declaring the variables to be volatile - I use macros to enforce 
volatile reads or writes as needed.

And keep all volatile-related code simple - then the compiler will get 
it right too.  Rather than writing "v = (v + v)", write "int v1 = v; int 
v2 = v; v = v1 + v2" to make explicit exactly the functionality you want.

Reply by George Neuner ●July 19, 20122012-07-19

On Wed, 18 Jul 2012 11:05:46 +0200, David Brown
<david.brown@removethis.hesbynett.no> wrote:

>On 17/07/12 10:39, dp wrote:
>
>> I suppose the fact that expanding [software mutex] for multiple
>> tasks (say, 64) would take each task examine all 63 flags etc. has
>> made it less appealing to me or something. Or the fact that I
>> just had an easier option.
>> But it can be made to work indeed - as far as I can see at
>> a glance, there is no fundamental flaw in it.
>>
>
>There are other algorithms that are more efficient for multiple tasks - 
>as you note, Dekker's is not very scalable.  But in most cases, there 
>should only be a few tasks competing for any given mutex/critical 
>section/semaphore - you don't really want 64 different tasks all 
>fighting for the same shared resource!

In fact, none of the software only algorithms scale particularly well
to large numbers of tasks.

>Note also that you don't actually need proper in-order memory access 
>here (as enforced by things like the eieio or msync instructions on a 
>PPC, or by memory area policies) - you only need the /appearance/ of 
>in-order memory access.  So if you are talking about multiple tasks on 
>the same processor, you don't need anything beyond "volatile" accesses - 
>even if the reads and writes are cached or buffered in some way, the 
>processor will still view them in the same order.

Yes, with a caveat which may make no difference to anyone here but is
mentioned for completeness.  

For SMT ("hyperthreaded") chips, each physical core behaves as 2 or
more logical cores executing separate instruction streams.  But unlike
software multitasking, in SMT the logical cores really do execute
*simultaneously* (assuming sufficient chip resources are available).
Exactly when and in what order writes become visible to all logical
cores is implementation dependent.  Code running on such a chip may
have to obey the same restrictions as for a true multiprocessor even
when there is only one physical core present.

George

Reply by David Brown ●July 21, 20122012-07-21

On 19/07/12 22:12, George Neuner wrote:
> On Wed, 18 Jul 2012 11:05:46 +0200, David Brown
> <david.brown@removethis.hesbynett.no>  wrote:
>
>> On 17/07/12 10:39, dp wrote:
>>
>>> I suppose the fact that expanding [software mutex] for multiple
>>> tasks (say, 64) would take each task examine all 63 flags etc. has
>>> made it less appealing to me or something. Or the fact that I
>>> just had an easier option.
>>> But it can be made to work indeed - as far as I can see at
>>> a glance, there is no fundamental flaw in it.
>>>
>>
>> There are other algorithms that are more efficient for multiple tasks -
>> as you note, Dekker's is not very scalable.  But in most cases, there
>> should only be a few tasks competing for any given mutex/critical
>> section/semaphore - you don't really want 64 different tasks all
>> fighting for the same shared resource!
>
> In fact, none of the software only algorithms scale particularly well
> to large numbers of tasks.
>
>> Note also that you don't actually need proper in-order memory access
>> here (as enforced by things like the eieio or msync instructions on a
>> PPC, or by memory area policies) - you only need the /appearance/ of
>> in-order memory access.  So if you are talking about multiple tasks on
>> the same processor, you don't need anything beyond "volatile" accesses -
>> even if the reads and writes are cached or buffered in some way, the
>> processor will still view them in the same order.
>
> Yes, with a caveat which may make no difference to anyone here but is
> mentioned for completeness.
>
> For SMT ("hyperthreaded") chips, each physical core behaves as 2 or
> more logical cores executing separate instruction streams.  But unlike
> software multitasking, in SMT the logical cores really do execute
> *simultaneously* (assuming sufficient chip resources are available).
> Exactly when and in what order writes become visible to all logical
> cores is implementation dependent.  Code running on such a chip may
> have to obey the same restrictions as for a true multiprocessor even
> when there is only one physical core present.
>
> George

Indeed - SMT has to be treated like true SMP as far as these things are 
concerned.

Previous 1 234 Next

atomic test_and_set on ADSP21020

Sign in

You might also like...

Search forums

Free PDF Downloads

Blogs - Hall of Fame

Discussion Groups

Quick Links

About EmbeddedRelated.com

Social Networks

The Related Media Group