On 3/26/2017 11:28 AM, David Brown wrote:
> On 25/03/17 20:27, rickman wrote:
>> On 3/25/2017 10:25 AM, David Brown wrote:
>
>>> Well, you are the one implementing this - so you have to figure out what
>>> solution makes most sense for you.  Here's another idea you could
>>> consider.
>>>
>>> If you are dealing with just one CPU here, I have always thought a
>>> "disable interrupts for the next X instructions then restore interrupt
>>> status" instruction would be handy - with X being something like 4. That
>>> would let you do atomic reads, writes or read-modify-write instructions
>>> covering at least two memory addresses, without need for special memory
>>> or read-write-modify opcodes.
>>
>> How is that different from instructions to enable and disable
>> interrupts?  This doesn't really help me as there are N logical CPUs.
>> They just share the same hardware.  But they all run concurrently in
>> nearly every sense.  They just use different clock cycles so that memory
>> accesses are not literally concurrent.  So interrupts aren't the (only)
>> issue.
>>
>
> An instruction like the one I propose would be safer than the normal
> "disable all interrupts" instruction, because it places a limit on the
> latency of interrupts.  Code that simply disables interrupts could do so
> for an arbitrary length of time - here it is specifically limited.
>
> It is harder to make this work well for your SMT cpu.  Here you might
> change things to say that for these next 4 clock cycles, the current
> logical cpu runs on /every/ clock cycle - all other SMT threads are
> paused.  Depending on how you have organised things, that might be
> simple or it might be nearly impossible.  (On the XMOS, if only one
> thread is running it gets a maximum of 1 cycle out of every 5, with the
> rest wasted - this is due to the 5 stage pipeline of its cpu and that
> each logical cpu can only have one instruction in action at a time.)

The other CPUs can be halted, (hard perhaps, but not impossible) but the 
point of the multi-CPU idea is to utilize a pipeline to make the clock 
faster, but rather than make it a single pipelined CPU with all it's 
warts, make it N CPUs.  No one CPU can hog all the clock cycles because 
of the pipeline, no different from the XMOS design.

But just as important is to not impact interrupt latency.  Preventing 
execution for any other CPU will impact interrupt latency adversely 
which is a primary design goal.  This is intended for hard, real time 
use.  The sort of thing where a CPU may well be counting cycles for 
short delays or need to respond to an even on the next cycle.

The instruction architecture of the last CPU I built allowed literally 1 
clock interrupt latency as it only took one cycle to push all needed 
info to the stacks.  The clock speed won't increase linearly with the 
pipeline length, but otherwise the cost of adding CPUs up to 16 is 
trivial.  So I have thought of using some of the CPUs for interrupt 
handling.  Can't get much faster than zero cycles.  :)

-- 

Rick C

On 25/03/17 20:27, rickman wrote:
> On 3/25/2017 10:25 AM, David Brown wrote:

>> Well, you are the one implementing this - so you have to figure out what
>> solution makes most sense for you.  Here's another idea you could
>> consider.
>>
>> If you are dealing with just one CPU here, I have always thought a
>> "disable interrupts for the next X instructions then restore interrupt
>> status" instruction would be handy - with X being something like 4. That
>> would let you do atomic reads, writes or read-modify-write instructions
>> covering at least two memory addresses, without need for special memory
>> or read-write-modify opcodes.
>
> How is that different from instructions to enable and disable
> interrupts?  This doesn't really help me as there are N logical CPUs.
> They just share the same hardware.  But they all run concurrently in
> nearly every sense.  They just use different clock cycles so that memory
> accesses are not literally concurrent.  So interrupts aren't the (only)
> issue.
>

An instruction like the one I propose would be safer than the normal 
"disable all interrupts" instruction, because it places a limit on the 
latency of interrupts.  Code that simply disables interrupts could do so 
for an arbitrary length of time - here it is specifically limited.

It is harder to make this work well for your SMT cpu.  Here you might 
change things to say that for these next 4 clock cycles, the current 
logical cpu runs on /every/ clock cycle - all other SMT threads are 
paused.  Depending on how you have organised things, that might be 
simple or it might be nearly impossible.  (On the XMOS, if only one 
thread is running it gets a maximum of 1 cycle out of every 5, with the 
rest wasted - this is due to the 5 stage pipeline of its cpu and that 
each logical cpu can only have one instruction in action at a time.)

On Sat, 25 Mar 2017 12:40:45 +0200, Dimiter_Popoff <dp@tgi-sci.com>
wrote:

>On 25.3.2017 ?. 05:47, Robert Wessel wrote:
>> On Sat, 25 Mar 2017 00:42:27 +0200, Dimiter_Popoff <dp@tgi-sci.com>
>> wrote:
>>
>>> Test And Set, a classic 68k RMW opcoode.
>>
>>
>> I'm pretty sure it predates S/360.
>>
>
>I would not know, my first encounter with that sort of thing was
>on the 68k. The first processor I designed a board with - which
>was the first computer I owned - was the 6809... here are the remnants
>of this board (early 80-s): http://tgi-sci.com/misc/grany09.gif
>Had yet to be exposed to the TAS concept when I was making this one :).


My point was merely that TAS is far, far older than you had indicated.
It was a standard instruction on S/360s when they were introduced to
the world in 1964, and I'm pretty sure it was not new then.

On 3/25/2017 4:43 PM, upsidedown@downunder.com wrote:
> On Sat, 25 Mar 2017 15:27:55 -0400, rickman <gnuarm@gmail.com> wrote:
>
>> On 3/25/2017 10:25 AM, David Brown wrote:
>>> On 25/03/17 03:10, rickman wrote:
>>>> On 3/24/2017 8:48 PM, David Brown wrote:
>>>>> On 24/03/17 19:12, rickman wrote:
>>>>>> On 3/24/2017 2:06 PM, David Brown wrote:
>>>>>>> On 24/03/17 18:36, rickman wrote:
>>>>>>>> On 3/24/2017 3:07 AM, Robert Wessel wrote:
>>>>>>>>> On Fri, 24 Mar 2017 01:58:26 -0400, rickman <gnuarm@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> On 3/24/2017 1:37 AM, Robert Wessel wrote:
>>>>>>>>>>> On Fri, 24 Mar 2017 01:05:02 -0400, rickman <gnuarm@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> On 3/23/2017 11:43 PM, Robert Wessel wrote:
>>>>>>>>>>>>> On Thu, 23 Mar 2017 18:38:13 -0400, rickman <gnuarm@gmail.com>
>>>>>>>>>>>>> wrote:
>>>>>>>
>>>>>>> <snip>
>>>>>>>
>>>>>>>>
>>>>>>>>>> There certainly is no reason for memory to be multiple cycle.  I
>>>>>>>>>> think
>>>>>>>>>> you are picturing various implementations where memory is slow
>>>>>>>>>> compared
>>>>>>>>>> to the CPU.  That's not a given.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> No, but such a situation is sufficiently uncommon that you'd really
>>>>>>>>> want to spec it up front.  Most cores these days don't even have
>>>>>>>>> single cycle L1 caches.
>>>>>>>>
>>>>>>>> Again, assuming an architecture and requirements.  This
>>>>>>>> conversation is
>>>>>>>> not about designing the next generation of ARM CPU.
>>>>>>>
>>>>>>> It would be making architecture assumptions to require single-cycle
>>>>>>> memory - saying that memory may be multiple cycle is the general case.
>>>>>>>
>>>>>>> (You might /want/ to make architecture assumptions here - I gave some
>>>>>>> other suggestions of specific hardware for locking in another post.
>>>>>>> The
>>>>>>> solutions discussed here are for general memory.)
>>>>>>
>>>>>> We are having two different conversations here.   I am designing a CPU,
>>>>>> you are talking about the theory of CPU design.
>>>>>
>>>>> Aha!  That is some useful information to bring to the table.  You told
>>>>> us earlier about some things that you are /not/ doing, but not what you
>>>>> /are/ doing.  (Or if you did, I missed it.)
>>>>>
>>>>> In that case, I would recommend making some dedicated hardware for your
>>>>> synchronisation primitives.  A simple method is one I described earlier
>>>>> - have a set of memory locations where you the upper half of each 32-bit
>>>>> entry is for a "thread id".  You can only write to the entry if the
>>>>> current upper half is 0, or it matches the thread id you are writing to
>>>>> it.  It should be straightforward to implement in an FPGA.
>>>>>
>>>>> The disadvantage of this sort of solution is scaling - if your hardware
>>>>> supports 64 such semaphores, then that's all you've got.  A solution
>>>>> utilizing normal memory (such as TAS, CAS, LL/SC) lets user code make as
>>>>> many semaphores as it wants.  But you can always use the hardware ones
>>>>> to implement more software semaphores indirectly.
>>>>
>>>> I guess I didn't explain the full context.  But as I mentioned, it is
>>>> simpler I think to include the swap memory instruction that will allow a
>>>> test and set operation to be implemented atomically without disrupting
>>>> any other functions.  It will use up an opcode, but it would be a simple
>>>> one with the only difference from a normal memory write being the use of
>>>> the read path.
>>>>
>>>
>>> Well, you are the one implementing this - so you have to figure out what
>>> solution makes most sense for you.  Here's another idea you could consider.
>>>
>>> If you are dealing with just one CPU here, I have always thought a
>>> "disable interrupts for the next X instructions then restore interrupt
>>> status" instruction would be handy - with X being something like 4. That
>>> would let you do atomic reads, writes or read-modify-write instructions
>>> covering at least two memory addresses, without need for special memory
>>> or read-write-modify opcodes.
>>
>> How is that different from instructions to enable and disable
>> interrupts?  This doesn't really help me as there are N logical CPUs.
>> They just share the same hardware.  But they all run concurrently in
>> nearly every sense.  They just use different clock cycles so that memory
>> accesses are not literally concurrent.  So interrupts aren't the (only)
>> issue.
>
> I thought that SMP stands for _symmetric_ multiprocessing.
>
> Disabling interrupts works well when dealing with one CPU and
> peripherals or the case with one master supervisor CPU and a lot of
> slave CPUs (AMP).

I don't follow how disabling interrupts will help resolve multiple CPUs 
accessing the same memory location?  Disabling interrupts only stops 
other processes on the same CPU from accessing the same memory location. 
  How would that prevent processes on other CPUs from accessing that 
location during the multiple instruction access of the process in 
question?

-- 

Rick C

On Sat, 25 Mar 2017 15:27:55 -0400, rickman <gnuarm@gmail.com> wrote:

>On 3/25/2017 10:25 AM, David Brown wrote:
>> On 25/03/17 03:10, rickman wrote:
>>> On 3/24/2017 8:48 PM, David Brown wrote:
>>>> On 24/03/17 19:12, rickman wrote:
>>>>> On 3/24/2017 2:06 PM, David Brown wrote:
>>>>>> On 24/03/17 18:36, rickman wrote:
>>>>>>> On 3/24/2017 3:07 AM, Robert Wessel wrote:
>>>>>>>> On Fri, 24 Mar 2017 01:58:26 -0400, rickman <gnuarm@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> On 3/24/2017 1:37 AM, Robert Wessel wrote:
>>>>>>>>>> On Fri, 24 Mar 2017 01:05:02 -0400, rickman <gnuarm@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> On 3/23/2017 11:43 PM, Robert Wessel wrote:
>>>>>>>>>>>> On Thu, 23 Mar 2017 18:38:13 -0400, rickman <gnuarm@gmail.com>
>>>>>>>>>>>> wrote:
>>>>>>
>>>>>> <snip>
>>>>>>
>>>>>>>
>>>>>>>>> There certainly is no reason for memory to be multiple cycle.  I
>>>>>>>>> think
>>>>>>>>> you are picturing various implementations where memory is slow
>>>>>>>>> compared
>>>>>>>>> to the CPU.  That's not a given.
>>>>>>>>
>>>>>>>>
>>>>>>>> No, but such a situation is sufficiently uncommon that you'd really
>>>>>>>> want to spec it up front.  Most cores these days don't even have
>>>>>>>> single cycle L1 caches.
>>>>>>>
>>>>>>> Again, assuming an architecture and requirements.  This
>>>>>>> conversation is
>>>>>>> not about designing the next generation of ARM CPU.
>>>>>>
>>>>>> It would be making architecture assumptions to require single-cycle
>>>>>> memory - saying that memory may be multiple cycle is the general case.
>>>>>>
>>>>>> (You might /want/ to make architecture assumptions here - I gave some
>>>>>> other suggestions of specific hardware for locking in another post.
>>>>>> The
>>>>>> solutions discussed here are for general memory.)
>>>>>
>>>>> We are having two different conversations here.   I am designing a CPU,
>>>>> you are talking about the theory of CPU design.
>>>>
>>>> Aha!  That is some useful information to bring to the table.  You told
>>>> us earlier about some things that you are /not/ doing, but not what you
>>>> /are/ doing.  (Or if you did, I missed it.)
>>>>
>>>> In that case, I would recommend making some dedicated hardware for your
>>>> synchronisation primitives.  A simple method is one I described earlier
>>>> - have a set of memory locations where you the upper half of each 32-bit
>>>> entry is for a "thread id".  You can only write to the entry if the
>>>> current upper half is 0, or it matches the thread id you are writing to
>>>> it.  It should be straightforward to implement in an FPGA.
>>>>
>>>> The disadvantage of this sort of solution is scaling - if your hardware
>>>> supports 64 such semaphores, then that's all you've got.  A solution
>>>> utilizing normal memory (such as TAS, CAS, LL/SC) lets user code make as
>>>> many semaphores as it wants.  But you can always use the hardware ones
>>>> to implement more software semaphores indirectly.
>>>
>>> I guess I didn't explain the full context.  But as I mentioned, it is
>>> simpler I think to include the swap memory instruction that will allow a
>>> test and set operation to be implemented atomically without disrupting
>>> any other functions.  It will use up an opcode, but it would be a simple
>>> one with the only difference from a normal memory write being the use of
>>> the read path.
>>>
>>
>> Well, you are the one implementing this - so you have to figure out what
>> solution makes most sense for you.  Here's another idea you could consider.
>>
>> If you are dealing with just one CPU here, I have always thought a
>> "disable interrupts for the next X instructions then restore interrupt
>> status" instruction would be handy - with X being something like 4. That
>> would let you do atomic reads, writes or read-modify-write instructions
>> covering at least two memory addresses, without need for special memory
>> or read-write-modify opcodes.
>
>How is that different from instructions to enable and disable 
>interrupts?  This doesn't really help me as there are N logical CPUs. 
>They just share the same hardware.  But they all run concurrently in 
>nearly every sense.  They just use different clock cycles so that memory 
>accesses are not literally concurrent.  So interrupts aren't the (only) 
>issue.

I thought that SMP stands for _symmetric_ multiprocessing.

Disabling interrupts works well when dealing with one CPU and
peripherals or the case with one master supervisor CPU and a lot of
slave CPUs (AMP).

On 3/25/2017 10:25 AM, David Brown wrote:
> On 25/03/17 03:10, rickman wrote:
>> On 3/24/2017 8:48 PM, David Brown wrote:
>>> On 24/03/17 19:12, rickman wrote:
>>>> On 3/24/2017 2:06 PM, David Brown wrote:
>>>>> On 24/03/17 18:36, rickman wrote:
>>>>>> On 3/24/2017 3:07 AM, Robert Wessel wrote:
>>>>>>> On Fri, 24 Mar 2017 01:58:26 -0400, rickman <gnuarm@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> On 3/24/2017 1:37 AM, Robert Wessel wrote:
>>>>>>>>> On Fri, 24 Mar 2017 01:05:02 -0400, rickman <gnuarm@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> On 3/23/2017 11:43 PM, Robert Wessel wrote:
>>>>>>>>>>> On Thu, 23 Mar 2017 18:38:13 -0400, rickman <gnuarm@gmail.com>
>>>>>>>>>>> wrote:
>>>>>
>>>>> <snip>
>>>>>
>>>>>>
>>>>>>>> There certainly is no reason for memory to be multiple cycle.  I
>>>>>>>> think
>>>>>>>> you are picturing various implementations where memory is slow
>>>>>>>> compared
>>>>>>>> to the CPU.  That's not a given.
>>>>>>>
>>>>>>>
>>>>>>> No, but such a situation is sufficiently uncommon that you'd really
>>>>>>> want to spec it up front.  Most cores these days don't even have
>>>>>>> single cycle L1 caches.
>>>>>>
>>>>>> Again, assuming an architecture and requirements.  This
>>>>>> conversation is
>>>>>> not about designing the next generation of ARM CPU.
>>>>>
>>>>> It would be making architecture assumptions to require single-cycle
>>>>> memory - saying that memory may be multiple cycle is the general case.
>>>>>
>>>>> (You might /want/ to make architecture assumptions here - I gave some
>>>>> other suggestions of specific hardware for locking in another post.
>>>>> The
>>>>> solutions discussed here are for general memory.)
>>>>
>>>> We are having two different conversations here.   I am designing a CPU,
>>>> you are talking about the theory of CPU design.
>>>
>>> Aha!  That is some useful information to bring to the table.  You told
>>> us earlier about some things that you are /not/ doing, but not what you
>>> /are/ doing.  (Or if you did, I missed it.)
>>>
>>> In that case, I would recommend making some dedicated hardware for your
>>> synchronisation primitives.  A simple method is one I described earlier
>>> - have a set of memory locations where you the upper half of each 32-bit
>>> entry is for a "thread id".  You can only write to the entry if the
>>> current upper half is 0, or it matches the thread id you are writing to
>>> it.  It should be straightforward to implement in an FPGA.
>>>
>>> The disadvantage of this sort of solution is scaling - if your hardware
>>> supports 64 such semaphores, then that's all you've got.  A solution
>>> utilizing normal memory (such as TAS, CAS, LL/SC) lets user code make as
>>> many semaphores as it wants.  But you can always use the hardware ones
>>> to implement more software semaphores indirectly.
>>
>> I guess I didn't explain the full context.  But as I mentioned, it is
>> simpler I think to include the swap memory instruction that will allow a
>> test and set operation to be implemented atomically without disrupting
>> any other functions.  It will use up an opcode, but it would be a simple
>> one with the only difference from a normal memory write being the use of
>> the read path.
>>
>
> Well, you are the one implementing this - so you have to figure out what
> solution makes most sense for you.  Here's another idea you could consider.
>
> If you are dealing with just one CPU here, I have always thought a
> "disable interrupts for the next X instructions then restore interrupt
> status" instruction would be handy - with X being something like 4. That
> would let you do atomic reads, writes or read-modify-write instructions
> covering at least two memory addresses, without need for special memory
> or read-write-modify opcodes.

How is that different from instructions to enable and disable 
interrupts?  This doesn't really help me as there are N logical CPUs. 
They just share the same hardware.  But they all run concurrently in 
nearly every sense.  They just use different clock cycles so that memory 
accesses are not literally concurrent.  So interrupts aren't the (only) 
issue.

-- 

Rick C

On 25/03/17 03:10, rickman wrote:
> On 3/24/2017 8:48 PM, David Brown wrote:
>> On 24/03/17 19:12, rickman wrote:
>>> On 3/24/2017 2:06 PM, David Brown wrote:
>>>> On 24/03/17 18:36, rickman wrote:
>>>>> On 3/24/2017 3:07 AM, Robert Wessel wrote:
>>>>>> On Fri, 24 Mar 2017 01:58:26 -0400, rickman <gnuarm@gmail.com> wrote:
>>>>>>
>>>>>>> On 3/24/2017 1:37 AM, Robert Wessel wrote:
>>>>>>>> On Fri, 24 Mar 2017 01:05:02 -0400, rickman <gnuarm@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> On 3/23/2017 11:43 PM, Robert Wessel wrote:
>>>>>>>>>> On Thu, 23 Mar 2017 18:38:13 -0400, rickman <gnuarm@gmail.com>
>>>>>>>>>> wrote:
>>>>
>>>> <snip>
>>>>
>>>>>
>>>>>>> There certainly is no reason for memory to be multiple cycle.  I
>>>>>>> think
>>>>>>> you are picturing various implementations where memory is slow
>>>>>>> compared
>>>>>>> to the CPU.  That's not a given.
>>>>>>
>>>>>>
>>>>>> No, but such a situation is sufficiently uncommon that you'd really
>>>>>> want to spec it up front.  Most cores these days don't even have
>>>>>> single cycle L1 caches.
>>>>>
>>>>> Again, assuming an architecture and requirements.  This
>>>>> conversation is
>>>>> not about designing the next generation of ARM CPU.
>>>>
>>>> It would be making architecture assumptions to require single-cycle
>>>> memory - saying that memory may be multiple cycle is the general case.
>>>>
>>>> (You might /want/ to make architecture assumptions here - I gave some
>>>> other suggestions of specific hardware for locking in another post.
>>>> The
>>>> solutions discussed here are for general memory.)
>>>
>>> We are having two different conversations here.   I am designing a CPU,
>>> you are talking about the theory of CPU design.
>>
>> Aha!  That is some useful information to bring to the table.  You told
>> us earlier about some things that you are /not/ doing, but not what you
>> /are/ doing.  (Or if you did, I missed it.)
>>
>> In that case, I would recommend making some dedicated hardware for your
>> synchronisation primitives.  A simple method is one I described earlier
>> - have a set of memory locations where you the upper half of each 32-bit
>> entry is for a "thread id".  You can only write to the entry if the
>> current upper half is 0, or it matches the thread id you are writing to
>> it.  It should be straightforward to implement in an FPGA.
>>
>> The disadvantage of this sort of solution is scaling - if your hardware
>> supports 64 such semaphores, then that's all you've got.  A solution
>> utilizing normal memory (such as TAS, CAS, LL/SC) lets user code make as
>> many semaphores as it wants.  But you can always use the hardware ones
>> to implement more software semaphores indirectly.
>
> I guess I didn't explain the full context.  But as I mentioned, it is
> simpler I think to include the swap memory instruction that will allow a
> test and set operation to be implemented atomically without disrupting
> any other functions.  It will use up an opcode, but it would be a simple
> one with the only difference from a normal memory write being the use of
> the read path.
>

Well, you are the one implementing this - so you have to figure out what 
solution makes most sense for you.  Here's another idea you could consider.

If you are dealing with just one CPU here, I have always thought a 
"disable interrupts for the next X instructions then restore interrupt 
status" instruction would be handy - with X being something like 4. 
That would let you do atomic reads, writes or read-modify-write 
instructions covering at least two memory addresses, without need for 
special memory or read-write-modify opcodes.

On 25.3.2017 &#1075;. 05:47, Robert Wessel wrote:
> On Sat, 25 Mar 2017 00:42:27 +0200, Dimiter_Popoff <dp@tgi-sci.com>
> wrote:
>
>> Test And Set, a classic 68k RMW opcoode.
>
>
> I'm pretty sure it predates S/360.
>

I would not know, my first encounter with that sort of thing was
on the 68k. The first processor I designed a board with - which
was the first computer I owned - was the 6809... here are the remnants
of this board (early 80-s): http://tgi-sci.com/misc/grany09.gif
Had yet to be exposed to the TAS concept when I was making this one :).

Dimiter

------------------------------------------------------
Dimiter Popoff, TGI             http://www.tgi-sci.com
------------------------------------------------------
http://www.flickr.com/photos/didi_tgi/

On 25/03/17 14:01, Robert Wessel wrote:
> On Fri, 24 Mar 2017 21:04:25 +1100, Clifford Heath
> <no.spam@please.net> wrote:
>
>> On 24/03/17 10:47, Tim Wescott wrote:
>>> On Thu, 23 Mar 2017 16:26:46 -0700, Don Y wrote:
>>>
>>>> On 3/23/2017 4:19 PM, Tim Wescott wrote:
>>>>> On Thu, 23 Mar 2017 18:38:13 -0400, rickman wrote:
>>>>>
>>>>>> I recall a discussion about the design of an instruction set
>>>>>> architecture where someone was saying an instruction was required to
>>>>>> test and set a bit or word as an atomic operation if it was desired to
>>>>>> support multiple processors.  Is this really true?  Is this a function
>>>>>> that can't be emulated with other operations including the disabling
>>>>>> of interrupts?
>>>>>
>>>>> AFAIK as long as you surround your "test and set" with an interrupt
>>>>> disable and an interrupt enable then you're OK.  At least, you're OK
>>>>> unless you have a processor that treats interrupts really strangely.
>>>>
>>>> Rethink that for the case of SMP...  (coincidentally, "Support Multiple
>>>> Processors"  :> )
>>>
>>> D'oh.  Atomic to the common memory, not to each individual processor, yes.
>>>
>>> Although it wouldn't have to be an instruction per se: you could have it
>>> be an "instruction" to whatever hardware is controlling the common
>>> memory, to hold off the other processors while it does a read/modify/
>>> write cycle.
>>
>> Yes. But when you have multi-level caching, perhaps some with
>> write-back semantics, it needs to force write-through, and be
>> bus-locked all the way to the common memory.
>>
>> X86 has a LOCK prefix which acts on certain following instructions
>> to make this happen, and SMP and multi-CPU architectures honor it.
>
>
> Well, if the memory is write-back, then yes, the write will go all the
> way to memory.
>
> OTOH, that's certainly *not* the case for the vast majority of atomic
> or "locked" operations, on x86, or any other high end CPU.  This
> almost always happens in cache, but the line in question needs to be
> held exclusively by the CPU issuing the atomic operation.  If it's
> already in that state, then it happens very fast.  If it's not in that
> state, the line may need to be fetched from main memory (if no other
> core has it), from another core (if another core has that line and
> it's modified), or the state may need to be changed to exclusive, by
> invalidating the shared copy in any other cores.  The exact details
> vary, but "lock" does not force a bus lock all the way to main memory,
> except, possibly, in the case of write through and/or non-cached
> memory.

Thanks for the clarification. I was aware of the result, but not of
the recent implementations. It certainly seems to work in the software
I've shipped anyhow.

Clifford Heath.