I recall a discussion about the design of an instruction set architecture where someone was saying an instruction was required to test and set a bit or word as an atomic operation if it was desired to support multiple processors. Is this really true? Is this a function that can't be emulated with other operations including the disabling of interrupts? -- Rick C
Speaking of Multiprocessing...
Started by ●March 23, 2017
Reply by ●March 23, 20172017-03-23
On Thu, 23 Mar 2017 18:38:13 -0400, rickman wrote:> I recall a discussion about the design of an instruction set > architecture where someone was saying an instruction was required to > test and set a bit or word as an atomic operation if it was desired to > support multiple processors. Is this really true? Is this a function > that can't be emulated with other operations including the disabling of > interrupts?AFAIK as long as you surround your "test and set" with an interrupt disable and an interrupt enable then you're OK. At least, you're OK unless you have a processor that treats interrupts really strangely. -- Tim Wescott Wescott Design Services http://www.wescottdesign.com I'm looking for work -- see my website!
Reply by ●March 23, 20172017-03-23
On 3/23/2017 4:19 PM, Tim Wescott wrote:> On Thu, 23 Mar 2017 18:38:13 -0400, rickman wrote: > >> I recall a discussion about the design of an instruction set >> architecture where someone was saying an instruction was required to >> test and set a bit or word as an atomic operation if it was desired to >> support multiple processors. Is this really true? Is this a function >> that can't be emulated with other operations including the disabling of >> interrupts? > > AFAIK as long as you surround your "test and set" with an interrupt > disable and an interrupt enable then you're OK. At least, you're OK > unless you have a processor that treats interrupts really strangely.Rethink that for the case of SMP... (coincidentally, "Support Multiple Processors" :> )
Reply by ●March 23, 20172017-03-23
On Thu, 23 Mar 2017 16:26:46 -0700, Don Y wrote:> On 3/23/2017 4:19 PM, Tim Wescott wrote: >> On Thu, 23 Mar 2017 18:38:13 -0400, rickman wrote: >> >>> I recall a discussion about the design of an instruction set >>> architecture where someone was saying an instruction was required to >>> test and set a bit or word as an atomic operation if it was desired to >>> support multiple processors. Is this really true? Is this a function >>> that can't be emulated with other operations including the disabling >>> of interrupts? >> >> AFAIK as long as you surround your "test and set" with an interrupt >> disable and an interrupt enable then you're OK. At least, you're OK >> unless you have a processor that treats interrupts really strangely. > > Rethink that for the case of SMP... (coincidentally, "Support Multiple > Processors" :> )D'oh. Atomic to the common memory, not to each individual processor, yes. Although it wouldn't have to be an instruction per se: you could have it be an "instruction" to whatever hardware is controlling the common memory, to hold off the other processors while it does a read/modify/ write cycle. -- Tim Wescott Wescott Design Services http://www.wescottdesign.com I'm looking for work -- see my website!
Reply by ●March 23, 20172017-03-23
On 3/23/2017 4:47 PM, Tim Wescott wrote:> On Thu, 23 Mar 2017 16:26:46 -0700, Don Y wrote: > >> On 3/23/2017 4:19 PM, Tim Wescott wrote: >>> On Thu, 23 Mar 2017 18:38:13 -0400, rickman wrote: >>> >>>> I recall a discussion about the design of an instruction set >>>> architecture where someone was saying an instruction was required to >>>> test and set a bit or word as an atomic operation if it was desired to >>>> support multiple processors. Is this really true? Is this a function >>>> that can't be emulated with other operations including the disabling >>>> of interrupts? >>> >>> AFAIK as long as you surround your "test and set" with an interrupt >>> disable and an interrupt enable then you're OK. At least, you're OK >>> unless you have a processor that treats interrupts really strangely. >> >> Rethink that for the case of SMP... (coincidentally, "Support Multiple >> Processors" :> ) > > D'oh. Atomic to the common memory, not to each individual processor, yes.Yes. If the processor supports a RMW memory cycle AND the memory arbiter honors that contract, then any competing processors would explicitly be held off from accessing the location in question until the RMW cycle terminated.> Although it wouldn't have to be an instruction per se: you could have it > be an "instruction" to whatever hardware is controlling the common > memory, to hold off the other processors while it does a read/modify/ > write cycle.Yes. The point is that you have to be able to view state in specific memory locations as functionally possible to "examine and alter" indivisibly. You can also introduce specific hardware mutexes but then you are limited (in practical terms) by their number. [I designed a processor many years ago that used such a mechanism to synchronize with its peers] You can also have the memory arbiter notice the cycle and "remember" it even while allowing it to be "interrupted" (poor choice of words) in the hope that the interruption does not COMPETE for that particular memory location. *If* the arbiter notices that the interruption actually *does* compete for that location, then it can explicitly act to stall that competing request until the original "remembered" request has a chance to finish (presumably "in short order"). It boils down to memory topology and the amount of resources you want to throw at supporting multiple concurrent accesses. As RMW's *tend* to be infrequent, you can just opt for the easy way out... By far, the easiest solution is a RMW memory cycle as you can then design an arbiter to honor *it* while, conceivably, allowing OTHER accesses to proceed in parallel (you just have to ensure that the RMW accessed location is treated atomically; all others are free to be accessed at will, concurrent with that RMW!)
Reply by ●March 24, 20172017-03-24
On Thu, 23 Mar 2017 18:38:13 -0400, rickman <gnuarm@gmail.com> wrote:>I recall a discussion about the design of an instruction set >architecture where someone was saying an instruction was required to >test and set a bit or word as an atomic operation if it was desired to >support multiple processors. Is this really true? Is this a function >that can't be emulated with other operations including the disabling of >interrupts?Dekker's algorithm (circa 1962) can be used to implement a mutex without test-and-set, or even masking interrupts, and really only requires sequential consistency (and if that's a issue for your hardware, it should have appropriate memory barriers to force that). https://en.wikipedia.org/wiki/Dekker%27s_algorithm A limitation is that Dekker's only works with two threads/processors. Several related algorithms exist, and generalizations of some of those (for example, Peterson's) can handle more than two threads. And once you have a mutex, you can implement everything else. OTOH, proper atomic instructions are a huge boon (even if only in LL/SC form), so they're provided by everyone who's serious about multiprocessor support.
Reply by ●March 24, 20172017-03-24
On 3/23/2017 11:43 PM, Robert Wessel wrote:> On Thu, 23 Mar 2017 18:38:13 -0400, rickman <gnuarm@gmail.com> wrote: > >> I recall a discussion about the design of an instruction set >> architecture where someone was saying an instruction was required to >> test and set a bit or word as an atomic operation if it was desired to >> support multiple processors. Is this really true? Is this a function >> that can't be emulated with other operations including the disabling of >> interrupts? > > > Dekker's algorithm (circa 1962) can be used to implement a mutex > without test-and-set, or even masking interrupts, and really only > requires sequential consistency (and if that's a issue for your > hardware, it should have appropriate memory barriers to force that). > > https://en.wikipedia.org/wiki/Dekker%27s_algorithm > > A limitation is that Dekker's only works with two threads/processors.Interesting. I'm not sure it's limited to two threads per processor rather than just two threads. Even if they are on separate processors it's the same problem and the same limit, no? The situation I am looking at resolving is a technique where rather than using a pipeline of length N to allow N-fold improvements in speed (other than when the pipeline has to be flushed) it can be used to get N-fold improvements in clock speed, but treating the CPU as N processors. They are time-sliced if you will. It also relieves a lot of the complexity of detecting problems and contingencies in the pipeline streamlining it that much more. But... When someone was proposing this design some time back another poster was insistent the design required an instruction level mutex, if I am using the term right. I don't recall the details of what he wrote, but I've always wondered how important this is. One of the things that can further streamline a processor is to make all instructions single cycle, truly single cycle. So a read-modify-write instruction may be hard to implement depending on the memory used. But in the context of the above multi-processor design, if the RMW instruction takes multiple clock cycles, it would not be truly uninterrupted since the other N-1 processors would all get a clock (and an instruction) between the multiple clocks of the RWM instruction. I need to bat this around in my head a bit more I think.> Several related algorithms exist, and generalizations of some of those > (for example, Peterson's) can handle more than two threads. > > And once you have a mutex, you can implement everything else. > > OTOH, proper atomic instructions are a huge boon (even if only in > LL/SC form), so they're provided by everyone who's serious about > multiprocessor support.LL/SC??? Low level, s.... c...? -- Rick C
Reply by ●March 24, 20172017-03-24
On Fri, 24 Mar 2017 01:05:02 -0400, rickman <gnuarm@gmail.com> wrote:>On 3/23/2017 11:43 PM, Robert Wessel wrote: >> On Thu, 23 Mar 2017 18:38:13 -0400, rickman <gnuarm@gmail.com> wrote: >> >>> I recall a discussion about the design of an instruction set >>> architecture where someone was saying an instruction was required to >>> test and set a bit or word as an atomic operation if it was desired to >>> support multiple processors. Is this really true? Is this a function >>> that can't be emulated with other operations including the disabling of >>> interrupts? >> >> >> Dekker's algorithm (circa 1962) can be used to implement a mutex >> without test-and-set, or even masking interrupts, and really only >> requires sequential consistency (and if that's a issue for your >> hardware, it should have appropriate memory barriers to force that). >> >> https://en.wikipedia.org/wiki/Dekker%27s_algorithm >> >> A limitation is that Dekker's only works with two threads/processors. > >Interesting. I'm not sure it's limited to two threads per processor >rather than just two threads. Even if they are on separate processors >it's the same problem and the same limit, no?It's two threads, on one processor or two. Many discussion of Dekker's don't really consider software threads (as we know them today), as it predates common support for that in OSs. Call it two competitors for the lock.>The situation I am looking at resolving is a technique where rather than >using a pipeline of length N to allow N-fold improvements in speed >(other than when the pipeline has to be flushed) it can be used to get >N-fold improvements in clock speed, but treating the CPU as N >processors. They are time-sliced if you will. It also relieves a lot >of the complexity of detecting problems and contingencies in the >pipeline streamlining it that much more. But...This is one of the oldest, if not the very oldest, multi-threading technique for processors, dating back to the PPUs on the CDC-6600 (circa 1965). The PPUs were 10 time-sliced "threads" on the I/O processor.>When someone was proposing this design some time back another poster was >insistent the design required an instruction level mutex, if I am using >the term right. I don't recall the details of what he wrote, but I've >always wondered how important this is. > >One of the things that can further streamline a processor is to make all >instructions single cycle, truly single cycle. So a read-modify-write >instruction may be hard to implement depending on the memory used. But >in the context of the above multi-processor design, if the RMW >instruction takes multiple clock cycles, it would not be truly >uninterrupted since the other N-1 processors would all get a clock (and >an instruction) between the multiple clocks of the RWM instruction.As soon as you have memory or FP, not all instructions are single cycle any more. But as described below, LL/SC is the standard way of avoiding having a RMW instruction. It is important to understand how much synchronization you need between the threads. If the individual tasks are very independent, and need little or no synchronization with tasks running on other threads, then you need only weak support for such in the hardware (and your code would probably run well on a GPU).>I need to bat this around in my head a bit more I think. > > >> Several related algorithms exist, and generalizations of some of those >> (for example, Peterson's) can handle more than two threads. >> >> And once you have a mutex, you can implement everything else. >> >> OTOH, proper atomic instructions are a huge boon (even if only in >> LL/SC form), so they're provided by everyone who's serious about >> multiprocessor support. > >LL/SC??? Low level, s.... c...?Load-linked/store-conditional. That's what most RISCs use to implement atomic operations. Basically the load-linked establishes that you want to atomically update a location (as well as loading the existing value there), the store-conditional attempts to update that value (it's a store), but if something else had hit that location*, the SC will fail, and you have to retry the update. In that sense it's similar to a compare-and-swap, rather than a true atomic update like test-and-set. The desire to avoid multi-cycle instructions is the prime driver for LL/SC, as it lets the operation be split into individual simple operations. So let's say you wanted to atomically add R2 to a memory location, you'd code something like: retry: ll r1,mem add r1,r2 sc r1,mem branch-if-fail retry The success flag is typically returned in a register. Alpha, for example returned it in the register operand of the SC (STL_C or STQ_C) so a simple BEQ (to test if the returned value was zero) was all that was needed to trigger the retry. *FSVO "location" - usually the granularity is on something like a cache line or larger. There are often other restrictions as well - no interrupts or exceptions between the LL and SC, a maximum number of instructions between the pair, no other LL, loads or stores in general, branches, etc. The exact details varied considerably. Commonly this is described as having a flag set by the LL, and various conditions can cause the flag to be cleared before the SC executes - and if that flag is clear the SC fails.
Reply by ●March 24, 20172017-03-24
On 3/24/2017 1:37 AM, Robert Wessel wrote:> On Fri, 24 Mar 2017 01:05:02 -0400, rickman <gnuarm@gmail.com> wrote: > >> On 3/23/2017 11:43 PM, Robert Wessel wrote: >>> On Thu, 23 Mar 2017 18:38:13 -0400, rickman <gnuarm@gmail.com> wrote: >>> >>>> I recall a discussion about the design of an instruction set >>>> architecture where someone was saying an instruction was required to >>>> test and set a bit or word as an atomic operation if it was desired to >>>> support multiple processors. Is this really true? Is this a function >>>> that can't be emulated with other operations including the disabling of >>>> interrupts? >>> >>> >>> Dekker's algorithm (circa 1962) can be used to implement a mutex >>> without test-and-set, or even masking interrupts, and really only >>> requires sequential consistency (and if that's a issue for your >>> hardware, it should have appropriate memory barriers to force that). >>> >>> https://en.wikipedia.org/wiki/Dekker%27s_algorithm >>> >>> A limitation is that Dekker's only works with two threads/processors. >> >> Interesting. I'm not sure it's limited to two threads per processor >> rather than just two threads. Even if they are on separate processors >> it's the same problem and the same limit, no? > > > It's two threads, on one processor or two. Many discussion of > Dekker's don't really consider software threads (as we know them > today), as it predates common support for that in OSs. Call it two > competitors for the lock. > > >> The situation I am looking at resolving is a technique where rather than >> using a pipeline of length N to allow N-fold improvements in speed >> (other than when the pipeline has to be flushed) it can be used to get >> N-fold improvements in clock speed, but treating the CPU as N >> processors. They are time-sliced if you will. It also relieves a lot >> of the complexity of detecting problems and contingencies in the >> pipeline streamlining it that much more. But... > > > This is one of the oldest, if not the very oldest, multi-threading > technique for processors, dating back to the PPUs on the CDC-6600 > (circa 1965). The PPUs were 10 time-sliced "threads" on the I/O > processor. > > >> When someone was proposing this design some time back another poster was >> insistent the design required an instruction level mutex, if I am using >> the term right. I don't recall the details of what he wrote, but I've >> always wondered how important this is. >> >> One of the things that can further streamline a processor is to make all >> instructions single cycle, truly single cycle. So a read-modify-write >> instruction may be hard to implement depending on the memory used. But >> in the context of the above multi-processor design, if the RMW >> instruction takes multiple clock cycles, it would not be truly >> uninterrupted since the other N-1 processors would all get a clock (and >> an instruction) between the multiple clocks of the RWM instruction. > > > As soon as you have memory or FP, not all instructions are single > cycle any more. But as described below, LL/SC is the standard way of > avoiding having a RMW instruction.I don't know what you mean when you use abbreviations that aren't clear. Is FP floating point? There is nothing inherent in floating point that makes it multiple cycles in a custom processor design. There certainly is no reason for memory to be multiple cycle. I think you are picturing various implementations where memory is slow compared to the CPU. That's not a given.> It is important to understand how much synchronization you need > between the threads. If the individual tasks are very independent, > and need little or no synchronization with tasks running on other > threads, then you need only weak support for such in the hardware (and > your code would probably run well on a GPU).How dependent the tasks are will vary with the problem being solved. I'm just looking at how best to utilize the facility available in FPGAs. The guy who was talking about the time slicing was looking into that when he was told about the need for a mutex.>> I need to bat this around in my head a bit more I think. >> >> >>> Several related algorithms exist, and generalizations of some of those >>> (for example, Peterson's) can handle more than two threads. >>> >>> And once you have a mutex, you can implement everything else. >>> >>> OTOH, proper atomic instructions are a huge boon (even if only in >>> LL/SC form), so they're provided by everyone who's serious about >>> multiprocessor support. >> >> LL/SC??? Low level, s.... c...? > > > Load-linked/store-conditional. That's what most RISCs use to > implement atomic operations. Basically the load-linked establishes > that you want to atomically update a location (as well as loading the > existing value there), the store-conditional attempts to update that > value (it's a store), but if something else had hit that location*, > the SC will fail, and you have to retry the update. In that sense > it's similar to a compare-and-swap, rather than a true atomic update > like test-and-set.I need to think how that could be implemented. One way would be to have an extra bit per memory location. Another would be to make the memory interface "smart" where it tracks the location you read as part of this function and handles the logic of monitoring other writes to it. That could be pretty simple really as long as it is one location per processor. Would an instruction that reads a location while also writing to it do the job? If you are trying to lock a resource by setting a location to a 1 say, you set the 1 but also read it. If it comes back already a 1 you know it was already locked. There doesn't need to be the same difficulty of setting it to a 0 if you hold the lock. Doing a read and a write at the same time in most FPGAs is not so hard. The block memory in an FPGA has simultaneous read/write functionality. In fact it looks like a large addressable register set requiring a clock edge to do the read. By using the opposite edge of the clock to clock the memory (as the rest of the CPU) the write and read can happen at the same time with the read result coming out a half clock before it has to be written to its destination. win-win.> The desire to avoid multi-cycle instructions is the prime driver for > LL/SC, as it lets the operation be split into individual simple > operations. So let's say you wanted to atomically add R2 to a memory > location, you'd code something like: > > retry: > ll r1,mem > add r1,r2 > sc r1,mem > branch-if-fail retry > > The success flag is typically returned in a register. Alpha, for > example returned it in the register operand of the SC (STL_C or STQ_C) > so a simple BEQ (to test if the returned value was zero) was all that > was needed to trigger the retry. > > > *FSVO "location" - usually the granularity is on something like a > cache line or larger. There are often other restrictions as well - no > interrupts or exceptions between the LL and SC, a maximum number of > instructions between the pair, no other LL, loads or stores in > general, branches, etc. The exact details varied considerably. > Commonly this is described as having a flag set by the LL, and various > conditions can cause the flag to be cleared before the SC executes - > and if that flag is clear the SC fails. >-- Rick C
Reply by ●March 24, 20172017-03-24
On Fri, 24 Mar 2017 01:58:26 -0400, rickman <gnuarm@gmail.com> wrote:>On 3/24/2017 1:37 AM, Robert Wessel wrote: >> On Fri, 24 Mar 2017 01:05:02 -0400, rickman <gnuarm@gmail.com> wrote: >> >>> On 3/23/2017 11:43 PM, Robert Wessel wrote: >>>> On Thu, 23 Mar 2017 18:38:13 -0400, rickman <gnuarm@gmail.com> wrote: >>>> >>>>> I recall a discussion about the design of an instruction set >>>>> architecture where someone was saying an instruction was required to >>>>> test and set a bit or word as an atomic operation if it was desired to >>>>> support multiple processors. Is this really true? Is this a function >>>>> that can't be emulated with other operations including the disabling of >>>>> interrupts? >>>> >>>> >>>> Dekker's algorithm (circa 1962) can be used to implement a mutex >>>> without test-and-set, or even masking interrupts, and really only >>>> requires sequential consistency (and if that's a issue for your >>>> hardware, it should have appropriate memory barriers to force that). >>>> >>>> https://en.wikipedia.org/wiki/Dekker%27s_algorithm >>>> >>>> A limitation is that Dekker's only works with two threads/processors. >>> >>> Interesting. I'm not sure it's limited to two threads per processor >>> rather than just two threads. Even if they are on separate processors >>> it's the same problem and the same limit, no? >> >> >> It's two threads, on one processor or two. Many discussion of >> Dekker's don't really consider software threads (as we know them >> today), as it predates common support for that in OSs. Call it two >> competitors for the lock. >> >> >>> The situation I am looking at resolving is a technique where rather than >>> using a pipeline of length N to allow N-fold improvements in speed >>> (other than when the pipeline has to be flushed) it can be used to get >>> N-fold improvements in clock speed, but treating the CPU as N >>> processors. They are time-sliced if you will. It also relieves a lot >>> of the complexity of detecting problems and contingencies in the >>> pipeline streamlining it that much more. But... >> >> >> This is one of the oldest, if not the very oldest, multi-threading >> technique for processors, dating back to the PPUs on the CDC-6600 >> (circa 1965). The PPUs were 10 time-sliced "threads" on the I/O >> processor. >> >> >>> When someone was proposing this design some time back another poster was >>> insistent the design required an instruction level mutex, if I am using >>> the term right. I don't recall the details of what he wrote, but I've >>> always wondered how important this is. >>> >>> One of the things that can further streamline a processor is to make all >>> instructions single cycle, truly single cycle. So a read-modify-write >>> instruction may be hard to implement depending on the memory used. But >>> in the context of the above multi-processor design, if the RMW >>> instruction takes multiple clock cycles, it would not be truly >>> uninterrupted since the other N-1 processors would all get a clock (and >>> an instruction) between the multiple clocks of the RWM instruction. >> >> >> As soon as you have memory or FP, not all instructions are single >> cycle any more. But as described below, LL/SC is the standard way of >> avoiding having a RMW instruction. > >I don't know what you mean when you use abbreviations that aren't clear. > Is FP floating point?Yes.> There is nothing inherent in floating point >that makes it multiple cycles in a custom processor design.While you can imagine some FP formats that you might reasonably implement in a single cycle, they'd be pretty far out of the mainstream. The width, normalization, and rounding requirements make all but the simplest FP operations implausible to implement in what's usually made the time for a single cycle. OTOH you *could* slow down the rest of your processor so that you can accomplish the implemented FP operations in a single cycle. While trivial, why would you accept a three fold reduction in clock rate for the integer part of your core just so you could have single cycle FP instructions?>There certainly is no reason for memory to be multiple cycle. I think >you are picturing various implementations where memory is slow compared >to the CPU. That's not a given.No, but such a situation is sufficiently uncommon that you'd really want to spec it up front. Most cores these days don't even have single cycle L1 caches.>> It is important to understand how much synchronization you need >> between the threads. If the individual tasks are very independent, >> and need little or no synchronization with tasks running on other >> threads, then you need only weak support for such in the hardware (and >> your code would probably run well on a GPU). > >How dependent the tasks are will vary with the problem being solved. >I'm just looking at how best to utilize the facility available in FPGAs. > The guy who was talking about the time slicing was looking into that >when he was told about the need for a mutex. > > >>> I need to bat this around in my head a bit more I think. >>> >>> >>>> Several related algorithms exist, and generalizations of some of those >>>> (for example, Peterson's) can handle more than two threads. >>>> >>>> And once you have a mutex, you can implement everything else. >>>> >>>> OTOH, proper atomic instructions are a huge boon (even if only in >>>> LL/SC form), so they're provided by everyone who's serious about >>>> multiprocessor support. >>> >>> LL/SC??? Low level, s.... c...? >> >> >> Load-linked/store-conditional. That's what most RISCs use to >> implement atomic operations. Basically the load-linked establishes >> that you want to atomically update a location (as well as loading the >> existing value there), the store-conditional attempts to update that >> value (it's a store), but if something else had hit that location*, >> the SC will fail, and you have to retry the update. In that sense >> it's similar to a compare-and-swap, rather than a true atomic update >> like test-and-set. > >I need to think how that could be implemented. One way would be to have >an extra bit per memory location. Another would be to make the memory >interface "smart" where it tracks the location you read as part of this >function and handles the logic of monitoring other writes to it. That >could be pretty simple really as long as it is one location per processor.The granularity can be pretty coarse. Commonly a cache-line or bigger. But you don't want to add a bit per memory word. With a cache, you'd keep track of any changes of ownership of the cache line. With direct memory, you can just track the memory area (perhaps make it blocks of 64B for simplicity) in each "thread", and then just watch all other writes from other threads/time-slices for matching addresses.>Would an instruction that reads a location while also writing to it do >the job? If you are trying to lock a resource by setting a location to >a 1 say, you set the 1 but also read it. If it comes back already a 1 >you know it was already locked. There doesn't need to be the same >difficulty of setting it to a 0 if you hold the lock.No, that's not enough in typical multi-processor implementations - if two cores/threads/whatever hit the same location, you can only return 0 to one of them, or both might think they acquires the lock. What you've described *is* a TAS, but not an atomic one. OTOH, since you have only a single core, and you appear to be proposing that your TAS *will* be atomic (since it's going to execute in a single clock, and presumably not interleave with any other TAS in another timeslice), you're good to go. Of course, there are alternatives. If you need only moderate locking for your one time-sliced CPU, you can implement a single global lock, and then just stall (prevent them from doing anything in their timeslice) any threads trying to acquire that lock if it's not available, or just spin. IOW this is a simplified form of TAS, and you'd us it to implement bigger atomic updates. You could also have several global flags ("lock0..lock15"), so you could have separate, non-conflicting, things happening on different locks without contention. IOW, (with a single global flag), you might end up with something like: retry_lock: try_global_lock_bit ;acquire global lock, set carry if success jnc retry_lock ;just a spin lock ...update shared item(s)... release_global_lock The try_global_lock and release_global_lock would be the two instructions you'd implement.>Doing a read and a write at the same time in most FPGAs is not so hard. >The block memory in an FPGA has simultaneous read/write functionality. >In fact it looks like a large addressable register set requiring a clock >edge to do the read. By using the opposite edge of the clock to clock >the memory (as the rest of the CPU) the write and read can happen at the >same time with the read result coming out a half clock before it has to >be written to its destination. win-win.Even on FPGAs, where the hard memory blocks are absurdly fast in comparison to logic (as compared to most of the real world), you're likely to see memory access to be multiple clocks unless your CPU logic is really slow.>> The desire to avoid multi-cycle instructions is the prime driver for >> LL/SC, as it lets the operation be split into individual simple >> operations. So let's say you wanted to atomically add R2 to a memory >> location, you'd code something like: >> >> retry: >> ll r1,mem >> add r1,r2 >> sc r1,mem >> branch-if-fail retry >> >> The success flag is typically returned in a register. Alpha, for >> example returned it in the register operand of the SC (STL_C or STQ_C) >> so a simple BEQ (to test if the returned value was zero) was all that >> was needed to trigger the retry. >> >> >> *FSVO "location" - usually the granularity is on something like a >> cache line or larger. There are often other restrictions as well - no >> interrupts or exceptions between the LL and SC, a maximum number of >> instructions between the pair, no other LL, loads or stores in >> general, branches, etc. The exact details varied considerably. >> Commonly this is described as having a flag set by the LL, and various >> conditions can cause the flag to be cleared before the SC executes - >> and if that flag is clear the SC fails. >>