Speaking of Multiprocessing...| page 3

Reply by David Brown ●March 24, 20172017-03-24

On 24/03/17 17:09, Tom Gardner wrote:
> On 24/03/17 10:19, David Brown wrote:
>> On 24/03/17 10:28, Tom Gardner wrote:
>>> On 24/03/17 08:17, David Brown wrote:
>>>> But if you can use dedicated hardware, there are many other methods.
>>>> The XMOS devices have hardware support for pipelines and message
>>>> passing.  On a dual-core PPC device I used, there is a hardware
>>>> block of
>>>> semaphores.  Each semaphore is a pair of 16-bit ID, 16-bit value that
>>>> you can only access as a 32-bit read or write.  You can write to it if
>>>> the current ID is 0, or if the ID you are writing matches that of the
>>>> semaphore.  There is plenty of scope for variation based on that theme.
>>>
>>> I received my first XMOS board from Digi-Key a couple of days
>>> ago, and I'm looking forward to using it for some simple
>>> experiments. I /feel/ that many low-level things will be
>>> much simpler and with fewer potential nasties lurking in
>>> the undergrowth. (I felt the same with the Transputer, for
>>> obvious reasons, but never had a suitable problem at that
>>> time)
>>>
>>> With your experience, did you find any undocumented gotchas
>>> and any pleasant or unpleasant surprises?
>>>
>>
>> Before saying anything else, I would first note that my work with XMOS
>> systems was about four years ago, when they first started getting
>> popular.  I believe many things that bugged me most have been improved
>> since then, both in the hardware and software, but some may remain.
>>
>> I think the devices themselves are a really neat idea.  You have very
>> fast execution, very efficient hardware multi-threading, very
>> predictable timings, and a variety of inter-thread and inter-process
>> communication methods.
>>
>> Their "XC" programming language was also a neat idea, based on C with
>> additional primitives to support the hardware features and
>> multi-threading stuff, and an attempt to make some aspects of C safer
>> (real arrays, control of when you can access variables, etc.).
> 
> Yes, those are precisely the aspects that interest me. I'm
> particularly interested in easy-to-implement hard realtime
> systems.
> 
> As far as I am concerned, caches and interrupts make it
> difficult to guarantee hard realtime performance, and C's
> explicit avoidance of multiprocessor biasses C away from
> "easy-to-implement".

The XMOS devices have a lot more control over timing than you get on a
normal processor (especially a superscaler one with caches).  The
development tools include a lot of timing analysis stuff too.

> 
> Yes, I know about libraries and modern compilers that
> may or may not compile your code in the way you expect!
> 
> I'd far rather build on a solid foundation than have to
> employ (language) lawyers to sort out the mess ;)
> 

XC is still based on C !

> 
> Besides, I want to use Occam++ :)

It is a /long/ time since I used Occam (except for shaving away unlikely
explanations).  It was fun, as long as you were careful with the spacing
in your source code.

XMOS programming is more inspired by Hoare's CSP than Occam.  They
should probably implement Go for the devices.


> 
> 
>> However, IMHO the whole thing suffered from a number of serious flaws
>> that limit the possibilities for the chips.  Sure, they would work well
>> in some circumstances - but I was left with the feeling that "if only
>> they had done /this/, the devices would be so much better and could be
>> used for so many more purposes".  It is a little unfair to concentrate
>> on the shortcomings rather than the innovations and features, but that
>> is how I felt when using them.  And again, I know that at least some
>> issues here have been greatly improved since I last used them.
>>
>>
>> A obvious flaw with the chips is lack of memory.  The basic device with
>> one cpu and 8 threads had 64K ram that was for program memory and
>> run-time data.  There was no flash - you had to use an external SPI
>> flash which used valuable pins (messing up the use of blocks of 8, 16 or
>> 32 pins), and used up a thread if you wanted to be able to access the
>> flash at run-time.  And while you could implement an Ethernet MAC or a
>> 480 Mbps USB 2.0 interface on the chip, there was nowhere near enough
>> ram for buffering or to do anything useful with the interface.  Adding
>> external memory was ridiculously expensive in terms of pins, threads,
>> and run-time inefficiency.
> 
> Yes, those did strike me as limitations to the extent I'm
> skeptical about networking connectivity. But maybe an XMOS
> device plus an ESP8266 would be worth considering for some
> purposes.

There are XMOS devices with Ethernet hardware, I believe - that seems
the best idea.

> 
> 
>> The hardware threading is great, and provides a really easy model for
>> all sorts of things.  To make a UART transmitter, you have a thread that
>> waits for data coming in on a pipe.  To transmit a bit, you set a pin,
>> wait for a bit time (using hardware timers), then move on to the next
>> bit.  The code is simple and elegant.  A UART receiver is not much
>> harder.  There is lots of example code in this style.
>>
>> Then you realise that to implement a UART, you have used a quarter of
>> the chip's resources.  Your elegant flashing light is another thread, as
>> is your PWM output.  Suddenly you find you are using a 500 MIPS chip to
>> do the work of a $0.50 microcontroller, and you only have a thread or
>> two left for the actual application.
> 
> Yes. However I don't care about wasting some resources if
> it makes the design easier/faster, so long as it isn't too
> expensive in terms of power and money.
> 
> 
>> And you end up trying to run FreeRTOS on one of your threads, or make
>> your own scheduler to multiplex several PWM channels in one thread.
>> Much of the elegance quickly disappears for real-world applications.
> 
> Needing to run a separate RTOS would be a code smell.
> That's where the CSP+multicore approach /ought/ to be
> sufficient. For practicality, I exclude peripheral
> libraries and networking code from that statement.
> 

The chip has 8 hardware threads (per core).  If you want more than 8
threads, you either have a bigger and more expensive chip, or you have
to do something in software.

> 
> 
>> Then there is the software.  The XC language lets you write code that
>> starts tasks in parallel, automatically allocates channels for
>> communication, lets you declare timers and wait on them.  That's all
>> great in theory - but it quickly gets confusing when you try to figure
>> out the details of when you can pass these around, when they get
>> allocated and deallocated, or when you can have a thread create new
>> threads.  XC carefully tracks threads and data accesses, spotting and
>> blocking all sorts of possible race conditions.  If a variable is
>> written by one thread, then it can't be accessed from another.  You can
>> work with arrays safely, but you can't take addresses.  Data gets passed
>> between threads using communication channels that are safe from race
>> conditions and nicely synchronised.
> 
> That's the kind of thing I'm interested in exploring.
> 
> 
>> And then you realise that to actually make the thing work, you would
>> need far more channels than there are on the device, and they would need
>> to be far faster - all you really wanted was for two threads to share a
>> circular buffer, and you know in your application code when it is safe
>> to use it.  But you can't do that in XC - the language and the tools
>> won't let you.  So you have to write that code in C, with calls back and
>> forth with the XC code that handles the multi-threading stuff.
> 
> That's the kind of thing I'm interested in exploring.
> 
> 
>> And then you realise that from within the C, you need to access some
>> hardware resources like timers, that can't be expressed properly in C,
>> and you can't get back to the XC code at the time.  So you end up with
>> inline assembly.
> 
> At which point many advantages would have been lost.
> 

Yes.  Hopefully, newer versions of the tools make this sort of hack
unnecessary.

> 
>> Then there are the libraries and examples.  These were written in such a
>> wide variety of styles that it was impossible to figure out what was
>> going on.  A typical example project would involve a USB interface and,
>> for example, SPDIF channels for an USB audio interface.  The
>> Eclipse-based IDE was fine, but the example did not come as a project -
>> it came as a collection of interdependent projects.  Some bits referred
>> to files in different projects.  Some bits merely required other
>> projects to be compiled.  Some bits of the code in one project would use
>> assembly for hardware resources, others would use XC, others would use C
>> intrinsic functions, and others would use a sort of XML file that
>> defines the setup for your chip resources.  If you change values in one
>> file in one project (say, the USB vendor ID), you have to figure out
>> which sub-projects need to be manually forced to re-build in order for
>> it to take effect consistently throughout the project.  Some parts use a
>> fairly obvious configuration file - a header with defines that let you
>> control things like IDs, number of channels, pins, etc.  Except they
>> don't - only /some/ of the sub-projects read and use the configuration
>> file, other parts are hard-coded or use values from elsewhere.  It was a
>> complete mess.
> 
> Irritating, but not fundamental, and as you point out, they could
> be fixed with application of time and money.
> 
> 
>> Now, I know that newer XMOS devices have more resources, built-in flash,
>> proper hardware peripherals for the devices that are most demanding or
>> popular, and so on.  And I can only hope that the language and tools
>> have been improved to the point where inline assembly is not required,
>> and that the examples and libraries have matured to the point that the
>> libraries are usable as-is, and the examples show practical ways to
>> develop code.
>>
>> I really hope XMOS does well here - it is so good to see a company that
>> thinks in a very different way and brings in these new ideas.  So if
>> your experience with modern XMOS devices and tools is good, I would love
>> to hear about it.
> 
> I think we are largely in violent agreement.
> 
> I suspect my "hello world" program will be for their &#4294967295;10
> StartKIT to echo anything it receives on the USB line,
> with a second task flipping uppercase to lowercase. That
> should give me a feel for a low-end resource usage.
> 
> Then I'd like to make a reciprocal frequency counter to
> see how far I can push individual threads and their
> SERDES-like IO primitives.
> 
> And I'll probably do some bitbashing to create analogue
> outputs that draw pretty XY pictures on an analogue scope.
> 

You'll have fun, anyway - unless you get issues with dependency messes
from sub-projects under Eclipse.  If that happens, I recommend
re-arranging things so that you have one project with multiple
directories - it will save you a lot of hair-pulling.

Reply by rickman ●March 24, 20172017-03-24

On 3/24/2017 3:07 AM, Robert Wessel wrote:
> On Fri, 24 Mar 2017 01:58:26 -0400, rickman <gnuarm@gmail.com> wrote:
>
>> On 3/24/2017 1:37 AM, Robert Wessel wrote:
>>> On Fri, 24 Mar 2017 01:05:02 -0400, rickman <gnuarm@gmail.com> wrote:
>>>
>>>> On 3/23/2017 11:43 PM, Robert Wessel wrote:
>>>>> On Thu, 23 Mar 2017 18:38:13 -0400, rickman <gnuarm@gmail.com> wrote:
>>>>>
>>>>>> I recall a discussion about the design of an instruction set
>>>>>> architecture where someone was saying an instruction was required to
>>>>>> test and set a bit or word as an atomic operation if it was desired to
>>>>>> support multiple processors.  Is this really true?  Is this a function
>>>>>> that can't be emulated with other operations including the disabling of
>>>>>> interrupts?
>>>>>
>>>>>
>>>>> Dekker's algorithm (circa 1962) can be used to implement a mutex
>>>>> without test-and-set, or even masking interrupts, and really only
>>>>> requires sequential consistency (and if that's a issue for your
>>>>> hardware, it should have appropriate memory barriers to force that).
>>>>>
>>>>> https://en.wikipedia.org/wiki/Dekker%27s_algorithm
>>>>>
>>>>> A limitation is that Dekker's only works with two threads/processors.
>>>>
>>>> Interesting.  I'm not sure it's limited to two threads per processor
>>>> rather than just two threads.  Even if they are on separate processors
>>>> it's the same problem and the same limit, no?
>>>
>>>
>>> It's two threads, on one processor or two.  Many discussion of
>>> Dekker's don't really consider software threads (as we know them
>>> today), as it predates common support for that in OSs.  Call it two
>>> competitors for the lock.
>>>
>>>
>>>> The situation I am looking at resolving is a technique where rather than
>>>> using a pipeline of length N to allow N-fold improvements in speed
>>>> (other than when the pipeline has to be flushed) it can be used to get
>>>> N-fold improvements in clock speed, but treating the CPU as N
>>>> processors.  They are time-sliced if you will.  It also relieves a lot
>>>> of the complexity of detecting problems and contingencies in the
>>>> pipeline streamlining it that much more.  But...
>>>
>>>
>>> This is one of the oldest, if not the very oldest, multi-threading
>>> technique for processors, dating back to the PPUs on the CDC-6600
>>> (circa 1965).  The PPUs were 10 time-sliced "threads" on the I/O
>>> processor.
>>>
>>>
>>>> When someone was proposing this design some time back another poster was
>>>> insistent the design required an instruction level mutex, if I am using
>>>> the term right.  I don't recall the details of what he wrote, but I've
>>>> always wondered how important this is.
>>>>
>>>> One of the things that can further streamline a processor is to make all
>>>> instructions single cycle, truly single cycle.  So a read-modify-write
>>>> instruction may be hard to implement depending on the memory used.  But
>>>> in the context of the above multi-processor design, if the RMW
>>>> instruction takes multiple clock cycles, it would not be truly
>>>> uninterrupted since the other N-1 processors would all get a clock (and
>>>> an instruction) between the multiple clocks of the RWM instruction.
>>>
>>>
>>> As soon as you have memory or FP, not all instructions are single
>>> cycle any more.  But as described below, LL/SC is the standard way of
>>> avoiding having a RMW instruction.
>>
>> I don't know what you mean when you use abbreviations that aren't clear.
>>  Is FP floating point?
>
>
> Yes.
>
>
>>  There is nothing inherent in floating point
>> that makes it multiple cycles in a custom processor design.
>
>
> While you can imagine some FP formats that you might reasonably
> implement in a single cycle, they'd be pretty far out of the
> mainstream.  The width, normalization, and rounding requirements make
> all but the simplest FP operations implausible to implement in what's
> usually made the time for a single cycle.  OTOH you *could* slow down
> the rest of your processor so that you can accomplish the implemented
> FP operations in a single cycle.  While trivial, why would you accept
> a three fold reduction in clock rate for the integer part of your core
> just so you could have single cycle FP instructions?

There is nothing implausible about single cycle floating point.  It is 
all a matter of context and application.  Not every design has to scream 
with the highest possible clock rate at the expense of pipeline 
complications.  Don't impose an architecture until you know the 
requirements.


>> There certainly is no reason for memory to be multiple cycle.  I think
>> you are picturing various implementations where memory is slow compared
>> to the CPU.  That's not a given.
>
>
> No, but such a situation is sufficiently uncommon that you'd really
> want to spec it up front.  Most cores these days don't even have
> single cycle L1 caches.

Again, assuming an architecture and requirements.  This conversation is 
not about designing the next generation of ARM CPU.


>>> It is important to understand how much synchronization you need
>>> between the threads.  If the individual tasks are very independent,
>>> and need little or no synchronization with tasks running on other
>>> threads, then you need only weak support for such in the hardware (and
>>> your code would probably run well on a GPU).
>>
>> How dependent the tasks are will vary with the problem being solved.
>> I'm just looking at how best to utilize the facility available in FPGAs.
>>  The guy who was talking about the time slicing was looking into that
>> when he was told about the need for a mutex.
>>
>>
>>>> I need to bat this around in my head a bit more I think.
>>>>
>>>>
>>>>> Several related algorithms exist, and generalizations of some of those
>>>>> (for example, Peterson's) can handle more than two threads.
>>>>>
>>>>> And once you have a mutex, you can implement everything else.
>>>>>
>>>>> OTOH, proper atomic instructions are a huge boon (even if only in
>>>>> LL/SC form), so they're provided by everyone who's serious about
>>>>> multiprocessor support.
>>>>
>>>> LL/SC???  Low level, s.... c...?
>>>
>>>
>>> Load-linked/store-conditional.  That's what most RISCs use to
>>> implement atomic operations.  Basically the load-linked establishes
>>> that you want to atomically update a location (as well as loading the
>>> existing value there), the store-conditional attempts to update that
>>> value (it's a store), but if something else had hit that location*,
>>> the SC will fail, and you have to retry the update.  In that sense
>>> it's similar to a compare-and-swap, rather than a true atomic update
>>> like test-and-set.
>>
>> I need to think how that could be implemented.  One way would be to have
>> an extra bit per memory location.  Another would be to make the memory
>> interface "smart" where it tracks the location you read as part of this
>> function and handles the logic of monitoring other writes to it.  That
>> could be pretty simple really as long as it is one location per processor.
>
>
> The granularity can be pretty coarse.  Commonly a cache-line or
> bigger.  But you don't want to add a bit per memory word.  With a
> cache, you'd keep track of any changes of ownership of the cache line.
> With direct memory, you can just track the memory area (perhaps make
> it blocks of 64B for simplicity) in each "thread", and then just watch
> all other writes from other threads/time-slices for matching
> addresses.

Why don't I want to add a bit per memory location?  If that bit happens 
to be free, I'd be silly to let it go to waste.


>> Would an instruction that reads a location while also writing to it do
>> the job?  If you are trying to lock a resource by setting a location to
>> a 1 say, you set the 1 but also read it.  If it comes back already a 1
>> you know it was already locked.  There doesn't need to be the same
>> difficulty of setting it to a 0 if you hold the lock.
>
>
> No, that's not enough in typical multi-processor implementations  - if
> two cores/threads/whatever hit the same location, you can only return
> 0 to one of them, or both might think they acquires the lock.  What
> you've described *is* a TAS, but not an atomic one.

You are ignoring that the instruction also writes a 1 to the location. 
So two processors *can't* both read a zero.  Why is it not atomic?  The 
write and read happen on the same clock cycle.  Much like Dekker's 
algorithm it sets the flag to exclude other processors as a priority. 
Then it can at leisure check if the flag was already set or if it 
"captured" the flag.

Again, an abbreviation I don't know.  What's a TAS?


> OTOH, since you have only a single core, and you appear to be
> proposing that your TAS *will* be atomic (since it's going to execute
> in a single clock, and presumably not interleave with any other TAS in
> another timeslice), you're good to go.

Not sure what you are assuming here.  There is one "core", but it is 
time shared N ways giving N functional processors.  Each one operates on 
a different clock cycle however, so any single instruction which takes 
one clock cycle is truly atomic and the other processors can't interfere.


> Of course, there are alternatives.  If you need only moderate locking
> for your one time-sliced CPU, you can implement a single global lock,
> and then just stall (prevent them from doing anything in their
> timeslice) any threads trying to acquire that lock if it's not
> available, or just spin.  IOW this is a simplified form of TAS, and
> you'd us it to implement bigger atomic updates.  You could also have
> several global flags ("lock0..lock15"), so you could have separate,
> non-conflicting, things happening on different locks without
> contention.

Stalling the other processors is a bad idea.  The primary apps for my 
designs are hard real-time and if one processor is dealing with 
something that needs 100% of its attention at that moment and it is 
stalled because another processor wants to lock some resource... not good.

Limiting the stall to processors that are accessing the same location in 
memory would be very messy in terms of logic.

The multiple channel approach was suggested by the previous designer and 
was shot down by the antagonist.  I don't recall any of the detail 
though.  If there is a resource that needs to be shared, it requires a 
mutex.  I'm not sure how to get around that.


> IOW, (with a single global flag), you might end up with something
> like:
>
>   retry_lock:
>      try_global_lock_bit    ;acquire global lock, set carry if success
>      jnc retry_lock  ;just a spin lock
>     ...update shared item(s)...
>     release_global_lock
>
> The try_global_lock and release_global_lock would be the two
> instructions you'd implement.

Isn't try_global_lock what my read/write (essentially a swap) 
instruction does?  A one is written to the memory location no matter 
what.  The previous value is returned.  The previous value tells you if 
you acquired it or not (return of one means it was already locked, 
return of zero means you locked it).  To perform the release you just 
write a zero, but only if you have the lock.


>> Doing a read and a write at the same time in most FPGAs is not so hard.
>> The block memory in an FPGA has simultaneous read/write functionality.
>> In fact it looks like a large addressable register set requiring a clock
>> edge to do the read.  By using the opposite edge of the clock to clock
>> the memory (as the rest of the CPU) the write and read can happen at the
>> same time with the read result coming out a half clock before it has to
>> be written to its destination.  win-win.
>
>
> Even on FPGAs, where the hard memory blocks are absurdly fast in
> comparison to logic (as compared to most of the real world), you're
> likely to see memory access to be multiple clocks unless your CPU
> logic is really slow.

Not sure what you are referring to here.  FGPA memory is *very* fast, 
almost as fast as registers.  Likely you are thinking of external 
memory, in particular DRAM.

-- 

Rick C

Reply by Tom Gardner ●March 24, 20172017-03-24

On 24/03/17 16:33, David Brown wrote:
> On 24/03/17 17:09, Tom Gardner wrote:

>> As far as I am concerned, caches and interrupts make it
>> difficult to guarantee hard realtime performance, and C's
>> explicit avoidance of multiprocessor biasses C away from
>> "easy-to-implement".
>
> The XMOS devices have a lot more control over timing than you get on a
> normal processor (especially a superscaler one with caches).  The
> development tools include a lot of timing analysis stuff too.

Exactly.

>> Yes, I know about libraries and modern compilers that
>> may or may not compile your code in the way you expect!
>>
>> I'd far rather build on a solid foundation than have to
>> employ (language) lawyers to sort out the mess ;)
>>
>
> XC is still based on C !

Yes, but hopefully with little need to use the important
areas that keep language lawyers employed.

If the "awkward bits" still /have/ to be used, then
Opportunities Will Have Been Missed.

>> Besides, I want to use Occam++ :)
>
> It is a /long/ time since I used Occam (except for shaving away unlikely
> explanations).  It was fun, as long as you were careful with the spacing
> in your source code.

Indeed. I've toyed with Python and don't dislike its
ethos, but semantically significant spaces make me
shudder. For a start, how does and editor's
copy-and-paste work reliably. The consider refactoring
browsers.

I still remember makefiles, where tabs were invisibly
different to spaces.

> XMOS programming is more inspired by Hoare's CSP than Occam.  They
> should probably implement Go for the devices.

Oh, <waves arms>. I've always regarded Occam (and now XC)
as an instantiation of CSP in a language rather than a
library. To that extent I haven't needed to distinguish.

>> Yes, those did strike me as limitations to the extent I'm
>> skeptical about networking connectivity. But maybe an XMOS
>> device plus an ESP8266 would be worth considering for some
>> purposes.
>
> There are XMOS devices with Ethernet hardware, I believe - that seems
> the best idea.

The hardware seems the least of the problem; the software
stack is more of an issue.

>>> The hardware threading is great, and provides a really easy model for
>>> all sorts of things.  To make a UART transmitter, you have a thread that
>>> waits for data coming in on a pipe.  To transmit a bit, you set a pin,
>>> wait for a bit time (using hardware timers), then move on to the next
>>> bit.  The code is simple and elegant.  A UART receiver is not much
>>> harder.  There is lots of example code in this style.
>>>
>>> Then you realise that to implement a UART, you have used a quarter of
>>> the chip's resources.  Your elegant flashing light is another thread, as
>>> is your PWM output.  Suddenly you find you are using a 500 MIPS chip to
>>> do the work of a $0.50 microcontroller, and you only have a thread or
>>> two left for the actual application.
>>
>> Yes. However I don't care about wasting some resources if
>> it makes the design easier/faster, so long as it isn't too
>> expensive in terms of power and money.
>>
>>
>>> And you end up trying to run FreeRTOS on one of your threads, or make
>>> your own scheduler to multiplex several PWM channels in one thread.
>>> Much of the elegance quickly disappears for real-world applications.
>>
>> Needing to run a separate RTOS would be a code smell.
>> That's where the CSP+multicore approach /ought/ to be
>> sufficient. For practicality, I exclude peripheral
>> libraries and networking code from that statement.
>>
>
> The chip has 8 hardware threads (per core).  If you want more than 8
> threads, you either have a bigger and more expensive chip, or you have
> to do something in software.

Yes. But I haven't thought through the ramifications
of that, yet. In particular, I haven't developed my
set of design patterns.

> You'll have fun, anyway - unless you get issues with dependency messes
> from sub-projects under Eclipse.  If that happens, I recommend
> re-arranging things so that you have one project with multiple
> directories - it will save you a lot of hair-pulling.

There's too much unreasoned prejudice against cut-n-paste,
and too much in favour of common code bases. Sometimes
cut-and-paste is optimum, e.g. when developing several
PCBs over several years, the best way of ensuring you can
regenerate a design is to copy all library components
into project-specific libraries.

I don't mind code duplication if it means that I can
find what needs to be modified, and then modify it in the
knowledge that I haven't affected other parts of a system.
I've seen companies get that dreadfully wrong, with
the result that mods took 6 months to get to a customer.

Reply by David Brown ●March 24, 20172017-03-24

On 24/03/17 18:36, rickman wrote:
> On 3/24/2017 3:07 AM, Robert Wessel wrote:
>> On Fri, 24 Mar 2017 01:58:26 -0400, rickman <gnuarm@gmail.com> wrote:
>>
>>> On 3/24/2017 1:37 AM, Robert Wessel wrote:
>>>> On Fri, 24 Mar 2017 01:05:02 -0400, rickman <gnuarm@gmail.com> wrote:
>>>>
>>>>> On 3/23/2017 11:43 PM, Robert Wessel wrote:
>>>>>> On Thu, 23 Mar 2017 18:38:13 -0400, rickman <gnuarm@gmail.com> wrote:

<snip>

> 
>>> There certainly is no reason for memory to be multiple cycle.  I think
>>> you are picturing various implementations where memory is slow compared
>>> to the CPU.  That's not a given.
>>
>>
>> No, but such a situation is sufficiently uncommon that you'd really
>> want to spec it up front.  Most cores these days don't even have
>> single cycle L1 caches.
> 
> Again, assuming an architecture and requirements.  This conversation is
> not about designing the next generation of ARM CPU.

It would be making architecture assumptions to require single-cycle
memory - saying that memory may be multiple cycle is the general case.

(You might /want/ to make architecture assumptions here - I gave some
other suggestions of specific hardware for locking in another post.  The
solutions discussed here are for general memory.)

> 
> 
>>>> It is important to understand how much synchronization you need
>>>> between the threads.  If the individual tasks are very independent,
>>>> and need little or no synchronization with tasks running on other
>>>> threads, then you need only weak support for such in the hardware (and
>>>> your code would probably run well on a GPU).
>>>
>>> How dependent the tasks are will vary with the problem being solved.
>>> I'm just looking at how best to utilize the facility available in FPGAs.
>>>  The guy who was talking about the time slicing was looking into that
>>> when he was told about the need for a mutex.
>>>
>>>
>>>>> I need to bat this around in my head a bit more I think.
>>>>>
>>>>>
>>>>>> Several related algorithms exist, and generalizations of some of
>>>>>> those
>>>>>> (for example, Peterson's) can handle more than two threads.
>>>>>>
>>>>>> And once you have a mutex, you can implement everything else.
>>>>>>
>>>>>> OTOH, proper atomic instructions are a huge boon (even if only in
>>>>>> LL/SC form), so they're provided by everyone who's serious about
>>>>>> multiprocessor support.
>>>>>
>>>>> LL/SC???  Low level, s.... c...?
>>>>
>>>>
>>>> Load-linked/store-conditional.  That's what most RISCs use to
>>>> implement atomic operations.  Basically the load-linked establishes
>>>> that you want to atomically update a location (as well as loading the
>>>> existing value there), the store-conditional attempts to update that
>>>> value (it's a store), but if something else had hit that location*,
>>>> the SC will fail, and you have to retry the update.  In that sense
>>>> it's similar to a compare-and-swap, rather than a true atomic update
>>>> like test-and-set.
>>>
>>> I need to think how that could be implemented.  One way would be to have
>>> an extra bit per memory location.  Another would be to make the memory
>>> interface "smart" where it tracks the location you read as part of this
>>> function and handles the logic of monitoring other writes to it.  That
>>> could be pretty simple really as long as it is one location per
>>> processor.
>>
>>
>> The granularity can be pretty coarse.  Commonly a cache-line or
>> bigger.  But you don't want to add a bit per memory word.  With a
>> cache, you'd keep track of any changes of ownership of the cache line.
>> With direct memory, you can just track the memory area (perhaps make
>> it blocks of 64B for simplicity) in each "thread", and then just watch
>> all other writes from other threads/time-slices for matching
>> addresses.
> 
> Why don't I want to add a bit per memory location?  If that bit happens
> to be free, I'd be silly to let it go to waste.
> 

Normally, one does not have a bit free per memory location.

LL/SC solutions usually have only a single tracker tracker per cpu, or
even just a single tracker per shared memory bus.  The granularity might
be for a machine word, or for a cache line.  The expectation is that
these sequences will be so short that it is very rare for a conflict to
occur, so it is fine to have just one tracker.  You set the tracker when
reading from the memory to get the old value, then use a conditional
store that will only do the write if nothing else has touched that same
location since your load.  It gives you a "test and set" type operation,
but split into RISC-friendly separate load and store operations.

> 
>>> Would an instruction that reads a location while also writing to it do
>>> the job?  If you are trying to lock a resource by setting a location to
>>> a 1 say, you set the 1 but also read it.  If it comes back already a 1
>>> you know it was already locked.  There doesn't need to be the same
>>> difficulty of setting it to a 0 if you hold the lock.
>>
>>
>> No, that's not enough in typical multi-processor implementations  - if
>> two cores/threads/whatever hit the same location, you can only return
>> 0 to one of them, or both might think they acquires the lock.  What
>> you've described *is* a TAS, but not an atomic one.
> 
> You are ignoring that the instruction also writes a 1 to the location.
> So two processors *can't* both read a zero.  Why is it not atomic?  The
> write and read happen on the same clock cycle.  Much like Dekker's
> algorithm it sets the flag to exclude other processors as a priority.
> Then it can at leisure check if the flag was already set or if it
> "captured" the flag.

Normally, you cannot read and write a memory location at the same time.
 It would only be possible with specialised hardware, or with a bus
locking mechanism (which can be the basis of your synchronisation
primitive).  In a normal memory bus, you have a read operation then a
write operation.  If you don't have bus locking of some sort, then cpu 1
could read 0, then cpu 2 could read 0, then cpu 1 would write 1, then
cpu 2 would write 1.

> 
> Again, an abbreviation I don't know.  What's a TAS?
> 

"test and set".  It reads the old value in a memory location, sets the
memory to 1, and returns the old value - all as an atomic operation.
Atomicity is key here - and normal memories are not atomic.

(If you are thinking of using an FPGA here, it is often possible to
implement a memory block where you /do/ read the old value on the same
clock cycle as writing the new one, making a TAS operation simple to
implement.)

> 
>> OTOH, since you have only a single core, and you appear to be
>> proposing that your TAS *will* be atomic (since it's going to execute
>> in a single clock, and presumably not interleave with any other TAS in
>> another timeslice), you're good to go.
> 
> Not sure what you are assuming here.  There is one "core", but it is
> time shared N ways giving N functional processors.  Each one operates on
> a different clock cycle however, so any single instruction which takes
> one clock cycle is truly atomic and the other processors can't interfere.
> 

If this is going to be efficient, it is unlikely that the core supports
reading and writing within the same instruction - LL/SC is usually a
better choice.

Reply by rickman ●March 24, 20172017-03-24

On 3/24/2017 2:06 PM, David Brown wrote:
> On 24/03/17 18:36, rickman wrote:
>> On 3/24/2017 3:07 AM, Robert Wessel wrote:
>>> On Fri, 24 Mar 2017 01:58:26 -0400, rickman <gnuarm@gmail.com> wrote:
>>>
>>>> On 3/24/2017 1:37 AM, Robert Wessel wrote:
>>>>> On Fri, 24 Mar 2017 01:05:02 -0400, rickman <gnuarm@gmail.com> wrote:
>>>>>
>>>>>> On 3/23/2017 11:43 PM, Robert Wessel wrote:
>>>>>>> On Thu, 23 Mar 2017 18:38:13 -0400, rickman <gnuarm@gmail.com> wrote:
>
> <snip>
>
>>
>>>> There certainly is no reason for memory to be multiple cycle.  I think
>>>> you are picturing various implementations where memory is slow compared
>>>> to the CPU.  That's not a given.
>>>
>>>
>>> No, but such a situation is sufficiently uncommon that you'd really
>>> want to spec it up front.  Most cores these days don't even have
>>> single cycle L1 caches.
>>
>> Again, assuming an architecture and requirements.  This conversation is
>> not about designing the next generation of ARM CPU.
>
> It would be making architecture assumptions to require single-cycle
> memory - saying that memory may be multiple cycle is the general case.
>
> (You might /want/ to make architecture assumptions here - I gave some
> other suggestions of specific hardware for locking in another post.  The
> solutions discussed here are for general memory.)

We are having two different conversations here.   I am designing a CPU, 
you are talking about the theory of CPU design.


>>>>> It is important to understand how much synchronization you need
>>>>> between the threads.  If the individual tasks are very independent,
>>>>> and need little or no synchronization with tasks running on other
>>>>> threads, then you need only weak support for such in the hardware (and
>>>>> your code would probably run well on a GPU).
>>>>
>>>> How dependent the tasks are will vary with the problem being solved.
>>>> I'm just looking at how best to utilize the facility available in FPGAs.
>>>>  The guy who was talking about the time slicing was looking into that
>>>> when he was told about the need for a mutex.
>>>>
>>>>
>>>>>> I need to bat this around in my head a bit more I think.
>>>>>>
>>>>>>
>>>>>>> Several related algorithms exist, and generalizations of some of
>>>>>>> those
>>>>>>> (for example, Peterson's) can handle more than two threads.
>>>>>>>
>>>>>>> And once you have a mutex, you can implement everything else.
>>>>>>>
>>>>>>> OTOH, proper atomic instructions are a huge boon (even if only in
>>>>>>> LL/SC form), so they're provided by everyone who's serious about
>>>>>>> multiprocessor support.
>>>>>>
>>>>>> LL/SC???  Low level, s.... c...?
>>>>>
>>>>>
>>>>> Load-linked/store-conditional.  That's what most RISCs use to
>>>>> implement atomic operations.  Basically the load-linked establishes
>>>>> that you want to atomically update a location (as well as loading the
>>>>> existing value there), the store-conditional attempts to update that
>>>>> value (it's a store), but if something else had hit that location*,
>>>>> the SC will fail, and you have to retry the update.  In that sense
>>>>> it's similar to a compare-and-swap, rather than a true atomic update
>>>>> like test-and-set.
>>>>
>>>> I need to think how that could be implemented.  One way would be to have
>>>> an extra bit per memory location.  Another would be to make the memory
>>>> interface "smart" where it tracks the location you read as part of this
>>>> function and handles the logic of monitoring other writes to it.  That
>>>> could be pretty simple really as long as it is one location per
>>>> processor.
>>>
>>>
>>> The granularity can be pretty coarse.  Commonly a cache-line or
>>> bigger.  But you don't want to add a bit per memory word.  With a
>>> cache, you'd keep track of any changes of ownership of the cache line.
>>> With direct memory, you can just track the memory area (perhaps make
>>> it blocks of 64B for simplicity) in each "thread", and then just watch
>>> all other writes from other threads/time-slices for matching
>>> addresses.
>>
>> Why don't I want to add a bit per memory location?  If that bit happens
>> to be free, I'd be silly to let it go to waste.
>>
>
> Normally, one does not have a bit free per memory location.
>
> LL/SC solutions usually have only a single tracker tracker per cpu, or
> even just a single tracker per shared memory bus.  The granularity might
> be for a machine word, or for a cache line.  The expectation is that
> these sequences will be so short that it is very rare for a conflict to
> occur, so it is fine to have just one tracker.  You set the tracker when
> reading from the memory to get the old value, then use a conditional
> store that will only do the write if nothing else has touched that same
> location since your load.  It gives you a "test and set" type operation,
> but split into RISC-friendly separate load and store operations.
>
>>
>>>> Would an instruction that reads a location while also writing to it do
>>>> the job?  If you are trying to lock a resource by setting a location to
>>>> a 1 say, you set the 1 but also read it.  If it comes back already a 1
>>>> you know it was already locked.  There doesn't need to be the same
>>>> difficulty of setting it to a 0 if you hold the lock.
>>>
>>>
>>> No, that's not enough in typical multi-processor implementations  - if
>>> two cores/threads/whatever hit the same location, you can only return
>>> 0 to one of them, or both might think they acquires the lock.  What
>>> you've described *is* a TAS, but not an atomic one.
>>
>> You are ignoring that the instruction also writes a 1 to the location.
>> So two processors *can't* both read a zero.  Why is it not atomic?  The
>> write and read happen on the same clock cycle.  Much like Dekker's
>> algorithm it sets the flag to exclude other processors as a priority.
>> Then it can at leisure check if the flag was already set or if it
>> "captured" the flag.
>
> Normally, you cannot read and write a memory location at the same time.
>  It would only be possible with specialised hardware, or with a bus
> locking mechanism (which can be the basis of your synchronisation
> primitive).  In a normal memory bus, you have a read operation then a
> write operation.  If you don't have bus locking of some sort, then cpu 1
> could read 0, then cpu 2 could read 0, then cpu 1 would write 1, then
> cpu 2 would write 1.
>
>>
>> Again, an abbreviation I don't know.  What's a TAS?
>>
>
> "test and set".  It reads the old value in a memory location, sets the
> memory to 1, and returns the old value - all as an atomic operation.
> Atomicity is key here - and normal memories are not atomic.
>
> (If you are thinking of using an FPGA here, it is often possible to
> implement a memory block where you /do/ read the old value on the same
> clock cycle as writing the new one, making a TAS operation simple to
> implement.)
>
>>
>>> OTOH, since you have only a single core, and you appear to be
>>> proposing that your TAS *will* be atomic (since it's going to execute
>>> in a single clock, and presumably not interleave with any other TAS in
>>> another timeslice), you're good to go.
>>
>> Not sure what you are assuming here.  There is one "core", but it is
>> time shared N ways giving N functional processors.  Each one operates on
>> a different clock cycle however, so any single instruction which takes
>> one clock cycle is truly atomic and the other processors can't interfere.
>>
>
> If this is going to be efficient, it is unlikely that the core supports
> reading and writing within the same instruction - LL/SC is usually a
> better choice.

As I have said before, I am designing a CPU in an FPGA.  I have FPGA 
features available to me that are different than what is available to 
chip designers.  I do not have the same constraints as chip designers.

I was asking what is required in an instruction set to support 
multi-processors and I think the answer is that the memory swap 
instruction would do the job efficiently.

-- 

Rick C

Reply by Dimiter_Popoff ●March 24, 20172017-03-24

On 24.3.2017 &#1075;. 19:36, rickman wrote:
>...
>
> Why don't I want to add a bit per memory location?  If that bit happens
> to be free, I'd be silly to let it go to waste.

I am not sure a bit per reservation unit will work in a multiprocessor
environment (it certainly will on your single core design, obviously).

Both the reserved address (range, typically) and the reserving processor
will need to know who reserved what. But this is just a "gut feeling",
I did not think seriously about it.
>
> Again, an abbreviation I don't know.  What's a TAS?

Test And Set, a classic 68k RMW opcoode. Read a byte (just byte size
allowed), adjust the Z flag (Z IIRC...) to the state of bit 7 (MSB),
then write the byte back with bit 7 set. While doing this the processor
provides some means (holds AS asserted on the 68k IIRC) such that the
access to the address can not be granted to another processor for the
entire duration of the TAS opcode.

Looks like in your case a simple bset (again 68k lingo, bit test and
set, but on two separate memory accesses, potentially interruptible
by another bus master - not by a processor interrupt, it is a single
instruction) will do what you need.

Dimiter

------------------------------------------------------
Dimiter Popoff, TGI             http://www.tgi-sci.com
------------------------------------------------------
http://www.flickr.com/photos/didi_tgi/

Reply by rickman ●March 24, 20172017-03-24

On 3/24/2017 6:42 PM, Dimiter_Popoff wrote:
> On 24.3.2017 &#1075;. 19:36, rickman wrote:
>> ...
>>
>> Why don't I want to add a bit per memory location?  If that bit happens
>> to be free, I'd be silly to let it go to waste.
>
> I am not sure a bit per reservation unit will work in a multiprocessor
> environment (it certainly will on your single core design, obviously).
>
> Both the reserved address (range, typically) and the reserving processor
> will need to know who reserved what. But this is just a "gut feeling",
> I did not think seriously about it.
>>
>> Again, an abbreviation I don't know.  What's a TAS?
>
> Test And Set, a classic 68k RMW opcoode. Read a byte (just byte size
> allowed), adjust the Z flag (Z IIRC...) to the state of bit 7 (MSB),
> then write the byte back with bit 7 set. While doing this the processor
> provides some means (holds AS asserted on the 68k IIRC) such that the
> access to the address can not be granted to another processor for the
> entire duration of the TAS opcode.
>
> Looks like in your case a simple bset (again 68k lingo, bit test and
> set, but on two separate memory accesses, potentially interruptible
> by another bus master - not by a processor interrupt, it is a single
> instruction) will do what you need.

Not interruptible in any way, shape or form in my approach.  The read 
and write happen in the same clock cycle - one of the beauties of having 
dual port memory in an FPGA.

-- 

Rick C

Reply by Rob Gaddi ●March 24, 20172017-03-24

On 03/24/2017 04:22 PM, rickman wrote:
> On 3/24/2017 6:42 PM, Dimiter_Popoff wrote:
>> On 24.3.2017 &#1075;. 19:36, rickman wrote:
>>> ...
>>>
>>> Why don't I want to add a bit per memory location?  If that bit happens
>>> to be free, I'd be silly to let it go to waste.
>>
>> I am not sure a bit per reservation unit will work in a multiprocessor
>> environment (it certainly will on your single core design, obviously).
>>
>> Both the reserved address (range, typically) and the reserving processor
>> will need to know who reserved what. But this is just a "gut feeling",
>> I did not think seriously about it.
>>>
>>> Again, an abbreviation I don't know.  What's a TAS?
>>
>> Test And Set, a classic 68k RMW opcoode. Read a byte (just byte size
>> allowed), adjust the Z flag (Z IIRC...) to the state of bit 7 (MSB),
>> then write the byte back with bit 7 set. While doing this the processor
>> provides some means (holds AS asserted on the 68k IIRC) such that the
>> access to the address can not be granted to another processor for the
>> entire duration of the TAS opcode.
>>
>> Looks like in your case a simple bset (again 68k lingo, bit test and
>> set, but on two separate memory accesses, potentially interruptible
>> by another bus master - not by a processor interrupt, it is a single
>> instruction) will do what you need.
>
> Not interruptible in any way, shape or form in my approach.  The read
> and write happen in the same clock cycle - one of the beauties of having
> dual port memory in an FPGA.
>

Sounds similar to the old ARM answer to the problem: SWP Rdst, Rsrc, [addr].



-- 
Rob Gaddi, Highland Technology -- www.highlandtechnology.com
Email address domain is currently out of order.  See above to fix.

Reply by David Brown ●March 24, 20172017-03-24

On 24/03/17 19:12, rickman wrote:
> On 3/24/2017 2:06 PM, David Brown wrote:
>> On 24/03/17 18:36, rickman wrote:
>>> On 3/24/2017 3:07 AM, Robert Wessel wrote:
>>>> On Fri, 24 Mar 2017 01:58:26 -0400, rickman <gnuarm@gmail.com> wrote:
>>>>
>>>>> On 3/24/2017 1:37 AM, Robert Wessel wrote:
>>>>>> On Fri, 24 Mar 2017 01:05:02 -0400, rickman <gnuarm@gmail.com> wrote:
>>>>>>
>>>>>>> On 3/23/2017 11:43 PM, Robert Wessel wrote:
>>>>>>>> On Thu, 23 Mar 2017 18:38:13 -0400, rickman <gnuarm@gmail.com>
>>>>>>>> wrote:
>>
>> <snip>
>>
>>>
>>>>> There certainly is no reason for memory to be multiple cycle.  I think
>>>>> you are picturing various implementations where memory is slow
>>>>> compared
>>>>> to the CPU.  That's not a given.
>>>>
>>>>
>>>> No, but such a situation is sufficiently uncommon that you'd really
>>>> want to spec it up front.  Most cores these days don't even have
>>>> single cycle L1 caches.
>>>
>>> Again, assuming an architecture and requirements.  This conversation is
>>> not about designing the next generation of ARM CPU.
>>
>> It would be making architecture assumptions to require single-cycle
>> memory - saying that memory may be multiple cycle is the general case.
>>
>> (You might /want/ to make architecture assumptions here - I gave some
>> other suggestions of specific hardware for locking in another post.  The
>> solutions discussed here are for general memory.)
>
> We are having two different conversations here.   I am designing a CPU,
> you are talking about the theory of CPU design.

Aha!  That is some useful information to bring to the table.  You told 
us earlier about some things that you are /not/ doing, but not what you 
/are/ doing.  (Or if you did, I missed it.)

In that case, I would recommend making some dedicated hardware for your 
synchronisation primitives.  A simple method is one I described earlier 
- have a set of memory locations where you the upper half of each 32-bit 
entry is for a "thread id".  You can only write to the entry if the 
current upper half is 0, or it matches the thread id you are writing to 
it.  It should be straightforward to implement in an FPGA.

The disadvantage of this sort of solution is scaling - if your hardware 
supports 64 such semaphores, then that's all you've got.  A solution 
utilizing normal memory (such as TAS, CAS, LL/SC) lets user code make as 
many semaphores as it wants.  But you can always use the hardware ones 
to implement more software semaphores indirectly.

>>
>> If this is going to be efficient, it is unlikely that the core supports
>> reading and writing within the same instruction - LL/SC is usually a
>> better choice.
>
> As I have said before, I am designing a CPU in an FPGA.  I have FPGA
> features available to me that are different than what is available to
> chip designers.  I do not have the same constraints as chip designers.
>
> I was asking what is required in an instruction set to support
> multi-processors and I think the answer is that the memory swap
> instruction would do the job efficiently.
>

Reply by rickman ●March 24, 20172017-03-24

On 3/24/2017 8:48 PM, David Brown wrote:
> On 24/03/17 19:12, rickman wrote:
>> On 3/24/2017 2:06 PM, David Brown wrote:
>>> On 24/03/17 18:36, rickman wrote:
>>>> On 3/24/2017 3:07 AM, Robert Wessel wrote:
>>>>> On Fri, 24 Mar 2017 01:58:26 -0400, rickman <gnuarm@gmail.com> wrote:
>>>>>
>>>>>> On 3/24/2017 1:37 AM, Robert Wessel wrote:
>>>>>>> On Fri, 24 Mar 2017 01:05:02 -0400, rickman <gnuarm@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> On 3/23/2017 11:43 PM, Robert Wessel wrote:
>>>>>>>>> On Thu, 23 Mar 2017 18:38:13 -0400, rickman <gnuarm@gmail.com>
>>>>>>>>> wrote:
>>>
>>> <snip>
>>>
>>>>
>>>>>> There certainly is no reason for memory to be multiple cycle.  I
>>>>>> think
>>>>>> you are picturing various implementations where memory is slow
>>>>>> compared
>>>>>> to the CPU.  That's not a given.
>>>>>
>>>>>
>>>>> No, but such a situation is sufficiently uncommon that you'd really
>>>>> want to spec it up front.  Most cores these days don't even have
>>>>> single cycle L1 caches.
>>>>
>>>> Again, assuming an architecture and requirements.  This conversation is
>>>> not about designing the next generation of ARM CPU.
>>>
>>> It would be making architecture assumptions to require single-cycle
>>> memory - saying that memory may be multiple cycle is the general case.
>>>
>>> (You might /want/ to make architecture assumptions here - I gave some
>>> other suggestions of specific hardware for locking in another post.  The
>>> solutions discussed here are for general memory.)
>>
>> We are having two different conversations here.   I am designing a CPU,
>> you are talking about the theory of CPU design.
>
> Aha!  That is some useful information to bring to the table.  You told
> us earlier about some things that you are /not/ doing, but not what you
> /are/ doing.  (Or if you did, I missed it.)
>
> In that case, I would recommend making some dedicated hardware for your
> synchronisation primitives.  A simple method is one I described earlier
> - have a set of memory locations where you the upper half of each 32-bit
> entry is for a "thread id".  You can only write to the entry if the
> current upper half is 0, or it matches the thread id you are writing to
> it.  It should be straightforward to implement in an FPGA.
>
> The disadvantage of this sort of solution is scaling - if your hardware
> supports 64 such semaphores, then that's all you've got.  A solution
> utilizing normal memory (such as TAS, CAS, LL/SC) lets user code make as
> many semaphores as it wants.  But you can always use the hardware ones
> to implement more software semaphores indirectly.

I guess I didn't explain the full context.  But as I mentioned, it is 
simpler I think to include the swap memory instruction that will allow a 
test and set operation to be implemented atomically without disrupting 
any other functions.  It will use up an opcode, but it would be a simple 
one with the only difference from a normal memory write being the use of 
the read path.

-- 

Rick C