Speaking of Multiprocessing...| page 2

Reply by ●March 24, 20172017-03-24

On Thu, 23 Mar 2017 17:49:39 -0700, Don Y
<blockedofcourse@foo.invalid> wrote:

>On 3/23/2017 4:47 PM, Tim Wescott wrote:
>> On Thu, 23 Mar 2017 16:26:46 -0700, Don Y wrote:
>>
>>> On 3/23/2017 4:19 PM, Tim Wescott wrote:
>>>> On Thu, 23 Mar 2017 18:38:13 -0400, rickman wrote:
>>>>
>>>>> I recall a discussion about the design of an instruction set
>>>>> architecture where someone was saying an instruction was required to
>>>>> test and set a bit or word as an atomic operation if it was desired to
>>>>> support multiple processors.  Is this really true?  Is this a function
>>>>> that can't be emulated with other operations including the disabling
>>>>> of interrupts?
>>>>
>>>> AFAIK as long as you surround your "test and set" with an interrupt
>>>> disable and an interrupt enable then you're OK.  At least, you're OK
>>>> unless you have a processor that treats interrupts really strangely.
>>>
>>> Rethink that for the case of SMP...  (coincidentally, "Support Multiple
>>> Processors"  :> )
>>
>> D'oh.  Atomic to the common memory, not to each individual processor, yes.
>
>Yes.  If the processor supports a RMW memory cycle AND the memory
>arbiter honors that contract, then any competing processors would
>explicitly be held off from accessing the location in question
>until the RMW cycle terminated.

The R/M/W popular a few decades ago when the core memory was much
slower than the processor. The read operation in core memory is
destructive, so you have to write back the original value. This is
usually done within the memory controller, the same or modified value
could also be written back by the processor, so you get the R/M/W
sequence practically  for "free".

In modern systems, things get complicated, since you may have to read
a full 64 bit memory word, bypassing caches on both read and write
while keeping the RAS active through the whole sequence.

Reply by David Brown ●March 24, 20172017-03-24

On 23/03/17 23:38, rickman wrote:
> I recall a discussion about the design of an instruction set
> architecture where someone was saying an instruction was required to
> test and set a bit or word as an atomic operation if it was desired to
> support multiple processors.  Is this really true?  Is this a function
> that can't be emulated with other operations including the disabling of
> interrupts?
> 

There are many, many ways to implement synchronisation between threads,
processors, whatever.  In theory, they are mostly equivalent in that any
one can be used to implement the others.  In practice, there can be a
lot of differences in the overheads in the hardware implementation, and
the speed in practice.

Typical implementations are "compare-and-swap" instructions (x86 uses
these) and load-linked/store-conditional (common on RISC systems where
instructions either load /or/ store, not both.  And of course, on single
processor systems there is always the "disable all interrupts" method.

But if you can use dedicated hardware, there are many other methods.
The XMOS devices have hardware support for pipelines and message
passing.  On a dual-core PPC device I used, there is a hardware block of
semaphores.  Each semaphore is a pair of 16-bit ID, 16-bit value that
you can only access as a 32-bit read or write.  You can write to it if
the current ID is 0, or if the ID you are writing matches that of the
semaphore.  There is plenty of scope for variation based on that theme.

Reply by Don Y ●March 24, 20172017-03-24

On 3/24/2017 12:48 AM, upsidedown@downunder.com wrote:
> On Thu, 23 Mar 2017 17:49:39 -0700, Don Y
> <blockedofcourse@foo.invalid> wrote:
>
>> On 3/23/2017 4:47 PM, Tim Wescott wrote:
>>> On Thu, 23 Mar 2017 16:26:46 -0700, Don Y wrote:
>>>
>>>> On 3/23/2017 4:19 PM, Tim Wescott wrote:
>>>>> On Thu, 23 Mar 2017 18:38:13 -0400, rickman wrote:
>>>>>
>>>>>> I recall a discussion about the design of an instruction set
>>>>>> architecture where someone was saying an instruction was required to
>>>>>> test and set a bit or word as an atomic operation if it was desired to
>>>>>> support multiple processors.  Is this really true?  Is this a function
>>>>>> that can't be emulated with other operations including the disabling
>>>>>> of interrupts?
>>>>>
>>>>> AFAIK as long as you surround your "test and set" with an interrupt
>>>>> disable and an interrupt enable then you're OK.  At least, you're OK
>>>>> unless you have a processor that treats interrupts really strangely.
>>>>
>>>> Rethink that for the case of SMP...  (coincidentally, "Support Multiple
>>>> Processors"  :> )
>>>
>>> D'oh.  Atomic to the common memory, not to each individual processor, yes.
>>
>> Yes.  If the processor supports a RMW memory cycle AND the memory
>> arbiter honors that contract, then any competing processors would
>> explicitly be held off from accessing the location in question
>> until the RMW cycle terminated.
>
> The R/M/W popular a few decades ago when the core memory was much
> slower than the processor. The read operation in core memory is
> destructive, so you have to write back the original value. This is
> usually done within the memory controller, the same or modified value
> could also be written back by the processor, so you get the R/M/W
> sequence practically  for "free".

On many "microprocessors", there are hints as to when RMW cycles are
undertaken.  E.g., the m68k would only issue address strobe for the
"two phase" RMW cycle (a consequence of the TAS opcode).

But, this requires the memory arbiter (for closely coupled coprocessors)
to monitor /AS and not attempt early (read vs write) cycle termination
(which is a potential performance hack in a shared memory system) by
just watching the individual data strobes.

Other legacy processors usually had exploits that could be leveraged
to deduce when RMW-ish cycles were in effect -- at the cost of a bit
of external logic (e.g., decoding opcode fetch cycles to provide
cleaner arbitration points for memory sharing).

However, as bus interface units became increasingly decoupled from
execution units, it gets harder to reliably infer what is ACTUALLY
happening in the CPU just by watching the bus.

> In modern systems, things get complicated, since you may have to read
> a full 64 bit memory word, bypassing caches on both read and write
> while keeping the RAS active through the whole sequence.

With SoC's, there's very little you can do to second-guess the processor
so you have to rely on the processor to perform this sort of access
(esp as the memory in question might be entirely "internal" to the processor).

Reply by Tom Gardner ●March 24, 20172017-03-24

On 24/03/17 08:17, David Brown wrote:
> But if you can use dedicated hardware, there are many other methods.
> The XMOS devices have hardware support for pipelines and message
> passing.  On a dual-core PPC device I used, there is a hardware block of
> semaphores.  Each semaphore is a pair of 16-bit ID, 16-bit value that
> you can only access as a 32-bit read or write.  You can write to it if
> the current ID is 0, or if the ID you are writing matches that of the
> semaphore.  There is plenty of scope for variation based on that theme.

I received my first XMOS board from Digi-Key a couple of days
ago, and I'm looking forward to using it for some simple
experiments. I /feel/ that many low-level things will be
much simpler and with fewer potential nasties lurking in
the undergrowth. (I felt the same with the Transputer, for
obvious reasons, but never had a suitable problem at that
time)

With your experience, did you find any undocumented gotchas
and any pleasant or unpleasant surprises?

Reply by Clifford Heath ●March 24, 20172017-03-24

On 24/03/17 10:47, Tim Wescott wrote:
> On Thu, 23 Mar 2017 16:26:46 -0700, Don Y wrote:
>
>> On 3/23/2017 4:19 PM, Tim Wescott wrote:
>>> On Thu, 23 Mar 2017 18:38:13 -0400, rickman wrote:
>>>
>>>> I recall a discussion about the design of an instruction set
>>>> architecture where someone was saying an instruction was required to
>>>> test and set a bit or word as an atomic operation if it was desired to
>>>> support multiple processors.  Is this really true?  Is this a function
>>>> that can't be emulated with other operations including the disabling
>>>> of interrupts?
>>>
>>> AFAIK as long as you surround your "test and set" with an interrupt
>>> disable and an interrupt enable then you're OK.  At least, you're OK
>>> unless you have a processor that treats interrupts really strangely.
>>
>> Rethink that for the case of SMP...  (coincidentally, "Support Multiple
>> Processors"  :> )
>
> D'oh.  Atomic to the common memory, not to each individual processor, yes.
>
> Although it wouldn't have to be an instruction per se: you could have it
> be an "instruction" to whatever hardware is controlling the common
> memory, to hold off the other processors while it does a read/modify/
> write cycle.

Yes. But when you have multi-level caching, perhaps some with
write-back semantics, it needs to force write-through, and be
bus-locked all the way to the common memory.

X86 has a LOCK prefix which acts on certain following instructions
to make this happen, and SMP and multi-CPU architectures honor it.

Clifford Heath.

Reply by David Brown ●March 24, 20172017-03-24

On 24/03/17 10:28, Tom Gardner wrote:
> On 24/03/17 08:17, David Brown wrote:
>> But if you can use dedicated hardware, there are many other methods.
>> The XMOS devices have hardware support for pipelines and message
>> passing.  On a dual-core PPC device I used, there is a hardware block of
>> semaphores.  Each semaphore is a pair of 16-bit ID, 16-bit value that
>> you can only access as a 32-bit read or write.  You can write to it if
>> the current ID is 0, or if the ID you are writing matches that of the
>> semaphore.  There is plenty of scope for variation based on that theme.
> 
> I received my first XMOS board from Digi-Key a couple of days
> ago, and I'm looking forward to using it for some simple
> experiments. I /feel/ that many low-level things will be
> much simpler and with fewer potential nasties lurking in
> the undergrowth. (I felt the same with the Transputer, for
> obvious reasons, but never had a suitable problem at that
> time)
> 
> With your experience, did you find any undocumented gotchas
> and any pleasant or unpleasant surprises?
> 

Before saying anything else, I would first note that my work with XMOS
systems was about four years ago, when they first started getting
popular.  I believe many things that bugged me most have been improved
since then, both in the hardware and software, but some may remain.

I think the devices themselves are a really neat idea.  You have very
fast execution, very efficient hardware multi-threading, very
predictable timings, and a variety of inter-thread and inter-process
communication methods.

Their "XC" programming language was also a neat idea, based on C with
additional primitives to support the hardware features and
multi-threading stuff, and an attempt to make some aspects of C safer
(real arrays, control of when you can access variables, etc.).

However, IMHO the whole thing suffered from a number of serious flaws
that limit the possibilities for the chips.  Sure, they would work well
in some circumstances - but I was left with the feeling that "if only
they had done /this/, the devices would be so much better and could be
used for so many more purposes".  It is a little unfair to concentrate
on the shortcomings rather than the innovations and features, but that
is how I felt when using them.  And again, I know that at least some
issues here have been greatly improved since I last used them.

A obvious flaw with the chips is lack of memory.  The basic device with
one cpu and 8 threads had 64K ram that was for program memory and
run-time data.  There was no flash - you had to use an external SPI
flash which used valuable pins (messing up the use of blocks of 8, 16 or
32 pins), and used up a thread if you wanted to be able to access the
flash at run-time.  And while you could implement an Ethernet MAC or a
480 Mbps USB 2.0 interface on the chip, there was nowhere near enough
ram for buffering or to do anything useful with the interface.  Adding
external memory was ridiculously expensive in terms of pins, threads,
and run-time inefficiency.

The hardware threading is great, and provides a really easy model for
all sorts of things.  To make a UART transmitter, you have a thread that
waits for data coming in on a pipe.  To transmit a bit, you set a pin,
wait for a bit time (using hardware timers), then move on to the next
bit.  The code is simple and elegant.  A UART receiver is not much
harder.  There is lots of example code in this style.

Then you realise that to implement a UART, you have used a quarter of
the chip's resources.  Your elegant flashing light is another thread, as
is your PWM output.  Suddenly you find you are using a 500 MIPS chip to
do the work of a $0.50 microcontroller, and you only have a thread or
two left for the actual application.

And you end up trying to run FreeRTOS on one of your threads, or make
your own scheduler to multiplex several PWM channels in one thread.
Much of the elegance quickly disappears for real-world applications.

Then there is the software.  The XC language lets you write code that
starts tasks in parallel, automatically allocates channels for
communication, lets you declare timers and wait on them.  That's all
great in theory - but it quickly gets confusing when you try to figure
out the details of when you can pass these around, when they get
allocated and deallocated, or when you can have a thread create new
threads.  XC carefully tracks threads and data accesses, spotting and
blocking all sorts of possible race conditions.  If a variable is
written by one thread, then it can't be accessed from another.  You can
work with arrays safely, but you can't take addresses.  Data gets passed
between threads using communication channels that are safe from race
conditions and nicely synchronised.

And then you realise that to actually make the thing work, you would
need far more channels than there are on the device, and they would need
to be far faster - all you really wanted was for two threads to share a
circular buffer, and you know in your application code when it is safe
to use it.  But you can't do that in XC - the language and the tools
won't let you.  So you have to write that code in C, with calls back and
forth with the XC code that handles the multi-threading stuff.

And then you realise that from within the C, you need to access some
hardware resources like timers, that can't be expressed properly in C,
and you can't get back to the XC code at the time.  So you end up with
inline assembly.

Then there are the libraries and examples.  These were written in such a
wide variety of styles that it was impossible to figure out what was
going on.  A typical example project would involve a USB interface and,
for example, SPDIF channels for an USB audio interface.  The
Eclipse-based IDE was fine, but the example did not come as a project -
it came as a collection of interdependent projects.  Some bits referred
to files in different projects.  Some bits merely required other
projects to be compiled.  Some bits of the code in one project would use
assembly for hardware resources, others would use XC, others would use C
intrinsic functions, and others would use a sort of XML file that
defines the setup for your chip resources.  If you change values in one
file in one project (say, the USB vendor ID), you have to figure out
which sub-projects need to be manually forced to re-build in order for
it to take effect consistently throughout the project.  Some parts use a
fairly obvious configuration file - a header with defines that let you
control things like IDs, number of channels, pins, etc.  Except they
don't - only /some/ of the sub-projects read and use the configuration
file, other parts are hard-coded or use values from elsewhere.  It was a
complete mess.

Now, I know that newer XMOS devices have more resources, built-in flash,
proper hardware peripherals for the devices that are most demanding or
popular, and so on.  And I can only hope that the language and tools
have been improved to the point where inline assembly is not required,
and that the examples and libraries have matured to the point that the
libraries are usable as-is, and the examples show practical ways to
develop code.

I really hope XMOS does well here - it is so good to see a company that
thinks in a very different way and brings in these new ideas.  So if
your experience with modern XMOS devices and tools is good, I would love
to hear about it.

Reply by Tom Gardner ●March 24, 20172017-03-24

On 24/03/17 10:19, David Brown wrote:
> On 24/03/17 10:28, Tom Gardner wrote:
>> On 24/03/17 08:17, David Brown wrote:
>>> But if you can use dedicated hardware, there are many other methods.
>>> The XMOS devices have hardware support for pipelines and message
>>> passing.  On a dual-core PPC device I used, there is a hardware block of
>>> semaphores.  Each semaphore is a pair of 16-bit ID, 16-bit value that
>>> you can only access as a 32-bit read or write.  You can write to it if
>>> the current ID is 0, or if the ID you are writing matches that of the
>>> semaphore.  There is plenty of scope for variation based on that theme.
>>
>> I received my first XMOS board from Digi-Key a couple of days
>> ago, and I'm looking forward to using it for some simple
>> experiments. I /feel/ that many low-level things will be
>> much simpler and with fewer potential nasties lurking in
>> the undergrowth. (I felt the same with the Transputer, for
>> obvious reasons, but never had a suitable problem at that
>> time)
>>
>> With your experience, did you find any undocumented gotchas
>> and any pleasant or unpleasant surprises?
>>
>
> Before saying anything else, I would first note that my work with XMOS
> systems was about four years ago, when they first started getting
> popular.  I believe many things that bugged me most have been improved
> since then, both in the hardware and software, but some may remain.
>
> I think the devices themselves are a really neat idea.  You have very
> fast execution, very efficient hardware multi-threading, very
> predictable timings, and a variety of inter-thread and inter-process
> communication methods.
>
> Their "XC" programming language was also a neat idea, based on C with
> additional primitives to support the hardware features and
> multi-threading stuff, and an attempt to make some aspects of C safer
> (real arrays, control of when you can access variables, etc.).
>
> However, IMHO the whole thing suffered from a number of serious flaws
> that limit the possibilities for the chips.  Sure, they would work well
> in some circumstances - but I was left with the feeling that "if only
> they had done /this/, the devices would be so much better and could be
> used for so many more purposes".  It is a little unfair to concentrate
> on the shortcomings rather than the innovations and features, but that
> is how I felt when using them.  And again, I know that at least some
> issues here have been greatly improved since I last used them.
>
>
> A obvious flaw with the chips is lack of memory.  The basic device with
> one cpu and 8 threads had 64K ram that was for program memory and
> run-time data.  There was no flash - you had to use an external SPI
> flash which used valuable pins (messing up the use of blocks of 8, 16 or
> 32 pins), and used up a thread if you wanted to be able to access the
> flash at run-time.  And while you could implement an Ethernet MAC or a
> 480 Mbps USB 2.0 interface on the chip, there was nowhere near enough
> ram for buffering or to do anything useful with the interface.  Adding
> external memory was ridiculously expensive in terms of pins, threads,
> and run-time inefficiency.
>
> The hardware threading is great, and provides a really easy model for
> all sorts of things.  To make a UART transmitter, you have a thread that
> waits for data coming in on a pipe.  To transmit a bit, you set a pin,
> wait for a bit time (using hardware timers), then move on to the next
> bit.  The code is simple and elegant.  A UART receiver is not much
> harder.  There is lots of example code in this style.
>
> Then you realise that to implement a UART, you have used a quarter of
> the chip's resources.  Your elegant flashing light is another thread, as
> is your PWM output.  Suddenly you find you are using a 500 MIPS chip to
> do the work of a $0.50 microcontroller, and you only have a thread or
> two left for the actual application.
>
> And you end up trying to run FreeRTOS on one of your threads, or make
> your own scheduler to multiplex several PWM channels in one thread.
> Much of the elegance quickly disappears for real-world applications.
>
>
> Then there is the software.  The XC language lets you write code that
> starts tasks in parallel, automatically allocates channels for
> communication, lets you declare timers and wait on them.  That's all
> great in theory - but it quickly gets confusing when you try to figure
> out the details of when you can pass these around, when they get
> allocated and deallocated, or when you can have a thread create new
> threads.  XC carefully tracks threads and data accesses, spotting and
> blocking all sorts of possible race conditions.  If a variable is
> written by one thread, then it can't be accessed from another.  You can
> work with arrays safely, but you can't take addresses.  Data gets passed
> between threads using communication channels that are safe from race
> conditions and nicely synchronised.
>
> And then you realise that to actually make the thing work, you would
> need far more channels than there are on the device, and they would need
> to be far faster - all you really wanted was for two threads to share a
> circular buffer, and you know in your application code when it is safe
> to use it.  But you can't do that in XC - the language and the tools
> won't let you.  So you have to write that code in C, with calls back and
> forth with the XC code that handles the multi-threading stuff.
>
> And then you realise that from within the C, you need to access some
> hardware resources like timers, that can't be expressed properly in C,
> and you can't get back to the XC code at the time.  So you end up with
> inline assembly.
>
>
> Then there are the libraries and examples.  These were written in such a
> wide variety of styles that it was impossible to figure out what was
> going on.  A typical example project would involve a USB interface and,
> for example, SPDIF channels for an USB audio interface.  The
> Eclipse-based IDE was fine, but the example did not come as a project -
> it came as a collection of interdependent projects.  Some bits referred
> to files in different projects.  Some bits merely required other
> projects to be compiled.  Some bits of the code in one project would use
> assembly for hardware resources, others would use XC, others would use C
> intrinsic functions, and others would use a sort of XML file that
> defines the setup for your chip resources.  If you change values in one
> file in one project (say, the USB vendor ID), you have to figure out
> which sub-projects need to be manually forced to re-build in order for
> it to take effect consistently throughout the project.  Some parts use a
> fairly obvious configuration file - a header with defines that let you
> control things like IDs, number of channels, pins, etc.  Except they
> don't - only /some/ of the sub-projects read and use the configuration
> file, other parts are hard-coded or use values from elsewhere.  It was a
> complete mess.
>
>
> Now, I know that newer XMOS devices have more resources, built-in flash,
> proper hardware peripherals for the devices that are most demanding or
> popular, and so on.  And I can only hope that the language and tools
> have been improved to the point where inline assembly is not required,
> and that the examples and libraries have matured to the point that the
> libraries are usable as-is, and the examples show practical ways to
> develop code.
>
> I really hope XMOS does well here - it is so good to see a company that
> thinks in a very different way and brings in these new ideas.  So if
> your experience with modern XMOS devices and tools is good, I would love
> to hear about it.

Thanks for a speedy, comprehensive response. I'll re-read
and digest it properly later.

My initial gut feel is that many of your points were
valid and probably are still valid - because they
/ought/ to still be valid.

The issues that most interest me relate to where you found
it necessary to step outside the toolchain. Part of me thinks
(hopes, really) that it is merely because your problem
wasn't well suited to the devices strengths (esp. guaranteed
timing), and/or were too big, and/or importing existing
code/thinking lead to friction, and/or the tools were immature.

I expect I'll end up agreeing with many of your observations,
but I'll have fun finding that out :)

Reply by David Brown ●March 24, 20172017-03-24

On 24/03/17 12:06, Tom Gardner wrote:
> On 24/03/17 10:19, David Brown wrote:

>> I really hope XMOS does well here - it is so good to see a company that
>> thinks in a very different way and brings in these new ideas.  So if
>> your experience with modern XMOS devices and tools is good, I would love
>> to hear about it.
> 
> Thanks for a speedy, comprehensive response. I'll re-read
> and digest it properly later.
> 
> My initial gut feel is that many of your points were
> valid and probably are still valid - because they
> /ought/ to still be valid.

I know that at least some of my points are no longer an issue, or at
least not as much of an issue - XMOS have devices with flash, USB
hardware, etc.  At least some of the toolchain issues should be fixable.
 And the mess of the examples and libraries is certainly fixable - at
least, if one disregards the time and effort it would involve!

> 
> The issues that most interest me relate to where you found
> it necessary to step outside the toolchain. Part of me thinks
> (hopes, really) that it is merely because your problem
> wasn't well suited to the devices strengths (esp. guaranteed
> timing), and/or were too big, and/or importing existing
> code/thinking lead to friction, and/or the tools were immature.

The existing code was mainly XMOS's own examples, libraries and
reference designs...

I do agree that much of their USB stuff was poorly suited to the devices
and too big for them, and that probably made things worse - but it was
XMOS's own code.  With newer devices with hardware USB peripherals, I
expect fewer such issues.

I will go along with your hope - expectation, even - that the tools have
matured and improved over time.

> 
> I expect I'll end up agreeing with many of your observations,
> but I'll have fun finding that out :)

Reply by Grant Edwards ●March 24, 20172017-03-24

On 2017-03-23, Tim Wescott <seemywebsite@myfooter.really> wrote:
> On Thu, 23 Mar 2017 18:38:13 -0400, rickman wrote:
>
>> I recall a discussion about the design of an instruction set
>> architecture where someone was saying an instruction was required to
>> test and set a bit or word as an atomic operation if it was desired to
>> support multiple processors.  Is this really true?  Is this a function
>> that can't be emulated with other operations including the disabling of
>> interrupts?
>
> AFAIK as long as you surround your "test and set" with an interrupt 
> disable and an interrupt enable then you're OK.

How does disabling interrupts prevent another processor from messing
up your "atomic" operation?

-- 
Grant Edwards               grant.b.edwards        Yow! I want EARS!  I want
                                  at               two ROUND BLACK EARS
                              gmail.com            to make me feel warm
                                                   'n secure!!

Reply by Tom Gardner ●March 24, 20172017-03-24

On 24/03/17 10:19, David Brown wrote:
> On 24/03/17 10:28, Tom Gardner wrote:
>> On 24/03/17 08:17, David Brown wrote:
>>> But if you can use dedicated hardware, there are many other methods.
>>> The XMOS devices have hardware support for pipelines and message
>>> passing.  On a dual-core PPC device I used, there is a hardware block of
>>> semaphores.  Each semaphore is a pair of 16-bit ID, 16-bit value that
>>> you can only access as a 32-bit read or write.  You can write to it if
>>> the current ID is 0, or if the ID you are writing matches that of the
>>> semaphore.  There is plenty of scope for variation based on that theme.
>>
>> I received my first XMOS board from Digi-Key a couple of days
>> ago, and I'm looking forward to using it for some simple
>> experiments. I /feel/ that many low-level things will be
>> much simpler and with fewer potential nasties lurking in
>> the undergrowth. (I felt the same with the Transputer, for
>> obvious reasons, but never had a suitable problem at that
>> time)
>>
>> With your experience, did you find any undocumented gotchas
>> and any pleasant or unpleasant surprises?
>>
>
> Before saying anything else, I would first note that my work with XMOS
> systems was about four years ago, when they first started getting
> popular.  I believe many things that bugged me most have been improved
> since then, both in the hardware and software, but some may remain.
>
> I think the devices themselves are a really neat idea.  You have very
> fast execution, very efficient hardware multi-threading, very
> predictable timings, and a variety of inter-thread and inter-process
> communication methods.
>
> Their "XC" programming language was also a neat idea, based on C with
> additional primitives to support the hardware features and
> multi-threading stuff, and an attempt to make some aspects of C safer
> (real arrays, control of when you can access variables, etc.).

Yes, those are precisely the aspects that interest me. I'm
particularly interested in easy-to-implement hard realtime
systems.

As far as I am concerned, caches and interrupts make it
difficult to guarantee hard realtime performance, and C's
explicit avoidance of multiprocessor biasses C away from
"easy-to-implement".

Yes, I know about libraries and modern compilers that
may or may not compile your code in the way you expect!

I'd far rather build on a solid foundation than have to
employ (language) lawyers to sort out the mess ;)


Besides, I want to use Occam++ :)


> However, IMHO the whole thing suffered from a number of serious flaws
> that limit the possibilities for the chips.  Sure, they would work well
> in some circumstances - but I was left with the feeling that "if only
> they had done /this/, the devices would be so much better and could be
> used for so many more purposes".  It is a little unfair to concentrate
> on the shortcomings rather than the innovations and features, but that
> is how I felt when using them.  And again, I know that at least some
> issues here have been greatly improved since I last used them.
>
>
> A obvious flaw with the chips is lack of memory.  The basic device with
> one cpu and 8 threads had 64K ram that was for program memory and
> run-time data.  There was no flash - you had to use an external SPI
> flash which used valuable pins (messing up the use of blocks of 8, 16 or
> 32 pins), and used up a thread if you wanted to be able to access the
> flash at run-time.  And while you could implement an Ethernet MAC or a
> 480 Mbps USB 2.0 interface on the chip, there was nowhere near enough
> ram for buffering or to do anything useful with the interface.  Adding
> external memory was ridiculously expensive in terms of pins, threads,
> and run-time inefficiency.

Yes, those did strike me as limitations to the extent I'm
skeptical about networking connectivity. But maybe an XMOS
device plus an ESP8266 would be worth considering for some
purposes.


> The hardware threading is great, and provides a really easy model for
> all sorts of things.  To make a UART transmitter, you have a thread that
> waits for data coming in on a pipe.  To transmit a bit, you set a pin,
> wait for a bit time (using hardware timers), then move on to the next
> bit.  The code is simple and elegant.  A UART receiver is not much
> harder.  There is lots of example code in this style.
>
> Then you realise that to implement a UART, you have used a quarter of
> the chip's resources.  Your elegant flashing light is another thread, as
> is your PWM output.  Suddenly you find you are using a 500 MIPS chip to
> do the work of a $0.50 microcontroller, and you only have a thread or
> two left for the actual application.

Yes. However I don't care about wasting some resources if
it makes the design easier/faster, so long as it isn't too
expensive in terms of power and money.


> And you end up trying to run FreeRTOS on one of your threads, or make
> your own scheduler to multiplex several PWM channels in one thread.
> Much of the elegance quickly disappears for real-world applications.

Needing to run a separate RTOS would be a code smell.
That's where the CSP+multicore approach /ought/ to be
sufficient. For practicality, I exclude peripheral
libraries and networking code from that statement.



> Then there is the software.  The XC language lets you write code that
> starts tasks in parallel, automatically allocates channels for
> communication, lets you declare timers and wait on them.  That's all
> great in theory - but it quickly gets confusing when you try to figure
> out the details of when you can pass these around, when they get
> allocated and deallocated, or when you can have a thread create new
> threads.  XC carefully tracks threads and data accesses, spotting and
> blocking all sorts of possible race conditions.  If a variable is
> written by one thread, then it can't be accessed from another.  You can
> work with arrays safely, but you can't take addresses.  Data gets passed
> between threads using communication channels that are safe from race
> conditions and nicely synchronised.

That's the kind of thing I'm interested in exploring.


> And then you realise that to actually make the thing work, you would
> need far more channels than there are on the device, and they would need
> to be far faster - all you really wanted was for two threads to share a
> circular buffer, and you know in your application code when it is safe
> to use it.  But you can't do that in XC - the language and the tools
> won't let you.  So you have to write that code in C, with calls back and
> forth with the XC code that handles the multi-threading stuff.

That's the kind of thing I'm interested in exploring.


> And then you realise that from within the C, you need to access some
> hardware resources like timers, that can't be expressed properly in C,
> and you can't get back to the XC code at the time.  So you end up with
> inline assembly.

At which point many advantages would have been lost.


> Then there are the libraries and examples.  These were written in such a
> wide variety of styles that it was impossible to figure out what was
> going on.  A typical example project would involve a USB interface and,
> for example, SPDIF channels for an USB audio interface.  The
> Eclipse-based IDE was fine, but the example did not come as a project -
> it came as a collection of interdependent projects.  Some bits referred
> to files in different projects.  Some bits merely required other
> projects to be compiled.  Some bits of the code in one project would use
> assembly for hardware resources, others would use XC, others would use C
> intrinsic functions, and others would use a sort of XML file that
> defines the setup for your chip resources.  If you change values in one
> file in one project (say, the USB vendor ID), you have to figure out
> which sub-projects need to be manually forced to re-build in order for
> it to take effect consistently throughout the project.  Some parts use a
> fairly obvious configuration file - a header with defines that let you
> control things like IDs, number of channels, pins, etc.  Except they
> don't - only /some/ of the sub-projects read and use the configuration
> file, other parts are hard-coded or use values from elsewhere.  It was a
> complete mess.

Irritating, but not fundamental, and as you point out, they could
be fixed with application of time and money.


> Now, I know that newer XMOS devices have more resources, built-in flash,
> proper hardware peripherals for the devices that are most demanding or
> popular, and so on.  And I can only hope that the language and tools
> have been improved to the point where inline assembly is not required,
> and that the examples and libraries have matured to the point that the
> libraries are usable as-is, and the examples show practical ways to
> develop code.
>
> I really hope XMOS does well here - it is so good to see a company that
> thinks in a very different way and brings in these new ideas.  So if
> your experience with modern XMOS devices and tools is good, I would love
> to hear about it.

I think we are largely in violent agreement.

I suspect my "hello world" program will be for their &#4294967295;10
StartKIT to echo anything it receives on the USB line,
with a second task flipping uppercase to lowercase. That
should give me a feel for a low-end resource usage.

Then I'd like to make a reciprocal frequency counter to
see how far I can push individual threads and their
SERDES-like IO primitives.

And I'll probably do some bitbashing to create analogue
outputs that draw pretty XY pictures on an analogue scope.