Engineering degree for embedded systems| page 8

Reply by Tom Gardner ●August 8, 20172017-08-08

On 08/08/17 16:11, David Brown wrote:
> On 08/08/17 16:56, Tom Gardner wrote:
>> On 08/08/17 11:56, David Brown wrote:
>>> On 08/08/17 12:09, Tom Gardner wrote:
>>>> On 08/08/17 10:26, David Brown wrote:
>
>>
>>
>>>> Consider single 32 core MCUs for &#4294967295;25 one-off. (xCORE)
>>>
>>> The xCORE is a bit different, as is the language you use and the style
>>> of the code.  Message passing is a very neat way to swap data between
>>> threads or cores, and is inherently safer than shared memory.
>>
>> Well, you can program xCOREs in C/C++, but I haven't
>> investigated that on the principle that I want to "kick
>> the tyres" of xC.
>>
>> ISTR seeing that the "interface" mechanisms in xC are
>> shared memory underneath, optionally involving memory
>> copies. That is plausible since xC interfaces have an
>> "asynchronous nonblocking" "notify" and "clear
>> notification" annotations on methods. Certainly they
>> are convenient to use and get around some pain points
>> in pure CSP message passing.
>
> The actual message passing can be done in several ways.  IIRC, it will
> use shared memory within the same cpu (8 logical cores), and channels
> ("real" message passing) between cpus.
>
> However, as long as it logically uses message passing then it is up to
> the tools to get the details right - it frees the programmer from having
> to understand about ordering, barriers, etc.

Just so.

I'm pretty sure:
- all "pure CSP" message passing uses the xSwitch fabric.
- the xC interfaces use shared memory between cores on
the same tile
- whereas across different tiles they bundle up a memory
copy and transmit that as messages across the xSwitch
fabric.

I can't think of a simpler/better way of achieving
the desired external behaviour.



>>> In general, I agree.  In this particular case, the Alpha is basically
>>> obsolete - but it is certainly possible that future cpu designs would
>>> have equally weak memory models.  Such a weak model is easier to make
>>> faster in hardware - you need less synchronisation, cache snooping, and
>>> other such details.
>>
>> Reasonable, but given the current fixation on the mirage
>> of globally-coherent memory, I wonder whether that is a
>> lost cause.
>>
>> Sooner or later people will have to come to terms with
>> non-global memory and multicore processing and (preferably)
>> message passing. Different abstractions and tools /will/
>> be required. Why not start now, from a good sound base?
>> Why hobble next-gen tools with last-gen problems?
>>
>
> That is /precisely/ the point - if you view it from the other side.  A
> key way to implement message passing, is to use shared memory underneath
> - but you isolate the messy details from the ignorant programmer.  If
> you have write the message passing library correctly, using features
> such as "consume" orders, then the high-level programmer can think of
> passing messages while the library and the compiler conspire to give
> optimal correct code even on very weak memory model cpus.
>
> You are never going to get away from shared memory systems - for some
> kind of multi-threaded applications, it is much, much more efficient
> than memory passing.  But it would be good if multi-threaded apps used
> message passing more often, as it is easier to get correct.

Oh dear. Violent agreement. How boring.

Reply by ●August 8, 20172017-08-08

On Tue, 08 Aug 2017 17:11:22 +0200, David Brown
<david.brown@hesbynett.no> wrote:

>On 08/08/17 16:56, Tom Gardner wrote:
>> On 08/08/17 11:56, David Brown wrote:
>>> On 08/08/17 12:09, Tom Gardner wrote:
>>>> On 08/08/17 10:26, David Brown wrote:
>
>> 
>> 
>>>> Consider single 32 core MCUs for &#4294967295;25 one-off. (xCORE)

When there are a large number of cores/processors available, I would
start a project by assigning a thread/process for each core. Later on
you might have to do some fine adjustments  to put multiple threads
into one core or split one thread into multiple cores.

>>>
>>> The xCORE is a bit different, as is the language you use and the style
>>> of the code.  Message passing is a very neat way to swap data between
>>> threads or cores, and is inherently safer than shared memory.
>> 
>> Well, you can program xCOREs in C/C++, but I haven't
>> investigated that on the principle that I want to "kick
>> the tyres" of xC.
>> 
>> ISTR seeing that the "interface" mechanisms in xC are
>> shared memory underneath, optionally involving memory
>> copies. That is plausible since xC interfaces have an
>> "asynchronous nonblocking" "notify" and "clear
>> notification" annotations on methods. Certainly they
>> are convenient to use and get around some pain points
>> in pure CSP message passing.
>
>The actual message passing can be done in several ways.  IIRC, it will
>use shared memory within the same cpu (8 logical cores), and channels
>("real" message passing) between cpus.
>
>However, as long as it logically uses message passing then it is up to
>the tools to get the details right - it frees the programmer from having
>to understand about ordering, barriers, etc.
>
>> 
>> I'm currently in two minds as to whether I like
>> any departure from CSP purity :)
>> 
>> 
>>>>> There is one "suboptimality" - the "consume" memory order.  It's a bit
>>>>> weird, in that it is mainly relevant to the Alpha architecture, whose
>>>>> memory model is so weak that in "x = *p;" it can fetch the contents of
>>>>> *p before seeing the latest update of p.  Because the C11 and C++11
>>>>> specs are not clear enough on "consume", all implementations (AFAIK)
>>>>> bump this up to the stronger "acquire", which may be slightly slower on
>>>>> some architectures.
>>>>
>>>> One of C/C++'s problems is deciding to cater for, um,
>>>> weird and obsolete architectures. I see /why/ they do
>>>> that, but on Mondays Wednesdays and Fridays I'd prefer
>>>> a concentration on doing common architectures simply
>>>> and well.
>>>>
>>>
>>> In general, I agree.  In this particular case, the Alpha is basically
>>> obsolete - but it is certainly possible that future cpu designs would
>>> have equally weak memory models.  Such a weak model is easier to make
>>> faster in hardware - you need less synchronisation, cache snooping, and
>>> other such details.
>> 
>> Reasonable, but given the current fixation on the mirage
>> of globally-coherent memory, I wonder whether that is a
>> lost cause.
>> 
>> Sooner or later people will have to come to terms with
>> non-global memory and multicore processing and (preferably)
>> message passing. Different abstractions and tools /will/
>> be required. Why not start now, from a good sound base?
>> Why hobble next-gen tools with last-gen problems?
>> 
>
>That is /precisely/ the point - if you view it from the other side.  A
>key way to implement message passing, is to use shared memory underneath
>- but you isolate the messy details from the ignorant programmer.  If
>you have write the message passing library correctly, using features
>such as "consume" orders, then the high-level programmer can think of
>passing messages while the library and the compiler conspire to give
>optimal correct code even on very weak memory model cpus.
>
>You are never going to get away from shared memory systems - for some
>kind of multi-threaded applications, it is much, much more efficient
>than memory passing.  But it would be good if multi-threaded apps used
>message passing more often, as it is easier to get correct.

What is the issue with shared memory systems ? Use unidirectional
FIFOs between threads in shared memory for the actual message. The
real issue how to inform the consuming thread that there is a new
message available in the FIFO.

>
>> 
>>>> I'm disappointed that thread support might not be as
>>>> useful as desired, but memory model and atomic is more
>>>> important.
>>>
>>> The trouble with thread support in C11/C++11 is that it is limited to
>>> very simple features - mutexes, condition variables and simple threads.
>>>  But real-world use needs priorities, semaphores, queues, timers, and
>>> many other features.  Once you are using RTOS-specific API's for all
>>> these, you would use the RTOS API's for thread and mutexes as well
>>> rather than <thread.h> calls.
>> 
>> That makes a great deal of sense to me, and it
>> brings into question how much it is worth bothering
>> about it in C/C++. No doubt I'll come to my senses
>> before too long :)
>
>

Reply by ●August 8, 20172017-08-08

On Mon, 7 Aug 2017 20:09:23 -0500, Les Cargill
<lcargill99@comcast.com> wrote:

>upsidedown@downunder.com wrote:
>> On Sun, 6 Aug 2017 09:53:55 -0500, Les Cargill
>> <lcargill99@comcast.com> wrote:
>>
>>>> I have often wondered what this IoT hype is all about. It seems to be
>>>> very similar to the PLC (Programmable Logic Controller) used for
>>>> decades.
>>>
>>> Similar. But PLCs are more pointed more at ladder logic for use in
>>> industrial settings. You generally cannot, for example, write a socket
>>> server that just does stuff on a PLC; you have to stay inside a dev
>>> framework that cushions it for you.
>>
>> In IEC-1131 (now IEC 61131-3) you can enter the program in the format
>> you are mostly familiar with, such as ladder logic or structured text
>> (ST), which is similar to Modula (and somewhat resembles Pascal) with
>> normal control structures.
>>
>
>
>It may resemble Pascal, but it's still limited in what it can do. It's 
>good enough for ... 90% of things that will need to be done, but I live 
>outside that 90% myself.

At least in the CoDeSys implementation of IEC 1131 it is easy to write
some low level functions e.g. in C, such as setting up hardware
registers, doing ISRs etc. Just publish suitable "hooks" that can be
used by the ST code, which then can be accessed by function blocks or
ladder logic.

In large projects, different people can do various abstraction layers.

When these hooks (written in C etc.) are well defined, people familiar
with ST or other IEC 1131 forms can do their own applications. I wrote
some hooks in C at the turn of the century and I have not needed to
touch it since, all the new operations could be implemented by other
persons, more familiar with IEC 1131.
>
>
>> IEC-1131 has ben available for two decades
>>
>>
>>

Reply by David Brown ●August 9, 20172017-08-09

On 08/08/17 20:07, upsidedown@downunder.com wrote:
> On Tue, 08 Aug 2017 17:11:22 +0200, David Brown
> <david.brown@hesbynett.no> wrote:
> 
>> On 08/08/17 16:56, Tom Gardner wrote:
>>> On 08/08/17 11:56, David Brown wrote:
>>>> On 08/08/17 12:09, Tom Gardner wrote:
>>>>> On 08/08/17 10:26, David Brown wrote:
>>
>>>
>>>
>>>>> Consider single 32 core MCUs for &#4294967295;25 one-off. (xCORE)
> 
> When there are a large number of cores/processors available, I would
> start a project by assigning a thread/process for each core. Later on
> you might have to do some fine adjustments  to put multiple threads
> into one core or split one thread into multiple cores.

The XMOS is a bit special - it has hardware multi-threading.  The 32
virtual core device has 4 real cores, each with 8 hardware threaded
virtual cores.  For hardware threads, you get one thread per virtual core.

>>>
>>> Sooner or later people will have to come to terms with
>>> non-global memory and multicore processing and (preferably)
>>> message passing. Different abstractions and tools /will/
>>> be required. Why not start now, from a good sound base?
>>> Why hobble next-gen tools with last-gen problems?
>>>
>>
>> That is /precisely/ the point - if you view it from the other side.  A
>> key way to implement message passing, is to use shared memory underneath
>> - but you isolate the messy details from the ignorant programmer.  If
>> you have write the message passing library correctly, using features
>> such as "consume" orders, then the high-level programmer can think of
>> passing messages while the library and the compiler conspire to give
>> optimal correct code even on very weak memory model cpus.
>>
>> You are never going to get away from shared memory systems - for some
>> kind of multi-threaded applications, it is much, much more efficient
>> than memory passing.  But it would be good if multi-threaded apps used
>> message passing more often, as it is easier to get correct.
> 
> What is the issue with shared memory systems ? Use unidirectional
> FIFOs between threads in shared memory for the actual message. The
> real issue how to inform the consuming thread that there is a new
> message available in the FIFO.
> 

That is basically how you make a message passing system when you have
shared memory for communication.  The challenge for modern systems is
making sure that other cpus see the same view of memory as the sending
one.  It is not enough to simply write the message, then update the
head/tail pointers for the FIFO.  You have cache coherency, write
re-ordering buffers, out-of-order execution in the cpu, etc., as well as
compiler re-ordering of writes.

It would be nice to see cpus (or chipsets) having better hardware
support for a variety of synchronisation mechanisms, rather than just
"flush all previous writes to memory before doing any new writes"
instructions.  Multi-port and synchronised memory is expensive, but
surely it would be possible to have a small amount that could be used
for things like mutexes, semaphores, and the control parts of queues.

Reply by ●August 10, 20172017-08-10

On Wed, 09 Aug 2017 10:03:40 +0200, David Brown
<david.brown@hesbynett.no> wrote:

>On 08/08/17 20:07, upsidedown@downunder.com wrote:
>> On Tue, 08 Aug 2017 17:11:22 +0200, David Brown
>> <david.brown@hesbynett.no> wrote:
>> 
>>> On 08/08/17 16:56, Tom Gardner wrote:
>>>> On 08/08/17 11:56, David Brown wrote:
>>>>> On 08/08/17 12:09, Tom Gardner wrote:
>>>>>> On 08/08/17 10:26, David Brown wrote:
>>>
>>>>
>>>>
>>>>>> Consider single 32 core MCUs for &#4294967295;25 one-off. (xCORE)
>> 
>> When there are a large number of cores/processors available, I would
>> start a project by assigning a thread/process for each core. Later on
>> you might have to do some fine adjustments  to put multiple threads
>> into one core or split one thread into multiple cores.
>
>The XMOS is a bit special - it has hardware multi-threading.  The 32
>virtual core device has 4 real cores, each with 8 hardware threaded
>virtual cores.  For hardware threads, you get one thread per virtual core.
>
>>>>
>>>> Sooner or later people will have to come to terms with
>>>> non-global memory and multicore processing and (preferably)
>>>> message passing. Different abstractions and tools /will/
>>>> be required. Why not start now, from a good sound base?
>>>> Why hobble next-gen tools with last-gen problems?
>>>>
>>>
>>> That is /precisely/ the point - if you view it from the other side.  A
>>> key way to implement message passing, is to use shared memory underneath
>>> - but you isolate the messy details from the ignorant programmer.  If
>>> you have write the message passing library correctly, using features
>>> such as "consume" orders, then the high-level programmer can think of
>>> passing messages while the library and the compiler conspire to give
>>> optimal correct code even on very weak memory model cpus.
>>>
>>> You are never going to get away from shared memory systems - for some
>>> kind of multi-threaded applications, it is much, much more efficient
>>> than memory passing.  But it would be good if multi-threaded apps used
>>> message passing more often, as it is easier to get correct.
>> 
>> What is the issue with shared memory systems ? Use unidirectional
>> FIFOs between threads in shared memory for the actual message. The
>> real issue how to inform the consuming thread that there is a new
>> message available in the FIFO.
>> 
>
>That is basically how you make a message passing system when you have
>shared memory for communication.  The challenge for modern systems is
>making sure that other cpus see the same view of memory as the sending
>one.  It is not enough to simply write the message, then update the
>head/tail pointers for the FIFO.  You have cache coherency, write
>re-ordering buffers, out-of-order execution in the cpu, etc., as well as
>compiler re-ordering of writes.

Sure you have to put the pointers into non-cached memory or into
write-through cache or use some explicit instruction to perform a
cache write-back.

The problem is the granulation of the cache, typically at least a
cache line or a virtual memory page size.

While "volatile" just affects code generation, it would be nice to
have a e.g.  "no_cache" keyword to affect run time execution and cache
handling. This would put these variables into special program sections
and let the linker put all variables requiring "no_cache" into the
same cache line or virtual memory page. The actual implementation
could then vary  according to hardware implementation.

If usage of some specific shared data is defined as a single producer
thread (with full R/W access) and multiple consumer threads (with read
only access) in a write-back cache system, the producer would activate
the write-trough after each update, while each consumer would
invalidate_cache before any read access, forcing a cache reload before
using the data. The source code would be identical in both producer as
well as consumer threads, but separate binary code could be compiled
for the producer and the consumers.

>It would be nice to see cpus (or chipsets) having better hardware
>support for a variety of synchronisation mechanisms, rather than just
>"flush all previous writes to memory before doing any new writes"
>instructions.  

Is that really so bad limitation ?

>Multi-port and synchronised memory is expensive, but
>surely it would be possible to have a small amount that could be used
>for things like mutexes, semaphores, and the control parts of queues.

Any system with memory mapped I/O registers must have a mechanism that
will disable any caching operations for these peripheral I/O
registers. Extending this to some RAM locations should be helpful.

---

BTW, discussing about massively parallel systems with shared memory
resembles the memory mapped file usage with some big data base
engines.

In these systems big (up to terabytes) files are mapped into the
virtual address space. After that, each byte in each memory mapped
file is accessed just as a huge (terabyte) array of bytes (or some
structured type) by simply assignment statements. With files larger
than a few hundred megabytes, a 64 bit processor architecture is
really nice to have :-)

The OS handles loading a segment from the physical disk file into the
memory using the normal OS page fault loading and writeback mechanism.
Instead of accessing the page file, the mechanism access the user data
base files.

Thus you can think about the physical disks as the real memory and the
computer main memory as the L4 cache. Since the main memory is just
one level in the cache hierarchy, there are also similar cache
consistency issues as with other cached systems. In transaction
processing, typically some Commit/Rollback is used.

I guess that designing products around these massively parallel chips,
studying the cache consistency tricks used by memory mapped data base
file systems might be helpful.

Reply by David Brown ●August 10, 20172017-08-10

On 10/08/17 13:30, upsidedown@downunder.com wrote:
> On Wed, 09 Aug 2017 10:03:40 +0200, David Brown
> <david.brown@hesbynett.no> wrote:
> 
>> On 08/08/17 20:07, upsidedown@downunder.com wrote:
>>> On Tue, 08 Aug 2017 17:11:22 +0200, David Brown
>>> <david.brown@hesbynett.no> wrote:
>>>
>>>> On 08/08/17 16:56, Tom Gardner wrote:
>>>>> On 08/08/17 11:56, David Brown wrote:
>>>>>> On 08/08/17 12:09, Tom Gardner wrote:
>>>>>>> On 08/08/17 10:26, David Brown wrote:
>>>>
>>>>>
>>>>>
>>>>>>> Consider single 32 core MCUs for &#4294967295;25 one-off. (xCORE)
>>>
>>> When there are a large number of cores/processors available, I would
>>> start a project by assigning a thread/process for each core. Later on
>>> you might have to do some fine adjustments  to put multiple threads
>>> into one core or split one thread into multiple cores.
>>
>> The XMOS is a bit special - it has hardware multi-threading.  The 32
>> virtual core device has 4 real cores, each with 8 hardware threaded
>> virtual cores.  For hardware threads, you get one thread per virtual core.
>>
>>>>>
>>>>> Sooner or later people will have to come to terms with
>>>>> non-global memory and multicore processing and (preferably)
>>>>> message passing. Different abstractions and tools /will/
>>>>> be required. Why not start now, from a good sound base?
>>>>> Why hobble next-gen tools with last-gen problems?
>>>>>
>>>>
>>>> That is /precisely/ the point - if you view it from the other side.  A
>>>> key way to implement message passing, is to use shared memory underneath
>>>> - but you isolate the messy details from the ignorant programmer.  If
>>>> you have write the message passing library correctly, using features
>>>> such as "consume" orders, then the high-level programmer can think of
>>>> passing messages while the library and the compiler conspire to give
>>>> optimal correct code even on very weak memory model cpus.
>>>>
>>>> You are never going to get away from shared memory systems - for some
>>>> kind of multi-threaded applications, it is much, much more efficient
>>>> than memory passing.  But it would be good if multi-threaded apps used
>>>> message passing more often, as it is easier to get correct.
>>>
>>> What is the issue with shared memory systems ? Use unidirectional
>>> FIFOs between threads in shared memory for the actual message. The
>>> real issue how to inform the consuming thread that there is a new
>>> message available in the FIFO.
>>>
>>
>> That is basically how you make a message passing system when you have
>> shared memory for communication.  The challenge for modern systems is
>> making sure that other cpus see the same view of memory as the sending
>> one.  It is not enough to simply write the message, then update the
>> head/tail pointers for the FIFO.  You have cache coherency, write
>> re-ordering buffers, out-of-order execution in the cpu, etc., as well as
>> compiler re-ordering of writes.
> 
> Sure you have to put the pointers into non-cached memory or into
> write-through cache or use some explicit instruction to perform a
> cache write-back.
> 

You also need the data pointed to in coherent memory of some sort (or
synchronise it explicitly).  It does not help if another processor sees
the "data ready" flag become active before the data itself is visible!

> The problem is the granulation of the cache, typically at least a
> cache line or a virtual memory page size.

No, that is rarely an issue.  Most SMP systems have cache snooping for
consistency.  It /is/ a problem on non-uniform multi-processing systems.
 (And cache lines can lead to cache line thrashing, which is a
performance problem but not a correctness problem.)

> 
> While "volatile" just affects code generation, it would be nice to
> have a e.g.  "no_cache" keyword to affect run time execution and cache
> handling. This would put these variables into special program sections
> and let the linker put all variables requiring "no_cache" into the
> same cache line or virtual memory page. The actual implementation
> could then vary  according to hardware implementation.

That sounds like a disaster for coupling compilers, linkers, OS's, and
processor MMU setups.  I don't see this happening automatically.  Doing
so /manually/ - giving explicit sections to variables, and explicitly
configuring an MMU / MPU to make a particular area of the address space
non-cached is fine.  I have done it myself on occasion.  But that's
different from trying to make it part of the standard language.

> 
> If usage of some specific shared data is defined as a single producer
> thread (with full R/W access) and multiple consumer threads (with read
> only access) in a write-back cache system, the producer would activate
> the write-trough after each update, while each consumer would
> invalidate_cache before any read access, forcing a cache reload before
> using the data. The source code would be identical in both producer as
> well as consumer threads, but separate binary code could be compiled
> for the producer and the consumers.

That's what atomic access modes and fences are for in C11/C++11.

>   
> 
>> It would be nice to see cpus (or chipsets) having better hardware
>> support for a variety of synchronisation mechanisms, rather than just
>> "flush all previous writes to memory before doing any new writes"
>> instructions.  
> 
> Is that really so bad limitation ?

For big SMP systems like modern x86 or PPC chips?  Yes, it is - these
barriers can cost hundreds of cycles of delay.  And if you want the
sequentially consistent barriers (not just acquire/release), so that all
cores see the same order of memory, you need a broadcast that makes
/all/ cores stop and flush all their write queues.  (Cache lines don't
need flushed - cache snooping takes care of that already.)

I have used a microcontroller with a dedicated "semaphore" peripheral
block.  It was very handy, and very efficient for synchronising between
the two cores.

> 
>> Multi-port and synchronised memory is expensive, but
>> surely it would be possible to have a small amount that could be used
>> for things like mutexes, semaphores, and the control parts of queues.
> 
> Any system with memory mapped I/O registers must have a mechanism that
> will disable any caching operations for these peripheral I/O
> registers. Extending this to some RAM locations should be helpful.
> 

Agreed.  But that ram would, in practice, be best implemented as a
separate block of fast ram independent from the main system ram.  For
embedded systems, a bit of on-chip static ram would make sense.

And note that it is /not/ enough to be uncached - you also need to make
sure that writes are done in order, and that reads are not done
speculatively or out of order.

> ---
> 
> BTW, discussing about massively parallel systems with shared memory
> resembles the memory mapped file usage with some big data base
> engines.
> 
> In these systems big (up to terabytes) files are mapped into the
> virtual address space. After that, each byte in each memory mapped
> file is accessed just as a huge (terabyte) array of bytes (or some
> structured type) by simply assignment statements. With files larger
> than a few hundred megabytes, a 64 bit processor architecture is
> really nice to have :-)
> 
> The OS handles loading a segment from the physical disk file into the
> memory using the normal OS page fault loading and writeback mechanism.
> Instead of accessing the page file, the mechanism access the user data
> base files.
> 
> Thus you can think about the physical disks as the real memory and the
> computer main memory as the L4 cache. Since the main memory is just
> one level in the cache hierarchy, there are also similar cache
> consistency issues as with other cached systems. In transaction
> processing, typically some Commit/Rollback is used.
> 

There is some saying about any big enough problem in computing being
just an exercise in caching, but I forget the exact quotation.

Serious caching systems are very far from easy to make, ensuring
correctness, convenient use, and efficiency.

> I guess that designing products around these massively parallel chips,
> studying the cache consistency tricks used by memory mapped data base
> file systems might be helpful.
> 

Indeed.

Reply by Walter Banks ●August 16, 20172017-08-16

On 2017-08-10 9:11 AM, David Brown wrote:
> That sounds like a disaster for coupling compilers, linkers, OS's, and
> processor MMU setups.  I don't see this happening automatically.  Doing
> so/manually/  - giving explicit sections to variables, and explicitly
> configuring an MMU / MPU to make a particular area of the address space
> non-cached is fine.  I have done it myself on occasion.  But that's
> different from trying to make it part of the standard language.

couple comments on this. Compiling for multiple processors I have used 
named address spaces to define private and shared space. IEC/ISO 18037
The nice part of that is applications can start out running on a single 
platform and then split later with minimum impact on the source code.

Admittedly I have done this on non MMU systems.

I have linked across multiple processors including cases of 
heterogeneous processors.

An other comment about inter-processor communication. We found out a 
long time ago that dual or multi port memory is not that much of an 
advantage in most applications. The data rate can actually be quite low. 
We have done quite a few consumer electronics packages with serial data 
well below a mbit some as low as 8Kbits/second. It creates skew between 
processor execution but generally has very limited impact on application 
function or performance.

w..

w..

Reply by David Brown ●August 17, 20172017-08-17

On 17/08/17 00:39, Walter Banks wrote:
> On 2017-08-10 9:11 AM, David Brown wrote:
>> That sounds like a disaster for coupling compilers, linkers, OS's, and
>> processor MMU setups.  I don't see this happening automatically.  Doing
>> so/manually/  - giving explicit sections to variables, and explicitly
>> configuring an MMU / MPU to make a particular area of the address space
>> non-cached is fine.  I have done it myself on occasion.  But that's
>> different from trying to make it part of the standard language.
> 
> 
> couple comments on this. Compiling for multiple processors I have used
> named address spaces to define private and shared space. IEC/ISO 18037

"IEC/ISO 18037" completely misses the point, and is a disaster for the
world of embedded C programming.  It is an enormous disappointment to
anyone who programs small embedded systems in C, and it is no surprise
that compiler implementers have almost entirely ignored it in the 15
years of its existence.  Named address spaces are perhaps the only
interesting and useful idea there, but the TR does not cover
user-definable address spaces properly.

> The nice part of that is applications can start out running on a single
> platform and then split later with minimum impact on the source code.
> 
> Admittedly I have done this on non MMU systems.
> 

On some systems, such a "no_cache" keyword/attribute is entirely
possible.  My comment is not that this would not be a useful thing, but
that it could not be a part of the C standard language.

For example, on the Nios processor (Altera soft cpu for their FPGAs -
and I don't remember if this was just for the original Nios or the
Nios2) the highest bit of an address was used to indicate "no cache, no
reordering", but it was otherwise unused for address decoding.  When you
made a volatile access, the compiler ensured that the highest bit of the
address was set.  On that processor, implementing a "no_cache" keyword
would be easy - it was already done for "volatile".

But on a processor that has an MMU?  It would be a serious problem.  And
how would you handle casts to a no_cache pointer?  Casting a pointer to
normal data into a pointer to volatile is an essential operation in lots
of low-level code.  (It is implementation-defined behaviour, but works
"as expected" in all compilers I have heard of.)

So for some processors, "no_cache" access is easy.  For some, it would
require support from the linker (or at least linker scripts) and MMU
setup, but have no possibility for casts.  For others, memory barrier
instructions and cache flush instructions would be the answer.  On
larger processors, that could quickly be /very/ expensive - much more so
than an OS call to get some uncached memory (dma_alloc_coherent() on
Linux, for example).

uncached accesses cannot be implemented sensible or efficiently in the
same way on different processors, and in some systems it cannot be done
at all.  The concept of cache is alien to the C standards.  Any code
that might need uncached memory is inherently low-level and highly
system dependent.

Therefore it is a concept that has no place in the C standards, even
though it is a feature that could be very useful in many specific
implementations for specific targets.  A great thing about C is that
there is no problem having such implementation-specific features and
extensions.

> 
> I have linked across multiple processors including cases of
> heterogeneous processors.
> 
> An other comment about inter-processor communication. We found out a
> long time ago that dual or multi port memory is not that much of an
> advantage in most applications. The data rate can actually be quite low.
> We have done quite a few consumer electronics packages with serial data
> well below a mbit some as low as 8Kbits/second. It creates skew between
> processor execution but generally has very limited impact on application
> function or performance.
> 
> w..
> 
> w..

Reply by Walter Banks ●August 17, 20172017-08-17

On 2017-08-17 3:37 AM, David Brown wrote:
> "IEC/ISO 18037" completely misses the point, and is a disaster for
> the world of embedded C programming.  It is an enormous
> disappointment to anyone who programs small embedded systems in C,
> and it is no surprise that compiler implementers have almost entirely
> ignored it in the 15 years of its existence.  Named address spaces
> are perhaps the only interesting and useful idea there, but the TR
> does not cover user-definable address spaces properly.

Guilty I wrote the section of 18037 on named address spaces based on our
use in consumer applications and earlier WG-14 papers.

We extended the named address space material to also include processor
named space N1351,N1386

The fixed point material in 18037 is in my opinion reasonable.

We use both of these a lot especially in programming the massively
parallel ISA's I have been working on in the last few years.

w..

Reply by David Brown ●August 17, 20172017-08-17

On 17/08/17 14:24, Walter Banks wrote:
> On 2017-08-17 3:37 AM, David Brown wrote:
>> "IEC/ISO 18037" completely misses the point, and is a disaster for
>> the world of embedded C programming.  It is an enormous
>> disappointment to anyone who programs small embedded systems in C,
>> and it is no surprise that compiler implementers have almost entirely
>> ignored it in the 15 years of its existence.  Named address spaces
>> are perhaps the only interesting and useful idea there, but the TR
>> does not cover user-definable address spaces properly.
> 
> 
> Guilty I wrote the section of 18037 on named address spaces based on our
> use in consumer applications and earlier WG-14 papers.
> 
> We extended the named address space material to also include processor
> named space N1351,N1386

I don't know the details of these different versions of the papers.  I
have the 2008 draft of ISO/IEC TR 18037:2008 in front of me.

With all due respect to your work and experience here, I have a good
deal of comments on this paper.  Consider it constructive criticism due
to frustration at a major missed opportunity.  In summary, TR 18037 is
much like EC++ - a nice idea when you look at the title, but an almost
total waste of time for everyone except compiler company marketing droids.

The basic idea of named address spaces that are syntactically like const
and volatile qualifiers is, IMHO, a good plan.  For an example usage,
look at the gcc support for "__flash" address spaces in the AVR port of gcc:

<https://gcc.gnu.org/onlinedocs/gcc/Named-Address-Spaces.html>

The AVR needs different instructions for accessing data in flash and
ram, and address spaces provide a neater and less error-prone solution
than macros or function calls for flash data access.

So far, so good - and if that is your work, then well done.  The actual
text of the document could, IMHO, benefit from a more concrete example
usage of address spaces (such as for flash access, as that is likely to
be a very popular usage).

The register storage class stuff, however, is not something I would like
to see in C standards.  If I had wanted to mess with specific cpu
registers such as flag registers, I would be programming in assembly.  C
is /not/ assembly - we use C so that we don't have to use assembly.
There may be a few specific cases of particular awkward processors for
which it is occasionally useful to have direct access to flag bits -
those are very much in the minority.  And they are getting more in the
minority as painful architectures like COP8 and PIC16 are being dropped
in favour of C-friendly processors.  It is absolutely fine to put
support for condition code registers (or whatever) into compilers as
target extensions.  I can especially see how it can help compiler
implementers to write support libraries in C rather than assembly.  But
it is /not/ something to clutter up C standards or for general embedded
C usage.

The disappointing part of named address spaces is in Annex B.1.  It is
tantalisingly close to allowing user-defined address spaces with
specific features such as neat access to data stored in other types of
memory.  But it is missing all the detail needed to make it work, how
and when it could be used, examples, and all the thought into how it
would interplay with other features of the language.  It also totally
ignores some major issues that are very contrary to the spirit and
philosophy of C.  When writing C, one expects "x = 1;" to operate
immediately as a short sequence of instructions, or even to be removed
altogether by the compiler optimiser.  With a user-defined address
space, such as an SPI eeprom mapping, this could take significant time,
it could interact badly with other code (such as another thread or an
interrupt the is also accessing the SPI bus), it could depend on setup
of things outside the control of the compiler, and it could fail.

You need to think long and hard as to whether this is something
desirable in a C compiler.  It would mean giving up the kind of
transparency and low-level predictability that are some of the key
reasons people choose C over C++ for such work.  If the convenience of
being able to access different types of data in the same way in code is
worth it, then these issues must be made clear and the mechanisms
developed - if not, then the idea should be dropped.  A half-written
half-thought-out annex is not the answer.

One point that is mentioned in Annex B is specific little endian and big
endian access.  This is a missed opportunity for the TR - qualifiers
giving explicit endianness to a type would be extremely useful,
completely independently of the named address space concept.  Such
qualifiers would be simple to implement on all but the weirdest of
hardware platforms, and would be massively useful in embedded programming.

> 
> The fixed point material in 18037 is in my opinion reasonable.

No, it is crap.

Look at C99.  Look what it gave us over C90.  One vital feature that
made a huge difference to embedded programming is <stdint.h> with fixed
size integer types.  There is no longer any need for every piece of
embedded C software, every library, every RTOS, to define its own types
u16, u16t, uint_16_t, uWORD, RTOS_u16, and whatever.  Now we can write
uint16_t and be done with it.

Then someone has come along and written this TR with a total disregard
for this.  So /if/ this support gets widely implemented, and /if/ people
start using it, what types will people use?  Either they will use
"signed long _Fract" and friends, making for unreadable code due to the
long names and having undocumented target-specific assumptions that make
porting an error prone disaster, or we are going to see a proliferation
of fract15_t, Q31, fp0_15, and a dozen different incompatible variations.

If this was going to be of any use, a set of specific, fixed-size type
names should have been defined from day one.  The assorted _Fract and
_Accum types are /useless/.  They should not exist.  My suggestion for a
naming convention would be uint0q16_t, int7q8_t, etc., for the number of
bits before and after the binary point.  Implementations should be free
to implement those that they can handle efficiently, and drop any that
they cannot - but there should be no ambiguity.

This would also avoid the next point - C99 was well established before
the TR was written.  What about the "long long" versions for completeness?

Of course, with a sensible explicit naming scheme, as many different
types as you want could exist.

Then there is the control of overflow.  It is one thing to say
saturation would be a nice idea - but it is absolutely, totally and
completely /wrong/ to allow this to be controllable by a pragma.
Explicit in the type - yes, that's fine.  Implicit based on what
preprocessing directives happen to have passed before that bit of the
source code is translated?  Absolutely /not/.

Equally, pragmas for precision and rounding - in fact, pragmas in
general - are a terrible idea.  Should the types behave differently in
different files in the same code?

Next up - fixed point constants.  Hands up all those that think it is
intuitive that 0.5uk makes it obvious that this is an "unsigned _Accum"
constant?  Write it as "(uint15q16_t) 0.5" instead - make it clear and
explicit.  The fixed point constant suffixes exist purely because
someone thought there should be suffixes and picked some letters out of
their hat.  Oh, and for extra fun lets make these suffixes subtly
different from the conversion specifiers for printf.  You remember?
that function that is already too big, slow and complicated for many
embedded C systems.

Then there is the selection of functions in <stdfix.h>.  We have
type-generic maths support in C99.  There is no place for individual
functions like abshr, abslr, abshk, abslk - a single type-generic absfx
would do the job.  We don't /need/ these underlying functions.  The
implementation may have them, but C programmers don't need to see that
mess.  Hide it away as implementation details.  That would leave
everything much simpler to describe, and much simpler to use, and mean
it will work with explicit names for the types.

And in the thirteen years that it has taken between this TR being first
published, and today, when implementations are still rare, incomplete
and inefficient, we now have microcontrollers that will do floating
point quickly for under a dollar.  Fixed point is rapidly becoming of
marginal use or even irrelevant.

As for the hardware IO stuff, the less said about that the better.  It
will /never/ be used.  It has no benefits over the system used almost
everywhere today - volatile accesses through casted constant addresses.

The TR has failed to give the industry anything that embedded C
programmers need, it has made suggestions that are worse than useless,
and by putting in so much that is not helpful it has delayed any hope of
implementation and standardisation for the ideas that might have been
helpful.

> 
> We use both of these a lot especially in programming the massively
> parallel ISA's I have been working on in the last few years.
> 

Implementation-specific extensions are clearly going to be useful for
odd architectures like this.  It is the attempt at standardisation in
the TR that is a total failure.