EmbeddedRelated.com
Forums
The 2024 Embedded Online Conference

Engineering degree for embedded systems

Started by hogwarts July 27, 2017
On 08/08/17 16:11, David Brown wrote:
> On 08/08/17 16:56, Tom Gardner wrote: >> On 08/08/17 11:56, David Brown wrote: >>> On 08/08/17 12:09, Tom Gardner wrote: >>>> On 08/08/17 10:26, David Brown wrote: > >> >> >>>> Consider single 32 core MCUs for �25 one-off. (xCORE) >>> >>> The xCORE is a bit different, as is the language you use and the style >>> of the code. Message passing is a very neat way to swap data between >>> threads or cores, and is inherently safer than shared memory. >> >> Well, you can program xCOREs in C/C++, but I haven't >> investigated that on the principle that I want to "kick >> the tyres" of xC. >> >> ISTR seeing that the "interface" mechanisms in xC are >> shared memory underneath, optionally involving memory >> copies. That is plausible since xC interfaces have an >> "asynchronous nonblocking" "notify" and "clear >> notification" annotations on methods. Certainly they >> are convenient to use and get around some pain points >> in pure CSP message passing. > > The actual message passing can be done in several ways. IIRC, it will > use shared memory within the same cpu (8 logical cores), and channels > ("real" message passing) between cpus. > > However, as long as it logically uses message passing then it is up to > the tools to get the details right - it frees the programmer from having > to understand about ordering, barriers, etc.
Just so. I'm pretty sure: - all "pure CSP" message passing uses the xSwitch fabric. - the xC interfaces use shared memory between cores on the same tile - whereas across different tiles they bundle up a memory copy and transmit that as messages across the xSwitch fabric. I can't think of a simpler/better way of achieving the desired external behaviour.
>>> In general, I agree. In this particular case, the Alpha is basically >>> obsolete - but it is certainly possible that future cpu designs would >>> have equally weak memory models. Such a weak model is easier to make >>> faster in hardware - you need less synchronisation, cache snooping, and >>> other such details. >> >> Reasonable, but given the current fixation on the mirage >> of globally-coherent memory, I wonder whether that is a >> lost cause. >> >> Sooner or later people will have to come to terms with >> non-global memory and multicore processing and (preferably) >> message passing. Different abstractions and tools /will/ >> be required. Why not start now, from a good sound base? >> Why hobble next-gen tools with last-gen problems? >> > > That is /precisely/ the point - if you view it from the other side. A > key way to implement message passing, is to use shared memory underneath > - but you isolate the messy details from the ignorant programmer. If > you have write the message passing library correctly, using features > such as "consume" orders, then the high-level programmer can think of > passing messages while the library and the compiler conspire to give > optimal correct code even on very weak memory model cpus. > > You are never going to get away from shared memory systems - for some > kind of multi-threaded applications, it is much, much more efficient > than memory passing. But it would be good if multi-threaded apps used > message passing more often, as it is easier to get correct.
Oh dear. Violent agreement. How boring.
On Tue, 08 Aug 2017 17:11:22 +0200, David Brown
<david.brown@hesbynett.no> wrote:

>On 08/08/17 16:56, Tom Gardner wrote: >> On 08/08/17 11:56, David Brown wrote: >>> On 08/08/17 12:09, Tom Gardner wrote: >>>> On 08/08/17 10:26, David Brown wrote: > >> >> >>>> Consider single 32 core MCUs for &#4294967295;25 one-off. (xCORE)
When there are a large number of cores/processors available, I would start a project by assigning a thread/process for each core. Later on you might have to do some fine adjustments to put multiple threads into one core or split one thread into multiple cores.
>>> >>> The xCORE is a bit different, as is the language you use and the style >>> of the code. Message passing is a very neat way to swap data between >>> threads or cores, and is inherently safer than shared memory. >> >> Well, you can program xCOREs in C/C++, but I haven't >> investigated that on the principle that I want to "kick >> the tyres" of xC. >> >> ISTR seeing that the "interface" mechanisms in xC are >> shared memory underneath, optionally involving memory >> copies. That is plausible since xC interfaces have an >> "asynchronous nonblocking" "notify" and "clear >> notification" annotations on methods. Certainly they >> are convenient to use and get around some pain points >> in pure CSP message passing. > >The actual message passing can be done in several ways. IIRC, it will >use shared memory within the same cpu (8 logical cores), and channels >("real" message passing) between cpus. > >However, as long as it logically uses message passing then it is up to >the tools to get the details right - it frees the programmer from having >to understand about ordering, barriers, etc. > >> >> I'm currently in two minds as to whether I like >> any departure from CSP purity :) >> >> >>>>> There is one "suboptimality" - the "consume" memory order. It's a bit >>>>> weird, in that it is mainly relevant to the Alpha architecture, whose >>>>> memory model is so weak that in "x = *p;" it can fetch the contents of >>>>> *p before seeing the latest update of p. Because the C11 and C++11 >>>>> specs are not clear enough on "consume", all implementations (AFAIK) >>>>> bump this up to the stronger "acquire", which may be slightly slower on >>>>> some architectures. >>>> >>>> One of C/C++'s problems is deciding to cater for, um, >>>> weird and obsolete architectures. I see /why/ they do >>>> that, but on Mondays Wednesdays and Fridays I'd prefer >>>> a concentration on doing common architectures simply >>>> and well. >>>> >>> >>> In general, I agree. In this particular case, the Alpha is basically >>> obsolete - but it is certainly possible that future cpu designs would >>> have equally weak memory models. Such a weak model is easier to make >>> faster in hardware - you need less synchronisation, cache snooping, and >>> other such details. >> >> Reasonable, but given the current fixation on the mirage >> of globally-coherent memory, I wonder whether that is a >> lost cause. >> >> Sooner or later people will have to come to terms with >> non-global memory and multicore processing and (preferably) >> message passing. Different abstractions and tools /will/ >> be required. Why not start now, from a good sound base? >> Why hobble next-gen tools with last-gen problems? >> > >That is /precisely/ the point - if you view it from the other side. A >key way to implement message passing, is to use shared memory underneath >- but you isolate the messy details from the ignorant programmer. If >you have write the message passing library correctly, using features >such as "consume" orders, then the high-level programmer can think of >passing messages while the library and the compiler conspire to give >optimal correct code even on very weak memory model cpus. > >You are never going to get away from shared memory systems - for some >kind of multi-threaded applications, it is much, much more efficient >than memory passing. But it would be good if multi-threaded apps used >message passing more often, as it is easier to get correct.
What is the issue with shared memory systems ? Use unidirectional FIFOs between threads in shared memory for the actual message. The real issue how to inform the consuming thread that there is a new message available in the FIFO.
> >> >>>> I'm disappointed that thread support might not be as >>>> useful as desired, but memory model and atomic is more >>>> important. >>> >>> The trouble with thread support in C11/C++11 is that it is limited to >>> very simple features - mutexes, condition variables and simple threads. >>> But real-world use needs priorities, semaphores, queues, timers, and >>> many other features. Once you are using RTOS-specific API's for all >>> these, you would use the RTOS API's for thread and mutexes as well >>> rather than <thread.h> calls. >> >> That makes a great deal of sense to me, and it >> brings into question how much it is worth bothering >> about it in C/C++. No doubt I'll come to my senses >> before too long :) > >
On Mon, 7 Aug 2017 20:09:23 -0500, Les Cargill
<lcargill99@comcast.com> wrote:

>upsidedown@downunder.com wrote: >> On Sun, 6 Aug 2017 09:53:55 -0500, Les Cargill >> <lcargill99@comcast.com> wrote: >> >>>> I have often wondered what this IoT hype is all about. It seems to be >>>> very similar to the PLC (Programmable Logic Controller) used for >>>> decades. >>> >>> Similar. But PLCs are more pointed more at ladder logic for use in >>> industrial settings. You generally cannot, for example, write a socket >>> server that just does stuff on a PLC; you have to stay inside a dev >>> framework that cushions it for you. >> >> In IEC-1131 (now IEC 61131-3) you can enter the program in the format >> you are mostly familiar with, such as ladder logic or structured text >> (ST), which is similar to Modula (and somewhat resembles Pascal) with >> normal control structures. >> > > >It may resemble Pascal, but it's still limited in what it can do. It's >good enough for ... 90% of things that will need to be done, but I live >outside that 90% myself.
At least in the CoDeSys implementation of IEC 1131 it is easy to write some low level functions e.g. in C, such as setting up hardware registers, doing ISRs etc. Just publish suitable "hooks" that can be used by the ST code, which then can be accessed by function blocks or ladder logic. In large projects, different people can do various abstraction layers. When these hooks (written in C etc.) are well defined, people familiar with ST or other IEC 1131 forms can do their own applications. I wrote some hooks in C at the turn of the century and I have not needed to touch it since, all the new operations could be implemented by other persons, more familiar with IEC 1131.
> > >> IEC-1131 has ben available for two decades >> >> >>
On 08/08/17 20:07, upsidedown@downunder.com wrote:
> On Tue, 08 Aug 2017 17:11:22 +0200, David Brown > <david.brown@hesbynett.no> wrote: > >> On 08/08/17 16:56, Tom Gardner wrote: >>> On 08/08/17 11:56, David Brown wrote: >>>> On 08/08/17 12:09, Tom Gardner wrote: >>>>> On 08/08/17 10:26, David Brown wrote: >> >>> >>> >>>>> Consider single 32 core MCUs for &#4294967295;25 one-off. (xCORE) > > When there are a large number of cores/processors available, I would > start a project by assigning a thread/process for each core. Later on > you might have to do some fine adjustments to put multiple threads > into one core or split one thread into multiple cores.
The XMOS is a bit special - it has hardware multi-threading. The 32 virtual core device has 4 real cores, each with 8 hardware threaded virtual cores. For hardware threads, you get one thread per virtual core.
>>> >>> Sooner or later people will have to come to terms with >>> non-global memory and multicore processing and (preferably) >>> message passing. Different abstractions and tools /will/ >>> be required. Why not start now, from a good sound base? >>> Why hobble next-gen tools with last-gen problems? >>> >> >> That is /precisely/ the point - if you view it from the other side. A >> key way to implement message passing, is to use shared memory underneath >> - but you isolate the messy details from the ignorant programmer. If >> you have write the message passing library correctly, using features >> such as "consume" orders, then the high-level programmer can think of >> passing messages while the library and the compiler conspire to give >> optimal correct code even on very weak memory model cpus. >> >> You are never going to get away from shared memory systems - for some >> kind of multi-threaded applications, it is much, much more efficient >> than memory passing. But it would be good if multi-threaded apps used >> message passing more often, as it is easier to get correct. > > What is the issue with shared memory systems ? Use unidirectional > FIFOs between threads in shared memory for the actual message. The > real issue how to inform the consuming thread that there is a new > message available in the FIFO. >
That is basically how you make a message passing system when you have shared memory for communication. The challenge for modern systems is making sure that other cpus see the same view of memory as the sending one. It is not enough to simply write the message, then update the head/tail pointers for the FIFO. You have cache coherency, write re-ordering buffers, out-of-order execution in the cpu, etc., as well as compiler re-ordering of writes. It would be nice to see cpus (or chipsets) having better hardware support for a variety of synchronisation mechanisms, rather than just "flush all previous writes to memory before doing any new writes" instructions. Multi-port and synchronised memory is expensive, but surely it would be possible to have a small amount that could be used for things like mutexes, semaphores, and the control parts of queues.
On Wed, 09 Aug 2017 10:03:40 +0200, David Brown
<david.brown@hesbynett.no> wrote:

>On 08/08/17 20:07, upsidedown@downunder.com wrote: >> On Tue, 08 Aug 2017 17:11:22 +0200, David Brown >> <david.brown@hesbynett.no> wrote: >> >>> On 08/08/17 16:56, Tom Gardner wrote: >>>> On 08/08/17 11:56, David Brown wrote: >>>>> On 08/08/17 12:09, Tom Gardner wrote: >>>>>> On 08/08/17 10:26, David Brown wrote: >>> >>>> >>>> >>>>>> Consider single 32 core MCUs for &#4294967295;25 one-off. (xCORE) >> >> When there are a large number of cores/processors available, I would >> start a project by assigning a thread/process for each core. Later on >> you might have to do some fine adjustments to put multiple threads >> into one core or split one thread into multiple cores. > >The XMOS is a bit special - it has hardware multi-threading. The 32 >virtual core device has 4 real cores, each with 8 hardware threaded >virtual cores. For hardware threads, you get one thread per virtual core. > >>>> >>>> Sooner or later people will have to come to terms with >>>> non-global memory and multicore processing and (preferably) >>>> message passing. Different abstractions and tools /will/ >>>> be required. Why not start now, from a good sound base? >>>> Why hobble next-gen tools with last-gen problems? >>>> >>> >>> That is /precisely/ the point - if you view it from the other side. A >>> key way to implement message passing, is to use shared memory underneath >>> - but you isolate the messy details from the ignorant programmer. If >>> you have write the message passing library correctly, using features >>> such as "consume" orders, then the high-level programmer can think of >>> passing messages while the library and the compiler conspire to give >>> optimal correct code even on very weak memory model cpus. >>> >>> You are never going to get away from shared memory systems - for some >>> kind of multi-threaded applications, it is much, much more efficient >>> than memory passing. But it would be good if multi-threaded apps used >>> message passing more often, as it is easier to get correct. >> >> What is the issue with shared memory systems ? Use unidirectional >> FIFOs between threads in shared memory for the actual message. The >> real issue how to inform the consuming thread that there is a new >> message available in the FIFO. >> > >That is basically how you make a message passing system when you have >shared memory for communication. The challenge for modern systems is >making sure that other cpus see the same view of memory as the sending >one. It is not enough to simply write the message, then update the >head/tail pointers for the FIFO. You have cache coherency, write >re-ordering buffers, out-of-order execution in the cpu, etc., as well as >compiler re-ordering of writes.
Sure you have to put the pointers into non-cached memory or into write-through cache or use some explicit instruction to perform a cache write-back. The problem is the granulation of the cache, typically at least a cache line or a virtual memory page size. While "volatile" just affects code generation, it would be nice to have a e.g. "no_cache" keyword to affect run time execution and cache handling. This would put these variables into special program sections and let the linker put all variables requiring "no_cache" into the same cache line or virtual memory page. The actual implementation could then vary according to hardware implementation. If usage of some specific shared data is defined as a single producer thread (with full R/W access) and multiple consumer threads (with read only access) in a write-back cache system, the producer would activate the write-trough after each update, while each consumer would invalidate_cache before any read access, forcing a cache reload before using the data. The source code would be identical in both producer as well as consumer threads, but separate binary code could be compiled for the producer and the consumers.
>It would be nice to see cpus (or chipsets) having better hardware >support for a variety of synchronisation mechanisms, rather than just >"flush all previous writes to memory before doing any new writes" >instructions.
Is that really so bad limitation ?
>Multi-port and synchronised memory is expensive, but >surely it would be possible to have a small amount that could be used >for things like mutexes, semaphores, and the control parts of queues.
Any system with memory mapped I/O registers must have a mechanism that will disable any caching operations for these peripheral I/O registers. Extending this to some RAM locations should be helpful. --- BTW, discussing about massively parallel systems with shared memory resembles the memory mapped file usage with some big data base engines. In these systems big (up to terabytes) files are mapped into the virtual address space. After that, each byte in each memory mapped file is accessed just as a huge (terabyte) array of bytes (or some structured type) by simply assignment statements. With files larger than a few hundred megabytes, a 64 bit processor architecture is really nice to have :-) The OS handles loading a segment from the physical disk file into the memory using the normal OS page fault loading and writeback mechanism. Instead of accessing the page file, the mechanism access the user data base files. Thus you can think about the physical disks as the real memory and the computer main memory as the L4 cache. Since the main memory is just one level in the cache hierarchy, there are also similar cache consistency issues as with other cached systems. In transaction processing, typically some Commit/Rollback is used. I guess that designing products around these massively parallel chips, studying the cache consistency tricks used by memory mapped data base file systems might be helpful.
On 10/08/17 13:30, upsidedown@downunder.com wrote:
> On Wed, 09 Aug 2017 10:03:40 +0200, David Brown > <david.brown@hesbynett.no> wrote: > >> On 08/08/17 20:07, upsidedown@downunder.com wrote: >>> On Tue, 08 Aug 2017 17:11:22 +0200, David Brown >>> <david.brown@hesbynett.no> wrote: >>> >>>> On 08/08/17 16:56, Tom Gardner wrote: >>>>> On 08/08/17 11:56, David Brown wrote: >>>>>> On 08/08/17 12:09, Tom Gardner wrote: >>>>>>> On 08/08/17 10:26, David Brown wrote: >>>> >>>>> >>>>> >>>>>>> Consider single 32 core MCUs for &#4294967295;25 one-off. (xCORE) >>> >>> When there are a large number of cores/processors available, I would >>> start a project by assigning a thread/process for each core. Later on >>> you might have to do some fine adjustments to put multiple threads >>> into one core or split one thread into multiple cores. >> >> The XMOS is a bit special - it has hardware multi-threading. The 32 >> virtual core device has 4 real cores, each with 8 hardware threaded >> virtual cores. For hardware threads, you get one thread per virtual core. >> >>>>> >>>>> Sooner or later people will have to come to terms with >>>>> non-global memory and multicore processing and (preferably) >>>>> message passing. Different abstractions and tools /will/ >>>>> be required. Why not start now, from a good sound base? >>>>> Why hobble next-gen tools with last-gen problems? >>>>> >>>> >>>> That is /precisely/ the point - if you view it from the other side. A >>>> key way to implement message passing, is to use shared memory underneath >>>> - but you isolate the messy details from the ignorant programmer. If >>>> you have write the message passing library correctly, using features >>>> such as "consume" orders, then the high-level programmer can think of >>>> passing messages while the library and the compiler conspire to give >>>> optimal correct code even on very weak memory model cpus. >>>> >>>> You are never going to get away from shared memory systems - for some >>>> kind of multi-threaded applications, it is much, much more efficient >>>> than memory passing. But it would be good if multi-threaded apps used >>>> message passing more often, as it is easier to get correct. >>> >>> What is the issue with shared memory systems ? Use unidirectional >>> FIFOs between threads in shared memory for the actual message. The >>> real issue how to inform the consuming thread that there is a new >>> message available in the FIFO. >>> >> >> That is basically how you make a message passing system when you have >> shared memory for communication. The challenge for modern systems is >> making sure that other cpus see the same view of memory as the sending >> one. It is not enough to simply write the message, then update the >> head/tail pointers for the FIFO. You have cache coherency, write >> re-ordering buffers, out-of-order execution in the cpu, etc., as well as >> compiler re-ordering of writes. > > Sure you have to put the pointers into non-cached memory or into > write-through cache or use some explicit instruction to perform a > cache write-back. >
You also need the data pointed to in coherent memory of some sort (or synchronise it explicitly). It does not help if another processor sees the "data ready" flag become active before the data itself is visible!
> The problem is the granulation of the cache, typically at least a > cache line or a virtual memory page size.
No, that is rarely an issue. Most SMP systems have cache snooping for consistency. It /is/ a problem on non-uniform multi-processing systems. (And cache lines can lead to cache line thrashing, which is a performance problem but not a correctness problem.)
> > While "volatile" just affects code generation, it would be nice to > have a e.g. "no_cache" keyword to affect run time execution and cache > handling. This would put these variables into special program sections > and let the linker put all variables requiring "no_cache" into the > same cache line or virtual memory page. The actual implementation > could then vary according to hardware implementation.
That sounds like a disaster for coupling compilers, linkers, OS's, and processor MMU setups. I don't see this happening automatically. Doing so /manually/ - giving explicit sections to variables, and explicitly configuring an MMU / MPU to make a particular area of the address space non-cached is fine. I have done it myself on occasion. But that's different from trying to make it part of the standard language.
> > If usage of some specific shared data is defined as a single producer > thread (with full R/W access) and multiple consumer threads (with read > only access) in a write-back cache system, the producer would activate > the write-trough after each update, while each consumer would > invalidate_cache before any read access, forcing a cache reload before > using the data. The source code would be identical in both producer as > well as consumer threads, but separate binary code could be compiled > for the producer and the consumers.
That's what atomic access modes and fences are for in C11/C++11.
> > >> It would be nice to see cpus (or chipsets) having better hardware >> support for a variety of synchronisation mechanisms, rather than just >> "flush all previous writes to memory before doing any new writes" >> instructions. > > Is that really so bad limitation ?
For big SMP systems like modern x86 or PPC chips? Yes, it is - these barriers can cost hundreds of cycles of delay. And if you want the sequentially consistent barriers (not just acquire/release), so that all cores see the same order of memory, you need a broadcast that makes /all/ cores stop and flush all their write queues. (Cache lines don't need flushed - cache snooping takes care of that already.) I have used a microcontroller with a dedicated "semaphore" peripheral block. It was very handy, and very efficient for synchronising between the two cores.
> >> Multi-port and synchronised memory is expensive, but >> surely it would be possible to have a small amount that could be used >> for things like mutexes, semaphores, and the control parts of queues. > > Any system with memory mapped I/O registers must have a mechanism that > will disable any caching operations for these peripheral I/O > registers. Extending this to some RAM locations should be helpful. >
Agreed. But that ram would, in practice, be best implemented as a separate block of fast ram independent from the main system ram. For embedded systems, a bit of on-chip static ram would make sense. And note that it is /not/ enough to be uncached - you also need to make sure that writes are done in order, and that reads are not done speculatively or out of order.
> --- > > BTW, discussing about massively parallel systems with shared memory > resembles the memory mapped file usage with some big data base > engines. > > In these systems big (up to terabytes) files are mapped into the > virtual address space. After that, each byte in each memory mapped > file is accessed just as a huge (terabyte) array of bytes (or some > structured type) by simply assignment statements. With files larger > than a few hundred megabytes, a 64 bit processor architecture is > really nice to have :-) > > The OS handles loading a segment from the physical disk file into the > memory using the normal OS page fault loading and writeback mechanism. > Instead of accessing the page file, the mechanism access the user data > base files. > > Thus you can think about the physical disks as the real memory and the > computer main memory as the L4 cache. Since the main memory is just > one level in the cache hierarchy, there are also similar cache > consistency issues as with other cached systems. In transaction > processing, typically some Commit/Rollback is used. >
There is some saying about any big enough problem in computing being just an exercise in caching, but I forget the exact quotation. Serious caching systems are very far from easy to make, ensuring correctness, convenient use, and efficiency.
> I guess that designing products around these massively parallel chips, > studying the cache consistency tricks used by memory mapped data base > file systems might be helpful. >
Indeed.
On 2017-08-10 9:11 AM, David Brown wrote:
> That sounds like a disaster for coupling compilers, linkers, OS's, and > processor MMU setups. I don't see this happening automatically. Doing > so/manually/ - giving explicit sections to variables, and explicitly > configuring an MMU / MPU to make a particular area of the address space > non-cached is fine. I have done it myself on occasion. But that's > different from trying to make it part of the standard language.
couple comments on this. Compiling for multiple processors I have used named address spaces to define private and shared space. IEC/ISO 18037 The nice part of that is applications can start out running on a single platform and then split later with minimum impact on the source code. Admittedly I have done this on non MMU systems. I have linked across multiple processors including cases of heterogeneous processors. An other comment about inter-processor communication. We found out a long time ago that dual or multi port memory is not that much of an advantage in most applications. The data rate can actually be quite low. We have done quite a few consumer electronics packages with serial data well below a mbit some as low as 8Kbits/second. It creates skew between processor execution but generally has very limited impact on application function or performance. w.. w..
On 17/08/17 00:39, Walter Banks wrote:
> On 2017-08-10 9:11 AM, David Brown wrote: >> That sounds like a disaster for coupling compilers, linkers, OS's, and >> processor MMU setups. I don't see this happening automatically. Doing >> so/manually/ - giving explicit sections to variables, and explicitly >> configuring an MMU / MPU to make a particular area of the address space >> non-cached is fine. I have done it myself on occasion. But that's >> different from trying to make it part of the standard language. > > > couple comments on this. Compiling for multiple processors I have used > named address spaces to define private and shared space. IEC/ISO 18037
"IEC/ISO 18037" completely misses the point, and is a disaster for the world of embedded C programming. It is an enormous disappointment to anyone who programs small embedded systems in C, and it is no surprise that compiler implementers have almost entirely ignored it in the 15 years of its existence. Named address spaces are perhaps the only interesting and useful idea there, but the TR does not cover user-definable address spaces properly.
> The nice part of that is applications can start out running on a single > platform and then split later with minimum impact on the source code. > > Admittedly I have done this on non MMU systems. >
On some systems, such a "no_cache" keyword/attribute is entirely possible. My comment is not that this would not be a useful thing, but that it could not be a part of the C standard language. For example, on the Nios processor (Altera soft cpu for their FPGAs - and I don't remember if this was just for the original Nios or the Nios2) the highest bit of an address was used to indicate "no cache, no reordering", but it was otherwise unused for address decoding. When you made a volatile access, the compiler ensured that the highest bit of the address was set. On that processor, implementing a "no_cache" keyword would be easy - it was already done for "volatile". But on a processor that has an MMU? It would be a serious problem. And how would you handle casts to a no_cache pointer? Casting a pointer to normal data into a pointer to volatile is an essential operation in lots of low-level code. (It is implementation-defined behaviour, but works "as expected" in all compilers I have heard of.) So for some processors, "no_cache" access is easy. For some, it would require support from the linker (or at least linker scripts) and MMU setup, but have no possibility for casts. For others, memory barrier instructions and cache flush instructions would be the answer. On larger processors, that could quickly be /very/ expensive - much more so than an OS call to get some uncached memory (dma_alloc_coherent() on Linux, for example). uncached accesses cannot be implemented sensible or efficiently in the same way on different processors, and in some systems it cannot be done at all. The concept of cache is alien to the C standards. Any code that might need uncached memory is inherently low-level and highly system dependent. Therefore it is a concept that has no place in the C standards, even though it is a feature that could be very useful in many specific implementations for specific targets. A great thing about C is that there is no problem having such implementation-specific features and extensions.
> > I have linked across multiple processors including cases of > heterogeneous processors. > > An other comment about inter-processor communication. We found out a > long time ago that dual or multi port memory is not that much of an > advantage in most applications. The data rate can actually be quite low. > We have done quite a few consumer electronics packages with serial data > well below a mbit some as low as 8Kbits/second. It creates skew between > processor execution but generally has very limited impact on application > function or performance. > > w.. > > w..
On 2017-08-17 3:37 AM, David Brown wrote:
> "IEC/ISO 18037" completely misses the point, and is a disaster for > the world of embedded C programming. It is an enormous > disappointment to anyone who programs small embedded systems in C, > and it is no surprise that compiler implementers have almost entirely > ignored it in the 15 years of its existence. Named address spaces > are perhaps the only interesting and useful idea there, but the TR > does not cover user-definable address spaces properly.
Guilty I wrote the section of 18037 on named address spaces based on our use in consumer applications and earlier WG-14 papers. We extended the named address space material to also include processor named space N1351,N1386 The fixed point material in 18037 is in my opinion reasonable. We use both of these a lot especially in programming the massively parallel ISA's I have been working on in the last few years. w..
On 17/08/17 14:24, Walter Banks wrote:
> On 2017-08-17 3:37 AM, David Brown wrote: >> "IEC/ISO 18037" completely misses the point, and is a disaster for >> the world of embedded C programming. It is an enormous >> disappointment to anyone who programs small embedded systems in C, >> and it is no surprise that compiler implementers have almost entirely >> ignored it in the 15 years of its existence. Named address spaces >> are perhaps the only interesting and useful idea there, but the TR >> does not cover user-definable address spaces properly. > > > Guilty I wrote the section of 18037 on named address spaces based on our > use in consumer applications and earlier WG-14 papers. > > We extended the named address space material to also include processor > named space N1351,N1386
I don't know the details of these different versions of the papers. I have the 2008 draft of ISO/IEC TR 18037:2008 in front of me. With all due respect to your work and experience here, I have a good deal of comments on this paper. Consider it constructive criticism due to frustration at a major missed opportunity. In summary, TR 18037 is much like EC++ - a nice idea when you look at the title, but an almost total waste of time for everyone except compiler company marketing droids. The basic idea of named address spaces that are syntactically like const and volatile qualifiers is, IMHO, a good plan. For an example usage, look at the gcc support for "__flash" address spaces in the AVR port of gcc: <https://gcc.gnu.org/onlinedocs/gcc/Named-Address-Spaces.html> The AVR needs different instructions for accessing data in flash and ram, and address spaces provide a neater and less error-prone solution than macros or function calls for flash data access. So far, so good - and if that is your work, then well done. The actual text of the document could, IMHO, benefit from a more concrete example usage of address spaces (such as for flash access, as that is likely to be a very popular usage). The register storage class stuff, however, is not something I would like to see in C standards. If I had wanted to mess with specific cpu registers such as flag registers, I would be programming in assembly. C is /not/ assembly - we use C so that we don't have to use assembly. There may be a few specific cases of particular awkward processors for which it is occasionally useful to have direct access to flag bits - those are very much in the minority. And they are getting more in the minority as painful architectures like COP8 and PIC16 are being dropped in favour of C-friendly processors. It is absolutely fine to put support for condition code registers (or whatever) into compilers as target extensions. I can especially see how it can help compiler implementers to write support libraries in C rather than assembly. But it is /not/ something to clutter up C standards or for general embedded C usage. The disappointing part of named address spaces is in Annex B.1. It is tantalisingly close to allowing user-defined address spaces with specific features such as neat access to data stored in other types of memory. But it is missing all the detail needed to make it work, how and when it could be used, examples, and all the thought into how it would interplay with other features of the language. It also totally ignores some major issues that are very contrary to the spirit and philosophy of C. When writing C, one expects "x = 1;" to operate immediately as a short sequence of instructions, or even to be removed altogether by the compiler optimiser. With a user-defined address space, such as an SPI eeprom mapping, this could take significant time, it could interact badly with other code (such as another thread or an interrupt the is also accessing the SPI bus), it could depend on setup of things outside the control of the compiler, and it could fail. You need to think long and hard as to whether this is something desirable in a C compiler. It would mean giving up the kind of transparency and low-level predictability that are some of the key reasons people choose C over C++ for such work. If the convenience of being able to access different types of data in the same way in code is worth it, then these issues must be made clear and the mechanisms developed - if not, then the idea should be dropped. A half-written half-thought-out annex is not the answer. One point that is mentioned in Annex B is specific little endian and big endian access. This is a missed opportunity for the TR - qualifiers giving explicit endianness to a type would be extremely useful, completely independently of the named address space concept. Such qualifiers would be simple to implement on all but the weirdest of hardware platforms, and would be massively useful in embedded programming.
> > The fixed point material in 18037 is in my opinion reasonable.
No, it is crap. Look at C99. Look what it gave us over C90. One vital feature that made a huge difference to embedded programming is <stdint.h> with fixed size integer types. There is no longer any need for every piece of embedded C software, every library, every RTOS, to define its own types u16, u16t, uint_16_t, uWORD, RTOS_u16, and whatever. Now we can write uint16_t and be done with it. Then someone has come along and written this TR with a total disregard for this. So /if/ this support gets widely implemented, and /if/ people start using it, what types will people use? Either they will use "signed long _Fract" and friends, making for unreadable code due to the long names and having undocumented target-specific assumptions that make porting an error prone disaster, or we are going to see a proliferation of fract15_t, Q31, fp0_15, and a dozen different incompatible variations. If this was going to be of any use, a set of specific, fixed-size type names should have been defined from day one. The assorted _Fract and _Accum types are /useless/. They should not exist. My suggestion for a naming convention would be uint0q16_t, int7q8_t, etc., for the number of bits before and after the binary point. Implementations should be free to implement those that they can handle efficiently, and drop any that they cannot - but there should be no ambiguity. This would also avoid the next point - C99 was well established before the TR was written. What about the "long long" versions for completeness? Of course, with a sensible explicit naming scheme, as many different types as you want could exist. Then there is the control of overflow. It is one thing to say saturation would be a nice idea - but it is absolutely, totally and completely /wrong/ to allow this to be controllable by a pragma. Explicit in the type - yes, that's fine. Implicit based on what preprocessing directives happen to have passed before that bit of the source code is translated? Absolutely /not/. Equally, pragmas for precision and rounding - in fact, pragmas in general - are a terrible idea. Should the types behave differently in different files in the same code? Next up - fixed point constants. Hands up all those that think it is intuitive that 0.5uk makes it obvious that this is an "unsigned _Accum" constant? Write it as "(uint15q16_t) 0.5" instead - make it clear and explicit. The fixed point constant suffixes exist purely because someone thought there should be suffixes and picked some letters out of their hat. Oh, and for extra fun lets make these suffixes subtly different from the conversion specifiers for printf. You remember? that function that is already too big, slow and complicated for many embedded C systems. Then there is the selection of functions in <stdfix.h>. We have type-generic maths support in C99. There is no place for individual functions like abshr, abslr, abshk, abslk - a single type-generic absfx would do the job. We don't /need/ these underlying functions. The implementation may have them, but C programmers don't need to see that mess. Hide it away as implementation details. That would leave everything much simpler to describe, and much simpler to use, and mean it will work with explicit names for the types. And in the thirteen years that it has taken between this TR being first published, and today, when implementations are still rare, incomplete and inefficient, we now have microcontrollers that will do floating point quickly for under a dollar. Fixed point is rapidly becoming of marginal use or even irrelevant. As for the hardware IO stuff, the less said about that the better. It will /never/ be used. It has no benefits over the system used almost everywhere today - volatile accesses through casted constant addresses. The TR has failed to give the industry anything that embedded C programmers need, it has made suggestions that are worse than useless, and by putting in so much that is not helpful it has delayed any hope of implementation and standardisation for the ideas that might have been helpful.
> > We use both of these a lot especially in programming the massively > parallel ISA's I have been working on in the last few years. >
Implementation-specific extensions are clearly going to be useful for odd architectures like this. It is the attempt at standardisation in the TR that is a total failure.

The 2024 Embedded Online Conference