ARM Cortex Mx vs the rest of the gang| page 4

Reply by Boudewijn Dijkstra ●June 14, 20172017-06-14

Op Mon, 12 Jun 2017 16:21:12 +0200 schreef StateMachineCOM
<statemachineguru@gmail.com>:
> I said that "... the FPU integration in the Cortex-M4F/M7 is horrible",  
> because it adds tons of overhead and a lot of headache for the  
> system-level software.
>
> The problem is that the ARM Vector Floating-Point (VFP) coprocessor  
> comes with a big context of 32 32-bit registers (S0-S31). These  
> registers need to be saved and restored as part of every context switch,  
> just like the CPU registers.

No. Only when you switch to an FPU-enabled task. The task dispatcher just
needs to keep track of whose content is in the FPU registers, and
save/restore when ownership changes.

> ARM has come up with some hardware optimizations called "lazy stacking  
> and context switching" (see ARM AppNote 298 at  
> http://infocenter.arm.com/help/topic/com.arm.doc.dai0298a/DAI0298A_cortex_m4f_lazy_stacking_and_context_switching.pdf  
> ). But as you will see in the AppNote, the scheme is quite involved and  
> still requires much more stack RAM than a context switch without the  
> VFP. The overhead of the ARM VFP in a multitasking system is so big, in  
> fact, that often it outweighs the benefits of having hardware FPU in the  
> first place. Often, a better solution would be to use the FPU in one  
> task only, and forbid to use it anywhere else. In this case, preserving  
> the FPU context would be unnecessary. (But it is difficult to reliably  
> forbid using FPU in other parts of the same code, so it opens the door  
> for race conditions around the FPU if the rule is violated.)

A good task dispatcher has the FPU enable bit as part of the task context.
Code that enables this bit is easily found.

> Anyway, does it have to be that hard? Apparently not. For example the  
> Renesas RX CPU comes also with single precision FPU, which is much  
> better integrated with the CPU and does not have its own register  
> context. Compared to the ARM VFP it is a pleasure to work with.




-- 
(Remove the obvious prefix to reply privately.)
Gemaakt met Opera's e-mailprogramma: http://www.opera.com/mail/

Reply by Chris ●July 10, 20172017-07-10

On 06/12/17 16:54, Don Y wrote:

> Instead, you enable the trap on the FPU opcodes so that if the "new" task
> attempts to use the FPU, you first swap out the FPU's state -- having
> remembered which task it belongs to (which may not be the task that
> executed
> immediately prior to the current task!). Having done so, you restore the
> saved FPU state for *this* task, disable the trap and let the instruction
> complete *in* the FPU. All the while, knowing that it may not complete
> before the current task loses control of the processor.
>

That sounds too complicated. If the fpu is busy, why not just put the
requesting task ready waiting back on the task queue, then try again
next time round ?. Set priorities accordingly.

Other solutions might include encapsulating the fpu within it's own
task, with or without input queue, then use messaging to talk to it ?

Anyway, isn't this just a bit academic ?. Modern cpus are orders of
magnitude faster than early designs and have never been limited by
cpu throughput.  Just take the simplest approach, save all registers
to start, then profile the code to see where the bottlenecks are. It's
not economic, nor sound engineering design to fine tune everything
just for the sake of it...

Chris

Reply by Don Y ●July 10, 20172017-07-10

On 7/10/2017 6:30 AM, Chris wrote:
> On 06/12/17 16:54, Don Y wrote:
> 
>> Instead, you enable the trap on the FPU opcodes so that if the "new" task
>> attempts to use the FPU, you first swap out the FPU's state -- having
>> remembered which task it belongs to (which may not be the task that
>> executed
>> immediately prior to the current task!). Having done so, you restore the
>> saved FPU state for *this* task, disable the trap and let the instruction
>> complete *in* the FPU. All the while, knowing that it may not complete
>> before the current task loses control of the processor.
> 
> That sounds too complicated.

Using the FPU *is* complicated in a multithreaded world!  :>

> If the fpu is busy, why not just put the
> requesting task ready waiting back on the task queue, then try again
> next time round ?. Set priorities accordingly.

So, some event has occurred which forces a reschedule() operation.
The system has decided that TASK_X (hand-wave away the task/process/thread
finer points) is deserving of the processor (or, *a* processor core).
But, you want to defer the execution of this "task" because its
inconvenient, at this time, and try for "second best".  What if the
second choice also requires the FPU's services?  Third choice?  etc.

You want to artificially LOWER the timeliness constraints of TASK_A
because the FPU is busy -- even if the VERY FIRST opcode that TASK_A
fetches (after it RESUMES execution) might not be a floating point
instruction?

How do you model this in your system design?  Do you profile the
frequency of floating point operations in each task and try to
predict the likelihood of one thread (task) starting a floating point
operation in the instant before a reschedule() event to be followed
by another thread (task) that happens to need to execute a floating
point operation AT SOME POINT (possibly hours from now)?  Does the
deferred task (thread) ever regain its DESERVED priority (timeliness)?
Or, once "demoted", does it remain that way -- hoping its peers
similarly get demoted (by pure chance) so that its RELATIVE priority
is reclaimed?

You're making a coarse-grained scheduling decision whereas the
trap approach just has the appearance of an opcode "taking longer"
to execute "in user space".

> Other solutions might include encapsulating the fpu within it's own
> task, with or without input queue, then use messaging to talk to it ?

You can treat the FPU as a "device" -- like a UART, disk drive,
NIC, etc. -- and impose sharing (through locks/mutexes) on it.
But, this promotes the sharing to a very visible level and forces
the developer (and all tasks) to consider the extent of FPU usage
in each case where the device is "open()-ed" -- so you can push
commands/messages at it.

> Anyway, isn't this just a bit academic ?. Modern cpus are orders of
> magnitude faster than early designs and have never been limited by
> cpu throughput.

*Memory* is the bottleneck.  Saving and restoring FPU state WHEN NOT
NEEDED (by the task surrendering the CPU/FPU *or* the task acquiring it)
generates lots of unnecessary memory activity.  E.g., in the M4, the
FPU state is ~100+ bytes that you're moving in and out, possibly
needlessly.

> Just take the simplest approach, save all registers
> to start, then profile the code to see where the bottlenecks are. It's
> not economic, nor sound engineering design to fine tune everything
> just for the sake of it...

Imagine if every ISR preserved and restored the ENTIRE processor state
"just to make things simple".  Would you consider THAT to be "sound
engineering"?  :>

Reply by Chris ●July 11, 20172017-07-11

On 07/10/17 23:03, Don Y wrote:

 >
 > Using the FPU *is* complicated in a multithreaded world! :>

Yes, so you have to tightly define how it is accessed for best
results.

 >
 > You want to artificially LOWER the timeliness constraints of TASK_A
 > because the FPU is busy -- even if the VERY FIRST opcode that TASK_A
 > fetches (after it RESUMES execution) might not be a floating point
 > instruction?

If you have contention for a resource, somone has to wait.
Who waits depends on task priorities, fpu state may have
to be saved, but so what ?. There various ways to provide
fair access, but if the design is so critically constrained by timing
issues, then the design is wrong and needs more resources. Ok, we
have all had to deal with that, but it shouldn't happen these days.

 >
 > How do you model this in your system design? Do you profile the
 > frequency of floating point operations in each task and try to
 > predict the likelihood of one thread (task) starting a floating point
 > operation in the instant before a reschedule() event to be followed
 > by another thread (task) that happens to need to execute a floating
 > point operation AT SOME POINT (possibly hours from now)? Does the
 > deferred task (thread) ever regain its DESERVED priority (timeliness)?
 > Or, once "demoted", does it remain that way -- hoping its peers
 > similarly get demoted (by pure chance) so that its RELATIVE priority
 > is reclaimed?
 >
 > You're making a coarse-grained scheduling decision whereas the
 > trap approach just has the appearance of an opcode "taking longer"
 > to execute "in user space".

Sorry, that doesn't make sense.

 >
 >> Other solutions might include encapsulating the fpu within it's own
 >> task, with or without input queue, then use messaging to talk to it ?
 >
 > You can treat the FPU as a "device" -- like a UART, disk drive,
 > NIC, etc. -- and impose sharing (through locks/mutexes) on it.
 > But, this promotes the sharing to a very visible level and forces
 > the developer (and all tasks) to consider the extent of FPU usage
 > in each case where the device is "open()-ed" -- so you can push
 > commands/messages at it.

If you use messaging ipc, the fpu is always ready for data, assuming
the queue is properly sized. If you want to make it priority aware,
just include that with the request, along with a the pid of the
requester. All the fpu internal complexity is hidden from the requester,
which doesn't need to know. While this might not be ideal for an fpu,
a task based model is a great way to encapsulate complexity.

 >
 > Imagine if every ISR preserved and restored the ENTIRE processor state
 > "just to make things simple". Would you consider THAT to be "sound
 > engineering"? :>
 >

 From a practical, keep the code simple point of view and
assuming there are no performance issues, that's the way I might do
it, but even the ancient 68000 had selective register save
instructions. Have used them at times in interrupt handlers, but it
requires poking around in the entrails and asm macros if you
typically write interrupt handlers in C.  Modern cpu's are orders of
magnitude faster, a king's ransome of riches in terms of throughput
and hardware options,  which allows a much more high level view of
the overall design. Ok, for a few apps like image processing and
compression etc, more care might be needed but they are the
exceptions for embedded work, afaics.

We are thankfully, past the stage where it was necessary to hand
optimise all the low level stuff to make systems work and the added
complexity is not good for reliability, nor maintenance. It's rarely
properly documented, so the poor soul who replaces you will have no idea
why the design decisions were made. Good design is not just about the
hardware and code, but the whole project infrastructure and the needs
surrounding it.

This is a bit tl:dr isn't it ?, but you do cover a lot of ground at
once :-)...

Chris

Reply by Don Y ●July 11, 20172017-07-11

On 7/11/2017 9:24 AM, Chris wrote:
> On 07/10/17 23:03, Don Y wrote:
> 
>  > Using the FPU *is* complicated in a multithreaded world! :>
> 
> Yes, so you have to tightly define how it is accessed for best
> results.

Or, design a strategy that will *adapt* to the needs of the application
without having to make that decision /a priori/.

>  > You want to artificially LOWER the timeliness constraints of TASK_A
>  > because the FPU is busy -- even if the VERY FIRST opcode that TASK_A
>  > fetches (after it RESUMES execution) might not be a floating point
>  > instruction?
> 
> If you have contention for a resource, somone has to wait.

Of course.  But the "when" becomes a driving factor.  You design
the system based on the intrinsic priorities of the "actors"
competing for those resources.  You don't decide that "its hard"
to give an actor his just due and, thus, rejiggle the "priorities"
to fit something more convenient.

> Who waits depends on task priorities, fpu state may have
> to be saved, but so what ?. 

But it may *not* "have to be saved".  You're assuming the task to be
assuming control of the processor *does* need the FPU and WILL need it
"presently" -- so save the state NOW instead of deferring the
act until it PROVES to be necessary (said "proof" being indicated
by the execution of a floating point operation).

The state of the FPU is *big* in most processors.  With multicore
chips, that's multiplied by the number of cores.  THE MEMORY BANDWIDTH
IS FIXED (and shared by *all* cores).  Why move temporally distant
data through that pipe if you don't NEED to do so?

> There various ways to provide
> fair access, but if the design is so critically constrained by timing
> issues, then the design is wrong and needs more resources. Ok, we
> have all had to deal with that, but it shouldn't happen these days.

Why do compilers worry so much about optimization?  We *surely*
shouldn't NEED the effective resource gain that these options
provide, right?

Unconditionally saving and restoring the FPU's state is akin to
unconditionally saving the entire state of the CPU for each
interrupt -- why invent things like FIRQ (which costs real silicon)
if these constraints "shouldn't happen these days"?

Why optimize away:
        foo += number;
        foo -= number;
SURELY we can afford a pair of integer (?) operations!  :>

>  > How do you model this in your system design? Do you profile the
>  > frequency of floating point operations in each task and try to
>  > predict the likelihood of one thread (task) starting a floating point
>  > operation in the instant before a reschedule() event to be followed
>  > by another thread (task) that happens to need to execute a floating
>  > point operation AT SOME POINT (possibly hours from now)? Does the
>  > deferred task (thread) ever regain its DESERVED priority (timeliness)?
>  > Or, once "demoted", does it remain that way -- hoping its peers
>  > similarly get demoted (by pure chance) so that its RELATIVE priority
>  > is reclaimed?
>  >
>  > You're making a coarse-grained scheduling decision whereas the
>  > trap approach just has the appearance of an opcode "taking longer"
>  > to execute "in user space".
> 
> Sorry, that doesn't make sense.

You are using the current state of the FPU (busy) to effectively
make a scheduling decision -- without knowledge of whether or not
the task that SHOULD be executing, next (based on the scheduling
criteria selected AT DESIGN TIME) will actually need the FPU *or*
will need it "in this next period of execution" (avoiding the term
"time slice" because preemptive schedulers tend not to be driven
strictly by "time").  *OR*, even "shortly".

When will the *deferred* highest priority task get his next opportunity
to run?  If you jigger with the priorities, then he's no longer
the most eligible to run (you may, in fact, have introduced a deadlock
that the *design* had implicitly avoided in its assignment of "priority").

[I assume you understand that "priority" in the sense used in scheduling
is NOT a "small integer used to artificially impose order on competing
actors"]

>  >> Other solutions might include encapsulating the fpu within it's own
>  >> task, with or without input queue, then use messaging to talk to it ?
>  >
>  > You can treat the FPU as a "device" -- like a UART, disk drive,
>  > NIC, etc. -- and impose sharing (through locks/mutexes) on it.
>  > But, this promotes the sharing to a very visible level and forces
>  > the developer (and all tasks) to consider the extent of FPU usage
>  > in each case where the device is "open()-ed" -- so you can push
>  > commands/messages at it.
> 
> If you use messaging ipc, the fpu is always ready for data, assuming
> the queue is properly sized. If you want to make it priority aware,
> just include that with the request, along with a the pid of the
> requester. All the fpu internal complexity is hidden from the requester,
> which doesn't need to know. While this might not be ideal for an fpu,
> a task based model is a great way to encapsulate complexity.

If you want finer-grained access to the FPU, then you have to be willing
to save and restore the contexts of the individual clients on a transactional
basis.  I.e., either load the FPU context of the IPC being serviced *now*,
run the opcode and then save the context as you're passing the results
to the client.  Or, leave the most recently loaded context *in* the FPU
until you examine the next incoming IPC to determine *if* there is a
need to swap out the context currently residing therein.

Your other arguments advocate unconditionally loading and saving the
*current* client's FPU context on each IPC -- regardless of recent past
history of that resource.  My argument is to leave whatever context
happens to be *in* the FPU there -- in the hope that the next request
MIGHT be from the same client; only swap contexts when you KNOW the
new client is a different entity than the last and, therefore, avoid
the overhead of a save-restore PER IPC.

>  > Imagine if every ISR preserved and restored the ENTIRE processor state
>  > "just to make things simple". Would you consider THAT to be "sound
>  > engineering"? :>
> 
>  From a practical, keep the code simple point of view and
> assuming there are no performance issues, that's the way I might do
> it, but even the ancient 68000 had selective register save
> instructions.

Ask yourself:  why did the vendor include these instructions in the
processor's design?  Why did they complicate the silicon, and the
programming model.  Surely, the developer has adequate resources
to blindly save the entire state; why provide provisions to save
only part of it?  "It shouldn't happen these days"  :>

Why would an MCU vendor add silicon and programming complexity
to a design to support this sort of treatment of the FPU?
Why waste an engineer's time documenting it:
<http://infocenter.arm.com/help/topic/com.arm.doc.dai0298a/DAI0298A_cortex_m4f_lazy_stacking_and_context_switching.pdf>
Surely, the developer shouldn't need to tune an application
(OS) to this extent, these days! (?)

> Have used them at times in interrupt handlers, but it
> requires poking around in the entrails and asm macros if you
> typically write interrupt handlers in C.  Modern cpu's are orders of
> magnitude faster, a king's ransome of riches in terms of throughput
> and hardware options,  which allows a much more high level view of
> the overall design. Ok, for a few apps like image processing and
> compression etc, more care might be needed but they are the
> exceptions for embedded work, afaics.

Modern APPLICATIONS are orders of magnitude more complex!   And,
you don't always use features to gain performance but, also, to
gain reliability, etc.

If I, the developer, KNOW that a particular task/process/thread
doesn't use the FPU, why wouldn't I want to take advantage of a mechanism
that tells me *if* an attempt is made to use the FPU (by THAT task)?
And, if so notified, wouldn't I want to *do* something about it?

If I, the developer, KNOW that my task's memory references are
constrained to the region [LOW,HIGH] -- because some other task
accesses the adjoining memory above/below that region -- wouldn't
I want to take advantage of a mechanism that tells me *if* an
attempt is made to access memory outside that region?

If I, the developer, KNOW that my task should NEVER be accessing a
particular file, device, etc. wouldn't I want to take advantage
of a mechanism that tells me *if* it tries to do so?

Or, tries to WRITE to program memory (CODE)?

Or, tries to grow the stack beyond the limits determined at design time?

Or, tries to "hog" the CPU?

etc.

> We are thankfully, past the stage where it was necessary to hand
> optimise all the low level stuff to make systems work and the added
> complexity is not good for reliability, nor maintenance.

The opposite is true.  Why do we see increasingly complex OS's in use?
Ans:  because you can design the mechanisms to detect and protect
against UNRELIABLE program operation *once* and leverage that across
applications and application domains.

Why do we see HLL's in use?  Ans:  it makes it easier for developers to
code larger programs *reliably*.  (Why "larger"?  Because applications
are getting orders of magnitude more complex).

> It's rarely
> properly documented, so the poor soul who replaces you will have no idea
> why the design decisions were made.

If the poor soul is competent to design an operating system, then he
SHOULD be skilled enough in his art to understand the ideas that are
frequently exploited in operating system designs.  If not, he shouldn't
be tinkering with the OS's implementation.

(You wouldn't want someone who doesn't have a deep understanding of
floating point issues to be writing a floating point emulation library,
would you?)

The developer (writing the *application*) need not be concerned about
the minutiae of how context switches are performed.  Do you have to
understand how a multilevel page table is implemented (and traversed
at runtime) in order to use demand-paged virtual memory?  OTOH,
you *would* if you were charged with maintaining that part of the
codebase!

> Good design is not just about the
> hardware and code, but the whole project infrastructure and the needs
> surrounding it.

Good design is fitting the design *to* the application "most effectively"
(which are squishy words that the developer defines).  If every project
could be handled with a PIC and 2KB of RAM, there'd be no need for
MMU's, FPU's, RTOS's, HLL's, SMP, IPC/RPC, etc.

Thankfully, (cuz that would be a world of pretty BORING applications)
that's not the case.  And, as applications ("projects") get larger,
they quickly grow to a point where they are "complex" (complex:  too
large to fit in one mind) and have to rely on the efforts of many.
Anything that can be done to be a productivity/reliability/performance
multiplier "by purchasing a few more gates on a die" almost always
has a net positive return.

Imagine if the authors of every application running on your PC had
to cooperate to ensure they were all LINKED at non-competing memory
addresses (because there was no relative addressing mode, segments,
virtual memory, etc.).  Instead, the silicon -- and then the OS -- can
assume the burden of providing these mechanisms so the developers
need not be concerned with them.

[I'd wager most PC developers are clueless as to what happens under
the hood when their application is launched.  And, I suspect there
is a boatload of documentation available for them *if* they decided
they had a genuine need to know -- at whatever level of detail they
deemed appropriate!]

> This is a bit tl:dr isn't it ?, but you do cover a lot of ground at
> once :-)...

IME, most non-trivial engineering decisions are hard to summarize in
a page (or ten :> ) or less.

Time to take advantage of 12 hours of rain to do some digging...

Reply by Chris ●July 11, 20172017-07-11

On 07/11/17 19:15, Don Y wrote:

 > Of course. But the "when" becomes a driving factor. You design
 > the system based on the intrinsic priorities of the "actors"
 > competing for those resources. You don't decide that "its hard"
 > to give an actor his just due and, thus, rejiggle the "priorities"
 > to fit something more convenient.

Make a rough estimate during development, then fine tune to fix
edge cases, or where a bit more headroom is needed for individual
tasks. No design is fixed in stone from the start.

 >
 > The state of the FPU is *big* in most processors. With multicore
 > chips, that's multiplied by the number of cores. THE MEMORY BANDWIDTH
 > IS FIXED (and shared by *all* cores). Why move temporally distant
 > data through that pipe if you don't NEED to do so?

You seem to be assuming high end systems running at the ragged
edge, which isn't the sort of work done here. Leave that to the mobile,
tablet and workstation / graphics people. You can't even be fluent in
all aspects of computing, let alone the electronics that enables it.

 > Why do compilers worry so much about optimization? We *surely*
 > shouldn't NEED the effective resource gain that these options
 > provide, right?.

Not sure. From a performance point of view, perhaps, but optimisation
can reduce memory footprint, critical for some embedded work.

 >
 > Unconditionally saving and restoring the FPU's state is akin to
 > unconditionally saving the entire state of the CPU for each
 > interrupt -- why invent things like FIRQ (which costs real silicon)
 > if these constraints "shouldn't happen these days"?

I guess you are talking arm?. FIRQ is a leftover from early arm, fwir.
Have you seen amount of tortuous code needed to get interrupts
working properly with Arm7TDMI, for example?. About 2 pages of dense
assembler, from memory. I rejected early arm almost on that basis
alone, but there were other idiosyncracies. They fixed it eventually
with a proper (68K) style vector table, but it took them a long time
:-). Cortex was when Arm finally came of age.

 > You are using the current state of the FPU (busy) to effectively
 > make a scheduling decision -- without knowledge of whether or not
 > the task that SHOULD be executing, next (based on the scheduling
 > criteria selected AT DESIGN TIME) will actually need the FPU *or*
 > will need it "in this next period of execution" (avoiding the term
 > "time slice" because preemptive schedulers tend not to be driven
 > strictly by "time"). *OR*, even "shortly".

It's a case of organisinmg system design, task allocation etc, so
that you get a result that meets spec. Think systems engineering.
If you have limited resources, something has to give, but would
prefer a situation where an fpu operation always runs to completion.
It's the simplest solution and the fewest variables in terms
of  estimating performance.

Interrupting and saving fpu state could be done, but only if all
other avenues have been explored. it's a whole can or worms best
avoided if possible and dependent on the actual fpu in use. It needs
memory to save context, added management code and maybe complex
synchronisation issues. Even if you make it work, May turn out to
be less efficient than run to completion.

Anyway, all kinds of events affect scheduling decisions, even if
indirectly. To make a waiting process ready, for example. Don't see what
the problem is. Perhaps that's the issue: Some always look for
issues, while others assume everything is going to work.

 >
 > If you want finer-grained access to the FPU, then you have to be willing
 > to save and restore the contexts of the individual clients on a
 > transactional
 > basis.

I don't want fine grained access if possible. I want a black box to
feed data and get a result. Not really interested what happens
under the hood, so long as it meets requirements and is predictable.

 >
 > Ask yourself: why did the vendor include these instructions in the
 > processor's design? Why did they complicate the silicon, and the
 > programming model. Surely, the developer has adequate resources
 > to blindly save the entire state; why provide provisions to save
 > only part of it? "It shouldn't happen these days" :>

Simple, both memory and processors were slow in those days and needed
all the help they could get. Modern processors arguably don't need them
for most applications. Do commercial tool chains make use of them ?.
Last time I checked, gcc still didn't know about interrupts, though
some vendors do add extensions.

 >
 > Why would an MCU vendor add silicon and programming complexity
 > to a design to support this sort of treatment of the FPU?
 > Why waste an engineer's time documenting it:
 > <http://infocenter.arm.com/help/topic/com.arm.doc.dai0298a/DAI0298A
    _cortex_m4f_lazy_stacking_and_context_switching.pdf>

Competitive market perhaps ?. Featureitis between vendors to cater for
widest application and market share. With most work these days, only
use a fraction of the internal arch and throughout. I see that as good,
as there's more freedom to think systems engineering, rather than detail.

 >
 >> Have used them at times in interrupt handlers, but it
 >> requires poking around in the entrails and asm macros if you
 >> typically write interrupt handlers in C. Modern cpu's are orders of
 >> magnitude faster, a king's ransome of riches in terms of throughput
 >> and hardware options, which allows a much more high level view of
 >> the overall design. Ok, for a few apps like image processing and
 >> compression etc, more care might be needed but they are the
 >> exceptions for embedded work, afaics.
 >
 > Modern APPLICATIONS are orders of magnitude more complex! And,
 > you don't always use features to gain performance but, also, to
 > gain reliability, etc.

Perhaps many modern apps don't need it, but don't write apps, so what
do I know ?. It's not only windows and Linux that suffer from bloat
these days.

 >
 > The opposite is true. Why do we see increasingly complex OS's in use?
 > Ans: because you can design the mechanisms to detect and protect
 > against UNRELIABLE program operation *once* and leverage that across
 > applications and application domains.
 >

Are we talking about vanilla embedded work here, or big system design ?.

 >
 > Good design is fitting the design *to* the application "most effectively"
 > (which are squishy words that the developer defines). If every project
 > could be handled with a PIC and 2KB of RAM, there'd be no need for
 > MMU's, FPU's, RTOS's, HLL's, SMP, IPC/RPC, etc.
 >

Agreed, but much embedded work is not big systems stuff, but at simple
state driven loop or rtos level. Ok, phones etc are all some
flavor of unix, Linux, whatever, but not typical embedded.

 >
 > Imagine if the authors of every application running on your PC had
 > to cooperate to ensure they were all LINKED at non-competing memory
 > addresses (because there was no relative addressing mode, segments,
 > virtual memory, etc.). Instead, the silicon -- and then the OS -- can
 > assume the burden of providing these mechanisms so the developers
 > need not be concerned with them.

That's why mainstream os's have loaders and memory management, because
you want maximum flexibity, whereas embedded is usually locked down
to particular need.

I don't get into pc stuff, it's just a tool and I assume that it works,
which it generally does. Same for Linux, but FreeBSD gets more and more
interesting and is rock solid on X86 and Sparc here. After systemd and
other bloat issues, Linux becomes less and less attractive.

 >
 > IME, most non-trivial engineering decisions are hard to summarize in
 > a page (or ten :> ) or less.
 >
 > Time to take advantage of 12 hours of rain to do some digging...

Been chucking it down all day here today in Oxford, but that's uk
summer weather and as you say, an excuse to catch up with the groups
and get into some back burner ideas. Too many interests and not
enough time, as usual :-)...

Chris

Reply by Don Y ●July 12, 20172017-07-12

On 7/11/2017 4:16 PM, Chris wrote:
>  > The state of the FPU is *big* in most processors. With multicore
>  > chips, that's multiplied by the number of cores. THE MEMORY BANDWIDTH
>  > IS FIXED (and shared by *all* cores). Why move temporally distant
>  > data through that pipe if you don't NEED to do so?
> 
> You seem to be assuming high end systems running at the ragged
> edge, which isn't the sort of work done here. Leave that to the mobile,
> tablet and workstation / graphics people. You can't even be fluent in
> all aspects of computing, let alone the electronics that enables it.

No, I'm seeing an opportunity for an optimization that can be largely
transparent to *any* application (assuming the application makes use
of floating point operations -- with or without hardware assist)
WITHOUT burdening the developer with the details of its implementation.

E.g., 30+ years ago, I'd build floating point "subroutines" (ASM) with a
preamble that resembled:

      if (!flag) {
           save_floating_point_context(previous_owner)
           restore_floating_point_context(new_owner)
           flag = TRUE
      }
      ...  // body of actual "subroutine"

This allowed the "task switcher" (scheduler) to simply clear
"flag" as part of the normal context switch and DEFER handling
the "floating point unit" (which was a bunch of subroutines and
a large shared section of memory) to a time when the "new_owner"
actually NEEDED it -- as indicated by his CALLing any of the
floating point subroutines (ALL of which had the above preamble).

This allowed the "FPU" to be implemented in a time-efficient
manner (e.g., potentially leaving the "floating point accumulator"
in denormalized form instead of normalizing after every operation!)

It's an obvious step from there to hooking the "helper routines"
used by many (esp *early*) compilers in the same way.

And, from there, to hooking the (early) *hardware* FPU's (e.g., Am9511)
that were costly to embrace in a multithreaded environment without
such deferred optimization.

Finally, the more modern FPU's with better mechanisms to detect these
things IN HARDWARE (i.e., no need for that explicit "if (flag)...")

>  > Why do compilers worry so much about optimization? We *surely*
>  > shouldn't NEED the effective resource gain that these options
>  > provide, right?.
> 
> Not sure. From a performance point of view, perhaps, but optimisation
> can reduce memory footprint, critical for some embedded work.

The point of all of these optimizations is they can be done, reliably, without
requiring effort on the part of the developer.  The folks responsible for
designing/implementing your OS deal with this issue.  Just like the compiler
writers deal with the schemes/machinations to make your code smaller, faster,
etc.

>  > Unconditionally saving and restoring the FPU's state is akin to
>  > unconditionally saving the entire state of the CPU for each
>  > interrupt -- why invent things like FIRQ (which costs real silicon)
>  > if these constraints "shouldn't happen these days"?
> 
> I guess you are talking arm?. FIRQ is a leftover from early arm, fwir.

The thread is about ARMs (Cortex M4).  FIRQ is still available in
most (all?) ARM cores.

> Have you seen amount of tortuous code needed to get interrupts
> working properly with Arm7TDMI, for example?. About 2 pages of dense
> assembler, from memory. I rejected early arm almost on that basis
> alone, but there were other idiosyncracies. They fixed it eventually
> with a proper (68K) style vector table, but it took them a long time
> :-). Cortex was when Arm finally came of age.
> 
>  > You are using the current state of the FPU (busy) to effectively
>  > make a scheduling decision -- without knowledge of whether or not
>  > the task that SHOULD be executing, next (based on the scheduling
>  > criteria selected AT DESIGN TIME) will actually need the FPU *or*
>  > will need it "in this next period of execution" (avoiding the term
>  > "time slice" because preemptive schedulers tend not to be driven
>  > strictly by "time"). *OR*, even "shortly".
> 
> It's a case of organisinmg system design, task allocation etc, so
> that you get a result that meets spec. Think systems engineering.
> If you have limited resources, something has to give, but would
> prefer a situation where an fpu operation always runs to completion.
> It's the simplest solution and the fewest variables in terms
> of  estimating performance.

Do you turn the cache OFF in your designs -- because it makes it easier to
estimate performance?

> Interrupting and saving fpu state could be done, but only if all
> other avenues have been explored. it's a whole can or worms best
> avoided if possible and dependent on the actual fpu in use. It needs
> memory to save context, added management code and maybe complex
> synchronisation issues. Even if you make it work, May turn out to
> be less efficient than run to completion.

Modern hardware FPU's tend to treat all opcodes as atomic.
The difference is software emulations -- you'd not want to let
the emulation of FSIN run to completion when it can be interrupted
at any of the hundreds of opcode fetches spanning its duration.

[But, then you need to be able to preserve ALL of the emulator's
state, not just the state that visibly mirrors the hardware FPU!]

> Anyway, all kinds of events affect scheduling decisions, even if
> indirectly. To make a waiting process ready, for example. Don't see what
> the problem is. Perhaps that's the issue: Some always look for
> issues, while others assume everything is going to work.

Designing reliable products means thinking about everything that *can*
go wrong and either ensuring it can't *or* being prepared to handle the
case *when* it does.

>  > If you want finer-grained access to the FPU, then you have to be willing
>  > to save and restore the contexts of the individual clients on a
>  > transactional
>  > basis.
> 
> I don't want fine grained access if possible. I want a black box to
> feed data and get a result. Not really interested what happens
> under the hood, so long as it meets requirements and is predictable.

An FPU is essentially another CPU.  As many (or more!) "internal state"
as the CPU itself.  If you want to share that resource, then you
need a way of ensuring that task_A's FPU register contents aren't
used (or exposed!) to task_B's operations.  So, you either swap them
in/out based on the identity of the (IPC) client making the *new*
request *or* examine the request and selectively decide which
portions of the FPU state are "safe" from interference based on
the nature of the FPU request (e.g., if it is an attempt to FADD
S0 and S1, then S2-S31 can be left in place -- only the previous
contents of S0 & S1 need to be preserved and the new client's
contents of S0 & S1 restored prior to servicing the request.)

[Think about the consequences of that sort of implementation:
now you have to track which *portions* of the FPU state are
associated with which tasks.  *Or*, let the FPU emulation
operate on FPU state *in* each client's TCB]

>  > Ask yourself: why did the vendor include these instructions in the
>  > processor's design? Why did they complicate the silicon, and the
>  > programming model. Surely, the developer has adequate resources
>  > to blindly save the entire state; why provide provisions to save
>  > only part of it? "It shouldn't happen these days" :>
> 
> Simple, both memory and processors were slow in those days and needed
> all the help they could get. Modern processors arguably don't need them
> for most applications.

How do you KNOW that?  As memory becomes increasingly the bottleneck,
the number of register inside the processor (CPU, FPU, MMU, etc.)
increases in an attempt to cut down on memory traffic.  E.g., the
99K placed the bulk of the processor's registers *in* memory
and just kept a pointer to them (the Workspace Pointer) inside the
CPU.

As the amount of state inside the CPU increases, the cost of
context switches goes up -- the memory accesses that have been
"avoided" by incorporating a register file eventually end up
appearing "deferred" (you pay the piper when the context switch
comes along)

> Do commercial tool chains make use of them ?.
> Last time I checked, gcc still didn't know about interrupts, though
> some vendors do add extensions.

It still uses PUSH and POP -- even for a register-at-a-time.

>  > Why would an MCU vendor add silicon and programming complexity
>  > to a design to support this sort of treatment of the FPU?
>  > Why waste an engineer's time documenting it:
>  > <http://infocenter.arm.com/help/topic/com.arm.doc.dai0298a/DAI0298A
>     _cortex_m4f_lazy_stacking_and_context_switching.pdf>
> 
> Competitive market perhaps ?. Featureitis between vendors to cater for
> widest application and market share. With most work these days, only
> use a fraction of the internal arch and throughout. I see that as good,
> as there's more freedom to think systems engineering, rather than detail.

Have you seen how many products use a Linux kernel when they don't really
*need* that level of functionality?  How much does *it* draw into
the mix that the application itself doesn't intrinsically need?

Returning to my earlier comment, applications have become increasingly
complex.  Some of this is natural progression.  Some is a design tradeoff
("Let's use floating point instead of hassling with Q12.19...").  Some
is marketing hype.

I don't want "users" (Ma & Pa) to have to understand the consequences of
particular numeric data types.  So, I use a BigRational form for the
"numbers" that users manipulate in their scripts.  I can elect to
give them 200 digits of precision (or, let them opt for that themselves)
rather than explaining to them why you want to reorder:

    REALLY_BIG_NUMBER * REALLY_BIG_NUMBER * REALLY_BIG_NUMBER
---------------------------------------------------------------
(REALLY_BIG_NUMBER * REALLY_BIG_NUMBER * REALLY_BIG_NUMBER) + 1

That comes at a cost:  I "waste" some of the system's resources
to enable them to NOT need to think about this level of detail.

Similarly, I "waste" system resources to ensure program A can't
stomp on program B's code/data.  Or, access a resource to which
it should have no need ("why is the MP3 player trying to access
the NIC?")

All of these added complexities make the resulting system more
robust and easier to design within.  (Easier just to *hide*
a resource from someone who shouldn't be needing it than it is to
try to concoct a set of ACL's that allow those who *should* have
access to do so while preventing those who shouldn't!)

>  >> Have used them at times in interrupt handlers, but it
>  >> requires poking around in the entrails and asm macros if you
>  >> typically write interrupt handlers in C. Modern cpu's are orders of
>  >> magnitude faster, a king's ransome of riches in terms of throughput
>  >> and hardware options, which allows a much more high level view of
>  >> the overall design. Ok, for a few apps like image processing and
>  >> compression etc, more care might be needed but they are the
>  >> exceptions for embedded work, afaics.
>  >
>  > Modern APPLICATIONS are orders of magnitude more complex! And,
>  > you don't always use features to gain performance but, also, to
>  > gain reliability, etc.
> 
> Perhaps many modern apps don't need it, but don't write apps, so what
> do I know ?. It's not only windows and Linux that suffer from bloat
> these days.

*Systems* are more complex.  In the past, products were isolated little
islands.  Your mouse had no idea that it was sitting alongside a keyboard.
There was no interaction between them.

Now, that is increasingly NOT the case.  It's now COMMON for applications
to have network connectivity (with all the complexity -- and risk -- that
a network stack brings to the design).

When the Unisite was released (80's?), it was "odd" in that it didn't have
a user interface:  just two idiot lights, a power switch and a "null modem"
switch on the back.  It *relied* on an external display (glass TTY) to
act as its user interface.  Previous product offerings had crippled
little keypads and one-line displays that tried to provide the same sort
of information in a klunkier manner ("Use Mode 27 for this...")

Now, its common for a device to have no specific user interface
and rely on a richer interface provided by some external agency.
No need for DIP switches to configure a device:  just set up a
BOOTP server and let the device *fetch* its configuration from
a set of text files that the user can prepare with more capable
tools (than a kludgey keypad interface).

>  > The opposite is true. Why do we see increasingly complex OS's in use?
>  > Ans: because you can design the mechanisms to detect and protect
>  > against UNRELIABLE program operation *once* and leverage that across
>  > applications and application domains.
> 
> Are we talking about vanilla embedded work here, or big system design ?.

You're assuming embedded is NOT "big system design".

The cash registers at every store I visit are PC (or iPad) based.
What do you call *them*?  Does a cash register need "DirectX" capabilities?
Or, the ability to read FAT12/16 filesystems?

My current system is distributed.  I "waste" an entire core on each node
just servicing communications and RPC.  <shrug>  I'll *take* every optimization
that I can get "for free" to pay for these more costly capabilities (that can't
easily be optimized).

>  > Good design is fitting the design *to* the application "most effectively"
>  > (which are squishy words that the developer defines). If every project
>  > could be handled with a PIC and 2KB of RAM, there'd be no need for
>  > MMU's, FPU's, RTOS's, HLL's, SMP, IPC/RPC, etc.
> 
> Agreed, but much embedded work is not big systems stuff, but at simple
> state driven loop or rtos level. Ok, phones etc are all some
> flavor of unix, Linux, whatever, but not typical embedded.

This is a THERMOSTAT:
<https://www.ifixit.com/Teardown/Nest+Learning+Thermostat+2nd+Generation+Teardown/13818>

Conceptually, it just implements:

    case mode {
        HEAT =>
            if (temperature < setpoint)
                furnace(on)
        COOL =>
            if (temperature > setpoint)
                ACbrrr(on)
    }

As I said, applications are getting increasingly complex!

Do you *need* that sort of capability in a thermostat?
Questionable.

OTOH, if it can reduce your heating/cooling costs, then
it's potentially "free".  A "dumb" thermostat can be
MORE expensive!

>  > Imagine if the authors of every application running on your PC had
>  > to cooperate to ensure they were all LINKED at non-competing memory
>  > addresses (because there was no relative addressing mode, segments,
>  > virtual memory, etc.). Instead, the silicon -- and then the OS -- can
>  > assume the burden of providing these mechanisms so the developers
>  > need not be concerned with them.
> 
> That's why mainstream os's have loaders and memory management, because
> you want maximum flexibity, whereas embedded is usually locked down
> to particular need.

"Usually" is a representation of The Past.  Looked at the capabilities of
"smart TV's" lately?

My current system is "deeply embedded".  But, provides a *richer*
execution environment than a typical desktop PC -- because it aims to
be more durable, extensible and reliable.  You can replace a PC every
few years; you wouldn't want to replace ALL the automation in a
particular business every few years!
     "Is there something WRONG with the existing irrigation system?
     Burglar alarm?  HVAC controls?  Energy management system??
     I.e., WHY should we be replacing/uprading it?"

> I don't get into pc stuff, it's just a tool and I assume that it works,
> which it generally does. Same for Linux, but FreeBSD gets more and more
> interesting and is rock solid on X86 and Sparc here. After systemd and
> other bloat issues, Linux becomes less and less attractive.

The top end of the "embedded" domain keeps nibbling at the underbelly
of the "desktop/mainframe" domain.  40 years ago, I could pilot a boat
with a few KB of code and an actuator for the rudder.  Nowadays,
cars park and drive themselves -- undoubtedly with far more than
a few KB and a fractional MIPS of resources!

An embedded designer who isn't aware of the technologies that are
becoming increasingly "affordable" is doomed to designing 2 button
mice for the rest of his days.

>  > IME, most non-trivial engineering decisions are hard to summarize in
>  > a page (or ten :> ) or less.
>  >
>  > Time to take advantage of 12 hours of rain to do some digging...
> 
> Been chucking it down all day here today in Oxford, but that's uk
> summer weather and as you say, an excuse to catch up with the groups
> and get into some back burner ideas. Too many interests and not
> enough time, as usual :-)...

Time gets scarcer and interests (for anyone with an imagination)
multiply.  The only solution I've found is to reduce the time spent
asleep!  :<

Reply by George Neuner ●July 12, 20172017-07-12

On Wed, 12 Jul 2017 11:44:38 -0700, Don Y
<blockedofcourse@foo.invalid> wrote:

>On 7/11/2017 4:16 PM, Chris wrote:
>
>> Last time I checked, gcc still didn't know about interrupts, though
>> some vendors do add extensions.
>
>It still uses PUSH and POP -- even for a register-at-a-time.

FWIW: on modern x86 ["modern" meaning since Pentium Pro, circa 1995],
PUSH and POP instructions are internally converted to MOV instructions
referencing the appropriate [stack offset] addresses.  A sequence of
PUSHes or POPs may be executed simultaneously and/or out of order.

x86 compilers still emit the PUSH and POP instructions because they
are more representive of the logical model expected by the programmer
who examines the generated code.

>Have you seen how many products use a Linux kernel when they don't really
>*need* that level of functionality?  How much does *it* draw into
>the mix that the application itself doesn't intrinsically need?

A *really* minimal configuration provides little more than chipset
support, tasking, and memory management with MMU isolation.  Depending
on the kernel version that could be as little as ~80 KB of code.  You
can run a [tight] kernel+application image in as little as 1 MB.

You actually *can* run Linux sans MMU, but it is difficult because so
many existing drivers and software stacks assume the MMU is present
and enabled.  You have to be willing/able to roll your own system
software.

George

Reply by Don Y ●July 12, 20172017-07-12

On 7/12/2017 2:46 PM, George Neuner wrote:
> On Wed, 12 Jul 2017 11:44:38 -0700, Don Y
> <blockedofcourse@foo.invalid> wrote:
> 
>> On 7/11/2017 4:16 PM, Chris wrote:
>>
>>> Last time I checked, gcc still didn't know about interrupts, though
>>> some vendors do add extensions.
>>
>> It still uses PUSH and POP -- even for a register-at-a-time.
> 
> FWIW: on modern x86 ["modern" meaning since Pentium Pro, circa 1995],
> PUSH and POP instructions are internally converted to MOV instructions
> referencing the appropriate [stack offset] addresses.  A sequence of
> PUSHes or POPs may be executed simultaneously and/or out of order.

On machines with more orthogonal instruction sets, auto-pre/post-inc/decrement
addressing modes could effectively implement *a* stack using any register.
So, a PUSH/POP (PULL) was just a shorthand for a "well decorated" opcode

     MOV  (R6)+ R0

Even the '8 had a mechanism for doing this using particular "memory indirect"
addressing modes via a small set (16?) of specific memory addresses

[IIRC, the Nova's could conceptually keep indirecting through "random"
memory locations indefinitely... "never" coming up with a final effective
address!]

With processors that didn't have the same sort of orthogonality
in addressing modes available, PUSH/POP/PULL could *imply* the
auto inc/decrement register indirect mode on a *special* register
(SP).

> x86 compilers still emit the PUSH and POP instructions because they
> are more representive of the logical model expected by the programmer
> who examines the generated code.
> 
>> Have you seen how many products use a Linux kernel when they don't really
>> *need* that level of functionality?  How much does *it* draw into
>> the mix that the application itself doesn't intrinsically need?
> 
> A *really* minimal configuration provides little more than chipset
> support, tasking, and memory management with MMU isolation.  Depending
> on the kernel version that could be as little as ~80 KB of code.  You
> can run a [tight] kernel+application image in as little as 1 MB.

My point is that folks don't bother to trim that DEAD CODE from their
products.  Either they figure its not worth the effort (CODE memory is
cheap?) *or* they are fearful of their lack of DETAILED knowledge of
the kernel's internals and don't want to risk "breaking something".

How many devices support a web interface that, conceptually, should
only be accessed by a single client at any given time -- but don't
expressly PREVENT two or more simultaneous connections?  Just drop
the cobbled code into the application and coax it to do what you
want -- and hope the "extra" code is never accidentally activated
(exploited!)

You don't, for example, think I'm going to elide the code from
PostgreSQL that supports the UUID type because I don't need/use it?
<grin>  Rather, I'll *rationalize* that someone MIGHT make use of it
in the future and use that to justify leaving it in the codebase
(despite it being, effectively, dead code!)

> You actually *can* run Linux sans MMU, but it is difficult because so
> many existing drivers and software stacks assume the MMU is present
> and enabled.  You have to be willing/able to roll your own system
> software.

Reply by ●July 13, 20172017-07-13

Don Y <blockedofcourse@foo.invalid> wrote:
> On 7/11/2017 4:16 PM, Chris wrote:

>>  > Unconditionally saving and restoring the FPU's state is akin to
>>  > unconditionally saving the entire state of the CPU for each
>>  > interrupt -- why invent things like FIRQ (which costs real silicon)
>>  > if these constraints "shouldn't happen these days"?
>> I guess you are talking arm?. FIRQ is a leftover from early arm, fwir.
> The thread is about ARMs (Cortex M4).  FIRQ is still available in
> most (all?) ARM cores.

ARMv7-M did away with most modes, leaving only thread and handler modes 
(corresponding to the old "usr" and "svc" modes). ARMv8-M added secure 
variants of both. There are no equivalents of the "fiq", "irq", "abt", 
"sys" and "und" modes. Another difference is that R13 (stack pointer) is 
the only banked register in the base architecture, plus secure state 
versions of some control registers in ARMv8-M.

-a