On 4/14/2017 2:53 AM, upsidedown@downunder.com wrote:
>>>>> Is there any potential downside to intentionally exposing
>>>>> (to a task/process/job/etc.) its current resource commitments?
>>>>> I.e., "You are currently holding X memory, Y CPU, Z ..."
>>>>>
>>>>> *If* the job is constrained to a particular set of quotas,
>>>>> then knowing *what* it is using SHOULDN'T give it any
>>>>> "exploitable" information, should it?
>>>>>
>>>>> [Open system; possibility for hostile actors]
>>>>
>>>> It depends on how paranoid you want to be.
>>>
>>> IIRC, some high security systems require that process should _not_ be
>>> able to determine, if it is running on a single CPU machine, multiple
>>> CPU systems or in a virtual machine and in all cases what the CPU
>>> speed is.
>>
>> In practice, that's almost impossible to guarantee. Especially if you can
>> access other agencies that aren't thusly constrained. E.g., issue a
>> request to a time service, count a large number of iterations of a
>> loop, access the time service again...
>
> Exactly for that reason, the process is not allowed to ask for the
> time-of-day.
But, it can't ask a whole class of questions that would allow it to
*infer* the amount of elapsed time. E.g., send an HTTP request to
*google's* time service; wait; send another request...
It's very difficult to apply "filters" that can screen arbitrary
information from the results to which they are applied! :>
>>> Regarding quotas in more normal systems, quotas are used to limit
>>> rogue processes to overclaiming the resources.
>>
>> But, that can be to limit the "damage" done by malicious processes
>> as well as processes that have undetected faults. It can also be
>> used to impose limits on tasks that are otherwise unconstrainable
>> (e.g., how would you otherwise limit the resources devoted to
>> solving a particular open-ended problem?)
>>
>>> In practice the sum of
>>> specific quotas for all processes can be much greater than the total
>>> resources available. Thus, a process my have to handle a denied
>>> requests even if the own quota would have allowed it.
>>
>> Yes, but how the "failure" is handled can vary tremendously -- see
>> below.
>>
>>> Only in the case in which the sum of all (specific) quotas in a system
>>> is less than the total available resources, in that case you should be
>>> able to claim resources without checks as long as your claim is less
>>> than the quota allocated to that process.
>>
>> That assumes the application is bug-free.
>>
>>> But how would a process know, if the sum of quotas is less or more
>>> than the resources available ? Thus, the only safe way is to do the
>>> checks for failed resource allocations in any case.
>>
>> How a resource request that can't be *currently* satisfied is
>> handled need not be an outright "failure". The "appropriate"
>> semantics are entirely at the discretion of the developer.
>>
>> When a process goes to push a character out a serial port
>> while the output queue/buffer is currently full (i.e., "resource
>> unavailable), it's common for the process to block until the
>> call can progress as expected.
>
> There can be many reasons why the Tx queue is full.
Yet *one* way of handling ALL of those conditions in a typical API!
Much cleaner to let the application decide how *it* wants the
"potential block" to be implemented based on *its* understanding
of the problem space.
In my case, I allow a timer-object to be included in service
requests (everything is a service or modeled as such). If the
request can't be completed (and isn't "malformed"), the task
blocks until it can be satisfied *or* the timer expires (at
which time, the "original error" is returned).
So, if you want the task to continue immediately with the
error indication, you pass a NULL timer to the service
effectively causing it to return immediately -- with PASS/FAIL
status (depending on whether or not the request was satisfied.
[Note that the timer doesn't limit the duration of the service request;
merely the length of time the task can be BLOCKED waiting for that
request to be processed]
> For instance in a
> TCP/IP or CANbus connection, the TX-queue can be filled, if the
> physical connection is broken. In such cases, buffering outgoing
> messages for seconds, minutes or hours can be lethal, when the
> physical connection is restored and all buffered messages are set at
> once. In such cases, it is important to kill the buffered Tx-queue as
> soon as the line fault is detected.
Or, not! How often do you unplug a network cable momentarily?
Should EVERYTHING that is pending be unceremoniously aborted
by that act? Even if you reconnect the cable moments later?
>> When a process goes to reference a memory location that has
>> been swapped out of physical memory, the request *still*
>> completes -- despite the fact that the reference may take
>> thousands of times longer than "normal" (who knows *when* the
>> page will be restored?!)
>
> This is not acceptable in a hard real time system or at least the
> worst case delay can be firmly established. For this reason, in hard
> RT systems, virtual memory systems are seldom used or at least lock
> the pages used by the high priority tasks into the process working
> set.
The key, here, is to know what the worst case delay is likely to be.
E.g., if two RT processes vie for a (single) shared resource, there
is a possibility that one will own the resource while another is
wanting it. But, if the maximum time the one holding it can be
addressed in the "slack time" budget of the other competitor, then
there is no reason why the competitor can't simply block *in*
the request. This is MORE efficient than returning FAIL and
having the competitor *spin* trying to reissue the request
(when BLOCKED, the competitor isn't competing for CPU cycles
so the process holding the resource has more freedom to "get
its work done" and release the resource!)
Again, the *application* should decide how a potential FAILLed
request is handled. And, as one common solution is to spin reissuing
that request, then exploiting this *in* the service makes life
easier for the developer AND a more reliable product.
competitorA:
...
result = request_resource(<parameters>, SLACK_TIME_A)
if (SUCCESS != result) {
fail()
}
...
competitorB:
...
result = request_resource(<parameters>, SLACK_TIME_B)
if (SUCCESS != result) {
fail()
}
...
>> When a process goes to fetch the next opcode (in a fully preemptible
>> environment), there are no guarantees that it will retain ownership
>> of the processor for the next epsilon of time.
>
> There is a guarantee for the highest priority process only, but not
> for other processes. Still hardware interrupts (such as the page fault
> interrupt) may change the order even for the highest priority process.
> For that reason, you should try to avoid page fault interrupts, e.g.by
> locking critical pages into the working set.
Again, that depends on the application and the rest of the activities
in the system.
E.g., I diarize recorded telephone conversations as a low-priority
task -- using whatever resources happen to be available at the time
(which I can't know, a priori). If the process happens to fault
pages in *continuously* (because other processes have "dibs" on
the physical memory on that processor), then the process runs slowly.
But, the rest of the processes are unaffected by its actions (because
the page faults are charged to the diarization task's ledger, not
"absorbed" in "system overhead").
OTOH, if the demands (or *resource reservations*) of the rest of the system
allow for the diarization task to have MORE pages resident and, thus,
reduce its page fault overhead, the process runs more efficiently
(which means it can STOP making demands on the system SOONER -- once
it has finished!)
>> When a process wants to take a mutex, it can end up blocking
>> in that operation, "indefinitely".
>
> For this reason, I try to avoid mutexes as much as possible by
> concentrating on the overall architecture.
You can't always do that. Larger systems tend to require more
sharing. Regardless of how you implement this (mutex, monitor,
etc), the possibility of ANOTHER process having to wait increases
with the number of opportunities for conflict.
Rather than rely on a developer REMEMBERING that he may not be
granted the resource when he asks for it AND requiring him to
write code to spin on its acquisition, let the system make that
effort more convenient and robust for him. So, all he has to
do is consider how long he is willing to wait (BLOCK) if need be.
>> Yet, developers have no problem adapting to these semantics.
>
> As it is done in early architectural design. Trying to add last ditch
> cludges during the testing phase is an invitation to disaster.
How is making a consistent "time constrained, blocking optional"
API a kludge?
>> Why can't a memory allocation request *block* until it can
>> be satisfied? Or, any other request for a resource that is
>> in scarce supply/overcommitted, currently?
>
> Not OK for any HRT system, unless there are a maximum acceptable value
> for the delay.
See above. Task A is perfectly happy to be BLOCKED while tasks B, C, D
and Q all vie for the processor. Yet, that doesn't preclude their use in
a RT system.
>> This is especially true in cases where resources can be overcommitted
>> as you may not be able to 'schedule' the use of those resources
>> to ensure that the "in use" amount is always less than the
>> "total available".
> Overcommitment is a no no for HRT as well as high reliability systems.
Nonsense. Its a fact of life.
Do you really think our missile defense system quits when ONE deadline
is missed? ("Oh, My God! A missile got through! Run up the white flag!!")
The typical view of HRT is naive. It assumes hard deadlines "can't be missed".
That missing a network packet -- or character received on a serial port -- is
as consequential as missing a bottle as it falls off the end of a conveyor
belt... or, missing an orbital insertion burn on a deep space probe.
A *hard* deadline just means you should STOP ALL FURTHER WORK on any
task that has missed its hard deadline -- there is nothing more to
be gained by pursuing that goal.
The *cost* of that missed deadline can vary immensely!
If Windows misses a mouse event, the user may be momentarily puzzled
("why didn't the cursor move when I moved my mouse"). But, the
consequences are insignificant ("I guess I'll try again...")
OTOH, if a "tablet press monitor" (tablet presses form tablets/pills
by compressing granulation/powder at rates of ~200/second) happens
to "miss" deflecting a defective tablet from the "good tablets"
BARREL, the press must be stopped and the barrel's contents
individually inspected to isolate the defective tablet from the
other "good" tablets. (this is a time consuming and expensive
undertaking -- even for tablets that retail for lots of money!)
In each case, however, there is nothing that can be done (by the
process that was SUPPOSED to handle that situation BEFORE the
deadline) once the deadline has passed.
> These days the hardware is so cheap that for a RT / high reliability
> system, I recommend 40-60 % usage of CPU channels and communications
> links. Going much higher than that, is going to cause problems sooner
> or later.
Again, that depends on the "costs" of those "problems".
SWMBO's vehicle often misses button presses as it is "booting up".
OTOH, the video streamed from the backup camera appears instantly
(backing up being something that you often do within seconds of
"powering up" the car). It's annoying to not be able to access
certain GPS features in those seconds. But, it would be MORE
annoying to see "jerky" video while the system brings everything
on-line.
Or, reduce this start-up delay by installing a faster processor...
or, letting the software "hibernate" for potentially hours/days/weeks
between drives (and adding mechanisms to verify that the memory
image hasn't been corrupted in the meantime).
> A 90-100 % utilization might be OK for a time sharing system or
> mobile phone apps or for viewing cat videos :-)