Reply by Niklas Holsti August 9, 20212021-08-09
On 2021-08-09 21:26, Paul Rubin wrote:
> Niklas Holsti <niklas.holsti@tidorum.invalid> writes: >> And, AIUI, at present that probability cannot be >> computed, and certainly depends on the test suite being measured. For >> example, on whether those tests lead to mispredictions in chains of >> conditional branches. > > Maybe architectural simulation of the target cpu can help, if the > architecture is known (i.e. exact workings of the pipelines, branch > predictors etc).
For timing, one needs *micro* -architectural simulation (as the term is commonly used, for example in comp.arch). But I think you mean that, so this is a terminological quibble.
> And maybe forthcoming RISC-V cpus will be more open about this than > the current ARM stuff is.
I suspect that the microarchitecture will be where the various RISC-V implementors will compete (that, and peripherals and packaging), so I'm not optimistic that they will be very open about their microarchitectures. However, I don't know how far into the micro-level the RISC-V standardization and open-source licensing extends.
Reply by Don Y August 9, 20212021-08-09
On 8/9/2021 12:50 AM, Niklas Holsti wrote:
> If you are not satisfied with Don's approach (extreme over-provision of > processor power)
It's not necessarily "extreme". I only grossly underestimate the processor's ability to do the "must do" aspects of the design. This ensures that they actually *will* get done. And, what's *left* will likely exceed the needs of what I'd *like* to (also) get done -- but, as that is (by definition) not a "must do" aspect of the design, there are no criteria as to HOW MUCH must get done (including "none" and "all") In my world, I can defer execution of some tasks, *move* tasks from an overburdened processor to a less-burdened one, *add* physical processors to the mix (on demand), etc. So, trying to match (at design time) the load to the capabilities is neither necessary nor desireable/economical. There's no (a priori) "hard limit" on the number/types of applications that you can run on your PC, is there? Instead, you dynamically review the performance you are experiencing and adjust your expectations, accordingly. Maybe kill off some application that isn't *as* useful to you (at the moment) as some other. Or, defer invoking an application until some *other* has exceeded its utility. You could, of course, just buy a bigger/faster PC! Yet, you don't (up to a point). Because that changes the economics of the solution. And, because you can "make do" with your current investment just by more intelligently scheduling its use! By minimizing the "must do" aspect of the problem, you give yourself the most flexibility in how you address that/those requirements. The rest is "free", figuratively speaking. For example, one of my applications handles telephony. It screens incoming callers, does speaker identification ("who is calling"), etc. To train the recognizer, I have it "listen in" on every call given that it now KNOWS who I am talking with and can extract/update speech parameters from the additional "training material" that is present, FOR FREE, in the ongoing conversation. (instead of explicitly requiring callers to "train" the recognizer in a "training session") But, there's no reason that this has to happen: - in real time - while the conversation is in progress - anytime "soon" So, if you *record* the conversation (relatively inexpensive as you are already moving the audio through the processor), you can pick some *later* time to run the model update task. Maybe in the middle of the night when there are less demands (i.e., no one telephoning you at 3AM!) And, if something happens to "come up" that is more important than this activity, you can checkpoint the operation, kill off the process and let the more important activity have access to those resources. Yes, it would be a much simpler design if you could just say: "I *need* the following resources to be able to update the recognizer models AS THE CONVERSATION WAS HAPPENING". But, when the conversation was *over*, you'd have all those extra re$ource$ sitting idle. Turn "hard" requirements into *soft* ones -- and then shift them, in time, to periods of lower resource utilization. [The "hard real-time is hard but soft real-time is HARDER" idiom]
> you could try the hybrid WCET-estimation tools (RapiTime or > TimeWeaver) which do not need to model the processors, but need to measure > fine-grained execution times (on the basic-block level). The problem with such > tools is that they cannot guarantee to produce an upper bound on the WCET, only > a bound that holds with high probability. And, AIUI, at present that > probability cannot be computed, and certainly depends on the test suite being > measured. For example, on whether those tests lead to mispredictions in chains > of conditional branches.
The equivalent mechanism in my world is monitoring *actual* load. As it increases, you have a mismatch between needs and capabilities. So, adjust one, the other, or both! Shed load (remove processes from the current node) and/or Add capacity (*move* process to another node, including bringing that other node "on-line" to handle the load!) If you keep in mind the fact that you only have to deal with the MUST DO tasks, this is a lot easier to wrap your head around. E.g., eventually, there *will* be a situation where what you *want* to do exceeds the capabilities that you have available -- period. So, you have to remember that only some of those are MUST DOs. Yeah, it would be nice to have 3000 applications loaded and ready at a mouse-click... but, that wouldn't be worth the cost of the system required to support it!
Reply by Paul Rubin August 9, 20212021-08-09
Niklas Holsti <niklas.holsti@tidorum.invalid> writes:
> And, AIUI, at present that probability cannot be > computed, and certainly depends on the test suite being measured. For > example, on whether those tests lead to mispredictions in chains of > conditional branches.
Maybe architectural simulation of the target cpu can help, if the architecture is known (i.e. exact workings of the pipelines, branch predictors etc). And maybe forthcoming RISC-V cpus will be more open about this than the current ARM stuff is.
Reply by Niklas Holsti August 9, 20212021-08-09
On 2021-08-09 1:34, George Neuner wrote:
> On Sat, 7 Aug 2021 13:40:19 +0300, Niklas Holsti > <niklas.holsti@tidorum.invalid> wrote: > >> On 2021-08-07 5:55, George Neuner wrote: >> >>> More importantly, how do you handle chains of conditional branches >>> and/or switch/case constructs which can mispredict at every decision >>> point? The branch targets may not be in cache - neither mispredicted >>> targets nor the actual one. Worst case for a horribly mispredicted >>> switch/case can be absolutely dreadful. >> >> >> I don't know if those questions are addressed to me (re the Bound-T >> tool) or to WCET analysis in general. > > A little of both really. I was hoping you had some smart(er) approach > to estimating misprediction effects in systems that use dynamic > prediction - even if it's just heuristic.
Sorry, but no.
> A lot of more powerful chips now are being used even in 'small' > systems, and some of them do have dynamic prediction. Note that Don > is talking about Cortex A53, A55, etc.
Indeed, and therefore static WCET analysis is waning, and hybrid (partly measurement-based) WCET estimation is waxing.
> Dynamic prediction handles loop control quite well ... it's all the > other branching code that is the problem. > > WCET has use outside the 'embedded' world also. A lot of my work was > in QA/QC vision systems: tons of code, tons of features, workstation > class processors, and still having to be /hard real time/.
(I would call that "embedded", because it is a computerized system used for a single purpose.) One could say, perhaps meanly, that that is an ill-posed problem, with the wrong choice of processors. But cost matters, of course... If you are not satisfied with Don's approach (extreme over-provision of processor power) you could try the hybrid WCET-estimation tools (RapiTime or TimeWeaver) which do not need to model the processors, but need to measure fine-grained execution times (on the basic-block level). The problem with such tools is that they cannot guarantee to produce an upper bound on the WCET, only a bound that holds with high probability. And, AIUI, at present that probability cannot be computed, and certainly depends on the test suite being measured. For example, on whether those tests lead to mispredictions in chains of conditional branches.
Reply by George Neuner August 8, 20212021-08-08
On Sat, 7 Aug 2021 13:40:19 +0300, Niklas Holsti
<niklas.holsti@tidorum.invalid> wrote:

>On 2021-08-07 5:55, George Neuner wrote: > >> More importantly, how do you handle chains of conditional branches >> and/or switch/case constructs which can mispredict at every decision >> point? The branch targets may not be in cache - neither mispredicted >> targets nor the actual one. Worst case for a horribly mispredicted >> switch/case can be absolutely dreadful. > > >I don't know if those questions are addressed to me (re the Bound-T >tool) or to WCET analysis in general.
A little of both really. I was hoping you had some smart(er) approach to estimating misprediction effects in systems that use dynamic prediction - even if it's just heuristic. A lot of more powerful chips now are being used even in 'small' systems, and some of them do have dynamic prediction. Note that Don is talking about Cortex A53, A55, etc.
>The advanced WCET tools, like the >ones from AbsInt, do try to cover such cases by their cache analysis to >provide a safe upper bound on the WCET, but the degree of pessimism >(over-estimation) increases with the complexity of the code. > >That said, in many programs most of the processing time is taken by >relatively simple loops, for which the present I-cache analyses work >quite well, and for which even D-cache analysis can (I believe) work >satisfactorily if the memory addressing patterns are regular.
Dynamic prediction handles loop control quite well ... it's all the other branching code that is the problem. WCET has use outside the 'embedded' world also. A lot of my work was in QA/QC vision systems: tons of code, tons of features, workstation class processors, and still having to be /hard real time/. George
Reply by Don Y August 7, 20212021-08-07
On 8/7/2021 10:46 AM, Niklas Holsti wrote:
> On 2021-08-07 20:26, Don Y wrote: >> On 8/7/2021 4:20 AM, Niklas Holsti wrote: > > (In a discussion about WCET analysis more than stack analysis:) > >>> Ok. If you describe your needs, perhaps I can comment. >> >> I'm targeting some of the bigger ARM offerings -- A53/55. My >> impression is they go through more "contortions" to eke out >> additional performance than some of the smaller processors. > > Out of scope for Bound-T, I'm sure. Even the AbsInt tool has no support for > static WCET analysis of those processors and their ARM page suggests a hybrid > tool instead (https://www.absint.com/ait/arm.htm).
You can see why I favor hand-waving away all of the "tricks" the processors can play to improve performance! Granted, the numbers that result are TRULY "worst case" -- and likely significantly inflated! But, if you size for that, then all of the "noncritical" stuff runs "for free"!
>>> In my experience, access to memory-mapped registers tends to be not much >>> slower than access to memory. Also, quite often the addresses in MMIO >>> accesses are static or easy to derive in static analysis, so the WCET tool >>> could choose the proper access time, if the tool knows enough about the >>> system architecture. >> >> Yes, that last phrase being the kicker. Can this be simplified to something >> as crude as a "memory map"? > > It seems so, if the access time is simply a function of the accessed address. > >>> What is VMM? >> >> Virtual Memory Management. I.e., when an opcode fetch (or argument >> reference) can not only take longer than a cache miss... but >> *considerably* longer as the physical memory is mapped in while the >> instruction stream "stalls". > > I don't recall seeing any static WCET analysis for page faults, but there may > be some for Translation Look-aside Buffer misses. Out of my competence, and out > of scope for Bound-T, certainly.
Unlike a cache miss, it's not easy to predict those costs without an intimate model of the software. E.g., two "closely located" addresses can have entirely different behaviors (timing) on a page fault. And, you likely can't predict where those addresses will reside as that's a function of how they are mapped at runtime. Unlike a conventional VMM system (faulting in/out pages from a secondary/disk store), I rely on the mechanism heavily for communication and process container hacks. I.e., there's a good reason to over-specify the hardware -- so you can be inefficient in your use of it! :-/
>> [Note that a page fault need not map physical memory in the >> traditional sense. It can also cause some "special" function to be >> invoked to provide the requisite data/access. So, the cost of a >> fault can vary depend on *what* is faulting and which pager is >> handling that fault] > > You may be able to map some of that into a very capable schedulability > analyzer, one that can handle chains of "tasks" passing data/messages to each > other. But translating the application logic and system behaviour into a model > for such a schedulability analyzer is not trivial.
As my workload is dynamically defined (can grow or shrink, algorithmically), I don't really fret this. It's not like a closed system where what's inside *has* to work. If something isn't working, I can shed load (or, move it to another processor). And, if something is underutilized, *add* load (possibly moved from another processor). You typically do this intuitively on your workstation; if things start to run sluggishly, you kill off some applications (and make a mental note to run them, again, later). And, if the workstation isn't "doing anything", you *add* application processes. If you needed to "make world", you might power up another workstation so it didn't impede the activities of the "normal" workstation you're using.
Reply by Niklas Holsti August 7, 20212021-08-07
On 2021-08-07 20:34, Don Y wrote:
> On 8/7/2021 9:50 AM, Niklas Holsti wrote: >> On 2021-08-07 18:09, Don Y wrote: >>> On 8/7/2021 4:09 AM, Niklas Holsti wrote: >> >> >> (On how the Bound-T tool models instructions for stack and WCET >> analysis:) >> >>>> The internal model includes not only the cost of the instruction, >>>> but also its effect on the computation. For example, an instruction >>>> like "increment the accumulator", INC A, would be decoded into a >>>> model that says >>>> >>>> &#4294967295;&#4294967295;&#4294967295;&#4294967295; WCET: 1 cycle. >>>> &#4294967295;&#4294967295;&#4294967295;&#4294967295; Control flow: to next instruction, unconditional. >>>> &#4294967295;&#4294967295;&#4294967295;&#4294967295; Effect: A := A + 1. >>>> >>>> and also the effect, if any, on condition flags. >>> >>> Why do you care about the "effect"?&#4294967295; Are you trying to determine >>> which branches will be taken?&#4294967295; How many iterations through loops? >>> etc. >> >> The effect is necessary for any analysis that depends on run-time >> values. Loop bounds are the prime example for WCET analysis, but >> stack-usage analysis also needs it. For example, the effect of a push >> instruction is to increment the "local stack height" model variable by >> some number that reflects the size of the pushed data (which size can, >> in principle, be non-static and computed at run time). > > OK.&#4294967295; But, you're not trying to emulate/simulate the complete algorithm; > just handle the side effects of value changes. > > But wouldn't you *have* to do a more thorough emulation? > > foo(count) { > &#4294967295;&#4294967295; for (i = 0; i < 5 * count; i++) { > &#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295; diddle() > &#4294967295;&#4294967295; } > }
The point and aim of _static_ analysis is to abstract the computation to model only the aspects that are relevant for the goals of the analysis, and especially to avoid any step-by-step emulation or interpretation -- that would become "dynamic analysis". For that example, analysing the subprogram foo stand-alone, the WCET analysis in Bound-T would conclude that, in the loop: - The variable "i" is an induction variable that increases by one on each iteration. This looks promising, and means that Bound-T can model the value of "i" as the initial value (zero) plus the loop iteration number (starting the count at iteration number zero). - But the analysis will not find a numeric upper bound on the iteration number, because the value of "count" is unknown. If this analysis of foo occurs while analysing some higher-level subprogram that calls foo, Bound-T would next try to compute bounds for the actual value of "count" in each such call. Say that the call is simply foo (33); This would make Bound-T reanalyze foo in the context count=33, which would show that the loop can be repeated only under the condition 5 * (iteration number) < 33 equivalent to (iteration number) < 33/5 = 6, which bounds the number of loop iterations and lets Bound-T produce a WCET bound for this specific call of foo (assuming that Bound-T can produce a WCET bound for the "diddle" subprogram too). Such repeated analyses down a call-path with increasing amount of context will of course increase the analysis time.
Reply by Niklas Holsti August 7, 20212021-08-07
On 2021-08-07 20:26, Don Y wrote:
> On 8/7/2021 4:20 AM, Niklas Holsti wrote:
(In a discussion about WCET analysis more than stack analysis:)
>> Ok. If you describe your needs, perhaps I can comment. > > I'm targeting some of the bigger ARM offerings -- A53/55. My > impression is they go through more "contortions" to eke out > additional performance than some of the smaller processors.
Out of scope for Bound-T, I'm sure. Even the AbsInt tool has no support for static WCET analysis of those processors and their ARM page suggests a hybrid tool instead (https://www.absint.com/ait/arm.htm).
>> In my experience, access to memory-mapped registers tends to be not >> much slower than access to memory. Also, quite often the addresses in >> MMIO accesses are static or easy to derive in static analysis, so the >> WCET tool could choose the proper access time, if the tool knows >> enough about the system architecture. > > Yes, that last phrase being the kicker. Can this be simplified to > something as crude as a "memory map"?
It seems so, if the access time is simply a function of the accessed address.
>> What is VMM? > > Virtual Memory Management. I.e., when an opcode fetch (or argument > reference) can not only take longer than a cache miss... but > *considerably* longer as the physical memory is mapped in while the > instruction stream "stalls".
I don't recall seeing any static WCET analysis for page faults, but there may be some for Translation Look-aside Buffer misses. Out of my competence, and out of scope for Bound-T, certainly.
> [Note that a page fault need not map physical memory in the > traditional sense. It can also cause some "special" function to be > invoked to provide the requisite data/access. So, the cost of a > fault can vary depend on *what* is faulting and which pager is > handling that fault]
You may be able to map some of that into a very capable schedulability analyzer, one that can handle chains of "tasks" passing data/messages to each other. But translating the application logic and system behaviour into a model for such a schedulability analyzer is not trivial.
Reply by Don Y August 7, 20212021-08-07
On 8/7/2021 9:50 AM, Niklas Holsti wrote:
> On 2021-08-07 18:09, Don Y wrote: >> On 8/7/2021 4:09 AM, Niklas Holsti wrote: > > > (On how the Bound-T tool models instructions for stack and WCET analysis:) > >>> The internal model includes not only the cost of the instruction, but also >>> its effect on the computation. For example, an instruction like "increment >>> the accumulator", INC A, would be decoded into a model that says >>> >>> WCET: 1 cycle. >>> Control flow: to next instruction, unconditional. >>> Effect: A := A + 1. >>> >>> and also the effect, if any, on condition flags. >> >> Why do you care about the "effect"? Are you trying to determine >> which branches will be taken? How many iterations through loops? >> etc. > > The effect is necessary for any analysis that depends on run-time values. Loop > bounds are the prime example for WCET analysis, but stack-usage analysis also > needs it. For example, the effect of a push instruction is to increment the > "local stack height" model variable by some number that reflects the size of > the pushed data (which size can, in principle, be non-static and computed at > run time).
OK. But, you're not trying to emulate/simulate the complete algorithm; just handle the side effects of value changes. But wouldn't you *have* to do a more thorough emulation? foo(count) { for (i = 0; i < 5 * count; i++) { diddle() } }
> The stack-usage analysis finds the maximum possible value of the local stack > height in each subprogram, both globally and at each call site within the > subprogram, then adds up the local stack heights along each possible call path > to find the total usage for that call path, and then finds and reports the path > with the largest total usage. > >> Doesn't this require a fairly complete model of the processor? > > Yes indeed. The model could be somewhat simpler for stack-usage analysis than > for WCET analysis. For stack usage, the relevant effect of an instruction is > just to increase or decrease the local stack height by a static constant. If > push/pop happens in loops they are usually balanced so that a loop iteration > has no net effect on stack usage, making loop iteration bounds unnecessary.
reverse(count, string) { while (count-- > 0) { reverse(count-1, &string[1]) emit(string[0]) } } No, that's a shitty example. I'll have to think on it when I have more time...
>> My "cost constrained" days were back in the days of the "small/cheap >> CPU" where you had to predict performance accurately -- but, doing so >> was relatively easy (because processors had very predictable >> instruction execution times). > > And that is the kind of processor that Bound-T was designed to support for WCET > analysis. Stack-usage analysis is not so sensitive.
Ah, OK. Time to get my neighbor's lunch ready...
Reply by Don Y August 7, 20212021-08-07
On 8/7/2021 4:20 AM, Niklas Holsti wrote:
> On 2021-08-07 2:06, Don Y wrote: >> On 8/6/2021 4:04 PM, Don Y wrote: >>>> That said, for some processors it is easy to recognize at decode-time most >>>> of the instructions that access the stack, and some versions of Bound-T let >>>> one specify different access times for stack accesses and for general >>>> (unclassified) accesses. That can be useful if the stack is located in fast >>>> memory, but other data are in slower memory. >>> >>> I'm thinking, specifically, about I/Os -- which are increasingly memory >>> mapped (including register spaces). >> >> Sorry, I should be more clear. I'm looking at the issues that would affect >> *my* needs in *my* environment -- realizing these may be different than >> those you've previously targeted. (e.g., VMM, FPU emulation, etc.) > > > Ok. If you describe your needs, perhaps I can comment.
I'm targeting some of the bigger ARM offerings -- A53/55. My impression is they go through more "contortions" to eke out additional performance than some of the smaller processors.
> In my experience, access to memory-mapped registers tends to be not much slower > than access to memory. Also, quite often the addresses in MMIO accesses are > static or easy to derive in static analysis, so the WCET tool could choose the > proper access time, if the tool knows enough about the system architecture.
Yes, that last phrase being the kicker. Can this be simplified to something as crude as a "memory map"?
> If by "FPU emulation" you mean SW-implemented FP instructions, of course we > have encountered those. They are often hand-coded assembler with very complex > and tricky control flow, which makes it hard to find loop bounds automatically, > so manual annotations must be used instead. > > What is VMM?
Virtual Memory Management. I.e., when an opcode fetch (or argument reference) can not only take longer than a cache miss... but *considerably* longer as the physical memory is mapped in while the instruction stream "stalls". [Note that a page fault need not map physical memory in the traditional sense. It can also cause some "special" function to be invoked to provide the requisite data/access. So, the cost of a fault can vary depend on *what* is faulting and which pager is handling that fault]