EmbeddedRelated.com
Forums

Stack analysis tool that really work?

Started by pozz July 29, 2021
On 8/5/2021 3:43 AM, Niklas Holsti wrote:
> On 2021-08-05 1:46, Don Y wrote:
>>> The source code is downloadable; see http://bound-t.com/download/src. The >>> copyright text is of my own writing, but I'm open to switching to some >>> better-known copyright such as GPL or even some non-viral version. >> >> Ah, OK. I may have a look at it to see how well it works and >> how much work to port to ARM objects. Given that you've not >> done so, already (despite your familiarity with the codebase), >> I suspect that's not a trivial task? > > There is an ARM TDMI port (32-bit ARM code and THUMB, pre Cortex), but of > course that architecture is out of date. I started a port to ARM Cortex-M, but > did not (yet) finish it for the reasons described on the status page. > > Note that the tool is implemented in Ada.
OK. It's been ~20 years since I wrote a line of Ada code but I doubt it will take much to "refresh" those memory cells. Esp as this likely doesn't have synchronization issues, RT constraints, etc.
> Porting to a new target processor means writing procedures to decode the > machine instructions and translate them to the internal representation used by > the analysis.
That shouldn't be hard -- except for the "timing uncertainties" in the processor. (Did you rely on some blanket generalization -- like all caches cold -- as you are looking for WORST case timing? OTOH, a loop known to fit in a cache line *should* expect that speedup... :< )
> Sometimes it is also necessary to modify or extend the tool parts > that read the executable files and especially the debugging information -- some > of that may be compiler-specific, unfortunately. It is a fair amount of work > and requires a good understanding both of the target processor and of the > Bound-T internal representation.
The latter will likely be the bigger effort. It requires trying to infer your mindset when you began the design. [I now include "rationales" in my documentation for exactly this purpose; to let those "following" understand why particular choices were made instead of choices that may (in hindsight) appear better]
> And Ada, of course, but every programmer > should know Ada, IMO :-) > >>> However, you may want to read about the state of the tool at >>> http://bound-t.com/status.html. >> >> Hmmm... that sounds ominous! :> (or, am I just too much of a Cynic?) > > The problems described on the status page are more relevant to WCET analysis > than to stack analysis, but can affect stack analysis too, in some cases.
OK. I'll try to make some time to look through it. Sadly, a neighbor was admitted to hospital with congestive heart failure, last night. So, I've some added responsibilities cutting into my time (meal prep, chores around his house, etc.) until he's got a firmer prognosis... [I don't *need* to sleep, do I? :< ]
On 2021-08-05 16:16, Don Y wrote:
> On 8/5/2021 3:43 AM, Niklas Holsti wrote: >> On 2021-08-05 1:46, Don Y wrote: > >>>> The source code is downloadable; see >>>> http://bound-t.com/download/src. The copyright text is of my own >>>> writing, but I'm open to switching to some better-known copyright >>>> such as GPL or even some non-viral version. >>> >>> Ah, OK.&#4294967295; I may have a look at it to see how well it works and >>> how much work to port to ARM objects.&#4294967295; Given that you've not >>> done so, already (despite your familiarity with the codebase), >>> I suspect that's not a trivial task? >> >> There is an ARM TDMI port (32-bit ARM code and THUMB, pre Cortex), but >> of course that architecture is out of date. I started a port to ARM >> Cortex-M, but did not (yet) finish it for the reasons described on the >> status page. >> >> Note that the tool is implemented in Ada. > > OK.&#4294967295; It's been ~20 years since I wrote a line of Ada code but I doubt it > will take much to "refresh" those memory cells.&#4294967295; Esp as this likely doesn't > have synchronization issues, RT constraints, etc.
Right. The only use of tasking features is to set an upper limit on the execution time of the analysis, and that feature is easy to remove if needed. The code mostly relies on the 1995 Ada standard, with some use of Ada 2005 additions.
>> Porting to a new target processor means writing procedures to decode >> the machine instructions and translate them to the internal >> representation used by the analysis. > > That shouldn't be hard -- except for the "timing uncertainties" in the > processor.&#4294967295; (Did you rely on some blanket generalization -- like all > caches cold -- as you are looking for WORST case timing?&#4294967295;&#4294967295; OTOH, a > loop known to fit in a cache line *should* expect that speedup...&#4294967295; :< )
At present the tool assumes that every instruction and control transfer can be assigned its own WCET, in machine cycles, independent of other context or data values. The only dynamic aspects of instruction execution time that are modelled are the possible pipeline stalls due to read-after-write interlocks. As a special case, in the version for SPARC, the concurrent execution of the Integer Unit and the Floating Point Unit is modelled to estimate where and for how long the IU has to wait for the FPU to complete an operation. Good algorithms for computing WCET bounds for most kinds of instruction caches are known, but I did not get around to implementing any of those, so only cache-less processors are supported now. If you need WCET analysis of caches, the AbsInt tool is best. Data caches are still a hard problem, where the WCET analyses tend to be quite pessimistic. Moreover, the increasing use of eviction algorithms other than LRU (for example, pseudo-random eviction) lessens the precision of the cache analyses, even for instruction caches.
>> Sometimes it is also necessary to modify or extend the tool parts that >> read the executable files and especially the debugging information -- >> some of that may be compiler-specific, unfortunately. It is a fair >> amount of work and requires a good understanding both of the target >> processor and of the Bound-T internal representation. > > The latter will likely be the bigger effort.&#4294967295; It requires trying to infer > your mindset when you began the design.
You could start by reading http://bound-t.com/whats-inside.html.
> [I now include "rationales" in my documentation for exactly this purpose; > to let those "following" understand why particular choices were made > instead of choices that may (in hindsight) appear better]
Me too. Ceterum censeo that calling the in-code documentation "comments" was a very poor choice of term and very misleading for what such documentation should contain and its importance.
On 8/5/2021 8:39 AM, Niklas Holsti wrote:
> On 2021-08-05 16:16, Don Y wrote: >> On 8/5/2021 3:43 AM, Niklas Holsti wrote: >>> On 2021-08-05 1:46, Don Y wrote: >> >>>>> The source code is downloadable; see http://bound-t.com/download/src. The >>>>> copyright text is of my own writing, but I'm open to switching to some >>>>> better-known copyright such as GPL or even some non-viral version. >>>> >>>> Ah, OK. I may have a look at it to see how well it works and >>>> how much work to port to ARM objects. Given that you've not >>>> done so, already (despite your familiarity with the codebase), >>>> I suspect that's not a trivial task? >>> >>> There is an ARM TDMI port (32-bit ARM code and THUMB, pre Cortex), but of >>> course that architecture is out of date. I started a port to ARM Cortex-M, >>> but did not (yet) finish it for the reasons described on the status page. >>> >>> Note that the tool is implemented in Ada. >> >> OK. It's been ~20 years since I wrote a line of Ada code but I doubt it >> will take much to "refresh" those memory cells. Esp as this likely doesn't >> have synchronization issues, RT constraints, etc. > > Right. The only use of tasking features is to set an upper limit on the > execution time of the analysis, and that feature is easy to remove if needed. > > The code mostly relies on the 1995 Ada standard, with some use of Ada 2005 > additions.
OK. I'll have to build a more current version of GNAT.
>>> Porting to a new target processor means writing procedures to decode the >>> machine instructions and translate them to the internal representation used >>> by the analysis. >> >> That shouldn't be hard -- except for the "timing uncertainties" in the >> processor. (Did you rely on some blanket generalization -- like all >> caches cold -- as you are looking for WORST case timing? OTOH, a >> loop known to fit in a cache line *should* expect that speedup... :< ) > > At present the tool assumes that every instruction and control transfer can be > assigned its own WCET, in machine cycles, independent of other context or data
So, you just sum worst case times? You don't, for example, alter times based on conditionals? I assume this is semi table-driven (?) -- as a preface to inquiring as to how you develop (and verify!) a test suite?
> values. The only dynamic aspects of instruction execution time that are > modelled are the possible pipeline stalls due to read-after-write interlocks. > As a special case, in the version for SPARC, the concurrent execution of the > Integer Unit and the Floating Point Unit is modelled to estimate where and for > how long the IU has to wait for the FPU to complete an operation. > > Good algorithms for computing WCET bounds for most kinds of instruction caches > are known, but I did not get around to implementing any of those, so only > cache-less processors are supported now. If you need WCET analysis of caches, > the AbsInt tool is best.
It's arguable whether (and to what extent) you should model I-cache effects. E.g., in a multithreaded environment, the cache can be purged at some frequency that a tool (or the developer, for that matter -- at least not wrt the instruction stream) can't know. Aim high and be pleasantly surprised...
> Data caches are still a hard problem, where the WCET analyses tend to be quite > pessimistic. Moreover, the increasing use of eviction algorithms other than LRU > (for example, pseudo-random eviction) lessens the precision of the cache > analyses, even for instruction caches.
As would handling page faults or other NUMA issues. [How do you handle accesses to memories with different *basic* access times?]
>>> Sometimes it is also necessary to modify or extend the tool parts that read >>> the executable files and especially the debugging information -- some of >>> that may be compiler-specific, unfortunately. It is a fair amount of work >>> and requires a good understanding both of the target processor and of the >>> Bound-T internal representation. >> >> The latter will likely be the bigger effort. It requires trying to infer >> your mindset when you began the design. > > You could start by reading http://bound-t.com/whats-inside.html. > >> [I now include "rationales" in my documentation for exactly this purpose; >> to let those "following" understand why particular choices were made >> instead of choices that may (in hindsight) appear better] > > Me too. Ceterum censeo that calling the in-code documentation "comments" was a > very poor choice of term and very misleading for what such documentation should > contain and its importance.
A "rationale" lets you talk to the next developer(s) informally. You can point out areas for likely optimization, areas fraught with dragons, etc. Otherwise, relying on leafing through hundreds of pages of code in the hope of stumbling on some "XXX ..." annotation is wasteful. And, *not* being tied to a "text only" presentation means you can explain things in ways that would be difficult to do (unambiguously) with prose. [How do I audibly explain the difference between front vowels and back vowels to a non-native speaker/listener?]
On 2021-08-06 11:09, Don Y wrote:
> On 8/5/2021 8:39 AM, Niklas Holsti wrote: >> On 2021-08-05 16:16, Don Y wrote: >>> On 8/5/2021 3:43 AM, Niklas Holsti wrote: >>>> On 2021-08-05 1:46, Don Y wrote: >>> >>>>>> The source code is downloadable; see >>>>>> http://bound-t.com/download/src. The copyright text is of my own >>>>>> writing, but I'm open to switching to some better-known copyright >>>>>> such as GPL or even some non-viral version. >>>>> >>>>> Ah, OK.&#4294967295; I may have a look at it to see how well it works and >>>>> how much work to port to ARM objects.&#4294967295; Given that you've not >>>>> done so, already (despite your familiarity with the codebase), >>>>> I suspect that's not a trivial task? >>>> >>>> There is an ARM TDMI port (32-bit ARM code and THUMB, pre Cortex), >>>> but of course that architecture is out of date. I started a port to >>>> ARM Cortex-M, but did not (yet) finish it for the reasons described >>>> on the status page. >>>> >>>> Note that the tool is implemented in Ada. >>> >>> OK.&#4294967295; It's been ~20 years since I wrote a line of Ada code but I doubt it >>> will take much to "refresh" those memory cells.&#4294967295; Esp as this likely >>> doesn't >>> have synchronization issues, RT constraints, etc. >> >> Right. The only use of tasking features is to set an upper limit on >> the execution time of the analysis, and that feature is easy to remove >> if needed. >> >> The code mostly relies on the 1995 Ada standard, with some use of Ada >> 2005 additions. > > OK.&#4294967295; I'll have to build a more current version of GNAT.
For playing around, I would just use the GNAT Community Edition. Or the FSF GNAT that comes with MinGW32.
>>>> Porting to a new target processor means writing procedures to decode >>>> the machine instructions and translate them to the internal >>>> representation used by the analysis. >>> >>> That shouldn't be hard -- except for the "timing uncertainties" in the >>> processor.&#4294967295; (Did you rely on some blanket generalization -- like all >>> caches cold -- as you are looking for WORST case timing?&#4294967295;&#4294967295; OTOH, a >>> loop known to fit in a cache line *should* expect that speedup...&#4294967295; :< ) >> >> At present the tool assumes that every instruction and control >> transfer can be assigned its own WCET, in machine cycles, independent >> of other context or data > > So, you just sum worst case times?&#4294967295; You don't, for example, alter > times based on conditionals?
(Note: I say "WCET" below, but really I should always say "upper bound on the WCET".) The analysis works on the control-flow graph (per subprogram). The WCET for each basic block is the sum of the WCETs of the instructions in that block. The WCET for an execution path through the graph (from entry to return) is the sum of the WCETs of the basic blocks in the bath, plus the WCETs assigned to each edge (control transfer) between basic blocks in the path. The WCET for the whole graph is computed by a pessimisation over all syntactically possible paths through the graph, using an Integer Linear Programming solver (lp_solve, for Bound-T). This method is common in WCET analysis and is called IPET, for Implicit Path Exploration Technique. By "syntactically possible" I mean that the tool generally does not attempt to find semantically or logically infeasible paths. For example, if a subprogram has this form: if boo then perform computation A; else perform computation B; end if; perform computation C; then the WCET for the subprogram is estimated as max(WCET(A),WCET(B)) + WCET(C). But if the subprogram has this form: if boo then perform computation A; end if; perform computation C; if not boo then perform computation B; end if; then the WCET for the subprogram is estimated as WCET(A) + WCET(C) + WCET(B). In other words, the analysis (usually) does not discover that the conditions "boo" and "not boo" are mutually exclusive.
> I assume this is semi table-driven (?)
I don't understand that question. Please clarify.
> -- as a preface to inquiring as to > how you develop (and verify!) a test suite?
In the same way as for any complex program. Final validation by running and measuring test code on a real processor or a cycle-accurate simulation.
>> Good algorithms for computing WCET bounds for most kinds of >> instruction caches are known, but I did not get around to implementing >> any of those, so only cache-less processors are supported now. If you >> need WCET analysis of caches, the AbsInt tool is best. > > It's arguable whether (and to what extent) you should model I-cache > effects. > E.g., in a multithreaded environment, the cache can be purged at some > frequency that a tool (or the developer, for that matter -- at least not > wrt the instruction stream) can't know.&#4294967295; Aim high and be pleasantly > surprised...
Such Cache-Related Preemption Delays (CRPDs) are a much-studied problem in WCET analysis. The most common solution is to first find out how the various tasks and interrupts can access memory blocks which potentially may collide in the caches, and then to take the worst possible interference into account in the schedulability analysis, which does know (or assumes) the worst case frequency of preemptions. For processors where cache misses are much slower than cache hits (which is fast coming to mean almost all processors) IMO an I-cache analysis is necessary for static WCET analysis to be useful.
> [How do you handle accesses to memories with different *basic* access > times?]
For code fetch the address is always statically known, so the correct access time can be selected and included in the WCET for the instruction when the instruction is decoded. For data accesses, if different instructions are used for different memories (as in the 8051 architecture, for example), the proper access time can be included in the WCET for the instruction when the instruction is decoded. If the same instructions are used to access memories with different access times, depending on the address, and the memory addresses are dynamically determined at run time, the practical but pessimistic way is to use the largest of the access times, that is, assume that the slowest memory is accessed. In principle the tool can try to analyze the address computations and might find sufficient bounds on the address to ensure that a faster memory is accessed, but the analysis methods currently available in Bound-T seldom manage that, and can be quite slow. That said, for some processors it is easy to recognize at decode-time most of the instructions that access the stack, and some versions of Bound-T let one specify different access times for stack accesses and for general (unclassified) accesses. That can be useful if the stack is located in fast memory, but other data are in slower memory.
On 8/6/2021 12:58 PM, Niklas Holsti wrote:

>>>>>>> The source code is downloadable; see http://bound-t.com/download/src. >>>>>>> The copyright text is of my own writing, but I'm open to switching to >>>>>>> some better-known copyright such as GPL or even some non-viral version. >>>>>> >>>>>> Ah, OK. I may have a look at it to see how well it works and >>>>>> how much work to port to ARM objects. Given that you've not >>>>>> done so, already (despite your familiarity with the codebase), >>>>>> I suspect that's not a trivial task? >>>>> >>>>> There is an ARM TDMI port (32-bit ARM code and THUMB, pre Cortex), but of >>>>> course that architecture is out of date. I started a port to ARM Cortex-M, >>>>> but did not (yet) finish it for the reasons described on the status page. >>>>> >>>>> Note that the tool is implemented in Ada. >>>> >>>> OK. It's been ~20 years since I wrote a line of Ada code but I doubt it >>>> will take much to "refresh" those memory cells. Esp as this likely doesn't >>>> have synchronization issues, RT constraints, etc. >>> >>> Right. The only use of tasking features is to set an upper limit on the >>> execution time of the analysis, and that feature is easy to remove if needed. >>> >>> The code mostly relies on the 1995 Ada standard, with some use of Ada 2005 >>> additions. >> >> OK. I'll have to build a more current version of GNAT. > > For playing around, I would just use the GNAT Community Edition. Or the FSF > GNAT that comes with MinGW32.
Ugh! No, I'll just build a fresh copy. I do all my development under NetBSD so have everything I want/need ('cept the Ada tools) there, already.
>>>>> Porting to a new target processor means writing procedures to decode the >>>>> machine instructions and translate them to the internal representation >>>>> used by the analysis. >>>> >>>> That shouldn't be hard -- except for the "timing uncertainties" in the >>>> processor. (Did you rely on some blanket generalization -- like all >>>> caches cold -- as you are looking for WORST case timing? OTOH, a >>>> loop known to fit in a cache line *should* expect that speedup... :< ) >>> >>> At present the tool assumes that every instruction and control transfer can >>> be assigned its own WCET, in machine cycles, independent of other context or >>> data >> >> So, you just sum worst case times? You don't, for example, alter >> times based on conditionals? > > (Note: I say "WCET" below, but really I should always say "upper bound on the > WCET".)
Understood.
> The analysis works on the control-flow graph (per subprogram). The WCET for > each basic block is the sum of the WCETs of the instructions in that block. The > WCET for an execution path through the graph (from entry to return) is the sum > of the WCETs of the basic blocks in the bath, plus the WCETs assigned to each > edge (control transfer) between basic blocks in the path.
So, the costs of conditional control transfers are handled separately (?)
> The WCET for the > whole graph is computed by a pessimisation over all syntactically possible > paths through the graph, using an Integer Linear Programming solver (lp_solve, > for Bound-T). This method is common in WCET analysis and is called IPET, for > Implicit Path Exploration Technique. > > By "syntactically possible" I mean that the tool generally does not attempt to > find semantically or logically infeasible paths. For example, if a subprogram > has this form: > > if boo then > perform computation A; > else > perform computation B; > end if; > perform computation C; > > then the WCET for the subprogram is estimated as max(WCET(A),WCET(B)) + > WCET(C). But if the subprogram has this form: > > if boo then > perform computation A; > end if; > perform computation C; > if not boo then > perform computation B; > end if; > > then the WCET for the subprogram is estimated as WCET(A) + WCET(C) + WCET(B). > In other words, the analysis (usually) does not discover that the conditions > "boo" and "not boo" are mutually exclusive.
OK. This is different from how a DSE-based tool would approach the problem. It would walk through the code and "solve" for each variable that affects control flow or execution cost. So, on one "probe", it would set boo to "TRUE" and evaluate ALL of the consequences of that choice. Then, would repeat the exercise with boo as FALSE. (hence, it is more costly -- though could be made less so if it wasn't actively probing for new cases along the way)
>> I assume this is semi table-driven (?) > > I don't understand that question. Please clarify.
A large number of opcodes, each with particular costs. Do you build a jungle of conditionals to subdivide the "opcode space" into groups of similar-cost operations? Or, do you just have a table of: {opcode, mask, type[1], cost} [1] used to apply some further heuristics to your refinement of "cost"
>> -- as a preface to inquiring as to >> how you develop (and verify!) a test suite? > > In the same way as for any complex program. Final validation by running and > measuring test code on a real processor or a cycle-accurate simulation.
Ah, OK. So, you could "can" that particular exemplar and use it to test a modification to the code (without having to run the app on "real hardware", again) I don't rely on "live" tests for my code. Rather, I use tools to generate good test coverage and then verify the results are what I expect. I find this easier and more easily extensible (I can test ARM code by running an x86 port of that code)
>>> Good algorithms for computing WCET bounds for most kinds of instruction >>> caches are known, but I did not get around to implementing any of those, so >>> only cache-less processors are supported now. If you need WCET analysis of >>> caches, the AbsInt tool is best. >> >> It's arguable whether (and to what extent) you should model I-cache effects. >> E.g., in a multithreaded environment, the cache can be purged at some >> frequency that a tool (or the developer, for that matter -- at least not >> wrt the instruction stream) can't know. Aim high and be pleasantly >> surprised... > > Such Cache-Related Preemption Delays (CRPDs) are a much-studied problem in WCET > analysis. The most common solution is to first find out how the various tasks > and interrupts can access memory blocks which potentially may collide in the > caches, and then to take the worst possible interference into account in the > schedulability analysis, which does know (or assumes) the worst case frequency > of preemptions. > > For processors where cache misses are much slower than cache hits (which is > fast coming to mean almost all processors) IMO an I-cache analysis is necessary > for static WCET analysis to be useful.
I look at it the other way around. Assume there is NO cache. You know that your code WILL run in less time than this case. Regardless of the frequency or presence of competing events. Processors are fast enough, now, that you can usually afford to "step up" in capability for comparably little cost. And, if you've minimized the number of things that *must* meet specific timing/performance goals -- and left everything else as a lower priority/importance secondary goal -- then cache just increases the performance of those "less important" goals, pleasantly.
>> [How do you handle accesses to memories with different *basic* access times?] > > For code fetch the address is always statically known, so the correct access > time can be selected and included in the WCET for the instruction when the > instruction is decoded.
So, you know the access characteristics of each block of memory in which code may reside.
> For data accesses, if different instructions are used for different memories > (as in the 8051 architecture, for example), the proper access time can be > included in the WCET for the instruction when the instruction is decoded. > > If the same instructions are used to access memories with different access > times, depending on the address, and the memory addresses are dynamically > determined at run time, the practical but pessimistic way is to use the largest > of the access times, that is, assume that the slowest memory is accessed. In > principle the tool can try to analyze the address computations and might find > sufficient bounds on the address to ensure that a faster memory is accessed, > but the analysis methods currently available in Bound-T seldom manage that, and > can be quite slow.
It would also require the user to be more specific about his execution environment.
> That said, for some processors it is easy to recognize at decode-time most of > the instructions that access the stack, and some versions of Bound-T let one > specify different access times for stack accesses and for general > (unclassified) accesses. That can be useful if the stack is located in fast > memory, but other data are in slower memory.
I'm thinking, specifically, about I/Os -- which are increasingly memory mapped (including register spaces).
On 8/6/2021 4:04 PM, Don Y wrote:
>> That said, for some processors it is easy to recognize at decode-time most of >> the instructions that access the stack, and some versions of Bound-T let one >> specify different access times for stack accesses and for general >> (unclassified) accesses. That can be useful if the stack is located in fast >> memory, but other data are in slower memory. > > I'm thinking, specifically, about I/Os -- which are increasingly memory > mapped (including register spaces).
Sorry, I should be more clear. I'm looking at the issues that would affect *my* needs in *my* environment -- realizing these may be different than those you've previously targeted. (e.g., VMM, FPU emulation, etc.)
On Thu, 5 Aug 2021 18:39:51 +0300, Niklas Holsti
<niklas.holsti@tidorum.invalid> wrote:

> : >At present the tool assumes that every instruction and control transfer >can be assigned its own WCET, in machine cycles, independent of other >context or data values. The only dynamic aspects of instruction >execution time that are modelled are the possible pipeline stalls due to >read-after-write interlocks. As a special case, in the version for >SPARC, the concurrent execution of the Integer Unit and the Floating >Point Unit is modelled to estimate where and for how long the IU has to >wait for the FPU to complete an operation. > >Good algorithms for computing WCET bounds for most kinds of instruction >caches are known, but I did not get around to implementing any of those, >so only cache-less processors are supported now. If you need WCET >analysis of caches, the AbsInt tool is best. > >Data caches are still a hard problem, where the WCET analyses tend to be >quite pessimistic. Moreover, the increasing use of eviction algorithms >other than LRU (for example, pseudo-random eviction) lessens the >precision of the cache analyses, even for instruction caches. > :
How do you handle branch/jump mispredicts? More importantly, how do you handle chains of conditional branches and/or switch/case constructs which can mispredict at every decision point? The branch targets may not be in cache - neither mispredicted targets nor the actual one. Worst case for a horribly mispredicted switch/case can be absolutely dreadful. Similar problem for table driven code: unless the jump is always made through the /same/ table element, then to a 1st approximation the jump will mispredict /every/ time. Late binding OO code is a problem too, though not to same degree: a lot of code really is monomorphic with a given call usually or always targeting the same class and method. But real polymorhic code (like table driven) suffers greatly when dealing with heterogenous objects. George
On 2021-08-07 5:55, George Neuner wrote:
> On Thu, 5 Aug 2021 18:39:51 +0300, Niklas Holsti > <niklas.holsti@tidorum.invalid> wrote: > >> : >> At present the tool assumes that every instruction and control transfer >> can be assigned its own WCET, in machine cycles, independent of other >> context or data values. The only dynamic aspects of instruction >> execution time that are modelled are the possible pipeline stalls due to >> read-after-write interlocks.
...
> How do you handle branch/jump mispredicts?
See above re the assumptions of the tool. The WCET analysis function of the Bound-T tool is (at present) suited only for simple, deterministic processors. The tool development started in the late 1990s when the company I then worked for developed applications mainly for such processors. If the processor uses a static branch prediction scheme (for example, predict forward branches as not taken and backward branches as taken), the mispredict penalty can be included in the WCET assigned to the non-predicted edge in the control-flow graph at instruction decode-time. For dynamic (history-dependent) branch predictors, various models for static analysis have been proposed and published, but they have the same problems as cache-analysis models: the HW microarchitectures are becoming more and more complex and varied, making the modelling effort more expensive, both in implementation effort and in analysis time, and reducing the customer base for each specific model. The WCET-analysis tools from AbsInt include detailed signal-by-signal simulations of the microarchitectures. Unfortunately such details are seldom documented well, and I believe AbsInt has had to enter special agreements with processor suppliers for access to such information. And the details can change with each processor revision... Speculative executions such as branch prediction can create so-called "timing anomalies" which make the WCET analysis much harder. A timing anomaly is any property of a processor that makes it invalid to estimate the WCET of a sequence of instructions by assuming that the first instruction encounters its worst case (for example, a cache miss instead of a cache hit) and then continuing the analysis of the rest of the sequence while propagating that assumption forward as an assumption on the state of the processor. The canonical example of timing anomalies are processors where a cache hit at one instruction changes the processor state in such a way that a later instruction takes much longer than it would have taken, had the first instruction encountered a cache miss instead of a hit. Propagating the "first instruction missed" assumption to the context of the later instruction may make the analysis under-estimate the execution time of that later instruction. To handle timing anomalies, the analysis must consider all possible combinations: first instruction hits or misses, second instruction hits or misses, etc. Exponential complexity. It amuses me that such processor properties, which are/were the bane of static WCET analysis, are now in the limelight as the origin of the side channels for various malware, such as Spectre exploits.
> More importantly, how do you handle chains of conditional branches > and/or switch/case constructs which can mispredict at every decision > point? The branch targets may not be in cache - neither mispredicted > targets nor the actual one. Worst case for a horribly mispredicted > switch/case can be absolutely dreadful.
I don't know if those questions are addressed to me (re the Bound-T tool) or to WCET analysis in general. The advanced WCET tools, like the ones from AbsInt, do try to cover such cases by their cache analysis to provide a safe upper bound on the WCET, but the degree of pessimism (over-estimation) increases with the complexity of the code. That said, in many programs most of the processing time is taken by relatively simple loops, for which the present I-cache analyses work quite well, and for which even D-cache analysis can (I believe) work satisfactorily if the memory addressing patterns are regular.
On 2021-08-07 2:04, Don Y wrote:
> On 8/6/2021 12:58 PM, Niklas Holsti wrote:
>>> OK.&#4294967295; I'll have to build a more current version of GNAT. >> >> For playing around, I would just use the GNAT Community Edition. Or >> the FSF GNAT that comes with MinGW32. > > Ugh!&#4294967295; No, I'll just build a fresh copy.&#4294967295; I do all my development > under NetBSD so have everything I want/need ('cept the Ada tools) > there, already.
You are braver than I am. Note that the front-end of GNAT is implemented in Ada, so it has to be bootstrapped with an Ada compiler. And the current version of GNAT requires quite an up-to-date Ada compiler for the bootstrap. In practice, I think people have found that only GNAT can build GNAT. Many Linux distributions come with GNAT pre-built. Debian-derived distributions have good support, I hear.
>> The analysis works on the control-flow graph (per subprogram). The >> WCET for each basic block is the sum of the WCETs of the instructions >> in that block. The WCET for an execution path through the graph (from >> entry to return) is the sum of the WCETs of the basic blocks in the >> bath, plus the WCETs assigned to each edge (control transfer) between >> basic blocks in the path. > > So, the costs of conditional control transfers are handled separately (?)
I'm not sure I understand your question. Each edge in the control-flow graph, that is, each transition from one instruction or one basic block to a possible next instruction or block, is assigned a WCET value when the instructions are decoded and entered in the control-flow graph. Usually this WCET is zero for a fall-through edge from a non-branch instruction to the next, and is non-zero only for edges from branch instructions, whether conditional or unconditional. For a conditional branch, the "not taken" edge usually has a zero WCET and the "taken" edge has some non-zero WCET, as specified in the processor documentation (remember we are assuming a simple processor).
>>> I assume this is semi table-driven (?) >> >> I don't understand that question. Please clarify. > > A large number of opcodes, each with particular costs. > Do you build a jungle of conditionals to subdivide the > "opcode space" into groups of similar-cost operations? > Or, do you just have a table of: > > {opcode, mask, type[1], cost} > > [1] used to apply some further heuristics to your > refinement of "cost"
I would say that the approach to translating instructions into their analysis models (elements in the control-flow graph) is usually not table-driven, but emulates the field-by-field decoding and case analysis that would be done by a disassembler or emulator (and Bound-T can produce a disassembly of the code). The internal model includes not only the cost of the instruction, but also its effect on the computation. For example, an instruction like "increment the accumulator", INC A, would be decoded into a model that says WCET: 1 cycle. Control flow: to next instruction, unconditional. Effect: A := A + 1. and also the effect, if any, on condition flags.
>>> -- as a preface to inquiring as to >>> how you develop (and verify!) a test suite? >> >> In the same way as for any complex program. Final validation by >> running and measuring test code on a real processor or a >> cycle-accurate simulation. > > Ah, OK.&#4294967295; So, you could "can" that particular exemplar and use it > to test a modification to the code (without having to run the app > on "real hardware", again)
Yep.
> I don't rely on "live" tests for my code. Rather, I use tools to > generate good test coverage and then verify the results are what I > expect. I find this easier and more easily extensible (I can test > ARM code by running an x86 port of that code)
At my former job, implementing SW for ESA satellite applications, we almost always tested the code first on normal PCs, in a simulated I/O environment, and then on the embedded target. Easier for Ada than for C.
>> For processors where cache misses are much slower than cache hits >> (which is fast coming to mean almost all processors) IMO an I-cache >> analysis is necessary for static WCET analysis to be useful. > > I look at it the other way around. Assume there is NO cache. You > know that your code WILL run in less time than this case. Regardless > of the frequency or presence of competing events. > > Processors are fast enough, now, that you can usually afford to "step up" > in capability for comparably little cost.
Sometimes, but not always. The ESA applications we made were usually running (worst case) at about 70% processor load, and faster processors for space applications were very expensive, not only in euros but also in power, volume and mass, which are scarce resources in space.
On 2021-08-07 2:06, Don Y wrote:
> On 8/6/2021 4:04 PM, Don Y wrote: >>> That said, for some processors it is easy to recognize at decode-time >>> most of the instructions that access the stack, and some versions of >>> Bound-T let one specify different access times for stack accesses and >>> for general (unclassified) accesses. That can be useful if the stack >>> is located in fast memory, but other data are in slower memory. >> >> I'm thinking, specifically, about I/Os -- which are increasingly memory >> mapped (including register spaces). > > Sorry, I should be more clear.&#4294967295; I'm looking at the issues that would affect > *my* needs in *my* environment -- realizing these may be different than > those you've previously targeted.&#4294967295; (e.g., VMM, FPU emulation, etc.)
Ok. If you describe your needs, perhaps I can comment. In my experience, access to memory-mapped registers tends to be not much slower than access to memory. Also, quite often the addresses in MMIO accesses are static or easy to derive in static analysis, so the WCET tool could choose the proper access time, if the tool knows enough about the system architecture. If by "FPU emulation" you mean SW-implemented FP instructions, of course we have encountered those. They are often hand-coded assembler with very complex and tricky control flow, which makes it hard to find loop bounds automatically, so manual annotations must be used instead. What is VMM?