Stack analysis tool that really work?| page 3

Reply by Don Y ●August 5, 20212021-08-05

On 8/5/2021 3:43 AM, Niklas Holsti wrote:
> On 2021-08-05 1:46, Don Y wrote:

>>> The source code is downloadable; see http://bound-t.com/download/src. The 
>>> copyright text is of my own writing, but I'm open to switching to some 
>>> better-known copyright such as GPL or even some non-viral version.
>>
>> Ah, OK.  I may have a look at it to see how well it works and
>> how much work to port to ARM objects.  Given that you've not
>> done so, already (despite your familiarity with the codebase),
>> I suspect that's not a trivial task?
> 
> There is an ARM TDMI port (32-bit ARM code and THUMB, pre Cortex), but of 
> course that architecture is out of date. I started a port to ARM Cortex-M, but 
> did not (yet) finish it for the reasons described on the status page.
> 
> Note that the tool is implemented in Ada.

OK.  It's been ~20 years since I wrote a line of Ada code but I doubt it
will take much to "refresh" those memory cells.  Esp as this likely doesn't
have synchronization issues, RT constraints, etc.

> Porting to a new target processor means writing procedures to decode the 
> machine instructions and translate them to the internal representation used by 
> the analysis.

That shouldn't be hard -- except for the "timing uncertainties" in the
processor.  (Did you rely on some blanket generalization -- like all
caches cold -- as you are looking for WORST case timing?   OTOH, a
loop known to fit in a cache line *should* expect that speedup...  :< )

> Sometimes it is also necessary to modify or extend the tool parts 
> that read the executable files and especially the debugging information -- some 
> of that may be compiler-specific, unfortunately. It is a fair amount of work 
> and requires a good understanding both of the target processor and of the 
> Bound-T internal representation.

The latter will likely be the bigger effort.  It requires trying to infer
your mindset when you began the design.

[I now include "rationales" in my documentation for exactly this purpose;
to let those "following" understand why particular choices were made
instead of choices that may (in hindsight) appear better]

> And Ada, of course, but every programmer 
> should know Ada, IMO :-)
> 
>>> However, you may want to read about the state of the tool at 
>>> http://bound-t.com/status.html.
>>
>> Hmmm... that sounds ominous!  :>  (or, am I just too much of a Cynic?)
> 
> The problems described on the status page are more relevant to WCET analysis 
> than to stack analysis, but can affect stack analysis too, in some cases.

OK.  I'll try to make some time to look through it.  Sadly, a neighbor was
admitted to hospital with congestive heart failure, last night.  So, I've
some added responsibilities cutting into my time (meal prep, chores around
his house, etc.) until he's got a firmer prognosis...

[I don't *need* to sleep, do I?  :< ]

Reply by Niklas Holsti ●August 5, 20212021-08-05

On 2021-08-05 16:16, Don Y wrote:
> On 8/5/2021 3:43 AM, Niklas Holsti wrote:
>> On 2021-08-05 1:46, Don Y wrote:
> 
>>>> The source code is downloadable; see 
>>>> http://bound-t.com/download/src. The copyright text is of my own 
>>>> writing, but I'm open to switching to some better-known copyright 
>>>> such as GPL or even some non-viral version.
>>>
>>> Ah, OK.&#4294967295; I may have a look at it to see how well it works and
>>> how much work to port to ARM objects.&#4294967295; Given that you've not
>>> done so, already (despite your familiarity with the codebase),
>>> I suspect that's not a trivial task?
>>
>> There is an ARM TDMI port (32-bit ARM code and THUMB, pre Cortex), but 
>> of course that architecture is out of date. I started a port to ARM 
>> Cortex-M, but did not (yet) finish it for the reasons described on the 
>> status page.
>>
>> Note that the tool is implemented in Ada.
> 
> OK.&#4294967295; It's been ~20 years since I wrote a line of Ada code but I doubt it
> will take much to "refresh" those memory cells.&#4294967295; Esp as this likely doesn't
> have synchronization issues, RT constraints, etc.

Right. The only use of tasking features is to set an upper limit on the 
execution time of the analysis, and that feature is easy to remove if 
needed.

The code mostly relies on the 1995 Ada standard, with some use of Ada 
2005 additions.

>> Porting to a new target processor means writing procedures to decode 
>> the machine instructions and translate them to the internal 
>> representation used by the analysis.
> 
> That shouldn't be hard -- except for the "timing uncertainties" in the
> processor.&#4294967295; (Did you rely on some blanket generalization -- like all
> caches cold -- as you are looking for WORST case timing?&#4294967295;&#4294967295; OTOH, a
> loop known to fit in a cache line *should* expect that speedup...&#4294967295; :< )

At present the tool assumes that every instruction and control transfer 
can be assigned its own WCET, in machine cycles, independent of other 
context or data values. The only dynamic aspects of instruction 
execution time that are modelled are the possible pipeline stalls due to 
read-after-write interlocks. As a special case, in the version for 
SPARC, the concurrent execution of the Integer Unit and the Floating 
Point Unit is modelled to estimate where and for how long the IU has to 
wait for the FPU to complete an operation.

Good algorithms for computing WCET bounds for most kinds of instruction 
caches are known, but I did not get around to implementing any of those, 
so only cache-less processors are supported now. If you need WCET 
analysis of caches, the AbsInt tool is best.

Data caches are still a hard problem, where the WCET analyses tend to be 
quite pessimistic. Moreover, the increasing use of eviction algorithms 
other than LRU (for example, pseudo-random eviction) lessens the 
precision of the cache analyses, even for instruction caches.

>> Sometimes it is also necessary to modify or extend the tool parts that 
>> read the executable files and especially the debugging information -- 
>> some of that may be compiler-specific, unfortunately. It is a fair 
>> amount of work and requires a good understanding both of the target 
>> processor and of the Bound-T internal representation.
> 
> The latter will likely be the bigger effort.&#4294967295; It requires trying to infer
> your mindset when you began the design.

You could start by reading http://bound-t.com/whats-inside.html.

> [I now include "rationales" in my documentation for exactly this purpose;
> to let those "following" understand why particular choices were made
> instead of choices that may (in hindsight) appear better]

Me too. Ceterum censeo that calling the in-code documentation "comments" 
was a very poor choice of term and very misleading for what such 
documentation should contain and its importance.

Reply by Don Y ●August 6, 20212021-08-06

On 8/5/2021 8:39 AM, Niklas Holsti wrote:
> On 2021-08-05 16:16, Don Y wrote:
>> On 8/5/2021 3:43 AM, Niklas Holsti wrote:
>>> On 2021-08-05 1:46, Don Y wrote:
>>
>>>>> The source code is downloadable; see http://bound-t.com/download/src. The 
>>>>> copyright text is of my own writing, but I'm open to switching to some 
>>>>> better-known copyright such as GPL or even some non-viral version.
>>>>
>>>> Ah, OK.  I may have a look at it to see how well it works and
>>>> how much work to port to ARM objects.  Given that you've not
>>>> done so, already (despite your familiarity with the codebase),
>>>> I suspect that's not a trivial task?
>>>
>>> There is an ARM TDMI port (32-bit ARM code and THUMB, pre Cortex), but of 
>>> course that architecture is out of date. I started a port to ARM Cortex-M, 
>>> but did not (yet) finish it for the reasons described on the status page.
>>>
>>> Note that the tool is implemented in Ada.
>>
>> OK.  It's been ~20 years since I wrote a line of Ada code but I doubt it
>> will take much to "refresh" those memory cells.  Esp as this likely doesn't
>> have synchronization issues, RT constraints, etc.
> 
> Right. The only use of tasking features is to set an upper limit on the 
> execution time of the analysis, and that feature is easy to remove if needed.
> 
> The code mostly relies on the 1995 Ada standard, with some use of Ada 2005 
> additions.

OK.  I'll have to build a more current version of GNAT.

>>> Porting to a new target processor means writing procedures to decode the 
>>> machine instructions and translate them to the internal representation used 
>>> by the analysis.
>>
>> That shouldn't be hard -- except for the "timing uncertainties" in the
>> processor.  (Did you rely on some blanket generalization -- like all
>> caches cold -- as you are looking for WORST case timing?   OTOH, a
>> loop known to fit in a cache line *should* expect that speedup...  :< )
> 
> At present the tool assumes that every instruction and control transfer can be 
> assigned its own WCET, in machine cycles, independent of other context or data 

So, you just sum worst case times?  You don't, for example, alter
times based on conditionals?

I assume this is semi table-driven (?) -- as a preface to inquiring as to
how you develop (and verify!) a test suite?

> values. The only dynamic aspects of instruction execution time that are 
> modelled are the possible pipeline stalls due to read-after-write interlocks. 
> As a special case, in the version for SPARC, the concurrent execution of the 
> Integer Unit and the Floating Point Unit is modelled to estimate where and for 
> how long the IU has to wait for the FPU to complete an operation.
> 
> Good algorithms for computing WCET bounds for most kinds of instruction caches 
> are known, but I did not get around to implementing any of those, so only 
> cache-less processors are supported now. If you need WCET analysis of caches, 
> the AbsInt tool is best.

It's arguable whether (and to what extent) you should model I-cache effects.
E.g., in a multithreaded environment, the cache can be purged at some
frequency that a tool (or the developer, for that matter -- at least not
wrt the instruction stream) can't know.  Aim high and be pleasantly
surprised...

> Data caches are still a hard problem, where the WCET analyses tend to be quite 
> pessimistic. Moreover, the increasing use of eviction algorithms other than LRU 
> (for example, pseudo-random eviction) lessens the precision of the cache 
> analyses, even for instruction caches.

As would handling page faults or other NUMA issues.

[How do you handle accesses to memories with different *basic* access times?]

>>> Sometimes it is also necessary to modify or extend the tool parts that read 
>>> the executable files and especially the debugging information -- some of 
>>> that may be compiler-specific, unfortunately. It is a fair amount of work 
>>> and requires a good understanding both of the target processor and of the 
>>> Bound-T internal representation.
>>
>> The latter will likely be the bigger effort.  It requires trying to infer
>> your mindset when you began the design.
> 
> You could start by reading http://bound-t.com/whats-inside.html.
> 
>> [I now include "rationales" in my documentation for exactly this purpose;
>> to let those "following" understand why particular choices were made
>> instead of choices that may (in hindsight) appear better]
> 
> Me too. Ceterum censeo that calling the in-code documentation "comments" was a 
> very poor choice of term and very misleading for what such documentation should 
> contain and its importance.

A "rationale" lets you talk to the next developer(s) informally.  You can
point out areas for likely optimization, areas fraught with dragons, etc.
Otherwise, relying on leafing through hundreds of pages of code in the
hope of stumbling on some "XXX ..." annotation is wasteful.

And, *not* being tied to a "text only" presentation means you can explain
things in ways that would be difficult to do (unambiguously) with prose.

[How do I audibly explain the difference between front vowels and back
vowels to a non-native speaker/listener?]

Reply by Niklas Holsti ●August 6, 20212021-08-06

On 2021-08-06 11:09, Don Y wrote:
> On 8/5/2021 8:39 AM, Niklas Holsti wrote:
>> On 2021-08-05 16:16, Don Y wrote:
>>> On 8/5/2021 3:43 AM, Niklas Holsti wrote:
>>>> On 2021-08-05 1:46, Don Y wrote:
>>>
>>>>>> The source code is downloadable; see 
>>>>>> http://bound-t.com/download/src. The copyright text is of my own 
>>>>>> writing, but I'm open to switching to some better-known copyright 
>>>>>> such as GPL or even some non-viral version.
>>>>>
>>>>> Ah, OK.&#4294967295; I may have a look at it to see how well it works and
>>>>> how much work to port to ARM objects.&#4294967295; Given that you've not
>>>>> done so, already (despite your familiarity with the codebase),
>>>>> I suspect that's not a trivial task?
>>>>
>>>> There is an ARM TDMI port (32-bit ARM code and THUMB, pre Cortex), 
>>>> but of course that architecture is out of date. I started a port to 
>>>> ARM Cortex-M, but did not (yet) finish it for the reasons described 
>>>> on the status page.
>>>>
>>>> Note that the tool is implemented in Ada.
>>>
>>> OK.&#4294967295; It's been ~20 years since I wrote a line of Ada code but I doubt it
>>> will take much to "refresh" those memory cells.&#4294967295; Esp as this likely 
>>> doesn't
>>> have synchronization issues, RT constraints, etc.
>>
>> Right. The only use of tasking features is to set an upper limit on 
>> the execution time of the analysis, and that feature is easy to remove 
>> if needed.
>>
>> The code mostly relies on the 1995 Ada standard, with some use of Ada 
>> 2005 additions.
> 
> OK.&#4294967295; I'll have to build a more current version of GNAT.

For playing around, I would just use the GNAT Community Edition. Or the 
FSF GNAT that comes with MinGW32.

>>>> Porting to a new target processor means writing procedures to decode 
>>>> the machine instructions and translate them to the internal 
>>>> representation used by the analysis.
>>>
>>> That shouldn't be hard -- except for the "timing uncertainties" in the
>>> processor.&#4294967295; (Did you rely on some blanket generalization -- like all
>>> caches cold -- as you are looking for WORST case timing?&#4294967295;&#4294967295; OTOH, a
>>> loop known to fit in a cache line *should* expect that speedup...&#4294967295; :< )
>>
>> At present the tool assumes that every instruction and control 
>> transfer can be assigned its own WCET, in machine cycles, independent 
>> of other context or data 
> 
> So, you just sum worst case times?&#4294967295; You don't, for example, alter
> times based on conditionals?

(Note: I say "WCET" below, but really I should always say "upper bound 
on the WCET".)

The analysis works on the control-flow graph (per subprogram). The WCET 
for each basic block is the sum of the WCETs of the instructions in that 
block. The WCET for an execution path through the graph (from entry to 
return) is the sum of the WCETs of the basic blocks in the bath, plus 
the WCETs assigned to each edge (control transfer) between basic blocks 
in the path. The WCET for the whole graph is computed by a pessimisation 
over all syntactically possible paths through the graph, using an 
Integer Linear Programming solver (lp_solve, for Bound-T). This method 
is common in WCET analysis and is called IPET, for Implicit Path 
Exploration Technique.

By "syntactically possible" I mean that the tool generally does not 
attempt to find semantically or logically infeasible paths. For example, 
if a subprogram has this form:

    if boo then
       perform computation A;
    else
       perform computation B;
    end if;
    perform computation C;

then the WCET for the subprogram is estimated as max(WCET(A),WCET(B)) + 
WCET(C). But if the subprogram has this form:

    if boo then
       perform computation A;
    end if;
    perform computation C;
    if not boo then
       perform computation B;
    end if;

then the WCET for the subprogram is estimated as WCET(A) + WCET(C) + 
WCET(B). In other words, the analysis (usually) does not discover that 
the conditions "boo" and "not boo" are mutually exclusive.

> I assume this is semi table-driven (?)

I don't understand that question. Please clarify.

> -- as a preface to inquiring as to
> how you develop (and verify!) a test suite?

In the same way as for any complex program. Final validation by running 
and measuring test code on a real processor or a cycle-accurate simulation.

>> Good algorithms for computing WCET bounds for most kinds of 
>> instruction caches are known, but I did not get around to implementing 
>> any of those, so only cache-less processors are supported now. If you 
>> need WCET analysis of caches, the AbsInt tool is best.
> 
> It's arguable whether (and to what extent) you should model I-cache 
> effects.
> E.g., in a multithreaded environment, the cache can be purged at some
> frequency that a tool (or the developer, for that matter -- at least not
> wrt the instruction stream) can't know.&#4294967295; Aim high and be pleasantly
> surprised...

Such Cache-Related Preemption Delays (CRPDs) are a much-studied problem 
in WCET analysis. The most common solution is to first find out how the 
various tasks and interrupts can access memory blocks which potentially 
may collide in the caches, and then to take the worst possible 
interference into account in the schedulability analysis, which does 
know (or assumes) the worst case frequency of preemptions.

For processors where cache misses are much slower than cache hits (which 
is fast coming to mean almost all processors) IMO an I-cache analysis is 
necessary for static WCET analysis to be useful.

> [How do you handle accesses to memories with different *basic* access 
> times?]

For code fetch the address is always statically known, so the correct 
access time can be selected and included in the WCET for the instruction 
when the instruction is decoded.

For data accesses, if different instructions are used for different 
memories (as in the 8051 architecture, for example), the proper access 
time can be included in the WCET for the instruction when the 
instruction is decoded.

If the same instructions are used to access memories with different 
access times, depending on the address, and the memory addresses are 
dynamically determined at run time, the practical but pessimistic way is 
to use the largest of the access times, that is, assume that the slowest 
memory is accessed. In principle the tool can try to analyze the address 
computations and might find sufficient bounds on the address to ensure 
that a faster memory is accessed, but the analysis methods currently 
available in Bound-T seldom manage that, and can be quite slow.

That said, for some processors it is easy to recognize at decode-time 
most of the instructions that access the stack, and some versions of 
Bound-T let one specify different access times for stack accesses and 
for general (unclassified) accesses. That can be useful if the stack is 
located in fast memory, but other data are in slower memory.

Reply by Don Y ●August 6, 20212021-08-06

On 8/6/2021 12:58 PM, Niklas Holsti wrote:

>>>>>>> The source code is downloadable; see http://bound-t.com/download/src. 
>>>>>>> The copyright text is of my own writing, but I'm open to switching to 
>>>>>>> some better-known copyright such as GPL or even some non-viral version.
>>>>>>
>>>>>> Ah, OK.  I may have a look at it to see how well it works and
>>>>>> how much work to port to ARM objects.  Given that you've not
>>>>>> done so, already (despite your familiarity with the codebase),
>>>>>> I suspect that's not a trivial task?
>>>>>
>>>>> There is an ARM TDMI port (32-bit ARM code and THUMB, pre Cortex), but of 
>>>>> course that architecture is out of date. I started a port to ARM Cortex-M, 
>>>>> but did not (yet) finish it for the reasons described on the status page.
>>>>>
>>>>> Note that the tool is implemented in Ada.
>>>>
>>>> OK.  It's been ~20 years since I wrote a line of Ada code but I doubt it
>>>> will take much to "refresh" those memory cells.  Esp as this likely doesn't
>>>> have synchronization issues, RT constraints, etc.
>>>
>>> Right. The only use of tasking features is to set an upper limit on the 
>>> execution time of the analysis, and that feature is easy to remove if needed.
>>>
>>> The code mostly relies on the 1995 Ada standard, with some use of Ada 2005 
>>> additions.
>>
>> OK.  I'll have to build a more current version of GNAT.
> 
> For playing around, I would just use the GNAT Community Edition. Or the FSF 
> GNAT that comes with MinGW32.

Ugh!  No, I'll just build a fresh copy.  I do all my development
under NetBSD so have everything I want/need ('cept the Ada tools)
there, already.

>>>>> Porting to a new target processor means writing procedures to decode the 
>>>>> machine instructions and translate them to the internal representation 
>>>>> used by the analysis.
>>>>
>>>> That shouldn't be hard -- except for the "timing uncertainties" in the
>>>> processor.  (Did you rely on some blanket generalization -- like all
>>>> caches cold -- as you are looking for WORST case timing?   OTOH, a
>>>> loop known to fit in a cache line *should* expect that speedup...  :< )
>>>
>>> At present the tool assumes that every instruction and control transfer can 
>>> be assigned its own WCET, in machine cycles, independent of other context or 
>>> data 
>>
>> So, you just sum worst case times?  You don't, for example, alter
>> times based on conditionals?
> 
> (Note: I say "WCET" below, but really I should always say "upper bound on the 
> WCET".)

Understood.

> The analysis works on the control-flow graph (per subprogram). The WCET for 
> each basic block is the sum of the WCETs of the instructions in that block. The 
> WCET for an execution path through the graph (from entry to return) is the sum 
> of the WCETs of the basic blocks in the bath, plus the WCETs assigned to each 
> edge (control transfer) between basic blocks in the path.

So, the costs of conditional control transfers are handled separately (?)

> The WCET for the 
> whole graph is computed by a pessimisation over all syntactically possible 
> paths through the graph, using an Integer Linear Programming solver (lp_solve, 
> for Bound-T). This method is common in WCET analysis and is called IPET, for 
> Implicit Path Exploration Technique.
> 
> By "syntactically possible" I mean that the tool generally does not attempt to 
> find semantically or logically infeasible paths. For example, if a subprogram 
> has this form:
> 
>     if boo then
>        perform computation A;
>     else
>        perform computation B;
>     end if;
>     perform computation C;
> 
> then the WCET for the subprogram is estimated as max(WCET(A),WCET(B)) + 
> WCET(C). But if the subprogram has this form:
> 
>     if boo then
>        perform computation A;
>     end if;
>     perform computation C;
>     if not boo then
>        perform computation B;
>     end if;
> 
> then the WCET for the subprogram is estimated as WCET(A) + WCET(C) + WCET(B). 
> In other words, the analysis (usually) does not discover that the conditions 
> "boo" and "not boo" are mutually exclusive.

OK.  This is different from how a DSE-based tool would approach the problem.
It would walk through the code and "solve" for each variable that affects
control flow or execution cost.  So, on one "probe", it would set boo
to "TRUE" and evaluate ALL of the consequences of that choice.

Then, would repeat the exercise with boo as FALSE.

(hence, it is more costly -- though could be made less so if it
wasn't actively probing for new cases along the way)

>> I assume this is semi table-driven (?)
> 
> I don't understand that question. Please clarify.

A large number of opcodes, each with particular costs.
Do you build a jungle of conditionals to subdivide the
"opcode space" into groups of similar-cost operations?
Or, do you just have a table of:

{opcode, mask, type[1], cost}

[1] used to apply some further heuristics to your
refinement of "cost"

>> -- as a preface to inquiring as to
>> how you develop (and verify!) a test suite?
> 
> In the same way as for any complex program. Final validation by running and 
> measuring test code on a real processor or a cycle-accurate simulation.

Ah, OK.  So, you could "can" that particular exemplar and use it
to test a modification to the code (without having to run the app
on "real hardware", again)

I don't rely on "live" tests for my code.  Rather, I use tools to generate
good test coverage and then verify the results are what I expect.  I find this
easier and more easily extensible (I can test ARM code by running an x86
port of that code)

>>> Good algorithms for computing WCET bounds for most kinds of instruction 
>>> caches are known, but I did not get around to implementing any of those, so 
>>> only cache-less processors are supported now. If you need WCET analysis of 
>>> caches, the AbsInt tool is best.
>>
>> It's arguable whether (and to what extent) you should model I-cache effects.
>> E.g., in a multithreaded environment, the cache can be purged at some
>> frequency that a tool (or the developer, for that matter -- at least not
>> wrt the instruction stream) can't know.  Aim high and be pleasantly
>> surprised...
> 
> Such Cache-Related Preemption Delays (CRPDs) are a much-studied problem in WCET 
> analysis. The most common solution is to first find out how the various tasks 
> and interrupts can access memory blocks which potentially may collide in the 
> caches, and then to take the worst possible interference into account in the 
> schedulability analysis, which does know (or assumes) the worst case frequency 
> of preemptions.
> 
> For processors where cache misses are much slower than cache hits (which is 
> fast coming to mean almost all processors) IMO an I-cache analysis is necessary 
> for static WCET analysis to be useful.

I look at it the other way around.  Assume there is NO cache.  You know that
your code WILL run in less time than this case.  Regardless of the frequency
or presence of competing events.

Processors are fast enough, now, that you can usually afford to "step up"
in capability for comparably little cost.

And, if you've minimized the number of things that *must* meet specific
timing/performance goals -- and left everything else as a lower
priority/importance secondary goal -- then cache just increases the
performance of those "less important" goals, pleasantly.

>> [How do you handle accesses to memories with different *basic* access times?]
> 
> For code fetch the address is always statically known, so the correct access 
> time can be selected and included in the WCET for the instruction when the 
> instruction is decoded.

So, you know the access characteristics of each block of memory in which code
may reside.

> For data accesses, if different instructions are used for different memories 
> (as in the 8051 architecture, for example), the proper access time can be 
> included in the WCET for the instruction when the instruction is decoded.
> 
> If the same instructions are used to access memories with different access 
> times, depending on the address, and the memory addresses are dynamically 
> determined at run time, the practical but pessimistic way is to use the largest 
> of the access times, that is, assume that the slowest memory is accessed. In 
> principle the tool can try to analyze the address computations and might find 
> sufficient bounds on the address to ensure that a faster memory is accessed, 
> but the analysis methods currently available in Bound-T seldom manage that, and 
> can be quite slow.

It would also require the user to be more specific about his execution
environment.

> That said, for some processors it is easy to recognize at decode-time most of 
> the instructions that access the stack, and some versions of Bound-T let one 
> specify different access times for stack accesses and for general 
> (unclassified) accesses. That can be useful if the stack is located in fast 
> memory, but other data are in slower memory.

I'm thinking, specifically, about I/Os -- which are increasingly memory
mapped (including register spaces).

Reply by Don Y ●August 6, 20212021-08-06

On 8/6/2021 4:04 PM, Don Y wrote:
>> That said, for some processors it is easy to recognize at decode-time most of 
>> the instructions that access the stack, and some versions of Bound-T let one 
>> specify different access times for stack accesses and for general 
>> (unclassified) accesses. That can be useful if the stack is located in fast 
>> memory, but other data are in slower memory.
> 
> I'm thinking, specifically, about I/Os -- which are increasingly memory
> mapped (including register spaces).

Sorry, I should be more clear.  I'm looking at the issues that would affect
*my* needs in *my* environment -- realizing these may be different than
those you've previously targeted.  (e.g., VMM, FPU emulation, etc.)

Reply by George Neuner ●August 6, 20212021-08-06

On Thu, 5 Aug 2021 18:39:51 +0300, Niklas Holsti
<niklas.holsti@tidorum.invalid> wrote:

>  :
>At present the tool assumes that every instruction and control transfer 
>can be assigned its own WCET, in machine cycles, independent of other 
>context or data values. The only dynamic aspects of instruction 
>execution time that are modelled are the possible pipeline stalls due to 
>read-after-write interlocks. As a special case, in the version for 
>SPARC, the concurrent execution of the Integer Unit and the Floating 
>Point Unit is modelled to estimate where and for how long the IU has to 
>wait for the FPU to complete an operation.
>
>Good algorithms for computing WCET bounds for most kinds of instruction 
>caches are known, but I did not get around to implementing any of those, 
>so only cache-less processors are supported now. If you need WCET 
>analysis of caches, the AbsInt tool is best.
>
>Data caches are still a hard problem, where the WCET analyses tend to be 
>quite pessimistic. Moreover, the increasing use of eviction algorithms 
>other than LRU (for example, pseudo-random eviction) lessens the 
>precision of the cache analyses, even for instruction caches.
>  :

How do you handle branch/jump mispredicts?  

More importantly, how do you handle chains of conditional branches
and/or switch/case constructs which can mispredict at every decision
point?  The branch targets may not be in cache - neither mispredicted
targets nor the actual one.  Worst case for a horribly mispredicted
switch/case can be absolutely dreadful.

Similar problem for table driven code: unless the jump is always made
through the /same/ table element, then to a 1st approximation the jump
will mispredict /every/ time.

Late binding OO code is a problem too, though not to same degree: a
lot of code really is monomorphic with a given call usually or always
targeting the same class and method.  But real polymorhic code (like
table driven) suffers greatly when dealing with heterogenous objects.

George

Reply by Niklas Holsti ●August 7, 20212021-08-07

On 2021-08-07 5:55, George Neuner wrote:
> On Thu, 5 Aug 2021 18:39:51 +0300, Niklas Holsti
> <niklas.holsti@tidorum.invalid> wrote:
> 
>>   :
>> At present the tool assumes that every instruction and control transfer
>> can be assigned its own WCET, in machine cycles, independent of other
>> context or data values. The only dynamic aspects of instruction
>> execution time that are modelled are the possible pipeline stalls due to
>> read-after-write interlocks.
    ...

> How do you handle branch/jump mispredicts?

See above re the assumptions of the tool.

The WCET analysis function of the Bound-T tool is (at present) suited 
only for simple, deterministic processors. The tool development started 
in the late 1990s when the company I then worked for developed 
applications mainly for such processors.

If the processor uses a static branch prediction scheme (for example, 
predict forward branches as not taken and backward branches as taken), 
the mispredict penalty can be included in the WCET assigned to the 
non-predicted edge in the control-flow graph at instruction decode-time.

For dynamic (history-dependent) branch predictors, various models for 
static analysis have been proposed and published, but they have the same 
problems as cache-analysis models: the HW microarchitectures are 
becoming more and more complex and varied, making the modelling effort 
more expensive, both in implementation effort and in analysis time, and 
reducing the customer base for each specific model.

The WCET-analysis tools from AbsInt include detailed signal-by-signal 
simulations of the microarchitectures. Unfortunately such details are 
seldom documented well, and I believe AbsInt has had to enter special 
agreements with processor suppliers for access to such information. And 
the details can change with each processor revision...

Speculative executions such as branch prediction can create so-called 
"timing anomalies" which make the WCET analysis much harder. A timing 
anomaly is any property of a processor that makes it invalid to estimate 
the WCET of a sequence of instructions by assuming that the first 
instruction encounters its worst case (for example, a cache miss instead 
of a cache hit) and then continuing the analysis of the rest of the 
sequence while propagating that assumption forward as an assumption on 
the state of the processor.

The canonical example of timing anomalies are processors where a cache 
hit at one instruction changes the processor state in such a way that a 
later instruction takes much longer than it would have taken, had the 
first instruction encountered a cache miss instead of a hit. Propagating 
the "first instruction missed" assumption to the context of the later 
instruction may make the analysis under-estimate the execution time of 
that later instruction. To handle timing anomalies, the analysis must 
consider all possible combinations: first instruction hits or misses, 
second instruction hits or misses, etc. Exponential complexity.

It amuses me that such processor properties, which are/were the bane of 
static WCET analysis, are now in the limelight as the origin of the side 
channels for various malware, such as Spectre exploits.

> More importantly, how do you handle chains of conditional branches
> and/or switch/case constructs which can mispredict at every decision
> point?  The branch targets may not be in cache - neither mispredicted
> targets nor the actual one.  Worst case for a horribly mispredicted
> switch/case can be absolutely dreadful.

I don't know if those questions are addressed to me (re the Bound-T 
tool) or to WCET analysis in general. The advanced WCET tools, like the 
ones from AbsInt, do try to cover such cases by their cache analysis to 
provide a safe upper bound on the WCET, but the degree of pessimism 
(over-estimation) increases with the complexity of the code.

That said, in many programs most of the processing time is taken by 
relatively simple loops, for which the present I-cache analyses work 
quite well, and for which even D-cache analysis can (I believe) work 
satisfactorily if the memory addressing patterns are regular.

Reply by Niklas Holsti ●August 7, 20212021-08-07

On 2021-08-07 2:04, Don Y wrote:
> On 8/6/2021 12:58 PM, Niklas Holsti wrote: 

>>> OK.&#4294967295; I'll have to build a more current version of GNAT.
>>
>> For playing around, I would just use the GNAT Community Edition. Or 
>> the FSF GNAT that comes with MinGW32.
> 
> Ugh!&#4294967295; No, I'll just build a fresh copy.&#4294967295; I do all my development
> under NetBSD so have everything I want/need ('cept the Ada tools)
> there, already.

You are braver than I am. Note that the front-end of GNAT is implemented 
in Ada, so it has to be bootstrapped with an Ada compiler. And the 
current version of GNAT requires quite an up-to-date Ada compiler for 
the bootstrap. In practice, I think people have found that only GNAT can 
build GNAT.

Many Linux distributions come with GNAT pre-built. Debian-derived 
distributions have good support, I hear.

>> The analysis works on the control-flow graph (per subprogram). The 
>> WCET for each basic block is the sum of the WCETs of the instructions 
>> in that block. The WCET for an execution path through the graph (from 
>> entry to return) is the sum of the WCETs of the basic blocks in the 
>> bath, plus the WCETs assigned to each edge (control transfer) between 
>> basic blocks in the path.
> 
> So, the costs of conditional control transfers are handled separately (?)

I'm not sure I understand your question. Each edge in the control-flow 
graph, that is, each transition from one instruction or one basic block 
to a possible next instruction or block, is assigned a WCET value when 
the instructions are decoded and entered in the control-flow graph. 
Usually this WCET is zero for a fall-through edge from a non-branch 
instruction to the next, and is non-zero only for edges from branch 
instructions, whether conditional or unconditional. For a conditional 
branch, the "not taken" edge usually has a zero WCET and the "taken" 
edge has some non-zero WCET, as specified in the processor documentation 
(remember we are assuming a simple processor).

>>> I assume this is semi table-driven (?)
>>
>> I don't understand that question. Please clarify.
> 
> A large number of opcodes, each with particular costs.
> Do you build a jungle of conditionals to subdivide the
> "opcode space" into groups of similar-cost operations?
> Or, do you just have a table of:
> 
> {opcode, mask, type[1], cost}
> 
> [1] used to apply some further heuristics to your
> refinement of "cost"

I would say that the approach to translating instructions into their 
analysis models (elements in the control-flow graph) is usually not 
table-driven, but emulates the field-by-field decoding and case analysis 
that would be done by a disassembler or emulator (and Bound-T can 
produce a disassembly of the code). The internal model includes not only 
the cost of the instruction, but also its effect on the computation. For 
example, an instruction like "increment the accumulator", INC A, would 
be decoded into a model that says

     WCET: 1 cycle.
     Control flow: to next instruction, unconditional.
     Effect: A := A + 1.

and also the effect, if any, on condition flags.

>>> -- as a preface to inquiring as to
>>> how you develop (and verify!) a test suite?
>>
>> In the same way as for any complex program. Final validation by 
>> running and measuring test code on a real processor or a 
>> cycle-accurate simulation.
> 
> Ah, OK.&#4294967295; So, you could "can" that particular exemplar and use it
> to test a modification to the code (without having to run the app
> on "real hardware", again)

Yep.

> I don't rely on "live" tests for my code.  Rather, I use tools to
> generate good test coverage and then verify the results are what I
> expect.  I find this easier and more easily extensible (I can test
> ARM code by running an x86 port of that code)

At my former job, implementing SW for ESA satellite applications, we 
almost always tested the code first on normal PCs, in a simulated I/O 
environment, and then on the embedded target. Easier for Ada than for C.

>> For processors where cache misses are much slower than cache hits 
>> (which is fast coming to mean almost all processors) IMO an I-cache 
>> analysis is necessary for static WCET analysis to be useful.
> 
> I look at it the other way around.  Assume there is NO cache.  You
> know that your code WILL run in less time than this case.  Regardless
> of the frequency or presence of competing events.
> 
> Processors are fast enough, now, that you can usually afford to "step up"
> in capability for comparably little cost.

Sometimes, but not always. The ESA applications we made were usually 
running (worst case) at about 70% processor load, and faster processors 
for space applications were very expensive, not only in euros but also 
in power, volume and mass, which are scarce resources in space.

Reply by Niklas Holsti ●August 7, 20212021-08-07

On 2021-08-07 2:06, Don Y wrote:
> On 8/6/2021 4:04 PM, Don Y wrote:
>>> That said, for some processors it is easy to recognize at decode-time 
>>> most of the instructions that access the stack, and some versions of 
>>> Bound-T let one specify different access times for stack accesses and 
>>> for general (unclassified) accesses. That can be useful if the stack 
>>> is located in fast memory, but other data are in slower memory.
>>
>> I'm thinking, specifically, about I/Os -- which are increasingly memory
>> mapped (including register spaces).
> 
> Sorry, I should be more clear.&#4294967295; I'm looking at the issues that would affect
> *my* needs in *my* environment -- realizing these may be different than
> those you've previously targeted.&#4294967295; (e.g., VMM, FPU emulation, etc.)

Ok. If you describe your needs, perhaps I can comment.

In my experience, access to memory-mapped registers tends to be not much 
slower than access to memory. Also, quite often the addresses in MMIO 
accesses are static or easy to derive in static analysis, so the WCET 
tool could choose the proper access time, if the tool knows enough about 
the system architecture.

If by "FPU emulation" you mean SW-implemented FP instructions, of course 
we have encountered those. They are often hand-coded assembler with very 
complex and tricky control flow, which makes it hard to find loop bounds 
automatically, so manual annotations must be used instead.

What is VMM?