Watchdog Timer Anti-patterns
The humble watchdog timer has been an essential part of our reliability tool chest for decades now. The way it works is straightforward and easy to understand, and most practical designs are easy to interface with.
There is a wealth of reference material that covers both the theory behind watchdog timers and practical design tips. But what we'll talk about today is of a slightly different nature.
Despite its straightforward operation and long history, the watchdog timer does occasionally get misused. Some ways to misuse it are so common that they constitute veritable patterns - or, more aptly called, anti-patterns - of incorrect design or implementation. These anti-patterns will be our subject today.
Why talk about them? The operation of watchdog timers is inherently difficult to test since it's difficult to reliably enumerate and reproduce the failure modes that it mitigates. Consequently, we rely on code reviews and design reviews to guarantee its correct integration even more than we do for other components. A successful review hinges on understanding both correct and incorrect watchdog use patterns.
Of course, the anti-patterns we'll talk about today cannot span the range of all possible design or implementation errors. However, they are very common and easy to recognize if you know what to look for.
Let's start with the most straightforward anti-patterns - the ones that stem not so much from a temporary failure of reasoning, but from a more permanent lack of understanding.
Missing the Point
Most code reviews and audits are primarily concerned with subtle errors, the sort of errors that made the infamous Therac-25 incident possible. But every once in a while, you stumble upon code that is not just "subtly" wrong. Instead, it misses the point entirely.
The main difficulty with these instances is not so much fixing them as explaining why they need to be fixed. When confronting the code with a functional spec that simply says "software must use watchdog functionality", talking about the exact manner in which the watchdog is used looks a lot like quibbling over details.
Fire and Forget
This most basic anti-pattern has an unfortunate origin: it comes from well-intentioned, but ultimately badly-written labs in introductory university courses. It consists of keeping the WDT alive by unconditionally feeding it from a timer interrupt.
In this case, simply put, the watchdog timer doesn't do anything. The application loop may hang, while the timer interrupt keeps firing and keeps the watchdog happy. The system is stuck, but the WDT doesn't trigger any action.
It doesn't help that feeding the WDT from a timer interrupt is actually a perfectly legitimate technique. Feeding the watchdog timer via timer interrupt is not wrong per se. It’s feeding it unconditionally that makes it useless.
Instead, resetting the watchdog timer should be tied to the system's main loop. If the loop hangs, the watchdog timer should not be reset.
Suspending the WDT during Long Operations
In systems without hard real-time constraints, you occasionally end up with tasks that take a long time to complete. Common culprits include things like upgrading firmware or dumping logs. The offending task isn't stuck, it just takes a long time.
There's rarely a solid technical reason why the task really needs to take such a long time. Downloading several megabytes of firmware, for instance, can be time-consuming, but there are plenty of ways to break it down or do it asynchronously.
However, the fix is rarely as trivial as simply suspending the watchdog timer during long operations. We know that the task runs fine, the reasoning goes, and the system isn't doing anything else in the meantime. So there's no harm in temporarily suspending the watchdog, right? In fact, the offending task may be one that doesn’t commonly occur during normal operation, so it's not the end of the world if it gets stuck.
This almost sounds reasonable, were it not for some bad assumptions.
First, there's the part about "knowing" that the task runs fine. Many embedded systems don't have all that many tasks. Presumably, when you ship the system, you've already tested it and "know" that all of them “run fine”, so why ship it with a WDT in the first place?
The watchdog timer is most helpful precisely in those corner cases that are not only difficult to test for, but also difficult to imagine. We don't employ a watchdog timer to guard against foreseeable errors during normal operation. We employ it as a last resort in case of abnormal, undetermined (or even non-deterministic) operation.
Second, getting stuck in an endless loop is not the only failure mode of a system. A stack overflow, for instance, can push it in an arbitrary state, where it might - haphazardly - perform other operations, without the WDT ever being enabled again. Smashing the stack of an access control system while it's transmitting logs or checking flash memory integrity may kick it in a state where all it does is blink LEDs and crash. But it might as well kick it in a state where it opens doors at random.
Besides, even in a cooperative multitasking environment, it's rarely the case that the system is really running nothing but the currently scheduled task. It's also executing ISRs, for example. The system might get stuck in one, or the interaction between an ISR and the currently scheduled task might end up causing the system to hang. The watchdog timer covers these cases as well.
Finally, and equally important, it's not too easy to guarantee that the WDT will never be accidentally disabled, or that it will always be re-enabled correctly.
The corrupted stack I mentioned above is an obvious case where the program flow might be accidentally diverted past the point where the WDT is re-enabled. But abnormal conditions aside, integrating WDT operation with the rest of your firmware makes it as susceptible to bugs as any other part the firmware. Just think about how many bugfix commit messages you've seen with phrases like "forgot to enable interrupts", "X flag was not cleared" or "task was not correctly re-scheduled". Why open yourself to one that says "watchdog timer was not correctly re-enabled"?
Reliability as an Afterthought
Reliability, much like its more glamorous counterpart, security, is not something you "add" to an application. It's something that you design for. Unless reliability is an explicit aspect of your engineering process, you won’t get a solid system, even if you tick all the reliability checkboxes - like having a watchdog timer.
Attempts to "add reliability" to a system during late development stages sometimes go so wrong that they do more harm than good. These are some of the cases that we will discuss next.
Earlier, we were talking about how sometimes you end up with a task that takes too long and causes a watchdog reset (and its frightening watchdog anti-pattern, pausing the watchdog while the task executes).
In systems that were never designed with a watchdog timer in mind in the first place, or where implementing and enabling a watchdog scheme is postponed for a long time, it routinely happens that all the tasks end up taking too long.
By that time, it's too late to start breaking tasks up or to rewrite sections in asynchronous terms. That often amounts to refactoring virtually the entire codebase.
The compromise? The code is inspected and calls to reset_wdt() are inserted in every loop, at the beginning of each function, and sometimes throughout the body of longer functions, as needed, until the system stops resetting. Calls to the function that feeds the watchdog timer are sprinkled here and there, hoping that they'll be close enough together that a "false" watchdog reset is never triggered.
More often than not, this hope is altogether false. Non-trivial programs have too many code paths for humans to perform reliable timing analyses, especially when you factor in ISRs. The human mind can barely analyze the logic of non-trivial programs and write them correctly. Timing is way beyond the limits of even the brightest engineers.
It's true that a watchdog timer that generates some "bad" resets definitely generates all the "good" ones, too. It would seem that this is better than no watchdog at all, but reality is a little murkier than that.
First, a device that randomly resets during normal operation is not too useful, no matter how quickly and how well it's restored to a known good state. Repeated glitches and stuttering can be dangerous in and of themselves. Even if safety is not affected in any way, this is still a quality issue.
Second, this approach makes it more difficult to reason about the system's behavior. It also tends to muddy non-technical waters. On systems without hard real-time constraints, it's tempting to solve the occasional random reset by increasing the watchdog timeout. This leads to unproductive haggling over functional specs, but also to a general loss of quality.
There is a phrase that is almost always followed by awkward silence in meetings that discuss system reliability. "How long should the watchdog timeout be?"
Even in systems that don't have hard real-time constraints, this is a question that you should answer neither randomly, nor through the universally unhelpful formula "as short as possible".
The timeout interval can be determined in relation to the system's main loop and response time requirements (in real-time systems), or based on convenience and user interaction requirements (in non-real-time systems).
For example, in a system that drives a linear actuator, the watchdog timeout interval can be chosen so as to ensure that if the application gets stuck while pushing the actuator at maximum speed, it never stays stuck long enough for the actuator to suffer mechanical damage if it encounters an obstacle. The length of that interval can be determined experimentally or based on data from the actuator's manufacturer. Either way, you get to make this decision based on real data and hard facts.
When this isn't done - often because the system is not designed with a watchdog from the very beginning, or because the decision is postponed - the timeout choice is done haphazardly.
Sometimes it's chosen so as to alleviate the problems caused by watchdog sprinkling. Sometimes it's fixed "by decree" and leads to watchdog sprinkling. Sometimes a ridiculously high value gets picked, allowing you to see that the system is stuck and giving you a chance to get a stack trace before it's reset (I wish my imagination were good enough that I could have made this one up...).
Unfortunately, this anti-pattern makes it impossible to make an informed assessment of a system's reliability. Is the timeout interval you've chosen good enough? Would there be any benefits in decreasing it? Or increasing it? Does it offer any substantial reliability guarantees or is does critical damage occur by the time the watchdog kicks in and resets the system?
It also leads to plenty of unproductive discussions and ill-defined specifications. Since there is no "real" (read: technical) reason why the timeout should be set to what it is, everyone whose problems would go away by increasing it or decreasing it will end up advocating for one of the two. There is a political dimension to every project, which is entirely unavoidable, and anything that cannot be settled through experimental evidence is bound to end up being debated in non-technical terms.
Sometimes the problem isn't systemic. It's not a matter of not understanding how a watchdog improves a system's reliability or a matter of how you approach designing a reliable system. It's simply bad design, and some patterns of bad design are so common that they are easy to recognize.
Incomplete Reset Tree
In many systems today, the watchdog timer is embedded in the main MCU or SoC. It's very common, in this case, for the SoC to be able to output the internal watchdog reset signal, so that it can be used to reset other devices in the system.
What's the point of this signal? Consider the case where a task gets stuck waiting for data from a hung peripheral. The watchdog will reset the CPU. But unless it also resets the troublemaking peripheral, the task will get stuck again (and the watchdog timer will expire again) as soon as it tries to talk to that peripheral device. The WDT reset output allows you to reset other devices in the system and return them to a known, valid state, too.
The WDT reset signal is not the only way to achieve this. It's also common for the processing unit to reset peripherals via GPIO lines, if not every time it boots, at least when it determines that it's booting after a watchdog reset. At the end of the day, how it's done is less important, as long as all the system's components are restored to a known good state after a watchdog reset. If this global known good state is not achieved, the individual recovery of some individual components may not count for much.
This anti-pattern is surprisingly common for a problem that is actually well-known. Ensuring that all devices start in a well-known state is not exactly arcane knowledge. Plus, the reasoning that goes into the careful design of the reset trees is easy to transplant to the watchdog reset flow. Nine out of ten times, the engineers who end up dealing with this anti-pattern are baffled not by the error itself, but by the fact that they didn't catch it in the first place.
How does it happen? Most of the cases I've seen are redesigns of old products which accumulate additional complexity through newer, more complex components. The additional failure modes that they introduce sometimes remain unnoticed, especially in industries that are not too tightly-regulated, where a formal risk analysis is not part of the design procedures.
It's hard to put this in more diplomatic terms: there is no such thing as a software watchdog.
There are systems where it makes sense to supervise running processes, and restart the ones that crash, or even restart all processes from scratch. That's a common idiom for process supervisors and it's a perfectly valid design solution. Such a solution is readily available on many systems. Systemd, for example, which is commonly used in many embedded Linux systems, can do this with very little effort.
(And, to avoid flame wars, I will quickly point out that other init systems allow you to do this as well.)
But this is not a watchdog. Process supervision is useful, and can be used in conjunction with a hardware watchdog. But it is no substitute for one, and should not be used as such. Designs that rely on a process supervisor to reset the system and bring it back to a known good state in case of failure are simply incorrect.
A process supervisor is a process, just like the ones that it's supposed to supervise. It can crash, too. Worse yet, it cannot protect against failures that occur at lower levels of the software stack, like a kernel panic on a Linux system.
That's not to say process supervisors are useless, of course. For example, in multi-user systems, restarting a non-critical process that crashed may be sufficient for recovery, and is much cheaper than a full system reset. This can be implemented via a process supervisor, or in a multi-stage watchdog scheme. A process supervisor is also a good delegate for feeding the watchdog timer in a multi-process environment, since it already watches all other processes and derives information about their state.
Process supervision is, therefore, an excellent complement to - but not a replacement for - a hardware watchdog.
Process supervision is also a good option in systems where failure is,
in fact, acceptable, and can be handled through some other means. Such systems are rare in the embedded world, but this is a fairly common idiom for cloud and distributed applications. In systems that don't need a watchdog timer in the first place, a process supervisor can offer some (but not all) of the reliability guarantees of a hardware watchdog, with far less design effort.
Recommendations and Conclusions
The six anti-patterns that we have covered are some of the most common design errors related to watchdog timers, and the ones I see most frequently. Design patterns (and anti-patterns), though, are nothing more than that: formal(-ish) references, good aids to reasoning, but not absolute references.
These patterns are easy to recognize during a design review, but you cannot review a design just by going through a list of ways to get something wrong and checking if anything matches. Instead, let us look at the more fundamental framework of the design errors that we've covered so far.
One way or another, all of these designs fail to tick one or more of these boxes that any good watchdog-based fail safety scheme should tick:
- The watchdog should be triggered when the device's essential functionality is lost or when its functioning becomes unpredictable, and should not be triggered while the device runs within parameters.
- The watchdog's own functioning should be predictable and, once enabled, should no longer depend on correct software behavior.
- If the watchdog triggers a system reset, the entire system should be brought back to a well-known good state.
- The watchdog should trigger a system reset before operating under unpredictable conditions causes damage or unacceptable loss of function.
Some of these anti-patterns fail in straightforward ways. The one I called "fire and forget" obviously fails to tick box #1. And "software" watchdogs obviously fail to tick box #2. Others don't inherently break one of these recommendations, but make it difficult to ensure compliance. For instance, "watchdog sprinkling" makes it hard to ensure #1, and "timeout auctioning" makes it hard to give strong guarantees about both #1 (especially about the absence of false positives) and #4.
But besides their technical aspect, many of these anti-patterns have a human dimension as well. And I think that dimension is too important to dismiss as "office politics" or "bikeshedding in meetings". There is great value in understanding our limits, both in the practice of engineering and in its management. At the end of the day, if we were infallible, the machines we make would be infallible as well. In many ways, reliability engineering consists of managing our failures as much as it consists of managing those of our machines.
To post reply to a comment, click on the 'reply' button attached to each comment. To post a new comment (not a reply to a comment) check out the 'Write a Comment' tab at the top of the comments.
Registering will allow you to participate to the forums on ALL the related sites and give you access to all pdf downloads.