Kicking the dog -- how do you use watchdog timers?

Started by Tim Wescott May 9, 2016
Randy Yates recently started a thread on programming flash that had an 
interesting tangent into watchdog timers.  I thought it was interesting 
enough that I'm starting a thread here.

I had stated in Randy's thread that I avoid watchdogs, because they 
mostly seem to be a source of erroneous behavior to me.

However, on reflection I realized that I lied: I _do_ use watchdog 
timers, but not automatically.  To date I've only used them when the 
processor is spinning a motor that might crash into something or 
otherwise engage in damaging behavior if the processor goes nuts.  

In general, my rule on watchdogs, as with any other feature, is "use it 
if using it is better", which means that I think about the consequences 
of the thing popping off when I don't want it to (as during a code update 
or during development when I hit a breakpoint) vs. the consequences of 
not having the thing when the processor goes haywire.

Furthermore, if I use a watchdog I don't just treat updating the thing as 
a requirement check-box -- so you won't find a timer ISR in my code that 
unconditionally kicks the dog.  Instead, I'll usually have just one task 
(the motor control one, on most of my stuff) kick the dog when it feels 
it's operating correctly.  If I've got more than one critical task (i.e., 
if I'm running more than one motor out of one processor) I'll have a low-
priority built-in-test task that kicks the dog, but only if it's getting 
periodic assurances of health from the (multiple) critical tasks.

Generally, in my systems, the result of the watchdog timer popping off is 
that the system will no longer work quite correctly, but it will operate 
safely.

So -- what do you do with watchdogs, and how, and why?  Always use 'em?  
Never use 'em?  Use 'em because the boss says so, but twiddle them in a 
"last part to break" bit of code?

Would you use a watchdog in a fly-by-wire system?  A pacemaker?  Why?  
Why not?  Could you justify _not_ using a watchdog in the top-level 
processor of a Mars rover or a satellite?

-- 

Tim Wescott
Wescott Design Services
http://www.wescottdesign.com
Tim Wescott <seemywebsite@myfooter.really> writes:
> To date I've only used them when the processor is spinning a motor > that might crash into something or otherwise engage in damaging > behavior if the processor goes nuts.
I haven't done anything with motors, but have used watchdogs in comm gear that typically sits in a customer's closet somewhere. The concern is less about the code going nuts and damaging something, than about getting wedged somehow so that the box stops working. Typically a customer with a wedged box would call the help desk, and the support rep would first tell them to try going to the box and power cycling it. The watchdog does essentially the same thing, without the customer getting involved. They at worst experience a short service outage and hopefully shrug it off. But usually it would happen without them even noticing.
> Would you use a watchdog in a fly-by-wire system? A pacemaker? Why?
Yes. The idea again is not to prevent something from going nuts, but to restore the system from a wedged or broken state to a known good state if it gets in trouble somehow. That's actually part of "official" Erlang programming methodology: to not engage in recovering from bad inputs or other sorts of defensive programming. If the system passed QA and still managed to reach some scroggled state where unexpected input reached your little function deep in the weeds, who knows what else is borked that got it like that? The Erlang motto is "let it crash", which means allow unexpected errors to unceremoniously blow away your whole process. The Erlang supervision system sees the crash and restarts/reinitializes the process so it's again in a known good state, and that (reliability studies show) usually fixes things. The above might sound cavalier but Erlang was developed for serious high-end telecom gear that's supposed to keep running for decades at a time, purport to have nine 9's of uptime, allow code upgrades without interrupting any phone calls, etc. Another Erlang saying is that a reliable system must be distributed: if it runs on only one CPU, the power cord is a single point of failure. So yeah, if a CPU fails, the supervision system sees that too, and transfers control to another one.
> Could you justify _not_ using a watchdog in the top-level processor of > a Mars rover or a satellite?
I'd include a watchdog. Here's an Erlang programmer's take on the Curiosity Mars lander: http://jlouisramblings.blogspot.com/2012/08/getting-25-megalines-of-code-to-behave.html It doesn't mention watchdogs but draws interesting parallels (and differences) between Erlang philosophy and NASA philosophy.
On 5/9/2016 1:06 PM, Tim Wescott wrote:
> Randy Yates recently started a thread on programming flash that had an > interesting tangent into watchdog timers. I thought it was interesting > enough that I'm starting a thread here. > > I had stated in Randy's thread that I avoid watchdogs, because they > mostly seem to be a source of erroneous behavior to me. > > However, on reflection I realized that I lied: I _do_ use watchdog > timers, but not automatically. To date I've only used them when the > processor is spinning a motor that might crash into something or > otherwise engage in damaging behavior if the processor goes nuts. > > In general, my rule on watchdogs, as with any other feature, is "use it > if using it is better", which means that I think about the consequences > of the thing popping off when I don't want it to (as during a code update > or during development when I hit a breakpoint) vs. the consequences of > not having the thing when the processor goes haywire. > > Furthermore, if I use a watchdog I don't just treat updating the thing as > a requirement check-box -- so you won't find a timer ISR in my code that > unconditionally kicks the dog. Instead, I'll usually have just one task > (the motor control one, on most of my stuff) kick the dog when it feels > it's operating correctly. If I've got more than one critical task (i.e., > if I'm running more than one motor out of one processor) I'll have a low- > priority built-in-test task that kicks the dog, but only if it's getting > periodic assurances of health from the (multiple) critical tasks. > > Generally, in my systems, the result of the watchdog timer popping off is > that the system will no longer work quite correctly, but it will operate > safely. > > So -- what do you do with watchdogs, and how, and why? Always use 'em? > Never use 'em? Use 'em because the boss says so, but twiddle them in a > "last part to break" bit of code? > > Would you use a watchdog in a fly-by-wire system? A pacemaker? Why? > Why not? Could you justify _not_ using a watchdog in the top-level > processor of a Mars rover or a satellite?
Watchdog timers are not often used in FPGAs. I guess that's because processes in HDL seldom get stuck or lost in the weeds. ;) When I did design a software project we had multiple tasks each kicking another task which would track what was going on and "pet" the watch dog to keep it from barking. The various tasks had periods of "interest" different from the watch dog timeout, so this process dealt with the appropriate time period of each of the tasks being watched. Only this task needed to actually deal with the watch dog period. -- Rick C
rickman wrote:

> On 5/9/2016 1:06 PM, Tim Wescott wrote: >> Randy Yates recently started a thread on programming flash that had an >> interesting tangent into watchdog timers. I thought it was interesting >> enough that I'm starting a thread here. >> >> I had stated in Randy's thread that I avoid watchdogs, because they >> mostly seem to be a source of erroneous behavior to me. >> >> However, on reflection I realized that I lied: I _do_ use watchdog >> timers, but not automatically. To date I've only used them when the >> processor is spinning a motor that might crash into something or >> otherwise engage in damaging behavior if the processor goes nuts. >> >> In general, my rule on watchdogs, as with any other feature, is "use it >> if using it is better", which means that I think about the consequences >> of the thing popping off when I don't want it to (as during a code update >> or during development when I hit a breakpoint) vs. the consequences of >> not having the thing when the processor goes haywire. >> >> Furthermore, if I use a watchdog I don't just treat updating the thing as >> a requirement check-box -- so you won't find a timer ISR in my code that >> unconditionally kicks the dog. Instead, I'll usually have just one task >> (the motor control one, on most of my stuff) kick the dog when it feels >> it's operating correctly. If I've got more than one critical task (i.e., >> if I'm running more than one motor out of one processor) I'll have a low- >> priority built-in-test task that kicks the dog, but only if it's getting >> periodic assurances of health from the (multiple) critical tasks. >> >> Generally, in my systems, the result of the watchdog timer popping off is >> that the system will no longer work quite correctly, but it will operate >> safely. >> >> So -- what do you do with watchdogs, and how, and why? Always use 'em? >> Never use 'em? Use 'em because the boss says so, but twiddle them in a >> "last part to break" bit of code? >> >> Would you use a watchdog in a fly-by-wire system? A pacemaker? Why? >> Why not? Could you justify _not_ using a watchdog in the top-level >> processor of a Mars rover or a satellite? > > Watchdog timers are not often used in FPGAs. I guess that's because > processes in HDL seldom get stuck or lost in the weeds. ;) > > When I did design a software project we had multiple tasks each kicking > another task which would track what was going on and "pet" the watch dog > to keep it from barking. The various tasks had periods of "interest" > different from the watch dog timeout, so this process dealt with the > appropriate time period of each of the tasks being watched. Only this > task needed to actually deal with the watch dog period. >
I'd say the FPGA equivalent to a watchdog is integrity checking hardware, like ECC RAM, state machines with explicit invalid state checking, all the way up to triple-modular redundancy. I've never needed any of that nonsense because everything I work on remains pleasantly surrounded by atmosphere, but it's definitely out there. -- Rob Gaddi, Highland Technology -- www.highlandtechnology.com Email address domain is currently out of order. See above to fix.
On 5/9/2016 10:06 AM, Tim Wescott wrote:

> In general, my rule on watchdogs, as with any other feature, is "use it > if using it is better", which means that I think about the consequences > of the thing popping off when I don't want it to (as during a code update > or during development when I hit a breakpoint) vs. the consequences of > not having the thing when the processor goes haywire.
The problem is that you don't often know at design time which (if any) failures will require this sort of protection. Even "bug free" code can reside in a system that experiences hardware faults (power supply fluctuations, input latchup, etc.). So, do you try to bolt this capability on, after the fact? Or, design around it from the start (hoping not to need it)?
> Furthermore, if I use a watchdog I don't just treat updating the thing as > a requirement check-box -- so you won't find a timer ISR in my code that > unconditionally kicks the dog. Instead, I'll usually have just one task > (the motor control one, on most of my stuff) kick the dog when it feels > it's operating correctly. If I've got more than one critical task (i.e., > if I'm running more than one motor out of one processor) I'll have a low- > priority built-in-test task that kicks the dog, but only if it's getting > periodic assurances of health from the (multiple) critical tasks.
There's no hard and fast rule for how you should implement a watchdog. It's a component in your system, just like any other component. Putting the stroking of the watchdog in the idle task can leave your system vulnerable to any sort of momentary overload; or, necessitate an unduly long timeout (to accommodate short overloads). Putting it in an ISR is almost always silly -- for obvious reasons. OTOH, I currently use the software equivalent of that mechanism by having my "watchdog monitor" run as a very HIGH priority task! But, one that spends most of its life blocking awaiting "sanity messages" from the various tasks that are trying to stroke this *virtual* watchdog. Putting all of the watchdog (hardware) interface in one task allows a more consistent -- and discerning -- implementation. First, it ensures any such activities will get logged! If you've got lots of independant/autonomous tasks stroking the watchdog, you never know which one FORGOT to do so. As a result, you can't recover (post mortem) when the device comes out of reset. Second, it allows the "stroking" to be smarter and more demonstrable of sentience on the part of the individual "strokers". I.e., instead of just twiddling a bit, you can engage the other party in a dialog and place further constraints on verify its sanity. ("Why are you sending me these keep alive messages at such an alarming rate? I was only EXPECTING to receive them from you at a more modest rate. Perhaps something has gone wrong in your implemlentation or process state??") Third, it allows for tasks to *request* a watchdog intervention! ("OhMiGsh! The motor is ignoring my commands to turn off! Somebody pull the plug -- NOW!!!!") And, this can be logged for post mortem.
> Generally, in my systems, the result of the watchdog timer popping off is > that the system will no longer work quite correctly, but it will operate > safely. > > So -- what do you do with watchdogs, and how, and why? Always use 'em? > Never use 'em? Use 'em because the boss says so, but twiddle them in a > "last part to break" bit of code?
(sigh) I have a lengthy paper/tutorial I wrote many years ago on the subject as I'd had the "argument" with clients many times over the years. People seem to have a naive concept of what watchdogs (sentries) can and can't do -- as well as when they are indicated vs. contraindicated. [One of these days, I'll set up a web site and push all these documents out there. But, far more interesting things to do with the few hours present in each day :-/ ] Watchdogs take many forms -- hardware and software. A process that deliberately KILL's processes that it suspects of being corrupt is just as much a watchdog as a piece of hardware that tugs on /RESET. Communication happens both in-band and out-of-band. The former, of course, tends to rely on "some (software)" remaining operational. The latter works around it. A watchdog plays a LAGGING role in a system (it "happens" AFTER something has already gone wrong) as well as a LEADING role (it informs the user/environment of a potential "more significant" failure that hasn't percolated through the "system", yet! This role should not be glossed over. INFORMATION IS CONVEYED by these mechanisms. Simply ignoring that information (i.e., letting the device reset itself) is usually not a very good idea. [Consider what happens when you have a device that is eager to start up quickly. So, if the device has incurred a watchdog upset, everything appears to shut down, unceremoniously. Then, as the device starts up, again, it rushes to get everything running, again -- just in time for it to be (possibly) shut down by the same, persistent failure retriggering the watchdog overrun. SOMETHING wants to be able to detect when a watchdog event has occurred and adjust the RESTART procedure (different from the START procedure) accordingly.] I'm currently working on ways to signal remote devices when a watchdog event has been triggered in some OTHER remote device; without relying on in-band signalling (if the device is misbehaving, how do I know it will be ABLE to inform others that it has just been watchdogged?). The point being so those other devices can adjust to this INFORMATION -- instead of wondering why some service/capability (in which the failed node played a part) isn't working properly AFTER SOME ARTIFICIAL DELAY.
> Would you use a watchdog in a fly-by-wire system? A pacemaker? Why? > Why not? Could you justify _not_ using a watchdog in the top-level > processor of a Mars rover or a satellite?
What's the reliability of each system and PROTECTION system? I'd surely not want a watchdog on a Mars rover that resets more frequently than the round trip radio delay to its earth station! (some hand-waving, there, but the point should be obvious)
On Mon, 09 May 2016 15:07:08 -0400, rickman wrote:

> On 5/9/2016 1:06 PM, Tim Wescott wrote: >> Randy Yates recently started a thread on programming flash that had an >> interesting tangent into watchdog timers. I thought it was interesting >> enough that I'm starting a thread here. >> >> I had stated in Randy's thread that I avoid watchdogs, because they >> mostly seem to be a source of erroneous behavior to me. >> >> However, on reflection I realized that I lied: I _do_ use watchdog >> timers, but not automatically. To date I've only used them when the >> processor is spinning a motor that might crash into something or >> otherwise engage in damaging behavior if the processor goes nuts. >> >> In general, my rule on watchdogs, as with any other feature, is "use it >> if using it is better", which means that I think about the consequences >> of the thing popping off when I don't want it to (as during a code >> update or during development when I hit a breakpoint) vs. the >> consequences of not having the thing when the processor goes haywire. >> >> Furthermore, if I use a watchdog I don't just treat updating the thing >> as a requirement check-box -- so you won't find a timer ISR in my code >> that unconditionally kicks the dog. Instead, I'll usually have just >> one task (the motor control one, on most of my stuff) kick the dog when >> it feels it's operating correctly. If I've got more than one critical >> task (i.e., >> if I'm running more than one motor out of one processor) I'll have a >> low- >> priority built-in-test task that kicks the dog, but only if it's >> getting periodic assurances of health from the (multiple) critical >> tasks. >> >> Generally, in my systems, the result of the watchdog timer popping off >> is that the system will no longer work quite correctly, but it will >> operate safely. >> >> So -- what do you do with watchdogs, and how, and why? Always use 'em? >> Never use 'em? Use 'em because the boss says so, but twiddle them in a >> "last part to break" bit of code? >> >> Would you use a watchdog in a fly-by-wire system? A pacemaker? Why? >> Why not? Could you justify _not_ using a watchdog in the top-level >> processor of a Mars rover or a satellite? > > Watchdog timers are not often used in FPGAs. I guess that's because > processes in HDL seldom get stuck or lost in the weeds. ;)
I've spent lab time next to unhappily cursing FPGA guys (good ones) trying to determine why their state machines have wedged. So I'm not sure that's an entirely accurate statement.
> When I did design a software project we had multiple tasks each kicking > another task which would track what was going on and "pet" the watch dog > to keep it from barking. The various tasks had periods of "interest" > different from the watch dog timeout, so this process dealt with the > appropriate time period of each of the tasks being watched. Only this > task needed to actually deal with the watch dog period.
That's more or less what I do if I need to keep watch on multiple tasks. -- Tim Wescott Wescott Design Services http://www.wescottdesign.com
On 5/9/2016 5:13 PM, Tim Wescott wrote:
> On Mon, 09 May 2016 15:07:08 -0400, rickman wrote: > >> On 5/9/2016 1:06 PM, Tim Wescott wrote: >>> Randy Yates recently started a thread on programming flash that had an >>> interesting tangent into watchdog timers. I thought it was interesting >>> enough that I'm starting a thread here. >>> >>> I had stated in Randy's thread that I avoid watchdogs, because they >>> mostly seem to be a source of erroneous behavior to me. >>> >>> However, on reflection I realized that I lied: I _do_ use watchdog >>> timers, but not automatically. To date I've only used them when the >>> processor is spinning a motor that might crash into something or >>> otherwise engage in damaging behavior if the processor goes nuts. >>> >>> In general, my rule on watchdogs, as with any other feature, is "use it >>> if using it is better", which means that I think about the consequences >>> of the thing popping off when I don't want it to (as during a code >>> update or during development when I hit a breakpoint) vs. the >>> consequences of not having the thing when the processor goes haywire. >>> >>> Furthermore, if I use a watchdog I don't just treat updating the thing >>> as a requirement check-box -- so you won't find a timer ISR in my code >>> that unconditionally kicks the dog. Instead, I'll usually have just >>> one task (the motor control one, on most of my stuff) kick the dog when >>> it feels it's operating correctly. If I've got more than one critical >>> task (i.e., >>> if I'm running more than one motor out of one processor) I'll have a >>> low- >>> priority built-in-test task that kicks the dog, but only if it's >>> getting periodic assurances of health from the (multiple) critical >>> tasks. >>> >>> Generally, in my systems, the result of the watchdog timer popping off >>> is that the system will no longer work quite correctly, but it will >>> operate safely. >>> >>> So -- what do you do with watchdogs, and how, and why? Always use 'em? >>> Never use 'em? Use 'em because the boss says so, but twiddle them in a >>> "last part to break" bit of code? >>> >>> Would you use a watchdog in a fly-by-wire system? A pacemaker? Why? >>> Why not? Could you justify _not_ using a watchdog in the top-level >>> processor of a Mars rover or a satellite? >> >> Watchdog timers are not often used in FPGAs. I guess that's because >> processes in HDL seldom get stuck or lost in the weeds. ;) > > I've spent lab time next to unhappily cursing FPGA guys (good ones) > trying to determine why their state machines have wedged. > > So I'm not sure that's an entirely accurate statement.
Ask them why their FSMs got stuck. In development they may make mistakes, but you don't use watchdogs for debugging. In fact they get in the way. I've never had a FSM failure in the field, but I suppose there is a first time. I did say "seldom", not never. A FSM in an FPGA is a separate entity. No other process in the FSM can step on it's memory or cause it to miss a deadline. CPUs are shared which hugely complicate multi-process designs in all aspects. You just don't have that in an FPGA. By comparison FPGAs are simple. But maybe I've just not worked on an FPGA design that was complicated enough to compare to what the software guys do...
>> When I did design a software project we had multiple tasks each kicking >> another task which would track what was going on and "pet" the watch dog >> to keep it from barking. The various tasks had periods of "interest" >> different from the watch dog timeout, so this process dealt with the >> appropriate time period of each of the tasks being watched. Only this >> task needed to actually deal with the watch dog period. > > That's more or less what I do if I need to keep watch on multiple tasks. >
-- Rick C
On 09/05/16 19:06, Tim Wescott wrote:
> Randy Yates recently started a thread on programming flash that had an > interesting tangent into watchdog timers. I thought it was interesting > enough that I'm starting a thread here. > > I had stated in Randy's thread that I avoid watchdogs, because they > mostly seem to be a source of erroneous behavior to me. > > However, on reflection I realized that I lied: I _do_ use watchdog > timers, but not automatically. To date I've only used them when the > processor is spinning a motor that might crash into something or > otherwise engage in damaging behavior if the processor goes nuts. > > In general, my rule on watchdogs, as with any other feature, is "use it > if using it is better", which means that I think about the consequences > of the thing popping off when I don't want it to (as during a code update > or during development when I hit a breakpoint) vs. the consequences of > not having the thing when the processor goes haywire. > > Furthermore, if I use a watchdog I don't just treat updating the thing as > a requirement check-box -- so you won't find a timer ISR in my code that > unconditionally kicks the dog. Instead, I'll usually have just one task > (the motor control one, on most of my stuff) kick the dog when it feels > it's operating correctly. If I've got more than one critical task (i.e., > if I'm running more than one motor out of one processor) I'll have a low- > priority built-in-test task that kicks the dog, but only if it's getting > periodic assurances of health from the (multiple) critical tasks. > > Generally, in my systems, the result of the watchdog timer popping off is > that the system will no longer work quite correctly, but it will operate > safely. > > So -- what do you do with watchdogs, and how, and why? Always use 'em? > Never use 'em? Use 'em because the boss says so, but twiddle them in a > "last part to break" bit of code? > > Would you use a watchdog in a fly-by-wire system? A pacemaker? Why? > Why not? Could you justify _not_ using a watchdog in the top-level > processor of a Mars rover or a satellite? >
Quoting Tim Williams' book "The most cost-effective way to ensure the reliability of a microprocessor-based product is to accept that the program (or data or both, my addition) *will* occasionally be corrupted, and to provide a means whereby the program flow can be automatically recovered, preferably transparently to the user. This is the function of the microprocessor watchdog." So, the whole thing is what to do "when" (not "if") shit (the unexpected) happens. Pere
On 5/10/2016 9:36 AM, o pere o wrote:
> > Quoting Tim Williams' book "The most cost-effective way to ensure the > reliability of a microprocessor-based product is to accept that the > program (or data or both, my addition) *will* occasionally be corrupted, > and to provide a means whereby the program flow can be automatically > recovered, preferably transparently to the user. This is the function of > the microprocessor watchdog." > > So, the whole thing is what to do "when" (not "if") shit (the > unexpected) happens.
That's an interesting approach, just give up on making the system reliable and instead make it recover from a failure. You do realize that just because Tim Williams said this, it doesn't make it gospel. It *is* possible to make programs that work and in some cases a program can be *proven* to work. But those are rare. Sure, it's great if you can make your system recover from a catastrophic failure. But there are many systems where that is not remotely a solution. Virtually any real-time control needs to work and the only other solution is to shut it down, preferably safely. Even that is not always possible. For any system where there is potential for harm to people or even equipment (depending on the cost) the best approach is an independent monitor that disconnects the errant controller. In other words, when safety is important, a processor watchdog timer may not be adequate. -- Rick C
rickman wrote:

> On 5/9/2016 5:13 PM, Tim Wescott wrote: >> On Mon, 09 May 2016 15:07:08 -0400, rickman wrote: >> >> >> I've spent lab time next to unhappily cursing FPGA guys (good ones) >> trying to determine why their state machines have wedged. >> >> So I'm not sure that's an entirely accurate statement. > > Ask them why their FSMs got stuck. In development they may make > mistakes, but you don't use watchdogs for debugging. In fact they get > in the way. >
Oh, that's easy. Because of either: An error in the synchronous logic, leaving it in a defined state with no way out (20% chance). An unsynchronized async input causing a race condition that static timing couldn't catch (80% chance) Or a single event upset (0.0001% chance) -- Rob Gaddi, Highland Technology -- www.highlandtechnology.com Email address domain is currently out of order. See above to fix.