EmbeddedRelated.com
Forums
The 2024 Embedded Online Conference

Kicking the dog -- how do you use watchdog timers?

Started by Tim Wescott May 9, 2016
Am 11.05.2016 um 08:17 schrieb Tim Wescott:
> Randy's point, I think, is that if something is _broken_, a reset isn't > going to un-break it.
Non-reset is not going to, either. It's worth trying to distinguish between a run-off-into-the-wild system and a permanently broken one. So trigger a global reset, and see if that makes it work again. If it does, things are better than before. If it doesn't, they're no worse. As problem-handling approaches go, that's a pretty impressive result. That's what a watchdog ultimately is good for: to distinguish between a SEU and a FUBAR situation. There's really nothing terminally wrong with having a watchdog. The main risk I see is that it's easy to fall into the trap of thinking of the Dog not as (almost) the last line of defense, but as the first, or even the only one you need. I.e. it's tempting to think: "Nice, now I've got a watchdog, so the rest of the system can be designed without a care." > A processor reset should also reset the hardware, in much the same way > that cops should always be honest -- "should" in this case indicates a > moral requirement, but not, in all companies, a reasonable expectation. OTOH, just because not all cops are honest, that doesn't make a world without any cops a better place. It can hardly be considered a concept's fault if some people implement it incorrectly. And it may not even be incorrect to leave resetting the rest of the circuit to the micro. There might be some information for the micro to be had from inspecting the state of other parts of the hardware, as left behind by the hosed system state. It all depends.
Hans-Bernhard Br&ouml;ker <HBBroeker@t-online.de> writes:
> I.e. it's tempting to think: "Nice, now I've got a watchdog, so the > rest of the system can be designed without a care."
If you need an absolutely reliable product (medical safety, NASA, or whatever), you have to use ultra high assurance design processes that are not economically competitive in more typical application areas. If you don't use those processes, you aren't designing "without a care", but you're designing with an amount of care chosen through an engineering and business decision, based on how much product failure you're willing to tolerate. If falling back to a WDT is a cheap way to reach your acceptable failure rate, it seems like an ok option. I worked on a thing a while back whose hardware randomly locked up every few thousand hours of operation. We never figured out why, and decided not too spend excessive resources studying it, given that it was coming due for a total redesign anyway. We had a few hundred of these things in the field which meant that on average, we logged maybe one WDT reset per day across the whole fleet. The application area was not even slightly safety critical and most of the resets were in the middle of the night when the device wasn't in use anyway. There was a slim possibility that a reset at the wrong time could actually inconvenience a customer and we'd get a support call. But AFAIK that never happened. Nobody ever noticed the resets. I think the above is a typical story. I wasn't involved in the management decision to ship the thing despite the lockups (relying on the WDT), but I can't say that they made a wrong choice. In mathematics we prove things and then expect to be absolutely sure of them, but engineering is different. Most engineering is about making stuff that meets cost constraints and empirically works well enough for the application, and that's what they did.
On Thu, 12 May 2016 20:34:19 +0200, Hans-Bernhard Br&ouml;ker wrote:

> Am 11.05.2016 um 08:17 schrieb Tim Wescott: >> Randy's point, I think, is that if something is _broken_, a reset isn't >> going to un-break it. > > Non-reset is not going to, either. > > It's worth trying to distinguish between a run-off-into-the-wild system > and a permanently broken one. So trigger a global reset, and see if > that makes it work again. If it does, things are better than before. If > it doesn't, they're no worse. As problem-handling approaches go, that's > a pretty impressive result. > > That's what a watchdog ultimately is good for: to distinguish between a > SEU and a FUBAR situation. > > There's really nothing terminally wrong with having a watchdog. The > main risk I see is that it's easy to fall into the trap of thinking of > the Dog not as (almost) the last line of defense, but as the first, or > even the only one you need. I.e. it's tempting to think: "Nice, now > I've got a watchdog, so the rest of the system can be designed without a > care." > > > A processor reset should also reset the hardware, in much the same > > way that cops should always be honest -- "should" in this case > > indicates a moral requirement, but not, in all companies, a > > reasonable expectation. > > OTOH, just because not all cops are honest, that doesn't make a world > without any cops a better place. > > It can hardly be considered a concept's fault if some people implement > it incorrectly. And it may not even be incorrect to leave resetting the > rest of the circuit to the micro. There might be some information for > the micro to be had from inspecting the state of other parts of the > hardware, as left behind by the hosed system state. It all depends.
I meant my comment more as an encouragement to look at schematics or ask the hardware designers what reset does. I see your point about possibly letting the micro reset the rest of hardware -- either way, one should not assume things. -- Tim Wescott Wescott Design Services http://www.wescottdesign.com I'm looking for work -- see my website!
Hans-Bernhard Br&ouml;ker <HBBroeker@t-online.de> writes:

> Am 11.05.2016 um 08:17 schrieb Tim Wescott: >> Randy's point, I think, is that if something is _broken_, a reset isn't >> going to un-break it. > > Non-reset is not going to, either.
In general that is a logical fallacy. Consider a situation where one section of the code, let's say one thread, hangs because of broken hardware, but other threads are still doing useful work, e.g., transmitting status information up to the cloud. -- Randy Yates, DSP/Embedded Firmware Developer Digital Signal Labs http://www.digitalsignallabs.com
On 13.5.2016 &#1075;. 06:09, Randy Yates wrote:
> Hans-Bernhard Br&ouml;ker <HBBroeker@t-online.de> writes: > >> Am 11.05.2016 um 08:17 schrieb Tim Wescott: >>> Randy's point, I think, is that if something is _broken_, a reset isn't >>> going to un-break it. >> >> Non-reset is not going to, either. > > In general that is a logical fallacy. > > Consider a situation where one section of the code, let's say one > thread, hangs because of broken hardware, but other threads are still > doing useful work, e.g., transmitting status information up to the > cloud. >
Or the case where reset causes screen reinitialization and the glimpse of something vital you had to understand what the problem was was too short. Generalizations are almost always wrong, but in general having a dog (and being able to turn it off for situations like the above) is a good thing :) (well not if it is an organic dog in your backyard to yell all the time, just a silicon sort of dog....). On certain systems it may even be smoke-saving, hitting reset early enough. On larger systems which are of the complexity of a PC and remotely operating it can at times eliminate the need for someone to have to go to the device and reset it... The latter is particularly useful during in situ fine-tuning which involves significant programming (has been for me). Dimiter ------------------------------------------------------ Dimiter Popoff, TGI http://www.tgi-sci.com ------------------------------------------------------ http://www.flickr.com/photos/didi_tgi/
Am 13.05.2016 um 05:09 schrieb Randy Yates:
> Hans-Bernhard Br&ouml;ker <HBBroeker@t-online.de> writes:
>> Am 11.05.2016 um 08:17 schrieb Tim Wescott: >>> Randy's point, I think, is that if something is _broken_, a reset isn't >>> going to un-break it.
>> Non-reset is not going to, either.
> In general that is a logical fallacy.
So a _broken_ system will be fixed by not doing a reset; really?
> Consider a situation where one section of the code, let's say one > thread, hangs because of broken hardware, but other threads are still > doing useful work, e.g., transmitting status information up to the > cloud.
Then that won't be fixed by not doing a reset. Just like I said.
Hans-Bernhard Br&ouml;ker <HBBroeker@t-online.de> writes:
>>>> if something is _broken_, a reset isn't going to un-break it. >>> Non-reset is not going to, either. >> In general that is a logical fallacy. > So a _broken_ system will be fixed by not doing a reset; really?
The fallacy is the implication "reset won't fix something broken" => "reset is not worth attempting", which comes from the erroneous concept that something broken is unusable. In fact lots of brokenness takes the form of the device freezing up once in a while, when it's supposed to keep working. Resetting won't un-break the device: it's still a broken device that will freeze up again eventually. But if resetting clears the immediate symptom (the freeze-up) so you can keep using it, that might be good enough for your purposes. Anyone who deals with technology products in the real world is used to this. My DSL modem freezes up every few months and I have to reset it manually since there's no WDT. This is a known problem with these modems. Resetting is a minor nuisance if I'm at home, but potentially a big headache if I want my home computer to stay online so I can connect to it while travelling. Consider these two solutions: 1) Buy a new modem (still with no WDT) guaranteed not to freeze for 3 years: the vendor replaces it under warranty if it freezes in that period. 2) Add a WDT to the existing broken modem, i.e. it will still freeze now and then, but it self-resets in the event of a freeze. I think #1 is an actual "fix", but #2 is more robust in practice. If I really cared about remote access, I'd want a WDT even with a new modem. If I had to pick between the two, I'd choose the WDT over improving the modem's underlying reliability. There's only so far you can go in trying to make hardware failure-proof. Even NASA gives up at a certain point. They make stuff as reliable as they can, but then they deal with residual unreliability by adding backup hardware in case the primary still fails.
Am 13.05.2016 um 21:10 schrieb Paul Rubin:
> Hans-Bernhard Br&ouml;ker <HBBroeker@t-online.de> writes: >>>>> if something is _broken_, a reset isn't going to un-break it. >>>> Non-reset is not going to, either. >>> In general that is a logical fallacy. >> So a _broken_ system will be fixed by not doing a reset; really?
> The fallacy is the implication "reset won't fix something broken" => > "reset is not worth attempting", which comes from the erroneous concept > that something broken is unusable.
I did not make that implication, so have to object to this criticism being applied to my posts.
> 1) Buy a new modem (still with no WDT) guaranteed not to freeze for 3 > years: the vendor replaces it under warranty if it freezes in that > period. > > 2) Add a WDT to the existing broken modem, i.e. it will still freeze > now and then, but it self-resets in the event of a freeze. > > I think #1 is an actual "fix",
It's not, for the use case you described, because you won't be home to let them in, so they won't be able to exchange it. And if you were home, there's no way they'll be there with the exchange device faster than you can reach the existing device's reset button (or power plug).
Hans-Bernhard Br&ouml;ker <HBBroeker@t-online.de> writes:
>> The fallacy is the implication "reset won't fix something broken" => >> "reset is not worth attempting" > I did not make that implication, so have to object to this criticism > being applied to my posts.
Well I don't understand what you were getting at then.
>> I think #1 is an actual "fix", > It's not, for the use case you described,
Of course it's a fix. It changes a deployment of broken equipment into one of non-broken equipment. How can that be anything other than a fix? What else can it mean to fix something? The issue is that it's hardware, not mathematics. Just because it's not broken today doesn't mean it will never break. Therefore being able to mitigate potential failure is still important, maybe even more important than being able to fix existing actual failure.
In article <_-udncmXlPgeXq3KnZ2dnUU7-eOdnZ2d@giganews.com>, 
seemywebsite@myfooter.really says...
> Randy Yates recently started a thread on programming flash that had an > interesting tangent into watchdog timers. I thought it was interesting > enough that I'm starting a thread here. > > I had stated in Randy's thread that I avoid watchdogs, because they > mostly seem to be a source of erroneous behavior to me. > > However, on reflection I realized that I lied: I _do_ use watchdog > timers, but not automatically. To date I've only used them when the > processor is spinning a motor that might crash into something or > otherwise engage in damaging behavior if the processor goes nuts. > > In general, my rule on watchdogs, as with any other feature, is "use it > if using it is better", which means that I think about the consequences > of the thing popping off when I don't want it to (as during a code update > or during development when I hit a breakpoint) vs. the consequences of > not having the thing when the processor goes haywire. > > Furthermore, if I use a watchdog I don't just treat updating the thing as > a requirement check-box -- so you won't find a timer ISR in my code that > unconditionally kicks the dog. Instead, I'll usually have just one task > (the motor control one, on most of my stuff) kick the dog when it feels > it's operating correctly. If I've got more than one critical task (i.e., > if I'm running more than one motor out of one processor) I'll have a low- > priority built-in-test task that kicks the dog, but only if it's getting > periodic assurances of health from the (multiple) critical tasks. > > Generally, in my systems, the result of the watchdog timer popping off is > that the system will no longer work quite correctly, but it will operate > safely. > > So -- what do you do with watchdogs, and how, and why? Always use 'em? > Never use 'em? Use 'em because the boss says so, but twiddle them in a > "last part to break" bit of code? > > Would you use a watchdog in a fly-by-wire system? A pacemaker? Why? > Why not? Could you justify _not_ using a watchdog in the top-level > processor of a Mars rover or a satellite?
As you said, use them when they are needed, and that's what I do. Except with me their use is the rule and not the exception. Most of my systems have to run unattended for years on end and there is little chance that a person will be able to cycle the power or press a reset button. That being said, I tend to use rather long time out periods, so I don't get bit on the butt by a WDT that is always on the verge of triggering and if the WDT does expire, something has really gone awry. Also like you, I do have a couple of motor control systems that are a bit more safety critical that I do have faster time out periods, mainly to make sure, that if they fail, the system can attempt to place itself in as safe a condition as possible. I think a watchdog makes sense in any system that is far from home, like your rover example, or one that incorrect operation may be more dangerous than the system running off into la-la land.

The 2024 Embedded Online Conference