Kicking the dog -- how do you use watchdog timers?| page 5

Reply by Hans-Bernhard Bröker ●May 12, 20162016-05-12

Am 11.05.2016 um 08:17 schrieb Tim Wescott:
> Randy's point, I think, is that if something is _broken_, a reset isn't
> going to un-break it.

Non-reset is not going to, either.

It's worth trying to distinguish between a run-off-into-the-wild system 
and a permanently broken one.  So trigger a global reset, and see if 
that makes it work again.  If it does, things are better than before. 
If it doesn't, they're no worse.  As problem-handling approaches go, 
that's a pretty impressive result.

That's what a watchdog ultimately is good for: to distinguish between a 
SEU and a FUBAR situation.

There's really nothing terminally wrong with having a watchdog.  The 
main risk I see is that it's easy to fall into the trap of thinking of 
the Dog not as (almost) the last line of defense, but as the first, or 
even the only one you need.  I.e. it's tempting to think: "Nice, now 
I've got a watchdog, so the rest of the system can be designed without a 
care."

 > A processor reset should also reset the hardware, in much the same way
 > that cops should always be honest -- "should" in this case indicates a
 > moral requirement, but not, in all companies, a reasonable expectation.

OTOH, just because not all cops are honest, that doesn't make a world 
without any cops a better place.

It can hardly be considered a concept's fault if some people implement 
it incorrectly.  And it may not even be incorrect to leave resetting the 
rest of the circuit to the micro.  There might be some information for 
the micro to be had from inspecting the state of other parts of the 
hardware, as left behind by the hosed system state.  It all depends.

Reply by Paul Rubin ●May 12, 20162016-05-12

Hans-Bernhard Br&ouml;ker <HBBroeker@t-online.de> writes:
> I.e. it's tempting to think: "Nice, now I've got a watchdog, so the
> rest of the system can be designed without a care."

If you need an absolutely reliable product (medical safety, NASA, or
whatever), you have to use ultra high assurance design processes that
are not economically competitive in more typical application areas.  If
you don't use those processes, you aren't designing "without a care",
but you're designing with an amount of care chosen through an
engineering and business decision, based on how much product failure
you're willing to tolerate.  If falling back to a WDT is a cheap way to
reach your acceptable failure rate, it seems like an ok option.  

I worked on a thing a while back whose hardware randomly locked up every
few thousand hours of operation.  We never figured out why, and decided
not too spend excessive resources studying it, given that it was coming
due for a total redesign anyway.

We had a few hundred of these things in the field which meant that on
average, we logged maybe one WDT reset per day across the whole fleet.
The application area was not even slightly safety critical and most of
the resets were in the middle of the night when the device wasn't in use
anyway.  There was a slim possibility that a reset at the wrong time
could actually inconvenience a customer and we'd get a support call.
But AFAIK that never happened.  Nobody ever noticed the resets.

I think the above is a typical story.  I wasn't involved in the
management decision to ship the thing despite the lockups (relying on
the WDT), but I can't say that they made a wrong choice.  In mathematics
we prove things and then expect to be absolutely sure of them, but
engineering is different.  Most engineering is about making stuff that
meets cost constraints and empirically works well enough for the
application, and that's what they did.

Reply by Tim Wescott ●May 12, 20162016-05-12

On Thu, 12 May 2016 20:34:19 +0200, Hans-Bernhard Br&ouml;ker wrote:

> Am 11.05.2016 um 08:17 schrieb Tim Wescott:
>> Randy's point, I think, is that if something is _broken_, a reset isn't
>> going to un-break it.
> 
> Non-reset is not going to, either.
> 
> It's worth trying to distinguish between a run-off-into-the-wild system
> and a permanently broken one.  So trigger a global reset, and see if
> that makes it work again.  If it does, things are better than before. If
> it doesn't, they're no worse.  As problem-handling approaches go, that's
> a pretty impressive result.
> 
> That's what a watchdog ultimately is good for: to distinguish between a
> SEU and a FUBAR situation.
> 
> There's really nothing terminally wrong with having a watchdog.  The
> main risk I see is that it's easy to fall into the trap of thinking of
> the Dog not as (almost) the last line of defense, but as the first, or
> even the only one you need.  I.e. it's tempting to think: "Nice, now
> I've got a watchdog, so the rest of the system can be designed without a
> care."
> 
>  > A processor reset should also reset the hardware, in much the same
>  > way that cops should always be honest -- "should" in this case
>  > indicates a moral requirement, but not, in all companies, a
>  > reasonable expectation.
> 
> OTOH, just because not all cops are honest, that doesn't make a world
> without any cops a better place.
> 
> It can hardly be considered a concept's fault if some people implement
> it incorrectly.  And it may not even be incorrect to leave resetting the
> rest of the circuit to the micro.  There might be some information for
> the micro to be had from inspecting the state of other parts of the
> hardware, as left behind by the hosed system state.  It all depends.

I meant my comment more as an encouragement to look at schematics or ask 
the hardware designers what reset does.

I see your point about possibly letting the micro reset the rest of 
hardware -- either way, one should not assume things.

-- 

Tim Wescott
Wescott Design Services
http://www.wescottdesign.com

I'm looking for work -- see my website!

Reply by Randy Yates ●May 13, 20162016-05-13

Hans-Bernhard Br&ouml;ker <HBBroeker@t-online.de> writes:

> Am 11.05.2016 um 08:17 schrieb Tim Wescott:
>> Randy's point, I think, is that if something is _broken_, a reset isn't
>> going to un-break it.
>
> Non-reset is not going to, either.

In general that is a logical fallacy.

Consider a situation where one section of the code, let's say one
thread, hangs because of broken hardware, but other threads are still
doing useful work, e.g., transmitting status information up to the
cloud.
-- 
Randy Yates, DSP/Embedded Firmware Developer
Digital Signal Labs
http://www.digitalsignallabs.com

Reply by Dimiter_Popoff ●May 13, 20162016-05-13

On 13.5.2016 &#1075;. 06:09, Randy Yates wrote:
> Hans-Bernhard Br&ouml;ker <HBBroeker@t-online.de> writes:
>
>> Am 11.05.2016 um 08:17 schrieb Tim Wescott:
>>> Randy's point, I think, is that if something is _broken_, a reset isn't
>>> going to un-break it.
>>
>> Non-reset is not going to, either.
>
> In general that is a logical fallacy.
>
> Consider a situation where one section of the code, let's say one
> thread, hangs because of broken hardware, but other threads are still
> doing useful work, e.g., transmitting status information up to the
> cloud.
>

Or the case where reset causes screen reinitialization and the glimpse
of something vital you had to understand what the problem was was
too short.

Generalizations are almost always wrong, but in general having a dog
(and being able to turn it off for situations like the above) is a good
thing :) (well not if it is an organic dog in your backyard to yell
all the time, just a silicon sort of dog....).
On certain systems it may even be smoke-saving, hitting reset
early enough.
On larger systems which are of the complexity of a PC and
remotely operating it can at times eliminate the need for someone to
have to go to the device and reset it... The latter is particularly
useful during in situ fine-tuning which involves significant
programming (has been for me).

Dimiter

------------------------------------------------------
Dimiter Popoff, TGI             http://www.tgi-sci.com
------------------------------------------------------
http://www.flickr.com/photos/didi_tgi/

Reply by Hans-Bernhard Bröker ●May 13, 20162016-05-13

Am 13.05.2016 um 05:09 schrieb Randy Yates:
> Hans-Bernhard Br&ouml;ker <HBBroeker@t-online.de> writes:

>> Am 11.05.2016 um 08:17 schrieb Tim Wescott:
>>> Randy's point, I think, is that if something is _broken_, a reset isn't
>>> going to un-break it.

>> Non-reset is not going to, either.

> In general that is a logical fallacy.

So a _broken_ system will be fixed by not doing a reset; really?

> Consider a situation where one section of the code, let's say one
> thread, hangs because of broken hardware, but other threads are still
> doing useful work, e.g., transmitting status information up to the
> cloud.

Then that won't be fixed by not doing a reset.  Just like I said.

Reply by Paul Rubin ●May 13, 20162016-05-13

Hans-Bernhard Br&ouml;ker <HBBroeker@t-online.de> writes:
>>>> if something is _broken_, a reset isn't going to un-break it.
>>> Non-reset is not going to, either.
>> In general that is a logical fallacy.
> So a _broken_ system will be fixed by not doing a reset; really?

The fallacy is the implication "reset won't fix something broken" =>
"reset is not worth attempting", which comes from the erroneous concept
that something broken is unusable.  In fact lots of brokenness takes the
form of the device freezing up once in a while, when it's supposed to
keep working.  Resetting won't un-break the device: it's still a broken
device that will freeze up again eventually.  But if resetting clears
the immediate symptom (the freeze-up) so you can keep using it, that
might be good enough for your purposes.  

Anyone who deals with technology products in the real world is used to
this.  My DSL modem freezes up every few months and I have to reset it
manually since there's no WDT.  This is a known problem with these
modems.  Resetting is a minor nuisance if I'm at home, but potentially a
big headache if I want my home computer to stay online so I can connect
to it while travelling.  Consider these two solutions:

   1) Buy a new modem (still with no WDT) guaranteed not to freeze for 3
   years: the vendor replaces it under warranty if it freezes in that
   period.

   2) Add a WDT to the existing broken modem, i.e. it will still freeze
   now and then, but it self-resets in the event of a freeze.

I think #1 is an actual "fix", but #2 is more robust in practice.  If I
really cared about remote access, I'd want a WDT even with a new modem.
If I had to pick between the two, I'd choose the WDT over improving the
modem's underlying reliability.  There's only so far you can go in
trying to make hardware failure-proof.  Even NASA gives up at a certain
point.  They make stuff as reliable as they can, but then they deal with
residual unreliability by adding backup hardware in case the primary
still fails.

Reply by Hans-Bernhard Bröker ●May 13, 20162016-05-13

Am 13.05.2016 um 21:10 schrieb Paul Rubin:
> Hans-Bernhard Br&ouml;ker <HBBroeker@t-online.de> writes:
>>>>> if something is _broken_, a reset isn't going to un-break it.
>>>> Non-reset is not going to, either.
>>> In general that is a logical fallacy.
>> So a _broken_ system will be fixed by not doing a reset; really?

> The fallacy is the implication "reset won't fix something broken" =>
> "reset is not worth attempting", which comes from the erroneous concept
> that something broken is unusable.

I did not make that implication, so have to object to this criticism 
being applied to my posts.

>    1) Buy a new modem (still with no WDT) guaranteed not to freeze for 3
>    years: the vendor replaces it under warranty if it freezes in that
>    period.
>
>    2) Add a WDT to the existing broken modem, i.e. it will still freeze
>    now and then, but it self-resets in the event of a freeze.
>
> I think #1 is an actual "fix",

It's not, for the use case you described, because you won't be home to 
let them in, so they won't be able to exchange it.  And if you were 
home, there's no way they'll be there with the exchange device faster 
than you can reach the existing device's reset button (or power plug).

Reply by Paul Rubin ●May 13, 20162016-05-13

Hans-Bernhard Br&ouml;ker <HBBroeker@t-online.de> writes:
>> The fallacy is the implication "reset won't fix something broken" =>
>> "reset is not worth attempting"
> I did not make that implication, so have to object to this criticism
> being applied to my posts.

Well I don't understand what you were getting at then.

>> I think #1 is an actual "fix",
> It's not, for the use case you described, 

Of course it's a fix.  It changes a deployment of broken equipment into
one of non-broken equipment.  How can that be anything other than a fix?
What else can it mean to fix something?  The issue is that it's
hardware, not mathematics.  Just because it's not broken today doesn't
mean it will never break.  Therefore being able to mitigate potential
failure is still important, maybe even more important than being able to
fix existing actual failure.

Reply by WangoTango ●May 13, 20162016-05-13

In article <_-udncmXlPgeXq3KnZ2dnUU7-eOdnZ2d@giganews.com>, 
seemywebsite@myfooter.really says...
> Randy Yates recently started a thread on programming flash that had an 
> interesting tangent into watchdog timers.  I thought it was interesting 
> enough that I'm starting a thread here.
> 
> I had stated in Randy's thread that I avoid watchdogs, because they 
> mostly seem to be a source of erroneous behavior to me.
> 
> However, on reflection I realized that I lied: I _do_ use watchdog 
> timers, but not automatically.  To date I've only used them when the 
> processor is spinning a motor that might crash into something or 
> otherwise engage in damaging behavior if the processor goes nuts.  
> 
> In general, my rule on watchdogs, as with any other feature, is "use it 
> if using it is better", which means that I think about the consequences 
> of the thing popping off when I don't want it to (as during a code update 
> or during development when I hit a breakpoint) vs. the consequences of 
> not having the thing when the processor goes haywire.
> 
> Furthermore, if I use a watchdog I don't just treat updating the thing as 
> a requirement check-box -- so you won't find a timer ISR in my code that 
> unconditionally kicks the dog.  Instead, I'll usually have just one task 
> (the motor control one, on most of my stuff) kick the dog when it feels 
> it's operating correctly.  If I've got more than one critical task (i.e., 
> if I'm running more than one motor out of one processor) I'll have a low-
> priority built-in-test task that kicks the dog, but only if it's getting 
> periodic assurances of health from the (multiple) critical tasks.
> 
> Generally, in my systems, the result of the watchdog timer popping off is 
> that the system will no longer work quite correctly, but it will operate 
> safely.
> 
> So -- what do you do with watchdogs, and how, and why?  Always use 'em?  
> Never use 'em?  Use 'em because the boss says so, but twiddle them in a 
> "last part to break" bit of code?
> 
> Would you use a watchdog in a fly-by-wire system?  A pacemaker?  Why?  
> Why not?  Could you justify _not_ using a watchdog in the top-level 
> processor of a Mars rover or a satellite?

As you said, use them when they are needed, and that's what I do.  
Except with me their use is the rule and not the exception.  Most of my 
systems have to run unattended for years on end and there is little 
chance that a person will be able to cycle the power or press a reset 
button.  That being said, I tend to use rather long time out periods, so 
I don't get bit on the butt by a WDT that is always on the verge of 
triggering and if the WDT does expire, something has really gone awry.
Also like you, I do have a couple of motor control systems that are a 
bit more safety critical that I do have faster time out periods, mainly 
to make sure, that if they fail, the system can attempt to place itself 
in as safe a condition as possible.

I think a watchdog makes sense in any system that is far from home, like 
your rover example, or one that incorrect operation may be more 
dangerous than the system running off into la-la land.