Watchdog for an embedded Linux-based system

I have a Linux-based SOM that intregrates complexity (CPU, memories, 
Ethernet PHY and so on) and I'm going to design a carrier board for it.

I'd like to add an external watchdog on the carrier board for two purposes:
- automatically rerun the system when the applications running
   on Linux crash for some reasons (mainly bugs)
- automatically rerun the system when the system is not able to
   start-up

For the big issue is the second point. I sometime have seen the system 
hangs during startup (during bootloader, during kernel initialization 
and so on).
I don't know the causes of these and the SOM manufacturer doesn't help 
too much. It says... it could happen on those complex systems based on a 
desktop OS as Linux. It says giving a pulse on main CPU RESET linux 
could be not sufficient in certain odd situations.

So I'm thinking to add an external watchdog that:
- monitors the Linux system activity (maybe a pin that goes high and
   low at a certain frequency by an application);
- open and close again a small relay that brings the main power supply
   voltage rail to the Linux system

Do you have better suggestions?

Reply by Mikko OH2HVJ ●March 13, 20172017-03-13

pozz <pozzugno@gmail.com> writes:

> So I'm thinking to add an external watchdog that:
> - monitors the Linux system activity (maybe a pin that goes high and
>   low at a certain frequency by an application);
> - open and close again a small relay that brings the main power supply
>   voltage rail to the Linux system
>
> Do you have better suggestions?

Sounds good to me. If you have the external watchdog as a separate MCU
(with super simple firmware and an external HW watchdog of it's own),
you can also ask the WD to cycle power on purpose.

Make sure you also take down all the incoming IO signals, they may keep
some part of the SOM powered and locked up. I've seen 0.4V to an IO pin
keep SRAM registers in a state that locked the system.

--
mikko

Reply by pozz ●March 13, 20172017-03-13

Il 13/03/2017 09:09, Mikko OH2HVJ ha scritto:
> pozz <pozzugno@gmail.com> writes:
>
>> So I'm thinking to add an external watchdog that:
>> - monitors the Linux system activity (maybe a pin that goes high and
>>   low at a certain frequency by an application);
>> - open and close again a small relay that brings the main power supply
>>   voltage rail to the Linux system
>>
>> Do you have better suggestions?
>
> Sounds good to me. If you have the external watchdog as a separate MCU
> (with super simple firmware and an external HW watchdog of it's own),
> you can also ask the WD to cycle power on purpose.

Do you suggest any external HW watchdog?


> Make sure you also take down all the incoming IO signals, they may keep
> some part of the SOM powered and locked up. I've seen 0.4V to an IO pin
> keep SRAM registers in a state that locked the system.

Hmmm..., this could be a little more complex. Anyway, thank you for this 
suggestion.

Reply by Tim Wescott ●March 13, 20172017-03-13

On Mon, 13 Mar 2017 08:55:42 +0100, pozz wrote:

> I have a Linux-based SOM that intregrates complexity (CPU, memories,
> Ethernet PHY and so on) and I'm going to design a carrier board for it.
> 
> I'd like to add an external watchdog on the carrier board for two
> purposes:
> - automatically rerun the system when the applications running
>    on Linux crash for some reasons (mainly bugs)
> - automatically rerun the system when the system is not able to
>    start-up
> 
> For the big issue is the second point. I sometime have seen the system
> hangs during startup (during bootloader, during kernel initialization
> and so on).
> I don't know the causes of these and the SOM manufacturer doesn't help
> too much. It says... it could happen on those complex systems based on a
> desktop OS as Linux. It says giving a pulse on main CPU RESET linux
> could be not sufficient in certain odd situations.
> 
> So I'm thinking to add an external watchdog that:
> - monitors the Linux system activity (maybe a pin that goes high and
>    low at a certain frequency by an application);
> - open and close again a small relay that brings the main power supply
>    voltage rail to the Linux system
> 
> Do you have better suggestions?

Sounds good.  I'd look for a power supply with an enable input (like most 
PC power supplies these days) and use that instead of a relay.

AFAIK, when the system gets bodged to the point of needing a power cycle, 
it's because some peripheral or another gets bodged in a way that won't 
get un-bodged merely as a result of sweet-talking from the CPU.  But, 
that's just a guess.

-- 
www.wescottdesign.com

Reply by Joe Chisolm ●March 13, 20172017-03-13

On Mon, 13 Mar 2017 10:06:10 +0100, pozz wrote:

> Il 13/03/2017 09:09, Mikko OH2HVJ ha scritto:
>> pozz <pozzugno@gmail.com> writes:
>>
>>> So I'm thinking to add an external watchdog that:
>>> - monitors the Linux system activity (maybe a pin that goes high and
>>>   low at a certain frequency by an application);
>>> - open and close again a small relay that brings the main power supply
>>>   voltage rail to the Linux system
>>>
>>> Do you have better suggestions?
>>
>> Sounds good to me. If you have the external watchdog as a separate MCU
>> (with super simple firmware and an external HW watchdog of it's own),
>> you can also ask the WD to cycle power on purpose.
> 
> Do you suggest any external HW watchdog?
> 
> 
>> Make sure you also take down all the incoming IO signals, they may keep
>> some part of the SOM powered and locked up. I've seen 0.4V to an IO pin
>> keep SRAM registers in a state that locked the system.
> 
> Hmmm..., this could be a little more complex. Anyway, thank you for this 
> suggestion.

Depends on what you have for free GPIO pins to reset a WDT.  A 555 and a
SSR can be rigged to do what you want.  If you have a serial port that 
supports full HW flow control or the like, you can use a CTS or other 
control pin.  A little dirt cheap uC can do the same thing and can
actually allow more control as you might want a longer WDT time during
system boot vs up and running the app.  The uC can listen on a serial
port and when the app starts it's heartbeat the uC can adjust the
timeout interval.

You can also do this with a simple shift register and appropiate
clock.  You clock in a high or low depending on what you need to
pop out the other end for the power reset.  The heart beat resets
the shift register.

The tricky part is the intial power on sequence with the WDT and
then the WDT power on to the main system.  Also will you allow
a normal reset of the system and how does the WDT react.

-- 
Chisolm
Republic of Texas

Reply by Don Y ●March 13, 20172017-03-13

On 3/13/2017 12:55 AM, pozz wrote:
> I have a Linux-based SOM that intregrates complexity (CPU, memories, Ethernet
> PHY and so on) and I'm going to design a carrier board for it.
>
> I'd like to add an external watchdog on the carrier board for two purposes:
> - automatically rerun the system when the applications running
>   on Linux crash for some reasons (mainly bugs)
> - automatically rerun the system when the system is not able to
>   start-up
>
> For the big issue is the second point. I sometime have seen the system hangs
> during startup (during bootloader, during kernel initialization and so on).
> I don't know the causes of these and the SOM manufacturer doesn't help too
> much. It says... it could happen on those complex systems based on a desktop OS
> as Linux. It says giving a pulse on main CPU RESET linux could be not
> sufficient in certain odd situations.
>
> So I'm thinking to add an external watchdog that:
> - monitors the Linux system activity (maybe a pin that goes high and
>   low at a certain frequency by an application);
> - open and close again a small relay that brings the main power supply
>   voltage rail to the Linux system
>
> Do you have better suggestions?

Don't use a watchdog to fix a DESIGN PROBLEM.  Find a new vendor.
(How can you have confidence in your product if one of the main components
can't even meet its minimal performance requirements?)

Reply by Dimiter_Popoff ●March 14, 20172017-03-14

On 14.3.2017 &#1075;. 02:02, Don Y wrote:
> On 3/13/2017 12:55 AM, pozz wrote:
>> I have a Linux-based SOM that intregrates complexity (CPU, memories,
>> Ethernet
>> PHY and so on) and I'm going to design a carrier board for it.
>>
>> I'd like to add an external watchdog on the carrier board for two
>> purposes:
>> - automatically rerun the system when the applications running
>>   on Linux crash for some reasons (mainly bugs)
>> - automatically rerun the system when the system is not able to
>>   start-up
>>
>> For the big issue is the second point. I sometime have seen the system
>> hangs
>> during startup (during bootloader, during kernel initialization and so
>> on).
>> I don't know the causes of these and the SOM manufacturer doesn't help
>> too
>> much. It says... it could happen on those complex systems based on a
>> desktop OS
>> as Linux. It says giving a pulse on main CPU RESET linux could be not
>> sufficient in certain odd situations.
>>
>> So I'm thinking to add an external watchdog that:
>> - monitors the Linux system activity (maybe a pin that goes high and
>>   low at a certain frequency by an application);
>> - open and close again a small relay that brings the main power supply
>>   voltage rail to the Linux system
>>
>> Do you have better suggestions?
>
> Don't use a watchdog to fix a DESIGN PROBLEM.  Find a new vendor.
> (How can you have confidence in your product if one of the main components
> can't even meet its minimal performance requirements?)

Hi Don,

this is how they do things nowadays, not much we can do about it.
Spread a mess over the currently exposed mess to cover it.
Like a friend of mine once said, our civilization won't be the
first one to fall.
Even in mass products - I had a phone from one of the leading
manufacturers which sometimes would not come out of reset no matter what
and for how long you press - took opening it and removing the battery
for a while to get it to work.
The messiness in our trade is past the point of no return, has been
for a while.

Dimiter

Reply by Don Y ●March 14, 20172017-03-14

Hi Dimiter,

On 3/14/2017 3:01 AM, Dimiter_Popoff wrote:

>> Don't use a watchdog to fix a DESIGN PROBLEM.  Find a new vendor.
>> (How can you have confidence in your product if one of the main components
>> can't even meet its minimal performance requirements?)
>
> this is how they do things nowadays, not much we can do about it.

Well, *I* can certainly not contribute to the practice!  If you've
got a "slow leak" in a tire, do you drive around with a *pump* in
the trunk (forever!) or do you fix the leak?

> Spread a mess over the currently exposed mess to cover it.

I think much of the problem comes from people treating software (and other
components) as "black boxes" -- despite the fact that they weren't rigorously
*designed* and *documented* as such.  Rather than UNDERSTAND what's going on,
they slap a band-aid on it to get the overall required behavior.  Yet, can
never be sure they've applied the RIGHT "band-aid" (cuz they don't know what
the PROBLEM is!)

Similarly, you see folks dismiss "intermittent"/sporadic failures as "flukes".
Hey, if it happened ONCE, who's to say it won't happen again?  Just because
it's not happening NOW (for some UNKNOWN reason), how do you know it won't
resume happening the day you start shipping product?  And, how do you know
it won't happen to EVERY unit that you ship??

I doubt the OP is constrained to having just *one* choice in terms of
"SoM/SoC running Linux".  And, what guarantee will he have that cycling
power WILL allow the device to boot properly?  (after all, didn't he just
recently apply power and watch it FAIL TO BOOT -- necessitating the power
cycling?)  Or, that there aren't other "issues" that will manifest DURING
OPERATION -- or, that *are* happening during operation but that he han't
rigorously identified in his test/validation procedure?

> Like a friend of mine once said, our civilization won't be the
> first one to fall.
> Even in mass products - I had a phone from one of the leading
> manufacturers which sometimes would not come out of reset no matter what
> and for how long you press - took opening it and removing the battery
> for a while to get it to work.
> The messiness in our trade is past the point of no return, has been
> for a while.

"Just push the product out the door.  Let the next guy worry about
why it doesn't work.  Chances are, a new version with a different set of
problems will be available at that time!  Offer the disgruntled user a
FREE UPGRADE (to that new set of disgruntling problems!  :> )"

Makes one wonder how tolerant of bugs, poor quality, etc. these same
folks are in the products that they *purchase* for their own use?!
How patient they'd be if their vehicle didn't start from time to
time unless they exited the cabin, closed the door, paused and
then repeated the attempt.  Or, a TV shutting off in the middle of
a ball game.  Or, their food order "disappearing" and that fact only
discovered after they'd waited 20 minutes for it to be served?

(sigh)

"Apres moi, le deluge!"

Reply by Dimiter_Popoff ●March 14, 20172017-03-14

On 14.3.2017 &#1075;. 18:09, Don Y wrote:
> Hi Dimiter,
>
> On 3/14/2017 3:01 AM, Dimiter_Popoff wrote:
>
>>> Don't use a watchdog to fix a DESIGN PROBLEM.  Find a new vendor.
>>> (How can you have confidence in your product if one of the main
>>> components
>>> can't even meet its minimal performance requirements?)
>>
>> this is how they do things nowadays, not much we can do about it.
>
> Well, *I* can certainly not contribute to the practice!  If you've
> got a "slow leak" in a tire, do you drive around with a *pump* in
> the trunk (forever!) or do you fix the leak?

What they seem to do in that situation is to put a larger tire over
the leaky one, pump a few kilograms of glue between the two and
move on....

Like I said before, past the point of no return. Way past really.

Dimiter

Reply by Don Y ●March 14, 20172017-03-14

On 3/14/2017 10:28 AM, Dimiter_Popoff wrote:
> On 14.3.2017 &#1075;. 18:09, Don Y wrote:
>> Hi Dimiter,
>>
>> On 3/14/2017 3:01 AM, Dimiter_Popoff wrote:
>>
>>>> Don't use a watchdog to fix a DESIGN PROBLEM.  Find a new vendor.
>>>> (How can you have confidence in your product if one of the main
>>>> components
>>>> can't even meet its minimal performance requirements?)
>>>
>>> this is how they do things nowadays, not much we can do about it.
>>
>> Well, *I* can certainly not contribute to the practice!  If you've
>> got a "slow leak" in a tire, do you drive around with a *pump* in
>> the trunk (forever!) or do you fix the leak?
>
> What they seem to do in that situation is to put a larger tire over
> the leaky one, pump a few kilograms of glue between the two and
> move on....

But, that would *solve* the problem -- assuming they similarly
increased the size of the remaining tires.

The "tire pump" remedy I mentioned just *perpetuates* the problem.
Akin to treating symptoms instead of the underlying problem.

It is particularly annoying to see how willingly folks will "dismiss"
problems THAT THEY, THEMSELVES, WITNESSED if they are intermittent (not
EASY to track down).

I designed a LORAN-C-based autopilot early in my career.  In our first
sea trial of the prototype (a piece of perf board bolted to a chunk of
lumber!), we defined a course to circumnavigate Cape Cod (Massachusetts)
<https://en.wikipedia.org/wiki/Cape_Cod> via a set of "waypoints"
designed to allow us to hug the coastline.

[The autopilot worked by letting you enter the latitude and longitude of
these waypoints, in sequence, and it would steer the boat *to* each,
in turn (at the time, most autopilots simply tried to keep a vessel
pointed in a particular direction and couldn't compensate for drift)]

We monitored the boat's course with a position plotter (took positional
information from a LORAN-C receiver and plotted those on a chart/map).
As expected, the vessel's track was true to the series of waypoints we'd
entered (you use the nominal coordinates of buoys in the ocean/bay
as markers cuz its hard to tell one wave from another!  :> ).

But, there was a very noticeable 'S' in the track at one particular point.
As if the algorithm had "overflowed" and then overcompensated before
eventually returning to its ideal track.  I spent weeks (off-hours)
digging through my source code, the floating point library implementation,
etc. in an attempt to understand why this anomaly was present.

Unfortunately, I had no way to log the raw data from the LORAN receiver;
that was the purpose of the plotter's record!  Nor any way to log the
actions INTENDED by my device (had it COMMANDED the vessel to take such
a course?).

My boss dismissed the "problem" -- citing the fact that we had stopped
the vessel in that general area to do some deep sea fishing.  As such,
its movements were largely at the whim of the ocean current and the
MANUAL compensation that the skipper would occasionally bring to bear
(to keep the waves from turning the vessel sideways which would leave it
vulnerable to being toppled by larger waves).

I still have that code (I have *everything* I've ever designed/written)
and periodically drag it out hoping to stumble on some OTHER explanation.
I'd much prefer a genuine bug to explain the behavior than HOPE it was
as my boss had suggested!

A colleague had implemented a video game that suffered from an annoying
"problem" -- almost invariably when the player was having a REALLY GOOD
game!  Yet, it proved difficult to track down as the game is, by its
very nature, based on "random" events (pseudo-random number generators
governing its actions; non-repeatable interactions with the player, etc.).

The game was released with the "problem" unresolved (that market is very
aggressively competitive and narrow windows of "opportunity").

Some *years* later, the problem was found to be a genuine bug (duh!).
Unfortunate that EVERY unit shipped had that bug.  *But*, a strong sense
of closure finally KNOWING the cause!

> Like I said before, past the point of no return. Way past really.

Previous12 Next

Watchdog for an embedded Linux-based system

Sign in

You might also like...

Search forums

Free PDF Downloads

Blogs - Hall of Fame

Discussion Groups

Quick Links

About EmbeddedRelated.com

Social Networks

The Related Media Group