Watchdog for an embedded Linux-based system

Started by pozz March 13, 2017
I have a Linux-based SOM that intregrates complexity (CPU, memories, 
Ethernet PHY and so on) and I'm going to design a carrier board for it.

I'd like to add an external watchdog on the carrier board for two purposes:
- automatically rerun the system when the applications running
   on Linux crash for some reasons (mainly bugs)
- automatically rerun the system when the system is not able to
   start-up

For the big issue is the second point. I sometime have seen the system 
hangs during startup (during bootloader, during kernel initialization 
and so on).
I don't know the causes of these and the SOM manufacturer doesn't help 
too much. It says... it could happen on those complex systems based on a 
desktop OS as Linux. It says giving a pulse on main CPU RESET linux 
could be not sufficient in certain odd situations.

So I'm thinking to add an external watchdog that:
- monitors the Linux system activity (maybe a pin that goes high and
   low at a certain frequency by an application);
- open and close again a small relay that brings the main power supply
   voltage rail to the Linux system

Do you have better suggestions?
pozz <pozzugno@gmail.com> writes:

> So I'm thinking to add an external watchdog that: > - monitors the Linux system activity (maybe a pin that goes high and > low at a certain frequency by an application); > - open and close again a small relay that brings the main power supply > voltage rail to the Linux system > > Do you have better suggestions?
Sounds good to me. If you have the external watchdog as a separate MCU (with super simple firmware and an external HW watchdog of it's own), you can also ask the WD to cycle power on purpose. Make sure you also take down all the incoming IO signals, they may keep some part of the SOM powered and locked up. I've seen 0.4V to an IO pin keep SRAM registers in a state that locked the system. -- mikko
Il 13/03/2017 09:09, Mikko OH2HVJ ha scritto:
> pozz <pozzugno@gmail.com> writes: > >> So I'm thinking to add an external watchdog that: >> - monitors the Linux system activity (maybe a pin that goes high and >> low at a certain frequency by an application); >> - open and close again a small relay that brings the main power supply >> voltage rail to the Linux system >> >> Do you have better suggestions? > > Sounds good to me. If you have the external watchdog as a separate MCU > (with super simple firmware and an external HW watchdog of it's own), > you can also ask the WD to cycle power on purpose.
Do you suggest any external HW watchdog?
> Make sure you also take down all the incoming IO signals, they may keep > some part of the SOM powered and locked up. I've seen 0.4V to an IO pin > keep SRAM registers in a state that locked the system.
Hmmm..., this could be a little more complex. Anyway, thank you for this suggestion.
On Mon, 13 Mar 2017 08:55:42 +0100, pozz wrote:

> I have a Linux-based SOM that intregrates complexity (CPU, memories, > Ethernet PHY and so on) and I'm going to design a carrier board for it. > > I'd like to add an external watchdog on the carrier board for two > purposes: > - automatically rerun the system when the applications running > on Linux crash for some reasons (mainly bugs) > - automatically rerun the system when the system is not able to > start-up > > For the big issue is the second point. I sometime have seen the system > hangs during startup (during bootloader, during kernel initialization > and so on). > I don't know the causes of these and the SOM manufacturer doesn't help > too much. It says... it could happen on those complex systems based on a > desktop OS as Linux. It says giving a pulse on main CPU RESET linux > could be not sufficient in certain odd situations. > > So I'm thinking to add an external watchdog that: > - monitors the Linux system activity (maybe a pin that goes high and > low at a certain frequency by an application); > - open and close again a small relay that brings the main power supply > voltage rail to the Linux system > > Do you have better suggestions?
Sounds good. I'd look for a power supply with an enable input (like most PC power supplies these days) and use that instead of a relay. AFAIK, when the system gets bodged to the point of needing a power cycle, it's because some peripheral or another gets bodged in a way that won't get un-bodged merely as a result of sweet-talking from the CPU. But, that's just a guess. -- www.wescottdesign.com
On Mon, 13 Mar 2017 10:06:10 +0100, pozz wrote:

> Il 13/03/2017 09:09, Mikko OH2HVJ ha scritto: >> pozz <pozzugno@gmail.com> writes: >> >>> So I'm thinking to add an external watchdog that: >>> - monitors the Linux system activity (maybe a pin that goes high and >>> low at a certain frequency by an application); >>> - open and close again a small relay that brings the main power supply >>> voltage rail to the Linux system >>> >>> Do you have better suggestions? >> >> Sounds good to me. If you have the external watchdog as a separate MCU >> (with super simple firmware and an external HW watchdog of it's own), >> you can also ask the WD to cycle power on purpose. > > Do you suggest any external HW watchdog? > > >> Make sure you also take down all the incoming IO signals, they may keep >> some part of the SOM powered and locked up. I've seen 0.4V to an IO pin >> keep SRAM registers in a state that locked the system. > > Hmmm..., this could be a little more complex. Anyway, thank you for this > suggestion.
Depends on what you have for free GPIO pins to reset a WDT. A 555 and a SSR can be rigged to do what you want. If you have a serial port that supports full HW flow control or the like, you can use a CTS or other control pin. A little dirt cheap uC can do the same thing and can actually allow more control as you might want a longer WDT time during system boot vs up and running the app. The uC can listen on a serial port and when the app starts it's heartbeat the uC can adjust the timeout interval. You can also do this with a simple shift register and appropiate clock. You clock in a high or low depending on what you need to pop out the other end for the power reset. The heart beat resets the shift register. The tricky part is the intial power on sequence with the WDT and then the WDT power on to the main system. Also will you allow a normal reset of the system and how does the WDT react. -- Chisolm Republic of Texas
On 3/13/2017 12:55 AM, pozz wrote:
> I have a Linux-based SOM that intregrates complexity (CPU, memories, Ethernet > PHY and so on) and I'm going to design a carrier board for it. > > I'd like to add an external watchdog on the carrier board for two purposes: > - automatically rerun the system when the applications running > on Linux crash for some reasons (mainly bugs) > - automatically rerun the system when the system is not able to > start-up > > For the big issue is the second point. I sometime have seen the system hangs > during startup (during bootloader, during kernel initialization and so on). > I don't know the causes of these and the SOM manufacturer doesn't help too > much. It says... it could happen on those complex systems based on a desktop OS > as Linux. It says giving a pulse on main CPU RESET linux could be not > sufficient in certain odd situations. > > So I'm thinking to add an external watchdog that: > - monitors the Linux system activity (maybe a pin that goes high and > low at a certain frequency by an application); > - open and close again a small relay that brings the main power supply > voltage rail to the Linux system > > Do you have better suggestions?
Don't use a watchdog to fix a DESIGN PROBLEM. Find a new vendor. (How can you have confidence in your product if one of the main components can't even meet its minimal performance requirements?)
On 14.3.2017 &#1075;. 02:02, Don Y wrote:
> On 3/13/2017 12:55 AM, pozz wrote: >> I have a Linux-based SOM that intregrates complexity (CPU, memories, >> Ethernet >> PHY and so on) and I'm going to design a carrier board for it. >> >> I'd like to add an external watchdog on the carrier board for two >> purposes: >> - automatically rerun the system when the applications running >> on Linux crash for some reasons (mainly bugs) >> - automatically rerun the system when the system is not able to >> start-up >> >> For the big issue is the second point. I sometime have seen the system >> hangs >> during startup (during bootloader, during kernel initialization and so >> on). >> I don't know the causes of these and the SOM manufacturer doesn't help >> too >> much. It says... it could happen on those complex systems based on a >> desktop OS >> as Linux. It says giving a pulse on main CPU RESET linux could be not >> sufficient in certain odd situations. >> >> So I'm thinking to add an external watchdog that: >> - monitors the Linux system activity (maybe a pin that goes high and >> low at a certain frequency by an application); >> - open and close again a small relay that brings the main power supply >> voltage rail to the Linux system >> >> Do you have better suggestions? > > Don't use a watchdog to fix a DESIGN PROBLEM. Find a new vendor. > (How can you have confidence in your product if one of the main components > can't even meet its minimal performance requirements?)
Hi Don, this is how they do things nowadays, not much we can do about it. Spread a mess over the currently exposed mess to cover it. Like a friend of mine once said, our civilization won't be the first one to fall. Even in mass products - I had a phone from one of the leading manufacturers which sometimes would not come out of reset no matter what and for how long you press - took opening it and removing the battery for a while to get it to work. The messiness in our trade is past the point of no return, has been for a while. Dimiter
Hi Dimiter,

On 3/14/2017 3:01 AM, Dimiter_Popoff wrote:

>> Don't use a watchdog to fix a DESIGN PROBLEM. Find a new vendor. >> (How can you have confidence in your product if one of the main components >> can't even meet its minimal performance requirements?) > > this is how they do things nowadays, not much we can do about it.
Well, *I* can certainly not contribute to the practice! If you've got a "slow leak" in a tire, do you drive around with a *pump* in the trunk (forever!) or do you fix the leak?
> Spread a mess over the currently exposed mess to cover it.
I think much of the problem comes from people treating software (and other components) as "black boxes" -- despite the fact that they weren't rigorously *designed* and *documented* as such. Rather than UNDERSTAND what's going on, they slap a band-aid on it to get the overall required behavior. Yet, can never be sure they've applied the RIGHT "band-aid" (cuz they don't know what the PROBLEM is!) Similarly, you see folks dismiss "intermittent"/sporadic failures as "flukes". Hey, if it happened ONCE, who's to say it won't happen again? Just because it's not happening NOW (for some UNKNOWN reason), how do you know it won't resume happening the day you start shipping product? And, how do you know it won't happen to EVERY unit that you ship?? I doubt the OP is constrained to having just *one* choice in terms of "SoM/SoC running Linux". And, what guarantee will he have that cycling power WILL allow the device to boot properly? (after all, didn't he just recently apply power and watch it FAIL TO BOOT -- necessitating the power cycling?) Or, that there aren't other "issues" that will manifest DURING OPERATION -- or, that *are* happening during operation but that he han't rigorously identified in his test/validation procedure?
> Like a friend of mine once said, our civilization won't be the > first one to fall. > Even in mass products - I had a phone from one of the leading > manufacturers which sometimes would not come out of reset no matter what > and for how long you press - took opening it and removing the battery > for a while to get it to work. > The messiness in our trade is past the point of no return, has been > for a while.
"Just push the product out the door. Let the next guy worry about why it doesn't work. Chances are, a new version with a different set of problems will be available at that time! Offer the disgruntled user a FREE UPGRADE (to that new set of disgruntling problems! :> )" Makes one wonder how tolerant of bugs, poor quality, etc. these same folks are in the products that they *purchase* for their own use?! How patient they'd be if their vehicle didn't start from time to time unless they exited the cabin, closed the door, paused and then repeated the attempt. Or, a TV shutting off in the middle of a ball game. Or, their food order "disappearing" and that fact only discovered after they'd waited 20 minutes for it to be served? (sigh) "Apres moi, le deluge!"
On 14.3.2017 &#1075;. 18:09, Don Y wrote:
> Hi Dimiter, > > On 3/14/2017 3:01 AM, Dimiter_Popoff wrote: > >>> Don't use a watchdog to fix a DESIGN PROBLEM. Find a new vendor. >>> (How can you have confidence in your product if one of the main >>> components >>> can't even meet its minimal performance requirements?) >> >> this is how they do things nowadays, not much we can do about it. > > Well, *I* can certainly not contribute to the practice! If you've > got a "slow leak" in a tire, do you drive around with a *pump* in > the trunk (forever!) or do you fix the leak?
What they seem to do in that situation is to put a larger tire over the leaky one, pump a few kilograms of glue between the two and move on.... Like I said before, past the point of no return. Way past really. Dimiter
On 3/14/2017 10:28 AM, Dimiter_Popoff wrote:
> On 14.3.2017 &#1075;. 18:09, Don Y wrote: >> Hi Dimiter, >> >> On 3/14/2017 3:01 AM, Dimiter_Popoff wrote: >> >>>> Don't use a watchdog to fix a DESIGN PROBLEM. Find a new vendor. >>>> (How can you have confidence in your product if one of the main >>>> components >>>> can't even meet its minimal performance requirements?) >>> >>> this is how they do things nowadays, not much we can do about it. >> >> Well, *I* can certainly not contribute to the practice! If you've >> got a "slow leak" in a tire, do you drive around with a *pump* in >> the trunk (forever!) or do you fix the leak? > > What they seem to do in that situation is to put a larger tire over > the leaky one, pump a few kilograms of glue between the two and > move on....
But, that would *solve* the problem -- assuming they similarly increased the size of the remaining tires. The "tire pump" remedy I mentioned just *perpetuates* the problem. Akin to treating symptoms instead of the underlying problem. It is particularly annoying to see how willingly folks will "dismiss" problems THAT THEY, THEMSELVES, WITNESSED if they are intermittent (not EASY to track down). I designed a LORAN-C-based autopilot early in my career. In our first sea trial of the prototype (a piece of perf board bolted to a chunk of lumber!), we defined a course to circumnavigate Cape Cod (Massachusetts) <https://en.wikipedia.org/wiki/Cape_Cod> via a set of "waypoints" designed to allow us to hug the coastline. [The autopilot worked by letting you enter the latitude and longitude of these waypoints, in sequence, and it would steer the boat *to* each, in turn (at the time, most autopilots simply tried to keep a vessel pointed in a particular direction and couldn't compensate for drift)] We monitored the boat's course with a position plotter (took positional information from a LORAN-C receiver and plotted those on a chart/map). As expected, the vessel's track was true to the series of waypoints we'd entered (you use the nominal coordinates of buoys in the ocean/bay as markers cuz its hard to tell one wave from another! :> ). But, there was a very noticeable 'S' in the track at one particular point. As if the algorithm had "overflowed" and then overcompensated before eventually returning to its ideal track. I spent weeks (off-hours) digging through my source code, the floating point library implementation, etc. in an attempt to understand why this anomaly was present. Unfortunately, I had no way to log the raw data from the LORAN receiver; that was the purpose of the plotter's record! Nor any way to log the actions INTENDED by my device (had it COMMANDED the vessel to take such a course?). My boss dismissed the "problem" -- citing the fact that we had stopped the vessel in that general area to do some deep sea fishing. As such, its movements were largely at the whim of the ocean current and the MANUAL compensation that the skipper would occasionally bring to bear (to keep the waves from turning the vessel sideways which would leave it vulnerable to being toppled by larger waves). I still have that code (I have *everything* I've ever designed/written) and periodically drag it out hoping to stumble on some OTHER explanation. I'd much prefer a genuine bug to explain the behavior than HOPE it was as my boss had suggested! A colleague had implemented a video game that suffered from an annoying "problem" -- almost invariably when the player was having a REALLY GOOD game! Yet, it proved difficult to track down as the game is, by its very nature, based on "random" events (pseudo-random number generators governing its actions; non-repeatable interactions with the player, etc.). The game was released with the "problem" unresolved (that market is very aggressively competitive and narrow windows of "opportunity"). Some *years* later, the problem was found to be a genuine bug (duh!). Unfortunate that EVERY unit shipped had that bug. *But*, a strong sense of closure finally KNOWING the cause!
> Like I said before, past the point of no return. Way past really.