EmbeddedRelated.com
Forums
Memfault Beyond the Launch

Kicking the dog -- how do you use watchdog timers?

Started by Tim Wescott May 9, 2016
Den onsdag den 11. maj 2016 kl. 02.58.08 UTC+2 skrev Allan Herriman:
> On Tue, 10 May 2016 17:22:43 -0700, lasselangwadtchristensen wrote: > > > Den onsdag den 11. maj 2016 kl. 01.59.02 UTC+2 skrev Allan Herriman: > >> On Tue, 10 May 2016 13:36:55 -0400, rickman wrote: > >> > >> > On 5/10/2016 12:48 PM, Rob Gaddi wrote: > >> >> rickman wrote: > >> >> > >> >>> On 5/9/2016 5:13 PM, Tim Wescott wrote: > >> >>>> On Mon, 09 May 2016 15:07:08 -0400, rickman wrote: > >> >>>> > >> >>>> > >> >>>> I've spent lab time next to unhappily cursing FPGA guys (good > >> >>>> ones) > >> >>>> trying to determine why their state machines have wedged. > >> >>>> > >> >>>> So I'm not sure that's an entirely accurate statement. > >> >>> > >> >>> Ask them why their FSMs got stuck. In development they may make > >> >>> mistakes, but you don't use watchdogs for debugging. In fact they > >> >>> get in the way. > >> >>> > >> >>> > >> >> Oh, that's easy. Because of either: > >> >> > >> >> An error in the synchronous logic, leaving it in a defined state > >> >> with no way out (20% chance). > >> >> > >> >> An unsynchronized async input causing a race condition that static > >> >> timing couldn't catch (80% chance) > >> >> > >> >> Or a single event upset (0.0001% chance) > >> > > >> > I just recalled that when designing FSMs in HDL, there is typically a > >> > synthesis option to recognize all unused states and design so they > >> > return to the reset condition. This is a good way to deal with SEU > >> > issues. It is very hard to prevent a hiccup from SEU, but recovery > >> > can be built in. > >> > >> Please don't expect that illegal state coverage will make your FSM > >> reliable. That will only help with illegal states, but illegal states > >> aren't the only causes of lockups. > >> > >> Consider FSMs in two systems (perhaps on the same chip) talking to each > >> other with some handshaking. There's a state that waits for a > >> handshake signal from the other system. If both FSMs get in that state > >> (from any cause: glitch, SEU, coding bug), the system will lock up. > >> > >> You should be able to see how a watchdog would help with that. The > >> watchdog could be built into the FSM, or it could sit to the side and > >> reset the whole FSM. > >> > >> > > A watchdog on the AXI bus would nice, it is easy to reconfigure the > > programmable logic in a Zynq but if you have stuff on bus you have to > > make absolutely sure no software is accessing that because it will halt > > the whole system and only a reset will recover from that > > We leave the ARM watchdog permanently enabled in our Zynq systems for > that very reason. > If code (via e.g. a wrong pointer) accesses an address without an AXI > address decode, it will hang the AXI, and hence the whole box. >
yep, but it would be nice if there was an option to get something similar to an access denied and handle it from there instead of resetting the whole system -Lasse
On 5/10/2016 7:58 PM, Allan Herriman wrote:
> On Tue, 10 May 2016 13:36:55 -0400, rickman wrote: > >> On 5/10/2016 12:48 PM, Rob Gaddi wrote: >>> rickman wrote: >>> >>>> On 5/9/2016 5:13 PM, Tim Wescott wrote: >>>>> On Mon, 09 May 2016 15:07:08 -0400, rickman wrote: >>>>> >>>>> >>>>> I've spent lab time next to unhappily cursing FPGA guys (good ones) >>>>> trying to determine why their state machines have wedged. >>>>> >>>>> So I'm not sure that's an entirely accurate statement. >>>> >>>> Ask them why their FSMs got stuck. In development they may make >>>> mistakes, but you don't use watchdogs for debugging. In fact they get >>>> in the way. >>>> >>>> >>> Oh, that's easy. Because of either: >>> >>> An error in the synchronous logic, leaving it in a defined state with >>> no way out (20% chance). >>> >>> An unsynchronized async input causing a race condition that static >>> timing couldn't catch (80% chance) >>> >>> Or a single event upset (0.0001% chance) >> >> I just recalled that when designing FSMs in HDL, there is typically a >> synthesis option to recognize all unused states and design so they >> return to the reset condition. This is a good way to deal with SEU >> issues. It is very hard to prevent a hiccup from SEU, but recovery can >> be built in. > > Please don't expect that illegal state coverage will make your FSM > reliable. That will only help with illegal states, but illegal states > aren't the only causes of lockups. > > Consider FSMs in two systems (perhaps on the same chip) talking to each > other with some handshaking. There's a state that waits for a handshake > signal from the other system. If both FSMs get in that state (from any > cause: glitch, SEU, coding bug), the system will lock up.
Coding bug??? Not sure where you are getting this. I specifically excluded SEU because that is one situation that can cause problems with *any* design in unpredictable ways. Otherwise this is a system design issue. If your system is subject to "glitches" then a means should be designed into the handshake to resolve timeouts. Resetting the entire FPGA or board shouldn't be necessary.
> You should be able to see how a watchdog would help with that. The > watchdog could be built into the FSM, or it could sit to the side and > reset the whole FSM.
You are talking about a specified timeout on a communications protocol, not a watchdog.
>> How would you implement a watchdog for an FPGA which likely has many >> independent FSMs? What would you monitor? > > Firstly, I create an architecture that doesn't have many interlocking > FSMs. Significant parts of my design (particularly in the datapath) will > not have any FSMs at all, and hence, no chance of FSM lockups.
If you can design sequential control logic that doesn't have FSMs, then you are a better man than I am... or you are good at renaming circuits. Everything sequential is an FSM other than simple data registers. A counter is a FSM.
> Then I consider each FSM independently. If possible, I make it > inherently crashproof. If not, I may add a watchdog timer. Sometimes I > will add a circuit that looks for bad signatures (e.g. unusual FIFO > depths) instead.
Yes, defining transitions for every possible state is a good tool, if needed. But by default adding a watchdog timer is overkill, especially when it simply masks a bug rather than exposing it. The Transputer had math instructions that would halt the CPU when an overflow occurred. It sounded crazy at the time, but that is actually preferable to letting an erroneous system continue running. Watchdogs are often like that, they let a system continue running in a corrupt way rather than pointing to the bug.
> A recent example from a system I was designing for a client: > > The Xilinx transceivers need to be reset in a particular sequence to work > properly (particularly at the higher data rates, e.g. > 10Gb/s). > These transceivers don't have a lock output that works reliably (thanks > Xilinx!). Instead, one must go to the next highest protocol layer (e.g. > (Ethernet) PCS level) to monitor that protocol's sync to determine > whether the transceiver is working. > > I coded a watchdog timer that would reset the transceiver if it hadn't > seen PCS sync for a certain time. I can't get it to fail now.
Again, I don't call that a watchdog since it is actually a part of your protocol. A watchdog is used to catch problem you know nothing about but you want the system to continue to run. In CPUs they reset the system so user intervention isn't required. But it is still a disruption to the user if they are using it at the time. In FPGAs the logic can be designed to not hang. It may require work to do the proper analysis, but it is not just possible, but saves money in the long run when you don't need to fix difficult to find bugs. Bottom line is adding a watchdog to an FPGA to catch unknown problems shows that something is missing from the design process. -- Rick C
On 5/10/2016 8:22 PM, lasselangwadtchristensen@gmail.com wrote:
> > A watchdog on the AXI bus would nice, it is easy to reconfigure the programmable logic in a Zynq but if you have stuff on bus you have to > make absolutely sure no software is accessing that because it will > halt the whole system and only a reset will recover from that
My first FPGA design was to provide data on the PCI bus through a bus interface chip. Turns out the PCI bus will hang the entire CPU if a handshake is not completed. My very first iteration of the design had a bug in the FSM that locked up the PC. lol It got fixed very quickly. -- Rick C
rickman <gnuarm@gmail.com> writes:
> In FPGAs the logic can be designed to not hang. It may require work > to do the proper analysis, but it is not just possible, but saves > money in the long run when you don't need to fix difficult to find > bugs.
As FPGA's get bigger and the circuits in them get more complicated, don't they face the same combinatorial explosion that big software systems do? There are tons of historical examples of Intel and similar CPU's (hard silicon, not even FPGA's) locking up due to bugs. Any big CPU or comparable chip will have an errata list. At some point you may have to accept that bugs are inevitable, and that a reliable system (besides preventing as many bugs as it can) also has to mitigate any remaining ones. Watchdogs are a time tested approach for that purpose.
On 5/10/2016 10:38 PM, Paul Rubin wrote:
> rickman <gnuarm@gmail.com> writes: >> In FPGAs the logic can be designed to not hang. It may require work >> to do the proper analysis, but it is not just possible, but saves >> money in the long run when you don't need to fix difficult to find >> bugs. > > As FPGA's get bigger and the circuits in them get more complicated, > don't they face the same combinatorial explosion that big software > systems do? There are tons of historical examples of Intel and similar > CPU's (hard silicon, not even FPGA's) locking up due to bugs. Any big > CPU or comparable chip will have an errata list. At some point you may > have to accept that bugs are inevitable, and that a reliable system > (besides preventing as many bugs as it can) also has to mitigate any > remaining ones. Watchdogs are a time tested approach for that purpose.
How may of those large ASICs with hang bugs had watchdog timers? The bug was a system level design problem, not a logic bug in a FSM. They could be dealt with by a software change, no? Even if they couldn't be dealt with with software, what would a watchdog do? Reset your entire computer/phone/flight nav? There is always the possibility of bugs in FPGAs. But bugs that require the use of a watchdog are a class of bugs that should be shaken out in debug unless the designers are not very good. If they can't find them in debug, they have to be pretty durn infrequent. Do you have any links to descriptions of such bugs? I'm curious. -- Rick C
rickman <gnuarm@gmail.com> writes:

> On 5/10/2016 10:38 PM, Paul Rubin wrote: >> rickman <gnuarm@gmail.com> writes: >>> In FPGAs the logic can be designed to not hang. It may require work >>> to do the proper analysis, but it is not just possible, but saves >>> money in the long run when you don't need to fix difficult to find >>> bugs. >> >> As FPGA's get bigger and the circuits in them get more complicated, >> don't they face the same combinatorial explosion that big software >> systems do? There are tons of historical examples of Intel and similar >> CPU's (hard silicon, not even FPGA's) locking up due to bugs. Any big >> CPU or comparable chip will have an errata list. At some point you may >> have to accept that bugs are inevitable, and that a reliable system >> (besides preventing as many bugs as it can) also has to mitigate any >> remaining ones. Watchdogs are a time tested approach for that purpose. > > How may of those large ASICs with hang bugs had watchdog timers? The > bug was a system level design problem, not a logic bug in a FSM. They > could be dealt with by a software change, no? Even if they couldn't > be dealt with with software, what would a watchdog do? Reset your > entire computer/phone/flight nav? > > There is always the possibility of bugs in FPGAs. But bugs that > require the use of a watchdog are a class of bugs that should be > shaken out in debug unless the designers are not very good. If they > can't find them in debug, they have to be pretty durn infrequent. > > Do you have any links to descriptions of such bugs? I'm curious.
A WDT is also not a cure-all. Consider this scenario: a piece of hardware fails, changing the inputs to a piece of code in an unexpected way and causing the code to go into the weeds and the WDT to fire. But after restart, the hardware is still failed and providing unexpected inputs, the same bug occurs again, the WDT fires again, and the processor restarts again. Ad-infinitum. So what did this fix? :) -- Randy Yates, DSP/Embedded Firmware Developer Digital Signal Labs http://www.digitalsignallabs.com
rickman <gnuarm@gmail.com> writes:
> How may of those large ASICs with hang bugs had watchdog timers?
I'd expect the WDT to be in the box that the ASIC is deployed in, not in the ASIC itself. That way the box resets if the ASIC locks up.
> Do you have any links to descriptions of such bugs? I'm curious.
https://duckduckgo.com/?q=cpu+lockup+errata
On 5/10/2016 11:48 PM, Paul Rubin wrote:
> rickman <gnuarm@gmail.com> writes: >> How may of those large ASICs with hang bugs had watchdog timers? > > I'd expect the WDT to be in the box that the ASIC is deployed in, not in > the ASIC itself. That way the box resets if the ASIC locks up. > >> Do you have any links to descriptions of such bugs? I'm curious. > > https://duckduckgo.com/?q=cpu+lockup+errata
I looked at this list and of the first three only one was a lockup of any sort. The LCD Controller in the MPC823 can hang the CPU when the LCD is disabled while in aggressive mode (LAM). This has a very simple fix in software, before disabling the LCD, turn off the LAM. Where is the need for a watchdog of any sort? If an ASIC locks up, won't the CPU be able to figure it out and reset whatever is appropriate? -- Rick C
rickman <gnuarm@gmail.com> writes:
> Where is the need for a watchdog of any sort? If an ASIC locks up, > won't the CPU be able to figure it out and reset whatever is > appropriate?
What if the CPU is part of the ASIC?
On 5/10/2016 11:36 PM, Randy Yates wrote:
> rickman <gnuarm@gmail.com> writes: > >> On 5/10/2016 10:38 PM, Paul Rubin wrote: >>> rickman <gnuarm@gmail.com> writes: >>>> In FPGAs the logic can be designed to not hang. It may require work >>>> to do the proper analysis, but it is not just possible, but saves >>>> money in the long run when you don't need to fix difficult to find >>>> bugs. >>> >>> As FPGA's get bigger and the circuits in them get more complicated, >>> don't they face the same combinatorial explosion that big software >>> systems do? There are tons of historical examples of Intel and similar >>> CPU's (hard silicon, not even FPGA's) locking up due to bugs. Any big >>> CPU or comparable chip will have an errata list. At some point you may >>> have to accept that bugs are inevitable, and that a reliable system >>> (besides preventing as many bugs as it can) also has to mitigate any >>> remaining ones. Watchdogs are a time tested approach for that purpose. >> >> How may of those large ASICs with hang bugs had watchdog timers? The >> bug was a system level design problem, not a logic bug in a FSM. They >> could be dealt with by a software change, no? Even if they couldn't >> be dealt with with software, what would a watchdog do? Reset your >> entire computer/phone/flight nav? >> >> There is always the possibility of bugs in FPGAs. But bugs that >> require the use of a watchdog are a class of bugs that should be >> shaken out in debug unless the designers are not very good. If they >> can't find them in debug, they have to be pretty durn infrequent. >> >> Do you have any links to descriptions of such bugs? I'm curious. > > A WDT is also not a cure-all. Consider this scenario: a piece of > hardware fails, changing the inputs to a piece of code in an unexpected > way and causing the code to go into the weeds and the WDT to fire. > > But after restart, the hardware is still failed and providing unexpected > inputs, the same bug occurs again, the WDT fires again, and the > processor restarts again. Ad-infinitum. > > So what did this fix? :)
Shouldn't a processor reset also reset the hardware? -- Rick C

Memfault Beyond the Launch