EmbeddedRelated.com
Forums
The 2024 Embedded Online Conference

Kicking the dog -- how do you use watchdog timers?

Started by Tim Wescott May 9, 2016
On 5/10/2016 12:48 PM, Rob Gaddi wrote:
> rickman wrote: > >> On 5/9/2016 5:13 PM, Tim Wescott wrote: >>> On Mon, 09 May 2016 15:07:08 -0400, rickman wrote: >>> >>> >>> I've spent lab time next to unhappily cursing FPGA guys (good ones) >>> trying to determine why their state machines have wedged. >>> >>> So I'm not sure that's an entirely accurate statement. >> >> Ask them why their FSMs got stuck. In development they may make >> mistakes, but you don't use watchdogs for debugging. In fact they get >> in the way. >> > > Oh, that's easy. Because of either: > > An error in the synchronous logic, leaving it in a defined state with no > way out (20% chance).
That's a system debug thing and actually shouldn't happen at all as there are tools to analyze for it.
> An unsynchronized async input causing a race condition that static > timing couldn't catch (80% chance)
Newbie mistake... that even... uh, experienced designers do once in a while... uh, sometimes... still, it wouldn't make it to a fielded system and so is does not create a need for a watchdog.
> Or a single event upset (0.0001% chance)
SEU is a possibility and in fact is a reason why watchdogs are used on FPGAs in space craft. Here on the ground the probability is more like, 0.0000000000001 in a year. I didn't actually count the zeros, but it is a *lot*. You will never see it in your lifetime. -- Rick C
On 5/10/2016 12:48 PM, Rob Gaddi wrote:
> rickman wrote: > >> On 5/9/2016 5:13 PM, Tim Wescott wrote: >>> On Mon, 09 May 2016 15:07:08 -0400, rickman wrote: >>> >>> >>> I've spent lab time next to unhappily cursing FPGA guys (good ones) >>> trying to determine why their state machines have wedged. >>> >>> So I'm not sure that's an entirely accurate statement. >> >> Ask them why their FSMs got stuck. In development they may make >> mistakes, but you don't use watchdogs for debugging. In fact they get >> in the way. >> > > Oh, that's easy. Because of either: > > An error in the synchronous logic, leaving it in a defined state with no > way out (20% chance). > > An unsynchronized async input causing a race condition that static > timing couldn't catch (80% chance) > > Or a single event upset (0.0001% chance)
I just recalled that when designing FSMs in HDL, there is typically a synthesis option to recognize all unused states and design so they return to the reset condition. This is a good way to deal with SEU issues. It is very hard to prevent a hiccup from SEU, but recovery can be built in. How would you implement a watchdog for an FPGA which likely has many independent FSMs? What would you monitor? -- Rick C
On Tue, 10 May 2016 15:36:07 +0200, o pere o <me@somewhere.net> wrote:

>On 09/05/16 19:06, Tim Wescott wrote: >> Randy Yates recently started a thread on programming flash that had an >> interesting tangent into watchdog timers. I thought it was interesting >> enough that I'm starting a thread here. >> >> I had stated in Randy's thread that I avoid watchdogs, because they >> mostly seem to be a source of erroneous behavior to me. >> >> However, on reflection I realized that I lied: I _do_ use watchdog >> timers, but not automatically. To date I've only used them when the >> processor is spinning a motor that might crash into something or >> otherwise engage in damaging behavior if the processor goes nuts. >> >> In general, my rule on watchdogs, as with any other feature, is "use it >> if using it is better", which means that I think about the consequences >> of the thing popping off when I don't want it to (as during a code update >> or during development when I hit a breakpoint) vs. the consequences of >> not having the thing when the processor goes haywire. >> >> Furthermore, if I use a watchdog I don't just treat updating the thing as >> a requirement check-box -- so you won't find a timer ISR in my code that >> unconditionally kicks the dog. Instead, I'll usually have just one task >> (the motor control one, on most of my stuff) kick the dog when it feels >> it's operating correctly. If I've got more than one critical task (i.e., >> if I'm running more than one motor out of one processor) I'll have a low- >> priority built-in-test task that kicks the dog, but only if it's getting >> periodic assurances of health from the (multiple) critical tasks. >> >> Generally, in my systems, the result of the watchdog timer popping off is >> that the system will no longer work quite correctly, but it will operate >> safely. >> >> So -- what do you do with watchdogs, and how, and why? Always use 'em? >> Never use 'em? Use 'em because the boss says so, but twiddle them in a >> "last part to break" bit of code? >> >> Would you use a watchdog in a fly-by-wire system? A pacemaker? Why? >> Why not? Could you justify _not_ using a watchdog in the top-level >> processor of a Mars rover or a satellite? >> > >Quoting Tim Williams' book "The most cost-effective way to ensure the >reliability of a microprocessor-based product is to accept that the >program (or data or both, my addition) *will* occasionally be corrupted, >and to provide a means whereby the program flow can be automatically >recovered, preferably transparently to the user. This is the function of >the microprocessor watchdog." > >So, the whole thing is what to do "when" (not "if") shit (the >unexpected) happens. > >Pere
Who is Tim Williams ? I had to do some googling what he has actually said. I still maintain that watchdog timers are only required at high radiation environments, in which humans would start to get radiation sickness or at least cancer in the long run. . Old electronics systems have been working for a decade or two without reboot. I have maintained some computer systems that were designed to do some thermal cycling every year. If i forgot to do the thermal recycling every year, do the system restart the next year or the year after that, no big problem.
On Tue, 10 May 2016 13:01:54 -0400, rickman wrote:

> On 5/10/2016 12:48 PM, Rob Gaddi wrote: >> rickman wrote: >> >>> On 5/9/2016 5:13 PM, Tim Wescott wrote: >>>> On Mon, 09 May 2016 15:07:08 -0400, rickman wrote: >>>> >>>> >>>> I've spent lab time next to unhappily cursing FPGA guys (good ones) >>>> trying to determine why their state machines have wedged. >>>> >>>> So I'm not sure that's an entirely accurate statement. >>> >>> Ask them why their FSMs got stuck. In development they may make >>> mistakes, but you don't use watchdogs for debugging. In fact they get >>> in the way. >>> >>> >> Oh, that's easy. Because of either: >> >> An error in the synchronous logic, leaving it in a defined state with >> no way out (20% chance). > > That's a system debug thing and actually shouldn't happen at all as > there are tools to analyze for it. > > >> An unsynchronized async input causing a race condition that static >> timing couldn't catch (80% chance) > > Newbie mistake... that even... uh, experienced designers do once in a > while... uh, sometimes... still, it wouldn't make it to a fielded > system and so is does not create a need for a watchdog.
Here's something from a comp.arch.fpga post I made in 2003: "When I was at Agilent I analysed the causes of failures in some FPGA developments. About half of all FPGA design related bugs (weighted by the time spent finding them) were associated with asynchronous logic and clock domain crossings. [snip] 0% of the clock domain crossing bugs had anything to do with metastability. Glitches and races were the cause." Regards, Allan
On 5/10/2016 7:11 PM, Allan Herriman wrote:
> On Tue, 10 May 2016 13:01:54 -0400, rickman wrote: > >> On 5/10/2016 12:48 PM, Rob Gaddi wrote: >>> rickman wrote: >>> >>>> On 5/9/2016 5:13 PM, Tim Wescott wrote: >>>>> On Mon, 09 May 2016 15:07:08 -0400, rickman wrote: >>>>> >>>>> >>>>> I've spent lab time next to unhappily cursing FPGA guys (good ones) >>>>> trying to determine why their state machines have wedged. >>>>> >>>>> So I'm not sure that's an entirely accurate statement. >>>> >>>> Ask them why their FSMs got stuck. In development they may make >>>> mistakes, but you don't use watchdogs for debugging. In fact they get >>>> in the way. >>>> >>>> >>> Oh, that's easy. Because of either: >>> >>> An error in the synchronous logic, leaving it in a defined state with >>> no way out (20% chance). >> >> That's a system debug thing and actually shouldn't happen at all as >> there are tools to analyze for it. >> >> >>> An unsynchronized async input causing a race condition that static >>> timing couldn't catch (80% chance) >> >> Newbie mistake... that even... uh, experienced designers do once in a >> while... uh, sometimes... still, it wouldn't make it to a fielded >> system and so is does not create a need for a watchdog. > > > Here's something from a comp.arch.fpga post I made in 2003: > > "When I was at Agilent I analysed the causes of failures in some FPGA > developments. > > About half of all FPGA design related bugs (weighted by the time spent > finding them) were associated with asynchronous logic and clock domain > crossings. [snip] 0% of the clock domain crossing bugs had anything to > do with metastability. Glitches and races were the cause."
Geeze, that just shouldn't happen. I'm not sure what they mean by "asynchronous logic" as real asynchronous logic is almost never used in FPGAs. Clock domain crossing is well understood so there is no reason to not get it right. It's the kind of thing that normally gets a big, red flag at design time and so is done correctly. -- Rick C
On Tue, 10 May 2016 13:36:55 -0400, rickman wrote:

> On 5/10/2016 12:48 PM, Rob Gaddi wrote: >> rickman wrote: >> >>> On 5/9/2016 5:13 PM, Tim Wescott wrote: >>>> On Mon, 09 May 2016 15:07:08 -0400, rickman wrote: >>>> >>>> >>>> I've spent lab time next to unhappily cursing FPGA guys (good ones) >>>> trying to determine why their state machines have wedged. >>>> >>>> So I'm not sure that's an entirely accurate statement. >>> >>> Ask them why their FSMs got stuck. In development they may make >>> mistakes, but you don't use watchdogs for debugging. In fact they get >>> in the way. >>> >>> >> Oh, that's easy. Because of either: >> >> An error in the synchronous logic, leaving it in a defined state with >> no way out (20% chance). >> >> An unsynchronized async input causing a race condition that static >> timing couldn't catch (80% chance) >> >> Or a single event upset (0.0001% chance) > > I just recalled that when designing FSMs in HDL, there is typically a > synthesis option to recognize all unused states and design so they > return to the reset condition. This is a good way to deal with SEU > issues. It is very hard to prevent a hiccup from SEU, but recovery can > be built in.
Please don't expect that illegal state coverage will make your FSM reliable. That will only help with illegal states, but illegal states aren't the only causes of lockups. Consider FSMs in two systems (perhaps on the same chip) talking to each other with some handshaking. There's a state that waits for a handshake signal from the other system. If both FSMs get in that state (from any cause: glitch, SEU, coding bug), the system will lock up. You should be able to see how a watchdog would help with that. The watchdog could be built into the FSM, or it could sit to the side and reset the whole FSM.
> How would you implement a watchdog for an FPGA which likely has many > independent FSMs? What would you monitor?
Firstly, I create an architecture that doesn't have many interlocking FSMs. Significant parts of my design (particularly in the datapath) will not have any FSMs at all, and hence, no chance of FSM lockups. Then I consider each FSM independently. If possible, I make it inherently crashproof. If not, I may add a watchdog timer. Sometimes I will add a circuit that looks for bad signatures (e.g. unusual FIFO depths) instead. A recent example from a system I was designing for a client: The Xilinx transceivers need to be reset in a particular sequence to work properly (particularly at the higher data rates, e.g. > 10Gb/s). These transceivers don't have a lock output that works reliably (thanks Xilinx!). Instead, one must go to the next highest protocol layer (e.g. (Ethernet) PCS level) to monitor that protocol's sync to determine whether the transceiver is working. I coded a watchdog timer that would reset the transceiver if it hadn't seen PCS sync for a certain time. I can't get it to fail now. Regards, Allan
Den onsdag den 11. maj 2016 kl. 01.59.02 UTC+2 skrev Allan Herriman:
> On Tue, 10 May 2016 13:36:55 -0400, rickman wrote: > > > On 5/10/2016 12:48 PM, Rob Gaddi wrote: > >> rickman wrote: > >> > >>> On 5/9/2016 5:13 PM, Tim Wescott wrote: > >>>> On Mon, 09 May 2016 15:07:08 -0400, rickman wrote: > >>>> > >>>> > >>>> I've spent lab time next to unhappily cursing FPGA guys (good ones) > >>>> trying to determine why their state machines have wedged. > >>>> > >>>> So I'm not sure that's an entirely accurate statement. > >>> > >>> Ask them why their FSMs got stuck. In development they may make > >>> mistakes, but you don't use watchdogs for debugging. In fact they get > >>> in the way. > >>> > >>> > >> Oh, that's easy. Because of either: > >> > >> An error in the synchronous logic, leaving it in a defined state with > >> no way out (20% chance). > >> > >> An unsynchronized async input causing a race condition that static > >> timing couldn't catch (80% chance) > >> > >> Or a single event upset (0.0001% chance) > > > > I just recalled that when designing FSMs in HDL, there is typically a > > synthesis option to recognize all unused states and design so they > > return to the reset condition. This is a good way to deal with SEU > > issues. It is very hard to prevent a hiccup from SEU, but recovery can > > be built in. > > Please don't expect that illegal state coverage will make your FSM > reliable. That will only help with illegal states, but illegal states > aren't the only causes of lockups. > > Consider FSMs in two systems (perhaps on the same chip) talking to each > other with some handshaking. There's a state that waits for a handshake > signal from the other system. If both FSMs get in that state (from any > cause: glitch, SEU, coding bug), the system will lock up. > > You should be able to see how a watchdog would help with that. The > watchdog could be built into the FSM, or it could sit to the side and > reset the whole FSM. >
A watchdog on the AXI bus would nice, it is easy to reconfigure the programmable logic in a Zynq but if you have stuff on bus you have to make absolutely sure no software is accessing that because it will halt the whole system and only a reset will recover from that -Lasse
On Tue, 10 May 2016 19:54:18 -0400, rickman wrote:

> On 5/10/2016 7:11 PM, Allan Herriman wrote: >> On Tue, 10 May 2016 13:01:54 -0400, rickman wrote: >> >>> On 5/10/2016 12:48 PM, Rob Gaddi wrote: >>>> rickman wrote: >>>> >>>>> On 5/9/2016 5:13 PM, Tim Wescott wrote: >>>>>> On Mon, 09 May 2016 15:07:08 -0400, rickman wrote: >>>>>> >>>>>> >>>>>> I've spent lab time next to unhappily cursing FPGA guys (good ones) >>>>>> trying to determine why their state machines have wedged. >>>>>> >>>>>> So I'm not sure that's an entirely accurate statement. >>>>> >>>>> Ask them why their FSMs got stuck. In development they may make >>>>> mistakes, but you don't use watchdogs for debugging. In fact they >>>>> get in the way. >>>>> >>>>> >>>> Oh, that's easy. Because of either: >>>> >>>> An error in the synchronous logic, leaving it in a defined state with >>>> no way out (20% chance). >>> >>> That's a system debug thing and actually shouldn't happen at all as >>> there are tools to analyze for it. >>> >>> >>>> An unsynchronized async input causing a race condition that static >>>> timing couldn't catch (80% chance) >>> >>> Newbie mistake... that even... uh, experienced designers do once in a >>> while... uh, sometimes... still, it wouldn't make it to a fielded >>> system and so is does not create a need for a watchdog. >> >> >> Here's something from a comp.arch.fpga post I made in 2003: >> >> "When I was at Agilent I analysed the causes of failures in some FPGA >> developments. >> >> About half of all FPGA design related bugs (weighted by the time spent >> finding them) were associated with asynchronous logic and clock domain >> crossings. [snip] 0% of the clock domain crossing bugs had anything >> to do with metastability. Glitches and races were the cause." > > Geeze, that just shouldn't happen. I'm not sure what they mean by > "asynchronous logic" as real asynchronous logic is almost never used in > FPGAs. Clock domain crossing is well understood so there is no reason > to not get it right. It's the kind of thing that normally gets a big, > red flag at design time and so is done correctly.
I would not say that clock domain crossings are /well/ understood by beginners, or even moderately experienced designers. BTW, I weighted the results with the time taken to find the bugs. There weren't that many bugs, it's just that they took a long time to find compared with straightforward functional bugs. Many of the bugs were caused by integrating IP (written elsewhere) and it wasn't always obvious to the designers that signals were crossing clock domains. Some of the bugs were created by the tools, e.g. when they replicated logic. That makes the bugs hard to find during source code review. It's actually better to review the post-synth netlist than the source code. (Better still to use an automated tool to do it.) [From a 2008 c.a.f post of mine] here's a list of the sort of things that could go wrong. Please bear in mind that this list is historical (i.e. it was based on experience with older FPGA families and older tools, in a job I left over a decade ago.). - (race) Passing vectors (i.e. multiple signals) from clock domain A to clock domain B and expecting all the bits to arrive on the same B clock. - (race) As above, but adding multiple banks of retiming flip flops in the B clock domain, which fixed the (non-existent) metastability issue but did nothing about the race. - (race) Passing a signal in clock domain A to multiple flip flops in clock domain B, and expecting the B flip flops to get the same value on the same clock. - (race) As above, but created when the tools replicate the B logic to manage fanout. - (glitch) Multiple signals in clock domain A hit some combinatorial logic producing a single signal which is sampled by a flip flop in clock domain B. Sometimes there may be a glitch which gets sampled by the B flip flop. It can be difficult to design combinatorial logic with good glitch coverage (and if you do, the tools will often remove it). (See XAPP 024, btw.) - (glitch) Clock multiplexers made out of combinatorial logic with inadequate glitch coverage (or adequate glitch coverage removed by the tools). - Using async reset or set inputs on flip flops to implement a logic function (rather than just using them for initialisation). I can remember a case where a design would fail even when we could prove mathematically that it couldn't fail. Rewriting it to avoid the use of async resets fixed the problem. - Gating clocks to create a logic function. I know this sort of thing is done in ASICs to save power, but it just doesn't seem to work too well in FPGAs sometimes. Regards, Allan
On Tue, 10 May 2016 17:22:43 -0700, lasselangwadtchristensen wrote:

> Den onsdag den 11. maj 2016 kl. 01.59.02 UTC+2 skrev Allan Herriman: >> On Tue, 10 May 2016 13:36:55 -0400, rickman wrote: >> >> > On 5/10/2016 12:48 PM, Rob Gaddi wrote: >> >> rickman wrote: >> >> >> >>> On 5/9/2016 5:13 PM, Tim Wescott wrote: >> >>>> On Mon, 09 May 2016 15:07:08 -0400, rickman wrote: >> >>>> >> >>>> >> >>>> I've spent lab time next to unhappily cursing FPGA guys (good >> >>>> ones) >> >>>> trying to determine why their state machines have wedged. >> >>>> >> >>>> So I'm not sure that's an entirely accurate statement. >> >>> >> >>> Ask them why their FSMs got stuck. In development they may make >> >>> mistakes, but you don't use watchdogs for debugging. In fact they >> >>> get in the way. >> >>> >> >>> >> >> Oh, that's easy. Because of either: >> >> >> >> An error in the synchronous logic, leaving it in a defined state >> >> with no way out (20% chance). >> >> >> >> An unsynchronized async input causing a race condition that static >> >> timing couldn't catch (80% chance) >> >> >> >> Or a single event upset (0.0001% chance) >> > >> > I just recalled that when designing FSMs in HDL, there is typically a >> > synthesis option to recognize all unused states and design so they >> > return to the reset condition. This is a good way to deal with SEU >> > issues. It is very hard to prevent a hiccup from SEU, but recovery >> > can be built in. >> >> Please don't expect that illegal state coverage will make your FSM >> reliable. That will only help with illegal states, but illegal states >> aren't the only causes of lockups. >> >> Consider FSMs in two systems (perhaps on the same chip) talking to each >> other with some handshaking. There's a state that waits for a >> handshake signal from the other system. If both FSMs get in that state >> (from any cause: glitch, SEU, coding bug), the system will lock up. >> >> You should be able to see how a watchdog would help with that. The >> watchdog could be built into the FSM, or it could sit to the side and >> reset the whole FSM. >> >> > A watchdog on the AXI bus would nice, it is easy to reconfigure the > programmable logic in a Zynq but if you have stuff on bus you have to > make absolutely sure no software is accessing that because it will halt > the whole system and only a reset will recover from that
We leave the ARM watchdog permanently enabled in our Zynq systems for that very reason. If code (via e.g. a wrong pointer) accesses an address without an AXI address decode, it will hang the AXI, and hence the whole box. With a watchdog, it will reboot (perhaps to fail again, perhaps not). Regards, Allan
On 5/10/2016 8:50 PM, Allan Herriman wrote:
> On Tue, 10 May 2016 19:54:18 -0400, rickman wrote: > >> On 5/10/2016 7:11 PM, Allan Herriman wrote: >>> On Tue, 10 May 2016 13:01:54 -0400, rickman wrote: >>> >>>> On 5/10/2016 12:48 PM, Rob Gaddi wrote: >>>>> rickman wrote: >>>>> >>>>>> On 5/9/2016 5:13 PM, Tim Wescott wrote: >>>>>>> On Mon, 09 May 2016 15:07:08 -0400, rickman wrote: >>>>>>> >>>>>>> >>>>>>> I've spent lab time next to unhappily cursing FPGA guys (good ones) >>>>>>> trying to determine why their state machines have wedged. >>>>>>> >>>>>>> So I'm not sure that's an entirely accurate statement. >>>>>> >>>>>> Ask them why their FSMs got stuck. In development they may make >>>>>> mistakes, but you don't use watchdogs for debugging. In fact they >>>>>> get in the way. >>>>>> >>>>>> >>>>> Oh, that's easy. Because of either: >>>>> >>>>> An error in the synchronous logic, leaving it in a defined state with >>>>> no way out (20% chance). >>>> >>>> That's a system debug thing and actually shouldn't happen at all as >>>> there are tools to analyze for it. >>>> >>>> >>>>> An unsynchronized async input causing a race condition that static >>>>> timing couldn't catch (80% chance) >>>> >>>> Newbie mistake... that even... uh, experienced designers do once in a >>>> while... uh, sometimes... still, it wouldn't make it to a fielded >>>> system and so is does not create a need for a watchdog. >>> >>> >>> Here's something from a comp.arch.fpga post I made in 2003: >>> >>> "When I was at Agilent I analysed the causes of failures in some FPGA >>> developments. >>> >>> About half of all FPGA design related bugs (weighted by the time spent >>> finding them) were associated with asynchronous logic and clock domain >>> crossings. [snip] 0% of the clock domain crossing bugs had anything >>> to do with metastability. Glitches and races were the cause." >> >> Geeze, that just shouldn't happen. I'm not sure what they mean by >> "asynchronous logic" as real asynchronous logic is almost never used in >> FPGAs. Clock domain crossing is well understood so there is no reason >> to not get it right. It's the kind of thing that normally gets a big, >> red flag at design time and so is done correctly. > > > I would not say that clock domain crossings are /well/ understood by > beginners, or even moderately experienced designers.
Why would a beginner be designing a system without supervision? As I said, this is the sort of issue that gets a red flag and lots of attention in a design review.
> BTW, I weighted the results with the time taken to find the bugs. > There weren't that many bugs, it's just that they took a long time to > find compared with straightforward functional bugs.
That's why they get lots of attention up front rather than after they are a problem.
> Many of the bugs were caused by integrating IP (written elsewhere) and it > wasn't always obvious to the designers that signals were crossing clock > domains.
That's exactly what happened to me. A simple UART needed a FF at the data in port. I had designed the UART and didn't document that detail. I used it later in a test fixture and forgot to include the I/O FF. It bit me hard as I was writing the software the talked to this port and kept thinking the flaw was software.
> Some of the bugs were created by the tools, e.g. when they replicated > logic. That makes the bugs hard to find during source code review. It's > actually better to review the post-synth netlist than the source code. > (Better still to use an automated tool to do it.)
Not sure how that happens. Are you saying a design with 1 FF and many destinations had an async input? That alone is a no-no. If there had been a FF in front to remove metastability it would have provided protection from async inputs (race) when the second FF was replicated.
> [From a 2008 c.a.f post of mine] here's a list of the sort of things that > could go wrong. Please bear in mind that this list is historical (i.e. > it was based on experience with older FPGA families and older tools, in a > job I left over a decade ago.). > > > - (race) Passing vectors (i.e. multiple signals) from clock domain A > to clock domain B and expecting all the bits to arrive on the same B > clock. > > - (race) As above, but adding multiple banks of retiming flip flops in > the B clock domain, which fixed the (non-existent) metastability issue > but did nothing about the race. > > - (race) Passing a signal in clock domain A to multiple flip flops in > clock domain B, and expecting the B flip flops to get the same value > on the same clock. > > - (race) As above, but created when the tools replicate the B logic to > manage fanout. > > - (glitch) Multiple signals in clock domain A hit some combinatorial > logic producing a single signal which is sampled by a flip flop in > clock domain B. Sometimes there may be a glitch which gets sampled by > the B flip flop. > It can be difficult to design combinatorial logic with good glitch > coverage (and if you do, the tools will often remove it). (See XAPP > 024, btw.) > > - (glitch) Clock multiplexers made out of combinatorial logic with > inadequate glitch coverage (or adequate glitch coverage removed by the > tools). > > - Using async reset or set inputs on flip flops to implement a logic > function (rather than just using them for initialisation). I can > remember a case where a design would fail even when we could prove > mathematically that it couldn't fail. Rewriting it to avoid the use > of async resets fixed the problem. > > - Gating clocks to create a logic function. I know this sort of thing > is done in ASICs to save power, but it just doesn't seem to work too > well in FPGAs sometimes.
All of these issues are known bad practice. Messing with the clocks or using async inputs on FFs is an especially bad practice. I thought that ended in the 90s. I did my first FPGA design in '95. I've always wondered how good that code was. I took some training on the Orcad schematic tools for FPGA design and learned about VHDL in one day. lol That made me the resident expert! I had to deal with changing compilers twice in the project, so I learned something about making your code portable very early. But I knew little about clock domain crossings and metastability. I guess I knew about race conditions though. That was not uncommon in discrete logic design which I had done. Regardless, I don't see any reason to use a watchdog timer with an FPGA design unless you have SEU issues. A proper code review with experienced designers will catch all of the above problems. Adding a bandaid is not the solution when there is no reason to not have a good, clean system in an FPGA. The reason why watchdogs are used with software is because software has so many more interactions and opportunities for something to screw up. Using an inherently serial processor to do multitasking is prone to problems with complex interactions. Clock domain crossing in FPGAs is similarly complex, but nearly always much more limited in scope, so much easier to focus on to resolve all the details and get right. I just don't buy the need for a watchdog with an FPGA. Have you ever seen one used that didn't involve SEU? -- Rick C

The 2024 Embedded Online Conference