Kicking the dog -- how do you use watchdog timers?| page 2

Reply by rickman ●May 10, 20162016-05-10

On 5/10/2016 12:48 PM, Rob Gaddi wrote:
> rickman wrote:
>
>> On 5/9/2016 5:13 PM, Tim Wescott wrote:
>>> On Mon, 09 May 2016 15:07:08 -0400, rickman wrote:
>>>
>>>
>>> I've spent lab time next to unhappily cursing FPGA guys (good ones)
>>> trying to determine why their state machines have wedged.
>>>
>>> So I'm not sure that's an entirely accurate statement.
>>
>> Ask them why their FSMs got stuck.  In development they may make
>> mistakes, but you don't use watchdogs for debugging.  In fact they get
>> in the way.
>>
>
> Oh, that's easy.  Because of either:
>
> An error in the synchronous logic, leaving it in a defined state with no
> way out (20% chance).

That's a system debug thing and actually shouldn't happen at all as 
there are tools to analyze for it.

> An unsynchronized async input causing a race condition that static
> timing couldn't catch (80% chance)

Newbie mistake... that even... uh, experienced designers do once in a 
while... uh, sometimes...  still, it wouldn't make it to a fielded 
system and so is does not create a need for a watchdog.

> Or a single event upset (0.0001% chance)

SEU is a possibility and in fact is a reason why watchdogs are used on 
FPGAs in space craft.  Here on the ground the probability is more like, 
0.0000000000001 in a year.  I didn't actually count the zeros, but it is 
a *lot*.  You will never see it in your lifetime.

-- 

Rick C

Reply by rickman ●May 10, 20162016-05-10

On 5/10/2016 12:48 PM, Rob Gaddi wrote:
> rickman wrote:
>
>> On 5/9/2016 5:13 PM, Tim Wescott wrote:
>>> On Mon, 09 May 2016 15:07:08 -0400, rickman wrote:
>>>
>>>
>>> I've spent lab time next to unhappily cursing FPGA guys (good ones)
>>> trying to determine why their state machines have wedged.
>>>
>>> So I'm not sure that's an entirely accurate statement.
>>
>> Ask them why their FSMs got stuck.  In development they may make
>> mistakes, but you don't use watchdogs for debugging.  In fact they get
>> in the way.
>>
>
> Oh, that's easy.  Because of either:
>
> An error in the synchronous logic, leaving it in a defined state with no
> way out (20% chance).
>
> An unsynchronized async input causing a race condition that static
> timing couldn't catch (80% chance)
>
> Or a single event upset (0.0001% chance)

I just recalled that when designing FSMs in HDL, there is typically a 
synthesis option to recognize all unused states and design so they 
return to the reset condition.  This is a good way to deal with SEU 
issues.  It is very hard to prevent a hiccup from SEU, but recovery can 
be built in.

How would you implement a watchdog for an FPGA which likely has many 
independent FSMs?  What would you monitor?

-- 

Rick C

Reply by ●May 10, 20162016-05-10

On Tue, 10 May 2016 15:36:07 +0200, o pere o <me@somewhere.net> wrote:

>On 09/05/16 19:06, Tim Wescott wrote:
>> Randy Yates recently started a thread on programming flash that had an
>> interesting tangent into watchdog timers.  I thought it was interesting
>> enough that I'm starting a thread here.
>>
>> I had stated in Randy's thread that I avoid watchdogs, because they
>> mostly seem to be a source of erroneous behavior to me.
>>
>> However, on reflection I realized that I lied: I _do_ use watchdog
>> timers, but not automatically.  To date I've only used them when the
>> processor is spinning a motor that might crash into something or
>> otherwise engage in damaging behavior if the processor goes nuts.
>>
>> In general, my rule on watchdogs, as with any other feature, is "use it
>> if using it is better", which means that I think about the consequences
>> of the thing popping off when I don't want it to (as during a code update
>> or during development when I hit a breakpoint) vs. the consequences of
>> not having the thing when the processor goes haywire.
>>
>> Furthermore, if I use a watchdog I don't just treat updating the thing as
>> a requirement check-box -- so you won't find a timer ISR in my code that
>> unconditionally kicks the dog.  Instead, I'll usually have just one task
>> (the motor control one, on most of my stuff) kick the dog when it feels
>> it's operating correctly.  If I've got more than one critical task (i.e.,
>> if I'm running more than one motor out of one processor) I'll have a low-
>> priority built-in-test task that kicks the dog, but only if it's getting
>> periodic assurances of health from the (multiple) critical tasks.
>>
>> Generally, in my systems, the result of the watchdog timer popping off is
>> that the system will no longer work quite correctly, but it will operate
>> safely.
>>
>> So -- what do you do with watchdogs, and how, and why?  Always use 'em?
>> Never use 'em?  Use 'em because the boss says so, but twiddle them in a
>> "last part to break" bit of code?
>>
>> Would you use a watchdog in a fly-by-wire system?  A pacemaker?  Why?
>> Why not?  Could you justify _not_ using a watchdog in the top-level
>> processor of a Mars rover or a satellite?
>>
>
>Quoting Tim Williams' book "The most cost-effective way to ensure the 
>reliability of a microprocessor-based product is to accept that the 
>program (or data or both, my addition) *will* occasionally be corrupted, 
>and to provide a means whereby the program flow can be automatically 
>recovered, preferably transparently to the user. This is the function of 
>the microprocessor watchdog."
>
>So, the whole thing is what to do "when" (not "if") shit (the 
>unexpected) happens.
>
>Pere

Who is Tim Williams ?

I had to do some googling what he has actually said.

I still maintain that watchdog timers are only required at high
radiation environments, in which humans would start to get radiation
sickness or at least cancer in the long run.
.
Old electronics systems have been working for a decade or two without
reboot. I have maintained some computer systems that were designed to
do some thermal cycling every year. If i forgot to do the thermal
recycling every year, do the system restart the next year or the year
after that, no big problem.

Reply by Allan Herriman ●May 10, 20162016-05-10

On Tue, 10 May 2016 13:01:54 -0400, rickman wrote:

> On 5/10/2016 12:48 PM, Rob Gaddi wrote:
>> rickman wrote:
>>
>>> On 5/9/2016 5:13 PM, Tim Wescott wrote:
>>>> On Mon, 09 May 2016 15:07:08 -0400, rickman wrote:
>>>>
>>>>
>>>> I've spent lab time next to unhappily cursing FPGA guys (good ones)
>>>> trying to determine why their state machines have wedged.
>>>>
>>>> So I'm not sure that's an entirely accurate statement.
>>>
>>> Ask them why their FSMs got stuck.  In development they may make
>>> mistakes, but you don't use watchdogs for debugging.  In fact they get
>>> in the way.
>>>
>>>
>> Oh, that's easy.  Because of either:
>>
>> An error in the synchronous logic, leaving it in a defined state with
>> no way out (20% chance).
> 
> That's a system debug thing and actually shouldn't happen at all as
> there are tools to analyze for it.
> 
> 
>> An unsynchronized async input causing a race condition that static
>> timing couldn't catch (80% chance)
> 
> Newbie mistake... that even... uh, experienced designers do once in a
> while... uh, sometimes...  still, it wouldn't make it to a fielded
> system and so is does not create a need for a watchdog.


Here's something from a comp.arch.fpga post I made in 2003:

"When I was at Agilent I analysed the causes of failures in some FPGA
developments.

About half of all FPGA design related bugs (weighted by the time spent
finding them) were associated with asynchronous logic and clock domain
crossings.  [snip]  0% of the clock domain crossing bugs had anything to 
do with metastability.  Glitches and races were the cause."


Regards,
Allan

Reply by rickman ●May 10, 20162016-05-10

On 5/10/2016 7:11 PM, Allan Herriman wrote:
> On Tue, 10 May 2016 13:01:54 -0400, rickman wrote:
>
>> On 5/10/2016 12:48 PM, Rob Gaddi wrote:
>>> rickman wrote:
>>>
>>>> On 5/9/2016 5:13 PM, Tim Wescott wrote:
>>>>> On Mon, 09 May 2016 15:07:08 -0400, rickman wrote:
>>>>>
>>>>>
>>>>> I've spent lab time next to unhappily cursing FPGA guys (good ones)
>>>>> trying to determine why their state machines have wedged.
>>>>>
>>>>> So I'm not sure that's an entirely accurate statement.
>>>>
>>>> Ask them why their FSMs got stuck.  In development they may make
>>>> mistakes, but you don't use watchdogs for debugging.  In fact they get
>>>> in the way.
>>>>
>>>>
>>> Oh, that's easy.  Because of either:
>>>
>>> An error in the synchronous logic, leaving it in a defined state with
>>> no way out (20% chance).
>>
>> That's a system debug thing and actually shouldn't happen at all as
>> there are tools to analyze for it.
>>
>>
>>> An unsynchronized async input causing a race condition that static
>>> timing couldn't catch (80% chance)
>>
>> Newbie mistake... that even... uh, experienced designers do once in a
>> while... uh, sometimes...  still, it wouldn't make it to a fielded
>> system and so is does not create a need for a watchdog.
>
>
> Here's something from a comp.arch.fpga post I made in 2003:
>
> "When I was at Agilent I analysed the causes of failures in some FPGA
> developments.
>
> About half of all FPGA design related bugs (weighted by the time spent
> finding them) were associated with asynchronous logic and clock domain
> crossings.  [snip]  0% of the clock domain crossing bugs had anything to
> do with metastability.  Glitches and races were the cause."

Geeze, that just shouldn't happen.  I'm not sure what they mean by 
"asynchronous logic" as real asynchronous logic is almost never used in 
FPGAs.  Clock domain crossing is well understood so there is no reason 
to not get it right.  It's the kind of thing that normally gets a big, 
red flag at design time and so is done correctly.

-- 

Rick C

Reply by Allan Herriman ●May 10, 20162016-05-10

On Tue, 10 May 2016 13:36:55 -0400, rickman wrote:

> On 5/10/2016 12:48 PM, Rob Gaddi wrote:
>> rickman wrote:
>>
>>> On 5/9/2016 5:13 PM, Tim Wescott wrote:
>>>> On Mon, 09 May 2016 15:07:08 -0400, rickman wrote:
>>>>
>>>>
>>>> I've spent lab time next to unhappily cursing FPGA guys (good ones)
>>>> trying to determine why their state machines have wedged.
>>>>
>>>> So I'm not sure that's an entirely accurate statement.
>>>
>>> Ask them why their FSMs got stuck.  In development they may make
>>> mistakes, but you don't use watchdogs for debugging.  In fact they get
>>> in the way.
>>>
>>>
>> Oh, that's easy.  Because of either:
>>
>> An error in the synchronous logic, leaving it in a defined state with
>> no way out (20% chance).
>>
>> An unsynchronized async input causing a race condition that static
>> timing couldn't catch (80% chance)
>>
>> Or a single event upset (0.0001% chance)
> 
> I just recalled that when designing FSMs in HDL, there is typically a
> synthesis option to recognize all unused states and design so they
> return to the reset condition.  This is a good way to deal with SEU
> issues.  It is very hard to prevent a hiccup from SEU, but recovery can
> be built in.

Please don't expect that illegal state coverage will make your FSM 
reliable.  That will only help with illegal states, but illegal states 
aren't the only causes of lockups.

Consider FSMs in two systems (perhaps on the same chip) talking to each 
other with some handshaking.  There's a state that waits for a handshake 
signal from the other system.  If both FSMs get in that state (from any 
cause: glitch, SEU, coding bug), the system will lock up.

You should be able to see how a watchdog would help with that.  The 
watchdog could be built into the FSM, or it could sit to the side and 
reset the whole FSM.

> How would you implement a watchdog for an FPGA which likely has many
> independent FSMs?  What would you monitor?

Firstly, I create an architecture that doesn't have many interlocking 
FSMs.  Significant parts of my design (particularly in the datapath) will 
not have any FSMs at all, and hence, no chance of FSM lockups.

Then I consider each FSM independently.  If possible, I make it 
inherently crashproof.  If not, I may add a watchdog timer.  Sometimes I 
will add a circuit that looks for bad signatures (e.g. unusual FIFO 
depths) instead.

A recent example from a system I was designing for a client:

The Xilinx transceivers need to be reset in a particular sequence to work 
properly (particularly at the higher data rates, e.g. > 10Gb/s).
These transceivers don't have a lock output that works reliably (thanks 
Xilinx!).  Instead, one must go to the next highest protocol layer (e.g. 
(Ethernet) PCS level) to monitor that protocol's sync to determine 
whether the transceiver is working.

I coded a watchdog timer that would reset the transceiver if it hadn't 
seen PCS sync for a certain time.  I can't get it to fail now.

Regards,
Allan

Reply by ●May 10, 20162016-05-10

Den onsdag den 11. maj 2016 kl. 01.59.02 UTC+2 skrev Allan Herriman:
> On Tue, 10 May 2016 13:36:55 -0400, rickman wrote:
> 
> > On 5/10/2016 12:48 PM, Rob Gaddi wrote:
> >> rickman wrote:
> >>
> >>> On 5/9/2016 5:13 PM, Tim Wescott wrote:
> >>>> On Mon, 09 May 2016 15:07:08 -0400, rickman wrote:
> >>>>
> >>>>
> >>>> I've spent lab time next to unhappily cursing FPGA guys (good ones)
> >>>> trying to determine why their state machines have wedged.
> >>>>
> >>>> So I'm not sure that's an entirely accurate statement.
> >>>
> >>> Ask them why their FSMs got stuck.  In development they may make
> >>> mistakes, but you don't use watchdogs for debugging.  In fact they get
> >>> in the way.
> >>>
> >>>
> >> Oh, that's easy.  Because of either:
> >>
> >> An error in the synchronous logic, leaving it in a defined state with
> >> no way out (20% chance).
> >>
> >> An unsynchronized async input causing a race condition that static
> >> timing couldn't catch (80% chance)
> >>
> >> Or a single event upset (0.0001% chance)
> > 
> > I just recalled that when designing FSMs in HDL, there is typically a
> > synthesis option to recognize all unused states and design so they
> > return to the reset condition.  This is a good way to deal with SEU
> > issues.  It is very hard to prevent a hiccup from SEU, but recovery can
> > be built in.
> 
> Please don't expect that illegal state coverage will make your FSM 
> reliable.  That will only help with illegal states, but illegal states 
> aren't the only causes of lockups.
> 
> Consider FSMs in two systems (perhaps on the same chip) talking to each 
> other with some handshaking.  There's a state that waits for a handshake 
> signal from the other system.  If both FSMs get in that state (from any 
> cause: glitch, SEU, coding bug), the system will lock up.
> 
> You should be able to see how a watchdog would help with that.  The 
> watchdog could be built into the FSM, or it could sit to the side and 
> reset the whole FSM.
> 

A watchdog on the AXI bus would nice, it is easy to reconfigure the programmable logic in a Zynq but if you have stuff on bus you have to 
make absolutely sure no software is accessing that because it will 
halt the whole system and only a reset will recover from that

-Lasse

Reply by Allan Herriman ●May 10, 20162016-05-10

On Tue, 10 May 2016 19:54:18 -0400, rickman wrote:

> On 5/10/2016 7:11 PM, Allan Herriman wrote:
>> On Tue, 10 May 2016 13:01:54 -0400, rickman wrote:
>>
>>> On 5/10/2016 12:48 PM, Rob Gaddi wrote:
>>>> rickman wrote:
>>>>
>>>>> On 5/9/2016 5:13 PM, Tim Wescott wrote:
>>>>>> On Mon, 09 May 2016 15:07:08 -0400, rickman wrote:
>>>>>>
>>>>>>
>>>>>> I've spent lab time next to unhappily cursing FPGA guys (good ones)
>>>>>> trying to determine why their state machines have wedged.
>>>>>>
>>>>>> So I'm not sure that's an entirely accurate statement.
>>>>>
>>>>> Ask them why their FSMs got stuck.  In development they may make
>>>>> mistakes, but you don't use watchdogs for debugging.  In fact they
>>>>> get in the way.
>>>>>
>>>>>
>>>> Oh, that's easy.  Because of either:
>>>>
>>>> An error in the synchronous logic, leaving it in a defined state with
>>>> no way out (20% chance).
>>>
>>> That's a system debug thing and actually shouldn't happen at all as
>>> there are tools to analyze for it.
>>>
>>>
>>>> An unsynchronized async input causing a race condition that static
>>>> timing couldn't catch (80% chance)
>>>
>>> Newbie mistake... that even... uh, experienced designers do once in a
>>> while... uh, sometimes...  still, it wouldn't make it to a fielded
>>> system and so is does not create a need for a watchdog.
>>
>>
>> Here's something from a comp.arch.fpga post I made in 2003:
>>
>> "When I was at Agilent I analysed the causes of failures in some FPGA
>> developments.
>>
>> About half of all FPGA design related bugs (weighted by the time spent
>> finding them) were associated with asynchronous logic and clock domain
>> crossings.  [snip]  0% of the clock domain crossing bugs had anything
>> to do with metastability.  Glitches and races were the cause."
> 
> Geeze, that just shouldn't happen.  I'm not sure what they mean by
> "asynchronous logic" as real asynchronous logic is almost never used in
> FPGAs.  Clock domain crossing is well understood so there is no reason
> to not get it right.  It's the kind of thing that normally gets a big,
> red flag at design time and so is done correctly.


I would not say that clock domain crossings are /well/ understood by 
beginners, or even moderately experienced designers.

BTW, I weighted the results with the time taken to find the bugs.
There weren't that many bugs, it's just that they took a long time to
find compared with straightforward functional bugs.

Many of the bugs were caused by integrating IP (written elsewhere) and it 
wasn't always obvious to the designers that signals were crossing clock 
domains.

Some of the bugs were created by the tools, e.g. when they replicated 
logic.  That makes the bugs hard to find during source code review.  It's 
actually better to review the post-synth netlist than the source code.  
(Better still to use an automated tool to do it.)


[From a 2008 c.a.f post of mine] here's a list of the sort of things that 
could go wrong.  Please bear in mind that this list is historical (i.e. 
it was based on experience with older FPGA families and older tools, in a 
job I left over a decade ago.).


- (race) Passing vectors (i.e. multiple signals) from clock domain A
to clock domain B and expecting all the bits to arrive on the same B
clock.

- (race) As above, but adding multiple banks of retiming flip flops in
the B clock domain, which fixed the (non-existent) metastability issue
but did nothing about the race.

- (race) Passing a signal in clock domain A to multiple flip flops in
clock domain B, and expecting the B flip flops to get the same value
on the same clock.

- (race) As above, but created when the tools replicate the B logic to
manage fanout.

- (glitch) Multiple signals in clock domain A hit some combinatorial
logic producing a single signal which is sampled by a flip flop in
clock domain B.  Sometimes there may be a glitch which gets sampled by
the B flip flop.
It can be difficult to design combinatorial logic with good glitch
coverage (and if you do, the tools will often remove it).  (See XAPP
024, btw.)

- (glitch) Clock multiplexers made out of combinatorial logic with
inadequate glitch coverage (or adequate glitch coverage removed by the
tools).

- Using async reset or set inputs on flip flops to implement a logic
function (rather than just using them for initialisation).  I can
remember a case where a design would fail even when we could prove
mathematically that it couldn't fail.  Rewriting it to avoid the use
of async resets fixed the problem.

- Gating clocks to create a logic function.  I know this sort of thing
is done in ASICs to save power, but it just doesn't seem to work too
well in FPGAs sometimes.


Regards,
Allan

Reply by Allan Herriman ●May 10, 20162016-05-10

On Tue, 10 May 2016 17:22:43 -0700, lasselangwadtchristensen wrote:

> Den onsdag den 11. maj 2016 kl. 01.59.02 UTC+2 skrev Allan Herriman:
>> On Tue, 10 May 2016 13:36:55 -0400, rickman wrote:
>> 
>> > On 5/10/2016 12:48 PM, Rob Gaddi wrote:
>> >> rickman wrote:
>> >>
>> >>> On 5/9/2016 5:13 PM, Tim Wescott wrote:
>> >>>> On Mon, 09 May 2016 15:07:08 -0400, rickman wrote:
>> >>>>
>> >>>>
>> >>>> I've spent lab time next to unhappily cursing FPGA guys (good
>> >>>> ones)
>> >>>> trying to determine why their state machines have wedged.
>> >>>>
>> >>>> So I'm not sure that's an entirely accurate statement.
>> >>>
>> >>> Ask them why their FSMs got stuck.  In development they may make
>> >>> mistakes, but you don't use watchdogs for debugging.  In fact they
>> >>> get in the way.
>> >>>
>> >>>
>> >> Oh, that's easy.  Because of either:
>> >>
>> >> An error in the synchronous logic, leaving it in a defined state
>> >> with no way out (20% chance).
>> >>
>> >> An unsynchronized async input causing a race condition that static
>> >> timing couldn't catch (80% chance)
>> >>
>> >> Or a single event upset (0.0001% chance)
>> > 
>> > I just recalled that when designing FSMs in HDL, there is typically a
>> > synthesis option to recognize all unused states and design so they
>> > return to the reset condition.  This is a good way to deal with SEU
>> > issues.  It is very hard to prevent a hiccup from SEU, but recovery
>> > can be built in.
>> 
>> Please don't expect that illegal state coverage will make your FSM
>> reliable.  That will only help with illegal states, but illegal states
>> aren't the only causes of lockups.
>> 
>> Consider FSMs in two systems (perhaps on the same chip) talking to each
>> other with some handshaking.  There's a state that waits for a
>> handshake signal from the other system.  If both FSMs get in that state
>> (from any cause: glitch, SEU, coding bug), the system will lock up.
>> 
>> You should be able to see how a watchdog would help with that.  The
>> watchdog could be built into the FSM, or it could sit to the side and
>> reset the whole FSM.
>> 
>> 
> A watchdog on the AXI bus would nice, it is easy to reconfigure the
> programmable logic in a Zynq but if you have stuff on bus you have to
> make absolutely sure no software is accessing that because it will halt
> the whole system and only a reset will recover from that

We leave the ARM watchdog permanently enabled in our Zynq systems for 
that very reason.
If code (via e.g. a wrong pointer) accesses an address without an AXI 
address decode, it will hang the AXI, and hence the whole box.  

With a watchdog, it will reboot (perhaps to fail again, perhaps not).

Regards,
Allan

Reply by rickman ●May 10, 20162016-05-10

On 5/10/2016 8:50 PM, Allan Herriman wrote:
> On Tue, 10 May 2016 19:54:18 -0400, rickman wrote:
>
>> On 5/10/2016 7:11 PM, Allan Herriman wrote:
>>> On Tue, 10 May 2016 13:01:54 -0400, rickman wrote:
>>>
>>>> On 5/10/2016 12:48 PM, Rob Gaddi wrote:
>>>>> rickman wrote:
>>>>>
>>>>>> On 5/9/2016 5:13 PM, Tim Wescott wrote:
>>>>>>> On Mon, 09 May 2016 15:07:08 -0400, rickman wrote:
>>>>>>>
>>>>>>>
>>>>>>> I've spent lab time next to unhappily cursing FPGA guys (good ones)
>>>>>>> trying to determine why their state machines have wedged.
>>>>>>>
>>>>>>> So I'm not sure that's an entirely accurate statement.
>>>>>>
>>>>>> Ask them why their FSMs got stuck.  In development they may make
>>>>>> mistakes, but you don't use watchdogs for debugging.  In fact they
>>>>>> get in the way.
>>>>>>
>>>>>>
>>>>> Oh, that's easy.  Because of either:
>>>>>
>>>>> An error in the synchronous logic, leaving it in a defined state with
>>>>> no way out (20% chance).
>>>>
>>>> That's a system debug thing and actually shouldn't happen at all as
>>>> there are tools to analyze for it.
>>>>
>>>>
>>>>> An unsynchronized async input causing a race condition that static
>>>>> timing couldn't catch (80% chance)
>>>>
>>>> Newbie mistake... that even... uh, experienced designers do once in a
>>>> while... uh, sometimes...  still, it wouldn't make it to a fielded
>>>> system and so is does not create a need for a watchdog.
>>>
>>>
>>> Here's something from a comp.arch.fpga post I made in 2003:
>>>
>>> "When I was at Agilent I analysed the causes of failures in some FPGA
>>> developments.
>>>
>>> About half of all FPGA design related bugs (weighted by the time spent
>>> finding them) were associated with asynchronous logic and clock domain
>>> crossings.  [snip]  0% of the clock domain crossing bugs had anything
>>> to do with metastability.  Glitches and races were the cause."
>>
>> Geeze, that just shouldn't happen.  I'm not sure what they mean by
>> "asynchronous logic" as real asynchronous logic is almost never used in
>> FPGAs.  Clock domain crossing is well understood so there is no reason
>> to not get it right.  It's the kind of thing that normally gets a big,
>> red flag at design time and so is done correctly.
>
>
> I would not say that clock domain crossings are /well/ understood by
> beginners, or even moderately experienced designers.

Why would a beginner be designing a system without supervision?  As I 
said, this is the sort of issue that gets a red flag and lots of 
attention in a design review.

> BTW, I weighted the results with the time taken to find the bugs.
> There weren't that many bugs, it's just that they took a long time to
> find compared with straightforward functional bugs.

That's why they get lots of attention up front rather than after they 
are a problem.

> Many of the bugs were caused by integrating IP (written elsewhere) and it
> wasn't always obvious to the designers that signals were crossing clock
> domains.

That's exactly what happened to me.  A simple UART needed a FF at the 
data in port.  I had designed the UART and didn't document that detail. 
  I used it later in a test fixture and forgot to include the I/O FF. 
It bit me hard as I was writing the software the talked to this port and 
kept thinking the flaw was software.

> Some of the bugs were created by the tools, e.g. when they replicated
> logic.  That makes the bugs hard to find during source code review.  It's
> actually better to review the post-synth netlist than the source code.
> (Better still to use an automated tool to do it.)

Not sure how that happens.  Are you saying a design with 1 FF and many 
destinations had an async input?  That alone is a no-no.  If there had 
been a FF in front to remove metastability it would have provided 
protection from async inputs (race) when the second FF was replicated.

> [From a 2008 c.a.f post of mine] here's a list of the sort of things that
> could go wrong.  Please bear in mind that this list is historical (i.e.
> it was based on experience with older FPGA families and older tools, in a
> job I left over a decade ago.).
>
>
> - (race) Passing vectors (i.e. multiple signals) from clock domain A
> to clock domain B and expecting all the bits to arrive on the same B
> clock.
>
> - (race) As above, but adding multiple banks of retiming flip flops in
> the B clock domain, which fixed the (non-existent) metastability issue
> but did nothing about the race.
>
> - (race) Passing a signal in clock domain A to multiple flip flops in
> clock domain B, and expecting the B flip flops to get the same value
> on the same clock.
>
> - (race) As above, but created when the tools replicate the B logic to
> manage fanout.
>
> - (glitch) Multiple signals in clock domain A hit some combinatorial
> logic producing a single signal which is sampled by a flip flop in
> clock domain B.  Sometimes there may be a glitch which gets sampled by
> the B flip flop.
> It can be difficult to design combinatorial logic with good glitch
> coverage (and if you do, the tools will often remove it).  (See XAPP
> 024, btw.)
>
> - (glitch) Clock multiplexers made out of combinatorial logic with
> inadequate glitch coverage (or adequate glitch coverage removed by the
> tools).
>
> - Using async reset or set inputs on flip flops to implement a logic
> function (rather than just using them for initialisation).  I can
> remember a case where a design would fail even when we could prove
> mathematically that it couldn't fail.  Rewriting it to avoid the use
> of async resets fixed the problem.
>
> - Gating clocks to create a logic function.  I know this sort of thing
> is done in ASICs to save power, but it just doesn't seem to work too
> well in FPGAs sometimes.

All of these issues are known bad practice.  Messing with the clocks or 
using async inputs on FFs is an especially bad practice.  I thought that 
ended in the 90s.

I did my first FPGA design in '95.  I've always wondered how good that 
code was.  I took some training on the Orcad schematic tools for FPGA 
design and learned about VHDL in one day.  lol  That made me the 
resident expert!  I had to deal with changing compilers twice in the 
project, so I learned something about making your code portable very 
early.  But I knew little about clock domain crossings and 
metastability.  I guess I knew about race conditions though.  That was 
not uncommon in discrete logic design which I had done.

Regardless, I don't see any reason to use a watchdog timer with an FPGA 
design unless you have SEU issues.  A proper code review with 
experienced designers will catch all of the above problems.  Adding a 
bandaid is not the solution when there is no reason to not have a good, 
clean system in an FPGA.

The reason why watchdogs are used with software is because software has 
so many more interactions and opportunities for something to screw up. 
Using an inherently serial processor to do multitasking is prone to 
problems with complex interactions.  Clock domain crossing in FPGAs is 
similarly complex, but nearly always much more limited in scope, so much 
easier to focus on to resolve all the details and get right.

I just don't buy the need for a watchdog with an FPGA.  Have you ever 
seen one used that didn't involve SEU?

-- 

Rick C