Kicking the dog -- how do you use watchdog timers?| page 4

Reply by rickman ●May 11, 20162016-05-11

On 5/11/2016 12:10 AM, Paul Rubin wrote:
> rickman <gnuarm@gmail.com> writes:
>> Where is the need for a watchdog of any sort?  If an ASIC locks up,
>> won't the CPU be able to figure it out and reset whatever is
>> appropriate?
>
> What if the CPU is part of the ASIC?

Ok, what if?

-- 

Rick C

Reply by Tim Wescott ●May 11, 20162016-05-11

On Wed, 11 May 2016 00:10:39 -0400, rickman wrote:

> On 5/10/2016 11:36 PM, Randy Yates wrote:
>> rickman <gnuarm@gmail.com> writes:
>>
>>> On 5/10/2016 10:38 PM, Paul Rubin wrote:
>>>> rickman <gnuarm@gmail.com> writes:
>>>>> In FPGAs the logic can be designed to not hang.  It may require work
>>>>> to do the proper analysis, but it is not just possible, but saves
>>>>> money in the long run when you don't need to fix difficult to find
>>>>> bugs.
>>>>
>>>> As FPGA's get bigger and the circuits in them get more complicated,
>>>> don't they face the same combinatorial explosion that big software
>>>> systems do?  There are tons of historical examples of Intel and
>>>> similar CPU's (hard silicon, not even FPGA's) locking up due to bugs.
>>>>  Any big CPU or comparable chip will have an errata list.  At some
>>>> point you may have to accept that bugs are inevitable, and that a
>>>> reliable system (besides preventing as many bugs as it can) also has
>>>> to mitigate any remaining ones.  Watchdogs are a time tested approach
>>>> for that purpose.
>>>
>>> How may of those large ASICs with hang bugs had watchdog timers?  The
>>> bug was a system level design problem, not a logic bug in a FSM.  They
>>> could be dealt with by a software change, no?  Even if they couldn't
>>> be dealt with with software, what would a watchdog do?  Reset your
>>> entire computer/phone/flight nav?
>>>
>>> There is always the possibility of bugs in FPGAs.  But bugs that
>>> require the use of a watchdog are a class of bugs that should be
>>> shaken out in debug unless the designers are not very good.  If they
>>> can't find them in debug, they have to be pretty durn infrequent.
>>>
>>> Do you have any links to descriptions of such bugs?  I'm curious.
>>
>> A WDT is also not a cure-all. Consider this scenario: a piece of
>> hardware fails, changing the inputs to a piece of code in an unexpected
>> way and causing the code to go into the weeds and the WDT to fire.
>>
>> But after restart, the hardware is still failed and providing
>> unexpected inputs, the same bug occurs again, the WDT fires again, and
>> the processor restarts again. Ad-infinitum.
>>
>> So what did this fix? :)
> 
> Shouldn't a processor reset also reset the hardware?

Randy's point, I think, is that if something is _broken_, a reset isn't 
going to un-break it.

A processor reset should also reset the hardware, in much the same way 
that cops should always be honest -- "should" in this case indicates a 
moral requirement, but not, in all companies, a reasonable expectation.

-- 
Tim Wescott
Control systems, embedded software and circuit design
I'm looking for work!  See my website if you're interested
http://www.wescottdesign.com

Reply by rickman ●May 11, 20162016-05-11

On 5/11/2016 2:17 AM, Tim Wescott wrote:
> On Wed, 11 May 2016 00:10:39 -0400, rickman wrote:
>
>> On 5/10/2016 11:36 PM, Randy Yates wrote:
>>> rickman <gnuarm@gmail.com> writes:
>>>
>>>> On 5/10/2016 10:38 PM, Paul Rubin wrote:
>>>>> rickman <gnuarm@gmail.com> writes:
>>>>>> In FPGAs the logic can be designed to not hang.  It may require work
>>>>>> to do the proper analysis, but it is not just possible, but saves
>>>>>> money in the long run when you don't need to fix difficult to find
>>>>>> bugs.
>>>>>
>>>>> As FPGA's get bigger and the circuits in them get more complicated,
>>>>> don't they face the same combinatorial explosion that big software
>>>>> systems do?  There are tons of historical examples of Intel and
>>>>> similar CPU's (hard silicon, not even FPGA's) locking up due to bugs.
>>>>>   Any big CPU or comparable chip will have an errata list.  At some
>>>>> point you may have to accept that bugs are inevitable, and that a
>>>>> reliable system (besides preventing as many bugs as it can) also has
>>>>> to mitigate any remaining ones.  Watchdogs are a time tested approach
>>>>> for that purpose.
>>>>
>>>> How may of those large ASICs with hang bugs had watchdog timers?  The
>>>> bug was a system level design problem, not a logic bug in a FSM.  They
>>>> could be dealt with by a software change, no?  Even if they couldn't
>>>> be dealt with with software, what would a watchdog do?  Reset your
>>>> entire computer/phone/flight nav?
>>>>
>>>> There is always the possibility of bugs in FPGAs.  But bugs that
>>>> require the use of a watchdog are a class of bugs that should be
>>>> shaken out in debug unless the designers are not very good.  If they
>>>> can't find them in debug, they have to be pretty durn infrequent.
>>>>
>>>> Do you have any links to descriptions of such bugs?  I'm curious.
>>>
>>> A WDT is also not a cure-all. Consider this scenario: a piece of
>>> hardware fails, changing the inputs to a piece of code in an unexpected
>>> way and causing the code to go into the weeds and the WDT to fire.
>>>
>>> But after restart, the hardware is still failed and providing
>>> unexpected inputs, the same bug occurs again, the WDT fires again, and
>>> the processor restarts again. Ad-infinitum.
>>>
>>> So what did this fix? :)
>>
>> Shouldn't a processor reset also reset the hardware?
>
> Randy's point, I think, is that if something is _broken_, a reset isn't
> going to un-break it.
>
> A processor reset should also reset the hardware, in much the same way
> that cops should always be honest -- "should" in this case indicates a
> moral requirement, but not, in all companies, a reasonable expectation.

I'm not sure we are on the same conversation.  We were discussing how to 
design systems, not what systems get designed.

-- 

Rick C

Reply by Allan Herriman ●May 11, 20162016-05-11

On Tue, 10 May 2016 21:45:32 -0400, rickman wrote:

> On 5/10/2016 8:50 PM, Allan Herriman wrote:
>> On Tue, 10 May 2016 19:54:18 -0400, rickman wrote:
>>
>>> On 5/10/2016 7:11 PM, Allan Herriman wrote:
>>>> On Tue, 10 May 2016 13:01:54 -0400, rickman wrote:
>>>>
>>>>> On 5/10/2016 12:48 PM, Rob Gaddi wrote:
>>>>>> rickman wrote:
>>>>>>
>>>>>>> On 5/9/2016 5:13 PM, Tim Wescott wrote:
>>>>>>>> On Mon, 09 May 2016 15:07:08 -0400, rickman wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> I've spent lab time next to unhappily cursing FPGA guys (good
>>>>>>>> ones)
>>>>>>>> trying to determine why their state machines have wedged.
>>>>>>>>
>>>>>>>> So I'm not sure that's an entirely accurate statement.
>>>>>>>
>>>>>>> Ask them why their FSMs got stuck.  In development they may make
>>>>>>> mistakes, but you don't use watchdogs for debugging.  In fact they
>>>>>>> get in the way.
>>>>>>>
>>>>>>>
>>>>>> Oh, that's easy.  Because of either:
>>>>>>
>>>>>> An error in the synchronous logic, leaving it in a defined state
>>>>>> with no way out (20% chance).
>>>>>
>>>>> That's a system debug thing and actually shouldn't happen at all as
>>>>> there are tools to analyze for it.
>>>>>
>>>>>
>>>>>> An unsynchronized async input causing a race condition that static
>>>>>> timing couldn't catch (80% chance)
>>>>>
>>>>> Newbie mistake... that even... uh, experienced designers do once in
>>>>> a while... uh, sometimes...  still, it wouldn't make it to a fielded
>>>>> system and so is does not create a need for a watchdog.
>>>>
>>>>
>>>> Here's something from a comp.arch.fpga post I made in 2003:
>>>>
>>>> "When I was at Agilent I analysed the causes of failures in some FPGA
>>>> developments.
>>>>
>>>> About half of all FPGA design related bugs (weighted by the time
>>>> spent finding them) were associated with asynchronous logic and clock
>>>> domain crossings.  [snip]  0% of the clock domain crossing bugs had
>>>> anything to do with metastability.  Glitches and races were the
>>>> cause."
>>>
>>> Geeze, that just shouldn't happen.  I'm not sure what they mean by
>>> "asynchronous logic" as real asynchronous logic is almost never used
>>> in FPGAs.  Clock domain crossing is well understood so there is no
>>> reason to not get it right.  It's the kind of thing that normally gets
>>> a big, red flag at design time and so is done correctly.
>>
>>
>> I would not say that clock domain crossings are /well/ understood by
>> beginners, or even moderately experienced designers.
> 
> Why would a beginner be designing a system without supervision?  As I
> said, this is the sort of issue that gets a red flag and lots of
> attention in a design review.
> 
> 
>> BTW, I weighted the results with the time taken to find the bugs.
>> There weren't that many bugs, it's just that they took a long time to
>> find compared with straightforward functional bugs.
> 
> That's why they get lots of attention up front rather than after they
> are a problem.
> 
> 
>> Many of the bugs were caused by integrating IP (written elsewhere) and
>> it wasn't always obvious to the designers that signals were crossing
>> clock domains.
> 
> That's exactly what happened to me.  A simple UART needed a FF at the
> data in port.  I had designed the UART and didn't document that detail.
>   I used it later in a test fixture and forgot to include the I/O FF.
> It bit me hard as I was writing the software the talked to this port and
> kept thinking the flaw was software.
> 
> 
>> Some of the bugs were created by the tools, e.g. when they replicated
>> logic.  That makes the bugs hard to find during source code review. 
>> It's actually better to review the post-synth netlist than the source
>> code. (Better still to use an automated tool to do it.)
> 
> Not sure how that happens.  Are you saying a design with 1 FF and many
> destinations had an async input?  That alone is a no-no.  If there had
> been a FF in front to remove metastability it would have provided
> protection from async inputs (race) when the second FF was replicated.

You seem to making the assumption that having two flip flops in series 
will stop the first one from being replicated.  I've seen it happen 
(albeit with a huge fanout on the second FF).

The only way to ensure that the first FF has not been replicated is to 
check, or to apply attributes that will tell the tools not to replicate 
it.  Even then, the tools may have bugs (they certainly have in the past) 
and you still need to check to be sure.

The good news is that the check can be automated.


>> [From a 2008 c.a.f post of mine] here's a list of the sort of things
>> that could go wrong.  Please bear in mind that this list is historical
>> (i.e. it was based on experience with older FPGA families and older
>> tools, in a job I left over a decade ago.).
>>
>>
>> - (race) Passing vectors (i.e. multiple signals) from clock domain A to
>> clock domain B and expecting all the bits to arrive on the same B
>> clock.
>>
>> - (race) As above, but adding multiple banks of retiming flip flops in
>> the B clock domain, which fixed the (non-existent) metastability issue
>> but did nothing about the race.
>>
>> - (race) Passing a signal in clock domain A to multiple flip flops in
>> clock domain B, and expecting the B flip flops to get the same value on
>> the same clock.
>>
>> - (race) As above, but created when the tools replicate the B logic to
>> manage fanout.
>>
>> - (glitch) Multiple signals in clock domain A hit some combinatorial
>> logic producing a single signal which is sampled by a flip flop in
>> clock domain B.  Sometimes there may be a glitch which gets sampled by
>> the B flip flop.
>> It can be difficult to design combinatorial logic with good glitch
>> coverage (and if you do, the tools will often remove it).  (See XAPP
>> 024, btw.)
>>
>> - (glitch) Clock multiplexers made out of combinatorial logic with
>> inadequate glitch coverage (or adequate glitch coverage removed by the
>> tools).
>>
>> - Using async reset or set inputs on flip flops to implement a logic
>> function (rather than just using them for initialisation).  I can
>> remember a case where a design would fail even when we could prove
>> mathematically that it couldn't fail.  Rewriting it to avoid the use of
>> async resets fixed the problem.
>>
>> - Gating clocks to create a logic function.  I know this sort of thing
>> is done in ASICs to save power, but it just doesn't seem to work too
>> well in FPGAs sometimes.
> 
> All of these issues are known bad practice.  Messing with the clocks or
> using async inputs on FFs is an especially bad practice.  I thought that
> ended in the 90s.

Do you have citation for "known bad practice"?  I wrote that list in (I 
think) 2001, and I haven't seen anything containing /all/ of those points 
published prior to that date.

Allan

Reply by rickman ●May 11, 20162016-05-11

On 5/11/2016 7:16 AM, Allan Herriman wrote:
> On Tue, 10 May 2016 21:45:32 -0400, rickman wrote:
>
>> On 5/10/2016 8:50 PM, Allan Herriman wrote:
>>> On Tue, 10 May 2016 19:54:18 -0400, rickman wrote:
>>>
>>>> On 5/10/2016 7:11 PM, Allan Herriman wrote:
>>>>> On Tue, 10 May 2016 13:01:54 -0400, rickman wrote:
>>>>>
>>>>>> On 5/10/2016 12:48 PM, Rob Gaddi wrote:
>>>>>>> rickman wrote:
>>>>>>>
>>>>>>>> On 5/9/2016 5:13 PM, Tim Wescott wrote:
>>>>>>>>> On Mon, 09 May 2016 15:07:08 -0400, rickman wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I've spent lab time next to unhappily cursing FPGA guys (good
>>>>>>>>> ones)
>>>>>>>>> trying to determine why their state machines have wedged.
>>>>>>>>>
>>>>>>>>> So I'm not sure that's an entirely accurate statement.
>>>>>>>>
>>>>>>>> Ask them why their FSMs got stuck.  In development they may make
>>>>>>>> mistakes, but you don't use watchdogs for debugging.  In fact they
>>>>>>>> get in the way.
>>>>>>>>
>>>>>>>>
>>>>>>> Oh, that's easy.  Because of either:
>>>>>>>
>>>>>>> An error in the synchronous logic, leaving it in a defined state
>>>>>>> with no way out (20% chance).
>>>>>>
>>>>>> That's a system debug thing and actually shouldn't happen at all as
>>>>>> there are tools to analyze for it.
>>>>>>
>>>>>>
>>>>>>> An unsynchronized async input causing a race condition that static
>>>>>>> timing couldn't catch (80% chance)
>>>>>>
>>>>>> Newbie mistake... that even... uh, experienced designers do once in
>>>>>> a while... uh, sometimes...  still, it wouldn't make it to a fielded
>>>>>> system and so is does not create a need for a watchdog.
>>>>>
>>>>>
>>>>> Here's something from a comp.arch.fpga post I made in 2003:
>>>>>
>>>>> "When I was at Agilent I analysed the causes of failures in some FPGA
>>>>> developments.
>>>>>
>>>>> About half of all FPGA design related bugs (weighted by the time
>>>>> spent finding them) were associated with asynchronous logic and clock
>>>>> domain crossings.  [snip]  0% of the clock domain crossing bugs had
>>>>> anything to do with metastability.  Glitches and races were the
>>>>> cause."
>>>>
>>>> Geeze, that just shouldn't happen.  I'm not sure what they mean by
>>>> "asynchronous logic" as real asynchronous logic is almost never used
>>>> in FPGAs.  Clock domain crossing is well understood so there is no
>>>> reason to not get it right.  It's the kind of thing that normally gets
>>>> a big, red flag at design time and so is done correctly.
>>>
>>>
>>> I would not say that clock domain crossings are /well/ understood by
>>> beginners, or even moderately experienced designers.
>>
>> Why would a beginner be designing a system without supervision?  As I
>> said, this is the sort of issue that gets a red flag and lots of
>> attention in a design review.
>>
>>
>>> BTW, I weighted the results with the time taken to find the bugs.
>>> There weren't that many bugs, it's just that they took a long time to
>>> find compared with straightforward functional bugs.
>>
>> That's why they get lots of attention up front rather than after they
>> are a problem.
>>
>>
>>> Many of the bugs were caused by integrating IP (written elsewhere) and
>>> it wasn't always obvious to the designers that signals were crossing
>>> clock domains.
>>
>> That's exactly what happened to me.  A simple UART needed a FF at the
>> data in port.  I had designed the UART and didn't document that detail.
>>    I used it later in a test fixture and forgot to include the I/O FF.
>> It bit me hard as I was writing the software the talked to this port and
>> kept thinking the flaw was software.
>>
>>
>>> Some of the bugs were created by the tools, e.g. when they replicated
>>> logic.  That makes the bugs hard to find during source code review.
>>> It's actually better to review the post-synth netlist than the source
>>> code. (Better still to use an automated tool to do it.)
>>
>> Not sure how that happens.  Are you saying a design with 1 FF and many
>> destinations had an async input?  That alone is a no-no.  If there had
>> been a FF in front to remove metastability it would have provided
>> protection from async inputs (race) when the second FF was replicated.
>
> You seem to making the assumption that having two flip flops in series
> will stop the first one from being replicated.  I've seen it happen
> (albeit with a huge fanout on the second FF).
>
> The only way to ensure that the first FF has not been replicated is to
> check, or to apply attributes that will tell the tools not to replicate
> it.  Even then, the tools may have bugs (they certainly have in the past)
> and you still need to check to be sure.
>
> The good news is that the check can be automated.
>
>
>>> [From a 2008 c.a.f post of mine] here's a list of the sort of things
>>> that could go wrong.  Please bear in mind that this list is historical
>>> (i.e. it was based on experience with older FPGA families and older
>>> tools, in a job I left over a decade ago.).
>>>
>>>
>>> - (race) Passing vectors (i.e. multiple signals) from clock domain A to
>>> clock domain B and expecting all the bits to arrive on the same B
>>> clock.
>>>
>>> - (race) As above, but adding multiple banks of retiming flip flops in
>>> the B clock domain, which fixed the (non-existent) metastability issue
>>> but did nothing about the race.
>>>
>>> - (race) Passing a signal in clock domain A to multiple flip flops in
>>> clock domain B, and expecting the B flip flops to get the same value on
>>> the same clock.
>>>
>>> - (race) As above, but created when the tools replicate the B logic to
>>> manage fanout.
>>>
>>> - (glitch) Multiple signals in clock domain A hit some combinatorial
>>> logic producing a single signal which is sampled by a flip flop in
>>> clock domain B.  Sometimes there may be a glitch which gets sampled by
>>> the B flip flop.
>>> It can be difficult to design combinatorial logic with good glitch
>>> coverage (and if you do, the tools will often remove it).  (See XAPP
>>> 024, btw.)
>>>
>>> - (glitch) Clock multiplexers made out of combinatorial logic with
>>> inadequate glitch coverage (or adequate glitch coverage removed by the
>>> tools).
>>>
>>> - Using async reset or set inputs on flip flops to implement a logic
>>> function (rather than just using them for initialisation).  I can
>>> remember a case where a design would fail even when we could prove
>>> mathematically that it couldn't fail.  Rewriting it to avoid the use of
>>> async resets fixed the problem.
>>>
>>> - Gating clocks to create a logic function.  I know this sort of thing
>>> is done in ASICs to save power, but it just doesn't seem to work too
>>> well in FPGAs sometimes.
>>
>> All of these issues are known bad practice.  Messing with the clocks or
>> using async inputs on FFs is an especially bad practice.  I thought that
>> ended in the 90s.
>
> Do you have citation for "known bad practice"?  I wrote that list in (I
> think) 2001, and I haven't seen anything containing /all/ of those points
> published prior to that date.

No, I'm not really a history buff.  I don't know of any resource that 
lists problems to be avoided in digital logic design.  Do you have any? 
  I believe all of these issues are common knowledge.

You list 8 things you look for and the first 6 are clock domain crossing 
issues.  So that is really one issue, good clock domain crossing design 
and you have listed six ways that designers screw up.

Using async inputs on FFs has always been discouraged by the FPGA 
companies, in no small part because it makes the design hard to verify 
and I believe they have said it makes it hard to port to ASICs (maybe 
because of being hard to verify).

I have heard forever that it is hard to gate clocks properly.  It 
requires knowledge of gate delays and detailed timing which is typically 
avoided in FPGA designs in favor of unit delays simulation with static 
timing analysis.

-- 

Rick C

Reply by Rob Gaddi ●May 11, 20162016-05-11

lasselangwadtchristensen@gmail.com wrote:

> Den onsdag den 11. maj 2016 kl. 02.58.08 UTC+2 skrev Allan Herriman:
>> On Tue, 10 May 2016 17:22:43 -0700, lasselangwadtchristensen wrote:
>> 
>> > A watchdog on the AXI bus would nice, it is easy to reconfigure the
>> > programmable logic in a Zynq but if you have stuff on bus you have to
>> > make absolutely sure no software is accessing that because it will halt
>> > the whole system and only a reset will recover from that
>> 
>> We leave the ARM watchdog permanently enabled in our Zynq systems for 
>> that very reason.
>> If code (via e.g. a wrong pointer) accesses an address without an AXI 
>> address decode, it will hang the AXI, and hence the whole box.  
>> 
>
> yep, but it would be nice if there was an option to get something similar 
> to an access denied and handle it from there instead of resetting the whole 
> system
>
> -Lasse

Wait, you're proposing that an error on the data bus should raise some
sort of Data Abort exception (perhaps at vector address 0x10) rather
than render the system catatonic?  Poppycock!  Who would ever design a
ARM-based CPU in such a way?

-- 
Rob Gaddi, Highland Technology -- www.highlandtechnology.com

Email address domain is currently out of order.  See above to fix.

Reply by rickman ●May 11, 20162016-05-11

On 5/11/2016 12:23 PM, rickman wrote:
> On 5/11/2016 7:16 AM, Allan Herriman wrote:
>> On Tue, 10 May 2016 21:45:32 -0400, rickman wrote:
>>
>>> On 5/10/2016 8:50 PM, Allan Herriman wrote:
>>>> On Tue, 10 May 2016 19:54:18 -0400, rickman wrote:
>>>>
>>>>> On 5/10/2016 7:11 PM, Allan Herriman wrote:
>>>>>> On Tue, 10 May 2016 13:01:54 -0400, rickman wrote:
>>>>>>
>>>>>>> On 5/10/2016 12:48 PM, Rob Gaddi wrote:
>>>>>>>> rickman wrote:
>>>>>>>>
>>>>>>>>> On 5/9/2016 5:13 PM, Tim Wescott wrote:
>>>>>>>>>> On Mon, 09 May 2016 15:07:08 -0400, rickman wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I've spent lab time next to unhappily cursing FPGA guys (good
>>>>>>>>>> ones)
>>>>>>>>>> trying to determine why their state machines have wedged.
>>>>>>>>>>
>>>>>>>>>> So I'm not sure that's an entirely accurate statement.
>>>>>>>>>
>>>>>>>>> Ask them why their FSMs got stuck.  In development they may make
>>>>>>>>> mistakes, but you don't use watchdogs for debugging.  In fact they
>>>>>>>>> get in the way.
>>>>>>>>>
>>>>>>>>>
>>>>>>>> Oh, that's easy.  Because of either:
>>>>>>>>
>>>>>>>> An error in the synchronous logic, leaving it in a defined state
>>>>>>>> with no way out (20% chance).
>>>>>>>
>>>>>>> That's a system debug thing and actually shouldn't happen at all as
>>>>>>> there are tools to analyze for it.
>>>>>>>
>>>>>>>
>>>>>>>> An unsynchronized async input causing a race condition that static
>>>>>>>> timing couldn't catch (80% chance)
>>>>>>>
>>>>>>> Newbie mistake... that even... uh, experienced designers do once in
>>>>>>> a while... uh, sometimes...  still, it wouldn't make it to a fielded
>>>>>>> system and so is does not create a need for a watchdog.
>>>>>>
>>>>>>
>>>>>> Here's something from a comp.arch.fpga post I made in 2003:
>>>>>>
>>>>>> "When I was at Agilent I analysed the causes of failures in some FPGA
>>>>>> developments.
>>>>>>
>>>>>> About half of all FPGA design related bugs (weighted by the time
>>>>>> spent finding them) were associated with asynchronous logic and clock
>>>>>> domain crossings.  [snip]  0% of the clock domain crossing bugs had
>>>>>> anything to do with metastability.  Glitches and races were the
>>>>>> cause."
>>>>>
>>>>> Geeze, that just shouldn't happen.  I'm not sure what they mean by
>>>>> "asynchronous logic" as real asynchronous logic is almost never used
>>>>> in FPGAs.  Clock domain crossing is well understood so there is no
>>>>> reason to not get it right.  It's the kind of thing that normally gets
>>>>> a big, red flag at design time and so is done correctly.
>>>>
>>>>
>>>> I would not say that clock domain crossings are /well/ understood by
>>>> beginners, or even moderately experienced designers.
>>>
>>> Why would a beginner be designing a system without supervision?  As I
>>> said, this is the sort of issue that gets a red flag and lots of
>>> attention in a design review.
>>>
>>>
>>>> BTW, I weighted the results with the time taken to find the bugs.
>>>> There weren't that many bugs, it's just that they took a long time to
>>>> find compared with straightforward functional bugs.
>>>
>>> That's why they get lots of attention up front rather than after they
>>> are a problem.
>>>
>>>
>>>> Many of the bugs were caused by integrating IP (written elsewhere) and
>>>> it wasn't always obvious to the designers that signals were crossing
>>>> clock domains.
>>>
>>> That's exactly what happened to me.  A simple UART needed a FF at the
>>> data in port.  I had designed the UART and didn't document that detail.
>>>    I used it later in a test fixture and forgot to include the I/O FF.
>>> It bit me hard as I was writing the software the talked to this port and
>>> kept thinking the flaw was software.
>>>
>>>
>>>> Some of the bugs were created by the tools, e.g. when they replicated
>>>> logic.  That makes the bugs hard to find during source code review.
>>>> It's actually better to review the post-synth netlist than the source
>>>> code. (Better still to use an automated tool to do it.)
>>>
>>> Not sure how that happens.  Are you saying a design with 1 FF and many
>>> destinations had an async input?  That alone is a no-no.  If there had
>>> been a FF in front to remove metastability it would have provided
>>> protection from async inputs (race) when the second FF was replicated.
>>
>> You seem to making the assumption that having two flip flops in series
>> will stop the first one from being replicated.  I've seen it happen
>> (albeit with a huge fanout on the second FF).
>>
>> The only way to ensure that the first FF has not been replicated is to
>> check, or to apply attributes that will tell the tools not to replicate
>> it.  Even then, the tools may have bugs (they certainly have in the past)
>> and you still need to check to be sure.
>>
>> The good news is that the check can be automated.
>>
>>
>>>> [From a 2008 c.a.f post of mine] here's a list of the sort of things
>>>> that could go wrong.  Please bear in mind that this list is historical
>>>> (i.e. it was based on experience with older FPGA families and older
>>>> tools, in a job I left over a decade ago.).
>>>>
>>>>
>>>> - (race) Passing vectors (i.e. multiple signals) from clock domain A to
>>>> clock domain B and expecting all the bits to arrive on the same B
>>>> clock.
>>>>
>>>> - (race) As above, but adding multiple banks of retiming flip flops in
>>>> the B clock domain, which fixed the (non-existent) metastability issue
>>>> but did nothing about the race.
>>>>
>>>> - (race) Passing a signal in clock domain A to multiple flip flops in
>>>> clock domain B, and expecting the B flip flops to get the same value on
>>>> the same clock.
>>>>
>>>> - (race) As above, but created when the tools replicate the B logic to
>>>> manage fanout.
>>>>
>>>> - (glitch) Multiple signals in clock domain A hit some combinatorial
>>>> logic producing a single signal which is sampled by a flip flop in
>>>> clock domain B.  Sometimes there may be a glitch which gets sampled by
>>>> the B flip flop.
>>>> It can be difficult to design combinatorial logic with good glitch
>>>> coverage (and if you do, the tools will often remove it).  (See XAPP
>>>> 024, btw.)
>>>>
>>>> - (glitch) Clock multiplexers made out of combinatorial logic with
>>>> inadequate glitch coverage (or adequate glitch coverage removed by the
>>>> tools).
>>>>
>>>> - Using async reset or set inputs on flip flops to implement a logic
>>>> function (rather than just using them for initialisation).  I can
>>>> remember a case where a design would fail even when we could prove
>>>> mathematically that it couldn't fail.  Rewriting it to avoid the use of
>>>> async resets fixed the problem.
>>>>
>>>> - Gating clocks to create a logic function.  I know this sort of thing
>>>> is done in ASICs to save power, but it just doesn't seem to work too
>>>> well in FPGAs sometimes.
>>>
>>> All of these issues are known bad practice.  Messing with the clocks or
>>> using async inputs on FFs is an especially bad practice.  I thought that
>>> ended in the 90s.
>>
>> Do you have citation for "known bad practice"?  I wrote that list in (I
>> think) 2001, and I haven't seen anything containing /all/ of those points
>> published prior to that date.
>
> No, I'm not really a history buff.  I don't know of any resource that
> lists problems to be avoided in digital logic design.  Do you have any?
>   I believe all of these issues are common knowledge.
>
> You list 8 things you look for and the first 6 are clock domain crossing
> issues.  So that is really one issue, good clock domain crossing design
> and you have listed six ways that designers screw up.
>
> Using async inputs on FFs has always been discouraged by the FPGA
> companies, in no small part because it makes the design hard to verify
> and I believe they have said it makes it hard to port to ASICs (maybe
> because of being hard to verify).
>
> I have heard forever that it is hard to gate clocks properly.  It
> requires knowledge of gate delays and detailed timing which is typically
> avoided in FPGA designs in favor of unit delays simulation with static
> timing analysis.

Since we have been discussing purely hardware issues and this is 
primarily a software group, I have started a post in comp.arch.fpga.  If 
you think it is appropriate to continue this discussion here, maybe add 
comp.arch.fpga to the list of groups.

-- 

Rick C

Reply by ●May 11, 20162016-05-11

Den onsdag den 11. maj 2016 kl. 18.33.41 UTC+2 skrev Rob Gaddi:
> lasselangwadtchristensen@gmail.com wrote:
> 
> > Den onsdag den 11. maj 2016 kl. 02.58.08 UTC+2 skrev Allan Herriman:
> >> On Tue, 10 May 2016 17:22:43 -0700, lasselangwadtchristensen wrote:
> >> 
> >> > A watchdog on the AXI bus would nice, it is easy to reconfigure the
> >> > programmable logic in a Zynq but if you have stuff on bus you have to
> >> > make absolutely sure no software is accessing that because it will halt
> >> > the whole system and only a reset will recover from that
> >> 
> >> We leave the ARM watchdog permanently enabled in our Zynq systems for 
> >> that very reason.
> >> If code (via e.g. a wrong pointer) accesses an address without an AXI 
> >> address decode, it will hang the AXI, and hence the whole box.  
> >> 
> >
> > yep, but it would be nice if there was an option to get something similar 
> > to an access denied and handle it from there instead of resetting the whole 
> > system
> >
> > -Lasse
> 
> Wait, you're proposing that an error on the data bus should raise some
> sort of Data Abort exception (perhaps at vector address 0x10) rather
> than render the system catatonic?  Poppycock!  Who would ever design a
> ARM-based CPU in such a way?

yes I know, crazy talk ;) 


-Lasse

Reply by Randy Yates ●May 12, 20162016-05-12

Tim Wescott <tim@seemywebsite.com> writes:
> [...]
> On Wed, 11 May 2016 00:10:39 -0400, rickman wrote:
>> Shouldn't a processor reset also reset the hardware?
>
> Randy's point, I think, is that if something is _broken_, a reset isn't 
> going to un-break it.

Exactly. Discriminate "broken hardware" from "hardware that's gotten
into a bad state." The former won't benefit from a reset, the latter
will.
-- 
Randy Yates, DSP/Embedded Firmware Developer
Digital Signal Labs
http://www.digitalsignallabs.com

Reply by rickman ●May 12, 20162016-05-12

On 5/12/2016 12:44 PM, Randy Yates wrote:
> Tim Wescott <tim@seemywebsite.com> writes:
>> [...]
>> On Wed, 11 May 2016 00:10:39 -0400, rickman wrote:
>>> Shouldn't a processor reset also reset the hardware?
>>
>> Randy's point, I think, is that if something is _broken_, a reset isn't
>> going to un-break it.
>
> Exactly. Discriminate "broken hardware" from "hardware that's gotten
> into a bad state." The former won't benefit from a reset, the latter
> will.

Even broken hardware will benefit if the reset prevents actions that 
cause damage or disrupt a larger part of the system.  It is frequently 
the case that a power up self test is performed before a system controls 
dangerous devices or tries to communicate with the larger system.

-- 

Rick C