I have some problems with the watchdog function. The watch dog is initialized on 1 sec (WDTCTL=0x5A0C) clocked by the AClock, clocked by a 32kHz crystal. (The program is executed at a speed of 6MHz, clocked by an external crystal) My program structure does not allow to reset the watchdog in the main loop. Therefore a 500ms timer ISR is implemented to take care of the periodical resets of the watchdog. Let me describe the steps of this theoretical problem with large practical consequences: 1. I force the stack to get corrupted by calling a function like void MyTest(void) { char test[10]; sprintf(test,"01234567890123456"); } I know, it's bad programming, but I just want to simulate a stack corruption. 2. The stack will overwrite the return address and the further program flow is undefined. (by the way, is there anybody who knows a trick to check the stack integrity on C-level when returning from a subroutine??) 3. In most cases the program generates a NMI_Fault interrupt, induced by a Flash Fault (FCTL3&ACCVIFG), due to the change that the program tries to illegally write data to flash memory. With a NMI interrupt the corrupted stack can be handled. 4. Please note that the Timer ISR can proceed as normal (!!!!!), since it just stores its PC on the stack and retrieves it as normal. The ISR does not know about the corrupted stack. 5. So far, I can understand what happens. Apart from the fact that the watchdog is of no use when the Timer ISR continues to reset the watchdog, while the program got stuck due to a corrupted stack (!!!), in some cases even the Flash Fault NMI Interrupt is not addressed due to a corrupted stack. Now I'm in trouble. The watchdog is still being served and the program got stuck. Even worse, I have seen instances where even the watchdog was not being served (I have some control LED's connected to the Timer ISR, so I'm sure that the watchdog did expired) and the program did not trigger the PUC signal !!!!!!!!!! Can anybody bring more clarity in this phenomenon? Nico
corrupted stack and watch dog functioning
Started by ●September 6, 2003
Reply by ●September 6, 20032003-09-06
--- In msp430@msp4..., "resiproc" <N.Arends@r...> wrote:
> Can anybody bring more clarity in this phenomenon?
The problem you're observing (at least in the first case you describe)
is that simply clearing the watchdog timer at some periodic rate does
not "cover all the bases" of what might be going wrong in your system.
For example, using a 2Hz ISR to hit a 1Hz watchdog only tells you that
the particular ISR is still working -- it tells you nothing about
mainline code or the other ISRs.
Similarly, hitting the watchdog in your main loop doesn't tell you if
your ISRs are working ...
If you google-search comp.arch.embedded there was a pretty long thread
a while (6 months?) back that discussed various approaches that people
take. I think the gist of it was that each part of your application
that needs "watchdog monitoring" must in some way contribute (in an
AND / combinatorial sense) to the clearing the watchdog -- if any
component fails to clear the watchdog, then it trips.
I think there was also a blurb in Embedded Systems Programming, either
the magazine or the on-line columns, about this, too.
Note that I am not suggesting that you clear the watchdog all over the
place -- rather, that there is a second, sort of "combinatorial"
method that you use to ultimately clear the watchdog within the
specified period.
As an aside, the Salvo RTOS libraries for the MSP430 are built by
default to clear the watchdog timer from inside the scheduler, which
is called in the application's main loop. While this simple method
does catch if a task fails to yield back to the scheduler (and some
other run-time and compile-time programming errors), it doesn't catch
(for example) a situation that wipes out all of the task control
blocks in RAM and leaves just the idling hook running. Salvo Pro users
can implement more sophistcated schemes as they see fit.
So, to summarize, really robust watchdog timer schemes are more
complex than what you've implemented so far.
--Andrew E. Kalman aek ... at ... pumpkininc ... dot ... com
Reply by ●September 8, 20032003-09-08
At 19:41 06-09-03 +0200, you wrote:
> ...................
>Now I'm in trouble. The watchdog is still being served and the program
got
>stuck.
>
>Even worse, I have seen instances where even the watchdog was not being
>served (I have some control LED's connected to the Timer ISR, so
I'm sure
>that the watchdog did expired) and the program did not trigger the PUC
>signal !!!!!!!!!!
>
>Can anybody bring more clarity in this phenomenon?
>
>Nico
As it was well pointed out by other answers, handling of the watchdog is
far more complicated than one might think at first.
I developed my own technique, which I will try to describe here at my best.
I follow the rules:
1) Never reset the watchdog inside an ISR
2) If possible reset the watchdog ONLY in the main loop of the
program
3) Write a reset-watchdog routine that will NOT reset the WDT unless it
has checked first for stack integrity.
This routine is called by the main loop, and as a parameter it receives the
expected value of the stack pointer.
It then checks that the received parameter is equal to the actual stack
pointer. If it is not, the subroutine loops indefinitely, since it has not
reset the watchdog timer this will let the time expire, and trigger the
watchdog action.
You may add other checks as well.
It may be a little tricky to figure out the value (of the stackpointer) to
pass to such routine, but once you have got it everything will go smoothly.
Hope this helps.
A.Morra
Reply by ●September 8, 20032003-09-08
Well, that's exactly the problem and the chicken/egg dilemma. When the stack gets corrupted (which I simulated by overwriting a local array) it will not return to the main loop..... You are right, in the main loop the stack pointer must be the same every time and can be checked under normal conditions. But it's of no use when the main loop is never entered anymore after a stack corruption. Having the watchdog reset residing in the main loop will prevent you to execute a time-intensive procedure. For example, to output a large text through the serial port (provided it's not running on interrupt of course but using a for loop) which can take >1 second. But a more severe problem is that overwriting the stack in some case resulted in neither serving the watch dog anymore (no resets) AND the watchdog did not generated a PUC signal (or at least the program did not restarted normally). The I'm got stuck and a POR is needed. May be an external watchdog is the only solution for these kind of problems. Nico -----Original Message----- From: Ing. Morra Antonio [mailto:antonio.morra@anto...] Sent: maandag 8 september 2003 14:32 To: msp430@msp4... Subject: Re: [msp430] corrupted stack and watch dog functioning At 19:41 06-09-03 +0200, you wrote: > ................... >Now I'm in trouble. The watchdog is still being served and the program got >stuck. > >Even worse, I have seen instances where even the watchdog was not being >served (I have some control LED's connected to the Timer ISR, so I'm sure >that the watchdog did expired) and the program did not trigger the PUC >signal !!!!!!!!!! > >Can anybody bring more clarity in this phenomenon? > >Nico As it was well pointed out by other answers, handling of the watchdog is far more complicated than one might think at first. I developed my own technique, which I will try to describe here at my best. I follow the rules: 1) Never reset the watchdog inside an ISR 2) If possible reset the watchdog ONLY in the main loop of the program 3) Write a reset-watchdog routine that will NOT reset the WDT unless it has checked first for stack integrity. This routine is called by the main loop, and as a parameter it receives the expected value of the stack pointer. It then checks that the received parameter is equal to the actual stack pointer. If it is not, the subroutine loops indefinitely, since it has not reset the watchdog timer this will let the time expire, and trigger the watchdog action. You may add other checks as well. It may be a little tricky to figure out the value (of the stackpointer) to pass to such routine, but once you have got it everything will go smoothly. Hope this helps. A.Morra .
Reply by ●September 9, 20032003-09-09
resiproc wrote: > Well, that's exactly the problem and the chicken/egg dilemma. > When the stack gets corrupted (which I simulated by overwriting a local > array) it will not return to the main loop..... > You are right, in the main loop the stack pointer must be the same every > time and can be checked under normal conditions. > But it's of no use when the main loop is never entered anymore after a stack > corruption. That's why the watchdog is there, if it isn't serviced the device is reset, stack corruption is a bad design issue, and should be ferreted out during debugging. > > Having the watchdog reset residing in the main loop will prevent you to > execute a time-intensive procedure. Then avoid writing them, this is bad prcatice anyway. > For example, to output a large text through the serial port (provided it's > not running on interrupt of course but using a for loop) which can take >1 > second. > Use an interrupt, it's what they are for. This is poor coding practice to hog the resources for any prolonged period of time. Either use a commercial RTOS if you haven't yet learned this yourself, or more rationally dedicate more time to the initial system design. This is a crucial phase that many people overlook, they just grab a processor that has the right mix of peripherals and slap down a design. > But a more severe problem is that overwriting the stack in some case > resulted in neither serving the watch dog anymore (no resets) AND the > watchdog did not generated a PUC signal (or at least the program did not > restarted normally). The I'm got stuck and a POR is needed. > May be an external watchdog is the only solution for these kind of problems. Good design, and a well thought out debug process is the real answer to this. Not reliance upon a last ditch safety net device to keep your system up and running. Before you ever lay mouse to schematic you should have designed the entire system on paper, figuring out bottle necks like communications, and developing strategies to overcome them. Tracking call depth and stack usage is a simple thing to do, there is no excuse for stack overflow, although, having said that I don't know what the compiler guys do in this respect, and they do seem to be very heavy on their stack usage, like stacking all registers in some circumstances, it seems. I guess in this case you'd need to understand how the compiler worked in different circumstances and allow adequate stack space, but it should be somthing that can be profiled. Cheers Al
Reply by ●September 9, 20032003-09-09
Jack Ganssle talked about WDT's a while back in Embedded Systems Programming. He has some good ideas on implementation of a robust watchdog timer. Here's the link to one of the articles. http://www.embedded.com/design_library/esd/rt/OEG20030220S0037 Hope this helps. Greg ----- Original Message ----- From: "Ing. Morra Antonio" <antonio.morra@anto...> To: <msp430@msp4...> Sent: Monday, September 08, 2003 8:31 AM Subject: Re: [msp430] corrupted stack and watch dog functioning > At 19:41 06-09-03 +0200, you wrote: > > ................... > >Now I'm in trouble. The watchdog is still being served and the program got > >stuck. > > > >Even worse, I have seen instances where even the watchdog was not being > >served (I have some control LED's connected to the Timer ISR, so I'm sure > >that the watchdog did expired) and the program did not trigger the PUC > >signal !!!!!!!!!! > > > >Can anybody bring more clarity in this phenomenon? > > > >Nico > > As it was well pointed out by other answers, handling of the watchdog is > far more complicated than one might think at first. > I developed my own technique, which I will try to describe here at my best. > I follow the rules: > 1) Never reset the watchdog inside an ISR > 2) If possible reset the watchdog ONLY in the main loop of the > program > 3) Write a reset-watchdog routine that will NOT reset the WDT unless it > has checked first for stack integrity. > This routine is called by the main loop, and as a parameter it receives the > expected value of the stack pointer. > It then checks that the received parameter is equal to the actual stack > pointer. If it is not, the subroutine loops indefinitely, since it has not > reset the watchdog timer this will let the time expire, and trigger the > watchdog action. > You may add other checks as well. > It may be a little tricky to figure out the value (of the stackpointer) to > pass to such routine, but once you have got it everything will go smoothly. > Hope this helps. > A.Morra > > > > . > > > > ">http://docs.yahoo.com/info/terms/ > >
Reply by ●September 9, 20032003-09-09
Dear Nico At 19:46 08-09-03 +0200, you wrote: >Well, that's exactly the problem and the chicken/egg dilemma. >When the stack gets corrupted (which I simulated by overwriting a local >array) it will not return to the main loop..... Following my rule, if it does not return to the main program loop (which is the only one that can reset the watchdog, in my approach) the watchdog cannot be reset, so it should be triggered. >You are right, in the main loop the stack pointer must be the same every >time and can be checked under normal conditions. >But it's of no use when the main loop is never entered anymore after a stack >corruption. Again ... no main loop ... no way to stop the watchdog --> obliged reset >Having the watchdog reset residing in the main loop will prevent you to >execute a time-intensive procedure. This is not entirely true, if you really have to go through long procedures, you can track your time usage and issue as many watchdog resets as you need. >For example, to output a large text through the serial port (provided it's >not running on interrupt of course but using a for loop) which can take >1 >second. That is seldom necessary, if ever. Better do it using a buffer and an ISR; if you really have nothing else to do in the meantime, at least you can put the processor to sleep (while the serial-AND-slow peripheral works its way) and spare some battery! >But a more severe problem is that overwriting the stack in some case >resulted in neither serving the watch dog anymore (no resets) AND the >watchdog did not generated a PUC signal (or at least the program did not >restarted normally). The I'm got stuck and a POR is needed. >May be an external watchdog is the only solution for these kind of problems. There are, of course, ways to entirely stop the processor through stack corruption. Once the processor runs away to ruin it can execute any sequence of instructions, including executing what you meant to be ascii characters, not instructions. This is the reason that disabling of the watchdog is a password protected access, but most programs do include as a first instruction the correct watchdog disable command ... which normally is not a problem,since the execution from there on IS the real program, but than you insist you want to reset the watchdog inside an ISR ... so ... things can definitely go havoc. An external watchdog may be of help ... sometimes! You still have to design your firmware taking care of the above considerations, at least in part. Also, you may want to be able to know that the reset was watchdog triggered, but you do no longer have an internal register to inspect ... so, be careful! regards A.Morra
Reply by ●September 9, 20032003-09-09
The problem I have with most of this is that it assumes the old adage
that no program is bug free is true. And frankly that's bullshit. Not
only bullshit, but it engenders a subconscious pre-acceptance of
failure, or of something that isn't good enough. basically if all
software has bugs why bother debugging beyond a certain point?
To me this is just wrong. I wouldn't claim to write bug free code all
the time, if I did I wouldn't have to debug, but I can say that I have
written code that, to all intents and purposes was bug free. ie it
performed its task with no discernible deviation from its required
operation over the expected life of the product, or through n-million
iterations. Of the designs that I consider to have met these criteria
two of them lacked a WDT completely. All of them shared 3 things in
common, a thorough preliminary design phase, a comprehensively planned
testing strategy, and a thorough debugging cycle.
The programs were far from trivial as well, ranging up to 256k of
assembler. Two that come to mind were team efforts, the rest solo. Team
development is by far the hardest to get right, but with the right
attitude it is possible.
I can't pre-accept the notion of imperfection or failure.
Al
Greg Maki wrote:
> Jack Ganssle talked about WDT's a while back in Embedded Systems
> Programming. He has some good ideas on implementation of a robust watchdog
> timer. Here's the link to one of the articles.
>
> http://www.embedded.com/design_library/esd/rt/OEG20030220S0037
>
> Hope this helps.
>
> Greg
> ----- Original Message -----
> From: "Ing. Morra Antonio" <antonio.morra@anto...>
> To: <msp430@msp4...>
> Sent: Monday, September 08, 2003 8:31 AM
> Subject: Re: [msp430] corrupted stack and watch dog functioning
>
>
>
>>At 19:41 06-09-03 +0200, you wrote:
>>
>>> ...................
>>>Now I'm in trouble. The watchdog is still being served and the
program
>
> got
>
>>>stuck.
>>>
>>>Even worse, I have seen instances where even the watchdog was not
being
>>>served (I have some control LED's connected to the Timer ISR,
so I'm sure
>>>that the watchdog did expired) and the program did not trigger the
PUC
>>>signal !!!!!!!!!!
>>>
>>>Can anybody bring more clarity in this phenomenon?
>>>
>>>Nico
>>
>>As it was well pointed out by other answers, handling of the watchdog is
>>far more complicated than one might think at first.
>>I developed my own technique, which I will try to describe here at my
>
> best.
>
>>I follow the rules:
>> 1) Never reset the watchdog inside an ISR
>> 2) If possible reset the watchdog ONLY in the main loop of the
>>program
>> 3) Write a reset-watchdog routine that will NOT reset the WDT unless
it
>>has checked first for stack integrity.
>>This routine is called by the main loop, and as a parameter it receives
>
> the
>
>>expected value of the stack pointer.
>>It then checks that the received parameter is equal to the actual stack
>>pointer. If it is not, the subroutine loops indefinitely, since it has
not
>>reset the watchdog timer this will let the time expire, and trigger the
>>watchdog action.
>>You may add other checks as well.
>>It may be a little tricky to figure out the value (of the stackpointer)
to
>>pass to such routine, but once you have got it everything will go
>
> smoothly.
>
>>Hope this helps.
>>A.Morra
>>
>>
>>
>>.
>>
>>
>>
>>">http://docs.yahoo.com/info/terms/
>>
>>
>
>
>
>
> .
>
>
>
> ">http://docs.yahoo.com/info/terms/
>
>
>
Reply by ●September 9, 20032003-09-09
Al:
> The problem I have with most of this is that it
assumes the old adage
> that no program is bug free is true. And frankly that's bullshit. Not
> only bullshit, but it engenders a subconscious pre-acceptance of
> failure, or of something that isn't good enough. basically if all
> software has bugs why bother debugging beyond a certain point?
I agree with your point. I think Jack Ganssle does too; I think he was
saying that it's responsible engineering to plan for worst-case scenarios,
and my reading of the majority of his writing is that he concurrs with you
on the importance of rigourous design and test. He has over the years
talked about a variety of sources of system error with both internal
(program error) and external (ESD, etc) sources, and emphasizes the
importance of trying to provide solutions for all plausible error sources.
He doesn't argue in favor of relying on a WDT instead of designing
correctly, in fact does the opposite. 'Plan ahead' seems to almost
always
be one of his main points. But he also repeatedly emphasizes that one
should plan for the worst where failure is a very bad thing.
The assertion that no program can be known to be bug free is an interesting
one. I can't provide an argument one way or another, but it makes sense to
me on the face of it. If this were not true, foolproof test systems would
be standard, off the shelf, even built into every compiler, wouldn't they?
--Bruce
PS
For anyone else interested in this subject, here are my misc.inks to
articles on fault-tolerance and robust system design in embedded systems.
Sorry for the HTML format of this message, but it saved me the trouble of
reformatting these:
Born to Fail
Watchdog Timers
Introduction to Watchdog Timers
Watching the Watchdog
Li'l Bow Wow
Locking Up Your Software
Solving the Software Safety Paradox
Sensible Software Testing
Mea Culpa
Getting By Without an RTOS
The Wisdom of Manned Space Flight
Cummins Centinel
Forget Me Not
Why The Towers Fell
Safety-critical Operating Systems
Use Processor Redundancy for Maximum Reliability
Fault-tolerant techniques
Analyzing System Failure
Switch Bounce and other dirty secrets
System-level Issues in Battery Charger Applications
Implementing Fault-tolerant Systems with Windows CE
Software Test Techniques For System Fault-tree Analysis
The Systematic Improvement of Fault Tolerance in the Rio File Cache
Software Techniques for Improving Microcontroller EMC performance
Reply by ●September 9, 20032003-09-09
Oops! Sorry for the missing links. --Bruce Born to Fail http://www.embedded.com/story/OEG20021211S0032 Watchdog Timers http://www.embedded.com/2000/0011/0011feat4.htm Introduction to Watchdog Timers http://www.embedded.com/story/OEG20010920S0064 Watching the Watchdog http://www.embedded.com/story/OEG20030220S0037 Li'l Bow Wow http://www.embedded.com/story/OEG20030115S0042 Locking Up Your Software http://www.embedded.com/2000/0012/0012murphy.htm Solving the Software Safety Paradox http://www.embedded.com/98/9812/9812feat2.htm Sensible Software Testing http://www.embedded.com/2000/0008/0008feat3.htm Mea Culpa http://www.embedded.com/story/OEG20020222S0023 Getting By Without an RTOS http://www.embedded.com/2000/0009/0009feat4.htm The Wisdom of Manned Space Flight http://www.embedded.com/story/OEG20030210S0039 Cummins Centinel http://www.embedded.com/story/OEG20010618S0078 Forget Me Not http://www.embedded.com/story/OEG20010529S0121 Why The Towers Fell http://www.embedded.com/story/OEG20020910S0051 Safety-critical Operating Systems http://www.embedded.com/story/OEG20010829S0055 Use Processor Redundancy for Maximum Reliability http://www.commsdesign.com/design_corner/OEG20020201S0008 Analyzing System Failure http://www.jaluna.com/doc/c5/html/AppliDevGuide/c5085.html Switch Bounce and other dirty secrets http://www.maxim-ic.com/appnotes.cfm/appnote_number/287/ln/en System-level Issues in Battery Charger Applications http://www.maxim-ic.com/appnotes.cfm/appnote_number/680 Implementing Fault-tolerant Systems with Windows CE http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dncenet/ht ml/faulttol.asp Software Test Techniques For System Fault-tree Analysis http://www.cs.virginia.edu/~jck/publications/safecomp.97.pdf The Systematic Improvement of Fault Tolerance in the Rio File Cache http://www.eecs.umich.edu/Rio/papers/rioPC.pdf Software Techniques for Improving Microcontroller EMC performance http://eu.st.com/stonline/books/pdf/docs/5833.pdf