EmbeddedRelated.com
Forums

corrupted stack and watch dog functioning

Started by resiproc September 6, 2003
I have some problems with the watchdog function.
The watch dog is initialized on 1 sec (WDTCTL=0x5A0C) clocked by the AClock,
clocked by a 32kHz crystal.
(The program is executed at a speed of 6MHz, clocked by an external crystal)
My program structure does not allow to reset the watchdog in the main loop.
Therefore a 500ms timer ISR is implemented to take care of the periodical
resets of the watchdog.

Let me describe the steps of this theoretical problem with large practical
consequences:

1. I force the stack to get corrupted by calling a function like

void MyTest(void)
{
char test[10];
  sprintf(test,"01234567890123456");
}

I know, it's bad programming, but I just want to simulate a stack
corruption.

2. The stack will overwrite the return address and the further program flow
is undefined.
(by the way, is there anybody who knows a trick to check the stack integrity
on C-level when returning from a subroutine??)

3. In most cases the program generates a NMI_Fault interrupt, induced by a
Flash Fault (FCTL3&ACCVIFG), due to the change that the program tries to
illegally write data to flash memory.
With a NMI interrupt the corrupted stack can be handled.

4. Please note that the Timer ISR can proceed as normal (!!!!!), since it
just stores its PC on the stack and retrieves it as normal. The ISR does not
know about the corrupted stack.

5. So far, I can understand what happens. Apart from the fact that the
watchdog is of no use when the Timer ISR continues to reset the watchdog,
while the program got stuck due to a corrupted stack (!!!), in some cases
even the Flash Fault NMI Interrupt is not addressed due to a corrupted
stack.
Now I'm in trouble. The watchdog is still being served and the program got
stuck.

Even worse, I have seen instances where even the watchdog was not being
served (I have some control LED's connected to the Timer ISR, so I'm
sure
that the watchdog did expired) and the program did not trigger the PUC
signal !!!!!!!!!!

Can anybody bring more clarity in this phenomenon?

Nico


Beginning Microcontrollers with the MSP430

--- In msp430@msp4..., "resiproc" <N.Arends@r...> wrote:
> Can anybody bring more clarity in this phenomenon?

The problem you're observing (at least in the first case you describe)
is that simply clearing the watchdog timer at some periodic rate does
not "cover all the bases" of what might be going wrong in your system.

For example, using a 2Hz ISR to hit a 1Hz watchdog only tells you that
the particular ISR is still working -- it tells you nothing about
mainline code or the other ISRs.

Similarly, hitting the watchdog in your main loop doesn't tell you if
your ISRs are working ...

If you google-search comp.arch.embedded there was a pretty long thread
a while (6 months?) back that discussed various approaches that people
take. I think the gist of it was that each part of your application
that needs "watchdog monitoring" must in some way contribute (in an
AND / combinatorial sense) to the clearing the watchdog -- if any
component fails to clear the watchdog, then it trips.

I think there was also a blurb in Embedded Systems Programming, either
the magazine or the on-line columns, about this, too.

Note that I am not suggesting that you clear the watchdog all over the
place -- rather, that there is a second, sort of "combinatorial"
method that you use to ultimately clear the watchdog within the
specified period.

As an aside, the Salvo RTOS libraries for the MSP430 are built by
default to clear the watchdog timer from inside the scheduler, which
is called in the application's main loop. While this simple method
does catch if a task fails to yield back to the scheduler (and some
other run-time and compile-time programming errors), it doesn't catch
(for example) a situation that wipes out all of the task control
blocks in RAM and leaves just the idling hook running. Salvo Pro users
can implement more sophistcated schemes as they see fit.

So, to summarize, really robust watchdog timer schemes are more
complex than what you've implemented so far.

--Andrew E. Kalman  aek ... at ... pumpkininc ... dot ... com


At 19:41 06-09-03 +0200, you wrote:
>  ...................
>Now I'm in trouble. The watchdog is still being served and the program
got
>stuck.
>
>Even worse, I have seen instances where even the watchdog was not being
>served (I have some control LED's connected to the Timer ISR, so
I'm sure
>that the watchdog did expired) and the program did not trigger the PUC
>signal !!!!!!!!!!
>
>Can anybody bring more clarity in this phenomenon?
>
>Nico

As it was well pointed out by other answers, handling of the watchdog is 
far more complicated than one might think at first.
I developed my own technique, which I will try to describe here at my best.
I follow the rules:
  1) Never reset the watchdog inside an ISR
  2) If possible reset the watchdog ONLY in the main loop of the 
program
  3) Write a reset-watchdog routine that will NOT reset the WDT unless it 
has checked first  for stack integrity.
This routine is called by the main loop, and as a parameter it receives the 
expected value of the stack pointer.
It then checks that the received parameter is equal to the actual stack 
pointer. If it is not, the subroutine loops indefinitely, since it has not 
reset the watchdog timer this will let the time expire, and trigger the 
watchdog action.
You may add other checks as well.
It may be a little tricky to figure out the value (of the stackpointer) to 
pass to such routine, but once you have got it everything will go smoothly.
Hope this helps.
A.Morra


Well, that's exactly the problem and the chicken/egg dilemma.
When the stack gets corrupted (which I simulated by overwriting a local
array) it will not return to the main loop.....
You are right, in the main loop the stack pointer must be the same every
time and can be checked under normal conditions.
But it's of no use when the main loop is never entered anymore after a
stack
corruption.

Having the watchdog reset residing in the main loop will prevent you to
execute a time-intensive procedure.
For example, to output a large text through the serial port (provided it's
not running on interrupt of course but using a for loop) which can take >1
second.

But a more severe problem is that overwriting the stack in some case
resulted in neither serving the watch dog anymore (no resets) AND the
watchdog did not generated a PUC signal (or at least the program did not
restarted normally). The I'm got stuck and a POR is needed.
May be an external watchdog is the only solution for these kind of problems.

Nico

-----Original Message-----
From: Ing. Morra Antonio [mailto:antonio.morra@anto...]
Sent: maandag 8 september 2003 14:32
To: msp430@msp4...
Subject: Re: [msp430] corrupted stack and watch dog functioning


  At 19:41 06-09-03 +0200, you wrote:
  >  ...................
  >Now I'm in trouble. The watchdog is still being served and the
program
got
  >stuck.
  >
  >Even worse, I have seen instances where even the watchdog was not being
  >served (I have some control LED's connected to the Timer ISR, so
I'm sure
  >that the watchdog did expired) and the program did not trigger the PUC
  >signal !!!!!!!!!!
  >
  >Can anybody bring more clarity in this phenomenon?
  >
  >Nico

  As it was well pointed out by other answers, handling of the watchdog is
  far more complicated than one might think at first.
  I developed my own technique, which I will try to describe here at my
best.
  I follow the rules:
    1) Never reset the watchdog inside an ISR
    2) If possible reset the watchdog ONLY in the main loop of the
  program
    3) Write a reset-watchdog routine that will NOT reset the WDT unless it
  has checked first  for stack integrity.
  This routine is called by the main loop, and as a parameter it receives
the
  expected value of the stack pointer.
  It then checks that the received parameter is equal to the actual stack
  pointer. If it is not, the subroutine loops indefinitely, since it has not
  reset the watchdog timer this will let the time expire, and trigger the
  watchdog action.
  You may add other checks as well.
  It may be a little tricky to figure out the value (of the stackpointer) to
  pass to such routine, but once you have got it everything will go
smoothly.
  Hope this helps.
  A.Morra


        




  .



  





resiproc wrote:
> Well, that's exactly the problem and the
chicken/egg dilemma.
> When the stack gets corrupted (which I simulated by overwriting a local
> array) it will not return to the main loop.....
> You are right, in the main loop the stack pointer must be the same every
> time and can be checked under normal conditions.
> But it's of no use when the main loop is never entered anymore after a
stack
> corruption.

That's why the watchdog is there, if it isn't serviced the device is 
reset, stack corruption is a bad design issue, and should be ferreted 
out during debugging.

> 
> Having the watchdog reset residing in the main loop will prevent you to
> execute a time-intensive procedure.

Then avoid writing them, this is bad prcatice anyway.

> For example, to output a large text through the
serial port (provided it's
> not running on interrupt of course but using a for loop) which can take
>1
> second.
> 

Use an interrupt, it's what they are for. This is poor coding practice 
to hog the resources for any prolonged period of time. Either use a 
commercial RTOS if you haven't yet learned this yourself, or more 
rationally dedicate more time to the initial system design. This is a 
crucial phase that many people overlook, they just grab a processor that 
has the right mix of peripherals and slap down a design.


> But a more severe problem is that overwriting the
stack in some case
> resulted in neither serving the watch dog anymore (no resets) AND the
> watchdog did not generated a PUC signal (or at least the program did not
> restarted normally). The I'm got stuck and a POR is needed.
> May be an external watchdog is the only solution for these kind of
problems.

Good design, and a well thought out debug process is the real answer to 
this. Not reliance upon a last ditch safety net device to keep your 
system up and running. Before you ever lay mouse to schematic you should 
have designed the entire system on paper, figuring out bottle necks like 
communications, and developing strategies to overcome them.

Tracking call depth and stack usage is a simple thing to do, there is no 
excuse for stack overflow, although, having said that I don't know what 
the compiler guys do in this respect, and they do seem to be very heavy 
on their stack usage, like stacking all registers in some circumstances, 
it seems. I guess in this case you'd need to understand how the compiler 
worked in different circumstances and allow adequate stack space, but it 
should be somthing that can be profiled.

Cheers

Al


Jack Ganssle talked about WDT's a while back in Embedded Systems
Programming. He has some good ideas on implementation of a robust watchdog
timer.  Here's the link to one of the articles.

http://www.embedded.com/design_library/esd/rt/OEG20030220S0037

Hope this helps.

Greg
----- Original Message ----- 
From: "Ing. Morra Antonio" <antonio.morra@anto...>
To: <msp430@msp4...>
Sent: Monday, September 08, 2003 8:31 AM
Subject: Re: [msp430] corrupted stack and watch dog functioning


> At 19:41 06-09-03 +0200, you wrote:
> >  ...................
> >Now I'm in trouble. The watchdog is still being served and the
program
got
> >stuck.
> >
> >Even worse, I have seen instances where even the watchdog was not being
> >served (I have some control LED's connected to the Timer ISR, so
I'm sure
> >that the watchdog did expired) and the program did not trigger the PUC
> >signal !!!!!!!!!!
> >
> >Can anybody bring more clarity in this phenomenon?
> >
> >Nico
>
> As it was well pointed out by other answers, handling of the watchdog is
> far more complicated than one might think at first.
> I developed my own technique, which I will try to describe here at my
best.
> I follow the rules:
>   1) Never reset the watchdog inside an ISR
>   2) If possible reset the watchdog ONLY in the main loop of the
> program
>   3) Write a reset-watchdog routine that will NOT reset the WDT unless it
> has checked first  for stack integrity.
> This routine is called by the main loop, and as a parameter it receives
the
> expected value of the stack pointer.
> It then checks that the received parameter is equal to the actual stack
> pointer. If it is not, the subroutine loops indefinitely, since it has not
> reset the watchdog timer this will let the time expire, and trigger the
> watchdog action.
> You may add other checks as well.
> It may be a little tricky to figure out the value (of the stackpointer) to
> pass to such routine, but once you have got it everything will go
smoothly.
> Hope this helps.
> A.Morra
>
>
>
> .
>
>
>
> ">http://docs.yahoo.com/info/terms/
>
>


Dear Nico

At 19:46 08-09-03 +0200, you wrote:
>Well, that's exactly the problem and the chicken/egg dilemma.
>When the stack gets corrupted (which I simulated by overwriting a local
>array) it will not return to the main loop.....

Following my rule, if it does not return to the main program loop (which is 
the only one that can reset the watchdog, in my approach) the watchdog 
cannot be reset, so it should be triggered.

>You are right, in the main loop the stack pointer
must be the same every
>time and can be checked under normal conditions.
>But it's of no use when the main loop is never entered anymore after a
stack
>corruption.

Again ... no main loop ... no way to stop the watchdog --> obliged reset


>Having the watchdog reset residing in the main loop
will prevent you to
>execute a time-intensive procedure.

This is not entirely true, if you really have to go through long 
procedures, you can track your time usage and issue as many watchdog resets 
as you need.

>For example, to output a large text through the
serial port (provided it's
>not running on interrupt of course but using a for loop) which can take
>1
>second.

That is seldom necessary, if ever. Better do it using a buffer and an ISR; 
if you really have nothing else to do in the meantime, at least you can put 
the processor to sleep (while the serial-AND-slow peripheral works its way) 
and spare some battery!


>But a more severe problem is that overwriting the
stack in some case
>resulted in neither serving the watch dog anymore (no resets) AND the
>watchdog did not generated a PUC signal (or at least the program did not
>restarted normally). The I'm got stuck and a POR is needed.
>May be an external watchdog is the only solution for these kind of problems.

There are, of course, ways to entirely stop the processor through stack 
corruption.
Once the processor runs away to ruin it can execute any sequence of 
instructions, including executing what you meant to be ascii characters, 
not instructions.
This is the reason that disabling of the watchdog is a password protected 
access, but most programs do include as a first instruction the correct 
watchdog disable command ... which normally is not a problem,since the 
execution from there on IS the real program, but than you insist you want 
to reset the watchdog inside an ISR ... so ...  things can definitely go havoc.

An external watchdog may be of help ... sometimes! You still have to design 
your firmware taking care of the above considerations, at least in part. 
Also, you may want to be able to know that the reset was watchdog 
triggered, but you do no longer have an internal register to inspect   ... 
so, be careful!


regards
A.Morra


The problem I have with most of this is that it assumes the old adage 
that no program is bug free is true. And frankly that's bullshit. Not 
only bullshit, but it engenders a subconscious pre-acceptance of 
failure, or of something that isn't good enough. basically if all 
software has bugs why bother debugging beyond a certain point?

To me this is just wrong. I wouldn't claim to write bug free code all 
the time, if I did I wouldn't have to debug, but I can say that I have 
written code that, to all intents and purposes was bug free. ie it 
performed its task with no discernible deviation from its required 
operation over the expected life of the product, or through n-million 
iterations. Of the designs that I consider to have met these criteria 
two of them lacked a WDT completely. All of them shared 3 things in 
common, a thorough preliminary design phase, a comprehensively planned 
testing strategy, and a thorough debugging cycle.

The programs were far from trivial as well, ranging up to 256k of 
assembler. Two that come to mind were team efforts, the rest solo. Team 
development is by far the hardest to get right, but with the right 
attitude it is possible.

I can't pre-accept the notion of imperfection or failure.

Al


Greg Maki wrote:
> Jack Ganssle talked about WDT's a while back in Embedded Systems
> Programming. He has some good ideas on implementation of a robust watchdog
> timer.  Here's the link to one of the articles.
> 
> http://www.embedded.com/design_library/esd/rt/OEG20030220S0037
> 
> Hope this helps.
> 
> Greg
> ----- Original Message ----- 
> From: "Ing. Morra Antonio" <antonio.morra@anto...>
> To: <msp430@msp4...>
> Sent: Monday, September 08, 2003 8:31 AM
> Subject: Re: [msp430] corrupted stack and watch dog functioning
> 
> 
> 
>>At 19:41 06-09-03 +0200, you wrote:
>>
>>> ...................
>>>Now I'm in trouble. The watchdog is still being served and the
program
> 
> got
> 
>>>stuck.
>>>
>>>Even worse, I have seen instances where even the watchdog was not
being
>>>served (I have some control LED's connected to the Timer ISR,
so I'm sure
>>>that the watchdog did expired) and the program did not trigger the
PUC
>>>signal !!!!!!!!!!
>>>
>>>Can anybody bring more clarity in this phenomenon?
>>>
>>>Nico
>>
>>As it was well pointed out by other answers, handling of the watchdog is
>>far more complicated than one might think at first.
>>I developed my own technique, which I will try to describe here at my
> 
> best.
> 
>>I follow the rules:
>>  1) Never reset the watchdog inside an ISR
>>  2) If possible reset the watchdog ONLY in the main loop of the
>>program
>>  3) Write a reset-watchdog routine that will NOT reset the WDT unless
it
>>has checked first  for stack integrity.
>>This routine is called by the main loop, and as a parameter it receives
> 
> the
> 
>>expected value of the stack pointer.
>>It then checks that the received parameter is equal to the actual stack
>>pointer. If it is not, the subroutine loops indefinitely, since it has
not
>>reset the watchdog timer this will let the time expire, and trigger the
>>watchdog action.
>>You may add other checks as well.
>>It may be a little tricky to figure out the value (of the stackpointer)
to
>>pass to such routine, but once you have got it everything will go
> 
> smoothly.
> 
>>Hope this helps.
>>A.Morra
>>
>>
>>
>>.
>>
>>
>>
>>">http://docs.yahoo.com/info/terms/
>>
>>
> 
> 
> 
> 
> .
> 
>  
> 
> ">http://docs.yahoo.com/info/terms/ 
> 
> 
> 


Al:

> The problem I have with most of this is that it
assumes the old adage
> that no program is bug free is true. And frankly that's bullshit. Not
> only bullshit, but it engenders a subconscious pre-acceptance of
> failure, or of something that isn't good enough. basically if all
> software has bugs why bother debugging beyond a certain point?


I agree with your point.  I think Jack Ganssle does too;  I think he was
saying that it's responsible engineering to plan for worst-case scenarios,
and my reading of the majority of his writing is that he concurrs with you
on the importance of rigourous design and test.  He has over the years
talked about a variety of sources of system error with both internal
(program error) and external (ESD, etc) sources, and emphasizes the
importance of trying to provide solutions for all plausible error sources.
He doesn't argue in favor of relying on a WDT instead of designing
correctly, in fact does the opposite.  'Plan ahead' seems to almost
always
be one of his main points.  But he also repeatedly emphasizes that one
should plan for the worst where failure is a very bad thing.

The assertion that no program can be known to be bug free is an interesting
one.  I can't provide an argument one way or another, but it makes sense to
me on the face of it.  If this were not true, foolproof test systems would
be standard, off the shelf, even built into every compiler, wouldn't they?

--Bruce

PS
For anyone else interested in this subject, here are my misc.inks to
articles on fault-tolerance and robust system design in embedded systems.
Sorry for the HTML format of this message, but it saved me the trouble of
reformatting these:


Born to Fail
Watchdog Timers
Introduction to Watchdog Timers
Watching the Watchdog
Li'l Bow Wow
Locking Up Your Software
Solving the Software Safety Paradox
Sensible Software Testing
Mea Culpa
Getting By Without an RTOS
The Wisdom of Manned Space Flight
Cummins Centinel
Forget Me Not
Why The Towers Fell
Safety-critical Operating Systems
Use Processor Redundancy for Maximum Reliability
Fault-tolerant techniques
Analyzing System Failure
Switch Bounce and other dirty secrets
System-level Issues in Battery Charger Applications
Implementing Fault-tolerant Systems with Windows CE
Software Test Techniques For System Fault-tree Analysis
The Systematic Improvement of Fault Tolerance in the Rio File Cache
Software Techniques for Improving Microcontroller EMC performance







Oops!  Sorry for the missing links.

--Bruce


Born to Fail
http://www.embedded.com/story/OEG20021211S0032
Watchdog Timers
http://www.embedded.com/2000/0011/0011feat4.htm
Introduction to Watchdog Timers
http://www.embedded.com/story/OEG20010920S0064
Watching the Watchdog
http://www.embedded.com/story/OEG20030220S0037
Li'l Bow Wow
http://www.embedded.com/story/OEG20030115S0042
Locking Up Your Software
http://www.embedded.com/2000/0012/0012murphy.htm
Solving the Software Safety Paradox
http://www.embedded.com/98/9812/9812feat2.htm
Sensible Software Testing
http://www.embedded.com/2000/0008/0008feat3.htm
Mea Culpa
http://www.embedded.com/story/OEG20020222S0023
Getting By Without an RTOS
http://www.embedded.com/2000/0009/0009feat4.htm
The Wisdom of Manned Space Flight
http://www.embedded.com/story/OEG20030210S0039
Cummins Centinel
http://www.embedded.com/story/OEG20010618S0078
Forget Me Not
http://www.embedded.com/story/OEG20010529S0121
Why The Towers Fell
http://www.embedded.com/story/OEG20020910S0051
Safety-critical Operating Systems
http://www.embedded.com/story/OEG20010829S0055
Use Processor Redundancy for Maximum Reliability
http://www.commsdesign.com/design_corner/OEG20020201S0008
Analyzing System Failure
http://www.jaluna.com/doc/c5/html/AppliDevGuide/c5085.html
Switch Bounce and other dirty secrets
http://www.maxim-ic.com/appnotes.cfm/appnote_number/287/ln/en
System-level Issues in Battery Charger Applications
http://www.maxim-ic.com/appnotes.cfm/appnote_number/680
Implementing Fault-tolerant Systems with Windows CE
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dncenet/ht
ml/faulttol.asp
Software Test Techniques For System Fault-tree Analysis
http://www.cs.virginia.edu/~jck/publications/safecomp.97.pdf
The Systematic Improvement of Fault Tolerance in the Rio File Cache
http://www.eecs.umich.edu/Rio/papers/rioPC.pdf
Software Techniques for Improving Microcontroller EMC performance
http://eu.st.com/stonline/books/pdf/docs/5833.pdf