Debugging crashes which appear after a long time

Gurus,
      I am encountering a strange situation in one of our consumer
electronics embedded product.
The product runs on prorietory RTOS and a custom processor supplied by
ST Micro electronics.Now what happens is,the box runs for around 8
hours with out any problem.After 8 hours it crashes.I am clueless on
how to debug it.The debugger window throws error
"task out of scope,stack frame cannot be set!".
I suspected this to be a stack over flow problem for the tasks running
and increased the stack size.Still that did not help me.
I tried to put trace messages and print it on console.Problem is, due
to time required to print to console,my application does not come up
even properly.I am not able to use the debugger too because of this out
of scope error.Also irony is this problem appears only after 8
hours,which makes me wait for another 8 hours to get the issue.
I am wondering are there any good approaches you experts would have
used to solve such problems?
It will be helpful if some one can point me in right direction.I am
looking for some debugging tips which can help me to sort out this
issue.
Looking farward for your replies and advanced thanks for the same,
Regards,
s.subbarayan

Reply by Vadim Borshchev ●March 23, 20062006-03-23

ssubbarayan wrote:
> Now what happens is,the box runs for around 8
> hours with out any problem.After 8 hours it crashes.

If it happens consistentry after a period of time - it is a good sign :)  Things to ckeck:
- memory leak?
- hardware counter gets overflown?
- variable counter gets overflown?
- some free-running hardware timer generates interrupt that is not handled?
- the building air conditioning system starts/stops, generating surge in the mains line?

Would it crash if you ran another (idle) task instead of yours?  If yes - check the RTOS.

HTH,

  Vadim

Reply by Gary...@aol.com ●March 23, 20062006-03-23

> I suspected this to be a stack over flow problem for the tasks running
> and increased the stack size.Still that did not help me.

Try stuffing a known pattern into the RAM used for the stacks. After it
crashes, check those areas to see if that pattern gets overwritten.

Reply by Mark McDougall ●March 23, 20062006-03-23

ssubbarayan wrote:

> I am
> wondering are there any good approaches you experts would have used
> to solve such problems? It will be helpful if some one can point me
> in right direction.I am looking for some debugging tips which can
> help me to sort out this issue.

Do you have any theories at all? If so, you need to devise a 
configuration that will make it crash more often.

I had a problem years ago that occurred once every few *weeks*. 
Fortunately a code review threw up one theory reasonably early - and I 
spent the next few weeks proving it.

1. Can you accelerate the tasks that the system is doing in order to 
decrease the time between crashes? For example, if there is a regular 
task scheduled for, say every minute, drop that down to every 30s and 
see if it crashes after 4 hours. Every 5 sec? etc...

This could be done on a system-wide level (just to decrease the 
turn-around) or on a task level (to identify the task at fault).

Is it a consistent 8 hours? Or is that an average based on some 
probability of two inter-related events happening? Can you increase the 
number of tasks running in parallel to accelerate the crash?

2. Divide and conquer. Is it possible to disable certain tasks? Does it 
still crash when these tasks are not running? Does the system crash even 
with only a single idle task (and nothing else) running?

Your ideal situation is having (a) a configuration that doesn't crash 
(however cut-down that is) as well as (b) a configuration that crashes 
after a few mins. Then you can narrow in on the problem from there.

Regards,

-- 
Mark McDougall, Engineer
Virtual Logic Pty Ltd, <http://www.vl.com.au>
21-25 King St, Rockdale, 2216
Ph: +612-9599-3255 Fax: +612-9599-3266

Reply by Peter ●March 23, 20062006-03-23

"ssubbarayan" <ssubba@gmail.com> wrote in message 
news:1143086470.110143.320060@i39g2000cwa.googlegroups.com...
> Gurus,
>      I am encountering a strange situation in one of our consumer
> electronics embedded product.
> The product runs on prorietory RTOS and a custom processor supplied by
> ST Micro electronics.Now what happens is,the box runs for around 8
> hours with out any problem.After 8 hours it crashes.I am clueless on
> how to debug it.The debugger window throws error
> "task out of scope,stack frame cannot be set!".
> I suspected this to be a stack over flow problem for the tasks running
> and increased the stack size.

Even if the stack is overflowed you should be able to look at the end of the 
stack. You'll be able to find the last few values that were placed there. 
The return addresses will tell you where these calls came from. This might 
give you a clue as to what was happening just before it crashed.

Peter

Reply by manja ●March 23, 20062006-03-23

Hello,

Really very challenging and interesting problem......!!!!!

As people have already mentioned, Proper Code review and some
brainstorming with all the people in the team will help a lot in
identifying the problem area and to reproduce the problem early ( say
after 10 mins ).

Some more points to add.

Instead of printf's, use while(1); or assert() in all the conditions
that you have assumed as impossible to occour. ( Doubtful conditions )

Use your own debug versions of malloc and free and write some signature
in each malloc ed block and check for this signature while freeing (
Basically to check memory leaks ).

Above all I suggest you to read the article " Proactive Debugging " by
Jack Ganssle. This is a really Good article and you will definately get
some points.

Please keep us updated about the status of the problem as it will be a
very good learning for all.

Best Regards,
Venkatesh Manja.

ssubbarayan wrote:
> Gurus,
>       I am encountering a strange situation in one of our consumer
> electronics embedded product.
> The product runs on prorietory RTOS and a custom processor supplied by
> ST Micro electronics.Now what happens is,the box runs for around 8
> hours with out any problem.After 8 hours it crashes.I am clueless on
> how to debug it.The debugger window throws error
> "task out of scope,stack frame cannot be set!".
> I suspected this to be a stack over flow problem for the tasks running
> and increased the stack size.Still that did not help me.
> I tried to put trace messages and print it on console.Problem is, due
> to time required to print to console,my application does not come up
> even properly.I am not able to use the debugger too because of this out
> of scope error.Also irony is this problem appears only after 8
> hours,which makes me wait for another 8 hours to get the issue.
> I am wondering are there any good approaches you experts would have
> used to solve such problems?
> It will be helpful if some one can point me in right direction.I am
> looking for some debugging tips which can help me to sort out this
> issue.
> Looking farward for your replies and advanced thanks for the same,
> Regards,
> s.subbarayan

Reply by manja ●March 23, 20062006-03-23

Hello,

Really very challenging and interesting problem......!!!!!

As people have already mentioned, Proper Code review and some
brainstorming with all the people in the team will help a lot in
identifying the problem area and to reproduce the problem early ( say
after 10 mins ).

Some more points to add.

Instead of printf's, use while(1); or assert() in all the conditions
that you have assumed as impossible to occour. ( Doubtful conditions )

Use your own debug versions of malloc and free and write some signature
in each malloc ed block and check for this signature while freeing (
Basically to check memory leaks ).

Above all I suggest you to read the article " Proactive Debugging " by
Jack Ganssle. This is a really Good article and you will definately get
some points.

Please keep us updated about the status of the problem as it will be a
very good learning for all.

Best Regards,
Venkatesh Manja.

ssubbarayan wrote:
> Gurus,
>       I am encountering a strange situation in one of our consumer
> electronics embedded product.
> The product runs on prorietory RTOS and a custom processor supplied by
> ST Micro electronics.Now what happens is,the box runs for around 8
> hours with out any problem.After 8 hours it crashes.I am clueless on
> how to debug it.The debugger window throws error
> "task out of scope,stack frame cannot be set!".
> I suspected this to be a stack over flow problem for the tasks running
> and increased the stack size.Still that did not help me.
> I tried to put trace messages and print it on console.Problem is, due
> to time required to print to console,my application does not come up
> even properly.I am not able to use the debugger too because of this out
> of scope error.Also irony is this problem appears only after 8
> hours,which makes me wait for another 8 hours to get the issue.
> I am wondering are there any good approaches you experts would have
> used to solve such problems?
> It will be helpful if some one can point me in right direction.I am
> looking for some debugging tips which can help me to sort out this
> issue.
> Looking farward for your replies and advanced thanks for the same,
> Regards,
> s.subbarayan

Reply by Ken Asbury ●March 23, 20062006-03-23

ssubbarayan wrote:
> Gurus,
>       I am encountering a strange situation in one of our consumer
> electronics embedded product.
<snip>

> I suspected this to be a stack over flow problem for the tasks running
> and increased the stack size.Still that did not help me.

<snip>
> Regards,
> s.subbarayan

If it were stack leakage then increasing stack size would likely
have increased the time to failure.

You might look at separating code space.from data space by a
greater distance to see if that changes the timing.  Buffer overrrun
and the like...

Also, changing the order of global storage might give you some
insight.

Regards,
Ken Asbury

Reply by Ulf Samuelsson ●March 23, 20062006-03-23

"ssubbarayan" <ssubba@gmail.com> skrev i meddelandet 
news:1143086470.110143.320060@i39g2000cwa.googlegroups.com...
> Gurus,
>      I am encountering a strange situation in one of our consumer
> electronics embedded product.
> The product runs on prorietory RTOS and a custom processor supplied by
> ST Micro electronics.Now what happens is,the box runs for around 8
> hours with out any problem.After 8 hours it crashes.I am clueless on
> how to debug it.The debugger window throws error
> "task out of scope,stack frame cannot be set!".
> I suspected this to be a stack over flow problem for the tasks running
> and increased the stack size.Still that did not help me.
> I tried to put trace messages and print it on console.Problem is, due
> to time required to print to console,my application does not come up
> even properly.I am not able to use the debugger too because of this out
> of scope error.Also irony is this problem appears only after 8
> hours,which makes me wait for another 8 hours to get the issue.
> I am wondering are there any good approaches you experts would have
> used to solve such problems?
> It will be helpful if some one can point me in right direction.I am
> looking for some debugging tips which can help me to sort out this
> issue.
> Looking farward for your replies and advanced thanks for the same,
> Regards,
> s.subbarayan
>

You can also use a faster interface (SPI) to another device, which buffers 
up
the data and sends using the slower UART.
You can try a software trace if you have enough RAM in the system.
This would be a circular buffer so you always have access to the last 
events.
Maybe you shoudl save the stack pointer every time you do a context switch.
You message indicates to me, that there is a problem with one of the tasks,
and it woudl be good to know WHICH task
Since the O/S is proprietary, it should be possible to change the printout
to have more info.

-- 
Best Regards,
Ulf Samuelsson
This is intended to be my personal opinion which may,
or may bot be shared by my employer Atmel Nordic AB

Reply by tbro...@hifn.com ●March 23, 20062006-03-23

> It will be helpful if some one can point me in right direction.I am
> looking for some debugging tips which can help me to sort out this
> issue.

Well, here's a scattering of suggestions. Take the ones that fit and
trash the rest:

1 - Above all, keep trying to divide the problem in two with
experiments that have two nearly equally likely outcomes. If you can do
this, you will get through the problem fairly quickly no matter where
the issue lies.

2 - If it's a proprietary RTOS, you have the source, right? Track down
that error message and become on expert on what causes it. Oh, the
message is from the debugger, isn't it? Still, find out what you can
from the vendor docs / Google.

3 - The message sounds like a bad task pointer is being dereferenced.
Since you have control of the OS source code, you can add a signature
value to the task structure that you set when the task is created.
During debug, have the OS check for the correct signature value each
time it gets ready to dereference a pointer to a task structure, and
then pop up the debugger if the check fails.

4 - Check the state of your heap. Is it close to full? If so, track it
down as a heap issue. Start looking for leaks, determine how much
memory you think should be used and determine why there is a difference
(or if there isn't one, change the design.)

5 - Can you get 3 or 4 systems up so you can get several experiments in
per day?

6 - Can you get access to the failing system at night so you can get
close to 3 experiments in per day, morning, evening, and night?

7 - If you have values that roll over such as indices into ring
buffers, initialize them at startup to be close to rollover so you find
problems early on.

8 - You can define a structure which describes events in your code, and
make a ring buffer full of them somewhere you can find it after the
crash. Then log each interesting section of the code to see what is
happening shortly before the crash.

9 - How reliable is that 8 hours? Can you find a place to stick a
breakpoint before the crash?

10 - If the hardware is not yet reliable and you just can't explain the
problem in terms of software processes, check that the power, clock,
and reset lines are all clean and stabile before you spend too much
time pulling your hair out. (I don't suspect this is the issue, though,
given the repeated 8 hour time span.)

11 - Is something unusual happening around the time of the failure? If
so, it's probably not a coincidence, and you can look for how that
occurance is handled in the code (and/or hardware).

12 - Put a logic analyzer on the processor address bus, record the
addresses, and find a way to trigger on the fault. This should give you
some idea of where you are in the code at the time of failure. Of
course, the instruction stream is almost certainly cached, so you may
need to be more clever about what you watch, but the hardware events
should be closely related to where you are in the code, right?

13 - As previous writers have said, identify the critical parameters of
your system and double some of them and see if the frequency rises.
Packet size, number of sessions open, etc.

Hope something here helps. Personally, I'd go for #3 first.

    - Tim.

Debugging crashes which appear after a long time

Sign in

You might also like...

Search forums

Free PDF Downloads

Blogs - Hall of Fame

Discussion Groups

Quick Links

About EmbeddedRelated.com

Social Networks

The Related Media Group