EmbeddedRelated.com
Forums
Memfault Beyond the Launch

Debugging crashes which appear after a long time

Started by ssubbarayan March 23, 2006
Gurus,
      I am encountering a strange situation in one of our consumer
electronics embedded product.
The product runs on prorietory RTOS and a custom processor supplied by
ST Micro electronics.Now what happens is,the box runs for around 8
hours with out any problem.After 8 hours it crashes.I am clueless on
how to debug it.The debugger window throws error
"task out of scope,stack frame cannot be set!".
I suspected this to be a stack over flow problem for the tasks running
and increased the stack size.Still that did not help me.
I tried to put trace messages and print it on console.Problem is, due
to time required to print to console,my application does not come up
even properly.I am not able to use the debugger too because of this out
of scope error.Also irony is this problem appears only after 8
hours,which makes me wait for another 8 hours to get the issue.
I am wondering are there any good approaches you experts would have
used to solve such problems?
It will be helpful if some one can point me in right direction.I am
looking for some debugging tips which can help me to sort out this
issue.
Looking farward for your replies and advanced thanks for the same,
Regards,
s.subbarayan

ssubbarayan wrote:
> Now what happens is,the box runs for around 8 > hours with out any problem.After 8 hours it crashes.
If it happens consistentry after a period of time - it is a good sign :) Things to ckeck: - memory leak? - hardware counter gets overflown? - variable counter gets overflown? - some free-running hardware timer generates interrupt that is not handled? - the building air conditioning system starts/stops, generating surge in the mains line? Would it crash if you ran another (idle) task instead of yours? If yes - check the RTOS. HTH, Vadim
> I suspected this to be a stack over flow problem for the tasks running > and increased the stack size.Still that did not help me.
Try stuffing a known pattern into the RAM used for the stacks. After it crashes, check those areas to see if that pattern gets overwritten.
ssubbarayan wrote:

> I am > wondering are there any good approaches you experts would have used > to solve such problems? It will be helpful if some one can point me > in right direction.I am looking for some debugging tips which can > help me to sort out this issue.
Do you have any theories at all? If so, you need to devise a configuration that will make it crash more often. I had a problem years ago that occurred once every few *weeks*. Fortunately a code review threw up one theory reasonably early - and I spent the next few weeks proving it. 1. Can you accelerate the tasks that the system is doing in order to decrease the time between crashes? For example, if there is a regular task scheduled for, say every minute, drop that down to every 30s and see if it crashes after 4 hours. Every 5 sec? etc... This could be done on a system-wide level (just to decrease the turn-around) or on a task level (to identify the task at fault). Is it a consistent 8 hours? Or is that an average based on some probability of two inter-related events happening? Can you increase the number of tasks running in parallel to accelerate the crash? 2. Divide and conquer. Is it possible to disable certain tasks? Does it still crash when these tasks are not running? Does the system crash even with only a single idle task (and nothing else) running? Your ideal situation is having (a) a configuration that doesn't crash (however cut-down that is) as well as (b) a configuration that crashes after a few mins. Then you can narrow in on the problem from there. Regards, -- Mark McDougall, Engineer Virtual Logic Pty Ltd, <http://www.vl.com.au> 21-25 King St, Rockdale, 2216 Ph: +612-9599-3255 Fax: +612-9599-3266
"ssubbarayan" <ssubba@gmail.com> wrote in message 
news:1143086470.110143.320060@i39g2000cwa.googlegroups.com...
> Gurus, > I am encountering a strange situation in one of our consumer > electronics embedded product. > The product runs on prorietory RTOS and a custom processor supplied by > ST Micro electronics.Now what happens is,the box runs for around 8 > hours with out any problem.After 8 hours it crashes.I am clueless on > how to debug it.The debugger window throws error > "task out of scope,stack frame cannot be set!". > I suspected this to be a stack over flow problem for the tasks running > and increased the stack size.
Even if the stack is overflowed you should be able to look at the end of the stack. You'll be able to find the last few values that were placed there. The return addresses will tell you where these calls came from. This might give you a clue as to what was happening just before it crashed. Peter
Hello,

Really very challenging and interesting problem......!!!!!

As people have already mentioned, Proper Code review and some
brainstorming with all the people in the team will help a lot in
identifying the problem area and to reproduce the problem early ( say
after 10 mins ).

Some more points to add.

Instead of printf's, use while(1); or assert() in all the conditions
that you have assumed as impossible to occour. ( Doubtful conditions )

Use your own debug versions of malloc and free and write some signature
in each malloc ed block and check for this signature while freeing (
Basically to check memory leaks ).

Above all I suggest you to read the article " Proactive Debugging " by
Jack Ganssle. This is a really Good article and you will definately get
some points.

Please keep us updated about the status of the problem as it will be a
very good learning for all.

Best Regards,
Venkatesh Manja.


ssubbarayan wrote:
> Gurus, > I am encountering a strange situation in one of our consumer > electronics embedded product. > The product runs on prorietory RTOS and a custom processor supplied by > ST Micro electronics.Now what happens is,the box runs for around 8 > hours with out any problem.After 8 hours it crashes.I am clueless on > how to debug it.The debugger window throws error > "task out of scope,stack frame cannot be set!". > I suspected this to be a stack over flow problem for the tasks running > and increased the stack size.Still that did not help me. > I tried to put trace messages and print it on console.Problem is, due > to time required to print to console,my application does not come up > even properly.I am not able to use the debugger too because of this out > of scope error.Also irony is this problem appears only after 8 > hours,which makes me wait for another 8 hours to get the issue. > I am wondering are there any good approaches you experts would have > used to solve such problems? > It will be helpful if some one can point me in right direction.I am > looking for some debugging tips which can help me to sort out this > issue. > Looking farward for your replies and advanced thanks for the same, > Regards, > s.subbarayan
Hello,

Really very challenging and interesting problem......!!!!!

As people have already mentioned, Proper Code review and some
brainstorming with all the people in the team will help a lot in
identifying the problem area and to reproduce the problem early ( say
after 10 mins ).

Some more points to add.

Instead of printf's, use while(1); or assert() in all the conditions
that you have assumed as impossible to occour. ( Doubtful conditions )

Use your own debug versions of malloc and free and write some signature
in each malloc ed block and check for this signature while freeing (
Basically to check memory leaks ).

Above all I suggest you to read the article " Proactive Debugging " by
Jack Ganssle. This is a really Good article and you will definately get
some points.

Please keep us updated about the status of the problem as it will be a
very good learning for all.

Best Regards,
Venkatesh Manja.


ssubbarayan wrote:
> Gurus, > I am encountering a strange situation in one of our consumer > electronics embedded product. > The product runs on prorietory RTOS and a custom processor supplied by > ST Micro electronics.Now what happens is,the box runs for around 8 > hours with out any problem.After 8 hours it crashes.I am clueless on > how to debug it.The debugger window throws error > "task out of scope,stack frame cannot be set!". > I suspected this to be a stack over flow problem for the tasks running > and increased the stack size.Still that did not help me. > I tried to put trace messages and print it on console.Problem is, due > to time required to print to console,my application does not come up > even properly.I am not able to use the debugger too because of this out > of scope error.Also irony is this problem appears only after 8 > hours,which makes me wait for another 8 hours to get the issue. > I am wondering are there any good approaches you experts would have > used to solve such problems? > It will be helpful if some one can point me in right direction.I am > looking for some debugging tips which can help me to sort out this > issue. > Looking farward for your replies and advanced thanks for the same, > Regards, > s.subbarayan
ssubbarayan wrote:
> Gurus, > I am encountering a strange situation in one of our consumer > electronics embedded product.
<snip>
> I suspected this to be a stack over flow problem for the tasks running > and increased the stack size.Still that did not help me.
<snip>
> Regards, > s.subbarayan
If it were stack leakage then increasing stack size would likely have increased the time to failure. You might look at separating code space.from data space by a greater distance to see if that changes the timing. Buffer overrrun and the like... Also, changing the order of global storage might give you some insight. Regards, Ken Asbury
"ssubbarayan" <ssubba@gmail.com> skrev i meddelandet 
news:1143086470.110143.320060@i39g2000cwa.googlegroups.com...
> Gurus, > I am encountering a strange situation in one of our consumer > electronics embedded product. > The product runs on prorietory RTOS and a custom processor supplied by > ST Micro electronics.Now what happens is,the box runs for around 8 > hours with out any problem.After 8 hours it crashes.I am clueless on > how to debug it.The debugger window throws error > "task out of scope,stack frame cannot be set!". > I suspected this to be a stack over flow problem for the tasks running > and increased the stack size.Still that did not help me. > I tried to put trace messages and print it on console.Problem is, due > to time required to print to console,my application does not come up > even properly.I am not able to use the debugger too because of this out > of scope error.Also irony is this problem appears only after 8 > hours,which makes me wait for another 8 hours to get the issue. > I am wondering are there any good approaches you experts would have > used to solve such problems? > It will be helpful if some one can point me in right direction.I am > looking for some debugging tips which can help me to sort out this > issue. > Looking farward for your replies and advanced thanks for the same, > Regards, > s.subbarayan >
You can also use a faster interface (SPI) to another device, which buffers up the data and sends using the slower UART. You can try a software trace if you have enough RAM in the system. This would be a circular buffer so you always have access to the last events. Maybe you shoudl save the stack pointer every time you do a context switch. You message indicates to me, that there is a problem with one of the tasks, and it woudl be good to know WHICH task Since the O/S is proprietary, it should be possible to change the printout to have more info. -- Best Regards, Ulf Samuelsson This is intended to be my personal opinion which may, or may bot be shared by my employer Atmel Nordic AB
> It will be helpful if some one can point me in right direction.I am > looking for some debugging tips which can help me to sort out this > issue.
Well, here's a scattering of suggestions. Take the ones that fit and trash the rest: 1 - Above all, keep trying to divide the problem in two with experiments that have two nearly equally likely outcomes. If you can do this, you will get through the problem fairly quickly no matter where the issue lies. 2 - If it's a proprietary RTOS, you have the source, right? Track down that error message and become on expert on what causes it. Oh, the message is from the debugger, isn't it? Still, find out what you can from the vendor docs / Google. 3 - The message sounds like a bad task pointer is being dereferenced. Since you have control of the OS source code, you can add a signature value to the task structure that you set when the task is created. During debug, have the OS check for the correct signature value each time it gets ready to dereference a pointer to a task structure, and then pop up the debugger if the check fails. 4 - Check the state of your heap. Is it close to full? If so, track it down as a heap issue. Start looking for leaks, determine how much memory you think should be used and determine why there is a difference (or if there isn't one, change the design.) 5 - Can you get 3 or 4 systems up so you can get several experiments in per day? 6 - Can you get access to the failing system at night so you can get close to 3 experiments in per day, morning, evening, and night? 7 - If you have values that roll over such as indices into ring buffers, initialize them at startup to be close to rollover so you find problems early on. 8 - You can define a structure which describes events in your code, and make a ring buffer full of them somewhere you can find it after the crash. Then log each interesting section of the code to see what is happening shortly before the crash. 9 - How reliable is that 8 hours? Can you find a place to stick a breakpoint before the crash? 10 - If the hardware is not yet reliable and you just can't explain the problem in terms of software processes, check that the power, clock, and reset lines are all clean and stabile before you spend too much time pulling your hair out. (I don't suspect this is the issue, though, given the repeated 8 hour time span.) 11 - Is something unusual happening around the time of the failure? If so, it's probably not a coincidence, and you can look for how that occurance is handled in the code (and/or hardware). 12 - Put a logic analyzer on the processor address bus, record the addresses, and find a way to trigger on the fault. This should give you some idea of where you are in the code at the time of failure. Of course, the instruction stream is almost certainly cached, so you may need to be more clever about what you watch, but the hardware events should be closely related to where you are in the code, right? 13 - As previous writers have said, identify the critical parameters of your system and double some of them and see if the frequency rises. Packet size, number of sessions open, etc. Hope something here helps. Personally, I'd go for #3 first. - Tim.

Memfault Beyond the Launch