EmbeddedRelated.com
Forums

Unexplained Hang During Boot

Started by Unknown June 28, 2006
I am experiencing a very bizarre problem with vxWorks and I am hoping
that someone might be able to offer some suggestions on where to start
looking to determine the root of the problem.

VxWorks is being used on a Synergy Microsystems VME SBC which is PPC
based.  The problem seems to arise at random times after rebuilding the
OS image.  For instance, by commenting out a single 'printf' statement
such as "printf("Message Received\n"); in an application level piece of
code that is not even invoked; and rebuilding the image, the image can
hang while booting (early in the boot procedure).  Uncomment this
'printf' statement, rebuild the image, and the OS will boot without
error. Note that this routine is not called at any time during the boot
procedure so the code containing that printf is never even executed.

This problem has been experienced by multiple developers on different
modules.  I am not sure if this is a hardware, or a software type of
problem.  Can anyone think of any reason why something as non-intrusive
as commenting out a printf statement, in a function that is never even
invoked, would cause the OS to hang during boot?

The printf statement is only adding a handful of bytes to the resultant
image and larger images than the ones that fail have been booted
successfully.

Similar hangs have been produced by changing array sizes in uncalled
routines, etc., (i.e., add a few more bytes to an array in an uncalled
function and the images hangs during boot, add a few more bytes and the
image loads fine).

On 28 Jun 2006, eon_blue_80@verizon.net wrote:
> I am experiencing a very bizarre problem with vxWorks and I am > hoping that someone might be able to offer some suggestions on where > to start looking to determine the root of the problem.
[snip]
> The printf statement is only adding a handful of bytes to the > resultant image and larger images than the ones that fail have been > booted successfully.
This sounds like a cache problem. The "printf" is unrelated to the code. It just changes the image size at the "right" place. You could add a ".bytes 7" or something in the code section and the same thing would result. At some point in the boot sequence, there may be an alias between data and code cache. It could be when the MMU is turned on. The address space will change and code must often jump in a very specific sequence. It maybe a conflict with a device. For instance an "eieio" instruction may be necessary in some cases, but due to code section alignment, the code is executing in different times and the "eieio" become necessary/un-necessary depending on the build. It is very good that you try to hunt this down. I've known several "senior" people who have let this type of problem go on for ever. You can toggle an LED, an general purpose I/O with scope or you can use some polled console output to provide check points in the boot sequence to see where the hang occurs. The important point is that the "printf" has nothing to do with the problem besides making the code move around. You can verify this by inserting different dummy routines with different lengths (a cache line is typically 32/64 bytes). Observing a map file of the full image and knowing the location of these bytes can be helpful. For instance if code following this is an ethernet driver, then that may be helpful to know. It could also be reading of garbage strings, code, constant data. I have also seen one section of code round MMU rights and another read to the byte. Sometimes this rounding is wrong and a "bus error" happens due to memory not being sized right. hth, Bill Pringlemeir. -- You have the right to remain silent -- so shut up! vxWorks FAQ, "http://www.xs4all.nl/~borkhuis/vxworks/vxworks.html"
eon_blue_80@verizon.net wrote:

> VxWorks is being used on a Synergy Microsystems VME SBC which is PPC > based. The problem seems to arise at random times after rebuilding the > OS image. For instance, by commenting out a single 'printf' statement > such as "printf("Message Received\n"); in an application level piece of > code that is not even invoked; and rebuilding the image, the image can > hang while booting (early in the boot procedure). Uncomment this > 'printf' statement, rebuild the image, and the OS will boot without > error. Note that this routine is not called at any time during the boot > procedure so the code containing that printf is never even executed. > > This problem has been experienced by multiple developers on different > modules. I am not sure if this is a hardware, or a software type of > problem. Can anyone think of any reason why something as non-intrusive > as commenting out a printf statement, in a function that is never even > invoked, would cause the OS to hang during boot? > > The printf statement is only adding a handful of bytes to the resultant > image and larger images than the ones that fail have been booted > successfully. > > Similar hangs have been produced by changing array sizes in uncalled > routines, etc., (i.e., add a few more bytes to an array in an uncalled > function and the images hangs during boot, add a few more bytes and the > image loads fine). >
Another possibility is that errant code is corrupting memory during the boot process. The commonest case is the "wild pointer" where an uninitialized pointer is used to write data. Other possibilites would be over-running the stack reserved area or using pointers to buffers that have been returned to the buffer pool and re-used. I have also seen incorrect function prototypes cause this type of problem. If you are using vector tables in RAM, walking on them will cause this type of problem too. The way I would attempt to solve this problem is with a logic analyzer. Start out by finding where the code hangs. Then see if the instruction sequence to get there took any un-explainable jumps. See if the departure point for the unexplainable sequence values match the expected values for that address. If they don't match the expected values, use writes to those locations to trigger the logic analyzer and you should be able to locate the errant code. The departure from expected execution could also be un-initialized or corrupted vectors in the vector table. I am not familiar with the particular VME card you mentioned, but memory management hardware could protect you from a number of the things I described. Because it is a boot sequence problem, memory management hardware may not be operational at this point. Another place to look would be the linker command file. Are all of the segements large enough and in non-overlapping regions of memory? The logic analyzer approach would leady you to this type of problem, but it could be a painful path that could be avoided by careful study. Good Luck, Bob
Bill Pringlemeir wrote:
> On 28 Jun 2006, eon_blue_80@verizon.net wrote: > > I am experiencing a very bizarre problem with vxWorks and I am > > hoping that someone might be able to offer some suggestions on where > > to start looking to determine the root of the problem. > > [snip] > > > The printf statement is only adding a handful of bytes to the > > resultant image and larger images than the ones that fail have been > > booted successfully. > > This sounds like a cache problem. The "printf" is unrelated to the > code. It just changes the image size at the "right" place. You could > add a ".bytes 7" or something in the code section and the same thing > would result. > > At some point in the boot sequence, there may be an alias between data > and code cache. It could be when the MMU is turned on. The address > space will change and code must often jump in a very specific > sequence. It maybe a conflict with a device. For instance an "eieio" > instruction may be necessary in some cases, but due to code section > alignment, the code is executing in different times and the "eieio" > become necessary/un-necessary depending on the build. > > It is very good that you try to hunt this down. I've known several > "senior" people who have let this type of problem go on for ever. > > You can toggle an LED, an general purpose I/O with scope or you can > use some polled console output to provide check points in the boot > sequence to see where the hang occurs. > > The important point is that the "printf" has nothing to do with the > problem besides making the code move around. You can verify this by > inserting different dummy routines with different lengths (a cache > line is typically 32/64 bytes). Observing a map file of the full > image and knowing the location of these bytes can be helpful. For > instance if code following this is an ethernet driver, then that may > be helpful to know. > > It could also be reading of garbage strings, code, constant data. I > have also seen one section of code round MMU rights and another read > to the byte. Sometimes this rounding is wrong and a "bus error" > happens due to memory not being sized right. > > hth, > Bill Pringlemeir. > > -- > You have the right to remain silent -- so shut up! > > vxWorks FAQ, "http://www.xs4all.nl/~borkhuis/vxworks/vxworks.html"
Bill\Others, Excellent and enlightening explaination from you all.We are facing a similar kind of issue with STMicroElectronics prop board and we were using prop OS.Though the OS is different,the problem seems to be similar to query we are addressing here I believe. We faced a situation where if we just type printf inside one function or just introduce one i=1(Though we did not use 'i' variable further anywhere) will make the feature to work and removing this statement made us to loose the feature. We were trying hard to figure the problem until one day when we inspected the cache and disabled the data cache the feature was working just fine. Now the question I would like to understand it,whats the best way to figure out whether the problem is with cache memory?One more behaviour I have observed is when we debug with break point the feature was working fine and when we use binary production version of same code it never works! This made debugging further difficult.Will the role of cache have something to do to bring this difference between debug and production version? I would like to avoid such problems in future so it will be helpful if some of you enlightened ones explain me this. I am posting the query also to comp.arch.embedded as this will help me to get lot of experienced people's inputs.Pardon me incase I am wrong. Looking farward for all your replys and advanced thanks for the same, Regards, s.subbarayan
"ssubbarayan" <ssubba@gmail.com> wrote in message 
news:1151564578.472571.35220@b68g2000cwa.googlegroups.com...
> > Bill Pringlemeir wrote: >> On 28 Jun 2006, eon_blue_80@verizon.net wrote: >> > I am experiencing a very bizarre problem with vxWorks and I am >> > hoping that someone might be able to offer some suggestions on where >> > to start looking to determine the root of the problem. >> >> [snip] >> >> > The printf statement is only adding a handful of bytes to the >> > resultant image and larger images than the ones that fail have been >> > booted successfully. >> >> This sounds like a cache problem. The "printf" is unrelated to the >> code. It just changes the image size at the "right" place. You could >> add a ".bytes 7" or something in the code section and the same thing >> would result. >> >> At some point in the boot sequence, there may be an alias between data >> and code cache. It could be when the MMU is turned on. The address >> space will change and code must often jump in a very specific >> sequence. It maybe a conflict with a device. For instance an "eieio" >> instruction may be necessary in some cases, but due to code section >> alignment, the code is executing in different times and the "eieio" >> become necessary/un-necessary depending on the build. >> >> It is very good that you try to hunt this down. I've known several >> "senior" people who have let this type of problem go on for ever. >> >> You can toggle an LED, an general purpose I/O with scope or you can >> use some polled console output to provide check points in the boot >> sequence to see where the hang occurs. >> >> The important point is that the "printf" has nothing to do with the >> problem besides making the code move around. You can verify this by >> inserting different dummy routines with different lengths (a cache >> line is typically 32/64 bytes). Observing a map file of the full >> image and knowing the location of these bytes can be helpful. For >> instance if code following this is an ethernet driver, then that may >> be helpful to know. >> >> It could also be reading of garbage strings, code, constant data. I >> have also seen one section of code round MMU rights and another read >> to the byte. Sometimes this rounding is wrong and a "bus error" >> happens due to memory not being sized right. >> >> hth, >> Bill Pringlemeir. >> >> -- >> You have the right to remain silent -- so shut up! >> >> vxWorks FAQ, "http://www.xs4all.nl/~borkhuis/vxworks/vxworks.html" > > Bill\Others, > Excellent and enlightening explaination from you all.We are facing a > similar kind of issue with STMicroElectronics prop board and we were > using prop OS.Though the OS is different,the problem seems to be > similar to query we are addressing here I believe. > > We faced a situation where if we just type printf inside one function > or just introduce one > i=1(Though we did not use 'i' variable further anywhere) will make the > feature to work and removing this statement made us to loose the > feature. > We were trying hard to figure the problem until one day when we > inspected the cache and disabled the data cache the feature was working > just fine. > > Now the question I would like to understand it,whats the best way to > figure out whether the problem is with cache memory?One more behaviour > I have observed is when we debug with break point the feature was > working fine and when we use binary production version of same code it > never works! > This made debugging further difficult.Will the role of cache have > something to do to bring this difference between debug and production > version? > > I would like to avoid such problems in future so it will be helpful if > some of you enlightened ones explain me this. > > I am posting the query also to comp.arch.embedded as this will help me > to get lot of experienced people's inputs.Pardon me incase I am wrong. > > Looking farward for all your replys and advanced thanks for the same, > > Regards, > s.subbarayan >
So when you determine for certain that cache is the problem, what typically is the solution? Sprinkling cache flushes throughout the code? or what? Bo
> "ssubbarayan" <ssubba@gmail.com> wrote in message
>> Now the question I would like to understand it,whats the best way >> to figure out whether the problem is with cache memory?One more >> behaviour I have observed is when we debug with break point the >> feature was working fine and when we use binary production version >> of same code it never works! This made debugging further >> difficult.Will the role of cache have something to do to bring this >> difference between debug and production version? >> >> I would like to avoid such problems in future so it will be helpful >> if some of you enlightened ones explain me this.
On 29 Jun 2006, bo@cephus.com wrote:
> So when you determine for certain that cache is the problem, what > typically is the solution? Sprinkling cache flushes throughout the > code? or what?
There are three possible issues. One is a direct effect of caching, another is alignment, and the other is timing. If you have DMA, it will always retrieve from memory (Ie, SDRAM, flash). If your CPU is using a cache, it might be retrieving data from the memory or from the cache. For example, on one project we had a video capture device with a built-in convulsion matrix that DMAed the results to the main processor. The code did not pay attention to the cache. After much debugging, the software developer for the imaging code decided that the HA was buggy. I examined this and noted that the buffer being used was fully cached. It started to work when we got memory that the MMU had marked as being non-cacheable. Another example is on the PPC, there is a "write buffer". It can be the result that the PPC will not commit data to memory in the order that instructions are encountered; especially with a write-back cache. So, for instance, an AMD style NOR flash takes the command AA, 55, CMD. Without using the eieio command on the PowerPC, your flash driver will not work as the commands can get written to the bus out of order. An MTD driver might loop forever trying to detect the flash type or an end of operation, etc. This might cause a hang during booting. Many HW devices use multiple writes to the same location. Those are some examples of direct changes the cache might have on the order of memory accesses. I had previously explained an alignment issue. It sound like this is more like the OPs problem as the code in question doesn't even execute. However, it can also be the timing as this will shift code and might change how the cache lines are fetched. If the compiler is aligning all code to a cache line, then this is not the problem. The other instance is just timing. If code is relying on things being slow, then a cache is enabled and speeds them up, a implicit delay may no longer be sufficient. Some slower/older HW devices must have fixed delays between accesses. It may also be that the code must be fast enough, like kicking a watchdog. In all cases, the best thing you can do is insert some sort of trace. Like toggling a general purpose I/O connected to a scope. You can alter the timing to provide information or use multiple lines to encode some information. Multiple lines are better as they will reduce the amount of code needed. This mechanism suffers in that inserting the debug code can make the symptom appear/disappear. An ICE, BDM, or JTAG debugger would also be useful. Let the system crash and then look at the stack and PC. Use HW breakpoints to work backwards from there. The problem with using a traditional debugger with breakpoints is that this alters the code flow (just like a printf). Hitting breakpoints will definitely effect what is in the cache. Once you find the problem, you have to look at the structure to know what to do. For instance, it is often best to change the way a hardware device is accessed. Like non-cacheable, write-through cache, etc. Sometimes it is not just the cache, but eieio instructions might be needed (or other PPC instructions like isync, sync, etc). Adding cache flushes may work. It would be much better to understand why it is crashing and then correct the problem. Just adding cache flushes might be equivalent to the printfs. Ie, it just shifts the code around. fwiw, Bill Pringlemeir. -- My cousin is an agoraphobic homosexual, which makes it kind of hard for him to come out of the closet. - Bill Kelly vxWorks FAQ, "http://www.xs4all.nl/~borkhuis/vxworks/vxworks.html"
> eon_blue_80@verizon.net wrote:
> Another possibility is that errant code is corrupting memory during > the boot process.
This is *unlikely* as the OP noted that adding un-executed code would cause the problem. If the code is directly corrupting memory this would be unlikely to introduce the problem. Especially if the added code make no types of allocation, nor writes to memory. If simply changing the cache on/off will cause the crash, I find it extremely unlikely that it is a memory corruption. So there is a quick way to rule this out. Disable/enable the cache with a crashing image. Often you can arrange the code so that the size is the same, just a constant has changed to disable/enable the cache. fwiw, Bill Pringlemeir. -- Anyone who trades liberty for security deserves neither liberty nor security - Benjamin Franklin vxWorks FAQ, "http://www.xs4all.nl/~borkhuis/vxworks/vxworks.html"
Bill Pringlemeir wrote:

> Another example is on the PPC, there is a "write buffer". It can be > the result that the PPC will not commit data to memory in the order > that instructions are encountered; especially with a write-back cache. > So, for instance, an AMD style NOR flash takes the command AA, 55, > CMD. Without using the eieio command on the PowerPC, your flash > driver will not work as the commands can get written to the bus out of > order.
Are you sure? Some PPC implementations, e.g. the AMCC PPC440 series, use a weakly ordered view of memory so you have the potential for out of order reads but never writes. The write buffer does not affect write order, it just allows read-around-write. Our AMCC PPC405 (strongly ordered view of memory) Flash driver started failing when we used it on the 440. After, reading up on weakly ordered memory systems and ensuring that the Flash region was marked cached, guarded we proceeded to put msync brackets round all reads to I/O devices (real memory or I/O devices without read-side effects don't need them) with read-side effects, e.g. # uint16 read_flash_hword(uin16 *pFlashAddr); read_flash_hword: msync lhz r3,0(r3) msync blr The msync brackets ensured that the read could not issue before any subsequent read or write. However, you are guaranteed to have multiple writes go in order to the device safely as long as your reads are protected as above. Furthermore, the PowerPC architecture is smart enough to execute RMW ops. correctly on a given I/O address, e.g. lwz r3,0(r4) ori r3,r3,0x0040 stw r3,0(r4) will result in the expected value written to the address pointed to by r4, that is, the CPU will not perform the store before the load due to register dependencies. -- - Mark
On 29 Jun 2006, mrfirmware@gmail.com wrote:
> Bill Pringlemeir wrote:
>> Another example is on the PPC, there is a "write buffer". It can >> be the result that the PPC will not commit data to memory in the >> order that instructions are encountered; especially with a >> write-back cache. So, for instance, an AMD style NOR flash takes >> the command AA, 55, CMD. Without using the eieio command on the >> PowerPC, your flash driver will not work as the commands can get >> written to the bus out of order.
> Are you sure? Some PPC implementations, e.g. the AMCC PPC440 series, > use a weakly ordered view of memory so you have the potential for > out of order reads but never writes. The write buffer does not > affect write order, it just allows read-around-write.
I am absolutely sure of nothing. I might have the wrong terminology for the cache type. If multiple writes to the same location fit in the cache, only that last value will be written to the memory device. This makes perfect sense for SDRAM and is a very good operation. Consider a frame pointer with some loop variables stored in one of these lines. Constantly committing the data from cache to SDRAM would seem to be a waste of time. With AMD type flash, there are several writes to the same address. I didn't have access to a logic analyzer to see what cycles the CPU was performing on the flash. However, a straight 'C' implementation was not sufficient. You need to add some assembler instructions. I guess it is wrong to say "out of order". I should have said not at all. AA and CMD are usually written to the same address. I did try msync commands and this was not effective. fwiw, Bill Pringlemeir. -- I never did give anybody hell. I just told the truth and they thought it was hell. - Harry S. Truman vxWorks FAQ, "http://www.xs4all.nl/~borkhuis/vxworks/vxworks.html"
After you reset after the hang, stop at the boot prompt and do an "e".
That might dump exception data if the previous hang was caused by an
exception.

lc
Bill Pringlemeir wrote:
> On 29 Jun 2006, mrfirmware@gmail.com wrote: > > Bill Pringlemeir wrote: > > >> Another example is on the PPC, there is a "write buffer". It can > >> be the result that the PPC will not commit data to memory in the > >> order that instructions are encountered; especially with a > >> write-back cache. So, for instance, an AMD style NOR flash takes > >> the command AA, 55, CMD. Without using the eieio command on the > >> PowerPC, your flash driver will not work as the commands can get > >> written to the bus out of order. > > > Are you sure? Some PPC implementations, e.g. the AMCC PPC440 series, > > use a weakly ordered view of memory so you have the potential for > > out of order reads but never writes. The write buffer does not > > affect write order, it just allows read-around-write. > > I am absolutely sure of nothing. I might have the wrong terminology > for the cache type. If multiple writes to the same location fit in > the cache, only that last value will be written to the memory device. > This makes perfect sense for SDRAM and is a very good operation. > Consider a frame pointer with some loop variables stored in one of > these lines. Constantly committing the data from cache to SDRAM would > seem to be a waste of time. > > With AMD type flash, there are several writes to the same address. I > didn't have access to a logic analyzer to see what cycles the CPU was > performing on the flash. However, a straight 'C' implementation was > not sufficient. You need to add some assembler instructions. > > I guess it is wrong to say "out of order". I should have said not at > all. AA and CMD are usually written to the same address. I did try > msync commands and this was not effective. > > fwiw, > Bill Pringlemeir. > > -- > I never did give anybody hell. I just told the truth and they thought > it was hell. - Harry S. Truman > > vxWorks FAQ, "http://www.xs4all.nl/~borkhuis/vxworks/vxworks.html"