Unexplained Hang During Boot

I am experiencing a very bizarre problem with vxWorks and I am hoping
that someone might be able to offer some suggestions on where to start
looking to determine the root of the problem.

VxWorks is being used on a Synergy Microsystems VME SBC which is PPC
based.  The problem seems to arise at random times after rebuilding the
OS image.  For instance, by commenting out a single 'printf' statement
such as "printf("Message Received\n"); in an application level piece of
code that is not even invoked; and rebuilding the image, the image can
hang while booting (early in the boot procedure).  Uncomment this
'printf' statement, rebuild the image, and the OS will boot without
error. Note that this routine is not called at any time during the boot
procedure so the code containing that printf is never even executed.

This problem has been experienced by multiple developers on different
modules.  I am not sure if this is a hardware, or a software type of
problem.  Can anyone think of any reason why something as non-intrusive
as commenting out a printf statement, in a function that is never even
invoked, would cause the OS to hang during boot?

The printf statement is only adding a handful of bytes to the resultant
image and larger images than the ones that fail have been booted
successfully.

Similar hangs have been produced by changing array sizes in uncalled
routines, etc., (i.e., add a few more bytes to an array in an uncalled
function and the images hangs during boot, add a few more bytes and the
image loads fine).

Reply by ●June 28, 20062006-06-28

On 28 Jun 2006, eon_blue_80@verizon.net wrote:
> I am experiencing a very bizarre problem with vxWorks and I am
> hoping that someone might be able to offer some suggestions on where
> to start looking to determine the root of the problem.

[snip]

> The printf statement is only adding a handful of bytes to the
> resultant image and larger images than the ones that fail have been
> booted successfully.

This sounds like a cache problem.  The "printf" is unrelated to the
code.  It just changes the image size at the "right" place.  You could
add a ".bytes 7" or something in the code section and the same thing
would result.

At some point in the boot sequence, there may be an alias between data
and code cache.  It could be when the MMU is turned on.  The address
space will change and code must often jump in a very specific
sequence.  It maybe a conflict with a device.  For instance an "eieio"
instruction may be necessary in some cases, but due to code section
alignment, the code is executing in different times and the "eieio"
become necessary/un-necessary depending on the build.

It is very good that you try to hunt this down.  I've known several
"senior" people who have let this type of problem go on for ever.

You can toggle an LED, an general purpose I/O with scope or you can
use some polled console output to provide check points in the boot
sequence to see where the hang occurs.

The important point is that the "printf" has nothing to do with the
problem besides making the code move around.  You can verify this by
inserting different dummy routines with different lengths (a cache
line is typically 32/64 bytes).  Observing a map file of the full
image and knowing the location of these bytes can be helpful.  For
instance if code following this is an ethernet driver, then that may
be helpful to know.

It could also be reading of garbage strings, code, constant data.  I
have also seen one section of code round MMU rights and another read
to the byte.  Sometimes this rounding is wrong and a "bus error"
happens due to memory not being sized right.

hth,
Bill Pringlemeir.

-- 
You have the right to remain silent -- so shut up!

vxWorks FAQ, "http://www.xs4all.nl/~borkhuis/vxworks/vxworks.html"

Reply by MetalHead ●June 28, 20062006-06-28

eon_blue_80@verizon.net wrote:

> VxWorks is being used on a Synergy Microsystems VME SBC which is PPC
> based.  The problem seems to arise at random times after rebuilding the
> OS image.  For instance, by commenting out a single 'printf' statement
> such as "printf("Message Received\n"); in an application level piece of
> code that is not even invoked; and rebuilding the image, the image can
> hang while booting (early in the boot procedure).  Uncomment this
> 'printf' statement, rebuild the image, and the OS will boot without
> error. Note that this routine is not called at any time during the boot
> procedure so the code containing that printf is never even executed.
> 
> This problem has been experienced by multiple developers on different
> modules.  I am not sure if this is a hardware, or a software type of
> problem.  Can anyone think of any reason why something as non-intrusive
> as commenting out a printf statement, in a function that is never even
> invoked, would cause the OS to hang during boot?
> 
> The printf statement is only adding a handful of bytes to the resultant
> image and larger images than the ones that fail have been booted
> successfully.
> 
> Similar hangs have been produced by changing array sizes in uncalled
> routines, etc., (i.e., add a few more bytes to an array in an uncalled
> function and the images hangs during boot, add a few more bytes and the
> image loads fine).
> 

Another possibility is that errant code is corrupting memory during the
boot process. The commonest case is the "wild pointer" where an
uninitialized pointer is used to write data. Other possibilites would be
over-running the stack reserved area or using pointers to buffers that 
have been returned to the buffer pool and re-used. I have also seen 
incorrect function prototypes cause this type of problem. If you are 
using vector tables in RAM, walking on them will cause this type of 
problem too.

The way I would attempt to solve this problem is with a logic analyzer.
Start out by finding where the code hangs. Then see if the instruction
sequence to get there took any un-explainable jumps. See if the 
departure point for the unexplainable sequence values match the expected 
values for that address. If they don't match the expected values, use 
writes to those locations to trigger the logic analyzer and you should 
be able to locate the errant code. The departure from expected execution 
could also be un-initialized or corrupted vectors in the vector table.

I am not familiar with the particular VME card you mentioned, but memory 
management hardware could protect you from a number of the things I 
described. Because it is a boot sequence problem, memory management 
hardware may not be operational at this point.

Another place to look would be the linker command file. Are all of the 
segements large enough and in non-overlapping regions of memory? The 
logic analyzer approach would leady you to this type of problem, but it 
could be a painful path that could be avoided by careful study.

Good Luck,
Bob

Reply by ssubbarayan ●June 29, 20062006-06-29

Bill Pringlemeir wrote:
> On 28 Jun 2006, eon_blue_80@verizon.net wrote:
> > I am experiencing a very bizarre problem with vxWorks and I am
> > hoping that someone might be able to offer some suggestions on where
> > to start looking to determine the root of the problem.
>
> [snip]
>
> > The printf statement is only adding a handful of bytes to the
> > resultant image and larger images than the ones that fail have been
> > booted successfully.
>
> This sounds like a cache problem.  The "printf" is unrelated to the
> code.  It just changes the image size at the "right" place.  You could
> add a ".bytes 7" or something in the code section and the same thing
> would result.
>
> At some point in the boot sequence, there may be an alias between data
> and code cache.  It could be when the MMU is turned on.  The address
> space will change and code must often jump in a very specific
> sequence.  It maybe a conflict with a device.  For instance an "eieio"
> instruction may be necessary in some cases, but due to code section
> alignment, the code is executing in different times and the "eieio"
> become necessary/un-necessary depending on the build.
>
> It is very good that you try to hunt this down.  I've known several
> "senior" people who have let this type of problem go on for ever.
>
> You can toggle an LED, an general purpose I/O with scope or you can
> use some polled console output to provide check points in the boot
> sequence to see where the hang occurs.
>
> The important point is that the "printf" has nothing to do with the
> problem besides making the code move around.  You can verify this by
> inserting different dummy routines with different lengths (a cache
> line is typically 32/64 bytes).  Observing a map file of the full
> image and knowing the location of these bytes can be helpful.  For
> instance if code following this is an ethernet driver, then that may
> be helpful to know.
>
> It could also be reading of garbage strings, code, constant data.  I
> have also seen one section of code round MMU rights and another read
> to the byte.  Sometimes this rounding is wrong and a "bus error"
> happens due to memory not being sized right.
>
> hth,
> Bill Pringlemeir.
>
> --
> You have the right to remain silent -- so shut up!
>
> vxWorks FAQ, "http://www.xs4all.nl/~borkhuis/vxworks/vxworks.html"

Bill\Others,
Excellent and enlightening explaination from you all.We are facing a
similar kind of issue with STMicroElectronics prop board and we were
using prop OS.Though the OS is different,the problem seems to be
similar to query we are addressing here I believe.

We faced a situation where if we just type printf inside one function
or just introduce one
i=1(Though we did not use 'i' variable further anywhere) will make the
feature to work and removing this statement made us to loose the
feature.
We were trying hard to figure the problem until one day when we
inspected the cache and disabled the data cache the feature was working
just fine.

Now the question I would like to understand it,whats the best way to
figure out whether the problem is with cache memory?One more behaviour
I have observed is when we debug with break point the feature was
working fine and when we use binary production version of same code it
never works!
This made debugging further difficult.Will the role of cache have
something to do to bring this difference between debug and production
version?

I would like to avoid such problems in future so it will be helpful if
some of you enlightened ones explain me this.

I am posting the query also to comp.arch.embedded as this will help me
to get lot of experienced people's inputs.Pardon me incase I am wrong.

Looking farward for all your replys and advanced thanks for the same,

Regards,
s.subbarayan

Reply by Bo ●June 29, 20062006-06-29

"ssubbarayan" <ssubba@gmail.com> wrote in message 
news:1151564578.472571.35220@b68g2000cwa.googlegroups.com...
>
> Bill Pringlemeir wrote:
>> On 28 Jun 2006, eon_blue_80@verizon.net wrote:
>> > I am experiencing a very bizarre problem with vxWorks and I am
>> > hoping that someone might be able to offer some suggestions on where
>> > to start looking to determine the root of the problem.
>>
>> [snip]
>>
>> > The printf statement is only adding a handful of bytes to the
>> > resultant image and larger images than the ones that fail have been
>> > booted successfully.
>>
>> This sounds like a cache problem.  The "printf" is unrelated to the
>> code.  It just changes the image size at the "right" place.  You could
>> add a ".bytes 7" or something in the code section and the same thing
>> would result.
>>
>> At some point in the boot sequence, there may be an alias between data
>> and code cache.  It could be when the MMU is turned on.  The address
>> space will change and code must often jump in a very specific
>> sequence.  It maybe a conflict with a device.  For instance an "eieio"
>> instruction may be necessary in some cases, but due to code section
>> alignment, the code is executing in different times and the "eieio"
>> become necessary/un-necessary depending on the build.
>>
>> It is very good that you try to hunt this down.  I've known several
>> "senior" people who have let this type of problem go on for ever.
>>
>> You can toggle an LED, an general purpose I/O with scope or you can
>> use some polled console output to provide check points in the boot
>> sequence to see where the hang occurs.
>>
>> The important point is that the "printf" has nothing to do with the
>> problem besides making the code move around.  You can verify this by
>> inserting different dummy routines with different lengths (a cache
>> line is typically 32/64 bytes).  Observing a map file of the full
>> image and knowing the location of these bytes can be helpful.  For
>> instance if code following this is an ethernet driver, then that may
>> be helpful to know.
>>
>> It could also be reading of garbage strings, code, constant data.  I
>> have also seen one section of code round MMU rights and another read
>> to the byte.  Sometimes this rounding is wrong and a "bus error"
>> happens due to memory not being sized right.
>>
>> hth,
>> Bill Pringlemeir.
>>
>> --
>> You have the right to remain silent -- so shut up!
>>
>> vxWorks FAQ, "http://www.xs4all.nl/~borkhuis/vxworks/vxworks.html"
>
> Bill\Others,
> Excellent and enlightening explaination from you all.We are facing a
> similar kind of issue with STMicroElectronics prop board and we were
> using prop OS.Though the OS is different,the problem seems to be
> similar to query we are addressing here I believe.
>
> We faced a situation where if we just type printf inside one function
> or just introduce one
> i=1(Though we did not use 'i' variable further anywhere) will make the
> feature to work and removing this statement made us to loose the
> feature.
> We were trying hard to figure the problem until one day when we
> inspected the cache and disabled the data cache the feature was working
> just fine.
>
> Now the question I would like to understand it,whats the best way to
> figure out whether the problem is with cache memory?One more behaviour
> I have observed is when we debug with break point the feature was
> working fine and when we use binary production version of same code it
> never works!
> This made debugging further difficult.Will the role of cache have
> something to do to bring this difference between debug and production
> version?
>
> I would like to avoid such problems in future so it will be helpful if
> some of you enlightened ones explain me this.
>
> I am posting the query also to comp.arch.embedded as this will help me
> to get lot of experienced people's inputs.Pardon me incase I am wrong.
>
> Looking farward for all your replys and advanced thanks for the same,
>
> Regards,
> s.subbarayan
>
So when you determine for certain that cache is the problem, what typically 
is the solution? Sprinkling cache flushes throughout the code? or what?

Bo

Reply by ●June 29, 20062006-06-29

> "ssubbarayan" <ssubba@gmail.com> wrote in message 

>> Now the question I would like to understand it,whats the best way
>> to figure out whether the problem is with cache memory?One more
>> behaviour I have observed is when we debug with break point the
>> feature was working fine and when we use binary production version
>> of same code it never works!  This made debugging further
>> difficult.Will the role of cache have something to do to bring this
>> difference between debug and production version?
>>
>> I would like to avoid such problems in future so it will be helpful
>> if some of you enlightened ones explain me this.

On 29 Jun 2006, bo@cephus.com wrote:

> So when you determine for certain that cache is the problem, what
> typically is the solution? Sprinkling cache flushes throughout the
> code? or what?

There are three possible issues.  One is a direct effect of caching,
another is alignment, and the other is timing.

If you have DMA, it will always retrieve from memory (Ie, SDRAM,
flash).  If your CPU is using a cache, it might be retrieving data
from the memory or from the cache.  For example, on one project we had
a video capture device with a built-in convulsion matrix that DMAed
the results to the main processor.  The code did not pay attention to
the cache.  After much debugging, the software developer for the
imaging code decided that the HA was buggy.  I examined this and noted
that the buffer being used was fully cached.  It started to work when
we got memory that the MMU had marked as being non-cacheable.

Another example is on the PPC, there is a "write buffer".  It can be
the result that the PPC will not commit data to memory in the order
that instructions are encountered; especially with a write-back cache.
So, for instance, an AMD style NOR flash takes the command AA, 55,
CMD.  Without using the eieio command on the PowerPC, your flash
driver will not work as the commands can get written to the bus out of
order.  An MTD driver might loop forever trying to detect the flash
type or an end of operation, etc.  This might cause a hang during
booting.  Many HW devices use multiple writes to the same location.

Those are some examples of direct changes the cache might have on the
order of memory accesses.

I had previously explained an alignment issue.  It sound like this is
more like the OPs problem as the code in question doesn't even
execute.  However, it can also be the timing as this will shift code
and might change how the cache lines are fetched.  If the compiler is
aligning all code to a cache line, then this is not the problem.

The other instance is just timing.  If code is relying on things being
slow, then a cache is enabled and speeds them up, a implicit delay may
no longer be sufficient.  Some slower/older HW devices must have fixed
delays between accesses.  It may also be that the code must be fast
enough, like kicking a watchdog.

In all cases, the best thing you can do is insert some sort of trace.
Like toggling a general purpose I/O connected to a scope.  You can
alter the timing to provide information or use multiple lines to
encode some information.  Multiple lines are better as they will
reduce the amount of code needed.  This mechanism suffers in that
inserting the debug code can make the symptom appear/disappear.

An ICE, BDM, or JTAG debugger would also be useful.  Let the system
crash and then look at the stack and PC.  Use HW breakpoints to work
backwards from there.

The problem with using a traditional debugger with breakpoints is that
this alters the code flow (just like a printf).  Hitting breakpoints
will definitely effect what is in the cache.

Once you find the problem, you have to look at the structure to know
what to do.  For instance, it is often best to change the way a
hardware device is accessed.  Like non-cacheable, write-through cache,
etc.  Sometimes it is not just the cache, but eieio instructions might
be needed (or other PPC instructions like isync, sync, etc).  Adding
cache flushes may work.  It would be much better to understand why it
is crashing and then correct the problem.  Just adding cache flushes
might be equivalent to the printfs.  Ie, it just shifts the code
around.

fwiw,
Bill Pringlemeir.

-- 
My cousin is an agoraphobic homosexual, which makes it kind of hard
for him to come out of the closet. - Bill Kelly

vxWorks FAQ, "http://www.xs4all.nl/~borkhuis/vxworks/vxworks.html"

Reply by ●June 29, 20062006-06-29

> eon_blue_80@verizon.net wrote:

> Another possibility is that errant code is corrupting memory during
> the boot process.

This is *unlikely* as the OP noted that adding un-executed code would
cause the problem.  If the code is directly corrupting memory this
would be unlikely to introduce the problem.  Especially if the added
code make no types of allocation, nor writes to memory.  If simply
changing the cache on/off will cause the crash, I find it extremely
unlikely that it is a memory corruption.

So there is a quick way to rule this out.  Disable/enable the cache
with a crashing image.  Often you can arrange the code so that the
size is the same, just a constant has changed to disable/enable the
cache.

fwiw,
Bill Pringlemeir.

-- 
Anyone who  trades liberty for  security deserves neither  liberty nor
security - Benjamin Franklin

vxWorks FAQ, "http://www.xs4all.nl/~borkhuis/vxworks/vxworks.html"

Reply by mrfirmware ●June 29, 20062006-06-29

Bill Pringlemeir wrote:

> Another example is on the PPC, there is a "write buffer".  It can be
> the result that the PPC will not commit data to memory in the order
> that instructions are encountered; especially with a write-back cache.
> So, for instance, an AMD style NOR flash takes the command AA, 55,
> CMD.  Without using the eieio command on the PowerPC, your flash
> driver will not work as the commands can get written to the bus out of
> order.

Are you sure? Some PPC implementations, e.g. the AMCC PPC440 series,
use a weakly ordered view of memory so you have the potential for out
of order reads but never writes. The write buffer does not affect write
order, it just allows read-around-write.

Our AMCC PPC405 (strongly ordered view of memory) Flash driver started
failing when we used it on the 440. After, reading up on weakly ordered
memory systems and ensuring that the Flash region was marked cached,
guarded we proceeded to put msync brackets round all reads to I/O
devices (real memory or I/O devices without read-side effects don't
need them) with read-side effects, e.g.

# uint16 read_flash_hword(uin16 *pFlashAddr);
read_flash_hword:
    msync
    lhz  r3,0(r3)
    msync
    blr

The msync brackets ensured that the read could not issue before any
subsequent read or write. However, you are guaranteed to have multiple
writes go in order to the device safely as long as your reads are
protected as above. Furthermore, the PowerPC architecture is smart
enough to execute RMW ops. correctly on a given I/O address, e.g.

lwz r3,0(r4)
ori  r3,r3,0x0040
stw r3,0(r4)

will result in the expected value written to the address pointed to by
r4, that is, the CPU will not perform the store before the load due to
register dependencies.
-- 
- Mark

Reply by ●June 29, 20062006-06-29

On 29 Jun 2006, mrfirmware@gmail.com wrote:
> Bill Pringlemeir wrote:

>> Another example is on the PPC, there is a "write buffer".  It can
>> be the result that the PPC will not commit data to memory in the
>> order that instructions are encountered; especially with a
>> write-back cache.  So, for instance, an AMD style NOR flash takes
>> the command AA, 55, CMD.  Without using the eieio command on the
>> PowerPC, your flash driver will not work as the commands can get
>> written to the bus out of order.

> Are you sure? Some PPC implementations, e.g. the AMCC PPC440 series,
> use a weakly ordered view of memory so you have the potential for
> out of order reads but never writes. The write buffer does not
> affect write order, it just allows read-around-write.

I am absolutely sure of nothing.  I might have the wrong terminology
for the cache type.  If multiple writes to the same location fit in
the cache, only that last value will be written to the memory device.
This makes perfect sense for SDRAM and is a very good operation.
Consider a frame pointer with some loop variables stored in one of
these lines.  Constantly committing the data from cache to SDRAM would
seem to be a waste of time.

With AMD type flash, there are several writes to the same address.  I
didn't have access to a logic analyzer to see what cycles the CPU was
performing on the flash.  However, a straight 'C' implementation was
not sufficient.  You need to add some assembler instructions.

I guess it is wrong to say "out of order".  I should have said not at
all.  AA and CMD are usually written to the same address.  I did try
msync commands and this was not effective.

fwiw,
Bill Pringlemeir.

-- 
I never did give anybody hell.  I just told the truth and they thought
it was hell. - Harry S. Truman

vxWorks FAQ, "http://www.xs4all.nl/~borkhuis/vxworks/vxworks.html"

Reply by LarryC ●June 29, 20062006-06-29

After you reset after the hang, stop at the boot prompt and do an "e".
That might dump exception data if the previous hang was caused by an
exception.

lc
Bill Pringlemeir wrote:
> On 29 Jun 2006, mrfirmware@gmail.com wrote:
> > Bill Pringlemeir wrote:
>
> >> Another example is on the PPC, there is a "write buffer".  It can
> >> be the result that the PPC will not commit data to memory in the
> >> order that instructions are encountered; especially with a
> >> write-back cache.  So, for instance, an AMD style NOR flash takes
> >> the command AA, 55, CMD.  Without using the eieio command on the
> >> PowerPC, your flash driver will not work as the commands can get
> >> written to the bus out of order.
>
> > Are you sure? Some PPC implementations, e.g. the AMCC PPC440 series,
> > use a weakly ordered view of memory so you have the potential for
> > out of order reads but never writes. The write buffer does not
> > affect write order, it just allows read-around-write.
>
> I am absolutely sure of nothing.  I might have the wrong terminology
> for the cache type.  If multiple writes to the same location fit in
> the cache, only that last value will be written to the memory device.
> This makes perfect sense for SDRAM and is a very good operation.
> Consider a frame pointer with some loop variables stored in one of
> these lines.  Constantly committing the data from cache to SDRAM would
> seem to be a waste of time.
>
> With AMD type flash, there are several writes to the same address.  I
> didn't have access to a logic analyzer to see what cycles the CPU was
> performing on the flash.  However, a straight 'C' implementation was
> not sufficient.  You need to add some assembler instructions.
>
> I guess it is wrong to say "out of order".  I should have said not at
> all.  AA and CMD are usually written to the same address.  I did try
> msync commands and this was not effective.
>
> fwiw,
> Bill Pringlemeir.
>
> --
> I never did give anybody hell.  I just told the truth and they thought
> it was hell. - Harry S. Truman
> 
> vxWorks FAQ, "http://www.xs4all.nl/~borkhuis/vxworks/vxworks.html"

Previous12 Next

Unexplained Hang During Boot

Sign in

You might also like...

Search forums

Free PDF Downloads

Blogs - Hall of Fame

Discussion Groups

Quick Links

About EmbeddedRelated.com

Social Networks

The Related Media Group