Reply by June 30, 20062006-06-30
Thank you everyone for all of your suggestions.  These suggestions will
be a great help when troubleshooting future problems.

As far as the original problem goes, using I/O probing we were able to
successfully narrow the error down to a relatively large segment of the
BSP.  Apparently there is a problem in the SCSI section of the BSP
(wild pointer or out of order type operation??) that causes the image
to hang when the bytes of the image are aligned in just the right way.
We have made a decision to disable SCSI support within the OS (which
has corrected the problem).  Hopefully, if time ever becomes available,
we can look into the SCSI section of the BSP; and find the exact bug.

Reply by Didi June 30, 20062006-06-30
> If you are referring to CPU_213 you need only to set CCR0 as specified.
This is what I was referring to, apparently you have it under control. It was enough to stop me from using the 405 (I opted for the 5200). Dimiter ------------------------------------------------------ Dimiter Popoff Transgalactic Instruments http://www.tgi-sci.com ------------------------------------------------------ mrfirmware wrote:
> Didi wrote: > > > There is one more possibility I know of. If the processor is a 405, > > check > > its errata sheet. I recently discovered (while considering a device, > > I opted not to use it) a late published error to be saying basically > > you may not use its cache in copyback mode, it does not work. > > Use write through.... > > We haven't used write-through, ever, on the 405GPr and it has had narry > a problem with copy-back at least for the past 4 years of the product > life (thousands of blade servers). Do you have an errata number or doc. > I could look at WRT to this cache bug? If you are referring to CPU_213 > you need only to set CCR0 as specified. Setting write-through mode is > simply too big a hammer (for us). > -- > - Mark
Reply by Jim Stewart June 30, 20062006-06-30
eon_blue_80@verizon.net wrote:

> I am experiencing a very bizarre problem with vxWorks and I am hoping > that someone might be able to offer some suggestions on where to start > looking to determine the root of the problem. > > VxWorks is being used on a Synergy Microsystems VME SBC which is PPC > based. The problem seems to arise at random times after rebuilding the > OS image. For instance, by commenting out a single 'printf' statement > such as "printf("Message Received\n"); in an application level piece of > code that is not even invoked; and rebuilding the image, the image can > hang while booting (early in the boot procedure). Uncomment this > 'printf' statement, rebuild the image, and the OS will boot without > error. Note that this routine is not called at any time during the boot > procedure so the code containing that printf is never even executed.
Reading your post, it's not clear how many different physical units you've tried this on. If the answer is one, the problem could be a bad byte with a bad bit of flash memory.
Reply by mrfirmware June 30, 20062006-06-30
Didi wrote:

> There is one more possibility I know of. If the processor is a 405, > check > its errata sheet. I recently discovered (while considering a device, > I opted not to use it) a late published error to be saying basically > you may not use its cache in copyback mode, it does not work. > Use write through....
We haven't used write-through, ever, on the 405GPr and it has had narry a problem with copy-back at least for the past 4 years of the product life (thousands of blade servers). Do you have an errata number or doc. I could look at WRT to this cache bug? If you are referring to CPU_213 you need only to set CCR0 as specified. Setting write-through mode is simply too big a hammer (for us). -- - Mark
Reply by Didi June 30, 20062006-06-30
> This would be a good first step. The OP sounded like he was fishing for > ideas, so I threw out a couple that I have run into in the past.
I also would tip on cache handling problems in the code. Forgotten flush of the i-cache is something I have had to chase with my early versions. There is one more possibility I know of. If the processor is a 405, check its errata sheet. I recently discovered (while considering a device, I opted not to use it) a late published error to be saying basically you may not use its cache in copyback mode, it does not work. Use write through.... Dimiter ------------------------------------------------------ Dimiter Popoff Transgalactic Instruments http://www.tgi-sci.com ------------------------------------------------------ MetalHead wrote:
> Bill Pringlemeir wrote: > >>eon_blue_80@verizon.net wrote: > > > > > >>Another possibility is that errant code is corrupting memory during > >>the boot process. > > > > > > This is *unlikely* as the OP noted that adding un-executed code would > > cause the problem. If the code is directly corrupting memory this > > would be unlikely to introduce the problem. Especially if the added > > code make no types of allocation, nor writes to memory. If simply > > changing the cache on/off will cause the crash, I find it extremely > > unlikely that it is a memory corruption. > > I have seen this happen in the past in this manner. By adding code into > the code segement, you move the relative position of stuff around. Even > if the code you added does not get executed, if the I/O drivers are at > opposite end of the link map from the boot code, just increasing or > decreasing the relative separation of components can cause the > corruption to occur in a place that does not get executed during the > boot process or causes a different kind of problem. C libraries are > another good candidate for winding up at the far end of the link map. If > you are lucky, this will show up as an illegal instruction trap, and if > you are unlucky, it shows up as branches to nowhere or tight loops. > > > > So there is a quick way to rule this out. Disable/enable the cache > > with a crashing image. Often you can arrange the code so that the > > size is the same, just a constant has changed to disable/enable the > > cache. > > This would be a good first step. The OP sounded like he was fishing for > ideas, so I threw out a couple that I have run into in the past. > > Bob
Reply by MetalHead June 29, 20062006-06-29
Bill Pringlemeir wrote:
>>eon_blue_80@verizon.net wrote: > > >>Another possibility is that errant code is corrupting memory during >>the boot process. > > > This is *unlikely* as the OP noted that adding un-executed code would > cause the problem. If the code is directly corrupting memory this > would be unlikely to introduce the problem. Especially if the added > code make no types of allocation, nor writes to memory. If simply > changing the cache on/off will cause the crash, I find it extremely > unlikely that it is a memory corruption.
I have seen this happen in the past in this manner. By adding code into the code segement, you move the relative position of stuff around. Even if the code you added does not get executed, if the I/O drivers are at opposite end of the link map from the boot code, just increasing or decreasing the relative separation of components can cause the corruption to occur in a place that does not get executed during the boot process or causes a different kind of problem. C libraries are another good candidate for winding up at the far end of the link map. If you are lucky, this will show up as an illegal instruction trap, and if you are unlucky, it shows up as branches to nowhere or tight loops.
> So there is a quick way to rule this out. Disable/enable the cache > with a crashing image. Often you can arrange the code so that the > size is the same, just a constant has changed to disable/enable the > cache.
This would be a good first step. The OP sounded like he was fishing for ideas, so I threw out a couple that I have run into in the past. Bob
Reply by LarryC June 29, 20062006-06-29
After you reset after the hang, stop at the boot prompt and do an "e".
That might dump exception data if the previous hang was caused by an
exception.

lc
Bill Pringlemeir wrote:
> On 29 Jun 2006, mrfirmware@gmail.com wrote: > > Bill Pringlemeir wrote: > > >> Another example is on the PPC, there is a "write buffer". It can > >> be the result that the PPC will not commit data to memory in the > >> order that instructions are encountered; especially with a > >> write-back cache. So, for instance, an AMD style NOR flash takes > >> the command AA, 55, CMD. Without using the eieio command on the > >> PowerPC, your flash driver will not work as the commands can get > >> written to the bus out of order. > > > Are you sure? Some PPC implementations, e.g. the AMCC PPC440 series, > > use a weakly ordered view of memory so you have the potential for > > out of order reads but never writes. The write buffer does not > > affect write order, it just allows read-around-write. > > I am absolutely sure of nothing. I might have the wrong terminology > for the cache type. If multiple writes to the same location fit in > the cache, only that last value will be written to the memory device. > This makes perfect sense for SDRAM and is a very good operation. > Consider a frame pointer with some loop variables stored in one of > these lines. Constantly committing the data from cache to SDRAM would > seem to be a waste of time. > > With AMD type flash, there are several writes to the same address. I > didn't have access to a logic analyzer to see what cycles the CPU was > performing on the flash. However, a straight 'C' implementation was > not sufficient. You need to add some assembler instructions. > > I guess it is wrong to say "out of order". I should have said not at > all. AA and CMD are usually written to the same address. I did try > msync commands and this was not effective. > > fwiw, > Bill Pringlemeir. > > -- > I never did give anybody hell. I just told the truth and they thought > it was hell. - Harry S. Truman > > vxWorks FAQ, "http://www.xs4all.nl/~borkhuis/vxworks/vxworks.html"
Reply by June 29, 20062006-06-29
On 29 Jun 2006, mrfirmware@gmail.com wrote:
> Bill Pringlemeir wrote:
>> Another example is on the PPC, there is a "write buffer". It can >> be the result that the PPC will not commit data to memory in the >> order that instructions are encountered; especially with a >> write-back cache. So, for instance, an AMD style NOR flash takes >> the command AA, 55, CMD. Without using the eieio command on the >> PowerPC, your flash driver will not work as the commands can get >> written to the bus out of order.
> Are you sure? Some PPC implementations, e.g. the AMCC PPC440 series, > use a weakly ordered view of memory so you have the potential for > out of order reads but never writes. The write buffer does not > affect write order, it just allows read-around-write.
I am absolutely sure of nothing. I might have the wrong terminology for the cache type. If multiple writes to the same location fit in the cache, only that last value will be written to the memory device. This makes perfect sense for SDRAM and is a very good operation. Consider a frame pointer with some loop variables stored in one of these lines. Constantly committing the data from cache to SDRAM would seem to be a waste of time. With AMD type flash, there are several writes to the same address. I didn't have access to a logic analyzer to see what cycles the CPU was performing on the flash. However, a straight 'C' implementation was not sufficient. You need to add some assembler instructions. I guess it is wrong to say "out of order". I should have said not at all. AA and CMD are usually written to the same address. I did try msync commands and this was not effective. fwiw, Bill Pringlemeir. -- I never did give anybody hell. I just told the truth and they thought it was hell. - Harry S. Truman vxWorks FAQ, "http://www.xs4all.nl/~borkhuis/vxworks/vxworks.html"
Reply by mrfirmware June 29, 20062006-06-29
Bill Pringlemeir wrote:

> Another example is on the PPC, there is a "write buffer". It can be > the result that the PPC will not commit data to memory in the order > that instructions are encountered; especially with a write-back cache. > So, for instance, an AMD style NOR flash takes the command AA, 55, > CMD. Without using the eieio command on the PowerPC, your flash > driver will not work as the commands can get written to the bus out of > order.
Are you sure? Some PPC implementations, e.g. the AMCC PPC440 series, use a weakly ordered view of memory so you have the potential for out of order reads but never writes. The write buffer does not affect write order, it just allows read-around-write. Our AMCC PPC405 (strongly ordered view of memory) Flash driver started failing when we used it on the 440. After, reading up on weakly ordered memory systems and ensuring that the Flash region was marked cached, guarded we proceeded to put msync brackets round all reads to I/O devices (real memory or I/O devices without read-side effects don't need them) with read-side effects, e.g. # uint16 read_flash_hword(uin16 *pFlashAddr); read_flash_hword: msync lhz r3,0(r3) msync blr The msync brackets ensured that the read could not issue before any subsequent read or write. However, you are guaranteed to have multiple writes go in order to the device safely as long as your reads are protected as above. Furthermore, the PowerPC architecture is smart enough to execute RMW ops. correctly on a given I/O address, e.g. lwz r3,0(r4) ori r3,r3,0x0040 stw r3,0(r4) will result in the expected value written to the address pointed to by r4, that is, the CPU will not perform the store before the load due to register dependencies. -- - Mark
Reply by June 29, 20062006-06-29
> eon_blue_80@verizon.net wrote:
> Another possibility is that errant code is corrupting memory during > the boot process.
This is *unlikely* as the OP noted that adding un-executed code would cause the problem. If the code is directly corrupting memory this would be unlikely to introduce the problem. Especially if the added code make no types of allocation, nor writes to memory. If simply changing the cache on/off will cause the crash, I find it extremely unlikely that it is a memory corruption. So there is a quick way to rule this out. Disable/enable the cache with a crashing image. Often you can arrange the code so that the size is the same, just a constant has changed to disable/enable the cache. fwiw, Bill Pringlemeir. -- Anyone who trades liberty for security deserves neither liberty nor security - Benjamin Franklin vxWorks FAQ, "http://www.xs4all.nl/~borkhuis/vxworks/vxworks.html"