Reply by William Dennen January 25, 20072007-01-25
On Tue, 23 Jan 2007 12:46:58 -0600, Bo opined:

> "William Dennen" <wdennen@gmail.com> wrote in message > news:eopafk$s74$1@aioe.org... >> Bo >> I've an idea, if you've got sufficient hardware, that may shed some light >> on where the problem is. You need 4 boards, 2 5100s and 2 anything VME. >> Call the 5100s A & B, the others C & D. Set up C & D so they can >> read/write each other's memory and also either A or B. NO shared memory >> configured for these two. You've got A & B already set up. Create the >> hang condition and then: >> (1) can C & D still read/write each other? >> (2) can either C or D read/write to either A or B? >> (3) can either A or B read/write to either C or D? >> >> The essence of what you're trying to determine is whether the system >> controller function is hosed or not. IF C & D can still read/write then >> it is not and the hang condition is local to A/B. > > 1) C&D cannot read/write. > 2) no > 3) no > > ie it 'appears' to be an honest-to-God hardware lock-up--from which only a > power cycle will recover. Scary, huh? > > Bo
Indeed it's scary and smells of an errata, it appears that the system controller has left the scene. I would recommend getting Tundra to look at the issue. I'm sure they'll want a dump of the Universe and a trace if you've got the capability. You can initiate the dialog at http://www.tundra.com/support.aspx?bid=481&id=962. Hopefully they can simulate the sequence ... Regards --
>@<
Bill Dennen wdennen@gmail.com Cluelessness: There are no stupid questions, but there are a LOT of inquisitive idiots. (despair.com)
Reply by Bo January 23, 20072007-01-23
"William Dennen" <wdennen@gmail.com> wrote in message 
news:eopafk$s74$1@aioe.org...
> Bo > I've an idea, if you've got sufficient hardware, that may shed some light > on where the problem is. You need 4 boards, 2 5100s and 2 anything VME. > Call the 5100s A & B, the others C & D. Set up C & D so they can > read/write each other's memory and also either A or B. NO shared memory > configured for these two. You've got A & B already set up. Create the > hang condition and then: > (1) can C & D still read/write each other? > (2) can either C or D read/write to either A or B? > (3) can either A or B read/write to either C or D? > > The essence of what you're trying to determine is whether the system > controller function is hosed or not. IF C & D can still read/write then > it is not and the hang condition is local to A/B.
1) C&D cannot read/write. 2) no 3) no ie it 'appears' to be an honest-to-God hardware lock-up--from which only a power cycle will recover. Scary, huh?
> > Would I be correct in assuming that you've left the BSP configured for > a hardware TAS? (I can't remember the #define precisely, but if you mucked > with it, you know the one I mean).
I don't think that TAS has been changed---at least not by me. Thanks for the suggestions and help, Bo
Reply by William Dennen January 18, 20072007-01-18
Bo
I've an idea, if you've got sufficient hardware, that may shed some light
on where the problem is.  You need 4 boards, 2 5100s and 2 anything VME. 
Call the 5100s A & B, the others C & D.  Set up C & D so they can
read/write each other's memory and also either A or B.  NO shared memory
configured for these two.  You've got A & B already set up.  Create the
hang condition and then:
(1) can C & D still read/write each other?
(2) can either C or D read/write to either A or B?
(3) can either A or B read/write to either C or D?

The essence of what you're trying to determine is whether the system
controller function is hosed or not.  IF C & D can still read/write then
it is not and the hang condition is local to A/B.

Would I be correct in assuming that you've left the BSP configured for
a hardware TAS? (I can't remember the #define precisely, but if you mucked
with it, you know the one I mean).

Regards
-- 
>@<
Bill Dennen wdennen@gmail.com Cluelessness: There are no stupid questions, but there are a LOT of inquisitive idiots. (despair.com)
Reply by Bo January 17, 20072007-01-17
"CBFalconer" <cbfalconer@yahoo.com> wrote in message 
news:45AD929D.651D0282@yahoo.com...
> > I have no idea whether this is applicable to the OP's problem, but > in general memory is shared as long as it is not written. If a > process wants to write in it, the page table for that process is > modified to remap that portion, a copy of the original made, and > the write then proceeds. That portion of the memory is then no > longer shared. > > If the memory is truly shared, so that one processes writes show up > in other processes memory space, then various synchronization > protocols must be used. This can involves semaphores, monitors, > critical sections, etc. > > Threads are generally lightweight processeses, using memory shared > with other threads in the same process, and will need the > synchronization primitives to access it. > > -- > Chuck F (cbfalconer at maineline dot net) > Available for consulting/temporary embedded and systems. > <http://cbfalconer.home.att.net> >
Yes the memory is truly shared. However, it is my understand that the RMW protection scheme across a VME backplane is implemented in hardware generally and that any HW that does not support RMW, the RMW protection scheme must be emulated by SW--resulting in a much slower transaction. In my particular case, it seems that the physical option jumper causes the HW to work/not work depending on its position-- which seems divorced from SW in my view. That is, if it was a SW issue, the problem would exist regardless of the HW jumper position. I do find it odd that earlier board models (using the same TUndra chip) do not exhibit the problem. Bo
Reply by CBFalconer January 16, 20072007-01-16
William Dennen wrote:
>
... snip ...
> > I wish I did; still don't have a good handle on how shared memory > is _really_ implemented in spite of mucking with it on and off for > a number of years. The point is that if the memory spaces weren't > shared the transactions would succeed, otherwise Tundra wouldn't be > able to sell chip one. I suspect your implementation is drawing > out a latent defect in the implementation of shared memory; I'm > aware of another who encountered a similar hang using a more > standard configuration (but totally weird in other ways). That > too is unresolved as far as I know.
I have no idea whether this is applicable to the OP's problem, but in general memory is shared as long as it is not written. If a process wants to write in it, the page table for that process is modified to remap that portion, a copy of the original made, and the write then proceeds. That portion of the memory is then no longer shared. If the memory is truly shared, so that one processes writes show up in other processes memory space, then various synchronization protocols must be used. This can involves semaphores, monitors, critical sections, etc. Threads are generally lightweight processeses, using memory shared with other threads in the same process, and will need the synchronization primitives to access it. -- Chuck F (cbfalconer at maineline dot net) Available for consulting/temporary embedded and systems. <http://cbfalconer.home.att.net>
Reply by William Dennen January 16, 20072007-01-16
On Mon, 15 Jan 2007 10:54:28 -0600, Bo queried:

> > Good point Bill. Do you know how I can test/change VxWorks to confirm it is > or isn't a RMW issue? >
I wish I did; still don't have a good handle on how shared memory is _really_ implemented in spite of mucking with it on and off for a number of years. The point is that if the memory spaces weren't shared the transactions would succeed, otherwise Tundra wouldn't be able to sell chip one. I suspect your implementation is drawing out a latent defect in the implementation of shared memory; I'm aware of another who encountered a similar hang using a more standard configuration (but totally weird in other ways). That too is unresolved as far as I know. Regards --
>@<
Bill Dennen wdennen@gmail.com Cluelessness: There are no stupid questions, but there are a LOT of inquisitive idiots. (despair.com)
Reply by Bo January 15, 20072007-01-15
"William Dennen" <wdennen@gmail.com> wrote in message 
news:eo901m$vbs$1@aioe.org...
> On Mon, 08 Jan 2007 10:57:04 -0600, Bo opined: > > >> So... the questions are: >> >> 1) is this a known HW issue with MVME5100 cards? >> 2) if not, is there any possibility that the VxWorks BSP could cause the >> behavior? >> 3) can we conclude that both/all boards think they are system controller? >> 4) is there a SW fix to make the auto-config jumper work as intended? > > I rather much wonder if this isn't some issue with the RMW mechanism being > used within VxWorks resulting in a lock on the local bus.
Good point Bill. Do you know how I can test/change VxWorks to confirm it is or isn't a RMW issue?
> Such nasty > behavior isn't seen if the transfers are not into shared memory spaces. > My recollection on the auto-config jumper is that it's sensed by the > Universe at initialization to determine if it needs to provide bus > arbitration and isn't used afterwards.
This is what I thought as well. I do recall at a previous employer we had similar issues with the same Tundra chip---and the workaround was extra crap that the BSP had to perform during initialization.... but I really don't want to go that route again if avoidable. Thanks, Bo
>That this problem comes and goes > depending whether auto-configuration or not is selected does suggest > otherwise; but I doubt the root cause is the jumper setting. > > Regards > -- >>@< > Bill Dennen wdennen@gmail.com > Cluelessness: There are no stupid questions, > but there are a LOT of inquisitive idiots. > (despair.com)
Reply by William Dennen January 12, 20072007-01-12
On Mon, 08 Jan 2007 10:57:04 -0600, Bo opined:


> So... the questions are: > > 1) is this a known HW issue with MVME5100 cards? > 2) if not, is there any possibility that the VxWorks BSP could cause the > behavior? > 3) can we conclude that both/all boards think they are system controller? > 4) is there a SW fix to make the auto-config jumper work as intended?
I rather much wonder if this isn't some issue with the RMW mechanism being used within VxWorks resulting in a lock on the local bus. Such nasty behavior isn't seen if the transfers are not into shared memory spaces. My recollection on the auto-config jumper is that it's sensed by the Universe at initialization to determine if it needs to provide bus arbitration and isn't used afterwards. That this problem comes and goes depending whether auto-configuration or not is selected does suggest otherwise; but I doubt the root cause is the jumper setting. Regards --
>@<
Bill Dennen wdennen@gmail.com Cluelessness: There are no stupid questions, but there are a LOT of inquisitive idiots. (despair.com)
Reply by Bo January 8, 20072007-01-08
We are using Motorola MVME5100 boards and VxWorks 6.3. We modified the 
delivered BSP to allow us to have shared memory windows across all processor 
cards. Now, our problem appears to be that the auto syscon feature of the 
board is not working properly. That is to say, if any controller, not in 
slot 0, is present and has auto-config jumper set to AUTO., then we see the 
following behavior:

syscontroller card0 can read/write into slave card N with no problems. Then, 
slave N can read/write into syscon0 shared memory. So far so good. Now, 
after a slave access of syscon slot 0 shared memory, any accesses across the 
VME bus, result in a hang. ie the sys controller in slot 0 can no longer 
read/write to slot N.

If the jumper on all cards (other than slot0) are set to NO SYSCON, then all 
accesses across the VME bus appear to function properly and there are no 
hangs. This indicates a hardware issue to us, but we are not 100% certain.

We verified the same behavior across multiple 5100 cards (with various RAM 
amounts)--get the same result. We also verified the same behavior with slot 
0 being a MVME6100 card and slot 1 being a MVME5100 card.

So... the questions are:

1) is this a known HW issue with MVME5100 cards?
2) if not, is there any possibility that the VxWorks BSP could cause the 
behavior?
3) can we conclude that both/all boards think they are system controller?
4) is there a SW fix to make the auto-config jumper work as intended?


Thanks,

Bo