EmbeddedRelated.com
Forums

Remote "watchdog"s

Started by Don Y April 20, 2016
On Thu, 21 Apr 2016 22:34:06 -0500, Robert Wessel
<robertwessel2@yahoo.com> wrote:

>On Wed, 20 Apr 2016 22:10:20 -0700, Don Y ><blockedofcourse@foo.invalid> wrote: > >>On 4/20/2016 9:19 PM, Robert Wessel wrote: >> >>> There's the obvious solution of using the power from the PoE PSE to >>> drive an enable of some sort to the device's power supply. Heck use >>> that to energize a relay you've put across the mains input (some way >>> of overriding that at the device would probably be prudent). >> >>If the device is NOT PoE powered, it's probably because it represents a >>substantial load (25+W?). I'm not sure it would be prudent to let >>something remotely disconnect power (and possibly reapply it, moments >>later) for large loads. >> >>OTOH, holding the device "in reset" (possibly indefinitely or even >>"repeatedly") should be safe(r?) > > >Presumably this is for cases where the device is so far gone that you >want to hit the big-red-switch. If you want more sophistication, you >can put a controlling microprocessor on the device, and have that >powered by PoE, and it could do things like force a reset, or actually >power the device off if necessary.
Aircraft systems have an interesting parallel. Almost everything have its power disconnected via a circuit breaking in the cockpit. In ye olde days, these were actually breakers wired into the circuit mounted on a panel (or several) in the cockpit, or a simple remote-operated breaker (usually for heavy loads). On recent aircraft, most of this is driven by the flight management system, which will pop up a little message saying it's pulled a breaker (if it happens automatically), or has a screen where you can pick a breaker to pull, and the breakers themselves are often located in a more convenient physical location (presumably near the circuit they're protecting), and they're controlled remotely. In the past is was not uncommon for the flight crew to attempt to cycle a breaker after a failure, but the modern policy is to just leave it alone (and powered off), and let maintenance deal with it on the ground. Obviously with exceptions where the loss of the system in question can be considered more dangerous than the possibility of a fire or other really bad result from the failing device.
On 4/21/2016 8:34 PM, Robert Wessel wrote:
> On Wed, 20 Apr 2016 22:10:20 -0700, Don Y > <blockedofcourse@foo.invalid> wrote: > >> On 4/20/2016 9:19 PM, Robert Wessel wrote: >> >>> There's the obvious solution of using the power from the PoE PSE to >>> drive an enable of some sort to the device's power supply. Heck use >>> that to energize a relay you've put across the mains input (some way >>> of overriding that at the device would probably be prudent). >> >> If the device is NOT PoE powered, it's probably because it represents a >> substantial load (25+W?). I'm not sure it would be prudent to let >> something remotely disconnect power (and possibly reapply it, moments >> later) for large loads. >> >> OTOH, holding the device "in reset" (possibly indefinitely or even >> "repeatedly") should be safe(r?) > > Presumably this is for cases where the device is so far gone that you > want to hit the big-red-switch.
The issue is trying to DECIDE that it's time to "pull the plug" and bring the box to its knees. It's not easy to know when the failure are seeing represents the failure of a particular PROCESS that happens to reside on that node at the present time; or, if the node itself is toast. [(sigh) Trying to figure out how little I need to explain to put this all in adequate context...] Everything is client-(agent-)server model. Processes (servers/agents) export services to other processes (agents/clients) via an IPC mechanism. Processes can migrate, dynamically, between nodes. So, a client never really is assured of where it is executing. Nor where each of the services that it is consuming reside! If the target of an IPC is "not local" (to "this" node), then it magically becomes an RPC -- with no notification to the caller. IPC/RPC's can be synchronous (blocking) or asynchronous (non-blocking). And, can have timeouts that the RTOS enforces for the caller(s). The RTOS instance on each node is responsible for the actual IPC/RPC mechanics -- it enforces access policies, deadlines, marshalling, etc. So, two processes never talk to each other (even if they co-reside on the same node!) without at least one instance of the RTOS being involved (two instances if the target is remote). The system is up continuously; there is never a "down time". So, a process can run on one particular node indefinitely. Or, get moved around to other nodes. Or, can run to its logical completion. Or, be killed off. etc. I *expect* (remotely apparent) failures to manifest along these lines: Process A on node 1 issues an RPC to process B on node 2. The response comes back (whenever) and is totally wonky. Process A *suspects* process B of being compromised/corrupted/failed/died. Some other process on some other node (or node 1) issues a request to some other process on node 2 (or, process B!) and gets a similar result. The RTOS instances on these two sourcing nodes eventually realize there appear to be issues with node 2, or process B, or... The RTOS instances "notice" if other processes on node 2 are becoming "suspect". If not, then perhaps the problem is local to process B and does not involve the entire node. (Can the RTOS instances communicate as expected?) If the problem appears to be one (or more) processes (on node 2 -- or wherever), then the RTOS's try to restart the processes (some processes are fault tolerant so restarting is effectively RESUMING). At the extreme, the RTOS can implement an effective "warm reset". Note that this problem may persist. It could be a latent bug in process B. Or, something wonky with its I/O's. Or, the region of physical memory that it is executing out of, etc. If the problem appears to apply to ALL processes on node 2 AND the RTOS instance on that node is similarly off-line, then the node is effectively isolated -- there is nothing that can be done "from outside" to regain its proper operation short of a hardware reset. On PD's, that happens when the PSE drops power to the node. When power is reapplied, the node executes its POST, reports its progress and goes looking for a boot image. It can either be rebooted as per normal *or* an "interactive" (in the sense that the rest of the system can interact with the Dx tool) diagnostic loaded. On nodes that are *not* PD's, I need a mechanism to implement the equivalent functionality. If the node is found to be faulty, the processes that were running on it can be redispatched to other node(s) and the faulted node taken out of service and marked as unavailable. Of course, any physical I/O's that were present on the physical device are no longer available. And, anything that relied on those I/O's is similarly ineligible to execute. (And, anything that relied on the services provided by those things... etc.) So, if a node was just being used for its compute resources (surplus MIPS and bytes), the user is not typically inconvenienced (assuming some other node can pick up the slack). If I know this to be the case, I can immediately bring another node online to support the processes that have died off. Then, engage more aggressive diagnostics on the "suspect" node without the user losing any capabilities.
> If you want more sophistication, you > can put a controlling microprocessor on the device, and have that > powered by PoE, and it could do things like force a reset, or actually > power the device off if necessary.
I'm trying to figure out the least requirements to impose on a device to gain that "remote reset" capability. And, let the device figure out how to address that. E.g., one of my nodes is a COTS PC. It should be relatively easy to take an instance of my PD interface and connect the "power out" signal to a relay, FET, buffer, etc. that ties to the "reset button" on the PC.
Den fredag den 22. april 2016 kl. 05.53.11 UTC+2 skrev robert...@yahoo.com:
> On Thu, 21 Apr 2016 22:34:06 -0500, Robert Wessel > <robertwessel2@yahoo.com> wrote: > > >On Wed, 20 Apr 2016 22:10:20 -0700, Don Y > ><blockedofcourse@foo.invalid> wrote: > > > >>On 4/20/2016 9:19 PM, Robert Wessel wrote: > >> > >>> There's the obvious solution of using the power from the PoE PSE to > >>> drive an enable of some sort to the device's power supply. Heck use > >>> that to energize a relay you've put across the mains input (some way > >>> of overriding that at the device would probably be prudent). > >> > >>If the device is NOT PoE powered, it's probably because it represents a > >>substantial load (25+W?). I'm not sure it would be prudent to let > >>something remotely disconnect power (and possibly reapply it, moments > >>later) for large loads. > >> > >>OTOH, holding the device "in reset" (possibly indefinitely or even > >>"repeatedly") should be safe(r?) > > > > > >Presumably this is for cases where the device is so far gone that you > >want to hit the big-red-switch. If you want more sophistication, you > >can put a controlling microprocessor on the device, and have that > >powered by PoE, and it could do things like force a reset, or actually > >power the device off if necessary. > > > Aircraft systems have an interesting parallel. Almost everything have > its power disconnected via a circuit breaking in the cockpit. In ye > olde days, these were actually breakers wired into the circuit mounted > on a panel (or several) in the cockpit, or a simple remote-operated > breaker (usually for heavy loads). On recent aircraft, most of this > is driven by the flight management system, which will pop up a little > message saying it's pulled a breaker (if it happens automatically), or > has a screen where you can pick a breaker to pull, and the breakers > themselves are often located in a more convenient physical location > (presumably near the circuit they're protecting), and they're > controlled remotely. > > In the past is was not uncommon for the flight crew to attempt to > cycle a breaker after a failure, but the modern policy is to just > leave it alone (and powered off), and let maintenance deal with it on > the ground. Obviously with exceptions where the loss of the system in > question can be considered more dangerous than the possibility of a > fire or other really bad result from the failing device.
https://en.wikipedia.org/wiki/Northwest_Airlines_Flight_255 -Lasse
On 4/21/2016 8:53 PM, Robert Wessel wrote:
> On Thu, 21 Apr 2016 22:34:06 -0500, Robert Wessel > <robertwessel2@yahoo.com> wrote: > >> On Wed, 20 Apr 2016 22:10:20 -0700, Don Y >> <blockedofcourse@foo.invalid> wrote: >> >>> On 4/20/2016 9:19 PM, Robert Wessel wrote: >>> >>>> There's the obvious solution of using the power from the PoE PSE to >>>> drive an enable of some sort to the device's power supply. Heck use >>>> that to energize a relay you've put across the mains input (some way >>>> of overriding that at the device would probably be prudent). >>> >>> If the device is NOT PoE powered, it's probably because it represents a >>> substantial load (25+W?). I'm not sure it would be prudent to let >>> something remotely disconnect power (and possibly reapply it, moments >>> later) for large loads. >>> >>> OTOH, holding the device "in reset" (possibly indefinitely or even >>> "repeatedly") should be safe(r?) >> >> >> Presumably this is for cases where the device is so far gone that you >> want to hit the big-red-switch. If you want more sophistication, you >> can put a controlling microprocessor on the device, and have that >> powered by PoE, and it could do things like force a reset, or actually >> power the device off if necessary. > > > Aircraft systems have an interesting parallel. Almost everything have > its power disconnected via a circuit breaking in the cockpit. In ye > olde days, these were actually breakers wired into the circuit mounted > on a panel (or several) in the cockpit, or a simple remote-operated > breaker (usually for heavy loads). On recent aircraft, most of this > is driven by the flight management system, which will pop up a little > message saying it's pulled a breaker (if it happens automatically), or > has a screen where you can pick a breaker to pull, and the breakers > themselves are often located in a more convenient physical location > (presumably near the circuit they're protecting), and they're > controlled remotely. > > In the past is was not uncommon for the flight crew to attempt to > cycle a breaker after a failure, but the modern policy is to just > leave it alone (and powered off), and let maintenance deal with it on > the ground. Obviously with exceptions where the loss of the system in > question can be considered more dangerous than the possibility of a > fire or other really bad result from the failing device.
In my case, "just waiting" (i.e., for the plane to land) isn't a practical option -- the system is intended to run 24/7/365 so there's no "scheduled down time" or "end of flight" :> As reset is, conceptually, the only time when a system's state can be "known", getting to that state seems to be the safest course of action. What I *should* probably do is figure out how to hold PD's *in* RESET, though powered. That'll require yet another modification to the negotiation protocol. <frown>
On Fri, 22 Apr 2016 04:23:55 -0700 (PDT),
lasselangwadtchristensen@gmail.com wrote:

>Den fredag den 22. april 2016 kl. 05.53.11 UTC+2 skrev robert...@yahoo.com: >> On Thu, 21 Apr 2016 22:34:06 -0500, Robert Wessel >> <robertwessel2@yahoo.com> wrote: >> >> >On Wed, 20 Apr 2016 22:10:20 -0700, Don Y >> ><blockedofcourse@foo.invalid> wrote: >> > >> >>On 4/20/2016 9:19 PM, Robert Wessel wrote: >> >> >> >>> There's the obvious solution of using the power from the PoE PSE to >> >>> drive an enable of some sort to the device's power supply. Heck use >> >>> that to energize a relay you've put across the mains input (some way >> >>> of overriding that at the device would probably be prudent). >> >> >> >>If the device is NOT PoE powered, it's probably because it represents a >> >>substantial load (25+W?). I'm not sure it would be prudent to let >> >>something remotely disconnect power (and possibly reapply it, moments >> >>later) for large loads. >> >> >> >>OTOH, holding the device "in reset" (possibly indefinitely or even >> >>"repeatedly") should be safe(r?) >> > >> > >> >Presumably this is for cases where the device is so far gone that you >> >want to hit the big-red-switch. If you want more sophistication, you >> >can put a controlling microprocessor on the device, and have that >> >powered by PoE, and it could do things like force a reset, or actually >> >power the device off if necessary. >> >> >> Aircraft systems have an interesting parallel. Almost everything have >> its power disconnected via a circuit breaking in the cockpit. In ye >> olde days, these were actually breakers wired into the circuit mounted >> on a panel (or several) in the cockpit, or a simple remote-operated >> breaker (usually for heavy loads). On recent aircraft, most of this >> is driven by the flight management system, which will pop up a little >> message saying it's pulled a breaker (if it happens automatically), or >> has a screen where you can pick a breaker to pull, and the breakers >> themselves are often located in a more convenient physical location >> (presumably near the circuit they're protecting), and they're >> controlled remotely. >> >> In the past is was not uncommon for the flight crew to attempt to >> cycle a breaker after a failure, but the modern policy is to just >> leave it alone (and powered off), and let maintenance deal with it on >> the ground. Obviously with exceptions where the loss of the system in >> question can be considered more dangerous than the possibility of a >> fire or other really bad result from the failing device. > >https://en.wikipedia.org/wiki/Northwest_Airlines_Flight_255
I'm not sure that exactly applies. If the CB was pulled by maintenance or the pilots, the flight should never have been started in that condition (I don't think the configuration warning system is MEL-able). If it tripped because of an actual overload, well, what else would you have it do? You could make a case for lack of redundancy. And if it failed in such a way that it was open, but gave no indication, again, that doesn't really apply, except perhaps to suggest the need for additional redundancy. In at least the first case, modern systems would likely have made it much harder to miss the pulled breaker, and might well have helped in the third case. In any event, the configuration error was the cause of the accident, not the failure of the configuration warning system. And that actually supports *my* point - let's say we were looking at the second case, there could actually be a fire risk that the tripped breaker is removing, vs. the pilots doing something really stupid, like taking off without flaps.
On 4/22/2016 4:47 PM, Robert Wessel wrote:
> On Fri, 22 Apr 2016 04:23:55 -0700 (PDT), > lasselangwadtchristensen@gmail.com wrote: > >> Den fredag den 22. april 2016 kl. 05.53.11 UTC+2 skrev robert...@yahoo.com: >>> On Thu, 21 Apr 2016 22:34:06 -0500, Robert Wessel >>> <robertwessel2@yahoo.com> wrote: >>> >>>> On Wed, 20 Apr 2016 22:10:20 -0700, Don Y >>>> <blockedofcourse@foo.invalid> wrote: >>>> >>>>> On 4/20/2016 9:19 PM, Robert Wessel wrote: >>>>> >>>>>> There's the obvious solution of using the power from the PoE PSE to >>>>>> drive an enable of some sort to the device's power supply. Heck use >>>>>> that to energize a relay you've put across the mains input (some way >>>>>> of overriding that at the device would probably be prudent). >>>>> >>>>> If the device is NOT PoE powered, it's probably because it represents a >>>>> substantial load (25+W?). I'm not sure it would be prudent to let >>>>> something remotely disconnect power (and possibly reapply it, moments >>>>> later) for large loads. >>>>> >>>>> OTOH, holding the device "in reset" (possibly indefinitely or even >>>>> "repeatedly") should be safe(r?) >>>> >>>> >>>> Presumably this is for cases where the device is so far gone that you >>>> want to hit the big-red-switch. If you want more sophistication, you >>>> can put a controlling microprocessor on the device, and have that >>>> powered by PoE, and it could do things like force a reset, or actually >>>> power the device off if necessary. >>> >>> >>> Aircraft systems have an interesting parallel. Almost everything have >>> its power disconnected via a circuit breaking in the cockpit. In ye >>> olde days, these were actually breakers wired into the circuit mounted >>> on a panel (or several) in the cockpit, or a simple remote-operated >>> breaker (usually for heavy loads). On recent aircraft, most of this >>> is driven by the flight management system, which will pop up a little >>> message saying it's pulled a breaker (if it happens automatically), or >>> has a screen where you can pick a breaker to pull, and the breakers >>> themselves are often located in a more convenient physical location >>> (presumably near the circuit they're protecting), and they're >>> controlled remotely. >>> >>> In the past is was not uncommon for the flight crew to attempt to >>> cycle a breaker after a failure, but the modern policy is to just >>> leave it alone (and powered off), and let maintenance deal with it on >>> the ground. Obviously with exceptions where the loss of the system in >>> question can be considered more dangerous than the possibility of a >>> fire or other really bad result from the failing device. >> >> https://en.wikipedia.org/wiki/Northwest_Airlines_Flight_255 > > > I'm not sure that exactly applies. If the CB was pulled by > maintenance or the pilots, the flight should never have been started > in that condition (I don't think the configuration warning system is > MEL-able). If it tripped because of an actual overload, well, what > else would you have it do? You could make a case for lack of > redundancy. And if it failed in such a way that it was open, but gave > no indication, again, that doesn't really apply, except perhaps to > suggest the need for additional redundancy. > > In at least the first case, modern systems would likely have made it > much harder to miss the pulled breaker, and might well have helped in > the third case. > > In any event, the configuration error was the cause of the accident, > not the failure of the configuration warning system. And that > actually supports *my* point - let's say we were looking at the second > case, there could actually be a fire risk that the tripped breaker is > removing, vs. the pilots doing something really stupid, like taking > off without flaps.
I think it underscores the fact that handling MULTIPLE errors is always problematic. Had the preflight check been completed (an error in itself), would the "problem" have gone unnoticed?
Den l&#4294967295;rdag den 23. april 2016 kl. 02.48.15 UTC+2 skrev Don Y:
> On 4/22/2016 4:47 PM, Robert Wessel wrote: > > On Fri, 22 Apr 2016 04:23:55 -0700 (PDT), > > lasselangwadtchristensen@gmail.com wrote: > > > >> Den fredag den 22. april 2016 kl. 05.53.11 UTC+2 skrev robert...@yahoo.com: > >>> On Thu, 21 Apr 2016 22:34:06 -0500, Robert Wessel > >>> <robertwessel2@yahoo.com> wrote: > >>> > >>>> On Wed, 20 Apr 2016 22:10:20 -0700, Don Y > >>>> <blockedofcourse@foo.invalid> wrote: > >>>> > >>>>> On 4/20/2016 9:19 PM, Robert Wessel wrote: > >>>>> > >>>>>> There's the obvious solution of using the power from the PoE PSE to > >>>>>> drive an enable of some sort to the device's power supply. Heck use > >>>>>> that to energize a relay you've put across the mains input (some way > >>>>>> of overriding that at the device would probably be prudent). > >>>>> > >>>>> If the device is NOT PoE powered, it's probably because it represents a > >>>>> substantial load (25+W?). I'm not sure it would be prudent to let > >>>>> something remotely disconnect power (and possibly reapply it, moments > >>>>> later) for large loads. > >>>>> > >>>>> OTOH, holding the device "in reset" (possibly indefinitely or even > >>>>> "repeatedly") should be safe(r?) > >>>> > >>>> > >>>> Presumably this is for cases where the device is so far gone that you > >>>> want to hit the big-red-switch. If you want more sophistication, you > >>>> can put a controlling microprocessor on the device, and have that > >>>> powered by PoE, and it could do things like force a reset, or actually > >>>> power the device off if necessary. > >>> > >>> > >>> Aircraft systems have an interesting parallel. Almost everything have > >>> its power disconnected via a circuit breaking in the cockpit. In ye > >>> olde days, these were actually breakers wired into the circuit mounted > >>> on a panel (or several) in the cockpit, or a simple remote-operated > >>> breaker (usually for heavy loads). On recent aircraft, most of this > >>> is driven by the flight management system, which will pop up a little > >>> message saying it's pulled a breaker (if it happens automatically), or > >>> has a screen where you can pick a breaker to pull, and the breakers > >>> themselves are often located in a more convenient physical location > >>> (presumably near the circuit they're protecting), and they're > >>> controlled remotely. > >>> > >>> In the past is was not uncommon for the flight crew to attempt to > >>> cycle a breaker after a failure, but the modern policy is to just > >>> leave it alone (and powered off), and let maintenance deal with it on > >>> the ground. Obviously with exceptions where the loss of the system in > >>> question can be considered more dangerous than the possibility of a > >>> fire or other really bad result from the failing device. > >> > >> https://en.wikipedia.org/wiki/Northwest_Airlines_Flight_255 > > > > > > I'm not sure that exactly applies. If the CB was pulled by > > maintenance or the pilots, the flight should never have been started > > in that condition (I don't think the configuration warning system is > > MEL-able). If it tripped because of an actual overload, well, what > > else would you have it do? You could make a case for lack of > > redundancy. And if it failed in such a way that it was open, but gave > > no indication, again, that doesn't really apply, except perhaps to > > suggest the need for additional redundancy. > > > > In at least the first case, modern systems would likely have made it > > much harder to miss the pulled breaker, and might well have helped in > > the third case. > > > > In any event, the configuration error was the cause of the accident, > > not the failure of the configuration warning system. And that > > actually supports *my* point - let's say we were looking at the second > > case, there could actually be a fire risk that the tripped breaker is > > removing, vs. the pilots doing something really stupid, like taking > > off without flaps. > > I think it underscores the fact that handling MULTIPLE errors is > always problematic. Had the preflight check been completed > (an error in itself), would the "problem" have gone unnoticed?
From the TV documentary about the crash it wasn't uncommon for pilots to pull the breaker on configuration warning system because of false warnings while taxing a bit fast. If that was why it wasn't on we'll never know afaiu the outcome was a change to warming system so there would be less false warmings and the checklist split in smaller sections so it wasn't such much work to start over with a section, as you are supposed to when disturbed in the middle -Lasse
On 4/22/2016 6:22 PM, lasselangwadtchristensen@gmail.com wrote:
> Den l&#4294967295;rdag den 23. april 2016 kl. 02.48.15 UTC+2 skrev Don Y: >> On 4/22/2016 4:47 PM, Robert Wessel wrote: >>> On Fri, 22 Apr 2016 04:23:55 -0700 (PDT), >>> lasselangwadtchristensen@gmail.com wrote: >>> >>>> Den fredag den 22. april 2016 kl. 05.53.11 UTC+2 skrev >>>> robert...@yahoo.com: >>>>> On Thu, 21 Apr 2016 22:34:06 -0500, Robert Wessel >>>>> <robertwessel2@yahoo.com> wrote: >>>>> >>>>>> On Wed, 20 Apr 2016 22:10:20 -0700, Don Y >>>>>> <blockedofcourse@foo.invalid> wrote: >>>>>> >>>>>>> On 4/20/2016 9:19 PM, Robert Wessel wrote: >>>>>>> >>>>>>>> There's the obvious solution of using the power from the PoE >>>>>>>> PSE to drive an enable of some sort to the device's power >>>>>>>> supply. Heck use that to energize a relay you've put across >>>>>>>> the mains input (some way of overriding that at the device >>>>>>>> would probably be prudent). >>>>>>> >>>>>>> If the device is NOT PoE powered, it's probably because it >>>>>>> represents a substantial load (25+W?). I'm not sure it would be >>>>>>> prudent to let something remotely disconnect power (and possibly >>>>>>> reapply it, moments later) for large loads. >>>>>>> >>>>>>> OTOH, holding the device "in reset" (possibly indefinitely or >>>>>>> even "repeatedly") should be safe(r?) >>>>>> >>>>>> >>>>>> Presumably this is for cases where the device is so far gone that >>>>>> you want to hit the big-red-switch. If you want more >>>>>> sophistication, you can put a controlling microprocessor on the >>>>>> device, and have that powered by PoE, and it could do things like >>>>>> force a reset, or actually power the device off if necessary. >>>>> >>>>> >>>>> Aircraft systems have an interesting parallel. Almost everything >>>>> have its power disconnected via a circuit breaking in the cockpit. >>>>> In ye olde days, these were actually breakers wired into the circuit >>>>> mounted on a panel (or several) in the cockpit, or a simple >>>>> remote-operated breaker (usually for heavy loads). On recent >>>>> aircraft, most of this is driven by the flight management system, >>>>> which will pop up a little message saying it's pulled a breaker (if >>>>> it happens automatically), or has a screen where you can pick a >>>>> breaker to pull, and the breakers themselves are often located in a >>>>> more convenient physical location (presumably near the circuit >>>>> they're protecting), and they're controlled remotely. >>>>> >>>>> In the past is was not uncommon for the flight crew to attempt to >>>>> cycle a breaker after a failure, but the modern policy is to just >>>>> leave it alone (and powered off), and let maintenance deal with it >>>>> on the ground. Obviously with exceptions where the loss of the >>>>> system in question can be considered more dangerous than the >>>>> possibility of a fire or other really bad result from the failing >>>>> device. >>>> >>>> https://en.wikipedia.org/wiki/Northwest_Airlines_Flight_255 >>> >>> >>> I'm not sure that exactly applies. If the CB was pulled by maintenance >>> or the pilots, the flight should never have been started in that >>> condition (I don't think the configuration warning system is MEL-able). >>> If it tripped because of an actual overload, well, what else would you >>> have it do? You could make a case for lack of redundancy. And if it >>> failed in such a way that it was open, but gave no indication, again, >>> that doesn't really apply, except perhaps to suggest the need for >>> additional redundancy. >>> >>> In at least the first case, modern systems would likely have made it >>> much harder to miss the pulled breaker, and might well have helped in >>> the third case. >>> >>> In any event, the configuration error was the cause of the accident, not >>> the failure of the configuration warning system. And that actually >>> supports *my* point - let's say we were looking at the second case, >>> there could actually be a fire risk that the tripped breaker is >>> removing, vs. the pilots doing something really stupid, like taking off >>> without flaps. >> >> I think it underscores the fact that handling MULTIPLE errors is always >> problematic. Had the preflight check been completed (an error in itself), >> would the "problem" have gone unnoticed? > > From the TV documentary about the crash it wasn't uncommon for pilots to > pull the breaker on configuration warning system because of false warnings > while taxing a bit fast. If that was why it wasn't on we'll never know
"The National Transportation Safety Board determines that the probable cause of the accident was the flightcrew&#4294967295;s failure to use the taxi checklist to ensure that the flaps and slats were extended for takeoff." Error #1 which MASKS error #2 (or, which allows error #2 to be fatal): "Contributing to the accident was the absence of electrical power to the airplane takeoff warning system which thus did not warn the flightcrew that the airplane was not configured properly for takeoff. The reason for the absence of electrical power could not be determined."
> afaiu the outcome was a change to warming system so there would be less > false warmings and the checklist split in smaller sections so it wasn't such > much work to start over with a section, as you are supposed to when > disturbed in the middle
This is the same sort of reasoning that goes into the installation of other warning devices. E.g., you would *think* that The Kitchen would be a great place to locate a smoke detector (as there are ignition sources, there). But, doing so causes too many false alarms -- which leads to folks disabling the detector.
On Fri, 22 Apr 2016 13:35:27 -0700, Don Y
<blockedofcourse@foo.invalid> wrote:

>On 4/21/2016 8:53 PM, Robert Wessel wrote: >> On Thu, 21 Apr 2016 22:34:06 -0500, Robert Wessel >> <robertwessel2@yahoo.com> wrote: >> >>> On Wed, 20 Apr 2016 22:10:20 -0700, Don Y >>> <blockedofcourse@foo.invalid> wrote: >>> >>>> On 4/20/2016 9:19 PM, Robert Wessel wrote: >>>> >>>>> There's the obvious solution of using the power from the PoE PSE to >>>>> drive an enable of some sort to the device's power supply. Heck use >>>>> that to energize a relay you've put across the mains input (some way >>>>> of overriding that at the device would probably be prudent). >>>> >>>> If the device is NOT PoE powered, it's probably because it represents a >>>> substantial load (25+W?). I'm not sure it would be prudent to let >>>> something remotely disconnect power (and possibly reapply it, moments >>>> later) for large loads. >>>> >>>> OTOH, holding the device "in reset" (possibly indefinitely or even >>>> "repeatedly") should be safe(r?) >>> >>> >>> Presumably this is for cases where the device is so far gone that you >>> want to hit the big-red-switch. If you want more sophistication, you >>> can put a controlling microprocessor on the device, and have that >>> powered by PoE, and it could do things like force a reset, or actually >>> power the device off if necessary. >> >> >> Aircraft systems have an interesting parallel. Almost everything have >> its power disconnected via a circuit breaking in the cockpit. In ye >> olde days, these were actually breakers wired into the circuit mounted >> on a panel (or several) in the cockpit, or a simple remote-operated >> breaker (usually for heavy loads). On recent aircraft, most of this >> is driven by the flight management system, which will pop up a little >> message saying it's pulled a breaker (if it happens automatically), or >> has a screen where you can pick a breaker to pull, and the breakers >> themselves are often located in a more convenient physical location >> (presumably near the circuit they're protecting), and they're >> controlled remotely. >> >> In the past is was not uncommon for the flight crew to attempt to >> cycle a breaker after a failure, but the modern policy is to just >> leave it alone (and powered off), and let maintenance deal with it on >> the ground. Obviously with exceptions where the loss of the system in >> question can be considered more dangerous than the possibility of a >> fire or other really bad result from the failing device. > >In my case, "just waiting" (i.e., for the plane to land) isn't a >practical option -- the system is intended to run 24/7/365 so there's >no "scheduled down time" or "end of flight" :>
Use a redundant system with at least two identical units, say A and B. If A needs to be reseted, doing some self test or do some application or OS upgrade, switch control to B, do the required maintenance operation (including hardware replacement) on A. Check that A is up and running, then you can switch back to A. If updates are required on both units, it is preferable to start with he passive unit (say B in this example), When B is up and running again, try switching control to B. If B is not working properly after the update, switch back to A and fix B before trying to switch again. When B has been verified to be properly in charge, do the maintenance on A and preferably switch back to A and verify that the maintenance on A also went OK.
>As reset is, conceptually, the only time when a system's state can >be "known", getting to that state seems to be the safest course of >action. > >What I *should* probably do is figure out how to hold PD's *in* RESET, >though powered. That'll require yet another modification to the >negotiation protocol. <frown>
Of course, when doing redundant system, the redundancy should be designed into the system from the beginning and not just try to stick on some redundancy, if/when the reliability of a non-redundant system is found to be too bad.
On 4/22/2016 11:29 PM, upsidedown@downunder.com wrote:
>> In my case, "just waiting" (i.e., for the plane to land) isn't a >> practical option -- the system is intended to run 24/7/365 so there's >> no "scheduled down time" or "end of flight" :> > > Use a redundant system with at least two identical units, say A and B. > If A needs to be reseted, doing some self test or do some application > or OS upgrade, switch control to B, do the required maintenance > operation (including hardware replacement) on A. Check that A is up > and running, then you can switch back to A.
The problem comes with I/O's. Not only do you have to "duplicate" the field interface -- but, you also have to provide a RELIABLE means to switch between any actuators driven by those two "duplicates". I.e., if A has an output set ON and B thinks it *really* should be OFF, how is the controlled device to know who to listen to? I consider physical I/O replication to be too troublesome. So, I only provide redundancy on "virtual" entities (processes, state, etc.). In that way, all I need are spare CPU's...
> If updates are required on both units, it is preferable to start with > he passive unit (say B in this example), When B is up and running > again, try switching control to B. If B is not working properly after > the update, switch back to A and fix B before trying to switch again. > > When B has been verified to be properly in charge, do the maintenance > on A and preferably switch back to A and verify that the maintenance > on A also went OK. > >> As reset is, conceptually, the only time when a system's state can >> be "known", getting to that state seems to be the safest course of >> action. >> >> What I *should* probably do is figure out how to hold PD's *in* RESET, >> though powered. That'll require yet another modification to the >> negotiation protocol. <frown> > > Of course, when doing redundant system, the redundancy should be > designed into the system from the beginning and not just try to stick > on some redundancy, if/when the reliability of a non-redundant system > is found to be too bad.