Remote "watchdog"s| page 2

Reply by Robert Wessel ●April 22, 20162016-04-22

On Thu, 21 Apr 2016 22:34:06 -0500, Robert Wessel
<robertwessel2@yahoo.com> wrote:

>On Wed, 20 Apr 2016 22:10:20 -0700, Don Y
><blockedofcourse@foo.invalid> wrote:
>
>>On 4/20/2016 9:19 PM, Robert Wessel wrote:
>>
>>> There's the obvious solution of using the power from the PoE PSE to
>>> drive an enable of some sort to the device's power supply.  Heck use
>>> that to energize a relay you've put across the mains input (some way
>>> of overriding that at the device would probably be prudent).
>>
>>If the device is NOT PoE powered, it's probably because it represents a
>>substantial load (25+W?).  I'm not sure it would be prudent to let
>>something remotely disconnect power (and possibly reapply it, moments
>>later) for large loads.
>>
>>OTOH, holding the device "in reset" (possibly indefinitely or even
>>"repeatedly") should be safe(r?)
>
>
>Presumably this is for cases where the device is so far gone that you
>want to hit the big-red-switch.  If you want more sophistication, you
>can put a controlling microprocessor on the device, and have that
>powered by PoE, and it could do things like force a reset, or actually
>power the device off if necessary.

Aircraft systems have an interesting parallel.  Almost everything have
its power disconnected via a circuit breaking in the cockpit.  In ye
olde days, these were actually breakers wired into the circuit mounted
on a panel (or several) in the cockpit, or a simple remote-operated
breaker (usually for heavy loads).  On recent aircraft, most of this
is driven by the flight management system, which will pop up a little
message saying it's pulled a breaker (if it happens automatically), or
has a screen where you can pick a breaker to pull, and the breakers
themselves are often located in a more convenient physical location
(presumably near the circuit they're protecting), and they're
controlled remotely.

In the past is was not uncommon for the flight crew to attempt to
cycle a breaker after a failure, but the modern policy is to just
leave it alone (and powered off), and let maintenance deal with it on
the ground.  Obviously with exceptions where the loss of the system in
question can be considered more dangerous than the possibility of a
fire or other really bad result from the failing device.

Reply by Don Y ●April 22, 20162016-04-22

On 4/21/2016 8:34 PM, Robert Wessel wrote:
> On Wed, 20 Apr 2016 22:10:20 -0700, Don Y
> <blockedofcourse@foo.invalid> wrote:
>
>> On 4/20/2016 9:19 PM, Robert Wessel wrote:
>>
>>> There's the obvious solution of using the power from the PoE PSE to
>>> drive an enable of some sort to the device's power supply.  Heck use
>>> that to energize a relay you've put across the mains input (some way
>>> of overriding that at the device would probably be prudent).
>>
>> If the device is NOT PoE powered, it's probably because it represents a
>> substantial load (25+W?).  I'm not sure it would be prudent to let
>> something remotely disconnect power (and possibly reapply it, moments
>> later) for large loads.
>>
>> OTOH, holding the device "in reset" (possibly indefinitely or even
>> "repeatedly") should be safe(r?)
>
> Presumably this is for cases where the device is so far gone that you
> want to hit the big-red-switch.

The issue is trying to DECIDE that it's time to "pull the plug" and
bring the box to its knees.

It's not easy to know when the failure are seeing represents the failure
of a particular PROCESS that happens to reside on that node at the
present time; or, if the node itself is toast.

[(sigh)  Trying to figure out how little I need to explain to put this
all in adequate context...]

Everything is client-(agent-)server model.  Processes (servers/agents)
export services to other processes (agents/clients) via an IPC mechanism.

Processes can migrate, dynamically, between nodes.  So, a client never
really is assured of where it is executing.  Nor where each of the
services that it is consuming reside!  If the target of an IPC is
"not local" (to "this" node), then it magically becomes an RPC -- with
no notification to the caller.

IPC/RPC's can be synchronous (blocking) or asynchronous (non-blocking).
And, can have timeouts that the RTOS enforces for the caller(s).

The RTOS instance on each node is responsible for the actual IPC/RPC
mechanics -- it enforces access policies, deadlines, marshalling, etc.
So, two processes never talk to each other (even if they co-reside on
the same node!) without at least one instance of the RTOS being involved
(two instances if the target is remote).

The system is up continuously; there is never a "down time".  So, a process
can run on one particular node indefinitely.  Or, get moved around to other
nodes.  Or, can run to its logical completion.  Or, be killed off.  etc.

I *expect* (remotely apparent) failures to manifest along these lines:
    Process A on node 1 issues an RPC to process B on node 2.

    The response comes back (whenever) and is totally wonky.  Process A
    *suspects* process B of being compromised/corrupted/failed/died.

    Some other process on some other node (or node 1) issues a request
    to some other process on node 2 (or, process B!) and gets a similar
    result.

    The RTOS instances on these two sourcing nodes eventually realize
    there appear to be issues with node 2, or process B, or...

    The RTOS instances "notice" if other processes on node 2 are becoming
    "suspect".  If not, then perhaps the problem is local to process B
    and does not involve the entire node.

    (Can the RTOS instances communicate as expected?)

    If the problem appears to be one (or more) processes (on node 2 -- or
    wherever), then the RTOS's try to restart the processes (some processes
    are fault tolerant so restarting is effectively RESUMING).  At the
    extreme, the RTOS can implement an effective "warm reset".

    Note that this problem may persist.  It could be a latent bug in process
    B.  Or, something wonky with its I/O's.  Or, the region of physical
    memory that it is executing out of, etc.

    If the problem appears to apply to ALL processes on node 2 AND the
    RTOS instance on that node is similarly off-line, then the node is
    effectively isolated -- there is nothing that can be done "from outside"
    to regain its proper operation short of a hardware reset.

    On PD's, that happens when the PSE drops power to the node.  When power
    is reapplied, the node executes its POST, reports its progress and
    goes looking for a boot image.  It can either be rebooted as per normal
    *or* an "interactive" (in the sense that the rest of the system can
    interact with the Dx tool) diagnostic loaded.

    On nodes that are *not* PD's, I need a mechanism to implement the
    equivalent functionality.

If the node is found to be faulty, the processes that were running on
it can be redispatched to other node(s) and the faulted node taken
out of service and marked as unavailable.  Of course, any physical I/O's
that were present on the physical device are no longer available.
And, anything that relied on those I/O's is similarly ineligible to
execute.  (And, anything that relied on the services provided by those
things... etc.)

So, if a node was just being used for its compute resources (surplus
MIPS and bytes), the user is not typically inconvenienced (assuming
some other node can pick up the slack).

If I know this to be the case, I can immediately bring another node
online to support the processes that have died off.  Then, engage more
aggressive diagnostics on the "suspect" node without the user losing
any capabilities.

>  If you want more sophistication, you
> can put a controlling microprocessor on the device, and have that
> powered by PoE, and it could do things like force a reset, or actually
> power the device off if necessary.

I'm trying to figure out the least requirements to impose on a device
to gain that "remote reset" capability.  And, let the device figure out
how to address that.

E.g., one of my nodes is a COTS PC.  It should be relatively easy to take
an instance of my PD interface and connect the "power out" signal to
a relay, FET, buffer, etc. that ties to the "reset button" on the PC.

Reply by ●April 22, 20162016-04-22

Den fredag den 22. april 2016 kl. 05.53.11 UTC+2 skrev robert...@yahoo.com:
> On Thu, 21 Apr 2016 22:34:06 -0500, Robert Wessel
> <robertwessel2@yahoo.com> wrote:
> 
> >On Wed, 20 Apr 2016 22:10:20 -0700, Don Y
> ><blockedofcourse@foo.invalid> wrote:
> >
> >>On 4/20/2016 9:19 PM, Robert Wessel wrote:
> >>
> >>> There's the obvious solution of using the power from the PoE PSE to
> >>> drive an enable of some sort to the device's power supply.  Heck use
> >>> that to energize a relay you've put across the mains input (some way
> >>> of overriding that at the device would probably be prudent).
> >>
> >>If the device is NOT PoE powered, it's probably because it represents a
> >>substantial load (25+W?).  I'm not sure it would be prudent to let
> >>something remotely disconnect power (and possibly reapply it, moments
> >>later) for large loads.
> >>
> >>OTOH, holding the device "in reset" (possibly indefinitely or even
> >>"repeatedly") should be safe(r?)
> >
> >
> >Presumably this is for cases where the device is so far gone that you
> >want to hit the big-red-switch.  If you want more sophistication, you
> >can put a controlling microprocessor on the device, and have that
> >powered by PoE, and it could do things like force a reset, or actually
> >power the device off if necessary.
> 
> 
> Aircraft systems have an interesting parallel.  Almost everything have
> its power disconnected via a circuit breaking in the cockpit.  In ye
> olde days, these were actually breakers wired into the circuit mounted
> on a panel (or several) in the cockpit, or a simple remote-operated
> breaker (usually for heavy loads).  On recent aircraft, most of this
> is driven by the flight management system, which will pop up a little
> message saying it's pulled a breaker (if it happens automatically), or
> has a screen where you can pick a breaker to pull, and the breakers
> themselves are often located in a more convenient physical location
> (presumably near the circuit they're protecting), and they're
> controlled remotely.
> 
> In the past is was not uncommon for the flight crew to attempt to
> cycle a breaker after a failure, but the modern policy is to just
> leave it alone (and powered off), and let maintenance deal with it on
> the ground.  Obviously with exceptions where the loss of the system in
> question can be considered more dangerous than the possibility of a
> fire or other really bad result from the failing device.

https://en.wikipedia.org/wiki/Northwest_Airlines_Flight_255

-Lasse

Reply by Don Y ●April 22, 20162016-04-22

On 4/21/2016 8:53 PM, Robert Wessel wrote:
> On Thu, 21 Apr 2016 22:34:06 -0500, Robert Wessel
> <robertwessel2@yahoo.com> wrote:
>
>> On Wed, 20 Apr 2016 22:10:20 -0700, Don Y
>> <blockedofcourse@foo.invalid> wrote:
>>
>>> On 4/20/2016 9:19 PM, Robert Wessel wrote:
>>>
>>>> There's the obvious solution of using the power from the PoE PSE to
>>>> drive an enable of some sort to the device's power supply.  Heck use
>>>> that to energize a relay you've put across the mains input (some way
>>>> of overriding that at the device would probably be prudent).
>>>
>>> If the device is NOT PoE powered, it's probably because it represents a
>>> substantial load (25+W?).  I'm not sure it would be prudent to let
>>> something remotely disconnect power (and possibly reapply it, moments
>>> later) for large loads.
>>>
>>> OTOH, holding the device "in reset" (possibly indefinitely or even
>>> "repeatedly") should be safe(r?)
>>
>>
>> Presumably this is for cases where the device is so far gone that you
>> want to hit the big-red-switch.  If you want more sophistication, you
>> can put a controlling microprocessor on the device, and have that
>> powered by PoE, and it could do things like force a reset, or actually
>> power the device off if necessary.
>
>
> Aircraft systems have an interesting parallel.  Almost everything have
> its power disconnected via a circuit breaking in the cockpit.  In ye
> olde days, these were actually breakers wired into the circuit mounted
> on a panel (or several) in the cockpit, or a simple remote-operated
> breaker (usually for heavy loads).  On recent aircraft, most of this
> is driven by the flight management system, which will pop up a little
> message saying it's pulled a breaker (if it happens automatically), or
> has a screen where you can pick a breaker to pull, and the breakers
> themselves are often located in a more convenient physical location
> (presumably near the circuit they're protecting), and they're
> controlled remotely.
>
> In the past is was not uncommon for the flight crew to attempt to
> cycle a breaker after a failure, but the modern policy is to just
> leave it alone (and powered off), and let maintenance deal with it on
> the ground.  Obviously with exceptions where the loss of the system in
> question can be considered more dangerous than the possibility of a
> fire or other really bad result from the failing device.

In my case, "just waiting" (i.e., for the plane to land) isn't a
practical option -- the system is intended to run 24/7/365 so there's
no "scheduled down time" or "end of flight"  :>

As reset is, conceptually, the only time when a system's state can
be "known", getting to that state seems to be the safest course of
action.

What I *should* probably do is figure out how to hold PD's *in* RESET,
though powered.  That'll require yet another modification to the
negotiation protocol.  <frown>

Reply by Robert Wessel ●April 22, 20162016-04-22

On Fri, 22 Apr 2016 04:23:55 -0700 (PDT),
lasselangwadtchristensen@gmail.com wrote:

>Den fredag den 22. april 2016 kl. 05.53.11 UTC+2 skrev robert...@yahoo.com:
>> On Thu, 21 Apr 2016 22:34:06 -0500, Robert Wessel
>> <robertwessel2@yahoo.com> wrote:
>> 
>> >On Wed, 20 Apr 2016 22:10:20 -0700, Don Y
>> ><blockedofcourse@foo.invalid> wrote:
>> >
>> >>On 4/20/2016 9:19 PM, Robert Wessel wrote:
>> >>
>> >>> There's the obvious solution of using the power from the PoE PSE to
>> >>> drive an enable of some sort to the device's power supply.  Heck use
>> >>> that to energize a relay you've put across the mains input (some way
>> >>> of overriding that at the device would probably be prudent).
>> >>
>> >>If the device is NOT PoE powered, it's probably because it represents a
>> >>substantial load (25+W?).  I'm not sure it would be prudent to let
>> >>something remotely disconnect power (and possibly reapply it, moments
>> >>later) for large loads.
>> >>
>> >>OTOH, holding the device "in reset" (possibly indefinitely or even
>> >>"repeatedly") should be safe(r?)
>> >
>> >
>> >Presumably this is for cases where the device is so far gone that you
>> >want to hit the big-red-switch.  If you want more sophistication, you
>> >can put a controlling microprocessor on the device, and have that
>> >powered by PoE, and it could do things like force a reset, or actually
>> >power the device off if necessary.
>> 
>> 
>> Aircraft systems have an interesting parallel.  Almost everything have
>> its power disconnected via a circuit breaking in the cockpit.  In ye
>> olde days, these were actually breakers wired into the circuit mounted
>> on a panel (or several) in the cockpit, or a simple remote-operated
>> breaker (usually for heavy loads).  On recent aircraft, most of this
>> is driven by the flight management system, which will pop up a little
>> message saying it's pulled a breaker (if it happens automatically), or
>> has a screen where you can pick a breaker to pull, and the breakers
>> themselves are often located in a more convenient physical location
>> (presumably near the circuit they're protecting), and they're
>> controlled remotely.
>> 
>> In the past is was not uncommon for the flight crew to attempt to
>> cycle a breaker after a failure, but the modern policy is to just
>> leave it alone (and powered off), and let maintenance deal with it on
>> the ground.  Obviously with exceptions where the loss of the system in
>> question can be considered more dangerous than the possibility of a
>> fire or other really bad result from the failing device.
>
>https://en.wikipedia.org/wiki/Northwest_Airlines_Flight_255


I'm not sure that exactly applies.  If the CB was pulled by
maintenance or the pilots, the flight should never have been started
in that condition (I don't think the configuration warning system is
MEL-able).  If it tripped because of an actual overload, well, what
else would you have it do?  You could make a case for lack of
redundancy.  And if it failed in such a way that it was open, but gave
no indication, again, that doesn't really apply, except perhaps to
suggest the need for additional redundancy.

In at least the first case, modern systems would likely have made it
much harder to miss the pulled breaker, and might well have helped in
the third case.

In any event, the configuration error was the cause of the accident,
not the failure of the configuration warning system.  And that
actually supports *my* point - let's say we were looking at the second
case, there could actually be a fire risk that the tripped breaker is
removing, vs. the pilots doing something really stupid, like taking
off without flaps.

Reply by Don Y ●April 22, 20162016-04-22

On 4/22/2016 4:47 PM, Robert Wessel wrote:
> On Fri, 22 Apr 2016 04:23:55 -0700 (PDT),
> lasselangwadtchristensen@gmail.com wrote:
>
>> Den fredag den 22. april 2016 kl. 05.53.11 UTC+2 skrev robert...@yahoo.com:
>>> On Thu, 21 Apr 2016 22:34:06 -0500, Robert Wessel
>>> <robertwessel2@yahoo.com> wrote:
>>>
>>>> On Wed, 20 Apr 2016 22:10:20 -0700, Don Y
>>>> <blockedofcourse@foo.invalid> wrote:
>>>>
>>>>> On 4/20/2016 9:19 PM, Robert Wessel wrote:
>>>>>
>>>>>> There's the obvious solution of using the power from the PoE PSE to
>>>>>> drive an enable of some sort to the device's power supply.  Heck use
>>>>>> that to energize a relay you've put across the mains input (some way
>>>>>> of overriding that at the device would probably be prudent).
>>>>>
>>>>> If the device is NOT PoE powered, it's probably because it represents a
>>>>> substantial load (25+W?).  I'm not sure it would be prudent to let
>>>>> something remotely disconnect power (and possibly reapply it, moments
>>>>> later) for large loads.
>>>>>
>>>>> OTOH, holding the device "in reset" (possibly indefinitely or even
>>>>> "repeatedly") should be safe(r?)
>>>>
>>>>
>>>> Presumably this is for cases where the device is so far gone that you
>>>> want to hit the big-red-switch.  If you want more sophistication, you
>>>> can put a controlling microprocessor on the device, and have that
>>>> powered by PoE, and it could do things like force a reset, or actually
>>>> power the device off if necessary.
>>>
>>>
>>> Aircraft systems have an interesting parallel.  Almost everything have
>>> its power disconnected via a circuit breaking in the cockpit.  In ye
>>> olde days, these were actually breakers wired into the circuit mounted
>>> on a panel (or several) in the cockpit, or a simple remote-operated
>>> breaker (usually for heavy loads).  On recent aircraft, most of this
>>> is driven by the flight management system, which will pop up a little
>>> message saying it's pulled a breaker (if it happens automatically), or
>>> has a screen where you can pick a breaker to pull, and the breakers
>>> themselves are often located in a more convenient physical location
>>> (presumably near the circuit they're protecting), and they're
>>> controlled remotely.
>>>
>>> In the past is was not uncommon for the flight crew to attempt to
>>> cycle a breaker after a failure, but the modern policy is to just
>>> leave it alone (and powered off), and let maintenance deal with it on
>>> the ground.  Obviously with exceptions where the loss of the system in
>>> question can be considered more dangerous than the possibility of a
>>> fire or other really bad result from the failing device.
>>
>> https://en.wikipedia.org/wiki/Northwest_Airlines_Flight_255
>
>
> I'm not sure that exactly applies.  If the CB was pulled by
> maintenance or the pilots, the flight should never have been started
> in that condition (I don't think the configuration warning system is
> MEL-able).  If it tripped because of an actual overload, well, what
> else would you have it do?  You could make a case for lack of
> redundancy.  And if it failed in such a way that it was open, but gave
> no indication, again, that doesn't really apply, except perhaps to
> suggest the need for additional redundancy.
>
> In at least the first case, modern systems would likely have made it
> much harder to miss the pulled breaker, and might well have helped in
> the third case.
>
> In any event, the configuration error was the cause of the accident,
> not the failure of the configuration warning system.  And that
> actually supports *my* point - let's say we were looking at the second
> case, there could actually be a fire risk that the tripped breaker is
> removing, vs. the pilots doing something really stupid, like taking
> off without flaps.

I think it underscores the fact that handling MULTIPLE errors is
always problematic.  Had the preflight check been completed
(an error in itself), would the "problem" have gone unnoticed?

Reply by ●April 22, 20162016-04-22

Den l&#4294967295;rdag den 23. april 2016 kl. 02.48.15 UTC+2 skrev Don Y:
> On 4/22/2016 4:47 PM, Robert Wessel wrote:
> > On Fri, 22 Apr 2016 04:23:55 -0700 (PDT),
> > lasselangwadtchristensen@gmail.com wrote:
> >
> >> Den fredag den 22. april 2016 kl. 05.53.11 UTC+2 skrev robert...@yahoo.com:
> >>> On Thu, 21 Apr 2016 22:34:06 -0500, Robert Wessel
> >>> <robertwessel2@yahoo.com> wrote:
> >>>
> >>>> On Wed, 20 Apr 2016 22:10:20 -0700, Don Y
> >>>> <blockedofcourse@foo.invalid> wrote:
> >>>>
> >>>>> On 4/20/2016 9:19 PM, Robert Wessel wrote:
> >>>>>
> >>>>>> There's the obvious solution of using the power from the PoE PSE to
> >>>>>> drive an enable of some sort to the device's power supply.  Heck use
> >>>>>> that to energize a relay you've put across the mains input (some way
> >>>>>> of overriding that at the device would probably be prudent).
> >>>>>
> >>>>> If the device is NOT PoE powered, it's probably because it represents a
> >>>>> substantial load (25+W?).  I'm not sure it would be prudent to let
> >>>>> something remotely disconnect power (and possibly reapply it, moments
> >>>>> later) for large loads.
> >>>>>
> >>>>> OTOH, holding the device "in reset" (possibly indefinitely or even
> >>>>> "repeatedly") should be safe(r?)
> >>>>
> >>>>
> >>>> Presumably this is for cases where the device is so far gone that you
> >>>> want to hit the big-red-switch.  If you want more sophistication, you
> >>>> can put a controlling microprocessor on the device, and have that
> >>>> powered by PoE, and it could do things like force a reset, or actually
> >>>> power the device off if necessary.
> >>>
> >>>
> >>> Aircraft systems have an interesting parallel.  Almost everything have
> >>> its power disconnected via a circuit breaking in the cockpit.  In ye
> >>> olde days, these were actually breakers wired into the circuit mounted
> >>> on a panel (or several) in the cockpit, or a simple remote-operated
> >>> breaker (usually for heavy loads).  On recent aircraft, most of this
> >>> is driven by the flight management system, which will pop up a little
> >>> message saying it's pulled a breaker (if it happens automatically), or
> >>> has a screen where you can pick a breaker to pull, and the breakers
> >>> themselves are often located in a more convenient physical location
> >>> (presumably near the circuit they're protecting), and they're
> >>> controlled remotely.
> >>>
> >>> In the past is was not uncommon for the flight crew to attempt to
> >>> cycle a breaker after a failure, but the modern policy is to just
> >>> leave it alone (and powered off), and let maintenance deal with it on
> >>> the ground.  Obviously with exceptions where the loss of the system in
> >>> question can be considered more dangerous than the possibility of a
> >>> fire or other really bad result from the failing device.
> >>
> >> https://en.wikipedia.org/wiki/Northwest_Airlines_Flight_255
> >
> >
> > I'm not sure that exactly applies.  If the CB was pulled by
> > maintenance or the pilots, the flight should never have been started
> > in that condition (I don't think the configuration warning system is
> > MEL-able).  If it tripped because of an actual overload, well, what
> > else would you have it do?  You could make a case for lack of
> > redundancy.  And if it failed in such a way that it was open, but gave
> > no indication, again, that doesn't really apply, except perhaps to
> > suggest the need for additional redundancy.
> >
> > In at least the first case, modern systems would likely have made it
> > much harder to miss the pulled breaker, and might well have helped in
> > the third case.
> >
> > In any event, the configuration error was the cause of the accident,
> > not the failure of the configuration warning system.  And that
> > actually supports *my* point - let's say we were looking at the second
> > case, there could actually be a fire risk that the tripped breaker is
> > removing, vs. the pilots doing something really stupid, like taking
> > off without flaps.
> 
> I think it underscores the fact that handling MULTIPLE errors is
> always problematic.  Had the preflight check been completed
> (an error in itself), would the "problem" have gone unnoticed?

From the TV documentary about the crash it wasn't uncommon for pilots 
to pull the breaker on configuration warning system because of false warnings
while taxing a bit fast. If that was why it wasn't on we'll never know

afaiu the outcome was a change to warming system so there would be less false warmings and the checklist split in smaller sections so it wasn't such much work to start over with a section, as you are supposed to when disturbed in the middle 

-Lasse

Reply by Don Y ●April 22, 20162016-04-22

On 4/22/2016 6:22 PM, lasselangwadtchristensen@gmail.com wrote:
> Den l&#4294967295;rdag den 23. april 2016 kl. 02.48.15 UTC+2 skrev Don Y:
>> On 4/22/2016 4:47 PM, Robert Wessel wrote:
>>> On Fri, 22 Apr 2016 04:23:55 -0700 (PDT),
>>> lasselangwadtchristensen@gmail.com wrote:
>>>
>>>> Den fredag den 22. april 2016 kl. 05.53.11 UTC+2 skrev
>>>> robert...@yahoo.com:
>>>>> On Thu, 21 Apr 2016 22:34:06 -0500, Robert Wessel
>>>>> <robertwessel2@yahoo.com> wrote:
>>>>>
>>>>>> On Wed, 20 Apr 2016 22:10:20 -0700, Don Y
>>>>>> <blockedofcourse@foo.invalid> wrote:
>>>>>>
>>>>>>> On 4/20/2016 9:19 PM, Robert Wessel wrote:
>>>>>>>
>>>>>>>> There's the obvious solution of using the power from the PoE
>>>>>>>> PSE to drive an enable of some sort to the device's power
>>>>>>>> supply.  Heck use that to energize a relay you've put across
>>>>>>>> the mains input (some way of overriding that at the device
>>>>>>>> would probably be prudent).
>>>>>>>
>>>>>>> If the device is NOT PoE powered, it's probably because it
>>>>>>> represents a substantial load (25+W?).  I'm not sure it would be
>>>>>>> prudent to let something remotely disconnect power (and possibly
>>>>>>> reapply it, moments later) for large loads.
>>>>>>>
>>>>>>> OTOH, holding the device "in reset" (possibly indefinitely or
>>>>>>> even "repeatedly") should be safe(r?)
>>>>>>
>>>>>>
>>>>>> Presumably this is for cases where the device is so far gone that
>>>>>> you want to hit the big-red-switch.  If you want more
>>>>>> sophistication, you can put a controlling microprocessor on the
>>>>>> device, and have that powered by PoE, and it could do things like
>>>>>> force a reset, or actually power the device off if necessary.
>>>>>
>>>>>
>>>>> Aircraft systems have an interesting parallel.  Almost everything
>>>>> have its power disconnected via a circuit breaking in the cockpit.
>>>>> In ye olde days, these were actually breakers wired into the circuit
>>>>> mounted on a panel (or several) in the cockpit, or a simple
>>>>> remote-operated breaker (usually for heavy loads).  On recent
>>>>> aircraft, most of this is driven by the flight management system,
>>>>> which will pop up a little message saying it's pulled a breaker (if
>>>>> it happens automatically), or has a screen where you can pick a
>>>>> breaker to pull, and the breakers themselves are often located in a
>>>>> more convenient physical location (presumably near the circuit
>>>>> they're protecting), and they're controlled remotely.
>>>>>
>>>>> In the past is was not uncommon for the flight crew to attempt to
>>>>> cycle a breaker after a failure, but the modern policy is to just
>>>>> leave it alone (and powered off), and let maintenance deal with it
>>>>> on the ground.  Obviously with exceptions where the loss of the
>>>>> system in question can be considered more dangerous than the
>>>>> possibility of a fire or other really bad result from the failing
>>>>> device.
>>>>
>>>> https://en.wikipedia.org/wiki/Northwest_Airlines_Flight_255
>>>
>>>
>>> I'm not sure that exactly applies.  If the CB was pulled by maintenance
>>> or the pilots, the flight should never have been started in that
>>> condition (I don't think the configuration warning system is MEL-able).
>>> If it tripped because of an actual overload, well, what else would you
>>> have it do?  You could make a case for lack of redundancy.  And if it
>>> failed in such a way that it was open, but gave no indication, again,
>>> that doesn't really apply, except perhaps to suggest the need for
>>> additional redundancy.
>>>
>>> In at least the first case, modern systems would likely have made it
>>> much harder to miss the pulled breaker, and might well have helped in
>>> the third case.
>>>
>>> In any event, the configuration error was the cause of the accident, not
>>> the failure of the configuration warning system.  And that actually
>>> supports *my* point - let's say we were looking at the second case,
>>> there could actually be a fire risk that the tripped breaker is
>>> removing, vs. the pilots doing something really stupid, like taking off
>>> without flaps.
>>
>> I think it underscores the fact that handling MULTIPLE errors is always
>> problematic.  Had the preflight check been completed (an error in itself),
>> would the "problem" have gone unnoticed?
>
> From the TV documentary about the crash it wasn't uncommon for pilots to
> pull the breaker on configuration warning system because of false warnings
> while taxing a bit fast. If that was why it wasn't on we'll never know

    "The National Transportation Safety Board determines that the probable
    cause of the accident was the flightcrew&#4294967295;s failure to use the taxi
    checklist to ensure that the flaps and slats were extended for takeoff."

Error #1 which MASKS error #2 (or, which allows error #2 to be fatal):

    "Contributing to the accident was the absence of electrical power to the
    airplane takeoff warning system which thus did not warn the flightcrew
    that the airplane was not configured properly for takeoff. The reason
    for the absence of electrical power could not be determined."

> afaiu the outcome was a change to warming system so there would be less
> false warmings and the checklist split in smaller sections so it wasn't such
> much work to start over with a section, as you are supposed to when
> disturbed in the middle

This is the same sort of reasoning that goes into the installation of
other warning devices.

E.g., you would *think* that The Kitchen would be a great place to locate
a smoke detector (as there are ignition sources, there).   But, doing
so causes too many false alarms -- which leads to folks disabling the
detector.

Reply by ●April 23, 20162016-04-23

On Fri, 22 Apr 2016 13:35:27 -0700, Don Y
<blockedofcourse@foo.invalid> wrote:

>On 4/21/2016 8:53 PM, Robert Wessel wrote:
>> On Thu, 21 Apr 2016 22:34:06 -0500, Robert Wessel
>> <robertwessel2@yahoo.com> wrote:
>>
>>> On Wed, 20 Apr 2016 22:10:20 -0700, Don Y
>>> <blockedofcourse@foo.invalid> wrote:
>>>
>>>> On 4/20/2016 9:19 PM, Robert Wessel wrote:
>>>>
>>>>> There's the obvious solution of using the power from the PoE PSE to
>>>>> drive an enable of some sort to the device's power supply.  Heck use
>>>>> that to energize a relay you've put across the mains input (some way
>>>>> of overriding that at the device would probably be prudent).
>>>>
>>>> If the device is NOT PoE powered, it's probably because it represents a
>>>> substantial load (25+W?).  I'm not sure it would be prudent to let
>>>> something remotely disconnect power (and possibly reapply it, moments
>>>> later) for large loads.
>>>>
>>>> OTOH, holding the device "in reset" (possibly indefinitely or even
>>>> "repeatedly") should be safe(r?)
>>>
>>>
>>> Presumably this is for cases where the device is so far gone that you
>>> want to hit the big-red-switch.  If you want more sophistication, you
>>> can put a controlling microprocessor on the device, and have that
>>> powered by PoE, and it could do things like force a reset, or actually
>>> power the device off if necessary.
>>
>>
>> Aircraft systems have an interesting parallel.  Almost everything have
>> its power disconnected via a circuit breaking in the cockpit.  In ye
>> olde days, these were actually breakers wired into the circuit mounted
>> on a panel (or several) in the cockpit, or a simple remote-operated
>> breaker (usually for heavy loads).  On recent aircraft, most of this
>> is driven by the flight management system, which will pop up a little
>> message saying it's pulled a breaker (if it happens automatically), or
>> has a screen where you can pick a breaker to pull, and the breakers
>> themselves are often located in a more convenient physical location
>> (presumably near the circuit they're protecting), and they're
>> controlled remotely.
>>
>> In the past is was not uncommon for the flight crew to attempt to
>> cycle a breaker after a failure, but the modern policy is to just
>> leave it alone (and powered off), and let maintenance deal with it on
>> the ground.  Obviously with exceptions where the loss of the system in
>> question can be considered more dangerous than the possibility of a
>> fire or other really bad result from the failing device.
>
>In my case, "just waiting" (i.e., for the plane to land) isn't a
>practical option -- the system is intended to run 24/7/365 so there's
>no "scheduled down time" or "end of flight"  :>

Use a redundant system with at least two identical units, say A and B.
If A needs to be reseted, doing some self test or do some application
or OS upgrade, switch control to B, do the required maintenance
operation (including hardware replacement) on A. Check that A is up
and running, then you can switch back to A.

If updates are required on both units, it is preferable to start with
he passive unit (say B in this example), When B is up and running
again, try switching control to B. If B is not working properly after
the update, switch back to A and fix B before trying to switch again.

When B has been verified to be properly in charge, do the maintenance
on A and preferably switch back to A and verify that the maintenance
on A also went OK.

>As reset is, conceptually, the only time when a system's state can
>be "known", getting to that state seems to be the safest course of
>action.
>
>What I *should* probably do is figure out how to hold PD's *in* RESET,
>though powered.  That'll require yet another modification to the
>negotiation protocol.  <frown>

Of course, when doing redundant system, the redundancy should be
designed into the system from the beginning and not just try to stick
on some redundancy, if/when the reliability of a non-redundant system
is found to be too bad.

Reply by Don Y ●April 23, 20162016-04-23

On 4/22/2016 11:29 PM, upsidedown@downunder.com wrote:
>> In my case, "just waiting" (i.e., for the plane to land) isn't a
>> practical option -- the system is intended to run 24/7/365 so there's
>> no "scheduled down time" or "end of flight"  :>
>
> Use a redundant system with at least two identical units, say A and B.
> If A needs to be reseted, doing some self test or do some application
> or OS upgrade, switch control to B, do the required maintenance
> operation (including hardware replacement) on A. Check that A is up
> and running, then you can switch back to A.

The problem comes with I/O's.  Not only do you have to "duplicate"
the field interface -- but, you also have to provide a RELIABLE means
to switch between any actuators driven by those two "duplicates".

I.e., if A has an output set ON and B thinks it *really* should be OFF,
how is the controlled device to know who to listen to?

I consider physical I/O replication to be too troublesome.  So, I
only provide redundancy on "virtual" entities (processes, state, etc.).
In that way, all I need are spare CPU's...

> If updates are required on both units, it is preferable to start with
> he passive unit (say B in this example), When B is up and running
> again, try switching control to B. If B is not working properly after
> the update, switch back to A and fix B before trying to switch again.
>
> When B has been verified to be properly in charge, do the maintenance
> on A and preferably switch back to A and verify that the maintenance
> on A also went OK.
>
>> As reset is, conceptually, the only time when a system's state can
>> be "known", getting to that state seems to be the safest course of
>> action.
>>
>> What I *should* probably do is figure out how to hold PD's *in* RESET,
>> though powered.  That'll require yet another modification to the
>> negotiation protocol.  <frown>
>
> Of course, when doing redundant system, the redundancy should be
> designed into the system from the beginning and not just try to stick
> on some redundancy, if/when the reliability of a non-redundant system
> is found to be too bad.