Forums

Remote "watchdog"s

Started by Don Y April 20, 2016
I have a distributed system.  As such, it is possible for parts of it
to crash, become unresponsive, suffer hardware failures, etc.

The "system" as a whole needs a way of coping with these.

In the normal course of operation, traffic between nodes gives some
reassurance that the nodes involved are "sane".

OTOH, if a node starts acting wonky, I can invoke recovery protocols
that (in an ideal world) will reset/restart the portions of the
node that appear to be misbehaving (e.g., if traffic to/from one
particular task is "odd", then perhaps that task has gone south;
but, the remainder of the node may be intact.  Or, some I/O that
the wacky task handles is misbehaving and the task is doing its
best to make sense of a broken environment).

If the node itself gets hosed, then a hardware watchdog on that
node *might* bring it back to nominal operating condition.

But, I still need a reliable way of making a (remote) node "safe",
"secure", etc.

I presently deliver PoE to each node and many of them live off that.
So, as a last ditch effort, I can command the switch to remove that
and unceremoniously bring the node down.

Except for the nodes that *aren't* dependent on that!

Rather than concoct some other mechanism that might work (less effectively),
perhaps it is easiest to codify this behavior in the hardware interface spec?
I.e., even if you don't RELY on PoE, you must still *honor* it AS IF you did!

In the simplest case, that means treating !PoE as a RESET state.  And,
ensuring that the hardware will remain safe/secure in perpetual reset
(i.e., not expecting RESET to be a *transient* state).

Of course, the downside is that I now must be able to deliver something
akin to a PoE signal on every port (though I could possibly skimp on
the load carrying capability for many ports?)

Any holes in this?  Or, better alternatives?
On Wed, 20 Apr 2016 12:57:06 -0700, Don Y wrote:

> I have a distributed system. As such, it is possible for parts of it to > crash, become unresponsive, suffer hardware failures, etc. > > The "system" as a whole needs a way of coping with these. > > In the normal course of operation, traffic between nodes gives some > reassurance that the nodes involved are "sane". > > OTOH, if a node starts acting wonky, I can invoke recovery protocols > that (in an ideal world) will reset/restart the portions of the node > that appear to be misbehaving (e.g., if traffic to/from one particular > task is "odd", then perhaps that task has gone south; > but, the remainder of the node may be intact. Or, some I/O that the > wacky task handles is misbehaving and the task is doing its best to make > sense of a broken environment).
I would be more inclined to insist that each node has good built-in-test, and that it do it's own fault recovery -- i.e., to the extent possible it should recognize it's own problems, and at worst it should respond to requests for information with "I'm sorry, I'm sick right now and can't come to work." Then if it goes _totally_ wonky, cut it off at the knees.
> If the node itself gets hosed, then a hardware watchdog on that node > *might* bring it back to nominal operating condition. > > But, I still need a reliable way of making a (remote) node "safe", > "secure", etc. > > I presently deliver PoE to each node and many of them live off that. > So, as a last ditch effort, I can command the switch to remove that and > unceremoniously bring the node down. > > Except for the nodes that *aren't* dependent on that! > > Rather than concoct some other mechanism that might work (less > effectively), perhaps it is easiest to codify this behavior in the > hardware interface spec? > I.e., even if you don't RELY on PoE, you must still *honor* it AS IF you > did! > > In the simplest case, that means treating !PoE as a RESET state. And, > ensuring that the hardware will remain safe/secure in perpetual reset > (i.e., not expecting RESET to be a *transient* state). > > Of course, the downside is that I now must be able to deliver something > akin to a PoE signal on every port (though I could possibly skimp on the > load carrying capability for many ports?) > > Any holes in this? Or, better alternatives?
Sounds good. For the pseudo-PoE idea, if you're worried about babbling nodes consuming all the bandwidth, you could insist that the Ethernet phy layer be powered by the PoE, even if nothing else is. Or insist on a set of analog switches that effectively disconnect Ethernet at some point. And -- keep in mind that you can't protect from EVERY fault. You have to make a risk/reward tradeoff. -- Tim Wescott Wescott Design Services http://www.wescottdesign.com
On 4/20/2016 2:17 PM, Tim Wescott wrote:
> On Wed, 20 Apr 2016 12:57:06 -0700, Don Y wrote: > >> I have a distributed system. As such, it is possible for parts of it to >> crash, become unresponsive, suffer hardware failures, etc. >> >> The "system" as a whole needs a way of coping with these. >> >> In the normal course of operation, traffic between nodes gives some >> reassurance that the nodes involved are "sane". >> >> OTOH, if a node starts acting wonky, I can invoke recovery protocols >> that (in an ideal world) will reset/restart the portions of the node >> that appear to be misbehaving (e.g., if traffic to/from one particular >> task is "odd", then perhaps that task has gone south; >> but, the remainder of the node may be intact. Or, some I/O that the >> wacky task handles is misbehaving and the task is doing its best to make >> sense of a broken environment). > > I would be more inclined to insist that each node has good built-in-test, > and that it do it's own fault recovery -- i.e., to the extent possible it > should recognize it's own problems, and at worst it should respond to > requests for information with "I'm sorry, I'm sick right now and can't > come to work."
Nodes are up "forever". So, the POST only happens occasionally. I have some run-time diagnostics (e.g., testing RAM as I scrub it; sanity tests on I/O values, etc.). But, if the node is already "compromised" in some way, I can't rely on *it* being able to adequately vouch for itself. OTOH, if a *process* goes wonky, I can shut it down (the assumption being that the rest of the node is intact). But, if the node doesn't (or can't!) listen, then I can't "tell" it to do anything!
> Then if it goes _totally_ wonky, cut it off at the knees.
By pulling the plug (?)
>> If the node itself gets hosed, then a hardware watchdog on that node >> *might* bring it back to nominal operating condition. >> >> But, I still need a reliable way of making a (remote) node "safe", >> "secure", etc. >> >> I presently deliver PoE to each node and many of them live off that. >> So, as a last ditch effort, I can command the switch to remove that and >> unceremoniously bring the node down. >> >> Except for the nodes that *aren't* dependent on that! >> >> Rather than concoct some other mechanism that might work (less >> effectively), perhaps it is easiest to codify this behavior in the >> hardware interface spec? >> I.e., even if you don't RELY on PoE, you must still *honor* it AS IF you >> did! >> >> In the simplest case, that means treating !PoE as a RESET state. And, >> ensuring that the hardware will remain safe/secure in perpetual reset >> (i.e., not expecting RESET to be a *transient* state). >> >> Of course, the downside is that I now must be able to deliver something >> akin to a PoE signal on every port (though I could possibly skimp on the >> load carrying capability for many ports?) >> >> Any holes in this? Or, better alternatives? > > Sounds good. For the pseudo-PoE idea, if you're worried about babbling > nodes consuming all the bandwidth, you could insist that the Ethernet phy > layer be powered by the PoE, even if nothing else is. Or insist on a set > of analog switches that effectively disconnect Ethernet at some point.
A process control system I worked on many years ago had a locally controlled relay to allow a node to isolate it from the network. And, fuses in-line. If the network got compromised (a node chattering endlessly), "normal" nodes would detect this and take themselves offline. The network controller (Master-Slave) would then apply a high voltage to the network blowing the fuses of any node that had "lost its marbles". After a prescribed delay, the "sane" nodes would reconnect to the network. But, here, each node is on its own segment. So, I can simply choose to block all traffic coming into the switch from that segment. (a node can't decide who it wants to talk to but, rather, the switch has to be configured to explicitly pass *desired* traffic to/from the other nodes as the fabric configuration is dynamically updated). [The point was to ensure that an adversary having access to a node or its drop can't flood the switch with bogus (or even forged) traffic and compromise the operation of other nodes]
> And -- keep in mind that you can't protect from EVERY fault. You have to > make a risk/reward tradeoff.
Yes. If it was a monolithic design, a single hardware "reset" circuit would ensure EVERYTHING was held in a safe/secure state (by explicitly choosing reset conditions to be such). The distributed nature prevents me from "distributing" that reset signal. [Ages ago, I'd accomplished this with serially (EIA232) connected nodes by driving TxD to a LONG SPACE condition... then, letting hardware detect this (one shot) and drive the "local" RESET accordingly. Not quite as easy to do/detect with ethernet interfaces (without relying on the interface itself being operational) -- for an EIA232 link, you could just look at the incoming RxD signal coming out of the level translator]
On Wed, 20 Apr 2016 15:30:12 -0700, Don Y wrote:

> On 4/20/2016 2:17 PM, Tim Wescott wrote: >> On Wed, 20 Apr 2016 12:57:06 -0700, Don Y wrote: >> >>> I have a distributed system. As such, it is possible for parts of it >>> to crash, become unresponsive, suffer hardware failures, etc. >>> >>> The "system" as a whole needs a way of coping with these. >>> >>> In the normal course of operation, traffic between nodes gives some >>> reassurance that the nodes involved are "sane". >>> >>> OTOH, if a node starts acting wonky, I can invoke recovery protocols >>> that (in an ideal world) will reset/restart the portions of the node >>> that appear to be misbehaving (e.g., if traffic to/from one particular >>> task is "odd", then perhaps that task has gone south; >>> but, the remainder of the node may be intact. Or, some I/O that the >>> wacky task handles is misbehaving and the task is doing its best to >>> make sense of a broken environment). >> >> I would be more inclined to insist that each node has good >> built-in-test, >> and that it do it's own fault recovery -- i.e., to the extent possible >> it should recognize it's own problems, and at worst it should respond >> to requests for information with "I'm sorry, I'm sick right now and >> can't come to work." > > Nodes are up "forever". So, the POST only happens occasionally. > I have some run-time diagnostics (e.g., testing RAM as I scrub it; > sanity tests on I/O values, etc.). But, if the node is already > "compromised" in some way, I can't rely on *it* being able to adequately > vouch for itself.
POST is only part of BIT. When I did pseudo-military stuff, BIT meant continuous built-in test, that would pop a fault condition if a sensor went out of range or if some combination of sensors were out of bounds. FIT (Fault Isolation Test) was something you commanded, and took that part of the system off line. POST was FIT on steroids.
> OTOH, if a *process* goes wonky, I can shut it down (the assumption > being that the rest of the node is intact). But, if the node doesn't > (or can't!) listen, then I can't "tell" it to do anything!
I'm not sure if by process you mean one thread of execution within the node's one processor, or if you mean that the node has multiple processors. At any rate, you should consider having local BIT that can at least tell if a process has gone _really_ wonky and do something.
>> Then if it goes _totally_ wonky, cut it off at the knees. > > By pulling the plug (?)
Yup. It's cheaper than detonators on each node, and less alarming to the customer.
>>> If the node itself gets hosed, then a hardware watchdog on that node >>> *might* bring it back to nominal operating condition. >>> >>> But, I still need a reliable way of making a (remote) node "safe", >>> "secure", etc. >>> >>> I presently deliver PoE to each node and many of them live off that. >>> So, as a last ditch effort, I can command the switch to remove that >>> and unceremoniously bring the node down. >>> >>> Except for the nodes that *aren't* dependent on that! >>> >>> Rather than concoct some other mechanism that might work (less >>> effectively), perhaps it is easiest to codify this behavior in the >>> hardware interface spec? >>> I.e., even if you don't RELY on PoE, you must still *honor* it AS IF >>> you did! >>> >>> In the simplest case, that means treating !PoE as a RESET state. And, >>> ensuring that the hardware will remain safe/secure in perpetual reset >>> (i.e., not expecting RESET to be a *transient* state). >>> >>> Of course, the downside is that I now must be able to deliver >>> something akin to a PoE signal on every port (though I could possibly >>> skimp on the load carrying capability for many ports?) >>> >>> Any holes in this? Or, better alternatives? >> >> Sounds good. For the pseudo-PoE idea, if you're worried about babbling >> nodes consuming all the bandwidth, you could insist that the Ethernet >> phy layer be powered by the PoE, even if nothing else is. Or insist on >> a set of analog switches that effectively disconnect Ethernet at some >> point. > > A process control system I worked on many years ago had a locally > controlled relay to allow a node to isolate it from the network. And, > fuses in-line. If the network got compromised (a node chattering > endlessly), "normal" nodes would detect this and take themselves > offline. The network controller (Master-Slave) would then apply a high > voltage to the network blowing the fuses of any node that had "lost its > marbles". After a prescribed delay, the "sane" nodes would reconnect to > the network. > > But, here, each node is on its own segment. So, I can simply choose to > block all traffic coming into the switch from that segment. (a node > can't decide who it wants to talk to but, rather, the switch has to be > configured to explicitly pass *desired* traffic to/from the other nodes > as the fabric configuration is dynamically updated). > > [The point was to ensure that an adversary having access to a node or > its drop can't flood the switch with bogus (or even forged) traffic and > compromise the operation of other nodes] > >> And -- keep in mind that you can't protect from EVERY fault. You have >> to make a risk/reward tradeoff. > > Yes. If it was a monolithic design, a single hardware "reset" circuit > would ensure EVERYTHING was held in a safe/secure state (by explicitly > choosing reset conditions to be such). The distributed nature prevents > me from "distributing" that reset signal. > > [Ages ago, I'd accomplished this with serially (EIA232) connected nodes > by driving TxD to a LONG SPACE condition... then, letting hardware > detect this (one shot) and drive the "local" RESET accordingly. Not > quite as easy to do/detect with ethernet interfaces (without relying on > the interface itself being operational) -- for an EIA232 link, you could > just look at the incoming RxD signal coming out of the level translator]
-- www.wescottdesign.com
On 4/20/2016 5:27 PM, Tim Wescott wrote:
> On Wed, 20 Apr 2016 15:30:12 -0700, Don Y wrote: > >> On 4/20/2016 2:17 PM, Tim Wescott wrote: >>> On Wed, 20 Apr 2016 12:57:06 -0700, Don Y wrote: >>> >>>> I have a distributed system. As such, it is possible for parts of it >>>> to crash, become unresponsive, suffer hardware failures, etc. >>>> >>>> The "system" as a whole needs a way of coping with these. >>>> >>>> In the normal course of operation, traffic between nodes gives some >>>> reassurance that the nodes involved are "sane". >>>> >>>> OTOH, if a node starts acting wonky, I can invoke recovery protocols >>>> that (in an ideal world) will reset/restart the portions of the node >>>> that appear to be misbehaving (e.g., if traffic to/from one particular >>>> task is "odd", then perhaps that task has gone south; >>>> but, the remainder of the node may be intact. Or, some I/O that the >>>> wacky task handles is misbehaving and the task is doing its best to >>>> make sense of a broken environment). >>> >>> I would be more inclined to insist that each node has good >>> built-in-test, >>> and that it do it's own fault recovery -- i.e., to the extent possible >>> it should recognize it's own problems, and at worst it should respond >>> to requests for information with "I'm sorry, I'm sick right now and >>> can't come to work." >> >> Nodes are up "forever". So, the POST only happens occasionally. >> I have some run-time diagnostics (e.g., testing RAM as I scrub it; >> sanity tests on I/O values, etc.). But, if the node is already >> "compromised" in some way, I can't rely on *it* being able to adequately >> vouch for itself. > > POST is only part of BIT. When I did pseudo-military stuff, BIT meant > continuous built-in test, that would pop a fault condition if a sensor > went out of range or if some combination of sensors were out of bounds.
That's just common-sense validation of inputs. I.e., if I command a motor to "move right" and sense the LEFT limit switch engaging, something is very wrong. Or, if a temperature sensor indicates 25.3C on one sample and 42.5C on the next sample -- when the thermal mass involved would otherwise prohibit such a change.
> FIT (Fault Isolation Test) was something you commanded, and took that > part of the system off line. POST was FIT on steroids.
I can invoke POST by cycling power to the node and letting it go through its normal startup checks. But, that takes the whole node off-line for the duration. Instead, I "poke at" various aspects of the system while it is running to ensure that "as components" they appear to be operational.
>> OTOH, if a *process* goes wonky, I can shut it down (the assumption >> being that the rest of the node is intact). But, if the node doesn't >> (or can't!) listen, then I can't "tell" it to do anything! > > I'm not sure if by process you mean one thread of execution within the > node's one processor, or if you mean that the node has multiple > processors.
"Process" being a container for threads and other resources. "Job" being a collection of processes. So, my statement is intended to mean "some planned activity is not behaving in a 'nominal' manner". E.g., if a process is supposed to be reporting power consumption at 10 second intervals and the numbers suddenly make no sense (e.g., -37, +JKLHKL, etc.) or the update interval lengthens or quickens or it stops responding to commands or synchronous RPC's to the process "hang indefinitely"... These are things that a remote process/actor can discern from observations FROM that other (remote) process. They leak information regarding the remote process's likely functionality (or lack thereof). If you could observe ALL of the processes on a node, remotely, you might be able to infer that there is some systemic failure on that node.
> At any rate, you should consider having local BIT that can at least tell > if a process has gone _really_ wonky and do something.
The OS knows if a process isn't interacting with it, properly. The process can't interact with anything else in the system (local or remote) without the OS acting as intermediary. But, the OS doesn't want to *know* how the process should behave. That information is encoded in the interactions the process has with other actors (local and remote) in the system. Obviously, if a node "goes silent", it is highly suspect. If this happens to an isolated process, then (hopefully) SOMETHING on the node is still sane and can help recover/restore that capability. What I need is an "if all else fails" fallback. So, I don't have to worry that some I/O is being (erroneously) commanded to do something that it shouldn't, etc.
On Wed, 20 Apr 2016 12:57:06 -0700, Don Y
<blockedofcourse@foo.invalid> wrote:

>I have a distributed system. As such, it is possible for parts of it >to crash, become unresponsive, suffer hardware failures, etc. > >The "system" as a whole needs a way of coping with these. > >In the normal course of operation, traffic between nodes gives some >reassurance that the nodes involved are "sane". > >OTOH, if a node starts acting wonky, I can invoke recovery protocols >that (in an ideal world) will reset/restart the portions of the >node that appear to be misbehaving (e.g., if traffic to/from one >particular task is "odd", then perhaps that task has gone south; >but, the remainder of the node may be intact. Or, some I/O that >the wacky task handles is misbehaving and the task is doing its >best to make sense of a broken environment). > >If the node itself gets hosed, then a hardware watchdog on that >node *might* bring it back to nominal operating condition. > >But, I still need a reliable way of making a (remote) node "safe", >"secure", etc. > >I presently deliver PoE to each node and many of them live off that. >So, as a last ditch effort, I can command the switch to remove that >and unceremoniously bring the node down. > >Except for the nodes that *aren't* dependent on that! > >Rather than concoct some other mechanism that might work (less effectively), >perhaps it is easiest to codify this behavior in the hardware interface spec? >I.e., even if you don't RELY on PoE, you must still *honor* it AS IF you did! > >In the simplest case, that means treating !PoE as a RESET state. And, >ensuring that the hardware will remain safe/secure in perpetual reset >(i.e., not expecting RESET to be a *transient* state). > >Of course, the downside is that I now must be able to deliver something >akin to a PoE signal on every port (though I could possibly skimp on >the load carrying capability for many ports?) > >Any holes in this? Or, better alternatives?
There's the obvious solution of using the power from the PoE PSE to drive an enable of some sort to the device's power supply. Heck use that to energize a relay you've put across the mains input (some way of overriding that at the device would probably be prudent).
On 4/20/2016 9:19 PM, Robert Wessel wrote:

> There's the obvious solution of using the power from the PoE PSE to > drive an enable of some sort to the device's power supply. Heck use > that to energize a relay you've put across the mains input (some way > of overriding that at the device would probably be prudent).
If the device is NOT PoE powered, it's probably because it represents a substantial load (25+W?). I'm not sure it would be prudent to let something remotely disconnect power (and possibly reapply it, moments later) for large loads. OTOH, holding the device "in reset" (possibly indefinitely or even "repeatedly") should be safe(r?)
On Wed, 20 Apr 2016 23:19:19 -0500, Robert Wessel wrote:

> On Wed, 20 Apr 2016 12:57:06 -0700, Don Y <blockedofcourse@foo.invalid> > wrote: > >>I have a distributed system. As such, it is possible for parts of it to >>crash, become unresponsive, suffer hardware failures, etc. >> >>The "system" as a whole needs a way of coping with these. >> >>In the normal course of operation, traffic between nodes gives some >>reassurance that the nodes involved are "sane". >> >>OTOH, if a node starts acting wonky, I can invoke recovery protocols >>that (in an ideal world) will reset/restart the portions of the node >>that appear to be misbehaving (e.g., if traffic to/from one particular >>task is "odd", then perhaps that task has gone south; >>but, the remainder of the node may be intact. Or, some I/O that the >>wacky task handles is misbehaving and the task is doing its best to make >>sense of a broken environment). >> >>If the node itself gets hosed, then a hardware watchdog on that node >>*might* bring it back to nominal operating condition. >> >>But, I still need a reliable way of making a (remote) node "safe", >>"secure", etc. >> >>I presently deliver PoE to each node and many of them live off that. >>So, as a last ditch effort, I can command the switch to remove that and >>unceremoniously bring the node down. >> >>Except for the nodes that *aren't* dependent on that! >> >>Rather than concoct some other mechanism that might work (less >>effectively), perhaps it is easiest to codify this behavior in the >>hardware interface spec? >>I.e., even if you don't RELY on PoE, you must still *honor* it AS IF you >>did! >> >>In the simplest case, that means treating !PoE as a RESET state. And, >>ensuring that the hardware will remain safe/secure in perpetual reset >>(i.e., not expecting RESET to be a *transient* state). >> >>Of course, the downside is that I now must be able to deliver something >>akin to a PoE signal on every port (though I could possibly skimp on the >>load carrying capability for many ports?) >> >>Any holes in this? Or, better alternatives? > > > There's the obvious solution of using the power from the PoE PSE to > drive an enable of some sort to the device's power supply. Heck use > that to energize a relay you've put across the mains input (some way of > overriding that at the device would probably be prudent).
I also see a potential issue with motion control -- if a node is controlling a motor or some such, you don't necessarily want to cut off control at an arbitrary time. I can envision a scenario where the communications is bollixed but the control loop is perking along merrily -- then you shut off power to the processor and WHANG! some serious metal hits a stop, and bits get bent or fly off. -- Tim Wescott Wescott Design Services http://www.wescottdesign.com
On 4/21/2016 9:25 AM, Tim Wescott wrote:
> On Wed, 20 Apr 2016 23:19:19 -0500, Robert Wessel wrote: > >> On Wed, 20 Apr 2016 12:57:06 -0700, Don Y <blockedofcourse@foo.invalid> >> wrote: >> >>> I have a distributed system. As such, it is possible for parts of it to >>> crash, become unresponsive, suffer hardware failures, etc. >>> >>> The "system" as a whole needs a way of coping with these. >>> >>> In the normal course of operation, traffic between nodes gives some >>> reassurance that the nodes involved are "sane". >>> >>> OTOH, if a node starts acting wonky, I can invoke recovery protocols >>> that (in an ideal world) will reset/restart the portions of the node >>> that appear to be misbehaving (e.g., if traffic to/from one particular >>> task is "odd", then perhaps that task has gone south; >>> but, the remainder of the node may be intact. Or, some I/O that the >>> wacky task handles is misbehaving and the task is doing its best to make >>> sense of a broken environment). >>> >>> If the node itself gets hosed, then a hardware watchdog on that node >>> *might* bring it back to nominal operating condition. >>> >>> But, I still need a reliable way of making a (remote) node "safe", >>> "secure", etc. >>> >>> I presently deliver PoE to each node and many of them live off that. >>> So, as a last ditch effort, I can command the switch to remove that and >>> unceremoniously bring the node down. >>> >>> Except for the nodes that *aren't* dependent on that! >>> >>> Rather than concoct some other mechanism that might work (less >>> effectively), perhaps it is easiest to codify this behavior in the >>> hardware interface spec? >>> I.e., even if you don't RELY on PoE, you must still *honor* it AS IF you >>> did! >>> >>> In the simplest case, that means treating !PoE as a RESET state. And, >>> ensuring that the hardware will remain safe/secure in perpetual reset >>> (i.e., not expecting RESET to be a *transient* state). >>> >>> Of course, the downside is that I now must be able to deliver something >>> akin to a PoE signal on every port (though I could possibly skimp on the >>> load carrying capability for many ports?) >>> >>> Any holes in this? Or, better alternatives? >> >> >> There's the obvious solution of using the power from the PoE PSE to >> drive an enable of some sort to the device's power supply. Heck use >> that to energize a relay you've put across the mains input (some way of >> overriding that at the device would probably be prudent). > > I also see a potential issue with motion control -- if a node is > controlling a motor or some such, you don't necessarily want to cut off > control at an arbitrary time. > > I can envision a scenario where the communications is bollixed but the > control loop is perking along merrily -- then you shut off power to the > processor and WHANG! some serious metal hits a stop, and bits get bent or > fly off.
I don't see any way around that. How many concurrent failures do you envision handling? In your example, the node (doing the motion control) would have to have had a communication failure *and* its local "watchdog(s)/daemons" would have had to have NOT detected that failure -- in order to allow for the motion to continue. By contrast, in my current approach, even a SINGLE failure could munge your scenario (the PSE dropping power to the PD "unexpectedly"). As you said, upthread, you can't deal with every condition...
On Wed, 20 Apr 2016 22:10:20 -0700, Don Y
<blockedofcourse@foo.invalid> wrote:

>On 4/20/2016 9:19 PM, Robert Wessel wrote: > >> There's the obvious solution of using the power from the PoE PSE to >> drive an enable of some sort to the device's power supply. Heck use >> that to energize a relay you've put across the mains input (some way >> of overriding that at the device would probably be prudent). > >If the device is NOT PoE powered, it's probably because it represents a >substantial load (25+W?). I'm not sure it would be prudent to let >something remotely disconnect power (and possibly reapply it, moments >later) for large loads. > >OTOH, holding the device "in reset" (possibly indefinitely or even >"repeatedly") should be safe(r?)
Presumably this is for cases where the device is so far gone that you want to hit the big-red-switch. If you want more sophistication, you can put a controlling microprocessor on the device, and have that powered by PoE, and it could do things like force a reset, or actually power the device off if necessary.