Remote "watchdog"s

I have a distributed system.  As such, it is possible for parts of it
to crash, become unresponsive, suffer hardware failures, etc.

The "system" as a whole needs a way of coping with these.

In the normal course of operation, traffic between nodes gives some
reassurance that the nodes involved are "sane".

OTOH, if a node starts acting wonky, I can invoke recovery protocols
that (in an ideal world) will reset/restart the portions of the
node that appear to be misbehaving (e.g., if traffic to/from one
particular task is "odd", then perhaps that task has gone south;
but, the remainder of the node may be intact.  Or, some I/O that
the wacky task handles is misbehaving and the task is doing its
best to make sense of a broken environment).

If the node itself gets hosed, then a hardware watchdog on that
node *might* bring it back to nominal operating condition.

But, I still need a reliable way of making a (remote) node "safe",
"secure", etc.

I presently deliver PoE to each node and many of them live off that.
So, as a last ditch effort, I can command the switch to remove that
and unceremoniously bring the node down.

Except for the nodes that *aren't* dependent on that!

Rather than concoct some other mechanism that might work (less effectively),
perhaps it is easiest to codify this behavior in the hardware interface spec?
I.e., even if you don't RELY on PoE, you must still *honor* it AS IF you did!

In the simplest case, that means treating !PoE as a RESET state.  And,
ensuring that the hardware will remain safe/secure in perpetual reset
(i.e., not expecting RESET to be a *transient* state).

Of course, the downside is that I now must be able to deliver something
akin to a PoE signal on every port (though I could possibly skimp on
the load carrying capability for many ports?)

Any holes in this?  Or, better alternatives?

Reply by Tim Wescott ●April 20, 20162016-04-20

On Wed, 20 Apr 2016 12:57:06 -0700, Don Y wrote:

> I have a distributed system.  As such, it is possible for parts of it to
> crash, become unresponsive, suffer hardware failures, etc.
> 
> The "system" as a whole needs a way of coping with these.
> 
> In the normal course of operation, traffic between nodes gives some
> reassurance that the nodes involved are "sane".
> 
> OTOH, if a node starts acting wonky, I can invoke recovery protocols
> that (in an ideal world) will reset/restart the portions of the node
> that appear to be misbehaving (e.g., if traffic to/from one particular
> task is "odd", then perhaps that task has gone south;
> but, the remainder of the node may be intact.  Or, some I/O that the
> wacky task handles is misbehaving and the task is doing its best to make
> sense of a broken environment).

I would be more inclined to insist that each node has good built-in-test, 
and that it do it's own fault recovery -- i.e., to the extent possible it 
should recognize it's own problems, and at worst it should respond to 
requests for information with "I'm sorry, I'm sick right now and can't 
come to work."

Then if it goes _totally_ wonky, cut it off at the knees.

> If the node itself gets hosed, then a hardware watchdog on that node
> *might* bring it back to nominal operating condition.
> 
> But, I still need a reliable way of making a (remote) node "safe",
> "secure", etc.
> 
> I presently deliver PoE to each node and many of them live off that.
> So, as a last ditch effort, I can command the switch to remove that and
> unceremoniously bring the node down.
> 
> Except for the nodes that *aren't* dependent on that!
> 
> Rather than concoct some other mechanism that might work (less
> effectively), perhaps it is easiest to codify this behavior in the
> hardware interface spec?
> I.e., even if you don't RELY on PoE, you must still *honor* it AS IF you
> did!
> 
> In the simplest case, that means treating !PoE as a RESET state.  And,
> ensuring that the hardware will remain safe/secure in perpetual reset
> (i.e., not expecting RESET to be a *transient* state).
> 
> Of course, the downside is that I now must be able to deliver something
> akin to a PoE signal on every port (though I could possibly skimp on the
> load carrying capability for many ports?)
> 
> Any holes in this?  Or, better alternatives?

Sounds good.  For the pseudo-PoE idea, if you're worried about babbling 
nodes consuming all the bandwidth, you could insist that the Ethernet phy 
layer be powered by the PoE, even if nothing else is.  Or insist on a set 
of analog switches that effectively disconnect Ethernet at some point.

And -- keep in mind that you can't protect from EVERY fault.  You have to 
make a risk/reward tradeoff.

-- 

Tim Wescott
Wescott Design Services
http://www.wescottdesign.com

Reply by Don Y ●April 20, 20162016-04-20

On 4/20/2016 2:17 PM, Tim Wescott wrote:
> On Wed, 20 Apr 2016 12:57:06 -0700, Don Y wrote:
>
>> I have a distributed system.  As such, it is possible for parts of it to
>> crash, become unresponsive, suffer hardware failures, etc.
>>
>> The "system" as a whole needs a way of coping with these.
>>
>> In the normal course of operation, traffic between nodes gives some
>> reassurance that the nodes involved are "sane".
>>
>> OTOH, if a node starts acting wonky, I can invoke recovery protocols
>> that (in an ideal world) will reset/restart the portions of the node
>> that appear to be misbehaving (e.g., if traffic to/from one particular
>> task is "odd", then perhaps that task has gone south;
>> but, the remainder of the node may be intact.  Or, some I/O that the
>> wacky task handles is misbehaving and the task is doing its best to make
>> sense of a broken environment).
>
> I would be more inclined to insist that each node has good built-in-test,
> and that it do it's own fault recovery -- i.e., to the extent possible it
> should recognize it's own problems, and at worst it should respond to
> requests for information with "I'm sorry, I'm sick right now and can't
> come to work."

Nodes are up "forever".  So, the POST only happens occasionally.
I have some run-time diagnostics (e.g., testing RAM as I scrub it;
sanity tests on I/O values, etc.).  But, if the node is already
"compromised" in some way, I can't rely on *it* being able to
adequately vouch for itself.

OTOH, if a *process* goes wonky, I can shut it down (the assumption
being that the rest of the node is intact).  But, if the node doesn't
(or can't!) listen, then I can't "tell" it to do anything!

> Then if it goes _totally_ wonky, cut it off at the knees.

By pulling the plug (?)

>> If the node itself gets hosed, then a hardware watchdog on that node
>> *might* bring it back to nominal operating condition.
>>
>> But, I still need a reliable way of making a (remote) node "safe",
>> "secure", etc.
>>
>> I presently deliver PoE to each node and many of them live off that.
>> So, as a last ditch effort, I can command the switch to remove that and
>> unceremoniously bring the node down.
>>
>> Except for the nodes that *aren't* dependent on that!
>>
>> Rather than concoct some other mechanism that might work (less
>> effectively), perhaps it is easiest to codify this behavior in the
>> hardware interface spec?
>> I.e., even if you don't RELY on PoE, you must still *honor* it AS IF you
>> did!
>>
>> In the simplest case, that means treating !PoE as a RESET state.  And,
>> ensuring that the hardware will remain safe/secure in perpetual reset
>> (i.e., not expecting RESET to be a *transient* state).
>>
>> Of course, the downside is that I now must be able to deliver something
>> akin to a PoE signal on every port (though I could possibly skimp on the
>> load carrying capability for many ports?)
>>
>> Any holes in this?  Or, better alternatives?
>
> Sounds good.  For the pseudo-PoE idea, if you're worried about babbling
> nodes consuming all the bandwidth, you could insist that the Ethernet phy
> layer be powered by the PoE, even if nothing else is.  Or insist on a set
> of analog switches that effectively disconnect Ethernet at some point.

A process control system I worked on many years ago had a locally controlled
relay to allow a node to isolate it from the network.  And, fuses in-line.
If the network got compromised (a node chattering endlessly), "normal"
nodes would detect this and take themselves offline.  The network controller
(Master-Slave) would then apply a high voltage to the network blowing the
fuses of any node that had "lost its marbles".  After a prescribed delay,
the "sane" nodes would reconnect to the network.

But, here, each node is on its own segment.  So, I can simply choose to
block all traffic coming into the switch from that segment.  (a node can't
decide who it wants to talk to but, rather, the switch has to be configured
to explicitly pass *desired* traffic to/from the other nodes as the
fabric configuration is dynamically updated).

[The point was to ensure that an adversary having access to a node
or its drop can't flood the switch with bogus (or even forged) traffic
and compromise the operation of other nodes]

> And -- keep in mind that you can't protect from EVERY fault.  You have to
> make a risk/reward tradeoff.

Yes.  If it was a monolithic design, a single hardware "reset" circuit would
ensure EVERYTHING was held in a safe/secure state (by explicitly choosing
reset conditions to be such).  The distributed nature prevents me from
"distributing" that reset signal.

[Ages ago, I'd accomplished this with serially (EIA232) connected nodes
by driving TxD to a LONG SPACE condition... then, letting hardware
detect this (one shot) and drive the "local" RESET accordingly.  Not
quite as easy to do/detect with ethernet interfaces (without relying on
the interface itself being operational) -- for an EIA232 link, you could
just look at the incoming RxD signal coming out of the level translator]

Reply by Tim Wescott ●April 20, 20162016-04-20

On Wed, 20 Apr 2016 15:30:12 -0700, Don Y wrote:

> On 4/20/2016 2:17 PM, Tim Wescott wrote:
>> On Wed, 20 Apr 2016 12:57:06 -0700, Don Y wrote:
>>
>>> I have a distributed system.  As such, it is possible for parts of it
>>> to crash, become unresponsive, suffer hardware failures, etc.
>>>
>>> The "system" as a whole needs a way of coping with these.
>>>
>>> In the normal course of operation, traffic between nodes gives some
>>> reassurance that the nodes involved are "sane".
>>>
>>> OTOH, if a node starts acting wonky, I can invoke recovery protocols
>>> that (in an ideal world) will reset/restart the portions of the node
>>> that appear to be misbehaving (e.g., if traffic to/from one particular
>>> task is "odd", then perhaps that task has gone south;
>>> but, the remainder of the node may be intact.  Or, some I/O that the
>>> wacky task handles is misbehaving and the task is doing its best to
>>> make sense of a broken environment).
>>
>> I would be more inclined to insist that each node has good
>> built-in-test,
>> and that it do it's own fault recovery -- i.e., to the extent possible
>> it should recognize it's own problems, and at worst it should respond
>> to requests for information with "I'm sorry, I'm sick right now and
>> can't come to work."
> 
> Nodes are up "forever".  So, the POST only happens occasionally.
> I have some run-time diagnostics (e.g., testing RAM as I scrub it;
> sanity tests on I/O values, etc.).  But, if the node is already
> "compromised" in some way, I can't rely on *it* being able to adequately
> vouch for itself.

POST is only part of BIT.  When I did pseudo-military stuff, BIT meant 
continuous built-in test, that would pop a fault condition if a sensor 
went out of range or if some combination of sensors were out of bounds.  
FIT (Fault Isolation Test) was something you commanded, and took that 
part of the system off line.  POST was FIT on steroids.

> OTOH, if a *process* goes wonky, I can shut it down (the assumption
> being that the rest of the node is intact).  But, if the node doesn't
> (or can't!) listen, then I can't "tell" it to do anything!

I'm not sure if by process you mean one thread of execution within the 
node's one processor, or if you mean that the node has multiple 
processors.

At any rate, you should consider having local BIT that can at least tell 
if a process has gone _really_ wonky and do something.

>> Then if it goes _totally_ wonky, cut it off at the knees.
> 
> By pulling the plug (?)

Yup.  It's cheaper than detonators on each node, and less alarming to the 
customer.

>>> If the node itself gets hosed, then a hardware watchdog on that node
>>> *might* bring it back to nominal operating condition.
>>>
>>> But, I still need a reliable way of making a (remote) node "safe",
>>> "secure", etc.
>>>
>>> I presently deliver PoE to each node and many of them live off that.
>>> So, as a last ditch effort, I can command the switch to remove that
>>> and unceremoniously bring the node down.
>>>
>>> Except for the nodes that *aren't* dependent on that!
>>>
>>> Rather than concoct some other mechanism that might work (less
>>> effectively), perhaps it is easiest to codify this behavior in the
>>> hardware interface spec?
>>> I.e., even if you don't RELY on PoE, you must still *honor* it AS IF
>>> you did!
>>>
>>> In the simplest case, that means treating !PoE as a RESET state.  And,
>>> ensuring that the hardware will remain safe/secure in perpetual reset
>>> (i.e., not expecting RESET to be a *transient* state).
>>>
>>> Of course, the downside is that I now must be able to deliver
>>> something akin to a PoE signal on every port (though I could possibly
>>> skimp on the load carrying capability for many ports?)
>>>
>>> Any holes in this?  Or, better alternatives?
>>
>> Sounds good.  For the pseudo-PoE idea, if you're worried about babbling
>> nodes consuming all the bandwidth, you could insist that the Ethernet
>> phy layer be powered by the PoE, even if nothing else is.  Or insist on
>> a set of analog switches that effectively disconnect Ethernet at some
>> point.
> 
> A process control system I worked on many years ago had a locally
> controlled relay to allow a node to isolate it from the network.  And,
> fuses in-line. If the network got compromised (a node chattering
> endlessly), "normal" nodes would detect this and take themselves
> offline.  The network controller (Master-Slave) would then apply a high
> voltage to the network blowing the fuses of any node that had "lost its
> marbles".  After a prescribed delay, the "sane" nodes would reconnect to
> the network.
> 
> But, here, each node is on its own segment.  So, I can simply choose to
> block all traffic coming into the switch from that segment.  (a node
> can't decide who it wants to talk to but, rather, the switch has to be
> configured to explicitly pass *desired* traffic to/from the other nodes
> as the fabric configuration is dynamically updated).
> 
> [The point was to ensure that an adversary having access to a node or
> its drop can't flood the switch with bogus (or even forged) traffic and
> compromise the operation of other nodes]
> 
>> And -- keep in mind that you can't protect from EVERY fault.  You have
>> to make a risk/reward tradeoff.
> 
> Yes.  If it was a monolithic design, a single hardware "reset" circuit
> would ensure EVERYTHING was held in a safe/secure state (by explicitly
> choosing reset conditions to be such).  The distributed nature prevents
> me from "distributing" that reset signal.
> 
> [Ages ago, I'd accomplished this with serially (EIA232) connected nodes
> by driving TxD to a LONG SPACE condition... then, letting hardware
> detect this (one shot) and drive the "local" RESET accordingly.  Not
> quite as easy to do/detect with ethernet interfaces (without relying on
> the interface itself being operational) -- for an EIA232 link, you could
> just look at the incoming RxD signal coming out of the level translator]





-- 
www.wescottdesign.com

Reply by Don Y ●April 20, 20162016-04-20

On 4/20/2016 5:27 PM, Tim Wescott wrote:
> On Wed, 20 Apr 2016 15:30:12 -0700, Don Y wrote:
>
>> On 4/20/2016 2:17 PM, Tim Wescott wrote:
>>> On Wed, 20 Apr 2016 12:57:06 -0700, Don Y wrote:
>>>
>>>> I have a distributed system.  As such, it is possible for parts of it
>>>> to crash, become unresponsive, suffer hardware failures, etc.
>>>>
>>>> The "system" as a whole needs a way of coping with these.
>>>>
>>>> In the normal course of operation, traffic between nodes gives some
>>>> reassurance that the nodes involved are "sane".
>>>>
>>>> OTOH, if a node starts acting wonky, I can invoke recovery protocols
>>>> that (in an ideal world) will reset/restart the portions of the node
>>>> that appear to be misbehaving (e.g., if traffic to/from one particular
>>>> task is "odd", then perhaps that task has gone south;
>>>> but, the remainder of the node may be intact.  Or, some I/O that the
>>>> wacky task handles is misbehaving and the task is doing its best to
>>>> make sense of a broken environment).
>>>
>>> I would be more inclined to insist that each node has good
>>> built-in-test,
>>> and that it do it's own fault recovery -- i.e., to the extent possible
>>> it should recognize it's own problems, and at worst it should respond
>>> to requests for information with "I'm sorry, I'm sick right now and
>>> can't come to work."
>>
>> Nodes are up "forever".  So, the POST only happens occasionally.
>> I have some run-time diagnostics (e.g., testing RAM as I scrub it;
>> sanity tests on I/O values, etc.).  But, if the node is already
>> "compromised" in some way, I can't rely on *it* being able to adequately
>> vouch for itself.
>
> POST is only part of BIT.  When I did pseudo-military stuff, BIT meant
> continuous built-in test, that would pop a fault condition if a sensor
> went out of range or if some combination of sensors were out of bounds.

That's just common-sense validation of inputs.  I.e., if I command a
motor to "move right" and sense the LEFT limit switch engaging,
something is very wrong.  Or, if a temperature sensor indicates
25.3C on one sample and 42.5C on the next sample -- when the
thermal mass involved would otherwise prohibit such a change.

> FIT (Fault Isolation Test) was something you commanded, and took that
> part of the system off line.  POST was FIT on steroids.

I can invoke POST by cycling power to the node and letting it
go through its normal startup checks.  But, that takes the whole
node off-line for the duration.

Instead, I "poke at" various aspects of the system while it is running
to ensure that "as components" they appear to be operational.

>> OTOH, if a *process* goes wonky, I can shut it down (the assumption
>> being that the rest of the node is intact).  But, if the node doesn't
>> (or can't!) listen, then I can't "tell" it to do anything!
>
> I'm not sure if by process you mean one thread of execution within the
> node's one processor, or if you mean that the node has multiple
> processors.

"Process" being a container for threads and other resources.
"Job" being a collection of processes.

So, my statement is intended to mean "some planned activity is not
behaving in a 'nominal' manner".  E.g., if a process is supposed
to be reporting power consumption at 10 second intervals and
the numbers suddenly make no sense (e.g., -37, +JKLHKL, etc.)
or the update interval lengthens or quickens or it stops responding
to commands or synchronous RPC's to the process "hang indefinitely"...

These are things that a remote process/actor can discern from observations
FROM that other (remote) process.  They leak information regarding the
remote process's likely functionality (or lack thereof).

If you could observe ALL of the processes on a node, remotely, you might
be able to infer that there is some systemic failure on that node.

> At any rate, you should consider having local BIT that can at least tell
> if a process has gone _really_ wonky and do something.

The OS knows if a process isn't interacting with it, properly.
The process can't interact with anything else in the system (local
or remote) without the OS acting as intermediary.

But, the OS doesn't want to *know* how the process should behave.
That information is encoded in the interactions the process has
with other actors (local and remote) in the system.

Obviously, if a node "goes silent", it is highly suspect.  If this
happens to an isolated process, then (hopefully) SOMETHING on the
node is still sane and can help recover/restore that capability.

What I need is an "if all else fails" fallback.  So, I don't
have to worry that some I/O is being (erroneously) commanded
to do something that it shouldn't, etc.

Reply by Robert Wessel ●April 21, 20162016-04-21

On Wed, 20 Apr 2016 12:57:06 -0700, Don Y
<blockedofcourse@foo.invalid> wrote:

>I have a distributed system.  As such, it is possible for parts of it
>to crash, become unresponsive, suffer hardware failures, etc.
>
>The "system" as a whole needs a way of coping with these.
>
>In the normal course of operation, traffic between nodes gives some
>reassurance that the nodes involved are "sane".
>
>OTOH, if a node starts acting wonky, I can invoke recovery protocols
>that (in an ideal world) will reset/restart the portions of the
>node that appear to be misbehaving (e.g., if traffic to/from one
>particular task is "odd", then perhaps that task has gone south;
>but, the remainder of the node may be intact.  Or, some I/O that
>the wacky task handles is misbehaving and the task is doing its
>best to make sense of a broken environment).
>
>If the node itself gets hosed, then a hardware watchdog on that
>node *might* bring it back to nominal operating condition.
>
>But, I still need a reliable way of making a (remote) node "safe",
>"secure", etc.
>
>I presently deliver PoE to each node and many of them live off that.
>So, as a last ditch effort, I can command the switch to remove that
>and unceremoniously bring the node down.
>
>Except for the nodes that *aren't* dependent on that!
>
>Rather than concoct some other mechanism that might work (less effectively),
>perhaps it is easiest to codify this behavior in the hardware interface spec?
>I.e., even if you don't RELY on PoE, you must still *honor* it AS IF you did!
>
>In the simplest case, that means treating !PoE as a RESET state.  And,
>ensuring that the hardware will remain safe/secure in perpetual reset
>(i.e., not expecting RESET to be a *transient* state).
>
>Of course, the downside is that I now must be able to deliver something
>akin to a PoE signal on every port (though I could possibly skimp on
>the load carrying capability for many ports?)
>
>Any holes in this?  Or, better alternatives?


There's the obvious solution of using the power from the PoE PSE to
drive an enable of some sort to the device's power supply.  Heck use
that to energize a relay you've put across the mains input (some way
of overriding that at the device would probably be prudent).

Reply by Don Y ●April 21, 20162016-04-21

On 4/20/2016 9:19 PM, Robert Wessel wrote:

> There's the obvious solution of using the power from the PoE PSE to
> drive an enable of some sort to the device's power supply.  Heck use
> that to energize a relay you've put across the mains input (some way
> of overriding that at the device would probably be prudent).

If the device is NOT PoE powered, it's probably because it represents a
substantial load (25+W?).  I'm not sure it would be prudent to let
something remotely disconnect power (and possibly reapply it, moments
later) for large loads.

OTOH, holding the device "in reset" (possibly indefinitely or even
"repeatedly") should be safe(r?)

Reply by Tim Wescott ●April 21, 20162016-04-21

On Wed, 20 Apr 2016 23:19:19 -0500, Robert Wessel wrote:

> On Wed, 20 Apr 2016 12:57:06 -0700, Don Y <blockedofcourse@foo.invalid>
> wrote:
> 
>>I have a distributed system.  As such, it is possible for parts of it to
>>crash, become unresponsive, suffer hardware failures, etc.
>>
>>The "system" as a whole needs a way of coping with these.
>>
>>In the normal course of operation, traffic between nodes gives some
>>reassurance that the nodes involved are "sane".
>>
>>OTOH, if a node starts acting wonky, I can invoke recovery protocols
>>that (in an ideal world) will reset/restart the portions of the node
>>that appear to be misbehaving (e.g., if traffic to/from one particular
>>task is "odd", then perhaps that task has gone south;
>>but, the remainder of the node may be intact.  Or, some I/O that the
>>wacky task handles is misbehaving and the task is doing its best to make
>>sense of a broken environment).
>>
>>If the node itself gets hosed, then a hardware watchdog on that node
>>*might* bring it back to nominal operating condition.
>>
>>But, I still need a reliable way of making a (remote) node "safe",
>>"secure", etc.
>>
>>I presently deliver PoE to each node and many of them live off that.
>>So, as a last ditch effort, I can command the switch to remove that and
>>unceremoniously bring the node down.
>>
>>Except for the nodes that *aren't* dependent on that!
>>
>>Rather than concoct some other mechanism that might work (less
>>effectively), perhaps it is easiest to codify this behavior in the
>>hardware interface spec?
>>I.e., even if you don't RELY on PoE, you must still *honor* it AS IF you
>>did!
>>
>>In the simplest case, that means treating !PoE as a RESET state.  And,
>>ensuring that the hardware will remain safe/secure in perpetual reset
>>(i.e., not expecting RESET to be a *transient* state).
>>
>>Of course, the downside is that I now must be able to deliver something
>>akin to a PoE signal on every port (though I could possibly skimp on the
>>load carrying capability for many ports?)
>>
>>Any holes in this?  Or, better alternatives?
> 
> 
> There's the obvious solution of using the power from the PoE PSE to
> drive an enable of some sort to the device's power supply.  Heck use
> that to energize a relay you've put across the mains input (some way of
> overriding that at the device would probably be prudent).

I also see a potential issue with motion control -- if a node is 
controlling a motor or some such, you don't necessarily want to cut off 
control at an arbitrary time.

I can envision a scenario where the communications is bollixed but the 
control loop is perking along merrily -- then you shut off power to the 
processor and WHANG! some serious metal hits a stop, and bits get bent or 
fly off.

-- 

Tim Wescott
Wescott Design Services
http://www.wescottdesign.com

Reply by Don Y ●April 21, 20162016-04-21

On 4/21/2016 9:25 AM, Tim Wescott wrote:
> On Wed, 20 Apr 2016 23:19:19 -0500, Robert Wessel wrote:
>
>> On Wed, 20 Apr 2016 12:57:06 -0700, Don Y <blockedofcourse@foo.invalid>
>> wrote:
>>
>>> I have a distributed system.  As such, it is possible for parts of it to
>>> crash, become unresponsive, suffer hardware failures, etc.
>>>
>>> The "system" as a whole needs a way of coping with these.
>>>
>>> In the normal course of operation, traffic between nodes gives some
>>> reassurance that the nodes involved are "sane".
>>>
>>> OTOH, if a node starts acting wonky, I can invoke recovery protocols
>>> that (in an ideal world) will reset/restart the portions of the node
>>> that appear to be misbehaving (e.g., if traffic to/from one particular
>>> task is "odd", then perhaps that task has gone south;
>>> but, the remainder of the node may be intact.  Or, some I/O that the
>>> wacky task handles is misbehaving and the task is doing its best to make
>>> sense of a broken environment).
>>>
>>> If the node itself gets hosed, then a hardware watchdog on that node
>>> *might* bring it back to nominal operating condition.
>>>
>>> But, I still need a reliable way of making a (remote) node "safe",
>>> "secure", etc.
>>>
>>> I presently deliver PoE to each node and many of them live off that.
>>> So, as a last ditch effort, I can command the switch to remove that and
>>> unceremoniously bring the node down.
>>>
>>> Except for the nodes that *aren't* dependent on that!
>>>
>>> Rather than concoct some other mechanism that might work (less
>>> effectively), perhaps it is easiest to codify this behavior in the
>>> hardware interface spec?
>>> I.e., even if you don't RELY on PoE, you must still *honor* it AS IF you
>>> did!
>>>
>>> In the simplest case, that means treating !PoE as a RESET state.  And,
>>> ensuring that the hardware will remain safe/secure in perpetual reset
>>> (i.e., not expecting RESET to be a *transient* state).
>>>
>>> Of course, the downside is that I now must be able to deliver something
>>> akin to a PoE signal on every port (though I could possibly skimp on the
>>> load carrying capability for many ports?)
>>>
>>> Any holes in this?  Or, better alternatives?
>>
>>
>> There's the obvious solution of using the power from the PoE PSE to
>> drive an enable of some sort to the device's power supply.  Heck use
>> that to energize a relay you've put across the mains input (some way of
>> overriding that at the device would probably be prudent).
>
> I also see a potential issue with motion control -- if a node is
> controlling a motor or some such, you don't necessarily want to cut off
> control at an arbitrary time.
>
> I can envision a scenario where the communications is bollixed but the
> control loop is perking along merrily -- then you shut off power to the
> processor and WHANG! some serious metal hits a stop, and bits get bent or
> fly off.

I don't see any way around that.  How many concurrent failures do
you envision handling?

In your example, the node (doing the motion control) would have to have had
a communication failure *and* its local "watchdog(s)/daemons" would have had
to have NOT detected that failure -- in order to allow for the motion to
continue.

By contrast, in my current approach, even a SINGLE failure could munge
your scenario (the PSE dropping power to the PD "unexpectedly").

As you said, upthread, you can't deal with every condition...

Reply by Robert Wessel ●April 22, 20162016-04-22

On Wed, 20 Apr 2016 22:10:20 -0700, Don Y
<blockedofcourse@foo.invalid> wrote:

>On 4/20/2016 9:19 PM, Robert Wessel wrote:
>
>> There's the obvious solution of using the power from the PoE PSE to
>> drive an enable of some sort to the device's power supply.  Heck use
>> that to energize a relay you've put across the mains input (some way
>> of overriding that at the device would probably be prudent).
>
>If the device is NOT PoE powered, it's probably because it represents a
>substantial load (25+W?).  I'm not sure it would be prudent to let
>something remotely disconnect power (and possibly reapply it, moments
>later) for large loads.
>
>OTOH, holding the device "in reset" (possibly indefinitely or even
>"repeatedly") should be safe(r?)


Presumably this is for cases where the device is so far gone that you
want to hit the big-red-switch.  If you want more sophistication, you
can put a controlling microprocessor on the device, and have that
powered by PoE, and it could do things like force a reset, or actually
power the device off if necessary.

Previous12 3 Next

Remote "watchdog"s

Sign in

You might also like...

Search forums

Free PDF Downloads

Blogs - Hall of Fame

Discussion Groups

Quick Links

About EmbeddedRelated.com

Social Networks

The Related Media Group