Reply by Don Y April 27, 20162016-04-27
On 4/27/2016 3:25 PM, lasselangwadtchristensen@gmail.com wrote:
> Den l�rdag den 23. april 2016 kl. 21.59.19 UTC+2 skrev Don Y: >> On 4/23/2016 7:26 AM, upsidedown@downunder.com wrote: >>>> You've got a "loose thread" on a garment and tugging on it just >>>> keeps unraveling the next layer. >>>> >>>> There's a point at which you say "what is the value of redundancy" >>>> compared to the *cost*? >>> >>> Usually customers with a daily production volume more than 1 million >>> USD/EUR each day require redundant systems. In other cases, when there >>> is a risk to human life or there are a risk of ending into CNN main >>> news :-), redundant systems are often used. >> >> I've worked in industries where a single machine could produce >> greater value per run-time *hour*. "Redundancy" was not addressed >> ON the machines, at all. Manufacturing was kept "available" by >> having spare machines instead of having more RELIABLE machines. >> >> My point is that redundancy adds cost and complexity to a design. > > a guy went to work with worked nights at a local paper printing delivery > list, they had some big IBM monster of a mechanical printer and since failure to print the lists meant no paper delivered they had an expensive "guaranteed > in a few hours" service > > eventually they just got a row of regular laser printers, as long as few of them worked they could make the lists on time so they didn't need some > service technician on standby. and buying a new was cheaper than fixing
Yeah, which is why I have many PC's. OTOH, we have exactly ONE furnace in the house, one ACbrrr, one roof, one set of exterior walls, etc. we *could* implement redundancy for each of these things -- at considerable cost and complexity! Having walls up, heat and a roof overhead goes far beyond the "few hours service" sort of guarantee!
Reply by April 27, 20162016-04-27
Den l�rdag den 23. april 2016 kl. 21.59.19 UTC+2 skrev Don Y:
> On 4/23/2016 7:26 AM, upsidedown@downunder.com wrote: > >> You've got a "loose thread" on a garment and tugging on it just > >> keeps unraveling the next layer. > >> > >> There's a point at which you say "what is the value of redundancy" > >> compared to the *cost*? > > > > Usually customers with a daily production volume more than 1 million > > USD/EUR each day require redundant systems. In other cases, when there > > is a risk to human life or there are a risk of ending into CNN main > > news :-), redundant systems are often used. > > I've worked in industries where a single machine could produce > greater value per run-time *hour*. "Redundancy" was not addressed > ON the machines, at all. Manufacturing was kept "available" by > having spare machines instead of having more RELIABLE machines. > > My point is that redundancy adds cost and complexity to a design. >
a guy went to work with worked nights at a local paper printing delivery list, they had some big IBM monster of a mechanical printer and since failure to print the lists meant no paper delivered they had an expensive "guaranteed in a few hours" service eventually they just got a row of regular laser printers, as long as few of them worked they could make the lists on time so they didn't need some service technician on standby. and buying a new was cheaper than fixing -Lasse
Reply by Don Y April 23, 20162016-04-23
On 4/23/2016 7:26 AM, upsidedown@downunder.com wrote:
>> You've got a "loose thread" on a garment and tugging on it just >> keeps unraveling the next layer. >> >> There's a point at which you say "what is the value of redundancy" >> compared to the *cost*? > > Usually customers with a daily production volume more than 1 million > USD/EUR each day require redundant systems. In other cases, when there > is a risk to human life or there are a risk of ending into CNN main > news :-), redundant systems are often used.
I've worked in industries where a single machine could produce greater value per run-time *hour*. "Redundancy" was not addressed ON the machines, at all. Manufacturing was kept "available" by having spare machines instead of having more RELIABLE machines. My point is that redundancy adds cost and complexity to a design. Toilet flappers fail regularly. Why not incorporate some redundancy in their design to eliminate/reduce this possibility?
Reply by April 23, 20162016-04-23
On Sat, 23 Apr 2016 04:08:32 -0700, Don Y
<blockedofcourse@foo.invalid> wrote:

>On 4/23/2016 3:46 AM, upsidedown@downunder.com wrote: >> On Sat, 23 Apr 2016 00:54:56 -0700, Don Y >> <blockedofcourse@foo.invalid> wrote: >> >>> On 4/22/2016 11:29 PM, upsidedown@downunder.com wrote: >>>>> In my case, "just waiting" (i.e., for the plane to land) isn't a >>>>> practical option -- the system is intended to run 24/7/365 so there's >>>>> no "scheduled down time" or "end of flight" :> >>>> >>>> Use a redundant system with at least two identical units, say A and B. >>>> If A needs to be reseted, doing some self test or do some application >>>> or OS upgrade, switch control to B, do the required maintenance >>>> operation (including hardware replacement) on A. Check that A is up >>>> and running, then you can switch back to A. >>> >>> The problem comes with I/O's. Not only do you have to "duplicate" >>> the field interface -- but, you also have to provide a RELIABLE means >>> to switch between any actuators driven by those two "duplicates". >> >> This is often done with a voting system or majority logic systems with >> three or more controllers. The final output state is the one most >> controllers can agree on. There can be different forms of voters, from >> logic gates to relays to hydraulics etc. > >Yes, and the voting logic is also subject to failure.
Yes, the horizontal bar with three actuator coils is subject to earthquakes :-).
>And, of course, the actuators being controlled, etc.
The voting is done in the solid mechanical bar. Of course, this could brake, but in such situations, how much is alive anyway ?
>You've got a "loose thread" on a garment and tugging on it just >keeps unraveling the next layer. > >There's a point at which you say "what is the value of redundancy" >compared to the *cost*?
Usually customers with a daily production volume more than 1 million USD/EUR each day require redundant systems. In other cases, when there is a risk to human life or there are a risk of ending into CNN main news :-), redundant systems are often used. In the nuclear industry, there is not a point of using more than a few _identical_ units, since in Fukushima, all the diesels required by the emergency cooling system got wet by the tsunami. Using 3+3 different emergency cooling systems would have saved the plant.
>What "extra" redundancy would have saved Columbia?
The STS was an idiotic design (due to weight constraints), so no redundancy would have saved the crew.
Reply by Don Y April 23, 20162016-04-23
On 4/23/2016 3:46 AM, upsidedown@downunder.com wrote:
> On Sat, 23 Apr 2016 00:54:56 -0700, Don Y > <blockedofcourse@foo.invalid> wrote: > >> On 4/22/2016 11:29 PM, upsidedown@downunder.com wrote: >>>> In my case, "just waiting" (i.e., for the plane to land) isn't a >>>> practical option -- the system is intended to run 24/7/365 so there's >>>> no "scheduled down time" or "end of flight" :> >>> >>> Use a redundant system with at least two identical units, say A and B. >>> If A needs to be reseted, doing some self test or do some application >>> or OS upgrade, switch control to B, do the required maintenance >>> operation (including hardware replacement) on A. Check that A is up >>> and running, then you can switch back to A. >> >> The problem comes with I/O's. Not only do you have to "duplicate" >> the field interface -- but, you also have to provide a RELIABLE means >> to switch between any actuators driven by those two "duplicates". > > This is often done with a voting system or majority logic systems with > three or more controllers. The final output state is the one most > controllers can agree on. There can be different forms of voters, from > logic gates to relays to hydraulics etc.
Yes, and the voting logic is also subject to failure. And, of course, the actuators being controlled, etc. You've got a "loose thread" on a garment and tugging on it just keeps unraveling the next layer. There's a point at which you say "what is the value of redundancy" compared to the *cost*? What "extra" redundancy would have saved Columbia?
Reply by April 23, 20162016-04-23
On Sat, 23 Apr 2016 00:54:56 -0700, Don Y
<blockedofcourse@foo.invalid> wrote:

>On 4/22/2016 11:29 PM, upsidedown@downunder.com wrote: >>> In my case, "just waiting" (i.e., for the plane to land) isn't a >>> practical option -- the system is intended to run 24/7/365 so there's >>> no "scheduled down time" or "end of flight" :> >> >> Use a redundant system with at least two identical units, say A and B. >> If A needs to be reseted, doing some self test or do some application >> or OS upgrade, switch control to B, do the required maintenance >> operation (including hardware replacement) on A. Check that A is up >> and running, then you can switch back to A. > >The problem comes with I/O's. Not only do you have to "duplicate" >the field interface -- but, you also have to provide a RELIABLE means >to switch between any actuators driven by those two "duplicates".
This is often done with a voting system or majority logic systems with three or more controllers. The final output state is the one most controllers can agree on. There can be different forms of voters, from logic gates to relays to hydraulics etc. One very simple mechanical voter is a horizontal bar connected to some mechanical control point with three coils on the bar driven by three control systems. The mechanical bar moved in that direction that at least two control systems are actively ordered to move left or right. The disagreeing unit and coil is overridden by the other system and the third does the actual movement. Simpler methods exists, when there is one "safe" state and an other "dangerous" state which should not be entered by mistake or hardware failure is to use a vertical bar with some weight to pull it down to the safe state. A coil is then driven to the "dangerous" state. Both ends could be driven separately preferably from different systems. One other approach is to use the select/execute method, in which one separate signal is needed to activate the relay power supply and an other to activate the actual relay. A timer is needed to deactivate the relay power supply, if no actual relay command is received. Thus, a separate Select or separate Execute can not do any action. Cross feeding select and execute from different controllers in a redundant system reduces the risk of spurious activity.
> >I.e., if A has an output set ON and B thinks it *really* should be OFF, >how is the controlled device to know who to listen to? > >I consider physical I/O replication to be too troublesome. So, I >only provide redundancy on "virtual" entities (processes, state, etc.). >In that way, all I need are spare CPU's...
Relays and hydraulic valves may also fail, so you should look at the big picture and not just software.
Reply by Don Y April 23, 20162016-04-23
On 4/22/2016 11:29 PM, upsidedown@downunder.com wrote:
>> In my case, "just waiting" (i.e., for the plane to land) isn't a >> practical option -- the system is intended to run 24/7/365 so there's >> no "scheduled down time" or "end of flight" :> > > Use a redundant system with at least two identical units, say A and B. > If A needs to be reseted, doing some self test or do some application > or OS upgrade, switch control to B, do the required maintenance > operation (including hardware replacement) on A. Check that A is up > and running, then you can switch back to A.
The problem comes with I/O's. Not only do you have to "duplicate" the field interface -- but, you also have to provide a RELIABLE means to switch between any actuators driven by those two "duplicates". I.e., if A has an output set ON and B thinks it *really* should be OFF, how is the controlled device to know who to listen to? I consider physical I/O replication to be too troublesome. So, I only provide redundancy on "virtual" entities (processes, state, etc.). In that way, all I need are spare CPU's...
> If updates are required on both units, it is preferable to start with > he passive unit (say B in this example), When B is up and running > again, try switching control to B. If B is not working properly after > the update, switch back to A and fix B before trying to switch again. > > When B has been verified to be properly in charge, do the maintenance > on A and preferably switch back to A and verify that the maintenance > on A also went OK. > >> As reset is, conceptually, the only time when a system's state can >> be "known", getting to that state seems to be the safest course of >> action. >> >> What I *should* probably do is figure out how to hold PD's *in* RESET, >> though powered. That'll require yet another modification to the >> negotiation protocol. <frown> > > Of course, when doing redundant system, the redundancy should be > designed into the system from the beginning and not just try to stick > on some redundancy, if/when the reliability of a non-redundant system > is found to be too bad.
Reply by April 23, 20162016-04-23
On Fri, 22 Apr 2016 13:35:27 -0700, Don Y
<blockedofcourse@foo.invalid> wrote:

>On 4/21/2016 8:53 PM, Robert Wessel wrote: >> On Thu, 21 Apr 2016 22:34:06 -0500, Robert Wessel >> <robertwessel2@yahoo.com> wrote: >> >>> On Wed, 20 Apr 2016 22:10:20 -0700, Don Y >>> <blockedofcourse@foo.invalid> wrote: >>> >>>> On 4/20/2016 9:19 PM, Robert Wessel wrote: >>>> >>>>> There's the obvious solution of using the power from the PoE PSE to >>>>> drive an enable of some sort to the device's power supply. Heck use >>>>> that to energize a relay you've put across the mains input (some way >>>>> of overriding that at the device would probably be prudent). >>>> >>>> If the device is NOT PoE powered, it's probably because it represents a >>>> substantial load (25+W?). I'm not sure it would be prudent to let >>>> something remotely disconnect power (and possibly reapply it, moments >>>> later) for large loads. >>>> >>>> OTOH, holding the device "in reset" (possibly indefinitely or even >>>> "repeatedly") should be safe(r?) >>> >>> >>> Presumably this is for cases where the device is so far gone that you >>> want to hit the big-red-switch. If you want more sophistication, you >>> can put a controlling microprocessor on the device, and have that >>> powered by PoE, and it could do things like force a reset, or actually >>> power the device off if necessary. >> >> >> Aircraft systems have an interesting parallel. Almost everything have >> its power disconnected via a circuit breaking in the cockpit. In ye >> olde days, these were actually breakers wired into the circuit mounted >> on a panel (or several) in the cockpit, or a simple remote-operated >> breaker (usually for heavy loads). On recent aircraft, most of this >> is driven by the flight management system, which will pop up a little >> message saying it's pulled a breaker (if it happens automatically), or >> has a screen where you can pick a breaker to pull, and the breakers >> themselves are often located in a more convenient physical location >> (presumably near the circuit they're protecting), and they're >> controlled remotely. >> >> In the past is was not uncommon for the flight crew to attempt to >> cycle a breaker after a failure, but the modern policy is to just >> leave it alone (and powered off), and let maintenance deal with it on >> the ground. Obviously with exceptions where the loss of the system in >> question can be considered more dangerous than the possibility of a >> fire or other really bad result from the failing device. > >In my case, "just waiting" (i.e., for the plane to land) isn't a >practical option -- the system is intended to run 24/7/365 so there's >no "scheduled down time" or "end of flight" :>
Use a redundant system with at least two identical units, say A and B. If A needs to be reseted, doing some self test or do some application or OS upgrade, switch control to B, do the required maintenance operation (including hardware replacement) on A. Check that A is up and running, then you can switch back to A. If updates are required on both units, it is preferable to start with he passive unit (say B in this example), When B is up and running again, try switching control to B. If B is not working properly after the update, switch back to A and fix B before trying to switch again. When B has been verified to be properly in charge, do the maintenance on A and preferably switch back to A and verify that the maintenance on A also went OK.
>As reset is, conceptually, the only time when a system's state can >be "known", getting to that state seems to be the safest course of >action. > >What I *should* probably do is figure out how to hold PD's *in* RESET, >though powered. That'll require yet another modification to the >negotiation protocol. <frown>
Of course, when doing redundant system, the redundancy should be designed into the system from the beginning and not just try to stick on some redundancy, if/when the reliability of a non-redundant system is found to be too bad.
Reply by Don Y April 22, 20162016-04-22
On 4/22/2016 6:22 PM, lasselangwadtchristensen@gmail.com wrote:
> Den l&#4294967295;rdag den 23. april 2016 kl. 02.48.15 UTC+2 skrev Don Y: >> On 4/22/2016 4:47 PM, Robert Wessel wrote: >>> On Fri, 22 Apr 2016 04:23:55 -0700 (PDT), >>> lasselangwadtchristensen@gmail.com wrote: >>> >>>> Den fredag den 22. april 2016 kl. 05.53.11 UTC+2 skrev >>>> robert...@yahoo.com: >>>>> On Thu, 21 Apr 2016 22:34:06 -0500, Robert Wessel >>>>> <robertwessel2@yahoo.com> wrote: >>>>> >>>>>> On Wed, 20 Apr 2016 22:10:20 -0700, Don Y >>>>>> <blockedofcourse@foo.invalid> wrote: >>>>>> >>>>>>> On 4/20/2016 9:19 PM, Robert Wessel wrote: >>>>>>> >>>>>>>> There's the obvious solution of using the power from the PoE >>>>>>>> PSE to drive an enable of some sort to the device's power >>>>>>>> supply. Heck use that to energize a relay you've put across >>>>>>>> the mains input (some way of overriding that at the device >>>>>>>> would probably be prudent). >>>>>>> >>>>>>> If the device is NOT PoE powered, it's probably because it >>>>>>> represents a substantial load (25+W?). I'm not sure it would be >>>>>>> prudent to let something remotely disconnect power (and possibly >>>>>>> reapply it, moments later) for large loads. >>>>>>> >>>>>>> OTOH, holding the device "in reset" (possibly indefinitely or >>>>>>> even "repeatedly") should be safe(r?) >>>>>> >>>>>> >>>>>> Presumably this is for cases where the device is so far gone that >>>>>> you want to hit the big-red-switch. If you want more >>>>>> sophistication, you can put a controlling microprocessor on the >>>>>> device, and have that powered by PoE, and it could do things like >>>>>> force a reset, or actually power the device off if necessary. >>>>> >>>>> >>>>> Aircraft systems have an interesting parallel. Almost everything >>>>> have its power disconnected via a circuit breaking in the cockpit. >>>>> In ye olde days, these were actually breakers wired into the circuit >>>>> mounted on a panel (or several) in the cockpit, or a simple >>>>> remote-operated breaker (usually for heavy loads). On recent >>>>> aircraft, most of this is driven by the flight management system, >>>>> which will pop up a little message saying it's pulled a breaker (if >>>>> it happens automatically), or has a screen where you can pick a >>>>> breaker to pull, and the breakers themselves are often located in a >>>>> more convenient physical location (presumably near the circuit >>>>> they're protecting), and they're controlled remotely. >>>>> >>>>> In the past is was not uncommon for the flight crew to attempt to >>>>> cycle a breaker after a failure, but the modern policy is to just >>>>> leave it alone (and powered off), and let maintenance deal with it >>>>> on the ground. Obviously with exceptions where the loss of the >>>>> system in question can be considered more dangerous than the >>>>> possibility of a fire or other really bad result from the failing >>>>> device. >>>> >>>> https://en.wikipedia.org/wiki/Northwest_Airlines_Flight_255 >>> >>> >>> I'm not sure that exactly applies. If the CB was pulled by maintenance >>> or the pilots, the flight should never have been started in that >>> condition (I don't think the configuration warning system is MEL-able). >>> If it tripped because of an actual overload, well, what else would you >>> have it do? You could make a case for lack of redundancy. And if it >>> failed in such a way that it was open, but gave no indication, again, >>> that doesn't really apply, except perhaps to suggest the need for >>> additional redundancy. >>> >>> In at least the first case, modern systems would likely have made it >>> much harder to miss the pulled breaker, and might well have helped in >>> the third case. >>> >>> In any event, the configuration error was the cause of the accident, not >>> the failure of the configuration warning system. And that actually >>> supports *my* point - let's say we were looking at the second case, >>> there could actually be a fire risk that the tripped breaker is >>> removing, vs. the pilots doing something really stupid, like taking off >>> without flaps. >> >> I think it underscores the fact that handling MULTIPLE errors is always >> problematic. Had the preflight check been completed (an error in itself), >> would the "problem" have gone unnoticed? > > From the TV documentary about the crash it wasn't uncommon for pilots to > pull the breaker on configuration warning system because of false warnings > while taxing a bit fast. If that was why it wasn't on we'll never know
"The National Transportation Safety Board determines that the probable cause of the accident was the flightcrew&#4294967295;s failure to use the taxi checklist to ensure that the flaps and slats were extended for takeoff." Error #1 which MASKS error #2 (or, which allows error #2 to be fatal): "Contributing to the accident was the absence of electrical power to the airplane takeoff warning system which thus did not warn the flightcrew that the airplane was not configured properly for takeoff. The reason for the absence of electrical power could not be determined."
> afaiu the outcome was a change to warming system so there would be less > false warmings and the checklist split in smaller sections so it wasn't such > much work to start over with a section, as you are supposed to when > disturbed in the middle
This is the same sort of reasoning that goes into the installation of other warning devices. E.g., you would *think* that The Kitchen would be a great place to locate a smoke detector (as there are ignition sources, there). But, doing so causes too many false alarms -- which leads to folks disabling the detector.
Reply by April 22, 20162016-04-22
Den l&#4294967295;rdag den 23. april 2016 kl. 02.48.15 UTC+2 skrev Don Y:
> On 4/22/2016 4:47 PM, Robert Wessel wrote: > > On Fri, 22 Apr 2016 04:23:55 -0700 (PDT), > > lasselangwadtchristensen@gmail.com wrote: > > > >> Den fredag den 22. april 2016 kl. 05.53.11 UTC+2 skrev robert...@yahoo.com: > >>> On Thu, 21 Apr 2016 22:34:06 -0500, Robert Wessel > >>> <robertwessel2@yahoo.com> wrote: > >>> > >>>> On Wed, 20 Apr 2016 22:10:20 -0700, Don Y > >>>> <blockedofcourse@foo.invalid> wrote: > >>>> > >>>>> On 4/20/2016 9:19 PM, Robert Wessel wrote: > >>>>> > >>>>>> There's the obvious solution of using the power from the PoE PSE to > >>>>>> drive an enable of some sort to the device's power supply. Heck use > >>>>>> that to energize a relay you've put across the mains input (some way > >>>>>> of overriding that at the device would probably be prudent). > >>>>> > >>>>> If the device is NOT PoE powered, it's probably because it represents a > >>>>> substantial load (25+W?). I'm not sure it would be prudent to let > >>>>> something remotely disconnect power (and possibly reapply it, moments > >>>>> later) for large loads. > >>>>> > >>>>> OTOH, holding the device "in reset" (possibly indefinitely or even > >>>>> "repeatedly") should be safe(r?) > >>>> > >>>> > >>>> Presumably this is for cases where the device is so far gone that you > >>>> want to hit the big-red-switch. If you want more sophistication, you > >>>> can put a controlling microprocessor on the device, and have that > >>>> powered by PoE, and it could do things like force a reset, or actually > >>>> power the device off if necessary. > >>> > >>> > >>> Aircraft systems have an interesting parallel. Almost everything have > >>> its power disconnected via a circuit breaking in the cockpit. In ye > >>> olde days, these were actually breakers wired into the circuit mounted > >>> on a panel (or several) in the cockpit, or a simple remote-operated > >>> breaker (usually for heavy loads). On recent aircraft, most of this > >>> is driven by the flight management system, which will pop up a little > >>> message saying it's pulled a breaker (if it happens automatically), or > >>> has a screen where you can pick a breaker to pull, and the breakers > >>> themselves are often located in a more convenient physical location > >>> (presumably near the circuit they're protecting), and they're > >>> controlled remotely. > >>> > >>> In the past is was not uncommon for the flight crew to attempt to > >>> cycle a breaker after a failure, but the modern policy is to just > >>> leave it alone (and powered off), and let maintenance deal with it on > >>> the ground. Obviously with exceptions where the loss of the system in > >>> question can be considered more dangerous than the possibility of a > >>> fire or other really bad result from the failing device. > >> > >> https://en.wikipedia.org/wiki/Northwest_Airlines_Flight_255 > > > > > > I'm not sure that exactly applies. If the CB was pulled by > > maintenance or the pilots, the flight should never have been started > > in that condition (I don't think the configuration warning system is > > MEL-able). If it tripped because of an actual overload, well, what > > else would you have it do? You could make a case for lack of > > redundancy. And if it failed in such a way that it was open, but gave > > no indication, again, that doesn't really apply, except perhaps to > > suggest the need for additional redundancy. > > > > In at least the first case, modern systems would likely have made it > > much harder to miss the pulled breaker, and might well have helped in > > the third case. > > > > In any event, the configuration error was the cause of the accident, > > not the failure of the configuration warning system. And that > > actually supports *my* point - let's say we were looking at the second > > case, there could actually be a fire risk that the tripped breaker is > > removing, vs. the pilots doing something really stupid, like taking > > off without flaps. > > I think it underscores the fact that handling MULTIPLE errors is > always problematic. Had the preflight check been completed > (an error in itself), would the "problem" have gone unnoticed?
From the TV documentary about the crash it wasn't uncommon for pilots to pull the breaker on configuration warning system because of false warnings while taxing a bit fast. If that was why it wasn't on we'll never know afaiu the outcome was a change to warming system so there would be less false warmings and the checklist split in smaller sections so it wasn't such much work to start over with a section, as you are supposed to when disturbed in the middle -Lasse