Forums

Did NT caused the Yorktown's systems to fail?

Started by Guy Macon July 30, 2005


Jerry Avins wrote:

>The navy used Windows NT to run a heavy cruiser. >They had to tow the Yorktown home from its shakedown cruise.
They could have had the same problem with any OS, including a RTOS. They used a single NT Terminal Server with a bunch of workstations spread across the ship, all talking to one application that controls all ship functions including propulsion and armament. With no way to take manual control. And no backups. And a database that anyone can modify, with the "feature" that a particular wrong value in the database crashes the server and keeps crashing it upon reboot. And a bored sailor making modifications to the database all day. That level of screwup transcends OS choice. For a revolting PR spin that nevertheless has the basic facts, look here: http://appserv.gcn.com/cgi-bin/udt/im.display.printable?client.id=gcn2&story.id=33914 http://thomas.loc.gov/cgi-bin/query/R?r106:FLD001:E00659 (Note how they blame the user? Now imagine a sailor putting in the fix suggested while the ship is under fire...) For the whole story, look here: http://appserv.gcn.com/cgi-bin/udt/im.display.printable?client.id=gcn2&story.id=33541 Smart Ship inquiry a go -August 31, 1998 http://appserv.gcn.com/cgi-bin/udt/im.display.printable?client.id=gcn2&story.id=33914 Navy: Calibration flaw crashed Yorktown LAN -November 9, 1998 http://appserv.gcn.com/cgi-bin/udt/im.display.printable?client.id=gcn2&story.id=34014 It's a smart lesson -November 23, 1998 http://appserv.gcn.com/cgi-bin/udt/im.display.printable?client.id=gcn2&story.id=33639 Control-system designers say newer version could have prevented LAN crash -Dec 14 1998 http://appserv.gcn.com/cgi-bin/udt/im.display.printable?client.id=gcn2&story.id=34351 LETTER TO THE GCN EDITOR -January 11, 1999 http://appserv.gcn.com/cgi-bin/udt/im.display.printable?client.id=gcn2&story.id=903 Navy cruises on with ATM -November 8, 1999 -- Guy Macon <http://www.guymacon.com/>, misc.business.product-dev Moderator

foo!

Guy Macon wrote:
> Jerry Avins wrote: > > >The navy used Windows NT to run a heavy cruiser. > >They had to tow the Yorktown home from its shakedown cruise. > > They could have had the same problem with any OS, including a RTOS. > > They used a single NT Terminal Server with a bunch of workstations > spread across the ship, all talking to one application that controls > all ship functions including propulsion and armament. With no way to > take manual control. And no backups. And a database that anyone > can modify, with the "feature" that a particular wrong value in the > database crashes the server and keeps crashing it upon reboot. And > a bored sailor making modifications to the database all day. > > That level of screwup transcends OS choice. > > For a revolting PR spin that nevertheless has the basic facts, look here: > > http://appserv.gcn.com/cgi-bin/udt/im.display.printable?client.id=gcn2&story.id=33914 > http://thomas.loc.gov/cgi-bin/query/R?r106:FLD001:E00659 > > (Note how they blame the user? Now imagine a sailor putting in the > fix suggested while the ship is under fire...) > > For the whole story, look here: > > http://appserv.gcn.com/cgi-bin/udt/im.display.printable?client.id=gcn2&story.id=33541 > Smart Ship inquiry a go -August 31, 1998 > > http://appserv.gcn.com/cgi-bin/udt/im.display.printable?client.id=gcn2&story.id=33914 > Navy: Calibration flaw crashed Yorktown LAN -November 9, 1998 > > > http://appserv.gcn.com/cgi-bin/udt/im.display.printable?client.id=gcn2&story.id=34014 > It's a smart lesson -November 23, 1998 > > http://appserv.gcn.com/cgi-bin/udt/im.display.printable?client.id=gcn2&story.id=33639 > Control-system designers say newer version could have prevented LAN crash -Dec 14 1998 > > http://appserv.gcn.com/cgi-bin/udt/im.display.printable?client.id=gcn2&story.id=34351 > LETTER TO THE GCN EDITOR -January 11, 1999 > > http://appserv.gcn.com/cgi-bin/udt/im.display.printable?client.id=gcn2&story.id=903 > Navy cruises on with ATM -November 8, 1999
"A systems administrator fed bad data into the ship's Remote Database Manager, which caused a buffer overflow when the software tried to divide by zero. The overflow crashed computers on the LAN and caused the Yorktown to lose control of its propulsion system, Navy officials said." "Because the ships' new propulsion control system was developed quickly, his programmers knew there were inherent risks, Rushton said" "NT was never the cause of any problem on the ship, Rushton said. The problems were all in programs, database and code within the individual pieces of software that we were using" "Using Windows NT, which is known to have some failure modes, on a warship is similar to hoping that luck will be in our favor, wrote Anthony DiGiorgio, an engineer with the Atlantic Fleet Technical Support Center, in a June 1998 article titled "The Smart Ship is Not The Answer."" It seems the problem comes from two main reasons: - The system was not designed to cope with the failure of the database server. - The system was developed "quickly". It is common practice to have separate busses on systems with safety constraints. One bus is the safety bus and only certified devices are connected on it. Another bus is the supervisory bus where operators can monitor and control the installation. There can be yet another bus for delayed tasks such as data anaysis. There are gateways between busses that ensure that errors are not propagated. A gateway between a safety bus and another bus ensures that data coming from non safety bus cannot compromise safety. If an installation has identified safety constraints windows NT should not be used as a central system, only certified safety devices should be in control. However the use of safety devices in itself does not ensure that safety errors cannot occur. The whole installation must be certified. A "quick" development is not compatible with this kind of constraints. This was not a programming error but a system design error. Windows NT has advantages at some level in the system. The availability of numerous and powerful software and the connectivity with many peripheral devices makes it a system of choice for some tasks. The fact that it is promoted for safety functions is probably the result of budgetary constraints becoming central and technical constraints neglected. This happens more and more and can be related to the emphasis put on revenue and the lack of expertise of managers in technical fields.
Guy Macon wrote:
> Jerry Avins wrote: > > >>The navy used Windows NT to run a heavy cruiser. >>They had to tow the Yorktown home from its shakedown cruise. > > > They could have had the same problem with any OS, including a RTOS.
Sheesh, Guy, the silliness of your conclusion is mind-boggling! There have been high-integrity systems, including RTOS's, built since the '80's that were immune to that sort of _system_ error, much less _user_ error. Except for the suits who were trying to justify their selection of hardware, software, and integrator vendors, all the responses to the article, even in your list of URL's, recognized that the affair showed a catastrophic lack of professionalism in the selection, design, implementation, and testing of the whole system. Note my use of order here -- it's obvious that selection by suits took place before engineering. Some of the corrections were listed, and I'm sure others were buried too deep to be exposed. John Perry


John Perry wrote:
> >Guy Macon wrote: > >> Jerry Avins wrote: >> >>>The navy used Windows NT to run a heavy cruiser. >>>They had to tow the Yorktown home from its shakedown cruise. >> >> They could have had the same problem with any OS, including a RTOS. > >Sheesh, Guy, the silliness of your conclusion is mind-boggling! > >There have been high-integrity systems, including RTOS's, built since >the '80's that were immune to that sort of _system_ error, much less >_user_ error. > >Except for the suits who were trying to justify their selection of >hardware, software, and integrator vendors, all the responses to the >article, even in your list of URL's, recognized that the affair showed a >catastrophic lack of professionalism in the selection, design, >implementation, and testing of the whole system. Note my use of order >here -- it's obvious that selection by suits took place before >engineering. Some of the corrections were listed, and I'm sure others >were buried too deep to be exposed.
It appears to me that, your comment about mind-boggling silliness notwithstanding, you agree with my conclusion - that the OS was not the problem. In my opinion, a system with the same catastrophic lack of professionalism in the selection, design, implementation, and testing of the whole system but with a different OS would still have failed. The specific failure mode would have been different, but simply applying a magic bullet of using another OS would have done nothing to address the core problem of a bad system design. There are problems that can be fixed with a different OS. A monolithic application written in Ada that controls a database and engine propulsion and which crashes when someone tells the database that valve X is closed is not on of them.
On 31 Jul 2005 07:10:09 GMT, Guy Macon
<_see.web.page_@_www.guymacon.com_> wrote:

> > > >John Perry wrote: >> >>Guy Macon wrote: >> >>> Jerry Avins wrote: >>> >>>>The navy used Windows NT to run a heavy cruiser. >>>>They had to tow the Yorktown home from its shakedown cruise. >>> >>> They could have had the same problem with any OS, including a RTOS. >> >>Sheesh, Guy, the silliness of your conclusion is mind-boggling! >> >>There have been high-integrity systems, including RTOS's, built since >>the '80's that were immune to that sort of _system_ error, much less >>_user_ error. >> >>Except for the suits who were trying to justify their selection of >>hardware, software, and integrator vendors, all the responses to the >>article, even in your list of URL's, recognized that the affair showed a >>catastrophic lack of professionalism in the selection, design, >>implementation, and testing of the whole system. Note my use of order >>here -- it's obvious that selection by suits took place before >>engineering. Some of the corrections were listed, and I'm sure others >>were buried too deep to be exposed. > >It appears to me that, your comment about mind-boggling silliness >notwithstanding, you agree with my conclusion - that the OS was not >the problem.
[snipped] His order is "Selection" first. That implies they chose the wrong OS. If one starts of with the wrong OS, then even if all the following steps are done as it should be done, one will end up with a product that would fail. If one chooses the right OS, and the rest is done incorrectly, then the end result would still be failure. In a building, if the foundations are unsuitable for the final load, it does not matter whether the rest is built even a 1000% over the required spec. The building is still unsafe an will fail. One can still build an unsafe building when the foundations is within spec, but one cannot build a safe building if the foundations are not up to the job. It is the same in a chain of logic. If a certain step is wrong, then the rest is wrong no matter how correct it would be as a seperate chain of logic. Hence making a mistake at the end of a chain would influence the final conclusion only a bit. A mistake at the beginning of the chain makes the final conclusion totally meaningless. Regards Anton Erasmus
Guy Macon wrote:
> John Perry wrote: > >>Guy Macon wrote: >> >> >>>Jerry Avins wrote: >>> >>> >>>>The navy used Windows NT to run a heavy cruiser. >>>>They had to tow the Yorktown home from its shakedown cruise. >>> >>>They could have had the same problem with any OS, including a RTOS. >> >>Sheesh, Guy, the silliness of your conclusion is mind-boggling! >> >>There have been high-integrity systems, including RTOS's, built since >>the '80's that were immune to that sort of _system_ error, much less >>_user_ error. >>... > > It appears to me that, your comment about mind-boggling silliness > notwithstanding, you agree with my conclusion - that the OS was not > the problem. >
Well, no. The OS was _part_ of the problem. Good programs protect themselves against such obvious and common operator error. Good software libraries check the processor exceptions and do reasonable recovery. Good networking processes don't lock up the network for hours at a time. Good OS's don't hang without timing out when the network is unavailable. Notice than only one of those characteristics didn't depend upon NT.
> In my opinion, a system with the same catastrophic lack of > professionalism in the selection, design, implementation, and > testing of the whole system but with a different OS would still > have failed. The specific failure mode would have been different, > but simply applying a magic bullet of using another OS would have > done nothing to address the core problem of a bad system design. >
This is where we can agree, assuming your statements above concede that the whole network would not have collapsed with another system (which is the point of all of us who blame NT for the collapse). For a less Polyanna-ish view of Smart Ship: http://wired-vig.wired.com/news/technology/0,1282,13987,00.html http://lists.essential.org/1998/am-info/msg03829.html http://www.usni.org/Proceedings/articles98/digiorgio.htm http://mae.pennnet.com/Articles/Article_Display.cfm?Section=Articles&Subsection=Display&ARTICLE_ID=96013&KEYWORD=%22smart%20ship%22%20postmortem http://cse.stanford.edu/class/cs201/projects-99-00/critical-systems/military.htm
> There are problems that can be fixed with a different OS. A monolithic > application written in Ada that controls a database and engine propulsion > and which crashes when someone tells the database that valve X is closed > is not on of them. > >
An no one disagrees, as far as I know. Except for the ideologs, our point is that NT was a major part of the problem, and specifically the complete collapse of the ship's systems was an NT problem. The user input error should have been caught at many places during program execution, and if NT had protected itself properly, even the unit that had the bad data would not have crashed. jp By the way, sorry about the "silliness". Wrong is not necessarily silly, and I do know better.
> John Perry wrote: > By the way, sorry about the "silliness". Wrong is not necessarily silly, and I do know better.
And, of course, disagreeing is not necessarily "wrong", either. Between the obvious political posturing and the possibility of disgruntled underlings, we'll probably never know for sure. It's remotely possible that the error could have caused the application code at each of the terminals to lock up and ignore input from a properly functioning network, although I've never seen any indication from even the Smart Ship defenders that that's what happened. That's the only thing I can think of that would exonerate NT. jp


John Perry wrote:
> >Guy Macon wrote: > >> John Perry wrote: >> >>>Guy Macon wrote: >>> >>>>Jerry Avins wrote: >>>> >>>>>The navy used Windows NT to run a heavy cruiser. >>>>>They had to tow the Yorktown home from its shakedown cruise. >>>> >>>>They could have had the same problem with any OS, including a RTOS.
>>>There have been high-integrity systems, including RTOS's, built since >>>the '80's that were immune to that sort of _system_ error, much less >>>_user_ error. >>>... >> >> It appears to me that [...] you agree with my conclusion - that >> the OS was not the problem. > >Well, no. The OS was _part_ of the problem.
Oh yes indeed. NT made it easier for the idiots to screw up. A different group who were non-idiots would have had a hard time making a reliable system based on NT, and would have had a much easier time making a reliable system based on a better OS. A good OS is necessary, but not sufficient.
>Good programs protect themselves against such obvious and common >operator error. Good software libraries check the processor exceptions >and do reasonable recovery. Good networking processes don't lock up the >network for hours at a time. Good OS's don't hang without timing out >when the network is unavailable. > >Notice than only one of those characteristics didn't depend upon NT. > >> In my opinion, a system with the same catastrophic lack of >> professionalism in the selection, design, implementation, and >> testing of the whole system but with a different OS would still >> have failed. The specific failure mode would have been different, >> but simply applying a magic bullet of using another OS would have >> done nothing to address the core problem of a bad system design. > >This is where we can agree, assuming your statements above concede that >the whole network would not have collapsed with another system (which is >the point of all of us who blame NT for the collapse).
It is my considered opinion that, even if the OS didn't collapse (which shouldn't be possible, but is with NT), the *application* would still have failed to control the engines. Not in the same way, but IMO the whole project was a game of whack-a-mole; take away the "NT allows the network to crash" failure mode and one of the thousands of other failure modes would have eventually bitten them in the arse. Just as one can write assembly in any language, so one can write an unreliable engine control program under any OS. The bare fact that they had no way to start the engines with the network down tells us that the engine control program was unreliable. There are dozens of engineers in this newsgroup who would have built in a fallback method of manually controlling the engines - the classic shouting at the engine room operator through a tube, for example. "Full Speed Ahead, Mr. Perry!" "Aye sir, Full Speed Ahead!" So it's the fault of NT and it's the fault of the idiots who designed the system. Just replacing the OS would not have made a good system, but replacing the idiots with some professionals would have resulted in a system that kept working even if NT crashed. Of course the choice of NT was and still is an indicator of an idiot who does not understand high-availability systems....
>For a less Polyanna-ish view of Smart Ship: >http://wired-vig.wired.com/news/technology/0,1282,13987,00.html >http://lists.essential.org/1998/am-info/msg03829.html >http://www.usni.org/Proceedings/articles98/digiorgio.htm >http://mae.pennnet.com/Articles/Article_Display.cfm?Section=Articles&Subsection=Display&ARTICLE_ID=96013&KEYWORD=%22smart%20ship%22%20postmortem >http://cse.stanford.edu/class/cs201/projects-99-00/critical-systems/military.htm
Thanks! Good references.
>> There are problems that can be fixed with a different OS. A monolithic >> application written in Ada that controls a database and engine propulsion >> and which crashes when someone tells the database that valve X is closed >> is not on of them. > >An no one disagrees, as far as I know. Except for the ideologues, our >point is that NT was a major part of the problem, and specifically the >complete collapse of the ship's systems was an NT problem. The user >input error should have been caught at many places during program >execution, and if NT had protected itself properly, even the unit that >had the bad data would not have crashed.
I agree 100%. Plenty of blame to lay upon NT here. Yet still, replacing the OS would not, IMO, have resulted in a reliable ship.


John Perry wrote:
> >> John Perry wrote: >> >> By the way, sorry about the "silliness". Wrong is not necessarily >> silly, and I do know better.
Thanks. I was a bit put off by that.
>And, of course, disagreeing is not necessarily "wrong", either. Between >the obvious political posturing and the possibility of disgruntled >underlings, we'll probably never know for sure. It's remotely possible >that the error could have caused the application code at each of the >terminals to lock up and ignore input from a properly functioning >network, although I've never seen any indication from even the Smart >Ship defenders that that's what happened. That's the only thing I can >think of that would exonerate NT.
What application code at each of the terminals? Unless I am mistaken, this was not a NT Server system. It was an NT *Terminal* Server system -- all applications running on the server, all terminals stop working if the server goes down. At the other end of the LAN they had remote operator station units (OSUs) to talk to the humans and remote headless embedded data acquisition units (DAUs) to talk to engines, damage control, etc. No way for a sailor to talk to a DAU other than through the network. As I understand it, all OSUs and DAUs were controlled by a single giant Ada program running on the NT Terminal Server. I think that they are calling a terminal server failure a network failure. On a NT Server system, bringing down the network doesn't bring down the workstations. On a NT Terminal Server system it does.