Learning From Engineering Failures
I'm an informal student of engineering failures. They guide a lot of my attitude and approach towards engineering.
This is rooted in two of my favorite quotes:
- George Santayana: Those who do not remember the past are condemned to repeat it.
- Louis Pasteur: Chance favors the prepared mind.
and leads to the ultimate advice I offer people:
- Trust nothing, and verify.
(See My Guiding Principles As An Engineer for the full list and interpretation.)
Here I'm passing along some useful resources for learning about engineering failures in the hope that you, too, will become a student of them, and apply what you learn to your work.
While I'm a software engineer and many of these are related to software, we can learn useful lessons from all engineering domains. They have many useful parallels.
Anybody can look good when things are going well, when things are working. Those are the happy paths, the happy day scenarios. But you need to think about what can go wrong. Those are the unhappy paths, the unhappy day scenarios. You need to anticipate and plan for them.
Engineering failures can have truly horrific consequences. Death, injury, destruction. Legal and financial disaster. The people involved may carry lifelong guilt, some justified, some not. The ripples and effects can be widespread and long lasting.
So it's important to study the failures, learn the lessons they offer, and apply them at the appropriate points. That offers the best probability of avoiding or mitigating future failures.
Life is full of risk, and if we allow risk aversion to paralyze us, we'll never accomplish anything. But that doesn't mean we can't do something about the risk.
One theme that comes up repeatedly is that many incidents of "human error" often have little to do with the poor human who had the bad luck to be at the controls when the incident occurred.
Instead, they are much more complex and nuanced cases of accidents that were waiting to happen. Sure, the operator was the one who performed the triggering action (or failed to perform some required action), but it's really a case of systemic problems, in a layered set of systems.
- Why did the operator think that was an appropriate thing to do?
- Why did the system allow the operator to do that?
- Why was the system designed and built in a way that allowed that to be done?
- Why was the system operated in this manner?
- Why did management allow the system to be operated in this manner?
- Why did the regulatory environment allow the system to be operated in this manner?
- Why were lurking problems not detected before the failure?
While we think of "the system" as the immediate technical system or equipment being used, there are many interacting layers of smaller "systems" involving many parties that make up the overall "system":
- Engineering design and development system.
- Manufacturing and construction system.
- Installation and deployment system.
- Operation system.
- Inspection system.
- Maintenance system.
- Training system.
- Assessment system.
- Certification system.
- Management system.
- Regulatory system.
Each one of these is a place to catch things and prevent failures, acting as a series of safety nets.
Understanding the true root causes and contributing factors is the only way we can determine where to apply the lessons learned, to multiple layers.
Simply blaming a failure on human error and leaving it at that just means it will happen again. To some other poor unlucky human at the controls. Because none of the actual things that contributed to the failure will have been addressed; they will continue to sit there, buried in those layers, waiting for the next time.
Nancy Leveson (see Books below): "Depending on humans not to make mistakes is an almost certain way to guarantee that accidents will happen."
This all started for me with Risks Digest some 30 years ago. The full title is Forum on Risks to the Public in Computers and Related Systems, ACM Committee on Computers and Public Policy, but readers simply refer to it as Risks. It's moderated by Peter G. Neumann, affectionately known as PGN.
Risks is essentially an online news clipping service where contributors send in news items they've noticed related to computers, sometimes with commentary or excerpts, sometimes just simple links. The result is an amazing collection of information and starting points to examine all kinds of risks and failures.
It includes full archives going back to 1985. Reading through those, it's depressing to see the same things cropping up again and again. Sometimes it's just the underlying technologies that have advanced, but repeating the same problems. The big lesson there is that we have failed to heed Santayana's warning, and continue to do so.
Security (as in cybersecurity) is a major long-running theme (everything I know about cybersecurity started with an item in Risks). So are things not working right, breaking down, collapsing, or tearing apart.
Risks formed the root of my study tree. It sensitized me to a number of issues and directed me to several authors worth reading. Much of my awareness of things started there, then expanded as I read things I found mentioned there and followed the references they mentioned.
I highly recommend dipping into Risks (or jumping into the deep end of the archives with Volume 1, Issue 1 and going from there). It's harrowing, frightening, and enlightening.
The rest of this post is a set of resources that I've used to learn about engineering failures. A few are a bit academic, some are aimed at a technical audience, and some are meant for a general audience. There's some duplication in coverage. But they're all interesting. There are many more once you start looking.
These are not rumor-mongering, conspiracy theories, disinformation, or misinformation. They focus on facts and reasoned analysis.
A lot of this stuff will scare the crap out of you. Take deep breaths. Keep calm and become an IEEE member.
- Risks Digest (items from the remaining websites often show up in Risks).
- Threatpost: focused on security.
- Krebs On Security: focused on security.
- Crypto-Gram: Schneier On Security: focused on security.
- The Register: technology news, sometimes with a wry twist.
Blog Posts, Articles, Papers, and Presentations
- Philip Koopman
- Michael Barr
- Phillip Johnston
- John Chidgey
- Causality: "Chain of Events. Cause and Effect. We analyse what went right and what went wrong as we discover that many outcomes can be predicted, planned for and even prevented."
- Peter G. Neumann
- Computer-Related Risks: drawn from the first 10 years of Risks.
- Nancy Leveson (appeared in Risks Volume 1, Issue 1)
- Engineering a Safer World: Systems Thinking Applied to Safety: free PDF downloads for each chapter. See my review.
- Sidney Dekker
- Eric Schlosser
- Command and Control: the explosion in 1980 of a Titan II nuclear missile in its silo, and related mishaps.
To post reply to a comment, click on the 'reply' button attached to each comment. To post a new comment (not a reply to a comment) check out the 'Write a Comment' tab at the top of the comments.
Registering will allow you to participate to the forums on ALL the related sites and give you access to all pdf downloads.