Blogs Steve Branam

Learning From Engineering Failures

Steve Branam●July 29, 2021

Education

Introduction

I'm an informal student of engineering failures. They guide a lot of my attitude and approach towards engineering.

This is rooted in two of my favorite quotes:

George Santayana: Those who do not remember the past are condemned to repeat it.
Louis Pasteur: Chance favors the prepared mind.

and leads to the ultimate advice I offer people:

This article is available in PDF format for easy printing

Trust nothing, and verify.

(See My Guiding Principles As An Engineer for the full list and interpretation.)

Here I'm passing along some useful resources for learning about engineering failures in the hope that you, too, will become a student of them, and apply what you learn to your work.

While I'm a software engineer and many of these are related to software, we can learn useful lessons from all engineering domains. They have many useful parallels.

Anybody can look good when things are going well, when things are working. Those are the happy paths, the happy day scenarios. But you need to think about what can go wrong. Those are the unhappy paths, the unhappy day scenarios. You need to anticipate and plan for them.

Engineering failures can have truly horrific consequences. Death, injury, destruction. Legal and financial disaster. The people involved may carry lifelong guilt, some justified, some not. The ripples and effects can be widespread and long lasting.

So it's important to study the failures, learn the lessons they offer, and apply them at the appropriate points. That offers the best probability of avoiding or mitigating future failures.

Life is full of risk, and if we allow risk aversion to paralyze us, we'll never accomplish anything. But that doesn't mean we can't do something about the risk.

Human Error

One theme that comes up repeatedly is that many incidents of "human error" often have little to do with the poor human who had the bad luck to be at the controls when the incident occurred.

Instead, they are much more complex and nuanced cases of accidents that were waiting to happen. Sure, the operator was the one who performed the triggering action (or failed to perform some required action), but it's really a case of systemic problems, in a layered set of systems.

Why did the operator think that was an appropriate thing to do (or not do the appropriate thing)?
Why did the system allow the operator to do that?
Why was the system designed and built in a way that allowed that to be done?
Why was the system operated in this manner?
Why did management allow the system to be operated in this manner?
Why did the regulatory environment allow the system to be operated in this manner?
Why were lurking problems not detected before the failure?

While we think of "the system" as the immediate technical system or equipment being used, there are many interacting layers of smaller "systems" involving many parties that make up the overall "system":

Engineering design and development system.
Manufacturing and construction system.
Installation and deployment system.
Operation system.
Monitoring system.
Verification system.
Inspection system.
Maintenance system.
Training system.
Assessment system.
Certification system.
Management system.
Regulatory system.

Each one of these is a place to catch things and prevent failures, acting as a series of safety nets. Checks and double-checks, so that one simple action or missed action doesn't trigger disaster.

Understanding the true root causes and contributing factors is the only way we can determine where to apply the lessons learned, to multiple layers.

Simply blaming a failure on human error and leaving it at that just means it will happen again. To some other poor unlucky human at the controls. Because none of the actual things that contributed to the failure will have been addressed; they will continue to sit there, buried in those layers, waiting for the next time.

Nancy Leveson (see Books below): "Depending on humans not to make mistakes is an almost certain way to guarantee that accidents will happen."

Risks Digest

This all started for me with Risks Digest some 30 years ago. The full title is Forum on Risks to the Public in Computers and Related Systems, ACM Committee on Computers and Public Policy, but readers simply refer to it as Risks. It's moderated by Peter G. Neumann, affectionately known as PGN.

Risks is essentially an online news clipping service where contributors send in news items they've noticed related to computers, sometimes with commentary or excerpts, sometimes just simple links. The result is an amazing collection of information and starting points to examine all kinds of risks and failures.

It includes full archives going back to 1985. Reading through those, it's depressing to see the same things cropping up again and again. Sometimes it's just the underlying technologies that have advanced, but repeating the same problems. The big lesson there is that we have failed to heed Santayana's warning, and continue to do so.

Security (as in cybersecurity) is a major long-running theme (everything I know about cybersecurity started with items in Risks). The other long-running themes are things not working right, breaking down, collapsing, or tearing apart.

Risks formed the root of my study tree. It sensitized me to a number of issues and directed me to several authors worth reading. Much of my awareness of things started there, then expanded as I read things I found mentioned there and followed the references they mentioned.

I highly recommend dipping into Risks (or jumping into the deep end of the archives with Volume 1, Issue 1 and going from there). It's harrowing, frightening, and enlightening.

Resources

The rest of this post is a set of resources that I've used to learn about engineering failures. A few are a bit academic, some are aimed at a technical audience, and some are meant for a general audience. There's some duplication in coverage. But they're all interesting. There are many more once you start looking.

These are not rumor-mongering, conspiracy theories, disinformation, or misinformation. They focus on facts and reasoned analysis.

A lot of this stuff will scare the crap out of you. Take deep breaths. Keep calm and become an IEEE member.

Websites

Risks Digest (items from the other websites listed below often show up in Risks).
Threatpost: focused on security.
Krebs On Security: focused on security.
Crypto-Gram: Schneier On Security: focused on security.
The Register: technology news, sometimes with a wry twist.

Blog Posts, Articles, Papers, and Presentations

Philip Koopman
- A Case Study of Toyota Unintended Acceleration and Software Safety
Michael Barr
- Toyota Expert Witness Case Study
- An Update on Toyota and Unintended Acceleration
Phillip Johnston
- What can Software Organizations Learn from the Boeing 737 MAX Saga?

Podcasts

John Chidgey
- Causality: "Chain of Events. Cause and Effect. We analyse what went right and what went wrong as we discover that many outcomes can be predicted, planned for and even prevented."

Books

Peter G. Neumann
- Computer-Related Risks: drawn from the first 10 years of Risks.
Nancy Leveson (appeared in Risks Volume 1, Issue 1)
- Engineering a Safer World: Systems Thinking Applied to Safety: free PDF downloads for each chapter. See my review.
- Safeware
Sidney Dekker
- The Field Guide to Understanding 'Human Error'
- Drift into Failure: From Hunting Broken Components to Understanding Complex Systems
Eric Schlosser
- Command and Control: the explosion in 1980 of a Titan II nuclear missile in its silo, and related mishaps.

TV Shows

Engineering Catastrophes
Engineering Disasters
Deadly Engineering
American Experience: Command and Control: documentary based on Schlosser's book.
Chernobyl
Frontline: Boeing's Fatal Flaw

You might also like... (promoted content)

Check out Memfault's New Sandbox!

New Research Report: The State of IoT Software Development

Comments

Comments
Write a Comment

Select to add a comment

To post reply to a comment, click on the 'reply' button attached to each comment. To post a new comment (not a reply to a comment) check out the 'Write a Comment' tab at the top of the comments.

Please login (on the right) if you already have an account on this platform.

Otherwise, please use this form to register (free) an join one of the largest online community for Electrical/Embedded/DSP/FPGA/ML engineers:

Choose a Username

E-Mail (Work, School or ieee)

First Name

Last Name

Employer

Job Title

Country

State

Password

Confirm Password

By checking this box, I agree with the terms of use and privacy policy By checking this box, I consent to receive occasional emails from the *Related sites and their partners. I understand that these emails will only contain relevant information and that I can unsubscribe at any time.

Learning From Engineering Failures

Contents:

Introduction

Human Error

Risks Digest

Resources

Websites

Blog Posts, Articles, Papers, and Presentations

Podcasts

Books

TV Shows

Sign in

You might also like...

About Steve Branam

Popular Posts by Steve Branam

Popular Blogs Series

Free PDF Downloads

Quick Links

About EmbeddedRelated.com

Social Networks

The Related Media Group