EmbeddedRelated.com
Blogs
Memfault Beyond the Launch

Learning From Engineering Failures

Steve BranamJuly 29, 2021

Contents:


Introduction

I'm an informal student of engineering failures. They guide a lot of my attitude and approach towards engineering.

This is rooted in two of my favorite quotes:

  • George Santayana: Those who do not remember the past are condemned to repeat it.
  • Louis Pasteur: Chance favors the prepared mind.

and leads to the ultimate advice I offer people:

This article is available in PDF format for easy printing
  • Trust nothing, and verify.

(See My Guiding Principles As An Engineer for the full list and interpretation.)

Here I'm passing along some useful resources for learning about engineering failures in the hope that you, too, will become a student of them, and apply what you learn to your work.

While I'm a software engineer and many of these are related to software, we can learn useful lessons from all engineering domains. They have many useful parallels.

Anybody can look good when things are going well, when things are working. Those are the happy paths, the happy day scenarios. But you need to think about what can go wrong. Those are the unhappy paths, the unhappy day scenarios. You need to anticipate and plan for them.

Engineering failures can have truly horrific consequences. Death, injury, destruction. Legal and financial disaster. The people involved may carry lifelong guilt, some justified, some not. The ripples and effects can be widespread and long lasting.

So it's important to study the failures, learn the lessons they offer, and apply them at the appropriate points. That offers the best probability of avoiding or mitigating future failures.

Life is full of risk, and if we allow risk aversion to paralyze us, we'll never accomplish anything. But that doesn't mean we can't do something about the risk.


Human Error

One theme that comes up repeatedly is that many incidents of "human error" often have little to do with the poor human who had the bad luck to be at the controls when the incident occurred.

Instead, they are much more complex and nuanced cases of accidents that were waiting to happen. Sure, the operator was the one who performed the triggering action (or failed to perform some required action), but it's really a case of systemic problems, in a layered set of systems.

  • Why did the operator think that was an appropriate thing to do (or not do the appropriate thing)?
  • Why did the system allow the operator to do that?
  • Why was the system designed and built in a way that allowed that to be done?
  • Why was the system operated in this manner?
  • Why did management allow the system to be operated in this manner?
  • Why did the regulatory environment allow the system to be operated in this manner?
  • Why were lurking problems not detected before the failure?

While we think of "the system" as the immediate technical system or equipment being used, there are many interacting layers of smaller "systems" involving many parties that make up the overall "system":

  • Engineering design and development system.
  • Manufacturing and construction system.
  • Installation and deployment system.
  • Operation system.
  • Monitoring system.
  • Verification system.
  • Inspection system.
  • Maintenance system.
  • Training system.
  • Assessment system.
  • Certification system.
  • Management system.
  • Regulatory system.

Each one of these is a place to catch things and prevent failures, acting as a series of safety nets. Checks and double-checks, so that one simple action or missed action doesn't trigger disaster.

Understanding the true root causes and contributing factors is the only way we can determine where to apply the lessons learned, to multiple layers.

Simply blaming a failure on human error and leaving it at that just means it will happen again. To some other poor unlucky human at the controls. Because none of the actual things that contributed to the failure will have been addressed; they will continue to sit there, buried in those layers, waiting for the next time.

Nancy Leveson (see Books below): "Depending on humans not to make mistakes is an almost certain way to guarantee that accidents will happen."


Risks Digest

This all started for me with Risks Digest some 30 years ago. The full title is Forum on Risks to the Public in Computers and Related Systems, ACM Committee on Computers and Public Policy, but readers simply refer to it as Risks. It's moderated by Peter G. Neumann, affectionately known as PGN.

Risks is essentially an online news clipping service where contributors send in news items they've noticed related to computers, sometimes with commentary or excerpts, sometimes just simple links. The result is an amazing collection of information and starting points to examine all kinds of risks and failures.

It includes full archives going back to 1985. Reading through those, it's depressing to see the same things cropping up again and again. Sometimes it's just the underlying technologies that have advanced, but repeating the same problems. The big lesson there is that we have failed to heed Santayana's warning, and continue to do so.

Security (as in cybersecurity) is a major long-running theme (everything I know about cybersecurity started with items in Risks). The other long-running themes are things not working right, breaking down, collapsing, or tearing apart.

Risks formed the root of my study tree. It sensitized me to a number of issues and directed me to several authors worth reading. Much of my awareness of things started there, then expanded as I read things I found mentioned there and followed the references they mentioned.

I highly recommend dipping into Risks (or jumping into the deep end of the archives with Volume 1, Issue 1 and going from there). It's harrowing, frightening, and enlightening.


Resources

The rest of this post is a set of resources that I've used to learn about engineering failures. A few are a bit academic, some are aimed at a technical audience, and some are meant for a general audience. There's some duplication in coverage. But they're all interesting. There are many more once you start looking.

These are not rumor-mongering, conspiracy theories, disinformation, or misinformation. They focus on facts and reasoned analysis.

A lot of this stuff will scare the crap out of you. Take deep breaths. Keep calm and become an IEEE member.

Websites

Blog Posts, Articles, Papers, and Presentations

Podcasts

  • John Chidgey
    • Causality: "Chain of Events. Cause and Effect. We analyse what went right and what went wrong as we discover that many outcomes can be predicted, planned for and even prevented."

Books

TV Shows



Memfault Beyond the Launch

To post reply to a comment, click on the 'reply' button attached to each comment. To post a new comment (not a reply to a comment) check out the 'Write a Comment' tab at the top of the comments.

Please login (on the right) if you already have an account on this platform.

Otherwise, please use this form to register (free) an join one of the largest online community for Electrical/Embedded/DSP/FPGA/ML engineers: