• Home
  • Category: Software Engineering

Written in Blood: Case Studies of Systems Engineering Failure

Our world and our systems are safer than ever. A major reason why is that we’ve learned from prior mistakes. Many of our practices, rules, and standards are “written in blood” from past, tragic failures.

We learn so that we don’t repeat the same mistakes. Of course, we first identify the proximate causes—the specific events directly leading to a casualty. To truly learn, we must take a step back to examine the larger context: what were the preceding holes in the Swiss cheese, and how do we account for them in our systems engineering practice? This approach is increasingly important as systems grow in complexity. This post describes three case studies illustrating a common theme: increasing complexity demands strong systems engineering approaches to maximize safety, system performance, and suitability.

USS McCain Collision

In 2017, USS McCain collided with the merchant vessel Alnic MC in the Singapore Strait. McCain was heading to port to fix issues with its integrated bridge and navigation system (IBNS), which was prone to bugs and failures. The ship’s captain, distrustful of the buggy IBNS system, operated in manual backup mode that bypassed several safeguards.

The incident starts in the early morning in a busy shipping lane, a high-demand situation. To reduce workload on junior helmsmen, the captain ordered a “split helm” with one station controlling steering and one controlling speed. In the backup manual mode, changing to split helm required a multi-step, manual process. The crew made several mistakes, starting with transferring steering to the wrong station without realizing it.

Believing steering was lost, the helmsman instinctively reduced thrust to slow the ship. However, the incomplete manual transfer process had left the port and starboard throttle controls unganged, and only the port shaft slowed. Instead of decelerating, the McCain veered directly into the path of the Alnic MC.

Helm station of a US Navy ship
IBNS on USS Dewey; US Navy photo

In the image above, from a sister ship of McCain, you will notice a large red button meant to take emergency control of the ship. This was certainly a situation in which to use it. However, the crew had an incorrect understanding of how this button worked and what would happen when it was pressed. Over the course of 16 seconds, multiple crewmembers pressed the big red buttons at their stations and control switched between the bridge and aft backup stations three times. By the time the crew regained control, it was too late. A few seconds later, the bow of Alnic MC struck McCain, and ten sailors died.

USS McCain with damage from the collision
Damage to McCain; US Navy photo

The crew of McCain had indeed lost control of the ship. However, it was not because of any technical failure. It was caused by a confluence of design, training, and operational deficiencies. Among these, a lack of trust in automation caused them to operate in backup manual mode without support and safeguards that would have improved safety. Most pressing, in my opinion, is the confusing controls that didn’t retain the functionality of traditional physical controls. Critically, the IBNS on McCain removed the physical throttles in favor of only digital displays; physical throttle controls would have made it obvious to the entire bridge team at a glance that the controls were not ganged.

Chekov’s Gear Shifter

Composite of the Jeep gear shifter and actor Anton Yelchin in Star Trek costume

The same type of design shortcoming affects our daily lives as well. Anton Yelchin, known for playing Chekov in the 2009 Star Trek film, died tragically in 2016 at age 27. After parking his Jeep Grand Cherokee in his driveway, he exited the vehicle, unaware it was still in neutral. The car rolled backward, crushing him against a wall and killing him.

The issue was the design of the “monostable” shift knob, pictured above. At the time of this incident, the design had already been implicated in 266 crashes causing 68 injuries. Complaints were that the shifter didn’t provide adequate tactile or position feedback to the driver, causing incorrect gears to be selected and particularly putting the vehicle into neutral or reverse instead of park. Fiat Chrysler had already initiated a voluntary recall based on these complaints and was working on a software update to provide better feedback and to prevent the car from moving in certain conditions. Unfortunately, the fix wasn’t available at the time of Yelchin’s death, and the incident caused Fiat Chrysler to fast-track the development and fielding of the fix.

This case highlights the risks of abandoning proven design principles. The monostable shifter resembled traditional designs but worked very differently, confusing users without adding functional benefits.

Breaking established design patterns is not necessarily a bad thing. Historically, function drove form with, for example, mechanical linkages between the shifter and transmission. User’s are able to develop mental models of the system by following these physical paths. Moving to software-defined systems enables much more flexibility in design, which can enhance performance and safety. But it also removes many of the constraints of physical systems, breaks mental models, and allows solutions to become much more complex. That can result in undesired emergent behavior.

Two related design principles are “if it’s not broken, don’t fix it” and to build on a user’s existing mental model. When it is necessary to depart from existing concepts to achieve solution objectives, the designer must carefully consider the potential impacts and account for them. The monostable shifter was problematic because it looks like a traditional shifter but works differently. Not only did that trip up users, it didn’t actually add anything to the effectiveness of the solution; it was just for the sake of being different.

Aircraft Flight Controls

Composite of older and newer aircraft cockpits
Composite of a Hawker Siddeley Trident (left) and an Airbus A380 (right) cockpit (Originals by Nimbus227 and Naddsy CC BY 2.0)

A positive example comes from aviation. Across manufacturers, models, and decades of technology advancement, controls and displays have remained relatively consistent with proven success. Basic primary flight display layout and colors are similar across aircraft, whether glass displays or traditional gauges; essential controls have the same shape coding and movement in any aircraft.

An effective, optimized, proven layout allows the pilot to focus on managing the aircraft and executing their mission. It also enables skills transfer; a pilot learning a new aircraft can focus on the unique qualities of that type rather than re-learning basic flight controls and displays.

Automation in modern aircraft has increased substantially. Airbus pioneered fly-by-wire; there are no direct linkages between cockpit controls and aircraft control surfaces, all inputs are mediated by software. That radically enhances safety. It’s nearly impossible to stall an Airbus because the flight computer won’t let you leave the safe flight envelope. Even still, it’s not infallible.

Air France Flight 447

In 2009, Air France flight 447 crashed into the Atlantic on a flight from Rio de Janeiro to Paris. The aircraft entered icing conditions and the pitot tubes froze over, causing airspeed data to be unreliable. Without valid airspeed data, the autopilot disconnected, the autothrust became unavailable, and the flight control software switched to an alternate control law. In this alternate law, the software didn’t mediate the pilot’s control inputs and so the controls were much more sensitive.

The most junior of the pilots onboard was the pilot flying and he struggled to adapt to the abrupt change. He spent the first 30 seconds getting a back-and-forth roll under control, over-correcting as he got used to the more sensitive handling of the aircraft. As he was fighting this roll, he also pulled back on the stick, which is a natural tendency of pilots in tense situations; that caused the aircraft to climb very steeply and ultimately stall. He continued to pull back almost the entire rest of the flight, even as the more experienced pilot tried to take control and push the nose forward to regain airspeed; the mismatched inputs caused “DUAL INPUT” warnings that the crew either didn’t notice or ignored, and the plane continued to respond to the junior pilot’s incorrect inputs.

Without the software providing flight envelope protection, the normally-unstallable aircraft fell out of the sky and 228 people died. The last thing the pilot flying said was “We’re going to crash! This can’t be true. But what’s happening?“ Just like with the McCain, there was nothing inherently wrong with the aircraft, just a disconnect between the user’s understanding and the actual state of the system, fueled by a rapid change of state, a stressful situation, and inability to rebuild situational awareness in time.

Recovery of wreckage from Flight 447 (Roberto Maltchik Repórter da TV Brasil, CC BY 3.0 br)

There are several lessons in this case study, but what stands out is the paradox of automation. The pilot was so used to the safety of the flight control software that his basic aviator skills weren’t available when he needed them. There are aspects of flight control design, a need for graceful degradation, and better training.

Lessons Learned

The key takeaway from these case studies for systems engineering practice is that the performance of the system is the product of human performance and technology performance. It doesn’t matter how great the technology is if the human can’t use it to safely and effectively accomplish their mission. That’s especially true in unusual or off-nominal situations. The robustness of the system depends on the ability of the human and/or technology to account for, adapt to, and recover from unusual situations.

System Performance = Human Performance x Technology Performance

With the rise of software-defined systems, complexity is outpacing our ability to characterize emergent behavior. One role of systems engineering is to minimize and manage this complexity. Human-centered approaches have proven to be effective at supporting user performance within complex systems, especially when combined with the frequent user feedback and iteration of the agile methodology. Building from user needs ensures that complexity is added when necessary for the sake of the solution rather than because it’s cool (Jeep monostable shifter) or economical (McCain IBNS). It also suggests other, non-design aspects that support user performance such as decision aids and training (AF447).

Adding human-centered design in SE is easy:
• Work from first principles, always
• Model user workflows, then ask “how might we…?”
• Evolutionary vs. revolutionary technology
• Thoughtfulness is next to godliness

Create for the user and the system will be successful

Human-centered approaches are a natural part of a holistic systems engineering program. Too often, engineers focus on developing technological aspects without a true understanding of the stakeholder and mission needs. These are the first principles that should guide all of our design decisions before we start to write code or bend metal.

From that deep understanding, we can ask my favorite question: how might we? Sometimes ‘how might we’ leads to small, incremental changes that add up to major performance gains. Other times it demands entirely rethinking the problem and solution, especially if revolutionary technology improvements are available.

Finally, designs must be thoughtful, putting practical needs ahead of novelty or cost savings to create the most effective solutions. This approach not only enhances system performance but also prevents failures. As the U.S. Navy report on the McCain collision noted:

“There is a tendency of designers to add automation based on economic benefits (e.g., reducing manning, consolidating discrete controls, using networked systems to manage obsolescence), without considering the effect to operators who are trained and proficient in operating legacy equipment.”
— US Navy report on the McCain collision

By prioritizing user needs, we can create systems that are safe, effective, and resilient in the face of challenges.

What case studies or examples have influenced your thinking? How do you apply user-centered or other approaches successfully in your practice? Share in the comments below.

Minimum Viable Product (MVP): You’re doing it wrong

Quibi was a short-lived short-form video app. It was founded in August 2018, launched in April 2020, and folded in December 2020, wiping out $1.75 billion of investor’s money. That’s twenty months from founding to launch and just six months to fail. Ouch.

Forbes chalked this up to “a misread of consumer interests”; though the content was pretty good, Quibi only worked as a phone app while customers wanted TV streaming, and it lacked social sharing features that may have drawn in new viewers. It was also a paid service competing with free options like YouTube and TikTok. According to The Wall Street Journal, the company’s attempts to address the issues were too late: “spending on advertising left little financial wiggle room when the company was struggling”.

If only there was some way Quibi could have validated its concept prior to wasting nearly two billion dollars1.

Read More

Agile isn’t faster

A common misconception is that Agile development processes are faster. I’ve heard this from leaders as a justification for adopting Agile processes and read it in proposals as a supposed differentiator. It’s not true. Nothing about Agile magically enable teams to architect, engineer, design, test, or validate any faster.

In fact, many parts of Agile are actually slower. Time spent on PI planning, backlog refinement, sprint planning, daily stand-ups2, and retrospectives is time the team isn’t developing. Much of that overhead is avoided in a Waterfall style where the development follows a set plan.

Read More

Agile SE Part One: What is Agile, Anyway?

Table of Contents

What is “Agile”?

Agile is a relatively new approach to software development based on the Agile Manifesto and Agile Principles. These documents are an easy read and you should absolutely check them out. I will sum them up as stating that development should be driven by what is most valuable to the customer and that our projects should align around delivering value.

Yes, I’ve obnoxiously italicized the word value as if it were in the glossary of a middle school textbook. That’s because value is the essence of this entire discussion.

Little-a Agile

With a little-a, “agile” is the ability to adapt to a changing situation. This means collaboration to understand the stakeholder needs and the best way to satisfy those needs. It means changing the plan when the situation (or your understanding of the situation) changes. It means understanding what is valuable to the customer, focusing on delivering that value, and minimizing non-value added effort.

Read More

Agile SE Part Zero: Overview

“Agile” is the latest buzzword in systems engineering. It has a fair share of both adherents and detractors, not to mention a long list of companies offering to sell tools, training, and coaching. What has been lacking is a thoughtful discussion about when agile provides value, when it doesn’t, and how to adapt agile practices to be effective in complex systems engineering projects.

I don’t claim this to be the end-all guide on agile systems engineering, but hope it will at least spark some discussion. Please comment on the articles with details from your own experiences. If you’re interested in contributing or collaborating, please contact me at benjamin@engineeringforhumans.com, I’d love to add your voice to the site.

Read More

Agile Government Contracts

Agile is a popular and growing software development approach. It promotes a focus on the product rather than the project plan. This model is very attractive for many reasons and teams are adopting it across the defense industry. However, traditional government contracts and project management are entirely plan-driven. Can you really be agile in a plan-driven world?

Read More