• Home
  • Engineering for Humans

Written in Blood: Case Studies of Systems Engineering Failure

Our world and our systems are safer than ever. A major reason why is that we’ve learned from prior mistakes. Many of our practices, rules, and standards are “written in blood” from past, tragic failures.

We learn so that we don’t repeat the same mistakes. Of course, we first identify the proximate causes—the specific events directly leading to a casualty. To truly learn, we must take a step back to examine the larger context: what were the preceding holes in the Swiss cheese, and how do we account for them in our systems engineering practice? This approach is increasingly important as systems grow in complexity. This post describes three case studies illustrating a common theme: increasing complexity demands strong systems engineering approaches to maximize safety, system performance, and suitability.

USS McCain Collision

In 2017, USS McCain collided with the merchant vessel Alnic MC in the Singapore Strait. McCain was heading to port to fix issues with its integrated bridge and navigation system (IBNS), which was prone to bugs and failures. The ship’s captain, distrustful of the buggy IBNS system, operated in manual backup mode that bypassed several safeguards.

The incident starts in the early morning in a busy shipping lane, a high-demand situation. To reduce workload on junior helmsmen, the captain ordered a “split helm” with one station controlling steering and one controlling speed. In the backup manual mode, changing to split helm required a multi-step, manual process. The crew made several mistakes, starting with transferring steering to the wrong station without realizing it.

Believing steering was lost, the helmsman instinctively reduced thrust to slow the ship. However, the incomplete manual transfer process had left the port and starboard throttle controls unganged, and only the port shaft slowed. Instead of decelerating, the McCain veered directly into the path of the Alnic MC.

Helm station of a US Navy ship
IBNS on USS Dewey; US Navy photo

In the image above, from a sister ship of McCain, you will notice a large red button meant to take emergency control of the ship. This was certainly a situation in which to use it. However, the crew had an incorrect understanding of how this button worked and what would happen when it was pressed. Over the course of 16 seconds, multiple crewmembers pressed the big red buttons at their stations and control switched between the bridge and aft backup stations three times. By the time the crew regained control, it was too late. A few seconds later, the bow of Alnic MC struck McCain, and ten sailors died.

USS McCain with damage from the collision
Damage to McCain; US Navy photo

The crew of McCain had indeed lost control of the ship. However, it was not because of any technical failure. It was caused by a confluence of design, training, and operational deficiencies. Among these, a lack of trust in automation caused them to operate in backup manual mode without support and safeguards that would have improved safety. Most pressing, in my opinion, is the confusing controls that didn’t retain the functionality of traditional physical controls. Critically, the IBNS on McCain removed the physical throttles in favor of only digital displays; physical throttle controls would have made it obvious to the entire bridge team at a glance that the controls were not ganged.

Chekov’s Gear Shifter

Composite of the Jeep gear shifter and actor Anton Yelchin in Star Trek costume

The same type of design shortcoming affects our daily lives as well. Anton Yelchin, known for playing Chekov in the 2009 Star Trek film, died tragically in 2016 at age 27. After parking his Jeep Grand Cherokee in his driveway, he exited the vehicle, unaware it was still in neutral. The car rolled backward, crushing him against a wall and killing him.

The issue was the design of the “monostable” shift knob, pictured above. At the time of this incident, the design had already been implicated in 266 crashes causing 68 injuries. Complaints were that the shifter didn’t provide adequate tactile or position feedback to the driver, causing incorrect gears to be selected and particularly putting the vehicle into neutral or reverse instead of park. Fiat Chrysler had already initiated a voluntary recall based on these complaints and was working on a software update to provide better feedback and to prevent the car from moving in certain conditions. Unfortunately, the fix wasn’t available at the time of Yelchin’s death, and the incident caused Fiat Chrysler to fast-track the development and fielding of the fix.

This case highlights the risks of abandoning proven design principles. The monostable shifter resembled traditional designs but worked very differently, confusing users without adding functional benefits.

Breaking established design patterns is not necessarily a bad thing. Historically, function drove form with, for example, mechanical linkages between the shifter and transmission. User’s are able to develop mental models of the system by following these physical paths. Moving to software-defined systems enables much more flexibility in design, which can enhance performance and safety. But it also removes many of the constraints of physical systems, breaks mental models, and allows solutions to become much more complex. That can result in undesired emergent behavior.

Two related design principles are “if it’s not broken, don’t fix it” and to build on a user’s existing mental model. When it is necessary to depart from existing concepts to achieve solution objectives, the designer must carefully consider the potential impacts and account for them. The monostable shifter was problematic because it looks like a traditional shifter but works differently. Not only did that trip up users, it didn’t actually add anything to the effectiveness of the solution; it was just for the sake of being different.

Aircraft Flight Controls

Composite of older and newer aircraft cockpits
Composite of a Hawker Siddeley Trident (left) and an Airbus A380 (right) cockpit (Originals by Nimbus227 and Naddsy CC BY 2.0)

A positive example comes from aviation. Across manufacturers, models, and decades of technology advancement, controls and displays have remained relatively consistent with proven success. Basic primary flight display layout and colors are similar across aircraft, whether glass displays or traditional gauges; essential controls have the same shape coding and movement in any aircraft.

An effective, optimized, proven layout allows the pilot to focus on managing the aircraft and executing their mission. It also enables skills transfer; a pilot learning a new aircraft can focus on the unique qualities of that type rather than re-learning basic flight controls and displays.

Automation in modern aircraft has increased substantially. Airbus pioneered fly-by-wire; there are no direct linkages between cockpit controls and aircraft control surfaces, all inputs are mediated by software. That radically enhances safety. It’s nearly impossible to stall an Airbus because the flight computer won’t let you leave the safe flight envelope. Even still, it’s not infallible.

Air France Flight 447

In 2009, Air France flight 447 crashed into the Atlantic on a flight from Rio de Janeiro to Paris. The aircraft entered icing conditions and the pitot tubes froze over, causing airspeed data to be unreliable. Without valid airspeed data, the autopilot disconnected, the autothrust became unavailable, and the flight control software switched to an alternate control law. In this alternate law, the software didn’t mediate the pilot’s control inputs and so the controls were much more sensitive.

The most junior of the pilots onboard was the pilot flying and he struggled to adapt to the abrupt change. He spent the first 30 seconds getting a back-and-forth roll under control, over-correcting as he got used to the more sensitive handling of the aircraft. As he was fighting this roll, he also pulled back on the stick, which is a natural tendency of pilots in tense situations; that caused the aircraft to climb very steeply and ultimately stall. He continued to pull back almost the entire rest of the flight, even as the more experienced pilot tried to take control and push the nose forward to regain airspeed; the mismatched inputs caused “DUAL INPUT” warnings that the crew either didn’t notice or ignored, and the plane continued to respond to the junior pilot’s incorrect inputs.

Without the software providing flight envelope protection, the normally-unstallable aircraft fell out of the sky and 228 people died. The last thing the pilot flying said was “We’re going to crash! This can’t be true. But what’s happening?“ Just like with the McCain, there was nothing inherently wrong with the aircraft, just a disconnect between the user’s understanding and the actual state of the system, fueled by a rapid change of state, a stressful situation, and inability to rebuild situational awareness in time.

Recovery of wreckage from Flight 447 (Roberto Maltchik Repórter da TV Brasil, CC BY 3.0 br)

There are several lessons in this case study, but what stands out is the paradox of automation. The pilot was so used to the safety of the flight control software that his basic aviator skills weren’t available when he needed them. There are aspects of flight control design, a need for graceful degradation, and better training.

Lessons Learned

The key takeaway from these case studies for systems engineering practice is that the performance of the system is the product of human performance and technology performance. It doesn’t matter how great the technology is if the human can’t use it to safely and effectively accomplish their mission. That’s especially true in unusual or off-nominal situations. The robustness of the system depends on the ability of the human and/or technology to account for, adapt to, and recover from unusual situations.

System Performance = Human Performance x Technology Performance

With the rise of software-defined systems, complexity is outpacing our ability to characterize emergent behavior. One role of systems engineering is to minimize and manage this complexity. Human-centered approaches have proven to be effective at supporting user performance within complex systems, especially when combined with the frequent user feedback and iteration of the agile methodology. Building from user needs ensures that complexity is added when necessary for the sake of the solution rather than because it’s cool (Jeep monostable shifter) or economical (McCain IBNS). It also suggests other, non-design aspects that support user performance such as decision aids and training (AF447).

Adding human-centered design in SE is easy:
• Work from first principles, always
• Model user workflows, then ask “how might we…?”
• Evolutionary vs. revolutionary technology
• Thoughtfulness is next to godliness

Create for the user and the system will be successful

Human-centered approaches are a natural part of a holistic systems engineering program. Too often, engineers focus on developing technological aspects without a true understanding of the stakeholder and mission needs. These are the first principles that should guide all of our design decisions before we start to write code or bend metal.

From that deep understanding, we can ask my favorite question: how might we? Sometimes ‘how might we’ leads to small, incremental changes that add up to major performance gains. Other times it demands entirely rethinking the problem and solution, especially if revolutionary technology improvements are available.

Finally, designs must be thoughtful, putting practical needs ahead of novelty or cost savings to create the most effective solutions. This approach not only enhances system performance but also prevents failures. As the U.S. Navy report on the McCain collision noted:

“There is a tendency of designers to add automation based on economic benefits (e.g., reducing manning, consolidating discrete controls, using networked systems to manage obsolescence), without considering the effect to operators who are trained and proficient in operating legacy equipment.”
— US Navy report on the McCain collision

By prioritizing user needs, we can create systems that are safe, effective, and resilient in the face of challenges.

What case studies or examples have influenced your thinking? How do you apply user-centered or other approaches successfully in your practice? Share in the comments below.

Postal vehicles: Function over form

One of my favorite items in my small model collection is a 1:34 scale Grumman Long Life Vehicle (LLV)1 with sliding side doors, a roll-up rear hatch, and pull-back propulsion. The iconic vehicle has been plying our city streets for nearly 40 years, reliably delivering critical communiques, bills, checks, advertisements, Dear John letters, junk mail, magazines, catalogs, post cards from afar, chain letters2, and Amazon packages.

Read More

System Design Lessons from the USS McCain

The Navy installed touch-screen steering systems to save money.

Ten sailors paid with their lives.

ProPublica
USS McCain in 2019 (U.S. Navy Photo)

Ten sailors died after the crew of the destroyer USS John S. McCain lost control of their vessel, causing a collision with the merchant tanker Alnic MC. There was nothing technically wrong with the vessel or its controls. Though much of the blame was put on the Sailors and Officers aboard, the real fault rests with the design of the Integrated Bridge & Navigation System (IBNS).

Read More

World War III’s Bletchley Park

In a near-future battlefield against a peer adversary, effective employment of machine learning and autonomy is the deciding factor. While our adversary is adapting commercial, mass-market technologies and controlling them remotely, U.S. and allied forces dominate with the effective application of advanced technologies that make decisions faster and more accurately. The concept of Joint All-Domain Command and Control (JADC2) is a key enabler, driving better battlefield decisions through robust information sharing.

Fed by this information, advanced decision aiding systems present courses of action (COAs) to each commander and then crew in the battle, taking into account every possible factor: tasking, environment and terrain, threats, available sensors and effectors, etc. Options and recommendations adapt as the battle unfolds, supporting every decision made with actionable information while deferring to human judgment.

In this campaign, the first few battles are handily won. It seems this war will be a cakewalk.

Until the enemy learns. They notice routines in behaviors and responses that are easy to exploit: System A is a higher threat priority, so is used as a diversion; displaced earth is flagged as a potential mine, so the enemy digs random holes to slow progress; fire comes in specific patterns, so the enemy knows when a barrage is over and quickly counters.

Pretty soon, these adaptions evolve to active attacks on autonomy: dazzle camouflage tricks computer vision systems into seeing more or different units; noise added to radio communications causes military chatter to be misclassified as civilian; selective sensor jamming confuses autonomy.

As the enemy learns to counter and attack these advanced capabilities, they become less helpful and eventually become a liability. Eventually, the operators deem them unreliable and revert to human decision-making and manual control of systems. The enemy has evened the battle and our investment in advanced decision support systems is wasted. Even worse, our operators lack experience with controlling the systems and are actually at a disadvantage, the technology actively hurt us.

The solution is clear: we must be prepared to counter the enemy’s learning and to learn ourselves. This is not a new insight. Learning and adaptation have always been essential elements of war, and now it’s more important than ever. The lessons learned from the field must be fed back into the AI/ML/autonomy development process. A short feedback, development, testing, deployment cycle is essential for autonomy to adapt to the adversary’s capabilities and TTPs, limiting the ability of the adversary to learn how to defend against and defeat our technologies.

In World War II, cryptography was the game-changing technology. You’re doubtless familiar with Bletchley Park, the codebreaking site that provided critical intelligence in World War II 3. Here, men and women worked tirelessly every single day of the war to analyze communication traffic, break the day’s codes, and pass intelligence to decision-makers. This work saved countless lives, leading directly to the Allied victory and shortening the war by 2-4 years. With the advancement of communications security 4, practically unbreakable encryption is available to everyone. We will no longer have the advantage of snooping on enemy communication content and must develop some other unique capability to ensure our forces have the edge.

I submit that the advantage will come from military-grade autonomy. Not the autonomous vehicles themselves, which are commodities, but the ability of the autonomy to respond to changing enemy behavior 5. One key advantage to traditional human control is adaptability to unique and changing situations, which current autonomy is not capable of; the state of the art in autonomous systems today more closely resembles video game NPCs, mindlessly applying the same routines based on simple input. While we may have high hopes for the future of autonomy, the truth is that autonomous systems will be limited for the foreseeable future by an inability to think outside the box.

Average autonomous system

How, then, do we enable the autonomous systems to react rapidly to changing battlefield conditions?

World War III’s version of Bletchley Park will be a capability I’m calling the Battlefield Accelerated Tactics and Techniques Learning and Effectiveness Laboratory. BATTLE Lab is a simulation facility. It ingests data from the field in near-real time, every detail of every battle including terrain, weather, friendly behaviors, enemy tactics, signals, etc. Through experimentation across hundreds of thousands of simulated engagements driven by observed behavior, we’ll develop courses of action for countering the enemy in every imaginable situation. Updated behavior models will be pushed to the field multiple times per day that reduce friendly vulnerabilities, exploit enemy weaknesses, and give our forces the edge.

Of course, we already do this today with extensive threat intelligence capabilities, training, and tactics. The difference is that the future battlefield will be chockablock with autonomous systems which can more rapidly integrate new threat and behavior models generated by BATTLE Lab. We’ll be able to move faster, using autonomy and simulation to reduce the OODA loop while nearly-instantly incorporating lessons from every battle.

Without BATTLE Lab, the enemy will learn how our autonomy operates and quickly find weaknesses to exploit; autonomous systems will be weak to spoofing, jamming, and unexpected behaviors by enemy systems. Bletchley Park shortened the OODA loop by providing better intelligence to strategic decision-makers (“Observe”). BATTLE Lab will shorten the OODA Loop by improving the ability of autonomy to understand the situation and make decisions (“Orient” and “Decide”).

BATTLE Lab is enabled by technology available and maturing today: low-cost uncrewed systems 6, battlefield connectivity, and edge processing.

A critical gap is human-autonomy interaction solutions. To implement these advanced capabilities effectively, human crews need to effectively task, trust, and collaborate with autonomous teammates and these interaction strategies need to mature alongside autonomy capabilities to enhance employment at every step. Autonomy tactics may change rapidly based on new models disseminated from the BATTLE Lab and human teammates need to be able to understand and trust the autonomous system’s behaviors. Explainability and trust are topics of ongoing research; additional efforts to integrate these capabilities into mission planning and mission execution will also be needed.

What do you think the future battlefield will look like and what additional capabilities need to be developed to make it possible? Share your thoughts in the comments below.

OODA Loop: Observe, Orient, Decide, Act

All models are wrong, some models are useful.

George E. P. Box

“Observe, Orient, Decide, Act” (OODA) is a simple decision-making model developed by US Air Force Colonel John Boyd. The concept is straightforward: Every entity in a competition is executing these four phases, the side that can execute them more quickly and accurately 7 will win. “OODA” is a useful shorthand for discussing human decision-making and is commonly used in military circles.

Of course, this simple phrase masks an enormous amount of complexity regarding the amount of information observed, the participant’s ability to orient, the quality of decision-making, and the actions available to execute. It is this simplicity that gives the phrase its strength. Because the model is so simple, it is true at every scale: engagement to engagement, battle to battle, campaign to campaign. Strategic decision-makers are looking at the forest while tactical decision-makers are looking at the trees, yet they’re all executing an OODA loop for their relative scope and scale.

What makes a good human factors engineer? Five critical skills

Recently, the head of a college human factors program asked for my perspective on the human factors (and user experience) skills valued in industry. Here are five critical qualities that emerged from our discussion, in no particular order:

Systems thinking

Making sense of complexity requires identifying relationships, patterns, feedback loops, and causality. Systems thinkers excel at identifying emergent properties of systems and are thus suited to analyses such as safety, cybersecurity, and process, where outcomes may not be obvious from simply looking at sum of the parts.

Read More

Military-industrial complex

The phrase “military-industrial complex” was coined by President Eisenhower in his farewell address to the nation in 19618. In this address, Eisenhower spoke of the deterrence value of military strength:

A vital element in keeping the peace is our military establishment. Our arms must be mighty, ready for instant action, so that no potential aggressor may be tempted to risk his own destruction.

Simultaneously, he warned of the potential danger in the growing relationship between the military establishment and the defense industry:

Read More

Agile SE Part Five: Agility on Large, Complex Programs

Table of Contents

Putting it all together

In this series we’ve introduced agile concepts, requirements, contracting, and digital engineering (DE) for physical systems. These things are all enablers to agility, they don’t make a program agile per se. The key for agility is how the program is planned and functions prioritized.

Agile program planning

A traditional waterfall program is planned using the Statement of Work (SOW), Work Breakdown Structure (WBS), and Integrated Master Schedule (IMS). This basically requires scheduling all of the work before the project starts, considering dependencies, key milestones, etc. Teams know what to work on because the schedule tells them what they’ll be working on when. At least in theory.

Read More

College interviewing tips

For several years I’ve been volunteering as an alumni interviewer for my alma mater. It’s enjoyable to spend a bit of time interacting with a younger generation and exploring their interests; my optimism is buoyed by their potential.

Read More

Minimum Viable Product (MVP): You’re doing it wrong

Quibi was a short-lived short-form video app. It was founded in August 2018, launched in April 2020, and folded in December 2020, wiping out $1.75 billion of investor’s money. That’s twenty months from founding to launch and just six months to fail. Ouch.

Forbes chalked this up to “a misread of consumer interests”; though the content was pretty good, Quibi only worked as a phone app while customers wanted TV streaming, and it lacked social sharing features that may have drawn in new viewers. It was also a paid service competing with free options like YouTube and TikTok. According to The Wall Street Journal, the company’s attempts to address the issues were too late: “spending on advertising left little financial wiggle room when the company was struggling”.

If only there was some way Quibi could have validated its concept prior to wasting nearly two billion dollars9.

Read More