Our world and our systems are safer than ever. A major reason why is that we’ve learned from prior mistakes. Many of our practices, rules, and standards are “written in blood” from past, tragic failures.
We learn so that we don’t repeat the same mistakes. Of course, we first identify the proximate causes—the specific events directly leading to a casualty. To truly learn, we must take a step back to examine the larger context: what were the preceding holes in the Swiss cheese, and how do we account for them in our systems engineering practice? This approach is increasingly important as systems grow in complexity. This post describes three case studies illustrating a common theme: increasing complexity demands strong systems engineering approaches to maximize safety, system performance, and suitability.
USS McCain Collision
In 2017, USS McCain collided with the merchant vessel Alnic MC in the Singapore Strait. McCain was heading to port to fix issues with its integrated bridge and navigation system (IBNS), which was prone to bugs and failures. The ship’s captain, distrustful of the buggy IBNS system, operated in manual backup mode that bypassed several safeguards.
The incident starts in the early morning in a busy shipping lane, a high-demand situation. To reduce workload on junior helmsmen, the captain ordered a “split helm” with one station controlling steering and one controlling speed. In the backup manual mode, changing to split helm required a multi-step, manual process. The crew made several mistakes, starting with transferring steering to the wrong station without realizing it.
Believing steering was lost, the helmsman instinctively reduced thrust to slow the ship. However, the incomplete manual transfer process had left the port and starboard throttle controls unganged, and only the port shaft slowed. Instead of decelerating, the McCain veered directly into the path of the Alnic MC.
In the image above, from a sister ship of McCain, you will notice a large red button meant to take emergency control of the ship. This was certainly a situation in which to use it. However, the crew had an incorrect understanding of how this button worked and what would happen when it was pressed. Over the course of 16 seconds, multiple crewmembers pressed the big red buttons at their stations and control switched between the bridge and aft backup stations three times. By the time the crew regained control, it was too late. A few seconds later, the bow of Alnic MC struck McCain, and ten sailors died.
The crew of McCain had indeed lost control of the ship. However, it was not because of any technical failure. It was caused by a confluence of design, training, and operational deficiencies. Among these, a lack of trust in automation caused them to operate in backup manual mode without support and safeguards that would have improved safety. Most pressing, in my opinion, is the confusing controls that didn’t retain the functionality of traditional physical controls. Critically, the IBNS on McCain removed the physical throttles in favor of only digital displays; physical throttle controls would have made it obvious to the entire bridge team at a glance that the controls were not ganged.
Chekov’s Gear Shifter
The same type of design shortcoming affects our daily lives as well. Anton Yelchin, known for playing Chekov in the 2009 Star Trek film, died tragically in 2016 at age 27. After parking his Jeep Grand Cherokee in his driveway, he exited the vehicle, unaware it was still in neutral. The car rolled backward, crushing him against a wall and killing him.
The issue was the design of the “monostable” shift knob, pictured above. At the time of this incident, the design had already been implicated in 266 crashes causing 68 injuries. Complaints were that the shifter didn’t provide adequate tactile or position feedback to the driver, causing incorrect gears to be selected and particularly putting the vehicle into neutral or reverse instead of park. Fiat Chrysler had already initiated a voluntary recall based on these complaints and was working on a software update to provide better feedback and to prevent the car from moving in certain conditions. Unfortunately, the fix wasn’t available at the time of Yelchin’s death, and the incident caused Fiat Chrysler to fast-track the development and fielding of the fix.
This case highlights the risks of abandoning proven design principles. The monostable shifter resembled traditional designs but worked very differently, confusing users without adding functional benefits.
Breaking established design patterns is not necessarily a bad thing. Historically, function drove form with, for example, mechanical linkages between the shifter and transmission. User’s are able to develop mental models of the system by following these physical paths. Moving to software-defined systems enables much more flexibility in design, which can enhance performance and safety. But it also removes many of the constraints of physical systems, breaks mental models, and allows solutions to become much more complex. That can result in undesired emergent behavior.
Two related design principles are “if it’s not broken, don’t fix it” and to build on a user’s existing mental model. When it is necessary to depart from existing concepts to achieve solution objectives, the designer must carefully consider the potential impacts and account for them. The monostable shifter was problematic because it looks like a traditional shifter but works differently. Not only did that trip up users, it didn’t actually add anything to the effectiveness of the solution; it was just for the sake of being different.
Aircraft Flight Controls
A positive example comes from aviation. Across manufacturers, models, and decades of technology advancement, controls and displays have remained relatively consistent with proven success. Basic primary flight display layout and colors are similar across aircraft, whether glass displays or traditional gauges; essential controls have the same shape coding and movement in any aircraft.
An effective, optimized, proven layout allows the pilot to focus on managing the aircraft and executing their mission. It also enables skills transfer; a pilot learning a new aircraft can focus on the unique qualities of that type rather than re-learning basic flight controls and displays.
Automation in modern aircraft has increased substantially. Airbus pioneered fly-by-wire; there are no direct linkages between cockpit controls and aircraft control surfaces, all inputs are mediated by software. That radically enhances safety. It’s nearly impossible to stall an Airbus because the flight computer won’t let you leave the safe flight envelope. Even still, it’s not infallible.
Air France Flight 447
In 2009, Air France flight 447 crashed into the Atlantic on a flight from Rio de Janeiro to Paris. The aircraft entered icing conditions and the pitot tubes froze over, causing airspeed data to be unreliable. Without valid airspeed data, the autopilot disconnected, the autothrust became unavailable, and the flight control software switched to an alternate control law. In this alternate law, the software didn’t mediate the pilot’s control inputs and so the controls were much more sensitive.
The most junior of the pilots onboard was the pilot flying and he struggled to adapt to the abrupt change. He spent the first 30 seconds getting a back-and-forth roll under control, over-correcting as he got used to the more sensitive handling of the aircraft. As he was fighting this roll, he also pulled back on the stick, which is a natural tendency of pilots in tense situations; that caused the aircraft to climb very steeply and ultimately stall. He continued to pull back almost the entire rest of the flight, even as the more experienced pilot tried to take control and push the nose forward to regain airspeed; the mismatched inputs caused “DUAL INPUT” warnings that the crew either didn’t notice or ignored, and the plane continued to respond to the junior pilot’s incorrect inputs.
Without the software providing flight envelope protection, the normally-unstallable aircraft fell out of the sky and 228 people died. The last thing the pilot flying said was “We’re going to crash! This can’t be true. But what’s happening?“ Just like with the McCain, there was nothing inherently wrong with the aircraft, just a disconnect between the user’s understanding and the actual state of the system, fueled by a rapid change of state, a stressful situation, and inability to rebuild situational awareness in time.
There are several lessons in this case study, but what stands out is the paradox of automation. The pilot was so used to the safety of the flight control software that his basic aviator skills weren’t available when he needed them. There are aspects of flight control design, a need for graceful degradation, and better training.
Lessons Learned
The key takeaway from these case studies for systems engineering practice is that the performance of the system is the product of human performance and technology performance. It doesn’t matter how great the technology is if the human can’t use it to safely and effectively accomplish their mission. That’s especially true in unusual or off-nominal situations. The robustness of the system depends on the ability of the human and/or technology to account for, adapt to, and recover from unusual situations.
With the rise of software-defined systems, complexity is outpacing our ability to characterize emergent behavior. One role of systems engineering is to minimize and manage this complexity. Human-centered approaches have proven to be effective at supporting user performance within complex systems, especially when combined with the frequent user feedback and iteration of the agile methodology. Building from user needs ensures that complexity is added when necessary for the sake of the solution rather than because it’s cool (Jeep monostable shifter) or economical (McCain IBNS). It also suggests other, non-design aspects that support user performance such as decision aids and training (AF447).
Human-centered approaches are a natural part of a holistic systems engineering program. Too often, engineers focus on developing technological aspects without a true understanding of the stakeholder and mission needs. These are the first principles that should guide all of our design decisions before we start to write code or bend metal.
From that deep understanding, we can ask my favorite question: how might we? Sometimes ‘how might we’ leads to small, incremental changes that add up to major performance gains. Other times it demands entirely rethinking the problem and solution, especially if revolutionary technology improvements are available.
Finally, designs must be thoughtful, putting practical needs ahead of novelty or cost savings to create the most effective solutions. This approach not only enhances system performance but also prevents failures. As the U.S. Navy report on the McCain collision noted:
“There is a tendency of designers to add automation based on economic benefits (e.g., reducing manning, consolidating discrete controls, using networked systems to manage obsolescence), without considering the effect to operators who are trained and proficient in operating legacy equipment.”
— US Navy report on the McCain collision
By prioritizing user needs, we can create systems that are safe, effective, and resilient in the face of challenges.
What case studies or examples have influenced your thinking? How do you apply user-centered or other approaches successfully in your practice? Share in the comments below.