The Boeing 737 Max crashes represent a failure of systems engineering

The 737 is an excellent airplane with a long history of safe, efficient service. Boeing’s cockpit philosophy of direct pilot control and positive mechanical feedback represents excellent human factors¹. In the latest generation, the 737 Max, Boeing added a new component to the flight control system which deviated from this philosophy, resulting in two fatal crashes. This is a case study in the failure of human factors engineering and systems engineering.

The 737 Max and MCAS

You’ve certainly heard of the 737 Max, the fatal crashes in October 2018 and March 2019, and the Maneuvering Characteristics Augmentation System (MCAS) which has been cited as the culprit. Even if you’re already familiar, I highly recommend these two thorough and fascinating articles:

Darryl Campbell at The Verge traces the market pressures and regulatory environment which led to the design of the Max, describes the cockpit activities leading up to each crash, and analyzes the information Boeing provided to pilots.
Gregory Travis at IEEE Spectrum provides a thorough analysis of the technical design failures from the perspective of a software engineer along with an appropriately glib analysis of the business and regulatory environment.

Typically I’d caution against armchair analysis of an aviation incident until the final crash investigation report is in. However, given the availability of information on the design of the 737 Max, I think the engineering failures are clear even as the crash investigations continue.

Hazard analysis

The most glaring, obvious, and completely inexplicable design choice was a lack of redundancy in the MCAS sensor inputs. Gregory Travis blames “inexperience, hubris, or lack of cultural understanding” on the part of the software team. That certainly seems to be the case, but it’s nowhere near the whole story.

There’s a team whose job it is to understand how the various aspects of the system work together: systems engineering². One essential job of the systems engineer is to understand all of the possible interactions among system components, how they interact under various conditions, and what happens if any part (or combination of parts) fails. That last part is addressed by hazard analysis techniques such as failure modes, effects, and criticality analysis (FMECA).

The details of risk management may vary among organizations, but the general principles are the same: (1) Identify hazards, (2) categorize by severity and probability, (3) mitigate/control risk as much as practical and to an acceptable level, (4) monitor for any issues. These techniques give the engineering team confidence that the system will be reasonably safe.

FAA Safety Risk Management Process flowchart and Risk Categorization Matrix table — FAA Safety Risk Management Process and Risk Categorization Matrix from FAA Order 8040.4B, Safety Risk Management Policy.

On its own, the angle of attack (AoA) sensor is an important but not critical component. The pilots can fly the plane without it, though stall-protection, automatic trim, and autopilot functions won’t work normally, increasing pilot workload. The interaction between the sensor and flight control augmentation system, MCAS in the case of the Max, can be critical. If MCAS uses incorrect AoA information from a faulty sensor, it can push the nose down and cause the plane to lose altitude. If this happens, the pilots must be able to diagnose the situation and respond appropriately. Thus the probability of a crash caused by an AoA failure can be notionally figured as follows:

P(AoA sensor failure) × P(system unable to recognize failure) × P(system unable to adapt to failure) × P(pilots unable to diagnose failure) × P(pilots unable to disable MCAS) × P(pilots unable to safely fly without MCAS)

AoA sensors can fail, but that shouldn’t be much of an issue because the plane has at least two of them and it’s pretty easy for the computers to notice a mismatch between them and also with other sources of attitude data such as inertial navigation systems. Except, of course, that the MCAS didn’t bother to cross-check; the probability of the Max failing to recognize and adapt to a potential AoA sensor failure was 100%. You can see where I’m going with this: the AoA sensor is a single point of failure with a direct path through the MCAS to the flight controls. Single point of failure and flight controls in the same sentence ought to give any engineer chills.

The next link in our failure chain is the pilots and their ability to recognize, diagnose, and respond to the issue. This implies proper training, procedures, and understanding of the system. From the news coverage, it seems that pilots were not provided sufficient information on the existence of MCAS and how to respond to its failure. Systems and human factors engineers, armed with a hazard analysis, should have known about and addressed this potential contributing factor to reduce the overall risk.

Finally, there’s the ability of the pilots to disable and fly without MCAS. The Ethiopian Airlines crew correctly diagnosed and responded to the issue but the aerodynamic forces apparently prevented them from manually correcting it. The ability to override those forces, plus the time it takes to correct the flight path, should have been part of the FMECA analysis.

I have no specific knowledge of the hazard analyses performed on the 737 Max. Based on recent events, it seems that the risk of this type of failure was severely underestimated or went unaddressed. Either one is equally poor systems engineering.

Cockpit human factors

An inaccurate hazard analysis, though inexcusable, could be an oversight. Compounding that, Boeing made a clear design decision in the cockpit controls which is hard to defend.

In previous 737 models, pilots could quickly override automatic trim control by yanking back on the yoke, similar to disabling cruise control in a car by hitting the brake. This is great human factors and it fit right in with Boeing’s cockpit philosophy of ensuring that the human was always in ultimate control. This function was removed in the Max.

As both the Lion Air and Ethiopian Airlines crew experienced, the aerodynamic forces being fed into the yoke are too strong for the human pilots to overcome. When MCAS directs the nose to go down, the nose goes down. Rather than simply control the airplane, Max pilots first have to disable the automated systems. Comparisons to HAL are not unwarranted.

In summary

Boeing is developing a fix for MCAS. It will include redundancy in AoA sensor inputs, not activating MCAS if the sensors disagree, MCAS activating only once per high-angle indication (i.e. not continuously activating after the pilots have given contrary commands), and limiting the feedback forces into the control yoke so that they aren’t stronger than the pilots. This functionality should have been part of the system to begin with.

Along with these fixes, Boeing is likely³ also re-conducting a complete hazard analysis of MCAS and other flight control systems. Boeing and the FAA should not clear the type until the hazards are completely understood, controlled, quantified, and deemed acceptable.

Many news stories frame the 737 Max crashes in terms of the market and regulatory pressures which resulted in the design. While I don’t disagree, these are not an excuse for the systems engineering failures. The 737 Max is a valuable case study for engineers of all types in any industry, and for systems engineers in high-risk industries in particular.

Footnotes:

Airbus’s fly-by-wire approach has some strengths over Boeing’s philosophy, but the human factors design isn’t as strong; its independent side-sticks were a contributing factor to the Air France Flight 447 crash.
Duh.
At least they certainly should be