I can’t tell you how many times I’ve seen a senior DoD leader hold up their smartphone1 and wonder aloud why their military systems can’t work as seamlessly.
The answer is simple: There is no market that incentivizes companies to build seamless products for the military. Androids and iPhones work so well because there is competition. If Facebook Messenger2 starts releasing buggy versions, users will uninstall it and switch to Signal or Telegram or Snapchat or dozens of other messaging apps with various capabilities. Conversely, if a developer creates a fantastic new app that disrupts the incumbents, everyone will quickly switch to it. This forces the entire industry to continually innovate3.
Apple and Google are also in competition, thus it’s in their interests to foster ecosystems of hardware and software developers that in turn build and maintain market share for their products. The market results in the success or failure of the companies in that ecosystem and that competition results in excellent consumer technologies.
Good enough for government work
The US defense industry is not a competitive market, at least not in the same way4. Incentives across the military-industrial complex are misaligned and our nation’s security suffers for it. Even when everyone involved has the best of intentions, military prime contractors only win projects when they’re just cheap enough, just fast enough, and just good enough.
We joke that it’s “good enough for government work”, but the warfighter and the taxpayer deserve better.
Open architectures
The solution is relatively simple, at least in theory: the government needs to support the creation and enforcement of modular open system architecture (MOSA) standards for every aspect of the battlefield. We have a model for this already: Future Airborne Capability Environment (FACE) is an open software standard and certification process for military helicopters developed as a consortium between government acquisition agencies and major prime contractors. FACE has many benefits:
Software reuse across platforms: Solutions developed for one platform can be reused on all compliant platforms, with no or few changes
Plug-and-play: Systems can be easily reconfigured for different mission sets
Speed and reliability: Developers can easily understand the interfaces and capabilities and automated compliance checking ensures the delivered solutions will work
Competition: Anyone can develop to the published standards and offer competing products
Sustainment: If a supplier goes out of business, their components can be replaced easily without being hampered by proprietary interfaces
Upgradability: Software updates can be released faster and with less risk, as long as compliance checks are passed
All of this adds up to cost and schedule savings as well as the potential for more capable solutions. In addition to being an effective approach, FACE serves as a case study for other acquisition organizations on how to develop their own open standards and enforcement, which is helpful now that federal law requires the DoD to use MOSAs in systems development.
Future vision: There’s an app for that
I’m excited for the ecosystem that this surge will create. I imagine a future where warfighters choose what apps to use from an available library, just like an app store. Instead of program offices acquiring specific technologies, MOSAs will enable them to open up the competition and allow multiple vendors to make approved apps available, and then pay them proportionally by hours of use. This is better for the warfighter as they’ll be able to choose the solution that works best for their needs and mission. This is better for the government as they’ll offload development risk and funding. And this is better for innovative developers who truly care about delivering the best solutions, who will be financially rewarded for creating the best solutions.
That’s a big vision, and a lot has to change before we can get there, but it’s just one of the possibilities opening up as we push toward developing and adopting MOSAs. If you’re interested in learning more and becoming part of the conversation, a new community called MOSA Network was recently launched. Start here with a brief analysis of the Tri-Services Memo based on the new law:
What’s your vision for a MOSA-enabled future? How else can consumer technologies inspire better battlefield solutions? How will you engage in the MOSA network?
Our world and our systems are safer than ever. A major reason why is that we’ve learned from prior mistakes. Many of our practices, rules, and standards are “written in blood” from past, tragic failures.
We learn so that we don’t repeat the same mistakes. Of course, we first identify the proximate causes—the specific events directly leading to a casualty. To truly learn, we must take a step back to examine the larger context: what were the preceding holes in the Swiss cheese, and how do we account for them in our systems engineering practice? This approach is increasingly important as systems grow in complexity. This post describes three case studies illustrating a common theme: increasing complexity demands strong systems engineering approaches to maximize safety, system performance, and suitability.
USS McCain Collision
In 2017, USS McCain collided with the merchant vessel Alnic MC in the Singapore Strait. McCain was heading to port to fix issues with its integrated bridge and navigation system (IBNS), which was prone to bugs and failures. The ship’s captain, distrustful of the buggy IBNS system, operated in manual backup mode that bypassed several safeguards.
The incident starts in the early morning in a busy shipping lane, a high-demand situation. To reduce workload on junior helmsmen, the captain ordered a “split helm” with one station controlling steering and one controlling speed. In the backup manual mode, changing to split helm required a multi-step, manual process. The crew made several mistakes, starting with transferring steering to the wrong station without realizing it.
Believing steering was lost, the helmsman instinctively reduced thrust to slow the ship. However, the incomplete manual transfer process had left the port and starboard throttle controls unganged, and only the port shaft slowed. Instead of decelerating, the McCain veered directly into the path of the Alnic MC.
IBNS on USS Dewey; US Navy photo
In the image above, from a sister ship of McCain, you will notice a large red button meant to take emergency control of the ship. This was certainly a situation in which to use it. However, the crew had an incorrect understanding of how this button worked and what would happen when it was pressed. Over the course of 16 seconds, multiple crewmembers pressed the big red buttons at their stations and control switched between the bridge and aft backup stations three times. By the time the crew regained control, it was too late. A few seconds later, the bow of Alnic MC struck McCain, and ten sailors died.
Damage to McCain; US Navy photo
The crew of McCain had indeed lost control of the ship. However, it was not because of any technical failure. It was caused by a confluence of design, training, and operational deficiencies. Among these, a lack of trust in automation caused them to operate in backup manual mode without support and safeguards that would have improved safety. Most pressing, in my opinion, is the confusing controls that didn’t retain the functionality of traditional physical controls. Critically, the IBNS on McCain removed the physical throttles in favor of only digital displays; physical throttle controls would have made it obvious to the entire bridge team at a glance that the controls were not ganged.
Chekov’s Gear Shifter
The same type of design shortcoming affects our daily lives as well. Anton Yelchin, known for playing Chekov in the 2009 Star Trek film, died tragically in 2016 at age 27. After parking his Jeep Grand Cherokee in his driveway, he exited the vehicle, unaware it was still in neutral. The car rolled backward, crushing him against a wall and killing him.
The issue was the design of the “monostable” shift knob, pictured above. At the time of this incident, the design had already been implicated in 266 crashes causing 68 injuries. Complaints were that the shifter didn’t provide adequate tactile or position feedback to the driver, causing incorrect gears to be selected and particularly putting the vehicle into neutral or reverse instead of park. Fiat Chrysler had already initiated a voluntary recall based on these complaints and was working on a software update to provide better feedback and to prevent the car from moving in certain conditions. Unfortunately, the fix wasn’t available at the time of Yelchin’s death, and the incident caused Fiat Chrysler to fast-track the development and fielding of the fix.
This case highlights the risks of abandoning proven design principles. The monostable shifter resembled traditional designs but worked very differently, confusing users without adding functional benefits.
Breaking established design patterns is not necessarily a bad thing. Historically, function drove form with, for example, mechanical linkages between the shifter and transmission. User’s are able to develop mental models of the system by following these physical paths. Moving to software-defined systems enables much more flexibility in design, which can enhance performance and safety. But it also removes many of the constraints of physical systems, breaks mental models, and allows solutions to become much more complex. That can result in undesired emergent behavior.
Two related design principles are “if it’s not broken, don’t fix it” and to build on a user’s existing mental model. When it is necessary to depart from existing concepts to achieve solution objectives, the designer must carefully consider the potential impacts and account for them. The monostable shifter was problematic because it looks like a traditional shifter but works differently. Not only did that trip up users, it didn’t actually add anything to the effectiveness of the solution; it was just for the sake of being different.
Aircraft Flight Controls
Composite of a Hawker Siddeley Trident (left) and an Airbus A380 (right) cockpit (Originals by Nimbus227 and NaddsyCC BY 2.0)
A positive example comes from aviation. Across manufacturers, models, and decades of technology advancement, controls and displays have remained relatively consistent with proven success. Basic primary flight display layout and colors are similar across aircraft, whether glass displays or traditional gauges; essential controls have the same shape coding and movement in any aircraft.
An effective, optimized, proven layout allows the pilot to focus on managing the aircraft and executing their mission. It also enables skills transfer; a pilot learning a new aircraft can focus on the unique qualities of that type rather than re-learning basic flight controls and displays.
Automation in modern aircraft has increased substantially. Airbus pioneered fly-by-wire; there are no direct linkages between cockpit controls and aircraft control surfaces, all inputs are mediated by software. That radically enhances safety. It’s nearly impossible to stall an Airbus because the flight computer won’t let you leave the safe flight envelope. Even still, it’s not infallible.
Air France Flight 447
In 2009, Air France flight 447 crashed into the Atlantic on a flight from Rio de Janeiro to Paris. The aircraft entered icing conditions and the pitot tubes froze over, causing airspeed data to be unreliable. Without valid airspeed data, the autopilot disconnected, the autothrust became unavailable, and the flight control software switched to an alternate control law. In this alternate law, the software didn’t mediate the pilot’s control inputs and so the controls were much more sensitive.
The most junior of the pilots onboard was the pilot flying and he struggled to adapt to the abrupt change. He spent the first 30 seconds getting a back-and-forth roll under control, over-correcting as he got used to the more sensitive handling of the aircraft. As he was fighting this roll, he also pulled back on the stick, which is a natural tendency of pilots in tense situations; that caused the aircraft to climb very steeply and ultimately stall. He continued to pull back almost the entire rest of the flight, even as the more experienced pilot tried to take control and push the nose forward to regain airspeed; the mismatched inputs caused “DUAL INPUT” warnings that the crew either didn’t notice or ignored, and the plane continued to respond to the junior pilot’s incorrect inputs.
Without the software providing flight envelope protection, the normally-unstallable aircraft fell out of the sky and 228 people died. The last thing the pilot flying said was “We’re going to crash! This can’t be true. But what’s happening?“ Just like with the McCain, there was nothing inherently wrong with the aircraft, just a disconnect between the user’s understanding and the actual state of the system, fueled by a rapid change of state, a stressful situation, and inability to rebuild situational awareness in time.
There are several lessons in this case study, but what stands out is the paradox of automation. The pilot was so used to the safety of the flight control software that his basic aviator skills weren’t available when he needed them. There are aspects of flight control design, a need for graceful degradation, and better training.
Lessons Learned
The key takeaway from these case studies for systems engineering practice is that the performance of the system is the product of human performance and technology performance. It doesn’t matter how great the technology is if the human can’t use it to safely and effectively accomplish their mission. That’s especially true in unusual or off-nominal situations. The robustness of the system depends on the ability of the human and/or technology to account for, adapt to, and recover from unusual situations.
With the rise of software-defined systems, complexity is outpacing our ability to characterize emergent behavior. One role of systems engineering is to minimize and manage this complexity. Human-centered approaches have proven to be effective at supporting user performance within complex systems, especially when combined with the frequent user feedback and iteration of the agile methodology. Building from user needs ensures that complexity is added when necessary for the sake of the solution rather than because it’s cool (Jeep monostable shifter) or economical (McCain IBNS). It also suggests other, non-design aspects that support user performance such as decision aids and training (AF447).
Human-centered approaches are a natural part of a holistic systems engineering program. Too often, engineers focus on developing technological aspects without a true understanding of the stakeholder and mission needs. These are the first principles that should guide all of our design decisions before we start to write code or bend metal.
From that deep understanding, we can ask my favorite question: how might we? Sometimes ‘how might we’ leads to small, incremental changes that add up to major performance gains. Other times it demands entirely rethinking the problem and solution, especially if revolutionary technology improvements are available.
Finally, designs must be thoughtful, putting practical needs ahead of novelty or cost savings to create the most effective solutions. This approach not only enhances system performance but also prevents failures. As the U.S. Navy report on the McCain collision noted:
“There is a tendency of designers to add automation based on economic benefits (e.g., reducing manning, consolidating discrete controls, using networked systems to manage obsolescence), without considering the effect to operators who are trained and proficient in operating legacy equipment.” — US Navy report on the McCain collision
By prioritizing user needs, we can create systems that are safe, effective, and resilient in the face of challenges.
What case studies or examples have influenced your thinking? How do you apply user-centered or other approaches successfully in your practice? Share in the comments below.
One of my favorite items in my small model collection is a 1:34 scale Grumman Long Life Vehicle (LLV)5 with sliding side doors, a roll-up rear hatch, and pull-back propulsion. The iconic vehicle has been plying our city streets for nearly 40 years, reliably delivering critical communiques, bills, checks, advertisements, Dear John letters, junk mail, magazines, catalogs, post cards from afar, chain letters6, and Amazon packages.
Ten sailors died after the crew of the destroyer USS John S. McCain lost control of their vessel, causing a collision with the merchant tanker Alnic MC. There was nothing technically wrong with the vessel or its controls. Though much of the blame was put on the Sailors and Officers aboard, the real fault rests with the design of the Integrated Bridge & Navigation System (IBNS).
In a near-future battlefield against a peer adversary, effective employment of machine learning and autonomy is the deciding factor. While our adversary is adapting commercial, mass-market technologies and controlling them remotely, U.S. and allied forces dominate with the effective application of advanced technologies that make decisions faster and more accurately. The concept of Joint All-Domain Command and Control (JADC2) is a key enabler, driving better battlefield decisions through robust information sharing.
Fed by this information, advanced decision aiding systems present courses of action (COAs) to each commander and then crew in the battle, taking into account every possible factor: tasking, environment and terrain, threats, available sensors and effectors, etc. Options and recommendations adapt as the battle unfolds, supporting every decision made with actionable information while deferring to human judgment.
In this campaign, the first few battles are handily won. It seems this war will be a cakewalk.
Until the enemy learns. They notice routines in behaviors and responses that are easy to exploit: System A is a higher threat priority, so is used as a diversion; displaced earth is flagged as a potential mine, so the enemy digs random holes to slow progress; fire comes in specific patterns, so the enemy knows when a barrage is over and quickly counters.
Pretty soon, these adaptions evolve to active attacks on autonomy: dazzle camouflage tricks computer vision systems into seeing more or different units; noise added to radio communications causes military chatter to be misclassified as civilian; selective sensor jamming confuses autonomy.
As the enemy learns to counter and attack these advanced capabilities, they become less helpful and eventually become a liability. Eventually, the operators deem them unreliable and revert to human decision-making and manual control of systems. The enemy has evened the battle and our investment in advanced decision support systems is wasted. Even worse, our operators lack experience with controlling the systems and are actually at a disadvantage, the technology actively hurt us.
The solution is clear: we must be prepared to counter the enemy’s learning and to learn ourselves. This is not a new insight. Learning and adaptation have always been essential elements of war, and now it’s more important than ever. The lessons learned from the field must be fed back into the AI/ML/autonomy development process. A short feedback, development, testing, deployment cycle is essential for autonomy to adapt to the adversary’s capabilities and TTPs, limiting the ability of the adversary to learn how to defend against and defeat our technologies.
In World War II, cryptography was the game-changing technology. You’re doubtless familiar with Bletchley Park, the codebreaking site that provided critical intelligence in World War II 7. Here, men and women worked tirelessly every single day of the war to analyze communication traffic, break the day’s codes, and pass intelligence to decision-makers. This work saved countless lives, leading directly to the Allied victory and shortening the war by 2-4 years. With the advancement of communications security 8, practically unbreakable encryption is available to everyone. We will no longer have the advantage of snooping on enemy communication content and must develop some other unique capability to ensure our forces have the edge.
I submit that the advantage will come from military-grade autonomy. Not the autonomous vehicles themselves, which are commodities, but the ability of the autonomy to respond to changing enemy behavior 9. One key advantage to traditional human control is adaptability to unique and changing situations, which current autonomy is not capable of; the state of the art in autonomous systems today more closely resembles video game NPCs, mindlessly applying the same routines based on simple input. While we may have high hopes for the future of autonomy, the truth is that autonomous systems will be limited for the foreseeable future by an inability to think outside the box.
Average autonomous system
How, then, do we enable the autonomous systems to react rapidly to changing battlefield conditions?
World War III’s version of Bletchley Park will be a capability I’m calling the Battlefield Accelerated Tactics and Techniques Learning and Effectiveness Laboratory. BATTLE Lab is a simulation facility. It ingests data from the field in near-real time, every detail of every battle including terrain, weather, friendly behaviors, enemy tactics, signals, etc. Through experimentation across hundreds of thousands of simulated engagements driven by observed behavior, we’ll develop courses of action for countering the enemy in every imaginable situation. Updated behavior models will be pushed to the field multiple times per day that reduce friendly vulnerabilities, exploit enemy weaknesses, and give our forces the edge.
Of course, we already do this today with extensive threat intelligence capabilities, training, and tactics. The difference is that the future battlefield will be chockablock with autonomous systems which can more rapidly integrate new threat and behavior models generated by BATTLE Lab. We’ll be able to move faster, using autonomy and simulation to reduce the OODA loop while nearly-instantly incorporating lessons from every battle.
Without BATTLE Lab, the enemy will learn how our autonomy operates and quickly find weaknesses to exploit; autonomous systems will be weak to spoofing, jamming, and unexpected behaviors by enemy systems. Bletchley Park shortened the OODA loop by providing better intelligence to strategic decision-makers (“Observe”). BATTLE Lab will shorten the OODA Loop by improving the ability of autonomy to understand the situation and make decisions (“Orient” and “Decide”).
BATTLE Lab is enabled by technology available and maturing today: low-cost uncrewed systems 10, battlefield connectivity, and edge processing.
A critical gap is human-autonomy interaction solutions. To implement these advanced capabilities effectively, human crews need to effectively task, trust, and collaborate with autonomous teammates and these interaction strategies need to mature alongside autonomy capabilities to enhance employment at every step. Autonomy tactics may change rapidly based on new models disseminated from the BATTLE Lab and human teammates need to be able to understand and trust the autonomous system’s behaviors. Explainability and trust are topics of ongoing research; additional efforts to integrate these capabilities into mission planning and mission execution will also be needed.
What do you think the future battlefield will look like and what additional capabilities need to be developed to make it possible? Share your thoughts in the comments below.
Part 5: Agility on Large, Complex Programs (you’re here!)
Putting it all together
In this series we’ve introduced agile concepts, requirements, contracting, and digital engineering (DE) for physical systems. These things are all enablers to agility, they don’t make a program agile per se. The key for agility is how the program is planned and functions prioritized.
Agile program planning
A traditional waterfall program is planned using the Statement of Work (SOW), Work Breakdown Structure (WBS), and Integrated Master Schedule (IMS). This basically requires scheduling all of the work before the project starts, considering dependencies, key milestones, etc. Teams know what to work on because the schedule tells them what they’ll be working on when. At least in theory.
Quibi was a short-lived short-form video app. It was founded in August 2018, launched in April 2020, and folded in December 2020, wiping out $1.75 billion of investor’s money. That’s twenty months from founding to launch and just six months to fail. Ouch.
Forbes chalked this up to “a misread of consumer interests”; though the content was pretty good, Quibi only worked as a phone app while customers wanted TV streaming, and it lacked social sharing features that may have drawn in new viewers. It was also a paid service competing with free options like YouTube and TikTok. According to The Wall Street Journal, the company’s attempts to address the issues were too late: “spending on advertising left little financial wiggle room when the company was struggling”.
If only there was some way Quibi could have validated its concept prior to wasting nearly two billion dollars11.
A common misconception is that Agile development processes are faster. I’ve heard this from leaders as a justification for adopting Agile processes and read it in proposals as a supposed differentiator. It’s not true. Nothing about Agile magically enable teams to architect, engineer, design, test, or validate any faster.
In fact, many parts of Agile are actually slower. Time spent on PI planning, backlog refinement, sprint planning, daily stand-ups12, and retrospectives is time the team isn’t developing. Much of that overhead is avoided in a Waterfall style where the development follows a set plan.
CALLBACK is the monthly newsletter of NASA’s Aviation Safety Reporting System (ASRS)13. Each edition features excerpts from real, first-person safety reports submitted to the system. Most of the reports come from pilots, many by air traffic controllers, and also the occasional maintainer, ground crew, or flight attendant. Human factors concerns feature heavily and the newsletters provide insight into current safety concerns14. ASRS gets five to nine thousand reports each month, so there’s plenty of content for the CALLBACK team to mine.
The February 2022 issue contained this report about swapped buttons:
This article is a quick detour on an important enabler for agile systems engineering. “Digital transformation” means re-imagining the way businesses operate in the digital age, including how we engineer systems. As future articles discuss scaling agile practices to larger and more complex systems, it will be very helpful to understand the possibilities that digital engineering unlocks.
Digital engineering enables the agile revolution
The knee-jerk reaction to agile systems engineering is this: “sure, agile is great for the speed and flexibility of software development, but there’s just no way to apply it to hardware systems”. Objections range from development times to lead times to the cost of producing physical prototypes.