Risk Management for Large Systems

4 min readOct 27, 2020

(Part 1 of 3)

The ability of Enterprises to stay current and competitive depends on maintaining the agility of their largest systems over time. Research confirms that systems which have low defects and those that can easily test new changes are much cheaper to support and maintain. Achieving large-system agility then should be a huge win-win!

Making decisions with the so-called “iron triangle” trade-off between cost, speed, and quality is misleading and incompatible with the research showing a quality-first mindset allows simultaneous improvement of all three dimensions.

The key is adopting practices that actively promote modular system architectures, continuous change, and drastically reduce risk via continuous testing and feedback.

“These reductions in schedules and costs, of course, are due to the fact that finding and fixing bugs has been the #1 software cost driver for over 50 years.” -Capers Jones

Modern Software factories adopt practices proven for “continuous change” of applications. As these factory-like practices become more common, the pursuit of speed and agility for larger, more diverse efforts is the next open question.

For Example, How do multiple factories collaborate effectively?

Why do small systems grow into larger systems, but cannot maintain this speed and agility?

Large existing systems become significantly more challenging and costly to improve and manage over time. A major factor in this trend is when a large system grows, the increasing number of complex dependencies and nuance require more and more regression testing to ensure changes do not introduce bugs. This results in systems growing for a time, but eventually becoming stagnant, overly complex, and very hard to enhance in meaningful ways.

Respondents to a 2013 Forrester Research survey of IT leaders at more than 3,700 companies estimated they spent an average of 72% of their budgets on just keeping-the-lights-on functions. According to Gartner, operating costs have continued to increase: from 67 percent of IT budgets in 2013 to 71 percent in 2017. How long will it be before maintenance is all a system can afford to do?

We are already at that point for many US Government systems. In 2016 the US Government Accountability Office found that 5,233 of the government’s almost 7,000 IT projects systems were spending “all of their funds on operations and maintenance”. Clearly this is a strategic issue for a majority of large “systems of systems” today.

As “systems of systems” grow and dependencies rise, the factory approach breaks down at a certain point. This is because the minimal scope of change for such systems grows and the corresponding risk of unintended defects also grows. The success of a brittle system actually becomes it’s achilles heel because defects and outages have a bigger and bigger impact merely due to growth. These risk factors (Increasing Scope of change and Increasing Risk of instability) delay decisions and slow-down change-management.

Manual testing (e.g weeks of regression testing) for risk avoidance is a common symptom of this situation. Delays from the risk factors slow down changes to the system in production which creates a growing tech-debt gap. Tech-debt, in this case is the equivalent of refactoring -adopting more efficient or more appropriate technology and models. This tech-debt gap is usually only noticed by users when they compare the system to successful consumer technology.

The infrequent changes also grow larger because all changes get funneled into fewer changes per year. This creates a vicious cycle leading to slower and larger changes that seem increasingly risky.

This chart shows the phases and options for agility as systems grow and age. On the left, factory-built applications and subsystems can keep up with ecosystem growth for a time, but the rate of change of modules slows down as they get larger and integrated with slow-moving systems.

Factory-built software systems are easily changed. As these systems grow, modern practices allow teams to control the size and structure in order to sustain speed and agility over time. Besides agility benefits, keeping subsystems understandable, with low defects, manageable levels of complexity, and with clearly defined dependencies between subsystems is also necessary to achieve very low operational costs and low cost-to-change.

Faster change is achievable only with a focus on testing (high quality) and modularity(small units of isolated change) to drive lower overall cost-to-change. This is the truth missing from the misguided concept that Cost, Schedule, Quality form an “Iron Triangle”.

Quality, Speed, and Costs become controllable by managing Scope of change and Risk of change.

Improved Testing and Improved modularity is required for the factory model to scale.

Testing reduces the Risk of Change because of increased awareness of defects & quality. Frequent Change limits the growing Scope of Change which reduces Risk further!

Therefore, Quality, Speed and Costs actually improve with continuous testing feedback and frequent change.

In the next article we’ll focus on Coordinating Structures and how coordinated changes keep Developers, Product managers, and Stakeholders grounded in a common understanding of the system. System change coordination and testability can simplify change if we use these techniques to proactively manage the Scope of Change Impact of Change and Risk of Change.

Risk Management for Large Systems

Written by Matt Gunter