Defect Rate, Reliability, System Size, and Limits on Growth

Why are most legacy systems so hard to change? Why are teams able to build systems quickly when they are small and isolated, but not when they are large and interconnected? What does this mean for managing defects (and assumptions) for Cloud-Native & “Death Star” Architectures like these:

To find out more about how system growth rate, system size, and development costs are affected by defect rates, I built an Agent-Based Model in NetLogo which simulates two development teams “building” and “deploying” modules to a common, interconnected System.

The System is modeled as a “network of modules” and adding or repairing modules in the network have a small chance of causing a neighboring module to fail. (Sound familiar? )

NetLogo Model comparing Two Dev Teams each having a different “Module Defect Rates” and building a Software System together.

By running experiments on this model, I was able to validate the following causal relationships:

1. Higher defect rates mean higher “Module Costs” and lower “Team Productivity”

2. Teams with higher defect rates don’t just slow themselves down, they slow other teams down.

3. The maximum size of a system that can be built is limited by the Defect Rates of its modules.

4. Very small increases in a team’s Defect-Rates can cause their “Module Costs” to drastically increase.

While these insights have been known in the industry for decades, directly measuring these effects in real-life Software modules and systems is a big challenge.

“Poor measurement practices have led to the fact that a majority of companies do not know that achieving high levels of software quality will shorten schedules and lower costs at the same time. But testing alone is insufficient. A synergistic combination of defect prevention, pre-test defect removal, and formal testing using mathematical methods all need to be part of the quality technology stack.” Capers Jones ( https://www.ifpug.org/content/documents/Jones-SoftwareDefectOriginsAndRemovalMethodsDraft5.pdf )

One way to think about the measurement challenge is to list all the micro-assumptions Developers, Users, and Operators make as they define, build, use, and operate a system. Each assumption must always remain “true”, or else the result is a Defect. For systems that are growing and changing frequently, the assumptions fail from time to time and this is measured as the “Defect Rate”.

Here are a few common “micro-assumptions” that can create su:

  1. Assuming a divisor “should” never be zero
  2. Assuming data received from external systems is “clean/predictable”.
  3. Assuming code is correct because a Testcase is “Passing”
  4. Assuming code is robust because a Negative Testcase is “Passing”
  5. Assumptions about User behavior or knowledge
  6. Assumptions about Timezones, Date Formats, Daylight Savings Time, and leap year.
  7. Assumptions about performance and latency and timeouts.
  8. Assumptions about threads of Simultaneous work being “thread-safe”.

Defect Rate changes across the Software Lifecycle

If we plot the defects by phase — as they are detected — we get a Rayleigh curve:

The phases other than ST and GA in the figure are: high-level design review (I0), low-level design review (I1), code inspection (I2), unit test (UT), and component test (CT). Given the defect removal pattern up through system test (ST), it is possible to estimate the defect rate when the product finally ships: the post general-availability phase (GA) in the figure. In this example the X-axis is the development phase and the defect rate is low, but not non-Zero at (GA).

As described in the figure’s caption, the Rayleigh curve is sometimes used to predict a “Final Defect Rate” based on the number of Defects Detected in earlier phases and then projecting forward via the slope of the curve. This technique works when the “Defect Removal Efficiency” for each phase has a track-record of consistency. This consistency is reflected in the model by assigning to each module the defect rate of the team who built it.

The Defects vs. Growth NetLogo Model

The NetLogo Model is designed to represent software systems that are developed by distributed teams that create modular components. The teams build the various modules and the “touchpoints” between modules create a network of linkages across the larger system.

By experimenting with the Model important causal factors can be studied in order to educate and provide important guidance on technology architecture, team coordination, and defect levels.

Because of the fairly well-known correlation between Team Practices and Defect Rates ( e.g. Economics of Software Quality by Capers Jones) we assign all modules produced by the same Team a single Defect Rate that is representing the result of the Team’s Processes & Practices. In reality, this will be somewhat variable, but that release-to-release variability is not currently a concern for this model.

The consequences of component defect levels show up primarily by affecting the following:
- Team Productivity (Actual number of ticks per module),
- System size (Total Modules), and
- Time-To-Repair (As modules are created and some fail, the failures can queue up waiting for the team to repair them.)

Other factors that can be explored in this model include:
- Network “structure” characteristics (e.g. Random, Small-World, Lattice)
- Cost-benefit of upfront testing vs backend repairs
- Impact of a low-quality team on the productivity of a high-quality team.

There are several possibilities for extending this model. Here are a few ideas:

  1. The model could explore how higher defect rates might be mitigated with very intentional network designs (vs randomly structured networks)
  2. The model could be better calibrated or actually use real defect data from a real-world system. This would make the Model more accurate and relevant.
  3. There is currently no coordination of “feature release” between Team 1 and Team 2. Modules are independent. However, including a super agent called “Feature” that represented the sum of multiple modules, created by different teams, would highlight a larger degree of productivity coupling between Teams. Hypothetically, this would show an even bigger “productivity win” for quality improvements.

Example Results (see the Appendix for a detailed model description)

The baseline run (Scenario 1) shows both teams working at their “Design Velocity” but without any Defects to deal with. Team1 has a configured “extra testing cost factor” of 0.5, which is why the Team1 “Cost per Module” is higher.

Scenario 2 shows Team1 with a very low defect rate of 0.01 while Team 2 has a higher defect rate of 0.025. The simulation stops when there are 225 healthy modules. Clearly, Team2 was struggling, it’s “cost per module” was going through the roof. Team2 was only able to build 70 modules (vs 200 for Team1) and 30 of those were unhealthy at the end of the simulation.

In scenario 2b, we run the same model configuration several times to understand the potential of a Team with high defect rate slowing down a lower defect-rate Team. As the figure shows Team2 spent the majority of its time repairing failed modules in all three runs. However, Team1 modules were not impacted by failures in Team2 modules at all in run (z). For run (z), we can see that Team1’s Max TTR is zero.

(Note: Ticks are the unit of time in the simulation. Cost is calculated as ‘Ticks/Module’ and the ‘Max TTR’ displays the max time-to-repair for any of the team’s module.)

With scenario 3 we find that above a certain defect rate, the system growth is capped and costs spiral out of control. This seems dramatic, but most projects fail or stop being enhanced well before they get to this zone. ( This also may cause you to wonder about the relationship between Cost, Speed, and Quality. If so, check out my older article: Risk Management for Large Systems for more on that topic. )

I hope you found this interesting and I hope to expand this type of research and continue to share more insights! (If you are interested in trying this model yourself, please reach out. )

Appendix: Model Logic and Miscellaneous Details

The Model uses two types of Agents: Team & Module
The goal of each run is for the Two Teams (Blue and Green) to build a “System” with 225 modules.

Modules– Modules are building blocks and combine with each other to form a single “System”. How they combine is based on the type of Network selected. The Model supports the following Network types:
“Random”
“Preferential”
“Strogatz”
“Small-world”
“Lattice”

Modules are created by Teams at a steady rate as long as none of the Team’s modules in the system have failed. Modules inherit a Defect-Rate based on the Team that created it. Modules also track their own lifecycle and can fail based on internal or external changes affecting the System.

Teams — Teams have a “Creation Rate” property. Currently, this is hardcoded for all teams at 10 “ticks per module”.( For the current model version, having this fixed is helpful because we are interested in the effect of Quality and the Cost/Benefit-of-Quality on Productivity, and holding the base Creation Rate constant for both teams makes the data/model-runs more comparable.

Teams also have other properties related to Productivity and Quality. Teams have to decide their quality standards in terms of a Defect Rate.

In the model, the Defect Rate captures the average rate that failures will occur for the following reasons:

  1. Change-Failures: a new or repaired module might fail immediately(bad-fix) or cause a failure of another directly linked neighbor in the network. This “chance” occurs everytime a “repair” is made to a module.
  2. Environment-Failures: an environment change causes a failure of a module that is on the outer edge of the System (module network). This “chance” occurs at each tick cycle. Because this failure scenario can overwhelm other failures, The current version arbitrarily “throttles” the number edge nodes tested for failure per tick based on the size of the network.
    (3–13 nodes: 1, 13–50 nodes: 2, 50–150 nodes: 5, >150: 25)

Both Teams are “spending their ticks” to build, test, and repair modules based on the input parameters. As mentioned above creation rate is hardcoded at 0.1 for both Teams. The Teams are otherwise independent and can choose their own quality parameters:

Defect-Rate(0.0001–0.05) — Combines with a Random float value to simulate a probability distribution of failures. Failures occur when the Random Float is less than the module’s Defect-Rate for one module, or the subset of modules being tested.

Testing-Cost-Factor(0–2) — Defines the tick duration required for testing as a factor of the base creation rate of 10 ticks per module. For example, the actual Creation Rate is a function of Testing-Cost-Factor as follows:

  1. Testing-Cost-Factor = 0, Creation Rate = 10 ticks/module
  2. Testing-Cost-Factor = 1, Creation Rate = 20 ticks/module
  3. Testing-Cost-Factor = 2, Creation Rate = 30 ticks/module

Modules are focused on tracking their individual failure histories, and repair status. Once a component is broken after deployment, it must be “Repaired” by the team that built it. Also, Repairing components takes the same effort as creating a new component and Teams cannot create new components while they have Failed Components in the System.

Critical Thinker about Software’s Potential for Organizations. Engineering outcomes for Rise8 customers.