Improve Steps: 1, 2, 3, 4

Matt Gunter
9 min readJan 21, 2021

…. Only then Increase Speed

Groundwork Required before Speeding up Releases

An excellent way to learn more about Improving Software Delivery Speed is to walk through a sequential example. This paper provides a step-by-step roadmap that “lays the necessary groundwork” for success. As we go, we’ll see that the ordering or sequencing of steps can either support or undermine further progress. We will make the case for the following sequence:
1. Improve Control
2. Reduce Defects
3. Improve Testing
4. Increase Tolerance for Disruption.

Virtually every software system that undergoes repeated changes can benefit from additional Speed and Productivity. If you are releasing software slower than weekly, this guide may help. There is a huge opportunity to make software changes faster (e.g. Patching, Upgrades, New Features can be 2–4 times faster and more productive). That does not mean, however, that end-to-end Speed and Full Automation should be the first priority.

Make Improvements in the Right Order…

In the real world, organizations have complex, brittle systems, and improvement steps should be thoughtfully directed toward long-term sustainable changes that generate incremental leverage for the next improvement step. This is continuous, incremental improvement (Lean refers to this as Kaizen)!

Just as standardization “locks in” recent improvements (as shown in the above graphic), the idea of Increasing the Granularity of Control and Reducing Uncertainty BEFORE Reducing Cycle Times can be an important consideration. Automating and speeding up change too soon, for example, can be counterproductive if the following “Four D’s” are not in place and standardized:

1. Improve the Degree of control (over test environment configuration and provisioning)
2. Reduce code Defects (through comprehensive automated testing),
3. Cultivate testing Discipline (so that defects are found early when the size of the change is small)
4. Increase Tolerance for Disruption (in order to de-risk production system failures)

Why are these steps and this particular ordering important? It is simply to manage the second-order complexities that, if present, work to undermine frequent, efficient release cycles. For example, in order to write tests and execute tests efficiently, environment provisioning needs to be automated, repeatable, and under significant control of developers via API or CLI.

Likewise, Good test automation should exist before making significant architectural or operational changes required to increase robustness and tolerance for Disruption. This is because Testing capabilities and Test-Automation practices form a safety-net and become the “poka-yoke” that allows changes to be “mistake-proof”.

Without this “4D foundation” of control and standardization, any changes will create a degree of uncertainty, delay, rework, and risk of disruption. Often there is resistance across broader organization (security, compliance, finance, brand leadership, etc.) to supporting a faster rate of change. The “4D foundation” can provide clarity and assurance that speeding up releases is “Safe”.

This advice is echoed by the Lean “House” metaphor which says success depends on the strength of the pillars, foundation, and roof. If any aspect of the House fails, success cannot be achieved.

adapted from: https://www.researchgate.net/figure/The-house-of-lean-production-in-the-context-of-the-literature-review-representing-a-lean_fig4_228433776

1-Improve the Degree of control over environments

Developers use environments for various types of feedback while they develop. Historically, these environments have mostly been shared, static environments. To improve development stability and flow, however, environments must be be based on a versioned set of artifacts, significantly standardized, isolated, and made available On-Demand via a Self-Service mechanism. This increase in “Degree of Control” puts developers in a position to receive faster, higher quality feedback than was possible with a shared, static environment. Because of the increased clarity, repeatability, and reconfigurability, developers can ask more questions and get more feedback. This increased in certainty is the primary goal. For example, Configuration changes can be made in a controlled way to the environments and not affect other developers since environments will be isolated. A range of changes can be tested independently in parallel. Once tested, the new code and any configuration changes can be assembled into a clearly-defined “release” for “assured repeatability” and additional testing downstream by other groups.

Why do we say “Degree of Control”? Because “Full Control” is not optimal. According to the Lean Principle of “Flow”, resources should be delivered just-in-time, without wasted effort. So, Finding the best “Degree of Control” for developers will be an iterative process that will mean that “Sufficient Operator Control” will also be preserved in order to ensure security, manage resource consumption, and enforce standardization. Cloud Foundry’s model for tenancy, self-service, use of buildpacks, and a marketplace of services is a great example to follow. Kubernetes is also a popular option, but exposes a significant learning curve onto developers and doesn’t have buildpacks or a marketplace out of the box. Another consideration is that one approach won’t work for all developers and system architectures, so some diversity in approaches will likely evolve.

Starting with environment provisioning and improving the individual Developer’s “Degree of Control” sets the stage for broader adoption of test automation and the selecting various testing practices to accelerate finding and removing Defects.

2-Reduce code Defects

For most improvement scenarios, the next priority should be ensuring that Quality is stable. What does this mean? It means every type of defect is detected quickly and resolved quickly. Defects that exist for prolonged periods of time DeStabilize Quality and create rework, uncertainty, and other waste.

Ideally, test automation is built out which can identify any sort of departure from expectations. When done well, this means that code changes are tested comprehensively and have predictable results and almost zero defects. The reason this is so important is that Quality has a major impact on Speed and Cost. If Quality is not stable, we can’t change quickly and cheaply with any level of certainty. (Note: The so-called iron triangle of cost-quality-speed is wrong and should be replaced by the more accurate relationship shown in the chart on the right below.)

The defacto approach for gaining fast defect visibility is “Continuous Integration”. It is often combined with other techniques as needed like Test-Driven-Development, Pair-Programming, and Chaos-testing.
For a more detailed view of different test practices see this whitepaper or the chart below(by @alexsotob):

The Test Pyramid is Evolving with many New Techniques emerging.

The limitations of pre-built, static environments and complex production requirements have historically resulted in the concept of a “Test Pyramid” with many small unit tests running easily at the bottom and far fewer expensive system tests and manual tests executed at the pyramid’s top.

adapted from: https://codingjourneyman.com/2014/09/24/the-clean-coder-testing-strategies/

The pyramid has been an important consideration due to the large difference in how efficient unit tests could be executed vs integration tests. This difference in performance is no longer a given. Containerization and technologies like kubernetes and testing tools like garden.io are closing the gap and allowing different testing approaches to work at all phases of the SDLC. The main point is that improving defect visibility must be a high priority for creating speed and agility for software developers. “Surprise defects” are unacceptable because they create Uncertainty and Excessive delay that can undermine trust and progress.

As this test automation effort gains traction, we begin to gain a clear picture of how complex, brittle, or fragile our codebase is. If developers still struggle to make changes in a predictable timeframe, then changes may need to be made to the Code’s Structure and Architecture to improve change-impact isolation and reduce coupling.

3- Cultivate Testing Discipline

To incorporate automated tests into everyday work, development teams must be able to build Continuous Integration pipelines and optimize when they are triggered and which tests are run, and in what order they are run depending on the type and scope of changes they are making at the time. As testing options grow, running through them all, every time a change is made is unreasonable.

Having a degree of control over environments and a growing catalog of automated tests puts developers in a good position to learn and achieve this shared Discipline around testing.

In cases where there are multiple, geographically dispersed teams, mechanisms must be put in place to facilitate smooth sharing of the versioning and testing metadata for all dependencies. For example, when a shared subsystem is versioned, the test history should be available to the teams consuming that subsystem so they can judge the risk of including that particular version into their next release. Likewise, if a consuming team is testing against a new subsystem version, the test results should flow back to the subsystem team for visibility.

These practices will take time to develop, but they lay the foundation for frequently deploying small changes from many collaborating but interdependent dev teams!

4. Increase Tolerance for Disruption

So far we have only been focused on pre-prod. While the above improvement steps are underway, all changes made to the Production system are still likely to be very slow. This is because the process for changing the Production system is not often controlled by development teams. Production changes are controlled by other teams and different org silos. These are the teams involved in security, change-management, assuring compliance, and running the operations of the Production system.

To speed up the rate of change to production, we need more than just developers creating releases faster. We also need the risk of disrupting the production system to be much lower. Reducing the risk of disruption involves implementing a range of techniques such as blue-green deploys, canaries with smoke-tests, circuit-breakers, periodic health-checking that restarts failed processes, and quotas to limit resource usage, etc. If these mechanisms are provided, then changes to production don’t have to have perfect operational behavior. The system is able to accommodate some disruption and exhibit a higher level of tolerance.

With frequent releases coming from developers, plus the de-risking of change due to higher tolerances of disruption, the gauntlet to Production can now be optimized (and negotiated) to better match the rate of change achieved by development.

Below, are a few additional planning tips for leaders considering this journey:

  1. Choose an Appropriate System. The ideal System will be one that has:
    - a history of slow, infrequent releases.
    - preferably a variation in the length of time between releases.
    - a strategic, long-term rationale for speeding up change cycles.
    - staff that is engaged, equipped and motivated to make these types of improvements
    - a number of opportunities on the horizon to test, safely fail, and learn from early efforts.
    - a future state model or architecture of what success looks like.
  2. Identify the most relevant components, automated tests, and new capabilities needed to implement the changes envisioned in #1 above. Also, create an aggregate view with a:
    - Description of component, test, or capability (what is needed?)
    - Cost in terms of Time (how long will each item take to complete?)
    - An overall program timeline that provides a ROM-level view of expected Costs, Time, Skills and Effort.
  3. Identify limitations or risks. What technical, procedural, or cultural factors or policies exist that would prevent teams from achieving the future state model? For each limitation or risk:
    - How can it be overcome or mitigated?
    - Who may be able to change a policy and is there a sound rationale?
    - Can the existing limitation or risk be accommodated or controlled without taking on unacceptable effort?
    - For limitations or risks that cannot be controlled, removed, or mitigated what is the impact? Are they “showstoppers”?
  4. Identify a sequence of steps and a reasonable timeline for putting the Four D’s in place, Modernizing the System’s internal Architecture as needed, and achieving the Future State Architecture.

This is your initial “Roadmap”.

Conclusion

The benefits of putting the above improvements in place can be astonishing. Some development shops achieve 4X improvement in productivity and 6X improvement in lead-time for changes.

However, for the progress to get this far, leadership must be actively involved. (Leadership is the “Roof” of the Lean House mentioned above.)

It may be valuable to revisit this planning guide (or the cheat-sheet below) periodically and update your roadmap with progress and revised insights.

Recap of the Four D’s with Detailed Description

Best of luck with your Lean Software Transformation!

(For more details on De-Risking the adoption of Lean Practices, see: Implementing Lean Practices: Managing the Transformation Risks)

--

--

Matt Gunter

Critical Thinker about Software’s Potential for Organizations.