Why Large Systems Need “Coordinating Structures”
(Part 2 of 3)
In part 1, we covered the challenges faced by systems as they scale up in size and emphasized how the myth of “the iron triangle” obscures the path to software productivity and agility. Research, as well as ample anecdotal evidence, shows that defects are counter-productive to Speed, Cost, and Quality of software development. Over time, many large system can no longer be changed without extensive regression testing, high risk and high costs. This problem of systems becoming brittle and hard to change and hard to replace is well documented across many industries. However, Systems can maintain agility if structured and managed with a focus on low-risk, low-defect change.
In a 2018 CIO survey, Deloitte observed that core agility is a high priority for CIOs ‘Many CIOs recognize that their legacy systems lack the agility needed to innovate and scale. Most (64 percent) are rolling out next-generation ERP or modernizing legacy platforms to address the limitations of existing systems. “If you don’t have high-performing infrastructure, then forget about projects and initiatives,” says Mike Tartakovsky, CIO of the US National Institute of Allergy and Infectious Diseases. “No one will trust you without solid infrastructure — that’s the core.”’
As systems get older and larger and the user community expands to more teams and organizations, systems become much more critical to business operations and riskier to change. To address this, Enterprises need to anticipate this and compensate with structures and practices that control change costs and limit risks so change can be managed efficiently and comprehensively.
Agile and Extreme Programming practices work well at smaller scales where the application architecture and operating model is defined and managed by cross-functional, mostly-autonomous teams. These teams use these techniques(e.g. Agile, Extreme Programming, CI)to prioritize defect awareness and maintain helpful modularity(e.g. frequent refactoring of the codebase) in order to maintain “high velocity” and “low cost of change”. Using this approach on large systems or to build a “system of systems” means that teams have to find ways to coordinate with each other’s concurrent changes. Even Enterprises that have success with Agile development struggle with scaling this approach to include multiple teams and governance organizations. This problem is worse in the largest Enterprises because they have the most silos and specialized teams that control different aspects of the large system. Teams that manage databases, networks, security, business continuity and change-management are focused on economies of scale and stability and they all need a seat at the table when change happens. This second-order need to “coordinate changes” in large enterprises, therefore, makes them exponentially harder & slower to change than smaller systems.
The reason for using the word “exponential” is not hyperbole. Systems are networks and when Systems scale up their network effects are literally exponential.
Focusing on “Network Effects”
Network Effects are distinct from Economies of scale because they are based on a growing structure, not just on volume. As Networks take shape, they affect cost, speed, and quality due to the way their nodes are connected (not just the number of nodes).
The fact that an Enterprise can negotiate better contracts with vendors because of higher volumes does not mean that it has a Network Effect, just an Economy of Scale. If instead, they standardize server configurations and move to on-demand, self-service deployments — that creates a Network Effect. The Network in this case is two-sided. Developers experience less delay with more options and control over deployments, while Operators see a reduction of costs per deployment and easier maintainability. This two-sided network creates a Win-Win for both sides. If there is a downside, or a so-called “Trade-off”, it is that this approach requires structure: standardization, automation, and most importantly coordination. Maintaining the Win-Win network effect requires Continuous Coordination across teams which, as we will discuss below, requires a system of Coordinating Structures.
Introducing “Coordinating Structures”
Some Large systems, like the Internet, easily coordinate changes that affect multiple stakeholders and it maintains a small “cost to change” directly because of well-designed, stable “Coordinating Structures”. A major opportunity for Enterprises is recognizing the network-effects associated with their technical and organizational structure that enables them to either exploit exponential effects or to be painted into a corner by them.
There are many types of “Coordinating Structures”. Internet’s Domain Name System(DNS) is a familiar coordinating structure that allows IP addresses to be changed independent of the Domain & Hostname. This supports relocating servers and websites. Another example is the Structured Query Language(SQL), which is used to define the criteria for a database query, but leaves the processing algorithm selection of the query up to the database (and allows the DBA’s to influence execution through tunable parameters, indexes, optimizer configuration, and hints. For example, a SQL statement asks the database to provide few rows from a large table but transparently the database can be tuned to take the most efficient of several processing options for the query. SQL simultaneously supports developer productivity, operational control, and portability. Cloud-Native Buildpacks are another more modern coordinating structure that allows operators and developers to coordinate their shared control over the containerization process so that there is flexibility for developers but also standardization and control for operators. Below is a short list of examples and the coordination advantage they provide.
Ideally with large systems, we want the rate-of-change to scale up with size, not slow down. We want system growth and also “resilience” and “change-ability” to improve. This win-win of size and rate of change is exemplified by The Internet. It gracefully handles many types of network requirements, link failures, and constant parallel, decoupled evolution. The well-defined and “auto-coordinating” structures of the Internet allow new teams to standup websites and thousands of existing teams to make changes without contention or delay. This is the win-win of increasing size, higher rate-of-change, and autonomy that we need for our systems and teams as they grow.
Unfortunately, such win-win growth dynamics are not what we see in most Enterprise systems. Enterprises have a tendency to undermine their systems’ agility by focusing on increasing manual, human-based coordination by relying on org structures with overlapping, unclear, uncoordinated responsibilities. This quickly results in a culture of meetings and emails rather than efficiency. For example, a silo may be responsible for change-management, another for security, and still others for managing business continuity or government compliance. None have responsibility for coordinating their overlapping efforts. Most Enterprises struggle to achieve manual or semi-manual Coordination. Completely automatic coordination is virtually non-existent in the Enterprise. A Patchwork of Pipelines, Confluence wikis, and SD-Elements surveys are as close as most get to “auto-coordination”.
Jan Bosch describes in his insightful post here that many Enterprises focus efforts first on establishing Org structures. Designing Processes or Technical Architectures are secondary concerns, which he explains is a very problematic order of priorities. Technical Architecture should be driven by the Business plan which should subsequently drive Processes and Organization. He simplifies this dichotomy as “BAPO vs. OPAB”.
The problematic OPAB model creates a vicious cycle where the complexity of changes keeps increasing which slows down the rate of change as the system scales. Ultimately this cycle results in System Stagnation and a general inability to innovate for a large majority of Enterprise Systems.
Prioritizing Technical Architecture and “coordinating structures” works to lower the Cost change through better coordination and therefore becomes the most important design focus for maintaining agility as systems get larger. Systems that cannot remain agile will inevitably become rigid and fragile and are ultimately going to be replaced in one of two ways:
- If Cost-of-replacement drops enough, Enterprises can take the easy path and replace the whole obsolete system. Everyone knows the Cost of replacing laptops and phones is frequently cheaper than the cost of staying with out-of-date hardware. This dynamic is also a primary reason why customers abandon existing supplier relationships. It’s easier and cheaper to just take another option.
- If Cost-of-change rises enough, it prevents adaptation and reduces options for improvement and creates an “opportunity desert”, the exact opposite of what Enterprises are built for. So the Enterprise itself will become vulnerable to competition and at the mercy of customer loyalty.
In the next and final post in this series, we focus on the convergence of the Quality First paradigm with the “Coordinating Structures” concept and how systems can grow larger, change faster, and be more reliable.