Low connectivity, architecture and team organization
This article discusses the relationship between code structure and organization structure in software development. I discuss why software and teams cannot scale easily, what lessons we can see in nature and the Internet, and show how we can reduce the connectivity of software and teams to overcome the problems of scaling.
The article is based on my 20 years of experience creating large software systems and on the impression of the book “Accelerate: The Science of Lean Software and DevOps: Building and Scaling High Performing Technology Organizations” (Nicole Forsgren, Jez Humble and Gene Kim), in which research data to back up most of my claims here. This book is highly recommended for reading.
Software and commands do not scale
Often the first release, perhaps written by one or two people, is surprisingly simple. It may have limited functionality, but it is written quickly and meets customer requirements. Interaction with the customer at this stage is excellent, because the customer is usually in direct contact with the developers. Any bugs are fixed quickly, and new features can be added quite painlessly. After a while, the pace slows down. Version 2.0 takes a little longer than expected. It is more difficult to fix bugs, and new features are given, it is not so simple. The natural answer to this is adding new developers to the team. Although, it seems that every additional employee added to the team reduces productivity. There is a feeling that as the complexity of the software grows, it atrophies. In extreme cases, organizations may find that they use programs with very expensive support that are almost impossible to make changes. The problem is that you do not need to make any “mistakes” to make this happen. It is so common that it can be said that this is a “natural” property of software.
Why is this happening? There are two reasons: those related to the code and the team. Both code and commands do not scale well.
As the code base grows, it becomes more difficult for one person to understand it. There are fixed cognitive boundaries of a person. And, although, one person can keep in mind the details of a small system, but only until it becomes more than its cognitive range. As soon as a team grows to five or more people, it becomes almost impossible for one person to be aware of how all parts of the system work. And when no one understands the whole system, fear appears. In a large, highly coupled system, it is very difficult to understand the effect of any significant changes, since the result is not localized. To minimize the impact of changes, developers are starting to use workarounds and code duplication instead of identifying common features, creating abstractions and generalizations. This further complicates the system, reinforcing these negative trends. Developers no longer feel responsible for code that they do not understand and are reluctant to do refactoring. Technical debt is growing. It also makes work unpleasant and unsatisfactory and stimulates a “talent drain” when the best developers leave who can easily find work elsewhere.
Teams also do not scale. As teams grow, communication becomes more complex. A simple formula comes into play:
c = n (n-1) / 2
(where n is the number of people, and c is the number of possible connections between team members)
|number of team members||number of possible connections|
As her team grows, her needs for communication and coordination grow exponentially. If a certain team is exceeded, it is very difficult for one team to remain an integral structure, and the natural human social tendency to divide into smaller groups will lead to the formation of informal subgroups, even if management does not take part in it. Communication with colleagues becomes more difficult and will be naturally replaced by new leaders and top-down communications. Team members are transformed from peers in the system to regular production workers. Motivation suffers, there is no sense of ownership due to the effect of diffusion of responsibility.
Management often intervenes at this stage and formally approaches the creation of new teams and management structures. But, it doesn’t matter formally or informally, large organizations find it difficult to maintain motivation and interest.
Usually inexperienced developers and poor management blame these scaling pathologies. But this is unfair. Scaling issues are a “natural” property of growing and evolving software. This is what always happens if you do not find the problem at an early stage, do not understand the deviation point, and make no effort to solve the problem. Software development teams are constantly being created, the amount of software in the world is constantly growing, and most of the software is relatively small. Therefore, quite often, a successful and developing product is created by a team that does not have experience in large-scale development. And it is unrealistic to expect developers to recognize the inflection point and understand what to do when scale problems begin to manifest.
Nature Scaling Lessons
I recently read the excellent Geoffrey West book “Scale”. It talks about mathematics of scale in biological and socio-economic systems. His thesis is that all large, complex systems obey the fundamental laws of scale. This is a fascinating read and I highly recommend it. In this article, I want to focus on his point of view that many biological and social systems scale surprisingly well. Look at the body of a mammal. All mammals have the same cell types, bone structure, nervous and circulatory systems. However, the size difference between the mouse and the blue whale is about 10 ^ 7. How does nature use the same materials and structure for organisms of such different sizes? The answer seems to be that evolution has discovered fractal branched structures. Look at the tree. Each part of it looks like a small tree. The same is true for mammalian circulatory and nervous systems, they are branched fractal networks where a small part of your lungs or blood vessels looks like a smaller version of the whole.
Can we take these ideas from nature and apply them to software? I think we can learn important lessons. If we can build large systems consisting of small parts that themselves look like complete systems, it will be possible to contain the pathologies that affect most programs as they grow and develop.
Are there software systems that successfully scale by several orders of magnitude? The answer is obvious – the Internet, a global software system with millions of nodes. Subnets really look and work like smaller versions of the entire Internet.
Signs of loosely coupled software
The ability to isolate separate, loosely coupled components in a large system is the main method of successful scaling. The Internet, in fact, is an example of loosely coupled architecture. This means that each node, service or application on the network has the following properties:
- A common communication protocol is used.
- Data is transferred using a clear contract with other nodes.
- Communication does not require knowledge of specific implementation technologies.
- Versioning and deployment are independent.
The Internet is scalable because it is a network of nodes that communicate through a set of well-defined protocols. The nodes interact only by protocols, the implementation details of which should not be known to the interacting nodes. The global Internet is not deployed as a single system. Each node has its own version and deployment procedure in it. Individual nodes appear and disappear independently of each other. Submission to Internet protocols is the only thing that really matters for the whole system. Who created each site, when it was created or deleted, what version it has, what specific technologies and platforms it uses, all this is not related to the Internet as a whole. This is what we mean by loosely coupled software.
Signs of a loosely coupled organization
We can scale teams following the same principles:
- Each sub-team should look like a small software development organization.
- Internal processes and team communication should not go beyond the team.
- The technologies and processes used to implement the software should not be discussed outside the team.
- Teams should communicate with each other only on external issues: common protocols, functionality, service levels and resources.
Small development teams are more effective than large ones, so you need to break up large teams into smaller groups. The lessons of nature and the Internet are that subgroups should look like single small software development organizations. How small are they? Ideally, from one to five people.
It is important that each team looks like a small independent software development organization. Other ways to organize teams are less effective. Often there is a temptation to divide a large team into functions. Therefore, we have a team of architects, a team of developers, a team of DBA, a team of testers, a deployment team and a support team, but this does not solve any of the scaling problems that we talked about above. All teams should participate in the development of a feature and often iteratively if you want to avoid project management in the style of a waterfall.
Communication barriers between these functional teams are becoming a major obstacle to efficient and timely delivery. Teams are tightly connected because they need to share important internal details to work together. In addition, the interests of different teams do not coincide: developers usually receive an award for new features, testers for quality, support for stability. These various interests can lead to conflict and poor results. Why should developers worry about logs if they never read them? Why should testers care about delivery if they are responsible for quality?
Instead, we should organize teams for loosely coupled services that support business functions, or for a logical group of functions. Each subcommand should design, code, test, deploy, and maintain its own software. Most likely, the members of such a team will be specialists of a wide profile, and not narrow specialists, because in a small team they will have to share these roles. They should focus on the maximum possible automation of processes: automated testing, deployment, monitoring. Teams must choose their own tools and design the architecture for their systems. Although the protocols used for the interaction of services should be determined at the organization level, the choice of tools used to implement them should be delegated to the teams. And that fits very well with the DevOps model.
The level of independence a team possesses is a reflection of the level of connectedness of the entire organization. Ideally, the organization should take care of the functionality of the software and, ultimately, of the business value that the team provides, as well as the cost of team resources.
In this case, the software architect plays an important role. It should not focus on specific tools and technologies that teams use, or interfere with the details of the internal architecture of services. Instead, it should focus on protocols and interactions between different services and the health of the system as a whole.
Conway's Inverted Law: Organization Structure Must Model Target Architecture
How do weak software coherence and weak team coherence fit together? Conway's Law states:
“Organizations designing systems are limited to a design that copies the communication structure of this organization.”
This is based on the observation that the architecture of a software system will reflect the structure of the organization that creates it. We can “hack” Conway’s law by turning it over. Organize our teams to reflect our desired architecture. With this in mind, we must align loosely coupled teams with loosely coupled software components. But should it be a one-to-one relationship? I think, ideally, yes. Although, it seems that it is good if a small team works on several loosely coupled services. I would say that the inflection point of scaling for teams is greater than for software, so this style of organization seems acceptable. It is important that the software components remain separated, with their own versioning and deployment, even if some of them are developed by a single team. We would like to be able to split the team if it gets too big, with the transfer of developed services to different teams. But we cannot do this if the services are tightly coupled or share a process, versioning, or deployment.
We must avoid the work of several teams on the same components. This is an anti-pattern. And, in a sense, even worse than the work of one large team with one large code base, because communication barriers between teams lead to an even stronger feeling of lack of ownership and control.
The interaction between loosely coupled teams creating loosely coupled software is minimized. Take the Internet example again. Often you can use the API provided by another company, without any direct communication with it (if the process is simple and there is documentation). When teams interact, internal team development and implementation processes should not be discussed. Instead, functionality, service levels, and resources should be discussed.
Managing loosely coupled teams creating loosely coupled software should be easier than alternatives. A large organization should focus on providing teams with clear goals and requirements in terms of functionality and service levels. Resource requirements should come from the team, although they can be used by the organization to measure the return on investment.
Loosely coupled teams develop loosely coupled software
Weak connectivity in software and between teams is key to building a highly effective organization. And my experience confirms this point. I worked in organizations where teams were divided by function, by software level, or even where there was a separation of customers. I also worked in large chaotic teams on a single code base. But in all these cases, there were problems with scaling, which were mentioned above. The most pleasant experience has always been when my team was a full-fledged unit, independently engaged in the creation, testing and deployment of independent services. But you do not need to rely on my life stories. Accelerate (described above) has research evidence to support this view.
If you have read this material to the end, we recommend that you watch a recording of an open webinar on the topic “One Day in the Life of DevOps”.