Metrics and why we need them

Nowadays you won’t surprise anyone with metrics. Metrics are everywhere, in application logs, in project management, in product management, in people management, in managing anything. We can say that we even understand why they are needed. But unfortunately, not all and not always.

In this article I will try to summarize the basic concepts of metrics from different areas. So, dear reader, welcome!

1. A complex system is a system whose behavior cannot be predicted

Before defining what metrics are, we will try to understand why they are needed. To do this, we will have to understand what a complex system is and speculate on this topic.

For the sake of fun and transparency, let's introduce the concept of a simple system.

A simple system is a system whose composition and operating principle are outlined. For such a system, it is easy to imagine its behavior and predict the result of its work. For example, some black box that multiplies an input number by 2.

A complex system is a system that consists of many components. For such a system, it is difficult to describe the behavior and predict the result, since it may depend on many factors.

2. Modern computer programs are mostly complex systems

Everyone who works in IT is in luck! We work with programs and code and build our systems ourselves. Since the existence of computer science, humanity has come up with a huge number of different things that simplify the computer system. I mean standard protocols, high-level programming languages, garbage collectors, schedulers, standard architectures, programming paradigms and much more. Everything that hides the complexity of an individual element from us. This is what is called “encapsulation” in OOP.

But all this does not make the software simple. Since most often it tries to inherently model the behavior of real systems (hello DDD) and simulate it as closely as possible, it itself is forced to take away a huge part of the complexity of the simulated system.

To this should be added the influence of all other factors on the operation of the software, such as hardware servers, virtualization/containerization systems, network, databases, proxies, DNS, client end devices and the rest of the real world. Please note that each of these components is also a complex system, and even if your software performs only one operation, returning the answer of a 2*2 mathematical expression over the network, then your application deployed in a cluster is complex and it is almost impossible to predict its behavior. If you don’t understand what I mean, here are a few options for you when a user comes to you and asks what 2*2 is equal to, and gets not 4, but something else:

  1. The server crashed, the user received a timeout error

  2. The user has network problems

  3. The user entered a query that is not in the format you expect

  4. The user is using a browser version that you do not support and will receive a 500 error.

  5. You did not pay for the domain name, and the user received a stub page from the company where you host the application

There is an approach in modeling theory that represents the system as a black box. What does it mean? This means that we have a component, and it can be either complex or simple, it has a set of input signals and a set of output signals. What is important about this:

  1. We can predict the output of the system

  2. We don't need to reveal the complexity of the black box's inner workings.

  3. From our point of view, a black box can be considered a simple system, although in fact this may not be the case

If you're now thinking about microservices and OOP encapsulation, then I think you already understand where I'm going with this.

In further discussions we will proceed from the black box approximation.

In addition, further considerations apply not only to computer programs, but also to any other complex systems.

3. The control of complex systems is based on the PDCA cycle

People have been learning to manage complex systems for hundreds of years. First of all, this concerns managing other people. But as soon as humanity began to create objects more complex than a cart, the need arose to control complex systems outside of social interaction.

The basis of modern management theory can be called the work of Walter Shewhart and William Deming. We will deliberately bypass other pillars of management theory, such as Peter Drucker and many others, for the sake of simplicity.

What is management from the point of view of the founding fathers of management theory?

Before the arrival of the above-mentioned individuals, systems management was based on their maintenance, stability and top-down management. An excellent example of this is Henry Ford's assembly line. We build the process so that it is as stable as possible, even with the participation of unskilled labor. Labor is only a tool and is completely replaceable. In this regard, we can highlight the first basic function of system management – maintaining the stability of the existing system. And the Ford assembly line coped with this function perfectly.

But Deming's work overturns this paradigm. I strongly recommend that you familiarize yourself with them or at least read about the system of in-depth knowledge and 14 principles. In order not to repeat them, I will try to paraphrase them and isolate the essence. So Deming came and said:

  1. Let's distribute control to everyone, make it decentralized, give every employee (read element of the system) the opportunity to show leadership and propose changes in processes

  2. Let's make the system flexible, it should accept good changes and reject bad ones

How do we achieve system changeability, given that we want to filter good and bad changes?

Answer: Deming-Shewhart cycle. It's a fairly simple concept that has given rise to countless variations. It lies in the fact that we divide system changes into cycles consisting of four steps:

  1. Planning a change (Plan)

  2. Let's see it off (Do)

  3. We check that our action has a positive impact on the indicators (Check)

  4. We carry out correction based on measurement results (Act)

  5. Let's repeat it all over again

Don't you think it smells like Agile? And this was in the first half of the last century!

It is quite obvious that what was proposed as a model for managing an organization can be applied to change management in any complex system.

Thus, Deming introduced the second basic function of system management – making changes.

4. Change control is impossible without metrics.

Point 4 of the Deming-Shewhart cycle is the most interesting for us from the point of view of the current argument. Let me remind you that we work with complex systems and just by observing them it is difficult to understand whether we have done better or worse. In addition, we presented our system as a black box. How can we understand whether a change has been beneficial or not? Now the answer is quite obvious to us – these are metrics.

And so, metrics are the results of the system’s operation, through which the system manifests itself. At the same time, metrics support both management functions: maintaining stability and assessing changes.

5. Shewhart charts – a simple way to visualize metrics

Why are we all talking about Deming and about Deming? We mentioned two founding fathers. And the time has come for Walter Shewhart to jump into our history.

The fact is that Shewhart very actively studied statistical methods of process control using metrics. He came up with a visual tool for displaying metrics – control charts, named after him. First of all, control charts were designed to stabilize already established processes. But nothing prevents us from using them to assess changes.

In its simplest form, a control chart is a time series on which the values ​​of metrics are plotted. In addition, the control chart contains a line displaying the average metric value over a time interval (CL) and lines indicating the upper and lower limits beyond which the value should not go (UCL, LCL). Everyone who worked with monitoring systems understood exactly what I’m talking about, since the vast majority of dashboards are built exactly on this principle.

6. Main types of metrics and some examples

Let's try to break the metrics into logical groups and give examples of metrics for some complex systems

  1. Main results of our process. Our system is needed for something, we need to understand why. For example,

    1. Business is done to make money, the logical main metric would be profit.

    2. The development team is cutting features, the logical main metric would be the number of released features\tasks\User stories.

    3. An online store sells goods, the logical main metric would be the number of sales and revenue from the sale.

    4. The web server routes requests, the logical main metric would be the number of requests processed.

  2. Quality indicators of the main process. The system produces results, but the question is how well it does it. For example,

    1. For a business, this could be the ratio of turnover to net profit.

    2. For the development team, the number of bugs that got into production and TTM.

    3. For an online store, the relationship in the sales funnel and response time to user actions.

    4. For a web server, the ratio of processed requests to requests that failed due to the server’s fault, server availability.

  3. Intermediate metrics and metrics of subsystems. If something goes wrong, understand exactly where or if the metric goes beyond the acceptable range, predict that something will soon go wrong. For example,

    1. Anything that contributes to the business.

    2. HR metrics, Git metrics and others.

    3. Subsystem metrics.

    4. Server metrics.

7. After collecting metrics, they need to be analyzed

So, we selected metrics and collected data. And what to do next?

You probably already have alerts and triggers set up, but it’s useful to look at the data with your eyes, especially when we’re just starting to collect metrics.

There is a lot of literature on the analysis of Shewhart charts, there is a whole science of data analysis, we will not try to cover what is not covered, but will only present some basic techniques and patterns for analyzing the data obtained.

Let's build a time series and calculate the trend line

The time representation is a time series, in fact a general case of a Shewhart chart: we plot time along the horizontal axis, and the metric value along the vertical axis.

Our trend line can be parallel to the axis. According to Shewhart, a stable process is a process that has a trend line equal to the line of the average value, all deviations are approximately normal (Gaussian) in nature. Simply put, if we see a straight horizontal trend line, our process is stable, everything is fine.

Our trend line can be an upward or downward trend. This means that our system is constantly being influenced by something and the system is drifting. For example, a constant increase in the time it takes to complete tasks in Jira may indicate that a system that is constantly being tweaked is increasing its complexity and it is difficult to modify it and this is bad; something needs to be done to compensate for the effect. Or there are constantly more and more visits to our website, the marketing program is working and this is good, something needs to be done to ensure that the service can cope with the growing load.

Our trend line may have jumps. It is very good if we can compare such jumps with changes in the system. For example, we added TDD practice, TTM increased, and the number of critical bugs decreased. We abandoned the practice of TDD and saw the reverse process. We added RAM to the server and got a faster response time.

It would be a good idea to play around with the display, for example, choose a logarithmic scale.

We analyze the deviation of points from the trend line or from the average value (CL) over time

If we have built a trend line, then we know some predicted value of the metric at a given point in time. This means we can calculate the deviation of the real metric value from the trend line value at a given point in time.

Do not forget about the physical meaning of the metric. Based on it, it is worth understanding whether the sign of the deviation is important to us or whether we should evaluate only the deviation module.

In the above example, it is clear to the naked eye that the deviation from the trend line tends to increase, which tells us that our process is becoming less stable.

This point also includes games with the derivative of our graph of the metric versus time. Let me remind you that if we calculate the derivative (and this is quite easy to do numerically), we can draw conclusions about the rate of change of our process.

Analyzing the “frequency” representation of data

The “frequency” representation is a histogram; the metric value is plotted along the horizontal axis, and the number of times the value was noticed on the vertical axis. “Frequency” is in quotation marks because it is a term from signal theory that describes what is happening well, but is not fully applicable to this case.

According to the frequency representation, it is worth noting that if the quantity is continuous (not discrete, takes on arbitrary fractional values), then we will probably need to make aggregates in order to construct such a graph.

Most likely we will see a normal distribution (Gaussian) or some other kind of symmetric distribution that looks painfully like a normal one. If that's the case, then we're probably fine. According to Shewhart, a normal distribution corresponds to a stable process with random deviations. If we are satisfied with the peak value, we can make changes to the system to reduce the spread around the target value. We can do something to move the peak.

From point 2 we learned that our process is unstable over time, which means we can build several graphs for different time intervals and compare them.

In addition to variations of the normal distribution, we can get a huge number of other distributions. An interesting option here could be a combination of several Gaussians. This suggests several different processes or several factors influencing the distribution. For example, site visits on weekdays and weekends. Or the number of bugs in the tasks of a senior, middle and junior developer. It is very helpful to understand what causes these combinations. It can be useful to isolate points from different domains and highlight them in a time interval. Or try to build a sample with different filters. For example, according to the previous example, the deadline for completing a task depends on the level of the developer, or the deadline for completing a task depends on the number of story points.

We can see some other typical distribution, and here I invite the reader to immerse himself in studying the topic on his own.

Let's continue playing with the data

There are many ways to extract the information we need from data. Filtering, aggregation, smoothing, selecting groups of data, visualizing the dependence of some data on others, and much more can help us. We experiment and try to find implicit dependencies.

We build a model and predict the behavior of the system in the future

A very subtle and controversial point. As they say, all models are wrong, but some are useful. Here we are trying to fix some patterns, for example, “increasing story points leads to an increase in task completion time,” “advertising companies on average increase website visits by 10%,” “the server most often crashes when the amount of RAM used exceeds 97%.” . I hope your conclusions will be less obvious, and you can add them to the progress report.

In addition, the trend line can be extended indefinitely into the future, and if you back this up with conclusions from the constructed model, you can get a scientific-like text telling that in six months the economic system of the world will collapse. All such statements should be accompanied by remarks such as: “If the trend is chosen correctly, then…”, “If the dependence does not change, then…”, “If our reasoning is correct…”.

8. But really no

In addition to the fact that metrics are a generally accepted means of monitoring the processes of complex systems, many books and methods for collecting them have been written on them, there is also a strong side of criticism of the introduction of metrics for the sake of monitoring complex systems. For example, Goodhart's law, and many of its variations. One of the formulations of this law sounds like this:

When a measure becomes a goal, it ceases to be a good measure

That is, as soon as we identify a metric, we subconsciously try to make it optimal, even to the detriment of the final result. This is especially evident in work teams, where KPIs (key performance indicators) are introduced as a measure of employee assessment and compensation. The same example is given by many management theorists. People begin to improve KPIs to the detriment of the overall results of the company and its development prospects. And this strategy is quite understandable; people were given a vector, shown what was expected of them, and they followed this vector. The same management theorists say that KPI can only be used to evaluate a process, but not to evaluate a person, and it is correct that no one knows that someone is evaluating something based on some kind of metrics.

This applies to a lesser extent to technically complex systems, but this factor is still present. For example, we measure the processor load and RAM load on the server. If we declare these to be important metrics and require that the machine running the application not be loaded, we risk abandoning important elements of logic to satisfy the requirements.

Assigning any metrics at once increases the chances of “not seeing the forest for the trees” by orders of magnitude. By putting metrics out there for everyone to see, we are saying, “This is important, but the rest is not.” Even if we don’t think so, there will probably be those who understand us that way. In this way we will reduce our complex system to a series of its manifestations. On the one hand, simple ones that will allow you to track the state of the system without plunging into its complexity, but almost certainly absolutely insufficient to understand and track the operation of the entire system and its dynamics.

You say, let's introduce more metrics to look at the entire system! And here we get a contradiction. On the one hand, we wanted to simplify the understanding of the system through its simple manifestations, on the other hand, if there are many of these manifestations, the complexity of their analysis may become greater than understanding the system. An excellent practice here would be to distinguish between “external” and “internal” metrics, in the sense that we will call external metrics those that are always in sight, there are few of them and they are simple, and internal we will call the whole range of metrics that will help us understand the reasons failure occurring, but will not distract us during normal operation.

In addition, there is an opinion that metrics, as a subtype of statistics, are needed only to prove to management that changes need to be made, and then to show how well these changes have affected the system. Naturally, all this is accompanied by data juggling, of course without direct deception, but you all know that there are three types of lies…

9. What do we have as a result?

The Founding Fathers said that without metrics it is difficult to manage complex systems. They warned us that metrics can become an evil that hides the essence of the system behind its manifestations. But metrics are just a tool that you need to know how to use. Like any tool, metrics have their own areas of applicability, their pros and cons.

10. Materials on the topic

  1. Way out of the crisis. A New Paradigm for Managing People, Systems and Processes, Edwards Deming

  2. Statistical process control. Business optimization using Shewhart control charts. Donald Wheeler, David Chambers

  3. Engineering metrics: what to measure, how and why?

  4. IT landscape as a complex system of systems

  5. GOST R ISO 7870-1—2011

  6. GOST R ISO 7870-2—2011

  7. Control Chart in JIRA, all its secrets

  8. The production process through the eyes of a Kanban practitioner

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *