Still measuring latency in percentiles?

Latency is an obvious metric that almost always comes to mind when it comes to monitoring our services. Whether it’s a simple controller, a worker who reads an event from a queue, or a service that backs up your monga – any logical piece of code whose performance is important to us. Let’s consider a simple example: you have a service that accepts requests from users and returns the necessary data to the UI. If this service is in production and you are already a fairly mature project, it is likely that you have already configured and monitored metrics. Or maybe they even set up alerts, pagerduty, appointed on-call engineers, report on the fulfillment of your SLA once a month. Almost probably one of the alerts you will have for the latency of this service. And I’ll try to guess – you are using percentiles as statistics.

Did you guess right? What if I told you that this is not the best option? What if I go ahead and argue that percentiles can hide major latency changes after the next deployment? Let’s go in order.

Let’s start with a quick discussion of percentiles: what they are and how engineers use them on a daily basis to monitor the state of services, look for cause and effect during outages, automatically roll back to the previous commit during deployment with your refactoring after reading Fowler, wake up at 3 a.m. ( in the morning?) from a call to the phone with the sound turned off, showing beautiful, even charts of three nines to the authorities …

Percentile

And so, we have a service and we really want to track how quickly it responds. In this case, by latency, we most often mean the time from receiving a request to the moment when the service is ready to return the result (although there are commands that include other things in latency: for example, the end of this time interval is considered the moment the user receives a response. In our case, this is not the best idea, because this metric will include, for example, the speed of data transmission over the network, as well as everything related to encryption / decryption for encryption). By collecting data on this value for all user requests, you can track how fast our service is running, as well as how our new commits affect this speed. And there are many studies showing a direct relationship between the speed of services and user satisfaction (which can be expressed, for example, in conversions of new users to paying), which adds value.

So far, everything is simple. Now let’s assume that the service is very loaded, 100k requests per second. In order not to send data about this metric to the monitoring system with the same frequency, we want to act smarter: on the service side, we collect data in 10 seconds, aggregate it, and then send it to the monitoring system. And here we finally got to the percentiles – this is the very statistics that we use to aggregate all data in order to send one value to the monitoring system every 10 seconds. In fact, we just cut off some of the longest queries that can distort our picture in order to roughly represent what the main part of our users sees. For example, p99 shows the maximum value after removing 1% of the longest queries. Or in other words: p99 is greater than 99% of all values ​​for which we use this statistic to aggregate.

Let’s assume that we have set up an alert with the condition that three values ​​in a row should not exceed 200ms. After a month of working on a new version of the service, it’s finally time to deploy! No sooner said than done, deploy and go to the graphana to check the charts! The red line shows the moment of deployment.

Did you notice the difference? Me not. Let’s go show the manager the schedule and no alerts, get a promotion?

No. Or, less critical, not so fast. SLA compliance is a good thing, but we also want to understand the dynamics of our metrics after the changes we make. At first glance, everything is fine. If, of course, p99 gives us enough information about the dynamics. Let’s take a look at two points on this graph, green and purple. Green – the last metric value before deployment. Purple is the first value immediately after deployment. And here are the real latency values ​​that led to these two points after aggregation with the p99 metric:

Oops! I’m sure you can see the difference now. Although p99 is the same thing, it is obvious that latency has changed significantly after deployment. Even though these graphs are fictional, exaggerating a little for beauty, we can say that user requests began to be processed twice as long!

Problem

Let’s try to describe the problem in simple words: percentiles as statistics give information about one point. Think about it: p90 represents a single point value that is greater than 90% of all values. That is, applying this statistics we lose information about all values, except for one – we simply do not consider the rest. As we saw in the graph above, this metric is not sensitive enough to reflect changes among these 90% values. Now think about the user experience: there are many businesses for which such a significant increase in latency is simply unacceptable! Is there anything we can do to improve the situation?

Trimmed mean

Can! Trimmed mean statistics allow you to monitor the mean over a selected range of data, giving a more accurate picture of the real state of affairs. For example, TM90 is the average after removing 10% of the largest values ​​(to remove fluctuations: no values ​​were found in the cache and had to go to the database, or downstream services took a long time to respond, or GC decided it was time to act). Trimmed mean is more sensitive to changes that cannot be seen when using percentiles!

By playing around with this metric, you can get a good idea of ​​the user experience. For example, TM (99% 🙂 shows the average among the 1% of the slowest queries, and TM99 is the average among the 99% of the fastest queries.

To demonstrate the difference, let’s take a look at the previous chart, but after drawing the values ​​of the TM90 statistics.

It’s easy to see that the changes after our deployment didn’t go unnoticed! With a properly configured rollback, deployment, and load tests, a problematic commit may not even make it to production servers. Isn’t that what we all want? 🙂 I hope this statistic will someday be added to all popular monitoring systems. For example Amazon already added in CloudWatch.

Load with pleasure!

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *