Hello. Today we would like to talk about detecting anomalies in a microservice environment. This post is a summary of our 40 minute talk that we did at an online conference DevOps Live 2020 and, in order not to write a longread, we decided to focus on a review of tools for detecting anomalies in the distribution of metric values for automating monitoring of microservices, which can be quickly used by any team.
The topic of anomaly detection is now very relevant, since with the transition to microservices for SRE and DevOps, the priority of tasks related to converting alerts into a meaningful signal, reducing MTTD and simplifying the configuration of alerts in monitoring distributed environments has significantly increased.
With the transition to a microservice infrastructure, services, as well as monitored metrics, become many times larger, the complexity of monitoring such a system increases exponentially.
By “control” we mean getting a meaningful signal in the event of degradation of the performance of services or infrastructure.
As a rule, to receive an alert, you need to configure alert triggers, that is, determine which metric value is considered a threshold.
But how do you set alert thresholds for hundreds of services and thousands of metrics?
And does it need to be done manually?
A few examples where anomaly detection can be extremely useful:
- increase in latency on one of hundreds of services;
- an increase in the number of errors at one of thousands of endpoints;
- decrease in the number of requests during the usually high period.
Such changes can be “lost” in a large amount of data, we can not adjust the response thresholds for all the necessary metrics because of their number, just forget, and so on.
So, performance monitoring anomaly detection is used to:
- detecting problems;
- automation of monitoring, avoiding manual thresholds in alerts;
- identifying “weak” signals when there are a lot of metrics and not all alert thresholds are set.
Let’s say the idea is interesting, I want to try using anomaly detection mechanisms, where can I start?
There are the following options for implementing anomaly detection:
- do it yourself;
- as a functional of APM systems;
- as a Service.
Let’s take a look at each of the ways.
Do it yourself
But even the standard Prometheus monitoring tool has built-in capabilities and statistics methods that allow you to find anomalies in time series metrics.
For each analyzed metric, it will be necessary to create several recording rules, in general, the process is painstaking and laborious.
If we accept the hypothesis that the analyzed data have a normal distribution, then using simple statistics, we can define an anomaly as a value outside the range of three standard deviations (the “three sigma” rule).
Thus, for the baseline scenario of detecting anomalies, it is sufficient to calculate and record the mean, standard deviation, and z-score – this is a measure of the relative spread of the observed value, which shows how many standard deviations are its spread in the relative mean.
For example, let’s take the http_requests_total metric, first you need to aggregate it:
# агрегация за пять минут - record: job:http_requests:rate5m expr: sum by (app) (rate(http_requests_total[5m]))
Next, we calculate the three data sets we need:
# average - среднее значение - record: job:http_requests:rate5m:avg_over_time_1w expr: avg_over_time(job:http_requests:rate5m[1w]) # stddev - стандартное отклонение - record: job:http_requests:rate5m:stddev_over_time_1w expr: stddev_over_time(job:http_requests:rate5m[1w]) # z-оценка (job:http_requests:rate5m - job:http_requests:rate5m:avg_over_time_1w ) / job:http_requests:rate5m:stddev_over_time_1w
Most of the data on performance and application load (request rate, latency) are subject to seasonality – the load is uneven throughout the day, week, year.
Such irregularities are of a repetitive nature – this is seasonality.
To take into account seasonality, it is necessary to predict values based on past data, and compare the value with the predicted data.
For prediction, you can use various statistical methods or even combine several predictions in one row.
Once you’ve got the predicted value, you need to determine how accurate it can be – again, the z-score comes in handy.
For more information on configuring recording rules in Prometheus to calculate upper and lower bounds in case of seasonality, refer to article …
Prometheus at maximum speed – PAD
Project Prometheus Anomaly Detector (PAD)under the wing of Red Hat automates the actions described in the previous section.
In PAD, you can select existing metrics in Prometeheus for analysis, and PAD will create the necessary recording rules, build predictions, low and high thresholds, running the values through the Prophet engine, which already takes into account seasonality.
PAD will create Grafana dashboards and alerts to alert you when an anomaly is detected.
So far, this is still not an industrial solution, but rather a proof of concept.
Anomaly detection as part of the APM system
Application Performance Monitoring systems are acquiring AIOps functionality – they strive to integrate machine learning mechanisms and anomaly detection, as they already contain a huge amount of information, data and metrics from applications.
Let’s look at the implementation of anomaly detection and how automated this process is in terms of setting up and monitoring new services and metrics.
The New Relic platform provides the ability to configure notification policies for the deviation of any specified metrics from the baseline (predicted values) – infrastructure, EUM metrics, business metrics.
The settings of the notification policy are separately configured – the sensitivity of deviations relative to the baseline, the direction of deviations (only down, only up, in both directions).
For example, if you choose the number of requests as a metric, then the system will consider as a problem either only an increase in the number of requests relative to the expected, or only a decrease, or any change relative to the baseline.
To configure alerts, you need to create an alert policy, select the required metric, and specify the parameters.
Also in April 2020, a separate product was introduced for managing multiple metrics and their automatic analysis using machine learning algorithms – New Relic Applied Intelligence (AI).
Once activated, New Relic AI monitors and detects anomalies in key KPIs of connected applications and signals anomalies.
When adding a new application / service to the system, you must manually add it to the tracked list.
AppDynamics APM automatically calculates the baseline based on the standard deviation for KPI metrics of connected applications and displays them on an application interaction map.
There can be several baselines for the application, you can configure them yourself by choosing the type (daily, weekly, etc.) and the time period over which the baseline will be calculated.
From the point of view of setting up anomaly alert, the process is not automatic; it requires setting up a policy, health rule and conditions.
During configuration, you need to select one metric or a group of metrics and specify the percentage by which the metric value should deviate from the baseline for the condition in the health rule to be triggered.
Dynatrace has several out-of-the-box rules for detecting anomalies, and alerts are immediately available for them.
Several rules are available:
- detecting degradation of performance KPIs
- traffic drop detection
- detecting degradation of the execution time of transactions
The main features and settings are shown in the screenshot.
In a performance monitoring solution Instana More than 230 rules for detecting health problems of services and applications are available out of the box, some of which relate to identifying anomalies in the KPIs of services.
The analysis of latecy, error rate, traffic metrics (number of requests) is used.
Most of the rules detect anomalies in the distribution of metric values using a modified E-Divisive with Medians (EDM) algorithm.
A constructor for creating custom notification rules has been implemented, in addition to the built-in rules, which uses the conditions for deviating metric values from the baseline.
The setting can be done in “easy” and “advanced” mode, in the second there are more options to choose from.
There are two baselines to choose from – daily and weekly for all major metrics.
We wrote more about the wizard for creating alert rules in the article on EUM.
Anomaly detection as a Service
If an APM solution is not implemented, there is no desire to manually configure Prometheus, I want to quickly try to look at my data and get anomaly detection as a service, ready-made SaaS solutions are available.
Azure Metric Advisor
One of the Microsoft services – Azure Metric Advisor allows you to use many sources of metrics and data types.
The main focus, judging by the documentation, is on e-commerce.
Support for multiple sources is available (SQL Server, ElasticSearch, InfluxDB, MongoDB, MySQL, PostgreSQL and others), there is no support for Prometheus as a source of metrics.
The service can integrate with many sources of metrics and data – from Prometheues to business intelligence systems.
Focused on both business users and SRE engineers.
Description of cases for e-commerce, gaming and other industries is available.
The service for data analysis and anomaly detection works mainly with metrics stored in InfluxDB.
Easily integrates with everything that can send metrics to InfluxDB, for example, you can send data for analysis directly from applications using third-party integration libraries.
Instead of a conclusion
- Using anomaly detection mechanisms allows you to identify problems in infrastructure and applications without setting static thresholds in alert triggers.
- The need to identify anomalies is due to the specifics of the microservice environment itself – there are many services, and even more metrics.
- If the company uses Prometheus, you can start with simple examples and manual data configuration for analysis.
- Most industrial APM solutions have integrated AIOps functionality, already collect a lot of necessary data for analysis and are able to identify anomalies in metrics out of the box.
Thanks for attention.