The thorny path to a single storage of metrics

Prerequisites for creating a unified metrics repository

There are many subsidiaries in the MTS ecosystem, each of which has its own IT department and its own decentralized monitoring based on Zabbix or Prometheus. This configuration created a problem: in order to find out about the status of the product, you first had to figure out which monitoring system was responsible for it. It was inconvenient. And with the growing load and complexity of the ecosystem, I wanted to have a single centralized, inexpensive, reliable and easily scalable solution for storing metrics for the entire ecosystem, which would simplify the work of product teams.

Basic requirements:

  • Monitor the health of the ecosystem and key business scenarios in a single point.

  • Scalable to a growing load of 600+ products.

  • Everything “out of the box” to reduce the team’s labor costs for setting up monitoring.

Finding solutions and choosing a metrics storage system

At the start, we abandoned Zabbix, despite the fact that it was already part of the ecosystem. At first they tried to use it, but encountered serious limitations:

  • High resource requirements. With a load of 30 thousand nvps, Zabbix ran into resource limits – a virtual server with 32 CPU cores and 248 GB RAM could not cope, and the system began to experience problems.

  • Low data compression rate. Zabbix did not compress data well, and as a result, it began to take up too much space. We understood that he couldn’t cope with our growth.

  • Kubernetes compatibility issue. Kubernetes was actively implemented in the ecosystem, tailored for another monitoring stack based on Prometheus.

In search of alternatives, we analyzed several options such as Cortex, Mimir, Thanos. In the end we settled on VictoriaMetrics for several reasons:

This can be clearly seen from the table. For 10 million unique rows, it uses five times less memory than its closest competitors.

Setting up and scaling VictoriaMetrics

Our starting architecture looked like this. Multiple products sent metrics to a single repository through a load balancer. All this fell into the pool of vminsert components – services responsible for recording data in VictoriaMetrics. In turn, vminsert wrote everything to the vmstorage storage service.

The schema had two readers:

  • Grafana — a system for visualizing metrics on dashboards.

  • vmalert is a component that monitors alerting rules and warns when the metric has gone beyond the acceptable threshold.

The entire read load went through the balancer and fell on the vmselect components responsible for processing read requests. They also extracted data from vmstorage.

This architecture turned out to be flexible. Components could be scaled independently to improve read or write performance. But as the load grew, difficulties began.

Problem: vminsert crash

With the increase in the number of hosts under monitoring, the number of agents sending metrics has also increased, as well as the number of connections to vminsert components. It turned out that the latter copes poorly with this, begins to degrade and processes incoming traffic more slowly. Therefore, vminsert components needed to be protected.

Solution: vmagent for buffering

To protect vminsert from overload, we used another VictoriaMetrics component – vmagent. This is a lightweight agent that can receive metrics data using different protocols and scrape it before sending it to vminsert. Such a “lightweight Prometheus” or intermediate link.

As a result we got:

  1. The number of connections has radically decreased. This happened due to the fact that vmagent has an internal buffer where it can accumulate metrics, buffer and reduce connections.

  2. We received protection from short-term failures on our circuit. Metrics can lie in the vmagent buffer for some time, wait until everything stabilizes for vminsert and vmstorage, and then add them.

  3. vmagent can compress metrics, so we used the Zstd compression algorithm and reduced the flow of transmitted traffic by four times.

The end result was a good working solution. But the problem with vminsert crashing was not the only one, because difficulties arose with reading data.

Problem: Alerts and dashboards compete for resources

The read load from two readers – Grafana and vmalert – created their competition. In the centralized system, each user could create complex dashboards and configure any alerts, which means that queries loaded VictoriaMetrics in parallel. As a result, the dashboards themselves loaded more slowly. Therefore, I wanted to share this load.

Solution: write vminsert in two loops

To reduce the influence of one process on another, we decided to make two separate VictoriaMetrics circuits: one for Grafana and dashboards, and the other for vmalert and alerting.

To do this, it was necessary to mirror the traffic so that both VictoriaMetrics circuits had the same copy of the data. At that time, only the vminsert component had this capability. With its help, we began to write simultaneously in two contours. The scheme worked, but in the process it became clear that vminsert does not always cope well.

Problem: vminsert write to two loops does not work

When one of the VictoriaMetrics loops encountered problems or was unavailable, vminsert stopped writing data to both loops at once. This applied not only to emergency situations. Even when simply running retention or standard VictoriaMetrics background activities, performance decreased slightly. At this point we felt that vminsert was degrading, and we were not happy with it.

Solution: traffic mirroring to vmagent

With the release of version v1.69+, the vmagent component now has support for traffic mirroring. This made it possible to change this function from vminsert to vmagent.

When traffic began to be mirrored to vmagent, we got the behavior we wanted. In case of problems in one circuit, data continued to be written to the second, and metrics were temporarily stored in the vmagent buffer, waiting for the problematic circuit to stabilize. But this solution was not ideal either.

In a configuration where one circuit was dedicated to Grafana, and the second to vmalert, when one of them crashed, we were left either without dashboards or without alerting. And this is bad for a mission-critical system. It was necessary to make sure that if one circuit failed, both would continue to work. To do this, we created a top-level vmselect layer between the contour readers.

Solution: two vmselect pools

We have allocated two separate vmselect pools – one for vmalert and the other for Grafana. Everyone worked with both VictoriaMetrics circuits.

Thus, we separated writing and reading, distributed the load between the pools, and if one circuit failed, requests from Grafana and vmalert could be routed to a functioning one. But we still haven't gotten rid of data loss.

Problem: Data Loss

vmagent has an internal buffer of limited volume that allows you to temporarily accumulate metrics. If problems in the circuit persist, for example, a failure in vmstorage lasts too long, vmagent can create a lot of connections and crash, losing data.

Then we will not be able to maintain operation during VictoriaMetrics’ long downtimes or carry out maintenance, because the agent buffer simply will not be enough. Therefore, as a solution, we agreed to make an external buffer – a separate queue where our metrics could be stored longer.

Solution: use an external queue

To implement a queue for recording metrics, we used Kafka. We wrote our own components VMBuffer-Producer and VMBuffer-Consumer – services that deliver data to Kafka and read from it.

Those who have encountered VictoriaMetrics may ask: “Why did you make your bike? VictoriaMetrics already knows how to work with Kafka.” It was a forced decision. We tested the implementation of consumers and producers, which is included in VictoriaMetrics out of the box, and it did not suit us. The problems were the same as with vminsert. In the event of degradation of one of the circuits, the consumer began to record data more slowly (or stopped altogether) and even after restoration did not return to working condition. So we had to make our own implementation.

In the new scheme, vmagent sends data not directly to the vminsert component, but to VMBuffer, which completely replicates the vminsert interface.

In both circuits there is a consumer who reads data from Kafka. Each record in Kafka is a batch of metrics approximately 100 MB in size (from 10 to 100 thousand samples). Additionally, in Kafka we create separate topics for important metrics producers, for example, Kubernetes clusters. This is done so that their metrics do not overlap with product metrics and are processed out of turn.

Final architecture

Our final architecture includes one more important addition. The MTS ecosystem is geo-distributed and data processing centers (DPCs) are located throughout the country. Therefore, the idea of ​​sending metrics from Vladivostok to Moscow seems ineffective. I wanted to make sure that they were delivered to the platform as quickly as possible and were not transferred between data centers. This requires data to be sent to the nearest platform deployment.

We implemented this using GSLB (Global Server Load Balancing) – balancing traffic based on geography:

  • We placed a vmagent in each data center, and now the product, using GSLB, writes its metrics to the nearest available vmagent.

  • When vmagent writes metrics to VictoriaMetrics contours, it also uses GSLB to select the geographically closest contour.

Thus, each VictoriaMetrics circuit receives only a portion of the traffic. And to ensure that each has 100%, the VMBuffer components work with all Kafka in each circuit. In other words, a VMBuffer in one VictoriaMetrics loop receives data from all VictoriaMetrics loops.

Now that we know the final architecture, let's discuss how to support this solution and ensure high levels of availability. This required self-monitoring.

How to monitor monitoring?

According to the classics, monitoring should be whitebox and blackbox. And we must receive system performance metrics from the inside and at the same time see everything through the user’s eyes.

Metrics whitebox (from inside):

  • Kafka – lag by topic. This is a key metric. If the lag grows, it means we do not have time to process the metrics and there is a performance problem.

  • Nginx – percentage of failed read requests. A metric that allows you to track the correctness of requests.

  • vmselect – 95th percentile response speed. Shows how quickly clients receive responses to queries based on metrics. This is easy to monitor and configure because all VictoriaMetrics components provide metrics in Prometheus format.

  • vmagent — volume of input traffic. Another key metric. For a geo-distributed system, it is important to monitor problems in all network segments. There is always a risk that we will lose traffic in one of the sections, so its volume clearly highlights possible problems. As soon as we see a fall below the permissible corridor, it means one of the segments has fallen out, and we need to figure out why this happened.

Blackbox metrics help control the system from the outside, as users see it, which means we test the accessibility of all public interfaces:

  • VictoriaMetrics API.

  • Grafana.

It remains to decide where to add the metrics, who will monitor the monitoring and check the inspector. Of course, you can store them in VictoriaMetrics itself, but if something happens to it, we will lose the data. Therefore, we decided to deploy a separate vmsingle next to each VictoriaMetrics contour.

The vmsingle component is essentially VictoriaMetrics collected in one instance. It is not suitable for large amounts of data, but can be used to store data about monitoring the VictoriaMetrics cluster itself. This allows you to keep the metrics safe even in the event of problems with the main circuit and not lose them if it fails.

Results of the monitoring system in numbers

We made a cool outline, learned how to scale it and store data:

  • Performance allows processing up to 12 million samples per second.

  • The scale of the system covers more than 300 ecosystem products and more than 2000 active

  • infrastructure.

How to collect metrics

We faced several challenges at once:

  1. Exosystem volume: more than 90 thousand different hosts in different network segments.

  2. A diverse stack: DBMS, web servers and a fairly large zoo of software.

  3. The need for a metrics structure: we wanted to bring the metrics to a clear scheme, in which, if necessary, it would be easy to find everything needed to create dashboards, alerts, and configure monitoring.

  4. Minimizing costs for product teams: we remember the condition that we are making a platform. Therefore, collecting metrics for products should be as cheap and simple as possible.

There are two ways to achieve this.

Method 1: Prometheus exporters

The classic approach that many have probably encountered is the use of Prometheus exporters. This is a time-proven approach, when an exporter downloaded from GitHub is placed next to each software, configured, and an additional collector is installed that scrapes these metrics.

The approach is normal, but very “busy” and has disadvantages:

  • Each software has its own exporter. This means that separate setup and support will be required.

  • Different metrics format. For some popular types of software, there may be several exporters. The use of different exporters by teams leads to the fact that metrics with the same meaning are called differently.

  • It is not possible to update and configure centrally. No one can guarantee that the exporter will be updated to the latest version after updating the software itself. And that the team is watching him enough and will pick him up if he falls.

We took a different path, based on the Telegraf agent and a pinch of magic.

Method 2: Telegraf agent

Telegraf is a lightweight agent written in Go. It can not only collect, but also send metrics to any number of points using different protocols and organize a full-fledged processing pipeline.

The Telegraf concept is built around plugins. input plugins can work with different types of software and are immediately built into the Telegraf image. Unlike exporters, there is no need to install anything additional here.

We went a little further and built Telegraf itself into the golden image of operating systems in MTS. This means that when deploying any virtual machine in MTS, Telegraf is already installed on it and ready to work. But that’s not even the magic! We have learned to manage it centrally.

Telegraf Management

To manage the Telegraf agent on each host, we developed Telegraf Helper, a service built into the golden image. It deploys alongside Telegraf without the engineer wasting time installing and manually configuring each agent. Telegraf Helper can:

  • Monitors the work of the Telegraf agent, collects statistics, restarts the agent in case of errors and monitors its health.

  • When first launched and then according to a schedule, Telegraf updates the agent configuration from a centralized rules store via the API.

  • Combines rules for collecting metrics for one host from different teams into a common configuration.

Simplified, the rules look like this: “For all hosts using the pg* mask, activate the metrics collection plugin with Postgres.” This affects a group of hosts at once and does not need to be configured for each host separately. Moreover, several rules can apply to one host. For us, it is a completely normal situation when, for some Unix host, a specialized Unix team creates rules for collecting operating system metrics necessary for operation. And the DBA command for the same host creates rules for collecting metrics from Postgres. Both rules apply to the same host. Telegraf Helper can combine them and turn them into a valid Telegraf agent configuration. And it is loaded into the agent so that it begins to collect the necessary metrics.

This approach radically reduces engineering costs. No need to install anything. You can immediately set up monitoring for an entire Postgres cluster. And even if a new node or a new server appears, they will automatically fall under this monitoring.

Centralized collection of metrics has several other pleasant consequences:

  1. Metrics out of the box

This is what we wanted to achieve. As an enterprise company, we have an internal cloud that provides managed resources: DBaaS (database as a service), Managed Kubernetes, Kafka as a Service and others. All these development tools are connected to the platform by default, and their metrics are collected automatically. Therefore, products building their solution based on these services receive not only a database as a service, but also monitoring of this database as a service:

  • All metrics are automatically collected and visualized in public dashboards.

  • Engineers immediately see data across their databases or clusters.

  • Also, standard alerting rules are created by default, and the team does not waste time setting them up.

Centralized collection of metrics not only simplifies monitoring, but also allows you to standardize the calculation of quality indicators or Service Level Indicator (SLI).

  1. Typical SLI Formulas

To track ecosystem health, we ask all products to describe their health with a set of SLIs. These are formulas written in PromQL (MetricsQL) that evaluate the performance of a product’s business operations as a percentage: from 0% when everything is bad to 100% when everything is great. By collecting metrics in a common repository and a unified format, we can make these formulas standard.

As an example, I will give a rule that sets the golden “error” signal – the percentage of erroneous requests.

For Kubernetes products, we can calculate the request error rate through Nginx metrics using a formula. It takes all successful requests to the ingress controller and divides it by the sum of all incoming requests. We get the percentage of effectiveness and success of the operation. This formula will be the same for all products that use Kubernetes. All that remains is to substitute the product name there and get a standard indicator.

So, everything sounds too positive and we seem to be solving all the problems. So it's time to add a fly in the ointment.

How to break a communal storage facility: bad advice

Since our storage is a shared space, anyone can connect and do something wrong. Errors in the simplest case will degrade the customer experience, but can also have a global impact on the health of the warehouse and other products.

One of the problems affecting customer experience is metrics with a huge delivery lag.

Metrics with a huge lag

If the metrics reach the storage facility in 10 minutes, there is no need to talk about any operational monitoring. During this time, users will already have time to contact the contact center and you will learn about the problem without monitoring. Therefore, it is important to strive to reduce the lag.

The metrics lag consists of two parts:

  • On the platform side, the sampling time in the Kafka queue. This is our area of ​​responsibility, and we monitor it.

  • On the product side – scraping and waiting time in the vmagent buffer. This is part of the product infrastructure, and its monitoring is more complex.

We need a means of monitoring the total delivery time of the metric. There was no ready-made solution, so we made it ourselves. We wrote a separate component – Stream Metrics. It reads the entire stream of metrics from Kafka in parallel, finds the event time for each measurement, calculates the Kafka delivery time, compares them and gets the delta. It is the lag of the metric that tells whether the product delivers the metrics on time.

Next, we tracked the lag and looked at which products it exceeded the acceptable values. Then we looked into the reasons: on the product side, few resources were allocated for vmagent, compression was not enabled, some settings were not set, or something else. There are a lot of options, so you need to work with the team and help them set up sending metrics correctly.

High cardinality in VictoriaMetrics

Cardinality is the number of unique time series. For example, in VictoriaMetrics, the http_request_total metric shows the number of HTTP requests. It has two cuts – instance and path.

The cardinality here is equal to two, because the instance label has one unique value (the number 1), and the path has two (read and write).

If we add to this metric the user_id tag with a unique user identifier, and we have a million of them, then the total cardinality will be two million.

This is important! For each unique combination of labels, VictoriaMetrics creates a new time series and spends additional resources on it. When there are many such unique time series, the load on the system increases sharply.

There is a separate Churn Rate metric, which shows the number of new time series per unit of time. The chart above highlights the emergency moment when Churn Rate jumped. Someone created a lot of new time series. At this point, VictoriaMetrics was bad. The following shows the moment when we worked with the teams, reduced the cardinality and simplified life for the VictoriaMetrics contours.

But first we need to understand whose severity we are reducing and find the culprit of the problem. To do this, you can use Cardinality Explorer. This is a VMUI interface that shows statistics on all VictoriaMetrics metrics and allows you to find the ones with the highest cardinality.

VMUI Cardinality Explorer

In this case, we see a metric that significantly increases its cardinality. We can go deeper into the details and find out which marks affect its growth.

Here the reason was the peer address, since it has the largest number of unique values. If this label is removed, the cardinality will greatly decrease.

Reducing the cardinality

When we find such an issue, we start working with the team to optimize the tag. The section helps with this Relabel in vmagent, which allows you to perform any operations with tags, including deletion.

A real example of working with a DBA team. They had a metric for Postgres; one of the tags contained the text of the SQL query. Since the text of SQL queries is very variable, the cardinality has increased significantly. But the mark was not critical for the team, so we removed it, reducing the cardinality.

Conclusions

Now I will summarize and list recommendations based on our experience.

  1. Work with teams to reduce cardinality and lag

When creating a common data repository, in our case metrics, it is important to anticipate possible user errors, learn to monitor them and automatically catch them. They can seriously degrade both the experience of the product itself and the overall health of the system. This is where teams often need outside help.

  1. Create standards and best practices for collecting metrics

The second challenge when creating a centralized repository is establishing consistent rules across your entire landscape. It’s great if they are made in the form of some kind of tool that does not allow mistakes at all. For example, like Telegraf with centralized management. There this standard is built into the instrument. But if this is not the case, the standards need to be recorded in the form of regulatory documentation, demos should be carried out, teams should be trained and told how to connect correctly.

  1. Consider using VictoriaMetrics Enterprise

We use the Community version due to licensing restrictions. But the Enterprise version has many useful features such as downsampling, advanced statistics on tenants and vmanomaly (a machine learning component). If you can handle the licensing restrictions, the Enterprise version will save a lot of man-hours and stress.

  1. Distribute vmagents across different data centers and balance traffic using GSLB

If you are making a geo-distributed system, then it’s great to immediately include the ability to distribute traffic recording across several vmagents scattered across data centers. This will help avoid major changes in the future if you have to reconfigure the product infrastructure.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *