a smarter approach to data

Monitoring can become expensive due to the huge volumes of data that need to be processed. In this article, you'll learn the best ways to store and process monitoring metrics to reduce costs and how VictoriaMetrics can help.

This article only covers open source solutions. VictoriaMetrics is like that open source project. You will get the most benefit from this article if you are familiar with Prometheus, Thanos, Mimir or VictoriaMetrics.

IN previous part we saw how replacing Prometheus with VictoriaMetrics can improve monitoring efficiency. If it were the equivalent of buying a faster car to win races, then in this part you will learn how to become a better driver by being smarter about your monitoring.

Query tracing to find bottlenecks

Users PostgreSQL should be familiar with EXPLAIN – a command used to understand how the database will execute a query. Information provided EXPLAIN ANALYZEcan help you figure out why the query is running slowly and what you can do to speed it up.

VictoriaMetrics has a very similar tool known as request tracing. Request tracing helps take the guesswork out of speeding up VictoriaMetrics queries by showing you exactly where time is being spent processing a request. If you want to experiment with query tracing, you can visit play.victoriametrics.com and try it yourself.

Let's take the following query as an example:

sum(rate(grpc_server_handled_total[5m]))

Running this query on recent data 30 days takes about 4 seconds:

A request for 30 days is completed in 4 seconds. Can we find out why?

To find out the reason, we can turn on the switch Trace query in the UI and rerun the request. This will show the steps VictoriaMetrics took in processing the request and the duration of each step:

Query 30 days in advance with tracing enabled

In the screenshot above, the blue bars represent the percentage of time spent on each step. Child steps are presented indented, and the absolute duration of each step is displayed below the step. In addition to being displayed in the user interface, trace information is sent in JSON format. This means you can also analyze traces programmatically.

If we look at the trace further, it turns out that 91% time was spent on vmselect during aggregation ~9400 time seriescontaining 13 million data samples:

Query for 30 days: processes 9.4 thousand time series, 13 million data samples

vmselect – This frontend request processing VictoriaMetrics, and in playground environment only one processor is allocated to it. It appears that this query is running slowly because it is processing a huge amount of data on a single processor. Therefore, to speed up the query, we can do one of two things:

Allocate more resources to vmselect.
Think smarter about your data.

In the next section we will look at cardinality explorer is a tool that helps us understand the amount of data we store and where we can reduce it.

Cardinality explorer

Why is more returned for our query above? 9000 time series and so many samples? To understand our data, we can use a tool called cardinality exploreraccessible through the VictoriaMetrics user interface. For those who are following playgroundit is accessible through Explore > Explore cardinality.

Cardinality explorer shows information about metrics stored in VictoriaMetrics. The default view displays top metric names by number of time series, and as you can see from the screenshot above, grpc_server_handled_total is one of the largest metrics we store.

You may notice that cardinality explorer reports that it only has 1500 time series. This is because the cardinality explorer is showing a one-day view, while the query we ran earlier was for 30 days. Over time, as applications are deployed and redeployed, old time series become inactive and new ones are created. This effect is known as turn rate and increases the number of time series stored over time. *(Churn rate can be translated as “frequency of changes”, “intensity of changes” or “data update rate”).

Clicking on a metric name will take you to a drill-down view that shows the labels stored for the metric. Below is what it looks like for grpc_server_handled_total:

Cardinality explorer: grpc_server_handled_total metric details — Cardinality explorer: metrics details `grpc_server_handled_total`

The most “expensive” label for this metric is grpc_methodsince it has 63 unique values. While 63 doesn't sound like a lot, the number of unique time series we need to store for the metric, i.e. cardinality calculated by multiplication number of unique values in each label. This means that grpc_method makes the number of time series our query should retrieve in 63 times more.

The query we originally ran didn't need the precision that it provides grpc_method. Since we don't need this particular label, we can get rid of it and our query will run significantly faster. Cardinality control is a powerful tool when working with time series databases. For more details see our article about cardinality explorer.

Cardinality explorer allows you to determine:

Metric names with the greatest number of time series.
Tags with the greatest number of time series.
Values with the greatest number of time series for the selected label.
label=name pairs with the greatest number of time series.
Tags with the greatest number unique values.

Cardinality explorer provides valuable information about why the metric we want to explore is expensive and tips on how to make it cheaper. It is available by default in VictoriaMetrics and you can even query Prometheus using cardinality explorer starting with VictoriaMetrics v1.94.0.

Streaming aggregation vs. recording rules

When working with Prometheus you can use recording rules to improve query speed. Recording rules pre-aggregate metrics, creating a new set of metrics but with a reduced number of time series. Queries on pre-aggregated metrics with fewer time series are faster than queries on raw metrics.

Write Rules Concept: Data is saved to a database and then aggregated using Ruler

Record rules work with data that have already been recorded in the databaseand write an aggregated version of this data back to the database. This means that the recording rules increase the total volume of datathat you need to store. Write rules need to be executed at regular intervals, which increases the overall load on the database.

VictoriaMetrics provides an alternative to recording rules in the form streaming aggregation. Streaming aggregation is a concept similar to write rules, except that the aggregation occurs before the data reaches the database.

Streaming aggregation concept: data is aggregated before it enters the database

By aggregating data before recording it, you end up storing only the data you'll query later. Below is an example configuration vmagent for streaming aggregation:

- match: "grpc_server_handled_total"   # селектор временных рядов
  interval: "2m"                       # с интервалом 2 минуты
  outputs: [ "total" ]                 # агрегировать как счетчик
  without: [ "grpc_method" ]           # группировать без метки

# Результат:
#   grpc_server_handled_total:2m_without_grpc_method_total

The above configuration is already available on playgroundso you can request it if you're following it. Querying this metric instead of a non-aggregated metric reduces execution time by 4 seconds to split second:

Comparing queries with original metrics and metrics obtained using streaming aggregation

Streaming aggregation provides both cost savings and speed improvements for your queries:

Aggregates incoming samples in streaming mode before writing data to remote storage.
Works as with metrics from any supported data reception protocoland with metrics collected from Prometheus compatible targets.
Is alternative to statsd.
Is alternative to recording rules.
Reduces the number of stored samples.
Reduces the number of time series stored.
Compatible with any tool that supports the Prometheus remote recording protocol.

Streaming aggregation is available in vmagenta metrics collection tool from the VictoriaMetrics ecosystem. Compliance with Prometheus standards means that vmagent and streaming aggregation can be used with Prometheus or any other system that supports the Prometheus remote write protocol.

You can read more at documentation on streaming aggregation.

Reducing significant figures

Operations with floating point numbers tend to produce results with high entropy, measurement errors or false precision.

For example, let's look at a common recording rule that calculates average CPU usage:

rules:
  - record: instance:cpu_utilization:ratio_avg
    expr: 100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[1m])) * 100)

This recording rule will produce the following results on playground:

{"instance":"10.71.0.8:9101"}   37.12491991997818	
{"instance":"10.142.0.48:9100"} 37.12499331333188

If you were asked which instance consumed more CPU resources in the results above, you would probably go digit by digit and settle on the first digit that was different. Similarly, if you were asked what is the average consumption of an instance 10.71.0.8:9101you would probably say 37% rather than 37.12491991997818%. In both cases, we don't need the full “length” of the numbers to give the answer. But storing samples with such values greatly affects the compression ratio in a negative direction due to the high entropy of the values.

VictoriaMetrics allows you to customize the quantity significant figuresthat you want to save. Reducing the number of significant digits reduces the number of possible values and increases the likelihood that two values will be the same. This improves the compression ratio of the collected metrics.

According to the tests described in this articlegoing from 13 significant digits to 8 reduces the compressed size of each sample from 1.2B to 0.8Bsaving third network bandwidth/disk usage. If you were to take the first sample from above and set it to 8 significant figures, it would change from 37.12491991997818 on 37.12492. For most applications, this loss of accuracy is barely noticeable.

Saving traffic costs

Network usage is also a monitoring expense. It's typically free within a single cloud provider's network zone, but charges across different availability zones.

Network traffic between availability zones — Network traffic between Availability Zones

This means that if you want to build a highly available monitoring platform, you will need to send traffic between zones. And the price for this traffic, depending on the costs of the cloud provider, can become a significant part of your monthly bill.

vmagent VictoriaMetrics uses an improved version of the Prometheus remote recording protocol with better compression:

Changes in network traffic after switching to VictoriaMetrics' own remote_write protocol — Changes in network traffic after switching to VictoriaMetrics' own protocol `remote_write`

Above is a screenshot of network usage before and after the VictoriaMetrics client switched to a native protocol remote_write VictoriaMetrics. The user achieved 4.5 times smaller network usage than with the standard protocol remote_write Prometheus, which directly impacted their accounts with the cloud provider.

In addition to achieving lower network usage out of the box, VictoriaMetrics can also be configured to further reduce network usage by trading CPU time, latency, or accuracy for less network usage:

Settings	Compromise
remoteWrite.vmProtoCompressLevel	Increased compression levels in exchange for higher CPU usage
remoteWrite.maxBlockSize, remoteWrite.maxRowsPerBlock, remoteWrite.flushInterval	Increased packet size, resulting in better compression ratio, in exchange for latency
remoteWrite.significantFigures, remoteWrite.roundDigits	Reduced precision/entropy, better compression ratio

We wrote in more detail in an entire article about reducing the cost of paying for traffic using VictoriaMetrics.

Conclusion

Monitoring can be expensive and place a lot of workload on engineers. VictoriaMetrics helps reduce the cost of out-of-the-box monitoring, and small configuration changes can reduce your costs even further. Moreover, VictoriaMetrics is a completely open source project and is supported by the team that wrote it.