how to simplify the calculation of metrics from logs
My name is Dima Sinyavsky, I am an SRE engineer at Vi.Tech – an IT subsidiary of VseInstruments.ru. In this article I will tell you about our experience with vector.dev, how to turn logs into metrics in the usual way and how this can be automated to such an extent that you only need a yaml-developer.
Our website creates more than 100 thousand orders every day, and in order for all this to work successfully, we have many services that write a lot of logs, often it is necessary to calculate metrics.
If you already have commercial functionality for calculating logs based on metrics, then I think your needs are already covered. However, for those who build their systems using open source software, this may be useful.
Background
Vector.dev began to be used a year before I joined the company. When I arrived, logs were already being processed through it and metrics were somehow already calculated based on them. Colleagues have already taken the first approach to making it easier to work with converting logs into tags – some of the configurations were templated, some parameters were put into yaml files, and generation and deployment were carried out through Ansible. My arrival was marked by a probationary period, and therefore trials.
I'm undergoing initiation
In our company Vi.Tech, it so happens that the test is an immersion in vector.dev. All SR engineers should know this tool, because this is our main service for processing logs in the company.
My first task was to ensure the collection of log metrics to calculate SLI/SLO. So, in front of me are the access logs of the nginx http server in the form of JSON. JSON is already good, since these are structured logs.
There is a little left:
Learn Vector Remap Language
Understand existing code for converting metrics to logs
Add log handlers for new metrics
I dealt with the first one quickly enough, because vector.dev good documentation and there is even online VRL playgroundwhere you can practice working with the VRL language without installing vector locally.
After reading our guide on adding metrics, I tried adding metrics from logs on one service.
How we added metrics counting from logs before
To add a metric count for one service you need:
Copy a piece from our code, paste it, correct it. An example of a piece to copy – a new transform mysvc-accesslog-metrics-prepare
This is a path filter and is needed in order to select suitable paths from the URL path and only count these logs into the metric. At the same time, it is clear that the “inputs” receiver is defined and it is not the same for everyone, it must be changed every time.
Find and add somewhere to these files or create a new one. We are looking for an existing file for the service. If a service is being added for the first time, then in addition to creating a file for working with metrics, you also need to do some work in advance to set up event filters for it, but we will not show this, there is still a whole screen of code…
Also add the line to the shared file. All metrics are published through the Prometheus exporter to the /metrics endpoint. The access_log_to_metrics transform serves as the source of all metrics for the exporter, so we need to connect a new metrics source to it mysvc-accesslog-metrics-prepare. If you forget to add, there will be no new metrics.
And also add tests and debug it all… and so on for each service.
As a result, it could take all day to write code and debug.
I had to do this for a dozen services.
It's hard. Lazy. What to do? Remake!
Enough pasta! Refactoring – I am reworking the calculation of metrics based on HTTP server request logs.
What I changed
I highlighted the tags needed in the http request metrics. When studying the logs, it turned out that the data set is almost always the same in structure, and for metrics from it we only need:
I created a new yml file for describing metrics (we had a similar one before, but much less flexibly customizable). Part of the file is shown below
Created a JSON schema to simplify writing yaml and its validation. If you use Visual Studio Code, you will get drop-down tooltips and check the yml for schema compliance on the fly. Part of it is presented below
The current version of the scheme only supports the definition of metrics such as counter (metric_type: counter) and histogram (metric_type: histogram), simply because we did not need other types yet.
Created new handlers (transforms) for vector.dev using jinja templates. Part of the code with templates below
I also added code generation for tests. Part of the test code is given below
Protect yourself! – Explosion of cardinality
When you have too many values in a label, for example when path
hits url
kind /products/product-name
Where product-name
can be anything – this means an explosion of cardinality is possible. We have more than 1 million products on our website vseinstrumenti.ru, which means a million unique ones product-name
. When a search robot comes to us, it quickly crawls these pages, and then we get an explosion of cardinality – the output metrics become tens of thousands of lines, and the volume of the response from the endpoint /metrics
can grow to tens to hundreds of megabytes. In this case, the metrics collector simply will not have time to collect them and you will stop receiving metrics.
To solve this problem, we previously used a separate file with descriptions of replacements via regexp. This was inconvenient – it was necessary to duplicate the path matching condition in two places: in the file describing the selection condition and in the file describing the path replacements by regexp. After the modification it became more convenient.
Now for such cases we set a condition for matching the regular expression (field re
in the log selection condition in metrics-catalog.yml) and immediately indicate the replacement value in label_override
– this is what will be in the label path
in the final metric.
Look at the example metrics-catalog.yml
As a result, we will see a series of metrics like countervi_http_requests_total{path=”/products/:name”, method=”GET”, status=”200”, cluster_name=”k8s_с1”, container_name=”site-app”, service_name=”site” }
In this case, all dissimilar addresses in the label path
will be replaced by /products/:name
and instead of tens of thousands of lines there will be only hundreds, as we expect.
Idea for development
If you take a closer look at our regular expressions in label_override
then you can see the redundancy, namely part (?P<name>(тут само выражение))
. So we give a name to one of the regular expression groups, but why?
And then we had the idea to automate setting the value label_override
. Write a tool that would look at re: ^/product/(?P<name>([^/]+-\\d+/))$
found the name inside (?P<name>(тут само выражение))
and would construct a value of the form /products/:name
For label_override
.
Currently, the JSON schema does not require a record label_override
. Perhaps we should make it mandatory for the re block to prevent the possibility of creating high cardinality value sets for the label.
Additional protection against cardinality explosion
Replacement via label_override
– this is great, but sometimes you can overlook it or not immediately understand that there may be too many values for the labels. This means that the risk of an explosion in the cardinality of metrics remains.
To protect against this, vector.dev offers a component tag_cardinality_limit. It keeps track of the number of values in the labels and the resulting number of metrics. If the cardinality exceeds the specified limit, then it can discard part of the metric labels or discard it entirely.
We use this component with the following settings:
Probabilistic mode is less memory intensive, but can sometimes skip a little more metrics even if the limit is on the number of values value_limit
has already been achieved. This is due to the probabilistic nature of its work. But for us this is not important, since it is only important for us to detect such behavior.
We collect internal metrics of vector components – this makes it possible to monitor its operation in the parts we need. In this case, we just need to set up an alert for an increase in the component_discarded_events_total metric for a component named “metrics-cardinality-limiter”, which will allow us to pay attention to the problem and move on to eliminating the cause – finding a high cardinality metric. If you use Victoria Metrics to store metrics, like we do, you can check for high cardinal metrics in Cardinality Explorer.
How to add metrics now
We define the metric and the conditions for selecting logs into the metric in the metrics-catalog.yml file
In addition to metric selection filters, we have also added support for exclusion from calculations. For example, you need to exclude from counting all requests with method=”OPTION” (see the metrics-catalog section exclude_events).
We generate a new config and roll it out via CI/CD
Ready. Metrics are already counting!
What did the redesign of metrics calculation give us?
Saving time for engineers – instead of 3-5 hours of coding on VRL with debugging and struggling with toml files, now 20-30 minutes is enough.
Simplified support – now you don’t need to deeply understand vector, it’s enough to describe the metrics in yml, i.e. A yaml developer is enough.
In fact, we didn't stop there. Later, with our invention of the Unified Log Pipeline, this solution was adapted for it and made it even more flexible – it is now suitable for any logs.
How can this benefit you?
You can check out our open source vector.dev code for converting logs into metrics here https://github.com/vseinstrumentiru/vector.dev-metrics-to-logs-helper.
It presents:
basic set of files for ansible and vector
metrics-catalog.yml
JSON schema for metrics-catalog.yml
example logs
Makefile for running tasks
Readme how to run it all
Use ideas from this code to implement your solution, adapt it to suit you.
If you have any questions about the code or how to create it, please write to me here in private messages or by email. Current contacts are in https://github.com/r3code .
And if you have questions about vector.dev itself, then I suggest you take a look at the Vector Russian-speaking community group on Telegram https://t.me/vectordev_ru