SRE Practices: Log Standardization
My name is Dima Sinyavsky, I am an SRE engineer at Vi.Tech – an IT subsidiary of VseInstruments.ru. Our website creates more than 100 thousand orders every day. We have many services and many developers who write a lot of logs. They are written in order to understand the situation in case of failure. You need to be able to navigate them and find what you need, even if you see them for the first time.
In previous articles I talked how we switched to vector.dev from Elastic for processing logs and how we now calculate metrics from logs with his help.
In this article I will tell you about our experience in standardizing logs. What problems were there, why did we do it and what did it give us as a team and company.
Background
In 2022, I joined V.Tech and dove into vector.dev to process logs. This is a very flexible solution that allows you to collect logs of almost any structure from any application. That's exactly what we did. Fortunately, we have already converted some of the logs to structured JSON logs and it was easier to process them.
We had a bunch of different transforms in vector and tables in ClickHouse for each source, for each log service, often the code was copied from time to time with minor changes. When you have 5-10 applications, this is still tolerable, but when there are more and more of them… you already begin to get tired of repeating the same work. And it would be nice to automate this, but everyone is different! Even neighboring applications may have a field in one case clusterin another cluster_namewhich means they cannot be put into a single column in the database without additional transformations and reductions to a general form. And so each application can write in its own way, and we need to support all this.
I tried working with logs
Developers rarely look at the logs of someone else’s application, for example, in a neighboring service, but they understand their own very well, no matter how exotic and mysterious the field names may be. However, DevOPS and SRE work with all systems and often you need to look at logs from multiple applications, especially if there is an incident.
Of course, it’s cool to feel like a know-it-all about logs, but not at 2 a.m. when the service you’re seeing for the first time has crashed, and its logs are more exotic than ever. If you found documentation on the logs (ahem…have you ever seen it?), then you're in luck. If you don’t find it, you look for the source code of the application and the developer. We called the developer and he helped us figure it out – you're in luck again! But I didn’t get through on the phone – I’m sitting and reading the source code. How much time wasted… a lot!
Let's play a guessing game
Guess what the query to the log database will return (we use Clickhouse):
SELECT path, request_query FROM accesslogs
3..2..1..Correct answer
Did you guess it? I don't.
What's the problem?
Look at the picture below, this is what people see
And the truth is…
But each person has his own association in his head due to individual experience
Problem – we call the same thing differently
Why is this bad:
This causes misunderstanding among people and misinterpretation of data.
It used to be funny: we’re talking about http request query, we look at the logs in Clickhouse, and then the service writes the full URL in the request_query field.It is impossible to template queries to logs.
If the names of the log fields are different everywhere, then I cannot create one request and then only change the service name in it in order to get, for example, http access logs. If you select all fields with SELECT * it will work, but if you only need a few fields, then it won’t work.Making it more difficult to find incident logs.
As I wrote above, you will have to spend time getting to know the structure of the logs and understanding the names and meanings, sometimes you cannot do it without outside help or reading the source code.Expansion of the data schema in the database.
Since we store logs in ClickHouse in tables, each service has its own table with a set of columns based on field names. There are many options for sets of fields; even the same http access logs for different applications can have different sets of columns.
All this makes working with logs varied, but costly and unpleasant.
Pleasant, predictable work with logs. How?
I wanted to redo all this to make it easier and more pleasant to work with logs, both for us and for the developers.
What we did:
The problem was brought to the attention of the developers, they accepted it and wanted to participate in the solution.
We decided to start by agreeing on a dictionary – what we call it and what it is.
Document agreements and begin to implement
It was the second month of discussions…
And then this happened
The next time I re-read the OpenTelemetry documents on tracing, I remembered that the specification also talked about logs, but since this part has not yet been fully implemented for GoLang, we did not pay attention to it. I started searching and found OpenTelemetry Logs Data Model – it turned out to be exactly what we began to create ourselves.
Why OpenTelemetry Logs Data Model is good
Based on the semantic conventions of OpenTelemetry SemConv is the same dictionary that we began to compile ourselves as a team. There is already an extensive registry of attributes with descriptions, it covers 80% of our needs, and for the rest we are creating an additional small dictionary of local names according to OpenTelemetry rules.
Gives flexible model for presenting and storing logs – allows you to store any logs without changing the storage structure.
It turned out to be close to us also because our developers already use tracing libraries from OpenTelemetry, i.e. they have already seen the names from SemConv and are implicitly honoring them in part of the code.
Standard for logs
By combining our agreements with the teams and studying the OpenTelemetry Logs Data Model, we created and agreed upon our standard for logs in a month.
What the standard gave us:
Fixed agreements with the teams:
by name – we use OpenTelemetry attribute registrywhere it’s missing, we supplement our dictionary according to the OpenTelemetry naming rules
by format – applications write structured JSON logs
according to the log storage structure – we determined the log storage structure in the Clickhouse database
according to restrictions – we agreed on the maximum length of the log message, this is important from the point of view of calculating resources for their storage
Served as a starting point for simplifying work with logs in vector.dev
Implementation of the log standard in practice
What we did:
The template service and code libraries (modules, packages) were adapted for logging according to the log standard.
Now developers do not have to worry about the format of logs and the names of the required fields in them; frequently used field names are included in the constants package. This simplifies development and allows you to make fewer typosReworked the transform code in vector.dev
Now we have a single set of log handlers for all logs – Unified Log PipelineWe agreed to accept new services only with logs according to our standard.
There are no discrepancies, no need to write new transforms in vector.dev
Implementation of the standard took us 4 months, but you can do it faster, you now know where and how to go.
How are logs collected now?
Model for storing logs in a database
ServiceName – name of the log source, string type
Timestamp – log event time, DateTime64 type
ObservedTimestamp – log reception time in vector, DateTime64 type
SeverityText – event severity (log level), string type
Body – text message or original entry, string type
Resource – information about the environment where the log was created (type map
) Attributes – attributes of the structured log that the application sent (type map
) TraceId, SpanId, TraceFlags – trace context data, string type
Example of working with logs
1. Send a log from the application
import “vitech/pkg/http_accesslog_needdleware”
zlog.Str(semconv.ServerAddressKey, r.Host).
Str(semconv.HTTPMethodKey, r.Method).
Str(semconv.HTTPSchemeKey, httpScheme(r)).
Str(semconv.URLPathKey, r.URL.Path).
Str(semconv.URLQueryKey, r.URL.RawQuery).
Int(semconv.HTTPStatusCodeKey, r.Status()).
Info(‘http request received’).Send()
2. vector.dev aggregator receives the log in JSON form
{
'Timestamp': '2023-01-04 17:27:29.880230118',
'SeverityText': 'INFO',
'Body': 'http request received',
'TraceId': 'b2fa3d72711c1adad9ec88348c46f449',
'SpanId': '85733005b2678b28',
'TraceFlags': 0,
'server.address': 'promo',
'http.method': 'GET',
'url.scheme: 'https',
'url.path': '/adv/1232',
'url.query': 'short=1',
'http.status_code': 200
}
3. The log is processed and saved in Clickhouse
4. Look at the logs in Clickhouse
SELECT * FROM unified_logs.logs
WHERE ServiceName="promo" FORMAT vertical;
Let's consider the resulting record in parts. Below is the data that we received from the application
Next, the vector.dev agent added information about the environment in which the log was created, in this case it is Kubernetes.
Then the vector.dev aggregator received the message, enriched it with data and recorded it in Clickhouse.
For example, here he gets service_name
from the pod annotation and puts it in the field ServiceName
storage model, indicates the time when the log was received by the aggregator.
Now any log can be saved in ClickHouse.
Now log processing is the SAME for all our applications!
Receiving logs from third-party applications
Not all applications in the IT landscape can give you the opportunity to change the format of logs and the names of their fields to the standard. But you still want to collect them and sometimes you need to. How to integrate them into the Unified Log Pipeline?
Option 1. Save as is
If the message is not in JSON, we will save the first 65535 characters of the text in the Body field, and in the additional service field PipelineErrors we will write a sign that the message was too large. Yes, we won’t parse the log and will have to work with the string, but we will have it in a single repository, available for viewing and searching!
Option 2. Expansion via adapters
Vector.dev builds a hierarchy of processing components in the form of a graph and allows the use of wildcards when specifying a source to connect source components dynamically. This gives us the opportunity to make an “adapter port” in our pipeline in the form input=["ulp-adapter-*"]
it will pick up all ulp-adapter components.
So what is an adapter?
An adapter is a vector component that receives a log message, using vector parses it into a vector event object – this is what we usually do with JSON logs. It’s just that here in the adapter you have to describe the structure of the string so that vector understands where to get it and in which event field to put it. For example, vector has a powerful parser built into it: parse_grok. For a list of available parsers, see vector documentation.
In this case, we will receive a parsed message, in which we can search separately by fields, as for processed JSON messages.
Option 3. Preparation on agents
You can write a transform directly into the vector agent on the machine where the third-party application you need is running. With its help, the read logs will be converted into JSON that suits us before being sent. Please note that this creates additional load on the machine. The more complex the transformations (for example, a lot of complex regular expressions), the greater the load on the vector-agent when processing logs.
How we managed to make log processing unified
The log standard sealed the deal.
Vector.dev + ClickHouse made it possible to make processing and storage universal
As a result, this implementation was called “Unified Log Pipeline (ULP).”
What Unified Log Pipeline Gained
Single code and tests of vector.dev handlers for all services:
Previously, for each new service it was necessary to write and debug code for the vector.dev aggregator
Now, to connect a new service, just adjust the service config in Kubernetes and apply it
Our mechanism for generating metrics from logs has become more universal, if previously we could make metrics from http access logsthen now we can from any logs
A single storage structure in ClickHouse for all applications – now the structure does not change, it is easier to maintain, you can template and reuse queries
Reduced log connection time for the service
Previously it took from 3 to 5 hours to write and test code for vector.dev
now it takes 10-20 minutes to add an annotation to the service pod and apply it, this is 18 to 30 times less time-consuming.
There are some unanswered questions left
For example:
How to manage the storage period of logs in ULP?
How to collect logs in ULP not from Kubernetes?
How does ULP implement the inclusion of logs through pod annotations?
How to exclude logs from some kubernetes containers in ULP?
How did ULP manage to make the mechanism for calculating metrics from logs more universal?
These questions require a detailed answer and are beyond the scope of this article, and we continue to improve our solution.
How can this benefit you?
Agree and use semantic conventions of OpenTelemetry – now you have basic documentation for logs and a description of what is where. This reduces the cognitive load on engineers and the time spent on “telling how it works”
Try it OpenTelemetry Logs Data Modelas the basis for its standard for logs. It gives flexibility in working with logs and ensures that the necessary basic data for the logs is available. You can apply our developments and supplement them with your own. Try changing the model to suit you.
Conclusion
Standardization of logs allowed us to agree on names and formats of logs, standardize libraries and make the work of developers easier, including by templated requests for retrieving logs. And it has become easier for SRE to maintain the system for collecting, processing and storing logs – there is no longer a need to frequently change and expand the vector.dev handler code. Now we have a magic toggle switch to enable/disable log collection, instead of hours of developing and debugging code for vector.
It was a great and interesting task. I express my gratitude to our SRE team!
We haven't stopped and continue to improve our Unified Log Pipeline solution, which we'll talk about later.