SRE Practices: Log Standardization

Searching for logs in a bunch of services

Searching for logs in a bunch of services

My name is Dima Sinyavsky, I am an SRE engineer at Vi.Tech – an IT subsidiary of VseInstruments.ru. Our website creates more than 100 thousand orders every day. We have many services and many developers who write a lot of logs. They are written in order to understand the situation in case of failure. You need to be able to navigate them and find what you need, even if you see them for the first time.

In previous articles I talked how we switched to vector.dev from Elastic for processing logs and how we now calculate metrics from logs with his help.

In this article I will tell you about our experience in standardizing logs. What problems were there, why did we do it and what did it give us as a team and company.

Background

In 2022, I joined V.Tech and dove into vector.dev to process logs. This is a very flexible solution that allows you to collect logs of almost any structure from any application. That's exactly what we did. Fortunately, we have already converted some of the logs to structured JSON logs and it was easier to process them.

We had a bunch of different transforms in vector and tables in ClickHouse for each source, for each log service, often the code was copied from time to time with minor changes. When you have 5-10 applications, this is still tolerable, but when there are more and more of them… you already begin to get tired of repeating the same work. And it would be nice to automate this, but everyone is different! Even neighboring applications may have a field in one case clusterin another cluster_namewhich means they cannot be put into a single column in the database without additional transformations and reductions to a general form. And so each application can write in its own way, and we need to support all this.

I tried working with logs

Developers rarely look at the logs of someone else’s application, for example, in a neighboring service, but they understand their own very well, no matter how exotic and mysterious the field names may be. However, DevOPS and SRE work with all systems and often you need to look at logs from multiple applications, especially if there is an incident.

Of course, it’s cool to feel like a know-it-all about logs, but not at 2 a.m. when the service you’re seeing for the first time has crashed, and its logs are more exotic than ever. If you found documentation on the logs (ahem…have you ever seen it?), then you're in luck. If you don’t find it, you look for the source code of the application and the developer. We called the developer and he helped us figure it out – you're in luck again! But I didn’t get through on the phone – I’m sitting and reading the source code. How much time wasted… a lot!

...that feeling when the applications are different, the names of the log fields are different, but the content is the same

…that feeling when the applications are different, the names of the log fields are different, but the content is the same

Let's play a guessing game

Guess what the query to the log database will return (we use Clickhouse):

SELECT path, request_query FROM accesslogs
3..2..1..Correct answer
The correct answer is empty and url

The correct answer is empty and url

Did you guess it? I don't.

What's the problem?

Look at the picture below, this is what people see

Projections of truth

Projections of truth

And the truth is…
It's all just a URL

It's all just a URL

But each person has his own association in his head due to individual experience

Problem – we call the same thing differently

Why is this bad:

  1. This causes misunderstanding among people and misinterpretation of data.
    It used to be funny: we’re talking about http request query, we look at the logs in Clickhouse, and then the service writes the full URL in the request_query field.

  2. It is impossible to template queries to logs.
    If the names of the log fields are different everywhere, then I cannot create one request and then only change the service name in it in order to get, for example, http access logs. If you select all fields with SELECT * it will work, but if you only need a few fields, then it won’t work.

  3. Making it more difficult to find incident logs.
    As I wrote above, you will have to spend time getting to know the structure of the logs and understanding the names and meanings, sometimes you cannot do it without outside help or reading the source code.

  4. Expansion of the data schema in the database.
    Since we store logs in ClickHouse in tables, each service has its own table with a set of columns based on field names. There are many options for sets of fields; even the same http access logs for different applications can have different sets of columns.

All this makes working with logs varied, but costly and unpleasant.

Pleasant, predictable work with logs. How?

I wanted to redo all this to make it easier and more pleasant to work with logs, both for us and for the developers.

What we did:

  1. The problem was brought to the attention of the developers, they accepted it and wanted to participate in the solution.

  2. We decided to start by agreeing on a dictionary – what we call it and what it is.

  3. Document agreements and begin to implement

It was the second month of discussions…

And then this happened

OpenTelemetry Logs Data Model is what we were missing.

OpenTelemetry Logs Data Model is what we were missing.

The next time I re-read the OpenTelemetry documents on tracing, I remembered that the specification also talked about logs, but since this part has not yet been fully implemented for GoLang, we did not pay attention to it. I started searching and found OpenTelemetry Logs Data Model – it turned out to be exactly what we began to create ourselves.

Why OpenTelemetry Logs Data Model is good

  1. Based on the semantic conventions of OpenTelemetry SemConv is the same dictionary that we began to compile ourselves as a team. There is already an extensive registry of attributes with descriptions, it covers 80% of our needs, and for the rest we are creating an additional small dictionary of local names according to OpenTelemetry rules.

  2. Gives flexible model for presenting and storing logs – allows you to store any logs without changing the storage structure.

It turned out to be close to us also because our developers already use tracing libraries from OpenTelemetry, i.e. they have already seen the names from SemConv and are implicitly honoring them in part of the code.

Standard for logs

By combining our agreements with the teams and studying the OpenTelemetry Logs Data Model, we created and agreed upon our standard for logs in a month.

We record the team’s agreements in the standard for logs

We record the team’s agreements in the standard for logs

What the standard gave us:

  1. Fixed agreements with the teams:

    1. by name – we use OpenTelemetry attribute registrywhere it’s missing, we supplement our dictionary according to the OpenTelemetry naming rules

    2. by format – applications write structured JSON logs

    3. according to the log storage structure – we determined the log storage structure in the Clickhouse database

    4. according to restrictions – we agreed on the maximum length of the log message, this is important from the point of view of calculating resources for their storage

  2. Served as a starting point for simplifying work with logs in vector.dev

Implementation of the log standard in practice

What we did:

  1. The template service and code libraries (modules, packages) were adapted for logging according to the log standard.
    Now developers do not have to worry about the format of logs and the names of the required fields in them; frequently used field names are included in the constants package. This simplifies development and allows you to make fewer typos

  2. Reworked the transform code in vector.dev
    Now we have a single set of log handlers for all logs – Unified Log Pipeline

  3. We agreed to accept new services only with logs according to our standard.
    There are no discrepancies, no need to write new transforms in vector.dev

Implementation of the standard took us 4 months, but you can do it faster, you now know where and how to go.

How are logs collected now?

Model for storing logs in a database

  • ServiceName – name of the log source, string type

  • Timestamp – log event time, DateTime64 type

  • ObservedTimestamp – log reception time in vector, DateTime64 type

  • SeverityText – event severity (log level), string type

  • Body – text message or original entry, string type

  • Resource – information about the environment where the log was created (type map)

  • Attributes – attributes of the structured log that the application sent (type map)

  • TraceId, SpanId, TraceFlags – trace context data, string type

Example of working with logs

1. Send a log from the application

import “vitech/pkg/http_accesslog_needdleware”

zlog.Str(semconv.ServerAddressKey, r.Host).
	 Str(semconv.HTTPMethodKey, r.Method).
 	 Str(semconv.HTTPSchemeKey, httpScheme(r)).
	 Str(semconv.URLPathKey, r.URL.Path).
	 Str(semconv.URLQueryKey, r.URL.RawQuery).
 	 Int(semconv.HTTPStatusCodeKey, r.Status()).
	 Info(‘http request received’).Send()

2. vector.dev aggregator receives the log in JSON form

{ 
	'Timestamp': '2023-01-04 17:27:29.880230118',
	'SeverityText': 'INFO', 
	'Body': 'http request received', 
	'TraceId': 'b2fa3d72711c1adad9ec88348c46f449', 
	'SpanId': '85733005b2678b28', 
	'TraceFlags': 0,
	'server.address': 'promo',
	'http.method': 'GET',
	'url.scheme: 'https',
	'url.path': '/adv/1232',
	'url.query': 'short=1',
	'http.status_code': 200
}

3. The log is processed and saved in Clickhouse

Log path from the application through the vector agent to the log storage

Log path from the application through the vector agent to the log storage

4. Look at the logs in Clickhouse

SELECT * FROM unified_logs.logs 
WHERE ServiceName="promo" FORMAT vertical;

Let's consider the resulting record in parts. Below is the data that we received from the application

Log data we received from the application

Log data we received from the application

Next, the vector.dev agent added information about the environment in which the log was created, in this case it is Kubernetes.

Vector.dev agent added information about the environment in which the log was created

Vector.dev agent added information about the environment in which the log was created

Then the vector.dev aggregator received the message, enriched it with data and recorded it in Clickhouse.

Vector.dev aggregator enriched the message with data

Vector.dev aggregator enriched the message with data

For example, here he gets service_name from the pod annotation and puts it in the field ServiceName storage model, indicates the time when the log was received by the aggregator.

Now any log can be saved in ClickHouse.

Now log processing is the SAME for all our applications!

Same! Same! Same!

Same! Same! Same!

Receiving logs from third-party applications

Not all applications in the IT landscape can give you the opportunity to change the format of logs and the names of their fields to the standard. But you still want to collect them and sometimes you need to. How to integrate them into the Unified Log Pipeline?

Option 1. Save as is

If the message is not in JSON, we will save the first 65535 characters of the text in the Body field, and in the additional service field PipelineErrors we will write a sign that the message was too large. Yes, we won’t parse the log and will have to work with the string, but we will have it in a single repository, available for viewing and searching!

Option 2. Expansion via adapters

Vector.dev builds a hierarchy of processing components in the form of a graph and allows the use of wildcards when specifying a source to connect source components dynamically. This gives us the opportunity to make an “adapter port” in our pipeline in the form input=["ulp-adapter-*"]it will pick up all ulp-adapter components.

So what is an adapter?

An adapter is a vector component that receives a log message, using vector parses it into a vector event object – this is what we usually do with JSON logs. It’s just that here in the adapter you have to describe the structure of the string so that vector understands where to get it and in which event field to put it. For example, vector has a powerful parser built into it: parse_grok. For a list of available parsers, see vector documentation.

In this case, we will receive a parsed message, in which we can search separately by fields, as for processed JSON messages.

Option 3. Preparation on agents

You can write a transform directly into the vector agent on the machine where the third-party application you need is running. With its help, the read logs will be converted into JSON that suits us before being sent. Please note that this creates additional load on the machine. The more complex the transformations (for example, a lot of complex regular expressions), the greater the load on the vector-agent when processing logs.

How we managed to make log processing unified

  1. The log standard sealed the deal.

  2. Vector.dev + ClickHouse made it possible to make processing and storage universal

As a result, this implementation was called “Unified Log Pipeline (ULP).”

What Unified Log Pipeline Gained

  1. Single code and tests of vector.dev handlers for all services:

    • Previously, for each new service it was necessary to write and debug code for the vector.dev aggregator

    • Now, to connect a new service, just adjust the service config in Kubernetes and apply it

  2. Our mechanism for generating metrics from logs has become more universal, if previously we could make metrics from http access logsthen now we can from any logs

  3. A single storage structure in ClickHouse for all applications – now the structure does not change, it is easier to maintain, you can template and reuse queries

  4. Reduced log connection time for the service

    • Previously it took from 3 to 5 hours to write and test code for vector.dev

    • now it takes 10-20 minutes to add an annotation to the service pod and apply it, this is 18 to 30 times less time-consuming.

There are some unanswered questions left

For example:

  • How to manage the storage period of logs in ULP?

  • How to collect logs in ULP not from Kubernetes?

  • How does ULP implement the inclusion of logs through pod annotations?

  • How to exclude logs from some kubernetes containers in ULP?

  • How did ULP manage to make the mechanism for calculating metrics from logs more universal?

These questions require a detailed answer and are beyond the scope of this article, and we continue to improve our solution.

How can this benefit you?

Useful information here

Useful information here

  1. Agree and use semantic conventions of OpenTelemetry – now you have basic documentation for logs and a description of what is where. This reduces the cognitive load on engineers and the time spent on “telling how it works”

  2. Try it OpenTelemetry Logs Data Modelas the basis for its standard for logs. It gives flexibility in working with logs and ensures that the necessary basic data for the logs is available. You can apply our developments and supplement them with your own. Try changing the model to suit you.

Conclusion

Standardization of logs allowed us to agree on names and formats of logs, standardize libraries and make the work of developers easier, including by templated requests for retrieving logs. And it has become easier for SRE to maintain the system for collecting, processing and storing logs – there is no longer a need to frequently change and expand the vector.dev handler code. Now we have a magic toggle switch to enable/disable log collection, instead of hours of developing and debugging code for vector.

It was a great and interesting task. I express my gratitude to our SRE team!

We haven't stopped and continue to improve our Unified Log Pipeline solution, which we'll talk about later.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *