How we built the monitoring system. The thorny path to the stable operation of complex IT systems

We are making the Amvera Cloud for hosting IT applications. For us, monitoring the operation of the service is a matter of paramount importance. Agree, few people want to host their projects in the cloud, which is unstable. Monitoring allows you to immediately notice the failure and take action. And if you have a complex project consisting of many microservices, you cannot do without failures.

We came to what is described in the article through a series of downtimes and bugs that had to be fixed “hot”. There were crashes during the beta test, but still the issue of ensuring stable operation is painful for us. For knowledge had to pay a heavy price – the inconvenience of users. We ourselves are still only halfway to building the stability system we would like to see. But I hope our experience will help someone not to repeat our mistakes and do everything right right away.

The article is not intended to create some kind of fundamentally new knowledge about monitoring that an experienced SRE engineer does not have. But, perhaps, it will be useful as a starting point for studying the technology stack for those who are just starting to dive into the question.

Let’s start with in what cases and what kind of monitoring is usually used.

When incidents occur, you need to find the cause and quickly eliminate it. Accordingly, tools are needed to investigate incidents. Usually, the analysis of logs and traces comes to the rescue here.
It is sometimes useful to proactively observe the behavior of infrastructure and application performance metrics. For this, there is Grafana, which visualizes metrics from Prometheus.
Finally, it is useful to receive alerts if something goes wrong. For these tasks there is an Open Source tool – Sentry.

Since commercial solutions on our market are either completely unavailable or their use is unreliable, we have chosen the path of introducing Open Source tools.

Monitoring Metrics

The easiest solution was to choose a solution for monitoring metrics. There is an excellent bunch of Grafana + time series database from the same company – Prometheus.

The solution is written in the GO language and works very quickly and is easy to install. We can say that it was the only thing that did not bring installation and operation problems.

In fact, Prometheus has such a property: it can go into restart if it does not have enough RAM, and you can find yourself without monitoring at the most crucial moment. They say that to overcome this problem, you can use the VictoriaMetrics database. We ourselves have not encountered this problem and have not tried to solve it.

Log analysis

With the analysis of logs, everything turned out to be more complicated. The classic solution is to use the ELK (Elastic) stack or its Open Search equivalent with a more open license. For our tasks, we tried to use EFK (ELK variant with Fluentd instead of Logstash), but we ran into a number of problems. And if the problems of operation are solvable, then the problem of the lack of the functionality we need was already more serious. A classic disadvantage of ELK is its high resource requirements. In addition, you need to “be able to cook” it so that it works without failures and provides the required level of security (there are cases when information leaks and hacks occurred due to holes in ELK).

ELK (EFK) did not solve all our problems. We needed not only to analyze the logs ourselves, but also to stream them to our clients in the context of their projects. Unfortunately, it was not possible to implement this task with ELK.

We wrote our own log collector that sends them to MongoDB, from where they are broadcast to clients. Frankly, this is not the best solution, and it does not work very fast, but at the current stage it solves its problems.

It is worth mentioning right away that if you do not need to separate logs by clients and stream them, then you can use an excellent Open Source solution – Grafana Loki. It is still a little “raw” and, due to architectural features, it allows you to save and process logs only for a short time period. But if you want everything to just work and not require a huge server cluster (like ELK), then we would recommend considering it as an option.

Trace analysis

At the moment, the OpenTelemetry standard has been established in trace analysis and the Jaeger Open Source tool has been distributed.

If you don’t know what traces are, let’s give a simple example.

Imagine that you have several microservices connected to each other. To perform some process (as an example, adding a product to the cart by the user), these microservices must work out each in its own part. Thus, you can see what happened at each stage of the interaction, and most importantly, how long each of the stages took.

It’s like a hero in a fairy tale who has to go through a series of trials before receiving a reward. On every trial, something can go wrong, and each trial takes some time. Here is the observation of this “chain of tests”, only now in the form of microservices, and is called trace analysis.

Error notifications

For error notifications, we chose Sentry with telegram streaming. The only time a major crash happens is like a DDOS bug. There are a lot of them in the channel and you need to make sure that Sentry itself does not fall.

In an ideal world, you could still use PagerDuty to deliver error messages to specific employees according to given rules, but this service has disappeared from our market. And while we do not have a very large team, this is not so important.

Perhaps experienced SRE engineers will say that using these tools is not enough, and to ensure stability, you need an appropriate culture for developing and deploying software, updates, etc. And we agree with this, but this is a topic for a separate article. And in this article, we wanted to talk about the technology stack that we chose for our tasks. I hope the information will be useful for those who are thinking about building a monitoring system in their project.

PS I invite you to participate in the beta test of our container cloud Amvera Cloud. In it, you can rent separate containers and, like in Heroku, do updates via push to GIT (this is convenient for small projects where a full-fledged CI / CD is not set up). During the beta test, the service is completely free. Then we will credit everyone with 1000 rubles. to the balance to continue using the cloud, which, in most cases, is enough for several months. I would be grateful for the use of the cloud and feedback.