the success story of some alerts

I want to tell you how we danced with a tambourine when setting up alerts for client metrics. How, why and what we encountered in this task – read on 🙂

What is this article about

A few words about Clickhouse and Grafana, so that it is clear what the further conversation is about.

Clickhouse – Columnar database management system (DBMS) for online processing of analytical queries. We use it to save metrics about user actions on the site (clicks on key buttons, opening experimental pages, etc.). It provides a convenient interface for sending data directly from our client code, which is more than satisfying.

Grafana – platform for visualization, monitoring and data analysis. It allows you to build graphs according to the collected metrics, react to emergency situations, visually check hypotheses, and much more. One of the significant functions of this platform is the automatic sending of alerts. Largely because of her, this article appeared.

Why do we need such specific alerts

It seems that metrics processing is purely a business story and should not directly affect developers. Unfortunately, it’s not that simple. A product is a complete machine in which everything is interconnected. Developers influence the collection of metrics as much as the rollout of a new feature. Why? Because both the new functionality and the submission of metrics work through the same code. And if we roll a new release, it is important for us to check not only the performance of our project, but also the correct sending of metrics. Thinking about solving this problem, we came to the conclusion that the day has finally come when we will make alerts for metrics.

We decided to ensure uninterrupted delivery of metrics from the front to Clickhouse. To check this delivery, Grafana became the most convenient tool for us, because we already have graphs for metrics, a plugin for working with Clickhouse and a chat alert in Telegram. The stars came together 💫

How alerts work

How to define “lost metrics”? It’s very simple: lost metrics are no longer stored in Clickhouse, which means that the graph of their display per unit of time is rapidly approaching zero.

That is, is it enough to collect metrics per unit of time and send an alert if they are equal to zero? Not certainly in that way. Collecting the amount, for example, in an hour can lead to the fact that alerts are fired at night (because at night the traffic is very close to zero), and this is an incorrect result. Therefore, the mechanism was redesigned to be more dynamic and less tied to the time of day.

We decided to check abrupt schedule changes… At the current time, we look at how much the current number of metrics has changed relative to the number 10 minutes ago. We take this change as a percentage. We consider it normal if the change is more than 10%. If less than 10% – send an alert. For projects that are usually rarely opened, we estimate the change over a longer period (for example, an hour). It turns out a smoother metric, besides, after the project is rolled out, you can quickly notice if the metrics have “fallen off”.

Another nuance. The “number at the current time” is not the current minute. Right now, the metrics do not have time to reach Clickhouse. Therefore, for the purity of the collection, we consider the “current time” a moment ago (or an hour ago for rare metrics).

Basic alert setup in the Grafana + Clickhouse bundle

Now let’s take a closer look at where to click and what is responsible for what.

Step 0. Set up the environment in Grafana

For us, this step was performed by the admins, so we didn’t have to. But if you are configuring from scratch, then you have to do this:

  • create a team space;

  • create a new dashboard;

  • add a new data resource for Clickhouse (Configuration → Data Sources → Add data source);

  • add a user to connect to Clickhouse from Grafana;

  • add a new channel for sending alerts (Alerting → Notification channels → New channel).

Step 1. Setting up the metrics graph

We wrote two different queries (Query). One builds a graph that we can check with our “eyes” on the dashboard, and the second calculates the required value to generate alerts.

In this step, we will analyze the first request. We select the database, table and format we need (here it is important that there was Time Series, otherwise the graph will not be built), and then write the request body itself. The old fashioned way in SQL:

SELECT 
  toStartOfMinute(created_at) as time,
  uniqIf(user_id, event_type="page_view") as show_metric
FROM $table
WHERE app_id = 999 AND $timeFilter
GROUP BY time
ORDER BY time
LIMIT 15

Small explanations on the code.

This is not pure SQL, but with Clickhouse mixins. You can read about this in the documentation, the link is attached below. The impurities here are:

  • variables (macros) $table and $timeFilterthat allow you to access data from the current Grafana page. $table Is our table, which we specified when creating the request, and $timeFilter – the time interval that we selected in the drop-down menu of the Grafana interface (upper right corner of the graph detail).

  • aggregation functions. toStartOfMinute – we group the timestamps that occurred at the same minute, uniqIf – we take the number of unique users who saw our metric

Don’t forget to specify filtering $timeFilterso as not to overload Clickhouse with unnecessary queries.

After these simple manipulations, we should build a graph for obtaining the metric page_view per minute. Cool! Let’s go set up alerts.

Step 2. Configuring the request for alerts

For alerts, we write a second request, which will not be displayed on the chart, but will only be used to check alerts. The request returns a number – this is the percentage by which the graph has changed in the last 10 minutes. It is very similar to the previous query, but a little more complicated because we need to take the percentage difference between two time points.

The note: in fact, this request must return a Time Series, otherwise the alerts will not work. Therefore, we mix the current timestamp with our result.

So, the code for this disgrace looks something like this (for the faint of heart, do not look):

SELECT
  now(),
  ROUND(count_current / count_10m_ago * 100, 2) as diff_in_percent
FROM (
	SELECT
  	uniqIf(user_id, event_type="page_view") as count_current
	FROM $table
	WHERE app_id = 999
		AND toStartOfMinute(created_at) = toStartOfMinute(
      dateadd(minute, -1, now())
    )
	LIMIT 60
) current, (
	SELECT
  	uniqIf(user_id, event_type="page_view") as count_10m_ago
	FROM $table
	WHERE app_id = 999
		AND toStartOfMinute(created_at) = toStartOfMinute(
      	dateadd(minute, -11, now())
    )
	LIMIT 60
) old

As mentioned above, it consists of three parts:

  • getting the current number of metrics (first subquery SELECT);

  • getting the number of metrics 10 minutes ago (second subquery SELECT);

  • calculation of the change in the number of metrics, conversion to percent (upper SELECT).

Cool, there is a request. We can move on to the setup itself.

Step 3. Setting up alerts in the Telegram channel

Further customization comes down to simple manipulations with Grafana and choosing the right settings for a specific case. Here it looks like this:

Setting up an alert.
Setting up an alert.
  • Name – the name of the alert that will appear in the message in the channel.

  • Evaluate every – evaluation interval (how often the alert rule is checked).

  • For – the time during which we consider the situation unchanged.

  • WHEN – choose the aggregating function.

  • OFquery (B, 10m, now) – check request B in the range [сейчас, 10 минут назад]…

  • IS BELOW – select the comparison function (“less”) and indicate the number 10 (10 percent).

From the interesting, you can also configure situations with errors and recovery. We put both fields in Keep last stateto avoid receiving alerts in these cases.

After all the settings, we check the operation of the alert. Push the button Test rule… A modal appears where it is important to check that there are no errors. If everything is configured correctly, the situation will be shown:

Checking the alert work.
Checking the alert work.

Hooray, we have metrics alerts!

Something went wrong?

The whole story above looks very nice and simple, right? In fact, the process was not so rosy. Therefore, now it’s time to “fry” – let’s talk about what you should pay attention to when setting up similar alerts and what we would like to know before getting into it.

1. Sensitive Grafana

Working with Grafana can be compared to going to an expensive restaurant: you expect to taste something delicious, but as a result, haute cuisine turns out to be “for everybody’s taste”, and mother’s cutlets are still tastier. You try to work with dashboards in a neat and orderly way, and mistakes are shot in a thousand different places for no reason.

So, a checklist for “gentle” work with Grafana:

  • Parallel work is not possible here. It is better to work on the panel (or even on the entire board) with only one person and only in one tab. Otherwise, something will not be saved somewhere and you will have to re-configure it.

  • It is necessary to keep it constantly. Every change on the board. Imagine you are working in an old Microsoft Office.

  • We constantly need to double-check. After saving, it is better to go in and check that everything is really preserved.

  • Incomprehensible logs. Prepare for the fact that you will not understand the error logs the first time and will fix them at random. Grafana issues an error log directly from Clickhouse, and it is wrapped in JSON with additional information. To curb this all, you have to be a Master Shifu, but this is not certain.

2. Instability Clickhouse

This concerns not so much the development as the administration of Clickhouse in the company. Unfortunately, preventive maintenance can greatly affect our alerts. Any rolls of updates or just a suspension of the service causes the metrics to fall to zero due to unavailability. And we get alerts. Not fatal, but sad. And a reason to think: do you really need all this?

3. Lack of information about stack configuration

This item is perhaps my favorite. Get ready for the fact that you will not be able to google errors, and set aside x3 time for the task. There is very little information on the World Wide Web about the work of the Grafana + Clickhouse bundle. Perhaps the most reliable sources here are only service and plugin documentation. If that’s not enough, then issues and networking.

4. “You don’t have the latest version!”

Continuation of point 2. Time of stories. Before I took on the operational task, I thought it would be a good idea to sort out the alerts on the local machine. I downloaded Grafana, played around with the settings, wrote scripts and even set up some alerts, I was glad. Then, with great enthusiasm and the feeling that the task was successfully completed, I went into production. And it turned out that the versions of Grafana in our product and on my LAN are different. And this is colossal, because none of my scripts worked, and in fact I had to do the task from scratch.

Moral: test in the same environment in which you will work.

Moral 2: be prepared that someday the versions will be updated, and you will find out about this when your metrics fall off.

5. Difference in versions. Continuation

What breaks down in the production? One of the most significant problems was the Time Series, namely working with time stamps. In the latest version of Grafana, all the code above works fine. But in our production (at the time of writing the article), not the latest version! Therefore, with a slight movement of the hand, the temporary branch turns from

toStartOfMinute(created_at) as time

v

toUnixTimestamp(toStartOfMinute(created_at), 'UTC')*1000 as time

The question to the Universe: how was I supposed to guess before this transformation ?! 🙂

6. “Overfitting” of the algorithm

This point is about some phantom problems that arose during setup. The fact is that initially we conceived a very simple algorithm: collect the number of metrics per day and see if they dropped to zero or not. But for some unknown reason, there was no alert for the metric collected during the day. That is, he simply was not sent and that’s it. If someone has thoughts about this or someone came across this, write in the comments, I will be glad to hear your stories.

Proceeding from the fact that we never got any alerts, we had to improvise and come up with a new algorithm. And it seems to be for the best ✨

Outcomes

It’s funny that a modern front-end developer has to be able to do much more than just paint buttons. This opens up many more opportunities for us to develop convenient and useful products. But before introducing something new, you need to think hard: do you need it?

I would be glad if our experience is useful to you. Good web everyone ✌🏻

Instead of an afterword. useful links

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *