The path from creating a basic monitoring system to an automation and decision-making system

We will go step by step from the first beautiful dashboard to deep automation, with the help of which the monitoring system will independently make decisions, eliminate incidents, prevent them at early stages, and minimize their impact on the client and business. You will also find a chapter dedicated to forecasting deviations using ML/AI models.

The approaches and best practices in the article are not tied to a specific tech stack. It could be a large global vendor solution, a local development within your company, or a whole “zoo of technologies” combining both of these options.

Therefore, this article is equally useful for:

  • companies of any size that have online services and operations for customers, as well as automated solutions that support them;

  • companies at any level and stage of development of the monitoring system;

  • specialists who have experience in creating monitoring systems;

  • experts in this field who have the opportunity to move to a new stage of development.

Start

Having a high-quality monitoring system is especially important now, when the service is provided to the client 24/7/365, and the gold standard of reliability is the availability of automated systems of at least 99.99%. We have long stopped waiting for the start of the working day to submit meter readings, book a hotel or take out a loan and much more, on weekends and in the middle of the night – services are always available to us.

Now we will go the way from creating a basic monitoring system to an automation and decision-making system, after which you can create an individual plan for developing your own monitoring system. But first, you need to choose the level of monitoring development in your company. At each level, you will be given brief summaries and hints (hereinafter referred to as HINT).

Levels of monitoring development in companies:

  1. Null. You learn about problems in your systems from your own customers or from the media.

  2. Basic, or “Zyring”. Metrics, agents and dashboards.

  3. Advanced. Availability of events, notifications and alerts.

  4. Advanced monitoring with basic automation.

  5. Advanced monitoring with metrics and events prediction.

  6. PRO. Maximum automation at the level of a nuclear power plantYou no longer need the duty officer and the administrator of the automated system.

You are here…

LEVEL 0. Zero

The lowest level. You learn about serious failures and incidents from your clients when the situation has already gotten out of control. At the same time, you may learn about minor or isolated deviations very late or never at all. Incident resolution begins later than the most difficult situation, and the impact and damage are at their maximum.

HINT: do not try to deny the need to create a monitoring system, do not delay the time of its creation. The main thing is to start, you will not be able to stop.

LEVEL 1. Basic, or “Zyring”

At this stage, you have already organized basic monitoring. Selected the system on which it is built. Installed the first infrastructure agents on your servers. And proceed to creating your first dashboard with metrics.

HINT: Don't limit yourself to setting up exclusive infrastructure monitoring at this stage. There are many conditionally free agentless ways to enrich it with other metrics without having to modify your system (pinging the authorization or service URL and the response code received, requesting metrics from the database, metrics from logs, etc.).

Infrastructure monitoring alone is not enough. Based on a processor load of 90+%, it is impossible to draw a clear conclusion about the degradation of the service. And vice versa: purely application monitoring is not enough. In the absence of rare operations at night on a weekend, it is impossible to draw a clear conclusion about the presence of problems in the system. Perhaps your clients are not currently using the service.

HINT: When setting up events and notifications, create complex conditions that take into account the behavior of application and infrastructure metrics, as well as their correlation with each other.

So, you have a number of metrics for one or more systems, and you start creating your first dashboard. Below are some rules for creating it.

Level 2. Advanced

As your monitoring system develops, with new consumer systems and new metrics connected to it, you will soon realize that the number of dashboards is growing, but visually – and most importantly, in a timely manner – it is not possible to keep track of all the metrics. And at this point, your monitoring system will begin to move to a new level – the level of working with alerting, notifications and the event console.

HINT: There is no such thing as too many notifications. Mail alone is not enough, and it is not a means for prompt response. Good practices include setting up SMS notifications for critical events and notifications to work chats (regardless of what tools you use).

It is important to divide events by their level of criticality. Start with three states: low, medium, highWhen events occur simultaneously for several metrics, it will be important for the duty officer to understand which of them are more important to start working with first.

HINT: An event or notification should contain the maximum amount of information: the name of the system, the current metric value, the threshold value, the values ​​of auxiliary metrics, recommended actions, a link to detailed instructions that indicate the contacts of key administrators.

When you reach a large number of metrics, you will feel the need to introduce a cross-cutting identifier and create a single instruction with the necessary reactions of the duty administrator.

HINT: enter the metrics coder immediately so that you don’t have to redo the metrics, instructions, and alerts that you’ve already created.

LEVEL 3. Advanced monitoring with basic automation

After setting up monitoring, alerts, dashboards and notifications, you will notice that often the set of actions performed by the administrator or duty officer is typical, but these actions take time each time. This is especially important at the time of an incident, when every second counts.

HINT: automate absolutely (!) all the standard actions that are possible. Your task is to focus on resolving the deviation, not on the standard routine or bureaucracy.

For each event, it is important to set up the correct scenario based on pre-filled templates. An incident will be opened for the correct service, which already contains all known diagnostic information. All the competencies necessary for solving the incident will be pulled up in the conference, and all the necessary diagnostic information will be in the mail. Logs will be collected for root causeanalysis.

All this will allow you to save 7-10 minutes that you would have spent during an incident without configured automation.

LEVEL 4. Advanced monitoring with metrics and events prediction

In the next step of developing your monitoring system, you will find that configured events and alerts are not always sufficient, even if thresholds are set to react proactively to prevent incidents.

And at this point, you will want to predict events and service degradation long before negative consequences for the client occur in order to eliminate possible causes of an incident that will occur in the future. forecasting using AI/ML solutions.

In my own example, I will say that we came to 15 minutes. This is the time during which the administrator can connect to the system, detect and eliminate the causes of a possible failure that could happen in the future. At the same time, the accuracy of the forecast was 80+%. That is, 4 out of 5 notifications from the monitoring system could actually lead to service degradation and incidents.

It is important to explain to administrators (it will be difficult and take time) that the ML/AI model makes a forecast similar to a weather forecast, which says that you need to take an umbrella because of possible precipitation. But if in 1 out of 5 cases (with a forecast accuracy of 80%) there was no rain, this does not mean that the model is bad and the forecast does not work.

Level 5. PRO

Maximum automation at the level of a nuclear power plant: you no longer need a duty officer and an administrator of the automated system! This is the last and maximum level of development of the monitoring system. At this stage, it ceases to be a monitoring system and becomes a decision-making system. Independent decisions. And, often, without the intervention of an administrator or duty officer.

HINT: Due to the large number of metrics (coverage), high accuracy of forecasts and your trust in the monitoring system, you give it more opportunities to make independent decisions in order to prevent incidents, degradation of service or performance.

HINT: This can be implemented using control signals, API calls, or additional modules implemented as part of the monitoring system or as part of the automated system that you have set up for monitoring.

Congratulations! You've made it to the finals…

…and you get a bonus in the form of a hint about which unusual functions can be further automated and assigned to the monitoring system.

The monitoring system is the eyes and hands of the administrator and the engineer on duty. The quality of your systems' support depends on the quality of the work, reliability, and level of development at which it is located. Creating an effective monitoring system requires significant resources. But all investments in it are justified, because they are much less than the potential risks of its absence. And for socially significant and state-owned companies, reputational and financial losses can have serious consequences for society as a whole.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *