Monitoring in simple words, or how I explained to my mother the work of SRE

Despite the fact that I am not the only IT specialist in the family, explaining my profession was a bit expensive. “What is SRE? Like a system administrator or something? What is the difference?” And indeed, given that in the Russian Federation the boundaries between the same DevOps and SRE are blurredand for the position of system administrator they are looking for a warlock with experience in stabilizing sales, it is not surprising that a person who is not at all connected with this area can get confused.

Achtung!

This is more of a life story, the author does not claim that his article is the ultimate truth, it is more of an ABC for those who want to understand the basic principles, in particular, monitoring. For precise definitions and useful practices, you need to go to Google and read all their books on SREThe books are very interesting, by the way!

This article also briefly covered a similar topic.

And yes, in this article we will not talk about specific tools, we will not develop holy wars ELK vs Loki, etc.

If you notice any typos or gross errors, write in the comments so that I can correct them, I will be very grateful 🙂


Who are SREs?

SRE (Site Reliability Engineer) is a profession, as well as a set of principles and practices for creating fault-tolerant and scalable systems.

Often you don't need SRE. It's unlikely that a small toy store with a simple one-page website used by two and a half people a week will need a large, fault-tolerant, scalable system. It's expensive, it's complicated, and the people who know how to manage it are also expensive.
However, if you run a worldwide network of such stores (let's call it the “Teen Universe”), then there are many problems that cannot be solved by hiring one SRE, for example, SREs do the following:

  • Operation (maintaining stable operation of the service, Ops)

  • Capacity planning

  • Incident management

  • Monitoring and alerting

    SRE's primary focus is: reliability and availability of service

For the curious: DevOps, SRE and system administration

TL;DR: DevOps is a philosophy about accelerating development and testing -> delivery -> code integration, SRE is a set of practices aimed at high service reliability. SRE to some extent uses the ideas of DevOps, it just directs them in a different direction.

Now in a little more detail:

Operations engineers (they are also called system administrators) are divided into different types:

  • Technicians (support technician), or Enikeyschiki – specialists who deal with “household stuff”. Setting up printers, cleaning the desktop from viruses, which suddenly ended up on someone's computer, etc., typical “you're a programmer”.

  • Infrastructure operations – specialists who directly create the company's infrastructure: select tools, automate their deployment, establish security, etc.

The operations team (often called Ops) strives to make the system as stable as possible, which leads to disagreements with the developers (Dev), who, on the contrary, want to change it quite often.

In order to unite these two teams, DevOps principles were created, the key ones of which are described by the CALMS model:

1) Culture (Culture) – developers, testing, and operation are one team that interacts with each other and is responsible for the quality of the product.
2) Automation (Aautomation) – we automate routine in order to free up time for creating code and other creative processes.
3) Support (in other variants Thrift from Lean Management) – mistakes happen and no one is immune from them, so instead of blaming the guilty party, it is better to spend time on training and preventing similar cases, relying on the experience gained.
4) Measurements (Measurement) – we can speed up the process only when we understand, What we do, but in order to understand at least somethingyou need this something measure. If resources allow, then everything should be measured, and only the truly significant things should be observed. The “Plyushkin” approach, so to speak.
5) Exchange (Sharing) – if you have expertise in something, share this information with other team members. Sharing information allows the team to work faster due to a clear understanding of the project.

These principles drive practices such as CI/CD (continuous integration, testing, and delivery of code), monitoring, and the IaC (infrastructure as code) approach. Together, they speed up the development team's efforts without disturbing Ops' peace of mind.

SRE uses DevOps, but focuses less on the speed of delivery of changes and more on the overall stability of its work for end users. You could say that SRE is a case of a developer being rotated to the operations team.

What is monitoring?

The work of an SRE can be compared to the work of a doctor, although we do not treat people.
Most likely, you have encountered monitoring representatives in your life, perhaps without even noticing:

  • stock exchanges that track stock prices and exchange rates

  • all kinds of medical devices for monitoring vital signs

  • buckwheat prices and the annual “it's gone up again!”

Monitoring is the process of collecting metrics and analyzing them to make subsequent decisions.

What does SRE measure?

Typically, SRE operates with the following concepts:

  • SLI (Service Level Indicator) is a metric that shows actual level of service.

  • SLO (Service Level Objective) – a goal for service quality. It is considered that users are satisfied if the actual indicators are equal to or higher than the established SLO.

  • SLA (Service Level Agreement) – service level agreement, usually set by the business. SLA – promises that the business makes to the user, they are the most painful to break.

Let's imagine that we are the CEO of a large postal operator “Bystro i tochka”. Every day at each point we receive more than a hundred items and give out about the same number, and the parcels themselves arrive in about a week. These metrics (how much we have received and given out, and how long it takes for the parcels to reach the receiver) are SLI.

We aim to deliver parcels within two weeks, our delivery time objective (SLO). If the parcel is not delivered within three weeks, we guarantee to initiate a search procedure and provide compensation in case of loss according to our service level agreement (SLA).

To differentiate between SLO and SLA, you can ask the question “what happens if this condition is not met?” If you can answer that, then you are most likely looking at an SLA.

It is worth noting that SLI and SLO are measured in percentages and calculated using the formula:

SLI=good/total*100%

The following SLIs are often used in IT:

  • Resource utilization (use)

  • Number of errors (4xx and 5xx)

  • Response time

  • Total availability time

  • The amount of time the system operates without errors

and others. You can read more about them in this article.

Obviously, we cannot maintain SLO = 100%. There is no system (yet) that would always work without errors, so SLO is usually measured in “nines”: 99%, 99.9%, 99.99%, etc. Based on SLO, Error Budget.

For example, if 100,000 parcels arrive at the sorting center every day and we want to return 99% of them, we can only leave 1,000 unprocessed.

SLO can be specified for different time windows, and they can be fixed by the calendar (month/year), or with a sliding window. Both methods have their pros and cons.

+

Sliding window

The dynamics over a period of time are taken into account, closer to the user experience

During holidays, user traffic increases, which is why SLIs can vary frequently.

Fixed window

Simple and convenient for conducting business processes

Users won't forget the December 31 fuckup if a new month begins

How will you know if something is wrong?

Of course, we don’t sit in front of monitors 24/7 and don’t monitor schedules during lunch (although this sometimes happens), just as firefighters move to the scene of an emergency on signal, we move out in response to a call alerts.

Alerts – notification of some event. As an example – the work of a fire alarm, which reacts to smoke.

Alerts can be received in different ways, the most popular of which are:

At the same time, you can set up notification escalations. Non-critical ones should be sent only to Telegram, and major failures should be confirmed in several stages: messenger, then a call to the first duty officer, if he does not answer, then to the second, etc.

What do you do during major outages?

Major failures (aka incidents, aka fuckups, aka upyachki) happen and you can't insure yourself against them. The first thing you need to do during an incident is stabilize the system – if you find the problem faster, then fix it, if not, use degradation scenarios.

Degradation scenarios are an ace up the sleeve of developers, they allow you to turn off certain functions of the application. This can be useful if a specific block fails, or if you need to reduce the load on the service.

After the service has stabilized, it is necessary to find the problem and fix it, and also create a postmortem. Just like doctors perform an autopsy, we analyze the causes of the incident to prevent its recurrence in the future. In a postmortem, all aspects are important: the chronology, the measures taken to stabilize the service, the causes themselves, and potential factors that could have influenced the development of the situation during the incident.

Conclusion

I hope this article was useful to those who wanted to get their bearings in the basic concepts. A lot had to be omitted in this article (the same capacity planning is a super-complex thing, for which a separate article with the hard tag can be written) in order to introduce the masses to SRE, vacancies for which have begun to actively appear on career platforms.

If you want to get into SRE, monitoring will be one of the last things you master. Tools like Kubernetes, Terraform, GitLab, etc. can be learned while building home servers, but learning monitoring requires traffic, requires fuckups, and requires real-world scenarios, which makes this topic even more complex and interesting.

Were you able to explain to your family/friends what you do for a living?

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *