Monitoring availability and uptime checks

I was recently asked a question about what kind of alert system I would build for a highly loaded service. After that, I went to study the topic deeper and found a very fresh Google Cloud blog post about using uptime checks for monitoring availability. The ideas it outlines are mostly well-known (particularly from the SRE book) or intuitive, but put together in a concise form can serve as a checklist of sorts. Below it is.

Minimize noise/false positives

Having to constantly react to one thing after another can lead to stress and burnout. Alert is needed only when manual intervention is indispensable. Tickets can be automatically created for other incidents and dealt with in a normal work environment. Thus, the key questions are what events we will alert and why.

Alerts should be triggered when user satisfaction drops

Ideally, urgent intervention should be required only when the failure of certain components affects the business. If customers can’t make a purchase, or even just go to a store’s website, that’s important. If a node has failed, but everything continues to work without degrading the user experience, this can wait.

Alerts should be triggered by application failures, not infrastructure failures

The more the infrastructure is automated, the higher the level of abstraction we are in terms of alerts. Ideally, we should cease to be interested in failures in the infrastructure themselves. Instead, we will only pay attention to application crashes. In other words, alert only in situations where the user does not get what the application serves for.

You can use uptime checks as a base for availability monitoring

This is a very simple and fast monitoring method that does not require any complex tools or special efforts. At the same time, with its help, we can also look at the operation of our service from different angles: for example, we can monitor whether the site is opening, whether it is available from different regions, and whether our load balancer is responding.

Uptime checks are great for monitoring critical services that are infrequently used.

Let’s imagine that we have a service that receives only a few requests per day, but it is very important for us that they are processed. Request monitoring may tell you about the problem too late. Simple but regular uptime checks will show that the service is dead before an important request comes to it.

Three things to evaluate when alerting with uptime checks

  • Check frequency. Let’s imagine that the previous check just happened, after which the service immediately broke. How long will it take us to respond to an incident, and how long can we tolerate a failure?

  • Correct setting. We can carry out uptime checks, for example, using ICMP. But what if the security experts changed the firewall settings and the protocol was blocked? You have to keep in mind possible changes in the system that prevent checks.

  • Again, noise reduction. Large systems generate noise anyway. If we do not have an answer to the question of what is expected from an engineer in response to an alert sent to him, perhaps we do not need such an alert.

Google didn’t just publish their article. So he also advertised your service for uptime checks which supports monitoring of GCP, AWS and its own hosts and applications.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *