Grafana OnCall – Open Source hub for alerts and incidents

Hey habr! I was surprised to find that there is not a single mention of Grafana OnCall, Incident Response Tool open source from Grafana Labs. And this needs to be fixed, because we are rapidly growing both in terms of stars on the github, and as part of the Grafana Cloud, and in the issues on the github, there are mainly technical leads from FAANG.

In short, OnCall is a tool that will help organize reliable alerts / incident response in a team, meet SLAs and not wake up at night from calls.

OnCall is new to the Open Source world, but no longer a new product. It started as a separate SaaS called Amixr.IO a few years ago. Then Amixr.IO acquired Grafana Labs and integrated it into its ecosystem. And just recently, finally, we were able to make the OnCall source code public 🎉 And this means that it has become available to a larger circle of users – both those who work in infrastructure without the Internet, and those who simply love Open Source.

What can Grafana OnCall do?

There are a lot of features, I will describe the main ones.

Collect alerts from all sources

Each integration is a set of templates that are applied to the data received from webhook monitoring. Several are available out of the box, we are working to increase the list and cover all open source monitoring systems. Even if your source is not there, nothing prevents you from accepting webhook alerts and editing templates. The most important thing is that in OnCall you can collect alerts from the entire monitoring zoo from Prometheus to PRTG in a single place.

Format alerts in templates

Inside OnCall, you can do a lot of magic with templates. For example, you can change the appearance of alerts, attach links to ranbooks to them, cut out unnecessary trash, and so on. You can make alerts enjoyable. For example, they might look like this:

Or even more carefully:

Powerful powerful formatting is available with Jinja 2, there are conditional operators, there are iterators. All fields can be formatted separately:

Group

One of the tasks of OnCall is to fight the storm of alerts in chats. Grouping works based on this pattern:

All alerts that result in the same pattern will be grouped. For example, you can group by all labels: “{{ payload.labels }}”, or by service: “{{ payload.labels[‘service’] }}”, or by service and region: “{{ payload.labels[‘service’] }}_{{ payload.labels[‘region’] }}”.

Integrate with Slack and Telegram

Integration with messengers is what we are truly proud of. Look at the screenshot:

There are some cool details here:

  • Only one message went to the general channel, 1069! alerts were grouped. We hid all this “garbage”.

  • OnCall neatly tagged the attendant in the thread.

  • OnCall added an escalation log to the thread (more on that later). When an incident was opened, future steps were displayed in the log.

  • OnCall regularly checks to see if the attendant has forgotten about an incident that has not been closed but marked as Acknowledged. If the attendant disappears somewhere, OnCall will re-escalate the incident.

Escalate

“Escalator” is the heart of OnCall. It is not enough to accept the incident, we need to notify someone about this incident, and even in such a way as to be sure that the SLA is observed. OnCall has this escalation editor:

You can escalate to the duty officer according to the schedule (more on that later), you can create a multi-level protection system, when if one person missed -> the incident goes to another, you can hold escalations for the night, and so on… The main thing is that this helps to make sure that the incident will be taken on duty to work and you will comply with the SLA.

Assign scheduling attendants

One of the main features of Grafana OnCall for distributed teams. For example, in our team, the schedule looks like this:

And yes, we edit the duty schedule right in Google Calendar, and Grafana OnCall pulls updates to itself and that’s how it finds out who to send the incident notification to.

For the thoughtful, we have prepared a Terraform Provider, thanks to which you can store the duty schedule in Git and configure very, very, very complex rotations, and accept changes through a Pull / Merge Request: https://grafana.com/blog/2022/08/29/get-started-with-grafana-oncall-and-terraform/

What’s under the hood

The notification system is what should, first of all, deliver an alert, and only then everything else. Because of this, we use what we call the “boring stack” that has proven itself over the years and what we know very well: Django, Celery, MySQL, RabbitMQ, Redis.

When we announced Open Source on Hackernews, we got a lot of criticism. It seemed to the commenter that OnCall required too many dependencies. Although it is packaged in Helm and Docker-compose, and in most cases it is launched with a couple of commands, everyone is stressed by the “extra” database, queue and cache.

On the one hand, they are really superfluous if you want to run OnCall for “games”, on the other hand, under the hood, OnCall has our 3 years of experience with several thousand customers. During these three years, every undelivered or delayed incident was investigated and led to some kind of improvement. Now OnCall can both correctly partially twitch, and correctly ratelimit when one of several sources goes berserk, and not crash when the Slack API is weird, and instantly recover. I think all these details are a good topic for a separate article.

For those cases where stability is not so important, we are preparing miniature version.

How to put

We have prepared three main environments − developer environment, prod. environment (so far only helm for kubernetes) and environment for “poke”. If you are missing an environment, please let us know immediately at Issues on GitHub.

How to participate

Grafana OnCall is an Open Source project and we are incredibly happy every time we get a PR from the community. If you want to add integration with a new monitoring system, or a new messenger, or just fix something simple, welcome to our github: https://github.com/grafana/oncall

We have a Russian-language chat in the telegram, where the developers sit: https://t.me/amixr_ru

We also have monthly Zoom calls, the next one will be on September 28th, where we will show a secret feature that we have been working on for almost 2 months 😏, and we will also convene a working group to work on integration with Mattermost, to close the most popular Issue. Come to the call, this is a great chance to join Open Source development! : https://github.com/grafana/oncall/discussions/451

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *