Reduced duty officer burnout through more effective alert monitoring
Introduction
Many have probably encountered the meme this is fineor original comic. This is what a typical day looks like for many on-call agents. On-call agents receive a lot of alerts, and dealing with too many alerts can lead to alert fatigue, a feeling of exhaustion caused by responding to alerts that don’t have a priority or clear actionable outcome. Ensuring that alerts are actionable and accurate, and not false positives, is critical because if on-call agents are constantly receiving false alerts, they may become disengaged and ignore even important messages. To this end, Cloudflare has multiple teams conducting periodic alert reviews, with each team building their own dashboards for reporting. As members of the Observability team, we’ve seen teams report inaccuracies in alerts or instances where alerts weren’t triggering, and we’ve helped combat noisy/flapping alerts.
The Observability team aims to expand the understanding of the tech stack by collecting and analyzing a broader range of data. In this post, we dive deeper into notification monitoring, discussing its importance and Cloudflare’s approach to achieving it. We’ll also look at how we overcome the gaps in notification reporting in our architecture to make troubleshooting easier with open source tooling and best practices. If you’re interested in how to effectively use notifications and improve visibility, resilience, and the health of on-duty workers, keep reading to learn more about our approach and solutions.
Being on-call can be a real stressor: it disrupts sleep patterns, spoils social and leisure time, and can lead to burnout. While burnout can be caused by several factors, one of them can be receiving redundant or false notifications that are not important and do not require action. Notification analytics can help reduce the risk of such burnout by reducing unnecessary interruptions and improving the overall efficiency of the on-call process. It involves periodic review and feedback to the system to improve the quality of notifications. Unfortunately, only some companies or teams conduct notification analytics, although this is important information that every on-call person or manager should have access to.
Alert analysis is useful for on-call staff, allowing them to easily see which alerts have been triggered during their shift to help them create handoff notes and ensure nothing important is missed. Additionally, managers can create reports based on these statistics to see improvements over time, as well as assess the vulnerability of on-call staff to burnout. Alert analysis also helps with incident reporting to see if alerts have been triggered or to determine when an incident began.
Let's first understand how the alert stack is structured and how we used open source tools to gain greater visibility, allowing us to analyze and optimize its effectiveness.
Prometheus Architecture at Cloudflare
At Cloudflare we rely heavily on Prometheus for monitoring. We have data centers in over 310 cities, and each of them has multiple instances installed Prometheus. In total, we have over 1,100 Prometheus servers. All alerts are sent to a central Alert managerwhere we have various integrations set up to route them. In addition, using alertmanager webhookswe store all alerts in a data warehouse for analysis.
Alert life cycle
Prometheus collects metrics from configured targets at specified intervals, evaluates rule expressions, displays the results, and can trigger alerts when alert conditions are met. Once an alert is in the fire state, it is sent to alertmanager.
Depending on the configuration, when Alertmanager receives an alert, it can inhibit, group, silence, or route the alerts to a suitable receiver integration, such as chat, PagerDuty, or a ticketing system. When configured correctly, Alertmanager can reduce alert noise. Unfortunately, this is not always the case, as not all alerts are configured optimally.
In Alertmanager, alerts initially enter the firing state, where they can be inhibited or silenced. They return to the firing state when the silence expires or the suppressing alert is resolved, and eventually enter the resolved state.
Alertmanager sends notifications about firing
And resolved
notification events via integration with webhooks. We used alertmanager2eswhich receives alert notifications via a webhook from Alertmanager and inserts them into an Elasticsearch index for searching and analysis. Alertmanager2es has been a reliable tool for us over the years, offering ways to monitor alert volume, noisy alerts, and create some sort of alert reporting. However, it had its drawbacks. The lack of a silent or inhibited state for alerts made it difficult to troubleshoot. We often found ourselves wondering why an alert wasn’t firing – was it being silenced by another alert, or perhaps suppressed by one of them? Without concrete data, we couldn’t confirm what was really going on.
Since Alertmanager does not provide notifications for “silenced” or “suppressed” alert events via webhook integrations, the alert reporting we were doing was somewhat lacking or incomplete. However, the Alertmanager API does provide query capabilities, and by querying the alertmanager endpoint /api/alerts , we can retrieve data for silenced and inhibited alerts. Having all four states in the data warehouse will enhance our ability to improve alert reporting and troubleshoot issues with Alertmanager.
Solution
We decided to combine all alert states (firing, silenced, inhibited, and resolved) into a data store. Given that we collect data from two different sources (webhook and API), each of which has a different format and potentially represents different events, we relate alerts from both sources using the field fingerprint – is a unique hash of a set of alert labels that allows us to match alerts in responses from the Alertmanager webhook and the API.
The Alertmanager API offers additional fields compared to webhook (highlighted in pastel red on the right), such as IDs silencedBy
And inhibitedBy
which help identify notifications in statusessilenced
and `inhibited. We store responses to webhooks and API calls in the data store as separate rows. When queried, we match notifications by the fingerprint field.
We decided to use a copy vector.devto transform the data as needed and store it in a data warehouse. Vector.dev (acquired by Datadog) is an open source, high-performance observational data pipeline that supports a wide range of data sources, multiple data sinks, and various data transformation operations.
Although we use ClickHouse to store this data, any other database can be used here. ClickHouse was chosen as a data store because it provides various data manipulation capabilities. It allows data aggregation on insertion using materialized views, reduces duplicates using the replacingMergeTree table mechanism, and supports JOIN operators.
If we were to create separate columns for all alert labels, the number of columns would grow exponentially as we added new alerts and unique labels. Instead, we decided to create separate columns for a few common labels such as alert priority, instance name, dashboard, alert-ref, alertname, etc., which would help us analyze the data as a whole, and store all other labels in a column of type Map(String, String)
. This was done because we wanted to store all the labels in the data store with minimal resource usage and allow users to query specific labels or filter alerts based on specific labels. For example, we can select all Prometheus alerts using labelsmap['service'] = 'Prometheus'
.
Dashboards
Based on this data, we created several dashboards:
Alerts overview: To get an overview of all the alerts Alertmanager receives.
Overview of Alert Names: To drill down into a specific alert.
Overview of notifications by recipients: Similar to the notification overview, but pertains to a specific team or recipient.
Alert Status Timeline: This dashboard shows a snapshot of alert volume.
Alerts overview: To get an idea of what alerts the ticket system receives.
Review of the muted: To get an idea of Alertmanager's thickened alerts.
Alerts overview
The image shows a screenshot of the Collapsed Alerts Dashboard by Receiver. This dashboard includes general statistics, components, services, and a breakdown by alert name. The dashboard also displays the number of P1/P2 alerts in the last one day/seven days/thirty days, the top alerts for the current quarter, and a quarter-to-quarter comparison.
Breakdown of elements into components
We route alerts to teams, and a team can have multiple services or components. This panel shows the number of alert components over time for a receiver. For example, alerts are sent to the observability team, which owns multiple components such as logging, metrics, traces, and errors. This panel shows the number of alert components over time and gives insight into which component is noisy at what time.
Notification Timeline
We created this view using the Grafana status timeline dashboard for receivers. The dashboard shows how busy the attendant was and at what point. Red here means the alarm started, orange means the alarm is active, and green means it has resolved. It shows the start time, duration of activity, and resolution of the alarm. This highlighted alert changes state from “fired” to “resolver” too often – this is similar to a flapping alert. Flapping is when an alert changes state too often. This can happen when alerts are configured incorrectly and need to be adjusted, such as changing the alert threshold or increasing the magnitude value. for duration
in the notification rule. Field for duration
in alert rules specifies the time interval before the alert will start to fire. In other words, the alert will not fire until the condition is met within 'X' minutes.
conclusions
Our analysis yielded a few interesting findings. We found a few alerts that were triggered but did not have a notification label set, meaning the alerts were triggered but were not being sent or routed to any teams, creating unnecessary load on Alertmanager. We also found a few components generating a large number of alerts, and upon digging into them, we found that they were from a cluster that had been decommissioned and the alerts had not been cleared. These dashboards gave us opportunities to improve visibility and clean up the clutter.
Alertmanager Suppressions
Suppressions on the Alertmanager side allow you to suppress a set of alerts if another set of alerts are already present. We found that sometimes Alertmanager suppressions do not work. Since there was no way to know this, we only found out when a user reported receiving notifications for suppressed alerts. Imagine a Venn diagram of triggered and suppressed alerts to understand what a failed suppression is. Ideally, they should not overlap, because suppressed alerts should not be triggered. But if they overlap, that means that suppressed alerts are triggered, and such an overlap is considered a failed suppression.
After saving the notifications in ClickHouse, we were able to come up with a query to find the fingerprint of those alertnames
in which suppression was unsuccessful:
SELECT $rollup(timestamp) as t, count() as countFROM( SELECT fingerprint, timestamp FROM alerts WHERE $timeFilter AND status.state="firing" GROUP BY fingerprint, timestamp) AS firingANY INNER JOIN( SELECT fingerprint, timestamp FROM alerts WHERE $timeFilter AND status.state="suppressed" AND notEmpty(status.inhibitedBy) GROUP BY fingerprint, timestamp) AS suppressed USING (fingerprint)GROUP BY t
The first panel in the image below is the total number of alarms, the second is the number of unsuccessful bans.
We can also create a breakdown for each unsuccessfully suppressed alert.
By looking at the fingerprints in the database, we were able to correlate the alert suppressions and found that unsuccessfully suppressed alerts had a “ring” suppression (or, if you prefer, a circular suppression). For example, the alert Service_XYZ_down
suppressed by notification server_OOR
notification server_OOR
suppressed by notification server_down
A server_down
suppressed by notification server_OOR
.
Unsuccessful suppressions can be avoided by carefully configuring notification suppression.
Silencing
Alertmanager provides a mechanism to silence alerts while they are being worked on or during maintenance. Silence can silence alerts for a specified amount of time and can be configured using matchers that can be triggered by an exact match, a regex, an alert name, or any other label. The silence matcher does not necessarily have to output an alert name. By analyzing alerts, we were able to match alert and silence IDs by performing a JOIN query on the alert and silence tables. We also found many “stale” silences, where the silence was created for a very long period of time and is no longer relevant.
Do-It-Yourself Alert Analysis
In this catalog contains a basic demo for implementing alert observability. Execution docker-compose up
will launch several containers, including Prometheus, Alertmanager, Vector, ClickHouse and Grafana. The vector.dev container will receive Alertmanager alerts via the API and, after converting them, will write the data to ClickHouse. The dashboard in Grafana will show an overview of Alerts and Silences.
Make sure you have docker installed and run docker compose up
.
Go to the address http://localhost:3000/dashboardsto see ready-made demo dashboards.
Conclusion
As part of the Observability team, we manage Alertmanager, which is a multi-tenant system. It is very important to us to be able to detect and mitigate misuse of the system, ensuring proper alerting. Using alert analysis tools has greatly improved the work of both the duty personnel and our team, providing quick access to the alerting system. The ability to observe alerts has made it easier to troubleshoot issues, such as why an alert did not fire, why a suppressed alert fired, or which alert has muted/suppressed another alert, providing valuable insights to improve alert management.
Additionally, alert dashboards facilitate quick review and adjustments, streamlining operations. Teams use these dashboards in weekly alert reviews to provide visible evidence of how an on-call shift has gone, identify which alerts are triggered most frequently, and are candidates for deletion or merging with others, thereby preventing wasteful use of the system and improving overall alert management. Additionally, we can identify services that may require special attention. The increased “observability” of alerts has also allowed some teams to make informed decisions about changing on-call schedules, such as moving to longer but less frequent shifts or merging on-call and unscheduled work shifts.
In conclusion, alert visibility plays a critical role in preventing burnout by minimizing interruptions and increasing on-call efficiency. Providing alert visibility as a service benefits all teams by eliminating the need to develop custom dashboards and fostering a culture of proactive monitoring.