Announcement Asserts

Introducing Asserts, a platform for analyzing and tracking metrics. By scanning your application's metrics against any Prometheus-compatible time-series database (TSDB), Asserts in real time:

  • creates a map of application architecture and infrastructure,

  • builds dashboards,

  • tracks service level objectives (SLOs)

  • and runs automatic checks (Assertions) to identify changes and potential problems.

Our goal is to reduce warning fatigue and reduce the time it takes to find the root cause.

What are Asserts?

Asserts helps you quickly and easily understand application and infrastructure problems—even on highly scalable and complex architectures. You'll receive alerts on the key metrics that your users care about most. You'll know immediately when something important happens, and won't waste time sorting through less significant or false alerts.

In addition to the most significant alerts, Asserts will show you everything you need to solve the problem: failures that may be causing the problem, associated changes and anomalies – all in one place. Based on Prometheus, Asserts works with any application and any architecture. To get started, you only need key metrics.

Why is monitoring difficult?

Our team at Asserts has lived the monitoring life in AppDynamics. AppDynamics customers relied on our SaaS platform to monitor their own applications, so quickly resolving outages was a top priority. We needed to get to the root of the problems as quickly as possible.

When we built cloud native applications in AppDynamics, we realized how difficult it is to effectively monitor applications, even with the latest APM tools with logs, metrics, traces, dashboards, and alerts. We were faced with a difficult decision – setting up high-level alerts versus thin-level alerts.

When we set alerts on a few high-level business or user experience metrics (e.g., Records Processed, PageRenderTime, APIAvailability), we got more reliable alerts, but then troubleshooting became time-consuming and prone to bugs.

When we configured more subtle alerts (for example, CPU Load, JVM GC, Request Timeouts) the alerts were often triggered in trivial cases that were not a visible problem for the user. Waking up at three in the morning due to an alert is not the most pleasant experience, especially in the case of a false alarm.

Finding a needle in a haystack

Our troubleshooting sessions consist of reviewing numerous dashboards and metrics and finding the specific root cause of the problem – the classic needle in a haystack. We asked ourselves: how can we automate this process?

To avoid alert fatigue, we need to set up alerts at a high level for applications or business metrics so that alerts are targeted to potential symptoms problems. Additionally, we need a way to automatically determine root causes problems to speed up their resolution. While these automatic checks are somewhat similar to alerts, they are different because they won't wake you up when they go off. These checks are troubleshooting aids and not the first line of defense against failure. We call them Assertions.

Getting rid of routine

All applications are different. Asserts make it easy to set up your own assertions and service level objectives (SLOs). In our deep industry experience, we have noticed that applications have a large number of common problems. For example:

  • oversaturation of resources, whether the disk is full or the cloud service exceeds the installed capacity;

  • release of a new version or configuration update leading to increased delays;

  • gradual accumulation of errors leading to violation of SLO.

We've created an entire library of assertions that you can use to find problems in your applications. Our library of statements is built on a taxonomy called SAAFEwhich will help you navigate the problems your application faces.

When Assertions get to the root cause of problems, you don't need to be an expert in every part of the application architecture. Collective knowledge is now available to every team member – for example, what has changed; what metrics should you pay attention to; what is normal and what is not.

To understand the relationship between symptoms and causes, it is also necessary to understand the architecture of the application: a failure in one component often causes problems in another. When troubleshooting using metrics and dashboards alone, you need to build a model of these relationships on the fly. Asserts automatically learn your application's architecture based on your metrics and create visualizations so you can quickly focus on fixing problems.

Solution

Let's illustrate this with an example where we drink our own Kool-Aid and use Asserts to solve problems in our own application. Briefly about the application architecture: we use a time series database consisting of several services. vmselect is responsible for fulfilling requests, vmalert evaluates rules (queries). grafana and other Asserts services also send their requests to vmselect.

Architecture vmselect and related services can be viewed in entity graph Asserts.

Asserts builds a real-time application map based on metrics and makes them available through graph search

Asserts builds a real-time application map based on metrics and makes them available through graph search

To control the operation of the application, we installed service level goals to track its performance. We knew something was wrong when Asserts sent us an alert indicating slow API server latency and degraded ability to evaluate write rules for our clients.

Violation of SLO on the API server

Violation of SLO on the API server

vmalert SLO shows a decrease

vmalert SLO shows a decrease

api-server-latency SLO violation notification in Slack

api-server-latency SLO violation notification in Slack

The alert includes handy View Impact and Start Troubleshooting links to help guide you through the troubleshooting process. The View Impact link shows the incidents of interest that occurred at the time of the alert.

Below shows that around the same time that api-server was experiencing latency issues, vmservice (on which the API server depends on making temporary requests) a failure occurred:

The incident screen displays SLO violations and other approvals configured for notification.

The incident screen displays SLO violations and other approvals configured for notification.

The Start Troubleshooting link launches the Top Insights screen. Here we rank services to identify hot spots in the system so that the user can get started right away. Statements are indicated in blue (changes/amendments), yellow (warning) and red (critical).

When Assertions fire, they are indicated by blue, yellow, and red lines on this timeline.  vmselect has the most problems, so it rose to the top.

When Assertions fire, they are indicated by blue, yellow, and red lines on this timeline. U vmselect the most problems, so he rose to the very top.

The failed assertions showed that the service vmselect a failure occurred and it became unavailable due to memory saturation.

Please note that there were no blue lines before the incident started (Amends) is Asserts' way of saying that there have been no updates to rules (queries) or builds for any of the services.

Logs from View Logs show error: vmselect ran out of memory while executing the request.

We wanted to know:

  • Where did the bad requests come from?

  • How our clients suffered.

  • How can we prevent this problem from happening again.

To answer these questions, we started digging into the allegations occurring in other services.

Watch a video of how Asserts captures assertions and provides an aggregated, filterable, and drill-down view of assertions to narrow down the problematic entities and metrics.

When statements are triggered, they are indicated by yellow and red lines on this timeline. U vmselect the most problems, so he is at the very top.

The allegations revealed anomalous network usage during this time: an unusual amount of traffic was detected between Grafana and the ingress controller, which routes traffic from the browser frontend to backend services. This indicated that someone had to evaluate particularly heavy queries through the Grafana query interface.

The statements indicate that something strange is happening - in this case, the outgoing network traffic on grafana and the ingress controller is significantly higher than normal.

The statements indicate that something strange is happening – in this case, the outgoing network traffic on grafana and the ingress controller is significantly higher than normal.

While reviewing the claims of other services, we also noticed that several of our instances crashed vmalert gave errors. This means that the rules evaluation failed while the time series database query layer, vmselect, did not work. Since we are running separate instances vmalert for each client, we knew which clients and rule groups were affected and which were not. (Please note that client names are hidden in these screenshots).

Each line here represents a rule group failure for a client, and the red line is the period of time during which its alerts were not evaluated.

Each line here represents a rule group failure for a client, and the red line is the period of time during which its alerts were not evaluated.

The root of the problem became clear. The user created a series of queries and executed them through grafana, which took too much memory to evaluate and caused the query service to crash vmselect. This blocked rule evaluation for some clients. To solve the immediate problem, we launched more modules vmselectto increase system fault tolerance.

Watch the video to see how we identified failed modules and their memory usage just before the failure.

Just before the failure, the modules reached their memory limit, which led to reduced service availability vmselect until new modules were added.

Troubleshooting without Asserts

Before solving the problem, let's think about how the situation would change without assertions. Standard tools for troubleshooting using metrics are dashboards and ad hoc research using PromQL. When we first received an alert at the beginning of an incident, we would turn to a pre-built dashboard to quickly find the problem. In order for the dashboard to help us find the root cause, we had to anticipate a similar problem and add all the relevant metrics. In practice, this is difficult, so teams usually start with a minimal or generic dashboard and evolve it as they learn more about how their applications work. The first problem with dashboards as troubleshooting tools is that they are often incomplete—it takes time to gain experience. With Asserts, we have a library of reusable assertions from which to draw information. So we start with much richer background information.

Second, when troubleshooting a dashboard, you need to decide whether a particular metric is worth further investigation while you look for the root cause. If the dashboard is large, this process becomes quite labor-intensive, so only the most important metrics should be included in the dashboard. Typically, dashboards are incomplete, leading to even slower ad hoc research. When troubleshooting using assertions, you start from searching for potential problems, which leads to a significant reduction in the list. Once we limit the search to the most important metrics, troubleshooting becomes much faster.

Troubleshooting

But how to prevent this in the future? Because we do not intend to block users from making ad hoc requests, but at the same time it is difficult to predict their resource consumption in advance, we separated ad hoc requests from the evaluation of alert rules by running multiple deployments vmselect.

There are now separate instances allocated for evaluating alert rules vmselect, which ensures reliable delivery of alerts. All requests from dashboards and users are serviced vmselect-userand the worst that can happen if a user writes an expensive query is that they will have to rework it before they see results.

We now have two vmselect deployments: original vmselect, which handles alerts, and vmselect-user, which handles random access queries.

Now we have two deployments vmselect: original vmselectresponsible for alerts, and vmselect-userwhich handles random access requests.

To summarize: Asserts will simplify your troubleshooting workflow and eliminate the daily grind of creating and managing dashboards and alerts.

And finally, we invite everyone involved in load testing to open lessons:

  • March 12: “Analyzing test results in Load Runner Analysis.” Record

  • March 18: “Organization of system performance testing.” Record

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *