How to Build an Effective Monitoring Strategy with High Observability

Let's get this straight: the most important thing in development now is the performance and reliability of your infrastructure, because if your project lags or works intermittently, no features will save you. The client will simply go to competitors.

Based on the above postulate, the role of system monitoring has increased dramatically in recent years. Our systems have moved from being technological innovations to being critical infrastructure without which everyday life is simply impossible. However, there is a gaping chasm between formal monitoring and monitoring that will match the complexity and depth of modern systems.

Often, even though engineers recognize the importance of effective monitoring, expectations almost never match the reality of building and operating such systems. After all, achieving comprehensive coverage that would fully reflect user interaction is an extremely difficult task. And here the ideas of Observability come to the fore. It should be perceived not just as monitoring or a set of measures, but as a global systemic approach to building work with the project infrastructure and its systems.

Observability is an approach to software development and infrastructure management that focuses on the use of methods that enable controlled development of an application and a clear understanding of its state and status at any given time.

Often, efforts to implement observability start off well, and as it should be, focus on the functionality of the critical user experience in the application. However, organizing truly full-fledged observability requires a lot of effort, and therefore failures often occur. This process requires a deep understanding of the system's operation beyond simple checks that lie on the surface: “the site works”, “the service is available”. This requires an understanding that observability is a fundamental part of understanding how the system works in general, and how the user experience in the application is organized (the user journey – i.e. how the user uses the application).

Unfortunately, many developers, even knowing the above, step on the same rake. At first, they strive to comprehensively control their applications and services, but in the end they find themselves limited by the priorities of business processes or even by initially meager monitoring tools. As a result, we get an imbalance between the speed of development and the observability of the system, that is, a situation when the developer no longer has the strength and resources to deal with business requests for new features and finally concentrates exclusively on coding what they said coding. In such a situation, the development of the monitoring system is left behind.

A truly deep approach to observability requires moving to systems analysis, understanding user flows, and a symmetrical manifestation of will from both the development team and project management. In practice, this means that you need to do more than just check off the boxes for monitoring features, but also constantly get feedback from your colleagues. In fact, observability is now another process in software development, since it involves all parts of the team, from ordinary engineers to product specialists and project managers. Without the full involvement of all parts of the team in the process of building observability, you will not be able to create a real map of the user path and identify key nodes that are worth paying attention to.

Further in the text we will present an ideal vision of what an observability system should be and ways to implement it. Of course, we understand that the ideal in modern realities is practically unattainable due to the very nature of modern development with its constantly changing tools, innovative approaches and endless sawing of new features. But the conditional description of the “ideal spherical system in a vacuum” can be used as a standard to which you can “apply” your current monitoring system and check how far you are from such a system at the moment.

Below we will look at two of the most critical things to keep an eye on.

Application performance and user experience


To ensure continuous operation of applications, you need to resort to checks that replicate browser actions, simulating user activity. This includes checking the ability to login, checking the correct functioning of typical user operations, and performing basic actions in the application.

Coverage checklist:

Notification checklist:

Business impact analysis

In addition to engineers, you may also receive information about issues or failures from your business team, which monitors conversions, sales, and revenue from the product. Everything may seem fine at first glance, but if your subscription sales suddenly plummet, it may indicate problems beyond the scope of the team monitoring the product. Recognizing these issues early will allow you to make more informed decisions and calmly organize the work of the product and technical teams, rather than panicking and patching up holes when the situation becomes dire.

Coverage checklist:

Notification checklist:

Localizing problems on the frontend or backend

If we dive even deeper into our ecosystem, it's important to first determine whether the issues are with the interface or with the API services on the server side that are oriented towards the interface.

Coverage checklist:

Notification checklist:

If the problem is on the frontend, the deployment team should roll back to the previous working version, disable the broken feature with a flag, or roll out a fix that will close the problem.

If something happened to the front-end API on the backend, then in an ideal world the next step would be to localize the problem. This will be possible if the entire application is covered with logging and error tracing, which will allow you to quickly identify the problem area and facilitate troubleshooting.

Unfortunately, we do not live in an ideal world and in practice this does not always work out, so we will talk about more realistic situations when there is not enough information and it is not clear where to look further.

Backend monitoring

Coverage checklist:

Notification checklist:

Although these are the main cases, in practice, things often happen completely differently. Product development, especially a commercially profitable product, is quite flexible and fast-paced. As a result, development teams can easily miss something or get into an unusual situation. For example, tracing may not be configured, which is why a microservice may simply not respond via HTTP 200 instead of providing correct information about a failure. In such cases, instead of relying solely on tracing, you need to track errors and failures through specialized monitoring services.

Search for service failures

Coverage checklist:

Notification checklist:

Now it is crucial to get detailed information about potential causes of failure or performance degradation. These problems can be related to degradation of the underlying infrastructure or caused by communication with other system components. In the second case, we are talking about microservices within applications, connections to databases, caches and other storages. Also, do not forget about third-party APIs and other external factors. At this stage, profiling of services becomes extremely important to accurately identify faulty functionality.

Service profiling and dependency analysis

Coating checklist

Notification Checklist

Infrastructure monitoring

So, we have finally reached the area where observability is normally built in most ecosystems. Of course, we are talking about the server/cloud infrastructure on which the main product runs. Although most of the scenarios described above (such as application-level failures, service interaction problems, or interface failures) are not directly related to the server part, the chance of many problems in the operation of applications directly depends on the performance of the infrastructure part or its degradation.

Coating checklist

Notification Checklist

In our opinion, the above steps build a monitoring ecosystem that is close to ideal, which we would like to see on any project. Of course, the ideal is unattainable and one can only strive for it, but it is always nicer to have at least some more or less relevant benchmark for establishing the development vector of the project and the company as a whole.

Integrating user reviews


This could be the end of the article, but there is another fairly common scenario: when problems with an application go unnoticed and reach a wide range of users. At such moments, a mountain of messages from dissatisfied customers who decided to personally report failures or performance issues begins to pour into the support.

Customer support and development often exist on different planes. While developers generate ideas and develop various cool features, the support team makes excuses for things they are not guilty of and listens to requests and demands that they are not able to fulfill. Most often, support is used as a spam filter, the task of which is to protect developers from overly active users and allow them to calmly do their work, while collecting feedback from clients. Very often, support informs developers about critical failures or malfunctions, because in their case, tech support receives a barrage of letters. However, in an ideal world, support should not serve as the first line of notification, but only a source of confirmation of previously received information and a way to provide feedback to the user. That is, you, as developers, should learn about problems with the service not from your own tech support and clients, but from failure and performance monitoring systems.

Only in this case you will be able to start eliminating the failure before a wave of tickets appears, which will bury your support for several days, and you will be able to give your customers a real and truthful answer that the problem is known and is currently being fixed. If the failure becomes known directly from customers, this will only lead to increased user dissatisfaction.

However, the world is not perfect, so you will always have situations where only your customers know about the problem and report it through your technical support. And such reports should never be ignored.

Coating checklist

Notification Checklist

Total

In this article, we outlined the basic elements of the philosophy of building a purposeful observation ecosystem, and also emphasized the importance of coordinating efforts on monitoring, analyzing the user experience, and maintaining the functionality of applications. In another publication, we will try to consider specific cases and analyze some things in practice.


The most important thing in our TG channel. No unnecessary spam.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *