How to Build an Effective Monitoring Strategy with High Observability
Let's get this straight: the most important thing in development now is the performance and reliability of your infrastructure, because if your project lags or works intermittently, no features will save you. The client will simply go to competitors.
Based on the above postulate, the role of system monitoring has increased dramatically in recent years. Our systems have moved from being technological innovations to being critical infrastructure without which everyday life is simply impossible. However, there is a gaping chasm between formal monitoring and monitoring that will match the complexity and depth of modern systems.
Often, even though engineers recognize the importance of effective monitoring, expectations almost never match the reality of building and operating such systems. After all, achieving comprehensive coverage that would fully reflect user interaction is an extremely difficult task. And here the ideas of Observability come to the fore. It should be perceived not just as monitoring or a set of measures, but as a global systemic approach to building work with the project infrastructure and its systems.
Observability is an approach to software development and infrastructure management that focuses on the use of methods that enable controlled development of an application and a clear understanding of its state and status at any given time.
Often, efforts to implement observability start off well, and as it should be, focus on the functionality of the critical user experience in the application. However, organizing truly full-fledged observability requires a lot of effort, and therefore failures often occur. This process requires a deep understanding of the system's operation beyond simple checks that lie on the surface: “the site works”, “the service is available”. This requires an understanding that observability is a fundamental part of understanding how the system works in general, and how the user experience in the application is organized (the user journey – i.e. how the user uses the application).
Unfortunately, many developers, even knowing the above, step on the same rake. At first, they strive to comprehensively control their applications and services, but in the end they find themselves limited by the priorities of business processes or even by initially meager monitoring tools. As a result, we get an imbalance between the speed of development and the observability of the system, that is, a situation when the developer no longer has the strength and resources to deal with business requests for new features and finally concentrates exclusively on coding what they said coding. In such a situation, the development of the monitoring system is left behind.
A truly deep approach to observability requires moving to systems analysis, understanding user flows, and a symmetrical manifestation of will from both the development team and project management. In practice, this means that you need to do more than just check off the boxes for monitoring features, but also constantly get feedback from your colleagues. In fact, observability is now another process in software development, since it involves all parts of the team, from ordinary engineers to product specialists and project managers. Without the full involvement of all parts of the team in the process of building observability, you will not be able to create a real map of the user path and identify key nodes that are worth paying attention to.
Further in the text we will present an ideal vision of what an observability system should be and ways to implement it. Of course, we understand that the ideal in modern realities is practically unattainable due to the very nature of modern development with its constantly changing tools, innovative approaches and endless sawing of new features. But the conditional description of the “ideal spherical system in a vacuum” can be used as a standard to which you can “apply” your current monitoring system and check how far you are from such a system at the moment.
Below we will look at two of the most critical things to keep an eye on.
Application performance and user experience
To ensure continuous operation of applications, you need to resort to checks that replicate browser actions, simulating user activity. This includes checking the ability to login, checking the correct functioning of typical user operations, and performing basic actions in the application.
Coverage checklist:
- Make sure you can log in and visit key pages.
- Check the load time of key pages and actions.
- Track JavaScript errors across common user scenarios.
- Make sure important features work correctly.
Notification checklist:
- Set up alerts for failed login attempts or increased login time.
- Set up an alert when page load time increases beyond a specified threshold.
- Track the number of JavaScript errors in critical user behavior scenarios.
- Set up alerts for crashes or other performance issues with your critical functions.
Business impact analysis
In addition to engineers, you may also receive information about issues or failures from your business team, which monitors conversions, sales, and revenue from the product. Everything may seem fine at first glance, but if your subscription sales suddenly plummet, it may indicate problems beyond the scope of the team monitoring the product. Recognizing these issues early will allow you to make more informed decisions and calmly organize the work of the product and technical teams, rather than panicking and patching up holes when the situation becomes dire.
Coverage checklist:
- Regularly review conversion rates and other key metrics.
- Set up alerts for sudden changes in income or other sensitive data of this kind.
- Conduct an analysis of traffic sources and engagement metrics for anomalies.
- Coordinate the work of the technical and business teams to verify the impact of technical issues on business performance.
Notification checklist:
- Set alerts for significant drops or changes in conversion rates.
- Track income dynamics and other financial data.
- Watch for anomalies in traffic sources and spikes in user engagement metrics that deviate from historical user behavior patterns.
- Implement cross-functional alerts to notify both technical and business teams of detected issues at once.
Localizing problems on the frontend or backend
If we dive even deeper into our ecosystem, it's important to first determine whether the issues are with the interface or with the API services on the server side that are oriented towards the interface.
Coverage checklist:
- Ensure that all front-end APIs are covered by health and performance monitoring.
- Set up browser error logging to record JavaScript failures and loading errors.
- Implement frontend performance monitoring to track page load times and user activity.
Notification checklist:
- Set up alerts for downtime or performance degradation of frontend APIs.
- Set up alerts for critical errors at the browser level or a sharp increase in the number of such errors.
- Monitor and alert on significant changes in frontend performance, such as a sharp increase in page load times.
If the problem is on the frontend, the deployment team should roll back to the previous working version, disable the broken feature with a flag, or roll out a fix that will close the problem.
If something happened to the front-end API on the backend, then in an ideal world the next step would be to localize the problem. This will be possible if the entire application is covered with logging and error tracing, which will allow you to quickly identify the problem area and facilitate troubleshooting.
Unfortunately, we do not live in an ideal world and in practice this does not always work out, so we will talk about more realistic situations when there is not enough information and it is not clear where to look further.
Backend monitoring
Coverage checklist:
- Ensure that all backend services and endpoints are included in distributed tracing.
- Set up detailed logging to track errors and failures across all server services.
- Monitor server service performance metrics for signs of degradation or failure.
Notification checklist:
- Set up alerts for critical errors or failures in server services that may indicate a problem.
- Set thresholds for performance alerts to identify issues before they impact users.
- Use anomaly detection in traces and logs to automatically alert you to unusual patterns that indicate hidden problems.
Although these are the main cases, in practice, things often happen completely differently. Product development, especially a commercially profitable product, is quite flexible and fast-paced. As a result, development teams can easily miss something or get into an unusual situation. For example, tracing may not be configured, which is why a microservice may simply not respond via HTTP 200 instead of providing correct information about a failure. In such cases, instead of relying solely on tracing, you need to track errors and failures through specialized monitoring services.
Search for service failures
Coverage checklist:
- Implement detailed logging for all server services, capturing both normal operations and error conditions.
- Enable health checks for all services and ensure that each one can report its status at all times.
Notification checklist:
- Set up alerts for any increase in service errors, sorting between critical errors and just warnings.
- Receive alerts whenever a service response time threshold is exceeded, indicating potential performance issues.
Now it is crucial to get detailed information about potential causes of failure or performance degradation. These problems can be related to degradation of the underlying infrastructure or caused by communication with other system components. In the second case, we are talking about microservices within applications, connections to databases, caches and other storages. Also, do not forget about third-party APIs and other external factors. At this stage, profiling of services becomes extremely important to accurately identify faulty functionality.
Service profiling and dependency analysis
Coating checklist
- Profile and monitor key service functions to identify performance bottlenecks.
- Monitor the time it takes for services to interact with databases, caches, and other internal storage systems to identify latencies.
- Monitor connection and response times to external APIs to find issues with timeouts or abnormal latencies.
- Use network instrumentation to visualize and monitor the flow of requests through your microservices architecture, identifying faulty services.
Notification Checklist
- Set up alerts for major deviations in execution times from baseline values.
- Use alerting on error rates that exceed predefined thresholds in service functions or when interacting with dependencies.
- Implement alerts for timeouts or significant latencies in connections to databases, caches, external APIs, and other microservices.
Infrastructure monitoring
So, we have finally reached the area where observability is normally built in most ecosystems. Of course, we are talking about the server/cloud infrastructure on which the main product runs. Although most of the scenarios described above (such as application-level failures, service interaction problems, or interface failures) are not directly related to the server part, the chance of many problems in the operation of applications directly depends on the performance of the infrastructure part or its degradation.
Coating checklist
- Implement continuous monitoring of vital metrics for all physical and virtual servers.
- Monitor the health and performance of container orchestration systems to ensure optimal application deployment and scaling.
- Closely monitor network throughput and errors to identify potential bottlenecks or outages affecting application connectivity.
- Monitor cloud services and infrastructure components for availability and performance issues.
Notification Checklist
- Set up threshold-based alerts for critical server metrics (CPU, memory, disk I/O) to quickly identify when compute capacity is overloaded.
- Configure alerts for container orchestration issues, including failed deployments or unhealthy pods.
- Set network performance alerts to notify teams of potential connectivity issues or degraded network quality.
- Implement alerts for cloud service outages or performance degradation that may impact application availability or performance.
In our opinion, the above steps build a monitoring ecosystem that is close to ideal, which we would like to see on any project. Of course, the ideal is unattainable and one can only strive for it, but it is always nicer to have at least some more or less relevant benchmark for establishing the development vector of the project and the company as a whole.
Integrating user reviews
This could be the end of the article, but there is another fairly common scenario: when problems with an application go unnoticed and reach a wide range of users. At such moments, a mountain of messages from dissatisfied customers who decided to personally report failures or performance issues begins to pour into the support.
Customer support and development often exist on different planes. While developers generate ideas and develop various cool features, the support team makes excuses for things they are not guilty of and listens to requests and demands that they are not able to fulfill. Most often, support is used as a spam filter, the task of which is to protect developers from overly active users and allow them to calmly do their work, while collecting feedback from clients. Very often, support informs developers about critical failures or malfunctions, because in their case, tech support receives a barrage of letters. However, in an ideal world, support should not serve as the first line of notification, but only a source of confirmation of previously received information and a way to provide feedback to the user. That is, you, as developers, should learn about problems with the service not from your own tech support and clients, but from failure and performance monitoring systems.
Only in this case you will be able to start eliminating the failure before a wave of tickets appears, which will bury your support for several days, and you will be able to give your customers a real and truthful answer that the problem is known and is currently being fixed. If the failure becomes known directly from customers, this will only lead to increased user dissatisfaction.
However, the world is not perfect, so you will always have situations where only your customers know about the problem and report it through your technical support. And such reports should never be ignored.
Coating checklist
- Implement a system for categorizing and quantifying issues reported by users to enable trend analysis.
- For easy monitoring, integrate user feedback from multiple sources into a centralized dashboard, including support tickets, social media, and direct feedback from other communication channels.
- Use sentiment analysis tools to gauge user sentiment across platforms. This will help you identify potential issues by analyzing the topics users are discussing.
- Build processes to correlate spikes in tickets and user messages with data from monitoring tools to identify potential outages and issues.
Notification Checklist
- Set up real-time alerts for abnormal increases in user tickets, allowing you to quickly respond to emerging issues.
- Set up alerts for significant changes in sentiment analysis metrics that indicate a change in user perception that may not yet be reflected in support tickets.
- Implement cross-functional alerts to ensure that spikes in user feedback reach both the technical and customer support teams, so that they can work together to find solutions to problems.
Total
In this article, we outlined the basic elements of the philosophy of building a purposeful observation ecosystem, and also emphasized the importance of coordinating efforts on monitoring, analyzing the user experience, and maintaining the functionality of applications. In another publication, we will try to consider specific cases and analyze some things in practice.
The most important thing in our TG channel. No unnecessary spam.