Has your online store fallen? Monitoring Monq business services will help you find the reason

For such situations, we recommend trying the community version of Monq – a corporate IT monitoring system that will automatically build a tree of dependencies between IT infrastructure components and track their status, helping to quickly find the root cause of failures. Further, as monitoring covers a larger number of hosts and business processes, you can upgrade to the extended version.

In Russia, there are no compiled statistics on losses of online stores associated with IT infrastructure failures. Most likely, no one keeps such statistics for the Russian Federation. However, there are quite impressive figures in foreign sources. It is estimated that, depending on store size, foot traffic and other factors, one hour of e-commerce downtime at major retailers can cost anywhere from $300,000 to more than $1 million. In a 2022 study from ITIC, 44% of companies reported that an hour of downtime costs them more than $1M ​(Uptime Institute).

By the way, this article has a video version on YouTube. If it is more convenient for you to watch in video format, welcome here: https://youtu.be/i84nDbdPa5o.

Let's continue. In the Monq system, all infrastructure elements are represented as configuration units (CUs), which are interconnected. What it looks like is an example of the KE list in Fig. 1. Configuration units (CUs) in Monq can include business services, virtual machines, and even switch ports.

Next, we have the “Operations Center” – this is a key tool that provides administrators with a complete picture of each CI, displaying both the current state and the impact on other components of the system. This allows you to monitor large and disparate IT environments, where thousands of elements may be involved, and gain a comprehensive view of the operation of the infrastructure.

Rice. 1. Example of a list of configuration items in Monq

Rice. 1. Example of a list of configuration items in Monq

KE objects are created in the CMDB database and automatically added to the resource-service model (RSM) using low-code auto-discovery scripts. This process greatly simplifies setting up monitoring, since manually creating and managing thousands of KUs that companies typically have would be extremely labor-intensive. Many scripts are available out of the box, and it is also possible to use low-code scripts to implement your own PCM control logic if standard integrations are not sufficient.

CMDB (Configuration Management Database) in the monitoring system is a database that stores information about configuration units (CU) of the IT infrastructure. These units may include servers, applications, networking equipment, virtual machines, and other components critical to the functioning of the IT system. In summary, a CMDB is a key component for proactively managing IT services, reducing the risk of downtime and helping you find and resolve root causes of failures faster.

Role of CMDB in Monq monitoring system:

  1. A unified view of the state of the IT infrastructure (data centralization). The CMDB stores complete information about all configuration items and their relationships. This helps create a unified view of the IT infrastructure monitored by Monq.

  2. Automation of management in the form of integration of CMDB with a monitoring system to track changes in configurations, which allows you to quickly identify deviations and incidents.

  3. Support for the PCM model through the construction of dependencies between components in the CMDB, which allows you to assess the impact of the state of the infrastructure on business processes.

How Monq Enterprise Monitoring Works

Let's imagine how the resource-service model helps track relationships in the IT infrastructure using a specific example. The operational map (Fig. 2) displays the key business services of a large retail chain, which form the basis of all processes – from selection to payment and delivery of goods.

An operational map is a top-level structure that shows how one object influences others through chains of connections. In our example, the map shows the normal state of business and IT processes, where all components are functioning 100%, but such ideal conditions do not always exist, and the monitoring system is ready to quickly detect possible failures.

Fig.2. An example of a PCM model of business services in a large retail chain

Rip.2. An example of a PCM model of business services in a large retail chain

Next, let’s consider an emergency situation – the monitoring system recorded two alerts in emergency chats, indicating a first priority incident in the “Online Store” segment. A failure of this level usually means complete unavailability of the service, which directly affects key business processes.

On the heat map of the system state (Fig. 3), one can observe a significant decrease in the level of performance up to 40%, which indicates that the majority of clients cannot use the services.

Fig.3. An example of a heat map of the system state when a key component fails

Fig.3. An example of a heat map of the system state when a key component fails

For a detailed study of problems that have arisen in the IT infrastructure, Monq provides a root cause analysis tool, simplifying the work of administrators and technical support engineers. In this case, using a well-constructed resource-service model map (Fig. 4), we can determine that the cause of the incident is the complete unavailability of the “Payment for Goods” service. To confirm this fact, a root cause analysis was performed, starting with an analysis of the top-level configuration item.

Rice. 4. Map of the resource-service model with indication of the source of failure

Rice. 4. Map of the resource-service model with indication of the source of failure

The status of the master configuration item (KU) includes several components, and in this example, the Retail Sales component is at a critical level. Using the data drilling tool, you can clarify the essence of the problem by going through the entire dependency tree to the level of the primary incident (Fig. 5).

Rice. 5. Component "Retail sales" is at a critical level

Rice. 5. The Retail Sales component is at a critical level

The root cause analysis process determines the impact of incidents on the state of the selected CU (Fig. 6) and displays a complete list of incidents that caused a disruption in business processes.

Rice. 6. Analysis of the root causes of failure

Rice. 6. Analysis of the root causes of failure

Here's a little more detail about how the admins were informed. The analysis showed the arrival of several alerts at the same time:

1. First signal was caused by failed synthetic monitoring builds, reporting that product payment pages had stopped loading.

2. Second signal was detected due to exceeding threshold values ​​at one of the infrastructure nodes.

3. Third signal came from an external monitoring system, and the number of events was significant.

The Monq system automatically suppressed redundant signals (step 3), generating one signal with the rest linked as embedded information. This demonstrates how Monq's deduplication and event correlation mechanisms work, which prevents alerts from overwhelming the screen and allows admins to process grouped events without having to sort through each one individually (Figure 7).

Rice. 7. Example of alerts on the operations center screen

Rice. 7. Example of alerts on the operations center screen

Alerts contain additional information about root causes, links to resources for further investigation, and information about business processes affected and/or initiated as a result of these events. In this case, a process was triggered that notified of the incident and triggered an escalation rule, which allocates a limited time to resolve the incident before moving to the next stage.

Rice. 8. This is what a business process pipeline looks like in Monq

Rice. 8. This is what a business process pipeline looks like in Monq

Analysis of the incident showed that the root cause was a problem in one of the containers (Pod) serving the payment service for goods. Once a problem is identified, responsible employees (or departments) are promptly notified and a ticket is automatically created in the service desk system for further processing.

Monq Umbrella Monitoring integrates data from many disparate sources on one screen, allowing you to control the entire IT infrastructure of the company. Specialized “Umbrella” type software on the market from different vendors, as a rule, collects ready-made events/alerts from Zabbix, Prometheus, SCOM and other systems. Unlike its competitors, Monq can collect both ready-made alerts and also works with raw data.

The system collects metrics and events, analyzes them and visualizes them to assess the “health” of services and calculate SLA. One of the key features is alert storm protection, which automatically groups and filters notifications, reducing noise and helping to quickly find the causes of failures. It also helps the business understand how the lack of IT resources (CPU, GPU, storage, memory, etc.) affects the stable operation of key services.

Using the monitoring event reporting service

To monitor the quality of services provided, the Monq monitoring system provides a reporting service. This tool allows you to create report templates and generate them for the services of interest for a given period of time (Fig. 9). Reports can be automatically sent by email as files.

Fig.9. IT infrastructure status report screen

Fig.9. IT infrastructure status report screen

Let me clarify that the “Reports” module in Monq allows you to configure various report templates in the system:

  • Availability report is a report on the availability of the information system as a whole with 3 levels of detail. An administrator or technical support engineer can obtain information about the availability of CIs, information systems (IS) consisting of CIs, or the availability of complex ISs consisting of many subsystems.

  • Availability (multi-report) report — a report on the availability of a complex system, which includes indicators of the report on the availability of the system as a whole and each of the subsystems within it. The multi-report for subsystems displays information about the service time of the subsystem, working and non-working time of the subsystem, percentage and time of availability of the subsystem, maximum downtime, as well as other indicators, including compliance with SLA.

Among the indicators in the report is SLA as one of the most important. Based on it, you can monitor contractors, see a picture of internal processes, how your employees work. Since it doesn’t matter whether the server worked 5 minutes less per week or not, what’s more important is how the customers felt and whether they were generally satisfied with the quality of the service.

Conclusion

In conclusion, let us once again mention the functionality of the Monq system, which ensures the holistic operation of the monitoring process – from the process of collecting and processing data to automated incident management and assessment of the “health” of objects.

The basis of the system’s operation is the processing of data streams, including events from external monitoring systems (Zabbix, Prometheus, etc.), events about topology changes, receiving raw logs and metrics, which are subsequently converted into signals and thanks to which the resource-service model is automatically managed (RSM). In other words, based on events, it is possible to generate alerts within the system, working both with raw data, and to aggregate ready-made events from external systems, create signals based on them and link them to KE. An important component of the system is the synthetic monitoring module, which processes autotests and also acts as a source for creating signals. An example of such a signal could be the incident discussed above in the operation of the payment service.

To describe the company's IT infrastructure and prepare monitoring logic, the Monq system is equipped with a low-code automation engine that allows you to implement both simple and more complex scenarios, depending on the requirements. These scripts can be a full-fledged replacement for several services and are adapted to the needs of the company without the need to wait for releases from vendors. In addition, Monq offers several standard solutions out of the box, such as integration with popular monitoring systems (Zabbix, Kubernetes, Prometheus, etc.) and the use of ready-made content packs, which greatly simplifies the process of setting up monitoring and automation.

It is also worth noting that the new 8th version of Monq implements support for simplified business processes, which are automated pipelines for performing tasks such as notifications to duty services, creating tickets in the service desk, integration with external systems and building escalation chains.

You can always download community version of the product and try to build monitoring to assess the quality of the product. And when you have questions and your own opinion about the product, our team will answer your questions and help you deploy Monq monitoring throughout the company’s entire IT infrastructure.

Contact by mail askformonq@monqlab.com or if something doesn’t work out or other questions arise, you are welcome to our community on Telegram.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *