Mail.Ru mail service, Qiwi payment system, VKontakte social network, Europe’s largest hosting provider OVH – all of them faced serious failures in data processing centers. Companies not only lost money due to equipment failure, but also suffered reputational losses. In this post, we’ll show you how to protect your data center from such threats.
Anything from high server loads to inadequate safety precautions can cause a breakdown or disaster in a data center. So, in the DataLine data center in Moscow in 2019, the cause of the fire was a short circuit in the air conditioning system, and last year’s shutdown of a number of VKontakte functions was due to overheating of the server equipment. OVH’s data centers were hit by an uninterruptible power supply problem.
Such serious force majeure does not happen every day – usually breakdowns are less critical. Nevertheless, this problem is very common: according to a survey conducted by the Tsody.rf resource, almost 80% of companies faced service interruptions due to data center failures. A solution that helps prevent failures or minimize their consequences is constant monitoring of engineering infrastructure.
What is data center monitoring?
Most often, data centers take a semi-automatic or fully automatic approach to monitoring.
At semi-automatic monitoring the responsible specialist or group constantly monitors the indicators of all sensors located in the data center – from temperature and humidity sensors to the sensor of coolant spillage on the floor – and quickly respond when these indicators go beyond the normal range. The disadvantages of this approach are high-paying manual labor, the need for the constant presence of specialists in the data center, as well as the lack of tools for storing historical data and their analysis. Employees only react to problems, but do not have the ability to identify patterns of their occurrence.
Automatic monitoring carried out remotely, and the cloud platform is engaged in the collection and processing of data from the sensors, it also shows them to the operator in a convenient format. To access the data, you just need to connect to the platform from any computer or mobile device with network access. This approach reduces the number of specialists required to maintain the system. In addition, operators can work remotely – this has become especially important now, when companies are forced to operate under quarantine restrictions.
Benefits of remote data center monitoring
Remote monitoring of the condition of equipment in the data center allows for more efficient service: the operator receives information about the condition of the infrastructure before the mechanic goes to the site. This will help to significantly save money if the data center is located in another city.
An important advantage of the cloud platform is long-term data storage. Changes in this or that indicator can be tracked over a certain period of time in order to identify problems.
Practical example: Eaton supplied a large customer, a data center in Finland, with several uninterruptible power supplies connected to a remote monitoring system. Over time, the system began to record a constant temperature rise in the data center. It turned out that the air conditioning system stopped working, and its sensors did not work. Comprehensive remote monitoring allowed identifying the problem before it became critical.
In addition to informing about the condition of the equipment here and now, the system can be trained to form predictions about the life of the equipment and its maintenance. To do this, you need to integrate the platform with machine learning and artificial intelligence tools.
Finally, with remote monitoring, the work of the data center becomes as transparent as possible: business owners and top managers can independently at any time and from anywhere in the world receive information about the condition of the equipment.
How does the remote monitoring system work?
Timely and accurately notifying the system about incidents allows the “red zone” – a set of indicators that indicate emergency events. As soon as the indicators “turn red”, the system automatically sends notifications about this to the responsible employees. It is important to set up alerts so that there are not too many of them. If the system reports even the smallest deviations, such messages will become commonplace and specialists will miss a really important signal.
In order to have a complete picture of the state of equipment in the data center, it is recommended to monitor three main groups of parameters:
1) environmental parameters – temperature, relative humidity, air composition – allow you to monitor the correct operation of air conditioning and cooling systems;
2) parameters of uninterruptible power supplies – voltage of each battery cell, total battery voltage, current consumption, power consumption, UPS status – make it possible to predict the need for maintenance or replacement of the UPS;
3) server parameters – load, network traffic – they can be used to understand how to use the computing power of the center more efficiently and prevent their overload.
The frequency of collection of metrics depends on the parameter. If the power supply is to be measured at least once a second, the temperature and humidity can be monitored every 10-15 minutes. Remote monitoring systems allow you to manually adjust the frequency of data collection.
Potential disadvantages of remote monitoring systems
In order for the remote monitoring system to work effectively, you should study its vulnerabilities: this will help prevent malfunctions.
First, there may be errors in the system that appeared as a result of the human factor – developer errors. Of course, no one is immune from mistakes, but in systems developed by reliable and experienced companies, the chance of this is lower.
Secondly, there is a possibility of intrusion into the information infrastructure of the data center in order to steal data or disrupt the operation of critical infrastructure. To minimize this likelihood, technical security measures are required – for example, two-factor authentication at login, timely software updates, the use of cybersecurity software and, in general, the use of the most effective methods of protecting IT and OT infrastructure.
Thirdly, through the remote monitoring system, it will not be possible to manage the equipment and computing infrastructure of the data center, since data is transmitted only in one direction: from the equipment to the cloud. The biggest risk is data spoofing and, therefore, a lack of proper incident response.
The remote monitoring system is the optimal tool for monitoring the state of the data center engineering infrastructure. It allows you to manage equipment and computing power without capital expenditures and reduces reputational and financial risks from incidents. At the same time, the risk of errors is low and can be minimized.