Why MWS data centers are so reliable

MWS (MTS Web Services) includes the entire MTS data center infrastructure – existing and under construction sites. Our infrastructure is located throughout Russia, the number of areas of presence is constantly growing.

Today the number of MWS data centers is 15, including large and modular data centers. Although many people associate the latter with container equipment, in our case it is a full-fledged complex of industrial and office buildings with a total area of ​​3,500 square meters.

Our own network of geographically distributed data centers plays a key role for the development of MWS and the entire MTS. In this article we will tell you how we ensure its reliability, what practices we use in the construction and operation of data centers, and why it is not enough for us to simply meet the TIER III level.

Data center reliability level: Uptime Institute classification

Let us remind you that TIER is the most famous and popular certificate awarded by the Uptime Institute. TIER certification objectively verifies that a data center is designed, built and operated in accordance with industry best practices. TIER status can be assigned at the design stage (Design), construction (Constructed Facility), as well as in real operating conditions of an existing data center (Operational Sustainability).

TIER is divided into four levels, which differ in the time of unplanned downtime (service availability) per year. The higher the level, the higher the reliability:

  • TIER I – 28.8 hours (99.671% fault tolerance level)

  • TIER II – 22.0 hours (99.749% fault tolerance level)

  • TIER III – 1.6 hours (99.982% fault tolerance level)

  • TIER IV – 0.4 hours (99.995% fault tolerance level)

The main difference between TIER III and higher levels is the ability to repair and modernize engineering equipment without affecting the IT load. Fulfilling this requirement requires significant infrastructure investment and ongoing testing and maintenance costs.

MWS data centers comply with TIER III level and are suitable for hosting critical services. However, this does not mean that our data centers have stopped developing at the achieved standard. We have determined the optimal balance between TIER III and TIER IV. This rule applies to all our data centers – from large to modular.

Key parameters of data center reliability

The reliability of data centers is assessed according to very specific criteria. It is important to consider infrastructure redundancy – duplication of key components according to a redundancy scheme. This is necessary for their planned maintenance and operation in the event of an accident.

Let's consider redundancy schemes using the example of a diesel generator set (DGS) in a data center; for simplicity, we will assume that we do not allow the possibility of parallel operation.

N

  • One diesel generator set connected to one power line

  • Failure of a diesel generator leads to a complete loss of power supply to the data center

N+1

  • Additional diesel generator set in reserve

  • In case of failure of the main diesel generator, the backup one will automatically turn on

  • The most common scheme for data centers

2N

  • Two diesel generator sets, each of which is capable of providing power to the data center

  • Two completely independent lines (each diesel generator is connected to its own power source)

  • Allows you to carry out scheduled maintenance or repair of one of the diesel generators without interrupting operation

2(N+1)

  • Combination of 2N and N+1 circuits

  • Two independent diesel generators, each of which will provide power supply to the data center, plus a reserve for each leg

  • An extremely reliable scheme, but requires significant finances

According to the TIER III level, all critical elements of the data center engineering systems must be backed up according to a scheme of at least N+1, where N is the number of elements that are necessary for the operation of the systems. An additional element (+1) will take over the load in the event of failure of one of the main elements (N).

Let's say there is a TIER III data center with two city power supplies. If it is desired to achieve redundancy with a nominal power of 2400 kW, the following configuration will be required:

N+1 redundancy means that we have three power supplies: two main and one backup. If one of the main sources fails, the backup diesel generator will take over the load.

The choice of redundancy scheme depends on the requirements for reliability and availability of the data center. The N+1 design is the most commonly used because it provides a balance between cost and reliability. Schemes 2N and 2(N+1) are used for data centers with very high reliability requirements, such as TIER IV level.

MWS adheres to the 2N and 2(N+1) DGS schemes depending on the site placement capabilities.

It is necessary to back up not only the “active” elements of the data center, such as diesel generators and air conditioners, but also the associated infrastructure. Cables, pipes, electrical panels and other “passive” elements should also be taken out for scheduled maintenance without risk to the rest of the equipment.

Resilient infrastructure: how it's done

The reliability of a data center consists of various factors, many of which must be taken into account before construction begins. A well-designed infrastructure minimizes the risk of failures caused by internal and external factors.

Next, we will tell you in more detail about the systems, components and principles that determine the fault tolerance of our data centers.

Geo-distributed reservation

In each availability zone, we usually have several data centers. In addition to redundancy within the data center itself, it is important for us to have georedundancy at the city or regional level. Typically, in order to synchronously replicate disk arrays, data centers are located at a distance of 10 to 70 km from each other. Thanks to configured connectivity, at least one independent copy is always available.

Construction registration

We select sites with the condition of permitted use “Communication” and undergo an examination. Experts confirm that the building complies with all standards: the requirements for the load-bearing capacity of floor slabs, heating, sanitary protection zone and other parameters are met.

We select the territory for the data center carefully. We have long stopped placing data centers in existing buildings and are choosing sites that can be registered as property.

In addition to complying with building codes, it is important that the facility does not pose a health hazard. Therefore, the data center is built in such a way as to minimize noise from diesel generators and air conditioners.

Power supply system

The power distribution system in a data center is critical. It must ensure continuous and reliable operation of servers and engineering equipment even in the face of multiple technical or human errors.

The N+1 scheme for power supply systems has a significant drawback – all backup and main devices are connected in parallel to a single power bus, which ensures power distribution.

A single power line is a critical part of the infrastructure, the failure of which will lead to a complete loss of data center functionality. If one external cable is damaged due to excavation, the entire backup power system can be lost.

In addition, when operating several diesel generators in parallel, it is necessary to ensure their synchronization. Any desynchronization can lead to malfunctions. To solve this problem, you need to install controllers that synchronize devices into a single sine wave.

Instead of the N+1 scheme, we use 2N or 2(N+1): it depends on the capabilities of the site. It is important to note that we strictly separate the IT load and engineering systems – each load has its own UPS system.

Let's consider the example of a site with two separate power systems of 2400 kW each. In the 2N scheme, each diesel generator is connected to its own cable line. Failure of any generator or cable will not interrupt the entire system.

In addition to this, we have a requirement to always separate engineering equipment into zones within the site itself. That is, if we have two power supply beams, then absolutely all equipment must be distributed in different rooms.

Thus, we build the power supply system in data centers as follows:

  1. Diesel generators

Used as backup sources in case of main power failure.

  1. Two city inputs

The main power sources that provide electricity from the city network. Having two inputs ensures system reliability because it reduces the likelihood of a complete power outage due to problems on one of the lines.

  1. Dual Power Server Racks

All servers have two power supplies, each connected to its own power line. This allows the server to continue operating even if one of the units fails.

  1. AVR system (automatic switching on of a reserve) “in a cross”

The scheme allows each of the feed beams (A and B) to receive electricity both from any city input and from any diesel generator. The AVR provides maximum reliability: even if one or both city inputs are disconnected, or one of the diesel generators fails, both beams will continue to receive energy from the remaining sources.

Even if three accidents occur simultaneously (on two power lines and one generator), our system will ensure stable operation of the data center.

Hot swapping of engineering equipment

Also worth noting design and operation features of switches in the power supply system. Unlike a home system, where replacing a breaker requires turning off all power, in a data center you can carry out the replacement without shutting down the power supply panel and hot replace any circuit breaker.

The system allows maintenance to be carried out on active equipment. Quickly replacing power components significantly reduces downtime and improves overall data center reliability.

Uninterruptible power supplies

The UPS is the heart of the data center. This is a universal solution to ensure reliability, efficiency and uninterrupted operation of the data center.

Main functions of the UPS:

  1. Providing high-quality power supply

UPSs smooth out fluctuations in the power grid and supply stabilized voltage to servers, which is extremely important for the reliable operation of power-sensitive equipment.

  1. Autonomous operation when the main power is turned off

In the event of an interruption in the supply of city electricity, the UPS automatically takes over the function of powering the equipment, ensuring uninterrupted operation until the diesel generator turns on.

N+1 in the context of a UPS

The N+1 scheme implies the presence of at least one backup UPS or power module in case of failure of the main one.

Smooth and automated switching between different power sources minimizes the risk of failures. Let's take a closer look at each stage of this process:

1. City power outage

In the event of an accident and loss of city power, the system automatically detects the problem. The fault is recorded at the main distribution board.

2. Load on the UPS

Using Online technology, the UPS takes on the load and continues to power the IT equipment from the battery.

3. Starting diesel generators

While the UPSs are running, the diesel generators are started (the process may take several minutes).

4. Switching the load to diesel generators

Once the diesel generators are fully started and stabilized, the load is automatically transferred from the missing city input to them. Of course, the UPS continues to operate in Online mode.

The UPSs do not turn off, they always operate in Online mode, but stop powering the equipment from the batteries.

5. Charging UPS batteries

After transferring the load to the diesel generators, the UPS batteries go into charging mode to be ready for the next possible power outage.

6. Automatic switching control

All switching between power sources occurs automatically according to a pre-configured algorithm, which eliminates human error and ensures continuity of operation.


For TIER III, one backup UPS is enough, but in our corporate data centers the approach to redundancy is much stricter. We use a 2(N+1) scheme for IT and UPS engineering systems. Our UPS redundancy system achieves reliability levels higher than TIER IV, ensuring stable operation even in the event of multiple failures.

Power systems are easy to scale thanks to modular UPSs, which can be combined to achieve the required total power (for example, from elements of 100 kilowatts). The modular structure allows you to easily increase power and maintain engineering equipment by replacing only the necessary components.

In the context of modular power supplies, the N+1 approach is sometimes interpreted as adding one additional module (+1) to the configuration. However, for reliable and fair redundancy, you must use a separate chassis for each additional power supply (N+1).

It is also worth noting battery backup when using a UPS. Often, while maintaining N+1 redundancy, UPS systems do not pay due attention to batteries. Let’s say you need to reserve 100 kW of power for 15 minutes, and for this, a conventional 50 units will be enough. batteries For an honest N+1 scheme, two 100 kW UPSs will be required, and a common battery array of 50 pcs can be connected to them, but if the battery fails, the system will fail.

MWS will supply 2 UPSs of 100 kW each and for each its own battery array of 50 pcs. – thereby actually making the circuit 2N.

According to the external power supply connection scheme, the use of diesel generators, UPS and other equipment, we fully comply with the TIER IV level.

Improving UPS reliability

Uptime Institute does not specify specific requirements for battery types. We believe that they should last at least 10 years and meet the standards of the European association of battery manufacturers EUROBAT. However, this does not say anything about the quality of the components themselves.

The use of cheap and low-quality batteries increases the risk of failure, especially in emergency situations when the UPS is operating under load. If one battery fails, it can disrupt the entire system due to voltage and current mismatches.

Some may find it more convenient to install whatever batteries they want, and then send an employee to regularly check each UPS. This approach is not for us; it is costly and ineffective from the point of view of the human factor. Therefore, we pay a lot of attention to the selection of high-quality system components.

Surge protection

Some servers and engineering equipment may only have one power supply, making them vulnerable to power outages. In modern data centers, where two power sources are supplied to the racks, even equipment with one unit can be protected using local automatic transfer switches (STS).

The local ATS takes two power sources and switches them into one at the output – the solution effectively provides reliable power supply in the event of failures.

Refrigeration center: principles of reliable operation

With the help of a refrigeration center, optimal temperature and humidity conditions are ensured in the data center. It includes various systems:

Our refrigeration centers are organized according to the classic N+1 scheme. Moreover, all engineering equipment works according to this scheme: chillers, pumps, air conditioners, humidification and dehumidification systems, ventilation units, etc.

Through a piping system, pumps pump coolant from chillers to internal air conditioners, where cooled air is supplied to server racks through heat exchangers and fans.

There is no direct information communication line between chillers located outside and indoor air conditioners. Their coordination occurs through physical interaction – through the hydraulic system in which the coolant circulates. These devices “communicate” with each other through changes in temperature and coolant flow.

It's important to note that in a cooling system with a chiller We use a ring circuit with the ability to switch to a beam circuit, in which you can carry out any work without disconnecting the IT load. Using various types of shut-off valves (valves, gate valves), we manipulate flow to isolate damaged or maintenance areas without affecting the rest of the system.

All refrigeration systems are powered via two independent beams, which significantly increases reliability. According to TIER III, it is quite possible to connect turbine rooms and engineering infrastructure to common uninterruptible power supplies. But we went the other way – the cold center connected to its own independent UPS.

Even in the event of a major accident associated with a complete loss of active cooling, the servers will work. For this we use storage tanks — special containers containing a supply of cold coolant. If the electricity goes out, the coolant from the tanks will flow into the air conditioners.

Storage tanks provide cooling for a time exceeding the battery life of the IT UPS. If the emergency situation lasts for a long time, the UPS batteries are discharged, the chillers do not work, the servers will still be cooled.

Finally, not in the turbine halls air conditioners and pipes with cold coolant. The Uptime Institute does not have such a requirement, but we want to completely eliminate the contact of servers with liquid and minimize the presence of service personnel in the machine rooms.

Fire extinguishing system

Uptime Institute does not provide specific recommendations on how to implement a fire suppression system. There are different options – for example, you can use fine water. But according to our internal standards, we do not carry pipes with liquids into the turbine rooms. So we connect everywhere gas fire extinguishing system. We use a gas that is safe for people and equipment, disintegrates quickly and does not leave any contamination.

Installed in the turbine halls early fire detection system, which detects a fire at the earliest stage, even before the appearance of open fire and large smoke. The system pumps air into the tubes from different parts of the machine room, including the area above the server racks and in the area of ​​​​warm-cold corridors (where air flow is created to cool the servers). The forced air enters a special installation and is analyzed for the presence of the smallest particles of combustion products.

The fire detection system is equipped with several alarm levels. For example, this could be a warning about a possible malfunction or the start of a fire. If a fire is confirmed, it can automatically release gas. Either way, staff will receive a notification immediately.

Insulation of engineering equipment

According to the TIER III standard, utility infrastructure equipment (air conditioners, main distribution board, batteries, etc.) can be placed in one room. Obviously, this approach has disadvantages.

In our data centers, all important systems are located in isolated rooms, which corresponds to the TIER IV level. This separation is critical to prevent the data center from shutting down in the event of a fire or other disaster. Separating rooms also prevents equipment from interfering with each other.

Insulation of turbine rooms

According to reports from the Uptime Institute, over the past three years, 40% of incidents in data centers were caused by human error. Of these incidents, 85% are due to employees not following procedures or flaws in the processes and procedures themselves.

To reduce the influence of the human factor, with the help of access control systems we differentiated which employees have access to which premises. This means that no one unless absolutely necessary will not get close to sensitive server equipment.

Uptime Institute does not regulate the distribution of access rights, so our internal security standards exceed generally accepted TIER practices.

Battery Isolation

Batteries are the most fire-hazardous element of a data center. This is due to the chemical composition and properties of batteries, which under certain conditions (such as overheating, short circuit or physical damage) can easily cause a fire.

We isolate all batteries – they are physically separated from other premises and infrastructure elements. For additional security, battery rooms are located in different parts of the data center building.

Conclusion. Why is it important to exceed TIER III level?

Around the world, TIER certification plays a key role for commercial data centers; is an objective and independent assessment that helps clients make a choice. But certification is not the only criterion of reliability.

MWS adheres to high standards of reliability and availability when constructing data centers, which may exceed those generally accepted in the market. Our system of distributed data centers is a unified utility infrastructure for the MTS ecosystem. The functioning of all MTS services depends on its effective operation. Therefore, we duplicate and reserve more than the TIER III standard requires.

Based on our own 20 years of experience in building complex infrastructure projects, we pay more attention to important points in the construction of data centers that are not sufficiently specified or not specified at all by the Uptime Institute. As a result, our practice of building and operating data centers ensures even greater reliability and availability of services.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *