The mission is possible, or how we passed the Tier III Facility certification in a working data center [Часть 1]
Imagine a quest where you need to turn an ordinary data center into a real refrigeration chamber, without freezing the client equipment and without creating a tropical paradise where there should be technological cold. Or, for example, load the data center to the maximum, and then turn off the power supplies, so that everything continues to work. And all this under the watchful gaze of auditors who are ready to find fault with every degree!
In September, Nubes Data Center completed this quest to earn Tier III Facility certification from the UpTime Institute. And two of our guys told us how to enter ~5% of data centers in the world that have been tested with clients located in them.
In this article, Alexey Sidorov, a senior refrigeration engineer, shared his story and explained how to survive in an environment where heat guns and server racks are playing their version of cat and mouse, and the monitoring system has decided to throw an Independence Day. Stock up on popcorn (just don't put it close to the servers) – it will be hot! More precisely, it's cold. In general, read for yourself!
Who is stronger: heat guns or air conditioners?
The CEO and Director of Data Center Operations set the team the task of passing an audit and obtaining a certificate Tier III Facility. This is an important document for the company, which opens the door to cooperation with the most demanding clients. Since we are a young provider, obtaining this certificate is essential.
My team and I had never had this kind of experience before, and to be honest, I was nervous. I was nervous and actively preparing for the test.
Globally, the senior refrigeration engineer has one task – to bring all air conditioning systems into “combat” readiness, reach the maximum heat load and “stand” at it all week, turning off backup units according to the redundancy scheme. Plus, it was necessary to perfect the monitoring and warning system.
It seems that everything is not so difficult. You turned on the guns throughout the data center, activated the air conditioners, and you wait quietly for seven days. And so, maybe it would be, only the first and biggest difficulty is that the data center is already functioning and there are dozens of clients in it. One mistake – and the consequences will be sad for everyone.
Let's turn everything on at full power and see how everything works
Preparing for an audit is a wonderful opportunity to spot shortcomings and correct them. During testing, we encountered unobvious problems that we would hardly have noticed in normal mode. Here's what to pay attention to:
1. When all precision air conditioners were fully launched, shortcomings were discovered that we did not see during comprehensive testing. For example, the time it took for the compressor to reach full power after simulating a power loss reached about 10 minutes and the temperature rose to peak values according to SLA conditions. We had to spend a lot of time studying the parameters and experimenting with configurations to find the optimal settings. After making adjustments, the time it took for the air conditioner to reach the desired mode was reduced from 10 to 3 minutes.
2. Some minor air conditioner alarms were not displayed on the monitoring dashboard, the color indication in some places did not match, and the locations of two sensors did not correspond to the mnemonic diagram. Of course, we corrected all this.
3. At the time of the audit, the halls already had client counters, which greatly limited us in terms of placement of guns. You need a lot of them, and it is important to place the guns so that the flow of hot air does not overheat the client racks. For this purpose, we decided to make “cold” corridors. But since the amount of time was limited, we could only use materials that could be obtained super quickly. Reinforced film, profile and tape are ideal options!
Of course, there were also disadvantages. The structure of the racks turned out to be too textured, the adhesive tape did not stick well, and it was necessary to constantly glue the holes that formed. In general, this decision was completely justified. Would I resort to it in a similar situation? Definitely yes!
4. When the air conditioners began to operate at full capacity, a problem with the LAC valves was discovered. After a long period of inactivity, the operating rod inside the valve soured, leaving the valve in the open position. Air conditioners with excess refrigerant experienced high pressure. And since we didn’t have much time, we had to literally “bite” the valve tubes in order to force it to close. The valves, of course, were replaced with new ones.
Good is not the one who does not make mistakes, but the one who draws conclusions!
In the end, as you already understand, everything worked out for us, but we made certain conclusions. And if you, like me, are responsible for the cold in data centers, then read to the end.
Conclusion No. 1. It is better to undergo Tier III Facility certification before clients enter the data center, so that there is no risk of overheating of client equipment.
Conclusion No. 2. Preparation for the audit should begin 3-4 months in advance.
Conclusion No. 3. N+1 sequential shutdown testing could help identify a problem with the compressor ramping up to full capacity sooner. But due to insufficient load in the data center, such a test was carried out in a limited manner.
Conclusion No. 4. It is important to record all configuration changes, monitor the amount of freon charged, monitor the system for leaks and demand formal explanations from the contractor if there are questions or doubts.
Conclusion No. 5. When accepting equipment, it is necessary to simulate as many errors as possible in the operation of refrigeration systems in order to verify their identity with the monitoring system. And in general, all equipment must be accepted strictly according to the checklist.
Conclusion No. 6. During commissioning, the valve must be carefully inspected. For example, a creaking sound during its operation is a clear sign of poor-quality installation, which will lead to its failure in the future.
This is such a “chilling” story and only about the part with cooling systems. My colleague will tell you everything related to electricity in the next article. And his article will no longer be “chilling”, but electro-shocking.