How we migrated into a cloud of… ash
Vanya, hello! I have good news and bad news, as you say. We seem to be finishing our migration to the cloud today.
Such a call came to me from our VP of Engineering Victor around 7 pm on March 9 last year. The fact is that Viktor knows Russian, but has never lived in Russia, so he often adds “as they say” or some other proverbs, sayings and proverbs that he knows only. But now is not about that.
Almost a year has passed and I finally had the courage to write the whole story without embellishing anything or throwing out the words from the song. I hope that this truth will help the reader to do better and learn from our mistakes.
7 pm in California is such a rare time when Moscow has not yet woken up, but ours have already finished and you can relax. At that time, we did not yet have a team in Tomsk and I usually rested from 7 to 10-11 pm. But not March 10th. And not in March at all. To be honest, I hardly remember what March of last year looked like after the 10th. And April too. Time was sped up a little that spring and then released in the summer.
We migrated into the ash cloud. Moving over
March 10 (00:47 French time and March 9 in the evening for me in California) OVH in Starsburg burned down the data center. More precisely, two data centers, if container trailers could be called separate buildings at all. Until that moment, I only knew about Startburg what he finds in France on the border with Germany and that there is a court of human rights to which political prisoners and other unjustly injured people file complaints. But on this day, I also found out that we have servers in the data center there and that it is on fire.
At that time, I had the following myths in my head:
The fire will soon be extinguished
Our product (firewall) will continue to work, just customers will not be able to log into their personal account
We have backups, we will restore everything
From these statements NOT ONE turned out to be entirely true.. But in order to make it interesting and understandable to read further, a brief note about Wallarm should be given:
We sell a firewall for APIs and applications that works as an Ingress / Nginx / Envoy module, analyzes traffic and puts the analysis results (malicious requests) into the cloud
Clients go to the cloud to watch UX and pick them up via the API in PagerDuty, Slack, etc.
We have three clouds, and Europe, which burned in OVH, is the oldest in terms of infra
The US cloud is already all in cube, it is in GCP, but it is new. Therefore, the first big US clients were in Europe
And now, in order, according to all the myths listed.
Myth #1. The fire will soon be extinguished
The first message said that a fire broke out in one of the rooms of one DC, SBG2, and nothing bad boded trouble. Fires happen and DCs have fire extinguishing systems.
After a couple of hours, it was already clear that everything was bad, and the data center was lost, that is, it burned down completely. At the same time, the fire spread to a neighboring container, which was called SBG-1, and also burned half of it.
Here is the text published by OVH CEO Octave on his Twitter:
We have a major incident on SBG2. The fire declared in the building. Firefighters were immediately on the scene but could not control the fire in SBG2. The whole site has been isolated which impacts all services in SGB1-4. We recommend to activate your Disaster Recovery Plan.
The coolest thing is that in the summer of 2019, we dined with him in San Francisco and talked nicely. Oktava then told what big plans they had for development and entering the stock exchange. And then I should have given him our branded fire extinguisher …
Myth #2. Our product will continue to work
This was almost true, but not for all clients. We sell software, it works in the client’s infrastructure and does not depend on our clouds, one of which burned down in France. But it is not so. The fact is that when starting in a cube, each new starting pod accesses the cloud via API and registers there. Without it, it won’t start. Moreover, the tests covered the case when the cloud falls off when the service is already running, but the start itself was not covered.
That is, we were in a situation where, after some time, all Cubernetis clients in Europe would stop working. Moreover, they would simply stop getting traffic until our module was turned off from ingress. Our software is in a gap (inline).
The problem was solved by a stub that gives JSON that everything when registering a pod without any checks. Then we completely refactored this part and did it normally, now the ingres starts even if the cloud is not available.
Myth #3. Restoring everything from backups
In a nutshell, the backups were in the same data center. When we took dedicated servers, we wanted them to be nearby so that internal connectivity was better (according to the legend). Well, or they just didn’t look where they gave us the servers (according to the second legend).
Be that as it may, backups should definitely be stored separately and in the cloud. And now we do it.
As a result, I had to restore using logs, old and piecemeal backups on people’s machines, as well as by decompiling client rules. And I’ll tell you a little more about this.
We break ourselves
So, we burned backups along with what they actually were from 🙂
Conventionally, all data in our cloud can be divided into three types:
Firewall client events (probably attack logs) – they were not affected at all, because they lived separately in Riak then (we switched to S3, and we do not recommend dead Riak to anyone)
Accounts of clients and API keys from their product instances (in the worst case, you can just drop all codes and passwords)
Firewall rules unique to each client. We call it LOM – Local Training Set (yes, they came up with it themselves) – the most critical, because the system is trained and the rules are added constantly so that there are no false positives
All the rest. Logs, settings, charts, and so on. The loss of this will not lead to a deterioration in the quality of service of the firewall itself.
Here it is worth talking about our rules, which are SCRAP. The fact is that they get to the client not in the same form in which they are stored in the cloud, but in a compiled one. This is necessary to speed up the firewall (we have to process even very large JSON requests to APIs with almost no delay, due to the use of CPU and memory). Therefore, the rules are compiled from descriptions into something like a decision tree (for example, automata theory) and already in this form fly to the client. Therefore, if you simply take the rules from clients, you won’t be able to put them back into the cloud so that you can then supplement and use them.
The idea to make a decompiler for the rules appeared almost immediately, and was eventually implemented. Thus, we broke ourselves a little, requested compiled files from clients with LOM rules configured for them and, after uncompiling, uploaded them back to our cloud. We couldn’t just start training clients from scratch, since the product works in blocking mode, and the settings for preventing false positives and the structure of the client application API are stored in the right. This project was successfully completed and with minimal losses we raised the rules.
Looking ahead, I will say – we DON’T LOSE A SINGLE CUSTOMER because of this incident. Despite the fact that everything was not at all in our favor from the very beginning. Moreover, we received not only words of support from those who pay us for support, but also real help in the form of LOM files, which I wrote about above. The main thing is that we felt understanding and true love from customers, which is difficult to evaluate and describe in words.
I would like to thank everyone again, it was very sincere.
Wrote the CEO of Google Cloud and he replied
We decided to take Google Cloud in Europe to raise a new Europe there. But we didn’t have a manager in France, only in the USA. Google is not the easiest company to communicate with and does not process tickets very quickly.
In the end, I just decided to write to the very top for good luck and threw the letter into Thomas Kurian Google Cloud CEO with theme “URGENT: GCP Account Manager is missing in the middle crisis“. He answered in 3 (!!!) minutes and solved the problem with the manager. At that moment it was necessary to do everything quickly and it surprisingly worked.
Commemorative souvenirs and awards
As a result of the incident, we paid all participants (firefighters) bonuses, which were divided into categories depending on the contribution to the elimination of the consequences and sleepless nights. And to add a little humor, they also released a series of such meme dogs This is fine. On each of which the category of “firefighter” who received it is indicated.
Do not store backups in the same DC as the data
Back up your value-added secondary data in the clouds
Make a backup of Kubernetes Master
Do yearly “exercises” and simulate which parts of the system will fall off in which cases, especially if you have a lot of software components in different places
Do not be afraid to tell relevant and honest statuses to customers, they will help and support
We live in a world where a fire can happen in a data center due to one uninterruptible power supply and burn down 2 buildings. Engineering, firefighting, insulation – no guarantees that it will work. Rely only on yourself and your team. And reinsure the risks of this hope by verified suppliers (hint, we are now verified).
Thanks to our clients and our teams: support, which caused all communication and all the nerves from the front line, devops for super adequate actions to eliminate the incident and super quickly raise a new cloud in Europe, development for the rules decompiler and help with all the efforts of programmers, detective, everyone who forgot, and, of course, to Viktor for coordinating the process and the disaster recovery plan, which was ready before the fire. I just don’t have the words to express my gratitude to everyone who helped and participated in this project. Thanks!
Taking this opportunity, I would like to invite you to work with us at Wallarm, we can hire anywhere remotely and help those who want to live in Europe with the move. We are cool. Email firstname.lastname@example.org (especially looking for strong products, devops, ruby/go/C).