How to properly brew tea when the server room is on fire

The company is full-remote, so we conduct a meeting of the paranoid circle something like this.  Sometimes with a banjo in the corner.

The company is full-remote, so we conduct a meeting of the paranoid circle something like this. Sometimes with a banjo in the corner.

In the life of any project, disaster occurs. We cannot know in advance what exactly it will be – a short circuit in the server room, an engineer who dropped the central database, or a beaver invasion. Nevertheless, it will definitely happen, and for an extremely idiotic reason.

By the way, I wasn’t joking about the beavers. In Canada they chewed the cable and left the entire Tumbler Ridge area without fiber optic service. Moreover, animals, it seems to me, do everything in order to suddenly deprive you of access to your resources:

In short, sooner or later, someone will definitely break something, drop something, or upload the wrong config at the most inopportune moment. And here comes what distinguishes companies that successfully survive a fatal accident from those who run in circles and try to restore crumbling infrastructure – DRP. Today I’ll tell you how to correctly write a Disaster Recovery Plan.

When should you start writing DRP?

Every employee hour spent costs money. An hour of an expert who knows by sight every crutch inserted and understands what can break is worth even more. And often an expert and a person who can write in understandable language are different people. In short, a good document is not very cheap and you need to understand exactly when it is worth making.

You should definitely do this if:

  • You have a clear point of failure, but no clear instructions on how to recover from it

  • When a server, a group of nodes, or a database goes down, your business will stand up and the rapidly spinning loss counter will start.

Actually, the main story here is primarily about money. The node with your telegram bot for ordering coffee to the office may fall, but the source code is lying around somewhere? Let it sit, we’ll restore it when we have time.

Did the Hashicorp Vault cluster pupadate due to a disconnect between data centers? Your billing system will be sad without passwords. Here we definitely need DRP, which can be combined with architecture refactoring.

Write documentation

People very rarely like to write documentation. For example, I like it, on the contrary, because it helps to escape from routine tasks and focus on the transparency of architecture. However, this is an extremely important story, since a few tens of minutes spent on taking notes can save you many thousands of dollars later in a serious accident.

Right away, I want to highlight a very important point: troubleshooting documentation is a very important thing, but not yet DRP. We take this approach:

  1. The employee who understands the system best sits down and writes a passport for the information system. It includes a basic description of why this thing is needed at all, what systems are connected to it, and who to run to if something breaks off.

  2. The duty officer comes across a problem in the logs, or records an accident. If in the process of eliminating an accident he had to do something not very obvious, he writes down all the details down to the list of commands in the console and enters them in the Troubleshooting section.

  3. If the accident repeats, we create a ticket to eliminate its causes.

  4. If the reasons cannot be eliminated for architectural or financial reasons, we attach to the alert description a link to internal documentation that describes how to fix it.

That's it, we already have some kind of basic process that allows any employee to quickly fix the problem.

Turn on the inner paranoid

In general, we already have documentation, we have an understanding of how to fix basic problems, but we still don't have DRP. For a real Disaster Recovery Plan, we first need to decide on the threat model.

There is a relatively limited list of what can break. It usually comes down to a small number of categories like this:

  • Application failure

  • OS services failure

  • Network failure

  • Iron failure

  • Virtualization failure

  • Orchestration system failure

At this stage, we usually get together as a team, pour coffee, tea and begin to think about how to effectively break our system. Bonus points and pizza delivery go to whoever finds a way to drop for as long as possible with minimal impact.

As a result of such work, our backlog usually grows sharply due to tasks like “If script N runs before script M, this will irreversibly destroy the central database. We need to fix it.” What remains after eliminating the technical debt is your threat model, which includes those recidivist beavers, the cleaning lady in the server room, and other natural disasters beyond your control.

Accept the inevitable and go drink tea

So, you have become aware of the fact that eventually everything will break. You've taken into account all possible factors, including the radio relay line being blocked by a balloon festival and another fire at one of your providers.

Now the task is to write the DRP itself. No, this is not the extensive troubleshooting section that you wrote earlier. DRP, at its core, is a kind of red envelope, which, after opening, will contain instructions like:

  1. Pick up the blue telephone handset

  2. Dictate to the attendant: “Desire – Rusty – Seventeen – Dawn – Furnace – Nine – Kind-hearted – Homecoming – One – Freight car”

  3. Wait for the answer tone

Even if you have a team that consists exclusively of top experts in their field, the document should be written in such a way as to be understandable to a person with an IQ no higher than 80. Analysis of real situations showed that at the time of severe accidents, a stressed engineer often not only does not eliminate , but, on the contrary, it completely finishes off the half-living system.

Therefore, our documents almost always begin the same way:

“A short top-level guide for the default script. Next there will be many letters and options with details. Let's start with the basic sequence:

  1. Brew some tea and stop being nervous.

  2. Notify those responsible”

And yes, tea is a must. In 5 minutes, a dead system won’t get any worse, but the risks that an engineer will urgently pull all the switches he knows are much lower. The last paragraph sounds like “Remove the cold Guinness from the refrigerator (must be kept in NZ in case of an emergency). DRP completed.”

DRP structure

Here's the rough structure we're using as a template. Of course, you can take something of your own.

Document type

DRP

Information system

Enigma cluster

date of creation

06/07/2022

Date of last update

02/13/2024

Frequency of updating and testing

1 year

DRP test date

02/13/2024

  1. Information system architecture

  2. Information Systems

  3. Notifying those responsible

  4. Key questions before DRP

    1. What should I do by default?

    2. A short top-level guide for the default scenario.

    3. When does this DRP apply?

    4. Principles for selecting a site for deployment

    5. How to understand where the deployment will go?

  5. Decision making and situation analysis

  6. Service deployment order and timing

  7. RPO and RTO

  8. Instructions for deploying DRP infrastructure

  9. Additional materials

Thus, the engineer who first opened the document at the time of the accident should immediately understand how to carry out the first diagnosis and begin recovery. The overall structure should answer a few simple questions:

  1. Who should be notified immediately about the start of an accident?

  2. How to properly carry out initial diagnostics before touching anything? Ready-made simple commands in the console, links to dashboards and other useful things.

  3. How much time do I have? Can I try to fix the system or do I need to start everything from scratch?

  4. How do we understand that the accident is over and we have switched to normal mode?

It is very important not to confuse the system passport with its detailed description and DRP. Don't overload him with unnecessary information. An engineer should be able to simply follow instructions, copying command after command.

A good practice is to add a “Additional Materials” section to the end of the document. You can refer there as needed from the main brief instructions and other documents. The Troubleshooting that we described earlier will fit perfectly into this block. Any other additional information should also be moved to the end so as not to violate the minimalistic style of the main instructions.

Vertical Gantt chart, very helpful in complex systems

Vertical Gantt chart, very helpful in complex systems

If the service is complex and individual elements are deployed in parallel, then I would strongly recommend adding a Gantt chart that clearly describes the order of recovery and the approximate time frame for each stage. In text form, such information is more difficult to perceive.

At the very end of your document there should be a clear description of the conditions for canceling emergency mode, so that it is clear at what point you need to switch the load back and switch to everyday operation.

Importance of IaC

In general, the Infrastructure-as-a-code concept is not formally required for the implementation of the Disaster Recovery Plan. However, in most large information systems, an engineer will not be able to meet the deadlines if he runs from server to server and urgently edits something in the configuration, changing DNS records along the way.

It would be much more correct to describe all the main and backup infrastructure in terraform, and its configuration in ansible. Optionally, you can even bake ready-made images into Hashicorp Packer if you adhere to the concept of immutable infrastructure.

In this case, it is possible to achieve near-zero costs for maintaining DRP in a state of combat readiness. Structurally it will look something like this.

  1. Describe your test and production infrastructure in terraform.

  2. Describe your DRP infrastructure in terraform, but add variables to trigger the deployment.

IN variables.tf add something like this:

variable "enable_mysql_drp" {
 description = "Condition for creation of DRP MySQL droplet. Droplet is created only if variable is true"
 type        = bool
 default     = false
}

IN main.tf Describe the necessary parameters of your temporary infrastructure and relate them to the conditions:

resource "digitalocean_droplet" "rover3-mysql-central-nl-1" {
 image      = var.ol7_base_image
 count      = var.enable_mysql_drp ? 1 : 0
 name       = "mysql-central-nl-1.example.com"
 region     = var.region
 size       = "c2-32vcpu-64gb"
 tags       = ["enigmа", "enigma-central", "enigma-mysql-central"]
 monitoring = true
}
resource "digitalocean_droplet" "rover3-mysql-central-nl-1" {
 image      = var.ol7_base_image
 count      = var.enable_mysql_drp ? 1 : 0
 name       = "mysql-central-nl-1.example.com"
 region     = var.region
 size       = "c2-32vcpu-64gb"
 tags       = ["enigmа", "enigma-central", "enigma-mysql-central"]
 monitoring = true
}

Now, the engineer will be able to deploy the entire temporary infrastructure in several similar commands from your instructions:

cd ~/enigma/terraform/DO
terraform apply /
-var="enable_mysql_drp=true" /
-var="enable_indexator_drp=true" /
-var="enable_clickhouse_drp=true" /
-var="enable_statistics_drp=true"

If you chose images already baked with Packer as a source for deploying emergency infrastructure, then you will receive an almost ready-made infrastructure within a few minutes. The overhead of storing the images themselves is usually low, but they require constant updating.

Another option is to use ansible directly, which will configure the entire new infrastructure according to your configuration requirements. Don’t forget about the time it takes to load and restore the database from a cold backup. This can be expensive and time-consuming, so take this into account when planning.

Short checklist

Here are a few key points to consider:

  1. Murphy's laws have not been repealed. If something falls, it will fall sooner or later. With bangs and side effects in unexpected places.

  2. Before you begin, conduct a full analysis and show the business how much a possible disaster would cost them, and how much the backup scenario for DRP would cost them. Most often, after this, resources are allocated. Sometimes they understand that it will be cheaper if everything falls for a day than to have a full reserve in stock.

  3. Make sure you have backups.

  4. Make sure that backups are not only performed, but also restored.

  5. Get creative and describe the threat model.

  6. Write DRP. Then take and throw out most of the text so that the meaning is not lost. Add anything redundant to the end of the document. If necessary, the engineer will look there.

  7. Include phone numbers, telegram accounts and other contacts of all key people in the document.

  8. DRP needs to be tested. Seriously, this is a must. Businesses never have time for these tasks, but it is a very important process. The infrastructure of any living project can change beyond recognition in a year, tokens can go bad, access can be lost, and the configuration can become invalid. Therefore, include at least an annual limited simulation into your business processes. Otherwise, at the time of the accident, the engineer will enter commands from outdated instructions and finish off your infrastructure.

  9. Give DRPs to trainees in a restricted environment and see how they do struggling with crutches They study the document with interest. Your plan is good if even the most inexperienced team member can figure it out.

By the way, if your company’s corporate policy allows this, then I would strongly recommend taking a closer look at LLM neural networks as an assistant for internal documentation. Yes, you need to understand that the conditional GPT-4 is not the ultimate truth. However, it is much more convenient to fix an accident in the middle of the night if there is an expert sitting next to you who “holds before his eyes” 120 sheets of documentation on the unique crutches in your system. We have just begun to implement this approach and it is already showing its best side.

A neural network can quickly analyze a raw log and build hypotheses about the causes of an accident. This is very valuable in cases where the engineer on duty does not know all the intricacies of the fallen system. I'll tell you more about this next time.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *