How not to survive an accident: bad advice

It would seem that such a simple action: take and recover from a backup if it breaks somewhere. The current level of software development allows you to do this in just a few clicks – and you are amazing. Perhaps no one will even notice that there was some kind of accident.

But no! There are no obstacles to patriots, as they said in one beautiful film. Even during post-emergency work, you can do such a game that tomorrow your labor will be nailed to the fence at the entrance with a two-hundredth nail, and the glory of your deeds will still echo through the vastness of the IT universe.

And today we will talk about how not to fan ourselves with the glory of the most powerful specialist forever. Six obvious and not very good reasons to think about it. And yes, while reading, do not forget to raise the Sarcasm sign and remember – everything written below is based on real cases.

You are the admin – you know better what they want!

And no one has the right to tell you in what order, how urgent and what exactly needs to be restored. Yes, you can pretend to be polite and bypass the heads of other departments for show to draw up a recovery plan in case of an accident, but why? The general will not fumble and sign anything, but in the field they will pull the blanket over themselves. Sellers will say their base is the most important. The warehouse will say that an hour without shipments is less revenue for a week. And in production they will be completely surprised at such a banal question.

No, you can certainly get inspired and bring them all together in order to prioritize and understand the criticality of specific services for the business, then sign an SLA and build a rigorous plan. But why? IT is your area of ​​responsibility, and there is nothing for others to poke their nose there. Especially those who do not fumble.

Is this app only used by ten people? Yes, this is clearly some kind of nonsense and cannot be a critical component of the production chain. But the file ball for several terabytes has always been held in high esteem. Whoever you ask, everyone was storing something there, and it is quite obvious that after the domain controller, the first thing to do is to raise it. Well, mail, I guess. The rest will wait, nothing bad will happen.

Don’t bother with manuals. It’s impossible to know everything

After all, what were you paying for? Well, the software would be free, and you collected the hardware yourself from the available components. In this case, it is still possible to somehow justify the weekly cooking of manuals for setting up and operating. But first you bought the software for backup at the price of a cast-iron bridge, and then you also bought a healthy storage system for integration, which ate under two annual budgets. And on top I also laid a separate network. Therefore, any fool understands that there is no need to read anything more complicated than the Quick Start Guide. All these thick manuals, of course, are very important and useful, but the head is not rubber. You can’t shove everything into it.

Making the initial setup is the sacred duty of any supplier, and if something changes, then everything inside is always intuitive, and the developers have provided protection against any attempts to shoot off their own leg. And in case something does not go according to plan, you can always call the support. Or write a letter. Or even find a chat in a cart, where they will be happy to help you literally right away.

Also, do not try to delve into the engine compartment of those applications that are installed on your beloved servers. Did the DB admin say something about the already configured backups by a third-party application? Ay, it doesn’t matter. Two backups are always better than one. Moreover, you need to configure the backup of your clustered mailer. Although what is there to configure? He’s in the cluster, and what will he be. It is enough to backup any node and all the science. It is better to go and start transferring information to pass-through disks.

So there is no need to keep anything more complicated than an operating system at the level of an advanced sysadmin. You don’t get paid for this.

Now the main thing is to recover, but where and how is not important

If an accident has already happened, then our first goal is to restore data and launch critical services. Therefore, our motto in this troubled time is: “I see the goal, I see no obstacles.” The main thing now is the speed and speed of reaction. Click-click – and now a backup of the European server is being deployed somewhere near Tver. It doesn’t matter that there is a dead site for one and a half servers, the main thing is to do it faster. After all, backups are there, which means it will recover faster. Personal data of Europeans? Oh, who cares now about your distant GDPR and other things. We need to recover. Hosting is five times more expensive than dropped? What difference does it make, then they will also thank me for the speed of reaction. Running out of disk space? So this is clearly some kind of nonsense, you don’t have to restore it. The main thing now is to restore the food!

One for all and all for one!

Uniform backup rules for all infrastructure is absolutely normal. The weak here may begin to object that there are systems whose failure would be fatal, and to protect them, you need to build active-active clusters, make redundancy and think over replication options. For example, take the same network. Can you imagine how much it will cost to make a double reservation? It is necessary to build a second one next to one network. But we are already protected from the crooked hands of the admins with a permanent backup of the configs so that we can roll back at any time. If some sfp’shka burns out, then the youngest firmly knows which store to run to for a new one, and our suppliers are ready to bring us at least a new core of the network upon a call. Well, in the half hour while they are driving, nothing bad will happen.

Well, others will argue, it was about very critical systems, without which everything will naturally arise. And what about those where you can really wait an hour? For example, a mailer or a site where you can quickly roll out an apologetic stub? The main thing here is also not to succumb to the persuasion of paranoids and not to start blocking Active-Passive systems. After all, there is a mountain of its own problems with synchronization, for example. You can, of course, build a system that will take snapshots on a schedule and allow you to quickly rollback with minimal losses. But this is again money for licenses and “idle” hardware. And by and large, these are theoretical risks and protect against the odds.

Therefore, in order not to waste extra money and time, it is always enough to use a single solution that has the maximum functionality, even if it does not protect against all risks. For everyone knows that multiple points of failure are always worse than one.

Schrödinger’s Disaster Recovery Plan

There are exactly two opposing problems with this excellent document. Some don’t have it, others print it out, sign it and hide it in a safe. As a result, neither the first nor the second know how to behave in an emergency. The first ones justify themselves by the fact that the infrastructure is a living thing and is constantly changing, which means that you will have to constantly make changes, coordinate them, and so on. So it would be much better not to fool yourself with paperwork, but to trust your knowledge and not interfere with people to work. They are professionals!

The second are not afraid of paperwork, they carefully document everything and in the event of an accident, they follow the plan step by step. For, as you know: if the plan was signed by the general, then all responsibility will be on him, and the plan itself must be strictly observed. The main thing is that this document resembles a local copy of Wikipedia: the most detailed schemes for switching on absolutely all equipment, application dependency schemes, the order of loading machines, through which port who communicates with whom, numerous checks of settings and basic tests after switching on. In short, level 80 bureaucracy. And if there have been changes since the last agreement, or in the middle of working out the script, everything went completely wrong as described, then this is already production costs. We will find the guilty ones and punish them, the main thing is to strictly follow the plan. More paper means cleaner places.

Rumor has it that there is another option in the middle, and some not only write step-by-step instructions, but also allow employees to use common sense when performing any actions. And completely mythical characters also conduct trainings, turning off entire systems for profit and simulating typical infrastructure problems. True, it is unclear who hires such maniacs.

Everyone should know their place

It would be a real crime and a waste of the employer’s money to allow employees to be distracted from their main job under the employment contract and delve into the activities of colleagues. First, it distracts the colleagues themselves, not allowing them to work at full strength. And secondly, the employee himself will also be constantly distracted from his work, thereby increasing the chances of making mistakes. And as a result, we have two people who are not working at full strength. Therefore, it is completely unacceptable for the conditional networker to closely communicate with the conditional engineer of the virtualization group. So you will neither have a normal network, nor clearly working virtual machines. If they have some kind of common project, it’s great – a request is made from one team to another, a technical task is formulated and work is carried out strictly according to it. And the uncontrolled exchange of knowledge is the path to chaos.

And most importantly, from what it will save you: in the event of an accident, if there is no specialized specialist on the spot, no one will think of climbing on his site. Even if he will control the actions remotely, it is better to wait until he arrives than to break everything even more.

Here are six bad tips.

They may sound trite, but it will never be harmful to check yourself again and make sure that in the event of an accident you will not fail and will brilliantly get out of the situation.

What harmful advice for colleagues would you add?

Similar Posts

Leave a Reply