While everyone was celebrating my birthday, I repaired the cluster until the morning - and the developers poured their mistakes on me

Here’s a story that changed my approach to devops forever. Back in dockyard times, long, long before them, when the guys and I were just thinking about our business and freelancing on random orders, one offer fell into my cart.

The company that wrote was in data analytics. She processed thousands of requests every day. They came to us with the words: guys, we have ClickHouse and we want to automate its configuration and installation. We want Ansible, Terraform, Docker and all of this to be stored in the gita. We want a cluster of four nodes with two replicas in each.

A standard request, there are dozens of them, and you need the same good standard solution. We said “okay” and in 2-3 weeks everything was ready. They accepted the job and began to move to the new Klickhaus cluster using our utility.

No one with them wanted or knew how to tinker with Klickhaus. Then we thought that this was their main problem, and therefore the service station of the company simply gave the go-ahead to my team to automate the work as much as possible so as not to go there again.

We accompanied the move, other tasks appeared – set up backups and monitoring. At the same moment, the service station of this company merged into another project, leaving us for the commander of one of our own – Leonid. Lenya was not a very gifted guy. A simple developer who was suddenly put in charge of Klickhaus. It seems that this was his first assignment to lead something, and from the piled honor, he got a star fever.

Together we started backups. I offered to back up the original data right away. Just take, zip and throw elegantly into some c3. The raw data is gold. There was another option – to back up the tables themselves in Klickhaus, using a frieze and copying. But Lenya came up with his own solution.

He announced that we needed a second Klickhaus cluster. And from now on we will write data on two clusters – the main one and the backup one. I tell him, they say, Leon, it will not be a backup, but an active replica. And if data starts to get lost in production, your backup will be the same.

But Lyonya firmly grabbed the wheel and refused to listen to my arguments. We spent a long time with him in the chat room, but there was nothing to do – Leon was driving the project, we were just hired guys from the street.

We monitored the state of the cluster and charged only for the work of the admins. Clean administration of Klickhaus without getting into data. The cluster was available, the disks were OK, the nodes were OK.

We still had no idea that we received this order due to a terrible misunderstanding within their team.

The manager was unhappy that Klickhaus was slow and sometimes data was lost. He set his STO the task to figure it out. He figured it out as best he could, and concluded that you just need to automate Klickhaus – and that’s it. But as it soon became clear, they did not need a team of devops at all.

All this turned out to be very, very painful. And the most offensive thing, it was on my birthday.

Friday evening. I booked a table at my favorite wine bar and called my homies.

Almost before leaving, we get a task to make an alter, we do it, everything is okay. Alter passed, Clickhouse confirmed. We have already gathered in a bar, and they write to us that there is not enough data. They counted – everything seems to be enough. And they went off to celebrate.

The restaurant was noisy on a Friday night. After ordering drinks, food, they lounged on the sofas. All this time my slack was slowly filled with messages. They wrote something about lack of data. I thought – the morning is wiser than the evening. Especially today.

Closer to eleven, they started calling. It was the head of the company … “Probably, he decided to congratulate me,” – I thought very uncertainly, picked up the phone.

And I heard something like: “You fucked up our data! I pay you, but nothing works! You were in charge of backups, and you didn’t do shit! Let’s fix it! ” – only rougher.

– You know what, go fuck you! It’s my birthday today, and now I’m going to drink, not do your June homemade shit and sticks!

That’s what I didn’t say. Instead, he took out his laptop and set to work.

No, I bombed, I bombed like hell! I poured caustic “I told you so” into the chat – because the backup, which was not a backup, of course, did not save anything.

The guys and I figured out how to manually stop the recording and check everything. Really made sure that some of the data is not being written.

We stopped recording, counted the number of events that were there per day. They threw in more data, of which only a third was not recorded. Three shards of 2 replicas. You insert 100,000 lines – 33,000 are not written.

There was complete confusion. Everyone sent each other to fuck in turn: Lenya went there first, followed by myself and the founder of the company. Only the joined SRT tried to bring our calls with shouts and correspondence in the direction of finding a solution to the problem.

What was really happening – no one understood

The guys and I just freaked out when we realized that a third of all data was not just not recorded – it was lost! It turned out that the order in the company was as follows: after insertion, the data was deleted irrevocably, the events were shredded in batches. I imagined how Sergei converts all this into lost rubles.

My birthday was also going to the trash heap. We sat at the bar and generated ideas, trying to solve the thrown puzzle. The reason for Klickhaus’s fall was not obvious. Maybe it’s a network, maybe it’s about the Linux settings. Yes, anything, hypotheses sounded enough.

I didn’t take the development oath, but it was dishonest to leave the guys on the other end of the line – even if they blamed us for everything. I was 99% sure that the problem was not in our decisions, not on our side. The 1% chance that we screwed up still burned with anxiety. But no matter which side the trouble was, it had to be fixed. Leaving customers, whatever they are, with such a terrible data leak is too cruel.

Until three in the morning we worked at a restaurant table. Throwing events, insert select – and drove to fill in the blanks. When you screwed up the data, it is done like this – you take the average data for the previous days and insert it into the data that was fucked up.

After three in the morning, my friend and I went to my house, ordered pivasik from the alcohol market. I was sitting with a laptop and Klickhaus problems, a friend was telling me something. As a result, an hour later he was offended that I was working, and not drinking beer with him, and left. Classic – was a friend of Devops.

By 6 a.m. I recreated the table and the data started to fill up. Everything worked without loss.

Then it was hard. Everyone blamed each other for losing data. If there was a new bug, I’m sure a shootout would start

In these srachs, we finally began to understand that the company thought that we were the guys who work with data and monitor the structure of tables. They confused admins with dibieys. And they came to ask us not like administrators.

Their main complaint is – what the fuck, you were responsible for backups and didn’t make them normally, you continued the data. And all this with rewinding mates.

I wanted justice. I dug up the correspondence and attached all with screenshots, where Leonid with all his might forces to make the backup that was made. Their service station took our side after my phone call. After that, Lenya admitted his guilt.

The head of the company, on the other hand, did not want to blame his own people. Screens and words didn’t work on him. He believed that since we were experts here, we had to convince everyone and insist on our decision. Apparently, our task was to teach Lenya and, moreover, bypassing him, appointed by the project manager, to reach the main point and personally pour out all our doubts about the concept of backups.

Chatik oozed with hatred, hidden and undisguised aggression. I didn’t know what to do. Everything has come to a standstill. And then I was advised the easiest way – to write to the manager in a personal note and make an appointment with him. Vasya, people in life are not as greyhounds as they are in chat. The boss replied to my message: come, no question.

It was the funniest meeting of my career. My client ally – the service station – could not find the time. I went to the meeting with the boss and Lena.

Time after time I replayed our possible dialogue in my head. I managed to arrive a lot in advance, half an hour in advance. Nervous began, I smoked 10 cigarettes. I understood everything – I was fucking alone. I will not be able to convince them. And stepped into the elevator.

While he was climbing, he struck with a lighter so that he broke it.

As a result, Leni was not at the meeting. And we had a great talk about everything with the main one! Sergei told me about his pain. He didn’t want to “automate Clickhouse” – he wanted the queries to work.

I saw not a goat, but a good guy worried about his business, immersed in work 24/7. Chat often draws us villains, scoundrels and dumbass. But in life these are people just like you.

Sergei didn’t need a couple of devops for hire. The problem they faced turned out to be much larger.

I said that I could solve his problems – it’s just a completely different job, and I have a DIBI friend for her. If we had initially found out that this was a deal for them, we would have avoided a lot. Late, but we realized that the problem lay in the shitty work with the data, and not in the infrastructure.

We shook hands, the fee was raised two and a half times, but on condition – I take absolutely all the smut with their data and Klickhaus for myself. In the elevator, I contacted the same dibieyschik Max and connected him to work. It was necessary to shovel the entire cluster.

Treshak was in bulk in the adopted project. Starting with the mentioned “backup”. It turned out that the same “backup” cluster was not isolated. They tested everything on it, sometimes they even let it into production.

Staff developers have composed their own custom data “insert”. It worked like this: batch files, run a script and merge data into a plate. But the main problem was that a huge amount of data was accepted for one simple request. Joined data request per second. All for the sake of one number – the amount per day.

In-house developers incorrectly used the analytics tool. They went to grafana, wrote their royal request. It dumped data in 2 weeks. It turned out to be a beautiful graph. But in fact, the data request was going every 10 seconds. All this accumulated in a queue, since Klickhaus simply did not take out the processing. Here was the main reason. Nothing worked in grafan, requests were in a queue, old irrelevant data were constantly arriving.

We reconfigured the cluster, redesigned the insert. The in-house developers rewrote their “insert”, and it began to shard data correctly.

Max conducted a full audit of the infrastructure. He outlined a plan for moving to a full-fledged backend. But this did not suit the company. They expected from Max a magic secret that would allow them to work the old fashioned way, but only effectively. Lenya was still in charge of the project, who had not learned anything. From all that was proposed, he again chose his alternative. As always, it was the most selective … daring decision. Lyonya believed that his company had a special path. Thorny and full of icebergs.

Actually, on this we parted – we did what we could.

With full of cones, wise with this story, we opened our own business and formed several principles for ourselves. We will never start work now as well as then.

After this project Max the Debian player joined us, and we still work great together. Case with Klickhaus taught how to conduct a full and thorough audit of the infrastructure before starting work. We delve into how everything works, and only then we accept the tasks. And if earlier we immediately rushed to maintain the infrastructure, now we first do a one-time project, which helps to understand how to bring it into working order.

And yes, we bypass projects with shitty infrastructure. Even if for a lot of money, even if by friendship. It is not profitable to run sick projects. This realization helped us grow. Either a one-time project to put the infrastructure in order and then a service contract, or we just fly by. Past another iceberg.

PS So if you have questions about your infrastructure, feel free to leave a request…

We have 2 free audits per month, perhaps your project will be among them.

While everyone was celebrating my birthday, I repaired the cluster until the morning – and the developers poured their mistakes on me

We still had no idea that we received this order due to a terrible misunderstanding within their team.

What was really happening – no one understood

Then it was hard. Everyone blamed each other for losing data. If there was a new bug, I’m sure a shootout would start

Getting started with the Galactica language model

Data Scientist: quarantined life memories and hope for the future

Deploy on the developer side: how we built Heroku for internal needs

SQLAlchemy – naked sql query

21 YouTube channels where you can learn AI, machine learning and data Science for free

Getting to Know Swift with Snake

Leave a Reply Cancel reply

We still had no idea that we received this order due to a terrible misunderstanding within their team.

What was really happening – no one understood

Then it was hard. Everyone blamed each other for losing data. If there was a new bug, I’m sure a shootout would start

Similar Posts

Leave a Reply Cancel reply