How we switched from outsourcing and created our own effective DevOps team

My name is Kirill Shagin, I lead the SRE, DevOps and DBA teams at Vi.Tech, a subsidiary of VI.ru. In our IT solutions we use a modern stack, we have 4 K8S clusters and more than a million pipelines per month.

In this article, I share my experience of how we built our effective DevOps team and gradually moved away from outsourcing most of our services. Of course, it was not possible to build the processes in one day and not on the first try. We took a variety of approaches to achieve our performance targets. As a result, some were abandoned, while others were implemented and are still used to this day. Read all about it under the cut.

How we came to outsourcing

When the company was small, we deployed releases with the help of several teams, but this did not last long as the business and infrastructure grew. At some point, we realized that it was necessary to implement DevOps practices. We deployed a Kubernetes cluster and placed all the services existing at that time in it. However, we did not have enough expertise to properly configure the entire system. At this point, we decided to seek help from experienced specialists.

Outsourced engineers set up “cubes”, CI/CD pipelines and began to implement modern DevOps practices. As a result, we identified 2 permanent tasks that were performed by outsourcing:

  1. Helping developers with current problems (on duty).

  2. Solving planned tasks from development teams (“expand”, “raise”).

Three years later problems began…

As the company grew, the number of developers increased, and along with this, the number of new services and planned tasks grew. To continue to do the work effectively, we would have to expand our outsourcing team, which would require additional funding that we did not have.

The result is overdue tasks, failed developers and goals. To somehow get out of the situation, the development team began to do some of the tasks themselves. This led to a whole zoo of technologies, since the developers did not dive deeply into DevOps practices and completed tasks in a hurry, depending on how you google it.

We take away routine tasks from an outsourcer

When we decided to take over duty duties, we had the following input data:

  1. A team of six people (5 engineers and 1 team lead).

  2. All engineers must be on duty.

  3. There are no duty processes, so you have to “go by feel.”

The processes were not regulated, tasks were distributed chaotically and the queue was not respected, which is why some tasks were lost. In addition, response times and problem resolution did not meet customer expectations.

To resolve the situation, we described the duty process, according to which the engineer performs the following actions:

  1. Takes the task.

  2. Sets the status to “in work”.

  3. Solves the problem.

  4. What he didn’t have time to do is completed the next working day.

  5. After solving the problem, updates the documentation.

Using Grafana OnCall and Slack, we automated the process of contacting the duty officer. The developer wrote in the Slack channel and the bot sent the duty officer a message that contained the text of the task and a link to the thread.

Message to the duty officer from a bot

Message to the duty officer from the bot

Is the new duty process working well?

After introducing the new duty process, we analyzed its effectiveness. As a result, the following problems were identified:

  • automation does not register applications;

  • it was not clear how to transfer requests between duty shifts;

  • and again, customer expectations did not coincide with reality. For example, a developer could connect on Sunday and ask for some help, but no one answered him.

To solve these issues, first of all, we improved the bot. We connected it with application systems – Jira SM and YouTrack. Requests created for the duty officer came from Jira, and planned tasks were registered in YouTrack. We also prescribed regulations for working with tasks, which contained 10 rules. One of these rules: the duty officer responds only to requests from the bot in which the customer has indicated priority. When a customer created a task in Slack, buttons appeared: “incident” or “scheduled”.

Another important rule is that requests must be solved one by one, from old to new, so that tasks are not lost. We also prescribed internal regulations for engineers:

  • duty takes place daily from 09:00 to 09:00 the next day;

  • One engineer should not have more than two shifts in a row;

  • the duty officer was relieved of meetings that were not related to the incidents.

To solve the problem of the discrepancy between the expectations of developers and the actual work of those on duty on requests, we wrote external regulations. It contained the following conditions:

  • the duty officer on weekdays from 09:00 to 18:00 (Moscow time) resolves requests that come from a bot in Slack;

  • on weekdays from 18:00 to 09:00 and on weekends, the duty officer performs tasks only in response to incidents in the grocery environment;

  • at night, the duty officer does not find out the root of the problem, but only eliminates the impact of the incident;

  • duty takes place in all environments – prod and dev.

As a result of process improvements, the problem of transferring applications between shifts was resolved. When an attendant had one hour left on his shift, he would pass those tasks on to the next attendant on duty. There is also a counting of applications and their distribution into categories. Now duty was not a chaotic set of actions, but a clear and consistent process.

What else could go wrong?

There were more people in the company, and at the same time the number of applications increased. We received 500 tasks per month. Plus, the number of DevOps specialists in our team also increased; they needed to be immersed in the infrastructure, processes, and adaptation. Therefore, knowledge sharing was necessary.

Customers have increased demands on the quality and speed of resolving requests. And besides, tasks began to come to us that were not in our area of ​​​​responsibility. For example, regarding problems with VPN operation, this is the task of a completely different department.

Answers to recent calls

To resolve these issues we did the following:

  • To share knowledge, we created a chat for duty officers in Slack, where an engineer can find answers in the history or ask colleagues online;

  • To meet customer requirements for the quality and speed of problem solving, we introduced SLA indicators: response to a problem – 2 hours, solution – 1 hour;

  • For the correct routing of applications in accordance with the area of ​​​​responsibility, we have modified the bot. Now, using a button in Slack, the bot sent the application to the first line, where specialists can correctly distribute the application.

  • To track the efficiency of engineers, we introduced the following metrics:

    • time spent on resolving the application;

    • number of returns to the duty officer within one task;

    • We monitor the timing of verification of the problem solution by applicants.

Number of created and solved tasks by day

Number of created and solved tasks by day

SLA for solution

SLA for solution

Time to resolve queries

Time to resolve queries

Time to respond and resolve requests

Time to respond and resolve requests

Now it takes us 1 hour to solve problems, and the reaction time is less than an hour.

The latest improvements to the process gave us the following results:

  • Now we can make a promise for a specific response time to requests during the duty shift;

  • there are verified metrics for requests for duty shifts;

  • we understand the types of problems that people come to us with and what can be automated;

  • we clearly see growth points;

  • We understand at what point it is necessary to increase the number of people on duty so that SLA indicators do not suffer.

We take away planned tasks from the outsourcer

To take over the planned tasks, we needed to solve the following issues:

  • how to collect requirements from developers;

  • how 5 engineers can serve 20 development teams (200+ people) with different stacks;

  • disassemble the zoo from technologies in pipelines, assembly and delivery tools.

We implemented two-week sprints that ran every Wednesday. Within the sprint there are two calls with customers. At the first call we discuss the priority of the task, and at the second we inform about the progress of the sprint.

The stack was left as an outsourced team and they refused to support and develop the tools that the development team had piled up.

Next, it was necessary to collect feedback on our work. And then we encountered unexpected obstacles.

Funny stories about collecting OS

To collect feedback, the following questions had to be answered:

  • What to ask?

  • Who should I ask?

  • What survey format should I use?

Our team lead developed a detailed survey that covered all our cases. We decided to involve heads of departments and directorates as respondents. We considered that leaders are the primary source of goals. And from goals, as a rule, planned tasks appear. The survey was compiled in Google Forms.

The survey asked to rate the following indicators on a scale from 1 to 10:

  • The speed of completing your planned tasks.

  • The quality of fulfillment of your planned tasks.

  • Level of information about the status of your tasks.

It was also necessary to give a detailed answer about what I would like to improve in the work of the DevOps team.

They created a survey, now everyone will take it and tell everything

They created a survey, now everyone will take it and tell everything

They created a survey, now everyone will take it and tell everything

We sent out a survey, set a deadline and began to wait for feedback. Periodically reminded about completing the survey. But upon reaching the deadline, only 2 people completed the survey. It was impossible to conduct an analysis based on such a database. For example, when asked about the speed of completing tasks, one respondent gave a rating of 2, and the second – 10.

Second attempt to collect OS

We threw away Google forms, armed ourselves with a pen and a piece of paper and went to the team leads in person. And online meetings were scheduled with employees who work remotely. As a result of the conversation, it did not become any clearer. We received requirements that do not provide specifics: “do it well,” “do it well,” and “do it yesterday.”

Poll experiment failed

Poll experiment failed

The evolution of OS collection

We have closed the topic with surveys, which we have been persistently developing for 5 months. During this time, the DevOps department has grown 3 times. Therefore, we identified 3 areas and assigned a separate DevOps team to each:

  • Cluster direction: seniors and leads;

  • Direction of test benches: middles;

  • CI/CD direction: middles and juniors.

We decided that if collecting OS from customers does not work, then we will collect data based on statistics. The first thing we started collecting statistics on was the time the task spent in status. For the measurement, statuses were divided into 2 groups: “we influence time in status” and “do not influence time in status.” We wrote an exporter in Go and used the API to capture data from YouTrack. The information received was compiled into Victoria Metrics and displayed in Grafana.

In order to correctly use the collected data, we created trash holds for operations and “colored” the graphs. This time we got a real picture of our work. And it didn’t make us very happy – all the graphs were red.

We received the following indicators from applications:

  • The path from “request created” to “assessment passed”—30 days, 50th percentile;

  • The path from “in progress” to “blocked on the client” is 8.77 days, 50th percentile and 18.3 days, 99th percentile;

  • The path from task creation to “completed” is 47.5 days, 50th percentile and 354 days, 99th percentile;

  • The path from “in progress” to “awaiting review” is 14 days, 99th percentile and 7 days, 50th percentile.

Working on indicators in the red zone

First of all, we checked how the tasks were assessed. We found out that the engineers in the CI/CD direction, where the middle and juniors worked, were giving an incorrect assessment. We also discovered a problem in the wording of the tasks. As before, the developers set abstract tasks. For example: “deploy quickly, efficiently, with Postgres and on 3 DCs.”

Two more problems are incorrect task statuses and a large number of requests in Code Review status. That is, the task is completed, but verification of the applicant is required to close it. As you know, everyone loves to create, but few people like to test. As a result, most of our tasks remained in check status, and they remained without action for a long time.

To solve these problems we have taken the following measures:

  • Added details to statuses:

    • Expectations from the data center department;

    • Waiting for testing by development – we are waiting for verification from the developers;

    • Waiting for a response – we are waiting for answers to our questions;

  • We took control of Code Review. On a daily basis, we personally remind who needs to check which tasks.

  • We created a limit on the number of tasks in the Code Review state. If a customer has 2 tasks with a check status, then his new requests are not accepted for work until he completes these checks.

Summary metrics

As a result of implementing the latest improvements, we received the following metrics:

What has changed in the processes

  • We collect tasks into a sprint strictly until a certain time. If customers want their tasks to be included in the sprint, then they need to prepare tasks by Friday.

  • We prepare estimates for the collected tasks the day before the start of the sprint. So that there are no situations where customers come for prioritization, but the task does not have an assessment.

  • We carry out preparatory work immediately, rather than waiting for planning.

  • We conduct team synchronizations 2 times a day. This helps the team lead track what tasks the engineers set at the beginning and what goals the team lead sets, and what happens at the end of the day. This shortens the feedback loop and allows for quick decisions.

  • It was forbidden to change sprints on the fly.

  • The times of the sprints were strictly fixed.

  • They left the opportunity to submit an application without evaluation. But then by default we assign it an estimate – the entire sprint.

What was automated

  • In “cubes” and on virtual machines, we created a measurement of allocated resources, in “shirts” (s, m, l, xl, xxl). Now teams can easily understand each other when naming computers.

  • We packaged everything that was on the technical radar into the Ansible role.

  • We created a playbook generator. It helps you turn Ansible roles into playbooks, allowing you to deploy with just four buttons.

  • Created deployment templates in Terraform. Now only configuration files remain in the service repositories.

  • We use Terraform's CDK to give developers yml configuration. This allows you to limit the imagination of developers so that they don’t “fence” anything beyond yml.

Results of the implementation of new processes

  • The path from application creation to assessment is 20 hours instead of 30 hours;

  • From assessment to implementation – 20 hours instead of 45 days;

  • Conducting a Code Review – 3.5 hours instead of 7 days.

Based on my experience, I would like to give the following advice:

  1. Automate interaction with your team. Otherwise you will have to:

  • each time explain to the customer all the intricacies of the process of working with the DevOps department – “go there”, “click here”, “do this”.

  • control that the customer complies with all parts of the process;

  • correct errors in assigned tasks for the customer.

Now our customer only needs to press one button to deploy the service. By pressing a button in DevOps, 4 tasks are automatically created, which are already taken on by the corresponding engineers.

  1. Surveys only work in conjunction with statistics that you need to collect yourself.

  2. Measure all stages of work on tasks.

  3. Focus on improving those stages over which you have direct influence.

  4. Tasks that are decomposed as much as possible help to always be included in the assessment. For example, if the task estimate is 10 hours, it means it is not decomposed enough. Now our maximum task in a sprint is 6 hours.

  5. Automation is the key to reducing grades. Because we remove the human factor, repetitions, the possibility of error and simplify the life of our colleagues.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *