How to assemble and launch a department for handling incidents in production – we figure it out in the instructional article

A bit of theory: L1 is a service that monitors client resources and answers technical questions that do not require special knowledge. When problems with the functionality of sites or applications arise, it is the first to respond to the incident, and if necessary, connects the following support lines: L2 or L3.

How everyone is used to handling incidents

I think most of the adequate agencies and productions put the clients' resources on monitoring after development. The purpose of monitoring is to track the resource's performance and notify the project team and the client about incidents.

Most often, monitoring is done by projects. We lived like that for a long time too. But this approach has its nuances. Bend your fingers.

Managers will not always promptly engage in the situation

We are not even talking about those cases when resources go down at night and you have to raise the site. During working hours, your project managers may be in a meeting with other customers or a team. Therefore, it is not always possible to immediately contact the client, organize DevOps so that they can find out the cause of the error and start dealing with it.

Incident resolution takes time and other management tasks suffer

To deal with an incident, you need to organize the team's work to eliminate the error, and be in touch with the client during this time. At the same time, some incidents will be insignificant: for example, a drop in performance for five seconds, after which the resource was restored on its own. I am sure many have encountered this. According to our internal statistics, there were about 70% of such cases. But even for them, notifications came: both to the client and the team. Triggering a false alarm ultimately costs employees time and the company money.

Projects spend time reporting on each incident

Ideally, a manager should write a report for each incident. Describe in detail the problem itself, its cause, the actions the team took, and other details. Even for false alarms. A report takes time: from an hour to almost a working day in complex cases. The project manager accumulates a backlog, the report is delayed, the client is dissatisfied.

There are probably some other issues that we haven't mentioned. Feel free to add to the list in the comments. Below is the instruction on how to create such a department.

How to create such a department and what should it be responsible for?

Here is a guide to creating a department based on our experience:

  • We start with developing the department's work algorithm – from receiving a client's request to preparing documentation. It should include a workflow. Below is the basic scheme that works for us.

  • Then, taking into account the number of projects and the workload for handling incidents, we calculate how many people are needed in the department and what the preliminary work schedule will be. We chose the 2/2 schedule, as it best suited the company's conditions. With such a schedule, the entropy of the system is lower, and as a result, there are fewer transaction costs for management within the company.

  • Next, you calculate the budget for the department and start looking for people. Here is an example of the job responsibilities of an L1 specialist. You can add to your vacancies.

I am separately attaching our table for calculating the hourly schedule, from which we derived the required number of people.

Optimization of processes related to incident handling

Again, everyone has their own processes, but when we launched the department, we took the following steps:

  • restructured the order of response and interaction with the client;

  • created a system to warn of potential problems;

  • removed all incident communication and report preparation from managers and handed it over to department employees;

  • took on customer requests regarding the technical condition of websites;

  • configured control of SSL certificates and client domains.

And one more thing: if you have short term decline (less than 5 seconds), it is better to configure the system so that notifications are sent only to the L1 department. The reasons for the failure are determined by the department at this point. If the alarm is not false, then it is transferred to the team for work.

If the failure is more serious, then in the first minutes we need to write to the project or DevOps. And only after that we go to the customer and inform that we see the problem and are already solving it. In the process, we answer the client's questions, update the work statuses.

Project workflow

Work with projects can be structured differently. For example, a manager comes and says: “I need technical support for a project. I want to put it on monitoring so that you can look at server indicators, answer queries regarding performance, and respond to incidents.”

And someone asks us to simply monitor the resource's performance, keep statistics on cases of loss of performance per week, but not to work through such cases ourselves. So here everything depends on the needs of a specific business and its teams.

However, no matter what framework you choose, there is one more important artifact you will need when setting up your department – an escalation map for each client.

It should contain a step-by-step algorithm of actions with a specific resource: who we contact, when we write to DevOps, how we set tasks for analyzing the incident, what we do if the incident occurs outside of working hours, and so on.

This is just a piece of the map, a visualization of stage 1. The full map in PDF can be taken from the link on the Disk.

This is just a piece of the map, a visualization of stage 1. The full map in PDF can be taken via the link from Disk.

What difficulties arise in the work of the department

The main difficulties are related to the fact that employees are not always able to assess the nature and scale of an incident: if the problem is atypical, there is a temptation to immediately run to DevOps or team leaders and ask for help. This creates a certain risk that the hours of expensive specialists will be wasted.

So far we see only two main ways to solve the problem:

  1. To improve the level of technical literacy of specialists

  2. Introduce regulations that describe a variety of possible situations.

Conclusion

Every outsourcing contractor has clients with whom you have been working for many years, and it seems like you have already closed a bunch of projects together, the processes are established and everything is going as usual. Until one day the client’s website or application crashes due to some incident. And despite years of established relationships, you can lose a client due to a couple of such cases. L1 is needed precisely to avoid such situations – when such a small thing as an alert notification costs you a large contract.

And thank you for reading the article to the end. If you have a similar department in your company, share it in the comments. I wish you all cool projects and fewer alerts.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *