Problem Management or how to turn problems into opportunities
I would like to share my experience from the position of a person who is now directly involved in the creation, formation and development of a complex and very important problem management process at X5 Tech. I hope our case will be useful to everyone who is developing the Problem Management area at home, as well as to those who are just planning to create such a function or are thinking about the feasibility of its implementation.
First, let's define the terms:
Problem Management – This is problem management. It consists of searching for the root causes of incidents that arose in IT services, as well as working with them and managing their life cycle.
Problem – it is the cause of one or more incidents. Until it is eliminated, incidents will continue to occur. We also call a problem a set of incidents, the cause of which is unknown, but the effect is obvious.
A little background
As I wrote above, Problem Management has existed in X5 for a long time, but, so to speak, in its infancy. There were several attempts to fully develop it, thanks to which the direction was overgrown with documentation, related processes and functions. However, the coverage did not exceed 3% of all incoming incidents, so let’s be frank: recently Problem Management in X5 was an optional process.
Fragmentary data on the causes of problems was accumulated by different people: in Excel, notepads, presentations. A huge number of hot problems were eliminated as part of a bug fix; in the wake of accidents, each time a plan was drawn up to prevent them. But in reality, there was no talk of any problem management, because the solution to the incident lay in the area of responsibility of a specific group of specialists, no one else knew about it, and in addition, it very rarely had a digital trace.
Thus, we received “elusive” problems that flowed from one direction to another over and over again, and it was almost impossible to catch them and eliminate them.
Problem Management was in this state from 2017 until last year. Only two managers dealt with it to the best of their ability, while simultaneously performing other tasks of their department. A centralized approach to the process was not just in the air, but became vital in the company.
It became clear that this could no longer continue: repeated incidents were starting to cost us too much, and effective solutions to eliminate them required statistical data. And in addition, we realized that we could not afford to continue to expand our staff to deal with the ever-increasing number of incidents. And it grows automatically with the increase in the number of our stores. Therefore, in October 2023, the X5 Tech IT support department took the process seriously: it was decided to completely reformat it and implement Problem Management everywhere.
How it was implemented
To begin with, we took 2-3 support groups in different divisions of our department and gave them one task: to tie incidents to certain problems and record these problems. The goal was to link at least 60% of all incidents falling on these groups.
Here it is necessary to separately say how the incident is tied to the problem.
If the incident is simple, then there will be a standard solution for it, and it will be easy to attribute it to a specific problem – there are no questions here. But there are more complex incidents that are not so easy to immediately classify. Then they are twisted, twirled, decomposed into components, compared first with one problem, then with another. In general, it turns out to be a whole detective story.
For example, you receive a request from a store that goods are not weighed at a particular store checkout or that the cash register keyboard does not work. Here, in general, everything is clear and there cannot be any discrepancies. The incident can easily be correctly identified, further resolved, and linked to the problem.
And the second example is requests that a specific product is not sold at the checkout. Everything is more complicated here. There are many reasons why such a problem may arise: from a poorly readable barcode to problems with the connection between the cash register and the server or an unloaded product card. This requires a more detailed diagnosis, often involving several support groups. And linking to the problem is possible only after establishing the reason for the appeal.
Or, for example, a store complains about untimely receipt of order data. There are also a lot of options here: from the same problems with communication to an unloading task that was completed late.
After some time, the groups began to meet twice a week and discuss vital issues: how convenient is the existing ITSM system for our purposes (the system for working with incidents, problems, configurations, etc.), does everyone understand the principles of creating problems and linking incidents, how much effort is put into completing a task. In general, linking an incident to a problem took about five minutes – the same as the average time to resolve it! A lot, of course.
Therefore, the guys then collected all the comments on the system and rolled out tasks for improvement to the development team. And two months later they were brought to life.
For example, the “Open a problem” tab began to open faster, and in this tab itself the list of problems began to be reflected in a convenient format, which reduced the time it took to search for a specific problem. A field has also appeared in the problem card, which shows the number of incidents depending on the mass incident. And many other “goodies” that made the work of specialists more convenient.
In general, the results obtained after all of the above came out quite consistent with expectations. Therefore, it was decided to start tying incidents to problems in other IT support departments.
How to deal with the problem
It is clear that linking incidents to a problem cannot be an end in itself. Now I’ll tell you how we work and plan to work with incidents in the new paradigm.
It's simple. The support person receives the incident and closes it with a temporary solution (puts out the fire, so to speak). A simple incident will be resolved with a standard solution; for a more complex one, you will have to look for a more serious solution. Next, the incident is tied to the problem, and several people are working on it:
Coordinator – based on data on similar incidents, identifies a problem, formulates its name and description and registers a card; describes work around; initiates initial diagnosis; coordinates all work aimed at solving the problem.
Support staff – link incidents to problems; apply the temporary solutions described in the problems.
Analyst – verifies problem cards; Prepares statistical reporting on issues with the greatest impact.
Those. owner, tech. experts are people who a) are looking for the most optimal temporary solution b) find the root of the problem and suggest a solution.
For the future, the plan for working with the problem is more ambitious. We expect that it will be formed and supplemented automatically, based on linked incidents, with minimal labor costs. And analysts and tech. experts will evaluate whether it is possible to: a) transfer the execution of the temporary solution to the first line of support b) monitor the occurrence of incidents on this problem c) automate the temporary solution.
Known bug
A known error and working with it should be mentioned in a separate line, so to speak.
Known error – what is it? This is a problem that we know about, but for some reason we are not eliminating. Why? There may be several reasons, but they all ultimately come down to one thing: fixing it will cost the company more than owning the problem.
However, you cannot recognize any problem as a known error, otherwise employees will be very tempted not to look for the root causes of incidents, but simply assign a new status to the problem – errors.
That is why the Problem Management Committee (or Problem Committee) was organized, which meets regularly – once every two weeks.
The process looks like this. The coordinator, seeing that he has a potential known error, must first make sure that everything possible has been done to eliminate it:
the solution cannot be automated and transferred to the first line;
the incident caused by the problem cannot be caught in a timely manner through monitoring.
To do this, the coordinator receives approval from the head of the first line of support, operational support and support for corporate services and analytics. And with the approvals received, he cheerfully goes to the committee, where he explains why this and that problem should be recognized as known errors.
By the way, the story does not end there. Known errors are regularly rechecked: what if there is an opportunity to eliminate it and save the company money? For example, the vendor has resumed support in our country and is ready to fix everything without damaging X5’s budget.
Added value to process or problem analytics
The work of problem analysts is special and should be discussed separately.
I would like to note that the case of employees is not entirely standard, at least for us. The direction of Problem Analytics, as I wrote at the very beginning, is located in my department. It currently includes a team leader and two employees. At the same time, we now have 8 people working in this role on the second line of support!
How did this happen? At some point, we realized that with the current composition we would not be able to fully and with a sufficient degree of quality cover all the problems in all areas. Having discussed this with fellow managers of related departments, we came to the conclusion that we want to try to develop and expand this area by creating similar roles for them, but everyone will work according to the same standards and under the same command of the team leader.
Thus, we have dedicated roles in all areas, including support experts (people who are involved in project activities and begin working on support and Problem Management at the earliest stages, in fact, even before the project/product is transferred to support), who are structurally included in their departments, but functionally work under the supervision of the head of the team of problem analysts who is with me. This is how labor franchising works =)
Now let's figure out who exactly these problem analysts are?
Problem analysts are those people who, among other things, look at the entire pool of problems, identify the most critical ones and highlight them to interested departments, and also make cross-sectional uploads to show colleagues what they should pay attention to. For example, reporting for business units helps them prioritize tasks and allocate budget for necessary improvements.
The work of analysts creates additional value in the Problem Management process: they automate decisions (more on this below), help reduce the time required to resolve problems – thanks to these measures, the cost of owning a problem is reduced and, in general, its negative consequences become less.
Now about the main directions of their work.
1. Verification of problem cards
Thanks to regular verification, we can be sure that:
For all problems, key fields are filled in correctly (name, description, status, etc.);
worked around and described;
the main tasks that reduce labor costs and support costs have been completed or are in progress (these are tasks to automate and robotize incident resolution, and also to transfer incidents to the first line of support).
What does this ultimately give us? Effective Problem Management! Judge for yourself:
We can correctly prioritize tasks to solve problems. Because, as one fairy tale said: “A king cannot think about everyone – a king must think about what is important.” In other words, problems with the greatest impact should be addressed first.
We have versatile, truthful statistics that show: how many problems there are, the resolution of incidents for which is not automated; which ones have the highest number of new incidents per week; how many problems for which the resolution of incidents cannot be transferred to the first line, thereby reducing their cost, etc.
We can timely assess the need to decompose problems, which, in turn, affects the speed of identifying root causes, and therefore their overall lifespan.
2. Reporting
This is the main direction for analysts. It can be roughly divided into two streams.
The first is regular reports for customers. They are designed so that colleagues, with their high workload and crazy schedule, will be able to relatively quickly and easily receive up-to-date information on the status of problems in the context of their departments:
For business units. Here analysts play a supporting role, because the development team does not manage either the budgets or the task backlog. Thanks to reports, business units see the top problems in their area of responsibility and, based on this data, help to prioritize their elimination.
For the IT support department. Having seen the top problems in their management, each boss will make every effort to quickly and effectively eliminate them.
The second stream is reports for specific queries, built on PowerBI. Basically, we are talking about reporting, which is not implemented in the Health Card for various reasons.
*The health card is a single internal IT reporting portal for visualizing the health of the company’s productive IT services. Recently, it has also implemented reporting in the Problem Management area. On the portal, we can view general statistics by area (number of open and resolved problems, statistics on reasons for registration, dynamics, coverage, etc.), as well as more detailed ones: for example, information on problems in a particular area, details on problems ( registration dates, priority, number of related requests, full name of the coordinator), details in given periods, etc.
Visually it looks like this:
Perhaps the customer’s direction has narrow specifics, or maybe the requested report is so complex in construction that it can be compared to development, but the customer does not need it on a regular basis. The customers, in this case, are related support departments. By narrow specificity we mean a case where the report is not based on any mass request or need. Therefore, publishing it on a common portal is not profitable (since it wastes general resources and labor costs for development).
3. Automation
Automation was not and, perhaps, to this day is not the area of responsibility of problem analysts. But the guys understand that they have the necessary competencies and therefore try to use them.
I'll show you with an example.
So we decided to start counting the number of incidents related to the problem, gave appropriate instructions to the teams, and were glad that we were doing the right thing. And then we thought about it. How to check that the guys from the teams are not making mistakes and linking to the problem exactly those incidents that were generated by it, and not some “similar” ones? After all, binding is a valuable metric; it is important that it be as correct as possible.
At first there were two options:
Check everything manually. We even managed to pilot this option to estimate the labor costs. And they turned out to be huge.
Call on the neural network for help. But this help would have to wait too long.
But everything had to be done quickly and without putting crazy strain on people.
While the working group was thinking, the head of the analytics team – a very passionate person – literally made an optional tool in Python that allows you to check incidents associated with a problem using an editable reference book. The match results are calculated and converted into percentages.
The essence of the tool is the automated selection of incidents related to the problem and checking them using keywords that the system itself suggests based on incident analysis and which can be edited manually.
Of course, this is not a 100% method, but all our tests showed, and we tested it, including comparing it with completely manual operator work, that the percentage of correctness is 80% or more. And in general, at the moment this is what we need!
The second striking example is the creation of a tool that estimates the cost of ownership of problems. Analysts developed it after meetings with business units, during which an insight came: they need to expand the criteria for assessing the problem. It is important to not only consider the impact and number of incidents, but also the average cost of support across those incidents and the total amount of time and FTE spent on support.
As a result, we got some kind of console utility, a Python script, a calculator, if you like, to which we give the numbers of the problems of interest as input. And as a result, we get data on the total number of related incidents and tasks, as well as the amount of time (in hours) spent on resolving them, and the number of FTE spent on the “content” of requests, which are easily translated into cost.
What have we achieved?
First of all, we got the opportunity to identify problems, digitize them in every sense of the word and create a predictable track for their solution. Which, in total, ultimately gives us the opportunity to optimize support work, reduce the cost and number of calls, increase system stability, etc.
Our incident database has been reduced by more than 200 thousand requests, which means the company can optimize support work and make it much more efficient.
Business, in turn, now not only “guesses”, but literally sees and tracks all “particularly problematic areas” and IT systems, which allows it to more effectively manage improvements.
Of course, we understand that de facto this is the beginning of a long journey. There is still a lot of work ahead that needs to be done and optimized. It seems that Problem Quality Management should become less resource-intensive (due to automation), coverage should be more extensive, effective interaction with the event management process should appear, quality should be assessed, preferably using AI, and much more.
We are now actively working on creating a strategy for the development of Problem Management, which will certainly include everything that will allow us to make this area even more useful and effective.