Postmortem, or how to analyze an incident and not quarrel with anyone (well … or almost anyone)

Before we start talking about this matter, I must warn you that you should not google the word Postmortem, especially pictures. At the turn of the 19th-20th centuries, this was not the most personal tradition of photographing relatives who had recently left this world. The content of the text below has nothing to do with this practice.

What is Postmortem in the Diocese of Information Technology?

To paraphrase Tolkien, tales of how we have succeeded are monotonous and boring, but tales of incidents are often breathtaking. So, one of the varieties of these “cat lamp” stories is Postmortem.

But jokes aside. Postmortem is a procedure for detailed analysis of a problem that arose in one project or affected several projects, and perhaps even the entire company. An issue that caused or could cause reputational or monetary damage.

The Anatomy Lesson of Dr. Tulp is a 1632 painting by Rembrandt.

The Anatomy Lesson of Dr. Tulp is a 1632 painting by Rembrandt.

Simply put, the purpose of this exercise is to explore some new, non-obvious or unexpected situation, a bug that you / the company does not want to face again. Consideration of an incident or problem in order to understand what did happen and prevent it in the future, well, or at least learn to detect earlier, minimize damage, etc.

Postmortem is a letter, a text file, a ticket in Jira – in fact, anything that is available to a fairly wide range of colleagues. It can be your team, several teams, departments, or maybe the whole company.

There are different approaches, policies, different cultures and lack of culture in postmortems. In this article I will try to briefly talk about the most common practices and the most popular rake.

And yet, what is Postmortem worthy?

And then opinions were immediately divided – “we are launching an investigation, when did what happen?”:

  1. Something new

  2. something unpleasant

  3. Something big

(Sometimes a combination of several of the features listed above is used.)

I will say for myself: I think that it is advisable to dissect only something new. A re-investigation of identical cases, in my deepest conviction, is not worth a full-fledged analysis – here, most likely, the goal should be to count the losses. All other elements of the incident have already been sorted out and will look like running in circles and a waste of time.

In the end, the goal of a good Postmortem is to help us become better, it should be an analysis, not a scarecrow for colleagues. They do not need to be scared (even as a joke), or put in KPI. I met a very strange item in the KPI of the technical director, which sounded “no more than X Postmortems in the area of ​​responsibility for the quarter.”

What can this lead to (in fact, it will lead sooner or later)? Yes, just to the fact that in the name of fulfilling KPI, incidents that are really worthy of analysis will not receive this very analysis. The sweeping of dust under the carpet and the usual concealment of problems will begin. If we are not honest with each other, Postmortem loses every

meaning.

We talked about the form, now it’s time to talk about the content.

Perhaps I’ll start with what should not be in the document.

names. Not “Ivanov did not pay attention to the strange behavior of the metric”, not “devops on duty Ivanov ignored the alert”, but “devops on duty set an overestimated trashhold”. We do not name names – we designate roles. The roles of not only those who made a mistake, but also the roles of participants in the event. We do not set ourselves the goal of accusing anyone – as a rule, the name of the culprit is already known, and we do not want anyone’s blood and other “organizational conclusions.” We want to understand what the incident looked like from different points of view: engineers, clients, financiers, support – anyone. We want to supplement our checklists, training and reference materials so that the result of our Postmortem is knowledge of the form “this situation looks the same as we had last quarter … well, when the trouble happened that the Postmortem guys wrote about.”

Some time ago, a friend told me just a terrible anti-example of Postmortems:

A large meeting room, similar to an amphitheater, was booked. “On the stage” at the table sat a group of those who allowed the incident. One by one, those who wished to call the thunders of heaven on a group of the accused, or simply to taunt, rose to the podium.

A visit to this amphitheater was welcomed in every possible way (specially trained people carefully monitored the occupancy of the “auditorium”).

I could not explain to myself what could be the purpose of this action. Shame the guilty? Yes, they themselves have already executed a hundred times for what happened. But initiative and independence will obviously go to zero after this – an endless coordination of each step will begin (well, just in case). Moreover, this awaits not only the accused who got to the table, but also the spectators and accusers – no one wants to get into such a trial as a victim.

Okay, we discussed the form, now let’s deal with the content.

Postmortem content

As a rule, Postmortem consists of 3 large sections:

  • Timeline

  • Details

  • conclusions

And now the details.

It is important to note here that the compiler of the Postmortem, whoever he may be, can hardly know all the details. Therefore, the compiler should not be an expert in matter, but rather a bearer of some general knowledge. You can’t know everything, but you need to know where to find out everything, so this work usually falls on the shoulders of the project manager, product manager and … (no, well, it’s not possible to endure the Moscow Art Theater pause here) the support team leader, as the one who is in contact with everyone in the company as well as with users.

What should be pointed out anyway?

Title, subject, with a brief description of the trouble.

Part 1.

Sad chronicle.

  1. Affected projects or clients

  2. Accurate Incident Timeline

    1. When did it start?

    2. When was it seen? NB! the beginning of an incident and its discovery are not always the same time.

    3. Who noticed it and what did it look like? NB! This is worth adding to the checklists.

    4. The delay between detection and the start of prevention work (sometimes the time between the start of the incident and the start of treatment is indicated, sometimes both).

  3. What steps were taken to fix this?

  4. When exactly was the problem fixed?

Part 2.

Details.

  1. Brief description of what actually happened and why.

  2. Technical details (full version) – this is usually the most voluminous part of our story, for which you will have to go to a variety of teams.

  3. Direct monetary impact – we go to the financiers.

  4. Estimated monetary impact – here to the product.

  5. Indirect and reputational impact – besides the product and SMM, how dissatisfied are we?

Part 3

But what to do?

  1. Could we have noticed and fixed the problem earlier? What do we need for this?

  2. Recurrence Prevention Measures and Lessons Learned from the Incident (well, it’s about taking on the role of Stan Marsh from South Park and saying “we learned a lot today”)

I will immediately add that it is often possible to talk like a human being with the same devops, who has become gloomy and has lost the gift of censored speech, with the phrase “Listen, we fell in love with such an event and lost money. But there is an idea that the introduction of service X for zero point, 5 kopecks will insure us that this Armageddian will take us by surprise. What do you think?”

(Based on true events.)

A part that may not exist, but it would be nice if it appeared.

What else can be included in our document?

  1. What were we trying to achieve and what went wrong? Well, we didn’t break everything on purpose … We wanted to do something good, but … (see the first 2 parts)

  2. What did the problem look like from different angles? It is not always present, because it is laborious to describe everything, but it is still highly desirable to invest in it.

  3. Colleague comments. Well, yes, comments and additions are welcome. But without those “impossible” that I described at the beginning of the text. Stop immediately, wield the banhammer mercilessly!

(Here I wanted to give an example of a real Postmortem, but I realized that explaining the specifics of the company and its processes would require a separate article and decided that nuivo.)

And in the end, paraphrasing another classic, this time the Strugatskys, I will say: “The people do not need unhealthy postmortems – the people need healthy postmortems.”

Dixi.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *