T-shirts, money, two cakes: how we forgot how to evaluate tasks

Hello, Habr! My name is Artyom and I am a team leader in Skyeng. My development team has a customer, he’s a product manager, he’s just Vanya. Vanya believes that our task assessment scheme is not ideal. For example, an assessment of 2 days does not give him anything. He will see his task at the prod in a week or 10 days. Or more. Or less.

This is not because we fail tasks, but because with traditional Estimate, in reality, we only evaluate the time the developer wrote the code. But there is still testing and code review. Ok, we’ll put it all in the assessment. But still:

  • we have a line immediately before development and testing,
  • there are improvements, we are not without sin,
  • urgent tasks fly in
  • when an implementation affects several services, we are waiting for a review from related teams.

How to learn to answer the question “When?”if predictability is out of the question?

How we doubted Estimate

Our team, like many in the company, has a very useful meeting – a technical review (or, in short, techview) It requires a decent amount of time and effort, but it adds predictability: we pre-paint the technical solution to the problem, and at the same time evaluate it.

Since we are always at a distance, everything happens in JIRA: there is a board on which the stages of work are visualized. The card leaves the Techview status and moves to Ready for Development after we have described and evaluated everything. It is at this moment that we commit ourselves to complete the work.


“Ready for development” has a WIP-limit – there can be no more than 8 tasks at the same time. There is the opposite rule: as soon as the tasks in the column become fewer, we initiate a new technical review.

Fact: We spend a significant amount of time evaluating. A technical review usually takes place twice a week, it can take 1.5-3 hours to complete 4 tasks with a detailed study and evaluation. But! Then we can still take the time to figure out why Estimate was exceeded.

At the same time, neither evaluation nor debriefing adds value to our product. Rather, we are wasting time on them. And money. For a long time I doubted the need for these procedures, and at one point I matured to a serious conversation with the product. And we both recognized the problem.

“The shirt is dry and completely …” not XS

We decided: let’s experiment with assessment approaches. I suggested stopping at T-Shirt Size – T-shirt sizes are used as the unit of measurement in this technique. You need to find the smallest task that you had to do, and take it for XS. After that, the remaining tasks are evaluated on the principle of “how much larger they are XS” – and depending on this, they are assigned the size S, M, L or XL.


Bribed the opportunity to evaluate “by eye”. The idea was simple: we accumulate statistics for how much the development fulfills a task of one dimension or another, calculate the average and can predict the timing.

An error in a day or two will be forgiven by the customer – which means that there will be no more debriefing. And you won’t have to spend time on tech reviews and interactive voting. Everything is smooth!

We have been working this way for several months, collecting statistics. And only Ivan looks askance at us.

It turned out that XS, like S, we do it in 1 day, then in 10. And on L we spend 5, or even 15 days. Because in fact, we take some work in the first place, some in the second, and some in the fifth – and tasks of the same dimension spend different times in waiting statuses. Oops, here you have the middle ones.

In short, the scatter here is not a couple of days – and for Vani little has changed. We recognized the experiment as unsuccessful, but nevertheless, the idea that the tasks can be categorized somehow settled in my head. And I began to think this way further.

“Everyone loves cakes. Puff! ” Donkey from Shrek

And I love. In addition, a child’s birthday is a great occasion! I go to my favorite site and start choosing:

  • it’s possible, but it’s not possible,
  • You can decorate, but you can not decorate,
  • It’s possible for 2 kg, and it’s possible for 5 kg.

I will not reveal my taste preferences, but I chose a cake. And they brought him to the appointed date. Next will be the philosophy of the overeating timlid cake.

Of course, I am not Newton, and the cake is not an apple, but the insight has come.

I could choose from many options, but no matter what I chose, the delivery date did not change. I needed a cake in a week. And I was ready to provide this service. And the size of the cake, weight and all sorts of gadgets did not greatly affect the final result – more precisely, in this case, they did not affect it at all. It’s not about the size, as they say. And in what? In the price.

For example, the guys had an express order: for an additional fee, they would bring the same tricked cake to me in just a couple of days, and not after 5. My order, as the most valuable compared to others, would go out of turn. In fact, the confectionery has two SLAs: for a regular order and for a VIP. There is something to think about.

The SLA idea triggered because I read about it in Kanban Guide

From the point of view of the Kanban method, everything is a service. And despite the fact that we do not supply cakes, and our product cannot be felt or eaten, development is also a service. And we also have different attitudes towards tasks.

Recall our board:


The service consists of several stages (development, code review, testing), and the column “Ready for development” is our commit point to the customer.

We do some things in our usual rhythm, but when burning tasks arrive, we drop everything. It remains to understand what SLAs we have, and it will be possible to conclude an agreement with Vanya.

How to evaluate the SLA of your team: building a spectral diagram (it’s simple)

To understand what classes of service we have and what SLAs they have, Kanban suggests constructing the following schedule:

  • X axis fix Lead Time (LT) – task production time. In our case, this time is from Ready to Development to Ready.
  • Y axis is plotted frequency – how many tasks we did for LT1, LT2, LT3, etc.

We took the tasks closed over the past few months and got the following:


We closed 3 tasks in a day, 6 in two, most of all in 5, and somewhere we fought over Task for more than two weeks …

Well, now is the time to analyze. What are these tasks? Why did they get here? Why are we doing in a certain LT more than others that are there? You can dig up to customers and performers, as well as study comments on the task.

Here’s what we managed to unearth. it our regular work.

image
The spread is quite large, but it can be analyzed.

In general, the bulk of the tasks were distributed in the interval of 7-14 days, and the couple flew very far – in this tail there were several tasks (not all) from PR to other services. Those tasks that completed in 3-4 days are more likely an exception than a rule.

So, I can already tell the customer that if the task goes like regular work, with a 75% probability it will reach the sales in 10 days.

And with a 90% probability, it will take 14 days. Well, if the development affects other services of the company, it will take a little longer to wait – we need a code review from another team and then another release.

Let’s go further. We called this class “Important”.


For some reason, these tasks are taken to work earlier than others: there is either more value or a delay price higher.

And here we can also voice the SLA: with 75% probability the task will go on sale in 5 days, with 90% probability in 7. Do we continue?

The very tasks for which we throw everything and saw, saw, saw – blockers.


In 100% of cases, these are minor improvements that we did not take into account when implementing the main feature, or bugs that affect vital functionality on the prod.

Despite the fact that we managed to resolve all such situations in 2 days, we will still announce the 90th percentile. Firstly, you should not promise 100% – never to anyone 🙂 Secondly, you need to lay the variability: remember the case of regular work, when several tasks went away in 20+ days, because there was a dependence on other teams.

Done! We can coordinate with Vanya SLA for all classes of service:


We chose exactly 90% of the terms – this is, in fact, the customer’s tolerance to non-compliance. That is, if 1 out of 10 tasks does not fit into the SLA, we are ready to forgive it.

If your customer is not so kind, it is better to voice the 95th percentile, for example.

Instead of a conclusion

– And what prevents Vanya from gaining only important tasks or blockers?
– Horizontal WIP limits.

We agreed on a limit on the number of tasks in the service class: you cannot take more than two blockers, you cannot take more than two important tasks. You may have other numbers – this is a matter of agreement with the customer. You can’t put such limits in JIRA without plugins, so an oral agreement is definitely needed. Tools tools, but without interaction with people anywhere.

Thank you for your attention and successful planning!

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *