How to do A

Hello! My name is Ilya, I am a product analyst at Samokat.tech.

Doing A/B tests is a fairly common thing for analysts. But what if you need to conduct an experiment in the physical world? What features and limitations are there offline? How to select and evaluate metrics?

Let's tell you with an example – how we tried to deliver Scooter orders even faster.

I will try to tell you about the whole path: from the search for an idea, preliminary assessment and selection of the main metric to summing up the results and conclusions. I’ll tell you about the mistakes and what this experience taught us. Go!

Offline product analytics: features and limitations

Samokat.tech product analysts work in one of three directions – operational, customer or commercial. I work in the operations department, on the team that manages the workload in assembly and delivery. Our goal – effectively plan availability intervals so that orders are delivered quickly.

Here are examples of the problems we solve:

  • how to correctly calculate how many partner couriers are needed to deliver orders this month;

  • what influenced the sharp increase in the load on one courier partner last week;

  • what factors (weather, road conditions, etc.) need to be taken into account to plan the load on couriers-partners on certain dark stores.

We do research, build various dashboards, and conduct experiments to improve business processes. All this is adjusted for the limitations of the physical world.

For offline product analytics there is peculiarities:

  1. The sample for conducting experiments is very limited – the count will be in the thousands, and not in the millions as online.

  2. There is a great dependence on the human factor – you can come up with an ingenious algorithm that will significantly improve the lives of the people using it, but getting them to start using it is a separate big difficult task. You need to “sell” the idea and monitor its implementation.

Let me tell you how to take these limitations into account when conducting A/B tests using a real story.

Case: how to make express delivery even faster

In the Scooter app there is a plate with order delivery times – from 15 or 30 minutes. This time depends on how far you are from the nearest darkstore.

Each darkstore has two zones: in the first, we deliver orders in 15 minutes, and in the second, in 30 minutes.

The delivery radius is determined approximately as follows: all addresses of users who are no further than 1200 meters from the nearest darkstore fall into the express delivery zone – from 15 minutes, and those located further – from 30 minutes.

However, the shape of delivery zones in reality is not circular at all and depends on a specific geographic polygon (a predetermined area that designates the delivery zone).

This is what the delivery zones look like.

This is what the delivery zones look like.

This is what the delivery zones look like.

This is what the delivery zones look like.

The higher the demand for delivery in a city or region, the larger and denser the dark stores are located. The case described below began with the question “What if we take these standard boundaries and move them?

The idea arose when we were moving from one navigator to another. With their help, we determine the distance from the darkstore to the delivery address, but for the same address, different navigators can show different distances.

This is a normal situation, they all work according to their own logic. But we needed to understand whether we would incur costs in connection with changing the tool. Due to differences in calculations, some addresses from the 30-minute delivery zone had to move to the 15-minute one and vice versa. How will users react to these changes, will partner couriers be able to deliver to the same addresses with an accelerated SLA – these were the main questions to which we were looking for answers.

In the process of analyzing the potential effects of such changes, we thought – what if we ourselves increased the delivery zone from 15 minutes for some dark stores as an experiment?

What about SLA in this case? Where is the ideal radius when we will not increase delays and increase delivery speed?

Is it really true that in so many years no one has tried to change delivery zones or tried to find the optimal radius for each individual dark store? After delving into the halls of Confluence, we found out that previous attempts had been made to find an answer to the question of the possibility of speeding up delivery. At this point, it was not possible to find a final answer – there was no test group, it was compared “before / after”; metrics were used, monitoring of which did not give the desired result.

This time we decided to do everything as in a classic experiment: select similar groups A and B, formulate a key hypothesis, simultaneously monitor other important metrics so as not to drop them, calculate the expected changes, the economic effect of them, as well as the required time to conduct the test .

A/B test offline: in search of the main metric

In search of the main metric for the experiment, the first thing we did was reduce the list of possible options to three:

Personally, I insisted on a conversion metric – intuitively it seemed that if a client doubts whether to order or not, and sees that the order will be delivered in 15 minutes (quickly), and not in 30 as it was before, then he is more likely to order.

Intuition is good, but it is not enough. Of the three metrics, I wanted to choose one with justification – the number of dark stores to conduct the experiment is limited, and it would be impractical to check everything. In addition, the more hypotheses we want to test at a time, the greater the likelihood of catching a type I error – finding differences where there are actually none. Corrections can be used, but again, due to the small number of darkstores, we sacrifice power – the probability of finding improvements if there are any.
Fortunately, we already have two groups with delivery times of 15 and 30 minutes, where we can evaluate the expected effect of changes using historical data, which will help us decide which of the three metrics should be taken as the main one.

Below, for clarity, we will need numerical indicators – these will be fictitious numbers (NDA), but I hope they will help demonstrate the general logic and course of our reasoning.

Let's take conversion. We know that for the group of users to whom we deliver orders in 30 minutes or more it is 12.3%, and for those in 15 minutes it is 15.7%.

The picture shows random values, for example.

The picture shows random values, for example.

Let's assume that for test users, to whom instead of 30 orders will be delivered in 15 minutes, the conversion will increase to the middle, that is, by 15.7% -12.3% = 1.7%. Then it is easy to calculate the economic effect of such growth – knowing the number of users to whom we will roll out the changes in the event of a successful experiment:

By doing this exercise for each of the three metrics, we can, as a first approximation, understand which of the hypotheses, if the experiment is successful, we will see a significant increase in revenue (GMV).

The average check value is fictitious;  needed to show the logic of reasoning.

The average check value is fictitious; needed to show the logic of reasoning.

As a result, it turned out that we can get a potentially tangible effect only by increasing the average frequency of orders per month.

Progress of the experiment

So, we launched a classic offline A/B test. We took two groups of 50 dakstores with a similar frequency of orders. In the test group, the delivery zone was expanded by 100 meters, in the control group – not. The experiment was launched for a month so that users had time to notice a change in delivery speed from 30 to 15 minutes and (possibly) start ordering from Samokata more often. This period is also needed for us to collect the necessary data to calculate a significant difference in metrics, if any.

In addition to the frequency of orders on the dashboard, we monitored other important metrics that are important not to worsen: the percentage of delays and the load on partner couriers (how many orders they have to deliver per hour); as well as an information metric to check whether zone changes have actually been enabled (percentage of orders with SLA 15).

This is what one of the report tabs looks like.

This is what one of the report tabs looks like.

Something went wrong

Our delays have increased significantly. Experienced couriers-partners get used to the routes and addresses to which they deliver orders in 30 minutes, and then suddenly the monitor shows that we need to be on time in 15. We expected that delays might increase, but not so much.

The dotted line shows the launch day of the zone expansion - the test group immediately increases in delays.

The dotted line shows the launch day of the zone expansion – the test group immediately increases in delays.

Since the experiment was conducted on a very small number of users, we did not urgently cancel the experiment. The costs of this were much less than the possible effect that we wanted to catch.

Test results

For the remaining metrics, the results were as follows:

  • Delays > 1 minute → +30%

  • Delay > 10 minutes → +15%

  • Load, idle → no change

  • Overall increase in the share of orders SLA 15 → +10%

Unfortunately, it was not possible to increase the frequency of orders. We tested a hypothesis that looked promising, but it was not confirmed – we also consider this a positive result.

What else can be improved? How can I make it easier?

In addition to the limited sample, offline experiments have another important difference from online ones – we are often trying to change processes that happen in real life every day. How partner couriers deliver, carry orders, what routes they take, and so on.

In this article we tried to show that sometimes experimental ideas can arise simply from a hypothesis But what will happen if we try to change this familiar thing a little?».

A second, more practical idea: if you're choosing between a bunch of metrics for a test and you're not sure what you're influencing, sometimes a fairly simple approach is to discard the metrics based on the expected effect you can get from them in a rough estimate of money or other important values. parameters for you.

There are more advanced and effective ways: use corrections for multiple hypothesis testing and apply PSM so that both the test and control groups are as similar as possible to each other on the metrics you choose. But sometimes simpler methods can work.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *