A / B test design mistakes I thought I would never make

Launching my first experiments, I thought that all these “three / five / seven most popular blunders”, which I read about in articles and listened to at conferences – certainly not about me. Moreover, the design of the test was helped by a large beautiful research template adopted by the company.

But in practice, pitfalls awaited. Let’s talk about what can happen if you tweak the design a bit or miss the fill in your template. And how to fix it all.

I wanted to benefit new users, but they behaved naturally not as intended

Skyeng’s primary sales tool is a free introductory video tutorial with a facilitator. We conduct lessons on our platform, and it happens that a student tries to connect to a call, but his microphone or camera is not captured.


This can happen for dozens of reasons – from a trivial misklik on a notification in the browser (as in this picture) to completely exotic cases: for example, once a person tried to work from Tesla, and there his own software that we do not support.

If you cannot quickly fix the problem, a technical breakdown of the introductory lesson occurs:

  • the student remains negative,
  • the teacher’s lesson is interrupted,
  • the school loses the conversion to payment here and now (this is the main metric of our department), compensates for the teacher’s participation in the lesson and starts the process of transferring the lesson.

Everyone suffers. Therefore, last year we started a bunch of projects to reduce technical disruptions. Each idea was tested: the business wanted to understand whether the feature was working and whether the costs of supporting it would be recouped.


One of the solutions that had to be put into the test was the equipment check quest. In the original it was seen, here are its main screens.

The idea is simple: do not wait for the moment to enter the lesson, but invite the student to check the camera and microphone in advance – when he left an application for training. If something goes wrong, we will issue a ticket to technical support, and the guys will have several hours to solve the problem.

When I divided users into test and control, I expected that people in the test group would click on the widget and complete the quest. What could go wrong?

In the control group (“A”) everything went on as usual – people left an application and went about their business. But after the test, we saw that the percentage of technical failures in groups “A” and “B” was similar to one hundredth of a percent. Hmm, all of them in the test group went through the quest, but it didn’t help, or no one went inside? We didn’t know – there was no logging.

The two stages merged into one, and it turned out that we cannot separate them. I had to restart the test and log the key stage “entered the quest”. We found out that about 10% of users were logged into it. There was no significant growth in the metric: the quest has sunk into oblivion, the equipment check itself was eventually built into onboarding during a global redesign. And now I check at the start whether I have data about all the key stages of the funnel.

I didn’t ask myself, “Can we roll it out?” Or a story about a call, which is really very important to us

In addition to technical problems, sometimes the student simply does not appear at that very free introductory lesson – because he slept, flew out of his head, something was transferred, and so on.

Therefore, before each lesson, the methodologist needs to find a student who is ready to call: for this, the system gives him several contacts, and the teacher calls them. This “eats away” 12-15% of the time that a person can spend on something more useful or pleasant.

It seems like a good opportunity for automation – let the robot call. But we need an A / B test: after all, some people, having heard the robot, can hang up. The possibility of losing something is obvious. We ran the test and at first everything went surprisingly well, but … We were let down by perfectionism.

In a number of scenarios, the robot had to transfer calls to a human operator: for example, if a student wanted to cancel a lesson, the operator had to make changes to the CRM. And sometimes the robot just came across talkative interlocutors – the system was not designed for serious speech recognition and dialogue support, here, too, it was necessary to connect a person.

We wanted to make the user experience as seamless as possible.

Therefore, we decided to switch such calls immediately to the incoming telephony line. Even if the question was not urgent. The methodologists in the same cases said: “You will be called back in 3-5 minutes to reassign the lesson.” And the operators had time to distribute the workload and help everyone.

The operators could not agree with the robot, and it created spikes with several urgent calls per minute. The circuit turned out to be non-scalable.


At peak moments, the situation resembled a classic game) For the photo thanks to Wikipedia and its contributor perepelin30

We returned to the scheme used by the Methodists – if a person clearly voiced a transfer request, the robot would answer “We will call you back”. Only potentially urgent issues were immediately transferred to operators. After these changes, the test had to be run again, as the change could affect key metrics. And now, before each experiment, we ask the question: “Ok, if everything goes well, can we roll it out?”

Launched the test, checked that everything was going well, went to rake a bunch of current tasks

Skyeng has a very cool and growing audience – kids who teach math and English with us. But we cannot conduct an introductory lesson for a child if his parent is not present. We cannot legally. Therefore, if the child connects alone, the lesson is disrupted. Then you know: negative, re-recording, and so on.

Parents were always warned about this orally, when they called, at which the time of the lesson was agreed. But time passed from the call to the lesson and, of course, not everyone remembered this agreement.


Then the solution came: let’s send an SMS reminder. Approximately such a text went to the parent closer to the time of the introductory lesson.

An increase in the number of introductory lessons without disruption does not mean an increase in conversion to pay. You need to estimate the ROI. To do this, let’s conduct an experiment:

  • we will randomly divide all applications for the children’s referral into two groups,
  • we won’t send anything to parents in the first group – they have a regular flow,
  • parents in another group will be sent two SMS reminders: 24 and 1-2 hours before the start of the lesson.

We started the test, did the check on the first day – and went to clean up the turnover.

After a couple of weeks, I look into the dashboard – and there, besides the test and control groups, there are some other users.


If we wanted to divide 50 by 50, then the red graph clearly says something went wrong.

It turned out that a banal bug was to blame: something was wrong with the events, not everyone was sending SMS on triggers. The bug was fixed, but the test had to be restarted: in the end, even if you have the correct test design, with all filled templates, and so on, this does not mean that the test will run smoothly. And you should look into it as often as possible.

ps I really hope that this text will help someone to make fewer mistakes in their tests. Most likely you will have or already have your own funny cases: it will be cool if you share them too someday!

pps The post is based on a report in the Rostov IT community RnDTech – if you live somewhere in the south of the country, join in, guys are doing a great move.

Similar Posts

Leave a Reply