Holograms are taught in China or How to Conduct an AB Test for Sales Automation in Education

Last year, in one of the company’s internal chats, a team of researchers shared interesting observations about the level of automation in the Chinese education market: from holograms of teachers in lectures to “autopilots”, when a software-generated image of a teacher leads an individual and interactive video lesson.

We are far from China, but who said that it is not worth trying. My team decided to replace the free introductory English lesson with an automated version. We were very afraid to deprive site visitors of human warmth, but came up with a safe way to test, discovered an unexpected effect and found additional profit.

Our department is responsible for the very free introductory lessons from Skyeng, to which popular YouTube bloggers and not only invite everyone. How does everything go in such a lesson? At the appointed time, a specially trained teacher will call you through our video platform, who assesses the level of the language and offers a program and an appropriate package of classes.

As a product team, we are responsible for optimizing the Customer Sales Costs metric (CSC) – costs of selling a package of services to one client. Our goal is to sell classes at Skyeng to as many people as possible and spend as little money as possible. Automation is one of our key strategies: we automate calls, trigger communications, and enrollment.

At peak times, we began to lack the resource of teachers for introductory lessons

From time to time, there are spikes in the number of applications for introductory lessons that cannot be predicted and fully satisfied.

It is known that not every application is converted into payment. In 2019, the school introduced scoring – an algorithm that predicts the likelihood of making a purchase based on user behavior on the site and historical data on users with similar behavior. If the forecast is “green” – the scoring is sent to the introductory lesson, if “red” – it is not sent.

But we promise everyone a free lesson, and users with a “red” scoring label too. For them, we came up with an inexpensive way to keep our promise – an automatic introductory lesson: no teacher, no manager calls, no upsell after class.

The auto-lesson was launched as a side project, it was planned to spend a lot of time and money on it. Therefore, there was no talk of generating images and so on, as in China)

The cold traffic product looked like a set of several slides:

  • a questionnaire to identify the interests, goals and experience of the student,
  • express test for the level of knowledge of the English language,
  • acquaintance with Vimbox – the platform on which classes are held,
  • a video with blogger Tanya Starikova from the Skyeng YouTube channel, who clarified: “Today I am here, and tomorrow there will be your real teacher”,
  • a set of recommended courses and a payment showcase (just in case).

Tracking what happens to users who took the automatic introductory lesson, we saw payments. A local meme appeared in the team: “the auto lesson is so good that it sells even to those who should never have bought it.”

Since then, I have not let go of the thought: “What if it was possible with hot users?”

A product that was not supposed to generate money generates money. Are we going to test further?

I really wanted to test the auto-lesson in a full-fledged AV test on warm traffic.

But each experiment is a risk of lost profit during the time of the experiment. In our case, it is radical. What is the worst-case conversion possible by replacing the tried and tested path of managers and objection-handling educators with a few slides and a video of a blogger? In addition, the introductory lesson meets the user at the most crucial moment when he forms the first impression about the school and the product.

We counted the risk in money and didn’t like it.

How we assessed the risks: an excerpt from the team checklist

Worst test group conversion = 0

Worst-case scenario new users remaining = [ Длительность эксперимента, недели ] * [ Входящий трафик, заявок/неделю ] * (1 – [ Трафик в тестовой группе, доля ]) * [ Базовая конверсия, доля ]

Short-term losses in new users = [ Длительность эксперимента, недели ] * [ Входящий трафик, заявок/неделю ] * [ Базовая конверсия, доля ] – [ Длительность эксперимента, недели ] * [ Входящий трафик, заявок/неделю ] * (1 – [ Трафик в тестовой группе, доля ] ) * [ Базовая конверсия, доля ] = [ Длительность эксперимента, недели ] * [ Входящий трафик, заявок/неделю ] * [ Базовая конверсия, доля ] * [ Трафик в тестовой группе, доля ]

Long-term loss in profit = [ Длительность эксперимента, недели ] * [ Входящий трафик, заявок/неделю ] * [ Базовая конверсия, доля ] * [ Трафик в тестовой группе, доля ] * [ LTV per user, $ ]

Long-term loss in revenue = [ Длительность эксперимента, недели ] *[ Входящий трафик, заявок/неделю ] * [ Базовая конверсия, доля ] * [ Трафик в тестовой группе, доля ] * [ Profit per user, $ ]

In addition, we ask ourselves questions:

  • Will the new experience influence the decision on the target action?
  • What value will we take from the user experience and what value will we compensate for?
  • Does the tested option match our quality of service delivery?
  • How does our piece of the product relate to the user’s expectations, which were formed at other stages?
  • How much can we earn if everything works out?

They came up with a safety net in the design – to send the test group to an automatic lesson, but if no payment has occurred, pick up the user – call and offer to go through a regular lesson with a teacher.

Thus, it was possible to measure the conversion in the first payment in an automatic lesson on small windows (before pickup) and compare it with the control on an identical small window. And at the same time, the main risk of zero conversion disappeared due to the radical replacement of the teacher with several slides with video. This experimental design seemed acceptable.

There were also risks in this decision, but not so significant.

We lengthened and complicated the user path before payment, adding an auto-lesson to it before the usual one. Expertly, it seemed that this might underestimate the conversion rate compared to the control group. But we removed the main risk and launched the test.

The second week of the experiment is in progress, the product writes to me: “Something is wrong with the metrics”

The experiment was designed for 3 weeks of recruiting new users and plus 14 days for closing conversion windows to summarize the final results.

We are monitoring the running experiments on dashboards, whether the key metric – conversion to the first payment – has not dropped.

The first thought when the product wrote about the current metrics: “Everything has fallen, the test will have to be stopped.” But the concern was not caused by the drop, but, on the contrary, by the stable growth of the key metric in the test group compared to the control group. We did not anticipate the possibility of an increase in conversion from application to payment, and it really looked strange.

What does the analyst do in such cases? Rechecks the dashboard, the datasource on which it is built, the honesty of the split and everything that can be checked. Checked out. Everything is correct. Since these were preliminary results and only the middle of the test, we did not become very happy, wrote them off to chance and began to monitor further.

For each remaining day, we expected a reverse trend to begin. But that did not happen.

The experiment is over. Final results – the test group outperforms the control group in terms of the key metric of the department in conversion of the application to payment by 2.13 percentage points.

The graph shows the bootstrap histogram and the cumulative p-value, also buffered.

The result is not accidental, we double-checked.

How we summarize the results of AV tests and exclude randomness:

  • We randomize the traffic division (if we find that randomization is dishonest, we normalize it manually)
  • We check the honesty of traffic splitting – we look at the uniformity of the distribution of attraction channels and geography using Cochran’s criteria and a chi-square
  • We break the funnel into sub-sides to localize the effect
  • Using Bootstrap to Estimate Confidence Intervals, Don’t Believe in Normal Distribution
  • Finally, we use the cumulative p-value to check the stability of the results over time.

What is 2 percent within Europe’s largest online school? We figured out the numbers for 2019 – knowing the school’s growth rate, Profit per user, confidence intervals – we got tens of millions of additional profits per year.

The last question, the answer to which we needed to get: by what substage of the funnel did the key metric grow?

Application-payment conversion consists of two stages: application-lesson and lesson-payment. The increase in end-to-end conversion due to the application-lesson stage means that we have increased the cost of teachers. In this case, to make a decision about rolling out the experiment, we must compare not only conversions in groups, but also CSC. But in our case, the growth of end-to-end conversion occurred due to the lesson-pay stage. This indicated that we took the same number of students to the lesson, kept the same expenses for teachers, but when they got to a regular lesson after an auto lesson, users bought more often.

But conversions on small windows (buying in an automatic lesson) did not work so well.

And what, in the end, soulless automation won?

The automatic introductory lesson was not good enough to completely eliminate the need for teachers at the sales stage. But it turned out to be a great tool for warming up users before buying. We built it into onboarding and tested it already as part of another experiment – booking an introductory lesson on our own without a call from an operator. The experiment proved to be successful and was rolled out to all traffic.

This is how it looks now.

I am very glad that we decided to test a hypothesis that seemed risky.

If there is something in your team backlog that seems “not very”, but there is a feeling, “what if it works,” check. Perhaps this is the very thing that will bring value, money and break a couple of your stereotypes. Take a chance!

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *