how to choose the right method for each type of metric and sample size

Hi all! I’m Vanya Kastornov, Liga Stavok product analyst. Today, everyone around is talking about the need for A / B testing, but often it turns out that they are carried out without checking for statistical significance, or even do not know how to conduct them. There will be no abstract description of why and how to do A/B tests, instead I will try to dispel the fog on which statistical methods to use in different situations, and also give examples of python scripts so that you can immediately use them. Who is this article for: Analysts, product developers, or anyone planning to conduct an ab test themselves. How to read? I tried to take into account the sequence when writing, but in fact, this is more of a cheat sheet – at any time when you need to conduct an ab test, you can find a case that suits you in it and use it, there is a link to a summary diagram at the end . A few words and let’s get started. While I was writing this article, I tried not to use a lot of scientific terminology so as not to scare you away, but it is found somewhere, sorry for that …)) Well, let’s start …

We have two samples that we received randomly. The first thing we need to do to start an experiment is to define the metric (or metrics) on the basis of which we will make a decision: that on the basis of which the hypothesis itself is built. The choice of a statistical method also depends on the type of metric we are looking at. We will divide the metrics into three types:

  1. Quantitative (number of bids or orders)

  2. Proportional or qualitative (conversion to purchase, CTR)

  3. Ratios (profitability or the ratio of open grocery cards to the number of purchases)

Each type of metric has its own specific approach, since the largest number of A / B tests are carried out with proportion metrics, then we will start with it, it is also the simplest.

Proportions (qualitative)

It’s simple, the best method to test the statistical significance of a change in proportions is the Chi-Square method. But before doing it, it is worth checking the size of the test group for the desired minimum statistically significant change. This can be done using one of the tools on the web, for example, this or using the statsmodels.stats library via python, here is a sample code:

from statsmodels.stats.proportion import proportion_effectsize
from statsmodels.stats.power import TTestIndPower

baseline_cr = 0.2 # базовый уровень конверсии
min_effect = 0.05 # минимальный значимый результат

effect_size = proportion_effectsize(baseline_cr, baseline_cr + min_effect)

alpha = 0.05 # уровень значимости
power = 0.8  #уровень мощности
power_analysis = TTestIndPower()
sample_size = power_analysis.solve_power(effect_size, power=power, alpha=alpha, alternative="two-sided")

print(f"Необходимый размер выборки: {sample_size:.0f}")

In both options, you need to specify the baseline conversion rate, the minimum value for change detection, level a (percentage of times a difference is detected if there is none), and power level (percentage of times the minimum change value is detected, if it exists). ).

Having found the required sample size, we can proceed to the test itself.

Again, the Internet is full of sites with online calculators that will allow you to do this, for example, this. If you decide to test using python, then here is an example script that will allow you to do this.

import numpy as np
import scipy.stats as stats

# Загрузите данные в переменные
group_A = [50, 100]
group_B = [60, 90]

# Запустите тест
chi2, p, dof, ex = stats.chi2_contingency([group_A, group_B], correction=False)

# Рассчитайте доверительный интервал для изменения
lift = (group_B[0]/group_B[1])/(group_A[0]/group_A[1])
std_error = np.sqrt(1/group_B[0] + 1/group_B[1] + 1/group_A[0] + 1/group_A[1])
ci = stats.norm.interval(0.95, loc=lift, scale=std_error)

# Выводим результаты
print("Хи-квадрат p-value: ", p)
print("Доверительный интервал изменения: ", ci)

# Проверяем есть ли изменение
if p < 0.05 and ci[0] > 1:
    print("Вариант лучше.")
    print("Разницы нет.")


The second most popular ab tests are conducted with quantitative metrics, examples of such metrics can be: the number of bets, session length, etc. When analyzing quantitative metrics, it is important to choose the right method for testing statistical significance. Since we chose a randomized splitting method, we will most likely encounter a high level of variance (difference) in the data, which will negatively affect the determination of the significance of the results, so I recommend using CUPED (Controlled pre-post experimental design) – this is a statistical method that increasingly used in A/B testing to improve the accuracy of the results.

Thus, we will be able to level the level of dispersion, seasonal changes, marketing promotions and other factors. You can read more about the method here or in the original article.

In order to use cuped, we need a data set with the value of the metrics in the pre-test period and the value itself at the time of the test. In order to convert data, there is a convenient ambrosia library, here is an example script:

import ambrosia as amb
import pandas as pd

# Загружаем данные
data = pd.read_csv('experiment_data.csv')

# Опеределяем переменные
covariates = ['pre_experiment_data']

cuped = amb.cuped(data, covariates, outcome)

# Выводим разницу в средних после лечения
print(f"The adjusted difference in means is: {cuped['diff']:.3f}")

# сохраняем в csv
cuped['data'].to_csv('adjusted_experiment_data.csv', index=False)

After we have received the data, it is worth estimating the sample size, there are two options.

If the sample is large…

Then you need to look at the normality of the data distribution, in general, everything is simple, check the normality of the distribution will help this script that uses the test Shapiro-Wilkait will work in most cases:

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import shapiro, norm

data = pd.read_csv('adjusted_experiment_data.csv')

# Применяем тест Шапиро-Уилка
stat, p = shapiro(data)

alpha = 0.05
if p > alpha:
    print("Нормальное распределение.")
    print("Не нормальное распределение.")

# График с расрпеделением
fig, ax = plt.subplots()
ax.hist(data, bins=5, density=True, alpha=0.5, label="Data")

mu, std =
xmin, xmax = ax.get_xlim()
x = np.linspace(xmin, xmax, 100)
p = norm.pdf(x, mu, std)
ax.plot(x, p, 'k', linewidth=2, label="Normal distribution")

So, the distribution turned out to be normal, what’s next ..

Next, we can carry out the most used method for checking the absence of a null hypothesis, namely the t-test, the most famous is the student’s t-test, it will also work, if there is unequal variance between groups, then the Welch t-test is better, in general, can use its always not going to be an error, but you may encounter a lower level of significance. Both tests are conveniently carried out using the scipy.stats library, here are sample queries:

import pandas as pd
import scipy.stats as stats

data = pd.read_csv('adjusted_experiment_data.csv')

control = data[data['group_type'] == 'control']['experiment_data']
test = data[data['group_type'] == 'test']['experiment_data']

# т-критерий Уэлча
welch_t, welch_p = stats.ttest_ind(control, test, equal_var=False)

# т-критерий Стъюдента
student_t, student_p = stats.ttest_ind(control, test, equal_var=True)

print("Welch's t-test:")
print("t-statistic: ", welch_t)
print("p-value: ", welch_p)

print("\nStudent's t-test:")
print("t-statistic: ", student_t)
print("p-value: ", student_p)

But it also happens that the distribution of data does not correspond to the normal one, what then?

In this case, the Mann-Whitney U-test is best suited. In contrast to parametric tests such as Student’s t-test or Welch’s t-test, U-test Mann-Whitney makes no assumptions about the shape of the underlying distribution. In order to conduct it, we can refer to the same scipy.stats library:

u, p = stats.mannwhitneyu(control, test, alternative="two-sided")

print("Mann-Whitney U-test:")
print("U-statistic: ", u)
print("p-value: ", p)

We figured out what to do if the sample size is normal, but it happens when there is very little data, bootstrap comes to the rescue. The main idea of ​​bootstrap is to repeatedly fetch and replace from the original data to create a large number of “samples”. Each selection is similar to the original selection, but has slightly different values ​​due to random substitution, more details can be found in this article, and an example script looks like this:

import pandas as pd
import numpy as np
import scipy.stats as stats

# Назначаем количество сэмплов
n_bootstrap = 10000

# генерирует сэмплы и расчитываем средние
bootstrap_diff = []
for i in range(n_bootstrap):
    control_sample = np.random.choice(control, size=len(control), replace=True)
    test_sample = np.random.choice(test, size=len(test), replace=True)
    bootstrap_diff.append(np.mean(test_sample) - np.mean(control_sample))
bootstrap_diff = np.array(bootstrap_diff)

# считаем доверительный интервал
ci = np.percentile(bootstrap_diff, [2.5, 97.5])

# проводим т-тест
t, p = stats.ttest_1samp(bootstrap_diff, 0)

print("Bootstrap mean difference 95% CI: ({:.2f}, {:.2f})".format(ci[0], ci[1]))
print("t-statistic: ", t)
print("p-value: ", p)

Phew, it seems that everything is with quantitative metrics …


There are two numerical values ​​in the ratios, most often the numerator is the number of actions performed, and the denominator is the number of users, for example, the number of bets on users who opened event cards, or the number of open cards per number of viewed ones. It happens that the essence of the hypothesis is precisely in changing such metrics, well, here we will focus on two options for testing statistical significance, which are based on sample sizes.

First of all, I also recommend using one of the options for reducing the variance between data, CUPED in this case will not work, as this is a method that helps to reduce errors in the data by controlling for one variable that is strongly related to the variable of interest. However, if the reason for the change in the variable under study is whether there is an impact (e.g. showing a training banner or receiving a free bet), then CUPED cannot be used as there are no other factors that can be controlled. In this case, it is better to use the Diff-in-diff method, which allows you to take into account other factors that affect the results and explore the effect of the impact on the variable under study.

To use diff in diff, you can use the script below, the data structure should be as follows: each row corresponds to one test participant, and the columns represent different variables, for example: participant ID, A / B test group, time period, dependent variable :

import pandas as pd

# загрузка данных из csv файла в DataFrame
data = pd.read_csv('data.csv')

# фильтрация данных по группе и временному периоду
control = data[(data['group'] == 'control') & (data['time'] == 'before')]
treatment = data[(data['group'] == 'treatment') & (data['time'] == 'before')]

# вычисление среднего значения зависимой переменной для контрольной и экспериментальной групп до воздействия
control_before = control['dependent_variable'].mean()
treatment_before = treatment['dependent_variable'].mean()

# фильтрация данных по временному периоду после воздействия
control = data[(data['group'] == 'control') & (data['time'] == 'after')]
treatment = data[(data['group'] == 'treatment') & (data['time'] == 'after')]

# вычисление среднего значения зависимой переменной для контрольной и экспериментальной групп после воздействия
control_after = control['dependent_variable'].mean()
treatment_after = treatment['dependent_variable'].mean()

# вычисление разницы между средними значениями для контрольной и экспериментальной групп до и после воздействия
control_diff = control_after - control_before
treatment_diff = treatment_after - treatment_before

# вычисление оценки эффекта воздействия с помощью метода Diff-in-Diff
diff_in_diff = treatment_diff - control_diff

# сохранение преобраз
output_data = pd.DataFrame({'group': ['control', 'treatment'], 'before': [control_before, treatment_before], 'after': [control_after, treatment_after], 'diff': [control_diff, treatment_diff]})
output_data.to_csv('output.csv', index=False)

So, the data is in order, we proceed to the verification.

Large sample

Let’s assume that the sample has an impressive size, then we only need to conduct a t-test, but first it is worth looking again at the variance. For example, our metric is CTR, clicks and views are random variables, and when we combine them into one CTR metric, they will have a joint distribution. Also, if our randomization is based on user_id, one user can generate multiple views, so the views are not independent of each other. It is better to use the delta method to approximate the variance of the ratios and then perform a t-test, the scipy.stats library comes to the rescue. The script might look like this.

import pandas as pd
import scipy.stats as stats

# загрузка преобразованных данных из CSV-файла
data = pd.read_csv('output.csv')

# вычисление стандартной ошибки разницы между средними значениями для контрольной и экспериментальной групп после воздействия с использованием дельта-метода
se_diff = ((data['after'][0] - data['after'][1]) ** 2 * control_var / 2 + (data['before'][0] - data['before'][1]) ** 2 * treatment_var / 2) ** 0.5 / pooled_var ** 0.5

# вычисление t-статистики и p-значения с использованием оценки стандартной ошибки, полученной из дельта-метода
t_stat = (data['after'][0] - data['after'][1]) / se_diff
df = len(data) - 2
p_val = stats.t.sf(abs(t_stat), df) * 2

# вывод результатов
print('Diff-in-Diff with Delta Method:')
print(f"Standard Error: {se_diff}")
print(f"t-statistic: {t_stat}")
print(f"p-value: {p_val}")

Small sample

Well, the last method that we will consider, which should be applied if the sample size is small. In this case, bootstrap will also be relevant. But since we are trying to analyze a metric, ratios and observations in the data can be independent, block bootstrap is more suitable. In normal bootstrap, data is taken from the original selection with replacement, in block bootstrap, data is taken in groups (or blocks). This is necessary because we want to take into account the dependency structure in the data. At the end, we will carry out the t-test we already know.

The script is similar to regular bootstrap:

import pandas as pd
import numpy as np
import scipy.stats as stats

data = pd.read_csv('output.csv')

control = data[data['group'] == 'control']
treatment = data[data['group'] == 'test']

observed_effect = treatment['after'].mean() - control['after'].mean()

# определяем количество блоков
num_blocks = 100
block_size = len(data) // num_blocks

bootstrap_samples = []
for i in range(num_blocks):
    # вставляем данные в блоки
    block_indices = np.random.choice(range(len(data)), size=block_size, replace=True)
    bootstrap_sample = data.iloc[block_indices]

    # сплитуем по группам
    bootstrap_control = bootstrap_sample[bootstrap_sample['group'] == 'control']
    bootstrap_test = bootstrap_sample[bootstrap_sample['group'] == 'test']

    # считаем эффект через bootstrap
    bootstrap_effect = bootstrap_test['after'].mean() - bootstrap_control['after'].mean()


# стандартная ошибка bootstrap
bootstrap_std_err = np.std(bootstrap_samples)

# проводим t-test
t_statistic = observed_effect / bootstrap_std_err
p_value = stats.t.sf(np.abs(t_statistic), len(data) - 1) * 2

print("Observed treatment effect: ", observed_effect)
print("Bootstrap estimate of standard error: ", bootstrap_std_err)
print("t-statistic: ", t_statistic)
print("p-value: ", p_value)

Power Check

An important criterion after performing statistical significance tests is the power test. A low power experiment will suffer from a high false negative rate (Type II error). The effect size is usually the difference in mean between the control and test groups divided by the standard deviation. To check the power after receiving the results, you can use TTestIndPower from the statsmodels library, and the script might look like this:

import pandas as pd
import numpy as np
from statsmodels.stats.power import TTestIndPower

# загрузка данных из CSV-файла
data = pd.read_csv('output.csv')

# вычисление размера выборок на основе количества уникальных значений ID
n1 = len(data[data['group'] == 'control']['id'].unique())
n2 = len(data[data['group'] == 'test']['id'].unique())

# вычисление ожидаемого эффекта на основе наблюдаемого эффекта
observed_effect = data[data['group'] == 'test']['after'].mean() - data[data['group'] == 'control']['after'].mean()
effect_size = observed_effect / data['after'].std()

# задание параметров теста
alpha = 0.05  # уровень значимости
power = 0.8  # мощность теста

# вычисление мощности теста
power_analysis = TTestIndPower()
sample_size = power_analysis.solve_power(effect_size=effect_size, alpha=alpha, power=power, ratio=n2/n1)
power = power_analysis.power(effect_size=effect_size, nobs1=sample_size, alpha=alpha, ratio=n2/n1)

# вывод результатов
print("Sample size required: ", sample_size)
print("Power of the test: ", power)


In summary, to conduct a successful A/B test, you need to choose an analysis method that is appropriate for specific metrics and sample size, as well as take into account all the factors that can affect the results. It is important to monitor the power of the experiment after it has been run to ensure that the results are meaningful.

For easier understanding I added a general scheme to Mirowhich will act as a stat method selection navigator for your test.

I hope this article will help you with your A/B tests and increase the value of your decisions, which will positively affect your product and personal progress.

Subscribe to my telegram channelwhere I’m going to talk about product development, analytics and hedonism, and add the article to your favorites, and see you soon!

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *