how not to lose the reliability of data

Statistical foundations and consequences of voyeurism

The main statistical tool in A/B testing is p-value, which is used to determine the statistical significance of differences between groups. In an ideal experiment, the p-value should be calculated after completing the test with a predetermined sample size. However, with snooping, results are checked multiple times during the data set process, leading to an increased likelihood of falsely detecting statistical significance when in fact there is none. This phenomenon is known as increased probability Type I errors.

Peeping can significantly increase p-valuebecause everyone is new”sight“data increases the chance of accidentally stumbling upon a temporary one”significant” a difference that disappears as the test continues. For example, if the analyst tests the data twice, the p-value can double, and with five tests it can increase 3.2 times, and with 10,000 tests it can increase twelve times.

In my experience, even with correction methods such as Bonferroni, which are designed to control for multiple comparisons and reduce false positives, snooping can still lead to errors. These methods tighten the criteria for statistical significance, but even with them, incorrect application of the testing procedure can lead to incorrect conclusions.

It is important to understand that each individual check is not independent of the previous ones. As a result, the more often data are tested, the greater the likelihood of finding spurious differences due to chance rather than due to the presence of a true effect.

Solution

Sequential testing

Sequential testing includes gradual verification of data using specially developed statistical methods that allow you to adjust significance thresholds at each stage of analysis. An example of such testing is the use of a group sequential testing method, which adjusts the important values and confidence intervals depending on the number of analyzes performed.

The method is usually used to control the level of type I error in conditions where data analysis is carried out several times during the experiment. A feature of sequential testing is the ability to stop the test as soon as pre-set performance limits are reached, which allows you to draw conclusions faster and at a lower cost compared to basic A/B tests.

Let me give you an example. For implementation we use the library statsmodels :

import numpy as np
import statsmodels.stats.api as sms

# генерация случайных данных
np.random.seed(42)
data = np.random.binomial(1, 0.5, 300)

# настройка секвенциального анализа
alpha = 0.05  # уровень значимости
beta = 0.2    # мощность теста
information_rate = np.linspace(0.1, 1.0, 10)  # доли информации для каждой точки анализа

# инициализация секвенциального теста
test = sms.SequentialGroupSequentialDesign(alpha=alpha, beta=beta, k=10)

# процесс тестирования
results = []
for i in range(1, 11):
    cum_data = data[:int(30 * i)]  # Кумулятивные данные на каждом шаге
    result = test.update(np.mean(cum_data), len(cum_data))
    results.append(result)
    if result['stop']:
        break

The data is simulated as the results of a binary test, where 0.5 is the probability of success. Sequence analysis is carried out in ten steps, and at each step it is checked whether the criterion for stopping the test is reached. If the criterion is reached, the test stops.

Bayesian approach

In the Bayesian approach to A/B testing a prior probability distribution is used, which is updated as new data arrives, leading to a posterior distribution. This process is based on Bayes' theorem, which allows for the continuous integration of new information about user behavior and the effectiveness of interventions.

The main feature of the Bayesian approach is its flexibility and ability to correctly handle changes arising from voyeurism. In frequentist methods, snooping can lead to an increased likelihood of a Type I error because each look at the data is treated as a separate test. In contrast, in the Bayesian approach, each new look at the data simply updates the posterior probabilities, which helps avoid the cumulative effect of multiple comparisons

Let's say there are two versions of a website: A and B. We want to test whether version B increases the number of clicks compared to version A. We start with the a priori assumption that both versions are equally effective:

import numpy as np

# функция для обновления апостериорных вероятностей
def bayes_update(prior, likelihood):
    unnormalized_posterior = prior * likelihood
    return unnormalized_posterior / unnormalized_posterior.sum()

# генерация данных: 1 - клик, 0 - нет клика
# группа A
clicks_a = np.random.binomial(1, 0.3, 500)
# группа B
clicks_b = np.random.binomial(1, 0.35, 500)

# априорные вероятности: [P(неэффективно), P(эффективно)]
priors = np.array([0.5, 0.5])

# вероятности увидеть данные, если гипотеза верна
likelihoods_a = [1 - clicks_a.mean(), clicks_a.mean()]
likelihoods_b = [1 - clicks_b.mean(), clicks_b.mean()]

# обновление априорных вероятностей для группы A
posterior_a = bayes_update(priors, likelihoods_a)
# обновление априорных вероятностей для группы B
posterior_b = bayes_update(priors, likelihoods_b)

print(f"Апостериорные вероятности для группы A: {posterior_a}")
print(f"Апостериорные вероятности для группы B: {posterior_b}")

Апостериорные вероятности для группы A: [0.72 0.28]
Апостериорные вероятности для группы B: [0.622 0.378]

We generated click data for groups A and B using the binomial distribution. Then we started with equal probability that the versions of the website were effective or ineffective. Next, it was calculated as the probability of obtaining the observed data, provided that the hypothesis is true or false. Using the function bayes_update we update prior probabilities based on observed data.

Thus, properly understanding and managing the problem of snooping leads to more accurate data-driven decision making.

A/B tests are one of the tools used in analytics. OTUS experts discuss more practical tools in practical online courses. More details in the catalogue.