A

KolyaI work as a data analyst at X5 Tech. We Sasha we continue writing a series of articles on A/B testing. Previous articles can be found in the description profiles.

Many hypotheses

When conducting experiments, sometimes it becomes necessary to test several hypotheses at once. For example, you have generated 100 images for an advertising mailing using neural networks, and you need to check which of them increase the average revenue per client. When testing one hypothesis, the probability of a type I error is equal to the significance level \alpha. As the number of hypotheses increases, if there is actually no effect, the probability of getting at least one type I error is expressed by the formula:

\mathbb{P}(FP>0) = 1 – (1 – \alpha)^m” src=”https://habrastorage.org/getpro/habr/upload_files/8cc/fe2/14d/8ccfe214da19eb6a80f14e5ed5bbb0de.svg” width=”228″ height=”22″/></p><p>Where <img data-lazyloaded= — the number of false positive results (type I errors), and m — the number of hypotheses.

Below is a graph of this function for \alpha= 0.05When testing 100 hypotheses, the probability of getting a type I error is close to one.

In addition, a large number of hypotheses without an effect may prevent us from detecting an effect where it actually exists. Let's say one of the hypotheses has an effect, while the other hypotheses do not. The effect size is equal to the expected effect, which was used to calculate the required group size. We want to choose one option that will have a larger point estimate of the effect. The graph below shows the probability of choosing the wrong hypothesis depending on the number of hypotheses:

If among 100 hypotheses there is only one hypothesis with an effect, then with a probability of 0.24 among 99 hypotheses without an effect there will be a hypothesis whose point estimate of the effect will be greater than the point estimate of the effect of the hypothesis that actually had an effect.

Testing a large number of hypotheses increases the probability of errors. A simple way to reduce the probability of errors is to reduce the number of hypotheses being tested. In many situations, there is no need to test the entire range of possible options; it is enough to expertly select several of the best candidates. This will save resources on conducting experiments and reduce the probability of errors.

If it is necessary to test two or more hypotheses, then it is necessary to apply corrections for multiple testing, which allow you to control the probabilities of errors at given levels.

Metrics

Before we move on to adjusting for multiple testing, let's define the metrics we want to control. Let's introduce some notations:

  • m is the total number of hypotheses;

  • TN — number of true negative results;

  • FN — number of false negative results;

  • FP — number of false positive results;

  • TP is the number of true positive results.

H_0 true

H_0 incorrect

Total

We do not reject H_0

TN

FN

TN+FN

We reject H_0

FP

TP

FP+TP

Total

TN+FP

FN+TP

m

To control for Type I errors in multiple testing, FWER or FDR are commonly used.

FWER (Family-wise error rate) is the probability that at least one of the tests gives a false positive result when the null hypotheses are true.

FWER = \mathbb{P}(FP>0) = 1 – (1 – \alpha)^m” src=”https://habrastorage.org/getpro/habr/upload_files/3b7/80c/1b4/3b780c1b44ceb2e289b18d9a1b598b8e.svg” width=”319″ height=”22″/></p><p><strong>FDR </strong>(False discovery rate) – the average proportion of false positive results among positive results.</p><p><img data-lazyloaded=

All other things being equal, decision algorithms that control FWER at a given significance level allow fewer false positives than algorithms that control FDR. Let's say that 20 hypotheses out of 100 actually have an effect. To control FDR at a significance level of 0.05, it is enough to select 20 hypotheses, allowing one false positive. In this case, FWER will be equal to 1. An algorithm that controls FWER will select fewer hypotheses, which will lead to a decrease in false positives, but, at the same time, to an increase in false negatives.

We also need an analogue of power for multiple testing. We will use the average proportion of true positives among hypotheses with an effect.

POWER = \mathbb{E} \left( \frac{TP}{\max(TP+FN, ~ 1)} \right)

Amendments

There are many variations of corrections for multiple hypothesis testing. We will consider three of them: Bonferroni, Holm, and Benjamini-Hochbreg.

Bonferroni method proposes to use a significance level equal to 0 for testing each hypothesis \alpha / mWhere m — the number of hypotheses. This allows us to control FWER at a level not exceeding \alphaThe Bonferroni method is simple to implement, but is less powerful than other corrections.

Holm Method also controls FWER at a level not exceeding \alpha. Let's describe the algorithm of its operation:

  1. Let's arrange the p-values ​​in ascending order: p_{(1)}, \ldots, p_{(m)}. Let us denote the null hypothesis corresponding to the value p_{(i)}How H_{(i)}.

  2. Let's define the levels of significance \alpha_i = \alpha / (m - i + 1).

  3. We sequentially iterate over the p-values ​​from the series p_{(1)}, \ldots, p_{(m)}. If p_{(i)} < \alpha_ithen we reject H_{(i)} and move on to the next value. Otherwise, we say that the data hypotheses H_{(i)}, \ldots, H_{(m)} do not contradict.

Benjamini-Hochberg method controls FDR at a level not exceeding \alpha. This method works in the same way as the Holm method, only the values ​​are used as significance levels. \alpha_i = \alpha * i / m.

Below are the implementations of the amendments. You can use ready-made implementations, for example, from the function statsmodels.stats.multitest.multipletests.

Hidden text
from collections import defaultdict
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats


def method_without_correct(pvalues, alpha=0.05):
    """Проверяет значимость отличий без поправок.

    :param pvalues (np.array): массив pvalue
    :param alpha (float): уровень значимости

    :return res (np.array): массив из нулей и единиц:
        0 - значимых отличий нет, 1 - значимые отличия есть.
    """
    res = (pvalues <= alpha).astype(int)
    return res

def method_bonferroni(pvalues, alpha=0.05):
    """Применяет метод Бонферрони для проверки значимости изменений.

    :param pvalues (np.array): массив pvalue
    :param alpha (float): уровень значимости

    :return res (np.array): массив из нулей и единиц:
        0 - значимых отличий нет, 1 - значимые отличия есть.
    """
    m = len(pvalues)
    res = (pvalues <= alpha / m).astype(int)
    return res

def method_holm(pvalues, alpha=0.05):
    """Применяет метод Холма для проверки значимости изменений.
    
    :param pvalues (np.array): массив pvalue
    :param alpha (float): уровень значимости

    :return res (np.array): массив из нулей и единиц:
        0 - значимых отличий нет, 1 - значимые отличия есть.
    """
    m = len(pvalues)
    alphas = alpha / np.arange(m, 0, -1)
    sorted_pvalue_indexes = np.argsort(pvalues)
    res = np.zeros(m, dtype=int)
    for alpha_, pvalue_index in zip(alphas, sorted_pvalue_indexes):
        pvalue = pvalues[pvalue_index]
        if pvalue < alpha_:
            res[pvalue_index] = 1
        else:
            break
    return res

def method_benjamini_hochberg(pvalues, alpha=0.05):
    """Применяет метод Бенджамини-Хохберга для проверки значимости изменений.
    
    :param pvalues (np.array): массив pvalue
    :param alpha (float): уровень значимости

    :return res (np.array): массив из нулей и единиц:
        0 - значимых отличий нет, 1 - значимые отличия есть.
    """
    m = len(pvalues)
    alphas = alpha * np.arange(1, m+1) / m
    sorted_pvalue_indexes = np.argsort(pvalues)
    res = np.zeros(m, dtype=int)
    for alpha_, pvalue_index in zip(alphas, sorted_pvalue_indexes):
        pvalue = pvalues[pvalue_index]
        if pvalue <= alpha_:
            res[pvalue_index] = 1
        else:
            break
    return res

Numerical experiments

Let's conduct numerical experiments to see how the corrections behave under different conditions. We will test 100 hypotheses for equality of means using the Student's t-test. The significance level is 0.05, the permissible probability of a type II error is 0.1, and the expected effect is 275. The group size is selected using the formula for estimating the required group size when testing one hypothesis and is 100. The experiments considered situations where among 100 hypotheses there were from 0 to 100 hypotheses with an effect. The results with metric estimates for different methods are shown in the graphs below.

Hidden text
def estimate_sample_size(effect, std, alpha, beta):
    """Оценка необходимого размер групп."""
    t_alpha = stats.norm.ppf(1 - alpha / 2, loc=0, scale=1)
    t_beta = stats.norm.ppf(1 - beta, loc=0, scale=1)
    var = 2 * std ** 2
    sample_size = int(np.ceil((t_alpha + t_beta) ** 2 * var / (effect ** 2)))
    return sample_size

def run(dict_methods, mean, std, sample_size, n_exp, n_iter=1000):
    """Проводит симуляцию экспериментов.
    
    :param dict_methods: словарь с методами множественного тестирования
    :param mean: среднее значение генерируемых данных
    :param std: стандартное отклонение генерируемых данных
    :param sample_size: размер групп
    :param n_exp: количество проверяемых гипотез
    :param n_iter: количество итераций симуляции

    :return method_2_result: результаты применения методов к данным симуляции
    """
    list_n_exp_with_effect = [0, 1] + list(range(10, n_exp, 10)) + [n_exp-1, n_exp]
    method_2_result = {
        method_name: defaultdict(list)
        for method_name in dict_methods
    }
    for _ in range(n_iter):
        a_values, b_values = np.random.normal(mean, std, (2, n_exp, sample_size))
        prev_n_exp_with_effect = 0
        for n_exp_with_effect in list_n_exp_with_effect:
            b_values[prev_n_exp_with_effect: n_exp_with_effect] += effect
            prev_n_exp_with_effect = n_exp_with_effect
            pvalues = stats.ttest_ind(a_values, b_values, axis=1).pvalue
            for method_name, method in dict_methods.items():
                res = method(pvalues)
                method_2_result[method_name][n_exp_with_effect].append({
                    'FN': sum(res[:n_exp_with_effect] == 0),
                    'TP': sum(res[:n_exp_with_effect] == 1),
                    'FP': sum(res[n_exp_with_effect:] == 1),
                    'TN': sum(res[n_exp_with_effect:] == 0),
                })
    return method_2_result

def plot_metrics(method_2_result, n_exp, alpha=0.05, beta=0.1):
    """Рисует графики."""
    method_2_metrics = {method_name: defaultdict(list) for method_name in method_2_result}
    markers = ['o', 'v', '<', '>', '^', 's']
    for method_name in method_2_result:
        list_n_exp_with_effect = sorted(method_2_result[method_name].keys())
        for n_exp_with_effect in list_n_exp_with_effect:
            results = method_2_result[method_name][n_exp_with_effect]
            fwer = np.mean([x['FP'] > 0 for x in results])
            fdr = np.mean([x['FP'] / max(1, x['FP'] + x['TP']) for x in results])
            power = np.mean([x['TP'] / max(1, x['FN'] + x['TP']) for x in results])
            method_2_metrics[method_name]['FWER'].append(fwer)
            method_2_metrics[method_name]['FDR'].append(fdr)
            method_2_metrics[method_name]['POWER'].append(power)
    for metric_name in ['FWER', 'FDR', 'POWER']:
        min_value = 1
        for idx, (method_name, metrics) in enumerate(method_2_metrics.items()):
            marker = markers[idx % len(markers)]
            if metric_name in ['FWER', 'FDR']:
                X = list_n_exp_with_effect[:1] + list_n_exp_with_effect[2: -1]
                Y = metrics[metric_name][:1] + metrics[metric_name][2: -1]
            elif metric_name == 'POWER':
                X = list_n_exp_with_effect[1: -2] + list_n_exp_with_effect[-1:]
                Y = metrics[metric_name][1: -2] + metrics[metric_name][-1:]
            min_value = min(min_value, min(Y))
            plt.plot(X, Y, f'-{marker}', label=method_name)
        if metric_name in ['FWER', 'FDR']:
            plt.hlines(alpha, 0, n_exp, 'k', linestyles="--", label=f'alpha={alpha}')
        else:
            plt.hlines(1-beta, 0, n_exp, 'k', linestyles="--", label=f'power={1-beta}')
        plt.title(metric_name)
        plt.xlabel('Количество гипотез с эффектом')
        plt.legend()
        plt.grid()
        plt.xlim([0, n_exp])
        plt.ylim([0.8 if min_value > 0.8 else 0, 1])
        plt.show()

alpha = 0.05              # уровень значимости
beta = 0.1                # допустимая вероятность ошибки II рода
mean = 1000               # средняя выручка с пользователя в контрольной группе
std = 600                 # стандартное отклонение
effect = 276              # ожидаемый размер эффекта
sample_size = estimate_sample_size(effect, std, alpha, beta)
n_exp = 100               # количество проверяемых гипотез
dict_methods = {
    'without_correct': method_without_correct,
    'bonferroni': method_bonferroni,
    'holm': method_holm,
    'benjamini_hochberg': method_benjamini_hochberg,
}
method_2_result = run(dict_methods, mean, std, sample_size, n_exp)
plot_metrics(method_2_result, n_exp)

The FWER and FDR graphs are plotted for points from 0 to 99. At point 100 there are no hypotheses without an effect, so the FWER and FDR values ​​there are zero for all methods.

From the FWER plots, it is clear that the Bonferroni and Holm methods control the FWER at a level no greater than \alphabut the Benjamini-Hochberg method does not. The plot for FDR shows that the Benjamini-Hochberg method controls FDR at a level no greater than \alpha.

Power graphs are plotted for points from 1 to 100. At point 0 there are no hypotheses with an effect, the power values ​​there are zero for all methods.

In the design of the experiments, the acceptable probability of a type II error was fixed at 0.1, which corresponds to a power of 0.9. The powers of the methods with corrections for multiple testing are less than the desired 0.9.

Power

To control the power at a given level, let's try using a significance level equal to \alpha / m in the formula for estimating the required group size.

Hidden text
sample_size = estimate_sample_size(effect, std, alpha/n_exp, beta)
method_2_result = run(dict_methods, mean, std, sample_size, n_exp)
plot_metrics(method_2_result, n_exp)

The power has increased, the minimum power values ​​are around 0.9, as we wanted.

The power of the Bonferroni method is slightly less than 0.9. This is due to the fact that the normal distribution approximation is used instead of the Student's t-distribution when estimating the required group size. For \alpha=0.05 the differences might be imperceptible, but \alpha=0.0005 The group size estimate differs: 215 instead of 217.

General control

When testing 100 hypotheses, each hypothesis had its own control groups. The control groups can be combined into a single control group and used to test all hypotheses. On the one hand, the p-values ​​will become dependent. On the other hand, increasing the size of the control group will lead to increased sensitivity.

What group size should be chosen to maintain the desired power level? Let us recall the formula that links the criterion parameters:

\varepsilon^2 = \left[ \Phi^{-1} \left( 1- \alpha / 2 \right) + \Phi^{-1} \left( 1-\beta \right) \right]^2 \left(\frac{\sigma_A^2}{n_A} + \frac{\sigma_B^2}{n_B}\right)

We are checking m hypotheses with a combined control group. Let us assume n_A=nm, n_B=n And \sigma=\sigma_A=\sigma_BThen

n = \frac{m+1}{m} \frac{\sigma^2}{\varepsilon^2} \left[ \Phi^{-1} \left( 1- \alpha / 2 \right) + \Phi^{-1} \left( 1-\beta \right) \right]^2

At m=1 we get (m+1)/m=2and when m=100 we get (m+1)/m=1.01. Previously the group size was 215, now we will use a group size equal to 215/2*1.01 \approx 109Let's conduct a numerical experiment:

Hidden text
sample_size = 109
list_n_exp_with_effect = [0, 1] + list(range(10, n_exp, 10)) + [n_exp-1, n_exp]
method_2_result = {
    method_name: defaultdict(list)
    for method_name in dict_methods
}
for _ in range(1000):
    a_values, b_values = np.random.normal(mean, std, (2, n_exp, sample_size))
    a_values = [a_values.flatten()]
    prev_n_exp_with_effect = 0
    for n_exp_with_effect in list_n_exp_with_effect:
        b_values[prev_n_exp_with_effect: n_exp_with_effect] += effect
        prev_n_exp_with_effect = n_exp_with_effect
        pvalues = stats.ttest_ind(a_values, b_values, axis=1).pvalue
        for method_name, method in dict_methods.items():
            res = method(pvalues)
            method_2_result[method_name][n_exp_with_effect].append({
                'FN': sum(res[:n_exp_with_effect] == 0),
                'TP': sum(res[:n_exp_with_effect] == 1),
                'FP': sum(res[n_exp_with_effect:] == 1),
                'TN': sum(res[n_exp_with_effect:] == 0),
            })
plot_metrics(method_2_result, n_exp)

The quality of the methods has not changed, and the required amount of data has decreased almost 2 times. In fact, you can get even more gain with optimal division of users into groups. How to do this – read in our article Determining the Optimal Group Size for Multiple A/B Testing.

Results

Testing a large number of hypotheses without adjusting for multiple testing increases the probability of false positives. FWER and FDR are metrics used for multiple testing. Bonferroni and Holm corrections allow you to control FWER. The Holm method has greater power. The Benjamini-Hochberg correction allows you to control FDR. The more hypotheses you test, the more data you need to get the desired power. You can increase the sensitivity of tests by using a single control group.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *