A
Many hypotheses
When conducting experiments, sometimes it becomes necessary to test several hypotheses at once. For example, you have generated 100 images for an advertising mailing using neural networks, and you need to check which of them increase the average revenue per client. When testing one hypothesis, the probability of a type I error is equal to the significance level . As the number of hypotheses increases, if there is actually no effect, the probability of getting at least one type I error is expressed by the formula:
— the number of false positive results (type I errors), and — the number of hypotheses.
Below is a graph of this function for When testing 100 hypotheses, the probability of getting a type I error is close to one.
In addition, a large number of hypotheses without an effect may prevent us from detecting an effect where it actually exists. Let's say one of the hypotheses has an effect, while the other hypotheses do not. The effect size is equal to the expected effect, which was used to calculate the required group size. We want to choose one option that will have a larger point estimate of the effect. The graph below shows the probability of choosing the wrong hypothesis depending on the number of hypotheses:
If among 100 hypotheses there is only one hypothesis with an effect, then with a probability of 0.24 among 99 hypotheses without an effect there will be a hypothesis whose point estimate of the effect will be greater than the point estimate of the effect of the hypothesis that actually had an effect.
Testing a large number of hypotheses increases the probability of errors. A simple way to reduce the probability of errors is to reduce the number of hypotheses being tested. In many situations, there is no need to test the entire range of possible options; it is enough to expertly select several of the best candidates. This will save resources on conducting experiments and reduce the probability of errors.
If it is necessary to test two or more hypotheses, then it is necessary to apply corrections for multiple testing, which allow you to control the probabilities of errors at given levels.
Metrics
Before we move on to adjusting for multiple testing, let's define the metrics we want to control. Let's introduce some notations:
m is the total number of hypotheses;
TN — number of true negative results;
FN — number of false negative results;
FP — number of false positive results;
TP is the number of true positive results.
true | incorrect | Total | |
We do not reject | TN | FN | TN+FN |
We reject | FP | TP | FP+TP |
Total | TN+FP | FN+TP | m |
To control for Type I errors in multiple testing, FWER or FDR are commonly used.
FWER (Family-wise error rate) is the probability that at least one of the tests gives a false positive result when the null hypotheses are true.
All other things being equal, decision algorithms that control FWER at a given significance level allow fewer false positives than algorithms that control FDR. Let's say that 20 hypotheses out of 100 actually have an effect. To control FDR at a significance level of 0.05, it is enough to select 20 hypotheses, allowing one false positive. In this case, FWER will be equal to 1. An algorithm that controls FWER will select fewer hypotheses, which will lead to a decrease in false positives, but, at the same time, to an increase in false negatives.
We also need an analogue of power for multiple testing. We will use the average proportion of true positives among hypotheses with an effect.
Amendments
There are many variations of corrections for multiple hypothesis testing. We will consider three of them: Bonferroni, Holm, and Benjamini-Hochbreg.
Bonferroni method proposes to use a significance level equal to 0 for testing each hypothesis Where — the number of hypotheses. This allows us to control FWER at a level not exceeding The Bonferroni method is simple to implement, but is less powerful than other corrections.
Holm Method also controls FWER at a level not exceeding . Let's describe the algorithm of its operation:
Let's arrange the p-values in ascending order: . Let us denote the null hypothesis corresponding to the value How .
Let's define the levels of significance .
We sequentially iterate over the p-values from the series . If then we reject and move on to the next value. Otherwise, we say that the data hypotheses do not contradict.
Benjamini-Hochberg method controls FDR at a level not exceeding . This method works in the same way as the Holm method, only the values are used as significance levels. .
Below are the implementations of the amendments. You can use ready-made implementations, for example, from the function statsmodels.stats.multitest.multipletests
.
Hidden text
from collections import defaultdict
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
def method_without_correct(pvalues, alpha=0.05):
"""Проверяет значимость отличий без поправок.
:param pvalues (np.array): массив pvalue
:param alpha (float): уровень значимости
:return res (np.array): массив из нулей и единиц:
0 - значимых отличий нет, 1 - значимые отличия есть.
"""
res = (pvalues <= alpha).astype(int)
return res
def method_bonferroni(pvalues, alpha=0.05):
"""Применяет метод Бонферрони для проверки значимости изменений.
:param pvalues (np.array): массив pvalue
:param alpha (float): уровень значимости
:return res (np.array): массив из нулей и единиц:
0 - значимых отличий нет, 1 - значимые отличия есть.
"""
m = len(pvalues)
res = (pvalues <= alpha / m).astype(int)
return res
def method_holm(pvalues, alpha=0.05):
"""Применяет метод Холма для проверки значимости изменений.
:param pvalues (np.array): массив pvalue
:param alpha (float): уровень значимости
:return res (np.array): массив из нулей и единиц:
0 - значимых отличий нет, 1 - значимые отличия есть.
"""
m = len(pvalues)
alphas = alpha / np.arange(m, 0, -1)
sorted_pvalue_indexes = np.argsort(pvalues)
res = np.zeros(m, dtype=int)
for alpha_, pvalue_index in zip(alphas, sorted_pvalue_indexes):
pvalue = pvalues[pvalue_index]
if pvalue < alpha_:
res[pvalue_index] = 1
else:
break
return res
def method_benjamini_hochberg(pvalues, alpha=0.05):
"""Применяет метод Бенджамини-Хохберга для проверки значимости изменений.
:param pvalues (np.array): массив pvalue
:param alpha (float): уровень значимости
:return res (np.array): массив из нулей и единиц:
0 - значимых отличий нет, 1 - значимые отличия есть.
"""
m = len(pvalues)
alphas = alpha * np.arange(1, m+1) / m
sorted_pvalue_indexes = np.argsort(pvalues)
res = np.zeros(m, dtype=int)
for alpha_, pvalue_index in zip(alphas, sorted_pvalue_indexes):
pvalue = pvalues[pvalue_index]
if pvalue <= alpha_:
res[pvalue_index] = 1
else:
break
return res
Numerical experiments
Let's conduct numerical experiments to see how the corrections behave under different conditions. We will test 100 hypotheses for equality of means using the Student's t-test. The significance level is 0.05, the permissible probability of a type II error is 0.1, and the expected effect is 275. The group size is selected using the formula for estimating the required group size when testing one hypothesis and is 100. The experiments considered situations where among 100 hypotheses there were from 0 to 100 hypotheses with an effect. The results with metric estimates for different methods are shown in the graphs below.
Hidden text
def estimate_sample_size(effect, std, alpha, beta):
"""Оценка необходимого размер групп."""
t_alpha = stats.norm.ppf(1 - alpha / 2, loc=0, scale=1)
t_beta = stats.norm.ppf(1 - beta, loc=0, scale=1)
var = 2 * std ** 2
sample_size = int(np.ceil((t_alpha + t_beta) ** 2 * var / (effect ** 2)))
return sample_size
def run(dict_methods, mean, std, sample_size, n_exp, n_iter=1000):
"""Проводит симуляцию экспериментов.
:param dict_methods: словарь с методами множественного тестирования
:param mean: среднее значение генерируемых данных
:param std: стандартное отклонение генерируемых данных
:param sample_size: размер групп
:param n_exp: количество проверяемых гипотез
:param n_iter: количество итераций симуляции
:return method_2_result: результаты применения методов к данным симуляции
"""
list_n_exp_with_effect = [0, 1] + list(range(10, n_exp, 10)) + [n_exp-1, n_exp]
method_2_result = {
method_name: defaultdict(list)
for method_name in dict_methods
}
for _ in range(n_iter):
a_values, b_values = np.random.normal(mean, std, (2, n_exp, sample_size))
prev_n_exp_with_effect = 0
for n_exp_with_effect in list_n_exp_with_effect:
b_values[prev_n_exp_with_effect: n_exp_with_effect] += effect
prev_n_exp_with_effect = n_exp_with_effect
pvalues = stats.ttest_ind(a_values, b_values, axis=1).pvalue
for method_name, method in dict_methods.items():
res = method(pvalues)
method_2_result[method_name][n_exp_with_effect].append({
'FN': sum(res[:n_exp_with_effect] == 0),
'TP': sum(res[:n_exp_with_effect] == 1),
'FP': sum(res[n_exp_with_effect:] == 1),
'TN': sum(res[n_exp_with_effect:] == 0),
})
return method_2_result
def plot_metrics(method_2_result, n_exp, alpha=0.05, beta=0.1):
"""Рисует графики."""
method_2_metrics = {method_name: defaultdict(list) for method_name in method_2_result}
markers = ['o', 'v', '<', '>', '^', 's']
for method_name in method_2_result:
list_n_exp_with_effect = sorted(method_2_result[method_name].keys())
for n_exp_with_effect in list_n_exp_with_effect:
results = method_2_result[method_name][n_exp_with_effect]
fwer = np.mean([x['FP'] > 0 for x in results])
fdr = np.mean([x['FP'] / max(1, x['FP'] + x['TP']) for x in results])
power = np.mean([x['TP'] / max(1, x['FN'] + x['TP']) for x in results])
method_2_metrics[method_name]['FWER'].append(fwer)
method_2_metrics[method_name]['FDR'].append(fdr)
method_2_metrics[method_name]['POWER'].append(power)
for metric_name in ['FWER', 'FDR', 'POWER']:
min_value = 1
for idx, (method_name, metrics) in enumerate(method_2_metrics.items()):
marker = markers[idx % len(markers)]
if metric_name in ['FWER', 'FDR']:
X = list_n_exp_with_effect[:1] + list_n_exp_with_effect[2: -1]
Y = metrics[metric_name][:1] + metrics[metric_name][2: -1]
elif metric_name == 'POWER':
X = list_n_exp_with_effect[1: -2] + list_n_exp_with_effect[-1:]
Y = metrics[metric_name][1: -2] + metrics[metric_name][-1:]
min_value = min(min_value, min(Y))
plt.plot(X, Y, f'-{marker}', label=method_name)
if metric_name in ['FWER', 'FDR']:
plt.hlines(alpha, 0, n_exp, 'k', linestyles="--", label=f'alpha={alpha}')
else:
plt.hlines(1-beta, 0, n_exp, 'k', linestyles="--", label=f'power={1-beta}')
plt.title(metric_name)
plt.xlabel('Количество гипотез с эффектом')
plt.legend()
plt.grid()
plt.xlim([0, n_exp])
plt.ylim([0.8 if min_value > 0.8 else 0, 1])
plt.show()
alpha = 0.05 # уровень значимости
beta = 0.1 # допустимая вероятность ошибки II рода
mean = 1000 # средняя выручка с пользователя в контрольной группе
std = 600 # стандартное отклонение
effect = 276 # ожидаемый размер эффекта
sample_size = estimate_sample_size(effect, std, alpha, beta)
n_exp = 100 # количество проверяемых гипотез
dict_methods = {
'without_correct': method_without_correct,
'bonferroni': method_bonferroni,
'holm': method_holm,
'benjamini_hochberg': method_benjamini_hochberg,
}
method_2_result = run(dict_methods, mean, std, sample_size, n_exp)
plot_metrics(method_2_result, n_exp)
The FWER and FDR graphs are plotted for points from 0 to 99. At point 100 there are no hypotheses without an effect, so the FWER and FDR values there are zero for all methods.
From the FWER plots, it is clear that the Bonferroni and Holm methods control the FWER at a level no greater than but the Benjamini-Hochberg method does not. The plot for FDR shows that the Benjamini-Hochberg method controls FDR at a level no greater than .
Power graphs are plotted for points from 1 to 100. At point 0 there are no hypotheses with an effect, the power values there are zero for all methods.
In the design of the experiments, the acceptable probability of a type II error was fixed at 0.1, which corresponds to a power of 0.9. The powers of the methods with corrections for multiple testing are less than the desired 0.9.
Power
To control the power at a given level, let's try using a significance level equal to in the formula for estimating the required group size.
Hidden text
sample_size = estimate_sample_size(effect, std, alpha/n_exp, beta)
method_2_result = run(dict_methods, mean, std, sample_size, n_exp)
plot_metrics(method_2_result, n_exp)
The power has increased, the minimum power values are around 0.9, as we wanted.
The power of the Bonferroni method is slightly less than 0.9. This is due to the fact that the normal distribution approximation is used instead of the Student's t-distribution when estimating the required group size. For the differences might be imperceptible, but The group size estimate differs: 215 instead of 217.
General control
When testing 100 hypotheses, each hypothesis had its own control groups. The control groups can be combined into a single control group and used to test all hypotheses. On the one hand, the p-values will become dependent. On the other hand, increasing the size of the control group will lead to increased sensitivity.
What group size should be chosen to maintain the desired power level? Let us recall the formula that links the criterion parameters:
We are checking hypotheses with a combined control group. Let us assume , And Then
At we get and when we get . Previously the group size was 215, now we will use a group size equal to Let's conduct a numerical experiment:
Hidden text
sample_size = 109
list_n_exp_with_effect = [0, 1] + list(range(10, n_exp, 10)) + [n_exp-1, n_exp]
method_2_result = {
method_name: defaultdict(list)
for method_name in dict_methods
}
for _ in range(1000):
a_values, b_values = np.random.normal(mean, std, (2, n_exp, sample_size))
a_values = [a_values.flatten()]
prev_n_exp_with_effect = 0
for n_exp_with_effect in list_n_exp_with_effect:
b_values[prev_n_exp_with_effect: n_exp_with_effect] += effect
prev_n_exp_with_effect = n_exp_with_effect
pvalues = stats.ttest_ind(a_values, b_values, axis=1).pvalue
for method_name, method in dict_methods.items():
res = method(pvalues)
method_2_result[method_name][n_exp_with_effect].append({
'FN': sum(res[:n_exp_with_effect] == 0),
'TP': sum(res[:n_exp_with_effect] == 1),
'FP': sum(res[n_exp_with_effect:] == 1),
'TN': sum(res[n_exp_with_effect:] == 0),
})
plot_metrics(method_2_result, n_exp)
The quality of the methods has not changed, and the required amount of data has decreased almost 2 times. In fact, you can get even more gain with optimal division of users into groups. How to do this – read in our article Determining the Optimal Group Size for Multiple A/B Testing.
Results
Testing a large number of hypotheses without adjusting for multiple testing increases the probability of false positives. FWER and FDR are metrics used for multiple testing. Bonferroni and Holm corrections allow you to control FWER. The Holm method has greater power. The Benjamini-Hochberg correction allows you to control FDR. The more hypotheses you test, the more data you need to get the desired power. You can increase the sensitivity of tests by using a single control group.