Power Analysis of Statistical Tests Using Bucketing

annotation

This article examines the impact of bucketing on the power of statistical tests under different data distributions and different sample sizes. Particular attention is paid to the dependence of the power of the criterion on the number of buckets and sample size. The study provides important implications for the design and analysis of A/B testing and other forms of experimental research.

Introduction

In A/B experiments, the bucketing method is often used to optimize calculations. Bucketization involves dividing the sample into several groups (buckets), within which the data is processed separately. This paper examines how bucketization affects the power of statistical tests used to analyze experimental results.

Methodology

Bucketization is the process of dividing the total sample randomly into several subgroups (buckets), which are then analyzed separately. This method is widely used in statistical research, especially in A/B testing.

What are the advantages of bucketization that are often highlighted?

APPLICATION OF BUCKETING:

  1. Reducing the impact of emissions: By dividing the overall sample into many smaller groups, the influence of outliers in each group can be reduced, making test results more stable and reliable.

  2. Improved evaluation of effects: Bucketization allows for a more accurate assessment of the effects of introduced changes, since comparisons are made within more homogeneous and comparable groups.

  3. Controlling data heterogeneity: Dividing the sample into groups helps control for heterogeneity in the data, for example when participants have different demographic or geographic backgrounds.

HOW TO USE BUCKETING:

  1. Determining Bucket Size: It is necessary to determine the optimal number and size of buckets. Too few buckets may not provide sufficient control over variability, while too many may result in over-segmentation and loss of statistical power.

  2. Randomization: Participants should be randomly assigned to buckets to minimize any bias and ensure group comparability.

  3. Data analysis: When bucketing, a transition is made in observations from users to buckets. It turns out that from millions of observations there is a transition to thousands of observations.

And now to this study

Synthetic data generated from different distributions were used for the analysis. The power of the criteria was assessed in three different conditions:

  1. With and without the use of bucketing.

  2. When changing the number of buckets with a fixed number of users.

  3. When changing the number of users with a fixed number of buckets.

Parallelization was used to speed up calculations.

What criteria were used?

Student's t-test (using the delta method for bucketing)

def t_test_deltamethod(x_0: np.array, y_0: np.array, x_1: np.array, y_1: np.array) -> float:
    n_0 = y_0.shape[0]
    n_1 = y_0.shape[0]

    mean_x_0, var_x_0 = np.mean(x_0), np.var(x_0)
    mean_x_1, var_x_1 = np.mean(x_1), np.var(x_1)

    mean_y_0, var_y_0 = np.mean(y_0), np.var(y_0)
    mean_y_1, var_y_1 = np.mean(y_1), np.var(y_1)

    cov_0 = np.mean((x_0 - mean_x_0.reshape(-1, )) * (y_0 - mean_y_0.reshape(-1, )))
    cov_1 = np.mean((x_1 - mean_x_1.reshape(-1, )) * (y_1 - mean_y_1.reshape(-1, )))

    var_0 = var_x_0 / mean_y_0 ** 2 + var_y_0 * mean_x_0 ** 2 / mean_y_0 ** 4 - 2 * mean_x_0 / mean_y_0 ** 3 * cov_0
    var_1 = var_x_1 / mean_y_1 ** 2 + var_y_1 * mean_x_1 ** 2 / mean_y_1 ** 4 - 2 * mean_x_1 / mean_y_1 ** 3 * cov_1

    rto_0 = np.sum(x_0) / np.sum(y_0)
    rto_1 = np.sum(x_1) / np.sum(y_1)

    statistic = (rto_1 - rto_0) / np.sqrt(var_0 / n_0 + var_1 / n_1)
    pvalue = 2 * np.minimum(stats.norm(0, 1).cdf(statistic), 1 - stats.norm(0, 1).cdf(statistic))
    return statistic, pvalue

results

This section presents the key results of a study whose purpose was to evaluate the impact of bucketing on the power of statistical tests under various experimental conditions. The results are grouped into three main areas: the impact of using bucketing, the impact of the number of buckets, and the impact of sample size on the power of the tests. Each section is accompanied by corresponding graphs that illustrate observed trends and statistical findings.

What data was used:

In this work, two different distributions were chosen: normal and lognormal.

Reasons for choosing such distributions:

  1. The normal distribution was chosen to compare comparables: after bucketing, the distribution density tends to normal.

  2. The lognormal distribution was chosen to make the data closer to the real data because most metrics have a lognormal density distribution.

For normal distribution the following parameters were used: mean = 50, variance = 10.

Normal distribution density: mean = 50, variance = 10

Normal distribution density: mean = 50, variance = 10

For the lognormal distribution the following parameters were used: mean = log(50), variance = 0.75.

Lognormal distribution density: mean = log(50), variance = 0.75

Lognormal distribution density: mean = log(50), variance = 0.75

Results 1: Power of the test with and without bucketing

Experiment parameters:

num_users = 1000000
num_buckets = 10000
alpha = .05

lifts = np.asarray([1., 1.0001, 1.0002, 1.0005, 1.001, 1.002])
num_simulations = 1000

The power of the criterion was assessed in two groups: with and without the use of bucketing. The results show that the power of the criterion remains almost unchanged regardless of the application of bucketing.

  1. For normal distribution

    Dependence of criterion power on elevator (relative difference)

    Dependence of criterion power on elevator (relative difference)

    import numpy as np
    from scipy.stats import ttest_ind
    from concurrent.futures import ThreadPoolExecutor, as_completed
    from tqdm import tqdm
    
    # Настройка параметров симуляции
    np.random.seed(42)
    num_users = 1000000
    num_buckets = 10000
    mean_control = 50 
    std_dev = 10 
    alpha = .05
    
    lifts = np.asarray([1., 1.0001, 1.0002, 1.0005, 1.001, 1.002])
    num_simulations = 1000
    
    
    def run_simulation(lift):
        mean_treatment = mean_control * lift
        control_group = np.random.normal(mean_control, std_dev, num_users)
        treatment_group = np.random.normal(mean_treatment, std_dev, num_users)
        
        bucket_indices = np.random.randint(0, num_buckets, num_users)
        bucket_sums_control = np.bincount(bucket_indices, weights=control_group, minlength=num_buckets)
        bucket_sums_treatment = np.bincount(bucket_indices, weights=treatment_group, minlength=num_buckets)
        bucket_counts = np.bincount(bucket_indices, minlength=num_buckets)
        
        bucket_means_control = bucket_sums_control / bucket_counts
        bucket_means_treatment = bucket_sums_treatment / bucket_counts
        
        t_stat_ind, p_value_ind = ttest_ind(treatment_group, control_group)
        t_stat_agg, p_value_agg = t_test_deltamethod(bucket_sums_control, bucket_counts, bucket_sums_treatment, bucket_counts)
        
        return (p_value_ind <= alpha, p_value_agg <= alpha)
    
    # Параллельное выполнение с визуализацией прогресса
    results = []
    with ThreadPoolExecutor(max_workers=4) as executor:
        future_to_lift = {executor.submit(run_simulation, lift): lift for lift in lifts for _ in range(num_simulations)}
        results = []
        for future in tqdm(as_completed(future_to_lift), total=len(future_to_lift), desc="Simulating"):
            results.append(future.result())
    
    # Обработка результатов
    TPR = {lift: 0 for lift in lifts}
    TPR_b = {lift: 0 for lift in lifts}
    
    for result, lift in zip(results, future_to_lift.values()):
        TPR[lift] += result[0]
        TPR_b[lift] += result[1]
    
    for lift in lifts:
        TPR[lift] /= num_simulations
        TPR_b[lift] /= num_simulations
    
    fig = go.Figure()
    
    fig.add_trace(go.Scatter(x=(np.asarray(list(TPR.keys()))-1)*100, y=list(TPR.values()), mode="lines+markers", name="Individual Data"))
    fig.add_trace(go.Scatter(x=(np.asarray(list(TPR.keys()))-1)*100, y=list(TPR_b.values()), mode="lines+markers", name="Aggregated Data"))
    
    fig.update_layout(title="TPR Individual Data vs Aggregated Data (normal distribution)",
                      xaxis_title="Lift %",
                      yaxis_title="Power",
                      height=600,
                      width=1000,
                      template="plotly_white")
    
    fig.show()
  2. For lognormal distribution:

    Dependence of criterion power on elevator (relative difference)

    Dependence of criterion power on elevator (relative difference)

    ...
    np.random.seed(42)
    num_users = 1000000
    num_buckets = 10000
    mean_control = np.log(50)  
    std_dev = .75  
    alpha = .05
    
    lifts = np.asarray([1., 1.0001, 1.0002, 1.0005, 1.001, 1.002])
    num_simulations = 1000
    
    
    def run_simulation(lift):
        mean_treatment = mean_control * lift
        control_group = np.random.lognormal(mean_control, std_dev, num_users)
        treatment_group = np.random.lognormal(mean_treatment, std_dev, num_users)
        ...

Results 2: The influence of the number of buckets on the power of the criterion

Experiment parameters:

num_users = 100000  # Общее число пользователей
num_simulations = 500  # Количество симуляций для каждого количества бакетов
mean_control = np.log(50)  # Среднее значение логнормального распределения
sigma = 0.75  # Стандартное отклонение логнормального распределения
alpha = 0.05  # Уровень значимости
lift = 1.001  # Значение lift

# Диапазон количества бакетов для тестирования
bucket_ranges = [1, 5, 10, 20, 30, 50, 70, 100, 500, 1000, 2000, 3000, 5000, 7000, 10000, 11000, 12000, 15000, 20000, 30000, 40000, 50000, 60000, 70000, 80000, 90000, 100000]

The analysis showed that changing the number of buckets does not have a significant impact on the power of the criterion for a fixed number of users. However, it was observed that as the number of buckets decreases, the variability of results increases. The variability in power is associated with a small sample of observations (100k) due to the large computational load for the local computer.

Dependence of criterion power on bucket size

Dependence of criterion power on bucket size

np.random.seed(42)
num_users = 100000  # Общее число пользователей
num_simulations = 500  # Количество симуляций для каждого количества бакетов
mean_control = np.log(50)  # Среднее значение логнормального распределения
sigma = 0.75  # Стандартное отклонение логнормального распределения
alpha = 0.05  # Уровень значимости
lift = 1.001  # Значение lift

# Диапазон количества бакетов для тестирования
bucket_ranges = [1, 5, 10, 20, 30, 50, 70, 100, 500, 1000, 2000, 3000, 5000, 7000, 10000, 11000, 12000, 15000, 20000, 30000, 40000, 50000, 60000, 70000, 80000, 90000, 100000]

def run_simulation(num_buckets):
    power_individual = 0
    power_aggregated = 0
    for _ in range(num_simulations):
        # Генерация данных
        mean_treatment = mean_control * lift
        control_group = np.random.lognormal(mean_control, std_dev, num_users)
        treatment_group = np.random.lognormal(mean_treatment, std_dev, num_users)
            
        # Бакетизация
        bucket_indices = np.random.randint(0, num_buckets, num_users)
        bucket_sums_control = np.bincount(bucket_indices, weights=control_group, minlength=num_buckets)
        bucket_sums_treatment = np.bincount(bucket_indices, weights=treatment_group, minlength=num_buckets)
        bucket_counts = np.bincount(bucket_indices, minlength=num_buckets)
        
        # T-тесты
        _, p_value_ind = ttest_ind(treatment_group, control_group)
        _, p_value_agg = t_test_deltamethod(bucket_sums_control, bucket_counts, bucket_sums_treatment, bucket_counts)
        
        # Обновление мощности
        if p_value_ind <= alpha:
            power_individual += 1
        if p_value_agg <= alpha:
            power_aggregated += 1
    
    return num_buckets, power_individual / num_simulations, power_aggregated / num_simulations

# Выполнение симуляций с распараллеливанием
results = []
with ThreadPoolExecutor(max_workers=4) as executor:
    futures = {executor.submit(run_simulation, num_buckets): num_buckets for num_buckets in bucket_ranges}
    for future in tqdm(as_completed(futures), total=len(futures), desc="Simulating"):
        result = future.result()
        results.append(result)
        print(f"Num Buckets: {result[0]}, Power Individual: {result[1]}, Power Aggregated: {result[2]}")

# Сортировка результатов по количеству бакетов для вывода
print(results.sort(key=lambda x: x[0]))

num_buckets = [result[0] for result in results]
power_individual = [result[1] for result in results]
power_aggregated = [result[2] for result in results]

fig = go.Figure()

fig.add_trace(go.Scatter(x=num_buckets, y=power_individual, mode="lines+markers", name="Individual Data"))

fig.add_trace(go.Scatter(x=num_buckets, y=power_aggregated, mode="lines+markers", name="Aggregated Data"))

fig.update_layout(
    title="Statistical Power vs Number of Buckets",
    xaxis_title="Number of Buckets",
    yaxis_title="Power",
    height=600,
    template="plotly_white"
)

fig.show()

Results 3: Effect of sample size on test power

Experiment parameters:

num_buckets = 10000  
num_simulations = 1000  # Количество симуляций для каждого количества бакетов
mean_control = np.log(50)  # Среднее значение логнормального распределения
sigma = 0.75  # Стандартное отклонение логнормального распределения
alpha = 0.05  # Уровень значимости
lift = 1.001  # Значение lift

user_ranges = [10000, 100000, 500000, 1000000] 

The results show that the power of the criterion increases with the number of users in each bucket. This highlights the importance of adequate sample size to improve the power of statistical analysis.

Dependence of the power of the criterion on the number of observations

Dependence of the power of the criterion on the number of observations

np.random.seed(42)
num_buckets = 10000  
num_simulations = 1000  # Количество симуляций для каждого количества бакетов
mean_control = np.log(50)  # Среднее значение логнормального распределения
sigma = 0.75  # Стандартное отклонение логнормального распределения
alpha = 0.05  # Уровень значимости
lift = 1.001  # Значение lift

user_ranges = [10000, 100000, 500000, 1000000] 

def run_simulation(num_users):
    power_individual = 0
    power_aggregated = 0
    for _ in range(num_simulations):
        # Генерация данных
        mean_treatment = mean_control * lift
        control_group = np.random.lognormal(mean_control, std_dev, num_users)
        treatment_group = np.random.lognormal(mean_treatment, std_dev, num_users)
            
        # Бакетизация
        bucket_indices = np.random.randint(0, num_buckets, num_users)
        bucket_sums_control = np.bincount(bucket_indices, weights=control_group, minlength=num_buckets)
        bucket_sums_treatment = np.bincount(bucket_indices, weights=treatment_group, minlength=num_buckets)
        bucket_counts = np.bincount(bucket_indices, minlength=num_buckets)
        
        # T-тесты
        _, p_value_ind = ttest_ind(treatment_group, control_group)
        _, p_value_agg = t_test_deltamethod(bucket_sums_control, bucket_counts, bucket_sums_treatment, bucket_counts)
        
        # Обновление мощности
        if p_value_ind <= alpha:
            power_individual += 1
        if p_value_agg <= alpha:
            power_aggregated += 1
            
    print(power_individual)
    
    return num_users, power_individual / num_simulations, power_aggregated / num_simulations

results = []
with ThreadPoolExecutor(max_workers=4) as executor:
    futures = {executor.submit(run_simulation, num_users): num_buckets for num_users in user_ranges}
    for future in tqdm(as_completed(futures), total=len(futures), desc="Simulating"):
        result = future.result()
        results.append(result)
        print(f"Num Users: {result[0]}, Power Individual: {result[1]}, Power Aggregated: {result[2]}")

print(results.sort(key=lambda x: x[0]))

num_users = [result[0] for result in results]
power_individual = [result[1] for result in results]
power_aggregated = [result[2] for result in results]

fig = go.Figure()

fig.add_trace(go.Scatter(x=num_users, y=power_individual, mode="lines+markers", name="Individual Data"))

fig.add_trace(go.Scatter(x=num_users, y=power_aggregated, mode="lines+markers", name="Aggregated Data"))

# Добавление заголовка и меток осей
fig.update_layout(
    title="Statistical Power vs Number of Users for 10k buckets",
    xaxis_title="Number of Users",
    yaxis_title="Power",
    height=600,
    template="plotly_white"
)

fig.show()

The study showed the following:

  1. Power of criterion with and without bucketing: The power of the tests was almost the same regardless of the application of bucketing, indicating that bucketing had little effect on the overall performance of statistical tests for a given data distribution.

  2. Dependence of power on the number of buckets: The power of the criteria turned out to be almost independent of the number of buckets for a fixed number of users. However, the fewer buckets, the greater the variability between results with and without bucketing.

  3. Dependence of power on the number of users: There was an increase in power with an increase in the number of users with a constant number of buckets. This result highlights the importance of sufficient sample size to improve the performance of statistical tests.

Conclusion

The results may be slightly different for other distributions. This can be done as the next step.

The use of bucketing is justified. The only thing is that you should not choose less than 1000 buckets, because the dispersion of the difference in power with and without bucketing increases with a decrease in the number of buckets. The power of the criteria is not critically different to exclude bucketing from the infrastructure.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *