we do CUPED ab-tests in one step

Hi!

I am writing this article for those who are already familiar with CUPED, but are looking for a better understanding of this method and a look at it from a different angle. Here I will not explain in detail the basic algorithm of CUPED ab-testing: there is already enough material on this topic on the Internet. We will focus on examining the method through the prism of regressions. The purpose of the article is to introduce the reader to a theorem that is incredibly useful for understanding how regressions work, and most importantly, to demonstrate how to use this theorem to conduct CUPED tests not in three consecutive steps (as in the basic algorithm), but using one regression.

Content:

Refreshing your mind about CUPED
Theorem: Frequent Case and Intuition
The connection of the theorem with CUPED
CUPED tests in one step!
Appendix: General case of the theorem

Refreshing your mind about CUPED

First, let's quickly refresh what CUPED does. CUPED is a Controlled-experiment Using Pre-Experiment Data. That is, an ab-testing method using data about test participants known about them during the pre-test period. Most often, such data are the values of the tested metric measured on test participants before the start of the experiment. Using this information allows you to reduce the variance of the estimated statistics, which increases the sensitivity of the test. The original CUPED algorithm looks like this:

Construct a regression of the form $metric_i = \hat\alpha + \hat\theta * preMetric_i + \epsilon_i,$ Where — the value of the metric of the i-th user obtained during the test period, and — the value of the metric of the same user, measured during the pre-test period;
Using the estimated value $\hat\theta$ calculate the following remainder: $metric^{(cuped)}_i = metric_i - \hat\theta * preMetric_i$ ;
Apply a standard statistical test to compare the calculated residuals in the test and control groups.
The output is an unbiased estimate of the difference between groups, but with less variance.

I myself could not understand the “physics” of this algorithm for some time. I understand the mathematical proof that the variance of the estimate decreases (you can read here). But what meaning does the value carry? $metric^{(cupped)}_i$ ? Why can a stat test be applied to it and an adequate and interpretable result of comparing two groups be obtained? Not mathematically, but just… humanly? The answer to these questions was given to me by the theorem on the separation of regressors, or as in English literature Frisch–Waugh–Lovell theorem. At the same time, she showed how to analyze CUPED tests easier than described in the original article.

Theorem: special case and intuition

When constructing a regression, we explain part of the variance of the dependent variable using the variances of the explanatory regressors. A simple case is a regression with two variables:

$y_i = \hat\alpha + \hat{\beta} x_i + \hat{\gamma} z_i +\epsilon_i , \quad cov(x,y) \neq 0$

Where, – explanatory variables with some non-zero correlation; $\hat\alpha, \hat\beta, \hat\gamma$ – estimates of regression coefficients; $\epsilon_i$ – the remainder. When constructing such a regression, we say that the variation is explained using two terms: independent (from ) part of the dispersion and independent (from ) part of the dispersion . Plus the error. Estimation of regression coefficients using OLS will clearly separate what part of the y variation is associated only with x, and what part – only with z. The beta and gamma coefficients will “take on” only the necessary part of the variance of the dependent variable, even if the regressors are slightly correlated (i.e. have a common part of the variance). You can verify this in practice using the theorem on partitioning regressors (for a simple case of regression with two variables).

Statement: coefficient evaluation $\hat{\beta}$ obtained in multiple regression with two variables of the form
$y_i = \alpha + \hat{\beta} x_i + \hat{\gamma} z_i +\epsilon_i , \quad cov(x,y) \neq 0$
completely identical to the assessment $\hat{\beta}'$ obtained using the following sequence of one-variable regressions:
build a regression on : $y_i = \hat{\alpha}' + \hat{\gamma}' z_i + \delta^{yz}_i$ calculate the regression residuals $\delta^{yz}_i = y_i - \hat{y_i}$ ;
build a regression on : $x_i = \hat{\alpha}'' + \hat{\gamma}'' z_i + \delta^{xz}_i$ count the remainder $\delta^{xz}_i = x_i - \hat{x_i}$ ;
construct a regression of the residuals from the first step on the residuals from the second step, that is, $\delta^{yz}_i = \hat{\beta}' \delta^{xz}_i + \epsilon_i$
Grade $\hat\beta'$ exactly equal $\hat{\beta}$ in size and significance!

Now let's figure out what all this means. In regression on In the first step, we associate the variance y with the variance z. By calculating the residuals from such a regression, we obtain the part of the variance y that is not associated with z. In the second step, we similarly calculate the part of the variance x that is independent of z, that is, the regression residual on . The last step is the most interesting. We take a part unrelated to and we take a part uncorrelated with . We regress the first onto the second, that is, $\delta^{yz}$ on $\delta^{xz}$ and oops, we get exactly the same coefficient that was with x in the original regression with two variables.

How to interpret this? Well, this is practical proof that the coefficient at in a two-variable regression, it describes only the relationship with a part , independent from . That is, the presence of regression in the equation “cleans” the coefficient of x, leaving only the y-connection in it with independent part x. Even though And – are correlated and, it would seem, a simple OLS algorithm could lead to biased coefficients. But no. This is literally a demonstration of the fact that in multiple regression, even with correlated variables, the OLS coefficient estimation clearly breaks down the variation of the dependent variable into independent ones. And this will be useful to us later for conducting CUPED.

At this point, I will stop torturing you with arguments on my fingers about “related and unrelated variances” and will provide simple empirical proof for the case under consideration.

import pandas as pd
import numpy as np
import statsmodels.formula.api as smf
np.random.seed(321)

# setting coefficients 
alpha = 1
beta = 1.2
gamma = 0.8

# number of observations and noise
n_obs = 100
std_noise = 5
noise = np.random.normal(0, std_noise, n_obs)

# generating correlated x and z
mean_x, mean_z = 0, 0
std_x, std_z = 1, 1
corr_xz = 0.1
cov_m = [[std_x**2        , std_x*std_z*corr_xz], 
        [std_x*std_z*corr_xz,          std_z**2]] 
x, z = np.random.multivariate_normal([mean_x, mean_z], cov_m, n_obs).T

# constructing y
y = alpha + beta * x + gamma * z + noise

# transforming to dataframe
data = pd.DataFrame({'x': x, 'z': z, 'y': y})

# regression with two variables 
reg_0 = smf.ols('y ~ 1 + x + z', data=data).fit()

# algorithm from theorem
reg_yz = smf.ols('y ~ 1 + z', data=data).fit()
data['delta_yz'] = data['y'] - reg_yz.predict()

reg_xz = smf.ols('x ~ 1 + z', data=data).fit()
data['delta_xz'] = data['x'] - reg_xz.predict()

reg_deltas = smf.ols('delta_yz ~ 1 + delta_xz', data=data).fit()

# comparing results 
print("beta =",     round(reg_0.params['x'], 5), 
      '; pvalue=", round(reg_0.pvalues["x'], 3))
print("beta' =",    round(reg_deltas.params['delta_xz'], 5), 
      "; pvalue' =", round(reg_deltas.pvalues['delta_xz'], 3))

# output:
# beta = 1.21274 ; pvalue = 0.014
# beta' = 1.21274 ; pvalue' = 0.014

There are a couple of obvious, but still necessary, points to make here. First, all of the above statements are true for any term in the regression equation. This algorithm can be repeated for and get the same coefficient $\gamma$ . Secondly, it is obvious that the considered statement is also true for the situation when And are uncorrelated, i.e. they do not have a common variance. In this case, the algorithm will be even simpler, since we do not need to calculate the part independent of . X is completely independent. In this case, the algorithm will consist of only two steps.

Algorithm for the case when And uncorrelated:
build a regression on : $y_i = \hat{\alpha}' + \hat{\gamma}' z_i + \delta^{yz}_i$ calculate the regression residuals $\delta^{yz}_i = y_i - \hat{y_i}$ ;
construct a regression of these residuals on : $\delta^{yz}_i = \hat{\beta}' x_i + \epsilon_i$ .
That's it, the grade received $\hat\beta'$ identical to the assessment $\hat\beta$ from regression: $y_i = \hat\alpha + \hat\beta x_i + \hat\gamma z_i + \epsilon_i$ .

And it is this case that is directly related to CUPED ab-tests.

The connection of the theorem with CUPED

An attentive eye will have already noticed the similarity between the sequence of actions in CUPED and the algorithm from the theorem: estimate some regression, calculate residuals, and then do something with these residuals. It may seem that the last steps of the two algorithms are different: in CUPED, a t-test is applied to the residuals, while in the theorem, the last step is regression. In fact, there is no difference, if we recall that a t-test can be performed using a regression of the following form:

$metric_i = \hat\alpha + \hat\beta * \mathbb I [user_i \in testSample] + \epsilon_i ,$

Where metirc_i – metric of the i-th user (two groups are combined into one vector), $\mathbb I$ – an indicator that takes the value 1 if the element belongs to the test sample, and 0 if it belongs to the control sample. Coefficient $\beta$ will be equal to the difference in means between groups, and its significance will exactly coincide with the p-value of a regular t-test.

Okay, so the CUPED algorithm is actually completely identical to the algorithm from the theorem (for the case when the regressors are uncorrelated). So the same reasoning about intuition can be applied to CUPED. In the test, we measure the difference in sample means, and the variance of the sample mean is directly proportional to the variance of the metric itself. The variance of the metric can in turn be represented as two terms: some constant part $\mathbb Var(Sample)$ associated with the fact that the sample consists of different people with different levels of activity, and some component $\mathbb Var(PeriodNoise)$ caused by the fact that the activity of people from the sample is not constant, but changes from period to period. That is:

$\mathbb Var(\hat\beta) \propto \mathbb Var(metric) = \mathbb Var (Sample) + \mathbb Var(PeriodNoise),$

where for $\hat\beta$ denotes the estimated difference in the means between groups. With a fixed sample, the first term is constant from period to period, i.e., it is the same in both the test and pre-test periods. When, in CUPED testing, we construct a regression of the form $metric_i = \hat\alpha + \hat\theta * preMetric_i + \epsilon_i$ component $\hat\theta * preMetric_i$ describes precisely the part of the sample dispersion associated with the heterogeneity of the participants. Then it turns out that the meaning of the quantity $metric_i^{(cuped)} = metric_i - \hat\theta * preMetric_i$ – is the part of the metric variation that is cleared of the effect of sample heterogeneity and includes only the variation associated with changes during the experiment.

$\mathbb Var(\hat\beta') \propto \mathbb Var(metric^{(cuped)}) = \mathbb Var(metric) - \mathbb Var(Sample) = \mathbb Var(PeriodNoise)$

We've sorted out the interpretation. And that means we can finally move on to the practical part – an alternative algorithm for CUPED tests, which consists of one step.

CUPED tests in one step!

Let's put together the three conclusions we got above:

The regressor partitioning theorem states that the coefficients in a multiple regression can be obtained alternatively by using a sequence of regressions with fewer variables.
This algorithm is completely identical to the sequence of actions in CUPED.
T-test can be performed using regression.

After reading these facts, the question should already arise: can we turn the theorem around and represent the CUPED algorithm as a multiple regression? Of course we can! That's what the whole article was about!

Claim: The original CUPED algorithm is completely identical to the linear regression of the following form:
$metric_i = \hat\alpha + \hat\beta * \mathbb I [i \in testSample] + \hat\gamma * preMetric_i + \epsilon_i$ ,
Where – the value of the metric of the i-th user during the experiment, – the value of the metric of the same user during the pre-test period, $\mathbb I [i \in testSample]$ – indicator of user belonging to the test group (1 if the user belongs to the test group, 0 if to the control group).

Evaluation of the coefficient $\hat\beta$ will reflect the magnitude and significance of the difference in means between groups, and its significance and dispersion will be similar to the dispersion of the estimate from the standard CUPED algorithm. Here is the proof:

import pandas as pd
import numpy as np
import statsmodels.formula.api as smf
from scipy.stats import ttest_ind
np.random.seed(321)

n_obs = 1000
uplift = 0.12   # +12%

# generate pre-experiment metrics of clients
pre_c = np.random.lognormal(mean=0, sigma=1, size=n_obs)
pre_t = np.random.lognormal(mean=0, sigma=1, size=n_obs)

# generate noise 
noise_c = np.random.lognormal(mean=0, sigma=0.8, size=n_obs)
noise_t = np.random.lognormal(mean=0, sigma=0.8, size=n_obs)

# construct experiment metrics, using pre-experiment, uplift and noise
c = pre_c * (1 + noise_c)
t = pre_t * (1 + uplift + noise_t)

# pack into dataframe
df = pd.DataFrame({'preMetric': np.append(pre_t, pre_c),
                   'metric': np.append(t, c), 
                   'testDummy': np.append([1] * len


# CUPED basic algorithm
cuped_reg = smf.ols('metric ~ 1 + preMetric', data=df).fit()
delta = (df.metric - cuped_reg.params['preMetric'] * df.preMetric).values
delta_t, delta_c = delta[0:len
t_stat_cuped, pvalue_cuped = ttest_ind(delta_t, delta_c, equal_var=False)

# CUPED via single regression using theorem
theorem_reg = smf.ols('metric ~ 1 + testDummy + preMetric', data=df).fit()
pvalue_theorem = theorem_reg.pvalues['testDummy']

# compare results
print('p-value cuped =', round(pvalue_cuped, 5))
print("p-value regression =", round(pvalue_theorem, 5))

# Output: 
# p-value cuped = 0.013
# p-value regression = 0.013

That's it! Now you can analyze CUPED tests in one regression. The entire analysis fits into the line statsmodels.ols('metric ~ 1 + testDummy + preMetric', data=df).fit(). And no three-step algorithms with residual calculations. It is clear that this is not an incredibly useful life hack – the basic CUPED algorithm fits into 3-4 lines of code. The main value of this skill, and this entire article as a whole, is different: to learn about the theorem demonstrating the principles of regression, learn about its connection with CUPED ab-tests, look at this method from a different angle and, as a result, deepen your understanding of it. I hope it was useful!

Appendix: General case of the theorem

It would be unforgivable to formulate the theorem on partitioning regressors only for a particular case. Let's close this gestalt here. Let's have a multiple regression with two (conditional) groups of regressors:

$y_i = \alpha + \beta_1 x_{1,i} + ...+ \beta_n x_{n,i} +\gamma_1 z_{1,i} + ... + \gamma_m z_{m,i} + \ epsilon_i$

Let's rewrite the equation in vector form, connecting everything And into two groups (constant $\alpha$ also for convenience can be attributed to one and groups). We obtain the following type of equation:

$\mathbb Y = \mathbb X \beta + \mathbb Z \gamma + \epsilon,$

Where $\beta$ And $\gamma$ – This vectors regression coefficients for regressor matrices $\mathbb X$ And $\mathbb Z$ .

Statement: coefficient vector estimation $\beta$ obtained in this regression, is identical in magnitude and significance to the estimate of the coefficient $\beta'$ in a regression of the following type:
$\mathbb M_Z \mathbb Y = \mathbb M_Z \mathbb X \beta' + \epsilon'$
where for $\mathbb M_Z \mathbb Y$ the orthogonal complement to the image is indicated $\hat{ \mathbb Y} = \mathbb Z \hat{\gamma}.$ Simply put, the residuals from the regression $\mathbb Y$ on the group of regressors $\mathbb Z$ . A similar interpretation has $\mathbb M_Z \mathbb X$ – residuals from regression $\mathbb Y$ on $\mathbb Z$ .

In general, everything is exactly the same as in the particular case, but not for one coefficient, but for the whole group. I will not describe the proof of the theorem here, but I will indicate a link to it below.

Useful materials and sources

Derivation of the theorem on partitioning of regressors (Frisch-Waugh-Lovell theorem).
Detailed analysis of CUPED testing with reference to the regressor partitioning theorem