why and how to tame ratio metrics in A
❗main idea❗
Linearization allows you to move from a ratio metric with dependent observations to an average user metric with independent observations. Wherein:
In experiments, the difference between linearized metrics preserves co-direction observed change with a change in the target ratio metric;
Level of statistical significance of the observed effect of the new metric consistent the level of statistical significance of the original ratio metric and is calculated by a t-test;
Sensitivity enhancement techniques such as CUPED or stratification can be applied to the linearized metric.
A little reminder about terms
Under signals we mean anything that relates to numeric information from an individual user, order, banner or any other entity. In the experiment metrics is the aggregate over these signals (usually the average). For example:
Signal | Metrics |
---|---|
User spending | Average user spend (ARPU) |
Has the user placed an order? | Custom conversion to order |
Number of items in order | Average number of items per order |
Did the banner impression convert into a click? | CTR |
1. Formally about ratio metrics
Let us introduce two terms, namely unit of analysis And randomization unit.
Unit of Analysis – this is the entity regarding which we want to calculate the metric. It can be a user, session, order, button, banner, time period, etc.
For example, there is the total amount of money received over a certain period. If we divide it (normalize it) by all users, we will get ARPU, and if for orders, we will get the average bill.
Also, you can group data by users and determine their expenses for each, build their distribution, and the average value of this distribution will be ARPU. Having done the same with orders, in the distribution of their prices the average value will be the average bill. We simply redistribute the general array of something into some different units of analysis.
Unit of randomization is an entity that we randomly assign to a test or control group in A/B tests, it can also be a user, session, order or time period. But in practice, it is the users who are most often randomized, and we will look further at their example.
Randomization helps to neutralize the influence of unobservable factors that can distort the results of the experiment. These factors include any nominative characteristic of the user, for example, age, gender, geolocation, as well as any other with a less clear formulation – “overly active”, “only clicks on profitable promotions”, “unsure/doesn’t know what to choose”, etc.
Thanks to randomization, we know that the observed difference between the metric values in the groups arose as a result of one of two options:
Either because of our experimental mechanics;
By pure chance, i.e. Random assignment to groups may have resulted in higher performing users being assigned to the same group.
In addition, randomization ensures independence signals received from users. Thus, the spending or clicks of one user do not depend on the actions of another. Independence of observations is one of the main properties for applying a statistical test to data.
When the unit of analysis and the unit of randomization coincide in an experiment, then the statistical criterion can be safely used to test the hypothesis. This applies to any user conversion and average user metric, since for one metric we only have one sample of independent user signals.
Due to the requirement of statistical criteria for the independence of observations, all signals in A/B tests can only be determined in terms of the randomization unit, i.e. user.
When in an experiment you want to calculate a metric relative to another unit of analysis, different from the randomization unit, then it appears ratio metricwhich has a peculiarity in its formal definition through the unit of randomization.
The unit of analysis of interest in an experiment is always located in dependencies from the randomization unit. In this case, the dependent entities themselves and their signals can be “collapsed” into two corresponding user signals. For example, if a user has three orders with certain prices, then for this user you just need to count the number of orders and the total expenses for them. Having done this for all users, you can then add up the signals in two new fields and divide their sums by each other, so you get the average value for the unit of analysis of interest, in our case, the average check.
Ratio metric is a non-user level metric with dependent observations, but which is explicitly expressed in terms of the ratio of the sums of the corresponding user signals.
2. Why can’t ratio metrics be considered a t-test?
Formally, due to a violation of the prerequisites for the independence of observations, since order prices, session lengths, conversion of impressions to clicks, etc. within one user it is adequate to consider correlated. It turns out that units of analysis belonging to one user and the values of their signals depend from the subjective characteristics of this very user.
Informally, this thesis can be tested using synthetic A/A tests, distributing it to different groups of users, and assessing statistical significance, for example, by their orders. In this case, the distribution p-value for the ratio metric will differ from uniform. In turn, the distribution p-value for the average user metric for the same signals will have a correct uniform distribution.
Python code
def aa_testing(df, stat_test, n_trails=10_000, alpha=0.05):
rng = np.random.default_rng()
p_vals_ratio = []
p_vals_avg = []
users = df['user_id'].unique()
# a/a simulations
for _ in tqdm(range(n_trails)):
usr_hits = np.where(rng.random(len(users)) < 0.5, 't', 'c')
tmp_df = pd.DataFrame(data=usr_hits, index=users, columns=['group'])
aa_df = tmp_df.merge(df, left_index=True, right_on='user_id')
# for ratio metric
ordr_price_df = aa_df.groupby(['group', 'order_id'], as_index=False).agg(price=('gmv', 'sum'))
t_price = ordr_price_df[ordr_price_df.group == 't'].price.values
c_price = ordr_price_df[ordr_price_df.group == 'c'].price.values
p_vals_ratio.append(stat_test(t_price, c_price))
# for user lvl metric
usr_spend_df = aa_df.groupby(['group', 'user_id'], as_index=False).agg(spend=('gmv', 'sum'))
t_spend = usr_spend_df[usr_spend_df.group == 't'].spend.values
c_spend = usr_spend_df[usr_spend_df.group == 'c'].spend.values
p_vals_avg.append(stat_test(t_spend, c_spend))
fpr_ratio = sum(np.array(p_vals_ratio) < alpha) / n_trails
fpr_avg = sum(np.array(p_vals_avg) < alpha) / n_trails
return p_vals_ratio, fpr_ratio, p_vals_avg, fpr_avg
def t_test(s1, s2):
return stats.ttest_ind(s1, s2).pvalue
p_vals_ratio, fpr_ratio, p_vals_avg, fpr_avg = aa_testing(order_df, stat_test=t_test)
It turns out that if you calculate ratio metrics using a t-test, then type 1 errors will grow in your experiments, which will lead to a significant increase in the number of dud mechanics added to the product without a positive impact on the business.
Then a logical question arises: how to evaluate the statistical significance of the ratio metric in this case, and here there are 3 common options:
Calculate using a proxy metric expressed as a pre-averaged average value per user.
Using bootstrap.
Using the delta method.
And each of them has its own disadvantages. Let's look at the example of the average check, banner CTR and average session length.
3. Methods for statistical assessment of the difference in ratio metrics and their disadvantages
3.1. Pre-average average
A naive solution is to convert the ratio metric to the average user proxy metric by first averaging the corresponding signals over the user. This will produce a pre-averaged average of the user values. The picture below shows the general formula for average check per user, CTR per user and average session length per user.
For such a metric, it is correct to consider the significance to be standard statistical tests, since these signals are independent. However, such a metric has a problem: in experiments, the observed effect in the preaveraged average and in the target ratio metric may have a different sign, i.e. effects may be multidirectional. For example, the average bill per user may increase statistically significantly, but the actual global average bill may have fallen. This can be verified either using a corpus of experiments or using simulations.
In the picture along the axis Ox observed changes in the pre-averaged mean, according to Oy effect in ratio-metrics. Co-directional changes are highlighted with green dots, and multi-directional changes are highlighted with red dots. It turns out that in experiments there may be no co-directionality of effects between these metrics, the pre-averaged average is a poor proxy metric for measuring statistical significance in the ratio metric.
3.2. Bootstrap
When you don’t know how to calculate statistical significance for your metric, bootstrap can help.
In fact, the problem of statistically assessing the difference in ratio metrics using a t-test is related to the dependence of observations. You can’t just take the sample variance and estimate the standard error of the mean (standard error) for the difference in ratio metrics. But with the help of bootstrap you can.
For ratio metrics, you need to sample and repeat random users in groups and select their corresponding signals for the numerator and denominator, calculate the difference of the bootstrapped ratio metrics and repeat this many, many times. The result will be an empirical distribution of the difference in ratio metrics, the average value of which will be the observed effect in the experiment, with some assessment of its variability, the empirical standard error of the mean. From this distribution it is already possible to calculate the treasured p-value.
How to bootstrap ratio metrics in Python
def to_np_array(*arrays):
res = [np.array(arr, dtype="float") for arr in arrays]
return res if len(res)>1 else res[0]
# num_t, num_c are arrays of user signals for numerator at groups
# denom_t, denom_c are arrays of user signals for denominator at groups
def bootstrap_ratio(num_t, denom_t, num_с, denom_c, n_trials=5_000):
num_t, denom_t, num_с, denom_c = to_np_array(num_t, denom_t, num_с, denom_c)
rng = np.random.default_rng()
ratio_diff_distrib = []
# booted iters
for _ in range(n_trials):
users_idx_t = rng.choice(np.arange(0, len(num_t)), size=len(num_t))
boot_num_t = num_t[users_idx_t]
boot_denom_t = denom_t[users_idx_t]
ratio_t = boot_num_t.sum() / boot_denom_t.sum()
users_idx_c = rng.choice(np.arange(0, len(num_с)), size=len(num_с))
boot_num_с = num_с[users_idx_c]
boot_denom_c = denom_c[users_idx_c]
ratio_c = boot_num_с.sum() / boot_denom_c.sum()
ratio_diff_distrib.append(ratio_t - ratio_c)
# p_value
mean = np.mean(ratio_diff_distrib)
se = np.std(ratio_diff_distrib)
quant = stats.norm.cdf(x=0, loc=mean, scale=se)
p_value = quant * 2 if 0 < mean else (1 - quant) * 2
return p_value
Bootstrap produces correct p-value for ratio metrics, but the disadvantage of the approach is obvious – bootstrap is computationally expensive and does not scale within the experimental platform, where dozens of ratio metrics can be calculated.
3.4. Delta method
At its core, the delta method does the same thing as bootstrap, but not empirically, but analytically, through a formula. Using this formula, you can correctly recalculate the variance for ratio metrics in the test and control, and knowing the variances and the number of users in groups, you can already estimate the standard error of the mean for the difference in ratio metrics and, accordingly, t-statistics with p-value. Read more about the delta method in original article or from colleagues in the workshop.
Delta method in Python
def to_np_array(*arrays):
res = [np.array(arr, dtype="float") for arr in arrays]
return res if len(res)>1 else res[0]
# num_t, num_c are arrays of user signals for numerator at groups
# denom_t, denom_c are arrays of user signals for denominator at groups
def delta_method(num_t, denom_t, num_c, denom_c):
def est_ratio_var(num, denom):
mean_num, mean_denom = np.mean(num), np.mean(denom)
var_num, var_denom = np.var(num), np.var(denom)
cov = np.cov(num, denom)[0, 1]
# main formula for estimating variance of ratio-metric
ratio_var = (
(var_num / mean_denom**2)
- (2 * (mean_num / mean_denom**3) * cov)
+ ((mean_num**2 / mean_denom**4) * var_denom)
)
return ratio_var
num_t, denom_t, num_c, denom_c = to_np_array(num_t, denom_t, num_c, denom_c)
ratio_t, ratio_c = num_t.sum()/denom_t.sum(), num_c.sum()/denom_c.sum()
var_t, var_c = est_ratio_var(num_t, denom_t), est_ratio_var(num_c, denom_c)
n_t, n_c = len(num_t), len(num_c)
uplift = ratio_t - ratio_c
se = np.sqrt(var_t/n_t + var_c/n_c)
t = uplift / se
p_value = (1 - stats.norm.cdf(abs
return p_value
If we consider statistical significance using the bootstrap and delta method on identical samples for the difference in ratio metrics, then p-value two methods will coincide with sufficient accuracy and have a linear dependence with a unit slope. We can say that the values p-value delta method consistent values obtained using bootstrap. Accordingly, in A/A tests, the delta method gives uniform distributions for ratio metrics and does not overestimate the type 1 error.
The approach also has disadvantages. Since we do not work directly with user signals, it will not be possible to use methods for increasing sensitivity for ratio metrics, i.e. In experiments, CUPED can be used for user conversions and average user metrics, but not for ratio metrics.
But there is one excellent method that does not have all the disadvantages described above.
4. Linearization
Let's define a function for linearizing two user signals in order to obtain one signal for each user. From these new signals we will collect a new average user linearized metric, which will be a proxy to the original ratio metric.
Options in Python
Easy to pandas
# at user_df user level signals with exp group info
is_cntrl = user_df.group=='c'
AOV_cntrl = user_df[is_cntrl].spend.sum() / user_df[is_cntrl].n_order.sum()
user_df['lin_aov'] = user_df.spend - AOV_cntrl * user_df.n_order
CTR_cntrl = user_df[is_cntrl].n_bnr_clkd.sum() / user_df[is_cntrl].n_bnr_vwd.sum()
user_df['lin_ctr'] = user_df.n_bnr_clkd - CTR_cntrl * user_df.n_bnr_vwd
In general, for arrays with user signals
def to_np_array(*arrays):
res = [np.array(arr, dtype="float") for arr in arrays]
return res if len(res)>1 else res[0]
# num_t, num_c are arrays of user signals for numerator at groups
# denom_t, denom_c are arrays of user signals for denominator at groups
def linearization(num_t, denom_t, num_c, denom_c):
def lin(num, denom, cntrl_ratio):
return num - cntrl_ratio * denom
num_t, denom_t, num_c, denom_c = to_np_array(num_t, denom_t, num_c, denom_c)
CNTRL_ratio = num_c.sum() / denom_c.sum()
lin_signals_t = lin(num_t, denom_t, CNTRL_ratio)
lin_signals_c = lin(num_c, denom_c, CNTRL_ratio)
p_val = stats.ttest_ind(lin_signals_t, lin_signals_c).pvalue
return p_val
What are the values of the linearized signals, how to perceive them and how to interpret the new metric?
Linearized signals can be treated as units contribution in the change in the ratio metric from the initial value in the control to the observed value in the test. This thesis can be clearly demonstrated by the distributions of these contributions in the experimental groups.
The mathematical expectation of the linearized metric in the control group is always equal to zero, i.e. In total, there is no contribution to control, since we measure it in relation to the ratio-metric in control. In the test group there will be some other average value of contributions, either positive or negative.
At the same time, the linearized metric has a number of positive features.
Firstly, unlike the pre-averaged average, the difference between linearized metrics always preserves co-direction with a change in the target ratio metric. For example, if CTR increased or decreased in an experiment, then the linearized CTR will always change in the same direction.
Secondly, linearized user signals can already be considered independent and statistical significance can be determined for them using a t-test. At the same time, the values p-value for the linearized metric there will be consistent values obtained using the delta method on the original ratio metric. This means that the delta method can be replaced by linearization and obtain the same values p-value with sufficient accuracy. Accordingly, linearization shows correct results in A/A tests.
Third, linearization allows the use of sensitivity enhancement techniques on ratio metrics to reduce experimental group sizes to detect effects or to increase power in observed results. For example, to use CUPED for a ratio metric, it is necessary to linearize it and linearize the corresponding signals in the pre-experimental period. You will get two average user metrics, to which you can already apply CUPED.
The picture below shows that such a criterion turns out to be more powerful in comparison with conventional linearization and the delta method.
Conclusion
Linearization in A/B tests is an easily calculated and highly scalable method for transforming a ratio metric to an average user metric. She saves co-direction observed effect with a change in the target ratio metric. Also in experiments, the difference between linearized metrics has consistent the level of statistical significance with the original ratio metric is calculated by a t-test. Since, thanks to linearization, we obtain user-level signals, it becomes possible to apply methods for increasing sensitivity to ratio metrics.
And it is the linearization of ratio metrics using CUPED that is implemented in our A/B testing platform.
References
Original article – Consistent Transformation of Ratio Metrics for Efficient Online Controlled Experiments;
Ilya Katsev – How to measure user happiness;
Roman Budylin – How to create sensitive metrics for AB testing;
Vitaly Polshkov – Effective A/B testing.