why and how to tame ratio metrics in A

main idea

Linearization allows you to move from a ratio metric with dependent observations to an average user metric with independent observations. Wherein:

  • In experiments, the difference between linearized metrics preserves co-direction observed change with a change in the target ratio metric;

  • Level of statistical significance of the observed effect of the new metric consistent the level of statistical significance of the original ratio metric and is calculated by a t-test;

  • Sensitivity enhancement techniques such as CUPED or stratification can be applied to the linearized metric.

A little reminder about terms

Under signals we mean anything that relates to numeric information from an individual user, order, banner or any other entity. In the experiment metrics is the aggregate over these signals (usually the average). For example:

Signal

Metrics

User spending

Average user spend (ARPU)

Has the user placed an order?

Custom conversion to order

Number of items in order

Average number of items per order

Did the banner impression convert into a click?

CTR


1. Formally about ratio metrics

Let us introduce two terms, namely unit of analysis And randomization unit.

Unit of Analysis – this is the entity regarding which we want to calculate the metric. It can be a user, session, order, button, banner, time period, etc.

For example, there is the total amount of money received over a certain period. If we divide it (normalize it) by all users, we will get ARPU, and if for orders, we will get the average bill.

Also, you can group data by users and determine their expenses for each, build their distribution, and the average value of this distribution will be ARPU. Having done the same with orders, in the distribution of their prices the average value will be the average bill. We simply redistribute the general array of something into some different units of analysis.

Distribution of an array of money across different units of analysis

Distribution of an array of money across different units of analysis

Unit of randomization is an entity that we randomly assign to a test or control group in A/B tests, it can also be a user, session, order or time period. But in practice, it is the users who are most often randomized, and we will look further at their example.

Randomization helps to neutralize the influence of unobservable factors that can distort the results of the experiment. These factors include any nominative characteristic of the user, for example, age, gender, geolocation, as well as any other with a less clear formulation – “overly active”, “only clicks on profitable promotions”, “unsure/doesn’t know what to choose”, etc.

Thanks to randomization, we know that the observed difference between the metric values ​​in the groups arose as a result of one of two options:

  • Either because of our experimental mechanics;

  • By pure chance, i.e. Random assignment to groups may have resulted in higher performing users being assigned to the same group.

In addition, randomization ensures independence signals received from users. Thus, the spending or clicks of one user do not depend on the actions of another. Independence of observations is one of the main properties for applying a statistical test to data.

When the unit of analysis and the unit of randomization coincide in an experiment, then the statistical criterion can be safely used to test the hypothesis. This applies to any user conversion and average user metric, since for one metric we only have one sample of independent user signals.

Application of statistical test to different signals of the randomization unit (user)

Application of statistical test to different signals of the randomization unit (user)

Due to the requirement of statistical criteria for the independence of observations, all signals in A/B tests can only be determined in terms of the randomization unit, i.e. user.

When in an experiment you want to calculate a metric relative to another unit of analysis, different from the randomization unit, then it appears ratio metricwhich has a peculiarity in its formal definition through the unit of randomization.
The unit of analysis of interest in an experiment is always located in dependencies from the randomization unit. In this case, the dependent entities themselves and their signals can be “collapsed” into two corresponding user signals. For example, if a user has three orders with certain prices, then for this user you just need to count the number of orders and the total expenses for them. Having done this for all users, you can then add up the signals in two new fields and divide their sums by each other, so you get the average value for the unit of analysis of interest, in our case, the average check.

Ratio metric is a non-user level metric with dependent observations, but which is explicitly expressed in terms of the ratio of the sums of the corresponding user signals.

Formal determination of the average bill through a randomization unit.  Ratio metric

Formal determination of the average bill through a randomization unit. Ratio metric

2. Why can’t ratio metrics be considered a t-test?

Formally, due to a violation of the prerequisites for the independence of observations, since order prices, session lengths, conversion of impressions to clicks, etc. within one user it is adequate to consider correlated. It turns out that units of analysis belonging to one user and the values ​​of their signals depend from the subjective characteristics of this very user.

Informally, this thesis can be tested using synthetic A/A tests, distributing it to different groups of users, and assessing statistical significance, for example, by their orders. In this case, the distribution p-value for the ratio metric will differ from uniform. In turn, the distribution p-value for the average user metric for the same signals will have a correct uniform distribution.

Python code
def aa_testing(df, stat_test, n_trails=10_000, alpha=0.05):
    rng = np.random.default_rng()
    p_vals_ratio = []
    p_vals_avg = []
    
    users = df['user_id'].unique()
    
    # a/a simulations
    for _ in tqdm(range(n_trails)):
        usr_hits = np.where(rng.random(len(users)) < 0.5, 't', 'c')
        
        tmp_df = pd.DataFrame(data=usr_hits, index=users, columns=['group'])
        aa_df = tmp_df.merge(df, left_index=True, right_on='user_id')
        
        # for ratio metric
        ordr_price_df = aa_df.groupby(['group', 'order_id'], as_index=False).agg(price=('gmv', 'sum'))
        t_price = ordr_price_df[ordr_price_df.group == 't'].price.values
        c_price = ordr_price_df[ordr_price_df.group == 'c'].price.values
        p_vals_ratio.append(stat_test(t_price, c_price))
        
        # for user lvl metric
        usr_spend_df = aa_df.groupby(['group', 'user_id'], as_index=False).agg(spend=('gmv', 'sum'))
        t_spend = usr_spend_df[usr_spend_df.group == 't'].spend.values
        c_spend = usr_spend_df[usr_spend_df.group == 'c'].spend.values
        p_vals_avg.append(stat_test(t_spend, c_spend))    

    
    fpr_ratio = sum(np.array(p_vals_ratio) < alpha) / n_trails
    fpr_avg = sum(np.array(p_vals_avg) < alpha) / n_trails
    return p_vals_ratio, fpr_ratio, p_vals_avg, fpr_avg

def t_test(s1, s2):
    return stats.ttest_ind(s1, s2).pvalue

p_vals_ratio, fpr_ratio, p_vals_avg, fpr_avg = aa_testing(order_df, stat_test=t_test)
Comparison of A/A tests on the same signals for the ratio metric and the user average

Comparison of A/A tests on the same signals for the ratio metric and the user average

It turns out that if you calculate ratio metrics using a t-test, then type 1 errors will grow in your experiments, which will lead to a significant increase in the number of dud mechanics added to the product without a positive impact on the business.

Then a logical question arises: how to evaluate the statistical significance of the ratio metric in this case, and here there are 3 common options:

  1. Calculate using a proxy metric expressed as a pre-averaged average value per user.

  2. Using bootstrap.

  3. Using the delta method.

And each of them has its own disadvantages. Let's look at the example of the average check, banner CTR and average session length.

3. Methods for statistical assessment of the difference in ratio metrics and their disadvantages

3.1. Pre-average average

A naive solution is to convert the ratio metric to the average user proxy metric by first averaging the corresponding signals over the user. This will produce a pre-averaged average of the user values. The picture below shows the general formula for average check per user, CTR per user and average session length per user.

Defining Pre-Averaged User Signals

Defining Pre-Averaged User Signals

For such a metric, it is correct to consider the significance to be standard statistical tests, since these signals are independent. However, such a metric has a problem: in experiments, the observed effect in the preaveraged average and in the target ratio metric may have a different sign, i.e. effects may be multidirectional. For example, the average bill per user may increase statistically significantly, but the actual global average bill may have fallen. This can be verified either using a corpus of experiments or using simulations.

Comparison of the co-directionality of effects in the pre-averaged average and in the target ratio metric

Comparison of the co-directionality of effects in the pre-averaged average and in the target ratio metric

In the picture along the axis Ox observed changes in the pre-averaged mean, according to Oy effect in ratio-metrics. Co-directional changes are highlighted with green dots, and multi-directional changes are highlighted with red dots. It turns out that in experiments there may be no co-directionality of effects between these metrics, the pre-averaged average is a poor proxy metric for measuring statistical significance in the ratio metric.

3.2. Bootstrap

When you don’t know how to calculate statistical significance for your metric, bootstrap can help.

In fact, the problem of statistically assessing the difference in ratio metrics using a t-test is related to the dependence of observations. You can’t just take the sample variance and estimate the standard error of the mean (standard error) for the difference in ratio metrics. But with the help of bootstrap you can.

For ratio metrics, you need to sample and repeat random users in groups and select their corresponding signals for the numerator and denominator, calculate the difference of the bootstrapped ratio metrics and repeat this many, many times. The result will be an empirical distribution of the difference in ratio metrics, the average value of which will be the observed effect in the experiment, with some assessment of its variability, the empirical standard error of the mean. From this distribution it is already possible to calculate the treasured p-value.

How to bootstrap ratio metrics in Python
def to_np_array(*arrays):
    res = [np.array(arr, dtype="float") for arr in arrays]
    return res if len(res)>1 else res[0]


# num_t, num_c are arrays of user signals for numerator at groups
# denom_t, denom_c are arrays of user signals for denominator at groups
def bootstrap_ratio(num_t, denom_t, num_с, denom_c, n_trials=5_000):
    num_t, denom_t, num_с, denom_c = to_np_array(num_t, denom_t, num_с, denom_c)
    rng = np.random.default_rng()
    ratio_diff_distrib = []

    # booted iters
    for _ in range(n_trials):
        users_idx_t = rng.choice(np.arange(0, len(num_t)), size=len(num_t))
        boot_num_t = num_t[users_idx_t]
        boot_denom_t = denom_t[users_idx_t]
        ratio_t = boot_num_t.sum() / boot_denom_t.sum()

        users_idx_c = rng.choice(np.arange(0, len(num_с)), size=len(num_с))
        boot_num_с = num_с[users_idx_c]
        boot_denom_c = denom_c[users_idx_c]
        ratio_c = boot_num_с.sum() / boot_denom_c.sum()

        ratio_diff_distrib.append(ratio_t - ratio_c)
    
    # p_value
    mean = np.mean(ratio_diff_distrib)
    se = np.std(ratio_diff_distrib)
    quant = stats.norm.cdf(x=0, loc=mean, scale=se)
    p_value = quant * 2 if 0 < mean else (1 - quant) * 2
    
    return p_value
  
Empirical distributions of differences in ratio metrics obtained by bootstrap

Empirical distributions of differences in ratio metrics obtained by bootstrap

Bootstrap produces correct p-value for ratio metrics, but the disadvantage of the approach is obvious – bootstrap is computationally expensive and does not scale within the experimental platform, where dozens of ratio metrics can be calculated.

3.4. Delta method

At its core, the delta method does the same thing as bootstrap, but not empirically, but analytically, through a formula. Using this formula, you can correctly recalculate the variance for ratio metrics in the test and control, and knowing the variances and the number of users in groups, you can already estimate the standard error of the mean for the difference in ratio metrics and, accordingly, t-statistics with p-value. Read more about the delta method in original article or from colleagues in the workshop.

Delta method in Python
def to_np_array(*arrays):
    res = [np.array(arr, dtype="float") for arr in arrays]
    return res if len(res)>1 else res[0]


# num_t, num_c are arrays of user signals for numerator at groups
# denom_t, denom_c are arrays of user signals for denominator at groups
def delta_method(num_t, denom_t, num_c, denom_c):
    def est_ratio_var(num, denom):
        mean_num, mean_denom = np.mean(num), np.mean(denom)
        var_num, var_denom = np.var(num), np.var(denom)
        cov = np.cov(num, denom)[0, 1]
        # main formula for estimating variance of ratio-metric
        ratio_var = (
            (var_num / mean_denom**2)
            - (2 * (mean_num / mean_denom**3) * cov)
            + ((mean_num**2 / mean_denom**4) * var_denom)
        )
        return ratio_var
    
    num_t, denom_t, num_c, denom_c = to_np_array(num_t, denom_t, num_c, denom_c)
    
    ratio_t, ratio_c = num_t.sum()/denom_t.sum(), num_c.sum()/denom_c.sum()
    var_t, var_c = est_ratio_var(num_t, denom_t), est_ratio_var(num_c, denom_c)
    n_t, n_c = len(num_t), len(num_c)

    uplift = ratio_t - ratio_c
    se = np.sqrt(var_t/n_t + var_c/n_c)
    
    t = uplift / se
    p_value = (1 - stats.norm.cdf(abs
    return p_value
The essence of the delta method

The essence of the delta method

If we consider statistical significance using the bootstrap and delta method on identical samples for the difference in ratio metrics, then p-value two methods will coincide with sufficient accuracy and have a linear dependence with a unit slope. We can say that the values p-value delta method consistent values ​​obtained using bootstrap. Accordingly, in A/A tests, the delta method gives uniform distributions for ratio metrics and does not overestimate the type 1 error.

Consistency of p-value of the delta method and bootstrap.  Results of A/A tests of the delta method

Consistency p-value delta method and bootstrap. Results of A/A tests of the delta method

The approach also has disadvantages. Since we do not work directly with user signals, it will not be possible to use methods for increasing sensitivity for ratio metrics, i.e. In experiments, CUPED can be used for user conversions and average user metrics, but not for ratio metrics.

But there is one excellent method that does not have all the disadvantages described above.

4. Linearization

Let's define a function for linearizing two user signals in order to obtain one signal for each user. From these new signals we will collect a new average user linearized metric, which will be a proxy to the original ratio metric.

Options in Python

Easy to pandas

# at user_df user level signals with exp group info

is_cntrl = user_df.group=='c'

AOV_cntrl = user_df[is_cntrl].spend.sum() / user_df[is_cntrl].n_order.sum()
user_df['lin_aov'] = user_df.spend - AOV_cntrl * user_df.n_order

CTR_cntrl = user_df[is_cntrl].n_bnr_clkd.sum() / user_df[is_cntrl].n_bnr_vwd.sum()
user_df['lin_ctr'] = user_df.n_bnr_clkd - CTR_cntrl * user_df.n_bnr_vwd

In general, for arrays with user signals

def to_np_array(*arrays):
    res = [np.array(arr, dtype="float") for arr in arrays]
    return res if len(res)>1 else res[0]


# num_t, num_c are arrays of user signals for numerator at groups
# denom_t, denom_c are arrays of user signals for denominator at groups
def linearization(num_t, denom_t, num_c, denom_c):
    def lin(num, denom, cntrl_ratio):
        return num - cntrl_ratio * denom
    
    num_t, denom_t, num_c, denom_c = to_np_array(num_t, denom_t, num_c, denom_c)
    
    CNTRL_ratio = num_c.sum() / denom_c.sum()
    
    lin_signals_t = lin(num_t, denom_t, CNTRL_ratio)
    lin_signals_c = lin(num_c, denom_c, CNTRL_ratio)
    
    p_val = stats.ttest_ind(lin_signals_t, lin_signals_c).pvalue
    
    return p_val
  

What are the values ​​of the linearized signals, how to perceive them and how to interpret the new metric?

Linearized signals can be treated as units contribution in the change in the ratio metric from the initial value in the control to the observed value in the test. This thesis can be clearly demonstrated by the distributions of these contributions in the experimental groups.

The mathematical expectation of the linearized metric in the control group is always equal to zero, i.e. In total, there is no contribution to control, since we measure it in relation to the ratio-metric in control. In the test group there will be some other average value of contributions, either positive or negative.

Distributions of linearized user signals for different metrics

Distributions of linearized user signals for different metrics

At the same time, the linearized metric has a number of positive features.

Firstly, unlike the pre-averaged average, the difference between linearized metrics always preserves co-direction with a change in the target ratio metric. For example, if CTR increased or decreased in an experiment, then the linearized CTR will always change in the same direction.

Comparison of the co-directionality of effects with the target ratio metric for linearization and pre-averaged average

Comparison of the co-directionality of effects with the target ratio metric for linearization and pre-averaged average

Secondly, linearized user signals can already be considered independent and statistical significance can be determined for them using a t-test. At the same time, the values p-value for the linearized metric there will be consistent values ​​obtained using the delta method on the original ratio metric. This means that the delta method can be replaced by linearization and obtain the same values p-value with sufficient accuracy. Accordingly, linearization shows correct results in A/A tests.

Consistency of p-value linearization and delta method.  Results of A/A linearization tests

Consistency p-value linearization And delta method. Results of A/A linearization tests

Third, linearization allows the use of sensitivity enhancement techniques on ratio metrics to reduce experimental group sizes to detect effects or to increase power in observed results. For example, to use CUPED for a ratio metric, it is necessary to linearize it and linearize the corresponding signals in the pre-experimental period. You will get two average user metrics, to which you can already apply CUPED.

The picture below shows that such a criterion turns out to be more powerful in comparison with conventional linearization and the delta method.

Comparison of powers for linearization, delta method and linearization+CUPED

Comparison of powers for linearization, delta method and linearization+CUPED

Conclusion

Linearization in A/B tests is an easily calculated and highly scalable method for transforming a ratio metric to an average user metric. She saves co-direction observed effect with a change in the target ratio metric. Also in experiments, the difference between linearized metrics has consistent the level of statistical significance with the original ratio metric is calculated by a t-test. Since, thanks to linearization, we obtain user-level signals, it becomes possible to apply methods for increasing sensitivity to ratio metrics.

And it is the linearization of ratio metrics using CUPED that is implemented in our A/B testing platform.

References

  1. Original article – Consistent Transformation of Ratio Metrics for Efficient Online Controlled Experiments;

  2. Ilya Katsev – How to measure user happiness;

  3. Roman Budylin – How to create sensitive metrics for AB testing;

  4. Vitaly Polshkov – Effective A/B testing.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *