How VWE helps reduce variance and improve data accuracy

To combat high dispersion, several approaches are traditionally used. One of them is to increase the sample size. The more data, the more accurate the estimate, but in life it is not always possible to collect a large amount of data due to time and resource limitations.

Another way is to use smoothing methods such as least squares or its variations. These methods help reduce the impact of individual outliers, but do not always cope with systematic errors.

There are also methods that use regularization, such as Ridge and Lasso regression, which add a complexity penalty to the model and thus reduce variance. However, such methods may introduce their own limitations, affecting the interpretation of the model and requiring fine tuning of parameters.

All of these approaches have their limitations. Increasing sample size is an expensive and time-consuming solution. Smoothing and regularization methods may not cope with data heterogeneity and require careful tuning.

VWE accounts for data heterogeneity by handling outliers and systematic errors. Let's look at this method in this article.

VWE

The idea behind VWE is that estimates with lower variance should be given more weight because they are more accurate. Unlike a simple average, where all data is weighted equally, VWE takes into account the varying accuracy of the data, which can significantly reduce the overall variance of the final score.

The working principle of the VWE method can be described as follows:

Each observation is assigned a weight that is the inverse of its variance. Weight formula wi For ith observation:

  w_i = \frac{1}{\sigma_i^2}

Where σ2/i​ – dispersion ith observation.

All weights are normalized so that their sum equals one. This is achieved by dividing each weight by the sum of all weights:

\lambda_i = \frac{w_i}{\sum_j w_j}

Where λi – normalized weight ith observation.

Final the score is calculated as a weighted average of the values, where the weights are normalized weights:

\bar{X}_{VWE} = \sum_i \lambda_i x_i

Where xi​ — value of the i-th observation.

The basic formula for estimating using VWE is:

    \bar{X}{VWE} = \frac{\sum_i \frac{x_i}{\sigma_i^2}}{\sum_i \frac{1}{\sigma_i^2}}

XVWE- this is the variance-weighted avg. meaning. This formula allows you to minimize the overall variance of the final estimate using information about the accuracy of each individual observation.

Variance reduction is achieved by weighting observations with less variance more heavily than observations with more variance.

Let's look at an example of using VWE in an A/B test. Let's say there are two sets of data – A and B, each of which contains the results of an experiment. The variance of results in set A is significantly lower than in set B.

By using the simple average method, you can obtain a general estimate that does not account for differences in the accuracy of data sets. However, with VWE, we will give more weight to the results from set A, resulting in a reduction in the overall variance of the estimate.

Calculation example:

Weights for each set:

  • wA​=0.51​=2

  • w_B = \frac{1}{2} = 0.5

Normalized weights:

  • \lambda_A = \frac{2}{2 + 0.5} = \frac{2}{2.5} = 0.8

  • \lambda_B = \frac{0.5}{2 + 0.5} = \frac{0.5}{2.5} = 0.2

Final weighted average:
  \bar{X}_{VWE} = 0.8 \cdot \bar{x}_A + 0.2 \cdot \bar{x}_B = 0.8 \cdot 2.2 + 0.2 \cdot 4.6 = 2.64

Thus, a weighted average is significantly more accurate and has less variance than a simple average, which considers all data equally.

Examples in python

We use the VWE method to analyze the results of A/B testing. The data includes results from two testing groups with different variances:

import numpy as np

# данные для группы A и B
group_A_results = np.array([1, 2, 2, 3, 3])
group_B_results = np.array([1, 4, 5, 6, 7])

# дисперсии для групп
variance_A = 0.5
variance_B = 2.0

# вычисляем веса
weights_A = 1 / variance_A
weights_B = 1 / variance_B

# нормализуем веса
total_weight = weights_A + weights_B
lambda_A = weights_A / total_weight
lambda_B = weights_B / total_weight

# вычисляем взвешенное среднее
mean_A = np.mean(group_A_results)
mean_B = np.mean(group_B_results)

vwe_result = lambda_A * mean_A + lambda_B * mean_B

print(f"VWE результат для A/B тестирования: {vwe_result}")
VWE результат для A/B тестирования: 2.68

Let's use VWE to improve the forecasting accuracy of an ML model by combining the predictions of several models with different accuracy:

from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import numpy as np

# обучающие данные
X_train, y_train = ...  #  данные

# обучение моделей
model1 = RandomForestRegressor()
model2 = LinearRegression()

model1.fit(X_train, y_train)
model2.fit(X_train, y_train)

# прогнозы моделей
pred1 = model1.predict(X_train)
pred2 = model2.predict(X_train)

# вычисление дисперсий ошибок
variance1 = mean_squared_error(y_train, pred1)
variance2 = mean_squared_error(y_train, pred2)

# вычисляем веса
weights1 = 1 / variance1
weights2 = 1 / variance2

# гормализуем веса
total_weight = weights1 + weights2
lambda1 = weights1 / total_weight
lambda2 = weights2 / total_weight

# итоговый взвешенный прогноз
final_prediction = lambda1 * pred1 + lambda2 * pred2

Let's use VWE to aggregate stock price forecasts using data with different variances:

import numpy as np
import pandas as pd

# данные для двух различных источников прогнозов
source1_predictions = np.array([100, 102, 101, 103, 104])
source2_predictions = np.array([99, 101, 100, 102, 103])

# дисперсии для источников
variance_source1 = 1.0
variance_source2 = 4.0

# вычисляем веса
weights_source1 = 1 / variance_source1
weights_source2 = 1 / variance_source2

# нормализуем веса
total_weight = weights_source1 + weights_source2
lambda_source1 = weights_source1 / total_weight
lambda_source2 = weights_source2 / total_weight

# итоговый взвешенный прогноз
final_prediction = lambda_source1 * source1_predictions + lambda_source2 * source2_predictions

print(f"Итоговый взвешенный прогноз цен акций: {final_prediction}")
Итоговый взвешенный прогноз цен акций: [ 99.8 101.8 100.8 102.8 103.8]

Of course there are some problems…

VWE assumes independence of observations. In cases where the data are correlated, the method may produce biased estimates. Correlation between data can distort the weights and, as a result, the final score.

VWE can be sensitive to outliers, especially if they have low variance and therefore high weight.

Weights in VWE are calculated based on variance, and errors in variance estimation can result in incorrect weights.

However, VWE is a very useful tool in the right hands and is useful in situations where the data is heterogeneous and has varying levels of accuracy.

To successfully apply VWE, it is important to carefully select and check the source data, take into account possible correlations and outliers, and use correct methods for estimating variance.


The article was prepared in anticipation of the start of the course Machine learning. Advanced

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *