from theory to practice in Python
Introduction
Application
Restrictions
Bootstrap diagram
Efronov confidence interval
Hall's confidence interval
t-percentile confidence interval
Python implementation
Problems
Notes
Introduction
Bootstrap — is a computational statistical method that allows one to estimate the distribution of sample statistics (for example: median, excess, kurtosis, mean) by repeatedly generating samples using the Monte Carlo method based on an existing sample.
In simple terms, the bootstrap allows you to “pretend” to be a population by repeatedly creating “pseudo-samples” from the original sample with replacement.
Bootstrap sample — is a pseudo-sample with repetitions, extracted from the original sample, i.e. the bootstrap sample may contain the same observation from the original sample several times. Moreover, the bootstrap sample must be equal in size to the original sample.
In fact, when we use bootstrapping, we want to draw conclusions about a certain statistic in the general population from the sample we have. We repeatedly extract bootstrap samples from the original sample, calculate the statistic from them, build its distribution, calculate the confidence interval and draw conclusions about it. Or we can simply take, for example, the mean or median value to get a point estimate.
Bootstrap is typically used in the following cases:
theoretical distribution of data is unknown;
the sample size is small for direct statistical evaluation;
no parametric or non-parametric analogues;
It is necessary to evaluate complex statistics for which it is difficult to obtain analytical formulas.
The key advantage of the bootstrap is that it can be applied to a wide range of problems, even when all other methods, both parametric and non-parametric, fail.
This method does not require distributional assumptions data from the researcher. The only condition that must be met is the representativeness of the sample.
Application
Now let's take a closer look at where bootstrapping is used in practice. Bootstrapping is very common in Data Science, namely in machine learning, it allows you to evaluate the quality of models, the uncertainty of predictions, and much more.
Model quality assessment
Bootstrapping is used to obtain more accurate estimates of model quality metrics such as accuracy, recall, F1-score, etc. Bootstrap samples are repeatedly generated from the original dataset by random selection with repetition. The model is then trained and tested on each sample, after which the metrics are averaged. This reduces the bias of estimates and obtains confidence intervals.
Assessing the uncertainty of predictions
In neural networks, the bootstrap is used to estimate the uncertainty of model predictions. To do this, several models are created on different bootstrap samples and predictions are made for each of them. The spread of predictions characterizes the uncertainty. This is important, for example, for tasks with a high cost of error, where you need to know how confident the model is in its answer.
Interpretation and analysis of signs
By training models on bootstrap samples, you can estimate the importance of features by the frequency of their use in trees (for random forest) or by the spread of weights (for neural networks). This gives an understanding of which features contribute most to the model's predictions.
Active learning
Bootstrapping is used in active learning, an approach where the model itself selects the most informative examples for labeling and retraining. One strategy is to request labeling for examples where the predictions of models trained on different bootstrap samples differ the most.
Ensemble methods
The bootstrap is the basis of some ensemble algorithms, such as bagging and random forest. These methods build multiple base models on different bootstrap samples of the training data. The predictions of the base models are then combined, which often produces more accurate and stable results than a single model.
Thus, bootstrapping allows improving the quality, reliability, and interpretability of machine learning models and is used in a fairly wide range of tasks. It is especially useful when the size of the training sample is limited, helping to make the most of the available data.
Restrictions
Bootstrap has several limitations. Let's look at those first. affect the accuracy of the resultsThis:
number of generated bootstrap samples (pseudo samples);
original sample size.
The larger the number of observations in the original sample and bootstrap samples, the more accurate the result will be, and vice versa. The number of generated bootstrap samples is more important here.
In any case, these factors are not blocking, i.e. it is possible to work with small samples and with a relatively small number of bootstrap samples. In this case, it is simply necessary to correctly interpret the result and understand the consequences, which will be discussed in the problems section.
Also bootstrap can be resource intensiveespecially when working with large amounts of data and a large number of iterations, since we have to extract many bootstrap samples from the original sample.
As we know from combinatorics, the number of such unique samples is that is, algorithmically bootstrapping will cost us which is very sad.
In reality, no one does this, everyone relies on their own computing power, so they never extract all the unique bootstrap samples, they just take a random pseudo-sample each time -th number of times. Such samples are unlikely, but may be non-unique, and even so, it does not really matter to us. That is, Monte Carlo simulation is used for bootstrapping.
To summarize, the accuracy of the bootstrapping result, as already mentioned, is determined by the number of pseudo-samples, but, on the other hand, this is very expensive in terms of computing power.
Thus, it is necessary to come to a compromise between the accuracy and resource intensity of bootstrapping, which is achieved by using the Monte Carlo method (in particular, by controlling the number of pseudo-samples ).
The last but very important nuance is the purpose of the bootstrap. Bootstrap is a great method for working in Middle-earth, however in the far reaches of the earth, bootstrap shows itself noticeably worseThis happens because only a small number of observations from the tails of the original sample enter the bootstrap samples.
That is, with the bootstrap we can analyze and draw conclusions about average trends, but not about extremes. For example, we can construct a confidence interval for the median, but a confidence interval for the 99% percentile will be extremely inaccurate.
Bootstrap diagram
Efronov confidence interval
So, let's move on to the bootstrap scheme. Initially, we have a sample, which is part of the general population. We want to study some statistics and draw conclusions about the general population. It is in this vein that we will consider the original sample.
Next, we determine how many bootstrap samples (of the same size as the original sample) our computing power allows us to extract from the original sample. We extract -th number of pseudo-samples with repetitions.
For each obtained bootstrap sample, we calculate the statistics relative to which we want to estimate the general population.
Once we have calculated all the statistics, we plot the distribution of the statistics and calculate the 2.5% and 97.5% quantiles.
This is what the algorithm looks like. Now let's describe it mathematically.
Designations:
— statistics on the general population;
— sample statistics;
— statistics on the bootstrap sample;
— number of bootstrap samples;
— the size of the original sample (that is, each bootstrap sample);
– significance level.
Algorithm:
A sample is a portion of the larger population you want to study and is representative.
From the initial sample, a random sample is selected with repetition (the same element can be selected several times) elements, forming a bootstrap sample.
Based on the generated bootstrap sample, it is calculated statistics.
Repeat steps 2 and 3 times (for example, 10,000, 100,000, or 1,000,000 times).
We find the Efron confidence interval, that is, the values less than which are 2.5% and 97.5% of the values (if necessary, the % of values outside the boundaries can be selected independently, for example, 1%, 0.1%, etc.). The confidence interval itself has the following form:
So, the scheme that was described is classical. And the confidence interval constructed according to this scheme is called Efronov confidence interval.
In addition to Efronov's confidence interval, there are Hall confidence interval And t-percentile confidence intervalThe advantage of the last two is that they provide an unbiased estimate of the sample statistics, since centering occurs in their calculation.
Efronov's confidence interval gives biased estimateand t-percentile and Hall – unbiased.
Estimation bias can occur if the original sample has very few observations, or the distribution has heavy tails, or the distribution of the sample statistics is skewed, multimodal, or for other reasons related to the type of distribution that is very different from traditional ones.
In particular, speaking about the t-percentile confidence interval, it is somewhat wider than Efronov and Hall, which brings it closer to analytical analogues, compared to the other two, which are noticeably narrower, that is, they take into account the variability of statistics to a lesser extent.
Hall's confidence interval
So, as already mentioned, the algorithm for constructing the Hall confidence interval gives an unbiased estimate. This is achieved by centeringwhich consists in subtracting the statistics from the original sample from each obtained statistic from the bootstrap sample.
Apart from centering and a small adjustment to the formula for calculating the confidence interval, this scheme is no different. Let's look at the scheme point by point, recalling the notations:
Designations:
— statistics on the general population;
— sample statistics;
— statistics on the bootstrap sample;
— number of bootstrap samples;
— the size of the original sample (that is, each bootstrap sample);
– significance level.
Algorithm:
A sample is a portion of the larger population you want to study and is representative.
Based on the original sample, it is calculated statistics.
From the initial sample, a random sample is selected with repetition. elements, forming a bootstrap sample.
Based on the generated bootstrap sample, it is calculated statistics.
From the statistics obtained from the bootstrap sample is subtracted and we get .
Repeat steps 3, 4 and 5 times (for example, 10,000, 100,000, or 1,000,000 times).
We find the Hall confidence interval, which has the following formula:
This is how an unbiased estimate is obtained.
t-percentile confidence interval
The t-percentile confidence interval is another scheme for constructing a bootstrap confidence interval. Like the Hall confidence interval, the t-percentile uses centering. However, in this case, we also divide the resulting difference between the bootstrap sample statistics and the original sample by the standard error.
This confidence interval is obtained somewhat wider Efronova and Hall, which compensates for the underestimation of the spread of statistics, but not completely. Also, the t-percentile confidence interval has better asymptotic convergence.
So, let's move on to discussing the algorithm:
Designations:
— statistics on the general population;
— sample statistics;
— statistics on the bootstrap sample;
— number of bootstrap samples;
— the size of the original sample (that is, each bootstrap sample);
– significance level;
— standard error of bootstrap sampling;
— standard error of the original sample.
Algorithm:
A sample is a portion of the larger population you want to study and is representative.
Based on the original sample, it is calculated statistics and .
From the initial sample, a random sample is selected with repetition. elements, forming a bootstrap sample.
Based on the generated bootstrap sample, it is calculated statistics and .
From the statistics obtained from the bootstrap sample is subtracted and the result is divided by the final value will be denoted as .
Repeat steps 3, 4 and 5 times (for example, 10,000, 100,000, or 1,000,000 times).
We find the t-percentile confidence interval, which has the following formula:
Python implementation
Finally, let's look at the implementation of bootstrap in Python. We will write a fairly simple function that will provide the ability to use different bootstrapping schemes and be able to accept a function that calculates the necessary statistics in a given way. It is worth noting that for simplicity, it is assumed that this function will be contained in numpy
and have a parameter axis
. This requirement is satisfied, for example, mean()
, median()
And var()
.
import numpy as np
from typing import Callable, Literal
def get_bootstrap_ci(
X: np.ndarray, func: Callable, N: int = 10**3,
kind: Literal['Efron', 'Hall', 't-percentile'] = 't-percentile',
alpha: float = 0.05
) -> tuple[float, float]:
n = X.size
bootstrap_samples = np.random.choice(X, (N, n), replace=True)
theta_hat_star = func(bootstrap_samples, axis=1)
if kind == 't-percentile':
theta_hat = func(X)
se_theta_hat = np.std(X) / np.sqrt(n)
se_theta_hat_star = np.std(bootstrap_samples, axis=1) / np.sqrt(n)
theta_hat_star = (theta_hat_star - theta_hat) / se_theta_hat_star
left, right = np.quantile(theta_hat_star, (1 - alpha / 2, alpha / 2))
ci = (theta_hat - se_theta_hat * left, theta_hat - se_theta_hat * right)
elif kind == 'Hall':
theta_hat = func(X)
theta_hat_star -= theta_hat
left, right = np.quantile(theta_hat_star, (1 - alpha / 2, alpha / 2))
ci = (theta_hat - left, theta_hat - right)
elif kind == 'Efron':
left, right = np.quantile(theta_hat_star, (alpha / 2, 1 - alpha / 2))
ci = (left, right)
else:
raise ValueError('Unknown method')
return ci
Now let's see this function in action. Let's assume that we have a sample of 52,334 observations that contain the respondents' ages. Then, roughly, but for the sake of illustration, we can assume that the available observations are the general population, and then extract the sample and try to calculate the confidence interval for the mean age using the bootstrapping method.
First, let's import the necessary libraries:
import numpy as np
import pandas as pd
import seaborn as sns
let's load the data (note 1):
df = pd.read_excel('analysis/data/data_msk.xlsx')
and now let's see how they are distributed:
sns.displot(df.age, aspect=1.8, bins=df.age.nunique())
We see that the distribution has an asymmetry coefficient greater than zero and it is bimodal, but not very pronounced, which should not greatly interfere with our experiment.
Next, to make the experiment reproducible, we set up the state:
np.random.seed(34)
Now let's calculate the average for the “general population”, create a sample and calculate the average for it:
sample = df.age.sample(500)
print(f'Среднее по "генеральной совокупности": {np.mean(df.age)}')
print(f'Среднее по выборке: {np.mean(sample)}')
conclusion:
Среднее по "генеральной совокупности": 68.54052814613827
Среднее по выборке: 67.768
We see that the sample mean is very different from the population mean. I specifically chose this example to show that even in such cases the bootstrap will work correctly.
So now let's look at the results of the different schemes one by one:
Efronov confidence interval:
ci = get_bootstrap_ci(sample , np.mean, alpha=0.01, kind='Efron') print(ci) print(f'Ширина интервала составила: {ci[1] - ci[0]}')
conclusion:
(66.99199, 68.53201) Ширина интервала составила: 1.5400199999999984
Hall's confidence interval:
ci = get_bootstrap_ci(sample , np.mean, alpha=0.01, kind='Hall') print(ci) print(f'Ширина интервала составила: {ci[1] - ci[0]}')
conclusion:
(67.00399, 68.54401) Ширина интервала составила: 1.5400199999999984
t-percentile confidence interval:
ci = get_bootstrap_ci(sample , np.mean, alpha=0.01) print(ci) print(f'Ширина интервала составила: {ci[1] - ci[0]}')
conclusion:
(67.0257231115649, 68.56861294427233) Ширина интервала составила: 1.5428898327074307
Important note! In order to reproduce the experiment, it is necessary to run lines with the installation each time before obtaining the confidence interval using a certain method
seed
and sample generation. This will allow working with identical bootstrap samples that are generated by the functionchoice()
Vget_bootstrap_ci()
.
In this case, the average for the general population at the 1% significance level lies outside the Efronov confidence intervalBut within t-percentile and Hall. It is important to note that in this example a shift has appearedwhich is included in the Efron confidence interval. That is why the mean in the general population is included in the Hall confidence interval, but not in the Efron one, despite the fact that they do not differ in width.
Also, what is noteworthy is that the t-percentile confidence interval turned out to be the widest of all, and the Hall and Efron intervals, as was said, are equal, but already t-percentile, which is quite natural.
Problems
Compared to analytical confidence intervals, confidence intervals obtained using bootstrapping (no matter which scheme is chosen) are somewhat narrower if the original sample is small. This is a disadvantage, since we are essentially underestimating the spread of sample statistics with a small number of observations in the original sample.
By bootstrapping, we cover asymptotically only 63% of the observations from the original sample, and 37% of observations do not fall into any bootstrap sampleThis fact is a consequence of the fact that we take samples with repetitions.
Based on the previous fact, it follows that the bootstrap is a great technique to work in Middle-earth, but not in Far-earth. And when the distribution of the original sample heavy tails (that is, a lot of outliers), bootstrap can start to perform poorly even in Middle Earth.
In the presence of structures in data (regression, time series) the bootstrap scheme must be designed in such a way that it takes this structure into account (Note 2).
Notes
Note 1. The data can be downloaded link.
Note 2. The presence of structure in the data is a big issue when bootstrapping. Therefore, it is necessary to take into account the existing structure, which allows you to create more advanced methods modeled on classical bootstrapping, or to use existing ones. Specific methods for bootstrapping regressions and time series can be found link.