What a data analyst might be asked about statistics in an interview: 3 topics
Binomial distribution criteria
The binomial distribution is one of the fundamental distributions in statistics. It describes the number of successes in a series of independent trials, each of which has two possible outcomes: success or failure.
To apply the binomial distribution, these criteria must be met:
Fixed number of tests n
The number of tests should be fixed and predetermined. For example, if a coin is tossed 10 times, then n = 10.
Test independence
The outcome of each test should not depend on the results of other tests. For example, the results of coin tosses are independent of each other: the probability of getting heads does not change from toss to toss.
Two possible outcomes
Each trial must have two mutually exclusive outcomes. In the context of statistics this is called success And failure. For example, when tossing a coin, the outcomes may be “heads” for success or “tails” for failure.
Constant probability of success p
The probability of success p must remain constant for each trial. For example, the probability of getting heads on each coin flip remains 0.5 if the coin is fair.
Outliers and methods for their detection
Outliers are data that deviate significantly from other observed values in the data set. They can occur for a variety of reasons: measurement errors, unusual conditions or rare events. Outliers can greatly influence the results of statistical analysis by distorting overall statistics.
There are several methods for identifying outliers that can be used, depending on the nature of the data and the specifics of the analysis. Let's look at the most common ones: z-score, interquartile range (IQR) and visualization (box plots).
z-score method
The z-score shows how far (in standard deviations) each value is from the mean of the data set. The formula for calculating the z-score is:
Where — data value, is the average value, and – standard deviation. Typically, values greater than 3 or less than -3 are considered outliers, since 99.7% of data in a normal distribution are within three standard deviations of the mean.
Example in Python:
import pandas as pd
data = {'score': [56, 65, 67, 74, 75, 42, 76, 63, 67, 85, 120]}
df = pd.DataFrame(data)
# расчет z-оценки
df['z_score'] = (df['score'] - df['score'].mean()) / df['score'].std()
# фильтрация выбросов
df_filtered = df[(df['z_score'] > -3) & (df['z_score'] < 3)]
Interquartile range (IQR method)
IQR is a measure of data dispersion, equal to the difference between the third and first quartiles . Formula for calculating IQR:
Outliers are considered to be values beyond:
(lower limit) and (upper limit).
Example in Python:
import numpy as np
data = {'score': [56, 65, 67, 74, 75, 42, 76, 63, 67, 85, 120]}
df = pd.DataFrame(data)
# расчет квартилей
Q1 = df['score'].quantile(0.25)
Q3 = df['score'].quantile(0.75)
IQR = Q3 - Q1
# определение границ выбросов
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
# фильтрация выбросов
df_filtered = df[(df['score'] >= lower_bound) & (df['score'] <= upper_bound)]
Visualization (box graphs)
Box plots provide a visual way to detect outliers. In a box plot, outliers appear as points outside the whiskers, which represent the range of data that does not include the outliers.
Example in Python:
import seaborn as sns
import matplotlib.pyplot as plt
# Пример данных
data = {'score': [56, 65, 67, 74, 75, 42, 76, 63, 67, 85, 120]}
df = pd.DataFrame(data)
# Построение графика-бокса
sns.boxplot(x=df['score'])
plt.title('Box Plot of Scores')
plt.show()
Central limit theorem
The central limit theorem states that given a large enough sample size, the distribution of the sample mean will approximate a normal distribution, regardless of the shape of the original distribution of the data. Formally, if are independent and identically distributed random variables with a finite mean and variance then the distribution of the standardized mean will approach the standard normal distribution as the sample size increases .
CLT allows the use of the normal distribution for various statistical methods even when the original data is not normal.
With a large sample size, the sample mean becomes an accurate estimate of the population mean.
CPT allows you to build confidence intervals for the population mean based on the sample mean and standard error. For example, for a large enough sample you can use the formula:
Where — sample mean value, — critical value for a given confidence level (for example, 1.96 for a 95% confidence interval), – standard deviation, and — sample size.
To illustrate the central limit theorem, let's write code that shows how the distribution of sample means approaches the normal distribution as the sample size increases:
import numpy as np
import matplotlib.pyplot as plt
# параметры
population_mean = 50
population_std_dev = 10
sample_size = 30 # размер выборки для демонстрации ЦПТ
num_samples = 1000 # количество выборок
# генерация случайной выборки из нормального распределения
population = np.random.normal(loc=population_mean, scale=population_std_dev, size=100000)
# вычисление средних значений выборок
sample_means = [np.mean(np.random.choice(population, sample_size)) for _ in range(num_samples)]
# Ввзуализация гистограммы выборочных средних
plt.figure(figsize=(10, 6))
plt.hist(sample_means, bins=30, edgecolor="k", alpha=0.7)
plt.title('Гистограмма выборочных средних (n = {})'.format(sample_size))
plt.xlabel('Среднее значение выборки')
plt.ylabel('Частота')
plt.axvline(np.mean(sample_means), color="r", linestyle="dashed", linewidth=2)
plt.axvline(population_mean, color="g", linestyle="dashed", linewidth=2)
plt.legend(['Среднее значение выборок', 'Среднее значение популяции'])
plt.show()
# проверка нормальности распределения выборочных средних
from scipy.stats import normaltest
stat, p_value = normaltest(sample_means)
print('Статистика теста на нормальность: {:.3f}, p-значение: {:.3f}'.format(stat, p_value))
if p_value > 0.05:
print('Распределение выборочных средних нормально (не отвергаем H0)')
else:
print('Распределение выборочных средних не нормально (отвергаем H0)')
Статистика теста на нормальность: 0.689, p-значение: 0.708
Распределение выборочных средних нормально (не отвергаем H0)
Python is one of the most popular programming languages, which is used in many areas, including data analysis. I would like to recommend you a free webinar about the basics of Python.