What a data analyst might be asked about statistics in an interview: 3 topics

Binomial distribution criteria

The binomial distribution is one of the fundamental distributions in statistics. It describes the number of successes in a series of independent trials, each of which has two possible outcomes: success or failure.

To apply the binomial distribution, these criteria must be met:

  1. Fixed number of tests n

    • The number of tests should be fixed and predetermined. For example, if a coin is tossed 10 times, then n = 10.

  2. Test independence

    • The outcome of each test should not depend on the results of other tests. For example, the results of coin tosses are independent of each other: the probability of getting heads does not change from toss to toss.

  3. Two possible outcomes

    • Each trial must have two mutually exclusive outcomes. In the context of statistics this is called success And failure. For example, when tossing a coin, the outcomes may be “heads” for success or “tails” for failure.

  4. Constant probability of success p

    • The probability of success p must remain constant for each trial. For example, the probability of getting heads on each coin flip remains 0.5 if the coin is fair.

Outliers and methods for their detection

Outliers are data that deviate significantly from other observed values ​​in the data set. They can occur for a variety of reasons: measurement errors, unusual conditions or rare events. Outliers can greatly influence the results of statistical analysis by distorting overall statistics.

There are several methods for identifying outliers that can be used, depending on the nature of the data and the specifics of the analysis. Let's look at the most common ones: z-score, interquartile range (IQR) and visualization (box plots).

z-score method

The z-score shows how far (in standard deviations) each value is from the mean of the data set. The formula for calculating the z-score is:

[ Z = \frac{X - \mu}{\sigma} ]

Where X — data value, \mu is the average value, and \sigma – standard deviation. Typically, values ​​greater than 3 or less than -3 are considered outliers, since 99.7% of data in a normal distribution are within three standard deviations of the mean.

Example in Python:

import pandas as pd

data = {'score': [56, 65, 67, 74, 75, 42, 76, 63, 67, 85, 120]}
df = pd.DataFrame(data)

# расчет z-оценки
df['z_score'] = (df['score'] - df['score'].mean()) / df['score'].std()

# фильтрация выбросов
df_filtered = df[(df['z_score'] > -3) & (df['z_score'] < 3)]

Interquartile range (IQR method)

IQR is a measure of data dispersion, equal to the difference between the third Q3 and first quartiles Q1. Formula for calculating IQR:

IQR = Q3 - Q1

Outliers are considered to be values ​​beyond:

Q1 - 1.5\times IQR (lower limit) and Q3 + 1.5\times IQR (upper limit).

Example in Python:

import numpy as np

data = {'score': [56, 65, 67, 74, 75, 42, 76, 63, 67, 85, 120]}
df = pd.DataFrame(data)

# расчет квартилей
Q1 = df['score'].quantile(0.25)
Q3 = df['score'].quantile(0.75)
IQR = Q3 - Q1

# определение границ выбросов
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# фильтрация выбросов
df_filtered = df[(df['score'] >= lower_bound) & (df['score'] <= upper_bound)]

Visualization (box graphs)

Box plots provide a visual way to detect outliers. In a box plot, outliers appear as points outside the whiskers, which represent the range of data that does not include the outliers.

Example in Python:

import seaborn as sns
import matplotlib.pyplot as plt

# Пример данных
data = {'score': [56, 65, 67, 74, 75, 42, 76, 63, 67, 85, 120]}
df = pd.DataFrame(data)

# Построение графика-бокса
sns.boxplot(x=df['score'])
plt.title('Box Plot of Scores')
plt.show()
Result

Result

Central limit theorem

The central limit theorem states that given a large enough sample size, the distribution of the sample mean will approximate a normal distribution, regardless of the shape of the original distribution of the data. Formally, if X_1, X_2, \ldots, X_n are independent and identically distributed random variables with a finite mean \mu and variance \sigma^2then the distribution of the standardized mean \sqrt{n}(\overline{X}_n - \mu) will approach the standard normal distribution as the sample size increases n.

CLT allows the use of the normal distribution for various statistical methods even when the original data is not normal.

With a large sample size, the sample mean becomes an accurate estimate of the population mean.

CPT allows you to build confidence intervals for the population mean based on the sample mean and standard error. For example, for a large enough sample you can use the formula:

    \overline{X} \pm Z \cdot \left(\frac{\sigma}{\sqrt{n}}\right),

Where \overline{X} — sample mean value, Z — critical value for a given confidence level (for example, 1.96 for a 95% confidence interval), \sigma– standard deviation, and n — sample size.

To illustrate the central limit theorem, let's write code that shows how the distribution of sample means approaches the normal distribution as the sample size increases:

import numpy as np
import matplotlib.pyplot as plt

# параметры
population_mean = 50
population_std_dev = 10
sample_size = 30  # размер выборки для демонстрации ЦПТ
num_samples = 1000  # количество выборок

# генерация случайной выборки из нормального распределения
population = np.random.normal(loc=population_mean, scale=population_std_dev, size=100000)

# вычисление средних значений выборок
sample_means = [np.mean(np.random.choice(population, sample_size)) for _ in range(num_samples)]

# Ввзуализация гистограммы выборочных средних
plt.figure(figsize=(10, 6))
plt.hist(sample_means, bins=30, edgecolor="k", alpha=0.7)
plt.title('Гистограмма выборочных средних (n = {})'.format(sample_size))
plt.xlabel('Среднее значение выборки')
plt.ylabel('Частота')
plt.axvline(np.mean(sample_means), color="r", linestyle="dashed", linewidth=2)
plt.axvline(population_mean, color="g", linestyle="dashed", linewidth=2)
plt.legend(['Среднее значение выборок', 'Среднее значение популяции'])
plt.show()

# проверка нормальности распределения выборочных средних
from scipy.stats import normaltest

stat, p_value = normaltest(sample_means)
print('Статистика теста на нормальность: {:.3f}, p-значение: {:.3f}'.format(stat, p_value))

if p_value > 0.05:
    print('Распределение выборочных средних нормально (не отвергаем H0)')
else:
    print('Распределение выборочных средних не нормально (отвергаем H0)')
Result

Result

Статистика теста на нормальность: 0.689, p-значение: 0.708
Распределение выборочных средних нормально (не отвергаем H0)

Python is one of the most popular programming languages, which is used in many areas, including data analysis. I would like to recommend you a free webinar about the basics of Python.

Registration is available via this link.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *