3 Questions to Prepare for an Interview

And the first question concerns cross-validation.

What is cross-validation?

Cross-validation is a model evaluation method where the data is divided into multiple sets and the model is trained and tested on different combinations of these sets. The main idea here is to get a more accurate and reliable estimate of the model's performance.

Basic Cross-Validation Techniques

K-Fold Cross-Validation

The most popular method. It consists of the following:

Splitting data into k parts: the data is divided into k equal subsets or “folds”.
Training and testing: the model is trained on folds and tested on the remaining fold. This process is repeated time, each time with a new fold for testing.
Average of results: All The test results are averaged to obtain a final estimate of the model's performance.

For example, if there are 1000 observations and we use 5-fold cross-validation, the data will be divided into 5 parts of 200 observations. The model is trained on 800 observations and tested on 200. This process is repeated 5 times, each time with a new testing set.

Time-Series Cross-Validation

A special case of cross-validation is used for time series data — time-series cross-validation. Since time-series data is time-dependent, traditional cross-validation methods are not suitable. In this method, the data is divided so that each test set contains later time points than the training sets.

Example:

Separating data by time: The data is divided into training and test sets such that the test sets always follow the training sets in time.
Training and testing: The model is trained on earlier data and tested on later data. This process is repeated several times, each time with a new data interval.

Cross-validation helps avoid a situation where a model learns too well on training data and performs poorly on new data.

Let's move on to the next question.

How would you handle a data set with missing values?

If missing values occur in a small number of rows, the easiest way is to remove those rows. In Python, this can be done using the function dropna() from pandas:

df = df.dropna()

The approach is suitable if the omissions are random and there are few of them. However, if the missing data is not random, important information can be lost.

If a particular column has a lot of missing values, it might make sense to remove the entire column:

df = df.dropna(axis=1)

This method is applicable if the column does not contain important information and its removal will not affect the data analysis.

For numerical data, the mean or median is often substituted:

df['column_name'] = df['column_name'].fillna(df['column_name'].mean())
df['column_name'] = df['column_name'].fillna(df['column_name'].median())

The mean is appropriate for normally distributed data, and the median is appropriate for data with outliers.

For categorical data, you can use most frequent value substitution:

df['column_name'] = df['column_name'].fillna(df['column_name'].mode()[0])

Regression models can be used to predict missing values based on other features:

from sklearn.linear_model import LinearRegression

# создаем копию данных без пропусков
complete_data = df.dropna()
missing_data = df[df['column_name'].isnull()]

# обучаем модель
model = LinearRegression()
model.fit(complete_data[['feature1', 'feature2']], complete_data['column_name'])

# прогнозируем пропущенные значения
df.loc[df['column_name'].isnull(), 'column_name'] = model.predict(missing_data[['feature1', 'feature2']])

This allows us to take into account the dependence between features, which generally improves the accuracy of substitution.

In random immutation, missing values are replaced with randomly selected values from the available data:

import numpy as np

def random_imputation(df, feature):
    missing = df[feature].isnull()
    num_missing = missing.sum()
    sampled = np.random.choice(df[feature].dropna(), num_missing)
    df.loc[missing, feature] = sampled
    return df

df = random_imputation(df, 'column_name')

This can preserve the distribution of the data, but may introduce noise.

Multiple immutation performs several iterations of missing value imputation and averages the results:

from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

imp = IterativeImputer(max_iter=10, random_state=0)
df_imputed = imp.fit_transform(df)

The method better preserves data variability.

Can you explain how Principal Component Analysis (PCA) works?

PCA will help to simplify data analysis, reduce its dimensionality and at the same time preserve as much information as possible.

The first step in PCA is to standardize the data. This is to ensure that each variable in the data contributes equally to the analysis. Standardization is done by subtracting the mean and dividing it by the standard deviation of each variable:

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaled_data = scaler.fit_transform(data)

Standardization is necessary because PCA is sensitive to the scale of the variables. If the variables have different scales, those with a larger scale will dominate the analysis.

After standardization, the covariance matrix is calculated to understand how the variables are related to each other:

import numpy as np

cov_matrix = np.cov(scaled_data, rowvar=False)

The covariance matrix shows how changes in one variable are related to changes in other variables.

The next step is to calculate the eigenvectors and eigenvalues of the covariance matrix. The eigenvectors indicate the directions in which the data has the most variance, and the eigenvalues indicate how much of the variance in the data is explained by each principal component:

eig_values, eig_vectors = np.linalg.eig(cov_matrix)

The eigenvectors define the axes of the new coordinates in which the data dispersion is maximum.

The eigenvectors are sorted in descending order of their eigenvalues. This way, it is possible to determine which principal components contain the most information:

sorted_index = np.argsort(eig_values)[::-1]
sorted_eig_values = eig_values[sorted_index]
sorted_eig_vectors = eig_vectors[:, sorted_index]

The first few principal components will contain most of the variation in the data.

In the final step, the data is transformed into a new space defined by the principal components:

n_components = 2  # выбор количества главных компонент
pca_components = sorted_eig_vectors[:, :n_components]
transformed_data = np.dot(scaled_data, pca_components)

This way we reduced the dimensionality of the data while preserving most of its information.

You can gain more practical machine learning skills in the framework practical online courses from industry experts.

We also invite all beginners in this field to an open lesson dedicated to data preparation in Pandas. There we will consider the stages of data processing in sequence: gap handling, duplicate handling and anomaly search.