Data Preprocessing for Machine Learning
In this article, we will consider an example of data preprocessing for further research, for example, using the clustering method. But first, let’s clarify what machine learning is and what stages it consists of.
Machine learning is the process in which a computer learns to interpret data, make decisions, and create recommendations based on that data. During the learning process – and later.
If the data used for forecasting is incorrect, biased or incomplete, it can lead to inaccurate forecast results.
Data preparation is the process of preparing “raw» (raw) data for further processing and analysis.
Data preprocessing includes the following procedures:
data checking;
data cleaning;
data transformation;
data transformation;
addition;
optimization.
Data checking includes identifying:
duplicates, contradictions, errors;
anomalous observations;
passes.
Data cleaning contains:
eliminating duplicates, inconsistencies and errors;
processing of anomalous observations;
processing of passes.
When solving various problems, gaps should be processed. They can be deleted, ignored, or filled in. However, you need to use gap removal carefully. If they make up a large proportion in relation to the volume of the entire data sample, then it is not worth resorting to removing gaps.
Fill omissions possible using:
zeros;
mode, median or mean;
indicator variables.
Data transformation includes:
renaming features;
sorting, grouping data;
variable coding;
data normalization.
Data Augmentation implies the creation of new features, aggregation of features.
Data optimization includes:
Data preparation can take up to 80% of the time spent on a machine learning project. To optimize this process, it is important to use tools specifically designed for data preparation.
In this article we will look at an example made in a programming language Python in the environment Google Colaboratory.
First you need to download the libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
and download the data.csv file, which contains the following data:
df = pd.read_csv('/content/data.csv')
Data Description
children – number of children in the family;
days_employed – total length of service in days;
dob_years – client’s age in years;
education – client’s education level;
education_id – education level identifier;
family_status – Family status;
family_status_id – marital status identifier;
gender – gender of the client;
income_type – type of employment;
debt – whether he had a debt to repay loans (1 – had, 0 – did not);
total_income – monthly income;
purpose – the purpose of obtaining a loan.
Let's look at the contents of the file using the method info():
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21525 entries, 0 to 21524
Data columns (total 12 columns):
# Column Non–Null Count Dtype
––– –––––– –––––––––––––– –––––
0 children 21525 non–null int64
1 days_employed 1 1935 non–null float64
2 dob_years 21525 non–null int64
3 education 21525 non–null object
4 education_id 21525 non–null int64
5 family_status 21525 non–null object
6 family_status_id 21525 non–null int64
7 gender 21525 non–null object
8 income_type 21525 non–null object
9 debt 21525 non–null int64
10 total_income 19351 non–null float64
11 purpose 21525 non–null object
dtypes: float64(2), int64(5), object(5)
memory usage: 2.0+ MB
Based on the derived data, we can conclude that the type of features corresponds to their semantic content. However, it can be noted that in the signs days_employed and total_income contains missing values.
Detection and processing of anomalous observations
The next step is to check the signs for anomalous values.
Anomalous sightings observations that are very different from the others in terms of the characteristic being studied are called.
To visualize the results we will use the construction box plot or “box with mustache” graphics». The construction of such a graph is associated with the concept “quartile”
Term “quartile» used to denote quantiles that are multiples of ¼.
Quantile order is a value of x such that
Quartiles divide the sample into 4 parts.
The first quartile Q1 is about 0.25 – a number that determines ¼ of the ordered sample, that is, 25% of the values of the ordered sample are less than Q1, 75% of the values are more than Q1.
The second quartile Q2 is about 0.5 or the median is a number for which 50% of the ordered sample is less than Q2, and 50% of the values are greater than Q2.
The third quartile Q3 of the order of 0.75 is a number for which 25% of the values of the ordered sample are less than Q3, 25% of the values are greater than Q3.
Interquartile range – value calculated by the formula:
Box and whisker plot» has the form:
The “box” is limited by the quartiles Q1 and Q3, with the median Q2 located inside it. The whiskers extend left and right a distance of 1.5 * IQR
Using methods head()
And tail()
we see that in the variable days_employed There are quite a lot of negative values.
Now let's look at the descriptive statistics for all variables:
df.describe()
The following conclusions can be drawn for each of the characteristics:
Characteristic values children has a range from –1 to 20.
Most likely there are anomalous values.
Characteristic values days_employed has a range from –18,388.9 to 401,755.4.
Most likely there are anomalous values.
Characteristic values dob_years has a range from 0 to 75.
Most likely there are anomalous values.
Characteristic values debt has a range from 0 to 1.
There are no anomalous values.
Characteristic values total_income has a range from 20,667 to 2,265,604.
There are no anomalous values.
Variables family_status_id And education_id are unique identifiers and do not carry a semantic component, so we do not evaluate them for the presence of anomalies.
Next, we will analyze the anomalous values based on specific characteristics.
Let's start with the sign children. Using the method .value_counts()
Let's look at the unique values and their number stored in the attribute.
Let's assume that the anomalous observations are the result of technical data entry errors, so we replace the values - 1 and 20 with 1 and 2, respectively, using the method .replace()
.
df['children'] = df['children'].replace(–1,1)
df['children'] = df['children'].replace(20,2)
The next sign we will consider is days_employed.
Earlier, when analyzing descriptive statistics, it was possible to notice that this characteristic contains both abnormally large and abnormally small observations. Therefore, we will identify types of employment of borrowers with negative values of the variable.
df[df['days_employed'] < 0]['income_type'].value_counts()
Now – with positive values of the variable.
df[df['days_employed'] > 0]['income_type'].value_counts()
Let's replace negative values with positive ones using methods .loc()
And .abs()
.
df.loc[df['days_employed'] < 0, 'days_employed'] = df['days_employed'].abs()
Let's look at the descriptive statistics for the variable:
df['days_employed'].describe()
This sign contains anomalous observations. To visualize the data, let's build a graph of experience using the method .hist()
.
df.hist('days_employed', bins =25)
Let's analyze the maximum length of service in days and years.
max_day = df['days_employed'].max()
max_day
Получим 401755.4
Let's convert days into years.
max_year = max_day/365
max_year
Получим 1100.69
We can conclude that the maximum length of service with a 5-day work week is more than 1100 years.
Now let’s analyze the average and median length of service in years using the methods .mean()
And .quantile()
.
mean_day = df['days_employed'].mean()
mean_year = mean_day/365
mean_year
Получим 183.33
median_day = df['days_employed'].quantile(0.5)
median_year = median_day/365
median_year
Получим 6.01
We see that the values of the average and median length of service are very different; further research is required.
Let's analyze the total number, average, maximum and minimum values of the attribute days_employed by type of borrower employment.
df_type = df.groupby(['income_type']).agg({'days_employed' : ['count', 'mean', 'max', 'min']})
We conclude that abnormally large observations of length of service are typical for two types of employment: “unemployed”» and “pensioner”». Since the group of borrowers “unemployed”»“on maternity leave»“entrepreneur» and “student» are represented by a small number of records (one), then they can be removed from consideration using the method .drop()
reducing the dimension of the feature space.
df.drop(df[df['income_type'] == 'безработный'].index, inplace = True)
df.drop(df[df['income_type'] == 'в декрете'].index, inplace = True)
df.drop(df[df['income_type'] == 'предприниматель'].index, inplace = True)
df.drop(df[df['income_type'] == 'студент'].index, inplace = True)
Let's build graphs for each class of borrowers using methods from libraries matplotlib
and seaborn. For convenience, we will split the dataframe df
by type of borrowers.
df1 = df[df['income_type'] == 'госслужащий']
df2 = df[df['income_type'] == 'компаньон']
df3 = df[df['income_type'] == 'пенсионер']
df4 = df[df['income_type'] == 'сотрудник']
We are building a general schedule for four groups of borrowers.
# задаем размер сетки
ax = plt.subplots(2,2, figsize = (18,12))
# определение позиции первого графика
plt.subplot(2,2,1)
# задание графика
ax = sns.histplot(df1['days_employed'], bins = 25, color="purple", kde = True)
# подпись графика
ax.set_title('Класс заемщиков: «госслужащий»')
# определение позиции второго графика
plt.subplot(2,2,2)
ax = sns.histplot(df2['days_employed'], bins = 25, color="orange", kde = True)
# подпись графика
ax.set_title('Класс заемщиков: «компаньон»')
# определение позиции третьего графика
plt.subplot(2,2,3)
ax = sns.histplot(df3['days_employed'], bins = 25, color="green", kde = True)
# подпись графика
ax.set_title('Класс заемщиков: «пенсионер»')
# определение позиции четвертого графика
plt.subplot(2,2,4)
ax = sns.histplot(df4['days_employed'], bins = 25, color="red", kde = True)
# подпись графика
ax.set_title('Класс заемщиков: «сотрудник»')
You can notice that in the “retired” group» There are clearly anomalous observations. Let's assume that the length of service in this group is represented not in days, but in hours.
df.loc[df['income_type'] == 'пенсионер', 'days_employed'] = df['days_employed']/24
For convenience of analysis, we will convert it into years of experience. Let's look at the data statistics after the transformations performed.
df['years_employed'] = df['days_employed']/365
df_type1 = df.groupby(['income_type']).agg({'days_employed':['count', 'mean', 'max', 'min']})
df_type1
Let's plot the graphs again for each group of borrowers to check how the data has changed.
According to the constructed diagrams, it can be concluded that in the “pensioner” group» There are still anomalous observations, so it is necessary to continue transformations.
For each type of employment, we calculate the 0.95 and 0.05 length of service quantiles.
print(int(df.loc[df['income_type'] == 'госслужащий', 'years_employed'].quantile(0.95)))
print(int(df.loc[df['income_type'] == 'компаньон','years_employed'].quantile(0.95)))
print(int(df.loc[df['income_type'] == 'пенсионер','years_employed'].quantile(0.95)))
print(int(df.loc[df['income_type'] == 'сотрудник', 'years_employed'].quantile(0.95)))
24
16
45
19
print(int(df.loc[df['income_type'] == 'госслужащий','years_employed'].quantile(0.05)))
print(int(df.loc[df['income_type'] == 'компаньон','years_employed'].quantile(0.05)))
print(int(df.loc[df['income_type'] == 'пенсионер','years_employed'].quantile(0.05)))
print(int(df.loc[df['income_type'] == 'сотрудник', 'years_employed'].quantile(0.05)))
0
0
37
0
Let us conduct a graphical analysis of anomalous observations of experience.
According to the “box with mustache” schedule» it is clear that anomalous observations are present in all groups except “pensioner”».
The percentage of observations exceeding the 0.95 quantile is large; if you delete it, you can lose some of the significant data; let’s also calculate the 0.99 quantile of experience.
print(int(df.loc[df['income_type'] == 'госслужащий','years_employed'].quantile(0.99)))
print(int(df.loc[df['income_type'] == 'компаньон', 'years_employed'].quantile(0.99)))
print(int(df.loc[df['income_type'] == 'пенсионер','years_employed'].quantile(0.99)))
print(int(df.loc[df['income_type'] == 'сотрудник','years_employed'].quantile(0.99)))
33
28
45
30
print(int(df.loc[df['income_type'] == 'госслужащий','years_employed'].quantile(0.01)))
print(int(df.loc[df['income_type'] == 'компаньон','years_employed'].quantile(0.01)))
print(int(df.loc[df['income_type'] == 'пенсионер','years_employed'].quantile(0.01)))
print(int(df.loc[df['income_type'] == 'сотрудник','years_employed'].quantile(0.01)))
0
0
37
0
We will remove observations exceeding the 0.99 quantile.
df.drop(df[(df['years_employed'] > 33) & (df['income_type'] == 'госслужащий')].index, inplace = True)
df.drop(df[(df['years_employed'] > 28) & (df['income_type'] == 'компаньон')].index, inplace = True)
df.drop(df[(df['years_employed'] > 30 ) & (df['income_type'] == 'сотрудник')].index, inplace = True)
Let's build a box-and-whisker graph after the transformations.
sns.boxplot(x = 'income_type', y = 'years_employed', data = df)
The number of anomalous observations has decreased but is still present in three groups. Let us analyze individual high values of length of service, comparing them with the age of the borrower.
Variable dob_years.
df.groupby(['income_type']).agg({'dob_years' : ['min', 'max']})
We can notice that the minimum age value in each of the groups is 0.
Let's count the number of zero values of a characteristic by type of employment using the method .value_counts()
.
df[df['dob_years'] == 0]['income_type'].value_counts()
This is a small enough number of observations in the total number of observations in the group that they can be removed.
df.drop(df[df['dob_years'] == 0].index, inplace = True)
Let's see what the minimum age is now.
df.groupby(['income_type']).agg({'dob_years' : ['min', 'max']})
Now let’s analyze the 0.95 and 0.99 quantiles:
print(int(df.loc[df['income_type'] == 'госслужащий','dob_years'].quantile(0.95)))
print(int(df.loc[df['income_type'] == 'компаньон', 'dob_years'].quantile(0.95)))
print(int(df.loc[df['income_type'] == 'пенсионер','dob_years'].quantile(0.95)))
print(int(df.loc[df['income_type'] == 'сотрудник', 'dob_years'].quantile(0.95)))
58
57
69
58
print(int(df.loc[df['income_type'] == 'госслужащий','dob_years'].quantile(0.99)))
print(int(df.loc[df['income_type'] == 'компаньон','dob_years'].quantile(0.99)))
print(int(df.loc[df['income_type'] == 'пенсионер','dob_years'].quantile(0.99)))
print(int(df.loc[df['income_type'] == 'сотрудник','dob_years'].quantile(0.99)))
65
64
72
63
We will also construct a graph for analyzing anomalous observations.
sns.boxplot(x = 'income_type', y = 'dob_years', data = df)
Let us define the boundaries of the “whiskers” to search for anomalous values.
Q1 = int(df.loc[df['income_type'] == 'сотрудник','dob_years'].quantile(0.25))
Q3 = int(df.loc[df['income_type'] == 'сотрудник', 'dob_years'].quantile(0.75))
IQR = Q3 – Q1
up_S = Q3 + 1.5*IQR
up_S
Получим 72.0
Q1 = int(df.loc[df['income_type'] == 'пенсионер', 'dob_years'].quantile(0.25))
Q3 = int(df.loc[df['income_type'] == 'пенсионер', 'dob_years'].quantile(0.75))
IQR = Q3 – Q1
low_P = Q1 – 1.5*IQR
low_P
Получим 44.0
Q1 = int(df.loc[df['income_type'] == 'компаньон', 'dob_years'].quantile(0.25))
Q3 = int(df.loc[df['income_type'] == 'компаньон', 'dob_years'].quantile(0.75))
IQR = Q3 – Q1
up_K = Q3 + 1.5*IQR
up_K
Получим 71.0
Q1 = int(df.loc[df['income_type'] == 'госслужащий', 'dob_years'].quantile(0.25))
Q3 = int(df.loc[df['income_type'] == 'госслужащий', 'dob_years'].quantile(0.75))
IQR = Q3 – Q1
up_G = Q3 + 1.5*IQR
up_G
Получим 72.0
Let us calculate the number of values for each type of borrower that go beyond the corresponding “whiskers”.
len(df[(df['dob_years'] > 72) & (df['income_type'] == 'сотрудник')])
Получим 1
len(df[(df['dob_years'] < 44) & (df['income_type'] == 'пенсионер')])
Получим 79
Also for other types, the quantity will correspond to a single value. Due to the small number of anomalous values, we will remove them from the sample.
df.drop(df[(df['dob_years'] > 72) & (df['income_type'] == 'сотрудник')].index, inplace = True)
df.drop(df[(df['dob_years'] < 44) & (df['income_type'] == 'пенсионер')].index, inplace = True)
df.drop(df[(df['dob_years'] > 71) & (df['income_type'] == 'компаньон')].index, inplace = True)
df.drop(df[(df['dob_years'] > 72) & (df['income_type'] == 'госслужащий')].index, inplace = True)
Let's see how the sample changed after the transformations.
sns.boxplot(x = 'income_type', y = 'dob_years', data = df)
Based on this diagram, we can conclude that there are no anomalous values.
Now let's move on to studying the variable gender.
If we look up the unique values using the previously used method .value_counts()
, then we will see that there is one observation with an uncertain gender that can be excluded from the sample.
When examining the family_status variable, let's also look at the unique values. To reduce the dimension of the feature space, we combine the variable values into 2 classes – married and single/unmarried
In variable total_income one may notice the presence of anomalously large observations. Processing of anomalous values of this variable will be similar to previously carried out procedures for finding the boundaries of the “whiskers”». It is also necessary to carry out quantile analysis and decide on further processing.
Analysis and elimination of duplicates
The next step is analysis and elimination duplicates.
Let's look at the unique values of the attribute education
df['education'].unique()
Let's convert all elements to one register using the method .str.lower()
and look at the unique values again.
df['education'] = df['education'].str.lower()
df['education'].unique()
Let's analyze the presence of duplicates (without analyzing the variable purpose):
df.duplicated().sum()
Получим 71
Let's remove all existing duplicates using the method .drop_duplicates()
Analysis and processing of passes
As defined earlier, gaps are present in two variables − days_employed and total_income. Variable analysis and processing days_employed carried out earlier, as a result of which we changed it to years_employed. Let's start by looking at the variable total_income.
Let's create dataframes for each class income_typesince the dataframe structure df
has changed, we cannot use previously created variables df1,.. df4
df_S = df[df['income_type'] == 'сотрудник']
df_P = df[df['income_type'] == 'пенсионер']
df_K = df[df['income_type'] == 'компаньон']
df_G = df[df['income_type'] == 'госслужащий']
Let's build histograms for each type of borrower, as described earlier
For each type we find median And average value.
df_S.agg({'total_income' : ['mean', 'median']})
df_P.agg({'total_income' : ['mean', 'median']})
df_K.agg({'total_income' : ['mean', 'median']})
df_G.agg({'total_income' : ['mean', 'median']})
Replace the blanks with the value median, using the function lambda х
and method .fillna()
.
group = df.groupby('income_type')
df['total_income'] = group.total_income.apply(lambda x: x.fillna(x.median()))
Next, let's move on to analyzing the variable years_employed. Let's determine how many clients started working at the age of 18.
df.loc[df['years_employed']> df['dob_years']–18].shape[0]
Получим 1908
And how many clients started working at the age of 14?
df.loc[df['years_employed']> df['dob_years']–14].shape[0]
Получим 784
Based on the results obtained, we can conclude that the number of clients who started working at the age of 14 is more than 2 times less than those who started working at the age of 18. Let's delete the variable years_employedsince it is uninformative.
We will also remove duplicate columns by creating a new dataframe df_new
df_new = df.drop(['education_id', 'family_status_id', 'days_employed', 'purpose', 'years_employed'], axis = 1)
Let's look at the result of data preprocessing
df_model.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 20316 entries, 0 to 21524
Data columns (total 9 columns):
# Column Non–Null Count Dtype
––– –––––– –––––––––––––– –––––
0 children 20316 non–null int64
1 dob_years 20316 non–null int64
2 education 20316 non–null object
3 family_status 20316 non–null object
4 gender 20316 non–null object
5 income_type 20316 non–null object
6 debt 20316 non–null int64
7 total_income 20316 non–null float64
8 purpose_lem 20316 non–null object
dtypes: float64(1), int64(3), object(5)
memory usage: 1.5+ MB
Thus, this article examined the process of data preprocessing using a specific example in the banking industry. This will be followed by another stage of machine learning, for example, clustering.