Data Preprocessing for Machine Learning

In this article, we will consider an example of data preprocessing for further research, for example, using the clustering method. But first, let’s clarify what machine learning is and what stages it consists of.

Machine learning is the process in which a computer learns to interpret data, make decisions, and create recommendations based on that data. During the learning process – and later.

If the data used for forecasting is incorrect, biased or incomplete, it can lead to inaccurate forecast results.

Data preparation is the process of preparing “raw» (raw) data for further processing and analysis.

Data preprocessing includes the following procedures:

  • data checking;

  • data cleaning;

  • data transformation;

  • data transformation;

  • addition;

  • optimization.

Data checking includes identifying:

  • duplicates, contradictions, errors;

  • anomalous observations;

  • passes.

Data cleaning contains:

  • eliminating duplicates, inconsistencies and errors;

  • processing of anomalous observations;

  • processing of passes.

When solving various problems, gaps should be processed. They can be deleted, ignored, or filled in. However, you need to use gap removal carefully. If they make up a large proportion in relation to the volume of the entire data sample, then it is not worth resorting to removing gaps.

Fill omissions possible using:

  • zeros;

  • mode, median or mean;

  • indicator variables.

Data transformation includes:

  • renaming features;

  • sorting, grouping data;

  • variable coding;

  • data normalization.

Data Augmentation implies the creation of new features, aggregation of features.

Data optimization includes:

Data preparation can take up to 80% of the time spent on a machine learning project. To optimize this process, it is important to use tools specifically designed for data preparation.

In this article we will look at an example made in a programming language Python in the environment Google Colaboratory.

First you need to download the libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

and download the data.csv file, which contains the following data:

df = pd.read_csv('/content/data.csv')

Data Description

  • children – number of children in the family;

  • days_employed – total length of service in days;

  • dob_years – client’s age in years;

  • education – client’s education level;

  • education_id – education level identifier;

  • family_status – Family status;

  • family_status_id – marital status identifier;

  • gender – gender of the client;

  • income_type – type of employment;

  • debt – whether he had a debt to repay loans (1 – had, 0 – did not);

  • total_income – monthly income;

  • purpose – the purpose of obtaining a loan.

Let's look at the contents of the file using the method info():

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21525 entries, 0 to 21524
Data columns (total 12 columns):
 #   Column            Non–Null Count  Dtype  
–––  ––––––            ––––––––––––––  –––––  
 0   children          21525 non–null  int64  
 1   days_employed    1 1935 non–null  float64
 2   dob_years         21525 non–null  int64  
 3   education         21525 non–null  object 
 4   education_id      21525 non–null  int64  
 5   family_status     21525 non–null  object 
 6   family_status_id  21525 non–null  int64  
 7   gender            21525 non–null  object 
 8   income_type       21525 non–null  object 
 9   debt              21525 non–null  int64  
 10  total_income      19351 non–null  float64
 11  purpose           21525 non–null  object 
dtypes: float64(2), int64(5), object(5)
memory usage: 2.0+ MB

Based on the derived data, we can conclude that the type of features corresponds to their semantic content. However, it can be noted that in the signs days_employed and total_income contains missing values.

Detection and processing of anomalous observations

The next step is to check the signs for anomalous values.

Anomalous sightings observations that are very different from the others in terms of the characteristic being studied are called.

To visualize the results we will use the construction box plot or “box with mustache” graphics». The construction of such a graph is associated with the concept “quartile

Term “quartile» used to denote quantiles that are multiples of ¼.

Quantile order is a value of x such that

F(x_\alpha) =\alpha = P(X < x_\alpha) =\alpha

Quartiles divide the sample into 4 parts.

The first quartile Q1 is about 0.25 – a number that determines ¼ of the ordered sample, that is, 25% of the values ​​of the ordered sample are less than Q1, 75% of the values ​​are more than Q1.

The second quartile Q2 is about 0.5 or the median is a number for which 50% of the ordered sample is less than Q2, and 50% of the values ​​are greater than Q2.

The third quartile Q3 of the order of 0.75 is a number for which 25% of the values ​​of the ordered sample are less than Q3, 25% of the values ​​are greater than Q3.

Interquartile range – value calculated by the formula:

IQR =Q3 - Q1

Box and whisker plot» has the form:

The “box” is limited by the quartiles Q1 and Q3, with the median Q2 located inside it. The whiskers extend left and right a distance of 1.5 * IQR

Using methods head() And tail() we see that in the variable days_employed There are quite a lot of negative values.

Now let's look at the descriptive statistics for all variables:

df.describe()

The following conclusions can be drawn for each of the characteristics:

Characteristic values children has a range from –1 to 20.
Most likely there are anomalous values.

Characteristic values days_employed has a range from –18,388.9 to 401,755.4.
Most likely there are anomalous values.

Characteristic values dob_years has a range from 0 to 75.
Most likely there are anomalous values.

Characteristic values debt has a range from 0 to 1.
There are no anomalous values.

Characteristic values total_income has a range from 20,667 to 2,265,604.
There are no anomalous values.

Variables family_status_id And education_id are unique identifiers and do not carry a semantic component, so we do not evaluate them for the presence of anomalies.

Next, we will analyze the anomalous values ​​based on specific characteristics.
Let's start with the sign children. Using the method .value_counts() Let's look at the unique values ​​and their number stored in the attribute.

Let's assume that the anomalous observations are the result of technical data entry errors, so we replace the values ​​- 1 and 20 with 1 and 2, respectively, using the method .replace().

df['children'] = df['children'].replace(–1,1)
df['children'] = df['children'].replace(20,2)

The next sign we will consider is days_employed.

Earlier, when analyzing descriptive statistics, it was possible to notice that this characteristic contains both abnormally large and abnormally small observations. Therefore, we will identify types of employment of borrowers with negative values ​​of the variable.

df[df['days_employed'] < 0]['income_type'].value_counts()

Now – with positive values ​​of the variable.

df[df['days_employed'] > 0]['income_type'].value_counts()

Let's replace negative values ​​with positive ones using methods .loc() And .abs().

df.loc[df['days_employed'] < 0, 'days_employed'] = df['days_employed'].abs()

Let's look at the descriptive statistics for the variable:

df['days_employed'].describe()

This sign contains anomalous observations. To visualize the data, let's build a graph of experience using the method .hist().

df.hist('days_employed', bins =25)

Let's analyze the maximum length of service in days and years.

max_day = df['days_employed'].max()
max_day

Получим 401755.4

Let's convert days into years.

max_year = max_day/365
max_year

Получим 1100.69

We can conclude that the maximum length of service with a 5-day work week is more than 1100 years.

Now let’s analyze the average and median length of service in years using the methods .mean() And .quantile().

mean_day = df['days_employed'].mean()
mean_year = mean_day/365
mean_year

Получим 183.33

median_day = df['days_employed'].quantile(0.5)
median_year = median_day/365
median_year

Получим 6.01

We see that the values ​​of the average and median length of service are very different; further research is required.

Let's analyze the total number, average, maximum and minimum values ​​of the attribute days_employed by type of borrower employment.

df_type = df.groupby(['income_type']).agg({'days_employed' : ['count', 'mean', 'max', 'min']})

We conclude that abnormally large observations of length of service are typical for two types of employment: “unemployed”» and “pensioner”». Since the group of borrowers “unemployed”»“on maternity leave»“entrepreneur» and “student» are represented by a small number of records (one), then they can be removed from consideration using the method .drop()reducing the dimension of the feature space.

df.drop(df[df['income_type'] == 'безработный'].index, inplace = True)
df.drop(df[df['income_type'] == 'в декрете'].index, inplace = True)
df.drop(df[df['income_type'] == 'предприниматель'].index, inplace = True)
df.drop(df[df['income_type'] == 'студент'].index, inplace = True)

Let's build graphs for each class of borrowers using methods from libraries matplotlib and seaborn. For convenience, we will split the dataframe df by type of borrowers.

df1 = df[df['income_type'] == 'госслужащий']
df2 = df[df['income_type'] == 'компаньон']
df3 = df[df['income_type'] == 'пенсионер']
df4 = df[df['income_type'] == 'сотрудник']

We are building a general schedule for four groups of borrowers.

# задаем размер сетки
ax = plt.subplots(2,2, figsize = (18,12))

# определение позиции первого графика
plt.subplot(2,2,1)

# задание графика
ax = sns.histplot(df1['days_employed'], bins = 25, color="purple", kde = True)

# подпись графика
ax.set_title('Класс заемщиков: «госслужащий»')

# определение позиции второго графика
plt.subplot(2,2,2)
ax = sns.histplot(df2['days_employed'], bins = 25, color="orange", kde = True)

# подпись графика
ax.set_title('Класс заемщиков: «компаньон»')

# определение позиции третьего графика
plt.subplot(2,2,3)
ax = sns.histplot(df3['days_employed'], bins = 25, color="green", kde = True)

# подпись графика
ax.set_title('Класс заемщиков: «пенсионер»')

# определение позиции четвертого графика
plt.subplot(2,2,4)
ax = sns.histplot(df4['days_employed'], bins = 25, color="red", kde = True)

# подпись графика
ax.set_title('Класс заемщиков: «сотрудник»')

You can notice that in the “retired” group» There are clearly anomalous observations. Let's assume that the length of service in this group is represented not in days, but in hours.

df.loc[df['income_type'] == 'пенсионер', 'days_employed'] = df['days_employed']/24

For convenience of analysis, we will convert it into years of experience. Let's look at the data statistics after the transformations performed.

df['years_employed'] = df['days_employed']/365
df_type1 = df.groupby(['income_type']).agg({'days_employed':['count', 'mean', 'max', 'min']})
df_type1

Let's plot the graphs again for each group of borrowers to check how the data has changed.

According to the constructed diagrams, it can be concluded that in the “pensioner” group» There are still anomalous observations, so it is necessary to continue transformations.

For each type of employment, we calculate the 0.95 and 0.05 length of service quantiles.

print(int(df.loc[df['income_type'] == 'госслужащий', 'years_employed'].quantile(0.95)))
print(int(df.loc[df['income_type'] == 'компаньон','years_employed'].quantile(0.95)))
print(int(df.loc[df['income_type'] == 'пенсионер','years_employed'].quantile(0.95)))
print(int(df.loc[df['income_type'] == 'сотрудник', 'years_employed'].quantile(0.95)))

24
16
45
19

print(int(df.loc[df['income_type'] == 'госслужащий','years_employed'].quantile(0.05)))

print(int(df.loc[df['income_type'] == 'компаньон','years_employed'].quantile(0.05)))

print(int(df.loc[df['income_type'] == 'пенсионер','years_employed'].quantile(0.05)))

print(int(df.loc[df['income_type'] == 'сотрудник', 'years_employed'].quantile(0.05)))

0
0
37
0

Let us conduct a graphical analysis of anomalous observations of experience.

According to the “box with mustache” schedule» it is clear that anomalous observations are present in all groups except “pensioner”».

The percentage of observations exceeding the 0.95 quantile is large; if you delete it, you can lose some of the significant data; let’s also calculate the 0.99 quantile of experience.

print(int(df.loc[df['income_type'] == 'госслужащий','years_employed'].quantile(0.99)))
print(int(df.loc[df['income_type'] == 'компаньон', 'years_employed'].quantile(0.99)))
print(int(df.loc[df['income_type'] == 'пенсионер','years_employed'].quantile(0.99)))
print(int(df.loc[df['income_type'] == 'сотрудник','years_employed'].quantile(0.99)))

33
28
45
30

print(int(df.loc[df['income_type'] == 'госслужащий','years_employed'].quantile(0.01)))
print(int(df.loc[df['income_type'] == 'компаньон','years_employed'].quantile(0.01)))
print(int(df.loc[df['income_type'] == 'пенсионер','years_employed'].quantile(0.01)))
print(int(df.loc[df['income_type'] == 'сотрудник','years_employed'].quantile(0.01)))

0
0
37
0

We will remove observations exceeding the 0.99 quantile.

df.drop(df[(df['years_employed'] > 33) & (df['income_type'] == 'госслужащий')].index, inplace = True)

df.drop(df[(df['years_employed'] > 28) & (df['income_type'] == 'компаньон')].index, inplace = True)

df.drop(df[(df['years_employed'] > 30 ) & (df['income_type'] == 'сотрудник')].index, inplace = True)

Let's build a box-and-whisker graph after the transformations.

 sns.boxplot(x = 'income_type', y = 'years_employed', data = df)

The number of anomalous observations has decreased but is still present in three groups. Let us analyze individual high values ​​of length of service, comparing them with the age of the borrower.

Variable dob_years.

df.groupby(['income_type']).agg({'dob_years' : ['min', 'max']})

We can notice that the minimum age value in each of the groups is 0.

Let's count the number of zero values ​​of a characteristic by type of employment using the method .value_counts().

df[df['dob_years'] == 0]['income_type'].value_counts()

This is a small enough number of observations in the total number of observations in the group that they can be removed.

df.drop(df[df['dob_years'] == 0].index, inplace = True)

Let's see what the minimum age is now.

df.groupby(['income_type']).agg({'dob_years' : ['min', 'max']})

Now let’s analyze the 0.95 and 0.99 quantiles:

print(int(df.loc[df['income_type'] == 'госслужащий','dob_years'].quantile(0.95)))
print(int(df.loc[df['income_type'] == 'компаньон', 'dob_years'].quantile(0.95)))
print(int(df.loc[df['income_type'] == 'пенсионер','dob_years'].quantile(0.95)))
print(int(df.loc[df['income_type'] == 'сотрудник', 'dob_years'].quantile(0.95)))

58
57
69
58

print(int(df.loc[df['income_type'] == 'госслужащий','dob_years'].quantile(0.99)))
print(int(df.loc[df['income_type'] == 'компаньон','dob_years'].quantile(0.99)))
print(int(df.loc[df['income_type'] == 'пенсионер','dob_years'].quantile(0.99)))
print(int(df.loc[df['income_type'] == 'сотрудник','dob_years'].quantile(0.99)))

65
64
72
63

We will also construct a graph for analyzing anomalous observations.

sns.boxplot(x = 'income_type', y = 'dob_years', data = df)

Let us define the boundaries of the “whiskers” to search for anomalous values.

Q1 = int(df.loc[df['income_type'] == 'сотрудник','dob_years'].quantile(0.25))
Q3 = int(df.loc[df['income_type'] == 'сотрудник', 'dob_years'].quantile(0.75))
IQR = Q3 – Q1
up_S = Q3 + 1.5*IQR
up_S

Получим 72.0

Q1 = int(df.loc[df['income_type'] == 'пенсионер', 'dob_years'].quantile(0.25))
Q3 = int(df.loc[df['income_type'] == 'пенсионер', 'dob_years'].quantile(0.75))
IQR = Q3 – Q1
low_P = Q1 – 1.5*IQR
low_P

Получим 44.0

Q1 = int(df.loc[df['income_type'] == 'компаньон', 'dob_years'].quantile(0.25))
Q3 = int(df.loc[df['income_type'] == 'компаньон', 'dob_years'].quantile(0.75))
IQR = Q3 – Q1
up_K = Q3 + 1.5*IQR
up_K

Получим 71.0

Q1 = int(df.loc[df['income_type'] == 'госслужащий', 'dob_years'].quantile(0.25))
Q3 = int(df.loc[df['income_type'] == 'госслужащий', 'dob_years'].quantile(0.75))
IQR = Q3 – Q1
up_G = Q3 + 1.5*IQR
up_G

Получим 72.0

Let us calculate the number of values ​​for each type of borrower that go beyond the corresponding “whiskers”.

len(df[(df['dob_years'] > 72) & (df['income_type'] == 'сотрудник')])

Получим 1

len(df[(df['dob_years'] < 44) & (df['income_type'] == 'пенсионер')])

Получим 79

Also for other types, the quantity will correspond to a single value. Due to the small number of anomalous values, we will remove them from the sample.

df.drop(df[(df['dob_years'] > 72) & (df['income_type'] == 'сотрудник')].index, inplace = True)

df.drop(df[(df['dob_years'] < 44) & (df['income_type'] == 'пенсионер')].index, inplace = True)

df.drop(df[(df['dob_years'] > 71) & (df['income_type'] == 'компаньон')].index, inplace = True)

df.drop(df[(df['dob_years'] > 72) & (df['income_type'] == 'госслужащий')].index, inplace = True)

Let's see how the sample changed after the transformations.

sns.boxplot(x = 'income_type', y = 'dob_years', data = df)

Based on this diagram, we can conclude that there are no anomalous values.

Now let's move on to studying the variable gender.

If we look up the unique values ​​using the previously used method .value_counts(), then we will see that there is one observation with an uncertain gender that can be excluded from the sample.

When examining the family_status variable, let's also look at the unique values. To reduce the dimension of the feature space, we combine the variable values ​​into 2 classes – married and single/unmarried

In variable total_income one may notice the presence of anomalously large observations. Processing of anomalous values ​​of this variable will be similar to previously carried out procedures for finding the boundaries of the “whiskers”». It is also necessary to carry out quantile analysis and decide on further processing.

Analysis and elimination of duplicates

The next step is analysis and elimination duplicates.

Let's look at the unique values ​​of the attribute education

df['education'].unique()

Let's convert all elements to one register using the method .str.lower() and look at the unique values ​​again.

df['education'] = df['education'].str.lower()
df['education'].unique()

Let's analyze the presence of duplicates (without analyzing the variable purpose):

df.duplicated().sum()

Получим 71

Let's remove all existing duplicates using the method .drop_duplicates()

Analysis and processing of passes

As defined earlier, gaps are present in two variables − days_employed and total_income. Variable analysis and processing days_employed carried out earlier, as a result of which we changed it to years_employed. Let's start by looking at the variable total_income.

Let's create dataframes for each class income_typesince the dataframe structure df has changed, we cannot use previously created variables df1,.. df4

df_S =  df[df['income_type'] == 'сотрудник']
df_P =  df[df['income_type'] == 'пенсионер']
df_K =  df[df['income_type'] == 'компаньон']
df_G =  df[df['income_type'] == 'госслужащий']

Let's build histograms for each type of borrower, as described earlier

For each type we find median And average value.

df_S.agg({'total_income' : ['mean', 'median']})
df_P.agg({'total_income' : ['mean', 'median']})
df_K.agg({'total_income' : ['mean', 'median']})
df_G.agg({'total_income' : ['mean', 'median']})

Replace the blanks with the value median, using the function lambda х and method .fillna().

group = df.groupby('income_type')
df['total_income'] = group.total_income.apply(lambda x: x.fillna(x.median()))

Next, let's move on to analyzing the variable years_employed. Let's determine how many clients started working at the age of 18.

df.loc[df['years_employed']> df['dob_years']–18].shape[0]

Получим 1908

And how many clients started working at the age of 14?

df.loc[df['years_employed']> df['dob_years']–14].shape[0]

Получим 784

Based on the results obtained, we can conclude that the number of clients who started working at the age of 14 is more than 2 times less than those who started working at the age of 18. Let's delete the variable years_employedsince it is uninformative.

We will also remove duplicate columns by creating a new dataframe df_new

df_new = df.drop(['education_id', 'family_status_id', 'days_employed', 'purpose', 'years_employed'], axis = 1)

Let's look at the result of data preprocessing

df_model.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 20316 entries, 0 to 21524
Data columns (total 9 columns):
 #   Column         Non–Null Count  Dtype  
–––  ––––––         ––––––––––––––  –––––  
 0   children       20316 non–null  int64  
 1   dob_years      20316 non–null  int64  
 2   education      20316 non–null  object 
 3   family_status  20316 non–null  object 
 4   gender         20316 non–null  object 
 5   income_type    20316 non–null  object 
 6   debt           20316 non–null  int64  
 7   total_income   20316 non–null  float64
 8   purpose_lem    20316 non–null  object 
dtypes: float64(1), int64(3), object(5)
memory usage: 1.5+ MB

Thus, this article examined the process of data preprocessing using a specific example in the banking industry. This will be followed by another stage of machine learning, for example, clustering.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *