Data Cleaning Methods in Pandas
Missing values
Missing values can occur for a variety of reasons. Regardless of the reason, they must first be discovered.
isnull()
in Pandas returns a DataFrame or Series with boolean values, where True
indicates a missing value. Likewise, notnull()
returns True
for elements where the value is present:
# удаляем строки, где есть хотя бы одно пропущенное значение
cleaned_df = df.dropna()
# заполняем все пропущенные значения нулями
filled_df = df.fillna(0)
print("Очищенный DataFrame:\n", cleaned_df)
prit("DataFrame с заполненными пропусками:\n", filled_df)
The code will display a table where True
denotes missing values.
– What to do with them next? One approach is deleting rows or columns with omissions, but this may lead to the loss of some information. Another approach is filling in the blanks.
To fill in the gaps we use the method fillna()
:
# заполняем пропуски средним значением по столбцам
df.fillna(df.mean(), inplace=True)
print("df с заполненными пропусками средним:\n", df)
For categorical data, it may be reasonable to fill in the gaps with the most frequently occurring value:
# заполнение пропусков в зависимости от условий
df['A'].fillna(value=df['A'].mean(), inplace=True)
df['B'].fillna(value=df['B'].median(), inplace=True)
# кастомная функция для заполнения
def custom_fill(series):
return series.fillna(series.mean())
df.apply(custom_fill)
Removing with dropna()
Default dropna()
removes rows that contain at least one missing value:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'Name': ['ivan', 'artem', np.nan, 'diana', 'eva'],
'Age': [25, np.nan, np.nan, 27, 28],
'City': ['New York', 'Los Angeles', 'Chicago', np.nan, 'Miami']
})
# удаляем строки с пропущенными значениями
df_cleaned = df.dropna()
print(df_cleaned)
dropna()
deletes all lines containing at least one NaN
.
To remove columns containing missing values, you can use the parameter axis=1
:
df_cleaned_columns = df.dropna(axis=1)
print(df_cleaned_columns)
You can remove rows where all values are missing using the parameter how='all'
:
# добавляем строку, полностью состоящую из NaN
df.loc[5] = [np.nan, np.nan, np.nan]
# удаляем строки, где все значения являются NaN
df_cleaned_all = df.dropna(how='all')
print(df_cleaned_all)
You can specify the columns for which validation should be applied. NaN
using the parameter subset
.
# удаляем строки, где пропущены значения в определенных столбцах
df_cleaned_subset = df.dropna(subset=['Name', 'City'])
print(df_cleaned_subset)
In this case, only those rows that have missing column values will be deleted Name or City
thresh
allows you to specify the minimum number of non-missing values before a row or column is retained:
# удаляем строки, где менее двух непропущенных значений
df_cleaned_thresh = df.dropna(thresh=2)
print(df_cleaned_thresh)
Custom options
Often used for numeric data to minimize the impact of gaps on the data distribution.
Average value:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'A': [1, 2, np.nan, 4, 5],
'B': [np.nan, 2, 3, np.nan, 5],
'C': [1, 2, 3, 4, np.nan]
})
# заполняем пропущенные значения в столбце A средним значением по этому столбцу
df['A'].fillna(df['A'].mean(), inplace=True)
Median
The median may be preferred over the mean in cases where the data contains outliers that could skew the mean:
# фулим пропущенные значения в столбце B медианой
df['B'].fillna(df['B'].median(), inplace=True)
Fashion
For categorical data or data with discrete values, you can use mode:
# C - категориальный столбец, заполняем пропущенные значения модой
df['C'].fillna(df['C'].mode()[0], inplace=True)
You can use custom functions for more complex fillings. For example, you can fill in the gaps with values based on other columns of data:
# Заполняем пропущенные значения в столбце A, используя кастомную логику
df['A'].fillna(df.apply(lambda row: row['B'] if pd.isnull(row['A']) else row['A'], axis=1), inplace=True)
When filling should take into account grouping according to some characteristic, you can use apply()
or transform()
together with groupby()
:
# допустим, есть столбец группы, и мы хотим заполнить пропуски средним значением внутри каждой группы
df['group'] = ['X', 'X', 'Y', 'Y', 'Z']
df['A'] = df.groupby('group')['A'].transform(lambda x: x.fillna(x.mean()))
What about duplicates?
Duplicates can be either completely identical records or partial duplicates, when only some fields match. In any case, duplicates introduce noise into the data, increase its volume and can lead to incorrect analytical conclusions.
drop_duplicates()
allows you not only to delete complete duplicate rows, but also provides settings for working with partial duplicates, let's delete complete duplicates:
import pandas as pd
df = pd.DataFrame({
'A': [1, 2, 2, 3, 3, 3],
'B': ['a', 'b', 'b', 'c', 'c', 'c'],
'C': [100, 200, 200, 300, 300, 300]
})
# удаляем дубликаты
df_unique = df.drop_duplicates()
print(df_unique)
The code will remove all complete duplicate rows, leaving unique combinations of values across all columns.
Often you need to remove duplicates based not on all columns, but only on some. To do this in drop_duplicates()
there is a parameter subset
:
# удаляем дубликаты, основываясь только на столбцах 'A' и 'B'
df_unique_subset = df.drop_duplicates(subset=['A', 'B'])
print(df_unique_subset)
keep
allows you to control which duplicates will be deleted: the first, last or all:
# Удаляем все дубликаты, кроме последнего вхождения
df_keep_last = df.drop_duplicates(keep='last')
print(df_keep_last)
You can use a combination of Pandas methods to preprocess the data before removing duplicates. For example, you can convert all strings to lowercase, remove spaces, convert dates to a single format, etc.:
# преобразуем все строки в нижний регистр и удаляем пробелы
df['B'] = df['B'].str.lower().str.replace(' ', '')
# теперь удаляем дубликаты
df_preprocessed = df.drop_duplicates()
print(df_preprocessed)
Categorical data
get_dummies()
used to convert categorical variables into dummy/indicator variables, by the way it is used quite often in ml:
import pandas as pd
df = pd.DataFrame({
'Color': ['Red', 'Green', 'Blue', 'Green']
})
# преобразуем категориальные данные в индикаторные переменные
dummies = pd.get_dummies(df['Color'])
print(dummies)
As a result, each unique value in the column Color
turns into a separate column with indicator variables (0 or 1) indicating the presence or absence of a given category in each row.
factorize()
used to obtain a numeric representation of categorical data, assigning each unique category an integer identifier:
# получаем числовое представление категориальных данных
codes, uniques = pd.factorize(df['Color'])
print(codes) # массив кодов
print(uniques) # массив уникальных значений
str
-methods allow you to perform operations such as case conversion, searching and replacing substrings, splitting strings, etc.:
df_text = pd.DataFrame({
'Text': ['This is a test.', 'Hello, World!', 'Otus!']
})
# преобразуем все тексты в нижний регистр
df_text['Text_lower'] = df_text['Text'].str.lower()
# удаляем знаки пунктуации
df_text['Text_clean'] = df_text['Text'].str.replace('[^\w\s]', '', regex=True)
print(df_text)
Sometimes you can use str.replace()
, str.findall().
Converting text data to numeric data is often required for analytical and ML models. For example, after using factorize()
for categorical data, you can convert numeric codes back to text categories using an array uniques
:
# преобразование кодов обратно в текстовые категории
df['Color_restored'] = codes
df['Color_restored'] = df['Color_restored'].map(lambda x: uniques[x])
print(df)
What cleaning methods do you know?
In anticipation of the start of the course Data Analyst I want to invite you to free webinars about preparing data in Pandasas well as about SQL in the realities of a beginning data analyst.