Data Cleaning Methods in Pandas

Pandas. Let's look at the main methods.

Missing values

Missing values ​​can occur for a variety of reasons. Regardless of the reason, they must first be discovered.

isnull() in Pandas returns a DataFrame or Series with boolean values, where True indicates a missing value. Likewise, notnull() returns True for elements where the value is present:

# удаляем строки, где есть хотя бы одно пропущенное значение
cleaned_df = df.dropna()

# заполняем все пропущенные значения нулями
filled_df = df.fillna(0)

print("Очищенный DataFrame:\n", cleaned_df)
prit("DataFrame с заполненными пропусками:\n", filled_df)

The code will display a table where True denotes missing values.

– What to do with them next? One approach is deleting rows or columns with omissions, but this may lead to the loss of some information. Another approach is filling in the blanks.

To fill in the gaps we use the method fillna():

# заполняем пропуски средним значением по столбцам
df.fillna(df.mean(), inplace=True)

print("df с заполненными пропусками средним:\n", df)

For categorical data, it may be reasonable to fill in the gaps with the most frequently occurring value:

# заполнение пропусков в зависимости от условий
df['A'].fillna(value=df['A'].mean(), inplace=True)
df['B'].fillna(value=df['B'].median(), inplace=True)

# кастомная функция для заполнения
def custom_fill(series):
    return series.fillna(series.mean())

df.apply(custom_fill)

Removing with dropna()

Default dropna() removes rows that contain at least one missing value:

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'Name': ['ivan', 'artem', np.nan, 'diana', 'eva'],
    'Age': [25, np.nan, np.nan, 27, 28],
    'City': ['New York', 'Los Angeles', 'Chicago', np.nan, 'Miami']
})

# удаляем строки с пропущенными значениями
df_cleaned = df.dropna()
print(df_cleaned)

dropna() deletes all lines containing at least one NaN.

To remove columns containing missing values, you can use the parameter axis=1:

df_cleaned_columns = df.dropna(axis=1)
print(df_cleaned_columns)

You can remove rows where all values ​​are missing using the parameter how='all':

# добавляем строку, полностью состоящую из NaN
df.loc[5] = [np.nan, np.nan, np.nan]

# удаляем строки, где все значения являются NaN
df_cleaned_all = df.dropna(how='all')
print(df_cleaned_all)

You can specify the columns for which validation should be applied. NaNusing the parameter subset.

# удаляем строки, где пропущены значения в определенных столбцах
df_cleaned_subset = df.dropna(subset=['Name', 'City'])
print(df_cleaned_subset)

In this case, only those rows that have missing column values ​​will be deleted Name or City

thresh allows you to specify the minimum number of non-missing values ​​before a row or column is retained:

# удаляем строки, где менее двух непропущенных значений
df_cleaned_thresh = df.dropna(thresh=2)
print(df_cleaned_thresh)

Custom options

Often used for numeric data to minimize the impact of gaps on the data distribution.

Average value:

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'A': [1, 2, np.nan, 4, 5],
    'B': [np.nan, 2, 3, np.nan, 5],
    'C': [1, 2, 3, 4, np.nan]
})

# заполняем пропущенные значения в столбце A средним значением по этому столбцу
df['A'].fillna(df['A'].mean(), inplace=True)

Median

The median may be preferred over the mean in cases where the data contains outliers that could skew the mean:

# фулим пропущенные значения в столбце B медианой
df['B'].fillna(df['B'].median(), inplace=True)

Fashion

For categorical data or data with discrete values, you can use mode:

# C - категориальный столбец, заполняем пропущенные значения модой
df['C'].fillna(df['C'].mode()[0], inplace=True)

You can use custom functions for more complex fillings. For example, you can fill in the gaps with values ​​based on other columns of data:

# Заполняем пропущенные значения в столбце A, используя кастомную логику
df['A'].fillna(df.apply(lambda row: row['B'] if pd.isnull(row['A']) else row['A'], axis=1), inplace=True)

When filling should take into account grouping according to some characteristic, you can use apply() or transform() together with groupby():

# допустим, есть столбец группы, и мы хотим заполнить пропуски средним значением внутри каждой группы
df['group'] = ['X', 'X', 'Y', 'Y', 'Z']
df['A'] = df.groupby('group')['A'].transform(lambda x: x.fillna(x.mean()))

What about duplicates?

Duplicates can be either completely identical records or partial duplicates, when only some fields match. In any case, duplicates introduce noise into the data, increase its volume and can lead to incorrect analytical conclusions.

drop_duplicates() allows you not only to delete complete duplicate rows, but also provides settings for working with partial duplicates, let's delete complete duplicates:

import pandas as pd

df = pd.DataFrame({
    'A': [1, 2, 2, 3, 3, 3],
    'B': ['a', 'b', 'b', 'c', 'c', 'c'],
    'C': [100, 200, 200, 300, 300, 300]
})

# удаляем дубликаты
df_unique = df.drop_duplicates()

print(df_unique)

The code will remove all complete duplicate rows, leaving unique combinations of values ​​across all columns.

Often you need to remove duplicates based not on all columns, but only on some. To do this in drop_duplicates() there is a parameter subset:

# удаляем дубликаты, основываясь только на столбцах 'A' и 'B'
df_unique_subset = df.drop_duplicates(subset=['A', 'B'])

print(df_unique_subset)

keep allows you to control which duplicates will be deleted: the first, last or all:

# Удаляем все дубликаты, кроме последнего вхождения
df_keep_last = df.drop_duplicates(keep='last')

print(df_keep_last)

You can use a combination of Pandas methods to preprocess the data before removing duplicates. For example, you can convert all strings to lowercase, remove spaces, convert dates to a single format, etc.:

# преобразуем все строки в нижний регистр и удаляем пробелы
df['B'] = df['B'].str.lower().str.replace(' ', '')

# теперь удаляем дубликаты
df_preprocessed = df.drop_duplicates()

print(df_preprocessed)

Categorical data

get_dummies() used to convert categorical variables into dummy/indicator variables, by the way it is used quite often in ml:

import pandas as pd

df = pd.DataFrame({
    'Color': ['Red', 'Green', 'Blue', 'Green']
})

# преобразуем категориальные данные в индикаторные переменные
dummies = pd.get_dummies(df['Color'])
print(dummies)

As a result, each unique value in the column Color turns into a separate column with indicator variables (0 or 1) indicating the presence or absence of a given category in each row.

factorize() used to obtain a numeric representation of categorical data, assigning each unique category an integer identifier:

# получаем числовое представление категориальных данных
codes, uniques = pd.factorize(df['Color'])

print(codes)  # массив кодов
print(uniques)  # массив уникальных значений

str-methods allow you to perform operations such as case conversion, searching and replacing substrings, splitting strings, etc.:

df_text = pd.DataFrame({
    'Text': ['This is a test.', 'Hello, World!', 'Otus!']
})

# преобразуем все тексты в нижний регистр
df_text['Text_lower'] = df_text['Text'].str.lower()

# удаляем знаки пунктуации
df_text['Text_clean'] = df_text['Text'].str.replace('[^\w\s]', '', regex=True)

print(df_text)

Sometimes you can use str.replace(), str.findall().

Converting text data to numeric data is often required for analytical and ML models. For example, after using factorize() for categorical data, you can convert numeric codes back to text categories using an array uniques:

# преобразование кодов обратно в текстовые категории
df['Color_restored'] = codes
df['Color_restored'] = df['Color_restored'].map(lambda x: uniques[x])

print(df)

What cleaning methods do you know?

In anticipation of the start of the course Data Analyst I want to invite you to free webinars about preparing data in Pandasas well as about SQL in the realities of a beginning data analyst.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *