Acceleration of exploratory data analysis using the library pandas-profiling

First of all, starting to work with a new data set, you need to understand it. In order to do this, you need, for example, to find out the ranges of values ​​accepted by variables, their types, and also to learn about the number of missing values.

The pandas library provides us with many useful tools for performing exploratory data analysis (EDA). But before using them, you usually need to start with more general functions, such as df.describe (). However, it should be noted that the opportunities provided by such functions are limited, and the initial stages of working with any data sets when performing EDA are very often very similar to each other.

The author of the material we are publishing today says that he is not a fan of performing repetitive actions. As a result, he, in search of tools to quickly and efficiently perform exploratory data analysis, found the pandas-profiling library. The results of her work are not expressed in the form of certain individual indicators, but in the form of a fairly detailed HTML report containing most of the information about the analyzed data that you may need to know before you start working more closely with them.

Here we will discuss the features of using the pandas-profiling library using the example of the Titanic dataset.

Exploration data analysis with pandas

I decided to experiment with pandas-profiling on the Titanic dataset due to the fact that it has data of different types and due to the presence of missing values ​​in it. I believe that the pandas-profiling library is particularly interesting in cases where the data has not yet been cleared and require further processing, depending on their features. In order to successfully perform such processing, you need to know where to start and what to look for. This is where pandas-profiling features come in handy.

First, we import the data and use pandas to get descriptive statistics:

# import required packages
import pandas as pd
import pandas_profiling
import numpy as np

# data import
df = pd.read_csv ('/ Users / lukas / Downloads / titanic / train.csv')

# calculating descriptive statistics
df.describe ()

After executing this code snippet, you get what is shown in the following figure.



Indicators of descriptive statistics obtained using standard pandas tools

Although it contains a lot of useful information, there is nothing here that would be interesting to know about the data under study. For example, it can be assumed that in the data frame, in the structure Datarame, there are 891 lines. If this needs to be checked, then another line of code will be required, defining the frame size. Although these calculations are not particularly resource-intensive, their constant repetition will necessarily lead to a waste of time, which it is probably better to spend on clearing data.

Exploratory data analysis with pandas-profiling

Now do the same with pandas-profiling:

pandas_profiling.ProfileReport (df)

Performing the above line of code will allow you to generate a report with indicators of exploratory data analysis. The code shown above will output the information found about the data, but you can make it so that the result is an HTML file that, for example, can be shown to someone.

The first part of the report will contain an Overview section, giving basic information about the data (number of observations, number of variables, and so on). In addition, it will contain a list of alerts, notifying the analyst that you should pay special attention to. These warnings can be a hint about where you can focus your efforts on data cleaning.



Overview Report Section

Exploratory variable analysis

Following the Overview section in the report, you can find useful information about each variable. They include, among other things, small diagrams describing the distribution of each variable.



Information about the numeric variable Age

As can be seen from the previous example, pandas-profiling gives us some useful indicators, such as the percentage and number of missing values, as well as the indicators of descriptive statistics that we have already seen. Because Age – this is a numerical variable, visualization of its distribution in the form of a histogram allows us to conclude that we have a distribution that is skewed to the right.

When considering a categorical variable, the output is slightly different from what was found for a numeric variable.



Information about the categorical variable Sex

Namely, instead of finding the average, minimum and maximum, the library pandas-profiling found the number of classes. Because Sex – binary variable, its values ​​are represented by two classes.

If you, like me, like to explore the code, then you may be interested in how exactly the pandas-profiling library calculates these indicators. Learning about this, given that the library code is open and available on GitHub, is not so difficult. Since I am not a big fan of using “black boxes” in my projects, I looked at the source code of the library. For example, here’s what the mechanism for processing numeric variables represented by the describe_numeric_1d function looks like:

def describe_numeric_1d (series, ** kwargs):
 "" "Numeric (` TYPE_NUM`) variable (a Series).
 Also create histograms (mini an full) of its distribution.
 Parameters
 ----------
 series: Series
 The variable to describe.
 Returns
 -------
 Series
 The stats keys.
 "" "
 # Format a number as a percentage. For example 0.25 will be turned to 25%.
 _percentile_format = "{: .0%}"
 stats = dict ()
 stats['type'] = base.TYPE_NUM
 stats['mean'] = series.mean ()
 stats['std'] = series.std ()
 stats['variance'] = series.var ()
 stats['min'] = series.min ()
 stats['max'] = series.max ()
 stats['range'] = stats['max'] - stats['min']
    # To avoid to compute it
 _series_no_na = series.dropna ()
 for percentile in np.array ([0.05, 0.25, 0.5, 0.75, 0.95]):
 # The dropna () is a workaround for https://github.com/pydata/pandas/issues/13098
 stats[_percentile_format.format(percentile)] = _series_no_na.quantile (percentile)
 stats['iqr'] = stats['75%'] - stats['25%']
    stats['kurtosis'] = series.kurt ()
 stats['skewness'] = series.skew ()
 stats['sum'] = series.sum ()
 stats['mad'] = series.mad ()
 stats['cv'] = stats['std'] / stats['mean'] if stats['mean'] else np.NaN
 stats['n_zeros'] = (len (series) - np.count_nonzero (series))
 stats['p_zeros'] = stats['n_zeros'] * 1.0 / len (series)
 # Histograms
 stats['histogram'] = histogram (series, ** kwargs)
 stats['mini_histogram'] = mini_histogram (series, ** kwargs)
 return pd.Series (stats, name = series.name)

Although this code snippet may seem rather large and complex, in fact, it is very easy to understand. The point is that in the source code of the library there is a function that determines the types of variables. If it turns out that the library has encountered a numeric variable, the above function will find the indicators that we considered. This function uses standard pandas operations for working with objects of the type Serieslike series.mean (). The results of the calculations are stored in the dictionary. stats. Histograms are generated using an adapted version of the function. matplotlib.pyplot.hist. Adaptation aims to enable the function to work with different types of data sets.

Correlation rates and sample survey data

After analyzing the variables of pandas-profiling, in the Correlations section, it will show the Pearson and Spearman correlation matrices.



Pearson Correlation Matrix

If necessary, then it is possible, in the line of code that starts the formation of the report, to set the indicators of the threshold values ​​used in the calculation of the correlation. By doing this, you can indicate what kind of correlation power is considered important for your analysis.

Finally, the pandas-profiling report, in the Sample section, displays, as an example, a piece of data taken from the beginning of a data set. This approach can lead to unpleasant surprises, since the first few observations can be a sample that does not reflect the features of the entire data set.



Section containing sample data under investigation

As a result, I do not recommend paying attention to this last section. It’s better to use the command instead. df.sample (5)which randomly selects 5 cases from the data set.

Results

Summarizing the above, it can be noted that the pandas-profiling library provides the analyst with some useful features that will come in handy when you need to quickly get a general approximate idea of ​​the data or send someone a report on intelligence data analysis. At the same time, real work with the data, taking into account their features, is performed, as well as without using pandas-profiling, manually.

If you want to take a look at how the entire intelligence analysis of data in one Jupyter notebook looks like – take a look at this project of mine created with the help of nbviewer. And in this GitHub repository you can find the corresponding code.

Dear readers! How do you start analyzing new datasets?

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *