Useful tricks and best practices from Kaggle

On the eve of the start of the course Machine Learning. Professional we share the traditional translation of useful material.


In this article, you will learn what can only be learned by spending countless hours of study and practice.

About this project

Kaggle is a great place. It’s a goldmine for data scientists and machine learning engineers. There are not many platforms where you can find high quality, efficient, reproducible, expert handpicked, awesome code examples all in one place.

Since its launch, it has hosted over 164 competitions. These competitions attract experts and professionals from all over the world to the platform. As a result, every competition has a lot of high quality notebooks and scripts, as well as a huge amount of open source datasets that Kaggle provides.

At the beginning of my data science journey, I came to Kaggle to find datasets and hone my skills. Whenever I tried to figure out other examples and code snippets, I was amazed at the complexity and immediately lost motivation.

But now I find myself spending a lot of time reading other people’s notebooks and submitting competition entries. Sometimes there are things to spend the whole weekend on. And sometimes I find simple yet incredibly effective techniques and best practices that can only be learned by observing other professionals.

Otherwise, it’s small, my OCD practically forces me to spread all the knowledge I have in the field of data science. So I present to you the first issue of my weekly Kaggle Tricks and Best Practices. Throughout the series, I’ll write about anything that can be useful during a typical data science workflow, including code snippets of common libraries, best practices followed by leading experts in the field with Kaggle, etc. – everything. what I’ve learned over the past week. Enjoy!

1. Displaying only the bottom of the correlation matrix

A good correlation matrix can say a lot about your dataset. It is usually built to see the pairwise correlation between your features and the target variable. According to your needs, you can decide which features to keep and include in your machine learning algorithm.

But today datasets contain so many features that dealing with a correlation matrix like this can be overwhelming:

As good as it is, this information is too much for perception. Correlation matrices are mostly symmetric along the main diagonal, so they contain duplicate data. The diagonal itself is also useless. Let’s see how you can build only the useful half:

houses = pd.read_csv('data/melb_data.csv')

# Calculate pairwise-correlation
matrix = houses.corr()

# Create a mask
mask = np.triu(np.ones_like(matrix, dtype=bool))

# Create a custom diverging palette
cmap = sns.diverging_palette(250, 15, s=75, l=40,
                             n=9, center="light", as_cmap=True)

plt.figure(figsize=(16, 12))

sns.heatmap(matrix, mask=mask, center=0, annot=True,
             fmt=".2f", square=True, cmap=cmap)

plt.show();

The resulting graph is much easier to interpret and less distracting with redundant data. First, we build a correlation matrix using the method DataFrame .corr… Then we use the function np.ones_like с dtypeset to bool to create a matrix of True values ​​with the same shape as ours DataFrame:

>>> np.ones_like(matrix, dtype=bool)[:5]

array([[ True, True, True, True, True, True, True, True, True,
 True, True, True, True],
 [ True, True, True, True, True, True, True, True, True,
 True, True, True, True],
 [ True, True, True, True, True, True, True, True, True,
 True, True, True, True],
 [ True, True, True, True, True, True, True, True, True,
 True, True, True, True],
 [ True, True, True, True, True, True, True, True, True,
 True, True, True, True]])

Then we pass it to the function Numpy .triu, which returns a two-dimensional boolean mask that contains False values ​​for the lower triangle of the matrix. We can then pass it to the Seaborn heatmap function to plot a subset of the matrix according to this mask:

sns.heatmap(matrix, mask=mask, center=0, annot=True,
               fmt=".2f", square=True, cmap=cmap)

I also made some additions to make the graph a little better, like adding my own color palette.

2. Adding missing values ​​to value_counts

A little handy trick with value_counts is that you can see the percentage of missing values ​​in any column by setting dropna to False:

>>> houses.CouncilArea.value_counts(dropna=False, normalize=True).head()

NaN           0.100810
Moreland      0.085641
Boroondara    0.085420
Moonee Valley 0.073417
Darebin       0.068778
Name: CouncilArea, dtype: float64

By determining the proportion of values ​​that are missing, you can decide whether to discard or overwrite them. However, if you want to see the percentage of missing values ​​in all columns, value_counts – not the best option. Instead, you can do:

>>> missing_props = houses.isna().sum() / len(houses)
>>> missing_props[missing_props > 0].sort_values(ascending=False
                                                 
BuildingArea 0.474963
YearBuilt    0.395803
CouncilArea  0.100810
Car          0.004566
dtype: float64

First find the proportions by dividing the number of missing values ​​by the length DataFrame… Then you can filter columns with 0% i.e. select only columns with missing values.

3. Using Pandas DataFrame Styler

Many of us never realize the enormous untapped potential of pandas. An overlooked and often overlooked feature of pandas is its ability to style its DataFrames. Using the attribute .style for pandas DataFrames, you can conditionally and style them. As a first example, let’s see how you can change the background color based on the value of each cell:

>>> diamonds = sns.load_dataset('diamonds')

>>> pd.crosstab(diamonds.cut, diamonds.clarity).
                style.background_gradient(cmap='rocket_r')

It is practically a heatmap without using the Seaborn heatmap feature. Here we count every combination of cut and clarity of a diamond using pd.crosstab… Using .style.background_gradient with a color palette, you can easily identify which combinations are most common. Only from the above DataFrame we can see that most diamonds are perfectly cut and the most common combination is with the VS2 clarity type.

We can even go further by calculating the average price of each cut and clarity combination in a crosstab:

>>> pd.crosstab(diamonds.cut, diamonds.clarity,
          aggfunc=np.mean, values=diamonds.price).
          style.background_gradient(cmap='flare')

This time we are aggregating diamond prices for each cut and clarity combination. From stylized DataFrame we see that the most expensive diamonds are VS2 clarity or premium cut. But it would be better if we could display the aggregated prices by rounding them up. We can also change this with .style:

>>> agg_prices = pd.crosstab(diamonds.cut, diamonds.clarity,
                         aggfunc=np.mean, values=diamonds.price).
                         style.background_gradient(cmap='flare')

>>> agg_prices.format('{:.2f}')

By changing the format string {: .2f} in the .format method, we specify the precision in 2 numbers after the decimal point.

FROM .style the limit is your imagination. With a basic knowledge of CSS, you can create your own styling functions to suit your needs. Check out the official leadership pandas for more information.

4. Setting up global graph configurations using Matplotlib

When doing EDA (Exploratory Data Analysis), you will find yourself keeping some Matplotlib settings the same for all of your plots. For example, you might want to apply a custom palette to all graphs, use larger fonts for labels, change the position of the legend, use fixed sizes for shapes, and more.

Specifying each custom graph change can be a rather boring, repetitive, and time-consuming task. Fortunately, you can use rcParams from Matplotlib to set global configurations for your plots:

from matplotlib import rcParams

rcParams is just an old Python dictionary containing the defaults for Matplotlib:

You can customize almost every possible aspect of each individual chart. What I usually do and have seen others do is set the shapes to a fixed size, the font size of the labels, and some other changes:

# Remove top and right spines
rcParams['axes.spines.top'] = False
rcParams['axes.spines.right'] = False

# Set fixed figure size
rcParams['figure.figsize'] = [12, 9]

# Set dots per inch to 300, very high quality images
rcParams['figure.dpi'] = 300

# Enable autolayout
rcParams['figure.autolayout'] = True

# Set global fontsize
rcParams['font.style'] = 16

# Fontsize of ticklabels
rcParams['xtick.labelsize'] = 10
rcParams['ytick.labelsize'] = 10

You can avoid a lot of repetitive work by installing everything right after importing Matplotlib. All other available settings can be viewed by calling rcParams.keys()

5. Setting up global Pandas configurations.

Just like Matplotlib, pandas has global configurations that you can play with. Of course, most of them are related to display options. The official user guide says that the entire pandas option system can be controlled with 5 functions available directly from the pandas namespace:

  • get_option() / set_option() – get / set the value of one parameter.

  • reset_option() – reset one or several parameters to their default values.

  • description_option() – display a description of one or more parameters.

  • option_context() – execute a block of code with a set of parameters, which, after execution, return to the previous settings.

All parameters have case-insensitive names and are found using a regular expression from under the hood. you can use pd.get_optionto find out what the default behavior is and change it to your liking with set_option:

>>> pd.get_option(‘display.max_columns’)
20

For example, the above parameter controls the number of columns that should be displayed when in DataFrame a lot of columns. Most datasets today contain more than 20 variables, and whenever you call .head or other display functions, pandas puts the annoying ellipsis to truncate the result:

>>> houses.head()

I would rather see all columns by scrolling through them. Let’s change this behavior:

>>> pd.set_option(‘display.max_columns’, None)

Above, I remove the constraint entirely:

>>> houses.head()

You can revert to the default setting with:

pd.reset_option(‘display.max_columns’)

As with columns, you can adjust the default number of rows displayed. If you set for display.max_rows value 5, you don’t have to call .head():

>>> pd.set_option(‘display.max_rows’, 5)>>> houses

Currently plotly is becoming very popular, so it would be nice to set it as the default graph for pandas. By doing so, you will get interactive graphical charts whenever you call .plot for pandas DataFrames:

pd.set_option(‘plotting.backend’, ‘plotly’)

Note that you need to have plotly installed for this.

If you don’t want to mess up the default behavior, or just want to temporarily change certain settings, you can use pd.option_context as a context manager. The temporary behavior change will only be applied to the block of code that follows the statement. For example, if there are large numbers, pandas has the annoying habit of converting them to standard notation. You can temporarily avoid this by using:

>>> df = pd.DataFrame(np.random.randn(5, 5))
>>> pd.reset_option('display.max_rows')
>>> with pd.option_context('float_format', '{:f}'.format):
        df.describe()

You can see the list of available options in the official leadership by pandas.


Learn more about the course Machine Learning. Professional

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *