10 things you might not know about scikit-learn

In this translated article, its author, Rebecca Vickery, shares some interesting scikit-learn features. Original posted on the blog towardsdatascience.com.

Photo from the site Unsplash… Author: Sasha • Stories

Scikit-learn is one of the most widely used Python machine learning libraries. Its simple, standard interface allows for data preprocessing, training, optimization, and model evaluation.

This project developed by By David Cournapeau, was born as part of the Google Summer of Code program and was released in 2010. Since its inception, the library has evolved into a rich infrastructure for building machine learning models. New features allow you to solve even more tasks and improve usability. In this article, I will introduce ten of the most interesting features you might not know about.

1. Built-in datasets

The scikit-learn API has built-in datasets containing both generated and actual data… You can use them with just one line of code. This data is extremely useful if you are just learning or just want to quickly test something.

Also using special tool you will be able to generate synthetic data for regression problems yourself make_regression(), clustering make_blobs() and classification make_classification()

Each method produces data already broken down into X (features) and Y (target variable) so that they can be used directly to train the model.

# Toy regression data set loading
from sklearn.datasets import load_boston

X,y = load_boston(return_X_y = True)

# Synthetic regresion data set loading
from sklearn.datasets import make_regression

X,y = make_regression(n_samples=10000, noise=100, random_state=0)

2. Access to third-party public datasets

If you want to access a variety of public datasets directly through scikit-learn, check out the handy feature that lets you import data directly from the site. openml.org… This site contains over 21,000 different datasets that can be used in machine learning projects.

from sklearn.datasets import fetch_openml

X,y = fetch_openml("wine", version=1, as_frame=True, return_X_y=True)

3. Ready classifiers for training baseline models

When creating a machine learning model for a project, it is wise to create a baseline model first. It is a dummy model that always predicts the most common class. This will give you benchmarks for benchmarking your more complex model. In addition, you can be sure of the quality of its work, for example, that it produces more than just a set of randomly selected data.

The scikit-learn library has DummyClassifier() for classification problems and DummyRegressor() to work with regression.

from sklearn.dummy import DummyClassifier

# Fit the model on the wine dataset and return the model score
dummy_clf = DummyClassifier(strategy="most_frequent", random_state=0)

dummy_clf.fit(X, y)

dummy_clf.score(X, y)

4. Own API for visualization

The scikit-learn library has a built in API for rendering, which allows you to visualize the operation of the model without importing any other libraries. It provides the following options: dependency plots, error matrix, ROC curves and Precision-Recall.

import matplotlib.pyplot as plt 
from sklearn import metrics, model_selection
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer

X,y = load_breast_cancer(return_X_y = True)

X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, random_state=0)
clf = RandomForestClassifier(random_state=0)
clf.fit(X_train, y_train)

metrics.plot_roc_curve(clf, X_test, y_test)
plt.show()


Author’s illustration

5. Built-in methods of feature selection

One of the ways to improve the quality of the model is to use only the most useful features in training or to remove the least informative ones. This process is called feature selection.

Scikit-learn has a number of methods for doing feature selection, one of them – SelectPercentile()… This method selects the X-percentile of the most informative features based on the specified statistical method of estimation.

from sklearn import model_selection
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_wine
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectPercentile, chi2

X,y = load_wine(return_X_y = True)

X_trasformed = SelectPercentile(chi2, percentile=60).fit_transform(X, y)

6. Pipelines for connecting stages in the process of machine learning

Besides being able to use a huge list of machine learning algorithms, scikit-learn also provides a number of functions for preprocessing and transforming data. To ensure reproducibility and accessibility in the machine learning process, scikit-learn has created pipelinesthat allow you to combine the various preprocessing steps and model training steps.

The pipeline stores all stages of the workflow as a single object that can be called by the fit and predict methods. When you run fit on a pipeline object, preprocessing and model training steps are automatically performed.

from sklearn import model_selection
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

X,y = load_breast_cancer(return_X_y = True)

X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, random_state=0)

# Chain together scaling the variables with the model
pipe = Pipeline([('scaler', StandardScaler()), ('rf', RandomForestClassifier())])
pipe.fit(X_train, y_train)

pipe.score(X_test, y_test)

7. ColumnTransformer to vary preprocessing methods for different features

Many datasets contain different types of features, which require several different stages to preprocess. For example, you may be faced with a mix of categorical and numeric data, and you might want to scale numeric columns and convert categorical features to numeric using one-hot encoding.

The scikit-learn pipeline has a function ColumnTransformer, which allows you to easily designate the most appropriate preprocessing method for specific columns through indexing or by specifying the column names.

from sklearn import model_selection
from sklearn.linear_model import LinearRegression
from sklearn.datasets import fetch_openml
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder

# Load auto93 data set which contains both categorical and numeric features
X,y = fetch_openml("auto93", version=1, as_frame=True, return_X_y=True)

# Create lists of numeric and categorical features  
numeric_features = X.select_dtypes(include=['int64', 'float64']).columns
categorical_features = X.select_dtypes(include=['object']).columns

X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, random_state=0)

# Create a numeric and categorical transformer to perform preprocessing steps
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value="missing")),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

# Use the ColumnTransformer to apply to the correct features
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)])

# Append regressor to the preprocessor
lr = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', LinearRegression())])

# Fit the complete pipeline
lr.fit(X_train, y_train)
print("model score: %.3f" % lr.score(X_test, y_test))

8. Easily get an HTML image of your pipeline

Pipelines often get quite complex, especially when working with real data. Therefore, it is very convenient that with scikit-learn you can deduce HTML schema steps of your pipeline.

from sklearn import set_config

set_config(display='diagram')   

lr


Author’s illustration

9. Plotting function for visualizing decision trees

Function plot_tree() allows you to create an outline of the steps present in the decision tree model.

import matplotlib.pyplot as plt 
from sklearn import metrics, model_selection
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.datasets import load_breast_cancer

X,y = load_breast_cancer(return_X_y = True)

X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, random_state=0)
clf = DecisionTreeClassifier()
clf.fit(X_train, y_train)

plot_tree(clf, filled=True)
plt.show()

10. Many third-party libraries that extend scikit-learn functions

There are many third party libraries that are compatible with and extend scikit-learn.

For example, the library Category Encoders, which provides a wider choice of preprocessing methods for categorical features, or the library ELI5 – for a more detailed interpretation of the model.

Both resources can also be accessed directly through the scikit-learn pipeline.

# Pipeline using Weight of Evidence transformer from category encoders

from sklearn import model_selection
from sklearn.linear_model import LinearRegression
from sklearn.datasets import fetch_openml
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
import category_encoders as ce

# Load auto93 data set which contains both categorical and numeric features
X,y = fetch_openml("auto93", version=1, as_frame=True, return_X_y=True)

# Create lists of numeric and categorical features  
numeric_features = X.select_dtypes(include=['int64', 'float64']).columns
categorical_features = X.select_dtypes(include=['object']).columns

X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, random_state=0)

# Create a numeric and categorical transformer to perform preprocessing steps
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())])

categorical_transformer = Pipeline(steps=[
        ('imputer', SimpleImputer(strategy='constant', fill_value="missing")),
        ('woe', ce.woe.WOEEncoder())])

# Use the ColumnTransformer to apply to the correct features
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)])

# Append regressor to the preprocessor
lr = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', LinearRegression())])

# Fit the complete pipeline
lr.fit(X_train, y_train)
print("model score: %.3f" % lr.score(X_test, y_test))

Thanks for your attention!

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *