Building a pipeline in scikit-learn – step by step guide

“Data Scientist” in Yandex Practicum. In this article, I will show you how to build a pipeline in the scikit-learn library based on built-in tools and reduce the amount of code when transforming data. This article is intended for beginners who are just starting to learn Data Science, but already know the basic concepts.

The text mentions scikit-learn, one of the most popular Python libraries for classical machine learning. In addition to a large number of machine learning algorithms, pipelines can be built using scikit-learn. These are chains of functions through which data can be passed. Pipelines are similar to pipelines, where each link acts as a converter (transformer) or predictor model (usually the last link).

An important stage of data processing is their transformation. Without a pipeline, data must go through separate converters: encoders (convert categorical features to numeric vectors), imputers (fill gaps in the data), and scalers (bring features to the same scale). Each tool separately needs to be trained on the training set, transform it, and separately perform the transformation of the test set. The result is a lot of repetitive code.

The pipeline collects all the tools in one pipeline without repetitive code. It is enough to train this pipeline on the training sample and use it for all the necessary transformations with one command. It takes features as input, transforms them and produces the result.

Below, we will assemble a pipeline from individual instruments, train it on a training data set, and make a prediction.

Preparation

  1. The first step is to install the minimum required version of the library – 1.1. If you execute the code in Jupiter Hub Yandex Practice, then the basic version of sklearn is 0.24. You need to install a later one to avoid compatibility issues.

!pip install scikit-learn==1.1
  1. We then import pandas to load the data, as well as all the sklearn tools we’ll need in this example: scalers, encoders, and imputers. The list contains auxiliary tools for splitting samples, calculating metrics, and other operations.

import pandas as pdfrom sklearn.compose import ColumnTransformer

from sklearn.impute import KNNImputer, SimpleImputer
from sklearn.linear_model import Ridge
from sklearn.model_selection import train_test_split 
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error
  1. Make sure that the version of the library is not lower than the required one. I show for an example – in real development this step is not needed.

import sklearn
sklearn.version
  1. Now we turn off the warning in pandas about the future suspension of one of the methods so that it does not distract from development, and limit the data output to eight columns. In this form, they will fit on one screen, and it will be more convenient to work with them.

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
pd.set_option('display.max_columns',  8)
  1. For this example, let’s take a popular free dataset with information about the cost of real estate in California, which is often mentioned in training articles and books on Data Science. Let’s download it from the website of Hugging Face, the developer of the Transformers library (which contains neural networks and other tools for working with text data).

data = 
pd.read_csv('https://huggingface.co/datasets/leostelon/california-housing/raw/main/housing.csv')
data.info()
display(data.describe())
  1. The next step is to choose the target feature medium_house_value is the median house price for the county. Numerical features will be the median value of the age of the houses, the total number of rooms and bedrooms in them, the population, the number of houses, the median income per family. A categorical feature is the position relative to the ocean.

And then we divide the dataset into working and test samples:

target = data['median_house_value']
features = data.drop(['median_house_value'], axis=1)
features_train, features_test, target_train, target_test = train_test_split(features, target, test_size=0.2, random_state=44)

This completes the preparatory part – let’s move on to creating a pipeline.

Pipeline example

For clarity, we will create two separate pipelines – for numerical and categorical features, and then collect them into one.

  1. Let’s start with numerical. Pipeline task pipe_num at this stage, fill in the gaps and, if any, perform feature scaling.

For this we use the tool SimpleImputerfill in the gaps with the median value of the feature.

StandardScaler standardizes the data – subtracts the mean and divides by the mean square of the training sample.

In the last line, we set an object of the Pipeline class. In this case, we pass to the input a list, each element in which is a tuple of two values. This is the transformer and its name, which we set ourselves in order to refer to it if necessary.

simple_imputer = SimpleImputer(strategy='median')
std_scaler = StandardScaler()
pipe_num = Pipeline([('imputer', simple_imputer), ('scaler', std_scaler)])
  1. The next step through pipe_num call fit_transform – the same thing happens as if we separately applied SimpleImputer And StandardScaler. Here all signs are used, except for the categorical one.

res_num = pipe_num.fit_transform(features_train.drop(['ocean_proximity'], axis=1))

The transformer above returns the data as a NumPy array with no column names. To return the column names for the convenience of working with data, you can use the method get_feature_names_out pipeline object.

In this case, the names can also be taken directly from the dataframe, since this pipeline retains the original set of features. But if the set of columns changes after the conversion, for example, using an encoder, this method will not work. Usage get_feature_names_out – a more universal approach, so we will use it for all pipelines.

  1. At the end, for verification, we display the result of the transformation of numerical features:

res_df_num = pd.DataFrame(res_num, 
columns=pipe_num['scaler'].get_feature_names_out(features.drop(['ocean_proximity'], axis=1).columns))
res_df_num
  1. This is a service string in which we check that there are no gaps in the data. It will not get into the final pipeline and is needed only for demonstration.

res_df_num.info()
  1. Now let’s create a pipeline for categorical features. By using SimpleImputer replace gaps with value ‘unknown’A OneHotEncoder are used to encode categorical features into numerical values ​​understandable by the models. Argument handle_unknown=’ignore’ is needed to eliminate errors in cases where during testing there are categories that were not in the training sample. At the end, we create a pipeline from the imputer and encoder and check that it worked.

Although there are no gaps in the categorical feature in this dataset, they can occur if new data appears when using the model in practice. In this case, you need a solution that will allow the code not to fall – this is what the imputer is for.

s_imputer = SimpleImputer(strategy='constant', fill_value="unknown")
ohe_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
pipe_cat = Pipeline([('imputer', s_imputer), ('encoder', ohe_encoder)])

res_cat = pipe_cat.fit_transform(features_train[['ocean_proximity']])

res_df_cat = pd.DataFrame(res_cat, columns=pipe_cat.get_feature_names_out())
res_df_cat
  1. Now we combine the signs into one pipeline.

There is a nuance here: it will not be possible to simply put categorical and numerical signs one after the other, they must be processed differently. scikit-learn has a tool ColumnTransformer, which works as a pipeline builder. With it, you can parallelize the pipeline: one pipeline can be used for one part of the columns, and another for the second.

When specifying a Column Transformer object, a list of tuples is passed. In each tuple, in addition to the name and transformer, as when creating the Pipeline object, it is additionally indicated for which columns it is applied.

It should be noted that in addition to elementary transformers (SimpleImputer, StandardScaler, etc.), Column Transformer can also accept Pipeline objects as a transformer.

With a tool list comprehension create lists of numerical and categorical columns. In our example, the dataset is small, and it would be possible to set the lists manually. But in practice, there can be many more columns, so it’s better to automate the process. The code below checks the column type: if it is objectthen it is considered as numerical, if not, as categorical.

col_transformer = ColumnTransformer([('num_preproc', pipe_num, [x for x in features.columns if features[x].dtype!='object']),
                                     ('cat_preproc', pipe_cat, [x for x in features.columns if features[x].dtype=='object'])])
  1. In this line, we return the result of the pipeline execution:

res = col_transformer.fit_transform(features_train)
  1. Let’s convert the result to a dataframe. At the same time, we will remove additional information in the column names – automatically the method get_feature_names_out object ColumnTransformer adds the name of the transformer with which the columns were obtained. Because of this, fewer columns fit on the page, and it becomes inconvenient to use the output.

res_df = pd.DataFrame(res, columns = [x.split('__')[-1] for x in col_transformer.get_feature_names_out()])
res_df
  1. We assemble the finished pipeline by composing its parts using the Column Transformer. It consists of preprocessing and model Ridge from the scikit-learn library is linear regression with regularization. This is one of the simple models that predicts a target feature based on the sum of other features multiplied by coefficients.

model = Ridge()

final_pipe = Pipeline([('preproc', col_transformer),
                       ('model', model)])
  1. We train the final pipeline on the training set and make a prediction.

final_pipe.fit(features_train, target_train)
preds = final_pipe.predict(features_test)
mean_squared_error(target_test, preds, squared=False)

If you combine the entire pipeline and exclude demo code fragments from it, then it will look like this:

simple_imputer = SimpleImputer(strategy='median')
std_scaler = StandardScaler()

pipe_num = Pipeline([('imputer', simple_imputer), ('scaler', std_scaler)])

s_imputer = SimpleImputer(strategy='constant', fill_value="unknown")
ohe_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
pipe_cat = Pipeline([('imputer', s_imputer), ('encoder', ohe_encoder)])

col_transformer = ColumnTransformer([('num_preproc', pipe_num, [x for x in features.columns if features[x].dtype!='object']),
                                     ('cat_preproc', pipe_cat, [x for x in features.columns if features[x].dtype=='object'])])

final_pipe = Pipeline([('preproc', col_transformer),
                       ('model', model)])

final_pipe.fit(features_train, target_train)
preds = final_pipe.predict(features_test)

As a result, all the code for preprocessing, training and prediction takes literally half a page.

In the following articles, we will explore the intricacies of using pipelines with scikit-learn cross-validation tools and show how to make more complex custom pipelines and use them in your pet projects.

You can learn how to work with datasets, pipelines, Jupyter Notebook and many other tools not mentioned in this article on the course “Data Scientist” from the Practicum. Depending on your schedule and the usual pace of work, you can master the program in 5, 8 or 16 months.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *