Building a pipeline in scikit-learn – step by step guide
The text mentions scikit-learn, one of the most popular Python libraries for classical machine learning. In addition to a large number of machine learning algorithms, pipelines can be built using scikit-learn. These are chains of functions through which data can be passed. Pipelines are similar to pipelines, where each link acts as a converter (transformer) or predictor model (usually the last link).
An important stage of data processing is their transformation. Without a pipeline, data must go through separate converters: encoders (convert categorical features to numeric vectors), imputers (fill gaps in the data), and scalers (bring features to the same scale). Each tool separately needs to be trained on the training set, transform it, and separately perform the transformation of the test set. The result is a lot of repetitive code.
The pipeline collects all the tools in one pipeline without repetitive code. It is enough to train this pipeline on the training sample and use it for all the necessary transformations with one command. It takes features as input, transforms them and produces the result.
Below, we will assemble a pipeline from individual instruments, train it on a training data set, and make a prediction.
Preparation
The first step is to install the minimum required version of the library – 1.1. If you execute the code in Jupiter Hub Yandex Practice, then the basic version of sklearn is 0.24. You need to install a later one to avoid compatibility issues.
!pip install scikit-learn==1.1
We then import pandas to load the data, as well as all the sklearn tools we’ll need in this example: scalers, encoders, and imputers. The list contains auxiliary tools for splitting samples, calculating metrics, and other operations.
import pandas as pdfrom sklearn.compose import ColumnTransformer
from sklearn.impute import KNNImputer, SimpleImputer
from sklearn.linear_model import Ridge
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error
Make sure that the version of the library is not lower than the required one. I show for an example – in real development this step is not needed.
import sklearn
sklearn.version
Now we turn off the warning in pandas about the future suspension of one of the methods so that it does not distract from development, and limit the data output to eight columns. In this form, they will fit on one screen, and it will be more convenient to work with them.
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
pd.set_option('display.max_columns', 8)
For this example, let’s take a popular free dataset with information about the cost of real estate in California, which is often mentioned in training articles and books on Data Science. Let’s download it from the website of Hugging Face, the developer of the Transformers library (which contains neural networks and other tools for working with text data).
data =
pd.read_csv('https://huggingface.co/datasets/leostelon/california-housing/raw/main/housing.csv')
data.info()
display(data.describe())
The next step is to choose the target feature
medium_house_value
is the median house price for the county. Numerical features will be the median value of the age of the houses, the total number of rooms and bedrooms in them, the population, the number of houses, the median income per family. A categorical feature is the position relative to the ocean.
And then we divide the dataset into working and test samples:
target = data['median_house_value']
features = data.drop(['median_house_value'], axis=1)
features_train, features_test, target_train, target_test = train_test_split(features, target, test_size=0.2, random_state=44)
This completes the preparatory part – let’s move on to creating a pipeline.
Pipeline example
For clarity, we will create two separate pipelines – for numerical and categorical features, and then collect them into one.
Let’s start with numerical. Pipeline task
pipe_num
at this stage, fill in the gaps and, if any, perform feature scaling.
For this we use the tool SimpleImputer
fill in the gaps with the median value of the feature.
StandardScaler
standardizes the data – subtracts the mean and divides by the mean square of the training sample.
In the last line, we set an object of the Pipeline class. In this case, we pass to the input a list, each element in which is a tuple of two values. This is the transformer and its name, which we set ourselves in order to refer to it if necessary.
simple_imputer = SimpleImputer(strategy='median')
std_scaler = StandardScaler()
pipe_num = Pipeline([('imputer', simple_imputer), ('scaler', std_scaler)])
The next step through
pipe_num
callfit_transform
– the same thing happens as if we separately appliedSimpleImputer
AndStandardScaler
. Here all signs are used, except for the categorical one.
res_num = pipe_num.fit_transform(features_train.drop(['ocean_proximity'], axis=1))
The transformer above returns the data as a NumPy array with no column names. To return the column names for the convenience of working with data, you can use the method get_feature_names_out
pipeline object.
In this case, the names can also be taken directly from the dataframe, since this pipeline retains the original set of features. But if the set of columns changes after the conversion, for example, using an encoder, this method will not work. Usage get_feature_names_out
– a more universal approach, so we will use it for all pipelines.
At the end, for verification, we display the result of the transformation of numerical features:
res_df_num = pd.DataFrame(res_num,
columns=pipe_num['scaler'].get_feature_names_out(features.drop(['ocean_proximity'], axis=1).columns))
res_df_num
This is a service string in which we check that there are no gaps in the data. It will not get into the final pipeline and is needed only for demonstration.
res_df_num.info()
Now let’s create a pipeline for categorical features. By using
SimpleImputer
replace gaps with value‘unknown’
AOneHotEncoder
are used to encode categorical features into numerical values understandable by the models. Argumenthandle_unknown=’ignore’
is needed to eliminate errors in cases where during testing there are categories that were not in the training sample. At the end, we create a pipeline from the imputer and encoder and check that it worked.
Although there are no gaps in the categorical feature in this dataset, they can occur if new data appears when using the model in practice. In this case, you need a solution that will allow the code not to fall – this is what the imputer is for.
s_imputer = SimpleImputer(strategy='constant', fill_value="unknown")
ohe_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
pipe_cat = Pipeline([('imputer', s_imputer), ('encoder', ohe_encoder)])
res_cat = pipe_cat.fit_transform(features_train[['ocean_proximity']])
res_df_cat = pd.DataFrame(res_cat, columns=pipe_cat.get_feature_names_out())
res_df_cat
Now we combine the signs into one pipeline.
There is a nuance here: it will not be possible to simply put categorical and numerical signs one after the other, they must be processed differently. scikit-learn has a tool ColumnTransformer
, which works as a pipeline builder. With it, you can parallelize the pipeline: one pipeline can be used for one part of the columns, and another for the second.
When specifying a Column Transformer object, a list of tuples is passed. In each tuple, in addition to the name and transformer, as when creating the Pipeline object, it is additionally indicated for which columns it is applied.
It should be noted that in addition to elementary transformers (SimpleImputer, StandardScaler, etc.), Column Transformer can also accept Pipeline objects as a transformer.
With a tool list comprehension
create lists of numerical and categorical columns. In our example, the dataset is small, and it would be possible to set the lists manually. But in practice, there can be many more columns, so it’s better to automate the process. The code below checks the column type: if it is object
then it is considered as numerical, if not, as categorical.
col_transformer = ColumnTransformer([('num_preproc', pipe_num, [x for x in features.columns if features[x].dtype!='object']),
('cat_preproc', pipe_cat, [x for x in features.columns if features[x].dtype=='object'])])
In this line, we return the result of the pipeline execution:
res = col_transformer.fit_transform(features_train)
Let’s convert the result to a dataframe. At the same time, we will remove additional information in the column names – automatically the method
get_feature_names_out
objectColumnTransformer
adds the name of the transformer with which the columns were obtained. Because of this, fewer columns fit on the page, and it becomes inconvenient to use the output.
res_df = pd.DataFrame(res, columns = [x.split('__')[-1] for x in col_transformer.get_feature_names_out()])
res_df
We assemble the finished pipeline by composing its parts using the Column Transformer. It consists of preprocessing and model
Ridge
from the scikit-learn library is linear regression with regularization. This is one of the simple models that predicts a target feature based on the sum of other features multiplied by coefficients.
model = Ridge()
final_pipe = Pipeline([('preproc', col_transformer),
('model', model)])
We train the final pipeline on the training set and make a prediction.
final_pipe.fit(features_train, target_train)
preds = final_pipe.predict(features_test)
mean_squared_error(target_test, preds, squared=False)
If you combine the entire pipeline and exclude demo code fragments from it, then it will look like this:
simple_imputer = SimpleImputer(strategy='median')
std_scaler = StandardScaler()
pipe_num = Pipeline([('imputer', simple_imputer), ('scaler', std_scaler)])
s_imputer = SimpleImputer(strategy='constant', fill_value="unknown")
ohe_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
pipe_cat = Pipeline([('imputer', s_imputer), ('encoder', ohe_encoder)])
col_transformer = ColumnTransformer([('num_preproc', pipe_num, [x for x in features.columns if features[x].dtype!='object']),
('cat_preproc', pipe_cat, [x for x in features.columns if features[x].dtype=='object'])])
final_pipe = Pipeline([('preproc', col_transformer),
('model', model)])
final_pipe.fit(features_train, target_train)
preds = final_pipe.predict(features_test)
As a result, all the code for preprocessing, training and prediction takes literally half a page.
In the following articles, we will explore the intricacies of using pipelines with scikit-learn cross-validation tools and show how to make more complex custom pipelines and use them in your pet projects.
You can learn how to work with datasets, pipelines, Jupyter Notebook and many other tools not mentioned in this article on the course “Data Scientist” from the Practicum. Depending on your schedule and the usual pace of work, you can master the program in 5, 8 or 16 months.