How to write data transformers in Sklearn

Today we are going to learn how to create custom Sklearn transformers that allow you to integrate almost any function or data transformation into Sklearn pipeline classes. Details under the cut to the start flagship course in Data Science.
What for?
Only one call fit
and one – predict
– How cool would that be? You get the data, train the pipeline once, and it takes care of pre-processing, feature engineering, modeling. All you have to do is call fit
.
What conveyor is powerful enough for this? There are many transformers in Sklearn, but not for all preprocessing situations. So, our conveyor is a pipe dream, or not? Absolutely not.
What are Sklearn pipelines?
Here is a simple pipeline that fills in the missing values with numbers, scales them and trains XGBRegressor
on the X
, y
:
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
import xgboost as xgb
xgb_pipe = make_pipeline(
SimpleImputer(strategy='mean'),
StandardScaler(),
xgb.XGBRegressor()
)
_ = xgb_pipe.fit(X, y)
AT this post I went into great detail about Sklearn pipelines and their benefits.
The most notable advantage of pipelines is the ability to combine all preprocessing and modeling steps into a single estimator, prevent data leakage, and not cause fit
on validation datasets. And the pipeline is a bonus in the form of concise, reproducible, modular code.
But this whole idea of atomic, neat pipelines breaks down as soon as you need to perform operations that are not built into Sklearn as evaluation functions, for example:
Extract regular expression patterns to clean up text data.
Combine existing functions into one, based on knowledge of the subject area.
To retain the full benefits of pipelines, we need a way to integrate custom pre-processing and feature engineering logic into Sklearn. This is where custom converters come into play.
Integration of simple functions through FunctionTransformer
AT September TPS 2021 competition on Kaggle, one of the ideas is to add the number of missed [в сроке данных] values as a new feature – significantly increased the performance of the model. This operation is not implemented in Sklearn, so let’s write a function that will work after importing the data:
tps_df = pd.read_csv("data/train.csv")
tps_df.head()

Data source: Kaggle
>>> tps_df.shape
(957919, 120)
>>> # Find the number of missing values across rows
>>> tps_df.isnull().sum(axis=1)
0 1
1 0
2 5
3 2
4 8
..
957914 0
957915 4
957916 0
957917 1
957918 4
Length: 957919, dtype: int64
This function takes a DataFrame and implements the operation:
def num_missing_row(X: pd.DataFrame, y=None):
# Calculate some metrics across rows
num_missing = X.isnull().sum(axis=1)
num_missing_std = X.isnull().std(axis=1)
# Add the above series as a new feature to the df
X["#missing"] = num_missing
X["num_missing_std"] = num_missing_std
return X
Add a function to the pipeline – pass it to FunctionTransformer
:
from sklearn.preprocessing import FunctionTransformer
num_missing_estimator = FunctionTransformer(num_missing_row)
When passing a custom function to FunctionTransformer
an estimator with methods is created fit
, transform
and fit_transform
:
# Check number of columns before
print(f"Number of features before preprocessing: {len(tps_df.columns)}")
# Apply the custom estimator
tps_df = num_missing_estimator.transform(tps_df)
print(f"Number of features after preprocessing: {len(tps_df.columns)}")
------------------------------------------------------
Number of features before preprocessing: 120
Number of features after preprocessing: 122
So we have a simple function so there is no need to call fit
: it just returns the untouched score. The only requirement FunctionTransformer
is that the passed function must accept the data in its first argument. Optionally, if the function needs a target array, you can also pass it:
# FunctionTransformer signature
def custom_function(X, y=None):
...
estimator = FunctionTransformer(custom_function) # no errors
custom_pipeline = make_pipeline(StandardScaler(), estimator, xgb.XGBRegressor())
custom_pipeline.fit(X, y)
FunctionTransformer
also accepts the inverse of the passed function in case you need to undo the changes:
def custom_function(X, y=None):
...
def inverse_of_custom(X, y=None):
...
estimator = FunctionTransformer(func=custom_function, inverse_func=inverse_of_custom)
For details on other arguments, see documentation.
Integration of complex pre-processing steps
One of the most common scaling options for skewed data is the logarithmic transformation. But if the function contains at least one 0, the conversion using np.log
or PowerTransformer
will return an error.
To bypass the feature feature, Kaggle contestants add 1 to all data samples, and only then apply the transformation. If the transformation is performed on the target array, then an inverse transformation is required, for which, after predicting, you need to use the exponential function and subtract 1:
y_transformed = np.log(y + 1)
_ = model.fit(X, y_transformed)
preds = np.exp(model.predict(X, y_transformed) - 1)
It works, but the same problem remains – we cannot include the code in the pipeline out of the box. Of course, you can turn to a new friend – FunctionTransformer
but it’s not suitable for complex pre-processing steps like this one.
Instead, let’s write our own converter class and create functions fit
, transform
manually. Eventually we will again have a Sklearn compatible estimator. Let’s start:
from sklearn.base import BaseEstimator, TransformerMixin
class CustomLogTransformer(BaseEstimator, TransformerMixin):
pass
First we create a class that inherits from BaseEstimator
and TransformerMixin
from sklearn.base
. Inheriting these classes allows Sklearn pipelines to recognize classes as custom estimators.
Let’s write a method __init__
where we initialize the instance PowerTransformer
:
from sklearn.preprocessing import PowerTransformer
class CustomLogTransformer(BaseEstimator, TransformerMixin):
def __init__(self):
self._estimator = PowerTransformer()
Let’s write fit
where we add 1 to all features in the data and train PowerTransformer
:
class CustomLogTransformer(BaseEstimator, TransformerMixin):
def __init__(self):
self._estimator = PowerTransformer()
def fit(self, X, y=None):
X_copy = np.copy(X) + 1
self._estimator.fit(X_copy)
return self
Method fit
must return the converter itself, this is done by returning self
. And check what we wrote:
custom_log = CustomLogTransformer()
>>> custom_log.fit(tps_df)
CustomLogTransformer()
As long as it works as it should.
We have transform
where after adding 1 to the passed data is called transform
from the class PowerTransformer
:
class CustomLogTransformer(BaseEstimator, TransformerMixin):
def __init__(self):
self._estimator = PowerTransformer()
def fit(self, X, y=None):
X_copy = np.copy(X) + 1
self._estimator.fit(X_copy)
return self
def transform(self, X):
X_copy = np.copy(X) + 1
return self._estimator.transform(X_copy)
Let’s check it another way:
custom_log = CustomLogTransformer()
custom_log.fit(tps_df)
transformed_tps = custom_log.transform(tps_df)
>>> transformed_tps[:5, :5]
array([[ 0.48908946, -2.17126787, -1.79124946, -0.52828469, nan],
[ 0.38660665, -0.29384644, 1.31313666, 0.1901713 , -0.34236997],
[-0.04286469, -0.05047097, -1.16463754, 0.95459266, 1.71830766],
[-0.584329 , -1.5743182 , -1.02444525, -0.15117546, 0.46952437],
[-0.87027925, -0.13045462, -0.10489176, -0.36806683, 1.21317668]])
Works as it should. As I said before, we need a method to undo the transformation:
class CustomLogTransformer(BaseEstimator, TransformerMixin):
def __init__(self):
self._estimator = PowerTransformer()
def fit(self, X, y=None):
X_copy = np.copy(X) + 1
self._estimator.fit(X_copy)
return self
def transform(self, X):
X_copy = np.copy(X) + 1
return self._estimator.transform(X_copy)
def inverse_transform(self, X):
X_reversed = self._estimator.inverse_transform(np.copy(X))
return X_reversed - 1
Instead of inverse_transform
could use np.exp
. Now let’s do a final check:
custom_log = CustomLogTransformer()
tps_transformed = custom_log.fit_transform(tps_df)
tps_inversed = custom_log.inverse_transform(tps_transformed)
But wait! We didn’t write
_fit_transform_
– where did she come from?It’s simple – when you inherit from
_BaseEstimator_
and_TransformerMixin_
then the method_fit_transform_
you just get it.
After the reverse transformation, you can compare it with the original data:
>>> tps_df.values[:5, 5]
array([0.35275, 0.17725, 0.25997, 0.4293 , 0.34079])
>>> tps_inversed[:5, 5]
array([0.35275, 0.17725, 0.25997, 0.4293 , 0.34079])
Now we have our own converter. Let’s put all the code together:
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
xgb_pipe = make_pipeline(
FunctionTransformer(num_missing_row),
SimpleImputer(strategy="constant", fill_value=-99999),
CustomLogTransformer(),
xgb.XGBClassifier(
n_estimators=1000, tree_method="gpu_hist", objective="binary:logistic"
),
)
X, y = tps_df.drop("claim", axis=1), tps_df[["claim"]].values.flatten()
split = train_test_split(X, y, test_size=0.33, random_state=1121218)
X_train, X_test, y_train, y_test = split
xgb_pipe.fit(X_train, y_train)
preds = xgb_pipe.predict_proba(X_test)
>>> roc_auc_score(y_test, preds[:, 1])
0.7986831816726399
Even though the transformation hurt the model, we made our pipeline work!
In short, the signature of a custom converter class should be:
class CustomTransformer(BaseEstimator, TransformerMixin):
def __init__(self):
pass
def fit(self):
pass
def transform(self):
pass
def inverse_transform(self):
pass
So you get fit_transform
without any effort. If methods are not needed __init__
, fit
, transform
or inverse_transform
, don’t use them, Sklearn’s parent classes will take care of everything. The logic of these methods depends entirely on your needs.
In the meantime, you are mastering the transformations in Sklearn, we will help you improve your skills or master the profession from the very beginning, which is in demand at any time:
Choose another in-demand profession.
