How to write data transformers in Sklearn

Today we are going to learn how to create custom Sklearn transformers that allow you to integrate almost any function or data transformation into Sklearn pipeline classes. Details under the cut to the start flagship course in Data Science.

What for?

Only one call fitand one – predict – How cool would that be? You get the data, train the pipeline once, and it takes care of pre-processing, feature engineering, modeling. All you have to do is call fit.

What conveyor is powerful enough for this? There are many transformers in Sklearn, but not for all preprocessing situations. So, our conveyor is a pipe dream, or not? Absolutely not.

What are Sklearn pipelines?

Here is a simple pipeline that fills in the missing values ​​with numbers, scales them and trains XGBRegressor on the X, y:

from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
import xgboost as xgb

xgb_pipe = make_pipeline(

_ =, y)

AT this post I went into great detail about Sklearn pipelines and their benefits.

The most notable advantage of pipelines is the ability to combine all preprocessing and modeling steps into a single estimator, prevent data leakage, and not cause fit on validation datasets. And the pipeline is a bonus in the form of concise, reproducible, modular code.

But this whole idea of ​​atomic, neat pipelines breaks down as soon as you need to perform operations that are not built into Sklearn as evaluation functions, for example:

  • Extract regular expression patterns to clean up text data.

  • Combine existing functions into one, based on knowledge of the subject area.

To retain the full benefits of pipelines, we need a way to integrate custom pre-processing and feature engineering logic into Sklearn. This is where custom converters come into play.

Integration of simple functions through FunctionTransformer

AT September TPS 2021 competition on Kaggle, one of the ideas is to add the number of missed [в сроке данных] values ​​as a new feature – significantly increased the performance of the model. This operation is not implemented in Sklearn, so let’s write a function that will work after importing the data:

tps_df = pd.read_csv("data/train.csv")

Data source: Kaggle

>>> tps_df.shape
(957919, 120)

>>> # Find the number of missing values across rows
>>> tps_df.isnull().sum(axis=1)
0         1
1         0
2         5
3         2
4         8
957914    0
957915    4
957916    0
957917    1
957918    4
Length: 957919, dtype: int64

This function takes a DataFrame and implements the operation:

def num_missing_row(X: pd.DataFrame, y=None):
    # Calculate some metrics across rows
    num_missing = X.isnull().sum(axis=1)
    num_missing_std = X.isnull().std(axis=1)

    # Add the above series as a new feature to the df
    X["#missing"] = num_missing
    X["num_missing_std"] = num_missing_std

    return X

Add a function to the pipeline – pass it to FunctionTransformer:

from sklearn.preprocessing import FunctionTransformer

num_missing_estimator = FunctionTransformer(num_missing_row)

When passing a custom function to FunctionTransformer an estimator with methods is created fit, transform and fit_transform:

# Check number of columns before
print(f"Number of features before preprocessing: {len(tps_df.columns)}")

# Apply the custom estimator
tps_df = num_missing_estimator.transform(tps_df)
print(f"Number of features after preprocessing: {len(tps_df.columns)}")


Number of features before preprocessing: 120
Number of features after preprocessing: 122

So we have a simple function so there is no need to call fit: it just returns the untouched score. The only requirement FunctionTransformer is that the passed function must accept the data in its first argument. Optionally, if the function needs a target array, you can also pass it:

# FunctionTransformer signature
def custom_function(X, y=None):

estimator = FunctionTransformer(custom_function)  # no errors

custom_pipeline = make_pipeline(StandardScaler(), estimator, xgb.XGBRegressor()), y)

FunctionTransformer also accepts the inverse of the passed function in case you need to undo the changes:

def custom_function(X, y=None):

def inverse_of_custom(X, y=None):

estimator = FunctionTransformer(func=custom_function, inverse_func=inverse_of_custom)

For details on other arguments, see documentation.

Integration of complex pre-processing steps

One of the most common scaling options for skewed data is the logarithmic transformation. But if the function contains at least one 0, the conversion using np.log or PowerTransformer will return an error.

To bypass the feature feature, Kaggle contestants add 1 to all data samples, and only then apply the transformation. If the transformation is performed on the target array, then an inverse transformation is required, for which, after predicting, you need to use the exponential function and subtract 1:

y_transformed = np.log(y + 1)

_ =, y_transformed)
preds = np.exp(model.predict(X, y_transformed) - 1)

It works, but the same problem remains – we cannot include the code in the pipeline out of the box. Of course, you can turn to a new friend – FunctionTransformerbut it’s not suitable for complex pre-processing steps like this one.

Instead, let’s write our own converter class and create functions fit, transform manually. Eventually we will again have a Sklearn compatible estimator. Let’s start:

from sklearn.base import BaseEstimator, TransformerMixin

class CustomLogTransformer(BaseEstimator, TransformerMixin):

First we create a class that inherits from BaseEstimator and TransformerMixin from sklearn.base. Inheriting these classes allows Sklearn pipelines to recognize classes as custom estimators.

Let’s write a method __init__where we initialize the instance PowerTransformer:

from sklearn.preprocessing import PowerTransformer

class CustomLogTransformer(BaseEstimator, TransformerMixin):
    def __init__(self):
        self._estimator = PowerTransformer()

Let’s write fitwhere we add 1 to all features in the data and train PowerTransformer:

class CustomLogTransformer(BaseEstimator, TransformerMixin):
    def __init__(self):
        self._estimator = PowerTransformer()

    def fit(self, X, y=None):
        X_copy = np.copy(X) + 1

        return self

Method fit must return the converter itself, this is done by returning self. And check what we wrote:

custom_log = CustomLogTransformer()


As long as it works as it should.

We have transformwhere after adding 1 to the passed data is called transform from the class PowerTransformer:

class CustomLogTransformer(BaseEstimator, TransformerMixin):
    def __init__(self):
        self._estimator = PowerTransformer()

    def fit(self, X, y=None):
        X_copy = np.copy(X) + 1

        return self

    def transform(self, X):
        X_copy = np.copy(X) + 1

        return self._estimator.transform(X_copy)

Let’s check it another way:

custom_log = CustomLogTransformer()

transformed_tps = custom_log.transform(tps_df)

>>> transformed_tps[:5, :5]
array([[ 0.48908946, -2.17126787, -1.79124946, -0.52828469,         nan],
       [ 0.38660665, -0.29384644,  1.31313666,  0.1901713 , -0.34236997],
       [-0.04286469, -0.05047097, -1.16463754,  0.95459266,  1.71830766],
       [-0.584329  , -1.5743182 , -1.02444525, -0.15117546,  0.46952437],
       [-0.87027925, -0.13045462, -0.10489176, -0.36806683,  1.21317668]])

Works as it should. As I said before, we need a method to undo the transformation:

class CustomLogTransformer(BaseEstimator, TransformerMixin):
    def __init__(self):
        self._estimator = PowerTransformer()

    def fit(self, X, y=None):
        X_copy = np.copy(X) + 1

        return self

    def transform(self, X):
        X_copy = np.copy(X) + 1

        return self._estimator.transform(X_copy)

    def inverse_transform(self, X):
        X_reversed = self._estimator.inverse_transform(np.copy(X))

        return X_reversed - 1

Instead of inverse_transform could use np.exp. Now let’s do a final check:

custom_log = CustomLogTransformer()

tps_transformed = custom_log.fit_transform(tps_df)
tps_inversed = custom_log.inverse_transform(tps_transformed)

But wait! We didn’t write _fit_transform_ – where did she come from?

It’s simple – when you inherit from _BaseEstimator_ and _TransformerMixin_then the method _fit_transform_ you just get it.

After the reverse transformation, you can compare it with the original data:

>>> tps_df.values[:5, 5]
array([0.35275, 0.17725, 0.25997, 0.4293 , 0.34079])

>>> tps_inversed[:5, 5]
array([0.35275, 0.17725, 0.25997, 0.4293 , 0.34079])

Now we have our own converter. Let’s put all the code together:

from sklearn.metrics import roc_auc_score
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline

xgb_pipe = make_pipeline(
    SimpleImputer(strategy="constant", fill_value=-99999),
        n_estimators=1000, tree_method="gpu_hist", objective="binary:logistic"

X, y = tps_df.drop("claim", axis=1), tps_df[["claim"]].values.flatten()
split = train_test_split(X, y, test_size=0.33, random_state=1121218)
X_train, X_test, y_train, y_test = split, y_train)
preds = xgb_pipe.predict_proba(X_test)

>>> roc_auc_score(y_test, preds[:, 1])

Even though the transformation hurt the model, we made our pipeline work!

In short, the signature of a custom converter class should be:

class CustomTransformer(BaseEstimator, TransformerMixin):
    def __init__(self):

    def fit(self):

    def transform(self):

    def inverse_transform(self):

So you get fit_transform without any effort. If methods are not needed __init__, fit, transform or inverse_transform, don’t use them, Sklearn’s parent classes will take care of everything. The logic of these methods depends entirely on your needs.

In the meantime, you are mastering the transformations in Sklearn, we will help you improve your skills or master the profession from the very beginning, which is in demand at any time:

Choose another in-demand profession.

Similar Posts

Leave a Reply