# How to write data transformers in Sklearn

Today we are going to learn how to create custom Sklearn transformers that allow you to integrate almost any function or data transformation into Sklearn pipeline classes. Details under the cut to the start __flagship course in Data Science__.

## What for?

Only one call `fit`

and one – `predict`

– How cool would that be? You get the data, train the pipeline once, and it takes care of pre-processing, feature engineering, modeling. All you have to do is call `fit`

.

What conveyor is powerful enough for this? There are many transformers in Sklearn, but not for all preprocessing situations. So, our conveyor is a pipe dream, or not? Absolutely not.

## What are Sklearn pipelines?

Here is a simple pipeline that fills in the missing values with numbers, scales them and trains `XGBRegressor`

on the `X`

, `y`

:

```
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
import xgboost as xgb
xgb_pipe = make_pipeline(
SimpleImputer(strategy='mean'),
StandardScaler(),
xgb.XGBRegressor()
)
_ = xgb_pipe.fit(X, y)
```

AT this post I went into great detail about Sklearn pipelines and their benefits.

The most notable advantage of pipelines is the ability to combine all preprocessing and modeling steps into a single estimator, prevent data leakage, and not cause `fit`

on validation datasets. And the pipeline is a bonus in the form of concise, reproducible, modular code.

But this whole idea of atomic, neat pipelines breaks down as soon as you need to perform operations that are not built into Sklearn as evaluation functions, for example:

Extract regular expression patterns to clean up text data.

Combine existing functions into one, based on knowledge of the subject area.

To retain the full benefits of pipelines, we need a way to integrate custom pre-processing and feature engineering logic into Sklearn. This is where custom converters come into play.

## Integration of simple functions through FunctionTransformer

AT September TPS 2021 competition on Kaggle, one of the ideas is to add the number of missed [в сроке данных] values as a new feature – significantly increased the performance of the model. This operation is not implemented in Sklearn, so let’s write a function that will work after importing the data:

```
tps_df = pd.read_csv("data/train.csv")
tps_df.head()
```

Data source: Kaggle

```
>>> tps_df.shape
(957919, 120)
>>> # Find the number of missing values across rows
>>> tps_df.isnull().sum(axis=1)
0 1
1 0
2 5
3 2
4 8
..
957914 0
957915 4
957916 0
957917 1
957918 4
Length: 957919, dtype: int64
```

This function takes a DataFrame and implements the operation:

```
def num_missing_row(X: pd.DataFrame, y=None):
# Calculate some metrics across rows
num_missing = X.isnull().sum(axis=1)
num_missing_std = X.isnull().std(axis=1)
# Add the above series as a new feature to the df
X["#missing"] = num_missing
X["num_missing_std"] = num_missing_std
return X
```

Add a function to the pipeline – pass it to `FunctionTransformer`

:

```
from sklearn.preprocessing import FunctionTransformer
num_missing_estimator = FunctionTransformer(num_missing_row)
```

When passing a custom function to `FunctionTransformer`

an estimator with methods is created `fit`

, `transform`

and `fit_transform`

:

```
# Check number of columns before
print(f"Number of features before preprocessing: {len(tps_df.columns)}")
# Apply the custom estimator
tps_df = num_missing_estimator.transform(tps_df)
print(f"Number of features after preprocessing: {len(tps_df.columns)}")
------------------------------------------------------
Number of features before preprocessing: 120
Number of features after preprocessing: 122
```

So we have a simple function so there is no need to call `fit`

: it just returns the untouched score. The only requirement `FunctionTransformer`

is that the passed function must accept the data in its first argument. Optionally, if the function needs a target array, you can also pass it:

```
# FunctionTransformer signature
def custom_function(X, y=None):
...
estimator = FunctionTransformer(custom_function) # no errors
custom_pipeline = make_pipeline(StandardScaler(), estimator, xgb.XGBRegressor())
custom_pipeline.fit(X, y)
```

`FunctionTransformer`

also accepts the inverse of the passed function in case you need to undo the changes:

```
def custom_function(X, y=None):
...
def inverse_of_custom(X, y=None):
...
estimator = FunctionTransformer(func=custom_function, inverse_func=inverse_of_custom)
```

For details on other arguments, see documentation.

## Integration of complex pre-processing steps

One of the most common scaling options for skewed data is the logarithmic transformation. But if the function contains at least one 0, the conversion using `np.log`

or `PowerTransformer`

will return an error.

To bypass the feature feature, Kaggle contestants add 1 to all data samples, and only then apply the transformation. If the transformation is performed on the target array, then an inverse transformation is required, for which, after predicting, you need to use the exponential function and subtract 1:

```
y_transformed = np.log(y + 1)
_ = model.fit(X, y_transformed)
preds = np.exp(model.predict(X, y_transformed) - 1)
```

It works, but the same problem remains – we cannot include the code in the pipeline out of the box. Of course, you can turn to a new friend – `FunctionTransformer`

but it’s not suitable for complex pre-processing steps like this one.

Instead, let’s write our own converter class and create functions `fit`

, `transform`

manually. Eventually we will again have a Sklearn compatible estimator. Let’s start:

```
from sklearn.base import BaseEstimator, TransformerMixin
class CustomLogTransformer(BaseEstimator, TransformerMixin):
pass
```

First we create a class that inherits from `BaseEstimator`

and `TransformerMixin`

from `sklearn.base`

. Inheriting these classes allows Sklearn pipelines to recognize classes as custom estimators.

Let’s write a method `__init__`

where we initialize the instance `PowerTransformer`

:

```
from sklearn.preprocessing import PowerTransformer
class CustomLogTransformer(BaseEstimator, TransformerMixin):
def __init__(self):
self._estimator = PowerTransformer()
```

Let’s write `fit`

where we add 1 to all features in the data and train `PowerTransformer`

:

```
class CustomLogTransformer(BaseEstimator, TransformerMixin):
def __init__(self):
self._estimator = PowerTransformer()
def fit(self, X, y=None):
X_copy = np.copy(X) + 1
self._estimator.fit(X_copy)
return self
```

Method `fit`

must return the converter itself, this is done by returning `self`

. And check what we wrote:

```
custom_log = CustomLogTransformer()
>>> custom_log.fit(tps_df)
CustomLogTransformer()
```

As long as it works as it should.

We have `transform`

where after adding 1 to the passed data is called `transform`

from the class `PowerTransformer`

:

```
class CustomLogTransformer(BaseEstimator, TransformerMixin):
def __init__(self):
self._estimator = PowerTransformer()
def fit(self, X, y=None):
X_copy = np.copy(X) + 1
self._estimator.fit(X_copy)
return self
def transform(self, X):
X_copy = np.copy(X) + 1
return self._estimator.transform(X_copy)
```

Let’s check it another way:

```
custom_log = CustomLogTransformer()
custom_log.fit(tps_df)
transformed_tps = custom_log.transform(tps_df)
>>> transformed_tps[:5, :5]
array([[ 0.48908946, -2.17126787, -1.79124946, -0.52828469, nan],
[ 0.38660665, -0.29384644, 1.31313666, 0.1901713 , -0.34236997],
[-0.04286469, -0.05047097, -1.16463754, 0.95459266, 1.71830766],
[-0.584329 , -1.5743182 , -1.02444525, -0.15117546, 0.46952437],
[-0.87027925, -0.13045462, -0.10489176, -0.36806683, 1.21317668]])
```

Works as it should. As I said before, we need a method to undo the transformation:

```
class CustomLogTransformer(BaseEstimator, TransformerMixin):
def __init__(self):
self._estimator = PowerTransformer()
def fit(self, X, y=None):
X_copy = np.copy(X) + 1
self._estimator.fit(X_copy)
return self
def transform(self, X):
X_copy = np.copy(X) + 1
return self._estimator.transform(X_copy)
def inverse_transform(self, X):
X_reversed = self._estimator.inverse_transform(np.copy(X))
return X_reversed - 1
```

Instead of `inverse_transform`

could use `np.exp`

. Now let’s do a final check:

```
custom_log = CustomLogTransformer()
tps_transformed = custom_log.fit_transform(tps_df)
tps_inversed = custom_log.inverse_transform(tps_transformed)
```

But wait! We didn’t write

`_fit_transform_`

– where did she come from?It’s simple – when you inherit from

`_BaseEstimator_`

and`_TransformerMixin_`

then the method`_fit_transform_`

you just get it.

After the reverse transformation, you can compare it with the original data:

```
>>> tps_df.values[:5, 5]
array([0.35275, 0.17725, 0.25997, 0.4293 , 0.34079])
>>> tps_inversed[:5, 5]
array([0.35275, 0.17725, 0.25997, 0.4293 , 0.34079])
```

Now we have our own converter. Let’s put all the code together:

```
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
xgb_pipe = make_pipeline(
FunctionTransformer(num_missing_row),
SimpleImputer(strategy="constant", fill_value=-99999),
CustomLogTransformer(),
xgb.XGBClassifier(
n_estimators=1000, tree_method="gpu_hist", objective="binary:logistic"
),
)
X, y = tps_df.drop("claim", axis=1), tps_df[["claim"]].values.flatten()
split = train_test_split(X, y, test_size=0.33, random_state=1121218)
X_train, X_test, y_train, y_test = split
xgb_pipe.fit(X_train, y_train)
preds = xgb_pipe.predict_proba(X_test)
>>> roc_auc_score(y_test, preds[:, 1])
0.7986831816726399
```

Even though the transformation hurt the model, we made our pipeline work!

In short, the signature of a custom converter class should be:

```
class CustomTransformer(BaseEstimator, TransformerMixin):
def __init__(self):
pass
def fit(self):
pass
def transform(self):
pass
def inverse_transform(self):
pass
```

So you get `fit_transform`

without any effort. If methods are not needed `__init__`

, `fit`

, `transform`

or `inverse_transform`

, don’t use them, Sklearn’s parent classes will take care of everything. The logic of these methods depends entirely on your needs.

In the meantime, you are mastering the transformations in Sklearn, we will help you improve your skills or master the profession from the very beginning, which is in demand at any time:

Choose another in-demand profession.