Gradient boosting with CatBoost (part 2

In the first part of the article, I talked about the concept of gradient boosting, the libraries with which you can implement this algorithm, and delved into one of these libraries. Today we will continue talking about CatBoost and take a look at Cross Validation, Overfitting Detector, ROC-AUC, SnapShot and Predict. Go!

Up to this point, we measured the quality on a specific fold (a specific sample), that is, we took our sample divided into training and test, this is not entirely correct, suddenly we took some unrepresentative piece of our dataset, on this very piece we will get good quality, and when the model works with real data, then the quality will be extremely sad. To avoid this, you must use Cross Validation.

Let’s break our dataset into pieces and then train the model as many times as we have pieces. First, we train the model on all pieces except the first one, it will be validated, then on the second the same situation will occur and the whole thing will be repeated until the last piece of our sample:

from catboost import cv

params = {
    'loss_function': 'Logloss',
    'iterations': 150,
    'custom_loss': 'AUC',
    'random_seed': 63,
    'learning_rate': 0.5
}

cv_data = cv(
    params=params,
    pool=Pool(X, label=y, cat_features=cat_features),
    fold_count=5, # Разбивка выборки на 5 кусочков
    shuffle=True, # Перемешаем наши данные
    partition_random_seed=0,
    plot=True, # Никуда без визуализатора
    stratified=True, 
    verbose=False
)

We again see the visualizer when the code is run, only now it draws not one curve, but an average curve for all folds, but if you uncheck the Standard Deviation box, we will see each curve separately, you can analyze folds where the quality is bad or good …

What does the Cross Validation function return? If pandas / polars is installed (I have a separate article about this library), then DataFrame, if not, then the Python dictionary, which contains information about the metrics for each selection:

Here we see the value of each metric for each selection in all folds.

Let’s print out the Logloss and at which step we got the best result:

best_value = np.min(cv_data['test-Logloss-mean'])
best_iter = np.argmin(cv_data['test-Logloss-mean'])
print("Best validation Logloss score, stratified: {:.4f}+/-{:.3f} on step {}".format(
best_value, cv_data['test-Logloss-std'][best_iter], best_iter))

Best validation Logloss score, stratified: 0.1577+/-0.002 on step 52

We see that the best Logloss is 0.1577 and it was achieved at step 52 with a standard deviation of 0.002.

At the end of the first part, I touched on the Overfitting Detector, what is it? This is a retraining detector, a great thing that helps save the Data Scientist’s time. When we train the model, we don’t need all the iterations after the overfitting, so why would we waste time waiting for all the iterations after the overfitting point to pass? The above-mentioned Overfitting Detector will help us with this.

When creating a model, the early_stopping_rounds parameter is added, which in this case is equal to 20, if the error on the validation set gets worse during 20 iterations, then training will be stopped:

model_with_early_stop = CatBoostClassifier(
    iterations=200,
    random_seed=63,
    learning_rate=0.5,
    early_stopping_rounds=20
)

model_with_early_stop.fit(
    X_train, y_train,
    cat_features=cat_features,
    eval_set=(X_test, y_test),
    verbose=False,
    plot=True
)

Here you can see that the model was retrained at iteration 28, another 20 iterations go through, the error gets worse, and training stops at iteration 48. Let’s call the tree_count function and see the number of trees after training:

print(model_with_early_stop.tree_count_)

28

Let’s look not only at Logloss, but at some more conscious metric, in this case it will be AUC, in order to use AUC as a metric, we will use the eval_metric parameter. Note that the renderer will output the AUC first:

model_with_early_stop=CatBoostClassifier(
    eval_metric="AUC",
    iterations=200,
    random_seed=63,
    learning_rate=0.3,
    early_stopping_rounds=20)
model_with_early_stop.fit(
    X_train,y_train,
    cat_features=cat_features,
    eval_set=(X_test, y_test),
    verbose=False,
    plot=True
)

We see that the retraining took place at iteration 44, another 20 passed and the training was stopped:

Let’s call tree_count_ and get in response:

print(model_with_early_stop.tree_count_)

44

Let’s go further … Let’s talk about the get_roc_curve function, it returns us the ROC curve (true positive rate, false positive rate and tresholds). The ROC curve is the dependence of tpr on fpr and each point corresponds to its own decision-making boundary.

from catboost.utils import get_roc_curve
import sklearn
from sklearn import metrics

eval_pool = Pool(X_test, y_test, cat_features=cat_features)
curve = get_roc_curve(model, eval_pool)
(fpr, tpr, thresholds)=curve
roc_auc=sklearn.metrics.auc(fpr, tpr)

Now we render the ROC curve:

import matplotlib.pyplot as plt

plt.figure(figsize=(16, 8))
lw=2

plt.plot(fpr, tpr, color="darkorange",
         lw=lw, label="ROC curve (area = %0.2f)" % roc_auc, alpha=0.5)

plt.plot([0, 1], [0, 1], color="navy", lw=lw, linestyle="--", alpha=0.5)

plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xticks(fontsize=16)
plt.yticks(fontsize=16)
plt.grid(True)
plt.xlabel('False Positive Rate', fontsize=16)
plt.ylabel('True Positive Rate', fontsize=16)
plt.title('Receiver operating characteristic', fontsize=20)
plt.legend(loc="lower right", fontsize=16)
plt.show()

The area under the ROC curve is called AUC, the larger the AUC area, the better, the closer we are to our ideal point (1.0).

There is also a function that separately calculates FPR, FNR and THRESHOLD:

from catboost.utils import get_fpr_curve
from catboost.utils import get_fnr_curve

(thresholds, fpr) = get_fpr_curve(curve=curve)
(thresholds, fnr) = get_fnr_curve(curve=curve)

We visualize this curve:

plt.figure(figsize=(16, 8))
lw=2

plt.plot(thresholds, fpr, color="blue", lw=lw, label="FPR", alpha=0.5)
plt.plot(thresholds, fpr, color="green", lw=lw, label="FNR", alpha=0.5)

plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xticks(fontsize=16)
plt.yticks(fontsize=16)
plt.grid(True)
plt.xlabel('Thresholds', fontsize=16)
plt.ylabel('Error rate', fontsize=16)
plt.title('FPR-FNR curves', fontsize=16)
plt.legend(loc="lower left", fontsize=16)
plt.show()

To determine the optimal boundary on the chart, we will use select_treshold:

from catboost.utils import select_threshold

print(select_threshold(model=model, data=eval_pool, FNR=0.01))
print(select_threshold(model=model, data=eval_pool, FPR=0.01))

0.5323909210615109
0.9895850986242051

In the first case, we need to select the border at 0.5323, in the second case, 0.9895. Of course, it is necessary to understand that it is not always necessary to take the 0.5 boundary, it depends on the problem, where it is more terrible for us to make a mistake, and where it is not so critical and based on this make a decision.

Let’s now take a look at Snapshot. There are different situations, the light is turned off, the computer / laptop freezes or some other situation for which the training has fallen, and then you have to do everything anew, but in order to avoid an unpleasant situation, Catboost preserves the progress of training the model:

from catboost import CatBoostClassifier

model = CatBoostClassifier(
    iterations=150,
    save_snapshot=True,
    snapshot_file="shapshot.bkp", # В данный файл будем писать наш прогресс
    snapshot_interval=1, # Интервал с которым необходимо делать снэпшот
    random_seed=42
)

model.fit(
    X_train, y_train,
    eval_set=(X_test, y_test),
    cat_features=cat_features,
    verbose=True
)

Let’s simulate a situation that we had some kind of error during training at some iteration:

1080:	learn: 0.1174802	test: 0.1512820	best: 0.1506310 (585)	total: 16.3s	remaining: 13.9s
1081:	learn: 0.1174613	test: 0.1512905	best: 0.1506310 (585)	total: 16.3s	remaining: 13.8s
1082:	learn: 0.1174327	test: 0.1512617	best: 0.1506310 (585)	total: 16.3s	remaining: 13.8s
1083:	learn: 0.1174143	test: 0.1512679	best: 0.1506310 (585)	total: 16.3s	remaining: 13.8s
---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
<ipython-input-45-aab67cd70f42> in <module>
      9 )
     10 
---> 11 model.fit(
     12     X_train, y_train,
     13     eval_set=(X_test, y_test),

In this case, we just run the cell again and training will start at 1083 iterations:

1083:	learn: 0.1174143	test: 0.1512679	best: 0.1506310 (585)	total: 16.3s	remaining: 14.1s
1084:	learn: 0.1173739	test: 0.1512903	best: 0.1506310 (585)	total: 16.3s	remaining: 14.1s
1085:	learn: 0.1173333	test: 0.1512818	best: 0.1506310 (585)	total: 16.4s	remaining: 14s
1086:	learn: 0.1172675	test: 0.1512872	best: 0.1506310 (585)	total: 16.4s	remaining: 14.1s
1087:	learn: 0.1172435	test: 0.1512959	best: 0.1506310 (585)	total: 16.4s	remaining: 14.1s
1088:	learn: 0.1171932	test: 0.1512984	best: 0.1506310 (585)	total: 16.4s	remaining: 14.1s
1089:	learn: 0.1171045	test: 0.1513513	best: 0.1506310 (585)	total: 16.4s	remaining: 14s
1090:	learn: 0.1170768	test: 0.1513511	best: 0.1506310 (585)	total: 16.4s	remaining: 14s
1091:	learn: 0.1170621	test: 0.1513434	best: 0.1506310 (585)	total: 16.5s	remaining: 14s
1092:	learn: 0.1170396	test: 0.1513455	best: 0.1506310 (585)	total: 16.5s	remaining: 14s
1093:	learn: 0.1170104	test: 0.1513388	best: 0.1506310 (585)	total: 16.5s	remaining: 14s
1094:	learn: 0.1169427	test: 0.1513257	best: 0.1506310 (585)	total: 16.5s	remaining: 14s
1095:	learn: 0.1169269	test: 0.1513051	best: 0.1506310 (585)	total: 16.5s	remaining: 14s

Next, it’s worth talking about predictions. In Catboost’e there is such a function as predict_proba, it produces predictions in Scikit Learn format, the first column will contain the probabilities of belonging to the zero class, and the second column will contain the probabilities of belonging to the first class:

print(model.predict_proba(X_test))

[[0.0155 0.9845]
 [0.0064 0.9936]
 [0.0137 0.9863]
 ...
 [0.0472 0.9528]
 [0.0091 0.9909]
 [0.0121 0.9879]]

Next, about the predict. It returns directly the classes, in which case the decision boundary will be 0.5:

print(model.predict(X_test))

[1 1 1 ... 1 1 1]

You can also choose the type of prediction “RawFormulaVal”, what is it? Boosting is difficult to force to predict numbers in the range from 0 to 1, the result of the boosting work is the result of the values of each of the trees, we go through each tree and get the sum, so inside we predict numbers from minus infinity to plus infinity, when we need a probability, then we can apply the sigmoid function to get results in the range 0 to 1.

raw_pred = model.predict(
    X_test,
    prediction_type="RawFormulaVal"
)

print(raw_pred)

[4.1528 5.0524 4.2755 ... 3.0048 4.6904 4.4035]

Let’s use the sigmoid function to get the predictions we need:

from numpy import exp

sigmoid = lambda x: 1 / (1 + exp(-x))
probabilities = sigmoid(raw_pred)
print(probabilities)

[0.9845 0.9936 0.9863 ... 0.9528 0.9909 0.9879]

And another way to get the probabilities:

X_prepared = X_test.values.astype(str).astype(object)

fast_prediction = model.predict_proba(
    FeaturesData(
        cat_feature_data=X_prepared,
        cat_feature_names=list(X_test)
    )
)

print(fast_prediction)

[[0.0155 0.9845]
 [0.0064 0.9936]
 [0.0137 0.9863]
 ...
 [0.0472 0.9528]
 [0.0091 0.9909]
 [0.0121 0.9879]]

There are times when you have some special metric, but Catboost does not support it, and you want to see which iteration you retrained, in such moments it is worth using Stage Prediction, it returns the result of work at each iteration from ntree_start to ntree_end with eval_period, let’s see how it works:

prediction_gen = model.staged_predict_proba(
    X_test,
    ntree_start=0,
    ntree_end=5,
    eval_period=1
)

try:
    for iteration, predictions in enumerate(prediction_gen):
        print(f"Iteration: {str(iteration)}, predictions: {predictions}")
except Exception:
    pass

Iteration: 0, predictions: [[0.4689 0.5311]
 [0.4689 0.5311]
 [0.4689 0.5311]
 ...
 [0.4689 0.5311]
 [0.4689 0.5311]
 [0.4689 0.5311]]
Iteration: 1, predictions: [[0.439 0.561]
 [0.439 0.561]
 [0.439 0.561]
 ...
 [0.439 0.561]
 [0.439 0.561]
 [0.439 0.561]]
Iteration: 2, predictions: [[0.4113 0.5887]
 [0.4113 0.5887]
 [0.4113 0.5887]
 ...
 [0.4113 0.5887]
 [0.4113 0.5887]
 [0.4113 0.5887]]

Iteration: 3, predictions: [[0.384 0.616]
 [0.384 0.616]
 [0.384 0.616]
 ...
 [0.384 0.616]
 [0.384 0.616]
 [0.384 0.616]]
Iteration: 4, predictions: [[0.359 0.641]
 [0.359 0.641]
 [0.359 0.641]
 ...
 [0.359 0.641]
 [0.359 0.641]
 [0.359 0.641]]

Next, you can calculate the metric and determine which iteration is the best for your metric.

This concludes the second part of the article on Gradient Boosting with CatBoost. In the last part, I’ll touch on MultiClassification, Metric Evaluation, Eval Metrics, and Parameter Tuning.