that is the question
The Gini coefficient (or Gini index), Lorenz curve, TPR (true positive rate) and FPR (false positive rate) are some of the most popular attributes of economic problems solved using machine learning. All of them are used to assess the quality of the model and, one way or another, are related to each other. I propose to remember how they are calculated.
Suppose you need to predict the credit reliability of the borrower. A trustworthy borrower will belong to class 1, an unreliable borrower – to class 0. Then there are four types of forecasting outcomes:
1) True Positives – a trustworthy borrower is predicted correctly;
2) False Positives – a trustworthy borrower was predicted incorrectly;
3) True Negatives – an unreliable borrower is predicted correctly;
4) False Negatives – an unreliable borrower is predicted incorrectly.
For our task, TPR = True Positives / (True Positives + False Negatives) is the bank’s profitability level, and FPR = False Positives / (False Positives + True Negatives) is the bank’s loss ratio. At the same time, the better one indicator, the worse the other. Therefore, a response threshold is introduced, above which the predicted values will belong to class 1, below which they will belong to class 0, respectively.
But for business, it is not enough to calculate the indicators. It is necessary to make decisions that are mathematically and statistically justified. Therefore, the analysis includes the ROC curve, the area under the curve (AUC), and the Gini coefficient. That is, a graph of sorted predictive target values is built (Fig. 1).
Then the area under the curve is calculated – the area of the figure under the line of predicted values. This is how we know the quality of our algorithm. And already the Gini coefficient (Gini = 2 * AUC ROC – 1) is used as the final metric for decision making.
This indicator is simple to calculate and easy to interpret, which means it is popular and often used in bank scoring models. But is one metric enough and can you “rely” on Gini in managerial matters? Let’s figure it out.
With the development of the economy, the level of lending is growing, and with it the level of debt. There is a need to manage credit risk. This means that the task of improving the borrower rating model appears.
As an example, let’s take a dataset with observations on the quantitative and qualitative characteristics of borrowers throughout the economic cycle and more, for which a default sign is affixed. The table below shows an example of labeled data.
TIN 
Observation ID 
date of 
F1 
F2 
F3 
F4 
… 
F17 
default 
split 
1234567890 
one 
20173009 
100,000 
89 
0.4 
530 
… 
A 
0 
train 
0987654321 
2 
20180930 
259 058 
56 
0.78 
942 
… 
C 
one 
train 
7418529631 
3 
20151231 
1000680 
77 
0.63 
5022 
… 
F 
one 
test 
1472583699 
4 
20200531 
68 012 
69 
0.7 
135 
… 
A 
0 
test 
It is necessary to transform the qualitative indicators. Many machine learning models work only with numerical factors and are not sensitive to others. However, in business, important indicators are not always numerical. Therefore, different ways of coding variables are used. In this problem, the WOE transformation was applied. This approach allows you to give importance to a feature in the format of a number (WOEweight) and include it in a set of factors for training a forecasting model. It is important that the indicator values are ranked, where A is the best value, B is a good value, C is a satisfactory value, etc. WOE weights are calculated as the natural logarithm of the ratio of the proportion of good observations to the proportion of bad ratios. I am using the following function:
# df  датасет; split_column – столбец который делит выборку на тестовую и обучающую; split  train/test; factor – столбец фактора.
def woe(df, split_column, split, factor):
woe = pd.concat([np.log((df[(df[split_column] == split) & (df['default'] == 0)][factor]\
.value_counts()/df[(df[split_column] == split) & (df['default'] == 0)][factor].count())\
/(df[(df[split_column] == split) & (df['default'] == 1)][factor]\
.value_counts()/df[(df[split_column] == split) & (df['default'] ==
1)][factor].count())).rename('WOE train')], axis=1)
return woe
# создадим преобразованный столбец для первого фактора
woe_F13 = {'A': woe(df, 'split', 'train', 'F13')['WOE train']['A'],
'B': woe(df, 'split', 'train', 'F13')['WOE']['B],
'C': woe(df, 'split', 'train', 'F13')['WOE']['C'],
'D': woe(df, 'split', 'train', 'F13')['WOE']['D']
}
df['F13_woe'] = df['F13'].replace(woe_F13)
By analogy, I will transform the remaining factors (F14 – F17).
For forecasting I use a logistic model.
I will write down the factors in a separate sheet for convenience.
feat_cols = [F1, F2, F3, F4, F5, F8, F9, F13_woe, F14_woe, F15_woe, F17_woe]
# разделим датасет на обучающую и тестовую выборки
df_train = df[df['split'] == 'train'][['ИНН', 'ID наблюдения', 'default', 'split'] + feat_cols].copy()
df_test = df[df['split'] == 'test'] [['ИНН', 'ID наблюдения', 'default', 'split'] + feat_cols].copy()
df_full = pd.concat([df_train, df_test], axis = 0).reset_index()
# импортируем библиотеки
from sklearn.linear_model import LogisticRegression
import statsmodels.api as sm
# строим модель
log_reg = sm.Logit(df_train['default'], df_train[feat_cols]).fit()
And now I will calculate the Gini train / test for the predictive model.
from sklearn.metrics import roc_auc_score # импортируем метрику roc auc
print('Gini Train: ', 2 * roc_auc_score(df_full[df_full['split'] == 'train']['default'], log_reg.predict(df_full[df_full['split'] == 'train'][feat_cols]))  1)
print('Gini Test : ', 2 * roc_auc_score(df_full[df_full['split'] == 'test']['default'], log_reg.predict(df_full[df_full['split'] == 'test'][feat_cols]))  1)
Based on the results obtained, a decision was made to trust this model. However, during the analysis of the model, it was proposed to consider the possibility of adding a new factor – F18. This indicator is qualitative, so it needs to be converted using the woe function. We retrained the model taking into account the new set of predictors and calculated the Gini.
The results show that on the training set the quality of the model is better with an additional factor, and on the test set without it. Since the decision is made on the basis of a larger Gini test value, an additional factor will not be added to the model.
The choice in favor of a model without a new factor is quite contradictory, so we calculate an additional metric – the average absolute error. This indicator is considered as the average of the differences between the actual and forecast values and does not contradict the logic of the task. To do this, we import the necessary library and calculate the error for the model with and without an additional factor.
from sklearn.metrics import mean_absolute_error
# MAE для модели без нового фактора
print('MAE Train:', mean_absolute_error(df_full[df_full['split'] == 'train']['default'], log_reg.predict(df_full[df_full['split'] == 'train'][feat_cols])))
print('MAE Test:', mean_absolute_error(df_full[df_full['split'] == 'test']['default'], log_reg.predict(df_full[df_full['split'] == 'test'][feat_cols])))
# MAE для модели с новым фактором
print('MAE Train модели с фактором F18:', mean_absolute_error(df_full[df_full['split'] == 'train']['default'], log_reg.predict(df_full[df_full['split'] == 'train'][feat_cols])))
print('MAE Test модели с фактором F18:', mean_absolute_error(df_full[df_full['split'] == 'test']['default'], log_reg.predict(df_full[df_full['split'] == 'test'][feat_cols])))
The mean absolute error is interpreted as “the smaller the better”. The results show that the model with the additional factor predicted with a smaller error.
Let’s compare all the results of the metrics.
Indicator 
Split 
Model without add. factor (1) 
Model with add. factor (2) 
Model with the best result 
Gini 
Train 
0.5119 
0.5137 
2 
test 
0.6297 
0.6134 
one 

MAE 
Train 
0.3911 
0.3882 
2 
test 
0.3953 
0.3948 
2 
It follows from the table that the inclusion of the new factor F18 increases the predictive power of the model. However, such a conclusion became available after the calculation of an additional metric. The conclusion suggests itself that the Gini coefficient is not enough to assess the quality of the model. More experiments are needed to confirm the hypothesis. And this article is the beginning of this.