Model quality price

Description of the problem

When creating any machine learning model, the question of optimal price-quality ratio always arises. On the one hand, data scientists always try to build the most productive model, on the other hand, the budget allocated for its construction is always limited. Some data sources may be paid, some require a complex procedure for collecting relevant information, and the time that a modeler can spend on a specific model is also limited, because, in fact, experiments with various features, samples and parameters can be carried out almost endlessly. All this leads to the fact that in production models are used that could be significantly improved with a large expenditure of resources, but these costs are often very difficult to justify, in particular, because model quality metrics can be extremely difficult to turn into specific business indicators, related to money. In this article, I want to propose an approach that relates metrics of model quality to its financial utility, using the example of one class of models: probability of default models, although essentially similar ideas can be used for any classification models.

So, let’s assume a fairly standard situation, we have a flow of clients with a certain initial default rate (DR), there is a default probability model with a certain Gini coefficient, we have ranked all clients according to this model and are ready to discard some of the clients with the worst rate. The question is, what will be the default rate in the sample after we have eliminated the worst performers? This question does not arise out of nowhere, but generally corresponds to standard decision-making practice. We have some initial flow of customers (with some default rate) and a model that ranks them in some way, after which we are willing to sacrifice some of the customers (and the profits they bring) in order to get a lower default rate.

For example, we want to issue car loans to 100,000 potential clients, 1 million rubles each. Let's say we are ready to cut off 10% of the flow, i.e. we will issue loans to 9 out of 10 applicants. Then, with DR = 5%, we will lose 5% * 100,000 * 1,000,000 = 5 billion rubles as a result of defaults, but if we manage to reduce the DR level to 3%, then the losses will be only 3 % * 100,000 * 1,000,000 = 3 billion rubles, i.e. the net profit from such a reduction in DR will be equal to 2 billion rubles. It turns out that if our potential change in the model can provide such a reduction in the default rate, it will bring a net profit of 2 billion rubles in our case. This may already be a sufficient argument to hire a separate data scientist (or an entire team) to work on this task. You need to understand that often we cannot in any way influence the initial level of defaults (quality of flow) and what part of clients we discard; all these decisions are usually made by businesses based on what profit they want to make, what share of which market to capture, etc. .P. The only thing we, as data scientists, can influence is the quality of the model, which is generally described by its Gini coefficient (provided that the model is stable out-of-time and this indicator generally makes sense).

Thus, a model with higher ranking ability will produce a lower default rate, which can lead to a significant reduction in losses. Our goal is to find an accurate formula that relates the Gini coefficient of the model to the default rate in order to show the direct financial result associated with improving the quality of the model.

Formal statement of the problem

Please note that although classification models formally predict a probability (a number from 0 to 1), often this probability is accompanied by a score (a value from minus infinity to plus infinity), which is simply converted into a probability using a logistic transformation. Accordingly, it can be obtained by performing a reverse logistic transformation. In many situations, the score is distributed normally, but if this is not the case, normality can be achieved by performing a monotonic transformation of the score (in our case, only the order in which the score ranks clients is important, and not the score itself). It is quite standard to assume that in addition to the score obtained by the model, there is a score unknown to us and together they form an “ideal” score, which allows us to completely separate “good” clients from “bad” ones (i.e., all “bad” clients have an “ideal” ” the speed will be less than that of the “good ones”).

More formally, let's consider the following model.

Let And $\varepsilon$ independent randomly distributed random variables , this is the speed generated by the machine learning model, $\varepsilon$ this is some unaccounted information that nevertheless affects the result

$z = \alpha x + \sqrt{1-\alpha^2}\varepsilon$ Where $\alpha$ some constant (in this case also $z\sim N(0,1)$ ), and z is a hidden variable that determines the result,

Let's also take some trash hold and when at we can observe some event (borrower default, illness, structural failure, etc.) The probability of this event

We can see that there is almost no influence of various and trash holds on Gini, but at the same time Gini is almost exactly equal to the coefficient $\alpha$ .

Let's try to find some explanation for this using mathematical reasoning.

According to the probabilistic interpretation of AUC in our case:
And two independent implementations defined above (with the same parameters $\alpha, t, p$ ). Let's assume that Where normally distributed random variable. In this case, we must calculate its mean and variance to calculate the corresponding probabilities.

We can see that
$(1-p)E(x|\alpha x + \sqrt{1-\alpha^2}\varepsilon > t) = \int dx d\varepsilon I(z > t) x \frac{\exp(-x ^2/2-\varepsilon^2/2)}{2\pi} =” src=”https://habrastorage.org/getpro/habr/upload_files/2cd/c89/ca4/2cdc89ca47c2d33ef62fb879cf4c7666.svg” width=”617″ height=”49″/><div class='code-block code-block-4' style='margin: 8px 0; clear: both;'> <script type=$

$=\int \frac{dx dz}{\sqrt{1-\alpha^2}} I(z > t) x \frac{\exp\Big(-\frac{x^2+z^2-2xz\ alpha}{2(1-\alpha^2)}\Big)}{2\pi}=” src=”https://habrastorage.org/getpro/habr/upload_files/cc6/d19/17b/cc6d1917b431624a9e3b6773d989e478.svg” width=”384″ height=”66″/><img class=$

since for sufficiently small $\alpha$ , $1-\alpha^2\approx 1$ , $\exp(xz \alpha)\approx 1 + xz \alpha$

in the same way

$(1-p)E(x^2|\alpha x + \sqrt{1-\alpha^2}\varepsilon > t) \approx \int dx dz I(z> t ) \frac{\exp(-x ^2/2-z^2/2)}{2\pi} (x^2+x^3 z \alpha) = ” src=”https://habrastorage.org/getpro/habr/upload_files/25f/914/9e4/25f9149e43efb26448fa54ddc11c56e1.svg” width=”715″ height=”49″/><img class=$

$D(x|\alpha x + \sqrt{1-\alpha^2}\varepsilon > t) \approx 1″ src=”https://habrastorage.org/getpro/habr/upload_files/637/b41/ddd/637b41dddd4308d5c62d60c2eef49117.svg” width=”245″ height=”28″/><img data-lazyloaded=$

hence

$E(x_1 - x_2|z_1 > t > z_2) \approx \frac{\alpha}{p(1-p)} \frac{e^{-t^2/2}}{\sqrt{2\pi} }” src=”https://habrastorage.org/getpro/habr/upload_files/6a6/1de/647/6a61de647eb2a250e3a0342d510f0f2f.svg” width=”346″ height=”54″/><img class=$

Then

$P(x_1 > x_2 | z_1 > t > z_2) \approx \Phi \Bigg( \frac{\alpha}{p(1-p)\sqrt{2}} \frac{e^{-t^2/2 }}{\sqrt{2\pi}} \Bigg)\approx” src=”https://habrastorage.org/getpro/habr/upload_files/027/a90/363/027a903633ee74e8789d59ffafa33930.svg” width=”438″ height=”61″/><img data-lazyloaded=$

since for sufficiently small , $\Phi(x) \approx \frac12 + \frac1{\sqrt{2\pi}}x$

Note also that

, i.e. we discard the same proportion of clients as we have in the stream. In this case, with Gini = 100%, we will discard all “bad” clients and the default rate will be 0.

import numpy as np
from scipy.stats import norm
import sklearn
from sklearn import metrics
import matplotlib.pyplot as plt
x = np.random.normal(size = 1000000)
e = np.random.normal(size = 1000000)

alpha_array = (np.arange(20) + 0.999) / 20
p_list = [0.1, 0.5, 0.7]
gini_list = []
DR_real_list = []
DR_calculated_list = []
for p in p_list:
    t = norm.ppf(p)
    gini_array = alpha_array * 0
    DR_real_array = alpha_array * 0
    DR_calculated_array = alpha_array * 0
    for i, alpha in enumerate(alpha_array):
        z = alpha * x + np.sqrt(1 - alpha * alpha) * e
        gini_array[i] = 2 * sklearn.metrics.roc_auc_score(z < t, -x) - 1
        DR_real_array[i] = np.mean(z[x > t] < t)
        DR_calculated_array[i] = DR_new(gini_array[i], p, p)
    gini_list.append(gini_array)
    DR_real_list.append(DR_real_array)
    DR_calculated_list.append(DR_calculated_array)

plt.rcParams['figure.figsize'] = [10, 10]
for i, gini_array in enumerate(gini_list):
    plt.plot(gini_array, DR_real_list[i], label="DR real p = " + str(p_list[i]))
    plt.plot(gini_array, DR_calculated_list[i], label="DR calculated p = " + str(p_list[i]))
plt.legend(loc="upper left")
plt.xlabel('Gini')
plt.ylabel('DR')

As we can see, there is a discrepancy between the formula and the values obtained as a result of the simulation, but in general the discrepancy is not particularly significant.

Let's give a practical example: suppose we need to issue 100,000 car loans, 1 million rubles each, i.e. In total, 100 billion rubles need to be issued. Let’s assume that the initial DR for a given flow of clients is 10%, while we are ready to weed out the 10% of the worst clients (it turns out that in an ideal world we would have no defaults at all). Let us have a model according to which we rank clients, and it has Gini = 30%. In this case, direct losses from defaults will be 8.71 billion rubles. Let's say we have a plan according to which we can increase Gini by 1%, 5%, 10%, 50%, what financial results will this lead to? According to the above calculations, this will bring savings as a result of reducing losses from defaults in the amount of 52 million rubles, 270 million rubles, 559 million rubles and 3.846 billion rubles, respectively.

print(DR_new(0.30, 0.1, 0.1))
print(DR_new(0.30, 0.1, 0.1)-DR_new(0.30 + 0.01, 0.1, 0.1))
print(DR_new(0.30, 0.1, 0.1)-DR_new(0.30 + 0.05, 0.1, 0.1))
print(DR_new(0.30, 0.1, 0.1)-DR_new(0.30 + 0.10, 0.1, 0.1))
print(DR_new(0.30, 0.1, 0.1)-DR_new(0.30 + 0.50, 0.1, 0.1))

0.0871097134754345
0.0005272288318745877
0.0027060299128032483
0.005594040703079159
0.03845790575519499

Proof on a real dataset

Let’s now take a real dataset and check that all the identified patterns are fulfilled when developing models in real life. To do this, we will use a dataset of customers from Taiwan who use credit cards. Since it is not so easy to obtain a model with a predetermined Gini, for this purpose we will discard some of the features and build 3 models on 4, 5 and all available features.

from ucimlrepo import fetch_ucirepo 
  
default_of_credit_card_clients = fetch_ucirepo(id=350) 
  
X = default_of_credit_card_clients.data.features 
y = default_of_credit_card_clients.data.targets 

from catboost import CatBoostClassifier
model = CatBoostClassifier(verbose = 0)
DR_initial = np.mean(y.values)
number_of_features = [4, 5, 23]
p_x_array = (np.arange(49) + 1)/100
DR_real_list = []
DR_calculated_list = []
gini_list = []
for n in number_of_features:
    model.fit(X[X.columns[:n]], y)
    pred = model.predict_proba(X[X.columns[:n]])[:, 1]
    gini = 2 * sklearn.metrics.roc_auc_score(y, pred) - 1
    gini_list.append(gini)
    DR_real_array = p_x_array * 0
    DR_calculated_array = p_x_array * 0
    for i, p_x in enumerate(p_x_array):
        t_pred = np.sort(pred)[-int(pred.shape[0] * p_x)]
        DR_real_array[i] = np.mean(y.values[pred < t_pred])
        DR_calculated_array[i] = DR_new(gini, DR_initial, p_x)
    DR_real_list.append(DR_real_array)
    DR_calculated_list.append(DR_calculated_array)  

plt.rcParams['figure.figsize'] = [10, 10]
for i, gini_array in enumerate(gini_list):
    plt.plot(p_x_array, DR_real_list[i], label="DR real gini = " + str(gini_list[i]))
    plt.plot(p_x_array, DR_calculated_list[i], label="DR calculated gini = " + str(gini_list[i]))
plt.legend(loc="upper left")
plt.xlabel('$p_x$')
plt.ylabel('DR')

It can be noted that despite the fact that there is a discrepancy between the real level of defaults and the level of defaults calculated using our formula, in general the curves are close to each other, and the formula can be used as an approximation of the calculated values.

conclusions

From all of the above, we can conclude that the resulting formula quite accurately describes the level of default at various Ginis and we can predict with sufficient accuracy what direct financial benefit can be obtained by improving this indicator. This argument can be used to increase the budget allocated to modeling and to justify a more careful and thoughtful approach to model creation.