How do neural networks issue loans?

It is no secret that credit scoring is a fairly common practice for assessing a borrower. So that a conventional laborer with a salary of 40 thousand does not take out 5 mortgages, and the country does not turn into one big “Depreciation Game”…

And, among other things, it is no secret to anyone that in the modern world the credit card limit is calculated not by a bank employee, but by a neural network or simply a machine learning algorithm.

But loans and credits are not the only forms of obligations; in any case, large companies can also end up “bankrupt” and “defaulters”.

Risk assessment allows us to link, for example, the degree of reliability of bonds (a receipt for the return of money) through an understanding of the potential success of the company.

Maybe tomorrow the local LLC “Pivtorgnom” will go bankrupt and only 1 million rubles will remain from the 2 million rubles borrowed? Such company scoring is called failure score (financial risk index, FRI).

Moreover, companies are subject to scoring for fraudulent schemes – fraud score (due diligence index, DDI) – this is the name of such a system for assessing companies like Walter White's car wash…

In this article, we will touch upon credit scoring first and foremost and find out why the so-called “young” adults were not given even minimal loans…

Traditional Scoring Systems: Linear Regression

Until recently, most credit risk assessment models were based on the analysis of past payment patterns using statistical algorithms: linear and logistic regression.

These traditional credit scoring models worked by finding weights for static factors that affect a borrower's creditworthiness.

Linear regression simply models the relationship between a dependent variable (in this case, the probability of default) and one or more independent variables (e.g. income level, age, credit history).

The model is constructed on the basis of minimizing the sum of squared errors between the predicted and actual values ​​of the dependent variable.

It turns out that a hypothetical family man-father with an income of 50 thousand rubles did not receive a credit limit of 500 thousand rubles for obvious reasons. But banks strive to make money on borrowers: they need more clients, but fewer defaults. This contradiction forces banks to constantly improve scoring systems.

If a young man is 19 years old, sufficiently responsible and responsible, lives with his parents and takes out a loan that is twice his salary, he is quite capable of paying off the debt.

On the other hand, a man with a prestigious salary will not be able to even pay the minimum payment due to advanced gambling addiction. In real life, “conscientiousness” is not determined solely by income level or age.

The financial institution sets the parameters for assessing the borrower: age, marital status, place of residence, work, ownership, requirements for credit history, income level, but also the results of court proceedings, tax reporting… And this is often not all the additional data.

Banking software analyzes much more diverse indicators: in addition to solvency, it predicts the behavior of the borrower, evaluates the level of his financial discipline, the situation in the family (the presence of children, for example), career prospects. What we talked about above.

Despite its simplicity and interpretability, linear regression has limited applicability in the context of credit scoring, as it often fails to adequately capture complex, nonlinear relationships in the data, which will be discussed below.

Logistic regression, unlike linear regression, is used for binary classification – it calculates the probability of an event, default, occurring.

Logistic regression models the logarithm of the odds of an event occurring as a linear combination of independent variables.

And the independent variables are those easily determinable age, gender, and income level.

This method, unlike the least squares method used in linear regression, is better suited for problems where the dependent variable is categorical (as in our case).

In the context of credit scoring, the dependent variable is a binary indicator that indicates whether a loan has defaulted (or not).

Let's say we have a table that contains the following columns:

  • Age: Client's age.

  • Income: The client's annual income.

  • Credit History (Credit_History): Information about whether the client has previous credit debts (0 – no, 1 – yes).

  • Overdraft: Information about whether the customer uses an overdraft (0 – no, 1 – yes).

  • Creditworthy: The target variable indicating whether the customer will repay the loan on time (0 = no, 1 = yes).

The last indicator is the probability of default.

Age

Income

Credit history

Overdraft

Solvency

25

30000

0

0

1

45

60000

1

1

0

35

50000

0

0

1

50

70000

1

0

0

29

40000

0

1

1

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix

# Создаем DataFrame
data = {
    'Возраст': [25, 45, 35, 50, 29],
    'Доход': [30000, 60000, 50000, 70000, 40000],
    'Кредитная история': [0, 1, 0, 1, 0],
    'Овердрафт': [0, 1, 0, 0, 1],
    'Платежеспособность': [1, 0, 1, 0, 1]
}

df = pd.DataFrame(data)

# Определяем независимые переменные (features) и зависимую переменную (target)
X = df[['Возраст', 'Доход', 'Кредитная история', 'Овердрафт']]
y = df['Платежеспособность']

# Разделяем данные на обучающую и тестовую выборки
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Создаем модель логистической регрессии
model = LogisticRegression()

# Обучаем модель
model.fit(X_train, y_train)

# Делаем прогнозы
y_pred = model.predict(X_test)

# Оценка модели
print("Матрица ошибок:")
print(confusion_matrix(y_test, y_pred))
print("\nОтчет о классификации:")
print(classification_report(y_test, y_pred))

# Печать коэффициентов модели
print("Коэффициенты модели:")
print(f"Коэффициенты: {model.coef_}")
print(f"Свободный член: {model.intercept_}")

# Визуализация коэффициентов
features = X.columns
coefficients = model.coef_.flatten()

plt.figure(figsize=(10, 6))
plt.bar(features, coefficients)
plt.xlabel('Факторы')
plt.ylabel('Коэффициенты')
plt.title('Коэффициенты модели логистической регрессии')
plt.show()

# Построение графика зависимости вероятности от возраста
# Создаем данные для предсказания
age_range = np.linspace(20, 60, 100)
X_age = np.zeros((100, X.shape[1]))  # Инициализация массива с нулями
X_age[:, 0] = age_range  # Устанавливаем возраст
X_age[:, 1] = np.mean(df['Доход'])  # Используем среднее значение дохода
X_age[:, 2] = np.mean(df['Кредитная история'])  # Используем среднее значение кредитной истории
X_age[:, 3] = np.mean(df['Овердрафт'])  # Используем среднее значение овердрафта

# Прогнозируем вероятность
probabilities = model.predict_proba(X_age)[:, 1]

# Построение графика
plt.figure(figsize=(10, 6))
plt.plot(age_range, probabilities, color="blue", marker="o")
plt.xlabel('Возраст')
plt.ylabel('Вероятность возврата кредита вовремя')
plt.title('Зависимость вероятности от возраста')
plt.grid(True)
plt.show()


The resulting coefficients allow us to understand the impact of each factor on the probability of default, which is critical for making decisions in the field of lending.

From the point of view of life wisdom – a primitive approach. We give an extremely simplified example, in reality the data provided by banks still needs to be corrected – they often have many gaps, and the dataset is biased.

It is important to note that logistic regression assumes the absence of multicollinearity among predictors, as high correlations between independent variables can lead to unstable coefficient estimates.

And in general, as we know, high dimensionality (an impressive number of parameters) leads to specific problems with learning ability.

And this is a problem from the point of view of identifying interdependencies and research work for banks. And, as we know, a bank needs as many regularities as possible to lend “at first glance” even to insolvent clients.

It is quite simple to determine the probability of a borrower's default – the model coefficients can be interpreted as a change in the logarithm of the odds of occurrence, but even this is not enough to recruit many conscientious credit clients.

Moreover, problems arise when adding new data.

Of course, banks often use not only pure logistic regression, but also conditional gradient boosting with neural network ensemble.

Decision trees or random forests can also be a completely adequate choice when calculating classical scoring. Even with sufficient tree depth, it is relatively easy to interpret the model and spy on the splits.

Conventionally, the funds circulated by some casino “intermediary” are a scheme of empty transfer of money, which does not say anything about the owner of the proper account. Here, additional scoring is connected, which identifies the “money transferors” and blocks bank accounts according to Federal Law 115. In this sense, the scoring system is much more ramified: it is necessary to separate fraudsters, insolvent clients, “drops”, etc.

Although logistic regression almost always remains the final link in a bank’s unified scoring model, it works as a final predictive model, not as a starting one.

Logistic regression also requires sample balancing, especially when the number of defaults is significantly smaller than the number of successful loans.

But this is a problem… The bank tries to issue loans to solvent and responsible clients, therefore the number of defaults in banking practice is significantly lower than the number of positive cases.

The peculiarity of the banking system leads to an initially unbalanced dataset. By the way, this is why banks like to create “experiments” and issue loans to dubious borrowers, as if clarifying additional patterns.

However, traditional scoring has been used for many years, because there was nowhere to get information about the user's behavior: we have his passport data, transactions, funds transferred from card to card, a small list of property… What else?

Nothing. With the advent of the Internet, a ton of open source data about a person – banks have gained access to advanced scoring. But… how to interpret so much complex and unstructured data?

Now the problem becomes the definition of “predictors”, those very signs that will become the measure and show the real face of the borrower. But it concerns not only the interpretation of some raw data, but also consistent ones.

Moreover, the transactions themselves store an impressive number of patterns that even the sensitive eye of a bank employee cannot identify. This is where other models should come in – neural networks that would find consistent data.

Total: banks determine default probabilities classically through logistic regressions, decision trees or simply ensemble neural networks through the same gradient boosting. But usually such methods only support ready-made, static and structured tabular data. No transactions, credit history.

Because of this, behavioral “factors” float away somewhere and we sometimes get “irresponsible” rich people on the lists of those approved for lending…

With the advent of smartphones and the availability of the Internet, new types of data have become available.

Now data can be semi-structured, unstructured and truly Big, but it can provide detailed insights into the customer and their creditworthiness.

We are talking about the same Behavioural-scoring – it evaluates the client’s creditworthiness based on his current behavior and use of credit products.

Although some banks actively declare the work of a personal department to take care of “clients” with overdue payments, in reality, secondary or primary assessment is carried out by machine learning models.

The boosting model on tabular data takes into account static characteristics of customers: demographic data, income level, duration of employment and other factors that do not depend on time.

And Behavioral or simply behavior is always a temporary concept.

And the ambassador of solutions to problems of sequential, time-based data remains the recurrent architecture. Therefore, over time, oddly enough, banks began to integrate neural networks.

Recurrent networks on sequential borrower data…

We are talking about a model based on card transactions, a model based on current account transactions, and a model based on credit history data.

In addition to these sequential models, a boosting model is developed on tabular data. They almost always work in tandem. Here is an article from Alfa-Bank with a single neural network for credit scoring, where the final link in the chain of the ensemble of models remains logistic regression, which works with the outputs of other neural networks.

Therefore, a single network is built on the principle of ensemble of decision trees and RNN. The bank's task is to cover the largest amount of data, to make decisions from a larger number of factors. It is necessary to have as many issued loans as possible, as few defaults as possible.

The card transaction model focuses on the analysis of purchases and payments made by customers using their credit or debit cards.

Recurrent neural networks (RNNs) or their more sophisticated variants are used: long short-term memory (LSTM) or gated recurrent unit (GRU) networks are used everywhere.

We remind you that the key disadvantage of recurrent neural networks is the gradual process of data forgetting, which means that with a large number of “operations” with credit cards, you can accidentally lose a couple of destructive client delinquencies. LSTM and GRU partially solve this problem.

In the case of transaction counting, models identify patterns in customer behavior that signal potential financial difficulties or, conversely, the stability of their financial situation.

For example, a sharp increase in expenses or frequent delays in payments may be indicators of an increased risk of default. Or, conversely, a constant turnover of funds signals illegal business activity or literally “turnover”. As we know, turnover is not equal to net profit.

After funds are received into the individual entrepreneur's current account, it is possible to calculate the share of money coming from there to personal cards and “estimate” how much the entrepreneur can withdraw from his turnover. A very speculative example, but illustrative.

Moreover, banks are notified where the client sends his money when paying for something with a card. The conventional bar of the “Board” and transactions for 100 thousand, taking into account the salary of 140 thousand, can lead to interesting thoughts.

The current account transaction model analyzes the movement of funds in customers' bank accounts, including deposits, withdrawals, transfers and other transactions.

It is important to consider both regular patterns (such as monthly salary receipts) and anomalies (such as large withdrawals or frequent transfers to third-party accounts) that indicate changes in the client's financial situation.

The model, based on credit history data, analyzes sequences of events related to clients’ credit activities, such as loan issuance, repayment schedule, cases of delinquency and debt restructuring.

The data is received from OKB, NBKI and Equifax, i.e. from credit history bureaus. All in the format of multidimensional arrays, which are interpreted into embendings, which can then be worked with.

The interpretation process uses complex nonlinear activation functions, such as ReLU or ELU, which allows the model to identify and classify hidden patterns and anomalies that indicate potential credit risks.

To begin, let's assume we have customer transaction data in the form of a time series. For example, the data includes information about transaction amounts, transaction types, and timestamps.

It is necessary to transform the data into a format convenient for feeding into RNN. For example, we can use a sequence of transactions for a week, where each transaction has several features (e.g. transaction amount, type, etc.).

import numpy as np
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
from tensorflow.keras.preprocessing.sequence import TimeseriesGenerator

# Загрузка и подготовка данных
def load_and_preprocess_data(file_path):
    data = pd.read_csv(file_path)
    # Преобразование временных меток
    data['timestamp'] = pd.to_datetime(data['timestamp'])
    data.set_index('timestamp', inplace=True)
    # Нормализация данных
    scaler = MinMaxScaler()
    data_scaled = scaler.fit_transform(data)
    return data_scaled

data_scaled = load_and_preprocess_data('transactions.csv')

# Создание последовательностей для RNN
sequence_length = 7  # Неделя
generator = TimeseriesGenerator(data_scaled, data_scaled, length=sequence_length, batch_size=32)

Now let's create the RNN model. We will use LSTM (Long Short-Term Memory) cells, as they are good at handling tasks involving long-term dependency.

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense

# Определение модели
model = Sequential()
model.add(LSTM(50, activation='relu', input_shape=(sequence_length, data_scaled.shape[1])))
model.add(Dense(1))  # Выходной слой для предсказания кредитного рейтинга
model.compile(optimizer="adam", loss="mean_squared_error")

# Обучение модели
model.fit(generator, epochs=20)

This is also an extremely simplified example, but it is still possible to roughly show the scoring.

Although neural networks are the top for current bank scoring, classical systems also have an advantage. Traditional models are relatively transparent, like the same decision trees.

“The past year has been marked by the GDPR. The EU General Data Protection Regulation states that you must explain how you process data in a ‘concise, transparent, understandable and easily accessible form, using clear and plain language’.

Therefore, banks can explain how the data was used or processed. But using neural networks, it is difficult for an American company to understand why 50 black families did not receive a mortgage and were left without their coveted home.

Neural networks make mistakes and can also be biased – young adults are left without a mortgage and are struggling to rent a home. If in our country, issuing a mortgage is accompanied by a “sweaty” calculation of the overpayment for an apartment by five times, then for some countries, obtaining loans is a serious matter.

The black box problem is a real problem, because it is difficult to accuse a recommendation system of political discrimination, but a bank with its CNN ensemble can.

End-to-end models or total control

It is obvious that the ensemble of neurons (in fact, the stupid logical summation of predictions) is a small crutch.

The world is complex and diverse, so logistic regressions can receive completely contradictory data as input. Here, according to the credit history, it is obvious that our client will fail the loan and will not answer the calls of collectors. And according to the transactions, our friend has finally gotten back on his feet…

Although transaction-based probabilities are usually given lower weightings, they are sometimes more important than other parameters, especially tabular ones. Therefore, banks are interested in creating a continuous network, an end-to-end model without simple logical summation.

Although Alfa-Bank solved this problem in an interesting way: they did not create an end-to-end model, but they also did not sum up the scores of individual neural networks/classical models – they translated the results of the neural networks’ work into embeddings, where additional information was embedded.

Only after receiving the number of embeddings of the work of different models – from them a regression was compiled, which formed a set of coefficients and derived a specific score.

But banks are trying to universalize their models, to look for more patterns in order to give as many loans as possible and get as few defaults as possible.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *