t2 Fintech experience

Introduction

Hello, I'm Bulat Yusupov, business analyst at t2 Fintech team. I will talk about how our team researched signs and indicators that can be used to predict a user’s financial condition. And today we’ll talk about bankruptcy.

The issue of bankruptcy of individuals is a hot topic. The number of bankrupts in Russia in 2023 amounted to 350 thousand people, against 278 thousand in 2022, and at the same time the amount of total debt changed from 384 to 437 billion rubles (https://fedresurs.ru/news/16e6ef70-49cd-4e6b-8f84-691963b99a9e). The commencement of bankruptcy proceedings for individuals opens the door to debt restructuring and relief from financial burdens, but also requires a more detailed analysis of this phenomenon. For financial institutions and companies, the ability to predict bankruptcies is of fundamental value because it helps them avoid risks and make informed decisions.

Target Variable Source

The main source of the target variable was the EFRSB, a federal resource containing reports on court cases and bankruptcies. The presence or absence of a bankruptcy case in the future became the binary attribute used for training. It should be noted that the presence of affiliated bankruptcy cases does not always mean that bankruptcy has actually occurred. To more accurately determine bankruptcy, it is necessary to delve into the details of the documents and analyze the court decision. This resource also provides detailed information on both individuals and legal entities.

Modeling

So we decided that it was time for us to take matters into our own hands and start developing a model for predicting borrower bankruptcy. This model predicts the risk of bankruptcy of the borrower within three years. To be more specific, we have achieved a prediction of the probability that the borrower will decide to try to go bankrupt in the next three years, from the moment of scoring.
The model training process consisted of the stages of searching and collecting relevant features, data preprocessing, feature filtering, hyperparameter tuning, and model validation. We chose CatBoost as the base algorithm because of its ability to effectively work with categorical variables, of which there are a sufficient number in raw sources.
With the help of the development standards we have developed and the automated procedures we have created, the time spent on development is significantly reduced, which allows the team to solve such problems quickly and efficiently.

Qualitative analysis

WOE analysis and calculation of the information index (Information Value) were used as one of the tools for selecting features. Among the features that are exclusive knowledge of telecom operators, and also have a fairly high Information Value and significance for models, there were, for example, such as the presence of an invalid document (doc_invalid), subscriber lifetime (bc_lifetime), minimum balance amount (bal_min_amt ), and the total number of SMS sent (sms_tot_cnt).
When generating hypotheses, we always assume the empirical evidence on which the model will be based:

  • Possession of an invalid document (doc_invalid): The subscriber's possession of invalid or outdated documents may indicate a risk of fraud or irregular activity. In telecom, the correctness of subscriber data directly affects his ability to fully use the services.

  • Subscriber lifetime (bc_lifetime): Regarding subscriber lifetime, it is assumed that the longer a person is an active subscriber, the more loyal and reliable he is. The lifetime of a subscriber may correlate with the age and stability of his behavior, as well as with his general activity. Subscribers with a long lifetime tend to show more stable and predictable patterns of behavior, which makes it easier to analyze their activity and formulate forecasts.

  • Minimum amount on balance (bal_min_amt): balance plays an important role in determining the financial condition of the subscriber. A regularly low balance indicates infrequent use of services and possible financial difficulties. Conversely, subscribers with a high balance often demonstrate greater financial stability and predictability.

  • Total number of SMS sent (sms_tot_cnt): The number of messages sent is an indicator of subscriber activity. A large number of SMS can indicate both a high level of engagement and possible suspicious activity, such as spam, as well as SMS notifications about debts and from collection agencies.

    doc_invalid

    doc_invalid

bc_lifetime

bc_lifetime

bal_min_amt

bal_min_amt

sms_tot_cnt

sms_tot_cnt

Metrics

Defining Metrics
Metrics are quantitative indicators used to evaluate the performance of a machine learning model. The main metrics used to evaluate binary classification models include F1-measure, recall, precision, and ROC AUC.

  • F1-measure is the harmonic mean between recall and accuracy, providing a balance between them.

  • Recall is the proportion of correctly predicted positive examples among all real positive examples.

  • Precision is the proportion of true positive examples among all positive examples predicted by the model.

  • ROC AUC (Area Under the ROC Curve) is a generalized metric that evaluates a model's ability to distinguish between positive and negative classes at different classification thresholds.

The performance of the model was assessed in accordance with these metrics. The feature space was optimized based on the ROC AUC metric. The model evaluation results showed excellent metrics: TRAIN AUC = 0.86 and TEST AUC = 0.82, which indicates the high accuracy and reliability of the model for this segment.

Application

The model for predicting bankruptcy of individuals can be used for a fairly wide range of business cases, including as an additional criterion. Special features available only to telecom operators, such as balances, lifetime, and features based on subscribers’ contacts between themselves and subscribers of other operators, give an advantage when used not only by banks, but also by marketplaces and insurance companies. For banks, for example, the model can help reduce credit risks when making lending decisions and improve customer service through the provision of personalized offers and services. Marketplaces can evaluate their clients when providing BNPL products, as well as check their sellers, self-employed people and individual entrepreneurs. Insurance companies have the opportunity to exclude cases of fraud by problem clients.

Development

We plan to improve accuracy metrics and add new functionality, in particular, scoring legal entities for bankruptcy. The basis for improvements to such a model is, undoubtedly, the expansion of the set of data sources and in-depth feature engineering. It is the unique data of subscriber activity and t2 experience that ensures further growth in quality, not only of its own models, but of the models of the company’s partners. Continuous improvement and integration of the latest advances in data analytics and machine learning will ultimately lead us to a tool that will be an integral part of risk management.

Conclusion

The desire to create a new product led our team to introduce a model for predicting bankruptcy of individuals. Thanks to the telecom operator's data and our research, we were able to create a product that provides significant benefits to companies from different sectors of the business environment. The simple solution architecture based on a single machine learning algorithm and t2 data has already shown good accuracy results and can be successfully applied in various business cases. Future improvements aimed at increasing accuracy and expanding functionality will improve reliability and efficiency, and the introduction of ever new technologies and methodologies will ensure the creation of an indispensable tool for risk management.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *