A neural network approach to modeling card transactions

A bank client can make up to several thousand debit and credit card transactions per year. The bank, in turn, keeps the entire history of user operations on its side. As a result, a huge amount of data is formed, sufficient for the current moment to be called the buzzword BigData. Data Scientists love when a large amount of information is available to solve a problem, since all machine learning methods boil down to identifying dependencies between the data and the target variable – the larger the amount of data and the richer the attribute description, the more complex dependencies can be detected by increasing the complexity of the models …

Due to the high density of the transaction history, it becomes possible to model a variety of target variables, including the most valuable for the bank: the client’s default, interest in credit products, and the client’s income.

Within the framework of the competition Alpha Battle 2.0 on boosters.pro the participants were asked to solve the credit scoring problem using only the customer’s transactional data for the previous year. After the competition was organized sandbox, which is a copy of the competition, but without time limit and without cash prizes. The dataset of the competition can be used in scientific publications, diploma and term papers.

Description of data

Competitors were provided with a dataset of 1.5 million loan product issues. For each object of the sample, an attribute description was tightened in the form of a history of client transactions with a depth of one year. Additionally, the type of issued product was presented. The training sample consists of issues for a period of N days, the test sample contains issues for a subsequent period of K days. In total, the dataset contained 450 million transactions, with a volume of about 6 gigabytes in the format parquet… Realizing that such a volume of data can become a serious threshold for entry, we split the dataset into 120 files and implemented batch data preprocessing methods, which made it possible to solve the competition problem from a personal laptop.

Each card transaction was presented as a set of 19 features. Placing a large amount of data in the public domain, we were forced to hide the values ​​of the features by assigning them to category numbers.

Signs of card transactions

feature name


Number of unique values


Transaction currency identifier



Transaction type identifier



Unique identifier of the card type



Plastic card transaction type identifier



Card transaction group identifier, such as debit card or credit card



E-commerce sign



Payment system type identifier



Sign of debiting / depositing funds to the card



Unique identifier for the type of outlet



Country ID of the transaction



Transaction city ID



Transaction store category ID



Day of the week on which the transaction was committed



The hour the transaction was committed



Number of days until the date of the loan



The number of the week of the year when the transaction was committed



The number of hours since the last transaction for this client



The normalized amount of the transaction. 0.0 – matches gaps


The target variable in the competition was a binary value corresponding to the default flag for the loan product. The metric for assessing the quality of solutions was chosen AUC ROC

Basic approach to solving the problem

Each sample object is presented in the form of a multidimensional time series, consisting of real and categorical features. The problem of classifying time series can be solved by the classical approach, which consists of generating a huge number of features with the subsequent selection of the most significant and stable ones. Obvious aggregations of real features are: mean, median, sum, minimum, maximum, variance over fixed time intervals.

In the case of categorical features, you can use the counts of occurrences of each value of each categorical variable, or go further and use vectors from matrix expansions or methods based on them: LDA, BigARTM… The latter of which allows one to obtain a vector representation at once for all categorical features due to the support multimodalities… Features can be selected based on the importance obtained by the popular method permutaion importance or less popular target permutation… A basic approach bringing 0.752 AUC ROC to public LB can be found at git

Neural network architecture

It is possible to solve the problem of classifying multidimensional time series using the methods used in the classification of texts, if you mentally replace each word of the text with a set of categorical features. In the field of natural language processing, it is customary to assign a numerical vector to each word, the dimensions are much less than the size of the dictionary. Usually word vectors are pre-trained on a huge corpus of text documents using teaching methods without a teacher: word2vec, FastText, BERT, GPT-3… The main motivation for pre-training is the huge number of parameters that need to be learned in view of the large size of the vocabulary and usually a small marked-up dataset for solving an applied problem. In this problem, the situation is the opposite: less than 200 unique values ​​for each categorical variable and a large tagged dataset.

Summarizing the above, a vector representation for each categorical feature can be obtained using the standard Embedding layer, working according to the following algorithm: at the beginning, the desired size of the vector representation is set, then the LookUp matrix is ​​initialized from some distribution, then its weights are learned along with the rest of the network by the method backpropagation… The size of the embedding is usually chosen equal to or less than half the number of unique values ​​of the categorical attribute. Any real sign can be converted into a categorical one by the operation binarization… The vector for each transaction is obtained from the concatenation of embeddings corresponding to the values ​​of its categorical features.

Next, we send a sequence of transaction vectors to BiLSTM to simulate time dependencies. Then we get rid of the spatial dimension using concatenation GlobalMaxPool1D and GlobalAveragePool1D for sequences of vectors from recurrent and embedding layers. As a result, having previously passed the resulting vector through an intermediate fully connected layer, it remains to add a fully connected layer with sigmoid to solve the problem of binary classification.

Neural network training

Consider the architecture of the neural network proposed to the participants in the competition as advanced baseline… It differs insignificantly from the one described above and contains hundreds of thousands of trained parameters. Heavy models tend to overfit, remembering the training sample and showing poor quality on new objects. Machine learning courses teach how to use L1 and L2 regularization in the context of logistic regression to combat overfitting, we will use this knowledge and set regularization coefficients for all parameters of the neural network. Don’t forget to use Dropout in front of the fully connected layer and SpatialDropout1D after the embedding layer, it is also worth remembering that you can always lighten the network by reducing the size of the hidden layers.

We have a limited amount of computing resources, so there is a native desire to use them in the most efficient way, speeding up each model training cycle by testing more hypotheses. An indirect sign that neural network training can be accelerated is low recycling video cards. Usually the problem lies in the sub-optimal batch generation. The first step is to make sure that your generator has an asynchronous queuing implementation. In the case of working with sequences, you should look at the distribution of their lengths.

It can be seen from the graph that the distribution is in the family of exponential, this leads to an increase in the volume of input data to the neural network by three orders of magnitude with the padding strategy to the maximum length. A simple solution to this problem, which speeds up training by 3 times, is the approach Sequence Bucketing, offering to cut the dataset into bucket groups depending on their length and sample batches from one bucket. When training recurrent layers, it is recommended to set such a set of network parameters that supports cudnn-implementation that accelerates learning in this task five times.

Let’s say your network is both learning fast and not overfitting. We suggest that you check that you are using all the ideas from the following list to get the best quality:

  1. Cyclical Learning Rate thanks to the LR change strategy, it allows not to overfit at the first epoch due to the low base LR and not to get stuck in local minima due to the sawtooth-damping strategy. The base_lr and max_lr hyperparameters can be set using the algorithm LRFinder… Additionally, you should pay attention to the One Cycle Policy and Super-Convergence

  2. Macroeconomic indicators and payment behavior of clients can change in N days during which the training sample was collected, relative to the next K days allotted for the test. As a simple strategy for modeling time displacement, you can use exponential decay for object weights over time.

  3. Having trained one tree, the authors of the random forest and gradient boosting did not limit themselves; in the case of training neural networks, it is better to follow a similar scenario and train an ensemble of neural networks using various subsets of objects for training and changing the hyperparameters of the architecture.

  4. Each transaction is described by a large number of features, feature selection can improve the quality of the model.

You can consolidate all the theoretical knowledge gained in practice, by submitting in the competition sandbox


Transactions in hadoop are updated once a day, so online processing is not required, in the worst case, you need to have time to run the pipeline in 24 hours. Pipeline is implemented as DAG on the airflowconsisting of the following steps:

  1. Reloading and aggregating new data on pyspasrk… The model uses the transaction history for each user for the previous year. The best scenario is to maintain historical preprocessed data at the time of the previous run of the model with the addition of new data. Output: a relatively compact parquet dataframe.

  2. Prediction in python using one of the frameworks: tensorfow, pytorch. Output: tsv-table containing id and many fields with model predictions.

  3. Uploading predictions to hadoop and to the input of the final model.

Neural networks trained only on transactional data are slightly inferior to classical algorithms that use explicit signals from all data available to the bank, including transactions and data from credit bureaus. Fortunately, the model has an orthogonal component and is built into the business process as an additive to the existing models. A simple approach to mixing classical modeling methods with neural networks allows independent development of models and significantly reduces implementation time.

In conclusion, I emphasize that the described approach in the article can be used to classify any time series of a similar structure.

Useful links and materials

Similar Posts

Leave a Reply Cancel reply