How we taught the neural network to parse the names of goods in receipts
A year ago we launched ProChek is an application for storing and analyzing receipts. It receives the user’s checks from the Federal Tax Service, sorts them and builds expense schedules. ProCheck has a killer feature, it automatically determines the category of each product from the check. That is, it does not simply classify the entire check into one category, as banking applications do in their reports. ProChek really sorts out the check from Auchan on the shelves: apples for groceries, cat food for pets, and washing powder for household chemicals. A hybrid solution using artificial intelligence is responsible for categorization. The estimated accuracy of the model is 95%. My name is Alexander Bondin, I am the head of Development at STM Labs, and in this article I will tell you how the categorizer works, what AI is responsible for, and what real people are responsible for. I will also share a life hack on how to mark up a couple of million objects for training a neural network almost for free, without registration and SMS.
How We Collect Data
By law, all purchase receipts are stored in the tax office in electronic form. With the permission of the user, ProChek receives all his checks from the FTS database automatically – the main thing is that a phone number or email is attached to the check. To do this, the contact must be called to the cashier when buying. Or you can add purchases manually by scanning a QR code on a paper receipt.
All user receipts in ProCheck are stored in a convenient form, they can be sorted and analyzed depending on the tasks. For example, select categories manually or automatically, sort them into thematic folders, upload reports in the required format, and build expense schedules.
The categorizer in the application automatically determines the category for each product, the user does not need to spend time sorting purchases, the application does it itself. And then builds charts for spending in different sections.
To train a categorization neural network, you need a huge array of data – checks, where each position has already been assigned a category. We already had a large array of checks left over from a previous project. These are the ones we wanted to use for ML. But it turned out to be impossible to work with him – the sample had already been processed, the abbreviations were deciphered, and the errors were corrected. It didn’t look like real input.
Therefore, we began to collect a dataset with organic data coming to ProChek. We started with just 8,000 positions. At the time of the publication of the article, there are already more than 2 million positions in the dataset. And it continues to grow — every week Prochek receives 25,000 checks, which is almost 300,000 commodity items per month. This means that the accuracy is constantly increasing.
Disclaimer about PDN security
I note that in all manipulations we use data in an anonymized form. For processing, we take only information about the names and cost of goods. From the dataset, it is impossible to determine who made the purchase and with which card.
First attempts at NLP
So we get checks. Each position in the receipt has a name and price, and some other metadata, in this case they are not important. Remember what the names look like on receipts. Maximum abbreviations, unknown codes from numbers and letters.
Therefore, a massive part of the task is to process the name and make out what the user bought.
At first glance, Natural Language Processing should help here. But with the help of NLP, longer texts are usually processed than the names of the goods on the check. They are usually no longer than 100 characters, and there are also names of just a few letters – ultrasound or service station, for example. Classical NLP approaches are good for analyzing comments on Kinopoisk or Twitter posts. They will help to understand what L.N. Tolstoy, describing the oak on several pages. But to find the difference between “Sour Cream Cheese” and “Glazed Cheese” is another task.
Usually, NLP takes a large amount of textual information and compiles a vector space of words, depending on the frequency of their mention and relative position. In the names of checks, the vector space simply does not give a semantic load – the names are too short, errors and abbreviations, codes and articles are often found.
We tried different methods of data preprocessing and transformation, different methods of cluster analysis for markup, used pre-trained vectors and multilingual models. For example, we worked with bag-of-words and word2vec. The result was so-so.
As a result, we decided to take on a large and complex solution – the DeepPavlov framework, which is based on the BERT transformer. DeepPavlov is an open source software library for natural language processing, pre-trained on a huge array of Russian-language information. The developers have processed all the classical literature, part of the modern, posts in social networks and forums, Wikipedia articles.
In general, DeepPavlov is more often used for chatbots and other conversational tasks. But what difference does it make if our new solution could eventually distinguish even milk from body milk, not to mention more obvious things.
Data preprocessing normalization
As I already said, the data in the check is very different from the texts that are worked with by NLP methods. And, since we have “very short text” at the input, then the maximum semantic part of it should be fed to the model.
Therefore, we introduced data preprocessing – we bring each name to a normalized form
convert all characters to lower case
glue extra spaces
remove special characters – they minimally affect the definition of the product
we delete numbers and alphanumeric codes – different stores have them of different lengths and different formats.
in certain positions, we change the numbers to numerals – for example, “AI-95” turns into “gasoline ninety-fifth”
Thus, we turn many short texts into one index. This simplifies the marking of data for training, instead of, conditionally, writing a class a hundred thousand times, we mark up a normalized position once. And then, by mapping the normalized name with non-normalized names, we get, conditionally, one hundred thousand marked positions. So, at the moment we store 110+ thousand rows of normalized positions, which correspond to 2.3 million products.
How did you come up with these normalization rules?
At first we tried
handle abbreviations with pymorphy2. But mistakes often happened – for example, “smrz” unfolded as self-development, and not a self-tapping screw.
introduce lemmatization – bringing all words to the initial form, i.e. nominative, singular for nouns and adjectives, or infinitive for verbs. As a result, “socks” turned into the word “sock”, and “tea” into the verb “tea”
use stemming – finding the basis of a word, without taking into account its ending.
Finding all such exceptions and creating rules for each seemed like a heavy job. Therefore, we abandoned these methods.
We tried to replace the numbers with words everywhere, but the numerals are quite long words, so they began to have even more weight than the names, they introduced a lot of noise. As a result, we simply abandoned almost all numbers and alphanumeric codes – except, for example, the same gasoline.
Yes, this can cause errors – for example, a Mercedes 125 model is recognized as a car, and not as a toy. But, firstly, there have not been such cases yet, and secondly, we plan to prevent such discrepancies in the future, focusing on the cost of goods.
On about a million data in the dataset, we noticed that the model correctly categorizes even a strange bunch of abbreviations in one name, which a person without Google cannot understand. So normalization works as it should.
ML expectation and reality
In the spring, seasonal items appeared, like garden tools and watermelons. The model began to make mistakes more often, because nothing like this had ever been seen in organic data. Then we gave the model a vote of no confidence and introduced a dictionary.
The task of the dictionary is to store all the normalized and labeled position data. Now each incoming normalized position is first mapped to the dictionary. If a match is found, then the position is immediately assigned a category, without the use of AI. If there is no match, the model gets a chance to determine the category itself.
In addition to cheating in the category definition, with a dictionary it is more convenient to prepare a training sample for the model; the dictionary is mapped to the entire volume of incoming positions, so a picture is obtained that often coincides with the input stream.
In fact, we found a way to improve accuracy even more – we began to use some auxiliary receipt attributes, not just names. For example, subscriptions like ivi or okko are determined by the TIN, and medicines by the marking code. In these cases, we assign a category to the purchase at the preprocessor level, even before parsing the name, but then we still add the position to the dictionary and training dataset.
How to mark up data quickly, without registration and SMS
The output of the model consists of two parts, the estimated category for the position and the % probability of this category – that is, how confident the model is in its decision. A model is allowed to be categorized only if the confidence is above 95%.
If the percentage is below the threshold, then in production the position remains without a category – there are about 17% of them. These positions, along with the proposed category, are sent for validation to a live data engineer. It marks the position, adds it to the dictionary, and trains the model on new data.
There was more and more incoming data and data engineers stopped having time to mark all the positions that the model could not cope with. We didn’t want to outsource the markup, so we brought in our own colleagues from STM Labs.
We launched a telegram bot where any employee could become an assessor, for example, during a tea break.
How it works: The bot sends the user a normalized position name with the category that the model assumed. The task of the user is to evaluate whether the category is correctly assigned.
The entry threshold for such markup is low – the average person can almost always recognize what kind of product is indicated on the check. Although we still introduced double validation, all the data was rechecked by someone from the data engineering team. But even with this re-check, the bot saved us a bunch of man-hours.
To speed up the process, we added an element of gamification for each answer, we gave out a point, and the record holders were given shoppers, thermal mugs and other merchandise. Thus, in the first week alone, voluntary assessors marked out about 30,000 positions.
But then we ran into the problem of gamification, some people marked any answer, just to score points. In general, the markup speed has grown a lot, but the quality has suffered at some point. After explanations and educational conversations, everything returned to normal.
How it works now
At the moment, the categorizer works according to the following scheme
The category is assigned to positions instantly – the receipt has just been loaded, and the user can already see which categories the purchases belong to. There are no delays thanks to the microservice architecture on flexiflow platform.
flexiflow is a high-load application platform developed by STM Labs. It allows you to quickly launch and easily scale services to work with BigData.
The categorizer is one of ProChek’s microservices. It was developed as part of the ProChek application, but can be used independently to solve individual problems. With the help of the API, the categorizer can be connected, for example, to the information system of the store and check the categories for all knocked out checks or divide the product catalog into categories.
Conclusions and plans
It would seem that the problem of classification is very primitive in its essence. And using BERT to decrypt checks is like firing a cannon at sparrows. But other methods did not work, so we did not hesitate to take such a powerful technology.
Thanks to high-quality natural language processing, further machine learning in this task is quite simple. After vectorizing the data, we process it with the usual logistic regression.
We continue to work on the accuracy of our hybrid solution with a dictionary and a neural network. We have several areas for further development
expand the dataset and retrain the model on new data
learn to recognize each individual product, not just its category
use all data from receipts, including marking tags and GTIN barcodes
add parameters for training – for example, use the names of stores and the time of purchase.
If you have ideas for the development of the project – share in the comments. I will also be happy to answer questions in the comments to the article.