Our experience of using AI technologies to classify documents for filing in court

How did we create, train and release a service that uses machine learning technology to recognize and classify legal documents? In this article, we will tell you about the experience of developing this solution for automating the work of lawyers and collectors and the difficulties we encountered.

AI technologies for credit file processing

Our service is implemented using OCR (machine text recognition) technology. And in addition to OCR, we used machine learning technologies.

The principle of recognition and classification is as follows:

  1. OCR is the first link in the cycle, machine text recognition. Extracts text from a document image. OCR was trained on different examples of documents compiled according to different templates. During training, both fonts and their sizes were taken into account. And also the rotation of the document.

  2. NLP (Natural Language Processing) is the second. It analyzes the text and says what type of document the image belongs to. Natural language processing is needed to extract meaning from the words and letters that OCR algorithms have recognized.

  3. CNN (Convolutional Neural Network) — third and last. Checks if the image is a passport in cases where NLP does not know what kind of document it is. The convolutional neural network comes into play when the two previous text recognition methods are not effective. For example, if the text is poorly printed. This often happens with passport scans.

How OCR works

How OCR works

NLP did not appear in the work of services immediately, before that a different algorithm was used. Why did we change it and make such a data recognition cycle, which in the end turned out?

The Path to NLP and What Came Before

How did we initially determine the document type during classification? The initial algorithm worked on a keyword. It implied finding words in the document that help identify the document type. For example, we need to understand whether the document is a Consent to the processing of personal data. If the program finds the words “Consent to the processing of personal data” in the text, then yes.

It worked like this:

  • The file was converted to an image.

  • Text was extracted from the image

  • They searched for one “keyword” in the text

But this approach can cause problems. We have encountered the following difficulties:

  1. The problem will arise if the document consists of more than one page, and the keyword is only on the first. PDF files with credit dossiers are long and consist of all documents at once, and then the program will determine the type incorrectly.

  2. If a document contains many identical keywords on different pages, and these are different documents, then the program also stumbles.

  3. Inaccurate OCR extraction is also a big problem.

To avoid them, you need to allocate “keywords” for each page of the document and make it so that python can distinguish one class from another, even if they have overlapping keywords.

Add to this the inaccuracy of OCR, which can extract incorrect text: miss a letter, change “O” to “0”, or not see something… And “keywords” will no longer work.

Yes, these problems can be solved by extracting text from the PDF file itself and creating a regular expression for each “keyword”, but this is very time-consuming and does not protect against OCR inaccuracy. And it is not always possible to extract text from the entire PDF file, since it may contain photographs and data in other formats.

And as a result, we implemented NLP into the product. What changed?

NLP (Natural Language Processing) is a more advanced version of “keyword” search. The model converts the text of a document into a numeric format, where each number indicates the importance of a certain word or phrase. How does it work? Something like the picture.

The model is trained on these numerical values, learning which words and phrases are most common in each document type, and is thus trained to identify specific document types.

When the model receives a document for classification, it converts its text into numbers and predicts the probability of what type of document it belongs to, based on the trained weights.

That is, with the help of NLP we evaluate all the words in the document at once, and do not look for specific ones as in the search by “keywords”.

Other challenges that had to be faced were: how to connect the technologies together and extract data from passports?

In addition, there were two main difficulties in the process of developing the product: choosing the right technologies and linking them together to create synergy, as well as recognizing text in passports.

We solved the first problem through various tests, and as a result we got a fairly complex product with multiple cascade architecture. Some blocks were monolithic, and microservices were used somewhere.

Why did this happen? There are two services inside our product – one for document classification, and the other for extracting attributes from them. Although we are only talking about the first one now, the second one can work together with it if it is needed to perform specific tasks.

At the stage of recognizing texts from passports, problems sometimes arose, since the printing there is not as even as in other documents. To recognize poorly printed text, we began to use CNN (convolutional neural network) – it recognizes specific pixels and the absence or presence of a certain color on them. And if one neural network fails to recognize the text, then we try to use another.

CNN has the following operating algorithm:

Convolution: The network applies filters to the image, highlighting key features such as text areas, logos, and seals.

Subsampling: The network reduces the size of the data by keeping only the most important of the extracted features.

Classification: Based on these features, the network determines what type the document image belongs to.

The working principle of a convolutional neural network

The working principle of a convolutional neural network

But for some time we couldn't find the parameters for CNN. We tried parameters like convolution, kernel, image size and some others. As a result, we managed to achieve 96% image accuracy – before that we showed a result of 89%.

Development and initial testing

During the development process, we tried 5 different machine learning algorithms before we found the right one. We also constantly changed and adjusted its hyperparameters to find the most universal settings for different types of incoming dossier documents of varying quality.

A similar process was used when selecting, retraining and using algorithms for recognizing photos and images using neural network algorithms. At the end of the tests, we settled on using a unique algorithm that, on the one hand, provided a quality of 95-99% accurate recognition of dossier documents, and on the other hand, could be trained on a new type of document within 1-2 days.

What we managed to create

Now we continue to actively develop our product and add new functions to it. At the end, after all the improvements and testing, we managed to achieve the following result compared to manual document processing: the time for processing credit files decreased from 10-20 minutes to 1-2 minutes, and the percentage of errors in determining document types decreased from 10-20% to 1-2%.

We did not have any particular difficulties with the implementation of OCR and recognition of legal documents, since they are the same text, only formatted according to certain rules. The program can recognize them with almost complete accuracy, because we trained it for a long time on exactly those documents that lawyers and collectors need in their work.

And recently we started working on making our service able to perform tasks on classification and preparation of documents for work with the electronic executive inscription of a notary. The basic work process is the same as when classifying files for submission to court, but the types of documents are slightly different.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *