from markup to implementation

Automating business processes often requires processing paper documents with a complex structure – for example, invoices, invoices, and so on. A common scenario: there is a mailbox that receives scans of paid invoices. This mailbox is parsed and the invoice and due date information is entered into the ERP. However, parsing such documents manually is a long and labor-intensive process. A solution in this case can be offered by artificial intelligence.

In this article, we will take a closer look at our approach to developing a system that recognizes information from paper invoices using computer vision and machine learning technologies.

What you will learn:

  • What approach do we use to design complex computer vision pipelines?

  • How to organize continuous additional training of a model in production.

  • How to monitor model quality metrics.

Model training

Plays a key role in the project Datapipe is an open source tool that we use to build a data processing pipeline. When input data changes, whether adding new data or changing existing ones, Datapipe automatically tracks these changes and recalculates only those steps for which the data has changed and requires updating. Each step of the pipeline represents a function, the result of which is passed on to the next step. Thanks to this, we save computing time and organize continuous additional training of the model in production, recalculating only those functions where it is really necessary.

The model training pipeline is structured like this:

  1. Data marked up by the moderator in Label Studio is automatically loaded into the pipeline.

  2. The labeled data is added to the so-called “frozen dataset” – it no longer changes and is used to train ML models.

  3. Then we split the data into two parts:

    • Training set (train) – for training the model.

    • Validation set (val) – for checking and assessing the quality of the model.

  4. Images undergo transformation: each picture is divided into smaller segments (crops) to facilitate model training. Within each crop, even smaller regions are highlighted to improve detection and classification accuracy.

  5. Let's start training the models sequentially. First, models are trained for object detection (YOLOv5), and then OCR models (Google Cloud Vision OCR) are applied for text recognition.

  6. Sequential testing of models. After each model training, the process of predicting results on training and test data is carried out. We calculate quality metrics, and implement the model with the best indicators into the production environment.

Now that the model has been trained and tested, it is ready for use in production. The process is designed to automatically process every new invoice that comes into the system and use a trained model to extract the necessary data.

It happens like this:

  1. New images arrive in the RabbitMQ queue. This is a kind of mailbox with data about new images.

  2. The daemon process monitors this queue. The program runs in the background, checking for new images. When a new message is detected, the daemon removes it from the queue and submits the image for processing.

  3. Using the best model, the system makes a prediction.

  4. The prediction result is sent to the API. Other programs will be able to use this information to perform further actions.

  5. The image is removed from the queue.

Models in Action: Invoice Processing

1. Data collection

An email will be sent to you with a scan of your paid invoice. This snapshot becomes the first step towards automatic information recognition.

Once the image is in the system, it is transferred to Label Studio, a moderator interface we chose for its ease of use. Here, the original image of the invoice is displayed on the screen, which is then transmitted for processing by the models of our system.

2. Count detection in the image

The first stage of processing is to detect the count itself in the image. Because photographs may contain additional objects such as hands, backgrounds, or other foreign objects, it is important to highlight only the count. The model's task is to determine the exact boundaries of the account in order to eliminate unnecessary details and concentrate on the desired area. Using the marking data, we trained the YOLOv5 model to accurately determine the position of the account in the image, taking into account the rotation angle at which the photo was taken (0°, 90°, 180°, 270°).

When fed a new image, the YOLOv5 model analyzes it and makes a prediction for it:

  • Bbox – boundaries of an object in the image.

  • Class – what class each object belongs to. At this stage we are interested in the “Account” class.

  • Score – the probability that the object belongs to the specified class.

That is, the model highlights the estimated score and indicates its level of confidence in the accuracy of the detection. After this, the system transfers its coordinates and score to the next processing stage.

3. Detection of key positions on the account

After determining the position of the invoice in the photo, the system proceeds to detect key fields on it.

Since the bills contain a lot of different information, the image is first divided into small areas (crops). This pre-separation improves the accuracy of key field detection. Each crop is then processed by the YOLOv5 model, which analyzes its contents and identifies areas of interest: dates, amounts, item names and other important elements.

After processing all the crops, the system combines the results to form a complete picture of the recognized fields on the entire account.

4. Recognition of numeric and text data

So, the system has identified and classified the key fields on the account, it’s time to move on to the stage of text and numeric data recognition. Here we use Google Cloud Vision OCR, which is responsible for extracting all the necessary information.

Each highlighted area found in the previous steps is transferred to Google Cloud Vision OCR. This service analyzes the image and extracts text data: amounts, product names, invoice numbers and other important details.

Google Cloud Vision OCR does a good job of recognizing text even on complex backgrounds and in a variety of fonts, ensuring high accuracy and completeness of information extraction. The system then associates the resulting text and numeric values ​​with the corresponding fields on the invoice.

5. Additional training of the model

Changes are constantly taking place in the account information recognition system: new invoice formats appear, fonts change, new fields are added, and the design of the invoices themselves is regularly updated. In order for models to effectively adapt to these changes and continue to recognize data with high accuracy, they need to be regularly retrained.

We transfer to the customer not just a trained model, but a full-fledged pipeline with an automated additional training process based on new labeled data. This ensures continuous improvement of the model's performance, adapting to constantly changing conditions in the working environment.

During the work, the moderator manually marks new data, which gradually accumulates in the dataset. When the volume of this data reaches the required level, the dataset is captured and transferred to the pipeline.

The Datapipe platform efficiently integrates new data by automatically processing only the parts that have changed. Thanks to this flexibility, new processing steps can be easily added to the pipeline and adapted to changing customer requirements. After each additional training cycle, updated metrics necessary to control the quality of the model’s performance are automatically calculated. This cycle continues, keeping the system up to date and accurate in all conditions.

6. Monitoring and evaluation of quality metrics

To assess the effectiveness of a model, it is necessary to regularly analyze its key metrics.

Analysis of key quality metrics includes calculation of the following indicators:

  • Precision and Recall to evaluate the accuracy and completeness of object detection.

  • F1-score as the harmonic mean of the Precision and Recall values ​​shows how accurately and completely the model recognizes the data we are interested in. The closer it is to 1, the more accurate the prediction.

  • Weighted and Macro F1-score for an overall assessment of the model’s performance, taking into account various weight categories of objects and their distribution among classes. The weighted F1-score shows how well the model makes predictions, taking into account the frequency of each class (type of object being recognized). The F1-score macro evaluates the accuracy of model predictions for each class, regardless of their frequency.

For monitoring and visualization of metrics, we prefer the Metabase platform. It is ideal for tracking the dynamics of changes in model quality indicators in real time. In the diagrams you can observe the accuracy of the model’s predictions:

Conclusion

Thus, our team has successfully developed an AI system for automatically recognizing information from paper invoices. We have gone from labeling data and setting up a pipeline to sequential training and additional training of models.

Many thanks to the team who worked on the project:

  • Alexander Kozlov, Lead ML Engineer;

  • Andrey Tatarinov, CEO/CTO Epoch8.co / AGIMA.AI.

What else to read

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *