Help me read what is written here? (OCR)

Tons of archival papers, checks and invoices are scanned and digitized in many industries: retail, logistics, banking, and more. Companies gain a competitive advantage when they digitize quickly and find the information they need.

In 2020, we also had to solve the problem of high-quality digitization of documents, and on this project my colleagues and I worked together with the company Verigram… This is how we digitized documents using the example of a customer ordering a SIM card right from home.

Digitization allowed us to automatically fill out legal documents and applications for services, and also opened access to analytics of fiscal receipts, tracking price dynamics and total spending.

We use Optical Character Recognition (OCR) technology to convert various types of documents (scanned documents, PDF files or photos from a digital camera) into editable and searchable formats.

Working with standard documents: problem statement

Ordering a SIM card for a user looks like this:

  • the user decides to order a SIM card;

  • downloads the application;

  • photographs the identity card to automatically fill out the questionnaire;

  • the courier delivers the SIM card.

Important: the user photographs the ID with his smartphone with a specific camera resolution, quality, architecture and other features. And at the output, we get a textual representation of the information of the loaded image.

The purpose of the OCR project: build a fast and accurate cross-platform model that takes up little memory on the device.

The top-level processing sequence for a standard document image looks like this:

  1. The borders of the document are highlighted, excluding the background that we are not interested in and correcting the perspective of the document image.

  2. The fields of interest to us are highlighted: name, surname, year of birth, etc. On their basis, it is possible to build a prediction model of the corresponding textual representation for each field.

  3. Post-processing: the model cleans up the predicted text.

Localization of document boundaries

The image of the document downloaded from the camera of the device is compared with a set of pre-prepared masks of standard documents: the front or back of an ID card, a new or old document, a passport page, or a driver’s license.

First, we do pre-processing image processing and, as a result of a number of morphological operations, we obtain the corresponding binary (black and white) representation.

The technique works like this: each type of document has fixed margins that do not change in width and height. For example, the name of the document in the upper right corner as in the picture below. They serve supporting fields from which the distance to other fields of the document is calculated. If the number of detected fields is from the reference above a certain threshold for the test mask, we stop at it. This is how a suitable mask is selected.

This is how the selection of a suitable mask looks like
This is how the selection of a suitable mask looks like


  • the perspective of the image is corrected;

  • the type of document is determined;

  • the image is cropped according to the found mask with the background removed.

In our example, we found that the uploaded photo is the front part of the identity card of the Republic of Kazakhstan of a sample later than 2014. Knowing the coordinates of the fields corresponding to this mask, we localize them and cut them out for further processing.

The next stage is text recognition. But before that, I’ll tell you how the data collection for training the model takes place.

Text recognising

Training data

We prepare data for training in one of the following ways.

The first method is used if there is enough real data. Then we select and mark the fields using the CVAT annotation tool. As a result, we get an XML file with the name of the fields and their attributes. Returning to the example, to train a model for text recognition, all sorts of localized fields and their corresponding textual representations, which are considered true, are fed into the input.

But more often than not, real data is not enough or the resulting set does not contain the entire dictionary of symbols (for example, some letters like “b” or “b” may not be used in real data). To get a large set of free data and avoid annotator errors when filling in, you can create synthetic data with augmentation.

First, we generate a random text based on the dictionary we are interested in (Cyrillic, Latin, etc.) on a white background, apply 2D transformations (rotations, shifts, scaling and their combinations) to each text, and then glue them into a word or text. In other words, we are synthesizing the text in the picture.

Examples of 2D transforms
Examples of 2D transforms

A good example of 2D transformation is provided in the Python library Text-Image-Augmentation-python… An arbitrary image (on the left) is fed to the input, to which various types of distortions can be applied.

Applying different types of distortion
Applying different types of distortion
Distortion, perspective and stretching of an image using the Text-Image-Augmentation-python library
Distortion, perspective and stretching of an image using the Text-Image-Augmentation-python library

After 2D transformation, composite augmentation effects are added to the text image: glare, blur, noise in the form of lines and dots, background, and more.

An example of images in the training set we formed based on the use of augmentation
An example of images in the training set we formed based on the use of augmentation

This is how a training sample can be created.

Training sample
Training sample

Text recognising

The next step is to recognize the text of a standard document. We have already selected the mask and cut out the text fields. Then you can act in one of two ways: segment the characters and recognize each separately, or predict the entire text.

Character-by-character text recognition

This method builds two models. The first segment the letters: finds the start and end of each character in the image. The second model recognizes each character individually and then glues all characters together.

Local text prediction without segmentation (end-2-end solution)

We used the second option – text recognition without segmentation into letters, because this method turned out to be less labor-intensive and more productive for us.

We used the second option – text recognition without segmentation into letters, because this method turned out to be less labor-intensive and more productive for us.

In theory, a neural network model is created that produces a copy of the text, the image of which is fed to the input. Since the text in the image can be handwritten, distorted, stretched or compressed, characters in the output of the model can be duplicated.

The difference between the recognition results of a real and an ideal model
The difference between the recognition results of a real and an ideal model

To get around the problem of duplicate characters, let’s add a special character, for example “-“, to the dictionary. At the training stage, each textual representation is encoded according to the following decoding rules:

  • two or more repeating characters that were encountered before the next special character are deleted, only one remains;

  • the repetition of the special character is removed.

Thus, during the training process, an image is fed into the input, which passes through the convolutional and recurrent layers, after which a matrix of probabilities of the occurrence of symbols at each step is created.

The true value gets various representations with the corresponding probability due to the CTC-encoding. The learning challenge is to maximize the sum of all true value representations. After the text is recognized and its representation is selected, the decoding described above is performed.

Model architecture for text recognition

We tried to train the model on different neural network architectures using and without using recurrent layers according to the scheme described above. As a result, we settled on an option without using recurrent layers. Also, to speed up the inference part, we used the ideas of MobileNet networks of different versions. Our model graph looked like this:

Final model schema
Final model schema

Decoding methods

I want to highlight the two most common decoding methods: CTC_Greedy_Decoder and Beam_Search.

CTC_Greedy_Decoder Method at each step, it takes the index that most likely corresponds to a certain symbol. After that, duplicate symbols and a special symbol specified during training are deleted.

Beam_Search method – ray algorithm, which is based on the principle: the next predicted symbol depends on the previous predicted symbol. The conditional probabilities of character occurrence are maximized and the resulting text is displayed.


There is a possibility that in production, when scoring on new data, the model may be wrong. It is necessary to exclude such moments or warn the user in advance that recognition did not work out, and ask to reshoot the document. This is helped by a simple post-processing routine that can only predict a limited vocabulary for a specific field. For example, for numeric fields, return only a number.

Another example of post-processing is fields with a limited set of values, which are selected from a dictionary based on editorial distance. Validation Validation: The date of birth field cannot contain dates from the 18th century.

Optimizing the model

Optimization techniques

At the previous stage, we received a 600 kilobyte model, which made the recognition too slow. It was necessary to optimize the model with a focus on increasing the speed of text recognition and reducing the size.

The following techniques helped us in this:

  • Model quantization, which translates calculations of real numbers into faster integer calculations.

An example of quantizing a smooth function ReLu
An example of quantizing a smooth function ReLu
  • Pruning unnecessary links. Some weights are small in magnitude and have little effect on prediction and can be trimmed.

  • To increase the speed of text recognition, mobile versions of neural network architectures are used, for example, MobileNetV1 or MobileNetV2

So, as a result of optimization, we got a decrease in quality by only 0.5%, while the speed of work increased 6 times, and the size of the model decreased to 60 kilobytes.

Outputting the model to production

The process of bringing the model into production looks like this:

We create 32-bit TensorFlow model, freeze and save it with additional optimizations like weight or unit pruning. We carry out additional 8-bit quantization. Then we compile the model into an Android or iOS library and deploy it to the main project.


  • During the deployment phase, set static highlighting of tensors in the model graph. For example, in our case, the speed has doubled after specifying a fixed batch size (Batch size).

  • Do not use LSTM and GRU networks for training on synthetic data, as they check for coincidence. In randomly generated synthetic data, the sequence of characters does not correspond to the real situation. In addition, they cause the effect of reducing the speed, which is important for mobile devices, especially for older versions.

  • Choose fonts carefully for your training set. Prepare for your vocabulary a set of fonts that are acceptable for rendering the characters of interest. For example, the OCR B Regular font is not suitable for a Cyrillic dictionary.

  • Try to train your own models, as not all opensource libraries may work. Before training our own models, we tried Tesseract and a number of other solutions. Since we were planning to deploy the library to Android and iOS, their size was too large. In addition, the quality of recognition of these libraries was insufficient.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *