my experience and solutions

Hello everyone! I would like to share a recent project where I developed a plugin for recognizing Hebrew text. The task was not easy, especially considering that the client had already tried to use Tesseract OCR, but the recognition accuracy left much to be desired. In this article, I will tell you about the difficulties I encountered and how I overcame them.

Problems with Tesseract OCR

Tesseract OCR is a pretty powerful OCR tool that supports many languages, including Hebrew. However, the client was disappointed with its performance in Hebrew: Tesseract often made mistakes in recognizing characters, making the results almost useless. The main issues I encountered were:

  • Lack of a quality dataset. To effectively train OCR models, large amounts of data are needed, and good datasets for Hebrew are rare.

  • Lack of models in popular frameworks. In popular OCR frameworks such as MMOCR, EasyOCR, PadleOCR, models for recognizing Hebrew were not found, although as a rule, support for many other languages ​​was present.

My decision

I decided to develop my own model using PaddleOCR since this framework provides more options for customization and training.

Creating an artificial dataset

Since it was impossible to find a ready-made dataset, I created one myself. Here's what I did:

1. Collected a dictionary. Took a large Hebrew dictionary, including both standard and specialized terms.

2. Generated word images. Used the Python library PIL to generate word images with different fonts and backgrounds. This added variety and improved model training.

Fig. 1. Artificial images with Hebrew

Fig. 1. Artificial images with Hebrew

3. Data augmentation. We used various augmentation techniques – adding noise, changing brightness and contrast, rotations and distortions. This increased the amount of data and made the model more robust.

Fig.2 examples of augmentations for text

Fig.2 examples of augmentations for text

Model training in PaddleOCR

Once I had created the synthetic dataset, I started training the model in PaddleOCR. Here are the main steps:

  1. Tuning the training parameters. I carefully selected hyperparameters such as the degree of change in concentration, brightness, maximum rotation angle, image noise, so that the model could gain good generalization ability and could work better on data that it had not seen before. The batch size and the number of epochs were also chosen to optimize the speed of the training process.

  2. Using pre-trained models. Took pre-trained PaddleOCR models as a basis and trained them on our dataset. This significantly reduced training time and improved results.

  3. Validation and testing. At each stage of training, the model was validated on a separate data set to monitor the process and avoid overfitting.

Results and future plans

After completing the training, we compared the new model with Tesseract OCR on the test dataset. The results were impressive: my model showed much better accuracy and robustness.

Plans for the future

1. Dataset expansion. I will continue to increase the volume and diversity of data for training, for this I will use more complex tools for generating artificial data such as here. We will also add real examples of Hebrew texts to dilute the synthetic dataset.

Fig. 3: Artificially generated text on a real background

Fig. 3: Artificially generated text on a real background

2. Model optimization. I plan to conduct additional experiments with the neural network architecture and hyperparameters to achieve maximum accuracy.

3. Integration and testing. After finalizing the model, we integrate the plugin into the client's workflows and conduct extensive testing in real conditions.

This is how we managed to improve Hebrew text recognition step by step. I hope my experience will be useful to those who face similar problems. I will analyze more similar problems in tg-channel “Brains are askew”. Subscribe.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *