Creating a Synthetic Dataset to Train a Model Using Paddle OCR

Hello, colleagues! We continue the topic of developing a plugin for recognizing Hebrew using Paddle OCR. Last time I forgot to introduce myself, I will do it in this post)
My name is Alexey, I run a company that develops using AI technologies. I am also immersed in development, but I trust my team more – we managed to assemble a team of cool professionals. I plan to tell stories from our joint work in my blog.

Let's get back to the topic of the article. Today we will dwell in more detail on the creation of an artificial dataset for training a model using Paddle OCR. This was done by my colleague Alexander, an expert in computer vision.

When we faced the task of recognizing Hebrew text, it became clear that it was almost impossible to find a ready-made dataset with the required characteristics. This prompted us to create our own dataset, which turned out to be not only useful, but also gave us the opportunity to practice generating synthetic data. In this post, we will describe in detail how exactly we approached this process.

Step 1: Data Collection

The first thing we did was find a large and diverse Hebrew dictionary. Using open sources, we compiled a database of over 350,000 words, from which we took every fifth word to speed up the learning process. An important aspect was not only the coverage of standard words, but also the inclusion of specialized terms that can be found in specific niches. Our task was precisely related to the recognition of documentation, where many lexical cliches and terms are used.

Step 2: Generate word images

To generate images from words, we used the Python library PIL (Python Imaging Library). This step was quite interesting, as it allowed us to create a variety of visual scenarios that the model would encounter in real life.

  1. Selecting fonts and text sizes: We selected a few popular fonts and text size variations to provide the model with a variety of styles. Using fonts with different characteristics (such as serif and sans serif) helps the model cope better with real text.

  2. Background images: To make the images as realistic as possible, we used several background options, from plain to textured. To create textured backgrounds, we took images of paper, fabric, and other surfaces, which imitated real-life conditions where text might be printed or written.

  3. Text positioning: We generated several image variants with different text indentation at the edges.

  4. Fonts: Hebrew has fonts with varying degrees of “decorativeness” and readability. We have included both standard and more complex ones to increase the model's robustness to different types of text.

  5. Augmentations: For more variety, augmentations such as blur, rotation, noise, etc. were used during training.

Here is an example code to generate an image of a word using PIL on a random background:

from random import randint

from PIL import Image, ImageDraw, ImageFont

def draw(p, text="אריאל", max_gap=5, font_size=50):

    fnt = ImageFont.truetype(p, font_size)

    size = fnt.getbbox(text)

    # Получаем координаты бокса с текстом,

    # где x1 и y1 это отступ слева и сверху

    x1, y1, x2, y2 = size

    # Увеличиваем отступы со всех сторон рандомно

    x1 -= randint(0, max_gap)

    y1 -= randint(0, max_gap)

    x2 += randint(0, max_gap)

    y2 += randint(0, max_gap)

    # Задаём рандомный фон

    img = Image.new('RGB', (x2-x1, y2-y1), color = (randint(75,150),            randint(75,150), randint(75,150)))

    d = ImageDraw.Draw(img)

    # Рисуем текст так, чтобы проигнорировать пустые отступы

    # и наш текст полностью попал на экран

    d.text((-x1,-y1), text, font=fnt, fill=(0, 0, 0))

    return img

img = generate_image(“path_to_font.ttf”, ”שלום”)

img.show()

Results

Often we have to deal with a lack of data and such situations become a small challenge, forcing us to find a creative solution to the problem. As a result, creating an artificial dataset turned out to be not only necessary but also an extremely useful process for us. The model trained on such a dataset showed excellent results in recognizing Hebrew text in various scenarios.

If you too are struggling with a lack of suitable data to train your model, don’t be afraid to experiment and create your own datasets. This process will not only improve your skills, but also give you precise control over the quality and variety of data you use for training.

P.S.

And again I urge you to subscribe to the TG channel “Brains are askew”: https://t.me/+vpPjcjWrhgViYTA6
We plan to publish original content on similar topics there.

PPS. Leave a + in the comments if you spotted a dolphin on the cover. Leave a ++ if you're one of those who remember this entertainment from childhood…

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *