Rapid development of a prototype HTR system on open data

2. Classify each individual symbol with a pretrained model.

Based on this approach, there are restrictions on the input data, namely – handwritten text should be written in unconnected letters.

This article presents a way to get results as quickly as possible using Google Colab as a platform for training the HTR model.

To create a prototype, we need a dataset with handwritten characters, for example, CoMNIST. Download it, unpack it and remove unnecessary symbols:

# скачаем и распакуем датасет CoMNIST
! wget https://github.com/GregVial/CoMNIST/raw/master/images/Cyrillic.zip
! unzip Cyrillic.zip
# удалим папку с изображениями буквы "I" и "Ъ" (с последним бывают проблемы)
! rm -R Cyrillic/I
# пример изображения в датасете
from IPython.display import Image

Also, a model is needed to classify handwritten characters:

# Удалим неподходящие пакеты для отработки алгоритма
!pip uninstall keras tensorflow h5py –y
# Установим необходимые зависимости
!pip install keras==2.2.5 tensorflow==1.14.0 h5py==2.10.0

It is advisable to restart the environment after reinstalling keras and tensorflow. The virtual machine data will remain.

Further, the functions for training:

import os
import cv2
import time
from tqdm import tqdm
from PIL import Image, ImageFilter, ImageOps
import numpy as np 
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
from sklearn.model_selection import train_test_split

from tensorflow import keras
from keras.models import Sequential
from keras import optimizers
from keras.layers import Convolution2D, MaxPooling2D, Dropout, Flatten, Dense, Reshape, LSTM, BatchNormalization
from keras.optimizers import SGD, RMSprop, Adam
from keras import backend as K
from keras.constraints import maxnorm
import tensorflow as tf

def emnist_model(labels_num=None):
    model = Sequential()
    model.add(Convolution2D(filters=32, kernel_size=(3, 3), padding='valid', input_shape=(28, 28, 1), activation='relu'))
    model.add(Convolution2D(filters=64, kernel_size=(3, 3), activation='relu'))
    model.add(MaxPooling2D(pool_size=(2, 2)))
    model.add(Dense(512, activation='relu'))
    model.add(Dense(labels_num, activation='softmax'))
    model.compile(loss="categorical_crossentropy", optimizer="adadelta", metrics=['accuracy'])
    return model

def emnist_train(model, X_train, y_train_cat, X_test, y_test_cat):
    t_start = time.time()
    # Set a learning rate reduction
    learning_rate_reduction = keras.callbacks.ReduceLROnPlateau(monitor="val_acc", patience=3, verbose=1, factor=0.5, min_lr=0.00001)
    # Required for learning_rate_reduction:
    model.fit(X_train, y_train_cat, validation_data=(X_test, y_test_cat), callbacks=[learning_rate_reduction], batch_size=64, epochs=9)
    print("Training done, dT:", time.time() - t_start)

Functions for working with images and augmentation:

def load_image_as_gray(path_to_image):
    img = Image.open(path_to_image)
    return np.array(img.convert("L"))

def load_image(path_to_image):
    img = Image.open(path_to_image)
    return img

def convert_rgba_to_rgb(pil_img):
    background = Image.new("RGB", pil_img.size, (255, 255, 255))
    background.paste(pil_img, mask = pil_img.split()[3])
    return background

def prepare_rgba_img(img_path):
    img = load_image(img_path)
    if np.array(img).shape[2] == 4:
      new_img = convert_rgba_to_rgb(img)
      return new_img
    return img

# размытие изображений
for lett in os.listdir("Cyrillic/"):
  for l in os.listdir(f"Cyrillic/{lett}"):
    if l != ".ipynb_checkpoints":
      img = Image.open(f"Cyrillic/{lett}/"+l)
      blurImage = img.filter(ImageFilter.BoxBlur(15))

# поворот изображений на +20 градусов
for lett in os.listdir("Cyrillic/"):
  for l in os.listdir(f"Cyrillic/{lett}"):
    if (l != ".ipynb_checkpoints") & ("blur_" not in l):
      img = Image.open(f"Cyrillic/{lett}/"+l)
      rotImage = img.rotate(20)

# поворот изображений на -20 градусов
for lett in os.listdir("Cyrillic/"):
  for l in os.listdir(f"Cyrillic/{lett}"):
    if (".ipynb_checkpoints" not in l) & ("rot20_" not in l) & ("blur_" not in l):
      img = Image.open(f"Cyrillic/{lett}/"+l)
      rotImage = img.rotate(-20)

# изменение размера изображений до 28x28
for lett in os.listdir("Cyrillic/"):
  for l in os.listdir(f"Cyrillic/{lett}"):
    if l != ".ipynb_checkpoints":
      img = Image.open(f"Cyrillic/{lett}/"+l)
      resized = img.resize((28, 28))

It should be noted that the images in the CoMNIST dataset are PNG RGBA, moreover, useful information is stored only in the alpha channel, therefore, it is necessary to “throw” the alpha channel to RGB:

# преобразование изображений из RGBA в RGB
for lett in os.listdir("Cyrillic/"):
  for l in os.listdir(f"Cyrillic/{lett}"):
    if l != ".ipynb_checkpoints":
      rgb_img = prepare_rgba_img(f"Cyrillic/{lett}/"+l)


y_num = {l:i+1 for i, l in enumerate(np.unique(y_train))}

X_train = np.reshape(np.array(X_train), (np.array(X_train).shape[0], 28, 28, 1))
X_test = np.reshape(np.array(X_test), (np.array(X_test).shape[0], 28, 28, 1))

X_train = X_train.astype(np.float32)
X_train /= 255.0
X_test = X_test.astype(np.float32)
X_test /= 255.0

y_train_num = [y_num[i] for i in y_train]
y_test_num = [y_num[i] for i in y_test]

y_train_cat = keras.utils.to_categorical(np.array(y_train_num), 33)
y_test_cat = keras.utils.to_categorical(np.array(y_test_num), 33)


model = emnist_model(len(y_num)+1)
emnist_train(model, X_train, y_train_cat, X_test, y_test_cat)

For the author, training in Google Colab on a VM without TPU took ~ 40 minutes.

Next, you can test the model with examples of handwritten text, for example:

Figure 1 - Image with handwritten text
Figure 1 – Image with handwritten text

Next is the code for splitting the image into individual characters. The principle of its operation is based on increasing the fat content of separate (not connected with each other) symbols and finding their outlines:

# разбитие строки на отдельные буквы
def letters_extract(image_file: str, out_size=28):
    img = cv2.imread(image_file)
    gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
    ret, thresh = cv2.threshold(gray, 200, 255, cv2.THRESH_BINARY)
    img_erode = cv2.erode(thresh, np.ones((3, 3), np.uint8), iterations=1)

    # Get contours
    contours, hierarchy = cv2.findContours(img_erode, cv2.RETR_TREE, cv2.CHAIN_APPROX_NONE)

    output = img.copy()

    letters = []
    for idx, contour in enumerate(contours):
        (x, y, w, h) = cv2.boundingRect(contour)
        if hierarchy[0][idx][3] == 0:
            cv2.rectangle(output, (x, y), (x + w, y + h), (70, 0, 0), 1)
            letter_crop = gray[y:y + h, x:x + w]
            size_max = max(w, h)
            letter_square = 255 * np.ones(shape=[size_max, size_max], dtype=np.uint8)
            if w > h:
                y_pos = size_max//2 - h//2
                letter_square[y_pos:y_pos + h, 0:w] = letter_crop
            elif w < h:
                x_pos = size_max//2 - w//2
                letter_square[0:h, x_pos:x_pos + w] = letter_crop
                letter_square = letter_crop

            letters.append((x, w, cv2.resize(letter_square, (out_size, out_size), interpolation=cv2.INTER_AREA)))

    # Sort array in place by X-coordinate
    letters.sort(key=lambda x: x[0], reverse=False)
    return letters

Demonstration of the result of the operation of the algorithm for splitting the image into characters:

import matplotlib.pyplot as plt
%matplotlib inline
lttrs = letters_extract("test.png", 28)
plt.imshow(lttrs[0][2], cmap="gray")
Figure 2 - first character after splitting the image
Figure 2 – first character after splitting the image

The trained model will return numbers – ordinal numbers of classes, so it will be convenient, for example, to make a dictionary to facilitate interpretation (here – y_num), however, it should be remembered that the letter “E”, for some unknown reason, is not included in the main alphabet and comes first … Next, the code to output the result:

def get_lettr(ind, y_num):
  back_y = {v:k for k, v in y_num.items()}
  return back_y[ind]

for i in range(len(lttrs)):
  img_arr = lttrs[i][2]
  img_arr = img_arr/255.0
  input_img_arr = img_arr.reshape((1, 28, 28, 1))
  result = model.predict_classes([input_img_arr])
  print(get_lettr(result[0], y_num))

The result of work for a pair of images:

Figure 3 - Demonstration of the results of the prototype HTR tool
Figure 3 – Demonstration of the results of the HTR prototype tool

Figure 3 shows the results of the tool. Obvious errors are marked in red, a problem area, which will be discussed later, in yellow. It should be noted that the recognition accuracy is satisfactory, however, due to the peculiarities of the image splitting algorithm, the letter “Y” is represented as two separate characters “b” and “I” and it is in this form that the classification model is presented. Also, there is no recognition of spaces between words.

Thus, in the shortest possible time, it was possible to obtain a prototype of the HTR tool, which provides a satisfactory recognition quality. It should be added that this prototype can be greatly improved, for example:

– change / expand the dataset of letters using fonts from the service handwritter.ru;

– use the augmentation and recognition methods described in the article “First Place at AI Journey 2020 Digital Peter”, including the method of breaking lines of text written with connections into separate characters;

– develop and connect an algorithm for correcting spelling errors;

– use a hidden Markov model to predict the next character / word;

– train the model to recognize individual words / phrases, which can also be generated using Handwriter fonts.

Similar Posts

Leave a Reply