Passport Data Generation for Model Training

NTA professional community.

To train neural networks, you need datasets with a sufficient amount of training data. Often, as part of the development of an ML model, it is the compilation of a dataset suitable for its training that takes most of the time and effort. If the dataset cannot be compiled from real data, they resort to generating synthetic data. When developing a passport “recognizer” without a sufficient number of real samples, it became necessary to generate passport data and corresponding images of individual fields (series/number, date of issue of a passport, etc.).

Post navigation

Passport data format

Consider an example of the main spread of a passport (2 and 3 pages):

Fields in the passport cannot be generated separately from each other, they are interconnected. In addition to the most obvious connections (name depends on gender, the date of issue must be at least 14 years later than the date of birth), there are others, for example:

the first 2 digits of the passport series are the OKATO code of the subject where the form was made (usually the same as the region of issue) or a special series 09;
the first two digits of the subdivision code are the subject code, the third digit takes values from 0 to 4;
the date of issue of the passport without MChD (located at the bottom of the 3rd page is a machine-readable record, highlighted with a black frame on the passport image, more about it in the section on image generation) is in the range from October 1, 1997 to June 30, 2011, with MChD – from 1 July 2011 to present;
year of printing of the form (the last 2 digits of the series), in most cases, is equal to the year of issue of the passport or the previous one;
each value of the Department Code field corresponds to up to 15 possible department names.

Data generation

The following data was used for generation:

Based on this data and the identified rules, a Passport data-class object is generated. First, gender and the corresponding full name are randomly selected. The probability of choosing a first name, patronymic or last name is calculated taking into account its prevalence.

Then, the date of issue is selected in the interval from July 1, 2011 (the day the MChZ was introduced) to the present day. The day of birth is then calculated (minimum 14 years before date of issue). A region is randomly selected, based on it (and the date of issue) a series is generated. Then, among all branches of the Federal Migration Service in the selected region, one is randomly selected.

import random
import math

def generateData(n =None,batch=False,batchSize = 1000)->Passport|list[Passport]
    # Опущено получение вышеописанных данных (ФИО, вероятности их 
    # выбора, список регионов и отделений ФМС) и функции, генерирующие 
    # отдельные элементы паспорта (серию, номер и др.). Полный код ниже.
    gender = random.choice(["M","F"])
    if gender =="F":
        surname = random.choices(femaleSurnames,fSurnameProb)[0]
        name = random.choices(femaleNames,fNameProb)[0]
        midname = random.choices(femaleMidnames,fMidnameProb)[0]
    else:
        surname = random.choices(maleSurnames,mSurnameProb)[0]
        name = random.choices(maleNames,mNameProb)[0]
        midname = random.choices(maleMidnames,mMidnameProb)[0]
    issueDate = genIssueDate()
    birthday = genBirthday(issueDate)
    # Случайный выбор региона, где был выдан паспорт
    dep = deps.sample(1).to_dict(orient="records")[0] 
    series= genSeries(issueDate,dep)
    dfNeededCode = codes[codes["code"].str.match(f'{str(dep["ГИБДД"])}')]
    neededCode = dfNeededCode.code.to_list()
    selCode = random.choice(neededCode)
    number = genNumber()
    return Passport(surname,name,midname,series,number,
              birthday,issueDate,gender,selCode)

To quickly generate a large number (over 10 thousand) of passports, a batch mode for generating passports was implemented in the function. Generation is accelerated due to the simultaneous generation of gender and full name for the entire package:

batchNum = math.ceil(n/batchSize)
dep= deps.sample(n,replace=True).to_dict(orient="records")
for i in range(batchNum):
    print(f"Batch № {i+1} started")
    gender = random.choice(["M","F"])
    if gender =="F":
        surname = random.choices(femaleSurnames,fSurnameProb,k=batchSize)
        name = random.choices(femaleNames,fNameProb,k=batchSize)
        midname = random.choices(femaleMidnames,fMidnameProb,k=batchSize)
    else:
        surname = random.choices(maleSurnames,mSurnameProb,k=batchSize)
        name = random.choices(maleNames,mNameProb,k=batchSize)
        midname = random.choices(maleMidnames,mMidnameProb,k=batchSize)
    for j in range(batchSize):
        issueDate = genIssueDate()
        birthday = genBirthday(issueDate)
        series= genSeries(issueDate,dep[i*batchSize+j])
        dfNeededCode = codes[codes["code"].str.match(f'{str(dep[i*batchSize+j]["ГИБДД"])}')]
        neededCode = dfNeededCode.code.to_list()
        selCode = random.choice(neededCode)
        number = genNumber()
        passport= Passport(surname[j],name[j],midname[j], series, number,birthday,issueDate,gender,selCode)
        res.append(passport)
        print(f"{i} out of {n},elapsed time={time.time()-start}",flush=True,end="\r")
        if i*batchSize+j+1==n:
            break

As a result, we get realistic passport data (except for the place of birth, the full code here.

Machine-readable recording and image generation

Based on the received data, you can proceed to image generation. Before writing our own tools, attempts were made to find ready-made ones, as a result of which a number of sites (and applications that were not downloaded) of dubious legality were discovered. In order for the resulting tool to be used only for generating synthetic data (and not for registering on marketplaces), only separate fragments of the passport are generated for further training of the model on them.

The most interesting fragment within the framework of the task is the previously mentioned MCHZ – a machine-readable zone. This format appeared in the 80s of the last century and is currently present in most passports of various countries. It is regulated by the international standard ISO/IEC 7501-1 (ISO/IEC 7501-1), which allows it to be used worldwide. The most common type of MCHZ (type 3), which is used in the passport of a citizen of the Russian Federation, consists of two lines of 44 characters each.

The allowed alphabet includes Latin characters, numbers, and the < symbol. Each character of the Russian alphabet corresponds to one character from the given alphabet, which makes it possible to unambiguously convert the received passport data to MCHZ (and vice versa).

AND

ABOUT

WITH

SCH

Kommersant

In the passport of a citizen of the Russian Federation, the MChZ stores the contents of the fields “Date of issue”, “Subdivision code”, “Last name”, “First name”, “Patronymic name”, “Gender”, “Date of birth”, series and number. Thus, there is no information about the fields “Passport issued” and “Place of birth” in the MCHZ. In addition, if the total length of the first name, last name and patronymic exceeds 38 characters, then information about them is stored in the MChZ only partially. This situation is handled in the handleLong function:

def handleLong(surname, name, patronymic):
    person = ""
    if len(surname) + len(name) + len(patronymic) > 36:
        if len(surname) > 34:
            person = f"{surname[:34]}<<{name[0]}<{patronymic[0]}"
        elif len(surname) + len(name) >= 36:
            lim = 37 - 2 - len(surname)
            person = f"{surname}<<{name[:lim]}<{patronymic[0]}"
        elif len(surname) + len(name) <= 35:
            limPatr = 39 - 2 - 1 - (len(surname) + len(name))
            person = f"{surname}<<{name}<{patronymic[:limPatr]}"
    elif len(surname) + len(name) + len(patronymic) == 36:
        person = f"{surname}<<{name}<{patronymic}"
    else:
        person = f"{surname}<<{name}<{patronymic}" + "<" * (
            36 - (len(surname) + len(name) + len(patronymic))
        )
        
    return person

In addition to the passport data itself, there are 4 check digits in the MCHZ (formally 5, but information on the expiration date is always filled with the < symbol, which is why the check digit is also always <). They are calculated modulo 10 with a constantly repeating weight function 731 731 731... as follows:

Step 1. From left to right, multiply each digit of the corresponding numeric data element by the weight in the corresponding sequential position.

Step 2. Add up the results of each multiplication.

Step 3. Divide the resulting amount by 10 (module).

Step 4. The remainder of the division is the check digit.

Check digits are calculated for positions 1 – 9 (series and number), 14 – 19 (date of birth), 29 – 42 (additional data elements – the last digit of the passport series, passport issue date, subdivision code) and positions 1-43 (the entire MCHZ , including previous check digits).

Knowing all this, you can get a string representation of the MCHZ from the previously generated passport data.

import datetime
import re

def formMRZ(surname: str,name: str,patronymic: str,serie: str | int,number: str | int,birthday: datetime.date,gender: str,issueDate: datetime.date,departament: str | int,) -> tuple[str, str]:
    '''Returns first and second lines in Russian National Passport implementation of MRZ,according to personal data provided'''

    topConst = "PNRUS"
    surname = re.sub("[-, ]", "<", surname)
    name = re.sub("[-, ]", "<", name)
    patronymic = re.sub("[-, ]", "<", patronymic)
    person = handleLong(surname, name, patronymic)
    topRow = topConst + person
    serieNumber = str(serie)[:-1] + str(number)
    birthdayMRZ = dateToString(birthday)
    issueMRZ = dateToString(issueDate)
    lastPart = f"{str(serie)[-1]}{issueMRZ}{departament}"
    checkSum1, checkSum2, checkSum3, finalCheckSum = checkAll(
        [serieNumber, birthdayMRZ, lastPart]
    )
    bottomRow = f"{serieNumber}{checkSum1}RUS{birthdayMRZ}{checkSum2}{gender}<<<<<<<{lastPart}<{checkSum3}{finalCheckSum}"
    return topRow, bottomRow

In order to get images based on text data, we used TextRecognitionDataGenerator (trdg) is a synthetic data generator used for OCR tasks. An important advantage of this tool is the ability to add any .ttf format fonts, which allows you to generate text in any language. It also allows you to adjust the blur, character spacing, and background image type to make the generated images look as real as possible.

Consider the generation process using the example of MCHZ. Since it is regulated by the international standard, the font used is known – OCR type B. Due to the fact that the MCHZ is located on the 3rd (laminated) page of the passport, there may be glare on it. To reflect this in the resulting images, a folder was created with actual images of the blank bottom of the passport to be used as background images.

Background image example

To generate images, the GeneratorFromStrings class from trdg.generators is used with the following parameters:

Code

import pandas as pd
import numpy as np
from PIL import Image
from trdg.generators import GeneratorFromStrings
from mrzCheck import formMRZ,latinize
from genPassportData import generateData

count = 10
passports  = generateData(count)
strings=[]
for passport in passports:
    print(passport)
    first,second=formMRZ(
            latinize(passport.surname),
            latinize(passport.name),
            latinize(passport.patronymic),
            passport.series,
            passport.number,
            passport.birthday,
            passport.gender,
            passport.issueDate,
            passport.codeDep,
        )
    strings.append(first)
    strings.append(second)


generator = GeneratorFromStrings(
    strings, # непосредственно текстовые строки МЧЗ
    count=len(strings), # количество строк
    fonts=["data/ocr-b.ttf"], # регламентированный шрифт
    blur=1, # размытие для симуляции сканированного изображения
    character_spacing=5, # подобранное значение расстояния между символами для размера итогового изображения 580 на 96
    background_type=4, # фоновое изображение берется случайное из папки
    image_dir="data/bg" # путь к папке с фоновыми изображениями
)

Since the MCH in the passport consists of 2 lines, the final image is collected from three parts: the generated image of the first line, an empty image, the generated image of the second line. For each MCH, its true value is written to the labels.csv file for subsequent training:

mid= Image.open("data/MRZBack.jpg")
i=1
dfMrz=pd.DataFrame(columns=['Filename', 'Words'])

for img, lbl in generator:
    if lbl[0]=="P": # Проверка на первую строку (она всегда начинается с P)
        first=img
        prev=lbl
    else:
        wholeImg = np.vstack([first,mid.resize(first.size),img.resize(first.size)])
        dfMrz = dfMrz.append({'Filename' : f'{i}.png', 
          'Words' : f"{prev}\n{lbl}",}, ignore_index = True)
        im=Image.fromarray(wholeImg)
        im.save(f'output_images/test/{i}.png')
        i+=1
dfMrz.to_csv('output_images/test/labels.csv')

As a result, we obtain images similar to the scans of the bottom part of the 3rd page of the passport containing the MCR, and which can be used to train the model: