A simple DOCX template engine using the Smart Document Engine

We are in Smart Engines we are engaged in document recognition systems, and we decided to check how much time is needed to create an MVP of a tool that allows you to pre-fill standard templates in DOCX format with data extracted from scans and photos of documents. In this article, we will show you how to quickly make a simple templating engine based on our Smart Document Engine recognition system that is ready to use and does not require any prior user training. Who cares – welcome under the cut!


How often such a banal action as filling out some kind of contract, act, or other business document can cause a lot of problems and inconvenience. Still all right if you need to fill in the full name from a freshly scanned passport into some kind of bank agreement, and another thing if you need to fill out several dozen documents of the same type, with sources of information in other documents, and all this needs to be done quickly, without errors , and in the presence of an impatient customer.

Of course, all these operations can and should be automated. For example, RPA systems can come to the rescue, with the help of which the user only needs to perform the correct sequence of actions for loading and unloading data. But in order for RPA to be implemented in a company, as a rule, it is necessary to complete a whole large project, collect the required number of components, configure them, teach users how to use them, and so on and so forth. What if the user only needs to scan a set of standard documents and fill out banal Word documents with the correct details?

We’ll show you how easy it is to put together a minimal yet functional templating engine using the Smart Document Engine and python with several public packages. The manipulations will all be demonstrated using the MacOS SDK example, but the same will work for Windows and Linux-based systems as well.

Document recognition with Smart Document Engine

At the heart of the template engine is, of course, document recognition using the Smart Document Engine. The library has a number of integration interfaces (the main C ++ interface and a set of wrappers), but, for maximum simplicity, we implement the recognition function as a CLI application, using the C ++ example as a basis, which is located directly in the SDK package.

To compile the supplied docengine_sample console example, you need to insert the client signature, which is taken from the SDK documentation, into the code:

// Creating a session object - a main handle for performing recognition.
std::unique_ptr<se::doc::DocSession> session(
    engine->SpawnSession(*session_settings, “ABCDEFG….”));

After that, it can be collected (for example, by a nearby script build_cpp.sh).

Console example docengine_sample takes three parameters: the path to the document image, the path to the configuration archive, and the mask of document types. We check the recognition in the standard mode, for example, of a payment order:

$ DYLD_LIBRARY_PATH=../../bin ./docengine_sample ../../testdata/rus.payment_order_sample.png ../../data-zip/bundle_docengine_photo.se "rus.payment_order*"
Smart Document Engine version 1.11.0
image_path = ../../testdata/rus.payment_order_sample.png
config_path = ../../data-zip/bundle_docengine_photo.se
document_types = rus.payment_order*

(... Скрыто много дополнительной информации …)

    Text fields (35 in total):
        amount                    : 20 003 000-00
        amount_words              : ДВАДЦАТЬ МИЛЛИОНОВ ТРИ ТЫСЯЧИ РУБЛЕЙ 00 КОПЕЕК
        beneficiary               : ООО "МЕЧТА"
        beneficiary_account       : 11223344556677889900
        beneficiary_bank          : МЕЖДУНАРОДНЫЙ ЗАПАДНЫЙ БАНК
        beneficiary_bank_invoice  : 33344455566677788899
        bik_beneficiary           : 987654321
        bik_payer                 : 012345678

(... и так далее …)

For the purpose of creating a template engine, we slightly modify the application code:

1. In the configuration bundle bundle_docengine_photo.se the default mode is optimized for photo recognition (in our demo application, this mode is used when recognizing documents from photos taken directly on the device). Set the mode for recognition sessions “universal”which is more suitable in the case when it is not known in advance whether a scan or a photo will be recognized (in the demo application, this mode is used when recognizing from the gallery):

session_settings->SetCurrentMode("universal"); // переходим в режим universal
// For starting the session we need to set up the mask of document types
//     which will be recognized.
session_settings->AddEnabledDocumentTypes(document_types.c_str());

2. Remove all debug/information output and simplify the function OutputRecognitionResult so that it writes out the type and text fields of the recognized document in JSON format:

void OutputRecognitionResult(
    const se::doc::DocResult& recog_result) {
  if (recog_result.GetDocumentsCount() == 0) {
    printf("{}\n");
  } else {
    const se::doc::Document& doc = recog_result.DocumentsBegin().GetDocument();
    printf("{\"DOCTYPE\": \"%s\"", doc.GetAttribute("type"));
    for (auto f_it = doc.TextFieldsBegin();
         f_it != doc.TextFieldsEnd();
         ++f_it) {
      std::string escaped_value = std::regex_replace(
          f_it.GetField().GetOcrString().GetFirstString().GetCStr(), 
          std::regex("\""), "\\\"");
      printf(",\"%s\": \"%s\"", 
          f_it.GetKey(),
          escaped_value.c_str());
    }
    printf("}\n");
  }
}

3. Rename the resulting source to docengine_cli.cpp and move it to a directory next to the dynamic library libdocengine.dylib (in my case – to the directory /bin SDK package), after which we compile with rpath binding so that it looks for the library next to the executable file:

$ clang++ docengine_cli.cpp -O2 -I ../include -L. -l docengine -o docengine_cli -Wl,-rpath,"@executable_path"

We check (newlines have been added to the output of the program for readability):

$ ./docengine_cli ../testdata/rus.payment_order_sample.png ../data-zip/bundle_docengine_photo.se "rus.payment_order*"
{"DOCTYPE": "rus.payment_order.type1","amount": "20 003 000-00",
 "amount_words": "ДВАДЦАТЬ МИЛЛИОНОВ ТРИ ТЫСЯЧИ РУБЛЕЙ 00 КОПЕЕК",
 "beneficiary": "ООО \"МЕЧТА\"","beneficiary_account": "11223344556677889900",
 "beneficiary_bank": "МЕЖДУНАРОДНЫЙ ЗАПАДНЫЙ БАНК",
 "beneficiary_bank_invoice": "33344455566677788899",
 "bik_beneficiary": "987654321","bik_payer": "012345678",
 "budget_classification_code": "","code1": "0401060","code_payment": "",
 "date": "05.11.2020","date_document_payment_basis": "",
 "debiting_date": "05.11.2020","inn_beneficiary": "1111111111",
 "inn_payer": "1234567890","invoice_number": "98765432109876543210",
 "kpp_beneficiary": "222222222","kpp_payer": "125125125",
 "number_document_basis_payment": "","number_payment_order": "345",
 "oktmo_code": "","payer": "ИП \"ДОБРОЕ УТРО\"",
 "payer_bank": "ПУШКИНСКОЕ ОТДЕЛЕНИЕ БАНК \"ЗДОРОВЬЕ\"",
 "payer_bank_invoice": "12345678901234567890","payment_code": "0",
 "payment_reason_code": "","payment_type": "","place_payment": "8",
 "purpose_payment": "ОПЛАТА ПО ДОГОВОРУ №23456 ЗА ВЫПОЛНЕНИЕ СТРОИТЕЛЬНЫХ И ФУНКЦИОНАЛЬНЫХ РАБОТ ПО ИССЛЕДОВАНИЮ ОРГАНИЗМА. НДС НЕ ОБЛАГАЕТСЯ",
 "purpose_payment_1": "","receipt_date": "05.11.2020","tax_period": "",
 "type_payment": "","wage_type": "01"}

What you need! Now let’s move on to the template engine.

template engine

What we want? We want a simple GUI application that would be able to load template documents in DOCX format, in strategic places of which tags of the form ${very_important_info}upload images of the required documents, and save the document with the completed data.

First of all, let’s create a configuration file that will indicate which CLI application needs to be launched, with which configuration bundle, with which document type masks for the types of interest to us (let us be interested in the Russian payment order and income statement of an individual, and, say, social map of Armenia), and how fields from different documents should be translated into template tags.

Suppose we want to extract the name of the payer and his bank, the details of the recipient, the amount in words and the purpose of the payment from the payment order. From 2-NDFL we extract the full name, date of birth (the certificate of income of an individual, formally speaking, is no longer called 2-NDFL, but I think it will not be easy to get rid of such an accustomed term), and, finally, we extract the full name from the certificate of the social number of Armenia in Armenian and, in fact, the number. For the purpose of demonstrating the capabilities of the template engine, it is quite enough. config file (config.json) turned out like this:

{
  "executable": "docengine_cli",
  "bundle": "bundle_docengine_photo.se",
  "sessions": {
    "rus_payment_order": {
      "documents_mask": "rus.payment_order*",
      "text": "payment order"
    },
    "arm_social_card": {
      "documents_mask": "arm.ref_public*",
      "text": "social card"
    },
    "rus_2ndfl": {
      "documents_mask": "rus.2ndfl*",
      "text": "income form"
    }
  },
  "tags": {
    "rus_payment_order:payer": "payer_name",
    "rus_payment_order:payer_bank": "payer_bank_name",
    "rus_payment_order:beneficiary": "beneficiary_name",
    "rus_payment_order:beneficiary_account": "beneficiary_account",
    "rus_payment_order:beneficiary_bank": "beneficiary_bank_name",
    "rus_payment_order:bik_beneficiary": "beneficiary_bik",
    "rus_payment_order:kpp_beneficiary": "beneficiary_kpp",
    "rus_payment_order:amount_words": "payment_amount",
    "rus_payment_order:purpose_payment": "payment_purpose",
    "rus_2ndfl:surname": "surname",
    "rus_2ndfl:name": "name",
    "rus_2ndfl:patronymic": "patronymic",
    "rus_2ndfl:birth_date": "birth_date",
    "arm_social_card:name_patronymic_surname": "arm_fio",
    "arm_social_card:public_service_number": "arm_number"
  }
}

The configuration file will be placed in the directory resourcesalong with everything you need to start recognition: a configuration bundle bundle_docengine_photo.seexecutable file docengine_cli and library libdocengine.dylib.

As a template engine itself, we will write a simple GUI application in wxPython. It makes no sense to go into details, I will limit myself only to the fact that it took me about two hours to do everything about everything (without experience with wx) and 292 lines of code. Let us analyze only the procedures for recognizing an image and filling in a template.

In a GUI application, image recognition is initiated by pressing a button that corresponds to one or another recognition session specified in config.json. We offer the user to select a file with a document image, and then run docengine_cli using the module subprocess and parse the JSON that we get as output. After that, according to the prescribed tags in config.json update the dictionary with tag values:

def loadImage(self, event):
  '''
    Загружает изображение, распознает документ, обновляет словарь тегов
  '''
  button_name = event.GetEventObject().GetName() # соответствует ключу в словаре “sessions” конфигурационного файла config.json
  self.tlog.AppendText('Loading image of %s...\n' % self.config['sessions'][button_name]['text'])

  with wx.FileDialog(self, 'Open %s image file' % self.config['sessions'][button_name]['text'], \
                     wildcard="PNG, JPG or TIF image (*.png;*.jpg;*.jpeg;*.tif;*.tiff)|*.png;*.jpg;*.jpeg;*.tif;*.tiff", \
                     style=wx.FD_OPEN | wx.FD_FILE_MUST_EXIST) as fileDialog:
    if fileDialog.ShowModal() == wx.ID_CANCEL:
      return

    pathname = fileDialog.GetPath()
    try:
      self.tlog.AppendText('Recognizing %s...\n' % pathname)
      # запускаем docengine_cli
      output = subprocess.run([
        os.path.join(self.resources_path, self.config['executable']), # путь к исполняемому файлу docengine_cli
        pathname, # путь к изображению
        os.path.join(self.resources_path, self.config['bundle']), # путь к конфигурационному бандлу Smart Document Engine
        Self.config['sessions'][button_name]['documents_mask'] # маска типа документа
      ], capture_output = True)

      # парсим вывод docengine_cli
      output_json = None
      try:
        output_json = json.loads(output.stdout)
      except Exception:
        pass

      if output_json is None:
        self.tlog.AppendText('Failed to retrieve any data.\n')
      else:
        # обновляем словарь тегов
        any_fields_extracted = False
        for tag in self.config['tags'].keys():
          if tag.split(':')[0] != button_name:
            continue
          prop_name = tag.split(':')[-1]
          if prop_name not in output_json.keys():
            continue
          prop_value = output_json[prop_name]
          self.keyval[self.config['tags'][tag]] = prop_value
          self.tlog.AppendText('Extracted %s: %s\n' % (self.config['tags'][tag], prop_value))
          any_fields_extracted = True

        if not any_fields_extracted:
          self.tlog.AppendText('No fields extracted.\n')

    except Exception as e:
      self.tlog.AppendText('Cannot process file %s: %s\n' % (pathname, str(e)))

The filling of the template is initiated by pressing another button, at which the previously loaded DOCX template is loaded using the package python-docx, after which all paragraphs of the document and all paragraphs of each cell of each table are scanned, replacing the found tags with values ​​extracted from the recognized documents. Most likely, filling in the template could have been made easier, but I’m already in my pajamas:

def applyTagsToParagraph(self, paragraph):
  '''
    Применяет словарь тегов self.keyval к одному параграфу DOCX-документа, сохраняя формат куска, содержащего символ “$”.
  '''
  for i in range(len(paragraph.runs)):
    while '$' in paragraph.runs[i].text:
      end_index = -1
      found_key = None
      composite_text=""
      for j in range(i, len(paragraph.runs)):
        composite_text += paragraph.runs[j].text
        for key in self.keyval.keys():
          if '${%s}' % key in composite_text:
            found_key = key
            end_index = j
            break
        if found_key is not None:
          break
      if found_key is not None:
        paragraph.runs[i].text = composite_text.replace('${%s}' % found_key, self.keyval[found_key])
        for k in range(i + 1, end_index + 1):
          paragraph.runs[k].clear()
      else:
        break

def saveDocument(self, event):
  '''
    Загружает шаблон документа из self.template_path, применяет словарь тегов self.keyval к документу и предлагает пользователю сохранить получившийся документ.
  '''
  if len(self.keyval) == 0:
    self.tlog.AppendText('Nothing to apply.\n')
    return

  self.tlog.AppendText('Applying values to template file %s:\n' % self.template_path)
  for k, v in self.keyval.items():
    self.tlog.AppendText('  %s: %s\n' % (k, v))

  document = docx.Document(self.template_path)

  # применяем к параграфам документа
  for paragraph in document.paragraphs:
    self.applyTagsToParagraph(paragraph)
  # применяем к таблицам документа
  for table in document.tables:
    for row in table.rows:
      for cell in row.cells:
        for paragraph in cell.paragraphs:
          self.applyTagsToParagraph(paragraph)

  with wx.FileDialog(self, "Save DOCX file", wildcard="DOCX files (*.docx)|*.docx", \
                     style=wx.FD_SAVE | wx.FD_OVERWRITE_PROMPT) as fileDialog:

    if fileDialog.ShowModal() == wx.ID_CANCEL:
      return

    pathname = fileDialog.GetPath()
    # на всякий случай добавляем расширение docx и безопасно сохраняем файл
    if not pathname.lower().endswith('.docx'):
      new_pathname = pathname + '.docx'
      while os.path.exists(new_pathname):
        new_pathname = new_pathname[:-5] + '-copy.docx'
      pathname = new_pathname
    try:
      document.save(pathname)
      self.tlog.AppendText('Saved to %s\n' % pathname)
    except IOError:
      self.tlog.AppendText('Cannot save to file %s\n' % pathname)

The template engine is ready! In order to run it like any other application, you can use a handy tool pyinstallerit allows you to create a ready-made application for the target operating system, pack the resources directory inside and put an icon:

$ pyinstaller -w docengine_templater.py --name="Docengine Templater" --add-data resources:resources -i docengine.icns

Testing!

To test the templating engine, let’s create a simple .docx file using all the tags we previously added in config.json:

The template engine window after loading the template and recognizing three images:

Saved document:

On this, perhaps everything! Template engine code (and modified sample application docengine_cli.cpp) you can see here.

If you are interested in the Smart Document Engine product, you can learn more about it at our company websiteor contact our specialists there for details.

Thank you for your attention!

Similar Posts

Leave a Reply