How we taught ABBYY FineReader PDF to edit whole paragraphs
Today we updated ABBYY FineReader 15 and released it under the brand name ABBYY FineReader PDF, because it combines all the tools for working with PDF. On this occasion, we publish the first post in a series of materials about the features of the program. In it we will talk about one interesting opportunity that has been in the program for several months, but perhaps not everyone knew about it.
How long have you opened PDFs? We bet that recently. Most likely, on your computer there will definitely be a couple of scans, and maybe also a presentation layout, an analytical study or technical instruction. What tasks do these documents usually use? By ABBYY survey data, 62% of respondents look for information in PDF, 60% – copy the text from the document, and 52% – edit: make corrections to the file, correct errors and typos.
Even now, not everyone knows that you can edit text in PDF. Yes, changing such files does not work like editing a regular text document. ABBYY FineReader PDF with a multifunctional text editor for working with PDF and scans allows you to quickly make changes directly to PDF, without the tedious conversion of the file to other formats. When editing text in PDF smoothly flows from line to line, as in MS Word. You can add or delete multiple words, change entire paragraphs, or even swap them.
In this post, we will reveal the technical details of editing multi-line text fragments in FineReader: how we changed the program engine, how editing is arranged from the inside and how it looks for the user. Go!
PDF format is used all over the world: its contents are equally displayed on any computers, smartphones and tablets with different operating systems. It is convenient and helps to avoid embarrassing situations. For example, when you wrote text in MS Word, sent it to your colleagues, and they open it with LibreOffice or Wordpad, and everything went and the fun begins. PDF, of course, is more convenient in this regard, but with the text everything is complicated here. 70% of all existing PDF documents have text, and 30% do not, because they are images.
Let’s talk first about the PDF, in which the text is. To edit a PDF, you need to understand how the text is written in it. Have you ever opened a PDF in notepad? If yes, then you saw this:
In order for all this to be displayed clearly for the user, a lot of work needs to be done.
Task: understand pdf
The content of each page in a PDF file is stored in the form of streams of commands for drawing a document – it can be text, images or vector graphics. The structure of the file is determined by PDF objects, for example, page, picture, comment (and paragraphs, lines of text and letters are just parts of the object). The character in the PDF is represented glyph. The way they are recorded is determined font. Each character is stored separately: it has a font, the character code in the font and the coordinates of its location on the page. Where the glyphs are located is determined precisely by the flow of commands. In addition, letters are combined into text flows (text run), but they are not semantic.
There are no lines or paragraphs in PDF that are in text-format documents. Even the order of the text is not always defined. That is, you see the text, but in fact the text does not exist. This is chaos from difficult to understand instructions (as in the image above) that need to be correctly displayed in specific places of the document, with appropriate formatting.
“What about the text?” – you ask.
The text in PDF does exist, and it can even be edited. To do this, we teach our technology to understand the structure of the text, for example, to identify and highlight lines. We will tell you more about this.
PDF libraries and how we changed them
To make it possible to edit entire paragraphs, we drastically changed our internal subsystem (library), which we call PdfTools. It deals with opening PDF files, parsing command flows (i.e., understands where the text is located, where the pictures are, and recreates the structure of the document) and helps users manipulate this data: read, modify, save to PDF.
The PdfTools subsystem contains all the necessary tools to read the contents and wrap it in objects (page, picture, comment) that are convenient for the program to work with. Our products can already work with these objects, in particular ABBYY FineReader PDF and others.
As it was before. In FineReader 14, we only knew how to edit text within a single line. After editing, it was necessary to perform a “rendering” – to place the glyphs in their new places.
In general, rendering is visualization. But we put in this word a different concept – the location of objects in PDF in their places. For PDF professionals, this is a visualization that no one else sees. When we talk about visualization in the usual sense, we use the word “rasterization”.
The whole process was located in the PdfTools subsystem. She helped us put the contents of the PDF into lines and edit them. For example, you must put in the 5th place the glyph “A”. FineReader told the PdfTools subsystem that the fifth place was to put the glyph “A” with the specified size and font, and PdfTools inserted “A” and moved to the right place in the line all the glyphs that followed the letter “A”. Line-by-line editing is quite easy: the text simply shifted to the right or, for example, to the left, if it is written in Hebrew or Arabic. This allowed for minor corrections, for example, to correct a typo, but did not make it possible to make more global changes to the text of the PDF document.
What they decided to change. When the multi-line editing task appeared, we realized that it would be problematic to do this within the framework of one PdfTools library. We needed to learn how to automatically find larger fragments in the PDF text, for example, “see” paragraphs, understand where their boundaries are, what formatting the whole fragment of text should have, and what happens when switching from one line to another. To determine all these parameters, we decided to use our other OCR technologies – Document Analysis (DA) and Synthesis, which can build the structure of the document, to solve this problem.
Document Analysis and Synthesis
To define blocks in the text, ABBYY FineReader PDF uses Document Analysis technology. It allows you to find paragraphs, tables, pictures. The program highlights the found blocks with small pale frames, so that it is more convenient for the user to make changes:
Next, we improved another subsystem of our program – Synthesis. We already told on Habré why it is necessary. In short, it determines the structure and all the characteristics of the recognized text: what fonts and sizes are used, what style (bold, italic, underline), where headers, lists, indents and many other parameters that can be configured in the same MS Word. We modified Synthesis so that when recognizing and re-creating the page, it is very accurate to restore the original parameters of the text.
Underlined Text Features
There is no text attribute such as underscore in PDF, which is familiar, for example, to MS Word users. The underscore in PDF is a vector graphic that has nothing to do with text. Without further development of the product, when editing the “underlined” text, the characters would move in the usual way, and the lines indicating the underscore would remain in place. ABBYY FineReader PDF can define and edit underlined text in a way familiar to the user.
Editing tables in PDF
Editing tables has also changed. Previously, the program “saw” the table as separate lines, and edited it the same way. Now, when working with tables, ABBYY FineReader PDF determines the contents of each cell, is able to extract text from them and work with it. This is convenient when you need to fix a mistake in a digit, change the point to a semicolon and at the same time save the structure of the table, do it quickly and without converting the PDF document to other formats.
How to edit a scan?
Multi-line editing is also available for scans. By the way, the user does not even need to think about whether the scan is in front of him or not. ABBYY FineReader PDF itself will determine this and start the necessary mechanisms. For example, a typo in the contract date, or the name of the counterparty has changed: it has become longer and should “flow” to the next line.
In the program, the scan is first recognized, and then preparation for editing takes place. When the scan is recognized, the text is obtained not in our original document, but in its virtual “double”. And it is in it that all editing operations take place.
When the user has finished editing the document, the program automatically collects all changes from the page and replaces these fragments in the original document. Our task is to embed the text back into the PDF document without damaging everything else that is already in it.
Editing a scan allows you not to waste time converting a document to other formats and vice versa. This is convenient when you need to quickly make a forgotten edit to a date or other piece of text.