Demystifying PDF Parsing: Pipelining

Review, implementation methods and conclusions

Converting unstructured documents like PDFs and scanned images into structured or semi-structured formats is an important part of artificial intelligence. However, due to the intricate nature of PDFs and the complexity of the tasks involved in PDF parsing, the process may not seem so obvious at first glance.

This series of articles is dedicated to demystifying PDF parsing. In previous article We described the main task of PDF parsing, classified existing methods and gave a brief description of each of them.

In this article, we will focus on the pipeline approach. We will start with an overview of the method itself, then demonstrate several strategies for its implementation using ready-made frameworks specialized for this task, and finally analyze the results obtained.

Review

The pipeline approach considers the task of parsing PDF files as shown in Figure 1.

Figure 1: General algorithm of the pipeline approach. Image by the author.

Figure 1: General algorithm of the pipeline approach. Image by the author.

The pipeline approach can be divided into the following five stages:

  • Preprocessing PDF files to correct problems such as blurriness or skewed page orientation. This step includes image enhancement, position correction, etc.

  • Conducting a layout analysis, which can be divided into two stages: visual and semantic structure analysis. The first reveals the structure of the document and highlights similar areas, while the second marks these areas with specific types, such as text, heading, list, table, figure, etc. This stage also determines the page reading order.

  • Separating the various areas identified during layout analysis from each other. This process includes recognizing tables, text, and identifying other components such as formulas, flow charts, and special symbols.

  • Reproducing the structure of a document page based on previously obtained results.

  • Output structured or semi-structured information, such as Markdown, JSON, or HTML.

Now we will look at several PDF parsing frameworks. This way we will be able to form an idea of ​​the pipeline approach and get some conclusions that we will talk about at the end.

Marker

Marker — is a deep learning model-based pipeline that can convert PDF, EPUB, and MOBI documents to Markdown.

General process

As shown in Figure 2, the process of working with Marker is divided into the following four stages:

Figure 2: Marker conveyor. Image by the author.

Figure 2: Marker conveyor. Image by the author.

Step 1: First, let's split the pages into blocks and extract the text using PyMuPDF and OCR. Below is the corresponding code:

def convert_single_pdf(
        fname: str,
        model_lst: List,
        max_pages=None,
        metadata: Optional[Dict]=None,
        parallel_factor: int = 1
) -> Tuple[str, Dict]:
    ...
    ...
    doc = pymupdf.open(fname, filetype=filetype)
    if filetype != "pdf":
        conv = doc.convert_to_pdf()
        doc = pymupdf.open("pdf", conv)


    blocks, toc, ocr_stats = get_text_blocks(
        doc,
        tess_lang,
        spell_lang,
        max_pages=max_pages,
        parallel=int(parallel_factor * settings.OCR_PARALLEL_WORKERS)
    )

Step 2: Use layout segmenterto select individual blocks and organize them using column detector. The corresponding code has the form:

def convert_single_pdf(
        fname: str,
        model_lst: List,
        max_pages=None,
        metadata: Optional[Dict]=None,
        parallel_factor: int = 1
) -> Tuple[str, Dict]:
    ...
    ...
    # Распаковка моделей из списка


    texify_model, layoutlm_model, order_model, edit_model = model_lst


    block_types = detect_document_block_types(
        doc,
        blocks,
        layoutlm_model,
        batch_size=int(settings.LAYOUT_BATCH_SIZE * parallel_factor)
    )


    # Поиск верхних и нижних колонтитулов


    bad_span_ids = filter_header_footer(blocks)
    out_meta["block_stats"] = {"header_footer": len(bad_span_ids)}


    annotate_spans(blocks, block_types)


    # Выгрузка отладочных данных, если установлены соответствующие флаги


    dump_bbox_debug_data(doc, blocks)


    blocks = order_blocks(
        doc,
        blocks,
        order_model,
        batch_size=int(settings.ORDERER_BATCH_SIZE * parallel_factor)
    )
    ...
    ...

Step 3: Filter out headers and footers, fix code and table blocks, and apply Texify model for formulas. The corresponding code looks like this:

def convert_single_pdf(
        fname: str,
        model_lst: List,
        max_pages=None,
        metadata: Optional[Dict]=None,
        parallel_factor: int = 1
) -> Tuple[str, Dict]:
    ...
    ...
    # Исправляем блоки с кодом
    code_block_count = identify_code_blocks(blocks)
    out_meta["block_stats"]["code"] = code_block_count
    indent_blocks(blocks)


    # Исправляем таблицы
    merge_table_blocks(blocks)
    table_count = create_new_tables(blocks)
    out_meta["block_stats"]["table"] = table_count


    for page in blocks:
        for block in page.blocks:
            block.filter_spans(bad_span_ids)
            block.filter_bad_span_types()


    filtered, eq_stats = replace_equations(
        doc,
        blocks,
        block_types,
        texify_model,
        batch_size=int(settings.TEXIFY_BATCH_SIZE * parallel_factor)
    )
    out_meta["block_stats"]["equations"] = eq_stats
    ...
    ...

Step 4: Post-process the text using editor models. The corresponding code:

def convert_single_pdf(
        fname: str,
        model_lst: List,
        max_pages=None,
        metadata: Optional[Dict]=None,
        parallel_factor: int = 1
) -> Tuple[str, Dict]:
    ...
    ...
    # Копирование во избежание изменения исходных данных
    merged_lines = merge_spans(filtered)
    text_blocks = merge_lines(merged_lines, filtered)
    text_blocks = filter_common_titles(text_blocks)
    full_text = get_full_text(text_blocks)


    # Обработка присоединяемых пустых блоков
    full_text = re.sub(r'\n{3,}', '\n\n', full_text)
    full_text = re.sub(r'(\n\s){3,}', '\n\n', full_text)


    # Меняем маркеры списка на -.
    full_text = replace_bullets(full_text)


    # Постобработка текста с помощью модели редактора
    full_text, edit_stats = edit_full_text(
        full_text,
        edit_model,
        batch_size=settings.EDITOR_BATCH_SIZE * parallel_factor
    )
    out_meta["postprocess_stats"] = {"edit": edit_stats}


    return full_text, out_meta

Conclusions on Marker

So far, we've only described the general process of how Marker works. But we already have something to discuss – some conclusions we can draw based on the information we've received.

Conclusion 1: The layout analysis can be divided into several subtasks. The first subtask includes Calling the PyMuPDF API to get page blocks.

def ocr_entire_page(page, lang: str, spellchecker: Optional[SpellChecker] = None) -> List[Block]:
    if settings.OCR_ENGINE == "tesseract":
        return ocr_entire_page_tess(page, lang, spellchecker)
    elif settings.OCR_ENGINE == "ocrmypdf":
        return ocr_entire_page_ocrmp(page, lang, spellchecker)
    else:
        raise ValueError(f"Unknown OCR engine {settings.OCR_ENGINE}")




def ocr_entire_page_tess(page, lang: str, spellchecker: Optional[SpellChecker] = None) -> List[Block]:
    try:
        full_tp = page.get_textpage_ocr(flags=settings.TEXT_FLAGS, dpi=settings.OCR_DPI, full=True, language=lang)
        blocks = page.get_text("dict", sort=True, flags=settings.TEXT_FLAGS, textpage=full_tp)["blocks"]
        full_text = page.get_text("text", sort=True, flags=settings.TEXT_FLAGS, textpage=full_tp)


        if len(full_text) == 0:
            return []


        # Проверяем, сработал ли OCR. Если нет, то возвращаем пустой список


        # OCR может не сработать, если была отсканирована пустая страница нечетко отпечатанным текстом
        if detect_bad_ocr(full_text, spellchecker):
            return []
    except RuntimeError:
        return []
    return blocks

Conclusion 2: Fine-tuning (or retraining) of small multimodal pre-trained models such as LayoutLMv3can be quite useful for solving specific problems. For example, LayoutLMv3 in Marker is retrained in such a way as to allow the layout segmenter model to determine types of blocks.

def load_layout_model():
    model = LayoutLMv3ForTokenClassification.from_pretrained(
        settings.LAYOUT_MODEL_NAME,
        torch_dtype=settings.MODEL_DTYPE,
    ).to(settings.TORCH_DEVICE_MODEL)


    model.config.id2label = {
        0: "Caption",
        1: "Footnote",
        2: "Formula",
        3: "List-item",
        4: "Page-footer",
        5: "Page-header",
        6: "Picture",
        7: "Section-header",
        8: "Table",
        9: "Text",
        10: "Title"
    }


    model.config.label2id = {v: k for k, v in model.config.id2label.items()}
    return model

The dataset used for this retrainingwas taken from an open dataset DocLayNet.

Conclusion 3: When parsing PDF files, the number of columns on a page is of great importance, since it affects the order in which the document is read. The Marker algorithm also includes a retrained LayoutLMv3, which is a column detector model. This model determines the number of columns on a page, and then uses the midpoint method,

def add_column_counts(doc, doc_blocks, model, batch_size):
    for i in range(0, len(doc_blocks), batch_size):
        batch = range(i, min(i + batch_size, len(doc_blocks)))
        rgb_images = []
        bboxes = []
        words = []
        for pnum in batch:
            page = doc[pnum]
            rgb_image, page_bboxes, page_words = get_inference_data(page, doc_blocks[pnum])
            rgb_images.append(rgb_image)
            bboxes.append(page_bboxes)
            words.append(page_words)


        predictions = batch_inference(rgb_images, bboxes, words, model)
        for pnum, prediction in zip(batch, predictions):
            doc_blocks[pnum].column_count = prediction




def order_blocks(doc, doc_blocks: List[Page], model, batch_size=settings.ORDERER_BATCH_SIZE):
    add_column_counts(doc, doc_blocks, model, batch_size)


    for page_blocks in doc_blocks:
        if page_blocks.column_count > 1:
            # Пересортировка блоков в зависимости от их позиции
            split_pos = page_blocks.x_start + page_blocks.width / 2
            left_blocks = []
            right_blocks = []
            for block in page_blocks.blocks:
                if block.x_start <= split_pos:
                    left_blocks.append(block)
                else:
                    right_blocks.append(block)
            page_blocks.blocks = left_blocks + right_blocks
    return doc_blocks

similar to my approach in the article Advanced RAG 02: Unveiling PDF Parsing.

Conclusion 4: Specialized models can be trained to process mathematical formulas. For example, Texifymodel from Marker, uses Donut architectureShe was trained in Donut models using LaTex images and corresponding equations taken from the internet (including the dataset im2latex). Training was carried out on 4x A6000 over approximately two days, which corresponds to approximately 6 epochs.

Conclusion 5: Model can also be used for post-processing. The basic idea is to train the T5 model to take almost finished text and refine it, removing artifacts, adding spaces and inserting new lines.

def load_editing_model():
    if not settings.ENABLE_EDITOR_MODEL:
        return None


    model = T5ForTokenClassification.from_pretrained(
            settings.EDITOR_MODEL_NAME,
            torch_dtype=settings.MODEL_DTYPE,
        ).to(settings.TORCH_DEVICE_MODEL)
    model.eval()


    model.config.label2id = {
        "equal": 0,
        "delete": 1,
        "newline-1": 2,
        "space-1": 3,
    }
    model.config.id2label = {v: k for k, v in   model.config.label2id.items()}
    return model

This is all the information I could find on postprocessor training and dataset construction for now.

Disadvantages of Marker

Naturally, Marker has its drawbacks:

  • Instead of training and fine-tuning a specialized model for layout analysis, a built-in function from PyMuPDF is used. The effectiveness of this approach is questionable.

  • Marker does not always manage to recognize tables, as well as their names, which is inferior in efficiency to, for example, Nougat (a solution based on a small model without OCR, which will be presented in detail in the next article). For example, Figure 3 shows the recognition results for table 3 from the article “Attention Is All You Need” The original table is shown on the left, the results using Marker are in the middle, and the Nougat results are on the right.

Figure 3: Comparison of table detection and recognition, original table is table 3 from the article "Attention Is All You Need". Image by the author.

Figure 3: Comparison of table detection and recognition, original table is table 3 from the article “Attention Is All You Need“. Image by the author.
  • Only languages ​​similar to English are supported. You won't be able to parse PDF files in languages ​​like Japanese and Hindi.

PaperMage

Papermage is an open-source framework for analyzing and processing visually rich, structured scientific documents. It provides clear and intuitive abstractions for representing and manipulating textual and visual elements in a document.

Papermage combines various models of natural language processing (NLP) and computer vision (CV) into a single framework. It offers ready-to-use solutions for common scenarios of scientific document processing.

Next, we'll cover how PaperMage works and discuss the overall process with source code examples. Then, we'll talk about the takeaways from PaperMage.

Components

IN Papermage Three main components can be distinguished:

  • Magelib: A library containing primitives and methods for representing and manipulating visually rich documents as multimodal structures.

  • Predictors: An implementation that combines various modern models of scientific document analysis into a single interface. This is possible even though individual models may be written on different frameworks or operate in different modes.

  • Recipes: The framework offers well-tested combinations of individual modules, often single-modal, to form complex and extensible multimodal pipelines, so to speak, “turnkey”. These combinations are called Recipes.

Basic data classes

Magelib provides three basic data classes for representing the main elements of visually rich structured documents: Document, Layers, and Entities.

Document and layers

Figure 4 shows how PaperMage creates and presents documents.

Figure 4: How PaperMage creates and presents documents. Source: PaperMage.

Figure 4: How PaperMage creates and presents documents. Source: PaperMage.

Once the document structure has been extracted using various algorithms and models, PaperMage conceptualizes it as an annotation layer used to store both textual and visual information.

A little later we will look at the source code and analyze the process of executing the function. recipe.run().

Entities

As shown in Figure 5, an entity is a unit of multimodal content.

Figure 5: PaperMage Entity. Source: PaperMage.

Figure 5: PaperMage Entity. Source: PaperMage.

But what do we do with discontinuous elements in a document, such as sentences that span entire columns or even pages, or, say, are interrupted by floating graphs or footnotes?

PaperMage uses two member variables: spans and boxes. As shown in Figure 5, spans define the text of a sentence among all characters, and boxes reflect its visual coordinates on the page. This approach provides great flexibility, allowing for even minor differences in layout.

In addition, we have the ability to access entities in different ways, as shown in Figure 6.

Figure 6: Different ways to access entities. Source: PaperMage.

Figure 6: Different ways to access entities. Source: PaperMage.

To better understand how Papermage works, we'll start with a concrete example of parsing a PDF file and dig deeper into the process as we go.

General process and code analysis

The test code looks like this:

from papermage.recipes import CoreRecipe

core_recipe = CoreRecipe()

doc = core_recipe.run("YOUR_PDF_PATH")

First things first core_recipe = CoreRecipe() will enter CoreRecipe class constructorwhere the initialization of related libraries and models will occur.

class CoreRecipe(Recipe):
    def __init__(
        self,
        ivila_predictor_path: str = "allenai/ivila-row-layoutlm-finetuned-s2vl-v2",
        bio_roberta_predictor_path: str = "allenai/vila-roberta-large-s2vl-internal",
        svm_word_predictor_path: str = "https://ai2-s2-research-public.s3.us-west-2.amazonaws.com/mmda/models/svm_word_predictor.tar.gz",
        dpi: int = 72,
    ):
        self.logger = logging.getLogger(self.__class__.__name__)
        self.dpi = dpi


        self.logger.info("Instantiating recipe...")
        self.parser = PDFPlumberParser()
        self.rasterizer = PDF2ImageRasterizer()


        # with warnings.catch_warnings():
        #     warnings.simplefilter("ignore")
        #     self.word_predictor = SVMWordPredictor.from_path(svm_word_predictor_path)


        self.publaynet_block_predictor = LPEffDetPubLayNetBlockPredictor.from_pretrained()
        self.ivila_predictor = IVILATokenClassificationPredictor.from_pretrained(ivila_predictor_path)
        self.sent_predictor = PysbdSentencePredictor()
        self.logger.info("Finished instantiating recipe")

Because class Recipe is a parent class CoreRecipefunction core_recipe.run() will go to Recipe::run().

class Recipe:
    @abstractmethod
    def run(self, input: Any) -> Document:
        if isinstance(input, Path):
            if input.suffix == ".pdf":
                return self.from_pdf(pdf=input)
            if input.suffix == ".json":
                return self.from_json(doc=input)


            raise NotImplementedError("Filetype not yet supported.")


        if isinstance(input, Document):
            return self.from_doc(doc=input)


        if isinstance(input, str):
            if os.path.exists(input):
                input = Path(input)
                return self.run(input=input)
            else:
                return self.from_str(text=input)


        raise NotImplementedError("Document input not yet supported.")

Then he will reach class CoreRecipe::from_pdf() And class CoreRecipe::from_doc():

class CoreRecipe(Recipe):
    ...
    ...
    def from_pdf(self, pdf: Path) -> Document:
        self.logger.info("Parsing document...")
        doc = self.parser.parse(input_pdf_path=pdf)


        self.logger.info("Rasterizing document...")
        images = self.rasterizer.rasterize(input_pdf_path=pdf, dpi=self.dpi)
        doc.annotate_images(images=list(images))
        self.rasterizer.attach_images(images=images, doc=doc)
        return self.from_doc(doc=doc)


    def from_doc(self, doc: Document) -> Document:
        # self.logger.info("Predicting words...")
        # words = self.word_predictor.predict(doc=doc)
        # doc.annotate_layer(name=WordsFieldName, entities=words)


        self.logger.info("Predicting sentences...")
        sentences = self.sent_predictor.predict(doc=doc)
        doc.annotate_layer(name=SentencesFieldName, entities=sentences)


        self.logger.info("Predicting blocks...")
        with warnings.catch_warnings():
            warnings.simplefilter("ignore")
            blocks = self.publaynet_block_predictor.predict(doc=doc)
        doc.annotate_layer(name=BlocksFieldName, entities=blocks)


        self.logger.info("Predicting figures and tables...")
        figures = []
        tables = []
        for block in blocks:
            if block.metadata.type == "Figure":
                figure = Entity(boxes=block.boxes)
                figures.append(figure)
            elif block.metadata.type == "Table":
                table = Entity(boxes=block.boxes)
                tables.append(table)
        doc.annotate_layer(name=FiguresFieldName, entities=figures)
        doc.annotate_layer(name=TablesFieldName, entities=tables)


        # self.logger.info("Predicting vila...")
        vila_entities = self.ivila_predictor.predict(doc=doc)
        doc.annotate_layer(name="vila_entities", entities=vila_entities)


        for entity in vila_entities:
            entity.boxes = [
                Box.create_enclosing_box(
                    [b for t in doc.intersect_by_span(entity, name=TokensFieldName) for b in t.boxes]
                )
            ]
            # entity.text = make_text(entity=entity, document=doc)
        preds = group_by(entities=vila_entities, metadata_field="label", metadata_values_map=VILA_LABELS_MAP)
        doc.annotate(*preds)
        return doc

The general process is shown in Figure 7:

Figure 7: General workflow of PaperMage; unlabeled arrows represent the merge operation, also known as the annotation feature in PaperMage. Image by the author.

Figure 7: General workflow of PaperMage; unlabeled arrows represent the merge operation, also known as the annotation feature in PaperMage. Image by the author.

In Figure 7 we can see that the PaperMage processing process also follows a pipeline approach.

Initially, the layout is analyzed using PDFPlumber libraries. Then professional algorithms and models are connected to analyze other objects on the page, based on the results of the layout analysis. This includes sentences, pictures, tables, headings, and so on.

Next, we will focus our attention on three important processes:

  • Sentence Breakdown

  • Analysis of the layout structure

  • Analysis of logical structure.

Sentence Breakdown

To separate sentences, use PySBD — a Python package for detecting sentence boundaries based on a system of rules.

The input is a sequence of lexemes. The output is the span of each sentence.

[
Unannotated Entity: {'spans': [[0, 212]]}, 
Unannotated Entity: {'spans': [[212, 367]]},  
…
]

Analysis of the layout structure

The model is used to analyze the structure of the page layout. LPEffDetPubLayNetBlockPredictor. It is a powerful deep learning based object detection model provided by LayoutParserIts main task is to segment the document into areas of visual blocks.

The input is an image of a page, denoted as doc.images. The output is an object box and the corresponding type for each block. The box includes the X coordinate of the top left vertex, the Y coordinate of the top left vertex, the width, the height, and the page number.

[
Unannotated Entity: {'boxes': [[0.5179840190298606, 0.752760137345049, 0.3682081491355128, 0.15176369855069774, 0]], 'metadata': {'type': 'Text'}}, 
Unannotated Entity: {'boxes': [[0.5145780320135539, 0.5080924136055337, 0.3675624668198144, 0.23725746136663078, 0]], 'metadata': {'type': 'Text'}}, 
…
]

Logical structure analysis

To analyze the logical structure of a document, a model is used IVILATokenClassificationPredictor. It divides the document into such organizational units as title, abstract, main body, footnotes, signatures, etc.

The source data used is page-level data, passed as a dictionary.

{
        'words': ['word1', 'word2', ...],
        'bbox': [[x1, y1, x2, y2], [x1, y1, x2, y2], ...],
        'block_ids': [0, 0, 0, 1 ...],
        'line_ids': [0, 1, 1, 2 ...],
        'labels': [0, 0, 0, 1 ...], # could be empty
    }

The output is the span of each entity.

[
Unannotated Entity: {'spans': [[0, 80]], 'metadata': {'label': 'Title'}}, 
Unannotated Entity: {'spans': [[81, 157]], 'metadata': {'label': 'Author'}}, 
Unannotated Entity: {'spans': [[158, 215]], 'metadata': {'label': 'Paragraph'}}, 
...
]

Thoughts and Conclusions on PaperMage

PDF Parsing Abstraction

The abstraction proposed by PaperMage for the PDF parsing task is quite effective. It involves dividing the entire PDF into types such as doc, layer, and entities, which makes it easier to classify and manage elements.

Scalability

PaperMage has developed a framework that is easily extensible, making future development easier.

For example, to add a custom predictor, we just need to inherit from the BasePredictor base class and override the function _predict().

from .base_predictor import BasePredictor


class YOUR_NEW_Predictor(BasePredictor):
    ...
    ...
    def _predict(self, doc: Document) -> List[YOUR_RET_TYPE]:
    ...
    ...

Parallelism

Figure 7 shows that PaperMage has potential to improve through parallelization, which is a viable direction for optimization.

Although Current version of PaperMage does not contain any parallelism related code, adding parallel processing logic can significantly improve the efficiency of PDF parsing.

Unstructured

Unstructured — is an open source tool for pre-processing unstructured data. In previous article We have already described in general terms the process of its operation.

Now we'll talk about the conclusions we made when considering the unstructured framework, in particular how it can help us in developing our own PDF parsing tool.

About layout analysis

The layout analysis in unstructured is done very meticulously.

If we ask strategy='hi_res'then the following models will be used to analyze the layout: YOLOX or detectron2. To improve detection, they are combined with the tool PDFMinerThe results of both methods are combined to obtain the final layout as shown in Figure 8.

Figure 8: PDF parsing process with strategy='hi_res' in unstructured. Image by the author.

Figure 8: PDF parsing process with strategy='hi_res' in unstructured. Image by the author.

Figures 9 and 10 show visualizations of the results of the layout analysis of page 16. BERT document; the frames in the figure represent the boundaries of each region. The results of the object detection model, shown in Figure 9, are more accurate. The tables and images are better integrated into the document structure in this case. The PDFMiner detection results, shown in Figure 10, on the contrary, separate the contents of the tables and images.

Figure 9: Visualization of the object detection model results (inferred_layout) for page 16 of the BERT paper, the boxes represent the boundaries of each region. Screenshot by the author.

Figure 9: Visualization of the object detection model results (inferred_layout) for page 16 of the BERT paper, the boxes represent the boundaries of each region. Screenshot by the author.

Figure 10: Visualization of PDFMiner detection results (extracted_layout) for page 16 of the BERT document, the boxes represent the boundaries of each region. Screenshot by the author.

Figure 10: Visualization of PDFMiner detection results (extracted_layout) for page 16 of the BERT document, the boxes represent the boundaries of each region. Screenshot by the author.

Code responsible for merging layoutslooks like this: it contains a double loop that evaluates the relationship between each region detected by PDFMiner (extracted_layout) and the result obtained from the object detection model (inferred_layout), and then decides whether to merge them.

def merge_inferred_layout_with_extracted_layout(
    inferred_layout: Collection[LayoutElement],
    extracted_layout: Collection[TextRegion],
    page_image_size: tuple,
    same_region_threshold: float = inference_config.LAYOUT_SAME_REGION_THRESHOLD,
    subregion_threshold: float = inference_config.LAYOUT_SUBREGION_THRESHOLD,
) -> List[LayoutElement]:
    """Merge two layouts to produce a single layout."""
    extracted_elements_to_add: List[TextRegion] = []
    inferred_regions_to_remove = []
    w, h = page_image_size
    full_page_region = Rectangle(0, 0, w, h)
    for extracted_region in extracted_layout:
        extracted_is_image = isinstance(extracted_region, ImageTextRegion)
        if extracted_is_image:
            # Для наших целей мы пропустим извлеченные изображения, у нас нет текста на них, и с ними
            # обычно трудно получить хорошие ограничительные рамки для текста.


            is_full_page_image = region_bounding_boxes_are_almost_the_same(
                extracted_region.bbox,
                full_page_region,
                FULL_PAGE_REGION_THRESHOLD,
            )


            if is_full_page_image:
                continue
        region_matched = False
        for inferred_region in inferred_layout:
            if inferred_region.source in CHIPPER_VERSIONS:
                continue
            ...
            ...

About customization

Unstructured provides a variety of intermediate results that can be easily customized.

IN previous article We considered three issues related to the data obtained from unstructed:

  • Table parsing

  • Rearrange detected blocks, especially in two-column PDFs

  • Extracting multi-level headings

The last two problems can be solved by changing the intermediate structure. As an example, Figure 11 shows the final layout of the second page. BERT document.

Figure 11: Visualization of the final layout of the second page of the BERT document. Screenshot by the author.

Figure 11: Visualization of the final layout of the second page BERT document. Screenshot by the author.

At the same time, we can easily get the available layout analysis results:

[


LayoutElement(bbox=Rectangle(x1=851.1539916992188, y1=181.15073777777613, x2=1467.844970703125, y2=587.8204599999975), text="These approaches have been generalized to coarser granularities, such as sentence embed- dings (Kiros et al., 2015; Logeswaran and Lee, 2018) or paragraph embeddings (Le and Mikolov, 2014). To train sentence representations, prior work has used objectives to rank candidate next sentences (Jernite et al., 2017; Logeswaran and Lee, 2018), left-to-right generation of next sen- tence words given a representation of the previous sentence (Kiros et al., 2015), or denoising auto- encoder derived objectives (Hill et al., 2016). ", source=<Source.YOLOX: 'yolox'>, type="Text", prob=0.9519357085227966, image_path=None, parent=None), 


LayoutElement(bbox=Rectangle(x1=196.5296173095703, y1=181.1507377777777, x2=815.468994140625, y2=512.548237777777), text="word based only on its context. Unlike left-to- right language model pre-training, the MLM ob- jective enables the representation to fuse the left and the right context, which allows us to pre- In addi- train a deep bidirectional Transformer. tion to the masked language model, we also use a “next sentence prediction” task that jointly pre- trains text-pair representations. The contributions of our paper are as follows: ", source=<Source.YOLOX: 'yolox'>, type="Text", prob=0.9517233967781067, image_path=None, parent=None), 


LayoutElement(bbox=Rectangle(x1=200.22352600097656, y1=539.1451822222216, x2=825.0242919921875, y2=870.542682222221), text="• We demonstrate the importance of bidirectional pre-training for language representations. Un- like Radford et al. (2018), which uses unidirec- tional language models for pre-training, BERT uses masked language models to enable pre- trained deep bidirectional representations. This is also in contrast to Peters et al. (2018a), which uses a shallow concatenation of independently trained left-to-right and right-to-left LMs. ", source=<Source.YOLOX: 'yolox'>, type="List-item", prob=0.9414362907409668, image_path=None, parent=None), 


LayoutElement(bbox=Rectangle(x1=851.8727416992188, y1=599.8257377777753, x2=1468.0499267578125, y2=1420.4982377777742), text="ELMo and its predecessor (Peters et al., 2017, 2018a) generalize traditional word embedding re- search along a different dimension. They extract context-sensitive features from a left-to-right and a right-to-left language model. The contextual rep- resentation of each token is the concatenation of the left-to-right and right-to-left representations. When integrating contextual word embeddings with existing task-specific architectures, ELMo advances the state of the art for several major NLP benchmarks (Peters et al., 2018a) including ques- tion answering (Rajpurkar et al., 2016), sentiment analysis (Socher et al., 2013), and named entity recognition (Tjong Kim Sang and De Meulder, 2003). Melamud et al. (2016) proposed learning contextual representations through a task to pre- dict a single word from both left and right context using LSTMs. Similar to ELMo, their model is feature-based and not deeply bidirectional. Fedus et al. (2018) shows that the cloze task can be used to improve the robustness of text generation mod- els. ", source=<Source.YOLOX: 'yolox'>, type="Text", prob=0.938507616519928, image_path=None, parent=None), 




LayoutElement(bbox=Rectangle(x1=199.3734130859375, y1=900.5257377777765, x2=824.69873046875, y2=1156.648237777776), text="• We show that pre-trained representations reduce the need for many heavily-engineered task- specific architectures. BERT is the first fine- tuning based representation model that achieves state-of-the-art performance on a large suite of sentence-level and token-level tasks, outper- forming many task-specific architectures. ", source=<Source.YOLOX: 'yolox'>, type="List-item", prob=0.9461237788200378, image_path=None, parent=None), 


LayoutElement(bbox=Rectangle(x1=195.5695343017578, y1=1185.526123046875, x2=815.9393920898438, y2=1330.3272705078125), text="• BERT advances the state of the art for eleven NLP tasks. The code and pre-trained mod- els are available at https://github.com/ google-research/bert. ", source=<Source.YOLOX: 'yolox'>, type="List-item", prob=0.9213815927505493, image_path=None, parent=None), 


LayoutElement(bbox=Rectangle(x1=195.33956909179688, y1=1360.7886962890625, x2=447.47264000000007, y2=1397.038330078125), text="2 Related Work ", source=<Source.YOLOX: 'yolox'>, type="Section-header", prob=0.8663332462310791, image_path=None, parent=None), 


LayoutElement(bbox=Rectangle(x1=197.7477264404297, y1=1419.3353271484375, x2=817.3308715820312, y2=1527.54443359375), text="There is a long history of pre-training general lan- guage representations, and we briefly review the most widely-used approaches in this section. ", source=<Source.YOLOX: 'yolox'>, type="Text", prob=0.928022563457489, image_path=None, parent=None), 


LayoutElement(bbox=Rectangle(x1=851.0028686523438, y1=1468.341394166663, x2=1420.4693603515625, y2=1498.6444497222187), text="2.2 Unsupervised Fine-tuning Approaches ", source=<Source.YOLOX: 'yolox'>, type="Section-header", prob=0.8346447348594666, image_path=None, parent=None), 


LayoutElement(bbox=Rectangle(x1=853.5444444444446, y1=1526.3701822222185, x2=1470.989990234375, y2=1669.5843488888852), text="As with the feature-based approaches, the first works in this direction only pre-trained word em- (Col- bedding parameters from unlabeled text lobert and Weston, 2008). ", source=<Source.YOLOX: 'yolox'>, type="Text", prob=0.9344717860221863, image_path=None, parent=None), 


LayoutElement(bbox=Rectangle(x1=200.00000000000009, y1=1556.2037353515625, x2=799.1743774414062, y2=1588.031982421875), text="2.1 Unsupervised Feature-based Approaches ", source=<Source.YOLOX: 'yolox'>, type="Section-header", prob=0.8317819237709045, image_path=None, parent=None), 


LayoutElement(bbox=Rectangle(x1=198.64227294921875, y1=1606.3146266666645, x2=815.2886352539062, y2=2125.895459999998), text="Learning widely applicable representations of words has been an active area of research for decades, including non-neural (Brown et al., 1992; Ando and Zhang, 2005; Blitzer et al., 2006) and neural (Mikolov et al., 2013; Pennington et al., 2014) methods. Pre-trained word embeddings are an integral part of modern NLP systems, of- fering significant improvements over embeddings learned from scratch (Turian et al., 2010). To pre- train word embedding vectors, left-to-right lan- guage modeling objectives have been used (Mnih and Hinton, 2009), as well as objectives to dis- criminate correct from incorrect words in left and right context (Mikolov et al., 2013). ", source=<Source.YOLOX: 'yolox'>, type="Text", prob=0.9450697302818298, image_path=None, parent=None), 


LayoutElement(bbox=Rectangle(x1=853.4905395507812, y1=1681.5868488888855, x2=1467.8729248046875, y2=2125.8954599999965), text="More recently, sentence or document encoders which produce contextual token representations have been pre-trained from unlabeled text and fine-tuned for a supervised downstream task (Dai and Le, 2015; Howard and Ruder, 2018; Radford et al., 2018). The advantage of these approaches is that few parameters need to be learned from scratch. At least partly due to this advantage, OpenAI GPT (Radford et al., 2018) achieved pre- viously state-of-the-art results on many sentence- level tasks from the GLUE benchmark (Wang language model- Left-to-right et al., 2018a). ", source=<Source.YOLOX: 'yolox'>, type="Text", prob=0.9476840496063232, image_path=None, parent=None)


]

Using the above information, we can easily perform tasks such as sorting and extracting multi-level headings.

Therefore, when developing our own PDF parsing tools, we should strive to preserve as much useful intermediate information as possible and metadata.

About detection and recognition of tables

To detect and recognize tables in the unstructured framework, Table Transformer.

The Table Transformer model was proposed in the article PubTables-1M: Towards comprehensive table extraction from unstructured documentsThis paper presents a new dataset, PubTables-1M, designed to extract tables from unstructured documents and perform structure recognition and functional analysis of tables, as shown in Figure 12.

Figure 12: Illustration of the three table extraction subtasks considered in the PubTables-1M dataset. Source: PubTables-1M: Towards comprehensive table extraction from unstructured document.

Table Transformer is trained on the PubTables-1M dataset based on the model DETRto solve problems such as table detection and structure recognition.

You will find more information about table processing in my previous article.

On the detection and recognition of formulas

The unstructured framework lacks a special module for detecting and recognizing formulas, which is noticeable in principle from the mediocre results shown in Figure 13.

Figure 13: On the left is the result of parsing a paragraph on page 6 of the BERT article, including the formula highlighted in red. On the right is the original article. Screenshot by the author.

Figure 13: The left shows the result of parsing the paragraph on page 6. BERT articlesincluding the formula highlighted in red. The original article is shown on the right. Screenshot by the author.

Conclusion

This article provides an overview of the pipeline approach to parsing PDF files. It examines this approach using three frameworks that use this method, presents it in detail, and outlines the conclusions drawn from it.

As a result,

  • Although Marker has a few drawbacks, it is a lightweight and fast tool.

  • Although PaperMage is primarily designed for working with scientific documents, its exceptional scalability serves as a good starting point for further development.

  • Unstructured is a comprehensive PDF parsing pipeline framework. Its strengths include detailed layout analysis and extensive customization capabilities.

Overall, the pipeline approach to PDF parsing is easily interpretable and customizable, making it a widely used method. However, its effectiveness largely depends on the performance of each model or algorithm used in the process. Therefore, the training data and the structure of each model must be carefully considered.


The material was prepared within the framework of practical online course “MLOps”.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *