Create PDF document in Python using pText

One of the most flexible and familiar ways to generate pdfs is to write LaTeX code and use the appropriate program. But there are other ways that may be simpler and clearer than LaTeX. Especially for the start of the course Fullstack Python developer we present a translation of an article on how to use the pText library to generate PDF; this article was written by Joris Schellekens, the developer of pText.


In this tutorial, we will use pText Is a Python library for reading, processing and creating PDF documents. It offers both a low-level model (allowing you to access the exact coordinates and layout if you choose to use them) and a high-level model (where you can delegate precise calculations of margins, positions, etc.). We will look at how to create and validate a PDF document in Python using pText as well as how to use some LayoutElement [элементы макета] to add barcodes and tables.

Portable Document Format (PDF) is not a WYSIWYG (What You See Is What You Get) format. It was designed to be platform independent, independent of the underlying operating system and rendering engines.

To achieve this, PDF was designed to interact with something more like a programming language and relies on a series of instructions and operations to achieve a result. Actually PDF is based on scripting language – PostScriptwhich was the first device independent page description language. It has operators that change graphical states, at a high level they look something like this:

  • Install the Helvetica font.

  • Set the stroke color to black.

  • Go to (60,700).

  • Draw glyph “H”.

This explains several things:

  • Why is it so difficult to accurately extract text from PDF.

  • Why is it difficult to edit a PDF document.

  • Why most PDF libraries take a very low-level approach to content creation (you have to specify the coordinates at which to display text, fields, etc.).

Installing pText

pText can be downloaded from Github or install via pip:

$ pip install ptext-joris-schellekens

Note… At the time of this writing, version 1.8.6 does not install external dependencies like python-barcode and qrcode by default. If an error message appears, install them manually:

$ pip install qrcode python-barcode requests

Create PDF document in Python using pText

pText has two intuitive key classes, Document and Page, which represent the document and the pages within it. This is the basic structure for creating PDF documents. In addition, the PDF class is an API for loading and saving the documents we create. With this in mind, let’s create an empty PDF file:

from ptext.pdf.document import Document
from ptext.pdf.page.page import Page
from ptext.pdf.pdf import PDF

# Create an empty Document
document = Document()

# Create an empty page
page = Page()

# Add the Page to the Document
document.append_page(page)

# Write the Document to a file
with open("output.pdf", "wb") as pdf_file_handle:
    PDF.dumps(pdf_file_handle, document)

Most of the code here is self-explanatory. We start by creating a blank document, then add a blank page to the document with the append () function, and finally save the file with PDF.dumps ().

It’s worth noting that we used the “wb” flag for writing in binary mode, since we don’t want Python to encode this text. This gives us an empty PDF called output.pdf on your filesystem:

Creating a Hello World Document Using pText

Of course, blank PDFs don’t contain much information. Let’s add content to the page before adding it to the document instance.

Similar to the two classes described earlier, to add content to the page, we will add a PageLayout indicating the type of layout we would like to see and add one or more paragraphs to that layout.

To this end, Document is the lowest-level instance in the object hierarchy, and Paragraph is the highest-level instance, sited on top of the PageLayout and therefore the page. Let’s add a paragraph to our page:

from ptext.pdf.document import Document
from ptext.pdf.page.page import Page
from ptext.pdf.pdf import PDF
from ptext.pdf.canvas.layout.paragraph import Paragraph
from ptext.pdf.canvas.layout.page_layout import SingleColumnLayout
from ptext.io.read.types import Decimal

document = Document()
page = Page()

# Setting a layout manager on the Page
layout = SingleColumnLayout(page)

# Adding a Paragraph to the Page
layout.add(Paragraph("Hello World", font_size=Decimal(20), font="Helvetica"))

document.append_page(page)

with open("output.pdf", "wb") as pdf_file_handle:
    PDF.dumps(pdf_file_handle, document)

You will notice that we have added 2 additional objects:

  • A PageLayout instance, more specific through its SingleColumnLayout subclass: This class keeps track of where content is added to the page, what areas are available for future content, what the page margins are, and what leading (space between Paragraph objects) should be.

Since we are only working with one column here, we are using SingleColumnLayout. Alternatively, we can use MultiColumnLayout.

  • Paragraph Instance: This class represents a block of text. You can set properties such as font, font_size, font_color, and many more. You can find more examples in the documentation.

The code generates the file output.pdf containing our paragraph:

Checking the generated PDF with pText.

Note: This section is optional if you are not interested in the inner workings of a PDF document.

But it can be very helpful to know a little about the format (for example, when debugging the classic “why is my content now showing on this page” problem). Typically a PDF reader reads the document starting from the last bytes:

xref
0 11
0000000000 00000 f
0000000015 00000 n
0000002169 00000 n
0000000048 00000 n
0000000105 00000 n
0000000258 00000 n
0000000413 00000 n
0000000445 00000 n
0000000475 00000 n
0000000653 00000 n
0000001938 00000 n
trailer
<</Root 1 0 R /Info 2 0 R /Size 11 /ID [<61e6d144af4b84e0e0aa52deab87cfe9><61e6d144af4b84e0e0aa52deab87cfe9>]>>
startxref
2274
%%EOF

Here we see an end-of-file marker (%% EOF) and a cross-reference table (usually abbreviated as xref).

External link is limited to “startxref” and “xref” tokens.

The xref (there may be several in the document) acts as a reference table for the PDF reader.

It contains the byte offset (starting from the top of the file) of each object in the PDF. The first line of the xref (0 11) says that there are 11 objects in this xref and that the first object starts at 0.

Each subsequent line consists of a byte offset, followed by the so-called generation number and the letter f or n:

  • Objects marked with f are free and not expected to be rendered.

  • Objects marked with the letter n are “in use”.

At the bottom of the xref we find the trailer dictionary. Dictionaries in PDF syntax are separated by << and >>. This dictionary contains the following pairs:

  • / Root 1 0 R

  • / Info 2 0 R

  • / Size 11

  • / ID [<61e6d144af4b84e0e0aa52deab87cfe9> <61e6d144af4b84e0e0aa52deab87cfe9>]

The Trailer Dictionary is the starting point for a PDF reader and links to all other data. In this case:

  • / Root: This is another dictionary that refers to the actual content of the document.

  • / Info: This is a dictionary containing the document’s meta information (author, title, and so on).

Strings of type 1 0 R are called “links” in PDF syntax. This is where the xref table comes in handy. To find the object associated with 1 0 R, we look at object 1 (generation number 0). The xref lookup table tells us that we can expect to find this object in byte 15 of the document. If we check this, we will find:

1 0 obj
<</Pages 3 0 R>>
endobj

Note that that object starts at 1 0 obj and ends at endobj. This is another confirmation that we are actually dealing with object 1. This dictionary tells us that we can find the pages of the document in object 3:

3 0 obj
<</Count 1 /Kids [4 0 R]
 /Type /Pages>>
endobj

It is a dictionary / Pages and it tells us that there is one page in this document (entry / Count). The entry for / Kids is usually an array with one object reference per page. We can expect to find the first page in object 4:

4 0 obj
<</Type /Page /MediaBox [0 0 595 842]
 /Contents 5 0 R /Resources 6 0 R /Parent 3 0 R>>
endobj

This dictionary contains several interesting entries:

  • / MediaBox: The physical dimensions of the page (in this case, an A4 page).

  • / Contents: Link to a (usually compressed) stream of PDF content statements.

  • / Resources: a link to a dictionary containing all the resources (fonts, images, and so on) used to render this page.

Let’s check object 5 to see what is actually displayed on this page:

5 0 obj
<</Filter /FlateDecode /Length 85>>
stream
xÚãR@
È<§ž`a¥£šÔw3T0„É
€!K¡š3B˜„žœenl7'§999ù
åùE9)š
!Y(’!8õÂyšT*î
endstream
endobj

As mentioned earlier, this stream of content is compressed. You can determine which compression method was used using the / Filter entry. If we unzip object 5, then we should get the actual content statements:

5 0 obj
<</Filter /FlateDecode /Length 85>>
stream
            q
            BT
            0.000000 0.000000 0.000000 rg
            /F1 1.000000 Tf            
            20.000000 0 0 20.000000 60.000000 738.000000 Tm            
            (Hello world) Tj
            ET            
            Q
endstream
endobj

Finally, we are at a level where we can decode content. Each line consists of arguments followed by their operator. Let’s quickly go through the operators:

  • q: save the current graphical state (by pushing it onto the stack);

  • BT: start text;

  • 0 0 0 rg: Set the current stroke color to (0,0,0) rgb. It’s black;

  • / F1 1 Tf: set the current font to / F1 (this is the resource dictionary entry mentioned earlier) and the font size to 1.

  • 20.000000 0 0 20.000000 60.000000 738.000000 Tm: set the text matrix, which requires a separate manual. Suffice it to say that this matrix controls the font size and text position. Here we scale the font to size 20 and set the cursor to draw the text at 60,738. The PDF coordinate system starts at the bottom left corner of the page. So 60,738 is somewhere near the top left of the page (assuming the page height is 842 units).

  • (Hello world) Tj: strings in PDF syntax are separated by (and). This command tells the PDF reader to display the string “Hello world” at the position we specified earlier using the text matrix, in the font, size, and color that we specified in the commands before.

  • ET: end of text.

  • Q: pop the graphics state from the stack (thus restoring the graphics state).

Adding Other pText Layout Elements to Pages

pText comes with a wide variety of LayoutElement objects. In the previous example, we briefly explored Paragraph. But there are other elements as well, such as UnorderedList, OrderedList, Image, Shape, Barcode and Table. Let’s create a slightly more complex table and barcode example. Tables are made up of TableCells, which we add to the Table instance. A barcode can be one of many types of barcodes – we’ll be using a QR code:

from ptext.pdf.document import Document
from ptext.pdf.page.page import Page
from ptext.pdf.pdf import PDF
from ptext.pdf.canvas.layout.paragraph import Paragraph
from ptext.pdf.canvas.layout.page_layout import SingleColumnLayout
from ptext.io.read.types import Decimal
from ptext.pdf.canvas.layout.table import Table, TableCell
from ptext.pdf.canvas.layout.barcode import Barcode, BarcodeType
from ptext.pdf.canvas.color.color import X11Color

document = Document()
page = Page()

# Layout
layout = SingleColumnLayout(page)

# Create and add heading
layout.add(Paragraph("DefaultCorp Invoice", font="Helvetica", font_size=Decimal(20)))

# Create and add barcode
layout.add(Barcode(data="0123456789", type=BarcodeType.QR, width=Decimal(64), height=Decimal(64)))

# Create and add table
table = Table(number_of_rows=5, number_of_columns=4)

# Header row
table.add(TableCell(Paragraph("Item", font_color=X11Color("White")), background_color=X11Color("SlateGray")))
table.add(TableCell(Paragraph("Unit Price", font_color=X11Color("White")), background_color=X11Color("SlateGray")))
table.add(TableCell(Paragraph("Amount", font_color=X11Color("White")), background_color=X11Color("SlateGray")))
table.add(TableCell(Paragraph("Price", font_color=X11Color("White")), background_color=X11Color("SlateGray")))

	# Data rows
for n in [("Lorem", 4.99, 1), ("Ipsum", 9.99, 2), ("Dolor", 1.99, 3), ("Sit", 1.99, 1)]:
    table.add(Paragraph(n[0]))
    table.add(Paragraph(str(n[1])))
    table.add(Paragraph(str(n[2])))
    table.add(Paragraph(str(n[1] * n[2])))

# Set padding
table.set_padding_on_all_cells(Decimal(5), Decimal(5), Decimal(5), Decimal(5))
layout.add(table)

# Append page
document.append_page(page)

# Persist PDF to file
with open("output4.pdf", "wb") as pdf_file_handle:
    PDF.dumps(pdf_file_handle, document)

Some implementation details:

  • pText supports a variety of color models, including RGBColor, HexColor, X11Color, and HSVColor.

  • You can add LayoutElement objects directly to a Table object, but you can also wrap them with a TableCell object, this gives you some additional options like col_span and row_span, or in this case background_color.

  • If font, font_size, or font_color are not specified, Paragraph will default to Helvetica, size 12, black.

The code will generate a document like this:

Conclusion

In this tutorial, we covered pText, a library for reading, writing and manipulating PDF files. We’ve covered key classes like Document and Page, as well as some elements like Paragraph, Barcode, and PageLayout. Finally, we created several PDFs with different content and also tested how PDFs store data under the hood.

The PDF document is pleasing to the eye and convenient enough to use in electronic document flow, to generate a variety of invoices and reports, especially relevant in large organizations and in their internal network, therefore, the development of large and complex web projects is often associated with PDF generation. If you are interested in working with complex web projects and do not want to choose between backing or front, then you can take a closer look at the course Fullstack Python developer

find outhow to level up in other specialties or master them from scratch:

Other professions and courses

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *