5 Ways to Convert PDF to Word in Python: A Comparative Guide

Converting PDF documents into editable Microsoft Word files allows you to make changes, add annotations, and work with PDF content more efficiently.

In this blog, I have collected 5 solutions to convert PDF to Word in Python using free open source or commercial libraries and highlighted the pros and cons of each solution.

  • Convert PDF to Word with PyPDF2 and python-docx

  • Convert PDF to Word with pdfplumber and python-docx

  • Convert PDF to Word with pdf2docx

  • Convert PDF to Word with Spire.PDF for Python

  • Convert PDF to Word with Aspose.Words for Python via .NET

Convert PDF to Word with PyPDF2 and python-docx

PyPDF2 is a free and open-source PDF library for Python that provides a wide range of functions for reading, manipulating, and processing PDF documents.

python-docx – a free and open source library for creating and updating Microsoft Word (.docx) files.

To install them via PyPI, use the following pip commands.

pip install PyPDF2
pip install python-docx

Code example:

from PyPDF2 import PdfReader
from PyPDF2 import PdfWriter
from docx import Document
from docx.shared import Inches

# Создание нового документа Word
document = Document()

# Открытие PDF-файла
with open("C:\\Users\\Administrator\\Desktop\\Input.pdf", "rb") as file:
    
    # Создание объекта PdfReader
    pdf_reader = PdfReader(file)

    # Открытие документа Word для записи
    with open("output.docx", "wb") as output_file:

        # Перебор каждой страницы PDF-файла
        for page_num in range(len(pdf_reader.pages)):
            
            # Получение текущей страницы
            page = pdf_reader.pages[page_num]

            # Извлечение текста со страницы
            text = page.extract_text()
            
            # Добавление абзаца в Word, содержащего текст
            document.add_paragraph(text)

# Сохранение документа Word
document.save("output.docx")

Pros:

Minuses:

  • Only the text will be extracted and placed into the Word document.

  • All formatting and layout of the original PDF file will be lost.

Convert PDF to Word with pdfplumber and python-docx

python-docx – is a free and open source library for creating and updating Microsoft Word (.docx) files.

pdfplumber – a free and open-source Python library for extracting text and tables from PDF files.

You can install them via PyPI using the following commands.

pip install pdfplumber
pip install python-docx

Code example:

import pdfplumber
from docx import Document
from docx.shared import Inches

# Открытие PDF-файла
with pdfplumber.open("C:\\Users\\Administrator\\Desktop\\Input.pdf") as pdf:

    # Извлечение текста из PDF
    text = ""
    for page in pdf.pages:
        text += page.extract_text()
 
# Создание нового документа Word
document = Document()

# Добавление абзаца в Word, содержащего текст
document.add_paragraph(text)

# Сохранение документа Word
document.save("output.docx")

Pros:

Minuses:

  • Only the text will be extracted and placed into the Word document.

  • All formatting and layout of the original PDF file will be lost.

Convert PDF to Word with pdf2docx

pdf2docx is a Python library that provides a simple and effective way to convert PDF files into Microsoft Word documents (.docx). It is a free and open-source library that can be used for various purposes such as document conversion, data extraction, and text processing.

It can be installed from PyPI using the following pip command.

pip install pdf2docx

Code example:

from pdf2docx import Converter

def convert_pdf_to_docx(pdf_file, docx_file):

    # Создание объекта Converter
    cv = Converter(pdf_file)

    # Конвертация указанной страницы PDF в docx 
    cv.convert(docx_file, start=0, end=None)
    cv.close()

# Конвертация PDF в файл Docx
convert_pdf_to_docx("C:\\Users\\Administrator\\Desktop\\Input.pdf", "Output.docx")

Pros:

  • For free.

  • Both text and graphic elements are converted.

  • Formatting and layout are preserved.

Minuses:

Convert PDF to Word with Spire.PDF for Python

Spire.PDF for Python is a multi-functional library for working with PDF documents in Python. It provides a wide range of tools for creating, modifying, and programmatically manipulating PDF files.

To install it from PyPI, use the following pip command.

pip install Spire.PDF

Code example:

from spire.pdf.common import *
from spire.pdf import *

# Создание объекта PdfDocument
doc = PdfDocument()

# Загрузка PDF-документа
doc.LoadFromFile("C:\\Users\\Administrator\\Desktop\\Input.pdf")

# Конвертация PDF в Word с потоковой разметкой
# doc.ConvertOptions.SetPdfToDocOptions(True, True)

# Сохранение в файл docx
doc.SaveToFile("Output.docx", FileFormat.DOCX)

# Освобождение ресурсов
doc.Close()

Pros:

  • Both text and graphic elements are converted.

  • Formatting and layout are preserved (in Fixed Page Layout mode).

  • The conversion speed is high.

Minuses:

Convert PDF to Word with Aspose.Words for Python via .NET

Aspose.Words for Python via .NET is a commercial library for manipulating and converting Microsoft Word documents (.docx, .doc) using Python. It also supports converting other formats such as PDF and HTML to Word format.

Alternatively, it can be installed directly via PyPI.

pip install aspose-words

Code example:

import aspose.words as aw

# Загрузка PDF-документа
doc = aw.Document("C:\\Users\\Administrator\\Desktop\\Input.pdf")

# Конвертация в файл Docx
doc.save("Output.docx")

Pros:

  • Both text and graphic elements are converted.

  • Formatting and layout are preserved.

  • The conversion speed is high.

Minuses:

Conclusion

Free and open-source libraries provide a convenient way to work with PDF and Word documents using Python without any licensing or cost issues. Commercial solutions typically offer more advanced features and better performance than free and open-source libraries. Choosing between these options depends on your specific requirements, budget, and the level of features you need.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *