How to remove blank back pages from PDF after double-sided scanning

About two months ago, I wrote an article on how to scan multi-page duplex documents when only a conventional auto-feed scanner is at hand, in which I touched on the problem that MFPs often have duplex duplex printing, but a single-sided scanner.

However, after solving the problem of quickly scanning large double-sided documents, another problem was discovered – a certain number of pages may turn out to be single-sided. And this means that the PDF will have white pages, for example, with scans of perforations or ring holes.

Of course, you can delete a few pages from a PDF manually, but what if there are hundreds of such files, and the documents themselves have several tens or even hundreds of pages like in the photo?

TL;DR

Solution: use bash script and PDFtk console program:

#!/bin/bash
# Убирает пустые страницы из PDF после двухстороннего сканирования
# Описание в статье https://habr.com/ru/articles/733754/
datetime=$(date +"%Y-%m-%d_%H-%M-%S")
# Создаём единый лог файл для всех действий
log_file="blank_page_remover_$datetime.log"
touch $log_file
# Перебираем все PDF файлы в текущем каталоге
for file in *.pdf; do
  echo "Работаем с $file..." >> "$log_file"
  # Разделяем PDF файл на отдельные страницы
  echo "Разделяем $file на отдельные страницы..." >> "$log_file"
  pdftk "$file" burst output "${file%.*}_pg_%04d.pdf" >> "$log_file" 2>&1
  # Удаляем файлы страниц, размер которых меньше чем XX килобайт
  echo "Удаляем файлы страниц, размер которых меньше чем 70 килобайт..." >> "$log_file"
  for page in "${file%.*}"_pg_*.pdf; do
    size=$(wc -c < "$page")
    if [[ $size -lt 70000 ]]; then
      echo "Удаляем $page (размер: $size байт)..." >> "$log_file"
      rm "$page"
    fi
  done
  # Склеиваем оставшиеся страницы в новый файл
  echo "Склеиваем оставшиеся страницы в новый файл..." >> "$log_file"
  pdftk "${file%.*}"_pg_*.pdf cat output "${file%.*}_без пустых.pdf" compress >> "$log_file" 2>&1
  # Удаляем временные файлы
  echo -e "Удаляем временные файлы...\n" >> "$log_file"
  rm "${file%.*}"_pg_*.pdf
done

Option to remove blank pages from pdf using a local program

Before starting to write my script, I honestly tried to figure out how to remove blank pages from PDF using the standard tools of some program:

Tried to do it with a free open source PDFsam Basicwhich is available under Linux and Windows, and MacOS, because I found instructions on the Internet, but they turned out to be outdated.
I tried to do it with Adobe Acrobat Pro, but it didn’t work for me. Did according to the instructions:
1. Open the PDF file in Adobe Acrobat.
2. Click on the “Tools” tab on the top menu bar.
3. Select “Pages” from the list of tools on the right.
4. Click “Crop” from the “Pages” tool menu.
5. In the Page Crop dialog box, select the Remove White Margins and Remove White Margins for All Pages options.
6. Click “OK” to apply the changes.
  These steps should have automatically removed all blank pages from the PDF file, but this didn’t happen for me.

*Adobe Acrobat Pro and removing blank pages*

Tried to do it with PDF-XChange Editor, but it didn’t work for me either. I had instructions:
1. Download the PDF file: Choose File > Open or press Ctrl + O on your keyboard, then browse to and select the PDF file you want to remove blank pages from.
2. After downloading the PDF file, click the Organize tab on the top toolbar.
3. With all pages selected, click the Delete Blank Pages button.
  Progress ran through, but blank pages remained in place for any of the three options.

Using a local program would of course be the best option because it ensured that the PDFs would remain on the computer, providing privacy and security compared to using online tools.

Option to remove blank pages from pdf using online tools

But since it didn’t work for me with local tools, I decided to try online services.

I was able to find a few tools available online that could help you automatically remove blank pages from a PDF file:

Sejda (https://www.sejda.com/delete-pdf-pages)
smallpdf (https://smallpdf.com/delete-pages-from-pdf)
defpdf(https://deftpdf.com/delete-pdf-pages)

In none of them I could find an option to automatically recognize blank pages, although the search engine came across links to pages that do not currently exist (pdf remove blank pages) of these services.

And of course, the use of online tools can compromise the confidentiality and security of your documents.

Option to remove blank pages from pdf using local bash script and PDFtk console program

After the failure, I decided to write my own script that will remove blank pages from all PDF files in the current directory.

While researching, I came across a great discussion where the question was discussed how best to remove blank pages from pdf using the command line. Various methods were offered, but I had all the documents scanned, and this means that even on an empty sheet there was still some information – scans of holes for stitching or just dirt from the scanner.

I decided that there would be the following algorithm:

Splitting a PDF document into separate files.
Pages smaller than a certain size are deleted.
I glue the remaining pages back.
I repeat as many times as there are PDF files in the current folder.
PROFIT

After simple manipulations, the file turned out blank_page_remover.sh:

#!/bin/bash
# Убирает пустые страницы из PDF после двухстороннего сканирования
# Описание в статье https://habr.com/ru/articles/733754/
datetime=$(date +"%Y-%m-%d_%H-%M-%S")
# Создаём единый лог файл для всех действий
log_file="blank_page_remover_$datetime.log"
touch $log_file
# Перебираем все PDF файлы в текущем каталоге
for file in *.pdf; do
  echo "Работаем с $file..." >> "$log_file"
  # Разделяем PDF файл на отдельные страницы
  echo "Разделяем $file на отдельные страницы..." >> "$log_file"
  pdftk "$file" burst output "${file%.*}_pg_%04d.pdf" >> "$log_file" 2>&1
  # Удаляем файлы страниц, размер которых меньше чем XX килобайт
  echo "Удаляем файлы страниц, размер которых меньше чем 70 килобайт..." >> "$log_file"
  for page in "${file%.*}"_pg_*.pdf; do
    size=$(wc -c < "$page")
    if [[ $size -lt 70000 ]]; then
      echo "Удаляем $page (размер: $size байт)..." >> "$log_file"
      rm "$page"
    fi
  done
  # Склеиваем оставшиеся страницы в новый файл
  echo "Склеиваем оставшиеся страницы в новый файл..." >> "$log_file"
  pdftk "${file%.*}"_pg_*.pdf cat output "${file%.*}_без пустых.pdf" compress >> "$log_file" 2>&1
  # Удаляем временные файлы
  echo -e "Удаляем временные файлы...\n" >> "$log_file"
  rm "${file%.*}"_pg_*.pdf
done

To run the script, you will need PDFtk (short for PDF Toolkit), which is a command-line tool for working with PDF files. How to install it for different operating systems can be found in the previous article.

How to use the script to remove blank pages from a PDF document

To run a bash script on a computer, follow these steps, depending on your operating system:

For Linux and macOS:

Open Terminal: click Ctrl + Alt + T on Linux or open Терминал from a folder Приложения > Утилиты in macOS.
Change to the directory where the script is located: use the command cdfollowed by the directory path. For example:
cd /путь/к/скрипту
Make the script executable:
chmod +x blank_page_remover.sh
Run this script. Run the script by typing ./and then the name of the script:
./blank_page_remover.sh
PROFIT!
The script will create new PDF files without blank pages and a detailed log of actions.

Terminal in Ubuntu and the result of executing the blank_page_remover.sh script — Terminal in Ubuntu and the result of the script execution `blank_page_remover.sh`

For Windows (using GitBash or WSL):

Install GitBash or WSL: if you haven’t already, install GitBash or Windows Subsystem for Linux (WSL).
Open Git Bash or WSL: Right-click on the folder containing the script and select GitBash здесь or Открыть в WSL.
Make the script executable:
chmod +x blank_page_remover.sh
Run this script. Run the script by typing ./and then the name of the script:
./blank_page_remover.sh
PROFIT!
The script will create new PDF files without blank pages and a detailed log of actions.

Conclusion

Removing blank pages from PDF files after duplex scanning can be a daunting task, especially when working with large volumes of documents. However, this article provided you with a solution in the form of using an automatic local bash script with the PDFtk console program.

By following the detailed instructions, you can effectively get rid of blank pages and keep your scanned PDF documents looking clean and professional.

Regardless of the size or complexity of your files, this solution will streamline your workflow and save you time and effort.

Author: Mikhail Shardin,

May 10, 2023