pypdf: Supercharge PDF Text Extraction in Python

August 4, 2023

pypdf: Supercharge PDF Text Extraction in Python

Khuyen Tran

Motivation

Extracting text from PDFs often results in capturing undesired elements, such as headers, footers, page numbers, or small captions. This occurs due to the lack of semantic layers in PDF files, making it difficult to isolate and focus only on the main content.

For example, when extracting text from a report PDF, you might encounter the following:

Header: Company Report 2023
Page 1
Main content starts here...
...
Footer: Confidential Information

In such cases, the inclusion of headers, footers, and other irrelevant elements can clutter the output and hinder analysis. The solution involves applying specific logic to filter out these undesired portions of the text.

Introduction to PyPDF

PyPDF is a free and open-source, pure-Python PDF library that can split, merge, crop, transform, and manipulate PDFs. It also supports extracting text and metadata from PDF files, making it a powerful tool for PDF processing.

You can install PyPDF using pip:

pip install pypdf

Extract PDF Text Precisely

To understand how PyPDF works, let’s extract text from this PDF:

In the example below, we first extract text without applying any filtering logic:

from PyPDF2 import PdfReader

# Extracting text without filtering headers, footers, or specific elements
reader = PdfReader("attachment.pdf")
page = reader.pages[0]
text = page.extract_text()
print(text)

Output:

The Crazy Ones
October 14, 1998
Heres to the crazy ones. The misﬁts. The rebels. The troublemakers.
The round pegs in the square holes.
The ones who see things diﬀerently. Theyre not fond of rules. And
they have no respect for the status quo. You can quote them,
disagree with them, glorify or vilify them.
About the only thing you cant do is ignore them. Because they change
things. They invent. They imagine. They heal. They explore. They
create. They inspire. They push the human race forward.
Maybe they have to be crazy.
How else can you stare at an empty canvas and see a work of art? Or
sit in silence and hear a song thats never been written? Or gaze at
a red planet and see a laboratory on wheels?
We make tools for these kinds of people.
While some see them as the crazy ones, we see genius. Because the
people who are crazy enough to think they can change the world,
are the ones who do.

The result of this code includes all textual elements, including “The Crazy Ones” and “October 14, 1998.”

To extract only relevant text, we use the visitor_text feature of PyPDF, which enables us to apply custom logic for filtering. Below, we demonstrate how to filter out small-font elements (e.g., headers, footers, or captions) using a visitor function.

from PyPDF2 import PdfReader

# Threshold to consider "small" text
SMALL_FONT_THRESHOLD = 10

# Prepare a list to store the filtered text
parts = []


# Visitor function to filter text based on font size
def visitor_body(text, cm, tm, font_dict, font_size):
    if font_size < SMALL_FONT_THRESHOLD:
        parts.append(text)

The visitor function processes each text snippet in the PDF and filters it based on specific criteria. Key parameters include:

text: The actual text string being processed.
cm (current transformation matrix): Represents the text’s positioning and scaling on the page. For instance, cm[4] and cm[5] indicate the horizontal and vertical positions.
tm (text matrix): A coordinate transformation applied to the text.
font_dict: A dictionary containing font-related metadata, such as font type or style.
font_size: Indicates the size of the text snippet, allowing us to filter out small font elements like headers or footers.

By integrating the visitor function into the extract_text method, we can filter and extract only the desired content:

reader = PdfReader("attachment.pdf")
page = reader.pages[0]
page.extract_text(visitor_text=visitor_body)

# Combine all filtered parts into a single string
text_body = "".join(parts)

print(text_body)

Output:

Heres to the crazy ones. The misﬁts. The rebels. The troublemakers.
The round pegs in the square holes.
The ones who see things diﬀerently. Theyre not fond of rules. And
they have no respect for the status quo. You can quote them,
disagree with them, glorify or vilify them.
About the only thing you cant do is ignore them. Because they change
things. They invent. They imagine. They heal. They explore. They
create. They inspire. They push the human race forward.
Maybe they have to be crazy.
How else can you stare at an empty canvas and see a work of art? Or
sit in silence and hear a song thats never been written? Or gaze at
a red planet and see a laboratory on wheels?
We make tools for these kinds of people.
While some see them as the crazy ones, we see genius. Because the
people who are crazy enough to think they can change the world,
are the ones who do.

The result includes only the small text. This approach ensures that the extracted text is more suitable for further analysis or processing.

Conclusion

The visitor_text feature of PyPDF provides a powerful way to control and filter text extraction from PDF files. By applying custom logic, such as filtering based on font size, you can isolate the main content and exclude irrelevant elements. This enhances the usability and accuracy of the extracted text, making PyPDF an indispensable tool for PDF processing.

Link to PyPDF.