To programmatically highlight text in PDFs, you can use the txtmarker
library. This guide will walk you through the installation, usage, and an advanced example of highlighting text based on questions using extractive QA.
Installation
To install txtmarker
, run the following commands:
pip install txtmarker
Highlighting Text in PDFs
Here is an example of how to use txtmarker
:
from txtmarker.factory import Factory
from pdf2image import convert_from_path
from IPython.display import display
highlighter = Factory.create("pdf")
highlighter.highlight(
"The Fascinating World of Penguins.pdf",
"output.pdf",
[("Walking challenge", "Penguins have to deal with flippers instead of feet")],
)
This code will highlight the text “Penguins have to deal with flippers instead of feet” in the PDF file “The Fascinating World of Penguins.pdf” and save the output to a new PDF file called “output.pdf”.
Extractive QA
What if you want to highlight text in a PDF by answering specific questions? For this, you can combine txtmarker
with extractive QA using the transformers
library.
For this use case, install the following libraries:
pip install pdf2image transformers txtmarker
brew install poppler
Below is the utility code for extracting text from PDFs, running extractive QA, and highlighting relevant answers.
import re
from pdf2image import convert_from_path
from pdfminer.high_level import extract_text
from transformers import pipeline
from IPython.display import display
from txtmarker.factory import Factory
# Create pipeline
nlp = pipeline("question-answering")
# Create highlighter
highlighter = Factory.create("pdf")
# Extracts text from pdf
def extract(path):
text = extract_text(path)
# Clean data
text = re.sub(r"\n+", " ", text)
return re.sub(r"[^\x20-\x7F]+", "", text)
# Renders first page of pdf file as image
def highlight(path, highlights):
# Get PDF text
context = extract(path)
# Run extractive qa
highlights = [(name, qa(context, question)) for name, question in highlights]
# Create annotated file
highlighter.highlight(path, "out.pdf", highlights)
# Render pdf as image
images = convert_from_path("out.pdf", size=(800, None), single_file=True)
display(images[0])
# Runs extractive qa
def qa(context, question):
return nlp(context=context, question=question)["answer"]
To highlight answers to specific questions in a PDF, use the following code:
highlight("The Fascinating World of Penguins.pdf", [
("Walking problem", "What is the main challenge penguins face when trying to walk?"),
("Known for", "What is penguin known for?"),
("Flying problem", "What is the main challenge penguins face when trying to fly?"),
])
This code will highlight the answers to the questions in the PDF file “The Fascinating World of Penguins.pdf” and display the output as an image.
Conclusion
In this blog post, we have explored how to use Python to highlight text in PDFs using the txtmarker
library. With these tools, you can automate the process of highlighting text in PDFs and make it easier to extract relevant information from large documents.