txtmarker: Question-Based PDF Highlighting with Python

txtmarker: Question-Based PDF Highlighting with Python

To programmatically highlight text in PDFs, you can use the txtmarker library. This guide will walk you through the installation, usage, and an advanced example of highlighting text based on questions using extractive QA.

Installation

To install txtmarker, run the following commands:

pip install txtmarker

Highlighting Text in PDFs

Here is an example of how to use txtmarker:

from txtmarker.factory import Factory
from pdf2image import convert_from_path
from IPython.display import display

highlighter = Factory.create("pdf")
highlighter.highlight(
    "The Fascinating World of Penguins.pdf",
    "output.pdf",
    [("Walking challenge", "Penguins have to deal with flippers instead of feet")],
)

This code will highlight the text “Penguins have to deal with flippers instead of feet” in the PDF file “The Fascinating World of Penguins.pdf” and save the output to a new PDF file called “output.pdf”.

Extractive QA

What if you want to highlight text in a PDF by answering specific questions? For this, you can combine txtmarker with extractive QA using the transformers library.

For this use case, install the following libraries:

pip install pdf2image transformers txtmarker
brew install poppler

Below is the utility code for extracting text from PDFs, running extractive QA, and highlighting relevant answers.

import re

from pdf2image import convert_from_path
from pdfminer.high_level import extract_text
from transformers import pipeline

from IPython.display import display

from txtmarker.factory import Factory

# Create pipeline
nlp = pipeline("question-answering")

# Create highlighter
highlighter = Factory.create("pdf")

# Extracts text from pdf
def extract(path):
  text = extract_text(path)

  # Clean data
  text = re.sub(r"\n+", " ", text)
  return re.sub(r"[^\x20-\x7F]+", "", text)

# Renders first page of pdf file as image
def highlight(path, highlights):
  # Get PDF text
  context = extract(path)

  # Run extractive qa
  highlights = [(name, qa(context, question)) for name, question in highlights]

  # Create annotated file
  highlighter.highlight(path, "out.pdf", highlights)

  # Render pdf as image
  images = convert_from_path("out.pdf", size=(800, None), single_file=True)
  display(images[0])

# Runs extractive qa
def qa(context, question):
  return nlp(context=context, question=question)["answer"]

To highlight answers to specific questions in a PDF, use the following code:

highlight("The Fascinating World of Penguins.pdf", [
  ("Walking problem", "What is the main challenge penguins face when trying to walk?"),
  ("Known for", "What is penguin known for?"),
  ("Flying problem", "What is the main challenge penguins face when trying to fly?"),
])

This code will highlight the answers to the questions in the PDF file “The Fascinating World of Penguins.pdf” and display the output as an image.

Conclusion

In this blog post, we have explored how to use Python to highlight text in PDFs using the txtmarker library. With these tools, you can automate the process of highlighting text in PDFs and make it easier to extract relevant information from large documents.

Link to txtmarker.

Scroll to Top

Work with Khuyen Tran

Work with Khuyen Tran