txtmarker: Question-Based PDF Highlighting with Python

To programmatically highlight text in PDFs, you can use the txtmarker library. This guide will walk you through the installation, usage, and an advanced example of highlighting text based on questions using extractive QA.

Installation

To install txtmarker, run the following commands:

pip install txtmarker

Highlighting Text in PDFs

Here is an example of how to use txtmarker:

from txtmarker.factory import Factory
from pdf2image import convert_from_path
from IPython.display import display

highlighter = Factory.create("pdf")
highlighter.highlight(
    "The Fascinating World of Penguins.pdf",
    "output.pdf",
    [("Walking challenge", "Penguins have to deal with flippers instead of feet")],
)

This code will highlight the text “Penguins have to deal with flippers instead of feet” in the PDF file “The Fascinating World of Penguins.pdf” and save the output to a new PDF file called “output.pdf”.

Extractive QA

What if you want to highlight text in a PDF by answering specific questions? For this, you can combine txtmarker with extractive QA using the transformers library.

For this use case, install the following libraries:

pip install pdf2image transformers txtmarker
brew install poppler

Below is the utility code for extracting text from PDFs, running extractive QA, and highlighting relevant answers.

import re

from pdf2image import convert_from_path
from pdfminer.high_level import extract_text
from transformers import pipeline

from IPython.display import display

from txtmarker.factory import Factory

# Create pipeline
nlp = pipeline("question-answering")

# Create highlighter
highlighter = Factory.create("pdf")

# Extracts text from pdf
def extract(path):
  text = extract_text(path)

  # Clean data
  text = re.sub(r"\n+", " ", text)
  return re.sub(r"[^\x20-\x7F]+", "", text)

# Renders first page of pdf file as image
def highlight(path, highlights):
  # Get PDF text
  context = extract(path)

  # Run extractive qa
  highlights = [(name, qa(context, question)) for name, question in highlights]

  # Create annotated file
  highlighter.highlight(path, "out.pdf", highlights)

  # Render pdf as image
  images = convert_from_path("out.pdf", size=(800, None), single_file=True)
  display(images[0])

# Runs extractive qa
def qa(context, question):
  return nlp(context=context, question=question)["answer"]

To highlight answers to specific questions in a PDF, use the following code:

highlight("The Fascinating World of Penguins.pdf", [
  ("Walking problem", "What is the main challenge penguins face when trying to walk?"),
  ("Known for", "What is penguin known for?"),
  ("Flying problem", "What is the main challenge penguins face when trying to fly?"),
])

This code will highlight the answers to the questions in the PDF file “The Fascinating World of Penguins.pdf” and display the output as an image.

Conclusion

In this blog post, we have explored how to use Python to highlight text in PDFs using the txtmarker library. With these tools, you can automate the process of highlighting text in PDFs and make it easier to extract relevant information from large documents.

Link to txtmarker.

Search

Better Outputs

txtmarker: Question-Based PDF Highlighting with Python

txtmarker: Question-Based PDF Highlighting with Python

Installation

Highlighting Text in PDFs

Extractive QA

Conclusion

Search

Related Posts

Dynamic Report Generation with Jinja Templates

3 Tools to Track and Visualize the Execution of your Python Code

From Python to Paper: Visualizing Calculations with Handcalcs

Related Posts

IPyflow: Automatic Dependency Tracking in Jupyter Notebooks

DB Browser for SQLite: A Visual Alternative to Command Line

Python’s dropwhile: A Clean Approach to Sequential Filtering

Stay up-to-date with
data skills using
CodeCut

Drop a line

Get in touch

Follow Us on Social Media

txtmarker: Question-Based PDF Highlighting with Python

txtmarker: Question-Based PDF Highlighting with Python

Installation

Highlighting Text in PDFs

Extractive QA

Conclusion

Search

Related Posts

Related Posts

Stay up-to-date with data skills using CodeCut

Follow Us on Social Media

Work with Khuyen Tran

Work with Khuyen Tran

Stay up-to-date with
data skills using
CodeCut