Generic selectors
Exact matches only
Search in title
Search in content
Post Type Selectors
Filter by Categories
About Article
Analyze Data
Archive
Best Practices
Better Outputs
Blog
Code Optimization
Code Quality
Command Line
Daily tips
Dashboard
Data Analysis & Manipulation
Data Engineer
Data Visualization
DataFrame
Delta Lake
DevOps
DuckDB
Environment Management
Feature Engineer
Git
Jupyter Notebook
LLM
LLM Tools
Machine Learning
Machine Learning & AI
Machine Learning Tools
Manage Data
MLOps
Natural Language Processing
NumPy
Pandas
Polars
PySpark
Python Helpers
Python Tips
Python Utilities
Scrape Data
SQL
Testing
Time Series
Tools
Visualization
Visualization & Reporting
Workflow & Automation
Workflow Automation

Transform Any PDF into Searchable AI Data with Docling

Table of Contents

Transform Any PDF into Searchable AI Data with Docling

Table of Contents

What if complex research papers could be transformed into AI-searchable data using fewer than 10 lines of Python?

Financial reports, research documents, and analytical papers often contain vital tables and formulas that traditional PDF tools fail to extract properly. This results in the loss of structured data that could inform key decisions.

Docling, developed by IBM Research, is an AI-first document processing tool that preserves the relationships between text, tables, and formulas. With just three lines of code, you can convert any document into structured data.

In this tutorial, you’ll learn how to build a complete pipeline that takes documents in any format and turns them into high-quality RAG-ready chunks for AI applications.

Setting Up Your Document Processing Pipeline

What is Docling?

Docling is an AI-first document processing tool developed by IBM Research. It transforms complex documents—like PDFs, Excel spreadsheets, and Word files—into structured data while preserving their original structure, including text, tables, and formulas.

To install Docling, run the following command:

pip install docling

What is RAG?

RAG (Retrieval-Augmented Generation) is an AI technique that combines document retrieval with language generation. Instead of relying solely on training data, RAG systems search through external documents to find relevant information, then use that context to generate accurate, up-to-date responses.

This process requires converting documents into structured, searchable chunks. Docling handles this conversion seamlessly.

Quick Start: Your First Document Conversion

Docling transforms any document into structured data with just three lines of code. Let’s see this in action by converting a PDF document – specifically, Docling’s own technical report from arXiv. This is a good example because it contains a lot of different types of elements, including tables, formulas, and text.

from docling.document_converter import DocumentConverter
import pandas as pd

# Initialize converter with default settings
converter = DocumentConverter()

# Convert any document format - we'll use the Docling technical report itself
source_url = "https://arxiv.org/pdf/2408.09869"
result = converter.convert(source_url)

# Access structured data immediately
doc = result.document
print(f"Successfully processed document from: {source_url}")

To iterate through each document element, we will use the doc.iterate_items() method. This method returns tuples of (item, level). For example:

  • (TextItem(label='paragraph', text='Introduction text...'), 0) – top-level paragraph
  • (TableItem(label='table', text='| Col1 | Col2 |...'), 1) – table at depth 1
  • (TextItem(label='heading', text='Section 2'), 0) – section heading
from collections import defaultdict

# Create a dictionary to categorize all document elements by type
element_types = defaultdict(list)

# Iterate through all document elements and group them by label
for item, _ in doc.iterate_items():
    element_type = item.label
    element_types[element_type].append(item)

# Display the breakdown of document structure
print("Document structure breakdown:")
for element_type, items in element_types.items():
    print(f"  {element_type}: {len(items)} elements")

The output shows the different types of elements Docling extracted from the document.

Document structure breakdown:
  picture: 13 elements
  section_header: 31 elements
  text: 102 elements
  list_item: 22 elements
  code: 2 elements
  footnote: 1 elements
  caption: 3 elements
  table: 5 elements

Let’s look specifically for structured elements like tables and formulas that are crucial for RAG applications:

first_table = element_types["table"][0]
print(first_table.export_to_dataframe().to_markdown())
CPU. Thread budget. native backend.TTS native backend.Pages/s native backend.Mem pypdfium backend.TTS pypdfium backend.Pages/s pypdfium backend.Mem
0 Apple M3 Max 4 177 s 167 s 1.27 1.34 6.20 GB 103 s 92 s 2.18 2.45 2.56 GB
1 (16 cores) Intel(R) E5-2690 16 4 16 375 s 244 s 0.60 0.92 6.16 GB 239 s 143 s 0.94 1.57 2.42 GB

Here is how the table looks in the original PDF:

CPU performance comparison table from the original PDF document showing merged cells structure

The extracted table shows Docling’s accuracy and structural differences from the original PDF. Docling captured all numerical data and text perfectly but flattened the merged cell structure into separate columns.

While this loses visual formatting, it benefits RAG applications since each row contains complete information without complex cell merging logic.

Next, look at the first list item element:

first_list_items = element_types["list_item"][0:6]
for list_item in first_list_items:
    print(list_item.text)
· Converts PDF documents to JSON or Markdown format, stable and lightning fast
· Understands detailed page layout, reading order, locates figures and recovers table structures
· Extracts metadata from the document, such as title, authors, references and language
· Optionally applies OCR, e.g. for scanned PDFs
· Can be configured to be optimal for batch-mode (i.e high throughput, low time-to-solution) or interactive mode (compromise on efficiency, low time-to-solution)
· Can leverage different accelerators (GPU, MPS, etc).

This matches the original PDF list item.

Original PDF showing the first 6 list items

Look at the first caption element:

first_caption = element_types["caption"][0]
print(first_caption.text)

This matches the image caption in the original PDF.

Screenshot of a technical diagram from the original PDF showing Docling's processing pipeline with labeled components and arrows indicating workflow sequence

This matches the image caption in the original PDF.

Export Options for Different Use Cases

Docling provides multiple ways to export the document data, including Markdown, JSON, and dictionary formats.

For human review and documentation, Markdown format preserves the document structure beautifully.

# Human-readable markdown for review
markdown_content = doc.export_to_markdown()
print(markdown_content[:500] + "...")
<!-- image -->

## Docling Technical Report

Version 1.0

Christoph Auer Maksym Lysak Ahmed Nassar Michele Dolfi Nikolaos Livathinos Panos Vagenas Cesar Berrospi Ramis Matteo Omenetti Fabian Lindlbauer Kasper Dinkla Lokesh Mishra Yusik Kim Shubham Gupta Rafael Teixeira de Lima Valery Weber Lucas Morin Ingmar Meijer Viktor Kuropiatnyk Peter W. J. Staar

AI4K Group, IBM Research R¨ uschlikon, Switzerland

## Abstract

This technical report introduces Docling , an easy to use, self-contained, MITli...

Compare this to the original PDF:

Screenshot of the original PDF title page showing "Docling Technical Report Version 1.0" with multiple author names and IBM Research affiliation in formal academic layout

Docling preserves all original content while converting complex PDF formatting into clean markdown. Every author name, title, and abstract text remains intact, creating searchable structure perfect for RAG applications.

For programmatic processing and API integrations, JSON format provides structured access to all document elements:

import json

# JSON for programmatic processing
json_dict = doc.export_to_dict()

print('JSON keys:', json_dict.keys())
JSON keys: dict_keys(['schema_name', 'version', 'name', 'origin', 'furniture', 'body', 'groups', 'texts', 'pictures', 'tables', 'key_value_items', 'form_items', 'pages'])

The JSON structure reveals Docling’s comprehensive document analysis. Key sections include texts for paragraphs, tables for structured data, pictures for images, and pages for layout information.

For Python development workflows, the dictionary format enables immediate access to all document elements.

# Python dictionary for immediate use
dict_repr = doc.export_to_dict()

# Preview the structure
num_texts = len(dict_repr['texts'])
num_tables = len(dict_repr['tables'])

print(f"Text elements: {num_texts}")
print(f"Table elements: {num_tables}")
Text elements: 985
Table elements: 5

Configuring PdfPipelineOptions for Advanced Processing

The default Docling configuration works well for most documents, but PdfPipelineOptions unlocks advanced processing capabilities. These options control OCR engines, table recognition, AI enrichments, and performance settings.

PdfPipelineOptions becomes essential when working with scanned documents, complex layouts, or specialized content requiring AI-powered understanding.

Enable Image Extraction

By default, Docling does not extract images from the document. However, you can enable image extraction by setting the generate_picture_images option to True.

from docling.datamodel.pipeline_options import PdfPipelineOptions
from docling.document_converter import PdfFormatOption

pipeline_options = PdfPipelineOptions(generate_picture_images=True)

# Create converter with enhanced table processing
converter_enhanced = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
    }
)

result_enhanced = converter_enhanced.convert("https://arxiv.org/pdf/2408.09869")
doc_enhanced = result_enhanced.document

Display the first image:

# Extract and display the first image
from IPython.display import Image, display

for item, _ in doc_enhanced.iterate_items():
    if item.label == "picture":
        image_data = item.image

        # Get the image URI
        uri = str(image_data.uri)

        # Display the image using IPython
        display(Image(url=uri))
        break

Yellow rubber duck toy on light surface

The output image matches the first image of the PDF.

Table Recognition Enhancement

To use the more sophisticated AI model for table extraction instead of the default fast model, you can set the table_structure_options.mode to TableFormerMode.ACCURATE.

from docling.datamodel.pipeline_options import PdfPipelineOptions, TableFormerMode
from docling.datamodel.base_models import InputFormat
from docling.document_converter import PdfFormatOption

# Enhanced table processing for complex layouts
pipeline_options = PdfPipelineOptions()
pipeline_options.table_structure_options.mode = TableFormerMode.ACCURATE

# Create converter with enhanced table processing
converter_enhanced = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
    }
)

result_enhanced = converter_enhanced.convert("https://arxiv.org/pdf/2408.09869")
doc_enhanced = result_enhanced.document

AI-Powered Content Understanding

AI enrichments enhance extracted content with semantic understanding. Picture descriptions, formula detection, and code parsing improve RAG accuracy by adding crucial context.

In the code below, we:

  • Set the do_picture_description option to True to enable picture description extraction
  • Set the picture_description_options option to use the SmolVLM-256M-Instruct model from Hugging Face.
from docling.datamodel.pipeline_options import PictureDescriptionVlmOptions

# AI-powered content enrichment
pipeline_options = PdfPipelineOptions(
    do_picture_description=True,  # AI-generated image descriptions
    picture_description_options=PictureDescriptionVlmOptions(
        repo_id="HuggingFaceTB/SmolVLM-256M-Instruct",
        prompt="Describe this picture. Be precise and concise.",
    ),
    generate_picture_images=True,
)

converter_enhanced = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
    }
)

result_enhanced = converter_enhanced.convert("https://arxiv.org/pdf/2408.09869")
doc_enhanced = result_enhanced.document

Extract the picture description from the second picture:

second_picture = doc_enhanced.pictures[1]

print(f"Caption: {second_picture.caption_text(doc=doc_enhanced)}")

# Check for annotations
for annotation in second_picture.annotations:
    print(annotation.text)
Caption: Figure 1: Sketch of Docling's default processing pipeline. The inner part of the model pipeline is easily customizable and extensible.
### Image Description

The image is a flowchart that depicts a sequence of steps from a document, likely a report or a document. The flowchart is structured with various elements such as text, icons, and arrows. Here is a detailed description of the flowchart:

#### Step 1: Parse
- **Description:** The first step in the process is to parse the document. This involves converting the text into a format that can be easily understood by the user.

#### Step 2: Ocr
- **Description:** The second step is to perform OCR (Optical Character Recognition) on the document. This involves converting the text into a format that can be easily read by the OCR software.

#### Step 3: Layout Analysis
- **Description:** The third step is to analyze the document's layout. This involves examining the document's structure, including the layout of the text, the alignment of the text, and the alignment of the document's content

Here is the original image:

Technical flowchart diagram from original PDF showing Docling's processing pipeline with parse, OCR, and layout analysis steps

The detailed description shows how Docling’s picture analysis transforms visual content into text that can be indexed and searched, making diagrams accessible to RAG systems.

Performance and Memory Management

Processing a large document can be time-consuming. To speed up the process, we can use:

  • The page_range option to process only a specific page range.
  • The max_num_pages option to limit the number of pages to process.
  • The images_scale option to reduce the image resolution for speed.
  • The generate_page_images option to skip page images to save memory.
  • The do_table_structure option to skip table structure extraction.
  • The enable_parallel_processing option to use multiple cores.
# Optimized for large documents
pipeline_options = PdfPipelineOptions(
    max_num_pages=4,  # Limit processing to first 4 pages
    page_range=[1, 3],  # Process specific page range
    generate_page_images=False,  # Skip page images to save memory
    do_table_structure=False,  # Skip table structure extraction
    enable_parallel_processing=True  # Use multiple cores
)

Building Your RAG Pipeline

We’ll build our RAG pipeline in five steps:

  • Document Processing: Use Docling to convert documents into structured data
  • Chunking: Break documents into smaller, searchable pieces
  • Create Embeddings: Convert text chunks into vector representations
  • Store in Vector Database: Save embeddings in FAISS for fast similarity search
  • Query: Retrieve relevant chunks and generate contextual responses

Tools for RAG Pipelines

Building RAG pipelines requires four essential tools:

  • Docling: converts documents into structured data
  • LangChain: manages document workflows, chain orchestration, and provides embedding models
  • FAISS: stores and retrieves document chunks

These tools work together to create complete RAG pipelines that can process, store, and retrieve document content intelligently.

LangChain

LangChain simplifies building AI applications by providing components for document loading, text processing, and chain orchestration. It integrates seamlessly with vector stores and language models.

For a comprehensive introduction to LangChain fundamentals and local AI workflows, see our LangChain and Ollama guide.

FAISS

FAISS (Facebook AI Similarity Search) is a library for efficient similarity search in high-dimensional spaces. It enables fast retrieval of the most relevant document chunks based on embedding similarity.

For production use cases requiring robust database integration, consider implementing semantic search with pgvector in PostgreSQL or using Pinecone for cloud-based vector search as alternatives to FAISS.

Let’s install the additional packages for RAG functionality:

# Install additional packages for RAG functionality
pip install docling sentence-transformers langchain-community langchain-huggingface faiss-cpu
# Note: Use faiss-gpu if you have CUDA support

Document Processing

Convert the document into structured data using Docling.

from docling.document_converter import DocumentConverter

# Initialize converter with default settings
converter = DocumentConverter()

# Convert the document into structured data
source_url = "https://arxiv.org/pdf/2408.09869"
result = converter.convert(source_url)

# Access structured data immediately
doc = result.document

Chunking

AI models have limited context windows that can’t process entire documents at once. Chunking solves this by breaking documents into smaller, searchable pieces that fit within these constraints. This improves retrieval accuracy by finding the most relevant sections rather than entire documents.

Docling provides two main chunking strategies:

  • HierarchicalChunker: Focuses purely on document structure, creating chunks based on headings and sections
  • HybridChunker: Combines structure-aware chunking with token-based limits, preserving document hierarchy while respecting model constraints

Let’s compare how these chunkers process the same document.

First, create a helper function to print the chunk content:

def print_chunk(chunk):
    print(f"Chunk length: {len(chunk.text)} characters")
    if len(chunk.text) > 30:
        print(f"Chunk content: {chunk.text[:30]}...{chunk.text[-30:]}")
    else:
        print(f"Chunk content: {chunk.text}")
    print("-" * 50)

Next, process the document with the HierarchicalChunker:

from docling.chunking import HierarchicalChunker

# Process with HierarchicalChunker (structure-based)
hierarchical_chunker = HierarchicalChunker()
hierarchical_chunks = list(hierarchical_chunker.chunk(doc))

print(f"HierarchicalChunker: {len(hierarchical_chunks)} chunks")

# Print the first 3 chunks
for chunk in hierarchical_chunks[:5]:
    print_chunk(chunk)
HierarchicalChunker: 114 chunks
Chunk length: 11 characters
Chunk content: Version 1.0
--------------------------------------------------
Chunk length: 295 characters
Chunk content: Christoph Auer Maksym Lysak Ah... Kuropiatnyk Peter W. J. Staar
--------------------------------------------------
Chunk length: 50 characters
Chunk content: AI4K Group, IBM Research R¨ us...arch R¨ uschlikon, Switzerland
--------------------------------------------------
Chunk length: 431 characters
Chunk content: This technical report introduc...on of new features and models.
--------------------------------------------------
Chunk length: 792 characters
Chunk content: Converting PDF documents back ... gap to proprietary solutions.
--------------------------------------------------

Compare this to the HybridChunker:

from docling.chunking import HybridChunker

# Process with HybridChunker (token-aware)
hybrid_chunker = HybridChunker(max_tokens=512, overlap_tokens=50)
hybrid_chunks = list(hybrid_chunker.chunk(doc))

print(f"HybridChunker: {len(hybrid_chunks)} chunks")

# Print the first 3 chunks
for chunk in hybrid_chunks[:5]:
    print_chunk(chunk)
HybridChunker: 50 chunks
Chunk length: 358 characters
Chunk content: Version 1.0
Christoph Auer Mak...arch R¨ uschlikon, Switzerland
--------------------------------------------------
Chunk length: 431 characters
Chunk content: This technical report introduc...on of new features and models.
--------------------------------------------------
Chunk length: 1858 characters
Chunk content: Converting PDF documents back ... accelerators (GPU, MPS, etc).
--------------------------------------------------
Chunk length: 1436 characters
Chunk content: To use Docling, you can simply...and run it inside a container.
--------------------------------------------------
Chunk length: 796 characters
Chunk content: Docling implements a linear pi...erialized to JSON or Markdown.
--------------------------------------------------

The comparison shows key differences:

  • HierarchicalChunker: Creates many small chunks by splitting at every section boundary
  • HybridChunker: Creates fewer, larger chunks by combining related sections within token limits

We will use HybridChunker because it respects document boundaries (won’t split tables inappropriately) while ensuring chunks fit within embedding model constraints.

from docling.chunking import HybridChunker

# Initialize the chunker
chunker = HybridChunker(max_tokens=512, overlap_tokens=50)

# Create the chunks
rag_chunks = list(chunker.chunk(doc))

print(f"Created {len(rag_chunks)} intelligent chunks")
Created 50 intelligent chunks

Creating a Vector Store

A vector store is a database that converts text into numerical vectors called embeddings. These vectors capture semantic meaning, allowing the system to find related content even when different words are used.

When you search for “document processing,” the vector store finds chunks about “PDF parsing” or “text extraction” because their embeddings are mathematically close. This enables semantic search beyond exact keyword matching.

Create the vector store for semantic search across your document chunks:

from langchain_community.vectorstores import FAISS
from langchain_huggingface import HuggingFaceEmbeddings

# Create embeddings
embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")

# Create the vector store
texts = [chunk.text for chunk in rag_chunks]
vectorstore = FAISS.from_texts(texts, embeddings)

print(f"Built vector store with {len(texts)} chunks")
Built vector store with 50 chunks

Now you can search your knowledge base with semantic similarity:

# Search the knowledge base
query = "How does document processing work?"
relevant_docs = vectorstore.similarity_search(query, k=3)

print(f"Query: '{query}'")
print(f"Found {len(relevant_docs)} relevant chunks:")

for i, doc in enumerate(relevant_docs, 1):
    print(f"\nResult {i}:")
    print(f"Content: {doc.page_content[:150]}...")
Query: 'How does document processing work?'
Found 3 relevant chunks:

Result 1:
Content: Docling implements a linear pipeline of operations, which execute sequentially on each given document (see Fig. 1). Each document is first parsed by a...

Result 2:
Content: In the final pipeline stage, Docling assembles all prediction results produced on each page into a well-defined datatype that encapsulates a converted...

Result 3:
Content: Docling is designed to allow easy extension of the model library and pipelines. In the future, we plan to extend Docling with several more models, suc...

The search results show effective semantic retrieval. The vector store found relevant chunks about Docling’s architecture and design when searching for “document processing” – demonstrating how RAG systems match meaning, not just keywords.

Conclusion

This tutorial demonstrated building a robust document processing pipeline that handles complex, real-world documents. Your pipeline preserves critical elements like tables, mathematical formulas, and document structure while generating semantically meaningful chunks for retrieval-augmented generation systems.

The capability to transform any document format into AI-ready data using minimal code—at no cost—represents a significant advancement in document processing workflows. For enhanced reasoning capabilities in your RAG workflows, explore our guide on building data science workflows with DeepSeek and LangChain which combines advanced language models with document processing pipelines.

Leave a Comment

Your email address will not be published. Required fields are marked *

0
    0
    Your Cart
    Your cart is empty
    Scroll to Top

    Work with Khuyen Tran

    Work with Khuyen Tran