Table of Contents
- Setting Up Your Document Processing Pipeline
- Quick Start: Your First Document Conversion
- Export Options for Different Use Cases
- Configuring PdfPipelineOptions for Advanced Processing
- Building Your RAG Pipeline
- Conclusion
What if complex research papers could be transformed into AI-searchable data using fewer than 10 lines of Python?
Financial reports, research documents, and analytical papers often contain vital tables and formulas that traditional PDF tools fail to extract properly. This results in the loss of structured data that could inform key decisions.
Docling, developed by IBM Research, is an AI-first document processing tool that preserves the relationships between text, tables, and formulas. With just three lines of code, you can convert any document into structured data.
In this tutorial, you’ll learn how to build a complete pipeline that takes documents in any format and turns them into high-quality RAG-ready chunks for AI applications.
Setting Up Your Document Processing Pipeline
What is Docling?
Docling is an AI-first document processing tool developed by IBM Research. It transforms complex documents—like PDFs, Excel spreadsheets, and Word files—into structured data while preserving their original structure, including text, tables, and formulas.
To install Docling, run the following command:
pip install docling
What is RAG?
RAG (Retrieval-Augmented Generation) is an AI technique that combines document retrieval with language generation. Instead of relying solely on training data, RAG systems search through external documents to find relevant information, then use that context to generate accurate, up-to-date responses.
This process requires converting documents into structured, searchable chunks. Docling handles this conversion seamlessly.
Quick Start: Your First Document Conversion
Docling transforms any document into structured data with just three lines of code. Let’s see this in action by converting a PDF document – specifically, Docling’s own technical report from arXiv. This is a good example because it contains a lot of different types of elements, including tables, formulas, and text.
from docling.document_converter import DocumentConverter
import pandas as pd
# Initialize converter with default settings
converter = DocumentConverter()
# Convert any document format - we'll use the Docling technical report itself
source_url = "https://arxiv.org/pdf/2408.09869"
result = converter.convert(source_url)
# Access structured data immediately
doc = result.document
print(f"Successfully processed document from: {source_url}")
To iterate through each document element, we will use the doc.iterate_items()
method. This method returns tuples of (item, level). For example:
(TextItem(label='paragraph', text='Introduction text...'), 0)
– top-level paragraph(TableItem(label='table', text='| Col1 | Col2 |...'), 1)
– table at depth 1(TextItem(label='heading', text='Section 2'), 0)
– section heading
from collections import defaultdict
# Create a dictionary to categorize all document elements by type
element_types = defaultdict(list)
# Iterate through all document elements and group them by label
for item, _ in doc.iterate_items():
element_type = item.label
element_types[element_type].append(item)
# Display the breakdown of document structure
print("Document structure breakdown:")
for element_type, items in element_types.items():
print(f" {element_type}: {len(items)} elements")
The output shows the different types of elements Docling extracted from the document.
Document structure breakdown:
picture: 13 elements
section_header: 31 elements
text: 102 elements
list_item: 22 elements
code: 2 elements
footnote: 1 elements
caption: 3 elements
table: 5 elements
Let’s look specifically for structured elements like tables and formulas that are crucial for RAG applications:
first_table = element_types["table"][0]
print(first_table.export_to_dataframe().to_markdown())
CPU. | Thread budget. | native backend.TTS | native backend.Pages/s | native backend.Mem | pypdfium backend.TTS | pypdfium backend.Pages/s | pypdfium backend.Mem | |
---|---|---|---|---|---|---|---|---|
0 | Apple M3 Max | 4 | 177 s 167 s | 1.27 1.34 | 6.20 GB | 103 s 92 s | 2.18 2.45 | 2.56 GB |
1 | (16 cores) Intel(R) E5-2690 | 16 4 16 | 375 s 244 s | 0.60 0.92 | 6.16 GB | 239 s 143 s | 0.94 1.57 | 2.42 GB |
Here is how the table looks in the original PDF:
The extracted table shows Docling’s accuracy and structural differences from the original PDF. Docling captured all numerical data and text perfectly but flattened the merged cell structure into separate columns.
While this loses visual formatting, it benefits RAG applications since each row contains complete information without complex cell merging logic.
Next, look at the first list item element:
first_list_items = element_types["list_item"][0:6]
for list_item in first_list_items:
print(list_item.text)
· Converts PDF documents to JSON or Markdown format, stable and lightning fast
· Understands detailed page layout, reading order, locates figures and recovers table structures
· Extracts metadata from the document, such as title, authors, references and language
· Optionally applies OCR, e.g. for scanned PDFs
· Can be configured to be optimal for batch-mode (i.e high throughput, low time-to-solution) or interactive mode (compromise on efficiency, low time-to-solution)
· Can leverage different accelerators (GPU, MPS, etc).
This matches the original PDF list item.
Look at the first caption element:
first_caption = element_types["caption"][0]
print(first_caption.text)
This matches the image caption in the original PDF.
This matches the image caption in the original PDF.
Export Options for Different Use Cases
Docling provides multiple ways to export the document data, including Markdown, JSON, and dictionary formats.
For human review and documentation, Markdown format preserves the document structure beautifully.
# Human-readable markdown for review
markdown_content = doc.export_to_markdown()
print(markdown_content[:500] + "...")
<!-- image -->
## Docling Technical Report
Version 1.0
Christoph Auer Maksym Lysak Ahmed Nassar Michele Dolfi Nikolaos Livathinos Panos Vagenas Cesar Berrospi Ramis Matteo Omenetti Fabian Lindlbauer Kasper Dinkla Lokesh Mishra Yusik Kim Shubham Gupta Rafael Teixeira de Lima Valery Weber Lucas Morin Ingmar Meijer Viktor Kuropiatnyk Peter W. J. Staar
AI4K Group, IBM Research R¨ uschlikon, Switzerland
## Abstract
This technical report introduces Docling , an easy to use, self-contained, MITli...
Compare this to the original PDF:
Docling preserves all original content while converting complex PDF formatting into clean markdown. Every author name, title, and abstract text remains intact, creating searchable structure perfect for RAG applications.
For programmatic processing and API integrations, JSON format provides structured access to all document elements:
import json
# JSON for programmatic processing
json_dict = doc.export_to_dict()
print('JSON keys:', json_dict.keys())
JSON keys: dict_keys(['schema_name', 'version', 'name', 'origin', 'furniture', 'body', 'groups', 'texts', 'pictures', 'tables', 'key_value_items', 'form_items', 'pages'])
The JSON structure reveals Docling’s comprehensive document analysis. Key sections include texts
for paragraphs, tables
for structured data, pictures
for images, and pages
for layout information.
For Python development workflows, the dictionary format enables immediate access to all document elements.
# Python dictionary for immediate use
dict_repr = doc.export_to_dict()
# Preview the structure
num_texts = len(dict_repr['texts'])
num_tables = len(dict_repr['tables'])
print(f"Text elements: {num_texts}")
print(f"Table elements: {num_tables}")
Text elements: 985
Table elements: 5
Configuring PdfPipelineOptions for Advanced Processing
The default Docling configuration works well for most documents, but PdfPipelineOptions
unlocks advanced processing capabilities. These options control OCR engines, table recognition, AI enrichments, and performance settings.
PdfPipelineOptions
becomes essential when working with scanned documents, complex layouts, or specialized content requiring AI-powered understanding.
Enable Image Extraction
By default, Docling does not extract images from the document. However, you can enable image extraction by setting the generate_picture_images
option to True
.
from docling.datamodel.pipeline_options import PdfPipelineOptions
from docling.document_converter import PdfFormatOption
pipeline_options = PdfPipelineOptions(generate_picture_images=True)
# Create converter with enhanced table processing
converter_enhanced = DocumentConverter(
format_options={
InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
}
)
result_enhanced = converter_enhanced.convert("https://arxiv.org/pdf/2408.09869")
doc_enhanced = result_enhanced.document
Display the first image:
# Extract and display the first image
from IPython.display import Image, display
for item, _ in doc_enhanced.iterate_items():
if item.label == "picture":
image_data = item.image
# Get the image URI
uri = str(image_data.uri)
# Display the image using IPython
display(Image(url=uri))
break
The output image matches the first image of the PDF.
Table Recognition Enhancement
To use the more sophisticated AI model for table extraction instead of the default fast model, you can set the table_structure_options.mode
to TableFormerMode.ACCURATE
.
from docling.datamodel.pipeline_options import PdfPipelineOptions, TableFormerMode
from docling.datamodel.base_models import InputFormat
from docling.document_converter import PdfFormatOption
# Enhanced table processing for complex layouts
pipeline_options = PdfPipelineOptions()
pipeline_options.table_structure_options.mode = TableFormerMode.ACCURATE
# Create converter with enhanced table processing
converter_enhanced = DocumentConverter(
format_options={
InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
}
)
result_enhanced = converter_enhanced.convert("https://arxiv.org/pdf/2408.09869")
doc_enhanced = result_enhanced.document
AI-Powered Content Understanding
AI enrichments enhance extracted content with semantic understanding. Picture descriptions, formula detection, and code parsing improve RAG accuracy by adding crucial context.
In the code below, we:
- Set the
do_picture_description
option toTrue
to enable picture description extraction - Set the
picture_description_options
option to use theSmolVLM-256M-Instruct
model from Hugging Face.
from docling.datamodel.pipeline_options import PictureDescriptionVlmOptions
# AI-powered content enrichment
pipeline_options = PdfPipelineOptions(
do_picture_description=True, # AI-generated image descriptions
picture_description_options=PictureDescriptionVlmOptions(
repo_id="HuggingFaceTB/SmolVLM-256M-Instruct",
prompt="Describe this picture. Be precise and concise.",
),
generate_picture_images=True,
)
converter_enhanced = DocumentConverter(
format_options={
InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
}
)
result_enhanced = converter_enhanced.convert("https://arxiv.org/pdf/2408.09869")
doc_enhanced = result_enhanced.document
Extract the picture description from the second picture:
second_picture = doc_enhanced.pictures[1]
print(f"Caption: {second_picture.caption_text(doc=doc_enhanced)}")
# Check for annotations
for annotation in second_picture.annotations:
print(annotation.text)
Caption: Figure 1: Sketch of Docling's default processing pipeline. The inner part of the model pipeline is easily customizable and extensible.
### Image Description
The image is a flowchart that depicts a sequence of steps from a document, likely a report or a document. The flowchart is structured with various elements such as text, icons, and arrows. Here is a detailed description of the flowchart:
#### Step 1: Parse
- **Description:** The first step in the process is to parse the document. This involves converting the text into a format that can be easily understood by the user.
#### Step 2: Ocr
- **Description:** The second step is to perform OCR (Optical Character Recognition) on the document. This involves converting the text into a format that can be easily read by the OCR software.
#### Step 3: Layout Analysis
- **Description:** The third step is to analyze the document's layout. This involves examining the document's structure, including the layout of the text, the alignment of the text, and the alignment of the document's content
Here is the original image:
The detailed description shows how Docling’s picture analysis transforms visual content into text that can be indexed and searched, making diagrams accessible to RAG systems.
Performance and Memory Management
Processing a large document can be time-consuming. To speed up the process, we can use:
- The
page_range
option to process only a specific page range. - The
max_num_pages
option to limit the number of pages to process. - The
images_scale
option to reduce the image resolution for speed. - The
generate_page_images
option to skip page images to save memory. - The
do_table_structure
option to skip table structure extraction. - The
enable_parallel_processing
option to use multiple cores.
# Optimized for large documents
pipeline_options = PdfPipelineOptions(
max_num_pages=4, # Limit processing to first 4 pages
page_range=[1, 3], # Process specific page range
generate_page_images=False, # Skip page images to save memory
do_table_structure=False, # Skip table structure extraction
enable_parallel_processing=True # Use multiple cores
)
Building Your RAG Pipeline
We’ll build our RAG pipeline in five steps:
- Document Processing: Use Docling to convert documents into structured data
- Chunking: Break documents into smaller, searchable pieces
- Create Embeddings: Convert text chunks into vector representations
- Store in Vector Database: Save embeddings in FAISS for fast similarity search
- Query: Retrieve relevant chunks and generate contextual responses
Tools for RAG Pipelines
Building RAG pipelines requires four essential tools:
- Docling: converts documents into structured data
- LangChain: manages document workflows, chain orchestration, and provides embedding models
- FAISS: stores and retrieves document chunks
These tools work together to create complete RAG pipelines that can process, store, and retrieve document content intelligently.
LangChain
LangChain simplifies building AI applications by providing components for document loading, text processing, and chain orchestration. It integrates seamlessly with vector stores and language models.
For a comprehensive introduction to LangChain fundamentals and local AI workflows, see our LangChain and Ollama guide.
FAISS
FAISS (Facebook AI Similarity Search) is a library for efficient similarity search in high-dimensional spaces. It enables fast retrieval of the most relevant document chunks based on embedding similarity.
For production use cases requiring robust database integration, consider implementing semantic search with pgvector in PostgreSQL or using Pinecone for cloud-based vector search as alternatives to FAISS.
Let’s install the additional packages for RAG functionality:
# Install additional packages for RAG functionality
pip install docling sentence-transformers langchain-community langchain-huggingface faiss-cpu
# Note: Use faiss-gpu if you have CUDA support
Document Processing
Convert the document into structured data using Docling.
from docling.document_converter import DocumentConverter
# Initialize converter with default settings
converter = DocumentConverter()
# Convert the document into structured data
source_url = "https://arxiv.org/pdf/2408.09869"
result = converter.convert(source_url)
# Access structured data immediately
doc = result.document
Chunking
AI models have limited context windows that can’t process entire documents at once. Chunking solves this by breaking documents into smaller, searchable pieces that fit within these constraints. This improves retrieval accuracy by finding the most relevant sections rather than entire documents.
Docling provides two main chunking strategies:
- HierarchicalChunker: Focuses purely on document structure, creating chunks based on headings and sections
- HybridChunker: Combines structure-aware chunking with token-based limits, preserving document hierarchy while respecting model constraints
Let’s compare how these chunkers process the same document.
First, create a helper function to print the chunk content:
def print_chunk(chunk):
print(f"Chunk length: {len(chunk.text)} characters")
if len(chunk.text) > 30:
print(f"Chunk content: {chunk.text[:30]}...{chunk.text[-30:]}")
else:
print(f"Chunk content: {chunk.text}")
print("-" * 50)
Next, process the document with the HierarchicalChunker
:
from docling.chunking import HierarchicalChunker
# Process with HierarchicalChunker (structure-based)
hierarchical_chunker = HierarchicalChunker()
hierarchical_chunks = list(hierarchical_chunker.chunk(doc))
print(f"HierarchicalChunker: {len(hierarchical_chunks)} chunks")
# Print the first 3 chunks
for chunk in hierarchical_chunks[:5]:
print_chunk(chunk)
HierarchicalChunker: 114 chunks
Chunk length: 11 characters
Chunk content: Version 1.0
--------------------------------------------------
Chunk length: 295 characters
Chunk content: Christoph Auer Maksym Lysak Ah... Kuropiatnyk Peter W. J. Staar
--------------------------------------------------
Chunk length: 50 characters
Chunk content: AI4K Group, IBM Research R¨ us...arch R¨ uschlikon, Switzerland
--------------------------------------------------
Chunk length: 431 characters
Chunk content: This technical report introduc...on of new features and models.
--------------------------------------------------
Chunk length: 792 characters
Chunk content: Converting PDF documents back ... gap to proprietary solutions.
--------------------------------------------------
Compare this to the HybridChunker
:
from docling.chunking import HybridChunker
# Process with HybridChunker (token-aware)
hybrid_chunker = HybridChunker(max_tokens=512, overlap_tokens=50)
hybrid_chunks = list(hybrid_chunker.chunk(doc))
print(f"HybridChunker: {len(hybrid_chunks)} chunks")
# Print the first 3 chunks
for chunk in hybrid_chunks[:5]:
print_chunk(chunk)
HybridChunker: 50 chunks
Chunk length: 358 characters
Chunk content: Version 1.0
Christoph Auer Mak...arch R¨ uschlikon, Switzerland
--------------------------------------------------
Chunk length: 431 characters
Chunk content: This technical report introduc...on of new features and models.
--------------------------------------------------
Chunk length: 1858 characters
Chunk content: Converting PDF documents back ... accelerators (GPU, MPS, etc).
--------------------------------------------------
Chunk length: 1436 characters
Chunk content: To use Docling, you can simply...and run it inside a container.
--------------------------------------------------
Chunk length: 796 characters
Chunk content: Docling implements a linear pi...erialized to JSON or Markdown.
--------------------------------------------------
The comparison shows key differences:
- HierarchicalChunker: Creates many small chunks by splitting at every section boundary
- HybridChunker: Creates fewer, larger chunks by combining related sections within token limits
We will use HybridChunker
because it respects document boundaries (won’t split tables inappropriately) while ensuring chunks fit within embedding model constraints.
from docling.chunking import HybridChunker
# Initialize the chunker
chunker = HybridChunker(max_tokens=512, overlap_tokens=50)
# Create the chunks
rag_chunks = list(chunker.chunk(doc))
print(f"Created {len(rag_chunks)} intelligent chunks")
Created 50 intelligent chunks
Creating a Vector Store
A vector store is a database that converts text into numerical vectors called embeddings. These vectors capture semantic meaning, allowing the system to find related content even when different words are used.
When you search for “document processing,” the vector store finds chunks about “PDF parsing” or “text extraction” because their embeddings are mathematically close. This enables semantic search beyond exact keyword matching.
Create the vector store for semantic search across your document chunks:
from langchain_community.vectorstores import FAISS
from langchain_huggingface import HuggingFaceEmbeddings
# Create embeddings
embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
# Create the vector store
texts = [chunk.text for chunk in rag_chunks]
vectorstore = FAISS.from_texts(texts, embeddings)
print(f"Built vector store with {len(texts)} chunks")
Built vector store with 50 chunks
Now you can search your knowledge base with semantic similarity:
# Search the knowledge base
query = "How does document processing work?"
relevant_docs = vectorstore.similarity_search(query, k=3)
print(f"Query: '{query}'")
print(f"Found {len(relevant_docs)} relevant chunks:")
for i, doc in enumerate(relevant_docs, 1):
print(f"\nResult {i}:")
print(f"Content: {doc.page_content[:150]}...")
Query: 'How does document processing work?'
Found 3 relevant chunks:
Result 1:
Content: Docling implements a linear pipeline of operations, which execute sequentially on each given document (see Fig. 1). Each document is first parsed by a...
Result 2:
Content: In the final pipeline stage, Docling assembles all prediction results produced on each page into a well-defined datatype that encapsulates a converted...
Result 3:
Content: Docling is designed to allow easy extension of the model library and pipelines. In the future, we plan to extend Docling with several more models, suc...
The search results show effective semantic retrieval. The vector store found relevant chunks about Docling’s architecture and design when searching for “document processing” – demonstrating how RAG systems match meaning, not just keywords.
Conclusion
This tutorial demonstrated building a robust document processing pipeline that handles complex, real-world documents. Your pipeline preserves critical elements like tables, mathematical formulas, and document structure while generating semantically meaningful chunks for retrieval-augmented generation systems.
The capability to transform any document format into AI-ready data using minimal code—at no cost—represents a significant advancement in document processing workflows. For enhanced reasoning capabilities in your RAG workflows, explore our guide on building data science workflows with DeepSeek and LangChain which combines advanced language models with document processing pipelines.