pdf

Auto-created tag for pdf

PDF Table Extraction: Docling vs Marker vs LlamaParse Compared

Leave a Comment / Blog, LLM / Khuyen Tran

Table of Contents

Introduction
The Test Document
Docling: TableFormer Deep Learning
Marker: Vision Transformer Pipeline
LlamaParse: LLM-Guided Extraction
Summary
Try It Yourself

Introduction
Have you ever copied a table from a PDF into a spreadsheet only to find the formatting completely broken? These issues include cells shifting, values landing in the wrong columns, and merged headers losing their structure.
This happens because PDFs do not store tables as structured data. They simply place text at specific coordinates on a page.
For example, a table that looks like this on screen:
┌───────┬───────┐
│ Name │ Score │
├───────┼───────┤
│ Alice │ 92 │
│ Bob │ 85 │
└───────┴───────┘

is stored in the PDF as a flat list of positioned text:
"Name" at (x=72, y=710)
"Score" at (x=200, y=710)
"Alice" at (x=72, y=690)
"92" at (x=200, y=690)
"Bob" at (x=72, y=670)
"85" at (x=200, y=670)

A table extraction tool must analyze those positions, determine which text belongs in each cell, and rebuild the table structure.
The challenge becomes even greater with multi-level headers, merged cells, or tables that span multiple pages. Many tools struggle with at least one of these scenarios.
While doing research, I came across three Python tools for extracting tables from PDFs: Docling, Marker, and LlamaParse. To compare them fairly, I ran each tool on the same PDF and evaluated the results.
In this article, I’ll walk through what I found and help you decide which tool may work best for your needs.

💻 Get the Code: The complete source code and Jupyter notebook for this tutorial are available on GitHub. Clone it to follow along!

Stay Current with CodeCut
Actionable Python tips, curated for busy data pros. Skim in under 2 minutes, three times a week.

.codecut-subscribe-form .codecut-input {
background: #2F2D2E !important;
border: 1px solid #72BEFA !important;
color: #FFFFFF !important;
}
.codecut-subscribe-form .codecut-input::placeholder {
color: #999999 !important;
}
.codecut-subscribe-form .codecut-subscribe-btn {
background: #72BEFA !important;
color: #2F2D2E !important;
}
.codecut-subscribe-form .codecut-subscribe-btn:hover {
background: #5aa8e8 !important;
}

.codecut-subscribe-form {
max-width: 650px;
display: flex;
flex-direction: column;
gap: 8px;
}

.codecut-input {
-webkit-appearance: none;
-moz-appearance: none;
appearance: none;
background: #FFFFFF;
border-radius: 8px !important;
padding: 8px 12px;
font-family: ‘Comfortaa’, sans-serif !important;
font-size: 14px !important;
color: #333333;
border: none !important;
outline: none;
width: 100%;
box-sizing: border-box;
}

input[type=”email”].codecut-input {
border-radius: 8px !important;
}

.codecut-input::placeholder {
color: #666666;
}

.codecut-email-row {
display: flex;
align-items: stretch;
height: 36px;
gap: 8px;
}

.codecut-email-row .codecut-input {
flex: 1;
}

.codecut-subscribe-btn {
background: #72BEFA;
color: #2F2D2E;
border: none;
border-radius: 8px;
padding: 8px 14px;
font-family: ‘Comfortaa’, sans-serif;
font-size: 14px;
font-weight: 500;
cursor: pointer;
text-decoration: none;
display: flex;
align-items: center;
justify-content: center;
transition: background 0.3s ease;
}

.codecut-subscribe-btn:hover {
background: #5aa8e8;
}

.codecut-subscribe-btn:disabled {
background: #999;
cursor: not-allowed;
}

.codecut-message {
font-family: ‘Comfortaa’, sans-serif;
font-size: 12px;
padding: 8px;
border-radius: 6px;
display: none;
}

.codecut-message.success {
background: #d4edda;
color: #155724;
display: block;
}

/* Mobile responsive */
@media (max-width: 480px) {
.codecut-email-row {
flex-direction: column;
height: auto;
gap: 8px;
}

.codecut-input {
border-radius: 8px;
height: 36px;
}

.codecut-subscribe-btn {
width: 100%;
text-align: center;
border-radius: 8px;
height: 36px;
}
}

The Test Document
All examples use the same PDF: the Docling Technical Report from arXiv. This paper contains tables with the features that make extraction difficult:

Multi-level headers with sub-columns
Merged cells spanning multiple rows
Numeric data that is easy to misalign

source = "https://arxiv.org/pdf/2408.09869"

Some tools require a local file path instead of a URL, so let’s download the PDF first:
import urllib.request

# Download PDF locally (used by Marker later)
local_pdf = "docling_report.pdf"
urllib.request.urlretrieve(source, local_pdf)

Docling: TableFormer Deep Learning
Docling is IBM’s open-source document converter built specifically for structured extraction. Its table pipeline works in two steps:

Detect table regions using a layout analysis model that finds tables, text, and figures on each page
Reconstruct cell structure using TableFormer, a deep learning model that maps each cell to its row and column position

Here is what that looks like in practice:
PDF page with mixed content
┌─────────────────────┐
│ Text paragraph… │
│ Name Score │
│ Alice 92 │
│ Bob 85 │
│ (figure) │
└─────────────────────┘
│
▼
Step 1: Layout model detects table region
┌─────────────────────┐
│ ┌─────────────────┐ │
│ │ Name Score │ │
│ │ Alice 92 │ │
│ │ Bob 85 │ │
│ └─────────────────┘ │
└─────────────────────┘
│
▼
Step 2: TableFormer maps cells to rows and columns
┌───────┬───────┐
│ Name │ Score │
├───────┼───────┤
│ Alice │ 92 │
│ Bob │ 85 │
└───────┴───────┘

The result is a pandas DataFrame for each table, ready for analysis.

For Docling’s full document processing capabilities beyond tables, including chunking and RAG integration, see Transform Any PDF into Searchable AI Data with Docling.

To install Docling, run:
pip install docling

This article uses docling v2.63.0.
Table Extraction
To extract tables from the PDF, we need to first convert it to a Docling document using DocumentConverter:
from docling.document_converter import DocumentConverter

# Convert PDF
converter = DocumentConverter()
result = converter.convert(source)

Once we have the Docling document, we can loop through all detected tables and export each one as a pandas DataFrame:
for i, table in enumerate(result.document.tables):
df = table.export_to_dataframe(doc=result.document)
print(f"Table {i + 1}: {df.shape[0]} rows × {df.shape[1]} columns")

Table 1: 2 rows × 8 columns
Table 2: 1 rows × 5 columns
Table 3: 0 rows × 0 columns

The PDF contains 5 tables, but Docling only detected 3.
Table 3 returned 0 rows. This means the layout model flagged it as a table but TableFormer couldn’t extract any structure from it.
Let’s look at the first table. Here’s the original from the PDF:

And here’s what Docling extracted:
# Export the first table as a DataFrame
table_1 = result.document.tables[0]
df_1 = table_1.export_to_dataframe(doc=result.document)
df_1

CPU
Thread budget
native TTS
native Pages/s
native Mem
pypdfium TTS
pypdfium Pages/s
pypdfium Mem

Apple M3 Max (16 cores)
4 16
177 s 167 s
1.27 1.34
6.20 GB
103 s 92 s
2.18 2.45
2.56 GB

Intel(R) Xeon E5-2690
4 16
375 s 244 s
0.60 0.92
6.16 GB
239 s 143 s
0.94 1.57
2.42 GB

Notice how Docling handles this complex table:

Docling smartly handled the multi-level header by flattening it into separate columns (“native backend” → “native TTS”, “native Pages/s”, “native Mem”).
However, it merged each CPU’s two thread-budget rows into one, packing values like “177 s 167 s” into single cells.

Now the second table. Here’s the original from the PDF:

And here’s what Docling extracted:
# Export the second table as a DataFrame
table_2 = result.document.tables[1]
df_2 = table_2.export_to_dataframe(doc=result.document)
df_2

human
MRCNN R50 R101
FRCNN R101
YOLO v5x6

0
Caption Footnote Formula List-item Page-footer…
84-89 83-91 83-85 87-88 93-94 85-89 69-71 83-8…
68.4 71.5 70.9 71.8 60.1 63.4 81.2 80.8 61.6 5…
70.1 73.7 63.5 81.0 58.9 72.0 72.0 68.4 82.2 8…

We can see that Docling did not handle this table as well as the first one:

Docling merged the MRCNN sub-columns (R50, R101) into a single “MRCNN R50 R101” column instead of two separate ones.
All 13 rows were collapsed into one, concatenating values like “68.4 71.5 70.9…” into a single cell.

Complex tables with multi-level headers and merged cells remain a challenge for Docling’s table extraction.
Performance
Docling took about 28 seconds for the full 6-page PDF on an Apple M1 (16 GB RAM), thanks to its lightweight two-stage pipeline.
Marker: Vision Transformer Pipeline
Marker is an open-source PDF-to-Markdown converter built on the Surya layout engine. Unlike Docling’s two-stage pipeline, Marker runs five stages for table extraction:

Layout detection: a Vision Transformer identifies table regions on each page
OCR error detection: flags misrecognized text
Bounding box detection: locates individual cell boundaries
Table recognition: reconstructs row/column structure from detected cells
Text recognition: extracts text from all detected regions

Here is how the five stages work together:
PDF page
┌─────────────────────┐
│ Text paragraph… │
│ Name Score │
│ Alice 92 │
│ Bob 85 │
└─────────────────────┘
│
▼
1. Layout detection → finds [TABLE] region
2. OCR error detection → fixes misread text
│
▼
3. Bounding box detection
┌──────────────────┐
│ [Name] [Score] │
│ [Alice] [92] │
│ [Bob] [85] │
└──────────────────┘
│
▼
4. Table recognition → maps cells to rows/columns
5. Text recognition → extracts final text
│
▼
| Name | Score |
|——-|——-|
| Alice | 92 |
| Bob | 85 |

To install Marker, run:
pip install marker-pdf

Table Extraction
Marker provides a dedicated TableConverter that extracts only tables from a document, returning them as Markdown:
from marker.converters.table import TableConverter
from marker.models import create_model_dict
from marker.output import text_from_rendered

models = create_model_dict()
converter = TableConverter(artifact_dict=models)
rendered = converter(local_pdf)
table_md, _, images = text_from_rendered(rendered)

Since TableConverter returns all tables as a single Markdown string, we split them on blank lines:
tables = table_md.strip().split("\n\n")
print(f"Tables found: {len(tables)}")

Tables found: 3

Let’s look at the first table. Here’s the original from the PDF:

And here’s what Marker extracted:
print(tables[0].md)

CPU
Thread

native backend

pypdfium backend

budget
TTS
Pages/s
Mem
TTS
Pages/s
Mem

Apple M3 Max (16 cores)
4 16
177 s 167 s
1.27 1.34
6.20 GB
103 s 92 s
2.18 2.45
2.56 GB

Intel(R) Xeon E5-2690 (16 cores)
4 16
375 s 244 s
0.60 0.92
6.16 GB
239 s 143 s
0.94 1.57
2.42 GB

Marker preserves the original table format well:

While Docling flattened this into prefixed column names like “native TTS”, Marker preserves the two-tier header (“native backend” → TTS, Pages/s, Mem) as separate rows, keeping the parent header visible.
While Docling packed these into single strings like “177 s 167 s” without separators, Marker preserves the distinction between values by using tags, making it easy to split them programmatically later with a simple string split

Let’s look at the second table. Here’s the original from the PDF:

And here’s what Marker extracted:
print(tables[1].md)

human MRCNN

FRCNN YOLO

R50 R101
R101
v5x6

Caption
84-89 68.4 71.5

70.1
77.7

Footnote
83-91 70.9 71.8

73.7
77.2

Formula
83-85 60.1 63.4

63.5
66.2

List-item
87-88 81.2 80.8

81.0
86.2

Page-footer
93-94 61.6 59.3

58.9
61.1

Page-header
85-89 71.9 70.0

72.0
67.9

Picture
69-71 71.7 72.7

72.0
77.1

Section-header 83-84 67.6 69.3

68.4
74.6

Table
77-81 82.2 82.9

82.2
86.3

Text
84-86 84.6 85.8

85.4
88.1

Title
60-72 76.7 80.4

79.9
82.7

All
82-83 72.4 73.5

73.4
76.8

This table has several column-merging issues:

“human” and “MRCNN” are merged into one header (human MRCNN), and “FRCNN” and “YOLO” are combined into a single header (FRCNN YOLO).
The human, MRCNN R50, and MRCNN R101 values are packed into one cell (“84-89 68.4 71.5”), while the MRCNN R50 and R101 columns are empty.
The R50 and R101 sub-columns collapsed into a single “R50 R101” cell.

Despite these issues, Marker still preserves all 12 rows individually, while Docling collapsed them into one.
Let’s look at the third table. Here’s the original from the PDF:

And here’s what Marker extracted:
print(tables[2].md)

human
MRCNN
MRCNN
FRCNN
YOLO

human
R50
R101
R101
v5x6

Caption
84-89
68.4
71.5
70.1
77.7

Footnote
83-91
70.9
71.8
73.7
77.2

Formula
83-85
60.1
63.4
63.5
66.2

List-item
87-88
81.2
80.8
81.0
86.2

Page-footer
93-94
61.6
59.3
58.9
61.1

Page-header
85-89
71.9
70.0
72.0
67.9

Picture
69-71
71.7
72.7
72.0
77.1

Section-header
83-84
67.6
69.3
68.4
74.6

Table
77-81
82.2
82.9
82.2
86.3

Text
84-86
84.6
85.8
85.4
88.1

Title
60-72
76.7
80.4
79.9
82.7

All
82-83
72.4
73.5
73.4
76.8

Since the layout of this table is simpler, Marker’s vision model correctly separates all columns and preserves all 12 rows. This shows that Marker’s accuracy depends heavily on the visual complexity of the original table.
Performance
TableConverter took about 6 minutes on an Apple M1 (16 GB RAM), roughly 13x slower than Docling. The speed difference comes down to how each tool handles text:

Docling extracts text that is already stored in the PDF, skipping OCR. It only runs its layout model and TableFormer on detected tables.
Marker runs Surya’s full text recognition model on every page, regardless of whether the PDF already contains selectable text.

LlamaParse: LLM-Guided Extraction
LlamaParse is a cloud-hosted document parser by LlamaIndex that takes a different approach:

Cloud-based: the PDF is uploaded to LlamaCloud instead of being processed locally
LLM-guided: an LLM interprets each page and identifies tables, returning structured row data

Here is how it works:
PDF file
┌─────────────────────┐
│ Name Score │
│ Alice 92 │
│ Bob 85 │
└─────────────────────┘
│
▼ upload
┌─────────────────────┐
│ LlamaCloud │
│ │
│ LLM reads the page │
│ and identifies │
│ table structure │
└─────────────────────┘
│
▼ response
┌───────┬───────┐
│ Name │ Score │
├───────┼───────┤
│ Alice │ 92 │
│ Bob │ 85 │
└───────┴───────┘

For extracting structured data from images like receipts using the same LlamaIndex ecosystem, see Turn Receipt Images into Spreadsheets with LlamaIndex.

To install LlamaParse, run:
pip install llama-parse

This article uses llama-parse v0.6.54.
LlamaParse requires an API key from LlamaIndex Cloud. The free tier includes 10,000 credits per month (basic parsing costs 1 credit per page; advanced modes like parse_page_with_agent cost more).
Create a .env file with your API key:
LLAMA_CLOUD_API_KEY=llx-…

from dotenv import load_dotenv

load_dotenv()

Table Extraction
To extract tables, we create a LlamaParse instance with two key settings:

parse_page_with_agent: tells LlamaCloud to use an LLM agent that reads each page and returns structured items (tables, text, figures)
output_tables_as_HTML=True: returns tables as HTML instead of Markdown, which better preserves multi-level headers

from llama_cloud_services import LlamaParse

parser = LlamaParse(
parse_mode="parse_page_with_agent",
output_tables_as_HTML=True,
)
result = parser.parse(local_pdf)

We can then iterate through each page’s items and collect only the tables:
all_tables = []
for page in result.pages:
for item in page.items:
if item.type == "table":
all_tables.append(item)

print(f"Items tagged as table: {len(all_tables)}")

Items tagged as table: 5

Not all items tagged as “table” are actual tables. LlamaParse’s LLM sometimes misidentifies non-table content (like the paper’s title page) as a table. We can filter these out by keeping only tables with more than 2 rows:
tables = [t for t in all_tables if len(t.rows) > 2]
print(f"Actual tables: {len(tables)}")

Actual tables: 4

Let’s look at the first table. Here’s the original from the PDF:

And here’s what LlamaParse extracted:
print(tables[0].md)

CPU
Threadbudget
native backend TTS
native backend Pages/s
native backend Mem
pypdfium backend TTS
pypdfium backend Pages/s
pypdfium backend Mem

Apple M3 Max (16 cores)
4
177 s
1.27
6.20 GB
103 s
2.18
2.56 GB

16
167 s
1.34

92 s
2.45

Intel(R) Xeon E5-2690 (16 cores)
4
375 s
0.60
6.16 GB
239 s
0.94
2.42 GB

16
244 s
0.92

143 s
1.57

LlamaParse produces the best result for this table among the three tools:

All values are correctly placed in individual cells. Docling packed multiple values like “177 s 167 s” into single strings, and Marker split multi-line CPU names across extra rows.
Multi-line entries like “Apple M3 Max / (16 cores)” stay in one cell via tags, avoiding Marker’s row-splitting issue.
The two-tier header is flattened into native backend TTS rather than kept as separate rows like Marker, but the grouping is still readable.

Let’s look at the second table. Here’s the original from the PDF:

And here’s what LlamaParse extracted:
print(tables[1].md)

human
MRCNN R50
MRCNN R101
FRCNN R101
YOLO v5x6

Caption
84-89
68.4
71.5
70.1
77.7

Footnote
83-91
70.9
71.8
73.7
77.2

Formula
83-85
60.1
63.4
63.5
66.2

List-item
87-88
81.2
80.8
81.0
86.2

Page-footer
93-94
61.6
59.3
58.9
61.1

Page-header
85-89
71.9
70.0
72.0
67.9

Picture
69-71
71.7
72.7
72.0
77.1

Section-header
83-84
67.6
69.3
68.4
74.6

Table
77-81
82.2
82.9
82.2
86.3

Text
84-86
84.6
85.8
85.4
88.1

Title
60-72
76.7
80.4
79.9
82.7

All
82-83
72.4
73.5
73.4
76.8

LlamaParse produces the most accurate extraction of this table among the three tools:

All 12 data rows are preserved with correct values. Docling merged all rows into a single row.
Each column is correctly separated, while Marker merged some into combined headers like “FRCNN YOLO”.
The MRCNN sub-columns (R50, R101) use tags to keep the parent header visible (e.g., MRCNN R50), unlike Marker which lost the grouping entirely.

Let’s look at the third table. Here’s the original from the PDF:

And here’s what LlamaParse extracted:
print(tables[2].md)

human
R50
R100
R101
v5x6

Caption
84-89
68.4
71.5
70.1
77.7

Footnote
83-91
70.9
71.8
73.7
77.2

Formula
83-85
60.1
63.4
63.5
66.2

List-item
87-88
81.2
80.8
81.0
86.2

Page-footer
93-94
61.6
59.3
58.9
61.1

Page-header
85-89
71.9
70.0
72.0
67.9

Picture
69-71
71.7
72.7
72.0
77.1

Section-header
83-84
67.6
69.3
68.4
74.6

Table
77-81
82.2
82.9
82.2
86.3

Text
84-86
84.6
85.8
85.4
88.1

Title
60-72
76.7
80.4
79.9
82.7

The data values are correct but header information is partially lost:

Parent model names (MRCNN, FRCNN, YOLO) are stripped from headers, unlike the previous table which used tags to preserve them.
“MRCNN R101” appears as “R100” (a typo), and the two R101 columns (MRCNN and FRCNN) are indistinguishable.
Marker handled this table better, keeping all 12 rows with proper column names. Docling missed this table entirely.

Unlike Docling and Marker, LlamaParse actually detects the fourth table. Here’s the original from the PDF:

And here’s what LlamaParse extracted:
print(tables[3].md)

class label
Count
% of TotalTrain
% of TotalTest
% of TotalVal
triple inter-annotator mAP @ 0.5-0.95 (%) All
triple inter-annotator mAP @ 0.5-0.95 (%) Fin
triple inter-annotator mAP @ 0.5-0.95 (%) Man
triple inter-annotator mAP @ 0.5-0.95 (%) Sci
triple inter-annotator mAP @ 0.5-0.95 (%) Law
triple inter-annotator mAP @ 0.5-0.95 (%) Pat
triple inter-annotator mAP @ 0.5-0.95 (%) Ten

Caption
22524
2.04
1.77
2.32
84-89
40-61
86-92
94-99
95-99
69-78
n/a

Footnote
6318
0.60
0.31
0.58
83-91
n/a
100
62-88
85-94
n/a
82-97

Formula
25027
2.25
1.90
2.96
83-85
n/a
n/a
84-87
86-96
n/a
n/a

List-item
185660
17.19
13.34
15.82
87-88
74-83
90-92
97-97
81-85
75-88
93-95

Page-footer
70878
6.51
5.58
6.00
93-94
88-90
95-96
100
92-97
100
96-98

Page-header
58022
5.10
6.70
5.06
85-89
66-76
90-94
98-100
91-92
97-99
81-86

Picture
45976
4.21
2.78
5.31
69-71
56-59
82-86
69-82
80-95
66-71
59-76

Section-header
142884
12.60
15.77
12.85
83-84
76-81
90-92
94-95
87-94
69-73
78-86

Table
34733
3.20
2.27
3.60
77-81
75-80
83-86
98-99
58-80
79-84
70-85

Text
510377
45.82
49.28
45.00
84-86
81-86
88-93
89-93
87-92
71-79
87-95

Title
5071
0.47
0.30
0.50
60-72
24-63
50-63
94-100
82-96
68-79
24-56

Total
1107470
941123
99816
66531
82-83
71-74
79-81
89-94
86-91
71-76
68-85

LlamaParse correctly extracts all 12 data rows plus the Total row with accurate values:

The two-tier headers are flattened into combined names like “% of TotalTrain”, losing the visual grouping but keeping the association.
The “triple inter-annotator mAP” prefix is repeated for every sub-column (All, Fin, Man, etc.), making headers verbose but unambiguous.
All numeric values and n/a entries match the original.

Performance
LlamaParse finished in 17 seconds, roughly 40% faster than Docling (28s) and 20x faster than Marker (6 min).
This is because LlamaParse offloads the work to LlamaCloud’s servers:

The 17-second runtime depends on network speed and server load, not your local hardware.
Summary
The table below summarizes the key differences we found after testing all three tools on the same PDF:

Feature
Docling
Marker
LlamaParse

Table detection
TableFormer
Vision Transformer
LLM (cloud)

Multi-level headers
Flattens into prefixed names
Keeps as separate rows
Preserves with tags

Row separation
Concatenates into one cell
Separates with tags
Keeps each value in its own cell

Speed (6-page PDF)
~28s
~6 min
~17s

Dependencies
PyTorch + models
PyTorch + models
API key

Pricing
Free (MIT)
Free (GPL-3.0)
Free tier (10k credits/month)

In short:

Docling is the fastest local option and gives you DataFrames out of the box, but it struggles with complex tables, sometimes merging rows and packing values together.
Marker preserves rows reliably and runs locally, but it is the slowest and can merge column headers on tricky layouts.
LlamaParse produces the most accurate tables overall, but it requires a cloud API and the free tier is limited to 10,000 credits per month.

So which one should you use?

For simple tables, start with Docling. It is free, fast, and produces DataFrames that are immediately ready for analysis.
If you must stay local and Docling struggles with the layout, Marker is the better alternative.
Use LlamaParse when accuracy matters most and your documents aren’t sensitive, since all pages are uploaded to LlamaCloud for processing.

Try It Yourself
These benchmarks are based on a single academic PDF tested on an Apple M1 (16 GB RAM). Table complexity, document length, and hardware all affect the results. The best way to pick the right tool is to run each one on a sample of your own PDFs.
Docling and Marker are completely free, and LlamaParse’s free tier gives you 10,000 credits per month to experiment with.
Related Tutorials

From CSS Selectors to Natural Language: Web Scraping with ScrapeGraphAI: Use LLM-guided web scraping to extract structured data from HTML pages without manual selector maintenance
Structured Output Tools for LLMs Compared: Compare tools for enforcing schemas and structured formats on LLM outputs

📚 Want to go deeper? My book shows you how to build data science projects that actually make it to production. Get the book →

Stay Current with CodeCut
Actionable Python tips, curated for busy data pros. Skim in under 2 minutes, three times a week.

.codecut-subscribe-form {
max-width: 650px;
display: flex;
flex-direction: column;
gap: 8px;
}

input[type=”email”].codecut-input {
border-radius: 8px !important;
}

.codecut-input::placeholder {
color: #666666;
}

.codecut-email-row {
display: flex;
align-items: stretch;
height: 36px;
gap: 8px;
}

.codecut-email-row .codecut-input {
flex: 1;
}

.codecut-subscribe-btn:hover {
background: #5aa8e8;
}

.codecut-subscribe-btn:disabled {
background: #999;
cursor: not-allowed;
}

.codecut-message {
font-family: ‘Comfortaa’, sans-serif;
font-size: 12px;
padding: 8px;
border-radius: 6px;
display: none;
}

.codecut-message.success {
background: #d4edda;
color: #155724;
display: block;
}

/* Mobile responsive */
@media (max-width: 480px) {
.codecut-email-row {
flex-direction: column;
height: auto;
gap: 8px;
}

.codecut-input {
border-radius: 8px;
height: 36px;
}

.codecut-subscribe-btn {
width: 100%;
text-align: center;
border-radius: 8px;
height: 36px;
}
}

PDF Table Extraction: Docling vs Marker vs LlamaParse Compared Read More »

Transform Any PDF into Searchable AI Data with Docling

3 Comments / Blog, LLM, Python Utilities / Khuyen Tran

Table of Contents

Setting Up Your Document Processing Pipeline
What is Docling?
What is RAG?

Quick Start: Your First Document Conversion
Export Options for Different Use Cases
Configuring PdfPipelineOptions for Advanced Processing
Enable Image Extraction
Table Recognition Enhancement
AI-Powered Content Understanding
Performance and Memory Management

Building Your RAG Pipeline
Tools for RAG Pipelines
Document Processing
Chunking
Creating a Vector Store

Conclusion

What if complex research papers could be transformed into AI-searchable data using fewer than 10 lines of Python?
Financial reports, research documents, and analytical papers often contain vital tables and formulas that traditional PDF tools fail to extract properly. This results in the loss of structured data that could inform key decisions.
Docling, developed by IBM Research, is an AI-first document processing tool that preserves the relationships between text, tables, and formulas. With just three lines of code, you can convert any document into structured data.
Key Takeaways
Here’s what you’ll learn:

Convert any PDF into structured data with just 3 lines of Python code
Extract tables, formulas, and text while preserving relationships between elements
Build complete RAG pipelines that process 50 chunks in under 60 seconds
Use AI-powered image descriptions to make diagrams searchable
Optimize processing speed by 10x with parallel processing and selective extraction

Stay Current with CodeCut
Actionable Python tips, curated for busy data pros. Skim in under 2 minutes, three times a week.

.codecut-subscribe-form {
max-width: 650px;
display: flex;
flex-direction: column;
gap: 8px;
}

input[type=”email”].codecut-input {
border-radius: 8px !important;
}

.codecut-input::placeholder {
color: #666666;
}

.codecut-email-row {
display: flex;
align-items: stretch;
height: 36px;
gap: 8px;
}

.codecut-email-row .codecut-input {
flex: 1;
}

.codecut-subscribe-btn:hover {
background: #5aa8e8;
}

.codecut-subscribe-btn:disabled {
background: #999;
cursor: not-allowed;
}

.codecut-message {
font-family: ‘Comfortaa’, sans-serif;
font-size: 12px;
padding: 8px;
border-radius: 6px;
display: none;
}

.codecut-message.success {
background: #d4edda;
color: #155724;
display: block;
}

/* Mobile responsive */
@media (max-width: 480px) {
.codecut-email-row {
flex-direction: column;
height: auto;
gap: 8px;
}

.codecut-input {
border-radius: 8px;
height: 36px;
}

.codecut-subscribe-btn {
width: 100%;
text-align: center;
border-radius: 8px;
height: 36px;
}
}

Setting Up Your Document Processing Pipeline
What is Docling?
Docling is an AI-first document processing tool developed by IBM Research. It transforms complex documents (like PDFs, Excel spreadsheets, and Word files) into structured data while preserving their original structure, including text, tables, and formulas.
To install Docling, run the following command:
pip install docling

What is RAG?
RAG (Retrieval-Augmented Generation) is an AI technique that combines document retrieval with language generation. Instead of relying solely on training data, RAG systems search through external documents to find relevant information, then use that context to generate accurate, up-to-date responses.
This process requires converting documents into structured, searchable chunks. Docling handles this conversion seamlessly.
Quick Start: Your First Document Conversion
Docling transforms any document into structured data with just three lines of code. Let’s see this in action by converting a PDF document – specifically, Docling’s own technical report from arXiv. This is a good example because it contains a lot of different types of elements, including tables, formulas, and text.
from docling.document_converter import DocumentConverter
import pandas as pd

# Initialize converter with default settings
converter = DocumentConverter()

# Convert any document format – we'll use the Docling technical report itself
source_url = "https://arxiv.org/pdf/2408.09869"
result = converter.convert(source_url)

# Access structured data immediately
doc = result.document
print(f"Successfully processed document from: {source_url}")

To iterate through each document element, we will use the doc.iterate_items() method. This method returns tuples of (item, level). For example:

(TextItem(label='paragraph', text='Introduction text…'), 0) – top-level paragraph
(TableItem(label='table', text='| Col1 | Col2 |…'), 1) – table at depth 1
(TextItem(label='heading', text='Section 2'), 0) – section heading

from collections import defaultdict

# Create a dictionary to categorize all document elements by type
element_types = defaultdict(list)

# Iterate through all document elements and group them by label
for item, _ in doc.iterate_items():
element_type = item.label
element_types[element_type].append(item)

# Display the breakdown of document structure
print("Document structure breakdown:")
for element_type, items in element_types.items():
print(f" {element_type}: {len(items)} elements")

The output shows the different types of elements Docling extracted from the document.
Document structure breakdown:
picture: 13 elements
section_header: 31 elements
text: 102 elements
list_item: 22 elements
code: 2 elements
footnote: 1 elements
caption: 3 elements
table: 5 elements

Let’s look specifically for structured elements like tables and formulas that are crucial for RAG applications:
first_table = element_types["table"][0]
print(first_table.export_to_dataframe(doc=doc).to_markdown())

CPU.
Thread budget.
native backend.TTS
native backend.Pages/s
native backend.Mem
pypdfium backend.TTS
pypdfium backend.Pages/s
pypdfium backend.Mem

0
Apple M3 Max
4
177 s 167 s
1.27 1.34
6.20 GB
103 s 92 s
2.18 2.45
2.56 GB

1
(16 cores) Intel(R) E5-2690
16 4 16
375 s 244 s
0.60 0.92
6.16 GB
239 s 143 s
0.94 1.57
2.42 GB

Here is how the table looks in the original PDF:

The extracted table shows Docling’s accuracy and structural differences from the original PDF. Docling captured all numerical data and text perfectly but flattened the merged cell structure into separate columns.
While this loses visual formatting, it benefits RAG applications since each row contains complete information without complex cell merging logic.
Next, look at the first list item element:
first_list_items = element_types["list_item"][0:6]
for list_item in first_list_items:
print(list_item.text)

· Converts PDF documents to JSON or Markdown format, stable and lightning fast
· Understands detailed page layout, reading order, locates figures and recovers table structures
· Extracts metadata from the document, such as title, authors, references and language
· Optionally applies OCR, e.g. for scanned PDFs
· Can be configured to be optimal for batch-mode (i.e high throughput, low time-to-solution) or interactive mode (compromise on efficiency, low time-to-solution)
· Can leverage different accelerators (GPU, MPS, etc).

This matches the original PDF list item.

Look at the first caption element:
first_caption = element_types["caption"][0]
print(first_caption.text)

This matches the image caption in the original PDF.

This matches the image caption in the original PDF.
Export Options for Different Use Cases
Docling provides multiple ways to export the document data, including Markdown, JSON, and dictionary formats.
For human review and documentation, Markdown format preserves the document structure beautifully.
# Human-readable markdown for review
markdown_content = doc.export_to_markdown()
print(markdown_content[:500] + "…")

<!– image –>

## Docling Technical Report

Version 1.0

Christoph Auer Maksym Lysak Ahmed Nassar Michele Dolfi Nikolaos Livathinos Panos Vagenas Cesar Berrospi Ramis Matteo Omenetti Fabian Lindlbauer Kasper Dinkla Lokesh Mishra Yusik Kim Shubham Gupta Rafael Teixeira de Lima Valery Weber Lucas Morin Ingmar Meijer Viktor Kuropiatnyk Peter W. J. Staar

AI4K Group, IBM Research R¨ uschlikon, Switzerland

## Abstract

This technical report introduces Docling , an easy to use, self-contained, MITli…

Compare this to the original PDF:

Docling preserves all original content while converting complex PDF formatting into clean markdown. Every author name, title, and abstract text remains intact, creating searchable structure perfect for RAG applications.
For programmatic processing and API integrations, JSON format provides structured access to all document elements:
import json

# JSON for programmatic processing
json_dict = doc.export_to_dict()

print('JSON keys:', json_dict.keys())

JSON keys: dict_keys(['schema_name', 'version', 'name', 'origin', 'furniture', 'body', 'groups', 'texts', 'pictures', 'tables', 'key_value_items', 'form_items', 'pages'])

The JSON structure reveals Docling’s comprehensive document analysis. Key sections include texts for paragraphs, tables for structured data, pictures for images, and pages for layout information.
For Python development workflows, the dictionary format enables immediate access to all document elements.
# Python dictionary for immediate use
dict_repr = doc.export_to_dict()

# Preview the structure
num_texts = len(dict_repr['texts'])
num_tables = len(dict_repr['tables'])

print(f"Text elements: {num_texts}")
print(f"Table elements: {num_tables}")

Text elements: 985
Table elements: 5

Configuring PdfPipelineOptions for Advanced Processing
The default Docling configuration works well for most documents, but PdfPipelineOptions unlocks advanced processing capabilities. These options control OCR engines, table recognition, AI enrichments, and performance settings.
PdfPipelineOptions becomes essential when working with scanned documents, complex layouts, or specialized content requiring AI-powered understanding.
Enable Image Extraction
By default, Docling does not extract images from the document. However, you can enable image extraction by setting the generate_picture_images option to True.
from docling.datamodel.pipeline_options import PdfPipelineOptions
from docling.document_converter import PdfFormatOption

pipeline_options = PdfPipelineOptions(generate_picture_images=True)

# Create converter with enhanced table processing
converter_enhanced = DocumentConverter(
format_options={
InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
}
)

result_enhanced = converter_enhanced.convert("https://arxiv.org/pdf/2408.09869")
doc_enhanced = result_enhanced.document

Display the first image:
# Extract and display the first image
from IPython.display import Image, display

for item, _ in doc_enhanced.iterate_items():
if item.label == "picture":
image_data = item.image

# Get the image URI
uri = str(image_data.uri)

# Display the image using IPython
display(Image(url=uri))
break

The output image matches the first image of the PDF.
Table Recognition Enhancement
To use the more sophisticated AI model for table extraction instead of the default fast model, you can set the table_structure_options.mode to TableFormerMode.ACCURATE.
from docling.datamodel.pipeline_options import PdfPipelineOptions, TableFormerMode
from docling.datamodel.base_models import InputFormat
from docling.document_converter import PdfFormatOption

# Enhanced table processing for complex layouts
pipeline_options = PdfPipelineOptions()
pipeline_options.table_structure_options.mode = TableFormerMode.ACCURATE

# Create converter with enhanced table processing
converter_enhanced = DocumentConverter(
format_options={
InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
}
)

result_enhanced = converter_enhanced.convert("https://arxiv.org/pdf/2408.09869")
doc_enhanced = result_enhanced.document

AI-Powered Content Understanding
AI enrichments enhance extracted content with semantic understanding. Picture descriptions, formula detection, and code parsing improve RAG accuracy by adding crucial context.
In the code below, we:

Set the do_picture_description option to True to enable picture description extraction
Set the picture_description_options option to use the SmolVLM-256M-Instruct model from Hugging Face.

from docling.datamodel.pipeline_options import PictureDescriptionVlmOptions

# AI-powered content enrichment
pipeline_options = PdfPipelineOptions(
do_picture_description=True, # AI-generated image descriptions
picture_description_options=PictureDescriptionVlmOptions(
repo_id="HuggingFaceTB/SmolVLM-256M-Instruct",
prompt="Describe this picture. Be precise and concise.",
),
generate_picture_images=True,
)

converter_enhanced = DocumentConverter(
format_options={
InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
}
)

result_enhanced = converter_enhanced.convert("https://arxiv.org/pdf/2408.09869")
doc_enhanced = result_enhanced.document

Extract the picture description from the second picture:
second_picture = doc_enhanced.pictures[1]

print(f"Caption: {second_picture.caption_text(doc=doc_enhanced)}")

# Check for annotations
for annotation in second_picture.annotations:
print(annotation.text)

Caption: Figure 1: Sketch of Docling's default processing pipeline. The inner part of the model pipeline is easily customizable and extensible.
### Image Description

The image is a flowchart that depicts a sequence of steps from a document, likely a report or a document. The flowchart is structured with various elements such as text, icons, and arrows. Here is a detailed description of the flowchart:

#### Step 1: Parse
– **Description:** The first step in the process is to parse the document. This involves converting the text into a format that can be easily understood by the user.

#### Step 2: Ocr
– **Description:** The second step is to perform OCR (Optical Character Recognition) on the document. This involves converting the text into a format that can be easily read by the OCR software.

#### Step 3: Layout Analysis
– **Description:** The third step is to analyze the document's layout. This involves examining the document's structure, including the layout of the text, the alignment of the text, and the alignment of the document's content

Here is the original image:

The detailed description shows how Docling’s picture analysis transforms visual content into text that can be indexed and searched, making diagrams accessible to RAG systems.
Performance and Memory Management
Processing a large document can be time-consuming. To speed up the process, we can use:

The page_range option to process only a specific page range.
The max_num_pages option to limit the number of pages to process.
The images_scale option to reduce the image resolution for speed.
The generate_page_images option to skip page images to save memory.
The do_table_structure option to skip table structure extraction.
The enable_parallel_processing option to use multiple cores.

# Optimized for large documents
pipeline_options = PdfPipelineOptions(
max_num_pages=4, # Limit processing to first 4 pages
page_range=[1, 3], # Process specific page range
generate_page_images=False, # Skip page images to save memory
do_table_structure=False, # Skip table structure extraction
enable_parallel_processing=True # Use multiple cores
)

Building Your RAG Pipeline
We’ll build our RAG pipeline in five steps:

Document Processing: Use Docling to convert documents into structured data
Chunking: Break documents into smaller, searchable pieces
Create Embeddings: Convert text chunks into vector representations
Store in Vector Database: Save embeddings in FAISS for fast similarity search
Query: Retrieve relevant chunks and generate contextual responses

Tools for RAG Pipelines
Building RAG pipelines requires four essential tools:

Docling: converts documents into structured data
LangChain: manages document workflows, chain orchestration, and provides embedding models
FAISS: stores and retrieves document chunks

These tools work together to create complete RAG pipelines that can process, store, and retrieve document content intelligently.
LangChain
LangChain simplifies building AI applications by providing components for document loading, text processing, and chain orchestration. It integrates seamlessly with vector stores and language models.
For a comprehensive introduction to LangChain fundamentals and local AI workflows, see our LangChain and Ollama guide.
FAISS
FAISS (Facebook AI Similarity Search) is a library for efficient similarity search in high-dimensional spaces. It enables fast retrieval of the most relevant document chunks based on embedding similarity.
For production use cases requiring robust database integration, consider implementing semantic search with pgvector in PostgreSQL or using Pinecone for cloud-based vector search as alternatives to FAISS.
Let’s install the additional packages for RAG functionality:
# Install additional packages for RAG functionality
pip install docling sentence-transformers langchain-community langchain-huggingface faiss-cpu
# Note: Use faiss-gpu if you have CUDA support

Document Processing
Convert the document into structured data using Docling.
from docling.document_converter import DocumentConverter

# Initialize converter with default settings
converter = DocumentConverter()

# Convert the document into structured data
source_url = "https://arxiv.org/pdf/2408.09869"
result = converter.convert(source_url)

# Access structured data immediately
doc = result.document

Chunking
AI models have limited context windows that can’t process entire documents at once. Chunking solves this by breaking documents into smaller, searchable pieces that fit within these constraints. This improves retrieval accuracy by finding the most relevant sections rather than entire documents.
Docling provides two main chunking strategies:

HierarchicalChunker: Focuses purely on document structure, creating chunks based on headings and sections
HybridChunker: Combines structure-aware chunking with token-based limits, preserving document hierarchy while respecting model constraints

Let’s compare how these chunkers process the same document.
First, create a helper function to print the chunk content:
def print_chunk(chunk):
print(f"Chunk length: {len(chunk.text)} characters")
if len(chunk.text) > 30:
print(f"Chunk content: {chunk.text[:30]}…{chunk.text[-30:]}")
else:
print(f"Chunk content: {chunk.text}")
print("-" * 50)

Next, process the document with the HierarchicalChunker:
from docling.chunking import HierarchicalChunker

# Process with HierarchicalChunker (structure-based)
hierarchical_chunker = HierarchicalChunker()
hierarchical_chunks = list(hierarchical_chunker.chunk(doc))

print(f"HierarchicalChunker: {len(hierarchical_chunks)} chunks")

# Print the first 3 chunks
for chunk in hierarchical_chunks[:5]:
print_chunk(chunk)

HierarchicalChunker: 114 chunks
Chunk length: 11 characters
Chunk content: Version 1.0
————————————————–
Chunk length: 295 characters
Chunk content: Christoph Auer Maksym Lysak Ah… Kuropiatnyk Peter W. J. Staar
————————————————–
Chunk length: 50 characters
Chunk content: AI4K Group, IBM Research R¨ us…arch R¨ uschlikon, Switzerland
————————————————–
Chunk length: 431 characters
Chunk content: This technical report introduc…on of new features and models.
————————————————–
Chunk length: 792 characters
Chunk content: Converting PDF documents back … gap to proprietary solutions.
————————————————–

Compare this to the HybridChunker:
from docling.chunking import HybridChunker

# Process with HybridChunker (token-aware)
hybrid_chunker = HybridChunker(max_tokens=512, overlap_tokens=50)
hybrid_chunks = list(hybrid_chunker.chunk(doc))

print(f"HybridChunker: {len(hybrid_chunks)} chunks")

# Print the first 3 chunks
for chunk in hybrid_chunks[:5]:
print_chunk(chunk)

HybridChunker: 50 chunks
Chunk length: 358 characters
Chunk content: Version 1.0
Christoph Auer Mak…arch R¨ uschlikon, Switzerland
————————————————–
Chunk length: 431 characters
Chunk content: This technical report introduc…on of new features and models.
————————————————–
Chunk length: 1858 characters
Chunk content: Converting PDF documents back … accelerators (GPU, MPS, etc).
————————————————–
Chunk length: 1436 characters
Chunk content: To use Docling, you can simply…and run it inside a container.
————————————————–
Chunk length: 796 characters
Chunk content: Docling implements a linear pi…erialized to JSON or Markdown.
————————————————–

The comparison shows key differences:

HierarchicalChunker: Creates many small chunks by splitting at every section boundary
HybridChunker: Creates fewer, larger chunks by combining related sections within token limits

We will use HybridChunker because it respects document boundaries (won’t split tables inappropriately) while ensuring chunks fit within embedding model constraints.
from docling.chunking import HybridChunker

# Initialize the chunker
chunker = HybridChunker(max_tokens=512, overlap_tokens=50)

# Create the chunks
rag_chunks = list(chunker.chunk(doc))

print(f"Created {len(rag_chunks)} intelligent chunks")

Created 50 intelligent chunks

Creating a Vector Store
A vector store is a database that converts text into numerical vectors called embeddings. These vectors capture semantic meaning, allowing the system to find related content even when different words are used.
When you search for “document processing,” the vector store finds chunks about “PDF parsing” or “text extraction” because their embeddings are mathematically close. This enables semantic search beyond exact keyword matching.
Create the vector store for semantic search across your document chunks:
from langchain_community.vectorstores import FAISS
from langchain_huggingface import HuggingFaceEmbeddings

# Create embeddings
embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")

# Create the vector store
texts = [chunk.text for chunk in rag_chunks]
vectorstore = FAISS.from_texts(texts, embeddings)

print(f"Built vector store with {len(texts)} chunks")

Built vector store with 50 chunks

Now you can search your knowledge base with semantic similarity:
# Search the knowledge base
query = "How does document processing work?"
relevant_docs = vectorstore.similarity_search(query, k=3)

print(f"Query: '{query}'")
print(f"Found {len(relevant_docs)} relevant chunks:")

for i, doc in enumerate(relevant_docs, 1):
print(f"\nResult {i}:")
print(f"Content: {doc.page_content[:150]}…")

Query: 'How does document processing work?'
Found 3 relevant chunks:

Result 1:
Content: Docling implements a linear pipeline of operations, which execute sequentially on each given document (see Fig. 1). Each document is first parsed by a…

Result 2:
Content: In the final pipeline stage, Docling assembles all prediction results produced on each page into a well-defined datatype that encapsulates a converted…

Result 3:
Content: Docling is designed to allow easy extension of the model library and pipelines. In the future, we plan to extend Docling with several more models, suc…

The search results show effective semantic retrieval. The vector store found relevant chunks about Docling’s architecture and design when searching for “document processing” – demonstrating how RAG systems match meaning, not just keywords.
Conclusion
This tutorial demonstrated building a robust document processing pipeline that handles complex, real-world documents. Your pipeline preserves critical elements like tables, mathematical formulas, and document structure while generating semantically meaningful chunks for retrieval-augmented generation systems.
The capability to transform any document format into AI-ready data using minimal code at no cost represents a significant advancement in document processing workflows. For enhanced reasoning capabilities in your RAG workflows, explore our guide on building data science workflows with DeepSeek and LangChain which combines advanced language models with document processing pipelines.

📚 Want to go deeper? Learning new techniques is the easy part. Knowing how to structure, test, and deploy them is what separates side projects from real work. My book shows you how to build data science projects that actually make it to production. Get the book →

Stay Current with CodeCut
Actionable Python tips, curated for busy data pros. Skim in under 2 minutes, three times a week.

.codecut-subscribe-form {
max-width: 650px;
display: flex;
flex-direction: column;
gap: 8px;
}

input[type=”email”].codecut-input {
border-radius: 8px !important;
}

.codecut-input::placeholder {
color: #666666;
}

.codecut-email-row {
display: flex;
align-items: stretch;
height: 36px;
gap: 8px;
}

.codecut-email-row .codecut-input {
flex: 1;
}

.codecut-subscribe-btn:hover {
background: #5aa8e8;
}

.codecut-subscribe-btn:disabled {
background: #999;
cursor: not-allowed;
}

.codecut-message {
font-family: ‘Comfortaa’, sans-serif;
font-size: 12px;
padding: 8px;
border-radius: 6px;
display: none;
}

.codecut-message.success {
background: #d4edda;
color: #155724;
display: block;
}

/* Mobile responsive */
@media (max-width: 480px) {
.codecut-email-row {
flex-direction: column;
height: auto;
gap: 8px;
}

.codecut-input {
border-radius: 8px;
height: 36px;
}

.codecut-subscribe-btn {
width: 100%;
text-align: center;
border-radius: 8px;
height: 36px;
}
}

Transform Any PDF into Searchable AI Data with Docling Read More »

pdf

PDF Table Extraction: Docling vs Marker vs LlamaParse Compared

Transform Any PDF into Searchable AI Data with Docling

Drop a line

Get in touch

Follow Us on Social Media

pdf

PDF Table Extraction: Docling vs Marker vs LlamaParse Compared

Transform Any PDF into Searchable AI Data with Docling

Work with Khuyen Tran

Work with Khuyen Tran