python Archives

Newsletter #300: Browser-Use: Automate Any Browser Task with Plain English

Leave a Comment / Newsletter Archive / Khuyen Tran

Grab your coffee. Here are this week’s highlights.

📅 Today’s Picks

mem0: Give Your AI Memory Without Building a Vector DB

Problem
When you build an AI app using the OpenAI or Anthropic API, every conversation starts from scratch with no built-in memory between sessions.
Adding memory yourself with a vector database like ChromaDB requires writing custom extraction, deduplication, and scoping logic on top of the storage layer.
Solution
mem0 handles all of that in a single function call. Just pass in conversations and retrieve relevant memories when needed.
Key features:

Automatic fact extraction from raw conversations via memory.add()
Cross-session retrieval with memory.search() in any future conversation
Automatic conflict resolution when user preferences change over time

🧪 Run code

Browser-Use: Automate Any Browser Task with Plain English

Problem
Most data collection tasks go beyond simple extraction. You need to log in, apply filters, navigate pagination, and then gather results.
Selenium can handle navigation, but it requires maintaining CSS selectors that can easily break when a site changes.
Solution
Browser Use simplifies the entire process. Describe what you want, and it navigates, clicks, types, and extracts automatically.
Key features:

Natural language task descriptions
Works with GPT-4, Claude, Gemini, and local models via Ollama
Structured output with Pydantic models

📚 Latest Deep Dives

How to Test GitHub Actions Locally with act

Debugging GitHub Actions is painfully slow. Every YAML change requires a commit, a push, and a 2-5 minute wait just to find out you missed a colon.
This article introduces act, a CLI tool that runs GitHub Actions workflows locally in Docker containers.
You’ll set up an ML testing pipeline and learn to pass secrets, run specific jobs, and validate workflows in seconds.

📖 View Full Article

☕️ Weekly Finds

open-webui
[LLM]
– A self-hosted AI platform with built-in RAG, model builder, and support for Ollama and OpenAI-compatible APIs

ragflow
[RAG]
– An open-source RAG engine with deep document understanding for unstructured data in any format

vllm
[MLOps]
– A high-throughput, memory-efficient inference and serving engine for LLMs with multi-GPU support

Looking for a specific tool? Explore 70+ Python tools →

Stay Current with CodeCut

Actionable Python tips, curated for busy data pros. Skim in under 2 minutes, three times a week.

.codecut-subscribe-form .codecut-input {
background: #2F2D2E !important;
border: 1px solid #72BEFA !important;
color: #FFFFFF !important;
}
.codecut-subscribe-form .codecut-input::placeholder {
color: #999999 !important;
}
.codecut-subscribe-form .codecut-subscribe-btn {
background: #72BEFA !important;
color: #2F2D2E !important;
}
.codecut-subscribe-form .codecut-subscribe-btn:hover {
background: #5aa8e8 !important;
}

.codecut-subscribe-form {
max-width: 650px;
display: flex;
flex-direction: column;
gap: 8px;
}
.codecut-input {
-webkit-appearance: none;
-moz-appearance: none;
appearance: none;
background: #FFFFFF;
border-radius: 8px !important;
padding: 8px 12px;
font-family: ‘Comfortaa’, sans-serif !important;
font-size: 14px !important;
color: #333333;
border: none !important;
outline: none;
width: 100%;
box-sizing: border-box;
}
input[type=”email”].codecut-input {
border-radius: 8px !important;
}
.codecut-input::placeholder {
color: #666666;
}
.codecut-email-row {
display: flex;
align-items: stretch;
height: 36px;
gap: 8px;
}
.codecut-email-row .codecut-input {
flex: 1;
}
.codecut-subscribe-btn {
background: #72BEFA;
color: #2F2D2E;
border: none;
border-radius: 8px;
padding: 8px 14px;
font-family: ‘Comfortaa’, sans-serif;
font-size: 14px;
font-weight: 500;
cursor: pointer;
text-decoration: none;
display: flex;
align-items: center;
justify-content: center;
transition: background 0.3s ease;
}
.codecut-subscribe-btn:hover {
background: #5aa8e8;
}
.codecut-subscribe-btn:disabled {
background: #999;
cursor: not-allowed;
}
.codecut-message {
font-family: ‘Comfortaa’, sans-serif;
font-size: 12px;
padding: 8px;
border-radius: 6px;
display: none;
}
.codecut-message.success {
background: #d4edda;
color: #155724;
display: block;
}
@media (max-width: 480px) {
.codecut-email-row {
flex-direction: column;
height: auto;
gap: 8px;
}
.codecut-input {
border-radius: 8px;
height: 36px;
}
.codecut-subscribe-btn {
width: 100%;
text-align: center;
border-radius: 8px;
height: 36px;
}
}

Newsletter #300: Browser-Use: Automate Any Browser Task with Plain English Read More »

Newsletter #299: latexify_py: Turn Python Functions into LaTeX with One Decorator

Leave a Comment / Newsletter Archive / Khuyen Tran

Grab your coffee. Here are this week’s highlights.

🤝 COLLABORATION

Give Your AI Agent Live Web Access with Bright Data MCP
With basic search APIs, agents often miss critical context from sources like social platforms, forums, news, and answer engines. That leads to incomplete or outdated responses.
Bright Data’s MCP server unifies all web data access into one interface your AI agent can use directly.
With Bright Data MCP, your AI agent can access:

Search engines (Google, Bing, more)
Social media (Twitter/X, Reddit, Instagram, TikTok)
Web archives (historical web data, years deep)
Answer engines (ChatGPT, Perplexity, Gemini)

All through one connection.

🔗 Try Bright Data MCP

📅 Today’s Picks

latexify_py: Turn Python Functions into LaTeX with One Decorator

Problem
Non-programmers cannot easily read Python logic. However, manually converting it to LaTeX is slow and quickly becomes outdated as the code changes.
Solution
latexify_py solves this with a single decorator, generating LaTeX directly from your function so the math stays readable and always in sync with the code.
Key capabilities:

Three decorators for different outputs: expressions, full equations, or pseudocode
Displays rendered LaTeX directly in Jupyter cells
Functions still work normally when called

📖 View Full Article

🧪 Run code

act: Run GitHub Actions Locally with Docker

Problem
GitHub Actions has no local execution mode. You can’t test a step, inspect an environment variable, or reproduce a runner-specific failure on your own machine.
Each change requires a commit and a wait for the cloud runner. A small mistake like a missing secret means starting the loop again.
Solution
With act, you can execute workflows locally using Docker. Failures surface immediately, making it easier to iterate and commit only when the workflow passes.

📚 Latest Deep Dives

How to Test GitHub Actions Locally with act

📖 View Full Article

☕️ Weekly Finds

json_repair
[LLM]
– A Python module to repair invalid JSON, especially from LLM outputs, with schema validation support

pyrsistent
[Python Utilities]
– Persistent, immutable, and functional data structures for Python

prek
[Code Quality]
– A faster, Rust-based reimagining of pre-commit with monorepo support and parallel hook execution

Looking for a specific tool? Explore 70+ Python tools →

Stay Current with CodeCut

Actionable Python tips, curated for busy data pros. Skim in under 2 minutes, three times a week.

Newsletter #299: latexify_py: Turn Python Functions into LaTeX with One Decorator Read More »

Newsletter #298: Chronos: Forecast Any Time Series Without Training a Model

Leave a Comment / Newsletter Archive / Khuyen Tran

Grab your coffee. Here are this week’s highlights.

🤝 COLLABORATION

Search engines (Google, Bing, more)
Social media (Twitter/X, Reddit, Instagram, TikTok)
Web archives (historical web data, years deep)
Answer engines (ChatGPT, Perplexity, Gemini)

All through one connection.

🔗 Try Bright Data MCP

📅 Today’s Picks

altimate-code: The Missing AI Layer for Data Engineering Teams

Problem
General AI tools can write SQL and catch obvious mistakes. But they cannot systematically detect anti-patterns, trace lineage, or keep warehouse costs under control.
That gap can lead to inefficient queries, broken dependencies, and hidden compliance risks building up over time.
Solution
I recently tried altimate-code, an open-source agent with 100+ tools purpose-built for data engineers, and built a demo repo to test it.
From a single prompt, it generated a full dbt project with staging, intermediate, and mart layers, added automated tests, and built an interactive dashboard.
What makes it different:

100+ tools that analyze SQL through structural parsing, not text guessing
Works across your stack including Snowflake, BigQuery, Databricks, DuckDB, and more
Model-agnostic. Compatible with OpenAI, Anthropic, Gemini, Ollama, and others

Chronos: Forecast Any Time Series Without Training a Model

Problem
Traditional forecasting requires domain-specific data, feature engineering, and multiple rounds of model tuning.
Solution
Chronos is a family of pretrained time series forecasting models from Amazon Science that deliver zero-shot predictions out of the box.
Simply load a pretrained model and generate forecasts on any time series data, with no fine-tuning required.
If zero-shot accuracy isn’t enough, you can fine-tune on your data with AutoGluon in a few lines.

🧪 Run code

📚 Latest Deep Dives

uv vs pixi: Which Python Environment Manager Should You Use for Data Science?

What if one tool could manage both your Python packages and compiled system libraries?
uv installs Python packages from PyPI, but it doesn’t support compiled C/C++ libraries.
The typical workaround is to install system libraries separately using an OS package manager, then manually align versions with your Python dependencies.
Since these system dependencies aren’t captured in project files, reproducing the environment across machines can be unreliable.
pixi solves this by managing both Python packages from PyPI and compiled system libraries from conda-forge in a single tool.
Quick comparison:

uv: fast, reliable lockfiles, Python-only
conda: system libraries supported, but slower and no lockfiles
pixi: fast, unified, with system libraries, lockfiles, and a built-in task runner

In this article, I compare uv and pixi on a real ML project so you can see how they perform in practice.

📖 View Full Article

☕️ Weekly Finds

timesfm
[Machine Learning]
– Pretrained time series foundation model by Google Research for zero-shot forecasting

darts
[Machine Learning]
– A Python library for user-friendly forecasting and anomaly detection on time series

orbit
[Machine Learning]
– A Python package for Bayesian time series forecasting with probabilistic models under the hood

Looking for a specific tool? Explore 70+ Python tools →

Stay Current with CodeCut

Actionable Python tips, curated for busy data pros. Skim in under 2 minutes, three times a week.

Newsletter #298: Chronos: Forecast Any Time Series Without Training a Model Read More »

Newsletter #297: Polars scan_csv: Merge CSVs with Different Schemas in One Call

Leave a Comment / Newsletter Archive / Khuyen Tran

Grab your coffee. Here are this week’s highlights.

📅 Today’s Picks

Polars scan_csv: Merge CSVs with Different Schemas in One Call

Problem
Polars’ scan_csv lets you load multiple CSV files lazily, reading data only when needed.
But before v1.39.0, every file had to share the same columns, or you’d get a SchemaError.
Solution
Polars v1.39.0 introduces missing_columns="insert" in scan_csv, allowing you to combine multiple files in one call while null-filling any missing columns.

📖 View Full Article

🧪 Run code

Build Professional Python Packages with UV –package

Problem
Python packages turn your code into reusable modules you can share across projects.
But building them requires complex setup with setuptools, managing build systems, and understanding distribution mechanics.
Solution
UV, a fast Python package installer and resolver, reduces the entire process to 2 simple commands:

uv init –package sets up your package structure instantly
uv build and uv publish to create and distribute to PyPI

📖 Learn more

📚 Latest Deep Dives

uv vs pixi: Which Python Environment Manager Should You Use for Data Science?

uv: fast, reliable lockfiles, Python-only
conda: system libraries supported, but slower and no lockfiles
pixi: fast, unified, with system libraries, lockfiles, and a built-in task runner

In this article, I compare uv and pixi on a real ML project so you can see how they perform in practice.

📖 View Full Article

☕️ Weekly Finds

datachain
[Data Processing]
– Process and curate unstructured data from cloud storages using local ML models and Python

label-studio
[Data Processing]
– Open source data labeling and annotation tool with standardized output format for ML workflows

qsv
[Command Line]
– Blazingly fast CSV command-line toolkit for slicing, dicing, and analyzing tabular data

Looking for a specific tool? Explore 70+ Python tools →

Stay Current with CodeCut

Actionable Python tips, curated for busy data pros. Skim in under 2 minutes, three times a week.

Newsletter #297: Polars scan_csv: Merge CSVs with Different Schemas in One Call Read More »

pixi: One Package Manager for Python and C/C++ Libraries

Leave a Comment / Blog, Python Utilities / Khuyen Tran

Table of Contents

Introduction
The Problem with uv
pixi: Modern Environment Management
Why Not conda?
Summary

Introduction
uv has quickly become the go-to Python package manager, and for good reason. As I covered in Why UV Might Be All You Need, it’s fast, handles lockfiles, and manages virtual environments out of the box.
But data science and AI projects rarely stay pure Python. Many key packages depend on compiled C/C++ libraries that must be installed at the system level. uv can install the Python bindings, but not the system libraries underneath them.
pixi solves this by managing both Python packages from PyPI and compiled system libraries from conda-forge in a single tool, with automatic lockfiles and fast dependency resolution.
In this article, we’ll compare how uv and pixi handle a real geospatial ML project and where each tool fits best.

📚 For a deeper dive into dependency management and production workflows, check out my book Production-Ready Data Science.

The Problem with uv
To make the comparison concrete, let’s set up the same geospatial ML project with each tool. Its dependencies come from two sources:

conda-forge: geopandas and GDAL (compiled C/C++ geospatial libraries) and LightGBM (optimized compiled binaries)
PyPI: scikit-learn (pure Python, uv handles it fine)

uv installs Python packages from PyPI, but it has no mechanism for installing compiled system libraries, header files, or non-Python dependencies.
This becomes a problem with packages like GDAL. uv add gdal only downloads Python bindings. If the underlying C/C++ library isn’t already installed, the build fails:
uv add gdal

× Failed to build `gdal==3.12.2`
├─▶ The build backend returned an error
╰─▶ Call to `setuptools.build_meta.build_wheel` failed (exit status: 1)

gdal_config_error: [Errno 2] No such file or directory: 'gdal-config'

Could not find gdal-config. Make sure you have installed
the GDAL native library and development headers.

Fixing this means installing GDAL through your OS package manager, then matching the exact version to the Python package:
# Ubuntu/Debian: install system library + headers
sudo apt-get install -y libgdal-dev gdal-bin
export C_INCLUDE_PATH=/usr/include/gdal
uv add GDAL==$(gdal-config –version)

# macOS: install via Homebrew
brew install gdal
uv add GDAL==$(gdal-config –version)

This manual process means you’re now using two separate tools to manage one project’s dependencies: your OS package manager (apt, brew) and uv.
uv only tracks Python packages in pyproject.toml:
# pyproject.toml
[project]
dependencies = [
"gdal>=3.12.2",
"geopandas>=1.1.3",
"scikit-learn>=1.4.2",
]

But the compiled C/C++ libraries those packages depend on (libgdal-core, proj, geos) aren’t recorded in any project file. They live only on your machine.
A new teammate cloning the repo won’t know which system packages to install or how the steps differ between Ubuntu and macOS. Even if the steps are documented in a README, they tend to go stale quickly.

Stay Current with CodeCut
Actionable Python tips, curated for busy data pros. Skim in under 2 minutes, three times a week.

.codecut-subscribe-form {
max-width: 650px;
display: flex;
flex-direction: column;
gap: 8px;
}

.codecut-input {
-webkit-appearance: none;
-moz-appearance: none;
appearance: none;
background: #FFFFFF;
border-radius: 8px !important;
padding: 8px 12px;
font-family: ‘Comfortaa’, sans-serif !important;
font-size: 14px !important;
color: #333333;
border: none !important;
outline: none;
width: 100%;
box-sizing: border-box;
}

input[type=”email”].codecut-input {
border-radius: 8px !important;
}

.codecut-input::placeholder {
color: #666666;
}

.codecut-email-row {
display: flex;
align-items: stretch;
height: 36px;
gap: 8px;
}

.codecut-email-row .codecut-input {
flex: 1;
}

.codecut-subscribe-btn {
background: #72BEFA;
color: #2F2D2E;
border: none;
border-radius: 8px;
padding: 8px 14px;
font-family: ‘Comfortaa’, sans-serif;
font-size: 14px;
font-weight: 500;
cursor: pointer;
text-decoration: none;
display: flex;
align-items: center;
justify-content: center;
transition: background 0.3s ease;
}

.codecut-subscribe-btn:hover {
background: #5aa8e8;
}

.codecut-subscribe-btn:disabled {
background: #999;
cursor: not-allowed;
}

.codecut-message {
font-family: ‘Comfortaa’, sans-serif;
font-size: 12px;
padding: 8px;
border-radius: 6px;
display: none;
}

.codecut-message.success {
background: #d4edda;
color: #155724;
display: block;
}

/* Mobile responsive */
@media (max-width: 480px) {
.codecut-email-row {
flex-direction: column;
height: auto;
gap: 8px;
}

.codecut-input {
border-radius: 8px;
height: 36px;
}

.codecut-subscribe-btn {
width: 100%;
text-align: center;
border-radius: 8px;
height: 36px;
}
}

pixi: Modern Environment Management
pixi, built in Rust by prefix-dev, manages both compiled system libraries and Python packages in a single tool. Think of it like conda but with lockfiles, fast resolution, and PyPI support built in.
Where uv needed brew install gdal plus uv add gdal with exact version matching, pixi handles it in one command:
pixi add gdal

Solving environment: done

The following NEW packages will be INSTALLED:
gdal 3.11.4 # Python bindings
libgdal-core 3.11.4 # compiled C/C++ library
proj 9.7.1 # coordinate system library
geos 3.10.6 # geometry engine
geotiff 1.7.4 # GeoTIFF support
libspatialite 5.1.0 # spatial SQL engine
xerces-c 3.3.0 # XML parser
… and 50+ other compiled dependencies

Total: 72.2 MB

Both the Python bindings (gdal) and the compiled C/C++ library (libgdal-core) are installed in one step, along with all their dependencies.
Installation
Install pixi with the official install script:
curl -fsSL https://pixi.sh/install.sh | sh

See the pixi documentation for other installation methods.
Adding Dependencies
With uv, you can only install packages from PyPI. pixi handles both conda-forge and PyPI packages in one command.
Use conda-forge for packages that include compiled system libraries (GDAL, HDF5, CUDA toolkit) and PyPI for pure Python packages:
# conda-forge packages (includes compiled system libraries)
pixi add python geopandas gdal lightgbm

# PyPI packages
pixi add –pypi scikit-learn

Each command updates the pixi.toml manifest and regenerates the lockfile automatically. The resulting manifest looks like this:
[workspace]
authors = ["Khuyen Tran <khuyentran@codecut.ai>"]
channels = ["conda-forge"]
name = "geo-ml"
platforms = ["osx-arm64"]
version = "0.1.0"

[tasks]

[dependencies]
python = ">=3.14.3,<3.15"
geopandas = ">=1.1.3,<2"
gdal = ">=3.12.2,<4"
lightgbm = ">=4.6.0,<5"

[pypi-dependencies]
scikit-learn = "*"

Automatic Lockfiles
To install all dependencies from pixi.toml, run:
pixi install

Like uv, pixi generates a lockfile automatically. The difference is scope: uv.lock only pins Python packages, while pixi.lock also pins the compiled system libraries underneath them:
# pixi.lock (excerpt)
– conda: https://conda.anaconda.org/…/gdal-3.12.2.conda
sha256: ac9a886dc1b4784da86c10946920031ccf85ebd97…
md5: 61e0829c9528ca287918fa86e56dbca2
depends:
– __osx >=11.0
– libcxx >=19
– libgdal-core 3.12.2.*
– numpy >=1.23,<3
license: MIT

With uv, a teammate running uv sync gets the same Python packages but has to install system libraries separately. pixi tracks both in the lockfile.
To reproduce the full environment, a teammate just runs:
pixi install

Project-Level Environments
Like uv, pixi defines the environment inside the project directory. To start a new project, run pixi init:
# Create and enter project directory
mkdir geo-ml && cd geo-ml

# Initialize pixi project
pixi init

This creates a pixi.toml manifest, similar to uv’s pyproject.toml, that tracks dependencies and lives in version control with your code.
[workspace]
authors = ["Khuyen Tran <khuyentran@codecut.ai>"]
channels = ["conda-forge"]
name = "geo-ml"
platforms = ["osx-arm64"]
version = "0.1.0"

[tasks]

[dependencies]

pixi automatically detects your platform and sets conda-forge as the default channel. The [dependencies] section is empty, ready for you to add packages.
If your project already has a pyproject.toml, pixi init will add pixi sections to it automatically. No separate manifest needed.

For a complete guide to organizing your project beyond dependencies, see How to Structure a Data Science Project for Maintainability.

Environment Activation
Like uv’s uv run, pixi can run commands inside the project environment without manual activation:
pixi run python train.py

pixi also supports pixi shell for an interactive shell session, similar to activating a virtual environment:
pixi shell

Multi-Platform Support
uv’s lockfile is platform-independent, but only for Python packages. If your team develops on macOS and deploys to Linux, the system libraries still need to be installed separately on each platform.
pixi generates lockfile entries for every target platform, including system libraries:
pixi workspace platform add linux-64 win-64

This updates pixi.toml with the new platforms and regenerates the lockfile with entries for all of them:
[workspace]
channels = ["conda-forge"]
name = "geo-ml"
platforms = ["osx-arm64", "linux-64", "win-64"]

Multiple Environments
uv supports dependency groups in pyproject.toml for separating dev and production dependencies, but only for Python packages. pixi takes this further with features that can also include system libraries.
The workflow is similar to uv’s uv add –group dev, but pixi calls them features:
pixi add –feature dev pytest ruff

To use the feature, link it to a named environment:
pixi workspace environment add dev –feature dev

These two commands update pixi.toml with the new feature and environment:
[feature.dev.dependencies]
pytest = "*"
ruff = "*"

[environments]
dev = ["dev"]

dev = ["dev"] means the dev environment includes default dependencies plus everything under [feature.dev.dependencies].
To use the dev environment, pass -e dev to any pixi command:
pixi shell -e dev # activate an interactive shell
pixi run -e dev pytest # run a single command

Built-in Task Runner
uv doesn’t have a built-in task runner, so teams typically manage project commands with Makefiles, Just, or shell scripts. These commands can be hard to remember:
uv run python src/preprocess.py –input data/raw –output data/processed
uv run python src/train.py –config configs/experiment_3.yaml –epochs 100
uv run pytest tests/ -v –cov=src

pixi has a built-in task runner that stores these commands alongside your dependencies, so no one has to memorize them.
To define a task, use pixi task add:
pixi task add preprocess "python src/preprocess.py –input data/raw –output data/processed"
pixi task add train "python src/train.py"
pixi task add test "pytest tests/"

This adds three tasks to pixi.toml:
[tasks]
preprocess = "python src/preprocess.py –input data/raw –output data/processed"
train = "python src/train.py"
test = "pytest tests/"

To run any task, use pixi run followed by the task name:
pixi run train
pixi run test

Tasks run inside the project environment automatically, with no need to activate first.

To learn more about writing effective tests, see Pytest for Data Scientists.

Global Tool Installation
Both uv (uv tool install) and pixi support installing tools globally. These are tools like code formatters and linters that are useful everywhere but don’t belong in any specific project:
pixi global install ipython
pixi global install ruff

Once installed, they’re available from any directory:
ipython # start interactive Python shell
ruff check . # lint any project

Why Not conda?
conda was the original solution for managing system libraries in Python environments. But it has several limitations that pixi was designed to fix:

Slow dependency resolution: Adding a single package to a large environment can take minutes. pixi’s Rust-based solver is 10-100x faster.
No lockfiles: conda’s environment.yml only lists packages you explicitly installed, not the dozens of sub-dependencies underneath them. Recreating the environment later may silently pull different versions.
conda and pip manage dependencies independently: Most projects need packages from both tools. Since neither checks what the other installed, conflicting versions can silently break your environment.
Environments live outside projects: conda stores environments in a central directory (~/miniconda3/envs/), not inside your project. There’s no way to tell which environment a project needs just by looking at its files.

pixi solves all four: fast resolution, automatic lockfiles, one tool for conda-forge and PyPI, and environments stored inside your project.
Summary
Here’s how the three tools compare:

Feature
uv
conda
pixi

Compiled system libraries
No
Yes
Yes

Fast dependency resolution
Yes
No
Yes

Lockfiles
Yes
No
Yes

Project-based environments
Yes
No
Yes

PyPI + conda-forge support
No
Limited
Yes

Built-in task runner
No
No
Yes

Multi-platform lockfiles
No
No
Yes

In short:

Use uv for pure Python projects where all dependencies come from PyPI.
Use pixi when your project needs compiled system libraries (GDAL, CUDA, C/C++ dependencies), multi-platform lockfiles, or a unified package manager that handles both conda-forge and PyPI.

For teams that need version history, sharing, and access control on top of pixi, see From uv to nebi: Reproducible Python Environments for Data Science Teams.

Stay Current with CodeCut
Actionable Python tips, curated for busy data pros. Skim in under 2 minutes, three times a week.

.codecut-subscribe-form {
max-width: 650px;
display: flex;
flex-direction: column;
gap: 8px;
}

input[type=”email”].codecut-input {
border-radius: 8px !important;
}

.codecut-input::placeholder {
color: #666666;
}

.codecut-email-row {
display: flex;
align-items: stretch;
height: 36px;
gap: 8px;
}

.codecut-email-row .codecut-input {
flex: 1;
}

.codecut-subscribe-btn:hover {
background: #5aa8e8;
}

.codecut-subscribe-btn:disabled {
background: #999;
cursor: not-allowed;
}

.codecut-message {
font-family: ‘Comfortaa’, sans-serif;
font-size: 12px;
padding: 8px;
border-radius: 6px;
display: none;
}

.codecut-message.success {
background: #d4edda;
color: #155724;
display: block;
}

/* Mobile responsive */
@media (max-width: 480px) {
.codecut-email-row {
flex-direction: column;
height: auto;
gap: 8px;
}

.codecut-input {
border-radius: 8px;
height: 36px;
}

.codecut-subscribe-btn {
width: 100%;
text-align: center;
border-radius: 8px;
height: 36px;
}
}

pixi: One Package Manager for Python and C/C++ Libraries Read More »

Newsletter #277: Handle Messy Data with RapidFuzz Fuzzy Matching

Leave a Comment / Newsletter Archive / Khuyen Tran

📅 Today’s Picks

Swap AI Prompts Instantly with MLflow Prompt Registry

Problem
Finding the right prompt often takes experimentation: tweaking wording, adjusting tone, testing different instructions.
But with prompts hardcoded in your codebase, each test requires a code change and redeployment.
Solution
MLflow Prompt Registry solves this with aliases. Your code references an alias like “production” instead of a version number, so you can swap versions without changing it.
Here’s how it works:

Every prompt edit creates a new immutable version with a commit message
Register prompts once, then assign aliases to specific versions
Deploy to different environments by creating aliases like “staging” and “production”
Track full version history with metadata and tags for each prompt

⭐ View GitHub

🔄 Worth Revisiting

Automate LLM Evaluation at Scale with MLflow make_judge()

Problem
When you ship LLM features without evaluating them, models might hallucinate, violate safety guidelines, or return incorrectly formatted responses.
Manual review doesn’t scale. Reviewers might miss subtle issues when evaluating thousands of outputs, and scoring standards often vary between people.
Solution
MLflow make_judge() applies the same evaluation standards to every output, whether you’re checking 10 or 10,000 responses.
Key capabilities:

Define evaluation criteria once, reuse everywhere
Automatic rationale explaining each judgment
Built-in judges for safety, toxicity, and hallucination detection
Typed outputs that never return unexpected formats

⭐ View GitHub

☕️ Weekly Finds

gspread
[Data Processing]
– Google Sheets Python API for reading, writing, and formatting spreadsheets

zeppelin
[Data Analysis]
– Web-based notebook for interactive data analytics with SQL, Scala, and more

vectorbt
[Data Science]
– Fast engine for backtesting, algorithmic trading, and research in Python

Looking for a specific tool? Explore 70+ Python tools →

Stay Current with CodeCut

Actionable Python tips, curated for busy data pros. Skim in under 2 minutes, three times a week.

Newsletter #277: Handle Messy Data with RapidFuzz Fuzzy Matching Read More »

Newsletter #296: Scrapling: Adaptive Web Scraping in Python

Leave a Comment / Newsletter Archive / Khuyen Tran

Grab your coffee. Here are this week’s highlights.

📅 Today’s Picks

Scrapling: Adaptive Web Scraping in Python

Problem
Traditional scraping with BeautifulSoup uses hardcoded CSS selectors to find elements on a page.
If the site updates its layout, those selectors no longer match and the scraper ends up returning empty data.
Solution
Instead of relying only on selectors, Scrapling records how elements appear during the initial scrape.
If the site is redesigned later, it can use that stored structure to find the same elements again.

Ibis: One Python API for 25+ Database Backends

Problem
Many data workflows begin with pandas for quick experimentation, while production pipelines might run on databases like PostgreSQL or BigQuery.
Moving from prototype to production usually means rewriting the same transformation logic in SQL. That translation takes time and can easily introduce errors.
Solution
Ibis solves this by letting you define transformations once in Python and compiling them into native SQL for 25+ backends automatically.

📖 View Full Article

☕️ Weekly Finds

Kronos
[Machine Learning]
– A decoder-only foundation model pre-trained on K-line sequences for financial market forecasting

pixi
[Environment Management]
– Fast, cross-platform package manager built on the Conda ecosystem, written in Rust

MinerU
[OCR/PDF Processing]
– One-stop tool for converting PDFs, webpages, and e-books into machine-readable markdown and JSON

Looking for a specific tool? Explore 70+ Python tools →

Stay Current with CodeCut

Actionable Python tips, curated for busy data pros. Skim in under 2 minutes, three times a week.

Newsletter #296: Scrapling: Adaptive Web Scraping in Python Read More »

Newsletter #295: Marker: Smart PDF Extraction with Hybrid LLM Mode

Leave a Comment / Newsletter Archive / Khuyen Tran

Grab your coffee. Here are this week’s highlights.

📅 Today’s Picks

Marker: Smart PDF Extraction with Hybrid LLM Mode

Problem
Standard OCR pipelines often miss inline math, split tables across pages, and lose the relationships between form fields.
Sending the full document to an LLM can improve accuracy, but it’s slow and expensive at scale.
Solution
Marker‘s hybrid mode takes a more targeted approach:

Its deep learning pipeline handles the bulk of conversion
Then an LLM steps in only for the hard parts: table merging, LaTeX formatting, and form extraction

Marker supports OpenAI, Gemini, Claude, Ollama, and Azure out of the box.

📖 View Full Article

Qdrant: Fast Vector Search in Rust with a Python API

Problem
Building semantic search typically starts with storing vectors in Python lists and computing cosine similarity manually.
But brute-force comparison scales linearly with your dataset, making every query slower as your data grows.
Solution
Qdrant is a vector search engine built in Rust that indexes your vectors for fast retrieval.
Key features:

In-memory mode for local prototyping with no server setup
Seamlessly scale to millions of vectors in production with the same Python API
Built-in support for cosine, dot product, and Euclidean distance
Sub-second query times even for millions of vectors

🧪 Run code

📚 Latest Deep Dives

PDF Table Extraction: Docling vs Marker vs LlamaParse Compared

Extracting tables from PDFs can be surprisingly difficult. A table that looks neatly structured in a PDF is actually saved as text placed at specific coordinates on the page. This makes it difficult to preserve the original layout when extracting the table.
This article will introduce three Python tools that attempt to solve this problem: Docling, Marker, and LlamaParse.

📖 View Full Article

☕️ Weekly Finds

Dify
[LLM]
– Open-source LLM app development platform with AI workflow, RAG pipeline, and agent capabilities

PageIndex
[RAG]
– Document index for vectorless, reasoning-based RAG

MCP Server Chart
[Data Visualization]
– A visualization MCP server for generating 25+ visual charts using AntV

Looking for a specific tool? Explore 70+ Python tools →

Stay Current with CodeCut

Actionable Python tips, curated for busy data pros. Skim in under 2 minutes, three times a week.

Newsletter #295: Marker: Smart PDF Extraction with Hybrid LLM Mode Read More »

Newsletter #294: pandas 3.0: 5-10x Faster String Operations with PyArrow

Leave a Comment / Newsletter Archive / Khuyen Tran

Grab your coffee. Here are this week’s highlights.

📅 Today’s Picks

pandas 3.0: 5-10x Faster String Operations with PyArrow

Problem
Traditionally, pandas stores strings as object dtype, where each string is a separate Python object scattered across memory.
This makes string operations slow and the dtype ambiguous, since both pure string columns and mixed-type columns show up as object.
Solution
pandas 3.0 introduces a dedicated str dtype backed by PyArrow, which stores strings in contiguous memory blocks instead of individual Python objects.
Key benefits:

5-10x faster string operations because data is stored contiguously
50% lower memory by eliminating Python object overhead
Clear distinction between string and mixed-type columns

📖 View Full Article

🧪 Run code

Build Self-Documenting Regex with Pregex

Problem
Regex patterns like [a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,} are difficult to read and intimidating.
Team members without regex expertise might struggle to understand and modify these validation patterns.
Solution
Team members without regex expertise might struggle to understand and modify these validation patterns.
Pregex transforms regex into readable Python code using descriptive components.
Key benefits:

Code that explains its intent without comments
Easy modification without regex expertise
Composable patterns for complex validation
Export to regex format when needed

📖 View Full Article

🧪 Run code

📚 Latest Deep Dives

PDF Table Extraction: Docling vs Marker vs LlamaParse Compared

PDF files do not store tables as structured data. Instead, they position text at specific coordinates on the page.
Table extraction tools must reconstruct the structure by determining which values belong in which rows and columns.
The problem becomes even harder when tables include multi-level headers, merged cells, or complex layouts.
To explore this problem, I experimented with three tools designed for PDF table extraction: LlamaParse, Marker, and Docling. Each tool takes a different approach.
Performance overview:

Docling: Fastest local option, but struggles with complex tables
Marker: Handles complex layouts well and runs locally, but is much slower
LlamaParse: Most accurate on complex tables and fastest overall, but requires a cloud API

In this article, I share the code, examples, and results from testing each tool.
📖 View Full Article

☕️ Weekly Finds

Lance
[Data Processing]
– Modern columnar data format for ML with 100x faster random access than Parquet

Mathesar
[Dashboard]
– Spreadsheet-like interface for PostgreSQL that lets anyone view, edit, and query data

dotenvx
[DevOps]
– A better dotenv with encryption, multiple environments, and cross-platform support

Looking for a specific tool? Explore 70+ Python tools →

Stay Current with CodeCut

Actionable Python tips, curated for busy data pros. Skim in under 2 minutes, three times a week.

Newsletter #294: pandas 3.0: 5-10x Faster String Operations with PyArrow Read More »

PDF Table Extraction: Docling vs Marker vs LlamaParse Compared

Leave a Comment / Blog, LLM / Khuyen Tran

Table of Contents

Introduction
The Test Document
Docling: TableFormer Deep Learning
Marker: Vision Transformer Pipeline
LlamaParse: LLM-Guided Extraction
Summary
Try It Yourself

Introduction
Have you ever copied a table from a PDF into a spreadsheet only to find the formatting completely broken? These issues include cells shifting, values landing in the wrong columns, and merged headers losing their structure.
This happens because PDFs do not store tables as structured data. They simply place text at specific coordinates on a page.
For example, a table that looks like this on screen:
┌───────┬───────┐
│ Name │ Score │
├───────┼───────┤
│ Alice │ 92 │
│ Bob │ 85 │
└───────┴───────┘

is stored in the PDF as a flat list of positioned text:
"Name" at (x=72, y=710)
"Score" at (x=200, y=710)
"Alice" at (x=72, y=690)
"92" at (x=200, y=690)
"Bob" at (x=72, y=670)
"85" at (x=200, y=670)

A table extraction tool must analyze those positions, determine which text belongs in each cell, and rebuild the table structure.
The challenge becomes even greater with multi-level headers, merged cells, or tables that span multiple pages. Many tools struggle with at least one of these scenarios.
While doing research, I came across three Python tools for extracting tables from PDFs: Docling, Marker, and LlamaParse. To compare them fairly, I ran each tool on the same PDF and evaluated the results.
In this article, I’ll walk through what I found and help you decide which tool may work best for your needs.

💻 Get the Code: The complete source code and Jupyter notebook for this tutorial are available on GitHub. Clone it to follow along!

Stay Current with CodeCut
Actionable Python tips, curated for busy data pros. Skim in under 2 minutes, three times a week.

.codecut-subscribe-form {
max-width: 650px;
display: flex;
flex-direction: column;
gap: 8px;
}

input[type=”email”].codecut-input {
border-radius: 8px !important;
}

.codecut-input::placeholder {
color: #666666;
}

.codecut-email-row {
display: flex;
align-items: stretch;
height: 36px;
gap: 8px;
}

.codecut-email-row .codecut-input {
flex: 1;
}

.codecut-subscribe-btn:hover {
background: #5aa8e8;
}

.codecut-subscribe-btn:disabled {
background: #999;
cursor: not-allowed;
}

.codecut-message {
font-family: ‘Comfortaa’, sans-serif;
font-size: 12px;
padding: 8px;
border-radius: 6px;
display: none;
}

.codecut-message.success {
background: #d4edda;
color: #155724;
display: block;
}

/* Mobile responsive */
@media (max-width: 480px) {
.codecut-email-row {
flex-direction: column;
height: auto;
gap: 8px;
}

.codecut-input {
border-radius: 8px;
height: 36px;
}

.codecut-subscribe-btn {
width: 100%;
text-align: center;
border-radius: 8px;
height: 36px;
}
}

The Test Document
All examples use the same PDF: the Docling Technical Report from arXiv. This paper contains tables with the features that make extraction difficult:

Multi-level headers with sub-columns
Merged cells spanning multiple rows
Numeric data that is easy to misalign

source = "https://arxiv.org/pdf/2408.09869"

Some tools require a local file path instead of a URL, so let’s download the PDF first:
import urllib.request

# Download PDF locally (used by Marker later)
local_pdf = "docling_report.pdf"
urllib.request.urlretrieve(source, local_pdf)

Docling: TableFormer Deep Learning
Docling is IBM’s open-source document converter built specifically for structured extraction. Its table pipeline works in two steps:

Detect table regions using a layout analysis model that finds tables, text, and figures on each page
Reconstruct cell structure using TableFormer, a deep learning model that maps each cell to its row and column position

Here is what that looks like in practice:
PDF page with mixed content
┌─────────────────────┐
│ Text paragraph… │
│ Name Score │
│ Alice 92 │
│ Bob 85 │
│ (figure) │
└─────────────────────┘
│
▼
Step 1: Layout model detects table region
┌─────────────────────┐
│ ┌─────────────────┐ │
│ │ Name Score │ │
│ │ Alice 92 │ │
│ │ Bob 85 │ │
│ └─────────────────┘ │
└─────────────────────┘
│
▼
Step 2: TableFormer maps cells to rows and columns
┌───────┬───────┐
│ Name │ Score │
├───────┼───────┤
│ Alice │ 92 │
│ Bob │ 85 │
└───────┴───────┘

The result is a pandas DataFrame for each table, ready for analysis.

For Docling’s full document processing capabilities beyond tables, including chunking and RAG integration, see Transform Any PDF into Searchable AI Data with Docling.

To install Docling, run:
pip install docling

This article uses docling v2.63.0.
Table Extraction
To extract tables from the PDF, we need to first convert it to a Docling document using DocumentConverter:
from docling.document_converter import DocumentConverter

# Convert PDF
converter = DocumentConverter()
result = converter.convert(source)

Once we have the Docling document, we can loop through all detected tables and export each one as a pandas DataFrame:
for i, table in enumerate(result.document.tables):
df = table.export_to_dataframe(doc=result.document)
print(f"Table {i + 1}: {df.shape[0]} rows × {df.shape[1]} columns")

Table 1: 2 rows × 8 columns
Table 2: 1 rows × 5 columns
Table 3: 0 rows × 0 columns

The PDF contains 5 tables, but Docling only detected 3.
Table 3 returned 0 rows. This means the layout model flagged it as a table but TableFormer couldn’t extract any structure from it.
Let’s look at the first table. Here’s the original from the PDF:

And here’s what Docling extracted:
# Export the first table as a DataFrame
table_1 = result.document.tables[0]
df_1 = table_1.export_to_dataframe(doc=result.document)
df_1

CPU
Thread budget
native TTS
native Pages/s
native Mem
pypdfium TTS
pypdfium Pages/s
pypdfium Mem

Apple M3 Max (16 cores)
4 16
177 s 167 s
1.27 1.34
6.20 GB
103 s 92 s
2.18 2.45
2.56 GB

Intel(R) Xeon E5-2690
4 16
375 s 244 s
0.60 0.92
6.16 GB
239 s 143 s
0.94 1.57
2.42 GB

Notice how Docling handles this complex table:

Docling smartly handled the multi-level header by flattening it into separate columns (“native backend” → “native TTS”, “native Pages/s”, “native Mem”).
However, it merged each CPU’s two thread-budget rows into one, packing values like “177 s 167 s” into single cells.

Now the second table. Here’s the original from the PDF:

And here’s what Docling extracted:
# Export the second table as a DataFrame
table_2 = result.document.tables[1]
df_2 = table_2.export_to_dataframe(doc=result.document)
df_2

human
MRCNN R50 R101
FRCNN R101
YOLO v5x6

0
Caption Footnote Formula List-item Page-footer…
84-89 83-91 83-85 87-88 93-94 85-89 69-71 83-8…
68.4 71.5 70.9 71.8 60.1 63.4 81.2 80.8 61.6 5…
70.1 73.7 63.5 81.0 58.9 72.0 72.0 68.4 82.2 8…

We can see that Docling did not handle this table as well as the first one:

Docling merged the MRCNN sub-columns (R50, R101) into a single “MRCNN R50 R101” column instead of two separate ones.
All 13 rows were collapsed into one, concatenating values like “68.4 71.5 70.9…” into a single cell.

Complex tables with multi-level headers and merged cells remain a challenge for Docling’s table extraction.
Performance
Docling took about 28 seconds for the full 6-page PDF on an Apple M1 (16 GB RAM), thanks to its lightweight two-stage pipeline.
Marker: Vision Transformer Pipeline
Marker is an open-source PDF-to-Markdown converter built on the Surya layout engine. Unlike Docling’s two-stage pipeline, Marker runs five stages for table extraction:

Layout detection: a Vision Transformer identifies table regions on each page
OCR error detection: flags misrecognized text
Bounding box detection: locates individual cell boundaries
Table recognition: reconstructs row/column structure from detected cells
Text recognition: extracts text from all detected regions

Here is how the five stages work together:
PDF page
┌─────────────────────┐
│ Text paragraph… │
│ Name Score │
│ Alice 92 │
│ Bob 85 │
└─────────────────────┘
│
▼
1. Layout detection → finds [TABLE] region
2. OCR error detection → fixes misread text
│
▼
3. Bounding box detection
┌──────────────────┐
│ [Name] [Score] │
│ [Alice] [92] │
│ [Bob] [85] │
└──────────────────┘
│
▼
4. Table recognition → maps cells to rows/columns
5. Text recognition → extracts final text
│
▼
| Name | Score |
|——-|——-|
| Alice | 92 |
| Bob | 85 |

To install Marker, run:
pip install marker-pdf

Table Extraction
Marker provides a dedicated TableConverter that extracts only tables from a document, returning them as Markdown:
from marker.converters.table import TableConverter
from marker.models import create_model_dict
from marker.output import text_from_rendered

models = create_model_dict()
converter = TableConverter(artifact_dict=models)
rendered = converter(local_pdf)
table_md, _, images = text_from_rendered(rendered)

Since TableConverter returns all tables as a single Markdown string, we split them on blank lines:
tables = table_md.strip().split("\n\n")
print(f"Tables found: {len(tables)}")

Tables found: 3

Let’s look at the first table. Here’s the original from the PDF:

And here’s what Marker extracted:
print(tables[0].md)

CPU
Thread

native backend

pypdfium backend

budget
TTS
Pages/s
Mem
TTS
Pages/s
Mem

Apple M3 Max (16 cores)
4 16
177 s 167 s
1.27 1.34
6.20 GB
103 s 92 s
2.18 2.45
2.56 GB

Intel(R) Xeon E5-2690 (16 cores)
4 16
375 s 244 s
0.60 0.92
6.16 GB
239 s 143 s
0.94 1.57
2.42 GB

Marker preserves the original table format well:

While Docling flattened this into prefixed column names like “native TTS”, Marker preserves the two-tier header (“native backend” → TTS, Pages/s, Mem) as separate rows, keeping the parent header visible.
While Docling packed these into single strings like “177 s 167 s” without separators, Marker preserves the distinction between values by using tags, making it easy to split them programmatically later with a simple string split

Let’s look at the second table. Here’s the original from the PDF:

And here’s what Marker extracted:
print(tables[1].md)

human MRCNN

FRCNN YOLO

R50 R101
R101
v5x6

Caption
84-89 68.4 71.5

70.1
77.7

Footnote
83-91 70.9 71.8

73.7
77.2

Formula
83-85 60.1 63.4

63.5
66.2

List-item
87-88 81.2 80.8

81.0
86.2

Page-footer
93-94 61.6 59.3

58.9
61.1

Page-header
85-89 71.9 70.0

72.0
67.9

Picture
69-71 71.7 72.7

72.0
77.1

Section-header 83-84 67.6 69.3

68.4
74.6

Table
77-81 82.2 82.9

82.2
86.3

Text
84-86 84.6 85.8

85.4
88.1

Title
60-72 76.7 80.4

79.9
82.7

All
82-83 72.4 73.5

73.4
76.8

This table has several column-merging issues:

“human” and “MRCNN” are merged into one header (human MRCNN), and “FRCNN” and “YOLO” are combined into a single header (FRCNN YOLO).
The human, MRCNN R50, and MRCNN R101 values are packed into one cell (“84-89 68.4 71.5”), while the MRCNN R50 and R101 columns are empty.
The R50 and R101 sub-columns collapsed into a single “R50 R101” cell.

Despite these issues, Marker still preserves all 12 rows individually, while Docling collapsed them into one.
Let’s look at the third table. Here’s the original from the PDF:

And here’s what Marker extracted:
print(tables[2].md)

human
MRCNN
MRCNN
FRCNN
YOLO

human
R50
R101
R101
v5x6

Caption
84-89
68.4
71.5
70.1
77.7

Footnote
83-91
70.9
71.8
73.7
77.2

Formula
83-85
60.1
63.4
63.5
66.2

List-item
87-88
81.2
80.8
81.0
86.2

Page-footer
93-94
61.6
59.3
58.9
61.1

Page-header
85-89
71.9
70.0
72.0
67.9

Picture
69-71
71.7
72.7
72.0
77.1

Section-header
83-84
67.6
69.3
68.4
74.6

Table
77-81
82.2
82.9
82.2
86.3

Text
84-86
84.6
85.8
85.4
88.1

Title
60-72
76.7
80.4
79.9
82.7

All
82-83
72.4
73.5
73.4
76.8

Since the layout of this table is simpler, Marker’s vision model correctly separates all columns and preserves all 12 rows. This shows that Marker’s accuracy depends heavily on the visual complexity of the original table.
Performance
TableConverter took about 6 minutes on an Apple M1 (16 GB RAM), roughly 13x slower than Docling. The speed difference comes down to how each tool handles text:

Docling extracts text that is already stored in the PDF, skipping OCR. It only runs its layout model and TableFormer on detected tables.
Marker runs Surya’s full text recognition model on every page, regardless of whether the PDF already contains selectable text.

LlamaParse: LLM-Guided Extraction
LlamaParse is a cloud-hosted document parser by LlamaIndex that takes a different approach:

Cloud-based: the PDF is uploaded to LlamaCloud instead of being processed locally
LLM-guided: an LLM interprets each page and identifies tables, returning structured row data

Here is how it works:
PDF file
┌─────────────────────┐
│ Name Score │
│ Alice 92 │
│ Bob 85 │
└─────────────────────┘
│
▼ upload
┌─────────────────────┐
│ LlamaCloud │
│ │
│ LLM reads the page │
│ and identifies │
│ table structure │
└─────────────────────┘
│
▼ response
┌───────┬───────┐
│ Name │ Score │
├───────┼───────┤
│ Alice │ 92 │
│ Bob │ 85 │
└───────┴───────┘

For extracting structured data from images like receipts using the same LlamaIndex ecosystem, see Turn Receipt Images into Spreadsheets with LlamaIndex.

To install LlamaParse, run:
pip install llama-parse

This article uses llama-parse v0.6.54.
LlamaParse requires an API key from LlamaIndex Cloud. The free tier includes 10,000 credits per month (basic parsing costs 1 credit per page; advanced modes like parse_page_with_agent cost more).
Create a .env file with your API key:
LLAMA_CLOUD_API_KEY=llx-…

from dotenv import load_dotenv

load_dotenv()

Table Extraction
To extract tables, we create a LlamaParse instance with two key settings:

parse_page_with_agent: tells LlamaCloud to use an LLM agent that reads each page and returns structured items (tables, text, figures)
output_tables_as_HTML=True: returns tables as HTML instead of Markdown, which better preserves multi-level headers

from llama_cloud_services import LlamaParse

parser = LlamaParse(
parse_mode="parse_page_with_agent",
output_tables_as_HTML=True,
)
result = parser.parse(local_pdf)

We can then iterate through each page’s items and collect only the tables:
all_tables = []
for page in result.pages:
for item in page.items:
if item.type == "table":
all_tables.append(item)

print(f"Items tagged as table: {len(all_tables)}")

Items tagged as table: 5

Not all items tagged as “table” are actual tables. LlamaParse’s LLM sometimes misidentifies non-table content (like the paper’s title page) as a table. We can filter these out by keeping only tables with more than 2 rows:
tables = [t for t in all_tables if len(t.rows) > 2]
print(f"Actual tables: {len(tables)}")

Actual tables: 4

Let’s look at the first table. Here’s the original from the PDF:

And here’s what LlamaParse extracted:
print(tables[0].md)

CPU
Threadbudget
native backend TTS
native backend Pages/s
native backend Mem
pypdfium backend TTS
pypdfium backend Pages/s
pypdfium backend Mem

Apple M3 Max (16 cores)
4
177 s
1.27
6.20 GB
103 s
2.18
2.56 GB

16
167 s
1.34

92 s
2.45

Intel(R) Xeon E5-2690 (16 cores)
4
375 s
0.60
6.16 GB
239 s
0.94
2.42 GB

16
244 s
0.92

143 s
1.57

LlamaParse produces the best result for this table among the three tools:

All values are correctly placed in individual cells. Docling packed multiple values like “177 s 167 s” into single strings, and Marker split multi-line CPU names across extra rows.
Multi-line entries like “Apple M3 Max / (16 cores)” stay in one cell via tags, avoiding Marker’s row-splitting issue.
The two-tier header is flattened into native backend TTS rather than kept as separate rows like Marker, but the grouping is still readable.

Let’s look at the second table. Here’s the original from the PDF:

And here’s what LlamaParse extracted:
print(tables[1].md)

human
MRCNN R50
MRCNN R101
FRCNN R101
YOLO v5x6

Caption
84-89
68.4
71.5
70.1
77.7

Footnote
83-91
70.9
71.8
73.7
77.2

Formula
83-85
60.1
63.4
63.5
66.2

List-item
87-88
81.2
80.8
81.0
86.2

Page-footer
93-94
61.6
59.3
58.9
61.1

Page-header
85-89
71.9
70.0
72.0
67.9

Picture
69-71
71.7
72.7
72.0
77.1

Section-header
83-84
67.6
69.3
68.4
74.6

Table
77-81
82.2
82.9
82.2
86.3

Text
84-86
84.6
85.8
85.4
88.1

Title
60-72
76.7
80.4
79.9
82.7

All
82-83
72.4
73.5
73.4
76.8

LlamaParse produces the most accurate extraction of this table among the three tools:

All 12 data rows are preserved with correct values. Docling merged all rows into a single row.
Each column is correctly separated, while Marker merged some into combined headers like “FRCNN YOLO”.
The MRCNN sub-columns (R50, R101) use tags to keep the parent header visible (e.g., MRCNN R50), unlike Marker which lost the grouping entirely.

Let’s look at the third table. Here’s the original from the PDF:

And here’s what LlamaParse extracted:
print(tables[2].md)

human
R50
R100
R101
v5x6

Caption
84-89
68.4
71.5
70.1
77.7

Footnote
83-91
70.9
71.8
73.7
77.2

Formula
83-85
60.1
63.4
63.5
66.2

List-item
87-88
81.2
80.8
81.0
86.2

Page-footer
93-94
61.6
59.3
58.9
61.1

Page-header
85-89
71.9
70.0
72.0
67.9

Picture
69-71
71.7
72.7
72.0
77.1

Section-header
83-84
67.6
69.3
68.4
74.6

Table
77-81
82.2
82.9
82.2
86.3

Text
84-86
84.6
85.8
85.4
88.1

Title
60-72
76.7
80.4
79.9
82.7

The data values are correct but header information is partially lost:

Parent model names (MRCNN, FRCNN, YOLO) are stripped from headers, unlike the previous table which used tags to preserve them.
“MRCNN R101” appears as “R100” (a typo), and the two R101 columns (MRCNN and FRCNN) are indistinguishable.
Marker handled this table better, keeping all 12 rows with proper column names. Docling missed this table entirely.

Unlike Docling and Marker, LlamaParse actually detects the fourth table. Here’s the original from the PDF:

And here’s what LlamaParse extracted:
print(tables[3].md)

class label
Count
% of TotalTrain
% of TotalTest
% of TotalVal
triple inter-annotator mAP @ 0.5-0.95 (%) All
triple inter-annotator mAP @ 0.5-0.95 (%) Fin
triple inter-annotator mAP @ 0.5-0.95 (%) Man
triple inter-annotator mAP @ 0.5-0.95 (%) Sci
triple inter-annotator mAP @ 0.5-0.95 (%) Law
triple inter-annotator mAP @ 0.5-0.95 (%) Pat
triple inter-annotator mAP @ 0.5-0.95 (%) Ten

Caption
22524
2.04
1.77
2.32
84-89
40-61
86-92
94-99
95-99
69-78
n/a

Footnote
6318
0.60
0.31
0.58
83-91
n/a
100
62-88
85-94
n/a
82-97

Formula
25027
2.25
1.90
2.96
83-85
n/a
n/a
84-87
86-96
n/a
n/a

List-item
185660
17.19
13.34
15.82
87-88
74-83
90-92
97-97
81-85
75-88
93-95

Page-footer
70878
6.51
5.58
6.00
93-94
88-90
95-96
100
92-97
100
96-98

Page-header
58022
5.10
6.70
5.06
85-89
66-76
90-94
98-100
91-92
97-99
81-86

Picture
45976
4.21
2.78
5.31
69-71
56-59
82-86
69-82
80-95
66-71
59-76

Section-header
142884
12.60
15.77
12.85
83-84
76-81
90-92
94-95
87-94
69-73
78-86

Table
34733
3.20
2.27
3.60
77-81
75-80
83-86
98-99
58-80
79-84
70-85

Text
510377
45.82
49.28
45.00
84-86
81-86
88-93
89-93
87-92
71-79
87-95

Title
5071
0.47
0.30
0.50
60-72
24-63
50-63
94-100
82-96
68-79
24-56

Total
1107470
941123
99816
66531
82-83
71-74
79-81
89-94
86-91
71-76
68-85

LlamaParse correctly extracts all 12 data rows plus the Total row with accurate values:

The two-tier headers are flattened into combined names like “% of TotalTrain”, losing the visual grouping but keeping the association.
The “triple inter-annotator mAP” prefix is repeated for every sub-column (All, Fin, Man, etc.), making headers verbose but unambiguous.
All numeric values and n/a entries match the original.

Performance
LlamaParse finished in 17 seconds, roughly 40% faster than Docling (28s) and 20x faster than Marker (6 min).
This is because LlamaParse offloads the work to LlamaCloud’s servers:

The 17-second runtime depends on network speed and server load, not your local hardware.
Summary
The table below summarizes the key differences we found after testing all three tools on the same PDF:

Feature
Docling
Marker
LlamaParse

Table detection
TableFormer
Vision Transformer
LLM (cloud)

Multi-level headers
Flattens into prefixed names
Keeps as separate rows
Preserves with tags

Row separation
Concatenates into one cell
Separates with tags
Keeps each value in its own cell

Speed (6-page PDF)
~28s
~6 min
~17s

Dependencies
PyTorch + models
PyTorch + models
API key

Pricing
Free (MIT)
Free (GPL-3.0)
Free tier (10k credits/month)

In short:

Docling is the fastest local option and gives you DataFrames out of the box, but it struggles with complex tables, sometimes merging rows and packing values together.
Marker preserves rows reliably and runs locally, but it is the slowest and can merge column headers on tricky layouts.
LlamaParse produces the most accurate tables overall, but it requires a cloud API and the free tier is limited to 10,000 credits per month.

So which one should you use?

For simple tables, start with Docling. It is free, fast, and produces DataFrames that are immediately ready for analysis.
If you must stay local and Docling struggles with the layout, Marker is the better alternative.
Use LlamaParse when accuracy matters most and your documents aren’t sensitive, since all pages are uploaded to LlamaCloud for processing.

Try It Yourself
These benchmarks are based on a single academic PDF tested on an Apple M1 (16 GB RAM). Table complexity, document length, and hardware all affect the results. The best way to pick the right tool is to run each one on a sample of your own PDFs.
Docling and Marker are completely free, and LlamaParse’s free tier gives you 10,000 credits per month to experiment with.
Related Tutorials

From CSS Selectors to Natural Language: Web Scraping with ScrapeGraphAI: Use LLM-guided web scraping to extract structured data from HTML pages without manual selector maintenance
Structured Output Tools for LLMs Compared: Compare tools for enforcing schemas and structured formats on LLM outputs

📚 Want to go deeper? My book shows you how to build data science projects that actually make it to production. Get the book →

Stay Current with CodeCut
Actionable Python tips, curated for busy data pros. Skim in under 2 minutes, three times a week.

.codecut-subscribe-form {
max-width: 650px;
display: flex;
flex-direction: column;
gap: 8px;
}

input[type=”email”].codecut-input {
border-radius: 8px !important;
}

.codecut-input::placeholder {
color: #666666;
}

.codecut-email-row {
display: flex;
align-items: stretch;
height: 36px;
gap: 8px;
}

.codecut-email-row .codecut-input {
flex: 1;
}

.codecut-subscribe-btn:hover {
background: #5aa8e8;
}

.codecut-subscribe-btn:disabled {
background: #999;
cursor: not-allowed;
}

.codecut-message {
font-family: ‘Comfortaa’, sans-serif;
font-size: 12px;
padding: 8px;
border-radius: 6px;
display: none;
}

.codecut-message.success {
background: #d4edda;
color: #155724;
display: block;
}

/* Mobile responsive */
@media (max-width: 480px) {
.codecut-email-row {
flex-direction: column;
height: auto;
gap: 8px;
}

.codecut-input {
border-radius: 8px;
height: 36px;
}

.codecut-subscribe-btn {
width: 100%;
text-align: center;
border-radius: 8px;
height: 36px;
}
}

PDF Table Extraction: Docling vs Marker vs LlamaParse Compared Read More »

python

pixi: One Package Manager for Python and C/C++ Libraries

PDF Table Extraction: Docling vs Marker vs LlamaParse Compared

Drop a line

Get in touch

Follow Us on Social Media

python

Work with Khuyen Tran

Work with Khuyen Tran