Generic selectors
Exact matches only
Search in title
Search in content
Post Type Selectors
Filter by Categories
About Article
Analyze Data
Archive
Best Practices
Better Outputs
Blog
Code Optimization
Code Quality
Command Line
Course
Daily tips
Dashboard
Data Analysis & Manipulation
Data Engineer
Data Visualization
DataFrame
Delta Lake
DevOps
DuckDB
Environment Management
Feature Engineer
Git
Jupyter Notebook
LLM
LLM Tools
Machine Learning
Machine Learning & AI
Machine Learning Tools
Manage Data
MLOps
Natural Language Processing
Newsletter Archive
NumPy
Pandas
Polars
PySpark
Python Helpers
Python Tips
Python Utilities
Scrape Data
SQL
Testing
Time Series
Tools
Visualization
Visualization & Reporting
Workflow & Automation
Workflow Automation

pydantic

Auto-created tag for pydantic

5 Python Tools for Structured LLM Outputs: A Practical Comparison

Table of Contents

Introduction
Post-Generation Validation
Instructor: Simplest Integration
PydanticAI: Type-Safe Agents
LangChain: Ecosystem Integration

Pre-Generation Constraints
Outlines: Guaranteed Valid JSON
Guidance: Branching During Generation

Final Thoughts

Introduction
An LLM can give you exactly the information you need, just not in the shape you asked for. The content may be correct, but when the structure is off, it can break downstream systems that expect a specific format.
Consider these common structured output challenges:
Invalid JSON: LLMs often wrap JSON in conversational text, causing json.loads() to fail even when the data is correct.
Here's the task information you requested:
{"title": "Review report", "priority": "high"}
Let me know if you need anything else!

Missing fields: LLMs skip required properties like hours or completed, even when the schema requires them.
{"title": "Review report", "priority": "high"}
# Missing: hours, completed

Wrong types: LLMs may return strings like “four” instead of numeric values, causing type errors in downstream processing.
{"title": "Review report", "hours": "four"}
# Expected: "hours": 4.0

Schema violations: Output passes type checks but breaks business rules like maximum values or allowed ranges.
{"title": "Review report", "hours": 200}
# Constraint: hours must be <= 100

This article covers five tools that solve these problems using two different approaches:
Post-Generation Validation
The LLM generates output freely, then validation checks the result against your schema. If validation fails, the error is sent back to the LLM for self-correction.
Here are the pros and cons of this approach:

Pros: Works with any LLM provider (OpenAI, Anthropic, local models). No special setup required.
Cons: Retries cost extra API calls. Complex schemas may need multiple attempts.

LLM Output → Validate → Failed: "hours must be float"

Retry with error

LLM Output → Validate → Success: {"hours": 4.0}

Tools using this approach: Instructor, PydanticAI, LangChain
Pre-Generation Constraints
Instead of fixing errors after generation, invalid tokens are blocked during generation. The LLM can only output valid JSON because invalid choices are never available.
Here are the pros and cons of this approach:

Pros: 100% schema compliance. No wasted API calls on invalid outputs.
Cons: Requires local models or specific inference servers. More setup complexity.

Schema: "priority" must be "low", "medium", or "high"

LLM generates → Only valid tokens available → {"priority": "high"}

100% valid output (no retries)

Tools using this approach: Outlines, Guidance

💻 Get the Code: The complete source code and Jupyter notebook for this tutorial are available on GitHub. Clone it to follow along!

Stay Current with CodeCut
Actionable Python tips, curated for busy data pros. Skim in under 2 minutes, three times a week.

.codecut-subscribe-form .codecut-input {
background: #2F2D2E !important;
border: 1px solid #72BEFA !important;
color: #FFFFFF !important;
}
.codecut-subscribe-form .codecut-input::placeholder {
color: #999999 !important;
}
.codecut-subscribe-form .codecut-subscribe-btn {
background: #72BEFA !important;
color: #2F2D2E !important;
}
.codecut-subscribe-form .codecut-subscribe-btn:hover {
background: #5aa8e8 !important;
}

.codecut-subscribe-form {
max-width: 650px;
display: flex;
flex-direction: column;
gap: 8px;
}

.codecut-input {
-webkit-appearance: none;
-moz-appearance: none;
appearance: none;
background: #FFFFFF;
border-radius: 8px !important;
padding: 8px 12px;
font-family: ‘Comfortaa’, sans-serif !important;
font-size: 14px !important;
color: #333333;
border: none !important;
outline: none;
width: 100%;
box-sizing: border-box;
}

input[type=”email”].codecut-input {
border-radius: 8px !important;
}

.codecut-input::placeholder {
color: #666666;
}

.codecut-email-row {
display: flex;
align-items: stretch;
height: 36px;
gap: 8px;
}

.codecut-email-row .codecut-input {
flex: 1;
}

.codecut-subscribe-btn {
background: #72BEFA;
color: #2F2D2E;
border: none;
border-radius: 8px;
padding: 8px 14px;
font-family: ‘Comfortaa’, sans-serif;
font-size: 14px;
font-weight: 500;
cursor: pointer;
text-decoration: none;
display: flex;
align-items: center;
justify-content: center;
transition: background 0.3s ease;
}

.codecut-subscribe-btn:hover {
background: #5aa8e8;
}

.codecut-subscribe-btn:disabled {
background: #999;
cursor: not-allowed;
}

.codecut-message {
font-family: ‘Comfortaa’, sans-serif;
font-size: 12px;
padding: 8px;
border-radius: 6px;
display: none;
}

.codecut-message.success {
background: #d4edda;
color: #155724;
display: block;
}

/* Mobile responsive */
@media (max-width: 480px) {
.codecut-email-row {
flex-direction: column;
height: auto;
gap: 8px;
}

.codecut-input {
border-radius: 8px;
height: 36px;
}

.codecut-subscribe-btn {
width: 100%;
text-align: center;
border-radius: 8px;
height: 36px;
}
}

Subscribe

Post-Generation Validation
With these tools, the LLM generates freely without constraints. Validation happens afterward, and failed outputs can trigger retries.
Instructor: Simplest Integration
Instructor (12.3k stars) wraps any LLM client with Pydantic validation and automatic retry.
Unlike PydanticAI’s dependency injection or LangChain’s ecosystem complexity, Instructor stays focused on one thing: structured outputs with minimal code.
To install Instructor, run:
pip install instructor

This article uses instructor v1.14.4.
To use Instructor:

Define a Pydantic model with your desired fields
Wrap your LLM client (OpenAI, Anthropic, Ollama, etc.) with Instructor
Pass the model as response_model in your API call

The code below extracts sales lead information from an email:
import instructor
from openai import OpenAI
from pydantic import BaseModel
from typing import Literal

class SalesLead(BaseModel):
company_size: Literal["startup", "smb", "enterprise"]
priority: Literal["low", "medium", "high"]

client = instructor.from_openai(OpenAI())

email = "Hi, I'm the CTO of a 500-person company. We're interested in your enterprise plan. Can we schedule a demo?"

lead = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": f"Extract sales lead info from this email: {email}"}],
response_model=SalesLead,
max_retries=3
)
print(lead)

Output:
company_size='enterprise' priority='high'

Each field matches the schema: company_size and priority are constrained to the allowed Literal values.
The first LLM response may return an invalid value like “large” instead of “enterprise”. When this happens, Instructor sends the validation error back for self-correction.
PydanticAI: Type-Safe Agents
PydanticAI (14.5k stars) brings FastAPI’s developer experience to AI agents.
While Instructor focuses on extraction, PydanticAI supports tools and dependency injection. Tools are functions the agent can call to fetch external dat.
To install PydanticAI, run:
pip install pydantic-ai

This article uses pydantic-ai v1.48.0.
PydanticAI uses async internally. If running in a Jupyter notebook, apply nest_asyncio to avoid event loop conflicts:
import nest_asyncio

nest_asyncio.apply()

For basic extraction, PydanticAI takes a different approach with an Agent abstraction, but the output resembles Instructor.
from pydantic_ai import Agent
from pydantic import BaseModel
from typing import Literal

class SalesLead(BaseModel):
company_size: Literal["startup", "smb", "enterprise"]
priority: Literal["low", "medium", "high"]

agent = Agent("openai:gpt-4o", output_type=SalesLead)

email = "Hi, I'm the CTO of a 500-person company. We're interested in your enterprise plan. Can we schedule a demo?"

result = agent.run_sync(f"Extract sales lead info from this email: {email}")
print(result.output)

company_size='enterprise' priority='high'

Where PydanticAI stands out is tools and dependency injection. Tools are functions the agent can call during generation to fetch external data. Dependency injection passes data into those tools without hardcoding values.
To use PydanticAI with tools and dependency injection:

Create a dataclass for external data (e.g., pricing table)
Add deps_type to the agent to specify the dependency class
Decorate functions with @agent.tool to make them callable
Provide dependencies when calling run_sync()

from pydantic_ai import Agent, RunContext
from pydantic import BaseModel
from dataclasses import dataclass
from typing import Literal

class SalesLead(BaseModel):
company_size: Literal["startup", "smb", "enterprise"]
priority: Literal["low", "medium", "high"]
monthly_price: int

@dataclass
class PricingTable:
prices: dict[str, int]

agent = Agent(
"openai:gpt-4o",
deps_type=PricingTable,
output_type=SalesLead
)

@agent.tool
def get_price(ctx: RunContext[PricingTable], company_size: str) -> str:
"""Get monthly price for a company size tier."""
price = ctx.deps.prices.get(company_size.lower(), 0)
return f"Monthly price for {company_size}: ${price}"

email = "Hi, I'm the CTO of a 500-person company. We're interested in your enterprise plan."

result = agent.run_sync(
f"Extract sales lead info from this email: {email}",
deps=PricingTable(prices={"startup": 99, "smb": 499, "enterprise": 1999})
)
print(result.output)

company_size='enterprise' priority='high' monthly_price=1999

The output shows monthly_price=1999, which matches the enterprise tier in the PricingTable. The LLM called get_price("enterprise") to retrieve this value.
For a deeper dive into PydanticAI’s capabilities, see Enforce Structured Outputs from LLMs with PydanticAI.
LangChain: Ecosystem Integration
LangChain (125k stars) offers structured outputs as part of a comprehensive framework.
While Instructor and PydanticAI focus on extraction, LangChain provides structured outputs as part of a larger ecosystem. This includes integrations with vector stores, tools, and monitoring.
To install LangChain, run:
pip install langchain langchain-openai

This article uses langchain v1.2.7 and langchain-openai v1.1.7.
To use LangChain for structured outputs:

Create a chat model (OpenAI, Anthropic, Google, etc.)
Call .with_structured_output(YourModel) to add schema enforcement
Use .invoke() with your prompt

The code below extracts sales lead information from an email:
from langchain_openai import ChatOpenAI
from pydantic import BaseModel
from typing import Literal

class SalesLead(BaseModel):
company_size: Literal["startup", "smb", "enterprise"]
priority: Literal["low", "medium", "high"]

model = ChatOpenAI(model="gpt-4o")
structured = model.with_structured_output(SalesLead)

email = "Hi, I'm the CTO of a 500-person company. We're interested in your enterprise plan. Can we schedule a demo?"

lead = structured.invoke(f"Extract sales lead info from this email: {email}")
print(lead)

Output:
company_size='enterprise' priority='high'

The output resembles Instructor and PydanticAI since all three use Pydantic models for schema enforcement.
LangChain’s value is ecosystem integration. You can combine structured outputs with:

Vector stores for RAG pipelines
Document loaders for PDFs, web pages, and databases
Memory for conversation history
LangSmith for monitoring and tracing
And many more integrations

When to Use Each Tool
LangChain covers the most features, but I find the simpler tools easier to maintain when you don’t need the full ecosystem.

Instructor: One pip install, zero framework concepts. Choose when extraction is your only need.
PydanticAI: Adds tools without the full LangChain ecosystem. Choose when you need external data but not RAG or memory.
LangChain: Full ecosystem with learning curve. Choose when you’re already using LangChain or need its integrations.

For production patterns like PII filtering and human approval workflows, see Build Production-Ready LLM Agents with LangChain 1.0 Middleware.
Pre-Generation Constraints
Unlike post-generation validation tools that check output after generation, these tools guide the LLM character-by-character. Invalid characters are blocked before they’re generated. This guarantees 100% schema compliance. No wasted API calls on invalid outputs.
Outlines: Guaranteed Valid JSON
Outlines (13.3k stars) guarantees valid output by constraining token sampling during generation.
Among pre-generation constraint tools, Outlines is the simplest.
To install Outlines, run:
pip install outlines

This article uses outlines v1.2.9.
The code resembles Instructor, but works differently. At each generation step, Outlines checks which tokens would keep the output valid and blocks all others. The model can only choose from schema-compliant tokens:
import outlines
from transformers import AutoModelForCausalLM, AutoTokenizer
from pydantic import BaseModel
from typing import Literal

class SalesLead(BaseModel):
company_size: Literal["startup", "smb", "enterprise"]
priority: Literal["low", "medium", "high"]

# Load local model for direct token control
model = outlines.from_transformers(
AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-1.5B"),
AutoTokenizer.from_pretrained("Qwen/Qwen2.5-1.5B")
)

email = "Hi, I'm the CTO of a 500-person company. We're interested in your enterprise plan."

result = model(
f"Extract sales lead info from this email: {email}",
SalesLead,
max_new_tokens=100
)
print(result)

Output:
company_size='enterprise' priority='high'

The company_size and priority fields contain valid Literal values. Invalid values are impossible because Outlines blocks those tokens during generation.
Beyond schema validation, Outlines supports regex and choice constraints that block invalid tokens during generation.
For example, this regex enforces a phone number format:
result = model("New York office phone number:", output_type=Regex(r"\(\d{3}\) \d{3}-\d{4}"))
print(result)

Output:
(212) 555-0147

Similarly, a Literal type restricts output to predefined values:
Sentiment = Literal["positive", "negative", "neutral"]
result = model("The product exceeded expectations! Sentiment:", output_type=Sentiment)
print(result)

Output:
positive

These constraints work at the token level: the model cannot generate invalid characters because they are blocked before generation.
Guidance: Branching During Generation
Guidance (19k stars) lets you run Python control flow during generation.
Like Outlines, Guidance uses token masking to enforce schema compliance. Guidance goes further by letting Python if/else statements run as the model generates. The model’s output becomes a variable you can check, then generation continues down the chosen branch.
To install Guidance, run:
pip install guidance

This article uses guidance v0.3.0.
The @guidance decorator creates reusable functions that combine branching with constrained output:

select() constrains the model to choose from a fixed list of options
Python if/else runs during generation based on the model’s choice
gen_json() constrains output to match different schemas per branch

from guidance import models, system, user, assistant, select, guidance
from guidance import json as gen_json
from pydantic import BaseModel
from typing import Literal

class SalesLead(BaseModel):
model_config = dict(extra="forbid")
company_size: Literal["startup", "smb", "enterprise"]
priority: Literal["low", "medium", "high"]

class SupportTicket(BaseModel):
model_config = dict(extra="forbid")
issue_type: Literal["billing", "technical", "account"]
urgency: Literal["low", "medium", "high"]

lm = models.Transformers("Qwen/Qwen2.5-1.5B")

@guidance
def classify_email(lm, email):
with system():
lm += "You classify emails and extract structured data."
with user():
lm += f"Classify and extract info from: {email}"
with assistant():
lm += f"Category: {select(['sales', 'support'], name='category')}\n"
if lm["category"] == "sales":
lm += gen_json(name="result", schema=SalesLead)
else:
lm += gen_json(name="result", schema=SupportTicket)
return lm

email1 = "Hi, I'm the CTO of a 500-person company. We're interested in your enterprise plan."
result1 = lm + classify_email(email1)
print(f"Category: {result1['category']}, Result: {result1['result']}")

Output:
Category: sales, Result: {"company_size": "enterprise", "priority": "high"}

The model classified this as “sales” and generated a SalesLead with enterprise company size and high priority.
The @guidance decorator makes the function reusable. Calling it with a different email runs the same branching logic:
email2 = "URGENT: My account is locked and I can't log in. Please help!"
result2 = lm + classify_email(email2)
print(f"Category: {result2['category']}, Result: {result2['result']}")

Output:
Category: support, Result: {"issue_type": "account", "urgency": "high"}

This time the model classified the email as “support” and generated a SupportTicket instead. The branching logic automatically selected the correct schema based on the classification.
When to Use Each Tool

Outlines: Choose when you need guaranteed schema compliance with straightforward extraction. Simpler API, easier to get started.
Guidance: Choose when you need branching logic during generation. Python if/else runs as the model generates, enabling different schemas per branch.

Final Thoughts
This article covered two approaches to structured LLM outputs:

Post-generation validation (Instructor, PydanticAI, LangChain): Works with any provider. Instructor and PydanticAI automatically retry on validation failure; LangChain requires explicit retry configuration.
Pre-generation constraints (Outlines, Guidance): Blocks invalid tokens during generation, guarantees valid output

I recommend starting with post-generation tools for their simplicity and provider flexibility. Switch to pre-generation tools when you want to eliminate retry costs or need constraints like regex patterns.
Did I miss a tool you use for structured outputs? Let me know in the comments.

📚 Want to go deeper? My book shows you how to build data science projects that actually make it to production. Get the book →

Stay Current with CodeCut
Actionable Python tips, curated for busy data pros. Skim in under 2 minutes, three times a week.

.codecut-subscribe-form .codecut-input {
background: #2F2D2E !important;
border: 1px solid #72BEFA !important;
color: #FFFFFF !important;
}
.codecut-subscribe-form .codecut-input::placeholder {
color: #999999 !important;
}
.codecut-subscribe-form .codecut-subscribe-btn {
background: #72BEFA !important;
color: #2F2D2E !important;
}
.codecut-subscribe-form .codecut-subscribe-btn:hover {
background: #5aa8e8 !important;
}

.codecut-subscribe-form {
max-width: 650px;
display: flex;
flex-direction: column;
gap: 8px;
}

.codecut-input {
-webkit-appearance: none;
-moz-appearance: none;
appearance: none;
background: #FFFFFF;
border-radius: 8px !important;
padding: 8px 12px;
font-family: ‘Comfortaa’, sans-serif !important;
font-size: 14px !important;
color: #333333;
border: none !important;
outline: none;
width: 100%;
box-sizing: border-box;
}

input[type=”email”].codecut-input {
border-radius: 8px !important;
}

.codecut-input::placeholder {
color: #666666;
}

.codecut-email-row {
display: flex;
align-items: stretch;
height: 36px;
gap: 8px;
}

.codecut-email-row .codecut-input {
flex: 1;
}

.codecut-subscribe-btn {
background: #72BEFA;
color: #2F2D2E;
border: none;
border-radius: 8px;
padding: 8px 14px;
font-family: ‘Comfortaa’, sans-serif;
font-size: 14px;
font-weight: 500;
cursor: pointer;
text-decoration: none;
display: flex;
align-items: center;
justify-content: center;
transition: background 0.3s ease;
}

.codecut-subscribe-btn:hover {
background: #5aa8e8;
}

.codecut-subscribe-btn:disabled {
background: #999;
cursor: not-allowed;
}

.codecut-message {
font-family: ‘Comfortaa’, sans-serif;
font-size: 12px;
padding: 8px;
border-radius: 6px;
display: none;
}

.codecut-message.success {
background: #d4edda;
color: #155724;
display: block;
}

/* Mobile responsive */
@media (max-width: 480px) {
.codecut-email-row {
flex-direction: column;
height: auto;
gap: 8px;
}

.codecut-input {
border-radius: 8px;
height: 36px;
}

.codecut-subscribe-btn {
width: 100%;
text-align: center;
border-radius: 8px;
height: 36px;
}
}

Subscribe

5 Python Tools for Structured LLM Outputs: A Practical Comparison Read More »

From CSS Selectors to Natural Language: Web Scraping with ScrapeGraphAI

Table of Contents

Introduction
What is ScrapeGraphAI?
Setup
Installation
OpenAI Configuration
Local Models with Ollama

Natural Language Prompts
Structured Output with Pydantic
JavaScript Content
Multi-Page Scraping
Key Takeaways

Introduction
BeautifulSoup is the go-to library for web scraping thanks to its simple API and flexible parsing. The workflow is straightforward: fetch HTML, inspect elements in DevTools, and write selectors to extract data:
from pprint import pprint

from bs4 import BeautifulSoup
import requests

url = "https://books.toscrape.com"
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")

books = []
for article in soup.select("article.product_pod"):
title = article.select_one("h3 a")["title"]
price = article.select_one("p.price_color").text
books.append({"title": title, "price": price})

pprint(books[:3])

Output:
[{'price': '£51.77', 'title': 'A Light in the Attic'},
{'price': '£53.74', 'title': 'Tipping the Velvet'},
{'price': '£50.10', 'title': 'Soumission'}]

The output is correct, but selectors are tightly coupled to the HTML structure. This means when the site redesigns, everything breaks, so you spend more time maintaining selectors than extracting data:
# Before: <article class="product_pod">
# After: <div class="book-card">
soup.select("article.product_pod") # Now returns []

# Before: <p class="price_color">£51.77</p>
# After: <span class="price">£51.77</span>
soup.select_one("p.price_color") # Returns None, crashes on .text

What if you could just describe the data you want and let an LLM figure out the extraction? That’s where ScrapeGraphAI comes in.

💻 Get the Code: The complete source code for this tutorial are available on GitHub. Clone it to follow along!

What is ScrapeGraphAI?
ScrapeGraphAI is an open-source Python library for LLM-powered web scraping. Rather than writing CSS selectors, you describe the data you want in plain English.
Key benefits:

No selector maintenance: Describe what data you want, not where it lives in the HTML
Self-healing scrapers: The LLM adjusts automatically when websites redesign
Structured output: Define Pydantic schemas for type-safe extraction
JavaScript support: Built-in rendering for React, Vue, and Angular sites
Multi-provider: Use OpenAI, Anthropic, or local models via Ollama

Stay Current with CodeCut
Actionable Python tips, curated for busy data pros. Skim in under 2 minutes, three times a week.

.codecut-subscribe-form .codecut-input {
background: #2F2D2E !important;
border: 1px solid #72BEFA !important;
color: #FFFFFF !important;
}
.codecut-subscribe-form .codecut-input::placeholder {
color: #999999 !important;
}
.codecut-subscribe-form .codecut-subscribe-btn {
background: #72BEFA !important;
color: #2F2D2E !important;
}
.codecut-subscribe-form .codecut-subscribe-btn:hover {
background: #5aa8e8 !important;
}

.codecut-subscribe-form {
max-width: 650px;
display: flex;
flex-direction: column;
gap: 8px;
}

.codecut-input {
-webkit-appearance: none;
-moz-appearance: none;
appearance: none;
background: #FFFFFF;
border-radius: 8px !important;
padding: 8px 12px;
font-family: ‘Comfortaa’, sans-serif !important;
font-size: 14px !important;
color: #333333;
border: none !important;
outline: none;
width: 100%;
box-sizing: border-box;
}

input[type=”email”].codecut-input {
border-radius: 8px !important;
}

.codecut-input::placeholder {
color: #666666;
}

.codecut-email-row {
display: flex;
align-items: stretch;
height: 36px;
gap: 8px;
}

.codecut-email-row .codecut-input {
flex: 1;
}

.codecut-subscribe-btn {
background: #72BEFA;
color: #2F2D2E;
border: none;
border-radius: 8px;
padding: 8px 14px;
font-family: ‘Comfortaa’, sans-serif;
font-size: 14px;
font-weight: 500;
cursor: pointer;
text-decoration: none;
display: flex;
align-items: center;
justify-content: center;
transition: background 0.3s ease;
}

.codecut-subscribe-btn:hover {
background: #5aa8e8;
}

.codecut-subscribe-btn:disabled {
background: #999;
cursor: not-allowed;
}

.codecut-message {
font-family: ‘Comfortaa’, sans-serif;
font-size: 12px;
padding: 8px;
border-radius: 6px;
display: none;
}

.codecut-message.success {
background: #d4edda;
color: #155724;
display: block;
}

/* Mobile responsive */
@media (max-width: 480px) {
.codecut-email-row {
flex-direction: column;
height: auto;
gap: 8px;
}

.codecut-input {
border-radius: 8px;
height: 36px;
}

.codecut-subscribe-btn {
width: 100%;
text-align: center;
border-radius: 8px;
height: 36px;
}
}

Subscribe

Setup
Installation
Install ScrapeGraphAI and Playwright for browser automation:
pip install scrapegraphai playwright
playwright install

OpenAI Configuration
For cloud-based extraction, you’ll need an OpenAI API key. Store it in a .env file:
OPENAI_API_KEY=your-api-key-here

Then load it in your script:
from dotenv import load_dotenv
import os

load_dotenv()

graph_config = {
"llm": {
"api_key": os.getenv("OPENAI_API_KEY"),
"model": "openai/gpt-4o-mini",
},
"verbose": False,
"headless": True,
}

Local Models with Ollama
For zero API costs, use local models via Ollama. ScrapeGraphAI requires two models:

LLM (llama3.2): Interprets your prompts and extracts data
Embedding model (nomic-embed-text): Converts page content into a format the LLM can search

📖 New to Ollama? See our complete guide to running local LLMs with Ollama.

Install Ollama and pull both:
# Install Ollama from https://ollama.ai
ollama pull llama3.2
ollama pull nomic-embed-text

Then configure ScrapeGraphAI to use local inference:
graph_config_local = {
"llm": {
"model": "ollama/llama3.2",
"temperature": 0,
"format": "json",
"base_url": "http://localhost:11434",
},
"embeddings": {
"model": "ollama/nomic-embed-text",
"base_url": "http://localhost:11434",
},
"verbose": False,
"headless": True,
}

The same extraction code works with both configurations. Switch between cloud and local by changing the config.
Natural Language Prompts
ScrapeGraphAI extraction works in three steps:

Prompt: Describe the data you want in plain English
Source: Provide the URL to scrape
Config: Set your LLM provider and credentials

Pass these to SmartScraperGraph and call run():
from dotenv import load_dotenv
from scrapegraphai.graphs import SmartScraperGraph
import os

load_dotenv()

graph_config = {
"llm": {
"api_key": os.getenv("OPENAI_API_KEY"),
"model": "openai/gpt-4o-mini",
},
"verbose": False,
"headless": True,
}

smart_scraper = SmartScraperGraph(
prompt="Extract the first 5 book titles and their prices",
source="https://books.toscrape.com",
config=graph_config,
)

result = smart_scraper.run()

Output:
{'content': [{'price': '£51.77', 'title': 'A Light in the Attic'},
{'price': '£53.74', 'title': 'Tipping the Velvet'},
{'price': '£50.10', 'title': 'Soumission'},
{'price': '£47.82', 'title': 'Sharp Objects'},
{'price': '£54.23', 'title': 'Sapiens: A Brief History of Humankind'}]}

The LLM understood “first 5 book titles and their prices” without any knowledge of the page’s HTML structure.
Structured Output with Pydantic
Raw scraped data often needs cleaning and validation. With ScrapeGraphAI, you can define a Pydantic schema to get type-safe, validated output directly from extraction.
from dotenv import load_dotenv
from scrapegraphai.graphs import SmartScraperGraph
from pydantic import BaseModel, Field
from typing import List
import os

load_dotenv()

class Book(BaseModel):
title: str = Field(description="The title of the book")
price: float = Field(description="Price in GBP as a number")
rating: int = Field(description="Star rating from 1 to 5")

class BookCatalog(BaseModel):
books: List[Book]

graph_config = {
"llm": {
"api_key": os.getenv("OPENAI_API_KEY"),
"model": "openai/gpt-4o-mini",
},
"verbose": False,
"headless": True,
}

smart_scraper = SmartScraperGraph(
prompt="Extract the first 3 books with their titles, prices, and star ratings",
source="https://books.toscrape.com",
schema=BookCatalog,
config=graph_config,
)

result = smart_scraper.run()

Output:
{'books': [{'price': 51.77, 'rating': 5, 'title': 'A Light in the Attic'},
{'price': 53.74, 'rating': 5, 'title': 'Tipping the Velvet'},
{'price': 50.1, 'rating': 5, 'title': 'Soumission'}]}

The output matches the Pydantic schema:

price: Converted from '£51.77' string to 51.77 float
rating: Extracted from star icons as integer 5
title: Captured as string

The schema ensures:

price is extracted as a float, not a string like "£51.77"
rating is converted to an int from the star display
Missing or invalid fields raise validation errors

The data is analysis-ready, so you don’t need any post-processing in pandas.
For more advanced LLM output validation patterns, see our PydanticAI guide.
JavaScript Content
Modern websites built with React, Vue, or Angular render content dynamically. BeautifulSoup only parses the initial HTML before JavaScript runs, so it misses the actual content.
To demonstrate this, let’s fetch a JavaScript-rendered page with BeautifulSoup:
from bs4 import BeautifulSoup
import requests

soup = BeautifulSoup(requests.get("https://quotes.toscrape.com/js/").content, "html.parser")
print(soup.select(".quote"))

Output:
[]

The result is an empty list because the content loads via JavaScript after the initial HTML is served.
Selenium can handle JavaScript, but requires explicit waits and complex timing logic.
ScrapeGraphAI uses Playwright to handle JavaScript rendering automatically. The headless parameter controls whether the browser runs visibly or in the background:
from scrapegraphai.graphs import SmartScraperGraph
from dotenv import load_dotenv
import os

load_dotenv()

graph_config = {
"llm": {
"api_key": os.getenv("OPENAI_API_KEY"),
"model": "openai/gpt-4o-mini",
},
"verbose": False,
"headless": True, # Browser runs in background
}

# quotes.toscrape.com/js loads content via JavaScript
smart_scraper = SmartScraperGraph(
prompt="Extract the first 3 quotes with their text and authors",
source="https://quotes.toscrape.com/js/",
config=graph_config,
)

result = smart_scraper.run()

Output:
{'content': [{'author': 'Albert Einstein',
'quote': 'The world as we have created it is a process of our '
'thinking. It cannot be changed without changing our '
'thinking.'},
{'author': 'J.K. Rowling',
'quote': 'It is our choices, Harry, that show what we truly are, '
'far more than our abilities.'},
{'author': 'Albert Einstein',
'quote': 'There are only two ways to live your life. One is as '
'though nothing is a miracle. The other is as though '
'everything is a miracle.'}]}

Unlike the empty BeautifulSoup result, ScrapeGraphAI successfully extracted all three quotes from the JavaScript-rendered page. The LLM chose sensible field names (author, quote) based solely on our natural language prompt.
Multi-Page Scraping
Research tasks often require data from multiple sources. Scraping multiple sites usually requires building individual scrapers for each layout, then manually combining the results into a unified format.
SearchGraph automates this workflow. It searches the web, scrapes relevant pages, and returns aggregated results:
from scrapegraphai.graphs import SearchGraph
import os
from dotenv import load_dotenv

load_dotenv()

graph_config = {
"llm": {
"api_key": os.getenv("OPENAI_API_KEY"),
"model": "openai/gpt-4o-mini",
},
"max_results": 3,
"verbose": False,
}

search_graph = SearchGraph(
prompt="Find the top 3 Python web scraping libraries and their GitHub stars",
config=graph_config,
)

result = search_graph.run()

Output:
{'sources': ['https://github.com/luminati-io/Python-scraping-libraries',
'https://brightdata.com/blog/web-data/python-web-scraping-libraries',
'https://www.geeksforgeeks.org/python/python-web-scraping-tutorial/',
'https://www.projectpro.io/article/python-libraries-for-web-scraping/625'],
'top_libraries': [{'github_stars': '~52.3k', 'name': 'Requests'},
{'github_stars': '~53.7k', 'name': 'Scrapy'},
{'github_stars': '~31.2k', 'name': 'Selenium'},
{'github_stars': 1800, 'name': 'BeautifulSoup'}]}

For scraping multiple known URLs with the same prompt, use SmartScraperMultiGraph:
from scrapegraphai.graphs import SmartScraperMultiGraph
import os
from dotenv import load_dotenv

load_dotenv()

graph_config = {
"llm": {
"api_key": os.getenv("OPENAI_API_KEY"),
"model": "openai/gpt-4o-mini",
},
"verbose": False,
"headless": True,
}

multi_scraper = SmartScraperMultiGraph(
prompt="Extract the page title and main heading",
source=[
"https://books.toscrape.com",
"https://quotes.toscrape.com",
],
config=graph_config,
)

result = multi_scraper.run()

Output:
{'main_headings': ['All products', 'Quotes to Scrape'],
'page_titles': ['Books to Scrape', 'Quotes to Scrape'],
'sources': ['https://books.toscrape.com', 'https://quotes.toscrape.com']}

Both approaches return consistent, structured output regardless of the underlying HTML differences between sites.
Key Takeaways
ScrapeGraphAI shifts web scraping from writing CSS selectors to describing the data you want:

Natural language prompts replace hard-coded CSS selectors and XPath expressions
Pydantic schemas provide type-safe, validated output ready for analysis
Built-in JavaScript rendering handles React, Vue, and Angular sites automatically
Multi-provider support lets you choose between cloud APIs and local models
SearchGraph automates multi-source research with a single prompt

The library is best suited for:

Exploratory data collection where site structures vary
Research tasks requiring data from multiple sources
Projects where scraper maintenance costs exceed development time
Extracting structured data from JavaScript-heavy applications

For high-volume production workloads on sites with stable HTML, Scrapy remains the faster choice. ScrapeGraphAI pays off when the time saved on selector updates outweighs the per-request LLM cost.
Related Tutorials

Turn Receipt Images into Spreadsheets with LlamaIndex: Extract structured data from images and PDFs instead of web pages
Transform Any PDF into Searchable AI Data with Docling: Convert PDF documents into RAG-ready structured data

Stay Current with CodeCut
Actionable Python tips, curated for busy data pros. Skim in under 2 minutes, three times a week.

.codecut-subscribe-form .codecut-input {
background: #2F2D2E !important;
border: 1px solid #72BEFA !important;
color: #FFFFFF !important;
}
.codecut-subscribe-form .codecut-input::placeholder {
color: #999999 !important;
}
.codecut-subscribe-form .codecut-subscribe-btn {
background: #72BEFA !important;
color: #2F2D2E !important;
}
.codecut-subscribe-form .codecut-subscribe-btn:hover {
background: #5aa8e8 !important;
}

.codecut-subscribe-form {
max-width: 650px;
display: flex;
flex-direction: column;
gap: 8px;
}

.codecut-input {
-webkit-appearance: none;
-moz-appearance: none;
appearance: none;
background: #FFFFFF;
border-radius: 8px !important;
padding: 8px 12px;
font-family: ‘Comfortaa’, sans-serif !important;
font-size: 14px !important;
color: #333333;
border: none !important;
outline: none;
width: 100%;
box-sizing: border-box;
}

input[type=”email”].codecut-input {
border-radius: 8px !important;
}

.codecut-input::placeholder {
color: #666666;
}

.codecut-email-row {
display: flex;
align-items: stretch;
height: 36px;
gap: 8px;
}

.codecut-email-row .codecut-input {
flex: 1;
}

.codecut-subscribe-btn {
background: #72BEFA;
color: #2F2D2E;
border: none;
border-radius: 8px;
padding: 8px 14px;
font-family: ‘Comfortaa’, sans-serif;
font-size: 14px;
font-weight: 500;
cursor: pointer;
text-decoration: none;
display: flex;
align-items: center;
justify-content: center;
transition: background 0.3s ease;
}

.codecut-subscribe-btn:hover {
background: #5aa8e8;
}

.codecut-subscribe-btn:disabled {
background: #999;
cursor: not-allowed;
}

.codecut-message {
font-family: ‘Comfortaa’, sans-serif;
font-size: 12px;
padding: 8px;
border-radius: 6px;
display: none;
}

.codecut-message.success {
background: #d4edda;
color: #155724;
display: block;
}

/* Mobile responsive */
@media (max-width: 480px) {
.codecut-email-row {
flex-direction: column;
height: auto;
gap: 8px;
}

.codecut-input {
border-radius: 8px;
height: 36px;
}

.codecut-subscribe-btn {
width: 100%;
text-align: center;
border-radius: 8px;
height: 36px;
}
}

Subscribe

📚 Want to go deeper? Learning new techniques is the easy part. Knowing how to structure, test, and deploy them is what separates side projects from real work. My book shows you how to build data science projects that actually make it to production. Get the book →

From CSS Selectors to Natural Language: Web Scraping with ScrapeGraphAI Read More »

The Hidden Cost of Python Dictionaries (And 3 Safer Alternatives)

Table of Contents

Introduction
What Are Typed Data Containers?
Using Dictionaries
Using NamedTuple
Using dataclass
Using Pydantic
Final Thoughts
Related Tutorials

Introduction
Imagine you’re processing customer records. The pipeline runs without errors, but customers never receive their welcome emails. After digging through the code, you discover the issue is a simple typo in a dictionary key.
def load_customer(row):
return {"customer_id": row[0], "name": row[1], "emial": row[2]} # Typo

def send_welcome_email(customer):
email = customer.get("email") # Returns None silently
if email:
print(f"Sending email to {email}")
# No email sent, no error raised

customer = load_customer(["C001", "Alice", "alice@example.com"])
send_welcome_email(customer) # Nothing happens

Since .get() returns None for a missing key, the bug stays hidden.
This is exactly the type of issue we want to catch earlier. In this article, we’ll look at how typed data containers like NamedTuple, dataclass, and Pydantic help surface these bugs at runtime.

Interactive Course: Master Python data containers with hands-on exercises in our interactive Python data containers course.

What Are Typed Data Containers?
Python offers several ways to structure data, each adding more safety than the last:

dict: No protection. Bugs surface only when you access a missing key.
NamedTuple: Basic safety. Catches typos at write time in your IDE and at runtime.
dataclass: Static analysis support. Tools like mypy catch errors before your code runs.
Pydantic: Full protection. Validates data the moment you create an instance.

Let’s see how each tool handles the same customer data:
Using Dictionaries
Dictionaries are quick to create but provide no safety:
customer = {
"customer_id": "C001",
"name": "Alice Smith",
"email": "alice@example.com",
"age": 28,
"is_premium": True,
}

print(customer["name"])

Alice Smith

Typo Bugs
A typo in the key name causes a KeyError at runtime:
customer["emial"] # Typo: should be "email"

KeyError: 'emial'

The error tells you what went wrong but not where. When dictionaries pass through multiple functions, finding the source of a typo can take significant debugging time:
def load_customer(row):
return {"customer_id": row[0], "name": row[1], "emial": row[2]} # Typo here

def validate_customer(customer):
return customer # Passes through unchanged

def send_email(customer):
return customer["email"] # KeyError raised here

customer = load_customer(["C001", "Alice", "alice@example.com"])
validated = validate_customer(customer)
send_email(validated) # Error points here, but bug is in load_customer

KeyError Traceback (most recent call last)
13 customer = load_customer(["C001", "Alice", "alice@example.com"])
14 validated = validate_customer(customer)
—> 15 send_email(validated) # Error points here, but bug is in load_customer

Cell In[6], line 10, in send_email(customer)
9 def send_email(customer):
—> 10 return customer["email"]

KeyError: 'email'

The stack trace shows where the KeyError was raised, not where "emial" was written. The bug and its symptom are 13 lines apart here, but in production code, they could be in different files entirely.
Using .get() makes it worse by returning None silently:
email = customer.get("email") # Returns None – key is "emial" not "email"
print(f"Sending email to: {email}")

Sending email to: None

This silent failure is dangerous: your notification system might skip thousands of customers, or worse, your code could write None to a database column, corrupting your data pipeline.
Type Confusion
Typos cause crashes, but wrong types can corrupt your data silently. Since dictionaries have no schema, nothing stops you from assigning the wrong type to a field:
customer = {
"customer_id": "C001",
"name": 123, # Should be a string
"age": "twenty-eight", # Should be an integer
}

total_age = customer["age"] + 5

TypeError: can only concatenate str (not "int") to str

The error message is misleading: it says “concatenate str” but the real problem is that age should never have been a string in the first place.
Using NamedTuple
NamedTuple is a lightweight way to define a fixed structure with named fields and type hints, like a dictionary with a schema:
from typing import NamedTuple

class Customer(NamedTuple):
customer_id: str
name: str
email: str
age: int
is_premium: bool

customer = Customer(
customer_id="C001",
name="Alice Smith",
email="alice@example.com",
age=28,
is_premium=True,
)

print(customer.name)

Alice Smith

IDE Autocomplete Catches Typos
Your IDE can’t autocomplete dictionary keys, so typing customer[" shows no suggestions. With NamedTuple, typing customer. displays all available fields: customer_id, name, email, age, is_premium.
Even if you skip autocomplete and type manually, typos are flagged instantly with squiggly lines:
customer.emial
~~~~~

Running the code will raise an error:
customer.emial

AttributeError: 'Customer' object has no attribute 'emial'

The error names the exact object and missing attribute, so you know immediately what to fix.
Immutability Prevents Accidental Changes
NamedTuples are immutable, meaning once created, their values cannot be changed:
customer.name = "Bob" # Raises an error

AttributeError: can't set attribute

This prevents bugs where data is accidentally modified during processing.
Limitations: No Runtime Type Validation
Type hints in NamedTuple are not enforced at runtime, so you can still pass in wrong types:
# Wrong types are accepted without error
customer = Customer(
customer_id="C001",
name=123, # Should be str, but int is accepted
email="alice@example.com",
age="twenty-eight", # Should be int, but str is accepted
is_premium=True,
)

print(f"Name: {customer.name}, Age: {customer.age}")

Name: 123, Age: twenty-eight

The code runs, but with incorrect data types. The bug surfaces later when you try to use the data.
Using dataclass
dataclass reduces the boilerplate of writing classes that mainly hold data. Instead of manually writing __init__ and other methods, you just declare your fields.
It provides the same IDE support as NamedTuple, plus three additional features:

Mutable objects: You can change field values after creation
Mutable defaults: Safe defaults for lists and dicts with field(default_factory=list)
Post-init logic: Run custom validation or compute derived fields with __post_init__

from dataclasses import dataclass

@dataclass
class Customer:
customer_id: str
name: str
email: str
age: int
is_premium: bool = False # Default value

customer = Customer(
customer_id="C001",
name="Alice Smith",
email="alice@example.com",
age=28,
)

print(f"{customer.name}, Premium: {customer.is_premium}")

Alice Smith, Premium: False

Mutability Allows Updates
Dataclass trades NamedTuple’s immutability protection for flexibility. You can modify fields after creation:
customer.name = "Alice Johnson" # Changed after marriage
customer.is_premium = True # Upgraded their account

print(f"{customer.name}, Premium: {customer.is_premium}")

Alice Johnson, Premium: True

For extra safety, use @dataclass(slots=True) to prevent accidentally adding new attributes:
@dataclass(slots=True)
class Customer:
customer_id: str
name: str
email: str
age: int
is_premium: bool = False

customer = Customer(
customer_id="C001",
name="Alice",
email="alice@example.com",
age=28,
)

customer.nmae = "Bob" # Typo

AttributeError: 'Customer' object has no attribute 'nmae'

Mutable Defaults with default_factory
Mutable defaults like lists don’t work as expected. You might think each instance gets its own empty list, but Python creates the default [] once and all instances share it:
from typing import NamedTuple

class Order(NamedTuple):
order_id: str
items: list = []

order1 = Order("001")
order2 = Order("002")

order1.items.append("apple")
print(f"Order 1: {order1.items}")
print(f"Order 2: {order2.items}") # Also has "apple"!

Order 1: ['apple']
Order 2: ['apple']

Order 2 has “apple” even though we only added it to Order 1. Modifying one order’s items affects every order.
Dataclass prevents this mistake by rejecting mutable defaults:
@dataclass
class Order:
items: list = []

ValueError: mutable default <class 'list'> for field items is not allowed: use default_factory

Dataclass offers field(default_factory=…) as the solution. The factory function runs at instance creation, not class definition, so each object gets its own list:
from dataclasses import dataclass, field

@dataclass
class Order:
order_id: str
items: list = field(default_factory=list) # Each instance gets its own list

order1 = Order("001")
order2 = Order("002")

order1.items.append("apple")
print(f"Order 1: {order1.items}")
print(f"Order 2: {order2.items}") # Not affected by order1

Order 1: ['apple']
Order 2: []

Unlike the NamedTuple example, Order 2 stays empty because it has its own list.
Post-Init Validation with __post_init__
Without validation, invalid data passes through silently:
@dataclass
class Customer:
customer_id: str
name: str
email: str
age: int
is_premium: bool = False

customer = Customer(
customer_id="C001",
name="", # Empty name
email="invalid",
age=-100,
)
print(f"Created: {customer}") # No error – bad data is in your system

Created: Customer(customer_id='C001', name='', email='invalid', age=-100, is_premium=False)

Dataclass provides __post_init__ to catch these issues at creation time so you can validate fields before the object is used:
@dataclass
class Customer:
customer_id: str
name: str
email: str
age: int
is_premium: bool = False

def __post_init__(self):
if self.age < 0:
raise ValueError(f"Age cannot be negative: {self.age}")
if "@" not in self.email:
raise ValueError(f"Invalid email: {self.email}")

customer = Customer(
customer_id="C001",
name="Alice",
email="invalid-email",
age=28,
)

ValueError: Invalid email: invalid-email

The error message tells you exactly what’s wrong, making the bug easy to fix.
Limitations: Manual Validation Only
__post_init__ requires you to write every validation rule yourself. If you forget to check a field, bad data can still slip through.
In this example, __post_init__ only validates email format, so wrong types for name and age pass undetected:
@dataclass
class Customer:
customer_id: str
name: str
email: str
age: int
is_premium: bool = False

def __post_init__(self):
if "@" not in self.email:
raise ValueError(f"Invalid email: {self.email}")

customer = Customer(
customer_id="C001",
name=123, # No validation for name type
email="alice@example.com",
age="twenty-eight", # No validation for age type
)

print(f"Name: {customer.name}, Age: {customer.age}")

Name: 123, Age: twenty-eight

Type hints alone don’t enforce types at runtime. For automatic validation, you need a library that actually checks types when objects are created.

📚 For comprehensive coverage of dataclasses and Pydantic in production workflows, check out Production-Ready Data Science.

Using Pydantic
Pydantic is a data validation library that enforces type hints at runtime. Unlike NamedTuple and dataclass, it actually checks that values match their declared types when objects are created. Install it with:
pip install pydantic

To create a Pydantic model, inherit from BaseModel and declare your fields with type hints:
from pydantic import BaseModel

class Customer(BaseModel):
customer_id: str
name: str
email: str
age: int
is_premium: bool = False

customer = Customer(
customer_id="C001",
name="Alice Smith",
email="alice@example.com",
age=28,
)

print(f"{customer.name}, Age: {customer.age}")

Alice Smith, Age: 28

For using Pydantic to enforce structured outputs from AI models, see our PydanticAI tutorial.
Runtime Validation
Remember how dataclass accepted name=123 without complaint? Pydantic catches this automatically with a ValidationError:
from pydantic import BaseModel, ValidationError

class Customer(BaseModel):
customer_id: str
name: str
email: str
age: int
is_premium: bool = False

try:
customer = Customer(
customer_id="C001",
name=123,
email="alice@example.com",
age="thirty",
)
except ValidationError as e:
print(e)

2 validation errors for Customer
name
Input should be a valid string [type=string_type, input_value=123, input_type=int]
age
Input should be a valid integer, unable to parse string as an integer [type=int_parsing, input_value='thirty', input_type=str]

The error message shows:

Which fields failed validation (name, age)
What was expected (valid string, valid integer)
What was received (123 as int, 'thirty' as str)

This tells you everything you need to fix the bug in one place, instead of digging through stack traces.
Type Coercion
Unlike dataclass which stores whatever you pass, Pydantic automatically converts compatible types to match your type hints:
customer = Customer(
customer_id="C001",
name="Alice Smith",
email="alice@example.com",
age="28", # String "28" is converted to int 28
is_premium="true", # String "true" is converted to bool True
)

print(f"Age: {customer.age} (type: {type(customer.age).__name__})")
print(f"Premium: {customer.is_premium} (type: {type(customer.is_premium).__name__})")

Age: 28 (type: int)
Premium: True (type: bool)

This is useful when reading data from CSV files or APIs where everything comes as strings.
Constraint Validation
Beyond types, you often need business rules: age must be positive, names can’t be empty, customer IDs must follow a pattern.
In dataclass, you define fields in one place and validate them in __post_init__. The validation logic grows with each constraint:
@dataclass
class Customer:
customer_id: str
name: str
email: str
age: int
is_premium: bool = False

def __post_init__(self):
if not self.customer_id:
raise ValueError("Customer ID cannot be empty")
if not self.name or len(self.name) < 1:
raise ValueError("Name cannot be empty")
if "@" not in self.email:
raise ValueError(f"Invalid email: {self.email}")
if self.age < 0 or self.age > 150:
raise ValueError(f"Age must be between 0 and 150: {self.age}")

Pydantic puts constraints directly in Field(), keeping rules next to the data they validate:
from pydantic import BaseModel, Field, ValidationError

class Customer(BaseModel):
customer_id: str
name: str = Field(min_length=1)
email: str
age: int = Field(ge=0, le=150) # Age must be between 0 and 150
is_premium: bool = False

try:
customer = Customer(
customer_id="C001",
name="", # Empty name
email="alice@example.com",
age=-5, # Negative age
)
except ValidationError as e:
print(e)

2 validation errors for Customer
name
String should have at least 1 character [type=string_too_short, input_value='', input_type=str]
age
Input should be greater than or equal to 0 [type=greater_than_equal, input_value=-5, input_type=int]

Nested Validation
Data structures are rarely flat. A customer has an address, an order contains items. When something is wrong inside a nested object, you need to know exactly where.
Pydantic validates each level and reports the full path to any error:
from pydantic import BaseModel, Field, ValidationError

class Address(BaseModel):
street: str
city: str
zip_code: str = Field(pattern=r"^\d{5}$") # Must be 5 digits

class Customer(BaseModel):
customer_id: str
name: str
address: Address

try:
customer = Customer(
customer_id="C001",
name="Alice Smith",
address={
"street": "123 Main St",
"city": "New York",
"zip_code": "invalid", # Invalid zip code
},
)
except ValidationError as e:
print(e)

1 validation error for Customer
address.zip_code
String should match pattern '^\d{5}$' [type=string_pattern_mismatch, input_value='invalid', input_type=str]

The error message shows address.zip_code, pinpointing the exact location in the nested structure.
For extracting structured data from documents using Pydantic, see our LlamaIndex data extraction guide.
Final Thoughts
To summarize what each tool provides:

dict: Quick to create. No structure or validation.
NamedTuple: Fixed structure with IDE autocomplete. Immutable.
dataclass: Mutable fields, safe defaults, custom logic via __post_init__.
Pydantic: Runtime type enforcement, automatic type coercion, built-in constraints.

Personally, I use dict for quick prototyping:
stats = {"rmse": 0.234, "mae": 0.189, "r2": 0.91}

Then Pydantic when the code moves to production. For example, a training config should reject invalid values like negative learning rates:
from pydantic import BaseModel, Field

class TrainingConfig(BaseModel):
epochs: int = Field(ge=1)
batch_size: int = Field(ge=1)
learning_rate: float = Field(gt=0)

config = TrainingConfig(epochs=10, batch_size=32, learning_rate=0.001)

Pick the level of protection that matches your needs. A notebook experiment doesn’t need Pydantic, but a production API does.
Related Tutorials

SQLModel vs psycopg2: Combine Pydantic-style validation with database integration
Pytest for Data Scientists: Test your data containers and processing pipelines
Hydra for Python Configuration: Manage validated configuration with YAML-based pipelines

📚 Want to go deeper? Learning new techniques is the easy part. Knowing how to structure, test, and deploy them is what separates side projects from real work. My book shows you how to build data science projects that actually make it to production. Get the book →
💻 Get the Code: The complete source code and Jupyter notebook for this tutorial are available on GitHub. Clone it to follow along!

Stay Current with CodeCut
Actionable Python tips, curated for busy data pros. Skim in under 2 minutes, three times a week.

.codecut-subscribe-form .codecut-input {
background: #2F2D2E !important;
border: 1px solid #72BEFA !important;
color: #FFFFFF !important;
}
.codecut-subscribe-form .codecut-input::placeholder {
color: #999999 !important;
}
.codecut-subscribe-form .codecut-subscribe-btn {
background: #72BEFA !important;
color: #2F2D2E !important;
}
.codecut-subscribe-form .codecut-subscribe-btn:hover {
background: #5aa8e8 !important;
}

.codecut-subscribe-form {
max-width: 650px;
display: flex;
flex-direction: column;
gap: 8px;
}

.codecut-input {
-webkit-appearance: none;
-moz-appearance: none;
appearance: none;
background: #FFFFFF;
border-radius: 8px !important;
padding: 8px 12px;
font-family: ‘Comfortaa’, sans-serif !important;
font-size: 14px !important;
color: #333333;
border: none !important;
outline: none;
width: 100%;
box-sizing: border-box;
}

input[type=”email”].codecut-input {
border-radius: 8px !important;
}

.codecut-input::placeholder {
color: #666666;
}

.codecut-email-row {
display: flex;
align-items: stretch;
height: 36px;
gap: 8px;
}

.codecut-email-row .codecut-input {
flex: 1;
}

.codecut-subscribe-btn {
background: #72BEFA;
color: #2F2D2E;
border: none;
border-radius: 8px;
padding: 8px 14px;
font-family: ‘Comfortaa’, sans-serif;
font-size: 14px;
font-weight: 500;
cursor: pointer;
text-decoration: none;
display: flex;
align-items: center;
justify-content: center;
transition: background 0.3s ease;
}

.codecut-subscribe-btn:hover {
background: #5aa8e8;
}

.codecut-subscribe-btn:disabled {
background: #999;
cursor: not-allowed;
}

.codecut-message {
font-family: ‘Comfortaa’, sans-serif;
font-size: 12px;
padding: 8px;
border-radius: 6px;
display: none;
}

.codecut-message.success {
background: #d4edda;
color: #155724;
display: block;
}

/* Mobile responsive */
@media (max-width: 480px) {
.codecut-email-row {
flex-direction: column;
height: auto;
gap: 8px;
}

.codecut-input {
border-radius: 8px;
height: 36px;
}

.codecut-subscribe-btn {
width: 100%;
text-align: center;
border-radius: 8px;
height: 36px;
}
}

Subscribe

The Hidden Cost of Python Dictionaries (And 3 Safer Alternatives) Read More »

Turn Receipt Images into Spreadsheets with LlamaIndex

Table of Contents

Introduction
What You Will Learn
Introduction to LlamaIndex
Basic Image Processing with LlamaParse
Structured Data Extraction with Pydantic
Compare Extraction with Ground Truth
Process the Images for Better Extraction
Export Clean Data to CSV or Excel
Speed Up Processing with Async Parallel Execution
Try It Yourself
Conclusion and Next Steps

Introduction
Manual data entry from receipts, invoices, and contracts wastes hours and introduces errors. What if you could automatically extract structured data from these documents in minutes?
In this article, you’ll learn how to transform receipt images into structured data using LlamaIndex, then export the results to a spreadsheet for analysis.

💻 Get the Code: The complete source code and Jupyter notebook for this tutorial are available on GitHub. Clone it to follow along!

Stay Current with CodeCut
Actionable Python tips, curated for busy data pros. Skim in under 2 minutes, three times a week.

.codecut-subscribe-form .codecut-input {
background: #2F2D2E !important;
border: 1px solid #72BEFA !important;
color: #FFFFFF !important;
}
.codecut-subscribe-form .codecut-input::placeholder {
color: #999999 !important;
}
.codecut-subscribe-form .codecut-subscribe-btn {
background: #72BEFA !important;
color: #2F2D2E !important;
}
.codecut-subscribe-form .codecut-subscribe-btn:hover {
background: #5aa8e8 !important;
}

.codecut-subscribe-form {
max-width: 650px;
display: flex;
flex-direction: column;
gap: 8px;
}

.codecut-input {
-webkit-appearance: none;
-moz-appearance: none;
appearance: none;
background: #FFFFFF;
border-radius: 8px !important;
padding: 8px 12px;
font-family: ‘Comfortaa’, sans-serif !important;
font-size: 14px !important;
color: #333333;
border: none !important;
outline: none;
width: 100%;
box-sizing: border-box;
}

input[type=”email”].codecut-input {
border-radius: 8px !important;
}

.codecut-input::placeholder {
color: #666666;
}

.codecut-email-row {
display: flex;
align-items: stretch;
height: 36px;
gap: 8px;
}

.codecut-email-row .codecut-input {
flex: 1;
}

.codecut-subscribe-btn {
background: #72BEFA;
color: #2F2D2E;
border: none;
border-radius: 8px;
padding: 8px 14px;
font-family: ‘Comfortaa’, sans-serif;
font-size: 14px;
font-weight: 500;
cursor: pointer;
text-decoration: none;
display: flex;
align-items: center;
justify-content: center;
transition: background 0.3s ease;
}

.codecut-subscribe-btn:hover {
background: #5aa8e8;
}

.codecut-subscribe-btn:disabled {
background: #999;
cursor: not-allowed;
}

.codecut-message {
font-family: ‘Comfortaa’, sans-serif;
font-size: 12px;
padding: 8px;
border-radius: 6px;
display: none;
}

.codecut-message.success {
background: #d4edda;
color: #155724;
display: block;
}

/* Mobile responsive */
@media (max-width: 480px) {
.codecut-email-row {
flex-direction: column;
height: auto;
gap: 8px;
}

.codecut-input {
border-radius: 8px;
height: 36px;
}

.codecut-subscribe-btn {
width: 100%;
text-align: center;
border-radius: 8px;
height: 36px;
}
}

Subscribe

What You Will Learn

Convert scanned receipts to structured data with LlamaParse and Pydantic models
Validate extraction accuracy by comparing results against ground truth annotations
Fix parsing errors by preprocessing low-quality images
Export clean receipt data to spreadsheet format

Introduction to LlamaIndex
LlamaIndex is a framework that connects LLMs with your data through three core capabilities:

Data ingestion: Built-in readers for PDFs, images, web pages, and databases that automatically parse content into processable nodes.
Structured extraction: LLM-powered conversion of unstructured text into Pydantic models with automatic validation.
Retrieval and indexing: Vector stores and semantic search that enable context-augmented queries over your documents.

It eliminates boilerplate code for loading, parsing, and querying data, letting you focus on building LLM applications.
The table below compares LlamaIndex with two other popular frameworks for LLM applications:

Framework
Purpose
Best For

LlamaIndex
Document ingestion and structured extraction
Converting unstructured documents into query-ready data

LangChain
LLM orchestration and tool integration
Building conversational agents with multiple LLM calls

LangGraph
Stateful workflow management
Coordinating long-running, multi-agent processes

Installation
Start with installing the required packages for this tutorial, including:

llama-index: Core LlamaIndex framework with base indexing and retrieval functionality
llama-parse: Document parsing service for PDFs, images, and complex layouts
llama-index-program-openai: OpenAI integration for structured data extraction with Pydantic
python-dotenv: Load environment variables from .env files
rapidfuzz: Fuzzy string matching library for comparing company names with minor variations

pip install llama-index llama-parse llama-index-program-openai python-dotenv rapidfuzz

Environment Setup
Create a .env file to store your API keys:
# .env
LLAMA_CLOUD_API_KEY="your-llama-parse-key"
OPENAI_API_KEY="your-openai-key"

Get your API keys from:

LlamaParse API: cloud.llamaindex.ai
OpenAI API: platform.openai.com/api-keys

Load the environment variables from the .env file with load_dotenv:
from dotenv import load_dotenv
import os

load_dotenv()

Configure the default LLM with Settings:
from llama_index.core import Settings
from llama_index.llms.openai import OpenAI

Settings.llm = OpenAI(model="gpt-4o-mini", temperature=0)
Settings.context_window = 8000

Settings stores global defaults so every query engine and program reuses the same LLM configuration. Keeping temperature at 0 nudges the model to return deterministic, structured outputs.
Basic Image Processing with LlamaParse
In this tutorial, we will use the SROIE Dataset v2 from Kaggle. This dataset contains real-world receipt scans from the ICDAR 2019 competition.
You can download the dataset directly from Kaggle’s website or use the Kaggle CLI:
# Install the Kaggle CLI once
uv pip install kaggle

# Configure Kaggle credentials (run once per environment)
export KAGGLE_USERNAME=your_username
export KAGGLE_KEY=your_api_key

# Create a workspace folder and download the full archive (~1 GB)
mkdir -p data
kaggle datasets download urbikn/sroie-datasetv2 -p data

# Extract everything and inspect a few image files
unzip -q -o data/sroie-datasetv2.zip -d data

This tutorial uses data from the data/SROIE2019/train/ directory, which contains:

img: Original receipt images
entities: Ground truth annotations for validation

Load the first 10 receipts into a list of paths:
from pathlib import Path

receipt_dir = Path("data/SROIE2019/train/img")
num_receipts = 10
receipt_paths = sorted(receipt_dir.glob("*.jpg"))[:num_receipts]

Take a look at the first receipt:
from IPython.display import Image

first_receipt_path = receipt_paths[0]
Image(filename=first_receipt_path)

Next, use LlamaParse to convert the first receipt into markdown.
from llama_parse import LlamaParse

# Parse receipts with LlamaParse
parser = LlamaParse(
api_key=os.environ["LLAMA_CLOUD_API_KEY"],
result_type="markdown", # Output format
num_workers=4, # Number of parallel workers for faster processing
language="en", # Language hint for OCR accuracy
skip_diagonal_text=True, # Ignore rotated or diagonal text
)
first_receipt = parser.load_data(first_receipt_path)[0]

Preview the markdown for the first receipt:
# Preview the first receipt
preview = "\n".join(first_receipt.text.splitlines()[:10])
print(preview)

Output:
tan woon yann
BOOK TA K (TAMAN DAYA) SDN BHD
789417-W
NO.5: 55,57 & 59, JALAN SAGU 18,
TAMAN DaYA,
81100 JOHOR BAHRU,
JOHOR.

LlamaParse successfully converts receipt images to text, but there is no structure: vendor names, dates, and totals are all mixed together in plain text. This format is not ideal for exporting to spreadsheets or analytics tools for further analysis.
The next section uses Pydantic models to extract structured fields like company, total, and purchase_date automatically.
Structured Data Extraction with Pydantic
Pydantic is a Python library that uses type hints for data validation and automatic type conversion. By defining a receipt schema once, you can extract consistent structured data from receipts regardless of their format or layout.
Start by defining two Pydantic models that represent receipt structure:
from datetime import date
from typing import List, Optional
from pydantic import BaseModel, Field, ValidationInfo, model_validator

class ReceiptItem(BaseModel):
"""Represents a single line item extracted from a receipt."""

description: str = Field(description="Item name exactly as shown on the receipt")
quantity: int = Field(default=1, ge=1, description="Integer quantity of the item")
unit_price: Optional[float] = Field(
default=None, ge=0, description="Price per unit in the receipt currency"
)
discount_amount: float = Field(
default=0.0, ge=0, description="Discount applied to this line item"
)

class Receipt(BaseModel):
"""Structured fields extracted from a retail receipt."""

company: str = Field(description="Business or merchant name")
purchase_date: Optional[date] = Field(
default=None, description="Date in YYYY-MM-DD format"
)
address: Optional[str] = Field(default=None, description="Address of the business")
total: float = Field(description="Final charged amount")
items: List[ReceiptItem] = Field(default_factory=list)

Create an OpenAIPydanticProgram that instructs the LLM to extract data according to our Receipt model:
from llama_index.program.openai import OpenAIPydanticProgram

prompt = """
You are extracting structured data from a receipt.
Use the provided text to populate the Receipt model.
Interpret every receipt date as day-first.
If a field is missing, return null.

{context_str}
"""

receipt_program = OpenAIPydanticProgram.from_defaults(
output_cls=Receipt,
llm=Settings.llm,
prompt_template_str=prompt,
)

Process the first parsed document to make sure everything works before scaling to the full batch:
# Process the first receipt
structured_first_receipt = receipt_program(context_str=first_receipt.text)

# Print the receipt as a JSON string for better readability
print(structured_first_receipt.model_dump_json(indent=2))

Output:
{
"company": "tan woon yann BOOK TA K (TAMAN DAYA) SDN BHD",
"purchase_date": "2018-12-25",
"address": "NO.5: 55,57 & 59, JALAN SAGU 18, TAMAN DaYA, 81100 JOHOR BAHRU, JOHOR.",
"total": 9.0,
"items": [
{
"description": "KF MODELLING CLAY KIDDY FISH",
"quantity": 1,
"unit_price": 9.0,
"discount_amount": 0.0
}
]
}

LlamaIndex populates the Pydantic schema with extracted values:

company: Vendor name from the receipt header
purchase_date: Parsed date (2018-12-25)
total: Final amount (9.0)
items: Line items with description, quantity, and price

Now that the extraction works, let’s scale it to process all receipts in a batch. The function uses each receipt’s filename as a unique identifier:
def extract_documents(paths: List[str], prompt: str, id_column: str = "receipt_id") -> List[dict]:
"""Extract structured data from documents using LlamaParse and LLM."""
results: List[dict] = []

# Initialize parser with OCR settings
parser = LlamaParse(
api_key=os.environ["LLAMA_CLOUD_API_KEY"],
result_type="markdown",
num_workers=4,
language="en",
skip_diagonal_text=True,
)

# Convert images to markdown text
documents = parser.load_data(paths)

# Create structured extraction program
program = OpenAIPydanticProgram.from_defaults(
output_cls=Receipt,
llm=Settings.llm,
prompt_template_str=prompt,
)

# Extract structured data from each document
for path, doc in zip(paths, documents):
document_id = Path(path).stem
parsed_document = program(context_str=doc.text)
results.append(
{
id_column: document_id,
"data": parsed_document,
}
)
return results

# Extract structured data from all receipts
structured_receipts = extract_documents(receipt_paths, prompt)

Convert the extracted receipts into a DataFrame for easier inspection:
import pandas as pd

def transform_receipt_columns(df: pd.DataFrame) -> pd.DataFrame:
"""Apply standard transformations to receipt DataFrame columns."""
df = df.copy()
df["company"] = df["company"].str.upper()
df["total"] = pd.to_numeric(df["total"], errors="coerce")
df["purchase_date"] = pd.to_datetime(
df["purchase_date"], errors="coerce", dayfirst=True
).dt.date
return df

def create_extracted_df(records: List[dict], id_column: str = "receipt_id") -> pd.DataFrame:
df = pd.DataFrame(
[
{
id_column: record[id_column],
"company": record["data"].company,
"total": record["data"].total,
"purchase_date": record["data"].purchase_date,
}
for record in records
]
)
return transform_receipt_columns(df)

extracted_df = create_extracted_df(structured_receipts)
extracted_df

receipt_id
company
total
purchase_date

0
X00016469612
TAN WOON YANN BOOK TA K (TAMAN DAYA) SDN BHD
9
2018-12-25

1
X00016469619
INDAH GIFT & HOME DECO
60.3
2018-10-19

2
X00016469620
MR D.I.Y. (JOHOR) SDN BHD
33.9
2019-01-12

3
X00016469622
YONGFATT ENTERPRISE
80.9
2018-12-25

4
X00016469623
MR D.I.Y. (M) SDN BHD
30.9
2018-11-18

5
X00016469669
ABC HO TRADING
31
2019-01-09

6
X00016469672
SOON HUAT MACHINERY ENTERPRISE
327
2019-01-11

7
X00016469676
S.H.H. MOTOR (SUNGAI RENGIT SN. BHD. (801580-T)
20
2019-01-23

8
X51005200938
TH MNAN
0
2023-10-11

9
X51005230617
GERBANG ALAF RESTAURANTS SDN BHD
26.6
2018-01-18

Most receipts are extracted correctly, but receipt X51005200938 shows issues:

The company name is incomplete (“TH MNAN”)
Total is 0 instead of the actual amount
Date (2023-10-11) appears incorrect

Compare Extraction with Ground Truth
To verify the extraction accuracy, load the ground-truth annotations from data/SROIE2019/train/entities:
def normalize_date(value: str) -> str:
"""Normalize date strings to consistent format."""
value = (value or "").strip()
if not value:
return value
# Convert hyphens to slashes
value = value.replace("-", "/")
parts = value.split("/")
# Convert 2-digit years to 4-digit (e.g., 18 -> 2018)
if len(parts[-1]) == 2:
parts[-1] = f"20{parts[-1]}"
return "/".join(parts)

def create_ground_truth_df(
label_paths: List[str], id_column: str = "receipt_id"
) -> pd.DataFrame:
"""Create ground truth DataFrame from label JSON files."""
records = []
# Load each JSON file and extract key fields
for path in label_paths:
payload = pd.read_json(Path(path), typ="series").to_dict()
records.append(
{
id_column: Path(path).stem,
"company": payload.get("company"),
"total": payload.get("total"),
"purchase_date": normalize_date(payload.get("date")),
}
)

df = pd.DataFrame(records)
# Apply same transformations as extracted data
return transform_receipt_columns(df)

# Load ground truth annotations
label_dir = Path("data/SROIE2019/train/entities")
label_paths = sorted(label_dir.glob("*.txt"))[:num_receipts]

ground_truth_df = create_ground_truth_df(label_paths)
ground_truth_df

receipt_id
company
total
purchase_date

0
X00016469612
BOOK TA .K (TAMAN DAYA) SDN BHD
9
2018-12-25

1
X00016469619
INDAH GIFT & HOME DECO
60.3
2018-10-19

2
X00016469620
MR D.I.Y. (JOHOR) SDN BHD
33.9
2019-01-12

3
X00016469622
YONGFATT ENTERPRISE
80.9
2018-12-25

4
X00016469623
MR D.I.Y. (M) SDN BHD
30.9
2018-11-18

5
X00016469669
ABC HO TRADING
31
2019-01-09

6
X00016469672
SOON HUAT MACHINERY ENTERPRISE
327
2019-01-11

7
X00016469676
S.H.H. MOTOR (SUNGAI RENGIT) SDN. BHD.
20
2019-01-23

8
X51005200938
PERNIAGAAN ZHENG HUI
112.45
2018-02-12

9
X51005230617
GERBANG ALAF RESTAURANTS SDN BHD
26.6
2018-01-18

Let’s validate extraction accuracy by comparing results against ground truth.
Company names often have minor variations (spacing, punctuation, extra characters), so we’ll use fuzzy matching to tolerate these formatting differences.
from rapidfuzz import fuzz

def fuzzy_match_score(text1: str, text2: str) -> int:
"""Calculate fuzzy match score between two strings."""
return fuzz.token_set_ratio(str(text1), str(text2))

Test the fuzzy matching with sample company names:
# Nearly identical strings score high
print(f"Score: {fuzzy_match_score('BOOK TA K SDN BHD', 'BOOK TA .K SDN BHD'):.2f}")

# Different punctuation still matches well
print(f"Score: {fuzzy_match_score('MR D.I.Y. JOHOR', 'MR DIY JOHOR'):.2f}")

# Completely different strings score low
print(f"Score: {fuzzy_match_score('ABC TRADING', 'XYZ COMPANY'):.2f}")

Output:
Score: 97.14
Score: 55.17
Score: 27.27

Now build a comparison function that merges extracted and ground truth data, then applies fuzzy matching for company names and exact matching for numeric fields:
def compare_receipts(
extracted_df: pd.DataFrame,
ground_truth_df: pd.DataFrame,
id_column: str,
fuzzy_match_cols: List[str],
exact_match_cols: List[str],
fuzzy_threshold: int = 80,
) -> pd.DataFrame:
"""Compare extracted and ground truth data with explicit column specifications."""
comparison_df = extracted_df.merge(
ground_truth_df,
on=id_column,
how="inner",
suffixes=("_extracted", "_truth"),
)

# Fuzzy matching
for col in fuzzy_match_cols:
extracted_col = f"{col}_extracted"
truth_col = f"{col}_truth"
comparison_df[f"{col}_score"] = comparison_df.apply(
lambda row: fuzzy_match_score(row[extracted_col], row[truth_col]),
axis=1,
)
comparison_df[f"{col}_match"] = comparison_df[f"{col}_score"] >= fuzzy_threshold

# Exact matching
for col in exact_match_cols:
extracted_col = f"{col}_extracted"
truth_col = f"{col}_truth"
comparison_df[f"{col}_match"] = (
comparison_df[extracted_col] == comparison_df[truth_col]
)

return comparison_df

comparison_df = compare_receipts(
extracted_df,
ground_truth_df,
id_column="receipt_id",
fuzzy_match_cols=["company"],
exact_match_cols=["total", "purchase_date"],
)

Inspect any rows where the company, total, or purchase-date checks fail:
def get_mismatch_rows(comparison_df: pd.DataFrame) -> pd.DataFrame:
"""Get mismatched rows, excluding match indicator columns."""
# Extract match columns and data columns
match_columns = [col for col in comparison_df.columns if col.endswith("_match")]
data_columns = sorted([col for col in comparison_df.columns if col.endswith("_extracted") or col.endswith("_truth")])

# Check for rows where not all matches are True
has_mismatch = comparison_df[match_columns].all(axis=1).eq(False)

return comparison_df[has_mismatch][data_columns]

mismatch_df = get_mismatch_rows(comparison_df)

mismatch_df

company_extracted
company_truth
purchase_date_extracted
purchase_date_truth
total_extracted
total_truth

8
TH MNAN
PERNIAGAAN ZHENG HUI
2023-10-11
2018-02-12
0
112.45

This confirms what we saw earlier. All receipts match the ground truth annotations except for receipt ID X51005200938 for the following fields:

Company name
Total
Purchase date

Let’s take a closer look at this receipt to see if we can identify the issue.
import IPython.display as display

file_to_inspect = receipt_dir / "X51005200938.jpg"

display.Image(filename=file_to_inspect)

This receipt appears smaller than the others in the dataset, which may affect OCR readability. In the next section, we will scale up the receipt to improve the extraction.
Process the Images for Better Extraction
Create a function to scale up the receipt:
from PIL import Image

def scale_image(image_path: Path, output_dir: Path, scale_factor: int = 3) -> Path:
"""Scale up an image using high-quality resampling.

Args:
image_path: Path to the original image
output_dir: Directory to save the scaled image
scale_factor: Factor to scale up the image (default: 3x)

Returns:
Path to the scaled image
"""
# Load the image
img = Image.open(image_path)

# Scale up the image using high-quality resampling
new_size = (img.width * scale_factor, img.height * scale_factor)
img_resized = img.resize(new_size, Image.Resampling.LANCZOS)

# Save to output directory with same filename
output_dir.mkdir(parents=True, exist_ok=True)
output_path = output_dir / image_path.name
img_resized.save(output_path, quality=95)

return output_path

Apply the function to the problematic receipt:
problematic_receipt_path = receipt_dir / "X51005200938.jpg"
adjusted_receipt_dir = Path("data/SROIE2019/train/img_adjusted")

scaled_image_path = scale_image(problematic_receipt_path, adjusted_receipt_dir, scale_factor=3)

Let’s extract the structured data from the scaled image:
problematic_structured_receipts = extract_documents([scaled_image_path], prompt)
problematic_extracted_df = create_extracted_df(problematic_structured_receipts)

problematic_extracted_df

receipt_id
company
total
purchase_date

0
X51005200938
PERNIAGAAN ZHENG HUI
112.46
2018-02-12

Nice! Scaling fixes the extraction. Company name and purchase date are now accurate. The total is 112.46 vs 112.45, acceptable since 112.45 actually looks like 112.46 when printed on the receipt.
Export Clean Data to CSV or Excel
Apply the scaling fix to all receipts. Copy the remaining images to the processed directory, excluding the already-scaled receipt:
import shutil

clean_receipt_paths = [scaled_image_path]
# Copy all receipts except the already processed one
for receipt_path in receipt_paths:
if receipt_path != problematic_receipt_path: # Skip the already scaled image
output_path = adjusted_receipt_dir / receipt_path.name
shutil.copy2(receipt_path, output_path)
clean_receipt_paths.append(output_path)
print(f"Copied {receipt_path.name}")

Let’s run the pipeline again with the processed images:
clean_structured_receipts = extract_documents(clean_receipt_paths, prompt)
clean_extracted_df = create_extracted_df(clean_structured_receipts)
clean_extracted_df

receipt_id
company
total
purchase_date

0
X51005200938
PERNIAGAAN ZHENG HUI
112.46
2018-02-12

1
X00016469612
TAN WOON YANN
9
2018-12-25

2
X00016469619
INDAH GIFT & HOME DECO
60.3
2018-10-19

3
X00016469620
MR D.I.Y. (JOHOR) SDN BHD
33.9
2019-01-12

4
X00016469622
YONGFATT ENTERPRISE
80.9
2018-12-25

5
X00016469623
MR D.I.Y. (M) SDN BHD
30.9
2018-11-18

6
X00016469669
ABC HO TRADING
31
2019-01-09

7
X00016469672
SOON HUAT MACHINERY ENTERPRISE
327
2019-01-11

8
X00016469676
S.H.H. MOTOR (SUNGAI RENGIT SN. BHD. (801580-T)
20
2019-01-23

9
X51005230617
GERBANG ALAF RESTAURANTS SDN BHD
26.6
2018-01-18

Awesome! All receipts now match the ground truth annotations.
Now we can export the dataset to a spreadsheet with just a few lines of code:
import pandas as pd

# Export to CSV
output_path = Path("reports/receipts.csv")
output_path.parent.mkdir(parents=True, exist_ok=True)
clean_extracted_df.to_csv(output_path, index=False)
print(f"Exported {len(clean_extracted_df)} receipts to {output_path}")

Output:
Exported 10 receipts to reports/receipts.csv

The exported data can now be imported into spreadsheet applications, analytics tools, or business intelligence platforms.
Speed Up Processing with Async Parallel Execution
LlamaIndex supports asynchronous processing to handle multiple receipts concurrently. By using async/await with the aget_nodes_from_documents() method, you can process receipts in parallel instead of sequentially, significantly reducing total processing time.
Here’s how to modify the extraction function to use async processing. Setting num_workers=10 means the parser will process up to 10 receipts concurrently:
import asyncio

async def extract_documents_async(
paths: List[str], prompt: str, id_column: str = "receipt_id"
) -> List[dict]:
"""Extract structured data from documents using async LlamaParse."""
results: List[dict] = []

parser = LlamaParse(
api_key=os.environ["LLAMA_CLOUD_API_KEY"],
result_type="markdown",
num_workers=10, # Process 10 receipts concurrently
language="en",
skip_diagonal_text=True,
)

# Use async method for parallel processing
documents = await parser.aload_data(paths)

program = OpenAIPydanticProgram.from_defaults(
output_cls=Receipt,
llm=Settings.llm,
prompt_template_str=prompt,
)

for path, doc in zip(paths, documents):
document_id = Path(path).stem
parsed_document = program(context_str=doc.text)
results.append({id_column: document_id, "data": parsed_document})

return results

# Run with asyncio
structured_receipts = await extract_documents_async(receipt_paths, prompt)

See the LlamaIndex async documentation for more details.
Try It Yourself
The concepts from this tutorial are available as a reusable pipeline in this GitHub repository. The code includes both synchronous and asynchronous versions:
Synchronous pipelines (simple, sequential processing):

Generic pipeline (document_extraction_pipeline.py): Reusable extraction function that works with any Pydantic schema
Receipt pipeline (extract_receipts_pipeline.py): Complete example with Receipt schema, image scaling, and data transformations

Asynchronous pipelines (parallel processing with 3-10x speedup):

Async generic pipeline (async_document_extraction_pipeline.py): Concurrent document processing
Async receipt pipeline (async_extract_receipts_pipeline.py): Batch receipt processing with progress tracking

Run the receipt extraction example:
# Synchronous version (simple, sequential)
uv run extract_receipts_pipeline.py

# Asynchronous version (parallel processing, 3-10x faster)
uv run async_extract_receipts_pipeline.py

Or create your own extractor by importing extract_structured_data() and providing your custom Pydantic schema, extraction prompt, and optional preprocessing functions.

Learn production-ready practices for data science and AI projects in Production-Ready Data Science.

Conclusion and Next Steps
This tutorial demonstrated how LlamaIndex automates receipt data extraction with minimal code. You converted scanned images to structured data, validated results against ground truth, and exported a clean CSV ready for analysis.
Here are some ideas to enhance this receipt extraction pipeline:

Richer schemas: Add nested Pydantic models for vendor details, payment methods, and itemized line items
Validation rules: Flag outliers like totals over $500 or future dates for manual review
Multi-stage workflows: Create custom workflows that combine image preprocessing, extraction, validation, and export steps with error handling

📚 Want to go deeper? Learning new techniques is the easy part. Knowing how to structure, test, and deploy them is what separates side projects from real work. My book shows you how to build data science projects that actually make it to production. Get the book →

Stay Current with CodeCut
Actionable Python tips, curated for busy data pros. Skim in under 2 minutes, three times a week.

.codecut-subscribe-form .codecut-input {
background: #2F2D2E !important;
border: 1px solid #72BEFA !important;
color: #FFFFFF !important;
}
.codecut-subscribe-form .codecut-input::placeholder {
color: #999999 !important;
}
.codecut-subscribe-form .codecut-subscribe-btn {
background: #72BEFA !important;
color: #2F2D2E !important;
}
.codecut-subscribe-form .codecut-subscribe-btn:hover {
background: #5aa8e8 !important;
}

.codecut-subscribe-form {
max-width: 650px;
display: flex;
flex-direction: column;
gap: 8px;
}

.codecut-input {
-webkit-appearance: none;
-moz-appearance: none;
appearance: none;
background: #FFFFFF;
border-radius: 8px !important;
padding: 8px 12px;
font-family: ‘Comfortaa’, sans-serif !important;
font-size: 14px !important;
color: #333333;
border: none !important;
outline: none;
width: 100%;
box-sizing: border-box;
}

input[type=”email”].codecut-input {
border-radius: 8px !important;
}

.codecut-input::placeholder {
color: #666666;
}

.codecut-email-row {
display: flex;
align-items: stretch;
height: 36px;
gap: 8px;
}

.codecut-email-row .codecut-input {
flex: 1;
}

.codecut-subscribe-btn {
background: #72BEFA;
color: #2F2D2E;
border: none;
border-radius: 8px;
padding: 8px 14px;
font-family: ‘Comfortaa’, sans-serif;
font-size: 14px;
font-weight: 500;
cursor: pointer;
text-decoration: none;
display: flex;
align-items: center;
justify-content: center;
transition: background 0.3s ease;
}

.codecut-subscribe-btn:hover {
background: #5aa8e8;
}

.codecut-subscribe-btn:disabled {
background: #999;
cursor: not-allowed;
}

.codecut-message {
font-family: ‘Comfortaa’, sans-serif;
font-size: 12px;
padding: 8px;
border-radius: 6px;
display: none;
}

.codecut-message.success {
background: #d4edda;
color: #155724;
display: block;
}

/* Mobile responsive */
@media (max-width: 480px) {
.codecut-email-row {
flex-direction: column;
height: auto;
gap: 8px;
}

.codecut-input {
border-radius: 8px;
height: 36px;
}

.codecut-subscribe-btn {
width: 100%;
text-align: center;
border-radius: 8px;
height: 36px;
}
}

Subscribe

Turn Receipt Images into Spreadsheets with LlamaIndex Read More »

Scroll to Top

Work with Khuyen Tran

Work with Khuyen Tran