Generic selectors
Exact matches only
Search in title
Search in content
Post Type Selectors
Filter by Categories
About Article
Analyze Data
Archive
Best Practices
Better Outputs
Blog
Code Optimization
Code Quality
Command Line
Daily tips
Dashboard
Data Analysis & Manipulation
Data Engineer
Data Visualization
DataFrame
Delta Lake
DevOps
DuckDB
Environment Management
Feature Engineer
Git
Jupyter Notebook
LLM
LLM Tools
Machine Learning
Machine Learning & AI
Machine Learning Tools
Manage Data
MLOps
Natural Language Processing
Newsletter Archive
NumPy
Pandas
Polars
PySpark
Python Helpers
Python Tips
Python Utilities
Scrape Data
SQL
Testing
Time Series
Tools
Visualization
Visualization & Reporting
Workflow & Automation
Workflow Automation

pydantic

Auto-created tag for pydantic

The Hidden Cost of Python Dictionaries (And 3 Safer Alternatives)

Table of Contents

Introduction
What Are Typed Data Containers?
Using Dictionaries
Using NamedTuple
Using dataclass
Using Pydantic
Final Thoughts
Related Tutorials

Introduction
Imagine you’re processing customer records. The pipeline runs without errors, but customers never receive their welcome emails. After digging through the code, you discover the issue is a simple typo in a dictionary key.
def load_customer(row):
return {"customer_id": row[0], "name": row[1], "emial": row[2]} # Typo

def send_welcome_email(customer):
email = customer.get("email") # Returns None silently
if email:
print(f"Sending email to {email}")
# No email sent, no error raised

customer = load_customer(["C001", "Alice", "alice@example.com"])
send_welcome_email(customer) # Nothing happens

Since .get() returns None for a missing key, the bug stays hidden.
This is exactly the type of issue we want to catch earlier. In this article, we’ll look at how typed data containers like NamedTuple, dataclass, and Pydantic help surface these bugs at runtime.
💻 Get the Code: The complete source code and Jupyter notebook for this tutorial are available on GitHub. Clone it to follow along!
What Are Typed Data Containers?
Python offers several ways to structure data, each adding more safety than the last:

dict: No protection. Bugs surface only when you access a missing key.
NamedTuple: Basic safety. Catches typos at write time in your IDE and at runtime.
dataclass: Static analysis support. Tools like mypy catch errors before your code runs.
Pydantic: Full protection. Validates data the moment you create an instance.

Let’s see how each tool handles the same customer data:
Using Dictionaries
Dictionaries are quick to create but provide no safety:
customer = {
"customer_id": "C001",
"name": "Alice Smith",
"email": "alice@example.com",
"age": 28,
"is_premium": True,
}

print(customer["name"])

Alice Smith

Typo Bugs
A typo in the key name causes a KeyError at runtime:
customer["emial"] # Typo: should be "email"

KeyError: 'emial'

The error tells you what went wrong but not where. When dictionaries pass through multiple functions, finding the source of a typo can take significant debugging time:
def load_customer(row):
return {"customer_id": row[0], "name": row[1], "emial": row[2]} # Typo here

def validate_customer(customer):
return customer # Passes through unchanged

def send_email(customer):
return customer["email"] # KeyError raised here

customer = load_customer(["C001", "Alice", "alice@example.com"])
validated = validate_customer(customer)
send_email(validated) # Error points here, but bug is in load_customer

KeyError Traceback (most recent call last)
13 customer = load_customer(["C001", "Alice", "alice@example.com"])
14 validated = validate_customer(customer)
—> 15 send_email(validated) # Error points here, but bug is in load_customer

Cell In[6], line 10, in send_email(customer)
9 def send_email(customer):
—> 10 return customer["email"]

KeyError: 'email'

The stack trace shows where the KeyError was raised, not where "emial" was written. The bug and its symptom are 13 lines apart here, but in production code, they could be in different files entirely.
Using .get() makes it worse by returning None silently:
email = customer.get("email") # Returns None – key is "emial" not "email"
print(f"Sending email to: {email}")

Sending email to: None

This silent failure is dangerous: your notification system might skip thousands of customers, or worse, your code could write None to a database column, corrupting your data pipeline.
Type Confusion
Typos cause crashes, but wrong types can corrupt your data silently. Since dictionaries have no schema, nothing stops you from assigning the wrong type to a field:
customer = {
"customer_id": "C001",
"name": 123, # Should be a string
"age": "twenty-eight", # Should be an integer
}

total_age = customer["age"] + 5

TypeError: can only concatenate str (not "int") to str

The error message is misleading: it says “concatenate str” but the real problem is that age should never have been a string in the first place.
Using NamedTuple
NamedTuple is a lightweight way to define a fixed structure with named fields and type hints, like a dictionary with a schema:
from typing import NamedTuple

class Customer(NamedTuple):
customer_id: str
name: str
email: str
age: int
is_premium: bool

customer = Customer(
customer_id="C001",
name="Alice Smith",
email="alice@example.com",
age=28,
is_premium=True,
)

print(customer.name)

Alice Smith

IDE Autocomplete Catches Typos
Your IDE can’t autocomplete dictionary keys, so typing customer[" shows no suggestions. With NamedTuple, typing customer. displays all available fields: customer_id, name, email, age, is_premium.
Even if you skip autocomplete and type manually, typos are flagged instantly with squiggly lines:
customer.emial
~~~~~

Running the code will raise an error:
customer.emial

AttributeError: 'Customer' object has no attribute 'emial'

The error names the exact object and missing attribute, so you know immediately what to fix.
Immutability Prevents Accidental Changes
NamedTuples are immutable, meaning once created, their values cannot be changed:
customer.name = "Bob" # Raises an error

AttributeError: can't set attribute

This prevents bugs where data is accidentally modified during processing.
Limitations: No Runtime Type Validation
Type hints in NamedTuple are not enforced at runtime, so you can still pass in wrong types:
# Wrong types are accepted without error
customer = Customer(
customer_id="C001",
name=123, # Should be str, but int is accepted
email="alice@example.com",
age="twenty-eight", # Should be int, but str is accepted
is_premium=True,
)

print(f"Name: {customer.name}, Age: {customer.age}")

Name: 123, Age: twenty-eight

The code runs, but with incorrect data types. The bug surfaces later when you try to use the data.
Using dataclass
dataclass reduces the boilerplate of writing classes that mainly hold data. Instead of manually writing __init__ and other methods, you just declare your fields.
It provides the same IDE support as NamedTuple, plus three additional features:

Mutable objects: You can change field values after creation
Mutable defaults: Safe defaults for lists and dicts with field(default_factory=list)
Post-init logic: Run custom validation or compute derived fields with __post_init__

from dataclasses import dataclass

@dataclass
class Customer:
customer_id: str
name: str
email: str
age: int
is_premium: bool = False # Default value

customer = Customer(
customer_id="C001",
name="Alice Smith",
email="alice@example.com",
age=28,
)

print(f"{customer.name}, Premium: {customer.is_premium}")

Alice Smith, Premium: False

Mutability Allows Updates
Dataclass trades NamedTuple’s immutability protection for flexibility. You can modify fields after creation:
customer.name = "Alice Johnson" # Changed after marriage
customer.is_premium = True # Upgraded their account

print(f"{customer.name}, Premium: {customer.is_premium}")

Alice Johnson, Premium: True

For extra safety, use @dataclass(slots=True) to prevent accidentally adding new attributes:
@dataclass(slots=True)
class Customer:
customer_id: str
name: str
email: str
age: int
is_premium: bool = False

customer = Customer(
customer_id="C001",
name="Alice",
email="alice@example.com",
age=28,
)

customer.nmae = "Bob" # Typo

AttributeError: 'Customer' object has no attribute 'nmae'

Mutable Defaults with default_factory
Mutable defaults like lists don’t work as expected. You might think each instance gets its own empty list, but Python creates the default [] once and all instances share it:
from typing import NamedTuple

class Order(NamedTuple):
order_id: str
items: list = []

order1 = Order("001")
order2 = Order("002")

order1.items.append("apple")
print(f"Order 1: {order1.items}")
print(f"Order 2: {order2.items}") # Also has "apple"!

Order 1: ['apple']
Order 2: ['apple']

Order 2 has “apple” even though we only added it to Order 1. Modifying one order’s items affects every order.
Dataclass prevents this mistake by rejecting mutable defaults:
@dataclass
class Order:
items: list = []

ValueError: mutable default <class 'list'> for field items is not allowed: use default_factory

Dataclass offers field(default_factory=…) as the solution. The factory function runs at instance creation, not class definition, so each object gets its own list:
from dataclasses import dataclass, field

@dataclass
class Order:
order_id: str
items: list = field(default_factory=list) # Each instance gets its own list

order1 = Order("001")
order2 = Order("002")

order1.items.append("apple")
print(f"Order 1: {order1.items}")
print(f"Order 2: {order2.items}") # Not affected by order1

Order 1: ['apple']
Order 2: []

Unlike the NamedTuple example, Order 2 stays empty because it has its own list.
Post-Init Validation with __post_init__
Without validation, invalid data passes through silently:
@dataclass
class Customer:
customer_id: str
name: str
email: str
age: int
is_premium: bool = False

customer = Customer(
customer_id="C001",
name="", # Empty name
email="invalid",
age=-100,
)
print(f"Created: {customer}") # No error – bad data is in your system

Created: Customer(customer_id='C001', name='', email='invalid', age=-100, is_premium=False)

Dataclass provides __post_init__ to catch these issues at creation time so you can validate fields before the object is used:
@dataclass
class Customer:
customer_id: str
name: str
email: str
age: int
is_premium: bool = False

def __post_init__(self):
if self.age < 0:
raise ValueError(f"Age cannot be negative: {self.age}")
if "@" not in self.email:
raise ValueError(f"Invalid email: {self.email}")

customer = Customer(
customer_id="C001",
name="Alice",
email="invalid-email",
age=28,
)

ValueError: Invalid email: invalid-email

The error message tells you exactly what’s wrong, making the bug easy to fix.
Limitations: Manual Validation Only
__post_init__ requires you to write every validation rule yourself. If you forget to check a field, bad data can still slip through.
In this example, __post_init__ only validates email format, so wrong types for name and age pass undetected:
@dataclass
class Customer:
customer_id: str
name: str
email: str
age: int
is_premium: bool = False

def __post_init__(self):
if "@" not in self.email:
raise ValueError(f"Invalid email: {self.email}")

customer = Customer(
customer_id="C001",
name=123, # No validation for name type
email="alice@example.com",
age="twenty-eight", # No validation for age type
)

print(f"Name: {customer.name}, Age: {customer.age}")

Name: 123, Age: twenty-eight

Type hints alone don’t enforce types at runtime. For automatic validation, you need a library that actually checks types when objects are created.

📚 For comprehensive coverage of dataclasses and Pydantic in production workflows, check out Production-Ready Data Science.

Using Pydantic
Pydantic is a data validation library that enforces type hints at runtime. Unlike NamedTuple and dataclass, it actually checks that values match their declared types when objects are created. Install it with:
pip install pydantic

To create a Pydantic model, inherit from BaseModel and declare your fields with type hints:
from pydantic import BaseModel

class Customer(BaseModel):
customer_id: str
name: str
email: str
age: int
is_premium: bool = False

customer = Customer(
customer_id="C001",
name="Alice Smith",
email="alice@example.com",
age=28,
)

print(f"{customer.name}, Age: {customer.age}")

Alice Smith, Age: 28

For using Pydantic to enforce structured outputs from AI models, see our PydanticAI tutorial.
Runtime Validation
Remember how dataclass accepted name=123 without complaint? Pydantic catches this automatically with a ValidationError:
from pydantic import BaseModel, ValidationError

class Customer(BaseModel):
customer_id: str
name: str
email: str
age: int
is_premium: bool = False

try:
customer = Customer(
customer_id="C001",
name=123,
email="alice@example.com",
age="thirty",
)
except ValidationError as e:
print(e)

2 validation errors for Customer
name
Input should be a valid string [type=string_type, input_value=123, input_type=int]
age
Input should be a valid integer, unable to parse string as an integer [type=int_parsing, input_value='thirty', input_type=str]

The error message shows:

Which fields failed validation (name, age)
What was expected (valid string, valid integer)
What was received (123 as int, 'thirty' as str)

This tells you everything you need to fix the bug in one place, instead of digging through stack traces.
Type Coercion
Unlike dataclass which stores whatever you pass, Pydantic automatically converts compatible types to match your type hints:
customer = Customer(
customer_id="C001",
name="Alice Smith",
email="alice@example.com",
age="28", # String "28" is converted to int 28
is_premium="true", # String "true" is converted to bool True
)

print(f"Age: {customer.age} (type: {type(customer.age).__name__})")
print(f"Premium: {customer.is_premium} (type: {type(customer.is_premium).__name__})")

Age: 28 (type: int)
Premium: True (type: bool)

This is useful when reading data from CSV files or APIs where everything comes as strings.
Constraint Validation
Beyond types, you often need business rules: age must be positive, names can’t be empty, customer IDs must follow a pattern.
In dataclass, you define fields in one place and validate them in __post_init__. The validation logic grows with each constraint:
@dataclass
class Customer:
customer_id: str
name: str
email: str
age: int
is_premium: bool = False

def __post_init__(self):
if not self.customer_id:
raise ValueError("Customer ID cannot be empty")
if not self.name or len(self.name) < 1:
raise ValueError("Name cannot be empty")
if "@" not in self.email:
raise ValueError(f"Invalid email: {self.email}")
if self.age < 0 or self.age > 150:
raise ValueError(f"Age must be between 0 and 150: {self.age}")

Pydantic puts constraints directly in Field(), keeping rules next to the data they validate:
from pydantic import BaseModel, Field, ValidationError

class Customer(BaseModel):
customer_id: str
name: str = Field(min_length=1)
email: str
age: int = Field(ge=0, le=150) # Age must be between 0 and 150
is_premium: bool = False

try:
customer = Customer(
customer_id="C001",
name="", # Empty name
email="alice@example.com",
age=-5, # Negative age
)
except ValidationError as e:
print(e)

2 validation errors for Customer
name
String should have at least 1 character [type=string_too_short, input_value='', input_type=str]
age
Input should be greater than or equal to 0 [type=greater_than_equal, input_value=-5, input_type=int]

Nested Validation
Data structures are rarely flat. A customer has an address, an order contains items. When something is wrong inside a nested object, you need to know exactly where.
Pydantic validates each level and reports the full path to any error:
from pydantic import BaseModel, Field, ValidationError

class Address(BaseModel):
street: str
city: str
zip_code: str = Field(pattern=r"^\d{5}$") # Must be 5 digits

class Customer(BaseModel):
customer_id: str
name: str
address: Address

try:
customer = Customer(
customer_id="C001",
name="Alice Smith",
address={
"street": "123 Main St",
"city": "New York",
"zip_code": "invalid", # Invalid zip code
},
)
except ValidationError as e:
print(e)

1 validation error for Customer
address.zip_code
String should match pattern '^\d{5}$' [type=string_pattern_mismatch, input_value='invalid', input_type=str]

The error message shows address.zip_code, pinpointing the exact location in the nested structure.
For extracting structured data from documents using Pydantic, see our LlamaIndex data extraction guide.
Final Thoughts
To summarize what each tool provides:

dict: Quick to create. No structure or validation.
NamedTuple: Fixed structure with IDE autocomplete. Immutable.
dataclass: Mutable fields, safe defaults, custom logic via __post_init__.
Pydantic: Runtime type enforcement, automatic type coercion, built-in constraints.

Personally, I use dict for quick prototyping:
stats = {"rmse": 0.234, "mae": 0.189, "r2": 0.91}

Then Pydantic when the code moves to production. For example, a training config should reject invalid values like negative learning rates:
from pydantic import BaseModel, Field

class TrainingConfig(BaseModel):
epochs: int = Field(ge=1)
batch_size: int = Field(ge=1)
learning_rate: float = Field(gt=0)

config = TrainingConfig(epochs=10, batch_size=32, learning_rate=0.001)

Pick the level of protection that matches your needs. A notebook experiment doesn’t need Pydantic, but a production API does.
Related Tutorials

Database Integration: SQLModel vs psycopg2 for combining Pydantic-style validation with databases
Testing: Pytest for Data Scientists to test your data containers
Configuration: Hydra for Python Configuration for validated configuration management

📚 Want to go deeper? Learning new techniques is the easy part. Knowing how to structure, test, and deploy them is what separates side projects from real work. My book shows you how to build data science projects that actually make it to production. Get the book →

Favorite

The Hidden Cost of Python Dictionaries (And 3 Safer Alternatives) Read More »

Turn Receipt Images into Spreadsheets with LlamaIndex

Table of Contents

Introduction
What You Will Learn
Introduction to LlamaIndex
Basic Image Processing with LlamaParse
Structured Data Extraction with Pydantic
Compare Extraction with Ground Truth
Process the Images for Better Extraction
Export Clean Data to CSV or Excel
Speed Up Processing with Async Parallel Execution
Try It Yourself
Conclusion and Next Steps

Introduction
Manual data entry from receipts, invoices, and contracts wastes hours and introduces errors. What if you could automatically extract structured data from these documents in minutes?
In this article, you’ll learn how to transform receipt images into structured data using LlamaIndex, then export the results to a spreadsheet for analysis.

💻 Get the Code: The complete source code and Jupyter notebook for this tutorial are available on GitHub. Clone it to follow along!

What You Will Learn

Convert scanned receipts to structured data with LlamaParse and Pydantic models
Validate extraction accuracy by comparing results against ground truth annotations
Fix parsing errors by preprocessing low-quality images
Export clean receipt data to spreadsheet format

Introduction to LlamaIndex
LlamaIndex is a framework that connects LLMs with your data through three core capabilities:

Data ingestion: Built-in readers for PDFs, images, web pages, and databases that automatically parse content into processable nodes.
Structured extraction: LLM-powered conversion of unstructured text into Pydantic models with automatic validation.
Retrieval and indexing: Vector stores and semantic search that enable context-augmented queries over your documents.

It eliminates boilerplate code for loading, parsing, and querying data, letting you focus on building LLM applications.
The table below compares LlamaIndex with two other popular frameworks for LLM applications:

Framework
Purpose
Best For

LlamaIndex
Document ingestion and structured extraction
Converting unstructured documents into query-ready data

LangChain
LLM orchestration and tool integration
Building conversational agents with multiple LLM calls

LangGraph
Stateful workflow management
Coordinating long-running, multi-agent processes

Installation
Start with installing the required packages for this tutorial, including:

llama-index: Core LlamaIndex framework with base indexing and retrieval functionality
llama-parse: Document parsing service for PDFs, images, and complex layouts
llama-index-program-openai: OpenAI integration for structured data extraction with Pydantic
python-dotenv: Load environment variables from .env files
rapidfuzz: Fuzzy string matching library for comparing company names with minor variations

pip install llama-index llama-parse llama-index-program-openai python-dotenv rapidfuzz

Environment Setup
Create a .env file to store your API keys:
# .env
LLAMA_CLOUD_API_KEY="your-llama-parse-key"
OPENAI_API_KEY="your-openai-key"

Get your API keys from:

LlamaParse API: cloud.llamaindex.ai
OpenAI API: platform.openai.com/api-keys

Load the environment variables from the .env file with load_dotenv:
from dotenv import load_dotenv
import os

load_dotenv()

Configure the default LLM with Settings:
from llama_index.core import Settings
from llama_index.llms.openai import OpenAI

Settings.llm = OpenAI(model="gpt-4o-mini", temperature=0)
Settings.context_window = 8000

Settings stores global defaults so every query engine and program reuses the same LLM configuration. Keeping temperature at 0 nudges the model to return deterministic, structured outputs.
Basic Image Processing with LlamaParse
In this tutorial, we will use the SROIE Dataset v2 from Kaggle. This dataset contains real-world receipt scans from the ICDAR 2019 competition.
You can download the dataset directly from Kaggle’s website or use the Kaggle CLI:
# Install the Kaggle CLI once
uv pip install kaggle

# Configure Kaggle credentials (run once per environment)
export KAGGLE_USERNAME=your_username
export KAGGLE_KEY=your_api_key

# Create a workspace folder and download the full archive (~1 GB)
mkdir -p data
kaggle datasets download urbikn/sroie-datasetv2 -p data

# Extract everything and inspect a few image files
unzip -q -o data/sroie-datasetv2.zip -d data

This tutorial uses data from the data/SROIE2019/train/ directory, which contains:

img: Original receipt images
entities: Ground truth annotations for validation

Load the first 10 receipts into a list of paths:
from pathlib import Path

receipt_dir = Path("data/SROIE2019/train/img")
num_receipts = 10
receipt_paths = sorted(receipt_dir.glob("*.jpg"))[:num_receipts]

Take a look at the first receipt:
from IPython.display import Image

first_receipt_path = receipt_paths[0]
Image(filename=first_receipt_path)

Next, use LlamaParse to convert the first receipt into markdown.
from llama_parse import LlamaParse

# Parse receipts with LlamaParse
parser = LlamaParse(
api_key=os.environ["LLAMA_CLOUD_API_KEY"],
result_type="markdown", # Output format
num_workers=4, # Number of parallel workers for faster processing
language="en", # Language hint for OCR accuracy
skip_diagonal_text=True, # Ignore rotated or diagonal text
)
first_receipt = parser.load_data(first_receipt_path)[0]

Preview the markdown for the first receipt:
# Preview the first receipt
preview = "\n".join(first_receipt.text.splitlines()[:10])
print(preview)

Output:
tan woon yann
BOOK TA K (TAMAN DAYA) SDN BHD
789417-W
NO.5: 55,57 & 59, JALAN SAGU 18,
TAMAN DaYA,
81100 JOHOR BAHRU,
JOHOR.

LlamaParse successfully converts receipt images to text, but there is no structure: vendor names, dates, and totals are all mixed together in plain text. This format is not ideal for exporting to spreadsheets or analytics tools for further analysis.
The next section uses Pydantic models to extract structured fields like company, total, and purchase_date automatically.
Structured Data Extraction with Pydantic
Pydantic is a Python library that uses type hints for data validation and automatic type conversion. By defining a receipt schema once, you can extract consistent structured data from receipts regardless of their format or layout.
Start by defining two Pydantic models that represent receipt structure:
from datetime import date
from typing import List, Optional
from pydantic import BaseModel, Field, ValidationInfo, model_validator

class ReceiptItem(BaseModel):
"""Represents a single line item extracted from a receipt."""

description: str = Field(description="Item name exactly as shown on the receipt")
quantity: int = Field(default=1, ge=1, description="Integer quantity of the item")
unit_price: Optional[float] = Field(
default=None, ge=0, description="Price per unit in the receipt currency"
)
discount_amount: float = Field(
default=0.0, ge=0, description="Discount applied to this line item"
)

class Receipt(BaseModel):
"""Structured fields extracted from a retail receipt."""

company: str = Field(description="Business or merchant name")
purchase_date: Optional[date] = Field(
default=None, description="Date in YYYY-MM-DD format"
)
address: Optional[str] = Field(default=None, description="Address of the business")
total: float = Field(description="Final charged amount")
items: List[ReceiptItem] = Field(default_factory=list)

Create an OpenAIPydanticProgram that instructs the LLM to extract data according to our Receipt model:
from llama_index.program.openai import OpenAIPydanticProgram

prompt = """
You are extracting structured data from a receipt.
Use the provided text to populate the Receipt model.
Interpret every receipt date as day-first.
If a field is missing, return null.

{context_str}
"""

receipt_program = OpenAIPydanticProgram.from_defaults(
output_cls=Receipt,
llm=Settings.llm,
prompt_template_str=prompt,
)

Process the first parsed document to make sure everything works before scaling to the full batch:
# Process the first receipt
structured_first_receipt = receipt_program(context_str=first_receipt.text)

# Print the receipt as a JSON string for better readability
print(structured_first_receipt.model_dump_json(indent=2))

Output:
{
"company": "tan woon yann BOOK TA K (TAMAN DAYA) SDN BHD",
"purchase_date": "2018-12-25",
"address": "NO.5: 55,57 & 59, JALAN SAGU 18, TAMAN DaYA, 81100 JOHOR BAHRU, JOHOR.",
"total": 9.0,
"items": [
{
"description": "KF MODELLING CLAY KIDDY FISH",
"quantity": 1,
"unit_price": 9.0,
"discount_amount": 0.0
}
]
}

LlamaIndex populates the Pydantic schema with extracted values:

company: Vendor name from the receipt header
purchase_date: Parsed date (2018-12-25)
total: Final amount (9.0)
items: Line items with description, quantity, and price

Now that the extraction works, let’s scale it to process all receipts in a batch. The function uses each receipt’s filename as a unique identifier:
def extract_documents(paths: List[str], prompt: str, id_column: str = "receipt_id") -> List[dict]:
"""Extract structured data from documents using LlamaParse and LLM."""
results: List[dict] = []

# Initialize parser with OCR settings
parser = LlamaParse(
api_key=os.environ["LLAMA_CLOUD_API_KEY"],
result_type="markdown",
num_workers=4,
language="en",
skip_diagonal_text=True,
)

# Convert images to markdown text
documents = parser.load_data(paths)

# Create structured extraction program
program = OpenAIPydanticProgram.from_defaults(
output_cls=Receipt,
llm=Settings.llm,
prompt_template_str=prompt,
)

# Extract structured data from each document
for path, doc in zip(paths, documents):
document_id = Path(path).stem
parsed_document = program(context_str=doc.text)
results.append(
{
id_column: document_id,
"data": parsed_document,
}
)
return results

# Extract structured data from all receipts
structured_receipts = extract_documents(receipt_paths, prompt)

Convert the extracted receipts into a DataFrame for easier inspection:
import pandas as pd

def transform_receipt_columns(df: pd.DataFrame) -> pd.DataFrame:
"""Apply standard transformations to receipt DataFrame columns."""
df = df.copy()
df["company"] = df["company"].str.upper()
df["total"] = pd.to_numeric(df["total"], errors="coerce")
df["purchase_date"] = pd.to_datetime(
df["purchase_date"], errors="coerce", dayfirst=True
).dt.date
return df

def create_extracted_df(records: List[dict], id_column: str = "receipt_id") -> pd.DataFrame:
df = pd.DataFrame(
[
{
id_column: record[id_column],
"company": record["data"].company,
"total": record["data"].total,
"purchase_date": record["data"].purchase_date,
}
for record in records
]
)
return transform_receipt_columns(df)

extracted_df = create_extracted_df(structured_receipts)
extracted_df

receipt_id
company
total
purchase_date

0
X00016469612
TAN WOON YANN BOOK TA K (TAMAN DAYA) SDN BHD
9
2018-12-25

1
X00016469619
INDAH GIFT & HOME DECO
60.3
2018-10-19

2
X00016469620
MR D.I.Y. (JOHOR) SDN BHD
33.9
2019-01-12

3
X00016469622
YONGFATT ENTERPRISE
80.9
2018-12-25

4
X00016469623
MR D.I.Y. (M) SDN BHD
30.9
2018-11-18

5
X00016469669
ABC HO TRADING
31
2019-01-09

6
X00016469672
SOON HUAT MACHINERY ENTERPRISE
327
2019-01-11

7
X00016469676
S.H.H. MOTOR (SUNGAI RENGIT SN. BHD. (801580-T)
20
2019-01-23

8
X51005200938
TH MNAN
0
2023-10-11

9
X51005230617
GERBANG ALAF RESTAURANTS SDN BHD
26.6
2018-01-18

Most receipts are extracted correctly, but receipt X51005200938 shows issues:

The company name is incomplete (“TH MNAN”)
Total is 0 instead of the actual amount
Date (2023-10-11) appears incorrect

Compare Extraction with Ground Truth
To verify the extraction accuracy, load the ground-truth annotations from data/SROIE2019/train/entities:
def normalize_date(value: str) -> str:
"""Normalize date strings to consistent format."""
value = (value or "").strip()
if not value:
return value
# Convert hyphens to slashes
value = value.replace("-", "/")
parts = value.split("/")
# Convert 2-digit years to 4-digit (e.g., 18 -> 2018)
if len(parts[-1]) == 2:
parts[-1] = f"20{parts[-1]}"
return "/".join(parts)

def create_ground_truth_df(
label_paths: List[str], id_column: str = "receipt_id"
) -> pd.DataFrame:
"""Create ground truth DataFrame from label JSON files."""
records = []
# Load each JSON file and extract key fields
for path in label_paths:
payload = pd.read_json(Path(path), typ="series").to_dict()
records.append(
{
id_column: Path(path).stem,
"company": payload.get("company"),
"total": payload.get("total"),
"purchase_date": normalize_date(payload.get("date")),
}
)

df = pd.DataFrame(records)
# Apply same transformations as extracted data
return transform_receipt_columns(df)

# Load ground truth annotations
label_dir = Path("data/SROIE2019/train/entities")
label_paths = sorted(label_dir.glob("*.txt"))[:num_receipts]

ground_truth_df = create_ground_truth_df(label_paths)
ground_truth_df

receipt_id
company
total
purchase_date

0
X00016469612
BOOK TA .K (TAMAN DAYA) SDN BHD
9
2018-12-25

1
X00016469619
INDAH GIFT & HOME DECO
60.3
2018-10-19

2
X00016469620
MR D.I.Y. (JOHOR) SDN BHD
33.9
2019-01-12

3
X00016469622
YONGFATT ENTERPRISE
80.9
2018-12-25

4
X00016469623
MR D.I.Y. (M) SDN BHD
30.9
2018-11-18

5
X00016469669
ABC HO TRADING
31
2019-01-09

6
X00016469672
SOON HUAT MACHINERY ENTERPRISE
327
2019-01-11

7
X00016469676
S.H.H. MOTOR (SUNGAI RENGIT) SDN. BHD.
20
2019-01-23

8
X51005200938
PERNIAGAAN ZHENG HUI
112.45
2018-02-12

9
X51005230617
GERBANG ALAF RESTAURANTS SDN BHD
26.6
2018-01-18

Let’s validate extraction accuracy by comparing results against ground truth.
Company names often have minor variations (spacing, punctuation, extra characters), so we’ll use fuzzy matching to tolerate these formatting differences.
from rapidfuzz import fuzz

def fuzzy_match_score(text1: str, text2: str) -> int:
"""Calculate fuzzy match score between two strings."""
return fuzz.token_set_ratio(str(text1), str(text2))

Test the fuzzy matching with sample company names:
# Nearly identical strings score high
print(f"Score: {fuzzy_match_score('BOOK TA K SDN BHD', 'BOOK TA .K SDN BHD'):.2f}")

# Different punctuation still matches well
print(f"Score: {fuzzy_match_score('MR D.I.Y. JOHOR', 'MR DIY JOHOR'):.2f}")

# Completely different strings score low
print(f"Score: {fuzzy_match_score('ABC TRADING', 'XYZ COMPANY'):.2f}")

Output:
Score: 97.14
Score: 55.17
Score: 27.27

Now build a comparison function that merges extracted and ground truth data, then applies fuzzy matching for company names and exact matching for numeric fields:
def compare_receipts(
extracted_df: pd.DataFrame,
ground_truth_df: pd.DataFrame,
id_column: str,
fuzzy_match_cols: List[str],
exact_match_cols: List[str],
fuzzy_threshold: int = 80,
) -> pd.DataFrame:
"""Compare extracted and ground truth data with explicit column specifications."""
comparison_df = extracted_df.merge(
ground_truth_df,
on=id_column,
how="inner",
suffixes=("_extracted", "_truth"),
)

# Fuzzy matching
for col in fuzzy_match_cols:
extracted_col = f"{col}_extracted"
truth_col = f"{col}_truth"
comparison_df[f"{col}_score"] = comparison_df.apply(
lambda row: fuzzy_match_score(row[extracted_col], row[truth_col]),
axis=1,
)
comparison_df[f"{col}_match"] = comparison_df[f"{col}_score"] >= fuzzy_threshold

# Exact matching
for col in exact_match_cols:
extracted_col = f"{col}_extracted"
truth_col = f"{col}_truth"
comparison_df[f"{col}_match"] = (
comparison_df[extracted_col] == comparison_df[truth_col]
)

return comparison_df

comparison_df = compare_receipts(
extracted_df,
ground_truth_df,
id_column="receipt_id",
fuzzy_match_cols=["company"],
exact_match_cols=["total", "purchase_date"],
)

Inspect any rows where the company, total, or purchase-date checks fail:
def get_mismatch_rows(comparison_df: pd.DataFrame) -> pd.DataFrame:
"""Get mismatched rows, excluding match indicator columns."""
# Extract match columns and data columns
match_columns = [col for col in comparison_df.columns if col.endswith("_match")]
data_columns = sorted([col for col in comparison_df.columns if col.endswith("_extracted") or col.endswith("_truth")])

# Check for rows where not all matches are True
has_mismatch = comparison_df[match_columns].all(axis=1).eq(False)

return comparison_df[has_mismatch][data_columns]

mismatch_df = get_mismatch_rows(comparison_df)

mismatch_df

company_extracted
company_truth
purchase_date_extracted
purchase_date_truth
total_extracted
total_truth

8
TH MNAN
PERNIAGAAN ZHENG HUI
2023-10-11
2018-02-12
0
112.45

This confirms what we saw earlier. All receipts match the ground truth annotations except for receipt ID X51005200938 for the following fields:

Company name
Total
Purchase date

Let’s take a closer look at this receipt to see if we can identify the issue.
import IPython.display as display

file_to_inspect = receipt_dir / "X51005200938.jpg"

display.Image(filename=file_to_inspect)

This receipt appears smaller than the others in the dataset, which may affect OCR readability. In the next section, we will scale up the receipt to improve the extraction.
Process the Images for Better Extraction
Create a function to scale up the receipt:
from PIL import Image

def scale_image(image_path: Path, output_dir: Path, scale_factor: int = 3) -> Path:
"""Scale up an image using high-quality resampling.

Args:
image_path: Path to the original image
output_dir: Directory to save the scaled image
scale_factor: Factor to scale up the image (default: 3x)

Returns:
Path to the scaled image
"""
# Load the image
img = Image.open(image_path)

# Scale up the image using high-quality resampling
new_size = (img.width * scale_factor, img.height * scale_factor)
img_resized = img.resize(new_size, Image.Resampling.LANCZOS)

# Save to output directory with same filename
output_dir.mkdir(parents=True, exist_ok=True)
output_path = output_dir / image_path.name
img_resized.save(output_path, quality=95)

return output_path

Apply the function to the problematic receipt:
problematic_receipt_path = receipt_dir / "X51005200938.jpg"
adjusted_receipt_dir = Path("data/SROIE2019/train/img_adjusted")

scaled_image_path = scale_image(problematic_receipt_path, adjusted_receipt_dir, scale_factor=3)

Let’s extract the structured data from the scaled image:
problematic_structured_receipts = extract_documents([scaled_image_path], prompt)
problematic_extracted_df = create_extracted_df(problematic_structured_receipts)

problematic_extracted_df

receipt_id
company
total
purchase_date

0
X51005200938
PERNIAGAAN ZHENG HUI
112.46
2018-02-12

Nice! Scaling fixes the extraction. Company name and purchase date are now accurate. The total is 112.46 vs 112.45, acceptable since 112.45 actually looks like 112.46 when printed on the receipt.
Export Clean Data to CSV or Excel
Apply the scaling fix to all receipts. Copy the remaining images to the processed directory, excluding the already-scaled receipt:
import shutil

clean_receipt_paths = [scaled_image_path]
# Copy all receipts except the already processed one
for receipt_path in receipt_paths:
if receipt_path != problematic_receipt_path: # Skip the already scaled image
output_path = adjusted_receipt_dir / receipt_path.name
shutil.copy2(receipt_path, output_path)
clean_receipt_paths.append(output_path)
print(f"Copied {receipt_path.name}")

Let’s run the pipeline again with the processed images:
clean_structured_receipts = extract_documents(clean_receipt_paths, prompt)
clean_extracted_df = create_extracted_df(clean_structured_receipts)
clean_extracted_df

receipt_id
company
total
purchase_date

0
X51005200938
PERNIAGAAN ZHENG HUI
112.46
2018-02-12

1
X00016469612
TAN WOON YANN
9
2018-12-25

2
X00016469619
INDAH GIFT & HOME DECO
60.3
2018-10-19

3
X00016469620
MR D.I.Y. (JOHOR) SDN BHD
33.9
2019-01-12

4
X00016469622
YONGFATT ENTERPRISE
80.9
2018-12-25

5
X00016469623
MR D.I.Y. (M) SDN BHD
30.9
2018-11-18

6
X00016469669
ABC HO TRADING
31
2019-01-09

7
X00016469672
SOON HUAT MACHINERY ENTERPRISE
327
2019-01-11

8
X00016469676
S.H.H. MOTOR (SUNGAI RENGIT SN. BHD. (801580-T)
20
2019-01-23

9
X51005230617
GERBANG ALAF RESTAURANTS SDN BHD
26.6
2018-01-18

Awesome! All receipts now match the ground truth annotations.
Now we can export the dataset to a spreadsheet with just a few lines of code:
import pandas as pd

# Export to CSV
output_path = Path("reports/receipts.csv")
output_path.parent.mkdir(parents=True, exist_ok=True)
clean_extracted_df.to_csv(output_path, index=False)
print(f"Exported {len(clean_extracted_df)} receipts to {output_path}")

Output:
Exported 10 receipts to reports/receipts.csv

The exported data can now be imported into spreadsheet applications, analytics tools, or business intelligence platforms.
Speed Up Processing with Async Parallel Execution
LlamaIndex supports asynchronous processing to handle multiple receipts concurrently. By using async/await with the aget_nodes_from_documents() method, you can process receipts in parallel instead of sequentially, significantly reducing total processing time.
Here’s how to modify the extraction function to use async processing. Setting num_workers=10 means the parser will process up to 10 receipts concurrently:
import asyncio

async def extract_documents_async(
paths: List[str], prompt: str, id_column: str = "receipt_id"
) -> List[dict]:
"""Extract structured data from documents using async LlamaParse."""
results: List[dict] = []

parser = LlamaParse(
api_key=os.environ["LLAMA_CLOUD_API_KEY"],
result_type="markdown",
num_workers=10, # Process 10 receipts concurrently
language="en",
skip_diagonal_text=True,
)

# Use async method for parallel processing
documents = await parser.aload_data(paths)

program = OpenAIPydanticProgram.from_defaults(
output_cls=Receipt,
llm=Settings.llm,
prompt_template_str=prompt,
)

for path, doc in zip(paths, documents):
document_id = Path(path).stem
parsed_document = program(context_str=doc.text)
results.append({id_column: document_id, "data": parsed_document})

return results

# Run with asyncio
structured_receipts = await extract_documents_async(receipt_paths, prompt)

See the LlamaIndex async documentation for more details.
Try It Yourself
The concepts from this tutorial are available as a reusable pipeline in this GitHub repository. The code includes both synchronous and asynchronous versions:
Synchronous pipelines (simple, sequential processing):

Generic pipeline (document_extraction_pipeline.py): Reusable extraction function that works with any Pydantic schema
Receipt pipeline (extract_receipts_pipeline.py): Complete example with Receipt schema, image scaling, and data transformations

Asynchronous pipelines (parallel processing with 3-10x speedup):

Async generic pipeline (async_document_extraction_pipeline.py): Concurrent document processing
Async receipt pipeline (async_extract_receipts_pipeline.py): Batch receipt processing with progress tracking

Run the receipt extraction example:
# Synchronous version (simple, sequential)
uv run extract_receipts_pipeline.py

# Asynchronous version (parallel processing, 3-10x faster)
uv run async_extract_receipts_pipeline.py

Or create your own extractor by importing extract_structured_data() and providing your custom Pydantic schema, extraction prompt, and optional preprocessing functions.

Learn production-ready practices for data science and AI projects in Production-Ready Data Science.

Conclusion and Next Steps
This tutorial demonstrated how LlamaIndex automates receipt data extraction with minimal code. You converted scanned images to structured data, validated results against ground truth, and exported a clean CSV ready for analysis.
Here are some ideas to enhance this receipt extraction pipeline:

Richer schemas: Add nested Pydantic models for vendor details, payment methods, and itemized line items
Validation rules: Flag outliers like totals over $500 or future dates for manual review
Multi-stage workflows: Create custom workflows that combine image preprocessing, extraction, validation, and export steps with error handling

📚 Want to go deeper? Learning new techniques is the easy part. Knowing how to structure, test, and deploy them is what separates side projects from real work. My book shows you how to build data science projects that actually make it to production. Get the book →

Favorite

Turn Receipt Images into Spreadsheets with LlamaIndex Read More »

0
    0
    Your Cart
    Your cart is empty
    Scroll to Top

    Work with Khuyen Tran

    Work with Khuyen Tran