Table of Contents
- Introduction
- What You Will Learn
- Introduction to LlamaIndex
- Basic Image Processing with LlamaParse
- Structured Data Extraction with Pydantic
- Compare Extraction with Ground Truth
- Process the Images for Better Extraction
- Export Clean Data to CSV or Excel
- Speed Up Processing with Async Parallel Execution
- Try It Yourself
- Conclusion and Next Steps
Introduction
Manual data entry from receipts, invoices, and contracts wastes hours and introduces errors. What if you could automatically extract structured data from these documents in minutes?
In this article, you’ll learn how to transform receipt images into structured data using LlamaIndex, then export the results to a spreadsheet for analysis.
💻 Get the Code: The complete source code and Jupyter notebook for this tutorial are available on GitHub. Clone it to follow along!
What You Will Learn
- Convert scanned receipts to structured data with LlamaParse and Pydantic models
- Validate extraction accuracy by comparing results against ground truth annotations
- Fix parsing errors by preprocessing low-quality images
- Export clean receipt data to spreadsheet format
Introduction to LlamaIndex
LlamaIndex is a framework that connects LLMs with your data through three core capabilities:
- Data ingestion: Built-in readers for PDFs, images, web pages, and databases that automatically parse content into processable nodes.
- Structured extraction: LLM-powered conversion of unstructured text into Pydantic models with automatic validation.
- Retrieval and indexing: Vector stores and semantic search that enable context-augmented queries over your documents.
It eliminates boilerplate code for loading, parsing, and querying data, letting you focus on building LLM applications.
The table below compares LlamaIndex with two other popular frameworks for LLM applications:
Framework | Purpose | Best For |
---|---|---|
LlamaIndex | Document ingestion and structured extraction | Converting unstructured documents into query-ready data |
LangChain | LLM orchestration and tool integration | Building conversational agents with multiple LLM calls |
LangGraph | Stateful workflow management | Coordinating long-running, multi-agent processes |
Installation
Start with installing the required packages for this tutorial, including:
- llama-index: Core LlamaIndex framework with base indexing and retrieval functionality
- llama-parse: Document parsing service for PDFs, images, and complex layouts
- llama-index-program-openai: OpenAI integration for structured data extraction with Pydantic
- python-dotenv: Load environment variables from .env files
- rapidfuzz: Fuzzy string matching library for comparing company names with minor variations
pip install llama-index llama-parse llama-index-program-openai python-dotenv rapidfuzz
Environment Setup
Create a .env
file to store your API keys:
# .env
LLAMA_CLOUD_API_KEY="your-llama-parse-key"
OPENAI_API_KEY="your-openai-key"
Get your API keys from:
- LlamaParse API: cloud.llamaindex.ai
- OpenAI API: platform.openai.com/api-keys
Load the environment variables from the .env
file with load_dotenv
:
from dotenv import load_dotenv
import os
load_dotenv()
Configure the default LLM with Settings
:
from llama_index.core import Settings
from llama_index.llms.openai import OpenAI
Settings.llm = OpenAI(model="gpt-4o-mini", temperature=0)
Settings.context_window = 8000
Settings
stores global defaults so every query engine and program reuses the same LLM configuration. Keeping temperature at 0 nudges the model to return deterministic, structured outputs.
Basic Image Processing with LlamaParse
In this tutorial, we will use the SROIE Dataset v2 from Kaggle. This dataset contains real-world receipt scans from the ICDAR 2019 competition.
You can download the dataset directly from Kaggle’s website or use the Kaggle CLI:
# Install the Kaggle CLI once
uv pip install kaggle
# Configure Kaggle credentials (run once per environment)
export KAGGLE_USERNAME=your_username
export KAGGLE_KEY=your_api_key
# Create a workspace folder and download the full archive (~1 GB)
mkdir -p data
kaggle datasets download urbikn/sroie-datasetv2 -p data
# Extract everything and inspect a few image files
unzip -q -o data/sroie-datasetv2.zip -d data
This tutorial uses data from the data/SROIE2019/train/
directory, which contains:
img
: Original receipt imagesentities
: Ground truth annotations for validation
Load the first 10 receipts into a list of paths:
from pathlib import Path
receipt_dir = Path("data/SROIE2019/train/img")
num_receipts = 10
receipt_paths = sorted(receipt_dir.glob("*.jpg"))[:num_receipts]
Take a look at the first receipt:
from IPython.display import Image
first_receipt_path = receipt_paths[0]
Image(filename=first_receipt_path)
Next, use LlamaParse
to convert the first receipt into markdown.
from llama_parse import LlamaParse
# Parse receipts with LlamaParse
parser = LlamaParse(
api_key=os.environ["LLAMA_CLOUD_API_KEY"],
result_type="markdown", # Output format
num_workers=4, # Number of parallel workers for faster processing
language="en", # Language hint for OCR accuracy
skip_diagonal_text=True, # Ignore rotated or diagonal text
)
first_receipt = parser.load_data(first_receipt_path)[0]
Preview the markdown for the first receipt:
# Preview the first receipt
preview = "\n".join(first_receipt.text.splitlines()[:10])
print(preview)
Output:
tan woon yann
BOOK TA K (TAMAN DAYA) SDN BHD
789417-W
NO.5: 55,57 & 59, JALAN SAGU 18,
TAMAN DaYA,
81100 JOHOR BAHRU,
JOHOR.
LlamaParse successfully converts receipt images to text, but there is no structure: vendor names, dates, and totals are all mixed together in plain text. This format is not ideal for exporting to spreadsheets or analytics tools for further analysis.
The next section uses Pydantic models to extract structured fields like company
, total
, and purchase_date
automatically.
Structured Data Extraction with Pydantic
Pydantic is a Python library that uses type hints for data validation and automatic type conversion. By defining a receipt schema once, you can extract consistent structured data from receipts regardless of their format or layout.
Start by defining two Pydantic models that represent receipt structure:
from datetime import date
from typing import List, Optional
from pydantic import BaseModel, Field, ValidationInfo, model_validator
class ReceiptItem(BaseModel):
"""Represents a single line item extracted from a receipt."""
description: str = Field(description="Item name exactly as shown on the receipt")
quantity: int = Field(default=1, ge=1, description="Integer quantity of the item")
unit_price: Optional[float] = Field(
default=None, ge=0, description="Price per unit in the receipt currency"
)
discount_amount: float = Field(
default=0.0, ge=0, description="Discount applied to this line item"
)
class Receipt(BaseModel):
"""Structured fields extracted from a retail receipt."""
company: str = Field(description="Business or merchant name")
purchase_date: Optional[date] = Field(
default=None, description="Date in YYYY-MM-DD format"
)
address: Optional[str] = Field(default=None, description="Address of the business")
total: float = Field(description="Final charged amount")
items: List[ReceiptItem] = Field(default_factory=list)
Create an OpenAIPydanticProgram
that instructs the LLM to extract data according to our Receipt
model:
from llama_index.program.openai import OpenAIPydanticProgram
prompt = """
You are extracting structured data from a receipt.
Use the provided text to populate the Receipt model.
Interpret every receipt date as day-first.
If a field is missing, return null.
{context_str}
"""
receipt_program = OpenAIPydanticProgram.from_defaults(
output_cls=Receipt,
llm=Settings.llm,
prompt_template_str=prompt,
)
Process the first parsed document to make sure everything works before scaling to the full batch:
# Process the first receipt
structured_first_receipt = receipt_program(context_str=first_receipt.text)
# Print the receipt as a JSON string for better readability
print(structured_first_receipt.model_dump_json(indent=2))
Output:
{
"company": "tan woon yann BOOK TA K (TAMAN DAYA) SDN BHD",
"purchase_date": "2018-12-25",
"address": "NO.5: 55,57 & 59, JALAN SAGU 18, TAMAN DaYA, 81100 JOHOR BAHRU, JOHOR.",
"total": 9.0,
"items": [
{
"description": "KF MODELLING CLAY KIDDY FISH",
"quantity": 1,
"unit_price": 9.0,
"discount_amount": 0.0
}
]
}
LlamaIndex populates the Pydantic schema with extracted values:
company
: Vendor name from the receipt headerpurchase_date
: Parsed date (2018-12-25)total
: Final amount (9.0)items
: Line items with description, quantity, and price
Now that the extraction works, let’s scale it to process all receipts in a batch. The function uses each receipt’s filename as a unique identifier:
def extract_documents(paths: List[str], prompt: str, id_column: str = "receipt_id") -> List[dict]:
"""Extract structured data from documents using LlamaParse and LLM."""
results: List[dict] = []
# Initialize parser with OCR settings
parser = LlamaParse(
api_key=os.environ["LLAMA_CLOUD_API_KEY"],
result_type="markdown",
num_workers=4,
language="en",
skip_diagonal_text=True,
)
# Convert images to markdown text
documents = parser.load_data(paths)
# Create structured extraction program
program = OpenAIPydanticProgram.from_defaults(
output_cls=Receipt,
llm=Settings.llm,
prompt_template_str=prompt,
)
# Extract structured data from each document
for path, doc in zip(paths, documents):
document_id = Path(path).stem
parsed_document = program(context_str=doc.text)
results.append(
{
id_column: document_id,
"data": parsed_document,
}
)
return results
# Extract structured data from all receipts
structured_receipts = extract_documents(receipt_paths, prompt)
Convert the extracted receipts into a DataFrame for easier inspection:
import pandas as pd
def transform_receipt_columns(df: pd.DataFrame) -> pd.DataFrame:
"""Apply standard transformations to receipt DataFrame columns."""
df = df.copy()
df["company"] = df["company"].str.upper()
df["total"] = pd.to_numeric(df["total"], errors="coerce")
df["purchase_date"] = pd.to_datetime(
df["purchase_date"], errors="coerce", dayfirst=True
).dt.date
return df
def create_extracted_df(records: List[dict], id_column: str = "receipt_id") -> pd.DataFrame:
df = pd.DataFrame(
[
{
id_column: record[id_column],
"company": record["data"].company,
"total": record["data"].total,
"purchase_date": record["data"].purchase_date,
}
for record in records
]
)
return transform_receipt_columns(df)
extracted_df = create_extracted_df(structured_receipts)
extracted_df
receipt_id | company | total | purchase_date | |
---|---|---|---|---|
0 | X00016469612 | TAN WOON YANN BOOK TA K (TAMAN DAYA) SDN BHD | 9 | 2018-12-25 |
1 | X00016469619 | INDAH GIFT & HOME DECO | 60.3 | 2018-10-19 |
2 | X00016469620 | MR D.I.Y. (JOHOR) SDN BHD | 33.9 | 2019-01-12 |
3 | X00016469622 | YONGFATT ENTERPRISE | 80.9 | 2018-12-25 |
4 | X00016469623 | MR D.I.Y. (M) SDN BHD | 30.9 | 2018-11-18 |
5 | X00016469669 | ABC HO TRADING | 31 | 2019-01-09 |
6 | X00016469672 | SOON HUAT MACHINERY ENTERPRISE | 327 | 2019-01-11 |
7 | X00016469676 | S.H.H. MOTOR (SUNGAI RENGIT SN. BHD. (801580-T) | 20 | 2019-01-23 |
8 | X51005200938 | TH MNAN | 0 | 2023-10-11 |
9 | X51005230617 | GERBANG ALAF RESTAURANTS SDN BHD | 26.6 | 2018-01-18 |
Most receipts are extracted correctly, but receipt X51005200938 shows issues:
- The company name is incomplete (“TH MNAN”)
- Total is 0 instead of the actual amount
- Date (2023-10-11) appears incorrect
Compare Extraction with Ground Truth
To verify the extraction accuracy, load the ground-truth annotations from data/SROIE2019/train/entities
:
def normalize_date(value: str) -> str:
"""Normalize date strings to consistent format."""
value = (value or "").strip()
if not value:
return value
# Convert hyphens to slashes
value = value.replace("-", "/")
parts = value.split("/")
# Convert 2-digit years to 4-digit (e.g., 18 -> 2018)
if len(parts[-1]) == 2:
parts[-1] = f"20{parts[-1]}"
return "/".join(parts)
def create_ground_truth_df(
label_paths: List[str], id_column: str = "receipt_id"
) -> pd.DataFrame:
"""Create ground truth DataFrame from label JSON files."""
records = []
# Load each JSON file and extract key fields
for path in label_paths:
payload = pd.read_json(Path(path), typ="series").to_dict()
records.append(
{
id_column: Path(path).stem,
"company": payload.get("company"),
"total": payload.get("total"),
"purchase_date": normalize_date(payload.get("date")),
}
)
df = pd.DataFrame(records)
# Apply same transformations as extracted data
return transform_receipt_columns(df)
# Load ground truth annotations
label_dir = Path("data/SROIE2019/train/entities")
label_paths = sorted(label_dir.glob("*.txt"))[:num_receipts]
ground_truth_df = create_ground_truth_df(label_paths)
ground_truth_df
receipt_id | company | total | purchase_date | |
---|---|---|---|---|
0 | X00016469612 | BOOK TA .K (TAMAN DAYA) SDN BHD | 9 | 2018-12-25 |
1 | X00016469619 | INDAH GIFT & HOME DECO | 60.3 | 2018-10-19 |
2 | X00016469620 | MR D.I.Y. (JOHOR) SDN BHD | 33.9 | 2019-01-12 |
3 | X00016469622 | YONGFATT ENTERPRISE | 80.9 | 2018-12-25 |
4 | X00016469623 | MR D.I.Y. (M) SDN BHD | 30.9 | 2018-11-18 |
5 | X00016469669 | ABC HO TRADING | 31 | 2019-01-09 |
6 | X00016469672 | SOON HUAT MACHINERY ENTERPRISE | 327 | 2019-01-11 |
7 | X00016469676 | S.H.H. MOTOR (SUNGAI RENGIT) SDN. BHD. | 20 | 2019-01-23 |
8 | X51005200938 | PERNIAGAAN ZHENG HUI | 112.45 | 2018-02-12 |
9 | X51005230617 | GERBANG ALAF RESTAURANTS SDN BHD | 26.6 | 2018-01-18 |
Let’s validate extraction accuracy by comparing results against ground truth.
Company names often have minor variations (spacing, punctuation, extra characters), so we’ll use fuzzy matching to tolerate these formatting differences.
from rapidfuzz import fuzz
def fuzzy_match_score(text1: str, text2: str) -> int:
"""Calculate fuzzy match score between two strings."""
return fuzz.token_set_ratio(str(text1), str(text2))
Test the fuzzy matching with sample company names:
# Nearly identical strings score high
print(f"Score: {fuzzy_match_score('BOOK TA K SDN BHD', 'BOOK TA .K SDN BHD'):.2f}")
# Different punctuation still matches well
print(f"Score: {fuzzy_match_score('MR D.I.Y. JOHOR', 'MR DIY JOHOR'):.2f}")
# Completely different strings score low
print(f"Score: {fuzzy_match_score('ABC TRADING', 'XYZ COMPANY'):.2f}")
Output:
Score: 97.14
Score: 55.17
Score: 27.27
Now build a comparison function that merges extracted and ground truth data, then applies fuzzy matching for company names and exact matching for numeric fields:
def compare_receipts(
extracted_df: pd.DataFrame,
ground_truth_df: pd.DataFrame,
id_column: str,
fuzzy_match_cols: List[str],
exact_match_cols: List[str],
fuzzy_threshold: int = 80,
) -> pd.DataFrame:
"""Compare extracted and ground truth data with explicit column specifications."""
comparison_df = extracted_df.merge(
ground_truth_df,
on=id_column,
how="inner",
suffixes=("_extracted", "_truth"),
)
# Fuzzy matching
for col in fuzzy_match_cols:
extracted_col = f"{col}_extracted"
truth_col = f"{col}_truth"
comparison_df[f"{col}_score"] = comparison_df.apply(
lambda row: fuzzy_match_score(row[extracted_col], row[truth_col]),
axis=1,
)
comparison_df[f"{col}_match"] = comparison_df[f"{col}_score"] >= fuzzy_threshold
# Exact matching
for col in exact_match_cols:
extracted_col = f"{col}_extracted"
truth_col = f"{col}_truth"
comparison_df[f"{col}_match"] = (
comparison_df[extracted_col] == comparison_df[truth_col]
)
return comparison_df
comparison_df = compare_receipts(
extracted_df,
ground_truth_df,
id_column="receipt_id",
fuzzy_match_cols=["company"],
exact_match_cols=["total", "purchase_date"],
)
Inspect any rows where the company, total, or purchase-date checks fail:
def get_mismatch_rows(comparison_df: pd.DataFrame) -> pd.DataFrame:
"""Get mismatched rows, excluding match indicator columns."""
# Extract match columns and data columns
match_columns = [col for col in comparison_df.columns if col.endswith("_match")]
data_columns = sorted([col for col in comparison_df.columns if col.endswith("_extracted") or col.endswith("_truth")])
# Check for rows where not all matches are True
has_mismatch = comparison_df[match_columns].all(axis=1).eq(False)
return comparison_df[has_mismatch][data_columns]
mismatch_df = get_mismatch_rows(comparison_df)
mismatch_df
company_extracted | company_truth | purchase_date_extracted | purchase_date_truth | total_extracted | total_truth | |
---|---|---|---|---|---|---|
8 | TH MNAN | PERNIAGAAN ZHENG HUI | 2023-10-11 | 2018-02-12 | 0 | 112.45 |
This confirms what we saw earlier. All receipts match the ground truth annotations except for receipt ID X51005200938 for the following fields:
- Company name
- Total
- Purchase date
Let’s take a closer look at this receipt to see if we can identify the issue.
import IPython.display as display
file_to_inspect = receipt_dir / "X51005200938.jpg"
display.Image(filename=file_to_inspect)
This receipt appears smaller than the others in the dataset, which may affect OCR readability. In the next section, we will scale up the receipt to improve the extraction.
Process the Images for Better Extraction
Create a function to scale up the receipt:
from PIL import Image
def scale_image(image_path: Path, output_dir: Path, scale_factor: int = 3) -> Path:
"""Scale up an image using high-quality resampling.
Args:
image_path: Path to the original image
output_dir: Directory to save the scaled image
scale_factor: Factor to scale up the image (default: 3x)
Returns:
Path to the scaled image
"""
# Load the image
img = Image.open(image_path)
# Scale up the image using high-quality resampling
new_size = (img.width * scale_factor, img.height * scale_factor)
img_resized = img.resize(new_size, Image.Resampling.LANCZOS)
# Save to output directory with same filename
output_dir.mkdir(parents=True, exist_ok=True)
output_path = output_dir / image_path.name
img_resized.save(output_path, quality=95)
return output_path
Apply the function to the problematic receipt:
problematic_receipt_path = receipt_dir / "X51005200938.jpg"
adjusted_receipt_dir = Path("data/SROIE2019/train/img_adjusted")
scaled_image_path = scale_image(problematic_receipt_path, adjusted_receipt_dir, scale_factor=3)
Let’s extract the structured data from the scaled image:
problematic_structured_receipts = extract_documents([scaled_image_path], prompt)
problematic_extracted_df = create_extracted_df(problematic_structured_receipts)
problematic_extracted_df
receipt_id | company | total | purchase_date | |
---|---|---|---|---|
0 | X51005200938 | PERNIAGAAN ZHENG HUI | 112.46 | 2018-02-12 |
Nice! Scaling fixes the extraction. Company name and purchase date are now accurate. The total is 112.46 vs 112.45, acceptable since 112.45 actually looks like 112.46 when printed on the receipt.
Export Clean Data to CSV or Excel
Apply the scaling fix to all receipts. Copy the remaining images to the processed directory, excluding the already-scaled receipt:
import shutil
clean_receipt_paths = [scaled_image_path]
# Copy all receipts except the already processed one
for receipt_path in receipt_paths:
if receipt_path != problematic_receipt_path: # Skip the already scaled image
output_path = adjusted_receipt_dir / receipt_path.name
shutil.copy2(receipt_path, output_path)
clean_receipt_paths.append(output_path)
print(f"Copied {receipt_path.name}")
Let’s run the pipeline again with the processed images:
clean_structured_receipts = extract_documents(clean_receipt_paths, prompt)
clean_extracted_df = create_extracted_df(clean_structured_receipts)
clean_extracted_df
receipt_id | company | total | purchase_date | |
---|---|---|---|---|
0 | X51005200938 | PERNIAGAAN ZHENG HUI | 112.46 | 2018-02-12 |
1 | X00016469612 | TAN WOON YANN | 9 | 2018-12-25 |
2 | X00016469619 | INDAH GIFT & HOME DECO | 60.3 | 2018-10-19 |
3 | X00016469620 | MR D.I.Y. (JOHOR) SDN BHD | 33.9 | 2019-01-12 |
4 | X00016469622 | YONGFATT ENTERPRISE | 80.9 | 2018-12-25 |
5 | X00016469623 | MR D.I.Y. (M) SDN BHD | 30.9 | 2018-11-18 |
6 | X00016469669 | ABC HO TRADING | 31 | 2019-01-09 |
7 | X00016469672 | SOON HUAT MACHINERY ENTERPRISE | 327 | 2019-01-11 |
8 | X00016469676 | S.H.H. MOTOR (SUNGAI RENGIT SN. BHD. (801580-T) | 20 | 2019-01-23 |
9 | X51005230617 | GERBANG ALAF RESTAURANTS SDN BHD | 26.6 | 2018-01-18 |
Awesome! All receipts now match the ground truth annotations.
Now we can export the dataset to a spreadsheet with just a few lines of code:
import pandas as pd
# Export to CSV
output_path = Path("reports/receipts.csv")
output_path.parent.mkdir(parents=True, exist_ok=True)
clean_extracted_df.to_csv(output_path, index=False)
print(f"Exported {len(clean_extracted_df)} receipts to {output_path}")
Output:
Exported 10 receipts to reports/receipts.csv
The exported data can now be imported into spreadsheet applications, analytics tools, or business intelligence platforms.
Speed Up Processing with Async Parallel Execution
LlamaIndex supports asynchronous processing to handle multiple receipts concurrently. By using async
/await
with the aget_nodes_from_documents()
method, you can process receipts in parallel instead of sequentially, significantly reducing total processing time.
Here’s how to modify the extraction function to use async processing. Setting num_workers=10
means the parser will process up to 10 receipts concurrently:
import asyncio
async def extract_documents_async(
paths: List[str], prompt: str, id_column: str = "receipt_id"
) -> List[dict]:
"""Extract structured data from documents using async LlamaParse."""
results: List[dict] = []
parser = LlamaParse(
api_key=os.environ["LLAMA_CLOUD_API_KEY"],
result_type="markdown",
num_workers=10, # Process 10 receipts concurrently
language="en",
skip_diagonal_text=True,
)
# Use async method for parallel processing
documents = await parser.aload_data(paths)
program = OpenAIPydanticProgram.from_defaults(
output_cls=Receipt,
llm=Settings.llm,
prompt_template_str=prompt,
)
for path, doc in zip(paths, documents):
document_id = Path(path).stem
parsed_document = program(context_str=doc.text)
results.append({id_column: document_id, "data": parsed_document})
return results
# Run with asyncio
structured_receipts = await extract_documents_async(receipt_paths, prompt)
See the LlamaIndex async documentation for more details.
Try It Yourself
The concepts from this tutorial are available as a reusable pipeline in this GitHub repository. The code includes both synchronous and asynchronous versions:
Synchronous pipelines (simple, sequential processing):
- Generic pipeline (
document_extraction_pipeline.py
): Reusable extraction function that works with any Pydantic schema - Receipt pipeline (
extract_receipts_pipeline.py
): Complete example with Receipt schema, image scaling, and data transformations
Asynchronous pipelines (parallel processing with 3-10x speedup):
- Async generic pipeline (
async_document_extraction_pipeline.py
): Concurrent document processing - Async receipt pipeline (
async_extract_receipts_pipeline.py
): Batch receipt processing with progress tracking
Run the receipt extraction example:
# Synchronous version (simple, sequential)
uv run extract_receipts_pipeline.py
# Asynchronous version (parallel processing, 3-10x faster)
uv run async_extract_receipts_pipeline.py
Or create your own extractor by importing extract_structured_data()
and providing your custom Pydantic schema, extraction prompt, and optional preprocessing functions.
Learn production-ready practices for data science and AI projects in Production-Ready Data Science.
Conclusion and Next Steps
This tutorial demonstrated how LlamaIndex automates receipt data extraction with minimal code. You converted scanned images to structured data, validated results against ground truth, and exported a clean CSV ready for analysis.
Here are some ideas to enhance this receipt extraction pipeline:
- Richer schemas: Add nested Pydantic models for vendor details, payment methods, and itemized line items
- Validation rules: Flag outliers like totals over $500 or future dates for manual review
- Multi-stage workflows: Create custom workflows that combine image preprocessing, extraction, validation, and export steps with error handling