Generic selectors
Exact matches only
Search in title
Search in content
Post Type Selectors
Filter by Categories
About Article
Analyze Data
Archive
Best Practices
Better Outputs
Blog
Code Optimization
Code Quality
Command Line
Daily tips
Dashboard
Data Analysis & Manipulation
Data Engineer
Data Visualization
DataFrame
Delta Lake
DevOps
DuckDB
Environment Management
Feature Engineer
Git
Jupyter Notebook
LLM
LLM Tools
Machine Learning
Machine Learning & AI
Machine Learning Tools
Manage Data
MLOps
Natural Language Processing
Newsletter Archive
NumPy
Pandas
Polars
PySpark
Python Helpers
Python Tips
Python Utilities
Scrape Data
SQL
Testing
Time Series
Tools
Visualization
Visualization & Reporting
Workflow & Automation
Workflow Automation

Turn Receipt Images into Spreadsheets with LlamaIndex

Turn Receipt Images into Spreadsheets with LlamaIndex

Table of Contents

Introduction

Manual data entry from receipts, invoices, and contracts wastes hours and introduces errors. What if you could automatically extract structured data from these documents in minutes?

In this article, you’ll learn how to transform receipt images into structured data using LlamaIndex, then export the results to a spreadsheet for analysis.

💻 Get the Code: The complete source code and Jupyter notebook for this tutorial are available on GitHub. Clone it to follow along!

What You Will Learn

  • Convert scanned receipts to structured data with LlamaParse and Pydantic models
  • Validate extraction accuracy by comparing results against ground truth annotations
  • Fix parsing errors by preprocessing low-quality images
  • Export clean receipt data to spreadsheet format

Introduction to LlamaIndex

LlamaIndex is a framework that connects LLMs with your data through three core capabilities:

  1. Data ingestion: Built-in readers for PDFs, images, web pages, and databases that automatically parse content into processable nodes.
  2. Structured extraction: LLM-powered conversion of unstructured text into Pydantic models with automatic validation.
  3. Retrieval and indexing: Vector stores and semantic search that enable context-augmented queries over your documents.

It eliminates boilerplate code for loading, parsing, and querying data, letting you focus on building LLM applications.

The table below compares LlamaIndex with two other popular frameworks for LLM applications:

Framework Purpose Best For
LlamaIndex Document ingestion and structured extraction Converting unstructured documents into query-ready data
LangChain LLM orchestration and tool integration Building conversational agents with multiple LLM calls
LangGraph Stateful workflow management Coordinating long-running, multi-agent processes

Installation

Start with installing the required packages for this tutorial, including:

  • llama-index: Core LlamaIndex framework with base indexing and retrieval functionality
  • llama-parse: Document parsing service for PDFs, images, and complex layouts
  • llama-index-program-openai: OpenAI integration for structured data extraction with Pydantic
  • python-dotenv: Load environment variables from .env files
  • rapidfuzz: Fuzzy string matching library for comparing company names with minor variations
pip install llama-index llama-parse llama-index-program-openai python-dotenv rapidfuzz

Environment Setup

Create a .env file to store your API keys:

# .env
LLAMA_CLOUD_API_KEY="your-llama-parse-key"
OPENAI_API_KEY="your-openai-key"

Get your API keys from:

Load the environment variables from the .env file with load_dotenv:

from dotenv import load_dotenv
import os

load_dotenv()

Configure the default LLM with Settings:

from llama_index.core import Settings
from llama_index.llms.openai import OpenAI

Settings.llm = OpenAI(model="gpt-4o-mini", temperature=0)
Settings.context_window = 8000

Settings stores global defaults so every query engine and program reuses the same LLM configuration. Keeping temperature at 0 nudges the model to return deterministic, structured outputs.

Basic Image Processing with LlamaParse

In this tutorial, we will use the SROIE Dataset v2 from Kaggle. This dataset contains real-world receipt scans from the ICDAR 2019 competition.

You can download the dataset directly from Kaggle’s website or use the Kaggle CLI:

# Install the Kaggle CLI once
uv pip install kaggle

# Configure Kaggle credentials (run once per environment)
export KAGGLE_USERNAME=your_username
export KAGGLE_KEY=your_api_key

# Create a workspace folder and download the full archive (~1 GB)
mkdir -p data
kaggle datasets download urbikn/sroie-datasetv2 -p data

# Extract everything and inspect a few image files
unzip -q -o data/sroie-datasetv2.zip -d data

This tutorial uses data from the data/SROIE2019/train/ directory, which contains:

  • img: Original receipt images
  • entities: Ground truth annotations for validation

Load the first 10 receipts into a list of paths:

from pathlib import Path

receipt_dir = Path("data/SROIE2019/train/img")
num_receipts = 10
receipt_paths = sorted(receipt_dir.glob("*.jpg"))[:num_receipts]

Take a look at the first receipt:

from IPython.display import Image

first_receipt_path = receipt_paths[0]
Image(filename=first_receipt_path)

First receipt

Next, use LlamaParse to convert the first receipt into markdown.

from llama_parse import LlamaParse


# Parse receipts with LlamaParse
parser = LlamaParse(
    api_key=os.environ["LLAMA_CLOUD_API_KEY"],
    result_type="markdown",  # Output format
    num_workers=4,  # Number of parallel workers for faster processing
    language="en",  # Language hint for OCR accuracy
    skip_diagonal_text=True,  # Ignore rotated or diagonal text
)
first_receipt = parser.load_data(first_receipt_path)[0]

Preview the markdown for the first receipt:

# Preview the first receipt
preview = "\n".join(first_receipt.text.splitlines()[:10])
print(preview)

Output:

tan woon yann
BOOK TA K (TAMAN DAYA) SDN BHD
789417-W
NO.5: 55,57 & 59, JALAN SAGU 18,
TAMAN DaYA,
81100 JOHOR BAHRU,
JOHOR.

LlamaParse successfully converts receipt images to text, but there is no structure: vendor names, dates, and totals are all mixed together in plain text. This format is not ideal for exporting to spreadsheets or analytics tools for further analysis.

The next section uses Pydantic models to extract structured fields like company, total, and purchase_date automatically.

Structured Data Extraction with Pydantic

Pydantic is a Python library that uses type hints for data validation and automatic type conversion. By defining a receipt schema once, you can extract consistent structured data from receipts regardless of their format or layout.

Start by defining two Pydantic models that represent receipt structure:

from datetime import date
from typing import List, Optional
from pydantic import BaseModel, Field, ValidationInfo, model_validator


class ReceiptItem(BaseModel):
    """Represents a single line item extracted from a receipt."""

    description: str = Field(description="Item name exactly as shown on the receipt")
    quantity: int = Field(default=1, ge=1, description="Integer quantity of the item")
    unit_price: Optional[float] = Field(
        default=None, ge=0, description="Price per unit in the receipt currency"
    )
    discount_amount: float = Field(
        default=0.0, ge=0, description="Discount applied to this line item"
    )


class Receipt(BaseModel):
    """Structured fields extracted from a retail receipt."""

    company: str = Field(description="Business or merchant name")
    purchase_date: Optional[date] = Field(
        default=None, description="Date in YYYY-MM-DD format"
    )
    address: Optional[str] = Field(default=None, description="Address of the business")
    total: float = Field(description="Final charged amount")
    items: List[ReceiptItem] = Field(default_factory=list)

Create an OpenAIPydanticProgram that instructs the LLM to extract data according to our Receipt model:

from llama_index.program.openai import OpenAIPydanticProgram

prompt = """
You are extracting structured data from a receipt.
Use the provided text to populate the Receipt model.
Interpret every receipt date as day-first.
If a field is missing, return null.

{context_str}
"""

receipt_program = OpenAIPydanticProgram.from_defaults(
    output_cls=Receipt,
    llm=Settings.llm,
    prompt_template_str=prompt,
)

Process the first parsed document to make sure everything works before scaling to the full batch:

# Process the first receipt
structured_first_receipt = receipt_program(context_str=first_receipt.text)

# Print the receipt as a JSON string for better readability
print(structured_first_receipt.model_dump_json(indent=2))

Output:

{
  "company": "tan woon yann BOOK TA K (TAMAN DAYA) SDN BHD",
  "purchase_date": "2018-12-25",
  "address": "NO.5: 55,57 & 59, JALAN SAGU 18, TAMAN DaYA, 81100 JOHOR BAHRU, JOHOR.",
  "total": 9.0,
  "items": [
    {
      "description": "KF MODELLING CLAY KIDDY FISH",
      "quantity": 1,
      "unit_price": 9.0,
      "discount_amount": 0.0
    }
  ]
}

LlamaIndex populates the Pydantic schema with extracted values:

  • company: Vendor name from the receipt header
  • purchase_date: Parsed date (2018-12-25)
  • total: Final amount (9.0)
  • items: Line items with description, quantity, and price

Now that the extraction works, let’s scale it to process all receipts in a batch. The function uses each receipt’s filename as a unique identifier:

def extract_documents(paths: List[str], prompt: str, id_column: str = "receipt_id") -> List[dict]:
    """Extract structured data from documents using LlamaParse and LLM."""
    results: List[dict] = []

    # Initialize parser with OCR settings
    parser = LlamaParse(
        api_key=os.environ["LLAMA_CLOUD_API_KEY"],
        result_type="markdown",
        num_workers=4,
        language="en",
        skip_diagonal_text=True,
    )

    # Convert images to markdown text
    documents = parser.load_data(paths)

    # Create structured extraction program
    program = OpenAIPydanticProgram.from_defaults(
        output_cls=Receipt,
        llm=Settings.llm,
        prompt_template_str=prompt,
    )

    # Extract structured data from each document
    for path, doc in zip(paths, documents):
        document_id = Path(path).stem
        parsed_document = program(context_str=doc.text)
        results.append(
            {
                id_column: document_id,
                "data": parsed_document,
            }
        )
    return results

# Extract structured data from all receipts
structured_receipts = extract_documents(receipt_paths, prompt)

Convert the extracted receipts into a DataFrame for easier inspection:

import pandas as pd


def transform_receipt_columns(df: pd.DataFrame) -> pd.DataFrame:
    """Apply standard transformations to receipt DataFrame columns."""
    df = df.copy()
    df["company"] = df["company"].str.upper()
    df["total"] = pd.to_numeric(df["total"], errors="coerce")
    df["purchase_date"] = pd.to_datetime(
        df["purchase_date"], errors="coerce", dayfirst=True
    ).dt.date
    return df


def create_extracted_df(records: List[dict], id_column: str = "receipt_id") -> pd.DataFrame:
    df = pd.DataFrame(
        [
            {
                id_column: record[id_column],
                "company": record["data"].company,
                "total": record["data"].total,
                "purchase_date": record["data"].purchase_date,
            }
            for record in records
        ]
    )
    return transform_receipt_columns(df)


extracted_df = create_extracted_df(structured_receipts)
extracted_df
receipt_id company total purchase_date
0 X00016469612 TAN WOON YANN BOOK TA K (TAMAN DAYA) SDN BHD 9 2018-12-25
1 X00016469619 INDAH GIFT & HOME DECO 60.3 2018-10-19
2 X00016469620 MR D.I.Y. (JOHOR) SDN BHD 33.9 2019-01-12
3 X00016469622 YONGFATT ENTERPRISE 80.9 2018-12-25
4 X00016469623 MR D.I.Y. (M) SDN BHD 30.9 2018-11-18
5 X00016469669 ABC HO TRADING 31 2019-01-09
6 X00016469672 SOON HUAT MACHINERY ENTERPRISE 327 2019-01-11
7 X00016469676 S.H.H. MOTOR (SUNGAI RENGIT SN. BHD. (801580-T) 20 2019-01-23
8 X51005200938 TH MNAN 0 2023-10-11
9 X51005230617 GERBANG ALAF RESTAURANTS SDN BHD 26.6 2018-01-18

Most receipts are extracted correctly, but receipt X51005200938 shows issues:

  • The company name is incomplete (“TH MNAN”)
  • Total is 0 instead of the actual amount
  • Date (2023-10-11) appears incorrect

Compare Extraction with Ground Truth

To verify the extraction accuracy, load the ground-truth annotations from data/SROIE2019/train/entities:

def normalize_date(value: str) -> str:
    """Normalize date strings to consistent format."""
    value = (value or "").strip()
    if not value:
        return value
    # Convert hyphens to slashes
    value = value.replace("-", "/")
    parts = value.split("/")
    # Convert 2-digit years to 4-digit (e.g., 18 -> 2018)
    if len(parts[-1]) == 2:
        parts[-1] = f"20{parts[-1]}"
    return "/".join(parts)


def create_ground_truth_df(
    label_paths: List[str], id_column: str = "receipt_id"
) -> pd.DataFrame:
    """Create ground truth DataFrame from label JSON files."""
    records = []
    # Load each JSON file and extract key fields
    for path in label_paths:
        payload = pd.read_json(Path(path), typ="series").to_dict()
        records.append(
            {
                id_column: Path(path).stem,
                "company": payload.get("company"),
                "total": payload.get("total"),
                "purchase_date": normalize_date(payload.get("date")),
            }
        )

    df = pd.DataFrame(records)
    # Apply same transformations as extracted data
    return transform_receipt_columns(df)


# Load ground truth annotations
label_dir = Path("data/SROIE2019/train/entities")
label_paths = sorted(label_dir.glob("*.txt"))[:num_receipts]

ground_truth_df = create_ground_truth_df(label_paths)
ground_truth_df
receipt_id company total purchase_date
0 X00016469612 BOOK TA .K (TAMAN DAYA) SDN BHD 9 2018-12-25
1 X00016469619 INDAH GIFT & HOME DECO 60.3 2018-10-19
2 X00016469620 MR D.I.Y. (JOHOR) SDN BHD 33.9 2019-01-12
3 X00016469622 YONGFATT ENTERPRISE 80.9 2018-12-25
4 X00016469623 MR D.I.Y. (M) SDN BHD 30.9 2018-11-18
5 X00016469669 ABC HO TRADING 31 2019-01-09
6 X00016469672 SOON HUAT MACHINERY ENTERPRISE 327 2019-01-11
7 X00016469676 S.H.H. MOTOR (SUNGAI RENGIT) SDN. BHD. 20 2019-01-23
8 X51005200938 PERNIAGAAN ZHENG HUI 112.45 2018-02-12
9 X51005230617 GERBANG ALAF RESTAURANTS SDN BHD 26.6 2018-01-18

Let’s validate extraction accuracy by comparing results against ground truth.

Company names often have minor variations (spacing, punctuation, extra characters), so we’ll use fuzzy matching to tolerate these formatting differences.

from rapidfuzz import fuzz


def fuzzy_match_score(text1: str, text2: str) -> int:
    """Calculate fuzzy match score between two strings."""
    return fuzz.token_set_ratio(str(text1), str(text2))

Test the fuzzy matching with sample company names:

# Nearly identical strings score high
print(f"Score: {fuzzy_match_score('BOOK TA K SDN BHD', 'BOOK TA .K SDN BHD'):.2f}")

# Different punctuation still matches well
print(f"Score: {fuzzy_match_score('MR D.I.Y. JOHOR', 'MR DIY JOHOR'):.2f}")

# Completely different strings score low
print(f"Score: {fuzzy_match_score('ABC TRADING', 'XYZ COMPANY'):.2f}")

Output:

Score: 97.14
Score: 55.17
Score: 27.27

Now build a comparison function that merges extracted and ground truth data, then applies fuzzy matching for company names and exact matching for numeric fields:

def compare_receipts(
    extracted_df: pd.DataFrame,
    ground_truth_df: pd.DataFrame,
    id_column: str,
    fuzzy_match_cols: List[str],
    exact_match_cols: List[str],
    fuzzy_threshold: int = 80,
) -> pd.DataFrame:
    """Compare extracted and ground truth data with explicit column specifications."""
    comparison_df = extracted_df.merge(
        ground_truth_df,
        on=id_column,
        how="inner",
        suffixes=("_extracted", "_truth"),
    )

    # Fuzzy matching
    for col in fuzzy_match_cols:
        extracted_col = f"{col}_extracted"
        truth_col = f"{col}_truth"
        comparison_df[f"{col}_score"] = comparison_df.apply(
            lambda row: fuzzy_match_score(row[extracted_col], row[truth_col]),
            axis=1,
        )
        comparison_df[f"{col}_match"] = comparison_df[f"{col}_score"] >= fuzzy_threshold

    # Exact matching
    for col in exact_match_cols:
        extracted_col = f"{col}_extracted"
        truth_col = f"{col}_truth"
        comparison_df[f"{col}_match"] = (
            comparison_df[extracted_col] == comparison_df[truth_col]
        )

    return comparison_df


comparison_df = compare_receipts(
    extracted_df,
    ground_truth_df,
    id_column="receipt_id",
    fuzzy_match_cols=["company"],
    exact_match_cols=["total", "purchase_date"],
)

Inspect any rows where the company, total, or purchase-date checks fail:

def get_mismatch_rows(comparison_df: pd.DataFrame) -> pd.DataFrame:
    """Get mismatched rows, excluding match indicator columns."""
    # Extract match columns and data columns
    match_columns = [col for col in comparison_df.columns if col.endswith("_match")]
    data_columns = sorted([col for col in comparison_df.columns if col.endswith("_extracted") or col.endswith("_truth")])

    # Check for rows where not all matches are True
    has_mismatch = comparison_df[match_columns].all(axis=1).eq(False)

    return comparison_df[has_mismatch][data_columns]


mismatch_df = get_mismatch_rows(comparison_df)


mismatch_df
company_extracted company_truth purchase_date_extracted purchase_date_truth total_extracted total_truth
8 TH MNAN PERNIAGAAN ZHENG HUI 2023-10-11 2018-02-12 0 112.45

This confirms what we saw earlier. All receipts match the ground truth annotations except for receipt ID X51005200938 for the following fields:

  • Company name
  • Total
  • Purchase date

Let’s take a closer look at this receipt to see if we can identify the issue.

import IPython.display as display

file_to_inspect = receipt_dir / "X51005200938.jpg"

display.Image(filename=file_to_inspect)

Problematic receipt

This receipt appears smaller than the others in the dataset, which may affect OCR readability. In the next section, we will scale up the receipt to improve the extraction.

Process the Images for Better Extraction

Create a function to scale up the receipt:

from PIL import Image


def scale_image(image_path: Path, output_dir: Path, scale_factor: int = 3) -> Path:
    """Scale up an image using high-quality resampling.

    Args:
        image_path: Path to the original image
        output_dir: Directory to save the scaled image
        scale_factor: Factor to scale up the image (default: 3x)

    Returns:
        Path to the scaled image
    """
    # Load the image
    img = Image.open(image_path)

    # Scale up the image using high-quality resampling
    new_size = (img.width * scale_factor, img.height * scale_factor)
    img_resized = img.resize(new_size, Image.Resampling.LANCZOS)

    # Save to output directory with same filename
    output_dir.mkdir(parents=True, exist_ok=True)
    output_path = output_dir / image_path.name
    img_resized.save(output_path, quality=95)

    return output_path

Apply the function to the problematic receipt:

problematic_receipt_path = receipt_dir / "X51005200938.jpg"
adjusted_receipt_dir = Path("data/SROIE2019/train/img_adjusted")

scaled_image_path = scale_image(problematic_receipt_path, adjusted_receipt_dir, scale_factor=3)

Let’s extract the structured data from the scaled image:

problematic_structured_receipts = extract_documents([scaled_image_path], prompt)
problematic_extracted_df = create_extracted_df(problematic_structured_receipts)

problematic_extracted_df
receipt_id company total purchase_date
0 X51005200938 PERNIAGAAN ZHENG HUI 112.46 2018-02-12

Nice! Scaling fixes the extraction. Company name and purchase date are now accurate. The total is 112.46 vs 112.45, acceptable since 112.45 actually looks like 112.46 when printed on the receipt.

Export Clean Data to CSV or Excel

Apply the scaling fix to all receipts. Copy the remaining images to the processed directory, excluding the already-scaled receipt:

import shutil

clean_receipt_paths = [scaled_image_path]
# Copy all receipts except the already processed one
for receipt_path in receipt_paths:
    if receipt_path != problematic_receipt_path:  # Skip the already scaled image
        output_path = adjusted_receipt_dir / receipt_path.name
        shutil.copy2(receipt_path, output_path)
        clean_receipt_paths.append(output_path)
        print(f"Copied {receipt_path.name}")

Let’s run the pipeline again with the processed images:

clean_structured_receipts = extract_documents(clean_receipt_paths, prompt)
clean_extracted_df = create_extracted_df(clean_structured_receipts)
clean_extracted_df
receipt_id company total purchase_date
0 X51005200938 PERNIAGAAN ZHENG HUI 112.46 2018-02-12
1 X00016469612 TAN WOON YANN 9 2018-12-25
2 X00016469619 INDAH GIFT & HOME DECO 60.3 2018-10-19
3 X00016469620 MR D.I.Y. (JOHOR) SDN BHD 33.9 2019-01-12
4 X00016469622 YONGFATT ENTERPRISE 80.9 2018-12-25
5 X00016469623 MR D.I.Y. (M) SDN BHD 30.9 2018-11-18
6 X00016469669 ABC HO TRADING 31 2019-01-09
7 X00016469672 SOON HUAT MACHINERY ENTERPRISE 327 2019-01-11
8 X00016469676 S.H.H. MOTOR (SUNGAI RENGIT SN. BHD. (801580-T) 20 2019-01-23
9 X51005230617 GERBANG ALAF RESTAURANTS SDN BHD 26.6 2018-01-18

Awesome! All receipts now match the ground truth annotations.

Now we can export the dataset to a spreadsheet with just a few lines of code:

import pandas as pd

# Export to CSV
output_path = Path("reports/receipts.csv")
output_path.parent.mkdir(parents=True, exist_ok=True)
clean_extracted_df.to_csv(output_path, index=False)
print(f"Exported {len(clean_extracted_df)} receipts to {output_path}")

Output:

Exported 10 receipts to reports/receipts.csv

The exported data can now be imported into spreadsheet applications, analytics tools, or business intelligence platforms.

Speed Up Processing with Async Parallel Execution

LlamaIndex supports asynchronous processing to handle multiple receipts concurrently. By using async/await with the aget_nodes_from_documents() method, you can process receipts in parallel instead of sequentially, significantly reducing total processing time.

Here’s how to modify the extraction function to use async processing. Setting num_workers=10 means the parser will process up to 10 receipts concurrently:

import asyncio


async def extract_documents_async(
    paths: List[str], prompt: str, id_column: str = "receipt_id"
) -> List[dict]:
    """Extract structured data from documents using async LlamaParse."""
    results: List[dict] = []

    parser = LlamaParse(
        api_key=os.environ["LLAMA_CLOUD_API_KEY"],
        result_type="markdown",
        num_workers=10,  # Process 10 receipts concurrently
        language="en",
        skip_diagonal_text=True,
    )

    # Use async method for parallel processing
    documents = await parser.aload_data(paths)

    program = OpenAIPydanticProgram.from_defaults(
        output_cls=Receipt,
        llm=Settings.llm,
        prompt_template_str=prompt,
    )

    for path, doc in zip(paths, documents):
        document_id = Path(path).stem
        parsed_document = program(context_str=doc.text)
        results.append({id_column: document_id, "data": parsed_document})

    return results


# Run with asyncio
structured_receipts = await extract_documents_async(receipt_paths, prompt)

See the LlamaIndex async documentation for more details.

Try It Yourself

The concepts from this tutorial are available as a reusable pipeline in this GitHub repository. The code includes both synchronous and asynchronous versions:

Synchronous pipelines (simple, sequential processing):

Asynchronous pipelines (parallel processing with 3-10x speedup):

Run the receipt extraction example:

# Synchronous version (simple, sequential)
uv run extract_receipts_pipeline.py

# Asynchronous version (parallel processing, 3-10x faster)
uv run async_extract_receipts_pipeline.py

Or create your own extractor by importing extract_structured_data() and providing your custom Pydantic schema, extraction prompt, and optional preprocessing functions.

Learn production-ready practices for data science and AI projects in Production-Ready Data Science.

Conclusion and Next Steps

This tutorial demonstrated how LlamaIndex automates receipt data extraction with minimal code. You converted scanned images to structured data, validated results against ground truth, and exported a clean CSV ready for analysis.

Here are some ideas to enhance this receipt extraction pipeline:

  • Richer schemas: Add nested Pydantic models for vendor details, payment methods, and itemized line items
  • Validation rules: Flag outliers like totals over $500 or future dates for manual review
  • Multi-stage workflows: Create custom workflows that combine image preprocessing, extraction, validation, and export steps with error handling

Leave a Comment

Your email address will not be published. Required fields are marked *

0
    0
    Your Cart
    Your cart is empty
    Scroll to Top

    Work with Khuyen Tran

    Work with Khuyen Tran