Sparrow: Document Processing Made Simple

The Problem with Traditional Approaches

Extracting structured information from a document often involves juggling multiple specialized libraries and APIs, including:

Tesseract for Optical Character Recognition (OCR)
OpenCV for image processing
Transformers for entity extraction

This complexity can lead to significant development overhead, making it difficult to build reliable and efficient document processing pipelines.

# Traditional approach mixing multiple libraries
import pytesseract
import cv2
from transformers import pipeline
import re

def extract_invoice_data(image_path):
    # OCR with Tesseract
    image = cv2.imread(image_path)
    text = pytesseract.image_to_string(image)
    
    # Use regex to find invoice number
    invoice_num = re.search(r'Invoice #: (\d+)', text)
    
    # Use transformer for entity extraction
    nlp = pipeline("ner")
    entities = nlp(text)
    
    # Complex post-processing logic
    # ... more code to structure the data

Introducing Sparrow

Sparrow is a unified API that handles various document types and extraction methods, making it easy to switch between different backends (local or cloud) and extraction models while maintaining a consistent output structure.

Examples

Bank Statement

Input document:

Query:

Output:

{
  "bank": "First Platypus Bank",
  "address": "1234 Kings St., New York, NY 12123",
  "account_holder": "Mary G. Orta",
  "account_number": "1234567890123",
  "statement_date": "3/1/2022",
  "period_covered": "2/1/2022 - 3/1/2022",
  "account_summary": {
    "balance_on_march_1": "$25,032.23",
    "total_money_in": "$10,234.23",
    "total_money_out": "$10,532.51"
  },
  "transactions": [
    {
      "date": "02/01",
      "description": "PGD EasyPay Debit",
      "withdrawal": "203.24",
      "deposit": "",
      "balance": "22,098.23"
    },
    {
      "date": "02/02",
      "description": "AB&B Online Payment*****",
      "withdrawal": "71.23",
      "deposit": "",
      "balance": "22,027.00"
    },
    {
      "date": "02/04",
      "description": "Check No. 2345",
      "withdrawal": "",
      "deposit": "450.00",
      "balance": "22,477.00"
    },
    {
      "date": "02/05",
      "description": "Payroll Direct Dep 23422342 Giants",
      "withdrawal": "",
      "deposit": "2,534.65",
      "balance": "25,011.65"
    },
    {
      "date": "02/06",
      "description": "Signature POS Debit - TJP",
      "withdrawal": "84.50",
      "deposit": "",
      "balance": "24,927.15"
    },
    {
      "date": "02/07",
      "description": "Check No. 234",
      "withdrawal": "1,400.00",
      "deposit": "",
      "balance": "23,527.15"
    },
    {
      "date": "02/08",
      "description": "Check No. 342",
      "withdrawal": "",
      "deposit": "25.00",
      "balance": "23,552.15"
    },
    {
      "date": "02/09",
      "description": "FPB AutoPay***** Credit Card",
      "withdrawal": "456.02",
      "deposit": "",
      "balance": "23,096.13"
    },
    {
      "date": "02/08",
      "description": "Check No. 123",
      "withdrawal": "",
      "deposit": "25.00",
      "balance": "23,552.15"
    },
    {
      "date": "02/09",
      "description": "FPB AutoPay***** Credit Card",
      "withdrawal": "156.02",
      "deposit": "",
      "balance": "23,096.13"
    },
    {
      "date": "02/08",
      "description": "Cash Deposit",
      "withdrawal": "",
      "deposit": "25.00",
      "balance": "23,552.15"
    }
  ],
  "valid": "true"
}

Bonds Table

Input document:

Query:

[{"instrument_name":"str", "valuation":0}]

Output:

{
  "data": [
    {
      "instrument_name": "UNITS BLACKROCK FIX INC DUB FDS PLC ISHS EUR INV GRD CP BD IDX/INST/E",
      "valuation": 19049
    },
    {
      "instrument_name": "UNITS ISHARES III PLC CORE EUR GOVT BOND UCITS ETF/EUR",
      "valuation": 83488
    },
    {
      "instrument_name": "UNITS ISHARES III PLC EUR CORP BOND 1-5YR UCITS ETF/EUR",
      "valuation": 213030
    },
    {
      "instrument_name": "UNIT ISHARES VI PLC/JP MORGAN USD E BOND EUR HED UCITS ETF DIST/HDGD/",
      "valuation": 32774
    },
    {
      "instrument_name": "UNITS XTRACKERS II SICAV/EUR HY CORP BOND UCITS ETF/-1D-/DISTR.",
      "valuation": 23643
    }
  ],
  "valid": "true"
}

Lab Results

Input document:

Query:

{"patient_name": "str", "patient_age": "str", "patient_pid": 0, "lab_results": [{"investigation": "str", "result": 0.00, "reference_value": "str", "unit": "str"}]}

Output:

{
  "patient_name": "Yash M. Patel",
  "patient_age": "21 Years",
  "patient_pid": 555,
  "lab_results": [
    {
      "investigation": "Hemoglobin (Hb)",
      "result": 12.5,
      "reference_value": "13.0 - 17.0",
      "unit": "g/dL"
    },
    {
      "investigation": "RBC COUNT",
      "result": 5.2,
      "reference_value": "4.5 - 5.5",
      "unit": "mill/cumm"
    },
    {
      "investigation": "Packed Cell Volume (PCV)",
      "result": 57.5,
      "reference_value": "40 - 50",
      "unit": "%"
    },
    {
      "investigation": "Mean Corpuscular Volume (MCV)",
      "result": 87.75,
      "reference_value": "83 - 101",
      "unit": "fL"
    },
    {
      "investigation": "MCH",
      "result": 27.2,
      "reference_value": "27 - 32",
      "unit": "pg"
    },
    {
      "investigation": "MCHC",
      "result": 32.8,
      "reference_value": "32.5 - 34.5",
      "unit": "g/dL"
    },
    {
      "investigation": "RDW",
      "result": 13.6,
      "reference_value": "11.6 - 14.0",
      "unit": "%"
    },
    {
      "investigation": "WBC COUNT",
      "result": 9000,
      "reference_value": "4000-11000",
      "unit": "cumm"
    },
    {
      "investigation": "Neutrophils",
      "result": 60,
      "reference_value": "50 - 62",
      "unit": "%"
    },
    {
      "investigation": "Lymphocytes",
      "result": 31,
      "reference_value": "20 - 40",
      "unit": "%"
    },
    {
      "investigation": "Eosinophils",
      "result": 1,
      "reference_value": "00 - 06",
      "unit": "%"
    },
    {
      "investigation": "Monocytes",
      "result": 7,
      "reference_value": "00 - 10",
      "unit": "%"
    },
    {
      "investigation": "Basophils",
      "result": 1,
      "reference_value": "00 - 02",
      "unit": "%"
    },
    {
      "investigation": "Absolute Neutrophils",
      "result": 6000,
      "reference_value": "1500 - 7500",
      "unit": "cells/mcL"
    },
    {
      "investigation": "Absolute Lymphocytes",
      "result": 3100,
      "reference_value": "1300 - 3500",
      "unit": "cells/mcL"
    },
    {
      "investigation": "Absolute Eosinophils",
      "result": 100,
      "reference_value": "00 - 500",
      "unit": "cells/mcL"
    },
    {
      "investigation": "Absolute Monocytes",
      "result": 700,
      "reference_value": "200 - 950",
      "unit": "cells/mcL"
    },
    {
      "investigation": "Absolute Basophils",
      "result": 100,
      "reference_value": "00 - 300",
      "unit": "cells/mcL"
    },
    {
      "investigation": "Platelet Count",
      "result": 320000,
      "reference_value": "150000 - 410000",
      "unit": "cumm"
    }
  ],
  "valid": "true"
}

How to Use Sparrow

Sparrow can be used by passing a JSON query string argument with field names and types to fetch, along with the input document and pipeline.

Arguments

query: JSON query string argument with field names and types to fetch
file-path: input document (image or multipage PDF)
pipeline: Sparrow pipeline used to process query request
crop_size=N: crop N pixels from all borders of the input images
debug-dir: folder where processed images are stored
debug: if True, additional messages will be printed

Options

--options mlx: set MLX as backend for local inference
--options mlx-community/Qwen2-VL-72B-Instruct-4bit: name for Vision LLM model, supported by MLX
--options tables_only: process tables only
--options validation_off: disable response validation

Examples

Running locally with Apple MLX backend:

./sparrow.sh "[{"instrument_name":"str", "valuation":0}]" --pipeline "sparrow-parse" --options mlx --options mlx-community/Qwen2-VL-72B-Instruct-4bit --file-path "/data/bonds_table.png"

Sparrow Parse pipeline, with GPU backend on Hugging Face:

./sparrow.sh "[{"instrument_name":"str", "valuation":0}]" --pipeline "sparrow-parse" --options huggingface --options katanaml/sparrow-qwen2-vl-7b --file-path "/data/bonds_table.png"

Conclusion

Sparrow simplifies document processing by providing a unified API that handles various document types and extraction methods. Its schema-based approach ensures consistent output format regardless of the input document format or the extraction backend used. With Sparrow, you can streamline your document processing pipeline and reduce development overhead.

Link to Sparrow.

Search

Machine Learning

Sparrow: Document Processing Made Simple

Sparrow: Document Processing Made Simple

The Problem with Traditional Approaches

Introducing Sparrow

Examples

Bank Statement

Bonds Table

Lab Results

How to Use Sparrow

Arguments

Options

Examples

Conclusion

Search

Related Posts

Generating Synthetic Tabular Data with TabGAN

Moondream: Lightweight Vision-Language AI for Everyone

Beyond Keywords: Implementing Semantic Search with Chroma

Leave a Comment Cancel Reply

Related Posts

Generating Synthetic Tabular Data with TabGAN

Leverage Mermaid for Real-Time Git Graph Rendering

Moondream: Lightweight Vision-Language AI for Everyone

Stay up-to-date with
data skills using
CodeCut

Drop a line

Get in touch

Follow Us on Social Media

Sparrow: Document Processing Made Simple

Sparrow: Document Processing Made Simple

The Problem with Traditional Approaches

Introducing Sparrow

Examples

Bank Statement

Bonds Table

Lab Results

How to Use Sparrow

Arguments

Options

Examples

Conclusion

Search

Related Posts

Leave a Comment Cancel Reply

Related Posts

Stay up-to-date with data skills using CodeCut

Follow Us on Social Media

Work with Khuyen Tran

Work with Khuyen Tran

Stay up-to-date with
data skills using
CodeCut