Sparrow: Document Processing Made Simple

Sparrow: Document Processing Made Simple

The Problem with Traditional Approaches

Extracting structured information from a document often involves juggling multiple specialized libraries and APIs, including:

  • Tesseract for Optical Character Recognition (OCR)
  • OpenCV for image processing
  • Transformers for entity extraction

This complexity can lead to significant development overhead, making it difficult to build reliable and efficient document processing pipelines.

# Traditional approach mixing multiple libraries
import pytesseract
import cv2
from transformers import pipeline
import re

def extract_invoice_data(image_path):
    # OCR with Tesseract
    image = cv2.imread(image_path)
    text = pytesseract.image_to_string(image)
    
    # Use regex to find invoice number
    invoice_num = re.search(r'Invoice #: (\d+)', text)
    
    # Use transformer for entity extraction
    nlp = pipeline("ner")
    entities = nlp(text)
    
    # Complex post-processing logic
    # ... more code to structure the data

Introducing Sparrow

Sparrow is a unified API that handles various document types and extraction methods, making it easy to switch between different backends (local or cloud) and extraction models while maintaining a consistent output structure.

Examples

Bank Statement

Input document:

Query:

*

Output:

{
  "bank": "First Platypus Bank",
  "address": "1234 Kings St., New York, NY 12123",
  "account_holder": "Mary G. Orta",
  "account_number": "1234567890123",
  "statement_date": "3/1/2022",
  "period_covered": "2/1/2022 - 3/1/2022",
  "account_summary": {
    "balance_on_march_1": "$25,032.23",
    "total_money_in": "$10,234.23",
    "total_money_out": "$10,532.51"
  },
  "transactions": [
    {
      "date": "02/01",
      "description": "PGD EasyPay Debit",
      "withdrawal": "203.24",
      "deposit": "",
      "balance": "22,098.23"
    },
    {
      "date": "02/02",
      "description": "AB&B Online Payment*****",
      "withdrawal": "71.23",
      "deposit": "",
      "balance": "22,027.00"
    },
    {
      "date": "02/04",
      "description": "Check No. 2345",
      "withdrawal": "",
      "deposit": "450.00",
      "balance": "22,477.00"
    },
    {
      "date": "02/05",
      "description": "Payroll Direct Dep 23422342 Giants",
      "withdrawal": "",
      "deposit": "2,534.65",
      "balance": "25,011.65"
    },
    {
      "date": "02/06",
      "description": "Signature POS Debit - TJP",
      "withdrawal": "84.50",
      "deposit": "",
      "balance": "24,927.15"
    },
    {
      "date": "02/07",
      "description": "Check No. 234",
      "withdrawal": "1,400.00",
      "deposit": "",
      "balance": "23,527.15"
    },
    {
      "date": "02/08",
      "description": "Check No. 342",
      "withdrawal": "",
      "deposit": "25.00",
      "balance": "23,552.15"
    },
    {
      "date": "02/09",
      "description": "FPB AutoPay***** Credit Card",
      "withdrawal": "456.02",
      "deposit": "",
      "balance": "23,096.13"
    },
    {
      "date": "02/08",
      "description": "Check No. 123",
      "withdrawal": "",
      "deposit": "25.00",
      "balance": "23,552.15"
    },
    {
      "date": "02/09",
      "description": "FPB AutoPay***** Credit Card",
      "withdrawal": "156.02",
      "deposit": "",
      "balance": "23,096.13"
    },
    {
      "date": "02/08",
      "description": "Cash Deposit",
      "withdrawal": "",
      "deposit": "25.00",
      "balance": "23,552.15"
    }
  ],
  "valid": "true"
}

Bonds Table

Input document:

Query:

[{"instrument_name":"str", "valuation":0}]

Output:

{
  "data": [
    {
      "instrument_name": "UNITS BLACKROCK FIX INC DUB FDS PLC ISHS EUR INV GRD CP BD IDX/INST/E",
      "valuation": 19049
    },
    {
      "instrument_name": "UNITS ISHARES III PLC CORE EUR GOVT BOND UCITS ETF/EUR",
      "valuation": 83488
    },
    {
      "instrument_name": "UNITS ISHARES III PLC EUR CORP BOND 1-5YR UCITS ETF/EUR",
      "valuation": 213030
    },
    {
      "instrument_name": "UNIT ISHARES VI PLC/JP MORGAN USD E BOND EUR HED UCITS ETF DIST/HDGD/",
      "valuation": 32774
    },
    {
      "instrument_name": "UNITS XTRACKERS II SICAV/EUR HY CORP BOND UCITS ETF/-1D-/DISTR.",
      "valuation": 23643
    }
  ],
  "valid": "true"
}

Lab Results

Input document:

Query:

{"patient_name": "str", "patient_age": "str", "patient_pid": 0, "lab_results": [{"investigation": "str", "result": 0.00, "reference_value": "str", "unit": "str"}]}

Output:

{
  "patient_name": "Yash M. Patel",
  "patient_age": "21 Years",
  "patient_pid": 555,
  "lab_results": [
    {
      "investigation": "Hemoglobin (Hb)",
      "result": 12.5,
      "reference_value": "13.0 - 17.0",
      "unit": "g/dL"
    },
    {
      "investigation": "RBC COUNT",
      "result": 5.2,
      "reference_value": "4.5 - 5.5",
      "unit": "mill/cumm"
    },
    {
      "investigation": "Packed Cell Volume (PCV)",
      "result": 57.5,
      "reference_value": "40 - 50",
      "unit": "%"
    },
    {
      "investigation": "Mean Corpuscular Volume (MCV)",
      "result": 87.75,
      "reference_value": "83 - 101",
      "unit": "fL"
    },
    {
      "investigation": "MCH",
      "result": 27.2,
      "reference_value": "27 - 32",
      "unit": "pg"
    },
    {
      "investigation": "MCHC",
      "result": 32.8,
      "reference_value": "32.5 - 34.5",
      "unit": "g/dL"
    },
    {
      "investigation": "RDW",
      "result": 13.6,
      "reference_value": "11.6 - 14.0",
      "unit": "%"
    },
    {
      "investigation": "WBC COUNT",
      "result": 9000,
      "reference_value": "4000-11000",
      "unit": "cumm"
    },
    {
      "investigation": "Neutrophils",
      "result": 60,
      "reference_value": "50 - 62",
      "unit": "%"
    },
    {
      "investigation": "Lymphocytes",
      "result": 31,
      "reference_value": "20 - 40",
      "unit": "%"
    },
    {
      "investigation": "Eosinophils",
      "result": 1,
      "reference_value": "00 - 06",
      "unit": "%"
    },
    {
      "investigation": "Monocytes",
      "result": 7,
      "reference_value": "00 - 10",
      "unit": "%"
    },
    {
      "investigation": "Basophils",
      "result": 1,
      "reference_value": "00 - 02",
      "unit": "%"
    },
    {
      "investigation": "Absolute Neutrophils",
      "result": 6000,
      "reference_value": "1500 - 7500",
      "unit": "cells/mcL"
    },
    {
      "investigation": "Absolute Lymphocytes",
      "result": 3100,
      "reference_value": "1300 - 3500",
      "unit": "cells/mcL"
    },
    {
      "investigation": "Absolute Eosinophils",
      "result": 100,
      "reference_value": "00 - 500",
      "unit": "cells/mcL"
    },
    {
      "investigation": "Absolute Monocytes",
      "result": 700,
      "reference_value": "200 - 950",
      "unit": "cells/mcL"
    },
    {
      "investigation": "Absolute Basophils",
      "result": 100,
      "reference_value": "00 - 300",
      "unit": "cells/mcL"
    },
    {
      "investigation": "Platelet Count",
      "result": 320000,
      "reference_value": "150000 - 410000",
      "unit": "cumm"
    }
  ],
  "valid": "true"
}

How to Use Sparrow

Sparrow can be used by passing a JSON query string argument with field names and types to fetch, along with the input document and pipeline.

Arguments

  • query: JSON query string argument with field names and types to fetch
  • file-path: input document (image or multipage PDF)
  • pipeline: Sparrow pipeline used to process query request
  • crop_size=N: crop N pixels from all borders of the input images
  • debug-dir: folder where processed images are stored
  • debug: if True, additional messages will be printed

Options

  • --options mlx: set MLX as backend for local inference
  • --options mlx-community/Qwen2-VL-72B-Instruct-4bit: name for Vision LLM model, supported by MLX
  • --options tables_only: process tables only
  • --options validation_off: disable response validation

Examples

Running locally with Apple MLX backend:

./sparrow.sh "[{"instrument_name":"str", "valuation":0}]" --pipeline "sparrow-parse" --options mlx --options mlx-community/Qwen2-VL-72B-Instruct-4bit --file-path "/data/bonds_table.png"

Sparrow Parse pipeline, with GPU backend on Hugging Face:

./sparrow.sh "[{"instrument_name":"str", "valuation":0}]" --pipeline "sparrow-parse" --options huggingface --options katanaml/sparrow-qwen2-vl-7b --file-path "/data/bonds_table.png"

Conclusion

Sparrow simplifies document processing by providing a unified API that handles various document types and extraction methods. Its schema-based approach ensures consistent output format regardless of the input document format or the extraction backend used. With Sparrow, you can streamline your document processing pipeline and reduce development overhead.

Link to Sparrow.

Search

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top

Work with Khuyen Tran

Work with Khuyen Tran