What is Entity Extraction?

Entity extraction (also called Named Entity Recognition or NER) automatically identifies and classifies key information from unstructured text. For instance, financial reports contain company names, monetary figures, executives, dates, and locations used for competitive analysis and executive tracking.

Extracting these entities manually is time-consuming and error-prone. Automated entity extraction provides a faster and more reliable alternative.

In this course, you’ll learn three modern tools for entity extraction:

  1. spaCy: Production-ready NER with pre-trained models
  2. GLiNER: Zero-shot custom entity recognition
  3. langextract: AI-powered extraction with source grounding

Sample Document

Throughout this course, we’ll extract entities from this earnings report.

Press Run below to try it out.

Python
Output
Loading Python…

We chose this report because it’s dense with overlapping entity types, which is exactly what makes real-world extraction challenging:

  • Monetary amounts appear in different contexts: revenue ($81.4B), dividends ($0.24), and forecasted ranges ($89B-$93B)
  • Named entities overlap: “Apple Inc.” is both a company and a stock ticker (AAPL), and “SEC” is an abbreviation that needs context to identify
  • Temporal references mix formats: exact dates (June 30, 2023), quarters (Q4 2023), and relative time (year over year)

Why Not Use Regex?

Regular expressions define text patterns using special syntax to find matches in strings. While they may seem like a natural first choice for entity extraction, they require a separate pattern for each entity type and fail when formats vary.

Here’s what extracting financial amounts, dates, stock symbols, and quarters with regex looks like:

Python
Output
Loading Python…

From the code above, several limitations become apparent:

  • Each entity type requires its own pattern, resulting in verbose boilerplate code that is difficult to read and maintain.
  • The patterns only match numeric quarter formats like “Q4 2023” and miss textual forms such as “third quarter” unless additional exact-match patterns are added.

Quiz

A document contains dates in formats like “January 15, 2024”, “15/01/2024”, and “2024-01-15”. What challenge does regex face here?

Production-Grade Named Entity Recognition

spaCy provides pre-trained models that automatically identify entities like PERSON, ORG, MONEY, DATE, and PERCENT from context. No pattern writing required.

Let’s install spaCy and download a small English model to get started:

pip install spacy
python -m spacy download en_core_web_sm

Extracting entities with spaCy takes just two steps:

  • Load the model
  • Process your text
Python
Output
💡 What the output shows
  • spaCy extracted three entity types (ORG, MONEY, PERSON) without any configuration
  • The model understood that “Apple Inc.” is a company, not just a fruit
  • It captured the complete monetary amount “$81.4 billion” including the unit
  • Person names are recognized even without titles like “CEO”

How spaCy NER Works

spaCy labels each token individually using its BILUO tagging scheme, then groups consecutive entity tokens into spans:

"Apple"  "Inc."  "CEO"  "Tim"  "Cook"  "$81.4"  "billion"
   │        │      │      │       │       │         │
   ▼        ▼      ▼      ▼       ▼       ▼         ▼
 B-ORG   L-ORG    O    B-PER   L-PER  B-MONEY   L-MONEY
   └───┬───┘             └──┬──┘        └────┬────┘
       ▼                    ▼                ▼
 "Apple Inc." → ORG   "Tim Cook" → PERSON   "$81.4 billion" → MONEY
  • Begin / Inside / Last mark multi-token entities
  • Unit marks single-token entities (e.g., “London” → U-LOC)
  • O means outside any entity

The model learns these tagging patterns from thousands of labeled examples during training.

Quiz

How does spaCy determine that “Apple Inc.” is an ORG entity?

Exercise: Build a Contact List

Extracting from Business Documents

Exercise: Export Contact List

Visualizing Entities with displaCy

Zero-Shot Custom Entity Extraction

GLiNER solves spaCy’s limitation of fixed entity types. Instead of being locked into categories like ORG or GPE, GLiNER lets you define custom types using natural language descriptions.

pip install gliner

GLiNER offers several pretrained models. We’ll use gliner_small-v2.1 with threshold=0.3 to capture entities with at least 30% confidence:

Python
Output
💡 What the output shows
  • GLiNER recognized custom entity types without any training
  • Confidence scores vary: “Tim Cook” (0.563) scores highest as names are distinctive, while “$81.4 billion” (0.310) scores lower because “Currency” is a less common label
📝 Other model options

For higher accuracy, try gliner_medium-v2.1. For multilingual support, use gliner_multi-v2.1.

How GLiNER Works

Instead of tagging individual tokens, GLiNER scores entire spans against every label you provide. The highest-scoring label wins, and spans below your threshold are filtered out:

┌──────────────┬───────────┬──────────────────┐
│  Span        │  Label    │  Confidence      │
├──────────────┼───────────┼──────────────────┤
│ Apple Inc    │ Company   │ ████░░░░░░░ 0.36 │ ✓ above 0.3
│ Apple Inc    │ Person    │ █░░░░░░░░░░ 0.05 │ ✗
├──────────────┼───────────┼──────────────────┤
│ Tim Cook     │ Company   │ █░░░░░░░░░░ 0.04 │ ✗
│ Tim Cook     │ Person    │ ██████░░░░░ 0.56 │ ✓ above 0.3
├──────────────┼───────────┼──────────────────┤
│ $81.4 billion│ Company   │ ░░░░░░░░░░░ 0.01 │ ✗
│ $81.4 billion│ Currency  │ ███░░░░░░░░ 0.31 │ ✓ above 0.3
└──────────────┴───────────┴──────────────────┘
                            threshold = 0.3 ▲

This gives you two controls spaCy doesn’t: custom labels (any text, not a fixed set) and a confidence threshold to filter results.

Quiz

How does GLiNER decide which label to assign to a text span?

Extracting Business Entities

Exercise: Parse Business Metrics

Using Confidence Scores for Quality Control

Exercise: Route Low-Confidence to Review

AI-Powered Extraction with Source Grounding

langextract uses large language models (Gemini, GPT) to understand entity relationships and provide source attribution.

It captures semantic context like “AI startup WaveOne” (category + name) and “between $89 billion and $93 billion” (revenue ranges) as complete phrases rather than separate pieces.

Let’s install langextract along with its dependencies to try it out:

pip install langextract python-dotenv google-genai

To authenticate, add your API key to a .env file. This course uses Gemini (get a key from AI Studio), but OpenAI models also work:

# .env file
LANGEXTRACT_API_KEY=your-api-key-here

langextract uses an LLM to extract entities. You provide examples that teach the model what to look for and how to format the output:

Example (you provide):
  ┌─────────────────────────────────────────────────────┐
  │ Text: "Microsoft Corp. CEO Satya Nadella reported   │
  │        Q2 2024 revenue of $65B"                     │
  │                                                     │
  │ Extractions:                                        │
  │   company    → "Microsoft Corp."                    │
  │   executive  → "CEO Satya Nadella"    ← role + name │
  │   quarter    → "Q2 2024"                            │
  │   financial  → "$65B"                               │
  └──────────────────────┬──────────────────────────────┘
                         │ teaches format
                         ▼
  New text: "Apple Inc... CEO Tim Cook... $81.4 billion"
                         │
                         ▼
  Output (model generates):
  ┌─────────────────────────────────────────────────────┐
  │   company    → "Apple Inc."                         │
  │   executive  → "CEO Tim Cook"         ← same format │
  │   executive  → "CFO Luca Maestri"     ← generalized │
  │   financial  → "undisclosed amount"   ← semantic    │
  └─────────────────────────────────────────────────────┘

The LLM generalizes from your examples. One example showing “CEO Satya Nadella” is enough for it to also extract “CFO Luca Maestri” and understand “undisclosed amount” as a financial figure, something spaCy and GLiNER would miss.

Few-Shot Learning with Examples

To use langextract, provide two components:

  • Prompt: A description listing entity types to extract (companies, executives, financial figures)
  • Examples: Sample text paired with labeled extractions showing expected output
Python
Output

Now extract entities from the earnings report:

Python
Output
💡 What the output shows
  • Role-linked executives (“CEO Tim Cook”) instead of just the name
  • Semantic understanding of “undisclosed amount” as a financial figure
  • Market reaction “up 2% year over year” captured with full context

Quiz

The example extracts “CEO Satya Nadella” as an executive. How does this affect the model’s output?


langextract extracted “undisclosed amount” as a financial figure. Why would spaCy and GLiNER likely miss this?

Exercise: Analyze Customer Feedback

Visualizing Extractions

When to Use Each Tool

0
    0
    Your Cart
    Your cart is empty
    Scroll to Top

    Work with Khuyen Tran

    Work with Khuyen Tran