What is Entity Extraction?

Entity extraction (also called Named Entity Recognition or NER) automatically identifies and classifies key information from unstructured text. For instance, financial reports contain company names, monetary figures, executives, dates, and locations used for competitive analysis and executive tracking.

Extracting these entities manually is time-consuming and error-prone. Automated entity extraction provides a faster and more reliable alternative.

In this course, you’ll learn three modern tools for entity extraction:

  1. spaCy: Production-ready NER with pre-trained models
  2. GLiNER: Zero-shot custom entity recognition
  3. langextract: AI-powered extraction with source grounding

Sample Document

Throughout this course, we’ll extract entities from this earnings report.

Press Run below to try it out.

Python
Output
Loading Python…

We chose this report because it’s dense with overlapping entity types, which is exactly what makes real-world extraction challenging:

  • Monetary amounts appear in different contexts: revenue ($81.4B), dividends ($0.24), and forecasted ranges ($89B-$93B)
  • Named entities overlap: “Apple Inc.” is both a company and a stock ticker (AAPL), and “SEC” is an abbreviation that needs context to identify
  • Temporal references mix formats: exact dates (June 30, 2023), quarters (Q4 2023), and relative time (year over year)

Why Not Use Regex?

Regular expressions define text patterns using special syntax to find matches in strings. While they may seem like a natural first choice for entity extraction, they require a separate pattern for each entity type and fail when formats vary.

Here’s what extracting financial amounts, dates, stock symbols, and quarters with regex looks like:

Python
Output
Loading Python…

From the code above, several limitations become apparent:

  • Each entity type requires its own pattern, resulting in verbose boilerplate code that is difficult to read and maintain.
  • The patterns only match numeric quarter formats like “Q4 2023” and miss textual forms such as “third quarter” unless additional exact-match patterns are added.

Quiz

A document contains dates in formats like “January 15, 2024”, “15/01/2024”, and “2024-01-15”. What challenge does regex face here?

Production-Grade Named Entity Recognition

spaCy provides pre-trained models that automatically identify entities like PERSON, ORG, MONEY, DATE, and PERCENT from context. No pattern writing required.

Let’s install spaCy and download a small English model to get started:

pip install spacy
python -m spacy download en_core_web_sm

Extracting entities with spaCy takes just two steps:

  • Load the model
  • Process your text
Python
Output
💡 What the output shows
  • spaCy extracted three entity types (ORG, MONEY, PERSON) without any configuration
  • The model understood that “Apple Inc.” is a company, not just a fruit
  • It captured the complete monetary amount “$81.4 billion” including the unit
  • Person names are recognized even without titles like “CEO”

How spaCy NER Works

spaCy labels each token individually using its BILUO tagging scheme, then groups consecutive entity tokens into spans:

"Apple"  "Inc."  "CEO"  "Tim"  "Cook"  "$81.4"  "billion"
   │        │      │      │       │       │         │
   ▼        ▼      ▼      ▼       ▼       ▼         ▼
 B-ORG   L-ORG    O    B-PER   L-PER  B-MONEY   L-MONEY
   └───┬───┘             └──┬──┘        └────┬────┘
       ▼                    ▼                ▼
 "Apple Inc." → ORG   "Tim Cook" → PERSON   "$81.4 billion" → MONEY
  • Begin / Inside / Last mark multi-token entities
  • Unit marks single-token entities (e.g., “London” → U-LOC)
  • O means outside any entity

The model learns these tagging patterns from thousands of labeled examples during training.

Quiz

How does spaCy determine that “Apple Inc.” is an ORG entity?

Exercise: Build a Contact List

Scenario

The sales team wants to build a contact database from meeting notes. They only need people’s names, not dates or other information.

Task

Extract only PERSON entities into a list.

💡 Hint

Use ent.label_ to check an entity’s type.

Output
Ready

Extracting from Business Documents

First, create a helper function that extracts entities and groups them by type:

Python
Output

Here’s an earnings report with companies, executives, financial figures, and dates:

Python
Output

Extract and display all entities found:

Python
Output
💡 What the output shows
  • spaCy extracted 20+ entities across 6 different types from a single document
  • It recognized textual dates like “third quarter” and “the quarter ending June 30, 2023”
  • All five monetary values were captured with their full amounts
  • However, some domain terms are misclassified: “iPhone” as ORG and “AI” as GPE (location)

Quiz

Why did spaCy classify “iPhone” as ORG instead of a product?

Exercise: Export Contact List

Scenario

HR needs a spreadsheet of all people mentioned in meeting notes, with their mention positions for reference.

Task

Create a DataFrame with only PERSON entities, columns: name, position.

💡 Hint

Filter by ent.label_ and use ent.start_char for position.

Output
Ready

Visualizing Entities with displaCy

spaCy includes displaCy, a built-in visualizer that highlights entities directly in your text. This helps you quickly verify extraction results and debug misclassifications.

The displacy.render() function generates an HTML visualization with color-coded entity labels:

Python
Output
💡 What the output shows
  • Each entity type has a distinct color: teal for ORG, purple for PERSON, beige for MONEY, and mint for DATE
  • Labels appear inline next to each entity, making it easy to verify classifications at a glance

When documents contain many entities, you can filter to show only specific types using the options parameter:

Python
Output
💡 What the output shows

“Q4 2023” is no longer highlighted since DATE was excluded from the filter.

Zero-Shot Custom Entity Extraction

GLiNER solves spaCy’s limitation of fixed entity types. Instead of being locked into categories like ORG or GPE, GLiNER lets you define custom types using natural language descriptions.

pip install gliner

GLiNER offers several pretrained models. We’ll use gliner_small-v2.1 with threshold=0.3 to capture entities with at least 30% confidence:

Python
Output
💡 What the output shows
  • GLiNER recognized custom entity types without any training
  • Confidence scores vary: “Tim Cook” (0.563) scores highest as names are distinctive, while “$81.4 billion” (0.310) scores lower because “Currency” is a less common label
📝 Other model options

For higher accuracy, try gliner_medium-v2.1. For multilingual support, use gliner_multi-v2.1.

How GLiNER Works

Instead of tagging individual tokens, GLiNER scores entire spans against every label you provide. The highest-scoring label wins, and spans below your threshold are filtered out:

┌──────────────┬───────────┬──────────────────┐
│  Span        │  Label    │  Confidence      │
├──────────────┼───────────┼──────────────────┤
│ Apple Inc    │ Company   │ ████░░░░░░░ 0.36 │ ✓ above 0.3
│ Apple Inc    │ Person    │ █░░░░░░░░░░ 0.05 │ ✗
├──────────────┼───────────┼──────────────────┤
│ Tim Cook     │ Company   │ █░░░░░░░░░░ 0.04 │ ✗
│ Tim Cook     │ Person    │ ██████░░░░░ 0.56 │ ✓ above 0.3
├──────────────┼───────────┼──────────────────┤
│ $81.4 billion│ Company   │ ░░░░░░░░░░░ 0.01 │ ✗
│ $81.4 billion│ Currency  │ ███░░░░░░░░ 0.31 │ ✓ above 0.3
└──────────────┴───────────┴──────────────────┘
                            threshold = 0.3 ▲

This gives you two controls spaCy doesn’t: custom labels (any text, not a fixed set) and a confidence threshold to filter results.

Quiz

How does GLiNER decide which label to assign to a text span?

Extracting Business Entities

First, define entity types specific to financial documents:

Python
Output

We’ll use the same earnings report from the spaCy section:

Python
Output

Extract entities and group them by type:

Python
Output
💡 What the output shows
  • GLiNER found entities spaCy couldn’t: “WaveOne” as STARTUP, “third quarter” as QUARTER
  • “iPhone” is now PRODUCT instead of ORG
  • “Cupertino headquarters” was captured as a complete LOCATION phrase

Quiz

“Apple Inc.” scored 0.908 while “$0.24 per share” scored 0.302. What explains this gap?

Exercise: Parse Business Metrics

Scenario

The BI team needs to automatically extract KPIs from quarterly reports to populate dashboards. They want to capture metric names and time periods from business summaries.

Task

Extract metrics and time periods from a business report using custom labels.

💡 Hint

Use labels like "Metric" and "Time Period" to capture business KPIs and dates.

Output
Ready

Using Confidence Scores for Quality Control

To implement quality control, categorize entities by confidence and flag low-scoring matches for manual review.

First, extract entities with a low threshold to capture all potential matches:

Python
Output

Sort entities into two groups based on a 0.5 confidence threshold:

  • High confidence: Entities scoring 0.5 or above
  • Needs review: Entities below 0.5 that require manual check
Python
Output
💡 What the output shows
  • “Apple” scores high (0.798) because it’s unambiguously a company in this context
  • “Java” scores low (0.366) because it could mean the programming language, coffee brand, or Indonesian island
  • The model correctly flags ambiguous terms for human review

Quiz

What is the key advantage of GLiNER over spaCy?

Exercise: Route Low-Confidence to Review

Scenario

Your data pipeline extracts entities from customer emails. Ambiguous extractions need human review before updating the CRM.

Task

Create a needs_review list containing entities with score < 0.5, storing tuples of (text, label, score).

💡 Hint

Each entity has ent['text'], ent['label'], and ent['score'] keys.

Output
Ready

AI-Powered Extraction with Source Grounding

langextract uses large language models (Gemini, GPT) to understand entity relationships and provide source attribution.

It captures semantic context like “AI startup WaveOne” (category + name) and “between $89 billion and $93 billion” (revenue ranges) as complete phrases rather than separate pieces.

Let’s install langextract along with its dependencies to try it out:

pip install langextract python-dotenv google-genai

To authenticate, add your API key to a .env file. This course uses Gemini (get a key from AI Studio), but OpenAI models also work:

# .env file
LANGEXTRACT_API_KEY=your-api-key-here

langextract uses an LLM to extract entities. You provide examples that teach the model what to look for and how to format the output:

Example (you provide):
  ┌─────────────────────────────────────────────────────┐
  │ Text: "Microsoft Corp. CEO Satya Nadella reported   │
  │        Q2 2024 revenue of $65B"                     │
  │                                                     │
  │ Extractions:                                        │
  │   company    → "Microsoft Corp."                    │
  │   executive  → "CEO Satya Nadella"    ← role + name │
  │   quarter    → "Q2 2024"                            │
  │   financial  → "$65B"                               │
  └──────────────────────┬──────────────────────────────┘
                         │ teaches format
                         ▼
  New text: "Apple Inc... CEO Tim Cook... $81.4 billion"
                         │
                         ▼
  Output (model generates):
  ┌─────────────────────────────────────────────────────┐
  │   company    → "Apple Inc."                         │
  │   executive  → "CEO Tim Cook"         ← same format │
  │   executive  → "CFO Luca Maestri"     ← generalized │
  │   financial  → "undisclosed amount"   ← semantic    │
  └─────────────────────────────────────────────────────┘

The LLM generalizes from your examples. One example showing “CEO Satya Nadella” is enough for it to also extract “CFO Luca Maestri” and understand “undisclosed amount” as a financial figure, something spaCy and GLiNER would miss.

Few-Shot Learning with Examples

To use langextract, provide two components:

  • Prompt: A description listing entity types to extract (companies, executives, financial figures)
  • Examples: Sample text paired with labeled extractions showing expected output
Python
Output

Now extract entities from the earnings report:

Python
Output
💡 What the output shows
  • Role-linked executives (“CEO Tim Cook”) instead of just the name
  • Semantic understanding of “undisclosed amount” as a financial figure
  • Market reaction “up 2% year over year” captured with full context

Quiz

The example extracts “CEO Satya Nadella” as an executive. How does this affect the model’s output?


langextract extracted “undisclosed amount” as a financial figure. Why would spaCy and GLiNER likely miss this?

Exercise: Analyze Customer Feedback

Scenario

The product team reviews app store feedback to prioritize fixes. They need to identify which feature users mention and whether the feedback is positive or negative.

Task

Complete the example by identifying what text to extract for each label. Paste your AI Studio key in the secure field below.

💡 Hint

Read the example: “Love the calendar sync, hate the notification sounds.” What words are features? What words express how the user feels?

Output
Ready

Visualizing Extractions

langextract can generate an interactive HTML visualization where each entity is color-coded and hoverable. First, save the results to a JSONL file, then generate the visualization:

Python
Output
💡 What the output shows
  • Each entity type gets a distinct color in the visualization
  • Hovering over highlighted text shows the extraction class and any attributes
  • The full source text is displayed with all entities highlighted inline

Quiz

What does langextract use under the hood to extract entities?

When to Use Each Tool

Now that you’ve seen all three tools in action, here’s how they compare across key dimensions to help you decide which fits your workflow:

FeaturespaCyGLiNERlangextract
SetupModel downloadModel downloadAPI key
SpeedFastModerateSlower (API)
CostFreeFreePer-request
PrivacyLocalLocalCloud API
Custom TypesLimitedZero-shotFew-shot
Context UnderstandingBasicGoodBest

Here’s when to reach for each tool:

  • Start with spaCy if your entities fit standard types (PERSON, ORG, MONEY). It’s fast, free, and runs locally.
  • Move to GLiNER when you need custom entity types. It adds zero-shot flexibility while still running locally.
  • Use langextract when you need the deepest context understanding. It captures relationships and nuance that local models miss, at the cost of API calls.

Course Complete!

Nice work finishing this course. Ready to go deeper? Check out these courses with hands-on exercises:

Browse all courses →
Scroll to Top

Work with Khuyen Tran

Work with Khuyen Tran