Entity Extraction with spaCy and LLMs

What is Entity Extraction?

Entity extraction (also called Named Entity Recognition or NER) automatically identifies and classifies key information from unstructured text. For instance, financial reports contain company names, monetary figures, executives, dates, and locations used for competitive analysis and executive tracking.

Extracting these entities manually is time-consuming and error-prone. Automated entity extraction provides a faster and more reliable alternative.

In this course, you’ll learn three modern tools for entity extraction:

spaCy: Production-ready NER with pre-trained models
GLiNER: Zero-shot custom entity recognition
langextract: AI-powered extraction with source grounding

Sample Document

Throughout this course, we’ll extract entities from this earnings report.

Press Run below to try it out.

ZWFybmluZ19yZXBvcnQgPSAiIiIKQXBwbGUgSW5jLiAoTkFTREFROiBBQVBMKSByZXBvcnRlZCB0aGlyZCBxdWFydGVyIHJldmVudWUgb2YgJDgxLjQgYmlsbGlvbiwKdXAgMiUgeWVhciBvdmVyIHllYXIuIENFTyBUaW0gQ29vayBzdGF0ZWQgdGhhdCBTZXJ2aWNlcyByZXZlbnVlIHJlYWNoZWQKYSBuZXcgYWxsLXRpbWUgaGlnaCBvZiAkMjEuMiBiaWxsaW9uLiBUaGUgY29tcGFueSdzIGJvYXJkIG9mIGRpcmVjdG9ycwpkZWNsYXJlZCBhIGNhc2ggZGl2aWRlbmQgb2YgJDAuMjQgcGVyIHNoYXJlLgoKQ0ZPIEx1Y2EgTWFlc3RyaSBtZW50aW9uZWQgdGhhdCBpUGhvbmUgcmV2ZW51ZSB3YXMgJDM5LjMgYmlsbGlvbiBmb3IKdGhlIHF1YXJ0ZXIgZW5kaW5nIEp1bmUgMzAsIDIwMjMuIFRoZSBjb21wYW55IGV4cGVjdHMgdG90YWwgcmV2ZW51ZQpiZXR3ZWVuICQ4OSBiaWxsaW9uIGFuZCAkOTMgYmlsbGlvbiBmb3IgdGhlIGZvdXJ0aCBxdWFydGVyLgoKQXBwbGUncyBDdXBlcnRpbm8gaGVhZHF1YXJ0ZXJzIGFubm91bmNlZCB0aGUgYWNxdWlzaXRpb24gb2YgQUkgc3RhcnR1cApXYXZlT25lIGZvciBhbiB1bmRpc2Nsb3NlZCBhbW91bnQuIFRoZSBkZWFsIGlzIGV4cGVjdGVkIHRvIGNsb3NlIGluClE0IDIwMjMsIHBlbmRpbmcgcmVndWxhdG9yeSBhcHByb3ZhbCBmcm9tIHRoZSBTRUMuCiIiIgoKcHJpbnQoIkVhcm5pbmdzIHJlcG9ydCBsb2FkZWQhIikKcHJpbnQoZiJEb2N1bWVudCBsZW5ndGg6IHtsZW4oZWFybmluZ19yZXBvcnQpfSBjaGFyYWN0ZXJzIik=

Output

Loading Python…

We chose this report because it’s dense with overlapping entity types, which is exactly what makes real-world extraction challenging:

Monetary amounts appear in different contexts: revenue ($81.4B), dividends ($0.24), and forecasted ranges ($89B-$93B)
Named entities overlap: “Apple Inc.” is both a company and a stock ticker (AAPL), and “SEC” is an abbreviation that needs context to identify
Temporal references mix formats: exact dates (June 30, 2023), quarters (Q4 2023), and relative time (year over year)

Why Not Use Regex?

Regular expressions define text patterns using special syntax to find matches in strings. While they may seem like a natural first choice for entity extraction, they require a separate pattern for each entity type and fail when formats vary.

Here’s what extracting financial amounts, dates, stock symbols, and quarters with regex looks like:

aW1wb3J0IHJlCgplYXJuaW5nX3JlcG9ydCA9ICIiIgpBcHBsZSBJbmMuIChOQVNEQVE6IEFBUEwpIHJlcG9ydGVkIHRoaXJkIHF1YXJ0ZXIgcmV2ZW51ZSBvZiAkODEuNCBiaWxsaW9uLAp1cCAyJSB5ZWFyIG92ZXIgeWVhci4gQ0VPIFRpbSBDb29rIHN0YXRlZCB0aGF0IFNlcnZpY2VzIHJldmVudWUgcmVhY2hlZAphIG5ldyBhbGwtdGltZSBoaWdoIG9mICQyMS4yIGJpbGxpb24uIENGTyBMdWNhIE1hZXN0cmkgbWVudGlvbmVkIHRoYXQKaVBob25lIHJldmVudWUgd2FzICQzOS4zIGJpbGxpb24gZm9yIHRoZSBxdWFydGVyIGVuZGluZyBKdW5lIDMwLCAyMDIzLgoiIiIKCiMgRWFjaCBlbnRpdHkgdHlwZSBuZWVkcyBhIHNlcGFyYXRlIGNvbXBsZXggcGF0dGVybgpmaW5hbmNpYWxfcGF0dGVybiA9IHIiXCQoPzpcZHsxLDN9KD86LFxkezN9KSt8XGQrKSg/OlwuWzAtOV0rKT8oPzpccyooPzpiaWxsaW9ufG1pbGxpb258dHJpbGxpb24pKT8iCmRhdGVfcGF0dGVybiA9IHIiXGIoPzpKYW51YXJ5fEZlYnJ1YXJ5fE1hcmNofEFwcmlsfE1heXxKdW5lfEp1bHl8QXVndXN0fFNlcHRlbWJlcnxPY3RvYmVyfE5vdmVtYmVyfERlY2VtYmVyKVxzK1xkezEsMn0sXHMrXGR7NH0iCnN0b2NrX3BhdHRlcm4gPSByIlxiKD86TkFTREFRfE5ZU0V8TllTRUFSQ0EpOlxzKltBLVpdezIsNX1cYiIKcXVhcnRlcl9wYXR0ZXJuID0gciJcYihRWzEtNF1ccytcZHs0fSlcYiIKCnByaW50KCJGaW5hbmNpYWwgYW1vdW50czoiLCByZS5maW5kYWxsKGZpbmFuY2lhbF9wYXR0ZXJuLCBlYXJuaW5nX3JlcG9ydCwgcmUuSUdOT1JFQ0FTRSkpCnByaW50KCJEYXRlczoiLCByZS5maW5kYWxsKGRhdGVfcGF0dGVybiwgZWFybmluZ19yZXBvcnQpKQpwcmludCgiU3RvY2sgc3ltYm9sczoiLCByZS5maW5kYWxsKHN0b2NrX3BhdHRlcm4sIGVhcm5pbmdfcmVwb3J0KSkKcHJpbnQoIlF1YXJ0ZXJzOiIsIHJlLmZpbmRhbGwocXVhcnRlcl9wYXR0ZXJuLCBlYXJuaW5nX3JlcG9ydCkp

Output

Loading Python…

From the code above, several limitations become apparent:

Each entity type requires its own pattern, resulting in verbose boilerplate code that is difficult to read and maintain.
The patterns only match numeric quarter formats like “Q4 2023” and miss textual forms such as “third quarter” unless additional exact-match patterns are added.

Quiz

A document contains dates in formats like “January 15, 2024”, “15/01/2024”, and “2024-01-15”. What challenge does regex face here?

Production-Grade Named Entity Recognition

spaCy provides pre-trained models that automatically identify entities like PERSON, ORG, MONEY, DATE, and PERCENT from context. No pattern writing required.

Let’s install spaCy and download a small English model to get started:

pip install spacy
python -m spacy download en_core_web_sm

Extracting entities with spaCy takes just two steps:

Load the model
Process your text

💡 What the output shows

spaCy extracted three entity types (ORG, MONEY, PERSON) without any configuration
The model understood that “Apple Inc.” is a company, not just a fruit
It captured the complete monetary amount “$81.4 billion” including the unit
Person names are recognized even without titles like “CEO”

How spaCy NER Works

spaCy labels each token individually using its BILUO tagging scheme, then groups consecutive entity tokens into spans:

"Apple"  "Inc."  "CEO"  "Tim"  "Cook"  "$81.4"  "billion"
   │        │      │      │       │       │         │
   ▼        ▼      ▼      ▼       ▼       ▼         ▼
 B-ORG   L-ORG    O    B-PER   L-PER  B-MONEY   L-MONEY
   └───┬───┘             └──┬──┘        └────┬────┘
       ▼                    ▼                ▼
 "Apple Inc." → ORG   "Tim Cook" → PERSON   "$81.4 billion" → MONEY

Begin / Inside / Last mark multi-token entities
Unit marks single-token entities (e.g., “London” → U-LOC)
O means outside any entity

The model learns these tagging patterns from thousands of labeled examples during training.

Quiz

How does spaCy determine that “Apple Inc.” is an ORG entity?

Exercise: Build a Contact List

Scenario

The sales team wants to build a contact database from meeting notes. They only need people’s names, not dates or other information.

Task

Extract only PERSON entities into a list.

💡 Hint

Use ent.label_ to check an entity’s type.

Output

Ready

Extracting from Business Documents

First, create a helper function that extracts entities and groups them by type:

Here’s an earnings report with companies, executives, financial figures, and dates:

Output

Extract and display all entities found:

💡 What the output shows

spaCy extracted 20+ entities across 6 different types from a single document
It recognized textual dates like “third quarter” and “the quarter ending June 30, 2023”
All five monetary values were captured with their full amounts
However, some domain terms are misclassified: “iPhone” as ORG and “AI” as GPE (location)

Quiz

Why did spaCy classify “iPhone” as ORG instead of a product?

Exercise: Export Contact List

Scenario

HR needs a spreadsheet of all people mentioned in meeting notes, with their mention positions for reference.

Task

Create a DataFrame with only PERSON entities, columns: name, position.

💡 Hint

Filter by ent.label_ and use ent.start_char for position.

Output

Ready

Visualizing Entities with displaCy

spaCy includes displaCy, a built-in visualizer that highlights entities directly in your text. This helps you quickly verify extraction results and debug misclassifications.

The displacy.render() function generates an HTML visualization with color-coded entity labels:

💡 What the output shows

Each entity type has a distinct color: teal for ORG, purple for PERSON, beige for MONEY, and mint for DATE
Labels appear inline next to each entity, making it easy to verify classifications at a glance

When documents contain many entities, you can filter to show only specific types using the options parameter:

💡 What the output shows

“Q4 2023” is no longer highlighted since DATE was excluded from the filter.

Zero-Shot Custom Entity Extraction

GLiNER solves spaCy’s limitation of fixed entity types. Instead of being locked into categories like ORG or GPE, GLiNER lets you define custom types using natural language descriptions.

pip install gliner

GLiNER offers several pretrained models. We’ll use gliner_small-v2.1 with threshold=0.3 to capture entities with at least 30% confidence:

Output

💡 What the output shows

GLiNER recognized custom entity types without any training
Confidence scores vary: “Tim Cook” (0.563) scores highest as names are distinctive, while “$81.4 billion” (0.310) scores lower because “Currency” is a less common label

📝 Other model options

For higher accuracy, try gliner_medium-v2.1. For multilingual support, use gliner_multi-v2.1.

How GLiNER Works

Instead of tagging individual tokens, GLiNER scores entire spans against every label you provide. The highest-scoring label wins, and spans below your threshold are filtered out:

┌──────────────┬───────────┬──────────────────┐
│  Span        │  Label    │  Confidence      │
├──────────────┼───────────┼──────────────────┤
│ Apple Inc    │ Company   │ ████░░░░░░░ 0.36 │ ✓ above 0.3
│ Apple Inc    │ Person    │ █░░░░░░░░░░ 0.05 │ ✗
├──────────────┼───────────┼──────────────────┤
│ Tim Cook     │ Company   │ █░░░░░░░░░░ 0.04 │ ✗
│ Tim Cook     │ Person    │ ██████░░░░░ 0.56 │ ✓ above 0.3
├──────────────┼───────────┼──────────────────┤
│ $81.4 billion│ Company   │ ░░░░░░░░░░░ 0.01 │ ✗
│ $81.4 billion│ Currency  │ ███░░░░░░░░ 0.31 │ ✓ above 0.3
└──────────────┴───────────┴──────────────────┘
                            threshold = 0.3 ▲

This gives you two controls spaCy doesn’t: custom labels (any text, not a fixed set) and a confidence threshold to filter results.

Quiz

How does GLiNER decide which label to assign to a text span?

Extracting Business Entities

First, define entity types specific to financial documents:

We’ll use the same earnings report from the spaCy section:

Output

Extract entities and group them by type:

Output

💡 What the output shows

GLiNER found entities spaCy couldn’t: “WaveOne” as STARTUP, “third quarter” as QUARTER
“iPhone” is now PRODUCT instead of ORG
“Cupertino headquarters” was captured as a complete LOCATION phrase

Quiz

“Apple Inc.” scored 0.908 while “$0.24 per share” scored 0.302. What explains this gap?

Exercise: Parse Business Metrics

Scenario

The BI team needs to automatically extract KPIs from quarterly reports to populate dashboards. They want to capture metric names and time periods from business summaries.

Task

Extract metrics and time periods from a business report using custom labels.

💡 Hint

Use labels like "Metric" and "Time Period" to capture business KPIs and dates.

Output

Ready

Using Confidence Scores for Quality Control

To implement quality control, categorize entities by confidence and flag low-scoring matches for manual review.

First, extract entities with a low threshold to capture all potential matches:

Sort entities into two groups based on a 0.5 confidence threshold:

High confidence: Entities scoring 0.5 or above
Needs review: Entities below 0.5 that require manual check

Output

💡 What the output shows

“Apple” scores high (0.798) because it’s unambiguously a company in this context
“Java” scores low (0.366) because it could mean the programming language, coffee brand, or Indonesian island
The model correctly flags ambiguous terms for human review

Quiz

What is the key advantage of GLiNER over spaCy?

Exercise: Route Low-Confidence to Review

Scenario

Your data pipeline extracts entities from customer emails. Ambiguous extractions need human review before updating the CRM.

Task

Create a needs_review list containing entities with score < 0.5, storing tuples of (text, label, score).

💡 Hint

Each entity has ent['text'], ent['label'], and ent['score'] keys.

Output

Ready

AI-Powered Extraction with Source Grounding

langextract uses large language models (Gemini, GPT) to understand entity relationships and provide source attribution.

It captures semantic context like “AI startup WaveOne” (category + name) and “between $89 billion and $93 billion” (revenue ranges) as complete phrases rather than separate pieces.

Let’s install langextract along with its dependencies to try it out:

pip install langextract python-dotenv google-genai

To authenticate, add your API key to a .env file. This course uses Gemini (get a key from AI Studio), but OpenAI models also work:

# .env file
LANGEXTRACT_API_KEY=your-api-key-here

langextract uses an LLM to extract entities. You provide examples that teach the model what to look for and how to format the output:

Example (you provide):
  ┌─────────────────────────────────────────────────────┐
  │ Text: "Microsoft Corp. CEO Satya Nadella reported   │
  │        Q2 2024 revenue of $65B"                     │
  │                                                     │
  │ Extractions:                                        │
  │   company    → "Microsoft Corp."                    │
  │   executive  → "CEO Satya Nadella"    ← role + name │
  │   quarter    → "Q2 2024"                            │
  │   financial  → "$65B"                               │
  └──────────────────────┬──────────────────────────────┘
                         │ teaches format
                         ▼
  New text: "Apple Inc... CEO Tim Cook... $81.4 billion"
                         │
                         ▼
  Output (model generates):
  ┌─────────────────────────────────────────────────────┐
  │   company    → "Apple Inc."                         │
  │   executive  → "CEO Tim Cook"         ← same format │
  │   executive  → "CFO Luca Maestri"     ← generalized │
  │   financial  → "undisclosed amount"   ← semantic    │
  └─────────────────────────────────────────────────────┘

The LLM generalizes from your examples. One example showing “CEO Satya Nadella” is enough for it to also extract “CFO Luca Maestri” and understand “undisclosed amount” as a financial figure, something spaCy and GLiNER would miss.

Few-Shot Learning with Examples

To use langextract, provide two components:

Prompt: A description listing entity types to extract (companies, executives, financial figures)
Examples: Sample text paired with labeled extractions showing expected output

aW1wb3J0IG9zCmZyb20gZG90ZW52IGltcG9ydCBsb2FkX2RvdGVudgppbXBvcnQgbGFuZ2V4dHJhY3QgYXMgbHgKZnJvbSBsYW5nZXh0cmFjdCBpbXBvcnQgZXh0cmFjdAoKbG9hZF9kb3RlbnYoKQoKZGVmIGV4dHJhY3RfZmluYW5jaWFsX2VudGl0aWVzKHRleHQpOgogICAgIiIiRXh0cmFjdCBlbnRpdGllcyB1c2luZyBsYW5nZXh0cmFjdC4iIiIKICAgIHByb21wdF9kZXNjcmlwdGlvbiA9ICIiIkV4dHJhY3QgYnVzaW5lc3MgZW50aXRpZXM6IGNvbXBhbmllcywgZXhlY3V0aXZlcywKICAgIGZpbmFuY2lhbCBmaWd1cmVzLCBxdWFydGVycywgbG9jYXRpb25zLCBwcm9kdWN0cywgc3RhcnR1cHMsCiAgICByZWd1bGF0b3J5IGJvZGllcywgc3RvY2tfc3ltYm9scywgbWFya2V0X3JlYWN0aW9uLiIiIgoKICAgIGV4YW1wbGVzID0gWwogICAgICAgIGx4LmRhdGEuRXhhbXBsZURhdGEoCiAgICAgICAgICAgIHRleHQ9Ik1pY3Jvc29mdCBDb3JwLiAoTllTRTogTVNGVCkgQ0VPIFNhdHlhIE5hZGVsbGEgcmVwb3J0ZWQgUTIgMjAyNCByZXZlbnVlIG9mICQ2NUIsIGRvd24gNSUgcXVhcnRlci1vdmVyLXF1YXJ0ZXIuIiwKICAgICAgICAgICAgZXh0cmFjdGlvbnM9WwogICAgICAgICAgICAgICAgbHguZGF0YS5FeHRyYWN0aW9uKGV4dHJhY3Rpb25fY2xhc3M9ImNvbXBhbnkiLCBleHRyYWN0aW9uX3RleHQ9Ik1pY3Jvc29mdCBDb3JwLiIpLAogICAgICAgICAgICAgICAgbHguZGF0YS5FeHRyYWN0aW9uKGV4dHJhY3Rpb25fY2xhc3M9ImV4ZWN1dGl2ZSIsIGV4dHJhY3Rpb25fdGV4dD0iQ0VPIFNhdHlhIE5hZGVsbGEiKSwKICAgICAgICAgICAgICAgIGx4LmRhdGEuRXh0cmFjdGlvbihleHRyYWN0aW9uX2NsYXNzPSJzdG9ja19zeW1ib2wiLCBleHRyYWN0aW9uX3RleHQ9Ik5ZU0U6IE1TRlQiKSwKICAgICAgICAgICAgICAgIGx4LmRhdGEuRXh0cmFjdGlvbihleHRyYWN0aW9uX2NsYXNzPSJxdWFydGVyIiwgZXh0cmFjdGlvbl90ZXh0PSJRMiAyMDI0IiksCiAgICAgICAgICAgICAgICBseC5kYXRhLkV4dHJhY3Rpb24oZXh0cmFjdGlvbl9jbGFzcz0iZmluYW5jaWFsX2ZpZ3VyZSIsIGV4dHJhY3Rpb25fdGV4dD0iJDY1QiIpLAogICAgICAgICAgICAgICAgbHguZGF0YS5FeHRyYWN0aW9uKGV4dHJhY3Rpb25fY2xhc3M9Im1hcmtldF9yZWFjdGlvbiIsIGV4dHJhY3Rpb25fdGV4dD0iZG93biA1JSBxdWFydGVyLW92ZXItcXVhcnRlciIpLAogICAgICAgICAgICBdCiAgICAgICAgKQogICAgXQoKICAgIHJldHVybiBleHRyYWN0KAogICAgICAgIHRleHRfb3JfZG9jdW1lbnRzPXRleHQsCiAgICAgICAgcHJvbXB0X2Rlc2NyaXB0aW9uPXByb21wdF9kZXNjcmlwdGlvbiwKICAgICAgICBleGFtcGxlcz1leGFtcGxlcywKICAgICAgICBtb2RlbF9pZD0iZ2VtaW5pLTIuNS1mbGFzaCIKICAgICk=

Output

Now extract entities from the earnings report:

ZnJvbSBjb2xsZWN0aW9ucyBpbXBvcnQgZGVmYXVsdGRpY3QKCmVhcm5pbmdfcmVwb3J0ID0gIiIiCkFwcGxlIEluYy4gKE5BU0RBUTogQUFQTCkgcmVwb3J0ZWQgdGhpcmQgcXVhcnRlciByZXZlbnVlIG9mICQ4MS40IGJpbGxpb24sCnVwIDIlIHllYXIgb3ZlciB5ZWFyLiBDRU8gVGltIENvb2sgc3RhdGVkIHRoYXQgU2VydmljZXMgcmV2ZW51ZSByZWFjaGVkCmEgbmV3IGFsbC10aW1lIGhpZ2ggb2YgJDIxLjIgYmlsbGlvbi4gVGhlIGNvbXBhbnkncyBib2FyZCBvZiBkaXJlY3RvcnMKZGVjbGFyZWQgYSBjYXNoIGRpdmlkZW5kIG9mICQwLjI0IHBlciBzaGFyZS4KCkNGTyBMdWNhIE1hZXN0cmkgbWVudGlvbmVkIHRoYXQgaVBob25lIHJldmVudWUgd2FzICQzOS4zIGJpbGxpb24gZm9yCnRoZSBxdWFydGVyIGVuZGluZyBKdW5lIDMwLCAyMDIzLiBUaGUgY29tcGFueSBleHBlY3RzIHRvdGFsIHJldmVudWUKYmV0d2VlbiAkODkgYmlsbGlvbiBhbmQgJDkzIGJpbGxpb24gZm9yIHRoZSBmb3VydGggcXVhcnRlci4KCkFwcGxlJ3MgQ3VwZXJ0aW5vIGhlYWRxdWFydGVycyBhbm5vdW5jZWQgdGhlIGFjcXVpc2l0aW9uIG9mIEFJIHN0YXJ0dXAKV2F2ZU9uZSBmb3IgYW4gdW5kaXNjbG9zZWQgYW1vdW50LiBUaGUgZGVhbCBpcyBleHBlY3RlZCB0byBjbG9zZSBpbgpRNCAyMDIzLCBwZW5kaW5nIHJlZ3VsYXRvcnkgYXBwcm92YWwgZnJvbSB0aGUgU0VDLgoiIiIKCnJlc3VsdCA9IGV4dHJhY3RfZmluYW5jaWFsX2VudGl0aWVzKGVhcm5pbmdfcmVwb3J0KQoKbm9uX2VtcHR5ID0gW2UgZm9yIGUgaW4gcmVzdWx0LmV4dHJhY3Rpb25zIGlmIGUuZXh0cmFjdGlvbl90ZXh0XQpwcmludChmIkV4dHJhY3RlZCB7bGVuKG5vbl9lbXB0eSl9IGVudGl0aWVzOiIpCgpncm91cGVkID0gZGVmYXVsdGRpY3QobGlzdCkKZm9yIGV4dHJhY3Rpb24gaW4gcmVzdWx0LmV4dHJhY3Rpb25zOgogICAgaWYgZXh0cmFjdGlvbi5leHRyYWN0aW9uX3RleHQ6ICAjIEZpbHRlciBlbXB0eSBleHRyYWN0aW9ucwogICAgICAgIGdyb3VwZWRbZXh0cmFjdGlvbi5leHRyYWN0aW9uX2NsYXNzXS5hcHBlbmQoZXh0cmFjdGlvbi5leHRyYWN0aW9uX3RleHQpCgpmb3IgZW50aXR5X2NsYXNzLCB0ZXh0cyBpbiBncm91cGVkLml0ZW1zKCk6CiAgICBwcmludChmIlxue2VudGl0eV9jbGFzcy51cHBlcigpfSAoe2xlbih0ZXh0cyl9IGZvdW5kKToiKQogICAgZm9yIHRleHQgaW4gdGV4dHM6CiAgICAgICAgcHJpbnQoZiIgICd7dGV4dH0nIik=

Output

💡 What the output shows

Role-linked executives (“CEO Tim Cook”) instead of just the name
Semantic understanding of “undisclosed amount” as a financial figure
Market reaction “up 2% year over year” captured with full context

Quiz

The example extracts “CEO Satya Nadella” as an executive. How does this affect the model’s output?

langextract extracted “undisclosed amount” as a financial figure. Why would spaCy and GLiNER likely miss this?

Exercise: Analyze Customer Feedback

Scenario

The product team reviews app store feedback to prioritize fixes. They need to identify which feature users mention and whether the feedback is positive or negative.

Task

Complete the example by identifying what text to extract for each label. Paste your AI Studio key in the secure field below.

💡 Hint

Read the example: “Love the calendar sync, hate the notification sounds.” What words are features? What words express how the user feels?

Langextract Api Key

ZnJvbSBjb2xsZWN0aW9ucyBpbXBvcnQgZGVmYXVsdGRpY3QKaW1wb3J0IGxhbmdleHRyYWN0IGFzIGx4CmZyb20gbGFuZ2V4dHJhY3QgaW1wb3J0IGV4dHJhY3QKCmZlZWRiYWNrID0gIlRoZSBuZXcgZGFyayBtb2RlIGlzIGFtYXppbmchIEJ1dCB0aGUgc2VhcmNoIGZ1bmN0aW9uIGlzIHBhaW5mdWxseSBzbG93LiIKCmV4YW1wbGVzID0gWwogICAgbHguZGF0YS5FeGFtcGxlRGF0YSgKICAgICAgICB0ZXh0PSJMb3ZlIHRoZSBjYWxlbmRhciBzeW5jLCBoYXRlIHRoZSBub3RpZmljYXRpb24gc291bmRzLiIsCiAgICAgICAgZXh0cmFjdGlvbnM9WwogICAgICAgICAgICBseC5kYXRhLkV4dHJhY3Rpb24oZXh0cmFjdGlvbl9jbGFzcz0iZmVhdHVyZSIsIGV4dHJhY3Rpb25fdGV4dD0iY2FsZW5kYXIgc3luYyIpLAogICAgICAgICAgICBseC5kYXRhLkV4dHJhY3Rpb24oZXh0cmFjdGlvbl9jbGFzcz0ic2VudGltZW50IiwgZXh0cmFjdGlvbl90ZXh0PSJMb3ZlIiksCiAgICAgICAgICAgIGx4LmRhdGEuRXh0cmFjdGlvbihleHRyYWN0aW9uX2NsYXNzPSJmZWF0dXJlIiwgZXh0cmFjdGlvbl90ZXh0PSJfX18iKSwKICAgICAgICAgICAgbHguZGF0YS5FeHRyYWN0aW9uKGV4dHJhY3Rpb25fY2xhc3M9InNlbnRpbWVudCIsIGV4dHJhY3Rpb25fdGV4dD0iX19fIiksCiAgICAgICAgXQogICAgKQpdCgpyZXN1bHQgPSBleHRyYWN0KAogICAgdGV4dF9vcl9kb2N1bWVudHM9ZmVlZGJhY2ssCiAgICBwcm9tcHRfZGVzY3JpcHRpb249IkV4dHJhY3QgZmVhdHVyZXMgbWVudGlvbmVkIGFuZCB1c2VyIHNlbnRpbWVudC4iLAogICAgZXhhbXBsZXM9ZXhhbXBsZXMsCiAgICBtb2RlbF9pZD0iZ2VtaW5pLTIuNS1mbGFzaCIKKQoKZW50aXRpZXMgPSBkZWZhdWx0ZGljdChsaXN0KQpmb3IgZSBpbiByZXN1bHQuZXh0cmFjdGlvbnM6CiAgICBpZiBlLmV4dHJhY3Rpb25fdGV4dDoKICAgICAgICBlbnRpdGllc1tlLmV4dHJhY3Rpb25fY2xhc3NdLmFwcGVuZChlLmV4dHJhY3Rpb25fdGV4dCkKcHJpbnQoZGljdChlbnRpdGllcykp

Output

Ready

Visualizing Extractions

langextract can generate an interactive HTML visualization where each entity is color-coded and hoverable. First, save the results to a JSONL file, then generate the visualization:

aW1wb3J0IGxhbmdleHRyYWN0IGFzIGx4CmZyb20gbGFuZ2V4dHJhY3QgaW1wb3J0IGV4dHJhY3QKCnRleHQgPSAiQXBwbGUgQ0VPIFRpbSBDb29rIHJlcG9ydGVkICQ4MS40IGJpbGxpb24gaW4gUTMgMjAyMyByZXZlbnVlLiIKCnJlc3VsdCA9IGV4dHJhY3QoCiAgICB0ZXh0X29yX2RvY3VtZW50cz10ZXh0LAogICAgcHJvbXB0X2Rlc2NyaXB0aW9uPSJFeHRyYWN0IGNvbXBhbmllcywgZXhlY3V0aXZlcywgYW5kIGZpbmFuY2lhbCBmaWd1cmVzLiIsCiAgICBleGFtcGxlcz1bCiAgICAgICAgbHguZGF0YS5FeGFtcGxlRGF0YSgKICAgICAgICAgICAgdGV4dD0iTWljcm9zb2Z0IENFTyBTYXR5YSBOYWRlbGxhIHJlcG9ydGVkICQ2NUIgcmV2ZW51ZS4iLAogICAgICAgICAgICBleHRyYWN0aW9ucz1bCiAgICAgICAgICAgICAgICBseC5kYXRhLkV4dHJhY3Rpb24oZXh0cmFjdGlvbl9jbGFzcz0iY29tcGFueSIsIGV4dHJhY3Rpb25fdGV4dD0iTWljcm9zb2Z0IiksCiAgICAgICAgICAgICAgICBseC5kYXRhLkV4dHJhY3Rpb24oZXh0cmFjdGlvbl9jbGFzcz0iZXhlY3V0aXZlIiwgZXh0cmFjdGlvbl90ZXh0PSJDRU8gU2F0eWEgTmFkZWxsYSIpLAogICAgICAgICAgICAgICAgbHguZGF0YS5FeHRyYWN0aW9uKGV4dHJhY3Rpb25fY2xhc3M9ImZpbmFuY2lhbF9maWd1cmUiLCBleHRyYWN0aW9uX3RleHQ9IiQ2NUIiKSwKICAgICAgICAgICAgXQogICAgICAgICkKICAgIF0sCiAgICBtb2RlbF9pZD0iZ2VtaW5pLTIuNS1mbGFzaCIKKQoKIyBTYXZlIHJlc3VsdHMgYW5kIGdlbmVyYXRlIGludGVyYWN0aXZlIHZpc3VhbGl6YXRpb24KbHguaW8uc2F2ZV9hbm5vdGF0ZWRfZG9jdW1lbnRzKFtyZXN1bHRdLCBvdXRwdXRfbmFtZT0iZXh0cmFjdGlvbnMuanNvbmwiLCBvdXRwdXRfZGlyPSIuIikKaHRtbF9jb250ZW50ID0gbHgudmlzdWFsaXplKCJleHRyYWN0aW9ucy5qc29ubCIpCgp3aXRoIG9wZW4oInZpc3VhbGl6YXRpb24uaHRtbCIsICJ3IikgYXMgZjoKICAgIGlmIGhhc2F0dHIoaHRtbF9jb250ZW50LCAnZGF0YScpOgogICAgICAgIGYud3JpdGUoaHRtbF9jb250ZW50LmRhdGEpCiAgICBlbHNlOgogICAgICAgIGYud3JpdGUoaHRtbF9jb250ZW50KQ==

Output

💡 What the output shows

Each entity type gets a distinct color in the visualization
Hovering over highlighted text shows the extraction class and any attributes
The full source text is displayed with all entities highlighted inline

Quiz

What does langextract use under the hood to extract entities?

When to Use Each Tool

Now that you’ve seen all three tools in action, here’s how they compare across key dimensions to help you decide which fits your workflow:

Feature	spaCy	GLiNER	langextract
Setup	Model download	Model download	API key
Speed	Fast	Moderate	Slower (API)
Cost	Free	Free	Per-request
Privacy	Local	Local	Cloud API
Custom Types	Limited	Zero-shot	Few-shot
Context Understanding	Basic	Good	Best

Here’s when to reach for each tool:

Start with spaCy if your entities fit standard types (PERSON, ORG, MONEY). It’s fast, free, and runs locally.
Move to GLiNER when you need custom entity types. It adds zero-shot flexibility while still running locally.
Use langextract when you need the deepest context understanding. It captures relationships and nuance that local models miss, at the cost of API calls.

What is Entity Extraction?

Sample Document

Why Not Use Regex?

Quiz

Production-Grade Named Entity Recognition

How spaCy NER Works

Quiz

Exercise: Build a Contact List

Scenario

Task

Extracting from Business Documents

Quiz

Exercise: Export Contact List

Scenario

Task

Visualizing Entities with displaCy

Zero-Shot Custom Entity Extraction

How GLiNER Works

Quiz

Extracting Business Entities

Quiz

Exercise: Parse Business Metrics

Scenario

Task

Using Confidence Scores for Quality Control

Quiz

Exercise: Route Low-Confidence to Review

Scenario

Task

AI-Powered Extraction with Source Grounding

Few-Shot Learning with Examples

Quiz

Exercise: Analyze Customer Feedback

Scenario

Task

Visualizing Extractions

Quiz

When to Use Each Tool

Course Complete!

Work with Khuyen Tran

Work with Khuyen Tran