Generic selectors
Exact matches only
Search in title
Search in content
Post Type Selectors
Filter by Categories
About Article
Analyze Data
Archive
Best Practices
Better Outputs
Blog
Code Optimization
Code Quality
Command Line
Daily tips
Dashboard
Data Analysis & Manipulation
Data Engineer
Data Visualization
DataFrame
Delta Lake
DevOps
DuckDB
Environment Management
Feature Engineer
Git
Jupyter Notebook
LLM
LLM
Machine Learning
Machine Learning
Machine Learning & AI
Manage Data
MLOps
Natural Language Processing
NumPy
Pandas
Polars
PySpark
Python Tips
Python Utilities
Python Utilities
Scrape Data
SQL
Testing
Time Series
Tools
Visualization
Visualization & Reporting
Workflow & Automation
Workflow Automation

BertTopic: Enhance Topic Models with Expert-Defined Themes

Table of Contents

BertTopic: Enhance Topic Models with Expert-Defined Themes

Motivation

Data scientists and analysts often need to guide their topic modeling process with domain knowledge or specific themes they want to extract, but traditional topic modeling approaches don’t allow for this kind of control over the generated topics.

Introduction to BERTopic

Topic modeling is a text mining technique that discovers abstract topics in a collection of documents. It helps in organizing, searching, and understanding large volumes of text data by finding common themes or patterns.

BERTopic is a topic modeling library that leverages BERT embeddings and c-TF-IDF to create easily interpretable topics. You can install it using pip:

pip install bertopic

As covered in BERTopic: Harnessing BERT for Interpretable Topic Modeling, the library provides powerful topic visualization and automatic topic discovery. In this post, we will cover guided topic modeling.

Guided Topic Modeling with Seed Words

Seed words are predefined sets of words that represent themes or topics you expect or want to find in your documents. BERTopic allows you to guide the topic modeling process using these seed words. By providing seed words, you can:

  • Direct the model towards specific themes of interest
  • Incorporate domain expertise into the topic discovery process
  • Ensure certain important themes are captured

Here’s how to implement guided topic modeling with seed words:

from bertopic import BERTopic
from sklearn.datasets import fetch_20newsgroups

# Load example data
docs = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))['data']

# Define seed topics with related words for each theme
seed_topic_list = [
    ["drug", "cancer", "drugs", "doctor"],        # Medical theme
    ["windows", "drive", "dos", "file"],          # Computer theme
    ["space", "launch", "orbit", "lunar"]         # Space theme
]

# Create and train the model with seed topics
topic_model = BERTopic(seed_topic_list=seed_topic_list)
topics, probs = topic_model.fit_transform(docs)

Let’s examine how the model processes these seed words and discovers topics:

# Examine discovered topics
print("\nFirst topic (Sports):")
print(topic_model.get_topic(0))

print("\nSecond topic (Cryptography):")
print(topic_model.get_topic(1))

print("\nFifth topic (Space Exploration):")
print(topic_model.get_topic(4))

Output:

First topic (Sports):
[('game', 0.010652), ('team', 0.009260), ('games', 0.007348), 
 ('he', 0.007269), ('players', 0.006459), ('season', 0.006363), 
 ('hockey', 0.006247), ('play', 0.005889), ('25', 0.005802), 
 ('year', 0.005770)]

Second topic (Cryptography):
[('key', 0.015048), ('clipper', 0.012965), ('chip', 0.012280), 
 ('encryption', 0.011336), ('keys', 0.010264), ('escrow', 0.008797), 
 ('government', 0.007993), ('nsa', 0.007754), ('algorithm', 0.007132), 
 ('be', 0.006736)]

Fifth topic (Space Exploration):
[('space', 0.019632), ('launch', 0.016378), ('orbit', 0.010814), 
 ('lunar', 0.010734), ('moon', 0.008701), ('nasa', 0.007899), 
 ('shuttle', 0.006732), ('mission', 0.006472), ('earth', 0.006001), 
 ('station', 0.005720)]

The results show how seed words influence topic discovery:

  • Seed Word Integration: In Topic 5, space-related seed words (‘space’, ‘launch’, ‘orbit’, ‘lunar’) have high weights. The model expands on these words to include related terms like ‘shuttle’, ‘mission’, and ‘station’.
  • Natural Topic Discovery: The model discovers prominent topics like sports (Topic 0) and cryptography (Topic 1), despite being seeded with medical and computer themes. This shows that seed words guide the model without constraining it.

Conclusion

Guided topic modeling with seed words in BERTopic offers a powerful way to balance user expertise with automated topic discovery. While seed words help direct the model toward specific themes of interest, the model maintains the flexibility to discover other important topics and expand the seed topics with related terms, resulting in a more comprehensive and nuanced topic analysis.

Link to BERTopic

Leave a Comment

Your email address will not be published. Required fields are marked *

0
    0
    Your Cart
    Your cart is empty
    Scroll to Top

    Work with Khuyen Tran

    Work with Khuyen Tran