BertTopic: Enhance Topic Models with Expert-Defined Themes

Motivation

Data scientists and analysts often need to guide their topic modeling process with domain knowledge or specific themes they want to extract, but traditional topic modeling approaches don’t allow for this kind of control over the generated topics.

Introduction to BERTopic

Topic modeling is a text mining technique that discovers abstract topics in a collection of documents. It helps in organizing, searching, and understanding large volumes of text data by finding common themes or patterns.

BERTopic is a topic modeling library that leverages BERT embeddings and c-TF-IDF to create easily interpretable topics. You can install it using pip:

pip install bertopic

As covered in BERTopic: Harnessing BERT for Interpretable Topic Modeling, the library provides powerful topic visualization and automatic topic discovery. In this post, we will cover guided topic modeling.

Guided Topic Modeling with Seed Words

Seed words are predefined sets of words that represent themes or topics you expect or want to find in your documents. BERTopic allows you to guide the topic modeling process using these seed words. By providing seed words, you can:

Direct the model towards specific themes of interest
Incorporate domain expertise into the topic discovery process
Ensure certain important themes are captured

Here’s how to implement guided topic modeling with seed words:

from bertopic import BERTopic
from sklearn.datasets import fetch_20newsgroups

# Load example data
docs = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))['data']

# Define seed topics with related words for each theme
seed_topic_list = [
    ["drug", "cancer", "drugs", "doctor"],        # Medical theme
    ["windows", "drive", "dos", "file"],          # Computer theme
    ["space", "launch", "orbit", "lunar"]         # Space theme
]

# Create and train the model with seed topics
topic_model = BERTopic(seed_topic_list=seed_topic_list)
topics, probs = topic_model.fit_transform(docs)

Let’s examine how the model processes these seed words and discovers topics:

# Examine discovered topics
print("\nFirst topic (Sports):")
print(topic_model.get_topic(0))

print("\nSecond topic (Cryptography):")
print(topic_model.get_topic(1))

print("\nFifth topic (Space Exploration):")
print(topic_model.get_topic(4))

Output:

First topic (Sports):
[('game', 0.010652), ('team', 0.009260), ('games', 0.007348), 
 ('he', 0.007269), ('players', 0.006459), ('season', 0.006363), 
 ('hockey', 0.006247), ('play', 0.005889), ('25', 0.005802), 
 ('year', 0.005770)]

Second topic (Cryptography):
[('key', 0.015048), ('clipper', 0.012965), ('chip', 0.012280), 
 ('encryption', 0.011336), ('keys', 0.010264), ('escrow', 0.008797), 
 ('government', 0.007993), ('nsa', 0.007754), ('algorithm', 0.007132), 
 ('be', 0.006736)]

Fifth topic (Space Exploration):
[('space', 0.019632), ('launch', 0.016378), ('orbit', 0.010814), 
 ('lunar', 0.010734), ('moon', 0.008701), ('nasa', 0.007899), 
 ('shuttle', 0.006732), ('mission', 0.006472), ('earth', 0.006001), 
 ('station', 0.005720)]

The results show how seed words influence topic discovery:

Seed Word Integration: In Topic 5, space-related seed words (‘space’, ‘launch’, ‘orbit’, ‘lunar’) have high weights. The model expands on these words to include related terms like ‘shuttle’, ‘mission’, and ‘station’.
Natural Topic Discovery: The model discovers prominent topics like sports (Topic 0) and cryptography (Topic 1), despite being seeded with medical and computer themes. This shows that seed words guide the model without constraining it.

Conclusion

Guided topic modeling with seed words in BERTopic offers a powerful way to balance user expertise with automated topic discovery. While seed words help direct the model toward specific themes of interest, the model maintains the flexibility to discover other important topics and expand the seed topics with related terms, resulting in a more comprehensive and nuanced topic analysis.

Link to BERTopic

Search

Natural Language Processing

BertTopic: Enhance Topic Models with Expert-Defined Themes

BertTopic: Enhance Topic Models with Expert-Defined Themes

Motivation

Introduction to BERTopic

Guided Topic Modeling with Seed Words

Conclusion

Search

Related Posts

Refinery: Human-Guided NLP Data Labeling

BertViz: Visualize Attention in Transformer Language Models

Automate Topic Discovery with Top2Vec

Leave a Comment Cancel Reply

Related Posts

marimo: Reactive Notebooks for Effortless Visualizations

Refinery: Human-Guided NLP Data Labeling

Pydantic-settings: Type-Safe Config Management

Stay up-to-date with
data skills using
CodeCut

Drop a line

Get in touch

Follow Us on Social Media

BertTopic: Enhance Topic Models with Expert-Defined Themes

BertTopic: Enhance Topic Models with Expert-Defined Themes

Motivation

Introduction to BERTopic

Guided Topic Modeling with Seed Words

Conclusion

Search

Related Posts

Leave a Comment Cancel Reply

Related Posts

Stay up-to-date with data skills using CodeCut

Follow Us on Social Media

Work with Khuyen Tran

Work with Khuyen Tran

Stay up-to-date with
data skills using
CodeCut