Generic selectors
Exact matches only
Search in title
Search in content
Post Type Selectors
Filter by Categories
About Article
Analyze Data
Archive
Best Practices
Better Outputs
Blog
Code Optimization
Code Quality
Command Line
Daily tips
Dashboard
Data Analysis & Manipulation
Data Engineer
Data Visualization
DataFrame
Delta Lake
DevOps
DuckDB
Environment Management
Feature Engineer
Git
Jupyter Notebook
LLM
LLM
Machine Learning
Machine Learning
Machine Learning & AI
Manage Data
MLOps
Natural Language Processing
NumPy
Pandas
Polars
PySpark
Python Tips
Python Utilities
Python Utilities
Scrape Data
SQL
Testing
Time Series
Tools
Visualization
Visualization & Reporting
Workflow & Automation
Workflow Automation

BERTopic: Harnessing BERT for Interpretable Topic Modeling

Table of Contents

BERTopic: Harnessing BERT for Interpretable Topic Modeling

Topic modeling is a popular technique in NLP for discovering abstract topics that occur in a collection of documents.

BERTopic leverages BERT to generate contextualized document embeddings, capturing semantics better than bag-of-words. It also provides excellent topic visualization capabilities and allows fine-tuning topic representations using language models like GPT.

For this example, we use the popular 20 Newsgroups dataset which contains roughly 18000 newsgroups posts

Open In Colab
from sklearn.datasets import fetch_20newsgroups
docs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']

In this example, we will go through the main components of BERTopic and the steps necessary to create a strong topic model.

We start by instantiating BERTopic. We set language to english since our documents are in the English language.

from bertopic import BERTopic

topic_model = BERTopic(language="english", verbose=True)
topics, probs = topic_model.fit_transform(docs)

After fitting our model, we can start by looking at the results. Typically, we look at the most frequent topics first as they best represent the collection of documents.

freq = topic_model.get_topic_info()
freq.head(5)
TopicCountNameRepresentationRepresentative_Docs
018230_game_team_games_he[game, team, games, he, players, season, hockey, nhl, league, played][“\n\n”Deeply rooted rivalry?” Ahem, Jokerit has been in the Finnish league for two seasons.\n\n\n\n\n\nThe reason why they’re…]
16301_key_clipper_chip_encryption[key, clipper, chip, encryption, keys, escrow, government, secure, privacy, public][“\nI am not an expert in the cryptography science by any means.\nHowever, I have studied the Clipper chip proposal extensively\nand have some thoughts on the matter…\n\n1) The…”]
25272_idjits_ites_cheek_dancing[idjits, ites, cheek, dancing, yep, consistently, considered, wrt, miller, observations][“consistently\n\n\n, \nYep.\n, \nDancing With Reality Cheek-to-Cheek to Stay Considered Consistent and On-the-Cheek\n\nSome of Miller’s keener observations (and…”]
34463_israel_israeli_jews_arab[israel, israeli, jews, arab, arabs, jewish, palestine, peace, land, occupied][“\nThis a “tried and true” method utilized by the Israelis in\norder to portray anyone who is critical of Israeli policies as\nbeing anti-Semitic…”]
-16789-1_to_the_is_of[to, the, is, of, and, you, for, it, in, that][“It’s like refusing ‘God’s kingdom come’.\n\nI often take positions opposite those of mainstream Christianity.\nBut not in this case.\n…”]

-1 refers to all outliers and should typically be ignored. Next, let’s take a look at a frequent topic that were generated:

topic_model.get_topic(0)  # Select the most frequent topic
[('game', 0.010318688564543007),
 ('team', 0.008992489388365084),
 ('games', 0.0071658097402482355),
 ('he', 0.006986923839656088),
 ('players', 0.00631255726099582),
 ('season', 0.006207025740053),
 ('hockey', 0.006108581738112714),
 ('play', 0.0057638598847672895),
 ('25', 0.005625421684874428),
 ('year', 0.005577343029862753)]

Access the predicted topics for the first 10 documents:

topic_model.topics_[:10]
[0, -1, 54, 29, 92, -1, -1, 0, 0, -1]

Visualize topics:

-1 refers to all outliers and should typically be ignored. Next, let’s take a look at a frequent topic that were generated:

topic_model.get_topic(0)  # Select the most frequent topic
[('game', 0.010318688564543007),
 ('team', 0.008992489388365084),
 ('games', 0.0071658097402482355),
 ('he', 0.006986923839656088),
 ('players', 0.00631255726099582),
 ('season', 0.006207025740053),
 ('hockey', 0.006108581738112714),
 ('play', 0.0057638598847672895),
 ('25', 0.005625421684874428),
 ('year', 0.005577343029862753)]

Access the predicted topics for the first 10 documents:

topic_model.topics_[:10]
[0, -1, 54, 29, 92, -1, -1, 0, 0, -1]

Visualize topics:

topic_model.visualize_topics()

We can visualize the selected terms for a few topics by creating bar charts out of the c-TF-IDF scores for each topic representation. Insights can be gained from the relative c-TF-IDF scores between and within topics. Moreover, you can easily compare topic representations to each other.

fig = topic_model.visualize_barchart(top_n_topics=8)
fig.show()

Link to BertTopic.

Leave a Comment

Your email address will not be published. Required fields are marked *

0
    0
    Your Cart
    Your cart is empty
    Scroll to Top

    Work with Khuyen Tran

    Work with Khuyen Tran