BERTopic: Harnessing BERT for Interpretable Topic Modeling

Khuyen Tran

Topic modeling is a popular technique in NLP for discovering abstract topics that occur in a collection of documents.

BERTopic leverages BERT to generate contextualized document embeddings, capturing semantics better than bag-of-words. It also provides excellent topic visualization capabilities and allows fine-tuning topic representations using language models like GPT.

For this example, we use the popular 20 Newsgroups dataset which contains roughly 18000 newsgroups posts

from sklearn.datasets import fetch_20newsgroups
docs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']

In this example, we will go through the main components of BERTopic and the steps necessary to create a strong topic model.

We start by instantiating BERTopic. We set language to english since our documents are in the English language.

from bertopic import BERTopic

topic_model = BERTopic(language="english", verbose=True)
topics, probs = topic_model.fit_transform(docs)

After fitting our model, we can start by looking at the results. Typically, we look at the most frequent topics first as they best represent the collection of documents.

freq = topic_model.get_topic_info()
freq.head(5)

Topic	Count	Name	Representation	Representative_Docs
0	1823	0_game_team_games_he	[game, team, games, he, players, season, hockey, nhl, league, played]	[“\n\n”Deeply rooted rivalry?” Ahem, Jokerit has been in the Finnish league for two seasons.\n\n\n\n\n\nThe reason why they’re…]
1	630	1_key_clipper_chip_encryption	[key, clipper, chip, encryption, keys, escrow, government, secure, privacy, public]	[“\nI am not an expert in the cryptography science by any means.\nHowever, I have studied the Clipper chip proposal extensively\nand have some thoughts on the matter…\n\n1) The…”]
2	527	2_idjits_ites_cheek_dancing	[idjits, ites, cheek, dancing, yep, consistently, considered, wrt, miller, observations]	[“consistently\n\n\n, \nYep.\n, \nDancing With Reality Cheek-to-Cheek to Stay Considered Consistent and On-the-Cheek\n\nSome of Miller’s keener observations (and…”]
3	446	3_israel_israeli_jews_arab	[israel, israeli, jews, arab, arabs, jewish, palestine, peace, land, occupied]	[“\nThis a “tried and true” method utilized by the Israelis in\norder to portray anyone who is critical of Israeli policies as\nbeing anti-Semitic…”]
-1	6789	-1_to_the_is_of	[to, the, is, of, and, you, for, it, in, that]	[“It’s like refusing ‘God’s kingdom come’.\n\nI often take positions opposite those of mainstream Christianity.\nBut not in this case.\n…”]

-1 refers to all outliers and should typically be ignored. Next, let’s take a look at a frequent topic that were generated:

topic_model.get_topic(0)  # Select the most frequent topic

[('game', 0.010318688564543007),
 ('team', 0.008992489388365084),
 ('games', 0.0071658097402482355),
 ('he', 0.006986923839656088),
 ('players', 0.00631255726099582),
 ('season', 0.006207025740053),
 ('hockey', 0.006108581738112714),
 ('play', 0.0057638598847672895),
 ('25', 0.005625421684874428),
 ('year', 0.005577343029862753)]

Access the predicted topics for the first 10 documents:

topic_model.topics_[:10]

[0, -1, 54, 29, 92, -1, -1, 0, 0, -1]

Visualize topics:

-1 refers to all outliers and should typically be ignored. Next, let’s take a look at a frequent topic that were generated:

topic_model.get_topic(0)  # Select the most frequent topic

[('game', 0.010318688564543007),
 ('team', 0.008992489388365084),
 ('games', 0.0071658097402482355),
 ('he', 0.006986923839656088),
 ('players', 0.00631255726099582),
 ('season', 0.006207025740053),
 ('hockey', 0.006108581738112714),
 ('play', 0.0057638598847672895),
 ('25', 0.005625421684874428),
 ('year', 0.005577343029862753)]

Access the predicted topics for the first 10 documents:

topic_model.topics_[:10]

[0, -1, 54, 29, 92, -1, -1, 0, 0, -1]

Visualize topics:

topic_model.visualize_topics()

We can visualize the selected terms for a few topics by creating bar charts out of the c-TF-IDF scores for each topic representation. Insights can be gained from the relative c-TF-IDF scores between and within topics. Moreover, you can easily compare topic representations to each other.

fig = topic_model.visualize_barchart(top_n_topics=8)
fig.show()

Link to BertTopic.

Extract Dates from Text with Datefinder

April 3, 2025

RapidFuzz: Find Similar Strings Despite Typos and Variations

March 23, 2025

Refinery: Human-Guided NLP Data Labeling

March 3, 2025

BERTopic: Harnessing BERT for Interpretable Topic Modeling

Table of Contents

BERTopic: Harnessing BERT for Interpretable Topic Modeling

Khuyen Tran

Related Posts

Leave a Comment Cancel Reply

Stay up-to-date with
data skills using
CodeCut

Drop a line

Get in touch

Follow Us on Social Media

BERTopic: Harnessing BERT for Interpretable Topic Modeling

Table of Contents

BERTopic: Harnessing BERT for Interpretable Topic Modeling

Khuyen Tran

Related Posts

Leave a Comment Cancel Reply

Stay up-to-date with data skills using CodeCut

Drop a line

Get in touch

Follow Us on Social Media

Work with Khuyen Tran

Work with Khuyen Tran

Stay up-to-date with
data skills using
CodeCut