Generic selectors
Exact matches only
Search in title
Search in content
Post Type Selectors
Filter by Categories
About Article
Analyze Data
Archive
Best Practices
Better Outputs
Blog
Code Optimization
Code Quality
Command Line
Daily tips
Dashboard
Data Analysis & Manipulation
Data Engineer
Data Visualization
DataFrame
Delta Lake
DevOps
DuckDB
Environment Management
Feature Engineer
Git
Jupyter Notebook
LLM
LLM
Machine Learning
Machine Learning
Machine Learning & AI
Manage Data
MLOps
Natural Language Processing
NumPy
Pandas
Polars
PySpark
Python Tips
Python Utilities
Python Utilities
Scrape Data
SQL
Testing
Time Series
Tools
Visualization
Visualization & Reporting
Workflow & Automation
Workflow Automation

Automate Topic Discovery with Top2Vec

Table of Contents

Automate Topic Discovery with Top2Vec

Manually identifying topics from large text collections can be a subjective and inconsistent process, requiring significant trial and error.

Top2Vec is a powerful library that automatically discovers topics by finding dense clusters of semantically similar documents and identifying the words that attract those documents together.

In this example, we will use the Top2Vec library to detect topics in a dataset of fake news articles.

We load the “Fake-News” dataset from OpenML using the following code:

from top2vec import Top2Vec
from sklearn.datasets import fetch_openml

news = fetch_openml("Fake-News")
text = news.data["text"].to_list()

Next, we create a Top2Vec model with the text data, using the “learn” speed and 8 worker threads:

model = Top2Vec(documents=text, speed="learn", workers=8)

The model automatically determines the number of topics in the data. We can retrieve this number using the get_num_topics method:

model.get_num_topics()

Output:

72

Get the topics in decreasing size:

topic_words, word_scores, topic_nums = model.get_topics(71)

print("\nSecond Topic - Top 5 Words and Scores:")
for word, score in zip(topic_words[1][:5], word_scores[1][:5]):
    print(f"Word: {word:<20} Score: {score:.4f}")

Output:

Second Topic - Top 5 Words and Scores:
Word: clinton              Score: 0.5631
Word: sanders              Score: 0.5293
Word: hillary              Score: 0.5234
Word: clintons             Score: 0.4866
Word: dnc                  Score: 0.4667

We can search for topics most similar to the keyword “president” using the search_topics method:

topic_words, word_scores, topic_scores, topic_nums = model.search_topics(keywords=["president"], num_topics=2)

first_five_words = [topic[:5] for topic in topic_words]

print("Topics most similar to president:")
for topic_num, words in zip(topic_nums, first_five_words):
    print(f"Topic {topic_num}: {words}")

Output:

Topics most similar to president:
Topic 44: ['manafort' 'nnkerry' 'trump' 'bannon' 'presidential']
Topic 2: ['trump' 'presidential' 'neocon' 'presidency' 'donald']

Finally, we can generate word clouds for the topics most similar to “president” using the generate_topic_wordcloud method:

for topic in topic_nums:
    model.generate_topic_wordcloud(topic)

Output:

Conclusion

Top2Vec is a powerful library for automated topic modeling that can help you discover hidden topics in large text collections. With its ability to automatically determine the number and content of topics, Top2Vec can save you significant time and effort in your text analysis tasks.

Link to Top2Vec.

Leave a Comment

Your email address will not be published. Required fields are marked *

0
    0
    Your Cart
    Your cart is empty
    Scroll to Top

    Work with Khuyen Tran

    Work with Khuyen Tran