Automate Topic Discovery with Top2Vec

Automate Topic Discovery with Top2Vec

Manually identifying topics from large text collections can be a subjective and inconsistent process, requiring significant trial and error.

Top2Vec is a powerful library that automatically discovers topics by finding dense clusters of semantically similar documents and identifying the words that attract those documents together.

In this example, we will use the Top2Vec library to detect topics in a dataset of fake news articles.

We load the “Fake-News” dataset from OpenML using the following code:

from top2vec import Top2Vec
from sklearn.datasets import fetch_openml

news = fetch_openml("Fake-News")
text = news.data["text"].to_list()

Next, we create a Top2Vec model with the text data, using the “learn” speed and 8 worker threads:

model = Top2Vec(documents=text, speed="learn", workers=8)

The model automatically determines the number of topics in the data. We can retrieve this number using the get_num_topics method:

model.get_num_topics()

Output:

72

Get the topics in decreasing size:

topic_words, word_scores, topic_nums = model.get_topics(71)

print("\nSecond Topic - Top 5 Words and Scores:")
for word, score in zip(topic_words[1][:5], word_scores[1][:5]):
    print(f"Word: {word:<20} Score: {score:.4f}")

Output:

Second Topic - Top 5 Words and Scores:
Word: clinton              Score: 0.5631
Word: sanders              Score: 0.5293
Word: hillary              Score: 0.5234
Word: clintons             Score: 0.4866
Word: dnc                  Score: 0.4667

We can search for topics most similar to the keyword “president” using the search_topics method:

topic_words, word_scores, topic_scores, topic_nums = model.search_topics(keywords=["president"], num_topics=2)

first_five_words = [topic[:5] for topic in topic_words]

print("Topics most similar to president:")
for topic_num, words in zip(topic_nums, first_five_words):
    print(f"Topic {topic_num}: {words}")

Output:

Topics most similar to president:
Topic 44: ['manafort' 'nnkerry' 'trump' 'bannon' 'presidential']
Topic 2: ['trump' 'presidential' 'neocon' 'presidency' 'donald']

Finally, we can generate word clouds for the topics most similar to “president” using the generate_topic_wordcloud method:

for topic in topic_nums:
    model.generate_topic_wordcloud(topic)

Output:

Conclusion

Top2Vec is a powerful library for automated topic modeling that can help you discover hidden topics in large text collections. With its ability to automatically determine the number and content of topics, Top2Vec can save you significant time and effort in your text analysis tasks.

Link to Top2Vec.

Search

Related Posts

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top

Work with Khuyen Tran

Work with Khuyen Tran