Manually identifying topics from large text collections can be a subjective and inconsistent process, requiring significant trial and error.
Top2Vec is a powerful library that automatically discovers topics by finding dense clusters of semantically similar documents and identifying the words that attract those documents together.
In this example, we will use the Top2Vec library to detect topics in a dataset of fake news articles.
We load the “Fake-News” dataset from OpenML using the following code:
from top2vec import Top2Vec
from sklearn.datasets import fetch_openml
news = fetch_openml("Fake-News")
text = news.data["text"].to_list()
Next, we create a Top2Vec model with the text data, using the “learn” speed and 8 worker threads:
model = Top2Vec(documents=text, speed="learn", workers=8)
The model automatically determines the number of topics in the data. We can retrieve this number using the get_num_topics
method:
model.get_num_topics()
Output:
72
Get the topics in decreasing size:
topic_words, word_scores, topic_nums = model.get_topics(71)
print("\nSecond Topic - Top 5 Words and Scores:")
for word, score in zip(topic_words[1][:5], word_scores[1][:5]):
print(f"Word: {word:<20} Score: {score:.4f}")
Output:
Second Topic - Top 5 Words and Scores:
Word: clinton Score: 0.5631
Word: sanders Score: 0.5293
Word: hillary Score: 0.5234
Word: clintons Score: 0.4866
Word: dnc Score: 0.4667
We can search for topics most similar to the keyword “president” using the search_topics
method:
topic_words, word_scores, topic_scores, topic_nums = model.search_topics(keywords=["president"], num_topics=2)
first_five_words = [topic[:5] for topic in topic_words]
print("Topics most similar to president:")
for topic_num, words in zip(topic_nums, first_five_words):
print(f"Topic {topic_num}: {words}")
Output:
Topics most similar to president:
Topic 44: ['manafort' 'nnkerry' 'trump' 'bannon' 'presidential']
Topic 2: ['trump' 'presidential' 'neocon' 'presidency' 'donald']
Finally, we can generate word clouds for the topics most similar to “president” using the generate_topic_wordcloud
method:
for topic in topic_nums:
model.generate_topic_wordcloud(topic)
Output:
data:image/s3,"s3://crabby-images/b8ffa/b8ffa3ae19f503e4d81de40722dd6218393f7e03" alt=""
Conclusion
Top2Vec is a powerful library for automated topic modeling that can help you discover hidden topics in large text collections. With its ability to automatically determine the number and content of topics, Top2Vec can save you significant time and effort in your text analysis tasks.