Visualizing text data in 2D typically requires several steps: cleaning, encoding, and dimensionality reduction. These processes can be time-consuming.
texthero library simplifies this task, allowing you to perform all these steps efficiently.
The following example demonstrates how to use texthero to visualize CNN news article descriptions from a Kaggle dataset. Each point in the resulting plot represents an article, color-coded by its category.
import pandas as pd
import texthero as hero
import matplotlib.pyplot as plt
# Load the data
df = pd.read_csv("small_CNN.csv")
# Process and reduce dimensionality of the text data
df["pca"] = (df["Description"]
.pipe(hero.clean)
.pipe(hero.tfidf)
.pipe(hero.pca))
# Create the visualization
plt.figure(figsize=(10, 3))
hero.scatterplot(df, col="pca", color="Category", title="CNN News")
plt.show()

This code efficiently cleans the text, applies TF-IDF encoding, performs PCA, and creates a 2D scatter plot of the articles, all in just a few lines of code.