How to Cluster Text with TF‑IDF, KMeans and PCA in Python

This article walks through a complete Python workflow that loads the 20 Newsgroups dataset, preprocesses the documents, vectorizes them with TF‑IDF, groups them using KMeans, reduces dimensions with PCA, and visualizes the resulting clusters, illustrating each step with code and plots.

Code DAO
Code DAO
Code DAO
How to Cluster Text with TF‑IDF, KMeans and PCA in Python

The article demonstrates a step‑by‑step pipeline for clustering textual data using scikit‑learn. It begins by loading a subset of the 20 Newsgroups corpus (categories: comp.graphics, comp.os.ms-windows.misc, rec.sport.baseball, rec.sport.hockey, alt.atheism, soc.religion.christian) with fetch_20newsgroups, shuffling the 3451 documents and stripping headers, footers, and quotes.

from sklearn.datasets import fetch_20newsgroups
categories = ['comp.graphics','comp.os.ms-windows.misc','rec.sport.baseball','rec.sport.hockey','alt.atheism','soc.religion.christian']
dataset = fetch_20newsgroups(subset='train', categories=categories, shuffle=True, remove=('headers','footers','quotes'))

After converting the raw texts into a pandas.DataFrame, a preprocess_text function is defined to clean each document by removing links, non‑alphabetic characters, optional stopwords, converting to lowercase, and stripping excess whitespace.

def preprocess_text(text: str, remove_stopwords: bool) -> str:
    """Sanitize a string by removing links, special characters, numbers, stopwords, converting to lowercase and stripping whitespaces."""
    text = re.sub(r"http\S+", "", text)
    text = re.sub("[^A-Za-z]+", " ", text)
    if remove_stopwords:
        tokens = nltk.word_tokenize(text)
        tokens = [w for w in tokens if w.lower() not in stopwords.words("english")]
        text = " ".join(tokens)
    return text.lower().strip()

The cleaned corpus is vectorized with TfidfVectorizer (sublinear TF, min_df=5, max_df=0.95), producing a sparse matrix X that encodes term importance across documents.

from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(sublinear_tf=True, min_df=5, max_df=0.95)
X = vectorizer.fit_transform(df['cleaned'])

KMeans clustering (3 clusters, random_state=42) is then applied to the TF‑IDF vectors, and the resulting labels are stored in the dataframe.

from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=3, random_state=42)
kmeans.fit(X)
clusters = kmeans.labels_
df['cluster'] = clusters

To visualise the high‑dimensional data, PCA reduces X to two components. The reduced coordinates ( x0, x1) are added to the dataframe.

from sklearn.decomposition import PCA
pca = PCA(n_components=2, random_state=42)
pca_vecs = pca.fit_transform(X.toarray())
df['x0'] = pca_vecs[:, 0]
df['x1'] = pca_vecs[:, 1]

Top keywords for each centroid are extracted by averaging TF‑IDF vectors per cluster and selecting the highest‑scoring terms. Based on the keywords, the numeric cluster IDs are mapped to semantic labels: 0 → "sport", 1 → "technology", 2 → "religion".

def get_top_keywords(n_terms):
    df_mean = pd.DataFrame(X.todense()).groupby(clusters).mean()
    terms = vectorizer.get_feature_names_out()
    for i, row in df_mean.iterrows():
        print('
Cluster {}'.format(i))
        print(','.join([terms[t] for t in np.argsort(row)[-n_terms:]]))

cluster_map = {0: "sport", 1: "technology", 2: "religion"}
df['cluster'] = df['cluster'].map(cluster_map)

Finally, a scatter plot visualises the two‑dimensional PCA representation, coloring points by their semantic cluster using Seaborn.

import matplotlib.pyplot as plt
import seaborn as sns
plt.figure(figsize=(12, 7))
plt.title("TF‑IDF + KMeans 20newsgroup clustering", fontsize=18)
plt.xlabel("X0", fontsize=16)
plt.ylabel("X1", fontsize=16)
sns.scatterplot(data=df, x='x0', y='x1', hue='cluster', palette='viridis')
plt.show()

The resulting plot shows three well‑separated groups corresponding to sports, technology, and religion. Minor overlap between sports and technology clusters is explained by shared terminology that receives similar TF‑IDF weights.

All code snippets above constitute the complete reproducible script for the described workflow.

Dataset preview
Dataset preview
Sparse matrix illustration
Sparse matrix illustration
KMeans centroids
KMeans centroids
KMeans convergence
KMeans convergence
PCA reduction
PCA reduction
Top keywords per cluster
Top keywords per cluster
Final clustering visualization
Final clustering visualization
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

PythonNLPPCATF-IDFscikit-learnKMeanstext clustering
Code DAO
Written by

Code DAO

We deliver AI algorithm tutorials and the latest news, curated by a team of researchers from Peking University, Shanghai Jiao Tong University, Central South University, and leading AI companies such as Huawei, Kuaishou, and SenseTime. Join us in the AI alchemy—making life better!

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.