Artificial Intelligence 13 min read

How to Cluster Text with TF‑IDF, KMeans and PCA in Python

This article walks through a complete Python workflow that loads the 20 Newsgroups dataset, preprocesses the documents, vectorizes them with TF‑IDF, groups them using KMeans, reduces dimensions with PCA, and visualizes the resulting clusters, illustrating each step with code and plots.

Code DAO

Dec 7, 2021

How to Cluster Text with TF‑IDF, KMeans and PCA in Python

The article demonstrates a step‑by‑step pipeline for clustering textual data using scikit‑learn. It begins by loading a subset of the 20 Newsgroups corpus (categories: comp.graphics, comp.os.ms-windows.misc, rec.sport.baseball, rec.sport.hockey, alt.atheism, soc.religion.christian) with fetch_20newsgroups, shuffling the 3451 documents and stripping headers, footers, and quotes.

from sklearn.datasets import fetch_20newsgroups
categories = ['comp.graphics','comp.os.ms-windows.misc','rec.sport.baseball','rec.sport.hockey','alt.atheism','soc.religion.christian']
dataset = fetch_20newsgroups(subset='train', categories=categories, shuffle=True, remove=('headers','footers','quotes'))

After converting the raw texts into a pandas.DataFrame, a preprocess_text function is defined to clean each document by removing links, non‑alphabetic characters, optional stopwords, converting to lowercase, and stripping excess whitespace.

def preprocess_text(text: str, remove_stopwords: bool) -> str:
    """Sanitize a string by removing links, special characters, numbers, stopwords, converting to lowercase and stripping whitespaces."""
    text = re.sub(r"http\S+", "", text)
    text = re.sub("[^A-Za-z]+", " ", text)
    if remove_stopwords:
        tokens = nltk.word_tokenize(text)
        tokens = [w for w in tokens if w.lower() not in stopwords.words("english")]
        text = " ".join(tokens)
    return text.lower().strip()

The cleaned corpus is vectorized with TfidfVectorizer (sublinear TF, min_df=5, max_df=0.95), producing a sparse matrix X that encodes term importance across documents.

from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(sublinear_tf=True, min_df=5, max_df=0.95)
X = vectorizer.fit_transform(df['cleaned'])

KMeans clustering (3 clusters, random_state=42) is then applied to the TF‑IDF vectors, and the resulting labels are stored in the dataframe.

from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=3, random_state=42)
kmeans.fit(X)
clusters = kmeans.labels_
df['cluster'] = clusters

To visualise the high‑dimensional data, PCA reduces X to two components. The reduced coordinates ( x0, x1) are added to the dataframe.

from sklearn.decomposition import PCA
pca = PCA(n_components=2, random_state=42)
pca_vecs = pca.fit_transform(X.toarray())
df['x0'] = pca_vecs[:, 0]
df['x1'] = pca_vecs[:, 1]

Top keywords for each centroid are extracted by averaging TF‑IDF vectors per cluster and selecting the highest‑scoring terms. Based on the keywords, the numeric cluster IDs are mapped to semantic labels: 0 → "sport", 1 → "technology", 2 → "religion".

def get_top_keywords(n_terms):
    df_mean = pd.DataFrame(X.todense()).groupby(clusters).mean()
    terms = vectorizer.get_feature_names_out()
    for i, row in df_mean.iterrows():
        print('
Cluster {}'.format(i))
        print(','.join([terms[t] for t in np.argsort(row)[-n_terms:]]))

cluster_map = {0: "sport", 1: "technology", 2: "religion"}
df['cluster'] = df['cluster'].map(cluster_map)

Finally, a scatter plot visualises the two‑dimensional PCA representation, coloring points by their semantic cluster using Seaborn.

import matplotlib.pyplot as plt
import seaborn as sns
plt.figure(figsize=(12, 7))
plt.title("TF‑IDF + KMeans 20newsgroup clustering", fontsize=18)
plt.xlabel("X0", fontsize=16)
plt.ylabel("X1", fontsize=16)
sns.scatterplot(data=df, x='x0', y='x1', hue='cluster', palette='viridis')
plt.show()

The resulting plot shows three well‑separated groups corresponding to sports, technology, and religion. Minor overlap between sports and technology clusters is explained by shared terminology that receives similar TF‑IDF weights.

All code snippets above constitute the complete reproducible script for the described workflow.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Python NLP PCA TF-IDF scikit-learn KMeans text clustering

Written by

Code DAO

We deliver AI algorithm tutorials and the latest news, curated by a team of researchers from Peking University, Shanghai Jiao Tong University, Central South University, and leading AI companies such as Huawei, Kuaishou, and SenseTime. Join us in the AI alchemy—making life better!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.