Artificial Intelligence 5 min read

Extracting Regression from Production Requests Using Clustering Algorithms

This article explains how to apply TF‑IDF weighting and the K‑means clustering algorithm in Python to identify a small set of representative regression cases from hundreds of thousands of production request records, including guidance on selecting the optimal number of clusters.

360 Quality & Efficiency

Nov 2, 2018

Extracting Regression from Production Requests Using Clustering Algorithms

The article describes a method for extracting representative regression cases from large volumes of production request logs by leveraging text mining techniques and clustering algorithms.

TF‑IDF (Term Frequency–Inverse Document Frequency) is introduced as a weighting scheme that evaluates the importance of a term within a document relative to a corpus. The term frequency is calculated as tf(w,d) = count(w, d) / size(d), and the inverse document frequency as idf = log(n / docs(w, D)). The combined TF‑IDF weight for a term is tf(w,d) * idf(w), and for a query q against a document d it is summed over all query terms: tf‑idf(q, d) = sum_{i=1..k} tf(w_i, d) * idf(w_i).

The article then introduces the K‑Means text clustering algorithm, outlining its popularity and basic principle of iteratively assigning data points to the nearest cluster centroid and updating centroids until convergence. It notes that K‑Means works best with continuous attributes.

Detailed steps of the K‑Means algorithm are provided: (a) initialize k centroids, (b) assign each sample to the nearest centroid, (c) recompute centroids as the arithmetic mean of assigned points, (d) repeat assignment and centroid update until the clustering stabilizes, and (e) output the final clusters.

A Python implementation is presented, showing how to convert request texts into feature vectors using TF‑IDF, apply K‑Means clustering, and obtain the resulting clusters. Visual illustrations of the feature extraction and clustering process are included.

The article discusses the challenge of selecting the appropriate number of clusters k. It explains that the K‑Means error function decreases as k increases, potentially reaching zero when each record forms its own cluster, which is undesirable. An elbow method is suggested: incrementally test different k values, plot the corresponding error, and choose the point where the error reduction sharply slows.

Finally, the results are summarized: from over 800,000 request records, the algorithm identified eight representative data points that capture the essential regression patterns.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

clustering TF-IDF text mining K-Means regression extraction

Written by

360 Quality & Efficiency

360 Quality & Efficiency focuses on seamlessly integrating quality and efficiency in R&D, sharing 360’s internal best practices with industry peers to foster collaboration among Chinese enterprises and drive greater efficiency value.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.