Master 10 Popular Clustering Algorithms in Python with Scikit‑Learn
This tutorial introduces unsupervised clustering, explains its purpose, and walks through installing scikit‑learn and implementing ten popular clustering algorithms—including AffinityPropagation, Agglomerative, BIRCH, DBSCAN, K‑Means, Mini‑Batch K‑Means, MeanShift, OPTICS, Spectral Clustering, and Gaussian Mixture—complete with code examples and visualizations.
Clustering (or cluster analysis) is an unsupervised learning task that automatically discovers natural groups in data, often used for pattern discovery, market segmentation, anomaly detection, and feature engineering.
Understand that clustering searches for natural groups in the feature space of input data.
Recognize that there is no single best algorithm for all datasets; many algorithms exist.
Learn how to install scikit‑learn and apply ten top clustering algorithms in Python.
1. Clustering
Clustering analysis is an unsupervised machine‑learning task that finds natural groupings in data without any predefined labels. It differs from supervised learning, which predicts known targets.
Clustering techniques are suitable when there are no classes to predict, but instances need to be divided into natural groups. —Source: "Data Mining: Practical Machine Learning Tools and Techniques", 2016.
Clusters are dense regions in the feature space where examples are closer to each other than to points in other clusters. They may have centroids and boundaries, and can reflect underlying mechanisms in the data domain.
2. Clustering Algorithms
Many clustering algorithms use similarity or distance measures to discover dense regions. Before applying them, it is good practice to scale the data.
The following ten algorithms are covered:
Affinity Propagation
Agglomerative Clustering
BIRCH
DBSCAN
K‑Means
Mini‑Batch K‑Means
Mean Shift
OPTICS
Spectral Clustering
Gaussian Mixture
Each example uses the make_classification function to generate a synthetic 2‑D dataset with 1,000 samples, then visualizes the resulting clusters.
3. Clustering Algorithm Examples
1. Library Installation
sudo pip install scikit-learnVerify the installation and check the version:
# check scikit-learn version
import sklearn
print(sklearn.__version__)2. Dataset Generation
The dataset is created with two informative features and one cluster per class:
# synthetic classification dataset
from numpy import where
from sklearn.datasets import make_classification
from matplotlib import pyplot
X, y = make_classification(n_samples=1000, n_features=2, n_informative=2,
n_redundant=0, n_clusters_per_class=1, random_state=4)
for class_value in range(2):
row_ix = where(y == class_value)
pyplot.scatter(X[row_ix, 0], X[row_ix, 1])
pyplot.show()3. Affinity Propagation
Affinity Propagation finds a set of exemplars that best represent the data.
We designed a method called "Affinity Propagation" that takes pairwise similarities as input and exchanges real‑valued messages between data points until a set of high‑quality exemplars and corresponding clusters emerges. —Source: "Message Passing Between Data Points", 2007.
# affinity propagation clustering
from numpy import unique, where
from sklearn.datasets import make_classification
from sklearn.cluster import AffinityPropagation
from matplotlib import pyplot
X, _ = make_classification(n_samples=1000, n_features=2, n_informative=2,
n_redundant=0, n_clusters_per_class=1, random_state=4)
model = AffinityPropagation(damping=0.9)
model.fit(X)
yhat = model.predict(X)
clusters = unique(yhat)
for cluster in clusters:
row_ix = where(yhat == cluster)
pyplot.scatter(X[row_ix, 0], X[row_ix, 1])
pyplot.show()4. Agglomerative Clustering
Agglomerative (hierarchical) clustering merges samples until a desired number of clusters is reached.
# agglomerative clustering
from numpy import unique, where
from sklearn.datasets import make_classification
from sklearn.cluster import AgglomerativeClustering
from matplotlib import pyplot
X, _ = make_classification(n_samples=1000, n_features=2, n_informative=2,
n_redundant=0, n_clusters_per_class=1, random_state=4)
model = AgglomerativeClustering(n_clusters=2)
yhat = model.fit_predict(X)
clusters = unique(yhat)
for cluster in clusters:
row_ix = where(yhat == cluster)
pyplot.scatter(X[row_ix, 0], X[row_ix, 1])
pyplot.show()5. BIRCH
BIRCH builds a tree structure to incrementally cluster large datasets.
BIRCH incrementally and dynamically clusters incoming multi‑dimensional metric data points to produce high‑quality clusters within memory and time constraints. —Source: "BIRCH: An Efficient Data Clustering Method for Very Large Databases", 1996.
# BIRCH clustering
from numpy import unique, where
from sklearn.datasets import make_classification
from sklearn.cluster import Birch
from matplotlib import pyplot
X, _ = make_classification(n_samples=1000, n_features=2, n_informative=2,
n_redundant=0, n_clusters_per_class=1, random_state=4)
model = Birch(threshold=0.01, n_clusters=2)
model.fit(X)
yhat = model.predict(X)
clusters = unique(yhat)
for cluster in clusters:
row_ix = where(yhat == cluster)
pyplot.scatter(X[row_ix, 0], X[row_ix, 1])
pyplot.show()6. DBSCAN
DBSCAN discovers dense regions based on a distance threshold and a minimum number of points.
We propose a new clustering algorithm, DBSCAN, which relies on a density‑based notion of clusters to discover arbitrarily shaped clusters. —Source: "A Density‑Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise", 1996.
# DBSCAN clustering
from numpy import unique, where
from sklearn.datasets import make_classification
from sklearn.cluster import DBSCAN
from matplotlib import pyplot
X, _ = make_classification(n_samples=1000, n_features=2, n_informative=2,
n_redundant=0, n_clusters_per_class=1, random_state=4)
model = DBSCAN(eps=0.30, min_samples=9)
yhat = model.fit_predict(X)
clusters = unique(yhat)
for cluster in clusters:
row_ix = where(yhat == cluster)
pyplot.scatter(X[row_ix, 0], X[row_ix, 1])
pyplot.show()7. K‑Means
K‑Means partitions data into k clusters by minimizing intra‑cluster variance.
The main purpose of this paper is to describe a process that partitions an N‑dimensional population into k sets based on samples. —Source: "Some Methods for Classification and Analysis of Multivariate Observations", 1967.
# k-means clustering
from numpy import unique, where
from sklearn.datasets import make_classification
from sklearn.cluster import KMeans
from matplotlib import pyplot
X, _ = make_classification(n_samples=1000, n_features=2, n_informative=2,
n_redundant=0, n_clusters_per_class=1, random_state=4)
model = KMeans(n_clusters=2)
model.fit(X)
yhat = model.predict(X)
clusters = unique(yhat)
for cluster in clusters:
row_ix = where(yhat == cluster)
pyplot.scatter(X[row_ix, 0], X[row_ix, 1])
pyplot.show()8. Mini‑Batch K‑Means
Mini‑Batch K‑Means updates centroids using small random batches, speeding up training on large datasets.
We suggest a mini‑batch optimization of the k‑means clustering algorithm. Compared with the classic batch algorithm, this reduces computational cost by orders of magnitude while providing better solutions than online stochastic gradient descent. —Source: "Web‑Scale K‑Means Clustering", 2010.
# mini‑batch k‑means clustering
from numpy import unique, where
from sklearn.datasets import make_classification
from sklearn.cluster import MiniBatchKMeans
from matplotlib import pyplot
X, _ = make_classification(n_samples=1000, n_features=2, n_informative=2,
n_redundant=0, n_clusters_per_class=1, random_state=4)
model = MiniBatchKMeans(n_clusters=2)
model.fit(X)
yhat = model.predict(X)
clusters = unique(yhat)
for cluster in clusters:
row_ix = where(yhat == cluster)
pyplot.scatter(X[row_ix, 0], X[row_ix, 1])
pyplot.show()9. Mean Shift
Mean Shift iteratively shifts points towards the mode of the density estimate.
We prove that the recursive mean‑shift procedure converges to a stationary point of the underlying density function, demonstrating its applicability to density‑mode detection. —Source: "Mean Shift: A Robust Approach Toward Feature Space Analysis", 2002.
# mean shift clustering
from numpy import unique, where
from sklearn.datasets import make_classification
from sklearn.cluster import MeanShift
from matplotlib import pyplot
X, _ = make_classification(n_samples=1000, n_features=2, n_informative=2,
n_redundant=0, n_clusters_per_class=1, random_state=4)
model = MeanShift()
yhat = model.fit_predict(X)
clusters = unique(yhat)
for cluster in clusters:
row_ix = where(yhat == cluster)
pyplot.scatter(X[row_ix, 0], X[row_ix, 1])
pyplot.show()10. OPTICS
OPTICS extends DBSCAN by producing an ordering of points that captures the clustering structure at multiple density levels.
We introduce a new algorithm for clustering analysis that does not explicitly generate a clustering of a dataset; instead, it creates an augmented ordering of the database that represents its density‑based clustering structure. —Source: "OPTICS: Ordering Points To Identify the Clustering Structure", 1999.
# optics clustering
from numpy import unique, where
from sklearn.datasets import make_classification
from sklearn.cluster import OPTICS
from matplotlib import pyplot
X, _ = make_classification(n_samples=1000, n_features=2, n_informative=2,
n_redundant=0, n_clusters_per_class=1, random_state=4)
model = OPTICS(eps=0.8, min_samples=10)
yhat = model.fit_predict(X)
clusters = unique(yhat)
for cluster in clusters:
row_ix = where(yhat == cluster)
pyplot.scatter(X[row_ix, 0], X[row_ix, 1])
pyplot.show()11. Spectral Clustering
Spectral clustering uses eigenvectors of a similarity matrix derived from the data to perform dimensionality reduction before clustering.
A promising alternative that has emerged in many fields is to use spectral methods for clustering, which rely on the top eigenvectors of a matrix derived from pairwise distances. —Source: "On Spectral Clustering: Analysis and Algorithms", 2002.
# spectral clustering
from numpy import unique, where
from sklearn.datasets import make_classification
from sklearn.cluster import SpectralClustering
from matplotlib import pyplot
X, _ = make_classification(n_samples=1000, n_features=2, n_informative=2,
n_redundant=0, n_clusters_per_class=1, random_state=4)
model = SpectralClustering(n_clusters=2)
yhat = model.fit_predict(X)
clusters = unique(yhat)
for cluster in clusters:
row_ix = where(yhat == cluster)
pyplot.scatter(X[row_ix, 0], X[row_ix, 1])
pyplot.show()12. Gaussian Mixture
Gaussian Mixture Models (GMM) fit a mixture of multivariate Gaussian distributions to the data.
# Gaussian mixture model clustering
from numpy import unique, where
from sklearn.datasets import make_classification
from sklearn.mixture import GaussianMixture
from matplotlib import pyplot
X, _ = make_classification(n_samples=1000, n_features=2, n_informative=2,
n_redundant=0, n_clusters_per_class=1, random_state=4)
model = GaussianMixture(n_components=2)
model.fit(X)
yhat = model.predict(X)
clusters = unique(yhat)
for cluster in clusters:
row_ix = where(yhat == cluster)
pyplot.scatter(X[row_ix, 0], X[row_ix, 1])
pyplot.show()3. Summary
Clustering discovers natural groups in the feature space of input data.
There is no single best algorithm for all datasets; many algorithms exist.
Scikit‑learn provides implementations for a wide range of clustering algorithms that can be installed and used in Python.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
MaGe Linux Operations
Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
