Artificial Intelligence 27 min read

Master 10 Popular Clustering Algorithms in Python with Scikit‑Learn

This tutorial introduces unsupervised clustering, explains its purpose, and walks through installing scikit‑learn and implementing ten popular clustering algorithms—including AffinityPropagation, Agglomerative, BIRCH, DBSCAN, K‑Means, Mini‑Batch K‑Means, MeanShift, OPTICS, Spectral Clustering, and Gaussian Mixture—complete with code examples and visualizations.

MaGe Linux Operations

Sep 8, 2022

Master 10 Popular Clustering Algorithms in Python with Scikit‑Learn

Clustering (or cluster analysis) is an unsupervised learning task that automatically discovers natural groups in data, often used for pattern discovery, market segmentation, anomaly detection, and feature engineering.

Understand that clustering searches for natural groups in the feature space of input data.

Recognize that there is no single best algorithm for all datasets; many algorithms exist.

Learn how to install scikit‑learn and apply ten top clustering algorithms in Python.

1. Clustering

Clustering analysis is an unsupervised machine‑learning task that finds natural groupings in data without any predefined labels. It differs from supervised learning, which predicts known targets.

Clustering techniques are suitable when there are no classes to predict, but instances need to be divided into natural groups. —Source: "Data Mining: Practical Machine Learning Tools and Techniques", 2016.

Clusters are dense regions in the feature space where examples are closer to each other than to points in other clusters. They may have centroids and boundaries, and can reflect underlying mechanisms in the data domain.

2. Clustering Algorithms

Many clustering algorithms use similarity or distance measures to discover dense regions. Before applying them, it is good practice to scale the data.

The following ten algorithms are covered:

Affinity Propagation

Agglomerative Clustering

BIRCH

DBSCAN

K‑Means

Mini‑Batch K‑Means

Mean Shift

OPTICS

Spectral Clustering

Gaussian Mixture

Each example uses the make_classification function to generate a synthetic 2‑D dataset with 1,000 samples, then visualizes the resulting clusters.

3. Clustering Algorithm Examples

1. Library Installation

sudo pip install scikit-learn

Verify the installation and check the version:

# check scikit-learn version
import sklearn
print(sklearn.__version__)

2. Dataset Generation

The dataset is created with two informative features and one cluster per class:

# synthetic classification dataset
from numpy import where
from sklearn.datasets import make_classification
from matplotlib import pyplot
X, y = make_classification(n_samples=1000, n_features=2, n_informative=2,
                            n_redundant=0, n_clusters_per_class=1, random_state=4)
for class_value in range(2):
    row_ix = where(y == class_value)
    pyplot.scatter(X[row_ix, 0], X[row_ix, 1])
pyplot.show()

Scatter plot of the synthetic dataset colored by class

3. Affinity Propagation

Affinity Propagation finds a set of exemplars that best represent the data.

We designed a method called "Affinity Propagation" that takes pairwise similarities as input and exchanges real‑valued messages between data points until a set of high‑quality exemplars and corresponding clusters emerges. —Source: "Message Passing Between Data Points", 2007.

# affinity propagation clustering
from numpy import unique, where
from sklearn.datasets import make_classification
from sklearn.cluster import AffinityPropagation
from matplotlib import pyplot
X, _ = make_classification(n_samples=1000, n_features=2, n_informative=2,
                            n_redundant=0, n_clusters_per_class=1, random_state=4)
model = AffinityPropagation(damping=0.9)
model.fit(X)
yhat = model.predict(X)
clusters = unique(yhat)
for cluster in clusters:
    row_ix = where(yhat == cluster)
    pyplot.scatter(X[row_ix, 0], X[row_ix, 1])
pyplot.show()

Scatter plot of clusters identified by Affinity Propagation

4. Agglomerative Clustering

Agglomerative (hierarchical) clustering merges samples until a desired number of clusters is reached.

# agglomerative clustering
from numpy import unique, where
from sklearn.datasets import make_classification
from sklearn.cluster import AgglomerativeClustering
from matplotlib import pyplot
X, _ = make_classification(n_samples=1000, n_features=2, n_informative=2,
                            n_redundant=0, n_clusters_per_class=1, random_state=4)
model = AgglomerativeClustering(n_clusters=2)
yhat = model.fit_predict(X)
clusters = unique(yhat)
for cluster in clusters:
    row_ix = where(yhat == cluster)
    pyplot.scatter(X[row_ix, 0], X[row_ix, 1])
pyplot.show()

5. BIRCH

BIRCH builds a tree structure to incrementally cluster large datasets.

BIRCH incrementally and dynamically clusters incoming multi‑dimensional metric data points to produce high‑quality clusters within memory and time constraints. —Source: "BIRCH: An Efficient Data Clustering Method for Very Large Databases", 1996.

# BIRCH clustering
from numpy import unique, where
from sklearn.datasets import make_classification
from sklearn.cluster import Birch
from matplotlib import pyplot
X, _ = make_classification(n_samples=1000, n_features=2, n_informative=2,
                            n_redundant=0, n_clusters_per_class=1, random_state=4)
model = Birch(threshold=0.01, n_clusters=2)
model.fit(X)
yhat = model.predict(X)
clusters = unique(yhat)
for cluster in clusters:
    row_ix = where(yhat == cluster)
    pyplot.scatter(X[row_ix, 0], X[row_ix, 1])
pyplot.show()

6. DBSCAN

DBSCAN discovers dense regions based on a distance threshold and a minimum number of points.

We propose a new clustering algorithm, DBSCAN, which relies on a density‑based notion of clusters to discover arbitrarily shaped clusters. —Source: "A Density‑Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise", 1996.

# DBSCAN clustering
from numpy import unique, where
from sklearn.datasets import make_classification
from sklearn.cluster import DBSCAN
from matplotlib import pyplot
X, _ = make_classification(n_samples=1000, n_features=2, n_informative=2,
                            n_redundant=0, n_clusters_per_class=1, random_state=4)
model = DBSCAN(eps=0.30, min_samples=9)
yhat = model.fit_predict(X)
clusters = unique(yhat)
for cluster in clusters:
    row_ix = where(yhat == cluster)
    pyplot.scatter(X[row_ix, 0], X[row_ix, 1])
pyplot.show()

7. K‑Means

K‑Means partitions data into k clusters by minimizing intra‑cluster variance.

The main purpose of this paper is to describe a process that partitions an N‑dimensional population into k sets based on samples. —Source: "Some Methods for Classification and Analysis of Multivariate Observations", 1967.

# k-means clustering
from numpy import unique, where
from sklearn.datasets import make_classification
from sklearn.cluster import KMeans
from matplotlib import pyplot
X, _ = make_classification(n_samples=1000, n_features=2, n_informative=2,
                            n_redundant=0, n_clusters_per_class=1, random_state=4)
model = KMeans(n_clusters=2)
model.fit(X)
yhat = model.predict(X)
clusters = unique(yhat)
for cluster in clusters:
    row_ix = where(yhat == cluster)
    pyplot.scatter(X[row_ix, 0], X[row_ix, 1])
pyplot.show()

8. Mini‑Batch K‑Means

Mini‑Batch K‑Means updates centroids using small random batches, speeding up training on large datasets.

We suggest a mini‑batch optimization of the k‑means clustering algorithm. Compared with the classic batch algorithm, this reduces computational cost by orders of magnitude while providing better solutions than online stochastic gradient descent. —Source: "Web‑Scale K‑Means Clustering", 2010.

# mini‑batch k‑means clustering
from numpy import unique, where
from sklearn.datasets import make_classification
from sklearn.cluster import MiniBatchKMeans
from matplotlib import pyplot
X, _ = make_classification(n_samples=1000, n_features=2, n_informative=2,
                            n_redundant=0, n_clusters_per_class=1, random_state=4)
model = MiniBatchKMeans(n_clusters=2)
model.fit(X)
yhat = model.predict(X)
clusters = unique(yhat)
for cluster in clusters:
    row_ix = where(yhat == cluster)
    pyplot.scatter(X[row_ix, 0], X[row_ix, 1])
pyplot.show()

9. Mean Shift

Mean Shift iteratively shifts points towards the mode of the density estimate.

We prove that the recursive mean‑shift procedure converges to a stationary point of the underlying density function, demonstrating its applicability to density‑mode detection. —Source: "Mean Shift: A Robust Approach Toward Feature Space Analysis", 2002.

# mean shift clustering
from numpy import unique, where
from sklearn.datasets import make_classification
from sklearn.cluster import MeanShift
from matplotlib import pyplot
X, _ = make_classification(n_samples=1000, n_features=2, n_informative=2,
                            n_redundant=0, n_clusters_per_class=1, random_state=4)
model = MeanShift()
yhat = model.fit_predict(X)
clusters = unique(yhat)
for cluster in clusters:
    row_ix = where(yhat == cluster)
    pyplot.scatter(X[row_ix, 0], X[row_ix, 1])
pyplot.show()

10. OPTICS

OPTICS extends DBSCAN by producing an ordering of points that captures the clustering structure at multiple density levels.

We introduce a new algorithm for clustering analysis that does not explicitly generate a clustering of a dataset; instead, it creates an augmented ordering of the database that represents its density‑based clustering structure. —Source: "OPTICS: Ordering Points To Identify the Clustering Structure", 1999.

# optics clustering
from numpy import unique, where
from sklearn.datasets import make_classification
from sklearn.cluster import OPTICS
from matplotlib import pyplot
X, _ = make_classification(n_samples=1000, n_features=2, n_informative=2,
                            n_redundant=0, n_clusters_per_class=1, random_state=4)
model = OPTICS(eps=0.8, min_samples=10)
yhat = model.fit_predict(X)
clusters = unique(yhat)
for cluster in clusters:
    row_ix = where(yhat == cluster)
    pyplot.scatter(X[row_ix, 0], X[row_ix, 1])
pyplot.show()

11. Spectral Clustering

Spectral clustering uses eigenvectors of a similarity matrix derived from the data to perform dimensionality reduction before clustering.

A promising alternative that has emerged in many fields is to use spectral methods for clustering, which rely on the top eigenvectors of a matrix derived from pairwise distances. —Source: "On Spectral Clustering: Analysis and Algorithms", 2002.

# spectral clustering
from numpy import unique, where
from sklearn.datasets import make_classification
from sklearn.cluster import SpectralClustering
from matplotlib import pyplot
X, _ = make_classification(n_samples=1000, n_features=2, n_informative=2,
                            n_redundant=0, n_clusters_per_class=1, random_state=4)
model = SpectralClustering(n_clusters=2)
yhat = model.fit_predict(X)
clusters = unique(yhat)
for cluster in clusters:
    row_ix = where(yhat == cluster)
    pyplot.scatter(X[row_ix, 0], X[row_ix, 1])
pyplot.show()

12. Gaussian Mixture

Gaussian Mixture Models (GMM) fit a mixture of multivariate Gaussian distributions to the data.

# Gaussian mixture model clustering
from numpy import unique, where
from sklearn.datasets import make_classification
from sklearn.mixture import GaussianMixture
from matplotlib import pyplot
X, _ = make_classification(n_samples=1000, n_features=2, n_informative=2,
                            n_redundant=0, n_clusters_per_class=1, random_state=4)
model = GaussianMixture(n_components=2)
model.fit(X)
yhat = model.predict(X)
clusters = unique(yhat)
for cluster in clusters:
    row_ix = where(yhat == cluster)
    pyplot.scatter(X[row_ix, 0], X[row_ix, 1])
pyplot.show()

3. Summary

Clustering discovers natural groups in the feature space of input data.

There is no single best algorithm for all datasets; many algorithms exist.

Scikit‑learn provides implementations for a wide range of clustering algorithms that can be installed and used in Python.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

machine learning clustering data mining Unsupervised Learning scikit-learn

Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.