Mastering Feature Binning with sklearn: Uniform, Quantile, and K‑Means Methods
This article explains why discretizing continuous variables improves model stability, introduces three common binning techniques—equal-width, equal-frequency, and clustering—and demonstrates how to implement each using scikit‑learn's KBinsDiscretizer with Python code examples on a synthetic score dataset.
When building classification models, feature engineering often requires discretizing continuous variables into categorical bins, which can make models more stable and reduce over‑fitting.
Equal‑width binning
Equal‑frequency (quantile) binning
Clustering (k‑means) binning
The following example uses a synthetic score dataset to illustrate the concepts.
import numpy as np
import pandas as pd
np.random.seed(1)
n = 20
ID = np.arange(1, n+1)
SCORE = np.random.normal(80, 10, n).astype('int')
df = pd.DataFrame({'ID': ID, 'SCORE': SCORE})scikit‑learn's KBinsDiscretizer class performs the binning operations.
Parameter n_bins: number of bins (default 5).
Parameter strategy: binning strategy, options are:
uniform – equal‑width bins
quantile – equal‑frequency bins
kmeans – clustering‑based bins
Parameter encode: whether to encode the binned feature (e.g., ordinal, one‑hot).
Since KBinsDiscretizer expects a column vector, reshape the data:
score = df['SCORE'].values.reshape(-1, 1)Equal‑width binning
Divide the data into three bins.
from sklearn.preprocessing import KBinsDiscretizer
dis = KBinsDiscretizer(n_bins=3,
encode="ordinal",
strategy="uniform")
label_uniform = dis.fit_transform(score) # transformerWith a minimum value of 56 and a maximum of 97, the bin edges are [56, 69.6667, 83.3333, 97].
Equal‑frequency binning
Each bin contains roughly the same number of samples. Using three bins:
dis = KBinsDiscretizer(n_bins=3,
encode="ordinal",
strategy="quantile")
label_quantile = dis.fit_transform(score)Clustering binning
First cluster the continuous variable, then use the cluster label as the bin identifier.
dis = KBinsDiscretizer(n_bins=3,
encode="ordinal",
strategy="kmeans")
label_kmeans = dis.transform(score) # transformerComparison
df["label_uniform"] = label_uniform
df["label_quantile"] = label_quantile
df["label_kmeans"] = label_kmeansReferences:
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.KBinsDiscretizer.html
https://mp.weixin.qq.com/s/H19asF7Qo_0Wc5FIn8Qkww
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Model Perspective
Insights, knowledge, and enjoyment from a mathematical modeling researcher and educator. Hosted by Haihua Wang, a modeling instructor and author of "Clever Use of Chat for Mathematical Modeling", "Modeling: The Mathematics of Thinking", "Mathematical Modeling Practice: A Hands‑On Guide to Competitions", and co‑author of "Mathematical Modeling: Teaching Design and Cases".
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
