Artificial Intelligence 5 min read

Mastering Feature Binning with sklearn: Uniform, Quantile, and K‑Means Methods

This article explains why discretizing continuous variables improves model stability, introduces three common binning techniques—equal-width, equal-frequency, and clustering—and demonstrates how to implement each using scikit‑learn's KBinsDiscretizer with Python code examples on a synthetic score dataset.

Model Perspective

Aug 14, 2022

Mastering Feature Binning with sklearn: Uniform, Quantile, and K‑Means Methods

When building classification models, feature engineering often requires discretizing continuous variables into categorical bins, which can make models more stable and reduce over‑fitting.

Equal‑width binning

Equal‑frequency (quantile) binning

Clustering (k‑means) binning

The following example uses a synthetic score dataset to illustrate the concepts.

import numpy as np
import pandas as pd
np.random.seed(1)

n = 20
ID = np.arange(1, n+1)
SCORE = np.random.normal(80, 10, n).astype('int')
df = pd.DataFrame({'ID': ID, 'SCORE': SCORE})

scikit‑learn's KBinsDiscretizer class performs the binning operations.

Parameter n_bins: number of bins (default 5).

Parameter strategy: binning strategy, options are:

uniform – equal‑width bins

quantile – equal‑frequency bins

kmeans – clustering‑based bins

Parameter encode: whether to encode the binned feature (e.g., ordinal, one‑hot).

Since KBinsDiscretizer expects a column vector, reshape the data:

score = df['SCORE'].values.reshape(-1, 1)

Equal‑width binning

Divide the data into three bins.

from sklearn.preprocessing import KBinsDiscretizer

dis = KBinsDiscretizer(n_bins=3,
                      encode="ordinal",
                      strategy="uniform")
label_uniform = dis.fit_transform(score)  # transformer

With a minimum value of 56 and a maximum of 97, the bin edges are [56, 69.6667, 83.3333, 97].

Equal‑frequency binning

Each bin contains roughly the same number of samples. Using three bins:

dis = KBinsDiscretizer(n_bins=3,
                      encode="ordinal",
                      strategy="quantile")
label_quantile = dis.fit_transform(score)

Clustering binning

First cluster the continuous variable, then use the cluster label as the bin identifier.

dis = KBinsDiscretizer(n_bins=3,
                      encode="ordinal",
                      strategy="kmeans")
label_kmeans = dis.transform(score)  # transformer

Comparison

df["label_uniform"] = label_uniform
df["label_quantile"] = label_quantile
df["label_kmeans"] = label_kmeans

References:

https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.KBinsDiscretizer.html

https://mp.weixin.qq.com/s/H19asF7Qo_0Wc5FIn8Qkww

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

machine learning Python Data preprocessing feature binning KBinsDiscretizer

Written by

Model Perspective

Insights, knowledge, and enjoyment from a mathematical modeling researcher and educator. Hosted by Haihua Wang, a modeling instructor and author of "Clever Use of Chat for Mathematical Modeling", "Modeling: The Mathematics of Thinking", "Mathematical Modeling Practice: A Hands‑On Guide to Competitions", and co‑author of "Mathematical Modeling: Teaching Design and Cases".

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.