SMOTE Techniques for Handling Imbalanced Classification in Machine Learning
This article explains the SMOTE oversampling method for imbalanced classification, demonstrates how to generate synthetic minority samples, evaluates models with and without SMOTE using scikit‑learn pipelines, and explores advanced variants such as Borderline‑SMOTE, SVMSMOTE and ADASYN with concrete code examples and benchmark results.
Imbalanced classification problems arise when one class dominates a dataset, causing most machine‑learning algorithms to perform poorly on the minority class. The article introduces SMOTE (Synthetic Minority Over‑sampling Technique) as a data‑augmentation strategy that creates new minority samples by interpolating between a randomly selected minority instance and one of its k‑nearest minority neighbors (typically k=5).
SMOTE first selects a minority class instance at random and finds its k nearest minority class neighbors. The synthetic instance is then created by choosing one of the k nearest neighbors at random and connecting the two with a line segment in the feature space. The synthetic instances are generated as a convex combination of the two chosen instances.
Using the imbalanced-learn Python library, the article walks through a complete example: a synthetic binary dataset with 10,000 samples and a 1:100 class ratio is created with make_classification, the original class distribution is verified with Counter, and a scatter plot visualises the severe imbalance.
from collections import Counter
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=10000, n_features=2, n_redundant=0,
n_clusters_per_class=1, weights=[0.99],
flip_y=0, random_state=1)
counter = Counter(y)
print(counter) # Counter({0: 9900, 1: 100})Applying SMOTE() balances the classes, and the new distribution is confirmed:
from imblearn.over_sampling import SMOTE
oversample = SMOTE()
X, y = oversample.fit_resample(X, y)
counter = Counter(y)
print(counter) # Counter({0: 9900, 1: 9900})The article then fits a DecisionTreeClassifier on the original imbalanced data using repeated stratified 10‑fold cross‑validation (3 repeats) and reports a mean ROC‑AUC of 0.761. Re‑training the same pipeline with SMOTE applied only to the training folds raises the mean ROC‑AUC to 0.809, illustrating the benefit of oversampling.
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import RepeatedStratifiedKFold, cross_val_score
model = DecisionTreeClassifier()
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
scores = cross_val_score(model, X, y, scoring='roc_auc', cv=cv, n_jobs=-1)
print('Mean ROC AUC: %.3f' % scores.mean())Combining SMOTE with random undersampling (via RandomUnderSampler) often yields better performance than undersampling alone, as reported in the original SMOTE paper. The article shows how to configure a pipeline that first oversamples to a 1:10 ratio and then undersamples to achieve roughly a 1:2 final balance.
from imblearn.under_sampling import RandomUnderSampler
from imblearn.pipeline import Pipeline
over = SMOTE(sampling_strategy=0.1)
under = RandomUnderSampler(sampling_strategy=0.5)
steps = [('o', over), ('u', under), ('model', DecisionTreeClassifier())]
pipeline = Pipeline(steps=steps)
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
scores = cross_val_score(pipeline, X, y, scoring='roc_auc', cv=cv, n_jobs=-1)
print('Mean ROC AUC: %.3f' % scores.mean())Parameter tuning is explored by varying the number of nearest neighbors (k) used by SMOTE. A grid‑search loop evaluates k from 1 to 7, revealing that higher k values (e.g., k=7) can improve ROC‑AUC to 0.853, while lower values give slightly lower scores.
k_values = [1,2,3,4,5,6,7]
for k in k_values:
over = SMOTE(sampling_strategy=0.1, k_neighbors=k)
pipeline = Pipeline([('over', over), ('model', DecisionTreeClassifier())])
scores = cross_val_score(pipeline, X, y, scoring='roc_auc', cv=cv, n_jobs=-1)
print('> k=%d, Mean ROC AUC: %.3f' % (k, scores.mean()))Advanced SMOTE variants are then presented:
Borderline‑SMOTE : focuses oversampling on minority samples near the decision boundary, using the BorderlineSMOTE class.
SVMSMOTE : replaces the k‑NN neighbor search with an SVM to locate support vectors and generate synthetic points along those vectors.
ADASYN : adaptively generates more synthetic samples for minority instances that are harder to learn, based on local density.
Each variant is demonstrated with a concise code snippet similar to the basic SMOTE example, and the resulting class distributions and scatter plots are described. The article notes potential drawbacks, such as noise introduction when synthetic points overlap heavily with the majority class (Borderline‑SMOTE) and the risk of over‑focusing on outliers (ADASYN).
In summary, the article provides a step‑by‑step guide to applying SMOTE and its extensions for imbalanced classification, shows how to integrate them into scikit‑learn pipelines, evaluates their impact on model performance, and highlights practical considerations for parameter selection and combination with undersampling.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Code DAO
We deliver AI algorithm tutorials and the latest news, curated by a team of researchers from Peking University, Shanghai Jiao Tong University, Central South University, and leading AI companies such as Huawei, Kuaishou, and SenseTime. Join us in the AI alchemy—making life better!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
