Mastering Imbalanced Data: Practical Techniques with imbalanced-learn
Learn what imbalanced data is, why it hampers machine learning models, and explore a comprehensive suite of preprocessing strategies—including under‑sampling, over‑sampling (SMOTE, ADASYN), combined sampling, ensemble methods, and class‑weight adjustments—using the imbalanced‑learn library with concrete Python code examples.
1. What is Imbalanced Data
Imbalanced data refers to a dataset where the number of samples across classes is unevenly distributed, which is common in real‑world tasks.
Credit card fraud data: 99% normal, 1% fraud
Loan overdue data (example)
Imbalanced data usually arises from the data generation process; minority class samples occur less frequently and require longer collection periods.
In machine‑learning classification tasks, imbalanced data causes models to bias toward the majority class. Besides choosing appropriate evaluation metrics, improving model performance requires preprocessing the data or the model.
Main methods to handle imbalanced data:
Under‑sampling
Over‑sampling
Combined sampling
Ensemble
Adjust class or sample weights
2. Imbalanced Data Handling Methods
The imbalanced-learn library provides many techniques; the examples below use this library.
pip install -U imbalanced-learnhttps://github.com/scikit-learn-contrib/imbalanced-learn
Data source: "Shandong Province 2nd Data Application Innovation & Entrepreneurship Competition – Rizhao Sub‑competition – Public Fund Loan Overdue Prediction".
First, inspect the data:
import pandas as pd
train_data = './data/train.csv'
test_data = './data/test.csv'
train_df = pd.read_csv(train_data)
test_df = pd.read_csv(test_data)
print(train_df.groupby(['label']).size())
# label 0 37243
# label 1 27572.1 Under‑sampling
Under‑sampling reduces the number of majority‑class samples to match the minority class, achieving balance.
Because under‑sampling discards data, it inevitably changes the distribution of the majority class (increasing variance). A good under‑sampling strategy should preserve the original data distribution as much as possible.
Which majority samples can be removed?
Overlapping data (redundant samples)
Noisy data that interferes with the minority distribution
Two ideas for under‑sampling:
Boundary‑adjacent matching, e.g., using TomekLinks or NearMiss Figure shows 6‑nearest‑neighbors.
Explanation of TomekLinks: for each minority sample, find its 1‑NN; if the nearest neighbor is a majority sample, the pair forms a tome‑link, and the majority sample is removed as it is considered noise.
from imblearn.under_sampling import TomekLinks
X_train = train_df.drop(['id', 'type'], axis=1)
y = train_df['label']
tl = TomekLinks()
X_us, y_us = tl.fit_sample(X_train, y)
print(X_us.groupby(['label']).size())
# 0 36069
# 1 2757Clustering‑based under‑sampling replaces each cluster with its centroid.
from imblearn.under_sampling import ClusterCentroids
cc = ClusterCentroids(random_state=42)
X_res, y_res = cc.fit_resample(X_train, y)
X_res.groupby(['label']).size()
# 0 2757
# 1 2757Under‑sampling methods provided by imbalanced-learn:
Random majority under‑sampling with replacement
Extraction of majority‑minority Tomek links
Under‑sampling with Cluster Centroids
NearMiss (1, 2, 3)
Condensed Nearest Neighbour
One‑Sided Selection
Neighbourhood Cleaning Rule
Edited Nearest Neighbours
Instance Hardness Threshold
Repeated Edited Nearest Neighbours
AllKNN
2.2 Over‑sampling
Over‑sampling duplicates minority samples to match the majority class, which changes the minority variance.
A simple way is random copying; a more sophisticated way generates synthetic samples, e.g., SMOTE.
SMOTE creates new samples by interpolating between a minority sample and one of its K‑nearest minority neighbors.
Select a minority sample and compute its KNN neighbors
Randomly choose one neighbor
Modify a feature by a random proportion of the difference between the sample and the neighbor
from imblearn.over_sampling import SMOTE
smote = SMOTE(k_neighbors=5, random_state=42)
X_res, y_res = smote.fit_resample(X_train, y)
X_res.groupby(['label']).size()
# 0 37243
# 1 37243Borderline‑SMOTE focuses on minority samples near the decision boundary ("danger" samples) while ignoring safe or noisy samples.
BorderlineSMOTEfrom imblearn.over_sampling import BorderlineSMOTE
bsmote = BorderlineSMOTE(k_neighbors=5, random_state=42)
X_res, y_res = bsmote.fit_resample(X_train, y)
X_res.groupby(['label']).size()
# 0 37243
# 1 37243ADASYN generates synthetic samples similarly to SMOTE but adapts the number of samples per minority instance based on the local distribution of majority neighbors.
from imblearn.over_sampling import ADASYN
adasyn = ADASYN(n_neighbors=5, random_state=42)
X_res, y_res = adasyn.fit_resample(X_train, y)
X_res.groupby(['label']).size()
# 0 37243
# 1 36690Over‑sampling methods provided by imbalanced-learn:
Random minority over‑sampling with replacement
SMOTE – Synthetic Minority Over‑sampling Technique
SMOTENC – SMOTE for Nominal Continuous
bSMOTE (1 & 2) – Borderline SMOTE types 1 and 2
SVM SMOTE – Support Vectors SMOTE
ADASYN – Adaptive Synthetic Sampling Approach for Imbalanced Learning
KMeans‑SMOTE
ROSE – Random OverSampling Examples
2.3 Combined Sampling
Combined sampling applies over‑sampling first, then under‑sampling, e.g., SMOTE+Tomek‑links or SMOTE+Edited Nearest Neighbours.
from imblearn.combine import SMOTETomek
smote_tomek = SMOTETomek(random_state=0)
X_res, y_res = smote_tomek.fit_sample(X_train, y)
X_res.groupby(['label']).size()
# 0 36260
# 1 362602.4 Ensemble Methods
Ensemble on balanced datasets can be done with imblearn.ensemble, e.g., BalancedRandomForestClassifier.
from imblearn.ensemble import BalancedRandomForestClassifier
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=1000, n_classes=3,
n_informative=4, weights=[0.2,0.3,0.5],
random_state=0)
clf = BalancedRandomForestClassifier(max_depth=2, random_state=0)
clf.fit(X, y)
print(clf.feature_importances_)
print(clf.predict([[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]]))Ensemble methods provided by imbalanced-learn:
Easy Ensemble classifier
Balanced Random Forest
Balanced Bagging
RUSBoost
2.5 Adjust Class or Sample Weights
For gradient‑based models, adjusting class_weight or sample_weight can mitigate imbalance, e.g., in LightGBM.
import lightgbm as lgb
clf = lgb.LGBMRegressor(num_leaves=31,
min_child_samples=np.random.randint(20,25),
max_depth=25,
learning_rate=0.1,
class_weight={0:1, 1:10},
n_estimators=500,
n_jobs=30)3. Summary
Under‑sampling reduces majority samples.
Over‑sampling increases minority samples.
Combined sampling performs over‑sampling followed by under‑sampling.
Ensemble creates balanced datasets (majority under‑sampling + minority samples), trains multiple models, and aggregates predictions.
Both under‑ and over‑sampling alter the original data distribution and may cause over‑fitting; experiment to find the method that best fits the actual data distribution.
4. References
Learning from Imbalanced Data
Two Modifications of CNN (Tomek links, CNN)
imbalanced-learn API: https://imbalanced-learn.org/stable/
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
MaGe Linux Operations
Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
