Artificial Intelligence 14 min read

Mastering Imbalanced Data: Practical Techniques with imbalanced-learn

Learn what imbalanced data is, why it hampers machine learning models, and explore a comprehensive suite of preprocessing strategies—including under‑sampling, over‑sampling (SMOTE, ADASYN), combined sampling, ensemble methods, and class‑weight adjustments—using the imbalanced‑learn library with concrete Python code examples.

MaGe Linux Operations

Jan 31, 2021

Mastering Imbalanced Data: Practical Techniques with imbalanced-learn

1. What is Imbalanced Data

Imbalanced data refers to a dataset where the number of samples across classes is unevenly distributed, which is common in real‑world tasks.

Credit card fraud data: 99% normal, 1% fraud

Loan overdue data (example)

Imbalanced data usually arises from the data generation process; minority class samples occur less frequently and require longer collection periods.

In machine‑learning classification tasks, imbalanced data causes models to bias toward the majority class. Besides choosing appropriate evaluation metrics, improving model performance requires preprocessing the data or the model.

Main methods to handle imbalanced data:

Under‑sampling

Over‑sampling

Combined sampling

Ensemble

Adjust class or sample weights

2. Imbalanced Data Handling Methods

The imbalanced-learn library provides many techniques; the examples below use this library.

pip install -U imbalanced-learn

https://github.com/scikit-learn-contrib/imbalanced-learn

Data source: "Shandong Province 2nd Data Application Innovation & Entrepreneurship Competition – Rizhao Sub‑competition – Public Fund Loan Overdue Prediction".

First, inspect the data:

import pandas as pd
train_data = './data/train.csv'
test_data = './data/test.csv'
train_df = pd.read_csv(train_data)
test_df = pd.read_csv(test_data)

print(train_df.groupby(['label']).size())
# label 0 37243
# label 1 2757

2.1 Under‑sampling

Under‑sampling reduces the number of majority‑class samples to match the minority class, achieving balance.

Because under‑sampling discards data, it inevitably changes the distribution of the majority class (increasing variance). A good under‑sampling strategy should preserve the original data distribution as much as possible.

Which majority samples can be removed?

Overlapping data (redundant samples)

Noisy data that interferes with the minority distribution

Two ideas for under‑sampling:

Boundary‑adjacent matching, e.g., using TomekLinks or NearMiss Figure shows 6‑nearest‑neighbors.

Explanation of TomekLinks: for each minority sample, find its 1‑NN; if the nearest neighbor is a majority sample, the pair forms a tome‑link, and the majority sample is removed as it is considered noise.

from imblearn.under_sampling import TomekLinks

X_train = train_df.drop(['id', 'type'], axis=1)
y = train_df['label']
tl = TomekLinks()
X_us, y_us = tl.fit_sample(X_train, y)
print(X_us.groupby(['label']).size())
# 0 36069
# 1 2757

Clustering‑based under‑sampling replaces each cluster with its centroid.

from imblearn.under_sampling import ClusterCentroids

cc = ClusterCentroids(random_state=42)
X_res, y_res = cc.fit_resample(X_train, y)
X_res.groupby(['label']).size()
# 0 2757
# 1 2757

Under‑sampling methods provided by imbalanced-learn:

Random majority under‑sampling with replacement

Extraction of majority‑minority Tomek links

Under‑sampling with Cluster Centroids

NearMiss (1, 2, 3)

Condensed Nearest Neighbour

One‑Sided Selection

Neighbourhood Cleaning Rule

Edited Nearest Neighbours

Instance Hardness Threshold

Repeated Edited Nearest Neighbours

AllKNN

2.2 Over‑sampling

Over‑sampling duplicates minority samples to match the majority class, which changes the minority variance.

A simple way is random copying; a more sophisticated way generates synthetic samples, e.g., SMOTE.

SMOTE creates new samples by interpolating between a minority sample and one of its K‑nearest minority neighbors.

Select a minority sample and compute its KNN neighbors

Randomly choose one neighbor

Modify a feature by a random proportion of the difference between the sample and the neighbor

from imblearn.over_sampling import SMOTE
smote = SMOTE(k_neighbors=5, random_state=42)
X_res, y_res = smote.fit_resample(X_train, y)
X_res.groupby(['label']).size()
# 0 37243
# 1 37243

Borderline‑SMOTE focuses on minority samples near the decision boundary ("danger" samples) while ignoring safe or noisy samples.

BorderlineSMOTE

from imblearn.over_sampling import BorderlineSMOTE
bsmote = BorderlineSMOTE(k_neighbors=5, random_state=42)
X_res, y_res = bsmote.fit_resample(X_train, y)
X_res.groupby(['label']).size()
# 0 37243
# 1 37243

ADASYN generates synthetic samples similarly to SMOTE but adapts the number of samples per minority instance based on the local distribution of majority neighbors.

from imblearn.over_sampling import ADASYN
adasyn = ADASYN(n_neighbors=5, random_state=42)
X_res, y_res = adasyn.fit_resample(X_train, y)
X_res.groupby(['label']).size()
# 0 37243
# 1 36690

Over‑sampling methods provided by imbalanced-learn:

Random minority over‑sampling with replacement

SMOTE – Synthetic Minority Over‑sampling Technique

SMOTENC – SMOTE for Nominal Continuous

bSMOTE (1 & 2) – Borderline SMOTE types 1 and 2

SVM SMOTE – Support Vectors SMOTE

ADASYN – Adaptive Synthetic Sampling Approach for Imbalanced Learning

KMeans‑SMOTE

ROSE – Random OverSampling Examples

2.3 Combined Sampling

Combined sampling applies over‑sampling first, then under‑sampling, e.g., SMOTE+Tomek‑links or SMOTE+Edited Nearest Neighbours.

from imblearn.combine import SMOTETomek

smote_tomek = SMOTETomek(random_state=0)
X_res, y_res = smote_tomek.fit_sample(X_train, y)
X_res.groupby(['label']).size()
# 0 36260
# 1 36260

2.4 Ensemble Methods

Ensemble on balanced datasets can be done with imblearn.ensemble, e.g., BalancedRandomForestClassifier.

from imblearn.ensemble import BalancedRandomForestClassifier
from sklearn.datasets import make_classification

X, y = make_classification(n_samples=1000, n_classes=3,
                           n_informative=4, weights=[0.2,0.3,0.5],
                           random_state=0)
clf = BalancedRandomForestClassifier(max_depth=2, random_state=0)
clf.fit(X, y)

print(clf.feature_importances_)
print(clf.predict([[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]]))

Ensemble methods provided by imbalanced-learn:

Easy Ensemble classifier

Balanced Random Forest

Balanced Bagging

RUSBoost

2.5 Adjust Class or Sample Weights

For gradient‑based models, adjusting class_weight or sample_weight can mitigate imbalance, e.g., in LightGBM.

import lightgbm as lgb
clf = lgb.LGBMRegressor(num_leaves=31,
                        min_child_samples=np.random.randint(20,25),
                        max_depth=25,
                        learning_rate=0.1,
                        class_weight={0:1, 1:10},
                        n_estimators=500,
                        n_jobs=30)

3. Summary

Under‑sampling reduces majority samples.

Over‑sampling increases minority samples.

Combined sampling performs over‑sampling followed by under‑sampling.

Ensemble creates balanced datasets (majority under‑sampling + minority samples), trains multiple models, and aggregates predictions.

Both under‑ and over‑sampling alter the original data distribution and may cause over‑fitting; experiment to find the method that best fits the actual data distribution.

4. References

Learning from Imbalanced Data

Two Modifications of CNN (Tomek links, CNN)

imbalanced-learn API: https://imbalanced-learn.org/stable/

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Python Imbalanced Data SMOTE imbalanced-learn over-sampling under-sampling

Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.