Artificial Intelligence 11 min read

Master Data Sampling Techniques in Python for Machine Learning

This article explains common data sampling methods—random, stratified, oversampling, undersampling, and adaptive sampling—and provides Python code examples using scikit-learn and imbalanced-learn to implement each technique on the Iris dataset and synthetic data.

Model Perspective

Mar 19, 2023

Data Sampling

In machine learning, sampling refers to randomly selecting samples from a dataset for training or evaluating a model. The following are common sampling methods:

Random Sampling: randomly select samples as training or test set; may lead to over‑ or under‑fitting due to lack of representativeness.

Stratified Sampling: divide samples into layers by class or label and randomly select from each layer, ensuring each class is represented and reducing over‑/under‑fitting risk.

Oversampling: duplicate minority‑class samples to balance class numbers, which mitigates imbalance but can cause over‑fitting.

Undersampling: randomly remove majority‑class samples to approach balanced class sizes, speeding training but risking under‑fitting.

Adaptive Sampling: dynamically adjust the sampling strategy based on classifier performance, focusing on difficult or confusing samples to improve accuracy.

Python Implementation

In Python, the scikit-learn library provides sampling methods; the examples use the Iris dataset.

Random Sampling

Use sklearn.model_selection.train_test_split to randomly split the dataset.

from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data
y = iris.target

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

Stratified Sampling

When the dataset is small, stratified sampling ensures each class retains its proportion. Use StratifiedShuffleSplit from scikit-learn.

from sklearn.model_selection import StratifiedShuffleSplit

Full code:

from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data
y = iris.target

from sklearn.model_selection import StratifiedShuffleSplit

sss = StratifiedShuffleSplit(n_splits=1, test_size=0.3, random_state=42)
for train_index, test_index in sss.split(X, y):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

Oversampling

Imbalanced data can be addressed with oversampling. The imbalanced-learn library offers RandomOverSampler, SMOTE, and ADASYN. Example using SMOTE:

pip install imbalanced-learn

from imblearn.over_sampling import SMOTE
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

# generate imbalanced data
X, y = make_classification(n_classes=2, class_sep=2,
                          weights=[0.1, 0.9], n_informative=3,
                          n_redundant=1, flip_y=0, n_features=20,
                          n_clusters_per_class=1, n_samples=1000,
                          random_state=10)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

smote = SMOTE(random_state=42)
X_train_res, y_train_res = smote.fit_resample(X_train, y_train)

print("Over‑sampling before class 0 count:", sum(y_train==0))
print("Over‑sampling before class 1 count:", sum(y_train==1))
print("Over‑sampling after class 0 count:", sum(y_train_res==0))
print("Over‑sampling after class 1 count:", sum(y_train_res==1))

SMOTE creates synthetic minority samples by interpolating between a minority instance and its nearest neighbors.

Undersampling

Undersampling reduces the majority‑class size. Example using RandomUnderSampler:

from sklearn.datasets import make_classification
from imblearn.under_sampling import RandomUnderSampler

X, y = make_classification(n_samples=1000, n_classes=2, weights=[0.9, 0.1], random_state=42)

rus = RandomUnderSampler(random_state=42)
X_resampled, y_resampled = rus.fit_resample(X, y)

Undersampling balances classes but may discard useful information, potentially affecting model generalization.

References: CSDN blog by Sany Ho – “Pure Random Sampling (train_test_split) and Stratified Sampling (StratifiedShuffleSplit) with sklearn”.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Scikit-learn data sampling oversampling undersampling

Written by

Model Perspective

Insights, knowledge, and enjoyment from a mathematical modeling researcher and educator. Hosted by Haihua Wang, a modeling instructor and author of "Clever Use of Chat for Mathematical Modeling", "Modeling: The Mathematics of Thinking", "Mathematical Modeling Practice: A Hands‑On Guide to Competitions", and co‑author of "Mathematical Modeling: Teaching Design and Cases".

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.