Master Data Sampling Techniques in Python for Machine Learning
This article explains common data sampling methods—random, stratified, oversampling, undersampling, and adaptive sampling—and provides Python code examples using scikit-learn and imbalanced-learn to implement each technique on the Iris dataset and synthetic data.
Data Sampling
In machine learning, sampling refers to randomly selecting samples from a dataset for training or evaluating a model. The following are common sampling methods:
Random Sampling: randomly select samples as training or test set; may lead to over‑ or under‑fitting due to lack of representativeness.
Stratified Sampling: divide samples into layers by class or label and randomly select from each layer, ensuring each class is represented and reducing over‑/under‑fitting risk.
Oversampling: duplicate minority‑class samples to balance class numbers, which mitigates imbalance but can cause over‑fitting.
Undersampling: randomly remove majority‑class samples to approach balanced class sizes, speeding training but risking under‑fitting.
Adaptive Sampling: dynamically adjust the sampling strategy based on classifier performance, focusing on difficult or confusing samples to improve accuracy.
Python Implementation
In Python, the scikit-learn library provides sampling methods; the examples use the Iris dataset.
Random Sampling
Use sklearn.model_selection.train_test_split to randomly split the dataset.
<code>from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data
y = iris.target
</code> <code>from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
</code>Stratified Sampling
When the dataset is small, stratified sampling ensures each class retains its proportion. Use StratifiedShuffleSplit from scikit-learn.
<code>from sklearn.model_selection import StratifiedShuffleSplit
</code>Full code:
<code>from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data
y = iris.target
from sklearn.model_selection import StratifiedShuffleSplit
sss = StratifiedShuffleSplit(n_splits=1, test_size=0.3, random_state=42)
for train_index, test_index in sss.split(X, y):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
</code>Oversampling
Imbalanced data can be addressed with oversampling. The imbalanced-learn library offers RandomOverSampler , SMOTE , and ADASYN . Example using SMOTE:
<code>pip install imbalanced-learn</code> <code>from imblearn.over_sampling import SMOTE
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
# generate imbalanced data
X, y = make_classification(n_classes=2, class_sep=2,
weights=[0.1, 0.9], n_informative=3,
n_redundant=1, flip_y=0, n_features=20,
n_clusters_per_class=1, n_samples=1000,
random_state=10)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
smote = SMOTE(random_state=42)
X_train_res, y_train_res = smote.fit_resample(X_train, y_train)
print("Over‑sampling before class 0 count:", sum(y_train==0))
print("Over‑sampling before class 1 count:", sum(y_train==1))
print("Over‑sampling after class 0 count:", sum(y_train_res==0))
print("Over‑sampling after class 1 count:", sum(y_train_res==1))
</code>SMOTE creates synthetic minority samples by interpolating between a minority instance and its nearest neighbors.
Undersampling
Undersampling reduces the majority‑class size. Example using RandomUnderSampler :
<code>from sklearn.datasets import make_classification
from imblearn.under_sampling import RandomUnderSampler
X, y = make_classification(n_samples=1000, n_classes=2, weights=[0.9, 0.1], random_state=42)
rus = RandomUnderSampler(random_state=42)
X_resampled, y_resampled = rus.fit_resample(X, y)
</code>Undersampling balances classes but may discard useful information, potentially affecting model generalization.
References: CSDN blog by Sany Ho – “Pure Random Sampling (train_test_split) and Stratified Sampling (StratifiedShuffleSplit) with sklearn”.
Model Perspective
Insights, knowledge, and enjoyment from a mathematical modeling researcher and educator. Hosted by Haihua Wang, a modeling instructor and author of "Clever Use of Chat for Mathematical Modeling", "Modeling: The Mathematics of Thinking", "Mathematical Modeling Practice: A Hands‑On Guide to Competitions", and co‑author of "Mathematical Modeling: Teaching Design and Cases".
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.