Master Data Sampling Techniques in Python for Machine Learning
This article explains common data sampling methods—random, stratified, oversampling, undersampling, and adaptive sampling—and provides Python code examples using scikit-learn and imbalanced-learn to implement each technique on the Iris dataset and synthetic data.
Data Sampling
In machine learning, sampling refers to randomly selecting samples from a dataset for training or evaluating a model. The following are common sampling methods:
Random Sampling: randomly select samples as training or test set; may lead to over‑ or under‑fitting due to lack of representativeness.
Stratified Sampling: divide samples into layers by class or label and randomly select from each layer, ensuring each class is represented and reducing over‑/under‑fitting risk.
Oversampling: duplicate minority‑class samples to balance class numbers, which mitigates imbalance but can cause over‑fitting.
Undersampling: randomly remove majority‑class samples to approach balanced class sizes, speeding training but risking under‑fitting.
Adaptive Sampling: dynamically adjust the sampling strategy based on classifier performance, focusing on difficult or confusing samples to improve accuracy.
Python Implementation
In Python, the scikit-learn library provides sampling methods; the examples use the Iris dataset.
Random Sampling
Use sklearn.model_selection.train_test_split to randomly split the dataset.
from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data
y = iris.target from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)Stratified Sampling
When the dataset is small, stratified sampling ensures each class retains its proportion. Use StratifiedShuffleSplit from scikit-learn.
from sklearn.model_selection import StratifiedShuffleSplitFull code:
from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data
y = iris.target
from sklearn.model_selection import StratifiedShuffleSplit
sss = StratifiedShuffleSplit(n_splits=1, test_size=0.3, random_state=42)
for train_index, test_index in sss.split(X, y):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]Oversampling
Imbalanced data can be addressed with oversampling. The imbalanced-learn library offers RandomOverSampler, SMOTE, and ADASYN. Example using SMOTE:
pip install imbalanced-learn from imblearn.over_sampling import SMOTE
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
# generate imbalanced data
X, y = make_classification(n_classes=2, class_sep=2,
weights=[0.1, 0.9], n_informative=3,
n_redundant=1, flip_y=0, n_features=20,
n_clusters_per_class=1, n_samples=1000,
random_state=10)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
smote = SMOTE(random_state=42)
X_train_res, y_train_res = smote.fit_resample(X_train, y_train)
print("Over‑sampling before class 0 count:", sum(y_train==0))
print("Over‑sampling before class 1 count:", sum(y_train==1))
print("Over‑sampling after class 0 count:", sum(y_train_res==0))
print("Over‑sampling after class 1 count:", sum(y_train_res==1))SMOTE creates synthetic minority samples by interpolating between a minority instance and its nearest neighbors.
Undersampling
Undersampling reduces the majority‑class size. Example using RandomUnderSampler:
from sklearn.datasets import make_classification
from imblearn.under_sampling import RandomUnderSampler
X, y = make_classification(n_samples=1000, n_classes=2, weights=[0.9, 0.1], random_state=42)
rus = RandomUnderSampler(random_state=42)
X_resampled, y_resampled = rus.fit_resample(X, y)Undersampling balances classes but may discard useful information, potentially affecting model generalization.
References: CSDN blog by Sany Ho – “Pure Random Sampling (train_test_split) and Stratified Sampling (StratifiedShuffleSplit) with sklearn”.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Model Perspective
Insights, knowledge, and enjoyment from a mathematical modeling researcher and educator. Hosted by Haihua Wang, a modeling instructor and author of "Clever Use of Chat for Mathematical Modeling", "Modeling: The Mathematics of Thinking", "Mathematical Modeling Practice: A Hands‑On Guide to Competitions", and co‑author of "Mathematical Modeling: Teaching Design and Cases".
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
