Artificial Intelligence 7 min read

Improving Class Imbalance in Machine Learning with Class Weights: A Python Logistic Regression Walkthrough

The article demonstrates, with Python code, how applying class_weight—first using the default logistic regression, then the balanced option, and finally manually tuned weights via grid search—can raise the F1 score from 0 to about 0.16 on imbalanced data, and discusses further techniques such as feature engineering and threshold adjustment.

Code DAO

Jan 15, 2022

Improving Class Imbalance in Machine Learning with Class Weights: A Python Logistic Regression Walkthrough

This tutorial shows how to address class imbalance in binary classification using class weights with scikit‑learn's LogisticRegression. Three experiments are presented.

1. Baseline logistic regression

A simple model is trained with default equal class weights. The code is:

from sklearn.linear_model import LogisticRegression
lr = LogisticRegression(solver='newton-cg')
lr.fit(x_train, y_train)
pred_test = lr.predict(x_test)
f1_test = f1_score(y_test, pred_test)
print('For the testing data, the f1-score:', f1_test)

The resulting F1 score on the test set is 0.0, indicating the model fails to predict the minority class.

2. Logistic regression with class_weight='balanced'

Setting the class_weight parameter to 'balanced' automatically adjusts weights inversely proportional to class frequencies.

from sklearn.linear_model import LogisticRegression
lr = LogisticRegression(solver='newton-cg', class_weight='balanced')
lr.fit(x_train, y_train)
pred_test = lr.predict(x_test)
f1_test = f1_score(y_test, pred_test)
print('For the testing data, the f1-score:', f1_test)

The F1 score improves to 0.10098851188885921, a ten‑percent increase over the baseline.

3. Manual class‑weight tuning via grid search

The author searches for the optimal weight pair {0: w0, 1: w1} where w1 = 1 - w0. A grid of 200 weight values between 0 and 0.99 is evaluated using GridSearchCV with stratified 5‑fold cross‑validation and F1 scoring.

from sklearn.model_selection import GridSearchCV, StratifiedKFold
import numpy as np
lr = LogisticRegression(solver='newton-cg')
weights = np.linspace(0.0, 0.99, 200)
param_grid = {'class_weight': [{0: x, 1: 1.0 - x} for x in weights]}
gridsearch = GridSearchCV(estimator=lr,
                         param_grid=param_grid,
                         cv=StratifiedKFold(),
                         n_jobs=-1,
                         scoring='f1',
                         verbose=2).fit(x_train, y_train)
# Plotting omitted for brevity

The plot shows the highest F1 score around a minority‑class weight of 0.93. The best weight combination found is approximately {0: 0.06467, 1: 0.93533}.

Training with these weights yields an F1 score of 0.1579371474617244 on the test set.

lr = LogisticRegression(solver='newton-cg', class_weight={0: 0.06467336683417085, 1: 0.9353266331658292})
lr.fit(x_train, y_train)
pred_test = lr.predict(x_test)
f1_test = f1_score(y_test, pred_test)
print('For the testing data, the f1-score:', f1_test)

Manually adjusting the class weight therefore improves the F1 score by roughly 6 % compared with the balanced setting, while shifting some errors from the majority to the minority class.

Further techniques to raise performance

Feature engineering: beyond the given predictors, one can create frequency‑based, interaction, group‑based, or statistical features.

Threshold tuning: the default decision threshold is 0.5; searching for an optimal threshold (via grid or random search) can further improve the F1 score.

Advanced algorithms: try boosting, bagging, stacking, or hybrid models instead of plain logistic regression.

Conclusion

The guide illustrates how to use class_weight in scikit‑learn, how to manually tune weights with grid search, and how these steps can substantially improve the F1 score on imbalanced datasets. Additional improvements can be achieved through richer features, threshold optimization, or more powerful classifiers.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Python logistic regression scikit-learn F1 score imbalanced data grid search class weight

Written by

Code DAO

We deliver AI algorithm tutorials and the latest news, curated by a team of researchers from Peking University, Shanghai Jiao Tong University, Central South University, and leading AI companies such as Huawei, Kuaishou, and SenseTime. Join us in the AI alchemy—making life better!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.