Artificial Intelligence 7 min read

Logistic Regression Tutorial with scikit-learn

This article introduces logistic regression, explains its theoretical basis, details key scikit-learn parameters, and provides a complete Python example for breast cancer classification, covering data preprocessing, model training, prediction, and evaluation with classification reports.

Qunar Tech Salon

Sep 19, 2018

Logistic Regression Tutorial with scikit-learn

Logistic regression is a binary classification model derived from linear regression by applying a sigmoid transformation to map continuous outputs to probabilities. It is often referred to as logistic, logit, or binary regression.

In scikit-learn the implementation is LogisticRegression. Important parameters include: C: regularization strength (default 1.0, smaller values mean stronger regularization). penalty: regularization norm, either 'l1' or 'l2' (default 'l2').

The following example demonstrates a full workflow on the Wisconsin breast‑cancer dataset, from data loading to model evaluation.

#coding=utf-8
import pandas as pd
import numpy as np
from sklearn.cross_validation import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

# Define column names (including Chinese comments for reference)
column_names = ['Sample code number', 'Clump Thickness', 'Uniformity of Cell Size', 'Uniformity of Cell Shape', 'Marginal Adhesion', 'Single Epithelial Cell Size', 'Bare Nuclei', 'Bland Chromatin', 'Normal Nucleoli', 'Mitoses', 'Class']

# Load data (CSV converted from the original .data file)
data = pd.read_csv('breast-cancer-wisconsin.csv', names=column_names)
# Replace missing values marked with '?' by NaN
data = data.replace(to_replace='?', value=np.nan)
# Drop rows containing any missing values
data = data.dropna(how='any')

# Split features and target
X = data[column_names[1:10]]   # first 9 feature columns
y = data['Class']
# Train‑test split (25% test size, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

# Standardize features (zero mean, unit variance)
ss = StandardScaler()
X_train = ss.fit_transform(X_train)
X_test = ss.transform(X_test)

# Initialize and train Logistic Regression (using L1 penalty)
lr = LogisticRegression(C=1.0, penalty='l1')
lr.fit(X_train, y_train)

# Predict on test set
lr_y_predict = lr.predict(X_test)
# Evaluate accuracy and detailed classification report
print('Accuracy:', lr.score(X_test, y_test))
print(classification_report(y_test, lr_y_predict, target_names=['Benign', 'Malignant']))

The model achieves high accuracy (around 95% on this dataset) and provides precision, recall, and F1‑score for each class via classification_report.

Mathematically, logistic regression models the log‑odds as a linear function z = w·x + b and maps it to a probability with the sigmoid function σ(z) = 1 / (1 + e⁻ᶻ). The probability of belonging to class A is σ(z), while the probability of class B is 1 - σ(z). This simple yet powerful formulation makes logistic regression a fundamental starting point for binary classification tasks.

In summary, logistic regression is fast to train, easy to interpret, and serves as an excellent introductory algorithm for anyone learning machine‑learning classification techniques.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

machine learning Python classification logistic regression data preprocessing

Written by

Qunar Tech Salon

Qunar Tech Salon is a learning and exchange platform for Qunar engineers and industry peers. We share cutting-edge technology trends and topics, providing a free platform for mid-to-senior technical professionals to exchange and learn.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.