Logistic Regression Tutorial with scikit-learn
This article introduces logistic regression, explains its theoretical basis, details key scikit-learn parameters, and provides a complete Python example for breast cancer classification, covering data preprocessing, model training, prediction, and evaluation with classification reports.
Logistic regression is a binary classification model derived from linear regression by applying a sigmoid transformation to map continuous outputs to probabilities. It is often referred to as logistic, logit, or binary regression.
In scikit-learn the implementation is LogisticRegression. Important parameters include: C: regularization strength (default 1.0, smaller values mean stronger regularization). penalty: regularization norm, either 'l1' or 'l2' (default 'l2').
The following example demonstrates a full workflow on the Wisconsin breast‑cancer dataset, from data loading to model evaluation.
#coding=utf-8
import pandas as pd
import numpy as np
from sklearn.cross_validation import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
# Define column names (including Chinese comments for reference)
column_names = ['Sample code number', 'Clump Thickness', 'Uniformity of Cell Size', 'Uniformity of Cell Shape', 'Marginal Adhesion', 'Single Epithelial Cell Size', 'Bare Nuclei', 'Bland Chromatin', 'Normal Nucleoli', 'Mitoses', 'Class']
# Load data (CSV converted from the original .data file)
data = pd.read_csv('breast-cancer-wisconsin.csv', names=column_names)
# Replace missing values marked with '?' by NaN
data = data.replace(to_replace='?', value=np.nan)
# Drop rows containing any missing values
data = data.dropna(how='any')
# Split features and target
X = data[column_names[1:10]] # first 9 feature columns
y = data['Class']
# Train‑test split (25% test size, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
# Standardize features (zero mean, unit variance)
ss = StandardScaler()
X_train = ss.fit_transform(X_train)
X_test = ss.transform(X_test)
# Initialize and train Logistic Regression (using L1 penalty)
lr = LogisticRegression(C=1.0, penalty='l1')
lr.fit(X_train, y_train)
# Predict on test set
lr_y_predict = lr.predict(X_test)
# Evaluate accuracy and detailed classification report
print('Accuracy:', lr.score(X_test, y_test))
print(classification_report(y_test, lr_y_predict, target_names=['Benign', 'Malignant']))The model achieves high accuracy (around 95% on this dataset) and provides precision, recall, and F1‑score for each class via classification_report.
Mathematically, logistic regression models the log‑odds as a linear function z = w·x + b and maps it to a probability with the sigmoid function σ(z) = 1 / (1 + e⁻ᶻ). The probability of belonging to class A is σ(z), while the probability of class B is 1 - σ(z). This simple yet powerful formulation makes logistic regression a fundamental starting point for binary classification tasks.
In summary, logistic regression is fast to train, easy to interpret, and serves as an excellent introductory algorithm for anyone learning machine‑learning classification techniques.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Qunar Tech Salon
Qunar Tech Salon is a learning and exchange platform for Qunar engineers and industry peers. We share cutting-edge technology trends and topics, providing a free platform for mid-to-senior technical professionals to exchange and learn.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
