Artificial Intelligence 17 min read

Random Forest Classification with PCA and Hyper‑Parameter Tuning on the Breast Cancer Dataset

This tutorial walks through loading the scikit‑learn breast‑cancer dataset, preprocessing it, building baseline and PCA‑reduced Random Forest models, applying RandomizedSearchCV and GridSearchCV for hyper‑parameter optimization, and evaluating the final models using recall as the primary metric.

Python Programming Learning Circle

Feb 8, 2025

Random Forest Classification with PCA and Hyper‑Parameter Tuning on the Breast Cancer Dataset

In this article we demonstrate a complete machine‑learning workflow for binary classification of the scikit‑learn breast‑cancer dataset using Random Forest, including data loading, exploratory checks, and preprocessing.

First, the data is loaded and a DataFrame is created:

import pandas as pd
from sklearn.datasets import load_breast_cancer
columns = ['mean radius', 'mean texture', 'mean perimeter', 'mean area', 'mean smoothness', 'mean compactness', 'mean concavity', 'mean concave points', 'mean symmetry', 'mean fractal dimension', 'radius error', 'texture error', 'perimeter error', 'area error', 'smoothness error', 'compactness error', 'concavity error', 'concave points error', 'symmetry error', 'fractal dimension error', 'worst radius', 'worst texture', 'worst perimeter', 'worst area', 'worst smoothness', 'worst compactness', 'worst concavity', 'worst concave points', 'worst symmetry', 'worst fractal dimension']
dataset = load_breast_cancer()
data = pd.DataFrame(dataset['data'], columns=columns)
data['cancer'] = dataset['target']
display(data.head())
display(data.info())
display(data.isna().sum())
display(data.describe())

After confirming the data quality, we split it into training and test sets (50 % each) while preserving class ratios:

from sklearn.model_selection import train_test_split
X = data.drop('cancer', axis=1)
y = data['cancer']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.50, random_state=2020, stratify=y)

Features are standardized before modeling:

import numpy as np
from sklearn.preprocessing import StandardScaler
ss = StandardScaler()
X_train_scaled = ss.fit_transform(X_train)
X_test_scaled = ss.transform(X_test)
y_train = np.array(y_train)

A baseline Random Forest classifier is trained on the scaled data:

from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier()
rfc.fit(X_train_scaled, y_train)
print(rfc.score(X_train_scaled, y_train))  # 1.0

Feature importance is extracted and visualized to understand which variables drive predictions.

To reduce dimensionality and training time, Principal Component Analysis (PCA) is applied. An initial PCA with 30 components shows that the first 10 components capture over 95 % of the variance, so we retain 10 components:

from sklearn.decomposition import PCA
pca = PCA(n_components=10)
pca.fit(X_train_scaled)
X_train_scaled_pca = pca.transform(X_train_scaled)
X_test_scaled_pca = pca.transform(X_test_scaled)

The PCA‑reduced data is used to train a second baseline Random Forest model.

rfc_pca = RandomForestClassifier()
rfc_pca.fit(X_train_scaled_pca, y_train)
print(rfc_pca.score(X_train_scaled_pca, y_train))  # 1.0

Hyper‑parameter tuning is performed in two stages. First, RandomizedSearchCV explores a wide range of values for n_estimators, max_features, max_depth, min_samples_split, min_samples_leaf, and bootstrap:

from sklearn.model_selection import RandomizedSearchCV
param_dist = {
    'n_estimators': [int(x) for x in np.linspace(100, 1000, 10)],
    'max_features': ['log2', 'sqrt'],
    'max_depth': [int(x) for x in np.linspace(1, 15, 15)],
    'min_samples_split': [int(x) for x in np.linspace(2, 50, 10)],
    'min_samples_leaf': [int(x) for x in np.linspace(2, 50, 10)],
    'bootstrap': [True, False]
}
rs = RandomizedSearchCV(rfc, param_dist, n_iter=100, cv=3, verbose=1, n_jobs=-1, random_state=0)
rs.fit(X_train_scaled_pca, y_train)
print(rs.best_params_)

Based on the RandomizedSearch results, a narrower grid is defined and GridSearchCV conducts an exhaustive search:

from sklearn.model_selection import GridSearchCV
param_grid = {
    'n_estimators': [300, 500, 700],
    'max_features': ['sqrt'],
    'max_depth': [2, 3, 7, 11, 15],
    'min_samples_split': [2, 3, 4, 22, 23, 24],
    'min_samples_leaf': [2, 3, 4, 5, 6, 7],
    'bootstrap': [False]
}
gs = GridSearchCV(rfc, param_grid, cv=3, verbose=1, n_jobs=-1)
gs.fit(X_train_scaled_pca, y_train)
print(gs.best_params_)

The best model from GridSearchCV is evaluated on the held‑out test set alongside the two baseline models. Recall is used as the primary metric because false negatives are critical in cancer diagnosis:

from sklearn.metrics import confusion_matrix, recall_score
y_pred = rfc.predict(X_test_scaled)
y_pred_pca = rfc_pca.predict(X_test_scaled_pca)
y_pred_gs = gs.best_estimator_.predict(X_test_scaled_pca)
print('Baseline recall:', recall_score(y_test, y_pred))
print('Baseline + PCA recall:', recall_score(y_test, y_pred_pca))
print('Tuned + PCA recall:', recall_score(y_test, y_pred_gs))
conf_matrix = confusion_matrix(y_test, y_pred_gs)
print(conf_matrix)

The results show that the original baseline Random Forest achieves the highest recall (≈94.97 %), illustrating that dimensionality reduction and extensive hyper‑parameter tuning do not always improve performance for this task. The case study highlights the importance of empirical testing when optimizing models for critical applications such as cancer detection.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

PCA Random Forest Scikit-learn hyperparameter tuning Breast Cancer

Written by

Python Programming Learning Circle

A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.