Artificial Intelligence 11 min read

Principal Component Analysis (PCA) with Python: Theory and Practical Example on the Breast Cancer Dataset

This article explains the fundamentals of Principal Component Analysis (PCA), demonstrates its application on the Breast Cancer Wisconsin dataset using Python code, and shows how scaling, PCA transformation, scree plots, and feature-group comparisons can reveal data structure and improve predictive modeling.

Python Programming Learning Circle

Jul 9, 2024

Principal Component Analysis (PCA) with Python: Theory and Practical Example on the Breast Cancer Dataset

Principal Component Analysis (PCA) is introduced as a powerful dimensionality‑reduction tool that creates uncorrelated principal components to capture the most variance in a dataset, helping to visualize classification ability.

What is PCA? PCA reduces the number of features by constructing principal components (PCs) where PC1 explains the largest variance, PC2 the next largest, and so on. The first two PCs often summarize the data well enough to be plotted in two dimensions.

Dataset – The Breast Cancer Wisconsin (Diagnostic) dataset is used. It contains 30 features derived from measurements of cell nuclei (e.g., mean symmetry, worst smoothness) and a binary target indicating malignant or benign tumors.

import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
cancer = load_breast_cancer()

data = pd.DataFrame(cancer['data'], columns=cancer['feature_names'])
data['y'] = cancer['target']

After loading the data, the features are scaled to zero mean and unit variance because PCA is sensitive to the scale of variables.

from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

# Scale the data
scaler = StandardScaler()
scaler.fit(data)
scaled = scaler.transform(data)

# Obtain principal components
pca = PCA().fit(scaled)
pc = pca.transform(scaled)
pc1 = pc[:,0]
pc2 = pc[:,1]

# Plot principal components
plt.figure(figsize=(10,10))
colour = ['#ff2121' if y == 1 else '#2176ff' for y in data['y']]
plt.scatter(pc1, pc2, c=colour, edgecolors='#000000')
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.show()

The resulting scatter plot shows two clusters corresponding to malignant and benign tumors, though some overlap remains. A scree plot is then generated to display the proportion of variance explained by each PC.

var = pca.explained_variance_[0:10]  # percentage of variance explained
labels = ['PC1','PC2','PC3','PC4','PC5','PC6','PC7','PC8','PC9','PC10']
plt.figure(figsize=(15,7))
plt.bar(labels, var)
plt.xlabel('Principal Component')
plt.ylabel('Proportion of Variance Explained')
plt.show()

Next, two feature groups are defined to compare their predictive power: one based on symmetry and smoothness, the other on perimeter and concavity. PCA is applied separately to each group, and the resulting PC1‑PC2 scatter plots reveal that the second group separates the classes more clearly.

group_1 = ['mean symmetry', 'symmetry error', 'worst symmetry',
           'mean smoothness', 'smoothness error', 'worst smoothness']

group_2 = ['mean perimeter', 'perimeter error', 'worst perimeter',
           'mean concavity', 'concavity error', 'worst concavity']

Logistic regression models are trained on each feature group (70% train, 30% test). The first group achieves ~74% accuracy, while the second reaches ~97%, confirming that PCA‑guided feature selection can identify more predictive subsets.

from sklearn.model_selection import train_test_split
import sklearn.metrics as metric
import statsmodels.api as sm

for i, g in enumerate([group_1, group_2]):
    x = data[g]
    x = sm.add_constant(x)
    y = data['y']
    x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=101)
    model = sm.Logit(y_train, x_train).fit()
    predictions = np.around(model.predict(x_test))
    accuracy = metric.accuracy_score(y_test, predictions)
    print(f"Accuracy of Group {i+1}: {accuracy}")

In summary, PCA provides a quick visual assessment of data separability and helps prioritize features before building predictive models, but it should be used alongside other exploratory tools such as box plots and information value analyses.

References

[1] Matt Brems, “A One‑Stop Shop for Principal Component Analysis” (2017). [2] L. Pachter, “What is principal component analysis?” (2014). [3] UCI, Breast Cancer Wisconsin (Diagnostic) Dataset (2020).

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Machine Learning Python PCA Data Visualization Breast Cancer Dataset

Written by

Python Programming Learning Circle

A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.