Artificial Intelligence 7 min read

Understanding PCA: A Step-by-Step Guide to Dimensionality Reduction in Machine Learning

This article explains the principal component analysis (PCA) method for reducing data dimensions, walks through the mathematical steps, shows how to implement it manually with NumPy, and demonstrates reuse with scikit‑learn's PCA class, including concrete code examples and output.

Code DAO

Dec 3, 2021

Understanding PCA: A Step-by-Step Guide to Dimensionality Reduction in Machine Learning

PCA (Principal Component Analysis) is a key dimensionality‑reduction technique that projects an n × m data matrix A onto a lower‑dimensional subspace while preserving the data’s essential structure, using linear‑algebra and statistical matrix operations.

The computation proceeds as follows: first compute the mean of each column, M = mean(A); then center the data, C = A - M; next calculate the covariance matrix, V = cov(C); perform eigen‑decomposition, values, vectors = eig(V); sort eigenvectors by descending eigenvalues and select the top k vectors as the principal components, forming the projection matrix B.

Manual NumPy implementation (3 × 2 example):

from numpy import array, mean, cov
from numpy.linalg import eig

# define a matrix
A = array([[1, 2], [3, 4], [5, 6]])
print(A)
# column means
M = mean(A.T, axis=1)
print(M)
# center columns
C = A - M
print(C)
# covariance matrix
V = cov(C.T)
print(V)
# eigen‑decomposition
values, vectors = eig(V)
print(vectors)
print(values)
# project data using the first eigenvector
P = vectors.T.dot(C.T)
print(P.T)

Running the code prints the original matrix, the centered covariance matrix’s eigenvectors and eigenvalues, and the projection. Only the first eigenvector is needed, showing that the 3 × 2 matrix can be projected to a 3 × 1 matrix with minimal loss.

Using scikit‑learn’s PCA class simplifies reuse on new data:

from numpy import array
from sklearn.decomposition import PCA

A = array([[1, 2], [3, 4], [5, 6]])
print(A)

pca = PCA(2)          # request two components
pca.fit(A)
print(pca.components_)        # principal axes
print(pca.explained_variance_)# eigenvalues
B = pca.transform(A)          # project original data
print(B)

The output matches the manual calculation, confirming that the same principal components, singular values, and projections are obtained (subject to floating‑point rounding).

In summary, the article details the PCA workflow, demonstrates a from‑scratch NumPy implementation, and shows how scikit‑learn’s PCA class can be leveraged for efficient dimensionality reduction on larger datasets.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

machine learning PCA NumPy scikit-learn dimensionality reduction eigen decomposition

Written by

Code DAO

We deliver AI algorithm tutorials and the latest news, curated by a team of researchers from Peking University, Shanghai Jiao Tong University, Central South University, and leading AI companies such as Huawei, Kuaishou, and SenseTime. Join us in the AI alchemy—making life better!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.