Artificial Intelligence 12 min read

Implementing PCA for Face Recognition with PaddlePaddle: A Step‑by‑Step Guide

This article walks through a complete PCA‑based face‑recognition pipeline using the PaddlePaddle framework, covering dataset preparation, library installation, image vectorization, PCA dimensionality reduction, training, testing, and performance evaluation with detailed code examples.

Baidu Tech Salon

Aug 15, 2024

Implementing PCA for Face Recognition with PaddlePaddle: A Step‑by‑Step Guide

Overview

Principal Component Analysis (PCA) is a dimensionality‑reduction technique that projects high‑dimensional data onto a lower‑dimensional subspace while preserving the largest variance. The following example implements PCA with PaddlePaddle’s linear‑algebra API for a face‑recognition task using the ORL face dataset.

Dataset

The ORL dataset contains 40 subjects, each with 10 grayscale face images stored as PGM files. Download the archive from http://www.cl.cam.ac.uk/Research/DTG/attarchive/pub/data/att_faces.tar.Z and extract the directory structure s1/1.pgm … s40/10.pgm.

Required Libraries

pip install opencv-python

pip install paddle

Image Vectorization

Each image is read in grayscale and reshaped into a one‑dimensional tensor.

import cv2
import paddle

def img2vector(image_path):
    img = cv2.imread(image_path, 0)  # grayscale
    return paddle.reshape(paddle.to_tensor(img, dtype='float32'), [1, -1])

Dataset Loader

import os
import numpy as np

class ORLDataset:
    def __init__(self, data_path, k, train=True):
        self.data_path = data_path
        self.k = k            # number of images per subject used for training
        self.train = train

    def load_orl(self):
        train_imgs, train_lbls = [], []
        test_imgs, test_lbls = [], []
        # Random permutation of image indices 1‑10 for each subject
        sample = np.random.permutation(10) + 1
        for person_id in range(1, 41):
            for idx, img_no in enumerate(sample):
                img_path = os.path.join(self.data_path, f's{person_id}', f'{img_no}.pgm')
                vec = img2vector(img_path)
                if idx < self.k:
                    train_imgs.append(vec)
                    train_lbls.append(person_id)
                else:
                    test_imgs.append(vec)
                    test_lbls.append(person_id)
        if self.train:
            return paddle.concat(train_imgs, axis=0), paddle.to_tensor(train_lbls, dtype='int64')
        else:
            return paddle.concat(test_imgs, axis=0), paddle.to_tensor(test_lbls, dtype='int64')

PCA Implementation

The PCA function follows these steps:

Cast the input matrix to float32 and obtain its shape.

Compute the column‑wise mean and center the data.

Form the covariance matrix C = A·Aᵀ.

Obtain eigenvalues and eigenvectors of C with paddle.linalg.eigh.

Project the eigenvectors back to the original space and keep the top r components.

Normalize each selected eigenvector to unit length.

Project the centered data onto the normalized eigenvectors to get the reduced representation.

def PCA(data, r):
    data = paddle.cast(data, 'float32')
    rows, _ = data.shape
    # 1. Mean and centering
    data_mean = paddle.mean(data, axis=0)
    A = data - paddle.tile(data_mean, repeat_times=[rows, 1])
    # 2. Covariance matrix
    C = paddle.matmul(A, A, transpose_y=True)
    # 3. Eigen‑decomposition
    eig_vals, eig_vecs = paddle.linalg.eigh(C)
    # 4. Select top‑r eigenvectors (project back to original space)
    eig_vecs = paddle.matmul(A.T, eig_vecs[:, :r])
    # 5. Normalize
    for i in range(r):
        eig_vecs[:, i] = eig_vecs[:, i] / paddle.norm(eig_vecs[:, i])
    # 6. Reduced data
    reduced = paddle.matmul(A, eig_vecs)
    return reduced, data_mean, eig_vecs

Training and Testing

For each target dimension r (10, 20, 30, 40) the experiment uses 7 images per subject for training and the remaining 3 for testing. The nearest‑neighbor classifier is based on Euclidean distance in the reduced space.

def face_recognize(data_path):
    for r in range(10, 41, 10):
        print(f"When reduced to {r} dimensions:")
        train_set = ORLDataset(data_path, k=7, train=True)
        test_set = ORLDataset(data_path, k=7, train=False)
        train_data, train_labels = train_set.load_orl()
        test_data, test_labels = test_set.load_orl()
        # PCA on training data
        train_reduced, mean_vec, V_r = PCA(train_data, r)
        # Center and project test data
        test_centered = test_data - mean_vec
        test_reduced = paddle.matmul(test_centered, V_r)
        # Nearest‑neighbor classification
        correct = 0
        for i in range(len(test_labels)):
            dists = paddle.sum(paddle.square(train_reduced - test_reduced[i]), axis=1)
            nearest = paddle.argmin(dists)
            if train_labels[nearest] == test_labels[i]:
                correct += 1
        accuracy = correct / len(test_labels)
        print(f"Classification accuracy: {accuracy:.2%}
")

Results

r = 10 → 67.5 % accuracy

r = 20 → 35.0 % accuracy

r = 30 → 67.5 % accuracy

r = 40 → 40.0 % accuracy

The experiment shows that moderate dimensionalities (10 or 30) retain sufficient discriminative information for the ORL faces, while higher dimensions introduce noise that degrades nearest‑neighbor performance.

Conclusion

PaddlePaddle’s paddle.linalg module provides efficient primitives for eigen‑decomposition, matrix multiplication, and SVD, enabling concise implementations of PCA and related dimensionality‑reduction methods. Such techniques are useful for data compression, feature extraction, and improving the computational efficiency of downstream machine‑learning models.

References

PaddlePaddle linalg API documentation: https://www.paddlepaddle.org.cn/documentation/docs/zh/develop/api/paddle/linalg/Overview_cn.html

PCA‑Principal‑Components‑Analysis repository: https://github.com/Gaoshiguo/PCA-Principal-Components-Analysis/tree/master

machine learning Python face recognition PCA PaddlePaddle dimensionality reduction

Written by

Baidu Tech Salon

Baidu Tech Salon, organized by Baidu's Technology Management Department, is a monthly offline event that shares cutting‑edge tech trends from Baidu and the industry, providing a free platform for mid‑to‑senior engineers to exchange ideas.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.