Master Factor Analysis in Python: From Theory to Practical Implementation
This article explains the origins and core concepts of factor analysis, outlines its algorithmic steps, demonstrates how to perform the analysis using Python's factor_analyzer library—including data preparation, adequacy tests, eigenvalue selection, rotation, and visualization—culminating in extracting new latent variables.
Origin
Factor analysis originated in 1904 when a British psychologist observed strong correlations among students' English, French, and classical language scores and hypothesized a common underlying factor, which he called "language ability". This insight led to the definition of factor analysis as a method for uncovering hidden common factors behind correlated variables.
Basic Idea
The basic idea is illustrated with a student who scores perfectly in mathematics, physics, chemistry, and biology, suggesting a strong "rational thinking" factor that drives high scores in science subjects. Factor analysis assumes that observed variables are generated by one or more latent variables (factors) that cannot be measured directly.
It reduces a set of correlated variables to a smaller number of factors that represent the original variables and can be used for classification.
Algorithm Uses
Factor analysis, similar to principal component analysis, aims to describe hidden, unobservable variables underlying a set of measured variables and can be used for comprehensive evaluation.
By exploiting correlations among indicators, it infers latent common factors that jointly influence the indicators, reducing the number of variables while preserving essential information.
Steps of Factor Analysis
Standardize the data sample.
Compute the correlation matrix R.
Obtain eigenvalues and eigenvectors of R.
Determine the number of principal factors based on cumulative contribution.
Calculate the factor loading matrix A.
Finalize the factor model.
factor_analyzer Library
The core Python library for factor analysis is factor_analyzer , which provides two main modules:
factor_analyzer.analyze (key module)
factor_analyzer.factor_analyzer
Detailed Example
Using a student grades dataset, the following code demonstrates the workflow.
<code># Data processing
import pandas as pd
import numpy as np
# Plotting
import seaborn as sns
import matplotlib.pyplot as plt
plt.rcParams['font.sans-serif'] = ['SimHei'] # Set default font
plt.rcParams['axes.unicode_minus'] = False # Fix minus sign display
# Factor analysis
from factor_analyzer import FactorAnalyzer
</code>Load the data:
<code>df = pd.read_excel('data/grades2.xlsx', index_col=0).iloc[:, :-3]
df = df.dropna()
df.head()
</code>Adequacy Tests
Before performing factor analysis, test the adequacy of the correlation matrix.
Bartlett's Sphericity Test
<code>from factor_analyzer.factor_analyzer import calculate_bartlett_sphericity
chi_square_value, p_value = calculate_bartlett_sphericity(df)
chi_square_value, p_value
</code>Result: (638.4879, 2.33e-126). A non‑unit matrix indicates sufficient correlation for factor analysis.
KMO Test
<code>from factor_analyzer.factor_analyzer import calculate_kmo
kmo_all, kmo_model = calculate_kmo(df)
kmo_model
</code>Result: 0.8849 (>0.6), confirming suitability for factor analysis.
Selecting the Number of Factors
Compute eigenvalues of the correlation matrix and sort them in descending order.
Eigenvalues and Eigenvectors
<code>faa = FactorAnalyzer(25, rotation=None)
faa.fit(df)
# Get eigenvalues (ev) and eigenvectors (v)
ev, v = faa.get_eigenvalues()
ev, v
</code>Eigenvalues: [3.7605, 0.7315, 0.4438, 0.3891, 0.3708, 0.3043]
Visualization
<code># Scatter and line plot of eigenvalues
plt.scatter(range(1, df.shape[1] + 1), ev)
plt.plot(range(1, df.shape[1] + 1), ev)
plt.title('Scree Plot')
plt.xlabel('Factors')
plt.ylabel('Eigenvalue')
plt.grid()
plt.show()
</code>Factor Rotation
Building the Factor Model
Choose varimax (maximum variance) rotation with two factors.
<code># Choose varimax rotation with 2 factors
faa_two = FactorAnalyzer(2, rotation='varimax')
faa_two.fit(df)
# Communalities (shared variance)
faa_two.get_communalities()
</code>Communalities: [0.5189, 0.6104, 0.6212, 0.6098, 0.6657, 0.6816]
Other rotation options include: varimax, promax, oblimin, oblimax, quartimin, quartimax, equamax.
Factor Variance
<code>faa_two.get_factor_variance()
</code>This returns total variance, proportional variance, and cumulative variance for each factor.
Visualizing Latent Variables
Heatmap of absolute factor loadings to see which variables relate strongly to each latent factor.
<code>df1 = pd.DataFrame(np.abs(faa_two.loadings_), index=df.columns)
ax = sns.heatmap(df1, annot=True, cmap="BuPu")
ax.yaxis.set_tick_params(labelsize=15)
plt.title('Factor Analysis', fontsize='xx-large')
plt.ylabel('Feature', fontsize='xx-large')
plt.show()
</code>Transforming to New Variables
Convert the original data into the two extracted factors.
<code>df2 = pd.DataFrame(faa_two.transform(df))
</code>The resulting table shows the scores for each observation on the two latent factors.
Reference
Author: 洋洋菜鸟 – https://blog.csdn.net/qq_25990967/article/details/122566533
Model Perspective
Insights, knowledge, and enjoyment from a mathematical modeling researcher and educator. Hosted by Haihua Wang, a modeling instructor and author of "Clever Use of Chat for Mathematical Modeling", "Modeling: The Mathematics of Thinking", "Mathematical Modeling Practice: A Hands‑On Guide to Competitions", and co‑author of "Mathematical Modeling: Teaching Design and Cases".
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.