How to Perform Two-Way ANOVA with Python’s statsmodels: Theory and Code
This article explains the theory behind two‑factor ANOVA, distinguishes cases with and without interaction, presents the mathematical model, and demonstrates a complete Python implementation using statsmodels, including data setup, model fitting, and interpretation of the ANOVA table.
Two-Factor ANOVA
When two factors may affect a response variable, a two‑factor ANOVA is used. The basic idea is to select several levels for each factor, conduct experiments for every combination of factor levels, and then analyze the variance of the collected data.
Mathematical Model
Assume factor A has a levels and factor B has b levels. For each level combination, the population follows a normal distribution. If n replicates are taken at each combination, the observed result Y_{ijk} follows a normal distribution and the observations are independent. The model can be written as:
Y_{ijk}=\mu+\alpha_i+\beta_j+(\alpha\beta)_{ij}+\epsilon_{ijk}
where \mu is the overall mean, \alpha_i is the effect of the i ‑th level of factor A, \beta_j is the effect of the j ‑th level of factor B, (\alpha\beta)_{ij} is the interaction effect, and \epsilon_{ijk} is random error.
Two-Way ANOVA Without Interaction
If prior knowledge indicates that the two factors do not interact, the experiment can be performed without replication, simplifying the analysis. The model reduces to:
Y_{ij}=\mu+\alpha_i+\beta_j+\epsilon_{ij}
The total sum of squares is decomposed into the sum of squares for factor A, factor B, and error. The test statistics are the ratios of each factor’s mean square to the error mean square. Under the null hypothesis, these follow an F‑distribution.
Two-Way ANOVA With Interaction
When interaction may exist, the full model includes the interaction term. The total sum of squares is partitioned into four components: factor A sum of squares, factor B sum of squares, interaction sum of squares, and error sum of squares. Each component’s mean square is compared with the error mean square to test significance.
Implementation with statsmodels
The example below uses a chemical process measured at three concentration levels and four temperature levels. It tests whether the yield differs significantly across temperatures (factor A), concentrations (factor B), and whether there is a significant interaction.
import numpy as np import statsmodels.api as sm y = np.array([[11,11,13,10], [10,11,9,12], [9,10,7,6], [7,8,11,10], [5,13,12,14], [11,14,13,10]]).flatten() A = np.tile(np.arange(1,5), (6,1)).flatten() B = np.tile(np.arange(1,4).reshape(3,1), (1,8)).flatten() d = {'x1': A, 'x2': B, 'y': y} model = sm.formula.ols("y~C(x1)+C(x2)+C(x1):C(x2)", d).fit() # Note the syntax for interaction terms anovat = sm.stats.anova_lm(model) # Perform two‑factor ANOVA print(anovat)
The resulting ANOVA table provides degrees of freedom, sum of squares, mean squares, F‑statistics, and p‑values for factor A, factor B, their interaction, and residual error, allowing conclusions about the significance of each effect.
Reference: 司守奎,孙玺菁 Python数学实验与建模
Model Perspective
Insights, knowledge, and enjoyment from a mathematical modeling researcher and educator. Hosted by Haihua Wang, a modeling instructor and author of "Clever Use of Chat for Mathematical Modeling", "Modeling: The Mathematics of Thinking", "Mathematical Modeling Practice: A Hands‑On Guide to Competitions", and co‑author of "Mathematical Modeling: Teaching Design and Cases".
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.