Tackling Multicollinearity: Ridge and LASSO Regression Explained with Python
This article explains how multicollinearity undermines ordinary least squares estimates, introduces ridge and LASSO regularization as remedies, and demonstrates their application on a 1966 French economic dataset using Python’s statsmodels, complete with code and interpretation of results.
For multiple linear regression, when the design matrix is not full rank there are multiple analytical solutions that all minimize the mean squared error; a common approach is to introduce a regularization term. Ridge regression and LASSO regression are the two most popular linear regression regularization methods.
Multicollinearity
In theory, when discussing a multiple linear regression model, the columns of the design matrix are generally required to be linearly independent. When strong linear relationships exist among the column vectors, severe multicollinearity occurs, causing the design matrix to become ill‑conditioned. In such cases, ordinary least squares (OLS) estimates have excessively large variances, making OLS performance unsatisfactory. To address this, statisticians have proposed many improvement methods; ridge regression and LASSO regression are two of the most widely used.
Case Study
Problem
The table below, proposed by Malinvand in 1966, contains a dataset for studying French economic issues. The dependent variable is total imports, and the three explanatory variables are total domestic output, stock, and total consumption (all in billions of French francs).
Using ordinary least squares on this problem yields a regression equation for the three explanatory variables and total imports with good statistical test indicators, but the coefficient of \(x_1\) is negative, which contradicts economic intuition because France is a net importer of raw materials; an increase in domestic output should increase total imports, so the coefficient should be positive. This sign reversal is caused by multicollinearity among the three explanatory variables.
The correlation matrix of the variables \(x_1, x_2, x_3\) shows a high correlation (up to 0.9) between \(x_1\) and \(x_2\), indicating near‑linear dependence. Treating \(x_1\) as the dependent variable and \(x_2\) as an explanatory variable leads to a simple linear regression that demonstrates this relationship.
When \(x_1\) changes, it is impossible to keep a constant, making the interpretation of regression coefficients complex; the sign alone cannot be used for interpretation, and multicollinearity exists between the variables.
<code>import numpy as np
import statsmodels.api as sm
a = np.loadtxt("data/economic.txt") # load x1, x2, x3, y (11 rows, 4 columns)
x = a[:,:3] # independent variables matrix
X = sm.add_constant(x) # add intercept column
md = sm.OLS(a[:,3], X).fit() # fit OLS model
b = md.params # regression coefficients
y = md.predict(X) # predicted values
print(md.summary()) # model summary
print("Correlation matrix:\n", np.corrcoef(x.T))
X1 = sm.add_constant(a[:,0])
md1 = sm.OLS(a[:,2], X1).fit()
print("Regression coefficients:", md1.params)
</code>The term "multicollinearity" was introduced by R. Frisch in 1934 to denote linear relationships among explanatory variables. In real economic problems, multicollinearity is common due to the nature of economic variables. Its causes can be grouped into three categories:
Many economic variables are correlated and share common trends (e.g., national income growth leads to simultaneous increases in consumption, savings, and investment). Including several of these correlated variables in a model can generate multicollinearity.
Using lagged explanatory variables can also cause multicollinearity because current and lagged values of economic variables are often highly correlated.
Sample data limitations can induce multicollinearity; even if variables are not linearly related in the population, a narrow or biased sample may exhibit linear dependence.
Definition: When the column vectors of the design matrix are approximately linearly related, i.e., there exist constants not all zero such that a linear combination equals zero, the variables are said to exhibit multicollinearity.
References
司守奎,孙玺菁. Python数学实验与建模.
Model Perspective
Insights, knowledge, and enjoyment from a mathematical modeling researcher and educator. Hosted by Haihua Wang, a modeling instructor and author of "Clever Use of Chat for Mathematical Modeling", "Modeling: The Mathematics of Thinking", "Mathematical Modeling Practice: A Hands‑On Guide to Competitions", and co‑author of "Mathematical Modeling: Teaching Design and Cases".
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.