Build and Optimize Multiple Linear Regression in Python
This article walks through constructing a multiple linear regression model for house price prediction using Python, covering data exploration, dummy variable creation, model fitting with statsmodels, diagnosing multicollinearity via VIF, and applying optimizations to improve predictive accuracy.
Preface
Multiple linear regression is a common entry‑point for machine learning, but it contains many nuances worth studying. This article interleaves theory with code to demonstrate how to build and optimize a multiple linear regression model.
Detailed principle
Python practice
Python Practice
We use a classic house‑price prediction case to illustrate a concise, complete workflow, including accuracy‑boosting techniques.
Data Exploration
The dataset is a cleaned US regional house‑price CSV.
import pandas as
import numpy as
import seaborn as sns
import matplotlib.pyplot as
df = pd.read_csv('house_prices.csv')
df.info(); df.head()neighborhood/area: area and district bedrooms/bathrooms: number of bedrooms and bathrooms style: house style
Multiple Linear Regression Modeling
from statsmodels.formula.api import ols
# lower‑case ols includes intercept, OLS does not
# formula: dependent ~ independent (+ joins)
lm = ols('price ~ area + bedrooms + bathrooms', data=df).fit()
lm.summary()The summary highlights the Intercept p‑value, which can be ignored.
Model Optimization
Low accuracy is due to categorical variables neighborhood and style not being fully utilized. First we inspect their distributions:
# nominal variables
nominal_vars = ['neighborhood', 'style']
for each in nominal_vars:
print(each, ':')
print(df[each].agg(['value_counts']).T)
print('='*35)Creating Dummy Variables
Categorical variables must be converted to dummy variables. The principle is “split into n dummies, use 0/1, drop one to keep full rank.”
Select nominal variable(s) to convert
Use pandas.get_dummies Concatenate the dummies with the original data
After adding dummies, the model accuracy improves, but multicollinearity appears.
We address multicollinearity using the Variance Inflation Factor (VIF), which measures how much the variance of a coefficient is inflated due to correlation with other predictors.
VIF Detection Function
def vif(df, col_i):
"""df: full DataFrame
col_i: column name to test"""
cols = list(df.columns)
cols.remove(col_i)
formula = col_i + '~' + '+'.join(cols)
r2 = ols(formula, df).fit().rsquared
return 1.0 / (1.0 - r2)Run detection on selected variables:
test_data = results[['area', 'bedrooms', 'bathrooms', 'A', 'B']]
for i in test_data.columns:
print(i, '\t', vif(df=test_data, col_i=i))Results show high VIF for bedrooms and bathrooms, indicating strong correlation. We drop bedrooms and refit:
lm = ols(formula='price ~ area + bathrooms + A + B', data=results).fit()
lm.summary()Accuracy slightly drops, but removing multicollinearity improves generalization. Re‑run VIF detection:
test_data = results[['area', 'bedrooms', 'A', 'B']]
for i in test_data.columns:
print(i, '\t', vif(df=test_data, col_i=i))VIF confirms multicollinearity is mitigated. Visual checks (scatter plots, heatmaps) also reveal problematic coefficients, e.g., an implausible negative effect of bathrooms on price.
Model Interpretation
The regression model is highly interpretable; printing parameters reveals relationships between dependent and independent variables.
The final model achieves an R‑squared of 0.916 . Coefficients for continuous variables (e.g., area) are straightforward, while dummy variables A and B indicate that, holding other factors constant, houses in area A are $8,707.18 cheaper than in area C, and area B houses are $449,896.73 more expensive.
Conclusion
Using multiple linear regression, we built a house‑price model, evaluated predictor significance, mitigated multicollinearity, and refined the model to improve predictive precision. The approach demonstrates how careful preprocessing and diagnostic checks can substantially boost regression performance.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Python Crawling & Data Mining
Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
