Artificial Intelligence 11 min read

Build and Optimize Multiple Linear Regression in Python

This article walks through constructing a multiple linear regression model for house price prediction using Python, covering data exploration, dummy variable creation, model fitting with statsmodels, diagnosing multicollinearity via VIF, and applying optimizations to improve predictive accuracy.

Python Crawling & Data Mining

Aug 28, 2020

Build and Optimize Multiple Linear Regression in Python

Preface

Multiple linear regression is a common entry‑point for machine learning, but it contains many nuances worth studying. This article interleaves theory with code to demonstrate how to build and optimize a multiple linear regression model.

Detailed principle

Python practice

Python Practice

We use a classic house‑price prediction case to illustrate a concise, complete workflow, including accuracy‑boosting techniques.

Data Exploration

The dataset is a cleaned US regional house‑price CSV.

import pandas as
import numpy as
import seaborn as sns
import matplotlib.pyplot as

df = pd.read_csv('house_prices.csv')
df.info(); df.head()

neighborhood/area

: area and district bedrooms/bathrooms: number of bedrooms and bathrooms style: house style

Multiple Linear Regression Modeling

from statsmodels.formula.api import ols
# lower‑case ols includes intercept, OLS does not
# formula: dependent ~ independent (+ joins)
lm = ols('price ~ area + bedrooms + bathrooms', data=df).fit()
lm.summary()

The summary highlights the Intercept p‑value, which can be ignored.

Model Optimization

Low accuracy is due to categorical variables neighborhood and style not being fully utilized. First we inspect their distributions:

# nominal variables
nominal_vars = ['neighborhood', 'style']
for each in nominal_vars:
    print(each, ':')
    print(df[each].agg(['value_counts']).T)
    print('='*35)

Creating Dummy Variables

Categorical variables must be converted to dummy variables. The principle is “split into n dummies, use 0/1, drop one to keep full rank.”

Select nominal variable(s) to convert

Use pandas.get_dummies Concatenate the dummies with the original data

After adding dummies, the model accuracy improves, but multicollinearity appears.

We address multicollinearity using the Variance Inflation Factor (VIF), which measures how much the variance of a coefficient is inflated due to correlation with other predictors.

VIF Detection Function

def vif(df, col_i):
    """df: full DataFrame
    col_i: column name to test"""
    cols = list(df.columns)
    cols.remove(col_i)
    formula = col_i + '~' + '+'.join(cols)
    r2 = ols(formula, df).fit().rsquared
    return 1.0 / (1.0 - r2)

Run detection on selected variables:

test_data = results[['area', 'bedrooms', 'bathrooms', 'A', 'B']]
for i in test_data.columns:
    print(i, '\t', vif(df=test_data, col_i=i))

Results show high VIF for bedrooms and bathrooms, indicating strong correlation. We drop bedrooms and refit:

lm = ols(formula='price ~ area + bathrooms + A + B', data=results).fit()
lm.summary()

Accuracy slightly drops, but removing multicollinearity improves generalization. Re‑run VIF detection:

test_data = results[['area', 'bedrooms', 'A', 'B']]
for i in test_data.columns:
    print(i, '\t', vif(df=test_data, col_i=i))

VIF confirms multicollinearity is mitigated. Visual checks (scatter plots, heatmaps) also reveal problematic coefficients, e.g., an implausible negative effect of bathrooms on price.

Model Interpretation

The regression model is highly interpretable; printing parameters reveals relationships between dependent and independent variables.

The final model achieves an R‑squared of 0.916 . Coefficients for continuous variables (e.g., area) are straightforward, while dummy variables A and B indicate that, holding other factors constant, houses in area A are $8,707.18 cheaper than in area C, and area B houses are $449,896.73 more expensive.

Conclusion

Using multiple linear regression, we built a house‑price model, evaluated predictor significance, mitigated multicollinearity, and refined the model to improve predictive precision. The approach demonstrates how careful preprocessing and diagnostic checks can substantially boost regression performance.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Python data science Statsmodels Multiple Linear Regression Dummy Variables VIF

Written by

Python Crawling & Data Mining

Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.