Mastering Regression: Key Assumptions, Metrics, and Model Evaluation
This article explains the fundamental assumptions of linear regression, compares linear and nonlinear models, discusses multicollinearity, outliers, regularization, heteroscedasticity, VIF, stepwise regression, and reviews essential evaluation metrics such as MAE, MSE, RMSE, R² and Adjusted R².
Regression analysis provides a solid foundation for many machine learning algorithms. This article summarizes ten important regression problems and five key evaluation metrics.
Assumptions of Linear Regression
Linear regression has four assumptions:
Linearity: the relationship between independent variable (x) and dependent variable (y) should be linear.
Independence: features should be independent, minimizing multicollinearity.
Normality: residuals should follow a normal distribution.
Homoscedasticity: variance of data points around the regression line should be constant for all values.
Concept of Residuals
Residuals are the errors between predicted values and observed values, measuring the distance of data points from the regression line.
A residual plot is a good way to assess a regression model; random scatter without patterns indicates a suitable linear model.
Linear vs. Non‑Linear Regression Models
Both are types of regression problems, differing in the data they are trained on.
Linear models assume a linear relationship between features and target, while non‑linear models assume no linear relationship and fit curves.
Three best ways to determine linearity:
Residual plot
Scatter plot
Multicollinearity
Multicollinearity occurs when some features are highly correlated, making it difficult for the model to learn distinct patterns and degrading performance. It should be mitigated before training.
Impact of Outliers
Outliers are data points far from the average range.
Outliers pull the best‑fit line toward them, increasing error rates and resulting in high MSE.
MSE and MAE
MSE (Mean Squared Error) measures the squared difference between actual and predicted values; MAE (Mean Absolute Error) measures the absolute difference. MSE penalizes large errors more heavily, while MAE is more robust to outliers.
L1 and L2 Regularization
When data are scarce, basic linear regression tends to overfit; L1 (Lasso) and L2 (Ridge) regularization help mitigate this.
L1 adds the absolute value of coefficients as a penalty, effectively removing features with small coefficients.
L2 adds the squared magnitude of coefficients as a penalty, shrinking large coefficients.
Both are useful when training data are limited, variance is high, features outnumber observations, or multicollinearity exists.
Heteroscedasticity
Heteroscedasticity means the variance of data points around the best‑fit line varies across the range, leading to uneven residual dispersion and unreliable predictions. Plotting residuals helps detect it.
Large variance differences often arise from features with vastly different scales (e.g., a column ranging from 1 to 100 000).
Variance Inflation Factor (VIF)
VIF quantifies how much a variable can be predicted by other variables. A high VIF indicates strong correlation; such variables should be removed.
Stepwise Regression
Stepwise regression iteratively adds or removes predictors based on statistical significance, aiming to minimize error between observed and predicted values while efficiently handling high‑dimensional data.
Evaluation Metrics
Using a regression example (predicting salary from work experience), the following metrics are introduced:
Mean Absolute Error (MAE)
MAE is the average absolute difference between actual and predicted values; lower values indicate better models.
Advantages: easy to interpret, same unit as output, relatively robust to outliers.
Disadvantages: uses absolute value, which is not differentiable everywhere, limiting its use as a loss function.
Mean Squared Error (MSE)
MSE squares the differences before averaging; it is differentiable everywhere, making it suitable as a loss function.
Advantages: differentiable, usable as loss.
Disadvantages: units are squared, harder to interpret; sensitive to outliers.
Root Mean Squared Error (RMSE)
RMSE is the square root of MSE, restoring the original unit while still being sensitive to outliers.
The choice among MAE, MSE, and RMSE depends on the problem context.
R² Score
R² ranges from 0 to 1, indicating goodness of fit. An R² of 0 means the model performs no better than predicting the mean; 1 means perfect fit; negative values indicate worse than mean prediction.
R² can increase or stay constant as more features are added, even if they are irrelevant.
Adjusted R² Score
Adjusted R² accounts for the number of predictors, penalizing the inclusion of irrelevant features and providing a more reliable measure of model performance.
References:
https://mp.weixin.qq.com/s/Sx1lf2Ia6FPblTQdC7-fdg
Model Perspective
Insights, knowledge, and enjoyment from a mathematical modeling researcher and educator. Hosted by Haihua Wang, a modeling instructor and author of "Clever Use of Chat for Mathematical Modeling", "Modeling: The Mathematics of Thinking", "Mathematical Modeling Practice: A Hands‑On Guide to Competitions", and co‑author of "Mathematical Modeling: Teaching Design and Cases".
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.