How to Perform Multiple Linear Regression in R with the Birthweight Dataset
This article explains the theory of multiple linear regression, demonstrates how to fit such a model in R using the birthwt dataset with the lm() function, and interprets the output, diagnostic plots, and handling of categorical variables.
In practical analysis, several independent variables often determine a single dependent variable. The multiple linear regression equation can be written in matrix form, where the response vector, predictor matrix (including a constant term), and coefficient vector are defined, allowing coefficient estimation via matrix operations, though manual calculation is difficult.
In R, the lm() function is used for this analysis.
Example: using the birthwt dataset to study factors influencing newborn weight and predict weight based on these factors.
<code>library(MASS)
data(birthwt)
birthwt.lm <- lm(bwt ~ age + lwt + as.factor(race) + smoke + ptl + ht + ui + ftv, data = birthwt)
summary(birthwt.lm)
</code>Result:
<code>Call:
lm(formula = bwt ~ age + lwt + as.factor(race) + smoke + ptl +
ht + ui + ftv, data = birthwt)
Residuals:
Min 1Q Median 3Q Max
-1825.26 -435.21 55.91 473.46 1701.20
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2927.962 312.904 9.357 < 2e-16 ***
age -3.570 9.620 -0.371 0.711012
lwt 4.354 1.736 2.509 0.013007 *
as.factor(race)2 -488.428 149.985 -3.257 0.001349 **
as.factor(race)3 -355.077 114.753 -3.094 0.002290 **
smoke -352.045 106.476 -3.306 0.001142 **
ptl -48.402 101.972 -0.475 0.635607
ht -592.827 202.321 -2.930 0.003830 **
ui -516.081 138.885 -3.716 0.000271 ***
ftv -14.058 46.468 -0.303 0.762598
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 650.3 on 179 degrees of freedom
Multiple R-squared: 0.2427, Adjusted R-squared: 0.2047
F-statistic: 6.376 on 9 and 179 DF, p-value: 7.891e-08
</code>Plotting the results:
<code>par(mfrow = c(2,2))
plot(birthwt.lm)
</code>The regression model relates newborn weight (bwt) to maternal age (age), race, smoking status (smoke), hypertension history (ht), uterine irritability (ui), and pre‑pregnancy weight (lwt). Not all variables need to be included; variable selection is challenging and part of model evaluation. In this example, race, smoke, ht, ui, and lwt show statistically significant effects, consistent with an ANOVA on the model. Model assessment can use R‑squared, Adjusted R‑squared, and optionally AIC or BIC (not shown by summary() ). Diagnostic plots of residuals help judge model adequacy and identify outliers or high‑leverage points, which may be removed for a better fit.
Categorical variables are converted to dummy variables using factor() ; for example, the variable race is encoded into two dummy variables representing three categories. The relevel() function can be used to set the reference level.
Model Perspective
Insights, knowledge, and enjoyment from a mathematical modeling researcher and educator. Hosted by Haihua Wang, a modeling instructor and author of "Clever Use of Chat for Mathematical Modeling", "Modeling: The Mathematics of Thinking", "Mathematical Modeling Practice: A Hands‑On Guide to Competitions", and co‑author of "Mathematical Modeling: Teaching Design and Cases".
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.