Fundamentals 6 min read

How to Perform Multiple Linear Regression in R with the Birthweight Dataset

This article explains the theory of multiple linear regression, demonstrates how to fit such a model in R using the birthwt dataset with the lm() function, and interprets the output, diagnostic plots, and handling of categorical variables.

Model Perspective
Model Perspective
Model Perspective
How to Perform Multiple Linear Regression in R with the Birthweight Dataset

In practical analysis, several independent variables often determine a single dependent variable. The multiple linear regression equation can be written in matrix form, where the response vector, predictor matrix (including a constant term), and coefficient vector are defined, allowing coefficient estimation via matrix operations, though manual calculation is difficult.

In R, the lm() function is used for this analysis.

Example: using the birthwt dataset to study factors influencing newborn weight and predict weight based on these factors.
<code>library(MASS)
data(birthwt)
birthwt.lm <- lm(bwt ~ age + lwt + as.factor(race) + smoke + ptl + ht + ui + ftv, data = birthwt)
summary(birthwt.lm)
</code>

Result:

<code>Call:
lm(formula = bwt ~ age + lwt + as.factor(race) + smoke + ptl + 
    ht + ui + ftv, data = birthwt)

Residuals:
     Min       1Q   Median       3Q      Max 
-1825.26 -435.21   55.91  473.46 1701.20 

Coefficients:
                     Estimate Std. Error t value Pr(&gt;|t|)    
(Intercept)          2927.962   312.904   9.357  &lt; 2e-16 ***
age                  -3.570     9.620  -0.371 0.711012    
lwt                   4.354     1.736   2.509 0.013007 *  
as.factor(race)2   -488.428   149.985  -3.257 0.001349 ** 
as.factor(race)3   -355.077   114.753  -3.094 0.002290 ** 
smoke               -352.045   106.476  -3.306 0.001142 ** 
ptl                 -48.402   101.972  -0.475 0.635607    
ht                  -592.827   202.321  -2.930 0.003830 ** 
ui                  -516.081   138.885  -3.716 0.000271 ***
ftv                  -14.058    46.468  -0.303 0.762598    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 650.3 on 179 degrees of freedom
Multiple R-squared:  0.2427, Adjusted R-squared:  0.2047 
F-statistic: 6.376 on 9 and 179 DF,  p-value: 7.891e-08
</code>

Plotting the results:

<code>par(mfrow = c(2,2))
plot(birthwt.lm)
</code>

The regression model relates newborn weight (bwt) to maternal age (age), race, smoking status (smoke), hypertension history (ht), uterine irritability (ui), and pre‑pregnancy weight (lwt). Not all variables need to be included; variable selection is challenging and part of model evaluation. In this example, race, smoke, ht, ui, and lwt show statistically significant effects, consistent with an ANOVA on the model. Model assessment can use R‑squared, Adjusted R‑squared, and optionally AIC or BIC (not shown by summary() ). Diagnostic plots of residuals help judge model adequacy and identify outliers or high‑leverage points, which may be removed for a better fit.

Categorical variables are converted to dummy variables using factor() ; for example, the variable race is encoded into two dummy variables representing three categories. The relevel() function can be used to set the reference level.

statistical modelingRbirthweight datasetlm functionmultiple linear regression
Model Perspective
Written by

Model Perspective

Insights, knowledge, and enjoyment from a mathematical modeling researcher and educator. Hosted by Haihua Wang, a modeling instructor and author of "Clever Use of Chat for Mathematical Modeling", "Modeling: The Mathematics of Thinking", "Mathematical Modeling Practice: A Hands‑On Guide to Competitions", and co‑author of "Mathematical Modeling: Teaching Design and Cases".

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.