R programming for beginners (GV900)

Lesson 17: Multiple linear regression

Monday, January 29, 2024

Video of Lesson 17

1 Outline

  • Basics of multiple linear regression

  • Multiple linear regression with R

  • Interpretation of multiple linear regression

  • Model specification

Code
library(tidyverse)

2 Basics of multiple linear regression

  • Multiple linear regression is an extension of simple linear regression

\[ Y = \beta_0 + \beta_1 X_1 + \epsilon \]

  • Example: We want to predict the miles per gallon (mpg) of a car based on its weight (wt). The model is expressed as follows:

\[ \text{mpg} = \beta_0 + \beta_1 \text{wt} + \epsilon \]

Why we need multiple linear regression?

In real world, it is rare that an outcome is affected by only one factor. In other words, the outcome is usually affected by more than one predictor variable. For example, the miles per gallon of a car may be mainly affected by its weight, but it is also affected by other factors, such as rear axle ratio, and horsepower. Therefore, we need to consider more than one predictor variable in the model.

  • To run multiple linear regression, we just add more predictors to the model:

\[ Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_3 + \epsilon \]

  • Example: We want to predict the miles per gallon (mpg) of a car based on its weight (wt), horsepower (hp), rear axle ratio (drat), and the transmission model (automatic or manual, am). The model is expressed as follows:

\[ \text{mpg} = \beta_0 + \beta_1 \text{wt} + \beta_2 \text{drat} + \beta_3 \text{hp} + \beta_4 \text{am} + \epsilon \]

3 Multiple linear regression with R

  • We use the lm() function to run multiple linear regression in R
Code
mtcars |> 
  mutate(am = factor(am)) |> # convert am to factor
  lm(mpg ~ wt + hp + drat + am, data = _) |>
  summary()

Call:
lm(formula = mpg ~ wt + hp + drat + am, data = mutate(mtcars, 
    am = factor(am)))

Residuals:
    Min      1Q  Median      3Q     Max 
-3.2882 -1.7531 -0.6827  1.1691  5.5211 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 30.027077   6.185177   4.855  4.5e-05 ***
wt          -2.726092   0.937791  -2.907 0.007209 ** 
hp          -0.036373   0.009814  -3.706 0.000958 ***
drat         0.981018   1.377101   0.712 0.482341    
am1          1.578521   1.559281   1.012 0.320363    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.56 on 27 degrees of freedom
Multiple R-squared:  0.8428,    Adjusted R-squared:  0.8196 
F-statistic:  36.2 on 4 and 27 DF,  p-value: 1.75e-10
  • We can interpret the model as follows:

    • The coefficient of weight is -2.72, which means that the predicted miles per gallon of a car decreases by 2.72 miles for each additional 1000 pounds of weight, all other things equal. We can see that the coefficient of weight is statistically significant (p < 0.01), which means that the weight of a car is a statistically significant predictor of its miles per gallon.

    • The coefficient of horsepower is -0.036, which means that the predicted miles per gallon of a car decreases by 3.6 miles for each additional 100 horsepower, all other things equal. We can see that the coefficient of horsepower is statistically significant (p < 0.01), which means that the horsepower of a car is a statistically significant predictor of its miles per gallon.

    • The coefficient of rear axle ratio is 0.98, which means that the predicted miles per gallon of a car increases by about 1 mile for each additional unit of rear axle ratio, all other things equal. However, the coefficient of rear axle ratio is not statistically significant (p = 0.48).

    • The coefficient of transmission model is 1.58, which means that the predicted miles per gallon of a car with manual transmission is 1.58 miles higher than that of a car with automatic transmission, all other things equal. We can see that the coefficient of transmission model is not statistically significant either (p = 0.32).

4 Model specification

  • The model specification is very important in multiple linear regression

  • It is also a very difficult task. To some extent, it is easier to run multiple linear regression than to specify the model.

  • We need to specify the model based on our research question, the theory, and the data. We should put forward a hypothesis about the relationship between the outcome and the predictors, than to find a fitted model through trying all different combinations of predictors. For instance:

Code
# mtcars |> 
#   head()
Code
mtcars |>
  mutate(am = factor(am)) |> # convert am to factor
  lm(mpg ~ wt + hp + qsec + am, data = _) |>
  summary()

Call:
lm(formula = mpg ~ wt + hp + qsec + am, data = mutate(mtcars, 
    am = factor(am)))

Residuals:
    Min      1Q  Median      3Q     Max 
-3.4975 -1.5902 -0.1122  1.1795  4.5404 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept) 17.44019    9.31887   1.871  0.07215 . 
wt          -3.23810    0.88990  -3.639  0.00114 **
hp          -0.01765    0.01415  -1.247  0.22309   
qsec         0.81060    0.43887   1.847  0.07573 . 
am1          2.92550    1.39715   2.094  0.04579 * 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.435 on 27 degrees of freedom
Multiple R-squared:  0.8579,    Adjusted R-squared:  0.8368 
F-statistic: 40.74 on 4 and 27 DF,  p-value: 4.589e-11
  • It is not the purpose of this course to teach you how to specify a model, which should be based on your research question and the theory.

  • However, we can use technical skills to test whether a model is a good fit or not. We will learn how to do this in future courses.


Thank you!