GV950-7-SP: Introduction to Quantitative Methods and Data Analysis II

Week 19: Lab Practice Exercises

Setup

Code
library(tidyverse)
#library(ggplot2) if you are more comfortable with ggplot2 and base R
library(jtools) # table packages
library(ggeffects) # for marginal effects
library(ggpubr) # for regression formula in plots
library(glue)

Task 1: Load the data

Let’s use the British Election Study (BES) data which we used in Homework 1.

Code
bes <- read.csv("bes.csv") # use your own path

Task 2: OLS assumptions check

2.1. Main independent variable and hypothesis

We want to understand the Conservatives’ vote share in 2019. To do so, first choose a continuous main independent variable that you think would help to explain this dependent variable. State a hypothesis and a credible mechanism to link X to Y.

\[Con19 = Turnout19\]

H1: The higher the turnout in 2019, the higher the share of votes for the Conservatives party.

2.2. Control variables and hypotheses

Choose two control variables that you think would help to explain the dependent variable. State a hypothesis and a credible mechanism to link X to Y.

\[Con19 = Turnout19 + Lab17 + Electorate19\]

H2: The higher the Labour party’s vote share in 2017, the higer the share of votes for the Conservatives party in 2019.

H3: The higher the electorate in 2019, the lower the share of votes for the Conservatives party.

2.3. Run the regression Model

Code
mod1 <- lm(Con19 ~ Turnout19 + Lab17 + Electorate19, data = bes)
summary(mod1)

Call:
lm(formula = Con19 ~ Turnout19 + Lab17 + Electorate19, data = bes)

Residuals:
    Min      1Q  Median      3Q     Max 
-45.323  -5.672   2.635   9.593  22.852 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)   7.887e+01  8.933e+00   8.828  < 2e-16 ***
Turnout19    -3.946e-01  1.126e-01  -3.505 0.000489 ***
Lab17        -6.380e-01  3.544e-02 -18.004  < 2e-16 ***
Electorate19  2.514e-04  5.788e-05   4.344 1.63e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 12.96 on 626 degrees of freedom
  (2 observations deleted due to missingness)
Multiple R-squared:  0.4026,    Adjusted R-squared:  0.3998 
F-statistic: 140.6 on 3 and 626 DF,  p-value: < 2.2e-16

2.4. Check for Linearity

Code
mod1 |>
  ggplot(aes(x = Turnout19, y = .resid)) +
  geom_point() +
  geom_hline(yintercept = 0, col = "red") +
  labs(x = "Turnout of 2019", y = "Residuals")

Code
#or
ggplot(mod1, aes(x = Turnout19, y = .resid)) +
  geom_point() +
  geom_hline(yintercept = 0, col = "red") +
  labs(x = "Turnout of 2019", y = "Residuals")

Code
mod1 |>
  ggplot(aes(x = Lab17, y = .resid)) +
  geom_point() +
  geom_hline(yintercept = 0, col = "red") +
  labs(x = "Labour party share of 2017", y = "Residuals")

Code
mod1 |>
  ggplot(aes(x = Electorate19, y = .resid)) +
  geom_point() +
  geom_hline(yintercept = 0, col = "red") +
  labs(x = "Electorate of 2019", y = "Residuals")

2.5. Check for Homoscedasticity

Code
mod1 |>
  ggplot(aes(x = .fitted, y = .resid)) +
  geom_point() +
  geom_hline(yintercept = 0, col = "red") +
  labs(x = "Fitted values", y = "Residuals")

Code
#or
ggplot(mod1, aes(x = .fitted, y = .resid)) +
  geom_point() +
  geom_hline(yintercept = 0, col = "red") +
  labs(x = "Fitted values", y = "Residuals")

Task 3: Check for Outliers

3.1. What is the assumption of outliers?

3.2. How can we detect outliers?

Code
cooksD <- cooks.distance(mod1)
cutoff <- 4 / (nrow(bes) - length(mod1$coefficients) - 2)
plot(cooksD, main = "Cook's distance")
abline(h = cutoff, lty = 2, col = "red") 

Code
plot(mod1, which = 4) |> 
  abline(h = cutoff, col = "red")

3.3. How to deal with outliers?

Code
# Cook's distance
mod2 <- bes |>
  select(Con19, Turnout19, Lab17, Electorate19) |>
  drop_na() |>
  mutate(cooksd = cooks.distance(mod1)) |>
  filter(cooksd < cutoff) |>
  # filter out the influential points
  lm(Con19 ~ Turnout19 + Lab17 + Electorate19, data = _)

mod2 |>
  summary()

Call:
lm(formula = Con19 ~ Turnout19 + Lab17 + Electorate19, data = filter(mutate(drop_na(select(bes, 
    Con19, Turnout19, Lab17, Electorate19)), cooksd = cooks.distance(mod1)), 
    cooksd < cutoff))

Residuals:
    Min      1Q  Median      3Q     Max 
-36.728  -5.085   2.449   7.898  18.618 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)   1.071e+02  8.015e+00  13.360  < 2e-16 ***
Turnout19    -5.520e-01  9.739e-02  -5.668 2.26e-08 ***
Lab17        -7.507e-01  3.085e-02 -24.333  < 2e-16 ***
Electorate19  1.013e-04  5.614e-05   1.804   0.0717 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 10.61 on 587 degrees of freedom
Multiple R-squared:  0.5523,    Adjusted R-squared:   0.55 
F-statistic: 241.4 on 3 and 587 DF,  p-value: < 2.2e-16
Code
export_summs(mod1, mod2)
Model 1 Model 2
(Intercept) 78.87 *** 107.08 ***
(8.93)    (8.02)   
Turnout19 -0.39 *** -0.55 ***
(0.11)    (0.10)   
Lab17 -0.64 *** -0.75 ***
(0.04)    (0.03)   
Electorate19 0.00 *** 0.00    
(0.00)    (0.00)   
N 630        591       
R2 0.40     0.55    
*** p < 0.001; ** p < 0.01; * p < 0.05.
Code
# install.packages("huxtable") if you don't have it
Code
plot(mod1, which = 4) |> 
  abline(h = 0.02, col = "red")

Code
mod2a <- bes |>
  select(Con19, Turnout19, Lab17, Electorate19) |>
  drop_na() |>
  mutate(cooksd = cooks.distance(mod1)) |>
  filter(cooksd < 0.02) |>
  # filter out the influential points
  lm(Con19 ~ Turnout19 + Lab17 + Electorate19, data = _)

mod2a |>
  summary()

Call:
lm(formula = Con19 ~ Turnout19 + Lab17 + Electorate19, data = filter(mutate(drop_na(select(bes, 
    Con19, Turnout19, Lab17, Electorate19)), cooksd = cooks.distance(mod1)), 
    cooksd < 0.02))

Residuals:
    Min      1Q  Median      3Q     Max 
-42.448  -5.303   2.615   9.010  21.123 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)   9.120e+01  8.630e+00  10.567  < 2e-16 ***
Turnout19    -4.231e-01  1.076e-01  -3.931 9.41e-05 ***
Lab17        -6.838e-01  3.391e-02 -20.165  < 2e-16 ***
Electorate19  1.419e-04  5.646e-05   2.514   0.0122 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 12.24 on 619 degrees of freedom
Multiple R-squared:  0.4514,    Adjusted R-squared:  0.4487 
F-statistic: 169.7 on 3 and 619 DF,  p-value: < 2.2e-16
Code
export_summs(mod1, mod2, mod2a)
Model 1 Model 2 Model 3
(Intercept) 78.87 *** 107.08 *** 91.20 ***
(8.93)    (8.02)    (8.63)   
Turnout19 -0.39 *** -0.55 *** -0.42 ***
(0.11)    (0.10)    (0.11)   
Lab17 -0.64 *** -0.75 *** -0.68 ***
(0.04)    (0.03)    (0.03)   
Electorate19 0.00 *** 0.00     0.00 *  
(0.00)    (0.00)    (0.00)   
N 630        591        623       
R2 0.40     0.55     0.45    
*** p < 0.001; ** p < 0.01; * p < 0.05.

Task 4: Check for multicollinearity

4.1. What is the assumption of multicollinearity?

4.2. How can we check for multicollinearity?

Code
bes |>
  select(Turnout19, Lab17, Electorate19) |>
  drop_na() |>
  cor()
              Turnout19       Lab17 Electorate19
Turnout19     1.0000000 -0.56793983   0.18821553
Lab17        -0.5679398  1.00000000  -0.09995258
Electorate19  0.1882155 -0.09995258   1.00000000

Task 5: Multiple Regression: interaction term with a dummy variable

5.1. What is the interaction term?

5.2. How to run regression with the interaction term?

5.2.1. Run the regression model with a dummy variable

\[ y = \beta_0 + \beta_1x_1 + \beta_2x_2\]

Where \(x_2\) is a dummy variable, so when \(x_2 = 0\), the equation becomes:

\[ y = \beta_0 + \beta_1x_1\]

And when \(x_2 = 1\), the equation becomes:

\[\begin{align*} y &= \beta_0 + \beta_1x_1 + \beta_2 \\ \\ &= (\beta_0 + \beta_2) + \beta_1x_1 \end{align*}\]

Code
mod3 <- bes |>
  lm(Con19 ~ Turnout19 + ConPPCsex19, data = _)
mod3 |>
  summary()

Call:
lm(formula = Con19 ~ Turnout19 + ConPPCsex19, data = bes)

Residuals:
    Min      1Q  Median      3Q     Max 
-40.339 -11.541   1.715  12.248  39.568 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -7.1505     7.6170  -0.939    0.348    
Turnout19     0.7855     0.1120   7.011 6.12e-12 ***
ConPPCsex19  -5.6565     1.3743  -4.116 4.37e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 15.88 on 628 degrees of freedom
  (1 observation deleted due to missingness)
Multiple R-squared:  0.1018,    Adjusted R-squared:  0.09899 
F-statistic: 35.61 on 2 and 628 DF,  p-value: 2.249e-15
Code
mod3 |> 
  ggeffect(c("Turnout19", "ConPPCsex19")) |> 
  plot(show_ci = F) +
  stat_regline_equation( #library(ggpubr)
    formula = y ~ x,
    label.x.npc = "left",
    label.y.npc = "top",
    show.legend = F
    ) +
  labs(title = "Turnout19 and ConPPCsex19 on Con19",
       x = "Turnout of 2019",
       y = "Conservatives votes share in 2019",
       )

5.2.2. Run the regression model with the interaction term of a dummy variable

\[ y = \beta_0 + \beta_1x_1 + \beta_2x_2 + \beta_3x_1 * x_2\]

Where \(x_2\) is a dummy variable, so when \(x_2 = 0\), the equation becomes:

\[ y = \beta_0 + \beta_1x_1\]

And when \(x_2 = 1\), the equation becomes:

\[\begin{align*} y &= \beta_0 + \beta_1x_1 + \beta_2 + \beta_3x_1 \\ \\ &= (\beta_0 + \beta_2) + (\beta_1 + \beta_3)x_1 \end{align*}\]

Code
mod4 <- bes |>
  lm(Con19 ~ Turnout19 * ConPPCsex19, data = _)
summary(mod4)

Call:
lm(formula = Con19 ~ Turnout19 * ConPPCsex19, data = bes)

Residuals:
    Min      1Q  Median      3Q     Max 
-40.046 -11.493   1.893  12.159  39.786 

Coefficients:
                      Estimate Std. Error t value Pr(>|t|)    
(Intercept)            -4.3393     9.5699  -0.453    0.650    
Turnout19               0.7439     0.1410   5.276 1.82e-07 ***
ConPPCsex19           -13.2244    15.6404  -0.846    0.398    
Turnout19:ConPPCsex19   0.1129     0.2324   0.486    0.627    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 15.89 on 627 degrees of freedom
  (1 observation deleted due to missingness)
Multiple R-squared:  0.1022,    Adjusted R-squared:  0.09789 
F-statistic: 23.79 on 3 and 627 DF,  p-value: 1.367e-14
Code
mod4 |> 
  ggeffect(c("Turnout19", "ConPPCsex19")) |>
  plot(show_ci = F) +
  stat_regline_equation(
    formula = y ~ x,
    label.x.npc = "left",
    label.y.npc = "top",
    show.legend = F
    ) +
  labs(title = "Turnout19 and ConPPCsex19 on Con19",
       x = "Turnout of 2019",
       y = "Conservatives votes share in 2019")

Bonus question

  • Run the regression model with the interaction term of a continuous variable
Code
mod5 <- bes |>
  lm(Con19 ~ Turnout19 * Lab17, data = _) 
summary(mod5)

Call:
lm(formula = Con19 ~ Turnout19 * Lab17, data = bes)

Residuals:
    Min      1Q  Median      3Q     Max 
-55.578  -4.436   3.682   8.604  18.449 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)     201.654501  19.347048  10.423  < 2e-16 ***
Turnout19        -1.902698   0.275354  -6.910 1.20e-11 ***
Lab17            -2.992372   0.376425  -7.949 8.76e-15 ***
Turnout19:Lab17   0.034659   0.005514   6.286 6.12e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 12.76 on 626 degrees of freedom
  (2 observations deleted due to missingness)
Multiple R-squared:  0.4212,    Adjusted R-squared:  0.4184 
F-statistic: 151.8 on 3 and 626 DF,  p-value: < 2.2e-16
Code
mod5 |> 
  ggeffect(c("Turnout19", "Lab17")) |>
  plot(show_ci = F) +
  stat_regline_equation(
    formula = y ~ x,
    label.x.npc = "center",
    label.y.npc = "top",
    show.legend = F
    ) +
  labs(title = "Turnout19 and Lab17 on Con19",
       x = "Turnout of 2019",
       y = "Conservatives votes share in 2019",
       color = "Lab17")


\[ y = \beta_0 + \beta_1x_1 + \beta_2x_2 + \beta_3x_1 * x_2\]

When \(x_2 = 24.14\), the equation becomes:

\[\begin{align*} y &= \beta_0 + \beta_1x_1 + \beta_2 * 24.14 + \beta_3x_1 * 24.14 \\ \\ & = (\beta_0 + 24.14\beta_2) + (\beta_1 + 24.14\beta_3)x_1 \\ \\ & = (201.65 + 24.14*(-2.99)) + (-1.90 + 24.14*0.035)x_1 \\ \\ & = 130 - 1.1x_1 \end{align*}\]

The End