GV950-7-SP: Introduction to Quantitative Methods and Data Analysis II

Week 19: Lab Practice Exercises

Setup

Code

library(tidyverse)
#library(ggplot2) if you are more comfortable with ggplot2 and base R
library(jtools) # table packages
library(ggeffects) # for marginal effects
library(ggpubr) # for regression formula in plots
library(glue)

Task 1: Load the data

Let’s use the British Election Study (BES) data which we used in Homework 1.

Code

bes <- read.csv("bes.csv") # use your own path

Task 2: OLS assumptions check

2.1. Main independent variable and hypothesis

We want to understand the Conservatives’ vote share in 2019. To do so, first choose a continuous main independent variable that you think would help to explain this dependent variable. State a hypothesis and a credible mechanism to link X to Y.

\[Con19 = Turnout19\]

H1: The higher the turnout in 2019, the higher the share of votes for the Conservatives party.

2.2. Control variables and hypotheses

Choose two control variables that you think would help to explain the dependent variable. State a hypothesis and a credible mechanism to link X to Y.

\[Con19 = Turnout19 + Lab17 + Electorate19\]

H2: The higher the Labour party’s vote share in 2017, the higer the share of votes for the Conservatives party in 2019.

H3: The higher the electorate in 2019, the lower the share of votes for the Conservatives party.

2.3. Run the regression Model

Code

mod1 <- lm(Con19 ~ Turnout19 + Lab17 + Electorate19, data = bes)
summary(mod1)


Call:
lm(formula = Con19 ~ Turnout19 + Lab17 + Electorate19, data = bes)

Residuals:
    Min      1Q  Median      3Q     Max 
-45.323  -5.672   2.635   9.593  22.852 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)   7.887e+01  8.933e+00   8.828  < 2e-16 ***
Turnout19    -3.946e-01  1.126e-01  -3.505 0.000489 ***
Lab17        -6.380e-01  3.544e-02 -18.004  < 2e-16 ***
Electorate19  2.514e-04  5.788e-05   4.344 1.63e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 12.96 on 626 degrees of freedom
  (2 observations deleted due to missingness)
Multiple R-squared:  0.4026,    Adjusted R-squared:  0.3998 
F-statistic: 140.6 on 3 and 626 DF,  p-value: < 2.2e-16

2.4. Check for Linearity

Code

mod1 |>
  ggplot(aes(x = Turnout19, y = .resid)) +
  geom_point() +
  geom_hline(yintercept = 0, col = "red") +
  labs(x = "Turnout of 2019", y = "Residuals")

Code

#or
ggplot(mod1, aes(x = Turnout19, y = .resid)) +
  geom_point() +
  geom_hline(yintercept = 0, col = "red") +
  labs(x = "Turnout of 2019", y = "Residuals")

Code

mod1 |>
  ggplot(aes(x = Lab17, y = .resid)) +
  geom_point() +
  geom_hline(yintercept = 0, col = "red") +
  labs(x = "Labour party share of 2017", y = "Residuals")

Code

mod1 |>
  ggplot(aes(x = Electorate19, y = .resid)) +
  geom_point() +
  geom_hline(yintercept = 0, col = "red") +
  labs(x = "Electorate of 2019", y = "Residuals")

2.5. Check for Homoscedasticity

Code

mod1 |>
  ggplot(aes(x = .fitted, y = .resid)) +
  geom_point() +
  geom_hline(yintercept = 0, col = "red") +
  labs(x = "Fitted values", y = "Residuals")

Code

#or
ggplot(mod1, aes(x = .fitted, y = .resid)) +
  geom_point() +
  geom_hline(yintercept = 0, col = "red") +
  labs(x = "Fitted values", y = "Residuals")

Task 3: Check for Outliers

3.1. What is the assumption of outliers?

3.2. How can we detect outliers?

Code

cooksD <- cooks.distance(mod1)
cutoff <- 4 / (nrow(bes) - length(mod1$coefficients) - 2)
plot(cooksD, main = "Cook's distance")
abline(h = cutoff, lty = 2, col = "red")

Code

plot(mod1, which = 4) |> 
  abline(h = cutoff, col = "red")

3.3. How to deal with outliers?

Code

# Cook's distance
mod2 <- bes |>
  select(Con19, Turnout19, Lab17, Electorate19) |>
  drop_na() |>
  mutate(cooksd = cooks.distance(mod1)) |>
  filter(cooksd < cutoff) |>
  # filter out the influential points
  lm(Con19 ~ Turnout19 + Lab17 + Electorate19, data = _)

mod2 |>
  summary()


Call:
lm(formula = Con19 ~ Turnout19 + Lab17 + Electorate19, data = filter(mutate(drop_na(select(bes, 
    Con19, Turnout19, Lab17, Electorate19)), cooksd = cooks.distance(mod1)), 
    cooksd < cutoff))

Residuals:
    Min      1Q  Median      3Q     Max 
-36.728  -5.085   2.449   7.898  18.618 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)   1.071e+02  8.015e+00  13.360  < 2e-16 ***
Turnout19    -5.520e-01  9.739e-02  -5.668 2.26e-08 ***
Lab17        -7.507e-01  3.085e-02 -24.333  < 2e-16 ***
Electorate19  1.013e-04  5.614e-05   1.804   0.0717 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 10.61 on 587 degrees of freedom
Multiple R-squared:  0.5523,    Adjusted R-squared:   0.55 
F-statistic: 241.4 on 3 and 587 DF,  p-value: < 2.2e-16

Code

export_summs(mod1, mod2)

	Model 1	Model 2
(Intercept)	78.87 ***	107.08 ***
	(8.93)	(8.02)
Turnout19	-0.39 ***	-0.55 ***
	(0.11)	(0.10)
Lab17	-0.64 ***	-0.75 ***
	(0.04)	(0.03)
Electorate19	0.00 ***	0.00
	(0.00)	(0.00)
N	630	591
R2	0.40	0.55
* p < 0.001; p < 0.01; * p < 0.05.

Code

# install.packages("huxtable") if you don't have it

Code

plot(mod1, which = 4) |> 
  abline(h = 0.02, col = "red")

Code

mod2a <- bes |>
  select(Con19, Turnout19, Lab17, Electorate19) |>
  drop_na() |>
  mutate(cooksd = cooks.distance(mod1)) |>
  filter(cooksd < 0.02) |>
  # filter out the influential points
  lm(Con19 ~ Turnout19 + Lab17 + Electorate19, data = _)

mod2a |>
  summary()


Call:
lm(formula = Con19 ~ Turnout19 + Lab17 + Electorate19, data = filter(mutate(drop_na(select(bes, 
    Con19, Turnout19, Lab17, Electorate19)), cooksd = cooks.distance(mod1)), 
    cooksd < 0.02))

Residuals:
    Min      1Q  Median      3Q     Max 
-42.448  -5.303   2.615   9.010  21.123 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)   9.120e+01  8.630e+00  10.567  < 2e-16 ***
Turnout19    -4.231e-01  1.076e-01  -3.931 9.41e-05 ***
Lab17        -6.838e-01  3.391e-02 -20.165  < 2e-16 ***
Electorate19  1.419e-04  5.646e-05   2.514   0.0122 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 12.24 on 619 degrees of freedom
Multiple R-squared:  0.4514,    Adjusted R-squared:  0.4487 
F-statistic: 169.7 on 3 and 619 DF,  p-value: < 2.2e-16

Code

export_summs(mod1, mod2, mod2a)

	Model 1	Model 2	Model 3
(Intercept)	78.87 ***	107.08 ***	91.20 ***
	(8.93)	(8.02)	(8.63)
Turnout19	-0.39 ***	-0.55 ***	-0.42 ***
	(0.11)	(0.10)	(0.11)
Lab17	-0.64 ***	-0.75 ***	-0.68 ***
	(0.04)	(0.03)	(0.03)
Electorate19	0.00 ***	0.00	0.00 *
	(0.00)	(0.00)	(0.00)
N	630	591	623
R2	0.40	0.55	0.45
* p < 0.001; p < 0.01; * p < 0.05.

Task 4: Check for multicollinearity

4.1. What is the assumption of multicollinearity?

4.2. How can we check for multicollinearity?

Code

bes |>
  select(Turnout19, Lab17, Electorate19) |>
  drop_na() |>
  cor()

              Turnout19       Lab17 Electorate19
Turnout19     1.0000000 -0.56793983   0.18821553
Lab17        -0.5679398  1.00000000  -0.09995258
Electorate19  0.1882155 -0.09995258   1.00000000

Task 5: Multiple Regression: interaction term with a dummy variable

5.1. What is the interaction term?

5.2. How to run regression with the interaction term?

5.2.1. Run the regression model with a dummy variable

\[ y = \beta_0 + \beta_1x_1 + \beta_2x_2\]

Where \(x_2\) is a dummy variable, so when \(x_2 = 0\), the equation becomes:

\[ y = \beta_0 + \beta_1x_1\]

And when \(x_2 = 1\), the equation becomes:

\[\begin{align*} y &= \beta_0 + \beta_1x_1 + \beta_2 \\ \\ &= (\beta_0 + \beta_2) + \beta_1x_1 \end{align*}\]

Code

mod3 <- bes |>
  lm(Con19 ~ Turnout19 + ConPPCsex19, data = _)
mod3 |>
  summary()


Call:
lm(formula = Con19 ~ Turnout19 + ConPPCsex19, data = bes)

Residuals:
    Min      1Q  Median      3Q     Max 
-40.339 -11.541   1.715  12.248  39.568 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -7.1505     7.6170  -0.939    0.348    
Turnout19     0.7855     0.1120   7.011 6.12e-12 ***
ConPPCsex19  -5.6565     1.3743  -4.116 4.37e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 15.88 on 628 degrees of freedom
  (1 observation deleted due to missingness)
Multiple R-squared:  0.1018,    Adjusted R-squared:  0.09899 
F-statistic: 35.61 on 2 and 628 DF,  p-value: 2.249e-15

Code

mod3 |> 
  ggeffect(c("Turnout19", "ConPPCsex19")) |> 
  plot(show_ci = F) +
  stat_regline_equation( #library(ggpubr)
    formula = y ~ x,
    label.x.npc = "left",
    label.y.npc = "top",
    show.legend = F
    ) +
  labs(title = "Turnout19 and ConPPCsex19 on Con19",
       x = "Turnout of 2019",
       y = "Conservatives votes share in 2019",
       )

5.2.2. Run the regression model with the interaction term of a dummy variable

\[ y = \beta_0 + \beta_1x_1 + \beta_2x_2 + \beta_3x_1 * x_2\]

Where \(x_2\) is a dummy variable, so when \(x_2 = 0\), the equation becomes:

\[ y = \beta_0 + \beta_1x_1\]

And when \(x_2 = 1\), the equation becomes:

\[\begin{align*} y &= \beta_0 + \beta_1x_1 + \beta_2 + \beta_3x_1 \\ \\ &= (\beta_0 + \beta_2) + (\beta_1 + \beta_3)x_1 \end{align*}\]

Code

mod4 <- bes |>
  lm(Con19 ~ Turnout19 * ConPPCsex19, data = _)
summary(mod4)


Call:
lm(formula = Con19 ~ Turnout19 * ConPPCsex19, data = bes)

Residuals:
    Min      1Q  Median      3Q     Max 
-40.046 -11.493   1.893  12.159  39.786 

Coefficients:
                      Estimate Std. Error t value Pr(>|t|)    
(Intercept)            -4.3393     9.5699  -0.453    0.650    
Turnout19               0.7439     0.1410   5.276 1.82e-07 ***
ConPPCsex19           -13.2244    15.6404  -0.846    0.398    
Turnout19:ConPPCsex19   0.1129     0.2324   0.486    0.627    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 15.89 on 627 degrees of freedom
  (1 observation deleted due to missingness)
Multiple R-squared:  0.1022,    Adjusted R-squared:  0.09789 
F-statistic: 23.79 on 3 and 627 DF,  p-value: 1.367e-14

Code

mod4 |> 
  ggeffect(c("Turnout19", "ConPPCsex19")) |>
  plot(show_ci = F) +
  stat_regline_equation(
    formula = y ~ x,
    label.x.npc = "left",
    label.y.npc = "top",
    show.legend = F
    ) +
  labs(title = "Turnout19 and ConPPCsex19 on Con19",
       x = "Turnout of 2019",
       y = "Conservatives votes share in 2019")

Bonus question

Run the regression model with the interaction term of a continuous variable

Code

mod5 <- bes |>
  lm(Con19 ~ Turnout19 * Lab17, data = _) 
summary(mod5)


Call:
lm(formula = Con19 ~ Turnout19 * Lab17, data = bes)

Residuals:
    Min      1Q  Median      3Q     Max 
-55.578  -4.436   3.682   8.604  18.449 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)     201.654501  19.347048  10.423  < 2e-16 ***
Turnout19        -1.902698   0.275354  -6.910 1.20e-11 ***
Lab17            -2.992372   0.376425  -7.949 8.76e-15 ***
Turnout19:Lab17   0.034659   0.005514   6.286 6.12e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 12.76 on 626 degrees of freedom
  (2 observations deleted due to missingness)
Multiple R-squared:  0.4212,    Adjusted R-squared:  0.4184 
F-statistic: 151.8 on 3 and 626 DF,  p-value: < 2.2e-16

Code

mod5 |> 
  ggeffect(c("Turnout19", "Lab17")) |>
  plot(show_ci = F) +
  stat_regline_equation(
    formula = y ~ x,
    label.x.npc = "center",
    label.y.npc = "top",
    show.legend = F
    ) +
  labs(title = "Turnout19 and Lab17 on Con19",
       x = "Turnout of 2019",
       y = "Conservatives votes share in 2019",
       color = "Lab17")

\[ y = \beta_0 + \beta_1x_1 + \beta_2x_2 + \beta_3x_1 * x_2\]

When \(x_2 = 24.14\), the equation becomes:

\[\begin{align*} y &= \beta_0 + \beta_1x_1 + \beta_2 * 24.14 + \beta_3x_1 * 24.14 \\ \\ & = (\beta_0 + 24.14\beta_2) + (\beta_1 + 24.14\beta_3)x_1 \\ \\ & = (201.65 + 24.14*(-2.99)) + (-1.90 + 24.14*0.035)x_1 \\ \\ & = 130 - 1.1x_1 \end{align*}\]

GV950-7-SP: Introduction to Quantitative Methods and Data Analysis II

Setup

Task 1: Load the data

Task 2: OLS assumptions check

2.1. Main independent variable and hypothesis

2.2. Control variables and hypotheses

2.3. Run the regression Model

2.4. Check for Linearity

2.5. Check for Homoscedasticity

Task 3: Check for Outliers

3.1. What is the assumption of outliers?

3.2. How can we detect outliers?

3.3. How to deal with outliers?

Task 4: Check for multicollinearity

4.1. What is the assumption of multicollinearity?

4.2. How can we check for multicollinearity?

Task 5: Multiple Regression: interaction term with a dummy variable

5.1. What is the interaction term?

5.2. How to run regression with the interaction term?

5.2.1. Run the regression model with a dummy variable

5.2.2. Run the regression model with the interaction term of a dummy variable

Bonus question

The End