Data

We’re going to return to the chile data set. The variables of interest are as follows:

  • vote: a factor with levels: A, will abstain; N, will vote no (against Pinochet); U, undecided; Y, will vote yes (for Pinochet).
  • statusquo: higher values = more support for the status quo, standardized to have mean zero and standard deviation of one
  • income: monthly income, in pesos
  • education: factor variable with 3 levels: primary only (baseline), secondary (S), post-secondary (PS)
  • sex: factor variable with two levels, female (baseline) and male (M)
  • region: A factor with levels: C, Central; M, Metropolitan Santiago area; N, North; S, South; SA, city of Santiago.
chile <- carData::Chile

# re-code vote as dummy 
chile$vote_yes <- ifelse(chile$vote == "Y", 1, 
                         ifelse(chile$vote == "N", 0, NA))

# re-code sex as dummy 
chile$female <- ifelse(chile$sex == "F", 1, 0)

# re-code income to 1000s
chile$income_1000 <- chile$income / 1000

# re-code education as dummies 
chile$educS <- ifelse(chile$education == "S", 1, 0)
chile$educPS <- ifelse(chile$education == "PS", 1, 0)

# remove missing
chile <- na.omit(chile)

Cross Validation

Estimate the following model:

\[ voteyes_i = \beta_0 + \beta_1statusquo_i + \beta_2income1000_i + \beta_3educS_i + \beta_4educPS_i + \beta_5female_i + u_i \]

m1 <- lm(vote_yes ~ statusquo + income_1000 + educS + educPS + female, chile)
summary(m1)
## 
## Call:
## lm(formula = vote_yes ~ statusquo + income_1000 + educS + educPS + 
##     female, data = chile)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.09279 -0.10727 -0.02865  0.06813  1.04154 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  5.085e-01  1.232e-02  41.289  < 2e-16 ***
## statusquo    3.870e-01  5.905e-03  65.540  < 2e-16 ***
## income_1000 -7.976e-05  1.659e-04  -0.481 0.630656    
## educS       -4.537e-02  1.455e-02  -3.119 0.001848 ** 
## educPS      -7.772e-02  1.968e-02  -3.950 8.12e-05 ***
## female       4.425e-02  1.262e-02   3.507 0.000464 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2579 on 1697 degrees of freedom
## Multiple R-squared:  0.7348, Adjusted R-squared:  0.734 
## F-statistic: 940.5 on 5 and 1697 DF,  p-value: < 2.2e-16

Clustered SE

Clustered standard errors adjust for the possibility that observations within the same group (e.g., a region) are not independent from each other.

# vcovCL function in sandwich package
se_cluster <- sqrt(diag(sandwich::vcovCL(m1, cluster = ~ region, type = "HC1")))

Coefficient Plot

Create a figure that plots the coefficient estimates for all independent variables along with their 95% confidence bounds (a “coefficient plot”).

# confidence intervals 
lci <- m1$coefficients - 2*se_cluster
uci <- m1$coefficients + 2*se_cluster

# data frame  
results <- data.frame(coefficient = coef(m1)[-1], 
           lower.ci = lci[-1],
           upper.ci = uci[-1])

library(ggplot2)
ggplot(results, aes(x = rownames(results), y = coefficient)) +
  geom_point(size = 3) +
  geom_errorbar(aes(ymin = lower.ci, ymax = upper.ci), width = 0.2) +
  labs(title = "Coefficient Plot with 95% Confidence Bounds",
       x = "Variable",
       y = "Effect on Probability of Voting for Pinochet") +
  geom_hline(yintercept=0, linetype="dashed", color = "red") +
  coord_flip() +
  theme_minimal()

Use 10-fold cross validation to calculate the out-of-sample (test) mean squared error and \(R^2\).

  • Out-of-sample (test) mean squared error measures the average prediction error (how far off the predicted vote_yes values are from the actual). (The lower the better).
  • Out-of-sample (test) \(R^2\) measures the proportion of variance in the binary outcome (vote_yes) explained by the model. (The higher the better).

How do we expect the out-of-sample (test) metrics to compare to the within-sample (train) metrics from ‘m1’?

# set up the training procedure
m1_trainControl <- caret::trainControl(method = "cv", number = 10)

# train the model using cross-validation
m1_10fCV <- caret::train(formula(m1), data = chile, method = "lm", trControl = m1_trainControl)
## Warning in train.default(x, y, weights = w, ...): You are trying to do
## regression and your outcome only has two possible values Are you trying to do
## classification? If so, use a 2 level factor as your outcome column.
## Loading required package: lattice
# cross-validation results 
print(m1_10fCV)
## Linear Regression 
## 
## 1703 samples
##    5 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 1532, 1533, 1533, 1532, 1533, 1533, ... 
## Resampling results:
## 
##   RMSE       Rsquared   MAE      
##   0.2579919  0.7348998  0.1686353
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE
# compare to values from m1
train <- c(summary(m1)["sigma"], summary(m1)["r.squared"])
test <- c(m1_10fCV$results["RMSE"], m1_10fCV$results["Rsquared"])

cbind(train, test)
##           train     test     
## sigma     0.2578879 0.2579919
## r.squared 0.7348246 0.7348998

What does it mean if the test RMSE and \(R^2\) are similar to the train RMSE and \(R^2\)?

Fixed Effects

Fixed effects adjust for the possibility of unobserved heterogeneity across groups by controlling for group-level omitted/confounding variables.

Estimate a new model that uses fixed effects for region to account for all regional-level variance in vote_yes.

m2 <- lm(vote_yes ~ statusquo + income_1000 + educS + educPS + female + region, chile)
summary(m2)
## 
## Call:
## lm(formula = vote_yes ~ statusquo + income_1000 + educS + educPS + 
##     female + region, data = chile)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.08466 -0.10696 -0.02765  0.06649  1.04268 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.5085125  0.0168712  30.141  < 2e-16 ***
## statusquo    0.3876215  0.0060062  64.537  < 2e-16 ***
## income_1000 -0.0001101  0.0001688  -0.652 0.514265    
## educS       -0.0456477  0.0145827  -3.130 0.001776 ** 
## educPS      -0.0770650  0.0197225  -3.907 9.69e-05 ***
## female       0.0437699  0.0126388   3.463 0.000547 ***
## regionM      0.0383948  0.0377065   1.018 0.308702    
## regionN     -0.0020142  0.0217556  -0.093 0.926245    
## regionS     -0.0086080  0.0179731  -0.479 0.632042    
## regionSA     0.0083283  0.0173408   0.480 0.631095    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.258 on 1693 degrees of freedom
## Multiple R-squared:  0.7352, Adjusted R-squared:  0.7338 
## F-statistic: 522.2 on 9 and 1693 DF,  p-value: < 2.2e-16

Use 10-fold cross validation to calculate the out-of-sample (test) mean squared error and \(R^2\). How do we expect the out-of-sample (test) metrics from the fixed effects model to compare to the out-of-sample (test) metrics from m1?

# set up the training procedure
m2_trainControl <- caret::trainControl(method = "cv", number = 10)

# train the model using cross-validation
m2_10fCV <- caret::train(formula(m2), data = chile, method = "lm", trControl = m2_trainControl)
## Warning in train.default(x, y, weights = w, ...): You are trying to do
## regression and your outcome only has two possible values Are you trying to do
## classification? If so, use a 2 level factor as your outcome column.
# compare to values from m1
m1_test <- c(m1_10fCV$results["RMSE"], m1_10fCV$results["Rsquared"])
m2_test <- c(m2_10fCV$results["RMSE"], m2_10fCV$results["Rsquared"])

cbind(m1_test, m2_test)
##          m1_test   m2_test  
## RMSE     0.2579919 0.2577978
## Rsquared 0.7348998 0.7328427

As expected, the out-of-sample (test) mean squared error in the fixed effects model is slightly larger than the out-of-sample (test) mean squared error in the original model.

The out-of-sample (test) \(R^2\) in the fixed effects model is slightly smaller than the out-of-sample (test) \(R^2\) in the original model (0.20).

This is because the fixed effects model is fitting the training data very closely, including noise tied to specific region dummies. On new (test) data, those specific patterns don’t generalize, so prediction error (MSE) increases and predictive performance (\(R^2\)) decreases.

Leave One Out CV

By Hand

error <- NA

for (i in 1:nrow(chile)){
  
  # estimate model without observation i 
  m <- lm(vote_yes ~ statusquo + income_1000 + educS + educPS + female, chile[-i,])

  # use model to predict vote_yes for observation i 
  pred <- predict(m, newdata = chile[i,], type = "response")
  
  # store difference between actual and predicted 
  error[i] <- chile$vote_yes[i] - pred
  
}

# root mean squared error (RMSE)
sqrt(mean(error^2))  
## [1] 0.2582178

With caret

# set up the training procedure
m1_trainControl <- caret::trainControl(method = "LOOCV")

# train the model using LOOCV
m1_LOOCV <- caret::train(formula(m1), data = chile, method = "lm", trControl = m1_trainControl)
## Warning in train.default(x, y, weights = w, ...): You are trying to do
## regression and your outcome only has two possible values Are you trying to do
## classification? If so, use a 2 level factor as your outcome column.
# print RMSE 
m1_LOOCV$results["RMSE"]
##        RMSE
## 1 0.2582178

Analytically

For OLS, there is an analytic solution for LOOCV (where \(h_i\) is the leverage of observation \(i\)):

\[ \text{LOOCV} = \frac{1}{N} \sum_{i=1}^N \left[\frac{\hat{u}_i}{(1 - h_i)} \right]^2 \]

# analytical RMSE
sqrt( sum( (m1$residuals / (1 - hatvalues(m1)))^2 ) / nrow(chile) )
## [1] 0.2582178

Family-Wise Error Rates

Consider the following randomized survey experiment, which tested different messaging strategies to promote vaccination against COVID-19.

  • vaccine_message: the message respondents received. One of:
    • Control – no message
    • Patriotism – emphasizing vaccination as good for the country
    • People you know – emphasizing social norms regarding vaccination
    • Physician recommend – asking the respondent to consider what they’d do if their personal doctor recommended vaccination
    • Preventing harm – emphasizing that vaccination is important to reduce risks for others
    • Scientists recommend – emphasizing the scientific consensus regarding vaccination
  • vaccine_attitude: respondents’ self-reported likelihood of getting vaccinated
    • 1-7 scale, higher values indicate higher likelihood
load("data/experiment")
print(s1)
## 
## Call:
## lm(formula = vaccine_attitude ~ vaccine_message, data = df)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.854 -1.810  1.146  2.190  2.405 
## 
## Coefficients:
##                                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                          4.59510    0.03611 127.263  < 2e-16 ***
## vaccine_messagePatriotism            0.06050    0.05114   1.183 0.236748    
## vaccine_messagePeople you know       0.12696    0.05098   2.490 0.012777 *  
## vaccine_messagePhysician recommend   0.25871    0.05126   5.047 4.52e-07 ***
## vaccine_messagePreventing harm       0.16941    0.05108   3.317 0.000912 ***
## vaccine_messageScientists recommend  0.21474    0.05093   4.217 2.49e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.284 on 23979 degrees of freedom
##   (697 observations deleted due to missingness)
## Multiple R-squared:  0.001483,   Adjusted R-squared:  0.001275 
## F-statistic: 7.122 on 5 and 23979 DF,  p-value: 1.151e-06

What issues might arise here with respect to using p-values in the regression output combined with \(\alpha=0.05\) to test each distinct coefficient?

  • This experiment has multiple treatment arms.
  • Repeated, independent trials have a family-wise error rate
  • Increases the rate of false positives
  • Adjust alpha to test for statistical significance
# corrected alpha
alpha <- 1 - (1-0.05)^(1/6) # 0.008512445

# significance tests
coef(s1)[2,4] < alpha # patriotism 
## [1] FALSE
coef(s1)[3,4] < alpha # people you know
## [1] FALSE
coef(s1)[4,4] < alpha # physician 
## [1] TRUE
coef(s1)[5,4] < alpha # preventing harm 
## [1] TRUE
coef(s1)[6,4] < alpha # scientists
## [1] TRUE
s1$df[2]
## [1] 23979

Let’s make another coefficient plot, this time with base R.

c <- qt(1 - alpha/2, s1$df[2])

df <- as.data.frame(coef(s1))
df$lci <- df$Estimate - c*df$`Std. Error`
df$uci <- df$Estimate + c*df$`Std. Error`

plot(c(1,2,3,4,5), coef(s1)[-1,1], pch=19, xlab="Messaging Treatment", ylab="Estimated Treatment Effect", main="Effect of Messaging on Vaccine Attitudes", ylim=c(-0.1,0.5), xlim=c(0.5,5.5), axes=F)
axis(1, at=1:5, labels=c("Patriotism", "Familiar", "Physician", "Harm", "Scientists"))
axis(2, at=c(0,0.1,0.2,0.3,0.4,0.5))
segments(1, df$lci[2], 1, df$uci[2])
segments(2, df$lci[3], 2, df$uci[3])
segments(3, df$lci[4], 3, df$uci[4])
segments(4, df$lci[5], 4, df$uci[5])
segments(5, df$lci[6], 5, df$uci[6])
abline(h = 0, lty = 2)
box()