We’re going to return to the chile
data set. The
variables of interest are as follows:
vote
: a factor with levels: A, will abstain; N, will
vote no (against Pinochet); U, undecided; Y, will vote yes (for
Pinochet).statusquo
: higher values = more support for the status
quo, standardized to have mean zero and standard deviation of oneincome
: monthly income, in pesoseducation
: factor variable with 3 levels: primary only
(baseline), secondary (S), post-secondary (PS)sex
: factor variable with two levels, female (baseline)
and male (M)region
: A factor with levels: C, Central; M,
Metropolitan Santiago area; N, North; S, South; SA, city of
Santiago.chile <- carData::Chile
# re-code vote as dummy
chile$vote_yes <- ifelse(chile$vote == "Y", 1,
ifelse(chile$vote == "N", 0, NA))
# re-code sex as dummy
chile$female <- ifelse(chile$sex == "F", 1, 0)
# re-code income to 1000s
chile$income_1000 <- chile$income / 1000
# re-code education as dummies
chile$educS <- ifelse(chile$education == "S", 1, 0)
chile$educPS <- ifelse(chile$education == "PS", 1, 0)
# remove missing
chile <- na.omit(chile)
Estimate the following model:
\[ voteyes_i = \beta_0 + \beta_1statusquo_i + \beta_2income1000_i + \beta_3educS_i + \beta_4educPS_i + \beta_5female_i + u_i \]
m1 <- lm(vote_yes ~ statusquo + income_1000 + educS + educPS + female, chile)
summary(m1)
##
## Call:
## lm(formula = vote_yes ~ statusquo + income_1000 + educS + educPS +
## female, data = chile)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.09279 -0.10727 -0.02865 0.06813 1.04154
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.085e-01 1.232e-02 41.289 < 2e-16 ***
## statusquo 3.870e-01 5.905e-03 65.540 < 2e-16 ***
## income_1000 -7.976e-05 1.659e-04 -0.481 0.630656
## educS -4.537e-02 1.455e-02 -3.119 0.001848 **
## educPS -7.772e-02 1.968e-02 -3.950 8.12e-05 ***
## female 4.425e-02 1.262e-02 3.507 0.000464 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2579 on 1697 degrees of freedom
## Multiple R-squared: 0.7348, Adjusted R-squared: 0.734
## F-statistic: 940.5 on 5 and 1697 DF, p-value: < 2.2e-16
Clustered standard errors adjust for the possibility that observations within the same group (e.g., a region) are not independent from each other.
# vcovCL function in sandwich package
se_cluster <- sqrt(diag(sandwich::vcovCL(m1, cluster = ~ region, type = "HC1")))
Create a figure that plots the coefficient estimates for all independent variables along with their 95% confidence bounds (a “coefficient plot”).
# confidence intervals
lci <- m1$coefficients - 2*se_cluster
uci <- m1$coefficients + 2*se_cluster
# data frame
results <- data.frame(coefficient = coef(m1)[-1],
lower.ci = lci[-1],
upper.ci = uci[-1])
library(ggplot2)
ggplot(results, aes(x = rownames(results), y = coefficient)) +
geom_point(size = 3) +
geom_errorbar(aes(ymin = lower.ci, ymax = upper.ci), width = 0.2) +
labs(title = "Coefficient Plot with 95% Confidence Bounds",
x = "Variable",
y = "Effect on Probability of Voting for Pinochet") +
geom_hline(yintercept=0, linetype="dashed", color = "red") +
coord_flip() +
theme_minimal()
Use 10-fold cross validation to calculate the out-of-sample (test) mean squared error and \(R^2\).
vote_yes
values
are from the actual). (The lower the better).How do we expect the out-of-sample (test) metrics to compare to the within-sample (train) metrics from ‘m1’?
# set up the training procedure
m1_trainControl <- caret::trainControl(method = "cv", number = 10)
# train the model using cross-validation
m1_10fCV <- caret::train(formula(m1), data = chile, method = "lm", trControl = m1_trainControl)
## Warning in train.default(x, y, weights = w, ...): You are trying to do
## regression and your outcome only has two possible values Are you trying to do
## classification? If so, use a 2 level factor as your outcome column.
## Loading required package: lattice
# cross-validation results
print(m1_10fCV)
## Linear Regression
##
## 1703 samples
## 5 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 1532, 1533, 1533, 1532, 1533, 1533, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 0.2579919 0.7348998 0.1686353
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
# compare to values from m1
train <- c(summary(m1)["sigma"], summary(m1)["r.squared"])
test <- c(m1_10fCV$results["RMSE"], m1_10fCV$results["Rsquared"])
cbind(train, test)
## train test
## sigma 0.2578879 0.2579919
## r.squared 0.7348246 0.7348998
What does it mean if the test RMSE and \(R^2\) are similar to the train RMSE and \(R^2\)?
Fixed effects adjust for the possibility of unobserved heterogeneity across groups by controlling for group-level omitted/confounding variables.
Estimate a new model that uses fixed effects for region
to account for all regional-level variance in vote_yes
.
m2 <- lm(vote_yes ~ statusquo + income_1000 + educS + educPS + female + region, chile)
summary(m2)
##
## Call:
## lm(formula = vote_yes ~ statusquo + income_1000 + educS + educPS +
## female + region, data = chile)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.08466 -0.10696 -0.02765 0.06649 1.04268
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.5085125 0.0168712 30.141 < 2e-16 ***
## statusquo 0.3876215 0.0060062 64.537 < 2e-16 ***
## income_1000 -0.0001101 0.0001688 -0.652 0.514265
## educS -0.0456477 0.0145827 -3.130 0.001776 **
## educPS -0.0770650 0.0197225 -3.907 9.69e-05 ***
## female 0.0437699 0.0126388 3.463 0.000547 ***
## regionM 0.0383948 0.0377065 1.018 0.308702
## regionN -0.0020142 0.0217556 -0.093 0.926245
## regionS -0.0086080 0.0179731 -0.479 0.632042
## regionSA 0.0083283 0.0173408 0.480 0.631095
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.258 on 1693 degrees of freedom
## Multiple R-squared: 0.7352, Adjusted R-squared: 0.7338
## F-statistic: 522.2 on 9 and 1693 DF, p-value: < 2.2e-16
Use 10-fold cross validation to calculate the out-of-sample (test)
mean squared error and \(R^2\). How do
we expect the out-of-sample (test) metrics from the fixed effects model
to compare to the out-of-sample (test) metrics from m1
?
# set up the training procedure
m2_trainControl <- caret::trainControl(method = "cv", number = 10)
# train the model using cross-validation
m2_10fCV <- caret::train(formula(m2), data = chile, method = "lm", trControl = m2_trainControl)
## Warning in train.default(x, y, weights = w, ...): You are trying to do
## regression and your outcome only has two possible values Are you trying to do
## classification? If so, use a 2 level factor as your outcome column.
# compare to values from m1
m1_test <- c(m1_10fCV$results["RMSE"], m1_10fCV$results["Rsquared"])
m2_test <- c(m2_10fCV$results["RMSE"], m2_10fCV$results["Rsquared"])
cbind(m1_test, m2_test)
## m1_test m2_test
## RMSE 0.2579919 0.2577978
## Rsquared 0.7348998 0.7328427
As expected, the out-of-sample (test) mean squared error in the fixed effects model is slightly larger than the out-of-sample (test) mean squared error in the original model.
The out-of-sample (test) \(R^2\) in the fixed effects model is slightly smaller than the out-of-sample (test) \(R^2\) in the original model (0.20).
This is because the fixed effects model is fitting the training data
very closely, including noise tied to specific region
dummies. On new (test) data, those specific patterns don’t generalize,
so prediction error (MSE) increases and predictive performance (\(R^2\)) decreases.
error <- NA
for (i in 1:nrow(chile)){
# estimate model without observation i
m <- lm(vote_yes ~ statusquo + income_1000 + educS + educPS + female, chile[-i,])
# use model to predict vote_yes for observation i
pred <- predict(m, newdata = chile[i,], type = "response")
# store difference between actual and predicted
error[i] <- chile$vote_yes[i] - pred
}
# root mean squared error (RMSE)
sqrt(mean(error^2))
## [1] 0.2582178
caret
# set up the training procedure
m1_trainControl <- caret::trainControl(method = "LOOCV")
# train the model using LOOCV
m1_LOOCV <- caret::train(formula(m1), data = chile, method = "lm", trControl = m1_trainControl)
## Warning in train.default(x, y, weights = w, ...): You are trying to do
## regression and your outcome only has two possible values Are you trying to do
## classification? If so, use a 2 level factor as your outcome column.
# print RMSE
m1_LOOCV$results["RMSE"]
## RMSE
## 1 0.2582178
For OLS, there is an analytic solution for LOOCV (where \(h_i\) is the leverage of observation \(i\)):
\[ \text{LOOCV} = \frac{1}{N} \sum_{i=1}^N \left[\frac{\hat{u}_i}{(1 - h_i)} \right]^2 \]
# analytical RMSE
sqrt( sum( (m1$residuals / (1 - hatvalues(m1)))^2 ) / nrow(chile) )
## [1] 0.2582178
Consider the following randomized survey experiment, which tested different messaging strategies to promote vaccination against COVID-19.
vaccine_message
: the message respondents received. One
of:
vaccine_attitude
: respondents’ self-reported likelihood
of getting vaccinated
load("data/experiment")
print(s1)
##
## Call:
## lm(formula = vaccine_attitude ~ vaccine_message, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.854 -1.810 1.146 2.190 2.405
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.59510 0.03611 127.263 < 2e-16 ***
## vaccine_messagePatriotism 0.06050 0.05114 1.183 0.236748
## vaccine_messagePeople you know 0.12696 0.05098 2.490 0.012777 *
## vaccine_messagePhysician recommend 0.25871 0.05126 5.047 4.52e-07 ***
## vaccine_messagePreventing harm 0.16941 0.05108 3.317 0.000912 ***
## vaccine_messageScientists recommend 0.21474 0.05093 4.217 2.49e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.284 on 23979 degrees of freedom
## (697 observations deleted due to missingness)
## Multiple R-squared: 0.001483, Adjusted R-squared: 0.001275
## F-statistic: 7.122 on 5 and 23979 DF, p-value: 1.151e-06
What issues might arise here with respect to using p-values in the regression output combined with \(\alpha=0.05\) to test each distinct coefficient?
# corrected alpha
alpha <- 1 - (1-0.05)^(1/6) # 0.008512445
# significance tests
coef(s1)[2,4] < alpha # patriotism
## [1] FALSE
coef(s1)[3,4] < alpha # people you know
## [1] FALSE
coef(s1)[4,4] < alpha # physician
## [1] TRUE
coef(s1)[5,4] < alpha # preventing harm
## [1] TRUE
coef(s1)[6,4] < alpha # scientists
## [1] TRUE
s1$df[2]
## [1] 23979
Let’s make another coefficient plot, this time with base R.
c <- qt(1 - alpha/2, s1$df[2])
df <- as.data.frame(coef(s1))
df$lci <- df$Estimate - c*df$`Std. Error`
df$uci <- df$Estimate + c*df$`Std. Error`
plot(c(1,2,3,4,5), coef(s1)[-1,1], pch=19, xlab="Messaging Treatment", ylab="Estimated Treatment Effect", main="Effect of Messaging on Vaccine Attitudes", ylim=c(-0.1,0.5), xlim=c(0.5,5.5), axes=F)
axis(1, at=1:5, labels=c("Patriotism", "Familiar", "Physician", "Harm", "Scientists"))
axis(2, at=c(0,0.1,0.2,0.3,0.4,0.5))
segments(1, df$lci[2], 1, df$uci[2])
segments(2, df$lci[3], 2, df$uci[3])
segments(3, df$lci[4], 3, df$uci[4])
segments(4, df$lci[5], 4, df$uci[5])
segments(5, df$lci[6], 5, df$uci[6])
abline(h = 0, lty = 2)
box()