For this lab, we’ll work with a (slightly modified version of) the dataset on smoking habits (originally collected by John Mullahy).
Let’s first load the data as cigs and re-code
income to be in 1,000s of dollars:
# load data
cigs <- read.csv("data/smoking.csv")
# re-code income to 1000s
cigs$income <- cigs$income / 1000
Visualize: Regress cigs on income and plot
the regression line in the same plot with the data points.
# model
m1 <- lm(cigs ~ income, data=cigs)
# plot reg line
plot(cigs$income, cigs$cigs, xlab="Income", ylab="Cigarettes per day")
abline(lm(cigs ~ income, data=cigs))
Based on this plot, do you think there is an issue with heteroskedasticity?
Calculate heteroskedasticity-robust standard errors for this model.
library(sandwich)
HC0_sw <- sqrt(diag(vcovHC(m1, type="HC0")))
HC3_sw <- sqrt(diag(vcovHC(m1))) # default is HC3
round(data.frame(EST = coef(m1),
SE = summary(m1)$coef[,2],
HC0 = HC0_sw,
HC3 = HC3_sw), 5)
## EST SE HC0 HC3
## (Intercept) 7.14468 1.12814 0.99213 0.99532
## income 0.07987 0.05282 0.04979 0.04995
HC3 penalizes high-leverage observations by inflating their error variance estimate. In other words, it adjusts the estimated covariance matrix of the residuals (\(\hat{\Sigma}\)) by scaling the squared residuals (\(\hat{u}^2\)) based on how influential the observation is (\(\hat{u}^2\)), where \(h\) is its leverage.
\[HC3: \hat{\Sigma} = \frac{\hat{u}^2}{(1-h)^2}I\]
Leverage is a measure of how unusual each observation is relative to other observations.
Diagonal of the projection (hat) matrix gives each observation’s leverage.
\[P = X(X^TX)^{-1}X^T\]
# sum of hats = K + 1
sum(hatvalues(m1))
## [1] 2
length(coef(m1))
## [1] 2
sort(hatvalues(m1), decreasing = TRUE)[1:3]
## 217 371 514
## 0.0064876 0.0064876 0.0064876
Influence is a measure of how much each observation changes the estimate when removed. High leverage points can substantially affect estimates because OLS minimizes squared errors.
Cook’s D: sum of squared differences in predicted values with and without the observation standardized by the mean squared error of the regression. Cut off at greater than 1 or greater than 4/N.
sort(cooks.distance(m1), decreasing = TRUE)[1:3]
## 620 195 456
## 0.03901001 0.02000700 0.02000700
Looking at the plot again, do you think any of the observations might have high influence? Label the three most influential observations with their ID number in your plot.
# three most influential observations
top_points <- cigs[order(cooks.distance(m1), decreasing = TRUE),][1:3,]
# label in plot
plot(cigs$income, cigs$cigs, xlab="Income", ylab="Cigarettes per day")
abline(lm(cigs ~ income, data=cigs))
text(top_points$income, top_points$cigs, labels = rownames(top_points), pos=c(2,3,2))
library(ggplot2)
# install.packages("ggrepel")
library(ggrepel)
## Warning: package 'ggrepel' was built under R version 4.3.3
ggplot(cigs, aes(x = income, y = cigs)) +
geom_point() +
geom_text_repel(data = top_points, aes(label = rownames(top_points))) +
geom_abline(intercept = coef(m1)[1], slope = coef(m1)[2])
Re-estimate the model without the three most influential observations. Compare coefficients.
m1r <- lm(cigs ~ income, data=cigs[-c(as.numeric(rownames(top_points))), ])
HC3_m1 <- sqrt(diag(vcovHC(m1)))
HC3_m1r <- sqrt(diag(vcovHC(m1r)))
round(data.frame(m1_est = coef(m1),
m1_se = HC3_m1,
m1r_est = coef(m1r),
m1r_se = HC3_m1r), 5)
## m1_est m1_se m1r_est m1r_se
## (Intercept) 7.14468 0.99532 7.46027 0.98030
## income 0.07987 0.04995 0.05242 0.04757
Robust SEs correct the SEs, but do not make OLS most efficient. An alternative estimator, generalized least squares (GLS), is more efficient.
GLS minimizes the weighted sum of squared residuals, where the weights are \(1/w_i\).
\[\sum_{i=1}^{N}\frac{(y_i - x_i\beta)^2}{w_i} \]
In feasible generalized least squares (FGLS), we estimate \(w_i\) from the data:
\[\hat{w}_i = e^{x_i\hat{\delta}}\] We can estimate \(\delta\) using OLS and the observed residuals:
\[ln(\hat{u_i}^2) = x_i\delta + \epsilon_i\]
# regress logged squared residuals on predictors
m1_res <- lm(log(m1$residuals^2) ~ m1$model[,2])
# estimate wls with exp(y-hat) as weights
m1_wls <- lm(m1$model[,1] ~ m1$model[,2],
weights = 1 / exp(m1_res$fitted.values))
summary(m1_wls)
##
## Call:
## lm(formula = m1$model[, 1] ~ m1$model[, 2], weights = 1/exp(m1_res$fitted.values))
##
## Weighted Residuals:
## Min 1Q Median 3Q Max
## -0.9450 -0.9434 -0.9386 1.0044 6.8332
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.95314 1.01612 6.843 1.54e-11 ***
## m1$model[, 2] 0.09027 0.05129 1.760 0.0788 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.477 on 805 degrees of freedom
## Multiple R-squared: 0.003834, Adjusted R-squared: 0.002597
## F-statistic: 3.098 on 1 and 805 DF, p-value: 0.07875
# robust standard errors
HC3_wls <- sqrt(diag(vcovHC(m1_wls)))
round(data.frame(ols_est = coef(m1),
ols_se = HC3_m1,
wls_est = coef(m1_wls),
wls_se = HC3_wls), 5)
## ols_est ols_se wls_est wls_se
## (Intercept) 7.14468 0.99532 6.95314 0.96636
## income 0.07987 0.04995 0.09027 0.04872
Imagine we have a system of equations:
\[ y_{1i} = 2y_{2i} + 1x_{1i} + u_{1i} \\ y_{2i} = 1z_{1i} + 1x_{1i} + u_{2i} \]
\(y_1\) and \(y_2\) are endogenous (causally downstream of \(x_1\))
\(x_1\) and \(z_1\) are exogenous (independent variables)
Let’s assume the error terms and exogenous variables are standard normal, and create a function that generates simulated data from this system of equations.
## function for simulations
sims <- function(N=1000, B_y2z1=1, B_y2x1=1, B_y1y2=2, B_y1x1=1){
# exog vars
z1 <- rnorm(N)
x1 <- rnorm(N)
u1 <- rnorm(N)
u2 <- rnorm(N)
# endog vars
y2 <- B_y2z1*z1 + B_y2x1*x1 + u2
y1 <- B_y1y2*y2 + B_y1x1*x1 + u1
out <- data.frame(z1, x1, u1, u2, y1, y2)
return(out)
}
data <- sims() # dimensions?
Now, imagine we estimate the model for \(y_1\), but we exclude \(x_1\). \(x_1\) is correlated with \(y_1\) and \(y_2\), so it is an omitted variable.
The coefficient on \(y_2\) will be biased because omitting \(x_1\) leads to correlation between \(y_2\) and the error term.
Let’s estimate the model for 1000 simulated data sets and compare the average coefficient estimate to the true one:
# for loop
betas <- NA
for (i in 1:1000){
data = sims() # simulate data
betas[i] <- coef(lm(y1 ~ y2, data))[2] # estimate coef
}
# sapply
betas <- sapply(1:1000, function(x) coef(lm(y1 ~ y2, data=sims()))[2])
# compare to known coef of 1
mean(betas)
## [1] 2.332034
hist(betas)
The instrumental variable (\(z_1\)) has two key properties:
The instrumental variable estimator for the bivariate case: (biased but consistent)
\[\beta_1 = \frac{Cov(y_{1i},z_{1i})}{Cov(y_{2i},z_{1i})}\]
betas <- c()
for (i in 1:1000){
data <- sims()
betas[i] <- cov(data$y1, data$z1) / cov(data$y2, data$z1)
}
mean(betas)
## [1] 1.999254
Two-stage least squares (2SLS) allows for instrumental variables estimation in multiple regression, with multiple instrumental variables and multiple endogenous variables.
Stage 1: Purge (residualize) the endogenous independent variable (\(y_{2i}\)) of variance correlated with \(u_i^*\) by regressing it on all available exogenous and instrumental variables.
Stage 2: Use the fitted values (\(\hat{y}_{2i}\)) from stage 1 as the instrument for \(y_{2i}\) in the original regression model.
betas <- c()
for (i in 1:1000){
data <- sims()
stage1 <- lm(y2 ~ z1, data=data) # regress y2 on z1
stage2 <- lm(data$y1 ~ stage1$fitted.values) # regress y1 on predicted y2
betas[i] <- coef(stage2)[2] # instrumental variable estimates
}
mean(betas)
## [1] 1.998447
Stage 2 provides instrumental variable estimates, but the standard errors are incorrect. Matrix representation is important for calculating standard errors.
\[\beta_1 = [X'Z(Z'Z)^{-1}Z'X]^{-1}X'Z(Z'Z)^{-1}Z'y_1\]
Where \(X\) is the design matrix for the exogenous and endogenous variables in the model for \(y_{1i}\) and \(Z\) is the design matrix for the exogenous variables in the model for \(y_{1i}\) and the instrumental variables for \(Y_2\).
betas <- c()
for (i in 1:1000){
data <- sims()
X <- as.matrix(cbind(1, data$y2))
Z <- as.matrix(cbind(1, data$z1))
y <- as.vector(data$y1)
betas[i] <- ( solve(t(X)%*%Z %*% solve(t(Z)%*%Z) %*% t(Z)%*%X) %*% t(X)%*%Z %*% solve(t(Z)%*%Z) %*% t(Z)%*%y )[2]
}
mean(betas)
## [1] 2.000865
2SLS Variance Assuming Homoskedasticity
# pull it out of the for loop
X <- as.matrix(cbind(1, data$y2))
Z <- as.matrix(cbind(1, data$z1))
y <- as.vector(data$y1)
SLS2 <- solve(t(X)%*%Z %*% solve(t(Z)%*%Z) %*% t(Z)%*%X) %*%
t(X)%*%Z %*% solve(t(Z)%*%Z) %*% t(Z)%*%y
# sample size
N <- nrow(data)
# calculate sigma^2 hat
s2_hat <- (1 / (N - ncol(X)))*sum( (y - X%*%SLS2)^2 )
# calculate estimated covariance matrix
VB <- solve( t(X)%*%Z %*% solve(t(Z)%*%Z) %*% t(Z)%*%X ) * s2_hat
sqrt(diag(VB))
## [1] 0.04502212 0.04493959
Correcting for Heteroskedasticity
# calculate sigma^2 hat vector
s2_hat <- diag( as.vector((y - X%*%SLS2)^2) )
# calculate Omega
Om <- t(Z) %*% s2_hat %*% Z
# additional component matrices
tXZ <- t(X)%*%Z # X'Z
tZZi <- solve(t(Z)%*%Z) # (Z'Z)^-1
tZX <- t(Z)%*%X # Z'X
# calculate estimated covariance matrix
VB <- solve( tXZ %*% tZZi %*% tZX ) %*%
( tXZ %*% tZZi %*% Om %*% tZZi %*% tZX ) %*%
solve( tXZ %*% tZZi %*% tZX)
sqrt(diag(VB))
## [1] 0.04519332 0.04375632
IVREG Package will include diagnostics along with instrumental variables estimates
#install.packages("ivreg")
library(ivreg)
## Warning: package 'ivreg' was built under R version 4.3.3
m_iv <- ivreg(y1 ~ 1 | y2 | z1, data=sims(), method="OLS")
summary(m_iv, vcov = sandwich::vcovHC)
##
## Call:
## ivreg(formula = y1 ~ 1 | y2 | z1, data = sims(), method = "OLS")
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.02068 -0.97912 -0.05045 0.94603 4.90798
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.02288 0.04565 -0.501 0.616
## y2 1.99186 0.04567 43.611 <2e-16 ***
##
## Diagnostic tests:
## df1 df2 statistic p-value
## Weak instruments 1 998 571.9 <2e-16 ***
## Wu-Hausman 1 997 110.0 <2e-16 ***
## Sargan 0 NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.436 on 998 degrees of freedom
## Multiple R-Squared: 0.8833, Adjusted R-squared: 0.8832
## Wald test: 1902 on 1 and 998 DF, p-value: < 2.2e-16