normalized vector length: measure of how unusual each observation is relative to other obs
larger leverage means higher potential to influence regression line (think leverage in physical terms / torque)
HC2 and HC3 penalize high-leverage obs by inflating their error variance estimate
\(0 \leq h_i \leq 1\), \(\sum_N h_i = K+1\)
Leverage calculation
# sum of hats = K + 1sum(hatvalues(m1))
[1] 5
length(coef(m1))
[1] 5
# if all hatvalues were equalalleq <-length(coef(m1)) /length(hatvalues(m1))# hatvalues gives leverage for all obssort(round(hatvalues(m1), 3), decreasing =TRUE)[1:20] / alleq
Linear hypothesis test
Hypothesis:
economic.cond.national - economic.cond.household = 0
Model 1: restricted model
Model 2: Europe ~ I(age/10) + gender + economic.cond.national + economic.cond.household
Note: Coefficient covariance matrix supplied.
Res.Df Df F Pr(>F)
1 1521
2 1520 1 12.695 0.0003779 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Testing for heteroskedasticity
Homoskedasticity means that the variance of \(u_i\) is independent of the \(x_{ki}\)
We can use the squared residuals, \(\hat{u}_i^2\), as an estimate of \(\text{Var}(u_i)\) and check the extent to which they can be explained by \(\boldsymbol{x}_i\)
A Breusch-Pagen test regresses the \(\hat{u}_i^2\) on the \(\boldsymbol{x}_i\)
A modified White test regresses the \(\hat{u}_i^2\) on \(\hat{y}_i\) and \(\hat{y}_i^2\)
Example, F-test
# calculate squared residuals from m1res <- m1$residuals^2# regress squared residuals on predictorsm1_res_1 <-lm(res ~ m1$model[,2] + m1$model[,3] + m1$model[,4] + m1$model[,5])summary(m1_res_1) # use F-stat + p-value
Call:
lm(formula = res ~ m1$model[, 2] + m1$model[, 3] + m1$model[,
4] + m1$model[, 5])
Residuals:
Min 1Q Median 3Q Max
-15.110 -7.589 -2.316 5.678 58.084
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.1242 1.4241 0.087 0.93049
m1$model[, 2] 1.1096 0.1572 7.060 2.53e-12 ***
m1$model[, 3] -0.7444 0.4948 -1.504 0.13269
m1$model[, 4] 0.8455 0.2988 2.830 0.00472 **
m1$model[, 5] 0.5683 0.2830 2.008 0.04479 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 9.624 on 1520 degrees of freedom
Multiple R-squared: 0.04422, Adjusted R-squared: 0.0417
F-statistic: 17.58 on 4 and 1520 DF, p-value: 4.097e-14
Example, B-P test
# B-P test, use chi-squared test chi2 = N*R2, on K df1-pchisq(q =summary(m1_res_1)$r.squared *sum(summary(m1_res_1)$df[1:2]), df =summary(m1_res_1)$df[1] -1)
[1] 7.91589e-14
# use lmtest packagelibrary(lmtest)bptest(m1)
studentized Breusch-Pagan test
data: m1
BP = 67.43, df = 4, p-value = 7.914e-14
Example, White test
# use lmtest package, specify function for variancey_hat <-predict(m1)bptest(m1, varformula =~ y_hat +I(y_hat^2))
studentized Breusch-Pagan test
data: m1
BP = 7.1803, df = 2, p-value = 0.02759
Clustering
Clustering arises when observations are drawn from a relevant subpopulation not captured by the predictors in the model
Schools, countries, congressional districts, time points, etc.
Non-independence –> errors are correlated for observations that share a group
As with heteroskedasticity, clustering leads to biased SEs (and hypothesis tests, etc.)
Non-diagonal covariance matrix for \(u_i\)
To this point, we have dealt with situations where \(u_i \perp\!\!\!\!\perp u_j\)
In which case, \(\text{Var}(\boldsymbol{u}_i)\) is not diagonal
The off-diagonal elements are the covariances of the errors between each pair of observations
“Fixed effects”
One common (and easy) solution is to estimate a model with fixed effects for groups
This just means that each group (except one) gets a binary variable in the estimated model
The indicator for group \(j\) gives the difference in average \(\hat{y}_i\) comparing group \(j\) to the excluded (baseline) group
Since we have removed what is shared across observations within groups, the errors are now independent, and SEs are unbiased
Example: Vocabulary
year sex education vocabulary
Min. :1974 Female:17148 Min. : 0.00 Min. : 0.000
1st Qu.:1987 Male :13203 1st Qu.:12.00 1st Qu.: 5.000
Median :1994 Median :12.00 Median : 6.000
Mean :1995 Mean :13.03 Mean : 6.004
3rd Qu.:2006 3rd Qu.:15.00 3rd Qu.: 7.000
Max. :2016 Max. :20.00 Max. :10.000
Example fixed effects
m2_a <-lm(vocabulary ~ education +as.factor(year), data = vocab)m2_naive <-lm(vocabulary ~ education, data = vocab)
naive
fixed_effects
(Intercept)
1.678
1.952
(0.047)
(0.064)
education
0.332
0.343
(0.003)
(0.004)
as.factor(year)1976
0.041
(0.069)
as.factor(year)1978
−0.117
(0.068)
as.factor(year)1982
−0.398
(0.066)
as.factor(year)1984
−0.218
(0.069)
as.factor(year)1987
−0.545
(0.066)
as.factor(year)1988
−0.512
(0.078)
as.factor(year)1989
−0.416
(0.077)
as.factor(year)1990
−0.311
(0.080)
as.factor(year)1991
−0.286
(0.077)
as.factor(year)1993
−0.485
(0.076)
as.factor(year)1994
−0.376
(0.065)
as.factor(year)1996
−0.520
(0.065)
as.factor(year)1998
−0.413
(0.071)
as.factor(year)2000
−0.504
(0.071)
as.factor(year)2004
−0.456
(0.069)
as.factor(year)2006
−0.410
(0.070)
as.factor(year)2008
−0.623
(0.073)
as.factor(year)2010
−0.548
(0.070)
as.factor(year)2012
−0.695
(0.071)
as.factor(year)2014
−0.682
(0.067)
as.factor(year)2016
−0.637
(0.065)
Num.Obs.
30351
30351
R2
0.229
0.238
R2 Adj.
0.229
0.237
AIC
123720.0
123416.3
BIC
123745.0
123616.0
Log.Lik.
−61857.020
−61684.162
F
9010.050
429.745
RMSE
1.86
1.85
Example fixed effects
m2_b <-lm(vocabulary ~0+ education +as.factor(year), data = vocab)summary(m2_b)
As with HC estimators, clustered SEs have a few variations:
The ordinary version w/no further adjustments
Ordinary with finite-sample adjustment (similar to HC1; Stata default with cluster option, sandwich default): \(\left( \frac{N-1}{N-K} \right) \left( \frac{G}{G-1} \right)\)
“CR3”: similar to HC3 robust estimator (and is also conservative; can take a while to estimate)
Using sandwich
library(sandwich)# cluster by year with HC1 adjustment (default)m2_cov <-vcovCL(m2, cluster =~ year, type ="HC1")# cluster by year with HC3 adjustment (takes a bit to estimate)#m2_cov <- vcovCL(m2, cluster = ~ year, type = "HC3")# sescbind(lm =sqrt(diag(vcov(m2))), cluster =sqrt(diag(m2_cov)))
An influential observation is one which, when removed, changes the estimate substantially
High leverage points can substantially affect estimates because OLS minimizes squared errors
Hypothetical example
Studentized residuals
The residual for an observation based on a regression line estimated from all data except the observation
# use car package to test for outliers using studentized residualscar::outlierTest(lm(data[,2] ~ data[,1]))
rstudent unadjusted p-value Bonferroni p
25 6.37629 2.0434e-06 5.1086e-05
Measures of influence
Cook’s D: sum of squared differences in predicted values with and without the observation standardized by the mean squared error of the regression
DFFITS: difference in predicted value for focal observation standardized by standard error of regression (without focal observation) multiplied by leverage of focal observation
Violation of OLS assumption 5 means that OLS is no longer the most efficient unbiased estimator
Robust SEs correct the SEs, but do not make OLS most efficient
An alternative estimator, generalized least squares (GLS), is more efficient
GLS definition
GLS minimizes the weighted sum of squared residuals, where the weights are \(1 / w_i\), \(w_i = w(\boldsymbol{x}_i)\), and \(w\) is a function that maps \(\boldsymbol{x}_i\) to \(\text{Var}(u_i)\)
Specifically, we find the values of \(\boldsymbol{\beta}\) that minimize:
We rarely know the \(w_i\) with certainty, so they must be estimated from the data - this results in feasible generalized least squares, where we assume: