Heteroskedasticity

POLSCI 630: Probability and Basic Regression

March 25, 2025

Homoskedasticity assumption

  1. \(\text{Var}(\boldsymbol{u}|\textbf{X}) = \sigma^2\textbf{I}\) (Spherical errors)

This implies:

  • Homoskedasticity (no correlation of error variance with IVs, which is called heteroskedasticity)

  • No autocorrelation (no correlation of residuals among observations, e.g., adjacent units in space or time)

Consequences of violation

  1. Estimator for \(\text{Var}(\hat{\boldsymbol{\beta}})\) is biased

  2. All things based on \(\text{Var}(\hat{\boldsymbol{\beta}})\) are wrong

    • SEs and confidence intervals
    • t-tests
    • F-tests
  3. OLS is not best (linear unbiased estimator)

Derive variance-covariance matrix of \(\boldsymbol{\beta}\)

\(\hat{\boldsymbol{\beta}} = \mathbf{I}\boldsymbol{\beta} + (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\boldsymbol{u}\)

\(\text{Var}(\hat{\boldsymbol{\beta}}) = \text{Var} \left((\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\boldsymbol{u} \right)\)

\(=(\boldsymbol{\mathbf{X}^T\mathbf{X}})^{-1}\boldsymbol{\mathbf{X}^T}(\text{Var}(\boldsymbol{u}))\boldsymbol{[(\boldsymbol{\mathbf{X}^T\mathbf{X}})^{-1}\mathbf{X}^T]^T}\)

\(=(\boldsymbol{\mathbf{X}^T\mathbf{X}})^{-1}\boldsymbol{\mathbf{X}^T}(\text{Var}(\boldsymbol{u}))\boldsymbol{\mathbf{X}}\boldsymbol{(\mathbf{X}^T\mathbf{X})^{-1}}\)

assume \(\text{E}(\boldsymbol{u}\boldsymbol{u}'|\mathbf{X}) = \text{Var}(\boldsymbol{u}|\mathbf{X}) = \sigma^2\mathbf{I}\):

\(=(\boldsymbol{\mathbf{X}^T\mathbf{X}})^{-1}\boldsymbol{\mathbf{X}^T}(\sigma^2\mathbf{I})\boldsymbol{[(\boldsymbol{\mathbf{X}^T\mathbf{X}})^{-1}\mathbf{X}^T]^T}\)

\(=\sigma^2(\boldsymbol{\mathbf{X}^T\mathbf{X}})^{-1}\boldsymbol{\mathbf{X}^T} \boldsymbol{\mathbf{X}}\boldsymbol{(\mathbf{X}^T\mathbf{X})^{-1}}\)

\(=\sigma^2\boldsymbol{(\mathbf{X}^T\mathbf{X})^{-1}}\)

Variance w/ heteroskedasticity

What if we don’t assume \(\text{E}(\boldsymbol{u}\boldsymbol{u}'|\textbf{X}) = \sigma^2\textbf{I}\)?

\[ \text{Var}(\hat{\boldsymbol{\beta}}) = (\textbf{X}'\textbf{X})^{-1}\textbf{X}' \text{E}(\boldsymbol{u}\boldsymbol{u}' | \mathbf{X}) \textbf{X} (\textbf{X}'\textbf{X})^{-1} \]

\[ = (\textbf{X}'\textbf{X})^{-1}\textbf{X}' \bf{\Sigma} \textbf{X} (\textbf{X}'\textbf{X})^{-1} \]

“robust”, “Huber”, “Huber-White”, “heteroskedasticity-consistent” (HC), “cluster-robust”, etc., standard errors…

  • use an estimate of \(\bf{\Sigma}\) to adjust the SEs and get appropriate uncertainty estimates

“Robust” (Huber-White) SEs

Consistent estimator for arbitrary heteroskedasticity with independent errors (i.e., \(\bf{\Sigma} = \boldsymbol{\sigma}^2 \bf{I}\))

  • where \(\boldsymbol{\sigma}^2 = \begin{bmatrix} \sigma_1^2 & \sigma_2^2 \dots \sigma_N^2 \end{bmatrix}\)

\(\hat{\text{Var}}(\hat{\boldsymbol{\beta}}) = (\textbf{X}'\textbf{X})^{-1}\textbf{X}' \bf{\hat{\Sigma}} \textbf{X} (\textbf{X}'\textbf{X})^{-1}\)

  • where \(\mathbf{\hat{\Sigma}} = \boldsymbol{\hat{u}}^2 \bf{I}\)

In words: substitute a diagonal matrix where the entries are each observation’s squared residual

Example


Call:
lm(formula = Europe ~ I(age/10) + gender + economic.cond.national + 
    economic.cond.household, data = BEPS)

Residuals:
    Min      1Q  Median      3Q     Max 
-8.2178 -2.5821 -0.1303  2.9648  6.1123 

Coefficients:
                        Estimate Std. Error t value Pr(>|t|)    
(Intercept)              8.53095    0.47496  17.961  < 2e-16 ***
I(age/10)                0.14562    0.05242   2.778  0.00554 ** 
gender                   0.42393    0.16504   2.569  0.01030 *  
economic.cond.national  -0.71951    0.09966  -7.220 8.19e-13 ***
economic.cond.household -0.15339    0.09438  -1.625  0.10432    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.21 on 1520 degrees of freedom
Multiple R-squared:  0.05498,   Adjusted R-squared:  0.05249 
F-statistic: 22.11 on 4 and 1520 DF,  p-value: < 2.2e-16

Example

library(sandwich)

# adjust SEs
N <- nrow(m1$model)
K <- summary(m1)$df[1]
X <- cbind(rep(1, N), m1$model[,-1])
colnames(X)[1] <- "Intercept"
X <- as.matrix(X)

# default standard error
se_default <- sqrt(diag((sum(m1$residuals^2) / m1$df.residual) * (solve(t(X) %*% X))))

# robust standard error (plug squared residuals in)
se_robust <- sqrt(diag( 
                      solve(t(X) %*% X) %*% t(X) %*%
                      diag(m1$residuals^2) %*% 
                      X %*% solve(t(X) %*% X)
                      )
)

# print
round(data.frame(coefficient = coef(m1), 
           se.default = se_default, 
           se.robust = se_robust,
           se.robust.sandwich = sqrt(diag(sandwich::vcovHC(m1, type = "HC0")))), 
      5)
                        coefficient se.default se.robust se.robust.sandwich
(Intercept)                 8.53095    0.47496   0.47660            0.47660
I(age/10)                   0.14562    0.05242   0.05273            0.05273
gender                      0.42393    0.16504   0.16538            0.16538
economic.cond.national     -0.71951    0.09966   0.09917            0.09917
economic.cond.household    -0.15339    0.09438   0.09319            0.09319

Different types

  1. HC0: previous slides
  2. HC1: finite-sample adjustment, (N/N-K) * HC0
  3. HC2: \(\mathbf{\hat{\Sigma}} = \boldsymbol{\frac{\hat{u}^2}{1-\boldsymbol{h}}} \mathbf{I}\), where \(h_i\) is the leverage for unit \(i\)
  4. HC3: \(\mathbf{\hat{\Sigma}} = \boldsymbol{\frac{\hat{u}^2}{(1-\boldsymbol{h})^2}} \mathbf{I}\)
  • All asymptotically equivalent
  • HC2 and HC3 better than HC0 and HC1
  • HC2 is unbiased under homoskedasticity, while HC3 is conservative
  • HC3 may perform best in “small” samples (N < 500; see HERE)
  • HC2 and HC3 affected by observations with very large leverage

Projection matrix

For outcome \(y\) and fitted values \(\hat{y}\), there is a projection matrix \(\mathbf{P}\) such that \(\hat{y} = \mathbf{P}y\).

\[ \begin{align} \hat{y} &= \mathbf{X}\hat{\boldsymbol{\beta}} \\ \hat{\boldsymbol{\beta}} &= (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^Ty \\ \mathbf{P} &= \mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T \end{align} \]

Sometimes called the “hat matrix” because “\(\mathbf{P}\) puts the hat on \(y\).”

Leverage

\(\mathbf{P} = \mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\)

Diagonal gives each observation’s leverage

  • normalized vector length: measure of how unusual each observation is relative to other obs
  • larger leverage means higher potential to influence regression line (think leverage in physical terms / torque)
  • HC2 and HC3 penalize high-leverage obs by inflating their error variance estimate
  • \(0 \leq h_i \leq 1\), \(\sum_N h_i = K+1\)

Leverage calculation

# sum of hats = K + 1
sum(hatvalues(m1))
[1] 5
length(coef(m1))
[1] 5
# if all hatvalues were equal
alleq <- length(coef(m1)) / length(hatvalues(m1))

# hatvalues gives leverage for all obs
sort(round(hatvalues(m1), 3), decreasing = TRUE)[1:20] / alleq
  542   653   730  1086    63   110   216   573  1076  1244    94   344   390 
4.270 3.355 3.355 3.050 2.745 2.745 2.745 2.745 2.745 2.745 2.440 2.440 2.440 
  395   437   442   447   450   555   558 
2.440 2.440 2.440 2.440 2.440 2.440 2.440 
# boxplot
bp <- boxplot(hatvalues(m1), ylim=c(0,0.02))

# list top 5 leverage observations
text(x = 1.1, y = 0.015 - (1:5)*0.001, 
     labels = names(bp$out[order(bp$out, decreasing = T)][1:5]), cex = 0.7)

Using sandwich

library(sandwich)

# calculate HC0 and HC3 using sandwich
HC0 <- sqrt(diag(vcovHC(m1, type="HC0")))
HC3 <- sqrt(diag(vcovHC(m1))) # default is HC3

# compare
table <- cbind(coef(m1), sqrt(diag(vcov(m1))), se_robust, HC0, HC3)
colnames(table) <- c("B","SE","HC0 by hand","sandwich HC0","sandwich HC3")
round(table, 3)
                             B    SE HC0 by hand sandwich HC0 sandwich HC3
(Intercept)              8.531 0.475       0.477        0.477        0.479
I(age/10)                0.146 0.052       0.053        0.053        0.053
gender                   0.424 0.165       0.165        0.165        0.166
economic.cond.national  -0.720 0.100       0.099        0.099        0.100
economic.cond.household -0.153 0.094       0.093        0.093        0.094

Using car

library(car)

# calculate HC0 and HC3 using car
HC0 <- sqrt(diag(hccm(m1, type="hc0")))
HC3 <- sqrt(diag(hccm(m1, type="hc3")))

# compare
table <- cbind(coef(m1), sqrt(diag(vcov(m1))), se_robust, HC0, HC3)
colnames(table) <- c("B","SE","HC0 by hand","car HC0","car HC3")
round(table, 3)
                             B    SE HC0 by hand car HC0 car HC3
(Intercept)              8.531 0.475       0.477   0.477   0.479
I(age/10)                0.146 0.052       0.053   0.053   0.053
gender                   0.424 0.165       0.165   0.165   0.166
economic.cond.national  -0.720 0.100       0.099   0.099   0.100
economic.cond.household -0.153 0.094       0.093   0.093   0.094

Robust test statistics

  1. Robust t-tests for regression coefficients use robust SEs

  2. In general, hypothesis tests require the adjusted covariance matrix of the coefficients, with test statistics asymptotically distributed chi-squared

Example: marginal effects

# sandwich
cbind(coef(m1), sqrt(diag(sandwich::vcovHC(m1))))
                              [,1]       [,2]
(Intercept)              8.5309469 0.47871785
I(age/10)                0.1456237 0.05295004
gender                   0.4239306 0.16592752
economic.cond.national  -0.7195125 0.09964462
economic.cond.household -0.1533928 0.09364993
# margins, robust
library(marginaleffects)
options(marginaleffects_print_omit = c("s.value", "contrast"))
avg_slopes(m1, vcov = "HC3")

                    Term Contrast Estimate Std. Error     z Pr(>|z|)    S
 age                        dY/dX   0.0146     0.0053  2.75  0.00596  7.4
 economic.cond.household    dY/dX  -0.1534     0.0937 -1.64  0.10152  3.3
 economic.cond.national     dY/dX  -0.7195     0.0996 -7.22  < 0.001 40.8
 gender                     1 - 0   0.4239     0.1659  2.55  0.01062  6.6
    2.5 %  97.5 %
  0.00418  0.0249
 -0.33699  0.0302
 -0.91481 -0.5242
  0.09872  0.7491

Columns: term, contrast, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high 
Type:  response 

Example: conditional MEs

# include interaction
m1_b <- lm(Europe ~ I(age/10) + gender + economic.cond.national + economic.cond.household*political.knowledge, data=BEPS)

# conditional MEs for ego retrospections
slopes(m1_b, 
       variables = c("economic.cond.household"), 
       newdata = datagrid(political.knowledge = seq(0, 3, 1)),
       vcov = "HC3")

                    Term political.knowledge Estimate Std. Error      z
 economic.cond.household                   0   0.0711     0.1641  0.433
 economic.cond.household                   1  -0.0879     0.1079 -0.815
 economic.cond.household                   2  -0.2470     0.0987 -2.504
 economic.cond.household                   3  -0.4061     0.1462 -2.778
 Pr(>|z|)   S  2.5 %  97.5 %
  0.66468 0.6 -0.251  0.3928
  0.41519 1.3 -0.299  0.1236
  0.01229 6.3 -0.440 -0.0537
  0.00546 7.5 -0.693 -0.1196

Columns: rowid, term, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high, political.knowledge, predicted_lo, predicted_hi, predicted, Europe, age, gender, economic.cond.national, economic.cond.household 
Type:  response 

Alternatives

Simulation:

# draw your sims using robust vcov and mvnorm
sims <- MASS::mvrnorm(1000, mu=coef(m1_b), Sigma=sandwich::vcovHC(m1_b))

Resampling

  • Jackknife approximates HC3

  • Simple bootstrap asymptotically equivalent to HC estimators, but may be poor in small samples

  • Wild bootstrap better choice w/ heteroskedastic data (and approximates HC0)

    • Can use transformations of residuals to mimic HC2 and HC3 (i.e., divide each by \(\sqrt{1 - h}\) or \(1 - h\))

Using linearHypothesis

HC3 adjustment:

library(car)

linearHypothesis(m1, 
                 c("economic.cond.national - economic.cond.household = 0"), 
                 white.adjust = "hc3")
Linear hypothesis test

Hypothesis:
economic.cond.national - economic.cond.household = 0

Model 1: restricted model
Model 2: Europe ~ I(age/10) + gender + economic.cond.national + economic.cond.household

Note: Coefficient covariance matrix supplied.

  Res.Df Df      F    Pr(>F)    
1   1521                        
2   1520  1 12.695 0.0003779 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Testing for heteroskedasticity

Homoskedasticity means that the variance of \(u_i\) is independent of the \(x_{ki}\)

  • We can use the squared residuals, \(\hat{u}_i^2\), as an estimate of \(\text{Var}(u_i)\) and check the extent to which they can be explained by \(\boldsymbol{x}_i\)

  • A Breusch-Pagen test regresses the \(\hat{u}_i^2\) on the \(\boldsymbol{x}_i\)

  • A modified White test regresses the \(\hat{u}_i^2\) on \(\hat{y}_i\) and \(\hat{y}_i^2\)

Example, F-test

# calculate squared residuals from m1
res <- m1$residuals^2

# regress squared residuals on predictors
m1_res_1 <- lm(res ~ m1$model[,2] + m1$model[,3] + m1$model[,4] + m1$model[,5])
summary(m1_res_1) # use F-stat + p-value

Call:
lm(formula = res ~ m1$model[, 2] + m1$model[, 3] + m1$model[, 
    4] + m1$model[, 5])

Residuals:
    Min      1Q  Median      3Q     Max 
-15.110  -7.589  -2.316   5.678  58.084 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)     0.1242     1.4241   0.087  0.93049    
m1$model[, 2]   1.1096     0.1572   7.060 2.53e-12 ***
m1$model[, 3]  -0.7444     0.4948  -1.504  0.13269    
m1$model[, 4]   0.8455     0.2988   2.830  0.00472 ** 
m1$model[, 5]   0.5683     0.2830   2.008  0.04479 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 9.624 on 1520 degrees of freedom
Multiple R-squared:  0.04422,   Adjusted R-squared:  0.0417 
F-statistic: 17.58 on 4 and 1520 DF,  p-value: 4.097e-14

Example, B-P test

# B-P test, use chi-squared test chi2 = N*R2, on K df
1 - pchisq(q = summary(m1_res_1)$r.squared * sum(summary(m1_res_1)$df[1:2]), 
           df = summary(m1_res_1)$df[1] - 1)
[1] 7.91589e-14
# use lmtest package
library(lmtest)
bptest(m1)

    studentized Breusch-Pagan test

data:  m1
BP = 67.43, df = 4, p-value = 7.914e-14

Example, White test

# use lmtest package, specify function for variance
y_hat <- predict(m1)
bptest(m1, varformula = ~ y_hat + I(y_hat^2))

    studentized Breusch-Pagan test

data:  m1
BP = 7.1803, df = 2, p-value = 0.02759

Clustering

Clustering arises when observations are drawn from a relevant subpopulation not captured by the predictors in the model

  • Schools, countries, congressional districts, time points, etc.
  • Non-independence –> errors are correlated for observations that share a group
  • As with heteroskedasticity, clustering leads to biased SEs (and hypothesis tests, etc.)

Non-diagonal covariance matrix for \(u_i\)

To this point, we have dealt with situations where \(u_i \perp\!\!\!\!\perp u_j\)

  • Clustering implies \(\text{Cov}(u_i, u_j) \neq 0\)
  • In which case, \(\text{Var}(\boldsymbol{u}_i)\) is not diagonal
  • The off-diagonal elements are the covariances of the errors between each pair of observations

“Fixed effects”

One common (and easy) solution is to estimate a model with fixed effects for groups

  • This just means that each group (except one) gets a binary variable in the estimated model
  • The indicator for group \(j\) gives the difference in average \(\hat{y}_i\) comparing group \(j\) to the excluded (baseline) group
  • Since we have removed what is shared across observations within groups, the errors are now independent, and SEs are unbiased

Example: Vocabulary

      year          sex          education       vocabulary    
 Min.   :1974   Female:17148   Min.   : 0.00   Min.   : 0.000  
 1st Qu.:1987   Male  :13203   1st Qu.:12.00   1st Qu.: 5.000  
 Median :1994                  Median :12.00   Median : 6.000  
 Mean   :1995                  Mean   :13.03   Mean   : 6.004  
 3rd Qu.:2006                  3rd Qu.:15.00   3rd Qu.: 7.000  
 Max.   :2016                  Max.   :20.00   Max.   :10.000  

Example fixed effects

m2_a <- lm(vocabulary ~ 
             education + 
             as.factor(year), 
           data = vocab)

m2_naive <- lm(vocabulary ~ 
                 education, 
               data = vocab)
naive fixed_effects
(Intercept) 1.678 1.952
(0.047) (0.064)
education 0.332 0.343
(0.003) (0.004)
as.factor(year)1976 0.041
(0.069)
as.factor(year)1978 −0.117
(0.068)
as.factor(year)1982 −0.398
(0.066)
as.factor(year)1984 −0.218
(0.069)
as.factor(year)1987 −0.545
(0.066)
as.factor(year)1988 −0.512
(0.078)
as.factor(year)1989 −0.416
(0.077)
as.factor(year)1990 −0.311
(0.080)
as.factor(year)1991 −0.286
(0.077)
as.factor(year)1993 −0.485
(0.076)
as.factor(year)1994 −0.376
(0.065)
as.factor(year)1996 −0.520
(0.065)
as.factor(year)1998 −0.413
(0.071)
as.factor(year)2000 −0.504
(0.071)
as.factor(year)2004 −0.456
(0.069)
as.factor(year)2006 −0.410
(0.070)
as.factor(year)2008 −0.623
(0.073)
as.factor(year)2010 −0.548
(0.070)
as.factor(year)2012 −0.695
(0.071)
as.factor(year)2014 −0.682
(0.067)
as.factor(year)2016 −0.637
(0.065)
Num.Obs. 30351 30351
R2 0.229 0.238
R2 Adj. 0.229 0.237
AIC 123720.0 123416.3
BIC 123745.0 123616.0
Log.Lik. −61857.020 −61684.162
F 9010.050 429.745
RMSE 1.86 1.85

Example fixed effects

m2_b <- lm(vocabulary ~ 0 + education + as.factor(year), data = vocab)
summary(m2_b)

Call:
lm(formula = vocabulary ~ 0 + education + as.factor(year), data = vocab)

Residuals:
    Min      1Q  Median      3Q     Max 
-8.4002 -1.1160  0.0501  1.2561  8.4605 

Coefficients:
                    Estimate Std. Error t value Pr(>|t|)    
education           0.342925   0.003553   96.52   <2e-16 ***
as.factor(year)1974 1.952034   0.064345   30.34   <2e-16 ***
as.factor(year)1976 1.992816   0.064444   30.92   <2e-16 ***
as.factor(year)1978 1.834844   0.064272   28.55   <2e-16 ***
as.factor(year)1982 1.553608   0.062154   25.00   <2e-16 ***
as.factor(year)1984 1.734251   0.066500   26.08   <2e-16 ***
as.factor(year)1987 1.407238   0.063705   22.09   <2e-16 ***
as.factor(year)1988 1.439784   0.076217   18.89   <2e-16 ***
as.factor(year)1989 1.535556   0.074888   20.50   <2e-16 ***
as.factor(year)1990 1.640631   0.078597   20.87   <2e-16 ***
as.factor(year)1991 1.666054   0.075186   22.16   <2e-16 ***
as.factor(year)1993 1.466978   0.074182   19.77   <2e-16 ***
as.factor(year)1994 1.576093   0.063906   24.66   <2e-16 ***
as.factor(year)1996 1.431860   0.064095   22.34   <2e-16 ***
as.factor(year)1998 1.539516   0.069905   22.02   <2e-16 ***
as.factor(year)2000 1.448530   0.069558   20.82   <2e-16 ***
as.factor(year)2004 1.495615   0.068992   21.68   <2e-16 ***
as.factor(year)2006 1.541744   0.068867   22.39   <2e-16 ***
as.factor(year)2008 1.329086   0.072270   18.39   <2e-16 ***
as.factor(year)2010 1.403861   0.069063   20.33   <2e-16 ***
as.factor(year)2012 1.257142   0.070749   17.77   <2e-16 ***
as.factor(year)2014 1.269685   0.066842   19.00   <2e-16 ***
as.factor(year)2016 1.315095   0.064893   20.27   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.847 on 30328 degrees of freedom
Multiple R-squared:  0.9158,    Adjusted R-squared:  0.9158 
F-statistic: 1.435e+04 on 23 and 30328 DF,  p-value: < 2.2e-16

“Clustered SEs”

\(\text{Var}(\hat{\boldsymbol{\beta}}) = (\textbf{X}'\textbf{X})^{-1}\textbf{X}' \bf{\Sigma} \textbf{X} (\textbf{X}'\textbf{X})^{-1}\)

A generalization of robust standard errors

  • Assume \(\text{Var}(u_i)\) is “block-diagonal”: unrestricted within clusters, but diagonal across clusters
  • Estimate \(\hat{\bf{\Sigma}}_c = \hat{\boldsymbol{u}}_{c} \hat{\boldsymbol{u}}_{c}'\)
  • Calculate \(\textbf{X}' \hat{\bf{\Sigma}} \textbf{X} = \sum_{i=1}^C \textbf{X}_c' \hat{\bf{\Sigma}}_c \textbf{X}_c\)

Example clustered SEs

\(\textbf{X}' \hat{\bf{\Sigma}} \textbf{X} = \sum_{i=1}^C \textbf{X}_c' \hat{\bf{\Sigma}}_c \textbf{X}_c\)

# estimate model
m2 <- lm(vocabulary ~ education, data = vocab)

# clusters
groups <- unique(vocab$year)

# estimate cluster-specific X' Sigma X
XSX <- list()
X <- as.matrix(cbind(1, m2$model[,-1]))
for (i in 1:length(groups)){
  X_c <- X[vocab$year == groups[i], ]
  e <- as.matrix(m2$residuals[vocab$year == groups[i]])
  XSX[[i]] <- t(X_c) %*% (e %*% t(e)) %*% X_c
}

# calculate variance of beta
VB <- solve(t(X) %*% X) %*% Reduce("+", XSX) %*% solve(t(X) %*% X)

# HC1 adjustment
VB_h1 <- (nrow(m2$model - 1) / (nrow(m2$model - length(coef(m2))))) * 
         (length(groups) / (length(groups) - 1)) * 
         VB

df <- data.frame(estimate = coef(m2),
                 se_default = sqrt(diag(vcov(m2))),
                 se_clust_h1 = sqrt(diag(VB_h1)),
                 se_clust_h1_sandwich = sqrt(diag(sandwich::vcovCL(m2, 
                                                                   cluster = ~ year, 
                                                                   type = "HC1")
                                                  )
                                             )
                 )
round(df, 3)
            estimate se_default se_clust_h1 se_clust_h1_sandwich
(Intercept)    1.678      0.047       0.104                0.104
education      0.332      0.003       0.008                0.008

Variations

As with HC estimators, clustered SEs have a few variations:

  • The ordinary version w/no further adjustments
  • Ordinary with finite-sample adjustment (similar to HC1; Stata default with cluster option, sandwich default): \(\left( \frac{N-1}{N-K} \right) \left( \frac{G}{G-1} \right)\)
  • “CR3”: similar to HC3 robust estimator (and is also conservative; can take a while to estimate)

Using sandwich

library(sandwich)

# cluster by year with HC1 adjustment (default)
m2_cov <- vcovCL(m2, cluster = ~ year, type = "HC1")

# cluster by year with HC3 adjustment (takes a bit to estimate)
#m2_cov <- vcovCL(m2, cluster = ~ year, type = "HC3")

# ses
cbind(lm = sqrt(diag(vcov(m2))), cluster = sqrt(diag(m2_cov)))
                     lm     cluster
(Intercept) 0.046802088 0.104448756
education   0.003496304 0.008183786

Leverage and influence

  • An influential observation is one which, when removed, changes the estimate substantially
  • High leverage points can substantially affect estimates because OLS minimizes squared errors

Hypothetical example

Studentized residuals

The residual for an observation based on a regression line estimated from all data except the observation

# use car package to test for outliers using studentized residuals
car::outlierTest(lm(data[,2] ~ data[,1]))
   rstudent unadjusted p-value Bonferroni p
25  6.37629         2.0434e-06   5.1086e-05

Measures of influence

  • Cook’s D: sum of squared differences in predicted values with and without the observation standardized by the mean squared error of the regression

  • DFFITS: difference in predicted value for focal observation standardized by standard error of regression (without focal observation) multiplied by leverage of focal observation

  • Cut offs, as always, are contentious, but…

    • Cook’s D: > 1 or > 4 / N

    • |DFFITS|: > 2(sqrt(K / N))

Influence calculations

# model
m <- lm(data[,2] ~ data[,1])

# cook's distance
cook <- round(cooks.distance(m), 3)

# dffits
dfit <- abs(round(dffits(m), 3))

# table
inf_tab <- cbind(cook, cook > 4 / N, dfit, dfit > 2*sqrt(1 / N), 
                 round(hatvalues(m) / (2/nrow(data)), 3), round(rstudent(m),3)
)
colnames(inf_tab) <- c("Cook's D", "> 4/N ?", "|DFFITS|", "> 2*sqrt(K/N) ?",
                       "Leverage/equality",
                       "rstudent")
inf_tab
   Cook's D > 4/N ? |DFFITS| > 2*sqrt(K/N) ? Leverage/equality rstudent
1     0.007       0    0.120               0             0.545    0.563
2     0.000       0    0.022               0             1.336   -0.064
3     0.012       0    0.151               0             0.504   -0.735
4     0.006       0    0.105               0             0.804    0.401
5     0.011       0    0.144               0             0.782    0.556
6     0.027       0    0.232               0             0.719   -0.938
7     0.004       0    0.091               0             0.550    0.424
8     0.003       0    0.082               0             1.015   -0.275
9     0.917       1    1.564               1             2.748   -2.946
10    0.010       0    0.142               0             0.506   -0.690
11    0.000       0    0.005               0             0.813    0.018
12    0.002       0    0.054               0             0.576    0.247
13    0.006       0    0.106               0             0.580    0.479
14    0.038       0    0.276               0             0.797   -1.059
15    0.000       0    0.013               0             0.549   -0.062
16    0.003       0    0.078               0             0.701   -0.319
17    0.065       0    0.360               0             1.661    0.920
18    0.005       0    0.094               0             0.511   -0.455
19    0.067       0    0.365               0             1.784    0.895
20    0.002       0    0.061               0             0.504   -0.299
21    0.045       0    0.297               0             1.281    0.879
22    0.025       0    0.220               0             0.771   -0.859
23    0.006       0    0.111               0             0.501   -0.542
24    0.000       0    0.010               0             0.516    0.049
25    3.442       1    4.331               1             3.946    6.376

Dealing with outliers and influence

  • Omit high-leverage observations
    • Typically requires theoretical rationale
  • Change loss function
    • Least absolute deviation instead of least squares
  • Change distributional assumptions
    • t with low df
    • mixture models

Generalized (Weighted) least squares

Violation of OLS assumption 5 means that OLS is no longer the most efficient unbiased estimator

  • Robust SEs correct the SEs, but do not make OLS most efficient

  • An alternative estimator, generalized least squares (GLS), is more efficient

GLS definition

GLS minimizes the weighted sum of squared residuals, where the weights are \(1 / w_i\), \(w_i = w(\boldsymbol{x}_i)\), and \(w\) is a function that maps \(\boldsymbol{x}_i\) to \(\text{Var}(u_i)\)

  • Specifically, we find the values of \(\boldsymbol{\beta}\) that minimize:

\[ \sum_{i=1}^N \frac{(y_i - \boldsymbol{x}_i \boldsymbol{\beta})^2}{w_i} \]

GLS definition

In matrix form, where \(\mathbf{\Omega}\) is an \(N \times \it{N}\) weighting matrix, with the \(w_i\) on the main diagonal

\[ \hat{\boldsymbol{\beta}}_{GLS} = (\mathbf{X}' \mathbf{\Omega}^{-1} \mathbf{X})^{-1} (\mathbf{X}' \mathbf{\Omega}^{-1} \boldsymbol{y}) \]

Feasible GLS

We rarely know the \(w_i\) with certainty, so they must be estimated from the data - this results in feasible generalized least squares, where we assume:

\[ \text{Var}(u_i | \boldsymbol{x}_i) = \sigma^2 e^{\boldsymbol{x}_i \boldsymbol{\delta}} \]

  • This means that the errors are a non-linear function of the predictors in the original regression model

Feasible GLS

If we don’t know \(\boldsymbol{\delta}\) (which is almost always the case), we can estimate it using OLS and the observed residuals:

\[ \text{ln}(\hat{u}_i^2) = \boldsymbol{x}_i \boldsymbol{\delta} + \epsilon_i \]

  • log because we’re estimating \(e^{\boldsymbol{x}_i \boldsymbol{\delta}}\)
  • squared because \(\text{Var}(u_i | \boldsymbol{x}_i) = \text{E}(u_i^2 | x_i) - \left( \text{E}(u_i|x_i) \right)^2 = \text{E}(u_i^2 | x_i)\)

Then, \(\hat{w}_i = e^{\boldsymbol{x}_i \boldsymbol{\hat{\delta}}}\)

Example

# regress logged squared residuals on predictors
m1_res_3 <- lm(log(m1$residuals^2) ~ m1$model[,2] + m1$model[,3] 
               + m1$model[,4] + m1$model[,5])

# estimate weighted least squares with exp(y-hat) as weights
m1_wls_1 <- lm(m1$model[,1] ~ m1$model[,2] + m1$model[,3] 
               + m1$model[,4] + m1$model[,5], 
             weights = 1 / exp(m1_res_3$fitted.values))
summary(m1_wls_1)

Call:
lm(formula = m1$model[, 1] ~ m1$model[, 2] + m1$model[, 3] + 
    m1$model[, 4] + m1$model[, 5], weights = 1/exp(m1_res_3$fitted.values))

Weighted Residuals:
    Min      1Q  Median      3Q     Max 
-4.3564 -1.2733 -0.0651  1.4120  3.5118 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)    8.43226    0.45724  18.442  < 2e-16 ***
m1$model[, 2]  0.15029    0.05345   2.812  0.00499 ** 
m1$model[, 3]  0.36639    0.16140   2.270  0.02334 *  
m1$model[, 4] -0.71728    0.09714  -7.384 2.52e-13 ***
m1$model[, 5] -0.12176    0.08925  -1.364  0.17269    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.591 on 1520 degrees of freedom
Multiple R-squared:  0.05429,   Adjusted R-squared:  0.0518 
F-statistic: 21.81 on 4 and 1520 DF,  p-value: < 2.2e-16

Example

You can also use the White version (instead of B-P) via the fitted values and their squares:

# regress logged squared residuals on predictors
m1_res_4 <- lm(log(m1$residuals^2) ~ m1$fitted.values + I(m1$fitted.values^2))

# estimate weighted least squares with exp(y-hat) as weights
m1_wls_2 <- lm(m1$model[,1] ~ m1$model[,2] + m1$model[,3] + m1$model[,4] + m1$model[,5], 
             weights = 1 / exp(m1_res_4$fitted.values))
summary(m1_wls_2)

Call:
lm(formula = m1$model[, 1] ~ m1$model[, 2] + m1$model[, 3] + 
    m1$model[, 4] + m1$model[, 5], weights = 1/exp(m1_res_4$fitted.values))

Weighted Residuals:
    Min      1Q  Median      3Q     Max 
-5.2366 -1.2856 -0.0695  1.5048  3.7130 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)    8.51021    0.47234  18.017  < 2e-16 ***
m1$model[, 2]  0.13975    0.05217   2.679  0.00746 ** 
m1$model[, 3]  0.42078    0.16525   2.546  0.01098 *  
m1$model[, 4] -0.71491    0.09670  -7.393 2.36e-13 ***
m1$model[, 5] -0.13851    0.09415  -1.471  0.14146    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.598 on 1520 degrees of freedom
Multiple R-squared:  0.05995,   Adjusted R-squared:  0.05747 
F-statistic: 24.23 on 4 and 1520 DF,  p-value: < 2.2e-16

Properties of FGLS

  1. Biased, but consistent, but only under the more stringent assumption of \(\text{E}(u_i | \boldsymbol{x}_i) = 0\)
  2. Asymptotically more efficient than OLS
  3. Test statistics (e.g., F tests) require use of same weights for restricted and unrestricted models

Misspecification

If the model for the variance is misspecified:

  1. SEs from WLS are incorrect, but this can be solved with robust SEs, same as for standard OLS
  2. WLS is not guaranteed to be asymptotically more efficient than OLS, but may often be in practice