OLS Assumptions & Properties

POLSCI 630: Probability and Basic Regression

January 21, 2025

Desirable properties

Our estimator, \(\hat{\beta}\), is a random variable: it produces a different estimate each time it is “run”

We would like it to have the following properties

  • Unbiasedness: \(\text{E}(\hat{\boldsymbol{\beta}}) = \boldsymbol{\beta}\)

  • Minimum variance / Efficiency: \(\text{E} \left((\hat{\boldsymbol{\beta}} - \boldsymbol{\beta})^2) \right)\) is smallest possible

Bias and variance, visually

Distribution of hypothetical estimator (\(\hat{\boldsymbol{\beta}}\)) across repeated sampling:

Gauss-Markov theorem

Given some necessary assumptions, OLS has the smallest variance among linear, unbiased estimators. i.e. it is the:

  • Best
  • Linear
  • Unbiased
  • Estimator

OLS is only the “best” of “unbiased” estimators!

Sometimes (but not right now) we might prefer a biased estimator with less variance.

Gauss-Markov assumptions

Assumptions necessary in order for OLS to be BLUE:

  1. Linear in the parameters

  2. Independent, identically distributed observations from random sample of relevant population

Gauss-Markov assumptions

Assumptions necessary in order for OLS to be BLUE:

  1. No perfect collinearity

Stated plainly, every IV contributes unique information to the system

  • No constants (every variable actually varies, except intercept)
  • No IVs perfect linear combinations of each other
    • If regress each IV on all others, no \(R^2 = 1\)

Correlations between independent variables OK, as long as they aren’t perfect

Gauss-Markov assumptions

More generally:

  1. \(\text{rank}(\mathbf{X})=K+1\) (\(\mathbf{X}\) is full rank / invertible)
df <- data.frame(x1 = rnorm(1000),
                 x2 = rnorm(1000),
                 x3 = rnorm(1000),
                 x4 = rnorm(1000),
                 x5 = rnorm(1000))

df$x6 <- with(df, x1 + 2*x2 - x3) # introduce perfect collinearity

X <- as.matrix(df)

solve(t(X) %*% X) # can't invert t(X) %*% X
Error in solve.default(t(X) %*% X): system is computationally singular: reciprocal condition number = 7.84165e-17

x6 is linear combination of x1, x2, and x3.

Gauss-Markov assumptions

More generally:

  1. \(\text{rank}(\mathbf{X})=K+1\) (\(\mathbf{X}\) is full rank / invertible)
df <- data.frame(x1 = rnorm(5),
                 x2 = rnorm(5),
                 x3 = rnorm(5),
                 x4 = rnorm(5),
                 x5 = rnorm(5),
                 x6 = rnorm(5))

X <- as.matrix(df); dim(X) # more columns than rows
[1] 5 6
solve(t(X) %*% X) # also can't invert t(X) %*% X
Error in solve.default(t(X) %*% X): system is computationally singular: reciprocal condition number = 9.47409e-18

More variables than observations means infinite number of solutions (e.g., draw a line through a single point)

Gauss-Markov assumptions

Assumptions necessary in order for OLS to be BLUE:

  1. \(\text{E}(\boldsymbol{u}|\mathbf{X}) = \text{E}(u|x_1,x_2,...,x_K) = 0\) (exogeneity)

The expected value of the error is 0 for any realization of \(\mathbf{X}\) - no correlation between functions of the IVs and error term

  • Nothing that’s correlated with both DV and at least one IV
  • Nothing that’s correlated with both DV and some linear or non-linear function of IVs
  • No measurement error in IVs
  • No “reverse causality” (DV causing IVs)
  • We will spend a week unpacking these issues

Gauss-Markov assumptions

Assumptions necessary in order for OLS to be BLUE:

  1. \(\text{Var}(\boldsymbol{u}|\mathbf{X}) = \sigma^2\mathbf{I}\) (spherical errors AKA homoskedasticity)


  • What does the right-hand side look like?
  • Given heteroskedastic errors, there are other linear unbiased estimators with lower variance
  • We will spend a week unpacking this issue

Gauss-Markov assumptions

Note: These assumptions do not need to be met in order to estimate an OLS model.1

These assumptions need to be met for OLS to be BLUE.

It’s your job to theoretically and empirically justify that these assumptions are being met!

Prove OLS estimator is unbiased

\[\hat{\boldsymbol{\beta}} = (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\boldsymbol{y}\]

Prove:

\[\text{E}(\hat{\boldsymbol{\beta}} | \mathbf{X}) = \boldsymbol{\beta}\]

Proof

\(\hat{\boldsymbol{\beta}} = (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\boldsymbol{y}\)

\(= (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'[\mathbf{X}\boldsymbol{\beta} + \boldsymbol{u}]\)

\(= (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{X}\boldsymbol{\beta} + (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\boldsymbol{u}\)

\(= \mathbf{I}\boldsymbol{\beta} + (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\boldsymbol{u}\)

Proof

\(\hat{\boldsymbol{\beta}} = \mathbf{I}\boldsymbol{\beta} + (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\boldsymbol{u}\)

\(\text{E}(\hat{\boldsymbol{\beta}} | \mathbf{X}) = \text{E}(\mathbf{I}\boldsymbol{\beta} | \mathbf{X}) + \text{E} \left((\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\boldsymbol{u} | \mathbf{X} \right)\)

\(= \boldsymbol{\beta} + (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}' \text{E}(\boldsymbol{u} | \mathbf{X})\)

Assume \(\text{E}(\boldsymbol{u}|\mathbf{X}) = 0\):

\(= \boldsymbol{\beta} + 0 = \boldsymbol{\beta}\)

The estimator \(\hat{\boldsymbol{\beta}}\) is unbiased for \(\boldsymbol{\beta}\) for any realization of \(\mathbf{X}\) (assuming random sampling)

Assumption to prove unbiased

\(\boldsymbol{\beta} + (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}' \text{E}(\boldsymbol{u} | \mathbf{X})\)

If \(\text{E}(\boldsymbol{u}|\mathbf{X}) = 0\) cannot be assumed, \((\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}' \text{E}(\boldsymbol{u} | \mathbf{X}) \neq 0\) . . .

And thus \(\text{E}(\hat{\boldsymbol{\beta}}) \neq \boldsymbol{\beta}\)

  • The exogeneity / zero conditional mean assumption is thus necessary for OLS to be an unbiased estimator

Confounding example

# set seed
set.seed(837)

# true correlation between x1 and x2, sample 1000 from pop
X <- cbind(rep(1,1000), 
           MASS::mvrnorm(1000, 
                     mu = c(0,0), 
                     Sigma = matrix(c(1,0.5,0.5,1), 2, 2)
                     )
           )

# true betas
B <- c(1, 1, 1)

# true error variance
sigma2 <- 1

# true model generates sampled y from pop
y <- X %*% B + rnorm(1000, 0, sigma2)

Confounding example

# estimate B using sample
solve(t(X) %*% X) %*% t(X) %*% y
          [,1]
[1,] 1.0208625
[2,] 0.9996043
[3,] 1.0394309

But what if we estimate without \(x_2\)?

# what if we estimate without x2?
solve(t(X[, -3]) %*% X[, -3]) %*% t(X[, -3]) %*% y
         [,1]
[1,] 1.019398
[2,] 1.554281

Variance of \(\boldsymbol{\beta}\)

We would like it to have the following properties

  • Unbiasedness: \(\text{E}(\hat{\boldsymbol{\beta}}) = \boldsymbol{\beta}\)

  • Minimum variance / Efficiency: \(\text{E} \left((\hat{\boldsymbol{\beta}} - \boldsymbol{\beta})^2) \right)\) is smallest possible

But what is the variance of \(\boldsymbol{\beta}\)? How can we derive it in general terms?

Derive variance-covariance matrix of \(\boldsymbol{\beta}\)

\(\hat{\boldsymbol{\beta}} = \mathbf{I}\boldsymbol{\beta} + (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\boldsymbol{u}\)

\(\text{Var}(\hat{\boldsymbol{\beta}}) = \text{Var} \left((\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\boldsymbol{u} \right)\)

\(=(\boldsymbol{\mathbf{X}^T\mathbf{X}})^{-1}\boldsymbol{\mathbf{X}^T}(\text{Var}(\boldsymbol{u}))\boldsymbol{[(\boldsymbol{\mathbf{X}^T\mathbf{X}})^{-1}\mathbf{X}^T]^T}\)

\(=(\boldsymbol{\mathbf{X}^T\mathbf{X}})^{-1}\boldsymbol{\mathbf{X}^T}(\text{Var}(\boldsymbol{u}))\boldsymbol{\mathbf{X}}\boldsymbol{(\mathbf{X}^T\mathbf{X})^{-1}}\)

assume \(\text{E}(\boldsymbol{u}\boldsymbol{u}'|\mathbf{X}) = \text{Var}(\boldsymbol{u}|\mathbf{X}) = \sigma^2\mathbf{I}\):

\(=(\boldsymbol{\mathbf{X}^T\mathbf{X}})^{-1}\boldsymbol{\mathbf{X}^T}(\sigma^2\mathbf{I})\boldsymbol{[(\boldsymbol{\mathbf{X}^T\mathbf{X}})^{-1}\mathbf{X}^T]^T}\)

\(=\sigma^2(\boldsymbol{\mathbf{X}^T\mathbf{X}})^{-1}\boldsymbol{\mathbf{X}^T} \boldsymbol{\mathbf{X}}\boldsymbol{(\mathbf{X}^T\mathbf{X})^{-1}}\)

\(=\sigma^2\boldsymbol{(\mathbf{X}^T\mathbf{X})^{-1}}\)

Assumption

  1. \(\text{Var}(\boldsymbol{u}|\mathbf{X}) = \sigma^2\mathbf{I}\) (Spherical errors, homoskedasticity)

This implies:

  • Homoskedasticity (no correlation of error variance with IVs, which is called heteroskedasticity)

  • No autocorrelation (no correlation of residuals among observations, e.g., adjacent units in space or time)

What we get, in return, is (1) a simple form for \(\text{Var}(\hat{\boldsymbol{\beta}})\) and (2) the property that OLS is efficient

Covariance matrix of \(\boldsymbol{\beta}\)

\[\sigma^2(\mathbf{X}'\mathbf{X})^{-1}\]

  • A \((K+1) \times (K+1)\), symmetric matrix

  • Main diagonal gives variances

  • Off-diagonal elements give covariances

  • What does this look like?

We typically do not know \(\sigma^2\) and use \(\frac{\hat{\boldsymbol{u}}'\hat{\boldsymbol{u}}}{N - K - 1}\) (MSE) as an estimate

  • Next week!

Getting some intuition

See also this tutorial on the variance matrix for \(\hat{\boldsymbol{\beta}}\)

Two issues

“Multicollinearity”

  • Thought of as a problem when corr between two predictors is very high (e.g., > 0.9)
  • How should we thinking about this?

Including irrelevant variables in the model

  • No bias
  • Less efficient b/c of non-zero corr with other predictors

Example SE calculation using trees

Let’s use the same example from last week:

# define X matrix (with 1st column of 1s for intercept)
X <- as.matrix(cbind(rep(1, nrow(trees)), trees[, c("Girth", "Height")]))
colnames(X)[1] <- "Intercept"

# define Y vector
y <- trees$Volume

# calculate beta
beta <- solve(t(X) %*% X) %*% t(X) %*% y

Calculate covariance matrix for \(\boldsymbol{\beta}\)

# calculate residuals
resid <- as.numeric(y - X %*% beta)

# calculate degrees of freedom for estimating sigma^2
resid_df <- nrow(X) - length(beta)

# estimate error variance (MSE)
s2 <- sum(resid^2) / resid_df

# calculate covariance matrix for beta
beta_cov <- s2 * (solve(t(X) %*% X))
beta_cov
           Intercept       Girth      Height
Intercept 74.6189461  0.43217138 -1.05076889
Girth      0.4321714  0.06983578 -0.01786030
Height    -1.0507689 -0.01786030  0.01693933

What does it mean that two coefficients are correlated?

Example

## simulate from the same "population" 100 times

# number of sims
sims <- 100

# store each B1 and B2
betas <- array(NA, c(sims, 2))

# loop over sims
for (i in 1:sims){
  
  # true correlation between x1 and x2, sample 1000 from pop
  X <- cbind(rep(1,1000), 
             MASS::mvrnorm(1000, 
                           mu = c(0,0), 
                           Sigma = matrix(c(1,0.5,0.5,1), 2, 2)
             )
  )
  
  # true betas
  B <- c(1, 1, 1)
  
  # true error variance
  sigma2 <- 1
  
  # true model generates sampled y from pop
  y <- X %*% B + rnorm(1000, 0, sigma2)
  
  # store betas
  betas[i, ] <- (solve(t(X)%*%X) %*% t(X)%*%y)[2:3]
}

# calculate covariance matrix for betas using last sim
resid <- y - X%*%B
as.numeric(sum(resid^2)/(1000-3))*solve(t(X)%*%X)
              [,1]          [,2]          [,3]
[1,]  1.120415e-03 -1.759557e-05  8.970814e-05
[2,] -1.759557e-05  1.498524e-03 -6.971701e-04
[3,]  8.970814e-05 -6.971701e-04  1.407637e-03

Visualization

Calculate standard errors

Main diagonal is estimated variance for coefficients, so sqrts are SEs:

# calculate standard errors
se <- sqrt(diag(beta_cov))

# print betas and SEs
round(cbind(beta, se), 3)
                     se
Intercept -57.988 8.638
Girth       4.708 0.264
Height      0.339 0.130

Compare to lm:

m1 <- lm(Volume ~ Girth + Height, data = trees)
round(cbind(coef(m1), sqrt(diag(vcov(m1)))), 3)
               [,1]  [,2]
(Intercept) -57.988 8.638
Girth         4.708 0.264
Height        0.339 0.130

Call:
lm(formula = Volume ~ Girth + Height, data = trees)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.4065 -2.6493 -0.2876  2.2003  8.4847 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -57.9877     8.6382  -6.713 2.75e-07 ***
Girth         4.7082     0.2643  17.816  < 2e-16 ***
Height        0.3393     0.1302   2.607   0.0145 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.882 on 28 degrees of freedom
Multiple R-squared:  0.948, Adjusted R-squared:  0.9442 
F-statistic:   255 on 2 and 28 DF,  p-value: < 2.2e-16