OLS Assumptions & Properties

POLSCI 630: Probability and Basic Regression

January 21, 2025

Desirable properties

Our estimator, \(\hat{\beta}\), is a random variable: it produces a different estimate each time it is “run”

We would like it to have the following properties

Unbiasedness: \(\text{E}(\hat{\boldsymbol{\beta}}) = \boldsymbol{\beta}\)
Minimum variance / Efficiency: \(\text{E} \left((\hat{\boldsymbol{\beta}} - \boldsymbol{\beta})^2) \right)\) is smallest possible

Bias and variance, visually

Distribution of hypothetical estimator (\(\hat{\boldsymbol{\beta}}\)) across repeated sampling:

Gauss-Markov theorem

Given some necessary assumptions, OLS has the smallest variance among linear, unbiased estimators. i.e. it is the:

Best
Linear
Unbiased
Estimator

OLS is only the “best” of “unbiased” estimators!

Sometimes (but not right now) we might prefer a biased estimator with less variance.

Gauss-Markov assumptions

Assumptions necessary in order for OLS to be BLUE:

Linear in the parameters
Independent, identically distributed observations from random sample of relevant population

Gauss-Markov assumptions

Assumptions necessary in order for OLS to be BLUE:

No perfect collinearity

Stated plainly, every IV contributes unique information to the system

No constants (every variable actually varies, except intercept)
No IVs perfect linear combinations of each other
- If regress each IV on all others, no \(R^2 = 1\)

Correlations between independent variables OK, as long as they aren’t perfect

Gauss-Markov assumptions

More generally:

\(\text{rank}(\mathbf{X})=K+1\) (\(\mathbf{X}\) is full rank / invertible)

df <- data.frame(x1 = rnorm(1000),
                 x2 = rnorm(1000),
                 x3 = rnorm(1000),
                 x4 = rnorm(1000),
                 x5 = rnorm(1000))

df$x6 <- with(df, x1 + 2*x2 - x3) # introduce perfect collinearity

X <- as.matrix(df)

solve(t(X) %*% X) # can't invert t(X) %*% X

Error in solve.default(t(X) %*% X): system is computationally singular: reciprocal condition number = 7.84165e-17

x6 is linear combination of x1, x2, and x3.

Gauss-Markov assumptions

More generally:

\(\text{rank}(\mathbf{X})=K+1\) (\(\mathbf{X}\) is full rank / invertible)

df <- data.frame(x1 = rnorm(5),
                 x2 = rnorm(5),
                 x3 = rnorm(5),
                 x4 = rnorm(5),
                 x5 = rnorm(5),
                 x6 = rnorm(5))

X <- as.matrix(df); dim(X) # more columns than rows

[1] 5 6

solve(t(X) %*% X) # also can't invert t(X) %*% X

Error in solve.default(t(X) %*% X): system is computationally singular: reciprocal condition number = 9.47409e-18

More variables than observations means infinite number of solutions (e.g., draw a line through a single point)

Gauss-Markov assumptions

Assumptions necessary in order for OLS to be BLUE:

\(\text{E}(\boldsymbol{u}|\mathbf{X}) = \text{E}(u|x_1,x_2,...,x_K) = 0\) (exogeneity)

The expected value of the error is 0 for any realization of \(\mathbf{X}\) - no correlation between functions of the IVs and error term

Nothing that’s correlated with both DV and at least one IV
Nothing that’s correlated with both DV and some linear or non-linear function of IVs
No measurement error in IVs
No “reverse causality” (DV causing IVs)
We will spend a week unpacking these issues

Gauss-Markov assumptions

Assumptions necessary in order for OLS to be BLUE:

\(\text{Var}(\boldsymbol{u}|\mathbf{X}) = \sigma^2\mathbf{I}\) (spherical errors AKA homoskedasticity)

What does the right-hand side look like?
Given heteroskedastic errors, there are other linear unbiased estimators with lower variance
We will spend a week unpacking this issue

Gauss-Markov assumptions

Note: These assumptions do not need to be met in order to estimate an OLS model.¹

These assumptions need to be met for OLS to be BLUE.

It’s your job to theoretically and empirically justify that these assumptions are being met!

Prove OLS estimator is unbiased

\[\hat{\boldsymbol{\beta}} = (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\boldsymbol{y}\]

Prove:

\[\text{E}(\hat{\boldsymbol{\beta}} | \mathbf{X}) = \boldsymbol{\beta}\]

Proof

\(\hat{\boldsymbol{\beta}} = (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\boldsymbol{y}\)

\(= (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'[\mathbf{X}\boldsymbol{\beta} + \boldsymbol{u}]\)

\(= (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{X}\boldsymbol{\beta} + (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\boldsymbol{u}\)

\(= \mathbf{I}\boldsymbol{\beta} + (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\boldsymbol{u}\)

Proof

\(\hat{\boldsymbol{\beta}} = \mathbf{I}\boldsymbol{\beta} + (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\boldsymbol{u}\)

\(\text{E}(\hat{\boldsymbol{\beta}} | \mathbf{X}) = \text{E}(\mathbf{I}\boldsymbol{\beta} | \mathbf{X}) + \text{E} \left((\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\boldsymbol{u} | \mathbf{X} \right)\)

\(= \boldsymbol{\beta} + (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}' \text{E}(\boldsymbol{u} | \mathbf{X})\)

Assume \(\text{E}(\boldsymbol{u}|\mathbf{X}) = 0\):

\(= \boldsymbol{\beta} + 0 = \boldsymbol{\beta}\)

The estimator \(\hat{\boldsymbol{\beta}}\) is unbiased for \(\boldsymbol{\beta}\) for any realization of \(\mathbf{X}\) (assuming random sampling)

Assumption to prove unbiased

\(\boldsymbol{\beta} + (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}' \text{E}(\boldsymbol{u} | \mathbf{X})\)

If \(\text{E}(\boldsymbol{u}|\mathbf{X}) = 0\) cannot be assumed, \((\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}' \text{E}(\boldsymbol{u} | \mathbf{X}) \neq 0\) . . .

And thus \(\text{E}(\hat{\boldsymbol{\beta}}) \neq \boldsymbol{\beta}\)

The exogeneity / zero conditional mean assumption is thus necessary for OLS to be an unbiased estimator

Confounding example

# set seed
set.seed(837)

# true correlation between x1 and x2, sample 1000 from pop
X <- cbind(rep(1,1000), 
           MASS::mvrnorm(1000, 
                     mu = c(0,0), 
                     Sigma = matrix(c(1,0.5,0.5,1), 2, 2)
                     )
           )

# true betas
B <- c(1, 1, 1)

# true error variance
sigma2 <- 1

# true model generates sampled y from pop
y <- X %*% B + rnorm(1000, 0, sigma2)

Confounding example

# estimate B using sample
solve(t(X) %*% X) %*% t(X) %*% y

          [,1]
[1,] 1.0208625
[2,] 0.9996043
[3,] 1.0394309

But what if we estimate without \(x_2\)?

# what if we estimate without x2?
solve(t(X[, -3]) %*% X[, -3]) %*% t(X[, -3]) %*% y

         [,1]
[1,] 1.019398
[2,] 1.554281

Variance of \(\boldsymbol{\beta}\)

We would like it to have the following properties

Unbiasedness: \(\text{E}(\hat{\boldsymbol{\beta}}) = \boldsymbol{\beta}\)
Minimum variance / Efficiency: \(\text{E} \left((\hat{\boldsymbol{\beta}} - \boldsymbol{\beta})^2) \right)\) is smallest possible

But what is the variance of \(\boldsymbol{\beta}\)? How can we derive it in general terms?

Derive variance-covariance matrix of \(\boldsymbol{\beta}\)

\(\hat{\boldsymbol{\beta}} = \mathbf{I}\boldsymbol{\beta} + (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\boldsymbol{u}\)

\(\text{Var}(\hat{\boldsymbol{\beta}}) = \text{Var} \left((\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\boldsymbol{u} \right)\)

\(=(\boldsymbol{\mathbf{X}^T\mathbf{X}})^{-1}\boldsymbol{\mathbf{X}^T}(\text{Var}(\boldsymbol{u}))\boldsymbol{[(\boldsymbol{\mathbf{X}^T\mathbf{X}})^{-1}\mathbf{X}^T]^T}\)

\(=(\boldsymbol{\mathbf{X}^T\mathbf{X}})^{-1}\boldsymbol{\mathbf{X}^T}(\text{Var}(\boldsymbol{u}))\boldsymbol{\mathbf{X}}\boldsymbol{(\mathbf{X}^T\mathbf{X})^{-1}}\)

assume \(\text{E}(\boldsymbol{u}\boldsymbol{u}'|\mathbf{X}) = \text{Var}(\boldsymbol{u}|\mathbf{X}) = \sigma^2\mathbf{I}\):

\(=(\boldsymbol{\mathbf{X}^T\mathbf{X}})^{-1}\boldsymbol{\mathbf{X}^T}(\sigma^2\mathbf{I})\boldsymbol{[(\boldsymbol{\mathbf{X}^T\mathbf{X}})^{-1}\mathbf{X}^T]^T}\)

\(=\sigma^2(\boldsymbol{\mathbf{X}^T\mathbf{X}})^{-1}\boldsymbol{\mathbf{X}^T} \boldsymbol{\mathbf{X}}\boldsymbol{(\mathbf{X}^T\mathbf{X})^{-1}}\)

\(=\sigma^2\boldsymbol{(\mathbf{X}^T\mathbf{X})^{-1}}\)

Assumption

\(\text{Var}(\boldsymbol{u}|\mathbf{X}) = \sigma^2\mathbf{I}\) (Spherical errors, homoskedasticity)

This implies:

Homoskedasticity (no correlation of error variance with IVs, which is called heteroskedasticity)
No autocorrelation (no correlation of residuals among observations, e.g., adjacent units in space or time)

What we get, in return, is (1) a simple form for \(\text{Var}(\hat{\boldsymbol{\beta}})\) and (2) the property that OLS is efficient

Covariance matrix of \(\boldsymbol{\beta}\)

\[\sigma^2(\mathbf{X}'\mathbf{X})^{-1}\]

A \((K+1) \times (K+1)\), symmetric matrix
Main diagonal gives variances
Off-diagonal elements give covariances
What does this look like?

We typically do not know \(\sigma^2\) and use \(\frac{\hat{\boldsymbol{u}}'\hat{\boldsymbol{u}}}{N - K - 1}\) (MSE) as an estimate

Next week!

Getting some intuition

See also this tutorial on the variance matrix for \(\hat{\boldsymbol{\beta}}\)

Two issues

“Multicollinearity”

Thought of as a problem when corr between two predictors is very high (e.g., > 0.9)
How should we thinking about this?

Including irrelevant variables in the model

No bias
Less efficient b/c of non-zero corr with other predictors

Example SE calculation using `trees`

Let’s use the same example from last week:

# define X matrix (with 1st column of 1s for intercept)
X <- as.matrix(cbind(rep(1, nrow(trees)), trees[, c("Girth", "Height")]))
colnames(X)[1] <- "Intercept"

# define Y vector
y <- trees$Volume

# calculate beta
beta <- solve(t(X) %*% X) %*% t(X) %*% y

Calculate covariance matrix for \(\boldsymbol{\beta}\)

# calculate residuals
resid <- as.numeric(y - X %*% beta)

# calculate degrees of freedom for estimating sigma^2
resid_df <- nrow(X) - length(beta)

# estimate error variance (MSE)
s2 <- sum(resid^2) / resid_df

# calculate covariance matrix for beta
beta_cov <- s2 * (solve(t(X) %*% X))
beta_cov

           Intercept       Girth      Height
Intercept 74.6189461  0.43217138 -1.05076889
Girth      0.4321714  0.06983578 -0.01786030
Height    -1.0507689 -0.01786030  0.01693933

What does it mean that two coefficients are correlated?

Example

## simulate from the same "population" 100 times

# number of sims
sims <- 100

# store each B1 and B2
betas <- array(NA, c(sims, 2))

# loop over sims
for (i in 1:sims){
  
  # true correlation between x1 and x2, sample 1000 from pop
  X <- cbind(rep(1,1000), 
             MASS::mvrnorm(1000, 
                           mu = c(0,0), 
                           Sigma = matrix(c(1,0.5,0.5,1), 2, 2)
             )
  )
  
  # true betas
  B <- c(1, 1, 1)
  
  # true error variance
  sigma2 <- 1
  
  # true model generates sampled y from pop
  y <- X %*% B + rnorm(1000, 0, sigma2)
  
  # store betas
  betas[i, ] <- (solve(t(X)%*%X) %*% t(X)%*%y)[2:3]
}

# calculate covariance matrix for betas using last sim
resid <- y - X%*%B
as.numeric(sum(resid^2)/(1000-3))*solve(t(X)%*%X)

              [,1]          [,2]          [,3]
[1,]  1.120415e-03 -1.759557e-05  8.970814e-05
[2,] -1.759557e-05  1.498524e-03 -6.971701e-04
[3,]  8.970814e-05 -6.971701e-04  1.407637e-03

Visualization

Calculate standard errors

Main diagonal is estimated variance for coefficients, so sqrts are SEs:

# calculate standard errors
se <- sqrt(diag(beta_cov))

# print betas and SEs
round(cbind(beta, se), 3)

                     se
Intercept -57.988 8.638
Girth       4.708 0.264
Height      0.339 0.130

Compare to lm:

m1 <- lm(Volume ~ Girth + Height, data = trees)
round(cbind(coef(m1), sqrt(diag(vcov(m1)))), 3)

               [,1]  [,2]
(Intercept) -57.988 8.638
Girth         4.708 0.264
Height        0.339 0.130


Call:
lm(formula = Volume ~ Girth + Height, data = trees)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.4065 -2.6493 -0.2876  2.2003  8.4847 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -57.9877     8.6382  -6.713 2.75e-07 ***
Girth         4.7082     0.2643  17.816  < 2e-16 ***
Height        0.3393     0.1302   2.607   0.0145 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.882 on 28 degrees of freedom
Multiple R-squared:  0.948, Adjusted R-squared:  0.9442 
F-statistic:   255 on 2 and 28 DF,  p-value: < 2.2e-16

OLS Assumptions & Properties

Desirable properties

Bias and variance, visually

Gauss-Markov theorem

Gauss-Markov assumptions

Gauss-Markov assumptions

Gauss-Markov assumptions

Gauss-Markov assumptions

Gauss-Markov assumptions

Gauss-Markov assumptions

Gauss-Markov assumptions

Prove OLS estimator is unbiased

Proof

Proof

Assumption to prove unbiased

Confounding example

Confounding example

Variance of \(\boldsymbol{\beta}\)

Derive variance-covariance matrix of \(\boldsymbol{\beta}\)

Assumption

Covariance matrix of \(\boldsymbol{\beta}\)

Getting some intuition

Two issues

Example SE calculation using trees

Calculate covariance matrix for \(\boldsymbol{\beta}\)

Example

Visualization

Calculate standard errors

Example SE calculation using `trees`