OLS Assumptions & Properties
POLSCI 630: Probability and Basic Regression
January 21, 2025
Desirable properties
Our estimator, \(\hat{\beta}\) , is a random variable: it produces a different estimate each time it is “run”
We would like it to have the following properties
Unbiasedness : \(\text{E}(\hat{\boldsymbol{\beta}}) = \boldsymbol{\beta}\)
Minimum variance / Efficiency : \(\text{E} \left((\hat{\boldsymbol{\beta}} - \boldsymbol{\beta})^2) \right)\) is smallest possible
Bias and variance, visually
Distribution of hypothetical estimator (\(\hat{\boldsymbol{\beta}}\) ) across repeated sampling:
Gauss-Markov theorem
Given some necessary assumptions, OLS has the smallest variance among linear, unbiased estimators. i.e. it is the:
B est
L inear
U nbiased
E stimator
OLS is only the “best” of “unbiased” estimators!
Sometimes (but not right now) we might prefer a biased estimator with less variance.
Gauss-Markov assumptions
Assumptions necessary in order for OLS to be BLUE:
Linear in the parameters
Independent, identically distributed observations from random sample of relevant population
Gauss-Markov assumptions
Assumptions necessary in order for OLS to be BLUE:
No perfect collinearity
Stated plainly, every IV contributes unique information to the system
No constants (every variable actually varies, except intercept)
No IVs perfect linear combinations of each other
If regress each IV on all others, no \(R^2 = 1\)
Correlations between independent variables OK, as long as they aren’t perfect
Gauss-Markov assumptions
More generally:
\(\text{rank}(\mathbf{X})=K+1\) (\(\mathbf{X}\) is full rank / invertible)
df <- data.frame (x1 = rnorm (1000 ),
x2 = rnorm (1000 ),
x3 = rnorm (1000 ),
x4 = rnorm (1000 ),
x5 = rnorm (1000 ))
df$ x6 <- with (df, x1 + 2 * x2 - x3) # introduce perfect collinearity
X <- as.matrix (df)
solve (t (X) %*% X) # can't invert t(X) %*% X
Error in solve.default(t(X) %*% X): system is computationally singular: reciprocal condition number = 7.84165e-17
x6 is linear combination of x1, x2, and x3.
Gauss-Markov assumptions
More generally:
\(\text{rank}(\mathbf{X})=K+1\) (\(\mathbf{X}\) is full rank / invertible)
df <- data.frame (x1 = rnorm (5 ),
x2 = rnorm (5 ),
x3 = rnorm (5 ),
x4 = rnorm (5 ),
x5 = rnorm (5 ),
x6 = rnorm (5 ))
X <- as.matrix (df); dim (X) # more columns than rows
solve (t (X) %*% X) # also can't invert t(X) %*% X
Error in solve.default(t(X) %*% X): system is computationally singular: reciprocal condition number = 9.47409e-18
More variables than observations means infinite number of solutions (e.g., draw a line through a single point)
Gauss-Markov assumptions
Assumptions necessary in order for OLS to be BLUE:
\(\text{E}(\boldsymbol{u}|\mathbf{X}) = \text{E}(u|x_1,x_2,...,x_K) = 0\) (exogeneity )
The expected value of the error is 0 for any realization of \(\mathbf{X}\) - no correlation between functions of the IVs and error term
Nothing that’s correlated with both DV and at least one IV
Nothing that’s correlated with both DV and some linear or non-linear function of IVs
No measurement error in IVs
No “reverse causality” (DV causing IVs)
We will spend a week unpacking these issues
Gauss-Markov assumptions
Assumptions necessary in order for OLS to be BLUE:
\(\text{Var}(\boldsymbol{u}|\mathbf{X}) = \sigma^2\mathbf{I}\) (spherical errors AKA homoskedasticity )
What does the right-hand side look like?
Given heteroskedastic errors, there are other linear unbiased estimators with lower variance
We will spend a week unpacking this issue
Gauss-Markov assumptions
Note: These assumptions do not need to be met in order to estimate an OLS model.1
These assumptions need to be met for OLS to be BLUE .
It’s your job to theoretically and empirically justify that these assumptions are being met!
Prove OLS estimator is unbiased
\[\hat{\boldsymbol{\beta}} = (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\boldsymbol{y}\]
Prove:
\[\text{E}(\hat{\boldsymbol{\beta}} | \mathbf{X}) = \boldsymbol{\beta}\]
Proof
\(\hat{\boldsymbol{\beta}} = (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\boldsymbol{y}\)
\(= (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'[\mathbf{X}\boldsymbol{\beta} + \boldsymbol{u}]\)
\(= (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{X}\boldsymbol{\beta} + (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\boldsymbol{u}\)
\(= \mathbf{I}\boldsymbol{\beta} + (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\boldsymbol{u}\)
Proof
\(\hat{\boldsymbol{\beta}} = \mathbf{I}\boldsymbol{\beta} + (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\boldsymbol{u}\)
\(\text{E}(\hat{\boldsymbol{\beta}} | \mathbf{X}) = \text{E}(\mathbf{I}\boldsymbol{\beta} | \mathbf{X}) + \text{E} \left((\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\boldsymbol{u} | \mathbf{X} \right)\)
\(= \boldsymbol{\beta} + (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}' \text{E}(\boldsymbol{u} | \mathbf{X})\)
Assume \(\text{E}(\boldsymbol{u}|\mathbf{X}) = 0\) :
\(= \boldsymbol{\beta} + 0 = \boldsymbol{\beta}\)
The estimator \(\hat{\boldsymbol{\beta}}\) is unbiased for \(\boldsymbol{\beta}\) for any realization of \(\mathbf{X}\) (assuming random sampling)
Assumption to prove unbiased
\(\boldsymbol{\beta} + (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}' \text{E}(\boldsymbol{u} | \mathbf{X})\)
If \(\text{E}(\boldsymbol{u}|\mathbf{X}) = 0\) cannot be assumed, \((\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}' \text{E}(\boldsymbol{u} | \mathbf{X}) \neq 0\) . . .
And thus \(\text{E}(\hat{\boldsymbol{\beta}}) \neq \boldsymbol{\beta}\)
The exogeneity / zero conditional mean assumption is thus necessary for OLS to be an unbiased estimator
Confounding example
# set seed
set.seed (837 )
# true correlation between x1 and x2, sample 1000 from pop
X <- cbind (rep (1 ,1000 ),
MASS:: mvrnorm (1000 ,
mu = c (0 ,0 ),
Sigma = matrix (c (1 ,0.5 ,0.5 ,1 ), 2 , 2 )
)
)
# true betas
B <- c (1 , 1 , 1 )
# true error variance
sigma2 <- 1
# true model generates sampled y from pop
y <- X %*% B + rnorm (1000 , 0 , sigma2)
Confounding example
# estimate B using sample
solve (t (X) %*% X) %*% t (X) %*% y
[,1]
[1,] 1.0208625
[2,] 0.9996043
[3,] 1.0394309
But what if we estimate without \(x_2\) ?
# what if we estimate without x2?
solve (t (X[, - 3 ]) %*% X[, - 3 ]) %*% t (X[, - 3 ]) %*% y
[,1]
[1,] 1.019398
[2,] 1.554281
Variance of \(\boldsymbol{\beta}\)
We would like it to have the following properties
Unbiasedness : \(\text{E}(\hat{\boldsymbol{\beta}}) = \boldsymbol{\beta}\)
Minimum variance / Efficiency : \(\text{E} \left((\hat{\boldsymbol{\beta}} - \boldsymbol{\beta})^2) \right)\) is smallest possible
But what is the variance of \(\boldsymbol{\beta}\) ? How can we derive it in general terms?
Assumption
\(\text{Var}(\boldsymbol{u}|\mathbf{X}) = \sigma^2\mathbf{I}\) (Spherical errors , homoskedasticity)
This implies:
Homoskedasticity (no correlation of error variance with IVs, which is called heteroskedasticity )
No autocorrelation (no correlation of residuals among observations, e.g., adjacent units in space or time)
What we get, in return, is (1) a simple form for \(\text{Var}(\hat{\boldsymbol{\beta}})\) and (2) the property that OLS is efficient
Covariance matrix of \(\boldsymbol{\beta}\)
\[\sigma^2(\mathbf{X}'\mathbf{X})^{-1}\]
A \((K+1) \times (K+1)\) , symmetric matrix
Main diagonal gives variances
Off-diagonal elements give covariances
What does this look like?
We typically do not know \(\sigma^2\) and use \(\frac{\hat{\boldsymbol{u}}'\hat{\boldsymbol{u}}}{N - K - 1}\) (MSE) as an estimate
Getting some intuition
Two issues
“Multicollinearity”
Thought of as a problem when corr between two predictors is very high (e.g., > 0.9)
How should we thinking about this?
Including irrelevant variables in the model
No bias
Less efficient b/c of non-zero corr with other predictors
Example SE calculation using trees
Let’s use the same example from last week:
# define X matrix (with 1st column of 1s for intercept)
X <- as.matrix (cbind (rep (1 , nrow (trees)), trees[, c ("Girth" , "Height" )]))
colnames (X)[1 ] <- "Intercept"
# define Y vector
y <- trees$ Volume
# calculate beta
beta <- solve (t (X) %*% X) %*% t (X) %*% y
Calculate covariance matrix for \(\boldsymbol{\beta}\)
# calculate residuals
resid <- as.numeric (y - X %*% beta)
# calculate degrees of freedom for estimating sigma^2
resid_df <- nrow (X) - length (beta)
# estimate error variance (MSE)
s2 <- sum (resid^ 2 ) / resid_df
# calculate covariance matrix for beta
beta_cov <- s2 * (solve (t (X) %*% X))
beta_cov
Intercept Girth Height
Intercept 74.6189461 0.43217138 -1.05076889
Girth 0.4321714 0.06983578 -0.01786030
Height -1.0507689 -0.01786030 0.01693933
What does it mean that two coefficients are correlated?
Visualization
Calculate standard errors
Main diagonal is estimated variance for coefficients, so sqrts are SEs:
# calculate standard errors
se <- sqrt (diag (beta_cov))
# print betas and SEs
round (cbind (beta, se), 3 )
se
Intercept -57.988 8.638
Girth 4.708 0.264
Height 0.339 0.130
Compare to lm
:
m1 <- lm (Volume ~ Girth + Height, data = trees)
round (cbind (coef (m1), sqrt (diag (vcov (m1)))), 3 )
[,1] [,2]
(Intercept) -57.988 8.638
Girth 4.708 0.264
Height 0.339 0.130
Call:
lm(formula = Volume ~ Girth + Height, data = trees)
Residuals:
Min 1Q Median 3Q Max
-6.4065 -2.6493 -0.2876 2.2003 8.4847
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -57.9877 8.6382 -6.713 2.75e-07 ***
Girth 4.7082 0.2643 17.816 < 2e-16 ***
Height 0.3393 0.1302 2.607 0.0145 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.882 on 28 degrees of freedom
Multiple R-squared: 0.948, Adjusted R-squared: 0.9442
F-statistic: 255 on 2 and 28 DF, p-value: < 2.2e-16