Variance of beta

FIRST DRAFT! PLEASE LET ME KNOW IF YOU FIND ANY MISTAKES!

The purpose of this document is to help you develop intuition about the variance of the estimated regression coefficients:

\[ \text{Var}(\boldsymbol{\hat{\beta}}) = \sigma^2 (\textbf{X}'\textbf{X})^{-1} \] The first term is the error variance of the regression: \(\text{Var} (\boldsymbol{u})\). The second term is the inverse of the product of the independent variable matrix with itself (including the first column of ones). Why is the formula like this?

Bivariate case

We can begin by looking at the variance of the coefficient estimator in the bivariate case. This will help to get the right intuition. Consider the model:

\[ y_i = \beta_0 + \beta_1x_{1i} + u_i \] The variance of \(\hat{\beta}_1\) is:

\[ \text{Var} (\hat{\beta}_1) = \frac{\sigma^2}{\sum_{i=1}^N (x_{1i} - \bar{x}_1)^2} \] Thus, the variance of the estimator for \(\beta_1\) is the error variance of the model (the variance of the individual errors, \(u_i\)) divided by the sum of the squared deviations of the independent variable from its own mean. This means that the variance of \(\hat{\beta}_1\) is determined by three things:

How much variance in \(y_i\) is left over after accounting for its shared variance with \(x_{1i}\). The more error variance in the model, the more uncertainty there is in the relationship between \(x_{1i}\) and \(y_i\). One way to think about this is that there is a larger “noise to signal” ratio, which makes it harder to “see” how the two variables are related, because there is less information in the data.
The variance of the independent variable, \(x_{1i}\) (the denominator is the variance of \(x_{1i}\) multiplied by \(N-1\)). The larger the variance in the independent variable, all else equal, the smaller the uncertainty in our estimate of \(\beta_1\). Why? There are at least two intuitions here. First, the larger the “spread” in \(x_{1i}\), the easier it is to trace out its relationship with \(y_i\). You can see this in the figure below. Second, we can think about our regression as a “variance decomposition”: we are dividing the variance of \(y_i\) into two components: a systematic component (associated with \(x_{1i}\)) and a random or stochastic component (associated with \(u_i\)). The more variance there is in \(x_{1i}\), the larger the proportion of variance in \(y_i\) that is systematic as opposed to random, and the lower the uncertainty in our estimator.

Finally, notice that the denominator gets larger as the sample size increases (the denominator is the variance in our independent variable multiplied by \(N-1\)). This means that increasing the sample size increases the denominator and, in turn, reduces the variance in \(\hat{\beta}_1\). This is exactly what we should expect: the more observations we have, the more information we have about the relationship of the independent to the dependent variable, and the lower the variance in our estimator.

Extending to multiple regression

Let’s extend this to multiple regression. Notice that the form of the variance of \(\boldsymbol{\hat{\beta}}\) is very similar to the formula for the variance of the coefficient in the bivariate case:

\[ \text{Var}(\boldsymbol{\hat{\beta}}) = \sigma^2 (\textbf{X}'\textbf{X})^{-1} \]

In the bivariate case, we divide the error variance by the sum of squared deviations of the one independent variable from its mean (that is, the variance of this variable multiplied by the sample size minus 1).

In the multivariate case, we multiply the error variance by the inverse of the product of the independent variable matrix with itself. Multiplying by the inverse of a matrix is kind of like dividing by the matrix, so this is similar to dividing the error variance by a function of the independent variables, just like in the bivariate case.

And the matrix product of the independent variable matrix with itself is related to the covariance matrix for the independent variables multiplied by \(N-1\), so this is also like the bivariate case where we divide by the variance of the independent variable. (If the independent variables have a population mean of 0, then \(\textbf{X}'\textbf{X}\) minus the row and column for the intercept just is the estimated covariance matrix for the independent variables multiplied by \(N-1\). To see this for yourself, replace the \(N\) parameter in the code below to a very large number and compare cov(X) to (1/(N-1))*t(X)%*%X).

Let’s take a look at the product of the independent variable matrix with itself in a simple case, to see exactly what it looks like. Consider the following model:

\[ y_i = \beta_0 + \beta_1x_{1i} + \beta_2x_{2i} + u_i \]

We have two independent variables, so our independent variable matrix, \(\textbf{X}\), will be: \(N \times K + 1 = N \times 3\). Let’s imagine we have 5 observations (so \(N = 5\)) and that the two independent variables are positively correlated. Our \(\textbf{X}\) matrix might look like the following, with one row for each of the \(N\) observations, the first column of 1s for the intercept, and one column for each of the independent variables:

##      [,1]      [,2]      [,3]
## [1,]    1  2.242462  3.227800
## [2,]    1  2.606496  3.232277
## [3,]    1  3.928984  5.458814
## [4,]    1  3.712018 11.828242
## [5,]    1 -2.425005 -6.579600

What about \(\textbf{X}'\textbf{X}\)? It will be a three by three matrix that is symmetric:

##          [,1]     [,2]      [,3]
## [1,]  5.00000 10.06495  17.16753
## [2,] 10.06495 46.91910  96.97294
## [3,] 17.16753 96.97294 233.86341

The first entry is just the sample size, because it is the dot product of the column of ones with itself, which is just the sum of \(N\) ones, which equals \(N\). The remaining diagonal elements are related to the variances of the two independent variables.

The off-diagonal elements in the first row and first column are just the sum of the observations for each of the independent variables because we are taking the dot product of a column of ones with those observations. For example, X[1,2] and X[2,1] are just the sum of observations for \(x_{1i}\). Why is this there? To see why, note that we can rewrite the variance of a variable as follows:

\[ \text{Var} (x) = \frac{\sum_{i=1}^N (x_i - \bar{x})^2}{N-1} = \frac{\sum_{i=1}^N x_i^2 - (\sum_{i=1}^N x_i)^2}{N-1} \]

The diagonal element for each independent variable is just the sum of the squared observations of that variable (the first term in the numerator on the right-hand side, \(x_i^2\)). The off-diagonal element for each variable in the first row and first column is the sum of the observations for that variable (\(x_i\), the value in parentheses in the second term of the numerator on the right-hand side). In other words, the diagonal term, in combination with its respective off-diagonal elements in the first row and column of the matrix, and the first entry of the matrix at [1,1], give us all the information necessary to calculate the variance of that variable. Now we can see even more clearly how similar the multivariate formula is to the bivariate one. In both cases we are “dividing” the error variance by something closely related to the variance of the independent variables.

The remaining off-diagonal elements are related to the covariances of the independent variables. Remember that covariances are sums of products of the deviations of each observation’s values of each independent variable from its respective mean, divided by \(N-1\). Thus, we have something similar to what is going on with the variances, as described in the immediately preceding paragraph. The off-diagonal entries outside the first row and column give the sum of products of the various independent variables with each other, the off-diagonal entries in the first row and column give sums of variable values, and the first entry of the matrix gives the sample size. Again, we have all the information needed to calculate the covariances of the independent variables with each other. In this sense, we can indeed think of the \(\textbf{X}'\textbf{X}\) matrix as akin to the variance-covariance matrix of the independent variables.

At this point, however, you may be asking yourself two questions. First, as explained in the preceding paragraphs, the information to calculate variances and covariances is scattered across the matrix - how is it ultimately combined? Second, in the multivariate case, the covariances of the independent variables are present, in addition to the variances of each variable - what role do they play in the calculation of standard errors? The answer to both is found in what happens when we take the inverse of this matrix.

Let’s consider a generic 3x3 symmetric matrix:

\[ A = \begin{pmatrix} a & b & c \\ b & d & e \\ c & e & f \\ \end{pmatrix} \]

The formula for the inverse of this matrix is as follows (thanks GPT!):

\[ A^{-1} = \frac{1}{\text{det}(A)} \begin{pmatrix} df-e^2 & ce-bf & be-cd \\ ce-bf & af-c^2 & bc-ae \\ be-cd & bc-ae & ad-b^2 \\ \end{pmatrix} \] And the determinant of A is given by:

\[\text{det}(A) = a(d f - e^2) - b(b f - c e) + c(b e - c d)\]

The first thing to notice is that we are dividing by the determinant of the matrix, which is a scalar value, and which integrates all of the values present in the matrix - including the “covariance” terms among the independent variables. The determinant is a measure of the volume of the geometric figure associated with the matrix. A larger determinant means larger volume. What is volume in the context of our independent variables?

Consider again that we can think of \(\textbf{X}'\textbf{X}\) as similar to the covariance matrix of the independent variables. The volume of this matrix is related to the amount of information contained in the multivariate distribution of the independent variables - essentially, the amount of variation captured by the set. This volume increases as: (1) the variance in each independent variable increases, and (2) the covariance between pairs of independent variables decreases. The first point is the same logic as for the bivariate case: more variation in the independent variables, all else equal, means more systematic variation in \(y_i\) relative to random variation. The second point holds because as the correlation between two variables goes up, they have less independent variance - that is, they contribute less unique information to \(y_i\). So this is the first way that the covariance terms enter into the variance of the estimator: the larger the correlations among the independent variables, the less unique information there is about the individual coefficients, and the higher the variance of the estimator will be.

We can see this in particular in the first term for the determinant, \(a(df - e^2)\). In our matrix of interest, \(X'X\), \(a\) is the sample size, \(d\) and \(f\) are the sums of squares for \(x_1\) and \(x_2\), and \(e\) is the sum of the products of \(x_{1i}\) and \(x_{2i}\). Thus, this first term multiplies the sample size by the product of the variation in the two independent variables minus their shared variation, and is thus a measure of unique information in the independent variables.

To see this visually, consider the following two distributions (you can drag the figure around to adjust the view). In both cases, the two variables have identical variances, but in the first case their correlation is 0, while in the second it is 0.9. You can see clearly that the spread of the first is much larger than the second, and this is closely related both to the determinants of the covariance matrices, and to the unique information contained in each set.

## Warning: package 'plotly' was built under R version 4.3.2

## Warning: 'layout' objects don't have these attributes: 'NA'
## Valid attributes include:
## '_deprecated', 'activeshape', 'annotations', 'autosize', 'autotypenumbers', 'calendar', 'clickmode', 'coloraxis', 'colorscale', 'colorway', 'computed', 'datarevision', 'dragmode', 'editrevision', 'editType', 'font', 'geo', 'grid', 'height', 'hidesources', 'hoverdistance', 'hoverlabel', 'hovermode', 'images', 'legend', 'mapbox', 'margin', 'meta', 'metasrc', 'modebar', 'newshape', 'paper_bgcolor', 'plot_bgcolor', 'polar', 'scene', 'selectdirection', 'selectionrevision', 'separators', 'shapes', 'showlegend', 'sliders', 'smith', 'spikedistance', 'template', 'ternary', 'title', 'transition', 'uirevision', 'uniformtext', 'updatemenus', 'width', 'xaxis', 'yaxis', 'barmode', 'bargap', 'mapType'

Overall, a large determinant means more unique variation in the independent variables, and thus more information for estimating the coefficients. In turn, we are dividing the entries of the matrix by a larger value, which results in smaller entries for \(\text{Var}(\hat{\boldsymbol{\beta}})\), and thus more precise estimates of the coefficients.

Let’s turn now to the entries of the matrix portion of the inverse formula (the matrix that we divide by the determinant, as shown above). Recall that the diagonal elements of the variance matrix for the coefficients give the variances associated with the coefficient estimators, and their square root will give us standard errors.

Let’s look at the [2,2] entry as an example, which will ultimately correspond with the variance of the regression coefficient for \(x_1\) (after dividing by the determinant and scaling by \(\sigma^2\)). The formula is \(af - c^2\). It turns out that this is the determinant of the submatrix that results from eliminating the 2nd row and 2nd column! This means that the entry for [2,2] in \(A^{-1}\) is the ratio of the determinant of the submatrix without \(x_1\) to the determinant of the overall matrix that includes \(x_1\). Recall that the determinant is a measure of how much information is contained in the matrix. This ratio thus reflects how much of the total information (\(\text{det}(A)\)) is due to variables other than \(x_1\) (\(\text{det}(A_{-[2,2]})\)). When this value is large, \(x_1\) is providing a smaller contribution to the overall amount of variation in the independent variables, and its standard error will be correspondingly larger, relative to the other variables.

In sum, the determinant for the overall matrix \(\textbf{X}'\textbf{X}\) tells us about the spread (and thus amount of information) in the multivariate distribution of the independent variables. When this value is large, we will get smaller values in \((\textbf{X}'\textbf{X})^{-1}\), which will produce smaller standard errors after multiplying by \(\sigma^2\). The extent to which any given coefficient has a relatively large or small standard error, however, depends on the amount of unique variation it contributes to the total variation, which can be captured (inversely) with the determinant of the submatrix formed from eliminating that variable’s row and column.

One last thing to note is that, when the variables are on different scales, the ratio is not quite the same as the proportion of unique information (in the sense of predictive power), because that difference in “spread” is due in part to the different scaling of the variables (see the example below with the trees data). That is, if a variable is on a different scale, its coefficient will also be on a different scale, and the standard error should be adjusted accordingly. Imagine a regression where income is scaled in either dollars or thousands of dollars. In the first case, the coefficient will be much larger, and the standard error needs to be larger as well.

Hopefully, this was somewhat enlightening, and helps to develop your intuitions about where our standard errors come from!

Example

Here is an example with the trees data. Let’s calculate the standard error for the coefficient for Girth using only determinants, rather than the inverse of \(\textbf{X}' \textbf{X}\):

# trees data, estimate model, store output
m <- lm(Volume ~ Girth + Height, data=trees)
summ_m <- summary(m)

# calculate X transpose X
XtX <- t(model.matrix(m)) %*% model.matrix(m)
XtX

##             (Intercept)    Girth   Height
## (Intercept)        31.0   410.70   2356.0
## Girth             410.7  5736.55  31524.7
## Height           2356.0 31524.70 180274.0

# find determinant of X transpose X
det_XtX <- det(XtX)
det_XtX

## [1] 8147126

# find determinant of submatrix removing row and col 2
det_XtX_sub <- det(XtX[c(1,3),c(1,3)])
det_XtX_sub

## [1] 37758

# estimate variance of errors
sig2 <- summ_m$sigma^2
sig2

## [1] 15.06862

# find SE for Girth
sqrt(sig2 * (det_XtX_sub / det_XtX))

## [1] 0.2642646

# compare to lm
summary(m)

## 
## Call:
## lm(formula = Volume ~ Girth + Height, data = trees)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.4065 -2.6493 -0.2876  2.2003  8.4847 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -57.9877     8.6382  -6.713 2.75e-07 ***
## Girth         4.7082     0.2643  17.816  < 2e-16 ***
## Height        0.3393     0.1302   2.607   0.0145 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.882 on 28 degrees of freedom
## Multiple R-squared:  0.948,  Adjusted R-squared:  0.9442 
## F-statistic:   255 on 2 and 28 DF,  p-value: < 2.2e-16

And what about for Height?

# find determinant of submatrix removing row and col 3
det_XtX_sub_3 <- det(XtX[c(1,2),c(1,2)])

# find SE for Height
sqrt(sig2 * (det_XtX_sub_3 / det_XtX))

## [1] 0.1301512

There is a sense in which Height is less important to predicting Volume than Girth, but it will still have a smaller standard error. Why? Because it is scaled differently, in units that contribute more to the total spread. The ratio of determinants is thus smaller than for Girth, but this is mostly just due to scaling! We can see that by summarizing the variables.

summary(trees)

##      Girth           Height       Volume     
##  Min.   : 8.30   Min.   :63   Min.   :10.20  
##  1st Qu.:11.05   1st Qu.:72   1st Qu.:19.40  
##  Median :12.90   Median :76   Median :24.20  
##  Mean   :13.25   Mean   :76   Mean   :30.17  
##  3rd Qu.:15.25   3rd Qu.:80   3rd Qu.:37.30  
##  Max.   :20.60   Max.   :87   Max.   :77.00

If we standardized the variables, this would no longer be an issue, and the standard errors would actually be identical, even though the coefficient for Girth would be much larger, reflecting the fact that Girth matters more than Height, making the estimate for the former more precise relative to its magnitude, and thus more “significant” in terms of p-values.

# trees data, estimate model, store output
m_st <- lm(Volume ~ scale(Girth) + scale(Height), data=trees[trees$Girth<15,])

# XtX
XtX_st <- t(model.matrix(m_st)) %*% model.matrix(m_st)

# submatrices
XtX_st_2 <- XtX_st[c(1,3),c(1,3)]
XtX_st_3 <- XtX_st[c(1,2),c(1,2)]

# Girth vs Height
det(XtX_st_2) / det(XtX_st) # Girth

## [1] 0.05182783

det(XtX_st_3) / det(XtX_st) # Height

## [1] 0.05182783

The exact equality here is a quirk of the case with only two independent variables, and would not hold more generally. The reason is that the determinants of the submatrices in the two variable case are simply the variances of the “other” independent variable times \(N-1\) (to see this, go back to the formulas for the entries of the inverse of X'X and compare to the formula for the variance derived previously). The variances are, by construction, equal to 1 in both cases. Since these determinants become more complicated in the more general case, this will not be true generally.