Linear Regression Introduction
POLSCI 630: Probability and Basic Regression
1. The Flexible Linear Model
Goals for regression
Regression has a variety of uses in the social sciences
- Characterizing associations between variables
- Prediction to future observations
- Causal effect estimation
. . .
Regression can in theory do all of the above if the right assumptions are met.
Notation
- generic units of analysis are indicated by \(i=1,...,N\)
- scalars are lower-case italics (e.g., \(x_{1i}\))
- (column) vectors are lower-case bold italics (e.g., \(\boldsymbol{x}_i\), transpose: \(\boldsymbol{x}_i'\))
- matrices are upper-case bold (e.g., \(\mathbf{X}\))
- random variables are generally Roman (e.g., \(x\))
- model parameters are generally Greek (e.g., \(\beta_1\), \(\boldsymbol{\beta}\))
- generic model variables (and associated parameters) are indicated by \(j=1,...,K\)
- error terms are indicated by \(u_i\) or \(\boldsymbol{u}\), but sometimes \(e_i\) or even \(\epsilon_i\)
- error variance is indicated by \(\sigma^2\) or \(\sigma_u^2\) if necessary
What is regression?
Regression describes the relationship between one or more independent variables, \(\boldsymbol{x}_1,\boldsymbol{x}_{2},...,\boldsymbol{x}_{K}\), and a dependent variable, \(\boldsymbol{y}\).
Conditional expected value: \(\text{E}(y_i|x_i)\)
- One way to do this is to simply calculate mean of the DV at each value of the IV
- But there are small numbers of observations at each value of the IV - estimates are noisy
- By imposing a model on the relationship, we leverage “knowledge” that comes from the model assumptions to get more efficient estimates
- This is a very general idea that you should internalize until you feel it in your bones
Conditional expected value based on linear model
The linear regression model
\[\boldsymbol{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{u}\]
- \(\boldsymbol{y}\) is an \(N \times 1\) vector
- \(\bf{X}\) is an \(N \times (K \rm{+1)}\) matrix
- \(\boldsymbol{\beta}\) is a \((K \rm{+1)} \times 1\) vector
- \(\boldsymbol{u}\) is an \(N \times 1\) vector
. . .
For data with N observations and K predictors
- The first column of \(\bf{X}\) is all 1s
- The first element of \(\boldsymbol{\beta}\) is the y-intercept
Vectors and matrices
\[\boldsymbol{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{u}\]
\[\begin{bmatrix} 10.3 \\ 10.3 \\ 10.2 \\ 16.4 \end{bmatrix} = \begin{bmatrix} 1 & 8.3 & 70 \\ 1 & 8.6 & 65 \\ 1 & 8.8 & 63 \\ 1 & 10.5 & 72 \end{bmatrix} \begin{bmatrix} \beta_0 \\ \beta_1 \\ \beta_2 \end{bmatrix} + \begin{bmatrix} u_1 \\ u_2 \\ u_3 \\ u_4 \end{bmatrix}\]
The linear regression model, expanded
\[\boldsymbol{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{u}\]
\[\boldsymbol{y} = \beta_0 + \beta_1\boldsymbol{x_1} + \beta_2\boldsymbol{x_2}\ldots\beta_K\boldsymbol{x_K} + \boldsymbol{u}\]
- \(\boldsymbol{y}\) is an \(N \times 1\) vector
- Each \(\boldsymbol{x_k}\) is an \(N \times 1\) vector
- Each \(\beta_k\) is a scalar
- \(\boldsymbol{u}\) is an \(N \times 1\) vector
The linear regression model, single unit
\[y_i = \beta_0 + \beta_1x_{1i} + \beta_2x_{2i}\ldots\beta_Kx_{Ki} + u_i\]
- This is the model for a single unit, \(i\)
- Each symbol represents a scalar
Model components
\[y_i = \beta_0 + \beta_1x_{1i} + \beta_2x_{2i}\ldots\beta_Kx_{Ki} + u_i\]
- \(\beta_0\) is the (y-)intercept or constant term
- The predicted value of \(y_i\) when all \(x_{ki} = 0\)
- \(\beta_k\) is the coefficient or slope for predictor \(x_{ki}\)
- Change in \(y_i\) associated with a 1-unit change in \(x_{ki}\)
- \(u_i\) is the error or disturbance term
- The random deviation of the observed value from the fitted value
Model components, visually
<- lm(trees$Volume ~ trees$Girth)
my.lm par(mar=c(5,4,1,1))
plot(trees$Girth, trees$Volume, ylim=c(-40,80), xlim=c(-5,30),
main = "",
xlab="Diameter of cherry trees", ylab="Volume of cherry trees",
axes=F)
axis(1, at=seq(-5,30,5))
axis(2, at=seq(-40,80,20), las=2)
abline(v=0, lty=2, col="purple")
text(3,60, "y-axis (x = 0)", col="purple")
abline(h=-36.94, lty=3, col="blue")
abline(a = coef(my.lm)[1], b = coef(my.lm)[2])
segments(5, coef(my.lm)[1] + 5*coef(my.lm)[2],
5+3, coef(my.lm)[1] + (5+3)*coef(my.lm)[2],
lwd=3, col="red")
segments(5, coef(my.lm)[1] + 5*coef(my.lm)[2],
5+3, coef(my.lm)[1] + 5*coef(my.lm)[2])
segments(5+3, coef(my.lm)[1] + 5*coef(my.lm)[2],
5+3, coef(my.lm)[1] + (5+3)*coef(my.lm)[2])
text(9,-5, expression(paste(Delta, "y")))
text(6.5,-16, expression(paste(Delta, "x")))
text(9, -30, expression(paste(beta[0], " = -36.94")), col="blue")
text(6, 15, expression(paste(beta[1], " = ", paste(Delta, "y", "/", Delta, "x"), )), col="red")
text(6.1, 3.5, " = 5.07", col="red")
arrows(6,-29, 0,-36.94, angle = 5, col="blue")
points(0,-36.94, col="blue", pch=19)
points(20.6,77, col="brown")
points(20.6,coef(my.lm)[1]+20.6*coef(my.lm)[2], pch=19, col="black")
segments(20.6,77, 20.6,coef(my.lm)[1]+20.6*coef(my.lm)[2], col="green", lwd=3)
text(19.5,72, expression(paste("u"[31])), col="green")
text(23,62, expression(paste("predicted value of ", "y"[31])), cex=0.75, col="black")
text(18,82.5, expression(paste("observed value of ", "y"[31])), cex=0.75, col="brown")
Multivariate case, visually
When we have more than one predictor, we are fitting a hyperplane to the data
library(scatterplot3d)
<- scatterplot3d(trees$Height, trees$Girth, trees$Volume,
plot3d angle=55, scale.y=0.7, pch=16, color ="red", main ="Regression Plane",
xlab = "Height", ylab = "Girth", zlab = "Volume")
<- lm(trees$Volume ~ trees$Height + trees$Girth)
my.lm$plane3d(my.lm, lty.box = "solid") plot3d
detach("package:scatterplot3d", unload=TRUE)
Multivariate case, visually
Bivariate “slices” of the hyperplane gives conditional relationships
library(patchwork)
library(ggplot2)
source("helper_functions.R")
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ lubridate 1.9.4 ✔ tibble 3.2.1
✔ purrr 1.0.4 ✔ tidyr 1.3.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
<- lm(trees$Volume ~ trees$Height + trees$Girth)
my.lm
<-
plot.a %>%
trees ggplot(aes(x = Height, y = Volume))+
geom_point()+
geom_abline(intercept = coef(my.lm)[1] + coef(my.lm)[3]*mean(trees$Girth),
slope = coef(my.lm)[2])+
labs(x = "Height", y = "Volume",
title = "Volume by Height",
subtitle = "Diameter held constant")+
theme_630()
<-
plot.b %>%
trees ggplot(aes(x = Girth, y = Volume))+
geom_point()+
geom_abline(intercept = coef(my.lm)[1] + coef(my.lm)[2]*mean(trees$Height),
slope = coef(my.lm)[3])+
labs(x = "Diameter", y = "Volume",
title = "Volume by Diameter",
subtitle = "Height held constant")+
theme_630()
+ plot.b plot.a
Expected value of \(y_i\)
There are two pieces to each regression model:
- A systematic part: \(\boldsymbol{x}_i \boldsymbol{\beta}\)
- A stochastic part: \(u_i\) (where \(\text{E}(u_i) = 0\))
. . .
The systematic part gives us the expected value of \(y_i\), which is generally our predicted value (\(\hat{y}_i\))
\[\text{E}(y_i | \boldsymbol{x_i}) = \boldsymbol{x_i} \boldsymbol{\beta} = \hat{y}_i\]
The errors
We will assume several things about the errors
- They are completely random disturbances
- They have an expected value of zero
- They have an error variance \(\sigma^2\), which is the model-unexplained portion of the dependent variable
- They are uncorrelated with the model predictors
. . .
Eventually, we will deal with the fall-out from violations of these assumptions
Example lm
output
Call:
lm(formula = trees$Volume ~ trees$Girth + trees$Height)
Residuals:
Min 1Q Median 3Q Max
-6.4065 -2.6493 -0.2876 2.2003 8.4847
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -57.9877 8.6382 -6.713 2.75e-07 ***
trees$Girth 4.7082 0.2643 17.816 < 2e-16 ***
trees$Height 0.3393 0.1302 2.607 0.0145 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.882 on 28 degrees of freedom
Multiple R-squared: 0.948, Adjusted R-squared: 0.9442
F-statistic: 255 on 2 and 28 DF, p-value: < 2.2e-16
Why linear regression?
Linear here means linear in the parameters (\(\boldsymbol{\beta}\))
Those parameters can be applied to non-linear transformations of \(x\)
Define \(x_2 = x_1^2\):
\[\text{income}_i = \beta_0 + \beta_1 \text{age}_{i} + \beta_2 \text{age}^2_{i} + u_i\]
Why linear regression?
We can also transform the DV and/or the IVs in ways that allow for non-linear relationships
\[\text{ln}(y_i) = \beta_0 + \beta_1 \text{ln}(x_{1i}) + \beta_2x_{2i}+\beta_3x_{3i} + u_i\]
- This implies a non-constant relationship of \(x_{1i}\) to \(y_i\)
- It is still a linear combination of (transformed) predictors
- A 1-unit change in \(\text{ln}(x_{1i})\) implies a \(\beta_1\)-unit change in \(\text{ln}(y_i)\)
- To characterize the relationship between the untransformed variables, we need to do additional work (we will have a section on this)
Why linear regression?
Examples of non-linear regression models:
- \(y_i = x_{1i}^{\alpha}+u_i\)
- \(y_i = e^{\beta_0+\beta_1x_{1i}+u_i}\)
Part 2. Estimation and Interpretation
Estimation goals
\[\boldsymbol{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{u}\]
We would like to generate an estimate of \(\boldsymbol{\beta}\) using a sample with \(N\) observations.
What qualities would we want in an ideal estimator?
Unbiasedness: \(\text{E}(\hat{\boldsymbol{\beta}}) = \boldsymbol{\beta}\)
Consistency: \(\text{lim}_{N\rightarrow\infty} \space \text{Pr}(|\hat{\boldsymbol{\beta}}-\boldsymbol{\beta}|>\epsilon)=0, \space \forall \space +\epsilon\)
Efficiency: \(\text{E} \left((\hat{\boldsymbol{\beta}}-\boldsymbol{\beta})^2 \right)\) (the variance of the estimator) is smallest possible
Ordinary least squares estimator
The ordinary least squares (OLS) estimator has these properties
- In general, find \(\boldsymbol{\beta}\) that minimizes the sum of squared residuals
. . .
- For bivariate case: find \(\hat{\beta}_0\) and \(\hat{\beta}_1\) that minimize:
\[\min_{\beta_0,\beta_1} \left[\sum_{i=1}^N(y_i-\beta_0-\beta_1x_{1i})^2 \right] =\min_{\beta_0,\beta_1} \left[\sum_{i=1}^N\hat{u}_i^2 \right]\]
. . .
Should be intuitive: minimize the (squared) distance between the actual value and the model’s “best guess”
Minimizing
To find the minimum with respect to some variable
- Take the first partial derivative
- Set it to zero
- Solve for parameter of interest
- Verify that 2nd partial derivative is positive
OLS estimators for bivariate case
OLS estimator for \(\beta_1\):
\[\hat{\beta}_1 = \frac{cov(y,x_1)}{var(x_1)}\]
OLS estimator for \(\beta_0\):
\[\hat{\beta}_0 = \bar{y} - \hat{\beta}_1\bar{x}_1\]
- to see B0, take expected value of y = B0 + B1x1 + u
Example, by hand and with lm()
Regression of Volume
on Girth
in trees
data
# estimate b1
<- cov(trees$Volume, trees$Girth) / var(trees$Girth)
b1
# estimate b0
<- mean(trees$Volume) - b1*mean(trees$Girth)
b0
# print estimates
c(b0,b1)
[1] -36.943459 5.065856
. . .
# estimate model using OLS and store in object 'm1'
<- lm(Volume ~ Girth, data = trees)
m1
# print estimates
coef(m1)
(Intercept) Girth
-36.943459 5.065856
OLS estimator for multiple regression
\[\hat{\boldsymbol{\beta}}=(\textbf{X}'\textbf{X})^{-1}\textbf{X}'\boldsymbol{y}\]
. . .
- Note the similarity to the bivariate case:
\[\hat{\beta}_1 = \frac{cov(y,x_1)}{var(x_1)}\]
Example, by hand
# define X matrix (with 1st column of 1s for intercept)
<- as.matrix(cbind(rep(1, nrow(trees)), trees[, c("Girth", "Height")]))
X colnames(X)[1] <- "Intercept"
# define Y vector
<- trees$Volume
y
# calculate beta
<- solve(t(X) %*% X) %*% t(X) %*% y
beta
# print beta
round(beta, 3)
[,1]
Intercept -57.988
Girth 4.708
Height 0.339
. . .
summary(lm(Volume ~ Girth + Height, data=trees))
Call:
lm(formula = Volume ~ Girth + Height, data = trees)
Residuals:
Min 1Q Median 3Q Max
-6.4065 -2.6493 -0.2876 2.2003 8.4847
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -57.9877 8.6382 -6.713 2.75e-07 ***
Girth 4.7082 0.2643 17.816 < 2e-16 ***
Height 0.3393 0.1302 2.607 0.0145 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.882 on 28 degrees of freedom
Multiple R-squared: 0.948, Adjusted R-squared: 0.9442
F-statistic: 255 on 2 and 28 DF, p-value: < 2.2e-16
What if we forget intercept?
# define X matrix (with 1st column of 1s for intercept)
<- as.matrix(trees[, c("Girth", "Height")])
X
# define Y vector
<- trees$Volume
y
# calculate beta
<- solve(t(X) %*% X) %*% t(X) %*% y
beta
# print beta
round(beta, 3)
[,1]
Girth 5.044
Height -0.477
What if we forget the intercept?
Taking out the intercept forces the plane to run through the origin (0,0)
<- lm(trees$Volume ~ trees$Girth)
my.lm <- lm(trees$Volume ~ 0 + trees$Girth)
my.lm.ni
%>%
trees ggplot(aes(x = Girth, y = Volume))+
geom_point(col = "blue")+
geom_abline(slope = coef(my.lm)[2], intercept = coef(my.lm)[1], col = "blue")+
geom_abline(slope = coef(my.lm.ni)[1], intercept = 0, col = "red")+
xlim(c(0,22))+
ylim(c(-60, 80))+
annotate("text", x = 10, y = 40,
label = expression(paste("Volume = ", beta[0]," + ", beta[1], "Girth")),
col = "blue")+
annotate("text", x = 5, y = 20,
label = expression(paste("Volume = ", beta[1], "Girth")),
col = "red")+
geom_vline(xintercept = 0, lty = "dashed")+
geom_hline(yintercept = 0, lty = "dashed")+
theme_630()
Warning in is.na(x): is.na() applied to non-(list or vector) of type
'expression'
Warning in is.na(x): is.na() applied to non-(list or vector) of type
'expression'
Predicted/expected values
The predicted or expected value of \(y_i\) is simply the systematic portion of the RHS: \(\beta_0 + \beta_1x_{1i}\)
# estimate model using OLS and store in object 'm1'
<- lm(Volume ~ Girth, data = trees)
m1
# print estimates
coef(m1)
(Intercept) Girth
-36.943459 5.065856
- “A tree with a 15 inch diameter is predicted to have a volume of \(-36.94 + (5.07)(15) = 39.11\) cubic inches.”
With multiple IVs
# estimate model using OLS and store in object 'm1'
<- lm(Volume ~ Girth + Height, data = trees)
m1
# predicted values are dot product of coefficients with x-values
coef(m1) %*% c(1, 15, 60) # 15 inch diameter and 60 ft height
[,1]
[1,] 32.98982
. . .
coef(m1)[1]*1 + coef(m1)[2]*15 + coef(m1)[3]*60
(Intercept)
32.98982
. . .
- “A tree with a 15 inch diameter and 60ft height is predicted to have a volume of 33 cubic inches.”
Interpreting coefficients
\[y_i = \beta_0 + \beta_1x_{1i} + \beta_2x_{2i} + u_i\]
We are interested in how \(\text{E}(y_i)\) changes as a function of \(x_{1i}\)
- When you think “rate of change” you should think “derivative”
- With more than one variable, think partial derivative
. . .
\[ \begin{align} \text{E}(y_i) &= \beta_0 + \beta_1x_{1i} + \beta_2x_{2i} \\ \frac{\partial{\text{E}(y_i)}}{\partial{x_{1i}}} &= \beta_1 \end{align} \]
Tells us how the function is changing at a given point in response to infinitesimal changes in one of the function’s variables
Slope of tangent line, linear
par(mar=c(5,4,2,1))
curve(2*x, 0, 10)
segments(3, 2*2 + 2, 1, 2*2 - 2, lty=1, col="red", lwd=2)
points(2, 2*2, col="red")
segments(7, 2*6 + 2, 5, 2*6 - 2, lty=1, col="blue", lwd=2)
points(6, 2*6, col="blue")
title(main="First derivative of 2x at x=2 and x=6")
text(2, 12, "Slope = 2", col="red")
text(6, 6, "Slope = 2", col="blue")
Slope of tangent line, non-linear
par(mar=c(5,4,2,1))
curve(log(x), 0, 10)
segments(4, log(2) + 2/2, 0, log(2) - 2/2, lty=1, col="red")
points(2, log(2), col="red")
title(main="First derivative of log(x) at x=2 and x=6")
text(2, -0.5, "Slope = 1/2", col="red")
segments(8, log(6) + 2/6, 4, log(6) - 2/6, lty=1, col="blue")
points(6, log(6), col="blue")
text(6, 0.5, "Slope = 1/6", col="blue")
Interpreting coefficients
\[y_i = \beta_0 + \beta_1x_{1i} + \beta_2x_{2i} + u_i\]
With discrete IVs, we cannot take the 1st derivative
. . .
- Instead, we think about first differences for a change in \(x_{1i}\) from some value (\(k_1\)) to another (\(k_2\))
\[ \begin{align} &\text{E}(y_i[x_{1i} = k_2]) - \text{E}(y_i[x_{1i} = k_1]) \\ &= (\beta_0 + \beta_1k_2 + \beta_2x_{2i}) - (\beta_0 + \beta_1k_1 + \beta_2x_{2i}) \\ &= \beta_1(k_2 - k_1) \end{align} \]
- If \(k_2 - k_1 = 1\), then this just equals \(\beta_1\)
First difference, linear
par(mar=c(5,4,2,1))
curve(2*x, 0, 10)
segments(3, 2*2 + 2, 1, 2*2 - 2, lty=1, col="red", lwd=2)
points(2, 2*2, col="red")
curve(2*x, 2, 3, col="blue", add=T, lwd=2)
title(main="First diff of 2x at x=2")
text(2, 12, "Slope = 2", col="red")
text(2, 10, "First diff = 2*3 - 2*2 = 2", col="blue")
First difference, non-linear
par(mar=c(5,4,2,1))
curve(log(x), 0, 10)
segments(4, log(2) + 2/2, 0, log(2) - 2/2, lty=1, col="red")
points(2, log(2), col="red")
title(main="First difference of log(x) at x=2")
curve(log(x), 2, 3, col="blue", add=T, lwd=2)
text(3, -0.5, paste0("First diff = log(3) - log(2) = ", round(log(3)-log(2),3)), col="blue")
text(1.5, 1.5, "Slope of tangent = 1/2", col="red")
Example, CEO salaries
# ceo data (salary in $1,000s, profits in $1Ms, age in years)
<- read.csv("data/ceosalary.csv")
ceo
# estimate model using OLS and store in object
<- lm(salary ~ age + profits, data = ceo)
m_ceo
# summary
summary(m_ceo)
Call:
lm(formula = salary ~ age + profits, data = ceo)
Residuals:
Min 1Q Median 3Q Max
-892.9 -331.6 -100.2 253.9 4445.4
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 469.3863 276.7318 1.696 0.0916 .
age 4.9620 4.8794 1.017 0.3106
profits 0.5604 0.1016 5.516 1.24e-07 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 541.6 on 174 degrees of freedom
Multiple R-squared: 0.1602, Adjusted R-squared: 0.1505
F-statistic: 16.59 on 2 and 174 DF, p-value: 2.539e-07
The importance of scale
The substantive meaning of a given coefficient depends on how \(y_i\) and \(x_{ki}\) are scaled
# estimate model using OLS and store in object
<- lm(salary/1000 ~ age + profits, data = ceo)
m_ceo
# summary
summary(m_ceo)
Call:
lm(formula = salary/1000 ~ age + profits, data = ceo)
Residuals:
Min 1Q Median 3Q Max
-0.8929 -0.3316 -0.1002 0.2539 4.4454
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.4693863 0.2767318 1.696 0.0916 .
age 0.0049620 0.0048794 1.017 0.3106
profits 0.0005604 0.0001016 5.516 1.24e-07 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.5416 on 174 degrees of freedom
Multiple R-squared: 0.1602, Adjusted R-squared: 0.1505
F-statistic: 16.59 on 2 and 174 DF, p-value: 2.539e-07
Standardized coefficients
Common to scale to mean of 0 and standard deviation of 1
Two kinds:
Only standardize (one or more) IVs
Also standardize the DV
. . .
scale()
function in R
will Z-score by default (i.e. subtract mean, divide by SD)
- Unit of change becomes SD
Standardized CEO
# estimate model using OLS and store in object
<- lm(scale(salary) ~ scale(age) + scale(profits), data = ceo)
m_ceo
# summary
summary(m_ceo)
Call:
lm(formula = scale(salary) ~ scale(age) + scale(profits), data = ceo)
Residuals:
Min 1Q Median 3Q Max
-1.5196 -0.5643 -0.1705 0.4322 7.5654
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -4.063e-17 6.928e-02 0.000 1.000
scale(age) 7.112e-02 6.994e-02 1.017 0.311
scale(profits) 3.858e-01 6.994e-02 5.516 1.24e-07 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.9217 on 174 degrees of freedom
Multiple R-squared: 0.1602, Adjusted R-squared: 0.1505
F-statistic: 16.59 on 2 and 174 DF, p-value: 2.539e-07
Care in interpretation with standardized vars
0-1 coding
PoliSci often uses 0-1 (or min-max) coding
\[\frac{x_i - \min(x)}{\max(x) - \min(x)}\]
. . .
Either theoretical or sample minima/maxima.
If lowest observed value is higher than lowest theoretically possible value, you may have a choice to make
Similar issue to “care in interpretation” with standardized vars
0-1 example
# recode Girth and Volume
$Volume_01 <- (trees$Volume - min(trees$Volume)) / (max(trees$Volume) - min(trees$Volume))
trees$Girth_01 <- (trees$Girth - min(trees$Girth)) / (max(trees$Girth) - min(trees$Girth))
trees
# estimate model
<- lm(Volume_01 ~ Girth_01, data = trees)
m_tree summary(m_tree)
Call:
lm(formula = Volume_01 ~ Girth_01, data = trees)
Residuals:
Min 1Q Median 3Q Max
-0.120739 -0.046507 0.002275 0.052317 0.143515
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.07630 0.02160 -3.533 0.0014 **
Girth_01 0.93278 0.04555 20.478 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.06365 on 29 degrees of freedom
Multiple R-squared: 0.9353, Adjusted R-squared: 0.9331
F-statistic: 419.4 on 1 and 29 DF, p-value: < 2.2e-16
Part 3. Model Fit
Model fit
We often evaluate our models by their ability to account for variation in our outcome – i.e. how well they “fit” the data.
- There are many different measures of model fit.
- We will discuss the most common today; later this semester, we will turn to alternatives.
- Model fit is not everything! Depends on your goals.
Sums of squares
People sometimes talk about the following sums of squares:
Total: \(\sum_N{(y_i - \bar{y})^2}\)
Model / Explained: \(\sum_N{(\hat{y}_i - \bar{y})^2}\)
Error / Residual: \(\sum_N{(y_i - \hat{y}_i)^2}\)
Errors vs residuals
The errors, \(u_i\), are population quantities, and unobservable
. . .
- Residuals are sample estimates, \(\hat{u}_i\), and have an expected value of 0 by construction
. . .
- An unbiased estimator of the error variance, \(\hat{\sigma}^2\) is:
\[\frac{\sum_N{(y_i - \hat{y}_i)^2}}{N - K - 1}\]
Mean squared error
More generally, we can define the mean squared error for any model as:
\[\text{MSE} = \text{E} \left((\hat{\theta} - \theta)^2 \right)\]
- Expected value of quadratic loss from applying model
Interpretable only with reference to scale of \(\theta\)
- useful in relative terms (for comparing fit between two versions of same model)
Incorporates both bias and variance
- For unbiased estimators such as OLS, \(\text{MSE} \equiv \text{Error Variance}\)
MSE in linear regression
In linear regression, mean squared error typically refers to the following:
\[\text{MSE} = \frac{\sum_N{(y_i - \hat{y}_i)^2}}{N - K - 1}\]
i.e. we divide by degrees of freedom, not sample size
- More complex model –> smaller denominator –> higher MSE, all else equal
\(R^2\)
A common measure of model fit is \(R^2\), which is 1 minus the ratio of residual to total variation in \(y\):
\[1 - \frac{\text{SS}_R}{\text{SS}_T} = 1 - \frac{\sum_N{(y_i - \hat{y}_i)^2}}{\sum_N{(y_i - \bar{y})^2}}\]
This can be interpreted as the proportionate reduction in error of prediction (relative to always guessing \(\bar{\boldsymbol{y}}\))
It is also the squared correlation (shared variance) between \(\boldsymbol{y}\) and \(\hat{\boldsymbol{y}}\)
\(R^2\)
Given Model 1 with K covariates and Model 2 with K + 1 covariates, \(R^2_2 \geq R^2_1\).
. . .
Why is this true?
. . .
Why is this a problem?
Adjusted \(R^2\)
A better version of \(R^2\) uses the MSE in the numerator, thus penalizing model complexity
\[R^2_{adj} = 1 - \frac{\text{MSE}}{\text{Var}(\boldsymbol{y})} = 1 - \frac{\frac{\sum_N{(y_i - \hat{y}_i)^2}}{N - K - 1}}{\frac{\sum_N{(y_i - \bar{\boldsymbol{y}})^2}}{N - 1}}\]
. . .
If you add a useless variable, adjusted \(R^2\) will decrease.
Example
<- lm(Volume ~ Girth + Height, data = trees)
m2 <- summary(m2)
s2
c(round(s2$r.squared, 3), round(s2$adj.r.squared, 5))
[1] 0.94800 0.94423
. . .
set.seed(3587)
<- lm(Volume ~ Girth + Height + rnorm(length(trees$Volume)), data = trees)
m3 <- summary(m3)
s3
c(round(s3$r.squared, 3), round(s3$adj.r.squared, 5))
[1] 0.95000 0.94398
By hand
# calculate r2
<- 1 - var(m2$residuals) / var(trees$Volume)
r2
# calculate mean squared error
<- (sum(m2$residuals^2) / m2$df.residual)
mse
# adjusted r2
<- 1 - mse / var(trees$Volume)
r2_adj
# print
c(round(sqrt(mse),3),round(r2,3),round(r2_adj, 3))
[1] 3.882 0.948 0.944
. . .
summary(m2)
Call:
lm(formula = Volume ~ Girth + Height, data = trees)
Residuals:
Min 1Q Median 3Q Max
-6.4065 -2.6493 -0.2876 2.2003 8.4847
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -57.9877 8.6382 -6.713 2.75e-07 ***
Girth 4.7082 0.2643 17.816 < 2e-16 ***
Height 0.3393 0.1302 2.607 0.0145 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.882 on 28 degrees of freedom
Multiple R-squared: 0.948, Adjusted R-squared: 0.9442
F-statistic: 255 on 2 and 28 DF, p-value: < 2.2e-16