Linear Regression Introduction

POLSCI 630: Probability and Basic Regression

Published

January 14, 2025

1. The Flexible Linear Model

Goals for regression

Regression has a variety of uses in the social sciences

  • Characterizing associations between variables
  • Prediction to future observations
  • Causal effect estimation

. . .

Regression can in theory do all of the above if the right assumptions are met.

Notation

  • generic units of analysis are indicated by \(i=1,...,N\)
  • scalars are lower-case italics (e.g., \(x_{1i}\))
  • (column) vectors are lower-case bold italics (e.g., \(\boldsymbol{x}_i\), transpose: \(\boldsymbol{x}_i'\))
  • matrices are upper-case bold (e.g., \(\mathbf{X}\))
  • random variables are generally Roman (e.g., \(x\))
  • model parameters are generally Greek (e.g., \(\beta_1\), \(\boldsymbol{\beta}\))
  • generic model variables (and associated parameters) are indicated by \(j=1,...,K\)
  • error terms are indicated by \(u_i\) or \(\boldsymbol{u}\), but sometimes \(e_i\) or even \(\epsilon_i\)
  • error variance is indicated by \(\sigma^2\) or \(\sigma_u^2\) if necessary

What is regression?

Regression describes the relationship between one or more independent variables, \(\boldsymbol{x}_1,\boldsymbol{x}_{2},...,\boldsymbol{x}_{K}\), and a dependent variable, \(\boldsymbol{y}\).

Conditional expected value: \(\text{E}(y_i|x_i)\)

  • One way to do this is to simply calculate mean of the DV at each value of the IV
  • But there are small numbers of observations at each value of the IV - estimates are noisy
  • By imposing a model on the relationship, we leverage “knowledge” that comes from the model assumptions to get more efficient estimates
  • This is a very general idea that you should internalize until you feel it in your bones

Conditional expected value based on linear model

The linear regression model

\[\boldsymbol{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{u}\]

  • \(\boldsymbol{y}\) is an \(N \times 1\) vector
  • \(\bf{X}\) is an \(N \times (K \rm{+1)}\) matrix
  • \(\boldsymbol{\beta}\) is a \((K \rm{+1)} \times 1\) vector
  • \(\boldsymbol{u}\) is an \(N \times 1\) vector

. . .

For data with N observations and K predictors

  • The first column of \(\bf{X}\) is all 1s
  • The first element of \(\boldsymbol{\beta}\) is the y-intercept

Vectors and matrices

\[\boldsymbol{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{u}\]

\[\begin{bmatrix} 10.3 \\ 10.3 \\ 10.2 \\ 16.4 \end{bmatrix} = \begin{bmatrix} 1 & 8.3 & 70 \\ 1 & 8.6 & 65 \\ 1 & 8.8 & 63 \\ 1 & 10.5 & 72 \end{bmatrix} \begin{bmatrix} \beta_0 \\ \beta_1 \\ \beta_2 \end{bmatrix} + \begin{bmatrix} u_1 \\ u_2 \\ u_3 \\ u_4 \end{bmatrix}\]

The linear regression model, expanded

\[\boldsymbol{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{u}\]

\[\boldsymbol{y} = \beta_0 + \beta_1\boldsymbol{x_1} + \beta_2\boldsymbol{x_2}\ldots\beta_K\boldsymbol{x_K} + \boldsymbol{u}\]

  • \(\boldsymbol{y}\) is an \(N \times 1\) vector
  • Each \(\boldsymbol{x_k}\) is an \(N \times 1\) vector
  • Each \(\beta_k\) is a scalar
  • \(\boldsymbol{u}\) is an \(N \times 1\) vector

The linear regression model, single unit


\[y_i = \beta_0 + \beta_1x_{1i} + \beta_2x_{2i}\ldots\beta_Kx_{Ki} + u_i\]

  • This is the model for a single unit, \(i\)
  • Each symbol represents a scalar

Model components

\[y_i = \beta_0 + \beta_1x_{1i} + \beta_2x_{2i}\ldots\beta_Kx_{Ki} + u_i\]

  • \(\beta_0\) is the (y-)intercept or constant term
    • The predicted value of \(y_i\) when all \(x_{ki} = 0\)
  • \(\beta_k\) is the coefficient or slope for predictor \(x_{ki}\)
    • Change in \(y_i\) associated with a 1-unit change in \(x_{ki}\)
  • \(u_i\) is the error or disturbance term
    • The random deviation of the observed value from the fitted value

Model components, visually

my.lm <- lm(trees$Volume ~ trees$Girth)
par(mar=c(5,4,1,1))
plot(trees$Girth, trees$Volume, ylim=c(-40,80), xlim=c(-5,30), 
     main = "",
     xlab="Diameter of cherry trees", ylab="Volume of cherry trees",
     axes=F)
axis(1, at=seq(-5,30,5))
axis(2, at=seq(-40,80,20), las=2)
abline(v=0, lty=2, col="purple")
text(3,60, "y-axis (x = 0)", col="purple")
abline(h=-36.94, lty=3, col="blue")
abline(a = coef(my.lm)[1], b = coef(my.lm)[2])
segments(5, coef(my.lm)[1] + 5*coef(my.lm)[2], 
         5+3, coef(my.lm)[1] + (5+3)*coef(my.lm)[2],
         lwd=3, col="red")
segments(5, coef(my.lm)[1] + 5*coef(my.lm)[2],
         5+3, coef(my.lm)[1] + 5*coef(my.lm)[2])
segments(5+3, coef(my.lm)[1] + 5*coef(my.lm)[2],
         5+3, coef(my.lm)[1] + (5+3)*coef(my.lm)[2])
text(9,-5, expression(paste(Delta, "y")))
text(6.5,-16, expression(paste(Delta, "x")))
text(9, -30, expression(paste(beta[0], " = -36.94")), col="blue")
text(6, 15, expression(paste(beta[1], " = ", paste(Delta, "y", "/", Delta, "x"), )), col="red")
text(6.1, 3.5, " = 5.07", col="red")
arrows(6,-29, 0,-36.94, angle = 5, col="blue")
points(0,-36.94, col="blue", pch=19)
points(20.6,77, col="brown")
points(20.6,coef(my.lm)[1]+20.6*coef(my.lm)[2], pch=19, col="black")
segments(20.6,77, 20.6,coef(my.lm)[1]+20.6*coef(my.lm)[2], col="green", lwd=3)
text(19.5,72, expression(paste("u"[31])), col="green")
text(23,62, expression(paste("predicted value of ", "y"[31])), cex=0.75, col="black")
text(18,82.5, expression(paste("observed value of ", "y"[31])), cex=0.75, col="brown")

Multivariate case, visually

When we have more than one predictor, we are fitting a hyperplane to the data

library(scatterplot3d)
plot3d <- scatterplot3d(trees$Height, trees$Girth, trees$Volume,
angle=55, scale.y=0.7, pch=16, color ="red", main ="Regression Plane",
xlab = "Height", ylab = "Girth", zlab = "Volume")
my.lm<- lm(trees$Volume ~ trees$Height + trees$Girth)
plot3d$plane3d(my.lm, lty.box = "solid")

detach("package:scatterplot3d", unload=TRUE)

Multivariate case, visually

Bivariate “slices” of the hyperplane gives conditional relationships

library(patchwork)
library(ggplot2)
source("helper_functions.R")
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ lubridate 1.9.4     ✔ tibble    3.2.1
✔ purrr     1.0.4     ✔ tidyr     1.3.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
my.lm <- lm(trees$Volume ~ trees$Height + trees$Girth)

plot.a <- 
  trees %>%
  ggplot(aes(x = Height, y = Volume))+
  geom_point()+
  geom_abline(intercept =  coef(my.lm)[1] + coef(my.lm)[3]*mean(trees$Girth),
              slope = coef(my.lm)[2])+
  labs(x = "Height", y = "Volume",
       title = "Volume by Height",
       subtitle = "Diameter held constant")+
  theme_630()

plot.b <- 
  trees %>%
  ggplot(aes(x = Girth, y = Volume))+
  geom_point()+
  geom_abline(intercept =  coef(my.lm)[1] + coef(my.lm)[2]*mean(trees$Height),
              slope = coef(my.lm)[3])+
  labs(x = "Diameter", y = "Volume",
       title = "Volume by Diameter",
       subtitle = "Height held constant")+
  theme_630()

plot.a + plot.b

Expected value of \(y_i\)

There are two pieces to each regression model:

  • A systematic part: \(\boldsymbol{x}_i \boldsymbol{\beta}\)
  • A stochastic part: \(u_i\) (where \(\text{E}(u_i) = 0\))

. . .

The systematic part gives us the expected value of \(y_i\), which is generally our predicted value (\(\hat{y}_i\))

\[\text{E}(y_i | \boldsymbol{x_i}) = \boldsymbol{x_i} \boldsymbol{\beta} = \hat{y}_i\]

The errors

We will assume several things about the errors

  • They are completely random disturbances
  • They have an expected value of zero
  • They have an error variance \(\sigma^2\), which is the model-unexplained portion of the dependent variable
  • They are uncorrelated with the model predictors

. . .

Eventually, we will deal with the fall-out from violations of these assumptions

Example lm output


Call:
lm(formula = trees$Volume ~ trees$Girth + trees$Height)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.4065 -2.6493 -0.2876  2.2003  8.4847 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -57.9877     8.6382  -6.713 2.75e-07 ***
trees$Girth    4.7082     0.2643  17.816  < 2e-16 ***
trees$Height   0.3393     0.1302   2.607   0.0145 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.882 on 28 degrees of freedom
Multiple R-squared:  0.948, Adjusted R-squared:  0.9442 
F-statistic:   255 on 2 and 28 DF,  p-value: < 2.2e-16

Why linear regression?

Linear here means linear in the parameters (\(\boldsymbol{\beta}\))

Those parameters can be applied to non-linear transformations of \(x\)

Define \(x_2 = x_1^2\):

\[\text{income}_i = \beta_0 + \beta_1 \text{age}_{i} + \beta_2 \text{age}^2_{i} + u_i\]

Why linear regression?

We can also transform the DV and/or the IVs in ways that allow for non-linear relationships

\[\text{ln}(y_i) = \beta_0 + \beta_1 \text{ln}(x_{1i}) + \beta_2x_{2i}+\beta_3x_{3i} + u_i\]

  • This implies a non-constant relationship of \(x_{1i}\) to \(y_i\)
  • It is still a linear combination of (transformed) predictors
    • A 1-unit change in \(\text{ln}(x_{1i})\) implies a \(\beta_1\)-unit change in \(\text{ln}(y_i)\)
  • To characterize the relationship between the untransformed variables, we need to do additional work (we will have a section on this)

Why linear regression?

Examples of non-linear regression models:

  • \(y_i = x_{1i}^{\alpha}+u_i\)
  • \(y_i = e^{\beta_0+\beta_1x_{1i}+u_i}\)

Part 2. Estimation and Interpretation

Estimation goals

\[\boldsymbol{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{u}\]

We would like to generate an estimate of \(\boldsymbol{\beta}\) using a sample with \(N\) observations.

What qualities would we want in an ideal estimator?

  • Unbiasedness: \(\text{E}(\hat{\boldsymbol{\beta}}) = \boldsymbol{\beta}\)

  • Consistency: \(\text{lim}_{N\rightarrow\infty} \space \text{Pr}(|\hat{\boldsymbol{\beta}}-\boldsymbol{\beta}|>\epsilon)=0, \space \forall \space +\epsilon\)

  • Efficiency: \(\text{E} \left((\hat{\boldsymbol{\beta}}-\boldsymbol{\beta})^2 \right)\) (the variance of the estimator) is smallest possible

Ordinary least squares estimator

The ordinary least squares (OLS) estimator has these properties

  • In general, find \(\boldsymbol{\beta}\) that minimizes the sum of squared residuals

. . .

  • For bivariate case: find \(\hat{\beta}_0\) and \(\hat{\beta}_1\) that minimize:

\[\min_{\beta_0,\beta_1} \left[\sum_{i=1}^N(y_i-\beta_0-\beta_1x_{1i})^2 \right] =\min_{\beta_0,\beta_1} \left[\sum_{i=1}^N\hat{u}_i^2 \right]\]

. . .

Should be intuitive: minimize the (squared) distance between the actual value and the model’s “best guess”

Minimizing

To find the minimum with respect to some variable

  1. Take the first partial derivative
  2. Set it to zero
  3. Solve for parameter of interest
  4. Verify that 2nd partial derivative is positive

OLS estimators for bivariate case

OLS estimator for \(\beta_1\):

\[\hat{\beta}_1 = \frac{cov(y,x_1)}{var(x_1)}\]

OLS estimator for \(\beta_0\):

\[\hat{\beta}_0 = \bar{y} - \hat{\beta}_1\bar{x}_1\]

  • to see B0, take expected value of y = B0 + B1x1 + u

Example, by hand and with lm()

Regression of Volume on Girth in trees data

# estimate b1
b1 <- cov(trees$Volume, trees$Girth) / var(trees$Girth)

# estimate b0
b0 <- mean(trees$Volume) - b1*mean(trees$Girth)

# print estimates
c(b0,b1)
[1] -36.943459   5.065856

. . .

# estimate model using OLS and store in object 'm1'
m1 <- lm(Volume ~ Girth, data = trees)

# print estimates
coef(m1)
(Intercept)       Girth 
 -36.943459    5.065856 

OLS estimator for multiple regression

\[\hat{\boldsymbol{\beta}}=(\textbf{X}'\textbf{X})^{-1}\textbf{X}'\boldsymbol{y}\]

. . .

  • Note the similarity to the bivariate case:

\[\hat{\beta}_1 = \frac{cov(y,x_1)}{var(x_1)}\]

Example, by hand

# define X matrix (with 1st column of 1s for intercept)
X <- as.matrix(cbind(rep(1, nrow(trees)), trees[, c("Girth", "Height")]))
colnames(X)[1] <- "Intercept"

# define Y vector
y <- trees$Volume

# calculate beta
beta <- solve(t(X) %*% X) %*% t(X) %*% y

# print beta
round(beta, 3)
             [,1]
Intercept -57.988
Girth       4.708
Height      0.339

. . .

summary(lm(Volume ~ Girth + Height, data=trees))

Call:
lm(formula = Volume ~ Girth + Height, data = trees)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.4065 -2.6493 -0.2876  2.2003  8.4847 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -57.9877     8.6382  -6.713 2.75e-07 ***
Girth         4.7082     0.2643  17.816  < 2e-16 ***
Height        0.3393     0.1302   2.607   0.0145 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.882 on 28 degrees of freedom
Multiple R-squared:  0.948, Adjusted R-squared:  0.9442 
F-statistic:   255 on 2 and 28 DF,  p-value: < 2.2e-16

What if we forget intercept?

# define X matrix (with 1st column of 1s for intercept)
X <- as.matrix(trees[, c("Girth", "Height")])

# define Y vector
y <- trees$Volume

# calculate beta
beta <- solve(t(X) %*% X) %*% t(X) %*% y

# print beta
round(beta, 3)
         [,1]
Girth   5.044
Height -0.477

What if we forget the intercept?

Taking out the intercept forces the plane to run through the origin (0,0)

my.lm <- lm(trees$Volume ~ trees$Girth)
my.lm.ni <- lm(trees$Volume ~ 0 + trees$Girth)

trees %>%
  ggplot(aes(x = Girth, y = Volume))+
  geom_point(col = "blue")+
  geom_abline(slope = coef(my.lm)[2], intercept = coef(my.lm)[1], col = "blue")+
  geom_abline(slope = coef(my.lm.ni)[1], intercept = 0, col = "red")+
  xlim(c(0,22))+
  ylim(c(-60, 80))+
  annotate("text", x = 10, y = 40, 
           label =  expression(paste("Volume = ", beta[0]," + ", beta[1], "Girth")),
           col = "blue")+
  annotate("text", x = 5, y = 20, 
           label =  expression(paste("Volume = ", beta[1], "Girth")),
           col = "red")+
  geom_vline(xintercept = 0, lty = "dashed")+
  geom_hline(yintercept = 0, lty = "dashed")+
  theme_630()
Warning in is.na(x): is.na() applied to non-(list or vector) of type
'expression'
Warning in is.na(x): is.na() applied to non-(list or vector) of type
'expression'

Predicted/expected values

The predicted or expected value of \(y_i\) is simply the systematic portion of the RHS: \(\beta_0 + \beta_1x_{1i}\)

# estimate model using OLS and store in object 'm1'
m1 <- lm(Volume ~ Girth, data = trees)

# print estimates
coef(m1)
(Intercept)       Girth 
 -36.943459    5.065856 


  • “A tree with a 15 inch diameter is predicted to have a volume of \(-36.94 + (5.07)(15) = 39.11\) cubic inches.”

With multiple IVs

# estimate model using OLS and store in object 'm1'
m1 <- lm(Volume ~ Girth + Height, data = trees)

# predicted values are dot product of coefficients with x-values
coef(m1) %*% c(1, 15, 60) # 15 inch diameter and 60 ft height
         [,1]
[1,] 32.98982

. . .

coef(m1)[1]*1 + coef(m1)[2]*15 + coef(m1)[3]*60
(Intercept) 
   32.98982 


. . .

  • “A tree with a 15 inch diameter and 60ft height is predicted to have a volume of 33 cubic inches.”

Interpreting coefficients

\[y_i = \beta_0 + \beta_1x_{1i} + \beta_2x_{2i} + u_i\]

We are interested in how \(\text{E}(y_i)\) changes as a function of \(x_{1i}\)

  • When you think “rate of change” you should think “derivative”
  • With more than one variable, think partial derivative

. . .

\[ \begin{align} \text{E}(y_i) &= \beta_0 + \beta_1x_{1i} + \beta_2x_{2i} \\ \frac{\partial{\text{E}(y_i)}}{\partial{x_{1i}}} &= \beta_1 \end{align} \]

Tells us how the function is changing at a given point in response to infinitesimal changes in one of the function’s variables

Slope of tangent line, linear

par(mar=c(5,4,2,1))
curve(2*x, 0, 10)
segments(3, 2*2 + 2, 1, 2*2 - 2, lty=1, col="red", lwd=2)
points(2, 2*2, col="red")
segments(7, 2*6 + 2, 5, 2*6 - 2, lty=1, col="blue", lwd=2)
points(6, 2*6, col="blue")
title(main="First derivative of 2x at x=2 and x=6")
text(2, 12, "Slope = 2", col="red")
text(6, 6, "Slope = 2", col="blue")

Slope of tangent line, non-linear

par(mar=c(5,4,2,1))
curve(log(x), 0, 10)
segments(4, log(2) + 2/2, 0, log(2) - 2/2, lty=1, col="red")
points(2, log(2), col="red")
title(main="First derivative of log(x) at x=2 and x=6")
text(2, -0.5, "Slope = 1/2", col="red")
segments(8, log(6) + 2/6, 4, log(6) - 2/6, lty=1, col="blue")
points(6, log(6), col="blue")
text(6, 0.5, "Slope = 1/6", col="blue")

Interpreting coefficients

\[y_i = \beta_0 + \beta_1x_{1i} + \beta_2x_{2i} + u_i\]

With discrete IVs, we cannot take the 1st derivative

. . .

  • Instead, we think about first differences for a change in \(x_{1i}\) from some value (\(k_1\)) to another (\(k_2\))

\[ \begin{align} &\text{E}(y_i[x_{1i} = k_2]) - \text{E}(y_i[x_{1i} = k_1]) \\ &= (\beta_0 + \beta_1k_2 + \beta_2x_{2i}) - (\beta_0 + \beta_1k_1 + \beta_2x_{2i}) \\ &= \beta_1(k_2 - k_1) \end{align} \]

  • If \(k_2 - k_1 = 1\), then this just equals \(\beta_1\)

First difference, linear

par(mar=c(5,4,2,1))
curve(2*x, 0, 10)
segments(3, 2*2 + 2, 1, 2*2 - 2, lty=1, col="red", lwd=2)
points(2, 2*2, col="red")
curve(2*x, 2, 3, col="blue", add=T, lwd=2)
title(main="First diff of 2x at x=2")
text(2, 12, "Slope = 2", col="red")
text(2, 10, "First diff = 2*3 - 2*2 = 2", col="blue")

First difference, non-linear

par(mar=c(5,4,2,1))
curve(log(x), 0, 10)
segments(4, log(2) + 2/2, 0, log(2) - 2/2, lty=1, col="red")
points(2, log(2), col="red")
title(main="First difference of log(x) at x=2")
curve(log(x), 2, 3, col="blue", add=T, lwd=2)
text(3, -0.5, paste0("First diff = log(3) - log(2) = ", round(log(3)-log(2),3)), col="blue")
text(1.5, 1.5, "Slope of tangent = 1/2", col="red")

Example, CEO salaries

# ceo data (salary in $1,000s, profits in $1Ms, age in years)
ceo <- read.csv("data/ceosalary.csv")

# estimate model using OLS and store in object
m_ceo <- lm(salary ~ age + profits, data = ceo)

# summary
summary(m_ceo)

Call:
lm(formula = salary ~ age + profits, data = ceo)

Residuals:
   Min     1Q Median     3Q    Max 
-892.9 -331.6 -100.2  253.9 4445.4 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 469.3863   276.7318   1.696   0.0916 .  
age           4.9620     4.8794   1.017   0.3106    
profits       0.5604     0.1016   5.516 1.24e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 541.6 on 174 degrees of freedom
Multiple R-squared:  0.1602,    Adjusted R-squared:  0.1505 
F-statistic: 16.59 on 2 and 174 DF,  p-value: 2.539e-07

The importance of scale

The substantive meaning of a given coefficient depends on how \(y_i\) and \(x_{ki}\) are scaled

# estimate model using OLS and store in object
m_ceo <- lm(salary/1000 ~ age + profits, data = ceo)

# summary
summary(m_ceo)

Call:
lm(formula = salary/1000 ~ age + profits, data = ceo)

Residuals:
    Min      1Q  Median      3Q     Max 
-0.8929 -0.3316 -0.1002  0.2539  4.4454 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.4693863  0.2767318   1.696   0.0916 .  
age         0.0049620  0.0048794   1.017   0.3106    
profits     0.0005604  0.0001016   5.516 1.24e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.5416 on 174 degrees of freedom
Multiple R-squared:  0.1602,    Adjusted R-squared:  0.1505 
F-statistic: 16.59 on 2 and 174 DF,  p-value: 2.539e-07

Standardized coefficients

Common to scale to mean of 0 and standard deviation of 1

Two kinds:

  • Only standardize (one or more) IVs

  • Also standardize the DV

. . .

scale() function in R will Z-score by default (i.e. subtract mean, divide by SD)

  • Unit of change becomes SD

Standardized CEO

# estimate model using OLS and store in object
m_ceo <- lm(scale(salary) ~ scale(age) + scale(profits), data = ceo)

# summary
summary(m_ceo)

Call:
lm(formula = scale(salary) ~ scale(age) + scale(profits), data = ceo)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.5196 -0.5643 -0.1705  0.4322  7.5654 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)    -4.063e-17  6.928e-02   0.000    1.000    
scale(age)      7.112e-02  6.994e-02   1.017    0.311    
scale(profits)  3.858e-01  6.994e-02   5.516 1.24e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.9217 on 174 degrees of freedom
Multiple R-squared:  0.1602,    Adjusted R-squared:  0.1505 
F-statistic: 16.59 on 2 and 174 DF,  p-value: 2.539e-07

Care in interpretation with standardized vars

0-1 coding

PoliSci often uses 0-1 (or min-max) coding

\[\frac{x_i - \min(x)}{\max(x) - \min(x)}\]

. . .

Either theoretical or sample minima/maxima.

  • If lowest observed value is higher than lowest theoretically possible value, you may have a choice to make

  • Similar issue to “care in interpretation” with standardized vars

0-1 example

# recode Girth and Volume
trees$Volume_01 <- (trees$Volume - min(trees$Volume)) / (max(trees$Volume) - min(trees$Volume)) 
trees$Girth_01 <- (trees$Girth - min(trees$Girth)) / (max(trees$Girth) - min(trees$Girth))

# estimate model
m_tree <- lm(Volume_01 ~ Girth_01, data = trees)
summary(m_tree)

Call:
lm(formula = Volume_01 ~ Girth_01, data = trees)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.120739 -0.046507  0.002275  0.052317  0.143515 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.07630    0.02160  -3.533   0.0014 ** 
Girth_01     0.93278    0.04555  20.478   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.06365 on 29 degrees of freedom
Multiple R-squared:  0.9353,    Adjusted R-squared:  0.9331 
F-statistic: 419.4 on 1 and 29 DF,  p-value: < 2.2e-16

Part 3. Model Fit

Model fit

We often evaluate our models by their ability to account for variation in our outcome – i.e. how well they “fit” the data.

  • There are many different measures of model fit.
  • We will discuss the most common today; later this semester, we will turn to alternatives.
  • Model fit is not everything! Depends on your goals.

Sums of squares

People sometimes talk about the following sums of squares:

  • Total: \(\sum_N{(y_i - \bar{y})^2}\)

  • Model / Explained: \(\sum_N{(\hat{y}_i - \bar{y})^2}\)

  • Error / Residual: \(\sum_N{(y_i - \hat{y}_i)^2}\)

Errors vs residuals

The errors, \(u_i\), are population quantities, and unobservable

. . .

  • Residuals are sample estimates, \(\hat{u}_i\), and have an expected value of 0 by construction

. . .

  • An unbiased estimator of the error variance, \(\hat{\sigma}^2\) is:

\[\frac{\sum_N{(y_i - \hat{y}_i)^2}}{N - K - 1}\]

Mean squared error

More generally, we can define the mean squared error for any model as:

\[\text{MSE} = \text{E} \left((\hat{\theta} - \theta)^2 \right)\]

  • Expected value of quadratic loss from applying model
  • Interpretable only with reference to scale of \(\theta\)

    • useful in relative terms (for comparing fit between two versions of same model)
  • Incorporates both bias and variance

    • For unbiased estimators such as OLS, \(\text{MSE} \equiv \text{Error Variance}\)

MSE in linear regression

In linear regression, mean squared error typically refers to the following:

\[\text{MSE} = \frac{\sum_N{(y_i - \hat{y}_i)^2}}{N - K - 1}\]

i.e. we divide by degrees of freedom, not sample size

  • More complex model –> smaller denominator –> higher MSE, all else equal

\(R^2\)

A common measure of model fit is \(R^2\), which is 1 minus the ratio of residual to total variation in \(y\):

\[1 - \frac{\text{SS}_R}{\text{SS}_T} = 1 - \frac{\sum_N{(y_i - \hat{y}_i)^2}}{\sum_N{(y_i - \bar{y})^2}}\]

  • This can be interpreted as the proportionate reduction in error of prediction (relative to always guessing \(\bar{\boldsymbol{y}}\))

  • It is also the squared correlation (shared variance) between \(\boldsymbol{y}\) and \(\hat{\boldsymbol{y}}\)

\(R^2\)

Given Model 1 with K covariates and Model 2 with K + 1 covariates, \(R^2_2 \geq R^2_1\).

. . .

Why is this true?

. . .

Why is this a problem?

Adjusted \(R^2\)

A better version of \(R^2\) uses the MSE in the numerator, thus penalizing model complexity

\[R^2_{adj} = 1 - \frac{\text{MSE}}{\text{Var}(\boldsymbol{y})} = 1 - \frac{\frac{\sum_N{(y_i - \hat{y}_i)^2}}{N - K - 1}}{\frac{\sum_N{(y_i - \bar{\boldsymbol{y}})^2}}{N - 1}}\]

. . .

If you add a useless variable, adjusted \(R^2\) will decrease.

Example

m2 <- lm(Volume ~ Girth + Height, data = trees)
s2 <- summary(m2)

c(round(s2$r.squared, 3), round(s2$adj.r.squared, 5))
[1] 0.94800 0.94423

. . .

set.seed(3587)
m3 <- lm(Volume ~ Girth + Height + rnorm(length(trees$Volume)), data = trees)
s3 <- summary(m3)

c(round(s3$r.squared, 3), round(s3$adj.r.squared, 5))
[1] 0.95000 0.94398

By hand

# calculate r2
r2 <- 1 - var(m2$residuals) / var(trees$Volume)

# calculate mean squared error
mse <- (sum(m2$residuals^2) / m2$df.residual)

# adjusted r2
r2_adj <- 1 -  mse / var(trees$Volume)

# print
c(round(sqrt(mse),3),round(r2,3),round(r2_adj, 3))
[1] 3.882 0.948 0.944

. . .

summary(m2)

Call:
lm(formula = Volume ~ Girth + Height, data = trees)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.4065 -2.6493 -0.2876  2.2003  8.4847 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -57.9877     8.6382  -6.713 2.75e-07 ***
Girth         4.7082     0.2643  17.816  < 2e-16 ***
Height        0.3393     0.1302   2.607   0.0145 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.882 on 28 degrees of freedom
Multiple R-squared:  0.948, Adjusted R-squared:  0.9442 
F-statistic:   255 on 2 and 28 DF,  p-value: < 2.2e-16