Linear Regression Introduction

POLSCI 630: Probability and Basic Regression

January 14, 2025

1. The Flexible Linear Model

Goals for regression

Regression has a variety of uses in the social sciences

Characterizing associations between variables
Prediction to future observations
Causal effect estimation

Regression can in theory do all of the above if the right assumptions are met.

Notation

generic units of analysis are indicated by \(i=1,...,N\)
scalars are lower-case italics (e.g., \(x_{1i}\))
(column) vectors are lower-case bold italics (e.g., \(\boldsymbol{x}_i\), transpose: \(\boldsymbol{x}_i'\))
matrices are upper-case bold (e.g., \(\mathbf{X}\))
random variables are generally Roman (e.g., \(x\))
model parameters are generally Greek (e.g., \(\beta_1\), \(\boldsymbol{\beta}\))
generic model variables (and associated parameters) are indicated by \(j=1,...,K\)
error terms are indicated by \(u_i\) or \(\boldsymbol{u}\), but sometimes \(e_i\) or even \(\epsilon_i\)
error variance is indicated by \(\sigma^2\) or \(\sigma_u^2\) if necessary

What is regression?

Regression describes the relationship between one or more independent variables, \(\boldsymbol{x}_1,\boldsymbol{x}_{2},...,\boldsymbol{x}_{K}\), and a dependent variable, \(\boldsymbol{y}\).

Conditional expected value: \(\text{E}(y_i|x_i)\)

Conditional expected value based on linear model

The linear regression model

\[\boldsymbol{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{u}\]

\(\boldsymbol{y}\) is an \(N \times 1\) vector
\(\bf{X}\) is an \(N \times (K \rm{+1)}\) matrix
\(\boldsymbol{\beta}\) is a \((K \rm{+1)} \times 1\) vector
\(\boldsymbol{u}\) is an \(N \times 1\) vector

For data with N observations and K predictors

The first column of \(\bf{X}\) is all 1s
The first element of \(\boldsymbol{\beta}\) is the y-intercept

Vectors and matrices

\[\boldsymbol{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{u}\]

\[\begin{bmatrix} 10.3 \\ 10.3 \\ 10.2 \\ 16.4 \end{bmatrix} = \begin{bmatrix} 1 & 8.3 & 70 \\ 1 & 8.6 & 65 \\ 1 & 8.8 & 63 \\ 1 & 10.5 & 72 \end{bmatrix} \begin{bmatrix} \beta_0 \\ \beta_1 \\ \beta_2 \end{bmatrix} + \begin{bmatrix} u_1 \\ u_2 \\ u_3 \\ u_4 \end{bmatrix}\]

The linear regression model, expanded

\[\boldsymbol{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{u}\]

\[\boldsymbol{y} = \beta_0 + \beta_1\boldsymbol{x_1} + \beta_2\boldsymbol{x_2}\ldots\beta_K\boldsymbol{x_K} + \boldsymbol{u}\]

\(\boldsymbol{y}\) is an \(N \times 1\) vector
Each \(\boldsymbol{x_k}\) is an \(N \times 1\) vector
Each \(\beta_k\) is a scalar
\(\boldsymbol{u}\) is an \(N \times 1\) vector

The linear regression model, single unit

\[y_i = \beta_0 + \beta_1x_{1i} + \beta_2x_{2i}\ldots\beta_Kx_{Ki} + u_i\]

This is the model for a single unit, \(i\)
Each symbol represents a scalar

Model components

\[y_i = \beta_0 + \beta_1x_{1i} + \beta_2x_{2i}\ldots\beta_Kx_{Ki} + u_i\]

\(\beta_0\) is the (y-)intercept or constant term
- The predicted value of \(y_i\) when all \(x_{ki} = 0\)
\(\beta_k\) is the coefficient or slope for predictor \(x_{ki}\)
- Change in \(y_i\) associated with a 1-unit change in \(x_{ki}\)
\(u_i\) is the error or disturbance term
- The random deviation of the observed value from the fitted value

Model components, visually

Multivariate case, visually

When we have more than one predictor, we are fitting a hyperplane to the data

Multivariate case, visually

Bivariate “slices” of the hyperplane gives conditional relationships

Expected value of \(y_i\)

There are two pieces to each regression model:

A systematic part: \(\boldsymbol{x}_i \boldsymbol{\beta}\)
A stochastic part: \(u_i\) (where \(\text{E}(u_i) = 0\))

The systematic part gives us the expected value of \(y_i\), which is generally our predicted value (\(\hat{y}_i\))

\[\text{E}(y_i | \boldsymbol{x_i}) = \boldsymbol{x_i} \boldsymbol{\beta} = \hat{y}_i\]

The errors

We will assume several things about the errors

They are completely random disturbances
They have an expected value of zero
They have an error variance \(\sigma^2\), which is the model-unexplained portion of the dependent variable
They are uncorrelated with the model predictors

Eventually, we will deal with the fall-out from violations of these assumptions

Example `lm` output


Call:
lm(formula = trees$Volume ~ trees$Girth + trees$Height)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.4065 -2.6493 -0.2876  2.2003  8.4847 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -57.9877     8.6382  -6.713 2.75e-07 ***
trees$Girth    4.7082     0.2643  17.816  < 2e-16 ***
trees$Height   0.3393     0.1302   2.607   0.0145 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.882 on 28 degrees of freedom
Multiple R-squared:  0.948, Adjusted R-squared:  0.9442 
F-statistic:   255 on 2 and 28 DF,  p-value: < 2.2e-16

Why linear regression?

Linear here means linear in the parameters (\(\boldsymbol{\beta}\))

Those parameters can be applied to non-linear transformations of \(x\)

Define \(x_2 = x_1^2\):

\[\text{income}_i = \beta_0 + \beta_1 \text{age}_{i} + \beta_2 \text{age}^2_{i} + u_i\]

Why linear regression?

We can also transform the DV and/or the IVs in ways that allow for non-linear relationships

\[\text{ln}(y_i) = \beta_0 + \beta_1 \text{ln}(x_{1i}) + \beta_2x_{2i}+\beta_3x_{3i} + u_i\]

This implies a non-constant relationship of \(x_{1i}\) to \(y_i\)
It is still a linear combination of (transformed) predictors
- A 1-unit change in \(\text{ln}(x_{1i})\) implies a \(\beta_1\)-unit change in \(\text{ln}(y_i)\)
To characterize the relationship between the untransformed variables, we need to do additional work (we will have a section on this)

Why linear regression?

Examples of non-linear regression models:

\(y_i = x_{1i}^{\alpha}+u_i\)
\(y_i = e^{\beta_0+\beta_1x_{1i}+u_i}\)

Part 2. Estimation and Interpretation

Estimation goals

\[\boldsymbol{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{u}\]

We would like to generate an estimate of \(\boldsymbol{\beta}\) using a sample with \(N\) observations.

What qualities would we want in an ideal estimator?

Unbiasedness: \(\text{E}(\hat{\boldsymbol{\beta}}) = \boldsymbol{\beta}\)
Consistency: \(\text{lim}_{N\rightarrow\infty} \space \text{Pr}(|\hat{\boldsymbol{\beta}}-\boldsymbol{\beta}|>\epsilon)=0, \space \forall \space +\epsilon\)
Efficiency: \(\text{E} \left((\hat{\boldsymbol{\beta}}-\boldsymbol{\beta})^2 \right)\) (the variance of the estimator) is smallest possible

Ordinary least squares estimator

The ordinary least squares (OLS) estimator has these properties

In general, find \(\boldsymbol{\beta}\) that minimizes the sum of squared residuals

For bivariate case: find \(\hat{\beta}_0\) and \(\hat{\beta}_1\) that minimize:

\[\min_{\beta_0,\beta_1} \left[\sum_{i=1}^N(y_i-\beta_0-\beta_1x_{1i})^2 \right] =\min_{\beta_0,\beta_1} \left[\sum_{i=1}^N\hat{u}_i^2 \right]\]

Should be intuitive: minimize the (squared) distance between the actual value and the model’s “best guess”

Minimizing

To find the minimum with respect to some variable

Take the first partial derivative
Set it to zero
Solve for parameter of interest
Verify that 2nd partial derivative is positive

OLS estimators for bivariate case

OLS estimator for \(\beta_1\):

\[\hat{\beta}_1 = \frac{cov(y,x_1)}{var(x_1)}\]

OLS estimator for \(\beta_0\):

\[\hat{\beta}_0 = \bar{y} - \hat{\beta}_1\bar{x}_1\]

Example, by hand and with lm()

Regression of Volume on Girth in trees data

# estimate b1
b1 <- cov(trees$Volume, trees$Girth) / var(trees$Girth)

# estimate b0
b0 <- mean(trees$Volume) - b1*mean(trees$Girth)

# print estimates
c(b0,b1)

[1] -36.943459   5.065856

# estimate model using OLS and store in object 'm1'
m1 <- lm(Volume ~ Girth, data = trees)

# print estimates
coef(m1)

(Intercept)       Girth 
 -36.943459    5.065856

OLS estimator for multiple regression

\[\hat{\boldsymbol{\beta}}=(\textbf{X}'\textbf{X})^{-1}\textbf{X}'\boldsymbol{y}\]

Note the similarity to the bivariate case:

\[\hat{\beta}_1 = \frac{cov(y,x_1)}{var(x_1)}\]

Example, by hand

# define X matrix (with 1st column of 1s for intercept)
X <- as.matrix(cbind(rep(1, nrow(trees)), trees[, c("Girth", "Height")]))
colnames(X)[1] <- "Intercept"

# define Y vector
y <- trees$Volume

# calculate beta
beta <- solve(t(X) %*% X) %*% t(X) %*% y

# print beta
round(beta, 3)

             [,1]
Intercept -57.988
Girth       4.708
Height      0.339


Call:
lm(formula = Volume ~ Girth + Height, data = trees)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.4065 -2.6493 -0.2876  2.2003  8.4847 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -57.9877     8.6382  -6.713 2.75e-07 ***
Girth         4.7082     0.2643  17.816  < 2e-16 ***
Height        0.3393     0.1302   2.607   0.0145 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.882 on 28 degrees of freedom
Multiple R-squared:  0.948, Adjusted R-squared:  0.9442 
F-statistic:   255 on 2 and 28 DF,  p-value: < 2.2e-16

What if we forget intercept?

# define X matrix (with 1st column of 1s for intercept)
X <- as.matrix(trees[, c("Girth", "Height")])

# define Y vector
y <- trees$Volume

# calculate beta
beta <- solve(t(X) %*% X) %*% t(X) %*% y

# print beta
round(beta, 3)

         [,1]
Girth   5.044
Height -0.477

What if we forget the intercept?

Taking out the intercept forces the plane to run through the origin (0,0)

Predicted/expected values

The predicted or expected value of \(y_i\) is simply the systematic portion of the RHS: \(\beta_0 + \beta_1x_{1i}\)

# estimate model using OLS and store in object 'm1'
m1 <- lm(Volume ~ Girth, data = trees)

# print estimates
coef(m1)

(Intercept)       Girth 
 -36.943459    5.065856

“A tree with a 15 inch diameter is predicted to have a volume of \(-36.94 + (5.07)(15) = 39.11\) cubic inches.”

With multiple IVs

# estimate model using OLS and store in object 'm1'
m1 <- lm(Volume ~ Girth + Height, data = trees)

# predicted values are dot product of coefficients with x-values
coef(m1) %*% c(1, 15, 60) # 15 inch diameter and 60 ft height

         [,1]
[1,] 32.98982

coef(m1)[1]*1 + coef(m1)[2]*15 + coef(m1)[3]*60

(Intercept) 
   32.98982

“A tree with a 15 inch diameter and 60ft height is predicted to have a volume of 33 cubic inches.”

Interpreting coefficients

\[y_i = \beta_0 + \beta_1x_{1i} + \beta_2x_{2i} + u_i\]

We are interested in how \(\text{E}(y_i)\) changes as a function of \(x_{1i}\)

When you think “rate of change” you should think “derivative”
With more than one variable, think partial derivative

\[ \begin{align} \text{E}(y_i) &= \beta_0 + \beta_1x_{1i} + \beta_2x_{2i} \\ \frac{\partial{\text{E}(y_i)}}{\partial{x_{1i}}} &= \beta_1 \end{align} \]

Tells us how the function is changing at a given point in response to infinitesimal changes in one of the function’s variables

Slope of tangent line, linear

Slope of tangent line, non-linear

Interpreting coefficients

\[y_i = \beta_0 + \beta_1x_{1i} + \beta_2x_{2i} + u_i\]

With discrete IVs, we cannot take the 1st derivative

Instead, we think about first differences for a change in \(x_{1i}\) from some value (\(k_1\)) to another (\(k_2\))

\[ \begin{align} &\text{E}(y_i[x_{1i} = k_2]) - \text{E}(y_i[x_{1i} = k_1]) \\ &= (\beta_0 + \beta_1k_2 + \beta_2x_{2i}) - (\beta_0 + \beta_1k_1 + \beta_2x_{2i}) \\ &= \beta_1(k_2 - k_1) \end{align} \]

If \(k_2 - k_1 = 1\), then this just equals \(\beta_1\)

First difference, linear

First difference, non-linear

Example, CEO salaries

# ceo data (salary in $1,000s, profits in $1Ms, age in years)
ceo <- read.csv("data/ceosalary.csv")

# estimate model using OLS and store in object
m_ceo <- lm(salary ~ age + profits, data = ceo)

# summary
summary(m_ceo)


Call:
lm(formula = salary ~ age + profits, data = ceo)

Residuals:
   Min     1Q Median     3Q    Max 
-892.9 -331.6 -100.2  253.9 4445.4 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 469.3863   276.7318   1.696   0.0916 .  
age           4.9620     4.8794   1.017   0.3106    
profits       0.5604     0.1016   5.516 1.24e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 541.6 on 174 degrees of freedom
Multiple R-squared:  0.1602,    Adjusted R-squared:  0.1505 
F-statistic: 16.59 on 2 and 174 DF,  p-value: 2.539e-07

The importance of scale

The substantive meaning of a given coefficient depends on how \(y_i\) and \(x_{ki}\) are scaled

# estimate model using OLS and store in object
m_ceo <- lm(salary/1000 ~ age + profits, data = ceo)

# summary
summary(m_ceo)


Call:
lm(formula = salary/1000 ~ age + profits, data = ceo)

Residuals:
    Min      1Q  Median      3Q     Max 
-0.8929 -0.3316 -0.1002  0.2539  4.4454 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.4693863  0.2767318   1.696   0.0916 .  
age         0.0049620  0.0048794   1.017   0.3106    
profits     0.0005604  0.0001016   5.516 1.24e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.5416 on 174 degrees of freedom
Multiple R-squared:  0.1602,    Adjusted R-squared:  0.1505 
F-statistic: 16.59 on 2 and 174 DF,  p-value: 2.539e-07

Standardized coefficients

Common to scale to mean of 0 and standard deviation of 1

Two kinds:

Only standardize (one or more) IVs
Also standardize the DV

scale() function in R will Z-score by default (i.e. subtract mean, divide by SD)

Unit of change becomes SD

Standardized CEO

# estimate model using OLS and store in object
m_ceo <- lm(scale(salary) ~ scale(age) + scale(profits), data = ceo)

# summary
summary(m_ceo)


Call:
lm(formula = scale(salary) ~ scale(age) + scale(profits), data = ceo)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.5196 -0.5643 -0.1705  0.4322  7.5654 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)    -4.063e-17  6.928e-02   0.000    1.000    
scale(age)      7.112e-02  6.994e-02   1.017    0.311    
scale(profits)  3.858e-01  6.994e-02   5.516 1.24e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.9217 on 174 degrees of freedom
Multiple R-squared:  0.1602,    Adjusted R-squared:  0.1505 
F-statistic: 16.59 on 2 and 174 DF,  p-value: 2.539e-07

Care in interpretation with standardized vars

0-1 coding

PoliSci often uses 0-1 (or min-max) coding

\[\frac{x_i - \min(x)}{\max(x) - \min(x)}\]

Either theoretical or sample minima/maxima.

If lowest observed value is higher than lowest theoretically possible value, you may have a choice to make
Similar issue to “care in interpretation” with standardized vars

0-1 example

# recode Girth and Volume
trees$Volume_01 <- (trees$Volume - min(trees$Volume)) / (max(trees$Volume) - min(trees$Volume)) 
trees$Girth_01 <- (trees$Girth - min(trees$Girth)) / (max(trees$Girth) - min(trees$Girth))

# estimate model
m_tree <- lm(Volume_01 ~ Girth_01, data = trees)
summary(m_tree)


Call:
lm(formula = Volume_01 ~ Girth_01, data = trees)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.120739 -0.046507  0.002275  0.052317  0.143515 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.07630    0.02160  -3.533   0.0014 ** 
Girth_01     0.93278    0.04555  20.478   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.06365 on 29 degrees of freedom
Multiple R-squared:  0.9353,    Adjusted R-squared:  0.9331 
F-statistic: 419.4 on 1 and 29 DF,  p-value: < 2.2e-16

Part 3. Model Fit

Model fit

We often evaluate our models by their ability to account for variation in our outcome – i.e. how well they “fit” the data.

There are many different measures of model fit.
We will discuss the most common today; later this semester, we will turn to alternatives.
Model fit is not everything! Depends on your goals.

Sums of squares

People sometimes talk about the following sums of squares:

Total: \(\sum_N{(y_i - \bar{y})^2}\)
Model / Explained: \(\sum_N{(\hat{y}_i - \bar{y})^2}\)
Error / Residual: \(\sum_N{(y_i - \hat{y}_i)^2}\)

Errors vs residuals

The errors, \(u_i\), are population quantities, and unobservable

Residuals are sample estimates, \(\hat{u}_i\), and have an expected value of 0 by construction

An unbiased estimator of the error variance, \(\hat{\sigma}^2\) is:

\[\frac{\sum_N{(y_i - \hat{y}_i)^2}}{N - K - 1}\]

Mean squared error

More generally, we can define the mean squared error for any model as:

\[\text{MSE} = \text{E} \left((\hat{\theta} - \theta)^2 \right)\]

Expected value of quadratic loss from applying model

Interpretable only with reference to scale of \(\theta\)
- useful in relative terms (for comparing fit between two versions of same model)
Incorporates both bias and variance
- For unbiased estimators such as OLS, \(\text{MSE} \equiv \text{Error Variance}\)

MSE in linear regression

In linear regression, mean squared error typically refers to the following:

\[\text{MSE} = \frac{\sum_N{(y_i - \hat{y}_i)^2}}{N - K - 1}\]

i.e. we divide by degrees of freedom, not sample size

More complex model –> smaller denominator –> higher MSE, all else equal

\(R^2\)

A common measure of model fit is \(R^2\), which is 1 minus the ratio of residual to total variation in \(y\):

\[1 - \frac{\text{SS}_R}{\text{SS}_T} = 1 - \frac{\sum_N{(y_i - \hat{y}_i)^2}}{\sum_N{(y_i - \bar{y})^2}}\]

This can be interpreted as the proportionate reduction in error of prediction (relative to always guessing \(\bar{\boldsymbol{y}}\))
It is also the squared correlation (shared variance) between \(\boldsymbol{y}\) and \(\hat{\boldsymbol{y}}\)

\(R^2\)

Given Model 1 with K covariates and Model 2 with K + 1 covariates, \(R^2_2 \geq R^2_1\).

Why is this true?

Why is this a problem?

Adjusted \(R^2\)

A better version of \(R^2\) uses the MSE in the numerator, thus penalizing model complexity

\[R^2_{adj} = 1 - \frac{\text{MSE}}{\text{Var}(\boldsymbol{y})} = 1 - \frac{\frac{\sum_N{(y_i - \hat{y}_i)^2}}{N - K - 1}}{\frac{\sum_N{(y_i - \bar{\boldsymbol{y}})^2}}{N - 1}}\]

If you add a useless variable, adjusted \(R^2\) will decrease.

Example

m2 <- lm(Volume ~ Girth + Height, data = trees)
s2 <- summary(m2)

c(round(s2$r.squared, 3), round(s2$adj.r.squared, 5))

[1] 0.94800 0.94423

set.seed(3587)
m3 <- lm(Volume ~ Girth + Height + rnorm(length(trees$Volume)), data = trees)
s3 <- summary(m3)

c(round(s3$r.squared, 3), round(s3$adj.r.squared, 5))

[1] 0.95000 0.94398

By hand

# calculate r2
r2 <- 1 - var(m2$residuals) / var(trees$Volume)

# calculate mean squared error
mse <- (sum(m2$residuals^2) / m2$df.residual)

# adjusted r2
r2_adj <- 1 -  mse / var(trees$Volume)

# print
c(round(sqrt(mse),3),round(r2,3),round(r2_adj, 3))

[1] 3.882 0.948 0.944


Call:
lm(formula = Volume ~ Girth + Height, data = trees)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.4065 -2.6493 -0.2876  2.2003  8.4847 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -57.9877     8.6382  -6.713 2.75e-07 ***
Girth         4.7082     0.2643  17.816  < 2e-16 ***
Height        0.3393     0.1302   2.607   0.0145 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.882 on 28 degrees of freedom
Multiple R-squared:  0.948, Adjusted R-squared:  0.9442 
F-statistic:   255 on 2 and 28 DF,  p-value: < 2.2e-16

Linear Regression Introduction

1. The Flexible Linear Model

Goals for regression

Notation

What is regression?

Conditional expected value: \(\text{E}(y_i|x_i)\)

Conditional expected value based on linear model

The linear regression model

Vectors and matrices

The linear regression model, expanded

The linear regression model, single unit

Model components

Model components, visually

Multivariate case, visually

Multivariate case, visually

Expected value of \(y_i\)

The errors

Example lm output

Why linear regression?

Why linear regression?

Why linear regression?

Part 2. Estimation and Interpretation

Estimation goals

Ordinary least squares estimator

Minimizing

OLS estimators for bivariate case

Example, by hand and with lm()

OLS estimator for multiple regression

Example, by hand

What if we forget intercept?

What if we forget the intercept?

Predicted/expected values

With multiple IVs

Interpreting coefficients

Slope of tangent line, linear

Slope of tangent line, non-linear

Interpreting coefficients

First difference, linear

First difference, non-linear

Example, CEO salaries

The importance of scale

Standardized coefficients

Standardized CEO

Care in interpretation with standardized vars

0-1 coding

0-1 example

Part 3. Model Fit

Model fit

Sums of squares

Errors vs residuals

Mean squared error

MSE in linear regression

\(R^2\)

\(R^2\)

Adjusted \(R^2\)

Example

By hand

Example `lm` output