Regression has a variety of uses in the social sciences
Characterizing associations between variables
Prediction to future observations
Causal effect estimation
Regression can in theory do all of the above if the right assumptions are met.
Notation
generic units of analysis are indicated by \(i=1,...,N\)
scalars are lower-case italics (e.g., \(x_{1i}\))
(column) vectors are lower-case bold italics (e.g., \(\boldsymbol{x}_i\), transpose: \(\boldsymbol{x}_i'\))
matrices are upper-case bold (e.g., \(\mathbf{X}\))
random variables are generally Roman (e.g., \(x\))
model parameters are generally Greek (e.g., \(\beta_1\), \(\boldsymbol{\beta}\))
generic model variables (and associated parameters) are indicated by \(j=1,...,K\)
error terms are indicated by \(u_i\) or \(\boldsymbol{u}\), but sometimes \(e_i\) or even \(\epsilon_i\)
error variance is indicated by \(\sigma^2\) or \(\sigma_u^2\) if necessary
What is regression?
Regression describes the relationship between one or more independent variables, \(\boldsymbol{x}_1,\boldsymbol{x}_{2},...,\boldsymbol{x}_{K}\), and a dependent variable, \(\boldsymbol{y}\).
# define X matrix (with 1st column of 1s for intercept)X <-as.matrix(cbind(rep(1, nrow(trees)), trees[, c("Girth", "Height")]))colnames(X)[1] <-"Intercept"# define Y vectory <- trees$Volume# calculate betabeta <-solve(t(X) %*% X) %*%t(X) %*% y# print betaround(beta, 3)
[,1]
Intercept -57.988
Girth 4.708
Height 0.339
Call:
lm(formula = Volume ~ Girth + Height, data = trees)
Residuals:
Min 1Q Median 3Q Max
-6.4065 -2.6493 -0.2876 2.2003 8.4847
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -57.9877 8.6382 -6.713 2.75e-07 ***
Girth 4.7082 0.2643 17.816 < 2e-16 ***
Height 0.3393 0.1302 2.607 0.0145 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.882 on 28 degrees of freedom
Multiple R-squared: 0.948, Adjusted R-squared: 0.9442
F-statistic: 255 on 2 and 28 DF, p-value: < 2.2e-16
What if we forget intercept?
# define X matrix (with 1st column of 1s for intercept)X <-as.matrix(trees[, c("Girth", "Height")])# define Y vectory <- trees$Volume# calculate betabeta <-solve(t(X) %*% X) %*%t(X) %*% y# print betaround(beta, 3)
[,1]
Girth 5.044
Height -0.477
What if we forget the intercept?
Taking out the intercept forces the plane to run through the origin (0,0)
Predicted/expected values
The predicted or expected value of \(y_i\) is simply the systematic portion of the RHS: \(\beta_0 + \beta_1x_{1i}\)
# estimate model using OLS and store in object 'm1'm1 <-lm(Volume ~ Girth, data = trees)# print estimatescoef(m1)
(Intercept) Girth
-36.943459 5.065856
“A tree with a 15 inch diameter is predicted to have a volume of \(-36.94 + (5.07)(15) = 39.11\) cubic inches.”
With multiple IVs
# estimate model using OLS and store in object 'm1'm1 <-lm(Volume ~ Girth + Height, data = trees)# predicted values are dot product of coefficients with x-valuescoef(m1) %*%c(1, 15, 60) # 15 inch diameter and 60 ft height
[,1]
[1,] 32.98982
coef(m1)[1]*1+coef(m1)[2]*15+coef(m1)[3]*60
(Intercept)
32.98982
“A tree with a 15 inch diameter and 60ft height is predicted to have a volume of 33 cubic inches.”
If \(k_2 - k_1 = 1\), then this just equals \(\beta_1\)
First difference, linear
First difference, non-linear
Example, CEO salaries
# ceo data (salary in $1,000s, profits in $1Ms, age in years)ceo <-read.csv("data/ceosalary.csv")# estimate model using OLS and store in objectm_ceo <-lm(salary ~ age + profits, data = ceo)# summarysummary(m_ceo)
Call:
lm(formula = salary ~ age + profits, data = ceo)
Residuals:
Min 1Q Median 3Q Max
-892.9 -331.6 -100.2 253.9 4445.4
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 469.3863 276.7318 1.696 0.0916 .
age 4.9620 4.8794 1.017 0.3106
profits 0.5604 0.1016 5.516 1.24e-07 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 541.6 on 174 degrees of freedom
Multiple R-squared: 0.1602, Adjusted R-squared: 0.1505
F-statistic: 16.59 on 2 and 174 DF, p-value: 2.539e-07
The importance of scale
The substantive meaning of a given coefficient depends on how \(y_i\) and \(x_{ki}\) are scaled
# estimate model using OLS and store in objectm_ceo <-lm(salary/1000~ age + profits, data = ceo)# summarysummary(m_ceo)
Call:
lm(formula = salary/1000 ~ age + profits, data = ceo)
Residuals:
Min 1Q Median 3Q Max
-0.8929 -0.3316 -0.1002 0.2539 4.4454
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.4693863 0.2767318 1.696 0.0916 .
age 0.0049620 0.0048794 1.017 0.3106
profits 0.0005604 0.0001016 5.516 1.24e-07 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.5416 on 174 degrees of freedom
Multiple R-squared: 0.1602, Adjusted R-squared: 0.1505
F-statistic: 16.59 on 2 and 174 DF, p-value: 2.539e-07
Standardized coefficients
Common to scale to mean of 0 and standard deviation of 1
Two kinds:
Only standardize (one or more) IVs
Also standardize the DV
scale() function in R will Z-score by default (i.e. subtract mean, divide by SD)
Unit of change becomes SD
Standardized CEO
# estimate model using OLS and store in objectm_ceo <-lm(scale(salary) ~scale(age) +scale(profits), data = ceo)# summarysummary(m_ceo)
Call:
lm(formula = scale(salary) ~ scale(age) + scale(profits), data = ceo)
Residuals:
Min 1Q Median 3Q Max
-1.5196 -0.5643 -0.1705 0.4322 7.5654
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -4.063e-17 6.928e-02 0.000 1.000
scale(age) 7.112e-02 6.994e-02 1.017 0.311
scale(profits) 3.858e-01 6.994e-02 5.516 1.24e-07 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.9217 on 174 degrees of freedom
Multiple R-squared: 0.1602, Adjusted R-squared: 0.1505
F-statistic: 16.59 on 2 and 174 DF, p-value: 2.539e-07
Care in interpretation with standardized vars
0-1 coding
PoliSci often uses 0-1 (or min-max) coding
\[\frac{x_i - \min(x)}{\max(x) - \min(x)}\]
Either theoretical or sample minima/maxima.
If lowest observed value is higher than lowest theoretically possible value, you may have a choice to make
Similar issue to “care in interpretation” with standardized vars