[1] 2.302585
[1] 2.302585
[1] 2.302585
[1] 2.302585
[1] 2.302585
POLSCI 630: Probability and Basic Regression
February 18, 2025
We can model non-linear relationships using linear regression
But! Interpretation gets trickier in these cases.
Useful to remember:
A constant (linear) effect in log of \(x\) is non-linear in \(x\) itself
But a 1-unit increase in \(\text{ln}(y_i)\) implies a non-constant effect in \(y_i\) itself
As an example, let’s explore the salaries of professors in the US using carData::Salaries
rank discipline yrs.since.phd yrs.service sex
AsstProf : 67 A:181 Min. : 1.00 Min. : 0.00 Female: 39
AssocProf: 64 B:216 1st Qu.:12.00 1st Qu.: 7.00 Male :358
Prof :266 Median :21.00 Median :16.00
Mean :22.31 Mean :17.61
3rd Qu.:32.00 3rd Qu.:27.00
Max. :56.00 Max. :60.00
salary
Min. : 57800
1st Qu.: 91000
Median :107300
Mean :113706
3rd Qu.:134185
Max. :231545
It is reasonable to think that the effect of a variable (e.g., years of service) on salary is increasing in salary, but constant in a multiplier of salary
Let’s say we have the following model: \(\text{ln}(y_i) = \boldsymbol{x}_i' \boldsymbol{\beta} + u_i\)
\(y_i = e^{\boldsymbol{x}_i' \boldsymbol{\beta}} e^{u_i}, \quad \text{E}(y_i) = \text{E}(e^{\boldsymbol{x}_i' \boldsymbol{\beta}} e^{u_i})\)
\(= e^{\boldsymbol{x}_i' \boldsymbol{\beta}} \text{E}(e^{u_i})\)
If \(u_i \sim N(0, \sigma^2), \quad \text{E}(e^{u_i}) = e^{\frac{\sigma^2}{2}}\)
\(\text{E}(y_i) = e^{\boldsymbol{x}_i' \boldsymbol{\beta}} e^{\frac{\sigma^2}{2}}\)
Mean-zero error doesn’t drop out!
Call:
lm(formula = log(salary) ~ yrs.since.phd + sex + discipline,
data = carData::Salaries)
Residuals:
Min 1Q Median 3Q Max
-0.84527 -0.15697 -0.00855 0.15550 0.62670
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.125e+01 4.189e-02 268.671 < 2e-16 ***
yrs.since.phd 9.584e-03 9.075e-04 10.560 < 2e-16 ***
sexMale 6.639e-02 3.830e-02 1.734 0.0838 .
disciplineB 1.447e-01 2.319e-02 6.239 1.14e-09 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.2244 on 393 degrees of freedom
Multiple R-squared: 0.2615, Adjusted R-squared: 0.2559
F-statistic: 46.39 on 3 and 393 DF, p-value: < 2.2e-16
Let’s calculate the expected value of \(y\) for a male professor, 10 years from PhD, from “theoretical” departments (dept = A)
What if we do not assume normality in the original errors?
Let’s use a simple example:
\(\text{ln}(y_i) = \beta_0 + \beta_1x_{1i} + \beta_2x_{2i} + u_i\)
\(\text{E}(y_i) = e^{\beta_0}e^{\beta_1x_{1i}}e^{\beta_2x_{2i}}\text{E}(e^{u_i})\)
What is the expected change in \(y_i\) for a \(\Delta\) change in \(x_{1i}\)?
What happens to \(\text{E}(y_i)\) when
Let’s take the ratio
\[\frac{e^{\beta_0}e^{\beta_1(x_{1i} + \Delta x_{1i})}e^{\beta_2x_{2i}}\text{E}(e^{u_i})} {e^{\beta_0}e^{\beta_1x_{1i}}e^{\beta_2x_{2i}}\text{E}(e^{u_i})} = \frac{e^{\beta_1(x_{1i} + \Delta x_{1i})}}{e^{\beta_1x_{1i}}} = e^{\beta_1(\Delta x_{1i})}\]
Useful to remember:
\(\text{E}(y_i) = e^{\boldsymbol{x}_i' \boldsymbol{\beta}} \text{E}(e^{u_i})\)
Notice that the marginal effect of \(x_1\) is not constant anymore:
\(\frac{\partial \text{E}(y_i)}{\partial x_{1i}} = \beta_1 e^{\boldsymbol{x}_i' \boldsymbol{\beta}} \text{E}(e^{u_i}) = \beta_1\text{E}(y_i)\)
Since the marginal effect is conditional on the levels of the \(\boldsymbol{x}_i\):
If we want to report the marginal effect of a variable with a single number, there are two common strategies:
Fix all IVs to central tendencies
Calculate the average marginal effect across observations
# B1
B1 <- coef(m1)["yrs.since.phd"]
# exp(X_B)
eXB <- exp(predict(m1,
newdata = data.frame(yrs.since.phd = mean(carData::Salaries$yrs.since.phd),
sex = "Male",
discipline = "B"
)
)
)
# calculate estimate of E(exp(u))
E_eu <- sum(exp(m1$residuals^2)) / nrow(m1$model)
# calculate ME for each observation
B1 * eXB * E_eu
yrs.since.phd
1194.31
## ME_i = B1 * exp(X_iB) * sum(exp(res_i^2)) / N
# B1
B1 <- coef(m1)["yrs.since.phd"]
# exp(X_B) [generate predicted values for all observations]
eXB <- exp(predict(m1))
# calculate estimate of E(exp(u))
E_eu <- sum(exp(m1$residuals^2)) / nrow(m1$model)
# calculate ME for each observation
mes_yrs <- B1 * eXB * E_eu
# calculate mean of MEs
mean(mes_yrs)
[1] 1120.686
Let’s say you want to see how the marginal effect of years since PhD changes as a function of itself (e.g., from 1 year to 20)
# B1
B1 <- coef(m1)["yrs.since.phd"]
# "expand grid" to get all values of X vars cross w/ each other
values <- expand.grid(1:20, c("Female", "Male"), c("A","B"))
# exp(X_B) [generate predicted values for all observations, but vary]
eXB <- exp(predict(m1,
newdata = data.frame(yrs.since.phd = values[, 1],
sex= values[, 2],
discipline = values[, 3])
)
)
# calculate estimate of E(exp(u))
E_eu <- sum(exp(m1$residuals^2)) / nrow(m1$model)
# calculate ME for each observation
mes_yrs <- B1 * eXB * E_eu
First difference gives difference in expected values of \(y_i\) for a discrete change in \(x_{1i}\)
Different options here
Move focal variable from a particular value to another particular value (e.g., 10 to 11)
Move focal variable by a fixed amount, allowing obs to keep own values of all vars
# exp(X_B) low yrs
eXB_low <- exp(predict(m1,
newdata = data.frame(yrs.since.phd = 10,
sex = "Male",
discipline = "B"
)
)
)
# exp(X_B) high yrs
eXB_high <- exp(predict(m1,
newdata = data.frame(yrs.since.phd = 11,
sex = "Male",
discipline = "B"
)
)
)
# calculate estimate of E(exp(u))
E_eu <- sum(exp(m1$residuals^2)) / nrow(m1$model)
# calculate FD
eXB_high * E_eu - eXB_low * E_eu
1
1066.459
# XB low (original values)
eXB_low <- exp(predict(m1))
# XB high (+1 from original values)
newdata <- m1$model
newdata$yrs.since.phd <- newdata$yrs.since.phd + 1
eXB_high <- exp(predict(m1, newdata = newdata))
# calculate estimate of E(exp(u))
E_eu <- sum(exp(m1$residuals^2)) / nrow(m1$model)
# calculate average FD
mean(eXB_high * E_eu - eXB_low * E_eu)
[1] 1126.074
\[y_i = \beta_0 + \beta_1 \text{ln}(x_{1i}) + \beta_2 \text{ln}(x_{2i}) +u_i\]
\(\frac{\Delta y_i}{\Delta x_{1i}} = \beta_1 \text{ln}(x_{1i}+\Delta x_{1i}) - \beta_1 \text{ln}(x_{1i}) = \beta_1 \text{ln}(\frac{x_{1i} + \Delta x_{1i}}{x_{1i}})\)
Call:
lm(formula = salary ~ log(yrs.since.phd) + sex + discipline,
data = carData::Salaries)
Residuals:
Min 1Q Median 3Q Max
-73925 -16473 -2228 14208 93982
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 39475 6171 6.397 4.50e-10 ***
log(yrs.since.phd) 20522 1629 12.598 < 2e-16 ***
sexMale 7476 4264 1.753 0.0804 .
disciplineB 15961 2581 6.183 1.58e-09 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 25100 on 393 degrees of freedom
Multiple R-squared: 0.3186, Adjusted R-squared: 0.3134
F-statistic: 61.25 on 3 and 393 DF, p-value: < 2.2e-16
The log-log model combines the previous two, and so all of the above applies!
\[\text{ln}(y_i) = \beta_0 + \beta_1 \text{ln}(x_{1i}) + \beta_2 \text{ln}(x_{2i}) + u_i\]
\(\text{E}(y_i) = \text{E}(e^{\beta_0 + \beta_1 \text{ln}(x_{1i}) + \beta_2 \text{ln}(x_{2i}) + u_i})\)
\(\frac{e^{\beta_0}e^{\beta_1 \left(\text{ln}(x_{1i} + \Delta x_{1i}) \right)} e^{\beta_2 \text{ln}(x_{2i})}\text{E}(e^{u_i})} {e^{\beta_0}e^{\beta_1 \left(\text{ln}(x_{1i}) \right)}e^{\beta_2 \text{ln}(x_{2i})}\text{E}(e^{u_i})} = e^{\beta_1 \text{ln} \left(\frac{x+\Delta x_{1i}}{x_{1i}} \right)}\)
The factor change in \(y\) for a \(p\)% change in \(x\) is \(e^{\beta \text{ln}(\frac{100+p}{100})}\)
We can thus roughly interpret \(\beta\) as the % change in \(y\) for a 1% change in \(x\)
Call:
lm(formula = log(salary) ~ log(yrs.since.phd) + sex + discipline,
data = carData::Salaries)
Residuals:
Min 1Q Median 3Q Max
-0.76594 -0.13749 0.00164 0.14007 0.59995
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 10.93192 0.05149 212.292 < 2e-16 ***
log(yrs.since.phd) 0.18557 0.01359 13.651 < 2e-16 ***
sexMale 0.06913 0.03558 1.943 0.0528 .
disciplineB 0.14943 0.02154 6.937 1.66e-11 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.2094 on 393 degrees of freedom
Multiple R-squared: 0.3569, Adjusted R-squared: 0.352
F-statistic: 72.69 on 3 and 393 DF, p-value: < 2.2e-16
A non-linear relationship can also be specified via a polynomial function for IVs, e.g.:
\[y_i = \beta_0 + \beta_1x_{1i} + \beta_2x_{1i}^2 + \beta_3x_{2i} + u_i\]
\[y_i = \beta_0 + \beta_1x_{1i} + \beta_2x_{1i}^2 + \beta_3x_{2i} + u_i\]
The marginal effect of \(x_{1i}\) is the first partial derivative:
\[\frac{\partial y_i}{\partial x_{1i}} = \beta_1 + 2\beta_2x_{1i}\]
\[y_i = \beta_0 + \beta_1x_{1i} + \beta_2x_{1i}^2 + \beta_3x_{1i}^3 + \beta_4x_{2i} + u_i\]
The marginal effect of \(x_{1i}\) is the first partial derivative:
\[\frac{\partial y_i}{\partial x_{1i}} = \beta_1 + 2\beta_2x_{1i} + 3\beta_3x_{1i}^2\]
\[y_i = \beta_0 + \beta_1x_{1i} + \beta_2x_{1i}^2 + \beta_3x_{2i} + u_i\]
The first difference is:
\(\frac{\Delta y_i}{\Delta x_{1i}} = \left( \beta_1(x_{1i} + \Delta x_{1i}) + \beta_2(x_{1i} + \Delta x_{1i})^2 \right) - \left( \beta_1(x_{1i}) + \beta_2(x_{1i})^2 \right)\)
\(= \beta_1(\Delta x_{1i}) + \beta_2(\Delta x_{1i}^2 + 2x_{1i}\Delta x_{1i})\)
# load Lucid pub opinion data
lucid <- read.csv("data/Lucid_Data.csv",
stringsAsFactors = F)
# estimate regression of left-right economic policy prefs on IVs
m4 <- lm(econ_mean ~ age_scale + I(age_scale^2) +
male + educ_scale + income_scale,
data = lucid)
summary(m4)
Call:
lm(formula = econ_mean ~ age_scale + I(age_scale^2) + male +
educ_scale + income_scale, data = lucid)
Residuals:
Min 1Q Median 3Q Max
-0.43693 -0.13101 -0.00162 0.11975 0.69088
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.322067 0.011080 29.068 < 2e-16 ***
age_scale -0.158514 0.049534 -3.200 0.00138 **
I(age_scale^2) 0.119425 0.056702 2.106 0.03524 *
male 0.033077 0.005582 5.926 3.33e-09 ***
educ_scale 0.021022 0.010864 1.935 0.05304 .
income_scale 0.120404 0.010602 11.356 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.1882 on 4684 degrees of freedom
(58 observations deleted due to missingness)
Multiple R-squared: 0.05823, Adjusted R-squared: 0.05723
F-statistic: 57.93 on 5 and 4684 DF, p-value: < 2.2e-16
Marginal effect of age_scale
is \(\beta_1 + 2\beta_2age\)
An interaction is a situation in which the relationships of each of two IVs to a DV depend on each other
\[y_i = \beta_0 + \beta_1x_{1i} + \beta_2x_{2i} + \beta_3x_{1i}x_{2i} + u_i\]
\[y_i = \beta_0 + \beta_1x_{1i} + \beta_2x_{2i} + \beta_3x_{1i}x_{2i} + u_i\]
The marginal effect of \(x_{1i}\) is the first partial derivative:
\[\frac{\partial y_i}{\partial x_{1i}} = \beta_1 + \beta_3x_{2i}\]
The first difference is:
\[\frac{\Delta y_i}{\Delta x_{1i}} = \beta_1\Delta x_{1i} + \beta_3\Delta x_{1i}x_{2i}\]
# estimate regression of left-right economic policy prefs on IVs
m5 <- lm(econ_mean ~ age_scale + I(age_scale^2) + male + educ_scale +
income_scale*know_mean,
data = lucid)
summary(m5)
Call:
lm(formula = econ_mean ~ age_scale + I(age_scale^2) + male +
educ_scale + income_scale * know_mean, data = lucid)
Residuals:
Min 1Q Median 3Q Max
-0.46091 -0.13112 -0.00257 0.11944 0.69693
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.3157633 0.0135866 23.241 < 2e-16 ***
age_scale -0.1561355 0.0495193 -3.153 0.00163 **
I(age_scale^2) 0.1232286 0.0566972 2.173 0.02980 *
male 0.0348407 0.0056419 6.175 7.16e-10 ***
educ_scale 0.0257784 0.0111371 2.315 0.02068 *
income_scale 0.1561915 0.0227687 6.860 7.79e-12 ***
know_mean 0.0007747 0.0165371 0.047 0.96264
income_scale:know_mean -0.0557548 0.0330965 -1.685 0.09213 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.1881 on 4682 degrees of freedom
(58 observations deleted due to missingness)
Multiple R-squared: 0.05968, Adjusted R-squared: 0.05827
F-statistic: 42.45 on 7 and 4682 DF, p-value: < 2.2e-16
# standardized
m5_s <- lm(econ_mean ~ age_scale + I(age_scale^2) + male + educ_scale +
scale(income_scale)*scale(know_mean),
data = lucid)
summary(m5_s)
Call:
lm(formula = econ_mean ~ age_scale + I(age_scale^2) + male +
educ_scale + scale(income_scale) * scale(know_mean), data = lucid)
Residuals:
Min 1Q Median 3Q Max
-0.46091 -0.13112 -0.00257 0.11944 0.69693
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.367927 0.011513 31.958 < 2e-16 ***
age_scale -0.156136 0.049519 -3.153 0.00163 **
I(age_scale^2) 0.123229 0.056697 2.173 0.02980 *
male 0.034841 0.005642 6.175 7.16e-10 ***
educ_scale 0.025778 0.011137 2.315 0.02068 *
scale(income_scale) 0.036094 0.003114 11.591 < 2e-16 ***
scale(know_mean) -0.006602 0.002984 -2.212 0.02698 *
scale(income_scale):scale(know_mean) -0.004767 0.002829 -1.685 0.09213 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.1881 on 4682 degrees of freedom
(58 observations deleted due to missingness)
Multiple R-squared: 0.05968, Adjusted R-squared: 0.05827
F-statistic: 42.45 on 7 and 4682 DF, p-value: < 2.2e-16
econ_mean
Ey <- predict(m5, newdata = data.frame(
# set non-focal variables at central tendencies
age_scale = mean(lucid$age_scale, na.rm = T),
male = median(lucid$male, na.rm = T),
educ_scale = mean(lucid$educ_scale, na.rm = T),
# vary two interacting variables across reasonable values of each
income_scale = rep(seq(0, 1, 0.05), 3),
know_mean = c(rep(0, 21), rep(0.5, 21), rep(1, 21))
)
)
econ_mean
Marginal effect of income_scale
is \(\beta_6 + \beta_8 \text{know_mean}\)
Alternative moderators of an interactive relationship need to be included as interactions
A simple interaction between two variables implies a linear function for the conditional marginal effect:
\[\frac{\partial y_i}{\partial x_{1i}} = \beta_1 + \beta_3x_{2i}\]
interflex
https://doi.org/10.1093/poq/nfac004
\[ \begin{aligned} y_i = \beta_0 &+ \beta_1x_{1i} + \beta_2x_{2i} + \beta_3x_{3i} \\ &+ \beta_4x_{1i}x_{2i} + \beta_5x_{1i}x_{3i} + \beta_6x_{2i}x_{3i}\\ &+ \beta_7x_{1i}x_{2i}x_{3i} \\ &+ u_i \end{aligned} \]
In practice, need strong theory and a lot of data to warrant.
\[\frac{\partial y_i}{\partial x_{1i}} = \beta_1 + \beta_4x_{2i} + \beta_5x_{3i} + \beta_7x_{2i}x_{3i}\]