<- read.csv("data/ceosalary.csv") ceosalary
Lab 2
Data
In this lab, we will continue to work with data on CEO salaries from 1990, from the dataset ceosalary.csv
.
The relevant variables are:
salary
: 1990 compensation ($1000s)age
: Age (years)profits
: 1990 profits ($ millions)grad
: binary for graduate school
Model Fit
\[salary_i = \beta_0 + \beta_1 profits_i + u_i\]
<- lm(salary ~ profits, ceosalary)
m1 summary(m1)
Call:
lm(formula = salary ~ profits, data = ceosalary)
Residuals:
Min 1Q Median 3Q Max
-872.4 -319.7 -119.8 242.0 4484.0
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 746.9238 45.7979 16.31 < 2e-16 ***
profits 0.5723 0.1009 5.67 5.81e-08 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 541.6 on 175 degrees of freedom
Multiple R-squared: 0.1552, Adjusted R-squared: 0.1504
F-statistic: 32.14 on 1 and 175 DF, p-value: 5.805e-08
R Squared
\[R^2 = 1 - \frac{SSR}{SST} = 1 - \frac{\Sigma_N (y_i - \hat{y_i})^2}{\Sigma_N (y_i - \bar{y})^2}\] \(R^2\) is the proportionate reduction in error associated with the dependent variable when moving from an intercept-only model to the model of interest. It is the proportion of total variance in the dependent variable that is accounted for by the model.
# observed y
<- ceosalary$salary
y
# predicted y
<- cbind(1, ceosalary$profits)
X <- solve(t(X) %*% X) %*% t(X) %*% y
beta <- X %*% beta
y_hat
# sum of squared residuals
<- sum((y - y_hat)^2)
SSR
# total sum of squares
<- sum((y - mean(y))^2)
SST
# r squared
1 - (SSR / SST)
[1] 0.1551789
1 - var(m1$residuals)/var(m1$model$salary)
[1] 0.1551789
Mean Squared Error
\[MSE = \frac{\Sigma_N (y_i - \hat{y_i})^2}{N - K - 1} \] Mean squared error is the estimated variance of prediction errors, or the estimated residual variance, \(\hat{\sigma}^2\).
# mse
/ (177 - 1 - 1) SSR
[1] 293350.7
sum(m1$residuals^2) / m1$df.residual
[1] 293350.7
Adjusted R Squared
\[R_{adj}^2 = 1 - \frac{MSE}{Var(y)} = 1 - \frac{\frac{\Sigma_N (y_i - \hat{y_i})^2}{N - K - 1}}{\frac{\Sigma_N (y_i - \bar{y})^2}{N - 1}}\]
Adjusted \(R^2\) is mean squared error divided by the variance of the dependent variable. The difference between adjusted and unadjusted \(R^2\) is that the former divided by the residual degrees of freedom (sample size minus parameters estimated), and thus adjusts \(R^2\) downwards to account for the increase in fit associated with adding additional parameters, regardless of how well they predict the dependent variable out-of-sample.
1 - (sum(m1$residuals^2) / m1$df.residual) / var(m1$model$salary)
[1] 0.1503514
Multiple Regression
\[salary_i = \beta_0 + \beta_1 profits_{i} + \beta_2 age_{i} + u_i\]
<- lm(salary ~ profits + age, ceosalary)
m2 summary(m2)
Call:
lm(formula = salary ~ profits + age, data = ceosalary)
Residuals:
Min 1Q Median 3Q Max
-892.9 -331.6 -100.2 253.9 4445.4
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 469.3863 276.7318 1.696 0.0916 .
profits 0.5604 0.1016 5.516 1.24e-07 ***
age 4.9620 4.8794 1.017 0.3106
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 541.6 on 174 degrees of freedom
Multiple R-squared: 0.1602, Adjusted R-squared: 0.1505
F-statistic: 16.59 on 2 and 174 DF, p-value: 2.539e-07
Comparing Model Fit
How does \(\hat{\beta}_1\) change from m1
to m2
?
How does \(R^2\) change from m1
to m2
?
How does adjusted \(R^2\) change from m1
to m2
?
Plotting Multiple Regression
Plot the relationship between profits and salary. Be careful! Consider the interpretation of \(\hat{\beta}_1\).
# Fix age at its mean and vary profits for the profits-salary relationship
<- data.frame(
profits_range profits = seq(min(ceosalary$profits), max(ceosalary$profits), length.out = 100),
age = mean(ceosalary$age)
)$salary <- predict(m2, newdata = profits_range)
profits_range
# base R
plot(profits_range$profits, profits_range$salary, type = "l",
xlab="Firm Profits (million USD)", ylab="CEO Salary (thousand USD)",
main = "CEO Salary by Firm Profits at Mean Age")
# ggplot
ggplot(profits_range, aes(x = profits, y = salary)) +
geom_line(size = 1) +
labs(
title = "Relationship Between Profits and Salary",
x = "Profits",
y = "Salary") +
theme_minimal()
Add the relationship between age and salary to the same plot.
# Fix profits at its mean and vary age for the age-salary relationship
<- data.frame(
age_range age = seq(min(ceosalary$age), max(ceosalary$age), length.out = 100),
profits = mean(ceosalary$profits)
)$salary <- predict(m2, newdata = age_range)
age_range
# combine datasets
$variable <- "Profits"
profits_range$variable <- "Age"
age_range<- bind_rows(profits_range, age_range)
combined_data
ggplot() +
geom_line(data = age_range, aes(x = age, y = salary), color = "red") +
geom_line(data = profits_range, aes(x = profits, y = salary), color = "blue") +
scale_x_continuous(name = "Age (Years)",
sec.axis = sec_axis(~ ., name = "Profits (Million USD)")) +
labs(
y = "Salary (Thousand USD)",
title = "CEO Salary by Age and Firm Profits"
+
) theme_minimal() +
theme(
axis.title.x.bottom = element_text(color = "red"),
axis.title.x.top = element_text(color = "blue"),
axis.line = element_line(color = "black"),
plot.title = element_text(hjust = 0.5)
)