Lab 2

Data

In this lab, we will continue to work with data on CEO salaries from 1990, from the dataset ceosalary.csv.

The relevant variables are:

salary: 1990 compensation ($1000s)
age: Age (years)
profits: 1990 profits ($ millions)
grad: binary for graduate school

ceosalary <- read.csv("data/ceosalary.csv")

Model Fit

\[salary_i = \beta_0 + \beta_1 profits_i + u_i\]

m1 <- lm(salary ~ profits, ceosalary)
summary(m1)


Call:
lm(formula = salary ~ profits, data = ceosalary)

Residuals:
   Min     1Q Median     3Q    Max 
-872.4 -319.7 -119.8  242.0 4484.0 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 746.9238    45.7979   16.31  < 2e-16 ***
profits       0.5723     0.1009    5.67 5.81e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 541.6 on 175 degrees of freedom
Multiple R-squared:  0.1552,    Adjusted R-squared:  0.1504 
F-statistic: 32.14 on 1 and 175 DF,  p-value: 5.805e-08

R Squared

\[R^2 = 1 - \frac{SSR}{SST} = 1 - \frac{\Sigma_N (y_i - \hat{y_i})^2}{\Sigma_N (y_i - \bar{y})^2}\] $R^2$ is the proportionate reduction in error associated with the dependent variable when moving from an intercept-only model to the model of interest. It is the proportion of total variance in the dependent variable that is accounted for by the model.

# observed y 
y <- ceosalary$salary

# predicted y 
X <- cbind(1, ceosalary$profits)
beta <- solve(t(X) %*% X) %*% t(X) %*% y
y_hat <- X %*% beta

# sum of squared residuals
SSR <- sum((y - y_hat)^2)

# total sum of squares
SST <- sum((y - mean(y))^2)

# r squared
1 - (SSR / SST)

[1] 0.1551789

1 - var(m1$residuals)/var(m1$model$salary)

[1] 0.1551789

Mean Squared Error

\[MSE = \frac{\Sigma_N (y_i - \hat{y_i})^2}{N - K - 1} \] Mean squared error is the estimated variance of prediction errors, or the estimated residual variance, $\hat{\sigma}^2$.

# mse
SSR / (177 - 1 - 1)

[1] 293350.7

sum(m1$residuals^2) / m1$df.residual

[1] 293350.7

Adjusted R Squared

\[R_{adj}^2 = 1 - \frac{MSE}{Var(y)} = 1 - \frac{\frac{\Sigma_N (y_i - \hat{y_i})^2}{N - K - 1}}{\frac{\Sigma_N (y_i - \bar{y})^2}{N - 1}}\]

Adjusted $R^2$ is mean squared error divided by the variance of the dependent variable. The difference between adjusted and unadjusted $R^2$ is that the former divided by the residual degrees of freedom (sample size minus parameters estimated), and thus adjusts $R^2$ downwards to account for the increase in fit associated with adding additional parameters, regardless of how well they predict the dependent variable out-of-sample.

1 - (sum(m1$residuals^2) / m1$df.residual) / var(m1$model$salary)

[1] 0.1503514

Multiple Regression

\[salary_i = \beta_0 + \beta_1 profits_{i} + \beta_2 age_{i} + u_i\]

m2 <- lm(salary ~ profits + age, ceosalary)
summary(m2)


Call:
lm(formula = salary ~ profits + age, data = ceosalary)

Residuals:
   Min     1Q Median     3Q    Max 
-892.9 -331.6 -100.2  253.9 4445.4 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 469.3863   276.7318   1.696   0.0916 .  
profits       0.5604     0.1016   5.516 1.24e-07 ***
age           4.9620     4.8794   1.017   0.3106    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 541.6 on 174 degrees of freedom
Multiple R-squared:  0.1602,    Adjusted R-squared:  0.1505 
F-statistic: 16.59 on 2 and 174 DF,  p-value: 2.539e-07

Comparing Model Fit

How does $\hat{\beta}_1$ change from m1 to m2?

How does $R^2$ change from m1 to m2?

How does adjusted $R^2$ change from m1 to m2?

Plotting Multiple Regression

Plot the relationship between profits and salary. Be careful! Consider the interpretation of $\hat{\beta}_1$.

# Fix age at its mean and vary profits for the profits-salary relationship
profits_range <- data.frame(
  profits = seq(min(ceosalary$profits), max(ceosalary$profits), length.out = 100),
  age = mean(ceosalary$age)
)
profits_range$salary <- predict(m2, newdata = profits_range)

# base R 
plot(profits_range$profits, profits_range$salary, type = "l",
     xlab="Firm Profits (million USD)", ylab="CEO Salary (thousand USD)",
     main = "CEO Salary by Firm Profits at Mean Age")

# ggplot
ggplot(profits_range, aes(x = profits, y = salary)) +
  geom_line(size = 1) +
  labs(
    title = "Relationship Between Profits and Salary",
    x = "Profits",
    y = "Salary") +
  theme_minimal()

Add the relationship between age and salary to the same plot.

# Fix profits at its mean and vary age for the age-salary relationship
age_range <- data.frame(
  age = seq(min(ceosalary$age), max(ceosalary$age), length.out = 100),
  profits = mean(ceosalary$profits)
)
age_range$salary <- predict(m2, newdata = age_range)

# combine datasets
profits_range$variable <- "Profits"
age_range$variable <- "Age"
combined_data <- bind_rows(profits_range, age_range)

ggplot() +
  geom_line(data = age_range, aes(x = age, y = salary), color = "red") +
  geom_line(data = profits_range, aes(x = profits, y = salary), color = "blue") +
  scale_x_continuous(name = "Age (Years)", 
                     sec.axis = sec_axis(~ ., name = "Profits (Million USD)")) +
  labs(
    y = "Salary (Thousand USD)",
    title = "CEO Salary by Age and Firm Profits"
  ) +
  theme_minimal() +
  theme(
    axis.title.x.bottom = element_text(color = "red"),
    axis.title.x.top = element_text(color = "blue"),
    axis.line = element_line(color = "black"),
    plot.title = element_text(hjust = 0.5)
  )