We may be willing to tolerate a little bias if it reduces variance by enough
Bias-variance tradeoff
James et al., p. 31
Bias-variance tradeoff
James et al., p. 36
Bias-interpretability tradeoff
In addition to increasing variance, increasing flexibility (reducing bias) may also decrease model interpretability
More variables, more parameters, more complexity means more difficult to understand and explain
e.g., cubic vs linear relationship of X to Y
e.g., non-parametric models
e.g., “deep learning” neural networks with millions of parameters
Upshot
Constructing a useful estimator doesn’t necessarily mean minimizing bias
We care about expected distance from the true value
This is a function of both bias and variance
We are willing to increase bias to decrease variance if it reduces our generalization error
May also increase interpretability as a byproduct
Implications for assessing fit
We expect our “training” MSE (fit to current data) to underestimate “test” MSE (fit to future data).
In-sample fit includes fit to signal and fit to noise.
Reporting training MSE and \(R^2\) overstates model fit
Corrections we have discussed so far (adjusted \(R^2\), e.g.) indicate whether adding parameters improves fit to current data at rates better than chance. This is a low bar to clear!
Implications for assessing fit
If we have new/unseen/“test” data, use it!
But we usually don’t. What’s the next best thing?
Hold out some of your data
Another obvious possibility is to “hold out” some of your training data to use as test data
Fit model on random subsample of your total dataset
Use the held out data as “test” data to assess fit
# randomly sample rows to include/exclude from trainingtrain_index <-sample(1:nrow(df), floor(nrow(df)*.7), replace = F)# splitmodel <-lm(y ~ independent_variables, data = df[train_index,])# compare heldout observations to heldout predictionsheldout_observations <- df[-train_index,y]heldout_predictions <-predict(model, newdata = df[-train_index,])
Cross-validation
Not obvious how much data to hold out / how sensitive results are to which observations are held out.
We could maximize our sample size by minimizing the amount of data “held out”, but this may also increase the variance of our estimate of the test MSE (because we are only using 1 observation!)
Cross-validation
Common to repeat this process across equally-sized chunks of held-out data and average loss across the K “folds”
Example K-fold, using caret
# load data and fit in-sampleBEPS <- carData:: BEPSm1 <-lm(Europe ~I(age/10) + gender + economic.cond.national + economic.cond.household, data = BEPS)# estimate models using 10-foldm1_trainControl <-trainControl(method ="cv", number =10)m1_10fCV <-train(formula(m1), data = BEPS, method ="lm", trControl = m1_trainControl)print(m1_10fCV)
Linear Regression
1525 samples
4 predictor
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 1373, 1373, 1371, 1373, 1372, 1373, ...
Resampling results:
RMSE Rsquared MAE
3.211572 0.05474133 2.766456
Tuning parameter 'intercept' was held constant at a value of TRUE
Cross-validation
Can extend this such that K = N (i.e. we estimate the model \(N\) times on \(N-1\) data points)
Rather than evaluating fit based on in-sample residuals, we instead use out-of-sample prediction errors
The MSE of these predictions is an estimate of the true test MSE
This is leave-one-out cross-validation
Example LOOCV, using caret
# load caret packagelibrary(caret)# estimate models using LOOCVm1_trainControl <-trainControl(method ="LOOCV")m1_LOOCV <-train(formula(m1), data = BEPS, method ="lm", trControl = m1_trainControl)print(m1_LOOCV)
Linear Regression
1525 samples
4 predictor
No pre-processing
Resampling: Leave-One-Out Cross-Validation
Summary of sample sizes: 1524, 1524, 1524, 1524, 1524, 1524, ...
Resampling results:
RMSE Rsquared MAE
3.215089 0.04893308 2.768443
Tuning parameter 'intercept' was held constant at a value of TRUE
Example LOOCV, analytically
For OLS, there is an analytic solution for LOOCV (where \(h_i\) is the leverage of observation \(i\)):
In words, inflate observed squared error by observations’ leverage
It’s an HC3 adjustment!
Bias and variance in cross-validation
How to choose \(K\)? Using only a portion of the data for training leads to upward bias on estimate of test MSE (b/c sample is smaller, so estimator is higher variance)
As \(K \rightarrow N\), the bias goes to 0 but the variance can also increase
b/c of potential for higher correlations across folds
Common practice to use 10 folds (i.e. hold 10% of data out at a time)
There is debate about this - unclear whether standard practice is actually better than LOOCV
With small \(K\), the final estimate depends (to some degree) on the randomness of the divisions into folds (we would get a different answer with different folds)
Extension: estimate K-fold CV \(M\) times and average the results
Cross-validation for model choice
You may often wish to choose among models of varying complexity
One criterion is to choose the model that minimizes the test MSE:
Estimate a series of models with different sets of parameters
Calculate K-fold test MSE for each
Choose the model with the lowest test MSE
Other selection criteria
If \(\hat{L}\) is the value of the likelihood function at the maximum:
Akaike Information Criterion (\(\text{AIC} = 2(K+1) - 2\ln(\hat{L})\))
Bayesian Information Criterion (\(\text{BIC} = \ln(N)(K+1) - 2\ln(\hat{L})\))
Choose model with smallest AIC or BIC
If we assume normal errors, maximum likelihood and ordinary least squares are equivalent, and so AIC and BIC can be used as model selection criteria
AIC and BIC in R
It is easy to get AIC and BIC values using base R
AIC(m1)
[1] 7891.712
BIC(m1)
[1] 7923.69
Regularization
Regularization methods place constraints on the parameter space during estimation
These constraints operate to reduce model flexibility
This increases bias (relative to OLS), but reduces variance, ideally reducing test MSE
These methods may also be used to increase interpretability (again, by trading for bias)
Penalized Regression
One way to regularize is to augment the loss function to explicitly penalize coefficient size
\(\lambda\) is the hyperparameter chosen by the researcher (often selected by cross-validation)
larger \(\lambda\) means larger penalties and more shrinkage
Dividing \(SS_R\) by the sample size ensures that the loss function is independent of \(N\), which makes the scale of \(\lambda\) comparable across applications (though some software does not do this, e.g., MASS)
\(\hat{\beta}_k\) is sensitive to scale! Standardize inputs prior to estimation
Penalized Regression
LASSO (Least Absolute Shrinkage and Selection Operator)
library(glmnet)# run ridge regression with lambda = 100m1_ridge <-glmnet(x = m1_std$model[,-1], y = m1_std$model[,1], alpha =0, # indicates ridge penaltylambda =100/nrow(m1_std$model)) # MASS lambda inflated b/c no std of RSS by N, divide by N to make comparablecbind(m1_ridge$beta, coef(m1_rr)[-1])
4 x 2 sparse Matrix of class "dgCMatrix"
s0
scale(age) 0.06489721 0.06489788
scale(as.numeric(gender)) -0.06087671 -0.06087852
scale(economic.cond.national) -0.17996402 -0.17996571
scale(economic.cond.household) -0.04486008 -0.04486036
Example, with glmnet
library(glmnet)# run a lasso regressionm1_lasso <-glmnet(x = m1_std$model[,-1], y = m1_std$model[,1], alpha =1, # indicates lasso penaltylambda =c(0.01, 0.02, 0.1, 0.2))round(cbind(m1_lasso$beta, coef(m1_std)[-1]), 3)
The researcher must choose the value of \(\lambda\)
There is unlikely to be theoretical guidance
We can choose based on model fit using cross-validation
Estimate ridge or lasso many times with varying values and choose the model with lowest MSE
Example, using glmnet
Example, using glmnet
Hitters <-na.omit(ISLR::Hitters)# run lasso regression with many lambdasm2_lasso_cv <-cv.glmnet(x =model.matrix(Salary ~ ., data = Hitters[, 1:19])[,-1], y = Hitters[, 19], alpha =1)m2_lasso_cv$lambda.min
[1] 2.220313
Example, using glmnet
Using minimum lambda
# refit lasso model with minimum MSE lambdam2_lasso <-glmnet(x =model.matrix(Salary ~ ., data = Hitters[, 1:19])[,-1], y = Hitters[, 19], alpha =1,lambda = m2_lasso_cv$lambda.min)round(m2_lasso$beta, 3)
# run ridge regression with many lambdasm2_ridge_cv <-cv.glmnet(x =model.matrix(Salary ~ ., data = Hitters[, 1:19])[,-1], y = Hitters[, 19], alpha =0)m2_ridge_cv$lambda.min
[1] 25.52821
plot(m2_ridge_cv)
Example, using glmnet
# refit ridge model with minimum MSE lambdam2_ridge <-glmnet(x =model.matrix(Salary ~ ., data = Hitters[, 1:19])[,-1], y = Hitters[, 19], alpha =0,lambda = m2_ridge_cv$lambda.min)round(cbind(m2_ridge$beta, m2_lasso$beta), 3)
Adds another tuning parameter, \(\alpha\), which ranges from 0-1 and “mixes” LASSO and Ridge
When \(\alpha\) = 1, we’ve got LASSO; when When \(\alpha\) = 0, we’ve got ridge
Example, using caret
library(caret)# set up cross-validation approachcv_10 =trainControl(method ="cv", number =10)# run enet regression with many lambdasm2_enet_cv <-train(Salary ~ ., data = Hitters[, 1:19][,-1], method ="glmnet",trControl = cv_10,tuneLength =10)m2_enet_cv