Bias and Variance

POLSCI 630: Probability and Basic Regression

April 8, 2025

Flexibility and variance

There is a fundamental trade-off in statistical modeling.

Adding parameters to the model (i.e. making it more complex)…

  • increases “flexibility”: it allows the model to better fit the current (“training” or “in-sample”) data
  • increases variance: larger uncertainty about the true parameter values (more variation from sample to sample)

Flexibility and variance

Your model’s fit contains…

  • systematic elements of the true data-generating process

    • generalizable, likely to remain consistent under repeated sampling / new data
  • stochastic elements of the true data-generating process

    • noise, unlikely to replicate on new data

Simple example

Simple example

Simple example

Simple example

Estimating a simpler model

m <- lm(y ~ x + I(x^2), data = dat)
summary(m)

Call:
lm(formula = y ~ x + I(x^2), data = dat)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.2901 -1.2920  0.0877  1.2712  4.9043 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.06939    0.25141   0.276    0.783    
x           -1.91563    0.21028  -9.110 1.12e-14 ***
I(x^2)       0.96122    0.18426   5.217 1.03e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.924 on 97 degrees of freedom
Multiple R-squared:  0.5666,    Adjusted R-squared:  0.5577 
F-statistic: 63.41 on 2 and 97 DF,  p-value: < 2.2e-16
1 - 
  ( sum( (predict(m, newdata = data.frame(x = dat2$x)) - dat2$y)^2 ) / (length(m$residuals) - 1)) / 
  var(dat2$y)
[1] 0.4994287

Test mean squared error

The true model of \(y_i\) is: \(y_i = f(\boldsymbol{x}_i) + u_i\), and we want to estimate \(f\) using \(\hat{f}\) to make predictions \(\hat{y}_i\)

  • Common practice to evaluate \(\hat{f}\) using variance of prediction errors on new data (i.e. test mean squared error):

\[ \text{E} \left( \left(y_{0i} - \hat{f}(\boldsymbol{x}_{0i}) \right)^2 \right) \]

Bias-variance tradeoff

Test mean squared error, \(\text{E} \left( \left(y_{0i} - \hat{f}(\boldsymbol{x}_{0i}) \right)^2 \right)\), can be decomposed as:

\[ \text{Var}\left(\hat{f}(x_{0i})\right) + \left[\text{Bias}\left(\hat{f}(x_{0i})\right)\right]^2 + \text{Var}(u_{0i}) \]

  • variance of estimator
  • bias of estimator squared
  • “irreducible” error variance

We may be willing to tolerate a little bias if it reduces variance by enough

Bias-variance tradeoff

James et al., p. 31

Bias-variance tradeoff

James et al., p. 36

Bias-interpretability tradeoff

In addition to increasing variance, increasing flexibility (reducing bias) may also decrease model interpretability

  • More variables, more parameters, more complexity means more difficult to understand and explain
  • e.g., cubic vs linear relationship of X to Y
  • e.g., non-parametric models
  • e.g., “deep learning” neural networks with millions of parameters

Upshot

Constructing a useful estimator doesn’t necessarily mean minimizing bias

  • We care about expected distance from the true value
  • This is a function of both bias and variance

We are willing to increase bias to decrease variance if it reduces our generalization error

  • May also increase interpretability as a byproduct

Implications for assessing fit

We expect our “training” MSE (fit to current data) to underestimate “test” MSE (fit to future data).

In-sample fit includes fit to signal and fit to noise.

  • Reporting training MSE and \(R^2\) overstates model fit

Corrections we have discussed so far (adjusted \(R^2\), e.g.) indicate whether adding parameters improves fit to current data at rates better than chance. This is a low bar to clear!

Implications for assessing fit

If we have new/unseen/“test” data, use it!

But we usually don’t. What’s the next best thing?

Hold out some of your data

Another obvious possibility is to “hold out” some of your training data to use as test data

  • Fit model on random subsample of your total dataset

  • Use the held out data as “test” data to assess fit

# randomly sample rows to include/exclude from training
train_index <- sample(1:nrow(df), floor(nrow(df)*.7), replace = F)

# split
model <- lm(y ~ independent_variables, data = df[train_index,])

# compare heldout observations to heldout predictions
heldout_observations <- df[-train_index,y]
heldout_predictions <- predict(model, newdata = df[-train_index,])

Cross-validation

Not obvious how much data to hold out / how sensitive results are to which observations are held out.

We could maximize our sample size by minimizing the amount of data “held out”, but this may also increase the variance of our estimate of the test MSE (because we are only using 1 observation!)

Cross-validation

Common to repeat this process across equally-sized chunks of held-out data and average loss across the K “folds”

Example K-fold, using caret

# load data and fit in-sample
BEPS <- carData:: BEPS
m1 <- lm(Europe ~ I(age/10) + gender + economic.cond.national + economic.cond.household, data = BEPS)

# estimate models using 10-fold
m1_trainControl <- trainControl(method = "cv", number = 10)
m1_10fCV <- train(formula(m1), data = BEPS, method = "lm", trControl = m1_trainControl)
print(m1_10fCV)
Linear Regression 

1525 samples
   4 predictor

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 1373, 1373, 1371, 1373, 1372, 1373, ... 
Resampling results:

  RMSE      Rsquared    MAE     
  3.211572  0.05474133  2.766456

Tuning parameter 'intercept' was held constant at a value of TRUE

Cross-validation

Can extend this such that K = N (i.e. we estimate the model \(N\) times on \(N-1\) data points)

  • Rather than evaluating fit based on in-sample residuals, we instead use out-of-sample prediction errors

  • The MSE of these predictions is an estimate of the true test MSE

This is leave-one-out cross-validation

Example LOOCV, using caret

# load caret package
library(caret)

# estimate models using LOOCV
m1_trainControl <- trainControl(method = "LOOCV")
m1_LOOCV <- train(formula(m1), data = BEPS, method = "lm", trControl = m1_trainControl)
print(m1_LOOCV)
Linear Regression 

1525 samples
   4 predictor

No pre-processing
Resampling: Leave-One-Out Cross-Validation 
Summary of sample sizes: 1524, 1524, 1524, 1524, 1524, 1524, ... 
Resampling results:

  RMSE      Rsquared    MAE     
  3.215089  0.04893308  2.768443

Tuning parameter 'intercept' was held constant at a value of TRUE

Example LOOCV, analytically

For OLS, there is an analytic solution for LOOCV (where \(h_i\) is the leverage of observation \(i\)):

\[ \text{LOOCV} = \frac{1}{N} \sum_{i=1}^N \left[\frac{\hat{u}_i}{(1 - h_i)} \right]^2 \]

sqrt( sum( (m1$residuals / (1 - hatvalues(m1)))^2 ) / nrow(BEPS) )
[1] 3.215089
  • In words, inflate observed squared error by observations’ leverage
  • It’s an HC3 adjustment!

Bias and variance in cross-validation

How to choose \(K\)? Using only a portion of the data for training leads to upward bias on estimate of test MSE (b/c sample is smaller, so estimator is higher variance)

  • As \(K \rightarrow N\), the bias goes to 0 but the variance can also increase
    • b/c of potential for higher correlations across folds
  • Common practice to use 10 folds (i.e. hold 10% of data out at a time)
  • There is debate about this - unclear whether standard practice is actually better than LOOCV

With small \(K\), the final estimate depends (to some degree) on the randomness of the divisions into folds (we would get a different answer with different folds)

  • Extension: estimate K-fold CV \(M\) times and average the results

Cross-validation for model choice

You may often wish to choose among models of varying complexity

One criterion is to choose the model that minimizes the test MSE:

  • Estimate a series of models with different sets of parameters
  • Calculate K-fold test MSE for each
  • Choose the model with the lowest test MSE

Other selection criteria

If \(\hat{L}\) is the value of the likelihood function at the maximum:

Akaike Information Criterion (\(\text{AIC} = 2(K+1) - 2\ln(\hat{L})\))

Bayesian Information Criterion (\(\text{BIC} = \ln(N)(K+1) - 2\ln(\hat{L})\))

  • Choose model with smallest AIC or BIC
  • If we assume normal errors, maximum likelihood and ordinary least squares are equivalent, and so AIC and BIC can be used as model selection criteria

AIC and BIC in R

It is easy to get AIC and BIC values using base R

AIC(m1)
[1] 7891.712
BIC(m1)
[1] 7923.69

Regularization

Regularization methods place constraints on the parameter space during estimation

  • These constraints operate to reduce model flexibility
  • This increases bias (relative to OLS), but reduces variance, ideally reducing test MSE
  • These methods may also be used to increase interpretability (again, by trading for bias)

Penalized Regression

One way to regularize is to augment the loss function to explicitly penalize coefficient size

\[ \begin{aligned} \hat{\boldsymbol{\beta}}_{OLS} &= \min(SS_R) \\ \hat{\boldsymbol{\beta}}_{Penalized} &= \min \left( \frac{SS_R}{2N} + \lambda \sum_{k=1}^{K} f(\hat{\beta}_k) \right) \end{aligned} \]

  • \(\lambda\) is the hyperparameter chosen by the researcher (often selected by cross-validation)
    • larger \(\lambda\) means larger penalties and more shrinkage
  • Dividing \(SS_R\) by the sample size ensures that the loss function is independent of \(N\), which makes the scale of \(\lambda\) comparable across applications (though some software does not do this, e.g., MASS)
  • \(\hat{\beta}_k\) is sensitive to scale! Standardize inputs prior to estimation

Penalized Regression

LASSO (Least Absolute Shrinkage and Selection Operator)

\[ \hat{\boldsymbol{\beta}}_{LASSO} = \min \left( \frac{SS_R}{2N} + \lambda \sum_{k=1}^{K} |\hat{\beta}_k| \right) \]

  • Penalize sum of coefficient absolute values
  • Leads uninformative coefficients to “shrink” to zero
  • LASSO is thus a good choice when one wishes to reduce model complexity

Penalized Regression

Ridge regression

\[ \hat{\boldsymbol{\beta}}_{RR} = \min \left( \frac{SS_R}{2N} + \lambda \sum_{k=1}^{K} \hat{\beta}_k^2 \right) \]

  • Penalize sum of squared coefficients
  • Shrinks coefficients toward zero as a function of uninformativeness
  • A limitation of ridge is that it does not do subset selection

Penalized Regression

Elastic net regression

\[ \hat{\boldsymbol{\beta}}_{EN} = \min \left( \frac{SS_R}{2N} + \frac{\lambda(1 - \alpha)}{2} \sum_{k=1}^{K} \hat{\beta}_k^2 + \lambda\alpha \sum_{k=1}^{K} |\hat{\beta}_k| \right) \]

  • Adds another tuning parameter, \(\alpha\), which ranges from 0-1 and “mixes” LASSO and Ridge
  • When \(\alpha\) = 1, we’ve got LASSO; when When \(\alpha\) = 0, we’ve got ridge

Ridge vs LASSO, visually

Why the LASSO is sparse:

James et al., p. 222

Example using MASS

# estimate original model with scaled vars
m1_std <- lm(scale(Europe) ~ scale(age) + scale(as.numeric(gender)) + scale(economic.cond.national) + scale(economic.cond.household), data = BEPS)

# estimate ridge regression with lambda = 100
m1_rr <- MASS::lm.ridge(formula(m1_std), data = BEPS, lambda = 100)
round(cbind(coef(m1_std), coef(m1_rr)), 3)
                                 [,1]   [,2]
(Intercept)                     0.000  0.000
scale(age)                      0.069  0.065
scale(as.numeric(gender))      -0.064 -0.061
scale(economic.cond.national)  -0.192 -0.180
scale(economic.cond.household) -0.043 -0.045

Example using MASS

Example using glmnet

library(glmnet)

# run ridge regression with lambda = 100
m1_ridge <- glmnet(x = m1_std$model[,-1], 
                   y = m1_std$model[,1], 
                   alpha = 0, # indicates ridge penalty
                   lambda = 100 / nrow(m1_std$model)) # MASS lambda inflated b/c no std of RSS by N, divide by N to make comparable

cbind(m1_ridge$beta, coef(m1_rr)[-1])
4 x 2 sparse Matrix of class "dgCMatrix"
                                        s0            
scale(age)                      0.06489721  0.06489788
scale(as.numeric(gender))      -0.06087671 -0.06087852
scale(economic.cond.national)  -0.17996402 -0.17996571
scale(economic.cond.household) -0.04486008 -0.04486036

Example, with glmnet

library(glmnet)

# run a lasso regression
m1_lasso <- glmnet(x = m1_std$model[,-1], 
                   y = m1_std$model[,1], 
                   alpha = 1, # indicates lasso penalty
                   lambda = c(0.01, 0.02, 0.1, 0.2))
round(cbind(m1_lasso$beta, coef(m1_std)[-1]), 3)
4 x 5 sparse Matrix of class "dgCMatrix"
                                   s0     s1     s2     s3       
scale(age)                      .      .      0.050  0.060  0.069
scale(as.numeric(gender))       .      .     -0.046 -0.055 -0.064
scale(economic.cond.national)  -0.009 -0.109 -0.178 -0.185 -0.192
scale(economic.cond.household)  .      .     -0.030 -0.037 -0.043

Selecting the tuning parameter

The researcher must choose the value of \(\lambda\)

  • There is unlikely to be theoretical guidance

  • We can choose based on model fit using cross-validation

    • Estimate ridge or lasso many times with varying values and choose the model with lowest MSE

Example, using glmnet

Example, using glmnet

Hitters <- na.omit(ISLR::Hitters)

# run lasso regression with many lambdas
m2_lasso_cv <- cv.glmnet(x = model.matrix(Salary ~ ., 
                                          data = Hitters[, 1:19])[,-1], 
                         y = Hitters[, 19], 
                         alpha = 1)
m2_lasso_cv$lambda.min
[1] 2.220313

Example, using glmnet

Using minimum lambda

# refit lasso model with minimum MSE lambda
m2_lasso <- glmnet(x = model.matrix(Salary ~ ., data = Hitters[, 1:19])[,-1], 
                   y = Hitters[, 19], 
                   alpha = 1,
                   lambda = m2_lasso_cv$lambda.min)
round(m2_lasso$beta, 3)
18 x 1 sparse Matrix of class "dgCMatrix"
                s0
AtBat       -1.692
Hits         5.973
HmRun        0.049
Runs         .    
RBI          .    
Walks        4.997
Years      -10.073
CAtBat       .    
CHits        .    
CHmRun       0.591
CRuns        0.713
CRBI         0.375
CWalks      -0.592
LeagueN     33.105
DivisionW -119.198
PutOuts      0.276
Assists      0.200
Errors      -2.245

Using 1SD above min lambda

# refit lasso model with minimum MSE lambda
m2_lasso_1sd <- glmnet(x = model.matrix(Salary ~ ., data = Hitters[, 1:19])[,-1], 
                   y = Hitters[, 19], 
                   alpha = 1,
                   lambda = m2_lasso_cv$lambda.1se)
round(m2_lasso_1sd$beta, 3)
18 x 1 sparse Matrix of class "dgCMatrix"
             s0
AtBat     .    
Hits      1.365
HmRun     .    
Runs      .    
RBI       .    
Walks     1.500
Years     .    
CAtBat    .    
CHits     .    
CHmRun    .    
CRuns     0.146
CRBI      0.335
CWalks    .    
LeagueN   .    
DivisionW .    
PutOuts   0.066
Assists   .    
Errors    .    

Example, using glmnet

# run ridge regression with many lambdas
m2_ridge_cv <- cv.glmnet(x = model.matrix(Salary ~ ., 
                                          data = Hitters[, 1:19])[,-1], 
                         y = Hitters[, 19], 
                         alpha = 0)
m2_ridge_cv$lambda.min
[1] 25.52821
plot(m2_ridge_cv)

Example, using glmnet

# refit ridge model with minimum MSE lambda
m2_ridge <- glmnet(x = model.matrix(Salary ~ ., data = Hitters[, 1:19])[,-1], 
                   y = Hitters[, 19], 
                   alpha = 0,
                   lambda = m2_ridge_cv$lambda.min)
round(cbind(m2_ridge$beta, m2_lasso$beta), 3)
18 x 2 sparse Matrix of class "dgCMatrix"
                s0       s0
AtBat       -0.686   -1.692
Hits         2.773    5.973
HmRun       -1.303    0.049
Runs         1.049    .    
RBI          0.714    .    
Walks        3.352    4.997
Years       -8.904  -10.073
CAtBat      -0.001    .    
CHits        0.133    .    
CHmRun       0.686    0.591
CRuns        0.290    0.713
CRBI         0.261    0.375
CWalks      -0.273   -0.592
LeagueN     38.543   33.105
DivisionW -122.808 -119.198
PutOuts      0.263    0.276
Assists      0.170    0.200
Errors      -3.648   -2.245

Elastic net

Elastic net regression

\[ \hat{\boldsymbol{\beta}}_{EN} = \min \left( \frac{SS_R}{2N} + \frac{\lambda(1 - \alpha)}{2} \sum_{k=1}^{K} \hat{\beta}_k^2 + \lambda\alpha \sum_{k=1}^{K} |\hat{\beta}_k| \right) \]

  • Adds another tuning parameter, \(\alpha\), which ranges from 0-1 and “mixes” LASSO and Ridge
  • When \(\alpha\) = 1, we’ve got LASSO; when When \(\alpha\) = 0, we’ve got ridge

Example, using caret

library(caret)

# set up cross-validation approach
cv_10 = trainControl(method = "cv", number = 10)

# run enet regression with many lambdas
m2_enet_cv <- train(Salary ~ ., 
                    data = Hitters[, 1:19][,-1], 
                    method = "glmnet",
                    trControl = cv_10,
                    tuneLength = 10)
m2_enet_cv
glmnet 

263 samples
 17 predictor

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 239, 237, 236, 236, 237, 235, ... 
Resampling results across tuning parameters:

  alpha  lambda       RMSE      Rsquared   MAE     
  0.1      0.1179470  337.1146  0.4569737  239.9463
  0.1      0.2724728  336.9989  0.4572353  239.8635
  0.1      0.6294474  335.7753  0.4601295  238.8601
  0.1      1.4541051  334.8405  0.4621547  237.6963
  0.1      3.3591715  334.5240  0.4619881  237.3114
  0.1      7.7601218  334.4284  0.4601385  236.4201
  0.1     17.9268877  333.4158  0.4605723  234.0868
  0.1     41.4134358  332.0295  0.4619638  231.8182
  0.1     95.6704080  329.8511  0.4689479  230.2697
  0.1    221.0110510  332.3967  0.4713953  233.7606
  0.2      0.1179470  338.0095  0.4549354  240.5453
  0.2      0.2724728  337.1458  0.4568900  239.9124
  0.2      0.6294474  335.9206  0.4597630  238.8633
  0.2      1.4541051  334.9432  0.4617918  237.6213
  0.2      3.3591715  334.6453  0.4613388  237.2397
  0.2      7.7601218  334.7482  0.4585609  236.0518
  0.2     17.9268877  333.5899  0.4590780  233.4087
  0.2     41.4134358  331.2632  0.4639370  230.8382
  0.2     95.6704080  331.3002  0.4674805  231.9413
  0.2    221.0110510  339.9189  0.4633366  242.0437
  0.3      0.1179470  338.1534  0.4546115  240.6253
  0.3      0.2724728  337.2920  0.4565623  239.9665
  0.3      0.6294474  336.0645  0.4594129  238.8640
  0.3      1.4541051  335.0762  0.4613655  237.5777
  0.3      3.3591715  334.8752  0.4603166  237.1819
  0.3      7.7601218  334.9238  0.4574169  235.5291
  0.3     17.9268877  333.6444  0.4581311  232.8973
  0.3     41.4134358  331.5666  0.4631639  231.2284
  0.3     95.6704080  333.5685  0.4647607  233.9864
  0.3    221.0110510  350.6040  0.4483223  253.3529
  0.4      0.1179470  338.2250  0.4544537  240.6223
  0.4      0.2724728  337.4501  0.4562223  240.0128
  0.4      0.6294474  336.2156  0.4590592  238.8391
  0.4      1.4541051  335.2694  0.4607937  237.5880
  0.4      3.3591715  335.2775  0.4588215  237.2343
  0.4      7.7601218  334.8884  0.4568945  235.1001
  0.4     17.9268877  333.3698  0.4583234  232.0654
  0.4     41.4134358  331.8898  0.4629589  231.6207
  0.4     95.6704080  336.8757  0.4595396  237.2452
  0.4    221.0110510  361.0185  0.4367743  263.9717
  0.5      0.1179470  338.3334  0.4541743  240.6566
  0.5      0.2724728  337.6295  0.4558153  240.0784
  0.5      0.6294474  336.4037  0.4586007  238.8440
  0.5      1.4541051  335.5122  0.4600479  237.6380
  0.5      3.3591715  335.7310  0.4572205  237.2826
  0.5      7.7601218  334.9944  0.4559907  234.7921
  0.5     17.9268877  332.8613  0.4595313  231.5704
  0.5     41.4134358  332.4408  0.4623855  232.0718
  0.5     95.6704080  340.7912  0.4533061  241.0801
  0.5    221.0110510  371.6798  0.4220470  275.0217
  0.6      0.1179470  338.4191  0.4539813  240.7096
  0.6      0.2724728  337.7655  0.4554672  240.1345
  0.6      0.6294474  336.5795  0.4581195  238.8273
  0.6      1.4541051  335.8177  0.4591576  237.7283
  0.6      3.3591715  336.0653  0.4558238  237.2654
  0.6      7.7601218  335.0224  0.4552847  234.4877
  0.6     17.9268877  332.6923  0.4597547  231.5810
  0.6     41.4134358  333.2894  0.4609658  232.6709
  0.6     95.6704080  345.4337  0.4451713  245.7922
  0.6    221.0110510  383.6031  0.4011638  286.9270
  0.7      0.1179470  338.5368  0.4536842  240.7464
  0.7      0.2724728  337.8875  0.4551388  240.1739
  0.7      0.6294474  336.7836  0.4575836  238.8350
  0.7      1.4541051  336.1571  0.4581585  237.8468
  0.7      3.3591715  336.0960  0.4552841  237.0383
  0.7      7.7601218  335.0320  0.4546871  234.3301
  0.7     17.9268877  332.6645  0.4596965  231.7496
  0.7     41.4134358  334.2697  0.4591874  233.5248
  0.7     95.6704080  350.0371  0.4379719  250.6555
  0.7    221.0110510  396.3846  0.3694465  299.5499
  0.8      0.1179470  338.6718  0.4533797  240.8178
  0.8      0.2724728  338.0437  0.4547679  240.2195
  0.8      0.6294474  337.0235  0.4569662  238.8584
  0.8      1.4541051  336.4878  0.4571542  237.9632
  0.8      3.3591715  335.9191  0.4553522  236.7445
  0.8      7.7601218  335.0229  0.4542200  234.0179
  0.8     17.9268877  332.5913  0.4598636  231.8564
  0.8     41.4134358  335.1930  0.4576883  234.4226
  0.8     95.6704080  353.8639  0.4343062  254.9515
  0.8    221.0110510  408.7359  0.3263137  311.3987
  0.9      0.1179470  338.8456  0.4529737  240.9110
  0.9      0.2724728  338.2491  0.4543162  240.2984
  0.9      0.6294474  337.3052  0.4562565  238.9018
  0.9      1.4541051  336.8246  0.4561569  238.0873
  0.9      3.3591715  335.6601  0.4557509  236.4092
  0.9      7.7601218  334.8533  0.4542180  233.5306
  0.9     17.9268877  332.5098  0.4601025  231.9450
  0.9     41.4134358  336.2682  0.4559157  235.5111
  0.9     95.6704080  357.5829  0.4297634  259.1683
  0.9    221.0110510  417.9415  0.3170228  319.7487
  1.0      0.1179470  339.0675  0.4525418  241.0737
  1.0      0.2724728  338.4985  0.4537807  240.4088
  1.0      0.6294474  337.7213  0.4553150  239.0309
  1.0      1.4541051  337.2338  0.4550189  238.2637
  1.0      3.3591715  335.4596  0.4559416  236.1369
  1.0      7.7601218  334.4036  0.4550731  233.0208
  1.0     17.9268877  332.5114  0.4601205  232.0317
  1.0     41.4134358  337.5744  0.4535119  236.6848
  1.0     95.6704080  360.9757  0.4252960  263.2606
  1.0    221.0110510  427.6143  0.3182270  328.2182

RMSE was used to select the optimal model using the smallest value.
The final values used for the model were alpha = 0.1 and lambda = 95.67041.

Example, using caret

round(m2_enet_cv$results[which(rownames(m2_enet_cv$results) == rownames(m2_enet_cv$bestTune)), ], 3)
  alpha lambda    RMSE Rsquared    MAE RMSESD RsquaredSD  MAESD
9   0.1  95.67 329.851    0.469 230.27 81.634      0.152 37.405
round(m2_enet_cv$finalModel$beta[, which(rownames(m2_enet_cv$results) == rownames(m2_enet_cv$bestTune))], 3)
     Hits     HmRun      Runs       RBI     Walks     Years    CAtBat     CHits 
    0.247     0.000     0.299     0.372     0.470     0.000     0.005     0.024 
   CHmRun     CRuns      CRBI    CWalks   LeagueN DivisionW   PutOuts   Assists 
    0.160     0.053     0.057     0.025     0.000     0.000     0.000     0.000 
   Errors 
    0.000 

Standard errors in regularization

You may have noticed that penalized regression functions are not providing standard errors!

  • For ridge regression

    • There are analytic solutions for covariance matrix (see Hanson (2022, p. 947)), though many packages do not report
    • Unclear how to think about confidence intervals when loss function introduces unknown bias, as the target of estimation is not the true value
  • For LASSO, the idea of a confidence interval is even more problematic, because variables are being eliminated entirely

  • You should take Machine Learning so that someone more informed on these topics can give you better advice!

Bayesian regularization

An alternative to penalized regression is a full Bayesian analysis with informative priors for coefficients that center on 0

  • This has the same regularizing effect of shrinking estimates toward zero
  • SEs and uncertainty bounds are easily calculated
  • Though, again, if those are not your true priors, then it is not entirely clear how to interpret…