library(HDSinRdata)
library(tidyverse)
library(glmnet)
library(L0Learn)
15 Model Selection
In Chapter 13 and Chapter 14, we included one simple method for model selection, stepwise selection. This chapter expands upon our model selection tools in R by focusing on regularized regression. The two packages we cover are glmnet (Friedman, Tibshirani, and Hastie 2010) and L0Learn (Hazimeh, Mazumder, and Nonet 2023). These two packages focus on different types of model regularization.
To demonstrate these packages, we use the same motivating example as in Chapter 13. Recall, that the NHANESsample
dataset contains lead, blood pressure, BMI, smoking status, alcohol use, and demographic variables from NHANES 1999-2018. Our focus is looking at the association between blood lead levels and systolic blood pressure. We first create a single systolic blood pressure by averaging across all measurements. We also transform lead with a log transformation before dropping variables we want to exclude from our analysis.
# load in data
data(NHANESsample)
# transform SBP and lead
$SBP <-
NHANESsamplerowMeans(NHANESsample[c("SBP1", "SBP2", "SBP3", "SBP4")],
na.rm=TRUE)
$LEAD <- log(NHANESsample$LEAD)
NHANESsample
# remove variables not to include in the model
<- NHANESsample %>%
nhanes select(-c(ID, HYP, LEAD_QUANTILE, DBP1, DBP2, DBP3, DBP4,
%>%
SBP1, SBP2, SBP3, SBP4, YEAR)) na.omit()
# convert to factors
$SEX <- factor(nhanes$SEX)
nhanes$RACE <- factor(nhanes$RACE)
nhanes$EDUCATION <- factor(nhanes$EDUCATION)
nhanes$BMI_CAT <- factor(nhanes$BMI_CAT)
nhanes$ALC <- factor(nhanes$ALC) nhanes
15.1 Regularized Regression
Suppose we have a numeric data matrix \(X \in \mathbb{R}^{n \times p}\) and outcome vector \(y \in \mathbb{R}^n\). We let \(x_i\) denote the vector representing the \(i\)th row of \(X\). This corresponds to the \(i\)th observation. When we refer to regularized regression, we are referring to solving the following optimization problem that minimizes the average loss plus a penalty term.
\[ \min_{ (\beta_0, \beta) \in \mathbb{R}^{p+1}} \frac{1}{n} \sum_{i=1}^n l(y_i, \beta_0 + \beta^T x_i) + \text{Pen}(\beta) \tag{15.1}\]
The function \(l(y_i, \beta_0 + \beta^T x_i)\) represents the loss function. For linear regression, this corresponds to the squared error \((y_i - \beta_0 - \beta^T x_i)^2\). For logistic regression, this loss corresponds to the logistic loss function.
The penalty terms we implement include the following:
L0 Norm: \(||\beta ||_0 = \sum_{j=1}^p 1(\beta_j \neq 0)\), the number of non-zero coefficients,
L1 Norm: \(||\beta ||_1 = \sum_{j=1}^p |\beta_j|\), the sum of absolute values of the coefficients, and
Squared L2 Norm: \(||\beta ||_2^2 = \sum_{j=1}^p \beta_j^2\), the sum of squared coefficients.
15.2 Elastic Net
We first consider L1 and L2 regularization. In particular, consider the following penalty term, referred to as elastic net regularization,
\[ \lambda \left[ \alpha ||\beta||_1 + (1-\alpha) ||\beta||^2_2 \right], \]
where \(\lambda\) is a complexity parameter and \(\alpha\) controls the balance between the two norms. A model with only L1 regularization (\(\alpha = 1\)) corresponds to lasso regression while a model with only L2 regularization (\(\alpha=0\)) corresponds to ridge regression. Note that the penalty depends on the scale of \(X\) and we typically assume each column has been standardized.
The glmnet package implements elastic net regularization. It assumes our data are in the form described previously. Therefore, we first create our numeric data matrix x
and output vector y
. Some of our variables are categorical, so in order to create a numeric matrix we need to one-hot encode them. We can do so using the model.matrix()
function which takes in a formula and a data frame and creates the corresponding design matrix including creating dummy variables from factor variables and implementing any transformations. Note that we drop the first generated column which corresponds to the intercept. The transformation to our outcome does not impact the result.
<- model.matrix(log(SBP) ~ ., nhanes)[, -1]
x head(x)
#> AGE SEXFemale RACEOther Hispanic RACENon-Hispanic White
#> 1 77 0 0 1
#> 2 49 0 0 1
#> 3 37 0 0 1
#> 4 70 0 0 0
#> 5 81 0 0 1
#> 6 38 1 0 1
#> RACENon-Hispanic Black RACEOther Race EDUCATIONHS EDUCATIONMoreThanHS
#> 1 0 0 0 1
#> 2 0 0 0 1
#> 3 0 0 0 1
#> 4 0 0 0 0
#> 5 0 0 0 0
#> 6 0 0 0 1
#> INCOME SMOKEQuitSmoke SMOKEStillSmoke LEAD BMI_CAT25<BMI<30
#> 1 5.00 0 0 1.609 0
#> 2 5.00 1 0 0.470 1
#> 3 4.93 0 0 0.875 0
#> 4 1.07 1 0 0.470 1
#> 5 2.67 0 1 1.705 1
#> 6 4.52 0 1 0.405 1
#> BMI_CATBMI>=30 ALCYes
#> 1 0 1
#> 2 0 1
#> 3 1 1
#> 4 0 1
#> 5 0 1
#> 6 0 1
Our outcome vector corresponds to log transformed systolic blood pressure.
<- log(nhanes$SBP) y
The glmnet()
function fits an elastic net regression model. This requires us to specify our input matrix x
and response variable y
. Additionally, we can specify the assumed distribution for y
using the family
argument. In our subsequent example, we fit this model with \(\alpha = 1\) and 25 different values of \(\lambda\). By default, glmnet()
sets \(\alpha\) to 1 and creates a grid of 100 different values of lambda
. It is also the default to standardize x
, which we can turn off by specifying standardize = FALSE
in our function call.
<- glmnet(x, y, family = "gaussian", alpha = 1,
mod_lasso nlambda = 25)
If we plot the resulting object, we can see the model coefficients for each resulting model by plotting how the coefficient for each variable changes with the value of \(\lambda\). The plot()
function by default plots these against the penalty term but we can also specify to plot against the \(\lambda\) values on the log scale. The label
argument adds a label to each line, though these are often hard to read. The numbers at the top of the plot indicate how many non-zero coefficients were included in the model for different \(\lambda\) values. Read the documentation ?glmnet
to see the other possible inputs including the penalty.factor
and weights
arguments.
plot(mod_lasso, xvar = "lambda", label = TRUE)
We can also print our results. This prints a matrix with the values of \(\lambda\) used. For each \(\lambda\) value we can also see the number of nonzero coefficients (Df
) and the percent deviance explained (%dev
).
print(mod_lasso)
#>
#> Call: glmnet(x = x, y = y, family = "gaussian", alpha = 1, nlambda = 25)
#>
#> Df %Dev Lambda
#> 1 0 0.0 0.0674
#> 2 1 11.6 0.0459
#> 3 1 17.0 0.0313
#> 4 1 19.5 0.0213
#> 5 4 21.0 0.0145
#> 6 5 23.1 0.0099
#> 7 7 24.3 0.0067
#> 8 8 24.9 0.0046
#> 9 9 25.2 0.0031
#> 10 9 25.4 0.0021
#> 11 12 25.6 0.0014
#> 12 13 25.6 0.0010
#> 13 14 25.7 0.0007
#> 14 14 25.7 0.0005
#> 15 14 25.7 0.0003
#> 16 14 25.7 0.0002
#> 17 14 25.7 0.0001
#> 18 14 25.7 0.0001
#> 19 15 25.7 0.0001
#> 20 15 25.7 0.0000
We can extract the model for a particular value of \(\lambda\) using the coef()
function. The argument s
specifies the value of \(\lambda\). For the particular value of \(\lambda\) chosen in the following code, only age has a non-zero coefficient. Note that the coef()
function returns the coefficients on the original scale.
coef(mod_lasso, s = 0.045920)
#> 16 x 1 sparse Matrix of class "dgCMatrix"
#> s1
#> (Intercept) 4.75241
#> AGE 0.00121
#> SEXFemale .
#> RACEOther Hispanic .
#> RACENon-Hispanic White .
#> RACENon-Hispanic Black .
#> RACEOther Race .
#> EDUCATIONHS .
#> EDUCATIONMoreThanHS .
#> INCOME .
#> SMOKEQuitSmoke .
#> SMOKEStillSmoke .
#> LEAD .
#> BMI_CAT25<BMI<30 .
#> BMI_CATBMI>=30 .
#> ALCYes .
We can also use the predict()
function to predict blood pressure for this particular model. In the function call, we have specified our value of \(\lambda\) as well as our data matrix x
as the data to predict on.
<- predict(mod_lasso, s = 0.045920, newx = x) pred_lasso
This shows our observed model fit for a fairly high penalty term. In order to choose the best value of \(\lambda\), we use 5-fold cross-validation. First, we randomly assign each observation to one of five folds using the sample()
function. We can see that this splits the data into folds of roughly equal size.
set.seed(1)
<- sample(1:5, size = nrow(x), replace = TRUE)
foldid table(foldid)
#> foldid
#> 1 2 3 4 5
#> 6081 5967 6048 6188 6121
Next, we call the cv.glmnet()
function which implements k-fold cross-validation across a grid of \(\lambda\) values. Similar to before, we specified x
, y
, and alpha = 1
. This time, we also include the measure to use for cross-validation (mse
indicates mean squared error) and provide the fold vector foldid
. If you do not want to provide folds, you can instead use the nfolds
argument to specify the number of folds and the function creates them. Plotting the returned object shows us the estimated mean squared error for different values of \(\lambda\) as well as error bars for each estimate. Similar to before it also shows the number of non-zero coefficients at the top.
<- cv.glmnet(x, y, foldid = foldid, alpha = 1,
cv_lasso type.measure = "mse")
plot(cv_lasso)
There are two vertical dashed lines included in the plot. These correspond to two values of \(\lambda\) that are stored in our object. The first is lambda.min
. This corresponds to the \(\lambda\) value with the lowest estimated mean squared error. The other is lambda.1se
. This corresponds to the largest \(\lambda\) value whose estimated mean squared error is within one standard error of the lowest value. As indicated in the plot, this corresponds to a model with fewer included coefficients and higher regularization.
We use the lambda.min
value as our chosen \(\lambda\) value. To extract the coefficients corresponding to this \(\lambda\) value we again use the coef()
function. However, this \(\lambda\) might not be one of the initial 25 used for our mod_lasso
object. In this case, the coef()
function uses linear interpolation to get the estimated coefficients. If we want to refit our model for this particular value of \(\lambda\) we can instead specify the argument exact = TRUE
and provide x
and y
.
<- coef(mod_lasso, s = cv_lasso$lambda.min,
lasso_coef exact = TRUE, x = x, y = y)
lasso_coef#> 16 x 1 sparse Matrix of class "dgCMatrix"
#> s1
#> (Intercept) 4.63458
#> AGE 0.00367
#> SEXFemale -0.02652
#> RACEOther Hispanic -0.00677
#> RACENon-Hispanic White -0.00531
#> RACENon-Hispanic Black 0.03257
#> RACEOther Race 0.00493
#> EDUCATIONHS .
#> EDUCATIONMoreThanHS -0.00942
#> INCOME -0.00396
#> SMOKEQuitSmoke -0.00753
#> SMOKEStillSmoke -0.00397
#> LEAD 0.00880
#> BMI_CAT25<BMI<30 0.01452
#> BMI_CATBMI>=30 0.03688
#> ALCYes 0.00547
We now repeat the same process for a model with \(\alpha = 0\) and \(\alpha = 0.5\). In this case, we call the glmnet()
function with our chosen \(\lambda\) value to find the coefficients. This is equivalent to what we did for our lasso model. Last, we create a data frame with the estimated coefficients for each model. The coef()
function returns a matrix so this requires converting these to numeric vectors.
# cross validation using same folds
<- cv.glmnet(x, y, foldid = foldid, alpha = 0,
cv_ridge type.measure = "mse")
<- cv.glmnet(x, y, foldid = foldid, alpha = 0.5,
cv_elastic type.measure = "mse")
# Refit model on full data with chosen lambda
<- glmnet(x, y, alpha = 0, lambda = cv_ridge$lambda.min)
mod_ridge <- glmnet(x, y, alpha = 0.5, lambda = cv_ridge$lambda.min)
mod_elastic
# extract coefficients for a table
<- data.frame(name = c("Intercept", colnames(x)),
res_coef lasso = round(as.numeric(lasso_coef), 3),
ridge = round(as.numeric(coef(mod_ridge)), 3),
elastic = round(as.numeric(coef(mod_elastic)),
3))
res_coef#> name lasso ridge elastic
#> 1 Intercept 4.635 4.646 4.650
#> 2 AGE 0.004 0.003 0.003
#> 3 SEXFemale -0.027 -0.025 -0.020
#> 4 RACEOther Hispanic -0.007 -0.007 0.000
#> 5 RACENon-Hispanic White -0.005 -0.005 -0.002
#> 6 RACENon-Hispanic Black 0.033 0.031 0.027
#> 7 RACEOther Race 0.005 0.004 0.000
#> 8 EDUCATIONHS 0.000 0.000 0.000
#> 9 EDUCATIONMoreThanHS -0.009 -0.010 -0.006
#> 10 INCOME -0.004 -0.004 -0.002
#> 11 SMOKEQuitSmoke -0.008 -0.006 0.000
#> 12 SMOKEStillSmoke -0.004 -0.005 0.000
#> 13 LEAD 0.009 0.011 0.008
#> 14 BMI_CAT25<BMI<30 0.015 0.014 0.000
#> 15 BMI_CATBMI>=30 0.037 0.035 0.022
#> 16 ALCYes 0.005 0.004 0.000
The coefficients between the models are not so different. We can also compare their mean squared errors, which are also similar. Since our lasso model was fit on a grid of \(\lambda\) values, we again have to specify this value.
mean((nhanes$SBP - exp(predict(mod_lasso, newx = x,
s = cv_lasso$lambda.min)))^2)
#> [1] 268
mean((nhanes$SBP - exp(predict(mod_ridge, newx = x)))^2)
#> [1] 268
mean((nhanes$SBP - exp(predict(mod_elastic, newx = x)))^2)
#> [1] 270
15.3 Best Subset
The second package we introduce in this chapter is one that allows us to fit models with an L0 penalty term. The package L0Learn considers penalties of the following forms.
L0 only: \(\lambda ||\beta||_0\)
L0L1: \(\lambda ||\beta ||_0 + \gamma ||\beta||_1\)
L0L2: \(\lambda ||\beta||_0 + \gamma ||\beta ||_2^2\)
To fit a model with an L0 penalty term, we use the L0Learn.fit()
function. Similar to glmnet()
, we need to specify our input matrix x
and response vector y
as well as our penalty using the penalty
argument. We can also specify a number of \(\lambda\) values to consider nLambda
and specify the family through the loss function loss
.
<- L0Learn.fit(x, y, penalty = "L0", loss = "SquaredError",
mod_l0 nLambda = 20)
Plotting the returned object also shows how the coefficients for each variable change with the penalty term, given by the support size or number of non-zero coefficients in this case. We can also print the returned object to see the different values of \(\lambda\) used and the corresponding number of included variables.
plot(mod_l0)
print(mod_l0)
#> lambda gamma suppSize
#> 1 1.08e-01 0 0
#> 2 1.07e-01 0 1
#> 3 6.51e-03 0 2
#> 4 5.00e-03 0 3
#> 5 4.22e-03 0 4
#> 6 1.75e-03 0 5
#> 7 5.36e-04 0 7
#> 8 3.23e-04 0 8
#> 9 1.92e-04 0 9
#> 10 1.42e-04 0 10
#> 11 1.04e-04 0 11
#> 12 6.80e-05 0 12
#> 13 1.75e-05 0 14
#> 14 4.06e-07 0 15
#> 15 3.94e-07 0 15
#> 16 3.03e-08 0 15
#> 17 5.87e-09 0 15
#> 18 9.43e-10 0 15
#> 19 3.48e-10 0 15
#> 20 2.19e-10 0 15
To extract the model for a particular value, we can use the coef()
function and specify the \(\lambda\) and \(\gamma\) value to use.
<- coef(mod_l0, lambda = 1.75475e-03, gamma = 0) coef_l0
Unfortunately, this output doesn’t include variable names so we add them manually.
rownames(coef_l0) <- c("Intercept", colnames(x))
coef_l0#> 16 x 1 sparse Matrix of class "dgCMatrix"
#>
#> Intercept 4.63847
#> AGE 0.00379
#> SEXFemale -0.03130
#> RACEOther Hispanic .
#> RACENon-Hispanic White .
#> RACENon-Hispanic Black 0.03753
#> RACEOther Race .
#> EDUCATIONHS .
#> EDUCATIONMoreThanHS .
#> INCOME -0.00535
#> SMOKEQuitSmoke .
#> SMOKEStillSmoke .
#> LEAD .
#> BMI_CAT25<BMI<30 .
#> BMI_CATBMI>=30 0.02760
#> ALCYes .
If instead we want to include a penalty with an L0 and L2 norm term, we can change our penalty argument to penalty = L0L2
and specify a number of \(\gamma\) values to test.
<- L0Learn.fit(x, y, penalty = "L0L2",
mod_l0l2 loss = "SquaredError",
nLambda = 20, nGamma = 10)
The L0Learn package also includes a function to use cross-validation to choose these parameters. Unfortunately, it does not include an option to specify your own folds. Instead, we use the nFolds
argument to specify to run 5-fold cross-validation. This function also allows us to specify a number of \(\lambda\) and \(\gamma\) values, or we can use the default number. Plotting the result shows the estimated mean squared error with error bars for each model and the corresponding support size.
= L0Learn.cvfit(x, y, loss = "SquaredError",
cv_l0l2 nFolds = 5, penalty = "L0L2")
plot(cv_l0l2)
The returned results are stored in cvMeans
. This is a list of matrices - one for each value of \(\gamma\). To extract these into a more manageable form, we use the sapply()
function to convert each matrix to a numeric vector and create a matrix of results. The columns of this matrix correspond to the 10 \(\gamma\) values used and each column of this matrix corresponds to 100 \(\lambda\) values chosen for that particular value of \(\gamma\). We use the which()
function to find which one has the lowest estimated mean squared error.
<- sapply(cv_l0l2$cvMeans, as.numeric)
cv_res <- which(cv_res == min(cv_res), arr.ind = TRUE)
min_ind
min_ind#> row col
#> [1,] 11 10
We can then extract out the corresponding \(\gamma\) and \(\lambda\) values through the fit
object returned in our result.
<- cv_l0l2$fit$gamma[[min_ind[2]]]
gamma_min <- cv_l0l2$fit$lambda[[min_ind[2]]][min_ind[1]] lambda_min
Last, we find the estimated coefficients for this model using the coef()
function on the cross-validation object cv_l0l2
.
<- coef(cv_l0l2, gamma = gamma_min, lambda = lambda_min)
cv_coef_l0 rownames(cv_coef_l0) <- c("Intercept", colnames(x))
cv_coef_l0#> 16 x 1 sparse Matrix of class "dgCMatrix"
#>
#> Intercept 4.61671
#> AGE 0.00382
#> SEXFemale .
#> RACEOther Hispanic .
#> RACENon-Hispanic White .
#> RACENon-Hispanic Black 0.04216
#> RACEOther Race .
#> EDUCATIONHS .
#> EDUCATIONMoreThanHS .
#> INCOME .
#> SMOKEQuitSmoke .
#> SMOKEStillSmoke .
#> LEAD .
#> BMI_CAT25<BMI<30 .
#> BMI_CATBMI>=30 .
#> ALCYes .
Last, we use the predict()
function on cv_l0l2
to evaluate our resulting model. To do so, we need to specify \(\lambda\) and \(\gamma\) as well as our data x
. This model is much sparser than the ones in the previous section but our mean squared error on our training data is a little higher.
<- predict(cv_l0l2, gamma = gamma_min,
pred_l0l2 lambda = lambda_min, newx = x)
mean((nhanes$SBP - exp(pred_l0l2))^2)
#> [1] 275
15.4 Exercises
For these exercises, we use the nyts
data from Chapter 14. Our outcome of interest is whether or not someone uses any tobacco product.
Create a new variable
tobacco_use
representing any tobacco use in the past 30 days (including e-cigarettes, cigarettes, and/or cigars). Then, create a new data framenyts_sub
that contains this new column as well as columns for location, age, sex, race, whether someone identifies with the LGBT community, grades in the past year, perceived_cigarette use, perceived e-cigarette use, psych distress, and family affluence.Create an outcome vector
y
corresponding to the columntobacco_use
and a model matrixx
containing all other covariates.Fit a L1 (lasso), L2 (ridge), and L0 (best subset) penalized regression model using 5-fold cross-validation. Create a data frame with the corresponding coefficients for all models. Be sure to update the loss function to reflect our binary outcome.
Report the AUC and accuracy of these three models on the training data.