An open API service indexing awesome lists of open source software.

https://github.com/dmolitor/bolasso

Model consistent Lasso estimation through the bootstrap.
https://github.com/dmolitor/bolasso

bolasso bootstrap lasso rstats variable-selection

Last synced: 6 months ago
JSON representation

Model consistent Lasso estimation through the bootstrap.

Awesome Lists containing this project

README

          

---
output: github_document
---

```{r message=FALSE, warning=FALSE, paged.print=TRUE, echo=FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "85%",
dpi = 300
)

set.seed(321) # Reproducible results
```

# bolasso

[![R-CMD-check](https://github.com/dmolitor/bolasso/workflows/R-CMD-check/badge.svg)](https://github.com/dmolitor/bolasso/actions)
[![pkgdown](https://github.com/dmolitor/bolasso/workflows/pkgdown/badge.svg)](https://github.com/dmolitor/bolasso/actions)
[![Codecov test coverage](https://codecov.io/gh/dmolitor/bolasso/branch/main/graph/badge.svg)](https://app.codecov.io/gh/dmolitor/bolasso?branch=main)
[![CRAN status](https://www.r-pkg.org/badges/version/bolasso)](https://CRAN.R-project.org/package=bolasso)

The goal of bolasso is to implement bootstrap-enhanced Lasso (and more generally,
penalized regression) estimation, as proposed originally in [Bach (2008)](#2)
and extended by [Bunea et al. (2011)](#3) and [Abram et al. (2016)](#1). These methods
focus primarily on variable selection and propose two similar, but slightly
different, variable selection algorithms; the variable inclusion probability
(VIP) algorithm (Bach; Bunea et al.), and the bootstrap
distribution quantile (QNT) algorithm (Abram et al.). Beyond implementing
both these variable selection methods, bolasso also provides utilities for
making bagged predictions, examining coefficient distributions, and plotting.

## Installation

Install bolasso from CRAN:
```r
install.packages("bolasso")
```

Or install the development version from GitHub with:
```r
# install.packages("pak")
pak::pkg_install("dmolitor/bolasso@dev")
```
## Usage

To illustrate the usage of bolasso, we'll use the
[Pima Indians Diabetes dataset](http://math.furman.edu/~dcs/courses/math47/R/library/mlbench/html/PimaIndiansDiabetes.html)
to determine which factors are important predictors of testing positive
for diabetes. For a full description of the input variables, see the link above.

### Load requisite packages and data

```{r echo=TRUE, message=FALSE, warning=FALSE}
library(bolasso)
library(ggplot2)
library(tibble)

data(PimaIndiansDiabetes, package = "mlbench")

# Quick overview of the dataset
str(PimaIndiansDiabetes)
```

First, let's create a train/test split of our data, and then run 100-fold
bootstrapped Lasso with `glmnet`.

```{r}
train_idx <- sample(1:nrow(PimaIndiansDiabetes), round(0.7*nrow(PimaIndiansDiabetes)))
train <- PimaIndiansDiabetes[train_idx, ]
test <- PimaIndiansDiabetes[-train_idx, ]

model <- bolasso(
diabetes ~ .,
data = train,
n.boot = 100,
progress = FALSE,
family = "binomial"
)
```

### Variable selection

Next, using a threshold of 0.95 we can extract the selected variables using the
VIP method, which extracts all variables
that were selected (had non-zero coefficients) in >= 95% of the bootstrapped models.
We'll use the regularization parameter `lambda.min` that minimizes
cross-validation error.
```{r}
selected_variables(model, threshold = 0.95, method = "vip", select = "lambda.min")
```

Note that this returned a tibble with the selected variables as columns and the
coefficients for each of the bootstrapped models as rows. If you want to simply
return only the variable names, you can add the `var_names_only` argument:
```{r}
selected_variables(model, 0.95, "vip", var_names_only = TRUE)
```

We can compare the selected variables using the VIP method to the QNT
method, which selects all variables that have
a 95% bootstrap confidence interval that does not contain 0:
```{r}
selected_variables(model, 0.95, "qnt", var_names_only = TRUE)
```

Note that the number of selected variables with QNT will always be <= than
with VIP. The default method for bolasso is `method = "vip"`.

#### Variable selection thresholds

It may be that, instead of selecting variables for a given threshold and
method, we want to see the largest threshold at which each variable
would be selected by both the VIP and QNT methods. We can quickly
visualize this with the `plot_selection_thresholds` function.
```{r}
plot_selection_thresholds(model, select = "lambda.min")
```

You can also get these thresholds in a tibble:
```{r}
selection_thresholds(model, select = "lambda.min")
```

### Coefficients

#### All coefficients

bolasso also supports moving beyond variable selection and understanding
the bootstrapped variable coefficients. We can extract a tidy tibble where
each variable is a column, and each row represents a bootstrap fold, and the
values are the corresponding estimated coefficients.
```{r}
tidy(model, select = "lambda.min")
```

bolasso also allows us to plot the bootstrap distribution of variable
coefficients. Suppose that we want to quickly inspect this distribution for
each of our variables. We can achieve this by simply plotting our model.
```{r}
plot(model, select = "lambda.min")
```

Now, suppose for example we are particularly interested in the coefficient
distributions for the `triceps`, `pressure`, and `glucose` variables. We can
plot the distributions for just these variables:
```{r}
plot(model, covariates = c(glucose, pressure, triceps))
```

Note: If there are more than 30 variables included in our model, then this
will plot the 30 variables with the largest absolute mean coefficients.

#### Selected variable coefficients

If we want to plot the coefficient distributions for only the selected
variables, we can use `plot_selected_variables` which will give us pretty
much the same thing as `plot`.
```{r}
plot_selected_variables(
model,
threshold = 0.95,
method = "vip",
select = "lambda.min"
)
```

Just like `plot` we can also focus on a subset of our selected variables.
```{r}
plot_selected_variables(
model,
covariates = c(pregnant, mass),
threshold = 0.95,
method = "vip",
select = "lambda.min"
)
```

### Predictions

Finally, we can make predictions using our bolasso model on new data. For
example, the following code shows how we would generate predicted probabilites
on our `test` data.
```{r}
as_tibble(predict(model, test, select = "lambda.min", type = "response"))
```

Note that this outputs an (n x p) matrix of predictions where n is the number
of rows in our test set, p is the number of bootstraps, and each column
represents the predictions from one of our bootstrapped models. To combine
these into a single prediction per observation, we could take the average
for each observation across the models:
```{r}
tibble(
predictions = rowMeans(
predict(model, test, select = "lambda.min", type = "response")
)
)
```

### Fast estimation 🏎️💨

For each bootstrapped model, bolasso uses cross-validation to find the optimal
regularization parameter lambda. In glmnet, the default number of cross-validation
folds is 10. This can quickly become computationally expensive and slow,
especially when using many bootstrap replicates. For example, with 1,000
bootstrap replicates, this results in estimating models on 10,000 cross-validation
sets.

To address this, we can activate the `fast = TRUE` argument in bolasso. Instead of
using cross-validation to find the optimal lambda for each bootstrap model,
fast bolasso runs a single cross-validated regression on the full dataset to
identify the optimal lambda. Then each bootstrapped model uses that lambda
as its regularization parameter.

The following comparison shows the computation time of standard bolasso vs fast
bolasso across increasing bootstrap replicates. The plot displays the number of
seconds each algorithm takes to complete.
```{r}
# Compare standard vs. fast bolasso across different bootstrap values
times <- lapply(
c(10, 100, 200, 300, 400, 500, 600, 700, 800, 900, 1e3),
\(x) {
time_standard <- system.time({
bolasso(
diabetes ~ .,
data = train,
n.boot = x,
progress = FALSE,
family = "binomial"
)
})
time_fast <- system.time({
bolasso(
diabetes ~ .,
data = train,
n.boot = x,
progress = FALSE,
family = "binomial",
fast = TRUE
)
})
return(
tibble::tibble("regular" = time_standard[[3]], "fast" = time_fast[[3]])
)
}
)

# Make a data.frame out of the times
times_df <- do.call(rbind, times) |>
transform(
id = 1:11,
n_bootstrap = c(10, 100, 200, 300, 400, 500, 600, 700, 800, 900, 1e3)
) |>
reshape(
varying = c("regular", "fast"),
v.names = "time",
times = c("regular", "fast"),
timevar = "algorithm",
idvar = c("id", "n_bootstrap"),
direction = "long"
)

# Plot it!
ggplot(times_df, aes(x = n_bootstrap, y = time, color = factor(algorithm))) +
geom_point() +
geom_line() +
labs(x = "N Bootstraps", y = "Time (seconds)") +
scale_y_continuous(breaks = seq(0, 60, 10)) +
theme_minimal() +
theme(legend.title = element_blank())
```

Fast bolasso clearly achieves some pretty massive speedups over the standard
version! This difference in speed will only be more accentuated when
estimating on larger datasets.

#### What do we lose with standard vs. fast?

There's never a free lunch, so to be clear about the tradeoffs between the
standard and fast versions of bolasso, the following shows the difference
in predictive accuracy on our hold-out test set.

```{r}
model_standard <- bolasso(
diabetes ~ .,
data = train,
n.boot = 100,
progress = FALSE,
family = "binomial"
)
model_fast <- bolasso(
diabetes ~ .,
data = train,
n.boot = 100,
progress = FALSE,
family = "binomial",
fast = TRUE
)

model_standard_preds <- ifelse(
rowMeans(predict(model_standard, test, type = "response")) >= 0.5,
yes = 1,
no = 0
)
model_fast_preds <- ifelse(
rowMeans(predict(model_fast, test, type = "response")) >= 0.5,
yes = 1,
no = 0
)
truth <- as.integer(test$diabetes) - 1
```
```{r, echo = FALSE}
cat(
"Standard Bolasso accuracy:",
round(100*sum(model_standard_preds == truth)/length(truth), 2),
"%\n",
"\rFast Bolasso accuracy:",
round(100*sum(model_fast_preds == truth)/length(truth), 2),
"%\n"
)
```

It's important to note that fast bolasso should be thought of more
as a rough-and-ready algorithm that is better for quick iteration and might
have worse empirical performance than the standard algorithm.

#### Parallelizing bolasso

We can also fit bolasso bootstrap models in parallel via the
[future](https://CRAN.R-project.org/package=future) package. The future package
supports a wide variety of parallelization, from local multi-core to remote
compute clusters. Parallelizing bolasso is as simple as initializing the parallel method
prior to executing the bolasso function. For example, the following setup
will execute bolasso in parallel R sessions.

```{r}
future::plan("multisession")
time_parallel <- system.time({
bolasso(
diabetes ~ .,
data = train,
n.boot = 1000,
progress = FALSE,
family = "binomial"
)
})
future::plan("sequential")

time_sequential <- system.time({
bolasso(
diabetes ~ .,
data = train,
n.boot = 1000,
progress = FALSE,
family = "binomial"
)
})
```
```{r, echo = FALSE}
cat(
"Parallel bolasso time (seconds):",
round(time_parallel[[3]], 3),
"\n\rSequential bolasso time (seconds):",
round(time_sequential[[3]], 3)
)
```

### Beyond the Lasso

bolasso also allows us to fit penalized regression models beyond the Lasso.
For example, suppose we want to fit a bootstrap-enhanced elasticnet model
with a mixing parameter of 0.5 (an even mix of the Ridge and Lasso
regularization terms). We can simply pass the underlying `glmnet::glmnet`
argument `alpha = 0.5` through bolasso. The following code compares selected
variables between the Lasso and elasticnet models.
```{r}
lasso <- bolasso(
diabetes ~ .,
data = train,
n.boot = 100,
progress = FALSE,
family = "binomial"
)

elnet <- bolasso(
diabetes ~ .,
data = train,
n.boot = 100,
progress = FALSE,
family = "binomial",
alpha = 0.5
)
```
```{r, echo = FALSE}
cat(
"Lasso selected variables:",
selected_variables(lasso, 0.95, var_names_only = TRUE),
"\n\rElnet selected variables:",
selected_variables(elnet, 0.95, var_names_only = TRUE)
)
```

## References

[1]Abram, Samantha V et al. “Bootstrap Enhanced Penalized
Regression for Variable Selection with Neuroimaging Data.” Frontiers in
neuroscience vol. 10 344. 28 Jul. 2016, doi:10.3389/fnins.2016.00344

[2]Bach, Francis. “Bolasso: Model Consistent Lasso Estimation
through the Bootstrap.” ArXiv:0804.1302 [Cs, Math, Stat], 2008.
https://arxiv.org/abs/0804.1302.

[3]Bunea, Florentina et al. “Penalized least squares regression
methods and applications to neuroimaging.” NeuroImage vol. 55,4 (2011):
1519-27. doi:10.1016/j.neuroimage.2010.12.028