https://github.com/dmolitor/bolasso

Model consistent Lasso estimation through the bootstrap.
https://github.com/dmolitor/bolasso
bolasso bootstrap lasso rstats variable-selection
Last synced: 6 months ago
JSON representation
Model consistent Lasso estimation through the bootstrap.
Host: GitHub
URL: https://github.com/dmolitor/bolasso
Owner: dmolitor
License: other
Created: 2021-09-17T17:51:40.000Z (about 4 years ago)
Default Branch: main
Last Pushed: 2024-12-21T19:20:37.000Z (10 months ago)
Last Synced: 2025-04-18T10:23:36.753Z (6 months ago)
Topics: bolasso, bootstrap, lasso, rstats, variable-selection
Language: R
Homepage: https://dmolitor.github.io/bolasso/
Size: 6.07 MB
Stars: 4
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.Rmd
- Changelog: NEWS.md
- License: LICENSE
Awesome Lists containing this project

README

          ---

output: github_document

---

```{r message=FALSE, warning=FALSE, paged.print=TRUE, echo=FALSE}

knitr::opts_chunk$set(

  collapse = TRUE,

  comment = "#>",

  fig.path = "man/figures/README-",

  out.width = "85%",

  dpi = 300

)

set.seed(321) # Reproducible results

```

# bolasso 

[![R-CMD-check](https://github.com/dmolitor/bolasso/workflows/R-CMD-check/badge.svg)](https://github.com/dmolitor/bolasso/actions)

[![pkgdown](https://github.com/dmolitor/bolasso/workflows/pkgdown/badge.svg)](https://github.com/dmolitor/bolasso/actions)

[![Codecov test coverage](https://codecov.io/gh/dmolitor/bolasso/branch/main/graph/badge.svg)](https://app.codecov.io/gh/dmolitor/bolasso?branch=main)

[![CRAN status](https://www.r-pkg.org/badges/version/bolasso)](https://CRAN.R-project.org/package=bolasso)

The goal of bolasso is to implement bootstrap-enhanced Lasso (and more generally,

penalized regression) estimation, as proposed originally in [Bach (2008)](#2)

and extended by [Bunea et al. (2011)](#3) and [Abram et al. (2016)](#1). These methods

focus primarily on variable selection and propose two similar, but slightly

different, variable selection algorithms; the variable inclusion probability

(VIP) algorithm (Bach; Bunea et al.), and the bootstrap

distribution quantile (QNT) algorithm (Abram et al.). Beyond implementing

both these variable selection methods, bolasso also provides utilities for

making bagged predictions, examining coefficient distributions, and plotting.

## Installation

Install bolasso from CRAN:

```r

install.packages("bolasso")

```

Or install the development version from GitHub with:

```r

# install.packages("pak")

pak::pkg_install("dmolitor/bolasso@dev")

```

## Usage

To illustrate the usage of bolasso, we'll use the 

[Pima Indians Diabetes dataset](http://math.furman.edu/~dcs/courses/math47/R/library/mlbench/html/PimaIndiansDiabetes.html)

to determine which factors are important predictors of testing positive

for diabetes. For a full description of the input variables, see the link above.

### Load requisite packages and data

```{r echo=TRUE, message=FALSE, warning=FALSE}

library(bolasso)

library(ggplot2)

library(tibble)

data(PimaIndiansDiabetes, package = "mlbench")

# Quick overview of the dataset

str(PimaIndiansDiabetes)

```

First, let's create a train/test split of our data, and then run 100-fold

bootstrapped Lasso with `glmnet`.

```{r}

train_idx <- sample(1:nrow(PimaIndiansDiabetes), round(0.7*nrow(PimaIndiansDiabetes)))

train <- PimaIndiansDiabetes[train_idx, ]

test <- PimaIndiansDiabetes[-train_idx, ]

model <- bolasso(

  diabetes ~ .,

  data = train,

  n.boot = 100,

  progress = FALSE,

  family = "binomial"

)

```

### Variable selection

Next, using a threshold of 0.95 we can extract the selected variables using the

VIP method, which extracts all variables

that were selected (had non-zero coefficients) in >= 95% of the bootstrapped models.

We'll use the regularization parameter `lambda.min` that minimizes

cross-validation error.

```{r}

selected_variables(model, threshold = 0.95, method = "vip", select = "lambda.min")

```

Note that this returned a tibble with the selected variables as columns and the

coefficients for each of the bootstrapped models as rows. If you want to simply

return only the variable names, you can add the `var_names_only` argument:

```{r}

selected_variables(model, 0.95, "vip", var_names_only = TRUE)

```

We can compare the selected variables using the VIP method to the QNT

method, which selects all variables that have

a 95% bootstrap confidence interval that does not contain 0:

```{r}

selected_variables(model, 0.95, "qnt", var_names_only = TRUE)

```

Note that the number of selected variables with QNT will always be <= than

with VIP. The default method for bolasso is `method = "vip"`.

#### Variable selection thresholds

It may be that, instead of selecting variables for a given threshold and

method, we want to see the largest threshold at which each variable

would be selected by both the VIP and QNT methods. We can quickly

visualize this with the `plot_selection_thresholds` function.

```{r}

plot_selection_thresholds(model, select = "lambda.min")

```

You can also get these thresholds in a tibble:

```{r}

selection_thresholds(model, select = "lambda.min")

```

### Coefficients

#### All coefficients

bolasso also supports moving beyond variable selection and understanding

the bootstrapped variable coefficients. We can extract a tidy tibble where

each variable is a column, and each row represents a bootstrap fold, and the

values are the corresponding estimated coefficients.

```{r}

tidy(model, select = "lambda.min")

```

bolasso also allows us to plot the bootstrap distribution of variable

coefficients. Suppose that we want to quickly inspect this distribution for

each of our variables. We can achieve this by simply plotting our model.

```{r}

plot(model, select = "lambda.min")

```

Now, suppose for example we are particularly interested in the coefficient

distributions for the `triceps`, `pressure`, and `glucose` variables. We can

plot the distributions for just these variables:

```{r}

plot(model, covariates = c(glucose, pressure, triceps))

```

Note: If there are more than 30 variables included in our model, then this

will plot the 30 variables with the largest absolute mean coefficients.

#### Selected variable coefficients

If we want to plot the coefficient distributions for only the selected

variables, we can use `plot_selected_variables` which will give us pretty

much the same thing as `plot`.

```{r}

plot_selected_variables(

  model,

  threshold = 0.95,

  method = "vip",

  select = "lambda.min"

)

```

Just like `plot` we can also focus on a subset of our selected variables.

```{r}

plot_selected_variables(

  model,

  covariates = c(pregnant, mass),

  threshold = 0.95,

  method = "vip",

  select = "lambda.min"

)

```

### Predictions

Finally, we can make predictions using our bolasso model on new data. For

example, the following code shows how we would generate predicted probabilites

on our `test` data.

```{r}

as_tibble(predict(model, test, select = "lambda.min", type = "response"))

```

Note that this outputs an (n x p) matrix of predictions where n is the number

of rows in our test set, p is the number of bootstraps, and each column

represents the predictions from one of our bootstrapped models. To combine

these into a single prediction per observation, we could take the average

for each observation across the models:

```{r}

tibble(

  predictions = rowMeans(

    predict(model, test, select = "lambda.min", type = "response")

  )

)

```

### Fast estimation 🏎️💨

For each bootstrapped model, bolasso uses cross-validation to find the optimal

regularization parameter lambda. In glmnet, the default number of cross-validation

folds is 10. This can quickly become computationally expensive and slow,

especially when using many bootstrap replicates. For example, with 1,000

bootstrap replicates, this results in estimating models on 10,000 cross-validation

sets.

To address this, we can activate the `fast = TRUE` argument in bolasso. Instead of

using cross-validation to find the optimal lambda for each bootstrap model,

fast bolasso runs a single cross-validated regression on the full dataset to

identify the optimal lambda. Then each bootstrapped model uses that lambda

as its regularization parameter.

The following comparison shows the computation time of standard bolasso vs fast

bolasso across increasing bootstrap replicates. The plot displays the number of

seconds each algorithm takes to complete.

```{r}

# Compare standard vs. fast bolasso across different bootstrap values

times <- lapply(

  c(10, 100, 200, 300, 400, 500, 600, 700, 800, 900, 1e3),

  \(x) {

    time_standard <- system.time({

      bolasso(

        diabetes ~ .,

        data = train,

        n.boot = x,

        progress = FALSE,

        family = "binomial"

      )

    })

    time_fast <- system.time({

      bolasso(

        diabetes ~ .,

        data = train,

        n.boot = x,

        progress = FALSE,

        family = "binomial",

        fast = TRUE

      )

    })

    return(

      tibble::tibble("regular" = time_standard[[3]], "fast" = time_fast[[3]])

    )

  }

)

# Make a data.frame out of the times

times_df <- do.call(rbind, times) |>

  transform(

    id = 1:11,

    n_bootstrap = c(10, 100, 200, 300, 400, 500, 600, 700, 800, 900, 1e3)

  ) |>

  reshape(

    varying = c("regular", "fast"),

    v.names = "time",

    times = c("regular", "fast"),

    timevar = "algorithm",

    idvar = c("id", "n_bootstrap"),

    direction = "long"

  )

# Plot it!

ggplot(times_df, aes(x = n_bootstrap, y = time, color = factor(algorithm))) +

  geom_point() +

  geom_line() +

  labs(x = "N Bootstraps", y = "Time (seconds)") +

  scale_y_continuous(breaks = seq(0, 60, 10)) +

  theme_minimal() +

  theme(legend.title = element_blank())

```

Fast bolasso clearly achieves some pretty massive speedups over the standard

version! This difference in speed will only be more accentuated when

estimating on larger datasets.

#### What do we lose with standard vs. fast?

There's never a free lunch, so to be clear about the tradeoffs between the

standard and fast versions of bolasso, the following shows the difference

in predictive accuracy on our hold-out test set.

```{r}

model_standard <- bolasso(

  diabetes ~ .,

  data = train,

  n.boot = 100,

  progress = FALSE,

  family = "binomial"

)

model_fast <- bolasso(

  diabetes ~ .,

  data = train,

  n.boot = 100,

  progress = FALSE,

  family = "binomial",

  fast = TRUE

)

model_standard_preds <- ifelse(

  rowMeans(predict(model_standard, test, type = "response")) >= 0.5,

  yes = 1,

  no = 0

)

model_fast_preds <- ifelse(

  rowMeans(predict(model_fast, test, type = "response")) >= 0.5,

  yes = 1,

  no = 0

)

truth <- as.integer(test$diabetes) - 1

```

```{r, echo = FALSE}

cat(

  "Standard Bolasso accuracy:",

  round(100*sum(model_standard_preds == truth)/length(truth), 2),

  "%\n",

  "\rFast Bolasso accuracy:",

  round(100*sum(model_fast_preds == truth)/length(truth), 2),

  "%\n"

)

```

It's important to note that fast bolasso should be thought of more

as a rough-and-ready algorithm that is better for quick iteration and might

have worse empirical performance than the standard algorithm.

#### Parallelizing bolasso

We can also fit bolasso bootstrap models in parallel via the 

[future](https://CRAN.R-project.org/package=future) package. The future package

supports a wide variety of parallelization, from local multi-core to remote

compute clusters. Parallelizing bolasso is as simple as initializing the parallel method

prior to executing the bolasso function. For example, the following setup

will execute bolasso in parallel R sessions.

```{r}

future::plan("multisession")

time_parallel <- system.time({

  bolasso(

    diabetes ~ .,

    data = train,

    n.boot = 1000,

    progress = FALSE,

    family = "binomial"

  )

})

future::plan("sequential")

time_sequential <- system.time({

  bolasso(

    diabetes ~ .,

    data = train,

    n.boot = 1000,

    progress = FALSE,

    family = "binomial"

  )

})

```

```{r, echo = FALSE}

cat(

  "Parallel bolasso time (seconds):",

  round(time_parallel[[3]], 3),

  "\n\rSequential bolasso time (seconds):",

  round(time_sequential[[3]], 3)

)

```

### Beyond the Lasso

bolasso also allows us to fit penalized regression models beyond the Lasso.

For example, suppose we want to fit a bootstrap-enhanced elasticnet model

with a mixing parameter of 0.5 (an even mix of the Ridge and Lasso

regularization terms). We can simply pass the underlying `glmnet::glmnet`

argument `alpha = 0.5` through bolasso. The following code compares selected

variables between the Lasso and elasticnet models.

```{r}

lasso <- bolasso(

  diabetes ~ .,

  data = train,

  n.boot = 100,

  progress = FALSE,

  family = "binomial"

)

elnet <- bolasso(

  diabetes ~ .,

  data = train,

  n.boot = 100,

  progress = FALSE,

  family = "binomial",

  alpha = 0.5

)

```

```{r, echo = FALSE}

cat(

  "Lasso selected variables:",

  selected_variables(lasso, 0.95, var_names_only = TRUE),

  "\n\rElnet selected variables:",

  selected_variables(elnet, 0.95, var_names_only = TRUE)

)

```

## References

[1]Abram, Samantha V et al. “Bootstrap Enhanced Penalized

Regression for Variable Selection with Neuroimaging Data.” Frontiers in

neuroscience vol. 10 344. 28 Jul. 2016, doi:10.3389/fnins.2016.00344

[2]Bach, Francis. “Bolasso: Model Consistent Lasso Estimation 

through the Bootstrap.” ArXiv:0804.1302 [Cs, Math, Stat], 2008. 

https://arxiv.org/abs/0804.1302.

[3]Bunea, Florentina et al. “Penalized least squares regression

methods and applications to neuroimaging.” NeuroImage vol. 55,4 (2011):

1519-27. doi:10.1016/j.neuroimage.2010.12.028
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/dmolitor/bolasso

Awesome Lists containing this project

README