Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/dmolitor/bolasso

Model consistent Lasso estimation through the bootstrap.
https://github.com/dmolitor/bolasso

bolasso bootstrap lasso rstats variable-selection

Last synced: 11 days ago
JSON representation

Model consistent Lasso estimation through the bootstrap.

Awesome Lists containing this project

README

        

---
output: github_document
---

```{r message=FALSE, warning=FALSE, paged.print=TRUE, echo=FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "75%"
)

set.seed(123) # Reproducible results
```

# bolasso

[![R-CMD-check](https://github.com/dmolitor/bolasso/workflows/R-CMD-check/badge.svg)](https://github.com/dmolitor/bolasso/actions)
[![pkgdown](https://github.com/dmolitor/bolasso/workflows/pkgdown/badge.svg)](https://github.com/dmolitor/bolasso/actions)
[![Codecov test coverage](https://codecov.io/gh/dmolitor/bolasso/branch/main/graph/badge.svg)](https://app.codecov.io/gh/dmolitor/bolasso?branch=main)
[![CRAN status](https://www.r-pkg.org/badges/version/bolasso)](https://CRAN.R-project.org/package=bolasso)

The goal of bolasso is to implement model-consistent Lasso estimation via the
bootstrap [[1]](#1).

## Installation

You can install the development version from [GitHub](https://github.com/) with:

``` r
# install.packages("devtools")
devtools::install_github("dmolitor/bolasso")
```
## Usage

To illustrate the usage of bolasso, we'll use the
[Pima Indians Diabetes dataset](http://math.furman.edu/~dcs/courses/math47/R/library/mlbench/html/PimaIndiansDiabetes.html)
to determine which factors are important predictors of testing positive
for diabetes. For a full description of the input variables, see the link above.

### Load requisite packages and data

```{r echo=TRUE, message=FALSE, warning=FALSE}
library(bolasso)

data(PimaIndiansDiabetes, package = "mlbench")

# Quick overview of the dataset
str(PimaIndiansDiabetes)
```

First, we run 100-fold bootstrapped Lasso with the `glmnet` implementation. We
can get a rough estimate of the elapsed time using `system.time()`.

```{r}
system.time({
model <- bolasso(
diabetes ~ .,
data = PimaIndiansDiabetes,
n.boot = 100,
implement = "glmnet",
family = "binomial"
)
})
```

We can get a quick overview of the model by printing the `bolasso` object.
```{r}
model
```

### Extracting selected variables

Next, we can extract all variables that were selected in 90% and 100% of the
bootstrapped Lasso models. We can also pass any relevant arguments to `predict`
on the `cv.glmnet` or `cv.gamlr` model objects. In this case we will use the
lambda value that minimizes OOS error.

```{r}
selected_vars(model, threshold = 0.9, select = "lambda.min")

selected_vars(model, threshold = 1, select = "lambda.min")
```

### Plotting selected variables

We can also quickly plot the selected variables at the 90% and 100% threshold
values.

```{r}
plot(model, threshold = 0.9)

plot(model, threshold = 1)
```

### Parallelizing bolasso

We can execute `bolasso` in parallel via the
[future](https://CRAN.R-project.org/package=future) package. To
do so we can copy the code from above with only one minor tweak shown below.

```{r}
future::plan("multisession")
```

```{r include=FALSE}
# Include a warm-start, otherwise parallel will be slow first time around
future.apply::future_lapply(1:100, function(i) i)
```

We can now run the code from above, unaltered, and it will execute in parallel.

```{r}
system.time({
model <- bolasso(
diabetes ~ .,
data = PimaIndiansDiabetes,
n.boot = 100,
implement = "glmnet",
family = "binomial"
)
})
```

## References

[1] Bach, Francis. “Bolasso: Model Consistent Lasso Estimation
through the Bootstrap.” ArXiv:0804.1302 [Cs, Math, Stat], April 8, 2008.
https://arxiv.org/abs/0804.1302.