{"id":17550845,"url":"https://github.com/dmolitor/bolasso","last_synced_at":"2025-04-24T02:28:20.299Z","repository":{"id":43050657,"uuid":"407633141","full_name":"dmolitor/bolasso","owner":"dmolitor","description":"Model consistent Lasso estimation through the bootstrap.","archived":false,"fork":false,"pushed_at":"2024-12-21T19:20:37.000Z","size":6364,"stargazers_count":4,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-04-18T10:23:36.753Z","etag":null,"topics":["bolasso","bootstrap","lasso","rstats","variable-selection"],"latest_commit_sha":null,"homepage":"https://dmolitor.github.io/bolasso/","language":"R","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/dmolitor.png","metadata":{"files":{"readme":"README.Rmd","changelog":"NEWS.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2021-09-17T17:51:40.000Z","updated_at":"2024-12-21T19:17:51.000Z","dependencies_parsed_at":"2024-12-21T20:21:20.829Z","dependency_job_id":"7a08fdcf-abb8-4cd0-acf5-f36fda0629f2","html_url":"https://github.com/dmolitor/bolasso","commit_stats":null,"previous_names":[],"tags_count":3,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dmolitor%2Fbolasso","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dmolitor%2Fbolasso/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dmolitor%2Fbolasso/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dmolitor%2Fbolasso/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/dmolitor","download_url":"https://codeload.github.com/dmolitor/bolasso/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":250547452,"owners_count":21448494,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bolasso","bootstrap","lasso","rstats","variable-selection"],"created_at":"2024-10-21T04:43:56.459Z","updated_at":"2025-04-24T02:28:20.262Z","avatar_url":"https://github.com/dmolitor.png","language":"R","funding_links":[],"categories":[],"sub_categories":[],"readme":"---\noutput: github_document\n---\n\n\u003c!-- README.md is generated from README.Rmd. Please edit that file --\u003e\n\n```{r message=FALSE, warning=FALSE, paged.print=TRUE, echo=FALSE}\nknitr::opts_chunk$set(\n  collapse = TRUE,\n  comment = \"#\u003e\",\n  fig.path = \"man/figures/README-\",\n  out.width = \"85%\",\n  dpi = 300\n)\n\nset.seed(321) # Reproducible results\n```\n\n# bolasso \u003ca href='https://www.dmolitor.com/bolasso/'\u003e\u003cimg src='man/figures/logo.png' align=\"right\" height=\"139\" /\u003e\u003c/a\u003e\n\n\u003c!-- badges: start --\u003e\n[![R-CMD-check](https://github.com/dmolitor/bolasso/workflows/R-CMD-check/badge.svg)](https://github.com/dmolitor/bolasso/actions)\n[![pkgdown](https://github.com/dmolitor/bolasso/workflows/pkgdown/badge.svg)](https://github.com/dmolitor/bolasso/actions)\n[![Codecov test coverage](https://codecov.io/gh/dmolitor/bolasso/branch/main/graph/badge.svg)](https://app.codecov.io/gh/dmolitor/bolasso?branch=main)\n[![CRAN status](https://www.r-pkg.org/badges/version/bolasso)](https://CRAN.R-project.org/package=bolasso)\n\u003c!-- badges: end --\u003e\n\nThe goal of bolasso is to implement bootstrap-enhanced Lasso (and more generally,\npenalized regression) estimation, as proposed originally in [Bach (2008)](#2)\nand extended by [Bunea et al. (2011)](#3) and [Abram et al. (2016)](#1). These methods\nfocus primarily on variable selection and propose two similar, but slightly\ndifferent, variable selection algorithms; the variable inclusion probability\n(VIP) algorithm (Bach; Bunea et al.), and the bootstrap\ndistribution quantile (QNT) algorithm (Abram et al.). Beyond implementing\nboth these variable selection methods, bolasso also provides utilities for\nmaking bagged predictions, examining coefficient distributions, and plotting.\n\n## Installation\n\nInstall bolasso from CRAN:\n```r\ninstall.packages(\"bolasso\")\n```\n\nOr install the development version from GitHub with:\n```r\n# install.packages(\"pak\")\npak::pkg_install(\"dmolitor/bolasso@dev\")\n```\n## Usage\n\nTo illustrate the usage of bolasso, we'll use the \n[Pima Indians Diabetes dataset](http://math.furman.edu/~dcs/courses/math47/R/library/mlbench/html/PimaIndiansDiabetes.html)\nto determine which factors are important predictors of testing positive\nfor diabetes. For a full description of the input variables, see the link above.\n\n### Load requisite packages and data\n\n```{r echo=TRUE, message=FALSE, warning=FALSE}\nlibrary(bolasso)\nlibrary(ggplot2)\nlibrary(tibble)\n\ndata(PimaIndiansDiabetes, package = \"mlbench\")\n\n# Quick overview of the dataset\nstr(PimaIndiansDiabetes)\n```\n\nFirst, let's create a train/test split of our data, and then run 100-fold\nbootstrapped Lasso with `glmnet`.\n\n```{r}\ntrain_idx \u003c- sample(1:nrow(PimaIndiansDiabetes), round(0.7*nrow(PimaIndiansDiabetes)))\ntrain \u003c- PimaIndiansDiabetes[train_idx, ]\ntest \u003c- PimaIndiansDiabetes[-train_idx, ]\n\nmodel \u003c- bolasso(\n  diabetes ~ .,\n  data = train,\n  n.boot = 100,\n  progress = FALSE,\n  family = \"binomial\"\n)\n```\n\n\n### Variable selection\n\nNext, using a threshold of 0.95 we can extract the selected variables using the\nVIP method, which extracts all variables\nthat were selected (had non-zero coefficients) in \u003e= 95% of the bootstrapped models.\nWe'll use the regularization parameter `lambda.min` that minimizes\ncross-validation error.\n```{r}\nselected_variables(model, threshold = 0.95, method = \"vip\", select = \"lambda.min\")\n```\n\nNote that this returned a tibble with the selected variables as columns and the\ncoefficients for each of the bootstrapped models as rows. If you want to simply\nreturn only the variable names, you can add the `var_names_only` argument:\n```{r}\nselected_variables(model, 0.95, \"vip\", var_names_only = TRUE)\n```\n\nWe can compare the selected variables using the VIP method to the QNT\nmethod, which selects all variables that have\na 95% bootstrap confidence interval that does not contain 0:\n```{r}\nselected_variables(model, 0.95, \"qnt\", var_names_only = TRUE)\n```\n\nNote that the number of selected variables with QNT will always be \u003c= than\nwith VIP. The default method for bolasso is `method = \"vip\"`.\n\n#### Variable selection thresholds\n\nIt may be that, instead of selecting variables for a given threshold and\nmethod, we want to see the largest threshold at which each variable\nwould be selected by both the VIP and QNT methods. We can quickly\nvisualize this with the `plot_selection_thresholds` function.\n```{r}\nplot_selection_thresholds(model, select = \"lambda.min\")\n```\n\nYou can also get these thresholds in a tibble:\n```{r}\nselection_thresholds(model, select = \"lambda.min\")\n```\n\n### Coefficients\n\n#### All coefficients\n\nbolasso also supports moving beyond variable selection and understanding\nthe bootstrapped variable coefficients. We can extract a tidy tibble where\neach variable is a column, and each row represents a bootstrap fold, and the\nvalues are the corresponding estimated coefficients.\n```{r}\ntidy(model, select = \"lambda.min\")\n```\n\nbolasso also allows us to plot the bootstrap distribution of variable\ncoefficients. Suppose that we want to quickly inspect this distribution for\neach of our variables. We can achieve this by simply plotting our model.\n```{r}\nplot(model, select = \"lambda.min\")\n```\n\nNow, suppose for example we are particularly interested in the coefficient\ndistributions for the `triceps`, `pressure`, and `glucose` variables. We can\nplot the distributions for just these variables:\n```{r}\nplot(model, covariates = c(glucose, pressure, triceps))\n```\n\nNote: If there are more than 30 variables included in our model, then this\nwill plot the 30 variables with the largest absolute mean coefficients.\n\n#### Selected variable coefficients\n\nIf we want to plot the coefficient distributions for only the selected\nvariables, we can use `plot_selected_variables` which will give us pretty\nmuch the same thing as `plot`.\n```{r}\nplot_selected_variables(\n  model,\n  threshold = 0.95,\n  method = \"vip\",\n  select = \"lambda.min\"\n)\n```\n\nJust like `plot` we can also focus on a subset of our selected variables.\n```{r}\nplot_selected_variables(\n  model,\n  covariates = c(pregnant, mass),\n  threshold = 0.95,\n  method = \"vip\",\n  select = \"lambda.min\"\n)\n```\n\n### Predictions\n\nFinally, we can make predictions using our bolasso model on new data. For\nexample, the following code shows how we would generate predicted probabilites\non our `test` data.\n```{r}\nas_tibble(predict(model, test, select = \"lambda.min\", type = \"response\"))\n```\n\nNote that this outputs an (n x p) matrix of predictions where n is the number\nof rows in our test set, p is the number of bootstraps, and each column\nrepresents the predictions from one of our bootstrapped models. To combine\nthese into a single prediction per observation, we could take the average\nfor each observation across the models:\n```{r}\ntibble(\n  predictions = rowMeans(\n    predict(model, test, select = \"lambda.min\", type = \"response\")\n  )\n)\n```\n\n### Fast estimation 🏎️💨\n\nFor each bootstrapped model, bolasso uses cross-validation to find the optimal\nregularization parameter lambda. In glmnet, the default number of cross-validation\nfolds is 10. This can quickly become computationally expensive and slow,\nespecially when using many bootstrap replicates. For example, with 1,000\nbootstrap replicates, this results in estimating models on 10,000 cross-validation\nsets.\n\nTo address this, we can activate the `fast = TRUE` argument in bolasso. Instead of\nusing cross-validation to find the optimal lambda for each bootstrap model,\nfast bolasso runs a single cross-validated regression on the full dataset to\nidentify the optimal lambda. Then each bootstrapped model uses that lambda\nas its regularization parameter.\n\nThe following comparison shows the computation time of standard bolasso vs fast\nbolasso across increasing bootstrap replicates. The plot displays the number of\nseconds each algorithm takes to complete.\n```{r}\n# Compare standard vs. fast bolasso across different bootstrap values\ntimes \u003c- lapply(\n  c(10, 100, 200, 300, 400, 500, 600, 700, 800, 900, 1e3),\n  \\(x) {\n    time_standard \u003c- system.time({\n      bolasso(\n        diabetes ~ .,\n        data = train,\n        n.boot = x,\n        progress = FALSE,\n        family = \"binomial\"\n      )\n    })\n    time_fast \u003c- system.time({\n      bolasso(\n        diabetes ~ .,\n        data = train,\n        n.boot = x,\n        progress = FALSE,\n        family = \"binomial\",\n        fast = TRUE\n      )\n    })\n    return(\n      tibble::tibble(\"regular\" = time_standard[[3]], \"fast\" = time_fast[[3]])\n    )\n  }\n)\n\n# Make a data.frame out of the times\ntimes_df \u003c- do.call(rbind, times) |\u003e\n  transform(\n    id = 1:11,\n    n_bootstrap = c(10, 100, 200, 300, 400, 500, 600, 700, 800, 900, 1e3)\n  ) |\u003e\n  reshape(\n    varying = c(\"regular\", \"fast\"),\n    v.names = \"time\",\n    times = c(\"regular\", \"fast\"),\n    timevar = \"algorithm\",\n    idvar = c(\"id\", \"n_bootstrap\"),\n    direction = \"long\"\n  )\n\n# Plot it!\nggplot(times_df, aes(x = n_bootstrap, y = time, color = factor(algorithm))) +\n  geom_point() +\n  geom_line() +\n  labs(x = \"N Bootstraps\", y = \"Time (seconds)\") +\n  scale_y_continuous(breaks = seq(0, 60, 10)) +\n  theme_minimal() +\n  theme(legend.title = element_blank())\n```\n\nFast bolasso clearly achieves some pretty massive speedups over the standard\nversion! This difference in speed will only be more accentuated when\nestimating on larger datasets.\n\n#### What do we lose with standard vs. fast?\n\nThere's never a free lunch, so to be clear about the tradeoffs between the\nstandard and fast versions of bolasso, the following shows the difference\nin predictive accuracy on our hold-out test set.\n\n```{r}\nmodel_standard \u003c- bolasso(\n  diabetes ~ .,\n  data = train,\n  n.boot = 100,\n  progress = FALSE,\n  family = \"binomial\"\n)\nmodel_fast \u003c- bolasso(\n  diabetes ~ .,\n  data = train,\n  n.boot = 100,\n  progress = FALSE,\n  family = \"binomial\",\n  fast = TRUE\n)\n\nmodel_standard_preds \u003c- ifelse(\n  rowMeans(predict(model_standard, test, type = \"response\")) \u003e= 0.5,\n  yes = 1,\n  no = 0\n)\nmodel_fast_preds \u003c- ifelse(\n  rowMeans(predict(model_fast, test, type = \"response\")) \u003e= 0.5,\n  yes = 1,\n  no = 0\n)\ntruth \u003c- as.integer(test$diabetes) - 1\n```\n```{r, echo = FALSE}\ncat(\n  \"Standard Bolasso accuracy:\",\n  round(100*sum(model_standard_preds == truth)/length(truth), 2),\n  \"%\\n\",\n  \"\\rFast Bolasso accuracy:\",\n  round(100*sum(model_fast_preds == truth)/length(truth), 2),\n  \"%\\n\"\n)\n```\n\nIt's important to note that fast bolasso should be thought of more\nas a rough-and-ready algorithm that is better for quick iteration and might\nhave worse empirical performance than the standard algorithm.\n\n#### Parallelizing bolasso\n\nWe can also fit bolasso bootstrap models in parallel via the \n[future](https://CRAN.R-project.org/package=future) package. The future package\nsupports a wide variety of parallelization, from local multi-core to remote\ncompute clusters. Parallelizing bolasso is as simple as initializing the parallel method\nprior to executing the bolasso function. For example, the following setup\nwill execute bolasso in parallel R sessions.\n\n```{r}\nfuture::plan(\"multisession\")\ntime_parallel \u003c- system.time({\n  bolasso(\n    diabetes ~ .,\n    data = train,\n    n.boot = 1000,\n    progress = FALSE,\n    family = \"binomial\"\n  )\n})\nfuture::plan(\"sequential\")\n\ntime_sequential \u003c- system.time({\n  bolasso(\n    diabetes ~ .,\n    data = train,\n    n.boot = 1000,\n    progress = FALSE,\n    family = \"binomial\"\n  )\n})\n```\n```{r, echo = FALSE}\ncat(\n  \"Parallel bolasso time (seconds):\",\n  round(time_parallel[[3]], 3),\n  \"\\n\\rSequential bolasso time (seconds):\",\n  round(time_sequential[[3]], 3)\n)\n```\n\n### Beyond the Lasso\n\nbolasso also allows us to fit penalized regression models beyond the Lasso.\nFor example, suppose we want to fit a bootstrap-enhanced elasticnet model\nwith a mixing parameter of 0.5 (an even mix of the Ridge and Lasso\nregularization terms). We can simply pass the underlying `glmnet::glmnet`\nargument `alpha = 0.5` through bolasso. The following code compares selected\nvariables between the Lasso and elasticnet models.\n```{r}\nlasso \u003c- bolasso(\n  diabetes ~ .,\n  data = train,\n  n.boot = 100,\n  progress = FALSE,\n  family = \"binomial\"\n)\n\nelnet \u003c- bolasso(\n  diabetes ~ .,\n  data = train,\n  n.boot = 100,\n  progress = FALSE,\n  family = \"binomial\",\n  alpha = 0.5\n)\n```\n```{r, echo = FALSE}\ncat(\n  \"Lasso selected variables:\",\n  selected_variables(lasso, 0.95, var_names_only = TRUE),\n  \"\\n\\rElnet selected variables:\",\n  selected_variables(elnet, 0.95, var_names_only = TRUE)\n)\n```\n\n## References\n\n\u003ca id=\"1\"\u003e[1]\u003c/a\u003eAbram, Samantha V et al. “Bootstrap Enhanced Penalized\nRegression for Variable Selection with Neuroimaging Data.” Frontiers in\nneuroscience vol. 10 344. 28 Jul. 2016, doi:10.3389/fnins.2016.00344\n\n\u003ca id=\"2\"\u003e[2]\u003c/a\u003eBach, Francis. “Bolasso: Model Consistent Lasso Estimation \nthrough the Bootstrap.” ArXiv:0804.1302 [Cs, Math, Stat], 2008. \nhttps://arxiv.org/abs/0804.1302.\n\n\u003ca id=\"3\"\u003e[3]\u003c/a\u003eBunea, Florentina et al. “Penalized least squares regression\nmethods and applications to neuroimaging.” NeuroImage vol. 55,4 (2011):\n1519-27. doi:10.1016/j.neuroimage.2010.12.028\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdmolitor%2Fbolasso","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdmolitor%2Fbolasso","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdmolitor%2Fbolasso/lists"}