Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/tlverse/sl3
💪 🤔 Modern Super Learning with Machine Learning Pipelines
https://github.com/tlverse/sl3
data-science ensemble-learning ensemble-model machine-learning model-selection r r-package regression stacking statistics
Last synced: 3 months ago
JSON representation
💪 🤔 Modern Super Learning with Machine Learning Pipelines
- Host: GitHub
- URL: https://github.com/tlverse/sl3
- Owner: tlverse
- License: gpl-3.0
- Created: 2017-08-08T21:09:22.000Z (about 7 years ago)
- Default Branch: master
- Last Pushed: 2024-04-30T05:11:22.000Z (6 months ago)
- Last Synced: 2024-07-28T09:31:51.992Z (3 months ago)
- Topics: data-science, ensemble-learning, ensemble-model, machine-learning, model-selection, r, r-package, regression, stacking, statistics
- Language: R
- Homepage: https://tlverse.org/sl3/
- Size: 11.3 MB
- Stars: 99
- Watchers: 13
- Forks: 40
- Open Issues: 44
-
Metadata Files:
- Readme: README.Rmd
- Contributing: CONTRIBUTING.md
- License: LICENSE
Awesome Lists containing this project
README
---
always_allow_html: true
output:
rmarkdown::github_document
bibliography: "inst/REFERENCES.bib"
---```{r, echo = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "README-"
)
```# R/`sl3`: Super Machine Learning with Pipelines
[![R-CMD-check](https://github.com/tlverse/sl3/workflows/R-CMD-check/badge.svg)](https://github.com/tlverse/sl3/actions)
[![Coverage Status](https://codecov.io/gh/tlverse/sl3/branch/master/graph/badge.svg)](https://codecov.io/gh/tlverse/sl3)
[![Project Status: Active – The project has reached a stable, usable state and is being actively developed.](https://www.repostatus.org/badges/latest/active.svg)](https://www.repostatus.org/#active)
[![License: GPL v3](https://img.shields.io/badge/License-GPL%20v3-blue.svg)](https://www.gnu.org/licenses/gpl-3.0)
[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.1342293.svg)](https://doi.org/10.5281/zenodo.1342293)> A flexible implementation of the Super Learner ensemble machine learning
> system__Authors:__ [Jeremy Coyle](https://github.com/jeremyrcoyle),
[Nima Hejazi](https://nimahejazi.org),
[Ivana Malenica](https://github.com/imalenica),
[Rachael Phillips](https://github.com/rachaelvp), and
[Oleg Sofrygin](https://github.com/osofr)---
## What's `sl3`?
`sl3` is an implementation of the Super Learner ensemble machine learning
algorithm of @vdl2007super. The Super Learner algorithm performs ensemble
learning in one of two fashions:1. The _discrete_ Super Learner can be used to select the best prediction
algorithm from among a supplied library of machine learning algorithms
("learners" in the `sl3` nomenclature) -- that is, the discrete Super Learner
is the single learning algorithm that minimizes the cross-validated risk.
2. The _ensemble_ Super Learner can be used to assign weights to a set of
specified learning algorithms (from a user-supplied library of such
algorithms) so as to create a combination of these learners that minimizes
the cross-validated risk. This notion of weighted combinations has also been
referred to as _stacked regression_ [@breiman1996stacked] and
_stacked generalization_ [@wolpert1992stacked].Looking for long-form documentation or a walkthrough of the `sl3` package?
Don't worry! Just browse [the chapter in our
book](https://tlverse.org/tlverse-handbook/sl3.html).---
## Installation
Install the _most recent version_ from the `master` branch on GitHub via
[`remotes`](https://CRAN.R-project.org/package=remotes):```{r gh-master-installation, eval = FALSE}
remotes::install_github("tlverse/sl3")
```Past stable releases may be located via the
[releases](https://github.com/tlverse/sl3/releases) page on GitHub and may be
installed by including the appropriate major version tag. For example,```{r gh-version-installation, eval = FALSE}
remotes::install_github("tlverse/[email protected]")
```To contribute, check out the `devel` branch and consider submitting a pull
request.---
## Issues
If you encounter any bugs or have any specific feature requests, please [file an
issue](https://github.com/tlverse/sl3/issues).---
## Examples
`sl3` makes the process of applying screening algorithms, learning algorithms,
combining both types of algorithms into a stacked regression model, and
cross-validating this whole process essentially trivial. The best way to
understand this is to see the `sl3` package in action:```{r sl3-simple-example, message=FALSE, warning=FALSE}
set.seed(49753)
library(tidyverse)
library(data.table)
library(SuperLearner)
library(origami)
library(sl3)# load example data set
data(cpp)
cpp <- cpp %>%
dplyr::filter(!is.na(haz)) %>%
mutate_all(~ replace(., is.na(.), 0))# use covariates of intest and the outcome to build a task object
covars <- c("apgar1", "apgar5", "parity", "gagebrth", "mage", "meducyrs",
"sexn")
task <- sl3_Task$new(
data = cpp,
covariates = covars,
outcome = "haz"
)# set up screeners and learners via built-in functions and pipelines
slscreener <- Lrnr_pkg_SuperLearner_screener$new("screen.glmnet")
glm_learner <- Lrnr_glm$new()
screen_and_glm <- Pipeline$new(slscreener, glm_learner)
SL.glmnet_learner <- Lrnr_pkg_SuperLearner$new(SL_wrapper = "SL.glmnet")# stack learners into a model (including screeners and pipelines)
learner_stack <- Stack$new(SL.glmnet_learner, glm_learner, screen_and_glm)
stack_fit <- learner_stack$train(task)
preds <- stack_fit$predict()
head(preds)
```### Parallelization with `future`s
While it's straightforward to fit a stack of learners (as above), it's easy to
take advantage of `sl3`'s built-in parallelization support too. To do this,
you can simply choose a `plan()` from the [`future`
ecosystem](https://CRAN.R-project.org/package=future).```{r sl3-parallel-example, eval=FALSE, message=FALSE, warning=FALSE}
# let's load the future package and set 4 cores for parallelization
library(future)
plan(multicore, workers = 4L)# now, let's re-train our Stack in parallel
stack_fit <- learner_stack$train(task)
preds <- stack_fit$predict()
```### Controlling the number of CV folds
In the above examples, we fit stacks of learners, but didn't create a Super
Learner ensemble, which uses cross-validation (CV) to build the ensemble model.
For the sake of computational expedience, we may be interested in lowering the
number of CV folds (from 10). Let's take a look at how to do both below.```{r sl3-folds-example, eval=FALSE, message=FALSE, warning=FALSE}
# first, let's instantiate some more learners and create a Super Learner
mean_learner <- Lrnr_mean$new()
rf_learner <- Lrnr_ranger$new()
sl <- Lrnr_sl$new(mean_learner, glm_learner, rf_learner)# CV folds are controlled in the sl3_Task object; we can lower the number of
# folds simply by specifying this in creating the Task
task <- sl3_Task$new(
data = cpp,
covariates = covars,
outcome = "haz",
folds = 5L
)# now, let's fit the Super Learner with just 5-fold CV, then get predictions
sl_fit <- sl$train(task)
sl_preds <- sl_fit$predict()
```The `folds` argument to `sl3_Task` supports both integers (for V-fold CV) and
all of the CV schemes supported in the [`origami`
package](https://CRAN.R-project.org/package=origami). To see the full list,
query `?fold_funs` from within `R` or take a look at [`origami`'s online
documentation](https://tlverse.org/origami/reference/).---
## Learner Properties
Properties supported by `sl3` learners are presented in the following table:
```{r sl3-learner-properties, warning=FALSE, echo=FALSE, message=FALSE}
library(sl3)
library(kableExtra)all_properties <- sl3_list_properties()
get_all_learners <- function() {
# search for objects named like sl3 learners
learner_names <- apropos("^Lrnr_")
learners <- mget(learner_names, inherits = TRUE)# verify that learner inherits from Lrnr_base (and is therefore an actual
# sl3 Learner)
is_learner_real <- sapply(learners, `[[`, "inherit") == "Lrnr_base"
return(learners[which(is_learner_real)])
}
all_learners <- get_all_learners()# get a list where each element is a list of properties of a particular learner
get_learner_class_properties <- function(learner_class) {
return(learner_class$private_fields$.properties)
}
learner_property_lst <- lapply(all_learners, get_learner_class_properties)# get a list where each element is a list of logicals indicating whether a
# learner has a property or not
get_learner_class_properties_ownership <- function(properties) {
return(sl3_list_properties() %in% unlist(properties))
}
learner_property_ownership <- lapply(learner_property_lst,
get_learner_class_properties_ownership)get_learner_name <- function(learner_class) {
return(learner_class$classname)
}
learner_names <- lapply(all_learners, get_learner_name)# generate matrix whose columns are properties and rows are learner names
final_mat <- matrix(unlist(learner_property_ownership),
nrow = length(learner_property_ownership),
ncol = length(sl3_list_properties()), byrow = TRUE)
dimnames(final_mat) <- list(unlist(learner_names), sl3_list_properties())
final_mat.df <- as.data.frame(final_mat)char_final_mat.df <- ifelse(final_mat.df == TRUE, "√", "x")
kable(char_final_mat.df) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed",
"responsive")) %>%
scroll_box(width = "100%", height = "200px", fixed_thead = T)
```---
## Contributions
Contributions are very welcome. Interested contributors should consult our
[contribution
guidelines](https://github.com/tlverse/sl3/blob/master/CONTRIBUTING.md) prior to
submitting a pull request.---
## Citation
After using the `sl3` R package, please cite the following:
@software{coyle2021sl3-rpkg,
author = {Coyle, Jeremy R and Hejazi, Nima S and Malenica, Ivana and
Phillips, Rachael V and Sofrygin, Oleg},
title = {{sl3}: Modern Pipelines for Machine Learning and {Super
Learning}},
year = {2021},
howpublished = {\url{https://github.com/tlverse/sl3}},
note = {{R} package version 1.4.2},
url = {https://doi.org/10.5281/zenodo.1342293},
doi = {10.5281/zenodo.1342293}
}---
## License
© 2017-2021 [Jeremy R. Coyle](https://github.com/jeremyrcoyle), [Nima S.
Hejazi](https://nimahejazi.org), [Ivana
Malenica](https://github.com/podTockom), [Rachael V.
Phillips](https://github.com/rachaelvp), [Oleg
Sofrygin](https://github.com/osofr)The contents of this repository are distributed under the GPL-3 license. See
file `LICENSE` for details.---
## References