Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/tlverse/sl3

💪 🤔 Modern Super Learning with Machine Learning Pipelines
https://github.com/tlverse/sl3

data-science ensemble-learning ensemble-model machine-learning model-selection r r-package regression stacking statistics

Last synced: about 2 months ago
JSON representation

💪 🤔 Modern Super Learning with Machine Learning Pipelines

Awesome Lists containing this project

README

        

---
always_allow_html: true
output:
rmarkdown::github_document
bibliography: "inst/REFERENCES.bib"
---

```{r, echo = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "README-"
)
```

# R/`sl3`: Super Machine Learning with Pipelines

[![R-CMD-check](https://github.com/tlverse/sl3/workflows/R-CMD-check/badge.svg)](https://github.com/tlverse/sl3/actions)
[![Coverage Status](https://codecov.io/gh/tlverse/sl3/branch/master/graph/badge.svg)](https://codecov.io/gh/tlverse/sl3)
[![Project Status: Active – The project has reached a stable, usable state and is being actively developed.](https://www.repostatus.org/badges/latest/active.svg)](https://www.repostatus.org/#active)
[![License: GPL v3](https://img.shields.io/badge/License-GPL%20v3-blue.svg)](https://www.gnu.org/licenses/gpl-3.0)
[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.1342293.svg)](https://doi.org/10.5281/zenodo.1342293)

> A flexible implementation of the Super Learner ensemble machine learning
> system

__Authors:__ [Jeremy Coyle](https://github.com/jeremyrcoyle),
[Nima Hejazi](https://nimahejazi.org),
[Ivana Malenica](https://github.com/imalenica),
[Rachael Phillips](https://github.com/rachaelvp), and
[Oleg Sofrygin](https://github.com/osofr)

---

## What's `sl3`?

`sl3` is an implementation of the Super Learner ensemble machine learning
algorithm of @vdl2007super. The Super Learner algorithm performs ensemble
learning in one of two fashions:

1. The _discrete_ Super Learner can be used to select the best prediction
algorithm from among a supplied library of machine learning algorithms
("learners" in the `sl3` nomenclature) -- that is, the discrete Super Learner
is the single learning algorithm that minimizes the cross-validated risk.
2. The _ensemble_ Super Learner can be used to assign weights to a set of
specified learning algorithms (from a user-supplied library of such
algorithms) so as to create a combination of these learners that minimizes
the cross-validated risk. This notion of weighted combinations has also been
referred to as _stacked regression_ [@breiman1996stacked] and
_stacked generalization_ [@wolpert1992stacked].

Looking for long-form documentation or a walkthrough of the `sl3` package?
Don't worry! Just browse [the chapter in our
book](https://tlverse.org/tlverse-handbook/sl3.html).

---

## Installation

Install the _most recent version_ from the `master` branch on GitHub via
[`remotes`](https://CRAN.R-project.org/package=remotes):

```{r gh-master-installation, eval = FALSE}
remotes::install_github("tlverse/sl3")
```

Past stable releases may be located via the
[releases](https://github.com/tlverse/sl3/releases) page on GitHub and may be
installed by including the appropriate major version tag. For example,

```{r gh-version-installation, eval = FALSE}
remotes::install_github("tlverse/[email protected]")
```

To contribute, check out the `devel` branch and consider submitting a pull
request.

---

## Issues

If you encounter any bugs or have any specific feature requests, please [file an
issue](https://github.com/tlverse/sl3/issues).

---

## Examples

`sl3` makes the process of applying screening algorithms, learning algorithms,
combining both types of algorithms into a stacked regression model, and
cross-validating this whole process essentially trivial. The best way to
understand this is to see the `sl3` package in action:

```{r sl3-simple-example, message=FALSE, warning=FALSE}
set.seed(49753)
library(tidyverse)
library(data.table)
library(SuperLearner)
library(origami)
library(sl3)

# load example data set
data(cpp)
cpp <- cpp %>%
dplyr::filter(!is.na(haz)) %>%
mutate_all(~ replace(., is.na(.), 0))

# use covariates of intest and the outcome to build a task object
covars <- c("apgar1", "apgar5", "parity", "gagebrth", "mage", "meducyrs",
"sexn")
task <- sl3_Task$new(
data = cpp,
covariates = covars,
outcome = "haz"
)

# set up screeners and learners via built-in functions and pipelines
slscreener <- Lrnr_pkg_SuperLearner_screener$new("screen.glmnet")
glm_learner <- Lrnr_glm$new()
screen_and_glm <- Pipeline$new(slscreener, glm_learner)
SL.glmnet_learner <- Lrnr_pkg_SuperLearner$new(SL_wrapper = "SL.glmnet")

# stack learners into a model (including screeners and pipelines)
learner_stack <- Stack$new(SL.glmnet_learner, glm_learner, screen_and_glm)
stack_fit <- learner_stack$train(task)
preds <- stack_fit$predict()
head(preds)
```

### Parallelization with `future`s

While it's straightforward to fit a stack of learners (as above), it's easy to
take advantage of `sl3`'s built-in parallelization support too. To do this,
you can simply choose a `plan()` from the [`future`
ecosystem](https://CRAN.R-project.org/package=future).

```{r sl3-parallel-example, eval=FALSE, message=FALSE, warning=FALSE}
# let's load the future package and set 4 cores for parallelization
library(future)
plan(multicore, workers = 4L)

# now, let's re-train our Stack in parallel
stack_fit <- learner_stack$train(task)
preds <- stack_fit$predict()
```

### Controlling the number of CV folds

In the above examples, we fit stacks of learners, but didn't create a Super
Learner ensemble, which uses cross-validation (CV) to build the ensemble model.
For the sake of computational expedience, we may be interested in lowering the
number of CV folds (from 10). Let's take a look at how to do both below.

```{r sl3-folds-example, eval=FALSE, message=FALSE, warning=FALSE}
# first, let's instantiate some more learners and create a Super Learner
mean_learner <- Lrnr_mean$new()
rf_learner <- Lrnr_ranger$new()
sl <- Lrnr_sl$new(mean_learner, glm_learner, rf_learner)

# CV folds are controlled in the sl3_Task object; we can lower the number of
# folds simply by specifying this in creating the Task
task <- sl3_Task$new(
data = cpp,
covariates = covars,
outcome = "haz",
folds = 5L
)

# now, let's fit the Super Learner with just 5-fold CV, then get predictions
sl_fit <- sl$train(task)
sl_preds <- sl_fit$predict()
```

The `folds` argument to `sl3_Task` supports both integers (for V-fold CV) and
all of the CV schemes supported in the [`origami`
package](https://CRAN.R-project.org/package=origami). To see the full list,
query `?fold_funs` from within `R` or take a look at [`origami`'s online
documentation](https://tlverse.org/origami/reference/).

---

## Learner Properties

Properties supported by `sl3` learners are presented in the following table:

```{r sl3-learner-properties, warning=FALSE, echo=FALSE, message=FALSE}
library(sl3)
library(kableExtra)

all_properties <- sl3_list_properties()

get_all_learners <- function() {
# search for objects named like sl3 learners
learner_names <- apropos("^Lrnr_")
learners <- mget(learner_names, inherits = TRUE)

# verify that learner inherits from Lrnr_base (and is therefore an actual
# sl3 Learner)
is_learner_real <- sapply(learners, `[[`, "inherit") == "Lrnr_base"
return(learners[which(is_learner_real)])
}
all_learners <- get_all_learners()

# get a list where each element is a list of properties of a particular learner
get_learner_class_properties <- function(learner_class) {
return(learner_class$private_fields$.properties)
}
learner_property_lst <- lapply(all_learners, get_learner_class_properties)

# get a list where each element is a list of logicals indicating whether a
# learner has a property or not
get_learner_class_properties_ownership <- function(properties) {
return(sl3_list_properties() %in% unlist(properties))
}
learner_property_ownership <- lapply(learner_property_lst,
get_learner_class_properties_ownership)

get_learner_name <- function(learner_class) {
return(learner_class$classname)
}
learner_names <- lapply(all_learners, get_learner_name)

# generate matrix whose columns are properties and rows are learner names
final_mat <- matrix(unlist(learner_property_ownership),
nrow = length(learner_property_ownership),
ncol = length(sl3_list_properties()), byrow = TRUE)
dimnames(final_mat) <- list(unlist(learner_names), sl3_list_properties())
final_mat.df <- as.data.frame(final_mat)

char_final_mat.df <- ifelse(final_mat.df == TRUE, "√", "x")
kable(char_final_mat.df) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed",
"responsive")) %>%
scroll_box(width = "100%", height = "200px", fixed_thead = T)
```

---

## Contributions

Contributions are very welcome. Interested contributors should consult our
[contribution
guidelines](https://github.com/tlverse/sl3/blob/master/CONTRIBUTING.md) prior to
submitting a pull request.

---

## Citation

After using the `sl3` R package, please cite the following:

@software{coyle2021sl3-rpkg,
author = {Coyle, Jeremy R and Hejazi, Nima S and Malenica, Ivana and
Phillips, Rachael V and Sofrygin, Oleg},
title = {{sl3}: Modern Pipelines for Machine Learning and {Super
Learning}},
year = {2021},
howpublished = {\url{https://github.com/tlverse/sl3}},
note = {{R} package version 1.4.2},
url = {https://doi.org/10.5281/zenodo.1342293},
doi = {10.5281/zenodo.1342293}
}

---

## License

© 2017-2021 [Jeremy R. Coyle](https://github.com/jeremyrcoyle), [Nima S.
Hejazi](https://nimahejazi.org), [Ivana
Malenica](https://github.com/podTockom), [Rachael V.
Phillips](https://github.com/rachaelvp), [Oleg
Sofrygin](https://github.com/osofr)

The contents of this repository are distributed under the GPL-3 license. See
file `LICENSE` for details.

---

## References