https://github.com/tlverse/sl3

💪 🤔 Modern Super Learning with Machine Learning Pipelines
https://github.com/tlverse/sl3

data-science ensemble-learning ensemble-model machine-learning model-selection r r-package regression stacking statistics

Last synced: 3 months ago
JSON representation

💪 🤔 Modern Super Learning with Machine Learning Pipelines

Host: GitHub
URL: https://github.com/tlverse/sl3
Owner: tlverse
License: gpl-3.0
Created: 2017-08-08T21:09:22.000Z (almost 8 years ago)
Default Branch: master
Last Pushed: 2024-11-14T19:52:05.000Z (8 months ago)
Last Synced: 2025-03-21T06:51:25.805Z (4 months ago)
Topics: data-science, ensemble-learning, ensemble-model, machine-learning, model-selection, r, r-package, regression, stacking, statistics
Language: R
Homepage: https://tlverse.org/sl3/
Size: 11.2 MB
Stars: 101
Watchers: 12
Forks: 43
Open Issues: 44
Metadata Files:
- Readme: README.Rmd
- Changelog: NEWS.md
- Contributing: CONTRIBUTING.md
- License: LICENSE

Awesome Lists containing this project

README

        ---

always_allow_html: true

output:

  rmarkdown::github_document

bibliography: "inst/REFERENCES.bib"

---

```{r, echo = FALSE}

knitr::opts_chunk$set(

  collapse = TRUE,

  comment = "#>",

  fig.path = "README-"

)

```

# R/`sl3`: Super Machine Learning with Pipelines

[![R-CMD-check](https://github.com/tlverse/sl3/workflows/R-CMD-check/badge.svg)](https://github.com/tlverse/sl3/actions)

[![Coverage Status](https://codecov.io/gh/tlverse/sl3/branch/master/graph/badge.svg)](https://codecov.io/gh/tlverse/sl3)

[![Project Status: Active – The project has reached a stable, usable state and is being actively developed.](https://www.repostatus.org/badges/latest/active.svg)](https://www.repostatus.org/#active)

[![License: GPL v3](https://img.shields.io/badge/License-GPL%20v3-blue.svg)](https://www.gnu.org/licenses/gpl-3.0)

[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.1342293.svg)](https://doi.org/10.5281/zenodo.1342293)

> A flexible implementation of the Super Learner ensemble machine learning

> system

__Authors:__ [Jeremy Coyle](https://github.com/jeremyrcoyle),

[Nima Hejazi](https://nimahejazi.org),

[Ivana Malenica](https://github.com/imalenica),

[Rachael Phillips](https://github.com/rachaelvp), and

[Oleg Sofrygin](https://github.com/osofr)

---

## What's `sl3`?

`sl3` is an implementation of the Super Learner ensemble machine learning

algorithm of @vdl2007super. The Super Learner algorithm performs ensemble

learning in one of two fashions:

1. The _discrete_ Super Learner can be used to select the best prediction

   algorithm from among a supplied library of machine learning algorithms

   ("learners" in the `sl3` nomenclature) -- that is, the discrete Super Learner

   is the single learning algorithm that minimizes the cross-validated risk.

2. The _ensemble_ Super Learner can be used to assign weights to a set of

   specified learning algorithms (from a user-supplied library of such

   algorithms) so as to create a combination of these learners that minimizes

   the cross-validated risk. This notion of weighted combinations has also been 

   referred to as _stacked regression_ [@breiman1996stacked] and 

   _stacked generalization_ [@wolpert1992stacked].

Looking for long-form documentation or a walkthrough of the `sl3` package?

Don't worry! Just browse [the chapter in our

book](https://tlverse.org/tlverse-handbook/sl3.html).

---

## Installation

Install the _most recent version_ from the `master` branch on GitHub via

[`remotes`](https://CRAN.R-project.org/package=remotes):

```{r gh-master-installation, eval = FALSE}

remotes::install_github("tlverse/sl3")

```

Past stable releases may be located via the

[releases](https://github.com/tlverse/sl3/releases) page on GitHub and may be

installed by including the appropriate major version tag. For example,

```{r gh-version-installation, eval = FALSE}

remotes::install_github("tlverse/[email protected]")

```

To contribute, check out the `devel` branch and consider submitting a pull

request.

---

## Issues

If you encounter any bugs or have any specific feature requests, please [file an

issue](https://github.com/tlverse/sl3/issues).

---

## Examples

`sl3` makes the process of applying screening algorithms, learning algorithms,

combining both types of algorithms into a stacked regression model, and

cross-validating this whole process essentially trivial. The best way to

understand this is to see the `sl3` package in action:

```{r sl3-simple-example, message=FALSE, warning=FALSE}

set.seed(49753)

library(tidyverse)

library(data.table)

library(SuperLearner)

library(origami)

library(sl3)

# load example data set

data(cpp)

cpp <- cpp %>%

  dplyr::filter(!is.na(haz)) %>%

  mutate_all(~ replace(., is.na(.), 0))

# use covariates of intest and the outcome to build a task object

covars <- c("apgar1", "apgar5", "parity", "gagebrth", "mage", "meducyrs",

            "sexn")

task <- sl3_Task$new(

  data = cpp,

  covariates = covars,

  outcome = "haz"

)

# set up screeners and learners via built-in functions and pipelines

slscreener <- Lrnr_pkg_SuperLearner_screener$new("screen.glmnet")

glm_learner <- Lrnr_glm$new()

screen_and_glm <- Pipeline$new(slscreener, glm_learner)

SL.glmnet_learner <- Lrnr_pkg_SuperLearner$new(SL_wrapper = "SL.glmnet")

# stack learners into a model (including screeners and pipelines)

learner_stack <- Stack$new(SL.glmnet_learner, glm_learner, screen_and_glm)

stack_fit <- learner_stack$train(task)

preds <- stack_fit$predict()

head(preds)

```

### Parallelization with `future`s

While it's straightforward to fit a stack of learners (as above), it's easy to

take advantage of `sl3`'s built-in parallelization support too. To do this,

you can simply choose a `plan()` from the [`future`

ecosystem](https://CRAN.R-project.org/package=future).

```{r sl3-parallel-example, eval=FALSE, message=FALSE, warning=FALSE}

# let's load the future package and set 4 cores for parallelization

library(future)

plan(multicore, workers = 4L)

# now, let's re-train our Stack in parallel

stack_fit <- learner_stack$train(task)

preds <- stack_fit$predict()

```

### Controlling the number of CV folds

In the above examples, we fit stacks of learners, but didn't create a Super

Learner ensemble, which uses cross-validation (CV) to build the ensemble model.

For the sake of computational expedience, we may be interested in lowering the

number of CV folds (from 10).  Let's take a look at how to do both below.

```{r sl3-folds-example, eval=FALSE, message=FALSE, warning=FALSE}

# first, let's instantiate some more learners and create a Super Learner

mean_learner <- Lrnr_mean$new()

rf_learner <- Lrnr_ranger$new()

sl <- Lrnr_sl$new(mean_learner, glm_learner, rf_learner)

# CV folds are controlled in the sl3_Task object; we can lower the number of

# folds simply by specifying this in creating the Task

task <- sl3_Task$new(

  data = cpp,

  covariates = covars,

  outcome = "haz",

  folds = 5L

)

# now, let's fit the Super Learner with just 5-fold CV, then get predictions

sl_fit <- sl$train(task)

sl_preds <- sl_fit$predict()

```

The `folds` argument to `sl3_Task` supports both integers (for V-fold CV) and

all of the CV schemes supported in the [`origami`

package](https://CRAN.R-project.org/package=origami). To see the full list,

query `?fold_funs` from within `R` or take a look at [`origami`'s online

documentation](https://tlverse.org/origami/reference/).

---

## Learner Properties

Properties supported by `sl3` learners are presented in the following table:

```{r sl3-learner-properties, warning=FALSE, echo=FALSE, message=FALSE}

library(sl3)

library(kableExtra)

all_properties <- sl3_list_properties()

get_all_learners <- function() {

  # search for objects named like sl3 learners

  learner_names <- apropos("^Lrnr_")

  learners <- mget(learner_names, inherits = TRUE)

  # verify that learner inherits from Lrnr_base (and is therefore an actual

  # sl3 Learner)

  is_learner_real <- sapply(learners, `[[`, "inherit") == "Lrnr_base"

  return(learners[which(is_learner_real)])

}

all_learners <- get_all_learners()

# get a list where each element is a list of properties of a particular learner

get_learner_class_properties <- function(learner_class) {

  return(learner_class$private_fields$.properties)

}

learner_property_lst <- lapply(all_learners, get_learner_class_properties)

# get a list where each element is a list of logicals indicating whether a

# learner has a property or not

get_learner_class_properties_ownership <- function(properties) {

  return(sl3_list_properties() %in% unlist(properties))

}

learner_property_ownership <- lapply(learner_property_lst,

                                     get_learner_class_properties_ownership)

get_learner_name <- function(learner_class) {

  return(learner_class$classname)

}

learner_names <- lapply(all_learners, get_learner_name)

# generate matrix whose columns are properties and rows are learner names

final_mat <- matrix(unlist(learner_property_ownership),

                    nrow = length(learner_property_ownership),

                    ncol = length(sl3_list_properties()), byrow = TRUE)

dimnames(final_mat) <- list(unlist(learner_names), sl3_list_properties())

final_mat.df <- as.data.frame(final_mat)

char_final_mat.df <- ifelse(final_mat.df == TRUE, "√", "x")

kable(char_final_mat.df) %>%

  kable_styling(bootstrap_options = c("striped", "hover", "condensed",

                                      "responsive")) %>%

  scroll_box(width = "100%", height = "200px", fixed_thead = T)

```

---

## Contributions

Contributions are very welcome. Interested contributors should consult our

[contribution

guidelines](https://github.com/tlverse/sl3/blob/master/CONTRIBUTING.md) prior to

submitting a pull request.

---

## Citation

After using the `sl3` R package, please cite the following:

        @software{coyle2021sl3-rpkg,

          author = {Coyle, Jeremy R and Hejazi, Nima S and Malenica, Ivana and

            Phillips, Rachael V and Sofrygin, Oleg},

          title = {{sl3}: Modern Pipelines for Machine Learning and {Super

            Learning}},

          year = {2021},

          howpublished = {\url{https://github.com/tlverse/sl3}},

          note = {{R} package version 1.4.2},

          url = {https://doi.org/10.5281/zenodo.1342293},

          doi = {10.5281/zenodo.1342293}

        }

---

## License

© 2017-2021 [Jeremy R. Coyle](https://github.com/jeremyrcoyle), [Nima S.

Hejazi](https://nimahejazi.org), [Ivana

Malenica](https://github.com/podTockom), [Rachael V.

Phillips](https://github.com/rachaelvp), [Oleg

Sofrygin](https://github.com/osofr)

The contents of this repository are distributed under the GPL-3 license. See

file `LICENSE` for details.

---

## References

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/tlverse/sl3

Awesome Lists containing this project

README