https://github.com/genentech/modsculpt_example

Last synced: about 1 year ago
JSON representation
Host: GitHub
URL: https://github.com/genentech/modsculpt_example
Owner: Genentech
License: mit
Created: 2024-08-29T21:21:03.000Z (almost 2 years ago)
Default Branch: main
Last Pushed: 2024-09-13T23:31:26.000Z (over 1 year ago)
Last Synced: 2025-02-05T03:27:54.963Z (over 1 year ago)
Language: JavaScript
Homepage: https://genentech.github.io/modsculpt_example/
Size: 5.78 MB
Stars: 0
Watchers: 3
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE.txt
Awesome Lists containing this project

README

          
## Important links

- modsculpt package:

  

- Output examples (GitHub Pages):

  

## Table Of Contents

- [Introduction](#introduction)

- [Overview](#overview)

    - [1. Set-up](#1-set-up)

    - [2. Tuning and training models](#2-tuning-and-training-models)

    - [3. Model sculpting](#3-model-sculpting)

    - [4. Model evaluation](#4-model-evaluation)

- [Details](#details)

    - [Install the packages](#install-the-packages)

    - [Data examples](#data-examples)

    - [Model options](#model-options)

    - [Job submission](#job-submission)

- [Appendix](#appendix)

    - [Continuous endpoint example](#continuous-endpoint-example)

## Introduction

This repository provides a fully worked-out example of model sculpting,

a method to develop interpretable, trustworthy, and high-performing additive

models. The example presented here accompanies the [`modsculpt`](https://github.com/Genentech/modsculpt)

package, which provides a set of tools to perform model sculpting, and 

[the slide deck (still in development)](#placeholder) that explains the concept of model sculpting.

Here we provide an example workflow of how to develop strong learner with 

hyperparameter tuning, to perform proper performance evaluation using nested 

cross-validation, to sculpt the strong learner into an interpretable model, 

and to evaluate and interpret the sculpted model.

Simpler examples are provided in the original 

[`modsculpt` repository](https://genentech.github.io/modsculpt/main/articles/modsculpt.html).

## Overview

### 1. Set-up

1. Clone the repository

1. Open the R project file `modsculpt_example.Rproj` in RStudio

1. [Install the necessary packages](#install-the-packages)

1. Open the R script `R/0_setup.R` and modify the storage folder as necessary

    - By default it is set at `output/sculpt_results` folder, but you can specify

      any folder, including one outside of this repository

      (e.g. when you run this code in a cluster).

### 2. Tuning and training models

1. Run the R script `R/1_prepare_data.R` to [prepare the example data](#data-examples)

    - [You can define your own data](#data-examples) in `R/define_data.R`

1. Run the R script `R/2_train_models.R` to train the models

    - See below for [data examples](#data-examples) and [model options](#model-options) 

    - You can run `R/2_train_models.R` interactively, or execute it in a batch

      mode with the command line.  

      See [job submission](#job-submission) for details

    - The following models need to be trained on the `compas` dataset if you want

      to follow the steps below

    

      ```bash

      # Note that xgb training can take some time (~1 hr on M2 MacBook Air)

      Rscript R/2_train_models.R xgb compas FALSE 

      # xgb_1_order is a bit faster (~30 min)

      Rscript R/2_train_models.R xgb_1_order compas FALSE 

      # Following models are faster (<1 min)

      Rscript R/2_train_models.R log_elastic compas FALSE

      Rscript R/2_train_models.R log_lasso compas FALSE

      Rscript R/2_train_models.R log_ridge compas FALSE

      ```

1. *[Optional]* Run the R script `R/2_train_models_ncv.R` to perform

   nested cross-validation on the models

    - This script is necessary only if you want to perform nested cross-validation

      for model evaluation

    - You can run this script interactively, or execute it in a batch mode with

      the command line.  

      See [job submission](#job-submission) for details

### 3. Model sculpting

1. Run the R script `R/3_model_sculpt_compas.R` to perform model sculpting

  on the `compas` dataset

    - You can modify the script to evaluate the models on the `bike` dataset

1. *[Optional]* Run the R script `R/3_model_sculpt_ncv_compas.R` to perform

   model sculpting on nested cross-validation outputs for sculpted model 

   performance evaluation

### 4. Model evaluation

1. Evaluation/interpretation of sculpted models: `notebooks/2_1_modsculpt_compas.qmd`

1. Performance summary: `notebooks/3_performance_summary.qmd`

1. [Sample outputs](https://pages.github.roche.com/yoshidk6/modsculpt_example/)

## Details

### Install the packages

The following packages are required to run the example in this repository.

```r

# Install necessary packages

install_if_not_available <-

  function(pkg, from_github = FALSE, ref = NULL, min_version = NULL) {

    

    # Check whether it's installed and the version is ok

    is_installed <- suppressWarnings(suppressPackageStartupMessages(require(pkg, character.only = TRUE)))

    version_ok <- is.null(min_version)

    if (is_installed & !version_ok) version_ok <- packageVersion(pkg) >= min_version

    

    # Install if necessary

    if (!is_installed | !version_ok) {

      if (from_github) {

        remotes::install_github(paste0("Genentech/", pkg), ref = ref)

      } else {

        install.packages(pkg)

      }

    }

  }

install_if_not_available("dplyr")

install_if_not_available("tidyr")

install_if_not_available("readr")

install_if_not_available("purrr")

install_if_not_available("lubridate")

install_if_not_available("here")

install_if_not_available("DT")

install_if_not_available("data.table")

install_if_not_available("mgcv")

install_if_not_available("tidymodels")

install_if_not_available("glmnet")

install_if_not_available("xgboost")

# devtools::install_version('xgboost', '1.7.7.1') # Version used for generating the outputs

install_if_not_available("tune")

install_if_not_available("parallel") 

install_if_not_available("doParallel") 

# Install the following packages from GitHub

install_if_not_available("remotes")

install_if_not_available("modsculpt", from_github = TRUE, ref = "v0.1.1")

install_if_not_available("stats4phc", from_github = TRUE, ref = "v0.1.1")

```

### Data examples

Two example datasets were provided, `compas` and `bike`.

- `compas`: A dataset from the COMPAS recidivism risk assessment tool, used in 

  the publication ["Stop Explaining Black Box Machine Learning Models for High

  Stakes Decisions and Use Interpretable Models Instead" by Rudin et al.

  ](https://www.nature.com/articles/s42256-019-0048-x), available 

  [here](https://github.com/propublica/compas-analysis). This is an example for

  binary classification.

- `bike`: A dataset from the Capital Bikeshare program in Washington, D.C., 

  available [here](https://archive.ics.uci.edu/ml/datasets/Bike+Sharing+Dataset).

  This is an example for regression.

If you are interested in using your own data, you can define it in

`R/define_data.R`. For binary classification task, the response variable is to 

be a factor with the the first level being "positive"; otherwise some of the codes

make a wrong conversion. 

### Model options

The following `model_types` are currently supported, made available with

`parsnip` package.

See `R/define_model.R` for the implementation details.

| model_type | task | description | packages |

|------------|------|-------------|----------|

| `xgb` | reg. or class. | XGBoost | `xgboost` |

| `xgb_monotone` | reg. or class. | XGBoost with monotone constraints as defined in `R/define_data.R` | `xgboost` |

| `xgb_*_order` | reg. or class. | XGBoost with `*` order constraints; replace `*` with the number to indicate the number of your choice, e.g. `xgb_1_order`| `xgboost` |

| `xgb_*_order_monotone` | reg. or class. | XGBoost with *_order and the monotone constraints | `xgboost` |

| `lm` / `logistic` | reg. / class. | Linear model / Logistic regression | `stats` / `glm` |

| `lm_elastic` / `log_elastic` | reg. / class. | Elastic net | `glmnet` |

| `lm_lasso` / `log_lasso` | reg. / class. | Lasso | `glmnet` |

| `lm_ridge` / `log_ridge` | reg. / class. | Ridge | `glmnet` |

| `rf` | reg. or class. | Random forest | `ranger` |

| `lgb` | reg. or class. | LightGBM | `bonsai`, `lightgbm` |

| `rpart` | class. | Recursive partitioning | `rpart` |

| `rpart_regression` | reg. | Recursive partitioning | `rpart` |

### Job submission

You can run the training script `R/2_train_models.R` and `R/2_train_models_ncv.R`

in a batch mode with the command line. 

The scripts takes three arguments: model type, dataset name, and

whether to use Bayesian optimization in addition to the grid search.

This is particularly useful when you run the code on a cluster.

On you local computer, you should be able to use any terminal; 

on Windows PC, the terminal tab in RStudio might be the easiest.

```bash

Rscript R/2_train_models.R $1 $2 $3

Rscript R/2_train_models_ncv.R $1 $2 $3

```

where

- `$1` is the model type (e.g. `xgb`, see [model options](#model-options) for

  the full list of model types)

- `$2` is the dataset name (e.g. `compas` or `bike`)

- `$3` is whether to use Bayesian optimization (e.g. `TRUE` or `FALSE`)

  in addition to the grid search

    - Bayesian optimization takes a long time to run, so it is recommended to

      set this to "FALSE" for an interactive run

For example,

```bash

Rscript R/2_train_models.R xgb compas FALSE

Rscript R/2_train_models.R lm_elastic bike FALSE

```

## Appendix

### Continuous endpoint example

Example with bike dataset

1. Run the R scripts `R/2_train_models.R` (& `R/2_train_models_ncv.R`) to train the models

    

    ```bash

    Rscript R/2_train_models.R xgb bike FALSE 

    Rscript R/2_train_models.R xgb_1_order bike FALSE 

    Rscript R/2_train_models.R log_elastic bike FALSE

    Rscript R/2_train_models.R log_lasso bike FALSE

    Rscript R/2_train_models.R log_ridge bike FALSE

    ```

1. Run the R script `R/3_model_sculpt_bike.R` (& `R/3_model_sculpt_ncv_bike.R`) 

   to perform model sculpting (and nested cross-validation performance evaluation)

   on the `bike` dataset

1. Evaluation/interpretation of sculpted models: `notebooks/2_1_modsculpt_bike.qmd`

1. Performance summary: `notebooks/3_performance_summary.qmd`

1. [Sample outputs](https://pages.github.roche.com/yoshidk6/modsculpt_example/)
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/genentech/modsculpt_example

Awesome Lists containing this project

README