https://github.com/borgwardtlab/multicenter-sepsis

Last synced: about 1 year ago
JSON representation

Host: GitHub
URL: https://github.com/borgwardtlab/multicenter-sepsis
Owner: BorgwardtLab
License: apache-2.0
Created: 2020-02-21T14:05:33.000Z (over 6 years ago)
Default Branch: master
Last Pushed: 2023-10-09T18:17:01.000Z (almost 3 years ago)
Last Synced: 2023-10-09T19:25:00.786Z (almost 3 years ago)
Language: Python
Size: 69.1 MB
Stars: 3
Watchers: 7
Forks: 2
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

This is the repository for the paper: [Predicting sepsis using deep learning across international sites: a retrospective development and validation study](https://www.thelancet.com/journals/eclinm/article/PIIS2589-5370(23)00301-2/fulltext)

### Reference:

```latex
@article{moor2023predicting,
title={Predicting sepsis using deep learning across international sites: a retrospective development and validation study},
author={Moor, Michael and Bennett, Nicolas and Ple{\v{c}}ko, Drago and Horn, Max and Rieck, Bastian and Meinshausen, Nicolai and B{\"u}hlmann, Peter and Borgwardt, Karsten},
journal={eClinicalMedicine},
volume={62},
pages={102124},
year={2023},
publisher={Elsevier}
}
```

### Disclaimer:

We plan to clean up the following components:

- R code for data loading / harmonization
- Python code for pre-prorcessing (feature extraction), normalization etc. (assumes a Dask pipeline that can be run on a large CPU server or cluster)

### Acknowledgements:

This project was a massive effort stretching over 4 years and over 1.5K commits.

Code contributors:

[Michael](https://github.com/mi92), [Nicolas](https://github.com/nbenn), [Max](https://github.com/ExpectationMax), [Bastian](https://github.com/Pseudomanifold), and [Drago](https://github.com/dplecko)

## Data setup

In order to set up the datasets, the R package `ricu` (available via CRAN) is required alongside access credentials for [PhysioNet](https://physionet.org) and a download token for [AmsterdamUMCdb](https://amsterdammedicaldatascience.nl/#amsterdamumcdb). This information can then be made available to `ricu` by setting the environment variables `RICU_PHYSIONET_USER`, `RICU_PHYSIONET_PASS` and `RICU_AUMC_TOKEN`.

```r
install.packages("ricu")
Sys.setenv(
RICU_PHYSIONET_USER = "my-username",
RICU_PHYSIONET_PASS = "my-password",
RICU_AUMC_TOKEN = "my-token"
)
```

Then, by sourcing the files in `r/utils`, which will require further R packages to be installed (see `r/utils/zzz-demps.R`), the function `export_data()` becomes available. This roughly loads data corresponding to the specification in `config/features.json`, on an hourly grid, performs some patient filtering and concludes with some missingness imputation/feature augmentation steps. The script under `r/scripts/create_dataset.R` can be used to carry out these steps.

```r
install.packages(
c("here", "arrow", "bigmemory", "jsonlite", "data.table", "readr",
"optparse", "assertthat", "cli", "memuse", "dplyr",
"biglasso", "ranger", "qs", "lightgbm", "cowplot", "roll")
)

invisible(
lapply(list.files(here::here("r", "utils"), full.names = TRUE), source)
)

for (x in c("mimic", "eicu", "hirid", "aumc")) {

if (!is_data_avail(x)) {
msg("setting up `{x}`\n")
setup_src_data(x)
}

msg("exporting data for `{x}`\n")
export_data(x)
}
```

If `export_data()` is called with a default argument of `data_path("export")` for `dest_dir`, this will create one parquet file per data source under `data-export`. This procedure can also be run using the PhysioNet demo datasets for debugging and to make sure it runs through:

```r
install.packages(
c("mimic.demo", "eicu.demo"),
repos = "https://eth-mds.github.io/physionet-demo"
)

for (x in c("mimic_demo", "eicu_demo")) {
export_data(x)
}
```

## Python pipeline (for the machine learning / modelling side):

For transparency, we include the full list of requirements we used throughout this study in
```requirements_full.txt```
However, some individual packages may not be supported anymore, hence to get started you may want to start with
```requirements_minimal.txt```

For example, by activating your virtual environment, and running:
```pip install -r requirements_minimal.txt```

For setting up this project, we ran:
```>pipenv install```
```>pipenv shell```
Hence, feel free to also check out the Pipfile / Pipfile.lock

### Datasets

Make sure that all exported data is put here:
```datasets/downloads/```

### Source code

`src`:
- `torch`: pytorch-based pipeline and models (currently an attention model)
TODO: add docu for training a model
- `sklearn`: sklearn-based pipeline for boosted trees baselines

## Preprocessing

### Running the preprocessing
```source scripts/run_preprocessing.sh```

Note that the preprocessed data (as parquet files) contain two different label columns: 'sep3', 'utility', whereas sep3 is the sepsis label, and utility is a regression target (that is derived from the sepsis label),
as inspired by the Physionet 2019 Challenge for sepsis prediction. The utility score is a bit more complex to use, as it can not be directly used with different datasets (due to prevalence differences). We have a solution for this (lambda parameters) but they are not part of this paper. Feel free to contact us, if interested.

If you are not using our scripts (which automatically take care of this), **make sure to not use either of `sep3` or `utility` as feature for training!**

## Training

### Model overview
- src/torch: pytorch-based pipeline and models (currently GRU and attention model)
- src/sklearn: sklearn-based pipeline for lightGBM and LogReg models

### Running the LightGBM hyperparameter search
```>source scripts/run_lgbm.sh ```

### After having run the LightGBM hyperparameter search, run repetitions with:
```>source scripts/run_lgbm_rep.sh ```

### Running the baseline models hyperparameter search + repetitions (in one)
```>source scripts/run_baselines.sh ```

### Deep models / torch pipeline
These jobs we currently run on bs-slurm-02.

First, compile a sweep on wandb.ai, using the sweep-id, (only the id -- not the entire id-path) run:
```>source scripts/wandb/submit_job.sh sweep-id```
In this submit_job script you can configure the variable `n_runs`, i.e. how many evaluations should be run (e.g. 25 during coarse or fine tuning search,
or 5 for repetition runs)

Example sweep for hyperparameter search of training an attention model on MIMIC:
```
method: random
metric:
goal: minimize
name: online_val/loss
parameters:
batch_size:
values:
- 16
- 32
- 64
- 128
cost:
value: 5
d_model:
values:
- 32
- 64
- 128
- 256
dataset:
value: MIMIC
dropout:
values:
- 0.3
- 0.4
- 0.5
- 0.6
- 0.7
gpus:
value: -1
ignore_statics:
value: "True"
label_propagation:
value: 6
label_propagation_right:
value: 24
learning_rate:
distribution: log_uniform
max: -7
min: -9
max_epochs:
value: 100
model:
value: AttentionModel
n_layers:
value: 2
norm:
value: rezero
task:
value: classification
weight_decay:
values:
- 0.1
- 0.01
- 0.001
- 0.0001
program: src/torch/train_model.py
```

This can be directly copied into Weights & Biases, for creating a new sweep.

#### Training a single dataset and model
Example command for training an attention model on MIMIC:

```
python src/torch/train_model.py --batch_size=16 --d_model=256 --dataset=MIMIC --dropout=0.5 --gpus=-1 --ignore_statics=True --label_propagation=6 --label_propagation_right=24 --learning_rate=0.0002 --max_epochs=100 --model=AttentionModel --n_layers=2 --norm=rezero --task=classification --weight_decay=0.001
```

## Evaluation pipeline

### Shallow models + Baselines

```>source scripts/eval_sklearn.sh ``` where the results folder refers to the output folder of the hyperparameter search
Make sure that the eval_sklearn script reads all those methods you wish to evaluate. This script already assumes that repetitions are available.

### Deep models

First determine the best run of your sweep, giving you a run-id.
First apply this model to all datasets:
```>source scripts/wandb/submit_evals.sh run-id```
Once this is completed, the prediction files can be processed in the patient eval:
```>source scripts/eval_torch.sh run-id```

For evaluating a repetition sweep, run (on slurm)
```>pipenv run python scripts/wandb/get_repetition_runs.py sweep-id1 sweep-id2 ..``` and once completed, run (again cpu server):
```>python scripts/wandb/get_repetition_evals.py sweep-id1 sweep-id2 ..```.

## Results and plots

For gathering all repetition results, run:
```>python -m scripts.plots.gather_data --input_path results/evaluation_validation/evaluation_output_subsampled --output_path results/evaluation_validation/plots/ ```

For creating ROC plots, run:
```>python scripts/plots/plot_roc.py --input_path results/evaluation/plots/result_data.csv```

For creating precision/earliness plots, run:
```>python -m scripts.plots.plot_scatterplots results/evaluation/plots/result_data.csv --r 0.80 --point-alpha 0.35 --line-alpha 1.0 --output results/evaluation/plots/```
For the scatter data, in order to return 50 measures (5 repetition splits, 10 subsamplings), set ```--aggregation micro```

## Pooled predictions

First, we need to create a mapping from experiments (data_train,data_eval, model etc) to the prediction files:
```>python scripts/map_model_to_result_files.py --output_path ``` Use --overwrite, to overwrite an existing mapping json.

Next we actually pool the predictions:
```>source scripts/pool_predictions.sh```

Then, we evaluate them:
```>source scripts/eval_pooled.sh```
To create plots with the pooled predictions, run:
```>python -m scripts.plots.gather_data --input_path results/evaluation_test/prediction_pooled_subsampled/max/evaluation_output --output_path results/evaluation_test/prediction_pooled_subsampled/max/plots/```
```>python scripts/plots/plot_roc.py --input_path results/evaluation_test/prediction_pooled_subsampled/max/plots/result_data_subsampled.csv```
For computing precision/earliness, run:
```python -m scripts.plots.plot_scatterplots results/evaluation_test/prediction_pooled_subsampled/max/plots/result_data_subsampled.csv --r 0.80 --point-alpha 0.35 --line-alpha 1.0 --output results/evaluation_test/prediction_pooled_subsampled/max/plots/```
And heatmap incl. pooled preds:
```>python -m scripts.make_heatmap results/evaluation_test/plots/roc_summary_subsampled.csv --pooled_path results/evaluation_test/prediction_pooled_subsampled/max/plots/roc_summary_subsampled.csv```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/borgwardtlab/multicenter-sepsis

Awesome Lists containing this project

README