Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/havakv/pycox

Survival analysis with PyTorch
https://github.com/havakv/pycox

deep-learning machine-learning neural-networks python pytorch survival-analysis

Last synced: about 14 hours ago
JSON representation

Survival analysis with PyTorch

Awesome Lists containing this project

README

        


pycox


Time-to-event prediction with PyTorch


GitHub Actions status
PyPI
PyPI
Hei


Get Started
Methods
Evaluation Criteria
Datasets
Installation
References

**pycox** is a python package for survival analysis and time-to-event prediction with [PyTorch](https://pytorch.org), built on the [torchtuples](https://github.com/havakv/torchtuples) package for training PyTorch models. An R version of this package is available at [survivalmodels](https://github.com/RaphaelS1/survivalmodels).

The package contains implementations of various [survival models](#methods), some useful [evaluation metrics](#evaluation-criteria), and a collection of [event-time datasets](#datasets).
In addition, some useful preprocessing tools are available in the `pycox.preprocessing` module.

# Get Started

To get started you first need to install [PyTorch](https://pytorch.org/get-started/locally/).
You can then install **pycox** via pip:
```sh
pip install pycox
```
OR, via conda:
```sh
conda install -c conda-forge pycox
```

We recommend to start with [01_introduction.ipynb](https://nbviewer.jupyter.org/github/havakv/pycox/blob/master/examples/01_introduction.ipynb), which explains the general usage of the package in terms of preprocessing, creation of neural networks, model training, and evaluation procedure.
The notebook use the `LogisticHazard` method for illustration, but most of the principles generalize to the other methods.

Alternatively, there are many examples listed in the [examples folder](https://nbviewer.jupyter.org/github/havakv/pycox/tree/master/examples), or you can follow the tutorial based on the `LogisticHazard`:

- [01_introduction.ipynb](https://nbviewer.jupyter.org/github/havakv/pycox/blob/master/examples/01_introduction.ipynb): General usage of the package in terms of preprocessing, creation of neural networks, model training, and evaluation procedure.

- [02_introduction.ipynb](https://nbviewer.jupyter.org/github/havakv/pycox/blob/master/examples/02_introduction.ipynb): Quantile based discretization scheme, nested tuples with `tt.tuplefy`, entity embedding of categorical variables, and cyclical learning rates.

- [03_network_architectures.ipynb](https://nbviewer.jupyter.org/github/havakv/pycox/blob/master/examples/03_network_architectures.ipynb):
Extending the framework with custom networks and custom loss functions. The example combines an autoencoder with a survival network, and considers a loss that combines the autoencoder loss with the loss of the `LogisticHazard`.

- [04_mnist_dataloaders_cnn.ipynb](https://nbviewer.jupyter.org/github/havakv/pycox/blob/master/examples/04_mnist_dataloaders_cnn.ipynb):
Using dataloaders and convolutional networks for the MNIST data set. We repeat the [simulations](https://peerj.com/articles/6257/#p-41) of [\[8\]](#references) where each digit defines the scale parameter of an exponential distribution.

# Methods

The following methods are available in the `pycox.methods` module.

## Continuous-Time Models:


Method
Description
Example


CoxTime

Cox-Time is a relative risk model that extends Cox regression beyond the proportional hazards [1].

notebook


CoxCC

Cox-CC is a proportional version of the Cox-Time model [1].

notebook


CoxPH (DeepSurv)

CoxPH is a Cox proportional hazards model also referred to as DeepSurv [2].

notebook


PCHazard

The Piecewise Constant Hazard (PC-Hazard) model [12] assumes that the continuous-time hazard function is constant in predefined intervals.
It is similar to the Piecewise Exponential Models [11] and PEANN [14], but with a softplus activation instead of the exponential function.

notebook

## Discrete-Time Models:


Method
Description
Example


LogisticHazard (Nnet-survival)

The Logistic-Hazard method parametrize the discrete hazards and optimize the survival likelihood [12] [7].
It is also called Partial Logistic Regression [13] and Nnet-survival [8].

notebook



PMF

The PMF method parametrize the probability mass function (PMF) and optimize the survival likelihood [12]. It is the foundation of methods such as DeepHit and MTLR.

notebook



DeepHit, DeepHitSingle

DeepHit is a PMF method with a loss for improved ranking that
can handle competing risks [3].

single
competing


MTLR (N-MTLR)

The (Neural) Multi-Task Logistic Regression is a PMF methods proposed by
[9] and [10].

notebook



BCESurv

A method representing a set of binary classifiers that remove individuals as they are censored [15]. The loss is the binary cross entropy of the survival estimates at a set of discrete times, with targets that are indicators of surviving each time.

bs_example

# Evaluation Criteria

The following evaluation metrics are available with `pycox.evalutation.EvalSurv`.


Metric
Description


concordance_td

The time-dependent concordance index evaluated at the event times [4].



brier_score

The IPCW Brier score (inverse probability of censoring weighted Brier score) [5][6][15].
See Section 3.1.2 of [15] for details.



nbll

The IPCW (negative) binomial log-likelihood [5][1]. I.e., this is minus the binomial log-likelihood and should not be confused with the negative binomial distribution.
The weighting is performed as in Section 3.1.2 of [15] for details.



integrated_brier_score

The integrated IPCW Brier score. Numerical integration of the `brier_score` [5][6].



integrated_nbll

The integrated IPCW (negative) binomial log-likelihood. Numerical integration of the `nbll` [5][1].



brier_score_admin integrated_brier_score_admin

The administrative Brier score [15]. Works well for data with administrative censoring, meaning all censoring times are observed.
See this example notebook.



nbll_admin integrated_nbll_admin

The administrative (negative) binomial log-likelihood [15]. Works well for data with administrative censoring, meaning all censoring times are observed.
See this example notebook.


# Datasets

A collection of datasets are available through the `pycox.datasets` module.
For example, the following code will download the `metabric` dataset and load it in the form of a pandas dataframe
```python
from pycox import datasets
df = datasets.metabric.read_df()
```

The `datasets` module will store datasets under the installation directory by default. You can specify a different directory by setting the `PYCOX_DATA_DIR` environment variable.

## Real Datasets:


Dataset
Size
Dataset
Data source


flchain
6,524

The Assay of Serum Free Light Chain (FLCHAIN) dataset. See
[1] for preprocessing.

source


gbsg
2,232

The Rotterdam & German Breast Cancer Study Group.
See [2] for details.

source


kkbox
2,814,735

A survival dataset created from the WSDM - KKBox's Churn Prediction Challenge 2017 with administrative censoring.
See [1] and [15] for details.
Compared to kkbox_v1, this data set has more covariates and censoring times.
Note: You need
Kaggle credentials to access the dataset.

source


kkbox_v1
2,646,746

A survival dataset created from the WSDM - KKBox's Churn Prediction Challenge 2017.
See [1] for details.
This is not the preferred version of this data set. Use kkbox instead.
Note: You need
Kaggle credentials to access the dataset.

source


metabric
1,904

The Molecular Taxonomy of Breast Cancer International Consortium (METABRIC).
See [2] for details.

source


nwtco
4,028

Data from the National Wilm's Tumor (NWTCO).

source


support
8,873

Study to Understand Prognoses Preferences Outcomes and Risks of Treatment (SUPPORT).
See [2] for details.

source

## Simulated Datasets:


Dataset
Size
Dataset
Data source


rr_nl_nph
25,000

Dataset from simulation study in [1].
This is a continuous-time simulation study with event times drawn from a
relative risk non-linear non-proportional hazards model (RRNLNPH).

SimStudyNonLinearNonPH


sac3
100,000

Dataset from simulation study in [12].
This is a discrete time dataset with 1000 possible event-times.

SimStudySACCensorConst


sac_admin5
50,000

Dataset from simulation study in [15].
This is a discrete time dataset with 1000 possible event-times.
Very similar to `sac3`, but with fewer survival covariates and administrative censoring determined by 5 covariates.

SimStudySACAdmin

# Installation

**Note:** *This package is still in its early stages of development, so please don't hesitate to report any problems you may experience.*

The package only works for python 3.6+.

Before installing **pycox**, please install [PyTorch](https://pytorch.org/get-started/locally/) (version >= 1.1).
You can then install the package with
```sh
pip install pycox
```
For the bleeding edge version, you can instead install directly from github (consider adding `--force-reinstall`):
```sh
pip install git+git://github.com/havakv/pycox.git
```

## Install from Source

Installation from source depends on [PyTorch](https://pytorch.org/get-started/locally/), so make sure a it is installed.
Next, clone and install with
```sh
git clone https://github.com/havakv/pycox.git
cd pycox
pip install .
```

# References

\[1\] Håvard Kvamme, Ørnulf Borgan, and Ida Scheel. Time-to-event prediction with neural networks and Cox regression. *Journal of Machine Learning Research*, 20(129):1–30, 2019. \[[paper](http://jmlr.org/papers/v20/18-424.html)\]

\[2\] Jared L. Katzman, Uri Shaham, Alexander Cloninger, Jonathan Bates, Tingting Jiang, and Yuval Kluger. Deepsurv: personalized treatment recommender system using a Cox proportional hazards deep neural network. *BMC Medical Research Methodology*, 18(1), 2018. \[[paper](https://doi.org/10.1186/s12874-018-0482-1)\]

\[3\] Changhee Lee, William R Zame, Jinsung Yoon, and Mihaela van der Schaar. Deephit: A deep learning approach to survival analysis with competing risks. *In Thirty-Second AAAI Conference on Artificial Intelligence*, 2018. \[[paper](http://medianetlab.ee.ucla.edu/papers/AAAI_2018_DeepHit)\]

\[4\] Laura Antolini, Patrizia Boracchi, and Elia Biganzoli. A time-dependent discrimination index for survival data. *Statistics in Medicine*, 24(24):3927–3944, 2005. \[[paper](https://doi.org/10.1002/sim.2427)\]

\[5\] Erika Graf, Claudia Schmoor, Willi Sauerbrei, and Martin Schumacher. Assessment and comparison of prognostic classification schemes for survival data. *Statistics in Medicine*, 18(17-18):2529–2545, 1999. \[[paper](https://onlinelibrary.wiley.com/doi/abs/10.1002/%28SICI%291097-0258%2819990915/30%2918%3A17/18%3C2529%3A%3AAID-SIM274%3E3.0.CO%3B2-5)\]

\[6\] Thomas A. Gerds and Martin Schumacher. Consistent estimation of the expected brier score in general survival models with right-censored event times. *Biometrical Journal*, 48 (6):1029–1040, 2006. \[[paper](https://onlinelibrary.wiley.com/doi/abs/10.1002/bimj.200610301?sid=nlm%3Apubmed)\]

\[7\] Charles C. Brown. On the use of indicator variables for studying the time-dependence of parameters in a response-time model. *Biometrics*, 31(4):863–872, 1975.
\[[paper](https://www.jstor.org/stable/2529811?seq=1#metadata_info_tab_contents)\]

\[8\] Michael F. Gensheimer and Balasubramanian Narasimhan. A scalable discrete-time survival model for neural networks. *PeerJ*, 7:e6257, 2019.
\[[paper](https://peerj.com/articles/6257/)\]

\[9\] Chun-Nam Yu, Russell Greiner, Hsiu-Chin Lin, and Vickie Baracos. Learning patient- specific cancer survival distributions as a sequence of dependent regressors. *In Advances in Neural Information Processing Systems 24*, pages 1845–1853. Curran Associates, Inc., 2011.
\[[paper](https://papers.nips.cc/paper/4210-learning-patient-specific-cancer-survival-distributions-as-a-sequence-of-dependent-regressors)\]

\[10\] Stephane Fotso. Deep neural networks for survival analysis based on a multi-task framework. *arXiv preprint arXiv:1801.05512*, 2018.
\[[paper](https://arxiv.org/pdf/1801.05512.pdf)\]

\[11\] Michael Friedman. Piecewise exponential models for survival data with covariates. *The Annals of Statistics*, 10(1):101–113, 1982.
\[[paper](https://projecteuclid.org/euclid.aos/1176345693)\]

\[12\] Håvard Kvamme and Ørnulf Borgan. Continuous and discrete-time survival prediction with neural networks. *arXiv preprint arXiv:1910.06724*, 2019.
\[[paper](https://arxiv.org/pdf/1910.06724.pdf)\]

\[13\] Elia Biganzoli, Patrizia Boracchi, Luigi Mariani, and Ettore Marubini. Feed forward neural networks for the analysis of censored survival data: a partial logistic regression approach. *Statistics in Medicine*, 17(10):1169–1186, 1998.
\[[paper](https://onlinelibrary.wiley.com/doi/abs/10.1002/(SICI)1097-0258(19980530)17:10%3C1169::AID-SIM796%3E3.0.CO;2-D)\]

\[14\] Marco Fornili, Federico Ambrogi, Patrizia Boracchi, and Elia Biganzoli. Piecewise exponential artificial neural networks (PEANN) for modeling hazard function with right censored data. *Computational Intelligence Methods for Bioinformatics and Biostatistics*, pages 125–136, 2014.
\[[paper](https://link.springer.com/chapter/10.1007%2F978-3-319-09042-9_9)\]

\[15\] Håvard Kvamme and Ørnulf Borgan. The Brier Score under Administrative Censoring: Problems and Solutions. *arXiv preprint arXiv:1912.08581*, 2019.
\[[paper](https://arxiv.org/pdf/1912.08581.pdf)\]