Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/bcebere/elastic-surv

Survival analysis for Big Data
https://github.com/bcebere/elastic-surv

automl bigdata coxph deephit elasticsearch hyperband survival-analysis

Last synced: 3 months ago
JSON representation

Survival analysis for Big Data

Host: GitHub
URL: https://github.com/bcebere/elastic-surv
Owner: bcebere
License: bsd-3-clause
Created: 2021-12-13T10:36:27.000Z (about 3 years ago)
Default Branch: main
Last Pushed: 2022-03-20T07:18:39.000Z (almost 3 years ago)
Last Synced: 2024-10-12T18:56:58.285Z (4 months ago)
Topics: automl, bigdata, coxph, deephit, elasticsearch, hyperband, survival-analysis
Language: Jupyter Notebook
Homepage:
Size: 77.1 KB
Stars: 5
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        


  elastic-surv





  Survival analysis on Big Data





[![elastic-surv Tests](https://github.com/bcebere/elastic-surv/actions/workflows/test.yml/badge.svg)](https://github.com/bcebere/elastic-surv/actions/workflows/test.yml)

[![License](https://img.shields.io/badge/License-BSD%203--Clause-blue.svg)](https://github.com/bcebere/elastic-surv/blob/main/LICENSE)

  



 elastic-surv is a library for training risk estimation models on ElasticSearch backends. Potential use cases include user churn prediction or survival probability.

 

- :key: Survival models include CoxPH, DeepHit or LogisticHazard([pycox](https://github.com/havakv/pycox)).

- :fire: ElasticSearch support using [eland](https://github.com/elastic/eland).

- :cyclone: Automatic model selection using HyperBand.

 

## Problem formulation

Risk estimation tasks require:

 - A set of covariates/features(`X`).

 - An outcome/event column(`Y`) - 0 means right censoring, 1 means that the event occured.

 - Time to event column(`T`) - the duration until the event or the censoring occured. 

The risk estimation task output is a survival function: for N time horizons, it outputs the probability of "survival"(event not occurring) at each horizon.

 

## Installation

For configuring the ELK stack, please follow the instructions [here](https://www.elastic.co/guide/en/elasticsearch/reference/current/install-elasticsearch.html).

The library can be installed using

```bash

$ pip install .

```

## Sample Usage

For each ElasticSearch data backend, we need to mention:

 - the es_index_pattern and the es_client for the ES connection.

 - which keys in the ES index stand for the time-to-event and outcome data.

 - optional: which features to include from the index.

```python

from elastic_surv.dataset import ESDataset

from elastic_surv.models import CoxPHModel

dataset = ESDataset(

    es_index_pattern = 'churn-prediction',

    time_column = 'months_active',

    event_column = 'churned',

    es_client = "localhost",

)

model = CoxPHModel(in_features = dataset.features())

    

model.train(dataset)

model.score(dataset)

```

For this example, we use a local ES index, `churn-prediction`. This can be generated using the following snippet

```python

from pysurvival.datasets import Dataset

import eland as ed

raw_dataset = Dataset('churn').load() 

ed.pandas_to_eland(raw_dataset,

                  es_client='localhost',

                  es_dest_index='churn-prediction',

                  es_if_exists='replace',

                  es_dropna=True,

                  es_refresh=True,

) 

```

## Tutorials

 - [Tutorial 1: Data backends](tutorials/tutorial_1_data_backends.ipynb)

 - [Tutorial 2: Training a survival model over ElasticSearch](tutorials/tutorial_2_model_training.ipynb)

 - [Tutorial 3: AutoML for survival analysis over ElasticSearch](tutorials/tutorial_3_automl.ipynb)

 

## Tests

Install the testing dependencies using

```bash

pip install .[testing]

```

The tests can be executed using

```bash

pytest -vsx

```