Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/philipdarke/torchtime
Benchmark time series data sets for PyTorch
https://github.com/philipdarke/torchtime
classification datasets physionet pytorch supervised-learning time-series
Last synced: 2 months ago
JSON representation
Benchmark time series data sets for PyTorch
- Host: GitHub
- URL: https://github.com/philipdarke/torchtime
- Owner: philipdarke
- License: mit
- Created: 2022-03-28T16:47:59.000Z (almost 3 years ago)
- Default Branch: main
- Last Pushed: 2024-02-14T14:13:38.000Z (11 months ago)
- Last Synced: 2024-04-23T12:38:12.033Z (9 months ago)
- Topics: classification, datasets, physionet, pytorch, supervised-learning, time-series
- Language: Python
- Homepage: https://philipdarke.com/torchtime
- Size: 3.07 MB
- Stars: 32
- Watchers: 3
- Forks: 5
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE
- Citation: CITATION.cff
Awesome Lists containing this project
- awesome-time-series - torchtime
README
# Benchmark time series data sets for PyTorch
[![PyPi](https://img.shields.io/pypi/v/torchtime)](https://pypi.org/project/torchtime)
[![Build status](https://img.shields.io/github/actions/workflow/status/philipdarke/torchtime/build.yml?branch=main)](https://github.com/philipdarke/torchtime/actions/workflows/build.yml)
![Coverage](https://philipdarke.com/torchtime/assets/coverage-badge.svg?dummy=8484744)
[![License](https://img.shields.io/github/license/philipdarke/torchtime.svg)](https://github.com/philipdarke/torchtime/blob/main/LICENSE)
[![DOI](https://img.shields.io/badge/DOI-10.48550%2FarXiv.2207.12503-blue)](https://doi.org/10.48550/arXiv.2207.12503)PyTorch data sets for supervised time series classification and prediction problems, including:
* All UEA/UCR classification repository data sets
* PhysioNet Challenge 2012 (in-hospital mortality)
* PhysioNet Challenge 2019 (sepsis prediction)
* A binary prediction variant of the 2019 PhysioNet Challenge## Why use `torchtime`?
1. Saves time. You don't have to write your own PyTorch data classes.
2. Better research. Use common, reproducible implementations of data sets for a level playing field when evaluating models.## Installation
Install PyTorch followed by `torchtime`:
```bash
$ pip install torchtime
```or
```bash
$ conda install torchtime -c conda-forge
```There is currently no Windows build for `conda`. Feedback is welcome from `conda` users in particular.
## Getting started
Data classes have a common API. The `split` argument determines whether training ("*train*"), validation ("*val*") or test ("*test*") data are returned. The size of the splits are controlled with the `train_prop` and (optional) `val_prop` arguments.
### PhysioNet data sets
Three [PhysioNet](https://physionet.org/) data sets are currently supported:
* [`torchtime.data.PhysioNet2012`](https://philipdarke.com/torchtime/api/data.html#torchtime.data.PhysioNet2012) returns the 2012 challenge (in-hospital mortality) [[link]](https://physionet.org/content/challenge-2012/1.0.0/).
* [`torchtime.data.PhysioNet2019`](https://philipdarke.com/torchtime/api/data.html#torchtime.data.PhysioNet2019) returns the 2019 challenge (sepsis prediction) [[link]](https://physionet.org/content/challenge-2019/1.0.0/).
* [`torchtime.data.PhysioNet2019Binary`](https://philipdarke.com/torchtime/api/data.html#torchtime.data.PhysioNet2019Binary) returns a binary prediction variant of the 2019 challenge.For example, to load training data for the 2012 challenge with a 70/30% training/validation split and create a [DataLoader](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader) for model training:
```python
from torch.utils.data import DataLoader
from torchtime.data import PhysioNet2012physionet2012 = PhysioNet2012(
split="train",
train_prop=0.7,
)
dataloader = DataLoader(physionet2012, batch_size=32)
```### UEA/UCR repository data sets
The [`torchtime.data.UEA`](https://philipdarke.com/torchtime/api/data.html#torchtime.data.UEA) class returns the [UEA/UCR repository](https://www.timeseriesclassification.com/) data set specified by the `dataset` argument, for example:
```python
from torch.utils.data import DataLoader
from torchtime.data import UEAarrowhead = UEA(
dataset="ArrowHead",
split="train",
train_prop=0.7,
)
dataloader = DataLoader(arrowhead, batch_size=32)
```### Using the DataLoader
Batches are dictionaries of tensors `X`, `y` and `length`:
* `X` are the time series data. The package follows the *batch first* convention therefore `X` has shape (*n*, *s*, *c*) where *n* is batch size, *s* is (longest) trajectory length and *c* is the number of channels. By default, the first channel is a time stamp.
* `y` are one-hot encoded labels of shape (*n*, *l*) where *l* is the number of classes.
* `length` are the length of each trajectory (before padding if sequences are of irregular length) i.e. a tensor of shape (*n*).For example, ArrowHead is a univariate time series therefore `X` has two channels, the time stamp followed by the time series (*c* = 2). Each series has 251 observations (*s* = 251) and there are three classes (*l* = 3). For a batch size of 32:
```python
next_batch = next(iter(dataloader))
next_batch["X"].shape # torch.Size([32, 251, 2])
next_batch["y"].shape # torch.Size([32, 3])
next_batch["length"].shape # torch.Size([32])
```See [Using DataLoaders](https://philipdarke.com/torchtime/tutorials/getting_started.html#using-dataloaders) for more information.
## Advanced options
* Missing data can be imputed by setting `impute` to *mean* (replace with training data channel means) or *forward* (replace with previous observation). Alternatively a custom imputation function can be passed to the `impute` argument.
* A time stamp (added by default), missing data mask and the time since previous observation can be appended with the boolean arguments ``time``, ``mask`` and ``delta`` respectively.
* Time series data are standardised using the `standardise` boolean argument.
* The location of cached data can be changed with the ``path`` argument, for example to share a single cache location across projects.
* For reproducibility, an optional random `seed` can be specified.
* Missing data can be simulated using the `missing` argument to drop data at random from UEA/UCR data sets.See the [tutorials](https://philipdarke.com/torchtime/tutorials/) and [API](https://philipdarke.com/torchtime/api/) for more information.
## Other resources
If you're looking for the TensorFlow equivalent for PhysioNet data sets try [medical_ts_datasets](https://github.com/ExpectationMax/medical_ts_datasets).
## Acknowledgements
`torchtime` uses some of the data processing ideas in Kidger et al, 2020 [[1]](https://arxiv.org/abs/2005.08926) and Che et al, 2018 [[2]](https://doi.org/10.1038/s41598-018-24271-9).
This work is supported by the Engineering and Physical Sciences Research Council, Centre for Doctoral Training in Cloud Computing for Big Data, Newcastle University (grant number EP/L015358/1).
## Citing `torchtime`
If you use this software, please cite the [paper](https://doi.org/10.48550/arXiv.2207.12503):
```
@software{darke_torchtime_2022,
author = Darke, Philip and Missier, Paolo and Bacardit, Jaume,
title = "Benchmark time series data sets for {PyTorch} - the torchtime package",
month = July,
year = 2022,
publisher={arXiv},
doi = 10.48550/arXiv.2207.12503,
url = https://doi.org/10.48550/arXiv.2207.12503,
}
```DOIs are also available for each version of the package [here](https://doi.org/10.5281/zenodo.6402406).
## References
1. Kidger, P, Morrill, J, Foster, J, *et al*. Neural Controlled Differential Equations for Irregular Time Series. *arXiv* 2005.08926 (2020). [[arXiv]](https://arxiv.org/abs/2005.08926)
1. Che, Z, Purushotham, S, Cho, K, *et al*. Recurrent Neural Networks for Multivariate Time Series with Missing Values. *Sci Rep* 8, 6085 (2018). [[doi]](https://doi.org/10.1038/s41598-018-24271-9)
1. Silva, I, Moody, G, Scott, DJ, *et al*. Predicting In-Hospital Mortality of ICU Patients: The PhysioNet/Computing in Cardiology Challenge 2012. *Comput Cardiol* 2012;39:245-248 (2010). [[hdl]](http://hdl.handle.net/1721.1/93166)
1. Reyna, M, Josef, C, Jeter, R, *et al*. Early Prediction of Sepsis From Clinical Data: The PhysioNet/Computing in Cardiology Challenge. *Critical Care Medicine* 48 2: 210-217 (2019). [[doi]](https://doi.org/10.1097/CCM.0000000000004145)
1. Reyna, M, Josef, C, Jeter, R, *et al*. Early Prediction of Sepsis from Clinical Data: The PhysioNet/Computing in Cardiology Challenge 2019 (version 1.0.0). *PhysioNet* (2019). [[doi]](https://doi.org/10.13026/v64v-d857)
1. Goldberger, A, Amaral, L, Glass, L, *et al*. PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. *Circulation* 101 (23), pp. e215–e220 (2000). [[doi]](https://doi.org/10.1161/01.cir.101.23.e215)
1. Löning, M, Bagnall, A, Ganesh, S, *et al*. sktime: A Unified Interface for Machine Learning with Time Series. *Workshop on Systems for ML at NeurIPS 2019* (2019). [[doi]](https://doi.org/10.5281/zenodo.3970852)
1. Löning, M, Bagnall, A, Middlehurst, M, *et al*. alan-turing-institute/sktime: v0.10.1 (v0.10.1). *Zenodo* (2022). [[doi]](https://doi.org/10.5281/zenodo.6191159)
## License
Released under the MIT license.