Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/predict-idlab/tsdownsample
High-performance time series downsampling algorithms for visualization
https://github.com/predict-idlab/tsdownsample
aggregation downsampling fast lttb m4 minmax performance python simd time-series visualization
Last synced: 3 months ago
JSON representation
High-performance time series downsampling algorithms for visualization
- Host: GitHub
- URL: https://github.com/predict-idlab/tsdownsample
- Owner: predict-idlab
- License: mit
- Created: 2022-11-22T13:38:41.000Z (about 2 years ago)
- Default Branch: main
- Last Pushed: 2024-08-29T17:07:03.000Z (5 months ago)
- Last Synced: 2024-09-29T09:50:00.961Z (4 months ago)
- Topics: aggregation, downsampling, fast, lttb, m4, minmax, performance, python, simd, time-series, visualization
- Language: Jupyter Notebook
- Homepage:
- Size: 622 KB
- Stars: 148
- Watchers: 10
- Forks: 13
- Open Issues: 16
-
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- Funding: .github/FUNDING.yml
- License: LICENSE
Awesome Lists containing this project
- awesome-time-series - tsdownsample
README
# tsdownsample
[![PyPI Latest Release](https://img.shields.io/pypi/v/tsdownsample.svg)](https://pypi.org/project/tsdownsample/)
[![support-version](https://img.shields.io/pypi/pyversions/tsdownsample)](https://img.shields.io/pypi/pyversions/tsdownsample)
[![Downloads](https://static.pepy.tech/badge/tsdownsample)](https://pepy.tech/project/tsdownsample)
[![CodeQL](https://github.com/predict-idlab/tsdownsample/actions/workflows/codeql.yml/badge.svg)](https://github.com/predict-idlab/tsdownsample/actions/workflows/codeql.yml)
[![Testing](https://github.com/predict-idlab/tsdownsample/actions/workflows/ci-downsample_rs.yml/badge.svg)](https://github.com/predict-idlab/tsdownsample/actions/workflows/ci-downsample_rs.yml)
[![Testing](https://github.com/predict-idlab/tsdownsample/actions/workflows/ci-tsdownsample.yml/badge.svg)](https://github.com/predict-idlab/tsdownsample/actions/workflows/ci-tsdownsample.yml)
[![Discord](https://img.shields.io/badge/Discord-%235865F2.svg?logo=discord&logoColor=white)](https://discord.gg/k2d59GrxPX)Extremely fast **time series downsampling 📈** for visualization, written in Rust.
## Features ✨
- **Fast**: written in rust with PyO3 bindings
- leverages optimized [argminmax](https://github.com/jvdd/argminmax) - which is SIMD accelerated with runtime feature detection
- scales linearly with the number of data points
- multithreaded with Rayon (in Rust)
Why we do not use Python multiprocessing
Citing the PyO3 docs on parallelism:
CPython has the infamous Global Interpreter Lock, which prevents several threads from executing Python bytecode in parallel. This makes threading in Python a bad fit for CPU-bound tasks and often forces developers to accept the overhead of multiprocessing.
In Rust - which is a compiled language - there is no GIL, so CPU-bound tasks can be parallelized (with Rayon) with little to no overhead.
- **Efficient**: memory efficient
- works on views of the data (no copies)
- no intermediate data structures are created
- **Flexible**: works on any type of data
- supported datatypes are
- for `x`: `f32`, `f64`, `i16`, `i32`, `i64`, `u16`, `u32`, `u64`, `datetime64`, `timedelta64`
- for `y`: `f16`, `f32`, `f64`, `i8`, `i16`, `i32`, `i64`, `u8`, `u16`, `u32`, `u64`, `datetime64`, `timedelta64`, `bool`
!! 🚀f16
argminmax is 200-300x faster than numpy
In contrast with all other data types above,f16
is *not* hardware supported (i.e., no instructions for f16) by most modern CPUs!!
🐌 Programming languages facilitate support for this datatype by either (i) upcasting to f32 or (ii) using a software implementation.
💡 As for argminmax, only comparisons are needed - and thus no arithmetic operations - creating a symmetrical ordinal mapping fromf16
toi16
is sufficient. This mapping allows to use the hardware supported scalar and SIMDi16
instructions - while not producing any memory overhead 🎉
More details are described in argminmax PR #1.
- **Easy to use**: simple & flexible API## Install
```bash
pip install tsdownsample
```## Usage
```python
from tsdownsample import MinMaxLTTBDownsampler
import numpy as np# Create a time series
y = np.random.randn(10_000_000)
x = np.arange(len(y))# Downsample to 1000 points (assuming constant sampling rate)
s_ds = MinMaxLTTBDownsampler().downsample(y, n_out=1000)# Select downsampled data
downsampled_y = y[s_ds]# Downsample to 1000 points using the (possible irregularly spaced) x-data
s_ds = MinMaxLTTBDownsampler().downsample(x, y, n_out=1000)# Select downsampled data
downsampled_x = x[s_ds]
downsampled_y = y[s_ds]
```## Downsampling algorithms & API
### Downsampling API 📑
Each downsampling algorithm is implemented as a class that implements a `downsample` method.
The signature of the `downsample` method:```
downsample([x], y, n_out, **kwargs) -> ndarray[uint64]
```**Arguments**:
- `x` is optional
- `x` and `y` are both positional arguments
- `n_out` is a mandatory keyword argument that defines the number of output values*
- `**kwargs` are optional keyword arguments *(see [table below](#downsampling-algorithms-📈))*:
- `parallel`: whether to use multi-threading (default: `False`)
❗ The max number of threads can be configured with the `TSDOWNSAMPLE_MAX_THREADS` ENV var (e.g. `os.environ["TSDOWNSAMPLE_MAX_THREADS"] = "4"`)
- ...**Returns**: a `ndarray[uint64]` of indices that can be used to index the original data.
\*When there are gaps in the time series, fewer than `n_out` indices may be returned.
### Downsampling algorithms 📈
The following downsampling algorithms (classes) are implemented:
| Downsampler | Description | `**kwargs` |
| ---:| --- |--- |
| `MinMaxDownsampler` | selects the **min and max** value in each bin | `parallel` |
| `M4Downsampler` | selects the [**min, max, first and last**](https://dl.acm.org/doi/pdf/10.14778/2732951.2732953) value in each bin | `parallel` |
| `LTTBDownsampler` | performs the [**Largest Triangle Three Buckets**](https://skemman.is/bitstream/1946/15343/3/SS_MSthesis.pdf) algorithm | `parallel` |
| `MinMaxLTTBDownsampler` | (*new two-step algorithm 🎉*) first selects `n_out` * `minmax_ratio` **min and max** values, then further reduces these to `n_out` values using the **Largest Triangle Three Buckets** algorithm | `parallel`, `minmax_ratio`* |*Default value for `minmax_ratio` is 4, which is empirically proven to be a good default. More details here: https://arxiv.org/abs/2305.00332
### Handling NaNs
This library supports two `NaN`-policies:
1. Omit `NaN`s (`NaN`s are ignored during downsampling).
2. Return index of first `NaN` once there is at least one present in the bin of the considered data.| Omit `NaN`s | Return `NaN`s |
| ----------------------: | :------------------------- |
| `MinMaxDownsampler` | `NaNMinMaxDownsampler` |
| `M4Downsampler` | `NaNM4Downsampler` |
| `MinMaxLTTBDownsampler` | `NaNMinMaxLTTBDownsampler` |
| `LTTBDownsampler` | |> Note that NaNs are not supported for `x`-data.
## Limitations & assumptions 🚨
Assumes;
1. `x`-data is (non-strictly) monotonic increasing (i.e., sorted)
2. no `NaN`s in `x`-data---
👤 Jeroen Van Der Donckt