Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/predict-idlab/tsdownsample

High-performance time series downsampling algorithms for visualization
https://github.com/predict-idlab/tsdownsample

aggregation downsampling fast lttb m4 minmax performance python simd time-series visualization

Last synced: 3 months ago
JSON representation

High-performance time series downsampling algorithms for visualization

Awesome Lists containing this project

README

        

# tsdownsample

[![PyPI Latest Release](https://img.shields.io/pypi/v/tsdownsample.svg)](https://pypi.org/project/tsdownsample/)
[![support-version](https://img.shields.io/pypi/pyversions/tsdownsample)](https://img.shields.io/pypi/pyversions/tsdownsample)
[![Downloads](https://static.pepy.tech/badge/tsdownsample)](https://pepy.tech/project/tsdownsample)
[![CodeQL](https://github.com/predict-idlab/tsdownsample/actions/workflows/codeql.yml/badge.svg)](https://github.com/predict-idlab/tsdownsample/actions/workflows/codeql.yml)
[![Testing](https://github.com/predict-idlab/tsdownsample/actions/workflows/ci-downsample_rs.yml/badge.svg)](https://github.com/predict-idlab/tsdownsample/actions/workflows/ci-downsample_rs.yml)
[![Testing](https://github.com/predict-idlab/tsdownsample/actions/workflows/ci-tsdownsample.yml/badge.svg)](https://github.com/predict-idlab/tsdownsample/actions/workflows/ci-tsdownsample.yml)
[![Discord](https://img.shields.io/badge/Discord-%235865F2.svg?logo=discord&logoColor=white)](https://discord.gg/k2d59GrxPX)

Extremely fast **time series downsampling 📈** for visualization, written in Rust.

## Features ✨

- **Fast**: written in rust with PyO3 bindings
- leverages optimized [argminmax](https://github.com/jvdd/argminmax) - which is SIMD accelerated with runtime feature detection
- scales linearly with the number of data points

- multithreaded with Rayon (in Rust)

Why we do not use Python multiprocessing
Citing the PyO3 docs on parallelism:


CPython has the infamous Global Interpreter Lock, which prevents several threads from executing Python bytecode in parallel. This makes threading in Python a bad fit for CPU-bound tasks and often forces developers to accept the overhead of multiprocessing.

In Rust - which is a compiled language - there is no GIL, so CPU-bound tasks can be parallelized (with Rayon) with little to no overhead.

- **Efficient**: memory efficient
- works on views of the data (no copies)
- no intermediate data structures are created
- **Flexible**: works on any type of data
- supported datatypes are
- for `x`: `f32`, `f64`, `i16`, `i32`, `i64`, `u16`, `u32`, `u64`, `datetime64`, `timedelta64`
- for `y`: `f16`, `f32`, `f64`, `i8`, `i16`, `i32`, `i64`, `u8`, `u16`, `u32`, `u64`, `datetime64`, `timedelta64`, `bool`

!! 🚀 f16 argminmax is 200-300x faster than numpy
In contrast with all other data types above, f16 is *not* hardware supported (i.e., no instructions for f16) by most modern CPUs!!

🐌 Programming languages facilitate support for this datatype by either (i) upcasting to f32 or (ii) using a software implementation.

💡 As for argminmax, only comparisons are needed - and thus no arithmetic operations - creating a symmetrical ordinal mapping from f16 to i16 is sufficient. This mapping allows to use the hardware supported scalar and SIMD i16 instructions - while not producing any memory overhead 🎉

More details are described in argminmax PR #1.

- **Easy to use**: simple & flexible API

## Install

```bash
pip install tsdownsample
```

## Usage

```python
from tsdownsample import MinMaxLTTBDownsampler
import numpy as np

# Create a time series
y = np.random.randn(10_000_000)
x = np.arange(len(y))

# Downsample to 1000 points (assuming constant sampling rate)
s_ds = MinMaxLTTBDownsampler().downsample(y, n_out=1000)

# Select downsampled data
downsampled_y = y[s_ds]

# Downsample to 1000 points using the (possible irregularly spaced) x-data
s_ds = MinMaxLTTBDownsampler().downsample(x, y, n_out=1000)

# Select downsampled data
downsampled_x = x[s_ds]
downsampled_y = y[s_ds]
```

## Downsampling algorithms & API

### Downsampling API 📑

Each downsampling algorithm is implemented as a class that implements a `downsample` method.
The signature of the `downsample` method:

```
downsample([x], y, n_out, **kwargs) -> ndarray[uint64]
```

**Arguments**:

- `x` is optional
- `x` and `y` are both positional arguments
- `n_out` is a mandatory keyword argument that defines the number of output values*
- `**kwargs` are optional keyword arguments *(see [table below](#downsampling-algorithms-📈))*:
- `parallel`: whether to use multi-threading (default: `False`)
❗ The max number of threads can be configured with the `TSDOWNSAMPLE_MAX_THREADS` ENV var (e.g. `os.environ["TSDOWNSAMPLE_MAX_THREADS"] = "4"`)
- ...

**Returns**: a `ndarray[uint64]` of indices that can be used to index the original data.

\*When there are gaps in the time series, fewer than `n_out` indices may be returned.

### Downsampling algorithms 📈

The following downsampling algorithms (classes) are implemented:

| Downsampler | Description | `**kwargs` |
| ---:| --- |--- |
| `MinMaxDownsampler` | selects the **min and max** value in each bin | `parallel` |
| `M4Downsampler` | selects the [**min, max, first and last**](https://dl.acm.org/doi/pdf/10.14778/2732951.2732953) value in each bin | `parallel` |
| `LTTBDownsampler` | performs the [**Largest Triangle Three Buckets**](https://skemman.is/bitstream/1946/15343/3/SS_MSthesis.pdf) algorithm | `parallel` |
| `MinMaxLTTBDownsampler` | (*new two-step algorithm 🎉*) first selects `n_out` * `minmax_ratio` **min and max** values, then further reduces these to `n_out` values using the **Largest Triangle Three Buckets** algorithm | `parallel`, `minmax_ratio`* |

*Default value for `minmax_ratio` is 4, which is empirically proven to be a good default. More details here: https://arxiv.org/abs/2305.00332

### Handling NaNs

This library supports two `NaN`-policies:

1. Omit `NaN`s (`NaN`s are ignored during downsampling).
2. Return index of first `NaN` once there is at least one present in the bin of the considered data.

| Omit `NaN`s | Return `NaN`s |
| ----------------------: | :------------------------- |
| `MinMaxDownsampler` | `NaNMinMaxDownsampler` |
| `M4Downsampler` | `NaNM4Downsampler` |
| `MinMaxLTTBDownsampler` | `NaNMinMaxLTTBDownsampler` |
| `LTTBDownsampler` | |

> Note that NaNs are not supported for `x`-data.

## Limitations & assumptions 🚨

Assumes;

1. `x`-data is (non-strictly) monotonic increasing (i.e., sorted)
2. no `NaN`s in `x`-data

---


👤 Jeroen Van Der Donckt