Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/sgrondin/tdigest
OCaml implementation of the T-Digest algorithm
https://github.com/sgrondin/tdigest
ocaml statistics tdigest
Last synced: 23 days ago
JSON representation
OCaml implementation of the T-Digest algorithm
- Host: GitHub
- URL: https://github.com/sgrondin/tdigest
- Owner: SGrondin
- License: mit
- Created: 2020-06-27T16:56:09.000Z (over 4 years ago)
- Default Branch: master
- Last Pushed: 2024-02-25T16:12:22.000Z (8 months ago)
- Last Synced: 2024-02-25T17:28:20.049Z (8 months ago)
- Topics: ocaml, statistics, tdigest
- Language: OCaml
- Homepage:
- Size: 104 KB
- Stars: 26
- Watchers: 3
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
Tdigest
=======OCaml implementation of the T-Digest algorithm.
```ocaml
let td =
Tdigest.create ()
|> Tdigest.add_list [ 10.0; 11.0; 12.0; 13.0 ]
inTdigest.percentiles td [ 0.; 0.25; 0.5; 0.75; 1. ]
(* [ Some 10; Some 10.5; Some 11.5; Some 12.5; Some 13 ] *)Tdigest.p_ranks td [ 9.; 10.; 11.; 12.; 13.; 14. ]
(* [ Some 0; Some 0.125; Some 0.375; Some 0.625; Some 0.875; Some 1 ] *)
```The T-Digest is a data structure and algorithm for constructing an approximate distribution for a collection of real numbers presented as a stream.
The T-Digest can estimate percentiles or quantiles extremely accurately even at the tails, while using a fraction of the space of the original data.
A median of medians is not equal to the median of the whole dataset. Percentiles are critical measures that are expensive to compute due to their requirement of having the entire **sorted** dataset present in one place. These downsides are addressed by using the T-Digest.
A T-Digest is concatenable, making it a good fit for distributed systems. The internal state of a T-Digest can be exported as a binary string, and the concatenation of any number of those strings can then be imported to form a new T-Digest.
```ocaml
let combined = Tdigest.merge [ td1; td2; td3 ] in
```A T-Digest's state can be stored in a database `VARCHAR`/`TEXT` column and multiple such states can be merged by concatenating strings:
```sql
-- Combine multiple states in the database
SELECT
STRING_AGG(M.tdigest_state) AS concat_state
FROM my_table AS M
```
```ocaml
(* Then load this combined state into a single T-Digest *)
let combined = Tdigest.of_string concat_state in
```Links:
- [A simple overview of the T-Digest](https://dataorigami.net/blogs/napkin-folding/19055451-percentile-and-quantile-estimation-of-big-data-the-t-digest)
- [A walkthrough of the algorithm by its creator](https://mapr.com/blog/better-anomaly-detection-t-digest-whiteboard-walkthrough/)
- [The white paper](https://github.com/tdunning/t-digest/blob/master/docs/t-digest-paper/histo.pdf)This library started off as a port of [Will Welch's JavaScript implementation](https://github.com/welch/tdigest), down to the unit tests. However some modifications have been made to adapt it to OCaml, the most important one being immutability. As such, almost every function in the `Tdigest` module return a new `Tdigest.t`, including "reading" ones since they may trigger intermediate computations worth caching.
## Usage
The API is well documented [here](https://github.com/SGrondin/tdigest/blob/master/src/tdigest.mli).
```sh
opam install tdigest
```### Marshal
The `Tdigest.t` type cannot be marshalled.
Use the functions in `Tdigest.Marshallable` if your application requires marshalling a T-Digest data structure. Note that `Tdigest.Marshallable.t` is approximately 5 times slower than `Tdigest.t`.
## Performance
On an ancient 2015 MacBook Pro, this implementation can incorporate 1,000,000 random floating points in just 770ms.
Exporting and importing state (`to_string`/`of_string`) is cheap.