{"id":13471583,"url":"https://github.com/tdunning/t-digest","last_synced_at":"2025-05-13T20:05:52.979Z","repository":{"id":11939553,"uuid":"14509169","full_name":"tdunning/t-digest","owner":"tdunning","description":"A new data structure for accurate on-line accumulation of rank-based statistics such as quantiles and trimmed means","archived":false,"fork":false,"pushed_at":"2025-02-17T07:19:38.000Z","size":43585,"stargazers_count":2058,"open_issues_count":43,"forks_count":228,"subscribers_count":65,"default_branch":"main","last_synced_at":"2025-05-06T19:52:09.329Z","etag":null,"topics":["accuracy","online-algorithms","quantile","t-digest"],"latest_commit_sha":null,"homepage":null,"language":"Java","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/tdunning.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":"CITATION.cff","codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2013-11-19T00:07:15.000Z","updated_at":"2025-05-06T18:30:40.000Z","dependencies_parsed_at":"2024-06-18T18:38:46.845Z","dependency_job_id":"61a7e252-5a76-43eb-90de-10a7e529decb","html_url":"https://github.com/tdunning/t-digest","commit_stats":{"total_commits":354,"total_committers":44,"mean_commits":8.045454545454545,"dds":0.5480225988700564,"last_synced_commit":"e1f4e3a89d947d4c3aed9f6287040c1b5729d265"},"previous_names":[],"tags_count":8,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tdunning%2Ft-digest","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tdunning%2Ft-digest/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tdunning%2Ft-digest/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tdunning%2Ft-digest/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/tdunning","download_url":"https://codeload.github.com/tdunning/t-digest/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":253112072,"owners_count":21856070,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["accuracy","online-algorithms","quantile","t-digest"],"created_at":"2024-07-31T16:00:46.931Z","updated_at":"2025-05-13T20:05:52.952Z","avatar_url":"https://github.com/tdunning.png","language":"Java","funding_links":[],"categories":["Java","By Language"],"sub_categories":["Java"],"readme":"t-digest  \u0026middot;  [![Java CI](https://github.com/tdunning/t-digest/actions/workflows/maven.yml/badge.svg?branch=main)](https://github.com/tdunning/t-digest/actions/workflows/maven.yml)\n========\n\nA new data structure for accurate online accumulation of rank-based statistics such as quantiles\nand trimmed means.  The t-digest algorithm is also very friendly to parallel programs making it \nuseful in map-reduce and parallel streaming applications implemented using, say, Apache Spark.\n\nThe t-digest construction algorithm uses a variant of 1-dimensional k-means clustering to produce a\nvery compact data structure that allows accurate estimation of quantiles.  This t-digest data \nstructure can be used to estimate quantiles, compute other rank statistics or even to estimate\nrelated measures like trimmed means.  The advantage of the t-digest over previous digests for \nthis purpose is that the _t_-digest handles data with full floating point resolution.  With small\nchanges, the _t_-digest can handle values from any ordered set for which we can compute something akin to a mean.\nThe accuracy of quantile estimates produced by t-digests can be orders of magnitude more accurate than\nthose produced by alternative digest algorithms in spite of the fact that t-digests are much more \ncompact, particularly when serialized.\n\nIn summary, the particularly interesting characteristics of the t-digest are that it\n\n* has smaller summaries when serialized\n* works on double precision floating point as well as integers.\n* provides part per million accuracy for extreme quantiles and typically \u003c1000 ppm accuracy for middle quantiles\n* is very fast (~ 140 ns per add)\n* is very simple (~ 5000 lines of code total, \u003c1000 for the most advanced implementation alone)\n* has a reference implementation that has \u003e 90% test coverage\n* can be used with map-reduce very easily because digests can be merged\n* requires no dynamic allocation after initial creation (`MergingDigest` only)\n* has no runtime dependencies\n\nRecent News\n-----------\nThere is a [new article (open access!)](https://www.sciencedirect.com/science/article/pii/S2665963820300403) in Software Impacts on \nthe t-digest, focussed particularly on this reference implementation.\n \nLots has happened in t-digest lately. Most recently, with the help of people\nposting their observations of subtle misbehavior over the last 2 years, I figured\nout that the sort in the `MergingDigest` really needed to be stable. This helps\nparticularly with repeated values. Stabilizing the sort appears to have no \nnegative impact on accuracy nor significant change in speed, but testing is \ncontinuing. As part of introducing this change to the sort, I made the core \nimplementation pickier about enforcing the size invariants which forced updates\nto a number of tests.\n\nThe basic gist of other recent changes is that the core algorithms have been \nmade much more rigorous and the associated papers in the docs directory have \nbeen updated to match the reality of the most advanced implementations. \nThe general areas of improvement include substantial speedups, a new \nframework for dealing with scale functions, real proofs of size bounds \nand invariants for all current scale functions, much improved interpolation \nalgorithms, better accuracy testing and splitting the entire distribution \ninto parts for the core algorithms, quality testing, benchmarking and \ndocumentation.\n\nI am working on a 4.0 release that incorporates all of these\nimprovements. The remaining punch list for the release is roughly:\n\n* ~~verify all tests are clean and not disabled~~  (done!)\n* ~~integrate all scale functions into AVLTreeDigest~~ (done!)\n* describe accuracy using the quality suite\n* extend benchmarks to include `AVLTreeDigest` as first-class alternative\n* measure merging performance\n* consider [issue #87](https://github.com/tdunning/t-digest/issues/87)\n* review all outstanding issues (add unit tests if necessary or close if not)\n\nPublication work is now centered around comparisons with the KLL digest \n(spoiler, the t-digest is much smaller and possibly 2 orders of\nmagnitude more accurate than KLL). I would still like to see potential \nco-authors who could accelerate these submissions are encouraged to \nspeak up! In the meantime, an \n[archived pre-print of the main paper is available](https://arxiv.org/abs/1902.04023).\n\nIn research areas, there are some ideas being thrown around about how to bring\nstrict guarantees similar to the GK or KLL algorithms to the t-digest. There is\nsome promise here, but nothing real yet. If you are interested in a research \nproject, this could be an interesting one. \n \n### Scale Functions\n\nThe idea of scale functions is the heart of the t-digest. But things\ndon't quite work the way that we originally thought. Originally, it\nwas presumed that accuracy should be proportional to the square of the\nsize of a cluster. That isn't true in practice. That means that scale\nfunctions need to be much more aggressive about controlling cluster\nsizes near the tails. We now have 4 scale functions supported for both\nmajor digest forms (`MergingDigest` and `AVLTreeDigest`) to allow\ndifferent trade-offs in terms of accuracy.\n\nThese scale functions now have associated proofs that they all\n[preserve the key invariants](https://github.com/tdunning/t-digest/blob/master/docs/proofs/invariant-preservation.pdf) \nnecessary to build an accurate digest and that they all give\n[tight bounds on the size of a digest](https://github.com/tdunning/t-digest/blob/master/docs/proofs/sizing.pdf).\nHaving new scale functions means that we can get much better tail \naccuracy than before without losing much in terms of median accuracy. \nIt also means that insertion into a `MergingDigest` is faster than \nbefore since we have been able to eliminate all fancy functions like \nsqrt, log or sin from the critical path (although sqrt _is_ faster \nthan you might think).\n\nThere are also suggestions that asymmetric scale functions would be useful.\nThese would allow good single-tailed accuracy with (slightly) smaller digests. \nA paper has been submitted on this by the developer who came up with the idea\nand feedback from users about the utility of such scale functions would be \nwelcome. \n \n### Better Interpolation\n\nThe better accuracy achieved by the new scale functions partly comes\nfrom the fact that the most extreme clusters near _q_=0 or _q_=1 are\nlimited to only a single sample. Handling these singletons well makes\na huge difference in the accuracy of tail estimates. Handling the\ntransition to non-singletons is also very important.\n  \nBoth cases are handled much better than before.\n\nThe better interpolation has been fully integrated and tested in both\nthe `MergingDigest` and `AVLTreeDigest` with very good improvements in\naccuracy. The bug detected in the `AVLTreeDigest` that affected data\nwith many repeated values has also been fixed.\n  \n### Two-level Merging\n\nWe now have a trick for the `MergingDigest` that uses a higher value\nof the compression parameter (delta) while we are accumulating a\nt-digest and a lower value when we are about to store or display a\nt-digest.  This two-level merging has a small (negative) effect on\nspeed, but a substantial (positive) effect on accuracy because\nclusters are ordered more strictly. This better ordering of clusters\nmeans that the effects of the improved interpolation are much easier\nto observe.\n\nExtending this to `AVLTreeDigest` is theoretically possible, but it\nisn't clear the effect it will have.\n \n### Repo Reorg\n\nThe t-digest repository is now split into different functional\nareas. This is important because it simplifies the code used in\nproduction by extracting the (slow) code that generates data for\naccuracy testing, but also because it lets us avoid any dependencies\non GPL code (notably the jmh benchmarking tools) in the released\nartifacts.\n \nThe major areas are\n \n * core - this is where the t-digest and unit tests live\n * docs - the main paper and auxiliary proofs live here\n * benchmarks - this is the code that tests the speed of the digest algos\n * quality - this is the code that generates and analyzes accuracy information\n \n Within the docs sub-directory, proofs of invariant preservation and size\n bounds are moved to `docs/proofs` and all figures in `docs/t-digest-paper`\n are collected into a single directory to avoid cluster.\n\nLogHistogram and FloatHistogram\n--------------\n\nThis package also has an implementation of `FloatHistogram` which is\nanother way to look at distributions where all measurements are\npositive and where you want relative accuracy in the measurement space\ninstead of accuracy defined in quantiles. This `FloatHistogram` makes\nuse of the floating point hardware to implement variable width bins so\nthat adding data is very fast (5ns/data point in benchmarks) and the\nresulting sketch is small for reasonable accuracy levels. For\ninstance, if you require dynamic range of a million and are OK with\nabout bins being about ±10%, then you only need 80 counters.\n\nSince the bins for `FloatHistogram`'s are static rather than adaptive,\nthey can be combined very easily. Thus you can store a histogram for\nshort periods of time and combined them at query time if you are\nlooking at metrics for your system. You can also reweight histograms\nto avoid errors due to structured omission.\n\nAnother class called `LogHistogram` is also available in\n`t-digest`. The `LogHistogram` is very much like the `FloatHistogram`,\nbut it incorporates a clever quadratic update step (thanks to Otmar\nErtl) so that the bucket widths vary more precisely and thus the\nnumber of buckets can be decreased by about 40% while getting the same\naccuracy. This is particularly important when you are maintaining only\nmodest accuracy and want small histograms.\n\nIn the future, I will incorporate some of the interpolation tricks\nfrom the main _t_-digest into the `LogHistogram` implementation.\n\n\nCompile and Test\n================\n\nYou have to have Java 1.8 to compile and run this code.  You will also\nneed maven (3+ preferred) to compile and test this software.  In order\nto build the figures that go into the theory paper, you will need R.\nIn order to format the paper, you will need latex.  Pre-built pdf\nversions of all figures and papers are provided so you won't need latex\nif you don't need to make changes to these documents.\n\nOn Ubuntu, you can get the necessary pre-requisites for compiling the \ncode with the following:\n\n    sudo apt-get install  openjdk-8-jdk git maven\n\nOnce you have these installed, use this to build and test the software:\n\n    cd t-digest; mvn test\n\nMost of the very slow tests are in the `quality` module so if you just run\nthe tests in `core` module, you can save considerable time.\n\nTesting Accuracy and Comparing to Q-digest\n================\n\nThe normal test suite produces a number of diagnostics that describe\nthe scaling and accuracy characteristics of t-digests.  In order to\nproduce nice visualizations of these properties, you need to have many more\nsamples.  To get this enhanced view, run the tests in the `quality` module\nby running the full test suite once or, subsequently, by running just the \ntests in the quality sub-directory. \n\n    cd quality; mvn test\n\nThe data from these tests are stored in a variety of data files in the\n`quality` directory.  Some of these files are quite large.\n\nI have prepared [detailed instructions on producing all of the figures](https://github.com/tdunning/t-digest/blob/master/docs/t-digest-paper/figure-doc.pdf)\nused in the main paper.\n\nMost of these scripts will complete almost instantaneously; one or two\nwill take a few tens of seconds.\n\nThe output of these scripts are a collection of PDF files that can be\nviewed with any suitable viewer such as Preview on a Mac.  Many of\nthese images are used as figures in the \n[main t-digest paper](https://github.com/tdunning/t-digest/blob/master/docs/t-digest-paper/histo.pdf). \n\nImplementations in Other Languages\n=================\nThe t-digest algorithm has been ported to other languages:\n - Python: [tdigest](https://github.com/CamDavidsonPilon/tdigest), [fastdigest](https://github.com/moritzmucha/fastdigest)\n - Go: [github.com/spenczar/tdigest](https://github.com/spenczar/tdigest) [github.com/influxdata/tdigest](https://github.com/influxdata/tdigest)\n - JavaScript: [tdigest](https://github.com/welch/tdigest)\n - C++: [CPP TDigest](https://github.com/gpichot/cpp-tdigest), [FB's Folly Implementation (high performance)](https://github.com/facebook/folly/blob/master/folly/stats/TDigest.h)\n - C++: [TDigest](https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/tdigest.h\n) as part of [Apache Arrow](https://arrow.apache.org/)\n - CUDA C++: [tdigest.cu](https://github.com/rapidsai/cudf/blob/branch-22.10/cpp/src/quantiles/tdigest/tdigest.cu) as part of `libcudf` in [RAPIDS](https://rapids.ai/) powering the [`approx_percentile` and `percentile_approx`](https://github.com/NVIDIA/spark-rapids/blob/b35311f7c6950fd5d8f7f6ed66aeffa87c480850/sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuApproximatePercentile.scala#L123-L130) expressions in Spark SQL with [RAPIDS Accelerator for Apache Spark](https://nvidia.github.io/spark-rapids/) \n - Rust: [t-digest](https://github.com/MnO2/t-digest) and its modified version in [Apache Arrow Datafusion](https://github.com/apache/arrow-datafusion/blob/ca952bd33402816dbb1550debb9b8cac3b13e8f2/datafusion-physical-expr/src/tdigest/mod.rs#L19-L28), [tdigests](https://github.com/andylokandy/tdigests)\n - Scala: [TDigest.scala](https://github.com/stripe-archive/brushfire/blob/master/brushfire-training/src/main/scala/com/stripe/brushfire/TDigest.scala)\n - C: [tdigestc (w/ bindings to Go, Java, Python, JS via wasm)](https://github.com/ajwerner/tdigestc)\n - C: [t-digest-c](https://github.com/RedisBloom/t-digest-c) as part of [RedisBloom](https://redisbloom.io/)\n - Clojure: [t-digest for Clojure](https://github.com/henrygarner/t-digest)\n - C#: [t-digest-csharp (.NET Core)](https://github.com/Cyral/t-digest-csharp)\n - Kotlin multiplatform: [tdigest_kotlin_multiplatform](https://github.com/beyondeye/tdigest_kotlin_multiplatform)\n - OCaml: [tdigest](https://github.com/SGrondin/tdigest). Purely functional, can also compile to JS via js_of_ocaml.\n - Redis: [Redis Stack](https://redis.io/docs/data-types/probabilistic/t-digest/) supports t-digest.\n   \nContinuous Integration\n=================\n\nThe t-digest project makes use of Travis integration with Github for testing whenever a change is made.\n\nYou can see the reports at:\n\n    https://travis-ci.org/tdunning/t-digest\n\ntravis update\n\nInstallation\n===============\n\nThe t-Digest library Jars are released via [Maven Central Repository](http://repo1.maven.org/maven2/com/tdunning/).\nThe current version is 3.3.\n\n ```xml\n      \u003cdependency\u003e\n          \u003cgroupId\u003ecom.tdunning\u003c/groupId\u003e\n          \u003cartifactId\u003et-digest\u003c/artifactId\u003e\n          \u003cversion\u003e3.3\u003c/version\u003e\n      \u003c/dependency\u003e\n ```     \n      \n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftdunning%2Ft-digest","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftdunning%2Ft-digest","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftdunning%2Ft-digest/lists"}