{"id":13706010,"url":"https://github.com/hrbrmstr/tdigest","last_synced_at":"2025-03-16T20:31:11.936Z","repository":{"id":34447303,"uuid":"170871921","full_name":"hrbrmstr/tdigest","owner":"hrbrmstr","description":"Wicked Fast, Accurate Quantiles Using 't-Digests'","archived":false,"fork":false,"pushed_at":"2024-06-19T18:37:42.000Z","size":393,"stargazers_count":37,"open_issues_count":2,"forks_count":8,"subscribers_count":5,"default_branch":"master","last_synced_at":"2025-02-27T13:18:57.537Z","etag":null,"topics":["quantile","r","rstats","t-digest"],"latest_commit_sha":null,"homepage":"","language":"C","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/hrbrmstr.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2019-02-15T13:48:04.000Z","updated_at":"2024-06-19T18:37:46.000Z","dependencies_parsed_at":"2024-06-20T06:40:05.654Z","dependency_job_id":"d587dd7b-4505-4563-b656-a0761bc4b066","html_url":"https://github.com/hrbrmstr/tdigest","commit_stats":null,"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hrbrmstr%2Ftdigest","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hrbrmstr%2Ftdigest/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hrbrmstr%2Ftdigest/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hrbrmstr%2Ftdigest/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/hrbrmstr","download_url":"https://codeload.github.com/hrbrmstr/tdigest/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":243830912,"owners_count":20354848,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["quantile","r","rstats","t-digest"],"created_at":"2024-08-02T22:00:51.233Z","updated_at":"2025-03-16T20:31:11.468Z","avatar_url":"https://github.com/hrbrmstr.png","language":"C","funding_links":[],"categories":["C"],"sub_categories":[],"readme":"\n\n[![Project Status: Active – The project has reached a stable, usable\nstate and is being actively\ndeveloped.](https://www.repostatus.org/badges/latest/active.svg)](https://www.repostatus.org/#active)\n[![Signed\nby](https://img.shields.io/badge/Keybase-Verified-brightgreen.svg)](https://keybase.io/hrbrmstr)\n![Signed commit\n%](https://img.shields.io/badge/Signed_Commits-3%25-lightgrey.svg)\n\n[![cran\nchecks](https://cranchecks.info/badges/worst/tdigest.png)](https://cranchecks.info/pkgs/tdigest)\n[![CRAN\nstatus](https://www.r-pkg.org/badges/version/tdigest.png)](https://www.r-pkg.org/pkg/tdigest)\n![Minimal R\nVersion](https://img.shields.io/badge/R%3E%3D-3.5.0-blue.svg)\n![License](https://img.shields.io/badge/License-MIT-blue.svg)\n\n# tdigest\n\nWicked Fast, Accurate Quantiles Using ‘t-Digests’\n\n## Description\n\nThe t-Digest construction algorithm uses a variant of 1-dimensional\nk-means clustering to produce a very compact data structure that allows\naccurate estimation of quantiles. This t-Digest data structure can be\nused to estimate quantiles, compute other rank statistics or even to\nestimate related measures like trimmed means. The advantage of the\nt-Digest over previous digests for this purpose is that the t-Digest\nhandles data with full floating point resolution. The accuracy of\nquantile estimates produced by t-Digests can be orders of magnitude more\naccurate than those produced by previous digest algorithms. Methods are\nprovided to create and update t-Digests and retrieve quantiles from the\naccumulated distributions.\n\nSee [the original paper by Ted Dunning \u0026 Otmar\nErtl](https://arxiv.org/abs/1902.04023) for more details on t-Digests.\n\n## What’s Inside The Tin\n\nThe following functions are implemented:\n\n- `as.list.tdigest`: Serialize a tdigest object to an R list or\n  unserialize a serialized tdigest list back into a tdigest object\n- `td_add`: Add a value to the t-Digest with the specified count\n- `td_create`: Allocate a new histogram\n- `td_merge`: Merge one t-Digest into another\n- `td_quantile_of`: Return the quantile of the value\n- `td_total_count`: Total items contained in the t-Digest\n- `td_value_at`: Return the value at the specified quantile\n- `tquantile`: Calculate sample quantiles from a t-Digest\n\n## Installation\n\n``` r\ninstall.packages(\"tdigest\") # NOTE: CRAN version is 0.4.1\n# or\nremotes::install_gitlab(\"hrbrmstr/tdigest\")\n```\n\nNOTE: To use the ‘remotes’ install options you will need to have the\n[{remotes} package](https://github.com/r-lib/remotes) installed.\n\n## Usage\n\n``` r\nlibrary(tdigest)\n\n# current version\npackageVersion(\"tdigest\")\n## [1] '0.4.2'\n```\n\n### Basic (Low-level interface)\n\n``` r\ntd \u003c- td_create(10)\n\ntd\n## \u003ctdigest; size=0; compression=10; cap=70\u003e\n\ntd_total_count(td)\n## [1] 0\n\ntd_add(td, 0, 1) %\u003e% \n  td_add(10, 1)\n## \u003ctdigest; size=2; compression=10; cap=70\u003e\n\ntd_total_count(td)\n## [1] 2\n\ntd_value_at(td, 0.1) == 0\n## [1] TRUE\ntd_value_at(td, 0.5) == 5\n## [1] TRUE\n\nquantile(td)\n## [1]  0  0  5 10 10\n```\n\n#### Bigger (and Vectorised)\n\n``` r\ntd \u003c- tdigest(c(0, 10), 10)\n\nis_tdigest(td)\n## [1] TRUE\n\ntd_value_at(td, 0.1) == 0\n## [1] TRUE\ntd_value_at(td, 0.5) == 5\n## [1] TRUE\n\nset.seed(1492)\nx \u003c- sample(0:100, 1000000, replace = TRUE)\ntd \u003c- tdigest(x, 1000)\n\ntd_total_count(td)\n## [1] 1e+06\n\ntquantile(td, c(0, 0.01, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.99, 1))\n##  [1]   0.0000000   0.8099857   9.6725790  19.7533723  29.7448283  39.7544675  49.9966628  60.0235148  70.2067574\n## [10]  80.3090454  90.2594642  99.4269454 100.0000000\n\nquantile(td)\n## [1]   0.00000  24.74751  49.99666  75.24783 100.00000\n```\n\n#### Serialization\n\nThese \\[de\\]serialization functions make it possible to create \u0026\npopulate a tdigest, serialize it out, read it in at a later time and\ncontinue populating it enabling compact distribution accumulation \u0026\nstorage for large, “continuous” datasets.\n\n``` r\nset.seed(1492)\nx \u003c- sample(0:100, 1000000, replace = TRUE)\ntd \u003c- tdigest(x, 1000)\n\ntquantile(td, c(0, 0.01, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.99, 1))\n##  [1]   0.0000000   0.8099857   9.6725790  19.7533723  29.7448283  39.7544675  49.9966628  60.0235148  70.2067574\n## [10]  80.3090454  90.2594642  99.4269454 100.0000000\n\nstr(in_r \u003c- as.list(td), 1)\n## List of 7\n##  $ compression   : num 1000\n##  $ cap           : int 6010\n##  $ merged_nodes  : int 226\n##  $ unmerged_nodes: int 0\n##  $ merged_count  : num 1e+06\n##  $ unmerged_count: num 0\n##  $ nodes         :List of 2\n##  - attr(*, \"class\")= chr [1:2] \"tdigest_list\" \"list\"\n\ntd2 \u003c- as_tdigest(in_r)\ntquantile(td2, c(0, 0.01, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.99, 1))\n##  [1]   0.0000000   0.8099857   9.6725790  19.7533723  29.7448283  39.7544675  49.9966628  60.0235148  70.2067574\n## [10]  80.3090454  90.2594642  99.4269454 100.0000000\n\nidentical(in_r, as.list(td2))\n## [1] TRUE\n```\n\n#### ALTREP-aware\n\n``` r\nN \u003c- 1000000\nx.altrep \u003c- seq_len(N) # this is an ALTREP in R version \u003e= 3.5.0\n\ntd \u003c- tdigest(x.altrep)\ntd[0.1]\n## [1] 93051\ntd[0.5]\n## [1] 491472.5\nlength(td)\n## [1] 1000000\n```\n\n#### Proof it’s faster\n\n``` r\nmicrobenchmark::microbenchmark(\n  tdigest = tquantile(td, c(0, 0.01, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.99, 1)),\n  r_quantile = quantile(x, c(0, 0.01, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.99, 1))\n)\n## Unit: microseconds\n##        expr       min        lq        mean     median        uq     max neval\n##     tdigest     3.198     3.731     7.79369     4.4895    12.792    16.4   100\n##  r_quantile 39197.353 39445.444 40069.38938 39584.8030 40062.945 43613.3   100\n```\n\n## tdigest Metrics\n\n| Lang         | \\# Files |  (%) | LoC |  (%) | Blank lines |  (%) | \\# Lines |  (%) |\n|:-------------|---------:|-----:|----:|-----:|------------:|-----:|---------:|-----:|\n| C            |        3 | 0.15 | 499 | 0.36 |          71 | 0.29 |       45 | 0.10 |\n| R            |        6 | 0.30 | 161 | 0.12 |          35 | 0.14 |      156 | 0.34 |\n| C/C++ Header |        1 | 0.05 |  24 | 0.02 |          16 | 0.07 |       30 | 0.06 |\n| SUM          |       10 | 0.50 | 684 | 0.50 |         122 | 0.50 |      231 | 0.50 |\n\n{cloc} 📦 metrics for tdigest\n\n## Code of Conduct\n\nPlease note that this project is released with a Contributor Code of\nConduct. By participating in this project you agree to abide by its\nterms.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhrbrmstr%2Ftdigest","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fhrbrmstr%2Ftdigest","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhrbrmstr%2Ftdigest/lists"}