{"id":13474413,"url":"https://github.com/CamDavidsonPilon/tdigest","last_synced_at":"2025-03-26T21:31:32.553Z","repository":{"id":28774363,"uuid":"32296825","full_name":"CamDavidsonPilon/tdigest","owner":"CamDavidsonPilon","description":"t-Digest data structure in Python. Useful for percentiles and quantiles, including distributed enviroments like PySpark","archived":false,"fork":false,"pushed_at":"2023-05-04T00:04:02.000Z","size":94,"stargazers_count":392,"open_issues_count":16,"forks_count":54,"subscribers_count":11,"default_branch":"master","last_synced_at":"2025-03-25T04:27:46.354Z","etag":null,"topics":["distributed-computing","estimate","mapreduce","percentile","pyspark","python","quantile"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/CamDavidsonPilon.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2015-03-16T02:32:10.000Z","updated_at":"2025-02-26T08:02:14.000Z","dependencies_parsed_at":"2024-01-13T18:24:24.407Z","dependency_job_id":"d1db96ee-a0f8-4c1d-943b-379dbcf5d9ec","html_url":"https://github.com/CamDavidsonPilon/tdigest","commit_stats":{"total_commits":95,"total_committers":18,"mean_commits":5.277777777777778,"dds":0.3789473684210526,"last_synced_commit":"e35cfd708962ae5e9d1c5d2b15a99af7b2e2f323"},"previous_names":[],"tags_count":5,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CamDavidsonPilon%2Ftdigest","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CamDavidsonPilon%2Ftdigest/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CamDavidsonPilon%2Ftdigest/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CamDavidsonPilon%2Ftdigest/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/CamDavidsonPilon","download_url":"https://codeload.github.com/CamDavidsonPilon/tdigest/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":245738642,"owners_count":20664322,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["distributed-computing","estimate","mapreduce","percentile","pyspark","python","quantile"],"created_at":"2024-07-31T16:01:12.170Z","updated_at":"2025-03-26T21:31:32.242Z","avatar_url":"https://github.com/CamDavidsonPilon.png","language":"Python","readme":"# tdigest\n### Efficient percentile estimation of streaming or distributed data\n[![PyPI version](https://badge.fury.io/py/tdigest.svg)](https://badge.fury.io/py/tdigest)\n[![Build Status](https://travis-ci.org/CamDavidsonPilon/tdigest.svg?branch=master)](https://travis-ci.org/CamDavidsonPilon/tdigest)\n\n\nThis is a Python implementation of Ted Dunning's [t-digest](https://github.com/tdunning/t-digest) data structure. The t-digest data structure is designed around computing accurate estimates from either streaming data, or distributed data. These estimates are percentiles, quantiles, trimmed means, etc. Two t-digests can be added, making the data structure ideal for map-reduce settings, and can be serialized into much less than 10kB (instead of storing the entire list of data).\n\nSee a blog post about it here: [Percentile and Quantile Estimation of Big Data: The t-Digest](http://dataorigami.net/blogs/napkin-folding/19055451-percentile-and-quantile-estimation-of-big-data-the-t-digest)\n\n\n### Installation\n*tdigest* is compatible with both Python 2 and Python 3. \n\n```\npip install tdigest\n```\n\n### Usage\n\n#### Update the digest sequentially\n\n```\nfrom tdigest import TDigest\nfrom numpy.random import random\n\ndigest = TDigest()\nfor x in range(5000):\n    digest.update(random())\n\nprint(digest.percentile(15))  # about 0.15, as 0.15 is the 15th percentile of the Uniform(0,1) distribution\n```\n\n#### Update the digest in batches\n\n```\nanother_digest = TDigest()\nanother_digest.batch_update(random(5000))\nprint(another_digest.percentile(15))\n```\n\n#### Sum two digests to create a new digest\n\n```\nsum_digest = digest + another_digest \nsum_digest.percentile(30)  # about 0.3\n```\n\n#### To dict or serializing a digest with JSON\n\nYou can use the to_dict() method to turn a TDigest object into a standard Python dictionary.\n```\ndigest = TDigest()\ndigest.update(1)\ndigest.update(2)\ndigest.update(3)\nprint(digest.to_dict())\n```\nOr you can get only a list of Centroids with `centroids_to_list()`.\n```\ndigest.centroids_to_list()\n```\n\nSimilarly, you can restore a Python dict of digest values with `update_from_dict()`. Centroids are merged with any existing ones in the digest.\nFor example, make a fresh digest and restore values from a python dictionary.\n```\ndigest = TDigest()\ndigest.update_from_dict({'K': 25, 'delta': 0.01, 'centroids': [{'c': 1.0, 'm': 1.0}, {'c': 1.0, 'm': 2.0}, {'c': 1.0, 'm': 3.0}]})\n```\n\nK and delta values are optional, or you can provide only a list of centroids with `update_centroids_from_list()`.\n```\ndigest = TDigest()\ndigest.update_centroids([{'c': 1.0, 'm': 1.0}, {'c': 1.0, 'm': 2.0}, {'c': 1.0, 'm': 3.0}])\n```\n\nIf you want to serialize with other tools like JSON, you can first convert to_dict().\n```\njson.dumps(digest.to_dict())\n```\n\nAlternatively, make a custom encoder function to provide as default to the standard json module.\n```\ndef encoder(digest_obj):\n    return digest_obj.to_dict()\n```\nThen pass the encoder function as the default parameter.\n```\njson.dumps(digest, default=encoder)\n```\n\n\n### API \n\n`TDigest.`\n\n - `update(x, w=1)`: update the tdigest with value `x` and weight `w`.\n - `batch_update(x, w=1)`: update the tdigest with values in array `x` and weight `w`.\n - `compress()`: perform a compression on the underlying data structure that will shrink the memory footprint of it, without hurting accuracy. Good to perform after adding many values. \n - `percentile(p)`: return the `p`th percentile. Example: `p=50` is the median.\n - `cdf(x)`: return the CDF the value `x` is at. \n - `trimmed_mean(p1, p2)`: return the mean of data set without the values below and above the `p1` and `p2` percentile respectively. \n - `to_dict()`: return a Python dictionary of the TDigest and internal Centroid values.\n - `update_from_dict(dict_values)`: update from serialized dictionary values into the TDigest object.\n - `centroids_to_list()`: return a Python list of the TDigest object's internal Centroid values.\n - `update_centroids_from_list(list_values)`: update Centroids from a python list.\n\n \n\n\n\n","funding_links":[],"categories":["Python"],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FCamDavidsonPilon%2Ftdigest","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FCamDavidsonPilon%2Ftdigest","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FCamDavidsonPilon%2Ftdigest/lists"}