{"id":13532765,"url":"https://github.com/ekzhu/datasketch","last_synced_at":"2025-05-13T21:04:49.349Z","repository":{"id":29028116,"uuid":"32555448","full_name":"ekzhu/datasketch","owner":"ekzhu","description":"MinHash, LSH, LSH Forest, Weighted MinHash, HyperLogLog, HyperLogLog++, LSH Ensemble and HNSW","archived":false,"fork":false,"pushed_at":"2024-06-04T00:43:43.000Z","size":5960,"stargazers_count":2681,"open_issues_count":55,"forks_count":296,"subscribers_count":48,"default_branch":"master","last_synced_at":"2025-04-28T11:55:35.783Z","etag":null,"topics":["data-sketches","data-summary","hnsw","hyperloglog","jaccard-similarity","locality-sensitive-hashing","lsh","lsh-ensemble","lsh-forest","minhash","python","search","top-k","weighted-quantiles"],"latest_commit_sha":null,"homepage":"https://ekzhu.github.io/datasketch","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ekzhu.png","metadata":{"files":{"readme":"README.rst","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2015-03-20T01:21:46.000Z","updated_at":"2025-04-25T21:46:49.000Z","dependencies_parsed_at":"2024-01-03T04:40:57.722Z","dependency_job_id":"a4f09611-1de4-45c3-b3b8-c0978fb3953a","html_url":"https://github.com/ekzhu/datasketch","commit_stats":{"total_commits":225,"total_committers":31,"mean_commits":7.258064516129032,"dds":0.6977777777777778,"last_synced_commit":"1ce3f6926172027d0d0df810dbb52ec3a5232741"},"previous_names":[],"tags_count":33,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ekzhu%2Fdatasketch","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ekzhu%2Fdatasketch/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ekzhu%2Fdatasketch/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ekzhu%2Fdatasketch/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ekzhu","download_url":"https://codeload.github.com/ekzhu/datasketch/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":251311332,"owners_count":21569008,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-sketches","data-summary","hnsw","hyperloglog","jaccard-similarity","locality-sensitive-hashing","lsh","lsh-ensemble","lsh-forest","minhash","python","search","top-k","weighted-quantiles"],"created_at":"2024-08-01T07:01:13.578Z","updated_at":"2025-04-28T11:56:07.501Z","avatar_url":"https://github.com/ekzhu.png","language":"Python","readme":"datasketch: Big Data Looks Small\n================================\n\n.. image:: https://static.pepy.tech/badge/datasketch/month\n    :target: https://pepy.tech/project/datasketch\n\n.. image:: https://zenodo.org/badge/DOI/10.5281/zenodo.598238.svg\n   :target: https://zenodo.org/doi/10.5281/zenodo.598238\n\ndatasketch gives you probabilistic data structures that can process and\nsearch very large amount of data super fast, with little loss of\naccuracy.\n\nThis package contains the following data sketches:\n\n+-------------------------+-----------------------------------------------+\n| Data Sketch             | Usage                                         |\n+=========================+===============================================+\n| `MinHash`_              | estimate Jaccard similarity and cardinality   |\n+-------------------------+-----------------------------------------------+\n| `Weighted MinHash`_     | estimate weighted Jaccard similarity          |\n+-------------------------+-----------------------------------------------+\n| `HyperLogLog`_          | estimate cardinality                          |\n+-------------------------+-----------------------------------------------+\n| `HyperLogLog++`_        | estimate cardinality                          |\n+-------------------------+-----------------------------------------------+\n\nThe following indexes for data sketches are provided to support\nsub-linear query time:\n\n+---------------------------+-----------------------------+------------------------+\n| Index                     | For Data Sketch             | Supported Query Type   |\n+===========================+=============================+========================+\n| `MinHash LSH`_            | MinHash, Weighted MinHash   | Jaccard Threshold      |\n+---------------------------+-----------------------------+------------------------+\n| `MinHash LSH Forest`_     | MinHash, Weighted MinHash   | Jaccard Top-K          |\n+---------------------------+-----------------------------+------------------------+\n| `MinHash LSH Ensemble`_   | MinHash                     | Containment Threshold  |\n+---------------------------+-----------------------------+------------------------+\n| `HNSW`_                   | Any                         | Custom Metric Top-K    |\n+---------------------------+-----------------------------+------------------------+\n\ndatasketch must be used with Python 3.7 or above, NumPy 1.11 or above, and Scipy. \n\nNote that `MinHash LSH`_ and `MinHash LSH Ensemble`_ also support Redis and Cassandra \nstorage layer (see `MinHash LSH at Scale`_).\n\nInstall\n-------\n\nTo install datasketch using ``pip``:\n\n::\n\n    pip install datasketch\n\nThis will also install NumPy as dependency.\n\nTo install with Redis dependency:\n\n::\n\n    pip install datasketch[redis]\n\nTo install with Cassandra dependency:\n\n::\n\n    pip install datasketch[cassandra]\n\n\n.. _`MinHash`: https://ekzhu.github.io/datasketch/minhash.html\n.. _`Weighted MinHash`: https://ekzhu.github.io/datasketch/weightedminhash.html\n.. _`HyperLogLog`: https://ekzhu.github.io/datasketch/hyperloglog.html\n.. _`HyperLogLog++`: https://ekzhu.github.io/datasketch/hyperloglog.html#hyperloglog-plusplus\n.. _`MinHash LSH`: https://ekzhu.github.io/datasketch/lsh.html\n.. _`MinHash LSH Forest`: https://ekzhu.github.io/datasketch/lshforest.html\n.. _`MinHash LSH Ensemble`: https://ekzhu.github.io/datasketch/lshensemble.html\n.. _`Minhash LSH at Scale`: http://ekzhu.github.io/datasketch/lsh.html#minhash-lsh-at-scale\n.. _`HNSW`: https://ekzhu.github.io/datasketch/documentation.html#hnsw\n","funding_links":[],"categories":["数据 Data","Python","Data Processing","数据容器和结构","**Programming (learning)**","Data Containers \u0026 Dataframes"],"sub_categories":["Data Management","**Developer\\'s Tools**"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fekzhu%2Fdatasketch","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fekzhu%2Fdatasketch","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fekzhu%2Fdatasketch/lists"}