Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/ekzhu/datasketch
MinHash, LSH, LSH Forest, Weighted MinHash, HyperLogLog, HyperLogLog++, LSH Ensemble and HNSW
https://github.com/ekzhu/datasketch
data-sketches data-summary hnsw hyperloglog jaccard-similarity locality-sensitive-hashing lsh lsh-ensemble lsh-forest minhash python search top-k weighted-quantiles
Last synced: 6 days ago
JSON representation
MinHash, LSH, LSH Forest, Weighted MinHash, HyperLogLog, HyperLogLog++, LSH Ensemble and HNSW
- Host: GitHub
- URL: https://github.com/ekzhu/datasketch
- Owner: ekzhu
- License: mit
- Created: 2015-03-20T01:21:46.000Z (almost 10 years ago)
- Default Branch: master
- Last Pushed: 2024-06-04T00:43:43.000Z (8 months ago)
- Last Synced: 2025-01-14T00:08:07.028Z (13 days ago)
- Topics: data-sketches, data-summary, hnsw, hyperloglog, jaccard-similarity, locality-sensitive-hashing, lsh, lsh-ensemble, lsh-forest, minhash, python, search, top-k, weighted-quantiles
- Language: Python
- Homepage: https://ekzhu.github.io/datasketch
- Size: 5.68 MB
- Stars: 2,633
- Watchers: 49
- Forks: 296
- Open Issues: 54
-
Metadata Files:
- Readme: README.rst
- License: LICENSE
Awesome Lists containing this project
- awesome-LLM-resourses - datasketch
- best-of-python - GitHub - 30% open · ⏱️ 26.03.2024): (Data Containers & Dataframes)
- awesome-list - datasketch - Gives you probabilistic data structures that can process and search very large amount of data super fast, with little loss of accuracy. (Data Processing / Data Management)
- awesome-python-machine-learning-resources - GitHub - 25% open · ⏱️ 19.08.2022): (数据容器和结构)
README
datasketch: Big Data Looks Small
================================.. image:: https://static.pepy.tech/badge/datasketch/month
:target: https://pepy.tech/project/datasketch.. image:: https://zenodo.org/badge/DOI/10.5281/zenodo.598238.svg
:target: https://zenodo.org/doi/10.5281/zenodo.598238datasketch gives you probabilistic data structures that can process and
search very large amount of data super fast, with little loss of
accuracy.This package contains the following data sketches:
+-------------------------+-----------------------------------------------+
| Data Sketch | Usage |
+=========================+===============================================+
| `MinHash`_ | estimate Jaccard similarity and cardinality |
+-------------------------+-----------------------------------------------+
| `Weighted MinHash`_ | estimate weighted Jaccard similarity |
+-------------------------+-----------------------------------------------+
| `HyperLogLog`_ | estimate cardinality |
+-------------------------+-----------------------------------------------+
| `HyperLogLog++`_ | estimate cardinality |
+-------------------------+-----------------------------------------------+The following indexes for data sketches are provided to support
sub-linear query time:+---------------------------+-----------------------------+------------------------+
| Index | For Data Sketch | Supported Query Type |
+===========================+=============================+========================+
| `MinHash LSH`_ | MinHash, Weighted MinHash | Jaccard Threshold |
+---------------------------+-----------------------------+------------------------+
| `MinHash LSH Forest`_ | MinHash, Weighted MinHash | Jaccard Top-K |
+---------------------------+-----------------------------+------------------------+
| `MinHash LSH Ensemble`_ | MinHash | Containment Threshold |
+---------------------------+-----------------------------+------------------------+
| `HNSW`_ | Any | Custom Metric Top-K |
+---------------------------+-----------------------------+------------------------+datasketch must be used with Python 3.7 or above, NumPy 1.11 or above, and Scipy.
Note that `MinHash LSH`_ and `MinHash LSH Ensemble`_ also support Redis and Cassandra
storage layer (see `MinHash LSH at Scale`_).Install
-------To install datasketch using ``pip``:
::
pip install datasketch
This will also install NumPy as dependency.
To install with Redis dependency:
::
pip install datasketch[redis]
To install with Cassandra dependency:
::
pip install datasketch[cassandra]
.. _`MinHash`: https://ekzhu.github.io/datasketch/minhash.html
.. _`Weighted MinHash`: https://ekzhu.github.io/datasketch/weightedminhash.html
.. _`HyperLogLog`: https://ekzhu.github.io/datasketch/hyperloglog.html
.. _`HyperLogLog++`: https://ekzhu.github.io/datasketch/hyperloglog.html#hyperloglog-plusplus
.. _`MinHash LSH`: https://ekzhu.github.io/datasketch/lsh.html
.. _`MinHash LSH Forest`: https://ekzhu.github.io/datasketch/lshforest.html
.. _`MinHash LSH Ensemble`: https://ekzhu.github.io/datasketch/lshensemble.html
.. _`Minhash LSH at Scale`: http://ekzhu.github.io/datasketch/lsh.html#minhash-lsh-at-scale
.. _`HNSW`: https://ekzhu.github.io/datasketch/documentation.html#hnsw