{"id":24143860,"url":"https://github.com/dnbaker/sketch","last_synced_at":"2025-07-18T22:08:02.331Z","repository":{"id":39614473,"uuid":"74399506","full_name":"dnbaker/sketch","owner":"dnbaker","description":"C++ Implementations of sketch data structures with SIMD Parallelism, including Python bindings","archived":false,"fork":false,"pushed_at":"2024-07-23T23:39:01.000Z","size":4648,"stargazers_count":155,"open_issues_count":6,"forks_count":14,"subscribers_count":8,"default_branch":"main","last_synced_at":"2025-06-28T05:34:49.716Z","etag":null,"topics":["bloom-filter","count-min-sketch","hll","hyperloglog","minhash","sketch-data-structures"],"latest_commit_sha":null,"homepage":"","language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/dnbaker.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2016-11-21T19:41:32.000Z","updated_at":"2025-06-21T11:32:10.000Z","dependencies_parsed_at":"2023-01-25T14:46:36.062Z","dependency_job_id":"59d091da-a617-4232-a4b0-7cf28977f274","html_url":"https://github.com/dnbaker/sketch","commit_stats":null,"previous_names":[],"tags_count":34,"template":false,"template_full_name":null,"purl":"pkg:github/dnbaker/sketch","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dnbaker%2Fsketch","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dnbaker%2Fsketch/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dnbaker%2Fsketch/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dnbaker%2Fsketch/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/dnbaker","download_url":"https://codeload.github.com/dnbaker/sketch/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dnbaker%2Fsketch/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":265845035,"owners_count":23837708,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bloom-filter","count-min-sketch","hll","hyperloglog","minhash","sketch-data-structures"],"created_at":"2025-01-12T05:45:43.777Z","updated_at":"2025-07-18T22:08:02.312Z","avatar_url":"https://github.com/dnbaker.png","language":"C++","readme":"# sketch [![Build Status](https://travis-ci.com/dnbaker/sketch.svg?branch=master)](https://travis-ci.com/dnbaker/sketch) [![Documentation Status](https://readthedocs.org/projects/sketch/badge/?version=latest)](https://sketch.readthedocs.io/en/latest/?badge=latest)\nsketch is a generic, header-only library providing implementations of a variety of sketch data structures for scalable and streaming applications.\nAll have been accelerated with SIMD parallelism where possible, most are composable, and many are threadsafe unless `-DNOT_THREADSAFE` is passed as a compilation flag.\n\n\n## Python documentation\n\nDocumentation for the Python interface is available [here](https://sketch.readthedocs.io/en/latest/).\n\n## Dependencies\n\nWe directly include blaze-lib, libpopcnt, [compact_vector](https://github.com/gmarcais/compact_vector), ska::flat\\_hash\\_map, and xxHash for various utilities.\nWe also have two submodules:\n\n* pybind11, only used for python bindings.\n* SLEEF for vectorized math, incorporated with vec.h. It's optionally used (disabled by defining `-DNO_SLEEF=1/#define NO_SLEEF 1`) and only applicable to using rnla.h through blaze-lib.\n\nYou can ignore both for most use cases.\n\n## Contents\n1. HyperLogLog Implementation [hll.h]\n    1. `hll_t`/`hllbase_t\u003cHashStruct\u003e`\n    2. Estimates the cardinality of a set using log(log(cardinality)) bits.\n    3. Threadsafe unless `-DNOT_THREADSAFE` is passed.\n    4. Currently, `hll` is the only structure for which python bindings are available, but we intend to extend this in the future.\n2. HyperBitBit [hbb.h]\n    1. Better per-bit accuracy than HyperLogLogs, but, at least currently, limited to 128 bits/16 bytes in sketch size.\n3. Bloom Filter [bf.h]\n    1. `bf_t`/`bfbase_t\u003cHashStruct\u003e`\n    2. Naive bloom filter\n    3. Currently *not* threadsafe.\n4. Count-Min and Count Sketches\n    1. ccm.h (`ccmbase_t\u003cUpdatePolicy=Increment\u003e/ccm_t`  (use `pccm_t` for Approximate Counting or `cs_t` for a count sketch).\n    2. The Count sketch is threadsafe if `-DNOT_THREADSAFE` is not passed or if an atomic container is used. Count-Min sketches are currently not threadsafe due to the use of minimal updates.\n    3. Count-min sketches can support concept drift if `realccm_t` from mult.h is used.\n5. MinHash sketches\n    1. mh.h (`RangeMinHash` is the currently verified implementation.) We recommend you build the sketch and then convert to a linear container (e.g., a `std::vector`) using `to_container\u003cContainerType\u003e()` or `.finalize()` for faster comparisons.\n        1. BottomKHasher is an alternate that uses more space to reduce runtime, which finalizes() into the same structure.\n    2. CountingRangeMinHash performs the same operations as RangeMinHash, but provides multiplicities, which facilitates `histogram_similarity`, a generalization of Jaccard with multiplicities.\n    3. Both CountingRangeMinHash and RangeMinHash can be finalized into containers for fast comparisons with `.finalize()`.\n    3. A draft HyperMinHash implementation is available as well, but it has not been thoroughly vetted.\n    4. Range MinHash implementationsare *not* threadsafe.\n    5. HyperMinHash implementation is threa\n6. B-Bit MinHash\n    1. bbmh.h\n    2. One-permutation (partition) bbit minhash\n        1. Threadsafe, bit-packed and fully SIMD-accelerated\n        2. Power of two partitions are supported in BBitMinHasher, which is finalized into a FinalBBitMinHash sketch. This is faster than the alternative.\n        3. We also support arbitrary divisions using fastmod64 with DivBBitMinHasher and its corresponding final sketch, FinalDivBBitMinHash.\n    3. One-permutation counting bbit minhash\n        1. Not threadsafe.\n7. ModHash sketches\n    1. mod.h\n    2. Estimates both containment and jaccard index, but takes a (potentially) unbounded space.\n    3. This returns a FinalRMinHash sketch, reusing the infrastructure for minhash sketches,\n       but which calculates Jaccard index and containment knowing that it was generated via mod, not min.\n8. HeavyKeeper\n    1. hk.h\n    3. Reference: https://www.usenix.org/conference/atc18/presentation/gong\n    4. A seemingly unilateral improvement over count-min sketches.\n        1. One drawback is the inability to delete items, which makes it unsuitable for sliding windows.\n        2. It shares this characteristic with the Count-Min sketch with conservative update and the Count-Min Mean sketch.\n9. ntcard\n    1. mult.h\n    2. Threadsafe\n    3. Reference: https://www.ncbi.nlm.nih.gov/pubmed/28453674\n    4. Not SIMD-accelerated, but also general, supporting any arbitrary coverage level\n10. PCSA\n    1. pc.h\n    2. The HLL is more performant and better-optimized, but this is included for completeness.\n    3. Not threadsafe.\n    1. Reference: https://dl.acm.org/doi/10.1016/0022-0000%2885%2990041-8\n       Journal of Computer and System Sciences.\n       September 1985 https://doi.org/10.1016/0022-0000(85)90041-8\n11. SetSketch\n    1. See setsketch.h for continuous and discretized versions of the SetSketch.\n    2. This also includes parameter-setting code.\n\n### Test case\nTo build and run the hll test case:\n\n```bash\nmake test \u0026\u0026 ./test\n```\n\n### Example\nTo use as a header-only library:\n\n```c++\nusing namespace sketch;\nhll::hll_t hll(14); // Use 2**14 bytes for this structure\n// Add hashed values for each element to the structure.\nfor(uint64_t i(0); i \u003c 10000000ull; ++i) hll.addh(i);\nfprintf(stderr, \"Elements estimated: %lf. Error bounds: %lf.\\n\", hll.report(), hll.est_err());\n```\n\n\nThe other structures work with a similar interface. See the type constructors for more information or view [10xdash](https://github.com/dnbaker/10xdash) for examples on using the\nsame interface for a variety of data structures.\n\nSimply `#include sketch/\u003cheader_name\u003e`, or, for one include `#include \u003csketch/sketch.h\u003e`,\nwhich allows you to write `sketch::bf_t` and `sketch::hll_t` without the subnamespaces.\n\nWe use inline namespaces for individual types of sketches, e.g., `sketch::minhash` or `sketch::hll` can be used for clarity, or `sketch::hll_t` can be used, omitting the `hll` namespace.\n\n### OSX Installation\nClang on OSX may fail to compile in AVX512 mode. We recommend using homebrew's gcc:\n\n```\nhomebrew upgrade gcc || homebrew install gcc\n```\nand either setting the environmental variables for CXX and CC to the most recent g++/gcc or providing them as Makefile arguments.\nAt the time of writing, this is `g++-10` and `gcc-10`.\n\n### Multithreading\nBy default, updates to the hyperloglog structure to occur using atomic operations, though threading should be handled by the calling code. Otherwise, the flag `-DNOT_THREADSAFE` should be passed. The cost of this is relatively minor, but in single-threaded situations, this would be preferred.\n\n## Python bindings\nPython bindings are available via pybind11. Simply `cd python \u0026\u0026 python setup.py install`.\n\nThe package has been renamed to `sketch_ds` as of v0.19\n\nUtilities include:\n    1. Sketching/serialization for sketch data structures\n        1. Supported: sketch_ds.bbmh.BBitMinHasher, sketch_ds.bf.bf, sketch_ds.hmh.hmh, sketch_ds.hll.hll\n    2. shs\\_isz, which computes the intersection size of sorted hash sets.\n        1. Supported: {uint,int}{32,64}, float32, float64\n    3. fastmod/fastdiv, which uses the fast modulo reduction to do faster division/mod than numpy.\n        1. Supportd: {uint,int}{32,64}\n    4. matrix generation functions - taking a list of sketches and creating the similarity function matrix.\n        1. Supported: sketch_ds.bbmh.BBitMinHasher, sketch_ds.bf.bf, sketch_ds.hmh.hmh, sketch_ds.hll.hll\n        2. Types: \"jaccard_matrix\", \"intersection_matrix\", \"containment_matrix\", \"union_size_matrix\", \"symmetric_containment_matrix\"\n        3. Returns a compressed distance matrix.\n    5. ccount\\_eq, pcount\\_eq compute the number of identical registers between integral registers.\n        1. Inspired by cdist and pdist from scipy.spatial.distance\n        2. ccount\\_eq computes the number of identical registers between all pairs of rows between two matrices A and B.\n            1. Size of returned matrix: (A.shape[0], A.shape[1])\n        3. pcount\\_eq computes the number of identical registers between all pairs of rows in a single matrix A.\n            1. Size of returned matrix: (A.shape[0] * (A.shape[0]) - 1) / 2\n        4. pcount\\_eq output can be transformed from similarities to distances via -np.log(distmat / A.shape[1]).\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdnbaker%2Fsketch","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdnbaker%2Fsketch","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdnbaker%2Fsketch/lists"}