{"id":13591696,"url":"https://github.com/google/cuckoo-index","last_synced_at":"2025-04-08T17:32:19.456Z","repository":{"id":45648955,"uuid":"255549123","full_name":"google/cuckoo-index","owner":"google","description":"Cuckoo Index: A Lightweight Secondary Index Structure","archived":true,"fork":false,"pushed_at":"2021-12-02T18:31:31.000Z","size":282,"stargazers_count":130,"open_issues_count":0,"forks_count":17,"subscribers_count":6,"default_branch":"master","last_synced_at":"2025-03-22T00:41:26.954Z","etag":null,"topics":["bitmap-index","cloud-databases","cuckoo-filter","secondary-index"],"latest_commit_sha":null,"homepage":"","language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/google.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"docs/contributing.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2020-04-14T08:11:28.000Z","updated_at":"2025-03-16T17:56:19.000Z","dependencies_parsed_at":"2022-09-03T23:12:01.429Z","dependency_job_id":null,"html_url":"https://github.com/google/cuckoo-index","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/google%2Fcuckoo-index","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/google%2Fcuckoo-index/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/google%2Fcuckoo-index/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/google%2Fcuckoo-index/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/google","download_url":"https://codeload.github.com/google/cuckoo-index/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247892666,"owners_count":21013756,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bitmap-index","cloud-databases","cuckoo-filter","secondary-index"],"created_at":"2024-08-01T16:01:00.776Z","updated_at":"2025-04-08T17:32:14.433Z","avatar_url":"https://github.com/google.png","language":"C++","funding_links":[],"categories":["C++"],"sub_categories":[],"readme":"**NOTE** This is not an officially supported Google product.\n\n# Cuckoo Index\n\n## Overview\n\n[Cuckoo Index](https://www.vldb.org/pvldb/vol13/p3559-kipf.pdf) (CI) is a lightweight secondary index structure that represents the many-to-many relationship between keys and partitions of columns in a highly space-efficient way. At its core, CI associates variable-sized fingerprints in a [Cuckoo filter](https://www.cs.cmu.edu/~dga/papers/cuckoo-conext2014.pdf) with compressed bitmaps indicating qualifying partitions.\n\n## What Problem Does It Solve?\n\nThe problem of finding all partitions that possibly contain a given lookup key is traditionally solved by maintaining one filter (e.g., a Bloom filter) per partition that indexes all unique key values contained in this partition:\n\n```\nPartition 0:\nA, B =\u003e Bloom filter 0\n\nPartition 1:\nB, C =\u003e Bloom filter 1\n...\n```\n\nTo identify all partitions containing a key, we need to probe all per-partition filters (which could be many). Since a Bloom filter may return false positives, there is a chance (of e.g. 1%) that we accidentally identify a negative partition as positive. In the above example, a lookup for key A may return Partition 0 (true positive) and 1 (false positive). Depending on the storage medium, a false positive partition can be very expensive (e.g., many milliseconds on disk).\n\nFurthermore, secondary columns typically contain many duplicates (also across partitions). With the per-partition filter design, these duplicates may be indexed in multiple filters (in the worst case, in all filters). In the above example, the key B is redundantly indexed in Bloom filter 0 and 1.\n\nCuckoo Index addresses both of these drawbacks of per-partition filters.\n\n## Features\n\n*   100% correct results for lookups with occurring keys (as opposed to per-partition filters).\n*   Configurable scan rate (ratio of false positive partitions) for lookups with non-occurring keys.\n*   Much smaller footprint size than full-fledged indexes that store full-sized keys.\n*   Smaller footprint size than per-partition filters for low-to-medium cardinality columns.\n\n## Limitations\n\n*   Requires access to all keys at build time.\n*   Relatively high build time (in O(n) but with a high constant factor) compared to e.g. per-partition Bloom filters.\n*   Once built, CI is immutable but fast to query (it uses a [rank support structure](https://www.cs.cmu.edu/~dga/papers/zhou-sea2013.pdf) for efficient rank calls).\n\n## Running Experiments\n\nPrepare a dataset in a CSV format that you are going to use. One of the datasets we used was DMV [Vehicle, Snowmobile, and Boat Registrations](https://catalog.data.gov/dataset/vehicle-snowmobile-and-boat-registrations).\n\n```\nwget -c https://data.ny.gov/api/views/w4pv-hbkt/rows.csv -O Vehicle__Snowmobile__and_Boat_Registrations.csv\n```\n\nAdd the file to the `data` dependencies in the `BUILD.bazel` file.\n\n```\ndata = [\n    # Put your csv files here\n    \"Vehicle__Snowmobile__and_Boat_Registrations.csv\"\n],\n```\n\nFor footprint experiments, run the following command, specifying the path to the data file, columns to test, and the tests to run.\n\n```\nbazel run -c opt --cxxopt=\"-std=c++17\" :evaluate -- \\\n  --input_csv_path=\"Vehicle__Snowmobile__and_Boat_Registrations.csv\" \\\n  --columns_to_test=\"City,Zip,Color\" \\\n  --test_cases=\"positive_uniform,positive_distinct,positive_zipf,negative,mixed\" \\\n  --output_csv_path=\"results.csv\"\n```\n\nFor lookup performance experiments, run the following command, specifying the path to the data file, and columns to test.\n\n**NOTE** You might want to use fewer rows for lookup experiments as the benchmarks are quite time-consuming.\n\n```\nbazel run -c opt --cxxopt='-std=c++17' --dynamic_mode=off :lookup_benchmark -- \\\n  --input_csv_path=\"Vehicle__Snowmobile__and_Boat_Registrations.csv\" \\\n  --columns_to_test=\"City,Zip,Color\"\n```\n\n## CMake support\n\n**NOTE** CMake support is community-based. The maintainers do not use CMake internally.\n\nFor further information have a look at the [cmake README](cmake/README.md).\n\n## Code Organization\n\n#### Evaluation Framework\n\n*   Evaluate (evaluate.h): *Entry point (binary) into our evaluation framework with instantiations of all indexes.*\n*   Evaluator (evaluator.h): *Evaluation framework.*\n*   Table/Column (data.h): *Integer columns that we run the benchmarks on (string columns are dict-encoded).*\n*   IndexStructure (index_structure.h): *Interface shared among all indexes.*\n\n#### Cuckoo Index\n\n*   CuckooIndex (cuckoo_index.h): *Main class of Cuckoo Index.*\n*   CuckooKicker (cuckoo_kicker.h): *A heuristic that finds a close-to-optimal assignment of keys to buckets (in terms of the ratio of items residing in primary buckets).*\n*   FingerprintStore (fingerprint_store.h): *Stores variable-sized fingerprints in bitpacket format.*\n*   RleBitmap (rle_bitmap.h): *An RLE-based (bitwise, unaligned) bitmap representation (for sparse bitmaps we use position lists).*\n*   BitPackedReader (bit_packing.h): *A helper class for storing \u0026 retrieving bitpacked data.*\n\n## Cite\n\nPlease cite our [VLDB 2020 paper](https://www.vldb.org/pvldb/vol13/p3559-kipf.pdf) if you use this code in your own work:\n\n```\n@article{cuckoo-index,\nauthor = {Kipf, Andreas and Chromejko, Damian and Hall, Alexander and Boncz, Peter and Andersen, David},\ntitle = {Cuckoo Index: A Lightweight Secondary Index Structure},\nyear = {2020},\nissue_date = {September 2020},\npublisher = {VLDB Endowment},\nvolume = {13},\nnumber = {13},\nissn = {2150-8097},\nurl = {https://doi.org/10.14778/3424573.3424577},\ndoi = {10.14778/3424573.3424577},\njournal = {Proc. VLDB Endow.},\nmonth = sep,\npages = {3559-3572},\nnumpages = {14}\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgoogle%2Fcuckoo-index","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fgoogle%2Fcuckoo-index","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgoogle%2Fcuckoo-index/lists"}