{"id":13477975,"url":"https://github.com/KenanHanke/rbloom","last_synced_at":"2025-03-27T07:30:35.366Z","repository":{"id":65338761,"uuid":"590172097","full_name":"KenanHanke/rbloom","owner":"KenanHanke","description":"A fast, simple and lightweight Bloom filter library for Python, implemented in Rust.","archived":false,"fork":false,"pushed_at":"2024-10-01T11:57:54.000Z","size":107,"stargazers_count":251,"open_issues_count":3,"forks_count":18,"subscribers_count":4,"default_branch":"main","last_synced_at":"2025-03-27T06:08:27.253Z","etag":null,"topics":["bloom","bloom-filter","python","python-library","rust"],"latest_commit_sha":null,"homepage":"","language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/KenanHanke.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2023-01-17T20:08:30.000Z","updated_at":"2025-03-24T00:31:28.000Z","dependencies_parsed_at":"2024-01-16T06:18:15.006Z","dependency_job_id":"0788d29d-a69d-4d4a-b22a-c9b7517cdddd","html_url":"https://github.com/KenanHanke/rbloom","commit_stats":null,"previous_names":["kenbyte/rbloom"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/KenanHanke%2Frbloom","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/KenanHanke%2Frbloom/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/KenanHanke%2Frbloom/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/KenanHanke%2Frbloom/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/KenanHanke","download_url":"https://codeload.github.com/KenanHanke/rbloom/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":245791971,"owners_count":20672671,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bloom","bloom-filter","python","python-library","rust"],"created_at":"2024-07-31T16:01:50.754Z","updated_at":"2025-03-27T07:30:35.007Z","avatar_url":"https://github.com/KenanHanke.png","language":"Rust","funding_links":[],"categories":["Rust","Performance \u0026 Caching"],"sub_categories":[],"readme":"# rBloom\n\n[![PyPI](https://img.shields.io/pypi/v/rbloom)](https://pypi.org/project/rbloom/)\n[![license](https://img.shields.io/github/license/KenanHanke/rbloom)](https://github.com/KenanHanke/rbloom/blob/main/LICENSE)\n[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.13868469.svg)](https://doi.org/10.5281/zenodo.13868469)\n[![build](https://img.shields.io/github/actions/workflow/status/KenanHanke/rbloom/CI.yml)](https://github.com/KenanHanke/rbloom/actions)\n\nA fast, simple and lightweight\n[Bloom filter](https://en.wikipedia.org/wiki/Bloom_filter) library for\nPython, implemented in Rust. It's designed to be as pythonic as possible,\nmimicking the built-in `set` type where it can, and works with any\nhashable object. While it's a new library (this project was started in\n2023), it's currently the fastest option for Python by\na long shot (see the section\n[Benchmarks](#benchmarks)). Releases are published on\n[PyPI](https://pypi.org/project/rbloom/).\n\n## Quickstart\n\nThis library defines only one class, which can be used as follows:\n\n```python\n\u003e\u003e\u003e from rbloom import Bloom\n\u003e\u003e\u003e bf = Bloom(200, 0.01)  # 200 items max, false positive rate of 1%\n\u003e\u003e\u003e bf.add(\"hello\")\n\u003e\u003e\u003e \"hello\" in bf\nTrue\n\u003e\u003e\u003e \"world\" in bf\nFalse\n\u003e\u003e\u003e bf.update([\"hello\", \"world\"])  # \"hello\" and \"world\" now in bf\n\u003e\u003e\u003e other_bf = Bloom(200, 0.01)\n\n### add some items to other_bf\n\n\u003e\u003e\u003e third_bf = bf | other_bf    # third_bf now contains all items in\n                                # bf and other_bf\n\u003e\u003e\u003e third_bf = bf.copy()\n... third_bf.update(other_bf)   # same as above\n\u003e\u003e\u003e bf.issubset(third_bf)    # bf \u003c= third_bf also works\nTrue\n```\n\nFor the full API, see the section [Documentation](#documentation).\n\n## Installation\n\nOn almost all platforms, simply run:\n\n```sh\npip install rbloom\n```\n\nIf you're on an uncommon platform, this may cause pip to build the library\nfrom source, which requires the Rust\n[toolchain](https://www.rust-lang.org/tools/install). You can also build\n`rbloom` by cloning this repository and running\n[maturin](https://github.com/PyO3/maturin):\n\n```sh\nmaturin build --release\n```\n\nThis will create a wheel in the `target/wheels/` directory, which can\nsubsequently also be passed to pip.\n\n## Why rBloom?\n\nWhy should you use this library instead of one of the other\nBloom filter libraries on PyPI?\n\n- **Simple:** Almost all important methods work exactly like their\n  counterparts in the built-in `set`\n  [type](https://docs.python.org/3/library/stdtypes.html#set-types-set-frozenset).\n- **Fast:** `rbloom` is implemented in Rust, which makes it\n  blazingly fast. See section [Benchmarks](#benchmarks) for more\n  information.\n- **Lightweight:** `rbloom` has no dependencies of its own.\n- **Maintainable:** This library is very concise, and it's written\n  in idiomatic Rust. Even if I were to stop maintaining `rbloom` (which I\n  don't intend to), it would be trivially easy for you to fork it and keep\n  it working for you.\n\nI started `rbloom` because I was looking for a simple Bloom filter\ndependency for a project, and the only sufficiently fast option\n(`pybloomfiltermmap3`) was segfaulting on recent Python versions. `rbloom`\nended up being twice as fast and has grown to encompass a more complete\nAPI (e.g. with set comparisons like `issubset`). Do note that it doesn't\nuse mmapped files, however. This shouldn't be an issue in most cases, as\nthe random access heavy nature of a Bloom filter negates the benefits of\nmmap after very few operations, but it is something to keep in mind for\nedge cases.\n\n## Benchmarks\n\nThe following simple benchmark was implemented in the respective API of\neach library (see the [comparison benchmarks](benchmarks/compare.py)):\n\n```python\nbf = Bloom(10_000_000, 0.01)\n\nfor i in range(10_000_000):\n    bf.add(i + 0.5)  # floats because ints are hashed as themselves\n\nfor i in range(10_000_000):\n    assert i + 0.5 in bf\n```\n\nThis resulted in the following average runtimes on an M1 Pro (confirmed to be proportional to runtimes on an Intel machine):\n\n| Library                                                            | Time    | Notes                                 |\n| ------------------------------------------------------------------ | ------- | ------------------------------------- |\n| [rBloom](https://pypi.org/project/rbloom/)                         | 2.52s   | works out-of-the-box                  |\n| [pybloomfiltermmap3](https://pypi.org/project/pybloomfiltermmap3/) | 4.78s   | unreliable [1]                        |\n| [pybloom3](https://pypi.org/project/pybloom3/)                     | 46.76s  | works out-of-the-box                  |\n| [Flor](https://pypi.org/project/Flor/)                             | 76.94s  | doesn't work on arbitrary objects [2] |\n| [bloom-filter2](https://pypi.org/project/bloom-filter2/)           | 165.54s | doesn't work on arbitrary objects [2] |\n\n[1] The official package failed to install on Python 3.11 and kept segfaulting on 3.10 (Linux, January 2023). It seems to be fine for now (October 2024).\n[2] I was forced to convert to a byte representation, which is bad default behavior as it presents\nthe problems mentioned below in the section \"Cryptographic security\".\n\nAlso note that `rbloom` is compiled against a stable ABI for\nportability, and that you can get a small but measurable speedup by\nremoving the `\"abi3-py37\"` flag from `Cargo.toml` and building\nit yourself.\n\n## Documentation\n\nThis library defines only one class, the signature of which should be\nthought of as follows. Note that only the first few methods differ from\nthe built-in `set` type:\n\n```python\nclass Bloom:\n\n    # expected_items:  max number of items to be added to the filter\n    # false_positive_rate:  max false positive rate of the filter\n    # hash_func:  optional argument, see section \"Cryptographic security\"\n    def __init__(self, expected_items: int, false_positive_rate: float,\n                 hash_func=__builtins__.hash)\n\n    @property\n    def size_in_bits(self) -\u003e int      # number of buckets in the filter\n\n    @property\n    def hash_func(self) -\u003e Callable[[Any], int]   # retrieve the hash_func\n                                                  # given to __init__\n\n    @property\n    def approx_items(self) -\u003e float    # estimated number of items in\n                                       # the filter\n\n    # see section \"Persistence\" for more information on these four methods\n    @classmethod\n    def load(cls, filepath: str, hash_func) -\u003e Bloom\n    def save(self, filepath: str)\n    @classmethod\n    def load_bytes(cls, data: bytes, hash_func) -\u003e Bloom\n    def save_bytes(self) -\u003e bytes\n\n    #####################################################################\n    #                    ALL SUBSEQUENT METHODS ARE                     #\n    #              EQUIVALENT TO THE CORRESPONDING METHODS              #\n    #                     OF THE BUILT-IN SET TYPE                      #\n    #####################################################################\n\n    def add(self, obj)                            # add obj to self\n    def __contains__(self, obj) -\u003e bool           # check if obj in self\n    def __bool__(self) -\u003e bool                    # False if empty\n    def __repr__(self) -\u003e str                     # basic info\n\n    def __or__(self, other: Bloom) -\u003e Bloom       # self | other\n    def __ior__(self, other: Bloom)               # self |= other\n    def __and__(self, other: Bloom) -\u003e Bloom      # self \u0026 other\n    def __iand__(self, other: Bloom)              # self \u0026= other\n\n    # these extend the functionality of __or__, __ior__, __and__, __iand__\n    def union(self, *others: Union[Iterable, Bloom]) -\u003e Bloom        # __or__\n    def update(self, *others: Union[Iterable, Bloom])                # __ior__\n    def intersection(self, *others: Union[Iterable, Bloom]) -\u003e Bloom # __and__\n    def intersection_update(self, *others: Union[Iterable, Bloom])   # __iand__\n\n    # these implement \u003c, \u003e, \u003c=, \u003e=, ==, !=\n    def __lt__, __gt__, __le__, __ge__, __eq__, __ne__(self,\n                                                       other: Bloom) -\u003e bool\n    def issubset(self, other: Bloom) -\u003e bool      # self \u003c= other\n    def issuperset(self, other: Bloom) -\u003e bool    # self \u003e= other\n\n    def clear(self)                               # remove all items\n    def copy(self) -\u003e Bloom                       # duplicate self\n```\n\nTo prevent death and destruction, the bitwise set operations only work on\nfilters where all parameters are equal (including the hash functions being\nthe exact same object). Because this is a Bloom filter, the `__contains__`\nand `approx_items` methods are probabilistic, as are all the methods that\ncompare two filters (such as `__le__` and `issubset`).\n\n## Cryptographic security\n\nPython's built-in hash function is designed to be fast, not maximally\ncollision-resistant, so if your program depends on the false positive rate\nbeing perfectly correct, you may want to supply your own hash function.\nThis is especially the case when working with very large filters (more\nthan a few tens of millions of items) or when false positives are very\ncostly and could be exploited by an adversary. Just make sure that your\nhash function returns an integer between -2^127 and 2^127 - 1. Feel free\nto use the following example in your own code:\n\n```python\nfrom rbloom import Bloom\nfrom hashlib import sha256\nfrom pickle import dumps\n\ndef hash_func(obj):\n    h = sha256(dumps(obj)).digest()\n    # use sys.byteorder instead of \"big\" for a small speedup when\n    # reproducibility across machines isn't a concern\n    return int.from_bytes(h[:16], \"big\", signed=True)\n\nbf = Bloom(100_000_000, 0.01, hash_func)\n```\n\nWhen you throw away Python's built-in hash function and start hashing\nserialized representations of objects, however, you open up a breach into\nthe scary realm of the unpythonic:\n\n- Numbers like `1`, `1.0`, `1 + 0j` and `True` will suddenly no longer\n  be equal.\n- Instances of classes with custom hashing logic (e.g. to stop\n  caches inside instances from affecting their hashes) will suddenly\n  display undefined behavior.\n- Objects that can't be serialized simply won't be hashable at all.\n\nMaking you supply your own hash function in this case is a deliberate\ndesign decision intended to show you what you're doing and prevent\nyou from shooting yourself in the foot.\n\nAlso note that using a custom hash will incur a performance penalty over\nusing the built-in hash.\n\n## Persistence\n\nThe `save` and `load` methods, along with their byte-oriented counterparts\n`save_bytes` and `load_bytes`, allow you to save and load filters to and\nfrom disk/Python `bytes` objects. However, as the built-in hash function's\nsalt changes between invocations of Python, they only work on filters with\ncustom hash functions. Note that it is your responsibility to ensure that\nthe hash function you supply to the loading functions is the same as the\none originally used by the filter you're loading!\n\n```python\nbf = Bloom(10_000, 0.01, some_hash_func)\nbf.add(\"hello\")\nbf.add(\"world\")\n\n# saving to a file\nbf.save(\"bf.bloom\")\n\n# loading from a file\nloaded_bf = Bloom.load(\"bf.bloom\", some_hash_func)\nassert loaded_bf == bf\n\n# saving to bytes\nbf_bytes = bf.save_bytes()\n\n# loading from bytes\nloaded_bf_from_bytes = Bloom.load_bytes(bf_bytes, some_hash_func)\nassert loaded_bf_from_bytes == bf\n```\n\nThe size of the file is `bf.size_in_bits / 8 + 8` bytes.\n\n---\n\n**Statement of attribution:** Bloom filters were originally proposed in\n[(Bloom, 1970)](https://doi.org/10.1145/362686.362692). Furthermore, this\nimplementation makes use of a constant recommended by\n[(L'Ecuyer, 1999)](https://doi.org/10.1090/S0025-5718-99-00996-5) for\nredistributing the entropy of a single hash over multiple integers using a\nlinear congruential generator.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FKenanHanke%2Frbloom","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FKenanHanke%2Frbloom","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FKenanHanke%2Frbloom/lists"}