{"id":13458225,"url":"https://github.com/bitly/dablooms","last_synced_at":"2025-12-23T02:45:11.876Z","repository":{"id":4077688,"uuid":"5183008","full_name":"bitly/dablooms","owner":"bitly","description":"scaling, counting, bloom filter library","archived":false,"fork":false,"pushed_at":"2019-10-26T22:07:55.000Z","size":211,"stargazers_count":967,"open_issues_count":9,"forks_count":118,"subscribers_count":77,"default_branch":"master","last_synced_at":"2024-10-29T03:32:40.501Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"C","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/bitly.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2012-07-25T18:41:03.000Z","updated_at":"2024-08-22T02:54:16.000Z","dependencies_parsed_at":"2022-09-17T04:01:39.630Z","dependency_job_id":null,"html_url":"https://github.com/bitly/dablooms","commit_stats":null,"previous_names":[],"tags_count":3,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bitly%2Fdablooms","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bitly%2Fdablooms/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bitly%2Fdablooms/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bitly%2Fdablooms/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/bitly","download_url":"https://codeload.github.com/bitly/dablooms/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":245297948,"owners_count":20592508,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-07-31T09:00:47.891Z","updated_at":"2025-12-23T02:45:11.803Z","avatar_url":"https://github.com/bitly.png","language":"C","funding_links":[],"categories":["Awesome Algorithms"],"sub_categories":["bloom - Bloom Filter (布隆过滤器)"],"readme":"Dablooms: A Scalable, Counting, Bloom Filter\n----------------------------------\n\n_Note_: this project has been mostly unmaintained for a while.\n\n### Overview\nThis project aims to demonstrate a novel Bloom filter implementation that can\nscale, and provide not only the addition of new members, but reliable removal\nof existing members.\n\nBloom filters are a probabilistic  data structure that provide space-efficient\nstorage of elements at the cost of possible false positive on membership\nqueries.\n\n**dablooms** implements such a structure that takes additional metadata to classify\nelements in order to make an intelligent decision as to which Bloom filter an element\nshould belong.\n\n### Features\n**dablooms**, in addition to the above, has several features.\n\n* Implemented as a static C library\n* Memory mapped\n* 4 bit counters\n* Sequence counters for clean/dirty checks\n* Python wrapper\n\nFor performance, the low-level operations are implemented in C.  It is also\nmemory mapped which provides async flushing and persistence at low cost.\nIn an effort to maintain memory efficiency, rather than using integers, or\neven whole bytes as counters, we use only four bit counters. These four bit\ncounters allow up to 15 items to share a counter in the map. If more than a\nsmall handful are sharing said counter, the Bloom filter would be overloaded\n(resulting in excessive false positives anyway) at any sane error rate, so\nthere is no benefit in supporting larger counters.\n\nThe Bloom filter also employs change sequence numbers to track operations performed\non the Bloom filter. These allow the application to determine if a write might have\nonly partially completed (perhaps due to a crash), leaving the filter in an\ninconsistent state. The application can thus determine if a filter is ok or needs\nto be recreated. The sequence number can be used to determine what a consistent but\nout-of-date filter missed, and bring it up-to-date.\n\nThere are two sequence numbers (and helper functions to get them): \"mem_seqnum\" and\n\"disk_seqnum\". The \"mem\" variant is useful if the user is sure the OS didn't crash,\nand the \"disk\" variant is useful if the OS might have crashed since the Bloom filter\nwas last changed. Both values could be \"0\", meaning the filter is possibly\ninconsistent from their point of view, or a non-zero sequence number that the filter\nis consistent with. The \"mem\" variant is often non-zero, but the \"disk\" variant only\nbecomes non-zero right after a (manual) flush. This can be expensive (it's an fsync),\nso the value can be ignored if not relevant for the application. For example, if the\nBloom file exists in a directory which is cleared at boot (like `/tmp`), then the\napplication can safely assume that any existing file was not affected by an OS crash,\nand never bother to flush or check disk_seqnum. Schemes involving batching up changes\nare also possible.\n\nThe dablooms library is not inherently thread safe, this is the clients responsibility.\nBindings are also not thread safe, unless they state otherwise.\n\n### Installing\nClone the repo, or download and extract a tarball of a tagged version\n[from github](https://github.com/bitly/dablooms/tags).\nIn the source tree, type `make`, `make install` (`sudo` may be needed).\nThis will only install static and dynamic versions of the C dablooms library \"libdablooms\".\n\nTo use a specific build directory, install prefix, or destination directory for packaging,\nspecify `BLDDIR`, `prefix`, or `DESTDIR` to make. For example:\n`make install BLDDIR=/tmp/dablooms/bld DESTDIR=/tmp/dablooms/pkg prefix=/usr`\n\nLook at the output of `make help` for more options and targets.\n\nAlso available are bindings for various other languages:\n\n#### Python (pydablooms)\nTo install the Python bindings \"pydablooms\" (currently only compatibly with python 2.x)\nrun `make pydablooms`, `make install_pydablooms` (`sudo` may be needed).\n\nTo use and install for a specific version of Python installed on your system,\nuse the `PYTHON` option to make. For example: `make install_pydablooms PYTHON=python2.7`.\nYou can override the module install location with the `PY_MOD_DIR` option to make,\nand the `BLDDIR` and `DESTDIR` options also affect pydablooms.\n\nThe Makefile attempts to determine the python module location `PY_MOD_DIR`\nautomatically. It prefers a location in `/usr/local`, but you can specify\n`PY_MOD_DIR_ARG=--user` to try to use the location which `pip install --user`\nwould use in your HOME dir. You can instead specify `PY_MOD_DIR_ARG=--system`\nto prefer the normal/central system python module dir.\n\nSee pydablooms/README.md for more info.\n\n#### Go (godablooms)\nThe Go bindings \"godablooms\" are not integrated into the Makefile.\nInstall libdablooms first, then look at `godablooms/README.md`\n\n### Contributing\nIf you make changes to C portions of dablooms which you would like merged into the\nupstream repository, it would help to have your code match our C coding style. We use\n[astyle](http://astyle.sourceforge.net/), svn rev 353 or later, on our code, with the\nfollowing options:\n\n    astyle --style=1tbs --lineend=linux --convert-tabs --preserve-date \\\n           --fill-empty-lines --pad-header --indent-switches           \\\n           --align-pointer=name --align-reference=name --pad-oper -n     \u003cfiles\u003e\n\n### Testing\nTo run a quick and dirty test, type `make test`. This test uses a list of words\nand defaults to `/usr/share/dict/words`. If your path differs, you can use the\n`WORDS` flag to specific its location, such as `make test WORDS=/usr/dict/words`.\n\nThis will run a simple test that iterates through a word list and\nadds each word to dablooms. It iterates again, removing every fifth\nelement. Lastly, it saves the file, opens a new filter, and iterates a third time\nchecking the existence of each word. It prints results of the true negatives,\nfalse positives, true positives, and false negatives, and the false positive rate.\n\nThe false positive rate is calculated by \"false positives / (false positivies + true negatives)\".\nThat is, what rate of real negatives are false positives. This is the interesting\nstatistic because the rate of false negatives should always be zero.\n\nThe test uses a maximum error rate of .05 (5%) and an initial capacity of 100k. If\nthe dictionary is near 500k, we should have created 4 new filters in order to scale to size.\n\nA second test adds every other word in the list, and removes no words, causing each\nused filter to stay at maximum capacity, which is a worse case for accuracy.\n\nCheck out the performance yourself, and checkout the size of the resulting file!\n\n## Bloom Filter Basics\nBloom filters are probabilistic data structures that provide\nspace-efficient storage of elements at the cost of occasional false positives on\nmembership queries, i.e. a Bloom filter may state true on query when it in fact does\nnot contain said element. A Bloom filter is traditionally implemented as an array of\n`M` bits, where `M` is the size of the Bloom filter. On initialization all bits are\nset to zero. A filter is also parameterized by a constant `k` that defines the number\nof hash functions used to set and test bits in the filter.  Each hash function should\noutput one index in `M`.  When inserting an element `x` into the filter, the bits\nin the `k` indices `h1(x), h2(x), ..., hk(X)` are set.\n\nIn order to query a Bloom filter, say for element `x`, it suffices to verify if\nall bits in indices `h1(x), h2(x), ..., hk(x)` are set. If one or more of these\nbits is not set then the queried element is definitely not present in the\nfilter. However, if all these bits are set, then the element is considered to\nbe in the filter. Given this procedure, an error probability exists for positive\nmatches, since the tested indices might have been set by the insertion of other\nelements.\n\n### Counting Bloom Filters: Solving Removals\nThe same property that results in false positives *also* makes it\ndifficult to remove an element from the filter as there is no\neasy means of discerning if another element is hashed to the same bit.\nUnsetting a bit that is hashed by multiple elements can cause **false\nnegatives**.  Using a counter, instead of a bit, can circumvent this issue.\nThe bit can be incremented when an element is hashed to a\ngiven location, and decremented upon removal.  Membership queries rely on whether a\ngiven counter is greater than zero.  This reduces the exceptional\nspace-efficiency provided by the standard Bloom filter.\n\n### Scalable Bloom Filters: Solving Scale\nAnother important property of a Bloom filter is its linear relationship between size\nand storage capacity. If the maximum allowable error probability and the number of elements to store\nare both known, it is relatively straightforward to dimension an appropriate\nfilter. However, it is not always possible to know how many elements\nwill need to be stored a priori. There is a trade off between over-dimensioning filters or\nsuffering from a ballooning error probability as it fills.\n\nAlmeida, Baquero, Preguiça, Hutchison published a paper in 2006, on\n[Scalable Bloom Filters](http://www.sciencedirect.com/science/article/pii/S0020019006003127),\nwhich suggested a means of scalable Bloom filters by creating essentially\na list of Bloom filters that act as one large Bloom filter. When greater\ncapacity is desired, a new filter is added to the list.\n\nMembership queries are conducted on each filter with the positives\nevaluated if the element is found in any one of the filters.  Naively, this\nleads to an increasing compounding error probability since the probability\nof the given structure evaluates to:\n\n    1 - 𝚺(1 - P)\n\nIt is possible to bound this error probability by adding a reducing tightening\nratio, `r`. As a result, the bounded error probability is represented as:\n\n    1 - 𝚺(1 - P0 * r^i) where r is chosen as 0 \u003c r \u003c 1\n\nSince size is simply a function of an error probability and capacity, any\narray of growth functions can be applied to scale the size of the Bloom filter\nas necessary.  We found it sufficient to pick .9 for `r`.\n\n## Problems with Mixing Scalable and Counting Bloom Filters\nScalable Bloom filters do not allow for the removal of elements from the filter.\nIn addition, simply converting each Bloom filter in a scalable Bloom filter into\na counting filter also poses problems. Since an element can be in any filter, and\nBloom filters inherently allow for false positives, a given element may appear to\nbe in two or more filters. If an element is inadvertently removed from a filter\nwhich did not contain it, it would introduce the possibility of **false negatives**.\n\nIf however, an element can be removed from the correct filter, it maintains\nthe integrity of said filter, i.e. prevents the possibility of false negatives. Thus,\na scaling, counting, Bloom filter is possible if upon additions and deletions\none can correctly decide which Bloom filter contains the element.\n\nThere are several advantages to using a Bloom filter. A Bloom filter gives the\napplication cheap, memory efficient set operations, with no actual data stored\nabout the given element. Rather, Bloom filters allow the application to test,\nwith some given error probability, the membership of an item. This leads to the\nconclusion that the majority of operations performed on Bloom filters are the\nqueries of membership, rather than the addition and removal of elements. Thus,\nfor a scaling, counting, Bloom filter, we can optimize for membership queries at\nthe expense of additions and removals. This expense comes not in performance,\nbut in the addition of more metadata concerning an element and its relation to\nthe Bloom filter.  With the addition of some sort of identification of an\nelement, which does not need to be unique as long as it is fairly distributed, it\nis possible to correctly determine which filter an element belongs to, thereby able\nto maintain the integrity of a given Bloom filter with accurate additions\nand removals.\n\n## Enter dablooms\ndablooms is one such implementation of a scaling, counting, Bloom filter that takes\nadditional metadata during additions and deletions in the form of a (generally)\nmonotonically  increasing integer to classify elements (possibly a timestamp).\nThis is used during additions/removals to easily determine the correct Bloom filter\nfor an element (each filter is assigned a range). Checking an item against the Bloom\nfilter, which is assumed to be the dominant activity, does not use the id (it works\nlike a normal scaling Bloom filter).\n\ndablooms is designed to scale itself using these identifiers and the given capacity.\nWhen a Bloom filter is at capacity, dablooms will create a new Bloom filter which\nstarts at the next id after the greatest id of the previous Bloom filter. Given the\nfact that the identifiers monotonically increase, new elements will be added to the\nnewest Bloom filter. Note, in theory and as implemented, nothing prevents one from\nadding an element to any \"older\" filter. You just run the increasing risk of the\nerror probability growing beyond the bound as it becomes \"overfilled\".\n\nYou can then remove any element from any Bloom filter using the identifier to intelligently\npick which Bloom filter to remove from.  Consequently, as you continue to remove elements\nfrom Bloom filters that you are not continuing to add to, these Bloom filters will become\nmore accurate.\n\nThe \"id\" of an element does not need to be known to check the Bloom filter, but does need\nto be known when the element is removed (and the same as when it was added). This might\nbe convenient if the item already has an appropriate id (almost always increasing for new\nitems) associated with it.\n\n### Example use case\nThere is a database with a collection of entries. There is a series of items, each of which\nyou want to look up in the database; most will have no entry in the database, but some\nwill. Perhaps it's a database of spam links. If you use dablooms in front of the database,\nyou can avoid needing to check the database for almost all items which won't be found in\nit anyway, and save a lot of time and effort. It's also much easier to distribute the\nBloom filter than the entire database. But to make it work, you need to determine an \"id\"\nwhenever you add to or remove from the Bloom filter. You could store the timestamp when\nyou add the item to the database as another column in the database, and give it to\n`scaling_bloom_add()` as well. When you remove the item, you look it up in the database\nfirst and pass the timestamp stored there to `scaling_bloom_remove()`. The timestamps for\nnew items will be equal or greater, and definitely greater over time. Instead of\ntimestamps, you could also use an auto-incrementing index. Checks against the Bloom\ndon't need to know the id and should be quick. If a check comes back negative, you can be\nsure the item isn't in the database, and skip that query completely. If a check comes\nback positive, you have to query the database, because there's a slight chance that the\nitem isn't actually in there.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbitly%2Fdablooms","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbitly%2Fdablooms","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbitly%2Fdablooms/lists"}