{"id":27253439,"url":"https://github.com/marselester/bloom","last_synced_at":"2025-07-04T11:02:13.384Z","repository":{"id":57586694,"uuid":"138955300","full_name":"marselester/bloom","owner":"marselester","description":"Bloom filter (space-efficient probabilistic data structure).","archived":false,"fork":false,"pushed_at":"2018-08-13T05:18:35.000Z","size":15,"stargazers_count":3,"open_issues_count":0,"forks_count":1,"subscribers_count":2,"default_branch":"master","last_synced_at":"2024-06-20T09:23:56.108Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/marselester.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2018-06-28T02:19:27.000Z","updated_at":"2022-08-26T12:15:46.000Z","dependencies_parsed_at":"2022-09-26T19:32:35.240Z","dependency_job_id":null,"html_url":"https://github.com/marselester/bloom","commit_stats":null,"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/marselester%2Fbloom","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/marselester%2Fbloom/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/marselester%2Fbloom/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/marselester%2Fbloom/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/marselester","download_url":"https://codeload.github.com/marselester/bloom/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248324233,"owners_count":21084670,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-04-11T01:27:38.957Z","updated_at":"2025-04-11T01:27:39.789Z","avatar_url":"https://github.com/marselester.png","language":"Go","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Bloom Filter\n\n[![Documentation](https://godoc.org/github.com/marselester/bloom?status.svg)](https://godoc.org/github.com/marselester/bloom)\n[![Go Report Card](https://goreportcard.com/badge/github.com/marselester/bloom)](https://goreportcard.com/report/github.com/marselester/bloom)\n\nThis is a Bloom filter implementation just for fun. Databases use them to reduce the disk lookups for non-existent rows or columns.\nAvoiding costly disk lookups considerably increases the performance of a database query operation.\nFor instance, we can avoid unnecessary key lookups in segment files if a Bloom filter leveraged by\n[RascalDB](https://github.com/marselester/rascaldb).\n\nA [Bloom filter](https://en.wikipedia.org/wiki/Bloom_filter) is a space-efficient probabilistic data structure\nthat is used to test whether an element is a member of a set.\nFalse positive matches are possible, but false negatives are not – in other words, a query returns either \"possibly in set\" or\n\"definitely not in set\". Elements can be added to the set, but not removed; the more elements that are added to the set,\nthe larger the probability of false positives.\n\nA Bloom filter of a fixed size can represent a set with an arbitrarily large number of elements (4,294,967,295 in this implementation); adding an element never fails due to the data structure \"filling up\".\n\n## Usage Example\n\nFor more details please refer to [documentation](https://godoc.org/github.com/marselester/bloom).\n\n```go\npackage main\n\nimport (\n    \"fmt\"\n    \"log\"\n\n    \"github.com/marselester/bloom\"\n)\n\nfunc main() {\n    const maxEmails = 100\n    const prob = 0.01\n    bf, err := bloom.New(maxEmails, prob)\n    if err != nil {\n        log.Fatalf(\"Bloom filter is not created: %v\", err)\n    }\n\n    email := []byte(\"alice@example.com\")\n    // Must operations panic on err caused by hash function, e.g., failed to convert hex to decimal\n    // (that shouldn't happen normally). You can use Add/Has which return errors.\n    bf.MustAdd(email)\n    if bf.MustHave(email) {\n        fmt.Print(\"Alice's email possibly is in the set.\")\n    } else {\n        fmt.Print(\"Alice's email definitely is not in the set.\")\n    }\n}\n```\n\n## Algorithm\n\nThe idea is to \"convert\" an element into several bit array's indexes (\"coordinates\" or positions).\nFor example, `\"test\"` string is transformed into `7, 36, 32, 37` offsets of a bit array.\nTo add an element, set bits to 1 on those positions.\nTo check if an element is a member of a set, all bits must be 1 on those positions.\n\n```\n 1  1  0  0  0  1  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 0 0 0 1 0 0 0 0 0 0\n36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0\n```\n\nThe element was transformed into indexes by applying 4 hash functions. A hash function (in this package)\nuses sha256 digest (hex) and converts it into a decimal number (from base 16 number into base 10 number).\nSince we need 4 distinct hash functions, we can append a number to an element.\nNote, a cryptographic hash function is used here to achieve the best uniformity and keep the code simple\n(it depends only on the standard library). There are faster hash functions for the job, for instance,\n[Murmur3](https://github.com/bitly/dablooms/pull/19).\n\n```\nhex2dec(sha256(\"test0\")) == 7\nhex2dec(sha256(\"test1\")) == 36\nhex2dec(sha256(\"test2\")) == 32\nhex2dec(sha256(\"test3\")) == 37\n```\n\nBased on desired probability of an error (false positives) and number of elements you intend to add,\nit's possible to calculate optimal number of hash functions and length of a bit array.\nFor example, 1,000,000 elements set with 0.01 error rate requires 9,585,059 bits (1.198 MB) of storage.\n\n```\nmax_elements = 1000000\ndesired_prob = 0.01\nhash_qty = -ln(desired_prob) / ln(2)\nbit_array_len = -max_elements * ln(desired_prob) / (ln(2) * ln(2))\n```\n\n## Tests\n\nInstall go-fuzz.\n\n```sh\n$ go get github.com/dvyukov/go-fuzz/go-fuzz\n$ go get github.com/dvyukov/go-fuzz/go-fuzz-build\n```\n\nStart the fuzzing and see if there are crashers.\n\n```sh\n$ go-fuzz-build github.com/marselester/bloom\n$ go-fuzz -bin=bloom-fuzz.zip -workdir=fuzz\n```\n\n## Benchmarks\n\nThis Bloom filter implementation certainly has room for improvement, e.g., use faster hash function,\nreduce the number of memory allocations or try to hash in parallel. Though the objective here is to keep code simple.\n\n```sh\n$ go test -bench=. -benchmem\nBenchmarkFilter_Add/1.198MB-4         \t  300000\t      4254 ns/op\t    1888 B/op\t      30 allocs/op\nBenchmarkFilter_Add/2.573GB-4         \t  300000\t      5198 ns/op\t    1888 B/op\t      30 allocs/op\nBenchmarkFilter_Has/1.198MB-4         \t  300000\t      4269 ns/op\t    1888 B/op\t      30 allocs/op\nBenchmarkFilter_Has/2.573GB-4         \t  300000\t      4285 ns/op\t    1888 B/op\t      30 allocs/op\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmarselester%2Fbloom","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmarselester%2Fbloom","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmarselester%2Fbloom/lists"}