Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/toniphan21/go-bf
A raw, simple and effective golang implementation of a Bloom Filter without any dependencies
https://github.com/toniphan21/go-bf
bloom-filter golang no-dependencies
Last synced: about 1 month ago
JSON representation
A raw, simple and effective golang implementation of a Bloom Filter without any dependencies
- Host: GitHub
- URL: https://github.com/toniphan21/go-bf
- Owner: toniphan21
- License: mit
- Created: 2024-10-06T17:41:34.000Z (4 months ago)
- Default Branch: main
- Last Pushed: 2024-10-14T10:21:11.000Z (3 months ago)
- Last Synced: 2024-10-25T03:53:56.310Z (3 months ago)
- Topics: bloom-filter, golang, no-dependencies
- Language: Go
- Homepage:
- Size: 35.2 KB
- Stars: 2
- Watchers: 1
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
## Go Bloom Filter
[![codecov](https://codecov.io/github/toniphan21/go-bf/graph/badge.svg?token=20ILOD9CPG)](https://codecov.io/github/toniphan21/go-bf)
[![Codacy Badge](https://app.codacy.com/project/badge/Grade/b80cab5415324a4e91ad5cd6cdec1fb0)](https://app.codacy.com/gh/toniphan21/go-bf/dashboard?utm_source=gh&utm_medium=referral&utm_content=&utm_campaign=Badge_grade)
[![Go Report Card](https://goreportcard.com/badge/github.com/toniphan21/go-bf)](https://goreportcard.com/report/github.com/toniphan21/go-bf)
[![Go Reference](https://pkg.go.dev/badge/github.com/toniphan21/go-bf?status.svg)](https://pkg.go.dev/github.com/toniphan21/go-bf?tab=doc)A [Bloom Filter](https://en.wikipedia.org/wiki/Bloom_filter) written in GoLang with flexible options and no dependencies.
### Usage
Initialize a Bloom Filter `WithAccuracy` configuration
```golang
package mainimport "github.com/toniphan21/go-bf"
func main() {
var errorRate = 0.001
var numberOfItems uint32 = 10_000_000
filter := bf.Must(bf.WithAccuracy(errorRate, numberOfItems))filter.Add([]byte("anything"))
if !filter.Exists([]byte("not found")) {
println("If a bloom filter returns false, it is 100% correct.")
}if filter.Exists([]byte("anything")) {
println("If a bloom filter returns true it MAYBE correct. Remember to check false positive cases.")
}
}```
Initialize a Bloom Filter `WithCapacity` configuration
```golang
package mainimport "github.com/toniphan21/go-bf"
func main() {
var capacityInBits uint32 = 65_536
var numberOfHashFunctions byte = 5
filter := bf.Must(bf.WithCapacity(capacityInBits, numberOfHashFunctions))filter.Add([]byte("anything"))
if !filter.Exists([]byte("not found")) {
println("If a bloom filter returns false, it is 100% correct.")
}if filter.Exists([]byte("anything")) {
println("If a bloom filter returns true it MAYBE correct. Remember to check false positive cases.")
}
}
```### APIs
#### Constructors
- `New(Config, ...Options) (BloomFilter, error)` initialize new instance
- `Must(Config, ...Options) BloomFilter` initialize new instance, panic if encounter any error#### BloomFilter interface
The `BloomFilter` interface has 8 main methods:
| Method | Description |
|--------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------|
| `Add([]byte)` | Add an item into the filter |
| `Exists([]byte) bool` | Check existence of an item in the filter |
| `Count() int` | Get number of items added into the filter. Return -1 if not sure (for example after using `Intersect()` or `Union()` |
| `Clone() (BloomFilter, error)` | Create new BloomFilter instance with the same storage, hasher and data |
| `Intersect(BloomFilter) error` | Intersect with given filter. They must use the same Storage and Hash. Only Storage's data of current filter is affected, given filter's data is not |
| `Union(BloomFilter) error` | Union with given filter. They must use the same Storage and Hash. Only Storage's data of current filter is affected, given filter's data is not |
| `Storage() Storage` | Get filter's Storage |
| `Hasher() Hasher` | Get filter's Hash |#### Options
There are 4 option functions could be used from the second param of `bf.New(Config, ...OptionFunc)`:
| Signature | | Description |
|---------------------------------|-----------|--------------------------------------------------------|
| `WithSHA()` | _default_ | Use splitted SHA hashing strategy (more uniform hash) |
| `WithFNV()` | | Use splitted FNV hashing strategy (better performance) |
| `WithHasher(f HasherFactory)` | | Customize Hashing strategy with a HasherFactory |
| `WithStorage(f StorageFactory)` | | Customize Storage strategy with a StorageFactory |### Implementation Details
#### Error rate, number of hash functions calculation
_This is just a summary of how to calculate estimated error rate of a Bloom Filter, for proof please check
academic paper or wikipedia page._Given:
- `n` estimated number of items in a Bloom Filter
- `e` false positive error rate
- `k` number of hash functions needed
- `m` bits of memoryIf you have `n` and `e` (`WithAccuracy` config):
```math
\huge k \approx -log_2 e
``````math
\huge m \approx -1.44 n log_2 e
````WithAccuracy` has a builtin `.Info()` returns calculated `m`, `k` and rounded values.
```golang
package mainimport (
"fmt"
"github.com/toniphan21/go-bf"
)func main() {
var errorRate = 0.001
var numberOfItems uint32 = 10_000_000
config := bf.WithAccuracy(errorRate, numberOfItems)
fmt.Println(config.Info())
}/** OUTPUT:
Config WithAccuracy()
- Requested error rate: 0.10000%
- Expected number of items: 10000000
- Bits per item: 14.351
- Number of hash functions: 10
- Size in bits of each has function: 28
- Storage capacity: 143507294 bits = 17938412 bytes = 17517.98KB = 17.11MB
- Estimated error rate: 0.10130%*/
```If you have `m` and `k` (`WithCapacity` config), estimated error rate when `n` items are added into a Bloom Filter:
```math
\huge \epsilon \approx (1 - e^-\frac{k n}{m} )^k
````WithCapacity` configuration has a builtin `.Info()` to display the estimated error rate when config by `m` and `k`.
```golang
package mainimport (
"fmt"
"github.com/toniphan21/go-bf"
)func main() {
var capacityInBits uint32 = 65_536
var numberOfHashFunctions byte = 5
config := bf.WithCapacity(capacityInBits, numberOfHashFunctions)
fmt.Println(config.Info())
}/** OUTPUT:
Config WithCapacity()
- Storage capacity: 65536 bits = 8192 bytes = 8.00KB = 0.01MB
- Number of hash functions: 5
- Size in bits of each has function: 16
- Estimated error rate by n - number of added items:
n= 100; estimated error rate: 0.00000%
n= 200; estimated error rate: 0.00000%
n= 500; estimated error rate: 0.00001%
n= 1000; estimated error rate: 0.00021%
n= 2000; estimated error rate: 0.00568%
n= 5000; estimated error rate: 0.32083%
n= 10000; estimated error rate: 4.33023%
n= 20000; estimated error rate: 29.35056%
n= 50000; estimated error rate: 89.45317%
n=100000; estimated error rate: 99.75726%
n=200000; estimated error rate: 99.99988%
n=500000; estimated error rate: 100.00000%*/
```> If you spot something wrong with the calculation, no worries - you can [write your own config](https://github.com/toniphan21/go-bf?tab=readme-ov-file#write-your-own-config).
#### Hashing strategy
This library has builtin 2 hashers with the same strategy:
- From the config we could know: `keySize` minimum key size (in bits) and `keyCount` number of hash function needed.
- When BloomFilter calls `Hasher.Hash()` it passes `count` - number of keys array needed
- Use `SHA-256` or `FNV-128` to generate hash bytes from the input. If `count * keySize * keyCount` > `256` when use
SHA (or `128` when use FNV), will do the hash multiple times with a byte prefixed
- Pick `count * keyCount * keySize` bits from the hash bytes in previous step, discard all remaining bits.Example 1: `count = 1`, `keySize = 25`, `keyCount = 10`, use `SHA-256`:
- Because `1*25*10 = 250 bits`, we only need to hash 1 time
- hash(256 bits) = `sha_hash(input)`
- pick key 0 = bit 0-24
- pick key 1 = bit 25-49
- ...
- pick key 9 = bit 225-249
- bit 250-255 is discarded
- return `[1][10]uint32{ {key0, key1...key9} }`Example 2: `count = 1`, `keySize = 25`, `keyCount = 10`, use `FNV-128`:
- Because `1*25*10 > 128`, we hash input 2 times
- hash(256 bits) = `fnv_128(input)` + `fnv_128(byte(0) + input)`
- pick key 0 = bit 0-24
- pick key 1 = bit 25-49
- ...
- pick key 9 = bit 225-249
- bit 250-255 is discarded
- return `[1][10]uint32{ {key0, key1...key9} }`Example 3: `count = 2`, `keySize = 25`, `keyCount = 10`, use `FNV-128`:
- Because `2*25*10 = 500 > 128` bits, we need to hash 4 times
- hash(512 bits) = `fnv_128(input)` + `fnv_128(byte(0) + input)` + `fnv_128(byte(1) + input)`+ `fnv_128(byte(2) + input)`
- pick key 0 = bit 0-24
- pick key 1 = bit 25-49
- ...
- pick key 9 = bit 225-249
- ...
- pick key 19 = bit 475-499
- bit 500-512 is discarded
- return `[2][10]uint32{ {key0, key1...key9}, {key10...key19} }`This strategy guarantees that the first n keys will always be the same independent from `count` passed via
`Hasher.Hash()` function. A classic BloomFilter always use `count=1` but the other variants may use more than 1
key collection.### Customization
#### Write your own hashing strategy
By default, an `SHAHasher` is used. You can customize the Hasher functions by implement `Hasher` and
`HasherFactory` interface:```golang
package mainimport "github.com/toniphan21/go-bf"
type YourHasher struct {
keyCount byte
keySize byte
}func (y *YourHasher) Equals(other bf.Hasher) bool {
o, ok := other.(*YourHasher)
if !ok {
return false
}
// check other params
return y.keyCount == o.keyCount && y.keySize == o.keySize
}func (y *YourHasher) Hash(input []byte, count int) [][]bf.Key {
// return an array of a hash array for given bytes input.
// - length of the array is count
// - each of subarray will have length = keyCount - number of hash functions
// - each hash need to >= keySize - minimum size of a hash in bits
// For example: given keyCount = 5, keySize = 16
// - count = 1 requires you to returns:
// [][]Key{
// { "key0: at least 16 bits long", "key1:...", "key2:...", "key3:...", "key4:..."},
// }
// - count = 2 requires you to returns:
// [][]Key{
// { "key0: at least 16 bits long", "key1:...", "key2:...", "key3:...", "key4:..."},
// { "key5: at least 16 bits long", "key6:...", "key7:...", "key7:...", "key8:..."},
// }
return [][]bf.Key{}
}type YourHasherFactory struct{}
func (y *YourHasherFactory) Make(numberOfHashFunctions, hashSizeInBits byte) bf.Hasher {
return &YourHasher{
keyCount: numberOfHashFunctions,
keySize: hashSizeInBits,
}
}func main() {
config := bf.WithAccuracy(0.01, 1_000_000)
filter := bf.Must(config, bf.WithHasher(&YourHasherFactory{}))filter.Add([]byte("anything"))
// ...
}
```#### Write your own custom storage
By default, all data are stored in memory, you can customize a storage by implement `Storage` and `StorageFactory`
interface:```golang
package mainimport "github.com/toniphan21/go-bf"
type FileStorage struct {
capacity uint32
}func (f *FileStorage) Intersect(other bf.Storage) {
// implement BatchIntersect to perform Intersect operator faster
}func (f *FileStorage) Union(other bf.Storage) {
// implement BatchUnion to perform Union operator faster
}func (f *FileStorage) Equals(other bf.Storage) bool {
o, ok := other.(*FileStorage)
if !ok {
return false
}
// check other params
return o.capacity == f.capacity
}func (f *FileStorage) Set(index uint32) {
// set a bit in the given index to true
}func (f *FileStorage) Clear(index uint32) {
// clear a bit in the given index to true
}func (f *FileStorage) Get(index uint32) bool {
// return a boolean in the given index
return false
}func (f *FileStorage) Capacity() uint32 {
// return the capacity of the storage in bits
return f.capacity
}type FileStorageFactory struct{}
func (f *FileStorageFactory) Make(capacity uint32) (bf.Storage, error) {
return &FileStorage{capacity}, nil
}func main() {
config := bf.WithAccuracy(0.01, 1_000_000)
filter := bf.Must(config, bf.WithStorage(&FileStorageFactory{}))filter.Add([]byte("anything"))
// ...
}
```#### Write your own config
If you don't like `WithCapacity()` or `WithAccuracy()` configuration, you can write your own:
```golang
package mainimport "github.com/toniphan21/go-bf"
type YourConfig struct {
}func (y *YourConfig) Info() string {
return "info about your config"
}func (y *YourConfig) NumberOfHashFunctions() byte {
return 5
}func (y *YourConfig) StorageCapacity() uint32 {
return 1_000_000
}func (y *YourConfig) KeySize() byte {
return 32
}func main() {
config := &YourConfig{}
filter, err := bf.New(config)
if err != nil {
panic("Something went wrong")
}filter.Add([]byte("anything"))
// ...
}
```### Benchmark
```
BenchmarkBloomFilter_WithSHA_Add
BenchmarkBloomFilter_WithSHA_Add-12 1027905 1152 ns/op
BenchmarkBloomFilter_WithFNV_Add
BenchmarkBloomFilter_WithFNV_Add-12 2113863 573.3 ns/op
BenchmarkBloomFilter_WithSHA_Exists
BenchmarkBloomFilter_WithSHA_Exists-12 1000000 1132 ns/op
BenchmarkBloomFilter_WithFNV_Exists
BenchmarkBloomFilter_WithFNV_Exists-12 2175315 549.9 ns/op
BenchmarkBloomFilter_Intersect
BenchmarkBloomFilter_Intersect/bench
BenchmarkBloomFilter_Intersect/bench-12 10000 106422 ns/op
BenchmarkBloomFilter_Union
BenchmarkBloomFilter_Union/bench
BenchmarkBloomFilter_Union/bench-12 10000 106651 ns/op
BenchmarkBloomFilter_Clone
BenchmarkBloomFilter_Clone/bench
BenchmarkBloomFilter_Clone/bench-12 6856 168797 ns/op
```