https://github.com/igrigorik/bloomfilter-rb

BloomFilter(s) in Ruby: Native counting filter + Redis counting/non-counting filters
https://github.com/igrigorik/bloomfilter-rb

Last synced: about 2 months ago
JSON representation

BloomFilter(s) in Ruby: Native counting filter + Redis counting/non-counting filters

Host: GitHub
URL: https://github.com/igrigorik/bloomfilter-rb
Owner: igrigorik
Created: 2008-12-27T18:07:07.000Z (over 16 years ago)
Default Branch: master
Last Pushed: 2024-03-26T22:22:14.000Z (over 1 year ago)
Last Synced: 2025-04-15T01:59:18.600Z (3 months ago)
Language: C
Homepage: http://www.igvita.com/2008/12/27/scalable-datasets-bloom-filters-in-ruby/
Size: 99.6 KB
Stars: 474
Watchers: 15
Forks: 59
Open Issues: 7
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

awesome-ruby - bloomfilter-rb - BloomFilter(s) in Ruby: Native counting filter + Redis counting/non-counting filters. (Scientific)

README

        # BloomFilter(s) in Ruby

- Native (MRI/C) counting bloom filter

- Redis-backed getbit/setbit non-counting bloom filter

- Redis-backed set-based counting (+TTL) bloom filter

Bloom filter is a space-efficient probabilistic data structure that is used to test whether an element is a member of a set. False positives are possible, but false negatives are not. For more detail, check the [wikipedia article](http://en.wikipedia.org/wiki/Bloom_filter). Instead of using k different hash functions, this implementation seeds the CRC32 hash with k different initial values (0, 1, ..., k-1). This may or may not give you a good distribution, it all depends on the data.

Performance of the Bloom filter depends on a number of variables:

- size of the bit array

- size of the counter bucket

- number of hash functions

## Resources

- Determining parameters: [Scalable Datasets: Bloom Filters in Ruby](http://www.igvita.com/2008/12/27/scalable-datasets-bloom-filters-in-ruby/)

- Applications & reasons behind bloom filter: [Flow analysis: Time based bloom filter](http://www.igvita.com/2010/01/06/flow-analysis-time-based-bloom-filters/)

***

## MRI/C API Example

MRI/C implementation which creates an in-memory filter which can be saved and reloaded from disk.

```ruby

require 'bloomfilter-rb'

bf = BloomFilter::Native.new(:size => 100, :hashes => 2, :seed => 1, :bucket => 3, :raise => false)

bf.insert("test")

bf.include?("test")     # => true

bf.include?("blah")     # => false

bf.delete("test")

bf.include?("test")     # => false

# Hash with a bloom filter!

bf["test2"] = "bar"

bf["test2"]             # => true

bf["test3"]             # => false

bf.stats

# => Number of filter bits (m): 10

# => Number of filter elements (n): 2

# => Number of filter hashes (k) : 2

# => Predicted false positive rate = 10.87%

```

***

## Redis-backed setbit/getbit bloom filter

Uses [getbit](http://redis.io/commands/getbit)/[setbit](http://redis.io/commands/setbit) on Redis strings - efficient, fast, can be shared by multiple/concurrent processes.

```ruby

bf = BloomFilter::Redis.new

bf.insert('test')

bf.include?('test')     # => true

bf.include?('blah')     # => false

bf.delete('test')

bf.include?('test')     # => false

```

### Memory footprint

- 1.0% error rate for 1M items, 10 bits/item: *2.5 mb*

- 1.0% error rate for 150M items, 10 bits per item: *358.52 mb*

- 0.1% error rate for 150M items, 15 bits per item: *537.33 mb*

***

## Redis-backed counting bloom filter with TTLs

Uses regular Redis get/set counters to implement a counting filter with optional TTL expiry. Because each "bit" requires its own key in Redis, you do incur a much larger memory overhead.

```ruby

bf = BloomFilter::CountingRedis.new(:ttl => 2)

bf.insert('test')

bf.include?('test')     # => true

sleep(2)

bf.include?('test')     # => false

```

## Credits

Tatsuya Mori  (Original C implementation: http://vald.x0.com/sb/)

## License

MIT License - Copyright (c) 2011 Ilya Grigorik

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/igrigorik/bloomfilter-rb

Awesome Lists containing this project

README