https://github.com/gabrielspmoreira/cross_bloomfilter
Pure Python and Java compatible implementions of Bloom Filter
https://github.com/gabrielspmoreira/cross_bloomfilter
Last synced: 7 months ago
JSON representation
Pure Python and Java compatible implementions of Bloom Filter
- Host: GitHub
- URL: https://github.com/gabrielspmoreira/cross_bloomfilter
- Owner: gabrielspmoreira
- License: mit
- Created: 2016-07-08T17:20:37.000Z (over 9 years ago)
- Default Branch: master
- Last Pushed: 2016-07-08T20:43:28.000Z (over 9 years ago)
- Last Synced: 2025-02-01T04:41:47.611Z (9 months ago)
- Language: Java
- Homepage:
- Size: 276 KB
- Stars: 2
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
## cross_bloomfilter
_cross_bloomfilter_ - Pure Python 3 and Java Bloom Filter compatible implementations
(https://en.wikipedia.org/wiki/Bloom_filter).## What's a Bloom Filter?
A Bloom filter is space-efficient Probabilistic Data Structure used to test whether an element is a member of a large set.
It can have false positives but never false negatives, which means a query returns either "possibly in set" or "definitely not in set".You can tune a Bloom filter to the desired error rate, it's basically a tradeoff between size and accuracy (See example: http://hur.st/bloomfilter). For example, a filter for about 100,000 keys, configured with a 0.5% false positive error rate can be compressed to 137 KB bytes, which is about 8% of the whole list of keys (with average length of 18 characters) (1,788 KB). With 5% error rate it can be compressed to just 18 KB.
## Why Cross Language compatible implementation?
The motivation for this implementation was the need to populate Bloom Filters in Python, serialize its bytes and load it within a Java Web Application.
You can dump/load the Bloom Filter Bytes from/to Java or Python 3.
p.s. the Java implementation was inspired in [inbloom](http://github.com/EverythingMe/inbloom) project.## Dump headers
This implementation provides utilities for serializing / deserializing Bloom filters so they can be stored in files/databases or sent over the network. It supports dumping options like gzip and base64 formats.
When you create a Bloom filter, you need to initialize it with parameters of expected cardinality and false positive rates. Those parameters are stored as a header when serializing the filter. It uses a 16 bit checksum as part of the header, to allow its consistency validation during deserialization.### Serialized filter structure:
| Field | Type | bits |
| ------------- |:-------------:| -----:|
| checksum | ushort | 16 |
| gzipped | byte | 8 |
| errorRate (1/N)| ushort | 16 |
| capacity | int | 32 |
| data | byte[] | ? |### Example Usage
#### Python
```python
from bloom_filter import BloomFilter#Populating Bloom Filter
bf = BloomFilter(capacity=100000, error=0.005)
bf.add('key_abc')
bf.add('key_def')#Check if keys are in the set
assert('key_abc' in bf)
assert('key_def' in bf)
assert('key_ghi' not in bf)#Serializing and deserializing Bloom Filter
dump_bytes = bloom_filter.dump(gzipped=True)
bf2 = BloomFilter.load(dump_bytes)#Check if keys are in the set
assert('key_abc' in bf2)
assert('key_def' in bf2)
assert('key_ghi' not in bf2)# Serializing to base64 (string) for storage in database or sending over an HTTP API
dump_base64 = bloom_filter.dump_to_base64_str(gzipped=True);```
#### Java
```java
import cross_bloomfilter.bloomfilter.impl.BloomFilter;
import cross_bloomfilter.bloomfilter.impl.BloomFilterFactory;// Popularing Bloom Filter
BloomFilter bloomFilter = new BloomFilterFactoryImpl().create(100000, 0.005);
bloomFilter.add('key_abc');
bloomFilter.add('key_def');// Check if keys are in the set
assert(bloomFilter.contains('key_abc'))
assert(bloomFilter.contains('key_def'))
assert(!bloomFilter.contains('key_ghi'))// Serializing and deserializing
byte[] dumpData = bloomFilter.dump(true);
BloomFilter bloomFilter2 = new BloomFilterFactoryImpl().load(dumpData);// Check if keys are in the set
assert(bloomFilter2.contains('key_abc'))
assert(bloomFilter2.contains('key_def'))
assert(!bloomFilter2.contains('key_ghi'))// Serializing to base64 (string) for storage in database or sending over an HTTP API
String dumpBase64 = bloomFilter.dumpToBase64(true);
```