https://github.com/ozgrakkurt/filterz
Probabilistic filter implementations. Ribbon, bloom, xor filters.
https://github.com/ozgrakkurt/filterz
bloom-filter probabilistic-data-structures ribbon xor zig ziglang
Last synced: about 1 year ago
JSON representation
Probabilistic filter implementations. Ribbon, bloom, xor filters.
- Host: GitHub
- URL: https://github.com/ozgrakkurt/filterz
- Owner: ozgrakkurt
- License: apache-2.0
- Created: 2024-10-28T18:49:48.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2024-12-02T22:28:08.000Z (over 1 year ago)
- Last Synced: 2025-03-28T03:32:42.196Z (about 1 year ago)
- Topics: bloom-filter, probabilistic-data-structures, ribbon, xor, zig, ziglang
- Language: Zig
- Homepage:
- Size: 79.1 KB
- Stars: 2
- Watchers: 2
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE-APACHE
Awesome Lists containing this project
README
# filterz
Implementations of some probabilistic filter structures. Implemented with `build once, use many times` use case in mind.
All filters export a `Filter` interface that can be used like this:
```zig
const Filter = filterz.ribbon.Filter(u10);
var my_filter = try Filter.init(alloc, hashes);
defer my_filter.deinit();
for (hashes) |h| {
try std.testing.expect(filter.check(h));
}
```
Each filter also exports a lower level API that can be used to implement more advanced use cases like:
- Reducing bits-per-key while the program is running to meet some memory usage criteria.
- Loading only a part of a filter from disk and using it to query.
Requires Zig 0.14.0-dev release.
## Filters
### Split-Block-Bloom-Filter
Speed optimized version of a bloom filter.
As described in https://github.com/apache/parquet-format/blob/master/BloomFilter.md
### Xor (BinaryFuse) filter
As described in https://arxiv.org/abs/2201.01174
Construction is a bit janky but constructed filters reach slightly higher space efficiency. This is better in cases where construction is one time and the filter is used for a much longer time.
### Ribbon filter
As described in https://arxiv.org/abs/2103.02515
The implementation corresponds to the standard ribbon filter with "smash" as described in the paper.
## Benchmarks
1. Download Benchmark Data - instructions [here](bench-data/README.md)
2. Run benchmarks with:
```bash
make benchmark
```
[Example results](./bench_result_low_hit.txt)
NOTE: Cost estimate stat in the benchmark output is calculated by assuming every hit generates a disk read, which is priced at 50 microseconds.
Ribbon filter seems to be the best option when on a memory budget. Bloom filter is the king when there is no memory budget.
### TODO
- Implement [frayed ribbon based filter](https://github.com/bitwiseshiftleft/compressed_map)
- Implement [bumped ribbon filter](https://github.com/lorenzhs/BuRR)
- Implement interleaved columnar storage for ribbon filter as described in the paper.
- Implement mixing column sizes in ribbon filter storage to support fractional bits-per-key configurations e.g. 6.6 bits per key instead of 6 or 7.
- Improve Xor (Binary Fuse) filter construction speed by pre-sorting the hashes before filter construction as described in the paper.
- Improve Xor filter parameter selection, currently it is done by trial-error at construction time.