https://github.com/pmcgleenon/heavykeeper-py
Heavykeeper algorithm for Top-K elephant flows - python
https://github.com/pmcgleenon/heavykeeper-py
heavykeeper probabilistic-data-structures python rust sketch top-k
Last synced: 3 months ago
JSON representation
Heavykeeper algorithm for Top-K elephant flows - python
- Host: GitHub
- URL: https://github.com/pmcgleenon/heavykeeper-py
- Owner: pmcgleenon
- License: mit
- Created: 2025-06-06T18:25:22.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2025-12-15T05:17:10.000Z (6 months ago)
- Last Synced: 2025-12-16T08:13:00.074Z (6 months ago)
- Topics: heavykeeper, probabilistic-data-structures, python, rust, sketch, top-k
- Language: Python
- Homepage:
- Size: 1.23 MB
- Stars: 2
- Watchers: 0
- Forks: 0
- Open Issues: 3
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE
Awesome Lists containing this project
README
# heavykeeper
[](https://opensource.org/licenses/MIT)
[](https://www.python.org/downloads/)
Python bindings for the HeavyKeeper algorithm - a fast, memory-efficient sketch-based algorithm for finding the top-K most frequent items in data streams.
## Overview
HeavyKeeper is a probabilistic data structure that identifies the most frequent items in a data stream using minimal memory. This implementation provides Python bindings for a high-performance Rust implementation of the algorithm.
### Key Features
- ๐ **High Performance**: Rust-based implementation with Python bindings via PyO3
- ๐พ **Memory Efficient**: Uses probabilistic sketching to track millions of items with minimal memory
- ๐ฏ **Top-K Tracking**: Efficiently maintains the K most frequent items
- ๐ **Stream Processing**: Designed for continuous data streams
- ๐ **Approximate Counts**: Provides estimated frequencies with high accuracy
- ๐งช **Battle Tested**: Includes comprehensive benchmarks and tests
### Use Cases
- **Log Analysis**: Find the most frequent IP addresses, user agents, or error messages
- **Text Processing**: Identify the most common words in large documents
- **Network Monitoring**: Track heavy hitters in network traffic
- **Clickstream Analysis**: Find the most popular pages or user actions
- **Time Series Data**: Monitor frequently occurring events or anomalies
## Installation
### From Source (Development)
```bash
# Clone the repository
git clone https://github.com/pmcgleen/heavykeeper-py.git
cd heavykeeper-py
# Install Rust (if not already installed)
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
# Build and install the Python package
maturin develop
# Or build a wheel
maturin build --release
```
### Requirements
- Python 3.11+
- Rust toolchain (for building from source)
## Quick Start
```python
from heavykeeper import HeavyKeeper
# Create a HeavyKeeper instance
# k=100: track top 100 items
# width=2048: sketch width (affects accuracy)
# depth=8: number of hash functions (affects accuracy)
# decay=0.9: aging factor for old items
hk = HeavyKeeper(k=100, width=2048, depth=8, decay=0.9)
# Add items to the stream
items = ["apple", "banana", "apple", "cherry", "apple", "banana"]
for item in items:
hk.add(item)
# Query individual items
print(f"Is 'apple' in top-K? {hk.query('apple')}")
print(f"Estimated count for 'apple': {hk.count('apple')}")
# Get all top-K items
top_items = hk.list() # Returns list of (item, count) tuples
print("Top items:", top_items)
# Get as dictionary
top_dict = hk.get_topk() # Returns {item: count} dictionary
print("Top items dict:", top_dict)
```
## API Reference
### `HeavyKeeper(k, width, depth, decay)`
Creates a new HeavyKeeper instance.
**Parameters:**
- `k` (int): Number of top items to track
- `width` (int): Width of the sketch (number of buckets)
- `depth` (int): Depth of the sketch (number of hash functions)
- `decay` (float): Decay factor for aging items (between 0.0 and 1.0)
### Methods
#### `add(item: str) -> None`
Add an item to the sketch.
#### `query(item: str) -> bool`
Check if an item is being tracked in the top-K list.
#### `count(item: str) -> int`
Get the estimated count for an item (returns 0 if not tracked).
#### `list() -> List[Tuple[str, int]]`
Get the top-K items as a list of (item, count) tuples, sorted by count.
#### `get_topk() -> Dict[str, int]`
Get the top-K items as a dictionary mapping items to counts.
#### `len() -> int`
Get the current number of items being tracked.
#### `is_empty() -> bool`
Check if the sketch is empty.
## Benchmarking
The repository includes a simnple script for performance testing:
### Word Count Benchmark
```bash
# Basic benchmark with a text file
python benchmark_wordcount.py -k 10 -f data/war_and_peace.txt --time
```
## Parameter Tuning
### Choosing Parameters
- **k**: Set to the number of top items you need
- **width**: Larger values improve accuracy but use more memory (try 1024-8192)
- **depth**: More hash functions improve accuracy (try 4-16)
- **decay**: Controls how quickly old items are forgotten (0.8-0.99)
### Memory Usage
Approximate memory usage: `width ร depth ร 16 bytes + k ร (item_size + 16 bytes)`
For typical usage (width=2048, depth=8, k=100):
- Sketch: ~262 KB
- Top-K storage: ~depends on item sizes
### Accuracy vs Performance
- Higher `width` and `depth` โ better accuracy, more memory
- Lower `decay` โ faster adaptation to changes, less stability
- Higher `k` โ more items tracked, slightly more overhead
## Development
### Building
```bash
# Development build
maturin develop
# Release build
maturin build --release
# Build with debugging
maturin develop --debug
```
### Testing
```bash
# Run the test suite
python test_heavykeeper.py
# Run benchmarks
python benchmark_wordcount.py -k 10 -f test_file.txt
```
### Project Structure
```
heavykeeper-py/
โโโ src/
โ โโโ lib.rs # Rust implementation and Python bindings
โโโ benchmark_*.py # Performance benchmarks
โโโ test_heavykeeper.py # Test suite
โโโ Cargo.toml # Rust dependencies
โโโ pyproject.toml # Python package configuration
โโโ README.md # This file
```
## Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
## License
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
## Acknowledgments
- Based on the [HeavyKeeper](https://www.usenix.org/system/files/conference/atc18/atc18-gong.pdf) algorithm
- Built with [PyO3](https://pyo3.rs/) for Rust-Python interoperability
- Uses [Maturin](https://github.com/PyO3/maturin) for building Python extensions