{"id":34070911,"url":"https://github.com/pmcgleenon/heavykeeper-py","last_synced_at":"2026-03-07T12:08:15.456Z","repository":{"id":297675548,"uuid":"997557093","full_name":"pmcgleenon/heavykeeper-py","owner":"pmcgleenon","description":"Heavykeeper algorithm for Top-K elephant flows - python","archived":false,"fork":false,"pushed_at":"2025-12-15T05:17:10.000Z","size":1289,"stargazers_count":2,"open_issues_count":3,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-12-16T08:13:00.074Z","etag":null,"topics":["heavykeeper","probabilistic-data-structures","python","rust","sketch","top-k"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/pmcgleenon.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-06-06T18:25:22.000Z","updated_at":"2025-12-11T19:13:52.000Z","dependencies_parsed_at":"2025-06-18T07:25:19.823Z","dependency_job_id":"af702819-5b06-47d6-b347-3ae584ab1f1f","html_url":"https://github.com/pmcgleenon/heavykeeper-py","commit_stats":null,"previous_names":["pmcgleenon/heavykeeper-py"],"tags_count":5,"template":false,"template_full_name":null,"purl":"pkg:github/pmcgleenon/heavykeeper-py","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pmcgleenon%2Fheavykeeper-py","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pmcgleenon%2Fheavykeeper-py/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pmcgleenon%2Fheavykeeper-py/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pmcgleenon%2Fheavykeeper-py/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/pmcgleenon","download_url":"https://codeload.github.com/pmcgleenon/heavykeeper-py/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pmcgleenon%2Fheavykeeper-py/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":30212506,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-03-07T09:02:10.694Z","status":"ssl_error","status_checked_at":"2026-03-07T09:02:08.429Z","response_time":53,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["heavykeeper","probabilistic-data-structures","python","rust","sketch","top-k"],"created_at":"2025-12-14T07:39:59.506Z","updated_at":"2026-03-07T12:08:15.419Z","avatar_url":"https://github.com/pmcgleenon.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# heavykeeper\n\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\n[![Python](https://img.shields.io/badge/python-3.11+-blue.svg)](https://www.python.org/downloads/)\n\nPython bindings for the HeavyKeeper algorithm - a fast, memory-efficient sketch-based algorithm for finding the top-K most frequent items in data streams.\n\n## Overview\n\nHeavyKeeper is a probabilistic data structure that identifies the most frequent items in a data stream using minimal memory. This implementation provides Python bindings for a high-performance Rust implementation of the algorithm.\n\n### Key Features\n\n- 🚀 **High Performance**: Rust-based implementation with Python bindings via PyO3\n- 💾 **Memory Efficient**: Uses probabilistic sketching to track millions of items with minimal memory\n- 🎯 **Top-K Tracking**: Efficiently maintains the K most frequent items\n- 🔄 **Stream Processing**: Designed for continuous data streams\n- 📊 **Approximate Counts**: Provides estimated frequencies with high accuracy\n- 🧪 **Battle Tested**: Includes comprehensive benchmarks and tests\n\n### Use Cases\n\n- **Log Analysis**: Find the most frequent IP addresses, user agents, or error messages\n- **Text Processing**: Identify the most common words in large documents\n- **Network Monitoring**: Track heavy hitters in network traffic\n- **Clickstream Analysis**: Find the most popular pages or user actions\n- **Time Series Data**: Monitor frequently occurring events or anomalies\n\n## Installation\n\n### From Source (Development)\n\n```bash\n# Clone the repository\ngit clone https://github.com/pmcgleen/heavykeeper-py.git\ncd heavykeeper-py\n\n# Install Rust (if not already installed)\ncurl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh\n\n# Build and install the Python package\nmaturin develop\n\n# Or build a wheel\nmaturin build --release\n```\n\n### Requirements\n\n- Python 3.11+\n- Rust toolchain (for building from source)\n\n## Quick Start\n\n```python\nfrom heavykeeper import HeavyKeeper\n\n# Create a HeavyKeeper instance\n# k=100: track top 100 items\n# width=2048: sketch width (affects accuracy)\n# depth=8: number of hash functions (affects accuracy)\n# decay=0.9: aging factor for old items\nhk = HeavyKeeper(k=100, width=2048, depth=8, decay=0.9)\n\n# Add items to the stream\nitems = [\"apple\", \"banana\", \"apple\", \"cherry\", \"apple\", \"banana\"]\nfor item in items:\n    hk.add(item)\n\n# Query individual items\nprint(f\"Is 'apple' in top-K? {hk.query('apple')}\")\nprint(f\"Estimated count for 'apple': {hk.count('apple')}\")\n\n# Get all top-K items\ntop_items = hk.list()  # Returns list of (item, count) tuples\nprint(\"Top items:\", top_items)\n\n# Get as dictionary\ntop_dict = hk.get_topk()  # Returns {item: count} dictionary\nprint(\"Top items dict:\", top_dict)\n```\n\n## API Reference\n\n### `HeavyKeeper(k, width, depth, decay)`\n\nCreates a new HeavyKeeper instance.\n\n**Parameters:**\n- `k` (int): Number of top items to track\n- `width` (int): Width of the sketch (number of buckets)\n- `depth` (int): Depth of the sketch (number of hash functions)  \n- `decay` (float): Decay factor for aging items (between 0.0 and 1.0)\n\n### Methods\n\n#### `add(item: str) -\u003e None`\nAdd an item to the sketch.\n\n#### `query(item: str) -\u003e bool`\nCheck if an item is being tracked in the top-K list.\n\n#### `count(item: str) -\u003e int`\nGet the estimated count for an item (returns 0 if not tracked).\n\n#### `list() -\u003e List[Tuple[str, int]]`\nGet the top-K items as a list of (item, count) tuples, sorted by count.\n\n#### `get_topk() -\u003e Dict[str, int]`  \nGet the top-K items as a dictionary mapping items to counts.\n\n#### `len() -\u003e int`\nGet the current number of items being tracked.\n\n#### `is_empty() -\u003e bool`\nCheck if the sketch is empty.\n\n## Benchmarking\n\nThe repository includes a simnple script for performance testing:\n\n### Word Count Benchmark\n\n```bash\n# Basic benchmark with a text file\npython benchmark_wordcount.py -k 10 -f data/war_and_peace.txt --time\n```\n\n## Parameter Tuning\n\n### Choosing Parameters\n\n- **k**: Set to the number of top items you need\n- **width**: Larger values improve accuracy but use more memory (try 1024-8192)\n- **depth**: More hash functions improve accuracy (try 4-16)  \n- **decay**: Controls how quickly old items are forgotten (0.8-0.99)\n\n### Memory Usage\n\nApproximate memory usage: `width × depth × 16 bytes + k × (item_size + 16 bytes)`\n\nFor typical usage (width=2048, depth=8, k=100):\n- Sketch: ~262 KB\n- Top-K storage: ~depends on item sizes\n\n### Accuracy vs Performance\n\n- Higher `width` and `depth` → better accuracy, more memory\n- Lower `decay` → faster adaptation to changes, less stability\n- Higher `k` → more items tracked, slightly more overhead\n\n## Development\n\n### Building\n\n```bash\n# Development build\nmaturin develop\n\n# Release build  \nmaturin build --release\n\n# Build with debugging\nmaturin develop --debug\n```\n\n### Testing\n\n```bash\n# Run the test suite\npython test_heavykeeper.py\n\n# Run benchmarks\npython benchmark_wordcount.py -k 10 -f test_file.txt\n```\n\n### Project Structure\n\n```\nheavykeeper-py/\n├── src/\n│   └── lib.rs          # Rust implementation and Python bindings\n├── benchmark_*.py      # Performance benchmarks\n├── test_heavykeeper.py # Test suite\n├── Cargo.toml          # Rust dependencies\n├── pyproject.toml      # Python package configuration\n└── README.md           # This file\n```\n\n## Contributing\n\nContributions are welcome! Please feel free to submit a Pull Request.\n\n## License\n\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\n\n## Acknowledgments\n\n- Based on the [HeavyKeeper](https://www.usenix.org/system/files/conference/atc18/atc18-gong.pdf) algorithm \n- Built with [PyO3](https://pyo3.rs/) for Rust-Python interoperability\n- Uses [Maturin](https://github.com/PyO3/maturin) for building Python extensions\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpmcgleenon%2Fheavykeeper-py","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fpmcgleenon%2Fheavykeeper-py","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpmcgleenon%2Fheavykeeper-py/lists"}