{"id":33925987,"url":"https://github.com/bitswired/kiru","last_synced_at":"2026-03-17T23:38:17.391Z","repository":{"id":320437337,"uuid":"1081867627","full_name":"bitswired/kiru","owner":"bitswired","description":null,"archived":false,"fork":false,"pushed_at":"2025-11-11T22:08:03.000Z","size":2320,"stargazers_count":8,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-12-13T16:34:27.835Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/bitswired.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-10-23T12:04:24.000Z","updated_at":"2025-11-13T13:25:22.000Z","dependencies_parsed_at":null,"dependency_job_id":"9dc22656-49ea-41e9-9b5d-969b233d0a0a","html_url":"https://github.com/bitswired/kiru","commit_stats":null,"previous_names":["bitswired/kiru"],"tags_count":9,"template":false,"template_full_name":null,"purl":"pkg:github/bitswired/kiru","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bitswired%2Fkiru","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bitswired%2Fkiru/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bitswired%2Fkiru/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bitswired%2Fkiru/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/bitswired","download_url":"https://codeload.github.com/bitswired/kiru/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bitswired%2Fkiru/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":30635282,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-03-17T22:38:22.569Z","status":"ssl_error","status_checked_at":"2026-03-17T22:38:11.804Z","response_time":56,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-12-12T10:02:44.007Z","updated_at":"2026-03-17T23:38:17.353Z","avatar_url":"https://github.com/bitswired.png","language":"Rust","funding_links":[],"categories":[],"sub_categories":[],"readme":"# kiru ⚡🗡️\n\n\u003e **Cut through text at the speed of light**\n\nThe fastest text chunking library for RAG applications. Available for both Rust and Python.\n\n[![Crates.io](https://img.shields.io/crates/v/kiru.svg)](https://crates.io/crates/kiru)\n[![PyPI](https://img.shields.io/pypi/v/kiru.svg)](https://pypi.org/project/kiru/)\n[![Documentation](https://docs.rs/kiru/badge.svg)](https://docs.rs/kiru)\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\n\n## What is kiru?\n\nkiru is a high-performance text chunking library designed for modern RAG (Retrieval-Augmented Generation) systems. When you need to split millions of documents for vector databases or process streaming data in real-time, kiru delivers unmatched speed without sacrificing correctness.\n\n### Key Features\n\n- **⚡ Blazing Fast (Python)**: 1000+ MB/s throughput for bytes, 300+ MB/s for characters\n- **🎯 UTF-8 Safe**: Never breaks multi-byte characters or emoji\n- **💾 Memory Efficient**: Stream gigabyte files with constant memory usage\n- **🚀 Parallel Processing**: Utilize all CPU cores automatically\n- **🔌 Multiple Sources**: Files, URLs, strings, and glob patterns\n- **🛠️ Flexible Strategies**: Chunk by bytes or characters\n- **🦀 Rust Core**: Rust performance and memory safety\n- **🐍 Python Bindings**: Pythonic API for ease of use\n\n## Performance\n\n**Benchmarked on 1MB text file, 1MB chunks, 1KB overlap:**\n\n| Implementation    | Strategy | Source | Time (ms) | Memory (MB) | Throughput (MB/s) |\n|-------------------|----------|--------|-----------|-------------|-------------------|\n| **kiru (Rust)**   | bytes    | string | 0.23      | -           | **4,370**         |\n| **kiru (Python)** | bytes    | string | 0.71      | 2.9         | **1,408**         |\n| **kiru (Python)** | chars    | string | 3.13      | 2.9         | **319**           |\n| LangChain         | chars    | string | 2,982     | 18.6        | 0.34              |\n\n**kiru is 4,000x faster than LangChain for byte chunking and 940x faster for character chunking!**\n\nKey insights:\n- **Rust native performance**: Up to 4,370 MB/s for byte chunking\n- **Python bindings overhead**: Still 1,400+ MB/s, beating all pure Python alternatives\n- **Character-aware chunking**: 300+ MB/s while respecting grapheme boundaries\n- **Memory efficient**: Uses 6x less memory than LangChain\n\n---\n\n## Quick Start\n\n### Python 🐍\n\n```bash\npip install kiru\n```\n\n```python\nfrom kiru import Chunker\n\n# Create a chunker\nchunker = Chunker.by_bytes(\n    chunk_size=1024,  # 1KB chunks\n    overlap=128       # 128 bytes overlap\n)\n\n# Chunk text\nchunks = chunker.on_string(\"Your text here...\").all()\n\n# Chunk files in parallel\nsources = [\"file://doc1.txt\", \"https://example.com/page\", \"glob://*.md\"]\nfor chunk in chunker.on_sources_par(sources):\n    process(chunk)\n```\n\n### Rust 🦀\n\nAdd to your `Cargo.toml`:\n\n```toml\n[dependencies]\nkiru = \"0.1\"\n```\n\n```rust\nuse kiru::{BytesChunker, Chunker};\n\n// Create a chunker\nlet chunker = BytesChunker::new(1024, 128)?;\n\n// Chunk text\nlet chunks: Vec\u003cString\u003e = chunker\n    .chunk_string(\"Your text here...\".to_string())\n    .collect();\n\n// Stream large files\nuse kiru::{Source, StreamType};\nlet stream = StreamType::from_source(\u0026Source::File(\"huge.txt\".to_string()))?;\nfor chunk in chunker.chunk_stream(stream) {\n    process(chunk);\n}\n```\n\n---\n\n## Use Cases\n\n### Building RAG Systems\n\n```python\n# Perfect for vector database ingestion\nchunker = Chunker.by_bytes(512, 50)  # Tuned for embedding models\n\ndocuments = [\"glob://knowledge_base/**/*.md\"]\nchunks = chunker.on_sources_par(documents, channel_size=10000)\n\nfor chunk in chunks:\n    embedding = model.encode(chunk)\n    vector_db.insert(chunk, embedding)\n```\n\n### Real-time Processing\n\n```python\n# Stream processing without memory overhead\nfor chunk in chunker.on_file(\"10GB_file.txt\"):\n    # Each chunk generated on-demand\n    send_to_queue(chunk)\n```\n\n### Parallel Document Processing\n\n```rust\n// Process hundreds of documents concurrently\nuse kiru::{ChunkerBuilder, ChunkerEnum};\n\nlet chunker = ChunkerBuilder::by_bytes(ChunkerEnum::Bytes {\n    chunk_size: 4096,\n    overlap: 512,\n});\n\nlet sources = vec![\"glob://docs/**/*.txt\"];\nlet chunks = chunker.on_sources_par_stream(sources, 1000)?;\n```\n\n---\n\n## Chunking Strategies\n\n### Bytes Chunking\n- Splits on byte boundaries while respecting UTF-8\n- Fastest performance (1000+ MB/s in Rust, 1400+ MB/s in Python)\n- Ideal for token-limited models and consistent memory usage\n\n### Characters Chunking  \n- Splits on character (grapheme) boundaries\n- Ensures exact character counts regardless of byte representation\n- Perfect for character-limited APIs (300+ MB/s in Python)\n\n---\n\n## API Reference\n\n### Python API\n\n#### Creating Chunkers\n\n```python\nfrom kiru import Chunker\n\n# Byte-based chunking\nchunker = Chunker.by_bytes(chunk_size=1024, overlap=128)\n\n# Character-based chunking\nchunker = Chunker.by_characters(chunk_size=1000, overlap=100)\n```\n\n#### Input Sources\n\n```python\n# Single string\nchunks = chunker.on_string(\"text...\").all()\n\n# Single file\nchunks = chunker.on_file(\"/path/to/file.txt\").all()\n\n# HTTP/HTTPS URL\nchunks = chunker.on_http(\"https://example.com/page\").all()\n\n# Multiple sources (serial)\nsources = [\"file://doc1.txt\", \"https://example.com/page\", \"glob://*.md\"]\nchunks = chunker.on_sources(sources).all()\n\n# Multiple sources (parallel)\nchunks = chunker.on_sources_par(sources, channel_size=1000).all()\n\n# Or iterate lazily\nfor chunk in chunker.on_sources_par(sources):\n    process(chunk)\n```\n\n#### Source Prefixes\n\n- `file://path/to/file.txt` - Local files\n- `http://example.com` or `https://example.com` - URLs\n- `text://Inline text content` - Raw text strings\n- `glob://*.md` - Glob patterns\n- No prefix - Treated as raw text\n\n### Rust API\n\n#### Creating Chunkers\n\n```rust\nuse kiru::{BytesChunker, CharactersChunker, Chunker};\n\n// Byte-based chunking\nlet chunker = BytesChunker::new(1024, 128)?;\n\n// Character-based chunking\nlet chunker = CharactersChunker::new(1000, 100)?;\n```\n\n#### Basic Usage\n\n```rust\nuse kiru::Chunker;\n\n// Chunk a string\nlet chunks: Vec\u003cString\u003e = chunker\n    .chunk_string(\"Your text here\".to_string())\n    .collect();\n\n// Stream a file\nuse kiru::{Source, StreamType};\nlet stream = StreamType::from_source(\u0026Source::File(\"file.txt\".to_string()))?;\nfor chunk in chunker.chunk_stream(stream) {\n    // Process chunk\n}\n```\n\n#### Advanced Usage\n\n```rust\nuse kiru::{ChunkerBuilder, ChunkerEnum, Source, HigherOrderSource, SourceGenerator};\n\n// Create chunker with builder pattern\nlet chunker = ChunkerBuilder::by_bytes(ChunkerEnum::Bytes {\n    chunk_size: 4096,\n    overlap: 512,\n});\n\n// Single source\nlet chunks = chunker.on_source(Source::File(\"doc.txt\".to_string()))?;\n\n// Multiple sources (serial)\nlet sources = vec![\n    Source::File(\"doc1.txt\".to_string()),\n    Source::Http(\"https://example.com\".to_string()),\n];\nlet chunks = chunker.on_sources(sources)?;\n\n// Multiple sources (parallel) - returns Vec\nlet chunks: Vec\u003cString\u003e = chunker.on_sources_par(sources)?;\n\n// Multiple sources (parallel streaming) - returns iterator\nlet chunks = chunker.on_sources_par_stream(sources, 1000)?;\nfor chunk in chunks {\n    // Process as they arrive\n}\n\n// Using glob patterns\nlet sources = vec![HigherOrderSource::SourceGenerator(\n    SourceGenerator::Glob(\"**/*.md\".to_string())\n)];\nlet flattened = HigherOrderSource::into_flattened_sources(sources)?;\n```\n\n---\n\n## Architecture\n\n```\n┌─────────────────────────────────────────┐\n│           Application Layer              │\n│     (Python or Rust Application)        │\n├─────────────────────────────────────────┤\n│          kiru-py (PyO3 Bindings)        │\n│              [Python only]               │\n├─────────────────────────────────────────┤\n│         kiru-core (Rust Library)        │\n│                                          │\n│        ┌──────────┬───────────┐         │\n│        │ Chunkers │ Streaming │         │  \n│        │  Engine  │   Engine  │         │\n│        └──────────┴───────────┘         │\n└─────────────────────────────────────────┘\n```\n\n---\n\n## Project Structure\n\n```\nkiru/\n├── README.md              # This file (shared documentation)\n├── kiru-core/             # Rust implementation\n│   ├── src/               # Core chunking algorithms\n│   │   ├── bytes_chunker.rs\n│   │   ├── characters_chunker.rs\n│   │   ├── chunker.rs     # Builder pattern \u0026 parallel processing\n│   │   └── stream.rs      # File/HTTP streaming\n│   ├── benches/           # Criterion benchmarks\n│   └── tests/             # Property-based tests\n├── kiru-py/               # Python bindings (PyO3)\n│   ├── src/lib.rs         # Python wrapper\n│   └── python/            # Python tests \u0026 benchmarks\n└── utils/                 # Version management scripts\n```\n\n---\n\n## Streaming \u0026 Memory Efficiency\n\n**kiru's killer feature: true streaming with constant memory usage.**\n\nUnlike traditional chunkers that load entire files into memory, kiru processes data as it arrives using an intelligent buffering system. This means you can chunk **gigabyte-sized files** with minimal RAM usage.\n\n### How Streaming Works\n\n```\nFile/HTTP Source → Read Blocks (8KB) → UTF-8 Buffer → Chunk Iterator → Your Code\n                      ↓                      ↓\n                 As needed              Constant size\n```\n\n**Key advantages:**\n\n1. **Constant Memory**: Process 10GB files with ~10MB RAM\n2. **Immediate Results**: First chunks available instantly, no waiting for full file load\n3. **Works Everywhere**: Local files, HTTP/HTTPS streams, any data source\n4. **UTF-8 Safe**: Buffer maintains character boundaries automatically\n\n### Python Examples\n\n```python\nfrom kiru import Chunker\n\nchunker = Chunker.by_bytes(chunk_size=4096, overlap=512)\n\n# ⚡ Stream a 10GB file - uses only ~10MB RAM\nfor chunk in chunker.on_file(\"huge_dataset.txt\"):\n    # Process chunk immediately as it arrives\n    vector_db.insert(chunk)\n    # No waiting, no memory explosion!\n\n# ⚡ Stream from HTTP - process as data downloads\nfor chunk in chunker.on_http(\"https://example.com/large_document.txt\"):\n    process(chunk)\n    # Chunks ready while download continues\n\n# ⚡ Stream multiple sources in parallel\nsources = [\n    \"file://10gb_file1.txt\",\n    \"https://example.com/doc.txt\",\n    \"file://10gb_file2.txt\"\n]\nfor chunk in chunker.on_sources_par(sources, channel_size=1000):\n    # All sources stream in parallel\n    # Memory stays constant regardless of file sizes\n    send_to_queue(chunk)\n```\n\n### Rust Examples\n\n```rust\nuse kiru::{BytesChunker, Chunker, Source, StreamType};\n\nlet chunker = BytesChunker::new(4096, 512)?;\n\n// ⚡ Stream a massive file with constant memory\nlet stream = StreamType::from_source(\u0026Source::File(\"10gb_file.txt\".to_string()))?;\nfor chunk in chunker.chunk_stream(stream) {\n    // Process immediately, no memory buildup\n    vector_db.insert(chunk);\n}\n\n// ⚡ Stream from HTTP as data arrives\nlet stream = StreamType::from_source(\u0026Source::Http(\"https://example.com/doc.txt\".to_string()))?;\nfor chunk in chunker.chunk_stream(stream) {\n    process(chunk);\n}\n```\n\n### Memory Comparison\n\nProcessing a **1GB file** with 4KB chunks:\n\n| Library    | Memory Usage | Loads Full File? | Streaming? |\n|------------|--------------|------------------|------------|\n| **kiru**   | **~10 MB**   | ❌ No            | ✅ Yes     |\n| LangChain  | **1000+ MB** | ✅ Yes           | ❌ No      |\n| tiktoken   | **1000+ MB** | ✅ Yes           | ❌ No      |\n\n**Result**: kiru uses **100x less memory** while being **4,000x faster**!\n\n---\n\n## Development\n\n### Setup\n\n```bash\n# Clone repository\ngit clone https://github.com/yourusername/kiru.git\ncd kiru\n\n# Run all tests\ncargo test --workspace\n\n# Run Rust benchmarks\ncd kiru-core\ncargo bench\n\n# Build Python package\ncd ../kiru-py\npip install maturin\nmaturin develop --release\n\n# Run Python tests\npip install pytest hypothesis\npytest python/test.py\n\n# Run Python benchmarks\npython python/bench.py\n```\n\n### Running Benchmarks\n\n```bash\n# Rust benchmarks\ncd kiru-core\ncargo bench\n\n# Python benchmarks\ncd kiru-py\npython python/bench.py\n```\n\n---\n\n## Performance Tips\n\n1. **Use byte chunking** for maximum throughput (1000+ MB/s)\n2. **Use character chunking** when exact character counts matter (300+ MB/s)\n3. **Enable parallel processing** with `on_sources_par()` for multiple files\n4. **Tune chunk size** based on your embedding model's context window\n5. **Adjust overlap** to balance context preservation and storage\n6. **Stream large files** to maintain constant memory usage\n\n---\n\n## Why \"kiru\"?\n\n\"Kiru\" (切る) is Japanese for \"to cut\" - reflecting the library's purpose of cutting text into chunks at lightning speed ⚡🗡️\n\n---\n\n## Contributing\n\nWe welcome contributions! Please check out our [Contributing Guide](CONTRIBUTING.md) for guidelines.\n\n## License\n\nMIT License - see [LICENSE](LICENSE) for details.\n\n---\n\n## Credits\n\nBuilt with:\n- [PyO3](https://pyo3.rs) - Rust bindings for Python\n- [Rayon](https://github.com/rayon-rs/rayon) - Data parallelism for Rust\n- [maturin](https://www.maturin.rs) - Build and publish Rust Python extensions\n\n---\n\n**Ready to cut through text at the speed of light?**\n\n- 🐍 **Python**: `pip install kiru`\n- 🦀 **Rust**: Add `kiru = \"0.1\"` to Cargo.toml\n\nGet started with [PyPI](https://pypi.org/project/kiru/) | [Crates.io](https://crates.io/crates/kiru) | [Documentation](https://docs.rs/kiru)","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbitswired%2Fkiru","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbitswired%2Fkiru","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbitswired%2Fkiru/lists"}