An open API service indexing awesome lists of open source software.

https://github.com/bitswired/kiru


https://github.com/bitswired/kiru

Last synced: about 2 months ago
JSON representation

Awesome Lists containing this project

README

          

# kiru ⚡🗡️

> **Cut through text at the speed of light**

The fastest text chunking library for RAG applications. Available for both Rust and Python.

[![Crates.io](https://img.shields.io/crates/v/kiru.svg)](https://crates.io/crates/kiru)
[![PyPI](https://img.shields.io/pypi/v/kiru.svg)](https://pypi.org/project/kiru/)
[![Documentation](https://docs.rs/kiru/badge.svg)](https://docs.rs/kiru)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

## What is kiru?

kiru is a high-performance text chunking library designed for modern RAG (Retrieval-Augmented Generation) systems. When you need to split millions of documents for vector databases or process streaming data in real-time, kiru delivers unmatched speed without sacrificing correctness.

### Key Features

- **⚡ Blazing Fast (Python)**: 1000+ MB/s throughput for bytes, 300+ MB/s for characters
- **🎯 UTF-8 Safe**: Never breaks multi-byte characters or emoji
- **💾 Memory Efficient**: Stream gigabyte files with constant memory usage
- **🚀 Parallel Processing**: Utilize all CPU cores automatically
- **🔌 Multiple Sources**: Files, URLs, strings, and glob patterns
- **🛠️ Flexible Strategies**: Chunk by bytes or characters
- **🦀 Rust Core**: Rust performance and memory safety
- **🐍 Python Bindings**: Pythonic API for ease of use

## Performance

**Benchmarked on 1MB text file, 1MB chunks, 1KB overlap:**

| Implementation | Strategy | Source | Time (ms) | Memory (MB) | Throughput (MB/s) |
|-------------------|----------|--------|-----------|-------------|-------------------|
| **kiru (Rust)** | bytes | string | 0.23 | - | **4,370** |
| **kiru (Python)** | bytes | string | 0.71 | 2.9 | **1,408** |
| **kiru (Python)** | chars | string | 3.13 | 2.9 | **319** |
| LangChain | chars | string | 2,982 | 18.6 | 0.34 |

**kiru is 4,000x faster than LangChain for byte chunking and 940x faster for character chunking!**

Key insights:
- **Rust native performance**: Up to 4,370 MB/s for byte chunking
- **Python bindings overhead**: Still 1,400+ MB/s, beating all pure Python alternatives
- **Character-aware chunking**: 300+ MB/s while respecting grapheme boundaries
- **Memory efficient**: Uses 6x less memory than LangChain

---

## Quick Start

### Python 🐍

```bash
pip install kiru
```

```python
from kiru import Chunker

# Create a chunker
chunker = Chunker.by_bytes(
chunk_size=1024, # 1KB chunks
overlap=128 # 128 bytes overlap
)

# Chunk text
chunks = chunker.on_string("Your text here...").all()

# Chunk files in parallel
sources = ["file://doc1.txt", "https://example.com/page", "glob://*.md"]
for chunk in chunker.on_sources_par(sources):
process(chunk)
```

### Rust 🦀

Add to your `Cargo.toml`:

```toml
[dependencies]
kiru = "0.1"
```

```rust
use kiru::{BytesChunker, Chunker};

// Create a chunker
let chunker = BytesChunker::new(1024, 128)?;

// Chunk text
let chunks: Vec = chunker
.chunk_string("Your text here...".to_string())
.collect();

// Stream large files
use kiru::{Source, StreamType};
let stream = StreamType::from_source(&Source::File("huge.txt".to_string()))?;
for chunk in chunker.chunk_stream(stream) {
process(chunk);
}
```

---

## Use Cases

### Building RAG Systems

```python
# Perfect for vector database ingestion
chunker = Chunker.by_bytes(512, 50) # Tuned for embedding models

documents = ["glob://knowledge_base/**/*.md"]
chunks = chunker.on_sources_par(documents, channel_size=10000)

for chunk in chunks:
embedding = model.encode(chunk)
vector_db.insert(chunk, embedding)
```

### Real-time Processing

```python
# Stream processing without memory overhead
for chunk in chunker.on_file("10GB_file.txt"):
# Each chunk generated on-demand
send_to_queue(chunk)
```

### Parallel Document Processing

```rust
// Process hundreds of documents concurrently
use kiru::{ChunkerBuilder, ChunkerEnum};

let chunker = ChunkerBuilder::by_bytes(ChunkerEnum::Bytes {
chunk_size: 4096,
overlap: 512,
});

let sources = vec!["glob://docs/**/*.txt"];
let chunks = chunker.on_sources_par_stream(sources, 1000)?;
```

---

## Chunking Strategies

### Bytes Chunking
- Splits on byte boundaries while respecting UTF-8
- Fastest performance (1000+ MB/s in Rust, 1400+ MB/s in Python)
- Ideal for token-limited models and consistent memory usage

### Characters Chunking
- Splits on character (grapheme) boundaries
- Ensures exact character counts regardless of byte representation
- Perfect for character-limited APIs (300+ MB/s in Python)

---

## API Reference

### Python API

#### Creating Chunkers

```python
from kiru import Chunker

# Byte-based chunking
chunker = Chunker.by_bytes(chunk_size=1024, overlap=128)

# Character-based chunking
chunker = Chunker.by_characters(chunk_size=1000, overlap=100)
```

#### Input Sources

```python
# Single string
chunks = chunker.on_string("text...").all()

# Single file
chunks = chunker.on_file("/path/to/file.txt").all()

# HTTP/HTTPS URL
chunks = chunker.on_http("https://example.com/page").all()

# Multiple sources (serial)
sources = ["file://doc1.txt", "https://example.com/page", "glob://*.md"]
chunks = chunker.on_sources(sources).all()

# Multiple sources (parallel)
chunks = chunker.on_sources_par(sources, channel_size=1000).all()

# Or iterate lazily
for chunk in chunker.on_sources_par(sources):
process(chunk)
```

#### Source Prefixes

- `file://path/to/file.txt` - Local files
- `http://example.com` or `https://example.com` - URLs
- `text://Inline text content` - Raw text strings
- `glob://*.md` - Glob patterns
- No prefix - Treated as raw text

### Rust API

#### Creating Chunkers

```rust
use kiru::{BytesChunker, CharactersChunker, Chunker};

// Byte-based chunking
let chunker = BytesChunker::new(1024, 128)?;

// Character-based chunking
let chunker = CharactersChunker::new(1000, 100)?;
```

#### Basic Usage

```rust
use kiru::Chunker;

// Chunk a string
let chunks: Vec = chunker
.chunk_string("Your text here".to_string())
.collect();

// Stream a file
use kiru::{Source, StreamType};
let stream = StreamType::from_source(&Source::File("file.txt".to_string()))?;
for chunk in chunker.chunk_stream(stream) {
// Process chunk
}
```

#### Advanced Usage

```rust
use kiru::{ChunkerBuilder, ChunkerEnum, Source, HigherOrderSource, SourceGenerator};

// Create chunker with builder pattern
let chunker = ChunkerBuilder::by_bytes(ChunkerEnum::Bytes {
chunk_size: 4096,
overlap: 512,
});

// Single source
let chunks = chunker.on_source(Source::File("doc.txt".to_string()))?;

// Multiple sources (serial)
let sources = vec![
Source::File("doc1.txt".to_string()),
Source::Http("https://example.com".to_string()),
];
let chunks = chunker.on_sources(sources)?;

// Multiple sources (parallel) - returns Vec
let chunks: Vec = chunker.on_sources_par(sources)?;

// Multiple sources (parallel streaming) - returns iterator
let chunks = chunker.on_sources_par_stream(sources, 1000)?;
for chunk in chunks {
// Process as they arrive
}

// Using glob patterns
let sources = vec![HigherOrderSource::SourceGenerator(
SourceGenerator::Glob("**/*.md".to_string())
)];
let flattened = HigherOrderSource::into_flattened_sources(sources)?;
```

---

## Architecture

```
┌─────────────────────────────────────────┐
│ Application Layer │
│ (Python or Rust Application) │
├─────────────────────────────────────────┤
│ kiru-py (PyO3 Bindings) │
│ [Python only] │
├─────────────────────────────────────────┤
│ kiru-core (Rust Library) │
│ │
│ ┌──────────┬───────────┐ │
│ │ Chunkers │ Streaming │ │
│ │ Engine │ Engine │ │
│ └──────────┴───────────┘ │
└─────────────────────────────────────────┘
```

---

## Project Structure

```
kiru/
├── README.md # This file (shared documentation)
├── kiru-core/ # Rust implementation
│ ├── src/ # Core chunking algorithms
│ │ ├── bytes_chunker.rs
│ │ ├── characters_chunker.rs
│ │ ├── chunker.rs # Builder pattern & parallel processing
│ │ └── stream.rs # File/HTTP streaming
│ ├── benches/ # Criterion benchmarks
│ └── tests/ # Property-based tests
├── kiru-py/ # Python bindings (PyO3)
│ ├── src/lib.rs # Python wrapper
│ └── python/ # Python tests & benchmarks
└── utils/ # Version management scripts
```

---

## Streaming & Memory Efficiency

**kiru's killer feature: true streaming with constant memory usage.**

Unlike traditional chunkers that load entire files into memory, kiru processes data as it arrives using an intelligent buffering system. This means you can chunk **gigabyte-sized files** with minimal RAM usage.

### How Streaming Works

```
File/HTTP Source → Read Blocks (8KB) → UTF-8 Buffer → Chunk Iterator → Your Code
↓ ↓
As needed Constant size
```

**Key advantages:**

1. **Constant Memory**: Process 10GB files with ~10MB RAM
2. **Immediate Results**: First chunks available instantly, no waiting for full file load
3. **Works Everywhere**: Local files, HTTP/HTTPS streams, any data source
4. **UTF-8 Safe**: Buffer maintains character boundaries automatically

### Python Examples

```python
from kiru import Chunker

chunker = Chunker.by_bytes(chunk_size=4096, overlap=512)

# ⚡ Stream a 10GB file - uses only ~10MB RAM
for chunk in chunker.on_file("huge_dataset.txt"):
# Process chunk immediately as it arrives
vector_db.insert(chunk)
# No waiting, no memory explosion!

# ⚡ Stream from HTTP - process as data downloads
for chunk in chunker.on_http("https://example.com/large_document.txt"):
process(chunk)
# Chunks ready while download continues

# ⚡ Stream multiple sources in parallel
sources = [
"file://10gb_file1.txt",
"https://example.com/doc.txt",
"file://10gb_file2.txt"
]
for chunk in chunker.on_sources_par(sources, channel_size=1000):
# All sources stream in parallel
# Memory stays constant regardless of file sizes
send_to_queue(chunk)
```

### Rust Examples

```rust
use kiru::{BytesChunker, Chunker, Source, StreamType};

let chunker = BytesChunker::new(4096, 512)?;

// ⚡ Stream a massive file with constant memory
let stream = StreamType::from_source(&Source::File("10gb_file.txt".to_string()))?;
for chunk in chunker.chunk_stream(stream) {
// Process immediately, no memory buildup
vector_db.insert(chunk);
}

// ⚡ Stream from HTTP as data arrives
let stream = StreamType::from_source(&Source::Http("https://example.com/doc.txt".to_string()))?;
for chunk in chunker.chunk_stream(stream) {
process(chunk);
}
```

### Memory Comparison

Processing a **1GB file** with 4KB chunks:

| Library | Memory Usage | Loads Full File? | Streaming? |
|------------|--------------|------------------|------------|
| **kiru** | **~10 MB** | ❌ No | ✅ Yes |
| LangChain | **1000+ MB** | ✅ Yes | ❌ No |
| tiktoken | **1000+ MB** | ✅ Yes | ❌ No |

**Result**: kiru uses **100x less memory** while being **4,000x faster**!

---

## Development

### Setup

```bash
# Clone repository
git clone https://github.com/yourusername/kiru.git
cd kiru

# Run all tests
cargo test --workspace

# Run Rust benchmarks
cd kiru-core
cargo bench

# Build Python package
cd ../kiru-py
pip install maturin
maturin develop --release

# Run Python tests
pip install pytest hypothesis
pytest python/test.py

# Run Python benchmarks
python python/bench.py
```

### Running Benchmarks

```bash
# Rust benchmarks
cd kiru-core
cargo bench

# Python benchmarks
cd kiru-py
python python/bench.py
```

---

## Performance Tips

1. **Use byte chunking** for maximum throughput (1000+ MB/s)
2. **Use character chunking** when exact character counts matter (300+ MB/s)
3. **Enable parallel processing** with `on_sources_par()` for multiple files
4. **Tune chunk size** based on your embedding model's context window
5. **Adjust overlap** to balance context preservation and storage
6. **Stream large files** to maintain constant memory usage

---

## Why "kiru"?

"Kiru" (切る) is Japanese for "to cut" - reflecting the library's purpose of cutting text into chunks at lightning speed ⚡🗡️

---

## Contributing

We welcome contributions! Please check out our [Contributing Guide](CONTRIBUTING.md) for guidelines.

## License

MIT License - see [LICENSE](LICENSE) for details.

---

## Credits

Built with:
- [PyO3](https://pyo3.rs) - Rust bindings for Python
- [Rayon](https://github.com/rayon-rs/rayon) - Data parallelism for Rust
- [maturin](https://www.maturin.rs) - Build and publish Rust Python extensions

---

**Ready to cut through text at the speed of light?**

- 🐍 **Python**: `pip install kiru`
- 🦀 **Rust**: Add `kiru = "0.1"` to Cargo.toml

Get started with [PyPI](https://pypi.org/project/kiru/) | [Crates.io](https://crates.io/crates/kiru) | [Documentation](https://docs.rs/kiru)