https://github.com/bitswired/kiru

Last synced: 3 months ago
JSON representation
Host: GitHub
URL: https://github.com/bitswired/kiru
Owner: bitswired
Created: 2025-10-23T12:04:24.000Z (8 months ago)
Default Branch: main
Last Pushed: 2025-11-11T22:08:03.000Z (7 months ago)
Last Synced: 2025-12-13T16:34:27.835Z (6 months ago)
Language: Rust
Size: 2.21 MB
Stars: 8
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
Awesome Lists containing this project

README

          # kiru ⚡🗡️

> **Cut through text at the speed of light**

The fastest text chunking library for RAG applications. Available for both Rust and Python.

[![Crates.io](https://img.shields.io/crates/v/kiru.svg)](https://crates.io/crates/kiru)

[![PyPI](https://img.shields.io/pypi/v/kiru.svg)](https://pypi.org/project/kiru/)

[![Documentation](https://docs.rs/kiru/badge.svg)](https://docs.rs/kiru)

[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

## What is kiru?

kiru is a high-performance text chunking library designed for modern RAG (Retrieval-Augmented Generation) systems. When you need to split millions of documents for vector databases or process streaming data in real-time, kiru delivers unmatched speed without sacrificing correctness.

### Key Features

- **⚡ Blazing Fast (Python)**: 1000+ MB/s throughput for bytes, 300+ MB/s for characters

- **🎯 UTF-8 Safe**: Never breaks multi-byte characters or emoji

- **💾 Memory Efficient**: Stream gigabyte files with constant memory usage

- **🚀 Parallel Processing**: Utilize all CPU cores automatically

- **🔌 Multiple Sources**: Files, URLs, strings, and glob patterns

- **🛠️ Flexible Strategies**: Chunk by bytes or characters

- **🦀 Rust Core**: Rust performance and memory safety

- **🐍 Python Bindings**: Pythonic API for ease of use

## Performance

**Benchmarked on 1MB text file, 1MB chunks, 1KB overlap:**

| Implementation    | Strategy | Source | Time (ms) | Memory (MB) | Throughput (MB/s) |

|-------------------|----------|--------|-----------|-------------|-------------------|

| **kiru (Rust)**   | bytes    | string | 0.23      | -           | **4,370**         |

| **kiru (Python)** | bytes    | string | 0.71      | 2.9         | **1,408**         |

| **kiru (Python)** | chars    | string | 3.13      | 2.9         | **319**           |

| LangChain         | chars    | string | 2,982     | 18.6        | 0.34              |

**kiru is 4,000x faster than LangChain for byte chunking and 940x faster for character chunking!**

Key insights:

- **Rust native performance**: Up to 4,370 MB/s for byte chunking

- **Python bindings overhead**: Still 1,400+ MB/s, beating all pure Python alternatives

- **Character-aware chunking**: 300+ MB/s while respecting grapheme boundaries

- **Memory efficient**: Uses 6x less memory than LangChain

---

## Quick Start

### Python 🐍

```bash

pip install kiru

```

```python

from kiru import Chunker

# Create a chunker

chunker = Chunker.by_bytes(

    chunk_size=1024,  # 1KB chunks

    overlap=128       # 128 bytes overlap

)

# Chunk text

chunks = chunker.on_string("Your text here...").all()

# Chunk files in parallel

sources = ["file://doc1.txt", "https://example.com/page", "glob://*.md"]

for chunk in chunker.on_sources_par(sources):

    process(chunk)

```

### Rust 🦀

Add to your `Cargo.toml`:

```toml

[dependencies]

kiru = "0.1"

```

```rust

use kiru::{BytesChunker, Chunker};

// Create a chunker

let chunker = BytesChunker::new(1024, 128)?;

// Chunk text

let chunks: Vec = chunker

    .chunk_string("Your text here...".to_string())

    .collect();

// Stream large files

use kiru::{Source, StreamType};

let stream = StreamType::from_source(&Source::File("huge.txt".to_string()))?;

for chunk in chunker.chunk_stream(stream) {

    process(chunk);

}

```

---

## Use Cases

### Building RAG Systems

```python

# Perfect for vector database ingestion

chunker = Chunker.by_bytes(512, 50)  # Tuned for embedding models

documents = ["glob://knowledge_base/**/*.md"]

chunks = chunker.on_sources_par(documents, channel_size=10000)

for chunk in chunks:

    embedding = model.encode(chunk)

    vector_db.insert(chunk, embedding)

```

### Real-time Processing

```python

# Stream processing without memory overhead

for chunk in chunker.on_file("10GB_file.txt"):

    # Each chunk generated on-demand

    send_to_queue(chunk)

```

### Parallel Document Processing

```rust

// Process hundreds of documents concurrently

use kiru::{ChunkerBuilder, ChunkerEnum};

let chunker = ChunkerBuilder::by_bytes(ChunkerEnum::Bytes {

    chunk_size: 4096,

    overlap: 512,

});

let sources = vec!["glob://docs/**/*.txt"];

let chunks = chunker.on_sources_par_stream(sources, 1000)?;

```

---

## Chunking Strategies

### Bytes Chunking

- Splits on byte boundaries while respecting UTF-8

- Fastest performance (1000+ MB/s in Rust, 1400+ MB/s in Python)

- Ideal for token-limited models and consistent memory usage

### Characters Chunking  

- Splits on character (grapheme) boundaries

- Ensures exact character counts regardless of byte representation

- Perfect for character-limited APIs (300+ MB/s in Python)

---

## API Reference

### Python API

#### Creating Chunkers

```python

from kiru import Chunker

# Byte-based chunking

chunker = Chunker.by_bytes(chunk_size=1024, overlap=128)

# Character-based chunking

chunker = Chunker.by_characters(chunk_size=1000, overlap=100)

```

#### Input Sources

```python

# Single string

chunks = chunker.on_string("text...").all()

# Single file

chunks = chunker.on_file("/path/to/file.txt").all()

# HTTP/HTTPS URL

chunks = chunker.on_http("https://example.com/page").all()

# Multiple sources (serial)

sources = ["file://doc1.txt", "https://example.com/page", "glob://*.md"]

chunks = chunker.on_sources(sources).all()

# Multiple sources (parallel)

chunks = chunker.on_sources_par(sources, channel_size=1000).all()

# Or iterate lazily

for chunk in chunker.on_sources_par(sources):

    process(chunk)

```

#### Source Prefixes

- `file://path/to/file.txt` - Local files

- `http://example.com` or `https://example.com` - URLs

- `text://Inline text content` - Raw text strings

- `glob://*.md` - Glob patterns

- No prefix - Treated as raw text

### Rust API

#### Creating Chunkers

```rust

use kiru::{BytesChunker, CharactersChunker, Chunker};

// Byte-based chunking

let chunker = BytesChunker::new(1024, 128)?;

// Character-based chunking

let chunker = CharactersChunker::new(1000, 100)?;

```

#### Basic Usage

```rust

use kiru::Chunker;

// Chunk a string

let chunks: Vec = chunker

    .chunk_string("Your text here".to_string())

    .collect();

// Stream a file

use kiru::{Source, StreamType};

let stream = StreamType::from_source(&Source::File("file.txt".to_string()))?;

for chunk in chunker.chunk_stream(stream) {

    // Process chunk

}

```

#### Advanced Usage

```rust

use kiru::{ChunkerBuilder, ChunkerEnum, Source, HigherOrderSource, SourceGenerator};

// Create chunker with builder pattern

let chunker = ChunkerBuilder::by_bytes(ChunkerEnum::Bytes {

    chunk_size: 4096,

    overlap: 512,

});

// Single source

let chunks = chunker.on_source(Source::File("doc.txt".to_string()))?;

// Multiple sources (serial)

let sources = vec![

    Source::File("doc1.txt".to_string()),

    Source::Http("https://example.com".to_string()),

];

let chunks = chunker.on_sources(sources)?;

// Multiple sources (parallel) - returns Vec

let chunks: Vec = chunker.on_sources_par(sources)?;

// Multiple sources (parallel streaming) - returns iterator

let chunks = chunker.on_sources_par_stream(sources, 1000)?;

for chunk in chunks {

    // Process as they arrive

}

// Using glob patterns

let sources = vec![HigherOrderSource::SourceGenerator(

    SourceGenerator::Glob("**/*.md".to_string())

)];

let flattened = HigherOrderSource::into_flattened_sources(sources)?;

```

---

## Architecture

```

┌─────────────────────────────────────────┐

│           Application Layer              │

│     (Python or Rust Application)        │

├─────────────────────────────────────────┤

│          kiru-py (PyO3 Bindings)        │

│              [Python only]               │

├─────────────────────────────────────────┤

│         kiru-core (Rust Library)        │

│                                          │

│        ┌──────────┬───────────┐         │

│        │ Chunkers │ Streaming │         │  

│        │  Engine  │   Engine  │         │

│        └──────────┴───────────┘         │

└─────────────────────────────────────────┘

```

---

## Project Structure

```

kiru/

├── README.md              # This file (shared documentation)

├── kiru-core/             # Rust implementation

│   ├── src/               # Core chunking algorithms

│   │   ├── bytes_chunker.rs

│   │   ├── characters_chunker.rs

│   │   ├── chunker.rs     # Builder pattern & parallel processing

│   │   └── stream.rs      # File/HTTP streaming

│   ├── benches/           # Criterion benchmarks

│   └── tests/             # Property-based tests

├── kiru-py/               # Python bindings (PyO3)

│   ├── src/lib.rs         # Python wrapper

│   └── python/            # Python tests & benchmarks

└── utils/                 # Version management scripts

```

---

## Streaming & Memory Efficiency

**kiru's killer feature: true streaming with constant memory usage.**

Unlike traditional chunkers that load entire files into memory, kiru processes data as it arrives using an intelligent buffering system. This means you can chunk **gigabyte-sized files** with minimal RAM usage.

### How Streaming Works

```

File/HTTP Source → Read Blocks (8KB) → UTF-8 Buffer → Chunk Iterator → Your Code

                      ↓                      ↓

                 As needed              Constant size

```

**Key advantages:**

1. **Constant Memory**: Process 10GB files with ~10MB RAM

2. **Immediate Results**: First chunks available instantly, no waiting for full file load

3. **Works Everywhere**: Local files, HTTP/HTTPS streams, any data source

4. **UTF-8 Safe**: Buffer maintains character boundaries automatically

### Python Examples

```python

from kiru import Chunker

chunker = Chunker.by_bytes(chunk_size=4096, overlap=512)

# ⚡ Stream a 10GB file - uses only ~10MB RAM

for chunk in chunker.on_file("huge_dataset.txt"):

    # Process chunk immediately as it arrives

    vector_db.insert(chunk)

    # No waiting, no memory explosion!

# ⚡ Stream from HTTP - process as data downloads

for chunk in chunker.on_http("https://example.com/large_document.txt"):

    process(chunk)

    # Chunks ready while download continues

# ⚡ Stream multiple sources in parallel

sources = [

    "file://10gb_file1.txt",

    "https://example.com/doc.txt",

    "file://10gb_file2.txt"

]

for chunk in chunker.on_sources_par(sources, channel_size=1000):

    # All sources stream in parallel

    # Memory stays constant regardless of file sizes

    send_to_queue(chunk)

```

### Rust Examples

```rust

use kiru::{BytesChunker, Chunker, Source, StreamType};

let chunker = BytesChunker::new(4096, 512)?;

// ⚡ Stream a massive file with constant memory

let stream = StreamType::from_source(&Source::File("10gb_file.txt".to_string()))?;

for chunk in chunker.chunk_stream(stream) {

    // Process immediately, no memory buildup

    vector_db.insert(chunk);

}

// ⚡ Stream from HTTP as data arrives

let stream = StreamType::from_source(&Source::Http("https://example.com/doc.txt".to_string()))?;

for chunk in chunker.chunk_stream(stream) {

    process(chunk);

}

```

### Memory Comparison

Processing a **1GB file** with 4KB chunks:

| Library    | Memory Usage | Loads Full File? | Streaming? |

|------------|--------------|------------------|------------|

| **kiru**   | **~10 MB**   | ❌ No            | ✅ Yes     |

| LangChain  | **1000+ MB** | ✅ Yes           | ❌ No      |

| tiktoken   | **1000+ MB** | ✅ Yes           | ❌ No      |

**Result**: kiru uses **100x less memory** while being **4,000x faster**!

---

## Development

### Setup

```bash

# Clone repository

git clone https://github.com/yourusername/kiru.git

cd kiru

# Run all tests

cargo test --workspace

# Run Rust benchmarks

cd kiru-core

cargo bench

# Build Python package

cd ../kiru-py

pip install maturin

maturin develop --release

# Run Python tests

pip install pytest hypothesis

pytest python/test.py

# Run Python benchmarks

python python/bench.py

```

### Running Benchmarks

```bash

# Rust benchmarks

cd kiru-core

cargo bench

# Python benchmarks

cd kiru-py

python python/bench.py

```

---

## Performance Tips

1. **Use byte chunking** for maximum throughput (1000+ MB/s)

2. **Use character chunking** when exact character counts matter (300+ MB/s)

3. **Enable parallel processing** with `on_sources_par()` for multiple files

4. **Tune chunk size** based on your embedding model's context window

5. **Adjust overlap** to balance context preservation and storage

6. **Stream large files** to maintain constant memory usage

---

## Why "kiru"?

"Kiru" (切る) is Japanese for "to cut" - reflecting the library's purpose of cutting text into chunks at lightning speed ⚡🗡️

---

## Contributing

We welcome contributions! Please check out our [Contributing Guide](CONTRIBUTING.md) for guidelines.

## License

MIT License - see [LICENSE](LICENSE) for details.

---

## Credits

Built with:

- [PyO3](https://pyo3.rs) - Rust bindings for Python

- [Rayon](https://github.com/rayon-rs/rayon) - Data parallelism for Rust

- [maturin](https://www.maturin.rs) - Build and publish Rust Python extensions

---

**Ready to cut through text at the speed of light?**

- 🐍 **Python**: `pip install kiru`

- 🦀 **Rust**: Add `kiru = "0.1"` to Cargo.toml

Get started with [PyPI](https://pypi.org/project/kiru/) | [Crates.io](https://crates.io/crates/kiru) | [Documentation](https://docs.rs/kiru)
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/bitswired/kiru

Awesome Lists containing this project

README