https://github.com/speedyk-005/chunklet-py
One library to split them all: Sentence, Code, Docs. Chunk smarter, not harder β built for LLMs, RAG pipelines, and beyond.
https://github.com/speedyk-005/chunklet-py
ai chunking chunks-algorithm chunks-processing code-chunking code-structure document-chunking natural-language-processing nlp rag text-splitting visualization
Last synced: 15 days ago
JSON representation
One library to split them all: Sentence, Code, Docs. Chunk smarter, not harder β built for LLMs, RAG pipelines, and beyond.
- Host: GitHub
- URL: https://github.com/speedyk-005/chunklet-py
- Owner: speedyk-005
- License: mit
- Created: 2025-07-22T19:43:10.000Z (8 months ago)
- Default Branch: main
- Last Pushed: 2026-02-21T04:50:27.000Z (24 days ago)
- Last Synced: 2026-02-21T11:35:33.931Z (24 days ago)
- Topics: ai, chunking, chunks-algorithm, chunks-processing, code-chunking, code-structure, document-chunking, natural-language-processing, nlp, rag, text-splitting, visualization
- Language: Python
- Homepage: https://speedyk-005.github.io/chunklet-py/latest
- Size: 17.7 MB
- Stars: 62
- Watchers: 3
- Forks: 2
- Open Issues: 6
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Support: docs/supported-languages.md
Awesome Lists containing this project
README
# π§© Chunklet-py
βOne library to split them all: Sentence, Code, Docsβ
> [!WARNING]
> **Quick heads up!** Version 2 has some breaking changes. No worries though - check our [Migration Guide](https://speedyk-005.github.io/chunklet-py/latest/migration/) for a smooth upgrade!
Hey! Welcome. Let's make some text chunking magic happen.
[](https://www.python.org/downloads/)
[](https://pypi.org/project/chunklet-py)
[](https://pepy.tech/projects/chunklet-py)
[](https://coveralls.io/github/speedyk-005/chunklet-py?branch=main)
[](https://github.com/speedyk-005/chunklet-py)
[](https://opensource.org/licenses/MIT)
[](https://github.com/speedyk-005/chunklet-py/actions)
[](https://www.codefactor.io/repository/github/speedyk-005/chunklet-py)
[](https://deepwiki.com/speedyk-005/chunklet-py)
## Why Smart Chunking? (Or: Why Not Just Split on Character Count?)
You could split your text by character count or random line breaks. But that's like trying to cut a wedding cake with a chainsaw. π
Dumb splitting causes problems:
- **Mid-sentence surprises:** Your thoughts get chopped mid-way, losing all meaning
- **Language confusion:** Non-English text and code structures get treated the same
- **Lost context:** Each chunk forgets what came before
Smart chunking solves this by:
- **Smart limits** β Respects both natural boundaries (sentences, paragraphs, sections) AND configurable limits (tokens, lines, functions)
- **Language-aware** β Detects language automatically and applies the right rules (50+ languages supported)
- **Context preservation** β Overlap between chunks, rich metadata (source, span, document structure)
## π€ So What's Chunklet-py Anyway? (And Why Should You Care?)
**Chunklet-py** is a developer-friendly text splitting library designed to be the most versatile chunking solution β for devs, researchers, and AI engineers. It goes way beyond basic character counting. I built this because I was tired of terrible chunking options. Chunklet-py intelligently chunks text, documents, and code into meaningful, context-aware pieces β perfect for RAG pipelines and LLM applications.
Key features:
- **Composable constraints** β Mix and match limits (sentences, tokens, sections) to get exactly the chunks you need
- **Pluggable architecture** β Swap in custom tokenizers, sentence splitters, or processors
- **Rich metadata** β Every chunk comes with source references, spans, and structural info
- **Multi-format support** β PDF, DOCX, EPUB, Markdown, HTML, LaTeX, ODT, CSV, Excel, and plain text
Available tools:
- `SentenceSplitter` β Lightweight sentence tokenization
- `DocumentChunker` β Natural language with semantic boundaries
- `CodeChunker` β Language-aware code chunking
- `ChunkVisualizer` β Interactive web-based exploration
Perfect for prepping data for LLMs, building RAG systems, or powering AI search - Chunklet-py gives you the precision and flexibility you need across tons of formats and languages.
| Feature | Why it's awesome |
| :--- | :--- |
| π **Blazingly Fast** | Leverages efficient parallel processing to chunk large volumes of content with remarkable speed. |
| πͺΆ **Featherlight Footprint** | Designed to be lightweight and memory-efficient, ensuring optimal performance without unnecessary overhead. |
| ποΈ **Rich Metadata for RAG** | Enriches chunks with valuable, context-aware metadata (source, span, document properties, code AST details) crucial for advanced RAG and LLM applications. |
| π§ **Infinitely Customizable** | Offers extensive customization options, from pluggable token counters to custom sentence splitters and processors. |
| π **Multilingual Mastery** | Supports over 50 natural languages for text and document chunking with intelligent detection and language-specific algorithms. |
| π§βπ» **Code-Aware Intelligence** | Language-agnostic code chunking that understands and preserves the structural integrity of your source code. |
| π― **Precision Chunking** | Flexible chunking with configurable limits based on sentences, tokens, sections, lines, and functions. |
| π **Document Format Mastery** | Processes a wide array of document formats including `.pdf`, `.docx`, `.epub`, `.txt`, `.tex`, `.html`, `.hml`, `.md`, `.rst`, `.rtf`, `.odt`, `.csv`, and `.xlsx`. |
| π» **Triple Interface: CLI, Library & Web** | Use it as a command-line tool, import as a library for deep integration, or launch the interactive web visualizer for real-time chunk exploration and parameter tuning. |
And that's just the start - there's plenty more to explore!
> [!NOTE]
> For the full documentation experience, check out our [documentation site](https://speedyk-005.github.io/chunklet-py/latest).
---
## π¦ Installation
Ready to get Chunklet-py running? Awesome! Let's get you set up quickly and painlessly.
> [!NOTE]
> **chunklet-py (aka chunklet)** β The old `chunklet` package is no longer maintained. Use `chunklet-py` to get the latest version.
### The Quick & Easy Way
The simplest way to get started is with pip:
```bash
# Install and check it's working
pip install chunklet-py
chunklet --version
```
That's it! You're all set to start chunking.
### Extra Features (Optional)
Want to unlock more Chunklet-py superpowers? Add these optional dependencies based on what you need:
* **Document Processing:** For handling `.pdf`, `.docx`, `.epub`, and other document formats:
```bash
pip install "chunklet-py[structured-document]"
```
* **Code Chunking:** For advanced code analysis and chunking features:
```bash
pip install "chunklet-py[code]"
```
* **Visualization:** For the interactive web-based chunk visualizer:
```bash
pip install "chunklet-py[visualization]"
```
* **All Extras:** To install all optional dependencies:
```bash
pip install "chunklet-py[all]"
```
### The From-Source Way
Prefer building from source? You can clone and install manually for full control:
```bash
git clone https://github.com/speedyk-005/chunklet-py.git
cd chunklet-py
pip install .[all]
```
(But honestly, the pip way is usually way easier!)
### Want to Help Make Chunklet-py Even Better?
That's awesome! We'd love to have you contribute. Check out our [**Contributing Guide**](https://github.com/speedyk-005/chunklet-py/blob/main/CONTRIBUTING.md) first, then set up your development environment:
```bash
git clone https://github.com/speedyk-005/chunklet-py.git
cd chunklet-py
# For basic development (testing, linting)
pip install -e ".[dev]"
# For documentation development
pip install -e ".[docs]"
# For comprehensive development (including all optional features like document and code chunking + docs dependencies)
pip install -e ".[dev-all]"
```
These install Chunklet-py in "editable" mode so your code changes take effect immediately. The different options give you just the dependencies you need.
Go forth and code! (And remember, good developers write tests. We appreciate excellent code examples!)
---
## Quick Reference π οΈ
> [!NOTE]
> For the exhaustive details that I know you're probably avoiding, check the [official docs](https://speedyk-005.github.io/chunklet-py/latest/).
### The Constraint-Based Logic
Chunklet-py is basically a "choose your own adventure" for data. It's constraint-based, meaning you can swap, combine, or ignore the limits below as you see fit.
**The Golden Rule:** You must provide at least one constraint, or the chunker has no idea when to stop.
### Core Imports
Pick your weapon based on whatever data mess you're currently cleaning up.
```python
from chunklet import DocumentChunker # For PDFs, DOCX, and general text chaos
from chunklet import CodeChunker # For source code (it actually respects brackets)
from chunklet import SentenceSplitter # For when you just need to split sentences
from chunklet import visualizer # Web-based chunk visualizer
```
### Configuration & Limits
These tools don't share arguments, so don't try to use `max_functions` on a PDF unless you want to see a very confused Python interpreter.
**DocumentChunker (Text & Docs)**
Perfect for natural language where you don't want to cut someone off mid-sentence.
```python
chunker = DocumentChunker()
# Feel free to mix and match these
chunks = chunker.chunk_text(
text,
max_sentences=3, # Stop after X sentences
max_tokens=500, # Don't blow up the LLM context
max_section_breaks=2, # Respect the Markdown headers
overlap_percent=20, # Give it some "memory" of the last chunk
offset=0 # Skip the first N sentences if you're feeling adventurous
)
```
**CodeChunker (Source Code)**
Logic-aware. It doesn't do "overlap" because duplicate code is a hallucination waiting to happen.
```python
chunker = CodeChunker()
# Again, use whichever constraints make sense for your file
chunks = chunker.chunk_text(
text,
max_lines=50, # Height limit
max_tokens=512, # Width limit
max_functions=1, # One function per chunk (keeps things tidy)
strict=True # True: Crash on big blocks; False: Slice 'em up anyway
)
```
### The Output Object
The chunkers return a list (or generator) of Chunk objects. These are Box instances, so you can use dot notation like a civilized developer.
```python
for chunk in chunks:
print(chunk.content) # The actual text/code
print(chunk.metadata) # Chunk metadata
print() # Because whitespace is free
```
### Input Methods (Chunkers Only)
These helper methods are for the DocumentChunker and CodeChunker. The SentenceSplitter is a simple soul and only takes strings.
| Method | Input | Return Type |
|--------|-------|-------------|
| `chunk_text(text)` | str | List[Chunk] |
| `chunk_file(path)` | Path or str | List[Chunk] |
| `chunk_texts(list)` | List[str] | Generator[Chunk] |
| `chunk_files(list)` | List[Path] | Generator[Chunk] |
### Specialized Tools
**SentenceSplitter**
The "lite" version for when you just need sentences and no fancy metadata.
```python
splitter = SentenceSplitter()
# 'auto' usually guesses right, but you can specify 'en', 'es', etc.
sentences = splitter.split_text(text, lang="auto")
```
**CLI (Command Line Interface)**
If you prefer the terminal to an IDE, the CLI is packed with features. Just ask for help.
```bash
chunklet --help
chunklet split --help
chunklet chunk --help
chunklet visualize --help
chunklet [COMMAND] [OPTIONS*]
```
---
## πΊ Features & Roadmap
- [x] CLI interface
- [x] Documents chunking with metadata
- [x] Code chunking based on interest point
- [x] Interactive chunk visualizer (web interface)
- [x] Extended file format support:
- [x] ODT files
- [x] CSV and Excel files
---
## How Chunklet-py Compares
While there are other chunking libraries available, Chunklet-py stands out for its unique combination of versatility, performance, and ease of use. Here's a quick look at how it compares to some of the alternatives:
| Library | Key Differentiator | Focus |
| :--- | :--- | :--- |
| **chunklet-py** | **All-in-one, lightweight, multilingual, language-agnostic with specialized algorithms.** | **Text, Code, Docs** |
| [LangChain](https://github.com/langchain-ai/langchain) | Full LLM framework with basic splitters (e.g., RecursiveCharacterTextSplitter, Markdown, HTML, code splitters). Good for prototyping but basic for complex docs or multilingual needs. | Full Stack |
| [Chonkie](https://github.com/chonkie-inc/chonkie) | All-in-one pipeline (chunking + embeddings + vector DB). Uses `tree-sitter` for code. Multilingual. | Pipelines |
| [Semchunk](https://github.com/isaacus-dev/semchunk) | Text-only, fast semantic splitting. Built-in tiktoken/HuggingFace support. 85% faster than alternatives. | Text |
| [CintraAI Code Chunker](https://github.com/CintraAI/code-chunker) | Code-specific, uses `tree-sitter`. Initially supports Python, JS, CSS only. | Code |
Chunklet-py is a specialized, drop-in replacement for the chunking step in any RAG pipeline. It handles text, documents, and code without heavy dependencies, while keeping your project lightweight.
---
## π Contributors & Thanks
A huge thank you to the awesome people who helped shape Chunklet-py:
- [@jmbernabotto](https://github.com/jmbernabotto) β for helping mostly on the CLI part, suggesting fixes, features, and design improvements.
- [@arnoldfranz](https://github.com/arnoldfranz) β for reporting the CLI Path Validation Bug (#6) that helped improve error handling.
---
π License
Check out the [LICENSE](https://github.com/speedyk-005/chunklet-py/blob/main/LICENSE) file for all the details.
> MIT License. Use freely, modify boldly, and credit appropriately! (We're not that legendary... yet π)