{"id":46144699,"url":"https://github.com/speedyk-005/chunklet-py","last_synced_at":"2026-03-02T07:02:50.725Z","repository":{"id":305957917,"uuid":"1024496408","full_name":"speedyk-005/chunklet-py","owner":"speedyk-005","description":"One library to split them all: Sentence, Code, Docs. Chunk smarter, not harder — built for LLMs, RAG pipelines, and beyond.","archived":false,"fork":false,"pushed_at":"2026-02-21T04:50:27.000Z","size":18519,"stargazers_count":62,"open_issues_count":6,"forks_count":2,"subscribers_count":3,"default_branch":"main","last_synced_at":"2026-02-21T11:35:33.931Z","etag":null,"topics":["ai","chunking","chunks-algorithm","chunks-processing","code-chunking","code-structure","document-chunking","natural-language-processing","nlp","rag","text-splitting","visualization"],"latest_commit_sha":null,"homepage":"https://speedyk-005.github.io/chunklet-py/latest","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/speedyk-005.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":"docs/supported-languages.md","governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-07-22T19:43:10.000Z","updated_at":"2026-02-17T17:22:39.000Z","dependencies_parsed_at":"2025-08-28T11:59:46.777Z","dependency_job_id":"e4243707-9423-4222-9faa-7bde9a2e0d5f","html_url":"https://github.com/speedyk-005/chunklet-py","commit_stats":null,"previous_names":["speed40/chunklet","speedyk-005/chunklet","speedyk-005/chunklet-py"],"tags_count":15,"template":false,"template_full_name":null,"purl":"pkg:github/speedyk-005/chunklet-py","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/speedyk-005%2Fchunklet-py","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/speedyk-005%2Fchunklet-py/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/speedyk-005%2Fchunklet-py/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/speedyk-005%2Fchunklet-py/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/speedyk-005","download_url":"https://codeload.github.com/speedyk-005/chunklet-py/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/speedyk-005%2Fchunklet-py/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":29994619,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-03-02T01:47:34.672Z","status":"online","status_checked_at":"2026-03-02T02:00:07.342Z","response_time":60,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai","chunking","chunks-algorithm","chunks-processing","code-chunking","code-structure","document-chunking","natural-language-processing","nlp","rag","text-splitting","visualization"],"created_at":"2026-03-02T07:02:39.285Z","updated_at":"2026-03-02T07:02:50.716Z","avatar_url":"https://github.com/speedyk-005.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# 🧩 Chunklet-py\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"https://github.com/speedyk-005/chunklet-py/blob/main/logo_with_tagline.png?raw=true\" alt=\"Chunklet-py Logo\" width=\"300\"/\u003e\n\u003c/p\u003e\n\n“One library to split them all: Sentence, Code, Docs”\n\n\u003e [!WARNING]\n\u003e **Quick heads up!** Version 2 has some breaking changes. No worries though - check our [Migration Guide](https://speedyk-005.github.io/chunklet-py/latest/migration/) for a smooth upgrade!\n\nHey! Welcome. Let's make some text chunking magic happen.\n\n[![Python Version](https://img.shields.io/badge/Python-3.10%20--%203.14-blue)](https://www.python.org/downloads/)\n[![PyPI](https://img.shields.io/pypi/v/chunklet-py)](https://pypi.org/project/chunklet-py)\n[![PyPI Downloads](https://static.pepy.tech/personalized-badge/chunklet-py?period=total\u0026units=INTERNATIONAL_SYSTEM\u0026left_color=BLACK\u0026right_color=BLUE\u0026left_text=downloads)](https://pepy.tech/projects/chunklet-py)\n[![Coverage Status](https://coveralls.io/repos/github/speedyk-005/chunklet-py/badge.svg?branch=main)](https://coveralls.io/github/speedyk-005/chunklet-py?branch=main)\n[![Stability](https://img.shields.io/badge/stability-stable-brightgreen)](https://github.com/speedyk-005/chunklet-py)\n[![License: MIT](https://img.shields.io/badge/License-MIT-green.svg)](https://opensource.org/licenses/MIT)\n[![Tests](https://img.shields.io/badge/tests-passing-brightgreen)](https://github.com/speedyk-005/chunklet-py/actions)\n[![CodeFactor](https://www.codefactor.io/repository/github/speedyk-005/chunklet-py/badge)](https://www.codefactor.io/repository/github/speedyk-005/chunklet-py)\n[![Ask DeepWiki](https://deepwiki.com/badge.svg)](https://deepwiki.com/speedyk-005/chunklet-py)\n\n\u003cp align=\"center\"\u003e\n  \u003ca href=\"https://speedyk-005.github.io/chunklet-py/latest\" target=\"_blank\" rel=\"noopener noreferrer\"\u003e\n    -- documentation site --\n  \u003c/a\u003e\n\u003c/p\u003e\n\n## Why Smart Chunking? (Or: Why Not Just Split on Character Count?)\n\nYou could split your text by character count or random line breaks. But that's like trying to cut a wedding cake with a chainsaw. 🎂\n\nDumb splitting causes problems:\n\n- **Mid-sentence surprises:** Your thoughts get chopped mid-way, losing all meaning\n- **Language confusion:** Non-English text and code structures get treated the same\n- **Lost context:** Each chunk forgets what came before\n\nSmart chunking solves this by:\n\n- **Smart limits** — Respects both natural boundaries (sentences, paragraphs, sections) AND configurable limits (tokens, lines, functions)\n- **Language-aware** — Detects language automatically and applies the right rules (50+ languages supported)\n- **Context preservation** — Overlap between chunks, rich metadata (source, span, document structure)\n\n## 🤔 So What's Chunklet-py Anyway? (And Why Should You Care?)\n\n**Chunklet-py** is a developer-friendly text splitting library designed to be the most versatile chunking solution — for devs, researchers, and AI engineers. It goes way beyond basic character counting. I built this because I was tired of terrible chunking options. Chunklet-py intelligently chunks text, documents, and code into meaningful, context-aware pieces — perfect for RAG pipelines and LLM applications.\n\nKey features:\n\n- **Composable constraints** — Mix and match limits (sentences, tokens, sections) to get exactly the chunks you need\n- **Pluggable architecture** — Swap in custom tokenizers, sentence splitters, or processors\n- **Rich metadata** — Every chunk comes with source references, spans, and structural info\n- **Multi-format support** — PDF, DOCX, EPUB, Markdown, HTML, LaTeX, ODT, CSV, Excel, and plain text\n\nAvailable tools:\n\n- `SentenceSplitter` — Lightweight sentence tokenization\n- `DocumentChunker` — Natural language with semantic boundaries\n- `CodeChunker` — Language-aware code chunking\n- `ChunkVisualizer` — Interactive web-based exploration\n\nPerfect for prepping data for LLMs, building RAG systems, or powering AI search - Chunklet-py gives you the precision and flexibility you need across tons of formats and languages.\n\n| Feature | Why it's awesome |\n| :--- | :--- |\n| 🚀 **Blazingly Fast** | Leverages efficient parallel processing to chunk large volumes of content with remarkable speed. |\n| 🪶 **Featherlight Footprint** | Designed to be lightweight and memory-efficient, ensuring optimal performance without unnecessary overhead. |\n| 🗂️ **Rich Metadata for RAG** | Enriches chunks with valuable, context-aware metadata (source, span, document properties, code AST details) crucial for advanced RAG and LLM applications. |\n| 🔧 **Infinitely Customizable** | Offers extensive customization options, from pluggable token counters to custom sentence splitters and processors. |\n| 🌐 **Multilingual Mastery** | Supports over 50 natural languages for text and document chunking with intelligent detection and language-specific algorithms. |\n| 🧑‍💻 **Code-Aware Intelligence** | Language-agnostic code chunking that understands and preserves the structural integrity of your source code. |\n| 🎯 **Precision Chunking** | Flexible chunking with configurable limits based on sentences, tokens, sections, lines, and functions. |\n| 📄 **Document Format Mastery** | Processes a wide array of document formats including `.pdf`, `.docx`, `.epub`, `.txt`, `.tex`, `.html`, `.hml`, `.md`, `.rst`, `.rtf`, `.odt`, `.csv`, and `.xlsx`. |\n| 💻 **Triple Interface: CLI, Library \u0026 Web** | Use it as a command-line tool, import as a library for deep integration, or launch the interactive web visualizer for real-time chunk exploration and parameter tuning. |\n\n\nAnd that's just the start - there's plenty more to explore!\n\n\u003e [!NOTE]\n\u003e For the full documentation experience, check out our [documentation site](https://speedyk-005.github.io/chunklet-py/latest).\n\n---\n\n## 📦 Installation\n\nReady to get Chunklet-py running? Awesome! Let's get you set up quickly and painlessly.\n\n\u003e [!NOTE]\n\u003e **chunklet-py (aka chunklet)** — The old `chunklet` package is no longer maintained. Use `chunklet-py` to get the latest version.\n    \n### The Quick \u0026 Easy Way\n\nThe simplest way to get started is with pip:\n\n```bash\n# Install and check it's working\npip install chunklet-py\nchunklet --version\n```\n\nThat's it! You're all set to start chunking.\n\n### Extra Features (Optional)\n\nWant to unlock more Chunklet-py superpowers? Add these optional dependencies based on what you need:\n\n*   **Document Processing:** For handling `.pdf`, `.docx`, `.epub`, and other document formats:\n    ```bash\n    pip install \"chunklet-py[structured-document]\"\n    ```\n*   **Code Chunking:** For advanced code analysis and chunking features:\n    ```bash\n    pip install \"chunklet-py[code]\"\n    ```\n*   **Visualization:** For the interactive web-based chunk visualizer:\n    ```bash\n    pip install \"chunklet-py[visualization]\"\n    ```\n*   **All Extras:** To install all optional dependencies:\n    ```bash\n    pip install \"chunklet-py[all]\"\n    ```\n\n### The From-Source Way\n\nPrefer building from source? You can clone and install manually for full control:\n\n```bash\ngit clone https://github.com/speedyk-005/chunklet-py.git\ncd chunklet-py\npip install .[all]\n```\n\n(But honestly, the pip way is usually way easier!)\n\n### Want to Help Make Chunklet-py Even Better?\n\nThat's awesome! We'd love to have you contribute. Check out our [**Contributing Guide**](https://github.com/speedyk-005/chunklet-py/blob/main/CONTRIBUTING.md) first, then set up your development environment:\n\n```bash\ngit clone https://github.com/speedyk-005/chunklet-py.git\ncd chunklet-py\n# For basic development (testing, linting)\npip install -e \".[dev]\"\n# For documentation development\npip install -e \".[docs]\"\n# For comprehensive development (including all optional features like document and code chunking + docs dependencies)\npip install -e \".[dev-all]\"\n```\n\nThese install Chunklet-py in \"editable\" mode so your code changes take effect immediately. The different options give you just the dependencies you need.\n\nGo forth and code! (And remember, good developers write tests. We appreciate excellent code examples!)\n\n---\n\n## Quick Reference 🛠️\n\n\u003e [!NOTE]\n\u003e For the exhaustive details that I know you're probably avoiding, check the [official docs](https://speedyk-005.github.io/chunklet-py/latest/).\n\n### The Constraint-Based Logic\n\nChunklet-py is basically a \"choose your own adventure\" for data. It's constraint-based, meaning you can swap, combine, or ignore the limits below as you see fit.\n\n**The Golden Rule:** You must provide at least one constraint, or the chunker has no idea when to stop.\n\n### Core Imports\n\nPick your weapon based on whatever data mess you're currently cleaning up.\n\n```python\nfrom chunklet import DocumentChunker   # For PDFs, DOCX, and general text chaos\nfrom chunklet import CodeChunker       # For source code (it actually respects brackets)\nfrom chunklet import SentenceSplitter  # For when you just need to split sentences\nfrom chunklet import visualizer        # Web-based chunk visualizer\n```\n\n### Configuration \u0026 Limits\n\nThese tools don't share arguments, so don't try to use `max_functions` on a PDF unless you want to see a very confused Python interpreter.\n\n**DocumentChunker (Text \u0026 Docs)**\n\nPerfect for natural language where you don't want to cut someone off mid-sentence.\n\n```python\nchunker = DocumentChunker()\n\n# Feel free to mix and match these\nchunks = chunker.chunk_text(\n    text,\n    max_sentences=3,       # Stop after X sentences\n    max_tokens=500,        # Don't blow up the LLM context\n    max_section_breaks=2,  # Respect the Markdown headers\n    overlap_percent=20,    # Give it some \"memory\" of the last chunk\n    offset=0               # Skip the first N sentences if you're feeling adventurous\n)\n```\n\n**CodeChunker (Source Code)**\n\nLogic-aware. It doesn't do \"overlap\" because duplicate code is a hallucination waiting to happen.\n\n```python\nchunker = CodeChunker()\n\n# Again, use whichever constraints make sense for your file\nchunks = chunker.chunk_text(\n    text,\n    max_lines=50,          # Height limit\n    max_tokens=512,        # Width limit\n    max_functions=1,       # One function per chunk (keeps things tidy)\n    strict=True            # True: Crash on big blocks; False: Slice 'em up anyway\n)\n```\n\n### The Output Object\n\nThe chunkers return a list (or generator) of Chunk objects. These are Box instances, so you can use dot notation like a civilized developer.\n\n```python\nfor chunk in chunks:\n    print(chunk.content)   # The actual text/code\n    print(chunk.metadata)  # Chunk metadata\n    print()                # Because whitespace is free\n```\n\n### Input Methods (Chunkers Only)\n\nThese helper methods are for the DocumentChunker and CodeChunker. The SentenceSplitter is a simple soul and only takes strings.\n\n| Method | Input | Return Type |\n|--------|-------|-------------|\n| `chunk_text(text)` | str | List[Chunk] |\n| `chunk_file(path)` | Path or str | List[Chunk] |\n| `chunk_texts(list)` | List[str] | Generator[Chunk] |\n| `chunk_files(list)` | List[Path] | Generator[Chunk] |\n\n### Specialized Tools\n\n**SentenceSplitter**\n\nThe \"lite\" version for when you just need sentences and no fancy metadata.\n\n```python\nsplitter = SentenceSplitter()\n\n# 'auto' usually guesses right, but you can specify 'en', 'es', etc.\nsentences = splitter.split_text(text, lang=\"auto\")\n```\n\n**CLI (Command Line Interface)**\n\nIf you prefer the terminal to an IDE, the CLI is packed with features. Just ask for help.\n\n```bash\nchunklet --help\nchunklet split --help\nchunklet chunk --help\nchunklet visualize --help\nchunklet [COMMAND] [OPTIONS*]\n```\n\n---\n\n## 🗺 Features \u0026 Roadmap\n\n- [x] CLI interface\n- [x] Documents chunking with metadata\n- [x] Code chunking based on interest point\n- [x] Interactive chunk visualizer (web interface)\n- [x] Extended file format support:\n  - [x] ODT files\n  - [x] CSV and Excel files\n\n---\n\n## How Chunklet-py Compares\n\nWhile there are other chunking libraries available, Chunklet-py stands out for its unique combination of versatility, performance, and ease of use. Here's a quick look at how it compares to some of the alternatives:\n\n| Library | Key Differentiator | Focus |\n| :--- | :--- | :--- |\n| **chunklet-py** | **All-in-one, lightweight, multilingual, language-agnostic with specialized algorithms.** | **Text, Code, Docs** |\n| [LangChain](https://github.com/langchain-ai/langchain) | Full LLM framework with basic splitters (e.g., RecursiveCharacterTextSplitter, Markdown, HTML, code splitters). Good for prototyping but basic for complex docs or multilingual needs. | Full Stack |\n| [Chonkie](https://github.com/chonkie-inc/chonkie) | All-in-one pipeline (chunking + embeddings + vector DB). Uses `tree-sitter` for code. Multilingual. | Pipelines |\n| [Semchunk](https://github.com/isaacus-dev/semchunk) | Text-only, fast semantic splitting. Built-in tiktoken/HuggingFace support. 85% faster than alternatives. | Text |\n| [CintraAI Code Chunker](https://github.com/CintraAI/code-chunker) | Code-specific, uses `tree-sitter`. Initially supports Python, JS, CSS only. | Code |\n\nChunklet-py is a specialized, drop-in replacement for the chunking step in any RAG pipeline. It handles text, documents, and code without heavy dependencies, while keeping your project lightweight.\n\n---\n\n## 🙌 Contributors \u0026 Thanks\n\nA huge thank you to the awesome people who helped shape Chunklet-py:\n\n- [@jmbernabotto](https://github.com/jmbernabotto) — for helping mostly on the CLI part, suggesting fixes, features, and design improvements.\n- [@arnoldfranz](https://github.com/arnoldfranz) — for reporting the CLI Path Validation Bug (#6) that helped improve error handling.\n\n---\n\n📜 License\n\nCheck out the [LICENSE](https://github.com/speedyk-005/chunklet-py/blob/main/LICENSE) file for all the details.\n\n\u003e MIT License. Use freely, modify boldly, and credit appropriately! (We're not that legendary... yet 😉)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fspeedyk-005%2Fchunklet-py","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fspeedyk-005%2Fchunklet-py","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fspeedyk-005%2Fchunklet-py/lists"}