https://github.com/ayush585/smartchunk
SmartChunk is a lightweight, structure-aware semantic chunking toolkit designed to supercharge RAG (Retrieval-Augmented Generation) and LLM pipelines. Unlike naive splitters that break text arbitrarily, SmartChunk respects document structure (headings, lists, tables, code blocks) and semantic flow, ensuring cleaner, more coherent chunks.
https://github.com/ayush585/smartchunk
agentic-workflow chunking chunking-algorithm cli llm nlp package pip rag semantic
Last synced: 5 months ago
JSON representation
SmartChunk is a lightweight, structure-aware semantic chunking toolkit designed to supercharge RAG (Retrieval-Augmented Generation) and LLM pipelines. Unlike naive splitters that break text arbitrarily, SmartChunk respects document structure (headings, lists, tables, code blocks) and semantic flow, ensuring cleaner, more coherent chunks.
- Host: GitHub
- URL: https://github.com/ayush585/smartchunk
- Owner: ayush585
- License: mit
- Created: 2025-09-06T05:31:29.000Z (5 months ago)
- Default Branch: master
- Last Pushed: 2025-09-07T04:05:27.000Z (5 months ago)
- Last Synced: 2025-09-07T04:26:00.248Z (5 months ago)
- Topics: agentic-workflow, chunking, chunking-algorithm, cli, llm, nlp, package, pip, rag, semantic
- Language: Python
- Homepage: https://test.pypi.org/project/smartchunk/
- Size: 38.1 KB
- Stars: 1
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# SmartChunk π§©
**Structure-aware semantic chunking for RAG/LLMs (test.pypi.org/project/smartchunk/)**
SmartChunk is a **Python package + CLI** that creates higher-quality chunks for Retrieval-Augmented Generation (RAG) pipelines. Instead of breaking text blindly, SmartChunk **respects structure and meaning** β no more chopped sentences, broken code blocks, or messy lists.
The result?
π Better retrieval quality
π Lower token costs
π Chunks your LLM can actually understand
---
## β¨ Why SmartChunk?
Naive splitters cut text every N tokens. That causes:
* β Broken headings, lists, or tables
* β Incoherent fragments across paragraphs
* β Duplicate/boilerplate content bloating your index
**SmartChunk fixes this** by combining structure awareness + semantic similarity.
---
## π§ Key Features
* **Structure-Aware Splitting**: Never slices through a heading, list, table, or fenced code block.
* **Semantic Boundary Detection**: Uses embeddings to find natural breakpoints between topics.
* **Noise & Duplication Guard**: Strips headers/footers, removes near-duplicates, normalizes whitespace.
* **Flexible & Tunable**: Control chunk size, overlap, and semantic sensitivity to fit your pipeline.
* **End-to-End Ready**: From URL β parsed β deduped β JSONL chunks in one command.
---
## β‘ Quickstart
### 1. Install
For hackathon/demo (TestPyPI):
```bash
pip install -i https://test.pypi.org/simple/ smartchunk
```
Once we'll publish it to PyPI:
```bash
pip install smartchunk
```
---
### 2. Chunk a Document
```bash
smartchunk chunk docs/README.md \
--mode markdown \
--max-tokens 500 \
--overlap 100 \
--dedupe \
--out chunks.jsonl
```
---
### 3. Fetch & Chunk a URL
```bash
smartchunk fetch "https://en.wikipedia.org/wiki/Crayon_Shin-chan" \
--semantic --dedupe --format table
```
---
### 4. Compare with a Naive Splitter
```bash
smartchunk compare docs/README.md --mode markdown --out report.html
```
Generates an **HTML report** showing naive vs SmartChunk side-by-side.
---
## π¦ Example Output
Each line in the `.jsonl` output is a coherent chunk with rich metadata:
```json
{
"id": "c0033",
"text": "###### Opening\n\n \n [\n\n \n edit\n\n \n ]\n\n* Footage from Japanese opening 8 (\"PLEASURE\") but with
completely different lyrics, to the melody of a techno remix of Japanese opening 3 (\"Ora wa Ninkimono\").Musical Director, Producer and
English Director: World Worm Studios composerGary Gibbons",
"header_path": "Media / Anime / Music / LUK Internacional dub / Opening",
"start_line": 709,
"end_line": 727
},
```
---
## π» CLI Overview
* `fetch` β Fetch, parse & chunk a URL in one go
* `chunk` β Chunk a local file
* `compare` β Compare SmartChunk vs naive splitter (HTML report)
* `stream` β Stream chunks from STDIN in real-time
Run `smartchunk --help` for full options.
---
## π License
MIT License. Free to use, modify, and share.
---
## (In Simple Words) π
SmartChunk = **βDonβt let your RAG cut sentences in half.β**
Itβs the **first step** for any production-grade RAG pipeline: clean, coherent, AI-ready chunks.