{"id":30861855,"url":"https://github.com/ayush585/smartchunk","last_synced_at":"2026-06-14T21:31:42.034Z","repository":{"id":313481615,"uuid":"1051494170","full_name":"ayush585/SmartChunk","owner":"ayush585","description":"SmartChunk is a lightweight, structure-aware semantic chunking toolkit designed to supercharge RAG (Retrieval-Augmented Generation) and LLM pipelines. Unlike naive splitters that break text arbitrarily, SmartChunk respects document structure (headings, lists, tables, code blocks) and semantic flow, ensuring cleaner, more coherent chunks.","archived":false,"fork":false,"pushed_at":"2026-02-06T08:10:38.000Z","size":98,"stargazers_count":10,"open_issues_count":0,"forks_count":1,"subscribers_count":0,"default_branch":"master","last_synced_at":"2026-02-06T16:25:36.688Z","etag":null,"topics":["agentic-workflow","chunking","chunking-algorithm","cli","llm","nlp","package","pip","rag","semantic"],"latest_commit_sha":null,"homepage":"https://test.pypi.org/project/smartchunk/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ayush585.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-09-06T05:31:29.000Z","updated_at":"2026-02-06T08:10:43.000Z","dependencies_parsed_at":"2025-09-06T11:37:26.014Z","dependency_job_id":"51446cd5-a2a7-4485-beea-136725935101","html_url":"https://github.com/ayush585/SmartChunk","commit_stats":null,"previous_names":["kai-972/smartchunk","ayush585/smartchunk"],"tags_count":3,"template":false,"template_full_name":null,"purl":"pkg:github/ayush585/SmartChunk","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ayush585%2FSmartChunk","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ayush585%2FSmartChunk/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ayush585%2FSmartChunk/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ayush585%2FSmartChunk/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ayush585","download_url":"https://codeload.github.com/ayush585/SmartChunk/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ayush585%2FSmartChunk/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34339194,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-14T02:00:07.365Z","response_time":62,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["agentic-workflow","chunking","chunking-algorithm","cli","llm","nlp","package","pip","rag","semantic"],"created_at":"2025-09-07T17:09:32.358Z","updated_at":"2026-06-14T21:31:42.029Z","avatar_url":"https://github.com/ayush585.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# SmartChunk 🧩\n\n**Structure-aware semantic chunking for RAG/LLMs (test.pypi.org/project/smartchunk/)**\n\nSmartChunk is a **Python package + CLI** that creates higher-quality chunks for Retrieval-Augmented Generation (RAG) pipelines. Instead of breaking text blindly, SmartChunk **respects structure and meaning** — no more chopped sentences, broken code blocks, or messy lists.\n\nThe result?\n👉 Better retrieval quality\n👉 Lower token costs\n👉 Chunks your LLM can actually understand\n\n---\n\n## ✨ Why SmartChunk?\n\nNaive splitters cut text every N tokens. That causes:\n\n* ❌ Broken headings, lists, or tables\n* ❌ Incoherent fragments across paragraphs\n* ❌ Duplicate/boilerplate content bloating your index\n\n**SmartChunk fixes this** by combining structure awareness + semantic similarity.\n\n---\n\n## 🧠 Key Features\n\n* **Structure-Aware Splitting**: Never slices through a heading, list, table, or fenced code block.\n* **Semantic Boundary Detection**: Uses embeddings to find natural breakpoints between topics.\n* **Noise \u0026 Duplication Guard**: Strips headers/footers, removes near-duplicates, normalizes whitespace.\n* **Flexible \u0026 Tunable**: Control chunk size, overlap, and semantic sensitivity to fit your pipeline.\n* **End-to-End Ready**: From URL → parsed → cleaned → JSONL chunks in one command.\n\n---\n\n## ⚡ Quickstart\n\n### 1. Install\n\nFor hackathon/demo (TestPyPI):\n\n```bash\npip install -i https://test.pypi.org/simple/ smartchunk\n```\n\nOnce we'll publish it to PyPI:\n\n```bash\npip install smartchunk\n```\n\n---\n\n### 2. Chunk a Document\n\n```bash\nsmartchunk chunk README.md \\\n  --mode markdown \\\n  --max-tokens 500 \\\n  --overlap 100 \\\n  --semantic \\\n  --semantic-model all-MiniLM-L6-v2 \\\n  --format jsonl \\\n  --output chunks.jsonl\n```\n\n---\n\n### 3. Fetch \u0026 Chunk a URL\n\n```bash\nsmartchunk fetch \"https://en.wikipedia.org/wiki/Crayon_Shin-chan\" \\\n  --semantic \\\n  --semantic-model all-MiniLM-L6-v2 \\\n  --format table\n```\n\n---\n\n### 4. Compare with a Naive Splitter\n\n```bash\nsmartchunk compare README.md --mode markdown --max-chars 800\n```\n\nPrints a **terminal table** comparing naive vs SmartChunk side-by-side.\n\n---\n\n## 📦 Example Output\n\nEach line in the `.jsonl` output is a coherent chunk with rich metadata:\n\n```json\n{\n    \"id\": \"c0033\",\n    \"text\": \"###### Opening\\n\\n \\n        [\\n\\n \\n         edit\\n\\n \\n        ]\\n\\n* Footage from Japanese opening 8 (\\\"PLEASURE\\\") but with \ncompletely different lyrics, to the melody of a techno remix of Japanese opening 3 (\\\"Ora wa Ninkimono\\\").Musical Director, Producer and \nEnglish Director: World Worm Studios composerGary Gibbons\",\n    \"header_path\": \"Media / Anime / Music / LUK Internacional dub / Opening\",\n    \"start_line\": 709,\n    \"end_line\": 727\n  },\n```\n\n---\n\n## 💻 CLI Overview\n\n* `fetch` → Fetch, parse \u0026 chunk a URL in one go\n* `chunk` → Chunk a local file\n* `compare` → Compare SmartChunk vs naive splitter (HTML report)\n* `stream` → Stream chunks from STDIN in real-time\n\nRun `smartchunk --help` for full options.\n\n---\n\n## 🤝 Contributing\n\nContributions are welcome! Please see [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines. By participating, you agree to follow our [Code of Conduct](CODE_OF_CONDUCT.md).\n\n---\n\n## 🔑 License\n\nMIT License. Free to use, modify, and share.\n\n---\n\n## (In Simple Words) 📝\n\nSmartChunk = **“Don’t let your RAG cut sentences in half.”**\nIt’s the **first step** for any production-grade RAG pipeline: clean, coherent, AI-ready chunks.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fayush585%2Fsmartchunk","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fayush585%2Fsmartchunk","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fayush585%2Fsmartchunk/lists"}