{"id":50433465,"url":"https://github.com/ekimetrics/adaptive-chunking","last_synced_at":"2026-06-02T14:00:40.984Z","repository":{"id":347024132,"uuid":"1192490705","full_name":"ekimetrics/adaptive-chunking","owner":"ekimetrics","description":"Adaptive Chunking: automatically select the best chunking method per document for RAG. Accepted at LREC 2026.","archived":false,"fork":false,"pushed_at":"2026-03-26T11:01:37.000Z","size":4027,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-03-27T04:25:03.636Z","etag":null,"topics":["chunking","information-retrieval","llm","nlp","rag","text-splitting"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ekimetrics.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":"NOTICE","maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-03-26T09:14:21.000Z","updated_at":"2026-03-26T11:01:41.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/ekimetrics/adaptive-chunking","commit_stats":null,"previous_names":["ekimetrics/adaptive-chunking"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/ekimetrics/adaptive-chunking","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ekimetrics%2Fadaptive-chunking","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ekimetrics%2Fadaptive-chunking/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ekimetrics%2Fadaptive-chunking/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ekimetrics%2Fadaptive-chunking/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ekimetrics","download_url":"https://codeload.github.com/ekimetrics/adaptive-chunking/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ekimetrics%2Fadaptive-chunking/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33824902,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-02T02:00:07.132Z","response_time":109,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["chunking","information-retrieval","llm","nlp","rag","text-splitting"],"created_at":"2026-05-31T16:00:17.335Z","updated_at":"2026-06-02T14:00:40.977Z","avatar_url":"https://github.com/ekimetrics.png","language":"Python","funding_links":[],"categories":["🛠️ Techniques"],"sub_categories":["Chunking"],"readme":"\u003cdiv align=\"center\"\u003e\n\n# Adaptive Chunking\n\n**Selecting the Best Chunking Strategy per Document for RAG**\n\n[![arXiv](https://img.shields.io/badge/arXiv-2603.25333-b31b1b.svg)](https://arxiv.org/abs/2603.25333)\n[![Conference](https://img.shields.io/badge/LREC-2026-blue)](https://lrec2026.info/)\n[![License: MIT](https://img.shields.io/badge/License-MIT-green.svg)](LICENSE)\n[![Python 3.11+](https://img.shields.io/badge/Python-3.11%2B-blue.svg)](https://www.python.org/downloads/)\n\u003c!-- [![PyPI](https://img.shields.io/pypi/v/adaptive-chunking)](https://pypi.org/project/adaptive-chunking/) --\u003e\n\n\u003c!-- \u003cimg src=\"docs/banner_concept2.svg\" alt=\"Adaptive Chunking banner\" width=\"800\"\u003e --\u003e\n\n\u003cimg src=\"docs/architecture.svg\" alt=\"Adaptive Chunking architecture\" width=\"800\"\u003e\n\n\u003c/div\u003e\n\n---\n\n\u003e **Official implementation** of *\"Adaptive Chunking: Optimizing Chunking-Method Selection for RAG\"*, accepted at **[LREC 2026](https://lrec2026.info/)**.\n\u003e\n\u003e **Authors:** Paulo Roberto de Moura Junior, Jean Lelong, Annabelle Blangero — [Ekimetrics](https://www.ekimetrics.com/)\n\n## What is Adaptive Chunking?\n\nNo single chunking method works best for every document in a RAG pipeline. **Adaptive Chunking** solves this by evaluating multiple chunking strategies against a set of intrinsic quality metrics and automatically selecting the best one for each document. Both dimensions are modular: you can plug in your own chunking methods and define your own evaluation metrics, making the framework easy to extend to new domains and use cases.\n\n## Key Results\n\nAdaptive Chunking selects the best splitting strategy per document using five intrinsic quality metrics, evaluated on 33 documents across 3 domains (~1.18M tokens).\n\n**RAG evaluation** (Table 5 — Wilcoxon p \u003c 0.05 for Retrieval Completeness):\n\n| Metric | Adaptive Chunking | LangChain recursive | Page splitting |\n|--------|---:|---:|---:|\n| Retrieval Completeness | **67.7** | 58.1 | 59.1 |\n| Answer Correctness | **78.0** | 70.1 | 73.3 |\n| Answered queries | **65/99** | 49/99 | 49/99 |\n\n**Intrinsic metrics** (Table 3 — mean % across all domains, Wilcoxon p \u003c 0.001 vs all methods):\n\n| Method | RC | ICC | DCC | BI | SC | **Mean** |\n|--------|---:|----:|----:|---:|---:|--------:|\n| **Adaptive Chunking** | 99.0 | 68.2 | 88.8 | 99.4 | 99.9 | **91.07** |\n| LLM regex (GPT) | 98.0 | 70.9 | 82.4 | 98.1 | 99.6 | 89.80 |\n| LangChain recursive | 96.1 | 65.6 | 88.8 | 95.0 | 97.7 | 88.62 |\n| Semantic | 97.5 | 69.3 | 76.3 | 91.3 | 48.1 | 76.49 |\n| Sentence | 86.3 | 78.4 | 72.5 | 61.9 | 67.2 | 73.26 |\n\n## Evaluation Metrics\n\nFive intrinsic metrics score each chunking output without requiring ground-truth answers:\n\n| Metric | What it measures |\n|--------|-----------------|\n| **Size Compliance (SC)** | Fraction of chunks within target token-count bounds |\n| **Intrachunk Cohesion (ICC)** | Semantic similarity between a chunk's sentences and its overall embedding |\n| **Contextual Coherence (DCC)** | Similarity of each chunk to its surrounding context window |\n| **Block Integrity (BI)** | Proportion of structural blocks (paragraphs, tables, lists) kept intact |\n| **Filtered Missing Reference Error (RC)** | Coreference chains (entity–pronoun pairs) not broken across chunk boundaries |\n\nAll metrics are implemented in [`metrics.py`](src/adaptive_chunking/metrics.py) and can be extended with custom scoring functions.\n\n## Default Chunking Methods\n\n| Method | Description |\n|--------|-------------|\n| **Recursive (s=1100)** | Split-then-merge recursive splitter with a target chunk size of 1100 tokens |\n| **Recursive (s=600)** | Same splitter with a smaller 600-token target, producing more granular chunks |\n| **Page** | Splits on page breaks, with post-processing to enforce size constraints |\n| **LLM Regex** | Asks an LLM to generate document-specific regex split patterns |\n\nThe recursive splitter lives in [`splitters.py`](src/adaptive_chunking/splitters.py); the others are in [`paper/splitters.py`](src/adaptive_chunking/paper/splitters.py). You can register any callable that takes text and returns a list of chunks.\n\n## Features\n\n- **Adaptive recursive splitter** — configurable separators, merge modes, overlap, and automatic per-document strategy selection\n- **5 intrinsic quality metrics** — size compliance, intrachunk cohesion, contextual coherence, block integrity, and filtered missing reference error\n- **3 PDF parsing backends** — Docling (open-source, default), PyMuPDF (lightweight), Azure Document Intelligence (cloud), plus Excel support\n- **RAG evaluation pipeline** — hybrid retrieval with custom retrieval completeness metric and answer correctness scoring\n- **Multi-domain evaluation** — tested on technical, legal, and sustainability reporting documents\n\n## Installation\n\n```bash\ngit clone https://github.com/ekimetrics/adaptive-chunking.git\ncd adaptive-chunking\npip install -e \".[dev]\"\n```\n\nOr install only what you need:\n\n```bash\n# Core package (splitter + metrics)\npip install -e .\n\n# With coreference resolution (CC BY-NC-SA 4.0 — non-commercial)\npip install -e \".[coref]\"\n\n# With PDF/Excel parsing backends\npip install -e \".[parsing]\"\n\n# Paper reproduction (all dependencies)\npip install -e \".[paper]\"\n```\n\nSome metrics require spaCy models:\n\n```bash\npython -m spacy download en_core_web_sm\n```\n\n## Quick Start\n\n```python\nfrom adaptive_chunking import chunk_files\n\n# Parse PDFs and chunk in one step (requires pip install -e \".[parsing]\")\nchunks = chunk_files(\"path/to/pdfs/\", chunk_size=600, chunk_overlap=50)\n\n# Each chunk is a dict with: doc_name, chunk_index, chunk_text, chunk_pages, titles_context, chunk_len\nfor chunk in chunks:\n    print(chunk[\"doc_name\"], chunk[\"chunk_index\"], chunk[\"chunk_len\"])\n```\n\nWorks with a single file too:\n\n```python\nchunks = chunk_files(\"path/to/report.pdf\")\n```\n\nChoose a different parser:\n\n```python\nfrom adaptive_chunking.parsing import PyMuPDFParser\n\nchunks = chunk_files(\"path/to/pdfs/\", parser=PyMuPDFParser())\n```\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003cb\u003eUsing the splitter and metrics separately\u003c/b\u003e\u003c/summary\u003e\n\n```python\nfrom adaptive_chunking.splitters import RecursiveSplitter\n\nsplitter = RecursiveSplitter(\n    chunk_size=600,\n    chunk_overlap=50,\n    separators=[\"\\n\\n\", \"\\n\", \" \", \"\"],\n    merging=\"small_only\",\n    min_chunk_tokens=100,\n)\n\nchunks = splitter.split_text(document_text)\n```\n\n```python\nfrom adaptive_chunking.metrics import (\n    compute_size_compliance,\n    compute_intrachunk_cohesion,\n    compute_block_integrity,\n)\n\nscore = compute_size_compliance(chunks, min_tokens=100, max_tokens=1100)\ncohesion = compute_intrachunk_cohesion(chunks, embedder)\nintegrity = compute_block_integrity(chunks, source_text, split_points)\n```\n\n\u003c/details\u003e\n\n## Reproducing Paper Results\n\nThe `data/clair/` directory contains 33 pre-parsed documents and pre-computed coreference mentions from the CLAIR corpus used in the paper:\n\n```\ndata/clair/\n├── adi_parsed/   # 33 parsed JSON documents\n└── mentions/     # Pre-computed maverick-coref clusters (no GPU needed)\n```\n\nTo replicate the chunking evaluation (Tables 1–3, Figure 1):\n\n```bash\npip install -e \".[paper]\"\npython -m spacy download en_core_web_sm\n\n# Full Table 3 reproduction (GPU recommended for chunking; mentions are pre-computed)\npython -m adaptive_chunking.paper.replicate \\\n    --data-dir data/clair/ \\\n    --output-dir results/ \\\n    --steps chunking metrics raw_metrics analysis table3 \\\n    --device cuda:0\n```\n\n**Steps explained:**\n\n| Step | What it does | Notes |\n|------|-------------|-------|\n| `chunking` | Split 33 docs × 8 methods + postprocessing | GPU needed for semantic chunker |\n| `mentions` | Extract coreference mentions | Pre-computed in `data/clair/mentions/` — skip unless regenerating |\n| `metrics` | Score post-processed chunks (Table 3 `*` rows) | ~9h local · ~30 min with `JINA_API_KEY` |\n| `raw_metrics` | Score raw chunks for baseline methods (Table 3 `†` rows) | Same cost as `metrics` |\n| `analysis` | Print Tables 1–2, Figure 1 | |\n| `table3` | Print full Table 3 with locally-computed vs paper values | Requires both `metrics` + `raw_metrics` |\n| `rag` | RAG evaluation (Tables 4–5) | Expensive: hundreds of OpenAI calls + GPU |\n\nThe `metrics` and `raw_metrics` steps are resumable — if interrupted, rerun the same command and already-computed documents are skipped.\n\nTo skip the LLM regex splitter (requires `OPENAI_API_KEY`) or the semantic chunker (requires GPU + flash-attention):\n\n```bash\npython -m adaptive_chunking.paper.replicate ... --steps chunking --skip-llm-regex --skip-semantic\n```\n\nThe RAG evaluation (Tables 4–5) is optional and expensive (hundreds of OpenAI API calls + GPU for embeddings):\n\n```bash\npython -m adaptive_chunking.paper.replicate --data-dir data/clair/ --output-dir results/ --steps rag --device cuda:0\n```\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003cb\u003ePackage Structure\u003c/b\u003e\u003c/summary\u003e\n\n| Module | Description |\n|--------|-------------|\n| `adaptive_chunking.splitters` | `RecursiveSplitter` — adaptive recursive chunking with configurable separators, merge modes, and overlap |\n| `adaptive_chunking.metrics` | Quality metrics: size compliance, intrachunk cohesion, contextual coherence, block integrity, missing reference error |\n| `adaptive_chunking.parsing` | Document parsers: `DoclingParser` (default), `PyMuPDFParser` (lightweight), `AzureDIParser` (cloud), `ExcelParser` |\n| `adaptive_chunking.postprocessing` | Gap detection/repair, page/title metadata, chunk location |\n| `adaptive_chunking.compute_metrics` | Orchestrates metric computation across documents |\n| `adaptive_chunking.split_documents` | Orchestrates chunking across documents in a directory |\n| `adaptive_chunking.extract_mentions` | Coreference resolution for the missing reference error metric |\n| `adaptive_chunking.paper.*` | Paper reproduction: baseline splitters, RAG evaluation, visualization, results analysis |\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003cb\u003eEnvironment Variables\u003c/b\u003e\u003c/summary\u003e\n\nFor full functionality, create a `.env` file:\n\n```bash\n# Azure Document Intelligence (only for AzureDIParser)\nADI_ENDPOINT=...\nADI_KEY=...\n\n# OpenAI (for LLM regex chunker and RAG evaluation)\nOPENAI_API_KEY=...\n\n# Jina AI (optional but recommended for the metrics steps)\n# If set, uses the Jina REST API instead of loading jina-embeddings-v3 locally.\n# ~30 min vs ~9 hours on RTX 4090. Get a key at https://jina.ai/\nJINA_API_KEY=...\n```\n\n\u003c/details\u003e\n\n## Development\n\nA [`REPLICATE_GUIDELINES.md`](REPLICATE_GUIDELINES.md) file documents the environment setup, stability constraints, key files, and detailed notes for reproducing paper results.\n\nAn [`LLM.md`](LLM.md) file provides project context for LLM-based coding assistants (architecture, key patterns, adding new components).\n\n## Testing\n\n```bash\npytest\n```\n\n## Citation\n\nIf you use this work, please cite our LREC 2026 paper:\n\n```bibtex\n@inproceedings{demoura2026adaptive,\n    title={Adaptive Chunking: Optimizing Chunking-Method Selection for RAG},\n    author={de Moura Junior, Paulo Roberto and Lelong, Jean and Blangero, Annabelle},\n    booktitle={Proceedings of the 15th Language Resources and Evaluation Conference (LREC 2026)},\n    year={2026},\n    url={https://arxiv.org/abs/2603.25333},\n}\n```\n\n## License\n\nThis project is licensed under the [MIT License](LICENSE).\n\nSome **optional** extras carry different licenses:\n\n- **`[coref]`** — `maverick-coref` is CC BY-NC-SA 4.0 (non-commercial, share-alike)\n- **`[parsing]`** — `pymupdf4llm` is AGPL-3.0 or Artifex Commercial\n\nThese are not installed by default. See [NOTICE](NOTICE) and [SBOM](SBOM.md) for full details.\n\nWe are actively working on replacing these dependencies with permissively-licensed alternatives so that all scoring metrics (including coreference-based ones) can be used without copyleft restrictions.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fekimetrics%2Fadaptive-chunking","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fekimetrics%2Fadaptive-chunking","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fekimetrics%2Fadaptive-chunking/lists"}