{"id":28717443,"url":"https://github.com/gazelle93/various-chunking-methods","last_synced_at":"2026-05-07T13:04:45.913Z","repository":{"id":298551138,"uuid":"1000294677","full_name":"gazelle93/Various-Chunking-Methods","owner":"gazelle93","description":"Exploring and benchmarking chunking methods for Retrieval-Augmented Generation (RAG), including fixed-size, recursive, sliding, semantic, and hybrid chunking strategies.","archived":false,"fork":false,"pushed_at":"2025-06-11T18:02:28.000Z","size":22,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-06-11T19:03:18.553Z","etag":null,"topics":["chunking","gensim","information-retrieval","natural-language-processing","nlp","nltk","rag","retrieval-augmented-generation","semantic-search","sentence-transformers","spacy"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/gazelle93.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-06-11T14:56:19.000Z","updated_at":"2025-06-11T18:02:31.000Z","dependencies_parsed_at":"2025-06-11T19:03:21.353Z","dependency_job_id":"3193292b-b735-4fda-841e-d807091daa49","html_url":"https://github.com/gazelle93/Various-Chunking-Methods","commit_stats":null,"previous_names":["gazelle93/various-chunking-methods"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/gazelle93/Various-Chunking-Methods","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gazelle93%2FVarious-Chunking-Methods","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gazelle93%2FVarious-Chunking-Methods/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gazelle93%2FVarious-Chunking-Methods/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gazelle93%2FVarious-Chunking-Methods/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/gazelle93","download_url":"https://codeload.github.com/gazelle93/Various-Chunking-Methods/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gazelle93%2FVarious-Chunking-Methods/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":274411319,"owners_count":25280108,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-09-10T02:00:12.551Z","response_time":83,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["chunking","gensim","information-retrieval","natural-language-processing","nlp","nltk","rag","retrieval-augmented-generation","semantic-search","sentence-transformers","spacy"],"created_at":"2025-06-15T04:00:31.876Z","updated_at":"2026-05-07T13:04:40.877Z","avatar_url":"https://github.com/gazelle93.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Overview\r\nThis repository explores various chunking strategies for improving the efficiency and effectiveness of Retrieval-Augmented Generation (RAG) pipelines. Chunking determines how source documents are segmented before being embedded and retrieved, which can significantly affect retrieval quality and latency.\r\n\r\n## Motivation\r\n\r\nChunking plays a critical role in balancing context preservation, retrieval precision, and inference cost. This project compares common and advanced methods under a controlled evaluation framework.\r\n\r\n## Repository Structure\r\n\r\n- `chunking_mehtods.py`: Contains implementations of chunking strategies such as:\r\n  - Fixed-size chunking\r\n  - Recursive chunking\r\n  - Sliding chunking\r\n  - Topic-based chunking\r\n  - Semantic chunking\r\n  - Hybrid chunking\r\n- `utils.py`: Utility functions shared across modules.\r\n\r\n## Methods Compared\r\n\r\n| Chunking Method      | Strategy                             | Pros                            | Cons                             |\r\n|----------------------|--------------------------------------|----------------------------------|----------------------------------|\r\n| Fixed-size           | Uniform length split                 | Simple, fast                    | Can break semantic units         |\r\n| Recursive            | Uses hierarchical splitting rules    | Maintains structure             | Slower, heuristic-based          |\r\n| Sliding window       | Overlapping segments                 | High recall                     | Increases redundancy             |\r\n| Topic-based\t| Clusters sentences by semantic similarity \t| Groups text by meaningful topics |\tRequires embedding + clustering; variable chunk sizes |\r\n| Semantic             | Embedding-based or topic-aware       | Semantic coherence              | More complex to implement        |\r\n| Hybrid             | Text-structure + semantic similarity       | Balanced, readable and coherent | More complex logic and slower    |\r\n\r\n\r\n\r\n## Prerequisites\r\n- spacy\r\n- nltk\r\n- sentence-transformers\r\n- numpy\r\n- scikit-learn","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgazelle93%2Fvarious-chunking-methods","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fgazelle93%2Fvarious-chunking-methods","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgazelle93%2Fvarious-chunking-methods/lists"}