https://github.com/gazelle93/various-chunking-methods
Exploring and benchmarking chunking methods for Retrieval-Augmented Generation (RAG), including fixed-size, recursive, sliding, semantic, and hybrid chunking strategies.
https://github.com/gazelle93/various-chunking-methods
chunking gensim information-retrieval natural-language-processing nlp nltk rag retrieval-augmented-generation semantic-search sentence-transformers spacy
Last synced: about 1 month ago
JSON representation
Exploring and benchmarking chunking methods for Retrieval-Augmented Generation (RAG), including fixed-size, recursive, sliding, semantic, and hybrid chunking strategies.
- Host: GitHub
- URL: https://github.com/gazelle93/various-chunking-methods
- Owner: gazelle93
- Created: 2025-06-11T14:56:19.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2025-06-11T18:02:28.000Z (about 1 year ago)
- Last Synced: 2025-06-11T19:03:18.553Z (about 1 year ago)
- Topics: chunking, gensim, information-retrieval, natural-language-processing, nlp, nltk, rag, retrieval-augmented-generation, semantic-search, sentence-transformers, spacy
- Language: Python
- Homepage:
- Size: 21.5 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Overview
This repository explores various chunking strategies for improving the efficiency and effectiveness of Retrieval-Augmented Generation (RAG) pipelines. Chunking determines how source documents are segmented before being embedded and retrieved, which can significantly affect retrieval quality and latency.
## Motivation
Chunking plays a critical role in balancing context preservation, retrieval precision, and inference cost. This project compares common and advanced methods under a controlled evaluation framework.
## Repository Structure
- `chunking_mehtods.py`: Contains implementations of chunking strategies such as:
- Fixed-size chunking
- Recursive chunking
- Sliding chunking
- Topic-based chunking
- Semantic chunking
- Hybrid chunking
- `utils.py`: Utility functions shared across modules.
## Methods Compared
| Chunking Method | Strategy | Pros | Cons |
|----------------------|--------------------------------------|----------------------------------|----------------------------------|
| Fixed-size | Uniform length split | Simple, fast | Can break semantic units |
| Recursive | Uses hierarchical splitting rules | Maintains structure | Slower, heuristic-based |
| Sliding window | Overlapping segments | High recall | Increases redundancy |
| Topic-based | Clusters sentences by semantic similarity | Groups text by meaningful topics | Requires embedding + clustering; variable chunk sizes |
| Semantic | Embedding-based or topic-aware | Semantic coherence | More complex to implement |
| Hybrid | Text-structure + semantic similarity | Balanced, readable and coherent | More complex logic and slower |
## Prerequisites
- spacy
- nltk
- sentence-transformers
- numpy
- scikit-learn