https://github.com/gazelle93/various-chunking-methods

Exploring and benchmarking chunking methods for Retrieval-Augmented Generation (RAG), including fixed-size, recursive, sliding, semantic, and hybrid chunking strategies.
https://github.com/gazelle93/various-chunking-methods

chunking gensim information-retrieval natural-language-processing nlp nltk rag retrieval-augmented-generation semantic-search sentence-transformers spacy

Last synced: about 1 month ago
JSON representation

Exploring and benchmarking chunking methods for Retrieval-Augmented Generation (RAG), including fixed-size, recursive, sliding, semantic, and hybrid chunking strategies.

Host: GitHub
URL: https://github.com/gazelle93/various-chunking-methods
Owner: gazelle93
Created: 2025-06-11T14:56:19.000Z (about 1 year ago)
Default Branch: main
Last Pushed: 2025-06-11T18:02:28.000Z (about 1 year ago)
Last Synced: 2025-06-11T19:03:18.553Z (about 1 year ago)
Topics: chunking, gensim, information-retrieval, natural-language-processing, nlp, nltk, rag, retrieval-augmented-generation, semantic-search, sentence-transformers, spacy
Language: Python
Homepage:
Size: 21.5 KB
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# Overview
This repository explores various chunking strategies for improving the efficiency and effectiveness of Retrieval-Augmented Generation (RAG) pipelines. Chunking determines how source documents are segmented before being embedded and retrieved, which can significantly affect retrieval quality and latency.

## Motivation

Chunking plays a critical role in balancing context preservation, retrieval precision, and inference cost. This project compares common and advanced methods under a controlled evaluation framework.

## Repository Structure

- `chunking_mehtods.py`: Contains implementations of chunking strategies such as:
- Fixed-size chunking
- Recursive chunking
- Sliding chunking
- Topic-based chunking
- Semantic chunking
- Hybrid chunking
- `utils.py`: Utility functions shared across modules.

## Methods Compared

| Chunking Method | Strategy | Pros | Cons |
|----------------------|--------------------------------------|----------------------------------|----------------------------------|
| Fixed-size | Uniform length split | Simple, fast | Can break semantic units |
| Recursive | Uses hierarchical splitting rules | Maintains structure | Slower, heuristic-based |
| Sliding window | Overlapping segments | High recall | Increases redundancy |
| Topic-based | Clusters sentences by semantic similarity | Groups text by meaningful topics | Requires embedding + clustering; variable chunk sizes |
| Semantic | Embedding-based or topic-aware | Semantic coherence | More complex to implement |
| Hybrid | Text-structure + semantic similarity | Balanced, readable and coherent | More complex logic and slower |

## Prerequisites
- spacy
- nltk
- sentence-transformers
- numpy
- scikit-learn

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/gazelle93/various-chunking-methods

Awesome Lists containing this project

README