https://github.com/dnth/rag-datakit
End-to-end data curation pipeline for building training datasets that produce better embedding models for RAG applications. Includes data cleaning, chunking, deduplication, and quality filtering tools.
https://github.com/dnth/rag-datakit
Last synced: 5 months ago
JSON representation
End-to-end data curation pipeline for building training datasets that produce better embedding models for RAG applications. Includes data cleaning, chunking, deduplication, and quality filtering tools.
- Host: GitHub
- URL: https://github.com/dnth/rag-datakit
- Owner: dnth
- License: apache-2.0
- Created: 2025-08-11T08:20:37.000Z (11 months ago)
- Default Branch: main
- Last Pushed: 2025-09-24T03:25:38.000Z (9 months ago)
- Last Synced: 2025-11-11T11:32:28.648Z (8 months ago)
- Language: Jupyter Notebook
- Size: 1.69 MB
- Stars: 0
- Watchers: 0
- Forks: 2
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# rag-datakit
End-to-end data curation pipeline for building training datasets that produce better embedding models for RAG applications. Includes data cleaning, chunking, deduplication, and quality filtering tools.
## Installation
To install run
```bash
pip install git+https://github.com/dnth/rag-datakit.git
```
## Roadmap
### Week 1
distilabel
training a customize embedding model
### Week 2
Create golden set
### Week 3
Data cleaning
### Week 4
Benchmarking with existing models