https://github.com/dnth/rag-datakit

End-to-end data curation pipeline for building training datasets that produce better embedding models for RAG applications. Includes data cleaning, chunking, deduplication, and quality filtering tools.
https://github.com/dnth/rag-datakit

Last synced: 5 months ago
JSON representation

Host: GitHub
URL: https://github.com/dnth/rag-datakit
Owner: dnth
License: apache-2.0
Created: 2025-08-11T08:20:37.000Z (11 months ago)
Default Branch: main
Last Pushed: 2025-09-24T03:25:38.000Z (9 months ago)
Last Synced: 2025-11-11T11:32:28.648Z (8 months ago)
Language: Jupyter Notebook
Size: 1.69 MB
Stars: 0
Watchers: 0
Forks: 2
Open Issues: 1
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# rag-datakit
End-to-end data curation pipeline for building training datasets that produce better embedding models for RAG applications. Includes data cleaning, chunking, deduplication, and quality filtering tools.

## Installation
To install run

```bash
pip install git+https://github.com/dnth/rag-datakit.git
```

## Roadmap

### Week 1
distilabel
training a customize embedding model

### Week 2
Create golden set

### Week 3
Data cleaning

### Week 4
Benchmarking with existing models

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/dnth/rag-datakit

Awesome Lists containing this project

README