An open API service indexing awesome lists of open source software.

https://github.com/dnth/rag-datakit

End-to-end data curation pipeline for building training datasets that produce better embedding models for RAG applications. Includes data cleaning, chunking, deduplication, and quality filtering tools.
https://github.com/dnth/rag-datakit

Last synced: 5 months ago
JSON representation

End-to-end data curation pipeline for building training datasets that produce better embedding models for RAG applications. Includes data cleaning, chunking, deduplication, and quality filtering tools.

Awesome Lists containing this project

README

          

# rag-datakit
End-to-end data curation pipeline for building training datasets that produce better embedding models for RAG applications. Includes data cleaning, chunking, deduplication, and quality filtering tools.

## Installation
To install run

```bash
pip install git+https://github.com/dnth/rag-datakit.git
```

## Roadmap

### Week 1
distilabel
training a customize embedding model

### Week 2
Create golden set

### Week 3
Data cleaning

### Week 4
Benchmarking with existing models