https://github.com/goswamipronnoy/rag-data-ingestion-pipeline
This repository contains a scalable and modular pipeline for ingesting large-scale datasets into vector databases to power Retrieval-Augmented Generation (RAG) applications.
https://github.com/goswamipronnoy/rag-data-ingestion-pipeline
ai ai-infrastructure distributed-computing distributed-systems etl-pipeline llm machine-learning mlops natural-language-processing open-search postgresql rag ray
Last synced: 3 months ago
JSON representation
This repository contains a scalable and modular pipeline for ingesting large-scale datasets into vector databases to power Retrieval-Augmented Generation (RAG) applications.
- Host: GitHub
- URL: https://github.com/goswamipronnoy/rag-data-ingestion-pipeline
- Owner: goswamipronnoy
- License: mit
- Created: 2025-03-07T03:54:47.000Z (3 months ago)
- Default Branch: main
- Last Pushed: 2025-03-07T04:03:29.000Z (3 months ago)
- Last Synced: 2025-03-07T05:18:38.814Z (3 months ago)
- Topics: ai, ai-infrastructure, distributed-computing, distributed-systems, etl-pipeline, llm, machine-learning, mlops, natural-language-processing, open-search, postgresql, rag, ray
- Language: Python
- Homepage:
- Size: 3.91 KB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# RAG Data Ingestion Pipeline
## Overview
This project implements a Retrieval Augmented Generation (RAG) data ingestion pipeline** using Ray, OpenSearch, and PostgreSQL with pgvector for large-scale ML workloads.## Project Structure:
```
rag-data-ingestion-pipeline/
│-- data/
│ │-- raw/
│ │ ├── data.jsonl
│ │-- processed/
│ │ ├── data.parquet
│-- src/
│ │-- convert.py # Converts JSONL to Parquet
│ │-- embeddings.py # Handles embedding generation with Ray
│ │-- opensearch_store.py # Stores embeddings in OpenSearch
│ │-- pgvector_store.py # Stores embeddings in PostgreSQL
│ │-- pipeline.py # Main script to run ingestion pipeline
│-- requirements.txt # Python dependencies
│-- README.md
```## Features
- Efficient embedding generation using distributed processing
- Storage in OpenSearch for ANN retrieval
- Storage in PostgreSQL with pgvector for k-NN searches## Setup
### 1. Install dependencies
```
pip install -r requirements.txt
```
### 2. Convert Data
```
python src/convert.py
```
### 3. Run Ingestion Pipeline
```
python src/pipeline.py
```