https://github.com/goswamipronnoy/rag-data-ingestion-pipeline

This repository contains a scalable and modular pipeline for ingesting large-scale datasets into vector databases to power Retrieval-Augmented Generation (RAG) applications.
https://github.com/goswamipronnoy/rag-data-ingestion-pipeline

ai ai-infrastructure distributed-computing distributed-systems etl-pipeline llm machine-learning mlops natural-language-processing open-search postgresql rag ray

Last synced: 3 months ago
JSON representation

This repository contains a scalable and modular pipeline for ingesting large-scale datasets into vector databases to power Retrieval-Augmented Generation (RAG) applications.

Host: GitHub
URL: https://github.com/goswamipronnoy/rag-data-ingestion-pipeline
Owner: goswamipronnoy
License: mit
Created: 2025-03-07T03:54:47.000Z (3 months ago)
Default Branch: main
Last Pushed: 2025-03-07T04:03:29.000Z (3 months ago)
Last Synced: 2025-03-07T05:18:38.814Z (3 months ago)
Topics: ai, ai-infrastructure, distributed-computing, distributed-systems, etl-pipeline, llm, machine-learning, mlops, natural-language-processing, open-search, postgresql, rag, ray
Language: Python
Homepage:
Size: 3.91 KB
Stars: 1
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# RAG Data Ingestion Pipeline

## Overview
This project implements a Retrieval Augmented Generation (RAG) data ingestion pipeline** using Ray, OpenSearch, and PostgreSQL with pgvector for large-scale ML workloads.

## Project Structure:

```
rag-data-ingestion-pipeline/
│-- data/
│ │-- raw/
│ │ ├── data.jsonl
│ │-- processed/
│ │ ├── data.parquet
│-- src/
│ │-- convert.py # Converts JSONL to Parquet
│ │-- embeddings.py # Handles embedding generation with Ray
│ │-- opensearch_store.py # Stores embeddings in OpenSearch
│ │-- pgvector_store.py # Stores embeddings in PostgreSQL
│ │-- pipeline.py # Main script to run ingestion pipeline
│-- requirements.txt # Python dependencies
│-- README.md
```

## Features
- Efficient embedding generation using distributed processing
- Storage in OpenSearch for ANN retrieval
- Storage in PostgreSQL with pgvector for k-NN searches

## Setup
### 1. Install dependencies
```
pip install -r requirements.txt
```
### 2. Convert Data
```
python src/convert.py
```
### 3. Run Ingestion Pipeline
```
python src/pipeline.py
```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/goswamipronnoy/rag-data-ingestion-pipeline

Awesome Lists containing this project

README