https://github.com/FareedKhan-dev/scalable-rag-pipeline
A scalable RAG platform combining LangGraph agents, hybrid retrieval (Vector+Graph), and Ray orchestration on Kubernetes.
https://github.com/FareedKhan-dev/scalable-rag-pipeline
ai-agents aws langchain multi-agent-systems python rag-pipeline
Last synced: 7 days ago
JSON representation
A scalable RAG platform combining LangGraph agents, hybrid retrieval (Vector+Graph), and Ray orchestration on Kubernetes.
- Host: GitHub
- URL: https://github.com/FareedKhan-dev/scalable-rag-pipeline
- Owner: FareedKhan-dev
- License: mit
- Created: 2025-12-27T10:54:35.000Z (6 months ago)
- Default Branch: main
- Last Pushed: 2025-12-27T18:14:56.000Z (6 months ago)
- Last Synced: 2026-03-21T23:32:10.433Z (3 months ago)
- Topics: ai-agents, aws, langchain, multi-agent-systems, python, rag-pipeline
- Language: Python
- Homepage: https://medium.com/@fareedkhandev/1168dcd36260
- Size: 2.19 MB
- Stars: 117
- Watchers: 2
- Forks: 50
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
- Security: docs/security.md
Awesome Lists containing this project
- awesome-ai-devtools - Scalable Agentic RAG Platform (FareedKhan-dev) - grade agentic RAG platform with Kubernetes, Ray, LangGraph, and multi-plane architecture. | CLI · Self-hosted | [Docs](https://github.com/FareedKhan-dev/scalable-rag-pipeline#readme) / [Repo](https://github.com/FareedKhan-dev/scalable-rag-pipeline) | (Browse The Shelves / Self-hosted AI dev stacks)
README
# Enterprise Agentic RAG Platform




## Table of Contents
1. [System Overview](#1-system-overview)
2. [RAG Methodologies & Agentic Logic](#2-rag-methodologies--agentic-logic)
3. [Prerequisites & Tooling](#3-prerequisites--tooling)
4. [Phase 1: Infrastructure Initialization (Terraform)](#4-phase-1-infrastructure-initialization-terraform)
5. [Phase 2: Cluster Bootstrapping (Kubernetes)](#5-phase-2-cluster-bootstrapping-kubernetes)
6. [Phase 3: The Data Plane (Ray & Databases)](#6-phase-3-the-data-plane-ray--databases)
7. [Phase 4: The Control Plane (API Deployment)](#7-phase-4-the-control-plane-api-deployment)
8. [Phase 5: Data Ingestion Pipeline](#8-phase-5-data-ingestion-pipeline)
9. [Validation & Testing](#9-validation--testing)
10. [Cost Optimization & Scaling](#10-cost-optimization--scaling)
11. [Troubleshooting](#11-troubleshooting)
## 1. System Overview
This repository contains the source code and Infrastructure-as-Code (IaC) definitions for a production-grade **Retrieval-Augmented Generation (RAG)** system. Unlike standard RAG implementations, this platform utilizes an **Agentic Architecture** (via LangGraph) to perform multi-step reasoning, query expansion, and hybrid retrieval (Vector + Knowledge Graph).
### High-Level Architecture
The system is decoupled into two primary processing planes:
1. **Control Plane (The Brain):** Handles HTTP requests, state management, agent orchestration, and business logic. Runs on low-cost CPU nodes.
2. **Data Plane (The Muscle):** Handles heavy compute tasks including LLM Inference, Embedding generation, and Graph Extraction. Runs on autoscaling GPU nodes via **Ray**.

---
## 2. RAG Methodologies & Agentic Logic
This platform implements advanced RAG techniques to solve common failure modes (hallucination, retrieval misses).
### 2.1. The Planning Agent (`services/api/app/agents/`)
Instead of a linear chain, we use **LangGraph** to model the RAG process as a state machine.
* **Planner Node:** Analyzes user intent. Decides whether to perform a direct answer, a retrieval, or use a tool (Code Interpreter).
* **Query Rewriter:** Uses an LLM to rewrite the user's query, resolving coreferences (e.g., changing "How much does *it* cost?" to "How much does *Kubernetes* cost?").
* **HyDE (Hypothetical Document Embeddings):** Generates a fake "ideal" answer, embeds it, and uses that vector to find real documents. This bridges the semantic gap between a question and a declarative statement.
### 2.2. Hybrid Retrieval
* **Vector Search (Qdrant):** Uses **BGE-M3** embeddings (dense retrieval) to find semantically similar text chunks.
* **Graph Search (Neo4j):** Executes Cypher queries to find entities and their relationships (e.g., `(Entity A)-[RELATED_TO]->(Entity B)`). This captures structural knowledge that vector search misses.
---
## 3. Prerequisites & Tooling
Ensure the following tools are installed on your workstation (Bastion Host):
* **AWS CLI (`v2.x`):** Configured with AdministratorAccess.
* **Terraform (`v1.5+`):** For infrastructure provisioning.
* **Kubectl (`v1.29+`):** For Kubernetes interaction.
* **Helm (`v3.x`):** For chart management.
* **Python (`3.10+`)**: For local scripting.
* **Docker:** For building container images.
---
## 4. Phase 1: Infrastructure Initialization (Terraform)
We use Terraform to provision the "Hardware" layer: VPC, EKS Control Plane, S3, and RDS.
### 4.1. Remote State Setup (Manual Step)
Terraform requires a backend to store the state file safely.
1. Log in to the **AWS Console**.
2. Navigate to **S3** and create a bucket named: `rag-platform-terraform-state-prod-001` (Must be unique globally).
3. Navigate to **DynamoDB** and create a table named `terraform-state-lock`.
* **Partition Key:** `LockID` (String).
### 4.2. Provisioning Resources
Navigate to the infrastructure directory:
```bash
cd infra/terraform
```
Initialize the backend and providers:
```bash
terraform init
```
Review the execution plan. This will show creation of:
* **VPC:** 10.0.0.0/16 with 3 Public, 3 Private, and 3 Database subnets.
* **EKS:** Cluster named `rag-platform-cluster` (Version 1.29).
* **RDS:** Aurora Postgres Serverless v2.
* **IAM:** OIDC Providers and IRSA roles.
```bash
terraform plan -var="db_password=YourStrongPassword#123" -out=tfplan
```
Apply the infrastructure (Estimated time: 20 minutes):
```bash
terraform apply tfplan
```
### 4.3. Connection Configuration
Once Terraform completes, configure `kubectl` to communicate with the new cluster:
```bash
aws eks update-kubeconfig --region us-east-1 --name rag-platform-cluster
```
Verify connectivity:
```bash
kubectl get nodes
# Expected Output: ip-10-0-x-x.ec2.internal Ready m6i.large
```
---
## 5. Phase 2: Cluster Bootstrapping (Kubernetes)
The EKS cluster is currently empty. We need to install the core system controllers.
### 5.1. Run Bootstrap Script
Execute the helper script to install **Karpenter** (Autoscaler), **KubeRay Operator**, **External Secrets**, and **Ingress Controller**.
```bash
cd scripts
chmod +x bootstrap_cluster.sh
./bootstrap_cluster.sh
```
### 5.2. Configure Karpenter Provisioners
Karpenter is responsible for analyzing unschedulable pods and spinning up EC2 instances dynamically.
**Apply the CPU Provisioner (For API & System pods):**
```bash
kubectl apply -f infra/karpenter/provisioner-cpu.yaml
```
* *Technical Detail:* This targets `m6i`, `c6i` instances and uses Spot pricing where available.
**Apply the GPU Provisioner (For AI Inference):**
```bash
kubectl apply -f infra/karpenter/provisioner-gpu.yaml
```
* *Technical Detail:* This targets `g5` (Nvidia A10G) instances. It creates a taint `nvidia.com/gpu=true:NoSchedule` to prevent non-AI pods from accidentally using expensive nodes.
---
## 6. Phase 3: The Data Plane (Ray & Databases)
We now deploy the "Muscle" of the system.
### 6.1. Deploy Vector & Graph Databases
In a full production environment, you might use Terraform managed services (AWS Neptune / Qdrant Cloud), but for this setup, we deploy HA clusters inside K8s.
```bash
# Deploy Qdrant
helm upgrade --install qdrant deploy/helm/qdrant --namespace default
# Deploy Neo4j
helm upgrade --install neo4j deploy/helm/neo4j --namespace default
```
### 6.2. Deploy Ray Cluster
The Ray Cluster consists of a Head Node (orchestrator) and Worker Groups.
```bash
kubectl apply -f deploy/ray/ray-cluster.yaml
```
*Verification:*
```bash
kubectl get pods -l ray.io/cluster=rag-ray-cluster
# Wait until the 'ray-head' pod is Running.
```
### 6.3. Deploy Model Services (Ray Serve)
We deploy two separate Ray Services. These utilize the `ServeConfigV2` specification.
**A. Embedding Service (BGE-M3):**
```bash
kubectl apply -f deploy/ray/ray-serve-embed.yaml
```
**B. LLM Service (vLLM / Llama-3-70B):**
This is the most resource-intensive step.
```bash
kubectl apply -f deploy/ray/ray-serve-llm.yaml
```
**What happens technically:**
1. The `RayService` CRD submits a request to the Ray Head.
2. Ray realizes it needs 1 GPU (`nvidia.com/gpu: 1` resource request).
3. The Ray Worker pod goes into `Pending` state.
4. **Karpenter** detects the pending pod, calls AWS Fleet API, and provisions a `g5.xlarge` instance.
5. Once the node joins (approx. 90s), the pod starts, downloads the weights (approx. 40GB) from HuggingFace, and initializes the vLLM engine (PagedAttention).
---
## 7. Phase 4: The Control Plane (API Deployment)
### 7.1. Secret Management
Create the Kubernetes Secret containing database credentials and keys.
```bash
kubectl create secret generic app-env-secret \
--from-literal=DATABASE_URL="postgresql+asyncpg://ragadmin:YourStrongPassword#123@rag-platform-cluster-postgres.cluster-xxxx.us-east-1.rds.amazonaws.com:5432/rag_db" \
--from-literal=REDIS_URL="redis://rag-redis-prod.xxxx.ng.0001.use1.cache.amazonaws.com:6379/0" \
--from-literal=NEO4J_PASSWORD="password" \
--from-literal=JWT_SECRET_KEY="$(openssl rand -hex 32)" \
--from-literal=QDRANT_HOST="qdrant" \
--from-literal=RAY_LLM_ENDPOINT="http://llm-service:8000/llm" \
--from-literal=RAY_EMBED_ENDPOINT="http://embed-service:8000/embed"
```
### 7.2. Deploy the API
Deploy the FastAPI application using Helm.
```bash
helm upgrade --install api deploy/helm/api
```
### 7.3. Apply Ingress
Configure the Load Balancer to route traffic to the API.
```bash
kubectl apply -f deploy/ingress/nginx.yaml
```
* *Note:* Get your Load Balancer DNS name via `kubectl get ingress`. Map your domain (CNAME) to this DNS.
---
## 8. Phase 5: Data Ingestion Pipeline
The system requires data to function. The ingestion pipeline is an asynchronous, distributed Ray Job.
### 8.1. Bulk Upload to S3
Upload your dataset (PDF, DOCX, HTML) to the S3 bucket created by Terraform.
```bash
# Retrieve bucket name
BUCKET_NAME=$(cd infra/terraform && terraform output -raw s3_documents_bucket_name)
# Run bulk uploader script
python scripts/bulk_upload_s3.py ./data/finance_reports $BUCKET_NAME
```
### 8.2. Trigger Ingestion Job
Normally triggered by S3 Events, we can manually submit the job to the Ray Cluster.
1. **Port Forward Ray Dashboard:**
```bash
kubectl port-forward service/rag-ray-cluster-head-svc 8265:8265
```
2. **Submit Job via Python SDK:**
```bash
python -m pipelines.jobs.s3_event_handler
```
**Technical Workflow:**
1. **Ray Data** reads binaries from S3 lazily.
2. **MapBatches (CPU):** `unstructured` library parses PDFs (OCR via Tesseract if needed) and chunks text (512 tokens).
3. **MapBatches (GPU - Embed):** Chunks are sent to the `embed-service` Actor.
4. **MapBatches (GPU - Graph):** Chunks are sent to the `llm-service` to extract `(Subject, Predicate, Object)` tuples.
5. **Write:**
* Vectors -> Qdrant (Upsert).
* Nodes/Edges -> Neo4j (MERGE queries).
---
## 9. Validation & Testing
### 9.1. Health Checks
Verify the API connects to all subsystems.
```bash
curl https:///health/readiness
# Expected: {"redis": "up", "neo4j": "up"}
```
### 9.2. End-to-End Chat Test
Perform a request to verify the Agentic flow (Authentication required).
1. **Obtain Token (Dev Mode):** Use the `jwt.py` utility or disable auth in `config.py` temporarily for testing.
2. **Curl Request:**
```bash
curl -X POST https:///api/v1/chat/stream \
-H "Content-Type: application/json" \
-d '{
"message": "Analyze the financial risks mentioned in the Q3 report.",
"session_id": "test-session-1"
}'
```
---
## 10. Cost Optimization & Scaling
The system uses aggressive scaling policies to minimize costs:
1. **Spot Instances:** `provisioner-gpu.yaml` is configured to request Spot instances (`karpenter.sh/capacity-type: spot`). This reduces GPU costs by ~70%.
2. **Scale-to-Zero:**
* The **Ray Autoscaler** is configured in `ray-serve-llm.yaml` with `min_replicas: 1` (can be 0 for dev).
* If `min_replicas` is 0 and no requests arrive, Ray kills the Pod.
* **Karpenter** sees the node is empty (TTL 30s) and terminates the EC2 instance.
---
## 11. Troubleshooting
* **Pod Pending (Insufficient CPU/Mem):** Check `kubectl describe pod `. If it says `FailedScheduling`, check if Karpenter logs show `launching node`.
* **Ray Actor Death:** Check Ray Dashboard `http://localhost:8265`. Common issue is OOM (Out Of Memory) on the GPU. Decrease `max_num_seqs` in `llama-70b.yaml`.
* **Database Connection Refused:** Ensure Security Groups in `infra/terraform/vpc.tf` allow traffic on ports 5432 (Postgres), 6333 (Qdrant), and 7687 (Neo4j) from the EKS Subnet CIDR.
---
## 12. Contributing
1. Create a feature branch (`git checkout -b feature/amazing-feature`).
2. Commit your changes.
3. Run tests (`make test`).
4. Push to the branch.
5. Open a Pull Request.
---
## License
Distributed under the MIT License. See `LICENSE` for more information.