{"id":47967874,"url":"https://github.com/amitgambhir/rag-auditor","last_synced_at":"2026-04-04T10:39:28.940Z","repository":{"id":343987845,"uuid":"1179990776","full_name":"amitgambhir/rag-auditor","owner":"amitgambhir","description":"Open source RAG evaluation platform — automatically score faithfulness, relevancy, and hallucination risk","archived":false,"fork":false,"pushed_at":"2026-03-12T21:37:16.000Z","size":112,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-03-12T21:51:08.715Z","etag":null,"topics":["fastapi","generative-ai","hallucination-detection","llm-evaluation","python","rag","ragas"],"latest_commit_sha":null,"homepage":"https://amitgambhir.com","language":"JavaScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/amitgambhir.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-03-12T15:32:45.000Z","updated_at":"2026-03-12T21:37:19.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/amitgambhir/rag-auditor","commit_stats":null,"previous_names":["amitgambhir/rag-auditor"],"tags_count":1,"template":false,"template_full_name":null,"purl":"pkg:github/amitgambhir/rag-auditor","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/amitgambhir%2Frag-auditor","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/amitgambhir%2Frag-auditor/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/amitgambhir%2Frag-auditor/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/amitgambhir%2Frag-auditor/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/amitgambhir","download_url":"https://codeload.github.com/amitgambhir/rag-auditor/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/amitgambhir%2Frag-auditor/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31397055,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-04T10:20:44.708Z","status":"ssl_error","status_checked_at":"2026-04-04T10:20:06.846Z","response_time":60,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["fastapi","generative-ai","hallucination-detection","llm-evaluation","python","rag","ragas"],"created_at":"2026-04-04T10:39:28.392Z","updated_at":"2026-04-04T10:39:28.933Z","avatar_url":"https://github.com/amitgambhir.png","language":"JavaScript","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cdiv align=\"center\"\u003e\n\n# 🔍 RAG Auditor\n\n**Know if your RAG is production-ready before your users find out it isn't.**\n\n[![MIT License](https://img.shields.io/badge/license-MIT-00d4aa.svg)](LICENSE)\n[![Python 3.11+](https://img.shields.io/badge/python-3.11+-blue.svg)](https://python.org)\n[![FastAPI](https://img.shields.io/badge/FastAPI-0.104+-009688.svg)](https://fastapi.tiangolo.com)\n[![Powered by RAGAS](https://img.shields.io/badge/eval-RAGAS-7c3aed.svg)](https://docs.ragas.io)\n[![Powered by Claude](https://img.shields.io/badge/judge-Claude%20AI-d97706.svg)](https://anthropic.com)\n\n\u003c/div\u003e\n\n---\n\n## Table of Contents\n\n1. [The Problem](#the-problem)\n2. [What It Does](#what-it-does)\n3. [Demo](#demo)\n4. [Key Features](#key-features)\n5. [Built On](#built-on)\n6. [Quickstart](#quickstart)\n7. [Architecture](#architecture)\n8. [Step-by-Step Testing Guide](#step-by-step-testing-guide)\n   - [Step 1 — Set Up Your Environment](#step-1--set-up-your-environment)\n   - [Step 2 — Start the Backend](#step-2--start-the-backend)\n   - [Step 3 — Generate a Synthetic Golden Dataset](#step-3--generate-a-synthetic-golden-dataset)\n   - [Step 4 — Evaluate a Single RAG Response](#step-4--evaluate-a-single-rag-response)\n   - [Step 5 — Evaluate a Batch](#step-5--evaluate-a-batch)\n   - [Step 6 — Compare Two Evaluations](#step-6--compare-two-evaluations)\n   - [Step 7 — Run the Automated Test Suite](#step-7--run-the-automated-test-suite)\n9. [Understanding Verdicts](#understanding-verdicts)\n10. [RAGAS Metrics Explained](#ragas-metrics-explained)\n11. [Interpreting Recommendations](#interpreting-recommendations)\n12. [How LLM-as-Judge Works](#how-llm-as-judge-works)\n13. [Integration: Evaluating multi-llm-rag-agent-chat](#integration-evaluating-multi-llm-rag-agent-chat)\n14. [Key Design Decisions](#key-design-decisions)\n15. [Extending the System](#extending-the-system)\n16. [API Reference](#api-reference)\n17. [Key Files](#key-files)\n18. [Configuration](#configuration)\n19. [Contributing](#contributing)\n\n---\n\n## The Problem\n\nMost RAG systems ship broken.\n\nNot broken in obvious ways — broken in the ways that matter: answers that sound confident\nbut contradict the source documents, retrieved chunks that miss the point entirely,\nprompts that quietly hallucinate under edge cases.\n\nThe teams building these systems usually know something's off. They just have no way to\n*measure* it systematically — so they eyeball outputs, cross their fingers, and ship.\n\n**RAG Auditor fixes that.**\n\n---\n\n## What It Does\n\nRAG Auditor is an open source evaluation platform that automatically scores your RAG\npipeline across the four dimensions that predict real-world quality — then tells you\nexactly what to fix and how.\n\n```\nInput:  Your question  +  Retrieved context chunks  +  RAG-generated answer\nOutput: A production-readiness verdict with scored diagnostics and fix recommendations\n```\n\n### Evaluation Dimensions\n\n| Metric | What It Measures | Why It Matters |\n|--------|-----------------|----------------|\n| **Faithfulness** | Is the answer grounded in the retrieved context? | Catches hallucinations |\n| **Answer Relevancy** | Does the answer actually address the question? | Catches non-answers |\n| **Context Precision** | Are retrieved chunks signal or noise? | Catches retrieval bloat |\n| **Context Recall** | Did retrieval surface the right information? | Catches retrieval gaps |\n| **Hallucination Risk** | `LOW` / `MEDIUM` / `HIGH` classification | Human-readable safety check |\n\nEach score comes with a plain-English explanation and a specific, actionable recommendation.\n\n---\n\n## Demo\n\n\u003e ⚡ *Paste a question, context, and answer. Get a verdict in ~10 seconds.*\n\n```\nQuestion:    \"What is our refund policy for digital products?\"\nContext:     [3 retrieved chunks from your knowledge base]\nAnswer:      \"You can get a refund within 30 days, no questions asked.\"\n\n──────────────────────────────────────────────\n  Overall Score      0.84    ● NEEDS WORK\n──────────────────────────────────────────────\n  Faithfulness       0.91    ✓ Strong\n  Answer Relevancy   0.88    ✓ Strong\n  Context Precision  0.67    ⚠ Warning\n  Context Recall     0.79    ● Review\n  Hallucination Risk  LOW    ✓ Safe\n──────────────────────────────────────────────\n  Top Issue: Context precision is low — your retriever is pulling in\n  irrelevant chunks alongside the relevant ones. Try reducing top-k\n  from 5 to 3, or add a reranking step before generation.\n──────────────────────────────────────────────\n```\n\n---\n\n## Key Features\n\n- **Single-sample evaluator** — paste one Q/context/answer, get instant scores\n- **Batch evaluation** — upload CSV/JSON for bulk pipeline testing\n- **Golden dataset generator** — paste your source docs, auto-generate synthetic Q\u0026A\n  test pairs (the #1 reason teams skip evals is they have no dataset — this removes\n  that blocker entirely)\n- **Compare mode** — run before/after evals when you change chunking, top-k, or prompts;\n  see exact delta per metric\n- **Evaluation history** — save results in the current browser session and restore them\n  into the evaluator to compare runs or keep iterating\n- **RAG trace visualization** — see scores annotated at each stage:\n  Query → Retrieval → Prompt Construction → Generation → Answer\n- **LLM-as-judge** — Claude evaluates hallucination risk with reasoning, not just a number\n- **Fix recommendations** — every low score maps to a specific, actionable suggestion\n\n---\n\n## Built On\n\nRAG Auditor is a product layer built on top of battle-tested open source infrastructure:\n\n- **[RAGAS](https://docs.ragas.io)** — the leading RAG evaluation framework, providing\n  the core faithfulness, relevancy, precision, and recall metrics\n- **[Claude](https://anthropic.com)** (`claude-sonnet-4-6`) — LLM-as-judge for\n  hallucination detection and plain-English explanations\n- **[FastAPI](https://fastapi.tiangolo.com)** — async Python backend with SSE streaming\n- **[React](https://react.dev) + [Recharts](https://recharts.org)** — dashboard UI\n\n\u003e RAGAS provides the scoring science. Claude provides the reasoning layer.\n\u003e RAG Auditor provides the product experience that makes both usable without a PhD.\n\n---\n\n## Quickstart\n\n**Docker (recommended):**\n```bash\ngit clone https://github.com/yourusername/rag-auditor\ncd rag-auditor\ncp .env.example .env          # Add your ANTHROPIC_API_KEY\ndocker-compose up\n```\n\nOpen `http://localhost:3000` — that's it.\n\n**Local development:**\n```bash\n# Backend\ncd backend\npython -m venv venv \u0026\u0026 source venv/bin/activate\npip install -r requirements.txt\ncp .env.example .env\nuvicorn main:app --reload --port 8000\n\n# Frontend (new terminal)\ncd frontend \u0026\u0026 npm install \u0026\u0026 npm run dev\n```\n\n---\n\n## Architecture\n\n```\n┌─────────────────────────────────────────────┐\n│                  React UI                   │\n│  Evaluator · Batch · Generator · Compare    │\n└──────────────────┬──────────────────────────┘\n                   │ REST + SSE\n┌──────────────────▼──────────────────────────┐\n│               FastAPI Backend               │\n│                                             │\n│  ┌─────────────┐    ┌────────────────────┐  │\n│  │    RAGAS    │    │   Claude (Judge)   │  │\n│  │  Evaluator  │    │  Hallucination     │  │\n│  │             │    │  Detection +       │  │\n│  │ Faithfulness│    │  Recommendations   │  │\n│  │ Relevancy   │    │  Plain-English     │  │\n│  │ Precision   │    │  Explanations      │  │\n│  │ Recall      │    └────────────────────┘  │\n│  └─────────────┘                            │\n│         └──────────── asyncio.gather() ─────┘\n│                                             │\n│  ┌──────────────────────────────────────┐   │\n│  │     Recommendation Engine            │   │\n│  │  Score → Root Cause → Specific Fix   │   │\n│  └──────────────────────────────────────┘   │\n└─────────────────────────────────────────────┘\n```\n\n---\n\n## Step-by-Step Testing Guide\n\nThis section walks you through the full workflow: generating a dataset, evaluating it, and running automated tests.\n\n### Prerequisites\n\n- Python 3.11+\n- Node.js 18+\n- An [Anthropic API key](https://console.anthropic.com/)\n\n---\n\n### Step 1 — Set Up Your Environment\n\n```bash\ncd backend\npython -m venv venv\nsource venv/bin/activate          # Windows: venv\\Scripts\\activate\npip install -r requirements.txt\ncp .env.example .env\n```\n\nEdit `.env` and set:\n```\nANTHROPIC_API_KEY=sk-ant-...\nCORS_ORIGINS=http://localhost:3000\n```\n\n---\n\n### Step 2 — Start the Backend\n\n```bash\ncd backend\nuvicorn main:app --reload --port 8000\n```\n\nVerify it's running:\n```bash\ncurl http://localhost:8000/health\n# → {\"status\": \"ok\"}\n```\n\nInteractive API docs are available at http://localhost:8000/docs\n\n---\n\n### Step 3 — Generate a Synthetic Golden Dataset\n\nBefore you can evaluate your RAG system, you need Q\u0026A pairs with ground truth. Use the Dataset Generator to create them from your own source documents.\n\n#### Via the UI\n\n1. Open http://localhost:3000\n2. Click **Dataset Generator** in the nav\n3. Paste 1–5 source documents (e.g., product docs, knowledge-base articles)\n4. Set the number of Q\u0026A pairs (1–100)\n5. Click **Generate**\n6. Download as JSON or CSV\n\nCSV exports store `contexts` as a JSON array string so they round-trip cleanly into the Batch Evaluator.\nThe Batch Evaluator also accepts the older pipe-delimited `contexts` format for compatibility.\n\n#### Via curl\n\n```bash\ncurl -X POST http://localhost:8000/generate-dataset \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\n    \"documents\": [\n      \"RAG stands for Retrieval-Augmented Generation. It is a technique that combines information retrieval with large language models. The retrieval step fetches relevant documents from a knowledge base. The generation step uses the retrieved documents as context to produce an answer.\",\n      \"Faithfulness measures whether every claim in the generated answer can be traced back to the retrieved context. A faithfulness score of 1.0 means the answer is fully grounded in the provided documents.\"\n    ],\n    \"num_questions\": 5\n  }'\n```\n\n**Response structure:**\n```json\n{\n  \"pairs\": [\n    {\n      \"question\": \"What does RAG stand for?\",\n      \"answer\": \"RAG stands for Retrieval-Augmented Generation.\",\n      \"ground_truth\": \"RAG stands for Retrieval-Augmented Generation.\",\n      \"contexts\": [\"RAG stands for Retrieval-Augmented Generation...\"],\n      \"evolution_type\": \"simple\"\n    }\n  ],\n  \"total\": 5,\n  \"source_documents\": 2\n}\n```\n\nThe generator first attempts RAGAS `TestsetGenerator` (which creates diverse question types: simple, reasoning, multi-context). If RAGAS is unavailable it falls back to Claude directly. Either path produces the same output format.\n\n#### Save the dataset for reuse\n\n```bash\ncurl -X POST http://localhost:8000/generate-dataset \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\"documents\": [\"...your text...\"], \"num_questions\": 10}' \\\n  -o my_dataset.json\n```\n\n#### Dataset evolution types\n\n| Type | Description | Distribution |\n|---|---|---|\n| `simple` | Direct factual questions from one document | 50% |\n| `reasoning` | Questions requiring inference or multi-step thinking | 30% |\n| `multi_context` | Questions that require combining multiple documents | 20% |\n\nEach pair contains:\n- `question` — the user query to pose to your RAG system\n- `answer` — a reference answer (useful for comparison)\n- `ground_truth` — the canonical correct answer (used for context recall)\n- `contexts` — the source passages (use these as your \"retrieved chunks\" in evaluation)\n- `evolution_type` — question category\n\n\u003e **Tip:** To evaluate your own RAG system with the generated dataset, replace `contexts` with the chunks your retriever actually returns, and replace `answer` with what your LLM generates. Keep `ground_truth` as-is.\n\n---\n\n### Step 4 — Evaluate a Single RAG Response\n\nUse a Q\u0026A pair from your dataset (or write one manually) and evaluate it.\n\n#### Via the UI\n\n1. Open http://localhost:3000\n2. Fill in **Question**, **Answer**, **Retrieved Contexts** (one per line), and optionally **Ground Truth**\n3. Choose mode: **Full** (all metrics) or **Quick** (skips context recall even if Ground Truth is provided, faster)\n4. Click **Evaluate**\n5. View per-metric scores, hallucination badge, trace visualization, and recommendations\n\n#### Via curl (non-streaming)\n\n```bash\ncurl -X POST http://localhost:8000/evaluate \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\n    \"question\": \"What is RAG?\",\n    \"answer\": \"RAG stands for Retrieval-Augmented Generation. It retrieves relevant documents and uses them to generate answers.\",\n    \"contexts\": [\n      \"RAG stands for Retrieval-Augmented Generation. It is a technique that combines information retrieval with large language models.\",\n      \"The retrieval step fetches relevant documents. The generation step uses them as context.\"\n    ],\n    \"ground_truth\": \"RAG retrieves relevant documents from a knowledge base and uses them as context to generate accurate answers.\",\n    \"mode\": \"full\"\n  }'\n```\n\n**Response structure:**\n```json\n{\n  \"overall_score\": 0.87,\n  \"scores\": {\n    \"faithfulness\": 0.95,\n    \"answer_relevancy\": 0.88,\n    \"context_precision\": 0.80,\n    \"context_recall\": 0.75,\n    \"hallucination_risk\": \"low\"\n  },\n  \"trace\": {\n    \"retrieval_stage\": {\"score\": 0.775, \"issues\": []},\n    \"generation_stage\": {\"score\": 0.915, \"issues\": []}\n  },\n  \"recommendations\": [...],\n  \"verdict\": \"READY\",\n  \"explanation\": \"Your RAG pipeline is well-grounded and relevant...\"\n}\n```\n\n#### Via curl (streaming SSE)\n\nThe streaming endpoint yields real-time progress events followed by the final result:\n\n```bash\ncurl -N -X POST http://localhost:8000/evaluate/stream \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\n    \"question\": \"What is RAG?\",\n    \"answer\": \"RAG stands for Retrieval-Augmented Generation.\",\n    \"contexts\": [\"RAG stands for Retrieval-Augmented Generation...\"],\n    \"ground_truth\": \"RAG retrieves documents and generates answers.\",\n    \"mode\": \"full\"\n  }'\n```\n\nEvents emitted (one per line, `data: {...}`):\n- `{\"type\": \"progress\", \"message\": \"Initializing evaluation engine...\", \"step\": 0, \"total\": 5}`\n- `{\"type\": \"progress\", \"message\": \"Checking answer faithfulness...\", \"step\": 1, \"total\": 5}`\n- `{\"type\": \"progress\", \"message\": \"Running hallucination check...\"}`\n- `{\"type\": \"scores\", \"scores\": {...}}`\n- `{\"type\": \"result\", \"data\": {...full EvaluationResponse...}}`\n- `[DONE]`\n\n---\n\n### Step 5 — Evaluate a Batch\n\nUse this when you have a full dataset and want aggregate statistics.\n\nIn the UI, batch upload accepts JSON files directly and CSV files with `contexts` stored either as a JSON array string or as a legacy pipe-delimited field.\n\n```bash\ncurl -X POST http://localhost:8000/evaluate/batch \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\n    \"samples\": [\n      {\n        \"question\": \"What is RAG?\",\n        \"answer\": \"RAG is Retrieval-Augmented Generation.\",\n        \"contexts\": [\"RAG combines retrieval with generation.\"],\n        \"ground_truth\": \"RAG retrieves documents to generate answers.\",\n        \"mode\": \"full\"\n      },\n      {\n        \"question\": \"What does faithfulness measure?\",\n        \"answer\": \"Faithfulness measures if claims trace back to context.\",\n        \"contexts\": [\"Faithfulness measures whether every claim in the answer can be traced back to context.\"],\n        \"ground_truth\": \"Faithfulness measures grounding of the answer in retrieved context.\",\n        \"mode\": \"full\"\n      }\n    ]\n  }'\n```\n\n**Response includes:**\n```json\n{\n  \"aggregate\": {\n    \"faithfulness\": 0.91,\n    \"answer_relevancy\": 0.85,\n    \"context_precision\": 0.78,\n    \"context_recall\": 0.72,\n    \"overall_score\": 0.85\n  },\n  \"verdict_distribution\": {\"READY\": 2},\n  \"total_samples\": 2,\n  \"successful\": 2,\n  \"failed\": 0,\n  \"results\": [...],\n  \"errors\": []\n}\n```\n\n#### Using a generated dataset directly\n\n```bash\n# Generate\ncurl -X POST http://localhost:8000/generate-dataset \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\"documents\": [\"...\"], \"num_questions\": 5}' \\\n  -o dataset.json\n\n# Transform to batch format with jq\ncat dataset.json | jq '{samples: [.pairs[] | {question, answer, contexts, ground_truth, mode: \"full\"}]}' \\\n  \u003e batch_input.json\n\n# Evaluate\ncurl -X POST http://localhost:8000/evaluate/batch \\\n  -H \"Content-Type: application/json\" \\\n  -d @batch_input.json\n```\n\n---\n\n### Step 6 — Compare Two Evaluations\n\nAfter changing your RAG pipeline, compare before and after:\n\n```bash\n# Save your baseline result\ncurl -X POST http://localhost:8000/evaluate \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\"question\": \"...\", \"answer\": \"old answer\", \"contexts\": [...], \"mode\": \"full\"}' \\\n  -o baseline.json\n\n# Save your candidate result\ncurl -X POST http://localhost:8000/evaluate \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\"question\": \"...\", \"answer\": \"improved answer\", \"contexts\": [...], \"mode\": \"full\"}' \\\n  -o candidate.json\n\n# Compare\ncurl -X POST http://localhost:8000/evaluate/compare \\\n  -H \"Content-Type: application/json\" \\\n  -d \"{\\\"baseline\\\": $(cat baseline.json), \\\"candidate\\\": $(cat candidate.json)}\"\n```\n\n**Response:**\n```json\n{\n  \"deltas\": [\n    {\"metric\": \"faithfulness\", \"baseline\": 0.70, \"candidate\": 0.90, \"delta\": 0.20, \"direction\": \"improved\"},\n    {\"metric\": \"answer_relevancy\", \"baseline\": 0.85, \"candidate\": 0.82, \"delta\": -0.03, \"direction\": \"regressed\"}\n  ],\n  \"summary\": \"Mixed results: faithfulness improved by 20.0% but answer_relevancy regressed by 3.0%.\",\n  \"overall_direction\": \"mixed\"\n}\n```\n\n---\n\n### Step 7 — Run the Automated Test Suite\n\nUnit tests cover formatters, trace analysis, and recommendation logic — no API key required.\n\n```bash\ncd backend\nsource venv/bin/activate\npytest tests/ -v\n```\n\nExpected output:\n```\ntests/test_evaluate.py::TestFormatters::test_clamp_score_valid PASSED\ntests/test_evaluate.py::TestFormatters::test_compute_overall_score_all_metrics PASSED\ntests/test_evaluate.py::TestFormatters::test_verdict_ready PASSED\ntests/test_evaluate.py::TestFormatters::test_verdict_not_ready_high_hallucination PASSED\ntests/test_evaluate.py::TestRecommendations::test_critical_faithfulness PASSED\ntests/test_evaluate.py::TestRecommendations::test_recommendations_sorted_by_severity PASSED\ntests/test_evaluate.py::TestTraceAnalyzer::test_analyze_trace_with_all_scores PASSED\n...\n```\n\nRun frontend tests:\n```bash\ncd frontend\nnpm test\n```\n\n---\n\n## Understanding Verdicts\n\n| Verdict | Condition |\n|---|---|\n| **READY** | Overall score ≥ 0.80 AND hallucination risk is not `high` |\n| **NEEDS_WORK** | Overall score ≥ 0.60 (and hallucination not `high`) |\n| **NOT_READY** | Overall score \u003c 0.60, OR hallucination risk is `high` |\n\nThe **overall score** is a weighted average:\n\n| Metric | Weight |\n|---|---|\n| Faithfulness | 35% |\n| Answer Relevancy | 30% |\n| Context Precision | 20% |\n| Context Recall | 15% |\n\n---\n\n## RAGAS Metrics Explained\n\n| Metric | What it measures | Requires ground truth? |\n|---|---|---|\n| **Faithfulness** | Does every claim in the answer trace back to the retrieved context? A score of 1.0 means the answer is fully grounded. | No |\n| **Answer Relevancy** | Does the answer actually address the question asked? Low scores mean the answer is off-topic. | No |\n| **Context Precision** | What fraction of the retrieved chunks are actually relevant? Low scores mean your retriever is returning noise. | No |\n| **Context Recall** | Was all the relevant information present in the retrieved chunks? Low scores mean your retriever missed important content. | Yes |\n\n**Hallucination Risk** is an additional LLM-as-judge assessment (not from RAGAS) that classifies whether the answer introduces information not in the context: `low`, `medium`, or `high`.\n\n---\n\n## Interpreting Recommendations\n\nEvery evaluation returns prioritized recommendations sorted by severity:\n\n| Severity | Score range | Example fix |\n|---|---|---|\n| **critical** | \u003c 0.50 | Rewrite your generation prompt to forbid external knowledge |\n| **warning** | 0.50–0.70 | Tighten top-k, add reranking |\n| **info** | ≥ 0.80 | Monitor production metrics |\n\n---\n\n## How LLM-as-Judge Works\n\nRAG Auditor uses Claude (`claude-sonnet-4-6`) for three purposes:\n\n1. **RAGAS metrics** — Claude is the judge LLM for all RAGAS computations (faithfulness, answer relevancy, context precision, context recall)\n2. **Hallucination detection** — A custom Claude prompt analyzes whether the answer introduces unsupported claims, returning `risk_level`, `confidence`, `unsupported_claims`, and `rationale`\n3. **Plain-English explanation** — Claude synthesizes scores into a 2–3 sentence summary of what to fix\n\n---\n\n## Integration: Evaluating multi-llm-rag-agent-chat\n\n[multi-llm-rag-agent-chat](https://github.com/amitgambhir/multi-llm-rag-agent-chat) is a production RAG chatbot with dual-LLM routing (GPT-4o / Gemini Flash), ChromaDB vector storage, HuggingFace embeddings (`all-MiniLM-L6-v2`), and an RLHF feedback loop. RAG Auditor is the ideal complement — it gives you objective metric scores for every dimension of that pipeline.\n\n```\nmulti-llm-rag-agent-chat          RAG Auditor\n─────────────────────────         ─────────────────────────────────\nUpload documents          ──►     Generate golden dataset from same docs\nAsk question              ──►     Capture question + answer + contexts\nChromaDB retrieval (top 6)──►     Evaluate context_precision / context_recall\nGPT-4o or Gemini answer   ──►     Evaluate faithfulness / answer_relevancy\nRLHF re-ranking active    ──►     Re-run batch eval, compare delta scores\n```\n\n### Step 1 — Generate a Golden Dataset from Your Documents\n\nUse the same documents you uploaded to the chatbot to generate ground-truth Q\u0026A pairs:\n\n```bash\ncurl -X POST http://localhost:8000/generate-dataset \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\n    \"documents\": [\n      \"paste the text content of one of your uploaded PDFs or web pages here\",\n      \"paste a second document here\"\n    ],\n    \"num_questions\": 20\n  }' \\\n  -o golden_dataset.json\n```\n\nThese Q\u0026A pairs become your evaluation harness. The `ground_truth` field is what you use to score context recall.\n\n### Step 2 — Capture Live Responses from the Chatbot\n\nFor each question in your golden dataset, query the chatbot and capture the full response including the retrieved source chunks. The chatbot's chat endpoint accepts `{ \"query\": \"...\", \"session_id\": \"...\" }` and returns `answer` + `sources`.\n\n\u003e **Note on content truncation:** The chatbot's `sources[].content` field is currently truncated to 300 characters (`doc.page_content[:300]` in `chat.py`). This is fine for UI display but too short for RAGAS to compute accurate faithfulness and recall scores. See [Changes needed to multi-llm-rag-agent-chat](#changes-needed-to-multi-llm-rag-agent-chat) below for the one-line fix.\n\n```python\nimport httpx, json, uuid\n\ngolden = json.load(open(\"golden_dataset.json\"))\nsamples = []\nsession_id = str(uuid.uuid4())\n\nfor pair in golden[\"pairs\"]:\n    # Query the chatbot backend — field is \"query\", not \"message\"\n    resp = httpx.post(\n        \"http://localhost:8001/chat\",\n        json={\"query\": pair[\"question\"], \"session_id\": session_id},\n        timeout=60,\n    )\n    data = resp.json()\n\n    samples.append({\n        \"question\": pair[\"question\"],\n        \"answer\": data[\"answer\"],\n        # sources[].content is truncated to 300 chars by default — apply the\n        # full_content fix (see below) to get meaningful RAGAS scores\n        \"contexts\": [s[\"content\"] for s in data[\"sources\"]],\n        \"ground_truth\": pair[\"ground_truth\"],\n        \"mode\": \"full\",\n    })\n\njson.dump({\"samples\": samples}, open(\"batch_input.json\", \"w\"))\n```\n\n#### Changes needed to multi-llm-rag-agent-chat\n\nOnly one change is required in the chatbot to make it fully compatible with RAG Auditor evaluation.\n\n**Problem:** `backend/routers/chat.py` truncates source content to 300 chars:\n```python\n# current — too short for RAGAS\nSource(content=doc.page_content[:300], ...)\n```\n\n**Fix:** Return the full chunk content (or add a `full_content` field alongside the truncated preview):\n```python\n# option A — return full content (evaluation-friendly, slightly larger payload)\nSource(content=doc.page_content, ...)\n\n# option B — keep the 300-char preview for UI, add full content for eval\nSource(\n    content=doc.page_content[:300],   # UI display\n    full_content=doc.page_content,    # evaluation use\n    ...\n)\n```\n\nIf you go with option B, update the integration script to use `s[\"full_content\"]` instead of `s[\"content\"]`.\n\nNo other changes are required — the chatbot's API shape (`query`, `answer`, `sources`, `chunk_ids`, `llm_used`, `complexity_score`) maps cleanly to RAG Auditor's evaluation input.\n\n### Step 3 — Run Batch Evaluation\n\n```bash\ncurl -X POST http://localhost:8000/evaluate/batch \\\n  -H \"Content-Type: application/json\" \\\n  -d @batch_input.json \\\n  -o batch_results.json\n\n# Quick summary\ncat batch_results.json | jq '.aggregate'\n```\n\nThis gives you aggregate scores across your whole document corpus — exactly what you need to decide if the pipeline is production-ready.\n\n### Step 4 — Compare GPT-4o vs Gemini Routing\n\nThe chatbot routes queries above complexity threshold 0.4 to GPT-4o and below to Gemini Flash. Use RAG Auditor's compare mode to measure whether the routing decision actually improves quality:\n\n```python\nimport httpx, json\n\nquestion = \"What are the key architectural trade-offs in microservices?\"  # high complexity\ncontexts = [\"...retrieved chunks...\"]\nground_truth = \"...\"\n\n# Force GPT-4o answer (or capture from a high-complexity query)\ngpt4o_eval = httpx.post(\"http://localhost:8000/evaluate\", json={\n    \"question\": question,\n    \"answer\": \"GPT-4o generated answer here\",\n    \"contexts\": contexts,\n    \"ground_truth\": ground_truth,\n    \"mode\": \"full\"\n}).json()\n\n# Capture Gemini answer (low-complexity routing)\ngemini_eval = httpx.post(\"http://localhost:8000/evaluate\", json={\n    \"question\": question,\n    \"answer\": \"Gemini Flash generated answer here\",\n    \"contexts\": contexts,\n    \"ground_truth\": ground_truth,\n    \"mode\": \"full\"\n}).json()\n\n# Compare\ncompare = httpx.post(\"http://localhost:8000/evaluate/compare\", json={\n    \"baseline\": gemini_eval,\n    \"candidate\": gpt4o_eval\n}).json()\n\nprint(compare[\"summary\"])\n# e.g. \"GPT-4o improved faithfulness by 12.0% and answer_relevancy by 8.0%.\"\n```\n\nThis tells you whether the routing threshold (0.4) is correctly placed — if GPT-4o isn't consistently outscoring Gemini on hard questions, you may need to adjust the threshold.\n\n### Step 5 — Measure RLHF Improvement Over Time\n\nThe chatbot's RLHF loop re-ranks ChromaDB results based on user thumbs up/down. To measure whether feedback is actually improving retrieval quality:\n\n```bash\n# Baseline: evaluate before users have given feedback\ncurl -X POST http://localhost:8000/evaluate/batch \\\n  -H \"Content-Type: application/json\" \\\n  -d @batch_input.json \\\n  -o before_rlhf.json\n\n# ... let users interact with the chatbot and submit feedback ...\n\n# Re-run: same questions, same contexts, re-capture from chatbot\ncurl -X POST http://localhost:8000/evaluate/batch \\\n  -H \"Content-Type: application/json\" \\\n  -d @batch_input_after.json \\\n  -o after_rlhf.json\n\n# Compare aggregate context_precision scores\ncat before_rlhf.json | jq '.aggregate.context_precision'\ncat after_rlhf.json  | jq '.aggregate.context_precision'\n```\n\nAn increase in `context_precision` after RLHF feedback confirms that re-ranking is surfacing higher-signal chunks. An increase in `context_recall` confirms fewer relevant chunks are being missed.\n\n### What to Watch For\n\n| Metric | What it reveals about the chatbot |\n|---|---|\n| **context_precision** | Whether ChromaDB's cosine similarity retrieval (top 6 → top 3) is pulling in noise |\n| **context_recall** | Whether `all-MiniLM-L6-v2` embeddings capture the semantic meaning of your domain |\n| **faithfulness** | Whether GPT-4o / Gemini is staying grounded or hallucinating beyond retrieved chunks |\n| **answer_relevancy** | Whether the complexity router is selecting the right LLM for each query type |\n| **hallucination_risk** | Claude's independent assessment — useful as a cross-check on the routing decision |\n\n\u003e **Expected baseline:** `all-MiniLM-L6-v2` is a lightweight embedding model optimized for speed, not domain accuracy. If `context_recall` scores below 0.70 consistently, consider upgrading to a larger embedding model (e.g. `BAAI/bge-large-en-v1.5`) and re-running the batch eval to measure the improvement.\n\n---\n\n## Key Design Decisions\n\n### 1. RAGAS + Claude in combination, not either/or\nRAGAS provides statistically rigorous, reproducible metrics based on dataset science. Claude provides contextual reasoning that RAGAS cannot — specifically hallucination detection and plain-English explanations. Running both in parallel via `asyncio.gather()` means neither adds latency to the other.\n\n### 2. Weighted overall score, not a simple average\nFaithfulness (35%) is weighted highest because hallucinating content is the most damaging RAG failure mode. Answer relevancy (30%) is second because an off-topic answer is equally useless regardless of how well it's grounded. Context metrics are weighted lower (20%/15%) because they diagnose the retriever, which is fixable without touching the LLM.\n\n### 3. SSE streaming over polling\nThe `/evaluate/stream` endpoint emits progress events per metric so the UI can update in real time as each RAGAS metric completes. This avoids a blank \"loading\" state during what can be a 10–30 second evaluation.\n\n### 4. Three-tier verdict, not a score\n`READY` / `NEEDS_WORK` / `NOT_READY` gives developers and stakeholders a clear go/no-go signal without needing to interpret a float. Hallucination risk overrides the score: even a 0.95 overall score is `NOT_READY` if Claude classifies hallucination as `high`.\n\n### 5. RAGAS → Claude fallback for dataset generation\nThe dataset generator first attempts RAGAS `TestsetGenerator` (which produces richer, more diverse question types using multi-hop reasoning). If RAGAS is unavailable or fails, it falls back to a direct Claude prompt that produces the same JSON schema. The caller never needs to know which path ran.\n\n### 6. Recommendations sorted by severity, not metric\nCritical issues (`score \u003c 0.50`) surface first regardless of which metric produced them. This matches how a developer would triage — fix the worst thing first, then warnings, then informational.\n\n---\n\n## Extending the System\n\n### Swap the LLM Judge\n\nAll Claude calls are isolated in `backend/services/llm_judge.py`. To use a different model, change the `model` parameter:\n\n```python\n# llm_judge.py\nresponse = await client.messages.create(\n    model=\"claude-sonnet-4-6\",   # change this\n    ...\n)\n```\n\nTo use a different provider entirely, replace the `anthropic.AsyncAnthropic` client with any async client that accepts the same prompt structure.\n\n### Add a New RAGAS Metric\n\nIn `backend/services/ragas_evaluator.py`, add your metric to `metrics_config` in `stream_ragas_evaluation()` and to `_run_ragas_sync()`:\n\n```python\nfrom ragas.metrics import answer_correctness   # example new metric\n\nmetrics_config = [\n    ...\n    (\"answer_correctness\", \"Checking answer correctness...\"),\n]\n```\n\nThen add it to the `Scores` model in `backend/models/evaluation.py` and the weighting dict in `backend/utils/formatters.py`.\n\n### Change the Verdict Thresholds\n\nEdit `score_to_verdict()` in `backend/utils/formatters.py`:\n\n```python\ndef score_to_verdict(overall_score: float, hallucination_risk) -\u003e str:\n    if hallucination_risk == \"high\":\n        return \"NOT_READY\"\n    if overall_score \u003e= 0.85:   # raise the bar\n        return \"READY\"\n    if overall_score \u003e= 0.65:\n        return \"NEEDS_WORK\"\n    return \"NOT_READY\"\n```\n\n### Change the Score Weights\n\nEdit the `weights` dict in `compute_overall_score()` in `backend/utils/formatters.py`. Weights are automatically re-normalized if a metric is absent, so you can adjust without breaking missing-metric cases.\n\n### Add a New Evaluation Endpoint\n\nAdd a router file in `backend/routers/` and register it in `backend/main.py`:\n\n```python\nfrom routers.my_endpoint import router as my_router\napp.include_router(my_router)\n```\n\n---\n\n## API Reference\n\nInteractive docs available at http://localhost:8000/docs\n\n| Method | Path | Description |\n|---|---|---|\n| `GET` | `/health` | Health check |\n| `POST` | `/evaluate` | Single evaluation (blocking) |\n| `POST` | `/evaluate/stream` | Single evaluation with SSE progress |\n| `POST` | `/evaluate/batch` | Batch evaluation with aggregates |\n| `POST` | `/evaluate/compare` | Compare baseline vs candidate |\n| `POST` | `/generate-dataset` | Generate synthetic golden dataset |\n\n---\n\n## Project Structure\n\n```\nrag-auditor/\n│\n├── .env.example                          # Template — copy to .env and add ANTHROPIC_API_KEY\n├── docker-compose.yml                    # Orchestrates backend + frontend\n│\n├── backend/\n│   ├── Dockerfile\n│   ├── requirements.txt                  # ragas==0.1.21, anthropic, langchain-anthropic, fastapi\n│   ├── main.py                           # FastAPI app, CORS middleware, router registration\n│   │\n│   ├── models/\n│   │   ├── evaluation.py                 # EvaluationRequest/Response, Scores, Trace, Recommendations\n│   │   └── dataset.py                    # GenerateDatasetRequest/Response, QAPair\n│   │\n│   ├── routers/\n│   │   ├── evaluate.py                   # POST /evaluate, /evaluate/stream, /evaluate/batch, /evaluate/compare\n│   │   ├── generate_dataset.py           # POST /generate-dataset\n│   │   └── health.py                     # GET /health\n│   │\n│   ├── services/\n│   │   ├── ragas_evaluator.py            # RAGAS metric runner — sync executor + async SSE streaming\n│   │   ├── llm_judge.py                  # Claude hallucination detector + plain-English explanation\n│   │   ├── trace_analyzer.py             # Maps scores → retrieval/generation stage issues + recommendations\n│   │   └── dataset_generator.py          # RAGAS TestsetGenerator with Claude fallback\n│   │\n│   ├── utils/\n│   │   └── formatters.py                 # clamp_score(), compute_overall_score(), score_to_verdict()\n│   │\n│   └── tests/\n│       └── test_evaluate.py              # Unit tests for formatters, trace, recommendations (no API key)\n│\n└── frontend/\n    ├── Dockerfile\n    ├── vite.config.js                    # Dev proxy → localhost:8000\n    ├── src/\n    │   ├── App.jsx                       # Top-level layout + tab routing\n    │   ├── components/\n    │   │   ├── EvaluatorForm.jsx         # Single-sample input form\n    │   │   ├── ResultsDashboard.jsx      # Score cards, verdict, explanation\n    │   │   ├── TraceVisualizer.jsx       # Retrieval + generation stage breakdown\n    │   │   ├── HallucinationBadge.jsx    # LOW / MEDIUM / HIGH risk badge\n    │   │   ├── RecommendationsPanel.jsx  # Sorted fix recommendations\n    │   │   ├── BatchEvaluator.jsx        # CSV/JSON upload + aggregate results\n    │   │   ├── DatasetGenerator.jsx      # Doc input + dataset download\n    │   │   ├── CompareMode.jsx           # Baseline vs candidate delta view\n    │   │   ├── HistoryPanel.jsx          # In-memory session history + restore\n    │   │   └── ScoreCard.jsx             # Reusable per-metric score component\n    │   ├── hooks/\n    │   │   ├── useEvaluate.js            # SSE streaming hook for /evaluate/stream\n    │   │   └── useHistory.js             # In-memory session history state\n    │   ├── api/\n    │   │   └── client.js                 # Axios wrappers for all backend endpoints\n    │   └── utils/\n    │       └── scoreHelpers.js           # Color/label helpers for score display\n    └── src/utils/\n        └── scoreHelpers.test.js          # Frontend unit tests\n```\n\n---\n\n## Configuration\n\nAll settings are loaded from `.env` (copy from `.env.example`):\n\n| Variable | Default | Required | Description |\n|---|---|---|---|\n| `ANTHROPIC_API_KEY` | — | Yes | Anthropic API key — used for RAGAS judge LLM, hallucination detection, and explanations |\n| `RAGAS_APP_TOKEN` | — | No | RAGAS Cloud token for dashboard and experiment tracking |\n| `CORS_ORIGINS` | `http://localhost:3000` | No | Comma-separated list of allowed frontend origins |\n\n---\n\n## Contributing\n\nSee [CONTRIBUTING.md](CONTRIBUTING.md) for how to submit issues, PRs, and run tests.\n\n## License\n\nMIT — see [LICENSE](LICENSE)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Famitgambhir%2Frag-auditor","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Famitgambhir%2Frag-auditor","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Famitgambhir%2Frag-auditor/lists"}