Projects in Awesome Lists tagged with rag-evaluation

https://github.com/giskard-ai/giskard

🐢 Open-Source Evaluation & Testing for AI & LLM systems

agent-evaluation ai-red-team ai-security ai-testing fairness-ai llm llm-eval llm-evaluation llm-security llmops ml-testing ml-validation mlops rag-evaluation red-team-tools responsible-ai trustworthy-ai

Last synced: 14 May 2025

https://github.com/Giskard-AI/giskard

🐢 Open-Source Evaluation & Testing for AI & LLM systems

agent-evaluation ai-red-team ai-security ai-testing fairness-ai llm llm-eval llm-evaluation llm-security llmops ml-testing ml-validation mlops rag-evaluation red-team-tools responsible-ai trustworthy-ai

Last synced: 15 Apr 2025

https://github.com/marker-inc-korea/autorag

AutoRAG: An Open-Source Framework for Retrieval-Augmented Generation (RAG) Evaluation & Optimization with AutoML-Style Automation

analysis automl benchmarking document-parser embeddings evaluation llm llm-evaluation llm-ops open-source ops optimization pipeline python qa rag rag-evaluation retrieval-augmented-generation

Last synced: 03 Apr 2026

https://github.com/agenta-ai/agenta

The open-source LLMOps platform: prompt playground, prompt management, LLM evaluation, and LLM observability all in one place.

agents evaluation llm-as-a-judge llm-evaluation llm-framework llm-monitoring llm-observability llm-platform llm-playground llm-tools llmops observability prompt-engineering prompt-management rag-evaluation

Last synced: 11 Mar 2026

https://github.com/Agenta-AI/agenta

The all-in-one LLM developer platform: prompt management, evaluation, human feedback, and deployment all in one place.

human-annotation langchain large-language-models llama-index llm llm-evaluation llm-framework llm-tools llmops llms prompt-engineering prompt-management prompt-toolkit rag rag-evaluation

Last synced: 13 Mar 2025

https://github.com/vectara/open-rag-eval

Open source RAG evaluation package

evaluation-metrics metrics rag rag-evaluation retrieval-augmented-generation vectara

Last synced: 22 Apr 2025

https://github.com/LLAMATOR-Core/llamator

Framework for testing vulnerabilities of large language models (LLM).

agent ai ai-security attack hallucinations jailbreak llm llm-read-team llm-security llm-testing misinformation nlp owasp python rag rag-evaluation red-team red-team-tools security-tools vulnerability

Last synced: 10 May 2025

https://github.com/llamator-core/llamator

Framework for testing vulnerabilities of large language models (LLM).

ai ai-security attack hallucinations jailbreak llm llm-read-team llm-security llm-testing misinformation nlp owasp python rag rag-evaluation red-team red-team-tools red-teaming security-tools vulnerability-assessment

Last synced: 17 Jan 2026

https://github.com/romiconez/llamator

Framework for testing vulnerabilities of large language models (LLM).

ai ai-security attack hallucinations jailbreak llm llm-read-team llm-security llm-testing misinformation nlp owasp python rag rag-evaluation red-team red-team-tools red-teaming security-tools vulnerability-assessment

Last synced: 22 Mar 2025

https://github.com/dokimos-dev/dokimos

LLM and agent evaluation for Java & Kotlin. Runs in JUnit and CI. Spring AI, LangChain4j, Koog.

agent-evaluation agentic-ai evaluation evaluation-framework evaluation-metrics java junit junit-extension koog kotlin langchain4j llm llm-evaluation llm-evaluation-framework llm-evaluation-metrics rag rag-evaluation retrieval-augmented-generation spring-ai spring-ai-evaluation

Last synced: 09 Jun 2026

https://github.com/mts-ai/rurage

information-retrieval llm-evaluation question-answering rag rag-evaluation

Last synced: 10 Nov 2025

https://github.com/vero-labs-ai/vero-eval

Open source framework for evaluating AI Agents

dataset-generation datasets evals evaluation evaluation-framework evaluation-metrics langgraph llm-evaluation llm-evaluation-framework python rag-evaluation rag-testing synthetic-dataset-generation testing testing-framework testing-library user-persona

Last synced: 07 Apr 2026

https://github.com/evaliphy/evaliphy

The E2E AI testing tool | No ML Overhead

ai ai-test-automation ai-testing ai-testing-tool end-to-end-testing llm-evaluation llm-evaluation-framework llm-evaluation-toolkit llm-testing rag rag-evaluation rag-pipeline test-automation test-automation-framework testing-tools

Last synced: 09 Jun 2026

https://github.com/oztrkoguz/rag-framework-evaluation

This project aims to compare different Retrieval-Augmented Generation (RAG) frameworks in terms of speed and performance.

autogen autogen-rag crewai crewai-rag langchain langchain-rag llamaindex llamaindex-rag rag rag-evaluation swarms swarms-rag

Last synced: 20 Mar 2025

https://github.com/lizhiyao/oh-my-knowledge

Evaluation framework for LLM knowledge inputs — prompts, RAG corpora, skills, agent workflows. Fix the model, vary the artifact. Built-in statistical rigor: bootstrap CI, Krippendorff α, length-debias, saturation curves.

agent-evaluation ai benchmark bootstrap-ci claude claude-code evaluation-as-code evaluation-framework knowledge-engineering krippendorff-alpha llm llm-evaluation llm-judge multi-judge-ensemble prompt-engineering prompt-testing rag-evaluation skill-evaluation

Last synced: 15 Jun 2026

https://github.com/xmpuspus/kb-arena

Benchmark 7 retrieval strategies on your own docs — naive vector, contextual, QnA pairs, knowledge graph, RAPTOR, PageIndex, and hybrid. Find which KB architecture fits your data.

benchmark chromadb cli document-retrieval evaluation graphrag hybrid-search knowledge-graph llm neo4j python rag rag-evaluation retrieval retrieval-augmented-generation vector-search

Last synced: 02 May 2026

https://github.com/simranjeet97/learn_rag_from_scratch_llm

Learn Retrieval-Augmented Generation (RAG) from Scratch using LLMs from Hugging Face and Langchain or Python

artificial-intelligence datascience-machinelearning genai-domain genai-usecase generative-ai llm-apps llm-evaluation llm-framework llm-training rag rag-application rag-chatbot rag-embeddings rag-evaluation rag-implementation rag-llm rag-model rag-pipeline retrieval-augmented-generation

Last synced: 31 Jul 2025

https://github.com/hallengray/rag-forge

Production-grade RAG pipelines with evaluation baked in

cli embeddings llm llm-evaluation mcp observability python rag rag-evaluation rag-pipeline ragas retrieval-augmented-generation vector-database

Last synced: 18 Apr 2026

https://github.com/shaadclt/evalrag

A comprehensive evaluation toolkit for assessing Retrieval-Augmented Generation (RAG) outputs using linguistic, semantic, and fairness metrics

rag rag-evaluation

Last synced: 22 Jul 2025

https://github.com/fkapsahili/entrag

EntRAG - Enterprise RAG Benchmark

benchmark dataset evaluation evaluations generative-ai knowledge-graph llm llm-evaluation rag rag-evaluation retrieval retrieval-augmented-generation

Last synced: 09 Mar 2026

https://github.com/kaos599/betterrag

BetterRAG: Powerful RAG evaluation toolkit for LLMs. Measure, analyze, and optimize how your AI processes text chunks with precision metrics. Perfect for RAG systems, document processing, and embedding quality assessment.

chunking-optimization embeddings embeddings-extraction embeddings-optimization evaluation evaluation-framework optimization rag rag-application rag-evaluation rag-optimization

Last synced: 05 May 2026

https://github.com/anasaber/mlflow_with_rag

Using MLflow to deploy your RAG pipeline, using LLamaIndex, Langchain and Ollama/HuggingfaceLLMs/Groq

cicd deployment evaluation-metrics llamaindex llamaindex-rag mlflow mlflow-deployement mlflow-projects mlflow-tracking mlflow-tracking-server mlflow-ui mlops mlops-project mlops-template rag rag-evaluation rag-pipeline

Last synced: 06 Feb 2026

https://github.com/unshdee/proofrag

Point your agent at your docs and your RAG app; get a golden test set + an LLM-as-judge & retrieval scorecard, in one command.

agent-skills ci claude claude-code codex evaluation llm llm-as-judge python rag rag-evaluation retrieval

Last synced: 01 Jun 2026

https://github.com/keitabroadwater/llm-eval-lab

A web sandbox for hands-on learning of LLM and RAG Evaluation

evaluation-framework fastapi gpt4 llm-evaluation llmops nextjs rag-evaluation ragas

Last synced: 19 Apr 2026

https://github.com/alexmartin1722/mirage

An evaluation framework for evaluating any modality to text generation and multimodal RAG.

multimodal multimodal-rag multimodal-summarization rag rag-evaluation

Last synced: 14 May 2026

https://github.com/zhjai/groundcheck

Single-agent, evidence-grounded claim verification to catch LLM hallucinations — a pluggable fact-gate for agent-arena and any multi-agent system (CrewAI, AutoGen, LangGraph).

agent-skill claim-verification claude-code-skill fact-checking groundedness hallucination-detection llm rag-evaluation

Last synced: 12 Jun 2026

https://github.com/jhaayush2004/rag-evaluation

Different approaches to evaluate RAG !!!

bert-score giskard hallucination-detection langchain rag rag-evaluation ragas vectara wandb

Last synced: 16 Oct 2025

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome