Projects in Awesome Lists tagged with llm-evaluation-framework
A curated list of projects in awesome lists tagged with llm-evaluation-framework .
https://github.com/confident-ai/deepeval
The LLM Evaluation Framework
evaluation-framework evaluation-metrics llm-evaluation llm-evaluation-framework llm-evaluation-metrics
Last synced: 13 May 2025
https://github.com/promptfoo/promptfoo
Test your prompts, agents, and RAGs. Red teaming, pentesting, and vulnerability scanning for LLMs. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with command line and CI/CD integration.
ci ci-cd cicd evaluation evaluation-framework llm llm-eval llm-evaluation llm-evaluation-framework llmops pentesting prompt-engineering prompt-testing prompts rag red-teaming testing vulnerability-scanners
Last synced: 03 Mar 2026
https://github.com/mr-gpt/deepeval
The LLM Evaluation Framework
evaluation-framework evaluation-metrics llm-evaluation llm-evaluation-framework llm-evaluation-metrics
Last synced: 12 Jan 2026
https://github.com/msoedov/agentic_security
Agentic LLM Vulnerability Scanner / AI red teaming kit 🧪
agent-framework agent-security ai-red-team llm-evaluation llm-evaluation-framework llm-fuzzer llm-fuzzer-aggregator llm-fuzzing llm-guardrails llm-jailbreaks llm-scanner llm-security llm-vulnerabilities prompt-testing
Last synced: 06 Sep 2025
https://cvs-health.github.io/langfair/
LangFair is a Python library for conducting use-case level LLM bias and fairness assessments
ai ai-safety artificial-intelligence bias bias-detection ethical-ai fairness fairness-ai fairness-ml fairness-testing large-language-models llm llm-evaluation llm-evaluation-framework llm-evaluation-metrics python responsible-ai
Last synced: 16 Jun 2026
https://github.com/JinjieNi/MixEval
The official evaluation suite and dynamic data release for MixEval.
benchmark benchmark-mixture benchmarking-framework benchmarking-suite evaluation evaluation-framework foundation-models large-language-model large-language-models large-multimodal-models llm-evaluation llm-evaluation-framework llm-inference mixeval
Last synced: 14 Sep 2025
https://github.com/cvs-health/langfair
LangFair is a Python library for conducting use-case level LLM bias and fairness assessments
ai ai-safety artificial-intelligence bias bias-detection ethical-ai fairness fairness-ai fairness-ml fairness-testing large-language-models llm llm-evaluation llm-evaluation-framework llm-evaluation-metrics python responsible-ai
Last synced: 13 Oct 2025
https://github.com/zli12321/qa_metrics
An easy python package to run quick basic QA evaluations. This package includes standardized QA evaluation metrics and semantic evaluation metrics: Black-box and Open-Source large language model prompting and evaluation, exact match, F1 Score, PEDANT semantic match, transformer match. Our package also supports prompting OPENAI and Anthropic API.
exact-matching llm llm-evaluation llm-evaluation-framework llm-evaluation-toolkit qa-automation-test reward-modeling rl-training
Last synced: 14 Jan 2026
https://github.com/dokimos-dev/dokimos
LLM and agent evaluation for Java & Kotlin. Runs in JUnit and CI. Spring AI, LangChain4j, Koog.
agent-evaluation agentic-ai evaluation evaluation-framework evaluation-metrics java junit junit-extension koog kotlin langchain4j llm llm-evaluation llm-evaluation-framework llm-evaluation-metrics rag rag-evaluation retrieval-augmented-generation spring-ai spring-ai-evaluation
Last synced: 09 Jun 2026
https://github.com/zhuohaoyu/kieval
[ACL'24] A Knowledge-grounded Interactive Evaluation Framework for Large Language Models
acl2024 explainable-ai llm llm-evaluation llm-evaluation-framework llm-evaluation-metrics llm-evaluation-toolkit machine-learning
Last synced: 18 Sep 2025
https://github.com/vero-labs-ai/vero-eval
Open source framework for evaluating AI Agents
dataset-generation datasets evals evaluation evaluation-framework evaluation-metrics langgraph llm-evaluation llm-evaluation-framework python rag-evaluation rag-testing synthetic-dataset-generation testing testing-framework testing-library user-persona
Last synced: 07 Apr 2026
https://github.com/honeyhiveai/realign
Realign is a testing and simulation framework for AI applications.
ai aiengineering alignment evaluation llm-eval llm-evaluation llm-evaluation-framework llmops llms prompt-engineering rag red-teaming simulation
Last synced: 20 Jan 2026
https://github.com/evaliphy/evaliphy
The E2E AI testing tool | No ML Overhead
ai ai-test-automation ai-testing ai-testing-tool end-to-end-testing llm-evaluation llm-evaluation-framework llm-evaluation-toolkit llm-testing rag rag-evaluation rag-pipeline test-automation test-automation-framework testing-tools
Last synced: 09 Jun 2026
https://github.com/networks-learning/prediction-powered-ranking
Code for "Prediction-Powered Ranking of Large Language Models", NeurIPS 2024.
llm-eval llm-evaluation llm-evaluation-framework prediction-powered-inference rank-sets ranking-algorithm
Last synced: 23 Apr 2025
https://github.com/pyladiesams/eval-llm-based-apps-jan2025
Create an evaluation framework for your LLM based app. Incorporate it into your test suite. Lay the monitoring foundation.
llm llm-eval llm-evals llm-evaluation-framework llm-evaluation-metrics llm-monitoring llm-test llm-testing llmops llms workshop
Last synced: 12 May 2025
https://github.com/stair-lab/melt
Multilingual Evaluation Toolkits
llm-evaluation-framework llms-benchmarking multilingual
Last synced: 05 Mar 2026
https://github.com/yukinagae/genkitx-promptfoo
Community Plugin for Genkit to use Promptfoo
ai evaluation evaluation-framework firebase genkit genkit-plugin genkitx llm llm-eval llm-evaluation llm-evaluation-framework llmops plugin prompt prompt-testing promptfoo prompts testing
Last synced: 27 Jul 2025
https://github.com/artefactop/promptdev
A prompt evaluation framework that provides comprehensive testing for AI agents across multiple providers.
ci-cd evaluation-framework llm llm-eval llm-evaluation llm-evaluation-framework prompt prompt-engineering prompt-toolkit red-team testing
Last synced: 30 Oct 2025
https://github.com/Aysnc-Labs/llm-eval
A PHP package for evaluating LLM outputs. Test your prompts, validate responses, and ensure your AI features work correctly.
llm llm-eval llm-evaluation llm-evaluation-framework llm-evaluation-toolkit php
Last synced: 16 Jun 2026
https://github.com/yukinagae/promptfoo-sample
Sample project demonstrates how to use Promptfoo, a test framework for evaluating the output of generative AI models
evaluation evaluation-framework llm llm-eval llm-evaluation llm-evaluation-framework llmops prompt-testing promptfoo prompts testing
Last synced: 25 Feb 2026
https://github.com/homemade-software-inc/completion-kit
Your prompts need tests too. Run prompts against real datasets, score outputs with LLM judges, version everything, and compare runs to see what got better.
anthropic evaluation-framework evaluation-metrics llm llm-as-judge llm-eval llm-evaluation llm-evaluation-framework llm-evaluation-metrics llmops mcp ollama openai prompt-engineering prompt-testing rails rails-engine ruby ruby-on-rails
Last synced: 09 Jun 2026
https://github.com/bluewave-labs/plugin-marketplace
VerifyWise AI Governance Plugin Marketplace
ai ai-governance ai-governance-model llm-eval llm-evaluation llm-evaluation-framework
Last synced: 20 Jan 2026
https://github.com/yukinagae/genkit-promptfoo-sample
Sample implementation demonstrating how to use Firebase Genkit with Promptfoo
evaluation evaluation-framework genkit llm llm-eval llm-evaluation llm-evaluation-framework llmops prompt-testing promptfoo prompts testing
Last synced: 15 Aug 2025