Projects in Awesome Lists tagged with llm-evaluation-framework

https://github.com/confident-ai/deepeval

The LLM Evaluation Framework

evaluation-framework evaluation-metrics llm-evaluation llm-evaluation-framework llm-evaluation-metrics

Last synced: 13 May 2025

https://github.com/promptfoo/promptfoo

Test your prompts, agents, and RAGs. Red teaming, pentesting, and vulnerability scanning for LLMs. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with command line and CI/CD integration.

ci ci-cd cicd evaluation evaluation-framework llm llm-eval llm-evaluation llm-evaluation-framework llmops pentesting prompt-engineering prompt-testing prompts rag red-teaming testing vulnerability-scanners

Last synced: 03 Mar 2026

https://github.com/mr-gpt/deepeval

The LLM Evaluation Framework

evaluation-framework evaluation-metrics llm-evaluation llm-evaluation-framework llm-evaluation-metrics

Last synced: 12 Jan 2026

https://github.com/msoedov/agentic_security

Agentic LLM Vulnerability Scanner / AI red teaming kit 🧪

agent-framework agent-security ai-red-team llm-evaluation llm-evaluation-framework llm-fuzzer llm-fuzzer-aggregator llm-fuzzing llm-guardrails llm-jailbreaks llm-scanner llm-security llm-vulnerabilities prompt-testing

Last synced: 06 Sep 2025

https://cvs-health.github.io/langfair/

LangFair is a Python library for conducting use-case level LLM bias and fairness assessments

ai ai-safety artificial-intelligence bias bias-detection ethical-ai fairness fairness-ai fairness-ml fairness-testing large-language-models llm llm-evaluation llm-evaluation-framework llm-evaluation-metrics python responsible-ai

Last synced: 16 Jun 2026

https://github.com/JinjieNi/MixEval

The official evaluation suite and dynamic data release for MixEval.

benchmark benchmark-mixture benchmarking-framework benchmarking-suite evaluation evaluation-framework foundation-models large-language-model large-language-models large-multimodal-models llm-evaluation llm-evaluation-framework llm-inference mixeval

Last synced: 14 Sep 2025

https://github.com/cvs-health/langfair

LangFair is a Python library for conducting use-case level LLM bias and fairness assessments

ai ai-safety artificial-intelligence bias bias-detection ethical-ai fairness fairness-ai fairness-ml fairness-testing large-language-models llm llm-evaluation llm-evaluation-framework llm-evaluation-metrics python responsible-ai

Last synced: 13 Oct 2025

https://github.com/zli12321/qa_metrics

An easy python package to run quick basic QA evaluations. This package includes standardized QA evaluation metrics and semantic evaluation metrics: Black-box and Open-Source large language model prompting and evaluation, exact match, F1 Score, PEDANT semantic match, transformer match. Our package also supports prompting OPENAI and Anthropic API.

exact-matching llm llm-evaluation llm-evaluation-framework llm-evaluation-toolkit qa-automation-test reward-modeling rl-training

Last synced: 14 Jan 2026

https://github.com/dokimos-dev/dokimos

LLM and agent evaluation for Java & Kotlin. Runs in JUnit and CI. Spring AI, LangChain4j, Koog.

agent-evaluation agentic-ai evaluation evaluation-framework evaluation-metrics java junit junit-extension koog kotlin langchain4j llm llm-evaluation llm-evaluation-framework llm-evaluation-metrics rag rag-evaluation retrieval-augmented-generation spring-ai spring-ai-evaluation

Last synced: 09 Jun 2026

https://github.com/zhuohaoyu/kieval

[ACL'24] A Knowledge-grounded Interactive Evaluation Framework for Large Language Models

acl2024 explainable-ai llm llm-evaluation llm-evaluation-framework llm-evaluation-metrics llm-evaluation-toolkit machine-learning

Last synced: 18 Sep 2025

https://github.com/vero-labs-ai/vero-eval

Open source framework for evaluating AI Agents

dataset-generation datasets evals evaluation evaluation-framework evaluation-metrics langgraph llm-evaluation llm-evaluation-framework python rag-evaluation rag-testing synthetic-dataset-generation testing testing-framework testing-library user-persona

Last synced: 07 Apr 2026

https://github.com/honeyhiveai/realign

Realign is a testing and simulation framework for AI applications.

ai aiengineering alignment evaluation llm-eval llm-evaluation llm-evaluation-framework llmops llms prompt-engineering rag red-teaming simulation

Last synced: 20 Jan 2026

https://github.com/evaliphy/evaliphy

The E2E AI testing tool | No ML Overhead

ai ai-test-automation ai-testing ai-testing-tool end-to-end-testing llm-evaluation llm-evaluation-framework llm-evaluation-toolkit llm-testing rag rag-evaluation rag-pipeline test-automation test-automation-framework testing-tools

Last synced: 09 Jun 2026

https://github.com/networks-learning/prediction-powered-ranking

Code for "Prediction-Powered Ranking of Large Language Models", NeurIPS 2024.

llm-eval llm-evaluation llm-evaluation-framework prediction-powered-inference rank-sets ranking-algorithm

Last synced: 23 Apr 2025

https://github.com/pyladiesams/eval-llm-based-apps-jan2025

Create an evaluation framework for your LLM based app. Incorporate it into your test suite. Lay the monitoring foundation.

llm llm-eval llm-evals llm-evaluation-framework llm-evaluation-metrics llm-monitoring llm-test llm-testing llmops llms workshop

Last synced: 12 May 2025

https://github.com/stair-lab/melt

Multilingual Evaluation Toolkits

llm-evaluation-framework llms-benchmarking multilingual

Last synced: 05 Mar 2026

https://github.com/yukinagae/genkitx-promptfoo

Community Plugin for Genkit to use Promptfoo

ai evaluation evaluation-framework firebase genkit genkit-plugin genkitx llm llm-eval llm-evaluation llm-evaluation-framework llmops plugin prompt prompt-testing promptfoo prompts testing

Last synced: 27 Jul 2025

https://github.com/artefactop/promptdev

A prompt evaluation framework that provides comprehensive testing for AI agents across multiple providers.

ci-cd evaluation-framework llm llm-eval llm-evaluation llm-evaluation-framework prompt prompt-engineering prompt-toolkit red-team testing

Last synced: 30 Oct 2025

https://github.com/Aysnc-Labs/llm-eval

A PHP package for evaluating LLM outputs. Test your prompts, validate responses, and ensure your AI features work correctly.

llm llm-eval llm-evaluation llm-evaluation-framework llm-evaluation-toolkit php

Last synced: 16 Jun 2026

https://github.com/yukinagae/promptfoo-sample

Sample project demonstrates how to use Promptfoo, a test framework for evaluating the output of generative AI models

evaluation evaluation-framework llm llm-eval llm-evaluation llm-evaluation-framework llmops prompt-testing promptfoo prompts testing

Last synced: 25 Feb 2026

https://github.com/homemade-software-inc/completion-kit

Your prompts need tests too. Run prompts against real datasets, score outputs with LLM judges, version everything, and compare runs to see what got better.

anthropic evaluation-framework evaluation-metrics llm llm-as-judge llm-eval llm-evaluation llm-evaluation-framework llm-evaluation-metrics llmops mcp ollama openai prompt-engineering prompt-testing rails rails-engine ruby ruby-on-rails