An open API service indexing awesome lists of open source software.

Projects in Awesome Lists tagged with llm-evaluation-framework

A curated list of projects in awesome lists tagged with llm-evaluation-framework .

https://github.com/promptfoo/promptfoo

Test your prompts, agents, and RAGs. Red teaming, pentesting, and vulnerability scanning for LLMs. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with command line and CI/CD integration.

ci ci-cd cicd evaluation evaluation-framework llm llm-eval llm-evaluation llm-evaluation-framework llmops pentesting prompt-engineering prompt-testing prompts rag red-teaming testing vulnerability-scanners

Last synced: 03 Mar 2026

https://github.com/zli12321/qa_metrics

An easy python package to run quick basic QA evaluations. This package includes standardized QA evaluation metrics and semantic evaluation metrics: Black-box and Open-Source large language model prompting and evaluation, exact match, F1 Score, PEDANT semantic match, transformer match. Our package also supports prompting OPENAI and Anthropic API.

exact-matching llm llm-evaluation llm-evaluation-framework llm-evaluation-toolkit qa-automation-test reward-modeling rl-training

Last synced: 14 Jan 2026

https://github.com/zhuohaoyu/kieval

[ACL'24] A Knowledge-grounded Interactive Evaluation Framework for Large Language Models

acl2024 explainable-ai llm llm-evaluation llm-evaluation-framework llm-evaluation-metrics llm-evaluation-toolkit machine-learning

Last synced: 18 Sep 2025

https://github.com/networks-learning/prediction-powered-ranking

Code for "Prediction-Powered Ranking of Large Language Models", NeurIPS 2024.

llm-eval llm-evaluation llm-evaluation-framework prediction-powered-inference rank-sets ranking-algorithm

Last synced: 23 Apr 2025

https://github.com/pyladiesams/eval-llm-based-apps-jan2025

Create an evaluation framework for your LLM based app. Incorporate it into your test suite. Lay the monitoring foundation.

llm llm-eval llm-evals llm-evaluation-framework llm-evaluation-metrics llm-monitoring llm-test llm-testing llmops llms workshop

Last synced: 12 May 2025

https://github.com/stair-lab/melt

Multilingual Evaluation Toolkits

llm-evaluation-framework llms-benchmarking multilingual

Last synced: 05 Mar 2026

https://github.com/artefactop/promptdev

A prompt evaluation framework that provides comprehensive testing for AI agents across multiple providers.

ci-cd evaluation-framework llm llm-eval llm-evaluation llm-evaluation-framework prompt prompt-engineering prompt-toolkit red-team testing

Last synced: 30 Oct 2025

https://github.com/Aysnc-Labs/llm-eval

A PHP package for evaluating LLM outputs. Test your prompts, validate responses, and ensure your AI features work correctly.

llm llm-eval llm-evaluation llm-evaluation-framework llm-evaluation-toolkit php

Last synced: 16 Jun 2026

https://github.com/yukinagae/promptfoo-sample

Sample project demonstrates how to use Promptfoo, a test framework for evaluating the output of generative AI models

evaluation evaluation-framework llm llm-eval llm-evaluation llm-evaluation-framework llmops prompt-testing promptfoo prompts testing

Last synced: 25 Feb 2026

https://github.com/homemade-software-inc/completion-kit

Your prompts need tests too. Run prompts against real datasets, score outputs with LLM judges, version everything, and compare runs to see what got better.

anthropic evaluation-framework evaluation-metrics llm llm-as-judge llm-eval llm-evaluation llm-evaluation-framework llm-evaluation-metrics llmops mcp ollama openai prompt-engineering prompt-testing rails rails-engine ruby ruby-on-rails

Last synced: 09 Jun 2026