Projects in Awesome Lists tagged with evaluation-framework
A curated list of projects in awesome lists tagged with evaluation-framework .
https://github.com/eleutherai/lm-evaluation-harness
A framework for few-shot evaluation of language models.
evaluation-framework language-model transformer
Last synced: 09 Sep 2025
https://github.com/EleutherAI/lm-evaluation-harness
A framework for few-shot evaluation of language models.
evaluation-framework language-model transformer
Last synced: 23 Mar 2025
https://github.com/confident-ai/deepeval
The LLM Evaluation Framework
evaluation-framework evaluation-metrics llm-evaluation llm-evaluation-framework llm-evaluation-metrics
Last synced: 13 May 2025
https://github.com/promptfoo/promptfoo
Test your prompts, agents, and RAGs. Red teaming, pentesting, and vulnerability scanning for LLMs. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with command line and CI/CD integration.
ci ci-cd cicd evaluation evaluation-framework llm llm-eval llm-evaluation llm-evaluation-framework llmops pentesting prompt-engineering prompt-testing prompts rag red-teaming testing vulnerability-scanners
Last synced: 03 Mar 2026
https://github.com/mr-gpt/deepeval
The LLM Evaluation Framework
evaluation-framework evaluation-metrics llm-evaluation llm-evaluation-framework llm-evaluation-metrics
Last synced: 12 Jan 2026
https://github.com/huggingface/lighteval
Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends
evaluation evaluation-framework evaluation-metrics huggingface
Last synced: 14 Oct 2025
https://github.com/MaurizioFD/RecSys2019_DeepLearning_Evaluation
This is the repository of our article published in RecSys 2019 "Are We Really Making Much Progress? A Worrying Analysis of Recent Neural Recommendation Approaches" and of several follow-up studies.
bpr bprmf bprslim collaborative-filtering content-based-recommendation deep-learning evaluation-framework funksvd hybrid-recommender-system hyperparameters knn matrix-completion matrix-factorization neural-network recommendation-algorithms recommendation-system recommender-system reproducibility reproducible-research slimelasticnet
Last synced: 11 May 2025
https://github.com/relari-ai/continuous-eval
Data-Driven Evaluation for LLM-Powered Applications
evaluation-framework evaluation-metrics information-retrieval llm-evaluation llmops rag retrieval-augmented-generation
Last synced: 05 Apr 2025
https://github.com/servicenow/agentlab
AgentLab: An open-source framework for developing, testing, and benchmarking web agents on diverse tasks, designed for scalability and reproducibility.
agent agents benchmark evaluation-framework lab llm llm-agents prompting web-agents
Last synced: 25 Sep 2025
https://github.com/aiverify-foundation/moonshot
Moonshot - A simple and modular tool to evaluate and red-team any LLM application.
benchmarking evaluation-framework llm red-teaming trustworthy-ai
Last synced: 05 Feb 2026
https://github.com/athina-ai/athina-evals
Python SDK for running evaluations on LLM generated responses
evaluation evaluation-framework evaluation-metrics llm-eval llm-evaluation llm-evaluation-toolkit llm-ops llmops
Last synced: 29 Dec 2025
https://github.com/JinjieNi/MixEval
The official evaluation suite and dynamic data release for MixEval.
benchmark benchmark-mixture benchmarking-framework benchmarking-suite evaluation evaluation-framework foundation-models large-language-model large-language-models large-multimodal-models llm-evaluation llm-evaluation-framework llm-inference mixeval
Last synced: 14 Sep 2025
https://github.com/TonicAI/tonic_validate
Metrics to evaluate the quality of responses of your Retrieval Augmented Generation (RAG) applications.
evaluation-framework evaluation-metrics large-language-models llm llmops llms rag retrieval-augmented-generation
Last synced: 04 Apr 2025
https://github.com/zeno-ml/zeno
AI Data Management & Evaluation Platform
ai data-science evaluation evaluation-framework machine-learning python
Last synced: 18 Apr 2025
https://github.com/lartpang/PySODEvalToolkit
PySODEvalToolkit: A Python-based Evaluation Toolbox for Salient Object Detection and Camouflaged Object Detection
camouflaged-object-detection co-saliency co-salient-object-detection e-measure evaluation evaluation-framework evaluation-metrics evaluator f-measure fm-curve latex mae metrics metrics-visualization pr-curve python3 s-measure saliency saliency-detection salient-object-detection
Last synced: 21 Nov 2025
https://github.com/lartpang/pysodevaltoolkit
PySODEvalToolkit: A Python-based Evaluation Toolbox for Salient Object Detection and Camouflaged Object Detection
camouflaged-object-detection co-saliency co-salient-object-detection e-measure evaluation evaluation-framework evaluation-metrics evaluator f-measure fm-curve latex mae metrics metrics-visualization pr-curve python3 s-measure saliency saliency-detection salient-object-detection
Last synced: 15 Apr 2025
https://github.com/bijington/expressive
Expressive is a cross-platform expression parsing and evaluation framework. The cross-platform nature is achieved through compiling for .NET Standard so it will run on practically any platform.
cross-platform evaluation evaluation-framework expression-evaluator expression-parser hacktoberfest netstandard parsing xamarin
Last synced: 31 Mar 2025
https://github.com/ServiceNow/AgentLab
AgentLab: An open-source framework for developing, testing, and benchmarking web agents on diverse tasks, designed for scalability and reproducibility.
agents benchmark evaluation-framework llm llm-agents prompting web-agents
Last synced: 30 Aug 2025
https://github.com/nlp-uoregon/mlmm-evaluation
Multilingual Large Language Models Evaluation Benchmark
datasets evaluation evaluation-datasets evaluation-framework language-model large-language-models multilingual natural-language-processing nlp
Last synced: 02 Aug 2025
https://github.com/AI21Labs/lm-evaluation
Evaluation suite for large-scale language models.
evaluation-framework language-model
Last synced: 23 Apr 2025
https://github.com/tsenst/crowdflow
Optical Flow Dataset and Benchmark for Visual Crowd Analysis
benchmark-suite computer-vision crowd-analysis crowd-counting dataset evaluation-framework motion-estimation multi-object-tracking optical-flow synthetic-images tracking tracking-by-detection trajectories tub-crowdflow-dataset video-analytics video-processing video-surveillance
Last synced: 06 Mar 2026
https://github.com/alibaba-damo-academy/MedEvalKit
MedEvalKit: A Unified Medical Evaluation Framework
evaluation-framework llm medicalai multimodal
Last synced: 28 Jul 2025
https://github.com/microsoft/eureka-ml-insights
A framework for standardizing evaluations of large foundation models, beyond single-score reporting and rankings.
ai artificial-intelligence evaluation-framework llm machine-learning mllm
Last synced: 05 Apr 2025
https://github.com/x-plug/writingbench
WritingBench: A Comprehensive Benchmark for Generative Writing
ai benchmark evaluation-framework huggingface llm long-context long-text nlp text-generation writing
Last synced: 01 Sep 2025
https://github.com/kaiko-ai/eva
Evaluation framework for oncology foundation models (FMs)
evaluation-framework foundation-models machine-learning oncology
Last synced: 24 Dec 2025
https://github.com/codefuse-ai/codefuse-evaluation
Industrial-level evaluation benchmarks for Coding LLMs in the full life-cycle of AI native software developing.δΌδΈηΊ§δ»£η 倧樑εθ―ζ΅δ½η³»,ζη»εΌζΎδΈ
code-evaluation codecommenteval codefuse codetranseval evaluation-framework lcc repository-eval
Last synced: 07 Apr 2025
https://github.com/bmw-innovationlab/sordi-ai-evaluation-gui
This repository allows you to evaluate a trained computer vision model and get general information and evaluation metrics with little configuration.
ai bmw computer-vision dataset deeplearning docker evaluation evaluation-framework no-code python rest-api sordi synthetic-data tensorflow
Last synced: 02 Jul 2025
https://github.com/nouhadziri/DialogEntailment
The implementation of the paper "Evaluating Coherence in Dialogue Systems using Entailment"
bert dialogue-evaluation evaluation-framework natural-language-inference
Last synced: 02 Apr 2025
https://github.com/pentoai/vectory
Vectory provides a collection of tools to track and compare embedding versions.
deep-learning deep-neural-networks embedding-python embedding-vectors embeddings-similarity evaluation-framework
Last synced: 18 Feb 2026
https://github.com/jinzhuoran/RWKU
RWKU: Benchmarking Real-World Knowledge Unlearning for Large Language Models. NeurIPS 2024
adversarial-attacks benchmark evaluation-framework forgetting large-language-models membership-inference-attack natural-language-processing privacy-protection right-to-be-forgotten unlearning
Last synced: 24 Mar 2025
https://github.com/jinzhuoran/rwku
RWKU: Benchmarking Real-World Knowledge Unlearning for Large Language Models. NeurIPS 2024
adversarial-attacks benchmark evaluation-framework forgetting large-language-models membership-inference-attack natural-language-processing privacy-protection right-to-be-forgotten unlearning
Last synced: 02 Apr 2025
https://github.com/letta-ai/letta-evals
Evaluation kit for testing stateful agents
agentevals agents evaluation-framework language-model letta letta-agents
Last synced: 26 Feb 2026
https://github.com/powerflows/powerflows-dmn
Power Flows DMN - Powerful decisions and rules engine
decision-engine decision-tables dmn dmn-engine dmn-model evaluation evaluation-framework feel groovy java javascript kotlin kotlin-dsl mvel rule-engine rules rules-engine xml yaml
Last synced: 09 Apr 2025
https://github.com/cedrickchee/vibe-jet
A browser-based 3D multiplayer flying game with arcade-style mechanics, created using the Gemini 2.5 Pro through a technique called "vibe coding"
evaluation-framework flight-simulator game-development gemini-2-5-pro-exp llm-evaluation vibe-check vibe-coding
Last synced: 05 May 2025
https://github.com/gair-nlp/scaleeval
Scalable Meta-Evaluation of LLMs as Evaluators
evaluation-framework generative-ai llm nlp
Last synced: 23 Jun 2025
https://github.com/adithya-s-k/indic_eval
A lightweight evaluation suite tailored specifically for assessing Indic LLMs across a diverse range of tasks
evaluation-framework llm-evaluation llms
Last synced: 03 Aug 2025
https://github.com/tohtsky/irspack
Train, evaluate, and optimize implicit feedback-based recommender systems.
eigen evaluation-framework hyperparameter-optimization knn-algorithm matrix-factorization optuna pybind11 recommender-systems
Last synced: 30 Apr 2025
https://github.com/vero-labs-ai/vero-eval
Open source framework for evaluating AI Agents
dataset-generation datasets evals evaluation evaluation-framework evaluation-metrics langgraph llm-evaluation llm-evaluation-framework python rag-evaluation rag-testing synthetic-dataset-generation testing testing-framework testing-library user-persona
Last synced: 07 Apr 2026
https://github.com/astrabert/sentrev
Simple customizable evaluation for text retrieval performance of Sentence Transformers embedders on PDFs
embedders evaluation-framework python python-package qdrant semantic-search sentence-transformers text-embedding vector-database
Last synced: 16 Apr 2025
https://github.com/davidheineman/thresh
πΎ Universal, customizable and deployable fine-grained evaluation for text generation.
annotation-tool evaluation-framework natural-language-processing nlp thresh
Last synced: 16 Jan 2026
https://github.com/vinid/quica
quica is a tool to run inter coder agreement pipelines in an easy and effective ways. Multiple measures are run and results are collected in a single table than can be easily exported in Latex
evaluation-framework evaluation-metrics inter-coder-agreement inter-rater-agreement python
Last synced: 15 May 2025
https://github.com/ad-freiburg/elevant
Entity linking evaluation and analysis tool
entity-disambiguation entity-linking evaluation-framework
Last synced: 29 Oct 2025
https://github.com/AstraBert/diRAGnosis
Diagnose the performance of your RAGπ©Ί
docker evaluation-framework fastapi gradio llamaindex llm python-package qdrant rag retrieval synthetic-dataset-generation vector-database
Last synced: 14 Mar 2025
https://github.com/hpai-bsc/turtle
A Unified Evaluation of LLMs for RTL Generation π’ (MLCAD 2025)
Last synced: 18 Jul 2025
https://github.com/ma7555/evalify
Evaluate your biometric verification models literally in seconds.
evaluation evaluation-framework evaluation-metrics face-recognition face-verification python
Last synced: 07 May 2025
https://github.com/liaad/tieval
An Evaluation Framework for Temporal Information Extraction Systems
evaluation-framework information-extraction nlp temporal-relations
Last synced: 25 Apr 2025
https://github.com/hlt-mt/subsonar
Evaluate the quality of SRT files using the multilingual multimodal SONAR model.
evaluation-framework evaluation-metrics subtitles subtitling
Last synced: 16 Jan 2026
https://github.com/borgwardtlab/ggme
Official repository for the ICLR 2022 paper "Evaluation Metrics for Graph Generative Models: Problems, Pitfalls, and Practical Solutions" https://openreview.net/forum?id=tBtoZYKd9n
evaluation-framework evaluation-metrics generative-model graph-learning machine-learning
Last synced: 11 Jul 2025
https://github.com/eduardogr/evalytics
HR tool to orchestrate the Performance Review Cycle of the employees of a company.
company evaluation-cycle evaluation-framework human-resources performance-evaluation python python-3
Last synced: 07 Jul 2025
https://github.com/GiovanniBaccichet/DNCS-HTTP3
Docker-based virtualized framework for analysing HTTP/3+QUIC performance and compare it to HTTP/2 and TCP.
docker evaluation-framework http3 performace performance-evaluation quic ssl tcp vagrant video-streaming
Last synced: 07 Apr 2025
https://github.com/giovannibaccichet/dncs-http3
Docker-based virtualized framework for analysing HTTP/3+QUIC performance and compare it to HTTP/2 and TCP.
docker evaluation-framework http3 performace performance-evaluation quic ssl tcp vagrant video-streaming
Last synced: 27 Jul 2025
https://github.com/vectara/mirage-bench
Repository for Multililngual Generation, RAG evaluations, and surrogate judge training for Arena RAG leaderboard (NAACL'25)
anyscale-endpoint arena azure-api claude-api cohere-api evaluation-framework gemini-api llm-inference openai-api rag retrieval-augmented-generation vllm
Last synced: 27 Feb 2026
https://github.com/rosinality/halite
Acceleration framework for Human Alignment Learning
evaluation-framework inference large-language-models proximal-policy-optimization reinforcement-learning reinforcement-learning-from-human-feedback transformers
Last synced: 28 Jul 2025
https://github.com/aigc-apps/PertEval
This is the accompanying repo of the NeurIPS '24 D&B Spotlight paper, PertEval, including code, data, and main results.
evaluation-framework evaluation-metrics large-language-models llm-evaluation machine-learning trustworthy-ai
Last synced: 09 Jul 2025
https://github.com/maximhq/maxim-cookbooks
Maxim is an end-to-end AI evaluation and observability platform that empowers modern AI teams to ship agents with quality, reliability, and speed.
evaluation evaluation-framework genai observability
Last synced: 03 Mar 2026
https://github.com/googlecloudplatform/evalbench
EvalBench is a flexible framework designed to measure the quality of generative AI (GenAI) workflows around database specific tasks.
databases eval evaluation-framework nl2sql text2sql
Last synced: 22 Jun 2025
https://github.com/feup-infolab/army-ant
An experimental information retrieval framework and a workbench for innovation in entity-oriented search.
ant evaluation-framework information-retrieval research
Last synced: 13 Jul 2025
https://github.com/jimmc414/claudecode_n_codex_swebench
Toolkit for measuring Claude Code and Codex performance over time against a baseline using SWEbench-lite dataset **No API key required for Max subscribers**
claude-code claudecode eval evaluation-framework swebench
Last synced: 17 Sep 2025
https://github.com/seblemaguer/replikant
A flexible evaluation platform to enable researchers to conduct replicable subjective evaluation
evaluation evaluation-framework listening-test replicability
Last synced: 06 Sep 2025
https://github.com/vcerqueira/modelradar
Aspect-based Forecasting Accuracy
deep-learning evaluation-framework forecasting machine-learning time-series
Last synced: 22 Jan 2026
https://github.com/feup-infolab/army-ant-install
Army ANT installation via Docker Compose.
ant docker-compose-files evaluation-framework information-retrieval research
Last synced: 19 Mar 2026
https://github.com/stack-rs/mitosis
Mitosis: A Unified Transport Evaluation Framework
cli distributed distributed-systems evaluation evaluation-framework library rust transport-layer
Last synced: 04 Mar 2026
https://github.com/maastrichtu-ids/fair-enough-metrics
βοΈ API to publish FAIR metrics tests written in python
evaluation-framework evaluation-metrics fair-data
Last synced: 15 Jun 2025
https://github.com/dongli/esmdiag
This is a diagnostic package for earth system modeling.
earth-science evaluation-framework
Last synced: 04 Apr 2026
https://github.com/aigc-apps/perteval
This is the accompanying repo of the NeurIPS '24 D&B Spotlight paper, PertEval, including code, data, and main results.
evaluation-framework evaluation-metrics large-language-models llm-evaluation machine-learning trustworthy-ai
Last synced: 13 Apr 2025
https://github.com/aidos-lab/rings
Relevant Information in Node features and Graph Structure
data-centric evaluation-framework geometric-deep-learning graph-learning icml-2025
Last synced: 05 Feb 2026
https://github.com/yukinagae/genkitx-promptfoo
Community Plugin for Genkit to use Promptfoo
ai evaluation evaluation-framework firebase genkit genkit-plugin genkitx llm llm-eval llm-evaluation llm-evaluation-framework llmops plugin prompt prompt-testing promptfoo prompts testing
Last synced: 27 Jul 2025
https://github.com/cmry/amica
Repository for the experiments described in "Current Limitations in Cyberbullying Detection: on Evaluation Criteria, Reproducibility, and Data Scarcity" submitted as pre-print to arXiv.
cyberbullying cyberbullying-detection cybersecurity evaluation evaluation-framework machine-learning reproduction text-mining
Last synced: 23 Apr 2025
https://github.com/sap-samples/llm-round-trip-correctness
This repo provides code for evaluation of llm round-trip-correctness on text to process model and vice versa
benchmarking business evaluation-framework genai processes round-trip-correctness
Last synced: 13 Apr 2025
https://github.com/artefactop/promptdev
A prompt evaluation framework that provides comprehensive testing for AI agents across multiple providers.
ci-cd evaluation-framework llm llm-eval llm-evaluation llm-evaluation-framework prompt prompt-engineering prompt-toolkit red-team testing
Last synced: 30 Oct 2025
https://github.com/teilomillet/kushim
eval creator
dataset dataset-generation eval evaluation evaluation-framework llm openai
Last synced: 15 Mar 2026
https://github.com/bassrehab/spark-llm-eval
Spark-native LLM evaluation framework with confidence intervals, significance testing, and Databricks integration
databricks evaluation-framework llm-evaluation- machine-learning mlflow mlops nlp pyspark python
Last synced: 14 Jan 2026
https://github.com/leo310/rag-chunking-evaluation
Assess the effectiveness of chunking strategies in RAG systems via a custom evaluation framework.
chunking evaluation-framework retrieval retrieval-augmented-generation
Last synced: 22 Jan 2026
https://github.com/iaar-shanghai/guessarena
[ACL 2025] GuessArena: Guess Who I Am? A Self-Adaptive Framework for Evaluating LLMs in Domain-Specific Knowledge and Reasoning
benchmark chatgpt deepseek domain-specific-eval evaluation-framework gamearena guessarena knowledge-evaluation large-language-models llm-eval openai qwen reasoning-evaluation reliable-evaluation
Last synced: 28 Jun 2025
https://github.com/kaos599/betterrag
BetterRAG: Powerful RAG evaluation toolkit for LLMs. Measure, analyze, and optimize how your AI processes text chunks with precision metrics. Perfect for RAG systems, document processing, and embedding quality assessment.
chunking-optimization embeddings embeddings-extraction embeddings-optimization evaluation evaluation-framework optimization rag rag-application rag-evaluation rag-optimization
Last synced: 27 Mar 2025
https://github.com/pedrodevog/synthecg
The first systematic evaluation framework for synthetic 10-second 12-lead ECGs from diagnostic class-conditioned generative models
deep-learning diffusion-models ecg electrocardiogram evaluation-framework gan generative-ai medical-ai ptb-xl python pytorch state-space-model synthetic-data time-series
Last synced: 17 Jul 2025
https://github.com/yukinagae/promptfoo-sample
Sample project demonstrates how to use Promptfoo, a test framework for evaluating the output of generative AI models
evaluation evaluation-framework llm llm-eval llm-evaluation llm-evaluation-framework llmops prompt-testing promptfoo prompts testing
Last synced: 25 Feb 2026
https://github.com/astrabert/diragnosis
Diagnose the performance of your RAGπ©Ί
docker evaluation-framework fastapi gradio llamaindex llm python-package qdrant rag retrieval synthetic-dataset-generation vector-database
Last synced: 11 Jun 2025
https://github.com/parthapray/llm_evaluation_metrics_localized
This repo contains code for localized LLM evaluation metrics vis a framework using Ollama and edge resource and novel derived metrics
evaluation evaluation-framework evaluation-metrics evaluations flask large-language-models metrics ollama-api restful-api
Last synced: 25 Aug 2025
https://github.com/arclabs561/anno
Information extraction for Rust: NER, coreference resolution, and evaluation
bert candle coreference-resolution entity-extraction evaluation-framework gliner information-extraction ner nlp onnx rust
Last synced: 13 Jan 2026
https://github.com/ksm26/improving-accuracy-of-llm-applications
The course equips developers with techniques to enhance the reliability of LLMs, focusing on evaluation, prompt engineering, and fine-tuning. Learn to systematically improve model accuracy through hands-on projects, including building a text-to-SQL agent and applying advanced fine-tuning methods.
evaluation-framework instruction-fine-tuning iterative-fine-tuning llama-models llm-accuracy lora memory-tuning model-reliability mome performance-optimization prompt-engineering self-reflection text-to-sql
Last synced: 28 Mar 2025
https://github.com/yukinagae/genkit-promptfoo-sample
Sample implementation demonstrating how to use Firebase Genkit with Promptfoo
evaluation evaluation-framework genkit llm llm-eval llm-evaluation llm-evaluation-framework llmops prompt-testing promptfoo prompts testing
Last synced: 15 Aug 2025
https://github.com/keitabroadwater/llm-eval-lab
A web sandbox for hands-on learning of LLM and RAG Evaluation
evaluation-framework fastapi gpt4 llm-evaluation llmops nextjs rag-evaluation ragas
Last synced: 14 May 2025
https://github.com/aiflowml/hyperparams
HyperParams: A Decentralized Framework for AI Agent Assessment and Certification
agent agents evaluation evaluation-framework evaluation-functions evaluation-kit evaluation-metrics evaluation-test ml ml-engineering
Last synced: 31 Oct 2025
https://github.com/jplane/llm-function-call-eval
Demonstrates a workflow for LLM function calling evaluation. Uses GitHub Copilot to generate synthetic eval data and Azure AI Foundry for handling results.
azure-ai-foundry evaluation-framework function-calling llm synthetic-dataset-generation tool-use vscode
Last synced: 04 Mar 2025
https://github.com/theaiautomators/deepeval-wrapper
REST API wrapper for DeepEval Python library with authentication
evaluation evaluation-framework evaluation-metrics
Last synced: 18 Jan 2026
https://github.com/amadlaorg/judge
π§ββοΈ Judge verifies that system settings meet required configurations and resource specifications π§ββοΈ
auditing evaluation evaluation-framework
Last synced: 18 Jan 2026
https://github.com/syed-m-hussain/recap
RECAP (Review Engine for Critiquing and Advising Pitches) is an LLM-powered agentic system designed to help founders and entrepreneurs receive actionable, multi-perspective, and structured feedback on their startup pitch presentations
evaluation-framework langchain langgraph-agents
Last synced: 20 Jun 2025
https://github.com/szegedai/hun_ner_checklist
CHECKLIST-style test cases and the testing of three Hungarian Named Entity Recognition tools.
evaluation-framework hungarian-language ner nlp
Last synced: 01 Feb 2026