An open API service indexing awesome lists of open source software.

Projects in Awesome Lists tagged with evaluation-framework

A curated list of projects in awesome lists tagged with evaluation-framework .

https://github.com/eleutherai/lm-evaluation-harness

A framework for few-shot evaluation of language models.

evaluation-framework language-model transformer

Last synced: 09 Sep 2025

https://github.com/EleutherAI/lm-evaluation-harness

A framework for few-shot evaluation of language models.

evaluation-framework language-model transformer

Last synced: 23 Mar 2025

https://github.com/promptfoo/promptfoo

Test your prompts, agents, and RAGs. Red teaming, pentesting, and vulnerability scanning for LLMs. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with command line and CI/CD integration.

ci ci-cd cicd evaluation evaluation-framework llm llm-eval llm-evaluation llm-evaluation-framework llmops pentesting prompt-engineering prompt-testing prompts rag red-teaming testing vulnerability-scanners

Last synced: 03 Mar 2026

https://github.com/huggingface/lighteval

Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends

evaluation evaluation-framework evaluation-metrics huggingface

Last synced: 14 Oct 2025

https://github.com/servicenow/agentlab

AgentLab: An open-source framework for developing, testing, and benchmarking web agents on diverse tasks, designed for scalability and reproducibility.

agent agents benchmark evaluation-framework lab llm llm-agents prompting web-agents

Last synced: 25 Sep 2025

https://github.com/aiverify-foundation/moonshot

Moonshot - A simple and modular tool to evaluate and red-team any LLM application.

benchmarking evaluation-framework llm red-teaming trustworthy-ai

Last synced: 05 Feb 2026

https://github.com/TonicAI/tonic_validate

Metrics to evaluate the quality of responses of your Retrieval Augmented Generation (RAG) applications.

evaluation-framework evaluation-metrics large-language-models llm llmops llms rag retrieval-augmented-generation

Last synced: 04 Apr 2025

https://github.com/zeno-ml/zeno

AI Data Management & Evaluation Platform

ai data-science evaluation evaluation-framework machine-learning python

Last synced: 18 Apr 2025

https://github.com/bijington/expressive

Expressive is a cross-platform expression parsing and evaluation framework. The cross-platform nature is achieved through compiling for .NET Standard so it will run on practically any platform.

cross-platform evaluation evaluation-framework expression-evaluator expression-parser hacktoberfest netstandard parsing xamarin

Last synced: 31 Mar 2025

https://github.com/ServiceNow/AgentLab

AgentLab: An open-source framework for developing, testing, and benchmarking web agents on diverse tasks, designed for scalability and reproducibility.

agents benchmark evaluation-framework llm llm-agents prompting web-agents

Last synced: 30 Aug 2025

https://github.com/AI21Labs/lm-evaluation

Evaluation suite for large-scale language models.

evaluation-framework language-model

Last synced: 23 Apr 2025

https://github.com/alibaba-damo-academy/MedEvalKit

MedEvalKit: A Unified Medical Evaluation Framework

evaluation-framework llm medicalai multimodal

Last synced: 28 Jul 2025

https://github.com/microsoft/eureka-ml-insights

A framework for standardizing evaluations of large foundation models, beyond single-score reporting and rankings.

ai artificial-intelligence evaluation-framework llm machine-learning mllm

Last synced: 05 Apr 2025

https://github.com/x-plug/writingbench

WritingBench: A Comprehensive Benchmark for Generative Writing

ai benchmark evaluation-framework huggingface llm long-context long-text nlp text-generation writing

Last synced: 01 Sep 2025

https://github.com/kaiko-ai/eva

Evaluation framework for oncology foundation models (FMs)

evaluation-framework foundation-models machine-learning oncology

Last synced: 24 Dec 2025

https://github.com/codefuse-ai/codefuse-evaluation

Industrial-level evaluation benchmarks for Coding LLMs in the full life-cycle of AI native software developing.δΌδΈšηΊ§δ»£η ε€§ζ¨‘εž‹θ―„ζ΅‹δ½“η³»,ζŒη»­εΌ€ζ”ΎδΈ­

code-evaluation codecommenteval codefuse codetranseval evaluation-framework lcc repository-eval

Last synced: 07 Apr 2025

https://github.com/bmw-innovationlab/sordi-ai-evaluation-gui

This repository allows you to evaluate a trained computer vision model and get general information and evaluation metrics with little configuration.

ai bmw computer-vision dataset deeplearning docker evaluation evaluation-framework no-code python rest-api sordi synthetic-data tensorflow

Last synced: 02 Jul 2025

https://github.com/nouhadziri/DialogEntailment

The implementation of the paper "Evaluating Coherence in Dialogue Systems using Entailment"

bert dialogue-evaluation evaluation-framework natural-language-inference

Last synced: 02 Apr 2025

https://github.com/pentoai/vectory

Vectory provides a collection of tools to track and compare embedding versions.

deep-learning deep-neural-networks embedding-python embedding-vectors embeddings-similarity evaluation-framework

Last synced: 18 Feb 2026

https://github.com/letta-ai/letta-evals

Evaluation kit for testing stateful agents

agentevals agents evaluation-framework language-model letta letta-agents

Last synced: 26 Feb 2026

https://github.com/cedrickchee/vibe-jet

A browser-based 3D multiplayer flying game with arcade-style mechanics, created using the Gemini 2.5 Pro through a technique called "vibe coding"

evaluation-framework flight-simulator game-development gemini-2-5-pro-exp llm-evaluation vibe-check vibe-coding

Last synced: 05 May 2025

https://github.com/gair-nlp/scaleeval

Scalable Meta-Evaluation of LLMs as Evaluators

evaluation-framework generative-ai llm nlp

Last synced: 23 Jun 2025

https://github.com/adithya-s-k/indic_eval

A lightweight evaluation suite tailored specifically for assessing Indic LLMs across a diverse range of tasks

evaluation-framework llm-evaluation llms

Last synced: 03 Aug 2025

https://github.com/tohtsky/irspack

Train, evaluate, and optimize implicit feedback-based recommender systems.

eigen evaluation-framework hyperparameter-optimization knn-algorithm matrix-factorization optuna pybind11 recommender-systems

Last synced: 30 Apr 2025

https://github.com/astrabert/sentrev

Simple customizable evaluation for text retrieval performance of Sentence Transformers embedders on PDFs

embedders evaluation-framework python python-package qdrant semantic-search sentence-transformers text-embedding vector-database

Last synced: 16 Apr 2025

https://github.com/davidheineman/thresh

🌾 Universal, customizable and deployable fine-grained evaluation for text generation.

annotation-tool evaluation-framework natural-language-processing nlp thresh

Last synced: 16 Jan 2026

https://github.com/vinid/quica

quica is a tool to run inter coder agreement pipelines in an easy and effective ways. Multiple measures are run and results are collected in a single table than can be easily exported in Latex

evaluation-framework evaluation-metrics inter-coder-agreement inter-rater-agreement python

Last synced: 15 May 2025

https://github.com/ad-freiburg/elevant

Entity linking evaluation and analysis tool

entity-disambiguation entity-linking evaluation-framework

Last synced: 29 Oct 2025

https://github.com/hpai-bsc/turtle

A Unified Evaluation of LLMs for RTL Generation 🐒 (MLCAD 2025)

evaluation-framework rtl

Last synced: 18 Jul 2025

https://github.com/ma7555/evalify

Evaluate your biometric verification models literally in seconds.

evaluation evaluation-framework evaluation-metrics face-recognition face-verification python

Last synced: 07 May 2025

https://github.com/liaad/tieval

An Evaluation Framework for Temporal Information Extraction Systems

evaluation-framework information-extraction nlp temporal-relations

Last synced: 25 Apr 2025

https://github.com/hlt-mt/subsonar

Evaluate the quality of SRT files using the multilingual multimodal SONAR model.

evaluation-framework evaluation-metrics subtitles subtitling

Last synced: 16 Jan 2026

https://github.com/borgwardtlab/ggme

Official repository for the ICLR 2022 paper "Evaluation Metrics for Graph Generative Models: Problems, Pitfalls, and Practical Solutions" https://openreview.net/forum?id=tBtoZYKd9n

evaluation-framework evaluation-metrics generative-model graph-learning machine-learning

Last synced: 11 Jul 2025

https://github.com/eduardogr/evalytics

HR tool to orchestrate the Performance Review Cycle of the employees of a company.

company evaluation-cycle evaluation-framework human-resources performance-evaluation python python-3

Last synced: 07 Jul 2025

https://github.com/GiovanniBaccichet/DNCS-HTTP3

Docker-based virtualized framework for analysing HTTP/3+QUIC performance and compare it to HTTP/2 and TCP.

docker evaluation-framework http3 performace performance-evaluation quic ssl tcp vagrant video-streaming

Last synced: 07 Apr 2025

https://github.com/giovannibaccichet/dncs-http3

Docker-based virtualized framework for analysing HTTP/3+QUIC performance and compare it to HTTP/2 and TCP.

docker evaluation-framework http3 performace performance-evaluation quic ssl tcp vagrant video-streaming

Last synced: 27 Jul 2025

https://github.com/vectara/mirage-bench

Repository for Multililngual Generation, RAG evaluations, and surrogate judge training for Arena RAG leaderboard (NAACL'25)

anyscale-endpoint arena azure-api claude-api cohere-api evaluation-framework gemini-api llm-inference openai-api rag retrieval-augmented-generation vllm

Last synced: 27 Feb 2026

https://github.com/aigc-apps/PertEval

This is the accompanying repo of the NeurIPS '24 D&B Spotlight paper, PertEval, including code, data, and main results.

evaluation-framework evaluation-metrics large-language-models llm-evaluation machine-learning trustworthy-ai

Last synced: 09 Jul 2025

https://github.com/maximhq/maxim-cookbooks

Maxim is an end-to-end AI evaluation and observability platform that empowers modern AI teams to ship agents with quality, reliability, and speed.

evaluation evaluation-framework genai observability

Last synced: 03 Mar 2026

https://github.com/googlecloudplatform/evalbench

EvalBench is a flexible framework designed to measure the quality of generative AI (GenAI) workflows around database specific tasks.

databases eval evaluation-framework nl2sql text2sql

Last synced: 22 Jun 2025

https://github.com/feup-infolab/army-ant

An experimental information retrieval framework and a workbench for innovation in entity-oriented search.

ant evaluation-framework information-retrieval research

Last synced: 13 Jul 2025

https://github.com/jimmc414/claudecode_n_codex_swebench

Toolkit for measuring Claude Code and Codex performance over time against a baseline using SWEbench-lite dataset **No API key required for Max subscribers**

claude-code claudecode eval evaluation-framework swebench

Last synced: 17 Sep 2025

https://github.com/seblemaguer/replikant

A flexible evaluation platform to enable researchers to conduct replicable subjective evaluation

evaluation evaluation-framework listening-test replicability

Last synced: 06 Sep 2025

https://github.com/stack-rs/mitosis

Mitosis: A Unified Transport Evaluation Framework

cli distributed distributed-systems evaluation evaluation-framework library rust transport-layer

Last synced: 04 Mar 2026

https://github.com/maastrichtu-ids/fair-enough-metrics

β˜‘οΈ API to publish FAIR metrics tests written in python

evaluation-framework evaluation-metrics fair-data

Last synced: 15 Jun 2025

https://github.com/dongli/esmdiag

This is a diagnostic package for earth system modeling.

earth-science evaluation-framework

Last synced: 04 Apr 2026

https://github.com/aigc-apps/perteval

This is the accompanying repo of the NeurIPS '24 D&B Spotlight paper, PertEval, including code, data, and main results.

evaluation-framework evaluation-metrics large-language-models llm-evaluation machine-learning trustworthy-ai

Last synced: 13 Apr 2025

https://github.com/aidos-lab/rings

Relevant Information in Node features and Graph Structure

data-centric evaluation-framework geometric-deep-learning graph-learning icml-2025

Last synced: 05 Feb 2026

https://github.com/cmry/amica

Repository for the experiments described in "Current Limitations in Cyberbullying Detection: on Evaluation Criteria, Reproducibility, and Data Scarcity" submitted as pre-print to arXiv.

cyberbullying cyberbullying-detection cybersecurity evaluation evaluation-framework machine-learning reproduction text-mining

Last synced: 23 Apr 2025

https://github.com/sap-samples/llm-round-trip-correctness

This repo provides code for evaluation of llm round-trip-correctness on text to process model and vice versa

benchmarking business evaluation-framework genai processes round-trip-correctness

Last synced: 13 Apr 2025

https://github.com/artefactop/promptdev

A prompt evaluation framework that provides comprehensive testing for AI agents across multiple providers.

ci-cd evaluation-framework llm llm-eval llm-evaluation llm-evaluation-framework prompt prompt-engineering prompt-toolkit red-team testing

Last synced: 30 Oct 2025

https://github.com/bassrehab/spark-llm-eval

Spark-native LLM evaluation framework with confidence intervals, significance testing, and Databricks integration

databricks evaluation-framework llm-evaluation- machine-learning mlflow mlops nlp pyspark python

Last synced: 14 Jan 2026

https://github.com/leo310/rag-chunking-evaluation

Assess the effectiveness of chunking strategies in RAG systems via a custom evaluation framework.

chunking evaluation-framework retrieval retrieval-augmented-generation

Last synced: 22 Jan 2026

https://github.com/iaar-shanghai/guessarena

[ACL 2025] GuessArena: Guess Who I Am? A Self-Adaptive Framework for Evaluating LLMs in Domain-Specific Knowledge and Reasoning

benchmark chatgpt deepseek domain-specific-eval evaluation-framework gamearena guessarena knowledge-evaluation large-language-models llm-eval openai qwen reasoning-evaluation reliable-evaluation

Last synced: 28 Jun 2025

https://github.com/kaos599/betterrag

BetterRAG: Powerful RAG evaluation toolkit for LLMs. Measure, analyze, and optimize how your AI processes text chunks with precision metrics. Perfect for RAG systems, document processing, and embedding quality assessment.

chunking-optimization embeddings embeddings-extraction embeddings-optimization evaluation evaluation-framework optimization rag rag-application rag-evaluation rag-optimization

Last synced: 27 Mar 2025

https://github.com/pedrodevog/synthecg

The first systematic evaluation framework for synthetic 10-second 12-lead ECGs from diagnostic class-conditioned generative models

deep-learning diffusion-models ecg electrocardiogram evaluation-framework gan generative-ai medical-ai ptb-xl python pytorch state-space-model synthetic-data time-series

Last synced: 17 Jul 2025

https://github.com/yukinagae/promptfoo-sample

Sample project demonstrates how to use Promptfoo, a test framework for evaluating the output of generative AI models

evaluation evaluation-framework llm llm-eval llm-evaluation llm-evaluation-framework llmops prompt-testing promptfoo prompts testing

Last synced: 25 Feb 2026

https://github.com/parthapray/llm_evaluation_metrics_localized

This repo contains code for localized LLM evaluation metrics vis a framework using Ollama and edge resource and novel derived metrics

evaluation evaluation-framework evaluation-metrics evaluations flask large-language-models metrics ollama-api restful-api

Last synced: 25 Aug 2025

https://github.com/arclabs561/anno

Information extraction for Rust: NER, coreference resolution, and evaluation

bert candle coreference-resolution entity-extraction evaluation-framework gliner information-extraction ner nlp onnx rust

Last synced: 13 Jan 2026

https://github.com/ksm26/improving-accuracy-of-llm-applications

The course equips developers with techniques to enhance the reliability of LLMs, focusing on evaluation, prompt engineering, and fine-tuning. Learn to systematically improve model accuracy through hands-on projects, including building a text-to-SQL agent and applying advanced fine-tuning methods.

evaluation-framework instruction-fine-tuning iterative-fine-tuning llama-models llm-accuracy lora memory-tuning model-reliability mome performance-optimization prompt-engineering self-reflection text-to-sql

Last synced: 28 Mar 2025

https://github.com/keitabroadwater/llm-eval-lab

A web sandbox for hands-on learning of LLM and RAG Evaluation

evaluation-framework fastapi gpt4 llm-evaluation llmops nextjs rag-evaluation ragas

Last synced: 14 May 2025

https://github.com/aiflowml/hyperparams

HyperParams: A Decentralized Framework for AI Agent Assessment and Certification

agent agents evaluation evaluation-framework evaluation-functions evaluation-kit evaluation-metrics evaluation-test ml ml-engineering

Last synced: 31 Oct 2025

https://github.com/jplane/llm-function-call-eval

Demonstrates a workflow for LLM function calling evaluation. Uses GitHub Copilot to generate synthetic eval data and Azure AI Foundry for handling results.

azure-ai-foundry evaluation-framework function-calling llm synthetic-dataset-generation tool-use vscode

Last synced: 04 Mar 2025

https://github.com/theaiautomators/deepeval-wrapper

REST API wrapper for DeepEval Python library with authentication

evaluation evaluation-framework evaluation-metrics

Last synced: 18 Jan 2026

https://github.com/amadlaorg/judge

πŸ§‘β€βš–οΈ Judge verifies that system settings meet required configurations and resource specifications πŸ§‘β€βš–οΈ

auditing evaluation evaluation-framework

Last synced: 18 Jan 2026

https://github.com/syed-m-hussain/recap

RECAP (Review Engine for Critiquing and Advising Pitches) is an LLM-powered agentic system designed to help founders and entrepreneurs receive actionable, multi-perspective, and structured feedback on their startup pitch presentations

evaluation-framework langchain langgraph-agents

Last synced: 20 Jun 2025

https://github.com/szegedai/hun_ner_checklist

CHECKLIST-style test cases and the testing of three Hungarian Named Entity Recognition tools.

evaluation-framework hungarian-language ner nlp

Last synced: 01 Feb 2026