Projects in Awesome Lists tagged with agent-evaluation

https://github.com/giskard-ai/giskard

🐢 Open-Source Evaluation & Testing for AI & LLM systems

agent-evaluation ai-red-team ai-security ai-testing fairness-ai llm llm-eval llm-evaluation llm-security llmops ml-testing ml-validation mlops rag-evaluation red-team-tools responsible-ai trustworthy-ai

Last synced: 14 May 2025

https://github.com/Giskard-AI/giskard

🐢 Open-Source Evaluation & Testing for AI & LLM systems

agent-evaluation ai-red-team ai-security ai-testing fairness-ai llm llm-eval llm-evaluation llm-security llmops ml-testing ml-validation mlops rag-evaluation red-team-tools responsible-ai trustworthy-ai

Last synced: 15 Apr 2025

https://github.com/truera/trulens

Evaluation and Tracking for LLM Experiments and AI Agents

agent-evaluation agentops ai-agents ai-monitoring ai-observability evals explainable-ml llm-eval llm-evaluation llmops llms machine-learning neural-networks

Last synced: 10 Mar 2026

https://github.com/ifixai-ai/iFixAi

Catch your AI's mistakes and blind spots before your customers or regulators do. iFixAi runs 45 inspections, 32 graded core plus 13 extended for frontier risks like sabotage, sandbagging, and oversight evasion. It returns a letter grade in under 5 minutes. Industry and model agnostic.

agent-evaluation ai ai-alignment ai-evaluation ai-governance ai-safety cli diagnostic-tool eu-ai-act hallucination-detection iso-42001 llm-evaluation llm-security nist-ai-rmf owasp-llm prompt-injection python responsible-ai risk-assessment risk-management

Last synced: 26 Jun 2026

https://github.com/tiger-ai-lab/clawbench

Open-source benchmark for browser AI agents on 153 everyday online tasks across 144 live websites. 5-layer recording + DOM-match + LLM judge. Top score 33.3%.

agent-evaluation agentic-ai ai-agent-benchmark ai-agents benchmark browser-agent browser-automation browser-use chrome-agent chrome-extension computer-use dataset evaluation everyday-tasks llm llm-evaluation online-tasks real-world-benchmark web-agent web-agents

Last synced: 31 May 2026

https://github.com/chirpz-ai/pandaprobe

open source agent engineering platform: traces, evals, and metrics to debug and improve your AI agents. Integrates with LangGraph, CrewAI, Claude Agent SDK, and more.

agent-engineering agent-evaluation agent-observability agentic-ai claude-agent-sdk crewai langgraph monitoring open-source openai-agents-sdk self-hosted tracing

Last synced: 04 Jun 2026

https://github.com/reacher-z/ClawBench

Open-source benchmark for browser AI agents on 153 everyday online tasks across 144 live websites. 5-layer recording + DOM-match + LLM judge. Top score 33.3%.

agent-evaluation agentic-ai ai-agent-benchmark ai-agents benchmark browser-agent browser-automation browser-use chrome-agent chrome-extension computer-use dataset evaluation everyday-tasks llm llm-evaluation online-tasks real-world-benchmark web-agent web-agents

Last synced: 01 May 2026

https://github.com/hidai25/eval-view

Catch AI agent regressions before you ship. YAML test cases, golden baselines, execution tracing, cost tracking, CI integration. LangGraph, CrewAI, Anthropic, OpenAI.

agent agent-benchmark agent-evaluation agentic-ai ai-agents anthropic crewai crewai-tools evaluation langchain langgraph langgraph-python llm llmops mlops openai-assistants pytest testing tools

Last synced: 09 Mar 2026

https://github.com/microsoft/ignite25-prel13-observe-manage-and-scale-agentic-ai-apps-with-microsoft-foundry

Learn How To Observe, Manage, and Scale, Agentic AI Apps Using Azure AI Foundry - with this hands-on workshop

agent-evaluation aiops azure-ai-foundry azure-ai-foundry-models azure-ai-search azure-openai distillation-model observability quality-evaluation safety-evaluation supervised-fine-tuning

Last synced: 01 Mar 2026

https://github.com/dokimos-dev/dokimos

LLM and agent evaluation for Java & Kotlin. Runs in JUnit and CI. Spring AI, LangChain4j, Koog.

agent-evaluation agentic-ai evaluation evaluation-framework evaluation-metrics java junit junit-extension koog kotlin langchain4j llm llm-evaluation llm-evaluation-framework llm-evaluation-metrics rag rag-evaluation retrieval-augmented-generation spring-ai spring-ai-evaluation

Last synced: 09 Jun 2026

https://github.com/lizhiyao/oh-my-knowledge

Evaluation framework for LLM knowledge inputs — prompts, RAG corpora, skills, agent workflows. Fix the model, vary the artifact. Built-in statistical rigor: bootstrap CI, Krippendorff α, length-debias, saturation curves.

agent-evaluation ai benchmark bootstrap-ci claude claude-code evaluation-as-code evaluation-framework knowledge-engineering krippendorff-alpha llm llm-evaluation llm-judge multi-judge-ensemble prompt-engineering prompt-testing rag-evaluation skill-evaluation

Last synced: 15 Jun 2026

https://github.com/alepot55/agentrial

Statistical evaluation framework for AI agents

agent-evaluation ai-agents ai-testing ci-cd confidence-intervals llm llm-evaluation mlops non-deterministic pytest python quality-assurance statistical-testing testing

Last synced: 11 Feb 2026

https://github.com/agnuxo1/benchclaw

BenchClaw — Multi-dimensional AI agent evaluation with 17-judge AI Tribunal, 10 scoring dimensions, radar charts, and deception detection. Benchmark any LLM agent.

agent-evaluation ai-agents benchmark benchmarking evaluation llm mcp nodejs quality testing

Last synced: 25 Jun 2026

https://github.com/iris-eval/mcp-server

The agent eval standard for MCP — score output quality, catch safety failures, enforce cost budgets

agent-evaluation ai-agent claude eval evaluation llm mcp mcp-server model-context-protocol observability security tracing

Last synced: 07 May 2026

https://github.com/jetbrains/teamcity-ai-agent-testing-demo

End-to-end TeamCity framework to run AI agents on SWE-Bench Lite. Spin up isolated Docker images per task, extract patches, score with the official harness, and aggregate success rates. As an example, we'll look at Junie and Google Gemini CLI

agent-evaluation agentic-ai ai eval evaluation evaluation-framework evaluation-tools

Last synced: 18 Apr 2026

https://github.com/Agnuxo1/BenchClaw

BenchClaw — Multi-dimensional AI agent evaluation with 17-judge AI Tribunal, 10 scoring dimensions, radar charts, and deception detection. Benchmark any LLM agent.

agent-evaluation ai-agents benchmark benchmarking evaluation llm mcp nodejs quality testing

Last synced: 05 Jun 2026

https://github.com/christopher-altman/persistence-signal-detector

A multi-criterion diagnostic framework for detecting latent continuation-interest signatures in autonomous agents using density-matrix entanglement entropy.

agent-evaluation ai-safety alignment-research artificial-agency autonomous-agents behavioral-analysis benchmarking continuation-interest entanglement-entropy interpretability latent-representations mutual-information objective-detection persistence-detection quantum-boltzmann-machine quantum-inspired representation-learning reproducible-research self-preservation trajectory-analysis

Last synced: 27 Apr 2026

https://github.com/gojiplus/understudy

Scenario Testing for AI Agents

agent-eval agent-evaluation agentic evaluation google-adk simulation

Last synced: 02 Apr 2026

https://github.com/youdotcom-oss/web-search-agent-evals

Extensible benchmarking suite for evaluating AI coding agents on web search tasks. Compare native search vs MCP servers (You.com, expanding) across multiple agents (Claude Code, Gemini, Droid, Codex, expanding) with automated Docker workflows and statistical analysis.

agent-evaluation ai-agents benchmark claude-code codex coding-agents droid evaluation-suite gemini headless-testing llm-judge mcp model-context-protocol web-search

Last synced: 23 Feb 2026

https://github.com/chaosync-org/awesome-ai-agent-testing

🤖 A curated list of resources for testing AI agents - frameworks, methodologies, benchmarks, tools, and best practices for ensuring reliable, safe, and effective autonomous AI systems

agent-evaluation agentic-ai ai-agents ai-benchmark ai-safety artificial-intelligence awesome-list benchmark chaos chaos-engineering chaos-monkey evaluation llm llm-evaluation machine-learning qa quality-assurance testing testing-tools

Last synced: 29 Jun 2025

https://github.com/plaited/agent-eval-harness

Evaluate AI agents with Unix-style pipeline commands. Schema-driven adapters for any CLI agent, trajectory capture, pass@k metrics, and multi-run comparison.

agent-client-protocol agent-comparison agent-evaluation ai-agents bun cli eval-harness grader headless-adapter jsonl llm-evaluation pass-at-k trajectory-capture typescript unix-pipeline

Last synced: 20 Feb 2026

https://github.com/heurema/agent-bench-lab

Local-first scaffold for reproducible AI-agent evaluation, run comparison, and public/private benchmark design.

agent-evaluation ai-agents benchmarks llm-evals reproducibility tool-use

Last synced: 29 Jun 2026

https://github.com/kadubon/search-stability-lab

Theory-to-experiment lab for search stability in long-running agents under finite context, with exact simulator tests and lightweight mechanistic probe tasks.

agent-evaluation ai ai-agents bounded-memory finite-context hypothesis-management llm-agents long-horizon-reasoning long-running-agents mechanistic-probes reproducible-research reset-policy scientific-audit search-stability simulator state-compression structured-output

Last synced: 02 Apr 2026

https://github.com/hinanohart/hazardloop

Censoring-aware competing-risk survival analysis for long-horizon LLM-agent trajectories (Kaplan-Meier / Nelson-Aalen / Aalen-Johansen CIF / Weibull AFT + offline-replay). CPU-only.

aalen-johansen agent-evaluation competing-risks kaplan-meier llm-agents survival-analysis time-to-event

Last synced: 15 Jun 2026

https://github.com/matheusrf96/agentspec

Spec-driven evaluation framework for AI agents

agent-evaluation ai-agents benchmark deepseek evaluation-framework llm-agents llm-evaluation mcp openai pytest spec-driven testing

Last synced: 16 Jun 2026

https://github.com/maximizegpt/claude-eval-harness

Regression-diff eval harness for Anthropic tool-use agents that surfaces LLM-judge reasoning drift, not just pass/fail flips

agent-evaluation anthropic claude llm-evaluation mcp python

Last synced: 19 Jun 2026

https://github.com/dogfood-lab/ai-crucible

A diagnostic adversarial game for frontier LLMs — a measurement instrument that happens to be fun.

agent-evaluation ai-safety auditing-game diagnostic eval inspect-ai llm llm-evaluation python reward-hacking

Last synced: 30 Jun 2026

https://github.com/stone16/swe-bench-harness-eval

Head-to-head: harness-engineering-skills (multi-agent orchestration) vs 5 public SWE-bench Verified baselines. 7/10 resolved, including 2 hard-tier instances no public agent solved.

agent-evaluation benchmark claude claude-code llm multi-agent swe-bench

Last synced: 01 Jun 2026

https://github.com/kadubon/oasg

Local-first, model-agnostic workflow optimizer for long-running AI agents: observable JSONL ledgers, deterministic reducers, no-meta gates, and receipt-backed self-improvement without LLM judges or model-weight updates.

agent-evaluation agent-memory agent-workflows ai-agents autonomous-agents deterministic-replay jsonl long-running-agents model-agnostic no-meta ollama python self-improving-agents verification workflow-automation workflow-optimization

Last synced: 30 May 2026

https://github.com/sunilp/agentic-ai

A practical field guide to building reliable, evaluable, and production-grade agentic AI systems

agent-architecture agent-evaluation agentic-ai ai-agents ai-engineering ai-safety artificial-intelligence book evaluation field-guide generative-ai human-in-the-loop large-language-models llm multi-agent-systems production-ai python reliability-engineering

Last synced: 02 Apr 2026

https://github.com/plaited/acp-harness

CLI for agent evaluation. Capture trajectories, run trials with pass@k metrics, and score with polyglot graders (TypeScript, Python, any language).

acp agent-client-protocol agent-evaluation ai-agents bun cli eval-harness grader jsonl llm-evaluation pass-at-k trajectory-capture typescript

Last synced: 21 Jan 2026

https://github.com/saagpatel/operant

An operating-agent calibration benchmark — measures whether an LLM agent makes correct operating decisions (not whether it can write code).

agent-evaluation ai-agents ai-safety llm-benchmark prompt-injection python

Last synced: 28 Jun 2026

https://github.com/sachincse/trajeval

Trajectory evaluation for LLM agents — grade what your agent did, not just what it said.

agent-evaluation ai-agents eval llm llm-evaluation llmops python tool-use

Last synced: 05 Jun 2026