Projects in Awesome Lists tagged with agent-evaluation
A curated list of projects in awesome lists tagged with agent-evaluation .
https://github.com/giskard-ai/giskard
🐢 Open-Source Evaluation & Testing for AI & LLM systems
agent-evaluation ai-red-team ai-security ai-testing fairness-ai llm llm-eval llm-evaluation llm-security llmops ml-testing ml-validation mlops rag-evaluation red-team-tools responsible-ai trustworthy-ai
Last synced: 14 May 2025
https://github.com/Giskard-AI/giskard
🐢 Open-Source Evaluation & Testing for AI & LLM systems
agent-evaluation ai-red-team ai-security ai-testing fairness-ai llm llm-eval llm-evaluation llm-security llmops ml-testing ml-validation mlops rag-evaluation red-team-tools responsible-ai trustworthy-ai
Last synced: 15 Apr 2025
https://github.com/truera/trulens
Evaluation and Tracking for LLM Experiments and AI Agents
agent-evaluation agentops ai-agents ai-monitoring ai-observability evals explainable-ml llm-eval llm-evaluation llmops llms machine-learning neural-networks
Last synced: 10 Mar 2026
https://github.com/ifixai-ai/iFixAi
Catch your AI's mistakes and blind spots before your customers or regulators do. iFixAi runs 45 inspections, 32 graded core plus 13 extended for frontier risks like sabotage, sandbagging, and oversight evasion. It returns a letter grade in under 5 minutes. Industry and model agnostic.
agent-evaluation ai ai-alignment ai-evaluation ai-governance ai-safety cli diagnostic-tool eu-ai-act hallucination-detection iso-42001 llm-evaluation llm-security nist-ai-rmf owasp-llm prompt-injection python responsible-ai risk-assessment risk-management
Last synced: 26 Jun 2026
https://github.com/tiger-ai-lab/clawbench
Open-source benchmark for browser AI agents on 153 everyday online tasks across 144 live websites. 5-layer recording + DOM-match + LLM judge. Top score 33.3%.
agent-evaluation agentic-ai ai-agent-benchmark ai-agents benchmark browser-agent browser-automation browser-use chrome-agent chrome-extension computer-use dataset evaluation everyday-tasks llm llm-evaluation online-tasks real-world-benchmark web-agent web-agents
Last synced: 31 May 2026
https://github.com/chirpz-ai/pandaprobe
open source agent engineering platform: traces, evals, and metrics to debug and improve your AI agents. Integrates with LangGraph, CrewAI, Claude Agent SDK, and more.
agent-engineering agent-evaluation agent-observability agentic-ai claude-agent-sdk crewai langgraph monitoring open-source openai-agents-sdk self-hosted tracing
Last synced: 04 Jun 2026
https://github.com/reacher-z/ClawBench
Open-source benchmark for browser AI agents on 153 everyday online tasks across 144 live websites. 5-layer recording + DOM-match + LLM judge. Top score 33.3%.
agent-evaluation agentic-ai ai-agent-benchmark ai-agents benchmark browser-agent browser-automation browser-use chrome-agent chrome-extension computer-use dataset evaluation everyday-tasks llm llm-evaluation online-tasks real-world-benchmark web-agent web-agents
Last synced: 01 May 2026
https://github.com/hidai25/eval-view
Catch AI agent regressions before you ship. YAML test cases, golden baselines, execution tracing, cost tracking, CI integration. LangGraph, CrewAI, Anthropic, OpenAI.
agent agent-benchmark agent-evaluation agentic-ai ai-agents anthropic crewai crewai-tools evaluation langchain langgraph langgraph-python llm llmops mlops openai-assistants pytest testing tools
Last synced: 09 Mar 2026
https://github.com/microsoft/ignite25-prel13-observe-manage-and-scale-agentic-ai-apps-with-microsoft-foundry
Learn How To Observe, Manage, and Scale, Agentic AI Apps Using Azure AI Foundry - with this hands-on workshop
agent-evaluation aiops azure-ai-foundry azure-ai-foundry-models azure-ai-search azure-openai distillation-model observability quality-evaluation safety-evaluation supervised-fine-tuning
Last synced: 01 Mar 2026
https://github.com/dokimos-dev/dokimos
LLM and agent evaluation for Java & Kotlin. Runs in JUnit and CI. Spring AI, LangChain4j, Koog.
agent-evaluation agentic-ai evaluation evaluation-framework evaluation-metrics java junit junit-extension koog kotlin langchain4j llm llm-evaluation llm-evaluation-framework llm-evaluation-metrics rag rag-evaluation retrieval-augmented-generation spring-ai spring-ai-evaluation
Last synced: 09 Jun 2026
https://github.com/lizhiyao/oh-my-knowledge
Evaluation framework for LLM knowledge inputs — prompts, RAG corpora, skills, agent workflows. Fix the model, vary the artifact. Built-in statistical rigor: bootstrap CI, Krippendorff α, length-debias, saturation curves.
agent-evaluation ai benchmark bootstrap-ci claude claude-code evaluation-as-code evaluation-framework knowledge-engineering krippendorff-alpha llm llm-evaluation llm-judge multi-judge-ensemble prompt-engineering prompt-testing rag-evaluation skill-evaluation
Last synced: 15 Jun 2026
https://github.com/alepot55/agentrial
Statistical evaluation framework for AI agents
agent-evaluation ai-agents ai-testing ci-cd confidence-intervals llm llm-evaluation mlops non-deterministic pytest python quality-assurance statistical-testing testing
Last synced: 11 Feb 2026
https://github.com/agnuxo1/benchclaw
BenchClaw — Multi-dimensional AI agent evaluation with 17-judge AI Tribunal, 10 scoring dimensions, radar charts, and deception detection. Benchmark any LLM agent.
agent-evaluation ai-agents benchmark benchmarking evaluation llm mcp nodejs quality testing
Last synced: 25 Jun 2026
https://github.com/iris-eval/mcp-server
The agent eval standard for MCP — score output quality, catch safety failures, enforce cost budgets
agent-evaluation ai-agent claude eval evaluation llm mcp mcp-server model-context-protocol observability security tracing
Last synced: 07 May 2026
https://github.com/jetbrains/teamcity-ai-agent-testing-demo
End-to-end TeamCity framework to run AI agents on SWE-Bench Lite. Spin up isolated Docker images per task, extract patches, score with the official harness, and aggregate success rates. As an example, we'll look at Junie and Google Gemini CLI
agent-evaluation agentic-ai ai eval evaluation evaluation-framework evaluation-tools
Last synced: 18 Apr 2026
https://github.com/Agnuxo1/BenchClaw
BenchClaw — Multi-dimensional AI agent evaluation with 17-judge AI Tribunal, 10 scoring dimensions, radar charts, and deception detection. Benchmark any LLM agent.
agent-evaluation ai-agents benchmark benchmarking evaluation llm mcp nodejs quality testing
Last synced: 05 Jun 2026
https://github.com/christopher-altman/persistence-signal-detector
A multi-criterion diagnostic framework for detecting latent continuation-interest signatures in autonomous agents using density-matrix entanglement entropy.
agent-evaluation ai-safety alignment-research artificial-agency autonomous-agents behavioral-analysis benchmarking continuation-interest entanglement-entropy interpretability latent-representations mutual-information objective-detection persistence-detection quantum-boltzmann-machine quantum-inspired representation-learning reproducible-research self-preservation trajectory-analysis
Last synced: 27 Apr 2026
https://github.com/gojiplus/understudy
Scenario Testing for AI Agents
agent-eval agent-evaluation agentic evaluation google-adk simulation
Last synced: 02 Apr 2026
https://github.com/chaosync-org/awesome-ai-agent-testing
🤖 A curated list of resources for testing AI agents - frameworks, methodologies, benchmarks, tools, and best practices for ensuring reliable, safe, and effective autonomous AI systems
agent-evaluation agentic-ai ai-agents ai-benchmark ai-safety artificial-intelligence awesome-list benchmark chaos chaos-engineering chaos-monkey evaluation llm llm-evaluation machine-learning qa quality-assurance testing testing-tools
Last synced: 29 Jun 2025
https://github.com/plaited/agent-eval-harness
Evaluate AI agents with Unix-style pipeline commands. Schema-driven adapters for any CLI agent, trajectory capture, pass@k metrics, and multi-run comparison.
agent-client-protocol agent-comparison agent-evaluation ai-agents bun cli eval-harness grader headless-adapter jsonl llm-evaluation pass-at-k trajectory-capture typescript unix-pipeline
Last synced: 20 Feb 2026
https://github.com/youdotcom-oss/web-search-agent-evals
Extensible benchmarking suite for evaluating AI coding agents on web search tasks. Compare native search vs MCP servers (You.com, expanding) across multiple agents (Claude Code, Gemini, Droid, Codex, expanding) with automated Docker workflows and statistical analysis.
agent-evaluation ai-agents benchmark claude-code codex coding-agents droid evaluation-suite gemini headless-testing llm-judge mcp model-context-protocol web-search
Last synced: 23 Feb 2026
https://github.com/sunilp/agentic-ai
A practical field guide to building reliable, evaluable, and production-grade agentic AI systems
agent-architecture agent-evaluation agentic-ai ai-agents ai-engineering ai-safety artificial-intelligence book evaluation field-guide generative-ai human-in-the-loop large-language-models llm multi-agent-systems production-ai python reliability-engineering
Last synced: 02 Apr 2026
https://github.com/stone16/swe-bench-harness-eval
Head-to-head: harness-engineering-skills (multi-agent orchestration) vs 5 public SWE-bench Verified baselines. 7/10 resolved, including 2 hard-tier instances no public agent solved.
agent-evaluation benchmark claude claude-code llm multi-agent swe-bench
Last synced: 01 Jun 2026
https://github.com/kadubon/oasg
Local-first, model-agnostic workflow optimizer for long-running AI agents: observable JSONL ledgers, deterministic reducers, no-meta gates, and receipt-backed self-improvement without LLM judges or model-weight updates.
agent-evaluation agent-memory agent-workflows ai-agents autonomous-agents deterministic-replay jsonl long-running-agents model-agnostic no-meta ollama python self-improving-agents verification workflow-automation workflow-optimization
Last synced: 30 May 2026
https://github.com/hinanohart/hazardloop
Censoring-aware competing-risk survival analysis for long-horizon LLM-agent trajectories (Kaplan-Meier / Nelson-Aalen / Aalen-Johansen CIF / Weibull AFT + offline-replay). CPU-only.
aalen-johansen agent-evaluation competing-risks kaplan-meier llm-agents survival-analysis time-to-event
Last synced: 15 Jun 2026
https://github.com/kadubon/search-stability-lab
Theory-to-experiment lab for search stability in long-running agents under finite context, with exact simulator tests and lightweight mechanistic probe tasks.
agent-evaluation ai ai-agents bounded-memory finite-context hypothesis-management llm-agents long-horizon-reasoning long-running-agents mechanistic-probes reproducible-research reset-policy scientific-audit search-stability simulator state-compression structured-output
Last synced: 02 Apr 2026
https://github.com/matheusrf96/agentspec
Spec-driven evaluation framework for AI agents
agent-evaluation ai-agents benchmark deepseek evaluation-framework llm-agents llm-evaluation mcp openai pytest spec-driven testing
Last synced: 16 Jun 2026
https://github.com/maximizegpt/claude-eval-harness
Regression-diff eval harness for Anthropic tool-use agents that surfaces LLM-judge reasoning drift, not just pass/fail flips
agent-evaluation anthropic claude llm-evaluation mcp python
Last synced: 19 Jun 2026
https://github.com/plaited/acp-harness
CLI for agent evaluation. Capture trajectories, run trials with pass@k metrics, and score with polyglot graders (TypeScript, Python, any language).
acp agent-client-protocol agent-evaluation ai-agents bun cli eval-harness grader jsonl llm-evaluation pass-at-k trajectory-capture typescript
Last synced: 21 Jan 2026
https://github.com/sachincse/trajeval
Trajectory evaluation for LLM agents — grade what your agent did, not just what it said.
agent-evaluation ai-agents eval llm llm-evaluation llmops python tool-use
Last synced: 05 Jun 2026