An open API service indexing awesome lists of open source software.

Projects in Awesome Lists tagged with agent-evaluation

A curated list of projects in awesome lists tagged with agent-evaluation .

https://github.com/ifixai-ai/iFixAi

Catch your AI's mistakes and blind spots before your customers or regulators do. iFixAi runs 45 inspections, 32 graded core plus 13 extended for frontier risks like sabotage, sandbagging, and oversight evasion. It returns a letter grade in under 5 minutes. Industry and model agnostic.

agent-evaluation ai ai-alignment ai-evaluation ai-governance ai-safety cli diagnostic-tool eu-ai-act hallucination-detection iso-42001 llm-evaluation llm-security nist-ai-rmf owasp-llm prompt-injection python responsible-ai risk-assessment risk-management

Last synced: 26 Jun 2026

https://github.com/tiger-ai-lab/clawbench

Open-source benchmark for browser AI agents on 153 everyday online tasks across 144 live websites. 5-layer recording + DOM-match + LLM judge. Top score 33.3%.

agent-evaluation agentic-ai ai-agent-benchmark ai-agents benchmark browser-agent browser-automation browser-use chrome-agent chrome-extension computer-use dataset evaluation everyday-tasks llm llm-evaluation online-tasks real-world-benchmark web-agent web-agents

Last synced: 31 May 2026

https://github.com/chirpz-ai/pandaprobe

open source agent engineering platform: traces, evals, and metrics to debug and improve your AI agents. Integrates with LangGraph, CrewAI, Claude Agent SDK, and more.

agent-engineering agent-evaluation agent-observability agentic-ai claude-agent-sdk crewai langgraph monitoring open-source openai-agents-sdk self-hosted tracing

Last synced: 04 Jun 2026

https://github.com/reacher-z/ClawBench

Open-source benchmark for browser AI agents on 153 everyday online tasks across 144 live websites. 5-layer recording + DOM-match + LLM judge. Top score 33.3%.

agent-evaluation agentic-ai ai-agent-benchmark ai-agents benchmark browser-agent browser-automation browser-use chrome-agent chrome-extension computer-use dataset evaluation everyday-tasks llm llm-evaluation online-tasks real-world-benchmark web-agent web-agents

Last synced: 01 May 2026

https://github.com/hidai25/eval-view

Catch AI agent regressions before you ship. YAML test cases, golden baselines, execution tracing, cost tracking, CI integration. LangGraph, CrewAI, Anthropic, OpenAI.

agent agent-benchmark agent-evaluation agentic-ai ai-agents anthropic crewai crewai-tools evaluation langchain langgraph langgraph-python llm llmops mlops openai-assistants pytest testing tools

Last synced: 09 Mar 2026

https://github.com/lizhiyao/oh-my-knowledge

Evaluation framework for LLM knowledge inputs — prompts, RAG corpora, skills, agent workflows. Fix the model, vary the artifact. Built-in statistical rigor: bootstrap CI, Krippendorff α, length-debias, saturation curves.

agent-evaluation ai benchmark bootstrap-ci claude claude-code evaluation-as-code evaluation-framework knowledge-engineering krippendorff-alpha llm llm-evaluation llm-judge multi-judge-ensemble prompt-engineering prompt-testing rag-evaluation skill-evaluation

Last synced: 15 Jun 2026

https://github.com/agnuxo1/benchclaw

BenchClaw — Multi-dimensional AI agent evaluation with 17-judge AI Tribunal, 10 scoring dimensions, radar charts, and deception detection. Benchmark any LLM agent.

agent-evaluation ai-agents benchmark benchmarking evaluation llm mcp nodejs quality testing

Last synced: 25 Jun 2026

https://github.com/iris-eval/mcp-server

The agent eval standard for MCP — score output quality, catch safety failures, enforce cost budgets

agent-evaluation ai-agent claude eval evaluation llm mcp mcp-server model-context-protocol observability security tracing

Last synced: 07 May 2026

https://github.com/jetbrains/teamcity-ai-agent-testing-demo

End-to-end TeamCity framework to run AI agents on SWE-Bench Lite. Spin up isolated Docker images per task, extract patches, score with the official harness, and aggregate success rates. As an example, we'll look at Junie and Google Gemini CLI

agent-evaluation agentic-ai ai eval evaluation evaluation-framework evaluation-tools

Last synced: 18 Apr 2026

https://github.com/Agnuxo1/BenchClaw

BenchClaw — Multi-dimensional AI agent evaluation with 17-judge AI Tribunal, 10 scoring dimensions, radar charts, and deception detection. Benchmark any LLM agent.

agent-evaluation ai-agents benchmark benchmarking evaluation llm mcp nodejs quality testing

Last synced: 05 Jun 2026

https://github.com/chaosync-org/awesome-ai-agent-testing

🤖 A curated list of resources for testing AI agents - frameworks, methodologies, benchmarks, tools, and best practices for ensuring reliable, safe, and effective autonomous AI systems

agent-evaluation agentic-ai ai-agents ai-benchmark ai-safety artificial-intelligence awesome-list benchmark chaos chaos-engineering chaos-monkey evaluation llm llm-evaluation machine-learning qa quality-assurance testing testing-tools

Last synced: 29 Jun 2025

https://github.com/plaited/agent-eval-harness

Evaluate AI agents with Unix-style pipeline commands. Schema-driven adapters for any CLI agent, trajectory capture, pass@k metrics, and multi-run comparison.

agent-client-protocol agent-comparison agent-evaluation ai-agents bun cli eval-harness grader headless-adapter jsonl llm-evaluation pass-at-k trajectory-capture typescript unix-pipeline

Last synced: 20 Feb 2026

https://github.com/youdotcom-oss/web-search-agent-evals

Extensible benchmarking suite for evaluating AI coding agents on web search tasks. Compare native search vs MCP servers (You.com, expanding) across multiple agents (Claude Code, Gemini, Droid, Codex, expanding) with automated Docker workflows and statistical analysis.

agent-evaluation ai-agents benchmark claude-code codex coding-agents droid evaluation-suite gemini headless-testing llm-judge mcp model-context-protocol web-search

Last synced: 23 Feb 2026

https://github.com/stone16/swe-bench-harness-eval

Head-to-head: harness-engineering-skills (multi-agent orchestration) vs 5 public SWE-bench Verified baselines. 7/10 resolved, including 2 hard-tier instances no public agent solved.

agent-evaluation benchmark claude claude-code llm multi-agent swe-bench

Last synced: 01 Jun 2026

https://github.com/kadubon/oasg

Local-first, model-agnostic workflow optimizer for long-running AI agents: observable JSONL ledgers, deterministic reducers, no-meta gates, and receipt-backed self-improvement without LLM judges or model-weight updates.

agent-evaluation agent-memory agent-workflows ai-agents autonomous-agents deterministic-replay jsonl long-running-agents model-agnostic no-meta ollama python self-improving-agents verification workflow-automation workflow-optimization

Last synced: 30 May 2026

https://github.com/hinanohart/hazardloop

Censoring-aware competing-risk survival analysis for long-horizon LLM-agent trajectories (Kaplan-Meier / Nelson-Aalen / Aalen-Johansen CIF / Weibull AFT + offline-replay). CPU-only.

aalen-johansen agent-evaluation competing-risks kaplan-meier llm-agents survival-analysis time-to-event

Last synced: 15 Jun 2026

https://github.com/kadubon/search-stability-lab

Theory-to-experiment lab for search stability in long-running agents under finite context, with exact simulator tests and lightweight mechanistic probe tasks.

agent-evaluation ai ai-agents bounded-memory finite-context hypothesis-management llm-agents long-horizon-reasoning long-running-agents mechanistic-probes reproducible-research reset-policy scientific-audit search-stability simulator state-compression structured-output

Last synced: 02 Apr 2026

https://github.com/maximizegpt/claude-eval-harness

Regression-diff eval harness for Anthropic tool-use agents that surfaces LLM-judge reasoning drift, not just pass/fail flips

agent-evaluation anthropic claude llm-evaluation mcp python

Last synced: 19 Jun 2026

https://github.com/plaited/acp-harness

CLI for agent evaluation. Capture trajectories, run trials with pass@k metrics, and score with polyglot graders (TypeScript, Python, any language).

acp agent-client-protocol agent-evaluation ai-agents bun cli eval-harness grader jsonl llm-evaluation pass-at-k trajectory-capture typescript

Last synced: 21 Jan 2026

https://github.com/sachincse/trajeval

Trajectory evaluation for LLM agents — grade what your agent did, not just what it said.

agent-evaluation ai-agents eval llm llm-evaluation llmops python tool-use

Last synced: 05 Jun 2026