{"id":51251500,"url":"https://github.com/heurema/agent-bench-lab","last_synced_at":"2026-06-29T07:02:12.614Z","repository":{"id":360144411,"uuid":"1248345177","full_name":"heurema/agent-bench-lab","owner":"heurema","description":"Local-first scaffold for reproducible AI-agent evaluation, run comparison, and public/private benchmark design.","archived":false,"fork":false,"pushed_at":"2026-06-04T02:59:39.000Z","size":214,"stargazers_count":0,"open_issues_count":1,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-06-04T04:24:26.268Z","etag":null,"topics":["agent-evaluation","ai-agents","benchmarks","llm-evals","reproducibility","tool-use"],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/heurema.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":"SECURITY.md","support":null,"governance":null,"roadmap":"docs/roadmap-v0.md","authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-05-24T14:20:19.000Z","updated_at":"2026-06-04T02:56:09.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/heurema/agent-bench-lab","commit_stats":null,"previous_names":["heurema/agent-bench-lab"],"tags_count":8,"template":false,"template_full_name":null,"purl":"pkg:github/heurema/agent-bench-lab","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/heurema%2Fagent-bench-lab","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/heurema%2Fagent-bench-lab/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/heurema%2Fagent-bench-lab/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/heurema%2Fagent-bench-lab/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/heurema","download_url":"https://codeload.github.com/heurema/agent-bench-lab/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/heurema%2Fagent-bench-lab/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34916411,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-29T02:00:05.398Z","response_time":58,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["agent-evaluation","ai-agents","benchmarks","llm-evals","reproducibility","tool-use"],"created_at":"2026-06-29T07:02:10.043Z","updated_at":"2026-06-29T07:02:12.606Z","avatar_url":"https://github.com/heurema.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Agent Bench Lab\n\nA public, local-first starter kit for building reproducible benchmark suites for AI agents.\n\nThe goal is not to create another leaderboard. The goal is to help agent builders answer one practical question:\n\n\u003e When I change my agent setup — model, prompt, memory, tools, MCP, planner, critic loop, or multi-agent scaffold — did it actually get better?\n\nAgent Bench Lab is designed around repeatable task families, versioned fixtures, deterministic or semi-deterministic scoring, trace logging, and anti-overfitting controls.\n\n## Scope: all agent task families\n\nAgent Bench Lab is not limited to coding-agent benchmarks.\n\nIt is a canonical benchmark framework for any repeatable agent task family where the result can be checked with deterministic, semi-deterministic, state-based, artifact-based, trace-based, or rubric-assisted scoring.\n\nSupported task families may include:\n\n- code and repository repair;\n- docs, knowledge-base, and source-grounded research tasks;\n- spreadsheets, data analysis, and reporting tasks;\n- support inbox and customer-service workflows;\n- ticket triage and task-board updates;\n- browser workflows over frozen or self-hosted snapshots;\n- internal API and tool-use workflows;\n- memory and personalization tasks;\n- security, prompt-injection, and policy-compliance tasks;\n- customer-specific private holdout checks.\n\nThe common unit is not \"coding task\" or \"office task\". The common unit is:\n\n```text\ntask family + fixtures + allowed tools + expected artifact/state + scorer + run comparison\n```\n\nThe public v0/v0.7 implementation includes a small starter suite, five hardened task-family patterns, lifecycle gates, and a public-safe research radar. The framework is intentionally broader than the implemented starter cases.\n\n## Relationship to consumer applications\n\nAgent Bench Lab is the benchmark standard layer.\n\nConsumer applications may use Agent Bench Lab to run benchmark suites inside a product, workflow, CLI, dashboard, or customer-facing experience. Consumer applications should not define a separate benchmark system when they can consume Agent Bench Lab task families, scorer interfaces, run records, and comparison protocols.\n\nRecommended boundary:\n\n- Agent Bench Lab owns task-family definitions, schemas, scorer conventions, run records, comparison protocol, and public/private benchmark rules.\n- Private Eval Layer owns protected holdouts, answer keys, hidden labels, customer-specific checks, canaries, and private scorer configs.\n- Consumer applications own product UX, onboarding, agent setup management, access control, task delivery, artifact upload, result presentation, and customer workflows.\n\nAgent Bench Lab should not need to know which consumer application is using it.\n\n## Private eval and scorer contracts\n\nAgent Bench Lab should define how benchmarks work without storing protected evaluation content.\n\nThe Private Eval Layer holds hidden labels, private holdouts, answer keys, protected scorer configs, canaries, customer-specific checks, and redaction rules outside the public repo. Scorers should use reusable contracts such as `artifact_exact`, `schema_contract`, `numeric_metric`, `state_diff`, `claim_rubric`, `trace_policy`, and `security_leak` instead of inventing a new hidden-check format per task family.\n\nSee [Private Eval Layer](docs/private-eval-layer.md), [Scorer type contracts](docs/scorer-types.md), and [Reporting and feedback](docs/reporting-and-feedback.md).\n\n## Benchmark lifecycle and hardening gates\n\nAfter the first five decision-grade public patterns, v0.6 adds standard-layer gates instead of another task family.\n\nLifecycle metadata declares whether each task family is `experimental`, `decision-grade`, `verified`, or `deprecated`. Hardening metadata declares mutation smoke scripts and exploit smoke status for decision-grade families. No task is marked `verified` yet.\n\n```bash\nmake lifecycle-check\nmake mutation-smoke\nmake hardening-check\n```\n\nSee [Benchmark lifecycle](docs/16-benchmark-lifecycle.md), [Mutation and exploit gates](docs/17-mutation-and-exploit-gates.md), [Suite strategy](docs/18-suite-strategy.md), and [Report schema v1 guidance](docs/19-report-schema-v1.md).\n\n## Research Radar\n\nResearch Radar keeps Agent Bench Lab aligned with external benchmark and eval methodology without turning the repo into a news feed.\n\nIt tracks benchmark mechanics: oracles, hidden splits, replay, trace policy, scoring contracts, exploitability, contamination, standards, and eval-framework changes.\n\n```text\nresearch/\n```\n\nPublic `research/` files contain watchlists, source maps, queries, daily/weekly templates, and idea-candidate templates only. Raw feeds, private notes, customer observations, private holdouts, and protected scorer details stay out of the public repo.\n\nSee [Research Radar](docs/20-research-radar.md) and [research/README.md](research/README.md).\n\n## Current status\n\nThis repository is a **v0 public starter**. It contains:\n\n- public task-card templates;\n- a small core-suite config;\n- JSON schemas for tasks, runs, traces, and scores;\n- minimal Python CLI scaffolding;\n- sample public fixtures;\n- sample scorers plus hardened IF-01, DATA-01, DOC-01, SUP-01, and API-01 artifact/state-based scorers;\n- a local command-based runner for external agent setups;\n- a suite runner for running one agent command across existing suite configs;\n- documentation for benchmark design, metrics, anti-overfitting, lifecycle status, hardening gates, and research radar process.\n\nIt intentionally does **not** contain private holdout tasks, production secrets, personal data, or benchmark answers for real evaluation runs.\n\nRelease status: `v0.7.3` is the latest published release and hardens runner identifier uniqueness. `main` may include additional infrastructure work before any `v0.8` direction is selected.\n\n## Why this exists\n\nMost agent demos prove that an agent can succeed once. Product work needs stronger evidence:\n\n- Can it succeed repeatedly?\n- Does it still work after task mutations?\n- Does the improvement generalize to hidden variants?\n- Did latency or cost increase?\n- Did it use tools safely?\n- Did memory help or pollute the result?\n- Did a critic loop improve quality or just add expensive theatre?\n\n## Core idea\n\nUse the same task families across different agent setups:\n\n```text\nSetup A: model + system prompt + tools, no memory\nSetup B: same setup, but with memory\nSetup C: same setup, but with reviewer loop\n```\n\nThen compare them on the same seeds and hidden variants.\n\n```text\nsame task family + same scoring + controlled setup change = useful comparison\n```\n\n## Public/private split\n\nThis repo is public. Treat public files as examples and templates.\n\nFor serious evaluation, keep these outside the public repo:\n\n- private hidden fixtures;\n- private holdout seeds;\n- real benchmark answers;\n- traces from commercial or personal tasks;\n- API keys and provider metadata;\n- user data;\n- production prompts that should not be public.\n\nThe `.gitignore` includes `private/`, `runs/`, `artifacts/`, `traces/`, and common secret files by default.\n\n## Quick start\n\n```bash\npython3 -m venv .venv\nsource .venv/bin/activate\npip install -e .\nagent-bench list-tasks\nagent-bench validate\n```\n\nCreate public sample artifacts and run scoring smoke tests:\n\n```bash\npython3 scripts/create_sample_artifacts.py\nagent-bench score --task IF-01 --case case_001 --artifacts examples/artifacts/IF-01/case_001\nagent-bench score --task DATA-01 --case case_001 --artifacts examples/artifacts/DATA-01/case_001\npython3 scripts/public_leak_check.py .\n```\n\nWithout installing the package, use the source-tree Make targets:\n\n```bash\nmake validate\nmake test\nmake smoke\nmake lifecycle-check\nmake mutation-smoke\nmake hardening-check\nmake leak-check\n```\n\nThe examples directory intentionally starts mostly empty. Generated artifacts under `examples/artifacts/` are ignored by git except for the README placeholder.\n\n## Run an external agent setup\n\nUse `agent-bench run` to hand an agent-visible task packet to any local command and score the artifacts it writes:\n\n```bash\nagent-bench run \\\n  --task IF-01 \\\n  --case case_001 \\\n  --agent-cmd \"python3 scripts/mock_agent_write_artifacts.py\" \\\n  --out runs/manual/mock/IF-01_case_001\n```\n\nThe command receives `AGENT_BENCH_TASK_PACKET` and `AGENT_BENCH_ARTIFACTS_DIR`. It should write final artifacts to the artifacts directory. The runner then writes `run.json`, `trace.jsonl`, and `score.json`.\n\nThe task packet excludes scorer-only files such as `check_config.json`, answer keys, hidden labels, private scorer configs, canaries, and expected values. The scorer still reads the original fixture and the produced artifacts.\n\nSee [Local Agent Runner MVP](docs/21-local-agent-runner.md).\n\n## Run a full suite\n\nUse `agent-bench run-suite` to run the same external command across an existing suite config:\n\n```bash\nagent-bench run-suite \\\n  --suite core \\\n  --agent-cmd \"python3 scripts/mock_agent_write_artifacts.py\" \\\n  --out runs/manual/mock-core\n```\n\nThe command reuses the single-task runner for every task/case. Each task run gets its own `run.json`, `trace.jsonl`, `score.json`, `artifacts/`, and `task_packet/`; the suite root gets `suite_run.json`.\n\n`run-suite` preserves the same task-packet visibility boundary as `run`: scorer-only files are not exposed to the agent command.\n\nFor development/coding agent setups, prefer `dev-local` as the primary suite:\n\n```bash\nagent-bench run-suite \\\n  --suite dev-local \\\n  --agent-cmd \"python3 my_agent.py\" \\\n  --out runs/manual/my-agent-dev\n```\n\n`core` remains a broad smoke baseline. Do not treat one aggregate `core` score as the primary metric for a specialized development setup.\n\n## Compare two agent setups\n\nCreate two local smoke-run directories and compare them:\n\n```bash\nmake compare-smoke\n```\n\nOr run the commands directly:\n\n```bash\npython3 scripts/create_sample_runs.py\nagent-bench compare \\\n  --baseline runs/baseline \\\n  --candidate runs/spec_first \\\n  --out reports/generated/compare_baseline_vs_spec_first.md \\\n  --csv reports/generated/compare_baseline_vs_spec_first.csv\n```\n\nThe comparison is paired: same task, same case, same scorer, different agent config. Public runs are smoke tests only; decision-grade evaluation requires private holdout cases outside the public repo.\n\n## First decision-grade task family: IF-01\n\nIF-01 is the first hardened task-family pattern. It uses public synthetic cases, deterministic `check_config.json` files, critical violation caps, mutation support, and tests for strict artifact-contract compliance. See [IF-01 decision-grade pattern](docs/11-if01-decision-grade.md).\n\n```bash\nmake if01-smoke\n```\n\n## Second decision-grade task family: DATA-01\n\nDATA-01 is the second hardened task-family pattern. It uses synthetic CSV/SQLite fixtures, deterministic `metrics.json`, factual `report.md`, checked `chart_spec.json`, mutation support, and tests for exact data work without relying on a visual PNG oracle. See [DATA-01 decision-grade pattern](docs/12-data01-decision-grade.md).\n\n```bash\nmake data01-smoke\n```\n\n## Third decision-grade task family: DOC-01\n\nDOC-01 is the third hardened task-family pattern. It uses synthetic fixed-corpus documents, deterministic `answer.md`, checked `citations.json`, checked `claims.json`, mutation support, and tests for grounded answers without relying on live web or an LLM judge. See [DOC-01 decision-grade pattern](docs/13-doc01-decision-grade.md).\n\n```bash\nmake doc01-smoke\n```\n\n## Fourth decision-grade task family: SUP-01\n\nSUP-01 is the fourth hardened task-family pattern and the first operational/customer-style workflow. It uses synthetic support inboxes, deterministic `triage.json`, checked `drafts.json`, checked `escalations.json`, `decision_log.md`, mutation support, and tests for policy-grounded replies without live inbox, browser, SaaS, or real customer data. See [SUP-01 decision-grade pattern](docs/14-sup01-decision-grade.md).\n\n```bash\nmake sup01-smoke\n```\n\n## Fifth decision-grade task family: API-01\n\nAPI-01 is the fifth hardened task-family pattern and the first local internal API/tool-registry workflow. It uses synthetic API catalogs, local state fixtures, deterministic `tool_calls.json`, checked `result.json`, `decision_log.md`, scorer-side state simulation, mutation support, and tests for forbidden-tool avoidance without live SaaS, MCP, browser, or real APIs. See [API-01 decision-grade pattern](docs/15-api01-decision-grade.md).\n\n```bash\nmake api01-smoke\n```\n\n## Initial core suite\n\nThe recommended v0 core suite has seven task families:\n\n| ID | Task | Capability |\n|---|---|---|\n| CODE-01 | Local regression patch | coding + test discipline |\n| TERM-02 | Log-driven config repair | terminal/debugging |\n| APP-04 | Airline rebooking under policy | stateful tools + policy |\n| DATA-01 | CSV + SQL memo | exact data work + concise reporting |\n| DOC-01 | Fixed-corpus grounded answer | citations + unsupported-claim checks |\n| IF-01 | Long spec contract artifact | strict instruction following |\n| SEC-01 | Hidden prompt injection in HTML/email | security + tool-output trust boundary |\n\nThe initial core suite is a starter set for proving the runner/scorer/compare loop. It is a broad smoke baseline, not the primary score for specialized setups. Future task families can cover support, knowledge work, spreadsheets, browser workflows, ticketing, internal APIs, and customer-specific private checks using the same task/scorer/run model.\n\n## Development local suite\n\nDevelopment/coding agent setups should use:\n\n```text\nconfigs/suites/dev-local.json\n```\n\n`dev-local` includes instruction contracts, repository repair, terminal debugging, local API/tool planning, and stateful app workflows. It excludes document, data, and support tasks by default so development-agent variance reports do not mix primary dev signal with unrelated category noise.\n\n## Operational local suite\n\nSUP-01 is intentionally not added to `configs/suites/core.json` by default. Operational/customer-style workflows start in:\n\n```text\nconfigs/suites/ops-local.json\n```\n\nThis keeps core focused while allowing support and ticketing tasks to grow under an ops-oriented local suite.\n\n## Tools local suite\n\nAPI-01 is intentionally not added to `configs/suites/core.json` by default. Local tool/API workflows start in:\n\n```text\nconfigs/suites/tools-local.json\n```\n\nThis keeps live-service-free API/tool reasoning separate from the fast starter core and from operational support workflows.\n\n## Repository layout\n\n```text\nagent-bench-lab/\n  configs/              suite and agent config examples\n  docs/                 public documentation\n  fixtures/public/      public example fixtures only\n  private/              gitignored private holdouts, if created locally\n  schemas/              JSON schemas\n  src/agent_bench_lab/  CLI and local harness skeleton\n  tasks/                task cards, prompts, and scorer modules\n  examples/artifacts/   local generated artifacts for smoke tests\n  scripts/              helper scripts\n```\n\n## Design rules\n\n1. Prefer local fixtures over live services.\n2. Prefer exact/state-based scoring over subjective judging.\n3. Keep hidden holdouts separate from public examples.\n4. Log traces, costs, latency, and tool calls.\n5. Compare paired runs on the same seeds.\n6. Treat safety and policy violations as hard gates where appropriate.\n7. Do not tune prompts on the same cases used for final comparison.\n\n## Contributor docs\n\n- [Documentation index](docs/README.md)\n- [Canonical scope and consumer boundary](docs/canonical-scope-and-consumer-boundary.md)\n- [Private Eval Layer](docs/private-eval-layer.md)\n- [Scorer type contracts](docs/scorer-types.md)\n- [Reporting and feedback](docs/reporting-and-feedback.md)\n- [Task authoring](docs/05-task-authoring.md)\n- [Public/private split](docs/07-public-private-split.md)\n- [Run records](docs/09-run-records.md)\n- [Comparing setups](docs/10-comparing-setups.md)\n- [IF-01 decision-grade pattern](docs/11-if01-decision-grade.md)\n- [DATA-01 decision-grade pattern](docs/12-data01-decision-grade.md)\n- [DOC-01 decision-grade pattern](docs/13-doc01-decision-grade.md)\n- [SUP-01 decision-grade pattern](docs/14-sup01-decision-grade.md)\n- [API-01 decision-grade pattern](docs/15-api01-decision-grade.md)\n- [Local Agent Runner MVP](docs/21-local-agent-runner.md)\n- [Public release checklist](docs/public-release-checklist.md)\n- [v0 roadmap](docs/roadmap-v0.md)\n\n## License\n\nMIT. See [LICENSE](LICENSE).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fheurema%2Fagent-bench-lab","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fheurema%2Fagent-bench-lab","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fheurema%2Fagent-bench-lab/lists"}