{"id":51146340,"url":"https://github.com/jfrog/agent-belt","last_synced_at":"2026-06-26T03:01:30.444Z","repository":{"id":363096091,"uuid":"1228745539","full_name":"jfrog/agent-belt","owner":"jfrog","description":"Reproducible evaluation for AI coding agents. Multi-turn scenarios against Claude Code, Codex, Copilot, Cursor, Gemini CLI, Goose, OpenCode, or any custom agent you plug in; verify behavior with rule checks, workspace diffs, multi-judge LLM consensus; pin reliability with pass^k variance across trials. Git worktrees, optional Docker sandbox.","archived":false,"fork":false,"pushed_at":"2026-06-07T12:56:36.000Z","size":1144,"stargazers_count":15,"open_issues_count":2,"forks_count":1,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-06-07T13:13:57.141Z","etag":null,"topics":["agents","ai","ai-agents","benchmark","claude-code","cli","codex","coding-agents","cursor","evaluation-framework","gemini-cli","github-copilot","goose","jfrog","llm","llm-evaluation","opencode","testing"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/jfrog.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":"SECURITY.md","support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":"NOTICE","maintainers":null,"copyright":null,"agents":"AGENTS.md","dco":null,"cla":null}},"created_at":"2026-05-04T10:32:34.000Z","updated_at":"2026-06-07T12:53:09.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/jfrog/agent-belt","commit_stats":null,"previous_names":["jfrog/agent-belt"],"tags_count":4,"template":false,"template_full_name":null,"purl":"pkg:github/jfrog/agent-belt","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jfrog%2Fagent-belt","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jfrog%2Fagent-belt/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jfrog%2Fagent-belt/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jfrog%2Fagent-belt/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/jfrog","download_url":"https://codeload.github.com/jfrog/agent-belt/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jfrog%2Fagent-belt/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34801014,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-26T02:00:06.560Z","response_time":106,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["agents","ai","ai-agents","benchmark","claude-code","cli","codex","coding-agents","cursor","evaluation-framework","gemini-cli","github-copilot","goose","jfrog","llm","llm-evaluation","opencode","testing"],"created_at":"2026-06-26T03:01:29.868Z","updated_at":"2026-06-26T03:01:30.432Z","avatar_url":"https://github.com/jfrog.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# agent-belt\n\nA seat belt for the agents you ship. Run reproducible multi-turn scenarios\nagainst the *binary your users actually run* - Claude Code, Cursor, Codex,\nGemini, Copilot, opencode, Goose, or your own CLI. Score with rule checks,\nworkspace diffs, and a separate LLM judge. Pin variance down with\nrepeat trials, paraphrase families, and multi-judge consensus - so a\ngreen check actually means something on a stochastic stack.\n\n## Disclaimer\n\nThis tool is provided as-is with no warranty. Running untrusted agents\nor scenarios can cause real, irreversible damage to your system. Each\nscenario drives a real agent CLI that runs as your user - a malicious\nscenario can modify dotfiles, SSH config, or git hooks; run destructive\nor data-exfiltrating commands; or reach the network to pull\ninstructions, upload source, or push to external repos. Only run\nscenarios you trust. We take no responsibility for damages.\n\n## Before You Begin\n\nYou need **Python 3.13** (3.11+ supported) and at least one\nCLI agent installed and authenticated (run it once interactively to sign\nin). One agent is enough to get started.\n\n## Install\n\n```bash\npip install agent-belt\nbelt doctor   # verify agents, auth, and LLM scoring providers\n```\n\nFor development setup, see [CONTRIBUTING.md](CONTRIBUTING.md).\n\n## Quick Start\n\n```bash\nbelt quickstart              # auto-detects first available agent\nbelt quickstart claude-code  # or specify one\n```\n\nThis validates the agent, runs a single scenario with rules-only scoring\n(no API key needed), and prints next steps.\n\nFor more control over the bundled showcase (no source clone needed):\n\n```bash\nbelt eval --bundled showcase --modes rules \\\n  --tags real-runnable --allow-external-working-dir                # whole runnable showcase\nbelt eval --bundled showcase --scorer-arg model=openai/gpt-5.4-mini  # + LLM judge (cloud)\nbelt eval --bundled showcase --scorer-arg model=ollama/gemma4        # + LLM judge (local, no API key)\nbelt eval --bundled showcase --modes rules --workers 3               # parallel\nbelt eval --bundled showcase --progress live --workers 3             # live TUI\n```\n\n`--bundled \u003cNAME\u003e` resolves to the scenarios shipped inside the wheel\n(`belt/_bundled_examples/scenarios/\u003cNAME\u003e/`), so the same one-liner\nworks after `pip install agent-belt` without cloning the source repo.\nUse `belt eval --bundled` to target the whole bundled tree, or\n`belt eval \u003cPATH\u003e` to point at your own scenarios directory.\n\nThe showcase is a schema-coverage reference, not a green-CI baseline -\nthe unfiltered command intentionally includes `dry-run-only` scenarios\nthat document fields no generic CLI agent surfaces. See\n[`examples/scenarios/showcase/README.md`](examples/scenarios/showcase/README.md)\nfor the full first-run guide and the per-group index.\n\nSee the [CLI guide](docs/glossary/CLI.md) for the subcommand index and\ncommon workflows; `belt \u003ccmd\u003e --help` is the canonical flag reference.\n\nFor a real-world example, see [`examples/`](examples/README.md) - fixture\nrepos in Python, TypeScript, and Go with scenarios across L1-L4 difficulty\nlevels, including cross-agent comparison data.\n\n## Why agent-belt\n\nThe eval ecosystem is real, and most of it solves a different problem.\nGeneric LLM-eval frameworks score a model's output on a single prompt.\nObservability platforms (LangSmith, Langfuse, Braintrust) eval the\nfunction you wrap with their SDK - not the binary your users\n`npm install`'d. Trajectory frameworks build the agent loop themselves\nand score it from the inside. Coding-agent benchmarks freeze a curated\ndataset and score the field on someone else's repo.\n\nagent-belt closes the obvious gap: take the binary your users actually\nrun, point it at a real workspace with your team's skills and MCP servers\nwired up, hand it scenarios that mirror the use cases they drive it\nthrough, repeat each one to measure reliability instead of luck, and\nship a verdict.\n\nThree things are distinctive:\n\n1. **The black-box CLI is the unit.** Not the model. Not a wrapper. Not\n   a solver class. The thing you `pip install`'d, `brew install`'d, or\n   `curl | bash`'d. agent-belt runs it as a subprocess; your MCP servers,\n   skills, and auth stay untouched. The same harness drives any\n   CLI agent - same scenarios, same scoring, same verdict format.\n2. **Variance gets pinned down on three axes.** `--trials N` runs the\n   same scenario k times and reports `pass^k` for k = 1, 3, 8. Tag a\n   *family* of paraphrased scenarios and the aggregator gives you\n   family-level pass rate. A multi-judge `--scorer-config` requires\n   consensus before a verdict counts. Each axis on its own is a partial\n   signal - together they're a regression detector for stochastic stacks.\n3. **Your judge runs separately.** You pick its provider, model, persona,\n   dimensions, and per-scenario instructions. It runs out-of-process from\n   the agent under test - never as another skill inside the same Claude\n   Code or Cursor session that's doing the work. OpenAI, Anthropic, Azure,\n   `ollama/llama3.3` on your laptop, vLLM on your own GPUs. The plumbing\n   is in the box.\n\n## How agent-belt compares\n\n| You want to evaluate | Use |\n|---|---|\n| A single LLM prompt's output (model-level, no agent loop) | DeepEval, Promptfoo, lm-evaluation-harness, OpenAI Evals |\n| A function in your app you've wrapped with a vendor SDK | LangSmith, Langfuse, Phoenix, Braintrust |\n| A research agent loop you've built inside a framework | Inspect AI |\n| The field of coding agents on a frozen public benchmark | SWE-bench, Aider Polyglot, Terminal-Bench |\n| **The CLI agent your users actually run, on your repo, against your scenarios** | **agent-belt** |\n\n## Evaluate Your Own Use Cases\n\nagent-belt sends inputs to your agent headlessly. The agent CLI must be\ninstalled and authenticated on the host running agent-belt. Beyond that,\nwhat the agent can do per scenario depends on where the agent discovers\nits configuration:\n\n- **Discoverable from the working directory** - skills, slash commands,\n  plugins, and MCP servers that the agent auto-loads from the worktree\n  root (e.g. Claude Code's `.claude/` and `.mcp.json`) can ship with the\n  fixture. See\n  [`examples/fixtures/folio/`](examples/fixtures/folio/) for a worked\n  example exercising all four surfaces.\n- **User-level only** - configuration the agent reads from a fixed user\n  path (e.g. `~/.cursor/`, `~/.claude/`, OS keychains) must be set up\n  out of band on the host running agent-belt.\n- **Repository access** - the agent must be able to read and write the\n  worktree agent-belt creates per scenario (default behavior).\n\n### Create a Scenario\n\nScenarios live in a **group directory**. Each group has a `_config.json`\nthat holds shared settings - which agent to use, workspace isolation,\ncustom scoring dimensions - so individual scenarios stay focused on the\ntest case itself. This matters when you have many scenarios: instead of\nrepeating `\"agent\": \"claude-code\"` and `\"working_dir\": \"../../my-repo\"`\nin every file, you set it once in `_config.json`.\n\n```text\nmy-scenarios/\n├── _config.json          # shared: agent, workspace, tags\n├── fix_bug.json          # scenario 1\n└── explain_arch.json     # scenario 2\n```\n\n`_config.json` - shared group configuration\n([full schema](docs/glossary/SCENARIOS.md#2-group-config-_configjson)):\n\n```json\n{\n  \"agent\": \"claude-code\",\n  \"working_dir\": \"../../path/to/your/repo\",\n  \"default_tags\": [\"my-project\"]\n}\n```\n\nFor a fully worked group with MCP servers, default skills, plugin\nskills, plugin commands, and slash commands all wired into one fixture,\nsee\n[`examples/scenarios/experience/programmatic-setup-claude/`](examples/scenarios/experience/programmatic-setup-claude/) -\nseven runnable scenarios over a single `_config.json`. Run it with:\n\n```bash\nbelt eval examples/scenarios/experience/programmatic-setup-claude --modes rules\n```\n\nSee [Scenarios](docs/glossary/SCENARIOS.md) for the full authoring guide\nand the field-by-field reference appendix.\n\n## Documentation\n\nThe `docs/glossary/` directory holds the entire reference. One topic per\nfile, no duplication:\n\n| Topic | Link |\n|---|---|\n| Architecture, design principles, where-things-live | [ARCHITECTURE.md](docs/glossary/ARCHITECTURE.md) |\n| Scenario authoring + JSON schema reference | [SCENARIOS.md](docs/glossary/SCENARIOS.md) |\n| Scoring (rules + LLM judges, multi-judge, thresholds) | [SCORING.md](docs/glossary/SCORING.md) |\n| Layered configuration, env vars, security toggles | [CONFIGURATION.md](docs/glossary/CONFIGURATION.md) |\n| Sandbox providers, kernel invariants, network policy | [SANDBOXING.md](docs/glossary/SANDBOXING.md) |\n| Threat model, trust boundaries, controls, residual risk | [SECURITY-MODEL.md](docs/glossary/SECURITY-MODEL.md) |\n| CLI subcommand index, workflows, progress modes | [CLI.md](docs/glossary/CLI.md) |\n| CI integration, threshold gating, agent install/auth recipes | [CI.md](docs/glossary/CI.md) |\n| Plugin architecture (agents, scorers, exporters) | [PLUGGABILITY.md](docs/glossary/PLUGGABILITY.md) |\n| On-disk artifacts, schema versioning, benchmark card | [OUTCOMES.md](docs/glossary/OUTCOMES.md) |\n| Built-in agent feature matrix | [AGENT-FEATURES.md](docs/glossary/AGENT-FEATURES.md) |\n\nagent-belt also ships a `SKILL.md` inside the wheel so an AI coding\nagent on your machine can author scenarios, run `belt eval`, and\ninterpret a verdict without leaving your IDE. One-time symlink setup is\nin [CONTRIBUTING.md](CONTRIBUTING.md#enable-the-bundled-skillmd-for-your-ai-coding-agent).\n\n## Not for\n\n- Benchmarking the LLMs themselves.\n- Wrapping or replacing the agent loop. agent-belt invokes your\n  *already-installed, already-authenticated* agent CLI as a subprocess.\n- IDE integration. agent-belt is a headless CI-friendly harness; the\n  agents it drives have their own IDE plugins.\n\n## Development\n\nSee [CONTRIBUTING.md](CONTRIBUTING.md) for setup, the test matrix, plugin\nauthoring, and what gates a PR.\n\n## Release Notes\n\nThe release notes are available [here](https://github.com/jfrog/agent-belt/releases).\n\n## License\n\nApache 2.0 - see [LICENSE](LICENSE) and [NOTICE](NOTICE) for third-party\nattribution. Contributions require signing the\n[JFrog CLA](https://jfrog.com/cla/); see [CONTRIBUTING.md](CONTRIBUTING.md).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjfrog%2Fagent-belt","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjfrog%2Fagent-belt","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjfrog%2Fagent-belt/lists"}