{"id":51139434,"url":"https://github.com/eserlxl/external-agents","last_synced_at":"2026-06-25T21:00:54.253Z","repository":{"id":365229074,"uuid":"1271114734","full_name":"eserlxl/external-agents","owner":"eserlxl","description":"Run external coding-agent CLIs (agy, codex, optionally claude) as autonomous sub-agents from inside Claude Code.","archived":false,"fork":false,"pushed_at":"2026-06-16T11:35:40.000Z","size":20,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-06-16T13:23:17.608Z","etag":null,"topics":["agents","agy","claude-code","claude-code-plugin","codex","delegation","orchestration","sub-agents"],"latest_commit_sha":null,"homepage":null,"language":"Shell","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/eserlxl.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-06-16T10:52:24.000Z","updated_at":"2026-06-16T11:35:48.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/eserlxl/external-agents","commit_stats":null,"previous_names":["eserlxl/external-agents"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/eserlxl/external-agents","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/eserlxl%2Fexternal-agents","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/eserlxl%2Fexternal-agents/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/eserlxl%2Fexternal-agents/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/eserlxl%2Fexternal-agents/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/eserlxl","download_url":"https://codeload.github.com/eserlxl/external-agents/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/eserlxl%2Fexternal-agents/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34792207,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-25T02:00:05.521Z","response_time":101,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["agents","agy","claude-code","claude-code-plugin","codex","delegation","orchestration","sub-agents"],"created_at":"2026-06-25T21:00:53.269Z","updated_at":"2026-06-25T21:00:54.241Z","avatar_url":"https://github.com/eserlxl.png","language":"Shell","funding_links":[],"categories":[],"sub_categories":[],"readme":"# external-agents\n\n\u003cdiv align=\"center\"\u003e\n\n[![CI](https://github.com/eserlxl/external-agents/actions/workflows/ci.yml/badge.svg)](https://github.com/eserlxl/external-agents/actions/workflows/ci.yml)\n[![Claude Code plugin](https://img.shields.io/badge/Claude%20Code-plugin-8A2BE2.svg)](https://docs.claude.com/en/docs/claude-code/overview)\n[![Run external agents as sub-agents](https://img.shields.io/badge/dispatch-agy%20%C2%B7%20codex%20%C2%B7%20claude%20%C2%B7%20cursor%20%C2%B7%20claude--api%20%C2%B7%20openai%20%C2%B7%20gemini%20%C2%B7%20openrouter-0A7BBB.svg)](#what-it-drives)\n[![version](https://img.shields.io/badge/version-0.10.1-informational.svg)](.claude-plugin/plugin.json)\n[![License: GPL-3.0-or-later](https://img.shields.io/badge/license-GPL--3.0--or--later-blue.svg)](LICENSE)\n\n\u003c/div\u003e\n\nA Claude Code plugin that runs **external coding-agent CLIs** — `agy`, `codex`, and\n`cursor` (and optionally `claude`) — as autonomous sub-agents from inside a Claude Code\nsession. Hand a task to one external agent or fan one prompt out to all of them in parallel,\nthen collect every response inline. The `cursor` agent runs Cursor's own models (Composer 2.5).\nIt also drives **read-only cloud API advisors** — `claude-api`, `openai`, `gemini`, and\n`openrouter` — direct provider endpoints used for repository analysis (see\n[Cloud API advisors](#cloud-api-advisors-read-only)).\n\nIt is a *delegation* tool: an arbitrary task, your choice of agent, and **read-write by\ndefault** so the CLI agents can actually do work (use `--read-only` for analysis-only runs).\nThe cloud API advisors are **always read-only** — a stateless completion call has no filesystem\naccess — and ship **disabled by default**.\n\n## What it drives\n\n| agent  | cli           | read-write (default)                                            | read-only (`--read-only`)                       |\n|--------|---------------|-----------------------------------------------------------------|-------------------------------------------------|\n| agy    | `agy`         | `-p P --add-dir DIR --dangerously-skip-permissions [--model M]` | `-p P --sandbox --add-dir DIR [--model M]`      |\n| codex  | `codex exec`  | `-s workspace-write -C DIR --skip-git-repo-check [-m M] [-c model_reasoning_effort=\"E\"] P` | `-s read-only -C DIR --skip-git-repo-check ... P` |\n| claude | `claude`      | `-p P --permission-mode acceptEdits [--model M] [--effort E]`   | `-p P --allowedTools Read Grep Glob [--model M] [--effort E]` |\n| cursor | `cursor-agent`| `-p --force --trust --workspace DIR [--model M] -- P`           | `-p --mode plan --trust --workspace DIR [--model M] -- P` |\n\nThe prompt is always passed as a single argv element (or via stdin), never `eval`'d or\nword-split, so it is injection-safe regardless of content.\n\nA few CLI caveats worth knowing:\n\n- **agy read-only is best-effort.** `agy --sandbox` restricts the terminal but does **not**\n  hard-block agy's file-edit tools, so an agy read-only run *can* still mutate the tree.\n  For an enforced read-only guarantee use `codex`/`claude`/`cursor` (their read-only modes are\n  hard), or point `--target` at a throwaway copy. The script warns whenever agy runs read-only.\n- **claude write + shell.** `--permission-mode acceptEdits` (the default) auto-accepts file\n  edits but **denies** other Bash non-interactively — a \"do X and run the tests\" task will\n  edit and then silently skip the tests at rc=0. For tasks that must build/test/run commands,\n  pass `--claude-perm bypassPermissions`.\n- **cursor needs auth, and its binary is `cursor-agent`.** The `cursor` agent calls\n  `cursor-agent` (the headless Cursor CLI), **not** `cursor` (the IDE). It must be signed in\n  first — `cursor-agent login` (or set `CURSOR_API_KEY`) — and shares auth/config with the\n  desktop app. Its read-only mode (`--mode plan`) is **enforced** (analyze/plan, no edits),\n  so it is a hard guarantee like codex/claude; write mode uses `--force` to auto-approve\n  edits and shell.\n\n## Cloud API advisors (read-only)\n\nBeyond the agentic CLIs, the plugin drives four **cloud API advisors** — direct provider\ncompletion endpoints, used for repository analysis (e.g. council/conclave-style panels):\n\n| agent | provider / endpoint | API key env var (names a [`pass`](#api-keys-via-pass) entry) |\n|-------|---------------------|--------------------------------------------------------------|\n| `claude-api` | Anthropic Messages API (`/v1/messages`) | `ANTHROPIC_API_KEY` |\n| `openai`     | OpenAI Chat Completions (`/v1/chat/completions`) | `OPENAI_API_KEY` |\n| `gemini`     | Google Generative Language (`:generateContent`) | `GEMINI_API_KEY` (else `GOOGLE_API_KEY`) |\n| `openrouter` | OpenRouter (OpenAI-compatible gateway → any model) | `OPENROUTER_API_KEY` |\n\nThese differ from the CLI agents in three ways:\n\n- **Always read-only, hard.** A stateless completion call has **no filesystem access**, so an API\n  advisor can never read or write the target tree. It receives only the prompt you give it — assemble\n  any repo context *into* the prompt (the advisor cannot open files itself). An API-only run defaults\n  to `--read-only`; `--write` is accepted but has no effect on an advisor.\n- **Disabled by default.** They ship `\"enabled\": false` in [`agents.json`](#configuration--agentsjson)\n  (like the `claude` CLI agent), so `--agent all` does not include them until you opt in. Run a named\n  one directly (`--agent gemini`), or set `\"enabled\": true` to add it to the fan-out.\n- **Stdlib client, no new deps.** They are driven by a single bundled\n  [`scripts/api-client.py`](scripts/api-client.py) (Python standard library only — `urllib`+`json`),\n  invoked as `python3 scripts/api-client.py --provider \u003cp\u003e --model M [--effort E] --prompt P`.\n\nThe per-tier `model` strings in `agents.json` are **passed through verbatim** to the provider API —\nedit them to your account's exact model ids (run the provider's model-list to see what you can use).\n`openrouter` is the escape hatch for \"any model\": its tier models are `vendor/model` slugs\n(`anthropic/claude-opus-4-8`, `openai/gpt-5-mini`, `google/gemini-3.1-pro`, …), so you can reach a model\nthat has no dedicated agent without touching code.\n\n```bash\n# one advisor, read-only (the default for an API-only run)\nbash scripts/run-agent.sh --agent claude-api --effort high --prompt 'Review src/client.py for races; cite line numbers.'\n\n# fan a question out to several advisors + CLIs (enable them in agents.json first), read-only\nbash scripts/run-agent.sh --agent all --read-only --consensus --prompt \"$(cat analysis-prompt.txt)\"\n\n# preview the exact API call without running it (the key is never shown)\nbash scripts/run-agent.sh --agent gemini --dry-run --prompt '...'\n```\n\n### API keys via `pass`\n\nAPI keys resolve through the Unix password manager [`pass`](https://www.passwordstore.org/), so the\nsecret stays GPG-encrypted at rest — it is **never** stored in `agents.json`, an env var value, the\nargv, or a transcript. The provider's env var holds a **`pass` entry name**, not the raw key:\n\n```bash\n# the value is a pass ENTRY NAME, not the secret:\nexport ANTHROPIC_API_KEY=api/anthropic     # -\u003e resolved at run time via `pass show api/anthropic`\nexport OPENAI_API_KEY=api/openai\nexport GEMINI_API_KEY=api/gemini            # GOOGLE_API_KEY is also accepted\nexport OPENROUTER_API_KEY=api/openrouter\n```\n\nResolution per provider, in order:\n\n1. If `pass` is on `PATH` (and `EXTERNAL_AGENTS_NO_PASS` is unset), the env var's value is treated as a\n   `pass` entry name and the key is read via `pass show \u003centry\u003e` (first line). The secret therefore\n   never appears in argv — only the entry name, which is not sensitive.\n2. If `pass` is **not** installed (e.g. CI), or `EXTERNAL_AGENTS_NO_PASS=1` is set, the env var's value\n   is used as the **literal key**.\n3. Missing/unresolvable → the run fails fast with an `auth`-classified error **before any network call**.\n\n`pass`/`gpg-agent` must be **unlocked non-interactively** when an advisor runs (the same one-time\nreadiness step the CLIs need). The offline `--check` preflight reports only whether a key source is\n*configured* (env var set, `pass` present) — it **never** runs `pass show`, so it never decrypts a\nsecret or triggers a gpg prompt. The trust-boundary details are in\n[docs/threat-model.md → API key custody](docs/threat-model.md#api-key-custody-api-advisors).\n\n\u003e **External providers.** Like `agy`/`codex`/`cursor`, the API advisors send your prompt to an external\n\u003e service — never put a secret or private IP into a prompt you fan out to them.\n\n## Install\n\nLoad this directory into Claude Code as a local plugin (e.g. `/plugin` → add a local\ndirectory pointing here), or symlink it into your plugins directory. Once loaded you get:\n\n- the **`external-agents`** skill — triggers on natural language (\"ask codex to …\",\n  \"have the external agents review …\", \"delegate this to agy\");\n- the **`/external-agents`** slash command — explicit dispatch.\n\nRequires `agy`, `codex`, the `cursor-agent` CLI (for the `cursor` agent; run\n`cursor-agent login` once), and — for the `claude` agent — `claude` on `PATH`, plus `jq`\n**or** `python3` to read `agents.json`. Optionally `antigravity-usage`\n(`npm i -g antigravity-usage`) enables agy's [quota-aware fallback](#agy-quota-aware-fallback-antigravity).\nVerify the install with the **`--check` preflight** — it reports the JSON reader (`jq`/`python3`) and\nwhether each agent CLI is on `PATH`:\n\n```bash\nbash scripts/run-agent.sh --check\n```\n\n`--check` prints one line per check and a final tally:\n\n- `ok   \u003cname\u003e \u003cpath\u003e` — the JSON reader (`jq`/`python3`) or an agent CLI is present on `PATH`.\n- `MISS \u003cname\u003e …` — a required piece is missing (an agent CLI not installed, or the `cursor-agent`\n  binary the `cursor` agent needs).\n- `info agy-qta …` — optional only: whether `antigravity-usage` (agy's quota-fallback helper) is on\n  `PATH`; this is never counted as missing.\n\nIt ends with `external-agents: \u003cN\u003e missing` and **exits non-zero when `N \u003e 0`**. A clean install of the\nagents you intend to use therefore reports **`0 missing` and exits `0`** — that is the pass criterion.\nScope the check to a single agent with `--agent \u003cname\u003e` if you don't use all four:\n\n```bash\nbash scripts/run-agent.sh --agent codex --check \u0026\u0026 echo \"codex ready\"\n```\n\n**What `--check` does and does not prove.** `--check` is a **presence** preflight: it confirms the\nconfig reader (`jq`/`python3`) is available, `agents.json` is readable, and each agent CLI is on\n`PATH`. It does **not** invoke any CLI, verify authentication, or exercise read-only enforcement — a\nCLI can be present yet unauthenticated or misconfigured, so a passing `--check` is necessary but not\nsufficient for a working run. Proving an agent actually round-trips (and that the enforced read-only\nguarantee holds) is the job of the opt-in [live smoke harness](#live-smoke-opt-in), never `--check`.\n\nThere is intentionally **no built-in auth/health probe** on `--check`: a meaningful check (is this\nCLI signed in and able to respond?) needs an authenticated, networked call that is neither free nor\nside-effect-free, and wiring it into `--check` would couple the offline gate to live CLIs — so it is\ndeliberately deferred. **Authentication is your responsibility:** sign each agent in once (e.g.\n`cursor-agent login`, plus the sign-in `codex` / `agy` / `claude` each need), then confirm a real\nround-trip with the opt-in [live smoke harness](#live-smoke-opt-in). The exact per-agent sign-in step\nand what \"ready\" means for each agent are consolidated in the\n[per-agent auth prerequisites](#per-agent-auth-prerequisites-readiness) reference below.\n\n**Presence vs. readiness, end to end.** Three checks, three guarantees, in order of strength: the\n[release tag-gate](RELEASING.md) check confirms a published tag equals the lockstep version;\n`--check` (offline, exits non-zero on a missing CLI) proves **presence**; and the opt-in live smoke\nplus the [per-agent auth prerequisites](#per-agent-auth-prerequisites-readiness) prove **readiness**\n(an authenticated agent that actually round-trips). The offline gate performs the presence check and\nthe tag check, but **never** the auth step — readiness is always an explicit, opt-in act.\n\n**Verifying a packaged release.** Beyond `--check`, an opt-in install/upgrade smoke\n([`tests/install-smoke.sh`](tests/install-smoke.sh)) proves the package installs and upgrades **from a\npublished tag** (not a branch): armed with `EXTERNAL_AGENTS_LIVE=1` it clones the repo at a tag into a\nthrowaway tree and confirms the loaded driver answers `--check` (presence) and `--version` (the\nlockstep version), with the upgrade variant asserting `--version` advances across tags. Without the\narming switch it is a clean no-op, exactly like the live smoke harness — the procedure is in\n[RELEASING.md](RELEASING.md).\n\n### Per-agent auth prerequisites (readiness)\n\n`--check` proves an agent's CLI is **present**; this reference is the **readiness** contract — the\nexact one-time auth step each agent needs and what \"ready\" means. It is keyed to the driver's agent\nnames and the binaries `agent_bin` resolves in `scripts/run-agent.sh`. Do the auth step once, then\nconfirm a real round-trip with the opt-in [live smoke](#live-smoke-opt-in). These are **auth steps\nonly** — no key or token value belongs in any committed file.\n\n| Agent | Binary on `PATH` | One-time auth step | \"Ready\" means |\n|-------|------------------|--------------------|---------------|\n| `codex` | `codex` | Sign in to the `codex` CLI (its own login). | `codex` present **and** an armed live smoke round-trips. |\n| `agy` | `agy` | Sign in to Antigravity / `agy` (its own login). | `agy` present **and** armed live smoke round-trips; for quota-aware fallback also install `antigravity-usage` (below). |\n| `cursor` | `cursor-agent` | Run `cursor-agent login`, **or** set the `CURSOR_API_KEY` environment variable. | `cursor-agent` present **and** armed live smoke round-trips. |\n| `claude` | `claude` | Authenticate the `claude` CLI (Claude Code sign-in / API key in your environment). | `claude` present **and** armed live smoke round-trips. |\n| `antigravity-usage` *(optional)* | `antigravity-usage` | `npm i -g antigravity-usage`, then `antigravity-usage login` (or keep the Antigravity IDE open). | Present on `PATH` so agy's [quota-aware fallback](#agy-quota-aware-fallback-antigravity) is active. **Never required** — its absence just means high/xhigh always use the Gemini fallback. |\n\nReadiness is always an explicit, opt-in act: the offline gate verifies **presence** only (via\n`--check`) and never performs any of the auth steps above.\n\n### Confirm an agent is usable: presence → ready\n\nTwo tiers, run in order, to know an agent will actually work:\n\n1. **Presence (offline).** `bash scripts/run-agent.sh --agent \u003cname\u003e --check` — exits **non-zero** if\n   the CLI (or the `cursor-agent` binary the `cursor` agent needs) is missing, `0` when present. This\n   never authenticates or makes a network call.\n2. **Readiness (opt-in).** After the one-time auth step above, arm the live smoke and read the\n   per-agent verdict:\n\n   ```bash\n   EXTERNAL_AGENTS_LIVE=1 bash tests/live-smoke.sh --agent \u003cname\u003e\n   cat \"${EXTERNAL_AGENTS_OUT:-$HOME/.external-agents/logs}\"/live-smoke/status.txt\n   ```\n\n   A `\u003cname\u003e  live-verified` line means the agent round-tripped (ready); `failed` or\n   `skipped-not-reachable` mean it is not. The [Live smoke](#live-smoke-opt-in) section defines the\n   full status vocabulary.\n\nThe required offline gate runs **only** step 1's presence check and **never** performs step 2's auth\nstep — readiness is always an explicit, opt-in act.\n\n## Usage\n\nNatural language (skill):\n\n\u003e ask codex to add input validation to the login handler\n\u003e have the external agents review the changes on this branch — read only\n\u003e delegate the failing test fix to agy\n\nSlash command:\n\n```\n/external-agents codex implement the retry logic in src/client.py\n/external-agents all read-only audit this repo for security issues\n/external-agents fix the flaky test in tests/test_cache.py\n```\n\nDirect script (what the skill/command call under the hood):\n\n```bash\n# single agent, read-write (default)\nbash scripts/run-agent.sh --agent codex --target \"$PWD\" --prompt-file - \u003c\u003c'PROMPT'\nImplement X. Run the tests when done.\nPROMPT\n\n# fan out to every enabled agent, read-only\nbash scripts/run-agent.sh --agent all --read-only --prompt \"Review this branch; cite file:line.\"\n\n# pick an effort tier — each agent maps it to its own model + native effort\nbash scripts/run-agent.sh --agent all --effort high --prompt \"Refactor the parser.\"\n\n# preview the exact command without running it\nbash scripts/run-agent.sh --agent agy --dry-run --prompt \"...\"\n```\n\n### Options\n\n| flag | meaning |\n|------|---------|\n| `--agent A` | `agy` \\| `codex` \\| `claude` \\| `cursor` \\| `all` (all enabled in `agents.json`) |\n| `--prompt P` / `--prompt-file F` / `--prompt-file -` | the task (literal, file, or stdin) |\n| `--target DIR` | where the agents work (default: cwd) |\n| `--write` / `--read-only` | read-write (default) vs analysis-only (mutually exclusive) |\n| `--yes` / `-y` | confirm a write run whose `--target` is not the current directory |\n| `--effort TIER` | effort level / model tier: `low` \\| `medium` \\| `high` \\| `xhigh`. Maps, per agent, to the model + native effort in `agents.json` (optional; blank = config `default_tier`) |\n| `--model M` | model override — wins over the tier's model; native effort still comes from the tier |\n| `--claude-perm MODE` | claude write-mode permission mode (default `acceptEdits`; use `bypassPermissions` for shell) |\n| `--timeout S` | per-agent timeout, seconds (default 1800) |\n| `--out DIR` | transcript dir (default `~/.external-agents/logs/\u003cproject\u003e`; the base dir is overridable with the `EXTERNAL_AGENTS_OUT` env var) |\n| `--conf FILE` | agent config JSON (default `agents.json`) |\n| `--list` / `--check` / `--discover` / `--dry-run` / `--version` | inspect config / preflight reader + CLIs / machine-readable agent reachability / preview argv / print the plugin version |\n| `--json` | also emit a machine-readable JSON run summary (opt-in; default output unchanged) |\n\n## Configuration — `agents.json`\n\nThe caller picks **one effort level** — `low`, `medium`, `high`, or `xhigh` — and\n`agents.json` maps that tier to the right **model + native effort for each agent**. So a\nsingle `--effort high` resolves, per agent, to:\n\n| `--effort` | agy (tier baked into model) | codex | claude | cursor (tier baked into model) |\n|------------|-----------------------------|-------|--------|--------------------------------|\n| `low`      | `Gemini 3.5 Flash (Low)`        | `gpt-5.5` effort `low`    | `claude-haiku-4-5`  | `gpt-5-mini` |\n| `medium`   | `Gemini 3.5 Flash (Medium)`     | `gpt-5.5` effort `medium` | `claude-sonnet-4-6` | `composer-2.5` |\n| `high`     | `Claude Sonnet 4.6 (Thinking)`† | `gpt-5.5` effort `high`   | `claude-opus-4-8` effort `high`  | `gpt-5.5-high` |\n| `xhigh`    | `Claude Opus 4.6 (Thinking)`†   | `gpt-5.5` effort `xhigh`  | `claude-opus-4-8` effort `xhigh` | `claude-opus-4-8-thinking-high` |\n\n† agy `high`/`xhigh` are **quota-aware**: the limited 3rd-party primary runs only when `antigravity-usage`\nconfirms remaining quota; otherwise they fall back to a larger-limit Gemini — `Gemini 3.5 Flash (High)`\nfor `high`, `Gemini 3.1 Pro (High)` for `xhigh` (see below).\n\n`agy` and `cursor` bake the tier into the model name and ignore a separate effort;\n`codex`/`claude` take model and effort separately. The cursor tiers form a cheap→premium\nladder — `gpt-5-mini` (low), `composer-2.5` (medium, Cursor's agentic workhorse),\n`gpt-5.5-high` (high), `claude-opus-4-8-thinking-high` (xhigh). All are **non-fast**,\nZDR-respecting ids on purpose: Cursor's CLI default is a `-fast` variant (e.g.\n`composer-2.5-fast`), which spends fast/priority requests — keep the plain ids, or switch a\ntier to a `-fast` model if you do want priority routing. (Avoid the `claude-fable-5-*` ids:\nthey are marked **NO ZDR**.) Run `cursor-agent models` (once signed in) to see every id your\naccount exposes. With no `--effort`, the config's `default_tier` is used (ships as `medium`).\nA per-run `--model M` overrides only the resolved model — the native effort still comes from the tier.\n\n`enabled: true` agents run under `--agent all`; a named `--agent agy|codex|claude|cursor` runs\neven if disabled. Ships with `agy` + `codex` + `cursor` enabled and `claude` disabled:\n\n```json\n{\n  \"default_tier\": \"medium\",\n  \"agents\": {\n    \"agy\": {\n      \"enabled\": true,\n      \"tiers\": {\n        \"low\":   { \"model\": \"Gemini 3.5 Flash (Low)\" },\n        \"xhigh\": { \"model\": \"Claude Opus 4.6 (Thinking)\", \"fallback\": \"Gemini 3.1 Pro (High)\" }\n      }\n    },\n    \"codex\":  { \"enabled\": true,  \"tiers\": { \"high\":   { \"model\": \"gpt-5.5\", \"effort\": \"high\" } } },\n    \"claude\": { \"enabled\": false, \"tiers\": { \"low\":    { \"model\": \"claude-haiku-4-5\" } } },\n    \"cursor\": { \"enabled\": true,  \"tiers\": { \"medium\": { \"model\": \"composer-2.5\" } } }\n  }\n}\n```\n\nRun `bash scripts/run-agent.sh --list` to print the full resolved table, enabled status,\nand `default_tier`. Reading the JSON needs `jq` (preferred) or `python3` on `PATH`.\n\n### Config schema and validation\n\n`schema/agents.schema.json` is a JSON Schema (draft-07) for `agents.json`. Each schema key maps to\nthe `cfg` query op the driver reads it with — the same op surface in both the `jq` and `python3`\nbackends — so the schema, the config, and both readers stay aligned:\n\n| schema key | `cfg` op | meaning |\n|------------|----------|---------|\n| `default_tier` | `default_tier` | tier used when `--effort` is omitted |\n| `agents` | `agents` | the set of configured agents |\n| `agents.\u003ca\u003e.enabled` | `enabled` | whether `\u003ca\u003e` runs under `--agent all` |\n| `agents.\u003ca\u003e.tiers` | `tiers` | the tier map for `\u003ca\u003e` |\n| `…tiers.\u003ct\u003e.model` | `model` | model id for tier `\u003ct\u003e` |\n| `…tiers.\u003ct\u003e.effort` | `effort` | native effort (codex/claude) |\n| `…tiers.\u003ct\u003e.fallback` | `fallback` | agy-only quota-fallback model |\n\nOnly the `agy` agent's tiers may carry `fallback` (the schema enforces this). Validate the shipped\nconfig against the schema with `jq` + `python3`:\n\n```bash\npython3 -c \"import json,jsonschema; jsonschema.validate(json.load(open('agents.json')), json.load(open('schema/agents.schema.json')))\"\n```\n\nThe contract: `default_tier` and `agents` are required; each agent requires `enabled` and `tiers`;\neach tier requires `model`, with `effort` and `fallback` optional (and `fallback` accepted **only**\non `agy`). `tests/run.sh` validates the shipped config against the schema and asserts that bad\nfixtures (missing `model`, wrong-typed `default_tier`, non-object `tiers`, non-agy `fallback`) are\nrejected; CI runs the same validation on every push/PR, so a contract violation fails the build.\n\n### agy quota-aware fallback (Antigravity)\n\n`agy` is Google's Antigravity, which gives **large Gemini limits** but **small, precious\n3rd-party limits** (Claude Opus/Sonnet, GPT-OSS). To spend the scarce ones deliberately, an\n`agy` tier may carry a `fallback`:\n\n```json\n\"high\":  { \"model\": \"Claude Sonnet 4.6 (Thinking)\", \"fallback\": \"Gemini 3.5 Flash (High)\" },\n\"xhigh\": { \"model\": \"Claude Opus 4.6 (Thinking)\",   \"fallback\": \"Gemini 3.1 Pro (High)\" }\n```\n\nBefore launching such a tier, `run-agent.sh` consults the free **`antigravity-usage --json`**\nCLI (`npm i -g antigravity-usage`) for the primary model's remaining quota and uses the\nprimary **only when quota is positively confirmed available**. If the primary is exhausted\n**or the quota is unconfirmable**, it uses the Gemini `fallback` instead — so scarce 3rd-party\n/ Opus quota is **never spent without a check**. This applies to **agy only**; a per-run\n`--model M` is explicit intent and skips the check.\n\n\u003e **You must open the Antigravity IDE** (or run `antigravity-usage login`) for the quota\n\u003e check to see data. While it can't (IDE closed / `antigravity-usage` not installed), `agy`'s\n\u003e 3rd-party tiers transparently fall back to Gemini. Tune with\n\u003e `EXTERNAL_AGENTS_AGY_MIN_REMAINING` (% remaining below which to fall back; default `5`),\n\u003e `EXTERNAL_AGENTS_AGY_QUOTA_TIMEOUT` (seconds to wait for `antigravity-usage`; default `20`),\n\u003e or `EXTERNAL_AGENTS_AGY_QUOTA_CMD` (override the quota command).\n\nThis quota-fallback is **verified against the real CLI** by the opt-in\n[live harness](#live-smoke-opt-in): it feeds the real `antigravity-usage` output through the\ndriver's resolver and asserts the decision matches the quota state (primary **iff** available, else\nthe Gemini fallback). The probe is **read-only** — it never calls the quota-spending `wakeup` — and\nwhen the quota is unconfirmable (IDE closed / not logged in) it degrades to the fallback, **never**\nspending the unconfirmed primary. It also records the real output's keys/types (no values) as a\nschema-drift detector.\n\n**Decision contract.** The quota check is a strict, read-only protocol:\n\n1. **Read-only.** The driver runs only `antigravity-usage --json` (read-only); it **never** calls the\n   quota-spending `wakeup`.\n2. **Primary iff confirmed-available.** The limited 3rd-party primary runs **only** when the quota CLI\n   positively reports it available.\n3. **Fallback otherwise.** On `exhausted`, or an `unknown`/unconfirmable result (IDE closed, CLI\n   absent, timeout, or unparseable output), the larger-limit Gemini `fallback` runs instead — the\n   unconfirmed primary is **never** spent.\n4. **Keys read.** The decision reads only these `antigravity-usage --json` fields: the top-level\n   `models` array, and each model's `label`, `remainingPercentage`, and `isExhausted`. No other field,\n   value, or account identifier is read into the run or written to any recorded evidence.\n\nThis contract is pinned offline (`tests/run.sh`: the fallback model-pick, graceful-degradation,\nnever-spend-`wakeup`, and sanitised quota-schema oracles) and verified against the real CLI by the\nlive harness above.\n\n## Fan-out observability\n\n\u003e Beyond parallel fan-out, two further orchestration patterns are specified in\n\u003e [`docs/orchestration.md`](docs/orchestration.md): a **sequential pipeline** (each stage's redacted\n\u003e output seeds the next) and a deterministic **consensus** verdict over a fan-out.\n\nA `--agent all` fan-out records, per agent, a **result record** built entirely from **control-plane\nfacts** — values the driver already resolved at launch/collect time, **never parsed from the\ntranscript**:\n\n| field | source |\n|-------|--------|\n| `agent` | the agent name |\n| `model` | the **resolved** model (post-fallback for agy) |\n| `tier` | the effort tier (`--effort`, else `default_tier`) |\n| `effort` | the tier's native effort (codex/claude) |\n| `mode` | `read-only` / `write` |\n| `rc` | the agent process exit code |\n| `sec` | wall-clock seconds |\n| `bytes` | redacted transcript size |\n| `fallback` | `1` iff the agy quota fallback swapped the primary model |\n\n**Contract.** Every field is a **resolved fact**, never inferred from the transcript text: `model` and\n`fallback` come from what the driver actually resolved at launch (`run_one`'s `.model`/`.fallback`\nsidecars), `tier`/`effort`/`mode` from the config and run mode, and `rc`/`sec`/`bytes` from the run\nfiles. So `fallback=1` means the agy quota fallback *actually swapped* the primary (not merely that\nquota was checked), and the record never depends on parsing an agent's free-text output. The record\nis the single source the cross-agent summary and the opt-in JSON output (below) both render from.\n\n### Cross-agent summary\n\nA `--agent all` fan-out prints a compact **summary block** after the transcripts — one row per agent\nfrom the records above:\n\n```\n===== fan-out summary =====\n  agent   rc  model                      tier      sec    bytes fallback\n  agy     0   Gemini 3.5 Flash (High)    high        0        5 1\n  codex   0   gpt-5.5                    high        1        5 0\n  cursor  0   gpt-5.5-high               high        0        5 0\n```\n\nHow to read it: the summary is a **digest, not a replacement** — the full, verbatim (redacted)\ntranscripts are still printed above it, and the `ok/failed` tally still goes to stderr. Use the\nsummary to compare agents at a glance — who succeeded (`rc=0`), which model each resolved to (note\nagy's `fallback`), and how they differ in time (`sec`) and output size (`bytes`) — then scroll up to\nthe matching `===== \u003cagent\u003e … =====` transcript for the actual content. On a **write** fan-out the\nsummary adds a note that the post-write `git changes after write` block is **target-wide** (all\nagents share one tree), so a change can't be attributed to a single agent. Single-agent runs print no\nsummary.\n\nThe summary closes with a deterministic **agreement** line derived from the success tally —\n`all-ok`, `mixed`, or `all-fail` — so a caller gets an at-a-glance outcome signal for the fan-out.\nFor a **read-only** fan-out on a git target it also adds a **no-mutation** line stating whether all\nagents left the shared tree unchanged (every read-only mode should — agy is best-effort).\n\nThese agreement signals are **deterministic and outcome-based** — they come from the per-agent\nexit-code tally and the target tree's git state, never from interpreting the agents' answers. The\ndeterministic **consensus** verdict (a majority/quorum over the per-agent success tally) extends this\nsame outcome-based signal — see [`docs/orchestration.md`](docs/orchestration.md#consensus).\n**Semantic content agreement** (do the agents actually agree on *what they said*?) is **not provided**:\nthat needs a human-confirmed cross-agent summary schema and live-run evidence, so it is deliberately\ndeferred. Read the verbatim transcripts to compare the agents' actual content.\n\n### JSON run summary (`--json`)\n\n`--json` additionally emits one machine-readable JSON document (in addition to the human output,\nwhich is unchanged) with this shape — run-level fields plus one object per agent and the agreement\nsignal, all control-plane facts (no transcript content):\n\n```json\n{\n  \"mode\": \"readonly\", \"tier\": \"high\", \"ok\": 3, \"fail\": 0, \"count\": 3, \"agreement\": \"all-ok\",\n  \"agents\": [\n    { \"agent\": \"agy\", \"model\": \"Gemini 3.5 Flash (High)\", \"tier\": \"high\", \"effort\": \"(none)\",\n      \"mode\": \"readonly\", \"rc\": 0, \"sec\": 0, \"bytes\": 23, \"fallback\": true }\n  ]\n}\n```\n\nThe offline suite validates that the emitted document is well-formed and carries these required keys.\n\nThe document is written to **stdout** alongside the human output — `--json` itself is a streaming\nsummary you can pipe (`… --json | jq …`), not a stored artifact. For **durable, stable on-disk\npersistence**, every run also writes a [per-run metadata record](#per-run-metadata-record) per agent\n(below); a cross-run index over those records is added in [Run index](#run-index).\n\n## Run metadata and history\n\n### Per-run metadata record\n\nBeyond the in-memory [fan-out record](#fan-out-observability), **every run — single agent *or*\n`--agent all` fan-out — writes one structured JSON metadata record per agent** next to that agent's\ntranscript, capturing the run's **control-plane truth** so you never have to re-parse a transcript to\nknow how a run resolved. Each field is a **post-fallback resolved** value (what was *actually* used,\nnot what was requested):\n\n| field | meaning (post-fallback resolved truth) | source |\n|-------|----------------------------------------|--------|\n| `agent` | the agent name | the run request |\n| `model` | the **resolved** model actually used (post-fallback for agy) | `run_one`'s `.model` sidecar |\n| `tier` | the effort tier (`--effort`, else `default_tier`) | the run request / config |\n| `effort` | the tier's native effort (codex/claude; `(none)` if unset) | `agents.json` |\n| `mode` | `readonly` / `write` | the run mode |\n| `target` | the resolved directory the agent worked in | resolved `--target` |\n| `rc` | the agent process exit code | the run |\n| `sec` | wall-clock seconds | the run |\n| `bytes` | redacted transcript size | the run |\n| `fallback` | `true` iff the agy quota fallback swapped the primary model | `run_one`'s `.fallback` sidecar |\n| `timestamp` | run launch time, UTC ISO-8601 (e.g. `2026-06-20T12:00:00Z`) | the run |\n\n`model`, `fallback`, `tier`, `effort`, and `mode` are exactly the resolved facts the fan-out record\ncarries; the per-run record additionally pins the **`target`** the agent worked in and the\n**`timestamp`** it launched, and — unlike the fan-out summary, which is fan-out only — is written\ndurably to disk for *every* run. The post-fallback rule is decisive: for an agy quota fallback,\n`model` is the **Gemini fallback actually used** and `fallback` is `true` (never the unconfirmed\nprimary). Like the fan-out record, it is built **only** from values resolved at launch/collect time —\n**never parsed from the transcript** — so it carries no agent free-text and no prompt.\n\nThe record's field set and JSON types are published as a draft-07 contract in\n[`schema/run-record.schema.json`](schema/run-record.schema.json), which validates both this\n`meta.json` record and the [run index](#run-index) row. The field-by-field reference and the\n**additive-only stability policy** (which changes require a major version bump) are in\n[`docs/run-record-contract.md`](docs/run-record-contract.md).\n\n### Error classification\n\nEvery run is classified into **one** of a closed set of error classes, so a caller can tell a\nrecoverable failure from a permanent one (recorded as the per-run record's `error_class` field):\n\n| class | meaning | retryable? |\n|-------|---------|------------|\n| `ok` | the run succeeded (`rc` 0) | n/a (success) |\n| `safety-refusal` | a driver pre-launch gate refused — containment, the non-cwd `--yes` confirmation, `--read-only`/`--write` exclusion, or an invalid `--timeout` | **never** (a deliberate guard, not a transient state) |\n| `timeout` | the run exceeded `--timeout` | retryable (opt-in) |\n| `transient` | a recoverable external failure (network blip, provider 5xx / rate-limit shaped) | retryable |\n| `auth` | the agent is unauthenticated or its credentials were rejected | no (re-authenticate first) |\n| `contract` | the agent broke the expected contract (malformed/empty output) | no |\n| `unknown` | an unclassified non-zero exit | no (conservative default) |\n\nThe **retryable subset** is `transient` (always) and `timeout` (opt-in). A `safety-refusal` is\n**never** retried — retrying would only repeat the same deliberate refusal. This taxonomy is the\ncanonical contract; the driver comment in `scripts/run-agent.sh` (above `run_one`) and the\n[threat model](docs/threat-model.md#error-classification-and-retry-safety) restate the same closed set.\n\n### Bounded retry\n\nRetryable failures can be re-attempted within **explicit bounds**, so a delegation never silently\namplifies cost or data egress through uncontrolled re-runs. Retry is **opt-in and off by default**:\n\n| env var | default | effect |\n|---------|---------|--------|\n| `EXTERNAL_AGENTS_RETRY_MAX` | `0` | maximum retries of a **retryable** outcome (`transient`; `timeout` only if enabled below). `0` = never retry. |\n| `EXTERNAL_AGENTS_RETRY_BACKOFF` | `1` | seconds to wait before each retry. |\n| `EXTERNAL_AGENTS_RETRY_ON_TIMEOUT` | `0` | set to `1` to also retry a `timeout` outcome. |\n\nOnly `transient` (and, when enabled, `timeout`) outcomes are retried — a `safety-refusal`, `auth`, or\n`contract` failure is **never** retried. Each run records two additive fields: `attempts` (total\nlaunch attempts = `1` + retries) and `retried` (`true` iff `attempts` \u003e 1). With retry disabled\n(the default) every run records `attempts: 1`, `retried: false`.\n\n**Where it lives.** The record is written to `\u003ctranscript-dir\u003e/\u003cagent\u003e.meta.json` — the *same*\nper-project directory as that agent's transcript (default `~/.external-agents/logs/\u003cproject\u003e`,\noverridable via [`--out`](#options) or the `EXTERNAL_AGENTS_OUT` base). One file per agent per run,\noverwritten on the next run that targets the same directory. Inspect or aggregate it with any JSON\ntool, e.g.:\n\n```bash\njq . \"${EXTERNAL_AGENTS_OUT:-$HOME/.external-agents/logs}\"/\u003cproject\u003e/codex.meta.json\n```\n\n**Secret-free by construction.** Because the record holds only the control-plane facts the driver\nresolved — never the transcript, never the prompt — no agent free-text, prompt text, or\nsecret-shaped token can reach it (the transcript itself is separately [redacted](#safety)). So\n`*.meta.json` is safe to collect, ship, or index for run history.\n\n### Run index\n\nAlongside the per-run records, the driver maintains a single durable **run index** — an append-only\n[JSON Lines](https://jsonlines.org/) file that accrues **one row per agent per run** across *all*\nprojects, so you get a chronological run history without walking the per-project transcript tree.\n\n**Where it lives.** `\u003cbase\u003e/index.jsonl`, where `\u003cbase\u003e` is the transcript root\n(`EXTERNAL_AGENTS_OUT`, default `~/.external-agents/logs`); with an explicit `--out DIR` the index is\nwritten inside that directory. It is **append-only** — never rewritten — so each row is a permanent\nrecord of one agent run and concurrent runs simply append. It grows unbounded with use; the\nend-to-end run-history lifecycle — retention, rotation (no rows lost), backup, and restore, all under\n`EXTERNAL_AGENTS_OUT` and never the repo — is documented in [RUNBOOK.md](RUNBOOK.md), which also\ncarries the Phase 8 resilience readiness statement (what is enforced vs best-effort).\n\n**Row fields.** Each line is one JSON object: a run id and timestamp that group a fan-out, the\nproject namespace, and the same post-fallback resolved fields the\n[per-run record](#per-run-metadata-record) carries:\n\n| field | meaning |\n|-------|---------|\n| `run_id` | groups the agents of one invocation (a `--agent all` fan-out shares one `run_id`) |\n| `timestamp` | run launch time, UTC ISO-8601 |\n| `project` | the `\u003cproject\u003e` transcript namespace (repo / repo/subdir / leaf) |\n| `agent`, `model`, `tier`, `effort`, `mode`, `target`, `rc`, `sec`, `bytes`, `fallback` | the per-run [resolved fields](#per-run-metadata-record) (post-fallback, control-plane only) |\n\nLike the per-run record, every row is **control-plane only** — built from values resolved at\nlaunch/collect, never the transcript — so the index never accrues agent free-text, prompt text, or\nsecrets.\n\n**Inspecting it.** The index is plain JSON Lines — **no special subcommand** — so you read it with\nstandard tools (`tail`, `jq`, `grep`, `python3`). The base resolves to\n`${EXTERNAL_AGENTS_OUT:-$HOME/.external-agents/logs}/index.jsonl`:\n\n```bash\nIDX=\"${EXTERNAL_AGENTS_OUT:-$HOME/.external-agents/logs}/index.jsonl\"\n\n# the 10 most recent agent runs, as a table\ntail -n 10 \"$IDX\" | jq -r '[.timestamp, .project, .agent, .rc, .model, \"\\(.sec)s\"] | @tsv'\n\n# only the runs that failed (rc != 0)\njq -c 'select(.rc != 0)' \"$IDX\"\n\n# every run where the agy quota fallback fired\njq -c 'select(.fallback)' \"$IDX\"\n```\n\nWithout `jq`, the same rows read fine in `python3` (`for l in open(IDX): json.loads(l)`) or as raw\nlines with `tail`/`grep`.\n\n### Cost, latency, and quality signals\n\nSome agent CLIs print machine-readable **cost / token signals** in their output (e.g. a token count\nor a dollar cost). When present, the driver lifts them — **best-effort** — into an optional `signals`\nobject on the [per-run record](#per-run-metadata-record) and [index row](#run-index); when a signal\nis **absent or unrecognized**, that field is the explicit string `\"unavailable\"` — **never** a\nguessed or fabricated number, and never a silent gap.\n\n**What is recognized.** Extraction is **capability-aware per agent** and deliberately\n**conservative** — it matches only tightly-anchored shapes, so ordinary transcript prose is never\nmistaken for a metric:\n\n| signal | recognized shape (case-insensitive, anchored) |\n|--------|-----------------------------------------------|\n| `tokens` | a `tokens used: N` / `total tokens: N` / `N tokens` line |\n| `cost` | a `cost: $X` / `total cost: $X` line |\n\n**Status — best-effort, not yet live-confirmed.** These recognizers are derived from the CLIs'\n*expected* output shapes; **no signal has been confirmed against a live run in this repo's offline\nwork** (the offline CI never launches a real CLI — see [Testing](#testing)). So in practice a field\nreads `\"unavailable\"` unless a CLI's real output happens to match a recognized shape. Confirming\n*which* agents actually emit *which* signals — and their exact live shapes — is left to the opt-in\n[live/E2E harness](#live-smoke-opt-in); the offline suite proves only that a recognized shape **is**\nextracted and an absent one yields `\"unavailable\"` (via committed fixtures).\n\n**Reading the signals (consumer caveat).** Treat `\"unavailable\"` as *no data* — filter it out before\naggregating (e.g. `jq 'select(.signals.tokens != \"unavailable\")'`), never as a zero. A present value\nis the **CLI's own self-report**, lifted verbatim and best-effort — not independently measured — so\nuse it for rough comparison, not billing. For latency and output size, prefer the\n**driver-measured** [`sec` and `bytes`](#per-run-metadata-record) fields, recorded for *every* run\nregardless of what the CLI prints; `signals.tokens` / `signals.cost` are the optional, CLI-dependent\nextras layered on top.\n\n### Run-history analytics\n\nThe append-only [run index](#run-index) accumulates one row per agent per run; the read-only\nanalytics surface ([`scripts/run-history-report.sh`](scripts/run-history-report.sh)) aggregates it\ninto cross-run trends **without ever mutating it**. It derives this metric set:\n\n| metric | meaning |\n|--------|---------|\n| `runs` | total rows (one per agent per run) |\n| `ok` / `failed` | rows whose outcome is `ok` (`error_class == \"ok\"`, or `rc == 0` for pre-8.2 rows) vs not |\n| `success_rate` | `ok / runs` (0..1) |\n| `error_class` | the per-class distribution, e.g. `{\"ok\": 12, \"transient\": 3, \"auth\": 1}` |\n| `fallback` | `{ \"count\": n, \"rate\": r }` — rows where the agy quota fallback fired |\n| `sec` / `bytes` | `{ \"min\", \"max\", \"mean\" }` over the **driver-measured** latency / size |\n| `tokens` | `{ \"counted\", \"unavailable\", \"sum\", \"mean\" }` over numeric `signals.tokens` only |\n| `cost` | `{ \"counted\", \"unavailable\", \"sum\" }` over numeric `signals.cost` only |\n\n**The `unavailable`-exclusion rule is decisive:** `signals.tokens` / `signals.cost` aggregates count\n**only present (numeric) values**; an `unavailable` row is **excluded from the denominator, never\ncounted as zero** (which would understate the true average). Each aggregate reports both the\n`counted` and `unavailable` row counts so the exclusion is auditable.\n\nThe machine-readable JSON output shape is one document:\n\n```json\n{\n  \"runs\": 16,\n  \"ok\": 12,\n  \"failed\": 4,\n  \"success_rate\": 0.75,\n  \"error_class\": { \"ok\": 12, \"transient\": 3, \"auth\": 1 },\n  \"fallback\": { \"count\": 5, \"rate\": 0.3125 },\n  \"sec\": { \"min\": 1, \"max\": 42, \"mean\": 8.5 },\n  \"bytes\": { \"min\": 10, \"max\": 4096, \"mean\": 512.0 },\n  \"tokens\": { \"counted\": 9, \"unavailable\": 7, \"sum\": 81000, \"mean\": 9000.0 },\n  \"cost\": { \"counted\": 9, \"unavailable\": 7, \"sum\": 1.23 }\n}\n```\n\nA human-readable table is also available. The surface is **strictly read-only** over the append-only\nindex, and the field contract it consumes is in\n[`docs/run-record-contract.md`](docs/run-record-contract.md).\n\n**Scoping the report.** By default every row is aggregated. Four optional, AND-composed filters narrow\n*which* rows the metrics are computed over — `--agent A`, `--project P`, `--since TS`, and `--until TS`\n(timestamps are the per-row UTC ISO-8601 `timestamp`, compared lexicographically; an absent or empty\nfilter selects everything). They scope the **input**, not the metric set, so the output shape and the\njq/python3 value-equivalence are unchanged — e.g.\n`run-history-report.sh --agent codex --since 2026-06-01T00:00:00Z` reports just codex's trend since\nJune, replacing a hand-written `jq 'select(.agent==\"codex\")'` pre-filter.\n\n**Consumer caveat.** The `tokens` and `cost` aggregates are built from **best-effort,\nCLI-self-reported** signals — they are **not billing-grade** and must not be treated as authoritative\ncost (see [Cost, latency, and quality signals](#cost-latency-and-quality-signals)). For latency and\noutput size, trust the **driver-measured** `sec` and `bytes` aggregates (recorded for *every* run),\nnot an agent's self-reported numbers. A trend summarises what the CLIs reported, not an audited ledger.\n\n## Safety\n\nFor the full trust-boundary analysis and the per-CLI enforcement matrix, see\n[docs/threat-model.md](docs/threat-model.md); to report a vulnerability privately, see\n[SECURITY.md](SECURITY.md).\n\n- **Read-write is the default.** The agents can modify anything under `--target`. Point it\n  at the tree you actually want changed; use `--read-only` when you only want analysis.\n- **Non-cwd writes need `--yes`.** A write run whose `--target` is not the current working\n  directory is refused unless you pass `--yes`, so a wrong/misremembered target can't launch\n  a writing agent silently.\n- **Plugin self-protection (both directions).** The script refuses to write inside its own\n  plugin tree *and* refuses a `--target` that contains the plugin (e.g. the monorepo root,\n  which would also expose sibling repos). Paths are resolved physically (`pwd -P`) so\n  symlinks can't slip past the check.\n- **External providers.** `agy`, `codex`, and `cursor` send the target tree to external\n  services — never point them at private IP or secrets.\n- **Transcript redaction (best-effort).** Before a transcript is persisted or echoed, the driver\n  masks secret-shaped tokens (`sk-`/`pk-`, `gh*_`/`github_pat_`, `xox*-`, `AKIA…`, `Bearer`\n  tokens, `KEY=`/`TOKEN=`/`SECRET=`/`PASSWORD=` assignments, and long high-entropy runs) as\n  `\u003cREDACTED\u003e`. This is **length-bounded and best-effort, not a guarantee** of total secret\n  removal — short or unusual secrets can slip through and long non-secret strings can be\n  over-masked, so still treat transcripts as sensitive. See\n  [docs/threat-model.md](docs/threat-model.md) for the threat-model entry.\n- **Enforcement is uneven across CLIs** — see the caveats above: agy read-only is\n  best-effort; claude write needs `bypassPermissions` for shell; cursor needs prior auth\n  (`cursor-agent login`) and its read-only mode (`--mode plan`) is enforced. The canonical\n  per-CLI enforcement matrix lives in\n  [docs/threat-model.md](docs/threat-model.md#per-cli-read-only-enforcement-matrix); it is\n  asserted against the driver's resolved argv by the offline test suite and **observed\n  end-to-end** by the opt-in live harness (enforced agents leave a sandbox byte-identical;\n  agy is reported best-effort — see [Live smoke](#live-smoke-opt-in)).\n- **Verification is produced, not just advised.** After a write run on a git target the\n  script prints `git status` / `git diff --stat`; if the target isn't a git repo it warns\n  that there is no baseline to diff or revert. Still review the diff before trusting the\n  agent's self-report.\n\n## Testing\n\nOffline, dependency-light tests live in `tests/`. They exercise `run-agent.sh` only\nthrough `--dry-run`, `--list`, and its early-exit error paths, so **no external agent\nCLI is launched** — they assert the `agents.json` tier→argv mapping, jq/python3 config\nparity, the effort and write-mode safety gates, and that the version string stays in\nlockstep across `plugin.json`, the skill frontmatter, and the README badge:\n\n```bash\nbash tests/run.sh\n```\n\nEvery `cfg` config op (`default_tier`, `agents`, `enabled`, `tiers`, `model`, `effort`, `fallback`)\nis **parity-gated**: the `jq` and `python3` backends must produce byte-identical output (the suite\nchecks `--list` and per-agent `--dry-run` across both, including malformed-config degradation). When\nyou add or change a config query type, update **both** backends in `scripts/run-agent.sh` and the\nparity block in `tests/run.sh` together.\n\n### Live smoke (opt-in)\n\n`tests/run.sh` never launches a real CLI. To verify the driver actually round-trips against the\n**real** agents, an opt-in harness lives in `tests/live-smoke.sh`. It is gated behind a single\narming switch — the `EXTERNAL_AGENTS_LIVE` environment variable (matching the existing\n`EXTERNAL_AGENTS_*` convention) — so it is a no-op by default and is **never** part of the offline\nCI gate:\n\n```bash\nbash tests/live-smoke.sh                 # unset/0 -\u003e skips every live step, exits 0\nEXTERNAL_AGENTS_LIVE=1 bash tests/live-smoke.sh   # 1 -\u003e arms the harness against reachable CLIs\n```\n\nWhen armed, the harness scopes itself to the agents actually installed (via the driver's\n`run-agent.sh --discover` surface): each **reachable** agent is queued for live checks, and each\n**unreachable** one is skipped with a clear per-agent line. Absence of a CLI is never a failure — an\nenvironment with no agents on `PATH` reports every agent skipped and still exits 0.\n\nThe **required CI gate is offline by design** (`shellcheck` + `tests/run.sh`) and never runs the live\nharness — launching real CLIs costs money and ships the tree to third-party providers. Any live\nverification must be a **separate, non-required, manual or scheduled** job, never a step added to the\nrequired check job.\n\n**To opt in:** install (and, for the agents that need it, sign in to) the CLIs you want to verify —\ne.g. `cursor-agent login` — then run `EXTERNAL_AGENTS_LIVE=1 bash tests/live-smoke.sh`. Each line the\nharness prints names exactly why a step ran or was skipped:\n\n- `live smoke skipped (set EXTERNAL_AGENTS_LIVE=1)` — the harness is **not armed** (the default); no\n  live work runs.\n- `\u003cagent\u003e skipped (not reachable on PATH)` — the agent is armed-for but its CLI is not installed, so\n  that agent is skipped (never a failure).\n- `live smoke: reachable agents: …` — the agents that are installed and will be live-verified.\n\n**Secret discipline:** the harness only ever **detects** auth (is the CLI present / does a cheap,\nread-only probe succeed) — it never **captures, prints, or stores** tokens, API keys, or credentials.\nNo secret is read into a variable or written to any recorded evidence.\n\n**Argv equivalence.** For each reachable agent the harness captures the **exact launch argv** the\ndriver builds for a real run (`run-agent.sh` records it to `$OUT/\u003cagent\u003e.argv` with the prompt masked\nto `\u003cPROMPT\u003e`, so the record holds no prompt text or secret) and asserts it is **byte-identical** to\nthe `--dry-run` argv — across both modes (`--read-only`/`--write`) and both prompt sources\n(`--prompt`/`--prompt-file`), covering every resolution path. This turns \"the launch command is\ncorrect\" from an offline claim into a live-verified fact. Run it for one agent or all:\n\n```bash\nEXTERNAL_AGENTS_LIVE=1 bash tests/live-smoke.sh --agent codex   # one agent\nEXTERNAL_AGENTS_LIVE=1 bash tests/live-smoke.sh                 # every reachable agent\n```\n\n**Non-mutation.** Each read-only run targets a disposable, git-backed sandbox (never the real\nrepo), and the harness snapshots it before/after. For the **enforced** agents\n(`codex`/`claude`/`cursor`) any change to the tree is a **hard failure** — independent proof the\nread-only guarantee held. For **agy**, whose read-only mode is **best-effort** (`--sandbox` is not a\nhard write barrier), the harness only **reports** whether the tree changed, explicitly labelled\nbest-effort, and **never fails** on a change — agy is never over-claimed as enforced, matching the\n[per-CLI enforcement matrix](docs/threat-model.md#per-cli-read-only-enforcement-matrix).\n\n**Recorded evidence.** Each armed run composes one end-to-end read-only run per reachable agent\n(argv-match + non-mutation + a successful transcript) and writes a deterministic per-agent status\nrecord to `$EXTERNAL_AGENTS_OUT/live-smoke/status.txt` — one `\u003cagent\u003e  \u003cstatus\u003e` line (the full\n**status vocabulary** is defined below) — so which agents are live-verified in the current\nenvironment is auditable. The driver's per-agent\ntranscripts and masked argv records land under the same transcript dir (default\n`~/.external-agents/logs/\u003cproject\u003e`, overridable with `EXTERNAL_AGENTS_OUT`). To reproduce:\n\n```bash\nEXTERNAL_AGENTS_LIVE=1 bash tests/live-smoke.sh        # run, then inspect:\ncat \"${EXTERNAL_AGENTS_OUT:-$HOME/.external-agents/logs}\"/live-smoke/status.txt\n```\n\n**Status vocabulary.** Each `\u003cagent\u003e  \u003cstatus\u003e` line carries exactly one token. What each asserts —\nand, crucially, what it does **not**:\n\n- `live-verified` — the agent round-tripped **at record time**: its launch argv matched `--dry-run`,\n  the read-only run left the disposable sandbox unchanged (best-effort for `agy`), and it produced a\n  successful (`rc=0`, non-empty) transcript. It does **not** assert the agent always works — a later\n  auth, quota, or CLI change can break a previously verified agent, so re-run the harness to refresh.\n- `failed` — the agent was reachable and checked, but a check failed (argv mismatch, an *enforced*\n  agent mutated the tree, or a non-zero/empty transcript). A real, actionable failure.\n- `reachable` — the CLI was found on `PATH` but no terminal verdict was assigned (an intermediate\n  state, normally overwritten by `live-verified` / `failed` / `skipped-scoped-out`).\n- `skipped-not-reachable` — armed for, but the CLI is not installed on `PATH`; skipped, never a failure.\n- `skipped-scoped-out` — reachable, but excluded by an `--agent \u003cname\u003e` scope this run; not verified.\n- `skipped-not-opted-in` — the harness was **not armed** (`EXTERNAL_AGENTS_LIVE` unset/`0`); no live\n  work ran for any agent.\n- `unknown` — no status was assigned for a known agent (a default guard); it should not appear in a\n  normal armed run.\n\nThe honest boundary: a green offline `tests/run.sh` proves the plumbing; only a `live-verified` line\nproves a *real* agent round-tripped — and only as of when that line was written, never \"always works\".\n\n**Record stability (decision).** `status.txt` and `provenance.txt` are **best-effort, human-auditable\nevidence — not a version-stable machine contract.** The `\u003cagent\u003e  \u003cstatus\u003e` line shape and the\nprovenance `key: value` lines are intended to stay stable, and the status vocabulary grows only\n**additively** (a consumer should treat an unrecognized token leniently rather than fail), but the\nrecord carries **no formal backward-compatibility guarantee** and is *not* version-gated. For machine\nconsumption prefer the driver's structured per-run records (`\u003cagent\u003e.meta.json` / `index.jsonl`); this\nfile exists to audit which agents were live-verified, not as a parsed API.\n\nThat transcript dir is **outside the repository** — raw transcripts (which can carry free-text or\nPII) are **never committed**; only the offline, content-free tests live in the repo.\n\n### End-to-end recipes (opt-in)\n\nBeyond the smoke harness, reproducible per-agent **delegation recipes** live under `tests/e2e/` —\nread-only review, read-write edit, and a non-git write — each driving a real agent against a\ndisposable git fixture and capturing uniform before/after evidence. They share the same\n`EXTERNAL_AGENTS_LIVE` opt-in and skip-when-absent behavior, and are never part of the offline CI\ngate. The read-only recipes assert the enforced agents (`codex`/`claude`/`cursor`) leave the fixture\nunchanged, while **agy** read-only is captured as **best-effort** (observed, never asserted as\nenforced). The shared contract and per-recipe steps are documented in\n[docs/e2e-recipe.md](docs/e2e-recipe.md).\n\n```bash\nEXTERNAL_AGENTS_LIVE=1 bash tests/e2e/run-e2e.sh                 # all recipes, all reachable agents\nEXTERNAL_AGENTS_LIVE=1 bash tests/e2e/review-readonly.sh codex   # one recipe, one agent\n```\n\nThe read-write edit recipe confines all writes to the **throwaway fixture** (a non-cwd temp tree\noutside the plugin tree, passed with `--yes`) and asserts the driver's post-write verification block\nnames the changed file — nothing in the real repo is ever touched.\n\nA green offline `tests/run.sh` proves the recipe **plumbing** (fixtures, masked-argv capture, the\nskip-when-unarmed path); only an armed run against an installed, authenticated CLI proves a given\nagent actually round-trips. See [docs/e2e-recipe.md](docs/e2e-recipe.md#local-vs-live-readiness).\n\n## Files\n\n```\nexternal-agents/\n├── .claude-plugin/plugin.json      # plugin manifest\n├── agents.json                     # enabled agents + per-tier model/effort map\n├── commands/external-agents.md     # /external-agents slash command (thin)\n├── skills/external-agents/SKILL.md # the natural-language brain\n├── scripts/run-agent.sh            # the deterministic driver (all the logic)\n├── scripts/bump-version.sh         # lockstep version bumper (plugin.json · SKILL.md · README badge · CHANGELOG)\n├── tests/run.sh                    # offline test suite (run-agent + bump-version)\n├── tests/live-smoke.sh             # opt-in live smoke harness (EXTERNAL_AGENTS_LIVE)\n├── tests/e2e/                      # opt-in end-to-end delegation recipes:\n│   ├── run-e2e.sh                  #   gated entry point (discovers agents, runs all recipes)\n│   ├── review-readonly.sh          #   read-only review (enforced no-mutation; agy best-effort)\n│   ├── edit-readwrite.sh           #   read-write edit on a git fixture (post-write verification)\n│   ├── edit-non-git.sh             #   read-write on a non-git dir (no-baseline warning)\n│   └── lib/                        #   fixture.sh + capture.sh (deterministic fixture + evidence)\n├── docs/e2e-recipe.md              # the shared E2E recipe contract + per-recipe steps\n├── RELEASING.md                    # release runbook (bump → tag → push)\n└── .github/workflows/ci.yml        # CI: shellcheck + tests on push/PR\n```\n\n## Releasing\n\nCutting a release is a documented, repeatable procedure: run the lockstep\n[`scripts/bump-version.sh`](scripts/bump-version.sh), then create and push the matching annotated\ntag (the step the bumper deliberately leaves out). The full flow — clean tree, dry-run preview, real\nbump, diff review, commit, `vX.Y.Z` tag, push — is in [RELEASING.md](RELEASING.md). The\n`tag == v\u003cversion\u003e` contract from that runbook is the single source: it is regression-pinned by the\noffline tag-gate oracle in [`tests/run.sh`](tests/run.sh) (run `bash tests/run.sh`), which exercises\nthe match, mismatch, and no-tag cases against a throwaway repo, so the documented check cannot\nsilently drift.\n\n## Contributing\n\nContributions are welcome. [CONTRIBUTING.md](CONTRIBUTING.md) covers local setup, the CI-mirroring\nvalidation loop (`shellcheck` + the `agents.json` schema check + `bash tests/run.sh`), the\ninterface/driver/adapter architecture and the safety invariants, the test expectations (extend the\noffline suite; keep jq/python3 parity), and version/changelog discipline via\n[`scripts/bump-version.sh`](scripts/bump-version.sh) (release flow in [RELEASING.md](RELEASING.md)).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Feserlxl%2Fexternal-agents","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Feserlxl%2Fexternal-agents","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Feserlxl%2Fexternal-agents/lists"}