{"id":51095560,"url":"https://github.com/jahala/copeca","last_synced_at":"2026-06-24T06:01:30.740Z","repository":{"id":366572861,"uuid":"1261882263","full_name":"jahala/copeca","owner":"jahala","description":"Cost per correct answer — a neutral, reproducible, verifiable benchmark for CLI-based coding agents","archived":false,"fork":false,"pushed_at":"2026-06-22T12:30:36.000Z","size":3623,"stargazers_count":1,"open_issues_count":4,"forks_count":0,"subscribers_count":0,"default_branch":"master","last_synced_at":"2026-06-22T13:05:38.606Z","etag":null,"topics":["agent","ai-coding","benchmark","claude-code","codex","coding-agents","cost-efficiency","evaluation","llm","mcp"],"latest_commit_sha":null,"homepage":"https://jahala.github.io/copeca/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/jahala.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":".github/FUNDING.yml","license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":".github/CODEOWNERS","security":"SECURITY.md","support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null},"funding":{"buy_me_a_coffee":"jahala"}},"created_at":"2026-06-07T09:27:36.000Z","updated_at":"2026-06-22T11:06:04.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/jahala/copeca","commit_stats":null,"previous_names":["jahala/copeca"],"tags_count":1,"template":false,"template_full_name":null,"purl":"pkg:github/jahala/copeca","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jahala%2Fcopeca","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jahala%2Fcopeca/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jahala%2Fcopeca/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jahala%2Fcopeca/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/jahala","download_url":"https://codeload.github.com/jahala/copeca/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jahala%2Fcopeca/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34719307,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-24T02:00:07.484Z","response_time":106,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["agent","ai-coding","benchmark","claude-code","codex","coding-agents","cost-efficiency","evaluation","llm","mcp"],"created_at":"2026-06-24T06:01:29.608Z","updated_at":"2026-06-24T06:01:30.721Z","avatar_url":"https://github.com/jahala.png","language":"Python","funding_links":["https://buymeacoffee.com/jahala"],"categories":[],"sub_categories":[],"readme":"# copeca \u0026middot; cost per correct answer\n\n[![Live site](https://img.shields.io/badge/live_site-1F8A7B?logo=githubpages\u0026logoColor=white)](https://jahala.github.io/copeca/)\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](LICENSE)\n[![Build](https://img.shields.io/github/actions/workflow/status/jahala/copeca/ci.yml?branch=master)](https://github.com/jahala/copeca/actions)\n\n🌱 **[What is copeca? →](https://jahala.github.io/copeca/)** \u0026nbsp;·\u0026nbsp; the visual overview\n\nA neutral, reproducible, verifiable benchmark for CLI-based coding agents.  \nCopeca measures **cost per correct answer** — the expected dollar cost before\ngetting a right answer — to A/B-compare MCP servers, context compressors,\nhooks, and harness improvements against a clean baseline.\n\n```\ncost_per_correct = total_spend / correct_count\n```\n\n**Why this metric.** \"-90% tokens removed!\" is a marketing number if it ignores\nwhether the answer was *right*. A tool that saves 90% of tokens but makes\n20% more mistakes has worse cost-per-correct. Copeca adjusts every savings\nclaim for accuracy, so the number you get is the number that actually matters.\n\n**Why a separate benchmark.** Every tool in the ~45-tool agent-efficiency space\nreports savings on its own methodology against its own baseline — the numbers\nare literally incomparable. Copeca holds the agent and model fixed and varies\n*one tool*, answering \"did my tool help, and what did it cost?\" No existing\nbenchmark occupies that lane.\n\n---\n\n## Quick start\n\n```bash\ngit clone https://github.com/jahala/copeca \u0026\u0026 cd copeca\npip install -e .\ncopeca init ./my-benchmark\ncopeca run --task scenarios/my-scenario.yaml --runner claude\ncopeca analyze results/bench.jsonl\n```\n\nA scenario file defines what to measure:\n\n```yaml\nname: my-tool-vs-baseline\ntasks:\n  include: [\"rg_*\", \"fastapi_*\"]\nmodes: [baseline, my-tool]\nmodels: [claude-sonnet-4-6]\nmodel_runner_map:\n  claude-sonnet-4-6: claude\nrepetitions: 5\nbudget_usd: 1.00\n```\n\nThe report leads with the cost-per-correct delta between your tool and the\nbaseline, with 95% bootstrapped confidence intervals, per-task and\n**per-capability** breakdowns (locate / trace / fix / debug — *where* the tool\nhelps, not just an overall number), a **control delta** on tool-neutral tasks (so a\nwin can't be a regression or mere specialization in disguise), and adversarial flags\nthat catch token snowballing and expensive failures.\n\n**Current corpus: 52 tasks** across four real repos (ripgrep, gin, express,\nfastapi — Rust, Go, JavaScript, Python), each tagged by capability so the report\nshows where a tool helps — plus a 6-task non-regression **control set** (tool-neutral\ntasks where a tool should not help, used to catch regression or over-specialization).\nBroader coverage is on the roadmap; small N still means\nwide confidence intervals — see\n[docs/known-limitations.md](docs/known-limitations.md).\n\n---\n\n## What copeca measures\n\n| Dimension | How |\n|---|---|\n| **Cost** | The vendor's billed cost when the runner reports it (the real bill — reflects cache TTL/tier/discounts; frozen into the artifact at run time). copeca also records a reproducible, provider-neutral cross-check: `computed_cost_usd = Σ tokens × runner.pricing[model]`. Token counts are read from the agent CLI and not re-tokenized — see known-limitations. |\n| **Correctness** | String matching (comprehension tasks) or test-command exit codes (edit tasks) (case-insensitive substring matching — gameable on single tasks; see known-limitations) |\n| **Completeness** | `all_of` field verifies the agent listed *everything* — not just *something* |\n| **Futility** | Adversarial flags: token snowball, talkative failure, tool storm, budget exhaustion, timeout |\n| **Integrity** | Each result is packaged with an integrity manifest — a SHA-256 hash of every file in the artifact. `copeca verify ARTIFACT` recomputes these to detect accidental corruption. The manifest alone is **not tamper-proof**: anyone who rewrites the zip can recompute it. For real tamper-evidence, sign artifacts with `copeca run … --artifacts --sign-key \u003cprivate.pem\u003e` — this writes a detached **Ed25519** signature over the content hash, and `copeca verify ARTIFACT --pubkey \u003cpublic.pem\u003e` rejects any artifact a holder of the private key did not sign (so a tampered-and-recomputed artifact fails). Unsigned artifacts get corruption detection only and are reported as unsigned. External transparency-log anchoring is a further planned option. |\n\n---\n\n## Who copeca is for\n\n**Tool builders** — MCP/server authors, context compressor developers, code-search\ntool maintainers. You ship a tool and need a number that isn't marketing. Copeca\ngives you cost-per-correct with a delta and CI, and a `.copeca` zip anyone can\nverify.\n\n**Platform builders** — CLI agent authors (Codex, OpenCode, Gemini CLI style).\nYou need to validate that your pricing model is accurate before customers depend\non it. Copeca normalizes cost across providers and warns when pricing data is\nstale.\n\n**Skeptical evaluators** — Researchers, reviewers, procurement leads. You've\nbeen burned by contaminated benchmarks and selectively reported results. Copeca's\nartifact model lets you verify any individual result; batch completeness verification\n(`copeca verify --batch --scenario \u003cpath\u003e`) confirms all expected runs are present\nand names any specific missing runs.\n\n---\n\n## How copeca works\n\nCopeca launches a CLI coding agent as a subprocess against a real open-source\nrepo pinned at a known commit. The agent answers a question or fixes a bug.\nCopeca parses the agent's output, checks correctness, computes cost from token\ncounts, and writes a JSONL record. A scenario runs the matrix of tasks × modes\n× models × repetitions with parallel git-worktree-isolated workers. A validity\ngate confirms the experimental arm actually used its tool before its result\ncounts — so a win can't be credited to a tool that never ran.\n\n**Modes** express the *one variable* that changes between baseline and\nexperimental. They cover all five integration types real tools use:\n\n| Integration | Mode field | Example |\n|---|---|---|\n| MCP server | `mcp_config` | any MCP server |\n| API proxy (env) | `env` | `ANTHROPIC_BASE_URL` proxy |\n| Config-dir hook | `agent_config` | PreToolUse hook via settings overlay |\n| Process wrapper | `wrapper` | `[\"your-wrapper-tool\", \"wrap\"]` |\n| Pre-run index | `setup` | per-worktree indexing command |\n\nCopeca provisions each arm with its own config directory and an allow-listed\nenvironment. The baseline arm receives only a minimal set of host vars (infra,\nlocale, and provider credentials); all ambient hooks, `CLAUDE_*` vars, and\n`MCP_*` vars are excluded. Experimental modes may declare additional vars via\n`mode.env`, which are merged on top.\n\n---\n\n## Task corpus\n\nTasks are YAML data — no embedded code, no Docker per task. They target real\nopen-source repos pinned at exact commits (per task, so one repo can serve\nseveral code states). The corpus is **52 tasks** across ripgrep, gin, express,\nand fastapi — drawn from six public source families plus a set migrated from the\ntilth benchmark (MIT); each carries a `source:` field with provenance and a\n`category` (locate / trace / fix / debug). Tasks are **tool-agnostic** — they name\nthe information required, never the method, so no tool is privileged; `copeca\nvalidate` lints for it. Every edit task is verified by `copeca check-task`: the\ntest must pass on clean code and fail on mutated code, proving the mutation\nactually bites. See [docs/task-taxonomy.md](docs/task-taxonomy.md).\n\n**Contamination defense:** `copeca validate` checks every task's `source:`\nfield against a blocklist of known-contaminated source benchmarks (SWE-bench\nVerified, RepoBench, ClassEval, DevEval, CoderEval). A task from any of\nthese sources is rejected before it can enter the corpus. This is a static\nprovenance check — no model calls, no network. A planned authoring-time\noption (requires an API key) will also probe a live model with the task ID\nand exclude it if the model reproduces the gold solution from memory; that\nfeature is not shipped yet.\n\n---\n\n## Runners\n\nThe runner interface is **config-driven**: a runner is a YAML file in\n`defaults/runners/` declaring the CLI binary, its argument mapping, its config-dir\nenv var, and which output parser to use — plus a pricing table. Copeca builds the\nsubprocess invocation from that YAML, so adding an agent CLI means writing a YAML,\nnot editing copeca's code. See\n[docs/runner-configuration.md](docs/runner-configuration.md).\n\nTo compute cost, copeca requires the *minimum* from the agent's output: token\ncounts. From those it derives `computed_cost_usd` — a reproducible, provider-neutral\ncross-check; when the runner also reports its own billed cost, that vendor figure is\nthe headline. Duration and completion are derived from the output too.\n\n```jsonl\n{\"type\": \"turn\", \"input_tokens\": 5000, \"output_tokens\": 200,\n \"cache_creation_tokens\": 3500, \"cache_read_tokens\": 3000}\n{\"type\": \"assistant_message\", \"text\": \"...\", \"turn\": 2}\n{\"type\": \"result\", \"total_cost_usd\": 0.0734, \"duration_ms\": 45230}\n```\n\nTwo runners ship today: **Claude Code** (`stream_json` parser) and **OpenAI\nCodex** (`codex_json` parser) — each added as a YAML plus a parser, with no\nchanges to copeca's core. A CLI with a different output format needs a matching\nparser, and a runner YAML naming an unbuilt parser fails loudly rather than\nsilently miscounting.\n\n---\n\n## Install\n\nA built wheel bundles its runtime data (`schemas/`, `tasks/`, `defaults/`, and\n`repos.yaml`), so a pip install is fully functional — `copeca init`, `validate`,\nand `run` work off the packaged corpus. Copeca is **not** published on PyPI yet,\nso install from git or a source checkout:\n\n```bash\npip install git+https://github.com/jahala/copeca\n```\n\nOr from a clone (use `-e` for development):\n\n```bash\ngit clone https://github.com/jahala/copeca\ncd copeca\npip install .\n```\n\nRequires Python ≥ 3.11. The Claude Code and Codex runners ship ready to use; the\nrunner interface is config-driven, so other CLIs are added by writing a YAML (and,\nif their output format differs, a parser). See\n[docs/runner-configuration.md](docs/runner-configuration.md).\n\n---\n\n## Documentation\n\n- [Task authoring guide](docs/task-authoring.md) — write comprehensions and edits\n- [Runner configuration](docs/runner-configuration.md) — output contract, pricing\n- [Metrics \u0026 methodology](docs/metrics.md) — cost-per-correct math, delta-not-absolute\n- [Known limitations](docs/known-limitations.md) — string matching, bootstrap CIs, modeled cost\n\n---\n\n## Support\n\n[![\"Buy Me A Coffee\"](https://www.buymeacoffee.com/assets/img/custom_images/orange_img.png)](https://buymeacoffee.com/jahala)\n\n## License\n\nMIT — see [LICENSE](LICENSE).\n\nCopeca's bundled task corpus is derived from independent benchmark sources\nunder permissive licenses (Apache-2.0, MIT, CC BY 4.0). Each task carries a\n`source:` field with provenance. Tasks from NonCommercial, ShareAlike, or\nno-license sources are explicitly excluded.\n\n---\n\n## Related\n\nCopeca is part of the [plotplot](https://github.com/plotplot-ai) garden of small,\nsharp tools for building with AI. Siblings:\n[tilth](https://github.com/jahala/tilth) (AST-aware code intelligence),\n[umbel](https://github.com/jahala/umbel) (drive many agent CLIs from one session),\n[pleach](https://github.com/jahala/pleach) (conduct agent work in isolated worktrees),\n[petals](https://github.com/jahala/petals) (brand intelligence),\n[tend](https://github.com/jahala/tend) (feature mapping across sessions).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjahala%2Fcopeca","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjahala%2Fcopeca","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjahala%2Fcopeca/lists"}