{"id":50960193,"url":"https://github.com/raiyanyahya/ensemble","last_synced_at":"2026-06-18T12:32:33.311Z","repository":{"id":362574986,"uuid":"1259794548","full_name":"raiyanyahya/ensemble","owner":"raiyanyahya","description":"Multi-model consensus debate via the filesystem. LLMs propose, peer-review, rebut, vote and synthesize a group-confirmed answer. CLI + MCP.","archived":false,"fork":false,"pushed_at":"2026-06-04T21:58:57.000Z","size":125,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"master","last_synced_at":"2026-06-04T23:13:58.872Z","etag":null,"topics":["agent","ai","anthropic","borda-count","claude","cli","consensus","debate","deepseek","ensemble","llm","llm-agents","llm-council","llm-evaluation","mcp","model-context-protocol","multi-agent","openai","peer-review","python"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/raiyanyahya.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-06-04T21:38:38.000Z","updated_at":"2026-06-04T21:59:00.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/raiyanyahya/ensemble","commit_stats":null,"previous_names":["raiyanyahya/ensemble"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/raiyanyahya/ensemble","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/raiyanyahya%2Fensemble","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/raiyanyahya%2Fensemble/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/raiyanyahya%2Fensemble/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/raiyanyahya%2Fensemble/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/raiyanyahya","download_url":"https://codeload.github.com/raiyanyahya/ensemble/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/raiyanyahya%2Fensemble/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34491228,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-18T02:00:06.871Z","response_time":128,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["agent","ai","anthropic","borda-count","claude","cli","consensus","debate","deepseek","ensemble","llm","llm-agents","llm-council","llm-evaluation","mcp","model-context-protocol","multi-agent","openai","peer-review","python"],"created_at":"2026-06-18T12:32:31.416Z","updated_at":"2026-06-18T12:32:33.281Z","avatar_url":"https://github.com/raiyanyahya.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# 🗳️ Ensemble\n\n*Multi-model consensus debate via the filesystem — LLMs propose, peer-review,\nrebut, vote, and synthesize a group-confirmed answer. CLI + MCP.*\n\n![Python](https://img.shields.io/badge/python-3.10%2B-blue)\n![Tests](https://img.shields.io/badge/tests-98%20passing-brightgreen)\n![Lint](https://img.shields.io/badge/lint-ruff-261230)\n![MCP](https://img.shields.io/badge/MCP-compatible-orange)\n![Providers](https://img.shields.io/badge/providers-OpenAI%20%C2%B7%20Anthropic%20%C2%B7%20DeepSeek-555)\n\n\u003e **Multi-round: propose → peer-review → rebut → vote → synthesize → converge.**\n\u003e Not a one-shot poll — an auditable debate that runs until the models agree on\n\u003e a *specific answer* (or provably can't), with every step left on disk.\n\nMulti-model consensus debate via the filesystem. Several top LLMs (OpenAI,\nAnthropic, DeepSeek) independently **propose**, **review** each other, **rebut**\nthe critiques of their own proposal, and **vote** — and, once a majority agrees,\n**synthesize** a single merged answer that the group **confirms**. They never\ntalk to each other directly: every contribution is a file in a shared folder,\nand a coordinator advances the debate phase by phase. Participants are\n**anonymized** to each other (shown only as \"Participant A/B/C\"), so they judge\narguments on merit, not on brand.\n\n```mermaid\n%%{init: {'theme':'neutral', 'themeVariables': {'fontSize':'22px'}, 'flowchart': {'nodeSpacing': 55, 'rankSpacing': 70, 'padding': 16}}}%%\nflowchart LR\n    P[PROPOSING] --\u003e R[REVIEWING] --\u003e B[REBUTTAL] --\u003e V{VOTING}\n    V -- \"revise / split\u003cbr/\u003e(positions still moving)\" --\u003e P\n    V -- \"stable disagreement\u003cbr/\u003eor safety fuse\" --\u003e D[\"Best-effort answer\u003cbr/\u003e(plurality, Borda-broken tie)\"]\n    V -- \"majority finalize\" --\u003e S[\"SYNTHESIS\u003cbr/\u003eendorsed author merges\u003cbr/\u003e(minority views kept)\"]\n    S --\u003e C{\"CONFIRM\u003cbr/\u003eAPPROVE majority?\"}\n    C -- yes --\u003e A1[\"Synthesis = consensus answer\"]\n    C -- \"no / stall / error\" --\u003e A2[\"Verbatim winning proposal\u003cbr/\u003e(today's behaviour)\"]\n    A1 --\u003e F[final.md]\n    A2 --\u003e F\n    D --\u003e F\n```\n\n\u003e Phases `PROPOSING → REVIEWING → REBUTTAL → VOTING` run every round.\n\u003e `SYNTHESIS` and `CONFIRM` run **only after a majority finalizes**; deadlocks\n\u003e skip them. Alongside its vote, each model may emit a **Borda ranking** of all\n\u003e proposals — a recorded signal that only ever decides a *plurality tie on a\n\u003e deadlock*, never a real majority.\n\n## Highlights\n\n- 🗳️ **Real consensus, not a poll** — convergence means a majority endorses the\n  *same* proposal; otherwise the debate keeps going or provably deadlocks.\n- 🎭 **Anonymized peer review** — models see each other only as \"Participant A/B/C\",\n  so arguments win on merit, not on brand.\n- 🔀 **Rebuttal phase** — each model answers the critiques of *its own* proposal\n  before anyone votes, so minds can actually change.\n- 🧬 **Group-confirmed synthesis** — on consensus the endorsed author merges the\n  best points (minority views kept) and the group ratifies it; any failure falls\n  back to the verbatim winner, so the worst case is never worse than today.\n- 📊 **Borda ranking** — a richer per-model signal that breaks deadlock ties\n  deterministically (adapted, with synthesis, from [karpathy/llm-council](https://github.com/karpathy/llm-council)).\n- 🗂️ **Everything on disk** — every proposal, review, rebuttal, vote, and synthesis\n  is a Markdown file; debates are inspectable and fully **resumable**.\n- 💸 **Cost-aware** — per-model token + USD accounting, prompt caching, and a hard\n  `--budget` cap.\n- 🌐 **Grounding \u0026 roles** — optional web-search citations and anti-groupthink\n  stances (`--ground`, `--roles diverse`).\n- 🔌 **CLI + MCP** — a rich terminal UI *and* a one-tool MCP server for Claude Code,\n  Cursor, Cline, Kilo, Continue, and friends.\n- 🧪 **Measured, not asserted** — a real eval harness (`ensemble-eval`) with a\n  strong-model baseline and per-question audit logs.\n\n## Why filesystem?\n\nEach model only reads and writes Markdown files. That makes every step of the\ndebate a durable, inspectable artifact: you can open any round and read exactly\nwhat each model proposed, how it critiqued the others, and how it voted. A\ndebate is fully resumable from disk.\n\n## Requirements\n\n- **Python ≥ 3.10**\n- API keys for **at least two** of three providers (below)\n- *(optional)* a `TAVILY_API_KEY` for web-search grounding\n\n## Install\n\nEnsemble isn't on PyPI yet, so install it from a clone. The `[mcp]` extra also\ninstalls the server used by the editor plugins — include it so you get\neverything in one go.\n\n```bash\ngit clone https://github.com/raiyanyahya/ensemble.git\ncd ensemble\n\npython -m venv .venv\nsource .venv/bin/activate           # Windows: .venv\\Scripts\\activate\n\npip install -e \".[mcp]\"             # CLI + MCP server\n#   \".[mcp,dev]\"  also installs pytest + ruff for development\n```\n\nThis puts two commands on your `PATH`:\n\n| Command        | What it is                                              |\n| -------------- | ------------------------------------------------------- |\n| `ensemble`     | the CLI (`chat`, `debate`, `list`, `status`, `show`, `resume`) |\n| `ensemble-mcp` | the stdio MCP server that editors/agents call           |\n\nVerify:\n\n```bash\nensemble --help\npython -c \"import mcp\"    # no output = the [mcp] extra is installed\n```\n\n(Running `ensemble-mcp` launches the stdio server, which then waits for an MCP\nclient on stdin — that's expected; press Ctrl-C to exit. Editors start it for\nyou.)\n\n## Configure API keys\n\nSet environment variables for the providers you have (any **two** is enough):\n\n| Provider | Env var             | Default model               |\n| -------- | ------------------- | --------------------------- |\n| gpt4o    | `OPENAI_API_KEY`    | `gpt-4o-mini`               |\n| claude   | `ANTHROPIC_API_KEY` | `claude-haiku-4-5-20251001`  |\n| deepseek | `DEEPSEEK_API_KEY`  | `deepseek-chat`             |\n\n```bash\nexport OPENAI_API_KEY=...\nexport ANTHROPIC_API_KEY=...\nexport DEEPSEEK_API_KEY=...   # any two of the three is enough\nexport TAVILY_API_KEY=...     # optional — enables web-search grounding (--ground)\n```\n\n\u003e Put these in your shell profile (`~/.bashrc`, `~/.zshenv`) so they persist.\n\u003e Keys are read at call time and never written to disk or logged.\n\n## Quickstart\n\n```bash\nensemble chat                                   # interactive; type a question\n# or one-shot:\nensemble debate \"Postgres or DynamoDB for a write-heavy event store?\" --quick\n```\n\n## Use the CLI\n\n### Interactive (`chat`)\n\nThe quickest way in — an interactive session where the council debates each\nquestion you type, with a live progress panel:\n\n```bash\nensemble chat                 # quick mode by default (1 round, low latency)\nensemble chat --deep          # full multi-round debates by default\n```\n\nIn-session commands: `/quick` · `/deep` · `/rounds N` · `/list` · `/help` · `/exit`.\n\n### One-shot (`debate`)\n\n```bash\n# Run a single question to consensus (or deadlock)\nensemble debate \"Is P equal to NP? Give your best honest assessment.\"\n\nensemble debate \"...\" --quick                 # single round, fast\nensemble debate \"...\" --rounds 3 --stall-timeout 180 -v\nensemble debate \"...\" -m claude=claude-sonnet-4-6   # override a model id\n\n# Inspect\nensemble list                 # all debates\nensemble status \u003cdebate-id\u003e   # current round/phase + who has contributed\nensemble show \u003cdebate-id\u003e     # render the final consensus document\nensemble resume \u003cdebate-id\u003e   # continue an interrupted debate\n```\n\n### Controls: cost, grounding, roles\n\n```bash\n# Cost \u0026 budget — every debate reports per-model tokens + estimated $ (prompt\n# caching is on, so cached tokens are billed at a discount). Cap the spend:\nensemble debate \"...\" --budget 0.05            # stop once est. spend hits $0.05\n\n# Grounding \u0026 citations — web-search the prompt first; models cite [n], and the\n# sources are listed in the final document (needs TAVILY_API_KEY):\nensemble debate \"Latest on \u003ctopic\u003e?\" --ground\n\n# Roles / stances — fight groupthink by assigning perspectives:\nensemble debate \"...\" --roles diverse          # skeptic / advocate / pragmatist\nensemble debate \"...\" --roles redteam          # one advocate, the rest skeptics\nensemble debate \"...\" --role gpt4o=skeptic --role claude=\"a security auditor\"\n```\n\nAll of these work in `ensemble chat` and via the MCP tool too (`ground`,\n`budget` arguments). Cost, sources, and votes all land in `final.md`.\n\n## Use it from your editor / agent (MCP)\n\n`ensemble-mcp` (installed by the `[mcp]` extra above) is a stdio MCP server that\nexposes one tool — `ensemble_debate(prompt, quick=true, rounds=5, models=…,\nground=false, budget=null)` — to any MCP client. Make sure your provider keys\nare set in the environment the client launches it from.\n\n### Claude Code\n\nInstall the bundled plugin (adds the `/ensemble` command **and** the tool):\n\n```text\n/plugin marketplace add /absolute/path/to/ensemble     # this repo (or raiyanyahya/ensemble on GitHub)\n/plugin install ensemble@ensemble\n```\n\nRestart Claude Code, then:\n\n```text\n/ensemble Should we shard this table now or wait until 1B rows?\n```\n\nOr just ask Claude to \"get the council's opinion on …\" and it will call the\ntool. (The plugin's `.mcp.json` forwards your `*_API_KEY` env vars to the\nserver.) Details: [`plugins/claude-code/`](plugins/claude-code/).\n\n### Kilo Code\n\nCopy [`plugins/kilo/kilo.jsonc`](plugins/kilo/kilo.jsonc) to\n`~/.config/kilo/kilo.jsonc` (global) or `.kilo/kilo.jsonc` (this project), fill\nin your keys, and **raise the timeout** — Kilo's 10s default aborts a debate.\nOr add it via the UI: Settings → MCP → Add Server → Local (stdio), command\n`ensemble-mcp`. Details: [`plugins/kilo/`](plugins/kilo/).\n\n### Cursor / Cline / Roo / Continue / VS Code Copilot\n\nAny MCP client takes the same stdio server — add it in that client's MCP config:\n\n```json\n{\n  \"mcpServers\": {\n    \"ensemble\": {\n      \"command\": \"ensemble-mcp\",\n      \"env\": {\n        \"OPENAI_API_KEY\": \"...\",\n        \"ANTHROPIC_API_KEY\": \"...\",\n        \"DEEPSEEK_API_KEY\": \"...\"\n      }\n    }\n  }\n}\n```\n\n\u003e A debate is much slower than a single call, so prefer `quick` for interactive\n\u003e use and reserve a deep debate (`quick: false`) for high-stakes decisions.\n\n## How consensus is decided\n\nConsensus means agreement on a **specific proposal**, not just willingness to\nstop. Each voting round, every active participant casts one vote:\n\n- **FINALIZE: \\\u003cparticipant\\\u003e** — endorse the single best proposal *by its label*.\n- **REVISE: \\\u003cfocus\\\u003e** — go another round, with a stated focus.\n- **SPLIT: \\\u003creason\\\u003e** — fundamental disagreement.\n\nThe coordinator resolves each FINALIZE to the proposal it endorses and tallies\nendorsements (`majority = n // 2 + 1`). **There is no fixed round count** — the\ndebate runs until the participants settle it:\n\n1. **Finalize** — a majority endorses the **same** proposal → it becomes the\n   consensus answer (terminal). Three FINALIZE votes for three *different*\n   proposals is **not** consensus.\n2. **Stable disagreement** — if a round's votes and endorsements are identical to\n   the previous round's, the participants have stopped moving → the debate\n   **deadlocks**, writing the *plurality* proposal as a best-effort answer.\n3. Otherwise a **revise** majority (or an unsettled **split**) starts another\n   round — for as long as positions keep changing.\n\nTwo backstops bound a debate that never settles: an optional **`--budget`** cap\non spend, and a high **safety fuse** (`--rounds`, default 50) that's almost never\nthe actual terminator. If a provider becomes unresponsive mid-debate it's\n**dropped** (as long as ≥2 live participants remain) so the debate finishes\ninstead of hanging; the drop is noted in `final.md`.\n\n### Synthesis \u0026 ranking\n\nBoth ideas here are adapted from Andrej Karpathy's\n[llm-council](https://github.com/karpathy/llm-council) — its **anonymous peer\nranking** and **chairman synthesis** — reworked for Ensemble's multi-round,\nconsensus-by-vote, filesystem model: the ranking is additive (it only breaks a\ndeadlock tie, never overrides a majority), and the synthesis is a *candidate the\ngroup ratifies by vote* rather than a single chairman's verdict.\n\nTwo signals refine the outcome **without changing the rules above**:\n\n- **Ranking (Borda).** Alongside its vote, each participant may rank all\n  proposals best-to-worst (`B \u003e C \u003e A`). The coordinator tallies Borda points and\n  records them in `final.md`. The ranking only ever *decides* anything in the one\n  case the old logic left arbitrary — breaking a **plurality tie on a deadlock**;\n  a real majority is always unique, so the finalize path is untouched.\n- **Synthesis-as-candidate.** Once a majority finalizes, the endorsed author\n  drafts a single **merged** answer that folds in the strongest points (and\n  preserves minority views). Every participant then **confirms** it (APPROVE /\n  REJECT). A majority APPROVE ships the synthesis as the consensus answer;\n  anything else — a reject, the author erroring, or a stall — **falls back to the\n  verbatim winning proposal**, i.e. exactly the previous behaviour. The verbatim\n  proposals are always kept in `final.md` below the synthesis for audit. This is\n  *not* a single \"chairman\": the merge is a candidate the group ratifies, and it\n  runs only on consensus (deadlocks are unchanged).\n\n#### Two outcomes, same prompt — both paths in the wild\n\nTwo live runs on the classic *\"Which is larger, 9.11 or 9.9?\"* trap landed on the\nsame correct answer (**9.9**) by two different legitimate routes — a neat tour of\nthe new machinery. (The route differs run-to-run from sampling, not from a flag.)\n\n**Run A — cyclic endorsement → deadlock → Borda tiebreak.** All three voted\nFINALIZE, but each endorsed a *different* peer, a perfect cycle:\n\n```\nGPT-4o  → endorsed DeepSeek\nClaude  → endorsed GPT-4o\nDeepSeek → endorsed Claude\n```\n\nEvery proposal drew exactly **1/3** endorsements: agreement on the *answer*,\ndisagreement on whose *articulation* was best, and no majority to settle it. The\ndebate **deadlocked**, and the 1-1-1 tie for the best-effort answer was broken\n**by Borda score** (previously arbitrary) — Claude 4 ▸ DeepSeek 3 ▸ GPT-4o 2.\nSynthesis correctly did **not** run (it's finalize-only). Cost **$0.0125**.\n\n**Run B — clean finalize → synthesis → confirm.** This time the endorsements\naligned **3/3 on DeepSeek**, so the debate finalized and the full post-consensus\npath ran:\n\n```\nVOTING → FINALIZE (3/3 → DeepSeek)\n  → SYNTHESIS  (DeepSeek, the winner, drafts the merge)\n  → CONFIRM    {APPROVE: 3, REJECT: 0} → synthesis ACCEPTED\n```\n\n`final.md` led with the **group-confirmed synthesis** (ending `Final answer: 9.9`,\ncrediting each participant's strongest point), kept the verbatim proposals below\nit, and ranked Borda **DeepSeek 6 ▸ Claude 3 ▸ GPT-4o 0**. The winner made 6 calls\n(it authored the synthesis), the others 5; cost **$0.0183**.\n\nSame question, same answer — one run exercised the **deadlock + Borda tiebreak**,\nthe other the **synthesis + confirm** path, and both handled it correctly.\n\n## A real debate, end to end\n\nHere's an actual run (not a mock-up). Prompt:\n\n\u003e *Should frontier AI labs be legally required to open-source their model weights?\n\u003e Give a yes or no and your single strongest reason.*\n\nThree models, anonymized to each other as Participant A/B/C (**A = GPT-4o Mini,\nB = Claude Haiku 4.5, C = DeepSeek** — the models never saw these names):\n\n1. **They genuinely disagreed.** In PROPOSING, GPT-4o argued **Yes** (transparency\n   and accountability); Claude and DeepSeek both argued **No** (irreversible\n   misuse/weaponization risk that audits and regulation can address instead). A\n   real 1-Yes / 2-No split, not three models nodding along.\n2. **The rebuttal phase changed a mind.** After reading the critiques of its own\n   proposal, GPT-4o conceded the security argument and floated a middle ground —\n   then, in VOTING, **endorsed Claude's \"No\" proposal** outright, citing the\n   asymmetric-risk reasoning it found persuasive. The lone dissenter was won over\n   by the argument — while still blind to *whose* argument it was.\n3. **Consensus, by endorsement.** Final tally: Claude's proposal endorsed **2/3**\n   (by GPT-4o and DeepSeek); DeepSeek's endorsed 1/3 (by Claude). DeepSeek rated\n   Claude's articulation above its own. **Consensus answer: No** — with the\n   minority \"Yes\" still preserved in the record.\n\nRun twice, the verdict reproduced exactly (same winner, same 2/3, same GPT-4o\nflip) even at `temperature=0.7` — the prose differed each time, the *decision*\ndidn't. Cost of the run:\n\n| Model | Calls | Input | Output | Cached | Est. cost |\n|---|---|---|---|---|---|\n| GPT-4o Mini (OpenAI)        | 4 | 6 749 | 1 028 | 0     | $0.0016 |\n| Claude Haiku 4.5 (Anthropic)| 4 | 7 051 | 2 091 | 0     | $0.0175 |\n| DeepSeek Chat               | 4 | 5 821 | 1 693 | 1 536 | $0.0031 |\n| **Total** | | | | | **$0.0222** |\n\n(Four calls each = propose + review + rebut + vote, one round — they converged\nwithout needing a second. Claude dominates the cost at $1/$5 per 1M tokens and\nthe longest outputs.) **Note:** this run predates the synthesis step; a converged\ndebate now adds a synthesis call (endorsed author) plus one short confirm call\nper participant — see the table in the next section.\n\n## Artifacts on disk\n\nDebates are stored under `~/.ensemble/debates/\u003cdebate-id\u003e/`:\n\n```\n\u003cdebate-id\u003e/\n├── prompt.md            # the question\n├── state.json           # full debate state (atomic, resumable) — incl. votes,\n│                        #   Borda scores, synthesis_used, confirm tally\n├── round-001/\n│   ├── gpt4o.proposal.md   gpt4o.review.md   gpt4o.rebuttal.md   gpt4o.vote.md\n│   ├── claude.proposal.md  claude.review.md  claude.rebuttal.md  claude.vote.md\n│   ├── deepseek.proposal.md  ...   (+ \u003cmodel\u003e.\u003cphase\u003e.failed if a provider gave up)\n│   ├── \u003cwinner\u003e.synthesis.md       # only on a finalize: the endorsed author's merge\n│   └── \u003cmodel\u003e.confirm.md          # each participant's APPROVE / REJECT of the synthesis\n├── round-002/ ...\n└── final.md             # the consensus (or best-effort) answer\n```\n\nEach phase writes a **separate** file, so contributions accumulate across\nphases rather than overwriting one another. A vote file may carry a `## Ranking`\nline (`B \u003e C \u003e A`); the synthesis and confirm files appear only on the finalize\npath.\n\n## Evaluation\n\nDoes the debate actually beat a single model? `ensemble-eval` puts numbers on it:\neach question is answered by every model *solo* and by the *ensemble*, graded by\nextracting the model's final answer (the concluding line plus any explicit\n`Final answer:` line — not a whole-text substring match, to avoid favouring\nlonger outputs), and tallied for accuracy and cost.\n\nThe honest verdict: **debate matches a strong single model and lifts unreliable\ncheap models to that level — but it does not beat a model that is already\nreliable, and it costs far more.** The runs below build to that conclusion.\n\n### Latest validated run (15 traps, with synthesis + ranking, 2026-06-04)\n\nAfter adding the post-consensus **synthesis** step and the **Borda** ranking\nsignal, we ran `evals/hard.jsonl` — 15 classic single-model traps (9.11 vs 9.9,\nthe bat-and-ball, the algae lake, \"all but 9 die\") where cheap models are\nindividually error-prone. Three cheap models as the ensemble, Claude Sonnet 4.6\nas the strong baseline, one round each (`--quick`):\n\n```\nCondition            Score  Accuracy       Cost     $/correct\n-------------------------------------------------------------\ngpt-4o-mini          8/15     53.3%   $ 0.0001    ~$0.00001\nclaude-haiku-4.5    14/15     93.3%   $ 0.0013    ~$0.0001\ndeepseek-chat       15/15    100.0%   $ 0.0002    ~$0.00001\n-------------------------------------------------------------\nBASELINE (sonnet)   15/15    100.0%   $ 0.0058    ~$0.0004\n-------------------------------------------------------------\nENSEMBLE            15/15    100.0%   $ 0.2923    ~$0.0195\n```\n\n- **The mechanism works.** On *Monday + 100 days → ?*, gpt-4o-mini said Thursday\n  and Claude said Friday (both wrong); only DeepSeek had Wednesday. **Two of three\n  cheap models were individually wrong, yet the ensemble landed on Wednesday** —\n  and the endorsed proposal was *Claude's*, which **revised to the correct answer**\n  through review→rebuttal before the vote. Cross-examination corrected an\n  individual error; the wrong majority didn't win.\n- **Synthesis verbosity vs. graders (found and fixed).** In the first pass the\n  ensemble scored 14/15: the bat-and-ball debate reached *unanimous-correct*\n  consensus (\"the ball costs 5 cents\"), but the verbose synthesis *ended* on a\n  caveat about the wrong intuitive answer (\"…totaling $1.20\"), so the last-line\n  extractor missed it. The fix instructs the synthesis to close with a\n  `Final answer: \u003cvalue\u003e` line in the requested format — gradable, and clearer for\n  a human. The re-run scored **15/15**.\n- **The honest caveat.** DeepSeek alone already went 15/15 here, so the ensemble\n  **tied** the best cheap single and the strong baseline rather than beating\n  them — at **~50× the baseline's cost**. Debate buys *reliability*, not a higher\n  ceiling, and only earns its keep when no single available model is already\n  reliable. (N = 15, single pass; gpt-4o-mini drifted 10→8 between passes on the\n  traps, a reminder these are noisy small-sample numbers.)\n\n### Earlier: the harder run (72 objective questions, single pass)\n\n*This is the run that motivated the work above — kept for the full story.*\n\n`evals/harder.jsonl` is 72 auto-gradeable questions across six categories\n(multi-step math, logic, counting/strings, factual edge cases, traps,\narithmetic). Every computable answer is re-derived and asserted in\n`evals/build_harder.py`, so a typo'd key fails at build time. We added a **strong\nsingle-model baseline** (Claude Sonnet 4.6) as the comparison that actually\nmatters — \"three cheap models debating\" vs \"one strong model answering once.\"\n\n```\nCondition            Score  Accuracy       Cost     $/correct\n-------------------------------------------------------------\ngpt-4o-mini         65/72     90.3%   $ 0.0008    ~$0.00001\nclaude-haiku-4.5    64/72     88.9%   $ 0.0073    ~$0.0001\ndeepseek-chat       67/72     93.1%   $ 0.0011    ~$0.00002\n-------------------------------------------------------------\nBASELINE (sonnet)   70/72     97.2%   $ 0.0247    ~$0.0004\n-------------------------------------------------------------\nENSEMBLE            30/72     41.7%   $ 0.6893    ~$0.023\n```\n\nTaken at face value the ensemble is a disaster — last place, at ~28× the cost of\nthe strong baseline. **But that headline is an artifact of one failure mode, not\nof bad reasoning:**\n\n- **40 of 72 debates stalled** (38 in voting, 2 in reviewing) and hit the 120 s\n  timeout, returning a \"no consensus\" placeholder that scores wrong. Stalled\n  debates were **2.4 %** correct; that single bucket *is* the 41.7 %.\n- **On the 31 debates that did converge, the ensemble scored 93.5 %** — and on\n  that same subset the cheap singles scored *lower* (gpt-4o 83.9 %, haiku 77.4 %,\n  deepseek 87.1 %), while Sonnet also scored 93.5 %. So when the debate actually\n  runs, it lifts three cheap models to strong-model accuracy.\n\n### What we can and can't conclude\n\n- We **cannot** yet claim debate beats (or loses to) a single model, because\n  this run was gated by a **vote-parsing bug** (since fixed — see below). The\n  41.7 % is not a measure of debate quality.\n- The \"converged\" subset is **selection-biased** (questions where models readily\n  agree) and small (N = 31), so its 93.5 % is suggestive, not a verdict.\n- These questions are **easier than intended**: modern cheap models already clear\n  ~90 %, leaving little headroom for debate to demonstrate value. A genuinely\n  hard, low-baseline set is needed to see the effect cleanly.\n\nThe encouraging signal (debate ≈ strong model, \u003e cheap singles *when it\nconverges*) only becomes a real claim once convergence is reliable.\n\n### Root cause of the stalls (found and fixed)\n\nAuditing the 41 non-consensus debates via the per-question log pinned the cause\nprecisely: **45 of 138 vote files contained a valid directive but no `## Vote`\nheader.** Models obey the instruction \"your vote MUST be the first line\" and emit\n`FINALIZE: Participant B` directly, sometimes dropping the `## Vote` wrapper. The\nparser only harvested a vote from a `## Vote` section, so those votes were\nsilently lost — and because the agent's API call *succeeded*, it wrote no failure\nsentinel, leaving the coordinator to wait for a vote that was physically present\nbut invisible until the 120 s timeout.\n\nThe fix makes vote parsing tolerant of a missing/garbled header (recovering the\nunwrapped directive line) while `for_phase` still prevents a stray directive in a\nnon-voting phase from being counted early. Re-parsing the recorded run with the\nfix, **all 137 of those vote files now parse, and 45/46 stalled debates would\nhave reached a vote.** A clean full re-run is the immediate next step before\nmaking any debate-vs-model claim.\n\n### Earlier: when does debate actually add value?\n\nWith convergence fixed, we went looking for the case that would justify the cost:\na question where the cheap models are individually unreliable, so debate has\nsomething to correct. Probing all three cheap models (gpt-4o-mini, Haiku,\nDeepSeek) on **30 hard, objective problems** turned up a striking fact: **not one\nproblem stumped all three.** Their errors are *uncorrelated* — each fails on\ndifferent questions — so for every problem at least one model was right. (This\nalso bounds the upside: debate can't invent an answer no member can reach.)\n\nThe sharp test, then, is what happens when the lone correct model is *outvoted*\nby confidently-wrong peers. On three such problems (a factorial sum, a\nsquares-or-cubes count, and a cryptarithm), run 3× each:\n\n```\nCondition            Score   where ≥2/3 cheap models were individually wrong\n----------------------------------------------------------------------------\ngpt-4o-mini          0/9\nclaude               8/9     Ensemble stayed correct in 7/7 such debates.\ndeepseek             3/9\nBASELINE (sonnet)    9/9\nENSEMBLE             9/9     (+11 pts over the best cheap single; ties Sonnet)\n```\n\nThe ensemble went **9/9, beating the best cheap single** — and the per-question\nlog shows *why*: on the squares-or-cubes problem only Haiku could solve it solo,\nyet in the debate the other two (wrong on their own) read its work and **endorsed\nthe correct answer**; on the cryptarithm, models that failed solo produced\n*correct* proposals once reasoning through propose → review → rebuttal. A wrong\nmajority did **not** drag the group to a wrong answer in any of the 9 debates.\nSo debate's value is real and mechanistic: cross-examination corrects individual\nerrors, not just tallies votes.\n\n**The honest caveat:** a single strong model (Sonnet) also went 9/9, at **~1/6th\nthe ensemble's cost** ($0.0032 vs $0.018 per correct answer). Debate *matched* the\nstrong model but never beat it. The defensible conclusion:\n\n- **Debate \u003e best single _cheap_ model** on hard, error-prone problems — genuine,\n  mechanism-backed value.\n- **Debate ≈ single _strong_ model** on accuracy, at ~6× the cost.\n- So debate earns its keep as a way to get **strong-model reliability out of weak\n  or diverse models** — not as a way to beat a strong model you could just call\n  directly.\n\n(Sample size here is small — 9 debates over 3 questions — a clean signal with a\nvisible mechanism, but a ≥30-question \"cheap-models-unreliable\" set is needed to\nmake it a firm claim.)\n\n### Reproduce\n\n```bash\npip install -e .\nexport OPENAI_API_KEY=... ANTHROPIC_API_KEY=... DEEPSEEK_API_KEY=...\n\n# the latest validated run (15 single-model traps):\nensemble-eval --dataset evals/hard.jsonl --models gpt4o,claude,deepseek \\\n  --baseline sonnet --delay 2 --stall-timeout 120 --log run.jsonl\n\n# or the larger 72-question set:\nensemble-eval --dataset evals/harder.jsonl --models gpt4o,claude,deepseek \\\n  --baseline sonnet --delay 2 --stall-timeout 120 --log run.jsonl\n```\n\n`--log` writes one JSONL record per question (every condition's answer, outcome,\ncost, and the debate's end status + reason) so any result can be audited and the\nstalls inspected. `--baseline` accepts any provider key; `sonnet` is registered\npurely as an eval baseline and never joins the default ensemble.\n\n## Development\n\n```bash\npip install -e \".[dev]\"\npytest            # unit + end-to-end (no network; providers are stubbed)\nruff check .\n```\n\nThe end-to-end test in `tests/test_flow.py` drives the real coordinator and\nagent loops with fake providers and asserts the full debate converges with all\nproposal content preserved.\n\n## Robustness notes\n\n- **Atomic writes** — `state.json` and contribution files are written to a temp\n  file and `os.replace`d, so a polling reader never sees a torn file.\n- **Retries** — provider calls retry transient failures (429 / 5xx / network)\n  with exponential backoff, honoring `Retry-After`.\n- **No infinite hangs** — if a phase makes no progress within `--stall-timeout`\n  seconds (e.g. a provider is down), the debate ends in a graceful deadlock.\n- **Tolerant vote parsing** — a vote is recovered even when the model omits the\n  `## Vote` header and emits a bare `FINALIZE: …` / `REVISE: …` / `SPLIT: …`\n  line, so a present-but-unwrapped vote can't silently stall the debate. The same\n  tolerance covers a bare `APPROVE` / `REJECT` in the confirm phase.\n- **Synthesis never undoes consensus** — once a majority finalizes, any failure,\n  stall, or rejection in the `SYNTHESIS`/`CONFIRM` phases falls back to the\n  verbatim winning proposal. The worst case equals the pre-synthesis behaviour;\n  the merged answer is strictly an upside the group can decline.\n- **Prompt caching** — the stable system prompt is marked as an Anthropic cache\n  breakpoint; OpenAI and DeepSeek cache prefixes automatically. Cached tokens\n  are billed at a discount and counted separately in the cost report.\n- **Cost accounting** — token usage is captured per call into `*.usage.json`\n  sidecars, tallied into `state.json`, and summarized (with estimated $) in\n  `final.md`. `--budget` stops the debate before the next round if exceeded.\n\n## Acknowledgments\n\nThe **synthesis** and **peer-ranking** steps are adapted from Andrej Karpathy's\n[llm-council](https://github.com/karpathy/llm-council), which pioneered the\npattern of multiple LLMs answering, ranking each other *anonymously*, and a\nchairman synthesizing a final response. Ensemble reworks those ideas into a\nmulti-round, consensus-by-vote debate on the filesystem: ranking is an additive\nBorda signal (deadlock tiebreak only), and the synthesis is a group-confirmed\ncandidate rather than a single chairman's call.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fraiyanyahya%2Fensemble","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fraiyanyahya%2Fensemble","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fraiyanyahya%2Fensemble/lists"}