{"id":51242430,"url":"https://github.com/scasella/multi-model","last_synced_at":"2026-06-29T01:04:41.381Z","repository":{"id":354055880,"uuid":"1221946224","full_name":"scasella/multi-model","owner":"scasella","description":"Panel-of-experts scaffold vs \u003cthink\u003e on Qwen3-30B-A3B — diversity finding + olympiad RLVR follow-up","archived":false,"fork":false,"pushed_at":"2026-04-26T23:48:22.000Z","size":353,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-04-27T00:22:16.240Z","etag":null,"topics":["aime","chain-of-thought","gsm8k","lora","mathematical-reasoning","mixture-of-experts","olympiad","panel-of-experts","pass-at-k","qwen3","reasoning","reinforcement-learning","rlvr","tinker"],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/scasella.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-04-26T22:02:08.000Z","updated_at":"2026-04-26T22:40:06.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/scasella/multi-model","commit_stats":null,"previous_names":["scasella/multi-model"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/scasella/multi-model","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/scasella%2Fmulti-model","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/scasella%2Fmulti-model/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/scasella%2Fmulti-model/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/scasella%2Fmulti-model/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/scasella","download_url":"https://codeload.github.com/scasella/multi-model/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/scasella%2Fmulti-model/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34909168,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-28T02:00:05.809Z","response_time":54,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["aime","chain-of-thought","gsm8k","lora","mathematical-reasoning","mixture-of-experts","olympiad","panel-of-experts","pass-at-k","qwen3","reasoning","reinforcement-learning","rlvr","tinker"],"created_at":"2026-06-29T01:04:38.938Z","updated_at":"2026-06-29T01:04:41.376Z","avatar_url":"https://github.com/scasella.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Panel-of-experts scaffold vs `\u003cthink\u003e` on Qwen3-30B-A3B\n\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](LICENSE)\n[![Backbone: Qwen3-30B-A3B](https://img.shields.io/badge/backbone-Qwen3--30B--A3B-ff6b6b.svg)](https://huggingface.co/Qwen/Qwen3-30B-A3B-Base)\n[![Model on HF](https://img.shields.io/badge/%F0%9F%A4%97%20Model-multipersona--debate--lora-FFD21E.svg)](https://huggingface.co/scasella91/qwen3-30b-a3b-multipersona-debate-lora)\n[![RL: Tinker](https://img.shields.io/badge/RL-Tinker-blueviolet.svg)](https://tinker-console.thinkingmachines.ai/)\n\nA LoRA rank-32 fine-tune of `Qwen/Qwen3-30B-A3B-Base` that replaces the usual\n`\u003cthink\u003e…\u003c/think\u003e` monologue with a three-persona debate\n(`\u003cmutipersonaDebate\u003e…\u003c/mutipersonaDebate\u003e`) before answering, scored against\nQwen's native thinking model on the same backbone.\n\n**Headline finding #1 — wider search per sample.** Panel reasoning traces\nsit +78.2% further apart on MATH-500 and +75.6% further apart on AIME 24 + 25\n(mean pairwise cosine distance, all-mpnet-base-v2 embeddings). On AIME the\npaired pass@k gap to Qwen3-thinking closes by +15.6pp from k = 1 to k = 16,\nmonotone in k.\n\n**Headline finding #2 — and the deployment-cost consequence.** On the **374\npaired draws** across MATH-500 L5 and AIME 24+25 where both arms produced a\ncorrect answer, thinking burns **5.9–6.9× more tokens (median)** than the\npanel — and is longer than the panel in **99.6–99.7%** of all paired draws,\nregardless of correctness. At the cost-per-correct-answer level — total\ntokens divided by correct samples — thinking is **2.5–5× more expensive**\nthan the panel. Per-sample accuracy still favors thinking; per-token\nefficiency reverses the comparison once compute is in the denominator.\n(Wilcoxon signed-rank: *p* = 6×10⁻⁵² on MATH, *p* = 8×10⁻¹³ on AIME. Run\n`python scripts/analyze_token_efficiency.py` to reproduce.)\n\n**Follow-up — Apr 25.** The diversity shows up where it matters for RL. On a\nfresh 877-problem olympiad-math pool the panel has **1.83× more variance-band\nproblems** than Qwen3-thinking (382 vs 209) — the only regime where\ngroup-relative RLVR generates non-zero gradient. 100 RL steps on that band\ncarry the panel from **14% → 29%** on a shared held-out, with per-source\ngains scaling with training-pool representation.\n\nThe full pipeline is a LoRA rank-32 adapter on the frozen base. The\npost-training compute is roughly four to five orders of magnitude under\nQwen's investment in the thinking baseline.\n\n**Adapter on Hugging Face:** [`scasella91/qwen3-30b-a3b-multipersona-debate-lora`](https://huggingface.co/scasella91/qwen3-30b-a3b-multipersona-debate-lora)\n\n---\n\n## Repo layout\n\n```\n.\n├── envs/               four RL environments: panel + think × math + gsm8k\n├── scripts/            training, eval, and analysis drivers (see scripts/README.md)\n├── reports/            eval + analysis outputs (summary.json tracked, rollouts.jsonl ignored)\n│   └── blog_post/      the writeup\n├── data/               stratified problem slices (SAMPLE.jsonl + schema.json only)\n├── RECIPE.md           concise reproduction guide\n├── LICENSE             MIT\n└── archive/            pre-pivot history (local only, gitignored)\n```\n\n## Quickstart\n\n```bash\n# 1. set up\nuv venv .venv \u0026\u0026 source .venv/bin/activate\nuv pip install -e .   # or install tinker_cookbook deps directly\ncp .env.example .env  # fill in TINKER_API_KEY + HF_TOKEN\n\n# 2. stage 1: GSM8K warmup (80 RL steps, ~2 h on Tinker)\nbash scripts/rl_multipersona_gsm8k.sh\n\n# Export the resulting checkpoint URI (printed by the run) before stage 2:\nexport PANEL_GSM8K_CHECKPOINT=tinker://\u003cyour-session\u003e:train:0/weights/final\n\n# 3. stage 2: MATH continuation (128 RL steps)\nbash scripts/rl_multipersona_math.sh\n\n# Export the post-MATH URIs (used by olympiad RL + post-MATH eval sweeps):\nexport PANEL_MATH_CHECKPOINT=tinker://\u003cyour-session\u003e:train:0/weights/final\nexport PANEL_MATH_CHECKPOINT_SAMPLER=tinker://\u003cyour-session\u003e:train:0/sampler_weights/final\n\n# 4. evaluate\npython scripts/eval_math500_vibecheck.py tag=panel_l5_full   variant=panel\npython scripts/eval_aime_vibecheck.py    tag=aime_panel_n16  n_samples=16\n\n# 5. analyze\npython scripts/analyze_diversity.py\npython scripts/pass_at_k_aime.py\n```\n\nFor the olympiad RLVR experiment (variance-band filter + per-arm RL +\nhill-climbing eval), see [§6 of `RECIPE.md`](RECIPE.md#6-olympiad-rlvr-hill-climbing-experiment).\nFor a per-script index, see [`scripts/README.md`](scripts/README.md).\n\n## Key numbers at a glance\n\n| | panel (ours) | thinking (Qwen3-30B-A3B native) |\n|---|---:|---:|\n| MATH-500 L5 avg hit rate | 0.58 | 0.75 |\n| MATH-500 L5 pass@4 | 0.90 | 1.00 |\n| AIME 24 + 25 pass@1 | 0.23 | 0.73 |\n| AIME 24 + 25 pass@16 | 0.55 | 0.75 |\n| Mean pairwise cos dist (MATH) | 0.095 | 0.053 |\n| Mean pairwise cos dist (AIME) | 0.119 | 0.068 |\n| **MATH-500 L5 tokens / correct** | **1,652** | **8,376** *(5.07× more)* |\n| **AIME 24 + 25 tokens / correct** | **6,257** | **15,491** *(2.48× more)* |\n| Median token ratio, both-correct (MATH / AIME) | — | **6.91× / 5.89×** |\n\nThe panel starts behind on pass@1 and **closes the gap sharply with more samples** —\nthe pattern you'd expect from wider, less redundant search. Per-sample\naccuracy favors thinking; per-token cost-per-correct reverses the comparison.\n\n## Bounds on the claim\n\nThe panel does **not** outperform the thinking baseline at pass@1. What the\ndata actually supports:\n\n1. Per-sample semantic spread is wider on both benchmarks (paired t = 5.91 / 4.72).\n2. That spread tightens pass@k against thinking on three independent slices.\n3. **At fixed correctness, the panel uses 5.9–6.9× fewer tokens than thinking\n   (median, both-correct subset; n = 374 paired draws). At the cost-per-correct\n   level, thinking is 2.5–5× more expensive. This is the deployment-cost\n   reading of the same diversity mechanism.**\n4. On a 877-problem olympiad pool, that same spread yields 1.83× more\n   variance-band problems than thinking — and 100 LoRA RL steps on those\n   problems lift the panel's shared held-out from 14% to 29%.\n5. The compute behind all of this is ~10⁴–10⁵× under Qwen's post-training stack.\n\nSee the blog post for the full caveat set: embedding-based diversity metric,\nsmall AIME slice, no token-budget-matched baseline (the cost-per-correct\nfinding partly addresses this — at fixed compute, panel wins; at fixed sample\ncount, thinking does), no matched thinking-arm RL run.\n\n## Reproducing this work\n\n| stage | wall-time on Tinker | output |\n|---|---|---|\n| Stage 1: GSM8K warmup (80 steps) | ~2 h | `PANEL_GSM8K_CHECKPOINT` |\n| Stage 2: MATH continuation (128 steps) | ~6 h | `PANEL_MATH_CHECKPOINT` (the panel adapter cited in the blog as eval session 44722365) |\n| Diversity + pass@k evals | ~3 h | the +78% / +75.6% / +15.6pp numbers |\n| Olympiad pool build + variance-band filter (G=8, both arms, parallelizable) | ~7 h | `data/olympiad_pool/{panel,thinking}_train.jsonl` |\n| Panel olympiad RL (100 steps) | ~3 h | the 14% → 29% hill-climbing curve |\n| Thinking olympiad RL (100 steps, matched) | ~37 h | not yet run — the open follow-up |\n\nThe first five rows fit comfortably in a single afternoon of Tinker spend.\nRow six is the highest-cost open item; we hit a billing wall before it\nfinished and never restarted. Stage URIs are read from `.env` (see\n`.env.example`). The only Tinker session URI hard-coded as a default in\nthe scripts is the publicly-cited `44722365-…` panel-MATH checkpoint\n(`build_case_study_transcripts.py`, `chat_panel.py`); both are\noverridable via `PANEL_MATH_CHECKPOINT_SAMPLER`.\n\n## Try the model\n\nThe post-MATH-RL adapter is published as a standalone PEFT LoRA at\n[`scasella91/qwen3-30b-a3b-multipersona-debate-lora`](https://huggingface.co/scasella91/qwen3-30b-a3b-multipersona-debate-lora)\n(MIT, 3.4 GB bf16). It loads on top of `Qwen/Qwen3-30B-A3B-Base` via\n`transformers + peft` — no Tinker account required. You'll want ~60 GB of\nGPU memory for the base model; the adapter itself adds negligible runtime\noverhead.\n\n```python\nimport torch\nfrom transformers import AutoModelForCausalLM, AutoTokenizer\nfrom peft import PeftModel\n\nbase_id = \"Qwen/Qwen3-30B-A3B-Base\"\nadapter_id = \"scasella91/qwen3-30b-a3b-multipersona-debate-lora\"\n\ntok = AutoTokenizer.from_pretrained(base_id, trust_remote_code=True)\nmodel = AutoModelForCausalLM.from_pretrained(\n    base_id, torch_dtype=torch.bfloat16, device_map=\"auto\"\n)\nmodel = PeftModel.from_pretrained(model, adapter_id)\nmodel.eval()\n\nPROMPT = (\n    \"A conversation between User and Multi-Persona Panel of Experts. \"\n    \"The user asks a question, and the Multi-Persona Panel of Experts solves it. \"\n    \"The Multi-Persona Panel of Experts first deliberates and debates the reasoning \"\n    \"process with each other and then provides the user with the answer. \"\n    \"The deliberation process and answer are enclosed within \"\n    \"\u003cmutipersonaDebate\u003e...\u003c/mutipersonaDebate\u003e and \u003canswer\u003e...\u003c/answer\u003e tags, \"\n    \"respectively, i.e., \u003cmutipersonaDebate\u003e deliberation process here \"\n    \"\u003c/mutipersonaDebate\u003e \u003canswer\u003eanswer here \u003c/answer\u003e. \"\n    \"User: {problem}. Assistant: \"\n)\n\ninputs = tok(PROMPT.format(problem=\"If 2x + 3 = 11, what is x?\"),\n             return_tensors=\"pt\").to(model.device)\nout = model.generate(**inputs, max_new_tokens=1024, temperature=1.0, do_sample=True)\nprint(tok.decode(out[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))\n```\n\nThe model card on Hugging Face has the full caveat list, including the known\ngap on per-sample accuracy and the experimental status of MoE expert LoRA\nserving in vLLM/SGLang.\n\n## License\n\nMIT — see [`LICENSE`](LICENSE).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fscasella%2Fmulti-model","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fscasella%2Fmulti-model","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fscasella%2Fmulti-model/lists"}