https://github.com/scasella/multi-model
Panel-of-experts scaffold vs <think> on Qwen3-30B-A3B — diversity finding + olympiad RLVR follow-up
https://github.com/scasella/multi-model
aime chain-of-thought gsm8k lora mathematical-reasoning mixture-of-experts olympiad panel-of-experts pass-at-k qwen3 reasoning reinforcement-learning rlvr tinker
Last synced: 6 days ago
JSON representation
Panel-of-experts scaffold vs <think> on Qwen3-30B-A3B — diversity finding + olympiad RLVR follow-up
- Host: GitHub
- URL: https://github.com/scasella/multi-model
- Owner: scasella
- License: mit
- Created: 2026-04-26T22:02:08.000Z (2 months ago)
- Default Branch: main
- Last Pushed: 2026-04-26T23:48:22.000Z (2 months ago)
- Last Synced: 2026-04-27T00:22:16.240Z (2 months ago)
- Topics: aime, chain-of-thought, gsm8k, lora, mathematical-reasoning, mixture-of-experts, olympiad, panel-of-experts, pass-at-k, qwen3, reasoning, reinforcement-learning, rlvr, tinker
- Language: Python
- Size: 345 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Panel-of-experts scaffold vs `` on Qwen3-30B-A3B
[](LICENSE)
[](https://huggingface.co/Qwen/Qwen3-30B-A3B-Base)
[](https://huggingface.co/scasella91/qwen3-30b-a3b-multipersona-debate-lora)
[](https://tinker-console.thinkingmachines.ai/)
A LoRA rank-32 fine-tune of `Qwen/Qwen3-30B-A3B-Base` that replaces the usual
`…` monologue with a three-persona debate
(`…`) before answering, scored against
Qwen's native thinking model on the same backbone.
**Headline finding #1 — wider search per sample.** Panel reasoning traces
sit +78.2% further apart on MATH-500 and +75.6% further apart on AIME 24 + 25
(mean pairwise cosine distance, all-mpnet-base-v2 embeddings). On AIME the
paired pass@k gap to Qwen3-thinking closes by +15.6pp from k = 1 to k = 16,
monotone in k.
**Headline finding #2 — and the deployment-cost consequence.** On the **374
paired draws** across MATH-500 L5 and AIME 24+25 where both arms produced a
correct answer, thinking burns **5.9–6.9× more tokens (median)** than the
panel — and is longer than the panel in **99.6–99.7%** of all paired draws,
regardless of correctness. At the cost-per-correct-answer level — total
tokens divided by correct samples — thinking is **2.5–5× more expensive**
than the panel. Per-sample accuracy still favors thinking; per-token
efficiency reverses the comparison once compute is in the denominator.
(Wilcoxon signed-rank: *p* = 6×10⁻⁵² on MATH, *p* = 8×10⁻¹³ on AIME. Run
`python scripts/analyze_token_efficiency.py` to reproduce.)
**Follow-up — Apr 25.** The diversity shows up where it matters for RL. On a
fresh 877-problem olympiad-math pool the panel has **1.83× more variance-band
problems** than Qwen3-thinking (382 vs 209) — the only regime where
group-relative RLVR generates non-zero gradient. 100 RL steps on that band
carry the panel from **14% → 29%** on a shared held-out, with per-source
gains scaling with training-pool representation.
The full pipeline is a LoRA rank-32 adapter on the frozen base. The
post-training compute is roughly four to five orders of magnitude under
Qwen's investment in the thinking baseline.
**Adapter on Hugging Face:** [`scasella91/qwen3-30b-a3b-multipersona-debate-lora`](https://huggingface.co/scasella91/qwen3-30b-a3b-multipersona-debate-lora)
---
## Repo layout
```
.
├── envs/ four RL environments: panel + think × math + gsm8k
├── scripts/ training, eval, and analysis drivers (see scripts/README.md)
├── reports/ eval + analysis outputs (summary.json tracked, rollouts.jsonl ignored)
│ └── blog_post/ the writeup
├── data/ stratified problem slices (SAMPLE.jsonl + schema.json only)
├── RECIPE.md concise reproduction guide
├── LICENSE MIT
└── archive/ pre-pivot history (local only, gitignored)
```
## Quickstart
```bash
# 1. set up
uv venv .venv && source .venv/bin/activate
uv pip install -e . # or install tinker_cookbook deps directly
cp .env.example .env # fill in TINKER_API_KEY + HF_TOKEN
# 2. stage 1: GSM8K warmup (80 RL steps, ~2 h on Tinker)
bash scripts/rl_multipersona_gsm8k.sh
# Export the resulting checkpoint URI (printed by the run) before stage 2:
export PANEL_GSM8K_CHECKPOINT=tinker://:train:0/weights/final
# 3. stage 2: MATH continuation (128 RL steps)
bash scripts/rl_multipersona_math.sh
# Export the post-MATH URIs (used by olympiad RL + post-MATH eval sweeps):
export PANEL_MATH_CHECKPOINT=tinker://:train:0/weights/final
export PANEL_MATH_CHECKPOINT_SAMPLER=tinker://:train:0/sampler_weights/final
# 4. evaluate
python scripts/eval_math500_vibecheck.py tag=panel_l5_full variant=panel
python scripts/eval_aime_vibecheck.py tag=aime_panel_n16 n_samples=16
# 5. analyze
python scripts/analyze_diversity.py
python scripts/pass_at_k_aime.py
```
For the olympiad RLVR experiment (variance-band filter + per-arm RL +
hill-climbing eval), see [§6 of `RECIPE.md`](RECIPE.md#6-olympiad-rlvr-hill-climbing-experiment).
For a per-script index, see [`scripts/README.md`](scripts/README.md).
## Key numbers at a glance
| | panel (ours) | thinking (Qwen3-30B-A3B native) |
|---|---:|---:|
| MATH-500 L5 avg hit rate | 0.58 | 0.75 |
| MATH-500 L5 pass@4 | 0.90 | 1.00 |
| AIME 24 + 25 pass@1 | 0.23 | 0.73 |
| AIME 24 + 25 pass@16 | 0.55 | 0.75 |
| Mean pairwise cos dist (MATH) | 0.095 | 0.053 |
| Mean pairwise cos dist (AIME) | 0.119 | 0.068 |
| **MATH-500 L5 tokens / correct** | **1,652** | **8,376** *(5.07× more)* |
| **AIME 24 + 25 tokens / correct** | **6,257** | **15,491** *(2.48× more)* |
| Median token ratio, both-correct (MATH / AIME) | — | **6.91× / 5.89×** |
The panel starts behind on pass@1 and **closes the gap sharply with more samples** —
the pattern you'd expect from wider, less redundant search. Per-sample
accuracy favors thinking; per-token cost-per-correct reverses the comparison.
## Bounds on the claim
The panel does **not** outperform the thinking baseline at pass@1. What the
data actually supports:
1. Per-sample semantic spread is wider on both benchmarks (paired t = 5.91 / 4.72).
2. That spread tightens pass@k against thinking on three independent slices.
3. **At fixed correctness, the panel uses 5.9–6.9× fewer tokens than thinking
(median, both-correct subset; n = 374 paired draws). At the cost-per-correct
level, thinking is 2.5–5× more expensive. This is the deployment-cost
reading of the same diversity mechanism.**
4. On a 877-problem olympiad pool, that same spread yields 1.83× more
variance-band problems than thinking — and 100 LoRA RL steps on those
problems lift the panel's shared held-out from 14% to 29%.
5. The compute behind all of this is ~10⁴–10⁵× under Qwen's post-training stack.
See the blog post for the full caveat set: embedding-based diversity metric,
small AIME slice, no token-budget-matched baseline (the cost-per-correct
finding partly addresses this — at fixed compute, panel wins; at fixed sample
count, thinking does), no matched thinking-arm RL run.
## Reproducing this work
| stage | wall-time on Tinker | output |
|---|---|---|
| Stage 1: GSM8K warmup (80 steps) | ~2 h | `PANEL_GSM8K_CHECKPOINT` |
| Stage 2: MATH continuation (128 steps) | ~6 h | `PANEL_MATH_CHECKPOINT` (the panel adapter cited in the blog as eval session 44722365) |
| Diversity + pass@k evals | ~3 h | the +78% / +75.6% / +15.6pp numbers |
| Olympiad pool build + variance-band filter (G=8, both arms, parallelizable) | ~7 h | `data/olympiad_pool/{panel,thinking}_train.jsonl` |
| Panel olympiad RL (100 steps) | ~3 h | the 14% → 29% hill-climbing curve |
| Thinking olympiad RL (100 steps, matched) | ~37 h | not yet run — the open follow-up |
The first five rows fit comfortably in a single afternoon of Tinker spend.
Row six is the highest-cost open item; we hit a billing wall before it
finished and never restarted. Stage URIs are read from `.env` (see
`.env.example`). The only Tinker session URI hard-coded as a default in
the scripts is the publicly-cited `44722365-…` panel-MATH checkpoint
(`build_case_study_transcripts.py`, `chat_panel.py`); both are
overridable via `PANEL_MATH_CHECKPOINT_SAMPLER`.
## Try the model
The post-MATH-RL adapter is published as a standalone PEFT LoRA at
[`scasella91/qwen3-30b-a3b-multipersona-debate-lora`](https://huggingface.co/scasella91/qwen3-30b-a3b-multipersona-debate-lora)
(MIT, 3.4 GB bf16). It loads on top of `Qwen/Qwen3-30B-A3B-Base` via
`transformers + peft` — no Tinker account required. You'll want ~60 GB of
GPU memory for the base model; the adapter itself adds negligible runtime
overhead.
```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
base_id = "Qwen/Qwen3-30B-A3B-Base"
adapter_id = "scasella91/qwen3-30b-a3b-multipersona-debate-lora"
tok = AutoTokenizer.from_pretrained(base_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
base_id, torch_dtype=torch.bfloat16, device_map="auto"
)
model = PeftModel.from_pretrained(model, adapter_id)
model.eval()
PROMPT = (
"A conversation between User and Multi-Persona Panel of Experts. "
"The user asks a question, and the Multi-Persona Panel of Experts solves it. "
"The Multi-Persona Panel of Experts first deliberates and debates the reasoning "
"process with each other and then provides the user with the answer. "
"The deliberation process and answer are enclosed within "
"... and ... tags, "
"respectively, i.e., deliberation process here "
" answer here . "
"User: {problem}. Assistant: "
)
inputs = tok(PROMPT.format(problem="If 2x + 3 = 11, what is x?"),
return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=1024, temperature=1.0, do_sample=True)
print(tok.decode(out[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))
```
The model card on Hugging Face has the full caveat list, including the known
gap on per-sample accuracy and the experimental status of MoE expert LoRA
serving in vLLM/SGLang.
## License
MIT — see [`LICENSE`](LICENSE).