awesomeopd

Awesome List for On-Policy Distillation
https://github.com/thinkwee/awesomeopd

Last synced: 18 days ago
JSON representation

🤖 Agent & Embodied OPD (by application)
- 🔁 Iterative Self-Bootstrapping
  - LLM4Teach - AMMI/LLM4Teach?style=for-the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700" alt="Stars"> | 2023.11 (updated 2025) | ZJ Lab AMMI | [arXiv 2311.13373](https://arxiv.org/abs/2311.13373) | LLM4Teach — small-RL agent guided by LLM |
  - RPD - Policy-Distillation/RPD?style=for-the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700" alt="Stars"> | 2025.03 | TUM / Freiburg | [arXiv 2503.05833](https://arxiv.org/abs/2503.05833) · [project](https://refined-policy-distillation.github.io/) | Refined Policy Distillation, VLA (IROS 2026) |
  - easydistill - the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700" alt="Stars"> | 2025.09 | Alibaba ModelScope | [SCoRe arXiv 2509.14257](https://arxiv.org/abs/2509.14257) | `/projects/SCoRe` |
  - VLA-OPD - paper-845C40?style=for-the-badge)](https://arxiv.org/abs/2603.26666) | 2026.03 | HKUST (Guangzhou) — IRPN Lab | [arXiv 2603.26666](https://arxiv.org/abs/2603.26666) · [project](https://irpn-lab.github.io/VLA-OPD/) | **VLA-OPD** — bridging offline SFT & online RL for VLA via OPD (code coming soon) |
  - Skill-SD - paper-845C40?style=for-the-badge)](https://arxiv.org/abs/2604.10674) | 2026.04 | Vivo | [arXiv 2604.10674](https://arxiv.org/abs/2604.10674) | Skill-SD — skill-conditioned self-distillation for multi-turn LLM agents|
  - TCOD - paper-845C40?style=for-the-badge)](https://arxiv.org/abs/2604.24005) | 2026.04 | Tongyi Lab, Alibaba / CUHK | [arXiv 2604.24005](https://arxiv.org/abs/2604.24005) | TCOD — temporal curriculum OPD for multi-turn agents; F2B & B2F schedules |
  - Healthcare AI GYM - the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700" alt="Stars"> | 2026.05 | Upstage AI / Korea University | [arXiv 2605.02943](https://arxiv.org/abs/2605.02943) | Healthcare AI GYM — medical agent RL environment + turn-level truncated OPD |
  - HyperEyes - the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700" alt="Stars"> | 2026.05 | Xiaohongshu / Cambridge | [arXiv 2605.07177](https://arxiv.org/abs/2605.07177) | HyperEyes — parallel multimodal search agent with dual-grained efficiency-aware RL (TRACE + OPD) |
  - SDAR - REAL/SDAR?style=for-the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700" alt="Stars"> | 2026.05 | ZJU-REAL / Meituan | [arXiv 2605.15155](https://arxiv.org/abs/2605.15155) | **SDAR — Self-Distilled Agentic RL**; multi-turn agent rolls out, a privileged-context self-teacher gives gated token-level OPSD as an auxiliary objective while RL stays the primary backbone (ALFWorld, WebShop, Search-QA) |
🌟 Curator's Picks — where to start
- 🔁 Iterative Self-Bootstrapping
🛠️ Frameworks & Toolkits
- 🔁 Iterative Self-Bootstrapping
  - trl - the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700" alt="Stars"> | 2019.11 | Hugging Face | `trl/experimental/{gkd,gold,minillm,sdft,self_distillation,sdpo,nash_md,xpo,online_dpo}/` | TRL — **the most diverse OPD trainer collection** |
  - LLaMA-Factory - Factory?style=for-the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700" alt="Stars"> | 2023.05 | hiyouga | — | LLaMA-Factory — OPD only via TRL integration; not native |
  - ms-swift - swift?style=for-the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700" alt="Stars"> | 2024 | Alibaba ModelScope | `examples/train/rlhf/gkd/`, multimodal/megatron variants | ms-swift — wraps TRL `GKDTrainer` |
  - verl - the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700" alt="Stars"> | 2024.10 | ByteDance Seed | `recipe/on_policy_distill/`; [Async OPD doc](https://verl.readthedocs.io/en/latest/advance/async-on-policy-distill.html) | verl |
  - rllm - org/rllm?style=for-the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700" alt="Stars"> | 2025.01 | UC Berkeley Sky | `examples/math_distill/` (incl. `opsd/` self-distill); `rllm/trainer/distill/` | rllm |
  - SkyRL - AI/SkyRL?style=for-the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700" alt="Stars"> | 2025.04 | UC Berkeley NovaSky | `skyrl-train/examples/on_policy_distillation/`; [blog](https://novasky-ai.notion.site/on-policy-distillation) | SkyRL |
  - ROLL - the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700" alt="Stars"> | 2025.06 | Alibaba | `roll/pipeline/distill/` | ROLL — with VLM support and various-divergence library |
  - AReaL - the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700" alt="Stars"> | 2025.06 | AntGroup / Tsinghua | `examples/distillation/gsm8k_grpo_distill.yaml` | AReaL |
  - slime - the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700" alt="Stars"> | 2025.06 | Tsinghua THUDM | `examples/on_policy_distillation/` | slime — RL framework behind GLM-4.5/4.6/4.7 |
  - RL - NeMo/RL?style=for-the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700" alt="Stars"> | 2026.01 | NVIDIA | `nemo_rl/algorithms/distillation.py` | NeMo-RL — native OPD with student rollouts |
  - KDFlow - the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700" alt="Stars"> | 2026.03 | BJTU (Songming Zhang et al.) | `examples/on_policy_kd/` (LLM + Qwen3-VL); [arXiv 2603.01875](https://arxiv.org/abs/2603.01875) | KDFlow — **KD-first framework**; SGLang teacher + FSDP2 student decoupled; cross-tokenizer & VLM native |
🏭 Industrial / Production Model Reports
- 🔁 Iterative Self-Bootstrapping
  - gemma - deepmind/gemma?style=for-the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700" alt="Stars"> | 2024.07 | Google DeepMind | [arXiv 2408.00118](https://arxiv.org/abs/2408.00118) | **Gemma 2** (explicit OPD) |
  - Qwen3 - the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700" alt="Stars"> | 2025.05 | Alibaba Qwen | [arXiv 2505.09388](https://arxiv.org/abs/2505.09388) | **Qwen3** (canonical OPD recipe) |
  - GLM-4.5 - org/GLM-4.5?style=for-the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700" alt="Stars"> | 2025.08 | Zhipu / Z.ai | [arXiv 2508.06471](https://arxiv.org/abs/2508.06471) | **GLM-4.5 / 4.6** |
  - HY-MT - Hunyuan/HY-MT?style=for-the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700" alt="Stars"> | 2025.12 | Tencent Hunyuan | [arXiv 2512.24092](https://arxiv.org/abs/2512.24092) · [HF 1.8B](https://huggingface.co/tencent/HY-MT1.5-1.8B) · [HF 7B](https://huggingface.co/tencent/HY-MT1.5-7B) | strong-to-weak distillation for MT |
  - MiMo-V2-Flash - V2-Flash?style=for-the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700" alt="Stars"> | 2026.01 | Xiaomi | [arXiv 2601.02780](https://arxiv.org/abs/2601.02780) | **MiMo-V2-Flash** (MOPD) |
  - Typhoon-S - 10x/typhoon-s?style=for-the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700" alt="Stars"> | 2026.01 | Typhoon AI & SCB 10X | [arXiv 2601.18129](https://arxiv.org/pdf/2601.18129) | GAD style OPD: Full logits greatly outperforms Top-K in Thai.|
  - Baichuan-M3-235B - inc/Baichuan-M3-235B?style=for-the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700" alt="Stars"> | 2026.02 | Baichuan | [arXiv 2602.06570](https://arxiv.org/abs/2602.06570) · [HF Collection](https://huggingface.co/collections/baichuan-inc/baichuan-m3) | Baichuan-M3 (learn critically from multi-teacher OPD) |
  - GLM-5 - org/GLM-5?style=for-the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700" alt="Stars"> | 2026.02 | Zhipu / Z.ai | [arXiv 2602.15763](https://arxiv.org/abs/2602.15763) | **GLM-5** (cross-stage OPD) |
  - Nemotron Cascade 2 - paper-845C40?style=for-the-badge)](https://arxiv.org/abs/2603.19220) | 2026.03 | NVIDIA | [arXiv 2603.19220](https://arxiv.org/abs/2603.19220) · [HF Collection](https://huggingface.co/collections/nvidia/nemotron-cascade-2) · [project](https://research.nvidia.com/labs/nemotron/nemotron-cascade-2/) | **Nemotron Cascade 2** (multi-domain OPD; "we sample y∼π_inf(·\|x)"); HF-only release |
  - Qwen3-Coder - Coder?style=for-the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700" alt="Stars"> | 2026.03 | Alibaba Qwen | Tech report | Qwen3-Coder |
  - KAT-Coder-V2 - paper-845C40?style=for-the-badge)](https://arxiv.org/abs/2603.27703) | 2026.03 | Kuaishou KwaiKAT | [arXiv 2603.27703](https://arxiv.org/abs/2603.27703) | step-level OPD for agentic coding|
  - HY-Embodied - Hunyuan/HY-Embodied?style=for-the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700" alt="Stars"> | 2026.04 | Tencent Hunyuan | [arXiv 2604.07430](https://arxiv.org/abs/2604.07430) | **HY-Embodied-0.5** (FKL embodied distillation) |
  - DeepSeek-V4 - paper-845C40?style=for-the-badge)](https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro/blob/main/DeepSeek_V4.pdf) | 2026.04 | DeepSeek-AI | [Tech Report](https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro/blob/main/DeepSeek_V4.pdf) · [V4-Pro](https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro) · [V4-Flash](https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash) | **DeepSeek-V4** (multi-teacher OPD replaces unified mixed-RL stage) |
  - ![Paper - studio/qwen-omni) | **Qwen3.5-Omni** (cross-modal OPD for audio reasoning) |
  - RLSD
🖼️ Multimodal OPD (VLM, Video, Audio, Image)
- 🔁 Iterative Self-Bootstrapping
  - piFlow - the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700" alt="Stars"> | 2025.10 | Multi-org | [arXiv 2510.14974](https://arxiv.org/abs/2510.14974) | π-Flow — image / flow OPD (ICLR 2026) |
  - VOLD - paper-845C40?style=for-the-badge)](https://arxiv.org/abs/2510.23497) | 2025.10 | INRIA / Goethe Univ. | [arXiv 2510.23497](https://arxiv.org/abs/2510.23497) · [project page](https://walidbousselham.com/VOLD/) | VOLD (LLM→VLM OPD) — repo placeholder; ICLR 2026 |
  - Step-Audio-R1 - ai/Step-Audio-R1?style=for-the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700" alt="Stars"> | 2025.11 | StepFun | [arXiv 2511.15848](https://arxiv.org/abs/2511.15848) | Step-Audio-R1 |
  - CORD - paper-845C40?style=for-the-badge)](https://arxiv.org/abs/2601.16547) | 2026.01 | Baidu Ernie | [arXiv 2601.16547](https://arxiv.org/abs/2601.16547) | Reasoning: Text ➡️ Audio |
  - Video-OPD - paper-845C40?style=for-the-badge)](https://arxiv.org/abs/2602.02994) | 2026.02 | Industrial | [arXiv 2602.02994](https://arxiv.org/abs/2602.02994) | Video-OPD |
  - X-OPD - paper-845C40?style=for-the-badge)](https://arxiv.org/abs/2603.24596) | 2026.03 | Tencent Hunyuan / ZJU | [arXiv 2603.24596](https://arxiv.org/abs/2603.24596) | X-OPD (Speech LLM) |
  - Uni-OPD - OPD?style=for-the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700" alt="Stars"> | 2026.05 | Multi-org | [arXiv 2605.03677](https://arxiv.org/abs/2605.03677) | **Uni-OPD** — unified OPD across LLMs & MLLMs via dual-perspective recipe |
  - piFlow - the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700" alt="Stars"> | 2025.10 | Multi-org | [arXiv 2510.14974](https://arxiv.org/abs/2510.14974) | π-Flow — image / flow OPD (ICLR 2026) |
  - Flow-OPD - OPD?style=for-the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700" alt="Stars"> | 2026.05 | Multi-org | [arXiv 2605.08063](https://arxiv.org/abs/2605.08063) | **Flow-OPD** — first to integrate OPD into Flow-Matching text-to-image models; consolidates multiple single-reward GRPO expert teachers into one student via on-policy sampling + reverse-KL (SD-3.5-Medium) |
  - Decomposed-OPD - suk-yoon/Decomposed_OPD?style=for-the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700" alt="Stars"> | 2026.06 | KAIST / Microsoft Research Asia | [arXiv 2606.00564](https://arxiv.org/abs/2606.00564) | **Decomposed-OPD / VGS** — decomposes the OPD gradient into (near-orthogonal) language-prior vs. visual-grounding components; *Visual Gradient Steering* reorients updates toward the visual subspace for VLM reasoning |
🤝 OPD-RL Hybrids — Inside-RL OPD
- 🔁 Iterative Self-Bootstrapping
  - BOND - paper-845C40?style=for-the-badge)](https://arxiv.org/abs/2407.14622) | 2024.07 | Google DeepMind | [arXiv 2407.14622](https://arxiv.org/abs/2407.14622) | BOND (Best-of-N Distillation) |
  - Faster WIND - paper-845C40?style=for-the-badge)](https://arxiv.org/abs/2410.20727) | 2024.10 | CMU / Google | [arXiv 2410.20727](https://arxiv.org/abs/2410.20727) | Faster WIND (iterative BoN) — AISTATS 2025 |
  - AlignDistil - the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700" alt="Stars"> | 2025.03 | BJTU / Tencent | [arXiv 2503.02832](https://arxiv.org/abs/2503.02832) | AlignDistil — RLHF-equivalent KD (ACL 2025) |
  - LUFFY - the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700" alt="Stars"> | 2025.04 | Westlake U. | [arXiv 2504.14945](https://arxiv.org/abs/2504.14945) | LUFFY — mixed-policy GRPO |
  - KETCHUP - paper-845C40?style=for-the-badge)](https://arxiv.org/abs/2504.19024) | 2025.04 | U. Alberta | [arXiv 2504.19024](https://arxiv.org/abs/2504.19024) | KETCHUP (k-step RL-KD) |
  - KDRL - paper-845C40?style=for-the-badge)](https://arxiv.org/abs/2506.02208) | 2025.06 | HIT / Huawei | [arXiv 2506.02208](https://arxiv.org/abs/2506.02208) | KDRL (Joint KD + RL) |
  - SDPO - the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700" alt="Stars"> | 2026.01 | ETH / MIT | [arXiv 2601.20802](https://arxiv.org/abs/2601.20802) · [project](https://self-distillation.github.io/SDPO) | SDPO — RL via Self-Distillation |
  - KEPO - the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700" alt="Stars"> | 2026.01 | Industrial | [arXiv 2602.00400](https://arxiv.org/abs/2602.00400) | KEPO |
  - Open-AgentRL - Verse/Open-AgentRL?style=for-the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700" alt="Stars"> | 2026.02 | Gen-Verse | — | Open-AgentRL — RLAnything / DemyAgent multi-domain |
  - Towards-On-Policy-SFT - On-Policy-SFT?style=for-the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700" alt="Stars"> | 2026.02 | MSRA / Shopee | [arXiv 2602.12222](https://arxiv.org/abs/2602.12222) | DDT — on-policy SFT theory |
  - 𝒳-KD - paper-845C40?style=for-the-badge)](https://arxiv.org/abs/2602.12674) | 2026.02 | BUPT | [arXiv 2602.12674](https://arxiv.org/abs/2602.12674) | 𝒳-KD (IRL-style) |
  - RLAD - paper-845C40?style=for-the-badge)](https://arxiv.org/abs/2602.22495) | 2026.02 | AWS | [arXiv 2602.22495](https://arxiv.org/abs/2602.22495) | RLAD (Reinforcement-aware KD) |
  - OpenClaw-RL - Verse/OpenClaw-RL?style=for-the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700" alt="Stars"> | 2026.03 | Gen-Verse | [arXiv 2603.10165](https://arxiv.org/abs/2603.10165) | OpenClaw-RL — combines GRPO + OPD |
  - ExGRPO - Tan-dmml/ExGRPO?style=for-the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700" alt="Stars"> | 2026.03 | UNC / ASU | [arXiv 2603.19266](https://arxiv.org/abs/2603.19266) | Probing-to-Refine / EI / EXGRPO |
  - HDPO - paper-845C40?style=for-the-badge)](https://arxiv.org/abs/2603.23871) | 2026.03 | NVIDIA | [arXiv 2603.23871](https://arxiv.org/abs/2603.23871) | HDPO (Hybrid Distillation PO) |
  - RLSD - paper-845C40?style=for-the-badge)](https://arxiv.org/abs/2604.03128) | 2026.04 | Multi-org | [arXiv 2604.03128](https://arxiv.org/abs/2604.03128) | Self-Distilled RLVR (RLSD) |
  - NPO - paper-845C40?style=for-the-badge)](https://arxiv.org/abs/2604.20733) | 2026.04 | IIE CAS / UCAS / JD.COM | [arXiv 2604.20733](https://arxiv.org/abs/2604.20733) | NPO / AutoNPO — mixed-policy GRPO with **near-future self** as teacher |
  - ROSD - paper-845C40?style=for-the-badge)](https://arxiv.org/abs/2605.28014) | 2026.05 | PolyU / Baidu | [arXiv 2605.28014](https://arxiv.org/abs/2605.28014) | ROSD — reflective error-localized self-distillation |
  - TGPO - paper-845C40?style=for-the-badge)](https://arxiv.org/abs/2605.13230) | 2026.05 | NEU NLP / Meituan | [arXiv 2605.13230](https://arxiv.org/abs/2605.13230) | **TGPO — Teacher-Guided Policy Optimization**; teacher directly guides token-level generation conditioned on student contexts, fused with RLVR trajectory rewards; targets *large* teacher–student divergence |
  - OPD+ - paper-845C40?style=for-the-badge)](https://arxiv.org/abs/2606.01039) | 2026.06 | Columbia / Capital One | [arXiv 2606.01039](https://arxiv.org/abs/2606.01039) | **OPD+ — Rethinking the Advantage Design**; formulates OPD as RL with an f-divergence reward, proves the common stop-gradient advantage estimator is biased & gives a corrected estimator |
🎭 OPD with Black-Box / Outcome-Based Teachers
- ORPO-Distill - paper-845C40?style=for-the-badge)](https://arxiv.org/abs/2509.25100) | 2025.09 | Industrial | [arXiv 2509.25100](https://arxiv.org/abs/2509.25100) | ORPO-Distill |
- LMOps `/gad` - the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700" alt="Stars"> | 2025.11 | Microsoft Research | [arXiv 2511.10643](https://arxiv.org/abs/2511.10643) · [project](https://ytianzhu.github.io/Generative-Adversarial-Distillation/) | GAD — Black-Box OPD |
- OVD - paper-845C40?style=for-the-badge)](https://arxiv.org/abs/2601.21968) | 2026.01 | HKU / Huawei | [arXiv 2601.21968](https://arxiv.org/abs/2601.21968) | OVD (On-policy Verbal Distillation) — project page `OVD.github.io` 404s |
- SODA - paper-845C40?style=for-the-badge)](https://arxiv.org/pdf/2604.03873) | 2026.04 | Academic | [arXiv 2604.03873](https://arxiv.org/pdf/2604.03873) | SODA — Semi On-Policy Black-Box Distillation |
- SPoT - AI/SPoT?style=for-the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700" alt="Stars"> | 2026.03 | Visual-AI | [arXiv 2603.01683](https://arxiv.org/abs/2603.01683) | **SPOT: Surgical Post-Training** — black-box oracle edits student failures into proximal rollouts |
- ROPD - the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700" alt="Stars"> | 2026.05 | NUS / USTC / Tencent | [arXiv 2605.07396](https://arxiv.org/abs/2605.07396) | **ROPD — Rubric-based On-Policy Distillation**; induces prompt-specific rubrics from teacher–student contrasts, then scores student rollouts by those rubrics (logit-free / black-box); up to 10× sample efficiency |
🔬 OPD with Larger External Teachers — White-Box
- LMOps `/minillm` - the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700" alt="Stars"> | 2023.06 | Microsoft / Tsinghua | [arXiv 2306.08543](https://arxiv.org/abs/2306.08543) | MiniLLM (ICLR 2024) |
- distillm - the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700" alt="Stars"> | 2024.02 | KAIST / Microsoft | [arXiv 2402.03898](https://arxiv.org/abs/2402.03898) | DistiLLM (ICML 2024) |
- google-research `/speculative_kd` - research/google-research?style=for-the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700" alt="Stars"> | 2024.10 | UCSB / Google | [arXiv 2410.11325](https://arxiv.org/abs/2410.11325) | Speculative KD (ICLR 2025) |
- distillm-2 - 2?style=for-the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700" alt="Stars"> | 2025.03 | KAIST / Microsoft | [arXiv 2503.07067](https://arxiv.org/abs/2503.07067) | DistiLLM-2 (ICML 2025 Oral) |
- DSKDv2 - the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700" alt="Stars"> | 2025.04 | BJTU | [arXiv 2504.11426](https://arxiv.org/abs/2504.11426) | DSKDv2 — cross-tokenizer; supports on-policy mode |
- Constrained OPD - paper-845C40?style=for-the-badge)](https://arxiv.org/abs/2509.22921) | 2025.09 | Huawei Noah's Ark | [arXiv 2509.22921](https://arxiv.org/abs/2509.22921) | Constrained OPD (CMDP) |
- AdaSwitch - paper-845C40?style=for-the-badge)](https://arxiv.org/abs/2510.07842) | 2025.10 | RUC / Baidu | [arXiv 2510.07842](https://arxiv.org/abs/2510.07842) | AdaSwitch (on-/off-policy switching) |
- Veto - paper-845C40?style=for-the-badge)](https://arxiv.org/abs/2601.07155) | 2026.01 | SNU | [arXiv 2601.07155](https://arxiv.org/abs/2601.07155) | Veto (Stable OPD) — ACL 2026 Findings |
- G-OPD - OPD?style=for-the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700" alt="Stars"> | 2026.02 | RUC / Tencent | [arXiv 2602.12125](https://arxiv.org/abs/2602.12125) | G-OPD |
- Fast OPD - paper-845C40?style=for-the-badge)](https://arxiv.org/abs/2602.15260) | 2026.02 | Industrial | [arXiv 2602.15260](https://arxiv.org/abs/2602.15260) | Fast OPD (prefix-truncated) |
- Entropy-Aware OPD - paper-845C40?style=for-the-badge)](https://arxiv.org/abs/2603.07079) | 2026.03 | KAIST / IBM | [arXiv 2603.07079](https://arxiv.org/abs/2603.07079) | Entropy-Aware OPD |
- REOPOLD - paper-845C40?style=for-the-badge)](https://arxiv.org/abs/2603.11137) | 2026.03 | KAIST / Microsoft | [arXiv 2603.11137](https://arxiv.org/abs/2603.11137) | REOPOLD (Relaxed OPD) — code soon |
- OPSD_OnPolicyDistillation - the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700" alt="Stars"> | 2026.03 | LinkedIn | [arXiv 2603.11178](https://arxiv.org/abs/2603.11178) | PACED — frontier curriculum self-distill |
- TSD-KD - KD?style=for-the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700" alt="Stars"> | 2026.03 | Korea Univ. | [arXiv 2603.13260](https://arxiv.org/abs/2603.13260) | TSD-KD — token-selective dual KD (ICLR 2026) |
- SCOPE - the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700" alt="Stars"> | 2026.04 | USTC / Meituan / Fudan | [arXiv 2604.10688](https://arxiv.org/abs/2604.10688) | SCOPE — signal-calibrated dual-path |
- Hybrid-Policy-Distillation - Policy-Distillation?style=for-the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700" alt="Stars"> | 2026.04 | zwhong714 | [arXiv 2604.20244](https://arxiv.org/abs/2604.20244) | HPD — Hybrid Policy Distillation; LlamaFactory + verl backends |
- trd - the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700" alt="Stars"> | 2026.06 | McGill / Mila / UT Austin (Jiang et al.) | [arXiv 2606.08432](https://arxiv.org/abs/2606.08432) | **TRD — Trajectory-Refined Distillation**; diagnoses *prefix failure* of dense per-token OPD, refines student rollouts at trajectory level before distilling; verl-based, also applies to OPSD |
- BRTS - keke/BRTS?style=for-the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700" alt="Stars"> | 2026.05 | JHU (Patel group) | [arXiv 2605.09725](https://arxiv.org/abs/2605.09725) | **BRTS — Best-of-N Teacher Rollout Selection**; augments student-context OPD with a curated teacher-context branch (correctness-first, then student-alignment) to cut single-rollout teacher variance |
- FiRe-OPD - OPD?style=for-the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700" alt="Stars"> | 2026.06 | THU / HKUST / Meituan (Li et al.) | [arXiv 2606.02684](https://arxiv.org/abs/2606.02684) | **FiRe-OPD — Filter, then Reweight**; decouples *optimization granularity* — **hard** trajectory-level filtering (drop bottom-p% rollouts by teacher log-prob) + **soft** token-level reweighting (teacher-confidence × student-confusion), arguing soft weighting beats hard token selection (cf. TIP); verl-based, with a multi-teacher math+code variant |
- OPRD - the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700" alt="Stars"> | 2026.06 | ZJU / Ant Group | [arXiv 2606.06021](https://arxiv.org/abs/2606.06021) | **OPRD — On-Policy Representation Distillation**; first OPD to supervise in *hidden-state space* (aligns teacher/student representations across layers on student rollouts, bypassing the LM head) rather than logits; built on the THUNLP OPD stack |
🧠 Reasoning OPD (by application)
- 🔁 Iterative Self-Bootstrapping
  - OPD-AVMP - paper-845C40?style=for-the-badge)](https://arxiv.org/abs/2604.07944) | 2026.04 | Academic | [arXiv 2604.07944](https://arxiv.org/abs/2604.07944) | OPD for Autonomous Vehicle Motion Planning |
♻️ Self-Distillation with Privileged Context — OPSD
- 🔁 Iterative Self-Bootstrapping
  - SPIN - the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700" alt="Stars"> | 2024.01 | UCLA | [arXiv 2401.01335](https://arxiv.org/abs/2401.01335) | SPIN — Self-Play Fine-Tuning (ICML 2024) |
  - rStar - the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700" alt="Stars"> | 2025.01 | Microsoft Research | [rStar-Math 2501.04519](https://arxiv.org/abs/2501.04519) · [rStar2-Agent 2508.20722](https://arxiv.org/abs/2508.20722) | rStar / rStar-Math / rStar2-Agent |
- - OPSD - zhao/OPSD?style=for-the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700" alt="Stars"> | 2026.01 | UCLA / Meta FAIR | [arXiv 2601.18734](https://arxiv.org/abs/2601.18734) · [blog](https://siyan-zhao.github.io/blog/2026/opsd/) | OPSD — Self-Distilled Reasoner |
  - Self-Distillation - Distillation?style=for-the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700" alt="Stars"> | 2026.01 | MIT / ETH | [arXiv 2601.19897](https://arxiv.org/abs/2601.19897) | SDFT-Continual |
  - mtp-lm - lm?style=for-the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700" alt="Stars"> | 2026.02 | UMD / LLNL | [arXiv 2602.06019](https://arxiv.org/abs/2602.06019) | MTP Self-Distill |
  - GATES - paper-845C40?style=for-the-badge)](https://arxiv.org/abs/2602.20574) | 2026.02 | UMD | [arXiv 2602.20574](https://arxiv.org/abs/2602.20574) | GATES (Self-Distillation under Privileged Context) |
  - CRISP_Reasoning_Compression - the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700" alt="Stars"> | 2026.03 | LinkedIn | [arXiv 2603.05433](https://arxiv.org/abs/2603.05433) | OPSDC / CRISP |
  - self-distillation-analysis - distillation-analysis?style=for-the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700" alt="Stars"> | 2026.03 | MSR / KAIST / SNU | [arXiv 2603.24472](https://arxiv.org/abs/2603.24472) | **Why Does Self-Distillation (Sometimes) Degrade Reasoning?** — diagnostic study of OPSD failure modes |
  - ml-ssd - ssd?style=for-the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700" alt="Stars"> | 2026.04 | Apple MLR | [arXiv 2604.01193](https://arxiv.org/abs/2604.01193) | Apple — Embarrassingly Simple Self-Distillation |
  - Skill-SD - paper-845C40?style=for-the-badge)](https://arxiv.org/abs/2604.10674) | 2026.04 | UCAS / CUHK / USTC / vivo AI Lab | [arXiv 2604.10674](https://arxiv.org/abs/2604.10674) | **Skill-SD** — skill-conditioned OPSD for multi-turn LLM agents |
  - SD-Zero - paper-845C40?style=for-the-badge)](https://arxiv.org/abs/2604.12002) | 2026.04 | Princeton / Toronto / CMU | [arXiv 2604.12002](https://arxiv.org/abs/2604.12002) | **SD-Zero** — Self-Revision turns binary rewards into dense supervision |
  - π-Play - paper-845C40?style=for-the-badge)](https://arxiv.org/abs/2604.14054) | 2026.04 | CASIA / UCAS / Meituan | [arXiv 2604.14054](https://arxiv.org/abs/2604.14054) | **π-Play** — multi-agent self-play turns the question-construction path into privileged context for OPSD on search agents |
  - OPSDL - paper-845C40?style=for-the-badge)](https://arxiv.org/abs/2604.17535) | 2026.04 | Baidu | [arXiv 2604.17535](https://arxiv.org/abs/2604.17535) | OPSDL (Long-Context Self-Distillation) |
  - MSD - paper-845C40?style=for-the-badge)](https://arxiv.org/abs/2605.02971) | 2026.05 | Tongji / Shanghai AI Lab | [arXiv 2605.02971](https://arxiv.org/abs/2605.02971) | **MSD** — multilingual safety OPSD; teacher conditioned on English query translation + CoT instruction; DPSW weights safety-critical tokens |
  - COPSD - the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700" alt="Stars"> | 2026.05 | LMU Munich / MCML | [arXiv 2605.09548](https://arxiv.org/abs/2605.09548) | **COPSD** — crosslingual OPSD; teacher sees English problem translation + reference solution, student rolls out in low-resource language (17 African languages) |
  - EMPO² - paper-845C40?style=for-the-badge)](https://arxiv.org/abs/2602.23008) | 2026.02 | Microsoft Research | [arXiv 2602.23008](https://arxiv.org/abs/2602.23008) · [code](https://github.com/microsoft/agent-lightning/tree/main/contrib/recipes/envs) · [blog](https://agent-lightning.github.io/posts/empo2/) | **EMPO²** — memory-tip-conditioned online self-distillation for exploratory LLM agents (ICLR 2026; cross-listed into Agent) |
  - SGSD - the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700" alt="Stars"> | 2026.05 | THU | [arXiv 2605.28791](https://arxiv.org/pdf/2605.28791) | **SGSD** — Skill-Conditional Gated SD |
  - CODE - the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700" alt="Stars"> | 2026.05 | USTC | [arXiv 2605.28303](https://arxiv.org/pdf/2605.28303v1) | **CODE** — OPSD on Knowledge Editing + Casual Editing |
  - SSOPD - paper-845C40?style=for-the-badge)](https://arxiv.org/abs/2605.17497) | 2026.05 | THU / Beihang | [arXiv 2605.17497](https://arxiv.org/abs/2605.17497) | **SSOPD — Self-Supervised OPSD**; privileged context is the model's *own shortest correct completion* within a GRPO group (no external traces), distilled into prefixes of the longest wrong completion |
  - RLCSD - BPM/RLCSD?style=for-the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700" alt="Stars"> | 2026.06 | THU (BPM) / Alibaba Tongyi | [arXiv 2606.11709](https://arxiv.org/abs/2606.11709) | **RLCSD — Contrastive OPSD**; cancels *privilege-induced style drift* by contrasting the teacher–student gap under a correct hint vs. a wrong hint; verl-based |
  - d-OPSD - opsd-code?style=for-the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700" alt="Stars"> | 2026.06 | THU / TUM / NTU / UT Austin | [arXiv 2606.18195](https://arxiv.org/abs/2606.18195) | **d-OPSD — first OPSD for diffusion LLMs**; self-generated answers as *suffix* conditioning ("self future-experience"); step-level (not token-level) divergence aligned to the denoising process |
⚡ Speculative-Decoding Distillation
- 🔁 Iterative Self-Bootstrapping
  - OSD - the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700" alt="Stars"> | 2023.10 | UCB / NVIDIA | [arXiv 2310.07177](https://arxiv.org/abs/2310.07177) | Online Speculative Decoding |
  - DistillSpec - paper-845C40?style=for-the-badge)](https://arxiv.org/abs/2310.08461) | 2023.10 | Google DeepMind | [arXiv 2310.08461](https://arxiv.org/abs/2310.08461) | DistillSpec (ICLR 2024) |
  - HASS - the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700" alt="Stars"> | 2024.08 | Academic | [arXiv 2408.15766](https://arxiv.org/abs/2408.15766) | HASS |
  - Falcon - inc/Falcon?style=for-the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700" alt="Stars"> | 2024.12 | Bestpay | [arXiv 2412.12639](https://arxiv.org/abs/2412.12639) | Falcon |
  - CORAL - paper-845C40?style=for-the-badge)](https://arxiv.org/abs/2502.16880) | 2025.02 | Academic | [arXiv 2502.16880](https://arxiv.org/abs/2502.16880) | CORAL (Cross-Step Representation Alignment) — ACL 2025 |
  - EAGLE - the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700" alt="Stars"> | 2025.03 | PKU / Microsoft | [EAGLE-3](https://arxiv.org/abs/2503.01840) | EAGLE-3 — on-policy multi-step TTT |
  - MASSV - paper-845C40?style=for-the-badge)](https://arxiv.org/abs/2505.10526) | 2025.05 | Cerebras | [arXiv 2505.10526](https://arxiv.org/abs/2505.10526) | MASSV (multimodal SD draft) |
  - DVI - paper-845C40?style=for-the-badge)](https://arxiv.org/abs/2510.05421) | 2025.10 | Academic | [arXiv 2510.05421](https://arxiv.org/abs/2510.05421) | DVI (Draft-Verify-Improve, online RL) |
  - SpecKD - paper-845C40?style=for-the-badge)](https://arxiv.org/abs/2510.24021) | 2025.10 | XJTU (Haiduo Huang et al.) | [arXiv 2510.24021](https://arxiv.org/abs/2510.24021) | SpecKD / SelecTKD (verification-gated KD; v1=SpecKD, v2 retitled SelecTKD) |
  - ReSpec - paper-845C40?style=for-the-badge)](https://arxiv.org/abs/2510.26475) | 2025.10 | Academic | [arXiv 2510.26475](https://arxiv.org/abs/2510.26475) | ReSpec (RL drafter evolution) |
  - SpecForge - project/SpecForge?style=for-the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700" alt="Stars"> | 2026.03 | SGLang | [LMSYS blog](https://www.lmsys.org/blog/2025-07-25-spec-forge/) | SpecForge — open EAGLE-3 training framework |
  - Draft-OPD - lei/Draft-OPD?style=for-the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700" alt="Stars"> | 2026.05 | Shanghai AI Lab / Fudan / SJTU | [arXiv 2605.29343](https://arxiv.org/abs/2605.29343) | **Draft-OPD** — on-policy distillation for speculative draft models; target supervises the drafter on *draft-induced states*, replaying from verification-exposed error positions; ~5× lossless speedup, beats EAGLE-3/DFlash |
Star History
- 🔁 Iterative Self-Bootstrapping
  - ![Star History Chart - history.com/#thinkwee/AwesomeOPD&Date)
  - ![Star History Chart - history.com/#thinkwee/AwesomeOPD&Date)
📚 Surveys, Foundations & Position Papers
- GKD - paper-845C40?style=for-the-badge)](https://arxiv.org/abs/2306.13649) | 2023.06 | Google DeepMind (Agarwal et al.) | [arXiv 2306.13649](https://arxiv.org/abs/2306.13649) — implemented in [TRL `GKDTrainer`](https://github.com/huggingface/trl/blob/main/trl/experimental/gkd/gkd_trainer.py) | **GKD: On-Policy Distillation of Language Models — Learning from Self-Generated Mistakes** (Seminal · ICLR 2024) |
- Blog - 3.2k_cookbook-blue?style=for-the-badge)](https://thinkingmachines.ai/blog/on-policy-distillation/) | 2025.10 | Thinking Machines Lab (Kevin Lu et al.) | [Blog](https://thinkingmachines.ai/blog/on-policy-distillation/) · [tinker-cookbook](https://github.com/thinking-machines-lab/tinker-cookbook) | **Thinking Machines Lab — On-Policy Distillation (blog)** |
- tinker-cookbook - machines-lab/tinker-cookbook?style=for-the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700" alt="Stars"> | 2025.10 | Thinking Machines Lab | — | Reference impl. of the OPD recipe on the Tinker SDK |
- revisiting_opd - the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700" alt="Stars"> | 2026.03 | CASIA (Fu et al.) | [arXiv 2603.25562](https://arxiv.org/abs/2603.25562) | Revisiting OPD: Failure Modes & Simple Fixes |
- Tencent OPD Survey - paper-845C40?style=for-the-badge)](https://arxiv.org/abs/2604.00626) | 2026.04 | Tencent (Mingyang Song & Mao Zheng) | [arXiv 2604.00626](https://arxiv.org/abs/2604.00626) | **A Survey of On-Policy Distillation for LLMs** |
- OPD - the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700" alt="Stars"> | 2026.04 | Tsinghua THUNLP | [arXiv 2604.13016](https://arxiv.org/abs/2604.13016) | **Rethinking On-Policy Distillation: Phenomenology, Mechanism & Recipe** |
- Lightning OPD - paper-845C40?style=for-the-badge)](https://arxiv.org/abs/2604.13010) | 2026.04 | Wu, Han, Cai | [arXiv 2604.13010](https://arxiv.org/abs/2604.13010) | **Lightning OPD: Efficient Post-Training with Offline OPD** |
- OPSD Survey - paper-845C40?style=for-the-badge)](https://arxiv.org/abs/2605.18141) | 2026.05 | Academic | [arXiv 2605.18141](https://arxiv.org/abs/2605.18141) | **A Brief Overview: On-Policy Self-Distillation in LLMs** |
- Blog - reflection-blue?style=for-the-badge)](https://louieworth.github.io/blog/opd_reflection/) | 2026.06 | Li Jiang | [Blog](https://louieworth.github.io/blog/opd_reflection/) · [arXiv 2606.08432](https://arxiv.org/abs/2606.08432) | **On-Policy Distillation: Promise, Pitfalls, and Prospects** — reflection on OPD's promise, three failure mechanisms (local teacher noise, coverage decay, myopic gradients) & prospects; companion to TRD |
- Many Faces of OPD - paper-845C40?style=for-the-badge)](https://arxiv.org/abs/2605.11182) | 2026.05 | UIUC (Ge Liu's ULab) | [arXiv 2605.11182](https://arxiv.org/abs/2605.11182) | **The Many Faces of On-Policy Distillation: Pitfalls, Mechanisms, and Fixes** — diagnoses when OPD/OPSD succeed or fail (distribution mismatch, optimization instability, PI-free policy limits) & proposes fixes; companion to Revisiting OPD & THUNLP Rethinking |

Programming Languages

Python 48 Jupyter Notebook 3 Shell 1

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

awesomeopd

🤖 Agent & Embodied OPD (by application)

🔁 Iterative Self-Bootstrapping

🌟 Curator's Picks — where to start

🔁 Iterative Self-Bootstrapping

🛠️ Frameworks & Toolkits

🔁 Iterative Self-Bootstrapping

🏭 Industrial / Production Model Reports

🔁 Iterative Self-Bootstrapping

🖼️ Multimodal OPD (VLM, Video, Audio, Image)

🔁 Iterative Self-Bootstrapping

🤝 OPD-RL Hybrids — Inside-RL OPD

🔁 Iterative Self-Bootstrapping

🎭 OPD with Black-Box / Outcome-Based Teachers

🔬 OPD with Larger External Teachers — White-Box

🧠 Reasoning OPD (by application)

🔁 Iterative Self-Bootstrapping

♻️ Self-Distillation with Privileged Context — OPSD

🔁 Iterative Self-Bootstrapping

⚡ Speculative-Decoding Distillation

🔁 Iterative Self-Bootstrapping

Star History

🔁 Iterative Self-Bootstrapping

📚 Surveys, Foundations & Position Papers