awesomeopd
Awesome List for On-Policy Distillation
https://github.com/thinkwee/awesomeopd
Last synced: 1 day ago
JSON representation
-
๐ค Agent & Embodied OPD (by application)
-
๐ Iterative Self-Bootstrapping
- LLM4Teach - AMMI/LLM4Teach?style=for-the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700" alt="Stars"> | 2023.11 (updated 2025) | ZJ Lab AMMI | [arXiv 2311.13373](https://arxiv.org/abs/2311.13373) | LLM4Teach โ small-RL agent guided by LLM |
- RPD - Policy-Distillation/RPD?style=for-the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700" alt="Stars"> | 2025.03 | TUM / Freiburg | [arXiv 2503.05833](https://arxiv.org/abs/2503.05833) ยท [project](https://refined-policy-distillation.github.io/) | Refined Policy Distillation, VLA (IROS 2026) |
- easydistill - the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700" alt="Stars"> | 2025.09 | Alibaba ModelScope | [SCoRe arXiv 2509.14257](https://arxiv.org/abs/2509.14257) | `/projects/SCoRe` |
- VLA-OPD - paper-845C40?style=for-the-badge)](https://arxiv.org/abs/2603.26666) | 2026.03 | HKUST (Guangzhou) โ IRPN Lab | [arXiv 2603.26666](https://arxiv.org/abs/2603.26666) ยท [project](https://irpn-lab.github.io/VLA-OPD/) | **VLA-OPD** โ bridging offline SFT & online RL for VLA via OPD (code coming soon) |
- Skill-SD - paper-845C40?style=for-the-badge)](https://arxiv.org/abs/2604.10674) | 2026.04 | Vivo | [arXiv 2604.10674](https://arxiv.org/abs/2604.10674) | Skill-SD โ skill-conditioned self-distillation for multi-turn LLM agents|
- TCOD - paper-845C40?style=for-the-badge)](https://arxiv.org/abs/2604.24005) | 2026.04 | Tongyi Lab, Alibaba / CUHK | [arXiv 2604.24005](https://arxiv.org/abs/2604.24005) | TCOD โ temporal curriculum OPD for multi-turn agents; F2B & B2F schedules |
- Healthcare AI GYM - the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700" alt="Stars"> | 2026.05 | Upstage AI / Korea University | [arXiv 2605.02943](https://arxiv.org/abs/2605.02943) | Healthcare AI GYM โ medical agent RL environment + turn-level truncated OPD |
- HyperEyes - the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700" alt="Stars"> | 2026.05 | Xiaohongshu / Cambridge | [arXiv 2605.07177](https://arxiv.org/abs/2605.07177) | HyperEyes โ parallel multimodal search agent with dual-grained efficiency-aware RL (TRACE + OPD) |
-
-
๐ Curator's Picks โ where to start
-
๐ ๏ธ Frameworks & Toolkits
-
๐ Iterative Self-Bootstrapping
- trl - the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700" alt="Stars"> | 2019.11 | Hugging Face | `trl/experimental/{gkd,gold,minillm,sdft,self_distillation,sdpo,nash_md,xpo,online_dpo}/` | TRL โ **the most diverse OPD trainer collection** |
- LLaMA-Factory - Factory?style=for-the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700" alt="Stars"> | 2023.05 | hiyouga | โ | LLaMA-Factory โ OPD only via TRL integration; not native |
- ms-swift - swift?style=for-the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700" alt="Stars"> | 2024 | Alibaba ModelScope | `examples/train/rlhf/gkd/`, multimodal/megatron variants | ms-swift โ wraps TRL `GKDTrainer` |
- verl - the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700" alt="Stars"> | 2024.10 | ByteDance Seed | `recipe/on_policy_distill/`; [Async OPD doc](https://verl.readthedocs.io/en/latest/advance/async-on-policy-distill.html) | verl |
- rllm - org/rllm?style=for-the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700" alt="Stars"> | 2025.01 | UC Berkeley Sky | `examples/math_distill/` (incl. `opsd/` self-distill); `rllm/trainer/distill/` | rllm |
- SkyRL - AI/SkyRL?style=for-the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700" alt="Stars"> | 2025.04 | UC Berkeley NovaSky | `skyrl-train/examples/on_policy_distillation/`; [blog](https://novasky-ai.notion.site/on-policy-distillation) | SkyRL |
- ROLL - the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700" alt="Stars"> | 2025.06 | Alibaba | `roll/pipeline/distill/` | ROLL โ with VLM support and various-divergence library |
- AReaL - the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700" alt="Stars"> | 2025.06 | AntGroup / Tsinghua | `examples/distillation/gsm8k_grpo_distill.yaml` | AReaL |
- slime - the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700" alt="Stars"> | 2025.06 | Tsinghua THUDM | `examples/on_policy_distillation/` | slime โ RL framework behind GLM-4.5/4.6/4.7 |
- RL - NeMo/RL?style=for-the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700" alt="Stars"> | 2026.01 | NVIDIA | `nemo_rl/algorithms/distillation.py` | NeMo-RL โ native OPD with student rollouts |
- KDFlow - the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700" alt="Stars"> | 2026.03 | BJTU (Songming Zhang et al.) | `examples/on_policy_kd/` (LLM + Qwen3-VL); [arXiv 2603.01875](https://arxiv.org/abs/2603.01875) | KDFlow โ **KD-first framework**; SGLang teacher + FSDP2 student decoupled; cross-tokenizer & VLM native |
-
-
๐ญ Industrial / Production Model Reports
-
๐ Iterative Self-Bootstrapping
- gemma - deepmind/gemma?style=for-the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700" alt="Stars"> | 2024.07 | Google DeepMind | [arXiv 2408.00118](https://arxiv.org/abs/2408.00118) | **Gemma 2** (explicit OPD) |
- Qwen3 - the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700" alt="Stars"> | 2025.05 | Alibaba Qwen | [arXiv 2505.09388](https://arxiv.org/abs/2505.09388) | **Qwen3** (canonical OPD recipe) |
- GLM-4.5 - org/GLM-4.5?style=for-the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700" alt="Stars"> | 2025.08 | Zhipu / Z.ai | [arXiv 2508.06471](https://arxiv.org/abs/2508.06471) | **GLM-4.5 / 4.6** |
- HY-MT - Hunyuan/HY-MT?style=for-the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700" alt="Stars"> | 2025.12 | Tencent Hunyuan | [arXiv 2512.24092](https://arxiv.org/abs/2512.24092) ยท [HF 1.8B](https://huggingface.co/tencent/HY-MT1.5-1.8B) ยท [HF 7B](https://huggingface.co/tencent/HY-MT1.5-7B) | strong-to-weak distillation for MT |
- MiMo-V2-Flash - V2-Flash?style=for-the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700" alt="Stars"> | 2026.01 | Xiaomi | [arXiv 2601.02780](https://arxiv.org/abs/2601.02780) | **MiMo-V2-Flash** (MOPD) |
- Typhoon-S - 10x/typhoon-s?style=for-the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700" alt="Stars"> | 2026.01 | Typhoon AI & SCB 10X | [arXiv 2601.18129](https://arxiv.org/pdf/2601.18129) | GAD style OPD: Full logits greatly outperforms Top-K in Thai.|
- Baichuan-M3-235B - inc/Baichuan-M3-235B?style=for-the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700" alt="Stars"> | 2026.02 | Baichuan | [arXiv 2602.06570](https://arxiv.org/abs/2602.06570) ยท [HF Collection](https://huggingface.co/collections/baichuan-inc/baichuan-m3) | Baichuan-M3 (learn critically from multi-teacher OPD) |
- GLM-5 - org/GLM-5?style=for-the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700" alt="Stars"> | 2026.02 | Zhipu / Z.ai | [arXiv 2602.15763](https://arxiv.org/abs/2602.15763) | **GLM-5** (cross-stage OPD) |
- Nemotron Cascade 2 - paper-845C40?style=for-the-badge)](https://arxiv.org/abs/2603.19220) | 2026.03 | NVIDIA | [arXiv 2603.19220](https://arxiv.org/abs/2603.19220) ยท [HF Collection](https://huggingface.co/collections/nvidia/nemotron-cascade-2) ยท [project](https://research.nvidia.com/labs/nemotron/nemotron-cascade-2/) | **Nemotron Cascade 2** (multi-domain OPD; "we sample yโผฯ_inf(ยท\|x)"); HF-only release |
- Qwen3-Coder - Coder?style=for-the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700" alt="Stars"> | 2026.03 | Alibaba Qwen | Tech report | Qwen3-Coder |
- KAT-Coder-V2 - paper-845C40?style=for-the-badge)](https://arxiv.org/abs/2603.27703) | 2026.03 | Kuaishou KwaiKAT | [arXiv 2603.27703](https://arxiv.org/abs/2603.27703) | step-level OPD for agentic coding|
- HY-Embodied - Hunyuan/HY-Embodied?style=for-the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700" alt="Stars"> | 2026.04 | Tencent Hunyuan | [arXiv 2604.07430](https://arxiv.org/abs/2604.07430) | **HY-Embodied-0.5** (FKL embodied distillation) |
- DeepSeek-V4 - paper-845C40?style=for-the-badge)](https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro/blob/main/DeepSeek_V4.pdf) | 2026.04 | DeepSeek-AI | [Tech Report](https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro/blob/main/DeepSeek_V4.pdf) ยท [V4-Pro](https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro) ยท [V4-Flash](https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash) | **DeepSeek-V4** (multi-teacher OPD replaces unified mixed-RL stage) |
-  | ฯ-Flow โ image / flow OPD (ICLR 2026) |
- VOLD - paper-845C40?style=for-the-badge)](https://arxiv.org/abs/2510.23497) | 2025.10 | INRIA / Goethe Univ. | [arXiv 2510.23497](https://arxiv.org/abs/2510.23497) ยท [project page](https://walidbousselham.com/VOLD/) | VOLD (LLMโVLM OPD) โ repo placeholder; ICLR 2026 |
- Step-Audio-R1 - ai/Step-Audio-R1?style=for-the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700" alt="Stars"> | 2025.11 | StepFun | [arXiv 2511.15848](https://arxiv.org/abs/2511.15848) | Step-Audio-R1 |
- CORD - paper-845C40?style=for-the-badge)](https://arxiv.org/abs/2601.16547) | 2026.01 | Baidu Ernie | [arXiv 2601.16547](https://arxiv.org/abs/2601.16547) | Reasoning: Text โก๏ธ Audio |
- Video-OPD - paper-845C40?style=for-the-badge)](https://arxiv.org/abs/2602.02994) | 2026.02 | Industrial | [arXiv 2602.02994](https://arxiv.org/abs/2602.02994) | Video-OPD |
- X-OPD - paper-845C40?style=for-the-badge)](https://arxiv.org/abs/2603.24596) | 2026.03 | Tencent Hunyuan / ZJU | [arXiv 2603.24596](https://arxiv.org/abs/2603.24596) | X-OPD (Speech LLM) |
- Uni-OPD - OPD?style=for-the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700" alt="Stars"> | 2026.05 | Multi-org | [arXiv 2605.03677](https://arxiv.org/abs/2605.03677) | **Uni-OPD** โ unified OPD across LLMs & MLLMs via dual-perspective recipe |
-
-
๐ค OPD-RL Hybrids โ Inside-RL OPD
-
๐ Iterative Self-Bootstrapping
- BOND - paper-845C40?style=for-the-badge)](https://arxiv.org/abs/2407.14622) | 2024.07 | Google DeepMind | [arXiv 2407.14622](https://arxiv.org/abs/2407.14622) | BOND (Best-of-N Distillation) |
- Faster WIND - paper-845C40?style=for-the-badge)](https://arxiv.org/abs/2410.20727) | 2024.10 | CMU / Google | [arXiv 2410.20727](https://arxiv.org/abs/2410.20727) | Faster WIND (iterative BoN) โ AISTATS 2025 |
- AlignDistil - the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700" alt="Stars"> | 2025.03 | BJTU / Tencent | [arXiv 2503.02832](https://arxiv.org/abs/2503.02832) | AlignDistil โ RLHF-equivalent KD (ACL 2025) |
- LUFFY - the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700" alt="Stars"> | 2025.04 | Westlake U. | [arXiv 2504.14945](https://arxiv.org/abs/2504.14945) | LUFFY โ mixed-policy GRPO |
- KETCHUP - paper-845C40?style=for-the-badge)](https://arxiv.org/abs/2504.19024) | 2025.04 | U. Alberta | [arXiv 2504.19024](https://arxiv.org/abs/2504.19024) | KETCHUP (k-step RL-KD) |
- KDRL - paper-845C40?style=for-the-badge)](https://arxiv.org/abs/2506.02208) | 2025.06 | HIT / Huawei | [arXiv 2506.02208](https://arxiv.org/abs/2506.02208) | KDRL (Joint KD + RL) |
- SDPO - the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700" alt="Stars"> | 2026.01 | ETH / MIT | [arXiv 2601.20802](https://arxiv.org/abs/2601.20802) ยท [project](https://self-distillation.github.io/SDPO) | SDPO โ RL via Self-Distillation |
- KEPO - the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700" alt="Stars"> | 2026.01 | Industrial | [arXiv 2602.00400](https://arxiv.org/abs/2602.00400) | KEPO |
- Open-AgentRL - Verse/Open-AgentRL?style=for-the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700" alt="Stars"> | 2026.02 | Gen-Verse | โ | Open-AgentRL โ RLAnything / DemyAgent multi-domain |
- Towards-On-Policy-SFT - On-Policy-SFT?style=for-the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700" alt="Stars"> | 2026.02 | MSRA / Shopee | [arXiv 2602.12222](https://arxiv.org/abs/2602.12222) | DDT โ on-policy SFT theory |
- ๐ณ-KD - paper-845C40?style=for-the-badge)](https://arxiv.org/abs/2602.12674) | 2026.02 | BUPT | [arXiv 2602.12674](https://arxiv.org/abs/2602.12674) | ๐ณ-KD (IRL-style) |
- RLAD - paper-845C40?style=for-the-badge)](https://arxiv.org/abs/2602.22495) | 2026.02 | AWS | [arXiv 2602.22495](https://arxiv.org/abs/2602.22495) | RLAD (Reinforcement-aware KD) |
- OpenClaw-RL - Verse/OpenClaw-RL?style=for-the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700" alt="Stars"> | 2026.03 | Gen-Verse | [arXiv 2603.10165](https://arxiv.org/abs/2603.10165) | OpenClaw-RL โ combines GRPO + OPD |
- ExGRPO - Tan-dmml/ExGRPO?style=for-the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700" alt="Stars"> | 2026.03 | UNC / ASU | [arXiv 2603.19266](https://arxiv.org/abs/2603.19266) | Probing-to-Refine / EI / EXGRPO |
- HDPO - paper-845C40?style=for-the-badge)](https://arxiv.org/abs/2603.23871) | 2026.03 | NVIDIA | [arXiv 2603.23871](https://arxiv.org/abs/2603.23871) | HDPO (Hybrid Distillation PO) |
- RLSD - paper-845C40?style=for-the-badge)](https://arxiv.org/abs/2604.03128) | 2026.04 | Multi-org | [arXiv 2604.03128](https://arxiv.org/abs/2604.03128) | Self-Distilled RLVR (RLSD) |
- NPO - paper-845C40?style=for-the-badge)](https://arxiv.org/abs/2604.20733) | 2026.04 | IIE CAS / UCAS / JD.COM | [arXiv 2604.20733](https://arxiv.org/abs/2604.20733) | NPO / AutoNPO โ mixed-policy GRPO with **near-future self** as teacher |
-
-
๐ญ OPD with Black-Box / Outcome-Based Teachers
- ORPO-Distill - paper-845C40?style=for-the-badge)](https://arxiv.org/abs/2509.25100) | 2025.09 | Industrial | [arXiv 2509.25100](https://arxiv.org/abs/2509.25100) | ORPO-Distill |
- LMOps `/gad` - the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700" alt="Stars"> | 2025.11 | Microsoft Research | [arXiv 2511.10643](https://arxiv.org/abs/2511.10643) ยท [project](https://ytianzhu.github.io/Generative-Adversarial-Distillation/) | GAD โ Black-Box OPD |
- OVD - paper-845C40?style=for-the-badge)](https://arxiv.org/abs/2601.21968) | 2026.01 | HKU / Huawei | [arXiv 2601.21968](https://arxiv.org/abs/2601.21968) | OVD (On-policy Verbal Distillation) โ project page `OVD.github.io` 404s |
- SODA - paper-845C40?style=for-the-badge)](https://arxiv.org/pdf/2604.03873) | 2026.04 | Academic | [arXiv 2604.03873](https://arxiv.org/pdf/2604.03873) | SODA โ Semi On-Policy Black-Box Distillation |
-
๐ฌ OPD with Larger External Teachers โ White-Box
- LMOps `/minillm` - the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700" alt="Stars"> | 2023.06 | Microsoft / Tsinghua | [arXiv 2306.08543](https://arxiv.org/abs/2306.08543) | MiniLLM (ICLR 2024) |
- distillm - the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700" alt="Stars"> | 2024.02 | KAIST / Microsoft | [arXiv 2402.03898](https://arxiv.org/abs/2402.03898) | DistiLLM (ICML 2024) |
- google-research `/speculative_kd` - research/google-research?style=for-the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700" alt="Stars"> | 2024.10 | UCSB / Google | [arXiv 2410.11325](https://arxiv.org/abs/2410.11325) | Speculative KD (ICLR 2025) |
- distillm-2 - 2?style=for-the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700" alt="Stars"> | 2025.03 | KAIST / Microsoft | [arXiv 2503.07067](https://arxiv.org/abs/2503.07067) | DistiLLM-2 (ICML 2025 Oral) |
- DSKDv2 - the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700" alt="Stars"> | 2025.04 | BJTU | [arXiv 2504.11426](https://arxiv.org/abs/2504.11426) | DSKDv2 โ cross-tokenizer; supports on-policy mode |
- Constrained OPD - paper-845C40?style=for-the-badge)](https://arxiv.org/abs/2509.22921) | 2025.09 | Huawei Noah's Ark | [arXiv 2509.22921](https://arxiv.org/abs/2509.22921) | Constrained OPD (CMDP) |
- AdaSwitch - paper-845C40?style=for-the-badge)](https://arxiv.org/abs/2510.07842) | 2025.10 | RUC / Baidu | [arXiv 2510.07842](https://arxiv.org/abs/2510.07842) | AdaSwitch (on-/off-policy switching) |
- Veto - paper-845C40?style=for-the-badge)](https://arxiv.org/abs/2601.07155) | 2026.01 | SNU | [arXiv 2601.07155](https://arxiv.org/abs/2601.07155) | Veto (Stable OPD) โ ACL 2026 Findings |
- G-OPD - OPD?style=for-the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700" alt="Stars"> | 2026.02 | RUC / Tencent | [arXiv 2602.12125](https://arxiv.org/abs/2602.12125) | G-OPD |
- Fast OPD - paper-845C40?style=for-the-badge)](https://arxiv.org/abs/2602.15260) | 2026.02 | Industrial | [arXiv 2602.15260](https://arxiv.org/abs/2602.15260) | Fast OPD (prefix-truncated) |
- Entropy-Aware OPD - paper-845C40?style=for-the-badge)](https://arxiv.org/abs/2603.07079) | 2026.03 | KAIST / IBM | [arXiv 2603.07079](https://arxiv.org/abs/2603.07079) | Entropy-Aware OPD |
- REOPOLD - paper-845C40?style=for-the-badge)](https://arxiv.org/abs/2603.11137) | 2026.03 | KAIST / Microsoft | [arXiv 2603.11137](https://arxiv.org/abs/2603.11137) | REOPOLD (Relaxed OPD) โ code soon |
- OPSD_OnPolicyDistillation - the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700" alt="Stars"> | 2026.03 | LinkedIn | [arXiv 2603.11178](https://arxiv.org/abs/2603.11178) | PACED โ frontier curriculum self-distill |
- TSD-KD - KD?style=for-the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700" alt="Stars"> | 2026.03 | Korea Univ. | [arXiv 2603.13260](https://arxiv.org/abs/2603.13260) | TSD-KD โ token-selective dual KD (ICLR 2026) |
- SCOPE - the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700" alt="Stars"> | 2026.04 | USTC / Meituan / Fudan | [arXiv 2604.10688](https://arxiv.org/abs/2604.10688) | SCOPE โ signal-calibrated dual-path |
- Hybrid-Policy-Distillation - Policy-Distillation?style=for-the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700" alt="Stars"> | 2026.04 | zwhong714 | [arXiv 2604.20244](https://arxiv.org/abs/2604.20244) | HPD โ Hybrid Policy Distillation; LlamaFactory + verl backends |
-
๐ง Reasoning OPD (by application)
-
๐ Iterative Self-Bootstrapping
- OPD-AVMP - paper-845C40?style=for-the-badge)](https://arxiv.org/abs/2604.07944) | 2026.04 | Academic | [arXiv 2604.07944](https://arxiv.org/abs/2604.07944) | OPD for Autonomous Vehicle Motion Planning |
-
-
โป๏ธ Self-Distillation with Privileged Context โ OPSD
-
๐ Iterative Self-Bootstrapping
- SPIN - the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700" alt="Stars"> | 2024.01 | UCLA | [arXiv 2401.01335](https://arxiv.org/abs/2401.01335) | SPIN โ Self-Play Fine-Tuning (ICML 2024) |
- rStar - the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700" alt="Stars"> | 2025.01 | Microsoft Research | [rStar-Math 2501.04519](https://arxiv.org/abs/2501.04519) ยท [rStar2-Agent 2508.20722](https://arxiv.org/abs/2508.20722) | rStar / rStar-Math / rStar2-Agent |
-
- OPSD - zhao/OPSD?style=for-the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700" alt="Stars"> | 2026.01 | UCLA / Meta FAIR | [arXiv 2601.18734](https://arxiv.org/abs/2601.18734) ยท [blog](https://siyan-zhao.github.io/blog/2026/opsd/) | OPSD โ Self-Distilled Reasoner |
- Self-Distillation - Distillation?style=for-the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700" alt="Stars"> | 2026.01 | MIT / ETH | [arXiv 2601.19897](https://arxiv.org/abs/2601.19897) | SDFT-Continual |
- mtp-lm - lm?style=for-the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700" alt="Stars"> | 2026.02 | UMD / LLNL | [arXiv 2602.06019](https://arxiv.org/abs/2602.06019) | MTP Self-Distill |
- GATES - paper-845C40?style=for-the-badge)](https://arxiv.org/abs/2602.20574) | 2026.02 | UMD | [arXiv 2602.20574](https://arxiv.org/abs/2602.20574) | GATES (Self-Distillation under Privileged Context) |
- CRISP_Reasoning_Compression - the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700" alt="Stars"> | 2026.03 | LinkedIn | [arXiv 2603.05433](https://arxiv.org/abs/2603.05433) | OPSDC / CRISP |
- self-distillation-analysis - distillation-analysis?style=for-the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700" alt="Stars"> | 2026.03 | MSR / KAIST / SNU | [arXiv 2603.24472](https://arxiv.org/abs/2603.24472) | **Why Does Self-Distillation (Sometimes) Degrade Reasoning?** โ diagnostic study of OPSD failure modes |
- ml-ssd - ssd?style=for-the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700" alt="Stars"> | 2026.04 | Apple MLR | [arXiv 2604.01193](https://arxiv.org/abs/2604.01193) | Apple โ Embarrassingly Simple Self-Distillation |
- Skill-SD - paper-845C40?style=for-the-badge)](https://arxiv.org/abs/2604.10674) | 2026.04 | UCAS / CUHK / USTC / vivo AI Lab | [arXiv 2604.10674](https://arxiv.org/abs/2604.10674) | **Skill-SD** โ skill-conditioned OPSD for multi-turn LLM agents |
- SD-Zero - paper-845C40?style=for-the-badge)](https://arxiv.org/abs/2604.12002) | 2026.04 | Princeton / Toronto / CMU | [arXiv 2604.12002](https://arxiv.org/abs/2604.12002) | **SD-Zero** โ Self-Revision turns binary rewards into dense supervision |
- ฯ-Play - paper-845C40?style=for-the-badge)](https://arxiv.org/abs/2604.14054) | 2026.04 | CASIA / UCAS / Meituan | [arXiv 2604.14054](https://arxiv.org/abs/2604.14054) | **ฯ-Play** โ multi-agent self-play turns the question-construction path into privileged context for OPSD on search agents |
- OPSDL - paper-845C40?style=for-the-badge)](https://arxiv.org/abs/2604.17535) | 2026.04 | Baidu | [arXiv 2604.17535](https://arxiv.org/abs/2604.17535) | OPSDL (Long-Context Self-Distillation) |
- MSD - paper-845C40?style=for-the-badge)](https://arxiv.org/abs/2605.02971) | 2026.05 | Tongji / Shanghai AI Lab | [arXiv 2605.02971](https://arxiv.org/abs/2605.02971) | **MSD** โ multilingual safety OPSD; teacher conditioned on English query translation + CoT instruction; DPSW weights safety-critical tokens |
- COPSD - the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700" alt="Stars"> | 2026.05 | LMU Munich / MCML | [arXiv 2605.09548](https://arxiv.org/abs/2605.09548) | **COPSD** โ crosslingual OPSD; teacher sees English problem translation + reference solution, student rolls out in low-resource language (17 African languages) |
-
-
โก Speculative-Decoding Distillation
-
๐ Iterative Self-Bootstrapping
- OSD - the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700" alt="Stars"> | 2023.10 | UCB / NVIDIA | [arXiv 2310.07177](https://arxiv.org/abs/2310.07177) | Online Speculative Decoding |
- DistillSpec - paper-845C40?style=for-the-badge)](https://arxiv.org/abs/2310.08461) | 2023.10 | Google DeepMind | [arXiv 2310.08461](https://arxiv.org/abs/2310.08461) | DistillSpec (ICLR 2024) |
- HASS - the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700" alt="Stars"> | 2024.08 | Academic | [arXiv 2408.15766](https://arxiv.org/abs/2408.15766) | HASS |
- Falcon - inc/Falcon?style=for-the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700" alt="Stars"> | 2024.12 | Bestpay | [arXiv 2412.12639](https://arxiv.org/abs/2412.12639) | Falcon |
- CORAL - paper-845C40?style=for-the-badge)](https://arxiv.org/abs/2502.16880) | 2025.02 | Academic | [arXiv 2502.16880](https://arxiv.org/abs/2502.16880) | CORAL (Cross-Step Representation Alignment) โ ACL 2025 |
- EAGLE - the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700" alt="Stars"> | 2025.03 | PKU / Microsoft | [EAGLE-3](https://arxiv.org/abs/2503.01840) | EAGLE-3 โ on-policy multi-step TTT |
- MASSV - paper-845C40?style=for-the-badge)](https://arxiv.org/abs/2505.10526) | 2025.05 | Cerebras | [arXiv 2505.10526](https://arxiv.org/abs/2505.10526) | MASSV (multimodal SD draft) |
- DVI - paper-845C40?style=for-the-badge)](https://arxiv.org/abs/2510.05421) | 2025.10 | Academic | [arXiv 2510.05421](https://arxiv.org/abs/2510.05421) | DVI (Draft-Verify-Improve, online RL) |
- SpecKD - paper-845C40?style=for-the-badge)](https://arxiv.org/abs/2510.24021) | 2025.10 | XJTU (Haiduo Huang et al.) | [arXiv 2510.24021](https://arxiv.org/abs/2510.24021) | SpecKD / SelecTKD (verification-gated KD; v1=SpecKD, v2 retitled SelecTKD) |
- ReSpec - paper-845C40?style=for-the-badge)](https://arxiv.org/abs/2510.26475) | 2025.10 | Academic | [arXiv 2510.26475](https://arxiv.org/abs/2510.26475) | ReSpec (RL drafter evolution) |
- SpecForge - project/SpecForge?style=for-the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700" alt="Stars"> | 2026.03 | SGLang | [LMSYS blog](https://www.lmsys.org/blog/2025-07-25-spec-forge/) | SpecForge โ open EAGLE-3 training framework |
-
-
Star History
-
๐ Iterative Self-Bootstrapping
-  | 2023.06 | Google DeepMind (Agarwal et al.) | [arXiv 2306.13649](https://arxiv.org/abs/2306.13649) โ implemented in [TRL `GKDTrainer`](https://github.com/huggingface/trl/blob/main/trl/experimental/gkd/gkd_trainer.py) | **GKD: On-Policy Distillation of Language Models โ Learning from Self-Generated Mistakes** (Seminal ยท ICLR 2024) |
- Blog - 3.2k_cookbook-blue?style=for-the-badge)](https://thinkingmachines.ai/blog/on-policy-distillation/) | 2025.10 | Thinking Machines Lab (Kevin Lu et al.) | [Blog](https://thinkingmachines.ai/blog/on-policy-distillation/) ยท [tinker-cookbook](https://github.com/thinking-machines-lab/tinker-cookbook) | **Thinking Machines Lab โ On-Policy Distillation (blog)** |
- tinker-cookbook - machines-lab/tinker-cookbook?style=for-the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700" alt="Stars"> | 2025.10 | Thinking Machines Lab | โ | Reference impl. of the OPD recipe on the Tinker SDK |
- revisiting_opd - the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700" alt="Stars"> | 2026.03 | CASIA (Fu et al.) | [arXiv 2603.25562](https://arxiv.org/abs/2603.25562) | Revisiting OPD: Failure Modes & Simple Fixes |
- Tencent OPD Survey - paper-845C40?style=for-the-badge)](https://arxiv.org/abs/2604.00626) | 2026.04 | Tencent (Mingyang Song & Mao Zheng) | [arXiv 2604.00626](https://arxiv.org/abs/2604.00626) | **A Survey of On-Policy Distillation for LLMs** |
- OPD - the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700" alt="Stars"> | 2026.04 | Tsinghua THUNLP | [arXiv 2604.13016](https://arxiv.org/abs/2604.13016) | **Rethinking On-Policy Distillation: Phenomenology, Mechanism & Recipe** |
- Lightning OPD - paper-845C40?style=for-the-badge)](https://arxiv.org/abs/2604.13010) | 2026.04 | Wu, Han, Cai | [arXiv 2604.13010](https://arxiv.org/abs/2604.13010) | **Lightning OPD: Efficient Post-Training with Offline OPD** |
- OPSD Survey - paper-845C40?style=for-the-badge)](https://arxiv.org/abs/2605.18141) | 2026.05 | Academic | [arXiv 2605.18141](https://arxiv.org/abs/2605.18141) | **A Brief Overview: On-Policy Self-Distillation in LLMs** |
Programming Languages
Categories
๐ค OPD-RL Hybrids โ Inside-RL OPD
17
๐ฌ OPD with Larger External Teachers โ White-Box
16
โป๏ธ Self-Distillation with Privileged Context โ OPSD
15
๐ญ Industrial / Production Model Reports
15
โก Speculative-Decoding Distillation
11
๐ ๏ธ Frameworks & Toolkits
11
๐ Curator's Picks โ where to start
8
๐ Surveys, Foundations & Position Papers
8
๐ค Agent & Embodied OPD (by application)
8
๐ผ๏ธ Multimodal OPD (VLM, Video, Audio, Image)
7
๐ญ OPD with Black-Box / Outcome-Based Teachers
4
๐ง Reasoning OPD (by application)
1
Star History
1
Sub Categories
Keywords
llm
5
large-language-models
3
rlhf
3
lora
2
gpt
2
language-model
2
llama
2
grpo
2
moe
2
fine-tuning
2
coding
2
peft
2
mistral
1
qlora
1
quantization
1
qwen
1
transformers
1
llm-inference
1
speculative-decoding
1
deep-learning
1
self-play
1
llama3
1
instruction-tuning
1
chatglm
1
ai
1
agent
1
x-prompt
1
promptist
1
prompt
1
pretraining
1
nlp
1
lmops
1
lm
1
rl
1
reasoning
1
distillation
1
tinker
1
slime
1
skill-learning
1
sglang
1
openclaw-skills
1
open-claw
1
on-policy-distillation
1
memory-systems
1
gui-application
1
async
1
glm
1
agentic-ai
1
rlvr
1
agentic
1