{"id":50510414,"url":"https://github.com/WenjinHou/Uni-OPD","last_synced_at":"2026-06-19T14:00:37.115Z","repository":{"id":357578657,"uuid":"1237481249","full_name":"WenjinHou/Uni-OPD","owner":"WenjinHou","description":"Uni-OPD: Unifying On-Policy Distillation with a Dual-Perspective Recipe","archived":false,"fork":false,"pushed_at":"2026-05-13T11:11:18.000Z","size":4068,"stargazers_count":5,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-05-13T12:25:35.456Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/WenjinHou.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-05-13T08:14:26.000Z","updated_at":"2026-05-13T11:11:25.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/WenjinHou/Uni-OPD","commit_stats":null,"previous_names":["wenjinhou/uni-opd"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/WenjinHou/Uni-OPD","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/WenjinHou%2FUni-OPD","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/WenjinHou%2FUni-OPD/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/WenjinHou%2FUni-OPD/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/WenjinHou%2FUni-OPD/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/WenjinHou","download_url":"https://codeload.github.com/WenjinHou/Uni-OPD/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/WenjinHou%2FUni-OPD/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34534278,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-19T02:00:06.005Z","response_time":61,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2026-06-02T20:00:26.252Z","updated_at":"2026-06-19T14:00:37.110Z","avatar_url":"https://github.com/WenjinHou.png","language":"Python","funding_links":[],"categories":["🖼️ Multimodal OPD (VLM, Video, Audio, Image)"],"sub_categories":["🔁 Iterative Self-Bootstrapping"],"readme":"\u003cdiv align=\"center\"\u003e\n\n# Uni-OPD: Unifying On-Policy Distillation with a Dual-Perspective Recipe \u003c!-- omit in toc --\u003e\n\n\u003ca href='https://arxiv.org/abs/2605.03677'\u003e\n\u003cimg src='https://img.shields.io/badge/Paper-Arxiv-purple'\u003e\u003c/a\u003e\n\u003ca href='https://github.com/WenjinHou/Uni-OPD/blob/main/LICENSE'\u003e\n\u003cimg src='https://img.shields.io/badge/LICENSE-Apache_2.0-yellow'\u003e\u003c/a\u003e\n\n\u003ca href=\"/docs/README_zh.md\"\u003e中文\u003c/a\u003e | \u003cb\u003eEnglish\u003c/b\u003e\n\n\u003c!--\n**Wenjin Hou\\*\u003csup\u003e1\u003c/sup\u003e**,\u0026emsp;\n**Shangpin Peng\\*\u003csup\u003e3\u003c/sup\u003e**,\u0026emsp;\n**Weinong Wang\u003csup\u003e3,†\u003c/sup\u003e**,\u0026emsp;\n**Zheng Ruan\u003csup\u003e3\u003c/sup\u003e**,\u0026emsp;\n**Yue Zhang\u003csup\u003e1\u003c/sup\u003e**,\u0026emsp;\n**Zhenglin Zhou\u003csup\u003e1\u003c/sup\u003e**\n\n**Mingqi Gao\u003csup\u003e3\u003c/sup\u003e**,\u0026emsp;\n**Yifei Chen\u003csup\u003e3\u003c/sup\u003e**,\u0026emsp;\n**Kaiqi Wang\u003csup\u003e3\u003c/sup\u003e**,\u0026emsp;\n**Hongming Yang\u003csup\u003e3\u003c/sup\u003e**,\u0026emsp;\n**Chengquan Zhang\u003csup\u003e3\u003c/sup\u003e**,\u0026emsp;\n**Zhuotao Tian\u003csup\u003e2\u003c/sup\u003e**\n\n**Han Hu\u003csup\u003e3,‡\u003c/sup\u003e**,\u0026emsp;\n**Yi Yang\u003csup\u003e1\u003c/sup\u003e**,\u0026emsp;\n**Fei Wu\u003csup\u003e1\u003c/sup\u003e**,\u0026emsp;\n**Hehe Fan\u003csup\u003e1,✉️\u003c/sup\u003e**\n\n\u003csup\u003e1\u003c/sup\u003eZhejiang University\u0026emsp;\n\u003csup\u003e2\u003c/sup\u003eShenzhen Loop Area Institute\u0026emsp;\n\u003csup\u003e3\u003c/sup\u003eLLM Department, Tencent\n\n\\*Equal contribution\u0026emsp;\u003csup\u003e†\u003c/sup\u003eProject lead\u0026emsp;\u003csup\u003e‡\u003c/sup\u003eAdvisor\u0026emsp;\u003csup\u003e✉️\u003c/sup\u003eCorresponding author\n--\u003e\n\n\u003c/div\u003e\n\n## 🎊 News \u003c!-- omit in toc --\u003e\n\n- [2026.05.13] 🚀 We open-source the code and training scripts for OPD.\n- [2026.05.05] 📖 We release our paper on [ArXiv](https://arxiv.org/abs/2605.03677).\n\n## 🚀 Overview \u003c!-- omit in toc --\u003e\n\n**Uni-OPD** is a unified On-Policy Distillation (OPD) framework that consolidates the capabilities of specialized expert teachers into a single student model, generalizing across **LLMs and MLLMs**. We identify two fundamental bottlenecks that limit effective OPD:\n\n1. **Insufficient exploration of informative student-generated states**, and\n2. **Unreliable teacher supervision for student rollouts**.\n\nTo address them, Uni-OPD introduces a **dual-perspective optimization recipe** that jointly improves student exploration (via offline difficulty-aware and online correctness-aware data balancing) and teacher reliability (via an outcome-guided margin calibration mechanism). Extensive experiments on **5 domains and 16 benchmarks**, covering single-/multi-teacher, strong-to-weak, and cross-modal distillation, verify the effectiveness and versatility of Uni-OPD.\n\n\u003ctable align=\"center\"\u003e\n    \u003cp align=\"center\"\u003e\n      \u003cimg src=\"/docs/figures/teaser.png\" width=\"80%\" /\u003e\n    \u003c/p\u003e\n\u003c/table\u003e\n\n## 📌 Contents \u003c!-- omit in toc --\u003e\n\n- [🔑 Key Features](#-key-features)\n- [📚 Dataset](#-dataset)\n- [💻 Environment Setup](#-environment-setup)\n- [⚙️ Training](#️-training)\n- [📈 Evaluation](#-evaluation)\n- [📝 Citation](#-citation)\n\n## 🔑 Key Features\n\n- **A unified OPD framework across LLMs and MLLMs.** Uni-OPD consolidates knowledge from one or several expert teachers into a single student model and works seamlessly across single-teacher, multi-teacher, strong-to-weak, and cross-modal (text + multimodal) distillation settings.\n\u003ctable align=\"center\"\u003e\n    \u003cp align=\"center\"\u003e\n      \u003cimg src=\"/docs/figures/framework.png\" width=\"85%\" /\u003e\n    \u003c/p\u003e\n\u003c/table\u003e\n\n- **Student-perspective 1: offline difficulty-aware data balancing.** We selectively upsample medium-difficulty prompts to reshape the training corpus into a more balanced difficulty distribution while preserving data diversity. This enables the student to generate more informative trajectories and explore a broader solution space.\n\u003ctable align=\"center\"\u003e\n    \u003cp align=\"center\"\u003e\n      \u003cimg src=\"/docs/figures/offline_data_balancing.png\" width=\"80%\" /\u003e\n    \u003c/p\u003e\n\u003c/table\u003e\n\n- **Student-perspective 2: online correctness-aware data balancing.** During training, we dynamically filter and reshape rollout batches to maintain a balanced ratio between correct and incorrect trajectories, preventing the student from collapsing onto trivially correct samples or being overwhelmed by uniformly failed ones.\n\u003ctable align=\"center\"\u003e\n    \u003cp align=\"center\"\u003e\n      \u003cimg src=\"/docs/figures/online_data_balancing.png\" width=\"60%\" /\u003e\n    \u003c/p\u003e\n\u003c/table\u003e\n\n- **Teacher-perspective: outcome-guided margin calibration.** We show that reliable token-level teacher supervision largely depends on whether its trajectory-level aggregation remains _order-consistent_ with the outcome reward. Uni-OPD uses the outcome reward as a global anchor to calibrate the teacher's per-token margins, restoring order consistency between correct and incorrect trajectories.\n\u003ctable align=\"center\"\u003e\n    \u003cp align=\"center\"\u003e\n      \u003cimg src=\"/docs/figures/margin_calibration.png\" width=\"85%\" /\u003e\n    \u003c/p\u003e\n\u003c/table\u003e\n\n\u003c!--\n- **Stable training dynamics and strong empirical results.** The dual-perspective recipe yields smoother training curves and consistent gains over strong OPD/RL baselines across math, code, chart, and general multimodal reasoning benchmarks. See the paper for the full set of results.\n\u003ctable align=\"center\"\u003e\n    \u003cp align=\"center\"\u003e\n      \u003cimg src=\"/docs/figures/train_dynamics.png\" width=\"80%\" /\u003e\n    \u003c/p\u003e\n\u003c/table\u003e\n--\u003e\n\n## 📚 Dataset\n\nThe dataset we use for training and evaluation in Uni-OPD is a combination of publicly available resources:\n\n- **Text training data (Math + Code).** We use the same training data as [G-OPD](https://github.com/RUCBM/G-OPD), available at [🤗 Keven16/G-OPD-Training-Data](https://huggingface.co/datasets/Keven16/G-OPD-Training-Data).\n  - The math part is sourced from the **DeepMath** dataset.\n  - The code part is sourced from the **code subset of the Eurus-2-RL** dataset.\n\n- **Multimodal training data.** We use a mixture of:\n  - [🤗 OpenMMReasoner/OpenMMReasoner-RL-74K](https://huggingface.co/datasets/OpenMMReasoner/OpenMMReasoner-RL-74K),\n  - [🤗 HuggingFaceM4/ChartQA](https://huggingface.co/datasets/HuggingFaceM4/ChartQA), and\n  - [InfographicVQA](https://www.docvqa.org).\n\n\u003c!--\n## 📦 Model Weights\n\n--\u003e\n\n## 💻 Environment Setup\n\nWe provide step-by-step instructions for both the training and evaluation environments:\n\n- **Training environment** — see [docs/build_env.md](docs/build_env.md). It walks through preparing the conda env (`Uni-OPD`, Python 3.12), installing required packages, and applying the SGLang \u0026 Megatron patches shipped under [`miles/docker/patch`](miles/docker/patch).\n- **Evaluation environment** — see [docs/build_eval_env.md](docs/build_eval_env.md). It covers two separate conda envs:\n  - `Uni-OPD-LLM-Eval` for text evaluation (built on top of [G-OPD](https://github.com/RUCBM/G-OPD)), and\n  - `Uni-OPD-LMMS-Eval` for multimodal evaluation (built on top of [lmms-eval](https://github.com/evolvinglmms-lab/lmms-eval)).\n\nA typical post-setup layout looks like:\n\n```text\n- Uni-OPD/                  # this repository\n  - miles/                  # RL / OPD training framework\n  - Megatron-LM/            # training backend\n  - sglang/                 # inference / rollout backend\n  - G-OPD/                  # text-side evaluation (cloned for eval env)\n  - lmms-eval/              # multimodal evaluation (cloned for eval env)\n```\n\n## ⚙️ Training\n\nAll training and implementation in Uni-OPD is built on top of the [miles](https://github.com/radixark/miles) framework. For a summary of the modifications we made to miles, see [docs/miles_modifications.md](docs/miles_modifications.md).\n\nWe release the full set of training scripts used in the paper under [`exps/scripts/OPD`](exps/scripts/OPD), grouped by distillation setting:\n\n| Setting        | Path                                                                 | Description                                                                |\n| -------------- | -------------------------------------------------------------------- | -------------------------------------------------------------------------- |\n| Single-teacher | [`exps/scripts/OPD/single_teacher`](exps/scripts/OPD/single_teacher) | Math / Code distillation with Qwen3-1.7B \u0026 Qwen3-4B students.              |\n| Multi-teacher  | [`exps/scripts/OPD/multi_teacher`](exps/scripts/OPD/multi_teacher)   | Joint Math + Code distillation from multiple expert teachers.              |\n| Strong-to-weak | [`exps/scripts/OPD/strong_to_weak`](exps/scripts/OPD/strong_to_weak) | Distilling a stronger teacher (Qwen3-A3B-Instruct) into a smaller student. |\n\nA minimal launch command looks like:\n\n```bash\n# Activate the training conda env built via docs/build_env.md\nconda activate Uni-OPD\n\n# Example: single-teacher Math distillation, 4B student\nbash exps/scripts/OPD/single_teacher/0413/Qwen3_Stu_4B_Math_Uni_OPD.sh \\\n    --rollout-batch-size 64 \\\n    --sample-n 16 \\\n    --lr 1e-6\n```\n\n\u003e Before running, please\n\u003e\n\u003e 1. update the model / data paths at the top of the script (and inside the corresponding YAML under `configs/`) to point to your local checkpoints and dataset files.\n\u003e 2. Launch teacher server(s) using `miles/Uni_OPD_utils/scripts/server/run_sglang_server.sh` and put relevent addresses in `miles/Uni_OPD_utils/OPD_reward/teacher_server_list.json`.\n\n## 📈 Evaluation\n\nEvaluation is performed in the dedicated evaluation environments described in [docs/build_eval_env.md](docs/build_eval_env.md):\n\n- **LLM benchmarks** (math \u0026 code) follow the [G-OPD](https://github.com/RUCBM/G-OPD) evaluation pipeline.\n- **MLLM benchmarks** (ChartQA, InfographicVQA, MathVision, LogicVista, etc.) follow the [lmms-eval](https://github.com/evolvinglmms-lab/lmms-eval) pipeline.\n\nPlease refer to the upstream repositories for the per-benchmark commands.\n\n## 📝 Citation\n\nIf you find our paper / code helpful, please consider citing our work 📝 and starring this repository ⭐️!\n\n```bibtex\n@article{hou2026uni,\n  title   = {{Uni-OPD}: Unifying On-Policy Distillation with a Dual-Perspective Recipe},\n  author  = {Hou, Wenjin and Peng, Shangpin and Wang, Weinong and Ruan, Zheng and Zhang, Yue and Zhou, Zhenglin and Gao, Mingqi and Chen, Yifei and Wang, Kaiqi and Yang, Hongming and Zhang, Chengquan and Tian, Zhuotao and Hu, Han and Yang, Yi and Wu, Fei and Fan, Hehe},\n  journal = {arXiv preprint arXiv:2605.03677},\n  year    = {2026}\n}\n```\n\n## 🙏 Acknowledgement \u003c!-- omit in toc --\u003e\n\n- [G-OPD](https://github.com/RUCBM/G-OPD): an excellent open-source project on on-policy distillation; we reuse its text-side training data and evaluation pipeline.\n- [miles](https://github.com/radixark/miles): the powerful RL training framework on top of which we build Uni-OPD.\n- [Megatron-LM](https://github.com/NVIDIA/Megatron-LM) and [SGLang](https://github.com/sgl-project/sglang): the training and rollout backends used throughout this project.\n- [lmms-eval](https://github.com/evolvinglmms-lab/lmms-eval): the multimodal evaluation framework we adopt for MLLM benchmarks.\n\n## 📧 Contact us \u003c!-- omit in toc --\u003e\n\nIf you have any questions, comments, or suggestions, please feel free to open an issue or PR. Contributions and discussions that help advance research in this area are very welcome!\n\n## License \u003c!-- omit in toc --\u003e\n\n[Apache License 2.0](/LICENSE)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FWenjinHou%2FUni-OPD","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FWenjinHou%2FUni-OPD","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FWenjinHou%2FUni-OPD/lists"}