{"id":49689515,"url":"https://github.com/machine981/SCOPE","last_synced_at":"2026-06-02T20:00:36.944Z","repository":{"id":350987463,"uuid":"1208162710","full_name":"machine981/SCOPE","owner":"machine981","description":"SCOPE: Signal-Calibrated On-Policy Distillation Enhancement with Dual-Path Adaptive Weighting","archived":false,"fork":false,"pushed_at":"2026-06-01T06:44:04.000Z","size":2728,"stargazers_count":22,"open_issues_count":7,"forks_count":1,"subscribers_count":1,"default_branch":"main","last_synced_at":"2026-06-01T08:23:38.856Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/machine981.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-04-11T22:47:50.000Z","updated_at":"2026-06-01T06:44:08.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/machine981/SCOPE","commit_stats":null,"previous_names":["machine981/scope"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/machine981/SCOPE","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/machine981%2FSCOPE","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/machine981%2FSCOPE/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/machine981%2FSCOPE/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/machine981%2FSCOPE/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/machine981","download_url":"https://codeload.github.com/machine981/SCOPE/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/machine981%2FSCOPE/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33834011,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-02T02:00:07.132Z","response_time":109,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2026-05-07T13:00:27.702Z","updated_at":"2026-06-02T20:00:36.931Z","avatar_url":"https://github.com/machine981.png","language":"Python","funding_links":[],"categories":["Frameworks, Tools, and Implementations","🔬 OPD with Larger External Teachers — White-Box"],"sub_categories":["Implementations"],"readme":"\u003ch1 align=\"center\"\u003e\nSCOPE: Signal-Calibrated On-Policy Distillation Enhancement with Dual-Path Adaptive Weighting\n\u003c/h1\u003e\n\n\u003cdiv align=\"center\"\u003e\n  \u003ca href='https://arxiv.org/pdf/2604.10688'\u003e\u003cimg src='https://img.shields.io/badge/arXiv-2604.10688-red?logo=arXiv'\u003e\u003c/a\u003e  \u0026nbsp;\n  \u003ca href=\"https://huggingface.co/papers/2604.10688\"\u003e\u003cimg src='https://img.shields.io/badge/HuggingFace-SCOPE-FF9B9E?logo=huggingface'\u003e\u003c/img\u003e\u003c/a\u003e  \u0026nbsp;\n  \u003ca href=\"https://github.com/machine981/SCOPE\"\u003e\u003cimg src=\"https://img.shields.io/badge/GitHub-SCOPE-94c320?logo=github\"\u003e\u003c/a\u003e \u0026nbsp;\n  \u003cbr\u003e\n\u003c/div\u003e\n\n![SCOPE Overview](image/SCOPE_Overview.png)\n\n---\n\n## 🔥 News\n\n- **[2026.04]** SCOPE model weights released on Hugging Face!\n  - 🤗 [SCOPE-Qwen3-1.7B](https://huggingface.co/Machine981/SCOPE-Qwen3-1.7B)\n  - 🤗 [SCOPE-Deepseek-R1-Distill-Qwen-1.5B](https://huggingface.co/Machine981/SCOPE-Deepseek-R1-Distill-Qwen-1.5B)\n- **[2026.04]** Paper released on arXiv!\n\n## 📖 Overview\n\nOn-Policy Distillation (OPD) alleviates alignment gaps by introducing dense, token-level KL supervision from a teacher model, but typically applies this supervision uniformly across all rollouts, ignoring fundamental differences in signal quality.\n\n**Existing OPD Limitations:**\n- **Diversity degradation**: Correct paths are reinforced equally, reducing exploration at the capability boundary\n- **Rectification inefficiency**: Noisy teacher signals mislead incorrect trajectories\n\n**SCOPE Solution:**\nWe propose Signal-Calibrated On-Policy Distillation Enhancement (SCOPE), a dual-path adaptive training framework that routes on-policy rollouts by correctness into two complementary supervision paths.\n\n---\n\n## 📝 Abstract\n\nOn-Policy Distillation (OPD) alleviates this by introducing dense, token-level KL supervision from a teacher model, but typically applies this supervision uniformly across all rollouts, ignoring fundamental differences in signal quality. We propose Signal-Calibrated On-Policy Distillation Enhancement (SCOPE), a dual-path adaptive training framework that routes on-policy rollouts by correctness into two complementary supervision paths. For incorrect trajectories, SCOPE performs teacher-perplexity-weighted KL distillation to prioritize instances where the teacher demonstrates genuine corrective capability, while down-weighting unreliable guidance. For correct trajectories, it applies student-perplexity-weighted MLE to concentrate reinforcement on low-confidence samples at the capability boundary rather than over-reinforcing already mastered ones. Both paths employ a group-level normalization to adaptively calibrate weight distributions, accounting for the intrinsic difficulty variance across prompts. Extensive experiments on six reasoning benchmarks show that SCOPE achieves an average relative improvement of 11.42% in Avg@32 and 7.30% in Pass@32 over competitive baselines, demonstrating its consistent effectiveness.\n\n---\n\n## 🏆 Key Contributions\n\n- **Empirical analysis of signal quality heterogeneity in OPD:** Uncovers that teacher and student perplexity reliably predict corrective capability on incorrect trajectories and capability-boundary samples on correct ones.\n\n- **The SCOPE dual-path adaptive framework:** Routes rollouts by correctness, directing incorrect trajectories to teacher-perplexity-weighted OPD and correct trajectories to student-perplexity-weighted MLE.\n\n- **Extensive experimental validation:** Achieves 11.42% Avg@32 and 7.30% Pass@32 relative improvement over baselines on six reasoning benchmarks.\n\n---\n\n## 📖 Method\n\n### SCOPE Framework\n\nSCOPE is a dual-path adaptive training framework that routes on-policy rollouts by correctness into two complementary supervision paths:\n\n| Path | Trajectories | Method | Objective |\n|------|-------------|--------|-----------|\n| **Student Path** | Correct (Ω_c) | Perplexity-weighted MLE | Reinforce unconventional valid paths at capability boundary |\n| **Teacher Path** | Incorrect (Ω_w) | Perplexity-weighted KL distillation | Filter out context-induced noise, prioritize reliable guidance |\n\n### Weight Formulation\n\n**Student-guided weight (for correct trajectories Ω_c):**\n\n$$w_i^{stu} = \\frac{\\text{PPL}_S(y_i|x)^{1/\\tau}}{\\sum_{j \\in \\Omega_c} \\text{PPL}_S(y_j|x)^{1/\\tau}}$$\n\nAmplifies \"unconventional valid paths\" at the capability boundary using perplexity-based weighting.\n\n**Teacher-guided weight (for incorrect trajectories Ω_w):**\n\n$$w_i^{tea} = \\frac{\\text{PPL}_T(y_i|x)^{-1/\\tau}}{\\sum_{j \\in \\Omega_w} \\text{PPL}_T(y_j|x)^{-1/\\tau}}$$\n\nFilters \"context-induced noise\" by down-weighting high teacher perplexity instances.\n\n### Key Insight\n\nWithin each prompt's trajectory group, SCOPE applies **group-level perplexity-based normalization** to adaptively calibrate weight distributions, accounting for the intrinsic difficulty variance across prompts.\n\n### Overall Objective\n\nThe combined SCOPE objective jointly optimizes:\n\n$$\\mathcal{L}_{SCOPE} = \\sum_{i \\in \\Omega_c} w_i^{stu} \\cdot \\mathcal{L}_{MLE} + \\sum_{i \\in \\Omega_w} w_i^{tea} \\cdot \\mathcal{L}_{OPD}$$\n\n---\n\n## 📊 Main Results\n\n### Mathematical Reasoning (Teacher: Skywork-OR1-Math-7B → Student: DeepSeek-R1-Distill-Qwen-1.5B)\n\n| Benchmark | Avg@32 | Pass@32 | vs OPD |\n| --------- | ------ | ------- | ------ |\n| AIME24 | 42.7 | 77.9 | +6.22% |\n| AIME25 | 30.4 | 50.9 | +5.19% |\n| AMC23 | 80.9 | 97.2 | +6.59% |\n| MATH500 | 89.8 | 97.9 | +0.90% |\n| Minerva | 37.8 | 55.1 | +8.31% |\n| Olympiad | 49.7 | 70.9 | +10.69% |\n\n**Key findings:**\n- **11.42%** relative improvement in Avg@32\n- **7.30%** relative improvement in Pass@32\n- **+5.54%** average improvement over standard OPD across benchmarks\n\n---\n\n## ⚡ Quick Start\n\n### 1. Install Dependencies\n\n```bash\npip install -r requirements.txt\npip install -e .  # install verl itself\n```\n\n### 2. Deploy VLLM Service\n\n```bash\nbash deploy_vllm.sh\n```\n\n**Key configurations in `deploy_vllm.sh`**:\n\n| Parameter | Description | Default |\n| --------- | ----------- | ------- |\n| `model_name_or_path` | Model path | `./Models/Skywork-OR1-7B` |\n| `served_model_name` | Model name in API | `Skywork-OR1-7B` |\n| `--api-key` | API authentication key | `xxx` (must match `verl/utils/api_interface.py`) |\n\n### 3. Configure Experiment Scripts\n\nSet the following in `run_experiment_distill_1_5b.sh`:\n\n```bash\nTEACHER_MODEL_NAME=Skywork-OR1-7B  # Must match served_model_name in deploy_vllm.sh\nIP_POOL=\"['xx.xxx.x.xx','...']\"    # VLLM service node IP list\n```\n\n**API Key Consistency**: The `--api-key` in `deploy_vllm.sh` must match the `api_key` in `verl/utils/api_interface.py`.\n\n### 4. Run Training\n\n```bash\nbash run_experiment_distill_1_5b.sh\n```\n\n---\n\n## 🔧 Training Parameters\n\n### Model Configuration\n\n| Parameter | Description | Default |\n| --------- | ----------- | ------- |\n| `POLICY_MODEL_PATH` | Student model path | `DeepSeek-R1-Distill-Qwen-1.5B` |\n| `TEACHER_MODEL_NAME` | Teacher model name (as registered in VLLM) | `Skywork-OR1-7B` |\n| `IP_POOL` | VLLM service node IP list | `['xx.xxx.x.xx','...']` |\n\n### Data Configuration\n\n| Parameter | Description | Default |\n| --------- | ----------- | ------- |\n| `TRAIN_DATA` | Training data path | `./verl-distillation-ori/data/deepmath_new/deepmath_new_train.parquet` |\n| `VAL_DATA` | Validation data path | `./verl-distillation-ori/data/aime/test.parquet` |\n| `MAX_PROMPT_LENGTH` | Max prompt length | `2048` |\n| `MAX_RESPONSE_LENGTH` | Max response length | `12288` |\n\n### SCOPE Dual-Path Configuration\n\n| Parameter | Description | Default |\n| --------- | ----------- | ------- |\n| `USE_SCOPE_DUAL_PATH_WEIGHTING` | Enable SCOPE dual-path weighting | `True` |\n| `SCOPE_TAU` | Weight temperature parameter | `1` |\n| `SCOPE_USE_SEQ_WEIGHTS` | Use sequence-level weights | `True` |\n| `USE_STUDENT_PATH_WEIGHTS` | Use student path weights | `True` |\n| `USE_TEACHER_PATH_WEIGHTS` | Use teacher path weights | `True` |\n| `STUDENT_PATH_PPL_POSITIVE` | Student path: higher PPL → higher weight | `True` |\n| `TEACHER_PATH_PPL_POSITIVE` | Teacher path: higher PPL → lower weight | `False` |\n\n---\n\n## 🤝 Acknowledgements\n\nThis work builds upon [verl](https://github.com/volcengine/verl) and the on-policy distillation paradigm, with appreciation for their contributions to the research community.\n\n## 🔗 Citation\n\nIf you find our work useful, please consider citing:\n\n```bibtex\n@article{scope2026,\n  title={SCOPE: Signal-Calibrated On-Policy Distillation Enhancement with Dual-Path Adaptive Weighting},\n  author={Zheng, Binbin and Ma, Xing and Liang, Yiheng and Ruan, Jingqing and Fu, Xiaoliang and Lin, Kepeng and Zhu, Benchang and Zeng, Ke and Cai, Xunliang},\n  journal={arXiv preprint arXiv:2604.10688},\n  year={2026}\n}\n```\n\n## 📝 License\n\nThis project is licensed under the MIT License. See the [LICENSE](LICENSE) file for details.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmachine981%2FSCOPE","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmachine981%2FSCOPE","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmachine981%2FSCOPE/lists"}