{"id":49689517,"url":"https://github.com/HJSang/OPSD_OnPolicyDistillation","last_synced_at":"2026-06-02T20:00:36.312Z","repository":{"id":350391933,"uuid":"1202211835","full_name":"HJSang/OPSD_OnPolicyDistillation","owner":"HJSang","description":"On Policy Distillation Build on top of Verl","archived":false,"fork":false,"pushed_at":"2026-04-10T05:17:40.000Z","size":105,"stargazers_count":7,"open_issues_count":0,"forks_count":1,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-04-10T07:31:46.034Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/HJSang.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-04-05T18:41:12.000Z","updated_at":"2026-04-10T05:17:44.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/HJSang/OPSD_OnPolicyDistillation","commit_stats":null,"previous_names":["hjsang/opsd_onpolicydistillation"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/HJSang/OPSD_OnPolicyDistillation","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/HJSang%2FOPSD_OnPolicyDistillation","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/HJSang%2FOPSD_OnPolicyDistillation/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/HJSang%2FOPSD_OnPolicyDistillation/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/HJSang%2FOPSD_OnPolicyDistillation/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/HJSang","download_url":"https://codeload.github.com/HJSang/OPSD_OnPolicyDistillation/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/HJSang%2FOPSD_OnPolicyDistillation/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33834011,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-02T02:00:07.132Z","response_time":109,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2026-05-07T13:00:27.702Z","updated_at":"2026-06-02T20:00:36.303Z","avatar_url":"https://github.com/HJSang.png","language":"Python","funding_links":[],"categories":["Frameworks, Tools, and Implementations","🔬 OPD with Larger External Teachers — White-Box"],"sub_categories":["Implementations"],"readme":"# Memory Efficient On-Policy Distillation Training\n\nMinimal training repo for on-policy distillation experiments built on top of `verl`.\n\n## Papers\n\nThis repository is related to the following papers:\n\n- [TIP: Token Importance in On-Policy Distillation](https://arxiv.org/abs/2604.14084) ([PDF](https://arxiv.org/pdf/2604.14084))\n  - Studies which token positions carry the most useful learning signal in OPD.\n  - Introduces the TIP view of token importance based on student entropy and teacher-student divergence.\n\n- [PACED: Distillation and On-Policy Self-Distillation at the Frontier of Student Competence](https://arxiv.org/abs/2603.11178) ([PDF](https://arxiv.org/pdf/2603.11178))\n  - Studies sample importance for distillation and self-distillation at the problem level.\n  - Proposes weighting problems by student empirical pass rate, emphasizing the frontier of student competence.\n  - A two-stage forward-then-reverse KL schedule leads to the best performance.\n\n- [Beyond GRPO and On-Policy Distillation: An Empirical Sparse-to-Dense Reward Principle for Language-Model Post-Training](https://arxiv.org/abs/2605.12483) ([PDF](https://arxiv.org/pdf/2605.12483))\n  - Use RL on a strong teacher model to explore high-reward reasoning behaviors.\n  - Distill the RL-trained teacher into a smaller student with dense token-level supervision (FKL-OPD two-stage pipeline).\n  - This teacher-RL + distillation setup outperforms directly training small models with GRPO/RL.\n  - \n## OPD: On-Policy Distillation with Separate Teacher\n\nA separate (typically bigger) teacher model and a trainable student model see the same input sequences. The teacher produces better distributions naturally; no ground-truth injection is needed.\n\n- Entry point: `python -m opd.main_opd`\n- Requires `TEACHER_MODEL_PATH` environment variable\n- Batch construction: `build_opd_batch` (trainer entry point) prefers pre-tokenized `batch[\"prompts\"]` + `response_mask` so training matches rollout inputs; falls back to `raw_prompt` + chat template only if prompts are absent\n- `build_opd_batch_multiturn` / `build_opd_batch_from_verl_batch` remain as thin aliases for the prompts-only and raw-prompt-only paths\n- Supports reward-weighted distillation via `opd.reward_beta` config\n\n## Multi-turn Agent-loop Support\n\nOPD supports multi-turn agent-loop rollouts where the response contains interleaved LLM-generated tokens and tool/environment tokens:\n\n- The trainer preserves the agent-loop `response_mask` (1=LLM, 0=tool) instead of recomputing it\n- The batch builder uses `response_mask` as the per-token loss mask so distillation only targets LLM-generated spans\n- `build_opd_batch` uses pre-tokenized prompt IDs from `batch[\"prompts\"]` when present for exact prompt matching\n\nMulti-turn diagnostics are logged: `tool_mask/llm_tokens`, `tool_mask/tool_tokens`, `tool_mask/tool_ratio`, `num_turns/*`.\n\n## Layout\n\n```text\nscripts/\n  eval/\n  grpo/\n  opd/          # OPD training scripts (separate teacher)\n  utils/\nsrc/\n  common/       # Shared batch builder\n  data/\n  opd/          # OPD module (separate teacher model)\n  rewards/\n```\n\n## Environment Assumptions\n\nThe scripts assume a GPU machine with:\n\n- Python 3\n- CUDA and `nvidia-smi`\n- `verl`\n- `torch`\n- `transformers`\n- `ray`\n- `hydra`\n- `tensordict`\n\nThe setup scripts under `scripts/*/setup_*.sh` only do lightweight verification plus `pip install tensordict`; they do not create a full environment from scratch.\n\n## Tested Environment\n\nThe current testing environment is:\n\n```text\nverl         0.7.0.7\ntorch        2.9.1.7\ntransformers 4.57.1\ntorchao      0.9.0\ntorchaudio   2.9.1.1\ntorchvision  0.24.1.10\n```\n\n## Data Layout\n\nBy default, training and eval scripts look for data under:\n\n```text\n\u003crepo\u003e/data\n```\n\nExpected raw inputs:\n\n```text\ndata/\n  DAPO-Math-17k-dedup/distinct-prompts-with-rewards.parquet\n  AIME_2024/aime_2024_problems.parquet\n  AIME_2025/train.jsonl\n  MATH-500/test.jsonl\n```\n\nGenerated files:\n\n- `data/grpo_processed/*.parquet` from `src/data/prepare_grpo_data.py`\n- `data/eval_processed/\u003cvariant\u003e/*.parquet` from `src/data/process_eval_data.py`\n\n## Memory Efficiency\n\nThe training code uses several mechanisms to keep memory usage manageable on long-context math runs:\n\n- FSDP parameter and optimizer offload. The launch scripts enable `actor.fsdp_config.param_offload=True`, `actor.fsdp_config.optimizer_offload=True`, and `ref.fsdp_config.param_offload=True` so model weights and optimizer state can be moved off GPU when inactive.\n- Remove-padding execution. Training scripts set `actor_rollout_ref.model.use_remove_padding=True`, and the OPD worker uses unpadded sequence paths so compute and memory scale with real token count instead of padded sequence length.\n- Two-phase teacher/student execution for distillation. OPD does not keep both teacher and student workloads active on GPU at the same time. The worker first runs teacher-side computation, moves cached teacher statistics or logits to CPU, offloads the teacher, and only then runs the student update step.\n- Chunked divergence computation. OPD divergence losses in `src/opd/losses.py` process tokens in chunks instead of materializing full-vocabulary probability tensors for the whole batch at once.\n- Micro-batching in the worker. OPD splits batches using `ppo_micro_batch_size_per_gpu` and accumulates gradients across micro-batches to bound activation and logits memory.\n- Dynamic batch sizing for GRPO. The main GRPO script enables `actor.use_dynamic_bsz` and caps per-GPU token counts with `ppo_max_token_len_per_gpu` and `log_prob_max_token_len_per_gpu`, which is useful when response lengths vary a lot.\n- Rollout memory controls. The scripts enable `rollout.free_cache_engine=True` and expose `GPU_MEMORY_UTIL` so KV-cache usage can be bounded during generation.\n\nIn practice, the biggest repo-specific savings come from the OPD two-phase worker design, chunked loss computation, and remove-padding execution.\n\n## Distillation Implementation\n\nOPD (`src/opd/opd_worker.py`) uses a two-phase update:\n\n1. **Phase 1 (Teacher):** Load the teacher (`ref`) model, run teacher forwards for all micro-batches, cache teacher logits on CPU, offload teacher.\n2. **Phase 2 (Student):** Load the student (`actor`) model and optimizer, run student forward + divergence loss + backward using cached teacher logits.\n\nThis avoids keeping both teacher and student compute active on GPU at the same time during the update step.\n\nOPD supports three divergence types (`reverse_kl`, `forward_kl`, `jsd`), chunk-wise loss computation, and per-sample reward weighting.\n\n## Main Entry Points\n\nGRPO:\n\n```bash\nbash scripts/grpo/setup_grpo.sh\nMODEL_PATH=/path/to/model \\\nMODEL_NAME=my-model \\\nbash scripts/grpo/train_grpo.sh\n```\n\nNative GRPO with KL:\n\n```bash\nMODEL_PATH=/path/to/model \\\nMODEL_NAME=my-model \\\nbash scripts/grpo/train_grpo_native.sh\n```\n\nNative GRPO without KL:\n\n```bash\nMODEL_PATH=/path/to/model \\\nMODEL_NAME=my-model \\\nbash scripts/grpo/train_grpo_native_no_kl.sh\n```\n\nOPD (separate teacher, single-turn math):\n\n```bash\nbash scripts/opd/setup_opd.sh\nMODEL_PATH=/path/to/student_model \\\nTEACHER_MODEL_PATH=/path/to/teacher_model \\\nMODEL_NAME=my-model \\\nbash scripts/opd/train_opd.sh\n```\n\nOPD (separate teacher, multi-turn agent with tool calls):\n\n```bash\nbash scripts/opd/setup_opd.sh\nMODEL_PATH=/path/to/student_model \\\nTEACHER_MODEL_PATH=/path/to/teacher_model \\\nDATABASE_DIR=/path/to/tool/database \\\nMODEL_NAME=my-model \\\nbash scripts/opd/train_opd_agent.sh\n```\n\nEvaluation:\n\n```bash\nMODEL_PATH=/path/to/model \\\nMODEL_NAME=my-model \\\nINSTRUCTION_VARIANT=boxed \\\nREWARD_FUNCTION=math_reward \\\nbash scripts/eval/eval_math.sh\n```\n\nCheckpoint conversion:\n\n```bash\nCHECKPOINT_PATH=/path/to/global_step_54/actor \\\nbash scripts/utils/convert_checkpoint.sh\n```\n\n## Useful Environment Variables\n\nMost training scripts accept overrides through environment variables, including:\n\n- `MODEL_PATH`\n- `MODEL_NAME`\n- `DATA_DIR`\n- `TRAIN_BATCH_SIZE`\n- `PPO_MINI_BATCH_SIZE`\n- `PPO_MICRO_BATCH_SIZE_PER_GPU`\n- `LEARNING_RATE`\n- `TOTAL_EPOCHS`\n- `MAX_PROMPT_LENGTH`\n- `MAX_RESPONSE_LENGTH`\n- `ROLLOUT_N`\n- `TP_SIZE`\n- `GPU_MEMORY_UTIL`\n\nOPD-specific variables:\n\n- `TEACHER_MODEL_PATH` (required)\n- `OPD_LOSS_TYPE`\n- `OPD_CHUNK_SIZE`\n- `OPD_MAX_LENGTH`\n- `OPD_REWARD_BETA`\n- `ENABLE_THINKING`\n\nOPD agent additional variables:\n\n- `ENABLE_TOOLS`\n- `MAX_ASSISTANT_TURNS`\n- `MAX_TOOL_RESPONSE_LENGTH`\n- `TOOL_FORMAT`\n- `AGENT_NUM_WORKERS`\n- `DATABASE_DIR`\n\nEval variables:\n\n- `INSTRUCTION_VARIANT`\n- `REWARD_FUNCTION`\n- `VAL_TEMPERATURE`\n- `VAL_TOP_P`\n- `VAL_TOP_K`\n- `VAL_N`\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FHJSang%2FOPSD_OnPolicyDistillation","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FHJSang%2FOPSD_OnPolicyDistillation","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FHJSang%2FOPSD_OnPolicyDistillation/lists"}