{"id":50339367,"url":"https://github.com/lasgroup/SDPO","last_synced_at":"2026-06-02T20:00:37.354Z","repository":{"id":336119490,"uuid":"1141366737","full_name":"lasgroup/SDPO","owner":"lasgroup","description":"Reinforcement Learning via Self-Distillation (SDPO)","archived":false,"fork":false,"pushed_at":"2026-02-02T21:55:48.000Z","size":4483,"stargazers_count":179,"open_issues_count":3,"forks_count":14,"subscribers_count":2,"default_branch":"main","last_synced_at":"2026-02-03T11:09:20.774Z","etag":null,"topics":["distillation","llm","reasoning","rl"],"latest_commit_sha":null,"homepage":"https://self-distillation.github.io/SDPO","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/lasgroup.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-01-24T18:21:12.000Z","updated_at":"2026-02-03T10:20:29.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/lasgroup/SDPO","commit_stats":null,"previous_names":["lasgroup/sdpo"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/lasgroup/SDPO","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lasgroup%2FSDPO","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lasgroup%2FSDPO/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lasgroup%2FSDPO/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lasgroup%2FSDPO/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/lasgroup","download_url":"https://codeload.github.com/lasgroup/SDPO/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lasgroup%2FSDPO/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33834011,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-02T02:00:07.132Z","response_time":109,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["distillation","llm","reasoning","rl"],"created_at":"2026-05-29T16:00:23.035Z","updated_at":"2026-06-02T20:00:37.348Z","avatar_url":"https://github.com/lasgroup.png","language":"Python","funding_links":[],"categories":["🤝 OPD-RL Hybrids — Inside-RL OPD","Papers"],"sub_categories":["🔁 Iterative Self-Bootstrapping","2026"],"readme":"\u003cdiv align=\"center\"\u003e\n\n# Reinforcement Learning via Self-Distillation (SDPO)\n\n[![Paper](https://img.shields.io/badge/paper-A42C25?style=for-the-badge\u0026logo=arxiv\u0026logoColor=white)](https://arxiv.org/abs/2601.20802)  [![Github](https://img.shields.io/badge/Code-000000?style=for-the-badge\u0026logo=github\u0026logoColor=000\u0026logoColor=white)](https://github.com/lasgroup/SDPO) [![W\u0026B Logs](https://img.shields.io/badge/WandB%20Logs-%2300B4AB?style=for-the-badge\u0026logo=weightsandbiases\u0026logoColor=white\u0026labelColor=000000)](https://wandb.ai/jonhue/SDPO?nw=mgotcx6kk7)\n\n\u003c/div\u003e\n\n\u003cdiv align=\"center\" style=\"font-family: Arial, sans-serif;\"\u003e\n  \u003cp\u003e\n    \u003ca href=\"#-introduction\" style=\"text-decoration: none; font-weight: bold;\"\u003e📖 Introduction\u003c/a\u003e •\n    \u003ca href=\"#-main-results\" style=\"text-decoration: none; font-weight: bold;\"\u003e📊 Main Results\u003c/a\u003e •\n    \u003ca href=\"#-getting-started\" style=\"text-decoration: none; font-weight: bold;\"\u003e🚀 Getting Started\u003c/a\u003e\n  \u003c/p\u003e\n  \u003cp\u003e\n    \u003ca href=\"#usage-documentation\" style=\"text-decoration: none; font-weight: bold;\"\u003eUsage Documentation\u003c/a\u003e •\n    \u003ca href=\"#citation\" style=\"text-decoration: none; font-weight: bold;\"\u003eCitation\u003c/a\u003e\n  \u003c/p\u003e\n\u003c/div\u003e\n\n## 📖 Introduction\n\nLarge language models are increasingly post-trained with reinforcement learning in verifiable domains such as code and math. Yet, current methods for reinforcement learning with verifiable rewards (RLVR) learn only from a scalar outcome reward per attempt, creating a severe credit-assignment bottleneck. Many verifiable environments actually provide rich textual feedback, such as runtime errors or judge evaluations, that explain *why* an attempt failed. We formalize this setting as *Reinforcement Learning with Rich Feedback* (RLRF):\n\n\u003cp align=\"center\"\u003e\n\u003cimg src=\"figures/sdpo-fig-training-loop.png\" alt=\"Reinforcement Learning from Rich Feedback\" width=\"80%\"\u003e\n\u003c/p\u003e\n\n**We propose Self-Distilled Policy Optimization (SDPO)**, a reinforcement learning framework that augments on-policy optimization with self-distillation from the model’s own high-reward trajectories.\n\nSDPO converts tokenized feedback into a dense learning signal without any external teacher or explicit reward model. SDPO treats the current model conditioned on feedback as a self-teacher and distills its feedback-informed next-token predictions back into the policy. In this way, SDPO leverages the model's ability to retrospectively identify its own mistakes in-context.\n\n\u003cp align=\"center\"\u003e\n\u003cimg src=\"figures/sdpo-fig.png\" alt=\"SDPO\" width=\"80%\"\u003e\n\u003c/p\u003e\n\n---\n\n## 📊 Main Results\n\n### Learning without Rich Environment Feedback\n\nWhen environment feedback is sparse or rule-based, standard reinforcement learning methods struggle to propagate learning signals efficiently. SDPO addresses this by reusing high-reward rollouts as implicit feedback, providing dense supervision even in the absence of rich environment feedback.\n\n\u003cp align=\"center\"\u003e\n\u003cimg src=\"figures/chemistry-accuracy-response.png\" alt=\"SDPO Performance vs. Training Steps\" width=\"80%\"\u003e\n\u003c/p\u003e\n\n*Training progression of Olmo3-7B-Instruct on Chemistry. We report the average accuracy across 16 samples per question and a rolling average of response lengths over 5 steps. We report GRPO with the optimal hyperparameters for this model and task. We run each configuration for 3 seeds and report standard errors as shaded areas.*\n\n\u003cp align=\"center\"\u003e\n\u003cimg src=\"figures/table-no-rich-feedback.png\" alt=\"SDPO Performance without Rich Environment Feedback\" width=\"80%\"\u003e\n\u003c/p\u003e\n\n***Comparison of SDPO and GRPO on reasoning-related benchmarks.** We report the highest achieved avg@16 within 1 hour and 5 hours of wall-clock training time, respectively.\nBoth SDPO and on-policy GRPO perform one gradient step per generation batch, while GRPO performs 4 off-policy mini batch steps. We select optimal hyperparameters for SDPO and baselines based on 5h accuracy. Each run is performed on a node with 4 NVIDIA GH200 GPUs. Together with initialization and validation, each run takes approximately 6 hours.*\n\n---\n\n### Learning with Rich Environment Feedback\n\nIn settings where environments provide structured or textual feedback, SDPO naturally incorporates this information into self-distillation. By conditioning future attempts on both successful demonstrations and feedback from failed attempts, SDPO achieves faster convergence and more stable training.\n\n\u003cp align=\"center\"\u003e\n\u003cimg src=\"figures/lcbv6-accuracy.png\" alt=\"SDPO Performance with Rich Environment Feedback\" width=\"80%\"\u003e\n\u003c/p\u003e\n\n***SDPO with rich environment feedback.**\nLeft: SDPO benefits from denser credit assignment (logit \u003e token \u003e sequence-level) and consistently outperforms GRPO when rich feedback is available.\nRight: The self-teacher improves throughout training, and the final student substantially surpasses the initial teacher. Error bars show variability across seeds.*\n\n---\n\n### Solving Hard Questions via Test-Time Self-Distillation\n\nSDPO also enables **test-time self-distillation**. By generating multiple candidate solutions, identifying high-quality responses, and reusing them as demonstrations, the model can iteratively refine its outputs at inference time.  This leads to substantial gains on hard reasoning tasks without additional training.\n\n\u003cp align=\"center\"\u003e\n\u003cimg src=\"figures/very-hard-questions.png\" alt=\"Test-Time Self-Distillation\" width=\"80%\"\u003e\n\u003c/p\u003e\n\n***Test-time self-distillation on hard coding problems.**\nSDPO solves questions that neither the base model nor multi-turn interaction can solve, achieving higher solution discovery rates across generation budgets.*\n\n---\n\n## 🚀 Getting Started\n\n### System Requirements\n*   **Operating System:** Linux (Tested on SLES 15 SP5 and Ubuntu 22.04)\n*   **Hardware:** NVIDIA GPUs (CUDA compatible)\n*   **Python:** 3.12 (Tested on 3.12.3)\n*   **CUDA Driver:** Compatible with the PyTorch version installed (see below).\n\n---\n\n### Installation\n\n#### Option 1: Docker (Recommended for HPC/GH200 Clusters)\n\nFor NVIDIA GH200 (aarch64) clusters with CUDA 13.1, we provide a pre-configured Dockerfile based on the NGC vLLM container.\n\n**Build and deploy:**\n```bash\n# Build the image\npodman build . -f Dockerfile.gh200 -t sdpo-gh200\n\n# Export for cluster use (enroot/squashfs)\nenroot import -x mount -o sdpo-gh200.sqsh podman://localhost/sdpo-gh200:latest\n```\n\n\u003e [!NOTE]\n\u003e The Docker images use `requirements-gh200.txt` which contains pinned versions from `requirements-full.txt`, excluding packages pre-installed in the NGC vLLM container (torch, vllm, flash-attn, xformers, triton).\n\n---\n\n#### Option 2: Local Installation\n\n1. **Install PyTorch:**\n\n*   **For Ampere/Hopper (RTX 30/40, H100):**\n    ```bash\n    pip install torch==2.5.1 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124\n    ```\n\n*   **For Blackwell (RTX 50, RTX PRO 2000 Blackwell):**\n    ```bash\n    pip install torch==2.7.0 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128\n    ```\n\n2. **Install SDPO and Dependencies:**\n```bash\n# Install core dependencies (pinned versions)\npip install -r requirements.txt\n\n# Install SDPO (verl) in editable mode\npip install -e .\n\n# Install Flash Attention 2 (compiled from source)\npip install flash-attn --no-build-isolation\n```\n\n3. **Optional: Install SGLang/vLLM for high-throughput inference:**\n```bash\npip install -r requirements_sglang.txt\n```\n\n---\n\n### Requirement Files\n\n| File | Description |\n|------|-------------|\n| `requirements.txt` | Core dependencies with pinned versions |\n| `requirements-gh200.txt` | For NGC vLLM container (excludes pre-installed packages) |\n| `requirements-full.txt` | Complete pip freeze from working environment |\n| `requirements_sglang.txt` | SGLang/vLLM stack for local inference |\n| `requirements-cuda.txt` | Flash Attention (for non-Docker installs) |\n\n**vLLM Version Note:**\n```\n# vllm==0.8.4       # GH200 cluster\n# vllm\u003e=0.12.0      # Blackwell (RTX 50 series, B100/B200) - NOT FULLY TESTED\n```\n\n\u003e [!WARNING]\n\u003e Blackwell architecture support (RTX 50 series, B100/B200) has not been fully tested.\n\n\u003e [!TIP]\n\u003e For reproducibility, use `requirements-full.txt` which contains the exact versions from a tested environment.\n\n\u003e [!NOTE]\n\u003e For more specific instructions on `verl` architecture and advanced configuration, refer to the [official verl repository](https://github.com/volcengine/verl).\n\n---\n\n### Data Preparation\n\nThe data is already loaded and split into train and test sets in the `datasets` directory. You can proceed to **preprocessing** the data.\n\nIf you want to load and process the data yourself, you can run the following command:\n\n#### Data Loading\nThe detailed instructions for loading the data are provided in `data/README.md`.\n\nOne example is provided below:\n```bash\npython data/load_dataset.py \\\n    --dataset_name Chemistry \\\n    --output_path datasets/sciknoweval/chemistry.json\n```\n\nTo split the data into train and test sets, run the following command:\n```bash\npython data/split_tasks.py \\\n    --json_path datasets/sciknoweval/chemistry.json \\\n    --output_dir datasets/sciknoweval/chemistry \\\n    --test_ratio 0.1 \\\n    --seed 42\n```\n\nFor `LiveCodeBenchv6` split the _unit tests_ into train and test sets, run the following command:\n```bash\npython data/split_tests.py \\\n    --json_path datasets/lcb_v6.json \\\n    --output_dir datasets/lcb_v6\n```\n\n\n#### Data Preprocessing\nOur implementation uses the `parquet` format for the data. To preprocess the data, run the following command:\n\n```bash\npython data/preprocess.py \\\n    --data_source DATASET_PATH\n```\n`DATASET_PATH` should contain the `train.json` and `test.json` files.\n\n---\n\n### Configuration\nBefore running experiments, adapt the paths in `verl/trainer/config/user.yaml` to your environment:\n\n```yaml\nvars:\n  dir: /path/to/your/SDPO              # Path to the SDPO repository\n  log_dir: /path/to/your/logs          # Directory for logs\n  ckpt_dir: /path/to/your/checkpoints  # Directory for model checkpoints\n```\n\n---\n\n### Training\n\n#### Reproducing Results (Without Rich Environment Feedback)\n\nRun the following commands to reproduce the results without rich environment feedback.\n\n**GRPO baseline:**\n```bash\nbash experiments/generalization/run_baseline_grpo_all.sh\n```\n\n**SDPO:**\n```bash\nbash experiments/generalization/run_sdpo_all.sh\n```\n\n#### Reproducing Results (With Rich Environment Feedback)\nRun the following commands to reproduce the results with rich environment feedback.\n\n**GRPO baseline:**\n```bash\nbash experiments/rich_feedback/run_baseline_grpo.sh\n```\n\n**SDPO:**\n```bash\nbash experiments/rich_feedback/run_sdpo.sh\n```\n\n---\n\n### Multi-turn Baseline of Section 5\n\nPrepare the data by splitting it into individual tasks:\n\n```\nexport MY_DATA_SPLITS_DIR=lcb_v6\nexport MY_DATA_SINGLES_DIR=lcb_v6_singles\nbash dat/prepare_data_splits.sh datasets/lcb_v6.json\n```\n\nRun the multi-turn baseline for, e.g., question 120:\n\n```\npython baseline_multiturn/multiturn.py --data-dir=lcb_v6_singles/q_120 --run-name multiturn_q120\n```\n\nOr, for all hard questions:\n\n```\nbash experiments/ttt/run_multiturn_all.sh\n```\n\n---\n\n## Usage Documentation\n\nThis section documents the configuration options added by SDPO on top of the base verl framework.\n\n### Policy Loss Configuration\n\nLocated at `actor.policy_loss` in the config.\n\n- **loss_mode** (str, default: `\"vanilla\"`): Loss function mode. Set to `\"sdpo\"` to enable self-distillation. Options: `vanilla`, `sdpo`.\n\n### Self-Distillation Configuration\n\nLocated at `actor.self_distillation` in the config. Only active when `actor.policy_loss.loss_mode = \"sdpo\"`.\n\n#### Core Settings\n\n- **full_logit_distillation** (bool, default: `True`): Whether to use full-logit KL distillation.\n\n- **alpha** (float, default: `0.5`): KL interpolation coefficient. `0.0` = forward KL, `1.0` = reverse KL, `0.5` = JSD.\n\n- **success_reward_threshold** (float, default: `1.0`): Minimum sequence reward to be considered a successful demonstration.\n\n- **teacher_regularization** (str, default: `\"ema\"`): Teacher regularization mode. Options: `ema`, `trust-region`. Note: if `ema` is used, the model on the `RefWorker` is updated as an exponential moving average. `trust-region` requires `use_fused_kernels = False`.\n\n- **teacher_update_rate** (float, default: `0.05`): EMA update rate for teacher weights, or trust-region mixing coefficient.\n\n- **distillation_topk** (int | None, default: `100`): If set, use top-k logits for distillation instead of full distribution.\n\n- **distillation_add_tail** (bool, default: `True`): Whether to add a tail bucket for top-k distillation.\n\n- **is_clip** (float | None, default: `2.0`): Clip value for importance sampling ratio. `None` disables IS weighting.\n\n#### Reprompting Settings\n\n- **max_reprompt_len** (int, default: `10240`): Maximum token length of the reprompted prompt.\n\n- **reprompt_truncation** (str, default: `\"right\"`): Truncation method for reprompted prompts. Options: `left`, `right`, `error`.\n\n- **dont_reprompt_on_self_success** (bool, default: `True`): If `True`, don't use a sample's own successful response as demonstration.\n\n- **remove_thinking_from_demonstration** (bool, default: `True`): Whether to remove `\u003cthink\u003e...\u003c/think\u003e` tags from demonstrations.\n\n#### Template Settings\n\n- **reprompt_template** (str): Main template for reprompting. Placeholders: `{prompt}`, `{solution}`, `{feedback}`.\n\n- **solution_template** (str): Template for the solution section. Placeholder: `{successful_previous_attempt}`.\n\n- **feedback_template** (str): Template for the feedback section. Placeholder: `{feedback_raw}`.\n\n#### Feedback Settings\n\n- **include_environment_feedback** (bool, default: `True`): Whether to include environment feedback (e.g., test errors) in reprompting.\n\n- **environment_feedback_only_without_solution** (bool, default: `True`): If `True`, only use feedback when no successful solution is available.\n\n---\n\n## Citation\nIf you find this work helpful, please cite us.\n\n```bibtex\n@article{hubotter2026reinforcement,\n  title = {Reinforcement Learning via Self-Distillation},\n  author = {Hübotter, Jonas and Lübeck, Frederike and Behric, Lejs and Baumann, Anton and Bagatella, Marco and Marta, Daniel and Hakimi, Ido and Shenfeld, Idan and Kleine Buening, Thomas and Guestrin, Carlos and Krause, Andreas},\n  year = {2026},\n  journal = {arXiv preprint arXiv:2601.20802},\n}\n```\n\n## Attribution\n\nOur implementation is based on a recent version of [verl](https://github.com/verl-project/verl).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flasgroup%2FSDPO","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Flasgroup%2FSDPO","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flasgroup%2FSDPO/lists"}