{"id":31786482,"url":"https://github.com/rlhflow/reinforce-ada","last_synced_at":"2025-10-11T13:01:38.355Z","repository":{"id":318359866,"uuid":"1069802891","full_name":"RLHFlow/Reinforce-Ada","owner":"RLHFlow","description":"An adaptive sampling framework for Reinforce-style LLM post training.","archived":false,"fork":false,"pushed_at":"2025-10-06T18:22:40.000Z","size":3013,"stargazers_count":3,"open_issues_count":0,"forks_count":2,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-10-06T19:27:00.990Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/RLHFlow.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-10-04T16:49:10.000Z","updated_at":"2025-10-06T18:25:50.000Z","dependencies_parsed_at":"2025-10-06T19:27:10.984Z","dependency_job_id":"142aba29-97ec-4288-b4bf-a80fcba2240e","html_url":"https://github.com/RLHFlow/Reinforce-Ada","commit_stats":null,"previous_names":["rlhflow/reinforce-ada"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/RLHFlow/Reinforce-Ada","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/RLHFlow%2FReinforce-Ada","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/RLHFlow%2FReinforce-Ada/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/RLHFlow%2FReinforce-Ada/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/RLHFlow%2FReinforce-Ada/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/RLHFlow","download_url":"https://codeload.github.com/RLHFlow/Reinforce-Ada/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/RLHFlow%2FReinforce-Ada/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":279003897,"owners_count":26083641,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-10T02:00:06.843Z","response_time":62,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-10-10T12:58:17.349Z","updated_at":"2025-10-10T12:58:19.005Z","avatar_url":"https://github.com/RLHFlow.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cdiv align=\"center\"\u003e\n\n# Reinforce-Ada: An Adaptive Sampling Framework for Reinforce-Style LLM Training\n[![Paper](https://img.shields.io/badge/paper-A42C25?style=for-the-badge\u0026logo=arxiv\u0026logoColor=white)](https://arxiv.org/abs/2510.04996) \n[![Github](https://img.shields.io/badge/Reinforce--Ada-000000?style=for-the-badge\u0026logo=github\u0026logoColor=000\u0026logoColor=white)](https://github.com/RLHFlow/Reinforce-Ada)\n[![Model on HF](https://huggingface.co/datasets/huggingface/badges/resolve/main/model-on-hf-sm.svg)](https://huggingface.co/collections/RLHFlow/reinforce-ada-68e3a8a10fc69dc56d9d86fe)\n[![Dataset on HF](https://huggingface.co/datasets/huggingface/badges/resolve/main/dataset-on-hf-sm.svg)](https://huggingface.co/collections/RLHFlow/reinforce-ada-68e3a8a10fc69dc56d9d86fe)\n\u003c/div\u003e\n\n\n\n## 📢 Introduction\nThis repository contains the official implementation for Reinforce-Ada, an adaptive sampling framework designed to resolve the ``signal collapse'' problem in Reinforce-style algorithm with group baselines such as GRPO, making training more efficient and effective.\n\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"figures/result.png\" width=\"99%\" /\u003e\n\u003c/p\u003e\n\u003ci\u003e\u003cb\u003eFigure 1:\u003c/b\u003e Left: Adaptive sampling can be used with one-line swap of the generation API in verl. Right: Reinforce-Ada significantly improves training efficiency and final performance compared to standard GRPO.\u003c/i\u003e\n\u003c/p\u003e\n\n\n### 🧐 The Challenge: Signal Collapse in GRPO\nGroup Relative Policy Optimization (GRPO) is a widely used algorithm in Reinforcement Learning from Verifiable Reward (RLVR). It calculates the advantage by normalizing rewards within a group of n responses:\n$$g_\\theta(x,a) =  \\frac{r_i - \\bar{r}}{\\sigma_r + \\varepsilon} \\cdot \\nabla_\\theta \\log \\pi_\\theta(a|x).$$\n\nWhile effective, GRPO suffers from a critical flaw in practice: **signal collapse**. When all n samples for a prompt yield the same reward (e.g., all correct or all incorrect), **the gradient is zero** for all the responses and there is no learning signal for this prompt.\n\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"figures/demo_grpo_ratio.png\" width=\"67%\" /\u003e\n\u003c/p\u003e\n\u003ci\u003e\u003cb\u003eFigure 2:\u003c/b\u003e The proportion of prompts with zero gradient (uniform rewards) remains high during training.\u003c/i\u003e\n\nThis isn't a minor issue. It frequently occurs early in training (when models fail on hard prompts) and later in training (when models master easy ones). Crucially, this is a **statistical artifact of undersampling**, not a sign that the prompts are useless. A larger sample size n would often reveal a mix of correct and incorrect answers, unlocking a valid learning signal. For instance, the RL trained model exhibits 35.3\\% all-correct groups at n=4, but only 10.2\\% at n=256. These results demonstrate that the missing signal is often recoverable with larger n, confirming that uniform-reward collapse is a sampling artifact rather than a model limitation.  \n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"figures/passk.png\" width=\"83%\" /\u003e\n\u003c/p\u003e\n\n\u003ci\u003e\u003cb\u003eFigure 3:\u003c/b\u003e Increasing sample size (pass@k) reveals the model's true capability, confirming that signals are often recoverable.\u003c/i\u003e\n\u003c/p\u003e\n\nHowever, uniformly increasing n for all prompts is computationally prohibitive. Seminal works like DeepSeek-R1 show that a small group size (e.g., n=16) is sufficient for an effective gradient update. This reveals a gap between the large inference budget needed to find a signal and the smaller update budget needed to learn from it.\n\n\n### ✨ Our Solution Reinforce-Ada: Reinforce with Adaptive Sampling\nTo bridge this gap, we introduce Reinforce-Ada, an adaptive sampling framework that intelligently allocates the inference budget. Instead of a fixed n, our algorithm samples in rounds, deactivating prompts once a sufficient learning signal is found. This frees up computation, allowing difficult prompts to be sampled more deeply until a useful signal emerges.\n\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"figures/algo_reinforce_ada.png\" width=\"83%\" /\u003e\n\u003c/p\u003e\n\n\u003ci\u003e\u003cb\u003eAlgorithm 1:\u003c/b\u003e The Reinforce-Ada framework.\u003c/i\u003e\n\u003c/p\u003e\n\nOur framework consists of three core ideas:\n\n1. **Adaptive Sampling**: A successive elimination process that eliminates prompts with sufficient learning signals and keeps sampling the unsolved prompts.\n2. **Principled Exit Conditions**: Flexible rules (Reinforce-Ada-pos, Reinforce-Ada-balance) to determine when a prompt is resolved, balancing signal diversity and sampling efficiency.\n3. **Robust Advantage Calculation**: We compute the advantage baseline $(r_i-\\bar{r})$ using statistics from the entire pool of responses generated for a prompt, not just the final down-sampled batch, leading to more stable estimates.\n\n### Key Results\nOur experiments show that Reinforce-Ada consistently improves sample efficiency and final model performance across various models and benchmarks.\n\n| Model | Algorithm | **Math500** | **Minerva Math** | **Olympiad Bench** | **AIME-like** | **Weighted Average** |\n| :--- | :--- | :--- | :--- | :--- | :--- | :--- |\n| *Qwen2.5-Math-1.5B* | GRPO | 74.2 | 34.4 | 38.4 | 16.2 | 45.3 |\n| *Qwen2.5-Math-1.5B* | Reinforce-Ada-pos | 75.8 | 35.7 | 38.6 | 16.5 | 46.1 |\n| *Qwen2.5-Math-1.5B* | **Reinforce-Ada-balance** | 77.4 | 36.5 | 40.5 | 17.5 | **47.6 (+2.3)** |\n|:---|:---|:---|:---|:---|:---|:---|\n| *Qwen2.5-Math-1.5B (hard)* | GRPO | 71.0 | 31.8 | 34.3 | 13.8 | 41.9 |\n| *Qwen2.5-Math-1.5B (hard)* | Reinforce-Ada-pos | 73.9 | 33.1 | 36.4 | 16.4 | 44.6 |\n| *Qwen2.5-Math-1.5B (hard)* | **Reinforce-Ada-balance** | 74.7 | 33.7 | 38.7 | 17.6 | **45.5 (+3.6)** |\n|:---|:---|:---|:---|:---|:---|:---|\n| *Qwen2.5-Math-7B* | GRPO | 82.2 | 44.7 | 45.6 | 23.2 | 53.3 |\n| *Qwen2.5-Math-7B* | Reinforce-Ada-pos | 82.7 | 45.1 | 46.7 | 23.7 | 54.2 |\n| *Qwen2.5-Math-7B* | **Reinforce-Ada-balance** | 84.0 | 45.2 | 47.1 | 23.7 | **54.6 (+1.3)** |\n|:---|:---|:---|:---|:---|:---|:---|\n| *Qwen2.5-Math-7B (hard)* | GRPO | 80.7 | 42.8 | 42.9 | 21.8 | 51.3 |\n| *Qwen2.5-Math-7B (hard)* | Reinforce-Ada-pos | 82.4 | 43.1 | 45.0 | 22.2 | 52.8 |\n| *Qwen2.5-Math-7B (hard)* | **Reinforce-Ada-balance** | 83.1 | 43.4 | 46.4 | 24.9 | **53.9 (+2.6)** |\n|:---|:---|:---|:---|:---|:---|:---|\n| *LLaMA-3.2-3B-instruct* | GRPO | 51.7 | 20.5 | 20.4 | 7.2 | 27.9 |\n| *LLaMA-3.2-3B-instruct* | Reinforce-Ada-pos | 52.6 | 22.2 | 21.0 | 7.5 | 28.8 |\n| *LLaMA-3.2-3B-instruct* | **Reinforce-Ada-balance** | 53.2 | 22.4 | 21.2 | 8.0 | **29.1 (+1.2)** |\n|:---|:---|:---|:---|:---|:---|:---|\n| *Qwen3-4B-instruct* | GRPO | 90.4 | 51.2 | 64.9 | 38.5 | 66.5 |\n| *Qwen3-4B-instruct* | Reinforce-Ada-pos | 91.6 | 50.4 | 66.3 | 38.8 | 67.4 |\n| *Qwen3-4B-instruct* | **Reinforce-Ada-balance** | 91.7 | 53.0 | 65.7 | 38.8 | **67.6 (+1.1)** |\n\n\u003e **Table Notes**: The value `(+X.X)` indicates the improvement in Weighted Average score over the GRPO baseline for each model group.\n**Table 1**:\n\u003e Performance comparison of GRPO and Reinforce-Ada. We report average@32 accuracy with a sampling temperature of 1.0 and a maximum generation length of 4096 tokens. The weighted average score is computed according to the number of prompts in each benchmark. \"Hard\" indicates training on a more challenging prompt set, with details provided in the paper.prompt set, with details provided in the paper.\n\n\n## 🌍 Environment Setup\n1. Create a new environment.\n   ```bash\n   python -m venv ~/.python/reinforce_ada\n   source ~/.python/reinforce_ada/bin/activate\n\n   # You can also use conda \n   #conda create -n reinforce_ada python==3.10\n   #conda activate reinforce_ada\n   ```\n2. Install dependencies\n   ```bash\n   pip install pip --upgrade\n   pip install uv\n   python -m uv pip install torch==2.6.0 --index-url https://download.pytorch.org/whl/cu124\n   python -m uv pip install flash-attn==2.8.0.post2 --no-build-isolation\n   git clone https://github.com/RLHFlow/Reinforce-Ada.git\n   cd ./Reinforce-Ada\n   python -n uv pip install -r requirements.txt\n   python -m uv pip install -e .\n   python -m uv pip install vllm==0.10.1\n   ```\n\n## 🧪 Experiment Running\n1. Prepare the training and test datasets\n    ```bash\n    # adjust pass_rate to 0.125 and 0.313 for hard and easy prompt selection, respectively.\n    bash scripts/prepare_data.py \n    ```\n    You can use our open-sourced training sets in the following to ignore this step.\n2. Start the training\n   ```bash\n   # Check this file for more details\n   bash scripts/run_reinforce_ada.sh \n   ```\n   The key hyperparameters from Reinforce-Ada are:\n   - ``multiround_adaptive_downsampling=True``: Use adaptive sampling.\n   - ``reinforce_ada_choice=balanced``: How to balance the positive and negative prompts within a batch, could be one of [balanced, positive-focused].\n   - ``global_stat_est=True``: Use global statistics to calculate the mean and std.\n\n   For ``multi_round_adaptive_downsampling``, check [**verl/trainer/ppo/ray_trainer.py**](verl/trainer/ppo/ray_trainer.py)\n   \n   For GRPO with global statistics, check [**verl/trainer/ppo/core_algos.py**](verl/trainer/ppo/core_algos.py)\n\n3. Evaluate\n   ```bash\n   # Check this file for more details\n   bash scripts/eval_model.sh\n   ```\n   You can use our open-sourced checkpoints in the following for evaluation.\n\n## 🤗 Processed Training Sets and Checkpoints\nWe also offer the processed/selected training prompts and trained models in [huggingface](https://huggingface.co/collections/RLHFlow/reinforce-ada-68e3a8a10fc69dc56d9d86fe). \n\nYou only need to run the following reformating command for verl training.\n  ```bash\n  # Convert to verl training format\n  echo \"Converting to verl training format...\"\n  python3 data_process/reformat.py \\\n      --local_dir ${output_dir} \\\n      --model_name_or_path ${model_name} \\\n      --data_source ${data_name} \\\n\n  # Generate validation set\n  echo \"Generating validation set...\"\n  python3 data_process/get_validation_set.py \\\n      --local_dir ${output_dir} \\\n      --model_name_or_path ${model_name} \n  ```\n\n\n  | Training set | Which model to train? |\n  | --- | --- |\n  |  [```RLHFlow/reinforce_ada_hard_prompt```](https://huggingface.co/datasets/RLHFlow/reinforce_ada_hard_prompt) | ```Qwen/Qwen2.5-Math-7B```, ```Qwen/Qwen3-4B-Instruct-2507``` |\n  | TBD | ```Qwen/Qwen2.5-Math-1.5B``` |\n  | TBD | ```meta-llama/Llama-3.2-3B-Instruct``` |\n\n  | Model | Prompt level | Algorithm | Checkpoint |\n  | --- | --- | --- | --- |\n  | ```Qwen/Qwen2.5-Math-1.5B``` | easy | Reinforce-Ada-Balance | TBD |\n  | ```Qwen/Qwen2.5-Math-1.5B``` | hard | Reinforce-Ada-Balance | TBD |\n  | ```Qwen/Qwen2.5-Math-7B``` | easy | Reinforce-Ada-Balance | TBD |\n  | ```Qwen/Qwen2.5-Math-7B``` | hard | Reinforce-Ada-Balance | TBD |\n  | ```Qwen/Qwen3-4B-Instruct-2507``` | hard | Reinforce-Ada-Balance | TBD |\n  | ```meta-llama/Llama-3.2-3B-Instruct``` | easy | Reinforce-Ada-Balance | TBD |\n\n\n## 🙏 Acknowledgement\nWe thank [verl](https://github.com/volcengine/verl) for providing the awesome training codebase, and [Qwen2.5-Math](https://github.com/QwenLM/Qwen2.5-Math) for its robust grader.\n\n## 📝 Citation\nIf you find our paper or code helpful, feel free to give us a citation.\n```bibtex\n@misc{xiong2025reinforceada,\n      title={Reinforce-Ada: An Adaptive Sampling Framework for Reinforce-Style LLM Training}, \n      author={Wei Xiong and Chenlu Ye and Baohao Liao and Hanze Dong and Xinxing Xu and Christof Monz and Jiang Bian and Nan Jiang and Tong Zhang},\n      year={2025},\n      eprint={2510.04996},\n      archivePrefix={arXiv},\n      primaryClass={cs.LG},\n      url={https://arxiv.org/abs/2510.04996}, \n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frlhflow%2Freinforce-ada","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Frlhflow%2Freinforce-ada","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frlhflow%2Freinforce-ada/lists"}