{"id":27233063,"url":"https://github.com/knoveleng/open-rs","last_synced_at":"2025-04-10T14:11:21.690Z","repository":{"id":283485089,"uuid":"950374312","full_name":"knoveleng/open-rs","owner":"knoveleng","description":"Official repo for paper: \"Reinforcement Learning for Reasoning in Small LLMs: What Works and What Doesn't\"","archived":false,"fork":false,"pushed_at":"2025-03-20T13:11:22.000Z","size":1303,"stargazers_count":0,"open_issues_count":1,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-03-20T14:22:38.005Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/knoveleng.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2025-03-18T04:07:48.000Z","updated_at":"2025-03-20T13:03:35.000Z","dependencies_parsed_at":"2025-03-20T14:32:45.503Z","dependency_job_id":null,"html_url":"https://github.com/knoveleng/open-rs","commit_stats":null,"previous_names":["knoveleng/open-rs"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/knoveleng%2Fopen-rs","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/knoveleng%2Fopen-rs/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/knoveleng%2Fopen-rs/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/knoveleng%2Fopen-rs/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/knoveleng","download_url":"https://codeload.github.com/knoveleng/open-rs/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248232420,"owners_count":21069487,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-04-10T14:11:18.912Z","updated_at":"2025-04-10T14:11:21.671Z","avatar_url":"https://github.com/knoveleng.png","language":"Python","funding_links":[],"categories":["Projects","A01_文本生成_文本对话"],"sub_categories":["Large Language Models","大语言对话模型及数据"],"readme":"# Open RS\n\nThis repository hosts the code and datasets for the **Open RS** project, accompanying the paper [*Reinforcement Learning for Reasoning in Small LLMs: What Works and What Doesn’t*](https://arxiv.org/abs/2503.16219). The project explores enhancing reasoning capabilities in small large language models (LLMs) using reinforcement learning (RL) under resource-constrained conditions.\n\nWe focus on a 1.5-billion-parameter model, `DeepSeek-R1-Distill-Qwen-1.5B`, trained on 4 NVIDIA A40 GPUs (48 GB VRAM each) within 24 hours. By adapting the Group Relative Policy Optimization (GRPO) algorithm and leveraging a curated, compact mathematical reasoning dataset, we conducted three experiments to assess performance and behavior. Key findings include:\n\n- Significant reasoning improvements, e.g., AMC23 accuracy rising from 63% to 80% and AIME24 reaching 46.7%, outperforming `o1-preview`.\n- Efficient training with just 7,000 samples at a cost of $42, compared to thousands of dollars for baseline models.\n- Challenges like optimization instability and length constraints with extended training.\n\nThese results showcase RL-based fine-tuning as a cost-effective approach for small LLMs, making reasoning capabilities accessible in resource-limited settings. We open-source our code, models, and datasets to support further research.\n\n![Performance Metrics](assets/overall.png)\n\n## Resources\n\n### Models\n- [Open-RS1](https://huggingface.co/knoveleng/Open-RS1)\n- [Open-RS2](https://huggingface.co/knoveleng/Open-RS2)\n- [Open-RS3](https://huggingface.co/knoveleng/Open-RS3)\n- Additional models in training: [knoveleng/OpenRS-GRPO](https://huggingface.co/knoveleng/OpenRS-GRPO/commits/main), [quyanh/OpenRS-GRPO](https://huggingface.co/quyanh/OpenRS-GRPO/commits/main)\n\n### Datasets\n- [open-s1](https://huggingface.co/datasets/knoveleng/open-s1)\n- [open-deepscaler](https://huggingface.co/datasets/knoveleng/open-deepscaler)\n- [open-rs](https://huggingface.co/datasets/knoveleng/open-rs) (used in Experiments 2 and 3)\n\n### Collection\n- [Open-RS Collection](https://huggingface.co/collections/knoveleng/open-rs-67d940abc201a7e7f252ca4e)\n\n## Installation\n\n### Prerequisites\nInstall `uv` for managing virtual environments:\n```bash\ncurl -LsSf https://astral.sh/uv/install.sh | sh\nexport PATH=\"$HOME/.local/bin:$PATH\"\n```\n\nSet up a virtual environment with Python 3.11:\n```bash\nuv venv openr1 --python 3.11\nsource openr1/bin/activate\nuv pip install --upgrade pip\nexport UV_LINK_MODE=copy\n```\n\n### Dependencies\nInstall `vLLM` and `FlashAttention`:\n```bash\nuv pip install vllm==0.7.2\nuv pip install setuptools\nuv pip install flash-attn --no-build-isolation\n```\n\n\u003e **Note**: This installs PyTorch `v2.5.1`, which is required for `vLLM` compatibility. Using a different version may cause issues.\n\nInstall additional dependencies based on your use case:\n```bash\nGIT_LFS_SKIP_SMUDGE=1 uv pip install -e \".[dev]\"\n```\n\n### Authentication\nLog in to Hugging Face and Weights \u0026 Biases:\n```bash\nhuggingface-cli login\nwandb login\n```\n\n### Git LFS\nEnsure Git LFS is installed for model/dataset management:\n```bash\ngit-lfs --version\n```\nIf not installed:\n```bash\nsudo apt-get install git-lfs\n```\n\n## Training\n\nTrain models using a YAML config with 4 GPUs (set `num_processes=3`):\n```bash\nACCELERATE_LOG_LEVEL=info accelerate launch \\\n  --config_file recipes/accelerate_configs/zero2.yaml \\\n  --num_processes=3 \\\n  src/open_r1/grpo.py \\\n  --config recipes/grpo.yaml\n```\n\nFor Experiment 3, add the `cosine_max_len` parameter:\n```bash\nACCELERATE_LOG_LEVEL=info accelerate launch \\\n  --config_file recipes/accelerate_configs/zero2.yaml \\\n  --num_processes=3 \\\n  src/open_r1/grpo.py \\\n  --config recipes/grpo.yaml \\\n  --cosine_max_len 3584\n```\n\n## Evaluation\n\nEvaluate models using `lighteval` with custom tasks in `src/open_r1/evaluate.py`. For single-GPU setups:\n```bash\nMODEL=knoveleng/Open-RS3\nMODEL_ARGS=\"pretrained=$MODEL,dtype=bfloat16,max_model_length=32768,gpu_memory_utilization=0.8,generation_parameters={max_new_tokens:32768,temperature:0.6,top_p:0.95}\"\nOUTPUT_DIR=data/evals/$MODEL\n\n# Example: AIME 2024\nTASK=aime24\nlighteval vllm \"$MODEL_ARGS\" \"custom|$TASK|0|0\" \\\n  --custom-tasks src/open_r1/evaluate.py \\\n  --use-chat-template \\\n  --output-dir \"$OUTPUT_DIR\"\n```\n\n\u003e **Important**: Set `max_model_length=32768` to match `max_new_tokens`, or `lighteval` will fail.\n\nFor multi-GPU evaluation with data parallelism:\n```bash\nNUM_GPUS=4\nMODEL=knoveleng/Open-RS3\nMODEL_ARGS=\"pretrained=$MODEL,dtype=bfloat16,data_parallel_size=$NUM_GPUS,max_model_length=32768,gpu_memory_utilization=0.8,generation_parameters={max_new_tokens:32768,temperature:0.6,top_p:0.95}\"\nTASK=aime24\nOUTPUT_DIR=data/evals/$MODEL\n\nlighteval vllm \"$MODEL_ARGS\" \"custom|$TASK|0|0\" \\\n  --custom-tasks src/open_r1/evaluate.py \\\n  --use-chat-template \\\n  --output-dir \"$OUTPUT_DIR\"\n```\n\nAlternatively, use the evaluation script:\n```bash\nsh eval.sh\n```\nModify tasks in `eval.sh` (line 8) as needed.\n\n### Performance Highlights\n- **Open-RS1**: 53.0% avg. score\n- **Open-RS2**: 55.7% avg. score, 80.0% on AMC23\n- **Open-RS3**: 56.3% avg. score, 46.7% on AIME24 (outperforms `o1-preview` at 44.6%)\n- Competitive MATH-500 scores; Minerva lags behind 7B models.\n\n![Performance Metrics](assets/performances.png)\n\n### Cost Efficiency\nOur approach uses 7,000 samples (42,000 total outputs) and costs ~$42 on 4x A40 GPUs in 24 hours, compared to:\n- 7B models: `Qwen2.5-7B-SimpleRL` ($1,633), `Eurus-2-7B-PRIME` ($1,088)\n- 1.5B models: `DeepScaleR-1.5B-Preview` ($3,629), `Still-3-1.5B-Preview` ($2,268)\n\n![7B Model Costs](assets/costs-7b.png)  \n![1.5B Model Costs](assets/costs-1.5b.png)\n\n## Acknowledgements\nThanks to the Hugging Face team for their [open-r1](https://github.com/huggingface/open-r1) project.\n\n## Citation\nIf this project aids your work, please cite it as:\n```\n@misc{dang2025reinforcementlearningreasoningsmall,\n      title={Reinforcement Learning for Reasoning in Small LLMs: What Works and What Doesn't}, \n      author={Quy-Anh Dang and Chris Ngo},\n      year={2025},\n      eprint={2503.16219},\n      archivePrefix={arXiv},\n      primaryClass={cs.LG},\n      url={https://arxiv.org/abs/2503.16219}, \n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fknoveleng%2Fopen-rs","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fknoveleng%2Fopen-rs","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fknoveleng%2Fopen-rs/lists"}