https://github.com/ganler/code-r1
Reproducing R1 for Code with Reliable Rewards
https://github.com/ganler/code-r1
Last synced: about 1 year ago
JSON representation
Reproducing R1 for Code with Reliable Rewards
- Host: GitHub
- URL: https://github.com/ganler/code-r1
- Owner: ganler
- License: apache-2.0
- Created: 2025-02-15T23:33:23.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2025-03-02T05:33:34.000Z (about 1 year ago)
- Last Synced: 2025-03-02T06:23:38.699Z (about 1 year ago)
- Language: Python
- Homepage:
- Size: 2.38 MB
- Stars: 8
- Watchers: 2
- Forks: 0
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- StarryDivineSky - ganler/code-r1 - R1项目旨在复现并改进DeepMind的R1模型,使其在代码生成任务中表现更佳。该项目专注于通过可靠的奖励机制来提升代码生成模型的性能,避免奖励信号中的噪声干扰。核心思想是利用更精准的测试用例和代码验证方法,为模型提供更可靠的反馈,从而引导模型生成更正确的代码。项目可能包含数据集处理、模型训练、奖励函数设计和评估指标等模块。通过改进奖励机制,Code-R1有望解决R1模型在代码生成中遇到的一些问题,例如泛化能力不足和生成错误代码等。该项目可能使用了强化学习或模仿学习等技术来训练模型,并利用代码执行结果或静态分析等方法来评估代码质量。具体实现细节和实验结果可以在项目的代码仓库中找到。总体目标是构建一个更强大、更可靠的代码生成模型。 (A01_文本生成_文本对话 / 大语言对话模型及数据)
README
# Code-R1: Reproducing R1 for Code with Reliable Rewards
This repository includes implementations to reproduce the R1 pipeline for code generation:
* **Result:** 2K Code-R1 samples + `Qwen2.5-7B-Instruct-1M` beats `Qwen2.5-Coder-7B-Instruct` (even better w/ 12K samples).
* **Finding:** Quality of rewards matters. False positives in datasets and execution confuse the model.
* **Implementation:** A reliable, scalable, and sandboxed pipeline to minimize reward false positives in datasets and execution.
More results and findings to come...
## Setup
### Environment
```bash
# For training
pip install -e .
pip install vllm==0.7.3
pip install flash-attn --no-build-isolation
pip install wandb IPython matplotlib gpustat # utility
```
### Sandboxing
I tried multiple ways for sandboxing including calling code execution servers, running dockerized Python, calling paid services, etc.
`firejail` is the approach I found to meet all the three:
1. Reliability -- False positive comes when "autograders" have concurrency issue (timeouts), violating OS limits (`OSError`), etc.
2. Scalability -- e.g., dockerized Python run is generally reliable but too slow (e.g., 20 samples/s on 192 cores).
3. Security -- ... otherwise the school IT will email you and stop your server...
```bash
sudo add-apt-repository ppa:deki/firejail
sudo apt-get update
sudo apt-get install firejail firejail-profiles
```
### Datasets
The current version has 12K RL samples (prompt + tests) at [🤗 ganler/code-r1-12k](https://huggingface.co/datasets/ganler/code-r1-12k):
* [2K LeetCode data](https://github.com/newfacade/LeetCodeDataset) where the tests are generally reliable
* 10K verified data filtered from 26K [TACO](https://huggingface.co/datasets/BAAI/TACO) data.
In general, it's suggesgted to test data & sandbox on every dataset & environment before training code RL.
Directly using noisy data and mismatched envornments can lead to reward false positives, confusing the model.
These noise could come from (i) wrong tests, (ii) unsolvable prompts (e.g., images tags), and (iii) execution environment mismatch.
To produce locally validated RL data:
```bash
python examples/data_preprocess/coder1.py
```
### Run!
```bash
bash main_grpo.sh
```
> [!NOTE]
>
> The script was optimized for single-node 8x H200 setup. You might need to customize the settings for your own workstation.
## Code-R1 Zero based on 7B models
We trained two models based on Qwen2.5-7B-Instruct-1M by pure R1 Zero:
* [🤗 CodeR1-Zero-Qwen2.5-7B-12k-832](https://huggingface.co/ganler/CodeR1-Zero-Qwen2.5-7B-12k-832): using 12K RL samples trained in 832 steps ([training logs](https://api.wandb.ai/links/llm4code/y13vs8d9)).
* [🤗 CodeR1-Zero-Qwen2.5-7B-LC2k-1088](https://huggingface.co/ganler/CodeR1-Zero-Qwen2.5-7B-LC2k-1088): using 2K RL samples from LeetCode, trained in 1088 steps ([training logs](https://api.wandb.ai/links/llm4code/k8q6zu51)).
| Model | LCB (v5) | HumanEval+ | MBPP+ | **Average** |
|------------------------------------------------|---------------|----------------|-------------|------------:|
| Qwen2.5-7B-Instruct-1M | 24.0 | 80.5 | 66.7 | 57.1 |
| + Code-R1-Zero (2k - 1088s GRPO) | 28.6 | 84.8 | 70.1 | 61.2 |
| + Code-R1-Zero (12k - 832s GRPO) | 29.7 | 83.5 | 74.3 | 🌟**62.5** |
* 2K leetcode training samples can already show promising results without any additional SFT or distillation.
* Adding it to 12K data (10K more verified data from TACO) can further improve the performance.
Some existing models:
| Model | LCB (v5) | HumanEval+ | MBPP+ | **Average** |
|------------------------------------------------|---------------|----------------|-------------|------------:|
| Qwen2.5-Coder-7B-Instruct | 31.1 | 82.3 | 69.6 | 61.0 |
| Eurus-2-7B-PRIME | 23.8 | 65.9 | 29.9 | 39.9 |
| Sky-T1-7B | 21.3 | 54.3 | 50.8 | 42.1 |
* Qwen2.5-Coder-7B-Instruct, despite released months back, is still very performant as the best baseline, but we don't know where the improvement comes from.
* Eurus-2-7B-PRIME starts from Qwen2.5-Math-7B-Instruct and is RL only. Its training data includes (unfiltered) extensive coding datasets, including APPS, CodeContests, TACO, and Codeforces. Code-R1-Zero outperforms it significantly despite using fewer data, likely because we use validated datasets and sandboxes.
* Sky-T1-7B uses a combination of RL and SFT/distillation steps. Its RL partially uses PRIME but its training data does not (seem to) include coding datasets.
## Citation
If you find this work helpful...
```bibtex
@article{code-r1,
title={Code-R1: Reproducing R1 for Code with Reliable Rewards},
author={Liu, Jiawei and Zhang, Lingming},
howpublished={\url{https://github.com/ganler/code-r1}},
year={2025}
}
```
## Acknowledgements
* [Verl](https://github.com/volcengine/verl)
* [Logic-RL](https://github.com/Unakar/Logic-RL)
## License
Apache-2.0. See [LICENSE.code-r1](LICENSE.code-r1) for more details.