https://github.com/CJReinforce/PURE
Official code for the paper, "Stop Summation: Min-Form Credit Assignment Is All Process Reward Model Needs for Reasoning"
https://github.com/CJReinforce/PURE
llm mathematics o1 r1 reasoning reinforcement-finetuning reinforcement-learning rl
Last synced: about 2 months ago
JSON representation
Official code for the paper, "Stop Summation: Min-Form Credit Assignment Is All Process Reward Model Needs for Reasoning"
- Host: GitHub
- URL: https://github.com/CJReinforce/PURE
- Owner: CJReinforce
- Created: 2025-02-09T10:56:28.000Z (4 months ago)
- Default Branch: verl
- Last Pushed: 2025-05-06T09:55:25.000Z (about 2 months ago)
- Last Synced: 2025-05-06T10:54:07.140Z (about 2 months ago)
- Topics: llm, mathematics, o1, r1, reasoning, reinforcement-finetuning, reinforcement-learning, rl
- Language: Python
- Homepage:
- Size: 31.7 MB
- Stars: 112
- Watchers: 1
- Forks: 3
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
- StarryDivineSky - CJReinforce/PURE - Form Credit Assignment Is All Process Reward Model Needs for Reasoning"的官方代码实现。该项目专注于解决强化学习中的信用分配问题,提出了一种名为“停止求和”的极简形式信用分配方法。其核心思想是利用过程奖励模型,通过最小化形式的信用分配,实现有效的推理能力。PURE项目的关键在于避免复杂的奖励函数设计,仅依赖过程中的奖励信号进行学习。该方法在多种推理任务上表现出色,证明了其在简化信用分配复杂性方面的有效性。项目代码提供了复现论文实验结果所需的必要组件,方便研究人员进一步探索和应用该方法。PURE的优势在于其简洁性和高效性,为强化学习领域提供了一种新的解决思路,尤其适用于需要复杂推理的任务。 (A01_文本生成_文本对话 / 大语言对话模型及数据)