{"id":30861854,"url":"https://github.com/niutrans/gram","last_synced_at":"2025-09-07T17:09:31.280Z","repository":{"id":313133296,"uuid":"1003415331","full_name":"NiuTrans/GRAM","owner":"NiuTrans","description":"Code for ICML 2025 paper \"GRAM: A Generative Foundation Reward Model for Reward Generalization\"","archived":false,"fork":false,"pushed_at":"2025-09-04T03:47:47.000Z","size":17909,"stargazers_count":10,"open_issues_count":0,"forks_count":1,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-09-04T05:49:26.169Z","etag":null,"topics":["generalization","generative","reward-model","rlhf"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/NiuTrans.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":"CITATION.cff","codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-06-17T05:47:53.000Z","updated_at":"2025-09-04T03:47:52.000Z","dependencies_parsed_at":"2025-09-04T05:49:38.239Z","dependency_job_id":"8bfb2f6d-c214-43e7-8fea-0656aa6f2c82","html_url":"https://github.com/NiuTrans/GRAM","commit_stats":null,"previous_names":["niutrans/gram"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/NiuTrans/GRAM","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NiuTrans%2FGRAM","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NiuTrans%2FGRAM/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NiuTrans%2FGRAM/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NiuTrans%2FGRAM/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/NiuTrans","download_url":"https://codeload.github.com/NiuTrans/GRAM/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NiuTrans%2FGRAM/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":274066370,"owners_count":25216447,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-09-07T02:00:09.463Z","response_time":67,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["generalization","generative","reward-model","rlhf"],"created_at":"2025-09-07T17:09:30.172Z","updated_at":"2025-09-07T17:09:31.271Z","avatar_url":"https://github.com/NiuTrans.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\n# A Generative Foundation Reward Model (GRAM)\n\nThis repository contains the code and released models for our paper [GRAM: A Generative Foundation Reward Model for Reward Generalization 📝](https://arxiv.org/abs/2506.14175). We propose a more effective approach to reward model training by combining both labeled and unlabeled data. Our method introduces a generative reward model that first learns from a large corpus of unlabeled data and is then fine-tuned with supervised data. Please find all the released model checkpoints at [this link 🤗](https://huggingface.co/collections/wangclnlp/gram-68452f737e53feeef4202d9b). To develop a reward model tailored to a specific task or domain, we recommend fine-tuning the released GRAM model using task-specific preference data. This strategy mitigates the dependence on large-scale human annotations while maintaining strong performance on the target task.\n\n\n\u003cimg src=\"./gram.png\" width=\"1000px\"\u003e\u003c/img\u003e\n\n\n## 🆕 Changelog\n- [2025/9/1] We released the second-generation generative foundation reward model, GRAM-R^2, which incorporates reward reasoning and generates rationales before predicting preference labels. When using LLaMA-3.1-8B-Instruct as the backbone, the GRAM-R^2 can achieve 85.7% accuracy on RM-Bench. Furthermore, test-time scaling strategies (e.g., majority voting) can be integrated to further enhance performance. More details are provided in this [README](./extensions/GRAM-RR/README.md) and our [paper](https://arxiv.org/pdf/2509.02492). \n- [2025/6/17] We trained GRAM using Qwen3 and achieved a score of 71.4 on JudgeBench with Qwen3-14B, significantly outperforming two strong open-source baselines: Llama-3.1-Nemotron-70B-Reward and Skywork-Reward-Gemma-2-27B-v0.2. \n- [2025/6/1] We performed additional data cleaning, such as the removal of overly long or corrupted samples, to help GRAM achieve better performance. The processed dataset is available at this [link](https://huggingface.co/datasets/wangclnlp/GRAM-pre-training-566k).\n- [2025/5/1] Our paper has been accepted by ICML 2025! \n\n\n## 🔗 Quick Links\n* [GRAM: A Generative Foundation Reward Model for Reward Generalization](#a-generative-foundation-reward-model-gram)\n\n  * [Changelog](#changelog)\n  * [Released Models](#released-models)\n  * [Installation Guide](#installation)\n  * [Preparing Models and Datasets](#preparing-models-and-datasets)\n  * [Training Scripts](#training-scripts)\n  * [Using GRAM in RLHF](#how-to-use-gram-in-rlhf)\n  * [Citation](#citation)\n  * [Acknowledgement](#acknowledgement)\n\n---\n\n## Released Models\n\nCheck out our GRAM series below. The models were first pre-trained on the dataset available [here](https://huggingface.co/datasets/wangclnlp/GRAM-pre-training-566k), and then fine-tuned on the dataset available [here](https://huggingface.co/datasets/wangclnlp/GRAM-fine-tuning-65k).\n\n- We evaluate our reward model on the [JudgeBench](https://huggingface.co/datasets/ScalerLab/JudgeBench), a benchmark for evaluating LLM-as-a-Judge (i.e., generative reward models) applications, and present the results as follows:\n\n| Model | Param. | Knowl. | Reason. | Math | Coding |\tAvg. | \n|:-|-:|:-:|:-:|:-:|:-:|:-:|\n|[GRAM-Qwen3-14B-RewardBench](https://huggingface.co/wangclnlp/GRAM-Qwen3-14B-RewardModel) |14B|63.0|64.3|89.3|69.1|71.4|\n|[GRAM-LLaMA3.2-3B-RewardBench](https://huggingface.co/wangclnlp/GRAM-LLaMA3.2-3B-RewardModel) |3B|59.7|64.3|84.0|71.4|69.9|\n|[GRAM-Qwen3-8B-RewardBench](https://huggingface.co/wangclnlp/GRAM-Qwen3-8B-RewardModel) |8B|62.3|64.3|80.4|64.3|67.8|\n|nvidia/Llama-3.1-Nemotron-70B-Reward|70B|62.3|72.5|76.8|57.1|67.2|\n|[GRAM-Qwen3-4B-RewardBench](https://huggingface.co/wangclnlp/GRAM-Qwen3-4B-RewardModel) |4B|59.7|59.2|80.4|64.3|65.9|\n|[GRAM-Qwen3-1.7B-RewardBench](https://huggingface.co/wangclnlp/GRAM-Qwen3-1.7B-RewardModel)   |1.7B|60.4|65.3|78.6|57.1|65.4|\n|Skywork/Skywork-Reward-Gemma-2-27B-v0.2|27B|59.7|66.3|83.9|50.0|65.0|\n|Skywork/Skywork-Reward-Llama-3.1-8B-v0.2|8B|59.1|64.3|76.8|50.0|62.6|\n|internlm/internlm2-20b-reward|20B|62.3|69.4|66.1|50.0|62.0|\n\n- We also evaluate our reward model on the recently introduced [RM-Bench](https://github.com/THU-KEG/RM-Bench), a challenging benchmark for reward models, and present the results as follows:\n\n\n| Model | Param. | Chat |\tCode |\tMath |\tSafety |\tAvg. | \n|:-|-:|:-:|:-:|:-:|:-:|:-:|\n|nvidia/Llama-3.1-Nemotron-70B-Reward| 70B | 70.7|57.4|64.3|90.3|70.7|\n|Skywork/Skywork-Reward-Gemma-2-27B-v0.2 | 27B |71.8|56.6|59.2|94.3|70.5|\n|[GRAM-Qwen3-14B-RewardBench](https://huggingface.co/wangclnlp/GRAM-Qwen3-14B-RewardModel) |14B|67.4|55.2|62.8|94.3|69.9|\n|nvidia/Nemotron-340B-Reward| 340B| 71.2|59.4|59.8|87.5|69.5|\n|[GRAM-Qwen3-8B-RewardBench](https://huggingface.co/wangclnlp/GRAM-Qwen3-8B-RewardModel) |8B|63.5|53.9|62.9|92.8|68.3|\n|internlm/internlm2-20b-reward|20B|63.1|56.7|66.8|86.5|68.3|\n|[GRAM-Qwen3-4B-RewardBench](https://huggingface.co/wangclnlp/GRAM-Qwen3-4B-RewardModel) |4B|61.1|54.7|61.6|92.9|67.6|\n|[GRAM-Qwen3-1.7B-RewardBench](https://huggingface.co/wangclnlp/GRAM-Qwen3-1.7B-RewardModel)   |1.7B|59.6|53.6|59.6|91.8|66.2|\n|[GRAM-LLaMA3.2-3B-RewardBench](https://huggingface.co/wangclnlp/GRAM-LLaMA3.2-3B-RewardModel) |3B|56.8|50.0|56.3|88.7|63.0|\n\n## Installation Guide\n\nThe code of this repo is modified from [hiyouga/LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory). If you encounter installation issues (e.g., related to PyTorch or CUDA), we recommend first checking the LLaMA-Factory [issues](https://github.com/hiyouga/LLaMA-Factory/issues) for potential solutions. If the problem persists, please feel free to submit an issue in this repository.\n\n```bash\ngit clone git@github.com:NiuTrans/GRAM.git\ncd gram\npip install -e \".[torch,metrics]\" --no-build-isolation\n```\n\n## Preparing Models and Datasets\n\n### Datasets\n\n#### Pre-Training\n\nEach item of the dataset for GRAM pre-training should include at least two keys:\n\n- `instruction`: any prompt in following template:\n  ```text\n  [User Question]\n  {your prompt here}\n  ```\n- `input`: the input for above prompt, can be empty if there is not.\n- `output`: two responses in following template:\n  ```text\n  [The Start of Assistant A's Answer]\n  {answer of assistant A}\n  [The End of Assistant A's Answer]\n\n  [The Start of Assistant B's Answer]\n  {answer of assistant B}\n  [The End of Assistant B's Answer]\n  ```\n\nAn example in json format:\n\n```json\n[\n  {\n    \"instruction\": \"[User Question]\\nCan dogs get covid?\\n\\n\",\n    \"input\": \"\",\n    \"output\": \"[The Start of Assistant A's Answer]\\nYes, indeed. ... [The End of Assistant A's Answer]\\n\\n[The Start of Assistant B's Answer]\\nMany of the symptoms are similar, including fever, coughing, loss of smell, etc. ...\\n[The End of Assistant B's Answer]\"\n  },\n  ...\n]\n```\n\n\n#### Fine-Tuning\n\nEach item of the dataset for GRAM fine-tuning should include at least two keys:\n\n- `instruction`: any prompt with corresponding two responses in following template:\n  ```text\n  Please act as an impartial judge and evaluate the quality of the responses provided by two AI assistants to the user question displayed below. You should choose the assistant that follows the user's instructions and answers the user's question better.\n  Your evaluation should consider factors such as the helpfulness, relevance, accuracy, depth, creativity, and level of detail of their responses. Avoid any position biases and ensure that the order in which the responses were presented does not influence your decision. Do not allow the length of the responses to influence your evaluation. Do not favor certain names of the assistants. Be as objective as possible.\n  Please directly output your final verdict by strictly following this format: \"A\" if assistant A is better, \"B\" if assistant B is better.\n\n  [User Question]\n  {your prompt here}\n\n  [The Start of Assistant A's Answer]\n  {answer of assistant A}\n  [The End of Assistant A's Answer]\n\n  [The Start of Assistant B's Answer]\n  {answer of assistant B}\n  [The End of Assistant B's Answer]\n\n  #Preferred:\n  ```\n- `input`: leave it empty.\n- `output`: the correct option, \"A\" or \"B\".\n\nAn example in json format:\n\n```json\n[\n  {\n    \"instruction\": \"Please act as an impartial judge and evaluate the quality of the responses provided by two AI assistants to the user question displayed below. ... [User Question]... [The Start of Assistant A's Answer] ... [The Start of Assistant B's Answer] ...\",\n    \"input\": \"\",\n    \"output\": \"B\"\n  },\n  ...\n]\n```\n\n\n#### Note\n\u003e After pre-processing the datasets as above, DO NOT forget to register the dataset in [`data/dataset_info.json`](data/dataset_info.json).\n\u003e Example:\n\u003e ```json\n\u003e  {\n\u003e    ...,\n\u003e    \"dataset_name\": {\n\u003e      \"file_name\": \"/path/to/your/dataset\"\n\u003e    },\n\u003e    ...,\n\u003e  },\n\u003e ```\n\n\n## Training Scripts\n\n### Pre-Training\n\n```bash\nllamafactory-cli train examples/train_full/qwen3_pre_training_rm.yaml\n```\n\n### Fine-Tuning\n```bash\nllamafactory-cli train examples/train_full/qwen3_fine_tuning_rm.yaml\n```\n\nAt this stage, the released models can be directly fine-tuned on your own labeled, task- or domain-specific preference data to obtain a reward model that is well-adapted to the target application.\n\n\n\n### Evaluation\n\nThe evaluation scripts are in the subdirectory `evaluation/`:\n```bash\ncd evaluation/\nckpt_path=/path/to/your/model\n```\n\n- Evaluation with Rewardbench\n\n  ```bash\n  python gram_eval.py -i allenai_reward_bench/filtered.json -m $ckpt_path -o $ckpt_path/reward-bench.res\n  echo -e \"RewardBench Evaluation Summary:\\n\"\n  python get_reward_bench_score.py $ckpt_path/reward-bench.res\n  ```\n\n- Evaluation with Judgebench\n  ```bash\n  python gram_eval.py -i scalerlab_judgebench/gpt.json -m $ckpt_path -o $ckpt_path/judge-bench.res\n  echo -e \"JudgeBench Evaluation Summary:\\n\"\n  python get_judgebench_score.py $ckpt_path/judge-bench.res\n  ```\n\n- Evaluation with RM-Bench\n  ```bash\n  python gram_eval.py -i thu_keg_rm_bench/total_dataset.json -m $ckpt_path -o $ckpt_path/reward-bench.res\n  echo -e \"RM-bench Evaluation Summary:\\n\"\n  python thu_keg_rm_bench/compute_accuracy.py $ckpt_path/reward-bench.res\n  ```\n\n## Using GRAM in RLHF\n\n### Computing Rewards of a pair of Samples\n\n```python\ndef compute_pair_rewards(response_a, response_b):\n    # compute rewards for response_a and response_b as in the demo `evaluation/gram_demo.py`\n    ...\n    return reward_response_a, reward_response_b\n```\n\n### PPO\n\nWhen applying GRAM to PPO training, we first generate a **reference response** and then compute a reward score using GRAM, which quantifies **how much better the sampled response is compared to the reference**. This score serves as the reward signal during PPO training. The basic idea, using the difference between the sampled and reference responses as the reward, has been shown effective in prior baseline methods. Additionally, inspired by [*ReMax*](https://arxiv.org/abs/2310.10505), we can use **greedy search** to construct the reference response. The detailed procedure is described below:\n\n```python\ndef ppo():\n    # Init dataset and model\n    ...\n    ref_model, policy_model, reward_model, value_model = ...\n    # Sample from policy model: generate with greedy search first, then sample as normal with top-p/top-k\n    response_greedy_search = greedy_search(policy_model, query)\n    response_normal = generate(policy_model, query, top_p=..., top_k=...)\n    _, reward_response_normal = compute_pair_rewards(response_greedy_search, response_normal)\n    # Compute logits from ref_model, values from value_model and update with PPO loss\n    ...\n```\n\n### List-wise Response Ranking\nA common use case for list-wise response ranking is best-of-n sampling, where the goal is to select the single best response from a list. This can be accomplished using GRAM with a linear search approach, as illustrated below. To support parallel computation and improve efficiency, we also incorporate optimization strategies such as divide-and-conquer.\n\n```python\ndef list_wise_response_ranking():\n    # Init dataset and model\n    ...\n    # Generate from model\n    responses = [response0, response1, responses2, ...]\n    # Compute rewards and choose one with highest score\n    best_response = response0\n    for response in responses[1:]:\n        score_a, score_b = compute_pair_rewards(best_response, response)\n        if score_a \u003c score_b:\n           best_response = response\n\n    return best_response\n```\n\n## Citation\n```\n@inproceedings{wang2025gram,\ntitle={{GRAM}: A Generative Foundation Reward Model for Reward Generalization},\nauthor={Chenglong Wang and Yang Gan and Yifu Huo and Yongyu Mu and Qiaozhi He and MuRun Yang and Bei Li and Tong Xiao and Chunliang Zhang and Tongran Liu and JingBo Zhu},\nbooktitle={Forty-second International Conference on Machine Learning},\nyear={2025}\n}\n\n@misc{wang2025gramr2,\n      title={GRAM-R$^2$: Self-Training Generative Foundation Reward Models for Reward Reasoning}, \n      author={Chenglong Wang and Yongyu Mu and Hang Zhou and Yifu Huo and Ziming Zhu and Jiali Zeng and Murun Yang and Bei Li and Tong Xiao and Xiaoyang Hao and Chunliang Zhang and Fandong Meng and Jingbo Zhu},\n      year={2025},\n      eprint={2509.02492},\n      archivePrefix={arXiv},\n      primaryClass={cs.CL},\n      url={https://arxiv.org/abs/2509.02492}, \n}\n```\n\n\n## Acknowledgement\nWe commence by utilizing the exceptional codebase provided by [hiyouga/LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory) 🌹🌹🌹.\n\nWe would like to thank [Hang Zhou](https://github.com/stceum) for his help in open-sourcing the GRAM model series.\n\nWe thank the contributions of the following papers:\n```bash\n[1] Lambert, Nathan, et al. \"Rewardbench: Evaluating reward models for language modeling.\" arXiv preprint arXiv:2403.13787 (2024).\n[2] Liu, Yantao, et al. \"RM-bench: Benchmarking reward models of language models with subtlety and style.\" arXiv preprint arXiv:2410.16184 (2024).\n[3] Tan, Sijun, et al. \"Judgebench: A benchmark for evaluating llm-based judges.\" arXiv preprint arXiv:2410.12784 (2024).\n[4] Grattafiori, Aaron, et al. \"The llama 3 herd of models.\" arXiv preprint arXiv:2407.21783 (2024).\n[5] Yang, An, et al. \"Qwen3 technical report.\" arXiv preprint arXiv:2505.09388 (2025).\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fniutrans%2Fgram","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fniutrans%2Fgram","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fniutrans%2Fgram/lists"}