{"id":31756568,"url":"https://github.com/nvlabs/rlp","last_synced_at":"2025-10-09T19:19:24.274Z","repository":{"id":317127654,"uuid":"1064849471","full_name":"NVlabs/RLP","owner":"NVlabs","description":"RLP: Reinforcement as a Pretraining Objective","archived":false,"fork":false,"pushed_at":"2025-09-29T02:19:56.000Z","size":20,"stargazers_count":2,"open_issues_count":0,"forks_count":1,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-09-29T03:23:45.121Z","etag":null,"topics":["grpo","language-modeling","large-language-models","policy-gradient","pretraining","reasoning","reinforcement-learning"],"latest_commit_sha":null,"homepage":"","language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/NVlabs.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-09-26T16:41:24.000Z","updated_at":"2025-09-29T02:19:59.000Z","dependencies_parsed_at":"2025-09-29T03:23:46.514Z","dependency_job_id":"1d81d654-847c-4671-abf3-715dd58abd2e","html_url":"https://github.com/NVlabs/RLP","commit_stats":null,"previous_names":["nvlabs/rlp"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/NVlabs/RLP","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NVlabs%2FRLP","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NVlabs%2FRLP/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NVlabs%2FRLP/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NVlabs%2FRLP/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/NVlabs","download_url":"https://codeload.github.com/NVlabs/RLP/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NVlabs%2FRLP/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":279001981,"owners_count":26083243,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-09T02:00:07.460Z","response_time":59,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["grpo","language-modeling","large-language-models","policy-gradient","pretraining","reasoning","reinforcement-learning"],"created_at":"2025-10-09T19:19:21.194Z","updated_at":"2025-10-09T19:19:24.268Z","avatar_url":"https://github.com/NVlabs.png","language":null,"funding_links":[],"categories":[],"sub_categories":[],"readme":"# RLP: Reinforcement as a Pretraining Objective \n\n[![Star on GitHub](https://img.shields.io/github/stars/NVlabs/RLP.svg?style=social)](https://github.com/NVlabs/RLP/stargazers)\n\nOfficial repository of [**RLP: Reinforcement as a Pretraining Objective**](https://arxiv.org/abs/2510.01265). \n\n\n_A verifier‑free, information‑gain objective that teaches models to “think before predicting” during pre‑training._\n\n[![Paper](https://img.shields.io/badge/Paper-arXiv-TBD)](https://arxiv.org/abs/2510.01265)\n\n[Ali Hatamizadeh[^1]](https://research.nvidia.com/person/ali-hatamizadeh),\n[Syeda Nahida Akter[^1]](https://snat1505027.github.io/),\n[Shrimai Prabhumoye[^1]](https://shrimai.github.io/),\n[Jan Kautz](https://jankautz.com/),\n[Mostofa Patwary](https://sites.google.com/view/mostofa-patwary),\n[Mohammad Shoeybi](https://developer.nvidia.com/blog/author/mshoeybi/),\n[Bryan Catanzaro](https://developer.nvidia.com/blog/author/bcatanzaro/),\n[Yejin Choi](https://yejinc.github.io/).\n\n[^1]: Equal Contribution\n\n**Teach models to think *during* pretraining, not just after.**\n\n\u003cimg width=\"1829\" height=\"433\" alt=\"framework\" src=\"https://github.com/user-attachments/assets/db9bec5f-0912-464f-accb-f27e4967983e\" /\u003e\n\n\u003e We introduce **RLP (Reinforcement Learning Pre‑training)**: treat chain‑of‑thought (CoT) as an *action* taken before next‑token prediction, and reward it by the **information gain** it provides on the observed next token. This yields a **verifier‑free, dense** reward that can be applied to ordinary pre‑training text. On **Qwen3‑1.7B‑Base**, RLP improves the overall math+science average by **≈ +19%** over the base model and **≈ +17%** over compute‑matched continuous pre‑training; after identical post‑training the gains **compound**. On a **12B hybrid Mamba‑Transformer (NeMo‑12B)**, the overall average rises from **42.81 → 61.32** (+18.51 points), with large science reasoning gains.\n\n---\n\n## Next token prediction comparison \n\n\u003cp align=\"center\"\u003e\n\u003cimg src=\"https://github.com/user-attachments/assets/a57a0b88-6687-4f0d-9cf0-bd83dd56eb49\" width=62% height=62% \nclass=\"center\"\u003e\n\u003c/p\u003e\n\n\n## Key results\n\n### 🔹 Qwen3 1.7B Base\n\n\u003cp align=\"center\"\u003e\n\u003cimg src=\"https://github.com/user-attachments/assets/77a75776-8cfc-4e45-ad53-1900e4ea8fa9\" width=62% height=62% \nclass=\"center\"\u003e\n\u003c/p\u003e\n\n\n* **Setup:**\n\n  * We compare **RLP** against both the base model (**BASE**) and a compute matched **Continuous Pretraining (CPT)** baseline.\n  * All models use the same **SFT + RLVR post training** pipeline for a fair comparison.\n\n* **Pretraining Gains:**\n\n  * **RLP outperforms BASE by +19%** and **CPT by +17%** on average across math and science benchmarks.\n  * These improvements come **without extra compute**, showing the gains are from methodology rather than raw FLOPs.\n\n* **Post Training Synergy:**\n\n  * After identical SFT + RLVR, **RLP compounds its advantage**, achieving:\n\n    * **+8% relative over BASE+Post**\n    * **+7% relative over CPT+Post**\n  * This shows that **RLP builds durable reasoning foundations** that are strengthened, not erased, by downstream alignment.\n\n* **Takeaway:**\n\n  * Unlike next token prediction or continuous pretraining, **RLP instills reasoning during pretraining itself**.\n  * These early advantages persist through post training, giving models **stronger and more robust reasoning capabilities**.\n\n\n### 🔹 Nemotron Nano 12B v2 Base\n\n\u003cp align=\"center\"\u003e\n\u003cimg src=\"https://github.com/user-attachments/assets/8f343919-0e9a-4a11-9250-a8cdb99321d8\" width=62% height=62% \nclass=\"center\"\u003e\n\u003c/p\u003e\n\n* **Setup:**\n\n  * We compare an intermediate checkpoint of **Nemotron-Nano-12B-v2-Base** trained on **19.8T tokens** with **RLP applied for only 250M tokens**.\n  * The **BASE** model, in contrast, is trained fully on **20T tokens**.\n\n* **Pretraining Gains:**\n\n  * **RLP substantially outperforms BASE across all domains** despite using **~200B fewer tokens**.\n  * On average, **RLP is +35% better than BASE**, highlighting both efficiency and scalability.\n\n* **Domain Specific Improvements:**\n\n  * **Math performance** improves moderately.\n  * The largest gains are in **science reasoning**, where **Science Avg improves by +23 absolute points**.\n\n* **Takeaway:**\n\n  * The benefits of **RLP not only persist but amplify** at larger model scales.\n  * RLP generalizes effectively across architectures, yielding robust reasoning improvements even in hybrid models like Nemotron.\n\n\n## Citation\n\nIf you find RLP to be useful for your work, please consider citing our paper: \n\n```\n@article{hatamizadeh2025rlp,\n  title={RLP: Reinforcement as a Pretraining Objective},\n  author={Hatamizadeh, Ali and Akter, Syeda Nahida and Prabhumoye, Shrimai and Kautz, Jan and Patwary, Mostofa and Shoeybi, Mohammad and Catanzaro, Bryan and Choi, Yejin},\n  journal={arXiv preprint arXiv:2510.01265},\n  year={2025}\n}\n```\n\n## Star History\n\n[![Stargazers repo roster for @NVlabs/RLP](https://bytecrank.com/nastyox/reporoster/php/stargazersSVG.php?user=NVlabs\u0026repo=RLP)](https://github.com/NVlabs/RLP/stargazers)\n\n[![Star History Chart](https://api.star-history.com/svg?repos=NVlabs/RLP\u0026type=Date)](https://star-history.com/#NVlabs/RLP\u0026Date)\n\n\n## Licenses\n\nCopyright © 2025, NVIDIA Corporation. All rights reserved.\n \n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnvlabs%2Frlp","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fnvlabs%2Frlp","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnvlabs%2Frlp/lists"}