https://github.com/nvlabs/rlp

RLP: Reinforcement as a Pretraining Objective
https://github.com/nvlabs/rlp

grpo language-modeling large-language-models policy-gradient pretraining reasoning reinforcement-learning

Last synced: 10 months ago
JSON representation

RLP: Reinforcement as a Pretraining Objective

Host: GitHub
URL: https://github.com/nvlabs/rlp
Owner: NVlabs
License: other
Created: 2025-09-26T16:41:24.000Z (10 months ago)
Default Branch: main
Last Pushed: 2025-09-29T02:19:56.000Z (10 months ago)
Last Synced: 2025-09-29T03:23:45.121Z (10 months ago)
Topics: grpo, language-modeling, large-language-models, policy-gradient, pretraining, reasoning, reinforcement-learning
Homepage:
Size: 19.5 KB
Stars: 2
Watchers: 1
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# RLP: Reinforcement as a Pretraining Objective

[![Star on GitHub](https://img.shields.io/github/stars/NVlabs/RLP.svg?style=social)](https://github.com/NVlabs/RLP/stargazers)

Official repository of [**RLP: Reinforcement as a Pretraining Objective**](https://arxiv.org/abs/2510.01265).

_A verifier‑free, information‑gain objective that teaches models to “think before predicting” during pre‑training._

[![Paper](https://img.shields.io/badge/Paper-arXiv-TBD)](https://arxiv.org/abs/2510.01265)

[Ali Hatamizadeh[^1]](https://research.nvidia.com/person/ali-hatamizadeh),
[Syeda Nahida Akter[^1]](https://snat1505027.github.io/),
[Shrimai Prabhumoye[^1]](https://shrimai.github.io/),
[Jan Kautz](https://jankautz.com/),
[Mostofa Patwary](https://sites.google.com/view/mostofa-patwary),
[Mohammad Shoeybi](https://developer.nvidia.com/blog/author/mshoeybi/),
[Bryan Catanzaro](https://developer.nvidia.com/blog/author/bcatanzaro/),
[Yejin Choi](https://yejinc.github.io/).

[^1]: Equal Contribution

**Teach models to think *during* pretraining, not just after.**

framework

> We introduce **RLP (Reinforcement Learning Pre‑training)**: treat chain‑of‑thought (CoT) as an *action* taken before next‑token prediction, and reward it by the **information gain** it provides on the observed next token. This yields a **verifier‑free, dense** reward that can be applied to ordinary pre‑training text. On **Qwen3‑1.7B‑Base**, RLP improves the overall math+science average by **≈ +19%** over the base model and **≈ +17%** over compute‑matched continuous pre‑training; after identical post‑training the gains **compound**. On a **12B hybrid Mamba‑Transformer (NeMo‑12B)**, the overall average rises from **42.81 → 61.32** (+18.51 points), with large science reasoning gains.

---

## Next token prediction comparison

## Key results

### 🔹 Qwen3 1.7B Base

* **Setup:**

* We compare **RLP** against both the base model (**BASE**) and a compute matched **Continuous Pretraining (CPT)** baseline.
* All models use the same **SFT + RLVR post training** pipeline for a fair comparison.

* **Pretraining Gains:**

* **RLP outperforms BASE by +19%** and **CPT by +17%** on average across math and science benchmarks.
* These improvements come **without extra compute**, showing the gains are from methodology rather than raw FLOPs.

* **Post Training Synergy:**

* After identical SFT + RLVR, **RLP compounds its advantage**, achieving:

* **+8% relative over BASE+Post**
* **+7% relative over CPT+Post**
* This shows that **RLP builds durable reasoning foundations** that are strengthened, not erased, by downstream alignment.

* **Takeaway:**

* Unlike next token prediction or continuous pretraining, **RLP instills reasoning during pretraining itself**.
* These early advantages persist through post training, giving models **stronger and more robust reasoning capabilities**.

### 🔹 Nemotron Nano 12B v2 Base

* **Setup:**

* We compare an intermediate checkpoint of **Nemotron-Nano-12B-v2-Base** trained on **19.8T tokens** with **RLP applied for only 250M tokens**.
* The **BASE** model, in contrast, is trained fully on **20T tokens**.

* **Pretraining Gains:**

* **RLP substantially outperforms BASE across all domains** despite using **~200B fewer tokens**.
* On average, **RLP is +35% better than BASE**, highlighting both efficiency and scalability.

* **Domain Specific Improvements:**

* **Math performance** improves moderately.
* The largest gains are in **science reasoning**, where **Science Avg improves by +23 absolute points**.

* **Takeaway:**

* The benefits of **RLP not only persist but amplify** at larger model scales.
* RLP generalizes effectively across architectures, yielding robust reasoning improvements even in hybrid models like Nemotron.

## Citation

If you find RLP to be useful for your work, please consider citing our paper:

```
@article{hatamizadeh2025rlp,
title={RLP: Reinforcement as a Pretraining Objective},
author={Hatamizadeh, Ali and Akter, Syeda Nahida and Prabhumoye, Shrimai and Kautz, Jan and Patwary, Mostofa and Shoeybi, Mohammad and Catanzaro, Bryan and Choi, Yejin},
journal={arXiv preprint arXiv:2510.01265},
year={2025}
}
```

## Star History

[![Stargazers repo roster for @NVlabs/RLP](https://bytecrank.com/nastyox/reporoster/php/stargazersSVG.php?user=NVlabs&repo=RLP)](https://github.com/NVlabs/RLP/stargazers)

[![Star History Chart](https://api.star-history.com/svg?repos=NVlabs/RLP&type=Date)](https://star-history.com/#NVlabs/RLP&Date)

## Licenses

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/nvlabs/rlp

Awesome Lists containing this project

README