https://github.com/sail-sg/oat

🌾 OAT: A research-friendly framework for LLM online alignment, including preference learning, reinforcement learning, etc.
https://github.com/sail-sg/oat

alignment distributed-rl distributed-training dpo dueling-bandits grpo llm llm-aligment llm-exploration online-alignment online-rl ppo r1-zero reasoning rlhf thompson-sampling

Last synced: 5 months ago
JSON representation

🌾 OAT: A research-friendly framework for LLM online alignment, including preference learning, reinforcement learning, etc.

Host: GitHub
URL: https://github.com/sail-sg/oat
Owner: sail-sg
License: apache-2.0
Created: 2024-10-15T05:53:45.000Z (12 months ago)
Default Branch: main
Last Pushed: 2025-05-06T06:23:38.000Z (5 months ago)
Last Synced: 2025-05-06T07:42:22.035Z (5 months ago)
Topics: alignment, distributed-rl, distributed-training, dpo, dueling-bandits, grpo, llm, llm-aligment, llm-exploration, online-alignment, online-rl, ppo, r1-zero, reasoning, rlhf, thompson-sampling
Language: Python
Homepage:
Size: 2.29 MB
Stars: 338
Watchers: 6
Forks: 23
Open Issues: 7
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

StarryDivineSky - sail-sg/oat

README

          


  



[![PyPI - Version](https://img.shields.io/pypi/v/oat-llm.svg)](https://pypi.org/project/oat-llm)

[![PyPI - Python Version](https://img.shields.io/pypi/pyversions/oat-llm.svg)](https://pypi.org/project/oat-llm)

[![License](https://img.shields.io/github/license/sail-sg/oat)](https://github.com/sail-sg/oat/blob/main/LICENSE)

[![arXiv](https://img.shields.io/badge/arXiv-2411.01493-b31b1b.svg)](https://arxiv.org/abs/2411.01493)

[Installation](#installation) | [Usage](#usage) | [Examples](./examples/) | [Citation](#citation)

---

## Updates

* 21/03/2025: We incorporate [Dr. GRPO](https://github.com/sail-sg/understand-r1-zero), which fixes the optimization bias in GRPO.

* 26/01/2025: We support reinforcement learning with verifiable rewards (RLVR) for math reasoning.

* 20/10/2024: We open source Oat, an online LLM alignment framework developed during a research project on online LLM exploration ([sample-efficient alignment](https://arxiv.org/pdf/2411.01493)).

## Introduction

Oat 🌾 is a simple yet efficient framework for running **online** LLM alignment algorithms. Its key features include:

* **High Efficiency**: Oat implements a distributed *Actor-Learner-Oracle* architecture, with each component being optimized using state-of-the-art tools:

  * `Actor`: Utilizes [vLLM](https://github.com/vllm-project/vllm) for accelerated online response sampling.

  * `Learner`: Leverages [DeepSpeed](https://github.com/microsoft/DeepSpeed) ZeRO strategies to enhance memory efficiency.

  * `Oracle`: Model-based oracle by [Mosec](https://github.com/mosecorg/mosec) as a remote service, supporting dynamic batching, data parallelism and pipeline parallelism.

* **Simplified Workflow**: Oat simplifies the experimental pipeline of LLM alignment. With an `Oracle` served online, we can flexibly query it for preference data labeling as well as anytime model evaluation. All you need is to launch experiments and monitor real-time learning curves (e.g., win rate) on wandb (see [reproduced results](https://wandb.ai/lkevinzc/oat-llm)) — no need for manual training, checkpointing and loading for evaluation.

* **Oracle Simulation**: Oat provides a diverse set of oracles to simulate preference/reward/verification feedback.

  * Verifiable rewards supported using rule-based functions.

  * Lightweight reward models run within the actor's process, enabling quick testing on as few as two GPUs.

  * Larger and more capable reward models can be served remotely, harnessing additional compute and memory resources.

  * LLM-as-a-judge is supported via querying OpenAI API for model-based pairwise ranking.

* **Ease of Use**: Oat's modular structure allows researchers to easily inherit and modify existing classes, enabling rapid prototyping and experimentation with new algorithms.

* **Cutting-Edge Algorithms**: Oat implements state-of-the-art online algorithms, fostering innovation and fair benchmarking.

  * PPO/Dr.GRPO (online RL) for math reasoning.

  * Online DPO/SimPO/IPO for online preference learning.

  * Online exploration (active alignment) algorithms, including [SEA](https://arxiv.org/abs/2411.01493), APL and XPO.

## Installation

In a python environment with supported versions (we recommend `3.10`), you could install oat via PyPI:

```shell

pip install vllm==0.8.4 && pip install -U oat-llm

```

Or you could also install in "editable" mode for local development:

```shell

git clone git@github.com:sail-sg/oat.git

cd oat

pip install vllm==0.8.4 && pip install -e .

```

##  Usage

Please refer to [this file](https://github.com/sail-sg/understand-r1-zero/blob/main/train_zero_math.py) for a self-contained example showing how to implement Dr. GRPO for R1-Zero-like training with oat 🌾.

Additionally, we also provide a guide on [online preference learning with active exploration](./docs/alignment_as_cdb.md).

## Citation

If you find this codebase useful for your research, please consider citing:

- LLM online alignment framework:

  ```bibtex

  @misc{liu2024oat,

    title={OAT: A research-friendly framework for LLM online alignment},

    author={Liu, Zichen and Chen, Changyu and Du, Chao and Lee, Wee Sun and Lin, Min},

    year={2024}

    howpublished={\url{https://github.com/sail-sg/oat}},

  }

  ```

- Online exploration method:

  ```bibtex

  @article{liu2024sea,

    title={Sample-Efficient Alignment for LLMs},

    author={Liu, Zichen and Chen, Changyu and Du, Chao and Lee, Wee Sun and Lin, Min},

    journal={arXiv preprint arXiv:2411.01493},

    year={2024}

  }

  ```

## License

`oat` is distributed under the terms of the [Apache2](https://www.apache.org/licenses/LICENSE-2.0) license.

## Acknowledgement

We thank the following awesome projects that have contributed to the development of oat:

* [vLLM](https://github.com/vllm-project/vllm)

* [DeepSpeed](https://github.com/microsoft/DeepSpeed)

* [Mosec](https://github.com/mosecorg/mosec)

* [launchpad](https://github.com/google-deepmind/launchpad)

* [OpenRLHF](https://github.com/OpenRLHF/OpenRLHF)

## Disclaimer

This is not an official Sea Limited or Garena Online Private Limited product.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/sail-sg/oat

Awesome Lists containing this project

README