{"id":27984401,"url":"https://github.com/sail-sg/oat","last_synced_at":"2025-05-08T05:01:55.629Z","repository":{"id":261158041,"uuid":"872813089","full_name":"sail-sg/oat","owner":"sail-sg","description":"🌾 OAT: A research-friendly framework for LLM online alignment, including preference learning, reinforcement learning, etc.","archived":false,"fork":false,"pushed_at":"2025-05-06T06:23:38.000Z","size":2406,"stargazers_count":338,"open_issues_count":7,"forks_count":23,"subscribers_count":6,"default_branch":"main","last_synced_at":"2025-05-06T07:42:22.035Z","etag":null,"topics":["alignment","distributed-rl","distributed-training","dpo","dueling-bandits","grpo","llm","llm-aligment","llm-exploration","online-alignment","online-rl","ppo","r1-zero","reasoning","rlhf","thompson-sampling"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/sail-sg.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2024-10-15T05:53:45.000Z","updated_at":"2025-05-01T15:36:54.000Z","dependencies_parsed_at":"2024-11-05T02:29:34.243Z","dependency_job_id":"978069da-e3ed-4b32-bc41-e3828abb474b","html_url":"https://github.com/sail-sg/oat","commit_stats":null,"previous_names":["sail-sg/oat"],"tags_count":7,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sail-sg%2Foat","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sail-sg%2Foat/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sail-sg%2Foat/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sail-sg%2Foat/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/sail-sg","download_url":"https://codeload.github.com/sail-sg/oat/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":253002856,"owners_count":21838640,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["alignment","distributed-rl","distributed-training","dpo","dueling-bandits","grpo","llm","llm-aligment","llm-exploration","online-alignment","online-rl","ppo","r1-zero","reasoning","rlhf","thompson-sampling"],"created_at":"2025-05-08T05:01:49.706Z","updated_at":"2025-05-08T05:01:55.618Z","avatar_url":"https://github.com/sail-sg.png","language":"Python","funding_links":[],"categories":["A01_文本生成_文本对话","Python"],"sub_categories":["大语言对话模型及数据"],"readme":"\u003cp align=\"center\"\u003e\n  \u003cimg src=\"./docs/new_logo.png\" width=90% alt=\"OAT\" /\u003e\n\u003c/p\u003e\n\n[![PyPI - Version](https://img.shields.io/pypi/v/oat-llm.svg)](https://pypi.org/project/oat-llm)\n[![PyPI - Python Version](https://img.shields.io/pypi/pyversions/oat-llm.svg)](https://pypi.org/project/oat-llm)\n[![License](https://img.shields.io/github/license/sail-sg/oat)](https://github.com/sail-sg/oat/blob/main/LICENSE)\n[![arXiv](https://img.shields.io/badge/arXiv-2411.01493-b31b1b.svg)](https://arxiv.org/abs/2411.01493)\n\n[Installation](#installation) | [Usage](#usage) | [Examples](./examples/) | [Citation](#citation)\n\n---\n\n## Updates\n* 21/03/2025: We incorporate [Dr. GRPO](https://github.com/sail-sg/understand-r1-zero), which fixes the optimization bias in GRPO.\n* 26/01/2025: We support reinforcement learning with verifiable rewards (RLVR) for math reasoning.\n* 20/10/2024: We open source Oat, an online LLM alignment framework developed during a research project on online LLM exploration ([sample-efficient alignment](https://arxiv.org/pdf/2411.01493)).\n## Introduction\n\nOat 🌾 is a simple yet efficient framework for running **online** LLM alignment algorithms. Its key features include:\n\n* **High Efficiency**: Oat implements a distributed *Actor-Learner-Oracle* architecture, with each component being optimized using state-of-the-art tools:\n  * `Actor`: Utilizes [vLLM](https://github.com/vllm-project/vllm) for accelerated online response sampling.\n  * `Learner`: Leverages [DeepSpeed](https://github.com/microsoft/DeepSpeed) ZeRO strategies to enhance memory efficiency.\n  * `Oracle`: Model-based oracle by [Mosec](https://github.com/mosecorg/mosec) as a remote service, supporting dynamic batching, data parallelism and pipeline parallelism.\n* **Simplified Workflow**: Oat simplifies the experimental pipeline of LLM alignment. With an `Oracle` served online, we can flexibly query it for preference data labeling as well as anytime model evaluation. All you need is to launch experiments and monitor real-time learning curves (e.g., win rate) on wandb (see [reproduced results](https://wandb.ai/lkevinzc/oat-llm)) — no need for manual training, checkpointing and loading for evaluation.\n* **Oracle Simulation**: Oat provides a diverse set of oracles to simulate preference/reward/verification feedback.\n  * Verifiable rewards supported using rule-based functions.\n  * Lightweight reward models run within the actor's process, enabling quick testing on as few as two GPUs.\n  * Larger and more capable reward models can be served remotely, harnessing additional compute and memory resources.\n  * LLM-as-a-judge is supported via querying OpenAI API for model-based pairwise ranking.\n* **Ease of Use**: Oat's modular structure allows researchers to easily inherit and modify existing classes, enabling rapid prototyping and experimentation with new algorithms.\n* **Cutting-Edge Algorithms**: Oat implements state-of-the-art online algorithms, fostering innovation and fair benchmarking.\n  * PPO/Dr.GRPO (online RL) for math reasoning.\n  * Online DPO/SimPO/IPO for online preference learning.\n  * Online exploration (active alignment) algorithms, including [SEA](https://arxiv.org/abs/2411.01493), APL and XPO.\n\n## Installation\nIn a python environment with supported versions (we recommend `3.10`), you could install oat via PyPI:\n```shell\npip install vllm==0.8.4 \u0026\u0026 pip install -U oat-llm\n```\nOr you could also install in \"editable\" mode for local development:\n```shell\ngit clone git@github.com:sail-sg/oat.git\ncd oat\npip install vllm==0.8.4 \u0026\u0026 pip install -e .\n```\n\n##  Usage\nPlease refer to [this file](https://github.com/sail-sg/understand-r1-zero/blob/main/train_zero_math.py) for a self-contained example showing how to implement Dr. GRPO for R1-Zero-like training with oat 🌾.\n\nAdditionally, we also provide a guide on [online preference learning with active exploration](./docs/alignment_as_cdb.md).\n\n\u003c!-- ## Benchmarking\nThe benchmarking compares oat with the online DPO implementation from [huggingface/trl](https://huggingface.co/docs/trl/main/en/online_dpo_trainer). Below, we outline the configurations used for oat and present the benchmarking results. Notably, oat 🌾 achieves up to **2.5x** computational efficiency compared to trl 🤗.\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"https://gist.githubusercontent.com/lkevinzc/98afee30a5141d7068a0b35a88901a31/raw/e23f40d33e8a2fa4220e8122c152b356084b8afb/system_configs.png\" width=97%/\u003e\n\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"https://gist.githubusercontent.com/lkevinzc/98afee30a5141d7068a0b35a88901a31/raw/e23f40d33e8a2fa4220e8122c152b356084b8afb/bench_results.png\" width=65% /\u003e\n\u003c/p\u003e\n\nPlease refer to [Appendix C of our paper](https://arxiv.org/pdf/2411.01493#page=17.64) for a detailed discussion of the benchmarking methods and results. --\u003e\n\n## Citation\nIf you find this codebase useful for your research, please consider citing:\n\n- LLM online alignment framework:\n  ```bibtex\n  @misc{liu2024oat,\n    title={OAT: A research-friendly framework for LLM online alignment},\n    author={Liu, Zichen and Chen, Changyu and Du, Chao and Lee, Wee Sun and Lin, Min},\n    year={2024}\n    howpublished={\\url{https://github.com/sail-sg/oat}},\n  }\n  ```\n\n- Online exploration method:\n  ```bibtex\n  @article{liu2024sea,\n    title={Sample-Efficient Alignment for LLMs},\n    author={Liu, Zichen and Chen, Changyu and Du, Chao and Lee, Wee Sun and Lin, Min},\n    journal={arXiv preprint arXiv:2411.01493},\n    year={2024}\n  }\n  ```\n\n## License\n\n`oat` is distributed under the terms of the [Apache2](https://www.apache.org/licenses/LICENSE-2.0) license.\n\n## Acknowledgement\nWe thank the following awesome projects that have contributed to the development of oat:\n* [vLLM](https://github.com/vllm-project/vllm)\n* [DeepSpeed](https://github.com/microsoft/DeepSpeed)\n* [Mosec](https://github.com/mosecorg/mosec)\n* [launchpad](https://github.com/google-deepmind/launchpad)\n* [OpenRLHF](https://github.com/OpenRLHF/OpenRLHF)\n\n## Disclaimer\n\nThis is not an official Sea Limited or Garena Online Private Limited product.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsail-sg%2Foat","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsail-sg%2Foat","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsail-sg%2Foat/lists"}