{"id":13754054,"url":"https://github.com/liziniu/ReMax","last_synced_at":"2025-05-09T22:30:37.596Z","repository":{"id":204576294,"uuid":"705994984","full_name":"liziniu/ReMax","owner":"liziniu","description":"Code for Paper (ReMax: A Simple, Efficient and Effective Reinforcement Learning Method for Aligning Large Language Models)","archived":false,"fork":false,"pushed_at":"2023-12-16T12:38:36.000Z","size":1849,"stargazers_count":135,"open_issues_count":0,"forks_count":12,"subscribers_count":2,"default_branch":"master","last_synced_at":"2024-08-03T09:06:45.988Z","etag":null,"topics":["large-language-models","policy-gradient","reinforcement-learning","rlhf"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/liziniu.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2023-10-17T05:25:36.000Z","updated_at":"2024-08-01T02:21:30.000Z","dependencies_parsed_at":null,"dependency_job_id":"d2f82299-9185-437d-bc2f-ae906f6b4498","html_url":"https://github.com/liziniu/ReMax","commit_stats":{"total_commits":3,"total_committers":1,"mean_commits":3.0,"dds":0.0,"last_synced_commit":"536ba49cc2f9a1a76941045388a44757e2533e75"},"previous_names":["liziniu/remax"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/liziniu%2FReMax","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/liziniu%2FReMax/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/liziniu%2FReMax/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/liziniu%2FReMax/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/liziniu","download_url":"https://codeload.github.com/liziniu/ReMax/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":224884615,"owners_count":17386121,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["large-language-models","policy-gradient","reinforcement-learning","rlhf"],"created_at":"2024-08-03T09:01:38.037Z","updated_at":"2024-11-16T06:31:28.024Z","avatar_url":"https://github.com/liziniu.png","language":"Python","funding_links":[],"categories":["A01_文本生成_文本对话"],"sub_categories":["大语言对话模型及数据"],"readme":"# ReMax: A Simple, Effective, and Efficient Method for Aligning Large Language Models\n\n## Overview\n\nReMax is a reinforcement learning method, tailored for reward maximization in RLHF.\n\n\u003cimg src='./images/framework.png' width='600'\u003e\n\n#### Simple Implementation\n\nReMax is easy to implement (with **6 lines of code**). We provide an implementation based on the DeepSpeed framework in this resposity. \n\n\u003cimg src='./images/algorithm.png' width='600'\u003e\n\n#### Memory Efficient\n\nReMax is memory-efficient. Compared with PPO, ReMax can save about 50% GPU memory consumption, which could be allocated for **1.3x large batch size**.\n\n\u003cdetails\u003e\n\u003csummary\u003eResults of tuning Llama2-7B with A100-80GB GPUs\u003c/summary\u003e\n\n| GPUs | Offload | Method | Maximum Batch Size |\n| ---- | ------- | ------ | ------------------ |\n| 4    | False   | PPO    | ❌ (OOM)            |\n| 4    | False   | ReMax  | **4x26=104**       |\n| 4    | True    | PPO    | 4x30=120           |\n| 4    | True    | ReMax  | **4x40=160**       |\n| 1    | True    | PPO    | 1x32=32            |\n| 1    | True    | ReMax  | **1x42=42**        |\n\n*: Gradient checkpointing and ZeRO-2 are used for LLM.\n\n*: ZeRO-3 and offload are used for the reward model and the reference model.\u003c/details\u003e\n\n\u003c/details\u003e\n\n#### Fast Training\n\nReMax runs fast. It does not need to train a value model and requires fewer computations. Usually, it can achieve about **2x training speed-up**.\n\n\u003cdetails\u003e\n\u003csummary\u003eResults of tuning Llama2-7B with A100-80GB GPUs\u003c/summary\u003e\n\n| GPUs | Offload | Method | Total Training Time |\n| ---- | ------- | ------ | ------------------- |\n| 4    | False   | PPO    | ❌ (OOM)             |\n| 4    | False   | ReMax  | **2.4h**            |\n| 4    | True    | PPO    | 6.0h                |\n| 4    | True    | ReMax  | **2.8h**            |\n| 1    | True    | PPO    | 22.0h               |\n| 1    | True    | ReMax  | **10.2h**           |\n\n*: Gradient checkpointing and ZeRO-2 are used for LLM.\n\n*: ZeRO-3 and offload are used for the reward model and the reference model.\n\n*: Measurement is based on 45k training samples (with 1 epoch) from the full-hh-rlhf dataset.\n\n\u003c/details\u003e\n\n#### Easy to Tune\n\nReMax is easy to tune for good performance. On the AlpacaEval benchmark, when judeged by GPT-4,  ReMax achieves win rates of 84.22%, 75.28%, and 63.60% over SFT, DPO, and PPO, respectively.\n\n\u003cimg src='./images/alpacaeval_result.png' width='600'\u003e\n\n\n\n## Change Log\n\n- [2023-12-16] Add response samples of trained models and evaluation results of training speed.\n- [2023-10-18] Release the initial code.\n\n\n## How to use\n\n\n### Prepare\n\n\nThe Python environment can be set up using Anaconda with the provided `environment.yml` file.\n\n```\nconda env create -f environment.yml\nconda activate llm\n```\n\n\n### Step 1 SFT\n\n```\ncd step1_supervised_finetuning\n\n# OPT(1.3B)\nbash training_scripts/opt/run_opt_1.3b.sh\n\n# Llama2(7B)\nbash training_scripts/llama2/run_llama2_1.3b.sh\n```\n\n### Step 2 Reward Learning\n\n```\ncd step2_reward_model_finetuning\n\n# OPT(1.3B)\nbash training_scripts/opt/run_opt_1.3b.sh\n\n# Llama2(7B)\nbash training_scripts/llama2/run_llama2_1.3b.sh\n```\n\n### Step 3 RLHF\n\n```\ncd step3_rlhf_finetuning\n\n# OPT(1.3B)\nbash training_scripts/opt/run_opt_1.3b.sh\n\n# Llama2(7B)\nbash training_scripts/llama2/run_llama2_1.3b.sh\n```\n\n\n## Acknowledgements\n\nOur code is heavily based on the [DeepSpeed-Chat](https://github.com/microsoft/DeepSpeedExamples/tree/master/applications/DeepSpeed-Chat). Please follow the detailed instructions from DeepSpeed-Chat.\n\n\n## Bibtex\n\nIf you find this code is helpful, please cite our paper in the following format.\n\n```\n@article{li2023remax,\n  title     = {ReMax: A Simple, Effective, and Efficient Method for Aligning Large Language Models},\n  author    = {Li, Ziniu and Xu, Tian and Zhang, Yushun and Yu, Yang and Sun, RUoyu and Luo, Zhi-Quan},\n  booktitle = {arXiv preprint arXiv:2310.10505},\n  year      = {2023},\n}\n```","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fliziniu%2FReMax","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fliziniu%2FReMax","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fliziniu%2FReMax/lists"}