{"id":13573451,"url":"https://github.com/OpenRLHF/OpenRLHF","last_synced_at":"2025-04-04T12:31:02.328Z","repository":{"id":184754818,"uuid":"672415139","full_name":"OpenRLHF/OpenRLHF","owner":"OpenRLHF","description":"An Easy-to-use, Scalable and High-performance RLHF Framework (70B+ PPO Full Tuning \u0026 Iterative DPO \u0026 LoRA \u0026 RingAttention \u0026 RFT)","archived":false,"fork":false,"pushed_at":"2025-04-02T13:52:11.000Z","size":2329,"stargazers_count":6059,"open_issues_count":215,"forks_count":599,"subscribers_count":35,"default_branch":"main","last_synced_at":"2025-04-02T14:03:09.328Z","etag":null,"topics":["large-language-models","openai-o1","proximal-policy-optimization","raylib","reinforcement-learning","reinforcement-learning-from-human-feedback","transformers","vllm"],"latest_commit_sha":null,"homepage":"https://openrlhf.readthedocs.io/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/OpenRLHF.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-07-30T02:20:13.000Z","updated_at":"2025-04-02T13:57:12.000Z","dependencies_parsed_at":"2024-05-05T08:25:00.096Z","dependency_job_id":"45efc570-b18d-4853-aaba-093a9f897e43","html_url":"https://github.com/OpenRLHF/OpenRLHF","commit_stats":{"total_commits":852,"total_committers":41,"mean_commits":20.78048780487805,"dds":0.5140845070422535,"last_synced_commit":"43297e5e89324b5e095dab825798b4859a9b39b4"},"previous_names":["openllmai/openllama2","openllmai/openrlhf","openrlhf/openrlhf"],"tags_count":63,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OpenRLHF%2FOpenRLHF","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OpenRLHF%2FOpenRLHF/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OpenRLHF%2FOpenRLHF/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OpenRLHF%2FOpenRLHF/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/OpenRLHF","download_url":"https://codeload.github.com/OpenRLHF/OpenRLHF/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247179542,"owners_count":20897062,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["large-language-models","openai-o1","proximal-policy-optimization","raylib","reinforcement-learning","reinforcement-learning-from-human-feedback","transformers","vllm"],"created_at":"2024-08-01T15:00:35.957Z","updated_at":"2025-04-04T12:30:57.318Z","avatar_url":"https://github.com/OpenRLHF.png","language":"Python","funding_links":["https://opencollective.com/OpenRLHF"],"categories":["Open Source Software/Implementations","Industry Strength Reinforcement Learning","Codebases","Open-source","A01_文本生成_文本对话","Python","LLM Training Frameworks","Uncategorized","Alignment \u0026 Training","7. Training \u0026 Fine-tuning Ecosystem","Fine-tuning \u0026 Quantization (18)","9. Fine-Tuning","Tools"],"sub_categories":["Reports","2020 and before","Codebase","大语言对话模型及数据","Uncategorized","RLHF \u0026 Preference Optimization","Training Frameworks","Training and Fine-tuning"],"readme":"\u003cdiv align=\"center\"\u003e\n    \u003cimg alt=\"OpenRLHF logo\" src=\"./docs/logo.png\" style=\"height: 140px;\" /\u003e\n\u003c/div\u003e\n\u003cdiv align=\"center\"\u003e\n\u003cp align=\"center\"\u003e\n      \u003ca href=\"https://github.com/OpenRLHF/OpenRLHF/graphs/contributors\"\u003e\n        \u003cimg alt=\"GitHub Contributors\" src=\"https://img.shields.io/github/contributors/OpenRLHF/OpenRLHF\" /\u003e\n      \u003c/a\u003e\n      \u003ca href=\"https://github.com/OpenRLHF/OpenRLHF/issues\"\u003e\n        \u003cimg alt=\"Issues\" src=\"https://img.shields.io/github/issues/OpenRLHF/OpenRLHF?color=0088ff\" /\u003e\n      \u003c/a\u003e\n      \u003ca href=\"https://github.com/OpenRLHF/OpenRLHF/discussions\"\u003e\n        \u003cimg alt=\"Issues\" src=\"https://img.shields.io/github/discussions/OpenRLHF/OpenRLHF?color=0088ff\" /\u003e\n      \u003c/a\u003e\n      \u003ca href=\"https://github.com/OpenRLHF/OpenRLHF/pulls\"\u003e\n        \u003cimg alt=\"GitHub pull requests\" src=\"https://img.shields.io/github/issues-pr/OpenRLHF/OpenRLHF?color=0088ff\" /\u003e\n      \u003ca href=\"https://github.com/OpenRLHF/OpenRLHF/stargazers\"\u003e\n        \u003cimg alt=\"GitHub stars\" src=\"https://img.shields.io/github/stars/OpenRLHF/OpenRLHF?color=ccf\" /\u003e\n      \u003c/a\u003e\n      \u003cbr\u003e\n      \u003cem\u003eOpen-source / Comprehensive / Lightweight / Easy-to-use\u003c/em\u003e\n    \u003c/p\u003e\n\u003c/p\u003e\n\u003c/div\u003e\n\n\u003chr\u003e\n\n\u003cspan\u003e[ English | \u003ca href=\"README_zh.md\"\u003e中文\u003c/a\u003e ]\u003c/span\u003e\n\nOpenRLHF is a high-performance RLHF framework built on Ray, DeepSpeed and HF Transformers:\n\n- **Simple and easy to use**: OpenRLHF is one of the simplest high-performance RLHF libraries currently available, and seamlessly compatible with Huggingface models and datasets.\n- **High performance**: RLHF training spends 80% of the time on the sample generation stage. Thanks to the ability to use a large inference batch size with Ray and Adam Offload (Pinned Memory) and vLLM generation acceleration, the performance of OpenRLHF 2x+ that of Optimized DeepSpeedChat with Hybrid Engine.\n- **Distributed RLHF**:  OpenRLHF distribute the Actor, Reward, Reference, and Critic models onto separate GPUs using Ray, while placing the Adam optimizer on the CPU. This enables full-scale fine-tuning of 70B+ models with multiple A100 80G GPUs and vLLM and 7B models across multiple 24GB RTX 4090 GPUs.\n- **PPO Implementation Optimization**: We integrated the implementation tricks for PPO to improve the training stability, referencing [Zhihu](https://zhuanlan.zhihu.com/p/622134699) and the [Notion blog](https://difficult-link-dd7.notion.site/eb7b2d1891f44b3a84e7396d19d39e6f?v=01bcb084210149488d730064cbabc99f).\n\nMore details are in [Slides](https://docs.google.com/presentation/d/1JRhB1d7csofx0PIZBmfyBdMluxNd5JLPpUHrrvVhGnk/edit?usp=sharing) | [Technical Report](https://arxiv.org/abs/2405.11143) | [Documents](https://openrlhf.readthedocs.io/)\n\n\n## Features\n\n- Distributed [PPO based on Ray](./examples/scripts/train_ppo_llama_ray.sh). \n- Support full RLHF fine-tuning of models with [over 70 billion parameters](./examples/scripts/train_ppo_llama_ray_70b.sh).\n- Support vLLM generation acceleration in RLHF (--vllm_num_engines).\n- Support multiple reward models (--reward_pretrain model1,model2...) and remote reward model(--remote_rm_url).\n- Support [DPO (direct-preference-optimization)/IPO/cDPO](./examples/scripts/train_dpo_llama.sh).\n- Support [Kahneman-Tversky optimization (KTO)](./examples/scripts/train_kto_llama.sh).\n- Support [Rejection Sampling](./examples/scripts/train_rejection_sampling_llama.sh).\n- Support [Iterative DPO](./examples/scripts/train_iterative_dpo_llama.sh) (https://github.com/RLHFlow/Online-RLHF).\n- Support [Conditional SFT](./examples/scripts/train_conditional_llama.sh) (https://arxiv.org/abs/2308.12050).\n- Support [Knowledge Distillation](./examples/scripts/train_knowledge_distillation.sh) (https://github.com/microsoft/LMOps/tree/main/minillm).\n- Support [Process Reward Model (PRM)](./examples/scripts/train_prm_mistral.sh).\n- Support SFT/DPO/RM/PRM/PPO training samples packing (--packing_samples).\n- Support [RingAttention](./examples/scripts/train_dpo_ring_llama.sh) (--ring_attn_size, --ring_head_stride)\n- Support [MoE](./examples/test_scripts/train_sft_mixtral_lora.sh) (--aux_loss_coef)\n- Support FlashAttention2 (--flash_attn).\n- Support QLoRA (--load_in_4bit), [LoRA (--lora_rank, --target_modules)](./examples/scripts/train_sft_mixtral_lora.sh).\n- Support HuggingFace `tokenizer.apply_chat_template` in datasets (--apply_chat_template and --input_key).\n- Support Wandb log (--use_wandb) and tensorboard (--use_tensorboard).\n- Support for recovering from checkpoint (--load_checkpoint and --save_steps).\n- Multi-nodes [training scripts](./examples/scripts/train_llama_slurm.sh) for Slurm.\n\n### PPO Support Matrix\n\n| Feature | OpenRLHF | DSChat | CAIChat | TRL |\n| ------------- |:-------------:| :-------------:| :-------------:| :-------------:|\n| 70B+ Full Tuning with 16 A100-80GB      | ✅ | ❌ | ❌ | ❌ |\n| 7B Full Tuning with 4 RTX4090 | ✅      |    ❌ | ❌ | ❌ |\n| 34B DPO Full Tuning with 8 A100-80GB | ✅      |    ❌ | ❌ | ❌ |  \n| Inference Engine in PPO | ✅      |    ✅ | ❌ | ❌ |  \n| PPO Implementation Tricks | ✅      |    ❌ | ❌ | ✅ |\n| Support QLoRA | ✅      |    ❌ | ❌ | ✅ | \n| Support Mixtral 8*7b | ✅      |    ❌ | ❌ | ❌ |  \n| Support Unmerged Actor-Critic | ✅     |   ✅ | ✅ | ❌ | \n| Support Multiple Reward Models | ✅      |    ❌ | ❌ | ❌ |   \n| Support Huggingface Models | ✅      |    ✅ | ✅ | ✅ | \n| Easy-to-use | ✅      |   ❌ (HybridEngine bugs) | ✅ | ✅ | \n\n\n## Quick Start\n\n### Installation\n\nTo use OpenRLHF, first launch the docker container (**Recommended**) and `pip install` openrlhf inside the docker container:\n\n```bash\n# Launch the docker container\ndocker run --runtime=nvidia -it --rm --shm-size=\"10g\" --cap-add=SYS_ADMIN -v $PWD:/openrlhf nvcr.io/nvidia/pytorch:24.02-py3 bash\nsudo pip uninstall xgboost transformer_engine flash_attn -y\n\n# pip install\npip install openrlhf\n\n# If you want to use vLLM acceleration (To install vLLM 0.4.2)\npip install openrlhf[vllm]\n# latest vLLM is also supported (Please use `--vllm_sync_backend gloo` or `export NCCL_P2P_DISABLE=1`)\npip install openrlhf[vllm_latest]\n\n# pip install the latest version\npip install git+https://github.com/OpenRLHF/OpenRLHF.git\n\n# Or git clone\ngit clone https://github.com/OpenRLHF/OpenRLHF.git\ncd OpenRLHF\npip install -e .\n```\n\n\u003e [!NOTE]\n\u003eWe recommend using vLLM 0.4.2, as the 0.4.3+ versions currently require synchronizing weights via Gloo (`--vllm_sync_backend gloo`) or disabling P2P communication (`export NCCL_P2P_DISABLE=1`).\n\u003eWe also provided the [Dockerfiles for vLLM](./dockerfile/) and [One-Click Installation Script of Nvidia-Docker](./examples/scripts/nvidia_docker_install.sh).\n\n### Prepare Datasets\nOpenRLHF provides multiple data processing methods in our dataset classes.\nSuch as in the [Prompt Dataset](https://github.com/OpenRLHF/OpenRLHF/blob/main/openrlhf/datasets/prompts_dataset.py#L6):\n\n```python\ndef preprocess_data(data, input_template=None, input_key=\"input\", apply_chat_template=None) -\u003e str:\n    if apply_chat_template:\n        prompt = apply_chat_template(data[input_key], tokenize=False, add_generation_prompt=True)\n    else:\n        prompt = data[input_key]\n        if input_template:\n            prompt = input_template.format(prompt)\n    return prompt\n```\n\n- We can use `--input_key` to specify the `JSON key name` of the input datasets `--prompt_data {name or path}` (PPO) or `--dataset {name or path}`, and use `--apply_chat_template` to utilize the `chat_template` from the [Huggingface Tokenizer](https://huggingface.co/docs/transformers/main/en/chat_templating).\n- If you don't want to use `--apply_chat_template`, you can use `--input_template` instead, or preprocess the datasets offline in advance.\n- OpenRLHF also support mixing multiple datasets using `--prompt_data_probs 0.1,0.4,0.5` (PPO) or `--dataset_probs 0.1,0.4,0.5`.\n\nHow Chat Templating Works:\n\n```python\ndataset = [{\"input_key\": [\n  {\"role\": \"user\", \"content\": \"Hello, how are you?\"},\n  {\"role\": \"assistant\", \"content\": \"I'm doing great. How can I help you today?\"},\n  {\"role\": \"user\", \"content\": \"I'd like to show off how chat templating works!\"},\n]}]\n\ntokenizer.apply_chat_template(dataset[0][\"input_key\"], tokenize=False)\n\n\"\u003cs\u003e[INST] Hello, how are you? [/INST]I'm doing great. How can I help you today?\u003c/s\u003e [INST] I'd like to show off how chat templating works! [/INST]\"\n```\n\nHow to specify training and test datasets ?\n\nYou can specify it using the `data_type@data_dir` format. For example, the dataset can be set as `--dataset json@./data`.\n\n```\ndata\n├── test.jsonl\n└── train.jsonl\n```\n\n\u003e [!NOTE]\n\u003e By default, we use `train` and `test` as splits to distinguish training and testing datasets from Huggingface.\n\u003e The ``JSON key`` options depends on the specific datasets. See [Reward Dataset](https://github.com/OpenRLHF/OpenRLHF/blob/main/openrlhf/datasets/reward_dataset.py#L10) and [SFT Dataset](https://github.com/OpenRLHF/OpenRLHF/blob/main/openrlhf/datasets/sft_dataset.py#L9)\n\n### Supervised Fine-tuning\n\nOpenRLHF's model checkpoint is fully compatible with HuggingFace models. You can specify the model name or path using `--pretrain  {name or path}`, `--reward_pretrain  {name or path}` and `--critic_pretrain  {name or path}`. We have provided some pre-trained checkpoints and datasets on [HuggingFace OpenRLHF](https://huggingface.co/OpenRLHF).\n\nThen you can use the startup scripts we provide in the [examples/scripts](./examples/scripts/) directory, or start the training using the following commands.\n\n```bash \ndeepspeed --module openrlhf.cli.train_sft \\\n   --max_len 4096 \\\n   --dataset Open-Orca/OpenOrca \\\n   --input_key question \\\n   --output_key response \\\n   --input_template 'User: {}\\nAssistant: ' \\\n   --train_batch_size 256 \\\n   --micro_train_batch_size 2 \\\n   --max_samples 500000 \\\n   --pretrain meta-llama/Meta-Llama-3-8B \\\n   --save_path ./checkpoint/llama3-8b-sft \\\n   --save_steps -1 \\\n   --logging_steps 1 \\\n   --eval_steps -1 \\\n   --zero_stage 2 \\\n   --max_epochs 1 \\\n   --bf16 \\\n   --flash_attn \\\n   --learning_rate 5e-6 \\\n   --gradient_checkpointing \\\n   --use_wandb {wandb_token}\n\n# Support HF tokenizer.apply_chat_template\n# --apply_chat_template \n# --input_key {JSON Key}\n# --tokenizer_chat_template {HF Chat Template}\n\n# Support samples packing\n# --packing_samples\n\n# Can also be used for continued pre-training\n# --pretrain_mode\n```\n\n\u003e [!NOTE]\n\u003e OpenRLHF SFT/DPO/RewardModel/PPO trainers support `--packing_samples` [based on `--flash_attn`](https://github.com/MeetKai/functionary/tree/main/functionary/train/packing)\n\n\n### Reward Model Training\n```bash\ndeepspeed --module openrlhf.cli.train_rm \\\n   --save_path ./checkpoint/llama3-8b-rm \\\n   --save_steps -1 \\\n   --logging_steps 1 \\\n   --eval_steps -1 \\\n   --train_batch_size 256 \\\n   --micro_train_batch_size 1 \\\n   --pretrain OpenRLHF/Llama-3-8b-sft-mixture \\\n   --bf16 \\\n   --max_epochs 1 \\\n   --max_len 8192 \\\n   --zero_stage 3 \\\n   --learning_rate 9e-6 \\\n   --dataset OpenRLHF/preference_dataset_mixture2_and_safe_pku \\\n   --apply_chat_template \\\n   --chosen_key chosen \\\n   --rejected_key rejected \\\n   --flash_attn \\\n   --gradient_checkpointing \\\n   --use_wandb {wandb_token}\n\n# Support samples packing\n# --packing_samples\n```\n\n### PPO without Ray\n\n```bash\ndeepspeed --module openrlhf.cli.train_ppo \\\n  --pretrain OpenRLHF/Llama-3-8b-sft-mixture \\\n  --reward_pretrain OpenRLHF/Llama-3-8b-rm-mixture \\\n  --save_path ./checkpoint/llama-3-8b-rlhf \\\n  --save_steps -1 \\\n  --logging_steps 1 \\\n  --eval_steps -1 \\\n  --micro_train_batch_size 2 \\\n  --train_batch_size 128 \\\n  --micro_rollout_batch_size 4 \\\n  --rollout_batch_size 1024 \\\n  --max_epochs 1 \\\n  --prompt_max_len 1024 \\\n  --generate_max_len 1024 \\\n  --zero_stage 2 \\\n  --bf16 \\\n  --actor_learning_rate 5e-7 \\\n  --critic_learning_rate 9e-6 \\\n  --init_kl_coef 0.01 \\\n  --prompt_data OpenRLHF/prompt-collection-v0.1 \\\n  --input_key context_messages \\\n  --apply_chat_template \\\n  --max_samples 100000 \\\n  --normalize_reward \\\n  --adam_offload \\\n  --flash_attn \\\n  --gradient_checkpointing \\\n  --use_wandb {wandb_token}\n\n# Support remote reward model (HTTP)\n# --remote_rm_url http://localhost:5000/get_reward\n```\n\n### PPO with Ray and vLLM\n\nTo improve RLHF training speed or support 70B models, we can use the PPO with Ray and vLLM acceleration\n\n```bash\n# launch the master node of ray in container\nray start --head --node-ip-address 0.0.0.0 --num-gpus 8\n\n# if you want to launch ray on more nodes, use\nray start --address {MASTER-NODE-ADDRESS}:6379  --num-gpus 8\n\nray job submit --address=\"http://127.0.0.1:8265\" \\\n  --runtime-env-json='{\"working_dir\": \"/openrlhf\"}' \\\n  -- python3 -m openrlhf.cli.train_ppo_ray \\\n  --ref_num_nodes 1 \\\n  --ref_num_gpus_per_node 2 \\\n  --reward_num_nodes 1 \\\n  --reward_num_gpus_per_node 2 \\\n  --critic_num_nodes 1 \\\n  --critic_num_gpus_per_node 2 \\\n  --actor_num_nodes 1 \\\n  --actor_num_gpus_per_node 2 \\\n  --vllm_num_engines 2 \\\n  --vllm_tensor_parallel_size 2 \\\n  --colocate_critic_reward \\\n  --colocate_actor_ref \\\n  --pretrain OpenRLHF/Llama-3-8b-sft-mixture \\\n  --reward_pretrain OpenRLHF/Llama-3-8b-rm-mixture \\\n  --save_path /openrlhf/examples/checkpoint/llama3-8b-rlhf \\\n  --micro_train_batch_size 8 \\\n  --train_batch_size 128 \\\n  --micro_rollout_batch_size 16 \\\n  --rollout_batch_size 1024 \\\n  --max_samples 100000 \\\n  --max_epochs 1 \\\n  --prompt_max_len 1024 \\\n  --generate_max_len 1024 \\\n  --zero_stage 3 \\\n  --bf16 \\\n  --actor_learning_rate 5e-7 \\\n  --critic_learning_rate 9e-6 \\\n  --init_kl_coef 0.01 \\\n  --prompt_data OpenRLHF/prompt-collection-v0.1 \\\n  --input_key context_messages \\\n  --apply_chat_template \\\n  --normalize_reward \\\n  --adam_offload \\\n  --flash_attn \\\n  --gradient_checkpointing \\\n  --use_wandb {wandb_token}\n\n# Support samples packing\n# --packing_samples (Recommended)\n\n# Support remote reward model (HTTP)\n# --remote_rm_url http://localhost:5000/get_reward\n```\n\u003e [!NOTE]\n\u003e Do not set `--vllm_num_engines` means not using the vLLM engine.\n\u003e You can also use ``setup_commands`` to let Ray automatically deploy the environment, such as `--runtime-env-json='{\"setup_commands\": [\"pip install openrlhf[vllm]\"]}'`.\n\nThe launch scripts and documents for supported algorithms are in [example/scripts](./examples/scripts/) and [Documents - Usage](https://openrlhf.readthedocs.io/en/latest/usage.html)\n\n## Performance\n\nWe optimized DSChat's performance to the greatest extent possible by employing techniques such as enabling Adam offload, along with reward model (RM) and reference model (Ref) offload to increase the micro-batch size during the inference stage and avoid out-of-memory issues. We even fixed some bugs in DSChat to enable the Hybrid Engine (HE) for LLaMA2. The average time (seconds) it took to train 1024 prompts with 1 PPO epoch using the Optimized DSChat and OpenRLHF:\n\n| **Size** | **NVIDIA A800-80GB GPUs** | **Optimized DSChat (with  Hybrid Engine)** | **OpenRLHF** | **Speedup** |\n| :---: | :---: | :---: | :---: | :---: |\n| 7B | 16 | 855.09 | 471.11 | 1.82x |\n| 13B | 32 | 1528.93 | 608.93 | 2.5x |\n| 34B | 32 | 3634.98 | 1526.4 | 2.4x |\n| 70B | 32 | 10407.0 | 4488.53 | 2.3x |\n\n\u003e [!NOTE]\n\u003e The data is outdated; please refer to the performance tuning section for re-testing.\n\n### Performance Tuning Guide\n\nTo achieve optimal performance, we recommend allocating more nodes to the vLLM Engine. For example, for a 70B model with 32 A100 GPUs, it is advised to allocate more than 16 A100 GPUs to the vLLM Engine, 8 GPUs to the Actor model, and the remaining 8 GPUs to the Critic model. Additionally, enable the `--colocate_critic_reward`, `--colocate_actor_ref`, or `--ref_reward_offload (Optional)` options to merge nodes. Finally, you should increase the `rollout_micro_batch_size` (and minimize the TP size of vLLM engine) as much as possible, and avoid OOM (Out Of Memory) using `--packing_samples`. During the training phase, a larger `--micro_train_batch_size` is better. Enable `enable_prefix_caching` in vLLM generation when `n_samples_per_prompt \u003e 1`.\n\n## Companies and Organizations using OpenRLHF\n\n- ByteDance\n- NexusFlow\n- Baidu\n- Jülich Supercomputing Centre (JSC)\n- Berkeley Starling Team\n- Tencent\n- Alibaba\n- Google\n- China Telecom\n- ...\n\n## Join Us\n\n**How to Join?**\n\n1. Email us at janhu9527@gmail.com or join [GitHub Organization](https://github.com/OpenRLHF). Please include the following details:\n   - Your name\n   - Your GitHub username\n   - Your areas of interest\n   - Your skills and experience related to NLP and/or AI\n1. You can also join us through the official GitHub [OpenRLHF ↗](https://github.com/OpenRLHF/OpenRLHF) project page. Just create an issue about your interest to contribute and we will get back to you.\n\n**What can you do?**\n\n1. Join the team and participate in the development of the OpenRLHF project.\n1. Contribute to the project by submitting pull requests.\n1. Help improve documentation, fix bugs, or create new features.\n1. Share the project and help us grow the community.\n\n## Sponsor Us\n\nYour sponsorship can help us maintain and improve OpenRLHF. If you find this project useful, please consider sponsoring us. You can sponsor us on [Open Collective ↗](https://opencollective.com/OpenRLHF).\n\n## Starchart\n\n[![Star History Chart](https://api.star-history.com/svg?repos=OpenRLHF/OpenRLHF\u0026type=Date)](https://star-history.com/#OpenRLHF/OpenRLHF\u0026Date)\n\n## Contributors\n\nA big thank you to all our contributors! If you want to contribute, feel free to make a pull request or create an issue.\n\n\u003ca href=\"https://github.com/OpenRLHF/OpenRLHF/graphs/contributors\"\u003e\n  \u003cimg src=\"https://contrib.rocks/image?repo=OpenRLHF/OpenRLHF\" /\u003e\n\u003c/a\u003e\n\n## References \u0026 Acknowledgements\n\nWe would like to express our gratitude to the following projects and organizations for their contributions to the field of AI and NLP:\n\n- [Hugging Face Transformers ↗](https://github.com/huggingface/transformers)\n- [OpenAI GPT ↗](https://github.com/openai/gpt-3)\n- [LLaMA ↗](https://llama.meta.com/)\n- [DeepSpeed ↗](https://github.com/microsoft/DeepSpeed)\n- [Ray ↗](https://github.com/ray-project/ray)\n\nOur project would also like to thank [ColossalChat](https://github.com/hpcaitech/ColossalAI/tree/main/applications/Chat) and [DeepSpeedChat](https://github.com/microsoft/DeepSpeedExamples/tree/master/applications/DeepSpeed-Chat). In the early stages of the project, we referred to their code design. \n\n(2024/7) Our GitHub organization has changed from OpenLLMAI to OpenRLHF.\n\n## Citation\n```\n@article{hu2024openrlhf,\n  title={OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework},\n  author={Jian Hu and Xibin Wu and Weixun Wang and Xianyu and Dehao Zhang and Yu Cao},\n  journal={arXiv preprint arXiv:2405.11143},\n  year={2024}\n}\n```\n\n______________________________________________________________________\n\n*OpenRLHF © 2024 OpenRLHF. All Rights Reserved.*\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FOpenRLHF%2FOpenRLHF","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FOpenRLHF%2FOpenRLHF","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FOpenRLHF%2FOpenRLHF/lists"}