{"id":13754204,"url":"https://github.com/PKU-Alignment/safe-rlhf","last_synced_at":"2025-05-09T22:31:32.862Z","repository":{"id":165616429,"uuid":"640914148","full_name":"PKU-Alignment/safe-rlhf","owner":"PKU-Alignment","description":"Safe RLHF: Constrained Value Alignment via Safe Reinforcement Learning from Human Feedback","archived":false,"fork":false,"pushed_at":"2024-06-13T10:06:45.000Z","size":4181,"stargazers_count":1454,"open_issues_count":16,"forks_count":120,"subscribers_count":16,"default_branch":"main","last_synced_at":"2025-04-30T02:04:42.837Z","etag":null,"topics":["ai-safety","alpaca","beaver","datasets","deepspeed","gpt","large-language-models","llama","llm","llms","reinforcement-learning","reinforcement-learning-from-human-feedback","rlhf","safe-reinforcement-learning","safe-reinforcement-learning-from-human-feedback","safe-rlhf","safety","transformer","transformers","vicuna"],"latest_commit_sha":null,"homepage":"https://pku-beaver.github.io","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/PKU-Alignment.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-05-15T11:47:08.000Z","updated_at":"2025-04-29T12:26:02.000Z","dependencies_parsed_at":"2023-10-10T17:14:48.010Z","dependency_job_id":"cacf5197-a7a4-4374-8fb8-a3ac2e7ef687","html_url":"https://github.com/PKU-Alignment/safe-rlhf","commit_stats":{"total_commits":111,"total_committers":4,"mean_commits":27.75,"dds":0.2072072072072072,"last_synced_commit":"e8cca16665ef2340ac92c6514f05519310251581"},"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/PKU-Alignment%2Fsafe-rlhf","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/PKU-Alignment%2Fsafe-rlhf/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/PKU-Alignment%2Fsafe-rlhf/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/PKU-Alignment%2Fsafe-rlhf/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/PKU-Alignment","download_url":"https://codeload.github.com/PKU-Alignment/safe-rlhf/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":253335685,"owners_count":21892713,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai-safety","alpaca","beaver","datasets","deepspeed","gpt","large-language-models","llama","llm","llms","reinforcement-learning","reinforcement-learning-from-human-feedback","rlhf","safe-reinforcement-learning","safe-reinforcement-learning-from-human-feedback","safe-rlhf","safety","transformer","transformers","vicuna"],"created_at":"2024-08-03T09:01:49.581Z","updated_at":"2025-05-09T22:31:29.611Z","avatar_url":"https://github.com/PKU-Alignment.png","language":"Python","funding_links":[],"categories":["A01_文本生成_文本对话","Python","Codebases","Safety \u0026 Governance","Training","10. AI Safety, Alignment \u0026 Interpretability"],"sub_categories":["大语言对话模型及数据","2020 and before","Sandboxing \u0026 Execution","RLHF"],"readme":"\u003c!-- markdownlint-disable first-line-h1 --\u003e\n\u003c!-- markdownlint-disable html --\u003e\n\n\u003cdiv align=\"center\"\u003e\n  \u003cimg src=\"images/PKU-Beaver-logo-wide.svg\" width=\"80%\"/\u003e\n\u003c/div\u003e\n\n\u003ch1 align=\"center\"\u003eConstrained Value-Aligned LLM via Safe RLHF\u003c/h1\u003e\n\nBeaver is a highly modular open-source RLHF framework developed by the PKU-Alignment team at Peking University.\nIt aims to provide training data and a reproducible code pipeline for alignment research, especially constrained alignment LLM research via Safe RLHF methods.\n\nThe key features of Beaver are:\n\n- Support **SFT**, **RLHF** and **Safe RLHF** training for popular pre-trained models: [LLaMA](https://ai.facebook.com/blog/large-language-model-llama-meta-ai), [OPT](https://arxiv.org/abs/2205.01068), [Baichuan](https://huggingface.co/baichuan-inc/Baichuan-7B), etc.\n- Provide a large human-labeled dataset **(up to 1M pairs)** including both helpful and harmless preferences to support reproducible RLHF research.\n- Support training for Reward Model \u0026 Cost Model, and provide pre-trained checkpoints.\n- Support customized parameters and datasets for SFT and RLHF.\n- Provide multi-scale metrics for safety constraints verification, e.g., BIG-bench, GPT-4 Evaluation.\n\n## **🦫 What's New?**  \u003c!-- omit in toc --\u003e\n\n- **🎉 `2024/06/13`:** We are pleased to announce the open-sourcing of our PKU-SafeRLHF dataset version 1.0. This release advances over the initial beta version by incorporating human-AI joint annotations, expanding the scope of harm categories, and introducing detailed severity level labels. For further details and access, please visit our dataset page on 🤗 Hugging Face: [PKU-Alignment/PKU-SafeRLHF](https://huggingface.co/datasets/PKU-Alignment/PKU-SafeRLHF).\n- **🎉 `2024/01/16`:** Our method [**Safe RLHF**](https://openreview.net/forum?id=TyFrPOKYXw) has been accepted by ICLR 2024 Spotlight.\n- **📄 `2023/10/19`:** We've released our [**Safe RLHF paper**](https://arxiv.org/abs/2310.12773) on arXiv, detailing our new safe alignment algorithm and its implementation.\n- **🚀 `2023/07/10`:** We're delighted to announce the open-sourcing of **Beaver-7B** [v1](https://huggingface.co/PKU-Alignment/beaver-7b-v1.0) / [v2](https://huggingface.co/PKU-Alignment/beaver-7b-v2.0) / [v3](https://huggingface.co/PKU-Alignment/beaver-7b-v3.0) models as the first milestone of the Safe RLHF training series, complemented by the corresponding **Reward Models** [v1](https://huggingface.co/PKU-Alignment/beaver-7b-v1.0-reward) / [v2](https://huggingface.co/PKU-Alignment/beaver-7b-v2.0-reward) / [v3](https://huggingface.co/PKU-Alignment/beaver-7b-v3.0-reward) / [unified](https://huggingface.co/PKU-Alignment/beaver-7b-unified-reward) and **Cost Models** [v1](https://huggingface.co/PKU-Alignment/beaver-7b-v1.0-cost) / [v2](https://huggingface.co/PKU-Alignment/beaver-7b-v2.0-cost) / [v3](https://huggingface.co/PKU-Alignment/beaver-7b-v3.0-cost) / [unified](https://huggingface.co/PKU-Alignment/beaver-7b-unified-cost) checkpoints on 🤗 Hugging Face.\n- **🔥 `2023/07/10`:** We extend the open-source safety preference dataset, [**PKU-Alignment/PKU-SafeRLHF**](https://huggingface.co/datasets/PKU-Alignment/PKU-SafeRLHF), which now contains over 300k examples. (See also section [PKU-SafeRLHF-Dataset](#pku-saferlhf-dataset))\n- **⚙ `2023/07/05`:** We enhanced our support for Chinese pre-training models and incorporated additional open-source Chinese datasets. (See also sections [Chinese Support (中文支持)](#chinese-support-中文支持) and [Custom Datasets (自定义数据集)](#custom-datasets))\n- **⭐️ `2023/05/15`:** First release of the Safe RLHF pipeline, evaluation results, and training code.\n\n### Table of Contents  \u003c!-- omit in toc --\u003e\n\n- [Constrained Value Alignment via Safe RLHF](#constrained-value-alignment-via-safe-rlhf)\n- [Comparison with Other RLHF Libraries](#comparison-with-other-rlhf-libraries)\n- [PKU-SafeRLHF-Dataset](#pku-saferlhf-dataset)\n  - [PKU-SafeRLHF-10K](#pku-saferlhf-10k)\n  - [PKU-SafeRLHF-1M](#pku-saferlhf-1m)\n- [Why \"Beaver\"](#why-beaver)\n- [Beaver vs. Alpaca](#beaver-vs-alpaca)\n- [Installation](#installation)\n- [Training](#training)\n- [Custom Datasets](#custom-datasets)\n- [Inference](#inference)\n  - [Interactive CLI Demo](#interactive-cli-demo)\n  - [Interactive Arena](#interactive-arena)\n- [Chinese Support (中文支持)](#chinese-support-中文支持)\n- [Benchmark and Evaluation](#benchmark-and-evaluation)\n  - [Arena via Reward and Cost Models](#arena-via-reward-and-cost-models)\n  - [BIG-bench](#big-bench)\n  - [GPT-4 Evaluation](#gpt-4-evaluation)\n- [Future Plans](#future-plans)\n- [Citation](#citation)\n- [PKU-Alignment Team](#pku-alignment-team)\n- [License](#license)\n\n## Constrained Value Alignment via Safe RLHF\n\nReinforcement Learning from Human Feedback: reward maximization via preference learning\n\n\u003cdiv align=\"center\"\u003e\n  \u003cimg src=\"images/rl-formulation.svg\" width=\"40%\"/\u003e\n\u003c/div\u003e\n\nSafe Reinforcement Learning from Human Feedback: constrained reward maximization via preference learning\n\n\u003cdiv align=\"center\"\u003e\n  \u003cimg src=\"images/safe-rl-formulation.svg\" width=\"55%\"/\u003e\n\u003c/div\u003e\n\nwhere $R (\\cdot)$ and $C (\\cdot)$ are reward and cost functions respectively. They are neural networks known as human proxies trained on human preferences.\n\n\u003cdiv align=\"center\"\u003e\n  \u003cimg src=\"images/preference-learning.svg\" width=\"90%\"/\u003e\n\u003c/div\u003e\n\nThe ultimate goal is to find a model $\\pi_{\\theta}$ that is both helpful (high reward) and harmless (low cost).\n\n\u003cdiv align=\"center\"\u003e\n  \u003cimg src=\"images/safe-alignment.png\" width=\"90%\"/\u003e\n\u003c/div\u003e\n\n## Comparison with Other RLHF Libraries\n\nCompare with other frameworks supporting RLHF, `safe-rlhf` is the first framework to support all stages from SFT to RLHF and Evaluation.\nIn addition, `safe-rlhf` is the first framework that takes safety preference under consideration during the RLHF stage.\nIt holds a more theoretical guarantee for constrained parameter searching in the policy space.\n\n|                                                                                                        |             SFT              | Preference Model\u003csup\u003e[1](#perference-model)\u003c/sup\u003e Training | RLHF  | Safe RLHF | PTX Loss | Evaluation |      Backend      |\n| :----------------------------------------------------------------------------------------------------: | :--------------------------: | :--------------------------------------------------------: | :---: | :-------: | :------: | :--------: | :---------------: |\n|                  [Beaver](https://github.com/PKU-Alignment/safe-rlhf)\u003c/br\u003e(Safe-RLHF)                  |              ✔️               |                             ✔️                              |   ✔️   |     ✔️     |    ✔️     |     ✔️      |     DeepSpeed     |\n|                                [trlX](https://github.com/CarperAI/trlx)                                |              ✔️               |             ❌\u003csup\u003e[2](#trlx-rm-example)\u003c/sup\u003e              |   ✔️   |     ❌     |    ❌     |     ❌      | Accelerate / NeMo |\n| [DeepSpeed-Chat](https://github.com/microsoft/DeepSpeedExamples/tree/HEAD/applications/DeepSpeed-Chat) |              ✔️               |                             ✔️                              |   ✔️   |     ❌     |    ✔️     |     ❌      |     DeepSpeed     |\n|                         [Colossal-AI](https://github.com/hpcaitech/ColossalAI)                         |              ✔️               |                             ✔️                              |   ✔️   |     ❌     |    ✔️     |     ❌      |    ColossalAI     |\n|                         [AlpacaFarm](https://github.com/tatsu-lab/alpaca_farm)                         | ❌\u003csup\u003e[3](#alpaca-sft)\u003c/sup\u003e |                             ✔️                              |   ✔️   |     ❌     |    ❌     |     ✔️      |    Accelerate     |\n\n\u003csup\u003e\n  \u003ca name=\"perference-model\"\u003e1.\u003c/a\u003e In the context of RLHF, the \"Preference Model\" is identified as the \"Reward Model\". And the \"Preference Model\" refers to both the \"Reward Model\" and the \"Cost Model\" in Safe RLHF.\n\u003c/sup\u003e\u003cbr/\u003e\n\u003csup\u003e\n  \u003ca name=\"trlx-rm-example\"\u003e2.\u003c/a\u003e There is an example for reward model training in the examples directory in the trlX repository. However it is not officially supported and is not integrated into the trlX library.\n\u003c/sup\u003e\u003cbr/\u003e\n\u003csup\u003e\n  \u003ca name=\"alpaca-sft\"\u003e3.\u003c/a\u003e The supervised fine-tuning support for Alpaca is provided in the \u003ca href=\"https://github.com/tatsu-lab/stanford_alpaca\"\u003etatsu-lab/stanford_alpaca\u003c/a\u003e repository.\n\u003c/sup\u003e\n\n## PKU-SafeRLHF-Dataset\n\nThe `PKU-SafeRLHF` dataset is a human-labeled dataset containing both performance and safety preferences.\nIt includes constraints in over ten dimensions, such as insults, immorality, crime, emotional harm, and privacy, among others.\nThese constraints are designed for fine-grained value alignment in RLHF technology.\n\nTo facilitate multi-round fine-tuning, we will release the initial parameter weights, required datasets, and training parameters for each round.\nThis ensures reproducibility in scientific and academic research.\nThe dataset will be released gradually through rolling updates.\n\nThe dataset is available on Hugging Face: [PKU-Alignment/PKU-SafeRLHF](https://huggingface.co/datasets/PKU-Alignment/PKU-SafeRLHF).\n\n### PKU-SafeRLHF-10K\n\n`PKU-SafeRLHF-10K` is a subset of `PKU-SafeRLHF` that contains the first round of Safe RLHF training data with 10K instances, including safety preferences.\nYou can find it on Hugging Face: [PKU-Alignment/PKU-SafeRLHF-10K](https://huggingface.co/datasets/PKU-Alignment/PKU-SafeRLHF-10K).\n\n### PKU-SafeRLHF-1M\n\nWe will gradually release the full Safe-RLHF datasets, which include **1M _human-labeled_ pairs** for both helpful and harmless preferences.\n\n## Why \"Beaver\"\n\nBeaver is a large language model based on [LLaMA](https://ai.facebook.com/blog/large-language-model-llama-meta-ai), trained using `safe-rlhf`.\nIt is developed upon the foundation of the [Alpaca](https://crfm.stanford.edu/2023/03/13/alpaca.html) model, by collecting human preference data related to helpfulness and harmlessness and employing the Safe RLHF technique for training.\nWhile maintaining the helpful performance of Alpaca, Beaver significantly improves its harmlessness.\n\n\u003e Beavers are known as the \"natural dam engineers\" as they are adept at using branches, shrubs, rocks, and soil to build dams and small wooden houses, creating wetland environments suitable for other creatures to inhabit, making them an indispensable part of the ecosystem. To ensure the safety and reliability of Large Language Models (LLMs) while accommodating a wide range of values across different populations, the Peking University team has named their open-source model \"Beaver\" and aims to build a dam for LLMs through the Constrained Value Alignment (CVA) technology. This technology enables fine-grained labeling of information and, combined with secure reinforcement learning methods, significantly reduces model bias and discrimination, thereby enhancing the model's safety. Analogous to the role of beavers in the ecosystem, the Beaver model will provide crucial support for the development of large language models and make positive contributions to the sustainable development of artificial intelligence technology.\n\n## Beaver vs. Alpaca\n\nFollowing the evaluation methodology of the [Vicuna](https://lmsys.org/blog/2023-03-30-vicuna) model, we utilized GPT-4 to evaluate Beaver. The results indicate that, compared to Alpaca, Beaver exhibits significant improvements in multiple dimensions related to safety.\n\n![Arena-Demo](images/arena-demo.gif)\n\n\u003cdiv align=\"center\"\u003e\n  \u003cimg src=\"images/evaluation.png\" width=\"60%\"/\u003e\n\u003c/div\u003e\n\nSignificant distribution shift for safety preferences after utilizing the Safe RLHF pipeline on the Alpaca-7B model.\n\n\u003ctable width=\"100%\" cellspacing=\"0\" cellpadding=\"0\"\u003e\n  \u003ctr align=\"center\" valign=\"middle\"\u003e\n    \u003ctd width=\"50%\"\u003e\n      \u003cimg src=\"images/reward-distribution.png\" width=\"100%\"/\u003e\n    \u003c/td\u003e\n    \u003ctd width=\"50%\"\u003e\n      \u003cimg src=\"images/cost-distribution.png\" width=\"100%\"/\u003e\n    \u003c/td\u003e\n  \u003c/tr\u003e\n\u003c/table\u003e\n\n## Installation\n\nClone the source code from GitHub:\n\n```bash\ngit clone https://github.com/PKU-Alignment/safe-rlhf.git\ncd safe-rlhf\n```\n\n**Native Runner:** Setup a conda environment using [`conda`](https://github.com/conda/conda) / [`mamba`](https://github.com/mamba-org/mamba):\n\n```bash\nconda env create --file conda-recipe.yaml  # or `mamba env create --file conda-recipe.yaml`\n```\n\nThis will automatically setup all dependencies.\n\n**Containerized Runner:** Other than using the native machine with conda isolation, as an alternative, you can also use docker images to configure the environment.\n\nFirstly, please follow [NVIDIA Container Toolkit: Installation Guide](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html) and [NVIDIA Docker: Installation Guide](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html#docker) to setup `nvidia-docker`.\nThen you can run:\n\n```bash\nmake docker-run\n```\n\nThis command will build and start a docker container installed with proper dependencies.\nThe host path `/` will be mapped to `/host` and the current working directory will be mapped to `/workspace` inside the container.\n\n## Training\n\n`safe-rlhf` supports a complete pipeline from Supervised Fine-Tuning (SFT) to preference model training to RLHF alignment training.\n\n0. Follow the instructions in section [Installation](#installation) to setup the training environment properly.\n\n```bash\nconda activate safe-rlhf\nexport WANDB_API_KEY=\"...\"  # your W\u0026B API key here\n```\n\nor\n\n```bash\nmake docker-run\nexport WANDB_API_KEY=\"...\"  # your W\u0026B API key here\n```\n\n1. Supervised Fine-Tuning (SFT)\n\n```bash\nbash scripts/sft.sh \\\n    --model_name_or_path \u003cyour-model-name-or-checkpoint-path\u003e \\\n    --output_dir output/sft\n```\n\nNOTE: You may need to update some of the parameters in the script according to your machine setup, such as the number of GPUs for training, the training batch size, etc.\n\n2. Value Models (reward model \u0026 cost model)\n\n```bash\nbash scripts/reward-model.sh \\\n    --model_name_or_path output/sft \\\n    --output_dir output/rm\n```\n\n```bash\nbash scripts/cost-model.sh \\\n    --model_name_or_path output/sft \\\n    --output_dir output/cm\n```\n\n3. RLHF (Optional)\n\n```bash\nbash scripts/ppo.sh \\\n    --actor_model_name_or_path output/sft \\\n    --reward_model_name_or_path output/rm \\\n    --output_dir output/ppo\n```\n\n4. Safe-RLHF\n\n```bash\nbash scripts/ppo-lag.sh \\\n    --actor_model_name_or_path output/sft \\\n    --reward_model_name_or_path output/rm \\\n    --cost_model_name_or_path output/cm \\\n    --output_dir output/ppo-lag\n```\n\nAn example of commands to run the whole pipeline with [LLaMA-7B](https://ai.facebook.com/blog/large-language-model-llama-meta-ai):\n\n```bash\nconda activate safe-rlhf\nbash scripts/sft.sh --model_name_or_path ~/models/llama-7b --output_dir output/sft\nbash scripts/reward-model.sh --model_name_or_path output/sft --output_dir output/rm\nbash scripts/cost-model.sh --model_name_or_path output/sft --output_dir output/cm\nbash scripts/ppo-lag.sh \\\n    --actor_model_name_or_path output/sft \\\n    --reward_model_name_or_path output/rm \\\n    --cost_model_name_or_path output/cm \\\n    --output_dir output/ppo-lag\n```\n\n#### Computational Requirements  \u003c!-- omit in toc --\u003e\n\nAll training processes listed above are tested with [LLaMA-7B](https://ai.facebook.com/blog/large-language-model-llama-meta-ai) on a cloud server with 8 x NVIDIA A800-80GB GPUs.\n\nUsers, who do not have enough GPU memory resources, can enable [DeepSpeed ZeRO-Offload](https://www.deepspeed.ai/tutorials/zero-offload) to alleviate the peak GPU memory usage.\n\nAll training scripts can pass with an extra option `--offload` (defaults to `none`, i.e., disable ZeRO-Offload) to offload the tensors (parameters and/or optimizer states) to CPU. For example:\n\n```bash\nbash scripts/sft.sh \\\n    --model_name_or_path ~/models/llama-7b \\\n    --output_dir output/sft \\\n    --offload all  # or `parameter` or `optimizer`\n```\n\nFor multi-node settings, users can refer to the [DeepSpeed: Resource Configuration (multi-node)](https://www.deepspeed.ai/getting-started/#resource-configuration-multi-node) documentation for more details. Here is an example to start the training process on 4 nodes (each has 8 GPUs):\n\n```text\n# myhostfile\nworker-1 slots=8\nworker-2 slots=8\nworker-3 slots=8\nworker-4 slots=8\n```\n\nThen launch the training scripts with:\n\n```bash\nbash scripts/sft.sh \\\n    --hostfile myhostfile \\\n    --model_name_or_path ~/models/llama-7b \\\n    --output_dir output/sft\n```\n\n## Custom Datasets\n\n`safe-rlhf` provides an abstraction to create datasets for all of the Supervised Fine-Tuning, preference model training, and RL training stages.\n\n```python\nclass RawSample(TypedDict, total=False):\n    \"\"\"Raw sample type.\n\n    For SupervisedDataset, should provide (input, answer) or (dialogue).\n    For PreferenceDataset, should provide (input, answer, other_answer, better).\n    For SafetyPreferenceDataset, should provide (input, answer, other_answer, safer, is_safe, is_other_safe).\n    For PromptOnlyDataset, should provide (input).\n    \"\"\"\n\n    # Texts\n    input: NotRequired[str]  # either `input` or `dialogue` should be provided\n    \"\"\"User input text.\"\"\"\n    answer: NotRequired[str]\n    \"\"\"Assistant answer text.\"\"\"\n    other_answer: NotRequired[str]\n    \"\"\"Other assistant answer text via resampling.\"\"\"\n    dialogue: NotRequired[list[str]]  # either `input` or `dialogue` should be provided\n    \"\"\"Dialogue history.\"\"\"\n\n    # Flags\n    better: NotRequired[bool]\n    \"\"\"Whether ``answer`` is better than ``other_answer``.\"\"\"\n    safer: NotRequired[bool]\n    \"\"\"Whether ``answer`` is safer than ``other_answer``.\"\"\"\n    is_safe: NotRequired[bool]\n    \"\"\"Whether ``answer`` is safe.\"\"\"\n    is_other_safe: NotRequired[bool]\n    \"\"\"Whether ``other_answer`` is safe.\"\"\"\n```\n\nHere is an example to implement a custom dataset (see [safe_rlhf/datasets/raw](safe_rlhf/datasets/raw) for more examples):\n\n```python\nimport argparse\nfrom datasets import load_dataset\nfrom safe_rlhf.datasets import RawDataset, RawSample, parse_dataset\n\n\nclass MyRawDataset(RawDataset):\n    NAME = 'my-dataset-name'\n\n    def __init__(self, path=None) -\u003e None:\n        # Load a dataset from Hugging Face\n        self.data = load_dataset(path or 'my-organization/my-dataset')['train']\n\n    def __getitem__(self, index: int) -\u003e RawSample:\n        data = self.data[index]\n        # Construct a `RawSample` dictionary from your custom dataset item\n        return RawSample(\n            input=data['col1'],\n            answer=data['col2'],\n            other_answer=data['col3'],\n            better=float(data['col4']) \u003e float(data['col5']),\n            ...\n        )\n\n    def __len__(self) -\u003e int:\n        return len(self.data)  # dataset size\n\n\ndef parse_arguments():\n    parser = argparse.ArgumentParser(...)\n    parser.add_argument(\n        '--datasets',\n        type=parse_dataset,\n        nargs='+',\n        metavar='DATASET[:PROPORTION[:PATH]]',\n    )\n    ...\n    return parser.parse_args()\n\n\ndef main():\n    args = parse_arguments()\n    ...\n\n\nif __name__ == '__main__':\n    main()\n```\n\nThen you can pass this dataset to the training scripts as:\n\n```bash\npython3 train.py --datasets my-dataset-name\n```\n\nYou may also pass multiple datasets with optionally additional dataset proportions (separated by a colon `:`). For example:\n\n```bash\npython3 train.py --datasets alpaca:0.75 my-dataset-name:0.5\n```\n\nThis will use randomly split 75% of the Stanford Alpaca dataset and 50% of your custom dataset.\n\nIn addition, the dataset argument can also be followed by a local path (separated by a colon `:`) if you have already cloned the dataset repository from Hugging Face.\n\n```bash\ngit lfs install\ngit clone https://huggingface.co/datasets/my-organization/my-dataset ~/path/to/my-dataset/repository\npython3 train.py --datasets alpaca:0.75 my-dataset-name:0.5:~/path/to/my-dataset/repository\n```\n\nNOTE: The dataset class must be imported before the training script begins to parse the command line arguments.\n\n## Inference\n\n### Interactive CLI Demo\n\n```bash\npython3 -m safe_rlhf.serve.cli --model_name_or_path output/sft  # or output/ppo-lag\n```\n\n### Interactive Arena\n\n```bash\npython3 -m safe_rlhf.serve.arena --red_corner_model_name_or_path output/sft --blue_corner_model_name_or_path output/ppo-lag\n```\n\n![Arena-Demo](images/arena-demo.gif)\n\n## Chinese Support (中文支持)\n\nThe Safe-RLHF pipeline supports not only the LLaMA model family but also other pre-trained models such as [Baichuan](https://huggingface.co/baichuan-inc/Baichuan-7B), [InternLM](https://huggingface.co/internlm/internlm-7b), etc. that offer better support for Chinese. You just need to update the path to the pre-trained model in the training and inference code.\n\n\u003e Safe-RLHF 管道不仅仅支持 LLaMA 系列模型，它也支持其他一些对中文支持更好的预训练模型，例如 [Baichuan](https://huggingface.co/baichuan-inc/Baichuan-7B) 和 [InternLM](https://huggingface.co/internlm/internlm-7b) 等。你只需要在训练和推理的代码中更新预训练模型的路径即可。\n\n```bash\n# SFT training\nbash scripts/sft.sh --model_name_or_path baichuan-inc/Baichuan-7B --output_dir output/baichuan-sft\n\n# Inference\npython3 -m safe_rlhf.serve.cli --model_name_or_path output/baichuan-sft\n```\n\n\u003cdiv align=\"center\"\u003e\n  \u003cimg src=\"images/chinese-support.png\" width=\"100%\"/\u003e\n\u003c/div\u003e\n\nIn the meantime, we've added support for Chinese datasets such as the [Firefly](https://huggingface.co/datasets/YeungNLP/firefly-train-1.1M) and [MOSS series](https://huggingface.co/datasets/fnlp/moss-003-sft-data) to our [raw-datasets](safe_rlhf/datasets/raw). You only need to change the dataset path in the training code to use the corresponding dataset for fine-tuning the Chinese pre-training model:\n\n\u003e 同时，我们也在 [raw-datasets](safe_rlhf/datasets/raw) 中增加了支持一些中文数据集，例如 [Firefly](https://huggingface.co/datasets/YeungNLP/firefly-train-1.1M) 和 [MOSS 系列](https://huggingface.co/datasets/fnlp/moss-003-sft-data)等。在训练代码中更改数据集路径，你就可以使用相应的数据集来微调中文预训练模型：\n\n```diff\n# scripts/sft.sh\n-\t--train_datasets alpaca \\\n+\t--train_datasets firefly \\\n```\n\nFor instructions on how to add custom datasets, please refer to section [Custom Datasets](#custom-datasets).\n\n\u003e 关于如何添加自定义数据集的方法，请参阅章节 [Custom Datasets (自定义数据集)](#custom-datasets)。\n\n## Benchmark and Evaluation\n\n### Arena via Reward and Cost Models\n\n```bash\nscripts/arena-evaluation.sh \\\n    --red_corner_model_name_or_path output/sft \\\n    --blue_corner_model_name_or_path output/ppo-lag \\\n    --reward_model_name_or_path output/rm \\\n    --cost_model_name_or_path output/cm \\\n    --output_dir output/arena-evaluation\n```\n\n### BIG-bench\n\n```bash\n# Install BIG-bench\ngit clone https://github.com/google/BIG-bench.git\n(\n    cd BIG-bench\n    python3 setup.py sdist\n    python3 -m pip install -e .\n)\n\n# BIG-bench evaluation\npython3 -m safe_rlhf.evaluate.bigbench \\\n    --model_name_or_path output/ppo-lag \\\n    --task_name \u003cBIG-bench-task-name\u003e\n```\n\n### GPT-4 Evaluation\n\n```bash\n# Install OpenAI Python API\npip3 install openai\nexport OPENAI_API_KEY=\"...\"  # your OpenAI API key here\n\n# GPT-4 evaluation\npython3 -m safe_rlhf.evaluate.gpt4 \\\n    --red_corner_model_name_or_path output/sft \\\n    --blue_corner_model_name_or_path output/ppo-lag\n```\n\n## Future Plans\n\n- [X] Beaver-7B checkpoint is released on Hugging Face.\n- [X] Release Safe RLHF paper preprint.\n- [ ] We will gradually release the full Safe-RLHF datasets.\n- [ ] Train Larger LLM with Safe-RLHF.\n- [ ] Support memory-efficient training, such as LoRA, PEFT, etc.\n\n## Citation\n\nIf you find Safe-RLHF useful or use Safe-RLHF (model, code, dataset, etc.) in your research, please consider citing the following work in your publications.\n\n```bibtex\n@inproceedings{safe-rlhf,\n  title={Safe RLHF: Safe Reinforcement Learning from Human Feedback},\n  author={Josef Dai and Xuehai Pan and Ruiyang Sun and Jiaming Ji and Xinbo Xu and Mickel Liu and Yizhou Wang and Yaodong Yang},\n  booktitle={The Twelfth International Conference on Learning Representations},\n  year={2024},\n  url={https://openreview.net/forum?id=TyFrPOKYXw}\n}\n@inproceedings{beavertails,\n  title={BeaverTails: Towards Improved Safety Alignment of {LLM} via a Human-Preference Dataset},\n  author={Jiaming Ji and Mickel Liu and Juntao Dai and Xuehai Pan and Chi Zhang and Ce Bian and Boyuan Chen and Ruiyang Sun and Yizhou Wang and Yaodong Yang},\n  booktitle={Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track},\n  year={2023},\n  url={https://openreview.net/forum?id=g0QovXbFw3}\n}\n```\n\n## PKU-Alignment Team\n\nAll students below contributed equally and the order is determined alphabetically:\n\n- [Juntao Dai](https://github.com/calico-1226)\n- [Jiaming Ji](https://github.com/zmsn-2077)\n- [Xuehai Pan](https://github.com/XuehaiPan)\n- [Ruiyang Sun](https://github.com/rockmagma02)\n\nAll advised by [Yizhou Wang](https://cfcs.pku.edu.cn/english/people/faculty/yizhouwang/index.htm) and [Yaodong Yang](https://www.yangyaodong.com/).\nAcknowledge: We appreciate [Ms. Yi Qu](https://www.xiaohongshu.com/user/profile/58ee23c96a6a695050dcf276) for designing the Beaver logo.\n\n### Acknowledgment  \u003c!-- omit in toc --\u003e\n\nThis repository benefits from [LLaMA](https://ai.facebook.com/blog/large-language-model-llama-meta-ai), [Stanford Alpaca](https://github.com/tatsu-lab/stanford_alpaca), [DeepSpeed](https://github.com/microsoft/DeepSpeed), and [DeepSpeed-Chat](https://github.com/microsoft/DeepSpeedExamples/tree/HEAD/applications/DeepSpeed-Chat).\nThanks for their wonderful works and their efforts for democratizing the LLM research.\nSafe-RLHF and its related assets are built and open-sourced with love 🤗❤️.\n\nThis work is supported and funded by the Peking University.\n\n\u003ctable width=\"100%\" cellspacing=\"0\" cellpadding=\"0\"\u003e\n  \u003ctr align=\"center\" valign=\"middle\"\u003e\n    \u003ctd width=\"40%\"\u003e\n      \u003ca href=\"https://www.ai.pku.edu.cn/\"\u003e\n        \u003cimg src=\"images/pku-ai.png\" width=\"100%\"/\u003e\n      \u003c/a\u003e\n    \u003c/td\u003e\n    \u003ctd width=\"60%\"\u003e\n      \u003ca href=\"https://cfcs.pku.edu.cn/\"\u003e\n        \u003cimg src=\"images/pku-cfcs.png\" width=\"100%\"/\u003e\n      \u003c/a\u003e\n    \u003c/td\u003e\n  \u003c/tr\u003e\n\u003c/table\u003e\n\n## License\n\nSafe-RLHF is released under Apache License 2.0.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FPKU-Alignment%2Fsafe-rlhf","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FPKU-Alignment%2Fsafe-rlhf","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FPKU-Alignment%2Fsafe-rlhf/lists"}