https://github.com/wangclnlp/deepspeed-chat-extension
This repo contains some extensions of deepspeed-chat for fine-tuning LLMs (SFT+RLHF).
https://github.com/wangclnlp/deepspeed-chat-extension
deepspeed llama llm rlhf sft
Last synced: 5 months ago
JSON representation
This repo contains some extensions of deepspeed-chat for fine-tuning LLMs (SFT+RLHF).
- Host: GitHub
- URL: https://github.com/wangclnlp/deepspeed-chat-extension
- Owner: wangclnlp
- License: apache-2.0
- Created: 2023-12-15T07:10:36.000Z (almost 2 years ago)
- Default Branch: main
- Last Pushed: 2024-07-02T15:30:57.000Z (over 1 year ago)
- Last Synced: 2025-04-04T15:21:25.982Z (6 months ago)
- Topics: deepspeed, llama, llm, rlhf, sft
- Language: Python
- Homepage:
- Size: 11.7 MB
- Stars: 18
- Watchers: 1
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
We have edited the code of project [DeepSpeed-Chat](https://github.com/microsoft/DeepSpeedExamples/tree/master/applications/DeepSpeed-Chat) to support many new features as shown below.
# Our New Features🎉🎉🎉
- We propose a hybrid alignment training to improve the LLM ([./examples/hybrid_alignment_training](https://github.com/wangclnlp/DeepSpeed-Chat-Extension/tree/main/examples/hybrid_alignment_training)).
- Add extra loss for RLHF in step3 like SFT loss and pre-trained loss ([./examples/add_extra_loss_for_rlhf](./examples/add_extra_loss_for_rlhf)).
- Support [DPO](https://arxiv.org/abs/2305.18290) as step2 ([./examples/dpo](./examples/dpo)).
- Implement [ESRL](https://arxiv.org/abs/2308.02223) features to train efficiently in step3 ([./examples/esrl](./examples/esrl)).
- Support COMET model(s) as reward model(s) in step3 RLHF ([./examples/rlhf_with_comet_reward](./examples/rlhf_with_comet_reward)).
- Support using scores instead of pairwise data only to train reward models directly ([./examples/training_reward_with_scores](./examples/training_reward_with_scores)).More details in [./examples](./examples).
# Installation
You can use anaconda/miniconda to install packages needed for this project.
```bash
conda env create -f conda-env.yml
conda activate dschat
pip install -r requirements.txt
```# Training Models
## Step1 Supervised Fine-tuning (SFT)
```bash
bash scripts/sft.sh
```## Step2 Reward Model Fine-tuning
```bash
bash scripts/reward.sh
```## Step2 Direct Pereference Optimization (DPO)
```bash
bash examples/dpo/train.sh
```## Step3 Reinforcement Learning from Human Feedback (RLHF)
```bash
bash scripts/rlhf.sh
```# Supported Models
| Model | Model size |
|:---:|:---:|
| Baichuan | 7B/13B |
| Baichuan2 | 7B/13B |
| LLaMA | 7B/13B/33B/65B |
| LLaMA-2 | 7B/13B/70B |
| Yi | 6B/34B |# Format of the Dataset
## SFT
The dataset for SFT should be `txt` files including `train.txt` and `test.txt` with `sft` in path such as `/your/path/to/sft_dataset/train.txt`, containing a json string each line as example below.
Example:
```
{"instruction": "User: Your task is to ... \nAssistant: ", "input": "...", "output": "..."}
...
```## SFT with Multi-turn History
We also support sft training with multi-turn dialogues. The corresponding dataset also contains a json string on each line, as shown in the example below.
Example:
```
{
"instruction": "User: Your task is to ... \nAssistant: ",
"input": "...",
"output": "...",
"history": [
["user instruction in the first round (optional)", "model response in the first round (optional)"],
["user instruction in the second round (optional)", "model response in the second round (optional)"],
...
]
}
...
```## Reward/DPO
The dataset for Reward/DPO should be parquet files including `train.parquet` and `test.parquet` with `reward` in path such as `/your/path/to/reward_dataset/train.parquet`, containing four keys each entry as example below.
Example:
| prompt | response | chosen | rejected |
|:---:|:---:|:---:|:---:|
| User: What are some of the challenges with usi... | Some of the challenges with using machine lear... | Some of the challenges with using machine lear... | Machine learning is a very powerful tool. |
| User: Looking for an essay by a contemporary m... | I believe you're thinking of Bernard-Henri Lévy. | I believe you're thinking of Bernard-Henri Lévy. | Laclau maybe? |
| ... | ... | ... | ... |## RLHF
Same as SFT, except for `rlhf` in path such as `/your/path/to/rlhf_dataset/train.txt`.
# Inference
You can use [this](rlhf_llama/deepspeed_chat/training/step1_supervised_finetuning/predict.py) python script for inference as shown in [`./scripts/predict.sh`](./scripts/predict.sh) in which the input should be in format of `{Input} ||| {None/Reference}` while output would be `{Input} ||| {ModelOutput} ||| {None/Reference}` as example below.
Example:
input.txt
```
User: What are the names of some famous actors ...\nAssistant: ||| Some famous ...
User: ... ||| None
... ||| ...
```output.txt
```
User: What are the names of some famous actors ...\nAssistant: ||| 1. Denzel Washington ... ||| Some famous ...
User: ... ||| ... ||| None
... ||| ... ||| ...
```# Last but Not Least
Thanks to the [DeepSpeed-Chat](https://github.com/microsoft/DeepSpeedExamples/tree/master/applications/DeepSpeed-Chat) project and its contributors❤️❤️❤️!