https://github.com/wangclnlp/deepspeed-chat-extension

This repo contains some extensions of deepspeed-chat for fine-tuning LLMs (SFT+RLHF).
https://github.com/wangclnlp/deepspeed-chat-extension

deepspeed llama llm rlhf sft

Last synced: 5 months ago
JSON representation

This repo contains some extensions of deepspeed-chat for fine-tuning LLMs (SFT+RLHF).

Host: GitHub
URL: https://github.com/wangclnlp/deepspeed-chat-extension
Owner: wangclnlp
License: apache-2.0
Created: 2023-12-15T07:10:36.000Z (almost 2 years ago)
Default Branch: main
Last Pushed: 2024-07-02T15:30:57.000Z (over 1 year ago)
Last Synced: 2025-04-04T15:21:25.982Z (6 months ago)
Topics: deepspeed, llama, llm, rlhf, sft
Language: Python
Homepage:
Size: 11.7 MB
Stars: 18
Watchers: 1
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          We have edited the code of project [DeepSpeed-Chat](https://github.com/microsoft/DeepSpeedExamples/tree/master/applications/DeepSpeed-Chat) to support many new features as shown below.

# Our New Features🎉🎉🎉

- We propose a hybrid alignment training to improve the LLM ([./examples/hybrid_alignment_training](https://github.com/wangclnlp/DeepSpeed-Chat-Extension/tree/main/examples/hybrid_alignment_training)).

- Add extra loss for RLHF in step3 like SFT loss and pre-trained loss ([./examples/add_extra_loss_for_rlhf](./examples/add_extra_loss_for_rlhf)).

- Support [DPO](https://arxiv.org/abs/2305.18290) as step2 ([./examples/dpo](./examples/dpo)).

- Implement [ESRL](https://arxiv.org/abs/2308.02223) features to train efficiently in step3 ([./examples/esrl](./examples/esrl)).

- Support COMET model(s) as reward model(s) in step3 RLHF ([./examples/rlhf_with_comet_reward](./examples/rlhf_with_comet_reward)).

- Support using scores instead of pairwise data only to train reward models directly ([./examples/training_reward_with_scores](./examples/training_reward_with_scores)).

More details in [./examples](./examples).

# Installation

You can use anaconda/miniconda to install packages needed for this project.

```bash

conda env create -f conda-env.yml

conda activate dschat

pip install -r requirements.txt

```

# Training Models

## Step1 Supervised Fine-tuning (SFT)

```bash

bash scripts/sft.sh

```

## Step2 Reward Model Fine-tuning

```bash

bash scripts/reward.sh

```

## Step2 Direct Pereference Optimization (DPO)

```bash

bash examples/dpo/train.sh

```

## Step3 Reinforcement Learning from Human Feedback (RLHF)

```bash

bash scripts/rlhf.sh

```

# Supported Models

| Model | Model size |

|:---:|:---:|

| Baichuan | 7B/13B |

| Baichuan2 | 7B/13B |

| LLaMA | 7B/13B/33B/65B |

| LLaMA-2 | 7B/13B/70B |

| Yi | 6B/34B |

# Format of the Dataset

## SFT

The dataset for SFT should be `txt` files including `train.txt` and `test.txt`  with `sft` in path such as `/your/path/to/sft_dataset/train.txt`, containing a json string each line as example below.

Example:

```

{"instruction": "User: Your task is to ... \nAssistant: ", "input": "...", "output": "..."}

...

```

## SFT with Multi-turn History

We also support sft training with multi-turn dialogues. The corresponding dataset also contains a json string on each line, as shown in the example below.

Example:

```

{

 "instruction": "User: Your task is to ... \nAssistant: ",

 "input": "...",

 "output": "...",

 "history": [

              ["user instruction in the first round (optional)", "model response in the first round (optional)"],

              ["user instruction in the second round (optional)", "model response in the second round (optional)"],

              ...

            ]

}

...

```

## Reward/DPO

The dataset for Reward/DPO should be parquet files including `train.parquet` and `test.parquet` with `reward` in path such as `/your/path/to/reward_dataset/train.parquet`, containing four keys each entry as example below.

Example:

| prompt | response | chosen | rejected |

|:---:|:---:|:---:|:---:|

| User: What are some of the challenges with usi... | Some of the challenges with using machine lear... | Some of the challenges with using machine lear... | Machine learning is a very powerful tool. |

| User: Looking for an essay by a contemporary m... | I believe you're thinking of Bernard-Henri Lévy. | I believe you're thinking of Bernard-Henri Lévy. | Laclau maybe? |

| ... | ... | ... | ... |

## RLHF

Same as SFT, except for `rlhf` in path such as `/your/path/to/rlhf_dataset/train.txt`.

# Inference

You can use [this](rlhf_llama/deepspeed_chat/training/step1_supervised_finetuning/predict.py) python script for inference as shown in [`./scripts/predict.sh`](./scripts/predict.sh) in which the input should be in format of `{Input} ||| {None/Reference}` while output would be `{Input} ||| {ModelOutput} ||| {None/Reference}` as example below.

Example:

input.txt

```

User: What are the names of some famous actors ...\nAssistant: ||| Some famous ...

User: ...                                                      ||| None

...                                                            ||| ...

```

output.txt

```

User: What are the names of some famous actors ...\nAssistant: ||| 1. Denzel Washington ... ||| Some famous ...

User: ...                                                      ||| ...                      ||| None

...                                                            ||| ...                      ||| ...

```

# Last but Not Least

Thanks to the [DeepSpeed-Chat](https://github.com/microsoft/DeepSpeedExamples/tree/master/applications/DeepSpeed-Chat) project and its contributors❤️❤️❤️!

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/wangclnlp/deepspeed-chat-extension

Awesome Lists containing this project

README