https://github.com/dvlab-research/step-dpo

Implementation for "Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs"
https://github.com/dvlab-research/step-dpo
dpo llm math reasoning
Last synced: 4 months ago
JSON representation
Implementation for "Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs"
Host: GitHub
URL: https://github.com/dvlab-research/step-dpo
Owner: dvlab-research
Created: 2024-06-24T05:15:20.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2024-07-15T14:43:51.000Z (over 1 year ago)
Last Synced: 2024-07-15T17:53:25.597Z (over 1 year ago)
Topics: dpo, llm, math, reasoning
Language: Python
Homepage:
Size: 6.46 MB
Stars: 155
Watchers: 2
Forks: 2
Open Issues: 1
Metadata Files:
- Readme: README.md
- License: licenses/DATA_LICENSE
Awesome Lists containing this project

README

          
![image](imgs/coreidea.png)

# Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs

[Xin Lai](https://scholar.google.com/citations?user=tqNDPA4AAAAJ&hl),

[Zhuotao Tian](https://scholar.google.com/citations?user=mEjhz-IAAAAJ&hl),

[Yukang Chen](https://scholar.google.com/citations?user=6p0ygKUAAAAJ&hl),

[Senqiao Yang](https://scholar.google.com/citations?user=NcJc-RwAAAAJ&hl),

[Xiangru Peng](xxxx),

[Jiaya Jia](https://scholar.google.com/citations?user=XPAkzTEAAAAJ&hl=en)

 

[![](https://img.shields.io/badge/Models-HuggingFace-pink)](https://huggingface.co/collections/xinlai/step-dpo-6682e12dfbbb2917c8161df7)

[![](https://img.shields.io/badge/Dataset-Math--Step--DPO--10K-blue)](https://huggingface.co/datasets/xinlai/Math-Step-DPO-10K)

[![](https://img.shields.io/badge/Paper-Arvix%20Link-green)](https://arxiv.org/pdf/2406.18629)

[![](https://img.shields.io/badge/Demo-Huggingface-yellow)](http://103.170.5.190:7870/)

[![Code License](https://img.shields.io/badge/Code%20License-Apache_2.0-yellow.svg)](licenses/LICENSE)

[![Data License](https://img.shields.io/badge/Data%20License-CC%20By%20NC%204.0-orange.svg)](licenses/DATA_LICENSE)

[![Weight License](https://img.shields.io/badge/Weight%20License-CC%20By%20NC%204.0-red)](licenses/WEIGHT_LICENSE)

This repo provides the implementation of **Step-DPO**, a simple, effective, and data-efficient method for boosting the long-chain reasoning ability of LLMs, with **a data construction pipeline** that yields a **high-quality dataset** containing 10K step-wise preference pairs.

Notably, **Step-DPO** boosts the performance of **Qwen2-7B-Instruct** from **53.0%** to **58.6%** on MATH, and **85.5%** to **87.9%** on GSM8K, with as few as **10K data** and **hundreds of training steps**!

Moreover, **Step-DPO**, when applied to **Qwen2-72B-Instruct**, achieves scores of **70.8%** and **94.0%** on the test sets of **MATH** and **GSM8K**, respectively, **surpassing a series of closed-source models** without bells and wistles, including GPT-4-1106, Claude-3-Opus, and Gemini-1.5-Pro.

![image](imgs/summary.jpg)

## TABLE OF CONTENTS

1. [News](#news)

2. [Datasets](#datasets)

3. [Models](#models)

4. [Installation](#installation)

5. [Training](#training)

6. [Evaluation](#evaluation)

7. [Data Construction Pipeline](#data-construction-pipeline)

8. [Deployment](#deployment)

9. [Examples](#examples)

10. [Acknowledgement](#acknowledgement)

11. [Citation](#citation)

## News

- [x] [2024.7.7] We release the scripts for [Data Construction Pipeline](#data-construction-pipeline)! You can construct dataset on your own with these scripts!

- [x] [2024.7.1] We release the demo of the model [Qwen2-7B-Instruct-Step-DPO](https://huggingface.co/xinlai/Qwen2-7B-Instruct-Step-DPO). Welcome to try it on [Demo](http://103.170.5.190:7870/)!

- [x] [2024.6.28] We release the pre-print of [Step-DPO](https://arxiv.org/pdf/2406.18629) and this GitHub repo, including training/evaluation scripts, pre-trained models and data.

## Datasets

We build a 10K math preference datasets for Step-DPO, which can be downloaded from the following link.

| Dataset                  | Size   | Link                                                         |

| ------------------------ | ------ | ------------------------------------------------------------ |

| xinlai/Math-Step-DPO-10K | 10,795 | 🤗 [Hugging Face](https://huggingface.co/datasets/xinlai/Math-Step-DPO-10K) |

## Models

It is notable that the model **Qwen2-72B-Instruct + Step-DPO** could achieve **70.8%** and **94.0%** on MATH and GSM8K test sets. Step-DPO also brings considerable improvement over various models as follows. Welcome to download and use.

|              Models              | Size | MATH | GSM8K | Odyssey-MATH |                            Link                             |

| :------------------------------ | :--: | :----: | :---: | :---: | :----------------------------------------------------------: |

|         Qwen2-7B-Instruct          |  7B  | 53.0 | 85.5  | - |                              -                               |

|    **Qwen2-7B-Instruct + Step-DPO**    |  7B  | **58.6 (+5.6)** | **87.9 (+2.4)** |  -  | 🤗 [HF](https://huggingface.co/xinlai/Qwen2-7B-Instruct-Step-DPO) |

|         DeepSeekMath-RL          |  7B  | 51.7 | 88.2  | - |                              -                               |

|    **DeepSeekMath-RL + Step-DPO**    |  7B  | **53.2 (+1.5)** | **88.7 (+0.5)** |  -  | 🤗 [HF](https://huggingface.co/xinlai/DeepSeekMath-RL-Step-DPO) |

|           Qwen2-7B-SFT           |  7B  | 54.8 | 88.2  |  -  |  🤗 [HF](https://huggingface.co/xinlai/Qwen2-7B-SFT)      |

|     **Qwen2-7B-SFT + Step-DPO**      |  7B  | **55.8 (+1.0)** | **88.5 (+0.3)**  | - |🤗 [HF](https://huggingface.co/xinlai/Qwen2-7B-SFT-Step-DPO) |

|         Qwen1.5-32B-SFT          | 32B  | 54.9 | 90.0  |  -  | 🤗 [HF](https://huggingface.co/xinlai/Qwen1.5-32B-SFT)    |

|    **Qwen1.5-32B-SFT + Step-DPO**    | 32B  | **56.9 (+2.0)** | **90.9 (+0.9)**  |  - |🤗 [HF](https://huggingface.co/xinlai/Qwen1.5-32B-SFT-Step-DPO) |

|        Qwen2-57B-A14B-SFT        | 57B  | 54.6 | 89.8  |  - | 🤗 [HF](https://huggingface.co/xinlai/Qwen2-57B-A14B-SFT)   |

|  **Qwen2-57B-A14B-SFT + Step-DPO**   | 57B  | **56.5 (+1.9)** | **90.0 (+0.2)**  |  - |🤗 [HF](https://huggingface.co/xinlai/Qwen2-57B-A14B-SFT-Step-DPO) |

|         Llama-3-70B-SFT          | 70B  | 56.9 | 92.2  |  - |  🤗 [HF](https://huggingface.co/xinlai/Llama-3-70B-SFT)    |

|    **Llama-3-70B-SFT + Step-DPO**    | 70B  | **59.5 (+2.6)** | **93.3 (+1.1)**  |  - |🤗 [HF](https://huggingface.co/xinlai/Llama-3-70B-SFT-Step-DPO) |

|          Qwen2-72B-SFT           | 72B  | 61.7 | 92.9  |  44.2  |   🤗 [HF](https://huggingface.co/xinlai/Qwen2-72B-SFT)     |

|     **Qwen2-72B-SFT + Step-DPO**     | 72B  | **64.7 (+3.0)** | **93.9 (+1.0)**  | **47.0 (+2.8)** | 🤗 [HF](https://huggingface.co/xinlai/Qwen2-72B-SFT-Step-DPO) |

|        Qwen2-72B-Instruct        | 72B  | 69.4 | 92.4 | 47.0 |                              -                               |

|  **Qwen2-72B-Instruct + Step-DPO**   | 72B  | **70.8 (+1.4)** | **94.0 (+1.6)**  | **50.1 (+3.1)** | 🤗 [HF](https://huggingface.co/xinlai/Qwen2-72B-Instruct-Step-DPO) |

Note: **Odyssey-MATH** contains competition-level math problems.

## Installation

```

conda create -n step_dpo python=3.10

conda activate step_dpo

pip install -r requirements.txt

```

## Training

### Pre-trained weights

We use Qwen2, Qwen1.5, Llama-3, and DeepSeekMath models as the pre-trained weights and fine-tune them with Step-DPO. Download based on your choices.

| Pre-trained weights                                                        |

|:---------------------------------------------------------------------------|

| [Qwen/Qwen2-7B-Instruct](https://huggingface.co/Qwen/Qwen2-7B-Instruct)           |

| [deepseek-ai/deepseek-math-7b-rl](https://huggingface.co/deepseek-ai/deepseek-math-7b-rl)           |

| [xinlai/Qwen2-7B-SFT](https://huggingface.co/xinlai/Qwen2-7B-SFT)         |

| [xinlai/Qwen1.5-32B-SFT](https://huggingface.co/xinlai/Qwen1.5-32B-SFT)         |

| [xinlai/Qwen2-57B-A14B-SFT](https://huggingface.co/xinlai/Qwen2-57B-A14B-SFT) |

| [xinlai/Llama-3-70B-SFT](https://huggingface.co/xinlai/Llama-3-70B-SFT)         |

| [xinlai/Qwen2-72B-SFT](https://huggingface.co/xinlai/Qwen2-72B-SFT)         |

| [Qwen/Qwen2-72B-Instruct](https://huggingface.co/Qwen/Qwen2-72B-Instruct)         |

**Note**: models with '-SFT' are supervised fine-tuned by our 299K SFT data based on open-source base models. You could perform Step-DPO on either our SFT models or existing open-source instruct models.

Here is a script example to perform Step-DPO on `Qwen/Qwen2-72B-Instruct`:

```shell

ACCELERATE_LOG_LEVEL=info accelerate launch --config_file accelerate_configs/deepspeed_zero3_cpu.yaml --mixed_precision bf16 \

    --num_processes 8 \

    train.py configs/config_full.yaml \

    --model_name_or_path="Qwen/Qwen2-72B-Instruct" \

    --data_path="xinlai/Math-Step-DPO-10K" \

    --per_device_train_batch_size=2 \

    --gradient_accumulation_steps=8 \

    --torch_dtype=bfloat16 \

    --bf16=True \

    --beta=0.4 \

    --num_train_epochs=4 \

    --save_strategy='steps' \

    --save_steps=200 \

    --save_total_limit=1 \

    --output_dir=outputs/qwen2-72b-instruct-step-dpo \

    --hub_model_id=qwen2-72b-instruct-step-dpo \

    --prompt=qwen2-boxed

```

## Evaluation

Here are script examples to evaluate fine-tuned models on both GSM8K and MATH test sets:

```

python eval_math.py \

    --model outputs/qwen2-72b-instruct-step-dpo \

    --data_file ./data/test/GSM8K_test_data.jsonl \

    --save_path 'eval_results/gsm8k/qwen2-72b-instruct-step-dpo.json' \

    --prompt 'qwen2-boxed' \

    --tensor_parallel_size 8

```

```

python eval_math.py \

    --model outputs/qwen2-72b-instruct-step-dpo \

    --data_file ./data/test/MATH_test_data.jsonl \

    --save_path 'eval_results/math/qwen2-72b-instruct-step-dpo.json' \

    --prompt 'qwen2-boxed' \

    --tensor_parallel_size 8

```

## Data Construction Pipeline

We release the scripts to construct the Step-DPO data, as shown in the `data_pipeline/` directory. Please follow the instructions below.

```

cd Step-DPO

# Step 1: Error Collection

# Before executing, please set the MODEL_PATH, PRED_PATH, EVAL_PROMPT

bash data_pipeline/step1.sh

# Step 2: Locate Erroneous Step by GPT-4o

# Before executing, please set the OPENAI_BASE_URL, OPENAI_API_KEY

bash data_pipeline/step2.sh

# Step 3: Rectify by the model itself

# Before executing, please set the MODEL_PATH, EVAL_PROMPT, JSON_FILE, PRED_PATH, SAVE_PATH

bash data_pipeline/step3.sh

# Finally, Get the resulting dataset

# Before executing, please set the EVAL_PROMPT, JSON_FILE, PRED_PATH, SAVE_PATH

bash data_pipeline/merge.sh

```

## Deployment

For deployment, please directly use the following command:

```

python3 app.py --model_path_or_name xinlai/Qwen2-7B-Instruct-Step-DPO

```

## Examples

![image](imgs/example5.jpg)

![image](imgs/example1.png)

![image](imgs/example4.png)

![image](imgs/example2.png)

## Acknowledgement

This repository is based on [alignment-handbook](https://github.com/huggingface/alignment-handbook), [DeepSeekMath](https://github.com/deepseek-ai/DeepSeek-Math), and [MetaMath](https://github.com/meta-math/MetaMath). 

Many thanks for their efforts!

## Citation

If you find this project useful in your research, please consider citing us:

```

@article{lai2024stepdpo,

  title={Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs},

  author={Xin Lai and Zhuotao Tian and Yukang Chen and Senqiao Yang and Xiangru Peng and Jiaya Jia},

  journal={arXiv:2406.18629},

  year={2024}

}

```
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/dvlab-research/step-dpo

Awesome Lists containing this project

README