https://github.com/taishan1994/dpo-finetuning

专门用于训练DPO模型的仓库。
https://github.com/taishan1994/dpo-finetuning

Last synced: 3 months ago
JSON representation

专门用于训练DPO模型的仓库。

Host: GitHub
URL: https://github.com/taishan1994/dpo-finetuning
Owner: taishan1994
Created: 2024-08-06T11:23:35.000Z (about 1 year ago)
Default Branch: main
Last Pushed: 2024-08-07T02:42:18.000Z (about 1 year ago)
Last Synced: 2025-04-08T22:31:37.240Z (6 months ago)
Language: Python
Size: 13.7 KB
Stars: 1
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

          # DPO-Finetuning

专门用于训练DPO模型的仓库。

---

数据地址：链接: https://pan.baidu.com/s/1L01fhb40jJprlCmRKVq2ig?pwd=aspy 提取码: aspy

# 一般步骤

1. 将test.jsonl和train.jsonl下载到data/CValues-Comparison/下。

2. 运行model_hub/下的download_modelscope.py下载预训练的权重。

3. 训练`sh train.sh`

4. 预测：`python predict.py`

   ```python

   from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

   

   

   base_model_path = "./model_hub/Qwen2-0.5B-Instruct/"

   dpo_model_path = "./output/qwen2-0.5B-Instruct-DPO"

   

   

   device = "cuda"

   quantization_config = None

   model = AutoModelForCausalLM.from_pretrained(base_model_path,

                                                device_map="auto",

                                                trust_remote_code=True,

                                                quantization_config=quantization_config)

   tokenizer = AutoTokenizer.from_pretrained(base_model_path, trust_remote_code=True)

   model.eval()

   

   dpo_model = AutoModelForCausalLM.from_pretrained(dpo_model_path,

                                                device_map="auto",

                                                trust_remote_code=True,

                                                quantization_config=quantization_config)

   dpo_model.eval()

   

   

   

   def get_result(model_inputs, model):

       generated_ids = model.generate(

           **model_inputs,

           max_new_tokens=512,

           eos_token_id=tokenizer.get_vocab()["<|im_end|>"]

       )

       generated_ids = [

           output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)

       ]

   

       response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

       return response

   

   while True:

       prompt = input(">>>")

   

       messages = [

           {"role": "system", "content": "You are a helpful assistant."},

           {"role": "user", "content": prompt}

       ]

       text = tokenizer.apply_chat_template(

           messages,

           tokenize=False,

           add_generation_prompt=True

       )

   

       # print(text)

   

       model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

       base_model_response = get_result(model_inputs, model)

       model_inputs = tokenizer([text], return_tensors="pt").to(dpo_model.device)

       dpo_model_response = get_result(model_inputs, dpo_model)

       print("基座模型：", base_model_response)

       print("DPO模型：", dpo_model_response)

       print("="*100)

   ```

# 如何训练其它模型

主要还是finetune_dpo.py里面的：apply_chat_template

```python

def apply_chat_template(example,

                        system="You are a helpful assistant."):

    # print(example)

    messages = example["messages"]

    chosen_message = ""

    rejected_message = ""

    chosen_score = 0.99

    rejected_score = 0.01

    # DPOTrainer里面会加一个Bos

    # prompt = "<|im_start|>system\n{}<|im_end|>\n".format(system)

    prompt = "system\n{}<|im_end|>\n".format(system)

    for i, message in enumerate(messages):

        role = message["role"]

        # print(message)

        if role == "user":

            value = message["value"]

            _input = "<|im_start|>user\n{}<|im_end|>\n<|im_start|>assistant\n".format(value)

            prompt += _input

        else:

            chosen_value = message["chosen_value"]

            rejected_value = message["rejected_value"]

            if i != len(messages) - 1:

                # 如果是多轮对话， 前面几轮对话的choen和rejected应该是一样的

                prompt += chosen_value  + "<|im_end|>\n"

            else:

                # 最后面不需要再加一个<|im_end|>，DPOTrainer里面会加一个Eos

                chosen_message = chosen_value

                rejected_message += rejected_value

                chosen_score = message["chosen_score"]

                rejected_score = message["rejected_score"]

    example["prompt"] = prompt

    example["chosen"] = chosen_message

    example["rejected"] = rejected_message

    example["reference_chosen_logps"] = chosen_score

    example["reference_rejected_logps"] = rejected_score

    return example

```

需要转换为不同模型对应的输入的格式。另外需要提供对chosen和rejected的评分。

# 参考

> https://github.com/huggingface/alignment-handbook

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/taishan1994/dpo-finetuning

Awesome Lists containing this project

README