https://github.com/taishan1994/dpo-finetuning
专门用于训练DPO模型的仓库。
https://github.com/taishan1994/dpo-finetuning
Last synced: 3 months ago
JSON representation
专门用于训练DPO模型的仓库。
- Host: GitHub
- URL: https://github.com/taishan1994/dpo-finetuning
- Owner: taishan1994
- Created: 2024-08-06T11:23:35.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2024-08-07T02:42:18.000Z (about 1 year ago)
- Last Synced: 2025-04-08T22:31:37.240Z (6 months ago)
- Language: Python
- Size: 13.7 KB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# DPO-Finetuning
专门用于训练DPO模型的仓库。---
数据地址:链接: https://pan.baidu.com/s/1L01fhb40jJprlCmRKVq2ig?pwd=aspy 提取码: aspy
# 一般步骤
1. 将test.jsonl和train.jsonl下载到data/CValues-Comparison/下。
2. 运行model_hub/下的download_modelscope.py下载预训练的权重。
3. 训练`sh train.sh`
4. 预测:`python predict.py`
```python
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
base_model_path = "./model_hub/Qwen2-0.5B-Instruct/"
dpo_model_path = "./output/qwen2-0.5B-Instruct-DPO"
device = "cuda"
quantization_config = None
model = AutoModelForCausalLM.from_pretrained(base_model_path,
device_map="auto",
trust_remote_code=True,
quantization_config=quantization_config)
tokenizer = AutoTokenizer.from_pretrained(base_model_path, trust_remote_code=True)
model.eval()
dpo_model = AutoModelForCausalLM.from_pretrained(dpo_model_path,
device_map="auto",
trust_remote_code=True,
quantization_config=quantization_config)
dpo_model.eval()
def get_result(model_inputs, model):
generated_ids = model.generate(
**model_inputs,
max_new_tokens=512,
eos_token_id=tokenizer.get_vocab()["<|im_end|>"]
)
generated_ids = [
output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
return response
while True:
prompt = input(">>>")
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
# print(text)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
base_model_response = get_result(model_inputs, model)
model_inputs = tokenizer([text], return_tensors="pt").to(dpo_model.device)
dpo_model_response = get_result(model_inputs, dpo_model)
print("基座模型:", base_model_response)
print("DPO模型:", dpo_model_response)
print("="*100)
```# 如何训练其它模型
主要还是finetune_dpo.py里面的:apply_chat_template
```python
def apply_chat_template(example,
system="You are a helpful assistant."):
# print(example)
messages = example["messages"]
chosen_message = ""
rejected_message = ""
chosen_score = 0.99
rejected_score = 0.01# DPOTrainer里面会加一个Bos
# prompt = "<|im_start|>system\n{}<|im_end|>\n".format(system)
prompt = "system\n{}<|im_end|>\n".format(system)
for i, message in enumerate(messages):
role = message["role"]
# print(message)
if role == "user":
value = message["value"]
_input = "<|im_start|>user\n{}<|im_end|>\n<|im_start|>assistant\n".format(value)
prompt += _input
else:
chosen_value = message["chosen_value"]
rejected_value = message["rejected_value"]
if i != len(messages) - 1:
# 如果是多轮对话, 前面几轮对话的choen和rejected应该是一样的
prompt += chosen_value + "<|im_end|>\n"
else:
# 最后面不需要再加一个<|im_end|>,DPOTrainer里面会加一个Eos
chosen_message = chosen_value
rejected_message += rejected_value
chosen_score = message["chosen_score"]
rejected_score = message["rejected_score"]example["prompt"] = prompt
example["chosen"] = chosen_message
example["rejected"] = rejected_message
example["reference_chosen_logps"] = chosen_score
example["reference_rejected_logps"] = rejected_score
return example
```需要转换为不同模型对应的输入的格式。另外需要提供对chosen和rejected的评分。
# 参考
> https://github.com/huggingface/alignment-handbook