https://github.com/songmzhang/AlignDistil
Code for ACL 2025 Paper "AlignDistil: Token-Level Language Model Alignment as Adaptive Policy Distillation"
https://github.com/songmzhang/AlignDistil
Last synced: 15 days ago
JSON representation
Code for ACL 2025 Paper "AlignDistil: Token-Level Language Model Alignment as Adaptive Policy Distillation"
- Host: GitHub
- URL: https://github.com/songmzhang/AlignDistil
- Owner: songmzhang
- Created: 2025-06-19T13:24:45.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2025-08-26T13:47:53.000Z (10 months ago)
- Last Synced: 2026-03-18T10:18:38.954Z (4 months ago)
- Language: Python
- Size: 1.92 MB
- Stars: 3
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
- awesomeopd - AlignDistil - the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700" alt="Stars"> | 2025.03 | BJTU / Tencent | [arXiv 2503.02832](https://arxiv.org/abs/2503.02832) | AlignDistil — RLHF-equivalent KD (ACL 2025) | (🤝 OPD-RL Hybrids — Inside-RL OPD / 🔁 Iterative Self-Bootstrapping)
README
# AlignDistil: Token-Level Language Model Alignment as Adaptive Policy Distillation (ACL2025)
[Songming Zhang](https://songmzhang.github.io/), Xue Zhang, Tong Zhang, Bojie Hu, Yufeng Chen*, Jinan Xu
Our code is based on [OpenRLHF](https://github.com/OpenRLHF/OpenRLHF) v0.5.2.post2 (We are also updating our codebase to the latest OpenRLHF).
## Why use AlignDistil?
- **Token**-level reward optimization
- Solid performance
- Stable optimization and Faster convergence
- Support both on-policy (like RL) and off-policy training (like DPO)
## Method Framework
Our AlignDistil is easy to use, which contains three steps:
- Train a DPO model on your preferece data
- Train a reverse DPO model on your reversed preference data (swapping `chosen` and `rejected`)
- AlignDistil: Composing a synthetic distribution from these two models and distill it to the current policy model.
It could be on your preference data (off-policy) or model-sampled data (on-policy).
## Prepare
Please run the following commands to install the specific OpenRLHF in our repo:
```shell
git clone https://github.com/songmzhang/AlignDistil
cd AlignDistil/OpenRLHF
pip install -e ./
```
You also need to install vllm to run on-policy AlignDistil.
## Start
DPO training example:
```shell
bash ./train_scripts/ultrafeedback/qwen2.5-1.5b/dpo_01.sh
```
Reverse DPO training example:
```shell
bash ./train_scripts/ultrafeedback/qwen2.5-1.5b/reverse_dpo_01.sh
```
Off-policy AlignDistil training example:
```shell
bash ./train_scripts/ultrafeedback/qwen2.5-1.5b/aligndistil_off_policy.sh
```
On-policy AlignDistil training example:
```shell
bash ./train_scripts/ultrafeedback/qwen2.5-1.5b/aligndistil_on_policy.sh
```
## Citation
If you find this repo helpful, please cite our paper:
```text
@article{zhang2025aligndistil,
title={Aligndistil: Token-level language model alignment as adaptive policy distillation},
author={Zhang, Songming and Zhang, Xue and Zhang, Tong and Hu, Bojie and Chen, Yufeng and Xu, Jinan},
journal={arXiv preprint arXiv:2503.02832},
year={2025}
}
```