https://github.com/songmzhang/AlignDistil

Code for ACL 2025 Paper "AlignDistil: Token-Level Language Model Alignment as Adaptive Policy Distillation"
https://github.com/songmzhang/AlignDistil

Last synced: about 1 month ago
JSON representation

Code for ACL 2025 Paper "AlignDistil: Token-Level Language Model Alignment as Adaptive Policy Distillation"

Host: GitHub
URL: https://github.com/songmzhang/AlignDistil
Owner: songmzhang
Created: 2025-06-19T13:24:45.000Z (about 1 year ago)
Default Branch: main
Last Pushed: 2025-08-26T13:47:53.000Z (11 months ago)
Last Synced: 2026-03-18T10:18:38.954Z (4 months ago)
Language: Python
Size: 1.92 MB
Stars: 3
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

awesomeopd - AlignDistil - the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700" alt="Stars"> | 2025.03 | BJTU / Tencent | [arXiv 2503.02832](https://arxiv.org/abs/2503.02832) | AlignDistil — RLHF-equivalent KD (ACL 2025) | (🤝 OPD-RL Hybrids — Inside-RL OPD / 🔁 Iterative Self-Bootstrapping)

README

# AlignDistil: Token-Level Language Model Alignment as Adaptive Policy Distillation (ACL2025)
[Songming Zhang](https://songmzhang.github.io/), Xue Zhang, Tong Zhang, Bojie Hu, Yufeng Chen*, Jinan Xu

Our code is based on [OpenRLHF](https://github.com/OpenRLHF/OpenRLHF) v0.5.2.post2 (We are also updating our codebase to the latest OpenRLHF).

## Why use AlignDistil?
- **Token**-level reward optimization
- Solid performance
- Stable optimization and Faster convergence
- Support both on-policy (like RL) and off-policy training (like DPO)

## Method Framework

Our AlignDistil is easy to use, which contains three steps:
- Train a DPO model on your preferece data
- Train a reverse DPO model on your reversed preference data (swapping `chosen` and `rejected`)
- AlignDistil: Composing a synthetic distribution from these two models and distill it to the current policy model.
It could be on your preference data (off-policy) or model-sampled data (on-policy).

## Prepare
Please run the following commands to install the specific OpenRLHF in our repo:
```shell
git clone https://github.com/songmzhang/AlignDistil
cd AlignDistil/OpenRLHF
pip install -e ./
```
You also need to install vllm to run on-policy AlignDistil.

## Start
DPO training example:
```shell
bash ./train_scripts/ultrafeedback/qwen2.5-1.5b/dpo_01.sh
```

Reverse DPO training example:
```shell
bash ./train_scripts/ultrafeedback/qwen2.5-1.5b/reverse_dpo_01.sh
```

Off-policy AlignDistil training example:
```shell
bash ./train_scripts/ultrafeedback/qwen2.5-1.5b/aligndistil_off_policy.sh
```

On-policy AlignDistil training example:
```shell
bash ./train_scripts/ultrafeedback/qwen2.5-1.5b/aligndistil_on_policy.sh
```

## Citation
If you find this repo helpful, please cite our paper:
```text
@article{zhang2025aligndistil,
title={Aligndistil: Token-level language model alignment as adaptive policy distillation},
author={Zhang, Songming and Zhang, Xue and Zhang, Tong and Hu, Bojie and Chen, Yufeng and Xu, Jinan},
journal={arXiv preprint arXiv:2503.02832},
year={2025}
}
```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/songmzhang/AlignDistil

Awesome Lists containing this project

README