https://github.com/yifan123/flow_grpo
An official implementation of Flow-GRPO: Training Flow Matching Models via Online RL
https://github.com/yifan123/flow_grpo
Last synced: 12 months ago
JSON representation
An official implementation of Flow-GRPO: Training Flow Matching Models via Online RL
- Host: GitHub
- URL: https://github.com/yifan123/flow_grpo
- Owner: yifan123
- License: mit
- Created: 2025-05-08T13:07:56.000Z (12 months ago)
- Default Branch: main
- Last Pushed: 2025-05-09T11:39:15.000Z (12 months ago)
- Last Synced: 2025-05-09T12:31:07.946Z (12 months ago)
- Language: Python
- Homepage:
- Size: 6.8 MB
- Stars: 65
- Watchers: 6
- Forks: 3
- Open Issues: 3
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- StarryDivineSky - yifan123/flow_grpo - GRPO 是一个基于在线强化学习训练 Flow Matching 模型的官方实现。该项目提出了一种新颖的训练方法,通过在线强化学习来优化 Flow Matching 模型,旨在提升生成模型的性能。其核心思想是将 Flow Matching 模型的训练过程建模为一个强化学习问题,并使用 GRPO (Generalized Policy Optimization) 算法进行求解。项目代码库包含了训练脚本、模型定义以及评估工具,方便研究者复现结果和进行进一步研究。该方法可以应用于各种生成任务,例如图像生成、文本生成等。项目提供详细的实验设置和参数配置,方便用户进行定制化训练。Flow-GRPO 旨在解决传统 Flow Matching 训练方法的局限性,并探索强化学习在生成模型训练中的潜力。该项目为生成模型领域的研究人员提供了一个新的视角和工具。 (A01_文本生成_文本对话 / 大语言对话模型及数据)
README
# Flow-GRPO
This is an official implementation of Flow-GRPO: Training Flow Matching Models via Online RL.
## 🔔 News
**[Update]** We showcase image examples from three tasks and their training evolution at https://gongyeliu.github.io/Flow-GRPO. Check them out!
**[Update]** We now provide an online demo for all three tasks at https://huggingface.co/spaces/jieliu/SD3.5-M-Flow-GRPO. You're welcome to try it out!
## ✅ TODO
- [x] Provide a gradio demo
- [x] Provide a web demo showcasing a wide range of generation examples for GenEval, OCR, and PickScore. _@GongyeLiu is working on this urgently._
- [x] Provide a web visualization of image evolution during training for all three tasks. _@GongyeLiu is working on this urgently._
## Model
| Task | Model |
| -------- | -------- |
| GenEval | [🤗GenEval](https://huggingface.co/jieliu/SD3.5M-FlowGRPO-GenEval) |
| Text Rendering | [🤗Text](https://huggingface.co/jieliu/SD3.5M-FlowGRPO-Text) |
| Human Preference Alignment | [🤗PickScore](https://huggingface.co/jieliu/SD3.5M-FlowGRPO-PickScore) |
## Installation
```bash
git clone https://github.com/yifan123/flow_grpo.git
cd flow_grpo
conda create -n flow_grpo python=3.10.16
pip install -e .
```
## Reward
The steps above only install the current repository. However, RL training requires different rewards, and each reward model might depend on some older pre-trained models. It's difficult to place all of these into a single Conda environment without version conflicts. Therefore, drawing inspiration from the ddpo-pytorch implementation, we use a remote server setup for some rewards.
### OCR
Please install paddle-ocr:
```bash
pip install paddlepaddle-gpu==2.6.2
pip install paddleocr==2.9.1
pip install python-Levenshtein
```
Then, pre-download the model using the Python command line:
```python
from paddleocr import PaddleOCR
ocr = PaddleOCR(use_angle_cls=False, lang="en", use_gpu=False, show_log=False)
```
### GenEval
Please create a new Conda virtual environment and install the corresponding dependencies according to the instructions in [reward-server](https://github.com/yifan123/reward-server).
## Usage
Single-node training:
```bash
bash scripts/single_node/main.sh
```
Multi-node training:
```bash
# Master node
bash scripts/multi_node/main.sh
# Other nodes
bash scripts/multi_node/main1.sh
bash scripts/multi_node/main2.sh
```
### Multi Reward Training
For multi-reward settings, you can pass in a dictionary where each key is a reward name and the corresponding value is its weight.
For example:
```python
{
"pickscore": 0.5,
"ocr": 0.2,
"aesthetic": 0.3
}
```
This means the final reward is a weighted sum of the individual rewards.
The following reward models are currently supported:
* **Geneval** evaluates T2I models on complex compositional prompts.
* **OCR** provides an OCR-based reward.
* **PickScore** is a general-purpose T2I reward model trained on human preferences.
* **[DeQA](https://github.com/zhiyuanyou/DeQA-Score)** is a multimodal LLM-based image quality assessment model that measures the impact of distortions and texture damage on perceived quality.
* **ImageReward** is a general-purpose T2I reward model capturing text-image alignment, visual fidelity, and safety.
* **QwenVL** is an experimental reward model using prompt engineering.
* **Aesthetic** is a CLIP-based linear regressor predicting image aesthetic scores.
* **JPEG\_Compressibility** measures image size as a proxy for quality.
* **UnifiedReward** is a state-of-the-art reward model for multimodal understanding and generation, topping the human preference leaderboard.
## Important Hyperparameters
You can adjust the parameters in `config/dgx.py` to tune different hyperparameters. An empirical finding is that `config.sample.train_batch_size * num_gpu / config.sample.num_image_per_prompt * config.sample.num_batches_per_epoch = 48`, i.e., `group_number=48`, `group_size=24`.
Additionally, setting `config.train.gradient_accumulation_steps = config.sample.num_batches_per_epoch // 2` also yields good performance.
## Acknowledgement
This repo is based on [ddpo-pytorch](https://github.com/kvablack/ddpo-pytorch) and [diffusers](https://github.com/huggingface/diffusers). We thank the authors for their valuable contributions to the AIGC community. Special thanks to Kevin Black for the excellent *ddpo-pytorch* repo.
## Citation
```
@misc{liu2025flowgrpo,
title={Flow-GRPO: Training Flow Matching Models via Online RL},
author={Jie Liu and Gongye Liu and Jiajun Liang and Yangguang Li and Jiaheng Liu and Xintao Wang and Pengfei Wan and Di Zhang and Wanli Ouyang},
year={2025},
eprint={2505.05470},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2505.05470},
}
```