https://github.com/thudm/relaydiffusion
The official implementation of "Relay Diffusion: Unifying diffusion process across resolutions for image synthesis" [ICLR 2024 Spotlight]
https://github.com/thudm/relaydiffusion
diffusion-models generative-model image-synthesis machine-learning
Last synced: about 1 year ago
JSON representation
The official implementation of "Relay Diffusion: Unifying diffusion process across resolutions for image synthesis" [ICLR 2024 Spotlight]
- Host: GitHub
- URL: https://github.com/thudm/relaydiffusion
- Owner: THUDM
- License: apache-2.0
- Created: 2023-09-04T14:28:18.000Z (almost 3 years ago)
- Default Branch: main
- Last Pushed: 2024-04-29T09:29:51.000Z (about 2 years ago)
- Last Synced: 2025-03-30T21:09:59.145Z (about 1 year ago)
- Topics: diffusion-models, generative-model, image-synthesis, machine-learning
- Language: Python
- Homepage:
- Size: 20.2 MB
- Stars: 296
- Watchers: 11
- Forks: 19
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
## Relay Diffusion: Unifying diffusion process across resolutions for image synthesis
Official Pytorch Implementation 🌐[[WiseModel]](https://www.wisemodel.cn/models/ZhipuAI/RelayDiffsuon/intro) 🌐[[Model Scope]](https://www.modelscope.cn/models/ZhipuAI/RelayDiffusion/summary)
🎉**News!** The paper of RelayDiffusion has been accepted by ICLR 2024 (**Spotlight**)!

We propose ***Relay Diffusion Model (RDM)*** as a better framework for diffusion generation. ***RDM*** transfers a low-resolution image or noise into an equivalent high-resolution one via blurring diffusion and block noise. Therefore, the diffusion process can continue seamlessly in any new resolution or model without restarting from pure noise or low-resolution conditioning.
RDM achieved **state-of-the-art** FID on CelebA-HQ and sFID ImageNet-256 (FID=1.87)!
For a formal introduction, Read our paper: [Relay Diffusion: Unifying diffusion process across resolutions for image synthesis](https://arxiv.org/abs/2309.03350).
## Setup
### Environment
Download the repo and setup the environment with:
```bash
git clone https://github.com/THUDM/RelayDiffusion.git
cd RelayDiffusion
conda env create -f environment.yml
conda activate rdm
```
We enable `xformers.ops.memory_efficient_attention` to reduce about 15% training cost. If there is no need you can also remove `xformers` from `environment.yml`.
Linux servers with Nvidia A100s are recommended. However, by setting smaller `--batch-gpu` (batch size on a single gpu), you can still run the inference and training scripts on less powerful GPUs.
### Dataset
We preprocess and implement datasets with the same format as [EDM](https://github.com/NVlabs/edm). For CelebA-HQ, follow [*Progressive Growing of GANs for Improved Quality, Stability, and Variation*](https://github.com/tkarras/progressive_growing_of_gans) to construct the high-quality subset of CelebA. For ImageNet, download data from the [official site](https://www.kaggle.com/c/imagenet-object-localization-challenge/overview/description).
To convert the original data to organized data ready for training at $64\times 64$ or $256\times 256$ resolution, run command:
```bash
python dataset_tool.py \
--source=/path/to/original/data \
--dest=/path/to/output/data.zip \
--transform=center-crop \
--resolution=64x64 # or --resolution=256x256
```
## Inference & Evaluation
### Sample Generation
To generate samples from RDM models, run command:
```bash
torchrun --standalone --nproc_per_node=1 generate.py --sampler_stages=both --outdir=/path/to/output/dir/ \
--network_first=/path/to/1st/ckpt --network_second=/path/to/2nd/ckpt
```
To generate $N$ images, set `--seed=[K]-[K+N-1]` with a randomly-picked $K$. You can assign `--nproc_per_node=N` to enable parallel generation of multiple GPUs.
If you want to generate final samples from first-stage results (only use the second stage model), set `--sampler_stages=second` and assign input directory of first-stage results by `--indir`.
Besides, arguments for configurations of the first stage are:
- `num_steps_first`: number of sampling steps.
- `sigma_min_first` & `sigma_max_first`: lowest & highest noise level.
- `rho_first`: time step exponent.
- `cfg_scale_first`: scale of classifier-free guidance.
- `S_churn`: stochasticity strength.
- `S_min` & `S_max`: min & max noise level.
- `S_noise`: noise inflation.
Arguments for configurations of the second stage are:
- `num_steps_second`: number of sampling steps.
- `sigma_min_second` & `sigma_max_second`: lowest & highest noise level.
- `blur_sigma_max_second`: maximum sigma of blurring schedule.
- `rho_second`: time step exponent.
- `cfg_scale_second`: scale of classifier-free guidance.
- `up_scale_second`: scale of upsampling.
- `truncation_sigma_second` & `truncation_t_second`: truncation point of noise & time schedule.
- `s_block_second`: strength of block noise addition.
- `s_noise_second`: strength of stochasticity.
### Evaluation Metrics
We quantitatively measure the sample quality by metrics including **Fréchet inception distance (FID)**, **spatial FID (sFID)**, **Inception Score (IS)**, **Precision** and **Recall**. For sFID, IS, Precision and Recall, we reformat the calculation pipeline based on the formulation in `tensorflow` from [ADM](https://github.com/openai/guided-diffusion).
First, run the following command to generate activation data file from samples and dataset:
```bash
torchrun --standalone --nproc_per_node=1 evaluate.py activations --data=/sample/dir/ --dest=eval-refs/activations_sample.npz --batch=64 # build sample activations
torchrun --standalone --nproc_per_node=1 evaluate.py activations --data=/path/to/dataset.zip --dest=eval-refs/activations_ref.npz --batch=64 # build reference activations
```
Then calculate metrics based on pre-built activations, run command:
```bash
torchrun --standalone --nproc_per_node=1 evaluate.py calc --batch=64 \
--activations_sample=eval-refs/activations_sample.npz \
--activations_ref=eval-refs/activations_ref.npz \
[-m fid] [-m sfid] [-m is] [-m pr] \ # assign metrics to be calculated
```
### Performance Reproduction
RDM achieves competitive results in comparison with previous SoTA models:
| Dataset | Resolution | Training Samples | FID | sFID | IS | Precision | Recall |
| --------- | ---------- | ---------------- | :--: | :--: | :----: | :-------: | :----: |
| CelebA-HQ | 256x256 | 47M | 3.15 | - | - | 0.77 | 0.55 |
| ImageNet | 256x256 | 1250M | 1.87 | 3.97 | 278.75 | 0.81 | 0.59 |
We provide best pre-trained checkpoints of RDM and their sampler settings for reproducing performance:
- CelebA-HQ $256\times 256$:
Download checkpoints of [first stage](https://cloud.tsinghua.edu.cn/f/8e8e4b2743fe4447b497/?dl=1) and [second stage](https://cloud.tsinghua.edu.cn/f/b8cd559a0e9f4b9abd39/?dl=1), place them in `ckpts/`, generate samples and their activations by commands:
```bash
torchrun --standalone --nproc_per_node=8 generate_celebahq.py --outdir=generations/celebahq_samples/ \
--network_first=ckpts/celebahq_first_stage.pt \
--network_second=ckpts/celebahq_second_stage.pt
torchrun --standalone --nproc_per_node=1 evaluate.py activations \
--data=generations/celebahq_samples/ --dest=eval-refs/celebahq_act_sample.npz
```
Generate activation data from CelebA-HQ zip or download our version from [here](https://cloud.tsinghua.edu.cn/f/a26f714e36304c3e948d/?dl=1):
```bash
torchrun --standalone --nproc_per_node=1 evaluate.py activations \
--data=datasets/celebahq-256x256.zip --dest=eval-refs/celebahq_act_ref.npz
```
Calculate metrics by command:
```bash
python evaluate.py calc -m fid -m pr \
--activations_sample=eval-refs/celebahq_act_sample.npz \
--activations_ref=eval-refs/celebahq_act_ref.npz
```
- ImageNet $256\times 256$:
Download checkpoints of [first stage](https://cloud.tsinghua.edu.cn/f/c9a0ab6341704ed0be55/?dl=1) and [second stage](https://cloud.tsinghua.edu.cn/f/b5915d0b7d994e86b4bb/?dl=1), place them in `ckpts/`, generate samples and their activations by commands:
```bash
torchrun --standalone --nproc_per_node=8 generate_imagenet.py --outdir=generations/imagenet_samples/ \
--network_first=ckpts/imagenet_first_stage.pkl \
--network_second=ckpts/imagenet_second_stage.pt
torchrun --standalone --nproc_per_node=1 evaluate.py activations \
--data=generations/imagenet_samples/ --dest=eval-refs/imagenet_act_sample.npz
```
Generate activation data from ImageNet zip: (The activation data of 128w images is up to 40GB, which is too big to upload. We only upload mu and sigma of [reference](https://cloud.tsinghua.edu.cn/f/8b98f215ae3a4977978c/?dl=1) and [samples](https://cloud.tsinghua.edu.cn/f/2b0f6fc577744fa19f1b/?dl=1) for calculating FID here. For more metrics, you need to generate activation by yourself.)
```bash
torchrun --standalone --nproc_per_node=1 evaluate.py activations \
--data=datasets/imagenet-256x256.zip --dest=eval-refs/imagenet_act_ref.npz
```
Calculate FID, sFID and IS by command:
```bash
python evaluate.py calc -m fid -m sfid -m is \
--activations_sample=eval-refs/imagenet_act_sample.npz \
--activations_ref=eval-refs/imagenet_act_ref.npz
```
For the calculation of Precision and Recall on ImageNet, we follow [ADM](https://github.com/openai/guided-diffusion) to use 1w reference samples. You can download the activation data we produced from [here](https://cloud.tsinghua.edu.cn/f/924f9878ddc340bcb09c/?dl=1). Then run the following command:
```bash
python evaluate.py calc -m pr \
--activations_sample=eval-refs/imagenet_act_sample.npz \
--activations_ref=eval-refs/imagenet_act_1w_ref.npz
```
## Training
you can follow the instruction of [EDM](https://github.com/NVlabs/edm) to train a new model of the first stage (standard diffusion). Using ImageNet for example, run command:
```bash
torchrun --standalone --nproc_per_node=8 train.py --outdir=training-runs --data=datasets/imagenet-64x64.zip --eff-attn=True \
--cond=1 --batch=4096 --batch-gpu=32 --lr=1e-4 --ema=50 --dropout=0.1 --fp16=1 --ls=25 \
--arch=adm --precond=edm
```
If you want to train a second stage model (blurring diffusion), set argument `--precond=blur` and other arguments for the configuration of blurring diffusion. The command will be:
```bash
torchrun --standalone --nproc_per_node=8 train.py --outdir=training-runs --data=datasets/imagenet-256x256.zip --eff-attn=True \
--cond=1 --batch=4096 --batch-gpu=8 --lr=1e-4 --dropout=0.1 --fp16=1 --ls=1 \
--arch=adm --precond=blur --up-scale=4 --block-scale=0.15 --prob-length=0.93 --blur-sigma-max=3.0
```
As for CelebA-HQ, train a first stage model with:
```bash
torchrun --standalone --nproc_per_node=8 train.py --outdir=training-runs --data=datasets/CelebA-HQ-64x64.zip --eff-attn=True \
--cond=0 --batch=1024 --batch-gpu=32 --lr=1e-4 --dropout=0.15 --augment=0.2 --ls=1 \
--arch=adm --precond=edm
```
And for training a second stage model:
```bash
torchrun --standalone --nproc_per_node=8 train.py --outdir=training-runs --data=datasets/CelebA-HQ-256x256.zip --eff-attn=True \
--cond=0 --batch=1024 --batch-gpu=8 --lr=1e-4 --dropout=0.2 --augment=0.2 --fp16=1 --ls=1 \
--arch=adm --precond=blur --up-scale=4 --block-scale=0.15 --prob-length=0.89 --blur-sigma-max=2.0
```
## Citation
```
@article{teng2023relay,
title={Relay Diffusion: Unifying diffusion process across resolutions for image synthesis},
author={Teng, Jiayan and Zheng, Wendi and Ding, Ming and Hong, Wenyi and Wangni, Jianqiao and Yang, Zhuoyi and Tang, Jie},
journal={arXiv preprint arXiv:2309.03350},
year={2023}
}
```
## Acknowledgements
This implementation is based on https://github.com/NVlabs/edm (codebase of EDM). Thanks a lot!