https://yhzhai.github.io/mcm/

[NeurIPS 2024] Motion Consistency Model: Accelerating Video Diffusion with Disentangled Motion-Appearance Distillation
https://yhzhai.github.io/mcm/

aigc consistency-models fast-sampling text-to-video video-diffusion video-diffusion-model video-generation

Last synced: 7 months ago
JSON representation

[NeurIPS 2024] Motion Consistency Model: Accelerating Video Diffusion with Disentangled Motion-Appearance Distillation

Host: GitHub
URL: https://yhzhai.github.io/mcm/
Owner: yhZhai
License: apache-2.0
Created: 2024-05-23T00:49:43.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2024-10-24T00:47:55.000Z (12 months ago)
Last Synced: 2024-10-24T15:24:38.494Z (12 months ago)
Topics: aigc, consistency-models, fast-sampling, text-to-video, video-diffusion, video-diffusion-model, video-generation
Language: Python
Homepage: https://yhzhai.github.io/mcm/
Size: 32 MB
Stars: 45
Watchers: 5
Forks: 4
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

awesome-diffusion-categorized - [Project

README

          # [🚀 Motion Consistency Model: Accelerating Video Diffusion with Disentangled Motion-Appearance Distillation](https://yhzhai.github.io/mcm/)

    [![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1ouGbIZA5092hF9ZMHO-AchCr_L3algTL?usp=sharing)

[Yuanhao Zhai](https://www.yhzhai.com/)¹, [Kevin Lin](https://sites.google.com/site/kevinlin311tw/)², [Zhengyuan Yang](https://zyang-ur.github.io)², [Linjie Li](https://scholar.google.com/citations?hl=en&user=WR875gYAAAAJ)², [Jianfeng Wang](http://jianfengwang.me)², [Chung-Ching Lin](https://scholar.google.com/citations?hl=en&user=legkbM0AAAAJ)², [David Doermann](https://cse.buffalo.edu/~doermann/)¹, [Junsong Yuan](https://cse.buffalo.edu/~jsyuan/)¹, [Lijuan Wang](https://scholar.google.com/citations?hl=en&user=cDcWXuIAAAAJ)²

**¹State University of New York at Buffalo    |    ²Microsoft**

**NeurIPS 2024**

**TL;DR**: Our motion consistency model not only accelerates text2video diffusion model sampling process, but also can benefit from an additional high-quality image dataset to improve the frame quality of generated videos.

![Our motion consistency model not only distill the motion prior from the teacher to accelerate sampling, but also can benefit from an additional high-quality image dataset to improve the frame quality of generated videos.](static/images/illustration.png)

## 🔥 News

**[09/2024]** MCM was accepted to NeurIPS 2024!

**[07/2024]** Release learnable head parameter at [this box link](https://buffalo.box.com/s/cnc9oltyerlk1id0xis0hgqrd1fz3clc).

**[06/2024]** Our MCM achieves strong performance (using 4 sampling steps) on the [ChronoMagic-Bench](https://pku-yuangroup.github.io/ChronoMagic-Bench/)! Check out the leaderboard [here](https://huggingface.co/spaces/BestWishYsh/ChronoMagic-Bench).

**[06/2024]** Training code, [pre-trained checkpoint](https://huggingface.co/yhzhai/mcm), [Gradio demo](https://huggingface.co/spaces/yhzhai/mcm), and [Colab demo](https://colab.research.google.com/drive/1ouGbIZA5092hF9ZMHO-AchCr_L3algTL?usp=sharing) release.

**[06/2024]** [Paper](https://arxiv.org/abs/2406.06890) and [project page](https://yhzhai.github.io/mcm/) release.

## Contents

  - [Getting started](#-getting-started-)

    - [Environment setup](#environment-setup-)

    - [Data preparation](#data-preparation-)

    - [DINOv2 and CLIP checkpoint download](#download)

    - [Wandb integration](#wandb-)

  - [Training](#training-)

  - [Inference](#inference-)

  - [MCM weights](#mcm-weights-)

  - [Acknowledgement](#acknowledgement-)

  - [Citation](#citation-)

## Getting started 

### Environment setup 

Instead of installing [diffusers](https://github.com/huggingface/diffusers), [peft](https://github.com/huggingface/peft), and [open_clip](https://github.com/mlfoundations/open_clip) from the official repos, we use our modified versions specified in the requirements.txt file.

This is particularly important for diffusers and open_clip, due to the former's current limited support for video diffusion model LoRA loading, and the latter's distributed training dependency.

To set up the environment, run the following commands:

```shell

pip install torch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 --index-url https://download.pytorch.org/whl/cu118  # please modify the cuda version according to your env 

pip install -r requirements.txt

pip install scipy==1.11.1

pip install https://github.com/podgorskiy/dnnlib/releases/download/0.0.1/dnnlib-0.0.1-py3-none-any.whl

```

### Data preparation 

Please preparation the video and optional image datasets in the [webdataset](https://github.com/webdataset/webdataset) format.

Specifically, please wrap the video/image files and their corresponding .json format metadata into .tar files.

Here is an example structure of the video .tar file:

```

.

├── video_0.json

├── video_0.mp4

...

├── video_n.json

└── video_n.mp4

```

The .json files contain video/image captions in key-value pairs, for example: `{"caption": "World map in gray - world map with animated circles and binary numbers"}`.

We provide our generated anime, realistic, and 3D cartoon style image datasets here (coming soom).

Due to dataset agreement, we could not publicly release the WebVid and LAION-aes dataset.

### DINOv2 and CLIP checkpoint download 

We provide a script `scripts/download.py` to download the DINOv2 and CLIP checkpoint.

```shell

python scripts/download.py

```

### Wandb integration 

Please input your wandb API key in `utils/wandb.py` to enable wandb logging.

If you do not use wandb, please remove `wandb` from the `--report_to` argument in the training command.

## Training 

We leverage [accelerate](https://github.com/huggingface/accelerate) for distributed training, and we support two different based text2video diffusion models: [ModelScopeT2V](https://arxiv.org/abs/2308.06571) and [AnimateDiff](https://arxiv.org/abs/2307.04725). For both models, we train LoRA instead fine-tuning all parameters.

### ModelScopeT2V

For ModelScopeT2V, our code supports pure video diffusion distillation training, and frame quality improvement training.

By default, the training script requires 8 GPUs, each with 80GB of GPU memory, to fit a batch size of 4. The minimal GPU memory requirement is 32GB for a batch size of 1. Please adjust the `--train_batch_size` argument accordingly for different GPU memory sizes.

Before running the scripts, please modify the data path in the environment variables defined at the top of each script.

**Diffusion distillation**

We provide the training script in `scripts/modelscopet2v_distillation.sh`

```shell

bash scripts/modelscopet2v_distillation.sh

```

**Frame quality improvement**

We provide the training script in `scripts/modelscopet2v_improvement.sh`. Before running, please assign the `IMAGE_DATA_PATH` in the script.

```shell

bash scripts/modelscopet2v_improvement.sh

```

### AnimateDiff

Due to the higher resolution requirement, MCM with AnimateDiff base model training requires at least 70GB of GPU memory to fit a single batch.

We provide the diffusion distillation training script in `scripts/animatediff_distillation.sh`.

```shell

bash scripts/animatediff_distillation.sh

```

## Inference 

We provide our pre-trained checkpoint [here](https://huggingface.co/yhzhai/mcm), Gradio demo [here](https://huggingface.co/spaces/yhzhai/mcm), and Colab demo [here](https://colab.research.google.com/drive/1ouGbIZA5092hF9ZMHO-AchCr_L3algTL?usp=sharing). `demo.py` showcases how to run our MCM in local machine.

Feel free to try out our MCM!

## MCM weights 

We provide our pre-trained checkpoint [here](https://huggingface.co/yhzhai/mcm).

For research/debug purpose, we also provide intermediate parameters and states at [this box link](https://buffalo.box.com/s/cnc9oltyerlk1id0xis0hgqrd1fz3clc). The folder (~1.12GB) include model weight, discriminator weight, scheduler states, optimizer states and learnable head weight.

## Acknowledgement 

Some of our implementations are borrowed from the great repos below.

1. [Diffusers](https://github.com/huggingface/diffusers)

2. [StyleGAN-T](https://github.com/autonomousvision/stylegan-t)

3. [GMFlow](https://github.com/haofeixu/gmflow)

## Citation 

```

@article{zhai2024motion,

  title={Motion Consistency Model: Accelerating Video Diffusion with Disentangled

  Motion-Appearance Distillation},

  author={Zhai, Yuanhao and Lin, Kevin and Yang, Zhengyuan and Li, Linjie and Wang, Jianfeng and Lin, Chung-Ching and Doermann, David and Yuan, Junsong and Wang, Lijuan},

  year={2024},

  journal={arXiv preprint arXiv:2406.06890},

  website={https://yhzhai.github.io/mcm/},

}

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://yhzhai.github.io/mcm/

Awesome Lists containing this project

README