{"id":17714291,"url":"https://yhzhai.github.io/mcm/","last_synced_at":"2025-03-13T22:32:19.998Z","repository":{"id":243553874,"uuid":"804631163","full_name":"yhZhai/mcm","owner":"yhZhai","description":"[NeurIPS 2024] Motion Consistency Model: Accelerating Video Diffusion with Disentangled Motion-Appearance Distillation","archived":false,"fork":false,"pushed_at":"2024-10-24T00:47:55.000Z","size":33515,"stargazers_count":45,"open_issues_count":0,"forks_count":4,"subscribers_count":5,"default_branch":"main","last_synced_at":"2024-10-24T15:24:38.494Z","etag":null,"topics":["aigc","consistency-models","fast-sampling","text-to-video","video-diffusion","video-diffusion-model","video-generation"],"latest_commit_sha":null,"homepage":"https://yhzhai.github.io/mcm/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/yhZhai.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-05-23T00:49:43.000Z","updated_at":"2024-10-21T13:21:56.000Z","dependencies_parsed_at":"2024-07-09T17:14:08.530Z","dependency_job_id":"4be33eb1-733a-4000-be5d-6b36feb871a2","html_url":"https://github.com/yhZhai/mcm","commit_stats":null,"previous_names":["yhzhai/mcm"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/yhZhai%2Fmcm","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/yhZhai%2Fmcm/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/yhZhai%2Fmcm/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/yhZhai%2Fmcm/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/yhZhai","download_url":"https://codeload.github.com/yhZhai/mcm/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":243494493,"owners_count":20299823,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["aigc","consistency-models","fast-sampling","text-to-video","video-diffusion","video-diffusion-model","video-generation"],"created_at":"2024-10-25T11:02:20.545Z","updated_at":"2025-03-13T22:32:19.993Z","avatar_url":"https://github.com/yhZhai.png","language":"Python","funding_links":[],"categories":["Accelerate"],"sub_categories":[],"readme":"# [🚀 Motion Consistency Model: Accelerating Video Diffusion with Disentangled Motion-Appearance Distillation](https://yhzhai.github.io/mcm/)\n\n\u003ca href='https://yhzhai.github.io/mcm/'\u003e\u003cimg src='https://img.shields.io/badge/Project-Page-Green'\u003e\u003c/a\u003e \u003ca href='https://arxiv.org/abs/2406.06890'\u003e\u003cimg src='https://img.shields.io/badge/Paper-arXiv-red'\u003e\u003c/a\u003e \u003ca href='https://huggingface.co/yhzhai/mcm'\u003e\u003cimg src='https://img.shields.io/badge/%F0%9F%A4%97%20HF-checkpoint-yellow'\u003e\u003c/a\u003e \u003ca href='https://huggingface.co/spaces/yhzhai/mcm'\u003e\u003cimg src='https://img.shields.io/badge/%F0%9F%A4%97%20HF-demo-yellow'\u003e\u003c/a\u003e [![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1ouGbIZA5092hF9ZMHO-AchCr_L3algTL?usp=sharing)\n\n[Yuanhao Zhai](https://www.yhzhai.com/)\u003csup\u003e1\u003c/sup\u003e, [Kevin Lin](https://sites.google.com/site/kevinlin311tw/)\u003csup\u003e2\u003c/sup\u003e, [Zhengyuan Yang](https://zyang-ur.github.io)\u003csup\u003e2\u003c/sup\u003e, [Linjie Li](https://scholar.google.com/citations?hl=en\u0026user=WR875gYAAAAJ)\u003csup\u003e2\u003c/sup\u003e, [Jianfeng Wang](http://jianfengwang.me)\u003csup\u003e2\u003c/sup\u003e, [Chung-Ching Lin](https://scholar.google.com/citations?hl=en\u0026user=legkbM0AAAAJ)\u003csup\u003e2\u003c/sup\u003e, [David Doermann](https://cse.buffalo.edu/~doermann/)\u003csup\u003e1\u003c/sup\u003e, [Junsong Yuan](https://cse.buffalo.edu/~jsyuan/)\u003csup\u003e1\u003c/sup\u003e, [Lijuan Wang](https://scholar.google.com/citations?hl=en\u0026user=cDcWXuIAAAAJ)\u003csup\u003e2\u003c/sup\u003e\n\n**\u003csup\u003e1\u003c/sup\u003eState University of New York at Buffalo  \u0026nbsp; | \u0026nbsp;  \u003csup\u003e2\u003c/sup\u003eMicrosoft**\n\n**NeurIPS 2024**\n\n**TL;DR**: Our motion consistency model not only accelerates text2video diffusion model sampling process, but also can benefit from an additional high-quality image dataset to improve the frame quality of generated videos.\n\n![Our motion consistency model not only distill the motion prior from the teacher to accelerate sampling, but also can benefit from an additional high-quality image dataset to improve the frame quality of generated videos.](static/images/illustration.png)\n\n\u003c!-- **All training, inference, and evaluation code, as well as model checkpoints will be released in the coming two weeks. Please stay tuned!** --\u003e\n\n## 🔥 News\n\n**[09/2024]** MCM was accepted to NeurIPS 2024!\n\n**[07/2024]** Release learnable head parameter at [this box link](https://buffalo.box.com/s/cnc9oltyerlk1id0xis0hgqrd1fz3clc).\n\n**[06/2024]** Our MCM achieves strong performance (using 4 sampling steps) on the [ChronoMagic-Bench](https://pku-yuangroup.github.io/ChronoMagic-Bench/)! Check out the leaderboard [here](https://huggingface.co/spaces/BestWishYsh/ChronoMagic-Bench).\n\n**[06/2024]** Training code, [pre-trained checkpoint](https://huggingface.co/yhzhai/mcm), [Gradio demo](https://huggingface.co/spaces/yhzhai/mcm), and [Colab demo](https://colab.research.google.com/drive/1ouGbIZA5092hF9ZMHO-AchCr_L3algTL?usp=sharing) release.\n\n**[06/2024]** [Paper](https://arxiv.org/abs/2406.06890) and [project page](https://yhzhai.github.io/mcm/) release.\n\n## Contents\n\n  - [Getting started](#-getting-started-)\n    - [Environment setup](#environment-setup-)\n    - [Data preparation](#data-preparation-)\n    - [DINOv2 and CLIP checkpoint download](#download)\n    - [Wandb integration](#wandb-)\n  - [Training](#training-)\n  - [Inference](#inference-)\n  - [MCM weights](#mcm-weights-)\n  - [Acknowledgement](#acknowledgement-)\n  - [Citation](#citation-)\n\n## Getting started \u003ca name=\"getting-started\"\u003e\u003c/a\u003e\n\n### Environment setup \u003ca name=\"env-setup\"\u003e\u003c/a\u003e\n\nInstead of installing [diffusers](https://github.com/huggingface/diffusers), [peft](https://github.com/huggingface/peft), and [open_clip](https://github.com/mlfoundations/open_clip) from the official repos, we use our modified versions specified in the requirements.txt file.\nThis is particularly important for diffusers and open_clip, due to the former's current limited support for video diffusion model LoRA loading, and the latter's distributed training dependency.\n\nTo set up the environment, run the following commands:\n\n```shell\npip install torch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 --index-url https://download.pytorch.org/whl/cu118  # please modify the cuda version according to your env \npip install -r requirements.txt\npip install scipy==1.11.1\npip install https://github.com/podgorskiy/dnnlib/releases/download/0.0.1/dnnlib-0.0.1-py3-none-any.whl\n```\n\n\n\n### Data preparation \u003ca name=\"data\"\u003e\u003c/a\u003e\n\nPlease preparation the video and optional image datasets in the [webdataset](https://github.com/webdataset/webdataset) format.\n\nSpecifically, please wrap the video/image files and their corresponding .json format metadata into .tar files.\nHere is an example structure of the video .tar file:\n\n```\n.\n├── video_0.json\n├── video_0.mp4\n...\n├── video_n.json\n└── video_n.mp4\n```\n\nThe .json files contain video/image captions in key-value pairs, for example: `{\"caption\": \"World map in gray - world map with animated circles and binary numbers\"}`.\n\nWe provide our generated anime, realistic, and 3D cartoon style image datasets here (coming soom).\nDue to dataset agreement, we could not publicly release the WebVid and LAION-aes dataset.\n\n### DINOv2 and CLIP checkpoint download \u003ca name=\"download\"\u003e\u003c/a\u003e\n\nWe provide a script `scripts/download.py` to download the DINOv2 and CLIP checkpoint.\n\n```shell\npython scripts/download.py\n```\n\n### Wandb integration \u003ca name=\"wandb\"\u003e\u003c/a\u003e\n\nPlease input your wandb API key in `utils/wandb.py` to enable wandb logging.\nIf you do not use wandb, please remove `wandb` from the `--report_to` argument in the training command.\n\n\n## Training \u003ca name=\"train\"\u003e\u003c/a\u003e\n\n\nWe leverage [accelerate](https://github.com/huggingface/accelerate) for distributed training, and we support two different based text2video diffusion models: [ModelScopeT2V](https://arxiv.org/abs/2308.06571) and [AnimateDiff](https://arxiv.org/abs/2307.04725). For both models, we train LoRA instead fine-tuning all parameters.\n\n### ModelScopeT2V\n\nFor ModelScopeT2V, our code supports pure video diffusion distillation training, and frame quality improvement training.\n\nBy default, the training script requires 8 GPUs, each with 80GB of GPU memory, to fit a batch size of 4. The minimal GPU memory requirement is 32GB for a batch size of 1. Please adjust the `--train_batch_size` argument accordingly for different GPU memory sizes.\n\nBefore running the scripts, please modify the data path in the environment variables defined at the top of each script.\n\n**Diffusion distillation**\n\nWe provide the training script in `scripts/modelscopet2v_distillation.sh`\n\n```shell\nbash scripts/modelscopet2v_distillation.sh\n```\n\n**Frame quality improvement**\n\nWe provide the training script in `scripts/modelscopet2v_improvement.sh`. Before running, please assign the `IMAGE_DATA_PATH` in the script.\n\n```shell\nbash scripts/modelscopet2v_improvement.sh\n```\n\n### AnimateDiff\n\nDue to the higher resolution requirement, MCM with AnimateDiff base model training requires at least 70GB of GPU memory to fit a single batch.\n\nWe provide the diffusion distillation training script in `scripts/animatediff_distillation.sh`.\n\n```shell\nbash scripts/animatediff_distillation.sh\n```\n\n## Inference \u003ca name=\"infer\"\u003e\u003c/a\u003e\n\nWe provide our pre-trained checkpoint [here](https://huggingface.co/yhzhai/mcm), Gradio demo [here](https://huggingface.co/spaces/yhzhai/mcm), and Colab demo [here](https://colab.research.google.com/drive/1ouGbIZA5092hF9ZMHO-AchCr_L3algTL?usp=sharing). `demo.py` showcases how to run our MCM in local machine.\nFeel free to try out our MCM!\n\n## MCM weights \u003ca name=\"weight\"\u003e\u003c/a\u003e\n\nWe provide our pre-trained checkpoint [here](https://huggingface.co/yhzhai/mcm).\n\nFor research/debug purpose, we also provide intermediate parameters and states at [this box link](https://buffalo.box.com/s/cnc9oltyerlk1id0xis0hgqrd1fz3clc). The folder (~1.12GB) include model weight, discriminator weight, scheduler states, optimizer states and learnable head weight.\n\n## Acknowledgement \u003ca name=\"ack\"\u003e\u003c/a\u003e\n\nSome of our implementations are borrowed from the great repos below.\n\n1. [Diffusers](https://github.com/huggingface/diffusers)\n2. [StyleGAN-T](https://github.com/autonomousvision/stylegan-t)\n3. [GMFlow](https://github.com/haofeixu/gmflow)\n\n## Citation \u003ca name=\"cite\"\u003e\u003c/a\u003e\n\n```\n@article{zhai2024motion,\n  title={Motion Consistency Model: Accelerating Video Diffusion with Disentangled\n  Motion-Appearance Distillation},\n  author={Zhai, Yuanhao and Lin, Kevin and Yang, Zhengyuan and Li, Linjie and Wang, Jianfeng and Lin, Chung-Ching and Doermann, David and Yuan, Junsong and Wang, Lijuan},\n  year={2024},\n  journal={arXiv preprint arXiv:2406.06890},\n  website={https://yhzhai.github.io/mcm/},\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/yhzhai.github.io%2Fmcm%2F","html_url":"https://awesome.ecosyste.ms/projects/yhzhai.github.io%2Fmcm%2F","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/yhzhai.github.io%2Fmcm%2F/lists"}