{"id":13488901,"url":"https://github.com/mihirp1998/VADER","last_synced_at":"2025-03-28T02:31:25.321Z","repository":{"id":247523700,"uuid":"819132711","full_name":"mihirp1998/VADER","owner":"mihirp1998","description":"Video Diffusion Alignment via Reward Gradients. We improve a variety of video diffusion models such as VideoCrafter, OpenSora, ModelScope and StableVideoDiffusion by finetuning them using various reward models such as HPS, PickScore, VideoMAE, VJEPA, YOLO, Aesthetics etc. ","archived":false,"fork":false,"pushed_at":"2024-08-19T06:13:20.000Z","size":171419,"stargazers_count":220,"open_issues_count":9,"forks_count":14,"subscribers_count":7,"default_branch":"master","last_synced_at":"2024-12-14T00:27:23.214Z","etag":null,"topics":["alignment","diffusion","reinforcement-learning","reinforcement-learning-human-feedback","rl","rlhf","vader","video-diffusion","video-diffusion-alignment"],"latest_commit_sha":null,"homepage":"https://vader-vid.github.io/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/mihirp1998.png","metadata":{"files":{"readme":"readme.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-06-23T21:44:57.000Z","updated_at":"2024-12-12T17:05:20.000Z","dependencies_parsed_at":"2024-10-31T01:40:58.402Z","dependency_job_id":null,"html_url":"https://github.com/mihirp1998/VADER","commit_stats":null,"previous_names":["mihirp1998/vader"],"tags_count":3,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mihirp1998%2FVADER","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mihirp1998%2FVADER/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mihirp1998%2FVADER/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mihirp1998%2FVADER/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/mihirp1998","download_url":"https://codeload.github.com/mihirp1998/VADER/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":245957681,"owners_count":20700316,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["alignment","diffusion","reinforcement-learning","reinforcement-learning-human-feedback","rl","rlhf","vader","video-diffusion","video-diffusion-alignment"],"created_at":"2024-07-31T18:01:23.841Z","updated_at":"2025-03-28T02:31:20.294Z","avatar_url":"https://github.com/mihirp1998.png","language":"Python","funding_links":[],"categories":["Video Generation","Python"],"sub_categories":[],"readme":"\u003cdiv align=\"center\"\u003e\n\n\u003c!-- TITLE --\u003e\n# **Video Diffusion Alignment via Reward Gradient**\n![VADER](assets/vader_method.png)\n\n[![arXiv](https://img.shields.io/badge/cs.CV-arXiv:2407.08737-b31b1b.svg)](https://arxiv.org/abs/2407.08737)\n[![Website](https://img.shields.io/badge/🌎-Website-blue.svg)](http://vader-vid.github.io)\n[![Demo](https://img.shields.io/badge/%F0%9F%A4%97-Demo-yellow)](https://huggingface.co/spaces/zheyangqin/VADER)\n\u003c/div\u003e\n\nThis is the official implementation of our paper [Video Diffusion Alignment via Reward Gradient](https://vader-vid.github.io/) by \n\nMihir Prabhudesai*, Russell Mendonca*, Zheyang Qin*, Katerina Fragkiadaki, Deepak Pathak .\n\n\n\u003c!-- DESCRIPTION --\u003e\n## Abstract\nWe have made significant progress towards building foundational video diffusion models. As these models are trained using large-scale unsupervised data, it has become crucial to adapt these models to specific downstream tasks, such as video-text alignment or ethical video generation. Adapting these models via supervised fine-tuning requires collecting target datasets of videos, which is challenging and tedious. In this work, we instead utilize pre-trained reward models that are learned via preferences on top of powerful discriminative models. These models contain dense gradient information with respect to generated RGB pixels, which is critical to be able to learn efficiently in complex search spaces, such as videos. We show that our approach can enable alignment of video diffusion for aesthetic generations, similarity between text context and video, as well long horizon video generations that are 3X longer than the training sequence length. We show our approach can learn much more efficiently in terms of reward queries and compute than previous gradient-free approaches for video generation.\n\n\n## Features\n- [x] Adaptation of VideoCrafter2 Text-to-Video Model\n- [x] Adaptation of Open-Sora V1.2 Text-to-Video Model\n- [x] Adaptation of ModelScope Text-to-Video Model\n- [ ] Adaptation of Stable Video Diffusion Image2Video Model\n- [ ] Movie generation code\n\n\n## Demo\n|         |          |       |\n| ----------------------------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------- |\n| \u003cimg src=\"assets/videos/8.gif\" width=\"\"\u003e | \u003cimg src=\"assets/videos/5.gif\" width=\"\"\u003e | \u003cimg src=\"assets/videos/7.gif\" width=\"\"\u003e |\n| \u003cimg src=\"assets/videos/10.gif\" width=\"\"\u003e | \u003cimg src=\"assets/videos/3.gif\" width=\"\"\u003e | \u003cimg src=\"assets/videos/4.gif\" width=\"\"\u003e |\n| \u003cimg src=\"assets/videos/9.gif\" width=\"\"\u003e | \u003cimg src=\"assets/videos/1.gif\" width=\"\"\u003e | \u003cimg src=\"assets/videos/11.gif\" width=\"\"\u003e |\n\n\n\n\n## 🌟 VADER-VideoCrafter\n\nWe **highly recommend** proceeding with the VADER-VideoCrafter model first, which performs better.\n\n### ⚙️ Installation\nAssuming you are in the `VADER/` directory, you are able to create a Conda environments for VADER-VideoCrafter using the following commands:\n```bash\ncd VADER-VideoCrafter\nconda create -n vader_videocrafter python=3.10\nconda activate vader_videocrafter\nconda install pytorch==2.3.0 torchvision==0.18.0 torchaudio==2.3.0 pytorch-cuda=12.1 -c pytorch -c nvidia\nconda install xformers -c xformers\npip install -r requirements.txt\ngit clone https://github.com/tgxs002/HPSv2.git\ncd HPSv2/\npip install -e .\ncd ..\n```\n\n\n- We are using the pretrained Text-to-Video [VideoCrafter2](https://huggingface.co/VideoCrafter/VideoCrafter2/blob/main/model.ckpt) model via Hugging Face. If you unfortunately find the model is not automatically downloaded when you running inference or training script, you can manually download it and put the `model.ckpt` in `VADER/VADER-VideoCrafter/checkpoints/base_512_v2/model.ckpt`.\n\n- We provided pretrained LoRA weights on [HuggingFace](https://huggingface.co/papers/2407.08737). The [`vader_videocrafter_pickscore.pt`](https://huggingface.co/zheyangqin/VADER_VideoCrafter_PickScore) is the model fine-tuned using PickScore function on chatgpt_custom_animal.txt with LoRA rank of 16, while [`vader_videocrafter_hps_aesthetic.pt`](https://huggingface.co/zheyangqin/VADER_VideoCrafter_HPS_Aesthetic) is the model fine-tuned using a combination of HPSv2.1 and Aesthetic function on chatgpt_custom_instruments.txt with LoRA rank of 8.\n\n\n### 📺 Inference\nPlease run `accelerate config` as the first step to configure accelerator settings. If you are not familiar with the accelerator configuration, you can refer to VADER-VideoCrafter [documentation](documentation/VADER-VideoCrafter.md).\n\nAssuming you are in the `VADER/` directory, you are able to do inference using the following commands:\n```bash\ncd VADER-VideoCrafter\nsh scripts/run_text2video_inference.sh\n```\n- We have tested on PyTorch 2.3.0 and CUDA 12.1. The inferece script works on a single GPU with 16GBs VRAM, when we set `val_batch_size=1` and use `fp16` mixed precision. It should also work with recent PyTorch and CUDA versions.\n- `VADER/VADER-VideoCrafter/scripts/main/train_t2v_lora.py` is a script for inference of the VideoCrafter2 using VADER via LoRA.\n    - Most of the arguments are the same as the training process. The main difference is that `--inference_only` should be set to `True`.\n    - `--lora_ckpt_path` is required to set to the path of the pretrained LoRA model. Specially, if the `lora_ckpt_path` is set to `'huggingface-pickscore'` or `'huggingface-hps-aesthetic'`, it will download the pretrained LoRA model from the respective HuggingFace model hub, [VADER_VideoCrafter_PickScore](https://huggingface.co/zheyangqin/VADER_VideoCrafter_PickScore) or [VADER_VideoCrafter_HPS_Aesthetic](https://huggingface.co/zheyangqin/VADER_VideoCrafter_HPS_Aesthetic). Otherwise, it will load the pretrained LoRA model from the path you provided. If you do not provide any `lora_ckpt_path`, the original VideoCrafter2 model will be used for inference. Note that if you use `'huggingface-pickscore'` you need to set `--lora_rank 16`, whereas if you use `'huggingface-hps-aesthetic'` you need to set `--lora_rank 8`.\n\n### 🔧 Training\nPlease run `accelerate config` as the first step to configure accelerator settings. If you are not familiar with the accelerator configuration, you can refer to VADER-VideoCrafter [documentation](documentation/VADER-VideoCrafter.md).\n\nAssuming you are in the `VADER/` directory, you are able to train the model using the following commands:\n\n```bash\ncd VADER-VideoCrafter\nsh scripts/run_text2video_train.sh\n```\n- Our experiments are conducted on PyTorch 2.3.0 and CUDA 12.1 while using 4 A6000s (48GB RAM). It should also work with recent PyTorch and CUDA versions. The training script have been tested on a single GPU with 16GBs VRAM, when we set `train_batch_size=1 val_batch_size=1` and use `fp16` mixed precision.\n- `VADER/VADER-VideoCrafter/scripts/main/train_t2v_lora.py` is also a script for fine-tuning the VideoCrafter2 using VADER via LoRA.\n    - You can read the VADER-VideoCrafter [documentation](documentation/VADER-VideoCrafter.md) to understand the usage of arguments.\n\n\n## 🎬 VADER-Open-Sora\n### ⚙️ Installation\nAssuming you are in the `VADER/` directory, you are able to create a Conda environments for VADER-Open-Sora using the following commands:\n```bash\ncd VADER-Open-Sora\nconda create -n vader_opensora python=3.10\nconda activate vader_opensora\nconda install pytorch==2.3.0 torchvision==0.18.0 torchaudio==2.3.0 pytorch-cuda=12.1 -c pytorch -c nvidia\nconda install xformers -c xformers\npip install -v -e .\ngit clone https://github.com/tgxs002/HPSv2.git\ncd HPSv2/\npip install -e .\ncd ..\n```\n\n### 📺 Inference\nPlease run `accelerate config` as the first step to configure accelerator settings. If you are not familiar with the accelerator configuration, you can refer to VADER-Open-Sora [documentation](documentation/VADER-Open-Sora.md).\n\nAssuming you are in the `VADER/` directory, you are able to do inference using the following commands:\n```bash\ncd VADER-Open-Sora\nsh scripts/run_text2video_inference.sh\n```\n- We have tested on PyTorch 2.3.0 and CUDA 12.1. If the `resolution` is set as `360p`, a GPU with 40GBs of VRAM is required when we set `val_batch_size=1` and use `bf16` mixed precision . It should also work with recent PyTorch and CUDA versions. Please refer to the original [Open-Sora](https://github.com/hpcaitech/Open-Sora) repository for more details about the GPU requirements and the model settings.\n- `VADER/VADER-Open-Sora/scripts/train_t2v_lora.py` is a script for do inference via the Open-Sora 1.2 using VADER.\n    - `--num-frames`, `'--resolution'`, `'fps'` and `'aspect-ratio'` are inherited from the original Open-Sora model. In short, you can set `'--num-frames'` as `'2s'`, `'4s'`, `'8s'`, and `'16s'`. Available values for `--resolution` are `'240p'`, `'360p'`, `'480p'`, and `'720p'`. The default value of `'fps'` is `24` and `'aspect-ratio'` is `3:4`. Please refer to the original [Open-Sora](https://github.com/hpcaitech/Open-Sora) repository for more details. One thing to keep in mind, for instance, is that if you set `--num-frames` to `2s` and `--resolution` to `'240p'`, it is better to use `bf16` mixed precision instead of `fp16`. Otherwise, the model may generate noise videos.\n    - `--prompt-path` is the path of the prompt file. Unlike VideoCrafter, we do not provide prompt function for Open-Sora. Instead, you can provide a prompt file, which contains a list of prompts.\n    - `--num-processes` is the number of processes for Accelerator. It is recommended to set it to the number of GPUs.\n- `VADER/VADER-Open-Sora/configs/opensora-v1-2/vader/vader_inferece.py` is the configuration file for inference. You can modify the configuration file to change the inference settings following the guidance in the [documentation](documentation/VADER-Open-Sora.md).\n    - The main difference is that `is_vader_training` should be set to `False`. The `--lora_ckpt_path` should be set to the path of the pretrained LoRA model. Otherwise, the original Open-Sora model will be used for inference.\n\n\n### 🔧 Training\nPlease run `accelerate config` as the first step to configure accelerator settings. If you are not familiar with the accelerator configuration, you can refer to VADER-Open-Sora [documentation](documentation/VADER-Open-Sora.md).\n\nAssuming you are in the `VADER/` directory, you are able to train the model using the following commands:\n\n```bash\ncd VADER-Open-Sora\nsh scripts/run_text2video_train.sh\n```\n- Our experiments are conducted on PyTorch 2.3.0 and CUDA 12.1 while using 4 A6000s (48GB RAM). It should also work with recent PyTorch and CUDA versions. A GPU with 48GBs of VRAM is required for fine-tuning model when use `bf16` mixed precision as `resolution` is set as `360p` and `num_frames` is set as `2s`.\n- `VADER/VADER-Open-Sora/scripts/train_t2v_lora.py` is a script for fine-tuning the Open-Sora 1.2 using VADER via LoRA.\n    - The arguments are the same as the inference process above.\n- `VADER/VADER-Open-Sora/configs/opensora-v1-2/vader/vader_train.py` is the configuration file for training. You can modify the configuration file to change the training settings.\n    - You can read the VADER-Open-Sora [documentation](documentation/VADER-Open-Sora.md) to understand the usage of arguments.\n\n\n## 🎥 ModelScope\n### ⚙️ Installation\nAssuming you are in the `VADER/` directory, you are able to create a Conda environments for VADER-ModelScope using the following commands:\n```bash\ncd VADER-ModelScope\nconda create -n vader_modelscope python=3.10\nconda activate vader_modelscope\nconda install pytorch==2.3.0 torchvision==0.18.0 torchaudio==2.3.0 pytorch-cuda=12.1 -c pytorch -c nvidia\nconda install xformers -c xformers\npip install -r requirements.txt\ngit clone https://github.com/tgxs002/HPSv2.git\ncd HPSv2/\npip install -e .\ncd ..\n```\n\n### 📺 Inference\nPlease run `accelerate config` as the first step to configure accelerator settings. If you are not familiar with the accelerator configuration, you can refer to VADER-ModelScope [documentation](documentation/VADER-ModelScope.md).\n\nAssuming you are in the `VADER/` directory, you are able to do inference using the following commands:\n```bash\ncd VADER-ModelScope\nsh run_text2video_inference.sh\n```\n- The current code can work on a single GPU with VRAM \u003e 14GBs.\n- Note: we do note set `lora_path` in the original inference script. You can set `lora_path` to the path of the pretrained LoRA model if you have one.\n\n### 🔧 Training\nPlease run `accelerate config` as the first step to configure accelerator settings. If you are not familiar with the accelerator configuration, you can refer to VADER-ModelScope [documentation](documentation/VADER-ModelScope.md).\n\nAssuming you are in the `VADER/` directory, you are able to train the model using the following commands:\n```bash\ncd VADER-ModelScope\nsh run_text2video_train.sh\n```\n- The current code can work on a single GPU with VRAM \u003e 14GBs. The code can be further optimized to work with even lesser VRAM with deepspeed and CPU offloading. For our experiments, we used 4 A100s- 40GB RAM to run our code.\n- `VADER/VADER-ModelScope/train_t2v_lora.py` is a script for fine-tuning ModelScope using VADER via LoRA.\n    - `gradient_accumulation_steps` can be increased while reducing the `--num_processes` of the accelerator to alleviate bottleneck caused by the number of GPUs. We tested with `gradient_accumulation_steps=4` and `--num_processes=4` on 4 A100s- 40GB RAM.\n    - `prompt_fn` is the prompt function, which can be the name of any functions in Core/prompts.py, like `'chatgpt_custom_instruments'`, `'chatgpt_custom_animal_technology'`, `'chatgpt_custom_ice'`, `'nouns_activities'`, etc. Note: If you set `--prompt_fn 'nouns_activities'`, you have to provide`--nouns_file` and `--nouns_file`, which will randomly select a noun and an activity from the files and form them into a single sentence as a prompt.\n    - `reward_fn` is the reward function, which can be selected from `'aesthetic'`, `'hps'`, and `'actpred'`.\n- `VADER/VADER-ModelScope/config_t2v/config.yaml` is the configuration file for training. You can modify the configuration file to change the training settings following the comments in that file.\n\n\n## 💡 Tutorial\nThis section is to provide a tutorial on how to implement the VADER method on VideoCrafter and Open-Sora by yourself. We will provide a step-by-step guide to help you understand the modification details. Thus, you can easily adapt the VADER method to later versions of VideCrafter.\n- Please refer to the [VideoCrafter tutorial](/VADER-VideoCrafter/readme.md)\n- Please refer to the [Open-Sora tutorial](/VADER-Open-Sora/readme.md)\n\n\n## Acknowledgement\n\nOur codebase is directly built on top of [VideoCrafter](https://github.com/AILab-CVC/VideoCrafter), [Open-Sora](https://github.com/hpcaitech/Open-Sora), and [Animate Anything](https://github.com/alibaba/animate-anything/). We would like to thank the authors for open-sourcing their code.\n\n## Citation\n\nIf you find this work useful in your research, please cite:\n\n```bibtex\n@misc{prabhudesai2024videodiffusionalignmentreward,\n      title={Video Diffusion Alignment via Reward Gradients}, \n      author={Mihir Prabhudesai and Russell Mendonca and Zheyang Qin and Katerina Fragkiadaki and Deepak Pathak},\n      year={2024},\n      eprint={2407.08737},\n      archivePrefix={arXiv},\n      primaryClass={cs.CV},\n      url={https://arxiv.org/abs/2407.08737}, \n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmihirp1998%2FVADER","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmihirp1998%2FVADER","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmihirp1998%2FVADER/lists"}