https://github.com/amazon-science/instruct-video-to-video
https://github.com/amazon-science/instruct-video-to-video
Last synced: about 1 year ago
JSON representation
- Host: GitHub
- URL: https://github.com/amazon-science/instruct-video-to-video
- Owner: amazon-science
- License: mit-0
- Created: 2023-11-25T06:00:08.000Z (over 2 years ago)
- Default Branch: main
- Last Pushed: 2024-02-13T12:04:31.000Z (over 2 years ago)
- Last Synced: 2024-02-13T14:23:45.477Z (over 2 years ago)
- Language: Python
- Size: 167 MB
- Stars: 43
- Watchers: 3
- Forks: 3
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md
Awesome Lists containing this project
README
## This is the code release for the ICLR2024 paper [Consistent Video-to-Video Transfer Using Synthetic Dataset](https://arxiv.org/abs/2311.00213).

## Quick Links
* [Installation](#installation)
* [Video Editing](#video-editing) 🔥
* [Synthetic Video Prompt-to-Prompt Dataset](#synthetic-video-prompt-to-prompt-dataset)
* [Training](#training)
* [Create Synthetic Video Dataset](#create-synthetic-video-dataset)
## Updates
* 2024/02/13: The official synthetic data and model will not be released due to Amazon policy, but we provide a third party reproduction of the synthetic data and model weights. Please refer to this [github repo](https://github.com/cplusx/INSV2V-3rd-pty-reprod)
* 2023/11/29: We have updated paper with more comparison to recent baseline methods and updated the [comparison video](#visual-comparison-to-other-methods). Gradio demo code is uploaded.
## Installation
```bash
git clone https://github.com/amazon-science/instruct-video-to-video.git
pip install -r requirements.txt
```
NOTE: The code is tested on PyTorch 2.1.0+cu11.8 and corresponding xformers version. Any PyTorch version > 2.0 should work but please install the right corresponding xformers version.
## Video Editing
We are undergoing the model release process. Please stay tuned.
Download the [InsV2V model weights](https://github.com/cplusx/INSV2V-3rd-pty-reprod) and change the ckpt path in the following notebook.
✨🚀 This [notebook](video_edit.ipynb) provide a sample code to conduct text-based video editing.
### Download LOVEU Dataset for Testing
Please follow the instructions in the [LOVEU Dataset](https://sites.google.com/view/loveucvpr23/track4) to download the dataset. Use the following [script](insv2v_run_loveu_tgve.py) to run editing on the LOVEU dataset:
```bash
python insv2v_run_loveu_tgve.py \
--config configs/instruct_v2v.yaml \
--ckpt-path [PATH TO THE CHECKPOINT] \
--data-dir [PATH TO THE LOVEU DATASET] \
--with_optical_flow \ # use motion compensation
--text-cfg 7.5 10 \
--video-cfg 1.2 1.5 \
--image-size 256 384
```
Note: you may need to try different combination of image resolution, video/text classifier free guidance scale to find the best editing results.
Example results of editing LOVEU-TGVE Dataset:
## Synthetic Video Prompt-to-Prompt Dataset
Generation pipeline of the synthetic video dataset:

Examples of the synthetic video dataset:
## Training
### Download Foundational Models
[Download](https://drive.google.com/file/d/1R9sWsnGZUa5P8IB5DDfD9eU-T9SQLsFw/view?usp=sharing) the foundational models and place them in the `pretrained_models` folder.
### Download Synthetic Video Dataset
[See download link in the third party reproduction](https://github.com/cplusx/INSV2V-3rd-pty-reprod)
### Train the Model
Put the synthetic video dataset in the `video_ptp` folder.
Run the following command to train the model:
```bash
python main.py --config configs/instruct_v2v.yaml -r # add -r to resume training if the training is interrupted
```
## Create Synthetic Video Dataset
If you want to create your own synthetic video dataset, please follow the instructions
* Download the modelscope VAE, UNet and text encoder weights from [here](https://huggingface.co/damo-vilab/modelscope-damo-text-to-video-synthesis/tree/main)
* Replace the model path in the [`video_prompt_to_prompt.py`](video_prompt_to_prompt.py) file
```
vae_ckpt = 'VAE_PATH'
unet_ckpt = 'UNet_PATH'
text_model_ckpt = 'Text_MODEL_PATH'
```
* Download the edit prompt files from [Instruct Pix2Pix](https://github.com/timothybrooks/instruct-pix2pix). The prompt file should be `gpt-generated-prompts.jsonl`, and change the file path in the `video_prompt_to_prompt.py` accordingly. Or download the WebVid prompt edit file proposed in our paper from [To be released]().
* Run the command to generate the synthetic video dataset:
```bash
python video_prompt_to_prompt.py
--start [START INDEX] \
--end [END INDEX] \
--prompt_source [ip2p or webvid] \
--num_sample_each_prompt [NUM SAMPLES FOR EACH PROMPT]
```
## Visual Comparison to Other Methods
https://github.com/amazon-science/instruct-video-to-video/assets/20940184/d3619652-dd75-41a0-92b4-345bbf57de40
Links to the baselines used in the video:
| [Tune-A-Video](https://github.com/showlab/Tune-A-Video) | [Control Video](https://github.com/thu-ml/controlvideo) | [Vid2Vid Zero](https://github.com/baaivision/vid2vid-zero) | [Video P2P](https://github.com/ShaoTengLiu/Video-P2P) |
| [TokenFlow](https://github.com/omerbt/TokenFlow) | [Render A Video](https://github.com/williamyang1991/Rerender_A_Video) | [Pix2Video](https://github.com/duyguceylan/pix2video) |
## Credit
The code was implemented by [Jiaxin Cheng](https://github.com/cplusx) during his internship at the AWS Shanghai Lablet.
## References
Part of the code and the foundational models are adapted from the following works:
* [Instruct Pix2Pix](https://github.com/timothybrooks/instruct-pix2pix)
* [AnimateDiff](https://github.com/guoyww/animatediff/)