https://github.com/baaivision/vid2vid-zero

Zero-Shot Video Editing Using Off-The-Shelf Image Diffusion Models
https://github.com/baaivision/vid2vid-zero

Last synced: 7 months ago
JSON representation

Zero-Shot Video Editing Using Off-The-Shelf Image Diffusion Models

Host: GitHub
URL: https://github.com/baaivision/vid2vid-zero
Owner: baaivision
Created: 2023-03-30T12:11:51.000Z (over 2 years ago)
Default Branch: main
Last Pushed: 2023-07-04T08:25:45.000Z (over 2 years ago)
Last Synced: 2025-03-30T09:09:27.928Z (8 months ago)
Language: Python
Homepage:
Size: 27.1 MB
Stars: 354
Watchers: 10
Forks: 22
Open Issues: 8
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

awesome-diffusion-categorized - [Code

README

          



vid2vid-zero for Zero-Shot Video Editing


Zero-Shot Video Editing Using Off-The-Shelf Image Diffusion Models


[Wen Wang](https://scholar.google.com/citations?user=1ks0R04AAAAJ&hl=zh-CN)^1*,   [Kangyang Xie](https://github.com/felix-ky)^1*,   [Zide Liu](https://github.com/zideliu)^1*,   [Hao Chen](https://scholar.google.com.au/citations?user=FaOqRpcAAAAJ&hl=en)¹,   [Yue Cao](http://yue-cao.me/)²,   [Xinlong Wang](https://www.xloong.wang/)²,   [Chunhua Shen](https://cshen.github.io/)¹

¹[ZJU](https://www.zju.edu.cn/english/),   ²[BAAI](https://www.baai.ac.cn/english.html)




[![Hugging Face Demo](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/BAAI/vid2vid-zero)






We propose vid2vid-zero, a simple yet effective method for zero-shot video editing. Our vid2vid-zero leverages off-the-shelf image diffusion models, and doesn't require training on any video. At the core of our method is a null-text inversion module for text-to-video alignment, a cross-frame modeling module for temporal consistency, and a spatial regularization module for fidelity to the original video. Without any training, we leverage the dynamic nature of the attention mechanism to enable bi-directional temporal modeling at test time. 

Experiments and analyses show promising results in editing attributes, subjects, places, etc., in real-world videos. 

## Highlights

- Video editing with off-the-shelf image diffusion models.

- No training on any video.

- Promising results in editing attributes, subjects, places, etc., in real-world videos.

## News

* [2023.4.12] Online Gradio Demo is available [here](https://huggingface.co/spaces/BAAI/vid2vid-zero).

* [2023.4.11] Add Gradio Demo (runs in local).

* [2023.4.9] Code released! 

## Installation

### Requirements

```shell

pip install -r requirements.txt

```

Installing [xformers](https://github.com/facebookresearch/xformers) is highly recommended for improved efficiency and speed on GPUs. 

### Weights

**[Stable Diffusion]** [Stable Diffusion](https://arxiv.org/abs/2112.10752) is a latent text-to-image diffusion model capable of generating photo-realistic images given any text input. The pre-trained Stable Diffusion models can be downloaded from [🤗 Hugging Face](https://huggingface.co) (e.g., [Stable Diffusion v1-4](https://huggingface.co/CompVis/stable-diffusion-v1-4), [v2-1](https://huggingface.co/stabilityai/stable-diffusion-2-1)). We use Stable Diffusion v1-4 by default.

## Zero-shot testing

Simply run:

```bash

accelerate launch test_vid2vid_zero.py --config path/to/config

```

For example:

```bash

accelerate launch test_vid2vid_zero.py --config configs/car-moving.yaml

```

## Gradio Demo

Launch the local demo built with [gradio](https://gradio.app/):

```bash

python app.py

```

Or you can use our online gradio demo [here](https://huggingface.co/spaces/BAAI/vid2vid-zero).

Note that we disable Null-text Inversion and enable fp16 for faster demo response.

## Examples

  Input Video

  Output Video

  Input Video

  Output Video

  "A car is moving on the road"

  "A Porsche car is moving on the desert"

  "A car is moving on the road"

  "A jeep car is moving on the snow"

  

         

  "A man is running"

  "Stephen Curry is running in Time Square"

  "A man is running"

  "A man is running in New York City"

  

         

  "A child is riding a bike on the road"

  "a child is riding a bike on the flooded road"

  "A child is riding a bike on the road"

  "a lego child is riding a bike on the road.gif"

  

         

  "A car is moving on the road"

  "A car is moving on the snow"

  "A car is moving on the road"

  "A jeep car is moving on the desert"

  

         

## Citation

```

@article{vid2vid-zero,

  title={Zero-Shot Video Editing Using Off-The-Shelf Image Diffusion Models},

  author={Wang, Wen and Xie, kangyang and Liu, Zide and Chen, Hao and Cao, Yue and Wang, Xinlong and Shen, Chunhua},

  journal={arXiv preprint arXiv:2303.17599},

  year={2023}

}

```

## Acknowledgement

[Tune-A-Video](https://github.com/showlab/Tune-A-Video), [diffusers](https://github.com/huggingface/diffusers), [prompt-to-prompt](https://github.com/google/prompt-to-prompt).

## Contact

**We are hiring** at all levels at BAAI Vision Team, including full-time researchers, engineers and interns. 

If you are interested in working with us on **foundation model, visual perception and multimodal learning**, please contact [Xinlong Wang](https://www.xloong.wang/) (`wangxinlong@baai.ac.cn`) and [Yue Cao](http://yue-cao.me/) (`caoyue@baai.ac.cn`).