Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/baaivision/vid2vid-zero

Zero-Shot Video Editing Using Off-The-Shelf Image Diffusion Models
https://github.com/baaivision/vid2vid-zero

Last synced: 4 days ago
JSON representation

Zero-Shot Video Editing Using Off-The-Shelf Image Diffusion Models

Awesome Lists containing this project

README

        

vid2vid-zero for Zero-Shot Video Editing

Zero-Shot Video Editing Using Off-The-Shelf Image Diffusion Models

[Wen Wang](https://scholar.google.com/citations?user=1ks0R04AAAAJ&hl=zh-CN)1*,   [Kangyang Xie](https://github.com/felix-ky)1*,   [Zide Liu](https://github.com/zideliu)1*,   [Hao Chen](https://scholar.google.com.au/citations?user=FaOqRpcAAAAJ&hl=en)1,   [Yue Cao](http://yue-cao.me/)2,   [Xinlong Wang](https://www.xloong.wang/)2,   [Chunhua Shen](https://cshen.github.io/)1

1[ZJU](https://www.zju.edu.cn/english/),   2[BAAI](https://www.baai.ac.cn/english.html)


[![Hugging Face Demo](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/BAAI/vid2vid-zero)


We propose vid2vid-zero, a simple yet effective method for zero-shot video editing. Our vid2vid-zero leverages off-the-shelf image diffusion models, and doesn't require training on any video. At the core of our method is a null-text inversion module for text-to-video alignment, a cross-frame modeling module for temporal consistency, and a spatial regularization module for fidelity to the original video. Without any training, we leverage the dynamic nature of the attention mechanism to enable bi-directional temporal modeling at test time.
Experiments and analyses show promising results in editing attributes, subjects, places, etc., in real-world videos.

## Highlights

- Video editing with off-the-shelf image diffusion models.

- No training on any video.

- Promising results in editing attributes, subjects, places, etc., in real-world videos.

## News
* [2023.4.12] Online Gradio Demo is available [here](https://huggingface.co/spaces/BAAI/vid2vid-zero).
* [2023.4.11] Add Gradio Demo (runs in local).
* [2023.4.9] Code released!

## Installation
### Requirements

```shell
pip install -r requirements.txt
```
Installing [xformers](https://github.com/facebookresearch/xformers) is highly recommended for improved efficiency and speed on GPUs.

### Weights

**[Stable Diffusion]** [Stable Diffusion](https://arxiv.org/abs/2112.10752) is a latent text-to-image diffusion model capable of generating photo-realistic images given any text input. The pre-trained Stable Diffusion models can be downloaded from [🤗 Hugging Face](https://huggingface.co) (e.g., [Stable Diffusion v1-4](https://huggingface.co/CompVis/stable-diffusion-v1-4), [v2-1](https://huggingface.co/stabilityai/stable-diffusion-2-1)). We use Stable Diffusion v1-4 by default.

## Zero-shot testing

Simply run:

```bash
accelerate launch test_vid2vid_zero.py --config path/to/config
```

For example:
```bash
accelerate launch test_vid2vid_zero.py --config configs/car-moving.yaml
```

## Gradio Demo
Launch the local demo built with [gradio](https://gradio.app/):
```bash
python app.py
```

Or you can use our online gradio demo [here](https://huggingface.co/spaces/BAAI/vid2vid-zero).

Note that we disable Null-text Inversion and enable fp16 for faster demo response.

## Examples

Input Video
Output Video
Input Video
Output Video

"A car is moving on the road"
"A Porsche car is moving on the desert"
"A car is moving on the road"
"A jeep car is moving on the snow"


"A man is running"
"Stephen Curry is running in Time Square"
"A man is running"
"A man is running in New York City"


"A child is riding a bike on the road"
"a child is riding a bike on the flooded road"
"A child is riding a bike on the road"
"a lego child is riding a bike on the road.gif"


"A car is moving on the road"
"A car is moving on the snow"
"A car is moving on the road"
"A jeep car is moving on the desert"


## Citation

```
@article{vid2vid-zero,
title={Zero-Shot Video Editing Using Off-The-Shelf Image Diffusion Models},
author={Wang, Wen and Xie, kangyang and Liu, Zide and Chen, Hao and Cao, Yue and Wang, Xinlong and Shen, Chunhua},
journal={arXiv preprint arXiv:2303.17599},
year={2023}
}
```

## Acknowledgement
[Tune-A-Video](https://github.com/showlab/Tune-A-Video), [diffusers](https://github.com/huggingface/diffusers), [prompt-to-prompt](https://github.com/google/prompt-to-prompt).

## Contact

**We are hiring** at all levels at BAAI Vision Team, including full-time researchers, engineers and interns.
If you are interested in working with us on **foundation model, visual perception and multimodal learning**, please contact [Xinlong Wang](https://www.xloong.wang/) (`[email protected]`) and [Yue Cao](http://yue-cao.me/) (`[email protected]`).