https://github.com/Vchitect/VideoBooth

[CVPR2024] VideoBooth: Diffusion-based Video Generation with Image Prompts
https://github.com/Vchitect/VideoBooth

Last synced: 8 months ago
JSON representation

[CVPR2024] VideoBooth: Diffusion-based Video Generation with Image Prompts

Host: GitHub
URL: https://github.com/Vchitect/VideoBooth
Owner: Vchitect
Created: 2023-11-30T11:49:01.000Z (about 2 years ago)
Default Branch: main
Last Pushed: 2024-06-09T14:51:10.000Z (over 1 year ago)
Last Synced: 2025-03-22T23:49:28.641Z (9 months ago)
Language: Python
Homepage:
Size: 13 MB
Stars: 292
Watchers: 20
Forks: 11
Open Issues: 6
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

awesome-diffusion-categorized - [Code

README

          # VideoBooth

[![Paper](https://img.shields.io/badge/cs.CV-Paper-b31b1b?logo=arxiv&logoColor=red)](xxxx)

[![Project Page](https://img.shields.io/badge/VideoBooth-Website-green?logo=googlechrome&logoColor=green)](https://vchitect.github.io/VideoBooth-project/)

[![Video](https://img.shields.io/badge/YouTube-Video-c4302b?logo=youtube&logoColor=red)](https://youtu.be/10DxH1JETzI)

[![Visitor](https://hits.seeyoufarm.com/api/count/incr/badge.svg?url=https%3A%2F%2Fgithub.com%2FVchitect%2FVideoBooth&count_bg=%23FFA500&title_bg=%23555555&icon=&icon_color=%23E7E7E7&title=hits&edge_flat=false)](https://hits.seeyoufarm.com)

This repository will contain the implementation of the following paper:

> **VideoBooth: Diffusion-based Video Generation with Image Prompts**


> [Yuming Jiang](https://yumingj.github.io/), [Tianxing Wu](https://tianxingwu.github.io/), [Shuai Yang](https://williamyang1991.github.io/), [Chenyang Si](https://chenyangsi.top/), [Dahua Lin](http://dahua.site/), [Yu Qiao](https://scholar.google.com.sg/citations?user=gFtI-8QAAAAJ&hl=en), [Chen Change Loy](https://www.mmlab-ntu.com/person/ccloy/), [Ziwei Liu](https://liuziwei7.github.io/)


From [MMLab@NTU](https://www.mmlab-ntu.com/) affliated with S-Lab, Nanyang Technological University and Shanghai AI Laboratory.

## Overview

Our VideoBooth generates videos with the subjects specified in the image prompts.

![overall_structure](./assets/teaser.png)

## Installation

1. Clone the repository.

```shell

git clone https://github.com/Vchitect/VideoBooth.git

cd VideoBooth

```

2. Install the environment.

```shell

conda env create -f environment.yml

conda activate videobooth

```

3. Download pretrained models ([Stable Diffusion v1.4](https://huggingface.co/CompVis/stable-diffusion-v1-4), [VideoBooth](https://huggingface.co/yumingj/VideoBooth_models/tree/main)), and put them under the folder `./pretrained_models/`.

## Inference

Here, we provide one example to perform the inference.

``` shell

python sample_scripts/sample.py --config sample_scripts/configs/panda.yaml

```

If you want to use your own image, you need to segment the object first. We use [Grounded-SAM](https://github.com/IDEA-Research/Grounded-Segment-Anything) to segment the subject from images.

## Training

VideoBooth is training in a coarse-to-fine manner.

# Stage 1: Coarse Stage Training

``` shell

srun --mpi=pmi2 torchrun --nnodes=1 --nproc_per_node=8 --master_port=29125 train_stage1.py \

--model TAVU \

--num-frames 16 \

--dataset WebVideoImageStage1  \

--frame-interval 4 \

--ckpt-every 1000 \

--clip-max-norm 0.1 \

--global-batch-size 16 \

--reg-text-weight 0 \

--results-dir ./results \

--pretrained-t2v-model path-to-t2v-model \

--global-mapper-path path-to-elite-global-model

```

# Stage 2: Fine Stage Training

``` shell

srun --mpi=pmi2 torchrun --nnodes=1 --nproc_per_node=8 --master_port=29125 train_stage2.py \

--model TAVU \

--num-frames 16 \

--dataset WebVideoImageStage2  \

--frame-interval 4 \

--ckpt-every 1000 \

--clip-max-norm 0.1 \

--global-batch-size 16 \

--reg-text-weight 0 \

--results-dir ./results \

--pretrained-t2v-model path-to-t2v-model \

--global-mapper-path path-to-stage1-model

```

## Dataset Preparation

You can download our proposed dataset in [HuggingFace](https://huggingface.co/datasets/yumingj/VideoBoothDataset).

```shell

# merge the splited zip files

zip -F webvid_parsing_2M_split.zip --out single-archive.zip

# replace the path-to-webvid-parsing to this path

unzip single-archive.zip

# replace the path-to-videobooth-subset to this path

unzip webvid_parsing_videobooth_subset.zip

```

## Citation

If you find our repo useful for your research, please consider citing our paper:

```bibtex

@article{jiang2023videobooth,

    author = {Jiang, Yuming and Wu, Tianxing and Yang, Shuai and Si, Chenyang and Lin, Dahua and Qiao, Yu and Loy, Chen Change and Liu, Ziwei},

    title = {VideoBooth: Diffusion-based Video Generation with Image Prompts},

    year = {2023}

}

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/Vchitect/VideoBooth

Awesome Lists containing this project

README