https://github.com/YangLing0818/VideoTetris

[NeurIPS 2024] VideoTetris: Towards Compositional Text-To-Video Generation
https://github.com/YangLing0818/VideoTetris

diffusion-models large-language-models text-to-video-generation

Last synced: 8 months ago
JSON representation

[NeurIPS 2024] VideoTetris: Towards Compositional Text-To-Video Generation

Host: GitHub
URL: https://github.com/YangLing0818/VideoTetris
Owner: YangLing0818
License: mit
Created: 2024-06-06T11:54:20.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2024-09-27T02:35:15.000Z (about 1 year ago)
Last Synced: 2024-10-31T01:35:04.959Z (about 1 year ago)
Topics: diffusion-models, large-language-models, text-to-video-generation
Language: Python
Homepage: https://arxiv.org/abs/2406.04277
Size: 28 MB
Stars: 202
Watchers: 19
Forks: 6
Open Issues: 5
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

awesome-diffusion-categorized - [Code

README

## ___***VideoTetris: Towards Compositional Text-To-Video Generation***___

This repo contains the official implementation of our [VideoTetris](https://arxiv.org/abs/2406.04277) (**NeurIPS 2024**).

> [**VideoTetris: Towards Compositional Text-To-Video Generation**](https://arxiv.org/abs/2406.04277)
> [Ye Tian](https://tyfeld.github.io/),
> [Ling Yang*](https://yangling0818.github.io),
> [Haotian Yang](https://scholar.google.com/citations?user=LH71RGkAAAAJ&hl=en),
> [Yuan Gao](https://videotetris.github.io/),
> [Yufan Deng](https://videotetris.github.io/),
> [Jingmin Chen](https://videotetris.github.io/),
> [Xintao Wang](https://xinntao.github.io),
> [Zhaochen Yu](https://videotetris.github.io/),
> [Pengfei Wan](https://scholar.google.com/citations?user=P6MraaYAAAAJ&hl=en),
> [Di Zhang](https://openreview.net/profile?id=~Di_ZHANG3),
> [Bin Cui](https://cuibinpku.github.io/cuibin_cn.html)
> (* Equal Contribution and Corresponding Author)
>
Peking University, Kuaishou Technology

## Introduction
VideoTetris is a novel framework that enables **compositional T2V generation**. Specifically, we propose **spatio-temporal compositional diffusion** to precisely follow complex textual semantics by manipulating and composing the attention maps of denoising networks spatially and temporally. Moreover, we propose an enhanced video data preprocessing to enhance the training data regarding motion dynamics and prompt understanding, equipped with a new reference frame attention mechanism to improve the consistency of auto-regressive video generation. Our demonstrations include successful examples of **videos spanning from 10s, 30s to 2 minutes**, and can be extended for even longer durations.

## Training and Inference

### Composition Text-to-Video Generation
We provide the inference code of our VideoTetris for compositional video generation based on VideoCrafter2. You can download the pretrained model from [Hugging Face](https://huggingface.co/VideoCrafter/VideoCrafter2/blob/main/model.ckpt) and put it in `checkpoints/base_512_v2/model.ckpt`. Then run the following command:
#### 1. Install Environment via Anaconda (Recommended)
```bash
cd short
conda create -n videocrafter python=3.8.5
conda activate videocrafter
pip install -r requirements.txt
```

#### 2. Region Planning
You can then plan the regions for different sub-objects in a json file like `prompts/demo_videotetris.json`. The regions are defined by the top-left and bottom-right coordinates of the bounding box. You can refer to the `prompts/demo_videotetris.json` for an example. And the final planning json should be like:
```json
{
{
"basic_prompt": "A cat on the left and a dog on the right are napping in the sun.",
"sub_objects":[
"A cute orange cat.",
"A cute dog."
],
"layout_boxes":[
[0, 0, 0.5, 1],
[0.5, 0, 1, 1]
]
},
}
```
In this case, we first define the basic prompt, and then specify the sub-objects and their corresponding regions, resulting in a video with a left cat and a right dog.

#### 3. Inference of VideoTetris
```bash
sh scripts/run_text2video_from_layout.sh
```
You can specify the input json file `run_text2video_from_layout.sh` script.

### Long Video Generation with Progressive Compositional Prompts

#### 1. Install Environment via Anaconda (Recommended)
```bash
cd long
conda create -n st2v python=3.10
conda activate st2v
pip install -r requirements.txt
```
#### 2. Download the Checkpoint

We put our VideoTetris-long model finetuned on our filtered dataset on [Hugging Face](https://huggingface.co/tyfeld/VideoTetris-long). You can download the weights and put it in the directory through:
```bash
wget https://huggingface.co/tyfeld/VideoTetris-long/resolve/main/model-step=6000-v1.ckpt
```

#### 3. Region Planning

You can then plan the regions for different sub-objects in a json file like prompts/prompt.json. You should specify the video chunk index, prompt, sub-objects and layout boxes for each video chunk.

> Video Chunk Meaning: As the long video is autoregressively generated by 8 frames for each chunk, a video with 80 frames will be autoregressively generated with (80-8)/8 = 9 rounds. And every chunk means the expanding 8 frames generated in one round.

The regions are defined by the top-left and bottom-right coordinates of the bounding box. You can refer to the prompts/prompt.json for an example. And the final planning json should be like:
```json
[
{
"video_chunk_index": 0,
"prompt": "A cute brown squirrel in Antarctica, on a pile of hazelnuts cinematic.",
"sub_objects": [
"A cute brown squirrel in Antarctica, on a pile of hazelnuts cinematic."
],
"layout_boxes":[
[0, 0, 1, 1]
]
},
{
"video_chunk_index": 4,
"prompt": "A cute brown squirrel and a cute white squirrel in Antarctica, on a pile of hazelnuts cinematic",
"sub_objects": [
"A cute brown squirrel in Antarctica, on a pile of hazelnuts cinematic.",
"A cute white squirrel in Antarctica, on a pile of hazelnuts cinematic."
],
"layout_boxes":[
[0.5, 0, 1, 1],
[0, 0, 0.5, 1]
]
}
]
```
#### 4. Inference of VideoTetris-long
```bash
cd t2v_enhanced
python inference_videotetris.py --num_frames 80
```

## Example Results
We only provide some example results here, more detailed results can be found in the [project page](https://videotetris.github.io/).

A cute brown dog on the left and a sleepy cat on the right are napping in the sun.
@16 Frames
A cheerful farmer and a hardworking blacksmith are building a barn.
@16 Frames

One cute brown squirrel, on a pile of hazelnuts, cinematic.
------> transitions to

Two cute brown squirrels, on a pile of hazelnuts, cinematic.
------> transitions to

Three cute brown squirrels, on a pile of hazelnuts, cinematic.
------> transitions to

Four cute brown squirrels, on a pile of hazelnuts, cinematic.

@80 Frames
A cute brown squirrel, on a pile of hazelnuts, cinematic.
------> transitions to

A cute brown squirrel and a cute white squirrel, on a pile of hazelnuts, cinematic.

@240 Frames

## Citation
```
@article{tian2024videotetris,
title={VideoTetris: Towards Compositional Text-to-Video Generation},
author={Tian, Ye and Yang, Ling and Yang, Haotian and Gao, Yuan and Deng, Yufan and Chen, Jingmin and Wang, Xintao and Yu, Zhaochen and Tao, Xin and Wan, Pengfei and Zhang, Di and Cui, Bin},
journal={arXiv preprint arXiv:2406.04277},
year={2024}
}
```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/YangLing0818/VideoTetris

Awesome Lists containing this project

README