https://yingqinghe.github.io/LVDM/

LVDM: Latent Video Diffusion Models for High-Fidelity Long Video Generation
https://yingqinghe.github.io/LVDM/

Last synced: 8 months ago
JSON representation

LVDM: Latent Video Diffusion Models for High-Fidelity Long Video Generation

Host: GitHub
URL: https://yingqinghe.github.io/LVDM/
Owner: YingqingHe
License: mit
Created: 2022-11-22T14:12:01.000Z (almost 3 years ago)
Default Branch: main
Last Pushed: 2024-11-12T11:31:52.000Z (about 1 year ago)
Last Synced: 2024-11-12T12:27:43.757Z (about 1 year ago)
Language: Python
Homepage: https://yingqinghe.github.io/LVDM/
Size: 1010 KB
Stars: 452
Watchers: 28
Forks: 17
Open Issues: 17
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

awesome-diffusion-categorized - [Project
ai-game-devtools - LVDM - Fidelity Long Video Generation. |[arXiv](https://arxiv.org/abs/2211.13221) | | Video | (<span id="video">Video</span> / <span id="tool">LLM (LLM & Tool)</span>)
awesome-conditional-content-generation - Latent Video Diffusion Models for High-Fidelity Video Generation with Arbitrary Lengths

README

LVDM: Latent Video Diffusion Models for High-Fidelity Long Video Generation

Yingqing He ¹
Tianyu Yang ²
Yong Zhang ²
Ying Shan ²
Qifeng Chen ¹

¹ The Hong Kong University of Science and Technology ² Tencent AI Lab

TL;DR: An efficient video diffusion model that can:
1️⃣ conditionally generate videos based on input text;
2️⃣ unconditionally generate videos with thousands of frames.

## 🍻 Results
### ☝️ Text-to-Video Generation

"A corgi is swimming fastly"
"astronaut riding a horse"
"A glass bead falling into water with a huge splash. Sunset in the background"
"A beautiful sunrise on mars. High definition, timelapse, dramaticcolors."
"A bear dancing and jumping to upbeat music, moving his whole body."
"An iron man surfing in the sea. cartoon style"

### ✌️ Unconditional Long Video Generation (40 seconds)

## ⏳ TODO
- [x] Release pretrained text-to-video generation models and inference code
- [x] Release unconditional video generation models
- [x] Release training code
- [ ] Update training and sampling for long video generation

---
## ⚙️ Setup

### Install Environment via Anaconda
```bash
conda create -n lvdm python=3.8.5
conda activate lvdm
pip install -r requirements.txt
```
### Pretrained Models and Used Datasets

Download via linux commands:
```
mkdir -p models/ae
mkdir -p models/lvdm_short
mkdir -p models/t2v

# sky timelapse
wget -O models/ae/ae_sky.ckpt https://huggingface.co/Yingqing/LVDM/resolve/main/ae/ae_sky.ckpt
wget -O models/lvdm_short/short_sky.ckpt https://huggingface.co/Yingqing/LVDM/resolve/main/lvdm_short/short_sky.ckpt

# taichi
wget -O models/ae/ae_taichi.ckpt https://huggingface.co/Yingqing/LVDM/resolve/main/ae/ae_taichi.ckpt
wget -O models/lvdm_short/short_taichi.ckpt https://huggingface.co/Yingqing/LVDM/resolve/main/lvdm_short/short_taichi.ckpt

# text2video
wget -O models/t2v/model.ckpt https://huggingface.co/Yingqing/LVDM/resolve/main/lvdm_short/t2v.ckpt
```

Download manually:
- Sky Timelapse: [VideoAE](https://huggingface.co/Yingqing/LVDM/blob/main/ae/ae_sky.ckpt), [LVDM_short](https://huggingface.co/Yingqing/LVDM/blob/main/lvdm_short/short_sky.ckpt), [LVDM_pred](TBD), [LVDM_interp](TBD), [dataset](https://github.com/weixiong-ur/mdgan)
- Taichi: [VideoAE](https://huggingface.co/Yingqing/LVDM/blob/main/ae/ae_taichi.ckpt), [LVDM_short](https://huggingface.co/Yingqing/LVDM/blob/main/lvdm_short/short_taichi.ckpt), [dataset](https://github.com/AliaksandrSiarohin/first-order-model/blob/master/data/taichi-loading/README.md)
- Text2Video: [model](https://huggingface.co/Yingqing/LVDM/blob/main/lvdm_short/t2v.ckpt)

---
## 💫 Inference
### Sample Short Videos
- unconditional generation

```
bash shellscripts/sample_lvdm_short.sh
```
- text to video generation
```
bash shellscripts/sample_lvdm_text2video.sh
```

### Sample Long Videos
```
bash shellscripts/sample_lvdm_long.sh
```

---
## 💫 Training

### Train video autoencoder
```
bash shellscripts/train_lvdm_videoae.sh
```
- remember to set `PROJ_ROOT`, `EXPNAME`, `DATADIR`, and `CONFIG`.

### Train unconditional lvdm for short video generation
```
bash shellscripts/train_lvdm_short.sh
```
- remember to set `PROJ_ROOT`, `EXPNAME`, `DATADIR`, `AEPATH` and `CONFIG`.

### Train unconditional lvdm for long video generation
```
# TBD
```

---
## 💫 Evaluation
```
bash shellscripts/eval_lvdm_short.sh
```
- remember to set `DATACONFIG`, `FAKEPATH`, `REALPATH`, and `RESDIR`.
---

## 📃 Abstract
AI-generated content has attracted lots of attention recently, but photo-realistic video synthesis is still challenging. Although many attempts using GANs and autoregressive models have been made in this area, the visual quality and length of generated videos are far from satisfactory. Diffusion models have shown remarkable results recently but require significant computational resources. To address this, we introduce lightweight video diffusion models by leveraging a low-dimensional 3D latent space, significantly outperforming previous pixel-space video diffusion models under a limited computational budget. In addition, we propose hierarchical diffusion in the latent space such that longer videos with more than one thousand frames can be produced. To further overcome the performance degradation issue for long video generation, we propose conditional latent perturbation and unconditional guidance that effectively mitigate the accumulated errors during the extension of video length. Extensive experiments on small domain datasets of different categories suggest that our framework generates more realistic and longer videos than previous strong baselines. We additionally provide an extension to large-scale text-to-video generation to demonstrate the superiority of our work. Our code and models will be made publicly available.

## 🔮 Pipeline

---
## 😉 Citation

```
@article{he2022lvdm,
title={Latent Video Diffusion Models for High-Fidelity Long Video Generation},
author={Yingqing He and Tianyu Yang and Yong Zhang and Ying Shan and Qifeng Chen},
year={2022},
eprint={2211.13221},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
```

## 🤗 Acknowledgements
We built our code partially based on [latent diffusion models](https://github.com/CompVis/latent-diffusion) and [TATS](https://github.com/SongweiGe/TATS). Thanks the authors for sharing their awesome codebases! We aslo adopt Xintao Wang's [Real-ESRGAN](https://github.com/xinntao/Real-ESRGAN) for upscaling our text-to-video generation results. Thanks for their wonderful work!

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://yingqinghe.github.io/LVDM/

Awesome Lists containing this project

README

LVDM: Latent Video Diffusion Models for High-Fidelity Long Video Generation