{"id":13488949,"url":"https://github.com/YingqingHe/LVDM","last_synced_at":"2025-03-28T02:31:34.121Z","repository":{"id":127264188,"uuid":"569297988","full_name":"YingqingHe/LVDM","owner":"YingqingHe","description":"LVDM: Latent Video Diffusion Models for High-Fidelity Long Video Generation","archived":false,"fork":false,"pushed_at":"2024-11-12T11:31:52.000Z","size":1032,"stargazers_count":452,"open_issues_count":17,"forks_count":17,"subscribers_count":28,"default_branch":"main","last_synced_at":"2024-11-12T12:27:43.757Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"https://yingqinghe.github.io/LVDM/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/YingqingHe.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2022-11-22T14:12:01.000Z","updated_at":"2024-11-12T11:31:56.000Z","dependencies_parsed_at":"2023-11-12T19:25:48.744Z","dependency_job_id":"dc3d98bc-7caf-45f4-a911-ac5afed3c0b4","html_url":"https://github.com/YingqingHe/LVDM","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/YingqingHe%2FLVDM","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/YingqingHe%2FLVDM/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/YingqingHe%2FLVDM/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/YingqingHe%2FLVDM/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/YingqingHe","download_url":"https://codeload.github.com/YingqingHe/LVDM/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":245957707,"owners_count":20700319,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-07-31T18:01:24.774Z","updated_at":"2025-03-28T02:31:29.084Z","avatar_url":"https://github.com/YingqingHe.png","language":"Python","funding_links":[],"categories":["Video Generation","视频生成_补帧_摘要","Papers"],"sub_categories":["资源传输下载","Text-Video Generation"],"readme":"\n\u003cdiv align=\"center\"\u003e\n\n\u003ch2\u003e LVDM: \u003cspan style=\"font-size:12px\"\u003eLatent Video Diffusion Models for High-Fidelity Long Video Generation \u003c/span\u003e \u003c/h2\u003e \n\n  \u003ca href='https://arxiv.org/abs/2211.13221'\u003e\u003cimg src='https://img.shields.io/badge/ArXiv-2211.14758-red'\u003e\u003c/a\u003e \u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u003ca href='https://yingqinghe.github.io/LVDM/'\u003e\u003cimg src='https://img.shields.io/badge/Project-Page-Green'\u003e\u003c/a\u003e\n\n\n\u003cdiv\u003e\n    \u003ca href='https://github.com/YingqingHe' target='_blank'\u003eYingqing He \u003csup\u003e1\u003c/sup\u003e \u003c/a\u003e\u0026emsp;\n    \u003ca href='https://tianyu-yang.com/' target='_blank'\u003eTianyu Yang \u003csup\u003e2\u003c/a\u003e\u0026emsp;\n    \u003ca href='https://yzhang2016.github.io/' target='_blank'\u003eYong Zhang \u003csup\u003e2\u003c/sup\u003e\u003c/a\u003e\u0026emsp;\n    \u003ca href='https://scholar.google.com/citations?hl=en\u0026user=4oXBp9UAAAAJ\u0026view_op=list_works\u0026sortby=pubdate' target='_blank'\u003eYing Shan \u003csup\u003e2\u003c/sup\u003e\u003c/a\u003e\u0026emsp;\n    \u003ca href='https://cqf.io/' target='_blank'\u003eQifeng Chen \u003csup\u003e1\u003c/sup\u003e\u003c/a\u003e\u0026emsp; \u003c/br\u003e\n\u003c/div\u003e\n\u003cbr\u003e\n\u003cdiv\u003e\n    \u003csup\u003e1\u003c/sup\u003e The Hong Kong University of Science and Technology \u0026emsp; \u003csup\u003e2\u003c/sup\u003e Tencent AI Lab \u0026emsp;\n\u003c/div\u003e\n\u003cbr\u003e\n\u003cbr\u003e\n\n\u003cb\u003eTL;DR: An efficient video diffusion model that can:\u003c/b\u003e  \n1️⃣ conditionally generate videos based on input text;  \n2️⃣ unconditionally generate videos with thousands of frames.\n\n\u003cbr\u003e\n\n\u003c/div\u003e\n\n\n## 🍻 Results\n### ☝️ Text-to-Video Generation\n\n\u003ctable class=\"center\"\u003e\n  \u003c!-- \u003ctd style=\"text-align:center;\" width=\"50\"\u003eInput Text\u003c/td\u003e --\u003e\n  \u003ctd style=\"text-align:center;\" width=\"170\"\u003e\"A corgi is swimming fastly\"\u003c/td\u003e\n  \u003ctd style=\"text-align:center;\" width=\"170\"\u003e\"astronaut riding a horse\"\u003c/td\u003e\n  \u003ctd style=\"text-align:center;\" width=\"170\"\u003e\"A glass bead falling into water with a huge splash. Sunset in the background\"\u003c/td\u003e\n  \u003ctd style=\"text-align:center;\" width=\"170\"\u003e\"A beautiful sunrise on mars. High definition, timelapse, dramaticcolors.\"\u003c/td\u003e\n  \u003ctd style=\"text-align:center;\" width=\"170\"\u003e\"A bear dancing and jumping to upbeat music, moving his whole body.\"\u003c/td\u003e\n  \u003ctd style=\"text-align:center;\" width=\"170\"\u003e\"An iron man surfing in the sea. cartoon style\"\u003c/td\u003e\n  \u003ctr\u003e\n  \u003ctd\u003e\u003cimg src=assets/t2v-001.gif width=\"170\"\u003e\u003c/td\u003e\n  \u003ctd\u003e\u003cimg src=assets/t2v-002.gif width=\"170\"\u003e\u003c/td\u003e\n  \u003ctd\u003e\u003cimg src=assets/t2v-003.gif width=\"170\"\u003e\u003c/td\u003e\n  \u003ctd\u003e\u003cimg src=assets/t2v-007.gif width=\"170\"\u003e\u003c/td\u003e\n  \u003ctd\u003e\u003cimg src=assets/t2v-005.gif width=\"170\"\u003e\u003c/td\u003e\n  \u003ctd\u003e\u003cimg src=assets/t2v-004.gif width=\"170\"\u003e\u003c/td\u003e\n\u003c/tr\u003e\n\u003c/table \u003e\n\n### ✌️ Unconditional Long Video Generation (40 seconds)\n\u003ctable class=\"center\"\u003e\n  \u003ctd\u003e\u003cimg src=assets/sky-long-001.gif width=\"170\"\u003e\u003c/td\u003e\n  \u003ctd\u003e\u003cimg src=assets/sky-long-002.gif width=\"170\"\u003e\u003c/td\u003e\n  \u003ctd\u003e\u003cimg src=assets/sky-long-003.gif width=\"170\"\u003e\u003c/td\u003e\n  \u003ctd\u003e\u003cimg src=assets/ucf-long-001.gif width=\"170\"\u003e\u003c/td\u003e\n  \u003ctd\u003e\u003cimg src=assets/ucf-long-002.gif width=\"170\"\u003e\u003c/td\u003e\n  \u003ctd\u003e\u003cimg src=assets/ucf-long-003.gif width=\"170\"\u003e\u003c/td\u003e\n  \u003ctr\u003e\n\u003c/tr\u003e\n\u003c/table \u003e\n\n## ⏳ TODO\n- [x] Release pretrained text-to-video generation models and inference code\n- [x] Release unconditional video generation models\n- [x] Release training code\n- [ ] Update training and sampling for long video generation\n\u003cbr\u003e\n\n---\n## ⚙️ Setup\n\n### Install Environment via Anaconda\n```bash\nconda create -n lvdm python=3.8.5\nconda activate lvdm\npip install -r requirements.txt\n```\n### Pretrained Models and Used Datasets\n\n\u003c!-- \u003cdiv style=\"text-indent:25px\"\u003e --\u003e\n\u003c!-- \u003cdetails\u003e\u003csummary\u003e\u003c/summary\u003e --\u003e\nDownload via linux commands:\n```\nmkdir -p models/ae\nmkdir -p models/lvdm_short\nmkdir -p models/t2v\n\n# sky timelapse\nwget -O models/ae/ae_sky.ckpt https://huggingface.co/Yingqing/LVDM/resolve/main/ae/ae_sky.ckpt\nwget -O models/lvdm_short/short_sky.ckpt https://huggingface.co/Yingqing/LVDM/resolve/main/lvdm_short/short_sky.ckpt  \n\n# taichi\nwget -O models/ae/ae_taichi.ckpt https://huggingface.co/Yingqing/LVDM/resolve/main/ae/ae_taichi.ckpt\nwget -O models/lvdm_short/short_taichi.ckpt https://huggingface.co/Yingqing/LVDM/resolve/main/lvdm_short/short_taichi.ckpt\n\n# text2video\nwget -O models/t2v/model.ckpt https://huggingface.co/Yingqing/LVDM/resolve/main/lvdm_short/t2v.ckpt\n```\n\u003c!-- \u003c/details\u003e\n\u003c/div\u003e --\u003e\n\u003c!-- - UCF-101: [dataset](https://www.crcv.ucf.edu/data/UCF101.php) --\u003e\n\u003c!-- [samples_short](TBD), [samples_long](TBD) --\u003e\n\nDownload manually:\n- Sky Timelapse: [VideoAE](https://huggingface.co/Yingqing/LVDM/blob/main/ae/ae_sky.ckpt), [LVDM_short](https://huggingface.co/Yingqing/LVDM/blob/main/lvdm_short/short_sky.ckpt), [LVDM_pred](TBD), [LVDM_interp](TBD), [dataset](https://github.com/weixiong-ur/mdgan)\n- Taichi: [VideoAE](https://huggingface.co/Yingqing/LVDM/blob/main/ae/ae_taichi.ckpt), [LVDM_short](https://huggingface.co/Yingqing/LVDM/blob/main/lvdm_short/short_taichi.ckpt), [dataset](https://github.com/AliaksandrSiarohin/first-order-model/blob/master/data/taichi-loading/README.md)\n- Text2Video: [model](https://huggingface.co/Yingqing/LVDM/blob/main/lvdm_short/t2v.ckpt)\n\n---\n## 💫 Inference \n### Sample Short Videos \n- unconditional generation\n\n```\nbash shellscripts/sample_lvdm_short.sh\n```\n- text to video generation\n```\nbash shellscripts/sample_lvdm_text2video.sh\n```\n\n### Sample Long Videos \n```\nbash shellscripts/sample_lvdm_long.sh\n```\n\n---\n## 💫 Training\n\u003c!-- tar -zxvf dataset/sky_timelapse.tar.gz -C /dataset/sky_timelapse --\u003e\n### Train video autoencoder\n```\nbash shellscripts/train_lvdm_videoae.sh \n```\n- remember to set `PROJ_ROOT`, `EXPNAME`, `DATADIR`, and `CONFIG`.\n\n### Train unconditional lvdm for short video generation\n```\nbash shellscripts/train_lvdm_short.sh\n```\n- remember to set `PROJ_ROOT`, `EXPNAME`, `DATADIR`, `AEPATH` and `CONFIG`.\n\n### Train unconditional lvdm for long video generation\n```\n# TBD\n```\n\n---\n## 💫 Evaluation\n```\nbash shellscripts/eval_lvdm_short.sh\n```\n- remember to set `DATACONFIG`, `FAKEPATH`, `REALPATH`, and `RESDIR`.\n---\n\n## 📃 Abstract\nAI-generated content has attracted lots of attention recently, but photo-realistic video synthesis is still challenging. Although many attempts using GANs and autoregressive models have been made in this area, the visual quality and length of generated videos are far from satisfactory. Diffusion models have shown remarkable results recently but require significant computational resources. To address this, we introduce lightweight video diffusion models by leveraging a low-dimensional 3D latent space, significantly outperforming previous pixel-space video diffusion models under a limited computational budget. In addition, we propose hierarchical diffusion in the latent space such that longer videos with more than one thousand frames can be produced. To further overcome the performance degradation issue for long video generation, we propose conditional latent perturbation and unconditional guidance that effectively mitigate the accumulated errors during the extension of video length. Extensive experiments on small domain datasets of different categories suggest that our framework generates more realistic and longer videos than previous strong baselines. We additionally provide an extension to large-scale text-to-video generation to demonstrate the superiority of our work. Our code and models will be made publicly available.\n\u003cbr\u003e\n\n## 🔮 Pipeline\n\n\u003cp align=\"center\"\u003e\n    \u003cimg src=assets/framework.jpg /\u003e\n\u003c/p\u003e\n\n---\n## 😉 Citation\n\n```\n@article{he2022lvdm,\n      title={Latent Video Diffusion Models for High-Fidelity Long Video Generation}, \n      author={Yingqing He and Tianyu Yang and Yong Zhang and Ying Shan and Qifeng Chen},\n      year={2022},\n      eprint={2211.13221},\n      archivePrefix={arXiv},\n      primaryClass={cs.CV}\n}\n```\n\n## 🤗 Acknowledgements\nWe built our code partially based on [latent diffusion models](https://github.com/CompVis/latent-diffusion) and [TATS](https://github.com/SongweiGe/TATS). Thanks the authors for sharing their awesome codebases! We aslo adopt Xintao Wang's [Real-ESRGAN](https://github.com/xinntao/Real-ESRGAN) for upscaling our text-to-video generation results. Thanks for their wonderful work!","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FYingqingHe%2FLVDM","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FYingqingHe%2FLVDM","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FYingqingHe%2FLVDM/lists"}