https://github.com/showlab/Awesome-Video-Diffusion
A curated list of recent diffusion models for video generation, editing, restoration, understanding, etc.
https://github.com/showlab/Awesome-Video-Diffusion
List: Awesome-Video-Diffusion
awesome diffusion-models text-to-motion text-to-video video-editing video-generation video-restoration video-understanding
Last synced: about 1 month ago
JSON representation
A curated list of recent diffusion models for video generation, editing, restoration, understanding, etc.
- Host: GitHub
- URL: https://github.com/showlab/Awesome-Video-Diffusion
- Owner: showlab
- Created: 2023-04-17T10:34:17.000Z (about 2 years ago)
- Default Branch: main
- Last Pushed: 2024-04-30T14:56:39.000Z (about 1 year ago)
- Last Synced: 2024-05-23T01:06:03.368Z (12 months ago)
- Topics: awesome, diffusion-models, text-to-motion, text-to-video, video-editing, video-generation, video-restoration, video-understanding
- Homepage:
- Size: 222 KB
- Stars: 2,554
- Watchers: 119
- Forks: 155
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
- awesome-awesome-artificial-intelligence - Awesome Video Diffusion - Video-Diffusion?style=social) | (Computer Vision)
- awesome-awesome-artificial-intelligence - Awesome Video Diffusion - Video-Diffusion?style=social) | (Computer Vision)
- ultimate-awesome - Awesome-Video-Diffusion - A curated list of recent diffusion models for video generation, editing, and various other applications. (Other Lists / Julia Lists)
README
# Awesome Video Diffusion [](https://github.com/sindresorhus/awesome)
A curated list of recent diffusion models for video generation, editing, restoration, understanding, nerf, etc.
![]()
![]()
![]()
![]()
![]()
(Source: Make-A-Video, Tune-A-Video, and Fate/Zero.)## Table of Contents
- [Open-source Toolboxes and Foundation Models](#open-source-toolboxes-and-foundation-models)
- [Evaluation Benchmarks and Metrics](#evaluation-benchmarks-and-metrics)
- [Commercial Product](#commercial-product)
- [Video Generation](#video-generation)
- [Efficient Video Generation](#efficient-video-generation)
- [Controllable Video Generation](#controllable-video-generation)
- [Character Customization](#character-customization)
- [Motion Customization](#motion-customization)
- [Long Video / Film Generation](#long-video--film-generation)
- [Video Generation with 3D/Physical Prior](#video-generation-with-3dphysical-prior)
- [Video Editing](#video-editing)
- [Human or Subject Motion](#human-or-subject-motion)
- [Video Enhancement and Restoration](#video-enhancement-and-restoration)
- [Audio Synthesis for Video](#audio-synthesis-for-video)
- [Talking Head Generation](#talking-head-generation)
- [Human/AI Feedback for Video Generation](#human/ai-feedback-for-video-generation)
- [Policy Learning](#policy-learning)
- [Virtual Try-On](#virtual-try-on)
- [3D](#3d)
- [4D](#4d)
- [Game Generation](#game-generation)
- [AI Safety](#ai-safety)
- [Rendering with Virtual Engine](#rendering-with-virtual-engine)
- [Open-World Model](#open-world-model)
- [Video Understanding](#video-understanding)
- [Healthcare and Biology](#healthcare-and-biology)
- [Other Applications](#other-applications)### Open-source Toolboxes and Foundation Models
+ [Wan-Video](https://github.com/Wan-Video/Wan2.1)
[](https://github.com/Wan-Video/Wan2.1)
[](https://wanxai.com/)+ [DiffSynth-Studio](https://github.com/modelscope/DiffSynth-Studio)
[](https://github.com/modelscope/DiffSynth-Studio)+ [Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model](https://arxiv.org/abs/2502.10248)
[](https://github.com/stepfun-ai/Step-Video-T2V)
[](https://arxiv.org/abs/2502.10248)
[](https://yuewen.cn/videos)+ [Cosmos](https://github.com/NVIDIA/Cosmos)
[](https://github.com/NVIDIA/Cosmos)
[](https://arxiv.org/abs/2501.03575)
[](https://research.nvidia.com/labs/dir/cosmos1/)+ [LTX-Video](https://github.com/Lightricks/LTX-Video)
[](https://github.com/Lightricks/LTX-Video)+ [HunyuanVideo: A Systematic Framework For Large Video Generative Models](https://arxiv.org/abs/2412.03603)
[](https://github.com/Tencent/HunyuanVideo)+ [VideoTuna](https://videoverses.github.io/videotuna/)
[](https://github.com/VideoVerses/VideoTuna)
[](https://videoverses.github.io/videotuna/)+ [Allegro](https://rhymes.ai/blog-details/allegro-advanced-video-generation-model)
[](https://github.com/rhymes-ai/Allegro)
[](https://arxiv.org/abs/2410.15458)
[](https://rhymes.ai/blog-details/allegro-advanced-video-generation-model)+ [Mochi 1](https://www.genmo.ai/blog)
[](https://github.com/genmoai/mochi)
[](https://www.genmo.ai/blog)+ [Movie Gen: A Cast of Media Foundation Models](https://ai.meta.com/research/publications/movie-gen-a-cast-of-media-foundation-models/)
[](https://github.com/facebookresearch/MovieGenBench)
[](https://arxiv.org/pdf/2410.13720)
[](https://www.youtube.com/playlist?list=PL86eLlsPNfyi27GSizYjinpYxp7gEl5K8)+ [Pyramidal Flow Matching for Efficient Video Generative Modeling](https://pyramid-flow.github.io/)
[](https://github.com/jy0205/Pyramid-Flow)
[](https://pyramid-flow.github.io/)+ [Open-Sora-Plan](https://github.com/PKU-YuanGroup/Open-Sora-Plan)
[](https://github.com/PKU-YuanGroup/Open-Sora-Plan)
[](https://github.com/PKU-YuanGroup/Open-Sora-Plan/blob/main/docs/Report-v1.0.0.md)+ [Open-Sora](https://github.com/hpcaitech/Open-Sora)
[](https://github.com/hpcaitech/Open-Sora)
[](https://github.com/hpcaitech/Open-Sora/blob/main/docs/zh_CN/README.md)+ [Stable Video Diffusion](https://github.com/Stability-AI/generative-models)
[](https://github.com/Stability-AI/generative-models)
[](https://stability.ai/news/stable-video-diffusion-open-ai-video-model)+ [Show-1](https://github.com/showlab/Show-1)
[](https://github.com/showlab/Show-1)
[](https://showlab.github.io/Show-1/)+ [Hotshot-XL (text-to-GIF)](https://github.com/hotshotco/Hotshot-XL)
[](https://github.com/hotshotco/Hotshot-XL)+ [zeroscope_v2](https://huggingface.co/cerspense/zeroscope_v2_576w)
[](https://huggingface.co/cerspense/zeroscope_v2_576w)
[](https://huggingface.co/cerspense/zeroscope_v2_XL)+ [I2VGen-XL (image-to-video / video-to-video)](https://modelscope.cn/models/damo/Image-to-Video/summary)
[-9cf)](https://modelscope.cn/models/damo/Image-to-Video/summary)
[-9cf)](https://modelscope.cn/models/damo/Video-to-Video/summary)+ [text-to-video-synthesis-colab](https://github.com/camenduru/text-to-video-synthesis-colab)
[](https://github.com/camenduru/text-to-video-synthesis-colab)+ [VideoCrafter: A Toolkit for Text-to-Video Generation and Editing](https://github.com/VideoCrafter/VideoCrafter)
[](https://github.com/VideoCrafter/VideoCrafter)+ [ModelScope (Text-to-video synthesis)](https://modelscope.cn/models/damo/text-to-video-synthesis/summary)
[](https://github.com/modelscope/modelscope)+ [Diffusers (Text-to-video synthesis)](https://huggingface.co/docs/diffusers/main/en/api/pipelines/text_to_video#texttovideo-synthesis)
[](https://github.com/huggingface/diffusers)+ [Wunjo CE (Video Generation and Editing)](https://github.com/wladradchenko/wunjo.wladradchenko.ru)
[](https://github.com/wladradchenko/wunjo.wladradchenko.ru)
### Evaluation Benchmarks and Metrics+ [VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness](https://arxiv.org/abs/2503.21755) (Mar., 2025)
[](https://github.com/Vchitect/VBench)
[](https://arxiv.org/abs/2503.21755)
[](https://vchitect.github.io/VBench-2.0-project/)+ [Impossible Videos](https://arxiv.org/abs/2503.14378) (Mar., 2025)
[](https://github.com/showlab/Impossible-Videos)
[](https://arxiv.org/abs/2503.14378)
[](https://showlab.github.io/Impossible-Videos/)
+ [MEt3R: Measuring Multi-View Consistency in Generated Images](https://geometric-rl.mpi-inf.mpg.de/met3r/static/assets/met3r.pdf) (Jan., 2025)
[](https://github.com/mohammadasim98/MEt3R)
[](https://geometric-rl.mpi-inf.mpg.de/met3r/static/assets/met3r.pdf)
[](https://geometric-rl.mpi-inf.mpg.de/met3r/)+ [Is Your World Simulator a Good Story Presenter? A Consecutive Events-Based Benchmark for Future Long Video Generation](https://arxiv.org/abs/2412.16211) (Dec., 2024)
[](https://github.com/ypwang61/StoryEval/tree/main)
[](https://arxiv.org/abs/2412.16211)
[](https://ypwang61.github.io/project/StoryEval/)+ [Evaluation Agent, Efficient and Promptable Evaluation Framework for Visual Generative Models](https://arxiv.org/abs/2412.09645) (Dec., 2024)
[](https://github.com/Vchitect/Evaluation-Agent)
[](https://arxiv.org/abs/2412.09645)
[](https://vchitect.github.io/Evaluation-Agent-project/)+ [Frechet Video Motion Distance: A Metric for Evaluating Motion Consistency in Videos](https://arxiv.org/pdf/2407.16124) (Jun., 2024)
[](https://github.com/DSL-Lab/FVMD-frechet-video-motion-distance)
[](https://arxiv.org/pdf/2407.16124)
[](https://pypi.org/project/fvmd/1.0.0/)
+ [T2V-CompBench: A Comprehensive Benchmark for Compositional Text-to-video Generation](https://arxiv.org/abs/2407.14505) (Jun., 2024)
[](https://github.com/KaiyueSun98/T2V-CompBench)
[](https://arxiv.org/abs/2407.14505)
[](https://t2v-compbench.github.io/)
+ [ChronoMagic-Bench: A Benchmark for Metamorphic Evaluation of Text-to-Time-lapse Video Generation](https://arxiv.org/abs/2406.18522) (NeurIPS, 2024)
[](https://github.com/PKU-YuanGroup/ChronoMagic-Bench)
[](https://arxiv.org/abs/2406.18522)
[](https://pku-yuangroup.github.io/ChronoMagic-Bench/)+ [PEEKABOO: Interactive Video Generation via Masked-Diffusion](https://arxiv.org/abs/2312.07509) (CVPR, 2024)
[](https://github.com/microsoft/Peekaboo)
[](https://arxiv.org/abs/2312.07509)
[](https://jinga-lala.github.io/projects/Peekaboo/)+ [T2VScore: Towards A Better Metric for Text-to-Video Generation](https://arxiv.org/abs/2401.07781) (Jan., 2024)
[](https://github.com/showlab/T2VScore)
[](https://arxiv.org/abs/2401.07781)
[](https://showlab.github.io/T2VScore/)+ [StoryBench: A Multifaceted Benchmark for Continuous Story Visualization](https://arxiv.org/abs/2308.11606) (NeurIPS, 2023)
[](https://github.com/google/storybench)
[](https://arxiv.org/abs/2308.11606)+ [VBench: Comprehensive Benchmark Suite for Video Generative Models](https://arxiv.org/abs/2311.17982) (Nov., 2023)
[](https://github.com/Vchitect/VBench?tab=readme-ov-file)
[](https://arxiv.org/abs/2311.17982)
[](https://vchitect.github.io/VBench-project/)+ [FETV: A Benchmark for Fine-Grained Evaluation of Open-Domain Text-to-Video Generation](https://arxiv.org/abs/2311.01813) (Nov., 2023)
[](https://github.com/llyx97/FETV)
[](https://arxiv.org/abs/2311.01813)+ [EvalCrafter: Benchmarking and Evaluating Large Video Generation Models](https://arxiv.org/abs/2310.11440) (Oct., 2023)
[](https://github.com/EvalCrafter/EvalCrafter)
[](https://arxiv.org/abs/2310.11440)
[](https://evalcrafter.github.io/)
[](https://huggingface.co/datasets/RaphaelLiu/EvalCrafter_T2V_Dataset)+ [Evaluation of Text-to-Video Generation Models: A Dynamics Perspective](https://arxiv.org/pdf/2407.01094) (Jul., 2024)
[](https://github.com/MingXiangL/DEVIL)
[](https://arxiv.org/pdf/2407.01094)+ [VidProM: A Million-scale Real Prompt-Gallery Dataset for Text-to-Video Diffusion Models](https://arxiv.org/abs/2403.06098) (May., 2024)
[](https://github.com/WangWenhao0716/VidProM)
[](https://arxiv.org/abs/2403.06098)
[](https://github.com/WangWenhao0716/VidProM)
[](https://vidprom.github.io/)
+ [Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers](https://arxiv.org/abs/2402.19479) (CVPR, 2024)
[](https://github.com/snap-research/Panda-70M)
[](https://arxiv.org/abs/2402.19479)
[](https://github.com/snap-research/Panda-70M)
[](https://snap-research.github.io/Panda-70M/)+ [ReLight My NeRF: A Dataset for Novel View Synthesis and Relighting of Real World Objects](https://openaccess.thecvf.com/content/CVPR2023/html/Toschi_ReLight_My_NeRF_A_Dataset_for_Novel_View_Synthesis_and_CVPR_2023_paper.html) (CVPR, 2023)
[](https://github.com/eyecan-ai/rene)
[](https://openaccess.thecvf.com/content/CVPR2023/html/Toschi_ReLight_My_NeRF_A_Dataset_for_Novel_View_Synthesis_and_CVPR_2023_paper.html)
[](https://eyecan-ai.github.io/rene/)### Commercial Product
+ [Veo 2](https://sora.com/) ([Google](https://deepmind.google/technologies/veo/veo-2/))
[](https://deepmind.google/technologies/veo/veo-2/)+ [Kling](https://klingai.com/?gad_source=1&gclid=CjwKCAiAudG5BhAREiwAWMlSjMtrwX5RsW6xQvRSSg05fn1bA8wo9-AJiAKTIr-IkZnewbLXpCM44RoCkrsQAvD_BwE) ([KuaiShou](https://www.kuaishou.com/en))
[](https://klingai.com/?gad_source=1&gclid=CjwKCAiAudG5BhAREiwAWMlSjMtrwX5RsW6xQvRSSg05fn1bA8wo9-AJiAKTIr-IkZnewbLXpCM44RoCkrsQAvD_BwE)+ [Gen 3](https://app.runwayml.com/login) ([Runway](https://runwayml.com/))
[](https://app.runwayml.com/login)+ [Dream Machine](https://lumalabs.ai/dream-machine) ([Luma AI](https://lumalabs.ai/dream-machine))
[](https://lumalabs.ai/dream-machine)+ [Sora](https://sora.com/) ([Open AI](https://openai.com/))
[](https://sora.com/)+ [Wunjo](https://wunjo.online/) ([WR](https://wladradchenko.ru/en#products))
[](https://wunjo.online/)### Video Generation
+ [Aligning Text-to-Video Generation Models with Prompt Optimization](https://arxiv.org/abs/2503.20491) (Mar., 2025)
[](https://github.com/thu-coai/VPO/tree/main)
[](https://arxiv.org/abs/2503.20491)+ [Target-Aware Video Diffusion Models](https://arxiv.org/abs/2503.18950) (Mar., 2025)
[](https://github.com/taeksuu/tavid)
[](https://arxiv.org/abs/2503.18950)
[](https://taeksuu.github.io/tavid/)+ [MagicComp: Training-free Dual-Phase Refinement for Compositional Video Generation](https://arxiv.org/abs/2503.14428) (Mar., 2025)
[](https://github.com/Hong-yu-Zhang/MagicComp)
[](https://arxiv.org/pdf/2503.14428)
[](https://hong-yu-zhang.github.io/MagicComp-Page/)
+ [Video-T1: Test-Time Scaling for Video Generation](https://arxiv.org/abs/2503.18942) (Mar., 2025)
[](https://github.com/liuff19/Video-T1)
[](https://arxiv.org/pdf/2503.18942)
[](https://liuff19.github.io/Video-T1/)+ [Temporal Regularization Makes Your Video Generator Stronger](https://arxiv.org/abs/2503.15417) (Mar., 2025)
[](https://arxiv.org/pdf/2503.15417)
[](https://haroldchen19.github.io/FluxFlow/)
+ [VACE: All-in-One Video Creation and Editing](https://arxiv.org/pdf/2503.07598) (Mar., 2025)
[](https://github.com/ali-vilab/VACE)
[](https://arxiv.org/pdf/2503.07598)
[](https://ali-vilab.github.io/VACE-Page/)+ [RIFLEx: A Free Lunch for Length Extrapolation in Video Diffusion Transformers](https://arxiv.org/abs/2502.15894) (Feb., 2025)
[](https://github.com/thu-ml/RIFLEx)
[](https://arxiv.org/abs/2502.15894)
[](https://riflex-video.github.io/)+ [DLFR-VAE: Dynamic Latent Frame Rate VAE for Video Generation](https://arxiv.org/abs/2502.11897) (Feb., 2025)
[](https://arxiv.org/abs/2502.11897)+ [Magic 1-For-1: Generating One Minute Video Clips within One Minute](https://arxiv.org/abs/2502.07701) (Feb., 2025)
[](https://github.com/DA-Group-PKU/Magic-1-For-1)
[](https://arxiv.org/abs/2502.07701)
[](https://magic-141.github.io/Magic-141/)+ [Lumina-Video: Efficient and Flexible Video Generation with Multi-scale Next-DiT](https://arxiv.org/abs/2502.06782) (Feb., 2025)
[](https://github.com/Alpha-VLLM/Lumina-Video)
[](https://arxiv.org/abs/2502.06782)+ [RepVideo: Rethinking Cross-Layer Representation for Video Generation](https://arxiv.org/pdf/2501.08994) (Jan., 2025)
[](https://github.com/Vchitect/RepVideo)
[](https://arxiv.org/pdf/2501.08994)
[](https://vchitect.github.io/RepVid-Webpage/)+ [Large Motion Video Autoencoding with Cross-modal Video VAE](https://arxiv.org/abs/2412.17805) (Dec., 2024)
[](https://arxiv.org/abs/2412.17805)
[](https://github.com/VideoVerses/VideoVAEPlus)+ [MotiF: Making Text Count in Image Animation with Motion Focal Loss](https://arxiv.org/abs/2412.16153) (Dec., 2024)
[](https://arxiv.org/abs/2412.16153)
[](https://wang-sj16.github.io/motif/#BibTeX)+ [VideoDPO: Omni-Preference Alignment for Video Diffusion Generation](https://arxiv.org/pdf/2412.14167) (Dec., 2024)
[](https://arxiv.org/pdf/2412.14167)
[](https://videodpo.github.io/)+ [Autoregressive Video Generation without Vector Quantization](https://arxiv.org/abs/2412.14169) (Dec., 2024)
[](https://github.com/baaivision/NOVA)
[](https://arxiv.org/abs/2412.14169)+ [AniDoc: Animation Creation Made Easier](https://arxiv.org/pdf/2412.14173) (Dec., 2024)
[](https://github.com/yihao-meng/AniDoc)
[](https://arxiv.org/pdf/2412.14173)
[](https://yihao-meng.github.io/AniDoc_demo/)+ [Video Diffusion Transformers are In-Context Learners](https://arxiv.org/abs/2412.10783) (Dec., 2024)
[](https://github.com/feizc/Video-In-Context)
[](https://arxiv.org/abs/2412.10783)
[](https://huggingface.co/feizhengcong/Video-In-Context)+ [Track4Gen: Teaching Video Diffusion Models to Track Points Improves Video Generation](https://arxiv.org/abs/2412.06016) (Dec., 2024 | CVPR 2025)
[](https://arxiv.org/abs/2412.06016)
[](https://hyeonho99.github.io/track4gen/)+ [Instructional Video Generation](https://arxiv.org/abs/2412.04189) (Dec., 2024)
[](https://arxiv.org/abs/2412.04189)
[](https://excitedbutter.github.io/Instructional-Video-Generation/)+ [Mimir: Improving Video Diffusion Models for Precise Text Understanding](https://arxiv.org/abs/2412.03085) (Dec., 2024)
[](https://arxiv.org/abs/2412.03085)
[](https://lucaria-academy.github.io/Mimir/)+ [Spatiotemporal Skip Guidance for Enhanced Video Diffusion Sampling](https://arxiv.org/abs/2411.18664) (Dec., 2024)
[](https://github.com/junhahyung/STGuidance)
[](https://arxiv.org/abs/2411.18664)
[](https://junhahyung.github.io/STGuidance)+ [Identity-Preserving Text-to-Video Generation by Frequency Decomposition](https://arxiv.org/abs/2411.17440) (Nov., 2024)
[](https://github.com/PKU-YuanGroup/ConsisID)
[](https://arxiv.org/abs/2411.17440)
[](https://pku-yuangroup.github.io/ConsisID/)
[](https://huggingface.co/datasets/BestWishYsh/ConsisID-preview-Data)+ [WF-VAE: Enhancing Video VAE by Wavelet-Driven Energy Flow for Latent Video Diffusion Model](https://arxiv.org/abs/2411.17459) (Nov., 2024)
[](https://github.com/PKU-YuanGroup/WF-VAE)
[](https://arxiv.org/abs/2411.17459)
[](https://video-repair.github.io/)+ [VideoRepair: Improving Text-to-Video Generation via Misalignment Evaluation and Localized Refinement](https://arxiv.org/pdf/2411.15115) (Nov., 2024)
[](https://github.com/daeunni/VideoRepair)
[](https://arxiv.org/pdf/2411.15115)
[](https://video-repair.github.io/)+ [Enhancing Motion in Text-to-Video Generation with Decomposed Encoding and Conditioning](https://arxiv.org/abs/2410.24219) (Oct., 2024 | NeurIPS 2024)
[](https://github.com/PR-Ryan/DEMO)
[](https://arxiv.org/abs/2410.24219)
[](https://pr-ryan.github.io/DEMO-project/)+ [Improved Video VAE for Latent Video Diffusion Model](https://arxiv.org/abs/2411.06449) (Oct., 2024)
[](https://arxiv.org/abs/2411.06449)
[](https://wpy1999.github.io/IV-VAE/)+ [VideoAgent: Self-Improving Video Generation](https://arxiv.org/abs/2410.10076)
[](https://arxiv.org/abs/2410.10076)+ [T2V-Turbo-v2: Enhancing Video Generation Model Post-Training Through Data, Reward, and Conditional Guidance Design](https://arxiv.org/pdf/2410.05677) (Oct, 2024)
[](https://github.com/Ji4chenLi/t2v-turbo)
[](https://arxiv.org/pdf/2410.05677)
[](https://t2v-turbo-v2.github.io/)+ [Progressive Autoregressive Video Diffusion Models](https://arxiv.org/abs/2410.08151) (Oct., 2024)
[](https://github.com/desaixie/pa_vdm)
[](https://arxiv.org/abs/2410.08151)
[](https://desaixie.github.io/pa-vdm/)+ [Real-Time Video Generation with Pyramid Attention Broadcast](https://arxiv.org/abs/2408.12588) (Aug., 2024)
[](https://github.com/NUS-HPC-AI-Lab/VideoSys)
[](https://arxiv.org/abs/2408.12588)+ [xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed Representations](https://arxiv.org/abs/2408.12590) (Aug., 2024)
[](https://github.com/SalesforceAIResearch/xgen-videosyn)
[](https://arxiv.org/abs/2408.12590)+ [CogVideoX: Text-to-video generation](https://github.com/THUDM/CogVideo/blob/main/resources/CogVideoX.pdf) (Aug., 2024)
[](https://github.com/THUDM/CogVideo)
[](https://github.com/THUDM/CogVideo/blob/main/resources/CogVideoX.pdf)+ [FreeLong: Training-Free Long Video Generation with SpectralBlend Temporal Attention](https://arxiv.org/abs/2407.19918) (Aug., 2024)
[](https://arxiv.org/abs/2407.19918)
[](https://yulu.net.cn/freelong/)
+ [VEnhancer: Generative Space-Time Enhancement for Video Generation](https://arxiv.org/abs/2407.07667) (Jul., 2024)
[](https://github.com/Vchitect/VEnhancer)
[](https://arxiv.org/abs/2407.07667)
[](https://vchitect.github.io/VEnhancer-project/)+ [Live2Diff: Live Stream Translation via Uni-directional Attention in Video Diffusion Models](https://arxiv.org/abs/2407.08701) (Jul., 2024)
[](https://github.com/open-mmlab/Live2Diff)
[](https://arxiv.org/abs/2407.08701)
[](https://live2diff.github.io/)+ [Video Diffusion Alignment via Reward Gradient](https://arxiv.org/abs/2407.08737) (Jul., 2024)
[](https://github.com/mihirp1998/VADER)
[](https://arxiv.org/abs/2407.08737)
[](https://vader-vid.github.io/)+ [ExVideo: Extending Video Diffusion Models via Parameter-Efficient Post-Tuning](https://arxiv.org/abs/2406.14130) (Jun., 2024)
[](https://arxiv.org/abs/2406.14130)+ [MimicMotion: High-Quality Human Motion Video Generation with Confidence-aware Pose Guidance](https://arxiv.org/abs/2406.19680) (Jul., 2024)
[](https://github.com/Tencent/MimicMotion)
[](https://arxiv.org/abs/2406.19680)
[](https://tencent.github.io/MimicMotion/)+ [Identifying and Solving Conditional Image Leakage in Image-to-Video Diffusion Model](https://arxiv.org/abs/2406.15735) (Jun., 2024)
[](https://github.com/thu-ml/cond-image-leakage/tree/main?tab=readme-ov-file)
[](https://arxiv.org/abs/2406.15735)
[](https://cond-image-leak.github.io/)+ [Video-Infinity: Distributed Long Video Generation](https://arxiv.org/abs/2406.16260) (Jun., 2024)
[](https://arxiv.org/abs/2406.16260)
[](https://video-infinity.tanzhenxiong.com/)+ [MotionBooth: Motion-Aware Customized Text-to-Video Generation](https://arxiv.org/abs/2406.17758) (Jun., 2024)
[](https://arxiv.org/abs/2406.17758)
[](https://jianzongwu.github.io/projects/motionbooth/)+ [Text-Animator: Controllable Visual Text Video Generation](https://arxiv.org/abs/2406.17777) (Jun., 2024)
[](https://arxiv.org/abs/2406.17777)
[](https://laulampaul.github.io/text-animator.html)+ [UniAnimate: Taming Unified Video Diffusion Models for Consistent Human Image Animation](https://arxiv.org/abs/2406.01188) (Jun., 2024)
[](https://arxiv.org/abs/2406.01188)
[](https://unianimate.github.io/)+ [T2V-Turbo: Breaking the Quality Bottleneck of Video Consistency Model with Mixed Reward Feedback](https://arxiv.org/abs/2405.18750) (May, 2024)
[](https://arxiv.org/abs/2405.18750)
[](https://t2v-turbo.github.io/)+ [Collaborative Video Diffusion: Consistent Multi-video Generation with Camera Control](https://arxiv.org/abs/2405.17414) (May, 2024)
[](https://arxiv.org/abs/2405.17414)
[](https://collaborativevideodiffusion.github.io/)+ [Human4DiT: Free-view Human Video Generation with 4D Diffusion Transformer](https://arxiv.org/abs/2405.17405) (May, 2024)
[](https://arxiv.org/abs/2405.17405)
[](https://human4dit.github.io/)+ [FIFO-Diffusion: Generating Infinite Videos from Text without Training](https://arxiv.org/abs/2405.11473) (May, 2024)
[](https://github.com/jjihwan/FIFO-Diffusion_public)
[](https://arxiv.org/abs/2405.11473)
[](https://jjihwan.github.io/projects/FIFO-Diffusion)+ [Vidu: a Highly Consistent, Dynamic and Skilled Text-to-Video Generator with Diffusion Models](https://arxiv.org/abs/2405.04233) (May, 2024)
[](https://arxiv.org/abs/2405.04233)
[](https://www.shengshu-ai.com/vidu)+ [Lumina-T2X: Transforming Text into Any Modality, Resolution, and Duration via Flow-based Large Diffusion Transformers](https://arxiv.org/abs/2405.05945) (May, 2024)
[](https://github.com/Alpha-VLLM/Lumina-T2X)
[](https://arxiv.org/abs/2405.05945)+ [StoryDiffusion: Consistent Self-Attention for Long-Range Image and Video Generation](https://arxiv.org/abs/2405.01434) (May, 2024)
[](https://github.com/HVision-NKU/StoryDiffusion)
[](https://arxiv.org/abs/2405.01434)
[](https://storydiffusion.github.io/)+ [TI2V-Zero: Zero-Shot Image Conditioning for Text-to-Video Diffusion Models](https://arxiv.org/abs/2404.16306) (CVPR 2024)
[](https://github.com/merlresearch/TI2V-Zero)
[](https://arxiv.org/abs/2404.16306)
[](https://merl.com/research/highlights/TI2V-Zero)+ [ID-Animator: Zero-Shot Identity-Preserving Human Video Generation](https://arxiv.org/abs/2404.15275) (Apr., 2024)
[](https://github.com/ID-Animator/ID-Animator)
[](https://arxiv.org/abs/2404.15275)
[](https://id-animator.github.io/)+ [AnimateZoo: Zero-shot Video Generation of Cross-Species Animation via Subject Alignment](https://arxiv.org/abs/2404.04946) (Apr., 2024)
[](https://github.com/JustinXu0/AnimateZoo)
[](https://arxiv.org/abs/2404.04946)
[](https://justinxu0.github.io/AnimateZoo/)+ [MagicTime: Time-lapse Video Generation Models as Metamorphic Simulators](https://arxiv.org/abs/2404.05014) (Apr., 2024)
[](https://github.com/PKU-YuanGroup/MagicTime)
[](https://arxiv.org/abs/2404.05014)
[](https://pku-yuangroup.github.io/MagicTime/)
[](https://huggingface.co/datasets/BestWishYsh/ChronoMagic)+ [TRIP: Temporal Residual Learning with Image Noise Prior for Image-to-Video Diffusion Models](https://arxiv.org/abs/2403.17005) (CVPR 2024)
[](https://arxiv.org/abs/2403.17005)
[](https://trip-i2v.github.io/TRIP/)+ [VSTAR: Generative Temporal Nursing for Longer Dynamic Video Synthesis](https://arxiv.org/abs/2403.13501) (Mar., 2024)
[](https://arxiv.org/abs/2403.13501)
[](https://yumengli007.github.io/VSTAR/)+ [StreamingT2V: Consistent, Dynamic, and Extendable Long Video Generation from Text](https://arxiv.org/abs/2403.14773) (Mar., 2024)
[](https://github.com/Picsart-AI-Research/StreamingT2V)
[](https://arxiv.org/abs/2403.14773)
[](https://streamingt2v.github.io/)+ [Intention-driven Ego-to-Exo Video Generation](https://arxiv.org/abs/2403.09194) (Mar., 2024)
[](https://arxiv.org/abs/2403.09194)+ [VideoElevator: Elevating Video Generation Quality with Versatile Text-to-Image Diffusion Models](https://arxiv.org/abs/2403.05438) (Mar., 2024)
[](https://github.com/YBYBZhang/VideoElevator)
[](https://arxiv.org/abs/2403.05438)
[](https://videoelevator.github.io/)+ [Snap Video: Scaled Spatiotemporal Transformers for Text-to-Video Synthesis](https://arxiv.org/abs/2402.14797) (Feb., 2024)
[](https://arxiv.org/abs/2402.14797)
[](https://snap-research.github.io/snapvideo/)+ [One-Shot Motion Customization of Text-to-Video Diffusion Models](https://arxiv.org/abs/2402.14780) (Feb., 2024)
[](https://arxiv.org/abs/2402.14780)
[](https://anonymous-314.github.io/)+ [Magic-Me: Identity-Specific Video Customized Diffusion](https://arxiv.org/abs/2402.09368) (Feb., 2024)
[](https://github.com/Zhen-Dong/Magic-Me)
[](https://arxiv.org/abs/2402.09368)
[](https://magic-me-webpage.github.io/)+ [ConsistI2V: Enhancing Visual Consistency for Image-to-Video Generation](https://arxiv.org/abs/2402.04324) (Feb., 2024)
[](https://github.com/TIGER-AI-Lab/ConsistI2V)
[](https://arxiv.org/abs/2402.04324)
[](https://tiger-ai-lab.github.io/ConsistI2V/)+ [Direct-a-Video: Customized Video Generation with User-Directed Camera Movement and Object Motion](https://arxiv.org/abs/2402.03162) (Feb., 2024)
[](https://arxiv.org/abs/2402.03162)
[](https://direct-a-video.github.io/)+ [Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization](https://arxiv.org/abs/2402.03161) (Feb., 2024)
[](https://github.com/jy0205/LaVIT)
[](https://arxiv.org/abs/2402.03161)
[](https://video-lavit.github.io/)+ [Boximator: Generating Rich and Controllable Motions for Video Synthesis](https://arxiv.org/abs/2402.01566) (Feb., 2024)
[](https://arxiv.org/abs/2402.01566)
[](https://boximator.github.io/)+ [Lumiere: A Space-Time Diffusion Model for Video Generation](https://arxiv.org/abs/2401.12945) (Jan., 2024)
[](https://arxiv.org/abs/2401.12945)
[](https://lumiere-video.github.io/)+ [ActAnywhere: Subject-Aware Video Background Generation](https://arxiv.org/abs/2401.10822) (Jan., 2024)
[](https://arxiv.org/abs/2401.10822)
[](https://actanywhere.github.io/)+ [WorldDreamer: Towards General World Models for Video Generation via Predicting Masked Tokens](https://arxiv.org/abs/2401.09985) (Jan., 2024)
[](https://github.com/JeffWang987/WorldDreamer)
[](https://arxiv.org/abs/2401.09985)
[](https://world-dreamer.github.io/)+ [CustomVideo: Customizing Text-to-Video Generation with Multiple Subjects](https://arxiv.org/abs/2401.09962) (Jan., 2024)
[](https://arxiv.org/abs/2401.09962)
[](https://kyfafyd.wang/projects/customvideo/)+ [UniVG: Towards UNIfied-modal Video Generation](https://arxiv.org/abs/2401.09084) (Jan., 2024)
[](https://arxiv.org/abs/2401.09084)
[](https://univg-baidu.github.io/)+ [VideoCrafter2: Overcoming Data Limitations for High-Quality Video Diffusion Models](https://arxiv.org/abs/2401.09047) (Jan., 2024)
[](https://github.com/AILab-CVC/VideoCrafter)
[](https://arxiv.org/abs/2401.09047)
[](https://ailab-cvc.github.io/videocrafter2/)+ [360DVD: Controllable Panorama Video Generation with 360-Degree Video Diffusion Model](https://arxiv.org/abs/2401.06578) (Jan., 2024)
[](https://arxiv.org/abs/2401.06578)
[](https://akaneqwq.github.io/360DVD/)+ [RAVEN: Rethinking Adversarial Video Generation with Efficient Tri-plane Networks](https://arxiv.org/abs/2401.06035) (Jan., 2024)
[](https://arxiv.org/abs/2401.06035)+ [Latte: Latent Diffusion Transformer for Video Generation](https://arxiv.org/abs/2401.03048) (Jan., 2024)
[](https://github.com/Vchitect/Latte)
[](https://arxiv.org/abs/2401.03048)
[](https://maxin-cn.github.io/latte_project/)+ [MagicVideo-V2: Multi-Stage High-Aesthetic Video Generation](https://arxiv.org/abs/2401.04468) (Jan., 2024)
[](https://arxiv.org/abs/2401.04468)
[](https://magicvideov2.github.io/)+ [VideoDrafter: Content-Consistent Multi-Scene Video Generation with LLM](https://arxiv.org/abs/2401.01256) (Jan., 2024)
[](https://arxiv.org/abs/2401.01256)
[](https://videodrafter.github.io/)+ [FlashVideo: A Framework for Swift Inference in Text-to-Video Generation](https://arxiv.org/abs/2401.00869) (Dec., 2023)
[](https://arxiv.org/abs/2401.00869)+ [I2V-Adapter: A General Image-to-Video Adapter for Video Diffusion Models](https://arxiv.org/abs/2312.16693) (Dec., 2023)
[](https://arxiv.org/abs/2312.16693)+ [A Recipe for Scaling up Text-to-Video Generation with Text-free Videos](https://arxiv.org/abs/2312.15770) (Dec., 2023)
[](https://arxiv.org/abs/2312.15770)
[](https://tf-t2v.github.io/)+ [PIA: Your Personalized Image Animator via Plug-and-Play Modules in Text-to-Image Models](https://arxiv.org/abs/2312.13964) (Dec., 2023)
[](https://github.com/open-mmlab/PIA)
[](https://arxiv.org/abs/2312.13964)
[](https://pi-animator.github.io/)+ [VideoPoet: A Large Language Model for Zero-Shot Video Generation](https://arxiv.org/abs/2312.14125) (Dec., 2023)
[](https://arxiv.org/abs/2312.14125)
[](https://sites.research.google/videopoet/)+ [InstructVideo: Instructing Video Diffusion Models with Human Feedback](https://arxiv.org/abs/2312.12490) (Dec., 2023)
[](https://github.com/damo-vilab/i2vgen-xl/blob/main/doc/InstructVideo.md)
[](https://arxiv.org/abs/2312.12490)
[](https://instructvideo.github.io/)+ [VideoLCM: Video Latent Consistency Model](https://arxiv.org/abs/2312.09109) (Dec., 2023)
[](https://arxiv.org/abs/2312.09109)+ [PEEKABOO: Interactive Video Generation via Masked-Diffusion](https://arxiv.org/abs/2312.07509) (Dec., 2023)
[](https://github.com/microsoft/Peekaboo)
[](https://arxiv.org/abs/2312.07509)
[](https://jinga-lala.github.io/projects/Peekaboo/)+ [FreeInit: Bridging Initialization Gap in Video Diffusion Models](https://arxiv.org/abs/2312.07537) (Dec., 2023)
[](https://github.com/TianxingWu/FreeInit)
[](https://arxiv.org/abs/2312.07537)
[](https://tianxingwu.github.io/pages/FreeInit/)+ [Photorealistic Video Generation with Diffusion Models](https://arxiv.org/abs/2312.06662) (Dec., 2023)
[](https://arxiv.org/abs/2312.06662)
[](https://walt-video-diffusion.github.io/)+ [Upscale-A-Video: Temporal-Consistent Diffusion Model for Real-World Video Super-Resolution](https://arxiv.org/abs/2312.06640) (Dec., 2023)
[](https://github.com/sczhou/Upscale-A-Video)
[](https://arxiv.org/abs/2312.06640)
[](https://shangchenzhou.com/projects/upscale-a-video/)+ [DreaMoving: A Human Video Generation Framework based on Diffusion Models](https://arxiv.org/abs/2312.05107) (Dec., 2023)
[](https://github.com/dreamoving/dreamoving-project)
[](https://arxiv.org/abs/2312.05107)
[](https://dreamoving.github.io/dreamoving/)+ [MotionCrafter: One-Shot Motion Customization of Diffusion Models](https://arxiv.org/abs/2312.05288) (Dec., 2023)
[](https://github.com/zyxElsa/MotionCrafter)
[](https://arxiv.org/abs/2312.05288)+ [AnimateZero: Video Diffusion Models are Zero-Shot Image Animators](https://arxiv.org/abs/2312.03793) (Dec., 2023)
[](https://github.com/vvictoryuki/AnimateZero)
[](https://arxiv.org/abs/2312.03793)
[](https://vvictoryuki.github.io/animatezero.github.io/)+ [AVID: Any-Length Video Inpainting with Diffusion Model](https://arxiv.org/abs/2312.03816) (Dec., 2023)
[](https://github.com/zhang-zx/AVID)
[](https://arxiv.org/abs/2312.03816)
[](https://zhang-zx.github.io/AVID/)+ [MTVG : Multi-text Video Generation with Text-to-Video Models](https://arxiv.org/abs/2312.04086) (Dec., 2023)
[](https://arxiv.org/abs/2312.04086)
[](https://kuai-lab.github.io/mtvg-page)+ [DreamVideo: Composing Your Dream Videos with Customized Subject and Motion](https://arxiv.org/abs/2312.04433) (Dec., 2023)
[](https://github.com/damo-vilab/i2vgen-xl)
[](https://arxiv.org/abs/2312.04433)
[](https://dreamvideo-t2v.github.io/)+ [Hierarchical Spatio-temporal Decoupling for Text-to-Video Generation](https://arxiv.org/abs/2312.04483) (Dec., 2023)
[](https://github.com/damo-vilab/i2vgen-xl)
[](https://arxiv.org/abs/2312.04483)
[](https://higen-t2v.github.io/)+ [GenTron: Delving Deep into Diffusion Transformers for Image and Video Generation](https://arxiv.org/abs/2312.04557) (CVPR 2024)
[](https://arxiv.org/abs/2312.04557)
[](https://www.shoufachen.com/gentron_website/)+ [GenDeF: Learning Generative Deformation Field for Video Generation](https://arxiv.org/abs/2312.04561) (Dec., 2023)
[](https://github.com/aim-uofa/GenDeF)
[](https://arxiv.org/abs/2312.04561)
[](https://aim-uofa.github.io/GenDeF/)+ [F3-Pruning: A Training-Free and Generalized Pruning Strategy towards Faster and Finer Text-to-Video Synthesis](https://arxiv.org/abs/2312.03459) (Dec., 2023)
[](https://arxiv.org/abs/2312.03459)+ [DreamVideo: High-Fidelity Image-to-Video Generation with Image Retention and Text Guidance](https://arxiv.org/abs/2312.03018) (Dec., 2023)
[](https://github.com/anonymous0769/DreamVideo)
[](https://arxiv.org/abs/2312.03018)
[](https://anonymous0769.github.io/DreamVideo/)+ [LivePhoto: Real Image Animation with Text-guided Motion Control](https://arxiv.org/abs/2312.02928) (Dec., 2023)
[](https://github.com/XavierCHEN34/LivePhoto)
[](https://arxiv.org/abs/2312.02928)
[](https://xavierchen34.github.io/LivePhoto-Page/)+ [Fine-grained Controllable Video Generation via Object Appearance and Context](https://arxiv.org/abs/2312.02919) (Dec., 2023)
[](https://arxiv.org/abs/2312.02919)
[](https://hhsinping.github.io/factor/)+ [VideoBooth: Diffusion-based Video Generation with Image Prompts](https://arxiv.org/abs/2312.00777) (Dec., 2023)
[](https://github.com/Vchitect/VideoBooth)
[](https://arxiv.org/abs/2312.00777)
[](https://vchitect.github.io/VideoBooth-project/)+ [StyleCrafter: Enhancing Stylized Text-to-Video Generation with Style Adapter](https://arxiv.org/abs/2312.00330) (Dec., 2023)
[](https://github.com/GongyeLiu/StyleCrafter)
[](https://arxiv.org/abs/2312.00330)
[](https://gongyeliu.github.io/StyleCrafter.github.io/)+ [MicroCinema: A Divide-and-Conquer Approach for Text-to-Video Generation](https://arxiv.org/abs/2311.18829) (Nov., 2023)
[](https://arxiv.org/abs/2311.18829)
[](https://wangyanhui666.github.io/MicroCinema.github.io/)+ [ART•V: Auto-Regressive Text-to-Video Generation with Diffusion Models](https://arxiv.org/abs/2311.18834) (Nov., 2023)
[](https://github.com/WarranWeng/ART.V)
[](https://arxiv.org/abs/2311.18834)
[](https://warranweng.github.io/art.v/)+ [Smooth Video Synthesis with Noise Constraints on Diffusion Models for One-shot Video Tuning](https://arxiv.org/abs/2311.17536) (Nov., 2023)
[](https://github.com/SPengLiang/SmoothVideo)
[](https://arxiv.org/abs/2311.17536)+ [VideoAssembler: Identity-Consistent Video Generation with Reference Entities using Diffusion Model](https://arxiv.org/abs/2311.17338) (Nov., 2023)
[](https://arxiv.org/abs/2311.17338)
[](https://videoassembler.github.io/videoassembler/)+ [MotionZero:Exploiting Motion Priors for Zero-shot Text-to-Video Generation](https://arxiv.org/abs/2311.16635) (Nov., 2023)
[](https://arxiv.org/abs/2311.16635)+ [MagicAnimate: Temporally Consistent Human Image Animation using Diffusion Model](https://arxiv.org/abs/2311.16498) (Nov., 2023)
[](https://github.com/magic-research/magic-animate)
[](https://arxiv.org/abs/2311.16498)
[](https://showlab.github.io/magicanimate)+ [FlowZero: Zero-Shot Text-to-Video Synthesis with LLM-Driven Dynamic Scene Syntax](https://arxiv.org/abs/2311.15813) (Nov., 2023)
[](https://github.com/aniki-ly/FlowZero)
[](https://arxiv.org/abs/2311.15813)
[](https://flowzero-video.github.io/)+ [Sketch Video Synthesis](https://arxiv.org/abs/2311.15306) (Nov., 2023)
[](https://github.com/yudianzheng/SketchVideo)
[](https://arxiv.org/abs/2311.15306)
[](https://sketchvideo.github.io/)+ [Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets](https://arxiv.org/abs/2311.15127) (Nov., 2023)
[](https://github.com/Stability-AI/generative-models)
[](https://arxiv.org/abs/2311.15127)
[](https://stability.ai/news/stable-video-diffusion-open-ai-video-model)+ [Decouple Content and Motion for Conditional Image-to-Video Generation](https://arxiv.org/abs/2311.14294) (Nov., 2023)
[](https://arxiv.org/abs/2311.14294)+ [FusionFrames: Efficient Architectural Aspects for Text-to-Video Generation Pipeline](https://arxiv.org/abs/2311.13073) (Nov., 2023)
[](https://github.com/ai-forever/KandinskyVideo)
[](https://arxiv.org/abs/2311.13073)
[](https://ai-forever.github.io/kandinsky-video/)+ [Fine-Grained Open Domain Image Animation with Motion Guidance](https://arxiv.org/abs/2311.12886) (Nov., 2023)
[](https://github.com/alibaba/animate-anything)
[](https://arxiv.org/abs/2311.12886)
[](https://animationai.github.io/AnimateAnything/)+ [GPT4Motion: Scripting Physical Motions in Text-to-Video Generation via Blender-Oriented GPT Planning](https://arxiv.org/abs/2311.12631) (Nov., 2023)
[](https://github.com/jiaxilv/GPT4Motion)
[](https://arxiv.org/abs/2311.12631)
[](https://gpt4motion.github.io/)+ [MagicDance: Realistic Human Dance Video Generation with Motions & Facial Expressions Transfer](https://arxiv.org/abs/2311.12052) (Nov., 2023)
[](https://github.com/Boese0601/MagicDance)
[](https://arxiv.org/abs/2311.12052)
[](https://boese0601.github.io/magicdance/)+ [MoVideo: Motion-Aware Video Generation with Diffusion Models](https://arxiv.org/abs/2311.11325) (Nov., 2023)
[](https://arxiv.org/abs/2311.11325)
[](https://jingyunliang.github.io/MoVideo/)+ [Make Pixels Dance: High-Dynamic Video Generation](https://arxiv.org/abs/2311.10982) (Nov., 2023)
[](https://arxiv.org/abs/2311.10982)
[](https://makepixelsdance.github.io/)+ [Emu Video: Factorizing Text-to-Video Generation by Explicit Image Conditioning](https://arxiv.org/abs/2311.10709) (Nov., 2023)
[](https://arxiv.org/abs/2311.10709)
[](https://emu-video.metademolab.com/)
+ [Optimal Noise pursuit for Augmenting Text-to-Video Generation](https://arxiv.org/abs/2311.00949) (Nov., 2023)
[](https://arxiv.org/abs/2311.00949)+ [VideoDreamer: Customized Multi-Subject Text-to-Video Generation with Disen-Mix Finetuning](https://arxiv.org/abs/2311.00990) (Nov., 2023)
[](https://github.com/videodreamer23/videodreamer23.github.io)
[](https://arxiv.org/abs/2311.00990)
[](https://videodreamer23.github.io/)+ [SEINE: Short-to-Long Video Diffusion Model for Generative Transition and Prediction](https://arxiv.org/abs/2310.20700) (Oct., 2023)
[](https://github.com/Vchitect/SEINE)
[](https://arxiv.org/abs/2310.20700)
[](https://vchitect.github.io/SEINE-project/)+ [FreeNoise: Tuning-Free Longer Video Diffusion Via Noise Rescheduling](https://arxiv.org/abs/2310.15169) (Oct., 2023)
[](https://github.com/arthur-qiu/LongerCrafter)
[](https://arxiv.org/abs/2310.15169)
[](http://haonanqiu.com/projects/FreeNoise.html)+ [DynamiCrafter: Animating Open-domain Images with Video Diffusion Priors](https://arxiv.org/abs/2310.12190) (Oct., 2023)
[](https://github.com/Doubiiu/DynamiCrafter)
[](https://arxiv.org/abs/2310.12190)
[](https://doubiiu.github.io/projects/DynamiCrafter/)
+ [LAMP: Learn A Motion Pattern for Few-Shot-Based Video Generation](https://arxiv.org/abs/2310.10769) (Oct., 2023)
[](https://github.com/RQ-Wu/LAMP)
[](https://arxiv.org/abs/2310.10769)
[](https://rq-wu.github.io/projects/LAMP/)+ [Show-1: Marrying Pixel and Latent Diffusion Models for Text-to-Video Generation](https://arxiv.org/abs/2309.15818) (Sep., 2023)
[](https://github.com/showlab/Show-1)
[](https://arxiv.org/abs/2309.15818)
[](https://showlab.github.io/Show-1/)+ [MotionDirector: Motion Customization of Text-to-Video Diffusion Models](https://arxiv.org/abs/2310.08465) (Sep., 2023)
[](https://github.com/showlab/MotionDirector)
[](https://arxiv.org/abs/2310.08465)
[](https://showlab.github.io/MotionDirector/)+ [LAVIE: High-Quality Video Generation with Cascaded Latent Diffusion Models](https://arxiv.org/abs/2309.15103) (Sep., 2023)
[](https://github.com/Vchitect/LaVie)
[](https://arxiv.org/abs/2309.15103)
[](https://vchitect.github.io/LaVie-project/)+ [Free-Bloom: Zero-Shot Text-to-Video Generator with LLM Director and LDM Animator](https://arxiv.org/abs/2309.14494) (Sep., 2023)
[](https://github.com/SooLab/Free-Bloom)
[](https://arxiv.org/abs/2309.14494)+ [Hierarchical Masked 3D Diffusion Model for Video Outpainting](https://arxiv.org/abs/2309.02119) (Sep., 2023)
[](https://arxiv.org/abs/2309.02119)
[](https://fanfanda.github.io/M3DDM/)+ [Reuse and Diffuse: Iterative Denoising for Text-to-Video Generation](https://arxiv.org/abs/2309.03549) (Sep., 2023)
[](https://github.com/anonymous0x233/ReuseAndDiffuse)
[](https://arxiv.org/abs/2309.03549)
[](https://anonymous0x233.github.io/ReuseAndDiffuse/)+ [VideoGen: A Reference-Guided Latent Diffusion Approach for High Definition Text-to-Video Generation](https://arxiv.org/abs/2309.00398) (Sep., 2023)
[](https://arxiv.org/abs/2309.00398)+ [MagicAvatar: Multimodal Avatar Generation and Animation](https://arxiv.org/abs/2308.14748) (Aug., 2023)
[](https://github.com/magic-research/magic-avatar)
[](https://arxiv.org/abs/2308.14748)
[](https://magic-avatar.github.io/)+ [Empowering Dynamics-aware Text-to-Video Diffusion with Large Language Models](https://arxiv.org/abs/2308.13812) (Aug., 2023)
[](https://github.com/scofield7419/Dysen)
[](https://arxiv.org/abs/2308.13812)
[](https://haofei.vip/Dysen-VDM/)+ [SimDA: Simple Diffusion Adapter for Efficient Video Generation](https://arxiv.org/abs/2308.09710) (Aug., 2023)
[](https://github.com/ChenHsing/SimDA)
[](https://arxiv.org/abs/2308.09710)
[](https://chenhsing.github.io/SimDA/)+ [ModelScope Text-to-Video Technical Report](https://arxiv.org/abs/2308.06571) (Aug., 2023)
[](https://arxiv.org/abs/2308.06571)
[](https://modelscope.cn/models/damo/text-to-video-synthesis/summary)+ [Dual-Stream Diffusion Net for Text-to-Video Generation](https://arxiv.org/abs/2308.08316) (Aug., 2023)
[](https://arxiv.org/abs/2308.08316)
[](https://anonymous.4open.science/r/Private-C3E8)+ [InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation](https://arxiv.org/abs/2307.06942) (Jul., 2023)
[](https://github.com/OpenGVLab/InternVideo/tree/main/Data/InternVid)
[](https://arxiv.org/abs/2307.06942)+ [Animate-A-Story: Storytelling with Retrieval-Augmented Video Generation](https://arxiv.org/abs/2307.06940) (Jul., 2023)
[](https://github.com/VideoCrafter/Animate-A-Story)
[](https://arxiv.org/abs/2307.06940)
[](https://videocrafter.github.io/Animate-A-Story/)+ [AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning](https://arxiv.org/abs/2307.04725) (Jul., 2023)
[](https://github.com/guoyww/animatediff/)
[](https://arxiv.org/abs/2307.04725)
[](https://animatediff.github.io/)+ [DisCo: Disentangled Control for Referring Human Dance Generation in Real World](https://arxiv.org/abs/2307.000400) (Jul., 2023)
[](https://github.com/Wangt-CN/DisCo)
[](https://arxiv.org/abs/2307.00040)
[](https://disco-dance.github.io/)+ [Learn the Force We Can: Enabling Sparse Motion Control in Multi-Object Video Generation](https://arxiv.org/abs/2306.03988) (Jun., 2023)
[](https://github.com/araachie/yoda)
[](https://arxiv.org/abs/2306.03988)
[](https://araachie.github.io/yoda/)+ [VideoComposer: Compositional Video Synthesis with Motion Controllability](https://arxiv.org/abs/2306.02018) (Jun., 2023)
[](https://github.com/damo-vilab/videocomposer)
[](https://arxiv.org/abs/2306.02018)
[](https://videocomposer.github.io/)+ [Probabilistic Adaptation of Text-to-Video Models](https://arxiv.org/abs/2306.01872) (Jun., 2023)
[](https://arxiv.org/abs/2306.01872)
[](https://video-adapter.github.io/video-adapter/)+ [Make-Your-Video: Customized Video Generation Using Textual and Structural Guidance](https://arxiv.org/abs/2306.00943) (Jun., 2023)
[](https://github.com/VideoCrafter/Make-Your-Video)
[](https://arxiv.org/abs/2306.00943)
[](https://doubiiu.github.io/projects/Make-Your-Video/)+ [Gen-L-Video: Multi-Text to Long Video Generation via Temporal Co-Denoising](https://arxiv.org/abs/2305.18264) (May, 2023)
[](https://github.com/G-U-N/Gen-L-Video)
[](https://arxiv.org/abs/2305.18264)
[](https://g-u-n.github.io/projects/gen-long-video/index.html)+ [Cinematic Mindscapes: High-quality Video Reconstruction from Brain Activity](https://arxiv.org/abs/2305.11675) (May, 2023)
[](https://github.com/jqin4749/MindVideo)
[](https://arxiv.org/abs/2305.11675)
[](https://mind-video.com/)+ [Any-to-Any Generation via Composable Diffusion](https://arxiv.org/abs/2305.11846) (May, 2023)
[](https://github.com/microsoft/i-Code/tree/main/i-Code-V3)
[](https://arxiv.org/abs/2305.11846)
[](https://codi-gen.github.io/)+ [VideoFactory: Swap Attention in Spatiotemporal Diffusions for Text-to-Video Generation](https://arxiv.org/abs/2305.10874) (May, 2023)
[](https://arxiv.org/abs/2305.10874)+ [Preserve Your Own Correlation: A Noise Prior for Video Diffusion Models](https://arxiv.org/abs/2305.10474) (May, 2023)
[](https://arxiv.org/abs/2305.10474)
[](https://research.nvidia.com/labs/dir/pyoco/)+ [LaMD: Latent Motion Diffusion for Video Generation](https://arxiv.org/abs/2304.11603) (Apr., 2023)
[](https://arxiv.org/abs/2304.11603)+ [Align your Latents: High-Resolution Video Synthesis with Latent Diffusion Models](https://arxiv.org/abs/2304.08818) (CVPR 2023)
[](https://arxiv.org/abs/2304.08818)
[](https://research.nvidia.com/labs/toronto-ai/VideoLDM/)+ [Text2Performer: Text-Driven Human Video Generation](https://arxiv.org/abs/2304.08483) (Apr., 2023)
[](https://github.com/yumingj/Text2Performer)
[](https://arxiv.org/abs/2304.08483)
[](https://yumingj.github.io/projects/Text2Performer)+ [Generative Disco: Text-to-Video Generation for Music Visualization](https://arxiv.org/abs/2304.08551) (Apr., 2023)
[](https://arxiv.org/abs/2304.08551)+ [Latent-Shift: Latent Diffusion with Temporal Shift](https://arxiv.org/abs/2304.08477) (Apr., 2023)
[](https://arxiv.org/abs/2304.08477)
[](https://latent-shift.github.io/)
+ [DreamPose: Fashion Image-to-Video Synthesis via Stable Diffusion](https://arxiv.org/abs/2304.06025) (Apr., 2023)
[](https://github.com/johannakarras/DreamPose)
[](https://arxiv.org/abs/2304.06025)
[](https://grail.cs.washington.edu/projects/dreampose/)+ [Follow Your Pose: Pose-Guided Text-to-Video Generation using Pose-Free Videos](https://arxiv.org/abs/2304.01186) (Apr., 2023)
[](https://github.com/mayuelala/FollowYourPose)
[](https://arxiv.org/abs/2304.01186)
[](https://follow-your-pose.github.io/)+ [Physics-Driven Diffusion Models for Impact Sound Synthesis from Videos](https://arxiv.org/abs/2303.16897) (CVPR 2023)
[](https://github.com/sukun1045/video-physics-sound-diffusion)
[](https://arxiv.org/abs/2303.16897)
[](https://sukun1045.github.io/video-physics-sound-diffusion/)+ [Seer: Language Instructed Video Prediction with Latent Diffusion Models](https://arxiv.org/abs/2303.14897) (Mar., 2023)
[](https://arxiv.org/abs/2303.14897)
[](https://seervideodiffusion.github.io/)+ [Text2video-Zero: Text-to-Image Diffusion Models Are Zero-Shot Video Generators](https://arxiv.org/abs/2303.13439) (Mar., 2023)
[](https://github.com/Picsart-AI-Research/Text2Video-Zero)
[](https://arxiv.org/abs/2303.13439)
[](https://text2video-zero.github.io/)+ [Conditional Image-to-Video Generation with Latent Flow Diffusion Models](https://arxiv.org/abs/2303.13744) (CVPR 2023)
[](https://github.com/nihaomiao/CVPR23_LFDM)
[](https://arxiv.org/abs/2303.13744)+ [Decomposed Diffusion Models for High-Quality Video Generation](https://arxiv.org/abs/2303.08320) (CVPR 2023)
[](https://arxiv.org/abs/2303.08320)
[](https://modelscope.cn/models/damo/text-to-video-synthesis/summary)+ [Video Probabilistic Diffusion Models in Projected Latent Space](https://arxiv.org/abs/2302.07685) (CVPR 2023)
[](https://github.com/sihyun-yu/PVDM)
[](https://arxiv.org/abs/2302.07685)
[](https://sihyun.me/PVDM/)+ [Learning 3D Photography Videos via Self-supervised Diffusion on Single Images](https://arxiv.org/abs/2302.10781) (Feb., 2023)
[](https://arxiv.org/abs/2302.10781)+ [Structure and Content-Guided Video Synthesis With Diffusion Models](https://arxiv.org/abs/2302.03011) (Feb., 2023)
[](https://arxiv.org/abs/2302.03011)
[](https://research.runwayml.com/gen2)+ [Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation](https://arxiv.org/abs/2212.11565) (ICCV 2023)
[](https://github.com/showlab/Tune-A-Video)
[](https://arxiv.org/abs/2212.11565)
[](https://tuneavideo.github.io/)+ [Mm-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and Video Generation](https://arxiv.org/abs/2212.09478) (CVPR 2023)
[](https://github.com/researchmm/MM-Diffusion)
[](https://arxiv.org/abs/2212.09478)+ [Magvit: Masked Generative Video Transformer](https://arxiv.org/abs/2212.05199) (Dec., 2022)
[](https://github.com/MAGVIT/magvit)
[](https://arxiv.org/abs/2212.05199)
[](https://magvit.cs.cmu.edu/)+ [VIDM: Video Implicit Diffusion Models](https://arxiv.org/abs/2212.00235) (AAAI 2023)
[](https://github.com/MKFMIKU/VIDM)
[](https://arxiv.org/abs/2212.00235)
[](https://kfmei.page/vidm/)+ [Efficient Video Prediction via Sparsely Conditioned Flow Matching](https://arxiv.org/abs/2211.14575) (Nov., 2022)
[](https://github.com/araachie/river)
[](https://arxiv.org/abs/2211.14575)
[](https://araachie.github.io/river/)+ [Latent Video Diffusion Models for High-Fidelity Video Generation With Arbitrary Lengths](https://arxiv.org/abs/2211.13221) (Nov., 2022)
[](https://github.com/YingqingHe/LVDM)
[](https://arxiv.org/abs/2211.13221)
[](https://yingqinghe.github.io/LVDM/)+ [SinFusion: Training Diffusion Models on a Single Image or Video](https://arxiv.org/abs/2211.11743) (Nov., 2022)
[](https://github.com/yanivnik/sinfusion-code)
[](https://arxiv.org/abs/2211.11743)
[](https://yanivnik.github.io/sinfusion/)+ [MagicVideo: Efficient Video Generation With Latent Diffusion Models](https://arxiv.org/abs/2211.11018) (Nov., 2022)
[](https://arxiv.org/abs/2211.11018)
[](https://magicvideo.github.io/#)+ [Imagen Video: High Definition Video Generation With Diffusion Models](https://arxiv.org/abs/2210.02303) (Oct., 2022)
[](https://arxiv.org/abs/2210.02303)
[](https://imagen.research.google/video/)+ [Make-A-Video: Text-to-Video Generation without Text-Video Data](https://openreview.net/forum?id=nJfylDvgzlq) (ICLR 2023)
[](https://openreview.net/forum?id=nJfylDvgzlq)
[](https://makeavideo.studio)+ [Diffusion Models for Video Prediction and Infilling](https://arxiv.org/abs/2206.07696) (TMLR 2022)
[](https://github.com/Tobi-r9/RaMViD)
[](https://arxiv.org/abs/2206.07696)
[](https://sites.google.com/view/video-diffusion-prediction)+ [McVd: Masked Conditional Video Diffusion for Prediction, Generation, and Interpolation](https://arxiv.org/abs/2205.09853) (NeurIPS 2022)
[](https://github.com/voletiv/mcvd-pytorch)
[](https://arxiv.org/abs/2205.09853)
[](https://mask-cond-video-diffusion.github.io)+ [Video Diffusion Models](https://arxiv.org/abs/2204.03458) (Apr., 2022)
[](https://arxiv.org/abs/2204.03458)
[](https://video-diffusion.github.io/)+ [Diffusion Probabilistic Modeling for Video Generation](https://arxiv.org/abs/2203.09481) (Mar., 2022)
[](https://github.com/buggyyang/RVD)
[](https://arxiv.org/abs/2203.09481)### Efficient Video Generation
+ [SpargeAttn: Accurate Sparse Attention Accelerating Any Model Inference](https://arxiv.org/abs/2502.18137) (Feb., 2025)
[](https://github.com/thu-ml/SpargeAttn)
[](https://arxiv.org/abs/2502.18137)+ [SageAttention2: Efficient Attention with Thorough Outlier Smoothing and Per-thread INT4 Quantization](https://arxiv.org/abs/2411.10958) (Feb., 2025)
[](https://github.com/thu-ml/SageAttention)
[](https://arxiv.org/abs/2411.10958)+ [FlashVideo:Flowing Fidelity to Detail for Efficient High-Resolution Video Generation](https://arxiv.org/abs/2502.05179) (Feb., 2025)
[](https://github.com/FoundationVision/FlashVideo)
[](https://arxiv.org/abs/2502.05179)+ [Fast Video Generation with Sliding Tile Attention](https://arxiv.org/abs/2502.04507) (Feb, 2025)
[](https://arxiv.org/abs/2502.04507)+ [Sparse VideoGen: Accelerating Video Diffusion Transformers with Spatial-Temporal Sparsity](https://arxiv.org/abs/2502.01776) (Feb, 2025)
[](https://arxiv.org/abs/2502.01776)+ [Diffusion Adversarial Post-Training for One-Step Video Generation](https://arxiv.org/abs/2501.08316) (Jan, 2025)
[](https://arxiv.org/abs/2501.08316)+ [From Slow Bidirectional to Fast Causal Video Generators](https://arxiv.org/pdf/2412.07772) (Dec., 2024)
[](https://arxiv.org/pdf/2412.07772)
[](https://causvid.github.io/)+ [SnapGen-V: Generating a Five-Second Video within Five Seconds on a Mobile Device](https://arxiv.org/pdf/2412.10494) (Dec., 2024)
[](https://arxiv.org/pdf/2412.10494)
[](https://snap-research.github.io/snapgen-v/)+ [Mobile Video Diffusion](https://arxiv.org/abs/2412.07583) (Dec., 2024)
[](https://arxiv.org/abs/2412.07583)
[](https://qualcomm-ai-research.github.io/mobile-video-diffusion/)
+ [MoViE: Mobile Diffusion for Video Editing](https://arxiv.org/abs/2412.06578) (Dec., 2024)
[](https://arxiv.org/abs/2412.06578)
[](https://qualcomm-ai-research.github.io/mobile-video-editing/)+ [Individual Content and Motion Dynamics Preserved Pruning for Video Diffusion Models](https://arxiv.org/abs/2411.18375) (Nov., 2024)
[](https://arxiv.org/abs/2411.18375)+ [Adaptive Caching for Faster Video Generation with Diffusion Transformers](https://arxiv.org/pdf/2411.02397) (Nov., 2024)
[](https://github.com/AdaCache-DiT/AdaCache)
[](https://arxiv.org/pdf/2411.02397)
[](https://adacache-dit.github.io/)+ [Fast and Memory-Efficient Video Diffusion Using Streamlined Inference](https://arxiv.org/abs/2411.01171) (Nov., 2024)
[](https://arxiv.org/abs/2411.01171)+ [SageAttention: Accurate 8-Bit Attention for Plug-and-play Inference Acceleration](https://arxiv.org/abs/2410.02367) (Oct., 2024)
[](https://github.com/thu-ml/SageAttention)
[](https://arxiv.org/abs/2410.02367)### Controllable Video Generation
+ [SketchVideo: Sketch-based Video Generation and Editing](https://arxiv.org/abs/2503.23284) (Apr., 2025)
[](https://github.com/IGLICT/SketchVideo)
[](https://arxiv.org/abs/2503.23284)
[](http://geometrylearning.com/SketchVideo/)
+ [Any2Caption:Interpreting Any Condition to Caption for Controllable Video Generation](https://arxiv.org/abs/2503.24379) (Apr., 2025)
[](https://arxiv.org/abs/2503.24379)
[](https://sqwu.top/Any2Cap/)
+ [Reangle-A-Video: 4D Video Generation as Video-to-Video Translation](https://arxiv.org/abs/2503.09151) (Mar., 2025)
[](https://github.com/HyeonHo99/Reangle-Video)
[](https://arxiv.org/abs/2503.09151)
[](https://hyeonho99.github.io/reangle-a-video/)+ [DynamiCtrl: Rethinking the Basic Structure and the Role of Text for High-quality Human Image Animation](https://arxiv.org/abs/2503.21246) (Mar., 2025)
[](https://github.com/gulucaptain/DynamiCtrl)
[](https://arxiv.org/abs/2503.21246)
[](https://gulucaptain.github.io/DynamiCtrl/)+ [HunyuanPortrait: Implicit Condition Control for Enhanced Portrait Animation](https://arxiv.org/pdf/2503.18860) (Mar., 2025)
[](https://github.com/kkakkkka/HunyuanPortrait)
[](https://arxiv.org/pdf/2503.18860)
[](https://kkakkkka.github.io/HunyuanPortrait/)+ [Enabling Versatile Controls for Video Diffusion Models](https://arxiv.org/abs/2503.16983) (Mar., 2025)
[](https://arxiv.org/abs/2503.16421)
+ [MagicMotion: Controllable Video Generation with Dense-to-Sparse Trajectory Guidance](https://arxiv.org/abs/2503.16421) (Mar., 2025)
[](https://github.com/quanhaol/MagicMotion)
[](https://arxiv.org/abs/2503.16421)
[](https://quanhaol.github.io/magicmotion-site/)+ [MusicInfuser: Making Video Diffusion Listen and Dance](https://arxiv.org/abs/2503.14505) (Mar., 2025)
[](https://github.com/SusungHong/MusicInfuser)
[](https://arxiv.org/abs/2503.14505)
[](https://susunghong.github.io/MusicInfuser)
+ [ReCamMaster: Camera-Controlled Generative Rendering from A Single Video](https://arxiv.org/abs/2503.11647) (Mar., 2025)
[](https://github.com/KwaiVGI/ReCamMaster)
[](https://arxiv.org/abs/2503.11647)
[](https://jianhongbai.github.io/ReCamMaster/)+ [CameraCtrl II: Dynamic Scene Exploration via Camera-controlled Video Diffusion Models](https://arxiv.org/abs/2503.10592) (Mar., 2025)
[](https://arxiv.org/abs/2503.10592)
[](https://hehao13.github.io/Projects-CameraCtrl-II/)+ [GEN3C: 3D-Informed World-Consistent Video Generation with Precise Camera Control](https://arxiv.org/abs/2503.03751) (Mar., 2025 | CVPR 2025)
[](https://github.com/nv-tlabs/GEN3C)
[](https://arxiv.org/abs/2503.03751)
[](https://research.nvidia.com/labs/toronto-ai/GEN3C/)+ [C-Drag: Chain-of-Thought Driven Motion Controller for Video Generation](https://arxiv.org/abs/2502.19868) (Feb., 2025)
[](https://github.com/WesLee88524/C-Drag-Official-Repo)
[](https://arxiv.org/abs/2502.19868)+ [X-Dancer: Expressive Music to Human Dance Video Generation](https://arxiv.org/pdf/2502.17414) (Feb., 2025)
[](https://arxiv.org/pdf/2502.17414)+ [CineMaster: A 3D-Aware and Controllable Framework for Cinematic Text-to-Video Generation](https://arxiv.org/abs/2502.08639) (Feb., 2025)
[](https://arxiv.org/abs/2502.08639)
[](https://cinemaster-dev.github.io/)+ [RealCam-I2V: Real-World Image-to-Video Generation with Interactive Complex Camera Control](https://arxiv.org/abs/2502.10059) (Feb., 2025)
[](https://arxiv.org/abs/2502.10059)
[](https://zgctroy.github.io/RealCam-I2V/)+ [AnyCharV: Bootstrap Controllable Character Video Generation with Fine-to-Coarse Guidance](https://arxiv.org/abs/2502.08189) (Feb., 2025)
[](https://github.com/AnyCharV/AnyCharV)
[](https://arxiv.org/abs/2502.08189)
[](https://anycharv.github.io/)+ [A 3D-Aware and Controllable Framework for Cinematic Text-to-Video Generation](https://arxiv.org/abs/2502.08639) (Feb., 2025)
[](https://arxiv.org/abs/2502.08639)
[](https://cinemaster-dev.github.io/)+ [VidCRAFT3: Camera, Object, and Lighting Control for Image-to-Video Generation](https://arxiv.org/pdf/2502.07531) (Feb., 2025)
[](https://arxiv.org/pdf/2502.07531)+ [FloVD: Optical Flow Meets Video Diffusion Model for Camera-Controlled Video Synthesis](https://jinwonjoon.github.io/flovd_site/FloVD_files/main.pdf) (Feb., 2025)
[](https://github.com/JinWonjoon/FloVD/)
[](https://jinwonjoon.github.io/flovd_site/FloVD_files/main.pdf)
[](https://jinwonjoon.github.io/flovd_site/)+ [Light-A-Video: Training-free Video Relighting via Progressive Light Fusion](https://arxiv.org/abs/2502.08590) (Feb., 2025)
[](https://github.com/bcmi/Light-A-Video/)
[](https://arxiv.org/abs/2502.08590)
[](https://bujiazi.github.io/light-a-video.github.io/)+ [MotionCanvas: Cinematic Shot Design with Controllable Image-to-Video Generation](https://arxiv.org/abs/2502.04299) (Feb., 2025)
[](https://arxiv.org/abs/2502.04299)
[](https://motion-canvas25.github.io/)+ [MotionCanvas: Cinematic Shot Design with Controllable Image-to-Video Generation](https://arxiv.org/pdf/2502.04299) (Feb., 2025)
[](https://arxiv.org/pdf/2502.04299)
+ [DynVFX: Augmenting Real Videos with Dynamic Content](https://arxiv.org/pdf/2502.03621) (Feb., 2025)
[](https://arxiv.org/pdf/2502.03621)+ [MotionAgent: Fine-grained Controllable Video Generation via Motion Field Agent](https://arxiv.org/pdf/2502.03207) (Feb., 2025)
[](https://arxiv.org/pdf/2502.03207)+ [RelightVid: Temporal-Consistent Diffusion Model for Video Relighting](https://arxiv.org/abs/2501.16330) (Feb., 2025)
[](https://arxiv.org/abs/2501.16330)+ [Perception-as-Control: Fine-grained Controllable Image Animation with 3D-aware Motion Representation](https://arxiv.org/pdf/2501.05020) (Jan., 2025)
[](https://arxiv.org/pdf/2501.05020)
[](https://chen-yingjie.github.io/projects/Perception-as-Control/)
[](https://github.com/chen-yingjie/Perception-as-Control)+ [BlobGEN-Vid: Compositional Text-to-Video Generation with Blob Video Representations](https://arxiv.org/abs/2501.07647) (Jan., 2025)
[](https://github.com/Tian-one/FCVG)
[](https://arxiv.org/abs/2501.07647)
[](https://blobgen-vid2.github.io/)+ [On Unifying Video Generation and Camera Pose Estimation](https://arxiv.org/abs/2501.01409) (Jan., 2025)
[](https://arxiv.org/abs/2501.01409)+ [VideoAnydoor: High-fidelity Video Object Insertion with Precise Motion Control](https://arxiv.org/pdf/2501.01427) (Jan., 2025)
[](https://arxiv.org/pdf/2501.01427)
[](https://videoanydoor.github.io/)+ [DirectorLLM for Human-Centric Video Generation](https://arxiv.org/pdf/2412.14484) (Dec., 2024)
[](https://arxiv.org/pdf/2412.14484)+ [Consistent Human Image and Video Generation with Spatially Conditioned Diffusion](https://arxiv.org/abs/2412.14531) (Dec., 2024)
[](https://github.com/ljzycmd/SCD)
[](https://arxiv.org/abs/2412.14531)+ [Generative Inbetweening through Frame-wise Conditions-Driven Video Generation](https://arxiv.org/abs/2412.11755) (Dec., 2024)
[](https://github.com/Tian-one/FCVG)
[](https://arxiv.org/abs/2412.11755)
[](https://fcvg-inbetween.github.io/)+ [InterDyn: Controllable Interactive Dynamics with Video Diffusion Models](https://interdyn.is.tue.mpg.de/media/upload/interdyn_video.pdf) (Dec., 2024)
[](https://interdyn.is.tue.mpg.de/media/upload/interdyn_video.pdf)
[](https://interdyn.is.tue.mpg.de/)+ [OmniDrag: Enabling Motion Control for Omnidirectional Image-to-Video Generation](https://arxiv.org/pdf/2412.09623) (Dec., 2024)
[](https://arxiv.org/pdf/2412.09623)
[](https://lwq20020127.github.io/OmniDrag/)+ [SynCamMaster: Synchronizing Multi-Camera Video Generation from Diverse Viewpoints](https://arxiv.org/abs/2412.07760) (Dec., 2024)
[](https://github.com/KwaiVGI/SynCamMaster)
[](https://arxiv.org/abs/2412.07760)
[](https://jianhongbai.github.io/SynCamMaster/)+ [3DTrajMaster: Mastering 3D Trajectory for Multi-Entity Motion in Video Generation](https://drive.google.com/file/d/111Z5CMJZupkmg-xWpV4Tl4Nb7SRFcoWx/view) (Dec., 2024)
[](https://github.com/KwaiVGI/3DTrajMaster)
[](https://drive.google.com/file/d/111Z5CMJZupkmg-xWpV4Tl4Nb7SRFcoWx/view)
[](https://wzhouxiff.github.io/projects/ObjCtrl-2.5D/)+ [ObjCtrl-2.5D: Training-free Object Control with Camera Poses](https://arxiv.org/pdf/2412.07721) (Dec., 2024)
[](https://github.com/wzhouxiff/ObjCtrl-2.5D)
[](https://arxiv.org/pdf/2412.07721)
[](https://wzhouxiff.github.io/projects/ObjCtrl-2.5D/)+ [Motion Prompting: Controlling Video Generation with Motion Trajectories](https://arxiv.org/abs/2412.02700) (Nov., 2024)
[](https://arxiv.org/abs/2412.02700)
[](https://motion-prompting.github.io/)+ [Identity-Preserving Text-to-Video Generation by Frequency Decomposition](https://arxiv.org/abs/2411.17440) (Nov., 2024)
[](https://github.com/PKU-YuanGroup/ConsisID)
[](https://arxiv.org/abs/2411.17440)
[](https://pku-yuangroup.github.io/ConsisID/)
[](https://huggingface.co/datasets/BestWishYsh/ConsisID-preview-Data)+ [FlipSketch: Flipping assets Drawings to Text-Guided Sketch Animations](https://arxiv.org/pdf/2411.10818) (Nov., 2024)
[](https://github.com/hmrishavbandy/FlipSketch)
[](https://arxiv.org/pdf/2411.10818)+ [AnimateAnything: Consistent and Controllable Animation for video generation](https://arxiv.org/pdf/2411.10836) (Nov., 2024)
[](https://github.com/yu-shaonian/AnimateAnything)
[](https://arxiv.org/pdf/2411.10836)
[](https://yu-shaonian.github.io/Animate_Anything/)+ [MVideo: Motion Control for Enhanced Complex Action Video Generation](https://arxiv.org/pdf/2411.08328) (Nov., 2024)
[](https://arxiv.org/pdf/2411.08328)
[](https://mvideo-v1.github.io/)+ [ReCapture: Generative Video Camera Controls for User-Provided Videos using Masked Video Fine-Tuning](https://arxiv.org/abs/2411.05003) (Nov., 2024 | CVPR 2025)
[](https://arxiv.org/abs/2411.05003)
[](https://generative-video-camera-controls.github.io/)+ [SG-I2V: Self-Guided Trajectory Control in Image-to-Video Generation](https://arxiv.org/pdf/2411.04989) (Nov., 2024)
[](https://arxiv.org/pdf/2411.04989)
[](https://kmcode1.github.io/Projects/SG-I2V/)
[](https://github.com/Kmcode1/SG-I2V)+ [X-Portrait: Expressive Portrait Animation with Hierarchical Motion Attention](https://arxiv.org/abs/2403.15931) (Nov., 2024)
[](https://arxiv.org/abs/2403.15931)
[](https://byteaigc.github.io/x-portrait/)
[](https://github.com/bytedance/X-Portrait)+ [LumiSculpt: A Consistency Lighting Control Network for Video Generation](https://arxiv.org/pdf/2410.22979) (Nov., 2024)
[](https://arxiv.org/pdf/2410.22979)+ [FRAMER: INTERACTIVE FRAME INTERPOLATION](https://arxiv.org/pdf/2410.18978) (Oct., 2024)
[](https://github.com/aim-uofa/Framer)
[](https://arxiv.org/pdf/2410.18978)
[](https://aim-uofa.github.io/Framer/)+ [CamI2V: Camera-Controlled Image-to-Video Diffusion Model](https://arxiv.org/pdf/2410.15957) (Oct., 2024)
[](https://github.com/ZGCTroy/CamI2V)
[](https://arxiv.org/pdf/2410.15957)
[](https://zgctroy.github.io/CamI2V/)+ [Cavia: Camera-controllable Multi-view Video Diffusion with View-Integrated Attention](https://arxiv.org/abs/2410.10774) (Oct., 2024)
[](https://arxiv.org/abs/2410.10774)
[](https://ir1d.github.io/Cavia/)
+ [Animate Your Motion: Turning Still Images into Dynamic Videos](https://arxiv.org/abs/2403.10179)(Mar., 2023|ECCV 2024)
[](https://arxiv.org/abs/2403.10179)
[](https://mingxiao-li.github.io/smcd/)+ [EasyControl: Transfer ControlNet to Video Diffusion for Controllable Generation and Interpolation](https://arxiv.org/abs/2408.13005) (Aug., 2024)
[](https://arxiv.org/abs/2408.13005)+ [ControlNeXt: Powerful and Efficient Control for Image and Video Generation](https://arxiv.org/pdf/2408.06070) (Aug., 2024)
[](https://github.com/dvlab-research/ControlNeXt)
[](https://arxiv.org/pdf/2408.06070)
[](https://pbihao.github.io/projects/controlnext/index.html)+ [TrackGo: A Flexible and Efficient Method for Controllable Video Generation](https://arxiv.org/abs/2408.11475) (Aug., 2024)
[](https://arxiv.org/abs/2408.11475)
[](https://zhtjtcz.github.io/TrackGo-Page/#)+ [Puppet-Master: Scaling Interactive Video Generation as a Motion Prior for Part-Level Dynamics](https://arxiv.org/abs/2408.04631) (Aug., 2024)
[](https://arxiv.org/abs/2408.04631)
[](https://vgg-puppetmaster.github.io/)+ [Sketch2Scene: Automatic Generation of Interactive 3D Game Scenes from User's Casual Sketches](https://arxiv.org/abs/2408.04567) (Aug., 2024)
[](https://arxiv.org/abs/2408.04567)
[](https://xrvisionlabs.github.io/Sketch2Scene/)+ [Expressive Whole-Body 3D Gaussian Avatar](https://arxiv.org/abs/2407.21686) (Aug., 2024)
[](https://arxiv.org/abs/2407.21686)
[](https://mks0601.github.io/ExAvatar/)
+ [Tora: Trajectory-oriented Diffusion Transformer for Video Generation](https://arxiv.org/abs/2407.21705) (Jul., 2024 | CVPR 2025)
[](https://github.com/alibaba/Tora)
[](https://arxiv.org/abs/2407.21705)
[](https://ali-videoai.github.io/tora_video/)+ [HumanVid: Demystifying Training Data for Camera-controllable Human Image Animation](https://arxiv.org/pdf/2407.17438) (Jul., 2024)
[](https://github.com/zhenzhiwang/HumanVid)
[](https://arxiv.org/pdf/2407.17438)
[](https://humanvid.github.io/#)+ [Cinemo: Consistent and Controllable Image Animation with Motion Diffusion Models](https://arxiv.org/abs/2407.15642) (Jul., 2024)
[](https://github.com/maxin-cn/Cinemo)
[](https://arxiv.org/abs/2407.15642)
[](https://maxin-cn.github.io/cinemo_project/)+ [VD3D: Taming Large Video Diffusion Transformers for 3D Camera Control](https://arxiv.org/abs/2407.12781) (Jul., 2024)
[](https://arxiv.org/abs/2407.12781)
[](https://snap-research.github.io/vd3d/)+ [Still-Moving: Customized Video Generation without Customized Video Data](https://arxiv.org/abs/2407.08674) (Jul., 2024)
[](https://arxiv.org/abs/2407.08674)
[](https://still-moving.github.io/)+ [LivePortrait: Efficient Portrait Animation with Stitching and Retargeting Control](https://arxiv.org/abs/2407.03168) (Jul., 2024)
[](https://github.com/KwaiVGI/LivePortrait)
[](https://arxiv.org/abs/2407.03168)
[](https://github.com/KwaiVGI/LivePortrait)+ [Identifying and Solving Conditional Image Leakage in Image-to-Video Diffusion Model](https://arxiv.org/abs/2406.15735) (Jun., 2024 | NeurIPS 2024)
[](https://github.com/thu-ml/cond-image-leakage/tree/main?tab=readme-ov-file)
[](https://arxiv.org/abs/2406.15735)
[](https://cond-image-leak.github.io/)
+ [Image Conductor: Precision Control for Interactive Video Synthesis](https://arxiv.org/pdf/2406.15339) (Jun., 2024)
[](https://github.com/liyaowei-stu/ImageConductor)
[](https://arxiv.org/pdf/2406.15339)
[](https://liyaowei-stu.github.io/project/ImageConductor/)+ [MimicMotion: High-Quality Human Motion Video Generation with Confidence-aware Pose Guidance](https://arxiv.org/abs/2406.19680) (Jun., 2024)
[](https://arxiv.org/abs/2406.19680)+ [FreeTraj: Tuning-Free Trajectory Control in Video Diffusion Models](https://arxiv.org/abs/2406.16863) (Jun., 2024)
[](https://github.com/arthur-qiu/FreeTraj)
[](https://arxiv.org/abs/2406.16863)
[](http://haonanqiu.com/projects/FreeTraj.html)+ [MOFA-Video: Controllable Image Animation via Generative Motion Field Adaptions in Frozen Image-to-Video Diffusion Model](https://arxiv.org/abs/2405.20222) (Jun., 2024)
[](https://github.com/MyNiuuu/MOFA-Video)
[](https://arxiv.org/abs/2405.20222)
[](https://myniuuu.github.io/MOFA_Video/)+ [Champ: Controllable and Consistent Human Image Animation with 3D Parametric Guidance](https://arxiv.org/abs/2403.14781) (Mar., 2024)
[](https://github.com/fudan-generative-vision/champ)
[](https://arxiv.org/abs/2403.14781)
[](https://fudan-generative-vision.github.io/champ/)+ [TrailBlazer: Trajectory Control for Diffusion-Based Video Generation](https://arxiv.org/abs/2401.00896) (Jan., 2024)
[](https://github.com/hohonu-vicml/Trailblazer)
[](https://arxiv.org/abs/2401.00896)
[](https://hohonu-vicml.github.io/Trailblazer.Page/)+ [Motion-Zero: Zero-Shot Moving Object Control Framework for Diffusion-Based Video Generation](https://arxiv.org/abs/2401.10150) (Jan., 2024)
[](https://arxiv.org/abs/2401.10150)+ [Moonshot: Towards Controllable Video Generation and Editing with Multimodal Conditions](https://arxiv.org/abs/2401.01827) (Jan., 2024)
[](https://arxiv.org/abs/2401.01827)
[](https://showlab.github.io/Moonshot/)+ [MotionCtrl: A Unified and Flexible Motion Controller for Video Generation](https://arxiv.org/abs/2312.03641) (Dec., 2023)
[](https://github.com/TencentARC/MotionCtrl)
[](https://arxiv.org/abs/2312.03641)
[](https://wzhouxiff.github.io/projects/MotionCtrl/)+ [Animate Anyone: Consistent and Controllable Image-to-Video Synthesis for Character Animation](https://arxiv.org/abs/2311.17117) (Nov., 2023)
[](https://github.com/HumanAIGC/AnimateAnyone)
[](https://arxiv.org/abs/2311.17117)
[](https://humanaigc.github.io/animate-anyone/)+ [SparseCtrl: Adding Sparse Controls to Text-to-Video Diffusion Models](https://arxiv.org/abs/2311.16933) (Nov., 2023)
[](https://arxiv.org/abs/2311.16933)
[](https://guoyww.github.io/projects/SparseCtrl/)+ [Control-A-Video: Controllable Text-to-Video Generation with Diffusion Models](https://arxiv.org/abs/2305.13840) (May, 2023)
[](https://github.com/Weifeng-Chen/control-a-video)
[](https://arxiv.org/abs/2305.13840)
[](https://controlavideo.github.io/)+ [Motion-Conditioned Diffusion Model for Controllable Video Synthesis](https://arxiv.org/abs/2304.14404) (Apr., 2023)
[](https://arxiv.org/abs/2304.14404)
[](https://tsaishien-chen.github.io/MCDiff/)+ [ControlVideo: Training-free Controllable Text-to-Video Generation](https://arxiv.org/abs/2305.13077) (May, 2023)
[](https://github.com/YBYBZhang/ControlVideo)
[](https://arxiv.org/abs/2305.13077)
+ [DragNUWA: Fine-grained Control in Video Generation by Integrating Text, Image, and Trajectory](https://arxiv.org/abs/2308.08089) (Aug., 2023)
[](https://arxiv.org/abs/2308.08089)
[](https://www.microsoft.com/en-us/research/project/dragnuwa/)+ [DragAnything: Motion Control for Anything using Entity Representation](https://arxiv.org/abs/2403.07420) (ECCV, 2024)
[](https://github.com/showlab/DragAnything)
[](https://arxiv.org/abs/2403.07420)
[](https://weijiawu.github.io/draganything_page/)+ [CameraCtrl: Enabling Camera Control for Video Diffusion Models](https://arxiv.org/abs/2404.02101) (Apr., 2024)
[](https://github.com/hehao13/CameraCtrl)
[](https://arxiv.org/abs/2404.02101)
[](https://hehao13.github.io/projects-CameraCtrl/)+ [Training-free Camera Control for Video Generation](https://arxiv.org/pdf/2406.10126) (Jun., 2024)
[](https://arxiv.org/pdf/2406.10126)
[](https://lifedecoder.github.io/CamTrol/)+ [Customizing Motion in Text-to-Video Diffusion Models](https://arxiv.org/abs/2312.04966) (Dec., 2023)
[](https://arxiv.org/abs/2312.04966)
[](https://joaanna.github.io/customizing_motion/)+ [MotionClone: Training-Free Motion Cloning for Controllable Video Generation](https://arxiv.org/abs/2406.05338) (Jun., 2024)
[](https://arxiv.org/abs/2406.05338)
[](https://bujiazi.github.io/motionclone.github.io/)### Character Customization
+ [VideoMage: Multi-Subject and Motion Customization of Text-to-Video Diffusion Models](https://arxiv.org/abs/2503.21781) (Mar., 2025)
[](https://arxiv.org/abs/2503.21781)
[](https://jasper0314-huang.github.io/videomage-customization/)+ [MagicID: Hybrid Preference Optimization for ID-Consistent and Dynamic-Preserved Video Customization](https://arxiv.org/abs/2503.12689) (Mar., 2025)
[](https://github.com/EchoPluto/MagicID)
[](https://arxiv.org/abs/2503.12689)
[](https://echopluto.github.io/MagicID-project/)
+ [Concat-ID: Towards Universal Identity-Preserving Video Synthesis](https://arxiv.org/abs/2503.14151) (Mar., 2025)
[](https://github.com/ML-GSAI/Concat-ID)
[](https://arxiv.org/abs/2503.14151)
[](https://ml-gsai.github.io/Concat-ID-demo/)
+ [CINEMA: Coherent Multi-Subject Video Generation via MLLM-Based Guidance](https://arxiv.org/abs/2503.10391) (Mar., 2025)
[](https://arxiv.org/abs/2503.10391)+ [FantasyID: Face Knowledge Enhanced ID-Preserving Video Generation](https://arxiv.org/abs/2502.13995) (Feb., 2025)
[](https://arxiv.org/abs/2502.13995)+ [Dynamic Concepts Personalization from Single Videos](https://arxiv.org/abs/2502.14844) (Feb., 2025)
[](https://arxiv.org/abs/2502.14844)
[](https://snap-research.github.io/dynamic_concepts/)+ [Phantom: Subject-consistent video generation via cross-modal alignment](https://arxiv.org/pdf/2502.11079) (Feb., 2025)
[](https://github.com/Phantom-video/Phantom)
[](https://arxiv.org/pdf/2502.11079)
[](https://phantom-video.github.io/Phantom/)+ [Movie Weaver: Tuning-Free Multi-Concept Video Personalization with Anchored Prompts](https://arxiv.org/abs/2502.07802) (Feb., 2025)
[](https://arxiv.org/abs/2502.07802)
[](https://jeff-liangf.github.io/projects/movieweaver/)+ [Animate Anyone 2: High-Fidelity Character Image Animation with Environment Affordance](https://arxiv.org/pdf/2502.06145) (Feb., 2025)
[](https://arxiv.org/pdf/2502.06145)
[](https://humanaigc.github.io/animate-anyone-2/)+ [Multi-subject Open-set Personalization in Video Generation](https://arxiv.org/abs/2501.06187) (Jan., 2025)
[](https://arxiv.org/abs/2501.06187)
[](https://snap-research.github.io/open-set-video-personalization/)+ [Magic Mirror: ID-Preserved Video Generation in Video Diffusion Transformers](https://arxiv.org/abs/2501.03931) (Jan., 2025)
[](https://github.com/dvlab-research/MagicMirror/)
[](https://arxiv.org/abs/2501.03931)
[](https://julianjuaner.github.io/projects/MagicMirror/)+ [ConceptMaster: Multi-Concept Video Customization on Diffusion Transformer Models Without Test-Time Tuning](https://arxiv.org/abs/2501.04698) (Jan., 2025)
[](https://arxiv.org/abs/2501.04698)
[](https://yuzhou914.github.io/ConceptMaster/)+ [VideoMaker: Zero-shot Customized Video Generation with the Inherent Force of Video Diffusion Models](https://arxiv.org/pdf/2412.19645) (Dec., 2024)
[](https://github.com/WuTao-CS/VideoMaker)
[](https://arxiv.org/pdf/2412.19645)
[](https://wutao-cs.github.io/VideoMaker/)+ [PersonalVideo: High ID-Fidelity Video Customization without Dynamic and Semantic Degradation](https://arxiv.org/pdf/2411.17048) (Nov., 2024)
[](https://arxiv.org/pdf/2411.17048)
[](https://personalvideo.github.io/)+ [DreamVideo-2: Zero-Shot Subject-Driven Video Customization with Precise Motion Control](https://arxiv.org/abs/2410.13830) (Oct., 2024)
[](https://arxiv.org/abs/2410.13830)
[](https://dreamvideo2.github.io/)+ [CustomCrafter: Customized Video Generation with Preserving Motion and Concept Composition Abilities](https://arxiv.org/abs/2408.13239) (Aug., 2024)
[](https://github.com/WuTao-CS/CustomCrafter)
[](https://arxiv.org/abs/2408.13239)
[](https://customcrafter.github.io/)### Motion Customization
+ [Separate Motion from Appearance: Customizing Motion via Customizing Text-to-Video Diffusion Models](https://arxiv.org/abs/2501.16714) (Feb., 2025)
[](https://arxiv.org/abs/2501.16714)+ [Go-with-the-Flow: Motion-Controllable Video Diffusion Models Using Real-Time Warped Noise](https://arxiv.org/pdf/2501.08331) (Jan., 2025)
[](https://github.com/VGenAI-Netflix-Eyeline-Research/Go-with-the-Flow)
[](https://arxiv.org/pdf/2501.08331)
[](https://vgenai-netflix-eyeline-research.github.io/Go-with-the-Flow/)+ [Training-Free Motion-Guided Video Generation with Enhanced Temporal Consistency Using Motion Consistency Loss](https://arxiv.org/abs/2501.07563v1) (Jan., 2025)
[](https://arxiv.org/abs/2501.07563v1)
[](https://zhangxinyu-xyz.github.io/SimulateMotion.github.io/)+ [Free-Form Motion Control: A Synthetic Video Generation Dataset with Controllable Camera and Object Motions](https://arxiv.org/abs/2501.01425) (Jan., 2025)
[](https://arxiv.org/abs/2501.01425)
[](https://henghuiding.github.io/SynFMC/)+ [CustomTTT: Motion and Appearance Customized Video Generation via Test-Time Training](https://arxiv.org/pdf/2412.15646) (Dec., 2024)
[](https://arxiv.org/pdf/2412.15646)+ [MotionShop: Zero-Shot Motion Transfer in Video Diffusion Models with Mixture of Score Guidance](https://motionshop-diffusion.github.io/MotionShop.pdf) (Dec., 2024)
[](https://github.com/gemlab-vt/motionshop)
[](https://motionshop-diffusion.github.io/MotionShop.pdf)
[](https://motionshop-diffusion.github.io/)+ [Video Motion Transfer with Diffusion Transformers](https://arxiv.org/abs/2412.07776) (Dec., 2024)
[](https://github.com/ditflow/ditflow)
[](https://arxiv.org/abs/2412.07776)
[](https://ditflow.github.io/)+ [Latent-Reframe: Enabling Camera Control for Video Diffusion Model without Training](https://arxiv.org/abs/2412.06029) (Dec., 2024)
[](https://arxiv.org/abs/2412.06029)
[](https://latent-reframe.github.io/)+ [Motion Modes: What Could Happen Next?](https://motionmodes.github.io/resources/MotionModes.pdf) (Dec., 2024)
[](https://motionmodes.github.io/resources/MotionModes.pdf)
[](https://motionmodes.github.io/)+ [MoTrans: Customized Motion Transfer with Text-driven Video](https://arxiv.org/abs/2412.01343) (Dec., 2024)
[](https://arxiv.org/abs/2412.01343)+ [AC3D: Analyzing and Improving 3D Camera Control in Video Diffusion Transformers](https://arxiv.org/abs/2411.18673) (Dec., 2024)
[](https://arxiv.org/abs/2411.18673)
[](https://snap-research.github.io/ac3d/)+ [Trajectory Attention For Fine-grained Video Motion Control](https://arxiv.org/abs/2411.14208) (Dec., 2024)
[](https://github.com/xizaoqu/TrajectoryAttntion)
[](https://arxiv.org/abs/2411.19324)
[](https://xizaoqu.github.io/trajattn/)+ [ViewExtrapolator: Novel View Extrapolation with Video Diffusion Priors](https://arxiv.org/abs/2411.14208) (Nov., 2024)
[](https://github.com/Kunhao-Liu/ViewExtrapolator)
[](https://arxiv.org/abs/2411.14208)
[](https://kunhao-liu.github.io/ViewExtrapolator/)+ [I2VControl-Camera: Precise Video Camera Control with Adjustable Motion Strength](https://arxiv.org/pdf/2411.06525) (Nov., 2024)
[](https://arxiv.org/pdf/2411.06525)+ [MotionDirector: Motion Customization of Text-to-Video Diffusion Models](https://arxiv.org/abs/2310.08465) (Sep., 2023 | ECCV 2024)
[](https://github.com/showlab/MotionDirector)
[](https://arxiv.org/abs/2310.08465)
[](https://showlab.github.io/MotionDirector/)
+ [LAMP: Learn A Motion Pattern for Few-Shot-Based Video Generation](https://arxiv.org/abs/2310.10769) (Oct., 2023 | CVPR 2024)
[](https://github.com/RQ-Wu/LAMP)
[](https://arxiv.org/abs/2310.10769)
[](https://rq-wu.github.io/projects/LAMP/)
+ [Vmc: Video motion customization using temporal attention adaption for text-to-video diffusion models](https://arxiv.org/abs/2312.00845) (Dec., 2023 | CVPR 2024)
[](https://github.com/HyeonHo99/Video-Motion-Customization)
[](https://arxiv.org/abs/2312.00845)
[](https://video-motion-customization.github.io/)
+ [DreamVideo: Composing Your Dream Videos with Customized Subject and Motion](https://arxiv.org/abs/2312.04433) (Dec., 2023 | CVPR 2024)
[](https://github.com/ali-vilab/VGen)
[](https://arxiv.org/abs/2312.04433)
[](https://dreamvideo-t2v.github.io/)+ [MotionCtrl: A Unified and Flexible Motion Controller for Video Generation](https://arxiv.org/abs/2312.03641) (Dec., 2023 | SIGGRAPH 2024)
[](https://github.com/TencentARC/MotionCtrl)
[](https://arxiv.org/abs/2312.03641)
[](https://wzhouxiff.github.io/projects/MotionCtrl/)+ [Customizing Motion in Text-to-Video Diffusion Models](https://arxiv.org/abs/2312.04966) (Dec., 2023)
[](https://arxiv.org/abs/2312.04966)
[](https://joaanna.github.io/customizing_motion/)+ [Direct-a-Video: Customized Video Generation with User-Directed Camera Movement and Object Motion](https://arxiv.org/abs/2402.03162) (Feb., 2024)
[](https://github.com/ysy31415/direct_a_video)
[](https://arxiv.org/abs/2402.03162)
[](https://direct-a-video.github.io/)+ [Customize-A-Video: One-Shot Motion Customization of Text-to-Video Diffusion Models](https://arxiv.org/abs/2402.14780) (Feb., 2024)
[](https://arxiv.org/abs/2402.14780)+ [DreamMotion: Space-Time Self-Similar Score Distillation for Zero-Shot Video Editing](https://arxiv.org/abs/2403.12002) (Mar., 2024 | ECCV 2024)
[](https://arxiv.org/abs/2403.12002)
[](https://hyeonho99.github.io/dreammotion/)+ [DragAnything: Motion Control for Anything using Entity Representation](https://arxiv.org/abs/2403.07420) (Mar., 2024 | ECCV 2024)
[](https://github.com/showlab/DragAnything)
[](https://arxiv.org/abs/2403.07420)
[](https://weijiawu.github.io/draganything_page/)+ [Spectral Motion Alignment for Video Motion Transfer using Diffusion Models](https://arxiv.org/abs/2403.15249) (Mar., 2024)
[](https://github.com/geonyeong-park/Spectral-Motion-Alignment)
[](https://arxiv.org/abs/2403.15249)
[](https://geonyeong-park.github.io/spectral-motion-alignment/)+ [Motion Inversion for Video Customization](https://arxiv.org/abs/2403.20193) (Mar., 2024)
[](https://github.com/EnVision-Research/MotionInversion)
[](https://arxiv.org/abs/2403.20193)
[](https://wileewang.github.io/MotionInversion/)+ [Edit-Your-Motion: Space-Time Diffusion Decoupling Learning for Video Motion Editing](https://arxiv.org/abs/2405.04496) (Mar., 2024)
[](https://arxiv.org/abs/2405.04496)+ [Video Diffusion Models are Training-free Motion Interpreter and Controller](https://arxiv.org/abs/2405.14864) (May., 2024)
[](https://arxiv.org/abs/2405.14864)
[](https://xizaoqu.github.io/moft/)+ [Collaborative Video Diffusion: Consistent Multi-video Generation with Camera Control](https://arxiv.org/abs/2405.17414) (May., 2024)
[](https://github.com/CollaborativeVideoDiffusion/CVD)
[](https://arxiv.org/abs/2405.17414)
[](https://collaborativevideodiffusion.github.io/)+ [MotionClone: Training-Free Motion Cloning for Controllable Video Generation](https://arxiv.org/abs/2406.05338) (Jun., 2024)
[](https://github.com/Bujiazi/MotionClone/)
[](https://arxiv.org/abs/2406.05338)
[](https://bujiazi.github.io/motionclone.github.io/)+ [FreeTraj: Tuning-Free Trajectory Control in Video Diffusion Models](https://arxiv.org/abs/2406.16863) (Jun., 2024)
[](https://github.com/arthur-qiu/FreeTraj)
[](https://arxiv.org/abs/2406.16863)
[](http://haonanqiu.com/projects/FreeTraj.html)+ [Zero-Shot Controllable Image-to-Video Animation via Motion Decomposition](https://www.amazon.science/publications/zero-shot-controllable-image-to-video-animation-via-motion-decomposition) (Jul., 2024 | ACM MM 2024)
[](https://img2vidanim-0.github.io/)+ [Tora: Trajectory-oriented Diffusion Transformer for Video Generation](https://arxiv.org/abs/2407.21705) (Jul., 2024)
[](https://github.com/alibaba/Tora)
[](https://arxiv.org/abs/2407.21705)
[](https://ali-videoai.github.io/tora_video/)+ [Reenact Anything: Semantic Video Motion Transfer Using Motion-Textual Inversion](https://arxiv.org/abs/2408.00458) (Aug., 2024)
[](https://arxiv.org/abs/2408.00458)### Long Video / Film Generation
+ [Long-Context Autoregressive Video Modeling with Next-Frame Prediction](https://arxiv.org/abs/2503.19325) (Mar., 2025)
[](https://arxiv.org/abs/2503.19325)
[](https://farlongctx.github.io/)
[](https://github.com/showlab/FAR)
+ [MovieAgent: Automated Movie Generation via Multi-Agent CoT Planning](https://arxiv.org/abs/2503.07314) (Mar., 2025)
[](https://arxiv.org/abs/2503.07314)
[](https://weijiawu.github.io/MovieAgent/)
[](https://github.com/showlab/MovieAgent)+ [Long Context Tuning for Video Generation](https://arxiv.org/pdf/2503.10589) (Mar., 2025)
[](https://arxiv.org/pdf/2503.10589)
[](https://guoyww.github.io/projects/long-context-video/)+ [RIFLEx: A Free Lunch for Length Extrapolation in Video Diffusion Transformers](https://arxiv.org/abs/2502.15894) (Feb., 2025)
[](https://github.com/thu-ml/RIFLEx)
[](https://arxiv.org/abs/2502.15894)
[](https://riflex-video.github.io/)+ [VideoRAG: Retrieval-Augmented Generation with Extreme Long-Context Videos](https://arxiv.org/abs/2502.01549) (Feb., 2025)
[](https://github.com/HKUDS/VideoRAG)
[](https://arxiv.org/abs/2502.01549)+ [Ouroboros-Diffusion: Exploring Consistent Content Generation in Tuning-free Long Video Diffusion](https://arxiv.org/pdf/2501.09019) (Jan., 2025)
[](https://arxiv.org/pdf/2501.09019)+ [VideoAuteur: Towards Long Narrative Video Generation](https://arxiv.org/abs/2501.06173) (Jan., 2025)
[](https://github.com/lambert-x/VideoAuteur)
[](https://arxiv.org/abs/2501.06173)
[](https://videoauteur.github.io/)+ [DiTCtrl: Exploring Attention Control in Multi-Modal Diffusion Transformer for Tuning-Free Multi-Prompt Longer Video Generation](https://arxiv.org/abs/2412.18597) (Dec., 2024)
[](https://github.com/TencentARC/DiTCtrl)
[](https://arxiv.org/abs/2412.18597)
[](https://onevfall.github.io/project_page/ditctrl/)+ [LinGen: Towards High-Resolution Minute-Length Text-to-Video Generation with Linear Computational Complexity](https://arxiv.org/abs/2412.09856) (Dec., 2024)
[](https://arxiv.org/abs/2412.09856)
[](https://lineargen.github.io/)+ [Owl-1: Omni World Model for Consistent Long Video Generation](https://arxiv.org/abs/2412.09600) (Dec., 2024)
[](https://github.com/huang-yh/Owl)
[](https://arxiv.org/abs/2412.09600)+ [Video Storyboarding: Multi-Shot Character Consistency for Text-to-Video Generation](https://arxiv.org/pdf/2412.07750) (Dec., 2024)
[](https://arxiv.org/pdf/2412.07750)
[](https://research.nvidia.com/labs/par/video_storyboarding/)+ [Mind the Time: Temporally-Controlled Multi-Event Video Generation](https://arxiv.org/abs/2412.05263) (Dec., 2024)
[](https://github.com/Karine-Huang/GenMAC)
[](https://arxiv.org/abs/2412.05263)
[](https://mint-video.github.io/)+ [GenMAC: Compositional Text-to-Video Generation with Multi-Agent Collaboration](https://arxiv.org/abs/2412.04440) (Dec., 2024)
[](https://github.com/Karine-Huang/GenMAC)
[](https://arxiv.org/abs/2412.04440)
[](https://karine-h.github.io/GenMAC/)+ [Long Video Diffusion Generation with Segmented Cross-Attention and Content-Rich Video Data Curation](https://arxiv.org/pdf/2412.01316) (Dec., 2024)
[](https://arxiv.org/pdf/2412.01316)
[](https://presto-video.github.io/)+ [VideoGen-of-Thought: A Collaborative Framework for Multi-Shot Video Generation](https://arxiv.org/abs/2412.02259) (Dec., 2024)
[](https://arxiv.org/abs/2412.02259)
[](https://cheliosoops.github.io/VGoT/)+ [MotionCharacter: Identity-Preserving and Motion Controllable Human Video Generation](https://arxiv.org/pdf/2411.18281) (Nov., 2024)
[](https://arxiv.org/pdf/2411.18281)
[](https://motioncharacter.github.io/)+ [Identity-Preserving Text-to-Video Generation by Frequency Decomposition](https://arxiv.org/abs/2411.17440) (Nov., 2024)
[](https://github.com/PKU-YuanGroup/ConsisID)
[](https://arxiv.org/abs/2411.17440)
[](https://pku-yuangroup.github.io/ConsisID/)
[](https://huggingface.co/datasets/BestWishYsh/ConsisID-preview-Data)+ [MovieBench: A Hierarchical Movie Level Dataset for Long Video Generation](https://arxiv.org/abs/2411.15262) (CVPR 2025)
[](https://arxiv.org/abs/2411.15262)
[](https://weijiawu.github.io/MovieBench/)
[](https://github.com/showlab/MovieBecnh)+ [MotionPrompt: Optical-Flow Guided Prompt Optimization for Coherent Video Generation](https://arxiv.org/pdf/2411.15540) (Nov., 2024)
[](https://arxiv.org/pdf/2411.15540)
[](https://motionprompt.github.io/)
[](https://github.com/HyelinNAM/MotionPrompt)+ [DreamRunner: Fine-Grained Storytelling Video Generation with Retrieval-Augmented Motion Adaptation](https://arxiv.org/pdf/2411.16657) (Nov., 2024)
[](https://arxiv.org/pdf/2411.16657)
[](https://dreamrunner-story2video.github.io/)
[](https://github.com/wz0919/DreamRunner)+ [StoryMaker: Towards consistent characters in text-to-image generation](https://arxiv.org/abs/2409.12576) (Nov., 2024)
[](https://arxiv.org/abs/2409.12576)
[](https://github.com/RedAIGC/StoryMaker)+ [Storynizor: Consistent Story Generation via Inter-Frame Synchronized and Shuffled ID Injection](https://arxiv.org/pdf/2409.19624) (Nov., 2024)
[](https://arxiv.org/pdf/2409.19624)+ [ACDC: Autoregressive Coherent Multimodal Generation using Diffusion Correction](https://arxiv.org/abs/2410.04721) (Nov., 2024)
[](https://arxiv.org/abs/2410.04721)
[](https://acdc2025.github.io/)+ [Story-Adapter: A Training-free Iterative Framework for Long Story Visualization](https://arxiv.org/abs/2410.06244) (Nov., 2024)
[](https://arxiv.org/abs/2410.06244)
[](https://jwmao1.github.io/storyadapter/)
[](https://github.com/jwmao1/story-adapter)+ [In-Context LoRA for Diffusion Transformers](https://arxiv.org/abs/2410.23775) (Aug., 2024)
[](https://arxiv.org/abs/2410.23775)
[](https://ali-vilab.github.io/In-Context-LoRA-Page/)
[](https://github.com/ali-vilab/In-Context-LoRA)+ [SEED-Story: Multimodal Long Story Generation with Large Language Model](https://arxiv.org/abs/2407.08683) (Jul., 2024)
[](https://arxiv.org/abs/2407.08683)
[](https://github.com/TencentARC/SEED-Story)+ [StoryAgent: Customized Storytelling Video Generation via Multi-Agent Collaboration](https://arxiv.org/pdf/2411.04925) (Nov., 2024)
[](https://arxiv.org/pdf/2411.04925)+ [ARLON: Boosting Diffusion Transformers With Autoregressive Models for Long Video Generation](https://arxiv.org/abs/2410.20502) (Oct., 2024)
[](https://arxiv.org/abs/2410.20502)
[](https://arlont2v.github.io/)+ [Unbounded: A Generative Infinite Game of Character Life Simulation](https://arxiv.org/abs/2410.18975) (Oct., 2024)
[](https://arxiv.org/abs/2410.18975)
[](https://generative-infinite-game.github.io/)+ [Redefining Temporal Modeling in Video Diffusion: The Vectorized Timestep Approach](https://arxiv.org/abs/2410.03160) (Oct., 2024)
[](https://github.com/Yaofang-Liu/FVDM)
[](https://arxiv.org/abs/2410.03160)+ [Loong: Generating Minute-level Long Videos with Autoregressive Language Models](https://arxiv.org/abs/2410.02757) (Oct., 2024)
[](https://arxiv.org/abs/2410.02757)
[](https://epiphqny.github.io/Loong-video/)+ [DreamCinema: Cinematic Transfer with Free Camera and 3D Character](https://arxiv.org/pdf/2410.04721) (Oct., 2024)
[](https://arxiv.org/pdf/2410.04721)
[](https://acdc2025.github.io/)+ [CinePreGen: Camera Controllable Video Previsualization via Engine-powered Diffusion](https://arxiv.org/pdf/2408.17424) (Aug., 2024)
[](https://arxiv.org/pdf/2408.17424)+ [DreamCinema: Cinematic Transfer with Free Camera and 3D Character](https://arxiv.org/abs/2408.12601) (Aug., 2024)
[](https://github.com/chen-wl20/DreamCinema?tab=readme-ov-file)
[](https://arxiv.org/abs/2408.12601)
[](https://liuff19.github.io/DreamCinema/)+ [SkyScript-100M: 1,000,000,000 Pairs of Scripts and Shooting Scripts for Short Drama](https://arxiv.org/pdf/2408.09333) (Aug., 2024)
[](https://github.com/vaew/SkyScript-100M)
[](https://arxiv.org/pdf/2408.09333)+ [Anim-Director: A Large Multimodal Model Powered Agent for Controllable Animation Video Generation](https://arxiv.org/abs/2408.09787) (Aug., 2024)
[](https://github.com/HITsz-TMG/Anim-Director)
[](https://arxiv.org/abs/2408.09787)
+ [Kubrick: Multimodal Agent Collaborations for Synthetic Video Generation](https://arxiv.org/pdf/2408.10453) (Aug., 2024)
[](https://arxiv.org/pdf/2408.10453)+ [DreamFactory: Pioneering Multi-Scene Long Video Generation with a Multi-Agent Framework](https://arxiv.org/abs/2408.11788) (Jul, 2024)
[](https://arxiv.org/abs/2408.11788)+ [MovieDreamer: Hierarchical Generation for Coherent Long Visual Sequence](https://arxiv.org/abs/2407.16655) (Jul, 2024)
[](https://github.com/aim-uofa/MovieDreamer)
[](https://arxiv.org/abs/2407.16655)
[](https://aim-uofa.github.io/MovieDreamer/)+ [Streetscapes: Large-scale Consistent Street View Generation Using Autoregressive Video Diffusion](https://arxiv.org/pdf/2407.13759) (Jul., 2024)
[](https://arxiv.org/pdf/2407.13759)
[](https://boyangdeng.com/streetscapes/)
+ [AutoAD-Zero: A Training-Free Framework for Zero-Shot Audio Description](https://arxiv.org/abs/2407.15850) (Jul, 2024)
[](https://github.com/Jyxarthur/AutoAD-Zero)
[](https://arxiv.org/abs/2407.15850)
[](https://www.robots.ox.ac.uk/~vgg/research/autoad-zero/)+ [AesopAgent: Agent-driven Evolutionary System on Story-to-Video Production](https://arxiv.org/abs/2403.07952) (Jul, 2024)
[](https://arxiv.org/abs/2403.07952)
[](https://aesopai.github.io/)+ [TheaterGen: Character Management with LLM for Consistent Multi-turn Image Generation](https://arxiv.org/abs/2404.18919) (Jul, 2024)
[](https://github.com/donahowe/Theatergen)
[](https://arxiv.org/abs/2404.18919)
[](https://howe140.github.io/theatergen.io/)+ [AutoStudio: Crafting Consistent Subjects in Multi-turn Interactive Image Generation](https://github.com/donahowe/AutoStudio) (Jul, 2024)
[](https://github.com/donahowe/AutoStudio)
[](https://arxiv.org/abs/2406.01388)
[](https://github.com/donahowe/AutoStudio)+ [DreamStory: Open-Domain Story Visualization by LLM-Guided Multi-Subject Consistent Diffusion](https://arxiv.org/abs/2407.12899) (Jul, 2024)
[](https://arxiv.org/abs/2407.12899)
[](https://dream-xyz.github.io/dreamstory)+ [VideoDirectorGPT: Consistent Multi-scene Video Generation via LLM-Guided Planning](https://arxiv.org/abs/2309.15091) (Jul, 2024)
[](https://github.com/HL-hanlin/VideoDirectorGPT)
[](https://arxiv.org/abs/2309.15091)
[](https://videodirectorgpt.github.io/)+ [MCVD: Masked Conditional Video Diffusion for Prediction, Generation, and Interpolation](https://arxiv.org/abs/2205.09853) (NeurIPS 2022)
[](https://github.com/voletiv/mcvd-pytorch)
[](https://arxiv.org/abs/2205.09853)
[](https://mask-cond-video-diffusion.github.io)+ [NUWA-XL: Diffusion over Diffusion for eXtremely Long Video Generation](https://arxiv.org/abs/2303.12346) (Mar., 2023)
[](https://arxiv.org/abs/2303.12346)
[](https://msra-nuwa.azurewebsites.net/#/)+ [Flexible Diffusion Modeling of Long Videos](https://arxiv.org/abs/2205.11495) (May, 2022)
[](https://github.com/plai-group/flexible-video-diffusion-modeling)
[](https://arxiv.org/abs/2205.11495)
[](https://fdmolv.github.io/)### Video Generation with 3D/Physical Prior
+ [SyncVP: Joint Diffusion for Synchronous Multi-Modal Video Prediction](https://arxiv.org/abs/2503.18933) (CVPR 2025)
[](https://github.com/PallottaEnrico/SyncVP)
[](https://arxiv.org/abs/2503.18933)
[](https://syncvp.github.io/)+ [DiffusionRenderer: Neural Inverse and Forward Rendering with Video Diffusion Models](https://arxiv.org/abs/2501.18590) (Feb, 2025)
[](https://arxiv.org/abs/2501.18590)
[](https://research.nvidia.com/labs/toronto-ai/DiffusionRenderer/)+ [Generative Physical AI in Vision: A Survey](https://arxiv.org/abs/2501.10928) (Jan, 2025)
[](https://github.com/BestJunYu/Awesome-Physics-aware-Generation)
[](https://arxiv.org/abs/2501.10928)
[](https://github.com/BestJunYu/Awesome-Physics-aware-Generation)+ [Do generative video models learn physical principles from watching videos?](https://arxiv.org/pdf/2501.09038) (Jan, 2025)
[](https://arxiv.org/pdf/2501.09038)
[](https://physics-iq.github.io/)+ [Diffusion as Shader: 3D-aware Video Diffusion for Versatile Video Generation Control](https://arxiv.org/pdf/2501.03847) (Jan, 2025)
[](https://github.com/IGL-HKUST/DiffusionAsShader)
[](https://arxiv.org/pdf/2501.03847)
[](https://igl-hkust.github.io/das/)+ [Motion Dreamer: Realizing Physically Coherent Video Generation through Scene-Aware Motion Reasoning](https://arxiv.org/pdf/2412.00547) (Nov, 2024)
[](https://arxiv.org/pdf/2412.00547)+ [PhyT2V: LLM-Guided Iterative Self-Refinement for Physics-Grounded Text-to-Video Generation](https://arxiv.org/abs/2412.00596) (Nov, 2024)
[](https://github.com/pittisl/PhyT2V)
[](https://arxiv.org/abs/2412.00596)+ [Phys4DGen: A Physics-Driven Framework for Controllable and Efficient 4D Content Generation from a Single Image](https://arxiv.org/pdf/2411.16800) (Nov, 2024)
[](https://arxiv.org/pdf/2411.16800)+ [PhysMotion: Physics-Grounded Dynamics From a Single Image](https://arxiv.org/abs/2411.17189) (Nov, 2024)
[](https://arxiv.org/abs/2411.17189)
[](https://supertan0204.github.io/physmotion_website/)+ [AutoVFX: Physically Realistic Video Editing from Natural Language Instructions](https://arxiv.org/pdf/2411.02385) (Nov, 2024)
[](https://github.com/haoyuhsu/autovfx)
[](https://haoyuhsu.github.io/autovfx-website/)
[](https://haoyuhsu.github.io/autovfx-website/)+ [How Far is Video Generation from World Model: A Physical Law Perspective](https://arxiv.org/pdf/2411.02385) (Oct, 2024)
[](https://github.com/phyworld/phyworld)
[](https://arxiv.org/pdf/2411.02385)
[](https://phyworld.github.io/)+ [Tex4D: Zero-shot 4D Scene Texturing with Video Diffusion Models](https://arxiv.org/pdf/2410.10821) (Oct, 2024)
[](https://github.com/ZqlwMatt/Tex4D)
[](https://arxiv.org/pdf/2410.10821)
[](https://tex4d.github.io/)+ [PhyGenBench: Towards World Simulator: Crafting Physical Commonsense-Based Benchmark for Video Generation](https://arxiv.org/pdf/2410.05363) (Oct, 2024)
[](https://github.com/OpenGVLab/PhyGenBench)
[](https://arxiv.org/pdf/2410.05363)
[](https://phygenbench123.github.io/)+ [PhysGen: Rigid-Body Physics-Grounded Image-to-Video Generation](https://arxiv.org/abs/2409.18964) (Oct, 2024)
[](https://github.com/stevenlsw/physgen)
[](https://arxiv.org/abs/2409.18964)
[](https://stevenlsw.github.io/physgen/)+ [StereoCrafter: Diffusion-based Generation of Long and High-fidelity Stereoscopic 3D from Monocular Videos](https://arxiv.org/pdf/2409.07447) (Oct, 2024)
[](https://arxiv.org/pdf/2409.07447)
[](http://stereocrafter.github.io/)+ [ViewCrafter: Taming Video Diffusion Models for High-fidelity Novel View Synthesis](https://arxiv.org/abs/2409.02048) (Sep, 2024)
[](https://github.com/Drexubery/ViewCrafter)
[](https://arxiv.org/abs/2409.02048)
[](https://drexubery.github.io/ViewCrafter/)+ [Compositional 3D-aware Video Generation with LLM Director](https://www.microsoft.com/en-us/research/project/compositional-3d-aware-video-generation/) (Aug, 2024)
[](https://arxiv.org/pdf/2409.00558)
[](https://www.microsoft.com/en-us/research/project/compositional-3d-aware-video-generation/)
+ [IDOL: Unified Dual-Modal Latent Diffusion for Human-Centric Joint Video-Depth Generation](https://arxiv.org/abs/2407.10937) (Jul, 2024)
[](https://github.com/yhZhai/idol)
[](https://arxiv.org/abs/2405.16823)
[](https://arxiv.org/abs/2407.10937)
+ [PhysDreamer: Physics-Based Interaction with 3D Objects via Video Generation](https://arxiv.org/abs/2404.13026) (ECCV 2024)
[](https://github.com/a1600012888/PhysDreamer)
[](https://arxiv.org/abs/2404.13026)
[](https://physdreamer.github.io/)### Video Editing
+ [MTV-Inpaint: Multi-Task Long Video Inpainting](https://arxiv.org/abs/2503.11412) (Mar., 2025)
[](https://arxiv.org/abs/2503.11412)
[](https://mtv-inpaint.github.io/)+ [Señorita-2M : A High-Quality Instruction-based Dataset for General Video Editing by Video Specialists](https://arxiv.org/abs/2502.06734) (Mar., 2025)
[](https://github.com/zibojia/SENORITA)
[](https://arxiv.org/abs/2502.06734)
[](https://senorita-2m-dataset.github.io/)+ [VideoGrain: Modulating Space-Time Attention for Multi-grained Video Editing](https://arxiv.org/abs/2502.17258) (Feb., 2025 | ICLR 2025)
[](https://github.com/knightyxp/VideoGrain)
[](https://arxiv.org/abs/2502.17258)
[](https://knightyxp.github.io/VideoGrain_project_page/)+ [Perception-as-Control: Fine-grained Controllable Image Animation with 3D-aware Motion Representation](https://arxiv.org/pdf/2501.05020) (Jan., 2025)
[](https://github.com/chen-yingjie/Perception-as-Control)
[](https://arxiv.org/pdf/2501.05020)
[](https://chen-yingjie.github.io/projects/Perception-as-Control/)+ [Qffusion: Controllable Portrait Video Editing via Quadrant-Grid Attention Learning](https://arxiv.org/abs/2501.06438) (Jan., 2025)
[](https://arxiv.org/abs/2501.06438)+ [MIVE: New Design and Benchmark for Multi-Instance Video Editing](https://arxiv.org/abs/2412.12877) (Dec., 2024)
[](https://arxiv.org/abs/2412.12877)
[](https://kaist-viclab.github.io/mive-site/)+ [VividFace: A Diffusion-Based Hybrid Framework for High-Fidelity Video Face Swapping](https://arxiv.org/abs/2403.16999) (Dec., 2024)
[](https://github.com/deepcs233/VividFace)
[](https://arxiv.org/abs/2403.16999)
[](https://hao-shao.com/projects/vividface.html)+ [MotionFlow: Attention-Driven Motion Transfer in Video Diffusion Models](https://arxiv.org/abs/2412.05275) (Dec., 2024)
[](https://arxiv.org/abs/2412.05275)
[](https://motionflow-diffusion.github.io/)+ [DIVE: Taming DINO for Subject-Driven Video Editing](https://arxiv.org/abs/2412.03347) (Dec., 2024)
[](https://arxiv.org/abs/2412.03347)
[](https://dino-video-editing.github.io/)+ [AniGS: Animatable Gaussian Avatar from a Single Image with Inconsistent Gaussian Reconstruction](https://arxiv.org/pdf/2412.02684) (Dec., 2024)
[](https://github.com/aigc3d/AniGS)
[](https://arxiv.org/pdf/2412.02684)
[](https://lingtengqiu.github.io/2024/AniGS/)+ [StableV2V: Stablizing Shape Consistency in Video-to-Video Editing](https://arxiv.org/pdf/2411.11045) (Nov, 2024)
[](https://github.com/AlonzoLeeeooo/StableV2V)
[](https://arxiv.org/pdf/2411.11045)
[](https://alonzoleeeooo.github.io/StableV2V/)+ [Unified Editing of Panorama, 3D Scenes, and Videos Through Disentangled Self-Attention Injection](https://arxiv.org/abs/2405.16823) (May, 2024)
[](https://arxiv.org/abs/2405.16823)
[](https://unifyediting.github.io/)+ [I2VEdit: First-Frame-Guided Video Editing via Image-to-Video Diffusion Models](https://arxiv.org/abs/2405.16537) (May, 2024)
[](https://arxiv.org/abs/2405.16537)
[](https://i2vedit.github.io/)+ [Looking Backward: Streaming Video-to-Video Translation with Feature Banks](https://arxiv.org/abs/2405.15757) (May, 2024)
[](https://github.com/Jeff-LiangF/streamv2v)
[](https://arxiv.org/abs/2405.15757)
[](https://jeff-liangf.github.io/projects/streamv2v/)+ [ReVideo: Remake a Video with Motion and Content Control](https://arxiv.org/abs/2405.13865) (May, 2024)
[](https://github.com/MC-E/ReVideo)
[](https://arxiv.org/abs/2405.13865)
[](https://mc-e.github.io/project/ReVideo/)+ [Slicedit: Zero-Shot Video Editing With Text-to-Image Diffusion Models Using Spatio-Temporal Slices](https://arxiv.org/abs/2405.12211) (May, 2024)
[](https://github.com/fallenshock/Slicedit)
[](https://arxiv.org/abs/2405.12211)
[](https://matankleiner.github.io/slicedit/)+ [ViViD: Video Virtual Try-on using Diffusion Models](https://arxiv.org/abs/2405.11794) (May, 2024)
[](https://github.com/BecauseImBatman0/ViViD)
[](https://arxiv.org/abs/2405.11794)
[](https://becauseimbatman0.github.io/ViViD)+ [Edit-Your-Motion: Space-Time Diffusion Decoupling Learning for Video Motion Editing](https://arxiv.org/abs/2405.04496) (May, 2024)
[](https://arxiv.org/abs/2405.04496)+ [GenVideo: One-shot target-image and shape aware video editing using T2I diffusion models](https://arxiv.org/abs/2404.12541) (Apr., 2024)
[](https://arxiv.org/abs/2404.12541)+ [EVA: Zero-shot Accurate Attributes and Multi-Object Video Editing](https://arxiv.org/abs/2403.16111) (Mar., 2024)
[](https://github.com/knightyxp/EVA_Video_Edit)
[](https://arxiv.org/abs/2403.16111)
[](https://knightyxp.github.io/EVA/)+ [Spectral Motion Alignment for Video Motion Transfer using Diffusion Models](https://arxiv.org/abs/2403.15249) (Mar., 2024)
[](https://arxiv.org/abs/2403.15249)
[](https://geonyeong-park.github.io/spectral-motion-alignment/)+ [AnyV2V: A Tuning-Free Framework For Any Video-to-Video Editing Tasks](https://arxiv.org/abs/2403.14468) (Mar., 2024)
[](https://github.com/TIGER-AI-Lab/AnyV2V)
[](https://arxiv.org/abs/2403.14468)
[](https://tiger-ai-lab.github.io/AnyV2V/)+ [CoCoCo: Improving Text-Guided Video Inpainting for Better Consistency, Controllability and Compatibility](https://arxiv.org/abs/2403.12035) (Mar., 2024)
[](https://arxiv.org/abs/2403.12035)
[](https://cococozibojia.github.io/)
[](https://github.com/zibojia/COCOCO)+ [DreamMotion: Space-Time Self-Similarity Score Distillation for Zero-Shot Video Editing](https://arxiv.org/abs/2403.12002) (Mar., 2024)
[](https://arxiv.org/abs/2403.12002)
[](https://hyeonho99.github.io/dreammotion/)+ [Video Editing via Factorized Diffusion Distillation](https://arxiv.org/abs/2403.09334) (Mar., 2024)
[](https://arxiv.org/abs/2403.09334)
[](https://fdd-video-edit.github.io/)+ [FastVideoEdit: Leveraging Consistency Models for Efficient Text-to-Video Editing](https://arxiv.org/abs/2403.06269) (Mar., 2024)
[](https://arxiv.org/abs/2403.06269)+ [UniEdit: A Unified Tuning-Free Framework for Video Motion and Appearance Editing](https://arxiv.org/abs/2402.13185) (Feb., 2024)
[](https://github.com/JianhongBai/UniEdit)
[](https://arxiv.org/abs/2402.13185)
[](https://jianhongbai.github.io/UniEdit/)
+ [Object-Centric Diffusion for Efficient Video Editing](https://arxiv.org/abs/2401.05735) (Jan., 2024)
[](https://arxiv.org/abs/2401.05735)+ [VASE: Object-Centric Shape and Appearance Manipulation of Real Videos](https://arxiv.org/abs/2401.02473) (Jan., 2024)
[](https://github.com/helia95/VASE)
[](https://arxiv.org/abs/2401.02473)
[](https://helia95.github.io/vase-website/)+ [FlowVid: Taming Imperfect Optical Flows for Consistent Video-to-Video Synthesis](https://arxiv.org/abs/2312.17681) (Dec., 2023)
[](https://github.com/Jeff-LiangF/FlowVid)
[](https://arxiv.org/abs/2312.17681)
[](https://jeff-liangf.github.io/projects/flowvid/)+ [Fairy: Fast Parallelized Instruction-Guided Video-to-Video Synthesis](https://arxiv.org/abs/2312.13834) (Dec., 2023)
[](https://arxiv.org/abs/2312.13834)
[](https://fairy-video2video.github.io/)+ [RealCraft: Attention Control as A Solution for Zero-shot Long Video Editing](https://arxiv.org/abs/2312.12635) (Dec., 2023)
[](https://arxiv.org/abs/2312.12635)+ [MaskINT: Video Editing via Interpolative Non-autoregressive Masked Transformers](https://arxiv.org/abs/2312.12468) (Dec., 2023)
[](https://arxiv.org/abs/2312.12468)
[](https://maskint.github.io/)+ [VidToMe: Video Token Merging for Zero-Shot Video Editing](https://arxiv.org/abs/2312.10656) (Dec., 2023)
[](https://github.com/lixirui142/VidToMe)
[](https://arxiv.org/abs/2312.10656)
[](https://vidtome-diffusion.github.io/)+ [A Video is Worth 256 Bases: Spatial-Temporal Expectation-Maximization Inversion for Zero-Shot Video Editing](https://arxiv.org/abs/2312.05856) (Dec., 2023)
[](https://github.com/STEM-Inv/stem-inv)
[](https://arxiv.org/abs/2312.05856)
[](https://stem-inv.github.io/page/)+ [Neutral Editing Framework for Diffusion-based Video Editing](https://arxiv.org/abs/2312.06708) (Dec., 2023)
[](https://arxiv.org/abs/2312.06708)
[](https://neuedit.github.io/)
+ [DiffusionAtlas: High-Fidelity Consistent Diffusion Video Editing](https://arxiv.org/abs/2312.03772) (Dec., 2023)
[](https://arxiv.org/abs/2312.03772)+ [RAVE: Randomized Noise Shuffling for Fast and Consistent Video Editing with Diffusion Models](https://arxiv.org/abs/2312.04524) (Dec., 2023)
[](https://github.com/rehg-lab/RAVE)
[](https://arxiv.org/abs/2312.04524)
[](https://rave-video.github.io/)+ [SAVE: Protagonist Diversification with Structure Agnostic Video Editing](https://arxiv.org/abs/2312.02503) (Dec., 2023)
[](https://arxiv.org/abs/2312.02503)
[](https://ldynx.github.io/SAVE/)+ [MagicStick: Controllable Video Editing via Control Handle Transformations](https://arxiv.org/abs/2312.03047) (Dec., 2023)
[](https://github.com/mayuelala/MagicStick)
[](https://arxiv.org/abs/2312.03047)
[](https://magic-stick-edit.github.io/)+ [VideoSwap: Customized Video Subject Swapping with Interactive Semantic Point Correspondence](https://arxiv.org/abs/2312.02087) (CVPR 2024)
[](https://github.com/showlab/VideoSwap)
[](https://arxiv.org/abs/2312.02087)
[](https://videoswap.github.io/)+ [DragVideo: Interactive Drag-style Video Editing](https://arxiv.org/abs/2312.02216) (Dec., 2023)
[](https://github.com/RickySkywalker/DragVideo-Official)
[](https://arxiv.org/abs/2312.02216)+ [Drag-A-Video: Non-rigid Video Editing with Point-based Interaction](https://arxiv.org/abs/2312.02936) (Dec., 2023)
[](https://arxiv.org/abs/2312.02936)
[](https://drag-a-video.github.io/)+ [BIVDiff: A Training-Free Framework for General-Purpose Video Synthesis via Bridging Image and Video Diffusion Models](https://arxiv.org/abs/2312.02813) (Dec., 2023)
[](https://arxiv.org/abs/2312.02813)
[](https://bivdiff.github.io/)+ [VMC: Video Motion Customization using Temporal Attention Adaption for Text-to-Video Diffusion Models](https://arxiv.org/abs/2312.00845) (CVPR 2024)
[](https://github.com/HyeonHo99/Video-Motion-Customization)
[](https://arxiv.org/abs/2312.00845)
[](https://video-motion-customization.github.io)+ [FLATTEN: optical FLow-guided ATTENtion for consistent text-to-video editing](https://arxiv.org/abs/2310.05922) (ICLR 2024)
[](https://github.com/yrcong/flatten)
[](https://arxiv.org/abs/2310.05922)
[](https://flatten-video-editing.github.io)+ [MotionEditor: Editing Video Motion via Content-Aware Diffusion](https://arxiv.org/abs/2311.18830) (Nov., 2023)
[](https://github.com/Francis-Rings/MotionEditor)
[](https://arxiv.org/abs/2311.18830)
[](https://francis-rings.github.io/MotionEditor/)+ [Motion-Conditioned Image Animation for Video Editing](https://arxiv.org/abs/2311.18827) (Nov., 2023)
[](https://arxiv.org/abs/2311.18827)
[](https://facebookresearch.github.io/MoCA/)+ [Space-Time Diffusion Features for Zero-Shot Text-Driven Motion Transfer](https://arxiv.org/abs/2311.17009) (CVPR 2024)
[](https://arxiv.org/abs/2311.17009)
[](https://diffusion-motion-transfer.github.io/)+ [Cut-and-Paste: Subject-Driven Video Editing with Attention Control](https://arxiv.org/abs/2311.11697) (Nov., 2023)
[](https://arxiv.org/abs/2311.11697)+ [LatentWarp: Consistent Diffusion Latents for Zero-Shot Video-to-Video Translation](https://arxiv.org/abs/2311.00353) (Nov., 2023)
[](https://arxiv.org/abs/2311.00353)
+ [Fuse Your Latents: Video Editing with Multi-source Latent Diffusion Models](https://arxiv.org/abs/2310.16400) (Oct., 2023)
[](https://arxiv.org/abs/2310.16400)+ [DynVideo-E: Harnessing Dynamic NeRF for Large-Scale Motion- and View-Change Human-Centric Video Editing](https://arxiv.org/abs/2310.10624) (Oct., 2023)
[](https://arxiv.org/abs/2310.10624)
[](https://showlab.github.io/DynVideo-E/)+ [Ground-A-Video: Zero-shot Grounded Video Editing using Text-to-image Diffusion Models](https://arxiv.org/abs/2310.01107) (ICLR 2024)
[](https://github.com/Ground-A-Video/Ground-A-Video)
[](https://arxiv.org/abs/2310.01107)
[](https://ground-a-video.github.io/)+ [CCEdit: Creative and Controllable Video Editing via Diffusion Models](https://arxiv.org/abs/2309.16496) (Sep., 2023)
[](https://arxiv.org/abs/2309.16496)+ [MagicProp: Diffusion-based Video Editing via Motion-aware Appearance Propagation](https://arxiv.org/abs/2309.00908) (Sep., 2023)
[](https://arxiv.org/abs/2309.00908)+ [MagicEdit: High-Fidelity and Temporally Coherent Video Editing](https://arxiv.org/abs/2308.14749) (Aug., 2023)
[](https://github.com/magic-research/magic-edit)
[](https://arxiv.org/abs/2308.14749)
[](https://magic-edit.github.io/)+ [StableVideo: Text-driven Consistency-aware Diffusion Video Editing](https://arxiv.org/abs/2308.09592) (ICCV 2023)
[](https://github.com/rese1f/StableVideo)
[](https://arxiv.org/abs/2308.09592)
[](https://rese1f.github.io/StableVideo/)+ [CoDeF: Content Deformation Fields for Temporally Consistent Video Processing](https://arxiv.org/abs/2308.07926) (CVPR 2024)
[](https://github.com/qiuyu96/CoDeF)
[](https://arxiv.org/abs/2308.07926)
[](https://qiuyu96.github.io/CoDeF/)+ [TokenFlow: Consistent Diffusion Features for Consistent Video Editing](https://arxiv.org/abs/2307.10373) (ICLR 2024)
[](https://github.com/omerbt/TokenFlow)
[](https://arxiv.org/abs/2307.10373)
[](https://diffusion-tokenflow.github.io/)+ [INVE: Interactive Neural Video Editing](https://arxiv.org/abs/2307.07663) (Jul., 2023)
[](https://arxiv.org/abs/2307.07663)
[](https://gabriel-huang.github.io/inve/)+ [VidEdit: Zero-Shot and Spatially Aware Text-Driven Video Editing](https://arxiv.org/abs//2306.08707) (Jun., 2023)
[](https://arxiv.org/abs//2306.08707)
[](https://videdit.github.io/)+ [Rerender A Video: Zero-Shot Text-Guided Video-to-Video Translation](https://arxiv.org/abs/2306.07954) (SIGGRAPH Asia 2023)
[](https://github.com/williamyang1991/Rerender_A_Video)
[](https://arxiv.org/abs/2306.07954)
[](https://www.mmlab-ntu.com/project/rerender/)+ [ControlVideo: Adding Conditional Control for One Shot Text-to-Video Editing](https://arxiv.org/abs/2305.17098) (May, 2023)
[](https://github.com/thu-ml/controlvideo)
[](https://arxiv.org/abs/2305.17098)
[](https://ml.cs.tsinghua.edu.cn/controlvideo/)+ [Make-A-Protagonist: Generic Video Editing with An Ensemble of Experts](https://arxiv.org/abs/2305.08850) (May, 2023)
[](https://github.com/Make-A-Protagonist/Make-A-Protagonist)
[](https://arxiv.org/abs/2305.08850)
[](https://make-a-protagonist.github.io/)+ [Soundini: Sound-Guided Diffusion for Natural Video Editing](https://arxiv.org/abs/2304.06818) (Apr., 2023)
[](https://github.com/kuai-lab/soundini-official)
[](https://arxiv.org/abs/2304.06818)
[](https://kuai-lab.github.io/soundini-gallery/)+ [Zero-Shot Video Editing Using Off-the-Shelf Image Diffusion Models](https://arxiv.org/abs/2303.17599) (Mar., 2023)
[](https://github.com/baaivision/vid2vid-zero)
[](https://arxiv.org/abs/2303.17599)
[](https://huggingface.co/spaces/BAAI/vid2vid-zero)+ [Edit-A-Video: Single Video Editing with Object-Aware Consistency](https://arxiv.org/abs/2303.17599) (Mar., 2023)
[](https://arxiv.org/abs/2303.07945)
[](https://edit-a-video.github.io/)+ [FateZero: Fusing Attentions for Zero-shot Text-based Video Editing](https://arxiv.org/abs/2303.09535) (Mar., 2023)
[](https://github.com/ChenyangQiQi/FateZero)
[](https://arxiv.org/abs/2303.09535)
[](https://fate-zero-edit.github.io/)+ [Pix2video: Video Editing Using Image Diffusion](https://arxiv.org/abs/2303.12688) (Mar., 2023)
[](https://arxiv.org/abs/2303.12688)
[](https://duyguceylan.github.io/pix2video.github.io/)+ [Video-P2P: Video Editing with Cross-attention Control](https://arxiv.org/abs/2303.04761) (Mar., 2023)
[](https://github.com/ShaoTengLiu/Video-P2P)
[](https://arxiv.org/abs/2303.04761)
[](https://video-p2p.github.io/)+ [Dreamix: Video Diffusion Models Are General Video Editors](https://arxiv.org/abs/2302.01329) (Feb., 2023)
[](https://arxiv.org/abs/2302.01329)
[](https://dreamix-video-editing.github.io/)+ [Shape-Aware Text-Driven Layered Video Editing](https://arxiv.org/abs/2301.13173) (Jan., 2023)
[](https://arxiv.org/abs/2301.13173)
[](https://text-video-edit.github.io/)+ [Speech Driven Video Editing via an Audio-Conditioned Diffusion Model](https://arxiv.org/abs/2301.04474) (Jan., 2023)
[](https://github.com/DanBigioi/DiffusionVideoEditing)
[](https://arxiv.org/abs/2301.04474)
[](https://danbigioi.github.io/DiffusionVideoEditing/)+ [Diffusion Video Autoencoders: Toward Temporally Consistent Face Video Editing via Disentangled Video Encoding](https://arxiv.org/abs/2212.02802) (CVPR 2023)
[](https://github.com/man805/Diffusion-Video-Autoencoders)
[](https://arxiv.org/abs/2212.02802)
[](https://diff-video-ae.github.io/)### Human or Subject Motion
+ [AnyTop: Character Animation Diffusion with Any Topology](https://arxiv.org/abs/2502.17327) (Feb., 2025)
[](https://github.com/Anytop2025/Anytop)
[](https://arxiv.org/abs/2502.17327)
[](https://anytop2025.github.io/Anytop-page/)+ [HumanDiT: Pose-Guided Diffusion Transformer for Long-form Human Motion Video Generation](https://arxiv.org/abs/2502.04847) (Feb., 2025)
[](https://arxiv.org/abs/2502.04847)
[](https://agnjason.github.io/HumanDiT-page/)+ [VideoJAM: Joint Appearance-Motion Representations for Enhanced Motion Generation in Video Models](https://hila-chefer.github.io/videojam-paper.github.io/VideoJAM_arxiv.pdf) (Feb., 2025)
[](https://hila-chefer.github.io/videojam-paper.github.io/VideoJAM_arxiv.pdf)
[](https://hila-chefer.github.io/videojam-paper.github.io/)+ [OmniHuman-1: Rethinking the Scaling-Up of One-Stage Conditioned Human Animation Models](https://arxiv.org/abs/2502.01061) (Feb., 2025)
[](https://arxiv.org/abs/2502.01061)
[](https://omnihuman-lab.github.io/)+ [AnchorCrafter: Animate CyberAnchors Saling Your Products via Human-Object Interacting Video Generation](https://arxiv.org/pdf/2411.17383) (Nov., 2024)
[](https://github.com/cangcz/AnchorCrafter)
[](https://arxiv.org/pdf/2411.17383)
[](https://cangcz.github.io/Anchor-Crafter/)+ [KMM: Key Frame Mask Mamba for Extended Motion Generation](https://arxiv.org/abs/2411.06481) (Nov., 2024)
[](https://github.com/steve-zeyu-zhang/KMM)
[](https://arxiv.org/abs/2411.06481)
[](https://steve-zeyu-zhang.github.io/KMM/)+ [DanceFusion: A Spatio-Temporal Skeleton Diffusion Transformer for Audio-Driven Dance Motion Reconstruction](https://arxiv.org/abs/2411.04646) (Nov., 2024)
[](https://arxiv.org/abs/2411.04646)
[](https://th-mlab.github.io/DanceFusion/)+ [Enhancing Motion in Text-to-Video Generation with Decomposed Encoding and Conditioning](https://arxiv.org/abs/2410.24219) (Nov., 2024)
[](https://github.com/PR-Ryan/DEMO)
[](https://arxiv.org/abs/2410.24219)
[](https://pr-ryan.github.io/DEMO-project/)+ [A Comprehensive Survey on Human Video Generation: Challenges, Methods, and Insights](https://arxiv.org/abs/2407.08428) (Jul., 2024)
[](https://arxiv.org/abs/2407.08428)
+ [OccFusion: Rendering Occluded Humans with Generative Diffusion Priors](https://arxiv.org/pdf/2406.08801) (Jul., 2024)
[](https://arxiv.org/pdf/2407.00316)
[](https://cs.stanford.edu/~xtiange/projects/occfusion/)+ [EchoMimic: Lifelike Audio-Driven Portrait Animations through Editable Landmark Conditions](https://arxiv.org/abs/2407.08136) (Jul., 2024)
[](https://github.com/BadToBest/EchoMimic)
[](https://arxiv.org/pdf/2407.08136)
[](https://badtobest.github.io/echomimic.html)+ [DiffSHEG: A Diffusion-Based Approach for Real-Time Speech-driven Holistic 3D Expression and Gesture Generation](https://arxiv.org/abs/2401.04747) (CVPR 2024)
[](https://github.com/JeremyCJM/DiffSHEG)
[](https://arxiv.org/abs/2401.04747)
[](https://jeremycjm.github.io/proj/DiffSHEG/)+ [Avatars Grow Legs: Generating Smooth Human Motion from Sparse Tracking Inputs with Diffusion Model](https://arxiv.org/abs/2304.08577) (CVPR 2023)
[](https://github.com/facebookresearch/AGRoL)
[](https://arxiv.org/abs/2304.08577)
[](https://dulucas.github.io/agrol/)+ [InterGen: Diffusion-based Multi-human Motion Generation under Complex Interactions](https://arxiv.org/abs/2304.05684) (Apr., 2023)
[](https://github.com/tr3e/InterGen)
[](https://arxiv.org/abs/2304.05684)+ [ReMoDiffuse: Retrieval-Augmented Motion Diffusion Model](https://arxiv.org/abs/2304.01116) (Apr., 2023)
[](https://github.com/mingyuan-zhang/ReMoDiffuse)
[](https://arxiv.org/abs/2304.01116)
[](https://mingyuan-zhang.github.io/projects/ReMoDiffuse.html)+ [Human Motion Diffusion as a Generative Prior](https://arxiv.org/abs/2303.01418) (Mar., 2023)
[](https://github.com/priorMDM/priorMDM)
[](https://arxiv.org/abs/2303.01418)
[](https://priormdm.github.io/priorMDM-page/)+ [Can We Use Diffusion Probabilistic Models for 3d Motion Prediction?](https://arxiv.org/abs/2302.14503) (Feb., 2023)
[](https://github.com/cotton-ahn/diffusion-motion-prediction)
[](https://arxiv.org/abs/2302.14503)
[](https://sites.google.com/view/diffusion-motion-prediction)+ [Single Motion Diffusion](https://arxiv.org/abs/2302.05905) (Feb., 2023)
[](https://github.com/SinMDM/SinMDM)
[](https://arxiv.org/abs/2302.05905)
[](https://sinmdm.github.io/SinMDM-page/)+ [HumanMAC: Masked Motion Completion for Human Motion Prediction](https://arxiv.org/abs/2302.03665) (Feb., 2023)
[](https://github.com/LinghaoChan/HumanMAC)
[](https://arxiv.org/abs/2302.03665)
[](https://lhchen.top/Human-MAC/)+ [DiffMotion: Speech-Driven Gesture Synthesis Using Denoising Diffusion Model](https://arxiv.org/abs/2301.10047) (Jan., 2023)
[](https://arxiv.org/abs/2301.10047)+ [Modiff: Action-Conditioned 3d Motion Generation With Denoising Diffusion Probabilistic Models](https://arxiv.org/abs/2301.03949) (Jan., 2023)
[](https://arxiv.org/abs/2301.03949)+ [Unifying Human Motion Synthesis and Style Transfer With Denoising Diffusion Probabilistic Models](https://arxiv.org/abs/2212.08526) (GRAPP 2023)
[](https://github.com/mrzzy2021/styledmotionsynthesis)
[](https://arxiv.org/abs/2212.08526)+ [Executing Your Commands via Motion Diffusion in Latent Space](https://arxiv.org/abs/2212.04048) (CVPR 2023)
[](https://github.com/ChenFengYe/motion-latent-diffusion)
[](https://arxiv.org/abs/2212.04048)
[](https://chenxin.tech/mld/)+ [Pretrained Diffusion Models for Unified Human Motion Synthesis](https://arxiv.org/abs/2212.02837) (Dec., 2022)
[](https://arxiv.org/abs/2212.02837)
[](https://ofa-sys.github.io/MoFusion/)
+ [PhysDiff: Physics-Guided Human Motion Diffusion Model](https://arxiv.org/abs/2212.02500) (Dec., 2022)
[](https://arxiv.org/abs/2212.02500)
[](https://nvlabs.github.io/PhysDiff/)+ [BeLFusion: Latent Diffusion for Behavior-Driven Human Motion Prediction](https://arxiv.org/abs/2211.14304) (Dec., 2022)
[](https://github.com/BarqueroGerman/BeLFusion)
[](https://arxiv.org/abs/2211.14304)
[](https://barquerogerman.github.io/BeLFusion/)+ [Diffusion Motion: Generate Text-Guided 3d Human Motion by Diffusion Model](https://arxiv.org/abs/2210.12315) (ICASSP 2023)
[](https://arxiv.org/abs/2210.12315)+ [Human Joint Kinematics Diffusion-Refinement for Stochastic Motion Prediction](https://arxiv.org/abs/2210.05976) (Oct., 2022)
[](https://arxiv.org/abs/2210.05976)+ [Human Motion Diffusion Model](https://arxiv.org/abs/2209.14916) (ICLR 2023)
[](https://github.com/GuyTevet/motion-diffusion-model)
[](https://arxiv.org/abs/2209.14916)
[](https://guytevet.github.io/mdm-page/)+ [FLAME: Free-form Language-based Motion Synthesis & Editing](https://arxiv.org/abs/2209.00349) (AAAI 2023)
[](https://github.com/kakaobrain/flame)
[](https://arxiv.org/abs/2209.00349)
[](https://kakaobrain.github.io/flame/)+ [MotionDiffuse: Text-Driven Human Motion Generation with Diffusion Model](https://arxiv.org/abs/2208.15001) (Aug., 2022)
[](https://github.com/mingyuan-zhang/MotionDiffuse)
[](https://arxiv.org/abs/2208.15001)
[](https://mingyuan-zhang.github.io/projects/MotionDiffuse.html)+ [Stochastic Trajectory Prediction via Motion Indeterminacy Diffusion](https://arxiv.org/abs/2203.13777) (CVPR 2022)
[](https://github.com/gutianpei/MID)
[](https://arxiv.org/abs/2203.13777)### Video Enhancement and Restoration
+ [Enhance-A-Video: Better Generated Video for Free](https://arxiv.org/abs/2502.07508) (Feb., 2025)
[](https://github.com/NUS-HPC-AI-Lab/Enhance-A-Video)
[](https://arxiv.org/abs/2502.07508)
[](https://oahzxl.github.io/Enhance_A_Video/)+ [SVFR: A Unified Framework for Generalized Video Face Restoration](https://arxiv.org/abs/2501.01235) (Jan., 2025)
[](https://arxiv.org/abs/2501.01235)+ [Disentangled Motion Modeling for Video Frame Interpolation](https://arxiv.org/abs/2406.17256) (Jun, 2024)
[](https://github.com/JHLew/MoMo)
[](https://arxiv.org/abs/2406.17256)
[](https://github.com/JHLew/MoMo)+ [DiffIR2VR-Zero: Zero-Shot Video Restoration with Diffusion-based Image Restoration Models](https://arxiv.org/abs/2407.01519) (Jul., 2024)
[](https://arxiv.org/abs/2407.01519)
[](https://jimmycv07.github.io/DiffIR2VR_web/)+ [LDMVFI: Video Frame Interpolation with Latent Diffusion Models](https://arxiv.org/abs/2303.09508) (Mar., 2023)
[](https://arxiv.org/abs/2303.09508)+ [CaDM: Codec-aware Diffusion Modeling for Neural-enhanced Video Streaming](https://arxiv.org/abs/2211.08428) (Nov., 2022)
[](https://arxiv.org/abs/2211.08428)### Audio Synthesis for Video
+ [AV-DiT: Efficient Audio-Visual Diffusion Transformer for Joint Audio and Video Generation](https://arxiv.org/pdf/2406.07686) (Feb., 2025)
[](https://arxiv.org/pdf/2406.07686)+ [UniForm: A Unified Diffusion Transformer for Audio-Video Generation](https://arxiv.org/abs/2502.03897) (Feb., 2025)
[](https://arxiv.org/abs/2502.03897)
[](https://uniform-t2av.github.io/)+ [AGAV-Rater: Enhancing LMM for AI-Generated Audio-Visual Quality Assessment](https://arxiv.org/abs/2501.18314) (Jan., 2025)
[](https://arxiv.org/abs/2501.18314)
[](https://agav-rater.github.io/)+ [XMusic: Towards a Generalized and Controllable Symbolic Music Generation Framework](https://arxiv.org/abs/2501.08809) (Jan., 2025)
[](https://arxiv.org/abs/2501.08809)
[](https://xmusic-project.github.io/)+ [Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis](https://arxiv.org/abs/2412.15322) (Dec., 2024)
[](https://github.com/hkchengrex/MMAudio)
[](https://arxiv.org/abs/2412.15322)
[](https://hkchengrex.com/MMAudio/)+ [Stable-V2A: Synthesis of Synchronized Audio Effects with Temporal and Semantic Controls](https://arxiv.org/abs/2412.15023) (Dec., 2024)
[](https://github.com/ispamm/Stable-V2A)
[](https://arxiv.org/abs/2412.15023)
[](https://ispamm.github.io/Stable-V2A/)+ [AV-Link: Temporally-Aligned Diffusion Features for Cross-Modal Audio-Video Generation](https://arxiv.org/abs/2412.15191) (Dec., 2024)
[](https://github.com/snap-research/AVLink)
[](https://arxiv.org/abs/2412.15191)
[](https://snap-research.github.io/AVLink/)+ [VinTAGe: Joint Video and Text Conditioning for Holistic Audio Generation](https://arxiv.org/abs/2412.10768) (Nov., 2024)
[](https://arxiv.org/abs/2412.10768)
[](https://www.youtube.com/watch?v=QmqWhUjPkJI)+ [YingSound: Video-Guided Sound Effects Generation with Multi-modal Chain-of-Thought Controls](https://arxiv.org/abs/2412.09551) (Nov., 2024)
[](https://arxiv.org/abs/2412.09551)
[](https://giantailab.github.io/yingsound/)+ [Video-Guided Foley Sound Generation with Multimodal Controls](https://arxiv.org/pdf/2411.17698) (Nov., 2024)
[](https://arxiv.org/pdf/2411.17698)
[](https://ificl.github.io/MultiFoley/)+ [MuVi: Video-to-Music Generation with Semantic Alignment and Rhythmic Synchronization](https://arxiv.org/abs/2410.12957) (Oct., 2024)
[](https://arxiv.org/abs/2410.12957)+ [Diverse and Aligned Audio-to-Video Generation via Text-to-Video Model Adaptation](https://arxiv.org/abs/2309.16429) (Sep., 2023)
[](https://github.com/guyyariv/TempoTokens)
[](https://arxiv.org/abs/2309.16429)
[](https://pages.cs.huji.ac.il/adiyoss-lab/TempoTokens/)+ [VMAs: Video-to-Music Generation via Semantic Alignment in Web Music Videos](https://www.arxiv.org/abs/2409.07450) (Oct., 2024)
[](https://www.arxiv.org/abs/2409.07450)
[](https://genjib.github.io/project_page/VMAs/index.html)+ [STA-V2A: Video-to-Audio Generation with Semantic and Temporal Alignment](https://arxiv.org/pdf/2409.08601) (Oct., 2024)
[](https://github.com/y-ren16/STAV2A)
[](https://arxiv.org/pdf/2409.08601)
[](https://y-ren16.github.io/STAV2A/)+ [Draw an Audio: Leveraging Multi-Instruction for Video-to-Audio Synthesis](https://arxiv.org/pdf/2409.06135) (Sep., 2024)
[](https://github.com/yannqi/Draw-an-Audio-Code)
[](https://arxiv.org/pdf/2409.06135)
[](https://yannqi.github.io/Draw-an-Audio/)+ [Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming](https://arxiv.org/abs/2408.16725) (Jul., 2024)
[](https://github.com/gpt-omni/mini-omni)
[](https://arxiv.org/abs/2408.16725)
[](https://github.com/gpt-omni/mini-omni)+ [Speech To Speech: an effort for an open-sourced and modular GPT4-o](https://github.com/huggingface/speech-to-speech) (Aug., 2024)
[](https://github.com/huggingface/speech-to-speech)
[](https://github.com/huggingface/speech-to-speech)+ [Video-Foley: Two-Stage Video-To-Sound Generation via Temporal Event Condition For Foley Sound](https://arxiv.org/abs/2408.11915) (Aug., 2024)
[](https://github.com/jnwnlee/video-foley)
[](https://arxiv.org/abs/2408.11915)
[](https://jnwnlee.github.io/video-foley-demo/)+ [Masked Generative Video-to-Audio Transformers with Enhanced Synchronicity](https://arxiv.org/abs/2407.10387) (Jul., 2024)
[](https://arxiv.org/abs/2407.10387)
[](https://maskvat.github.io/)+ [Video-to-Audio Generation with Hidden Alignment](https://arxiv.org/abs/2407.07464) (Jul., 2024)
[](https://github.com/ariesssxu/vta-ldm)
[](https://arxiv.org/abs/2407.07464)
[](https://sites.google.com/view/vta-ldm)
+ [Read, Watch and Scream! Sound Generation from Text and Video](https://arxiv.org/abs/2407.05551) (Jul., 2024)
[](https://github.com/naver-ai/rewas)
[](https://arxiv.org/abs/2407.05551)
[](https://naver-ai.github.io/rewas/)+ [FoleyCrafter: Bring Silent Videos to Life with Lifelike and Synchronized Sounds](https://arxiv.org/abs/2407.01494) (July, 2024)
[](https://arxiv.org/abs/2407.01494)
[](https://foleycrafter.github.io/)+ [Network Bending of Diffusion Models for Audio-Visual Generation](https://arxiv.org/abs/2406.19589) (CVPR, 2024)
[](https://github.com/dzluke/DAFX2024)
[](https://arxiv.org/abs/2406.19589)### Talking Head Generation
+ [Audio-visual Controlled Video Diffusion with Masked Selective State Spaces Modelling for Natural Talking Head Generation](https://harlanhong.github.io/publications/actalker/index.html) (Apr., 2025)
[](https://arxiv.org/abs/2504.02542)
[](https://harlanhong.github.io/publications/actalker/index.html)
[](https://github.com/harlanhong/ACTalker)+ [MoCha: Towards Movie-Grade Talking Character Synthesis](https://arxiv.org/abs/2503.23307) (Apr., 2025)
[](https://arxiv.org/abs/2503.23307)
[](https://congwei1230.github.io/MoCha/)
+ [SayAnything: Audio-Driven Lip Synchronization with Conditional Video Diffusion](https://arxiv.org/pdf/2502.11515) (Feb., 2025)
[](https://arxiv.org/pdf/2502.11515)+ [VQTalker: Towards Multilingual Talking Avatars through Facial Motion Tokenization](https://arxiv.org/abs/2412.09892) (Dec., 2024)
[](https://arxiv.org/abs/2412.09892)
[](https://x-lance.github.io/VQTalker/)+ [IF-MDM: Implicit Face Motion Diffusion Model for High-Fidelity Realtime Talking Head Generation](https://arxiv.org/abs/2412.04000) (Dec., 2024)
[](https://arxiv.org/abs/2412.04000)
[](http://ec2-3-25-102-128.ap-southeast-2.compute.amazonaws.com/IF-MDM/ifmdm_supplementary/index.html)+ [INFP: Audio-Driven Interactive Head Generation in Dyadic Conversations](https://arxiv.org/pdf/2412.04037) (Dec., 2024)
[](https://arxiv.org/pdf/2412.04037)
[](https://grisoon.github.io/INFP/)+ [MEMO: Memory-Guided Diffusion for Expressive Talking Video Generation](https://arxiv.org/abs/2412.04448) (Dec., 2024)
[](https://arxiv.org/abs/2412.04448)
[](https://memoavatar.github.io/)
[](https://github.com/memoavatar/memo)+ [SINGER: Vivid Audio-driven Singing Video Generation with Multi-scale Spectral Diffusion Model](https://arxiv.org/pdf/2412.03430) (Dec., 2024)
[](https://arxiv.org/pdf/2412.03430)
[](https://yl4467.github.io/)+ [Synergizing Motion and Appearance: Multi-Scale Compensatory Codebooks for Talking Head Video Generation](https://arxiv.org/abs/2412.00719) (Nov., 2024)
[](https://github.com/ShaelynZ/synergize-motion-appearance)
[](https://arxiv.org/abs/2412.00719)
[](https://shaelynz.github.io/synergize-motion-appearance/)+ [Hallo3: Highly Dynamic and Realistic Portrait Image Animation with Diffusion Transformer Networks](https://arxiv.org/pdf/2412.00733) (Nov., 2024)
[](https://arxiv.org/pdf/2412.00733)
[](https://github.com/fudan-generative-vision/hallo3)+ [FLOAT: Generative Motion Latent Flow Matching for Audio-driven Talking Portrait](https://arxiv.org/abs/2412.01064) (Nov., 2024)
[](https://arxiv.org/abs/2412.01064)
[](https://deepbrainai-research.github.io/float/)+ [EmotiveTalk: Expressive Talking Head Generation through Audio Information Decoupling and Emotional Video Diffusion](https://arxiv.org/pdf/2411.16726) (Nov., 2024)
[](https://arxiv.org/pdf/2411.16726)+ [LetsTalk: Latent Diffusion Transformer for Talking Video Synthesis](https://arxiv.org/pdf/2411.16748) (Nov., 2024)
[](https://arxiv.org/pdf/2411.16748)+ [Loopy: Taming Audio-Driven Portrait Avatar with Long-Term Motion Dependency](https://arxiv.org/pdf/2409.02634) (Nov., 2024)
[](https://arxiv.org/pdf/2409.02634)
[](https://loopyavatar.github.io/)+ [HelloMeme: Integrating Spatial Knitting Attentions to Embed High-Level and Fidelity-Rich Conditions in Diffusion Models](https://arxiv.org/abs/2410.22901) (Oct., 2024)
[](https://github.com/HelloVision/HelloMeme)
[](https://arxiv.org/abs/2410.22901)
[](https://songkey.github.io/hellomeme/)+ [PersonaTalk: Bring Attention to Your Persona in Visual Dubbing](https://arxiv.org/pdf/2409.05379) (Oct., 2024)
[](https://arxiv.org/pdf/2409.05379)
[](https://grisoon.github.io/PersonaTalk/)+ [Talking With Hands 16.2M: A Large-Scale Dataset of Synchronized Body-Finger Motion and Audio for Conversational Motion Analysis and Synthesis](https://openaccess.thecvf.com/content_ICCV_2019/papers/Lee_Talking_With_Hands_16.2M_A_Large-Scale_Dataset_of_Synchronized_Body-Finger_ICCV_2019_paper.pdf) (Oct., 2024)
[](https://openaccess.thecvf.com/content_ICCV_2019/papers/Lee_Talking_With_Hands_16.2M_A_Large-Scale_Dataset_of_Synchronized_Body-Finger_ICCV_2019_paper.pdf)+ [Learning to Listen: Modeling Non-Deterministic Dyadic Facial Motion](https://arxiv.org/abs/2204.08451) (Oct., 2024)
[](https://github.com/evonneng/learning2listen)
[](https://arxiv.org/abs/2204.08451)
[](https://evonneng.github.io/learning2listen/)+ [GestureDiffuCLIP: Gesture Diffusion Model with CLIP Latents](https://arxiv.org/abs/2303.14613) (Oct., 2024)
[](https://arxiv.org/abs/2303.14613)
[](https://pku-mocca.github.io/GestureDiffuCLIP-Page/)+ [From Audio to Photoreal Embodiment: Synthesizing Humans in Conversations](https://evonneng.github.io/projects/audio2photoreal/static/CCA.pdf) (Oct., 2024)
[](https://evonneng.github.io/projects/audio2photoreal/static/CCA.pdf)
[](https://evonneng.github.io/projects/audio2photoreal/index.html)+ [Takin-ADA: Emotion Controllable Audio-Driven Animation with Canonical and Landmark Loss Optimization](https://arxiv.org/pdf/2410.14283) (Oct., 2024)
[](https://arxiv.org/pdf/2410.14283)
[](https://everest-ai.github.io/takinada/)+ [DAWN: Dynamic Frame Avatar with Non-autoregressive Diffusion Framework for Talking Head Video Generation](https://arxiv.org/abs/2410.13726) (Oct., 2024)
[](https://github.com/Hanbo-Cheng/DAWN-pytorch)
[](https://arxiv.org/abs/2410.13726)
[](https://hanbo-cheng.github.io/DAWN/)+ [MimicTalk: Mimicking a personalized and expressive 3D talking face in few minutes](https://arxiv.org/abs/2410.06734) (Oct., 2024)
[](https://github.com/yerfor/MimicTalk)
[](https://arxiv.org/abs/2410.06734)
[](https://mimictalk.github.io/)+ [Hallo2: Long-Duration and High-Resolution Audio-driven Portrait Image Animation](https://arxiv.org/pdf/2410.07718) (Oct., 2024)
[](https://github.com/fudan-generative-vision/hallo2)
[](https://arxiv.org/pdf/2410.07718)
[](https://fudan-generative-vision.github.io/hallo2/#/)+ [Listen, Denoise, Action! Audio-Driven Motion Synthesis With Diffusion Models](https://arxiv.org/abs/2211.09707) (Nov. 2022)
[](https://arxiv.org/abs/2211.09707)
[](https://www.speech.kth.se/research/listen-denoise-action/)
+ [TANGO: Co-Speech Gesture Video Reenactment with Hierarchical Audio Motion Embedding and Diffusion Interpolation](https://arxiv.org/pdf/2410.04221) (Oct., 2024)
[](https://arxiv.org/pdf/2410.04221)
[](https://pantomatrix.github.io/TANGO/)+ [Hallo: Hierarchical Audio-Driven Visual Synthesis for Portrait Image Animation](https://arxiv.org/pdf/2406.08801) (Jun., 2024)
[](https://github.com/fudan-generative-vision/hallo)
[](https://arxiv.org/pdf/2406.08801)
[](https://fudan-generative-vision.github.io/hallo/#/)### Human/AI Feedback for Video Generation
+ [LiFT: Leveraging Human Feedback for Text-to-Video Model Alignment](https://arxiv.org/pdf/2412.04814) (Dec., 2024)
[](https://github.com/CodeGoat24/LiFT)
[](https://arxiv.org/pdf/2412.04814)
[](https://codegoat24.github.io/LiFT/)+ [Improving Dynamic Object Interactions in Text-to-Video Generation with AI Feedback](https://arxiv.org/abs/2412.02617) (Nov., 2024)
[](https://arxiv.org/abs/2412.02617)
[](https://sites.google.com/view/aif-dynamic-t2v/)+ [VIDEOSCORE: Building Automatic Metrics to Simulate Fine-grained Human Feedback for Video Generation](https://arxiv.org/pdf/2406.15252) (July, 2024)
[](https://github.com/TIGER-AI-Lab/VideoScore/)
[](https://arxiv.org/abs/2406.15252)
[](https://tiger-ai-lab.github.io/VideoScore/)### Policy Learning
+ [Object-Centric Image to Video Generation with Language Guidance](https://arxiv.org/abs/2502.11655) (Feb, 2025)
[](https://github.com/angelvillar96/TextOCVP)
[](https://arxiv.org/abs/2502.11655)
[](https://play-slot.github.io/TextOCVP/)+ [Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations](https://arxiv.org/abs/2412.14803) (Dec, 2024)
[](https://arxiv.org/abs/2412.14803)
[](https://video-prediction-policy.github.io/)+ [Motion Tracks: A Unified Representation for Human-Robot Transfer in Few-Shot Imitation Learning](https://portal-cornell.github.io/motion_track_policy/) (Dec, 2024)
[](https://portal-cornell.github.io/motion_track_policy/)
[](https://portal-cornell.github.io/motion_track_policy/)+ [Stag-1: Towards Realistic 4D Driving Simulation with Video Generation Model](https://arxiv.org/abs/2412.05280) (Dec, 2024)
[](https://github.com/wzzheng/Stag)
[](https://arxiv.org/abs/2412.05280)
[](https://wzzheng.net/Stag/)+ [RT-Sketch: Goal-Conditioned Imitation Learning from Hand-Drawn Sketches](https://arxiv.org/abs/2403.02709) (Dec, 2024)
[](https://arxiv.org/abs/2403.02709)
[](https://rt-sketch.github.io/)+ [EgoVid-5M: A Large-Scale Video-Action Dataset for Egocentric Video Generation](https://arxiv.org/pdf/2411.08380) (Nov, 2024)
[](https://github.com/JeffWang987/EgoVid)
[](https://arxiv.org/pdf/2411.08380)
[](https://egovid.github.io/)+ [GR-MG: Leveraging Partially Annotated Data via Multi-Modal Goal Conditioned Policy](https://arxiv.org/abs/2408.14368) (July, 2024)
[](https://github.com/bytedance/GR-MG)
[](https://arxiv.org/abs/2408.14368)
[](https://gr-mg.github.io/)+ [Any-point Trajectory Modeling for Policy Learning](https://arxiv.org/abs/2401.00025) (July, 2024)
[](https://github.com/Large-Trajectory-Model/ATM)
[](https://arxiv.org/abs/2401.00025)
[](https://xingyu-lin.github.io/atm/)+ [This&That: Language-Gesture Controlled Video Generation for Robot Planning](https://arxiv.org/abs/2407.05530) (Jun, 2024)
[](https://github.com/cfeng16/this-and-that)
[](https://arxiv.org/abs/2407.05530)
[](https://cfeng16.github.io/this-and-that/)+ [Dreamitate: Real-World Visuomotor Policy Learning via Video Generation](https://arxiv.org/abs/2406.16862) (Jun, 2024)
[](https://github.com/cvlab-columbia/dreamitate)
[](https://arxiv.org/abs/2406.16862)
[](https://dreamitate.cs.columbia.edu/)### Virtual Try-On
+ [1-2-1: Renaissance of Single-Network Paradigm for Virtual Try-On](https://arxiv.org/abs/2501.05369) (Jan., 2025)
[](https://github.com/ningshuliang/1-2-1-MNVTON)
[](https://arxiv.org/abs/2501.05369)
[](https://ningshuliang.github.io/2023/Arxiv/index.html)+ [Dynamic Try-On: Taming Video Virtual Try-on with Dynamic Attention Mechanism](https://arxiv.org/abs/2412.09822) (Dec., 2024)
[](https://arxiv.org/abs/2412.09822)
[](https://zhengjun-ai.github.io/dynamic-tryon-page/)+ [Fashion-VDM: Video Diffusion Model for Virtual Try-On](https://arxiv.org/abs/2411.00225) (Nov., 2024)
[](https://arxiv.org/abs/2411.00225)
[](https://johannakarras.github.io/Fashion-VDM/)### 3D
+ [Difix3D+: Improving 3D Reconstructions with Single-Step Diffusion Models](https://arxiv.org/abs/2503.01774) (Mar., 2025 | CVPR 2025)
[](https://arxiv.org/abs/2503.01774)
[](https://research.nvidia.com/labs/toronto-ai/difix3d/)+ [Wonderland: Navigating 3D Scenes from a Single Image](https://arxiv.org/abs/2412.12091) (Dec., 2024)
[](https://arxiv.org/abs/2412.12091)
[](https://snap-research.github.io/wonderland/)+ [GPT-4V(ision) is a Human-Aligned Evaluator for Text-to-3D Generation](https://arxiv.org/abs/2401.04092) (Jan., 2024)
[](https://arxiv.org/abs/2401.04092)
[](https://gpteval3d.github.io/)
[](https://github.com/3DTopia/GPTEval3D)+ [MonST3R: A Simple Approach for Estimating Geometry in the Presence of Motion](https://arxiv.org/abs/2410.03825) (Oct., 2024)
[](https://arxiv.org/abs/2410.03825)
[](https://monst3r-project.github.io/)
[](https://github.com/Junyi42/monst3r)+ [L3DG: Latent 3D Gaussian Diffusion](https://arxiv.org/abs/2112.03288) (Oct., 2024)
[](https://arxiv.org/abs/2112.03288)
[](https://barbararoessle.github.io/l3dg/)+ [ReconX: Reconstruct Any Scene from Sparse Views with Video Diffusion Model](https://arxiv.org/abs/2410.07155) (Oct., 2024)
[](https://github.com/YangLing0818/Trans4D)
[](https://arxiv.org/abs/2410.07155)
[](https://github.com/YangLing0818/Trans4D)+ [Hi3D: Pursuing High-Resolution Image-to-3D Generation with Video Diffusion Models](https://arxiv.org/pdf/2409.07452) (Oct., 2024)
[](https://github.com/yanghb22-fdu/Hi3D-Official)
[](https://arxiv.org/pdf/2409.07452)+ [ReconX: Reconstruct Any Scene from Sparse Views with Video Diffusion Model](https://arxiv.org/abs/2408.16767) (Aug., 2024)
[](https://github.com/liuff19/ReconX)
[](https://arxiv.org/abs/2408.16767)
[](https://liuff19.github.io/ReconX/)+ [SV4D: Dynamic 3D Content Generation with Multi-Frame and Multi-View Consistency](https://arxiv.org/abs/2407.17470) (Jul., 2024)
[](https://github.com/Stability-AI/generative-models)
[](https://arxiv.org/abs/2407.13764)
[](https://sv4d.github.io/)+ [Shape of Motion: 4D Reconstruction from a Single Video](https://arxiv.org/abs/2407.13764) (Jul., 2024)
[](https://github.com/vye16/shape-of-motion/)
[](https://arxiv.org/abs/2407.13764)
[](https://shape-of-motion.github.io/)+ [WonderWorld: Interactive 3D Scene Generation from a Single Image](https://arxiv.org/abs/2406.09394) (Jun., 2024)
[](https://arxiv.org/abs/2406.09394)
[](https://wonderworld-2024.github.io/)+ [WonderJourney: Going from Anywhere to Everywhere](https://arxiv.org/pdf/2312.03884) (CVPR 2024)
[](https://github.com/KovenYu/WonderJourney)
[](https://arxiv.org/pdf/2312.03884)
[](https://kovenyu.com/wonderjourney/)+ [MultiDiff: Consistent Novel View Synthesis from a Single Image](https://sirwyver.github.io/MultiDiff/static/assets/MultiDiff.pdf) (CVPR, 2024)
[](https://sirwyver.github.io/MultiDiff/static/assets/MultiDiff.pdf)
[](https://sirwyver.github.io/MultiDiff/)+ [Vivid-ZOO: Multi-View Video Generation with Diffusion Model](https://arxiv.org/pdf/2406.08659v1) (Jun, 2024)
[](https://github.com/hi-zhengcheng/vividzoo)
[](https://arxiv.org/pdf/2406.08659v1)
[](https://hi-zhengcheng.github.io/vividzoo/)+ [Director3D: Real-world Camera Trajectory and 3D Scene Generation from Text](https://arxiv.org/pdf/2406.17601) (June, 2024)
[](https://github.com/imlixinyang/director3d)
[](https://arxiv.org/pdf/2406.17601)
[](https://imlixinyang.github.io/director3d-page/)
+ [YouDream: Generating Anatomically Controllable Consistent Text-to-3D Animals](https://arxiv.org/abs/2406.16273v1) (June, 2024)
[](https://github.com/YouDream3D/YouDream/)
[](https://arxiv.org/abs/2406.16273v1)
[](https://youdream3d.github.io/)+ [Text2NeRF: Text-Driven 3D Scene Generation with Neural Radiance Fields](https://arxiv.org/abs/2305.11588) (May, 2023)
[](https://arxiv.org/abs/2305.11588)
[](https://eckertzhang.github.io/Text2NeRF.github.io/)+ [RoomDreamer: Text-Driven 3D Indoor Scene Synthesis with Coherent Geometry and Texture](https://arxiv.org/abs/2305.11337) (May, 2023)
[](https://arxiv.org/abs/2305.11337)+ [NeuralField-LDM: Scene Generation with Hierarchical Latent Diffusion Models](https://arxiv.org/abs/2304.09787) (CVPR 2023)
[](https://arxiv.org/abs/2304.09787)
[](https://research.nvidia.com/labs/toronto-ai/NFLDM/)+ [Single-Stage Diffusion NeRF: A Unified Approach to 3D Generation and Reconstruction](https://arxiv.org/abs/2304.06714) (Apr., 2023)
[](https://github.com/Lakonik/SSDNeRF)
[](https://arxiv.org/abs/2304.06714)
[](https://lakonik.github.io/ssdnerf/)+ [Instruct-NeRF2NeRF: Editing 3D Scenes with Instructions](https://arxiv.org/abs/2303.12789) (Mar., 2023)
[](https://github.com/ayaanzhaque/instruct-nerf2nerf)
[](https://arxiv.org/abs/2303.12789)
[](https://instruct-nerf2nerf.github.io/)+ [DiffusioNeRF: Regularizing Neural Radiance Fields with Denoising Diffusion Models](https://arxiv.org/abs/2302.12231) (Feb., 2023)
[](https://github.com/nianticlabs/diffusionerf)
[](https://arxiv.org/abs/2302.12231)+ [NerfDiff: Single-image View Synthesis with NeRF-guided Distillation from 3D-aware Diffusion](https://arxiv.org/abs/2302.10109) (Feb., 2023)
[](https://arxiv.org/abs/2302.10109)
[](https://jiataogu.me/nerfdiff/)+ [DiffRF: Rendering-guided 3D Radiance Field Diffusion](https://arxiv.org/abs/2212.01206) (CVPR 2023)
[](https://arxiv.org/abs/2212.01206)
[](https://sirwyver.github.io/DiffRF/)### 4D
+ [AvatarArtist: Open-Domain 4D Avatarization](https://arxiv.org/abs/2503.19906) (Apr., 2025)
[](https://github.com/ant-research/AvatarArtist)
[](https://arxiv.org/abs/2503.19906)
[](https://kumapowerliu.github.io/AvatarArtist/)
+ [Not All Frame Features Are Equal: Video-to-4D Generation via Decoupling Dynamic-Static Features](https://arxiv.org/abs/2502.08377) (Feb., 2025)
[](https://github.com/paintscene4d/paintscene4d.github.io)
[](https://arxiv.org/abs/2502.08377)
[](https://paintscene4d.github.io/)+ [DreamDrive: Generative 4D Scene Modeling from Street View Images](https://arxiv.org/pdf/2501.00601) (Jan., 2025)
[](https://arxiv.org/pdf/2501.00601)+ [Stereo4D Learning How Things Move in 3D from Internet Stereo Videos](https://arxiv.org/pdf/2412.09621) (Dec., 2024)
[](https://arxiv.org/pdf/2412.09621)
[](https://stereo4d.github.io/)+ [4Real-Video: Learning Generalizable Photo-Realistic 4D Video Diffusion](https://arxiv.org/abs/2412.04462) (Dec., 2024)
[](https://arxiv.org/abs/2412.04462)
[](https://snap-research.github.io/4Real-Video/)+ [PaintScene4D: Consistent 4D Scene Generation from Text Prompts](https://arxiv.org/abs/2412.04471) (Dec., 2024)
[](https://github.com/paintscene4d/paintscene4d.github.io)
[](https://arxiv.org/abs/2412.04471)
[](https://paintscene4d.github.io/)+ [CAT4D: Create Anything in 4D with Multi-View Video Diffusion Models](https://arxiv.org/abs/2411.18613) (Nov., 2024)
[](https://arxiv.org/abs/2411.18613)
[](https://cat-4d.github.io/)+ [DimensionX: Create Any 3D and 4D Scenes from a Single Image with Controllable Video Diffusion](https://arxiv.org/pdf/2411.04928) (Nov., 2024)
[](https://github.com/wenqsun/DimensionX)
[](https://arxiv.org/pdf/2411.04928)
[](https://chenshuo20.github.io/DimensionX/)### Game Generation
+ [Playable Game Generation](https://arxiv.org/pdf/2412.00887) (Nov., 2024)
[](https://arxiv.org/pdf/2412.00887)
[](http://124.156.151.207)### AI Safety
+ [What Matters in Detecting AI-Generated Videos like Sora?](https://arxiv.org/abs/2406.19568) (Jun., 2024)
[](https://arxiv.org/abs/2406.19568)
[](https://justin-crchang.github.io/3DCNNDetection.github.io/)### Rendering with Virtual Engine
+ [UnrealZoo: Enriching Photo-realistic Virtual Worlds for Embodied AI](https://arxiv.org/abs/2412.20977) (Jan., 2025)
[](https://github.com/UnrealZoo/unrealzoo-gym)
[](https://arxiv.org/abs/2412.20977)
[](http://unrealzoo.site/)+ [Infinigen Indoors: Photorealistic Indoor Scenes using Procedural Generation](https://arxiv.org/abs/2406.11824) (CVPR 2024)
[](https://arxiv.org/abs/2406.11824)
[](https://infinigen.org/)+ [Scene Co-pilot: Procedural Text to Video Generation with Human in the Loop](https://arxiv.org/abs/2411.18644) (Dec., 2024)
[](https://arxiv.org/abs/2411.18644)
[](https://abolfazl-sh.github.io/Scene_co-pilot_site/)### Open-World Model
+ [Aether: Geometric-Aware Unified World Modeling](https://arxiv.org/pdf/2503.18945) (Mar., 2025)
[](https://github.com/OpenRobotLab/Aether)
[](https://arxiv.org/pdf/2503.18945)
[](https://aether-world.github.io/#team)+ [Pre-Trained Video Generative Models as World Simulators](https://arxiv.org/abs/2502.07825) (Feb., 2025)
[](https://arxiv.org/abs/2502.07825)+ [VideoWorld: Exploring Knowledge Learning from Unlabeled Videos](https://arxiv.org/abs/2501.09781) (Jan., 2025)
[](https://github.com/bytedance/VideoWorld)
[](https://arxiv.org/abs/2501.09781)
[](https://maverickren.github.io/VideoWorld.github.io/)+ [GameFactory: Creating New Games with Generative Interactive Videos](https://arxiv.org/abs/2501.08325) (Jan., 2025)
[](https://github.com/KwaiVGI/GameFactory)
[](https://arxiv.org/abs/2501.08325)
[](https://vvictoryuki.github.io/gamefactory/)+ [Vid2Sim: Realistic and Interactive Simulation from Video for Urban Navigation](https://arxiv.org/abs/2501.06693) (Jan., 2025)
[](https://arxiv.org/abs/2501.06693)
[](https://metadriverse.github.io/vid2sim/)+ [GenEx: Generating an Explorable World](https://arxiv.org/abs/2412.09624) (Dec., 2024)
[](https://arxiv.org/abs/2412.09624)
[](https://genex.world/)+ [The Matrix: Infinite-Horizon World Generation with Real-Time Moving Control](https://arxiv.org/pdf/2412.03568) (Dec., 2024)
[](https://arxiv.org/pdf/2412.03568)
[](https://thematrix1999.github.io/)+ [Navigation World Models](https://arxiv.org/abs/2412.03572) (Dec., 2024)
[](https://arxiv.org/abs/2412.03572)
[](https://www.amirbar.net/nwm/)+ [Genie 2: A large-scale foundation world model](https://deepmind.google/discover/blog/genie-2-a-large-scale-foundation-world-model/) (Dec., 2024)
[](https://deepmind.google/discover/blog/genie-2-a-large-scale-foundation-world-model/)+ [Understanding World or Predicting Future? A Comprehensive Survey of World Models](https://arxiv.org/abs/2411.14499) (Nov., 2024)
[](https://arxiv.org/abs/2411.14499)+ [AppWorld: A Controllable World of Apps and People for Benchmarking Interactive Coding Agents](https://arxiv.org/abs/2407.18901) (Nov., 2024)
[](https://github.com/stonybrooknlp/appworld/)
[](https://arxiv.org/abs/2407.18901)
[](https://appworld.dev/)+ [Oasis: A Universe in a Transformer](https://www.decart.ai/articles/oasis-interactive-ai-video-game-model) (Nov., 2024)
[](https://github.com/etched-ai/open-oasis)
[](https://www.decart.ai/articles/oasis-interactive-ai-video-game-model)
[](https://www.decart.ai/articles/oasis-interactive-ai-video-game-model)+ [Digital Life Project: Autonomous 3D Characters with Social Intelligence](https://arxiv.org/abs/2312.04547) (CVPR 2024)
[](https://github.com/caizhongang/Digital_Life_Project)
[](https://arxiv.org/abs/2312.04547)
[](https://digital-life-project.com/)+ [3D-VLA: A 3D Vision-Language-Action Generative World Model](https://arxiv.org/abs/2403.09631) (ICML 2024)
[](https://github.com/UMass-Foundation-Model/3D-VLA)
[](https://arxiv.org/abs/2403.09631)
[](https://vis-www.cs.umass.edu/3dvla/)### Video Understanding
+ [UniReal: Universal Image Generation and Editing via Learning Real-world Dynamics](https://arxiv.org/pdf/2412.07774) (Dec., 2024)
[](https://arxiv.org/pdf/2412.07774)
[](https://xavierchen34.github.io/UniReal-Page/)+ [Divot: Diffusion Powers Video Tokenizer for Comprehension and Generation](https://arxiv.org/abs/2412.04432) (Nov., 2024)
[](https://github.com/TencentARC/Divot)
[](https://arxiv.org/abs/2412.04432)+ [VLM-Grounder: A VLM Agent for Zero-Shot 3D Visual Grounding](https://arxiv.org/pdf/2410.13860) (Oct., 2024)
[](https://github.com/OpenRobotLab/VLM-Grounder)
[](https://arxiv.org/pdf/2410.13860)+ [Exploring Diffusion Models for Unsupervised Video Anomaly Detection](https://arxiv.org/abs/2304.05841) (Apr., 2023)
[](https://arxiv.org/abs/2304.05841)+ [PDPP:Projected Diffusion for Procedure Planning in Instructional Videos](https://arxiv.org/abs/2303.14676) (CVPR 2023)
[](https://github.com/MCG-NJU/PDPP)
[](https://arxiv.org/abs/2303.14676)+ [DiffTAD: Temporal Action Detection with Proposal Denoising Diffusion](https://arxiv.org/abs/2303.14863) (Mar., 2023)
[](https://github.com/sauradip/DiffusionTAD)
[](https://arxiv.org/abs/2303.14863)+ [Diffusion Action Segmentation](https://arxiv.org/abs/2303.17959) (ICCV 2023)
[](https://github.com/Finspire13/DiffAct)
[](https://arxiv.org/abs/2303.17959)
[](https://daochang.site/DiffAct-Project-Page/)+ [DiffusionRet: Generative Text-Video Retrieval with Diffusion Model](https://arxiv.org/abs/2303.09867) (ICCV 2023)
[](https://github.com/jpthu17/DiffusionRet)
[](https://arxiv.org/abs/2303.09867)+ [Refined Semantic Enhancement Towards Frequency Diffusion for Video Captioning](https://arxiv.org/abs/2211.15076) (Nov., 2022)
[](https://github.com/lzp870/RSFD)
[](https://arxiv.org/abs/2211.15076)+ [A Generalist Framework for Panoptic Segmentation of Images and Videos](https://arxiv.org/abs/2210.06366) (Oct., 2022)
[](https://github.com/google-research/pix2seq)
[](https://arxiv.org/abs/2210.06366)### Healthcare and Biology
+ [Medical Video Generation for Disease Progression Simulation](https://arxiv.org/abs/2411.11943) (Nov., 2024)
[](https://arxiv.org/abs/2411.11943)+ [Artificial Intelligence for Biomedical Video Generation](https://arxiv.org/pdf/2411.07619) (Nov., 2024)
[](https://github.com/Finspire13/DiffAct)
[](https://arxiv.org/pdf/2411.07619)+ [Exploring Variational Autoencoders for Medical Image Generation: A Comprehensive Study](https://arxiv.org/abs/2411.07348) (Nov., 2024)
[](https://arxiv.org/abs/2411.07348)+ [MedSora: Optical Flow Representation Alignment Mamba Diffusion Model for Medical Video Generation](https://arxiv.org/abs/2411.01647) (Nov., 2024)
[](https://arxiv.org/abs/2411.01647)
[](https://wongzbb.github.io/MedSora/)+ [Annealed Score-Based Diffusion Model for Mr Motion Artifact Reduction](https://arxiv.org/abs/2301.03027) (Jan., 2023)
[](https://arxiv.org/abs/2301.03027)+ [Feature-Conditioned Cascaded Video Diffusion Models for Precise Echocardiogram Synthesis](https://arxiv.org/abs/2303.12644) (Mar., 2023)
[](https://arxiv.org/abs/2303.12644)+ [Neural Cell Video Synthesis via Optical-Flow Diffusion](https://arxiv.org/abs/2212.03250) (Dec., 2022)
[](https://arxiv.org/abs/2212.03250)### Other Applications
+ [History-Guided Video Diffusion](https://arxiv.org/abs/2502.06764) (Feb., 2025)
[](https://github.com/kwsong0113/diffusion-forcing-transformer)
[](https://arxiv.org/abs/2502.06764)
[](https://boyuan.space/history-guidance/)+ [VidSketch: Hand-drawn Sketch-Driven Video Generation with Diffusion Control](https://arxiv.org/pdf/2502.01101) (Feb., 2025)
[](https://arxiv.org/pdf/2502.01101)+ [VanGogh: A Unified Multimodal Diffusion-based Framework for Video Colorization](https://arxiv.org/pdf/2501.09499) (Jan., 2025)
[](https://github.com/BecauseImBatman0/VanGogh)
[](https://arxiv.org/pdf/2501.09499)
[](https://becauseimbatman0.github.io/VanGogh)+ [PhysAnimator: Physics-Guided Generative Cartoon Animation](https://arxiv.org/pdf/2501.16550) (Feb., 2025)
[](https://arxiv.org/pdf/2501.16550)