Awesome-Video-Diffusion
A curated list of recent diffusion models for video generation, editing, restoration, understanding, etc.
https://github.com/showlab/Awesome-Video-Diffusion
Last synced: 6 days ago
JSON representation
-
Table of Contents <!-- omit in toc -->
-
Video Generation
- 
- 
- ![Star - AI/generative-models)
- ![Star - to-video-synthesis-colab)
- ModelScope (Text-to-video synthesis)
- ![Star
- Diffusers (Text-to-video synthesis)
- ![Star
- Open-Sora-Plan
- Open-Sora
- Stable Video Diffusion
- Show-1
- text-to-video-synthesis-colab
- ![Star
- ![Star - YuanGroup/Open-Sora-Plan)
- ![Star - Sora)
- VideoTuna
- ![Star
- ![Star - ai/Allegro)
- ![Star
- Movie Gen: A Cast of Media Foundation Models
- Hotshot-XL (text-to-GIF)
- Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model
- ![Star - ai/Step-Video-T2V)
- Cosmos
- zeroscope_v2
- Mochi 1
- ![Star
- Pyramidal Flow Matching for Efficient Video Generative Modeling
- ![Star - Flow)
- LTX-Video
- HunyuanVideo: A Systematic Framework For Large Video Generative Models
- ![Star
- ![Star
- ![Star - Video)
- ![Star - Video)
- Allegro
- Wan-Video
- ![Star - Video/Wan2.1)
- DiffSynth-Studio
- ![Star - Studio)
- Wunjo CE (Video Generation and Editing)
- ![Star
-
Evaluation Benchmarks and Metrics
- T2VScore: Towards A Better Metric for Text-to-Video Generation
- VBench: Comprehensive Benchmark Suite for Video Generative Models
- EvalCrafter: Benchmarking and Evaluating Large Video Generation Models
- Frechet Video Motion Distance: A Metric for Evaluating Motion Consistency in Videos
- ![Star - Lab/FVMD-frechet-video-motion-distance)
- T2V-CompBench: A Comprehensive Benchmark for Compositional Text-to-video Generation
- ![Star - CompBench)
- FETV: A Benchmark for Fine-Grained Evaluation of Open-Domain Text-to-Video Generation
- ![Star
- ![Star
- StoryBench: A Multifaceted Benchmark for Continuous Story Visualization
- ![Star
- Is Your World Simulator a Good Story Presenter? A Consecutive Events-Based Benchmark for Future Long Video Generation
- ReLight My NeRF: A Dataset for Novel View Synthesis and Relighting of Real World Objects
- ![Star - ai/rene)
- ChronoMagic-Bench: A Benchmark for Metamorphic Evaluation of Text-to-Time-lapse Video Generation
- ![Star - YuanGroup/ChronoMagic-Bench)
- ![Star - Videos)
- Impossible Videos
- VidProM: A Million-scale Real Prompt-Gallery Dataset for Text-to-Video Diffusion Models
- ![Star
- Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers
- ![Star - research/Panda-70M)
- Evaluation of Text-to-Video Generation Models: A Dynamics Perspective
- ![Star
- MEt3R: Measuring Multi-View Consistency in Generated Images
- ![Star
- Evaluation Agent, Efficient and Promptable Evaluation Framework for Visual Generative Models
- ![Star - Agent)
- VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness
-
New Video Generation Benchmark and Metrics
-
Video Editing
- Object-Centric Diffusion for Efficient Video Editing
- VASE: Object-Centric Shape and Appearance Manipulation of Real Videos
- ![Star
- FlowVid: Taming Imperfect Optical Flows for Consistent Video-to-Video Synthesis
- ![Star - LiangF/FlowVid)
- Fairy: Fast Parallelized Instruction-Guided Video-to-Video Synthesis
- RealCraft: Attention Control as A Solution for Zero-shot Long Video Editing
- VidToMe: Video Token Merging for Zero-Shot Video Editing
- ![Star
- A Video is Worth 256 Bases: Spatial-Temporal Expectation-Maximization Inversion for Zero-Shot Video Editing
- ![Star - Inv/stem-inv)
- Neutral Editing Framework for Diffusion-based Video Editing
- DiffusionAtlas: High-Fidelity Consistent Diffusion Video Editing
- RAVE: Randomized Noise Shuffling for Fast and Consistent Video Editing with Diffusion Models
- ![Star - lab/RAVE)
- SAVE: Protagonist Diversification with Structure Agnostic Video Editing
- MagicStick: Controllable Video Editing via Control Handle Transformations
- ![Star
- VideoSwap: Customized Video Subject Swapping with Interactive Semantic Point Correspondence
- ![Star
- DragVideo: Interactive Drag-style Video Editing
- ![Star - Official)
- Drag-A-Video: Non-rigid Video Editing with Point-based Interaction
- Motion-Conditioned Image Animation for Video Editing
- Space-Time Diffusion Features for Zero-Shot Text-Driven Motion Transfer
- Cut-and-Paste: Subject-Driven Video Editing with Attention Control
- LatentWarp: Consistent Diffusion Latents for Zero-Shot Video-to-Video Translation
- Fuse Your Latents: Video Editing with Multi-source Latent Diffusion Models
- DynVideo-E: Harnessing Dynamic NeRF for Large-Scale Motion- and View-Change Human-Centric Video Editing
- Ground-A-Video: Zero-shot Grounded Video Editing using Text-to-image Diffusion Models
- BIVDiff: A Training-Free Framework for General-Purpose Video Synthesis via Bridging Image and Video Diffusion Models
- ![Star - Motion-Customization)
- MotionEditor: Editing Video Motion via Content-Aware Diffusion
- ![Star - Rings/MotionEditor)
- ![Star - A-Video/Ground-A-Video)
- CCEdit: Creative and Controllable Video Editing via Diffusion Models
- MagicProp: Diffusion-based Video Editing via Motion-aware Appearance Propagation
- MagicEdit: High-Fidelity and Temporally Coherent Video Editing
- ![Star - research/magic-edit)
- StableVideo: Text-driven Consistency-aware Diffusion Video Editing
- ![Star
- CoDeF: Content Deformation Fields for Temporally Consistent Video Processing
- ![Star
- TokenFlow: Consistent Diffusion Features for Consistent Video Editing
- ![Star
- INVE: Interactive Neural Video Editing
- VidEdit: Zero-Shot and Spatially Aware Text-Driven Video Editing
- Rerender A Video: Zero-Shot Text-Guided Video-to-Video Translation
- ![Star
- ControlVideo: Adding Conditional Control for One Shot Text-to-Video Editing
- ![Star - ml/controlvideo)
- Make-A-Protagonist: Generic Video Editing with An Ensemble of Experts
- ![Star - A-Protagonist/Make-A-Protagonist)
- Soundini: Sound-Guided Diffusion for Natural Video Editing
- ![Star - lab/soundini-official)
- ![Star - zero)
- FateZero: Fusing Attentions for Zero-shot Text-based Video Editing
- ![Star
- Pix2video: Video Editing Using Image Diffusion
- Video-P2P: Video Editing with Cross-attention Control
- ![Star - P2P)
- Dreamix: Video Diffusion Models Are General Video Editors
- Shape-Aware Text-Driven Layered Video Editing
- Speech Driven Video Editing via an Audio-Conditioned Diffusion Model
- ![Star
- Diffusion Video Autoencoders: Toward Temporally Consistent Face Video Editing via Disentangled Video Encoding
- ![Star - Video-Autoencoders)
- EVA: Zero-shot Accurate Attributes and Multi-Object Video Editing
- ![Star
- MaskINT: Video Editing via Interpolative Non-autoregressive Masked Transformers
- FLATTEN: optical FLow-guided ATTENtion for consistent text-to-video editing
- ![Star
- Señorita-2M : A High-Quality Instruction-based Dataset for General Video Editing by Video Specialists
- ![Star
- UniEdit: A Unified Tuning-Free Framework for Video Motion and Appearance Editing
- ![Star
- Video Editing via Factorized Diffusion Distillation
- ![Star - yingjie/Perception-as-Control)
- FastVideoEdit: Leveraging Consistency Models for Efficient Text-to-Video Editing
- CoCoCo: Improving Text-Guided Video Inpainting for Better Consistency, Controllability and Compatibility
- Unified Editing of Panorama, 3D Scenes, and Videos Through Disentangled Self-Attention Injection
- I2VEdit: First-Frame-Guided Video Editing via Image-to-Video Diffusion Models
- Looking Backward: Streaming Video-to-Video Translation with Feature Banks
- Slicedit: Zero-Shot Video Editing With Text-to-Image Diffusion Models Using Spatio-Temporal Slices
- ![Star - LiangF/streamv2v)
- ReVideo: Remake a Video with Motion and Content Control
- ![Star - E/ReVideo)
- ![Star
- ViViD: Video Virtual Try-on using Diffusion Models
- ![Star
- GenVideo: One-shot target-image and shape aware video editing using T2I diffusion models
- MTV-Inpaint: Multi-Task Long Video Inpainting
- Edit-A-Video: Single Video Editing with Object-Aware Consistency
- AnyV2V: A Tuning-Free Framework For Any Video-to-Video Editing Tasks
- ![Star - AI-Lab/AnyV2V)
- MotionFlow: Attention-Driven Motion Transfer in Video Diffusion Models
- DIVE: Taming DINO for Subject-Driven Video Editing
- AniGS: Animatable Gaussian Avatar from a Single Image with Inconsistent Gaussian Reconstruction
- ![Star
- StableV2V: Stablizing Shape Consistency in Video-to-Video Editing
- ![Star
- ![Star
- MIVE: New Design and Benchmark for Multi-Instance Video Editing
- VividFace: A Diffusion-Based Hybrid Framework for High-Fidelity Video Face Swapping
- ![Star
- Qffusion: Controllable Portrait Video Editing via Quadrant-Grid Attention Learning
- VideoGrain: Modulating Space-Time Attention for Multi-grained Video Editing
- ![Star
-
Long-form Video Generation and Completion
-
Human or Subject Motion
- Avatars Grow Legs: Generating Smooth Human Motion from Sparse Tracking Inputs with Diffusion Model
- ![Star
- InterGen: Diffusion-based Multi-human Motion Generation under Complex Interactions
- ![Star
- ReMoDiffuse: Retrieval-Augmented Motion Diffusion Model
- ![Star - zhang/ReMoDiffuse)
- Human Motion Diffusion as a Generative Prior
- ![Star
- Can We Use Diffusion Probabilistic Models for 3d Motion Prediction?
- ![Star - ahn/diffusion-motion-prediction)
- Single Motion Diffusion
- ![Star
- HumanMAC: Masked Motion Completion for Human Motion Prediction
- ![Star
- DiffMotion: Speech-Driven Gesture Synthesis Using Denoising Diffusion Model
- Modiff: Action-Conditioned 3d Motion Generation With Denoising Diffusion Probabilistic Models
- PhysDiff: Physics-Guided Human Motion Diffusion Model
- BeLFusion: Latent Diffusion for Behavior-Driven Human Motion Prediction
- ![Star
- Unifying Human Motion Synthesis and Style Transfer With Denoising Diffusion Probabilistic Models
- ![Star
- Executing Your Commands via Motion Diffusion in Latent Space
- ![Star - latent-diffusion)
- Pretrained Diffusion Models for Unified Human Motion Synthesis
- Diffusion Motion: Generate Text-Guided 3d Human Motion by Diffusion Model
- Human Joint Kinematics Diffusion-Refinement for Stochastic Motion Prediction
- Human Motion Diffusion Model
- ![Star - diffusion-model)
- FLAME: Free-form Language-based Motion Synthesis & Editing
- ![Star
- MotionDiffuse: Text-Driven Human Motion Generation with Diffusion Model
- ![Star - zhang/MotionDiffuse)
- Stochastic Trajectory Prediction via Motion Indeterminacy Diffusion
- ![Star
- EchoMimic: Lifelike Audio-Driven Portrait Animations through Editable Landmark Conditions
- ![Star
- KMM: Key Frame Mask Mamba for Extended Motion Generation
- ![Star - zeyu-zhang/KMM)
- DanceFusion: A Spatio-Temporal Skeleton Diffusion Transformer for Audio-Driven Dance Motion Reconstruction
- DiffSHEG: A Diffusion-Based Approach for Real-Time Speech-driven Holistic 3D Expression and Gesture Generation
- ![Star
- ![Star
- ![Star - Ryan/DEMO)
- AnchorCrafter: Animate CyberAnchors Saling Your Products via Human-Object Interacting Video Generation
- ![Star
- OccFusion: Rendering Occluded Humans with Generative Diffusion Priors
- A Comprehensive Survey on Human Video Generation: Challenges, Methods, and Insights
- VideoJAM: Joint Appearance-Motion Representations for Enhanced Motion Generation in Video Models
- OmniHuman-1: Rethinking the Scaling-Up of One-Stage Conditioned Human Animation Models
- AnyTop: Character Animation Diffusion with Any Topology
- ![Star
- HumanDiT: Pose-Guided Diffusion Transformer for Long-form Human Motion Video Generation
-
Talking Head Generation
- Listen, Denoise, Action! Audio-Driven Motion Synthesis With Diffusion Models
- DAWN: Dynamic Frame Avatar with Non-autoregressive Diffusion Framework for Talking Head Video Generation
- ![Star - Cheng/DAWN-pytorch)
- Loopy: Taming Audio-Driven Portrait Avatar with Long-Term Motion Dependency
- HelloMeme: Integrating Spatial Knitting Attentions to Embed High-Level and Fidelity-Rich Conditions in Diffusion Models
- ![Star
- GestureDiffuCLIP: Gesture Diffusion Model with CLIP Latents
- From Audio to Photoreal Embodiment: Synthesizing Humans in Conversations
- Takin-ADA: Emotion Controllable Audio-Driven Animation with Canonical and Landmark Loss Optimization
- PersonaTalk: Bring Attention to Your Persona in Visual Dubbing
- Talking With Hands 16.2M: A Large-Scale Dataset of Synchronized Body-Finger Motion and Audio for Conversational Motion Analysis and Synthesis
- Learning to Listen: Modeling Non-Deterministic Dyadic Facial Motion
- ![Star
- TANGO: Co-Speech Gesture Video Reenactment with Hierarchical Audio Motion Embedding and Diffusion Interpolation
- ![Star - generative-vision/hallo)
- Hallo2: Long-Duration and High-Resolution Audio-driven Portrait Image Animation
- ![Star - generative-vision/hallo2)
- ![Star
- MimicTalk: Mimicking a personalized and expressive 3D talking face in few minutes
- IF-MDM: Implicit Face Motion Diffusion Model for High-Fidelity Realtime Talking Head Generation
- INFP: Audio-Driven Interactive Head Generation in Dyadic Conversations
- MEMO: Memory-Guided Diffusion for Expressive Talking Video Generation
- ![Star
- SINGER: Vivid Audio-driven Singing Video Generation with Multi-scale Spectral Diffusion Model
- Synergizing Motion and Appearance: Multi-Scale Compensatory Codebooks for Talking Head Video Generation
- ![Star - motion-appearance)
- Hallo3: Highly Dynamic and Realistic Portrait Image Animation with Diffusion Transformer Networks
- FLOAT: Generative Motion Latent Flow Matching for Audio-driven Talking Portrait
- EmotiveTalk: Expressive Talking Head Generation through Audio Information Decoupling and Emotional Video Diffusion
- LetsTalk: Latent Diffusion Transformer for Talking Video Synthesis
- VQTalker: Towards Multilingual Talking Avatars through Facial Motion Tokenization
- SayAnything: Audio-Driven Lip Synchronization with Conditional Video Diffusion
-
Video Enhancement and Restoration
- LDMVFI: Video Frame Interpolation with Latent Diffusion Models
- CaDM: Codec-aware Diffusion Modeling for Neural-enhanced Video Streaming
- ![Star
- DiffIR2VR-Zero: Zero-Shot Video Restoration with Diffusion-based Image Restoration Models
- Disentangled Motion Modeling for Video Frame Interpolation
- SVFR: A Unified Framework for Generalized Video Face Restoration
- Enhance-A-Video: Better Generated Video for Free
- ![Star - HPC-AI-Lab/Enhance-A-Video)
-
3D
- Text2NeRF: Text-Driven 3D Scene Generation with Neural Radiance Fields
- Shape of Motion: 4D Reconstruction from a Single Video
- L3DG: Latent 3D Gaussian Diffusion
- Difix3D+: Improving 3D Reconstructions with Single-Step Diffusion Models
- ReconX: Reconstruct Any Scene from Sparse Views with Video Diffusion Model
- Hi3D: Pursuing High-Resolution Image-to-3D Generation with Video Diffusion Models
- GPT-4V(ision) is a Human-Aligned Evaluator for Text-to-3D Generation
- MonST3R: A Simple Approach for Estimating Geometry in the Presence of Motion
- MultiDiff: Consistent Novel View Synthesis from a Single Image
- ReconX: Reconstruct Any Scene from Sparse Views with Video Diffusion Model
- Vivid-ZOO: Multi-View Video Generation with Diffusion Model
- DiffRF: Rendering-guided 3D Radiance Field Diffusion
- Director3D: Real-world Camera Trajectory and 3D Scene Generation from Text
- Wonderland: Navigating 3D Scenes from a Single Image
- YouDream: Generating Anatomically Controllable Consistent Text-to-3D Animals
- RoomDreamer: Text-Driven 3D Indoor Scene Synthesis with Coherent Geometry and Texture
- NeuralField-LDM: Scene Generation with Hierarchical Latent Diffusion Models
- Single-Stage Diffusion NeRF: A Unified Approach to 3D Generation and Reconstruction
- Instruct-NeRF2NeRF: Editing 3D Scenes with Instructions
- SV4D: Dynamic 3D Content Generation with Multi-Frame and Multi-View Consistency
- WonderWorld: Interactive 3D Scene Generation from a Single Image
- WonderJourney: Going from Anywhere to Everywhere
- DiffusioNeRF: Regularizing Neural Radiance Fields with Denoising Diffusion Models
- NerfDiff: Single-image View Synthesis with NeRF-guided Distillation from 3D-aware Diffusion
-
3D / NeRF
-
Video Understanding
- ![Star - NJU/PDPP)
- Exploring Diffusion Models for Unsupervised Video Anomaly Detection
- PDPP:Projected Diffusion for Procedure Planning in Instructional Videos
- DiffTAD: Temporal Action Detection with Proposal Denoising Diffusion
- ![Star
- DiffusionRet: Generative Text-Video Retrieval with Diffusion Model
- ![Star
- Refined Semantic Enhancement Towards Frequency Diffusion for Video Captioning
- ![Star
- A Generalist Framework for Panoptic Segmentation of Images and Videos
- ![Star - research/pix2seq)
- Diffusion Action Segmentation
- ![Star
- VLM-Grounder: A VLM Agent for Zero-Shot 3D Visual Grounding
- ![Star - Grounder)
- UniReal: Universal Image Generation and Editing via Learning Real-world Dynamics
- Divot: Diffusion Powers Video Tokenizer for Comprehension and Generation
-
Healthcare and Biology
- Annealed Score-Based Diffusion Model for Mr Motion Artifact Reduction
- Feature-Conditioned Cascaded Video Diffusion Models for Precise Echocardiogram Synthesis
- Neural Cell Video Synthesis via Optical-Flow Diffusion
- Artificial Intelligence for Biomedical Video Generation
- Exploring Variational Autoencoders for Medical Image Generation: A Comprehensive Study
- MedSora: Optical Flow Representation Alignment Mamba Diffusion Model for Medical Video Generation
- Medical Video Generation for Disease Progression Simulation
-
Long Video / Film Generation
- One-Minute Video Generation with Test-Time Training
- ![Star - time-training/ttt-video-dit)
- ![Website
- MovieDreamer: Hierarchical Generation for Coherent Long Visual Sequence
- ![Star - uofa/MovieDreamer)
- AutoAD-Zero: A Training-Free Framework for Zero-Shot Audio Description
- ![Star - Zero)
- AesopAgent: Agent-driven Evolutionary System on Story-to-Video Production
- TheaterGen: Character Management with LLM for Consistent Multi-turn Image Generation
- ![Star
- AutoStudio: Crafting Consistent Subjects in Multi-turn Interactive Image Generation
- ![Star
- DreamStory: Open-Domain Story Visualization by LLM-Guided Multi-Subject Consistent Diffusion
- VideoDirectorGPT: Consistent Multi-scene Video Generation via LLM-Guided Planning
- Streetscapes: Large-scale Consistent Street View Generation Using Autoregressive Video Diffusion
- Long Context Tuning for Video Generation
- Story-Adapter: A Training-free Iterative Framework for Long Story Visualization
- StoryMaker: Towards consistent characters in text-to-image generation
- ![Star
- Storynizor: Consistent Story Generation via Inter-Frame Synchronized and Shuffled ID Injection
- ACDC: Autoregressive Coherent Multimodal Generation using Diffusion Correction
- ![Star - adapter)
- In-Context LoRA for Diffusion Transformers
- ![Star - vilab/In-Context-LoRA)
- SEED-Story: Multimodal Long Story Generation with Large Language Model
- ![Star - Story)
- StoryAgent: Customized Storytelling Video Generation via Multi-Agent Collaboration
- DreamFactory: Pioneering Multi-Scene Long Video Generation with a Multi-Agent Framework
- Kubrick: Multimodal Agent Collaborations for Synthetic Video Generation
- DreamCinema: Cinematic Transfer with Free Camera and 3D Character
- ![Star - wl20/DreamCinema?tab=readme-ov-file)
- SkyScript-100M: 1,000,000,000 Pairs of Scripts and Shooting Scripts for Short Drama
- ![Star - 100M)
- Anim-Director: A Large Multimodal Model Powered Agent for Controllable Animation Video Generation
- ![Star - TMG/Anim-Director)
- DiTCtrl: Exploring Attention Control in Multi-Modal Diffusion Transformer for Tuning-Free Multi-Prompt Longer Video Generation
- ![Star
- ![Star - YuanGroup/ConsisID)
- ![Star - Zero)
- CinePreGen: Camera Controllable Video Previsualization via Engine-powered Diffusion
- ARLON: Boosting Diffusion Transformers With Autoregressive Models for Long Video Generation
- Unbounded: A Generative Infinite Game of Character Life Simulation
- Loong: Generating Minute-level Long Videos with Autoregressive Language Models
- DreamCinema: Cinematic Transfer with Free Camera and 3D Character
- Redefining Temporal Modeling in Video Diffusion: The Vectorized Timestep Approach
- Video Storyboarding: Multi-Shot Character Consistency for Text-to-Video Generation
- Mind the Time: Temporally-Controlled Multi-Event Video Generation
- ![Star - Huang/GenMAC)
- GenMAC: Compositional Text-to-Video Generation with Multi-Agent Collaboration
- VideoGen-of-Thought: A Collaborative Framework for Multi-Shot Video Generation
- Long Video Diffusion Generation with Segmented Cross-Attention and Content-Rich Video Data Curation
- ![Star - YuanGroup/ConsisID)
- MotionCharacter: Identity-Preserving and Motion Controllable Human Video Generation
- MovieBench: A Hierarchical Movie Level Dataset for Long Video Generation
- ![Star
- MotionPrompt: Optical-Flow Guided Prompt Optimization for Coherent Video Generation
- ![Star
- DreamRunner: Fine-Grained Storytelling Video Generation with Retrieval-Augmented Motion Adaptation
- ![Star
- LinGen: Towards High-Resolution Minute-Length Text-to-Video Generation with Linear Computational Complexity
- Owl-1: Omni World Model for Consistent Long Video Generation
- ![Star - yh/Owl)
- Long-Context Autoregressive Video Modeling with Next-Frame Prediction
- ![Star
- MovieAgent: Automated Movie Generation via Multi-Agent CoT Planning
- ![Star
- ![arXiv
- Ouroboros-Diffusion: Exploring Consistent Content Generation in Tuning-free Long Video Diffusion
- VideoAuteur: Towards Long Narrative Video Generation
- ![Star - x/VideoAuteur)
- VideoRAG: Retrieval-Augmented Generation with Extreme Long-Context Videos
- ![Star
- NUWA-XL: Diffusion over Diffusion for eXtremely Long Video Generation
- Flexible Diffusion Modeling of Long Videos
- RIFLEx: A Free Lunch for Length Extrapolation in Video Diffusion Transformers
- ![Star - ml/RIFLEx)
-
Video Generation with Physical Prior / 3D
-
Human/AI Feedback for Video Generation
-
Human Feedback for Video Generation
-
Policy Learning with Video Generation
-
Open-World Model
- Digital Life Project: Autonomous 3D Characters with Social Intelligence
- Oasis: A Universe in a Transformer
- AppWorld: A Controllable World of Apps and People for Benchmarking Interactive Coding Agents
- ![Star - ai/open-oasis)
- ![Star
- Pre-Trained Video Generative Models as World Simulators
- VideoWorld: Exploring Knowledge Learning from Unlabeled Videos
- ![Star
- The Matrix: Infinite-Horizon World Generation with Real-Time Moving Control
- Navigation World Models
- Genie 2: A large-scale foundation world model
- Understanding World or Predicting Future? A Comprehensive Survey of World Models
- 3D-VLA: A 3D Vision-Language-Action Generative World Model
- GenEx: Generating an Explorable World
- Aether: Geometric-Aware Unified World Modeling
- ![Star
- Vid2Sim: Realistic and Interactive Simulation from Video for Urban Navigation
- GameFactory: Creating New Games with Generative Interactive Videos
- ![Star
-
World Model
-
Motion Customization
- ![Star
- MotionBank: A Large-scale Video Motion Benchmark with Disentangled Rule-based Annotations
- ![Star
- ![Star - Research/MotionInversion)
- ![Star - vilab/VGen)
- ![Star
- ![Star - park/Spectral-Motion-Alignment)
- Motion Inversion for Video Customization
- Video Diffusion Models are Training-free Motion Interpreter and Controller
- ![Star
- ![Star
- Zero-Shot Controllable Image-to-Video Animation via Motion Decomposition
- Reenact Anything: Semantic Video Motion Transfer Using Motion-Textual Inversion
- I2VControl-Camera: Precise Video Camera Control with Adjustable Motion Strength
- DreamVideo: Composing Your Dream Videos with Customized Subject and Motion
- Direct-a-Video: Customized Video Generation with User-Directed Camera Movement and Object Motion
- LAMP: Learn A Motion Pattern for Few-Shot-Based Video Generation
- Customizing Motion in Text-to-Video Diffusion Models
- DragAnything: Motion Control for Anything using Entity Representation
- ![Star - Wu/LAMP)
- ![Star
- Customize-A-Video: One-Shot Motion Customization of Text-to-Video Diffusion Models
- ![Star
- ![Star - qiu/FreeTraj)
- Spectral Motion Alignment for Video Motion Transfer using Diffusion Models
- ![Star
- ![Star
- DreamMotion: Space-Time Self-Similar Score Distillation for Zero-Shot Video Editing
- MotionDirector: Motion Customization of Text-to-Video Diffusion Models
- Edit-Your-Motion: Space-Time Diffusion Decoupling Learning for Video Motion Editing
- ![Star
- ![Star
- MotionClone: Training-Free Motion Cloning for Controllable Video Generation
- ![Star
- Video Motion Transfer with Diffusion Transformers
- ![Star
- Latent-Reframe: Enabling Camera Control for Video Diffusion Model without Training
- Motion Modes: What Could Happen Next?
- MoTrans: Customized Motion Transfer with Text-driven Video
- ![Star
- AC3D: Analyzing and Improving 3D Camera Control in Video Diffusion Transformers
- ![Star
- Tora: Trajectory-oriented Diffusion Transformer for Video Generation
- ViewExtrapolator: Novel View Extrapolation with Video Diffusion Priors
- ![Star - Liu/ViewExtrapolator)
- FreeTraj: Tuning-Free Trajectory Control in Video Diffusion Models
- Free-Form Motion Control: A Synthetic Video Generation Dataset with Controllable Camera and Object Motions
- MotionShop: Zero-Shot Motion Transfer in Video Diffusion Models with Mixture of Score Guidance
- ![Star - vt/motionshop)
- MotionCtrl: A Unified and Flexible Motion Controller for Video Generation
- Vmc: Video motion customization using temporal attention adaption for text-to-video diffusion models
- CustomTTT: Motion and Appearance Customized Video Generation via Test-Time Training
- Collaborative Video Diffusion: Consistent Multi-video Generation with Camera Control
- Go-with-the-Flow: Motion-Controllable Video Diffusion Models Using Real-Time Warped Noise
- ![Star - Netflix-Eyeline-Research/Go-with-the-Flow)
- Training-Free Motion-Guided Video Generation with Enhanced Temporal Consistency Using Motion Consistency Loss
- Separate Motion from Appearance: Customizing Motion via Customizing Text-to-Video Diffusion Models
- MoTrans: Customized Motion Transfer with Text-driven Video
-
Character Customization
- CINEMA: Coherent Multi-Subject Video Generation via MLLM-Based Guidance
- Phantom: Subject-consistent video generation via cross-modal alignment
- ![Star - video/Phantom)
- Movie Weaver: Tuning-Free Multi-Concept Video Personalization with Anchored Prompts
- MagicID: Hybrid Preference Optimization for ID-Consistent and Dynamic-Preserved Video Customization
- ![Star
- Concat-ID: Towards Universal Identity-Preserving Video Synthesis
- ![Star - GSAI/Concat-ID)
- PersonalVideo: High ID-Fidelity Video Customization without Dynamic and Semantic Degradation
- Magic Mirror: ID-Preserved Video Generation in Video Diffusion Transformers
- ![Star - research/MagicMirror/)
- ConceptMaster: Multi-Concept Video Customization on Diffusion Transformer Models Without Test-Time Tuning
- DreamVideo-2: Zero-Shot Subject-Driven Video Customization with Precise Motion Control
- FantasyID: Face Knowledge Enhanced ID-Preserving Video Generation
- Dynamic Concepts Personalization from Single Videos
- VideoMaker: Zero-shot Customized Video Generation with the Inherent Force of Video Diffusion Models
- ![Star - CS/VideoMaker)
- CustomCrafter: Customized Video Generation with Preserving Motion and Concept Composition Abilities
- ![Star - CS/CustomCrafter)
- Multi-subject Open-set Personalization in Video Generation
- Animate Anyone 2: High-Fidelity Character Image Animation with Environment Affordance
- VideoMage: Multi-Subject and Motion Customization of Text-to-Video Diffusion Models
-
4D
- DimensionX: Create Any 3D and 4D Scenes from a Single Image with Controllable Video Diffusion
- ![Star
- Not All Frame Features Are Equal: Video-to-4D Generation via Decoupling Dynamic-Static Features
- ![Star
- 4Real-Video: Learning Generalizable Photo-Realistic 4D Video Diffusion
- PaintScene4D: Consistent 4D Scene Generation from Text Prompts
- CAT4D: Create Anything in 4D with Multi-View Video Diffusion Models
- DreamDrive: Generative 4D Scene Modeling from Street View Images
- Stereo4D Learning How Things Move in 3D from Internet Stereo Videos
-
Audio Synthesis for Video
- ![Star - foley)
- Video-Foley: Two-Stage Video-To-Sound Generation via Temporal Event Condition For Foley Sound
- Draw an Audio: Leveraging Multi-Instruction for Video-to-Audio Synthesis
- ![Star - an-Audio-Code)
- Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming
- ![Star - omni/mini-omni)
- Speech To Speech: an effort for an open-sourced and modular GPT4-o
- ![Star - to-speech)
- VMAs: Video-to-Music Generation via Semantic Alignment in Web Music Videos
- STA-V2A: Video-to-Audio Generation with Semantic and Temporal Alignment
- MuVi: Video-to-Music Generation with Semantic Alignment and Rhythmic Synchronization
- Diverse and Aligned Audio-to-Video Generation via Text-to-Video Model Adaptation
- Video-Guided Foley Sound Generation with Multimodal Controls
- FoleyCrafter: Bring Silent Videos to Life with Lifelike and Synchronized Sounds
- Network Bending of Diffusion Models for Audio-Visual Generation
- ![Star
- Masked Generative Video-to-Audio Transformers with Enhanced Synchronicity
- Video-to-Audio Generation with Hidden Alignment
- ![Star - ldm)
- Read, Watch and Scream! Sound Generation from Text and Video
- ![Star - ai/rewas)
- VinTAGe: Joint Video and Text Conditioning for Holistic Audio Generation
- YingSound: Video-Guided Sound Effects Generation with Multi-modal Chain-of-Thought Controls
- ![Star
- Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis
- Stable-V2A: Synthesis of Synchronized Audio Effects with Temporal and Semantic Controls
- ![Star - V2A)
- AV-Link: Temporally-Aligned Diffusion Features for Cross-Modal Audio-Video Generation
- ![Star - research/AVLink)
- XMusic: Towards a Generalized and Controllable Symbolic Music Generation Framework
- AGAV-Rater: Enhancing LMM for AI-Generated Audio-Visual Quality Assessment
- AV-DiT: Efficient Audio-Visual Diffusion Transformer for Joint Audio and Video Generation
- UniForm: A Unified Diffusion Transformer for Audio-Video Generation
-
Policy Learning
- GR-MG: Leveraging Partially Annotated Data via Multi-Modal Goal Conditioned Policy
- Object-Centric Image to Video Generation with Language Guidance
- EgoVid-5M: A Large-Scale Video-Action Dataset for Egocentric Video Generation
- Motion Tracks: A Unified Representation for Human-Robot Transfer in Few-Shot Imitation Learning
- Stag-1: Towards Realistic 4D Driving Simulation with Video Generation Model
- RT-Sketch: Goal-Conditioned Imitation Learning from Hand-Drawn Sketches
- Dreamitate: Real-World Visuomotor Policy Learning via Video Generation
- This&That: Language-Gesture Controlled Video Generation for Robot Planning
- Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations
- ![Star - MG)
- Any-point Trajectory Modeling for Policy Learning
-
Efficient Video Generation
- Mobile Video Diffusion
- MoViE: Mobile Diffusion for Video Editing
- From Slow Bidirectional to Fast Causal Video Generators
- Individual Content and Motion Dynamics Preserved Pruning for Video Diffusion Models
- SnapGen-V: Generating a Five-Second Video within Five Seconds on a Mobile Device
- ![Star - ml/SageAttention)
- SageAttention: Accurate 8-Bit Attention for Plug-and-play Inference Acceleration
- SpargeAttn: Accurate Sparse Attention Accelerating Any Model Inference
- ![Star - ml/SpargeAttn)
- SageAttention2: Efficient Attention with Thorough Outlier Smoothing and Per-thread INT4 Quantization
- Diffusion Adversarial Post-Training for One-Step Video Generation
- Sparse VideoGen: Accelerating Video Diffusion Transformers with Spatial-Temporal Sparsity
- Adaptive Caching for Faster Video Generation with Diffusion Transformers
- Fast and Memory-Efficient Video Diffusion Using Streamlined Inference
- FlashVideo:Flowing Fidelity to Detail for Efficient High-Resolution Video Generation
- Fast Video Generation with Sliding Tile Attention
-
Video Generation with 3D/Physical Prior
- Generative Physical AI in Vision: A Survey
- ViewCrafter: Taming Video Diffusion Models for High-fidelity Novel View Synthesis
- Compositional 3D-aware Video Generation with LLM Director
- StereoCrafter: Diffusion-based Generation of Long and High-fidelity Stereoscopic 3D from Monocular Videos
- Tex4D: Zero-shot 4D Scene Texturing with Video Diffusion Models
- PhysGen: Rigid-Body Physics-Grounded Image-to-Video Generation
- How Far is Video Generation from World Model: A Physical Law Perspective
- PhyGenBench: Towards World Simulator: Crafting Physical Commonsense-Based Benchmark for Video Generation
- Motion Dreamer: Realizing Physically Coherent Video Generation through Scene-Aware Motion Reasoning
- PhyT2V: LLM-Guided Iterative Self-Refinement for Physics-Grounded Text-to-Video Generation
- Phys4DGen: A Physics-Driven Framework for Controllable and Efficient 4D Content Generation from a Single Image
- PhysMotion: Physics-Grounded Dynamics From a Single Image
- Diffusion as Shader: 3D-aware Video Diffusion for Versatile Video Generation Control
- Do generative video models learn physical principles from watching videos?
- DiffusionRenderer: Neural Inverse and Forward Rendering with Video Diffusion Models
- IDOL: Unified Dual-Modal Latent Diffusion for Human-Centric Joint Video-Depth Generation
- PhysDreamer: Physics-Based Interaction with 3D Objects via Video Generation
- SyncVP: Joint Diffusion for Synchronous Multi-Modal Video Prediction
- ![Star
-
Rendering with Virtual Engine
-
Other Applications
-
Other Application for Video Gen
-
Commercial Product
- Kling
- KuaiShou Company
- Gen 3
- Runway Company
- Dream Machine - machine))
- Sora
- Wunjo
-
AI Safety
-
Virtual Try-On
-
Unified Model for Generation and Understanding
-
Game Generation
-
Efficiency for Video Generation
- ![Star - DiT/AdaCache)
-
Video Generation with Physical Prior/3D
-
Try On with Video Generation
- ![Star - 2-1-MNVTON)
-
Acceleration for Video Generation
-
Programming Languages
Categories
Sub Categories
Video Generation
281
Controllable Video Generation
115
Video Editing
108
Long Video / Film Generation
76
Motion Customization
58
Human or Subject Motion
52
Open-source Toolboxes and Foundation Models
49
Audio Synthesis for Video
33
Talking Head Generation
32
Evaluation Benchmarks and Metrics
30
3D
24
Character Customization
22
Video Generation with 3D/Physical Prior
19
Open-World Model
19
Video Understanding
17
Efficient Video Generation
16
3D / NeRF
11
Policy Learning
11
Video Generation with Physical Prior / 3D
10
4D
9
Video Enhancement and Restoration
8
Policy Learning with Video Generation
8
Healthcare and Biology
7
Commercial Product
7
Other Applications
4
Human/AI Feedback for Video Generation
4
Rendering with Virtual Engine
4
New Video Generation Benchmark and Metrics
3
Virtual Try-On
3
Long-form Video Generation and Completion
3
Other Application for Video Gen
2
Human Feedback for Video Generation
2
World Model
2
Acceleration for Video Generation
1
Game Generation
1
Unified Model for Generation and Understanding
1
Try On with Video Generation
1
AI Safety
1
Video Generation with Physical Prior/3D
1
Efficiency for Video Generation
1
Keywords
text-to-video
4
ai
3
text-to-video-generation
2
speech-to-text
1
speech-synthesis
1
speech
1
python
1
machine-learning
1
language-model
1
assistant
1
text-to-gif
1
sdxl
1
hotshot-xl
1
hotshot
1
visual-art
1
video-generation
1
fine-tuning-diffusion
1
content-production
1
aigc
1
text-to-image-generation
1
multi-turn-dialogue
1
image-generation
1
t2v
1
colaboratory
1
colab-notebook
1
wunjo
1
voice-recognition
1
voice-cloning
1
vid2vid
1
tts
1
talking-head
1
talking-face-generation
1
talking-face
1
tacotron2
1
segment-anything
1
retouching-video
1
image-animation
1
free
1
face-swapping
1
face-swap
1
diffusion
1
deepfakes
1
deepfake-emotion
1
deepfake
1
controlnet
1
image-to-video-generation
1
image-to-video
1
dit
1
diffusion-models
1
speech-translation
1