Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/dadadadawjb/awesome-video-prediction
A curated list of awesome video prediction papers
https://github.com/dadadadawjb/awesome-video-prediction
List: awesome-video-prediction
awesome computer-vision papers video-prediction
Last synced: 3 months ago
JSON representation
A curated list of awesome video prediction papers
- Host: GitHub
- URL: https://github.com/dadadadawjb/awesome-video-prediction
- Owner: dadadadawjb
- Created: 2022-11-11T10:16:00.000Z (almost 2 years ago)
- Default Branch: main
- Last Pushed: 2023-05-24T08:52:26.000Z (over 1 year ago)
- Last Synced: 2024-04-11T22:07:16.916Z (7 months ago)
- Topics: awesome, computer-vision, papers, video-prediction
- Homepage:
- Size: 227 KB
- Stars: 9
- Watchers: 1
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
- ultimate-awesome - awesome-video-prediction - A curated list of awesome video prediction papers. (Other Lists / PowerShell Lists)
README
# Awesome Video Prediction [![Awesome](https://cdn.rawgit.com/sindresorhus/awesome/d7305f38d29fed78fa85652e3a63e154dd8e8829/media/badge.svg)](https://github.com/sindresorhus/awesome)
A curated list of awesome video prediction papers with brief summary.## Table of Contents
* [Blogs](#Blogs)
* [Surveys](#Surveys)
* [Papers](#Papers)## Blogs
...## Surveys
* ★ [A Review on Deep Learning Techniques for Video Prediction](https://arxiv.org/abs/2004.05214) | TPAMI 2020
* [Deep Learning for Vision-based Prediction: A Survey](https://arxiv.org/abs/2007.00095) | Arxiv 2020## Papers
![Family Tree](assets/tree1.png)
![Family Tree](assets/tree2.png)
![Family Tree](assets/tree3.png)
![Family Tree](assets/tree4.png)* **Baseline Video Language Modeling** (**BVLM**) | [Video (language) modeling: a baseline for generative models of natural videos](https://arxiv.org/abs/1412.6604) | Arxiv 2014 FAIR NYU
* first video prediction | patch-level language model, CNN+RNN | no inductive bias, raw pixels* **LSTM Encoder-Decoder** (**LSTM-ED**) | [Unsupervised Learning of Video Representations using LSTMs](https://arxiv.org/abs/1502.04681) | ICML 2015
* unsupervised learning representation | LSTM encoder into representation and LSTM decoder to reconstruct, FC-LSTM | no inductive bias, raw pixels* ★ **Convolutional LSTM** (**ConvLSTM**) | [Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting](https://arxiv.org/abs/1506.04214) | NeurIPS 2015 HKUST
* model well spatial correlations | just modified to convLSTM as LSTM-ED, convLSTM | no inductive bias, raw pixels* **Predictive Generative Network** (**PGN**) | [Unsupervised learning of visual structure using predictive generative networks](https://arxiv.org/abs/1511.06380) | Arxiv 2015 Harvard
* unsupervised learning representation | CNN-LSTM-deCNN and mse+adversarial loss, CNN+LSTM+GAN | no inductive bias, raw pixels* **Predictive Coding Network** (**PredNet**) | [Deep Predictive Coding Networks for Video Prediction and Unsupervised Learning](https://arxiv.org/abs/1605.08104) | Arxiv 2016 Harvard
* unsupervised learning representation | stacked multi-level encode representation and decode reconstruction variant, convLSTM | no inductive bias, raw pixels* **Predictive Recurrent Neural Network** (**PredRNN**) | [PredRNN: A Recurrent Neural Network for Spatiotemporal Predictive Learning](https://arxiv.org/abs/2103.09504) | NeurIPS 2017 TPAMI 2022 Tsinghua (Yunbo Wang)
* solve several problems in design of convLSTM for spatiotemporal predictive learning | spatiotemporal memory flow + spatiotemporal LSTM + reverse scheduled sampling curriculum learning, convLSTM | no inductive bias, raw pixels* **Improved Predictive Recurrent Neural Network** (**PredRNN++**) | [PredRNN++: Towards A Resolution of the Deep-in-Time Dilemma in Spatiotemporal Predictive Learning](https://arxiv.org/abs/1804.06300) | ICML 2018 Tsinghua (Yunbo Wang)
* deeper in time and deep-in-time RNN vanishing gradient | causal LSTM + gradient highway unit, convLSTM | no inductive bias, raw pixels* ★ **Convolutional Dynamic Neural Advection** (**CDNA**) | [Unsupervised Learning for Physical Interaction through Video Prediction](https://arxiv.org/abs/1605.07157) | NeurIPS 2016 UCBerkeley (Chelsea Finn, Ian Goodfellow, Sergey Levine)
* first real-world video long-range prediction | explicitly model pixel motion then merge previous frame, convLSTM | kernel-based transformation* **Object-centric Transformation** (**ObjectTransformation**) | [Learning Object-Centric Transformation for Video Prediction](https://dl.acm.org/doi/10.1145/3123266.3123349) | ACM-MM 2017 PKU
* different objects motion | attention to object patches and predict transformation kernels, CNN+RNN | kernel-based transformation* **Spatially-Displaced Convolution Network** (**SDC-Net**) | [SDC-Net: Video prediction using spatially-displaced convolution](https://arxiv.org/abs/1811.00684) | ECCV 2018 Nvidia
* high-resolution video prediction | combine vector-based and kernel-based transformation, 3D CNN | vector-based transformation + kernel-based transformation* ★ **Motion-Content Network** (**MCnet**) | [Decomposing Motion and Content for Natural Video Sequence Prediction](https://arxiv.org/abs/1706.08033) | ICLR 2017
* first decompose motion and content | motion encoder + content encoder + combination decoder, CNN+convLSTM | motion and content separation* **Decompositional Disentangled Predictive Auto-Encoder** (**DDPAE**) | [Learning to Decompose and Disentangle Representations for Video Prediction](https://arxiv.org/abs/1806.04166) | NeurIPS 2018 Stanford (Li Fei-Fei)
* deal with high-dimentionality | decompose whole frame to different components and disentangle each component to time-invariant content and low-dimensionality pose, CNN+RNN+VAE | vector-based transformation + motion and content separation* ★ **Spatial-Temporal Multi-Frequency Analysis Network** (**STMFANet**) | [Exploring Spatial-Temporal Multi-Frequency Analysis for High-Fidelity and Temporal-Consistency Video Prediction](https://arxiv.org/abs/2002.09905) | CVPR 2020 CAS
* deal with image distortion and temporal inconsistency | merge multi-level both spatial and temporal wavelet analysis into prediction, CNN+LSTM+wavelet | add traditional CV, raw pixels* ★ **Stochastic Variational Video Prediction** (**SV2P**) | [Stochastic Variational Video Prediction](https://arxiv.org/abs/1710.11252) | ICLR 2018 UIUC (Chelsea Finn, Sergey Levine)
* first introduce stochastic | VAE noise as stochastic condition for CDNA, 3D CNN+convLSTM+VAE | kernel-based transformation + VAE stochastic* **Stochastic Video Generation with a Learned Prior** (**SVG-LP**) | [Stochastic Video Generation with a Learned Prior](https://arxiv.org/abs/1802.07687) | ICML 2018 NYU
* "learned prior as uncertainty predictive model" | learned prior for VAE, convLSTM+VAE | VAE stochastic* **Stochastic Adversarial Video Prediction** (**SAVP**) | [Stochastic Adversarial Video Prediction](https://arxiv.org/abs/1804.01523) | ICLR 2019 UCBerkeley (Chelsea Finn, Sergey Levine)
* bring together stochastic and realistic | VAE-GAN for SV2P, 3D CNN+convLSTM+VAE+GAN | kernel-based transformation + VAE stochastic* **Hierarchical VRNN** (**Hierarchical-VRNN**) | [Improved Conditional VRNNs for Video Prediction](https://arxiv.org/abs/1904.12165) | ICCV 2019
* "still blurry and due to underfitting" | hierarchical levels of latents to increase expressiveness, CNN+RNN+VAE | VAE hierarchical stochastic* **Greedy Hierarchical Variational Auto-Encoders** (**GHVAE**) | [Greedy Hierarchical Variational Autoencoders for Large-Scale Video Prediction](https://arxiv.org/abs/2103.04174) | CVPR 2021 Stanford (Li Fei-Fei, Chelsea Finn)
* deal with memory constraints and optimization instability problems for hierarchical VAE | greedy and modular optimization, CNN+RNN+VAE | VAE hierarchical stochastic* **Beyond Mean Square Error** (**BeyondMSE**) | [Deep multi-scale video prediction beyond mean square error](https://arxiv.org/abs/1511.05440) | ICLR 2016 FAIR NYU (Yann LeCun)
* deal with blur | adversarial loss + gradient difference loss, CNN+GAN | no inductive bias, raw pixels* **Eidetic 3D LSTM** (**E3D-LSTM**) | [Eidetic 3D LSTM: A Model for Video Prediction and Beyond](https://openreview.net/forum?id=B1lKS2AqtX) | ICLR 2019 Tsinghua (Yunbo Wang, Li Fei-Fei)
* learn good for both short-term and long-term | 3D CNN for local dynamics and recurrent modeling for temporal dependencies, 3D CNN+LSTM | no inductive bias, raw pixels* ★ **Simple Video Prediction** (**SimVP**) | [SimVP: Simpler yet Better Video Prediction](https://arxiv.org/abs/2206.05099) | CVPR 2022
* investigate simple techniques for CNN in video prediction | pure 2D CNN and only MSE loss, CNN | no inductive bias, raw pixels* **Video Diffusion Models** (**VDM**) | [Video Diffusion Models](https://arxiv.org/abs/2204.03458) | NeurIPS 2022 Google (Jonathan Ho)
* first video diffusion model for primarily unconditional video generation | diffusion model with 3D U-Net, 3D CNN+diffusion | no inductive bias, raw pixels* ★ **Masked Conditional Video Diffusion** (**MCVD**) | [MCVD: Masked Conditional Video Diffusion for Prediction, Generation, and Interpolation](https://arxiv.org/abs/2205.09853) | NeurIPS 2022
* general-purpose as prediction/generation/interpolation | conditioned on masked past or future frames U-Net, CNN+diffusion | no inductive bias, raw pixels* **Residual Video Diffusion** (**RVD**) | [Diffusion Probabilistic Modeling for Video Generation](https://arxiv.org/abs/2203.09481) | Arxiv 2022
* "residual errors are easier to model than future observations" | MAF for average + diffusion for residual, CNN+RNN+diffusion | no inductive bias, raw pixels* **Flexible Diffusion Model** (**FDM**) | [Flexible Diffusion Modeling of Long Videos](https://arxiv.org/abs/2205.11495) | Arxiv 2022
* deal with long duration coherent prediction | randomly sampling train, 3D CNN+diffusion | no inductive bias, raw pixels* **Video Transformer** (**VideoTransformer**) | [Scaling Autoregressive Video Models](https://arxiv.org/abs/1906.02634) | ICLR 2020 Google
* first Transformer in video prediction | block-local self-attention and spatiotemporal subscaling for reducing memory, Transformer | no inductive bias, raw pixels* ★ **Latent Video Transformer** (**LVT**) | [Latent Video Transformer](https://arxiv.org/abs/2006.10704) | Arxiv 2020
* solve computation requirement problem | VQ-VAE encodes pixels into discrete latent space and VideoTransformer operates in the discrete latent space, Transformer | discrete latent space* **Convolutional Transformer** (**ConvTransformer**) | [ConvTransformer: A Convolutional Transformer Network for Video Frame Synthesis](https://arxiv.org/abs/2011.10185) | Arxiv 2021
* combine CNN and Transformer in video prediction | multi-head convolutional self-attention, Transformer+CNN | no inductive bias, raw pixels* **Video Generative Pre-Training** (**VideoGPT**) | [VideoGPT: Video Generation using VQ-VAE and Transformers](https://arxiv.org/abs/2104.10157) | Arxiv 2021 UCBerkeley (Pieter Abbeel)
* combine GPT and Transformer in video prediction | VQ-VAE encodes pixels into discrete latent space and VideoTransformer operates in the discrete latent space, Transformer | discrete latent space* **Video Prediction Transformer** (**VPTR**) | [Video Prediction by Efficient Transformers](https://arxiv.org/abs/2212.06026) | ICPR 2022 IVC 2022
* solve computation requirement problem and extensive experiments on Transformer autoregressive formats | Pix2Pix autoencoder and VidHRFormer attention, Transformer | latent space* **Masked Video Transformer** (**MaskViT**) | [MaskViT: Masked Visual Pre-Training for Video Prediction](https://arxiv.org/abs/2206.11894) | ICLR 2023 Stanford (Jiajun Wu, Fei-Fei Li)
* mask visual modeling pre-training for video | VQ-GAN quantizing frame and mask visual modeling training, Transformer | discrete latent space* **MAsked Generative VIdeo Transformer** (**MAGVIT**) | [MAGVIT: Masked Generative Video Transformer](https://arxiv.org/abs/2212.05199) | CVPR 2023 CMU Google
* single model for multiple video synthesis tasks | 3D-VQ quantizing video and multi-task mask token modeling training, Transformer | discrete latent space* **MOtion Scene and Object** (**MOSO**) | [MOSO: Decomposing MOtion, Scene and Object for Video Prediction](https://arxiv.org/abs/2303.03684) | CVPR 2023 CAS
* decompose motion, scene and object | separate VQVAE quantizing and Transformer prediction, Transformer | discrete latent space + motion and content separation