
An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

Reading list for multimodal sequence learning

List: awesome-multimodal-sequence-learning

Last synced: about 1 month ago
JSON representation

Reading list for multimodal sequence learning

Awesome Lists containing this project



# Awesome-Multimodal-Sequence-Learning
Reading list for multimodal sequence learning

## Table of Contents

* [Survey Papers](#survey-papers)
* [Research Areas](#core-areas)
* [Representation Learning](#representation-learning)
* [Multimodal Fusion](#multimodal-fusion)
* [Analysis of Multimodal Models](#analysis-of-multimodal-models)
* [Multimodal Pretraining](#multimodal-pretraining)
* [Self-supervised Learning](#self-supervised-learning)
* [Generative Multimodal Models](#generative-multimodal-models)
* [Multimodal Adversarial Attacks](#multimodal-adversarial-attacks)
* [Multimodal Reasoning](#multimodal-reasoning)
* [Research Tasks](#research-tasks)
* [Sentiment and Emotion Analysis](#sentiment-and-emotion-analysis)
* [Trajectory and Motion Forecasting](#trajectory-and-motion-forecasting)
* [Datasets](#datasets)
* [Tutorials and blogs](#tutorials-and-blogs)

## Survey Papers

[Multimodal Machine Learning: A Survey and Taxonomy](, TPAMI 2019

[Multimodal Intelligence: Representation Learning, Information Fusion, and Applications](, arXiv 2019

[Deep Multimodal Representation Learning: A Survey](, arXiv 2019

[Representation Learning: A Review and New Perspectives](, TPAMI 2013

## Research Areas

### Representation Learning

[Robustness in Multimodal Learning under Train-Test Modality Mismatch](

[Calibrating Multimodal Learning](, ICML 2023

[Learning Multimodal Data Augmentation in Feature Space](, ICLR 2023

[Multimodal Federated Learning via Contrastive Representation Ensemble](, ICLR 2023

[MultiBench: Multiscale Benchmarks for Multimodal Representation Learning](, NeurlPS 2021, [[code]](

[CrossCLR: Cross-modal Contrastive Learning For Multi-modal Video Representations](, ICCV 2021

[Multimodal Contrastive Training for Visual Representation Learning](, CVPR 2021

[Parameter Efficient Multimodal Transformers for Video Representation Learning](, ICLR 2021

[Viewmaker Networks: Learning Views for Unsupervised Representation Learning](, ICLR 2021, [[code]](//

[Representation Learning for Sequence Data with Deep Autoencoding Predictive Components](, ICLR 2021

[Improving Transformation Invariance in Contrastive Representation Learning](, ICLR 2021

[Active Contrastive Learning of Audio-Visual Video Representations](, ICLR 2021

[Parameter Efficient Multimodal Transformers for Video Representation Learning](, ICLR 2021

[i-Mix: A Domain-Agnostic Strategy for Contrastive Representation Learning](, ICLR 2021

[Seq2Tens: An Efficient Representation of Sequences by Low-Rank Tensor Projections](, ICLR 2021

[Adaptive Transformers for Learning Multimodal Representations](, ACL 2020

[Learning Transferable Visual Models From Natural Language Supervision](, arXiv 2020 [[blog]]( [[code]](

[12-in-1: Multi-Task Vision and Language Representation Learning](, CVPR 2020 [[code]](

[Watching the World Go By: Representation Learning from Unlabeled Videos](, arXiv 2020

[Contrastive Multiview Coding](, ECCV 2020 [[code]](

[Representation Learning with Contrastive Predictive Coding](, arXiv 2019 [[code]](

[Multi-Head Attention with Diversity for Learning Grounded Multilingual Multimodal Representations](, EMNLP 2019

[Visual Concept-Metaconcept Learning](, NeurIPS 2019 [[code]](

[ViCo: Word Embeddings from Visual Co-occurrences](, ICCV 2019 [[code]](

[Unified Visual-Semantic Embeddings: Bridging Vision and Language With Structured Meaning Representations](, CVPR 2019

[Multi-Task Learning of Hierarchical Vision-Language Representation](, CVPR 2019

[Learning Factorized Multimodal Representations](, ICLR 2019 [[code]](

[Learning Video Representations using Contrastive Bidirectional Transformer](, arXiv 2019

[OmniNet: A Unified Architecture for Multi-modal Multi-task Learning](, arXiv 2019 [[code]](

[Learning Representations by Maximizing Mutual Information Across Views](, arXiv 2019 [[code]](

[A Probabilistic Framework for Multi-view Feature Learning with Many-to-many Associations via Neural Networks](, ICML 2018

[Do Neural Network Cross-Modal Mappings Really Bridge Modalities?](, ACL 2018

[Learning Robust Visual-Semantic Embeddings](, ICCV 2017

[Deep Multimodal Representation Learning from Temporal Data](, CVPR 2017

[Is an Image Worth More than a Thousand Words? On the Fine-Grain Semantic Differences between Visual and Linguistic Representations](, COLING 2016

[Combining Language and Vision with a Multimodal Skip-gram Model](, NAACL 2015

[Learning Grounded Meaning Representations with Autoencoders](, ACL 2014

[Deep Fragment Embeddings for Bidirectional Image Sentence Mapping](, NIPS 2014

[Multimodal Learning with Deep Boltzmann Machines](, JMLR 2014

[DeViSE: A Deep Visual-Semantic Embedding Model](, NeurIPS 2013

[Multimodal Deep Learning](http//, ICML 2011

### Multimodal Fusion

[Provable Dynamic Fusion for Low-Quality Multimodal Data](, ICML 2023, [[code]](

[Deep multimodal sequence fusion by regularized expressive representation distillation](, TMM 2022, [[code]](

[Pace-adaptive and Noise-resistant Contrastive Learning for Multimodal Feature Fusion](, TMM 2023

[Unimodal and Crossmodal Refinement Network for Multimodal Sequence Fusion](, EMNLP 2021, [[code]](

[Attention Bottlenecks for Multimodal Fusion](, ArXiv 2021

[Contrastive Multimodal Fusion with TupleInfoNCE](, ArXiv 2021

[Understanding and Improving Encoder Layer Fusion in Sequence-to-Sequence Learning](, ICLR 2021, [[e]](

[Cross-Attentional Audio-Visual Fusion for Weakly-Supervised Action Localization](, ICLR 2021

[MMFT-BERT: Multimodal Fusion Transformer with BERT Encodings for Visual Question Answering](, EMNLP 2020

[VolTAGE: Volatility Forecasting via Text Audio Fusion with Graph Convolution Networks for Earnings Calls](, EMNLP 2020

[Dual Low-Rank Multimodal Fusion](, EMNLP Findings 2020

[Trusted Multi-View Classification](, ICLR 2021 [[code]](

[Deep-HOSeq: Deep Higher-Order Sequence Fusion for Multimodal Sentiment Analysis](, ICDM 2020

[Removing Bias in Multi-modal Classifiers: Regularization by Maximizing Functional Entropies](, NeurIPS 2020 [[code]](

[Deep Multimodal Fusion by Channel Exchanging](, NeurIPS 2020 [[code]](

[What Makes Training Multi-Modal Classification Networks Hard?](, CVPR 2020

[DeepCU: Integrating Both Common and Unique Latent Information for Multimodal Sentiment Analysis](, IJCAI 2019 [[code]](

[Deep Multimodal Multilinear Fusion with High-order Polynomial Pooling](, NeurIPS 2019

[XFlow: Cross-modal Deep Neural Networks for Audiovisual Classification](, IEEE TNNLS 2019 [[code]](

[MFAS: Multimodal Fusion Architecture Search](, CVPR 2019

[The Neuro-Symbolic Concept Learner: Interpreting Scenes, Words, and Sentences From Natural Supervision](, ICLR 2019 [[code]](

[Dynamic Fusion for Multimodal Data](, arXiv 2019

[Unifying and merging well-trained deep neural networks for inference stage](, IJCAI 2018 [[code]](

[Efficient Low-rank Multimodal Fusion with Modality-Specific Factors](, ACL 2018 [[code]](

[Memory Fusion Network for Multi-view Sequential Learning](, AAAI 2018 [[code]](

[Tensor Fusion Network for Multimodal Sentiment Analysis](, EMNLP 2017 [[code]](

[Jointly Modeling Deep Video and Compositional Text to Bridge Vision and Language in a Unified Framework](, AAAI 2015

### Analysis of Multimodal Models

[The Modality Focusing Hypothesis: Towards Understanding Crossmodal Knowledge Distillation](, ICLR 2023 [[code]](

[Post-hoc Concept Bottleneck Models](, ICLR 2023, [[code]](

[CLIP-Dissect: Automatic Description of Neuron Representations in Deep Vision Networks](, ICLR 2023, [[code]](

[Identifiability Results for Multimodal Contrastive Learning](, ICLR 2023 [[code]](

[MultiViz: Towards Visualizing and Understanding Multimodal Models](, ICLR 2023 [[code]](

[Generic Attention-model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers](, ICCV 2021, [[code]](

[Does my multimodal model learn cross-modal interactions? It’s harder to tell than you might think!](, EMNLP 2020

[Decoupling the Role of Data, Attention, and Losses in Multimodal Transformers](, TACL 2021

[Blindfold Baselines for Embodied QA](, NIPS 2018 Visually-Grounded Interaction and Language Workshop

[Analyzing the Behavior of Visual Question Answering Models](, EMNLP 2016

### Multimodal Pretraining

[PaLI: A Jointly-Scaled Multilingual Language-Image Model](, ICLR 2023

[HiCLIP: Contrastive Language-Image Pretraining with Hierarchy-aware Attention](, ICLR 2023 [[code]](

[Composing Ensembles of Pre-trained Models via Iterative Consensus](

[Multi-stage Pre-training over Simplified Multimodal Pre-training Models](, ACL 2021

[Integrating Multimodal Information in Large Pretrained Transformers](, ACL 2020

[Less is More: ClipBERT for Video-and-Language Learning via Sparse Sampling](, CVPR 2021 [[code]](

[Large-Scale Adversarial Training for Vision-and-Language Representation Learning](, NeurIPS 2020 [[code]](

[Vokenization: Improving Language Understanding with Contextualized, Visual-Grounded Supervision](, EMNLP 2020 [[code]](

[Integrating Multimodal Information in Large Pretrained Transformers](, ACL 2020

[Transformer is All You Need: Multimodal Multitask Learning with a Unified Transformer](, arXiv 2021

[ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks](, NeurIPS 2019 [[code]](

[LXMERT: Learning Cross-Modality Encoder Representations from Transformers](, EMNLP 2019 [[code]](

[VideoBERT: A Joint Model for Video and Language Representation Learning](, ICCV 2019

[Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training](, arXiv 2019

[M-BERT: Injecting Multimodal Information in the BERT Structure](, arXiv 2019

[VL-BERT: Pre-training of Generic Visual-Linguistic Representations](, arXiv 2019 [[code]](

[VisualBERT: A Simple and Performant Baseline for Vision and Language](, arXiv 2019 [[code]](

### Self-supervised Learning

[VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text](, arXiv 2021, [[code]](

[Self-supervised Representation Learning with Relative Predictive Coding](, ICLR 2021

[Exploring Balanced Feature Spaces for Representation Learning](, ICLR 2021

[There Is More Than Meets the Eye: Self-Supervised Multi-Object Detection and Tracking With Sound by Distilling Multimodal Knowledge](, CVPR 2021, [[code]](, [[homepage]](

[Self-Supervised Learning by Cross-Modal Audio-Video Clustering](, NeurIPS 2020 [[code]](

[Self-Supervised MultiModal Versatile Networks](, NeurIPS 2020 [[code]](

[Labelling Unlabelled Videos from Scratch with Multi-modal Self-supervision](, NeurIPS 2020 [[code]](

[Self-Supervised Learning from Web Data for Multimodal Retrieval](, arXiv 2019

[Self-Supervised Learning of Visual Features through Embedding Images into Text Topic Spaces](, CVPR 2017

[Multimodal Dynamics : Self-supervised Learning in Perceptual and Motor Systems](, 2016

[Unsupervised Learning of Visual Features by Contrasting Cluster Assignments](, NeurIPS 2020, [[code]](

[Understanding Contrastive Representation Learning through Alignment and Uniformity on the Hypersphere](, PMLR 2020, [[code]](

[Self-Supervised Learning by Cross-Modal Audio-Video Clustering](, NeurIPS 2020, [[code]](

[Improving Contrastive Learning by Visualizing Feature Transformation](, ICCV 2021, [[code]](

### Generative Multimodal Models

[Grounding Language Models to Images for Multimodal Inputs and Outputs](, ICML 2023 [[code]](

[Retrieval-Augmented Multimodal Language Modeling](, ICML 2023 [[webpage]](

[Make-A-Video: Text-to-Video Generation without Text-Video Data](, ICLR 2023 [[website]](

[Discrete Contrastive Diffusion for Cross-Modal Music and Image Generation](, ICLR 2023 [[code]](

[Unified Discrete Diffusion for Simultaneous Vision-Language Generation](, ICLR 2023 [[code]](

[MMVAE+: Enhancing the Generative Quality of Multimodal VAEs without Compromises](, ICLR 2023

[MM-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and Video Generation](, CVPR 2023 [[code]](

[Relating by Contrasting: A Data-efficient Framework for Multimodal Generative Models](, ICLR 2021

[Generalized Multimodal ELBO](, ICLR 2021

[UNIMO: Towards Unified-Modal Understanding and Generation via Cross-Modal Contrastive Learning](, ACL 2021

[Few-shot Video-to-Video Synthesis](, NeurIPS 2019 [[code]](

[Multimodal Generative Models for Scalable Weakly-Supervised Learning](, NeurIPS 2018 [[code1]]( [[code2]](

[Look, Imagine and Match: Improving Textual-Visual Cross-Modal Retrieval with Generative Models](, CVPR 2018

[The Multi-Entity Variational Autoencoder](, NeurIPS 2017

### Multimodal Adversarial Attacks

[Data Poisoning Attacks Against Multimodal Encoders](, ICML 2023 [[code]](

[Attend and Attack: Attention Guided Adversarial Attacks on Visual Question Answering Models](, NeurIPS Workshop on Visually Grounded Interaction and Language 2018

[Attacking Visual Language Grounding with Adversarial Examples: A Case Study on Neural Image Captioning](, ACL 2018 [[code]](

[Fooling Vision and Language Models Despite Localization and Attention Mechanism](, CVPR 2018

### Multimodal Reasoning

[Multimodal Analogical Reasoning over Knowledge Graphs](, ICLR 2023 [[code]](

[Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language](, ICLR 2023 [[code]](

## Research Tasks

### Sentiment and Emotion Analysis

[MMGCN: Multimodal Fusion via Deep Graph Convolution Network for Emotion Recognition in Conversation](, ACL 2021

[Multimodal Sentiment Detection Based on Multi-channel Graph Neural Networks](), ACL 2021

[Dual Graph Convolutional Networks for Aspect-based Sentiment Analysis](), ACL 2021

[Multi-Label Few-Shot Learning for Aspect Category Detection](, ACL 2021

[Directed Acyclic Graph Network for Conversational Emotion Recognition](, ACL 2021

[CTFN: Hierarchical Learning for Multimodal Sentiment Analysis Using Coupled-Translation Fusion Network](), ACL 2021

[Learning Language and Multimodal Privacy-Preserving Markers of Mood from Mobile Data](, ACL 2021

[A Text-Centered Shared-Private Framework via Cross-Modal Prediction for Multimodal Sentiment Analysis](), ACL Findings 2021

### Trajectory and Motion Forecasting
[HalentNet: Multimodal Trajectory Forecasting with Hallucinative Intents](, ICLR 2021

[Multimodal Motion Prediction with Stacked Transformers](, CVPR 2021 [[code]](

[Social NCE: Contrastive Learning of Socially-aware Motion Representations](, ICCV 2021, [[code]](

[The Garden of Forking Paths: Towards Multi-Future Trajectory Prediction](, ECCV 2020, [[code]](

## Datasets

[An Extensible Multi-modal Multi-task Object Dataset with Materials](, ICLR 2023 [[download]](

[comment]: <> ([MultiMET: A Multimodal Dataset for Metaphor Understanding](, ACL 2021 [[download]])

[A Large-Scale Chinese Multimodal NER Dataset with Speech Clues](), ACL 2021

[A Recipe for Creating Multimodal Aligned Datasets for Sequential Tasks](, ACL 2020

[CH-SIMS: A Chinese Multimodal Sentiment Analysis Dataset with Fine-grained Annotation of Modality](, ACl 2020, [[code]](

[CMU-MOSEAS: A Multimodal Language Dataset for Spanish, Portuguese, German and French](, EMNLP 2020 [[download]](

YouTube-8: [Predicting Emotions in User-Generated Videos](, [[download]](, [[webpage]](


## Tutorials and blogs

[Deep learning 2021 - NYU](


[SSL-paper list](