awesome-multimodal-sequence-learning
Reading list for multimodal sequence learning
https://github.com/Redaimao/awesome-multimodal-sequence-learning
Last synced: 5 days ago
JSON representation
-
Research Areas
-
Multimodal Fusion
- Pace-adaptive and Noise-resistant Contrastive Learning for Multimodal Feature Fusion
- Unimodal and Crossmodal Refinement Network for Multimodal Sequence Fusion
- Deep multimodal sequence fusion by regularized expressive representation distillation
- Attention Bottlenecks for Multimodal Fusion
- Contrastive Multimodal Fusion with TupleInfoNCE
- Understanding and Improving Encoder Layer Fusion in Sequence-to-Sequence Learning
- Cross-Attentional Audio-Visual Fusion for Weakly-Supervised Action Localization
- MMFT-BERT: Multimodal Fusion Transformer with BERT Encodings for Visual Question Answering
- Provable Dynamic Fusion for Low-Quality Multimodal Data
- VolTAGE: Volatility Forecasting via Text Audio Fusion with Graph Convolution Networks for Earnings Calls
- Dual Low-Rank Multimodal Fusion
- Trusted Multi-View Classification
- Deep-HOSeq: Deep Higher-Order Sequence Fusion for Multimodal Sentiment Analysis
- Removing Bias in Multi-modal Classifiers: Regularization by Maximizing Functional Entropies - bias-in-multi-modal-classifiers)
- Deep Multimodal Fusion by Channel Exchanging
- What Makes Training Multi-Modal Classification Networks Hard?
- DeepCU: Integrating Both Common and Unique Latent Information for Multimodal Sentiment Analysis - IJCAI19)
- XFlow: Cross-modal Deep Neural Networks for Audiovisual Classification
- MFAS: Multimodal Fusion Architecture Search
- The Neuro-Symbolic Concept Learner: Interpreting Scenes, Words, and Sentences From Natural Supervision
- Dynamic Fusion for Multimodal Data
- Unifying and merging well-trained deep neural networks for inference stage
- Efficient Low-rank Multimodal Fusion with Modality-Specific Factors - rank-Multimodal-Fusion)
- Memory Fusion Network for Multi-view Sequential Learning
- Tensor Fusion Network for Multimodal Sentiment Analysis
- Jointly Modeling Deep Video and Compositional Text to Bridge Vision and Language in a Unified Framework
- Attention Bottlenecks for Multimodal Fusion
- Pace-adaptive and Noise-resistant Contrastive Learning for Multimodal Feature Fusion
-
Multimodal Pretraining
- VideoBERT: A Joint Model for Video and Language Representation Learning
- PaLI: A Jointly-Scaled Multilingual Language-Image Model
- HiCLIP: Contrastive Language-Image Pretraining with Hierarchy-aware Attention
- Composing Ensembles of Pre-trained Models via Iterative Consensus
- Multi-stage Pre-training over Simplified Multimodal Pre-training Models
- Integrating Multimodal Information in Large Pretrained Transformers
- Less is More: ClipBERT for Video-and-Language Learning via Sparse Sampling
- Large-Scale Adversarial Training for Vision-and-Language Representation Learning
- Vokenization: Improving Language Understanding with Contextualized, Visual-Grounded Supervision
- Transformer is All You Need: Multimodal Multitask Learning with a Unified Transformer
- ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks
- LXMERT: Learning Cross-Modality Encoder Representations from Transformers
- Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training
- VL-BERT: Pre-training of Generic Visual-Linguistic Representations - BERT)
- VisualBERT: A Simple and Performant Baseline for Vision and Language
- Multi-stage Pre-training over Simplified Multimodal Pre-training Models
- M-BERT: Injecting Multimodal Information in the BERT Structure
-
Representation Learning
- Robustness in Multimodal Learning under Train-Test Modality Mismatch
- Calibrating Multimodal Learning
- Learning Multimodal Data Augmentation in Feature Space
- Multimodal Federated Learning via Contrastive Representation Ensemble
- MultiBench: Multiscale Benchmarks for Multimodal Representation Learning
- CrossCLR: Cross-modal Contrastive Learning For Multi-modal Video Representations
- Multimodal Contrastive Training for Visual Representation Learning
- Parameter Efficient Multimodal Transformers for Video Representation Learning
- Viewmaker Networks: Learning Views for Unsupervised Representation Learning
- Representation Learning for Sequence Data with Deep Autoencoding Predictive Components
- Improving Transformation Invariance in Contrastive Representation Learning
- Active Contrastive Learning of Audio-Visual Video Representations
- i-Mix: A Domain-Agnostic Strategy for Contrastive Representation Learning
- Seq2Tens: An Efficient Representation of Sequences by Low-Rank Tensor Projections
- Adaptive Transformers for Learning Multimodal Representations
- Learning Transferable Visual Models From Natural Language Supervision
- 12-in-1: Multi-Task Vision and Language Representation Learning - multi-task)
- Watching the World Go By: Representation Learning from Unlabeled Videos
- Contrastive Multiview Coding
- Representation Learning with Contrastive Predictive Coding - Predictive-Coding-PyTorch)
- Multi-Head Attention with Diversity for Learning Grounded Multilingual Multimodal Representations
- ViCo: Word Embeddings from Visual Co-occurrences
- Multi-Task Learning of Hierarchical Vision-Language Representation
- Learning Factorized Multimodal Representations
- Learning Video Representations using Contrastive Bidirectional Transformer
- OmniNet: A Unified Architecture for Multi-modal Multi-task Learning
- Learning Representations by Maximizing Mutual Information Across Views - Bachman/amdim-public)
- A Probabilistic Framework for Multi-view Feature Learning with Many-to-many Associations via Neural Networks
- Learning Robust Visual-Semantic Embeddings
- Deep Multimodal Representation Learning from Temporal Data
- Deep Fragment Embeddings for Bidirectional Image Sentence Mapping
- Multimodal Deep Learning
- Representation Learning with Contrastive Predictive Coding - Predictive-Coding-PyTorch)
- Calibrating Multimodal Learning
- CrossCLR: Cross-modal Contrastive Learning For Multi-modal Video Representations
-
Analysis of Multimodal Models
- The Modality Focusing Hypothesis: Towards Understanding Crossmodal Knowledge Distillation
- Post-hoc Concept Bottleneck Models - hoc-cbm)
- CLIP-Dissect: Automatic Description of Neuron Representations in Deep Vision Networks - ML-Lab/CLIP-dissect)
- Identifiability Results for Multimodal Contrastive Learning - contrastive-learning)
- MultiViz: Towards Visualizing and Understanding Multimodal Models
- Generic Attention-model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers - chefer/Transformer-MM-Explainability)
- Decoupling the Role of Data, Attention, and Losses in Multimodal Transformers
- Blindfold Baselines for Embodied QA - Grounded Interaction and Language Workshop
- Analyzing the Behavior of Visual Question Answering Models
- Does my multimodal model learn cross-modal interactions? It’s harder to tell than you might think!
-
Self-supervised Learning
- VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text - transformers-for-multimodal-self#code)
- Self-supervised Representation Learning with Relative Predictive Coding
- Exploring Balanced Feature Spaces for Representation Learning
- There Is More Than Meets the Eye: Self-Supervised Multi-Object Detection and Tracking With Sound by Distilling Multimodal Knowledge - freiburg.de/research/multimodal-distill), [[homepage]]( http://rl.uni-freiburg.de/research/multimodal-distill)
- Self-Supervised Learning by Cross-Modal Audio-Video Clustering
- Self-Supervised MultiModal Versatile Networks
- Labelling Unlabelled Videos from Scratch with Multi-modal Self-supervision
- Self-Supervised Learning from Web Data for Multimodal Retrieval
- Self-Supervised Learning of Visual Features through Embedding Images into Text Topic Spaces
- Unsupervised Learning of Visual Features by Contrasting Cluster Assignments
- Understanding Contrastive Representation Learning through Alignment and Uniformity on the Hypersphere
- Self-Supervised Learning by Cross-Modal Audio-Video Clustering
- Improving Contrastive Learning by Visualizing Feature Transformation - Visualizing-Feature-Transformation)
-
Generative Multimodal Models
- Grounding Language Models to Images for Multimodal Inputs and Outputs
- Retrieval-Augmented Multimodal Language Modeling
- Make-A-Video: Text-to-Video Generation without Text-Video Data
- Discrete Contrastive Diffusion for Cross-Modal Music and Image Generation - YeZhu/CDCD)
- Unified Discrete Diffusion for Simultaneous Vision-Language Generation
- MMVAE+: Enhancing the Generative Quality of Multimodal VAEs without Compromises
- MM-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and Video Generation - Diffusion)
- Relating by Contrasting: A Data-efficient Framework for Multimodal Generative Models
- Generalized Multimodal ELBO
- UNIMO: Towards Unified-Modal Understanding and Generation via Cross-Modal Contrastive Learning
- Few-shot Video-to-Video Synthesis - shot-vid2vid/)
- Multimodal Generative Models for Scalable Weakly-Supervised Learning - vae-public) [[code2]](https://github.com/panpan2/Multimodal-Variational-Autoencoder)
- Look, Imagine and Match: Improving Textual-Visual Cross-Modal Retrieval with Generative Models
- The Multi-Entity Variational Autoencoder
-
Multimodal Adversarial Attacks
- Data Poisoning Attacks Against Multimodal Encoders
- Attend and Attack: Attention Guided Adversarial Attacks on Visual Question Answering Models
- Attacking Visual Language Grounding with Adversarial Examples: A Case Study on Neural Image Captioning
- Fooling Vision and Language Models Despite Localization and Attention Mechanism
-
Multimodal Reasoning
-
-
Survey Papers
-
Research Tasks
-
Sentiment and Emotion Analysis
- MMGCN: Multimodal Fusion via Deep Graph Convolution Network for Emotion Recognition in Conversation
- Multi-Label Few-Shot Learning for Aspect Category Detection
- Directed Acyclic Graph Network for Conversational Emotion Recognition
- Learning Language and Multimodal Privacy-Preserving Markers of Mood from Mobile Data
-
Trajectory and Motion Forecasting
- HalentNet: Multimodal Trajectory Forecasting with Hallucinative Intents
- Multimodal Motion Prediction with Stacked Transformers
- Social NCE: Contrastive Learning of Socially-aware Motion Representations - epfl/social-nce-crowdnav)
- The Garden of Forking Paths: Towards Multi-Future Trajectory Prediction
-
-
Datasets
-
Trajectory and Motion Forecasting
- An Extensible Multi-modal Multi-task Object Dataset with Materials
- A Recipe for Creating Multimodal Aligned Datasets for Sequential Tasks
- CH-SIMS: A Chinese Multimodal Sentiment Analysis Dataset with Fine-grained Annotation of Modality
- CMU-MOSEAS: A Multimodal Language Dataset for Spanish, Portuguese, German and French
- Predicting Emotions in User-Generated Videos
- LAION-400M
-
Sub Categories