Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
awesome-multimodal-ml
Reading list for research topics in multimodal machine learning
https://github.com/pliang279/awesome-multimodal-ml
Last synced: 5 days ago
JSON representation
-
Course content + workshops
- Tutorials on Multimodal Machine Learning - multicomp-lab.github.io/mmml-tutorial/schedule/).
- 11-877 Advanced Topics in Multimodal Machine Learning - based. We plan to post discussion probes, relevant papers, and summarized discussion highlights every week on the website.
- 11-777 Multimodal Machine Learning
-
Survey Papers
- Multimodal Learning with Transformers: A Survey
- Trends in Integration of Vision and Language Research: A Survey of Tasks, Datasets, and Methods
- Experience Grounds Language
- A Survey of Reinforcement Learning Informed by Natural Language
- Multimodal Machine Learning: A Survey and Taxonomy
- Multimodal Intelligence: Representation Learning, Information Fusion, and Applications
- Deep Multimodal Representation Learning: A Survey
- Guest Editorial: Image and Language Understanding
- Representation Learning: A Review and New Perspectives
- A Survey of Socially Interactive Robots
- Guest Editorial: Image and Language Understanding
- Guest Editorial: Image and Language Understanding
- Guest Editorial: Image and Language Understanding
- Guest Editorial: Image and Language Understanding
- Guest Editorial: Image and Language Understanding
- Guest Editorial: Image and Language Understanding
- Guest Editorial: Image and Language Understanding
- Guest Editorial: Image and Language Understanding
- Guest Editorial: Image and Language Understanding
- Guest Editorial: Image and Language Understanding
- Guest Editorial: Image and Language Understanding
- Guest Editorial: Image and Language Understanding
- Guest Editorial: Image and Language Understanding
- Guest Editorial: Image and Language Understanding
- Guest Editorial: Image and Language Understanding
- Guest Editorial: Image and Language Understanding
- Guest Editorial: Image and Language Understanding
- Guest Editorial: Image and Language Understanding
- Guest Editorial: Image and Language Understanding
- Guest Editorial: Image and Language Understanding
- Guest Editorial: Image and Language Understanding
- Guest Editorial: Image and Language Understanding
- Guest Editorial: Image and Language Understanding
- Guest Editorial: Image and Language Understanding
- Guest Editorial: Image and Language Understanding
- Guest Editorial: Image and Language Understanding
- Guest Editorial: Image and Language Understanding
- Guest Editorial: Image and Language Understanding
- Guest Editorial: Image and Language Understanding
- Guest Editorial: Image and Language Understanding
- Guest Editorial: Image and Language Understanding
- Guest Editorial: Image and Language Understanding
- Guest Editorial: Image and Language Understanding
- Guest Editorial: Image and Language Understanding
- Guest Editorial: Image and Language Understanding
- Guest Editorial: Image and Language Understanding
- Guest Editorial: Image and Language Understanding
- Guest Editorial: Image and Language Understanding
- Guest Editorial: Image and Language Understanding
- Guest Editorial: Image and Language Understanding
- Guest Editorial: Image and Language Understanding
- Guest Editorial: Image and Language Understanding
- Guest Editorial: Image and Language Understanding
- Guest Editorial: Image and Language Understanding
- Guest Editorial: Image and Language Understanding
- Guest Editorial: Image and Language Understanding
- Guest Editorial: Image and Language Understanding
- Foundations and Trends in Multimodal Machine Learning: Principles, Challenges, and Open Questions
- Guest Editorial: Image and Language Understanding
-
Core Areas
-
Multimodal Representations
- Identifiability Results for Multimodal Contrastive Learning - contrastive-learning)
- Unpaired Vision-Language Pre-training via Cross-Modal CutMix
- Balanced Multimodal Learning via On-the-fly Gradient Modulation
- Unsupervised Voice-Face Representation Learning by Cross-Modal Prototype Contrast
- Towards a Unified Foundation Model: Jointly Pre-Training Transformers on Unpaired Images and Text
- MultiBench: Multiscale Benchmarks for Multimodal Representation Learning
- Learning Transferable Visual Models From Natural Language Supervision
- VinVL: Revisiting Visual Representations in Vision-Language Models - us/research/blog/vinvl-advancing-the-state-of-the-art-for-vision-language-models/?OCID=msr_blog_VinVL_fb) [[code]](https://github.com/pzzhang/VinVL)
- Learning Transferable Visual Models From Natural Language Supervision
- 12-in-1: Multi-Task Vision and Language Representation Learning - multi-task)
- Watching the World Go By: Representation Learning from Unlabeled Videos
- Learning Video Representations using Contrastive Bidirectional Transformer
- Visual Concept-Metaconcept Learning
- OmniNet: A Unified Architecture for Multi-modal Multi-task Learning
- Learning Representations by Maximizing Mutual Information Across Views - Bachman/amdim-public)
- ViCo: Word Embeddings from Visual Co-occurrences
- Unified Visual-Semantic Embeddings: Bridging Vision and Language With Structured Meaning Representations
- Multi-Task Learning of Hierarchical Vision-Language Representation
- Learning Factorized Multimodal Representations
- A Probabilistic Framework for Multi-view Feature Learning with Many-to-many Associations via Neural Networks
- Do Neural Network Cross-Modal Mappings Really Bridge Modalities?
- Learning Robust Visual-Semantic Embeddings
- Deep Multimodal Representation Learning from Temporal Data
- Is an Image Worth More than a Thousand Words? On the Fine-Grain Semantic Differences between Visual and Linguistic Representations
- Combining Language and Vision with a Multimodal Skip-gram Model
- Deep Fragment Embeddings for Bidirectional Image Sentence Mapping
- Learning Grounded Meaning Representations with Autoencoders
- DeViSE: A Deep Visual-Semantic Embedding Model
- Transformer is All You Need: Multimodal Multitask Learning with a Unified Transformer
- Multimodal Deep Learning
- FLAVA: A Foundational Language And Vision Alignment Model
- Perceiver: General Perception with Iterative Attention - research/tree/master/perceiver)
-
Multimodal Fusion
- Robust Contrastive Learning against Noisy Views
- Cooperative Learning for Multi-view Analysis
- What Makes Multi-modal Learning Better than Single (Provably)
- Attention Bottlenecks for Multimodal Fusion
- VMLoc: Variational Fusion For Learning-Based Multimodal Camera Localization
- Trusted Multi-View Classification
- Deep-HOSeq: Deep Higher-Order Sequence Fusion for Multimodal Sentiment Analysis
- Removing Bias in Multi-modal Classifiers: Regularization by Maximizing Functional Entropies - bias-in-multi-modal-classifiers)
- Deep Multimodal Fusion by Channel Exchanging
- What Makes Training Multi-Modal Classification Networks Hard?
- Dynamic Fusion for Multimodal Data
- DeepCU: Integrating Both Common and Unique Latent Information for Multimodal Sentiment Analysis - IJCAI19)
- Deep Multimodal Multilinear Fusion with High-order Polynomial Pooling
- XFlow: Cross-modal Deep Neural Networks for Audiovisual Classification
- MFAS: Multimodal Fusion Architecture Search
- The Neuro-Symbolic Concept Learner: Interpreting Scenes, Words, and Sentences From Natural Supervision
- Unifying and merging well-trained deep neural networks for inference stage
- Efficient Low-rank Multimodal Fusion with Modality-Specific Factors - rank-Multimodal-Fusion)
- Memory Fusion Network for Multi-view Sequential Learning
- Tensor Fusion Network for Multimodal Sentiment Analysis
- Jointly Modeling Deep Video and Compositional Text to Bridge Vision and Language in a Unified Framework
-
Multimodal Alignment
- Reconsidering Representation Alignment for Multi-view Clustering
- CoMIR: Contrastive Multimodal Image Representation for Registration - group/CoMIR)
- Multimodal Transformer for Unaligned Multimodal Language Sequences - Transformer)
- Temporal Cycle-Consistency Learning - research/google-research/tree/master/tcc)
- See, Hear, and Read: Deep Aligned Representations
- On Deep Multi-View Representation Learning
- Deep Canonical Correlation Analysis
-
Multimodal Pretraining
- Align before Fuse: Vision and Language Representation Learning with Momentum Distillation
- Less is More: ClipBERT for Video-and-Language Learning via Sparse Sampling
- Large-Scale Adversarial Training for Vision-and-Language Representation Learning
- Integrating Multimodal Information in Large Pretrained Transformers
- VL-BERT: Pre-training of Generic Visual-Linguistic Representations - BERT)
- VisualBERT: A Simple and Performant Baseline for Vision and Language
- ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks
- Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training
- LXMERT: Learning Cross-Modality Encoder Representations from Transformers
- VideoBERT: A Joint Model for Video and Language Representation Learning
-
Multimodal Translation
- Zero-Shot Text-to-Image Generation - E)
- Translate-to-Recognize Networks for RGB-D Scene Recognition - to-Recognize-Networks)
- Language2Pose: Natural Language Grounded Pose Forecasting
- Reconstructing Faces from Voices - mlsp/reconstructing_faces_from_voices)
- Speech2Face: Learning the Face Behind a Voice
- Found in Translation: Learning Robust Joint Representations by Cyclic Translations Between Modalities
-
Crossmodal Retrieval
- Learning with Noisy Correspondence for Cross-modal Matching - SCU/2021-NeurIPS-NCR)
- MURAL: Multimodal, Multitask Retrieval Across Languages
- Self-Supervised Learning from Web Data for Multimodal Retrieval
- Look, Imagine and Match: Improving Textual-Visual Cross-Modal Retrieval with Generative Models
- Scene-centric vs. Object-centric Image-Text Cross-modal Retrieval: A Reproducibility Study
-
Multimodal Co-learning
- Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision
- Multimodal Co-learning: Challenges, Applications with Datasets, Recent Advances and Future Directions
- Foundations of Multimodal Co-learning
- Vokenization: Improving Language Understanding via Contextualized, Visually-Grounded Supervision
-
Missing or Imperfect Modalities
- A Variational Information Bottleneck Approach to Multi-Omics Data Integration
- SMIL: Multimodal Learning with Severely Missing Modality
- Factorized Inference in Deep Markov Models for Incomplete Multimodal Time Series
- Learning Representations from Imperfect Time Series Data via Tensor Rank Regularization
- Multimodal Deep Learning for Robust RGB-D Object Recognition
-
Analysis of Multimodal Models
- M2Lens: Visualizing and Explaining Multimodal Models for Sentiment Analysis
- Decoupling the Role of Data, Attention, and Losses in Multimodal Transformers
- Does my multimodal model learn cross-modal interactions? It’s harder to tell than you might think!
- Blindfold Baselines for Embodied QA - Grounded Interaction and Language Workshop
- Analyzing the Behavior of Visual Question Answering Models
-
Knowledge Graphs and Knowledge Bases
- MMKG: Multi-Modal Knowledge Graphs
- Answering Visual-Relational Queries in Web-Extracted Knowledge Graphs
- Embedding Multimodal Relational Data for Knowledge Base Completion
- A Multimodal Translation-Based Approach for Knowledge Graph Representation Learning - multimodalKB)
- Order-Embeddings of Images and Language - embedding)
- Building a Large-scale Multimodal Knowledge Base System for Answering Visual Queries
-
Intepretable Learning
- Multimodal Explanations by Predicting Counterfactuality in Videos
- Multimodal Explanations: Justifying Decisions and Pointing to the Evidence - Park/MultimodalExplanations)
- Do Explanations make VQA Models more Predictable to a Human?
- Towards Transparent AI Systems: Interpreting Visual Question Answering Models
-
Generative Learning
- MMVAE+: Enhancing the Generative Quality of Multimodal VAEs without Compromises
- On the Limitations of Multimodal VAEs - CPUXXrAj&name=supplementary_material)
- Generalized Multimodal ELBO
- Multimodal Generative Learning Utilizing Jensen-Shannon-Divergence
- Self-supervised Disentanglement of Modality-specific and Shared Factors Improves Multimodal Generative Models
- Variational Mixture-of-Experts Autoencodersfor Multi-Modal Deep Generative Models
- Few-shot Video-to-Video Synthesis - shot-vid2vid/)
- Multimodal Generative Models for Scalable Weakly-Supervised Learning - vae-public) [[code2]](https://github.com/panpan2/Multimodal-Variational-Autoencoder)
- The Multi-Entity Variational Autoencoder
- Variational Mixture-of-Experts Autoencodersfor Multi-Modal Deep Generative Models
-
Semi-supervised Learning
-
Self-supervised Learning
- DABS: A Domain-Agnostic Benchmark for Self-Supervised Learning
- Self-Supervised Learning by Cross-Modal Audio-Video Clustering
- Self-Supervised MultiModal Versatile Networks
- Labelling Unlabelled Videos from Scratch with Multi-modal Self-supervision
- Self-Supervised Learning of Visual Features through Embedding Images into Text Topic Spaces
-
Language Models
- Neural Language Modeling with Visual Features
- Learning Multi-Modal Word Representation Grounded in Visual Context
- Visual Word2Vec (vis-w2v): Learning Visually Grounded Word Embeddings Using Abstract Scenes
- Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models - semantic-embedding)
-
Adversarial Attacks
-
Few-Shot Learning
-
Bias and Fairness
- Worst of Both Worlds: Biases Compound in Pre-trained Vision-and-Language Models
- Towards Debiasing Sentence Representations
- FairCVtest Demo: Understanding Bias in Multimodal Learning with a Testbed in Fair Automatic Recruitment
- Model Cards for Model Reporting
- Black is to Criminal as Caucasian is to Police: Detecting and Removing Multiclass Bias in Word Embeddings
- Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification
- Datasheets for Datasets
- Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings
- PromptStyler: Prompt-driven Style Generation for Source-free Domain Generalization
-
Human in the Loop Learning
-
-
Architectures
-
Multimodal Transformers
- Pretrained Transformers As Universal Computation Engines
- PolyViT: Co-training Vision Transformers on Images, Videos and Audio
- VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text - research/google-research/tree/master/vatt)
- Parameter Efficient Multimodal Transformers for Video Representation Learning - vision/avbert)
-
Multimodal Memory
- Multimodal Transformer with Variable-length Memory for Vision-and-Language Navigation
- History Aware Multimodal Transformer for Vision-and-Language Navigation
- Episodic Memory in Lifelong Language Learning
- ICON: Interactive Conversational Memory Network for Multimodal Emotion Detection
- Multimodal Memory Modelling for Video Captioning
- Dynamic Memory Networks for Visual and Textual Question Answering
-
-
Applications and Datasets
-
Language and Visual QA
- TAG: Boosting Text-VQA via Text-aware Visual Question-answer Generation
- Learning to Answer Questions in Dynamic Audio-Visual Scenarios
- SUTD-TrafficQA: A Question Answering Benchmark and an Efficient Network for Video Reasoning over Traffic Events - TrafficQA)
- MultiModalQA: complex question answering over text, tables and images
- ManyModalQA: Modality Disambiguation and QA over Diverse Inputs
- Iterative Answer Prediction with Pointer-Augmented Multimodal Transformers for TextVQA
- Interactive Language Learning by Question Answering - eric-yuan/qait_public)
- Fusion of Detected Objects in Text for Visual Question Answering
- RUBi: Reducing Unimodal Biases in Visual Question Answering
- GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering
- OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge
- MUREL: Multimodal Relational Reasoning for Visual Question Answering
- Social-IQ: A Question Answering Benchmark for Artificial Social Intelligence - IQ)
- Probabilistic Neural-symbolic Models for Interpretable Visual Question Answering - clevr)
- Learning to Count Objects in Natural Images for Visual Question Answering - counting)
- Overcoming Language Priors in Visual Question Answering with Adversarial Regularization
- Neural-Symbolic VQA: Disentangling Reasoning from Vision and Language Understanding - vqa)
- RecipeQA: A Challenge Dataset for Multimodal Comprehension of Cooking Recipes
- TVQA: Localized, Compositional Video Question Answering
- Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering
- Don't Just Assume; Look and Answer: Overcoming Priors for Visual Question Answering
- Stacked Latent Attention for Multimodal Reasoning
- Learning to Reason: End-to-End Module Networks for Visual Question Answering
- CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning - iep) [[dataset generation]](https://github.com/facebookresearch/clevr-dataset-gen)
- Are You Smarter Than A Sixth Grader? Textbook Question Answering for Multimodal Machine Comprehension
- Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding - mcb)
- MovieQA: Understanding Stories in Movies through Question-Answering
- VQA: Visual Question Answering
-
Language Grounding in Vision
- Core Challenges in Embodied Vision-Language Planning
- MaRVL: Multicultural Reasoning over Vision and Language - challenge.github.io/)
- Grounding 'Grounding' in NLP
- The Hateful Memes Challenge: Detecting Hate Speech in Multimodal Memes - memes-challenge-and-data-set/)
- What Does BERT with Vision Look At?
- Visual Grounding in Video for Unsupervised Word Translation - grounding)
- VIOLIN: A Large-Scale Dataset for Video-and-Language Inference
- Grounded Video Description
- Show, Control and Tell: A Framework for Generating Controllable and Grounded Captions
- Multilevel Language and Vision Integration for Text-to-Clip Retrieval - to-Clip_Retrieval)
- Binary Image Selection (BISON): Interpretable Evaluation of Visual Grounding - image-selection)
- Finding “It”: Weakly-Supervised Reference-Aware Visual Grounding in Instructional Videos
- SCAN: Learning Hierarchical Compositional Visual Concepts
- Visual Coreference Resolution in Visual Dialog using Neural Module Networks
- Gated-Attention Architectures for Task-Oriented Language Grounding - Grounding)
- Using Syntax to Ground Referring Expressions in Natural Images
- Grounding language acquisition by training semantic parsers using captioned videos
- Interpretable and Globally Optimal Prediction for Textual Grounding using Image Concepts
- Localizing Moments in Video with Natural Language
- What are you talking about? Text-to-Image Coreference
- Grounded Language Learning from Video Described with Sentences
- Grounded Compositional Semantics for Finding and Describing Images with Sentences
- MaRVL: Multicultural Reasoning over Vision and Language - challenge.github.io/)
- MaRVL: Multicultural Reasoning over Vision and Language - challenge.github.io/)
-
Language Grouding in Navigation
- ALFWorld: Aligning Text and Embodied Environments for Interactive Learning
- Hierarchical Cross-Modal Agent for Robotics Vision-and-Language Navigation - RIPL/robo-vln), [[video]](https://www.youtube.com/watch?v=y16x9n_zP_4), [[project page]](https://zubair-irshad.github.io/projects/robo-vln.html)
- Improving Vision-and-Language Navigation with Image-Text Pairs from the Web
- Towards Learning a Generic Agent for Vision-and-Language Navigation via Pre-training
- VideoNavQA: Bridging the Gap between Visual and Embodied Question Answering
- Vision-and-Dialog Navigation
- Stay on the Path: Instruction Fidelity in Vision-and-Language Navigation
- Are You Looking? Grounding to Multiple Modalities in Vision-and-Language Navigation
- Touchdown: Natural Language Navigation and Spatial Reasoning in Visual Street Environments - lab/touchdown)
- Reinforced Cross-Modal Matching and Self-Supervised Imitation Learning for Vision-Language Navigation
- The Regretful Navigation Agent for Vision-and-Language Navigation - agent)
- Tactical Rewind: Self-Correction via Backtracking in Vision-and-Language Navigation
- Multi-modal Discriminative Model for Vision-and-Language Navigation - RoboNLP Workshop 2019
- Self-Monitoring Navigation Agent via Auxiliary Progress Estimation - agent)
- From Language to Goals: Inverse Reinforcement Learning for Vision-Based Instruction Following
- Read, Watch, and Move: Reinforcement Learning for Temporally Grounding Natural Language Descriptions in Videos
- Learning to Navigate Unseen Environments: Back Translation with Environmental Dropout - EnvDrop)
- Attention Based Natural Language Grounding by Navigating Virtual Environment
- Mapping Instructions to Actions in 3D Environments with Visual Goal Prediction - lab/ciff)
- Vision-and-Language Navigation: Interpreting Visually-Grounded Navigation Instructions in Real Environments
- Embodied Question Answering
- Look Before You Leap: Bridging Model-Free and Model-Based Reinforcement Learning for Planned-Ahead Vision-and-Language Navigation
-
Multimodal Machine Translation
- Unsupervised Multimodal Neural Machine Translation with Pseudo Visual Pivoting
- Multimodal Transformer for Multimodal Machine Translation
- Neural Machine Translation with Universal Visual Representation - NMT)
- Visual Agreement Regularized Training for Multi-Modal Machine Translation
- VATEX: A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language Research
- Latent Variable Model for Multi-modal Translation
- Distilling Translations with Visual Awareness
- Probing the Need for Visual Context in Multimodal Machine Translation
- Emergent Translation in Multi-Agent Communication
- Zero-Resource Neural Machine Translation with Multi-Agent Communication Game
- Learning Translations via Images with a Massively Multilingual Image Dataset
- A Visual Attention Grounding Neural Model for Multimodal Machine Translation
- Adversarial Evaluation of Multimodal Machine Translation
- Doubly-Attentive Decoder for Multi-modal Neural Machine Translation
- An empirical study on the effectiveness of images in Multimodal Neural Machine Translation
- Incorporating Global Visual Features into Attention-based Neural Machine Translation
- Multimodal Pivots for Image Caption Translation
- Multi30K: Multilingual English-German Image Descriptions
- Does Multimodality Help Human and Machine for Translation and Image Captioning?
- Latent Variable Model for Multi-modal Translation
- Distilling Translations with Visual Awareness
- Zero-Resource Neural Machine Translation with Multi-Agent Communication Game
-
Multi-agent Communication
- Multi-agent Communication meets Natural Language: Synergies between Functional and Structural Language Learning
- Emergence of Compositional Language with Deep Generational Transmission
- On the Pitfalls of Measuring Emergent Communication - emergent-comm)
- Emergent Translation in Multi-Agent Communication
- Emergent Communication in a Multi-Modal, Multi-Step Referential Game - dl/MultimodalGame)
- Emergence of Linguistic Communication From Referential Games with Symbolic and Pixel Input
- Emergent Communication through Negotiation
- Emergence of Grounded Compositional Language in Multi-Agent Populations
- Emergence of Language with Multi-agent Games: Learning to Communicate with Sequences of Symbols
- Natural Language Does Not Emerge 'Naturally' in Multi-Agent Dialog - mlp-lab/lang-emerge) [[code2]](https://github.com/kdexd/lang-emerge-parlai)
- Learning Cooperative Visual Dialog Agents with Deep Reinforcement Learning - mlp-lab/visdial-rl)
- Multi-agent Cooperation and the Emergence of (natural) Language
- Learning to Communicate with Deep Multi-agent Reinforcement Learning
- Learning multiagent communication with backpropagation
- The Emergence of Compositional Structures in Perceptually Grounded Language Games
-
Commonsense Reasoning
- Adventures in Flatland: Perceiving Social Interactions Under Physical Dynamics
- A Logical Model for Supporting Social Commonsense Knowledge Acquisition
- Heterogeneous Graph Learning for Visual Commonsense Reasoning
- SocialIQA: Commonsense Reasoning about Social Interactions
- From Recognition to Cognition: Visual Commonsense Reasoning
- CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge
-
Multimodal Reinforcement Learning
- MiniHack the Planet: A Sandbox for Open-Ended Reinforcement Learning Research
- Imitating Interactive Intelligence
- Grounded Language Learning Fast and Slow
- RTFM: Generalising to Novel Environment Dynamics via Reading
- Embodied Multimodal Multitask Learning
- Learning to Speak and Act in a Fantasy Text Adventure Game
- Language as an Abstraction for Hierarchical Deep Reinforcement Learning
- Habitat: A Platform for Embodied AI Research
- Multimodal Hierarchical Reinforcement Learning Policy for Task-Oriented Visual Dialog
- Mapping Instructions and Visual Observations to Actions with Reinforcement Learning
- Reinforcement Learning for Mapping Instructions to Actions
- Hierarchical Decision Making by Generating and Following Natural Language Instructions
-
Multimodal Dialog
- Two Causal Principles for Improving Visual Dialog
- MELD: A Multimodal Multi-Party Dataset for Emotion Recognition in Conversations - meld.github.io/)
- CLEVR-Dialog: A Diagnostic Dataset for Multi-Round Reasoning in Visual Dialog - dialog)
- Talk the Walk: Navigating New York City through Grounded Dialogue
- Dialog-based Interactive Image Retrieval - retrieval)
- Towards Building Large Scale Multimodal Domain-Aware Conversation Systems
- Visual Dialog - mlp-lab/visdial)
-
Language and Audio
- Lattice Transformer for Speech Translation
- Exploring Phoneme-Level Speech Representations for End-to-End Speech Translation
- Audio Caption: Listen and Tell
- Audio-Linguistic Embeddings for Spoken Sentences
- From Semi-supervised to Almost-unsupervised Speech Recognition with Very-low Resource by Jointly Learning Phonetic Structures from Audio and Text Embeddings
- From Audio to Semantics: Approaches To End-to-end Spoken Language Understanding
- Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning
- Deep Voice 2: Multi-Speaker Neural Text-to-Speech
- Deep Voice: Real-time Neural Text-to-Speech
- Text-to-Speech Synthesis
- Text-to-Speech Synthesis
- Natural TTS Synthesis by Conditioning Wavenet on Mel Spectrogram Predictions
-
Audio and Visual
- Music Gesture for Visual Sound Separation
- Co-Compressing and Unifying Deep CNN Models for Efficient Human Face and Speaker Recognition
- Learning Individual Styles of Conversational Gesture
- Capture, Learning, and Synthesis of 3D Speaking Styles
- Disjoint Mapping Network for Cross-modal Matching of Voices and Faces
- Wav2Pix: Speech-conditioned Face Generation using Generative Adversarial Networks - upc.github.io/wav2pix/)
- Learning Affective Correspondence between Music and Image
- Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input - Discovering-Visual-Objects-and-Spoken-Words)
- Seeing Voices and Hearing Faces: Cross-modal Biometric Matching - nagrani/SVHF-Net)
- Learning to Separate Object Sounds by Watching Unlabeled Video
- Deep Audio-Visual Speech Recognition
- Look, Listen and Learn
- Unsupervised Learning of Spoken Language with Visual Context
- SoundNet: Learning Sound Representations from Unlabeled Video
-
Visual, IMU and Wireless
-
Media Description
- Towards Unsupervised Image Captioning with Shared Multimodal Embeddings
- Video Relationship Reasoning using Gated Spatio-Temporal Energy Graph
- Joint Event Detection and Description in Continuous Video Streams
- Learning to Compose and Reason with Language Tree Structures for Visual Grounding
- Neural Baby Talk
- Grounding Referring Expressions in Images by Variational Context
- Video Captioning via Hierarchical Reinforcement Learning
- Charades-Ego: A Large-Scale Dataset of Paired Third and First Person Videos
- Neural Motifs: Scene Graph Parsing with Global Context - motifs)
- No Metrics Are Perfect: Adversarial Reward Learning for Visual Storytelling
- Generating Descriptions with Grounded and Co-Referenced People
- DenseCap: Fully Convolutional Localization Networks for Dense Captioning
- Review Networks for Caption Generation
- Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding
- Show and Tell: Lessons learned from the 2015 MSCOCO Image Captioning Challenge
- Show, Attend and Tell: Neural Image Caption Generation with Visual Attention - captions)
- Deep Visual-Semantic Alignments for Generating Image Descriptions
- Show and Tell: A Neural Image Caption Generator
- A Dataset for Movie Description - inf.mpg.de/departments/computer-vision-and-multimodal-computing/research/vision-and-language/mpii-movie-description-dataset/)
- What’s Cookin’? Interpreting Cooking Videos using Text, Speech and Vision
- Microsoft COCO: Common Objects in Context
-
Video Generation from Text
-
Affect Recognition and Multimodal Language
- End-to-end Facial and Physiological Model for Affective Computing and Applications
- Affective Computing for Large-Scale Heterogeneous Multimedia Data: A Survey
- Towards Multimodal Sarcasm Detection (An Obviously_Perfect Paper)
- Multi-modal Approach for Affective Computing
- Multimodal Language Analysis with Recurrent Multistage Fusion
- Multimodal Language Analysis in the Wild: CMU-MOSEI Dataset and Interpretable Dynamic Fusion Graph - MultimodalSDK)
- Multi-attention Recurrent Network for Human Communication Comprehension - MultimodalSDK)
- End-to-End Multimodal Emotion Recognition using Deep Neural Networks
- AMHUSE - A Multimodal dataset for HUmor SEnsing
- Decoding Children’s Social Behavior
- Collecting Large, Richly Annotated Facial-Expression Databases from Movies
- The Interactive Emotional Dyadic Motion Capture (IEMOCAP) Database
-
Healthcare
- Multimodal Co-Attention Transformer for Survival Prediction in Gigapixel Whole Slide Images
- PET-Guided Attention Network for Segmentation of Lung Tumors from PET/CT Images
- Pathomic Fusion: An Integrated Framework for Fusing Histopathology and Genomic Features for Cancer Diagnosis and Prognosis
- Leveraging Medical Visual Question Answering with Supporting Facts
- Unsupervised Multimodal Representation Learning across Medical Images and Reports
- Multimodal Medical Image Retrieval based on Latent Topic Modeling
- Improving Hospital Mortality Prediction with Medical Named Entities and Multimodal Learning
- Knowledge-driven Generative Subspaces for Modeling Multi-view Dependencies in Medical Data
- Multimodal Depression Detection: Fusion Analysis of Paralinguistic, Head Pose and Eye Gaze Behaviors
- Learning the Joint Representation of Heterogeneous Temporal Events for Clinical Endpoint Prediction
- Understanding Coagulopathy using Multi-view Data in the Presence of Sub-Cohorts: A Hierarchical Subspace Approach
- Machine Learning in Multimodal Medical Imaging
- Cross-modal Recurrent Models for Weight Objective Prediction from Multimodal Time-series Data
- SimSensei Kiosk: A Virtual Human Interviewer for Healthcare Decision Support
- Dyadic Behavior Analysis in Depression Severity Assessment Interviews
- Audiovisual Behavior Descriptors for Depression Assessment
-
Robotics
- Detect, Reject, Correct: Crossmodal Compensation of Corrupted Sensors
- Multimodal sensor fusion with differentiable filters
- Concept2Robot: Learning Manipulation Concepts from Instructions and Human Demonstrations
- See, Feel, Act: Hierarchical Learning for Complex Manipulation Skills with Multi-sensory Fusion
- Early Fusion for Goal Directed Robotic Vision
- Simultaneously Learning Vision and Feature-based Control Policies for Real-world Ball-in-a-Cup
- Probabilistic Multimodal Modeling for Human-Robot Interaction Tasks
- Making Sense of Vision and Touch: Self-Supervised Learning of Multimodal Representations for Contact-Rich Tasks
- Multi-modal Predicate Identification using Dynamically Learned Robot Controllers
- Multimodal Probabilistic Model-Based Planning for Human-Robot Interaction
- Perching and Vertical Climbing: Design of a Multimodal Robot
- Multi-Modal Scene Understanding for Robotic Grasping
- Strategies for Multi-Modal Scene Exploration
- See, Feel, Act: Hierarchical Learning for Complex Manipulation Skills with Multi-sensory Fusion
-
Autonomous Driving
-
Finance
- A Multimodal Event-driven LSTM Model for Stock Prediction Using Online News
- Multimodal Deep Learning for Finance: Integrating and Forecasting International Stock Markets
- Multimodal deep learning for short-term stock volatility prediction
- Self-Supervised Learning in Event Sequences: A Comparative Study and Hybrid Approach of Generative Modeling and Contrastive Learning
-
Human AI Interaction
- Multimodal Human Computer Interaction: A Survey
- Affective multimodal human-computer interaction
- Building a multimodal human-robot interface
- Multimodal Human Computer Interaction: A Survey
- Multimodal Human Computer Interaction: A Survey
- Multimodal Human Computer Interaction: A Survey
- Multimodal Human Computer Interaction: A Survey
- Multimodal Human Computer Interaction: A Survey
- Multimodal Human Computer Interaction: A Survey
- Multimodal Human Computer Interaction: A Survey
- Multimodal Human Computer Interaction: A Survey
- Multimodal Human Computer Interaction: A Survey
- Multimodal Human Computer Interaction: A Survey
- Multimodal Human Computer Interaction: A Survey
- Multimodal Human Computer Interaction: A Survey
- Multimodal Human Computer Interaction: A Survey
- Multimodal Human Computer Interaction: A Survey
- Multimodal Human Computer Interaction: A Survey
- Multimodal Human Computer Interaction: A Survey
- Multimodal Human Computer Interaction: A Survey
- Multimodal Human Computer Interaction: A Survey
- Multimodal Human Computer Interaction: A Survey
- Multimodal Human Computer Interaction: A Survey
- Multimodal Human Computer Interaction: A Survey
- Affective multimodal human-computer interaction
- Multimodal Human Computer Interaction: A Survey
- Multimodal Human Computer Interaction: A Survey
- Multimodal Human Computer Interaction: A Survey
- Multimodal Human Computer Interaction: A Survey
- Multimodal Human Computer Interaction: A Survey
- Multimodal Human Computer Interaction: A Survey
- Multimodal Human Computer Interaction: A Survey
- Multimodal Human Computer Interaction: A Survey
- Multimodal Human Computer Interaction: A Survey
- Multimodal Human Computer Interaction: A Survey
- Multimodal Human Computer Interaction: A Survey
- Multimodal Human Computer Interaction: A Survey
- Multimodal Human Computer Interaction: A Survey
- Multimodal Human Computer Interaction: A Survey
- Multimodal Human Computer Interaction: A Survey
- Multimodal Human Computer Interaction: A Survey
- Multimodal Human Computer Interaction: A Survey
- Multimodal Human Computer Interaction: A Survey
- Multimodal Human Computer Interaction: A Survey
- Multimodal Human Computer Interaction: A Survey
- Multimodal Human Computer Interaction: A Survey
- Multimodal Human Computer Interaction: A Survey
- Multimodal Human Computer Interaction: A Survey
- Multimodal Human Computer Interaction: A Survey
- Multimodal Human Computer Interaction: A Survey
-
Multimodal Content Generation
- Non-Linear Consumption of Videos Using a Sequence of Personalized Multimodal Fragments
- Generating Need-Adapted Multimodal Fragments
- Multimodal KDD 2023: International Workshop on Multimodal Learning
- Social Intelligence in Humans and Robots
- LANTERN 2021 - world kNowledge @ EACL 2021
- MAI-Workshop - workshop.github.io/), [ViGIL](https://vigilworkshop.github.io/).
- Embodied Multimodal Learning
- Wordplay: When Language Meets Games
- The Large Scale Movie Description Challenge (LSMDC)
- EVAL - workshop.stanford.edu/), and [MVA](https://sites.google.com/view/multimodalvideo-v2).
- Advances in Language and Vision Research
- Visually Grounded Interaction and Language
- Emergent Communication: Towards Natural Language
- Workshop on Multimodal Understanding and Learning for Embodied Applications
- Beyond Vision and Language: Integrating Real-World Knowledge
- Visual Question Answering and Dialog
- Multi-modal Learning from Videos
- Habitat: Embodied Agents Challenge and Workshop
- Closing the Loop Between Vision and Language & LSMD Challenge
- Multi-modal Video Analysis and Moments in Time Challenge
- Cross-Modal Learning in Real World
- Spatial Language Understanding and Grounded Communication for Robotics
- YouTube-8M Large-Scale Video Understanding
- Language and Vision Workshop
- Sight and Sound
- Wordplay: Reinforcement and Language Learning in Text-based Games
- Interpretability and Robustness in Audio, Speech, and Language
- Multimodal Robot Perception
- WMT18: Shared Task on Multimodal Machine Translation
- Shortcomings in Vision and Language
- Computational Approaches to Subjectivity, Sentiment and Social Media Analysis - HLT 2016, EMNLP 2015, ACL 2014, NAACL-HLT 2013
- Visual Understanding Across Modalities
- International Workshop on Computer Vision for Audio-Visual Media
- Language Grounding for Robotics
- Computer Vision for Audio-visual Media
- Language and Vision
- Tutorial on MultiModal Machine Learning
- Connecting Language and Vision to Actions
- Machine Learning for Clinicians: Advances for Multi-Modal Health Data
- Multimodal Machine Learning
- Vision and Language: Bridging Vision and Language with Deep Learning
- CMU 11-777 Multimodal Machine Learning
- CMU 11-877 Advanced Topics in Multimodal Machine Learning
- CMU 05-618, Human-AI Interaction
- CMU 11-777, Advanced Multimodal Machine Learning
- Stanford CS422: Interactive and Embodied Learning
- CMU 16-785, Integrated Intelligence in Robotics: Vision, Language, and Planning
- CMU 10-808, Language Grounding to Vision and Control
- CMU 11-775, Large-Scale Multimedia Analysis
- MIT 6.882, Embodied Intelligence
- Georgia Tech CS 8803, Vision and Language
- Virginia Tech CS 6501-004, Vision & Language
- Grand Challenge and Workshop on Human Multimodal Language
- Machine Learning Career: A Comprehensive Guide
- Multimodal Learning and Applications Workshop
-
Categories
Sub Categories
Multimodal Content Generation
55
Human AI Interaction
50
Multimodal Representations
32
Language and Visual QA
28
Language Grounding in Vision
24
Language Grouding in Navigation
22
Multimodal Machine Translation
22
Multimodal Fusion
21
Media Description
21
Healthcare
16
Multi-agent Communication
15
Robotics
14
Audio and Visual
14
Multimodal Reinforcement Learning
12
Language and Audio
12
Affect Recognition and Multimodal Language
12
Multimodal Pretraining
10
Generative Learning
10
Bias and Fairness
9
Multimodal Dialog
7
Multimodal Alignment
7
Knowledge Graphs and Knowledge Bases
6
Commonsense Reasoning
6
Multimodal Memory
6
Multimodal Translation
6
Crossmodal Retrieval
5
Analysis of Multimodal Models
5
Self-supervised Learning
5
Missing or Imperfect Modalities
5
Semi-supervised Learning
4
Finance
4
Multimodal Co-learning
4
Few-Shot Learning
4
Multimodal Transformers
4
Intepretable Learning
4
Language Models
4
Human in the Loop Learning
4
Video Generation from Text
3
Autonomous Driving
3
Adversarial Attacks
3
Visual, IMU and Wireless
2