Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
awesome-vision-language-pretraining-papers
Recent Advances in Vision and Language PreTrained Models (VL-PTMs)
https://github.com/yuewang-cuhk/awesome-vision-language-pretraining-papers
Last synced: 4 days ago
JSON representation
-
Other Analysis
- Deep Multimodal Neural Architecture Search
- 12-in-1: Multi-Task Vision and Language Representation Learning - multi-task)
- Unifying Vision-and-Language Tasks via Text Generation
- Measuring Social Biases in Grounded Vision and Language Embeddings - in-vision-and-language)
- Are we pretraining it right? Digging deeper into visio-linguistic pretraining
- Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models
- A Closer Look at the Robustness of Vision-and-Language Pre-trained Models
- Large-Scale Adversarial Training for Vision-and-Language Representation Learning
- Adaptive Transformers for Learning Multimodal Representations
- Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision
- VideoBERT: A Joint Model for Video and Language Representation Learning
- Learning Video Representations Using Contrastive Bidirectional Transformers
- M-BERT: Injecting Multimodal Information in the BERT Structure
- BERT for Large-scale Video Segment Classification with Test-time Augmentation - Youtube8M-TM)
- Bridging Text and Video: A Universal Multimodal Transformer for Video-Audio Scene-Aware Dialog
- Learning Spatiotemporal Features via Video and Text Pair Discrimination - NJU/CPD-Video)
- UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation
- ActBERT: Learning Global-Local Video-Text Representations
- HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training
- Video-Grounded Dialogues with Pretrained Generation Language Models
- Auto-captions on GIF: A Large-scale Video-sentence Dataset for Vision-language Pre-training
- Multimodal Pretraining for Dense Video Captioning
- PARAMETER EFFICIENT MULTIMODAL TRANSFORMERS FOR VIDEO REPRESENTATION LEARNING
- Less is More: CLIPBERT for Video-and-Language Learning via Sparse Sampling
- Towards Transfer Learning for End-to-End Speech Synthesis from Deep Pre-Trained Language Models
- Understanding Semantics from Speech Through Pre-training
- SpeechBERT: Cross-Modal Pre-trained Language Model for End-to-end Spoken Question Answering
- vq-wav2vec: Self-Supervised Learning of Discrete Speech Representations
- Effectiveness of self-supervised pre-training for speech recognition
- Multi-Modality Cross Attention Network for Image and Sentence Matching
- MART: Memory-Augmented Recurrent Transformer for Coherent Video Paragraph Captioning
- History for Visual Dialog: Do we really need it?
- Cross-Modality Relevance for Reasoning on Language and Vision
- Pre-trained Models for Natural Language Processing: A Survey
- A Survey on Contextual Embeddings
- Deep Multimodal Representation Learning: A Survey
- Multimodal Machine Learning: A Survey and Taxonomy
- A Comprehensive Survey of Deep Learning for Image Captioning
- Pre-trained Languge Model Papers from THU-NLP
- BERT-related Papers
- Reading List for Topics in Multimodal Machine Learning
-
Representation Learning
- ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks
- LXMERT: Learning Cross-Modality Encoder Representations from Transformers
- VL-BERT: Pre-training of Generic Visual-Linguistic Representations
- VisualBERT: A Simple and Performant Baseline for Vision and Language
- Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training
- Unified Vision-Language Pre-Training for Image Captioning and VQA
- UNITER: Learning Universal Image-text Representations
- Weak Supervision helps Emergence of Word-Object Alignment and improves Vision-Language Tasks
- InterBERT: Vision-and-Language Interaction for Multi-modal Pretraining
- Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks
- Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers
- ERNIE-VIL: KNOWLEDGE ENHANCED VISION-LANGUAGE REPRESENTATIONS THROUGH SCENE GRAPH
- DeVLBert: Learning Deconfounded Visio-Linguistic Representations
- SEMVLP: VISION-LANGUAGE PRE-TRAINING BY ALIGNING SEMANTICS AT MULTIPLE LEVELS
- CAPT: Contrastive Pre-Training for Learning Denoised Sequence Representations
- Multimodal Pretraining Unmasked: Unifying the Vision and Language BERTs
- LAMP: Label Augmented Multimodal Pretraining
- Scheduled Sampling in Vision-Language Pretraining with Decoupled Encoder-Decoder Network
- ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision
- UNIMO: Towards Unified-Modal Understanding and Generation via Cross-Modal Contrastive Learning
- X-LXMERT: Paint, Caption and Answer Questions with Multi-Modal Transformers
- VinVL: Revisiting Visual Representations in Vision-Language Models
- Kaleido-BERT: Vision-Language Pre-training on Fashion Domain
- Learning Transferable Visual Models From Natural Language Supervision
- Align before Fuse: Vision and Language Representation Learning with Momentum Distillation
- Florence: A New Foundation Model for Computer Vision
- [code
- Unified Vision-Language Pre-Training for Image Captioning and VQA
- Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks
- CAPT: Contrastive Pre-Training for Learning Denoised Sequence Representations
- Multimodal Pretraining Unmasked: Unifying the Vision and Language BERTs
- Scheduled Sampling in Vision-Language Pretraining with Decoupled Encoder-Decoder Network
-
Task-specific
- Fusion of Detected Objects in Text for Visual Question Answering - research/language/tree/master/language/question_answering/b2t2), (**B2T2**)
- Iterative Answer Prediction with Pointer-Augmented Multimodal Transformers for TextVQA
- VD-BERT: A Unified Vision and Dialog Transformer with BERT - BERT), (**VD-BERT**)
- Large-scale Pretraining for Visual Dialog: A Simple State-of-the-Art Baseline - bert), (**VisDial-BERT**)
- Towards Learning a Generic Agent for Vision-and-Language Navigation via Pre-training
- ImageBERT: Cross-Modal Pre-training with Large-scale Weak-supervised Image-text Data
- XGPT: Cross-modal Generative Pre-Training for Image Captioning
- BERT Can See Out of the Box: On the Cross-modal Transferability of Text Representations
- CROSS-PROBE BERT FOR EFFICIENT AND EFFECTIVE CROSS-MODAL SEARCH
- STL-CQA: Structure-based Transformers with Localization and Encoding for Chart Question Answering
- VisualMRC: Machine Reading Comprehension on Document Images
- Visual Relationship Detection With Visual-Linguistic Knowledge From Multimodal Representations
Programming Languages
Sub Categories
Keywords
representation-learning
2
bert
1
iclr2020
1
pre-training
1
pytorch
1
self-supervised-learning
1
vision-and-language
1
vl-bert
1
computer-vision
1
deep-learning
1
healthcare
1
machine-learning
1
multimodal-learning
1
natural-language-processing
1
reading-list
1
reinforcement-learning
1
robotics
1
speech-processing
1