Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
awesome-Vision-and-Language-Pre-training
Recent Advances in Vision and Language Pre-training (VLP)
https://github.com/phellonchen/awesome-Vision-and-Language-Pre-training
Last synced: about 7 hours ago
JSON representation
-
Representation Learning
- [code
- BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
- Language Is Not All You Need: Aligning Perception with Language Models
- OPT: Omni-Perception Pre-Trainer for Cross-Modal Understanding and Generation
- Learning Transferable Visual Models From Natural Language Supervision, CLIP
- ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks
- LXMERT: Learning Cross-Modality Encoder Representations from Transformers
- VL-BERT: Pre-training of Generic Visual-Linguistic Representations
- VisualBERT: A Simple and Performant Baseline for Vision and Language
- Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training
- Unified Vision-Language Pre-Training for Image Captioning and VQA
- UNITER: Learning Universal Image-text Representations
- Weak Supervision helps Emergence of Word-Object Alignment and improves Vision-Language Tasks
- InterBERT: Vision-and-Language Interaction for Multi-modal Pretraining
- Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks
- Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers
- ERNIE-VIL: KNOWLEDGE ENHANCED VISION-LANGUAGE REPRESENTATIONS THROUGH SCENE GRAPH
- DeVLBert: Learning Deconfounded Visio-Linguistic Representations
- X-LXMERT: Paint, Caption and Answer Questions with Multi-Modal Transformers
- SEMVLP: VISION-LANGUAGE PRE-TRAINING BY ALIGNING SEMANTICS AT MULTIPLE LEVELS
- CAPT: Contrastive Pre-Training for Learning Denoised Sequence Representations
- Multimodal Pretraining Unmasked: Unifying the Vision and Language BERTs
- LAMP: Label Augmented Multimodal Pretraining
- Scheduled Sampling in Vision-Language Pretraining with Decoupled Encoder-Decoder Network
- VinVL: Revisiting Visual Representations in Vision-Language Models
- ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision
- UNIMO: Towards Unified-Modal Understanding and Generation via Cross-Modal Contrastive Learning
- How Much Can CLIP Benefit Vision-and-Language Tasks? - vil/CLIP-ViL/tree/master/CLIP-ViL-Pretrain)
- Multimodal Pretraining Unmasked: A Meta-Analysis and a Unified Framework of Vision-and-Language BERTs - bug/volta)
- SimVLM: Simple Visual Language Model Pretraining with Weak Supervision
- VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts
- Kaleido-BERT: Vision-Language Pre-training on Fashion Domain - BERT)
- Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual Concepts - 97/X-VLM)
- Vision-Language Pre-Training with Triple Contrastive Learning - smile/TCL)
- Unpaired Vision-Language Pre-training via Cross-Modal CutMix
- BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
- OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework - Sys/OFA)
- GIT: A Generative Image-to-text Transformer for Vision and Language
- CoCa: Contrastive Captioners are Image-Text Foundation Models - pytorch)
- Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks - 3)
- PaLI: A Jointly-Scaled Multilingual Language-Image Model
- BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
- Language Is Not All You Need: Aligning Perception with Language Models
- Unifying Vision-Language Representation Space with Single-tower Transformer
- ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision
- LAMP: Label Augmented Multimodal Pretraining
- Unified Vision-Language Pre-Training for Image Captioning and VQA
- Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks
- CAPT: Contrastive Pre-Training for Learning Denoised Sequence Representations
- Multimodal Pretraining Unmasked: Unifying the Vision and Language BERTs
- Scheduled Sampling in Vision-Language Pretraining with Decoupled Encoder-Decoder Network
- OPT: Omni-Perception Pre-Trainer for Cross-Modal Understanding and Generation
-
Task-specific
-
Text-Image Retrieval
- Learning Relation Alignment for Calibrated Cross-modal Retrieval
- ImageBERT: Cross-Modal Pre-training with Large-scale Weak-supervised Image-text Data
- CROSS-PROBE BERT FOR EFFICIENT AND EFFECTIVE CROSS-MODAL SEARCH
- Dynamic Contrastive Distillation for Image-Text Retrieval
- Where Does the Performance Improvement Come From? - A Reproducibility Concern about Image-Text Retrieval
-
Image Caption
-
VQA
- Fusion of Detected Objects in Text for Visual Question Answering - research/language/tree/master/language/question_answering/b2t2), (**B2T2**)
- Iterative Answer Prediction with Pointer-Augmented Multimodal Transformers for TextVQA
- BERT Can See Out of the Box: On the Cross-modal Transferability of Text Representations
- TAG: Boosting Text-VQA via Text-aware Visual Question-answer Generation
- STL-CQA: Structure-based Transformers with Localization and Encoding for Chart Question Answering
-
Visual Dialog
-
Visual Language Navigation
-
Visual Machine Reading Comprehension
-
Other Tasks
-
-
Other Analysis
-
Other Tasks
- Pre-trained Languge Model Papers from THU-NLP
- BERT-related Papers
- Reading List for Topics in Multimodal Machine Learning
- A repository of vision and language papers
- Recent Advances in Vision and Language PreTrained Models (VL-PTMs)
- A Closer Look at the Robustness of Vision-and-Language Pre-trained Models
- Large-Scale Adversarial Training for Vision-and-Language Representation Learning
- Adaptive Transformers for Learning Multimodal Representations
- Deep Multimodal Neural Architecture Search
- 12-in-1: Multi-Task Vision and Language Representation Learning - multi-task)
- Measuring Social Biases in Grounded Vision and Language Embeddings - in-vision-and-language)
- Are we pretraining it right? Digging deeper into visio-linguistic pretraining
- Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models
- Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision
- VideoBERT: A Joint Model for Video and Language Representation Learning
- Learning Video Representations Using Contrastive Bidirectional Transformers
- M-BERT: Injecting Multimodal Information in the BERT Structure
- BERT for Large-scale Video Segment Classification with Test-time Augmentation - Youtube8M-TM)
- Bridging Text and Video: A Universal Multimodal Transformer for Video-Audio Scene-Aware Dialog
- Learning Spatiotemporal Features via Video and Text Pair Discrimination - NJU/CPD-Video)
- UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation
- HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training
- Video-Grounded Dialogues with Pretrained Generation Language Models
- Auto-captions on GIF: A Large-scale Video-sentence Dataset for Vision-language Pre-training
- Multimodal Pretraining for Dense Video Captioning
- PARAMETER EFFICIENT MULTIMODAL TRANSFORMERS FOR VIDEO REPRESENTATION LEARNING
- Less is More: CLIPBERT for Video-and-Language Learning via Sparse Sampling
- Towards Transfer Learning for End-to-End Speech Synthesis from Deep Pre-Trained Language Models
- Understanding Semantics from Speech Through Pre-training
- SpeechBERT: Cross-Modal Pre-trained Language Model for End-to-end Spoken Question Answering
- vq-wav2vec: Self-Supervised Learning of Discrete Speech Representations
- Effectiveness of self-supervised pre-training for speech recognition - ->
- MART: Memory-Augmented Recurrent Transformer for Coherent Video Paragraph Captioning
- History for Visual Dialog: Do we really need it?
- Cross-Modality Relevance for Reasoning on Language and Vision
- Pre-trained Models for Natural Language Processing: A Survey
- A Survey on Contextual Embeddings
- Trends in Integration of Vision and Language Research: A Survey of Tasks, Datasets, and Methods
- Deep Multimodal Representation Learning: A Survey
- Multimodal Machine Learning: A Survey and Taxonomy
- A Comprehensive Survey of Deep Learning for Image Captioning
- BaiduYun
- BaiduYun
- Unifying Vision-and-Language Tasks via Text Generation
-
-
Table of Contents
Programming Languages
Sub Categories
Keywords
vision-and-language
4
bert
2
representation-learning
2
multimodal-learning
2
iclr2020
1
pre-training
1
pytorch
1
self-supervised-learning
1
vl-bert
1
multimodal
1
retrieval
1
computer-vision
1
deep-learning
1
healthcare
1
machine-learning
1
natural-language-processing
1
reading-list
1
reinforcement-learning
1
robotics
1
speech-processing
1
awesome
1
awesome-list
1
multimodal-deep-learning
1
pretraining
1
vl-ptms
1