Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/DerekDLP/VQA-papers
A list of recent papers regarding visual(image) question answering「mainly from arxiv.com」
https://github.com/DerekDLP/VQA-papers
Last synced: 2 months ago
JSON representation
A list of recent papers regarding visual(image) question answering「mainly from arxiv.com」
- Host: GitHub
- URL: https://github.com/DerekDLP/VQA-papers
- Owner: DerekDLP
- Created: 2019-02-23T03:59:10.000Z (almost 6 years ago)
- Default Branch: master
- Last Pushed: 2019-03-06T08:05:50.000Z (almost 6 years ago)
- Last Synced: 2024-08-03T02:05:55.330Z (5 months ago)
- Homepage:
- Size: 23.4 KB
- Stars: 14
- Watchers: 4
- Forks: 3
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
- Awesome-Paper-List - Visual Question Answering
README
# Visual(image) Question Answering - Papers
A reading list of resources dedicated to visual(image) question answering「mainly from arxiv.com」
# Bookmarks
* [2015 Papers](#2015-papers)
* [2016 Papers](#2016-papers)
* [2017 Papers](#2017-papers)
* [2018 Papers](#2018-papers)
* [2019 Papers](#2019-papers)
* TODO Listq
* [Else Papers]()
* [Similar field Papers]()
* [Notes supplement]()## 2015 Papers
| ID | Title | Ori Date | Latest Date | Notes | Pubilshed
(Incomplete Statistics) |
| :-: | :-: | :-: | :-: | - | :-: |
| 1 | [VQA: Visual Question Answering](https://arxiv.org/pdf/1505.00468) | 2015.05.03 | 2016.10.26 | [[Data](https://visualqa.org/)] [[code](https://github.com/JamesChuanggg/VQA-tensorflow)] | ICCV 2015 |
| 2 | [Ask Your Neurons: A Neural-based Approach to Answering Questions about Images](https://arxiv.org/pdf/1505.01121) | 2015.05.05 | 2015.10.01 | | ICCV 2015 |
| 3 | [Exploring Models and Data for Image Question Answering](https://arxiv.org/pdf/1505.02074) | 2015.05.08 | 2015.11.29 | | NIPS 2015 |
| 4 | [Are You Talking to a Machine? Dataset and Methods for Multilingual Image Question Answering](https://arxiv.org/pdf/1505.05612) | 2015.05.21 | 2015.11.02 | [[Data](http://research.baidu.com/Downloads)] | NIPS 2015 |
| 5 | [Visual Madlibs: Fill in the blank Image Generation and Question Answering](https://arxiv.org/pdf/1506.00278) | 2015.05.31 | | | |
| 6 | [What value do explicit high level concepts have in vision to language problems?](https://arxiv.org/pdf/1506.01144) | 2015.06.03 | 2016.04.29 | | CVPR 2016 |
| 7 | [Semantic Amodal Segmentation](https://arxiv.org/pdf/1509.01329) | 2015.09.03 | 2016.12.14 | |
| 8 | [VISALOGY: Answering Visual Analogy Questions](https://arxiv.org/pdf/1510.08973) | 2015.10.30 | | | NIPS 2015 |
| 9 | [Stacked Attention Networks for Image Question Answering](https://arxiv.org/pdf/1511.02274) | 2015.11.06 | 2016.01.26 | [[code1](https://github.com/abhshkdz/neural-vqa-attention)] [[code2](https://github.com/JamesChuanggg/san-torch)] | CVPR 2016 |
| 10 | [Explicit Knowledge-based Reasoning for Visual Question Answering](https://arxiv.org/pdf/1511.05099) | 2015.11.09 | 2015.11.11 | | |
| 11 | [Neural Module Networks](https://arxiv.org/pdf/1511.02799) | 2015.11.09 | 2017.07.24 | | |
| 12 | [Visual7W: Grounded Question Answering in Images](https://arxiv.org/pdf/1511.03416) | 2015.11.11 | 2016.04.09 | | CVPR 2016 |
| 13 | [Yin and Yang: Balancing and Answering Binary Visual Questions](https://arxiv.org/pdf/1511.05099) | 2015.11.16 | 2016.04.19 | | |
| 14 | [Ask, Attend and Answer: Exploring Question-Guided Spatial Attention for Visual Question Answering](https://arxiv.org/pdf/1511.05234) | 2015.11.16 | 2016.03.18 | | |
| 15 | [Compositional Memory for Visual Question Answering](https://arxiv.org/pdf/1511.05676) | 2015.11.18 | | | |
| 16 | [ABC-CNN: An Attention Based Convolutional Neural Network for Visual Question Answering](https://arxiv.org/pdf/1511.06973) | 2015.11.18 | 2016.04.03 | | |
| 17 | [Ask Me Anything: Free-form Visual Question Answering Based on Knowledge from External Sources](https://arxiv.org/pdf/1511.06973) | 2015.11.22 | 2016.04.14 | | CVPR |
| 18 | [Where To Look: Focus Regions for Visual Question Answering](https://arxiv.org/pdf/1511.07394) | 2015.11.23 | 2016.01.10 | | Submitted to CVPR 2016 |
| 19 | [Simple Baseline for Visual Question Answering](https://arxiv.org/pdf/1512.02167) | 2015.12.07 | 2015.12.15 | | |## 2016 Papers
| ID | Title | Ori Date | Latest Date | Notes | Pubilshed
(Incomplete Statistics) |
| :-: | :-: | :-: | :-: | - | :-: |
| 1 | [Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations](https://arxiv.org/pdf/1602.07332) | 2016.02.23 | | | |
| 2 | [Dynamic Memory Networks for Visual and Textual Question Answering](https://arxiv.org/pdf/1603.01417) | 2016.03.04 | | | |
| 3 | [Image Captioning and Visual Question Answering Based on Attributes and External Knowledge](https://arxiv.org/pdf/1603.02814) | 2016.03.09 | 2016.12.16 | [Overlap(2015`14)] | |
| 4 | [Generating Natural Questions About an Image](https://arxiv.org/pdf/1603.06059) | 2016.03.19 | 2016.06.08 | | Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics |
| 5 | [A Focused Dynamic Attention Model for Visual Question Answering](https://arxiv.org/pdf/1604.01485) | 2016.04.06 | | | Submitted to ECCV 2016 |
| 6 | [Counting Everyday Objects in Everyday Scenes](https://arxiv.org/pdf/1604.03505) | 2016.04.12 | 2017.05.08 | | |
| 7 | [Learning Models for Actions and Person-Object Interactions with Transfer to Question Answering](https://arxiv.org/pdf/1604.04808) | 2016.04.16 | 2016.07.28 | | |
| 8 | [Leveraging Visual Question Answering for Image-Caption Ranking](https://arxiv.org/pdf/1605.01379) | 2016.05.04 | 2015.08.31 | | |
| 9 | [Ask Your Neurons: A Deep Learning Approach to Visual Question Answering](https://arxiv.org/pdf/1605.02697) | 2016.05.09 | 2016.11.24 | | |
| 10 | [Hierarchical Question-Image Co-Attention for Visual Question Answering](https://arxiv.org/pdf/1606.00061) | 2016.05.31 | 2017.01.19 | [[code](https://github.com/jiasenlu/HieCoAttenVQA)] | NIPS 2016 |
| 11 | [Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding](https://arxiv.org/pdf/1606.01847) | 2016.06.06 | 2016.09.23 | [[code](https://github.com/akirafukui/vqa-mcb)] | EMNLP 2016 |
| 12 | [Human Attention in Visual Question Answering: Do Humans and Deep Networks Look at the Same Regions?](https://arxiv.org/pdf/1606.03556) | 2016.06.11 | 2016.06.17 | | EMNLP 2016 |
| 13 | [Training Recurrent Answering Units with Joint Loss Minimization for VQA](https://arxiv.org/pdf/1606.03647) | 2016.06.11 | 2016.09.29 | | |
| 14 | [FVQA: Fact-based Visual Question Answering](https://arxiv.org/pdf/1606.05433) | 2016.06.17 | 2016.08.08 | | |
| 15 | [Human Attention in Visual Question Answering: Do Humans and Deep Networks Look at the Same Regions?](https://arxiv.org/pdf/1606.05589) | 2016.06.17 | | | 2016 ICML Workshop on Human Interpretability in Machine Learning (WHI 2016), New York, NY. |
| 16 | [DualNet: Domain-Invariant Network for Visual Question Answering](https://arxiv.org/pdf/1606.06108) | 2016.06.20 | 2017.05.04 | | ICME 2017 |
| 17 | [Question Relevance in VQA: Identifying Non-Visual And False-Premise Questions](https://arxiv.org/pdf/1606.06622) | 2016.06.21 | 2016.09.26 | | EMNLP 2016 |
| 18 | [Analyzing the Behavior of Visual Question Answering Models](https://arxiv.org/pdf/1606.07356) | 2016.06.23 | 2016.09.27 | | EMNLP 2016 |
| 19 | [Revisiting Visual Question Answering Baselines](https://arxiv.org/pdf/1606.08390) | 2016.06.27 | 2016.11.22 | | European Conference on Computer Vision |
| 20 | [Visual Question Answering: A Survey of Methods and Datasets](https://arxiv.org/pdf/1607.05910) | 2016.07.20 | | [Survey] | |
| 21 | [Solving Visual Madlibs with Multiple Cues](https://arxiv.org/pdf/1608.03410) | 2016.08.11 | | | BMVC 2016 |
| 22 | [Visual Question: Predicting If a Crowd Will Agree on the Answer](https://arxiv.org/pdf/1608.08188) | 2016.08.29 | | | |
| 23 | [Measuring Machine Intelligence Through Visual Question Answering](https://arxiv.org/pdf/1608.08716) | 2016.08.30 | | | AI Magazine, 2016 |
| 24 | [Towards Transparent AI Systems: Interpreting Visual Question Answering Models](https://arxiv.org/pdf/1608.08974) | 2016.08.31 | 2016.09.09 | | |
| 25 | [Graph-Structured Representations for Visual Question Answering](https://arxiv.org/pdf/1609.05600) | 2016.09.19 | 2017.03.30 | | |
| 26 | [The Color of the Cat is Gray: 1 Million Full-Sentences Visual Question Answering (FSVQA)](https://arxiv.org/pdf/1609.06657) | 2016.09.21 | | | |
| 27 | [Tutorial on Answering Questions about Images with Deep Learning](https://arxiv.org/pdf/1610.01076) | 2016.10.04 | | [tutorial] | 2nd Summer School on Integrating Vision and Language: Deep Learning' in Malta, 2016 |
| 28 | [Visual Question Answering: Datasets, Algorithms, and Future Challenges](https://arxiv.org/pdf/1610.01465) | 2016.10.05 | 2017.06.14 | | |
| 29 | [Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization](https://arxiv.org/pdf/1610.02391) | 2016.10.07 | 2017.03.21 | [[code](https://github.com/ramprs/grad-ca)] [[demo1](http://gradcam.cloudcv.org)] [[demo2](youtu.be/COjUB9Izk6E)] | |
| 30 | [Open-Ended Visual Question-Answering](https://arxiv.org/pdf/1610.02692) | 2016.06. | 2016.10.09 | [[web](http://imatge-upc.github.io/vqa-2016-cvprw/)] [[code](https://github.com/imatge-upc/vqa-2016-cvprw)] | Bachelor thesis report graded with A with honours at ETSETB Telecom BCN school, Universitat Politècnica de Catalunya (UPC). June 2016. |
| 31 | [Hadamard Product for Low-rank Bilinear Pooling](https://arxiv.org/pdf/1610.04325) | 2016.10.14 | 2017.03.26 | | ICLR 2017 |
| 32 | [Proposing Plausible Answers for Open-ended Visual Question Answering](https://arxiv.org/pdf/1610.06620) | 2016.10.20 | 2016.10.23 | | |
| 33 | [Combining Multiple Cues for Visual Madlibs Question Answering](https://arxiv.org/pdf/1611.00393) | 2016.11.01 | 2018.02.07 | | submitted to IJCV |
| 34 | [Dual Attention Networks for Multimodal Reasoning and Matchin](https://arxiv.org/pdf/1611.00471) | 2016.11.02 | 2017.03.21 | | |
| 35 | [Zero-Shot Visual Question Answering](https://arxiv.org/pdf/1611.05546) | 2016.11.16 | 2016.11.20 | | |
| 36 | [Answering Image Riddles using Vision and Reasoning through Probabilistic Soft Logic](https://arxiv.org/pdf/1611.05896) | 2016.11.17 | | | |
| 37 | [Grad-CAM: Why did you say that?](https://arxiv.org/pdf/1611.07450) | 2016.11.22 | 2017.01.25 | | NIPS 2016 |
| 38 | [Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering](https://arxiv.org/pdf/1612.00837) | 2016.12.02 | 2017.05.15 | | |
| 39 | [Contextual Visual Similarity](https://arxiv.org/pdf/1612.02534) | 2016.12.08 | | | Submitted to CVPR 2017 |
| 40 | [VIBIKNet: Visual Bidirectional Kernelized Network for Visual Question Answering](https://arxiv.org/pdf/1612.03628) | 2016.12.12 | | | submitted to IbPRIA 2017 |
| 41 | [Attentive Explanations: Justifying Decisions and Pointing to the Evidence](https://arxiv.org/pdf/1612.04757) | 2016.12.14 | 2017.07.25 | | |
| 42 | [The VQA-Machine: Learning How to Use Existing Vision Algorithms to Answer New Questions](https://arxiv.org/pdf/1612.05386) | 2016.12.16 | | | |
| 43 | [Automatic Generation of Grounded Visual Questions](https://arxiv.org/pdf/1612.06530) | 2016.12.20 | 2017.05.29 | | IJCAI 2017 |
| 44 | [CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning](https://arxiv.org/pdf/1612.06890) | 2016.12.20 | | | |## 2017 Papers
| ID | Title | Ori Date | Latest Date | Notes | Pubilshed
(Incomplete Statistics) |
| :-: | :-: | :-: | :-: | - | :-: |
| 1 | [Task-driven Visual Saliency and Attention-based Visual Question Answering](https://arxiv.org/pdf/1702.06700) | 2017.02.22 | | | |
| 2 | [Tree Memory Networks for Modelling Long-term Temporal Dependencies](https://arxiv.org/pdf/1703.04706) | 2017.03.12 | 2018.05.20 | | Neurocomputing, Volume 304, 23 August 2018, Pages 64-81 |
| 3 | [VQABQ: Visual Question Answering by Basic Questions](https://arxiv.org/pdf/1703.06492) | 2017.03.19 | 2017.08.28 | | CVPR 2017 VQA Challenge Workshop |
| 4 | [Recurrent and Contextual Models for Visual Question Answering](https://arxiv.org/pdf/1703.08120) | 2017.03.23 | | | |
| 5 | [An Analysis of Visual Question Answering Algorithms](https://arxiv.org/pdf/1703.09684) | 2017.03.28 | 2017.09.13 | [[data](http://kushalkafle.com/projects/tdiuc)] | ICCV 2017 |
| 6 | [Aligned Image-Word Representations Improve Inductive Transfer Across Vision-Language Tasks](https://arxiv.org/pdf/1704.00260) | 2017.04.02 | 2017.10.16 | | ICCV 2017 |
| 7 | [It Takes Two to Tango: Towards Theory of AI's Mind](https://arxiv.org/pdf/1704.00717) | 2017.04.03 | 2017.10.02 | | |
| 8 | [An Empirical Evaluation of Visual Question Answering for Novel Objects](https://arxiv.org/pdf/1704.02516) | 2017.04.08 | | | CVPR 2017 |
| 9 | [Show, Ask, Attend, and Answer: A Strong Baseline For Visual Question Answering](https://arxiv.org/pdf/1704.03162) | 2017.04.11 | 2017.04.12 | [[code](https://github.com/Cyanogenoid/pytorch-vqa)] | |
| 10 | [What's in a Question: Using Visual Questions as a Form of Supervision](https://arxiv.org/pdf/1704.03895) | 2017.04.12 | | | CVPR 2017 |
| 11 | [TGIF-QA: Toward Spatio-Temporal Reasoning in Visual Question Answering](https://arxiv.org/pdf/1704.04497) | 2017.04.14 | 2017.12.02 | | CVPR 2017 |
| 12 | [Learning to Reason: End-to-End Module Networks for Visual Question Answering](https://arxiv.org/pdf/1704.05526) | 2017.04.18 | 2017.09.11 | | |
| 13 | [Being Negative but Constructively: Lessons Learnt from Creating Better Visual Question Answering Datasets](https://arxiv.org/pdf/1704.07121) | 2017.04.24 | 2018.06.10 | | NAACL-HLT 2018 |
| 14 | [C-VQA: A Compositional Split of the Visual Question Answering (VQA) v1.0 Dataset](https://arxiv.org/pdf/1704.08243) | 2017.04.26 | | | |
| 15 | [Speech-Based Visual Question Answering](https://arxiv.org/pdf/1705.00464) | 2017.05.01 | 2017.09.15 | | |
| 16 | [The Promise of Premise: Harnessing Question Premises in Visual Question Answering](https://arxiv.org/pdf/1705.00601) | 2017.05.01 | 2017.08.17 | | EMNLP 2017 |
| 17 | [Survey of Visual Question Answering: Datasets and Techniques](https://arxiv.org/pdf/1705.03865) | 2017.05.10 | 2017.05.11 | [Survey] | |
| 18 | [ParlAI: A Dialog Research Software Platform](https://arxiv.org/pdf/1705.06476) | 2017.05.18 | 2018.03.08 | | |
| 19 | [MUTAN: Multimodal Tucker Fusion for Visual Question Answering](https://arxiv.org/pdf/1705.06676) | 2017.05.18 | | [[code](https://github.com/Cadene/vqa.pytorch)] | |
| 20 | [Learning Convolutional Text Representations for Visual Question Answering](https://arxiv.org/pdf/1705.06824) | 2017.05.18 | 2018.04.18 | [[code](https://github.com/divelab/svae)] | SDM 2018;
In proceedings of the 2018 SIAM International Conference on Data Mining (pp. 594-602). 2018 |
| 21 | [Deep learning evaluation using deep linguistic processing](https://arxiv.org/pdf/1706.01322) | 2017.06.05 | 2018.05.12 | | |
| 22 | [A simple neural network module for relational reasoning](https://arxiv.org/pdf/1706.01427) | 2017.06.05 | | | |
| 23 | [Compact Tensor Pooling for Visual Question Answering](https://arxiv.org/pdf/1706.06706) | 2017.06.20 | | | |
| 24 | [Sampling Matters in Deep Embedding Learning](https://arxiv.org/pdf/1706.07567) | 2017.06.23 | 2018.01.16 | | ICCV 2017 |
| 25 | [Modulating early visual processing by language](https://arxiv.org/pdf/1707.00683) | 2017.07.02 | 2017.12.18 | | NIPS 2017 |
| 26 | [Effective Approaches to Batch Parallelization for Dynamic Neural Network Architectures](https://arxiv.org/pdf/1707.02402) | 2017.07.08 | | [[code](https://github.com/jsuarez5341/Efficient-Dynamic-Batching)] | |
| 27 | [Visual Question Answering with Memory-Augmented Networks](https://arxiv.org/pdf/1707.04968) | 2017.07.16 | | | CVPR 2018 |
| 28 | [Improved Bilinear Pooling with CNNs](https://arxiv.org/pdf/1707.06772) | 2017.07.21 | | | |
| 29 | [Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering](https://arxiv.org/pdf/1707.07998) | 2017.07.25 | 2018.03.14 | [[code](https://github.com/hengyuan-hu/bottom-up-attention-vqa)] | CVPR 2018;
winner of 2017 VQA challenge|
| 30 | [A Simple Loss Function for Improving the Convergence and Accuracy of Visual Question Answering Models](https://arxiv.org/pdf/1708.00584) | 2017.08.01 | | | CVPR 2017 |
| 31 | [MemexQA: Visual Memex Question Answering](https://arxiv.org/pdf/1708.01336) | 2017.08.03 | | [[Web](https://memexqa.cs.cmu.edu/)] | |
| 32 | [Multi-modal Factorized Bilinear Pooling with Co-Attention Learning for Visual Question Answering](https://arxiv.org/pdf/1708.01471) | 2017.08.04 | | | ICCV 2017 |
| 33 | [Structured Attentions for Visual Question Answering](https://arxiv.org/pdf/1708.02071) | 2017.08.07 | | | ICCV 2017 |
| 34 | [Tips and Tricks for Visual Question Answering: Learnings from the 2017 Challenge](https://arxiv.org/pdf/1708.02711) | 2017.08.09 | | | Winner of the 2017 Visual Question Answering (VQA) Challenge at CVPR 2017 |
| 35 | [Learning to Disambiguate by Asking Discriminative Questions](https://arxiv.org/pdf/1708.02760) | 2017.08.09 | | | ICCV 2017 |
| 36 | [Beyond Bilinear: Generalized Multi-modal Factorized High-order Pooling for Visual Question Answering](https://arxiv.org/pdf/1708.03619) | 2017.08.10 | | [Overlap(2017`32)] | |
| 37 | [VQS: Linking Segmentations to Questions and Answers for Supervised Attention in VQA and Question-Focused Semantic Segmentation](https://arxiv.org/pdf/1708.04686) | 2017.08.17 | | | ICCV 2017 |
| 38 | [Robustness Analysis of Visual QA Models by Basic Questions](https://arxiv.org/pdf/1709.04625) | 2017.09.14 | 2018.05.26 | | CVPR 2018 |
| 39 | [Exploring Human-like Attention Supervision in Visual Question Answering](https://arxiv.org/pdf/1709.06308) | 2017.09.19 | | | |
| 40 | [Visual Question Generation as Dual Task of Visual Question Answering](https://arxiv.org/pdf/1709.07192) | 2017.09.21 | | | |
| 41 | [Survey of Recent Advances in Visual Question Answering](https://arxiv.org/pdf/1709.08203) | 2017.09.24 | | [Survey] | |
| 42 | [Fooling Vision and Language Models Despite Localization and Attention Mechanism](https://arxiv.org/pdf/1709.08693) | 2017.09.25 | 2018.04.05 | | CVPR 2018 |
| 43 | [iVQA: Inverse Visual Question Answering](https://arxiv.org/pdf/1710.03370) | 2017.10.09 | 2018.03.16 | | CVPR 2018 |
| 44 | [Active Learning for Visual Question Answering: An Empirical Study](https://arxiv.org/pdf/1711.01732) | 2017.11.06 | | | |
| 45 | [High-Order Attention Models for Visual Question Answering](https://arxiv.org/pdf/1711.04323) | 2017.11.12 | | | NIPS 2017 |
| 46 | [A Novel Framework for Robustness Analysis of Visual QA Models](https://arxiv.org/pdf/1711.06232) | 2017.11.16 | 2018.12.24 | | AAAI 2019 |
| 47 | [Co-attending Free-form Regions and Detections with Multi-modal Multiplicative Feature Embedding for Visual Question Answering](https://arxiv.org/pdf/1711.06794) | 2017.11.17 | 2017.12.12 | | AAAI 2018 |
| 48 | [Attentive Explanations: Justifying Decisions and Pointing to the Evidence (Extended Abstract)](https://arxiv.org/pdf/1711.07373) | 2017.11.17 | | [Overlap(2016`41)] | |
| 49 | [Visual Question Answering as a Meta Learning Task](https://arxiv.org/pdf/1711.08105) | 2017.11.21 | | | |
| 50 | [Locally Smoothed Neural Networks](https://arxiv.org/pdf/1711.08132) | 2017.11.22 | | | ACML 2017 |
| 51 | [Hyper-dimensional computing for a visual question-answering system that is trainable end-to-end](https://arxiv.org/pdf/1711.10185) | 2017.11.28 | | | |
| 52 | [Don't Just Assume; Look and Answer: Overcoming Priors for Visual Question Answering](https://arxiv.org/pdf/1712.00377) | 2017.12.01 | 2018.06.03 | | CVPR 2018 |
| 53 | [Incorporating External Knowledge to Answer Open-Domain Visual Questions with Dynamic Memory Networks](https://arxiv.org/pdf/1712.00733) | 2017.12.03 | | | |
| 54 | [Learning by Asking Questions](https://arxiv.org/pdf/1712.01238) | 2017.12.04 | | | |
| 55 | [IQA: Visual Question Answering in Interactive Environments](https://arxiv.org/pdf/1712.03316) | 2017.12.08 | 2018.09.06 | | CVPR 2018 |
| 56 | [Visual Explanations from Hadamard Product in Multimodal Deep Networks](https://arxiv.org/pdf/1712.06228) | 2017.12.17 | | | NIPS 2017 |
| 57 | [Interpretable Counting for Visual Question Answering](https://arxiv.org/pdf/1712.08697) | 2017.12.22 | 2018.03.01 | | ICLR 2018 |## 2018 Papers
| ID | Title | Ori Date | Latest Date | Notes | Pubilshed
(Incomplete Statistics) |
| :-: | :-: | :-: | :-: | - | :-: |
| 1 | [Benchmark Visual Question Answer Models by using Focus Map](https://arxiv.org/pdf/1801.05302) | 2018.01.13 | | [Overlap([2017](https://arxiv.org/pdf/1705.03633))] | course CS348 |
| 2 | [Structured Triplet Learning with POS-tag Guided Attention for Visual Question Answering](https://arxiv.org/pdf/1801.07853) | 2018.01.23 | | | |
| 3 | [DVQA: Understanding Data Visualizations via Question Answering](https://arxiv.org/pdf/1801.08163) | 2018.01.24 | 2018.03.29 | | CVPR 2018 |
| 4 | [Tell-and-Answer: Towards Explainable Visual Question Answering using Attributes and Captions](https://arxiv.org/pdf/1801.09041) | 2018.01.27 | | | |
| 5 | [Object-based reasoning in VQA](https://arxiv.org/pdf/1801.09718) | 2018.01.29 | | | WACV 2018 |
| 6 | [Dual Recurrent Attention Units for Visual Question Answering](https://arxiv.org/pdf/1802.00209) | 2018.02.01 | 2018.11.07 | | |
| 7 | [Answerer in Questioner's Mind: Information Theoretic Approach to Goal-Oriented Visual Dialog](https://arxiv.org/pdf/1802.03881) | 2018.02.11 | 2018.11.28 | | NIPS 2018 |
| 8 | [Learning to Count Objects in Natural Images for Visual Question Answering](https://arxiv.org/pdf/1802.05766) | 2018.02.15 | | [[code](https://github.com/Cyanogenoid/vqa-counting)] | ICLR 2018 |
| 9 | [Multimodal Explanations: Justifying Decisions and Pointing to the Evidence](https://arxiv.org/pdf/1802.08129) | 2018.02.15 | | [Overlap(2016`41)] | |
| 10 | [VizWiz Grand Challenge: Answering Visual Questions from Blind People](https://arxiv.org/pdf/1802.08218) | 2018.02.22 | 2018.05.09 | | |
| 11 | [Inverse Visual Question Answering: A New Benchmark and VQA Diagnosis Tool](https://arxiv.org/pdf/1803.06936) | 2018.03.16 | | [Overlap(2017`43)] | |
| 12 | [VQA-E: Explaining, Elaborating, and Enhancing Your Answers for Visual Questions](https://arxiv.org/pdf/1803.07464) | 2018.03.20 | 2018.08.25 | | ECCV 2018 |
| 13 | [Attention on Attention: Architectures for Visual Question Answering (VQA)](https://arxiv.org/pdf/1803.07724) | 2018.03.20 | | [[code](https://github.com/SinghJasdeep/Attention-on-Attention-for-VQA)] | |
| 14 | [Explicit Reasoning over End-to-End Neural Architectures for Visual Question Answering](https://arxiv.org/pdf/1803.08896) | 2018.03.23 | | | AAAI 2018 |
| 15 | [Generalized Hadamard-Product Fusion Operators for Visual Question Answering](https://arxiv.org/pdf/1803.09374) | 2018.03.25 | 2018.04.06 | | CRV, 2018, 15th Canadian Conference on Computer and Robot Vision |
| 16 | [DDRprog: A CLEVR Differentiable Dynamic Reasoning Programmer](https://arxiv.org/pdf/1803.11361) | 2018.03.30 | | | |
| 17 | [Differential Attention for Visual Question Answering](https://arxiv.org/pdf/1804.00298) | 2018.03.30 | | [[Web](https://badripatro.github.io/DVQA/)] | CVPR 2018 |
| 18 | [Improved Fusion of Visual and Language Representations by Dense Symmetric Co-Attention for Visual Question Answering](https://arxiv.org/pdf/1804.00775) | 2018.04.02 | 2018.12.01 | | CVPR 2018 |
| 19 | [Question Type Guided Attention in Visual Question Answering](https://arxiv.org/pdf/1804.02088) | 2018.04.05 | 2018.07.18 | | |
| 20 | [Reciprocal Attention Fusion for Visual Question Answering](https://arxiv.org/pdf/1805.04247) | 2018.05.11 | 2018.07.22 | | the British Machine Vision Conference (BMVC), September 2018 |
| 21 | [Did the Model Understand the Question?](https://arxiv.org/pdf/1805.05492) | 2018.05.14 | | | ACL 2018 |
| 22 | [Bilinear Attention Networks](https://arxiv.org/pdf/1805.07932) | 2018.05.21 | 2018.10.19 | | NIPS 2018 |
| 23 | [Reproducibility Report for ](https://arxiv.org/pdf/1805.08174) | 2018.05.21 | | | Reproducibility in ML Workshop, ICML 2018 |
| 24 | [Joint Image Captioning and Question Answering](https://arxiv.org/pdf/1805.08389) | 2018.05.22 | | | |
| 25 | [R-VQA: Learning Visual Relation Facts with Semantic Attention for Visual Question Answering](https://arxiv.org/pdf/1805.09701) | 2018.05.24 | 2018.07.19 | [[data](https://github.com/lupantech/rvqa)] | SIGKDD 2018 |
| 26 | [On the Flip Side: Identifying Counterexamples in Visual Question Answering](https://arxiv.org/pdf/1806.00857) | 2018.06.03 | 2018.07.24 | [[framework](https://github.com/Cadene/vqa.pytorch)] | KDD 2018 |
| 27 | [CS-VQA: Visual Question Answering with Compressively Sensed Images](https://arxiv.org/pdf/1806.03379) | 2018.06.08 | | | ICIP 2018 |
| 28 | [Learning Answer Embeddings for Visual Question Answering](https://arxiv.org/pdf/1806.03724) | 2018.06.10 | | | CVPR 2018 |
| 39 | [Cross-Dataset Adaptation for Visual Question Answering](https://arxiv.org/pdf/1806.03726) | 2018.06.10 | | | CVPR 2018 |
| 30 | [Learning Visual Knowledge Memory Networks for Visual Question Answering](https://arxiv.org/pdf/1806.04860) | 2018.06.13 | | | CVPR 2018 |
| 31 | [Learning Conditioned Graph Structures for Interpretable Visual Question Answering](https://arxiv.org/pdf/1806.07243) | 2018.06.19 | 2018.11.01 | [[code](https://github.com/aimbrain/vqa-project)] | NIPS 2018 |
| 32 | [Question Relevance in Visual Question Answering](https://arxiv.org/pdf/1807.08435) | 2018.07.23 | | [[code](https://github.com/nitish-kulkarni/Question-Relevance-in-VQA)] | |
| 33 | [Pythia v0.1: the Winning Entry to the VQA Challenge 2018](https://arxiv.org/pdf/1807.09956) | 2018.07.26 | 2018.07.27 | [[code](https://github.com/facebookresearch/pythia)] | winner of 2018 VQA challenge |
| 34 | [Interpretable Visual Question Answering by Visual Grounding from Attention Supervision Mining](https://arxiv.org/pdf/1808.00265) | 2018.08.01 | | | |
| 35 | [Learning Visual Question Answering by Bootstrapping Hard Attention](https://arxiv.org/pdf/1808.00300) | 2018.08.01 | | | ECCV 2018 |
| 36 | [Question-Guided Hybrid Convolution for Visual Question Answering](https://arxiv.org/pdf/1808.02632) | 2018.08.08 | | | ECCV 2018 |
| 37 | [Straight to the Facts: Learning Knowledge Base Retrieval for Factual Visual Question Answering](https://arxiv.org/pdf/1809.01124) | 2018.09.04 | | | ECCV 2018 |
| 38 | [Interpretable Visual Question Answering by Reasoning on Dependency Trees](https://arxiv.org/pdf/1809.01810) | 2018.09.08 | | | |
| 39 | [Faithful Multimodal Explanation for Visual Question Answering](https://arxiv.org/pdf/1809.02805) | 2018.09.08 | | | AAAI 2019 |
| 40 | [The Wisdom of MaSSeS: Majority, Subjectivity, and Semantic Similarity in the Evaluation of VQA](https://arxiv.org/pdf/1809.04344) | 2018.09.12 | | | |
| 41 | [The Visual QA Devil in the Details: The Impact of Early Fusion and Batch Norm on CLEVR](https://arxiv.org/pdf/1809.04482) | 2018.09.11 | | | ECCV 2018 |
| 42 | [Textually Enriched Neural Module Networks for Visual Question Answering](https://arxiv.org/pdf/1809.08697) | 2018.09.23 | | [Overlop([2018\CVPR 2018](https://arxiv.org/pdf/1804.00105))] | IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE |
| 43 | [Neural-Symbolic VQA: Disentangling Reasoning from Vision and Language Understanding](https://arxiv.org/pdf/1810.02338) | 2018.10.04 | 2019.01.14 | [[Web](http://nsvqa.csail.mit.edu/)] [[code](https://github.com/kexinyi/ns-vqa)] | NIPS 2018 |
| 44 | [Transfer Learning via Unsupervised Task Discovery for Visual Question Answering](https://arxiv.org/pdf/1810.02358) | 2018.10.03 | | | |
| 45 | [Overcoming Language Priors in Visual Question Answering with Adversarial Regularization](https://arxiv.org/pdf/1810.03649) | 2018.10.08 | 2018.11.08 | | NIPS 2018 |
| 46 | [Knowing Where to Look? Analysis on Attention of Visual Question Answering System](https://arxiv.org/pdf/1810.03821) | 2018.10.09 | | | ECCV SiVL Workshop paper |
| 47 | [Understand, Compose and Respond - Answering Visual Questions by a Composition of Abstract Procedures](https://arxiv.org/pdf/1810.10656) | 2018.10.24 | | | |
| 48 | [Do Explanations make VQA Models more Predictable to a Human?](https://arxiv.org/pdf/1810.12366) | 2018.10.29 | | | EMNLP 2018 |
| 49 | [TallyQA: Answering Complex Counting Questions](https://arxiv.org/pdf/1810.12440) | 2018.10.31 | | [[data](http://www.manojacharya.com/)] | AAAI 2019 |
| 50 | [Out of the Box: Reasoning with Graph Convolution Nets for Factual Visual Question Answering](https://arxiv.org/pdf/1811.00538) | 2018.11.01 | | | NIPS 2018 |
| 51 | [Zero-Shot Transfer VQA Dataset](https://arxiv.org/pdf/1811.00692) | 2018.11.01 | | | |
| 52 | [Explicit Bias Discovery in Visual Question Answering Models](https://arxiv.org/pdf/1811.07789) | 2018.11.19 | | | |
| 53 | [VQA with no questions-answers training](https://arxiv.org/pdf/1811.08481) | 2018.11.20 | | | |
| 54 | [Visual Entailment Task for Visually-Grounded Language Learning](https://arxiv.org/pdf/1811.10582) | 2018.11.20 | | | NeurIPS 2018 |
| 55 | [Visual Question Answering as Reading Comprehension](https://arxiv.org/pdf/1811.11903) | 2018.11.28 | | | |
| 56 | [From Known to the Unknown: Transferring Knowledge to Answer Questions about Novel Visual and Semantic Concepts](https://arxiv.org/pdf/1811.12772) | 2018.11.29 | | | |
| 57 | [Systematic Generalization: What Is Required and Can It Be Learned?](https://arxiv.org/pdf/1811.12889) | 2018.11.30 | | | Work in progress |
| 58 | [Learning Representations of Sets through Optimized Permutations](https://arxiv.org/pdf/1812.03928) | 2018.12.10 | 2019.01.14 | | ICLR 2019 |
| 59 | [Dynamic Fusion with Intra- and Inter- Modality Attention Flow for Visual Question Answering](https://arxiv.org/pdf/1812.05252) | 2018.12.12 | | | report |
| 60 | [Multi-modal Learning with Prior Visual Relation Reasoning](https://arxiv.org/pdf/1812.09681) | 2018.12.23 | | | |
| 61 | [The meaning of "most" for visual question answering models](https://arxiv.org/abs/1812.11737) | 2018.12.31 | | | |## 2019 Papers
| ID | Title | Ori Date | Latest Date | Notes | Pubilshed
(Incomplete Statistics) |
| :-: | :-: | :-: | :-: | - | :-: |
| 1 | [Visual Entailment: A Novel Task for Fine-Grained Image Understanding](https://arxiv.org/abs/1901.06706) | 2019.01.20 | | | |
| 2 | [BLOCK: Bilinear Superdiagonal Fusion for Visual Question Answering and Visual Relationship Detection](https://arxiv.org/abs/1902.00038) | 2019.01.31 | | | |
| 3 | [Taking a HINT: Leveraging Explanations to Make Vision and Language Models More Grounded](https://arxiv.org/abs/1902.03751) | 2019.02.11 | | | Technical Report |
| 4 | [Cycle-Consistency for Robust Visual Question Answering](https://arxiv.org/abs/1902.05660) | 2019.02.14 | | | |
| 5 | [Generating Natural Language Explanations for Visual Question Answering using Scene Graphs and Visual Attention](https://arxiv.org/abs/1902.05715) | 2019.02.15 | | | |
| 6 | [Probabilistic Neural-symbolic Models for Interpretable Visual Question Answering](https://arxiv.org/pdf/1902.07864) | 2019.02.20 | | | |
| 7 | [MUREL: Multimodal Relational Reasoning for Visual Question Answering](https://arxiv.org/pdf/1902.09487) | 2019.02.25 | | | CVPR 2019 |
| 8 | [GQA: a new dataset for compositional question answering over real-world images](https://arxiv.org/pdf/1902.09506) | 2019.02.25 | | | |
| 9 | [Answer Them All! Toward Universal Visual Question Answering Models](https://arxiv.org/pdf/1903.00366) | 2019.03.01 | | | |