Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/forence/Awesome-Visual-Captioning
This repository focus on Image Captioning & Video Captioning & Seq-to-Seq Learning & NLP
https://github.com/forence/Awesome-Visual-Captioning
List: Awesome-Visual-Captioning
Last synced: about 1 month ago
JSON representation
This repository focus on Image Captioning & Video Captioning & Seq-to-Seq Learning & NLP
- Host: GitHub
- URL: https://github.com/forence/Awesome-Visual-Captioning
- Owner: forence
- Created: 2019-01-14T09:12:59.000Z (almost 6 years ago)
- Default Branch: master
- Last Pushed: 2022-11-14T03:24:29.000Z (about 2 years ago)
- Last Synced: 2024-05-23T06:01:17.854Z (7 months ago)
- Homepage:
- Size: 1.19 MB
- Stars: 414
- Watchers: 20
- Forks: 51
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
- awesome-vision-language-pretraining - image captioning
- Awesome-Paper-List - Image Captioning
- ultimate-awesome - Awesome-Visual-Captioning - This repository focus on Image Captioning & Video Captioning & Seq-to-Seq Learning & NLP. (Other Lists / PowerShell Lists)
README
# Awesome-Visual-Captioning[![Awesome](https://awesome.re/badge.svg)](https://awesome.re)
## Table of Contents
- [ECCV-2022](#ECCV-2022)
- [CVPR-2022](#CVPR-2022)
- [AAAI-2022](#AAAI-2022)
- [IJCAI-2022](#IJCAI-2022)
- [NeurIPS-2021](#NeurIPS-2021)
- [ACMMM-2021](#ACMMM-2021)
- [ICCV-2021](#ICCV-2021)
- [ACL-2021](#ACL-2021)
- [CVPR-2021](#CVPR-2021)
- [AAAI-2021](#AAAI-2021)
- [ACMMM-2020](#ACMMM-2020)
- [NeurIPS-2020](#NeurIPS-2020)
- [ECCV-2020](#ECCV-2020)
- [CVPR-2020](#CVPR-2020)
- [ACL-2020](#ACL-2020)
- [AAAI-2020](#AAAI-2020)
- [ACL-2019](#ACL-2019)
- [NeurIPS-2019](#NeurIPS-2019)
- [ACMMM-2019](#ACMMM-2019)
- [ICCV-2019](#ICCV-2019)
- [CVPR-2019](#CVPR-2019)
- [AAAI-2019](#AAAI-2019)## Paper Roadmap
### ECCV-2022
- ECCV Caption: Correcting False Negatives by Collecting Machine-and-Human-Verified Image-Caption Associations for MS-COCO
- Object-Centric Unsupervised Image Captioning
- D3Net: A Unified Speaker-Listener Architecture for 3D Dense Captioning and Visual Grounding
- StyleBabel: Artistic Style Tagging and Captioning
- MORE: Multi-Order RElation Mining for Dense Captioning in 3D Scenes
- GEB+: A Benchmark for Generic Event Boundary Captioning, Grounding and Retrieval
- Explicit Image Caption Editing
- GRIT: Faster and Better Image Captioning Transformer Using Dual Visual Features
- Unifying Event Detection and Captioning as Sequence Generation via Pre-training### CVPR-2022
**Image Captioing**
- DeeCap: Dynamic Early Exiting for Efficient Image Captioning
- Injecting Visual Concepts into End-to-End Image Captioning
- DIFNet: Boosting Visual Information Flow for Image Captioning
- Beyond a Pre-Trained Object Detector: Cross-Modal Textual and Visual Context for Image Captioning
- Quantifying Societal Bias Amplification in Image Captioning
- Show, Deconfound and Tell: Image Captioning with Causal Inference
- Scaling Up Vision-Language Pretraining for Image Captioning
- VisualGPT: Data-efficient Adaptation of Pretrained Language Models for Image Captioning
- Comprehending and Ordering Semantics for Image Captioning
- Alleviating Emotional bias in Affective Image Captioning by Contrastive Data Collection
- NOC-REK: Novel Object Captioning with Retrieved Vocabulary from External Knowledge
- NICGSlowDown: Evaluating the Efficiency Robustness of Neural Image Caption Generation Models**Video Captioing**
- End-to-end Generative Pretraining for Multimodal Video Captioning
- SwinBERT: End-to-End Transformers with Sparse Attention for Video Captioning
- Hierarchical Modular Network for Video Captioning### AAAI-2022
**Image Captioing**
- Image Difference Captioning with Pre-Training and Contrastive Learning
- Attention-Aligned Transformer for Image Captioning
- Retrieve, Caption, Generate: Visual Grounding for Enhancing Commonsense in Text Generation Models
- MAGIC: Multimodal relAtional Graph adversarIal inferenCe for Diverse and Unpaired Text-Based Image Captioning
- UNISON: Unpaired Cross-Lingual Image Captioning
- End-to-End Transformer Based Model for Image Captioning### IJCAI-2022
**Image Captioing**
- ER-SAN: Enhanced-Adaptive Relation Self-Attention Network for Image Captioning
- S2 Transformer for Image Captioning**Video Captioning**
- GL-RG: Global-Local Representation Granularity for Video Captioning### NeurIPS-2021
**Video Captioning**
- Multi-modal Dependency Tree for Video Captioning [[paper]](https://openreview.net/pdf?id=sW40wkwfsZp)### ACMMM-2021
**Image Captioning**
- Distributed Attention for Grounded Image Captioning
- Dual Graph Convolutional Networks with Transformer and Curriculum Learning for Image Captioning
- Semi-Autoregressive Image Captioning
- Question-controlled Text-aware Image Captioning
- Triangle-Reward Reinforcement Learning: A Visual-Linguistic Semantic Alignment for Image Captioning
- Group-based Distinctive Image Captioning with Memory Attention
- Direction Relation Transformer for Image Captioning
- Scene Graph with 3D Information for Change Captioning
- Similar Scenes Arouse Similar Emotions: Parallel Data Augmentation for Stylized Image Captioning**Video Captioning**
- State-aware Video Procedural Captioning
- Discriminative Latent Semantic Graph for Video Captioning
- Sensor-Augmented Egocentric-Video Captioning with Dynamic Modal Attention
- Multi-Perspective Video Captioning.
- Hybrid Reasoning Network for Video-based Commonsense Captioning### ICCV-2021
**Image Captioning**
- Partial Off-Policy Learning: Balance Accuracy and Diversity for Human-Oriented Image Captioning [[paper]](https://openaccess.thecvf.com/content/ICCV2021/papers/Shi_Partial_Off-Policy_Learning_Balance_Accuracy_and_Diversity_for_Human-Oriented_Image_ICCV_2021_paper.pdf)
- Viewpoint-Agnostic Change Captioning With Cycle Consistency [[paper]](https://openaccess.thecvf.com/content/ICCV2021/papers/Kim_Viewpoint-Agnostic_Change_Captioning_With_Cycle_Consistency_ICCV_2021_paper.pdf)
- Understanding and Evaluating Racial Biases in Image Captioning [[paper]](https://openaccess.thecvf.com/content/ICCV2021/papers/Zhao_Understanding_and_Evaluating_Racial_Biases_in_Image_Captioning_ICCV_2021_paper.pdf)
- Auto-Parsing Network for Image Captioning and Visual Question Answering [[paper]](https://openaccess.thecvf.com/content/ICCV2021/papers/Yang_Auto-Parsing_Network_for_Image_Captioning_and_Visual_Question_Answering_ICCV_2021_paper.pdf)
- In Defense of Scene Graphs for Image Captioning [[paper]](https://openaccess.thecvf.com/content/ICCV2021/papers/Nguyen_In_Defense_of_Scene_Graphs_for_Image_Captioning_ICCV_2021_paper.pdf)
- Describing and Localizing Multiple Changes With Transformers [[paper]](https://openaccess.thecvf.com/content/ICCV2021/papers/Qiu_Describing_and_Localizing_Multiple_Changes_With_Transformers_ICCV_2021_paper.pdf)
- Explain Me the Painting: Multi-Topic Knowledgeable Art Description Generation [[paper]](https://openaccess.thecvf.com/content/ICCV2021/papers/Bai_Explain_Me_the_Painting_Multi-Topic_Knowledgeable_Art_Description_Generation_ICCV_2021_paper.pdf)**Video Captioning**
- Motion Guided Region Message Passing for Video Captioning [[paper]](https://openaccess.thecvf.com/content/ICCV2021/papers/Chen_Motion_Guided_Region_Message_Passing_for_Video_Captioning_ICCV_2021_paper.pdf)
- End-to-End Dense Video Captioning With Parallel Decoding [[paper]](https://openaccess.thecvf.com/content/ICCV2021/papers/Wang_End-to-End_Dense_Video_Captioning_With_Parallel_Decoding_ICCV_2021_paper.pdf)### ACL-2021
**Image Captioning**
- Control Image Captioning Spatially and Temporally [[paper]](https://aclanthology.org/2021.acl-long.157.pdf)
- SMURF: SeMantic and linguistic UndeRstanding Fusion for Caption Evaluation via Typicality Analysis [[paper]](https://arxiv.org/pdf/2106.01444.pdf) [[code]](https://github.com/JoshuaFeinglass/SMURF)
- Enhancing Descriptive Image Captioning with Natural Language Inference [[paper]](https://aclanthology.org/2021.acl-short.36/)
- UMIC: An Unreferenced Metric for Image Captioning via Contrastive Learning [[paper]](https://arxiv.org/pdf/2106.14019.pdf)
- Semantic Relation-aware Difference Representation Learning for Change Captioning [[paper]](https://aclanthology.org/2021.findings-acl.6/)**Video Captioning**
- Hierarchical Context-aware Network for Dense Video Event Captioning [[paper]](https://aclanthology.org/2021.acl-long.156.pdf)
- Video Paragraph Captioning as a Text Summarization Task [[paper]](https://aclanthology.org/2021.acl-short.9.pdf)
- O2NA: An Object-Oriented Non-Autoregressive Approach for Controllable Video Captioning [[paper]](https://aclanthology.org/2021.findings-acl.24.pdf)### CVPR-2021
**Image Captioning**
- Connecting What to Say With Where to Look by Modeling Human Attention Traces. [[paper]](https://arxiv.org/pdf/2105.05964.pdf) [[code]](https://github.com/facebookresearch/connect-caption-and-trace)
- Multiple Instance Captioning: Learning Representations from Histopathology Textbooks and Articles. [[paper]](https://arxiv.org/pdf/2103.05121.pdf)
- Improving OCR-Based Image Captioning by Incorporating Geometrical Relationship. [[paper]](https://openaccess.thecvf.com/content/CVPR2021/papers/Wang_Improving_OCR-Based_Image_Captioning_by_Incorporating_Geometrical_Relationship_CVPR_2021_paper.pdf)
- Image Change Captioning by Learning From an Auxiliary Task. [[paper]](https://openaccess.thecvf.com/content/CVPR2021/papers/Hosseinzadeh_Image_Change_Captioning_by_Learning_From_an_Auxiliary_Task_CVPR_2021_paper.pdf)
- Scan2Cap: Context-aware Dense Captioning in RGB-D Scans. [[paper]](https://arxiv.org/pdf/2012.02206.pdf) [[code]](https://github.com/daveredrum/Scan2Cap)
- Towards Bridging Event Captioner and Sentence Localizer for Weakly Supervised Dense Event Captioning. [paper](https://openaccess.thecvf.com/content/CVPR2021/papers/Chen_Towards_Bridging_Event_Captioner_and_Sentence_Localizer_for_Weakly_Supervised_CVPR_2021_paper.pdf)
- TAP: Text-Aware Pre-Training for Text-VQA and Text-Caption. [[paper]](https://arxiv.org/pdf/2012.04638.pdf)
- Towards Accurate Text-Based Image Captioning With Content Diversity Exploration. [[paper]](https://arxiv.org/pdf/2105.03236.pdf)
- FAIEr: Fidelity and Adequacy Ensured Image Caption Evaluation. [[paper]](https://openaccess.thecvf.com/content/CVPR2021/papers/Wang_FAIEr_Fidelity_and_Adequacy_Ensured_Image_Caption_Evaluation_CVPR_2021_paper.pdf)
- RSTNet: Captioning With Adaptive Attention on Visual and Non-Visual Words. [[paper]](https://openaccess.thecvf.com/content/CVPR2021/papers/Zhang_RSTNet_Captioning_With_Adaptive_Attention_on_Visual_and_Non-Visual_Words_CVPR_2021_paper.pdf)
- Human-Like Controllable Image Captioning With Verb-Specific Semantic Roles. [[paper]](https://arxiv.org/pdf/2103.12204.pdf)**Video Captioning**
- Open-Book Video Captioning With Retrieve-Copy-Generate Network. [[paper]](https://arxiv.org/pdf/2103.05284.pdf)
- Towards Diverse Paragraph Captioning for Untrimmed Videos. [[paper]](https://openaccess.thecvf.com/content/CVPR2021/papers/Song_Towards_Diverse_Paragraph_Captioning_for_Untrimmed_Videos_CVPR_2021_paper.pdf)### AAAI-2021
**Image Captioning**
- Partially Non-Autoregressive Image Captioning. [[code]](https://github.com/feizc/PNAIC/tree/master)
- Improving Image Captioning by Leveraging Intra- and Inter-layer Global Representation in Transformer Network. [[paper]](https://arxiv.org/pdf/2012.07061.pdf)
- Object Relation Attention for Image Paragraph Captioning [[paper]](https://ojs.aaai.org/index.php/AAAI/article/view/16423)
- Dual-Level Collaborative Transformer for Image Captioning. [[paper]](https://arxiv.org/pdf/2101.06462.pdf) [[code]](https://github.com/luo3300612/image-captioning-DLCT)
- Memory-Augmented Image Captioning [[paper]](https://ojs.aaai.org/index.php/AAAI/article/view/16220)
- Image Captioning with Context-Aware Auxiliary Guidance. [[paper]](https://arxiv.org/pdf/2012.05545.pdf)
- Consensus Graph Representation Learning for Better Grounded Image Captioning. [[paper]](https://www.aaai.org/AAAI21Papers/AAAI-3680.ZhangW.pdf)
- FixMyPose: Pose Correctional Captioning and Retrieval. [[paper]](https://arxiv.org/pdf/2104.01703.pdf) [[code]](https://github.com/hyounghk/FixMyPose) [[website]](https://fixmypose-unc.github.io/)
- VIVO: Visual Vocabulary Pre-Training for Novel Object Captioning [[paper]](https://arxiv.org/pdf/2009.13682)**Video Captioning**
- Non-Autoregressive Coarse-to-Fine Video Captioning. [[paper]](https://arxiv.org/pdf/1911.12018.pdf)
- Semantic Grouping Network for Video Captioning. [[paper]](https://arxiv.org/pdf/2102.00831.pdf) [[code]](https://github.com/hobincar/SGN)
- Augmented Partial Mutual Learning with Frame Masking for Video Captioning. [[paper]](https://www.aaai.org/AAAI21Papers/AAAI-9714.LinK.pdf)### ACMMM-2020
**Image Captioning**
- Structural Semantic Adversarial Active Learning for Image Captioning. `oral` [[paper]](https://dl.acm.org/doi/pdf/10.1145/3394171.3413885)
- Iterative Back Modification for Faster Image Captioning. [[paper]](https://dl.acm.org/doi/pdf/10.1145/3394171.3413901)
- Bridging the Gap between Vision and Language Domains for Improved Image Captioning. [[paper]](https://dl.acm.org/doi/pdf/10.1145/3394171.3414004)
- Hierarchical Scene Graph Encoder-Decoder for Image Paragraph Captioning. [[paper]](https://dl.acm.org/doi/pdf/10.1145/3394171.3413859)
- Improving Intra- and Inter-Modality Visual Relation for Image Captioning. [[paper]](https://dl.acm.org/doi/pdf/10.1145/3394171.3413877)
- ICECAP: Information Concentrated Entity-aware Image Captioning. [[paper]](https://dl.acm.org/doi/pdf/10.1145/3394171.3413576)
- Attacking Image Captioning Towards Accuracy-Preserving Target Words Removal. [[paper]](https://dl.acm.org/doi/pdf/10.1145/3394171.3414009)
- Multimodal Attention with Image Text Spatial Relationship for OCR-Based Image Captioning. [[paper]](https://dl.acm.org/doi/pdf/10.1145/3394171.3413753)**Video Captioning**
- Controllable Video Captioning with an Exemplar Sentence. `oral` [[paper]](https://dl.acm.org/doi/pdf/10.1145/3394171.3413908)
- Poet: Product-oriented Video Captioner for E-commerce. `oral` [[paper]](https://arxiv.org/abs/2008.06880)
- Learning Semantic Concepts and Temporal Alignment for Narrated Video Procedural Captioning. [[paper]](https://dl.acm.org/doi/pdf/10.1145/3394171.3413498)
- Relational Graph Learning for Grounded Video Description Generation. [[paper]](https://dl.acm.org/doi/pdf/10.1145/3394171.3413746)### NeurIPS-2020
- Prophet Attention: Predicting Attention with Future Attention for Improved Image Captioning. [[paper]](https://proceedings.neurips.cc/paper/2020/file/13fe9d84310e77f13a6d184dbf1232f3-Paper.pdf)
- RATT: Recurrent Attention to Transient Tasks for Continual Image Captioning. [[paper]](https://proceedings.neurips.cc/paper/2020/file/c2964caac096f26db222cb325aa267cb-Paper.pdf)
- Diverse Image Captioning with Context-Object Split Latent Spaces. [[paper]](https://papers.nips.cc/paper/2020/file/24bea84d52e6a1f8025e313c2ffff50a-Paper.pdf)### ECCV-2020
**Image Captioning**
- Compare and Reweight: Distinctive Image Captioning Using Similar Images Sets. `oral` [[paper]](https://arxiv.org/pdf/2007.06877.pdf)
- In-Home Daily-Life Captioning Using Radio Signals. `oral` [[paper]](https://arxiv.org/pdf/2008.10966.pdf) [[website]](http://rf-diary.csail.mit.edu/)
- TextCaps: a Dataset for Image Captioning with Reading Comprehension. `oral` [[paper]](https://arxiv.org/pdf/2003.12462.pdf) [[website]](https://textvqa.org/textcaps) [[code]](https://github.com/facebookresearch/mmf/tree/master/projects/m4c_captioner)
- SODA: Story Oriented Dense Video Captioning Evaluation Framework. [[paper]](http://www.ecva.net/papers/eccv_2020/papers_ECCV/papers/123510511.pdf)
- Towards Unique and Informative Captioning of Images. [[paper]](http://www.ecva.net/papers/eccv_2020/papers_ECCV/papers/123520613.pdf)
- Learning Visual Representations with Caption Annotations. [[paper]](https://arxiv.org/pdf/2008.01392.pdf) [[website]](https://europe.naverlabs.com/research/computer-vision-research-naver-labs-europe/icmlm/)
- Fashion Captioning: Towards Generating Accurate Descriptions with Semantic Rewards. [[paper]](https://arxiv.org/pdf/2008.02693.pdf)
- Length Controllable Image Captioning. [[paper]](https://arxiv.org/pdf/2007.09580.pdf) [[code]](https://github.com/bearcatt/LaBERT)
- Comprehensive Image Captioning via Scene Graph Decomposition. [[paper]](https://arxiv.org/pdf/2007.11731.pdf) [[website]](http://pages.cs.wisc.edu/~yiwuzhong/Sub-GC.html)
- Finding It at Another Side: A Viewpoint-Adapted Matching Encoder for Change Captioning. [[paper]](http://www.ecva.net/papers/eccv_2020/papers_ECCV/papers/123590562.pdf)
- Captioning Images Taken by People Who Are Blind. [[paper]](http://www.ecva.net/papers/eccv_2020/papers_ECCV/papers/123620409.pdf)
- Learning to Generate Grounded Visual Captions without Localization Supervision. [[paper]](https://arxiv.org/pdf/1906.00283.pdf) [[code]](https://github.com/chihyaoma/cyclical-visual-captioning)**Video Captioning**
- Learning Modality Interaction for Temporal Sentence Localization and Event Captioning in Videos. `Spotlight` [[paper]](https://arxiv.org/pdf/2007.14164.pdf) [[code]](https://github.com/xuewyang/Fashion_Captioning)
- Character Grounding and Re-Identification in Story of Videos and Text Descriptions. `Spotlight` [[paper]](http://www.ecva.net/papers/eccv_2020/papers_ECCV/papers/123500528.pdf) [[code]](https://github.com/yj-yu/CiSIN/)
- Identity-Aware Multi-Sentence Video Description. [[paper]](http://www.ecva.net/papers/eccv_2020/papers_ECCV/papers/123660358.pdf)### CVPR-2020
**Image Captioning**
- Context-Aware Group Captioning via Self-Attention and Contrastive Features [[paper]](https://arxiv.org/abs/2004.03708)
Zhuowan Li, Quan Tran, Long Mai, Zhe Lin, Alan L. Yuille
- More Grounded Image Captioning by Distilling Image-Text Matching Model [[paper]](https://arxiv.org/abs/2004.00390v1) [[code]](https://github.com/YuanEZhou/Grounded-Image-Captioning)
Yuanen Zhou, Meng Wang, Daqing Liu, Zhenzhen Hu, Hanwang Zhang
- Show, Edit and Tell: A Framework for Editing Image Captions [[paper]](https://arxiv.org/abs/2003.03107) [[code]](https://github.com/fawazsammani/show-edit-tell)
Fawaz Sammani, Luke Melas-Kyriazi
- Say As You Wish: Fine-Grained Control of Image Caption Generation With Abstract Scene Graphs [[paper]](https://arxiv.org/abs/2003.00387) [[code]](https://github.com/cshizhe/asg2cap)
Shizhe Chen, Qin Jin, Peng Wang, Qi Wu
- Normalized and Geometry-Aware Self-Attention Network for Image Captioning [[paper]](https://arxiv.org/abs/2003.08897)
Longteng Guo, Jing Liu, Xinxin Zhu, Peng Yao, Shichen Lu, Hanqing Lu
- Meshed-Memory Transformer for Image Captioning [[paper]](https://arxiv.org/abs/1912.08226) [[code]](https://github.com/aimagelab/meshed-memory-transformer)
Marcella Cornia, Matteo Stefanini, Lorenzo Baraldi, Rita Cucchiara
- X-Linear Attention Networks for Image Captioning [[paper]](https://arxiv.org/abs/2003.14080) [[code]](https://github.com/JDAI-CV/image-captioning)
Yingwei Pan, Ting Yao, Yehao Li, Tao Mei
- Transform and Tell: Entity-Aware News Image Captioning [[paper]](https://arxiv.org/abs/2004.08070) [[code]](https://github.com/alasdairtran/transform-and-tell) [[website]](https://transform-and-tell.ml/)
Alasdair Tran, Alexander Mathews, Lexing Xie**Video Captioning**
- Object Relational Graph With Teacher-Recommended Learning for Video Captioning [[paper]](https://arxiv.org/abs/2002.11566)
Ziqi Zhang, Yaya Shi, Chunfeng Yuan, Bing Li, Peijin Wang, Weiming Hu, Zheng-Jun Zha- Spatio-Temporal Graph for Video Captioning With Knowledge Distillation [[paper]](https://arxiv.org/abs/2003.13942?context=cs) [[code]](https://github.com/StanfordVL/STGraph)
Boxiao Pan, Haoye Cai, De-An Huang, Kuan-Hui Lee, Adrien Gaidon, Ehsan Adeli, Juan Carlos Niebles
- Better Captioning With Sequence-Level Exploration [[paper]](https://arxiv.org/abs/2003.03749)
Jia Chen, Qin Jin
- Syntax-Aware Action Targeting for Video Captioning [[code]](https://github.com/SydCaption/SAAT)
Qi Zheng, Chaoyue Wang, Dacheng Tao### ACL-2020
**Image Captioning**
- Clue: Cross-modal Coherence Modeling for Caption Generation [[paper]](https://arxiv.org/abs/2005.00908)
Malihe Alikhani, Piyush Sharma, Shengjie Li, Radu Soricut and Matthew Stone
- Improving Image Captioning Evaluation by Considering Inter References Variance [[paper]](https://www.aclweb.org/anthology/2020.acl-main.93.pdf)
Yanzhi Yi, Hangyu Deng and Jinglu Hu
- Improving Image Captioning with Better Use of Caption [[paper]](https://www.aclweb.org/anthology/2020.acl-main.664.pdf) [[code]](https://github.com/Gitsamshi/WeakVRD-Captioning)
Zhan Shi, Xu Zhou, Xipeng Qiu and Xiaodan Zhu**Video Captioning**
- MART: Memory-Augmented Recurrent Transformer for Coherent Video Paragraph Captioning [[paper]](https://arxiv.org/abs/2005.05402) [[code]](https://github.com/jayleicn/recurrent-transformer)
Jie Lei, Liwei Wang, Yelong Shen, Dong Yu, Tamara Berg and Mohit Bansal### AAAI-2020
**Image Captioning**
- **Unified VLP**: Unified Vision-Language Pre-Training for Image Captioning and VQA [[paper]](https://arxiv.org/abs/1909.11059)
*Luowei Zhou (University of Michigan); Hamid Palangi (Microsoft Research); Lei Zhang (Microsoft); Houdong Hu (Microsoft AI and Research); Jason Corso (University of Michigan); Jianfeng Gao (Microsoft Research)*
- **OffPG**: Reinforcing an Image Caption Generator using Off-line Human Feedback [[paper]](https://arxiv.org/abs/1911.09753)
*Paul Hongsuck Seo (POSTECH); Piyush Sharma (Google Research); Tomer Levinboim (Google); Bohyung Han(Seoul National University); Radu Soricut (Google)*
- **MemCap**: Memorizing Style Knowledge for Image Captioning [[paper]](https://wuxinxiao.github.io/assets/papers/2020/MemCap.pdf)
*Wentian Zhao (Beijing Institute of Technology); Xinxiao Wu (Beijing Institute of Technology); Xiaoxun Zhang(Alibaba Group)*
- **C-R Reasoning**: Joint Commonsense and Relation Reasoning for Image and Video Captioning [[paper]](https://wuxinxiao.github.io/assets/papers/2020/C-R_reasoning.pdf)
*Jingyi Hou (Beijing Institute of Technology); Xinxiao Wu (Beijing Institute of Technology); Xiaoxun Zhang (AlibabaGroup); Yayun Qi (Beijing Institute of Technology); Yunde Jia (Beijing Institute of Technology); Jiebo Luo (University of Rochester)*
- **MHTN**: Learning Long- and Short-Term User Literal-Preference with Multimodal Hierarchical Transformer Network for Personalized Image Caption [[paper]](https://weizhangltt.github.io/paper/zhang-aaai20.pdf)
*Wei Zhang (East China Normal University); Yue Ying (East China Normal University); Pan Lu (The University of California, Los Angeles); Hongyuan Zha (GEORGIA TECH)*
- Show, Recall, and Tell: Image Captioning with Recall Mechanism [[paper]](https://arxiv.org/abs/2001.05876)
*Li WANG (MoE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University, China); Zechen BAI(Institute of Software, Chinese Academy of Science, China); Yonghua Zhang (Bytedance); Hongtao Lu (Shanghai Jiao Tong University)*
- Interactive Dual Generative Adversarial Networks for Image Captioning
*Junhao Liu (Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences); Kai Wang (Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences); Chunpu Xu (Huazhong University of Science and Technology); Zhou Zhao (Zhejiang University); Ruifeng Xu (Harbin Institute of Technology (Shenzhen)); Ying Shen (Peking University Shenzhen Graduate School); Min Yang ( Chinese Academy of Sciences)*- **FDM-net**: Feature Deformation Meta-Networks in Image Captioning of Novel Objects [[paper]](https://www.aaai.org/Papers/AAAI/2020GB/AAAI-CaoT.4566.pdf)
Tingjia Cao (Fudan University); Ke Han (Fudan University); Xiaomei Wang (Fudan University); Lin Ma (Tencent AI Lab); Yanwei Fu (Fudan University); Yu-Gang Jiang (Fudan University); Xiangyang Xue (Fudan University)**Video Captioning**
- An Efficient Framework for Dense Video Captioning
Maitreya Suin (Indian Institute of Technology Madras)*; Rajagopalan Ambasamudram (Indian Institute of Technology Madras)### ACMMM-2019
**Image Captioning**
- Aligning Linguistic Words and Visual Semantic Units for Image Captioning
- Unpaired Cross-lingual Image Caption Generation with Self-Supervised Rewards
- MUCH: Mutual Coupling Enhancement of Scene Recognition and Dense Captioning
- Generating Captions for Images of Ancient Artworks**Video Captioning**
- Hierarchical Global-Local Temporal Modeling for Video Captioning [[peper]](https://dl.acm.org/doi/pdf/10.1145/3343031.3351072)
- Attention-based Densely Connected LSTM for Video Captioning [[paper]](https://dl.acm.org/doi/pdf/10.1145/3343031.3350932)
- Critic-based Attention Network for Event-based Video Captioning [[paper]](https://dl.acm.org/doi/pdf/10.1145/3343031.3351037)
- Watch It Twice: Video Captioning with a Refocused Video Encoder [[paper]](https://dl.acm.org/doi/pdf/10.1145/3343031.3351060)### ACL-2019
- Informative Image Captioning with External Sources of Information [[paper]](https://www.aclweb.org/anthology/P19-1650.pdf)
*Sanqiang Zhao, Piyush Sharma, Tomer Levinboim and Radu Soricut*
- Dense Procedure Captioning in Narrated Instructional Videos [[paper]](https://www.aclweb.org/anthology/P19-1641.pdf)
*Botian Shi, Lei Ji, Yaobo Liang, Nan Duan, Peng Chen, Zhendong Niu and Ming Zhou*
- Bridging by Word: Image Grounded Vocabulary Construction for Visual Captioning [[paper]](https://www.aclweb.org/anthology/P19-1652.pdf)
*Zhihao Fan, Zhongyu Wei, Siyuan Wang and Xuanjing Huang*
- Generating Question Relevant Captions to Aid Visual Question Answering [[paper]](https://www.aclweb.org/anthology/P19-1348.pdf)
*Jialin Wu, Zeyuan Hu and Raymond Mooney*
- Bridging by Word: Image Grounded Vocabulary Construction for Visual Captioning [[paper]](https://www.aclweb.org/anthology/P19-1652.pdf)
*Zhihao Fan, Zhongyu Wei, Siyuan Wang and Xuanjing Huang*### NeurIPS-2019
**Image Captioning**
- **AAT**: Adaptively Aligned Image Captioning via Adaptive Attention Time [[paper]](http://papers.nips.cc/paper/by-source-2019-4799) [[code]](https://github.com/husthuaan/AAT)
*Lun Huang, Wenmin Wang, Yaxian Xia, Jie Chen*
- **ObjRel Transf**: Image Captioning: Transforming Objects into Words [[paper]](http://papers.nips.cc/paper/by-source-2019-5963) [[code]](https://github.com/yahoo/object_relation_transformer)
*Simao Herdade, Armin Kappeler, Kofi Boakye, Joao Soares*
- **VSSI-cap**: Variational Structured Semantic Inference for Diverse Image Captioning [[paper]](http://papers.nips.cc/paper/by-source-2019-1113)
*Fuhai Chen, Rongrong Ji, Jiayi Ji, Xiaoshuai Sun, Baochang Zhang, Xuri Ge, Yongjian Wu, Feiyue Huang*### ICCV-2019
**Video Captioning**
- **VATEX**: A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language Research [[paper]](http://openaccess.thecvf.com/content_ICCV_2019/papers/Wang_VaTeX_A_Large-Scale_High-Quality_Multilingual_Dataset_for_Video-and-Language_Research_ICCV_2019_paper.pdf) [[challenge]](https://vatex.org/main/index.html)
*Xin Wang, Jiawei Wu, Junkun Chen, Lei Li, Yuan-Fang Wang, William Yang Wang*
`ICCV 2019 Oral`
- **POS+CG**: Controllable Video Captioning With POS Sequence Guidance Based on Gated Fusion Network [[paper]](http://openaccess.thecvf.com/content_ICCV_2019/papers/Wang_Controllable_Video_Captioning_With_POS_Sequence_Guidance_Based_on_Gated_ICCV_2019_paper.pdf)
*Bairui Wang, Lin Ma, Wei Zhang, Wenhao Jiang, Jingwen Wang, Wei Liu*
- **POS**: Joint Syntax Representation Learning and Visual Cue Translation for Video Captioning [[paper]](http://openaccess.thecvf.com/content_ICCV_2019/papers/Hou_Joint_Syntax_Representation_Learning_and_Visual_Cue_Translation_for_Video_ICCV_2019_paper.pdf)
*Jingyi Hou, Xinxiao Wu, Wentian Zhao, Jiebo Luo, Yunde Jia*
**Image Captioning**- **DUDA**: Robust Change Captioning
*Dong Huk Park, Trevor Darrell, Anna Rohrbach* [[paper]](http://openaccess.thecvf.com/content_ICCV_2019/papers/Park_Robust_Change_Captioning_ICCV_2019_paper.pdf)
`ICCV 2019 Oral`
- **AoANet**: Attention on Attention for Image Captioning [[paper]](http://openaccess.thecvf.com/content_ICCV_2019/papers/Huang_Attention_on_Attention_for_Image_Captioning_ICCV_2019_paper.pdf)
*Lun Huang, Wenmin Wang, Jie Chen, Xiao-Yong Wei*
`ICCV 2019 Oral`
- **MaBi-LSTMs**: Exploring Overall Contextual Information for Image Captioning in Human-Like Cognitive Style [[paper]](http://openaccess.thecvf.com/content_ICCV_2019/papers/Ge_Exploring_Overall_Contextual_Information_for_Image_Captioning_in_Human-Like_Cognitive_ICCV_2019_paper.pdf)
*Hongwei Ge, Zehang Yan, Kai Zhang, Mingde Zhao, Liang Sun*
- **Align2Ground**: Align2Ground: Weakly Supervised Phrase Grounding Guided by Image-Caption Alignment [[paper]](http://openaccess.thecvf.com/content_ICCV_2019/papers/Datta_Align2Ground_Weakly_Supervised_Phrase_Grounding_Guided_by_Image-Caption_Alignment_ICCV_2019_paper.pdf)
Samyak Datta, Karan Sikka, Anirban Roy, Karuna Ahuja, Devi Parikh, Ajay Divakaran*
- **GCN-LSTM+HIP**: Hierarchy Parsing for Image Captioning [[paper]](http://openaccess.thecvf.com/content_ICCV_2019/papers/Yao_Hierarchy_Parsing_for_Image_Captioning_ICCV_2019_paper.pdf)
*Ting Yao, Yingwei Pan, Yehao Li, Tao Mei*
- **IR+Tdiv**: Generating Diverse and Descriptive Image Captions Using Visual Paraphrases [[paper]](http://openaccess.thecvf.com/content_ICCV_2019/papers/Liu_Generating_Diverse_and_Descriptive_Image_Captions_Using_Visual_Paraphrases_ICCV_2019_paper.pdf)
*Lixin Liu, Jiajun Tang, Xiaojun Wan, Zongming Guo*
- **CNM+SGAE**: Learning to Collocate Neural Modules for Image Captioning [[paper]](http://openaccess.thecvf.com/content_ICCV_2019/papers/Yang_Learning_to_Collocate_Neural_Modules_for_Image_Captioning_ICCV_2019_paper.pdf)
*Xu Yang, Hanwang Zhang, Jianfei Cai*
- **Seq-CVAE**: Sequential Latent Spaces for Modeling the Intention During Diverse Image Captioning [[paper]](http://openaccess.thecvf.com/content_ICCV_2019/papers/Aneja_Sequential_Latent_Spaces_for_Modeling_the_Intention_During_Diverse_Image_ICCV_2019_paper.pdf)
Jyoti Aneja, Harsh Agrawal, Dhruv Batra, Alexander Schwing
- Towards Unsupervised Image Captioning With Shared Multimodal Embeddings [[paper]](http://openaccess.thecvf.com/content_ICCV_2019/papers/Laina_Towards_Unsupervised_Image_Captioning_With_Shared_Multimodal_Embeddings_ICCV_2019_paper.pdf)
*Iro Laina, Christian Rupprecht, Nassir Navab*
- Human Attention in Image Captioning: Dataset and Analysis [[paper]](http://openaccess.thecvf.com/content_ICCV_2019/papers/He_Human_Attention_in_Image_Captioning_Dataset_and_Analysis_ICCV_2019_paper.pdf)
*Sen He, Hamed R. Tavakoli, Ali Borji, Nicolas Pugeault*
- **RDN**: Reflective Decoding Network for Image Captioning [[paper]](http://openaccess.thecvf.com/content_ICCV_2019/papers/Ke_Reflective_Decoding_Network_for_Image_Captioning_ICCV_2019_paper.pdf)
*Lei Ke, Wenjie Pei, Ruiyu Li, Xiaoyong Shen, Yu-Wing Tai*
- **PSST**: Joint Optimization for Cooperative Image Captioning [[paper]](http://openaccess.thecvf.com/content_ICCV_2019/papers/Vered_Joint_Optimization_for_Cooperative_Image_Captioning_ICCV_2019_paper.pdf)
*Gilad Vered, Gal Oren, Yuval Atzmon, Gal Chechik*
- **MUTAN**: Watch, Listen and Tell: Multi-Modal Weakly Supervised Dense Event Captioning [[paper]](http://openaccess.thecvf.com/content_ICCV_2019/papers/Rahman_Watch_Listen_and_Tell_Multi-Modal_Weakly_Supervised_Dense_Event_Captioning_ICCV_2019_paper.pdf)
*Tanzila Rahman, Bicheng Xu, Leonid Sigal*
- **ETA**: Entangled Transformer for Image Captioning [[paper]](http://openaccess.thecvf.com/content_ICCV_2019/papers/Li_Entangled_Transformer_for_Image_Captioning_ICCV_2019_paper.pdf)
*Guang Li, Linchao Zhu, Ping Liu, Yi Yang*
- **nocaps**: novel object captioning at scale [[paper]](http://openaccess.thecvf.com/content_ICCV_2019/papers/Agrawal_nocaps_novel_object_captioning_at_scale_ICCV_2019_paper.pdf)
*Harsh Agrawal, Karan Desai, Yufei Wang, Xinlei Chen, Rishabh Jain, Mark Johnson, Dhruv Batra, Devi Parikh, Stefan Lee, Peter Anderson*
- **Cap2Det**: Learning to Amplify Weak Caption Supervision for Object Detection [[paper]](http://openaccess.thecvf.com/content_ICCV_2019/papers/Ye_Cap2Det_Learning_to_Amplify_Weak_Caption_Supervision_for_Object_Detection_ICCV_2019_paper.pdf)
*Keren Ye, Mingda Zhang, Adriana Kovashka, Wei Li, Danfeng Qin, Jesse Berent*
- **Graph-Align**: Unpaired Image Captioning via Scene Graph Alignments [paper](http://openaccess.thecvf.com/content_ICCV_2019/papers/Gu_Unpaired_Image_Captioning_via_Scene_Graph_Alignments_ICCV_2019_paper.pdf)
*Jiuxiang* Gu, Shafiq Joty, Jianfei Cai, Handong Zhao, Xu Yang, Gang Wang
- : Learning to Caption Images Through a Lifetime by Asking Questions [[paper]](http://openaccess.thecvf.com/content_ICCV_2019/papers/Shen_Learning_to_Caption_Images_Through_a_Lifetime_by_Asking_Questions_ICCV_2019_paper.pdf)
*Tingke Shen, Amlan Kar, Sanja Fidler*### CVPR-2019
**Image Captioning**
- **SGAE**: Auto-Encoding Scene Graphs for Image Captioning [[paper]](http://openaccess.thecvf.com/content_CVPR_2019/papers/Yang_Auto-Encoding_Scene_Graphs_for_Image_Captioning_CVPR_2019_paper.pdf) [[code]](https://github.com/fengyang0317/unsupervised_captioning)
*XU YANG (Nanyang Technological University); Kaihua Tang (Nanyang Technological University); Hanwang Zhang (Nanyang Technological University); Jianfei Cai (Nanyang Technological University)*
`CVPR 2019 Oral`
- **POS**: Fast, Diverse and Accurate Image Captioning Guided by Part-Of-Speech [[paper]](http://openaccess.thecvf.com/content_CVPR_2019/papers/Deshpande_Fast_Diverse_and_Accurate_Image_Captioning_Guided_by_Part-Of-Speech_CVPR_2019_paper.pdf)
*Aditya Deshpande (University of Illinois at UC); Jyoti Aneja (University of Illinois, Urbana-Champaign); Liwei Wang (Tencent AI Lab); Alexander Schwing (UIUC); David Forsyth (Univeristy of Illinois at Urbana-Champaign)*
`CVPR 2019 Oral`
- Unsupervised Image Captioning [[paper]](http://openaccess.thecvf.com/content_CVPR_2019/papers/Feng_Unsupervised_Image_Captioning_CVPR_2019_paper.pdf) [[code]](https://github.com/fengyang0317/unsupervised_captioning)
*Yang Feng (University of Rochester); Lin Ma (Tencent AI Lab); Wei Liu (Tencent); Jiebo Luo (U. Rochester)*
- Adversarial Attack to Image Captioning via Structured Output Learning With Latent Variables [[paper]](http://openaccess.thecvf.com/content_CVPR_2019/papers/Xu_Exact_Adversarial_Attack_to_Image_Captioning_via_Structured_Output_Learning_CVPR_2019_paper.pdf)
*Yan Xu (UESTC); Baoyuan Wu (Tencent AI Lab); Fumin Shen (UESTC); Yanbo Fan (Tencent AI Lab); Yong Zhang (Tencent AI Lab); Heng Tao Shen (University of Electronic Science and Technology of China (UESTC)); Wei Liu (Tencent)*
- Describing like Humans: On Diversity in Image Captioning [[paper]](http://openaccess.thecvf.com/content_CVPR_2019/papers/Wang_Describing_Like_Humans_On_Diversity_in_Image_Captioning_CVPR_2019_paper.pdf)
*Qingzhong Wang (Department of Computer Science, City University of Hong Kong); Antoni Chan (City University of Hong Kong, Hong, Kong)*
- **MSCap**: Multi-Style Image Captioning With Unpaired Stylized Text [[paper]](http://openaccess.thecvf.com/content_CVPR_2019/papers/Guo_MSCap_Multi-Style_Image_Captioning_With_Unpaired_Stylized_Text_CVPR_2019_paper.pdf)
*Longteng Guo ( Institute of Automation, Chinese Academy of Sciences); Jing Liu (National Lab of Pattern Recognition, Institute of Automation,Chinese Academy of Sciences); Peng Yao (University of Science and Technology Beijing); Jiangwei Li (Huawei); Hanqing Lu (NLPR, Institute of Automation, CAS)*
- **CapSal**: Leveraging Captioning to Boost Semantics for Salient Object Detection [[paper]](http://openaccess.thecvf.com/content_CVPR_2019/papers/Zhang_CapSal_Leveraging_Captioning_to_Boost_Semantics_for_Salient_Object_Detection_CVPR_2019_paper.pdf) [[code]](https://github.com/zhangludl/code-and-dataset-for-CapSal)
*Lu Zhang (Dalian University of Technology); Huchuan Lu (Dalian University of Technology); Zhe Lin (Adobe Research); Jianming Zhang (Adobe Research); You He (Naval Aviation University)*
- Context and Attribute Grounded Dense Captioning [[paper]](http://openaccess.thecvf.com/content_CVPR_2019/papers/Yin_Context_and_Attribute_Grounded_Dense_Captioning_CVPR_2019_paper.pdf)
*Guojun Yin (University of Science and Technology of China); Lu Sheng (The Chinese University of Hong Kong); Bin Liu (University of Science and Technology of China); Nenghai Yu (University of Science and Technology of China); Xiaogang Wang (Chinese University of Hong Kong, Hong Kong); Jing Shao (Sensetime)*
- Dense Relational Captioning: Triple-Stream Networks for Relationship-Based Captioning [[paper]](http://openaccess.thecvf.com/content_CVPR_2019/papers/Kim_Dense_Relational_Captioning_Triple-Stream_Networks_for_Relationship-Based_Captioning_CVPR_2019_paper.pdf)
*Dong-Jin Kim (KAIST); Jinsoo Choi (KAIST); Tae-Hyun Oh (MIT CSAIL); In So Kweon (KAIST)*
- **Show, Control and Tell**: A Framework for Generating Controllable and Grounded Captions [[paper]](http://openaccess.thecvf.com/content_CVPR_2019/papers/Cornia_Show_Control_and_Tell_A_Framework_for_Generating_Controllable_and_CVPR_2019_paper.pdf)
*Marcella Cornia (University of Modena and Reggio Emilia); Lorenzo Baraldi (University of Modena and Reggio Emilia); Rita Cucchiara (Universita Di Modena E Reggio Emilia)*
- Self-Critical N-step Training for Image Captioning [[paper]](http://openaccess.thecvf.com/content_CVPR_2019/papers/Gao_Self-Critical_N-Step_Training_for_Image_Captioning_CVPR_2019_paper.pdf)
*Junlong Gao (Peking University Shenzhen Graduate School); Shiqi Wang (CityU); Shanshe Wang (Peking University); Siwei Ma (Peking University, China); Wen Gao (PKU)*
- Look Back and Predict Forward in Image Captioning [[paper]](http://openaccess.thecvf.com/content_CVPR_2019/papers/Qin_Look_Back_and_Predict_Forward_in_Image_Captioning_CVPR_2019_paper.pdf)
*Yu Qin (Shanghai Jiao Tong University); Jiajun Du (Shanghai Jiao Tong University); Hongtao Lu (Shanghai Jiao Tong University); Yonghua Zhang (Bytedance)*
- Intention Oriented Image Captions with Guiding Objects [[paper]](http://openaccess.thecvf.com/content_CVPR_2019/papers/Zheng_Intention_Oriented_Image_Captions_With_Guiding_Objects_CVPR_2019_paper.pdf)
*Yue Zheng (Tsinghua University); Ya-Li Li (THU); Shengjin Wang (Tsinghua University)*
- Adversarial Semantic Alignment for Improved Image Captions [[paper]](http://openaccess.thecvf.com/content_CVPR_2019/papers/Dognin_Adversarial_Semantic_Alignment_for_Improved_Image_Captions_CVPR_2019_paper.pdf)
*Pierre Dognin (IBM); Igor Melnyk (IBM); Youssef Mroueh (IBM Research); Jarret Ross (IBM); Tom Sercu (IBM Research AI)*
- Good News, Everyone! Context driven entity-aware captioning for news images [[paper]](http://openaccess.thecvf.com/content_CVPR_2019/papers/Biten_Good_News_Everyone_Context_Driven_Entity-Aware_Captioning_for_News_Images_CVPR_2019_paper.pdf) [[code]](https://github.com/furkanbiten/GoodNews)
*Ali Furkan Biten (Computer Vision Center); Lluis Gomez (Universitat Autónoma de Barcelona); Marçal Rusiñol (Computer Vision Center, UAB); Dimosthenis Karatzas (Computer Vision Centre)*
- Pointing Novel Objects in Image Captioning [[paper]](http://openaccess.thecvf.com/content_CVPR_2019/papers/Li_Pointing_Novel_Objects_in_Image_Captioning_CVPR_2019_paper.pdf)
*Yehao Li (Sun Yat-Sen University); Ting Yao (JD AI Research); Yingwei Pan (JD AI Research); Hongyang Chao (Sun Yat-sen University); Tao Mei (AI Research of JD.com)*
- Engaging Image Captioning via Personality [[paper]](http://openaccess.thecvf.com/content_CVPR_2019/papers/Shuster_Engaging_Image_Captioning_via_Personality_CVPR_2019_paper.pdf)
*Kurt Shuster (Facebook); Samuel Humeau (Facebook); Hexiang Hu (USC); Antoine Bordes (Facebook); Jason Weston (FAIR)*
- Intention Oriented Image Captions With Guiding Objects [[paper]](http://openaccess.thecvf.com/content_CVPR_2019/papers/Zheng_Intention_Oriented_Image_Captions_With_Guiding_Objects_CVPR_2019_paper.pdf)
*Yue Zheng, Yali Li, Shengjin Wang*
- Exact Adversarial Attack to Image Captioning via Structured Output Learning With Latent Variables [[paper]](http://openaccess.thecvf.com/content_CVPR_2019/papers/Xu_Exact_Adversarial_Attack_to_Image_Captioning_via_Structured_Output_Learning_CVPR_2019_paper.pdf)
*Yan Xu, Baoyuan Wu, Fumin Shen, Yanbo Fan, Yong Zhang, Heng Tao Shen, Wei Liu***Video Captioning**
- **SDVC**: Streamlined Dense Video Captioning [[paper]](http://openaccess.thecvf.com/content_CVPR_2019/papers/Mun_Streamlined_Dense_Video_Captioning_CVPR_2019_paper.pdf)
*Jonghwan Mun (POSTECH); Linjie Yang (ByteDance AI Lab); Zhou Ren (Snap Inc.); Ning Xu (Snap); Bohyung Han (Seoul National University)*
`CVPR 2019 Oral`
- **GVD**: Grounded Video Description [[paper]](http://openaccess.thecvf.com/content_CVPR_2019/papers/Zhou_Grounded_Video_Description_CVPR_2019_paper.pdf)
*Luowei Zhou (University of Michigan); Yannis Kalantidis (Facebook Research); Xinlei Chen (Facebook AI Research); Jason J Corso (University of Michigan); Marcus Rohrbach (Facebook AI Research)*
`CVPR 2019 Oral`
- **HybridDis**: Adversarial Inference for Multi-Sentence Video Description [[paper]](http://openaccess.thecvf.com/content_CVPR_2019/papers/Park_Adversarial_Inference_for_Multi-Sentence_Video_Description_CVPR_2019_paper.pdf)
*Jae Sung Park (UC Berkeley); Marcus Rohrbach (Facebook AI Research); Trevor Darrell (UC Berkeley); Anna Rohrbach (UC Berkeley)*
`CVPR 2019 Oral`
- **OA-BTG**: Object-aware Aggregation with Bidirectional Temporal Graph for Video Captioning [[paper]](http://openaccess.thecvf.com/content_CVPR_2019/papers/Zhang_Object-Aware_Aggregation_With_Bidirectional_Temporal_Graph_for_Video_Captioning_CVPR_2019_paper.pdf)
*Junchao Zhang (Peking University); Yuxin Peng (Peking University)*
- **MARN**: Memory-Attended Recurrent Network for Video Captioning [[paper]](http://openaccess.thecvf.com/content_CVPR_2019/papers/Pei_Memory-Attended_Recurrent_Network_for_Video_Captioning_CVPR_2019_paper.pdf)
*Wenjie Pei (Tencent); Jiyuan Zhang (Tencent YouTu); Xiangrong Wang (Delft University of Technology); Lei Ke (Tencent); Xiaoyong Shen (Tencent); Yu-Wing Tai (Tencent)*
- **GRU-EVE**: Spatio-Temporal Dynamics and Semantic Attribute Enriched Visual Encoding for Video Captioning [[paper]](http://openaccess.thecvf.com/content_CVPR_2019/papers/Aafaq_Spatio-Temporal_Dynamics_and_Semantic_Attribute_Enriched_Visual_Encoding_for_Video_CVPR_2019_paper.pdf)
*Nayyer Aafaq (The University of Western Australia); Naveed Akhtar (The University of Western Australia); Wei Liu (University of Western Australia); Syed Zulqarnain Gilani (The University of Western Australia); Ajmal Mian (University of Western Australia)*### AAAI-2019
**Image Captioning**
- Improving Image Captioning with Conditional Generative Adversarial Nets [[paper]](https://arxiv.org/pdf/1805.07112.pdf)
*CHEN CHEN (Tencent); SHUAI MU (Tencent); WANPENG XIAO (Tencent); ZEXIONG YE (Tencent); LIESI WU (Tencent); QI JU (Tencent)*
`AAAI 2019 Oral`
- **PAGNet**: Connecting Language to Images: A Progressive Attention-Guided Network for Simultaneous Image Captioning and Language Grounding [[paper]](https://www.aaai.org/ojs/index.php/AAAI/article/view/4916)
*Lingyun Song (Xi'an JiaoTong University); Jun Liu (Xi'an Jiaotong Univerisity); Buyue Qian (Xi'an Jiaotong University); Yihe Chen (University of Toronto)*
`AAAI 2019 Oral`
- Meta Learning for Image Captioning [[paper]](https://www.aaai.org/ojs/index.php/AAAI/article/view/4883)
*Nannan Li (Wuhan University); Zhenzhong Chen (WHU); Shan Liu (Tencent America)*
- **DA**: Deliberate Residual based Attention Network for Image Captioning [[paper]](https://www.aaai.org/ojs/index.php/AAAI/article/view/4845/4718)
*Lianli Gao (The University of Electronic Science and Technology of China); kaixuan fan (University of Electronic Science and Technology of China); Jingkuan Song (UESTC); Xianglong Liu (Beihang University); Xing Xu (University of Electronic Science and Technology of China); Heng Tao Shen (University of Electronic Science and Technology of China (UESTC))*
- **HAN**: Hierarchical Attention Network for Image Captioning [[paper]](https://www.aaai.org/ojs/index.php/AAAI/article/view/4924)
*Weixuan Wang (School of Electronic and Information Engineering, Sun Yat-sen University);Zhihong Chen (School of Electronic and Information Engineering, Sun Yat-sen University); Haifeng Hu (School of Electronic and Information Engineering, Sun Yat-sen University)*
- **COCG**: Learning Object Context for Dense Captioning [[paper]](https://www.aaai.org/ojs/index.php/AAAI/article/view/4886)
*Xiangyang Li (Institute of Computing Technology, Chinese Academy of Sciences); Shuqiang Jiang (ICT, China Academy of Science); Jungong Han (Lancaster University)***Video Captioning**
- **TAMoE**: Learning to Compose Topic-Aware Mixture of Experts for Zero-Shot Video Captioning [[code]](https://github.com/eric-xw/Zero-Shot-Video-Captioning) [[paper]](https://arxiv.org/pdf/1811.02765.pdf)
*Xin Wang (University of California, Santa Barbara); Jiawei Wu (University of California, Santa Barbara); Da Zhang (UC Santa Barbara); Yu Su (OSU); William Wang (UC Santa Barbara)*
`AAAI 2019 Oral`
- **TDConvED**: Temporal Deformable Convolutional Encoder-Decoder Networks for Video Captioning [[paper]](https://arxiv.org/pdf/1905.01077v1.pdf)
*Jingwen Chen (Sun Yat-set University); Yingwei Pan (JD AI Research); Yehao Li (Sun Yat-Sen University); Ting Yao (JD AI Research); Hongyang Chao (Sun Yat-sen University); Tao Mei (AI Research of JD.com)*
`AAAI 2019 Oral`
- **FCVC-CF&IA**: Fully Convolutional Video Captioning with Coarse-to-Fine and Inherited Attention [[paper]](https://aaai.org/ojs/index.php/AAAI/article/view/4839)
*Kuncheng Fang (Fudan University); Lian Zhou (Fudan University); Cheng Jin (Fudan University); Yuejie Zhang (Fudan University); Kangnian Weng (Shanghai University of Finance and Economics); Tao Zhang (Shanghai University of Finance and Economics); Weiguo Fan (University of Iowa)*
- **MGSA**: Motion Guided Spatial Attention for Video Captioning [[paper]](http://yugangjiang.info/publication/19AAAI-vidcaptioning.pdf)
*Shaoxiang Chen (Fudan University); Yu-Gang Jiang (Fudan University)*