Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/tgc1997/Awesome-Video-Captioning

A curated list of research papers in Video Captioning
https://github.com/tgc1997/Awesome-Video-Captioning

List: Awesome-Video-Captioning

Last synced: about 1 month ago
JSON representation

A curated list of research papers in Video Captioning

Awesome Lists containing this project

README

        

# Awesome-Video-Captioning
A curated list of research papers in Video Captioning(from 2015 to 2020). Link to the code and project website if available.

# Contents
- [2015](#2015)
- [2016](#2016)
- [2017](#2017)
- [2018](#2018)
- [2019](#2019)
- [2020](#2020)
- [Dense Captioning](#Dense-Captioning)
- [Grounded Captioning](#Grounded-Captioning)

# Paper List
## 2015
1. **LSTM-P**: [Translating Videos to Natural Language Using Deep Recurrent Neural Networks](https://www.cs.utexas.edu/users/ml/papers/venugopalan.naacl15.pdf)

*Subhashini Venugopalan, Huijuan Xu, Jeff Donahue, Marcus Rohrbach, Raymond Mooney, Kate Saenko*

NAACL, 2015.[[caffe-code]](https://gist.github.com/vsubhashini/3761b9ad43f60db9ac3d)

2. **LRCN**: [Long-term Recurrent Convolutional Networks for Visual Recognition and Description](https://www.cv-foundation.org/openaccess/content_cvpr_2015/papers/Donahue_Long-Term_Recurrent_Convolutional_2015_CVPR_paper.pdf)

*Jeff Donahue, Lisa Anne Hendricks, Marcus Rohrbach, Subhashini Venugopalan, Sergio Guadarrama, Kate Saenko, Trevor Darrell*

CVPR, 2015.[[website]](http://jeffdonahue.com/lrcn/)

3. **S2VT**: [Sequence to Sequence – Video to Text](https://www.cs.utexas.edu/users/ml/papers/venugopalan.iccv15.pdf)

*Subhashini Venugopalan, Marcus Rohrbach, Jeff Donahue, Raymond Mooney, Trevor Darrell, Kate Saenko*

ICCV, 2015.[[caffe-code]](https://gist.github.com/vsubhashini/38d087e140854fee4b14)

4. **SA**: [Describing Videos by Exploiting Temporal Structure](https://www.cv-foundation.org/openaccess/content_iccv_2015/papers/Yao_Describing_Videos_by_ICCV_2015_paper.pdf)

*Li Yao, Atousa Torabi, Kyunghyun Cho, Nicolas Ballas, Christopher Pal, Hugo Larochelle, Aaron Courville*

ICCV, 2015.[[theano-code]](https://github.com/yaoli/arctic-capgen-vid) [[tf-code]](https://github.com/tsenghungchen/SA-tensorflow)

## 2016
1. **LSTM-E**: [Jointly Modeling Embedding and Translation to Bridge Video and Language](http://openaccess.thecvf.com/content_cvpr_2016/papers/Pan_Jointly_Modeling_Embedding_CVPR_2016_paper.pdf)

*Yingwei Pan, Tao Mei, Ting Yao, Houqiang Li, Yong Rui*

CVPR, 2016.

2. **HRNE**: [Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning](http://zhongwen.ai/pdf/HRNE.pdf)

*Pingbo Pan, Zhongwen Xu, Yi Yang, Fei Wu, Yueting Zhuang*

CVPR, 2016.

3. **h-RNN**: [Video Paragraph Captioning Using Hierarchical Recurrent Neural Networks](https://arxiv.org/pdf/1510.07712)

*Haonan Yu, Jiang Wang, Zhiheng Huang, Yi Yang, Wei Xu*

CVPR, 2016.

4. **MSR-VTT**: [MSR-VTT: A Large Video Description Dataset for Bridging Video and Language](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/06/cvpr16.msr-vtt.tmei_-1.pdf)

*Jun Xu , Tao Mei , Ting Yao and Yong Rui*

CVPR, 2016.[[website]](https://www.microsoft.com/en-us/research/publication/msr-vtt-a-large-video-description-dataset-for-bridging-video-and-language/)

5. **BiLSTM**: [Video Description using Bidirectional Recurrent Neural Networks](https://arxiv.org/pdf/1604.03390)

*Álvaro Peris, Marc Bolaños, Petia Radeva, Francisco Casacuberta*

ICANN, 2016.

## 2017
1. **DenseVidCap**: [ Weakly Supervised Dense Video Captioning](https://arxiv.org/pdf/1704.01502)

*Zhiqiang Shen, Jianguo Li, Zhou Su, Minjun Li, Yurong Chen, Yu-Gang Jiang, Xiangyang Xue*

CVPR, 2017.[[tf-code]](https://github.com/SCLinDennis/Weakly-Supervised-Dense-Video-Captioning)

2. **LSTM-TSA**: [Video Captioning with Transferred Semantic Attributes](https://arxiv.org/pdf/1611.07675)

*Yingwei Pan, Ting Yao, Houqiang Li, Tao Mei*

CVPR, 2017.

3. **SCN**: [Semantic Compositional Networks for Visual Captioning](https://arxiv.org/pdf/1611.08002)

*Zhe Gan, Chuang Gan, Xiaodong He, Yunchen Pu, Kenneth Tran, Jianfeng Gao, Lawrence Carin, Li Deng*

CVPR, 2017.[[theano-code]](https://github.com/zhegan27/Semantic_Compositional_Nets)

4. **StyleNet**: [StyleNet: Generating Attractive Visual Captions with Styles](http://openaccess.thecvf.com/content_cvpr_2017/papers/Gan_StyleNet_Generating_Attractive_CVPR_2017_paper.pdf)

*Chuang Gan, Zhe Gan, Xiaodong He, Jianfeng Gao, Li Deng*

CVPR, 2017.[[pytorch-code]](https://github.com/kacky24/stylenet)

5. **CT-SAN**: [End-to-end Concept Word Detection for Video Captioning, Retrieval, and Question Answering](https://zpascal.net/cvpr2017/Yu_End-To-End_Concept_Word_CVPR_2017_paper.pdf)

*Youngjae Yu, Hyungjin Ko, Jongwook Choi, Gunhee Kim*

CVPR, 2017.[[tf-code]](https://gitlab.com/fodrh1201/CT-SAN/tree/master)

6. **CGVS**: [Top-down Visual Saliency Guided by Captions](http://zpascal.net/cvpr2017/Ramanishka_Top-Down_Visual_Saliency_CVPR_2017_paper.pdf)

*Vasili Ramanishka, Abir Das, Jianming Zhang, Kate Saenko*

CVPR, 2017.[[tf-code]](https://github.com/VisionLearningGroup/caption-guided-saliency)

7. **HBA**: [Hierarchical Boundary-Aware Neural Encoder for Video Captioning](http://openaccess.thecvf.com/content_cvpr_2017/papers/Baraldi_Hierarchical_Boundary-Aware_Neural_CVPR_2017_paper.pdf)

*Lorenzo Baraldi, Costantino Grana, Rita Cucchiara*

CVPR, 2017.[[pytorch-code]](https://github.com/Yugnaynehc/banet)

8. **TDDF**: [Task-Driven Dynamic Fusion: Reducing Ambiguity in Video Description](https://www.zpascal.net/cvpr2017/Zhang_Task-Driven_Dynamic_Fusion_CVPR_2017_paper.pdf)

*Xishan Zhang, Ke Gao, Yongdong Zhang, Dongming Zhang, Jintao Li,and Qi Tian*

CVPR, 2017.

9. **GEAN**: [Supervising Neural Attention Models for Video Captioning by Human Gaze Data](http://zpascal.net/cvpr2017/Yu_Supervising_Neural_Attention_CVPR_2017_paper.pdf)

*Youngjae Yu, Jongwook Choi, Yeonhwa Kim, Kyung Yoo, Sang-Hun Lee, Gunhee Kim*

CVPR, 2017.[[tf-code]](https://github.com/yj-yu/Recurrent_Gaze_Prediction)

10. **MM-Att**: [Attention-Based Multimodal Fusion for Video Description](http://openaccess.thecvf.com/content_ICCV_2017/papers/Hori_Attention-Based_Multimodal_Fusion_ICCV_2017_paper.pdf)

*Chiori Hori, Takaaki Hori, Teng-Yok Lee, Kazuhiro Sumi, John R. Hershey, Tim K. Marks*

ICCV, 2017.

11. **Tessellation**: [Temporal Tessellation: A Unified Approach for Video Analysis](http://openaccess.thecvf.com/content_ICCV_2017/papers/Kaufman_Temporal_Tessellation_A_ICCV_2017_paper.pdf)

*Dotan Kaufman, Gil Levi, Tal Hassner, Lior Wolf*

ICCV, 2017.[[tf-code]](https://github.com/dot27/temporal-tessellation)

12. **MTEG**: [Multi-Task Video Captioning with Video and Entailment Generation](https://arxiv.org/pdf/1704.07489)

*Ramakanth Pasunuru, Mohit Bansal*

ACL, 2017.

13. **MAM-RNN**: [MAM-RNN: Multi-level Attention Model Based RNN for Video Captioning](https://www.ijcai.org/proceedings/2017/0307.pdf)

*Xuelong Li, Bin Zhao, Xiaoqiang Lu*

IJCAI, 2017.

14. **hLSTMat**: [Hierarchical LSTM with Adjusted Temporal Attention for Video Captioning](https://www.ijcai.org/proceedings/2017/0381.pdf)

*Jingkuan Song, Lianli Gao, Zhao Guo, Wu Liu, Dongxiang Zhang, Heng Tao Shen*

IJCAI, 2017.[[theano-code]](https://github.com/zhaoluffy/hLSTMat)

## 2018
1. **Survey**: [Study of Video Captioning Problem](https://www.cs.princeton.edu/courses/archive/spring18/cos598B/public/projects/LiteratureReview/COS598B_spr2018_VideoCaptioning.pdf)

*Jiaqi Su*

cos598B, 2018.

2. [Fine-grained Video Captioning for Sports Narrative](http://openaccess.thecvf.com/content_cvpr_2018/papers/Yu_Fine-Grained_Video_Captioning_CVPR_2018_paper.pdf)

*Huanyu Yu, Shuo Cheng, Bingbing Ni, Minsi Wang, Jian Zhang, Xiaokang Yang*

CVPR, 2018.

3. **TSA-ED**: [Interpretable Video Captioning via Trajectory Structured Localization](http://openaccess.thecvf.com/content_cvpr_2018/papers/Wu_Interpretable_Video_Captioning_CVPR_2018_paper.pdf)

*Xian Wu, Guanbin Li Qingxing Cao, Qingge Ji, Liang Lin*

CVPR, 2018.

4. **RecNet**: [Reconstruction Network for Video Captioning](https://www.zpascal.net/cvpr2018/Wang_Reconstruction_Network_for_CVPR_2018_paper.pdf)

*Bairui Wang, Lin Ma, Wei Zhang, Wei Liu*

CVPR, 2018.[[pytorch-code]](https://github.com/hobincar/reconstruction-network-for-video-captioning)

5. **M3**: [M3: Multimodal Memory Modelling for Video Captioning](http://openaccess.thecvf.com/content_cvpr_2018/papers/Wang_M3_Multimodal_Memory_CVPR_2018_paper.pdf)

*Junbo Wang, Wei Wang, Yan Huang, Liang Wang, Tieniu Tan*

CVPR, 2018.

6. **PickNet**: [Less Is More: Picking Informative Frames for Video Captioning](https://eccv2018.org/openaccess/content_ECCV_2018/papers/Yangyu_Chen_Less_is_More_ECCV_2018_paper.pdf)

*Yangyu Chen, Shuhui Wang, Weigang Zhang, Qingming Huang*

ECCV, 2018.

7. **ECO-SCN**: [ECO: Efficient Convolutional Network for Online Video Understanding](http://openaccess.thecvf.com/content_ECCV_2018/papers/Mohammadreza_Zolfaghari_ECO_Efficient_Convolutional_ECCV_2018_paper.pdf)

*Mohammadreza Zolfaghari, Kamaljeet Singh, Thomas Brox*

ECCV, 2018.[[caffe-code]](https://github.com/mzolfaghari/ECO-efficient-video-understanding) [[pytorch-code]](https://github.com/zhang-can/ECO-pytorch)

8. **SibNet**: [SibNet: Sibling Convolutional Encoder for Video Captioning](https://cse.buffalo.edu/~jsyuan/papers/2018/SibNet__Sibling_Convolutional_Encoder_for_Video_Captioning.pdf)

*Sheng liu, Zhou Ren, Junsong Yuan*

ACM MM, 2018.

9. **TubeNet**: [Video Captioning with Tube Features](https://www.ijcai.org/proceedings/2018/0164.pdf)

*Bin Zhao, Xuelong Li, Xiaoqiang Lu*

IJCAI, 2018.

## 2019
1. **Survey**: [Video Description: A Survey of Methods, Datasets and Evaluation Metrics](https://arxiv.org/pdf/1806.00186)

*Nayyer Aafaq, Ajmal Mian, Wei Liu, Syed Zulqarnain Gilani, Mubarak Shah*

ACM Computing Surveys, 2019.

2. **GRU-EVE**: [Spatio-Temporal Dynamics and Semantic Attribute Enriched Visual Encoding for Video Captioning](https://zpascal.net/cvpr2019/Aafaq_Spatio-Temporal_Dynamics_and_Semantic_Attribute_Enriched_Visual_Encoding_for_Video_CVPR_2019_paper.pdf)

*Nayyer Aafaq, Naveed Akhtar, Wei Liu, Syed Zulqarnain Gilani, Ajmal Mian*

CVPR, 2019.

3. **MARN**: [Memory-Attended Recurrent Network for Video Captioning](https://arxiv.org/pdf/1905.03966)

*Wenjie Pei, Jiyuan Zhang, Xiangrong Wang, Lei Ke, Xiaoyong Shen, Yu-Wing Tai*

CVPR, 2019.

4. **OA-BTG**: [Object-aware Aggregation with Bidirectional Temporal Graph for Video Captioning](http://openaccess.thecvf.com/content_CVPR_2019/papers/Zhang_Object-Aware_Aggregation_With_Bidirectional_Temporal_Graph_for_Video_Captioning_CVPR_2019_paper.pdf)

*Junchao Zhang, Yuxin Peng*

CVPR, 2019.

5. **VATEX**: [VaTeX: A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language Research](http://openaccess.thecvf.com/content_ICCV_2019/papers/Wang_VaTeX_A_Large-Scale_High-Quality_Multilingual_Dataset_for_Video-and-Language_Research_ICCV_2019_paper.pdf)

*Xin Wang, Jiawei Wu, Junkun Chen, Lei Li, Yuan-Fang Wang, William Yang Wang*

ICCV, 2019.[[website]](https://vatex.org/main/index.html#)

6. **POS**: [Joint Syntax Representation Learning and Visual Cue Translation for Video Captioning](http://openaccess.thecvf.com/content_ICCV_2019/papers/Hou_Joint_Syntax_Representation_Learning_and_Visual_Cue_Translation_for_Video_ICCV_2019_paper.pdf)

*Jingyi Hou, Xinxiao Wu, Wentian Zhao, Jiebo Luo, Yunde Jia*

ICCV, 2019.

7. **POS-CG**: [Controllable Video Captioning With POS Sequence Guidance Based on Gated Fusion Network](http://openaccess.thecvf.com/content_ICCV_2019/papers/Wang_Controllable_Video_Captioning_With_POS_Sequence_Guidance_Based_on_Gated_ICCV_2019_paper.pdf)

*Bairui Wang, Lin Ma, Wei Zhang, Wenhao Jiang, Jingwen Wang, Wei Liu*

ICCV, 2019.[[pytorch-code]](https://github.com/vsislab/Controllable_XGating)

8. **WIT**: [Watch It Twice: Video Captioning with a Refocused Video Encoder](https://arxiv.org/pdf/1907.12905)

*Xiangxi Shi, Jianfei Cai, Shafiq Joty, Jiuxiang Gu*

ACM MM, 2019.

9. **MGSA**: [Motion Guided Spatial Attention for Video Captioning](https://www.aaai.org/ojs/index.php/AAAI/article/view/4829/4702)

*Shaoxiang Chen and Yu-Gang Jiang*

AAAI, 2019.

10. **TDConvED**: [Temporal Deformable Convolutional Encoder-Decoder Networks for Video Captioning](https://arxiv.org/pdf/1905.01077)

*Jingwen Chen, Yingwei Pan, Yehao Li, Ting Yao, Hongyang Chao, Tao Mei*

AAAI, 2019.

11. **FCVC-CF&IA**: [Fully Convolutional Video Captioning with Coarse-to-Fine and Inherited Attention](https://aaai.org/ojs/index.php/AAAI/article/view/4839)

*Kuncheng Fang, Lian Zhou, Cheng Jin, Yuejie Zhang,Kangnian Weng,Tao Zhang, Weiguo Fan*

AAAI, 2019.

12. **TAMoE**: [Learning to Compose Topic-Aware Mixture of Experts for Zero-Shot Video Captioning](https://arxiv.org/pdf/1811.02765)

*Xin Wang, Jiawei Wu, Da Zhang, Yu Su, William Yang Wang*

AAAI, 2019.[[code]](https://github.com/eric-xw/Zero-Shot-Video-Captioning)

13. **VIC**: [Video Interactive Captioning with Human Prompts](https://www.ijcai.org/proceedings/2019/0135.pdf)

*Aming Wu, Yahong Han and Yi Yang*

IJCAI, 2019.[[code]](https://github.com/ViCap01/ViCap)

## 2020
1. [Spatio-Temporal Graph for Video Captioning with Knowledge Distillation](https://openaccess.thecvf.com/content_CVPR_2020/papers/Pan_Spatio-Temporal_Graph_for_Video_Captioning_With_Knowledge_Distillation_CVPR_2020_paper.pdf)

*Boxiao Pan, Haoye Cai, De-An Huang, Kuan-Hui Lee, Adrien Gaidon, Ehsan Adeli, Juan Carlos Niebles*

CVPR, 2020.

2. **SAAT**: [Syntax-Aware Action Targeting for Video Captioning](https://openaccess.thecvf.com/content_CVPR_2020/papers/Zheng_Syntax-Aware_Action_Targeting_for_Video_Captioning_CVPR_2020_paper.pdf)

*Zheng, Qi and Wang, Chaoyue and Tao, Dacheng*

CVPR, 2020.[[pytorch-code]](https://github.com/SydCaption/SAAT)

3. **ORG-TRL**: [Object Relational Graph with Teacher-Recommended Learning for Video Captioning](https://openaccess.thecvf.com/content_CVPR_2020/papers/Zhang_Object_Relational_Graph_With_Teacher-Recommended_Learning_for_Video_Captioning_CVPR_2020_paper.pdf)

*Ziqi Zhang, Yaya Shi, Chunfeng Yuan, Bing Li, Peijin Wang, Weiming Hu, Zhengjun Zha*

CVPR, 2020.

4. **PMI-CAP**: [Learning Modality Interaction for Temporal Sentence Localization and Event Captioning in Videos](https://arxiv.org/pdf/2007.14164.pdf)

*Shaoxiang Chen, Wenhao Jiang, Wei Liu, Yu-Gang Jiang*

ECCV, 2020.[[pytorch-code]](https://github.com/xuewyang/Fashion_Captioning)

5. **RMN**: [Learning to Discretely Compose Reasoning Module Networks for Video Captioning](https://www.ijcai.org/Proceedings/2020/0104.pdf)

*Ganchao Tan, Daqing Liu, Meng Wang and Zheng-Jun Zha*

IJCAI, 2020.[[pytorch-code]](https://github.com/tgc1997/RMN)

6. **SBAT**: [SBAT: Video Captioning with Sparse Boundary-Aware Transformer](https://www.ijcai.org/Proceedings/2020/0088.pdf)

*Tao Jin, Siyu Huang, Yingming Li, Zhongfei Zhang, Ming Chen*

IJCAI, 2020.

7. [Joint Commonsense and Relation Reasoning for Image and Video Captioning](https://wuxinxiao.github.io/assets/papers/2020/C-R_reasoning.pdf)

*Jingyi Hou, Xinxiao Wu, Xiaoxun Zhang, Yayun Qi, Yunde Jia, Jiebo Luo*

AAAI, 2020.

8. **SMCG**: [Controllable Video Captioning with an Exemplar Sentence](https://dl.acm.org/doi/abs/10.1145/3394171.3413908)

*Yitian Yuan, Lin Ma, Jingwen Wang, Wenwu Zhu*

ACM MM, 2020.

9. **Poet**: [Poet: Product-oriented Video Captioner for E-commerce](https://arxiv.org/pdf/2008.06880.pdf)

*Shengyu Zhang, Ziqi Tan, Jin Yu, Zhou Zhao, Kun Kuang, Jie Liu, Jingren Zhou, Hongxia Yang, Fei Wu*

ACM MM, 2020.

10. [Learning Semantic Concepts and Temporal Alignment for Narrated Video Procedural Captioning](https://dl.acm.org/doi/abs/10.1145/3394171.3413498)

*Botian Shi, Lei Ji, Zhendong Niu, Nan Duan, Ming Zhou, Xilin Chen*

ACM MM, 2020.

## Dense-Captioning
1. [Dense-Captioning Events in Videos](http://openaccess.thecvf.com/content_ICCV_2017/papers/Krishna_Dense-Captioning_Events_in_ICCV_2017_paper.pdf)

*Ranjay Krishna, Kenji Hata, Frederic Ren, Li Fei-Fei, Juan Carlos Niebles*

ICCV, 2017.[[code]](https://github.com/ranjaykrishna/densevid_eval) [[website]](https://cs.stanford.edu/people/ranjaykrishna/densevid/)

2. [End-to-End Dense Video Captioning with Masked Transformer](http://openaccess.thecvf.com/content_cvpr_2018/papers/Zhou_End-to-End_Dense_Video_CVPR_2018_paper.pdf)

*Luowei Zhou, Yingbo Zhou, Jason J. Corso, Richard Socher, Caiming Xiong*

CVPR, 2018.[[pytorch-code]](https://github.com/salesforce/densecap)

3. [Attend and Interact: Higher-Order Object Interactions for Video Understanding](http://openaccess.thecvf.com/content_cvpr_2018/CameraReady/0330.pdf)

*Chih-Yao Ma, Asim Kadav, Iain Melvin, Zsolt Kira, Ghassan AlRegib, and Hans Peter Graf*

CVPR, 2018.

4. [Jointly Localizing and Describing Events for Dense Video Captioning](http://openaccess.thecvf.com/content_cvpr_2018/papers/Li_Jointly_Localizing_and_CVPR_2018_paper.pdf)

*Yehao Li, Ting Yao, Yingwei Pan, Hongyang Chao, Tao Mei*

CVPR, 2018.

5. [Bidirectional Attentive Fusion with Context Gating for Dense Video Captioning](http://openaccess.thecvf.com/content_cvpr_2018/papers/Wang_Bidirectional_Attentive_Fusion_CVPR_2018_paper.pdf)

*Jingwen Wang, Wenhao Jiang, Lin Ma, Wei Liu, Yong Xu*

CVPR, 2018.[[tf-code]](https://github.com/JaywongWang/DenseVideoCaptioning)

6. [Move Forward and Tell: A Progressive Generator of Video Descriptions](http://openaccess.thecvf.com/content_ECCV_2018/papers/Yilei_Xiong_Move_Forward_and_ECCV_2018_paper.pdf)

*Yilei Xiong, Bo Dai, Dahua Lin*

ECCV, 2018.

7. [Adversarial Inference for Multi-sentence Video Description](http://openaccess.thecvf.com/content_CVPR_2019/papers/Park_Adversarial_Inference_for_Multi-Sentence_Video_Description_CVPR_2019_paper.pdf)

*Jae Sung Park, Marcus Rohrbach, Trevor Darrell, Anna Rohrbach*

CVPR, 2019.[[pytorch-code]](https://github.com/jamespark3922/adv-inf)

8. [Dense Relational Captioning: Triple-stream Networks for Relationship-based Captioning](http://openaccess.thecvf.com/content_CVPR_2019/papers/Kim_Dense_Relational_Captioning_Triple-Stream_Networks_for_Relationship-Based_Captioning_CVPR_2019_paper.pdf)

*Dong-Jin Kim, Jinsoo Choi, Tae-Hyun Oh, In So Kweon*

CVPR, 2019.[[torch-code]](https://github.com/Dong-JinKim/DenseRelationalCaptioning)

9. [Streamlined Dense Video Captioning](http://openaccess.thecvf.com/content_CVPR_2019/papers/Mun_Streamlined_Dense_Video_Captioning_CVPR_2019_paper.pdf)

*Jonghwan Mun, Linjie Yang, Zhou Ren, Ning Xu, Bohyung Han*

CVPR, 2019.

9. [Watch, Listen and Tell: Multi-Modal Weakly Supervised Dense Event Captioning](http://openaccess.thecvf.com/content_ICCV_2019/papers/Rahman_Watch_Listen_and_Tell_Multi-Modal_Weakly_Supervised_Dense_Event_Captioning_ICCV_2019_paper.pdf)

*Tanzila Rahman, Bicheng Xu, Leonid Sigal*

ICCV, 2019.

10. [An Efficient Framework for Dense Video Captioning](https://www.aaai.org/Papers/AAAI/2020GB/AAAI-SuinM.7561.pdf)

*Maitreya Suin, A. N. Rajagopalan*

AAAI, 2020.

11. [MART: Memory-Augmented Recurrent Transformer for Coherent Video Paragraph Captioning](https://arxiv.org/pdf/2005.05402.pdf)

*Jie Lei, Liwei Wang, Yelong Shen, Dong Yu, Tamara L. Berg, Mohit Bansal*

ACL, 2020. [[pytorch-code]](https://github.com/jayleicn/recurrent-transformer)

12. [Identity-Aware Multi-Sentence Video Description](http://www.ecva.net/papers/eccv_2020/papers_ECCV/papers/123660358.pdf)

*Jae Sung Park, Trevor Darrell, Anna Rohrbach*

ECCV, 2020.

## Grounded-Captioning
1. **GVD**: [Grounded Video Description](http://openaccess.thecvf.com/content_CVPR_2019/papers/Zhou_Grounded_Video_Description_CVPR_2019_paper.pdf)

*Luowei Zhou, Yannis Kalantidis, Xinlei Chen, Jason J. Corso, Marcus Rohrbach*

CVPR, 2019.[[pytorch-code]](https://github.com/facebookresearch/grounded-video-description)

2. [Relational Graph Learning for Grounded Video Description Generation](https://dl.acm.org/doi/abs/10.1145/3394171.3413746)

*Wenqiao Zhang, Xineric Wang, Siliang Tang, Haizhou Shi, Haochen Shi, Jun Xiao, Yueting Zhuang, Williamyang Wang*

ACM MM, 2020.