https://github.com/holylovenia/awesome-multimodal-convai

Paper reading list for Multimodal Conversational AI
https://github.com/holylovenia/awesome-multimodal-convai

computer-vision conversational-ai deep-learning dialogue-systems language machine-learning multimodal natural-language-processing nlp papers reading-list research speech speech-processing

Last synced: 6 months ago
JSON representation

Paper reading list for Multimodal Conversational AI

Host: GitHub
URL: https://github.com/holylovenia/awesome-multimodal-convai
Owner: holylovenia
License: apache-2.0
Created: 2022-09-12T02:40:39.000Z (over 2 years ago)
Default Branch: main
Last Pushed: 2022-09-19T07:35:35.000Z (over 2 years ago)
Last Synced: 2024-05-23T09:56:07.446Z (about 1 year ago)
Topics: computer-vision, conversational-ai, deep-learning, dialogue-systems, language, machine-learning, multimodal, natural-language-processing, nlp, papers, reading-list, research, speech, speech-processing
Homepage:
Size: 151 KB
Stars: 3
Watchers: 4
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

ultimate-awesome - awesome-multimodal-convai - Paper reading list for Multimodal Conversational AI. (Other Lists / Julia Lists)

README

# Multimodal Conversational AI

A paper reading list of Multimodal Conversational AI that I keep for my own research purposes. :innocent: Will tidy up and re-organize along the way.
> PS: I will appreciate paper suggestions if you have any! 😸

------

## :bookmark_tabs: Research Papers

### 💬 Conversational AI Surveys

- Katharina Kann, et al. 2022. [**Open-domain Dialogue Generation: What We Can Do, Cannot Do, And Should Do Next**](https://aclanthology.org/2022.nlp4convai-1.13). _NLP4ConvAI, ACL_.
- Somil Gupta, Bhanu Pratap Singh Rawat, Hong Yu. 2020. [**Conversational Machine Comprehension: a Literature Review**](https://aclanthology.org/2020.coling-main.247/). _COLING_.

### :nerd_face: Multimodal Surveys

- Anirudh Sundar, Larry Heck. 2022. [**Multimodal Conversational AI: A Survey of Datasets and Approaches**](https://aclanthology.org/2022.nlp4convai-1.12/). _NLP4ConvAI, ACL_.
- Aditya Mogadala, Marimuthu Kalimuthu, Dietrich Klakow. 2021. [**Trends in Integration of Vision and Language Research: A Survey of Tasks, Datasets, and Methods**](https://arxiv.org/abs/1907.09358). _JAIR_.
- Jabeen Summaira, et al. 2021. [**Recent Advances and Trends in Multimodal Deep Learning: A Review**](https://arxiv.org/abs/2105.11087). _arXiv_.
- Chao Zhang, et al. 2020. [**Multimodal Intelligence: Representation Learning, Information Fusion, and Applications**](https://arxiv.org/abs/1911.03977). _IEEE Journal of Selected Topics in Signal Processing_.
- Yonatan Bisk, et al. 2020. [**Experience Grounds Language**](https://aclanthology.org/2020.emnlp-main.703/). _EMNLP_.
- Tadas Baltrušaitis, Chaitanya Ahuja, Louis-Philippe Morency. 2018. [**Multimodal Machine Learning: A Survey and Taxonomy**](https://ieeexplore.ieee.org/iel7/34/8605394/08269806.pdf). _IEEE Transactions on Pattern Analysis and Machine Intelligence_.
- Dhanesh Ramachandram, Graham W. Taylor. 2017. [**Deep Multimodal Learning: A Survey on Recent Advances and Trends**](https://ieeexplore.ieee.org/document/8103116). _IEEE Signal Processing Magazine_.

### :monocle_face: Multimodal Machine Learning

#### Multimodal Representation

- Wenzhong Guo, Jianwen Wang, Shiping Wang. 2019. [**Deep Multimodal Representation Learning: A Survey**](https://ieeexplore.ieee.org/abstract/document/8715409). _IEEE Access_.

#### Multimodal Fusion

- Yiqun Yao, Rada Mihalcea. 2022. [**Modality-specific Learning Rates for Effective Multimodal Additive Late-fusion**](https://aclanthology.org/2022.findings-acl.143/). _Findings, ACL_.
- Jing Gao, et al. 2020. [**A survey on deep learning for multimodal data fusion**](https://direct.mit.edu/neco/article-pdf/32/5/829/1865303/neco_a_01273.pdf). _Neural Computation, MIT_.

#### Multimodal Translation

- Bei Li, et al. 2022. [**On Vision Features in Multimodal Machine Translation**](https://aclanthology.org/2022.acl-long.438/). _ACL_.
- Umut Sulubacak, et al. 2020. [**Multimodal machine translation through visuals and speech**](https://link.springer.com/article/10.1007/s10590-020-09250-0). _Machine Translation, Springer_.
- Shaowei Yao, Xiaojun Wan. 2020. [**Multimodal Transformer for Multimodal Machine Translation**](https://aclanthology.org/2020.acl-main.400/). _ACL_. Image Text
- Ozan Caglayan, et al. 2019. [**Probing the Need for Visual Context in Multimodal Machine Translation**](https://aclanthology.org/N19-1422/). _NAACL_.

#### Multimodal Alignment

#### Multimodal Pre-training and Models

- Xichen Pan, et al. 2022. [**Leveraging Unimodal Self-Supervised Learning for Multimodal Audio-Visual Speech Recognition**](https://aclanthology.org/2022.acl-long.308/). _ACL_.
- Wenliang Dai, et al. 2022. [**Enabling Multimodal Generation on CLIP via Vision-Language Knowledge Distillation**](https://aclanthology.org/2022.findings-acl.187/). _Findings, ACL_.
- Hui Su, et al. 2022. [**RoCBert: Robust Chinese Bert with Multimodal Contrastive Pretraining**](https://aclanthology.org/2022.acl-long.65/). _ACL_.

#### Multimodal Co-learning

- Anil Rahate, et al. 2022. [**Multimodal Co-learning: Challenges, Applications with Datasets, Recent Advances and Future Directions**](https://arxiv.org/abs/2107.13782). _Information Fusion, Elsevier_.

### :face_in_clouds: ConvAI Tasks

#### Multimodal Dialogue

##### Visual Dialogue

- Zhiyuan Ma, et al. 2022. [**UniTranSeR: A Unified Transformer Semantic Representation Framework for Multimodal Task-Oriented Dialog System**](https://aclanthology.org/2022.acl-long.9/). _ACL_.
- Qingfeng Sun, et al. 2022. [**Multimodal Dialogue Response Generation**](https://aclanthology.org/2022.acl-long.204/). _ACL_.
- Jiaxin Qi, et al. 2020. [**Two Causal Principles for Improving Visual Dialog**](https://arxiv.org/abs/1911.10496). _CVPR_.
- Hardik Chauhan, et al. 2019. [**Ordinal and Attribute Aware Response Generation in a Multimodal Dialogue System**](https://aclanthology.org/P19-1540/). __ACL__.
- Lizi Liao, et al. 2018. [**Knowledge-aware Multimodal Dialogue Systems**](https://dl.acm.org/doi/pdf/10.1145/3240508.3240605). _MM, ACM_.
- Shubham Agarwal, et al. 2018. [**A Knowledge-Grounded Multimodal Search-Based Conversational Agent**](https://aclanthology.org/W18-5709/). _Workshop on Search-Oriented Conversational AI, EMNLP_.
- Shubham Agarwal, et al. 2018. [**Improving Context Modelling in Multimodal Dialogue Generation**](https://arxiv.org/abs/1810.11955). _INLG_.
- Xiaoxiao Guo, et al. 2018. [**Dialog-based Interactive Image Retrieval**](https://proceedings.neurips.cc/paper/2018/file/a01a0380ca3c61428c26a231f0e49a09-Paper.pdf). _NeurIPS_. [\[GitHub\]](https://github.com/XiaoxiaoGuo/fashion-retrieval)
- Abhishek Das, et al. 2017. [**Visual Dialog**](https://arxiv.org/abs/1611.08669). _CVPR_. [\[GitHub\]](https://github.com/batra-mlp-lab/visdial)

##### Spoken Dialogue

- Tom Young, et al. 2020. [**Dialogue systems with audio context**](https://www.sciencedirect.com/science/article/pii/S0925231220300758). _Neurocomputing, Elsevier_.
- Tatsuya Kawahara. 2019. [**Spoken Dialogue System for a Human-like Conversational Robot ERICA**](https://colips.org/conferences/iwsds2018/wp/wp-content/uploads/2018/06/IWSDS18-kawahara.pdf). _IWSDS_.

##### Audio-Visual Dialogue

- Zekang Li, et al. 2020. [**Bridging Text and Video: A Universal Multimodal Transformer for Audio-Visual Scene-Aware Dialog**](https://arxiv.org/abs/2002.00163). _Dialog System Technology Challenge, AAAI_. [\[GitHub\]](https://github.com/ictnlp/DSTC8-AVSD)
- Xiangyang Mou, et al. 2020. [**Multimodal Dialogue State Tracking By QA Approach with Data Augmentation**](https://arxiv.org/abs/2007.09903). _Dialog System Technology Challenge, AAAI_.
- Yun-Wei Chu, et al. 2020. [**Multi-step Joint-Modality Attention Network for Scene-Aware Dialogue System**](https://arxiv.org/abs/2001.06206). _Dialog System Technology Challenge, AAAI_.
- Huang Le, et al. 2019. [**Multimodal Transformer Networks for End-to-End Video-Grounded Dialogue Systems**](https://aclanthology.org/P19-1564/). _ACL_. [\[GitHub\]](https://github.com/henryhungle/MTN.)

#### Multimodal Reasoning

- Qingxiu Dong, et al. 2022. [**Premise-based Multimodal Reasoning: Conditional Inference on Joint Textual and Visual Clues**](https://aclanthology.org/2022.acl-long.66/). _ACL_.
- Weifeng Zhang, et al. 2021. [**DMRFNet: Deep Multimodal Reasoning and Fusion for Visual Question Answering and explanation generation**](https://www.sciencedirect.com/science/article/pii/S1566253521000208). _Information Fusion, Elsevier_.
- Remi Cadene, et al. 2019. [**MUREL: Multimodal Relational Reasoning for Visual Question Answering**](https://openaccess.thecvf.com/content_CVPR_2019/papers/Cadene_MUREL_Multimodal_Relational_Reasoning_for_Visual_Question_Answering_CVPR_2019_paper.pdf). _CVPR_.

#### Visual QA

- Sruthy Manmadhan, Binsu C. Kovoor. 2020. [**Visual question answering: a state-of-the-art review**](https://link.springer.com/article/10.1007/s10462-020-09832-7). _Artificial Intelligence Review, Springer_.
- Remi Cadene, et al. 2019. [**RUBi: Reducing Unimodal Biases for Visual Question Answering**](https://proceedings.neurips.cc/paper/2019/file/51d92be1c60d1db1d2e5e7a07da55b26-Paper.pdf). _NeurIPS_.
- Remi Cadene, et al. 2019. [**Murel: Multimodal relational reasoning for visual question answering**](https://openaccess.thecvf.com/content_CVPR_2019/papers/Cadene_MUREL_Multimodal_Relational_Reasoning_for_Visual_Question_Answering_CVPR_2019_paper.pdf).

#### Affect Recognition and Multimodal Language

- Yan Ling, Jianfei Yu, Rui Xia. 2022. [**Vision-Language Pre-Training for Multimodal Aspect-Based Sentiment Analysis**](https://aclanthology.org/2022.acl-long.152/). _ACL_.
- Yang Wu, et al. 2022. [**Sentiment Word Aware Multimodal Refinement for Multimodal Sentiment Analysis with ASR Errors**](https://aclanthology.org/2022.findings-acl.109/). _Findings, ACL_.
- Jiquan Wang, et al. 2022. [**Multimodal Sarcasm Target Identification in Tweets**](https://aclanthology.org/2022.acl-long.562/). _ACL_.
- Huisheng Mao, et al. 2022. [**M-SENA: An Integrated Platform for Multimodal Sentiment Analysis**](https://aclanthology.org/2022.acl-demo.20/). _System Demonstrations, ACL_.
- Wenliang Dai, et al. 2021. [**Weakly-supervised Multi-task Learning for Multimodal Affect Recognition**](https://arxiv.org/abs/2104.11560). _arXiv_.
- Trisha Mittal, et al. 2020. [**M3ER: Multiplicative Multimodal Emotion Recognition using Facial, Textual, and Speech Cues**](https://ojs.aaai.org/index.php/AAAI/article/view/5492). _AAAI_.

### :100: Evaluation

- Paul Pu Liang, et al. 2021. [**MultiBench: Multiscale Benchmarks for Multimodal Representation Learning**](https://arxiv.org/abs/2107.07502). _NeuRIPS_.
- Jan Deriu, et al. 2020. [**Survey on evaluation methods for dialogue systems**](https://link.springer.com/article/10.1007/s10462-020-09866-x). _Artificial Intelligence Review, Springer_.
- Masahiro Araki, et al. 2018. [**Collection of Multimodal Dialog Data and Analysis of the Result of Annotation of Users’ Interest Level**](https://aclanthology.org/L18-1250/). _LREC_. Manual

### :card_file_box: Dataset and Challenges

- Mauajama Firdaus, et al. 2022. [**EmoSen: Generating Sentiment and Emotion Controlled Responses in a Multimodal Dialogue System**](https://ieeexplore.ieee.org/abstract/document/9165162/). _IEEE Transactions on Affective Computing_.
- Yunlong Liang, et al. 2022. [**MSCTD: A Multimodal Sentiment Chat Translation Dataset**](https://aclanthology.org/2022.acl-long.186). _ACL_.
- Yirong Chen, et al. 2022. [**CPED: A Large-Scale Chinese Personalized and Emotional Dialogue Dataset for Conversational AI**](https://arxiv.org/abs/2205.14727). _arXiv_. Chinese
- Zhengcong Fei, et al. 2021. [**Towards Expressive Communication with Internet Memes: A New Multimodal Conversation Dataset and Benchmark**](https://arxiv.org/abs/2109.01839). _Dialog System Technology Challenge, AAAI_. [\[GitHub\]](https://github.com/lizekang/DSTC10-MOD) Chinese
- Deeksha Varshney, Asif Ekbal Anushkha Singh. 2021. [**Knowledge Grounded Multimodal Dialog Generation in Task-oriented Settings**](https://aclanthology.org/2021.paclic-1.45.pdf). _PACLIC_.
- Satwik Kottur, et al. 2021. [**SIMMC 2.0: A Task-oriented Dialog Dataset for Immersive Multimodal Conversations**](https://aclanthology.org/2021.emnlp-main.401/). _EMNLP_.
- Kübra Bodur, et al. 2021. [**ChiCo: A Multimodal Corpus for the Study of Child Conversation**](https://dl.acm.org/doi/pdf/10.1145/3461615.3485399). _ICMI_.
- Mauajama Firdaus, et al. 2020. [**MEISD: A Multimodal Multi-Label Emotion, Intensity and Sentiment Dialogue Dataset for Emotion Recognition and Sentiment Analysis in Conversations**](https://aclanthology.org/2020.coling-main.393/). _COLING_.
- Seungwhan Moon, et al. 2020. [**Situated and Interactive Multimodal Conversations**](https://aclanthology.org/2020.coling-main.96/). _COLING_.
- Darryl Hannan, Akshay Jain, Mohit Bansal. 2020. [**ManyModalQA: Modality Disambiguation and QA over Diverse Inputs**](https://ojs.aaai.org/index.php/AAAI/article/view/6294). _AAAI_.
- Santiago Castro, 2019. [**Towards Multimodal Sarcasm Detection (An _Obviously_ Perfect Paper)**](https://aclanthology.org/P19-1455/). _ACL_.
- Satwik Kottur, et al. 2019. [**CLEVR-Dialog: A Diagnostic Dataset for Multi-Round Reasoning in Visual Dialog**](https://aclanthology.org/N19-1058/). _NAACL_. [\[GitHub\]](https://github.com/satwikkottur/clevr-dialog)
- Soujanya Poria, et al. 2019. [**MELD: A Multimodal Multi-Party Dataset for Emotion Recognition in Conversations**](https://aclanthology.org/P19-1050/). _ACL_. [\[Homepage\]](https://affective-meld.github.io/)
- Asma Ben Abacha, et al. 2019. [**VQA-Med: Overview of the Medical Visual Question Answering Task at ImageCLEF 2019**](https://web.archive.org/web/20220120143058id_/http://ceur-ws.org/Vol-2380/paper_272.pdf). _CLEF_.
- Amrita Saha, Mitesh Khapra, Karthik Sankaranarayanan. 2018. [**Towards Building Large Scale Multimodal Domain-Aware Conversation Systems**](https://ojs.aaai.org/index.php/AAAI/article/view/11331). _AAAI_. [\[Homepage\]](https://amritasaha1812.github.io/MMD/)
- Harm de Vries, et al. 2018. [**Talk the Walk: Navigating New York City through Grounded Dialogue**](https://arxiv.org/abs/1807.03367). _arXiv_. [\[GitHub\]](https://github.com/facebookresearch/talkthewalk)

### Analysis

- Jialu Wang, Yang Liu, Xin Wang. 2022. [**Assessing Multilingual Fairness in Pre-trained Multimodal Representations**](https://aclanthology.org/2022.findings-acl.211/). _Findings, ACL_.
- Victor Milewski, Miryam de Lhoneux, Marie-Francine Moens. 2022. [**Finding Structural Knowledge in Multimodal-BERT**](https://aclanthology.org/2022.acl-long.388/). _ACL_. [\[GitHub\]](https://github.com/VSJMilewski/multimodal-probes)

### :robot: Interface, Experience, and Interaction

- Delphine Potdevin, Céline Clavel, Nicolas Sabouret. 2020. [**Virtual intimacy in human-embodied conversational agent interactions: the influence of multimodality on its perception**](https://link.springer.com/article/10.1007/s12193-020-00337-9). _Journal on Multimodal User Interfaces, Springer_.
- Stefan Schaffer, Norbert Reithinger. 2019. [**Conversation is Multimodal: Thus Conversational User Interfaces should be as well**](https://www.dfki.de/fileadmin/user_upload/import/10581_a12-schaffer.pdf). _Conversational User Interfaces (CUI), ACM_.
- Liu Yang, Catherine Achard, and Catherine Pelachaud. 2022. [**Multimodal Analysis of Interruptions**](https://www.researchgate.net/profile/Catherine-Pelachaud/publication/361335318_Multimodal_Analysis_of_Interruptions/links/62e3e4499d410c5ff36d55b5/Multimodal-Analysis-of-Interruptions.pdf). _International Conference on Human-Computer Interaction, Springer_.
- Stephen C. Levinson, Judith Holler. 2014. [**The origin of human multi-modal communication**](https://royalsocietypublishing.org/doi/10.1098/rstb.2013.0302). _Philosophical Transactions of the Royal Society B_.

## :bookmark: Articles, Tutorials, and Presentations

- Mireille Fares. 2020. [**Towards Multimodal Human-Like Characteristics and Expressive Visual Prosody in Virtual Agents**](https://dl.acm.org/doi/pdf/10.1145/3382507.3421155). _Doctoral Consortium Paper, ICMI_.
- Jianfeng Gao, Michel Galley, Lihong Li. 2018. [**Neural Approaches to Conversational AI**](https://dl.acm.org/doi/pdf/10.1145/3209978.3210183?casa_token=9ygHAl9OmREAAAAA:7ySMxjDO4en-H1nw6Za9RZ0M-T3hc2xuy91p-eeos6B6IPFdMX2hPnb7epZxa6S-6VciTxrzgjek). _Tutorial, SIGIR, ACM_.
- Louis-Philippe Morency, Tadas Baltrušaitis. 2017. [**Multimodal Machine Learning: Integrating Language, Vision and Speech**](https://aclanthology.org/P17-5002/). _Tutorial Abstracts, ACL_.
- Louis-Philippe Morency, Tadas Baltrusaitis. 2017. [**Multimodal Machine Learning**](https://www.cs.cmu.edu/~morency/MMML-Tutorial-ACL2017.pdf). _Tutorial, ACL_.
- Margaret Mitchell, John C. Platt, Kate Saenko. 2017. [**Guest Editorial: Image and Language Understanding**](https://link.springer.com/article/10.1007/s11263-017-0993-y). _International Journal of Computer Vision, Springer_.
- Desmond Elliott, Douwe Kiela and Angeliki Lazaridou. 2016. [**Multimodal Learning and Reasoning**](http://multimodalnlp.github.io/mlr_tutorial.pdf). _Tutorial, ACL_.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/holylovenia/awesome-multimodal-convai

Awesome Lists containing this project

README