https://github.com/krantiparida/awesome-audio-visual

A curated list of different papers and datasets in various areas of audio-visual processing
https://github.com/krantiparida/awesome-audio-visual
List: awesome-audio-visual
audio-visual awesome awesome-list cross-modal localization mutli-modal source-separation
Last synced: about 1 year ago
JSON representation
A curated list of different papers and datasets in various areas of audio-visual processing
Host: GitHub
URL: https://github.com/krantiparida/awesome-audio-visual
Owner: krantiparida
Created: 2019-03-30T14:17:45.000Z (about 7 years ago)
Default Branch: master
Last Pushed: 2024-01-30T14:39:52.000Z (over 2 years ago)
Last Synced: 2024-05-23T04:08:25.406Z (about 2 years ago)
Topics: audio-visual, awesome, awesome-list, cross-modal, localization, mutli-modal, source-separation
Homepage:
Size: 58.6 KB
Stars: 620
Watchers: 17
Forks: 70
Open Issues: 1
Metadata Files:
- Readme: README.md
Awesome Lists containing this project

StarryDivineSky - krantiparida/awesome-audio-visual
ultimate-awesome - awesome-audio-visual - A curated list of different papers and datasets in various areas of audio-visual processing. (Other Lists / TeX Lists)
awesome-rainmana - krantiparida/awesome-audio-visual - A curated list of different papers and datasets in various areas of audio-visual processing (Others)
README

          # Awesome Audio-Visual: [![Awesome](https://cdn.rawgit.com/sindresorhus/awesome/d7305f38d29fed78fa85652e3a63e154dd8e8829/media/badge.svg)](https://github.com/sindresorhus/awesome)

A curated list of papers and datsets for various audio-visual tasks, inspired by [awesome-computer-vision](https://github.com/jbhuang0604/awesome-computer-vision).

## Contents

 - [Audio-Visual Localization](#Audio-Visual-Localization)

 - [Audio-Visual Separation](#Audio-Visual-Separation)

 - [Audio-Visual Representation/Classification/Retrieval](#Audio-Visual-Representation)

 - [Audio-Visual Action Recognition](#Audio-Visual-Action-Recognition)

 - [Audio-Visual Spatial/Depth](#Audio-Visual-Spatialdepth)

 - [Audio-Visual RIR](#audio-visual-rir)

 - [Audio-Visual Deepfake/Robustness](#Audio-Visual-DeepfakeRobustness)

 - [Lightweight Audio-Visual Model](#Lightweight-Audio-VisualModel)

 - [Audio-Visual Navigation/RL](#Audio-Visual-NavigationRL)

 - [Audio-Visual Faces/Speech](#Audio-Visual-Facesspeech)

 - [Audio-Visual Question Answering](#Audio-Visual-Question-Answering)

 - [Audio-Visual Stylization/Generation](#audio-visual-stylizationgeneration)

 - [Cross-modal Generation (Audio-Video / Video-Audio)](#Cross-modal-Generation-(Audio-Video--Video-Audio))

 - [Multi-modal Architectures](#Multi-modal-Architectures)

 - [Uncategorized Papers](#Uncategorized-Papers)

#### Audio-Visual Localization

* [Dense-Localizing Audio-Visual Events in Untrimmed Videos: A Large-Scale Benchmark and Baseline](https://openaccess.thecvf.com/content/CVPR2023/papers/Geng_Dense-Localizing_Audio-Visual_Events_in_Untrimmed_Videos_A_Large-Scale_Benchmark_and_CVPR_2023_paper.pdf) - Geng, T., Wang, T., Duan, J., Cong, R., & Zheng, F. (CVPR 2023)

* [Learning Audio-Visual Source Localization via False Negative Aware Contrastive Learning](https://openaccess.thecvf.com/content/CVPR2023/papers/Sun_Learning_Audio-Visual_Source_Localization_via_False_Negative_Aware_Contrastive_Learning_CVPR_2023_paper.pdf) - Sun, W., Zhang, J., Wang, J., Liu, Z., Zhong, Y., Feng, T., ... & Barnes, N. (CVPR 2023) [[code]](https://github.com/OpenNLPLab/FNAC_AVL)

* [Dual Perspective Network for Audio Visual Event Localization](https://www.ecva.net/papers/eccv_2022/papers_ECCV/papers/136940676.pdf) - Rao, V., Khalil, M. I., Li, H., Dai, P., & Lu, J. (ECCV 2022)

* [A Proposal-Based Paradigm for Self-Supervised Sound Source Localization in Videos](https://openaccess.thecvf.com/content/CVPR2022/papers/Xuan_A_Proposal-Based_Paradigm_for_Self-Supervised_Sound_Source_Localization_in_Videos_CVPR_2022_paper.pdf) - Xuan, H., Wu, Z., Yang, J., Yan, Y., & Alameda-Pineda, X. (CVPR 2022)

* [Mix and Localize: Localizing Sound Sources in Mixtures](https://openaccess.thecvf.com/content/CVPR2022/papers/Hu_Mix_and_Localize_Localizing_Sound_Sources_in_Mixtures_CVPR_2022_paper.pdf) - Hu, X., Chen, Z., & Owens, A. (CVPR 2022) [[project page]](https://hxixixh.github.io/mix-and-localize/) [[code]](https://github.com/hxixixh/mix-and-localize)

* [Wnet: Audio-Guided Video Object Segmentation via Wavelet-Based Cross-Modal Denoising Networks](https://openaccess.thecvf.com/content/CVPR2022/papers/Pan_Wnet_Audio-Guided_Video_Object_Segmentation_via_Wavelet-Based_Cross-Modal_Denoising_Networks_CVPR_2022_paper.pdf) - Pan, W., Shi, H., Zhao, Z., Zhu, J., He, X., Pan, Z., ... & Tian, Q. (CVPR 2022) [[code]](https://github.com/asudahkzj/Wnet)

* [Cross-Modal Background Suppression for Audio-Visual Event Localization](https://openaccess.thecvf.com/content/CVPR2022/papers/Xia_Cross-Modal_Background_Suppression_for_Audio-Visual_Event_Localization_CVPR_2022_paper.pdf) - Xia, Y., & Zhao, Z. (CVPR 2022) [[code]](https://github.com/marmot-xy/CMBS)

* [Audio-Visual Grouping Network for Sound Localization from Mixtures](https://arxiv.org/pdf/2303.17056.pdf) - Mo S., Tian Y. (CVPR 2023) [[code]](https://github.com/stoneMo/AVGN)

* [Egocentric Audio-Visual Object Localization](https://arxiv.org/pdf/2303.13471.pdf) - C. Huang, Y. Tian, A. Kumar, C. Xu (CVPR 2023) [[code]](https://github.com/WikiChao/Ego-AV-Loc)

* [A Closer Look at Weakly-Supervised Audio-Visual Source Localization](https://papers.nips.cc/paper_files/paper/2022/file/f3f2ff9579ba6deeb89caa2fe1f0b99c-Paper-Conference.pdf) - Mo, S., & Morgado, P. (NeurIPS 2022) [[code]](https://github.com/stoneMo/SLAVC) 

* [Multi-modal Grouping Network for Weakly-Supervised Audio-Visual Video Parsing](https://openreview.net/pdf?id=zfo2LqFEVY) - Mo S., Tian Y. (NeurIPS 2022) [[code]](https://github.com/stoneMo/MGN)

* [Exploring Cross-Video and Cross-Modality Signals for Weakly-Supervised Audio-Visual Video Parsing](https://proceedings.neurips.cc/paper/2021/file/5f93f983524def3dca464469d2cf9f3e-Paper.pdf) - Lin, Y. B., Tseng, H. Y., Lee, H. Y., Lin, Y. Y., & Yang, M. H. (NeurIPS 2021)

* [Localizing Visual Sounds the Hard Way](https://arxiv.org/pdf/2104.02691.pdf) - Chen, H., Xie, W., Afouras, T., Nagrani, A., Vedaldi, A., & Zisserman, A. (CVPR 2021) [[code]](https://github.com/hche11/Localizing-Visual-Sounds-the-Hard-Way) [[project page]](https://www.robots.ox.ac.uk/~vgg/research/lvs/)

* [Positive Sample Propagation along the Audio-Visual Event Line](https://arxiv.org/pdf/2104.00239.pdf) - Zhou, J., Zheng, L., Zhong, Y., Hao, S., & Wang, M. (CVPR 2021) [[code]](https://github.com/jasongief/PSP_CVPR_2021) 

* [Exploring Heterogeneous Clues for Weakly-Supervised Audio-Visual Video Parsing](https://yu-wu.net/pdf/CVPR21_audio.pdf) - Wu Y., Yang Y. (CVPR 2021) [[code]](https://github.com/Yu-Wu/Modaily-Aware-Audio-Visual-Video-Parsing)

* [Audio-Visual Localization by Synthetic Acoustic Image Generation]() - Sanguineti V., Morerio P., Del Bue A., Murino V.(AAAI 2021)

* [Binaural Audio-Visual Localization](https://cse.sc.edu/~songwang/document/aaai21b.pdf) - Wu, X., Wu, Z., Ju L., Wang S. (AAAI 2021) [[dataset]](https://github.com/W-zx-Y/Binaural-Audio-Visual-Localization) 

* [Discriminative Sounding Objects Localization via Self-supervised Audiovisual Matching](https://arxiv.org/pdf/2010.05466.pdf) - Hu, D., Qian, R., Jiang, M., Tan, X., Wen, S., Ding, E., Lin, W., Dou, D. (NeurIPS 2020) [[code]](https://github.com/DTaoo/Discriminative-Sounding-Objects-Localization) [[dataset]](https://zenodo.org/record/4079386#.X4PFodozbb2) [[demo]](https://www.youtube.com/watch?v=XRU-R32t6rU&feature=youtu.be)

* [Not only Look, but also Listen: Learning Multimodal Violence Detection under Weak Supervision](https://arxiv.org/pdf/2007.04687.pdf) - Wu, P., Liu, J., Shi, Y., Sun, Y., Shao, F., Wu, Z., & Yang, Z. (ECCV 2020)[[project page/dataset]](https://roc-ng.github.io/XD-Violence/)

* [Do We Need Sound for Sound Source Localization?](https://arxiv.org/pdf/2007.05722.pdf) - Oya, T., Iwase, S., Natsume, R., Itazuri, T., Yamaguchi, S., & Morishima, S. (arXiv 2020)

* [Multiple Sound Sources Localization from Coarse to Fine](https://arxiv.org/pdf/2007.06355.pdf) - Qian, R., Hu, D., Dinkel, H., Wu, M., Xu, N., & Lin, W. (ECCV 2020) [[code]](https://github.com/shvdiwnkozbw/Multi-Source-Sound-Localization)

* [Learning Differentiable Sparse and Low Rank Networks for Audio-Visual Object Localization](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9054280) - Pu, J., Panagakis, Y., & Pantic, M. (ICASSP 2020)

* [What Makes the Sound?: A Dual-Modality Interacting Network for Audio-Visual Event Localization](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9053895) - Ramaswamy, J. (ICASSP 2020)

* [Self-supervised learning for audio-visual speaker diarization](https://arxiv.org/pdf/2002.05314.pdf) - Ding, Y., Xu, Y., Zhang, S. X., Cong, Y., & Wang, L. (ICASSP 2020)

* [See the Sound, Hear the Pixels](http://openaccess.thecvf.com/content_WACV_2020/papers/Ramaswamy_See_the_Sound_Hear_the_Pixels_WACV_2020_paper.pdf) - Ramaswamy, J., & Das, S. (WACV 2020)

* [Dual Attention Matching for Audio-Visual Event Localization](http://openaccess.thecvf.com/content_ICCV_2019/papers/Wu_Dual_Attention_Matching_for_Audio-Visual_Event_Localization_ICCV_2019_paper.pdf) - Wu, Y., Zhu, L., Yan, Y., & Yang, Y. (ICCV 2019)

* [Weakly Supervised Representation Learning for Unsynchronized Audio-Visual Events](https://arxiv.org/pdf/1804.07345.pdf) - Parekh, S., Essid, S., Ozerov, A., Duong, N. Q., Pérez, P., & Richard, G. (arxiv, 2018) [CPVRW2018](http://openaccess.thecvf.com/content_cvpr_2018_workshops/papers/w49/Parekh_Weakly_Supervised_Representation_CVPR_2018_paper.pdf)

* [Learning to Localize Sound Source in Visual Scenes](http://openaccess.thecvf.com/content_cvpr_2018/papers/Senocak_Learning_to_Localize_CVPR_2018_paper.pdf) - Senocak, A., Oh, T. H., Kim, J., Yang, M. H., & Kweon, I. S. (CVPR 2018)

* [Objects that Sound](https://arxiv.org/pdf/1712.06651.pdf) - Arandjelovic, R., & Zisserman, A. (ECCV 2018) 

* [Audio-Visual Event Localization in Unconstrained Videos](http://openaccess.thecvf.com/content_ECCV_2018/papers/Yapeng_Tian_Audio-Visual_Event_Localization_ECCV_2018_paper.pdf) - Tian, Y., Shi, J., Li, B., Duan, Z., & Xu, C. (ECCV 2018) [[project page]](https://sites.google.com/view/audiovisualresearch) [[code]](https://github.com/YapengTian/AVE-ECCV18)

* [Audio-visual object localization and separation using low-rank and sparsity](https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=7952687) - Pu, J., Panagakis, Y., Petridis, S., & Pantic, M. (ICASSP 2017)

#### Audio-Visual Separation

* [iQuery: Instruments As Queries for Audio-Visual Sound Separation](https://openaccess.thecvf.com/content/CVPR2023/papers/Chen_iQuery_Instruments_As_Queries_for_Audio-Visual_Sound_Separation_CVPR_2023_paper.pdf) - Chen, J., Zhang, R., Lian, D., Yang, J., Zeng, Z., & Shi, J. (CVPR 2023) [[code]](https://github.com/JiabenChen/iQuery)

* [Language-Guided Audio-Visual Source Separation via Trimodal Consistency](https://openaccess.thecvf.com/content/CVPR2023/papers/Tan_Language-Guided_Audio-Visual_Source_Separation_via_Trimodal_Consistency_CVPR_2023_paper.pdf) - Tan, R., Ray, A., Burns, A., Plummer, B. A., Salamon, J., Nieto, O., ... & Saenko, K. (CVPR 2023) [[code]](https://github.com/rxtan2/AVSeT)

* [Filter-Recovery Network for Multi-Speaker Audio-Visual Speech Separation](https://openreview.net/pdf?id=fiB2RjmgwQ6) - Cheng, H., Liu, Z., Wu, W., & Wang, L. (ICLR 2023)

* [AudioScopeV2: Audio-Visual Attention Architectures for Calibrated Open-Domain On-Screen Sound Separation](https://www.ecva.net/papers/eccv_2022/papers_ECCV/papers/136970360.pdf) - Tzinis, E., Wisdom, S., Remez, T., & Hershey, J. R. (ECCV 2022)

* [VoViT: Low Latency Graph-based Audio-Visual Voice Separation Transformer](http://arxiv.org/abs/2203.04099) - Montesinos, J. F., Kadandale, V. S., & Haro, G. (ECCV 2022) [[project page]](https://ipcv.github.io/VoViT/) [[code]](https://github.com/JuanFMontesinos/VoViT)

* [Learning Audio-Visual Dynamics Using Scene Graphs for Audio Source Separation](https://papers.nips.cc/paper_files/paper/2022/file/6c92839f0f9cddc96c694712a7143b09-Paper-Conference.pdf) - Chatterjee, M., Ahuja, N., & Cherian, A. (NeurIPS 2022)

* [Active Audio-Visual Separation of Dynamic Sound Sources](https://arxiv.org/pdf/2202.00850.pdf) - Majumder, S. & Grauman, K. (ECCV 2022) [[code]](https://github.com/SAGNIKMJR/active-AV-dynamic-separation) [[project page]](http://vision.cs.utexas.edu/projects/active-av-dynamic-separation/) [[code]](https://github.com/SAGNIKMJR/active-av-dynamic-separation)

* [TriBERT: Full-body Human-centric Audio-visual Representation Learning for Visual Sound Separation](https://arxiv.org/pdf/2110.13412.pdf) - Rahman, T., Yang, M., & Sigal, L. (NeurIPS 2021) [[code]](https://github.com/ubc-vision/tribert)

* [Move2Hear: Active Audio-Visual Source Separation](https://openaccess.thecvf.com/content/ICCV2021/papers/Majumder_Move2Hear_Active_Audio-Visual_Source_Separation_ICCV_2021_paper.pdf) - Majumder, S., Al-Halah, Z., & Grauman, K. (ICCV 2021) [[code]](https://github.com/SAGNIKMJR/move2hear-active-AV-separation) [[project page]](http://vision.cs.utexas.edu/projects/move2hear/)

* [Visual Scene Graphs for Audio Source Separation](https://www.merl.com/publications/docs/TR2021-095.pdf) - Chatterjee, M., Le Roux, J., Ahuja, N., & Cherian, A. (ICCV 2021) [[code]](https://www.dropbox.com/s/cjfoklgozcamjns/avsgs.zip?dl=0) [[project page]](https://sites.google.com/site/metrosmiles/research/research-projects/avsgs)

* [VisualVoice: Audio-Visual Speech Separation With Cross-Modal Consistency]() - Gao, R., & Grauman, K. (CVPR 2021) [[code]](https://github.com/facebookresearch/VisualVoice) [[project page]](http://vision.cs.utexas.edu/projects/VisualVoice/)

* [Cyclic Co-Learning of Sounding Object Visual Grounding and Sound Separation](https://arxiv.org/pdf/2104.02026.pdf) - Tian, Y., Hu, D., & Xu, C. (CVPR 2021) [[code]](https://github.com/YapengTian/CCOL-CVPR21)

* [Looking into Your Speech: Learning Cross-modal Affinity for Audio-visual Speech Separation](https://arxiv.org/pdf/2104.02775.pdf) Lee, J., Chung, S. W., Kim, S., Kang, H. G., & Sohn, K. (CVPR 2021) [[project page]](https://caffnet.github.io/)

* [Into the Wild with AudioScope: Unsupervised Audio-Visual Separation of On-Screen Sounds](https://openreview.net/pdf?id=MDsQkFP1Aw) - zinis, E., Wisdom, S., Jansen, A., Hershey, S., Remez, T., Ellis, D.P. and Hershey, J.R. (ICLR 2021) [[project page]](https://audioscope.github.io/)

* [Sep-stereo: Visually guided stereophonic audio generation by associating source separation](https://arxiv.org/pdf/2007.09902.pdf) - Zhou, H., Xu, X., Lin, D., Wang, X., & Liu, Z. (ECCV 2020) [[project page]](https://hangz-nju-cuhk.github.io/projects/Sep-Stereo) [[code]](https://github.com/SheldonTsui/SepStereo_ECCV2020)

* [Visually Guided Sound Source Separation using Cascaded Opponent Filter Network. arXiv](https://arxiv.org/pdf/2006.03028.pdf) - Zhu, L., & Rahtu, E. (arXiv 2020) [[project page]](https://ly-zhu.github.io/cof-net)

* [Music Gesture for Visual Sound Separation](https://arxiv.org/pdf/2004.09476.pdf) - Gan, C., Huang, D., Zhao, H., Tenenbaum, J. B., & Torralba, A. (CVPR 2020) [[project page]](http://music-gesture.csail.mit.edu/) [[code]](http://music-gesture.csail.mit.edu/#code)

* [Recursive Visual Sound Separation Using Minus-Plus Net](http://openaccess.thecvf.com/content_ICCV_2019/papers/Xu_Recursive_Visual_Sound_Separation_Using_Minus-Plus_Net_ICCV_2019_paper.pdf) - Xudong Xu, Bo Dai, Dahua Lin (ICCV 2019)

* [Co-Separating Sounds of Visual Objects](https://arxiv.org/pdf/1904.07750.pdf) - Gao, R. & Grauman, K. (ICCV 2019) [[project page]](http://vision.cs.utexas.edu/projects/coseparation/)

* [The sound of Motions](https://arxiv.org/pdf/1904.05979.pdf) - Zhao, H., Gan, C., Ma, W. & Torralba, A. (ICCV 2019)

* [Learning to Separate Object Sounds by Watching Unlabeled Video](http://vision.cs.utexas.edu/projects/separating_object_sounds/sound-sep-eccv2018.pdf) - Gao, R., Feris, R., & Grauman, K. (ECCV 2018 (Oral)) [[project page]](http://vision.cs.utexas.edu/projects/separating_object_sounds/) [[code]](https://github.com/rhgao/Deep-MIML-Network) [[dataset]](http://vision.cs.utexas.edu/projects/separating_object_sounds/#data)

* [The Sound of Pixels](https://arxiv.org/pdf/1804.03160.pdf) - Zhao, H., Gan, C., Rouditchenko, A., Vondrick, C., McDermott, J., & Torralba, A. (ECCV 2018) [[project page]](http://sound-of-pixels.csail.mit.edu/) [[code]](https://github.com/hangzhaomit/Sound-of-Pixels) [[dataset]](https://github.com/roudimit/MUSIC_dataset)

#### Audio-Visual Representation/Classification/Retrieval

* [Vision Transformers Are Parameter-Efficient Audio-Visual Learners](https://openaccess.thecvf.com/content/CVPR2023/papers/Lin_Vision_Transformers_Are_Parameter-Efficient_Audio-Visual_Learners_CVPR_2023_paper.pdf) - Lin, Y. B., Sung, Y. L., Lei, J., Bansal, M., & Bertasius, G. (CVPR 2023) [[code]](https://genjib.github.io/project_page/LAVISH/)

* [Collecting Cross-Modal Presence-Absence Evidence for Weakly-Supervised Audio-Visual Event Perception](https://openaccess.thecvf.com/content/CVPR2023/papers/Gao_Collecting_Cross-Modal_Presence-Absence_Evidence_for_Weakly-Supervised_Audio-Visual_Event_Perception_CVPR_2023_paper.pdf) - Gao, J., Chen, M., & Xu, C. (CVPR 2023) [[code]](github.com/MengyuanChen21/CVPR2023-CMPAE)

* [Contrastive Audio-Visual Masked Autoencoder](https://openreview.net/pdf?id=QPtMRyk5rb) - Gong, Y., Rouditchenko, A., Liu, A. H., Harwath, D., Karlinsky, L., Kuehne, H., & Glass, J. R. (ICLR 2023) [[code]](https://github.com/yuangongnd/cav-mae)

* [Audio-Visual Segmentation](https://www.ecva.net/papers/eccv_2022/papers_ECCV/papers/136970378.pdf) - Zhou, J., Wang, J., Zhang, J., Sun, W., Zhang, J., Birchfield, S., ... & Zhong, Y. (ECCV 2022) [[code]](https://github.com/OpenNLPLab/AVSBench)

* [Temporal and cross-modal attention for audio-visual zero-shot learning](https://www.ecva.net/papers/eccv_2022/papers_ECCV/papers/136800474.pdf) - Mercea, O. B., Hummel, T., Koepke, A. S., & Akata, Z. (ECCV 2022) [[code]](https://github.com/ExplainableML/TCAF-GZSL)

* [Audio-Visual Mismatch-Aware Video Retrieval via Association and Adjustment](https://www.ecva.net/papers/eccv_2022/papers_ECCV/papers/136740484.pdf) - Lee, S., Park, S., & Ro, Y. M. (ECCV 2022)

* [Joint-Modal Label Denoising for Weakly-Supervised Audio-Visual Video Parsing](https://www.ecva.net/papers/eccv_2022/papers_ECCV/papers/136940424.pdf) - Cheng, H., Liu, Z., Zhou, H., Qian, C., Wu, W., & Wang, L. (ECCV 2022) [[code]](https://github.com/MCG-NJU/JoMoLD)

* [MERLOT Reserve: Neural Script Knowledge Through Vision and Language and Sound](https://openaccess.thecvf.com/content/CVPR2022/papers/Zellers_MERLOT_Reserve_Neural_Script_Knowledge_Through_Vision_and_Language_and_CVPR_2022_paper.pdf) - Zellers, R., Lu, J., Lu, X., Yu, Y., Zhao, Y., Salehi, M., ... & Choi, Y. (CVPR 2022) [[project page]](https://rowanzellers.com/merlotreserve/) [[code]](https://github.com/rowanz/merlot_reserve)

* [Weakly Paired Associative Learning for Sound and Image Representations via Bimodal Associative Memory](https://openaccess.thecvf.com/content/CVPR2022/papers/Lee_Weakly_Paired_Associative_Learning_for_Sound_and_Image_Representations_via_CVPR_2022_paper.pdf) - Lee, S., Kim, H. I., & Ro, Y. M. (CVPR 2022)

* [Sound and Visual Representation Learning With Multiple Pretraining Tasks](https://openaccess.thecvf.com/content/CVPR2022/papers/Vasudevan_Sound_and_Visual_Representation_Learning_With_Multiple_Pretraining_Tasks_CVPR_2022_paper.pdf) - Vasudevan, A. B., Dai, D., & Van Gool, L. (CVPR 2022)

* [Self-Supervised Object Detection From Audio-Visual Correspondence](https://openaccess.thecvf.com/content/CVPR2022/html/Afouras_Self-Supervised_Object_Detection_From_Audio-Visual_Correspondence_CVPR_2022_paper.html) - Afouras, T., Asano, Y. M., Fagan, F., Vedaldi, A., & Metze, F. (CVPR 2022)

* [Audio-Visual Generalised Zero-Shot Learning With Cross-Modal Attention and Language](https://openaccess.thecvf.com/content/CVPR2022/papers/Mercea_Audio-Visual_Generalised_Zero-Shot_Learning_With_Cross-Modal_Attention_and_Language_CVPR_2022_paper.pdf) [[project page]](https://www.eml-unitue.de/publication/audio-visual-zsl) [[code]](https://github.com/ExplainableML/AVCA-GZSL)

* [Multi-modal Grouping Network for Weakly-Supervised Audio-Visual Video Parsing](https://papers.nips.cc/paper_files/paper/2022/file/e095c0a3717629aa5497601985bfcf0e-Paper-Conference.pdf) - Mo, S., & Tian, Y. (NeurIPS 2022) [[code]](https://github.com/stoneMo/MGN)

* [Self-Supervised Object Detection From Audio-Visual Correspondence](https://openaccess.thecvf.com/content/CVPR2022/papers/Afouras_Self-Supervised_Object_Detection_From_Audio-Visual_Correspondence_CVPR_2022_paper.pdf) - Afouras, T., Asano, Y. M., Fagan, F., Vedaldi, A., & Metze, F. (CVPR 2022)

* [Audio-visual Generalised Zero-shot Learning with Cross-modal Attention and Language](https://openaccess.thecvf.com/content/CVPR2022/papers/Mercea_Audio-Visual_Generalised_Zero-Shot_Learning_With_Cross-Modal_Attention_and_Language_CVPR_2022_paper.pdf) - [[code]](https://github.com/ExplainableML/AVCA-GZSL/) 

* [Learning State-Aware Visual Representations from Audible Interactions](https://arxiv.org/abs/2209.13583) - Himangi Mittal, Pedro Morgado, Unnat Jain, Abhinav Gupta. (NeurIPS 2022) [[code]](https://github.com/HimangiM/RepLAI)

* [ACAV100M: Automatic Curation of Large-Scale Datasets for Audio-Visual Video Representation Learning](https://openaccess.thecvf.com/content/ICCV2021/papers/Lee_ACAV100M_Automatic_Curation_of_Large-Scale_Datasets_for_Audio-Visual_Video_Representation_ICCV_2021_paper.pdf) - Lee, S., Chung, J., Yu, Y., Kim, G., Breuel, T., Chechik, G., & Song, Y. (ICCV 2021) [[code]](https://github.com/sangho-vision/acav100m) [[project page]](https://acav100m.github.io/)

* [Spoken moments: Learning joint audio-visual representations from video descriptions](https://arxiv.org/pdf/2105.04489.pdf) - Monfort, M., Jin, S., Liu, A., Harwath, D., Feris, R., Glass, J., & Oliva, A. (CVPR 2021) [[project page/dataset]](http://moments.csail.mit.edu/spoken.html)

* [Robust Audio-Visual Instance Discrimination](https://arxiv.org/pdf/2103.15916.pdf) - Morgado, P., Misra, I., & Vasconcelos, N. (CVPR 2021) 

* [Distilling Audio-Visual Knowledge by Compositional Contrastive Learning](https://arxiv.org/pdf/2104.10955.pdf) - Chen, Y., Xian, Y., Koepke, A., Shan, Y., & Akata, Z. (CVPR 2021) [[code]](https://github.com/yanbeic/CCL)

* [Enhancing Audio-Visual Association with Self-Supervised Curriculum Learning](https://www.aaai.org/AAAI21Papers/AAAI-6067.ZhangJ.pdf) - Zhang, J., Xu, X., Shen, F., Lu, H., Liu, X., & Shen, H. T. (AAAI 2021)

* [Active Contrastive Learning of Ausio-Visual Video Representations](https://openreview.net/pdf?id=OMizHuea_HB) - Ma, S., Zeng, Z., McDuff, D., & Song, Y. (ICLR 2021) [[code]](https://github.com/yunyikristy/CM-ACC)

* [Labelling unlabelled videos from scratch with multi-modal self-supervision](https://arxiv.org/pdf/2006.13662.pdf) - Asano, Y., Patrick, M., Rupprecht, C., & Vedaldi, A. (NeruIPS 2020) [[project page]](https://www.robots.ox.ac.uk/~vgg/research/selavi/)

* [Look, Listen, and Attend: Co-Attention Network for Self-Supervised Audio-Visual Representation learning](https://arxiv.org/pdf/2008.05789.pdf) - Cheng, Y., Wang, R., Pan, Z., Feng, R., & Zhang, Y. (ACM MM 202)

* [Cross-Task Transfer for Geotagged Audiovisual Aerial Scene Recognition](https://arxiv.org/pdf/2005.08449.pdf) - Di Hu, X. L., Mou, L., Jin, P., Chen, D., Jing, L., Zhu, X., & Dou, D. (ECCV 2020) [[code]](https://github.com/DTaoo/Multimodal-Aerial-Scene-Recognition)

* [Leveraging Acoustic Images for Effective Self-Supervised Audio Representation Learning](https://www.ecva.net/papers/eccv_2020/papers_ECCV/papers/123670120.pdf) - Sanguineti, V., Morerio, P., Pozzetti, N., Greco, D., Cristani, M., & Murino, V. (ECCV 2020) [[code]](https://github.com/IIT-PAVIS/acoustic-images-self-supervision)

* [Self-Supervised Learning of Audio-Visual Objects from Video](https://arxiv.org/pdf/2008.04237.pdf) - Afouras, T., Owens, A., Chung, J. S., & Zisserman, A. (ECCV 2020) [[project page]](http://www.robots.ox.ac.uk/~vgg/research/avobjects/)

* [Unified Multisensory Perception: Weakly-Supervised Audio-Visual Video Parsing](https://arxiv.org/pdf/2007.10558.pdf) - Tian, Y., Li, D., & Xu, C. (ECCV 2020) 

* [Audio-Visual Instance Discrimination with Cross-Modal Agreement](https://arxiv.org/abs/2004.12943) - Morgado, P., Vasconcelos, N., & Misra, I. (CVPR 2021)

* [Vggsound: A Large-Scale Audio-Visual Dataset](https://www.robots.ox.ac.uk/~vgg/publications/2020/Chen20/chen20.pdf) - Chen, H., Xie, W., Vedaldi, A., & Zisserman, A. (ICASSP 2020) [[project page/dataset]](http://www.robots.ox.ac.uk/~vgg/data/vggsound/) [[code]](https://github.com/hche11/VGGSound)

* [Large Scale Audiovisual Learning of Sounds with Weakly Labeled Data](https://arxiv.org/pdf/2006.01595.pdf) - Fayek, H. M., & Kumar, A. (IJCAI 2020)

* [Multi-modal Self-Supervision from Generalized Data Transformations](https://arxiv.org/pdf/2003.04298.pdf) - Patrick, M., Asano, Y. M., Fong, R., Henriques, J. F., Zweig, G., & Vedaldi, A. (arXiv 2020)

* [Curriculum Audiovisual Learning](https://arxiv.org/pdf/2001.09414.pdf) - Hu, D., Wang, Z., Xiong, H., Wang, D., Nie, F., & Dou, D. (arXiv 2020)

* [Audio-visual model distillation using acoustic images]() - Perez, A., Sanguineti, V., Morerio, P., & Murino, V. (WACV 2020) [[code]](https://github.com/afperezm/acoustic-images-distillation) [[dataset]](https://pavis.iit.it/datasets/audio-visually-indicated-actions-dataset)

* [Coordinated Joint Multimodal Embeddings for Generalized Audio-Visual Zero-shot Classification and Retrieval of Videos](http://openaccess.thecvf.com/content_WACV_2020/papers/Parida_Coordinated_Joint_Multimodal_Embeddings_for_Generalized_Audio-Visual_Zero-shot_Classification_and_WACV_2020_paper.pdf) - Parida, K., Matiyali, N., Guha, T., & Sharma, G. (WACV 2020) [[project page]](https://www.cse.iitk.ac.in/users/kranti/avzsl.html)[[Dataset]](https://github.com/krantiparida/AudioSetZSL)

* [Self-Supervised Learning by Cross-Modal Audio-Video Clustering](https://arxiv.org/pdf/1911.12667.pdf) - Alwassel, H., Mahajan, D., Torresani, L., Ghanem, B., & Tran, D. (NeurIPS 2020)

* [Look, listen, and learn more: Design choices for deep audio embeddings](http://www.justinsalamon.com/uploads/4/3/9/4/4394963/cramer_looklistenlearnmore_icassp_2019.pdf) - Cramer, J., Wu, H. H., Salamon, J., & Bello, J. P. (ICASSP 2019) [[code]](https://github.com/marl/openl3) [[L3-embedding]](https://github.com/marl/l3embedding)

* [Self-supervised audio-visual co-segmentation](https://arxiv.org/pdf/1904.09013.pdf) - Rouditchenko, A., Zhao, H., Gan, C., McDermott, J., & Torralba, A. (ICASSP 2019)

* [Deep Multimodal Clustering for Unsupervised Audiovisual Learning](https://arxiv.org/pdf/1807.03094.pdf) - (Hu, D., Nie, F., & Li, X. (CVPR 2019))

* [Cooperative learning of audio and video models from self-supervised synchronization](http://papers.nips.cc/paper/8002-cooperative-learning-of-audio-and-video-models-from-self-supervised-synchronization.pdf) - (Korbar, B., Tran, D., & Torresani, L. (NeurIPS 2108)) [[project page]](http://vlg.cs.dartmouth.edu/projects/avts/)[[trained model 1]](http://vlg.cs.dartmouth.edu/projects/avts/model_mc3_as.pt)[[trained model 2]](http://vlg.cs.dartmouth.edu/projects/avts/model_mc2_as.pt)

* [Multimodal Attention for Fusion of Audio and Spatiotemporal Features for Video Description](http://openaccess.thecvf.com/content_cvpr_2018_workshops/papers/w49/Hori_Multimodal_Attention_for_CVPR_2018_paper.pdf) - Hori, C., Hori, T., Wichern, G., Wang, J., Lee, T. Y., Cherian, A., & Marks, T. K. (CVPRW 2018)

* [Audio-Visual Scene Analysis with Self-Supervised Multisensory Features](https://arxiv.org/pdf/1804.03641.pdf) - Owens, A., & Efros, A. A. (ECCV 2018 (Oral)) [[project page]](http://andrewowens.com/multisensory/) [[code]](https://github.com/andrewowens/multisensory)

* [Look, listen and learn](https://arxiv.org/pdf/1705.08168.pdf) - Arandjelovic, R., & Zisserman, A. (ICCV 2017) [[Keras-code]](https://github.com/Kajiyu/LLLNet)

* [Ambient Sound Provides Supervision for Visual Learning](https://arxiv.org/pdf/1608.07017.pdf) - Owens, A., Wu, J., McDermott, J. H., Freeman, W. T., & Torralba, A. (ECCV 2016(Oral)) [[journal version]](https://arxiv.org/pdf/1712.07271.pdf) [[project page]](http://andrewowens.com/ambient/index.html) 

* [Soundnet: Learning sound representations from unlabeled video](http://www.cs.columbia.edu/~vondrick/soundnet.pdf) -  Aytar, Y., Vondrick, C., & Torralba, A. (NIPS 2016) [[project page]](http://projects.csail.mit.edu/soundnet/) [[code]](https://github.com/cvondrick/soundnet)

* [See, hear, and read: Deep aligned representations](https://people.csail.mit.edu/yusuf/publications/2017/Aytar17/aytar17.pdf) - Aytar, Y., Vondrick, C., & Torralba, A. (arXiv 2017) [[project page]](https://people.csail.mit.edu/yusuf/see-hear-read/)

* [Cross-Modal Embeddings for Video and Audio Retrieval](https://arxiv.org/pdf/1801.02200.pdf) -Surís, D., Duarte, A., Salvador, A., Torres, J., & Giró-i-Nieto, X. (ECCVW, 2018)

#### Audio-Visual Action Recognition

* [Audio-Adaptive Activity Recognition Across Video Domains](https://openaccess.thecvf.com/content/CVPR2022/papers/Zhang_Audio-Adaptive_Activity_Recognition_Across_Video_Domains_CVPR_2022_paper.pdf) - Zhang, Y., Doughty, H., Shao, L., & Snoek, C. G. (CVPR 2022) [[project page]](https://xiaobai1217.github.io/DomainAdaptation/) [[code]](https://github.com/xiaobai1217/DomainAdaptation)

* [Cross-Attentional Audio-Visual Fusion for Weakly-Supervised Action Localization](https://openreview.net/pdf?id=hWr3e3r-oH5) - Lee, J., Jain, M., Park, H., & Yun, S. (ICLR 2021)

* [Speech2Action: Cross-modal Supervision for Action Recognition](http://www.robots.ox.ac.uk/~vgg/publications/2020/Nagrani20/nagrani20.pdf) - Nagrani, A., Sun, C., Ross, D., Sukthankar, R., Schmid, C., & Zisserman, A. (CVPR 2020) [project page, dataset](https://www.robots.ox.ac.uk/~vgg/research/speech2action/)

* [Listen to Look: Action Recognition by Previewing Audio](https://arxiv.org/pdf/1912.04487.pdf) -  Ruohan Gao, Tae-Hyun Oh, Kristen Grauman, Lorenzo Torresani (CVPR 2020) [[project page]](http://vision.cs.utexas.edu/projects/listen_to_look/)

* [EPIC-Fusion: Audio-Visual Temporal Binding for Egocentric Action Recognition](http://openaccess.thecvf.com/content_ICCV_2019/papers/Kazakos_EPIC-Fusion_Audio-Visual_Temporal_Binding_for_Egocentric_Action_Recognition_ICCV_2019_paper.pdf) - Kazakos, E., Nagrani, A., Zisserman, A., & Damen, D. (ICCV 2019) [[project page]](https://ekazakos.github.io/TBN/) [[code]](https://github.com/ekazakos/temporal-binding-network)

* [Uncertainty-aware Audiovisual Activity Recognition using Deep Bayesian Variational Inference](http://openaccess.thecvf.com/content_ICCV_2019/papers/Subedar_Uncertainty-Aware_Audiovisual_Activity_Recognition_Using_Deep_Bayesian_Variational_Inference_ICCV_2019_paper.pdf) - Subedar, M., Krishnan, R., Meyer, P. L., Tickoo, O., & Huang, J. (ICCV 2019)

* [Seeing and Hearing Egocentric Actions: How Much Can We Learn?](http://openaccess.thecvf.com/content_ICCVW_2019/papers/EPIC/Cartas_Seeing_and_Hearing_Egocentric_Actions_How_Much_Can_We_Learn_ICCVW_2019_paper.pdf) - Cartas, A., Luque, J., Radeva, P., Segura, C., & Dimiccoli, M. (ICCVW 2019)

* [How Much Does Audio Matter to Recognize Egocentric Object Interactions?](https://arxiv.org/pdf/1906.00634.pdf) - Cartas, A., Luque, J., Radeva, P., Segura, C., & Dimiccoli, M. (EPIC CVPRW 2019)

#### Audio-Visual Spatial/Depth

* [Camera Pose Estimation and Localization with Active Audio Sensing](https://www.ecva.net/papers/eccv_2022/papers_ECCV/papers/136970266.pdf) - Yang, K., Firman, M., Brachmann, E., & Godard, C. (ECCV 2022)

* [Few-Shot Audio-Visual Learning of Environment Acoustics](https://proceedings.neurips.cc/paper_files/paper/2022/file/113ae3a9762ca2168f860a8501d6ae25-Paper-Conference.pdf) - Majumder, S., Chen, C., Al-Halah, Z., & Grauman, K. NeurIPS (2022) [[code]](https://github.com/SAGNIKMJR/few-shot-rir)

* [Localize to Binauralize: Audio Spatialization From Visual Sound Source Localization](https://openaccess.thecvf.com/content/ICCV2021/papers/Rachavarapu_Localize_to_Binauralize_Audio_Spatialization_From_Visual_Sound_Source_Localization_ICCV_2021_paper.pdf) - Rachavarapu, K. K., Sundaresha, V., & Rajagopalan, A. N. (ICCV 2021)

* [Visually Informed Binaural Audio Generation without Binaural Audios](https://arxiv.org/pdf/2104.06162.pdf) - Xu, X., Zhou, H., Liu, Z., Dai, B., Wang, X., & Lin, D. (CVPR 2021) [[code]](https://github.com/SheldonTsui/PseudoBinaural_CVPR2021)

* [Beyond image to depth: Improving depth prediction using echoes](https://arxiv.org/pdf/2103.08468.pdf) - Parida, K. K., Srivastava, S., & Sharma, G. (CVPR 2021) [[code]](https://github.com/krantiparida/beyond-image-to-depth) [[project page]](https://krantiparida.github.io/projects/bimgdepth.html)

* [Exploiting Audio-Visual Consistency with Partial Supervision for Spatial Audio Generation](https://yanbo.ml/papers/aaai21-9767.pdf) - Lin, Yan-Bo and Wang, Yu-Chiang Frank, (AAAI 2021)

* [Learning Representations from Audio-Visual Spatial Alignment](https://pedro-morgado.github.io/assets/publications/2020-sptalign/paper.pdf) - Morgado, P., Li, Y., & Nvasconcelos, N. (NeurIPS 2020) [[code]](https://github.com/pedro-morgado/AVSpatialAlignment)

* [VisualEchoes: Spatial Image Representation Learning through Echolocation](https://arxiv.org/pdf/2005.01616.pdf) - Gao, R., Chen, C., Al-Halah, Z., Schissler, C., & Grauman, K. (ECCV 2020)

* [BatVision with GCC-PHAT Features for Better Sound to Vision Predictions](http://sightsound.org/papers/2020/Jesper_Haahr_Christensen_BatVision_with_GCC-PHAT_Features_for_Better_Sound_to_Vision_Predictions.pdf) - Christensen, J. H., Hornauer, S., & Yu, S. (CVPRW 2020)

* [BatVision: Learning to See 3D Spatial Layout with Two Ears](https://arxiv.org/pdf/1912.07011.pdf) - Christensen, J. H., Hornauer, S., & Yu, S. (ICRA 2020) [[dataset/code]](https://github.com/SaschaHornauer/Batvision)

* [Semantic Object Prediction and Spatial Sound Super-Resolution with Binaural Sounds](https://arxiv.org/pdf/2003.04210.pdf) - Vasudevan, A. B., Dai, D., & Van Gool, L. (arXiv 2020) [[project page]](https://www.trace.ethz.ch/publications/2020/sound_perception/index.html)

* [Audio-Visual SfM towards 4D reconstruction under dynamic scenes](http://sightsound.org/papers/2020/Takashi_Konno_Audio-Visual_SfM_towards_4D_reconstruction_under_dynamic_scenes.pdf) -  Konno, A., Nishida K., Itoyama K., Nakadai K. (CVPRW 2020)

* [Telling Left From Right: Learning Spatial Correspondence of Sight and Sound](http://openaccess.thecvf.com/content_CVPR_2020/papers/Yang_Telling_Left_From_Right_Learning_Spatial_Correspondence_of_Sight_and_CVPR_2020_paper.pdf) - Yang, K., Russell, B., & Salamon, J. (CVPR 2020) [[project page / dataset]](https://karreny.github.io/telling-left-from-right/)

* [2.5D Visual Sound](http://openaccess.thecvf.com/content_CVPR_2019/papers/Gao_2.5D_Visual_Sound_CVPR_2019_paper.pdf) - Gao, R., & Grauman, K. (CVPR 2019) [[project page]](http://vision.cs.utexas.edu/projects/2.5D_visual_sound/) [[dataset]](https://github.com/facebookresearch/FAIR-Play) [[code]](https://github.com/facebookresearch/2.5D-Visual-Sound)

* [Self-supervised generation of spatial audio for 360 video](https://papers.nips.cc/paper/7319-self-supervised-generation-of-spatial-audio-for-360-video.pdf) - Morgado, P., Nvasconcelos, N., Langlois, T., & Wang, O. (NeurIPS 2018) [[project page]](https://pedro-morgado.github.io/spatialaudiogen/) [[code/dataset]](https://github.com/pedro-morgado/spatialaudiogen)

* [Self-supervised audio spatialization with correspondence classifier](https://ieeexplore.ieee.org/abstract/document/8803494?casa_token=rUiDsxJS6u0AAAAA:sLIrrVSiI-mgs5dJXroOslT5sh1nWX1dvlK-iYwV4CVVaJyqJCTWQ3Gc9BhLdBNEPVsqPLIW) - Lu, Y. D., Lee, H. Y., Tseng, H. Y., & Yang, M. H. (ICIP 2019) 

#### Audio-Visual RIR

* [Self-Supervised Visual Acoustic Matching](https://arxiv.org/pdf/2307.15064.pdf) - Somayazulu, A., Chen, C., & Grauman, K. (NeurIPS 2023) [[project page]](https://vision.cs.utexas.edu/projects/ss_vam/)

* [Novel-View Acoustic Synthesis](https://openaccess.thecvf.com/content/CVPR2023/papers/Chen_Novel-View_Acoustic_Synthesis_CVPR_2023_paper.pdf) - Chen, C., Richard, A., Shapovalov, R., Ithapu, V. K., Neverova, N., Grauman, K., & Vedaldi, A. (CVPR 2023) [[code]](https://github.com/facebookresearch/novel-view-acoustic-synthesis)

* [Few-shot audio-visual learning of environment acoustics](https://proceedings.neurips.cc/paper_files/paper/2022/file/113ae3a9762ca2168f860a8501d6ae25-Paper-Conference.pdf) - Majumder, S., Chen, C., Al-Halah, Z., & Grauman, K. (NeurIPS 2022)

* [Learning Neural Acoustic Fields](https://arxiv.org/pdf/2204.00628.pdf) - Luo, A., Du, Y., Tarr, M., Tenenbaum, J., Torralba, A., & Gan, C. (NeurIPS 2022) [[code]](https://github.com/aluo-x/Learning_Neural_Acoustic_Fields)

* [Learning Audio-Visual Dereverberation](https://arxiv.org/pdf/2106.07732.pdf) - Chen, C., Sun, W., Harwath, D., & Grauman, K. (ICASSP 2023) [[code]](https://github.com/facebookresearch/learning-audio-visual-dereverberation)

* [Visual acoustic matching](https://openaccess.thecvf.com/content/CVPR2022/papers/Chen_Visual_Acoustic_Matching_CVPR_2022_paper.pdf) - Chen, C., Gao, R., Calamia, P., & Grauman, K. (CVPR 2022) [[code]](https://github.com/facebookresearch/visual-acoustic-matching)

* [Image2reverb: Cross-modal reverb impulse response synthesis](https://openaccess.thecvf.com/content/ICCV2021/papers/Singh_Image2Reverb_Cross-Modal_Reverb_Impulse_Response_Synthesis_ICCV_2021_paper.pdf) - Singh, N., Mentch, J., Ng, J., Beveridge, M., & Drori, I. (ICCV 2021). [[code]](https://github.com/nikhilsinghmus/image2reverb)

#### Audio-Visual Highlight Detection

* [Temporal Cue Guided Video Highlight Detection With Low-Rank Audio-Visual Fusion](https://openaccess.thecvf.com/content/ICCV2021/papers/Ye_Temporal_Cue_Guided_Video_Highlight_Detection_With_Low-Rank_Audio-Visual_Fusion_ICCV_2021_paper.pdf) - Ye, Q., Shen, X., Gao, Y., Wang, Z., Bi, Q., Li, P., & Yang, G. (ICCV 2021)

* [Joint Visual and Audio Learning for Video Highlight Detection](https://openaccess.thecvf.com/content/ICCV2021/papers/Badamdorj_Joint_Visual_and_Audio_Learning_for_Video_Highlight_Detection_ICCV_2021_paper.pdf) - Badamdorj, T., Rochan, M., Wang, Y., & Cheng, L. (ICCV 2021)

#### Audio-Visual Deepfake/Robustness

* [Push-Pull: Characterizing the Adversarial Robustness for Audio-Visual Active Speaker Detection](https://arxiv.org/abs/2210.00753) - Chen, X., et al. (SLT 2022) [[Demos]](https://xjchen.tech/Push-Pull/index.html)

* [Joint Audio-Visual Deepfake Detection](https://openaccess.thecvf.com/content/ICCV2021/papers/Zhou_Joint_Audio-Visual_Deepfake_Detection_ICCV_2021_paper.pdf) - Zhou, Y., & Lim, S. N. (ICCV 2021)

* [Can Audio-Visual Integration Strengthen Robustness Under Multimodal Attacks?](https://arxiv.org/pdf/2104.02000.pdf) - Tian, Y., & Xu, C. (CVPR 2021) [[code]](https://github.com/YapengTian/AV-Robustness-CVPR21)

#### Lightweight Audio-Visual Model

* [Multimodal Transformer Distillation for Audio-Visual Synchronization](https://arxiv.org/abs/2210.15563) - Chen, X., et al. (ICASSP 2024) [[Code]](https://github.com/xjchenGit/MTDVocaLiST)

#### Audio-Visual Navigation/RL

* [Sound Adversarial Audio-Visual Navigation](https://openreview.net/pdf?id=NkZq4OEYN-) - Yu, Y., Huang, W., Sun, F., Chen, C., Wang, Y., & Liu, X. (ICLR 2022) [[project page]](https://yyf17.github.io/SAAVN/) [[code]](https://github.com/yyf17/SAAVN/tree/main)

* [AVLEN: Audio-Visual-Language Embodied Navigation in 3D Environments](https://papers.nips.cc/paper_files/paper/2022/hash/28f699175783a2c828ae74d53dd3da20-Abstract-Conference.html) - Paul, S., Roy-Chowdhury, A., & Cherian, A. (NeurIPS 2022)

* [Semantic Audio-Visual Navigation](https://arxiv.org/pdf/2012.11583.pdf) - Chen, C., Al-Halah, Z., & Grauman, K. (CVPR 2021) [[project page]](http://vision.cs.utexas.edu/projects/semantic-audio-visual-navigation/) [[code]](https://github.com/facebookresearch/sound-spaces/tree/master/ss_baselines/savi)

* [Learning to set waypoints for audio-visual navigation](https://arxiv.org/pdf/2008.09622.pdf) - Chen, C., Majumder, S., Al-Halah, Z., Gao, R., Ramakrishnan, S. K., & Grauman, K. (ICLR 2021) [[project page]](http://vision.cs.utexas.edu/projects/audio_visual_waypoints/) [[code]](https://github.com/facebookresearch/sound-spaces/tree/master/ss_baselines/av_wan)

* [See, hear, explore: Curiosity via audio-visual association](https://arxiv.org/pdf/2007.03669.pdf) - Dean, V., Tulsiani, S., & Gupta, A. (arXiv 2020) [[project page]](https://vdean.github.io/audio-curiosity.html) [[code]](https://github.com/vdean/audio-curiosity)

* [Audio-Visual Embodied Navigation](https://arxiv.org/pdf/1912.11474.pdf) - Chen, C., Jain, U., Schissler, C., Gari, S. V. A., Al-Halah, Z., Ithapu, V. K., Robinson P., Grauman, K. (ECCV 2020) [[project page]](http://vision.cs.utexas.edu/projects/audio_visual_navigation/)

* [Look, listen, and act: Towards audio-visual embodied navigation](https://arxiv.org/pdf/1912.11684.pdf) - Gan, C., Zhang, Y., Wu, J., Gong, B., & Tenenbaum, J. B. (ICRA 2020) [[project page/dataset]](http://avn.csail.mit.edu/)

#### Audio-Visual Faces/Speech

* [DiffTalk: Crafting Diffusion Models for Generalized Audio-Driven Portraits Animation](https://openaccess.thecvf.com/content/CVPR2023/papers/Shen_DiffTalk_Crafting_Diffusion_Models_for_Generalized_Audio-Driven_Portraits_Animation_CVPR_2023_paper.pdf) - Shen, S., Zhao, W., Meng, Z., Li, W., Zhu, Z., Zhou, J., & Lu, J. (CVPR 2023) [[code]](https://github.com/sstzal/DiffTalk)

* [SadTalker: Learning Realistic 3D Motion Coefficients for Stylized Audio-Driven Single Image Talking Face Animation](https://openaccess.thecvf.com/content/CVPR2023/papers/Zhang_SadTalker_Learning_Realistic_3D_Motion_Coefficients_for_Stylized_Audio-Driven_Single_CVPR_2023_paper.pdf) - Zhang, W., Cun, X., Wang, X., Zhang, Y., Shen, X., Guo, Y., ... & Wang, F. (CVPR 2023) [[project page]](https://sadtalker.github.io/) [[code]](https://github.com/OpenTalker/SadTalker)

* [Parametric Implicit Face Representation for Audio-Driven Facial Reenactment](https://openaccess.thecvf.com/content/CVPR2023/papers/Huang_Parametric_Implicit_Face_Representation_for_Audio-Driven_Facial_Reenactment_CVPR_2023_paper.pdf) - Huang, R., Lai, P., Qin, Y., & Li, G. (CVPR 2023)

* [Taming Diffusion Models for Audio-Driven Co-Speech Gesture Generation](https://openaccess.thecvf.com/content/CVPR2023/papers/Zhu_Taming_Diffusion_Models_for_Audio-Driven_Co-Speech_Gesture_Generation_CVPR_2023_paper.pdf) - Zhu, L., Liu, X., Liu, X., Qian, R., Liu, Z., & Yu, L. (CVPR 2023) [[code]](https://github.com/Advocate99/DiffGesture)

* [Watch or Listen: Robust Audio-Visual Speech Recognition with Visual Corruption Modeling and Reliability Scoring](https://openaccess.thecvf.com/content/CVPR2023/papers/Hong_Watch_or_Listen_Robust_Audio-Visual_Speech_Recognition_With_Visual_Corruption_CVPR_2023_paper.pdf) - Hong, J., Kim, M., Choi, J., & Ro, Y. M. (CVPR 2023) [[code]](https://github.com/joannahong/AV-RelScore)

* [AVFace: Towards Detailed Audio-Visual 4D Face Reconstruction](https://openaccess.thecvf.com/content/CVPR2023/papers/Chatziagapi_AVFace_Towards_Detailed_Audio-Visual_4D_Face_Reconstruction_CVPR_2023_paper.pdf) - Chatziagapi, A., & Samaras, D. (CVPR 2023)

* [GeneFace: Generalized and High-Fidelity Audio-Driven 3D Talking Face Synthesis](https://openreview.net/pdf?id=YfwMIDhPccD) - Ye, Z., Jiang, Z., Ren, Y., Liu, J., He, J., & Zhao, Z. (ICLR 2023) [[code]](https://geneface.github.io/)

* [Jointly Learning Visual and Auditory Speech Representations from Raw Data](https://openreview.net/pdf?id=BPwIgvf5iQ) - Haliassos, A., Ma, P., Mira, R., Petridis, S., & Pantic, M. (ICLR 2023) [[code]](https://github.com/ahaliassos/raven)

* [Audio-Driven Stylized Gesture Generation with Flow-Based Model](https://www.ecva.net/papers/eccv_2022/papers_ECCV/papers/136650701.pdf) - Ye, S., Wen, Y. H., Sun, Y., He, Y., Zhang, Z., Wang, Y., ... & Liu, Y. J. (ECCV 2022) [[code]](https://github.com/yesheng-THU/GFGE)

* [Semantic-Aware Implicit Neural Audio-Driven Video Portrait Generation](https://www.ecva.net/papers/eccv_2022/papers_ECCV/papers/136970105.pdf) - Liu, X., Xu, Y., Wu, Q., Zhou, H., Wu, W., & Zhou, B. (ECCV 2022) [[project page]](https://alvinliu0.github.io/projects/SSP-NeRF) [[code]](https://github.com/alvinliu0/SSP-NeRF)

* [Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction](https://openreview.net/pdf?id=Z1Qlm11uOM) - Shi, B., Hsu, W. N., Lakhotia, K., & Mohamed, A. (ICLR 2022) [[code]](https://github.com/facebookresearch/av_hubert)

* [PoseKernelLifter: Metric Lifting of 3D Human Pose Using Sound](https://openaccess.thecvf.com/content/CVPR2022/papers/Yang_PoseKernelLifter_Metric_Lifting_of_3D_Human_Pose_Using_Sound_CVPR_2022_paper.pdf) - Yang, Z., Fan, X., Isler, V., & Park, H. S. (CVPR 2022)

* [Audio-Driven Neural Gesture Reenactment With Video Motion Graphs](https://openaccess.thecvf.com/content/CVPR2022/papers/Zhou_Audio-Driven_Neural_Gesture_Reenactment_With_Video_Motion_Graphs_CVPR_2022_paper.pdf) - Zhou, Y., Yang, J., Li, D., Saito, J., Aneja, D., & Kalogerakis, E. (CVPR 2022) [[code]](https://github.com/yzhou359/vid-reenact)

* [Expressive Talking Head Generation With Granular Audio-Visual Control](https://openaccess.thecvf.com/content/CVPR2022/papers/Liang_Expressive_Talking_Head_Generation_With_Granular_Audio-Visual_Control_CVPR_2022_paper.pdf) - Liang, B., Pan, Y., Guo, Z., Zhou, H., Hong, Z., Han, X., ... & Wang, J. (CVPR 2022)

* [Egocentric Deep Multi-Channel Audio-Visual Active Speaker Localization](https://openaccess.thecvf.com/content/CVPR2022/papers/Jiang_Egocentric_Deep_Multi-Channel_Audio-Visual_Active_Speaker_Localization_CVPR_2022_paper.pdf) - Jiang, H., Murdock, C., & Ithapu, V. K. (CVPR 2022)

* [Audio-Driven Co-Speech Gesture Video Generation](https://papers.nips.cc/paper_files/paper/2022/file/8667f264f88c7938a73a53ab01eb1327-Paper-Conference.pdf) - Liu, X., Wu, Q., Zhou, H., Du, Y., Wu, W., Lin, D., & Liu, Z. (NeurIPS 2022) [[project page]](https://alvinliu0.github.io/projects/ANGIE) [[code]](https://github.com/alvinliu0/ANGIE)

* [Audio-Visual Speech Codecs: Rethinking Audio-Visual Speech Enhancement by Re-Synthesis](https://openaccess.thecvf.com/content/CVPR2022/papers/Yang_Audio-Visual_Speech_Codecs_Rethinking_Audio-Visual_Speech_Enhancement_by_Re-Synthesis_CVPR_2022_paper.pdf) - Yang, K., Marković, D., Krenn, S., Agrawal, V., & Richard, A. (CVPR 2022) [[video]](https://www.youtube.com/watch?v=3lQ-ImYnLhc)

* [Audio2Gestures: Generating Diverse Gestures From Speech Audio With Conditional Variational Autoencoders](https://openaccess.thecvf.com/content/ICCV2021/papers/Li_Audio2Gestures_Generating_Diverse_Gestures_From_Speech_Audio_With_Conditional_Variational_ICCV_2021_paper.pdf) - Li, J., Kang, D., Pei, W., Zhe, X., Zhang, Y., He, Z., & Bao, L. (ICCV 2021) [[code]](https://github.com/JingLi513/Audio2Gestures) [[project page]](https://jingli513.github.io/audio2gestures/)

* [Seeking the Shape of Sound: An Adaptive Framework for Learning Voice-Face Association](https://arxiv.org/pdf/2103.07293.pdf) - Wen, P., Xu, Q., Jiang, Y., Yang, Z., He, Y., & Huang, Q. (CVPR 2021) [[code]](https://github.com/KID-7391/seeking-the-shape-of-sound)

* [Audio-Driven Emotional Video Portraits](https://arxiv.org/pdf/2104.07452.pdf) - Ji, X., Zhou, H., Wang, K., Wu, W., Loy, C. C., Cao, X., & Xu, F. (CVPR 2021) [[project page]](https://jixinya.github.io/projects/evp/) [[code]](https://github.com/jixinya/EVP/)

* [Pose-controllable talking face generation by implicitly modularized audio-visual representation](https://arxiv.org/pdf/2104.11116.pdf) - Zhou, H., Sun, Y., Wu, W., Loy, C. C., Wang, X., & Liu, Z. (CVPR 2021) [[project page]](https://hangz-nju-cuhk.github.io/projects/PC-AVS) [[code]](https://github.com/Hangz-nju-cuhk/Talking-Face_PC-AVS)

* [One-Shot Free-View Neural Talking-Head Synthesis for Video Conferencing](https://nvlabs.github.io/face-vid2vid/main.pdf) - Wang, T. C., Mallya, A., & Liu, M. Y. (CVPR 2021) [[project page]](https://nvlabs.github.io/face-vid2vid/)

* [Unsupervised audiovisual synthesis via exemplar autoencoders](https://openreview.net/pdf?id=43VKWxg_Sqr) - Deng, K., Bansal, A., & Ramanan, D. [[project page]](https://github.com/facebookresearch/sound-spaces/tree/master/ss_baselines/av_wan) [[project page]](https://dunbar12138.github.io/projectpage/Audiovisual/)

* [Mead: A large-scale audio-visual dataset for emotional talking-face generation](https://www.ecva.net/papers/eccv_2020/papers_ECCV/papers/123660698.pdf) - Wang, K., Wu, Q., Song, L., Yang, Z., Wu, W., Qian, C., He, R., Qiao Y., Loy, C. C. (ECCV 2020) [[project page/dataset]](https://wywu.github.io/projects/MEAD/MEAD.html)

* [Discriminative Multi-modality Speech Recognition](http://openaccess.thecvf.com/content_CVPR_2020/papers/Xu_Discriminative_Multi-Modality_Speech_Recognition_CVPR_2020_paper.pdf) - Xu, B., Lu, C., Guo, Y., & Wang, J. (CVPR 2020)

* [Learning Individual Speaking Styles for Accurate Lip to Speech Synthesis](http://openaccess.thecvf.com/content_CVPR_2020/papers/Prajwal_Learning_Individual_Speaking_Styles_for_Accurate_Lip_to_Speech_Synthesis_CVPR_2020_paper.pdf) -  Prajwal, K. R., Mukhopadhyay, R., Namboodiri, V. P., & Jawahar, C. V. (CVPR 2020) [[project page/dataset]](http://cvit.iiit.ac.in/research/projects/cvit-projects/speaking-by-observing-lip-movements#) [[code]](https://github.com/Rudrabha/Lip2Wav)

* [DAVD-Net: Deep Audio-Aided Video Decompression of Talking Heads](http://openaccess.thecvf.com/content_CVPR_2020/papers/Zhang_DAVD-Net_Deep_Audio-Aided_Video_Decompression_of_Talking_Heads_CVPR_2020_paper.pdf) - Zhang, X., Wu, X., Zhai, X., Ben, X., & Tu, C. (CVPR 2020) 

* [Learning to Have an Ear for Face Super-Resolution](http://openaccess.thecvf.com/content_CVPR_2020/papers/Meishvili_Learning_to_Have_an_Ear_for_Face_Super-Resolution_CVPR_2020_paper.pdf) - Meishvili, G., Jenni, S., & Favaro, P. (CVPR 2020) [[project page]](https://gmeishvili.github.io/ear_for_face_super_resolution/index.html) [[code]](https://github.com/gmeishvili/ear_for_face_super_resolution)

* [ASR is all you need: Cross-modal distillation for lip reading](https://arxiv.org/pdf/1911.12747.pdf) - Afouras, T., Chung, J. S., & Zisserman, A. (ICASSP 2020)

* [Visually guided self supervised learning of speech representations](https://arxiv.org/pdf/2001.04316.pdf) - Shukla, A., Vougioukas, K., Ma, P., Petridis, S., & Pantic, M. (ICASSP 2020) 

* [Disentangled Speech Embeddings using Cross-modal Self-supervision](https://arxiv.org/pdf/2002.08742.pdf) - Nagrani, A., Chung, J. S., Albanie, S., & Zisserman, A. (ICASSP 2020)

* [Animating Face using Disentangled Audio Representations](http://openaccess.thecvf.com/content_WACV_2020/papers/Mittal_Animating_Face_using_Disentangled_Audio_Representations_WACV_2020_paper.pdf) - Mittal, G., & Wang, B. (WACV 2020)

* [Deep Audio-Visual Speech Recognition](http://www.robots.ox.ac.uk/~vgg/publications/2019/Afouras19/afouras18c.pdf) - T. Afouras, J.S. Chung*, A. Senior, O. Vinyals, A. Zisserman (TPAMI 2019)

* [Reconstructing faces from voices](https://arxiv.org/pdf/1905.10604.pdf) - Yandong Wen, Rita Singh, Bhiksha Raj (NIPS 2019)[[project page]](https://github.com/cmu-mlsp/reconstructing_faces_from_voices)

* [Learning Individual Styles of Conversational Gesture](https://arxiv.org/pdf/1906.04160.pdf) - Ginosar, S., Bar, A., Kohavi, G., Chan, C., Owens, A., & Malik, J. (CVPR 2019) [[project page]](http://people.eecs.berkeley.edu/~shiry/projects/speech2gesture/) [[dataset]](https://drive.google.com/drive/folders/1qvvnfGwas8DUBrwD4DoBnvj8anjSLldZ)

* [Hierarchical Cross-Modal Talking Face Generation with Dynamic Pixel-Wise Loss](https://www.cs.rochester.edu/u/lchen63/cvpr2019.pdf) - Chen, L., Maddox, R. K., Duan, Z., & Xu, C. (CVPR 2019)[[project page]](https://github.com/lelechen63/ATVGnet)

* [Speech2Face: Learning the Face Behind a Voice](https://arxiv.org/pdf/1905.09773.pdf) - Oh, T. H., Dekel, T., Kim, C., Mosseri, I., Freeman, W. T., Rubinstein, M., & Matusik, W. (CVPR 2019)[[project page]](https://speech2face.github.io/)

* [My lips are concealed: Audio-visual speech enhancement through obstructions](https://arxiv.org/pdf/1907.04975.pdf) - Afouras, T., Chung, J. S., & Zisserman, A. (INTERSPEECH 2019) [[project page]](http://www.robots.ox.ac.uk/~vgg/research/concealed/)

* [Talking Face Generation by Adversarially Disentangled Audio-Visual Representation](https://arxiv.org/pdf/1807.07860.pdf) - Hang Zhou, Yu Liu, Ziwei Liu, Ping Luo, Xiaogang Wang (AAAI 2019) [[project page]](https://liuziwei7.github.io/projects/TalkingFace) [[code]](https://github.com/Hangz-nju-cuhk/Talking-Face-Generation-DAVS)

* [Disjoint mapping network for cross-modal matching of voices and faces](https://openreview.net/pdf?id=B1exrnCcF7) - Wen, Y., Ismail, M. A., Liu, W., Raj, B., & Singh, R. (ICLR 2019)[[project page]](https://github.com/ydwen/DIMNet)

* [X2Face: A network for controlling face generation using images, audio, and pose codes](http://openaccess.thecvf.com/content_ECCV_2018/papers/Olivia_Wiles_X2Face_A_network_ECCV_2018_paper.pdf) - Wiles, O., Sophia Koepke, A., & Zisserman, A. (ECCV 2018)[[project page]](http://www.robots.ox.ac.uk/~vgg/research/unsup_learn_watch_faces/x2face.html)[[code]](https://github.com/oawiles/X2Face)

* [Learnable PINs: Cross-Modal Embeddings for Person Identity](https://arxiv.org/pdf/1805.00833.pdf) - Nagrani, A., Albanie, S., & Zisserman, A. (ECCV 2018)[[project page]](http://www.robots.ox.ac.uk/~vgg/research/LearnablePins/)

* [Seeing voices and hearing faces: Cross-modal biometric matching](http://www.robots.ox.ac.uk/~vgg/publications/2018/Nagrani18a/nagrani18a.pdf) - Nagrani, A., Albanie, S., & Zisserman, A. (CVPR 2018) [[project page]](http://www.robots.ox.ac.uk/~vgg/research/CMBiometrics/)[[code]](https://github.com/a-nagrani/SVHF-Net)(trained moodel only)

* [Looking to Listen at the Cocktail Party: A Speaker-Independent Audio-Visual Model for Speech Separation](https://arxiv.org/pdf/1804.03619.pdf) - Ephrat, A., Mosseri, I., Lang, O., Dekel, T., Wilson, K., Hassidim, A., Freeman, W.T. and Rubinstein, M., (SIGGRAPH 2018) [[project page]](https://looking-to-listen.github.io/)

* [The Conversation: Deep Audio-Visual Speech Enhancement](http://www.robots.ox.ac.uk/~vgg/demo/theconversation) - Afouras, T., Chung, J. S., & Zisserman, A. (INTERSPEECH 2018) [[project page]](http://www.robots.ox.ac.uk/~vgg/demo/theconversation/)

* [VoxCeleb2: Deep Speaker Recognition](https://arxiv.org/pdf/1806.05622.pdf) - Nagrani, A., Chung, J. S., & Zisserman, A. (INTERSPEECH 2018) [[dataset]](http://www.robots.ox.ac.uk/~vgg/data/voxceleb/)

* [You said that?](http://www.robots.ox.ac.uk/~vgg/publications/2017/Chung17b/chung17b.pdf) - Son Chung, J., Jamaludin, A., & Zisserman, A. (BMVC 2017) [[project page]](http://www.robots.ox.ac.uk/~vgg/software/yousaidthat/) [[code]](https://github.com/joonson/yousaidthat)(trained model, evaluation code)

* [VoxCeleb: a large-scale speaker identification dataset](http://www.robots.ox.ac.uk/~vgg/publications/2017/Nagrani17/nagrani17.pdf) - Nagrani, A., Chung, J. S., & Zisserman, A. (INTERSPEECH 2017) [[project page]](http://www.robots.ox.ac.uk/~vgg/publications/2017/Nagrani17/)[[code]](https://github.com/a-nagrani/VGGVox) [[dataset]](http://www.robots.ox.ac.uk/~vgg/data/voxceleb/)

* [Out of time: automated lip sync in the wild](https://www.robots.ox.ac.uk/~vgg/publications/2016/Chung16a/chung16a.pdf) - J.S. Chung & A. Zisserman (ACCVW 2016)

#### Audio-Visual Learning of Scene Acoustics

* [INRAS: Implicit Neural Representations of Audio Scenes](https://openreview.net/pdf?id=7KBzV5IL7W) - Su, K.\*, Chen, M.\*, Shilzerman, E. (NeurIPS 2022) 

* [Learning Neural Acoustic Fields](https://openreview.net/pdf?id=lkQ7meEa-qv) - Luo, A., Du, Y., Tarr, M., Tenenbaum, J., Torralba, A., & Gan, C. (NeurIPS 2022) [[code]](https://github.com/aluo-x/Learning_Neural_Acoustic_Fields) [[project page]](https://www.andrew.cmu.edu/user/afluo/Neural_Acoustic_Fields/)

#### Audio-Visual Question Answering

* [PACS: A Dataset for Physical Audiovisual CommonSense Reasoning](https://www.ecva.net/papers/eccv_2022/papers_ECCV/papers/136970286.pdf) - Yu, S., Wu, P., Liang, P. P., Salakhutdinov, R., & Morency, L. P. (ECCV 2022) [[code]](https://github.com/samuelyu2002/PACS)

* [Learning To Answer Questions in Dynamic Audio-Visual Scenarios](https://openaccess.thecvf.com/content/CVPR2022/papers/Li_Learning_To_Answer_Questions_in_Dynamic_Audio-Visual_Scenarios_CVPR_2022_paper.pdf) - Li, G., Wei, Y., Tian, Y., Xu, C., Wen, J. R., & Hu, D. (CVPR 2022) [[project page]](https://gewu-lab.github.io/MUSIC-AVQA/) [[code]](https://github.com/GeWu-Lab/MUSIC-AVQA)

#### Cross-modal Generation (Audio-Video / Video-Audio)

* [Conditional Generation of Audio From Video via Foley Analogies](https://openaccess.thecvf.com/content/CVPR2023/papers/Du_Conditional_Generation_of_Audio_From_Video_via_Foley_Analogies_CVPR_2023_paper.pdf) - Du, Y., Chen, Z., Salamon, J., Russell, B., & Owens, A. (CVPR 2023) [[project page]](https://xypb.github.io/CondFoleyGen/)

* [Sound to Visual Scene Generation by Audio-to-Visual Latent Alignment](https://openaccess.thecvf.com/content/CVPR2023/papers/Sung-Bin_Sound_to_Visual_Scene_Generation_by_Audio-to-Visual_Latent_Alignment_CVPR_2023_paper.pdf) - Sung-Bin, K., Senocak, A., Ha, H., Owens, A., & Oh, T. H. (CVPR 2023) [[project page]](https://sound2scene.github.io/)

* [How Does it Sound? Generation of Rhythmic Soundtracks for Human Movement Videos](https://proceedings.neurips.cc/paper/2021/file/f4e369c0a468d3aeeda0593ba90b5e55-Paper.pdf) - Su, K., Liu, X., & Shlizerman, E. (NeurIPS 2021)

* [AI Choreographer: Music Conditioned 3D Dance Generation with AIST++](https://google.github.io/aichoreographer/) - Li, R., Yang, S., Ross, D. A., & Kanazawa, A. (ICCV 2021) [[code]](https://github.com/google-research/mint) [[project page]](https://google.github.io/aichoreographer/) [[dataset]](https://google.github.io/aistplusplus_dataset)

* [Sound2Sight: Generating Visual Dynamics from Sound and Context](https://arxiv.org/pdf/2007.12130.pdf) - Cherian, A., Chatterjee, M., & Ahuja, N. (ECCV 2020)

* [Generating Visually Aligned Sound from Videos](https://ieeexplore.ieee.org/document/9151258) - Chen, P., Zhang, Y., Tan, M., Xiao, H., Huang, D., & Gan, C. (IEEE Transactions on Image Processing 2020)

* [Audeo: Audio Generation for a Silent Performance Video](https://arxiv.org/pdf/2006.14348.pdf) - Su, K., Liu, X., & Shlizerman, E. (NeurIPS 2020) 

* [Foley Music: Learning to Generate Music from Videos](https://arxiv.org/pdf/2007.10984.pdf) - Gan, C., Huang, D., Chen, P., Tenenbaum, J. B., & Torralba, A. (ECCV 2020) [[project page]](http://foley-music.csail.mit.edu/)

* [Spectrogram Analysis Via Self-Attention for Realizing Cross-Model Visual-Audio Generation](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9052918) - Tan, H., Wu, G., Zhao, P., & Chen, Y. (ICASSP 2020)

* [Unpaired Image-to-Speech Synthesis with Multimodal Information Bottleneck](http://openaccess.thecvf.com/content_ICCV_2019/papers/Ma_Unpaired_Image-to-Speech_Synthesis_With_Multimodal_Information_Bottleneck_ICCV_2019_paper.pdf) - (Shuang Ma, Daniel McDuff, Yale Song (ICCV 2019)) [[code]](https://github.com/yunyikristy/skipNet)

* [Listen to the Image](https://arxiv.org/pdf/1904.09115.pdf) - (Hu, D., Wang, D., Li, X., Nie, F., & Wang, Q. (CVPR 2019))

* [Cascade attention guided residue learning GAN for cross-modal translation](https://arxiv.org/pdf/1907.01826.pdf) - Duan, B., Wang, W., Tang, H., Latapie, H., & Yan, Y. (arXiv 2019) [[code]](https://github.com/tuffr5/CAR-GAN)

* [Visual to Sound: Generating Natural Sound for Videos in the Wild](http://openaccess.thecvf.com/content_cvpr_2018/papers/Zhou_Visual_to_Sound_CVPR_2018_paper.pdf) - (Zhou, Y., Wang, Z., Fang, C., Bui, T., & Berg, T. L. (CVPR 2018))[[project page]](http://bvision11.cs.unc.edu/bigpen/yipin/visual2sound_webpage/visual2sound.html)

* [Image generation associated with music data](http://openaccess.thecvf.com/content_cvpr_2018_workshops/papers/w49/Qiu_Image_Generation_Associated_CVPR_2018_paper.pdf) - Qiu, Y., & Kataoka, H. (CVPRW 2018)

* [CMCGAN: A uniform framework for cross-modal visual-audio mutual generation](https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/17153/16274) - Hao, W., Zhang, Z., & Guan, H. (AAAI 2018) 

#### Audio-Visual Stylization/Generation

* [MM-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and Video Generation](https://openaccess.thecvf.com/content/CVPR2023/papers/Ruan_MM-Diffusion_Learning_Multi-Modal_Diffusion_Models_for_Joint_Audio_and_Video_CVPR_2023_paper.pdf) - Ruan, L., Ma, Y., Yang, H., He, H., Liu, B., Fu, J., ... & Guo, B. (CVPR 2023) [[code]](https://github.com/researchmm/MM-Diffusion)

* [MUGEN: A Playground for Video-Audio-Text Multimodal Understanding and GENeration](https://arxiv.org/pdf/2204.08058.pdf) - Hayes, T., Zhang, S., Yin, X., Pang, G., Sheng, S., Yang, H., ... & Parikh, D. (ECCV 2022) [[project page]](https://mugen-org.github.io/) [[code]](https://github.com/mugen-org/MUGEN_baseline)

* [Learning visual styles from audio-visual associations](https://arxiv.org/pdf/2205.05072.pdf) - Li, T., Liu, Y., Owens, A., & Zhao, H. (ECCV 2022) [[project page]](https://tinglok.netlify.app/files/avstyle/) [[code]](https://github.com/Tinglok/avstyle)

* [Sound-Guided Semantic Image Manipulation](https://openaccess.thecvf.com/content/CVPR2022/papers/Lee_Sound-Guided_Semantic_Image_Manipulation_CVPR_2022_paper.pdf) - Lee, S. H., Roh, W., Byeon, W., Yoon, S. H., Kim, C., Kim, J., & Kim, S. (CVPR 2022) [[project page]](https://kuai-lab.github.io/cvpr2022sound/) [[code]](https://github.com/kuai-lab/sound-guided-semantic-image-manipulation)

#### Multi-modal Architectures

* [What Makes Training Multi-Modal Networks Hard?](https://arxiv.org/pdf/1905.12681.pdf) - Wang, W., Tran, D., & Feiszli, M. (arXiv 2019) 

* [MFAS: Multimodal Fusion Architecture Search](http://openaccess.thecvf.com/content_CVPR_2019/papers/Perez-Rua_MFAS_Multimodal_Fusion_Architecture_Search_CVPR_2019_paper.pdf) - Pérez-Rúa, J. M., Vielzeuf, V., Pateux, S., Baccouche, M., & Jurie, F. (CVPR 2019)

#### Uncategorized Papers

* [CASP-Net: Rethinking Video Saliency Prediction From an Audio-Visual Consistency Perceptual Perspective](https://openaccess.thecvf.com/content/CVPR2023/papers/Xiong_CASP-Net_Rethinking_Video_Saliency_Prediction_From_an_Audio-Visual_Consistency_Perceptual_CVPR_2023_paper.pdf) - Xiong, J., Wang, G., Zhang, P., Huang, W., Zha, Y., & Zhai, G. (CVPR 2023)

* [Self-Supervised Video Forensics by Audio-Visual Anomaly Detection](https://openaccess.thecvf.com/content/CVPR2023/papers/Feng_Self-Supervised_Video_Forensics_by_Audio-Visual_Anomaly_Detection_CVPR_2023_paper.pdf) - Feng, C., Chen, Z., & Owens, A. (CVPR 2023) [[code]](https://github.com/cfeng16/audio-visual-forensics)

* [Exploring Fine-Grained Audiovisual Categorization with the SSW60 Dataset](https://www.ecva.net/papers/eccv_2022/papers_ECCV/papers/136680262.pdf) - Van Horn, G., Qian, R., Wilber, K., Adam, H., Mac Aodha, O., & Belongie, S. (ECCV 2022) [[code]](https://github.com/visipedia/ssw60)

* [Learning Audio-Video Modalities from Image Captions](https://www.ecva.net/papers/eccv_2022/papers_ECCV/papers/136740396.pdf) - Nagrani, A., Seo, P. H., Seybold, B., Hauth, A., Manen, S., Sun, C., & Schmid, C. (ECCV 2022) [[project page]](https://a-nagrani.github.io/videocc.html) [[dataset]](https://github.com/google-research-datasets/videoCC-data)

* [MAD: A Scalable Dataset for Language Grounding in Videos From Movie Audio Descriptions](https://openaccess.thecvf.com/content/CVPR2022/papers/Soldan_MAD_A_Scalable_Dataset_for_Language_Grounding_in_Videos_From_CVPR_2022_paper.pdf) - Soldan, M., Pardo, A., Alcázar, J. L., Caba, F., Zhao, C., Giancola, S., & Ghanem, B. (CVPR 2022) [[code]](https://github.com/Soldelli/MAD)

* [Finding Fallen Objects via Asynchronous Audio-Visual Integration](https://openaccess.thecvf.com/content/CVPR2022/papers/Gan_Finding_Fallen_Objects_via_Asynchronous_Audio-Visual_Integration_CVPR_2022_paper.pdf) - Gan, C., Gu, Y., Zhou, S., Schwartz, J., Alter, S., Traer, J., ... & Torralba, A. (CVPR 2022) [[code]](https://github.com/chuangg/find_fallen_objects)

* [Audio-Visual Floorplan Reconstruction]() - S. Purushwalkam, S. V. A. Gari, V. K. Ithapu, C. Schissler, P. Robinson, A. Gupta, K. Grauman (ICCV 2021) [[code]](https://github.com/senthilps8/avmap) [[project page]](https://www.cs.cmu.edu/~spurushw/publication/avmap/)

* [GLAVNet: Global-Local Audio-Visual Cues for Fine-Grained Material Recognition]() - (CVPR 2021)

* [There is More than Meets the Eye: Self-Supervised Multi-Object Detection and Tracking with Sound by Distilling Multimodal Knowledge]() - Valverde, F. R., Hurtado, J. V., & Valada, A. (CVPR 2021) [[code]](https://github.com/robot-learning-freiburg/MM-DistillNet) [[project page/dataset]](http://multimodal-distill.cs.uni-freiburg.de/)

* [Sight to sound: An end-to-end approach for visual piano transcription](http://www.robots.ox.ac.uk/~vgg/publications/2020/Koepke20/koepke20.pdf) - Koepke, A. S., Wiles, O., Moses, Y., & Zisserman, A. (ICASSP 2020) [[project page/dataset]](https://www.robots.ox.ac.uk/~vgg/research/sighttosound/)

* [Solos: A Dataset for Audio-Visual Music Analysis](https://arxiv.org/pdf/2006.07931.pdf) - Montesinos, J. F., Slizovskaia, O., & Haro, G. (arXiv 2020) [[project page]](https://www.juanmontesinos.com/Solos/) [[dataset]](https://github.com/JuanFMontesinos/Solos)

* [Cross-Task Transfer for Multimodal Aerial Scene Recognition](https://arxiv.org/pdf/2005.08449.pdf) - Hu, D., Li, X., Mou, L., Jin, P., Chen, D., Jing, L., ... & Dou, D. (arXiv 2020) [[code]](https://github.com/DTaoo/Multimodal-Aerial-Scene-Recognition) [[dataset]](https://zenodo.org/record/3828124)

* [STAViS: Spatio-Temporal AudioVisual Saliency Network](http://openaccess.thecvf.com/content_CVPR_2020/papers/Tsiami_STAViS_Spatio-Temporal_AudioVisual_Saliency_Network_CVPR_2020_paper.pdf) - Tsiami, A., Koutras, P., & Maragos, P. (CVPR 2020) [[code]](https://github.com/atsiami/STAViS)

* [AlignNet: A Unifying Approach to Audio-Visual Alignment](http://openaccess.thecvf.com/content_WACV_2020/papers/Wang_AlignNet_A_Unifying_Approach_to_Audio-Visual_Alignment_WACV_2020_paper.pdf) - Wang, J., Fang, Z., & Zhao, H. (WACV 2020) [[project page]](https://jianrenw.github.io/AlignNet/) [[code]](https://github.com/zfang399/AlignNet)

* [Self-supervised Moving Vehicle Tracking with Stereo Sound](http://openaccess.thecvf.com/content_ICCV_2019/papers/Gan_Self-Supervised_Moving_Vehicle_Tracking_With_Stereo_Sound_ICCV_2019_paper.pdf) - Gan, C., Zhao, H., Chen, P., Cox, D., & Torralba, A. (ICCV 2019) [[project page/dataset]](http://sound-track.csail.mit.edu/)

* [Vision-Infused Deep Audio Inpainting](http://openaccess.thecvf.com/content_ICCV_2019/papers/Zhou_Vision-Infused_Deep_Audio_Inpainting_ICCV_2019_paper.pdf) - Zhou, H., Liu, Z., Xu, X., Luo, P., & Wang, X. (ICCV 2019) [[project page]](https://hangz-nju-cuhk.github.io/projects/AudioInpainting) [[code]](https://github.com/Hangz-nju-cuhk/Vision-Infused-Audio-Inpainter-VIAI)

* [ISNN: Impact Sound Neural Network for Audio-Visual Object Classification](http://openaccess.thecvf.com/content_ECCV_2018/papers/Auston_Sterling_ISNN_-_Impact_ECCV_2018_paper.pdf) - Sterling, A., Wilson, J., Lowe, S., & Lin, M. C. (ECCV 2018) [[project page]](http://gamma.cs.unc.edu/ISNN/) [[dataset1]](https://drive.google.com/drive/folders/1lVYYMTOLItozaU-vAa0OoXQr_4VfmqVg?usp=sharing)[[dataset2]](https://drive.google.com/drive/folders/1p0HAlrsoydYFuWginPuhiDQjncAvSQon?usp=sharing) [[model]](http://gamma.cs.unc.edu/ISNN/ModelNet_URLs_WAV_PLY.pdf)

* [Audio to Body Dynamics](http://openaccess.thecvf.com/content_cvpr_2018/papers/Shlizerman_Audio_to_Body_CVPR_2018_paper.pdf) - Shlizerman, E., Dery, L., Schoen, H., & Kemelmacher-Shlizerman, I. (CVPR 2018) [[project page]](https://arviolin.github.io/AudioBodyDynamics/)[[code]](https://github.com/facebookresearch/Audio2BodyDynamics)

* [A Multimodal Approach to Mapping Soundscapes](http://openaccess.thecvf.com/content_cvpr_2018_workshops/papers/w49/Salem_A_Multimodal_Approach_CVPR_2018_paper.pdf) - Salem, T., Zhai, M., Workman, S., & Jacobs, N. (CVPRW 2018) [[project page]](http://cs.uky.edu/~salem/audio-mapping/)

* [Shape and material from sound](https://papers.nips.cc/paper/6727-shape-and-material-from-sound.pdf) - Zhang, Z., Li, Q., Huang, Z., Wu, J., Tenenbaum, J., & Freeman, B. (NeurIPS 2017)

## Datasets

#### General Audio-Visual Tasks

* [AudioSet](https://research.google.com/audioset/) - Audio-Visual Classification

* [MUSIC](https://github.com/roudimit/MUSIC_dataset) - Audio-Visual Source Separation

* [AudioSetZSL](https://github.com/krantiparida/AudioSetZSL) - Audio-Visual Zero-shot Learning

* [Visually Engaged and Grounded AudioSet (VEGAS)](http://bvision11.cs.unc.edu/bigpen/yipin/visual2sound_webpage/visual2sound.html) - Sound generation from video

* [SoundNet-Flickr](http://soundnet.csail.mit.edu/) - Image-Audio pair for cross-modal learning

* [Audio-Visual Event (AVE)](https://sites.google.com/view/audiovisualresearch) - Audio-Visual Event Localization

* [AudioSet Single Source](http://vision.cs.utexas.edu/projects/separating_object_sounds/#data) - Subset of AudioSet videos containing only a single souding object

* [Kinetics-Sounds](https://arxiv.org/pdf/1705.08168.pdf) - Subset of Kinetics dataset

* [EPIC-Kitchens](https://epic-kitchens.github.io/2019) - Egocentric Audio-Visual Action Recogniton

* [Audio-Visually Indicated Actions Dataset](https://pavis.iit.it/datasets/audio-visually-indicated-actions-dataset) - Multimodal dataset (RGB, acoustic data as raw audio) acquired using the acoustic-optical camera 

* [IMSDb dataset](https://www.robots.ox.ac.uk/~vgg/research/speech2action/) - Movie scripts downloaded from The [Internet Script Movie Database](https://www.imsdb.com)

* [YOUTUBE-ASMR-300K dataset](https://karreny.github.io/telling-left-from-right/) -  ASMR videos collected from YouTube that contains stereo audio

* [FAIR-Play](https://github.com/facebookresearch/FAIR-Play) - 1,871 video clips and their corresponding binaural audio clips recorded in a music room

* [VGG-Sound](http://www.robots.ox.ac.uk/~vgg/data/vggsound/) - audio-visual correspondent dataset consisting of short clips of audio sounds, extracted from videos uploaded to YouTube

* [XD-Violence](https://roc-ng.github.io/XD-Violence/) - weakly annotated dataset for audio-visual violence detection

* [AuDio Visual Aerial sceNe reCognition datasEt (ADVANCE)](https://akchen.github.io/ADVANCE-DATASET/) - Geotagged aerial images and sounds, classified into 13 scene classes

* [auDIoviSual Crowd cOunting dataset (DISCO)](https://zenodo.org/record/3828468) - 1,935 Images and audios from various typical scenes, a total of 170, 270 instances annotated with the head locations.

* [MUSIC-Synthetic dataset](https://zenodo.org/record/4079386#.X4PFodozbb2)- Category-balanced multi-source videos by artificially synthesizing solo videos from the [MUSIC](https://github.com/roudimit/MUSIC_dataset) dataset, to facilitate the learning and evaluation of multiple-soundings-sources localization in the cocktail-party scenario.

* [ACAV100M](https://acav100m.github.io/) - 140 million full-length videos (total duration 1,030 years) and produce a dataset of 100 million 10-second clips (31 years) with high audio-visual correspondence.

* [AIST++](https://google.github.io/aichoreographer/) - A large-scale 3D human dance motion dataset, which contains a wide variety of 3D motion paired with music It is built upon the AIST Dance Database, which is an uncalibrated multi-view collection of dance videos.

* [VideoCC](https://github.com/google-research-datasets/videoCC-data) - A dataset containing (video-URL, caption) pairs for training video-text machine learning models. It is created using an automatic pipeline starting from the Conceptual Captions Image-Captioning Dataset.

* [ssw60](https://github.com/visipedia/ssw60) - A dataset for research on adiovisual fine-grained categorization. The dataset covers 60 species of birds that all occur in a specific geographic location: Sapsucker Woods, Ithaca, NY. It is comprised of images from existing datasets, and brand new, expert curated audio and video data. 

* [PACS](https://github.com/samuelyu2002/PACS) - A  dataset designed to help create and evaluate

a new generation of AI algorithms able to reason about physical commonsense using both audio and visual modalities. 

* [AVSBench](http://www.avlbench.opennlplab.cn/download) - A dataset for audio-visual pixel-wise segmentation task.

* [UnAV-100](https://unav100.github.io/) - The dataset consists of more than 10K untrimmed videos with over 30K audio-visual events covering 100 different event categories. There are often multiple audio-visual events that might be very short or long, and occur concurrently in each video as in real-life audio-visual scenes.

#### Face-Voice Dataset

* [VoxCeleb](http://www.robots.ox.ac.uk/~vgg/data/voxceleb/) - Audio-Visual Speaker Identification, contains two versions

* [EmoVoxCeleb](http://www.robots.ox.ac.uk/~vgg/research/cross-modal-emotions/)

* [Speech2Gesture](http://people.eecs.berkeley.edu/~shiry/projects/speech2gesture/) - Gesture prediction from speech

* [AVSpeech](https://looking-to-listen.github.io/avspeech/)

* [LRW Dataset](http://www.robots.ox.ac.uk/~vgg/data/lip_reading/lrw1.html)

* [LRS2](http://www.robots.ox.ac.uk/~vgg/data/lip_reading/lrs2.html), [LRS3](http://www.robots.ox.ac.uk/~vgg/data/lip_reading/lrs3.html), [LRS3 Language](http://www.robots.ox.ac.uk/~vgg/data/lip_reading/lrs3-lang.html) - Lip Reading Datasets

## Licenses

License

[![CC0](http://i.creativecommons.org/p/zero/1.0/88x31.png)](http://creativecommons.org/publicdomain/zero/1.0/)

To the extent possible under law, [Kranti Kumar Parida](https://krantiparida.github.io/) has waived all copyright and related or neighboring rights to this work.

## Contributing

Please feel free to send me [pull requests](https://github.com/krantiparida/awesome-audio-visual/pulls) or email (kranti@cse.iitk.ac.in) to add links, correct wrong ones or if you find any broken links.
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/krantiparida/awesome-audio-visual

Awesome Lists containing this project

README