Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/EmulationAI/awesome-large-audio-models

Collection of resources on the applications of Large Language Models (LLMs) in Audio AI.
https://github.com/EmulationAI/awesome-large-audio-models

List: awesome-large-audio-models

audio-ai audio-processing automatic-speech-recognition foundational-models large-audio-models large-language-model-speech large-language-models music-ai music-information-retrieval music-processing speech-ai speech-llms speech-to-text

Last synced: 7 days ago
JSON representation

Collection of resources on the applications of Large Language Models (LLMs) in Audio AI.

Awesome Lists containing this project

README

        

[![Maintenance](https://img.shields.io/badge/Maintained%3F-yes-green.svg)](https://GitHub.com/Naereen/StrapDown.js/graphs/commit-activity)
[![PR's Welcome](https://img.shields.io/badge/PRs-welcome-brightgreen.svg?style=flat)](http://makeapullrequest.com)
[![Awesome](https://cdn.rawgit.com/sindresorhus/awesome/d7305f38d29fed78fa85652e3a63e154dd8e8829/media/badge.svg)](https://github.com/sindresorhus/awesome)

![](LLMs_AudioAI.jpg)

#

This repo supplements our survey paper: [Sparks of Large Audio Models: A Survey and Outlook](https://arxiv.org/abs/2308.12792).

Authors: [Siddique Latif](https://scholar.google.com/citations?user=Scq5ADcAAAAJ), [Moazzam Shoukat](https://scholar.google.com/citations?user=uU550yYAAAAJ&hl=en), [Fahad Shamshad](https://scholar.google.com.pk/citations?user=d7QL4wkAAAAJ&hl=en), [Muhammad Usama](https://scholar.google.com/citations?user=unGWVYMAAAAJ&hl=en), [Yi Ren](https://scholar.google.com/citations?user=4FA6C0AAAAAJ&hl=zh-CN), [Heriberto Cuayahuitl](https://scholar.google.com/citations?user=zDlQNDgAAAAJ&hl=en), [Xulong Zhang](https://scholar.google.com/citations?hl=en&user=1XKLPoAAAAAJ), [Roberto Togneri](https://scholar.google.com.au/citations?user=uPELUScAAAAJ&hl=en), [Wenwu Wang](https://scholar.google.co.uk/citations?user=JQFnV5IAAAAJ&hl=en), [Bjorn Schuller](https://scholar.google.com/citations?user=TxKNCSoAAAAJ&hl=en).

> **

Abstract:** *This survey paper provides a comprehensive overview of the recent advancements and challenges in applying large language models to the field of audio signal processing. Audio processing, with its diverse signal representations and a wide range of sources--from human voices to musical instruments and environmental sounds--poses challenges distinct from those found in traditional Natural Language Processing scenarios. Nevertheless, Large Audio Models, epitomized by transformer-based architectures, have shown marked efficacy in this sphere. By leveraging massive amounts of data, these models have demonstrated prowess in a variety of audio tasks, spanning from Automatic Speech Recognition and Text-To-Speech to Music Generation, among others. Notably, recently these Foundational Audio Models, like SeamlessM4T, have started showing abilities to act as universal translators, supporting multiple speech tasks for up to 100 languages without any reliance on separate task-specific systems. This paper presents an in-depth analysis of state-of-the-art methodologies regarding Foundational Large Audio Models, their performance benchmarks, and their applicability to real-world scenarios. We also highlight current limitations and provide insights into potential future research directions in the realm of Large Audio Models with the intent to spark further discussion, thereby fostering innovation in the next generation of audio-processing systems.*

![](timeline_audioAI.png)


#

` Awesome Large Language Models in Audio AI`

![](LLMS_TimeLine.jpg)

A curated list of awesome large AI models in audio signal processing, inspired by the other awesome initiatives. We intend to regularly update the relevant latest papers and their open-source implementations on this page.

## Overview
- [Popular Large Audio Models](#popular-large-audio-models)
- [Automatic Speech Recognition (ASR)](#automatic-speech-recognition-asr)
- [Neural Speech Synthesis](#neural-speech-synthesis)
- [Speech Translation (ST)](#speech-translation-st)
- [Other Speech Applications](#other-speech-applications)
- [Large Audio Models in Music](#large-audio-models-in-music)
- [Audio Datasets](#audio-datasets)

# Survey Papers

**A review of deep learning techniques for speech processing** [2023].
*Ambuj Mehrish, Navonil Majumder, Rishabh Bharadwaj, Rada Mihalcea, Soujanya Poria*
[[PDF](https://www.sciencedirect.com/science/article/abs/pii/S1566253523001859)]

**A survey on deep reinforcement learning for audio-based applications** [2023].
*Latif, Siddique and Cuay{\'a}huitl, Heriberto and Pervez, Farrukh and Shamshad, Fahad and Ali, Hafiz Shehbaz and Cambria, Erik*
[[PDF](https://link.springer.com/article/10.1007/s10462-022-10224-2)]

**A Survey of Large Language Models** [2023].
*Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, Yifan Du, Chen Yang, Yushuo Chen, Zhipeng Chen, Jinhao Jiang, Ruiyang Ren, Yifan Li, Xinyu Tang, Zikang Liu, Peiyu Liu, Jian-Yun Nie, Ji-Rong Wen*
[[PDF](https://arxiv.org/abs/2303.18223)]

** survey on evaluation of large language models** [2023].
*Yupeng Chang, Xu Wang, Jindong Wang, Yuan Wu, Linyi Yang, Kaijie Zhu, Hao Chen, Xiaoyuan Yi, Cunxiang Wang, Yidong Wang, Wei Ye, Yue Zhang, Yi Chang, Philip S. Yu, Qiang Yang, Xing Xie*
[[PDF](https://arxiv.org/abs/2307.03109)]

**Challenges and Applications of Large Language Models** [2023].
*Kaddour, Jean and Harris, Joshua and Mozes, Maximilian and Bradley, Herbie and Raileanu, Roberta and McHardy, Robert*
[[PDF](https://arxiv.org/abs/2307.10169)]

**Aligning Large Language Models with Human: A Survey** [2023].
*Yufei Wang, Wanjun Zhong, Liangyou Li, Fei Mi, Xingshan Zeng, Wenyong Huang, Lifeng Shang, Xin Jiang, Qun Liu*
[[PDF](https://arxiv.org/abs/2307.12966)]

**A Comprehensive Survey on Segment Anything Model for Vision and Beyond** [2023].
*Zhang, Chunhui and Liu, Li and Cui, Yawen and Huang, Guanjie and Lin, Weilin and Yang, Yiqian and Hu, Yuehong*
[[PDF](https://arxiv.org/abs/2305.08196)]

**Vision-language models for vision tasks: A survey** [2023].
*Zhang, Jingyi and Huang, Jiaxing and Jin, Sheng and Lu, Shijian*
[[PDF](https://arxiv.org/abs/2304.00685)]

**Foundational Models Defining a New Era in Vision: A Survey and Outlook** [2023].
*Awais, Muhammad and Naseer, Muzammal and Khan, Salman and Anwer, Rao Muhammad and Cholakkal, Hisham and Shah, Mubarak and Yang, Ming-Hsuan and Khan, Fahad Shahbaz*
[[PDF](https://arxiv.org/abs/2307.13721)]

**Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models** [2023].
*Tiffany H. Kung, Morgan Cheatham, Arielle Medenilla, Czarina Sillos, Lorie De Leon, Camille Elepaño, Maria Madriaga, Rimel Aggabao, Giezel Diaz-Candido, James Maningo, Victor Tseng*
[[PDF](https://journals.plos.org/digitalhealth/article?id=10.1371/journal.pdig.0000198)]

**Engineering education in the era of ChatGPT: Promise and pitfalls of generative AI for education** [2023].
*Junaid Qadir*
[[PDF](https://ieeexplore.ieee.org/abstract/document/10125121)]

**ChatGPT: Bullshit spewer or the end of traditional assessments in higher education?** [2023].
*Jürgen Rudolph, Samson Tan, Shannon Tan*
[[PDF](https://journals.sfu.ca/jalt/index.php/jalt/article/download/689/539/3059)]

**Foundation models for generalist medical artificial intelligence** [2023].
*Moor, Michael and Banerjee, Oishi and Abad, Zahra Shakeri Hossein and Krumholz, Harlan M and Leskovec, Jure and Topol, Eric J and Rajpurkar, Pranav*
[[PDF](https://www.nature.com/articles/s41586-023-05881-4)]

**Large AI models in health informatics: Applications, challenges, and the future** [2023].
*Jianing Qiu, Lin Li, Jiankai Sun, Jiachuan Peng, Peilun Shi, Ruiyang Zhang, Yinzhao Dong, Kyle Lam, Frank P.-W. Lo, Bo Xiao, Wu Yuan, Dong Xu, Benny Lo*
[[PDF](https://arxiv.org/abs/2303.11568)]

**The shaky foundations of large language models and foundation models for electronic health records** [2023].
*Michael Wornow, Yizhe Xu, Rahul Thapa, Birju Patel, Ethan Steinberg, Scott Fleming, Michael A. Pfeffer, Jason Fries & Nigam H. Shah*
[[PDF](https://www.nature.com/articles/s41746-023-00879-8)]

**On the Challenges and Perspectives of Foundation Models for Medical Image Analysis** [2023].
*Shaoting Zhang, Dimitris Metaxas*
[[PDF](https://arxiv.org/abs/2306.05705)]

**Survey of Protein Sequence Embedding Models** [2023].
*Chau Tran, Siddharth Khadkikar, Aleksey Porollo*
[[PDF](https://www.mdpi.com/1422-0067/24/4/3775)]

**A Short Survey of Viewing Large Language Models in Legal Aspect** [2023].
*Zhongxiang Sun*
[[PDF](https://arxiv.org/abs/2303.09136)]

**Large Language Models as Tax Attorneys: A Case Study in Legal Capabilities Emergence** [2023].
*John J. Nay, David Karamardian, Sarah B. Lawsky, Wenting Tao, Meghana Bhat, Raghav Jain, Aaron Travis Lee, Jonathan H. Choi, Jungo Kasai*
[[PDF](https://arxiv.org/abs/2306.07075)]

**Foundation Models for Decision Making: Problems, Methods, and Opportunities** [2023].
*Sherry Yang, Ofir Nachum, Yilun Du, Jason Wei, Pieter Abbeel, Dale Schuurmans*
[[PDF](https://arxiv.org/abs/2303.04129)]

**Transformers in speech processing: A survey** [2022].
*Siddique Latif, Aun Zaidi, Heriberto Cuayahuitl, Fahad Shamshad, Moazzam Shoukat, Junaid Qadir*
[[PDF](https://arxiv.org/abs/2303.11607)]

**On the opportunities and risks of foundation models** [2022].
*Rishi Bommasani, Drew A. Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx et. al. *
[[PDF](https://arxiv.org/abs/2108.07258)]

**Vision-language pre-training: Basics, recent advances, and future trends** [2022].
*Zhe Gan, Linjie Li, Chunyuan Li, Lijuan Wang, Zicheng Liu, Jianfeng Gao*
[[PDF](https://www.nowpublishers.com/article/Details/CGV-105)]

**ChatGPT for good? On opportunities and challenges of large language models for education** [2022].
*Enkelejda Kasneci, Kathrin Sessler, Stefan Küchemann, Maria Bannert, Daryna Dementieva, Frank Fischer, Urs Gasser, Georg Groh, Stephan Günnemann, Eyke Hüllermeier, Stephan Krusche, Gitta Kutyniok, Tilman Michaeli, Claudia Nerdel, Jürgen Pfeffer, Oleksandra Poquet, Michael Sailer, Albrecht Schmidt, Tina Seidel, Matthias Stadler, Jochen Weller, Jochen Kuhn, Gjergji Kasneci*
[[PDF](https://www.sciencedirect.com/science/article/pii/S1041608023000195)]

**Protein language models and structure prediction: Connection and progression** [2022].
*Bozhen Hu, Jun Xia, Jiangbin Zheng, Cheng Tan, Yufei Huang, Yongjie Xu, Stan Z. Li*
[[PDF](https://arxiv.org/abs/2211.16742)]

**A human being wrote this law review article: GPT-3 and the practice of law** [2022].
*Amy B. Cyphert*
[[PDF](https://researchrepository.wvu.edu/cgi/viewcontent.cgi?article=1099&context=law_faculty)]

**A comparative study on transformer vs rnn in speech applications** [2019].
*Shigeki Karita, Nanxin Chen, Tomoki Hayashi, Takaaki Hori, Hirofumi Inaguma, Ziyan Jiang, Masao Someki, Nelson Enrique Yalta Soplin, Ryuichi Yamamoto, Xiaofei Wang, Shinji Watanabe, Takenori Yoshimura, Wangyou Zhang*
[[PDF](https://ieeexplore.ieee.org/abstract/document/9003750)]

## Popular Large Audio Models
**Speechgpt: Empowering large language models with intrinsic cross-modal conversational abilities.** [2023].
*Zhang, Dong, Shimin Li, Xin Zhang, Jun Zhan, Pengyu Wang, Yaqian Zhou, and Xipeng Qiu.*
[[PDF](https://arxiv.org/pdf/2305.11000.pdf)]

**AudioPaLM: A Large Language Model That Can Speak and Listen.** [2023].
*Rubenstein, Paul K., Chulayuth Asawaroengchai, Duc Dung Nguyen, Ankur Bapna, Zalán Borsos, Félix de Chaumont Quitry, Peter Chen et al.*
[[PDF](https://arxiv.org/abs/2306.12925)]

**AudioLM: A Language Modeling Approach to Audio Generation** [2023].
*Zalán Borsos, Raphaël Marinier, Damien Vincent, Eugene Kharitonov, Olivier Pietquin, Matt Sharifi, Dominik Roblek, Olivier Teboul, David Grangier, Marco Tagliasacchi, Neil Zeghidour*
[[PDF](https://ieeexplore.ieee.org/abstract/document/10158503)]

**Listen, Think, and Understand** [2023].
*Yuan Gong, Hongyin Luo, Alexander H. Liu, Leonid Karlinsky, James Glass*
[[PDF](https://arxiv.org/abs/2305.10790)]

**VioLA: Unified Codec Language Models for Speech Recognition, Synthesis, and Translation** [2023].
*Tianrui Wang, Long Zhou, Ziqiang Zhang, Yu Wu, Shujie Liu, Yashesh Gaur, Zhuo Chen, Jinyu Li, Furu Wei*
[[PDF](https://arxiv.org/abs/2305.16107)]

**Audiogen: Textually guided audio generation** [2022].
*Felix Kreuk, Gabriel Synnaeve, Adam Polyak, Uriel Singer, Alexandre Défossez, Jade Copet, Devi Parikh, Yaniv Taigman, Yossi Adi*
[[PDF](https://arxiv.org/abs/2209.15352)]

**Simple and Controllable Music Generation** [2023].
*Jade Copet, Felix Kreuk, Itai Gat, Tal Remez, David Kant, Gabriel Synnaeve, Yossi Adi, Alexandre Défossez*
[[PDF](https://arxiv.org/abs/2306.05284)]

**MusicLM: Generating Music From Text** [2023].
*Andrea Agostinelli, Timo I. Denk, Zalán Borsos, Jesse Engel, Mauro Verzetti, Antoine Caillon, Qingqing Huang, Aren Jansen, Adam Roberts, Marco Tagliasacchi, Matt Sharifi, Neil Zeghidour, Christian Frank*
[[PDF](https://arxiv.org/abs/2301.11325)]

**SeamlessM4T—Massively Multilingual & Multimodal Machine Translation** [2023].
*Seamless Communication, Loic Barrault, Andy Chung, David Dale, Ning Dong (AI), Paul-Ambroise Duquenne, Hady Elsahar et. al.*
[[PDF](https://ai.meta.com/research/publications/seamless-m4t/)]

**SALMONN: Towards Generic Hearing Abilities for Large Language Models** [2023].
*Changli Tang, Wenyi Yu, Guangzhi Sun, Xiaozhao Chen, Tian Tan, Wei Li, Lu Lu, Zejun Ma, Chao Zhang*
[[PDF](https://arxiv.org/abs/2310.13289)][[Github](https://github.com/bytedance/SALMONN)]


## Automatic Speech Recognition (ASR)

**On decoder-only architecture for speech-to-text and large language model integration** [2023].
*Jian Wu, Yashesh Gaur, Zhuo Chen, Long Zhou, Yimeng Zhu, Tianrui Wang, Jinyu Li, Shujie Liu, Bo Ren, Linquan Liu, Yu Wu*
[[PDF](https://arxiv.org/abs/2307.03917)]

**X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages** [2023].
*Feilong Chen, Minglun Han, Haozhi Zhao, Qingyang Zhang, Jing Shi, Shuang Xu, Bo Xu*
[[PDF](https://arxiv.org/abs/2305.04160)][[Github](https://github.com/phellonchen/X-LLM)]

**Adapting Large Language Model with Speech for Fully Formatted End-to-End Speech Recognition** [2023].
*Shaoshi Ling, Yuxuan Hu, Shuangbei Qian, Guoli Ye, Yao Qian, Yifan Gong, Ed Lin, Michael Zeng*
[[PDF](https://arxiv.org/abs/2307.08234)]

**Semantic Segmentation with Bidirectional Language Models Improves Long-form ASR** [2023].
*W. Ronny Huang, Hao Zhang, Shankar Kumar, Shuo-yiin Chang, Tara N. Sainath*
[[PDF](https://arxiv.org/abs/2305.18419)]

**Prompting Large Language Models with Speech Recognition Abilities** [2023].
*Yassir Fathullah, Chunyang Wu, Egor Lakomkin, Junteng Jia, Yuan Shangguan, Ke Li, Jinxi Guo, Wenhan Xiong, Jay Mahadeokar, Ozlem Kalinli, Christian Fuegen, Mike Seltzer*
[[PDF](https://arxiv.org/abs/2307.11795)]

**Connecting Speech Encoder and Large Language Model for ASR** [2023].
*Wenyi Yu, Changli Tang, Guangzhi Sun, Xiaozhao Chen, Tian Tan, Wei Li, Lu Lu, Zejun Ma, Chao Zhang*
[[PDF](https://arxiv.org/abs/2309.13963)]

**SALMONN: Towards Generic Hearing Abilities for Large Language Models** [2023].
*Changli Tang, Wenyi Yu, Guangzhi Sun, Xiaozhao Chen, Tian Tan, Wei Li, Lu Lu, Zejun Ma, Chao Zhang*
[[PDF](https://arxiv.org/abs/2310.13289)][[Github](https://github.com/bytedance/SALMONN)]



## Neural Speech Synthesis

**Investigating the Utility of Surprisal from Large Language Models for Speech Synthesis Prosody** [2023].
*Sofoklis Kakouros, Juraj Šimko, Martti Vainio, Antti Suni*
[[PDF](https://arxiv.org/abs/2306.09814)]

**Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers** [2023].
*Chengyi Wang, Sanyuan Chen, Yu Wu, Ziqiang Zhang, Long Zhou, Shujie Liu, Zhuo Chen, Yanqing Liu, Huaming Wang, Jinyu Li, Lei He, Sheng Zhao, Furu Wei*
[[PDF](https://arxiv.org/abs/2301.02111)]

**Speak, read and prompt: High-fidelity text-to-speech with minimal supervision** [2023].
*Eugene Kharitonov, Damien Vincent, Zalán Borsos, Raphaël Marinier, Sertan Girgin, Olivier Pietquin, Matt Sharifi, Marco Tagliasacchi, Neil Zeghidour*
[[PDF](https://arxiv.org/abs/2302.03540)]

**Speechlmscore: Evaluating Speech Generation Using Speech Language Model** [2023].
*Soumi Maiti, Yifan Peng, Takaaki Saeki, Shinji Watanabe*
[[PDF](https://ieeexplore.ieee.org/abstract/document/10095710)]

**LM-VC: Zero-shot Voice Conversion via Speech Generation based on Language Models** [2023].
*Zhichao Wang, Yuanzhe Chen, Lei Xie, Qiao Tian, Yuping Wang*
[[PDF](https://arxiv.org/abs/2306.10521)]

**Assessing Phrase Break of ESL Speech with Pre-trained Language Models and Large Language Models** [2023].
*Zhiyi Wang, Shaoguang Mao, Wenshan Wu, Yan Xia, Yan Deng, Jonathan Tien*
[[PDF](https://arxiv.org/abs/2306.04980)]


## Speech Translation (ST)

**SeamlessM4T—Massively Multilingual & Multimodal Machine Translation** [2023].
*Seamless Communication, Loic Barrault, Andy Chung, David Dale, Ning Dong (AI), Paul-Ambroise Duquenne, Hady Elsahar, Hongyu Gong, Kevin Heffernan, John Hoffman, Christopher Klaiber, Peng-Jen Chen, Daniel Licht, Jean Maillard, Alice Rakotoarison, Kaushik Ram Sadagopan, Guillaume Wenzek, Abinesh Ramakrishnan, Alexandre Mourachko, Amanda Kallet, Ann Lee, Anna Sun, Bapi Akula, Benjamin Peloquin, Bernie Huang, Bokai Yu, Brian Ellis, Can Balioglu, Carleigh Wood, Changhan Wang, Christophe Ropers, Cynthia Gao, Daniel Li (FAIR), Elahe Kalbassi, Ethan Ye, Gabriel Mejia Gonzalez, Hirofumi Inaguma, Holger Schwenk, Igor Tufanov, Ilia Kulikov, Janice Lam, Jeff Wang (PM - AI), Juan Pino, Justin Haaheim, Justine Kao, Prangthip Hasanti, Kevin Tran, Maha Elbayad, Marta R. Costa-jussa, Mohamed Ramadan, Naji El Hachem, Onur Çelebi, Paco Guzmán, Paden Tomasello, Pengwei Li, Pierre Andrews, Ruslan Mavlyutov, Russ Howes, Safiyyah Saleem, Skyler Wang, Somya Jain, Sravya Popuri, Tuan Tran, Vish Vogeti, Xutai Ma, Yilin Yang*
[[PDF](https://ai.meta.com/research/publications/seamless-m4t/)]

**PolyVoice: Language Models for Speech to Speech Translation** [2023].
*Qianqian Dong, Zhiying Huang, Qiao Tian, Chen Xu, Tom Ko, Yunlong Zhao, Siyuan Feng, Tang Li, Kexin Wang, Xuxin Cheng, Fengpeng Yue, Ye Bai, Xi Chen, Lu Lu, Zejun Ma, Yuping Wang, Mingxuan Wang, Yuxuan Wang*
[[PDF](https://arxiv.org/abs/2306.02982)]

**AudioPaLM: A Large Language Model That Can Speak and Listen** [2023].
*Paul K. Rubenstein, Chulayuth Asawaroengchai, Duc Dung Nguyen, Ankur Bapna, Zalán Borsos, Félix de Chaumont Quitry, Peter Chen, Dalia El Badawy, Wei Han, Eugene Kharitonov, Hannah Muckenhirn, Dirk Padfield, James Qin, Danny Rozenberg, Tara Sainath, Johan Schalkwyk, Matt Sharifi, Michelle Tadmor Ramanovich, Marco Tagliasacchi, Alexandru Tudor, Mihajlo Velimirović, Damien Vincent, Jiahui Yu, Yongqiang Wang, Vicky Zayats, Neil Zeghidour, Yu Zhang, Zhishuai Zhang, Lukas Zilka, Christian Frank*
[[PDF](https://arxiv.org/abs/2306.12925)]

**SALMONN: Towards Generic Hearing Abilities for Large Language Models** [2023].
*Changli Tang, Wenyi Yu, Guangzhi Sun, Xiaozhao Chen, Tian Tan, Wei Li, Lu Lu, Zejun Ma, Chao Zhang*
[[PDF](https://arxiv.org/abs/2310.13289)][[Github](https://github.com/bytedance/SALMONN)]


## Other Speech Applications

**SpeechX: Neural Codec Language Model as a Versatile Speech Transformer** [2023].
*Xiaofei Wang, Manthan Thakker, Zhuo Chen, Naoyuki Kanda, Sefik Emre Eskimez, Sanyuan Chen, Min Tang, Shujie Liu, Jinyu Li, Takuya Yoshioka*
[[PDF](https://arxiv.org/abs/2308.06873)]

**Audiogpt: Understanding and generating speech, music, sound, and talking head** [2023].
*Rongjie Huang, Mingze Li, Dongchao Yang, Jiatong Shi, Xuankai Chang, Zhenhui Ye, Yuning Wu, Zhiqing Hong, Jiawei Huang, Jinglin Liu, Yi Ren, Zhou Zhao, Shinji Watanabe*
[[PDF](https://arxiv.org/abs/2304.12995)]

**X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages** [2023].
*Feilong Chen, Minglun Han, Haozhi Zhao, Qingyang Zhang, Jing Shi, Shuang Xu, Bo Xu*
[[PDF](https://arxiv.org/abs/2305.04160)][[Github](https://github.com/phellonchen/X-LLM)]

**Can Large Language Models Aid in Annotating Speech Emotional Data? Uncovering New Frontiers** [2023].
*Siddique Latif, Muhammad Usama, Mohammad Ibrahim Malik, Björn W. Schuller*
[[PDF](https://arxiv.org/abs/2307.06090)]

**LLaSM: Large Language and Speech Model** [2023].
*Yu Shu, Siwei Dong, Guangyao Chen, Wenhao Huang, Ruihua Zhang, Daochen Shi, Qiqi Xiang, Yemin Shi*
[[PDF](https://arxiv.org/abs/2308.15930)]

**SALMONN: Towards Generic Hearing Abilities for Large Language Models** [2023].
*Changli Tang, Wenyi Yu, Guangzhi Sun, Xiaozhao Chen, Tian Tan, Wei Li, Lu Lu, Zejun Ma, Chao Zhang*
[[PDF](https://arxiv.org/abs/2310.13289)][[Github](https://github.com/bytedance/SALMONN)]


## Large Audio Models in Music

**MusicGen: Simple and Controllable Music Generation** [2023].
*Jade Copet, Felix Kreuk, Itai Gat, Tal Remez, David Kant, Gabriel Synnaeve, Yossi Adi, Alexandre Défossez*
[[PDF](https://arxiv.org/abs/2306.05284)]

**JEN-1: Text-Guided Universal Music Generation with Omnidirectional Diffusion Models** [2023].
*Peike Li, Boyu Chen, Yao Yao, Yikai Wang, Allen Wang, Alex Wang*
[[PDF](https://arxiv.org/abs/2308.04729)]

**VampNet: Music Generation via Masked Acoustic Token Modeling** [2023].
*Hugo Flores Garcia, Prem Seetharaman, Rithesh Kumar, Bryan Pardo*
[[PDF](https://arxiv.org/abs/2307.04686)]

**Text-to-Audio Generation using Instruction-Tuned LLM and Latent Diffusion Model** [2023].
*Deepanway Ghosal, Navonil Majumder, Ambuj Mehrish, Soujanya Poria*
[[PDF](https://arxiv.org/abs/2304.13731)]

**WavJourney: Compositional Audio Creation with Large Language Models** [2023].
*Xubo Liu, Zhongkai Zhu, Haohe Liu, Yi Yuan, Meng Cui, Qiushi Huang, Jinhua Liang, Yin Cao, Qiuqiang Kong, Mark D. Plumbley, Wenwu Wang*
[[PDF](https://arxiv.org/abs/2307.14335)]

**MusicLDM: Enhancing Novelty in Text-to-Music Generation Using Beat-Synchronous Mixup Strategies** [2023].
*Ke Chen, Yusong Wu, Haohe Liu, Marianna Nezhurina, Taylor Berg-Kirkpatrick, Shlomo Dubnov*
[[PDF](https://arxiv.org/abs/2308.01546)]

**Exploring the efficacy of pre-trained checkpoints in text-to-music generation task** [2022].
*Shangda Wu, Maosong Sun*
[[PDF](https://arxiv.org/abs/2211.11216)]

**SingSong: Generating musical accompaniments from singing** [2023].
*Chris Donahue, Antoine Caillon, Adam Roberts, Ethan Manilow, Philippe Esling, Andrea Agostinelli, Mauro Verzetti, Ian Simon, Olivier Pietquin, Neil Zeghidour, Jesse Engel*
[[PDF](https://arxiv.org/abs/2301.12662)]

**LOAF-M2L: Joint Learning of Wording and Formatting for Singable Melody-to-Lyric Generation** [2023].
*Longshen Ou, Xichu Ma, Ye Wang*
[[PDF](https://arxiv.org/abs/2307.02146)]

**Efficient Neural Music Generation** [2023].
*Max W. Y. Lam, Qiao Tian, Tang Li, Zongyu Yin, Siyuan Feng, Ming Tu, Yuliang Ji, Rui Xia, Mingbo Ma, Xuchen Song, Jitong Chen, Yuping Wang, Yuxuan Wang*
[[PDF](https://arxiv.org/abs/2305.15719)]

**MuseCoco: Generating Symbolic Music from Text** [2023].
*Peiling Lu, Xin Xu, Chenfei Kang, Botao Yu, Chengyi Xing, Xu Tan, Jiang Bian*
[[PDF](https://arxiv.org/abs/2306.00110)]

**LaunchpadGPT: Language Model as Music Visualization Designer on Launchpad** [2023].
*Siting Xu, Yunlong Tang, Feng Zheng*
[[PDF](https://arxiv.org/abs/2307.04827)]

**Music Understanding LLaMA: Advancing Text-to-Music Generation with Question Answering and Captioning** [2023].
*Shansong Liu, Atin Sakkeer Hussain, Chenshuo Sun, Ying Shan*
[[PDF](https://arxiv.org/abs/2308.11276)][[Github](https://github.com/crypto-code/MU-LLaMA)]

**SALMONN: Towards Generic Hearing Abilities for Large Language Models** [2023].
*Changli Tang, Wenyi Yu, Guangzhi Sun, Xiaozhao Chen, Tian Tan, Wei Li, Lu Lu, Zejun Ma, Chao Zhang*
[[PDF](https://arxiv.org/abs/2310.13289)][[Github](https://github.com/bytedance/SALMONN)]

**Mustango: Toward Controllable Text-to-Music Generation** [2023].
*Jan Melechovsky, Zixun Guo, Deepanway Ghosal, Navonil Majumder, Dorien Herremans, Soujanya Poria*
[[PDF](https://arxiv.org/pdf/2311.08355.pdf)][[Github](https://github.com/AMAAI-Lab/mustango)]


## Audio Datasets
| Title | Full Name | Size | Link |
| -------- | -------- | -------- | -------- |
| CommonVoice 11 | CommonVoice: A Massively Multilingual Speech Corpus | 58250 Voices of 2508 hours | [Download](https://voice.mozilla.org/en/datasets) |
| Libri-Light | Libri-Light: A Benchmark for ASR with Limited or No Supervision | 60000 Hours | [Download](https://ai.facebook.com/tools/libri-light) |
| Wenetspeech | Wenetspeech: A 10000+ hours multi-domain mandarin corpus for speech recognition | 10000 Hours | [Download](https://github.com/wenet-e2e/WenetSpeech) |
| Gigaspeech | Gigaspeech: An evolving, multi-domain asr corpus with 10,000 hours of transcribed audio | 50000 Hours | [Download](https://github.com/SpeechColab/GigaSpeech) |
| MuST-C | MuST-C: a Multilingual Speech Translation Corpus | 3600 Hours | [Download](https://aclanthology.org/N19-1202.pdf) |
| VoxPopuli | VoxPopuli: A Large-Scale Multilingual Speech Corpus for Representation Learning, Semi-Supervised Learning and Interpretation | 400k Hours | [Download](https://github.com/facebookresearch/voxpopuli) |
| CoVoST | CoVoST: A Large-Scale Multilingual Speech-To-Text Translation Corpus | 2280 Hours | [Download](https://github.com/facebookresearch/covost) |
| CVSS | CVSS: A Massively Multilingual Speech-to-Speech Translation Corpus | 3909 Hours | [Download](https://github.com/google-research-datasets/cvss) |
| EMIME | The EMIME bilingual database | - | [Download](https://www.emime.org/participate/emime-bilingual-database.html) |
| Audiocaps | Audiocaps: Generating captions for audios in the wild | 46000 Audios | [Download](https://github.com/cdjkim/audiocaps) |
| Clotho | Clotho: An audio captioning dataset | 4981 audios 24905 captions | [Download](https://zenodo.org/record/3490684) |
| Audio set | Audio set: An ontology and human-labeled dataset for audio events | 5.8k hours | [Download](g.co/audioset) |
| Emopia | Emopia: A multi-modal pop piano dataset for emotion recognition and emotion-based music generation | 387 piano solo sounds | [Download](https://zenodo.org/record/5090631) |
| MetaMIDI | Building the MetaMIDI Dataset: Linking Symbolic and Audio Musical Data | 436631 MIDI files | [Download](#) |
| DALI2 | Creating DALI, a Large Dataset of Synchronized Audio, Lyrics, and Notes | 7756 Songs | [Download](https://github.com/gabolsgabs/DALI) |
| MillionMIDI | Million MIDI Dataset (MMD) | 100k Songs | [Download](#) |
| Vggsound | Vggsound: A Large-Scale Audio-Visual Dataset | 200k Videos | [Download](https://www.robots.ox.ac.uk/~vgg/data/vggsound/) |
| FSD50K | FSD50K: An Open Dataset of Human-Labeled Sound Events | 51197 Sound Clips | [Download](https://zenodo.org/record/4060432) |
| Symphony | Symphony generation with permutation invariant language model | 46359 MIDI files | [Download](https://symphonynet.github.io/) |
| MusicCaps | MusicLM: Generating Music From Text | 5521 music-text pairs | [Download](https://www.kaggle.com/datasets/googleai/musiccaps) |
| Jamendo | The MTG-Jamendo dataset for automatic music tagging | 55525 Tracks | [Download](https://github.com/MTG/mtg-jamendo-dataset) |
| MusicBench | Mustango: Toward Controllable Text-to-Music Generation | 53168 Tracks | [Download](https://huggingface.co/datasets/amaai-lab/MusicBench) |

# Citation

If you find the listing and survey useful for your work, please cite the paper:

```
@article{latif2023sparks,
title={Sparks of Large Audio Models: A Survey and Outlook},
author={Latif, Siddique and Shoukat, Moazzam and Shamshad, Fahad and Usama, Muhammad and Cuay{\'a}huitl, Heriberto and Schuller, Bj{\"o}rn W},
journal={arXiv preprint arXiv:2308.12792},
year={2023}
}
```