An open API service indexing awesome lists of open source software.

https://github.com/liusongxiang/Large-Audio-Models

Keep track of big models in audio domain, including speech, singing, music etc.
https://github.com/liusongxiang/Large-Audio-Models

Last synced: 7 months ago
JSON representation

Keep track of big models in audio domain, including speech, singing, music etc.

Awesome Lists containing this project

README

          

# Large-Audio-Models

We keep track of something big in the audio domain, including speech, singing, music etc.

## Contents

- [Spoken Language Models](#Spoken-Language-Models)
- [Prompt-based Audio Synthesis](#Prompt-based-Audio-Synthesis)
- [Audio Language Models](#Audio-Language-Models)
- [Audio SSL/UL models](#Audio-SSL-and-UL-models)

### Spoken Language Models
- **Moshi: a speech-text foundation model for real-time dialogue**(2024.9) by Alexandre Défossez, Laurent Mazaré, Manu Orsini, Amélie Royer, Patrick Pérez, Hervé Jégou, Edouard Grave and Neil Zeghidour. [[PDF]](https://kyutai.org/Moshi.pdf)[[Code]](https://github.com/kyutai-labs/moshi)

- **LLaMA-Omni: Seamless Speech Interaction with Large Language Models**(2024.9) by Qingkai Fang et al. [[PDF]](https://arxiv.org/pdf/2409.06666)[[Code]](https://github.com/ictnlp/LLaMA-Omni)

- **Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming**(2024.8) by Zhifei Xie et al. [[PDF]](https://arxiv.org/pdf/2408.16725)[[Code]](https://github.com/gpt-omni/mini-omni)

- **SpeechGPT: Speech Large Language Models**(2023.5) by Dong Zhang et al. [[PDF]](https://arxiv.org/pdf/2305.11000)[[Code]](https://github.com/0nutation/SpeechGPT/tree/main/speechgpt)

### Prompt-based Audio Synthesis

- **M2UGen: Multi-modal Music Understanding and Generation with the Power of Large Language Models**(2023), Atin Sakkeer Hussain et al. [[PDF]](https://arxiv.org/pdf/2311.11255.pdf)
- **SpeechX: Neural Codec Language Model as a Versatile Speech Transformer**(2023), Xiaofei Wang et al. [[PDF]](https://arxiv.org/pdf/2308.06873.pdf)
- **TANGO: Text-to-Audio Generation using Instruction Tuned LLM and Latent Diffusion Model**(2023), Deepanway Ghosal et al. [[PDF]](https://openreview.net/pdf?id=1Sn2WqLku1e)
- **Diverse and Vivid Sound Generation from Text Descriptions**(2023), Guangwei Li et al. [[PDF]](https://arxiv.org/pdf/2305.01980.pdf)
- **NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot Speech and Singing Synthesizers**(2023), Kai Shen et al. [[PDF]](https://arxiv.org/pdf/2304.09116.pdf)
- **AUDIT: Audio Editing by Following Instructions with Latent Diffusion Models**(2023), Yuancheng Wang et al. [[PDF]](https://arxiv.org/pdf/2304.00830.pdf)
- **Physics-Driven Diffusion Models for Impact Sound Synthesis from Videos**(2023), Kun Su et al. [[PDF]](https://arxiv.org/pdf/2303.16897.pdf)
- **FoundationTTS: Text-to-Speech for ASR Customization with Generative Language Model**(2023), Ruiqing Xue et al. [[PDF]](https://arxiv.org/pdf/2303.02939v3.pdf)
- **VALL-E X: Speak Foreign Languages with Your Own Voice: Cross-Lingual Neural Codec Language Modeling** (2023), Ziqiang Zhang et al. [[PDF]](https://arxiv.org/pdf/2303.03926.pdf)
- **Simple and Controllable Music Generation**(2023), Jade Copet et al. [[PDF]](https://arxiv.org/pdf/2306.05284.pdf)
- **Efficient Neural Music Generation**(2023), Max W. Y. Lam et al. [[PDF]](https://arxiv.org/pdf/2305.15719.pdf)
- **ERNIE-Music: Text-to-Waveform Music Generation with Diffusion Models**(2023), Pengfei Zhu et al. [[PDF]](https://arxiv.org/pdf/2302.04456.pdf)
- **Noise2Music: Text-conditioned Music Generation with Diffusion Models**(2023), Qingqing Huang et al. [[PDF]](https://arxiv.org/pdf/2302.03917)
- **Spear-TTS: Speak, Read and Prompt: High-Fidelity Text-to-Speech with Minimal Supervision**(2023), Eugene Kharitonov et al. [[PDF]](https://arxiv.org/abs/2302.03540)
- **SingSong: Generating musical accompaniments from singing**(2023), Chris Donahue et al. [[PDF]](https://arxiv.org/pdf/2301.12662.pdf)
- **MusicLM: Generating Music From Text**(2023), Andrea Agostinelli et al. [[PDF]](https://arxiv.org/pdf/2301.11325)
- **InstructTTS: Modelling Expressive TTS in Discrete Latent Space with Natural Language Style Prompt** (2023), Dongchao Yang et al. [[PDF]](https://arxiv.org/pdf/2301.13662.pdf)
- **Make-An-Audio 2: Temporal-Enhanced Text-to-Audio Generation**(2023), Rongjie Huang et al. [[PDF]](https://arxiv.org/pdf/2305.18474.pdf)
- **AudioLDM: Text-to-Audio Generation with Latent Diffusion Models**(2023), Haohe Liu et al. [[PDF]](https://arxiv.org/pdf/2301.12503)
- **Moûsai: Text-to-Music Generation with Long-Context Latent Diffusion**(2023), Flavio Schneider et al. [[PDF]](https://arxiv.org/pdf/2301.11757)
- **Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion Models**(2023), Jiawei Huang et al. [[PDF]](https://text-to-audio.github.io/paper.pdf)
- **ArchiSound: Audio Generation with Diffusion**(2023), Flavio Schneider. [[PDF]](https://arxiv.org/ftp/arxiv/papers/2301/2301.13267.pdf)
- **VALL-E: Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers** (2023), Chengyi Wang et al. [[PDF]](https://arxiv.org/pdf/2301.02111.pdf)
- **PromptTTS: Controllable Text-to-Speech with Text Descriptions**(2022), Zhifang Guo et al. [[PDF]](https://arxiv.org/pdf/2211.12171.pdf)
- **Diffsound: Discrete Diffusion Model for Text-to-sound Generation**(2022), Dongchao Yang et al. [[PDF]](https://arxiv.org/pdf/2207.09983v1.pdf)

### Audio Language Models

- **Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models**(2023), Yunfei Chu et al. [[PDF]](https://arxiv.org/pdf/2311.07919v1.pdf)
- **UniAudio: An Audio Foundation Model Toward Universal Audio Generation**(2023), Dongchao Yang et al. [[PDF]](https://arxiv.org/pdf/2310.00704.pdf)
- **SpeechTokenizer: Unified Speech Tokenizer for Speech Large Language Models**(2023), Xin Zhang et al. [[PDF]](https://arxiv.org/pdf/2308.16692.pdf)
- **SoundStorm: Efficient Parallel Audio Generation**(2023), Zalán Borsos et al. [[PDF]](https://arxiv.org/pdf/2305.09636.pdf)
- **AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head**(2023), Rongjie Huang et al. [[PDF]](https://arxiv.org/pdf/2304.12995.pdf)
- **AudioPaLM: A Large Language Model That Can Speak and Listen**(2023), Paul K. Rubenstein et al. [[PDF]](https://arxiv.org/pdf/2306.12925.pdf)
- **Pengi: An Audio Language Model for Audio Tasks**(2023), Soham Deshmukh et al. [[PDF]](https://arxiv.org/pdf/2305.11834)
- **AudioLM: a Language Modeling Approach to Audio Generation**(2022), Zalán Borsos et al. [[PDF]](https://arxiv.org/pdf/2209.03143)

### Audio SSL and UL models

- **vq-wav2vec: Self-Supervised Learning of Discrete Speech Representations**(2019), Alexei Baevski et al. [[PDF]](https://arxiv.org/abs/1910.05453.pdf)
- **wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations** (2020), Alexei Baevski et al. [[PDF]](https://arxiv.org/pdf/2006.11477.pdf)
- **W2v-BERT: Combining Contrastive Learning and Masked Language Modeling for Self-Supervised Speech Pre-Training** (2021) [[PDF]](https://arxiv.org/pdf/2108.06209.pdf)
- **HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units** (2021) Wei-Ning Hsu et al. [[PDF]](https://arxiv.org/pdf/2106.07447.pdf)
- **Data2vec: A general framework for self-supervised learning in speech, vision and language** (2022), Alexei Baevski et al. [[PDF]](https://arxiv.org/abs/2202.03555.pdf)
- **MT4SSL: Boosting Self-Supervised Speech Representation Learning by Integrating Multiple Targets** (2022), Ziyang Ma et al. [[PDF]](https://arxiv.org/abs/2211.07321.pdf)
- **ContentVec: An Improved Self-Supervised Speech Representation by Disentangling Speakers** (2022), Kaizhi Qian et al. [[PDF]](https://arxiv.org/pdf/2204.09224.pdf)
- **Data2vec 2.0: Efficient Self-supervised Learning with Contextualized Target Representations for Vision, Speech and Language** (2022), Alexei Baevski et al. [[PDF]](https://arxiv.org/abs/2212.07525.pdf)
- **MuLan: A Joint Embedding of Music Audio and Natural Language** (2022) Qingqing Huang et al. [[PDF]](https://arxiv.org/pdf/2208.12415.pdf)