https://github.com/liusongxiang/Large-Audio-Models

Keep track of big models in audio domain, including speech, singing, music etc.
https://github.com/liusongxiang/Large-Audio-Models
Last synced: 7 months ago
JSON representation
Keep track of big models in audio domain, including speech, singing, music etc.
Host: GitHub
URL: https://github.com/liusongxiang/Large-Audio-Models
Owner: liusongxiang
Created: 2023-03-12T05:41:21.000Z (over 2 years ago)
Default Branch: main
Last Pushed: 2024-09-26T06:09:50.000Z (about 1 year ago)
Last Synced: 2024-10-27T14:45:30.026Z (12 months ago)
Size: 44.9 KB
Stars: 455
Watchers: 46
Forks: 27
Open Issues: 1
Metadata Files:
- Readme: README.md
Awesome Lists containing this project

README

          # Large-Audio-Models

We keep track of something big in the audio domain,  including speech, singing, music etc.

## Contents

- [Spoken Language Models](#Spoken-Language-Models)

- [Prompt-based Audio Synthesis](#Prompt-based-Audio-Synthesis)

- [Audio Language Models](#Audio-Language-Models)

- [Audio SSL/UL models](#Audio-SSL-and-UL-models)

### Spoken Language Models

- **Moshi: a speech-text foundation model for real-time dialogue**(2024.9) by Alexandre Défossez, Laurent Mazaré, Manu Orsini, Amélie Royer, Patrick Pérez, Hervé Jégou, Edouard Grave and Neil Zeghidour. [[PDF]](https://kyutai.org/Moshi.pdf)[[Code]](https://github.com/kyutai-labs/moshi)

- **LLaMA-Omni: Seamless Speech Interaction with Large Language Models**(2024.9) by Qingkai Fang et al. [[PDF]](https://arxiv.org/pdf/2409.06666)[[Code]](https://github.com/ictnlp/LLaMA-Omni)

- **Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming**(2024.8) by Zhifei Xie et al. [[PDF]](https://arxiv.org/pdf/2408.16725)[[Code]](https://github.com/gpt-omni/mini-omni)

- **SpeechGPT: Speech Large Language Models**(2023.5) by Dong Zhang et al. [[PDF]](https://arxiv.org/pdf/2305.11000)[[Code]](https://github.com/0nutation/SpeechGPT/tree/main/speechgpt)

### Prompt-based Audio Synthesis

- **M2UGen: Multi-modal Music Understanding and Generation with the Power of Large Language Models**(2023), Atin Sakkeer Hussain et al. [[PDF]](https://arxiv.org/pdf/2311.11255.pdf)

- **SpeechX: Neural Codec Language Model as a Versatile Speech Transformer**(2023), Xiaofei Wang et al. [[PDF]](https://arxiv.org/pdf/2308.06873.pdf)

- **TANGO: Text-to-Audio Generation using Instruction Tuned LLM and Latent Diffusion Model**(2023), Deepanway Ghosal et al. [[PDF]](https://openreview.net/pdf?id=1Sn2WqLku1e)

- **Diverse and Vivid Sound Generation from Text Descriptions**(2023), Guangwei Li et al. [[PDF]](https://arxiv.org/pdf/2305.01980.pdf)

- **NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot Speech and Singing Synthesizers**(2023), Kai Shen et al. [[PDF]](https://arxiv.org/pdf/2304.09116.pdf)

- **AUDIT: Audio Editing by Following Instructions with Latent Diffusion Models**(2023), Yuancheng Wang et al. [[PDF]](https://arxiv.org/pdf/2304.00830.pdf)

- **Physics-Driven Diffusion Models for Impact Sound Synthesis from Videos**(2023), Kun Su et al. [[PDF]](https://arxiv.org/pdf/2303.16897.pdf)

- **FoundationTTS: Text-to-Speech for ASR Customization with Generative Language Model**(2023), Ruiqing Xue et al. [[PDF]](https://arxiv.org/pdf/2303.02939v3.pdf)

- **VALL-E X: Speak Foreign Languages with Your Own Voice: Cross-Lingual Neural Codec Language Modeling** (2023), Ziqiang Zhang et al. [[PDF]](https://arxiv.org/pdf/2303.03926.pdf)

- **Simple and Controllable Music Generation**(2023), Jade Copet et al. [[PDF]](https://arxiv.org/pdf/2306.05284.pdf)

- **Efficient Neural Music Generation**(2023), Max W. Y. Lam et al. [[PDF]](https://arxiv.org/pdf/2305.15719.pdf)

- **ERNIE-Music: Text-to-Waveform Music Generation with Diffusion Models**(2023), Pengfei Zhu et al. [[PDF]](https://arxiv.org/pdf/2302.04456.pdf)

- **Noise2Music: Text-conditioned Music Generation with Diffusion Models**(2023), Qingqing Huang et al. [[PDF]](https://arxiv.org/pdf/2302.03917)

- **Spear-TTS: Speak, Read and Prompt: High-Fidelity Text-to-Speech with Minimal Supervision**(2023), Eugene Kharitonov et al. [[PDF]](https://arxiv.org/abs/2302.03540)

- **SingSong: Generating musical accompaniments from singing**(2023), Chris Donahue et al. [[PDF]](https://arxiv.org/pdf/2301.12662.pdf)

- **MusicLM: Generating Music From Text**(2023), Andrea Agostinelli et al. [[PDF]](https://arxiv.org/pdf/2301.11325)

- **InstructTTS: Modelling Expressive TTS in Discrete Latent Space with Natural Language Style Prompt** (2023), Dongchao Yang et al. [[PDF]](https://arxiv.org/pdf/2301.13662.pdf)

- **Make-An-Audio 2: Temporal-Enhanced Text-to-Audio Generation**(2023), Rongjie Huang et al. [[PDF]](https://arxiv.org/pdf/2305.18474.pdf)

- **AudioLDM: Text-to-Audio Generation with Latent Diffusion Models**(2023), Haohe Liu et al. [[PDF]](https://arxiv.org/pdf/2301.12503)

- **Moûsai: Text-to-Music Generation with Long-Context Latent Diffusion**(2023), Flavio Schneider et al. [[PDF]](https://arxiv.org/pdf/2301.11757)

- **Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion Models**(2023), Jiawei Huang et al. [[PDF]](https://text-to-audio.github.io/paper.pdf)

- **ArchiSound: Audio Generation with Diffusion**(2023), Flavio Schneider. [[PDF]](https://arxiv.org/ftp/arxiv/papers/2301/2301.13267.pdf)

- **VALL-E: Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers** (2023), Chengyi Wang et al. [[PDF]](https://arxiv.org/pdf/2301.02111.pdf)

- **PromptTTS: Controllable Text-to-Speech with Text Descriptions**(2022), Zhifang Guo et al. [[PDF]](https://arxiv.org/pdf/2211.12171.pdf)

- **Diffsound: Discrete Diffusion Model for Text-to-sound Generation**(2022), Dongchao Yang et al. [[PDF]](https://arxiv.org/pdf/2207.09983v1.pdf)

### Audio Language Models

- **Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models**(2023), Yunfei Chu et al. [[PDF]](https://arxiv.org/pdf/2311.07919v1.pdf)

- **UniAudio: An Audio Foundation Model Toward Universal Audio Generation**(2023), Dongchao Yang et al. [[PDF]](https://arxiv.org/pdf/2310.00704.pdf)

- **SpeechTokenizer: Unified Speech Tokenizer for Speech Large Language Models**(2023), Xin Zhang et al. [[PDF]](https://arxiv.org/pdf/2308.16692.pdf)

- **SoundStorm: Efficient Parallel Audio Generation**(2023), Zalán Borsos et al. [[PDF]](https://arxiv.org/pdf/2305.09636.pdf)

- **AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head**(2023), Rongjie Huang et al. [[PDF]](https://arxiv.org/pdf/2304.12995.pdf)

- **AudioPaLM: A Large Language Model That Can Speak and Listen**(2023), Paul K. Rubenstein et al. [[PDF]](https://arxiv.org/pdf/2306.12925.pdf)

- **Pengi: An Audio Language Model for Audio Tasks**(2023), Soham Deshmukh et al. [[PDF]](https://arxiv.org/pdf/2305.11834)

- **AudioLM: a Language Modeling Approach to Audio Generation**(2022), Zalán Borsos et al. [[PDF]](https://arxiv.org/pdf/2209.03143)

### Audio SSL and UL models

- **vq-wav2vec: Self-Supervised Learning of Discrete Speech Representations**(2019), Alexei Baevski et al. [[PDF]](https://arxiv.org/abs/1910.05453.pdf)

- **wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations** (2020), Alexei Baevski et al. [[PDF]](https://arxiv.org/pdf/2006.11477.pdf)

- **W2v-BERT: Combining Contrastive Learning and Masked Language Modeling for Self-Supervised Speech Pre-Training** (2021) [[PDF]](https://arxiv.org/pdf/2108.06209.pdf)

- **HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units** (2021) Wei-Ning Hsu et al. [[PDF]](https://arxiv.org/pdf/2106.07447.pdf)

- **Data2vec: A general framework for self-supervised learning in speech, vision and language** (2022), Alexei Baevski et al. [[PDF]](https://arxiv.org/abs/2202.03555.pdf)

- **MT4SSL: Boosting Self-Supervised Speech Representation Learning by Integrating Multiple Targets** (2022), Ziyang Ma et al. [[PDF]](https://arxiv.org/abs/2211.07321.pdf)

- **ContentVec: An Improved Self-Supervised Speech Representation by Disentangling Speakers** (2022), Kaizhi Qian et al. [[PDF]](https://arxiv.org/pdf/2204.09224.pdf)

- **Data2vec 2.0: Efficient Self-supervised Learning with Contextualized Target Representations for Vision, Speech and Language** (2022), Alexei Baevski et al. [[PDF]](https://arxiv.org/abs/2212.07525.pdf)

- **MuLan: A Joint Embedding of Music Audio and Natural Language** (2022) Qingqing Huang et al. [[PDF]](https://arxiv.org/pdf/2208.12415.pdf)
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/liusongxiang/Large-Audio-Models

Awesome Lists containing this project

README