https://github.com/ga642381/speech-trident

Awesome speech/audio LLMs, representation learning, and codec models
https://github.com/ga642381/speech-trident
Last synced: 6 months ago
JSON representation
Awesome speech/audio LLMs, representation learning, and codec models
Host: GitHub
URL: https://github.com/ga642381/speech-trident
Owner: ga642381
Created: 2024-04-08T09:49:58.000Z (about 1 year ago)
Default Branch: master
Last Pushed: 2024-11-12T19:34:57.000Z (6 months ago)
Last Synced: 2024-11-12T20:27:59.051Z (6 months ago)
Size: 2.76 MB
Stars: 689
Watchers: 41
Forks: 34
Open Issues: 2
Metadata Files:
- Readme: README.md
Awesome Lists containing this project

awesome-speech-language-model - Speech Trident - `Github` (Survey / Benchmark)
awesome-speech-language-model - Speech Trident - `Github` (Survey / Benchmark)
awesome_ai_agents - Speech-Trident - Awesome speech/audio LLMs, representation learning, and codec models (Building / LLM Models)
awesome_ai_agents - Speech-Trident - Awesome speech/audio LLMs, representation learning, and codec models (Building / LLM Models)
README

        # :trident: Speech Trident - Awesome Speech LM



  



In this repository, we survey three crucial areas: (1) representation learning, (2) neural codec, and (3) language models that contribute to speech/audio large language models.

1.⚡ **Speech Representation Models:** These models focus on learning structural speech representations, which can then be quantized into discrete speech tokens, often refer to **semantic tokens**.

2.⚡ **Speech Neural Codec Models:** These models are designed to learn speech and audio discrete tokens, often referred to as **acoustic tokens**, while maintaining reconstruction ability and low bitrate.

3.⚡ **Speech Large Language Models:** These models are trained on top of speech and acoustic tokens in a language modeling approach. They demonstrate proficiency in tasks on speech understanding and speech generation.

## :trident: Contributors

  

    

      

        

        


        _{Kai-Wei Chang}

      

    

    

      

        

        


        _{Haibin Wu}

      

    

      

      

        

        


        _{Wei-Cheng Tseng}

      

    

  

  

    

      

        

        


        _{Kehan Lu}

      

    

    

      

        

        


        _{Chun-Yi Kuan}

      

    

    

      

        

        


        _{Hung-yi Lee}

      

    

  

## :trident: Speech/Audio Language Models

| Date    | Model Name 
| ------- | 
| 2024-11 | 
| 2024-11 | Ultravox 
| 2024-11 | hertz-dev 
| 2024-11 | Freeze-Omni 
| 2024-11 | Align-SLM 
| 2024-10 | 
| 2024-10 | OmniFlatten 
| 2024-10 | GPT-4o 
| 2024-10 | Baichuan-OMNI 
| 2024-10 | GLM-4-Voice 
| 2024-10 | -- 
| 2024-10 | SALMONN-OMNI 
| 2024-10 | Mini-Omni 2 
| 2024-10 | HALL-E 
| 2024-10 | SyllableLM 
| 2024-09 | Moshi 
| 2024-09 | Takin AudioLLM 
| 2024-09 | FireRedTTS 
| 2024-09 | LLaMA-Omni 
| 2024-09 | MaskGCT 
| 2024-09 | SSR-Speech 
| 2024-09 | MoWE-Audio 
| 2024-08 | Mini-Omni 
| 2024-08 | Make-A-Voice 2 
| 2024-08 | LSLM 
| 2024-06 | SimpleSpeech 
| 2024-06 | UniAudio 1.5 
| 2024-06 | VALL-E R 
| 2024-06 | VALL-E 2 
| 2024-06 | GPST 
| 2024-04 | CLaM-TTS 
| 2024-04 | RALL-E 
| 2024-04 | WavLLM 
| 2024-02 | MobileSpeech 
| 2024-02 | SLAM-ASR 
| 2024-02 | AnyGPT 
| 2024-02 | SpiRit-LM 
| 2024-02 | USDM 
| 2024-02 | BAT 
| 2024-02 | 
| 2024-02 | 
| 2024-02 | GenTranslate 
| 2024-02 | Base-TTS 
| 2024-02 | -- 
| 2024-01 | -- 
| 2024-01 | ELLA-V 
| 2023-12 | Seamless 
| 2023-11 | Qwen-Audio 
| 2023-10 | LauraGPT 
| 2023-10 | SALMONN 
| 2023-10 | UniAudio 
| 2023-10 | 
| 2023-09 | VoxtLM 
| 2023-09 | LTU-AS 
| 2023-09 | SLM 
| 2023-09 | -- 
| 2023-08 | SpeechGen 
| 2023-08 | SpeechX 
| 2023-08 | LLaSM 
| 2023-08 | SeamlessM4T 
| 2023-07 | Speech-LLaMA 
| 2023-07 | 
| 2023-06 | AudioPaLM 
| 2023-05 | Make-A-Voice 
| 2023-05 | Spectron 
| 2023-05 | TWIST 
| 2023-05 | Pengi 
| 2023-05 | SoundStorm 
| 2023-05 | LTU 
| 2023-05 | SpeechGPT 
| 2023-05 | VioLA 
| 2023-05 | X-LLM 
| 2023-03 | Google USM 
| 2023-03 | VALL-E X 
| 2023-02 | SPEAR-TTS 
| 2023-01 | VALL-E 
| 2022-12 | Whisper 
| 2022-10 | AudioGen 
| 2022-09 | AudioLM 
| 2022-05 | Wav2Seq 
| 2022-04 | Unit mBART 
| 2022-03 | d-GSLM 
| 2021-10 | SLAM 
| 2021-09 | p-GSLM 
| 2021-02 | GSLM

| Paper Title                                                                                                           | Link                                      | -------------- | --------------------------------------------------------------------------------------------------------------------- | ----------------------------------------- | --  | Building a Taiwanese Mandarin Spoken Language Model: A First Attempt | [Paper](https://arxiv.org/abs/2411.07111) | | Ultravox: An open-weight alternative to GPT-4o Realtime | [Blog](https://www.ultravox.ai/blog/ultravox-an-open-weight-alternative-to-gpt-4o-realtime) | | [blog](https://si.inc/hertz-dev/)  | [GitHub](https://github.com/Standard-Intelligence/hertz-dev) | |  Freeze-Omni: A Smart and Low Latency Speech-to-speech Dialogue Model with Frozen LLM | [paper](https://arxiv.org/abs/2411.00774) | | Align-SLM: Textless Spoken Language Models with Reinforcement Learning from AI Feedback       | [paper](https://arxiv.org/pdf/2411.01834) | Ichigo | Ichigo: Mixed-Modal Early-Fusion Realtime Voice Assistant | [paper](https://arxiv.org/abs/2410.15316), [code](https://github.com/homebrewltd/ichigo)| | OmniFlatten: An End-to-end GPT Model for Seamless Voice Conversation       | [paper](https://arxiv.org/abs/2410.17799v1) | | GPT-4o System Card       | [paper](https://arxiv.org/pdf/2410.21276) | | Baichuan-Omni Technical Report       | [paper](https://arxiv.org/abs/2410.08565) | | GLM-4-Voice       | [GitHub](https://github.com/THUDM/GLM-4-Voice) | | Roadmap towards Superhuman Speech Understanding using Large Language Models       | [paper](https://arxiv.org/abs/2410.13268) | | SALMONN-OMNI: A SPEECH UNDERSTANDING AND GENERATION LLM IN A CODEC-FREE FULL-DUPLEX FRAMEWORK       | [paper](https://openreview.net/attachment?id=eJpI20hzWf&name=pdf) | | Mini-Omni2: Towards Open-source GPT-4o with Vision, Speech and Duplex Capabilities       | [paper](https://arxiv.org/abs/2410.11190) | | HALL-E: Hierarchical Neural Codec Language Model for Minute-Long Zero-Shot Text-to-Speech Synthesis       | [paper](https://openreview.net/forum?id=868masI331) | |  SyllableLM: Learning Coarse Semantic Units for Speech Language Models     | [paper](https://arxiv.org/html/2410.04029v1) | | Moshi: a speech-text foundation model for real-time dialogue       | [paper](https://kyutai.org/Moshi.pdf) | | Takin: A Cohort of Superior Quality Zero-shot Speech Generation Models       | [paper](https://arxiv.org/abs/2409.12139) | | FireRedTTS: A Foundation Text-To-Speech Framework for Industry-Level Generative Speech Applications       | [paper](https://arxiv.org/html/2409.03283v1) | | LLaMA-Omni: Seamless Speech Interaction with Large Language Models                                                      | [paper](https://arxiv.org/abs/2409.06666) | | MaskGCT: Zero-Shot Text-to-Speech with Masked Generative Codec Transformer                                                       | [paper](https://arxiv.org/abs/2409.00750v1) | | SSR-Speech: Towards Stable, Safe and Robust Zero-shot Text-based Speech Editing and Synthesis                                                       | [paper](https://arxiv.org/abs/2409.07556) | | MoWE-Audio: Multitask AudioLLMs with Mixture of Weak Encoders                                                       | [paper](https://arxiv.org/pdf/2409.06635) | | Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming                                       | [paper](https://arxiv.org/abs/2408.16725) | |  Make-A-Voice: Revisiting Voice Large Language Models as Scalable Multilingual and Multitask Learner    | [paper](https://aclanthology.org/2024.acl-long.589/) | |  Language Model Can Listen While Speaking  | [paper](https://arxiv.org/abs/2408.02622) | | SimpleSpeech: Towards Simple and Efficient Text-to-Speech with Scalar Latent Transformer Diffusion Models                                       | [paper](https://arxiv.org/abs/2406.02328) | | UniAudio 1.5: Large Language Model-driven Audio Codec is A Few-shot Audio Task Learner  | [paper](https://arxiv.org/abs/2406.10056) | | VALL-E R: Robust and Efficient Zero-Shot Text-to-Speech Synthesis via Monotonic Alignment  | [paper](https://arxiv.org/abs/2406.07855) | | VALL-E 2: Neural Codec Language Models are Human Parity Zero-Shot Text to Speech Synthesizers  | [paper](https://arxiv.org/abs/2406.05370) | | Generative Pre-trained Speech Language Model with Efficient Hierarchical Transformer  | [paper](https://arxiv.org/abs/2406.00976) | | CLaM-TTS: Improving Neural Codec Language Model for Zero-Shot Text-to-Speech                                                       | [paper](https://arxiv.org/abs/2404.02781) | | RALL-E: Robust Codec Language Modeling with Chain-of-Thought Prompting for Text-to-Speech Synthesis                                                       | [paper](https://arxiv.org/abs/2404.03204) | | WavLLM: Towards Robust and Adaptive Speech Large Language Model                                                       | [paper](https://arxiv.org/abs/2404.00656) | | MobileSpeech: A Fast and High-Fidelity Framework for Mobile Zero-Shot Text-to-Speech                                                    | [paper](https://arxiv.org/abs/2402.09378) | | An Embarrassingly Simple Approach for LLM with Strong ASR Capacity                                                    | [paper](https://arxiv.org/abs/2402.08846) | | AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling                                                        | [paper](https://arxiv.org/abs/2402.12226) | | SpiRit-LM: Interleaved Spoken and Written Language Model                                                              | [paper](https://arxiv.org/abs/2402.05755) | | Integrating Paralinguistics in Speech-Empowered Large Language Models for Natural Conversation                                                       | [paper](https://arxiv.org/abs/2402.05706) | | BAT: Learning to Reason about Spatial Sounds with Large Language Models                                               | [paper](https://arxiv.org/abs/2402.01591) | Audio Flamingo | Audio Flamingo: A Novel Audio Language Model with Few-Shot Learning and Dialogue Abilities                            | [paper](https://arxiv.org/abs/2402.01831) | Text Description to speech | Natural language guidance of high-fidelity text-to-speech with synthetic annotations                      | [paper](https://arxiv.org/abs/2402.01912) | | GenTranslate: Large Language Models are Generative Multilingual Speech and Machine Translators                        | [paper](https://arxiv.org/abs/2402.06894) | | BASE TTS: Lessons from building a billion-parameter Text-to-Speech model on 100K hours of data                        | [paper](https://arxiv.org/abs/2402.08093) | | It's Never Too Late: Fusing Acoustic Information into Large Language Models for Automatic Speech Recognition          | [paper](https://arxiv.org/abs/2402.05457) | | Large Language Models are Efficient Learners of Noise-Robust Speech Recognition                                       | [paper](https://arxiv.org/abs/2401.10446) | | ELLA-V: Stable Neural Codec Language Modeling with Alignment-guided Sequence Reordering                                       | [paper](https://arxiv.org/abs/2401.07333) | | Seamless: Multilingual Expressive and Streaming Speech Translation                                                    | [paper](https://arxiv.org/abs/2312.05187) | | Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models                     | [paper](https://arxiv.org/abs/2311.07919) | | LauraGPT: Listen, Attend, Understand, and Regenerate Audio with GPT                                                   | [paper](https://arxiv.org/abs/2310.04673) | | SALMONN: Towards Generic Hearing Abilities for Large Language Models                                                  | [paper](https://arxiv.org/abs/2310.13289) | | UniAudio: An Audio Foundation Model Toward Universal Audio Generation                                                 | [paper](https://arxiv.org/abs/2310.00704) | Whispering LLaMA | Whispering LLaMA: A Cross-Modal Generative Error Correction Framework for Speech Recognition                          | [paper](https://arxiv.org/abs/2310.06434) | | Voxtlm: unified decoder-only models for consolidating speech recognition/synthesis and speech/text continuation tasks | [paper](https://arxiv.org/abs/2309.07937) | | Joint Audio and Speech Understanding                                                                                  | [paper](https://arxiv.org/abs/2309.14405) | | SLM: Bridge the thin gap between speech and text foundation models                                                    | [paper](https://arxiv.org/abs/2310.00230) | | Generative Speech Recognition Error Correction with Large Language Models and Task-Activating Prompting               | [paper](https://arxiv.org/abs/2309.15649) | | SpeechGen: Unlocking the Generative Power of Speech Language Models with Prompts                                      | [paper](https://arxiv.org/abs/2306.02207) | | SpeechX: Neural Codec Language Model as a Versatile Speech Transformer                                                | [paper](https://arxiv.org/abs/2308.06873) | | Large Language and Speech Model                                                                                       | [paper](https://arxiv.org/abs/2308.15930) | | Massively Multilingual & Multimodal Machine Translation                                                               | [paper](https://arxiv.org/abs/2308.11596) | | On decoder-only architecture for speech-to-text and large language model integration                                  | [paper](https://arxiv.org/abs/2307.03917) | LLM-ASR(temp.) | Prompting Large Language Models with Speech Recognition Abilities                                                     | [paper](https://arxiv.org/abs/2307.11795) | | AudioPaLM: A Large Language Model That Can Speak and Listen                                                           | [paper](https://arxiv.org/abs/2306.12925) | | Make-A-Voice: Unified Voice Synthesis With Discrete Representation                                       | [paper](https://arxiv.org/abs/2305.19269) | | Spoken Question Answering and Speech Continuation Using Spectrogram-Powered LLM                                       | [paper](https://arxiv.org/abs/2305.15255) | | Textually Pretrained Speech Language Models                                                                           | [paper](https://arxiv.org/abs/2305.13009) | | Pengi: An Audio Language Model for Audio Tasks                                                                        | [paper](https://arxiv.org/abs/2305.11834) | | Efficient Parallel Audio Generation                                                                                   | [paper](https://arxiv.org/abs/2305.09636) | | Joint Audio and Speech Understanding                                                                                  | [paper](https://arxiv.org/abs/2305.10790) | | Empowering Large Language Models with Intrinsic Cross-Modal Conversational Abilities                                  | [paper](https://arxiv.org/abs/2305.11000) | | Unified Codec Language Models for Speech Recognition, Synthesis, and Translation                                      | [paper](https://arxiv.org/abs/2305.16107) | | X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages                 | [paper](https://arxiv.org/abs/2305.04160) | | Google USM: Scaling Automatic Speech Recognition Beyond 100 Languages                                                 | [paper](https://arxiv.org/abs/2303.01037) | | Speak Foreign Languages with Your Own Voice: Cross-Lingual Neural Codec Language Modeling                             | [paper](https://arxiv.org/abs/2303.03926) | | Speak, Read and Prompt: High-Fidelity Text-to-Speech with Minimal Supervision                                         | [paper](https://arxiv.org/abs/2302.03540) | | Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers                                                | [paper](https://arxiv.org/abs/2301.02111) | | Robust Speech Recognition via Large-Scale Weak Supervision                                                            | [paper](https://arxiv.org/abs/2212.04356) | | AudioGen: Textually Guided Audio Generation                                                                           | [paper](https://arxiv.org/abs/2209.15352) | | AudioLM: a Language Modeling Approach to Audio Generation                                                             | [paper](https://arxiv.org/abs/2209.03143) | | Wav2Seq: Pre-training Speech-to-Text Encoder-Decoder Models Using Pseudo Languages                                    | [paper](https://arxiv.org/abs/2205.01086) | | Enhanced Direct Speech-to-Speech Translation Using Self-supervised Pre-training and Data Augmentation                 | [paper](https://arxiv.org/abs/2204.02967) | | Generative Spoken Dialogue Language Modeling                                                                          | [paper](https://arxiv.org/abs/2203.16502) | | SLAM: A Unified Encoder for Speech and Language Modeling via Speech-Text Joint Pre-Training                           | [paper](https://arxiv.org/abs/2110.10329) | | Text-Free Prosody-Aware Generative Spoken Language Modeling                                                           | [paper](https://arxiv.org/abs/2109.03264) | | Generative Spoken Language Modeling from Raw Audio                                                                    | [paper](https://arxiv.org/abs/2102.01192) |

## :trident: Speech/Audio Codec Models

| Date    | Model Name           | Paper Title                                                                                                                                                | Link                                      |

| ------- | -------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------- | ----------------------------------------- |

| 2024-11 | PyramidCodec | PyramidCodec: Hierarchical Codec for Long-form Music Generation in Audio Domain | [paper](https://aclanthology.org/2024.findings-emnlp.246.pdf) |

| 2024-11 | UniCodec | Universal Speech Token Learning Via Low-Bitrate Neural Codec and Pretrained Representations | [paper](https://ieeexplore.ieee.org/abstract/document/10738376?casa_token=eWtmSXEr4AEAAAAA:FzYuQIESJ2LXwl9smJQe3RakpDUFuJ-AS0d39ZDlhsI0tBVX_8P7hu4a59yZezz7hpYd3VomUDo) |

| 2024-11 | SimVQ | Addressing Representation Collapse in Vector Quantized Models with One Linear Layer | [paper](https://arxiv.org/pdf/2411.02038) |

| 2024-11 | MDCTCodec | MDCTCodec: A Lightweight MDCT-based Neural Audio Codec towards High Sampling Rate and Low Bitrate Scenarios | [paper](https://arxiv.org/pdf/2411.00464) |

| 2024-10 | APCodec+ | APCodec+: A Spectrum-Coding-Based High-Fidelity and High-Compression-Rate Neural Audio Codec with Staged Training Paradigm       | [paper](https://arxiv.org/pdf/2410.22807) |

| 2024-10 | - | A Closer Look at Neural Codec Resynthesis: Bridging the Gap between Codec and Waveform Generation       | [paper](https://arxiv.org/pdf/2410.22448) |

| 2024-10 | SNAC | SNAC: Multi-Scale Neural Audio Codec       | [paper](https://arxiv.org/pdf/2410.14411) |

| 2024-10 | LSCodec | LSCodec: Low-Bitrate and Speaker-Decoupled Discrete Speech Codec       | [paper](https://arxiv.org/abs/2410.15764) |

| 2024-10 | Co-design for codec and codec-LM | TOWARDS CODEC-LM CO-DESIGN FOR NEURAL CODEC LANGUAGE MODELS       | [paper](https://openreview.net/pdf?id=KCVv3tICvp) |

| 2024-10 | VChangeCodec | VChangeCodec: A High-efficiency Neural Speech Codec with Built-in Voice Changer for Real-time Communication       | [paper](https://openreview.net/forum?id=qDSfOQBrOD) |

| 2024-10 | DC-Spin | DC-Spin: A Speaker-invariant Speech Tokenizer For Spoken Language Models       | [paper](https://openreview.net/forum?id=OW332Wh9S5) |

| 2024-10 | TAAE | Scaling Transformers for Low-Bitrate High-Quality Speech Coding       | [paper](https://openreview.net/pdf?id=4YpMrGfldX) |

| 2024-10 | DM-Codec | DM-Codec: Distilling Multimodal Representations for Speech Tokenization       | [paper](https://openreview.net/forum?id=UFwefiypla) |

| 2024-09 | Mimi             | Moshi: a speech-text foundation model for real-time dialogue       | [paper](https://kyutai.org/Moshi.pdf) |

| 2024-09 | NDVQ             | NDVQ: Robust Neural Audio Codec with Normal Distribution-Based Vector Quantization       | [paper](https://arxiv.org/pdf/2409.12717) |

| 2024-09 | SoCodec             | SoCodec: A Semantic-Ordered Multi-Stream Speech Codec for Efficient Language Model Based Text-to-Speech Synthesis       | [paper](https://arxiv.org/pdf/2409.00933) |

| 2024-09 | BigCodec             | BigCodec: Pushing the Limits of Low-Bitrate Neural Speech Codec       | [paper](https://arxiv.org/abs/2409.05377) |

| 2024-08 | X-Codec             | Codec Does Matter: Exploring the Semantic Shortcoming of Codec for Audio Language Model       | [paper](https://arxiv.org/pdf/2408.17175) |

| 2024-08 | WavTokenizer             | WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling | [paper](https://arxiv.org/abs/2408.16532) |

| 2024-07 | Super-Codec             | SuperCodec: A Neural Speech Codec with Selective Back-Projection Network | [paper](https://arxiv.org/abs/2407.20530) |

| 2024-07 | dMel             | dMel: Speech Tokenization made Simple | [paper](https://arxiv.org/abs/2407.15835) |

| 2024-06 | CodecFake             | CodecFake: Enhancing Anti-Spoofing Models Against Deepfake Audios from Codec-Based Speech Synthesis Systems | [paper](https://arxiv.org/abs/2406.07237) |

| 2024-06 | Single-Codec             | Single-Codec: Single-Codebook Speech Codec towards High-Performance Speech Generation | [paper](https://www.arxiv.org/abs/2406.07422) |

| 2024-06 | SQ-Codec             | SimpleSpeech: Towards Simple and Efficient Text-to-Speech with Scalar Latent Transformer Diffusion Models | [paper](https://arxiv.org/abs/2406.02328) |

| 2024-06 | PQ-VAE             | Addressing Index Collapse of Large-Codebook Speech Tokenizer with Dual-Decoding Product-Quantized Variational Auto-Encoder | [paper](https://arxiv.org/abs/2406.02940) |

| 2024-06 | LLM-Codec                | UniAudio 1.5: Large Language Model-driven Audio Codec is A Few-shot Audio Task Learner  | [paper](https://arxiv.org/abs/2406.10056) |

| 2024-05 | HILCodec                 | HILCodec: High Fidelity and Lightweight Neural Audio Codec                                                                                             | [paper](https://arxiv.org/abs/2405.04752) |

| 2024-04 | SemantiCodec             | SemantiCodec: An Ultra Low Bitrate Semantic Audio Codec for General Sound                                                                              | [paper](https://arxiv.org/abs/2405.00233) |

| 2024-04 | PromptCodec             | PromptCodec: High-Fidelity Neural Speech Codec using Disentangled Representation Learning based Adaptive Feature-aware Prompt Encoders                | [paper](https://arxiv.org/abs/2404.02702) |

| 2024-04 | ESC             | ESC: Efficient Speech Coding with Cross-Scale Residual Vector Quantized Transformers                                                                             | [paper](https://arxiv.org/abs/2404.19441) |

| 2024-03 | FACodec                  | NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models                                                                 | [paper](https://arxiv.org/abs/2403.03100) |

| 2024-02 | AP-Codec             | APCodec: A Neural Audio Codec with Parallel Amplitude and Phase Spectrum Encoding and Decoding | [paper](https://arxiv.org/abs/2402.10533) |

| 2024-02 | Language-Codec       | Language-Codec: Reducing the Gaps Between Discrete Codec Representation and Speech Language Models                                                         | [paper](https://arxiv.org/abs/2402.12208) |

| 2024-01 | ScoreDec             | ScoreDec: A Phase-preserving High-Fidelity Audio Codec with A Generalized Score-based Diffusion Post-filter                                                | [paper](https://arxiv.org/abs/2401.12160) |

| 2023-11 | HierSpeech++         | HierSpeech++: Bridging the Gap between Semantic and Acoustic Representation of Speech by Hierarchical Variational Inference for Zero-shot Speech Synthesis | [paper](https://arxiv.org/abs/2311.12454) |

| 2023-10 | TiCodec              | FEWER-TOKEN NEURAL SPEECH CODEC WITH TIME-INVARIANT CODES                                                                                                  | [paper](https://arxiv.org/pdf/2310.00014) |

| 2023-09 | RepCodec             | RepCodec: A Speech Representation Codec for Speech Tokenization                                                                                            | [paper](https://arxiv.org/abs/2309.00169) |

| 2023-09 | FunCodec             | FunCodec: A Fundamental, Reproducible and Integrable Open-source Toolkit for Neural Speech Codec                                                           | [paper](https://arxiv.org/abs/2309.07405) |

| 2023-08 | SpeechTokenizer      | Speechtokenizer: Unified speech tokenizer for speech large language models                                                                                 | [paper](https://arxiv.org/abs/2308.16692) |

| 2023-06 | VOCOS | VOCOS: CLOSING THE GAP BETWEEN TIME-DOMAIN AND FOURIER-BASED NEURAL VOCODERS FOR HIGH-QUALITY AUDIO SYNTHESIS                                                             | [paper](https://arxiv.org/pdf/2306.00814) |

| 2023-06 | Descript-audio-codec | High-Fidelity Audio Compression with Improved RVQGAN                                                                                                       | [paper](https://arxiv.org/abs/2306.06546) |

| 2023-05 | AudioDec             | Audiodec: An open-source streaming highfidelity neural audio codec                                                                                         | [paper](https://arxiv.org/abs/2305.16608) |

| 2023-05 | HiFi-Codec           | Hifi-codec: Group-residual vector quantization for high fidelity audio codec                                                                               | [paper](https://arxiv.org/abs/2305.02765) |

| 2023-03 | LMCodec              | LMCodec: A Low Bitrate Speech Codec With Causal Transformer Models                                                                                         | [paper](https://arxiv.org/abs/2303.12984) |

| 2022-11 | Disen-TF-Codec              | Disentangled Feature Learning for Real-Time Neural Speech Coding   | [paper](https://arxiv.org/abs/2211.11960) |

| 2022-10 | EnCodec              | High fidelity neural audio compression                                                                                                                     | [paper](https://arxiv.org/abs/2210.13438) |

| 2022-07 | S-TFNet              | Cross-Scale Vector Quantization for Scalable Neural Speech Coding | [paper](https://arxiv.org/abs/2207.03067) |

| 2022-01 | TFNet              | End-to-End Neural Speech Coding for Real-Time Communications | [paper](https://arxiv.org/abs/2201.09429) |

| 2021-07 | SoundStream          | SoundStream: An End-to-End Neural Audio Codec                                                                                                              | [paper](https://arxiv.org/abs/2107.03312) |

## :trident: Speech/Audio Representation Models

| Date    | Model Name   | Paper Title                                                                                                   | Link                                      |

| ------- | ------------ | ------------------------------------------------------------------------------------------------------------- | ----------------------------------------- |

| 2024-09 | NEST-RQ      | NEST-RQ: Next Token Prediction for Speech Self-Supervised Pre-Training                                        | [paper](https://arxiv.org/pdf/2409.08680) |

| 2024-01 | EAT          | Self-Supervised Pre-Training with Efficient Audio Transformer                                                 | [paper](https://arxiv.org/abs/2401.03497) |

| 2023-10 | MR-HuBERT    | Multi-resolution HuBERT: Multi-resolution Speech Self-Supervised Learning with Masked Unit Prediction         | [paper](https://arxiv.org/abs/2310.02720) |

| 2023-10 | SpeechFlow   | Generative Pre-training for Speech with Flow Matching                                                         | [paper](https://arxiv.org/abs/2310.16338) |

| 2023-09 | WavLabLM     | Joint Prediction and Denoising for Large-scale Multilingual Self-supervised Learning                          | [paper](https://arxiv.org/abs/2309.15317) |

| 2023-08 | W2v-BERT 2.0 | Massively Multilingual & Multimodal Machine Translation                                                       | [paper](https://arxiv.org/abs/2308.11596) |

| 2023-07 | Whisper-AT   | Noise-Robust Automatic Speech Recognizers are Also Strong General Audio Event Taggers                         | [paper](https://arxiv.org/abs/2307.03183) |

| 2023-06 | ATST         | Self-supervised Audio Teacher-Student Transformer for Both Clip-level and Frame-level Tasks                   | [paper](https://arxiv.org/abs/2306.04186) |

| 2023-05 | SPIN         | Self-supervised Fine-tuning for Improved Content Representations by Speaker-invariant Clustering              | [paper](https://arxiv.org/abs/2305.11072) |

| 2023-05 | DinoSR       | Self-Distillation and Online Clustering for Self-supervised Speech Representation Learning                    | [paper](https://arxiv.org/abs/2305.10005) |

| 2023-05 | NFA          | Self-supervised neural factor analysis for disentangling utterance-level speech representations               | [paper](https://arxiv.org/abs/2305.08099) |

| 2022-12 | Data2vec 2.0 | Efficient Self-supervised Learning with Contextualized Target Representations for Vision, Speech and Language | [paper](https://arxiv.org/abs/2212.07525) |

| 2022-12 | BEATs        | Audio Pre-Training with Acoustic Tokenizers                                                                   | [paper](https://arxiv.org/abs/2212.09058) |

| 2022-11 | MT4SSL       | MT4SSL: Boosting Self-Supervised Speech Representation Learning by Integrating Multiple Targets               | [paper](https://arxiv.org/abs/2211.07321) |

| 2022-08 | DINO         | Non-contrastive self-supervised learning of utterance-level speech representations                            | [paper](https://arxiv.org/abs/2208.05413) |

| 2022-07 | Audio-MAE    | Masked Autoencoders that Listen                                                                               | [paper](https://arxiv.org/abs/2207.06405) |

| 2022-04 | MAESTRO      | Matched Speech Text Representations through Modality Matching                                                 | [paper](https://arxiv.org/abs/2204.03409) |

| 2022-03 | MAE-AST      | Masked Autoencoding Audio Spectrogram Transformer                                                             | [paper](https://arxiv.org/abs/2203.16691) |

| 2022-03 | LightHuBERT  | Lightweight and Configurable Speech Representation Learning with Once-for-All Hidden-Unit BERT                | [paper](https://arxiv.org/abs/2203.15610) |

| 2022-02 | Data2vec     | A General Framework for Self-supervised Learning in Speech, Vision and Language                               | [paper](https://arxiv.org/abs/2202.03555) |

| 2021-10 | WavLM        | WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing                              | [paper](https://arxiv.org/abs/2110.13900) |

| 2021-08 | W2v-BERT     | Combining Contrastive Learning and Masked Language Modeling for Self-Supervised Speech Pre-Training           | [paper](https://arxiv.org/abs/2108.06209) |

| 2021-07 | mHuBERT      | Direct speech-to-speech translation with discrete units                                                       | [paper](https://arxiv.org/abs/2107.05604) |

| 2021-06 | HuBERT       | Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units                           | [paper](https://arxiv.org/abs/2106.07447) |

| 2021-03 | BYOL-A       | Self-Supervised Learning for General-Purpose Audio Representation                                             | [paper](https://arxiv.org/abs/2103.06695) |

| 2020-12 | DeCoAR2.0    | DeCoAR 2.0: Deep Contextualized Acoustic Representations with Vector Quantization                             | [paper](https://arxiv.org/abs/2012.06659) |

| 2020-07 | TERA         | TERA: Self-Supervised Learning of Transformer Encoder Representation for Speech                               | [paper](https://arxiv.org/abs/2007.06028) |

| 2020-06 | Wav2vec2.0   | wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations                               | [paper](https://arxiv.org/abs/2006.11477) |

| 2019-10 | APC          | Generative Pre-Training for Speech with Autoregressive Predictive Coding                                      | [paper](https://arxiv.org/abs/1910.12607) |

| 2018-07 | CPC          | Representation Learning with Contrastive Predictive Coding                                                    | [paper](https://arxiv.org/abs/1807.03748) |

## :trident: SLT 2024 Codec-SUPERB challenge (upcoming)

[Webpage](https://codecsuperb.github.io/). The challenge will cover nowday's neural audio codecs and speech/audio language models. Agenda: To be determined.

## :trident: Interspeech 2024 Survey Talk

Professor Hung-Yi Lee will be giving a talk as part of the [Interspeech 2024 survey talk](https://drive.google.com/file/d/1gPjnjGKxeCF72gisPVuQlDvogXQCtNk4/view) titled **Challenges in Developing Spoken Language Models**. The topic will cover nowday's speech/audio large language models.

## :trident: ICASSP 2024 Tutorial Information

I (Kai-Wei Chang) will be giving a talk as part of the [ICASSP 2024 tutorial](https://cmsworkshops.com/ICASSP2024/tutorials.php#tut32) titled **Parameter-Efficient and Prompt Learning for Speech and Language Foundation Models**. The topic will cover nowday's speech/audio large language models. The slides from my presentation is available at https://kwchang.org/talks/. Please feel free to reach out to me for any discussions.

## :trident: Related Repository

- https://github.com/liusongxiang/Large-Audio-Models

- https://github.com/kuan2jiu99/Awesome-Speech-Generation

- https://github.com/ga642381/Speech-Prompts-Adapters

- https://github.com/voidful/Codec-SUPERB

- https://github.com/huckiyang/awesome-neural-reprogramming-prompting

## Citation

If you find this repository useful, please consider citing the following papers.

```

@article{wu2024codec,

  title={Codec-SUPERB@ SLT 2024: A lightweight benchmark for neural audio codec models},

  author={Wu, Haibin and Chen, Xuanjun and Lin, Yi-Cheng and Chang, Kaiwei and Du, Jiawei and Lu, Ke-Han and Liu, Alexander H and Chung, Ho-Lam and Wu, Yuan-Kuei and Yang, Dongchao and others},

  journal={arXiv preprint arXiv:2409.14085},

  year={2024}

}

```

```

@inproceedings{wu-etal-2024-codec,

    title = "Codec-{SUPERB}: An In-Depth Analysis of Sound Codec Models",

    author = "Wu, Haibin  and

      Chung, Ho-Lam  and

      Lin, Yi-Cheng  and

      Wu, Yuan-Kuei  and

      Chen, Xuanjun  and

      Pai, Yu-Chi  and

      Wang, Hsiu-Hsuan  and

      Chang, Kai-Wei  and

      Liu, Alexander  and

      Lee, Hung-yi",

    editor = "Ku, Lun-Wei  and

      Martins, Andre  and

      Srikumar, Vivek",

    booktitle = "Findings of the Association for Computational Linguistics: ACL 2024",

    month = aug,

    year = "2024",

    address = "Bangkok, Thailand",

    publisher = "Association for Computational Linguistics",

    url = "https://aclanthology.org/2024.findings-acl.616",

    doi = "10.18653/v1/2024.findings-acl.616",

    pages = "10330--10348",

}

```

```

@article{wu2023speechgen,

  title={Speechgen: Unlocking the generative power of speech language models with prompts},

  author={Wu, Haibin and Chang, Kai-Wei and Wu, Yuan-Kuei and Lee, Hung-yi},

  journal={arXiv preprint arXiv:2306.02207},

  year={2023}

}

```

```

@article{wu2024towards,

  title={Towards audio language modeling-an overview},

  author={Wu, Haibin and Chen, Xuanjun and Lin, Yi-Cheng and Chang, Kai-wei and Chung, Ho-Lam and Liu, Alexander H and Lee, Hung-yi},

  journal={arXiv preprint arXiv:2402.13236},

  year={2024}

}

```
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/ga642381/speech-trident

Awesome Lists containing this project

README