awesome-speech-language-model

Paper, Code and Resources for Speech Language Model and End2End Speech Dialogue System.
https://github.com/ddlbojack/awesome-speech-language-model

Last synced: 28 minutes ago
JSON representation

Survey
- Benchmark
  - Recent Advances in Speech Language Models: A Survey - `arXiv 2024`
  - A Survey on Speech Large Language Models - `arXiv 2024`
  - Speech Trident - `Github`
  - Towards audio language modeling -- an overview - `arXiv 2024`
  - Towards audio language modeling -- an overview - `arXiv 2024`
  - Recent Advances in Speech Language Models: A Survey - `arXiv 2024`
  - A Survey on Speech Large Language Models - `arXiv 2024`
  - Speech Trident - `Github`
Universal Speech, Audio and Music Understanding
- Model
  - Listen, Think, and Understand - `ICLR 2024`
  - SALMONN: Towards Generic Hearing Abilities for Large Language Models - `ICLR 2024`
  - Joint Audio and Speech Understanding - `ASRU 2024`
  - Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models - `arXiv 2023`
  - Audio Flamingo: A Novel Audio Language Model with Few-Shot Learning and Dialogue Abilities - `ICML 2024`
  - Qwen2-Audio Technical Report - `arXiv 2024`
  - WavLLM: Towards Robust and Adaptive Speech Large Language Model - `EMNLP 2024`
  - Distilling an End-to-End Voice Assistant Without Instruction Training Data - `arXiv 2024`
  - Listen, Think, and Understand - `ICLR 2024`
  - SALMONN: Towards Generic Hearing Abilities for Large Language Models - `ICLR 2024`
  - Joint Audio and Speech Understanding - `ASRU 2024`
  - Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models - `arXiv 2023`
  - Audio Flamingo: A Novel Audio Language Model with Few-Shot Learning and Dialogue Abilities - `ICML 2024`
  - Qwen2-Audio Technical Report - `arXiv 2024`
  - WavLLM: Towards Robust and Adaptive Speech Large Language Model - `EMNLP 2024`
  - Distilling an End-to-End Voice Assistant Without Instruction Training Data - `arXiv 2024`
- Benchmark
  - Dynamic-SUPERB: Towards A Dynamic, Collaborative, and Comprehensive Instruction-Tuning Benchmark for Speech - `ICASSP 2024`
  - MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmark - `arXiv 2024`
  - Dynamic-SUPERB Phase-2: A Collaboratively Expanding Benchmark for Measuring the Capabilities of Spoken Language Models with 180 Tasks - `ICLR 2024 open review`
  - SD-Eval: A Benchmark Dataset for Spoken Dialogue Understanding Beyond Words - `arXiv 2024`
  - AudioBench: A Universal Benchmark for Audio Large Language Models - `arXiv 2024`
  - A Suite for Acoustic Language Model Evaluation - `arXiv 2024`
  - MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmark - `arXiv 2024`
  - Dynamic-SUPERB Phase-2: A Collaboratively Expanding Benchmark for Measuring the Capabilities of Spoken Language Models with 180 Tasks - `ICLR 2024 open review`
  - AIR-Bench: Benchmarking Large Audio-Language Models via Generative Comprehension - `ACL 2024`
  - SD-Eval: A Benchmark Dataset for Spoken Dialogue Understanding Beyond Words - `arXiv 2024`
  - AudioBench: A Universal Benchmark for Audio Large Language Models - `arXiv 2024`
  - A Suite for Acoustic Language Model Evaluation - `arXiv 2024`
  - Dynamic-SUPERB: Towards A Dynamic, Collaborative, and Comprehensive Instruction-Tuning Benchmark for Speech - `ICASSP 2024`
  - AIR-Bench: Benchmarking Large Audio-Language Models via Generative Comprehension - `ACL 2024`
Full Duplex Modeling
- Benchmark
  - A Full-duplex Speech Dialogue Scheme Based On Large Language Models - `NeurIPS 2024`
  - Beyond the Turn-Based Game: Enabling Real-Time Conversations with Duplex Models - `EMNLP 2024`
  - Language Model Can Listen While Speaking - `arXiv 2024`
  - Beyond Turn-Based Interfaces: Synchronous LLMs as Full-Duplex Dialogue Agents - `arXiv 2024`
  - Enabling Real-Time Conversations with Minimal Training Costs - `arXiv 2024`
  - Beyond Turn-Based Interfaces: Synchronous LLMs as Full-Duplex Dialogue Agents - `arXiv 2024`
  - Enabling Real-Time Conversations with Minimal Training Costs - `arXiv 2024`
  - Beyond the Turn-Based Game: Enabling Real-Time Conversations with Duplex Models - `EMNLP 2024`
  - Language Model Can Listen While Speaking - `arXiv 2024`
  - A Full-duplex Speech Dialogue Scheme Based On Large Language Models - `NeurIPS 2024`
End2End Speech Dialogue System
- Model
  - Freeze-Omni: A Smart and Low Latency Speech-to-speech Dialogue Model with Frozen LLM - `arXiv 2024`
  - Hertz-dev - `GitHub 2024`
  - Fish Agent - `GitHub 2024`
  - EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions - `arXiv 2024`
  - SpeechGPT: Empowering Large Language Models with Intrinsic Cross-Modal Conversational Abilities - `EMNLP 2023`
  - GPT-4o Voice Mode - `API 2024`
  - VITA: Towards Open-Source Interactive Omni Multimodal LLM - `arXiv 2024`
  - Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming - `arXiv 2024`
  - LLaMA-Omni: Seamless Speech Interaction with Large Language Models - `arXiv 2024`
  - Moshi: a speech-text foundation model for real-time dialogue - `arXiv 2024`
  - Westlake-Omni - `GitHub 2024`
  - Westlake-Omni - `GitHub 2024`
  - EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions - `arXiv 2024`
  - Mini-Omni2: Towards Open-source GPT-4o with Vision, Speech and Duplex Capabilities - `arXiv 2024`
  - MooER-omni - `GitHub 2024`
  - GLM-4-Voice - `GitHub 2024`
  - SpeechGPT: Empowering Large Language Models with Intrinsic Cross-Modal Conversational Abilities - `EMNLP 2023`
  - GPT-4o Voice Mode - `API 2024`
  - VITA: Towards Open-Source Interactive Omni Multimodal LLM - `arXiv 2024`
  - Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming - `arXiv 2024`
  - LLaMA-Omni: Seamless Speech Interaction with Large Language Models - `arXiv 2024`
  - Moshi: a speech-text foundation model for real-time dialogue - `arXiv 2024`
  - IntrinsicVoice: Empowering LLMs with Intrinsic Real-time Voice Interaction Abilities - `arXiv 2024`
  - Mini-Omni2: Towards Open-source GPT-4o with Vision, Speech and Duplex Capabilities - `arXiv 2024`
  - MooER-omni - `GitHub 2024`
  - GLM-4-Voice - `GitHub 2024`
  - Freeze-Omni: A Smart and Low Latency Speech-to-speech Dialogue Model with Frozen LLM - `arXiv 2024`
  - Hertz-dev - `GitHub 2024`
  - Fish Agent - `GitHub 2024`
  - IntrinsicVoice: Empowering LLMs with Intrinsic Real-time Voice Interaction Abilities - `arXiv 2024`
- Benchmark
  - VoiceBench: Benchmarking LLM-Based Voice Assistants - `arXiv 2024`
  - VoiceBench: Benchmarking LLM-Based Voice Assistants - `arXiv 2024`

Programming Languages

Categories

End2End Speech Dialogue System 32 Universal Speech, Audio and Music Understanding 30 Full Duplex Modeling 10 Survey 8

Sub Categories

Model 46 Benchmark 34

Keywords

vqvae 2 vqgan 2 vits 2 valle 2 tts 2 transformer 2 llama 2