Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
awesome-speech-language-model
Paper, Code and Resources for Speech Language Model and End2End Speech Dialogue System.
https://github.com/ddlbojack/awesome-speech-language-model
Last synced: 5 days ago
JSON representation
-
Survey
-
Benchmark
- Recent Advances in Speech Language Models: A Survey - `arXiv 2024`
- A Survey on Speech Large Language Models - `arXiv 2024`
- Speech Trident - `Github`
- Towards audio language modeling -- an overview - `arXiv 2024`
- Towards audio language modeling -- an overview - `arXiv 2024`
- Recent Advances in Speech Language Models: A Survey - `arXiv 2024`
- A Survey on Speech Large Language Models - `arXiv 2024`
- Speech Trident - `Github`
-
-
Universal Speech, Audio and Music Understanding
-
Model
- Listen, Think, and Understand - `ICLR 2024`
- SALMONN: Towards Generic Hearing Abilities for Large Language Models - `ICLR 2024`
- Joint Audio and Speech Understanding - `ASRU 2024`
- Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models - `arXiv 2023`
- Audio Flamingo: A Novel Audio Language Model with Few-Shot Learning and Dialogue Abilities - `ICML 2024`
- Qwen2-Audio Technical Report - `arXiv 2024`
- WavLLM: Towards Robust and Adaptive Speech Large Language Model - `EMNLP 2024`
- Distilling an End-to-End Voice Assistant Without Instruction Training Data - `arXiv 2024`
- Listen, Think, and Understand - `ICLR 2024`
- SALMONN: Towards Generic Hearing Abilities for Large Language Models - `ICLR 2024`
- Joint Audio and Speech Understanding - `ASRU 2024`
- Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models - `arXiv 2023`
- Audio Flamingo: A Novel Audio Language Model with Few-Shot Learning and Dialogue Abilities - `ICML 2024`
- Qwen2-Audio Technical Report - `arXiv 2024`
- WavLLM: Towards Robust and Adaptive Speech Large Language Model - `EMNLP 2024`
- Distilling an End-to-End Voice Assistant Without Instruction Training Data - `arXiv 2024`
-
Benchmark
- Dynamic-SUPERB: Towards A Dynamic, Collaborative, and Comprehensive Instruction-Tuning Benchmark for Speech - `ICASSP 2024`
- MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmark - `arXiv 2024`
- Dynamic-SUPERB Phase-2: A Collaboratively Expanding Benchmark for Measuring the Capabilities of Spoken Language Models with 180 Tasks - `ICLR 2024 open review`
- SD-Eval: A Benchmark Dataset for Spoken Dialogue Understanding Beyond Words - `arXiv 2024`
- AudioBench: A Universal Benchmark for Audio Large Language Models - `arXiv 2024`
- A Suite for Acoustic Language Model Evaluation - `arXiv 2024`
- MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmark - `arXiv 2024`
- Dynamic-SUPERB Phase-2: A Collaboratively Expanding Benchmark for Measuring the Capabilities of Spoken Language Models with 180 Tasks - `ICLR 2024 open review`
- AIR-Bench: Benchmarking Large Audio-Language Models via Generative Comprehension - `ACL 2024`
- SD-Eval: A Benchmark Dataset for Spoken Dialogue Understanding Beyond Words - `arXiv 2024`
- AudioBench: A Universal Benchmark for Audio Large Language Models - `arXiv 2024`
- A Suite for Acoustic Language Model Evaluation - `arXiv 2024`
- Dynamic-SUPERB: Towards A Dynamic, Collaborative, and Comprehensive Instruction-Tuning Benchmark for Speech - `ICASSP 2024`
- AIR-Bench: Benchmarking Large Audio-Language Models via Generative Comprehension - `ACL 2024`
-
-
Full Duplex Modeling
-
Benchmark
- A Full-duplex Speech Dialogue Scheme Based On Large Language Models - `NeurIPS 2024`
- Beyond the Turn-Based Game: Enabling Real-Time Conversations with Duplex Models - `EMNLP 2024`
- Language Model Can Listen While Speaking - `arXiv 2024`
- Beyond Turn-Based Interfaces: Synchronous LLMs as Full-Duplex Dialogue Agents - `arXiv 2024`
- Enabling Real-Time Conversations with Minimal Training Costs - `arXiv 2024`
- Beyond Turn-Based Interfaces: Synchronous LLMs as Full-Duplex Dialogue Agents - `arXiv 2024`
- Enabling Real-Time Conversations with Minimal Training Costs - `arXiv 2024`
- Beyond the Turn-Based Game: Enabling Real-Time Conversations with Duplex Models - `EMNLP 2024`
- Language Model Can Listen While Speaking - `arXiv 2024`
- A Full-duplex Speech Dialogue Scheme Based On Large Language Models - `NeurIPS 2024`
-
-
End2End Speech Dialogue System
-
Model
- Freeze-Omni: A Smart and Low Latency Speech-to-speech Dialogue Model with Frozen LLM - `arXiv 2024`
- Hertz-dev - `GitHub 2024`
- Fish Agent - `GitHub 2024`
- EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions - `arXiv 2024`
- SpeechGPT: Empowering Large Language Models with Intrinsic Cross-Modal Conversational Abilities - `EMNLP 2023`
- GPT-4o Voice Mode - `API 2024`
- VITA: Towards Open-Source Interactive Omni Multimodal LLM - `arXiv 2024`
- Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming - `arXiv 2024`
- LLaMA-Omni: Seamless Speech Interaction with Large Language Models - `arXiv 2024`
- Moshi: a speech-text foundation model for real-time dialogue - `arXiv 2024`
- Westlake-Omni - `GitHub 2024`
- Westlake-Omni - `GitHub 2024`
- EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions - `arXiv 2024`
- Mini-Omni2: Towards Open-source GPT-4o with Vision, Speech and Duplex Capabilities - `arXiv 2024`
- MooER-omni - `GitHub 2024`
- GLM-4-Voice - `GitHub 2024`
- SpeechGPT: Empowering Large Language Models with Intrinsic Cross-Modal Conversational Abilities - `EMNLP 2023`
- GPT-4o Voice Mode - `API 2024`
- VITA: Towards Open-Source Interactive Omni Multimodal LLM - `arXiv 2024`
- Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming - `arXiv 2024`
- LLaMA-Omni: Seamless Speech Interaction with Large Language Models - `arXiv 2024`
- Moshi: a speech-text foundation model for real-time dialogue - `arXiv 2024`
- IntrinsicVoice: Empowering LLMs with Intrinsic Real-time Voice Interaction Abilities - `arXiv 2024`
- Mini-Omni2: Towards Open-source GPT-4o with Vision, Speech and Duplex Capabilities - `arXiv 2024`
- MooER-omni - `GitHub 2024`
- GLM-4-Voice - `GitHub 2024`
- Freeze-Omni: A Smart and Low Latency Speech-to-speech Dialogue Model with Frozen LLM - `arXiv 2024`
- Hertz-dev - `GitHub 2024`
- Fish Agent - `GitHub 2024`
- IntrinsicVoice: Empowering LLMs with Intrinsic Real-time Voice Interaction Abilities - `arXiv 2024`
-
Benchmark
- VoiceBench: Benchmarking LLM-Based Voice Assistants - `arXiv 2024`
- VoiceBench: Benchmarking LLM-Based Voice Assistants - `arXiv 2024`
-
Programming Languages
Categories
Sub Categories