https://github.com/ddlbojack/awesome-speech-language-model

Paper, Code and Resources for Speech Language Model and End2End Speech Dialogue System.
https://github.com/ddlbojack/awesome-speech-language-model

List: awesome-speech-language-model

Last synced: 2 months ago
JSON representation

Paper, Code and Resources for Speech Language Model and End2End Speech Dialogue System.

Host: GitHub
URL: https://github.com/ddlbojack/awesome-speech-language-model
Owner: ddlBoJack
Created: 2024-11-09T11:50:16.000Z (7 months ago)
Default Branch: main
Last Pushed: 2024-11-10T19:07:05.000Z (7 months ago)
Last Synced: 2025-03-12T13:01:42.563Z (2 months ago)
Homepage:
Size: 7.81 KB
Stars: 161
Watchers: 9
Forks: 12
Open Issues: 2
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

ultimate-awesome - awesome-speech-language-model - Paper, Code and Resources for Speech Language Model and End2End Speech Dialogue System. (Other Lists / Julia Lists)

README

        * [Awesome-Speech-Language-Model](#awesome-speech-language-model)

   * [Universal Speech, Audio and Music Understanding](#universal-speech-audio-and-music-understanding)

      * [Model](#model)

      * [Benchmark](#benchmark)

   * [End2End Speech Dialogue System](#end2end-speech-dialogue-system)

      * [Model](#model-1)

      * [Benchmark](#benchmark-1)

   * [Full Duplex Modeling](#full-duplex-modeling)

   * [Survey](#survey)

# Awesome-Speech-Language-Model

Paper, Code and Resources for Speech Language Model and End2End Speech Dialogue System. 

## Universal Speech, Audio and Music Understanding

### Model

- LTU: [Listen, Think, and Understand](https://arxiv.org/abs/2305.10790) - `ICLR 2024`

- [SALMONN: Towards Generic Hearing Abilities for Large Language Models](https://arxiv.org/abs/2310.13289) - `ICLR 2024`

- LTU-AS: [Joint Audio and Speech Understanding](https://arxiv.org/abs/2309.14405) - `ASRU 2024`

- [Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models](https://arxiv.org/abs/2311.07919) - `arXiv 2023`

- [Audio Flamingo: A Novel Audio Language Model with Few-Shot Learning and Dialogue Abilities](https://arxiv.org/abs/2402.01831) - `ICML 2024`

- [Qwen2-Audio Technical Report](https://arxiv.org/abs/2407.10759) - `arXiv 2024`

- [WavLLM: Towards Robust and Adaptive Speech Large Language Model](https://arxiv.org/abs/2404.00656) - `EMNLP 2024`

- DiVA: [Distilling an End-to-End Voice Assistant Without Instruction Training Data](https://arxiv.org/abs/2410.02678) - `arXiv 2024`

### Benchmark

- [Dynamic-SUPERB: Towards A Dynamic, Collaborative, and Comprehensive Instruction-Tuning Benchmark for Speech](https://arxiv.org/abs/2309.09510) - `ICASSP 2024`

- [AIR-Bench: Benchmarking Large Audio-Language Models via Generative Comprehension](https://arxiv.org/abs/2402.07729) - `ACL 2024`

- [SD-Eval: A Benchmark Dataset for Spoken Dialogue Understanding Beyond Words](https://arxiv.org/abs/2406.13340) - `arXiv 2024`

- [AudioBench: A Universal Benchmark for Audio Large Language Models](https://arxiv.org/abs/2406.16020) - `arXiv 2024`

- SALMon: [A Suite for Acoustic Language Model Evaluation](https://arxiv.org/abs/2409.07437) - `arXiv 2024`

- [MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmark](https://www.arxiv.org/abs/2410.19168) - `arXiv 2024`

- [Dynamic-SUPERB Phase-2: A Collaboratively Expanding Benchmark for Measuring the Capabilities of Spoken Language Models with 180 Tasks](https://openreview.net/forum?id=s7lzZpAW7T) - `ICLR 2024 open review`

## End2End Speech Dialogue System

### Model

- [SpeechGPT: Empowering Large Language Models with Intrinsic Cross-Modal Conversational Abilities](https://arxiv.org/abs/2305.11000) - `EMNLP 2023`

- [GPT-4o Voice Mode](https://openai.com/index/hello-gpt-4o/) -`API 2024`

- [PSLM: Parallel Generation of Text and Speech with LLMs for Low-Latency Spoken Dialogue Systems](Uhttps://arxiv.org/abs/2406.12428) - `EMNLP 2024`

- [VITA: Towards Open-Source Interactive Omni Multimodal LLM](https://www.arxiv.org/abs/2408.05211) - `arXiv 2024`

- [Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming](https://arxiv.org/abs/2408.16725) - `arXiv 2024`

- [LLaMA-Omni: Seamless Speech Interaction with Large Language Models](https://arxiv.org/abs/2409.06666) - `arXiv 2024`

- [Moshi: a speech-text foundation model for real-time dialogue](https://arxiv.org/abs/2410.00037) - `arXiv 2024`

- [Westlake-Omni](https://github.com/xinchen-ai/Westlake-Omni) - `GitHub 2024`

- [EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions](https://arxiv.org/abs/2409.18042) - `arXiv 2024`

- [IntrinsicVoice: Empowering LLMs with Intrinsic Real-time Voice Interaction Abilities](https://arxiv.org/abs/2410.08035) - `arXiv 2024`

- [Mini-Omni2: Towards Open-source GPT-4o with Vision, Speech and Duplex Capabilities](https://arxiv.org/abs/2410.11190) - `arXiv 2024`

- [MooER-omni](https://github.com/MooreThreads/MooER) - `GitHub 2024`

- [GLM-4-Voice](https://github.com/THUDM/GLM-4-Voice) - `GitHub 2024`

- [Freeze-Omni: A Smart and Low Latency Speech-to-speech Dialogue Model with Frozen LLM](https://arxiv.org/abs/2411.00774) - `arXiv 2024`

- [Hertz-dev](https://github.com/Standard-Intelligence/hertz-dev) - `GitHub 2024`

- [Fish Agent](https://github.com/fishaudio/fish-speech) - `GitHub 2024`

### Benchmark

- [VoiceBench: Benchmarking LLM-Based Voice Assistants](https://arxiv.org/abs/2410.17196) - `arXiv 2024`

## Full Duplex Modeling

- [A Full-duplex Speech Dialogue Scheme Based On Large Language Models](https://arxiv.org/abs/2405.19487) - `NeurIPS 2024`

- MiniCPM-duplex: [Beyond the Turn-Based Game: Enabling Real-Time Conversations with Duplex Models](https://arxiv.org/abs/2406.15718) - `EMNLP 2024`

- LSLM: [Language Model Can Listen While Speaking](https://arxiv.org/abs/2408.02622) - `arXiv 2024`

- SyncLLM: [Beyond Turn-Based Interfaces: Synchronous LLMs as Full-Duplex Dialogue Agents](https://arxiv.org/abs/2409.15594) - `arXiv 2024`

- [Enabling Real-Time Conversations with Minimal Training Costs](https://arxiv.org/abs/2409.11727) - `arXiv 2024`

## Survey

- [Towards audio language modeling -- an overview](https://arxiv.org/abs/2402.13236) - `arXiv 2024`

- [Recent Advances in Speech Language Models: A Survey](https://arxiv.org/abs/2410.03751) - `arXiv 2024`

- [A Survey on Speech Large Language Models](https://arxiv.org/abs/2410.18908) - `arXiv 2024`

- [Speech Trident](https://github.com/ga642381/speech-trident) - `Github`

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/ddlbojack/awesome-speech-language-model

Awesome Lists containing this project

README