Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/ddlbojack/awesome-speech-language-model

Paper, Code and Resources for Speech Language Model and End2End Speech Dialogue System.
https://github.com/ddlbojack/awesome-speech-language-model

List: awesome-speech-language-model

Last synced: about 1 month ago
JSON representation

Paper, Code and Resources for Speech Language Model and End2End Speech Dialogue System.

Awesome Lists containing this project

README

        

* [Awesome-Speech-Language-Model](#awesome-speech-language-model)
* [Universal Speech, Audio and Music Understanding](#universal-speech-audio-and-music-understanding)
* [Model](#model)
* [Benchmark](#benchmark)
* [End2End Speech Dialogue System](#end2end-speech-dialogue-system)
* [Model](#model-1)
* [Benchmark](#benchmark-1)
* [Full Duplex Modeling](#full-duplex-modeling)
* [Survey](#survey)

# Awesome-Speech-Language-Model
Paper, Code and Resources for Speech Language Model and End2End Speech Dialogue System.

## Universal Speech, Audio and Music Understanding

### Model
- LTU: [Listen, Think, and Understand](https://arxiv.org/abs/2305.10790) - `ICLR 2024`
- [SALMONN: Towards Generic Hearing Abilities for Large Language Models](https://arxiv.org/abs/2310.13289) - `ICLR 2024`
- LTU-AS: [Joint Audio and Speech Understanding](https://arxiv.org/abs/2309.14405) - `ASRU 2024`
- [Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models](https://arxiv.org/abs/2311.07919) - `arXiv 2023`
- [Audio Flamingo: A Novel Audio Language Model with Few-Shot Learning and Dialogue Abilities](https://arxiv.org/abs/2402.01831) - `ICML 2024`
- [Qwen2-Audio Technical Report](https://arxiv.org/abs/2407.10759) - `arXiv 2024`
- [WavLLM: Towards Robust and Adaptive Speech Large Language Model](https://arxiv.org/abs/2404.00656) - `EMNLP 2024`
- DiVA: [Distilling an End-to-End Voice Assistant Without Instruction Training Data](https://arxiv.org/abs/2410.02678) - `arXiv 2024`

### Benchmark
- [Dynamic-SUPERB: Towards A Dynamic, Collaborative, and Comprehensive Instruction-Tuning Benchmark for Speech](https://arxiv.org/abs/2309.09510) - `ICASSP 2024`
- [AIR-Bench: Benchmarking Large Audio-Language Models via Generative Comprehension](https://arxiv.org/abs/2402.07729) - `ACL 2024`
- [SD-Eval: A Benchmark Dataset for Spoken Dialogue Understanding Beyond Words](https://arxiv.org/abs/2406.13340) - `arXiv 2024`
- [AudioBench: A Universal Benchmark for Audio Large Language Models](https://arxiv.org/abs/2406.16020) - `arXiv 2024`
- SALMon: [A Suite for Acoustic Language Model Evaluation](https://arxiv.org/abs/2409.07437) - `arXiv 2024`
- [MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmark](https://www.arxiv.org/abs/2410.19168) - `arXiv 2024`
- [Dynamic-SUPERB Phase-2: A Collaboratively Expanding Benchmark for Measuring the Capabilities of Spoken Language Models with 180 Tasks](https://openreview.net/forum?id=s7lzZpAW7T) - `ICLR 2024 open review`

## End2End Speech Dialogue System

### Model
- [SpeechGPT: Empowering Large Language Models with Intrinsic Cross-Modal Conversational Abilities](https://arxiv.org/abs/2305.11000) - `EMNLP 2023`
- [GPT-4o Voice Mode](https://openai.com/index/hello-gpt-4o/) -`API 2024`
- [PSLM: Parallel Generation of Text and Speech with LLMs for Low-Latency Spoken Dialogue Systems](Uhttps://arxiv.org/abs/2406.12428) - `EMNLP 2024`
- [VITA: Towards Open-Source Interactive Omni Multimodal LLM](https://www.arxiv.org/abs/2408.05211) - `arXiv 2024`
- [Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming](https://arxiv.org/abs/2408.16725) - `arXiv 2024`
- [LLaMA-Omni: Seamless Speech Interaction with Large Language Models](https://arxiv.org/abs/2409.06666) - `arXiv 2024`
- [Moshi: a speech-text foundation model for real-time dialogue](https://arxiv.org/abs/2410.00037) - `arXiv 2024`
- [Westlake-Omni](https://github.com/xinchen-ai/Westlake-Omni) - `GitHub 2024`
- [EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions](https://arxiv.org/abs/2409.18042) - `arXiv 2024`
- [IntrinsicVoice: Empowering LLMs with Intrinsic Real-time Voice Interaction Abilities](https://arxiv.org/abs/2410.08035) - `arXiv 2024`
- [Mini-Omni2: Towards Open-source GPT-4o with Vision, Speech and Duplex Capabilities](https://arxiv.org/abs/2410.11190) - `arXiv 2024`
- [MooER-omni](https://github.com/MooreThreads/MooER) - `GitHub 2024`
- [GLM-4-Voice](https://github.com/THUDM/GLM-4-Voice) - `GitHub 2024`
- [Freeze-Omni: A Smart and Low Latency Speech-to-speech Dialogue Model with Frozen LLM](https://arxiv.org/abs/2411.00774) - `arXiv 2024`
- [Hertz-dev](https://github.com/Standard-Intelligence/hertz-dev) - `GitHub 2024`
- [Fish Agent](https://github.com/fishaudio/fish-speech) - `GitHub 2024`

### Benchmark
- [VoiceBench: Benchmarking LLM-Based Voice Assistants](https://arxiv.org/abs/2410.17196) - `arXiv 2024`

## Full Duplex Modeling
- [A Full-duplex Speech Dialogue Scheme Based On Large Language Models](https://arxiv.org/abs/2405.19487) - `NeurIPS 2024`
- MiniCPM-duplex: [Beyond the Turn-Based Game: Enabling Real-Time Conversations with Duplex Models](https://arxiv.org/abs/2406.15718) - `EMNLP 2024`
- LSLM: [Language Model Can Listen While Speaking](https://arxiv.org/abs/2408.02622) - `arXiv 2024`
- SyncLLM: [Beyond Turn-Based Interfaces: Synchronous LLMs as Full-Duplex Dialogue Agents](https://arxiv.org/abs/2409.15594) - `arXiv 2024`
- [Enabling Real-Time Conversations with Minimal Training Costs](https://arxiv.org/abs/2409.11727) - `arXiv 2024`

## Survey
- [Towards audio language modeling -- an overview](https://arxiv.org/abs/2402.13236) - `arXiv 2024`
- [Recent Advances in Speech Language Models: A Survey](https://arxiv.org/abs/2410.03751) - `arXiv 2024`
- [A Survey on Speech Large Language Models](https://arxiv.org/abs/2410.18908) - `arXiv 2024`
- [Speech Trident](https://github.com/ga642381/speech-trident) - `Github`