https://github.com/freedomintelligence/echox

EchoX: Towards Mitigating Acoustic-Semantic Gap via Echo Training for Speech-to-Speech LLMs
https://github.com/freedomintelligence/echox

ai artificial-intelligence dialogue-systems llm s2s speech-to-speech

Last synced: 9 months ago
JSON representation

EchoX: Towards Mitigating Acoustic-Semantic Gap via Echo Training for Speech-to-Speech LLMs

Host: GitHub
URL: https://github.com/freedomintelligence/echox
Owner: FreedomIntelligence
License: apache-2.0
Created: 2025-04-09T13:00:12.000Z (about 1 year ago)
Default Branch: main
Last Pushed: 2025-09-09T16:36:33.000Z (9 months ago)
Last Synced: 2025-09-09T19:55:25.303Z (9 months ago)
Topics: ai, artificial-intelligence, dialogue-systems, llm, s2s, speech-to-speech
Language: Python
Homepage:
Size: 5.47 MB
Stars: 3
Watchers: 12
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          # EchoX: Towards Mitigating Acoustic-Semantic Gap via Echo Training for Speech-to-Speech LLMs



  

  

  





   📄 Paper |

   📦 Model | 

   🚀 HF Space | 

   🌐 Web Demo | 

   📊 EchoX-Dialougues | 

   📊 EchoX-Dialogues-Plus



## Contents

- [EchoX: Towards Mitigating Acoustic-Semantic Gap via Echo Training for Speech-to-Speech LLMs](#echox-towards-mitigating-acoustic-semantic-gap-via-echo-training-for-speech-to-speech-llms)

  - [Contents](#contents)

  - [Key Features](#key-features)

  - [Performance](#performance)

  - [Datasets and Models](#datasets-and-models)

    - [Dataset](#dataset)

    - [Model](#model)

  - [Quickstart](#quickstart)

    - [Environment Setup](#environment-setup)

    - [Model Download](#model-download)

    - [Inference](#inference)

  - [Citation](#citation)

  - [License](#license)

## Key Features

- Mitigates Acoustic-Semantic Gap in Speech-to-Speech LLMs

- Introduces Echo Training with a Novel Three-Stage Pipeline (S2T, T2C, Echo)

- Trained on Only 6k Hours of Curated Data, Ensuring Efficiency

- Achieves State-of-the-Art Performance in Knowledge-Based QA Benchmarks

- Preserves Reasoning and Knowledge Abilities for Interactive Speech Tasks

## Performance



  



EchoX demonstrates exceptional performance on knowledge-based question-answering tasks. The model achieves superior results with minimal training data, establishing a new benchmark for efficiency in speech-to-speech language models.

## Datasets and Models

### Dataset

EchoX is trained on carefully curated datasets for each stage of the pipeline, ensuring optimal performance across ASR, TTS, and SQA tasks. The datasets used are as follows:

| Task      | Data                | Size        | Duration(H) | Stage  | Download                                                                    |

| :-------- | :------------------ | :---------- | :---------- | :----- | :-------------------------------------------------------------------------- |

| ASR       | LibriSpeech         | 281,241     | 960         | I      | -                                                                           |

| ASR       | MLS                 | 723,636     | 3,000       | I      | -                                                                           |

| TTS       | AudioQA-1M          | 178,576     | 989         | II     | -                                                                           |

| TTS       | SpeechInstruct      | 31,563      | 84          | II     | -                                                                           |

| TTS       | HH-RLHF-Speech      | 124,945     | 656         | II     | -                                                                           |

| SQA       | sharechatx          | 43,223      | 178         | I, III | [Link](https://huggingface.co/datasets/KurtDu/EchoX-Dialogues) |

| SQA       | Magpie-Pro-Speech+   | 117,000     | 327         | I, III | [Link](https://huggingface.co/datasets/KurtDu/EchoX-Dialogues) |

| **Total** |                     | **1,500,184** | **6,194**   |        |                                                                             |

### Model

The following pre-trained models are available for download:

| Model        | Parameters | Training Data | Download Link                                      |

| ------------ | ---------- | ------------- | -------------------------------------------------- |

| **EchoX-3B** | 3 billion  | 6k hours     | [EchoX-3B Model](https://huggingface.co/FreedomIntelligence/EchoX-3B) |

| **EchoX-8B** | 8 billion  | 6k hours     | [EchoX-8B Model](https://huggingface.co/FreedomIntelligence/EchoX-8B) |

## Quickstart

### Environment Setup

To set up your environment, follow these steps:

```bash

git clone https://github.com/FreedomIntelligence/EchoX.git

cd EchoX

conda create -n echox python=3.10 pip=24.0

conda activate echox

pip install -r requirements.txt

```

### Model Download

Download the models to this repository directory using the following commands:

```bash

pip install -U huggingface_hub

hf download --resume-download FreedomIntelligence/EchoX-8B --local-dir EchoX-8B

hf download --resume-download openai/whisper-large-v3 --local-dir whisper-large-v3

```

**Note**: If the models are downloaded to a different location, please update the model directory paths in [inference/echox_stream.py](inference/echox_stream.py) accordingly.

### Inference

Run inference on a test case:

```bash

python demo.py

```

Alternatively, start the Gradio web interface:

```bash

python app.py

```

To use a specific GPU:

```bash

CUDA_VISIBLE_DEVICES=1 python app.py

```

## Citation

If you use EchoX in your research or projects, please cite our paper:

```bibtex

@misc{zhang2025echoxmitigatingacousticsemanticgap,

      title={EchoX: Towards Mitigating Acoustic-Semantic Gap via Echo Training for Speech-to-Speech LLMs}, 

      author={Yuhao Zhang and Yuhao Du and Zhanchen Dai and Xiangnan Ma and Kaiqi Kou and Benyou Wang and Haizhou Li},

      year={2025},

      eprint={2509.09174},

      archivePrefix={arXiv},

      primaryClass={cs.CL},

      url={https://arxiv.org/abs/2509.09174}, 

}

```

## License

This project is licensed under the Apache 2.0 License. See the [LICENSE](LICENSE) file for details.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/freedomintelligence/echox

Awesome Lists containing this project

README