https://github.com/yangdongchao/UniAudio2
The open-source code of UniAudio2.0
https://github.com/yangdongchao/UniAudio2
Last synced: 2 months ago
JSON representation
The open-source code of UniAudio2.0
- Host: GitHub
- URL: https://github.com/yangdongchao/UniAudio2
- Owner: yangdongchao
- Created: 2025-09-01T16:26:31.000Z (2 months ago)
- Default Branch: main
- Last Pushed: 2025-09-01T17:45:49.000Z (2 months ago)
- Last Synced: 2025-09-01T19:06:14.299Z (2 months ago)
- Language: Python
- Size: 831 KB
- Stars: 1
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: readme.md
Awesome Lists containing this project
- ai-game-devtools - UniAudio 2.0 - task Audio Foundation Model with Reasoning-Augmented Audio Tokenization. | | | Speech | (<span id="speech">Speech</span> / <span id="tool">LLM (LLM & Tool)</span>)
README
# UniAudio 2.0: A Multi-task Audio Foundation Model with Reasoning-Augmented Audio Tokenization

## Abstract
In this work, we present UniAudio 2.0, a multi-task audio foundation model that unifies speech, sound, and music understanding and generation within a single framework. A key component of UniAudio 2.0 is our proposed ReasoningCodec, which tokenizes audio into reasoning tokens and semantic tokens. Reasoning tokens capture descriptive, interpretable attributes (e.g., linguistic content, emotion, style, acoustic scene), while semantic tokens encode structural and fine-grained acoustic details necessary for faithful reconstruction.
For generation tasks, UniAudio 2.0 adopts a reasoning-first prediction strategy: the model first predicts reasoning tokens, providing human-interpretable descriptions that enhance interpretability and predictive accuracy, and then generates semantic tokens to synthesize high-fidelity audio. This design enables UniAudio 2.0 to move beyond transcription-centric modeling, incorporating paralinguistic information (emotion, timbre, tone), environmental context, and non-linguistic sounds, which are often neglected in prior text–audio foundation models.
By integrating ReasoningCodec into a unified audio foundation model, UniAudio 2.0 establishes a comprehensive framework for multi-domain audio tasks. Experimental results show that UniAudio 2.0 improves both understanding and generation across diverse modalities, advancing the goal of general-purpose audio foundation models.
## Planing
The whole project is ongoing. Now, you can use the **ReasoningCodec**! The training details of UniAudio 2.0, please refer to our research paper.
- [x] Release the training code and the checkpoints of ReasoningCodec
- [ ] Release the research paper of UniAudio 2.0
- [ ] Release the checkpoints of UniAudio 2.0
## Environment
```bash
conda create -n uniaudio2 python=3.10
pip install pip==24.0
pip install fairseq==0.12.2
pip install torch==2.4.1
pip install torchaudio==2.4.1
pip install torchtitan==0.0.2
pip install langsegment==0.3.5
pip install typeguard==2.13.3
pip install lightning==2.4.0
pip install nnAudio==0.3.3
pip install omegaconf==2.0.6
pip install torchtune==0.4.0
pip install torchao==0.9.0
pip install tensorboard
pip install humanfriendly
```
## Acknowledgement
Part of the code refers to MuCodec (https://github.com/tencent-ailab/MuCodec).