https://github.com/funaudiollm/thinksound
PyTorch implementation of [ThinkSound], a unified framework for generating audio from any modality, guided by Chain-of-Thought (CoT) reasoning.
https://github.com/funaudiollm/thinksound
Last synced: 7 months ago
JSON representation
PyTorch implementation of [ThinkSound], a unified framework for generating audio from any modality, guided by Chain-of-Thought (CoT) reasoning.
- Host: GitHub
- URL: https://github.com/funaudiollm/thinksound
- Owner: FunAudioLLM
- Created: 2025-06-27T02:27:00.000Z (7 months ago)
- Default Branch: master
- Last Pushed: 2025-07-03T08:14:20.000Z (7 months ago)
- Last Synced: 2025-07-03T08:44:17.588Z (7 months ago)
- Language: Python
- Size: 1.56 MB
- Stars: 143
- Watchers: 0
- Forks: 3
- Open Issues: 4
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
ThinkSound
🌐
English |
简体中文 |
繁體中文 |
Español |
Français |
日本語
If you find this project useful,
a star ⭐ on GitHub would be greatly appreciated!
---
**ThinkSound** is a unified Any2Audio generation framework with flow matching guided by Chain-of-Thought (CoT) reasoning.
PyTorch implementation for multimodal audio generation and editing: generate or edit audio from video, text, and audio, powered by step-by-step reasoning from Multimodal Large Language Models (MLLMs).

---
## 📰 News
- **2025.07** 🔧 Major update: model lightweighted and optimized memory and GPU usage, now supports high-throughput audio generation at scale!
- **2025.07** 🔥Online demo on [Hugging Face Spaces](https://huggingface.co/spaces/FunAudioLLM/ThinkSound) and [ModelScope](https://modelscope.cn/studios/iic/ThinkSound) for interactive experience!
- **2025.07** 🔥Released inference scripts and web interface;
- **2025.06** 🔥[ThinkSound paper](https://arxiv.org/pdf/2506.21448) released on arXiv!
- **2025.06** 🔥[Online Demo](http://thinksound-project.github.io/) is live - try it now!
---
## 🚀 Features
- **Any2Audio**: Generate audio from arbitrary modalities — video, text, audio, or their combinations.
- **Video-to-Audio SOTA**: Achieves state-of-the-art results on multiple V2A benchmarks.
- **CoT-Driven Reasoning**: Chain-of-Thought reasoning for compositional and controllable audio generation via MLLMs.
- **Interactive Object-centric Editing**: Refine or edit specific sound events by clicking on visual objects or using text instructions.
- **Unified Framework**: One foundation model supports generation, editing, and interactive workflow.
---
## ✨ Method Overview
ThinkSound decomposes audio generation and editing into three interactive stages, all guided by MLLM-based Chain-of-Thought (CoT) reasoning:
1. **Foley Generation:** Generate foundational, semantically and temporally aligned soundscapes from video.
2. **Object-Centric Refinement:** Refine or add sounds for user-specified objects via clicks or regions in the video.
3. **Targeted Audio Editing:** Modify generated audio using high-level natural language instructions.

---
## ⚡ Quick Start
**Environment Preparation:**
```bash
git clone https://github.com/liuhuadai/ThinkSound.git
cd ThinkSound
pip install -r requirements.txt
conda install -y -c conda-forge 'ffmpeg<7'
# Download pretrained weights https://huggingface.co/liuhuadai/ThinkSound to Directory ckpts/
# model weights can be also downloaded from https://www.modelscope.cn/models/iic/ThinkSound
git lfs install
git clone https://huggingface.co/liuhuadai/ThinkSound ckpts
```
**Make it executable**
```bash
chmod +x scripts/demo.sh
```
**Run the script**
```bash
./scripts/demo.sh [use-half]
```
Add use-half at the end to enable half precision inference, which reduces GPU memory usage.
Use the `eval_batch.sh` script to extract features from a batch of videos and run inference to generate audio outputs.
```bash
chmod +x scripts/eval_batch.sh
./scripts/eval_batch.sh [use-half]
```
``:Path to the root directory containing video files.
* **Requirement**: All videos should be in `.mp4` format.
* **Assumption**: All videos have **equal duration**.
``:Path to the CSV file containing text descriptions (e.g., captions, CoT prompts) for each video.
* Format should be similar to `demo_test.csv`, where each row corresponds to a video and includes at least the filename (without extension) and associated text.
`` (optional):
Directory where the generated audios will be saved.
* Defaults to `results/features` if not provided.
`[use-half]` (optional):
### Web Interface Usage
For an interactive experience, launch the Gradio web interface:
```bash
python app.py
```
---
## 📝 TODO
- ☐ Release training scripts for ThinkSound models
- ☐ Open-source AudioCoT dataset and automated pipeline
- ☐ Provide detailed documentation and API reference
- ☐ Add support for additional modalities and downstream tasks
---
## 📄 License
This project is released under the [Apache 2.0 License](LICENSE).
> **Note:**
> The code, models, and dataset are **for research and educational purposes only**.
> **Commercial use is NOT permitted.**
>
> For commercial licensing, please contact the authors.
---
## 📖 Citation
If you find ThinkSound useful in your research or work, please cite our paper:
```bibtex
@misc{liu2025thinksoundchainofthoughtreasoningmultimodal,
title={ThinkSound: Chain-of-Thought Reasoning in Multimodal Large Language Models for Audio Generation and Editing},
author={Huadai Liu and Jialei Wang and Kaicheng Luo and Wen Wang and Qian Chen and Zhou Zhao and Wei Xue},
year={2025},
eprint={2506.21448},
archivePrefix={arXiv},
primaryClass={eess.AS},
url={https://arxiv.org/abs/2506.21448},
}
```
---
## 📬 Contact
✨ Feel free to [open an issue](https://github.com/liuhuadai/ThinkSound/issues) or contact us via email ([liuhuadai@zju.edu.cn](mailto:liuhuadai@zju.edu.cn)) if you have any questions or suggestions!