{"id":29358266,"url":"https://github.com/funaudiollm/thinksound","last_synced_at":"2025-07-09T06:09:51.701Z","repository":{"id":301474241,"uuid":"1009362171","full_name":"FunAudioLLM/ThinkSound","owner":"FunAudioLLM","description":"PyTorch implementation of [ThinkSound], a unified framework for generating audio from any modality, guided by Chain-of-Thought (CoT) reasoning.","archived":false,"fork":false,"pushed_at":"2025-07-03T08:14:20.000Z","size":1633,"stargazers_count":143,"open_issues_count":4,"forks_count":3,"subscribers_count":0,"default_branch":"master","last_synced_at":"2025-07-03T08:44:17.588Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/FunAudioLLM.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-06-27T02:27:00.000Z","updated_at":"2025-07-03T08:30:48.000Z","dependencies_parsed_at":"2025-07-03T08:44:20.688Z","dependency_job_id":null,"html_url":"https://github.com/FunAudioLLM/ThinkSound","commit_stats":null,"previous_names":["liuhuadai/thinksound","funaudiollm/thinksound"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/FunAudioLLM/ThinkSound","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/FunAudioLLM%2FThinkSound","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/FunAudioLLM%2FThinkSound/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/FunAudioLLM%2FThinkSound/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/FunAudioLLM%2FThinkSound/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/FunAudioLLM","download_url":"https://codeload.github.com/FunAudioLLM/ThinkSound/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/FunAudioLLM%2FThinkSound/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":264403827,"owners_count":23602623,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-07-09T06:09:48.230Z","updated_at":"2025-07-09T06:09:51.695Z","avatar_url":"https://github.com/FunAudioLLM.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003ch1 align=\"center\"\u003eThinkSound\u003c/h1\u003e\r\n\r\n\u003cp align=\"center\"\u003e\r\n  🌐\r\n  \u003ca href=\"https://openaitx.github.io/view.html?user=FunAudioLLM\u0026project=ThinkSound\u0026lang=en\"\u003eEnglish\u003c/a\u003e |\r\n  \u003ca href=\"https://openaitx.github.io/view.html?user=FunAudioLLM\u0026project=ThinkSound\u0026lang=zh-CN\"\u003e简体中文\u003c/a\u003e |\r\n  \u003ca href=\"https://openaitx.github.io/view.html?user=FunAudioLLM\u0026project=ThinkSound\u0026lang=zh-TW\"\u003e繁體中文\u003c/a\u003e |\r\n  \u003ca href=\"https://openaitx.github.io/view.html?user=FunAudioLLM\u0026project=ThinkSound\u0026lang=es\"\u003eEspañol\u003c/a\u003e |\r\n  \u003ca href=\"https://openaitx.github.io/view.html?user=FunAudioLLM\u0026project=ThinkSound\u0026lang=fr\"\u003eFrançais\u003c/a\u003e |\r\n  \u003ca href=\"https://openaitx.github.io/view.html?user=FunAudioLLM\u0026project=ThinkSound\u0026lang=ja\"\u003e日本語\u003c/a\u003e\r\n  \r\n\u003c/p\u003e\r\n\r\n\u003cp align=\"center\"\u003e\r\n  \u003ca href=\"https://arxiv.org/pdf/2506.21448\"\u003e\r\n    \u003cimg src=\"https://img.shields.io/badge/arXiv-2506.21448-b31b1b.svg\" alt=\"arXiv\"/\u003e\r\n  \u003c/a\u003e\r\n  \u0026nbsp;\r\n  \u003ca href=\"https://thinksound-project.github.io/\"\u003e\r\n    \u003cimg src=\"https://img.shields.io/badge/Online%20Demo-🌐-blue\" alt=\"Online Demo\"/\u003e\r\n  \u003c/a\u003e\r\n  \u0026nbsp;\r\n  \u003ca href=\"https://huggingface.co/spaces/FunAudioLLM/ThinkSound\"\u003e\r\n    \u003cimg src=\"https://img.shields.io/badge/HuggingFace-Spaces-orange?logo=huggingface\" alt=\"Hugging Face\"/\u003e\r\n  \u003c/a\u003e\r\n  \u0026nbsp;\r\n  \u003ca href=\"https://modelscope.cn/studios/iic/ThinkSound\"\u003e\r\n    \u003cimg src=\"https://img.shields.io/badge/ModelScope-在线体验-green\" alt=\"ModelScope\"/\u003e\r\n  \u003c/a\u003e\r\n\u003c/p\u003e\r\n\r\n\u003cp align=\"center\"\u003e\r\n  If you find this project useful,\u003cbr\u003e\r\n  a star ⭐ on GitHub would be greatly appreciated!\r\n\u003c/p\u003e\r\n\r\n---\r\n\r\n**ThinkSound** is a unified Any2Audio generation framework with flow matching guided by Chain-of-Thought (CoT) reasoning.\r\n\r\nPyTorch implementation for multimodal audio generation and editing: generate or edit audio from video, text, and audio, powered by step-by-step reasoning from Multimodal Large Language Models (MLLMs).\r\n\r\n![Teaser](assets/figs/fig1_teaser.png)\r\n---\r\n\r\n## 📰 News\r\n- **2025.07** \u0026nbsp;  🔧 Major update: model lightweighted and optimized memory and GPU usage, now supports high-throughput audio generation at scale!\r\n- **2025.07** \u0026nbsp; 🔥Online demo on [Hugging Face Spaces](https://huggingface.co/spaces/FunAudioLLM/ThinkSound) and [ModelScope](https://modelscope.cn/studios/iic/ThinkSound) for interactive experience!\r\n- **2025.07** \u0026nbsp; 🔥Released inference scripts and web interface; \r\n- **2025.06** \u0026nbsp; 🔥[ThinkSound paper](https://arxiv.org/pdf/2506.21448) released on arXiv!\r\n- **2025.06** \u0026nbsp; 🔥[Online Demo](http://thinksound-project.github.io/) is live - try it now!\r\n\r\n---\r\n\r\n\r\n## 🚀 Features\r\n\r\n- **Any2Audio**: Generate audio from arbitrary modalities — video, text, audio, or their combinations.\r\n- **Video-to-Audio SOTA**: Achieves state-of-the-art results on multiple V2A benchmarks.\r\n- **CoT-Driven Reasoning**: Chain-of-Thought reasoning for compositional and controllable audio generation via MLLMs.\r\n- **Interactive Object-centric Editing**: Refine or edit specific sound events by clicking on visual objects or using text instructions.\r\n- **Unified Framework**: One foundation model supports generation, editing, and interactive workflow.\r\n\r\n---\r\n\r\n## ✨ Method Overview\r\n\r\nThinkSound decomposes audio generation and editing into three interactive stages, all guided by MLLM-based Chain-of-Thought (CoT) reasoning:\r\n\r\n1. **Foley Generation:** Generate foundational, semantically and temporally aligned soundscapes from video.\r\n2. **Object-Centric Refinement:** Refine or add sounds for user-specified objects via clicks or regions in the video.\r\n3. **Targeted Audio Editing:** Modify generated audio using high-level natural language instructions.\r\n\r\n![ThinkSound Overview](assets/figs/fig3_model.png)\r\n\u003c!-- A large-scale CoT-annotated dataset (**AudioCoT**) is used to train both the reasoning module and the unified audio foundation model.\r\n![AudioCoT Pipeline](assets/figs/fig2_dataset.png) --\u003e\r\n\r\n---\r\n\r\n## ⚡ Quick Start\r\n\r\n**Environment Preparation:**\r\n```bash\r\ngit clone https://github.com/liuhuadai/ThinkSound.git\r\ncd ThinkSound\r\npip install -r requirements.txt\r\nconda install -y -c conda-forge 'ffmpeg\u003c7'\r\n# Download pretrained weights https://huggingface.co/liuhuadai/ThinkSound to Directory ckpts/\r\n# model weights can be also downloaded from https://www.modelscope.cn/models/iic/ThinkSound\r\ngit lfs install\r\ngit clone https://huggingface.co/liuhuadai/ThinkSound ckpts\r\n```\r\n\r\n**Make it executable**\r\n```bash\r\nchmod +x scripts/demo.sh\r\n```\r\n\r\n**Run the script**\r\n```bash\r\n./scripts/demo.sh \u003cvideo_path\u003e \u003ctitle\u003e \u003cCoT description\u003e [use-half]\r\n```\r\nAdd use-half at the end to enable half precision inference, which reduces GPU memory usage.\r\n\r\nUse the `eval_batch.sh` script to extract features from a batch of videos and run inference to generate audio outputs.\r\n\r\n```bash\r\nchmod +x scripts/eval_batch.sh\r\n./scripts/eval_batch.sh \u003cvideo_path\u003e \u003ccsv_path\u003e \u003csave_path (optional)\u003e [use-half]\r\n```\r\n\r\n`\u003cvideo_path\u003e`:Path to the root directory containing video files.\r\n  * **Requirement**: All videos should be in `.mp4` format.\r\n  * **Assumption**: All videos have **equal duration**.\r\n\r\n`\u003ccsv_path\u003e`:Path to the CSV file containing text descriptions (e.g., captions, CoT prompts) for each video.\r\n  * Format should be similar to `demo_test.csv`, where each row corresponds to a video and includes at least the filename (without extension) and associated text.\r\n\r\n`\u003csave_path\u003e` (optional):\r\n  Directory where the generated audios will be saved.\r\n  * Defaults to `results/features` if not provided.\r\n\r\n`[use-half]` (optional):\r\n\r\n\r\n### Web Interface Usage\r\n\r\nFor an interactive experience, launch the Gradio web interface:\r\n\r\n```bash\r\npython app.py\r\n```\r\n\r\n---\r\n\r\n## 📝 TODO\r\n\r\n- ☐ Release training scripts for ThinkSound models\r\n- ☐ Open-source AudioCoT dataset and automated pipeline\r\n- ☐ Provide detailed documentation and API reference\r\n- ☐ Add support for additional modalities and downstream tasks\r\n\r\n---\r\n\r\n## 📄 License\r\n\r\nThis project is released under the [Apache 2.0 License](LICENSE).\r\n\r\n\u003e **Note:**  \r\n\u003e The code, models, and dataset are **for research and educational purposes only**.  \r\n\u003e **Commercial use is NOT permitted.**\r\n\u003e\r\n\u003e For commercial licensing, please contact the authors.\r\n\r\n---\r\n\r\n## 📖 Citation\r\n\r\nIf you find ThinkSound useful in your research or work, please cite our paper:\r\n\r\n```bibtex\r\n@misc{liu2025thinksoundchainofthoughtreasoningmultimodal,\r\n      title={ThinkSound: Chain-of-Thought Reasoning in Multimodal Large Language Models for Audio Generation and Editing}, \r\n      author={Huadai Liu and Jialei Wang and Kaicheng Luo and Wen Wang and Qian Chen and Zhou Zhao and Wei Xue},\r\n      year={2025},\r\n      eprint={2506.21448},\r\n      archivePrefix={arXiv},\r\n      primaryClass={eess.AS},\r\n      url={https://arxiv.org/abs/2506.21448}, \r\n}\r\n```\r\n\r\n---\r\n\r\n## 📬 Contact\r\n\r\n✨ Feel free to [open an issue](https://github.com/liuhuadai/ThinkSound/issues) or contact us via email ([liuhuadai@zju.edu.cn](mailto:liuhuadai@zju.edu.cn)) if you have any questions or suggestions!\r\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffunaudiollm%2Fthinksound","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ffunaudiollm%2Fthinksound","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffunaudiollm%2Fthinksound/lists"}