{"id":13456505,"url":"https://github.com/2noise/ChatTTS","last_synced_at":"2025-03-24T10:32:48.267Z","repository":{"id":241384537,"uuid":"806709826","full_name":"2noise/ChatTTS","owner":"2noise","description":"A generative speech model for daily dialogue.","archived":false,"fork":false,"pushed_at":"2025-03-14T03:34:46.000Z","size":10056,"stargazers_count":35147,"open_issues_count":66,"forks_count":3798,"subscribers_count":192,"default_branch":"main","last_synced_at":"2025-03-17T21:15:39.148Z","etag":null,"topics":["agent","chat","chatgpt","chattts","chinese","chinese-language","english","english-language","gpt","llm","llm-agent","natural-language-inference","python","text-to-speech","torch","torchaudio","tts"],"latest_commit_sha":null,"homepage":"https://2noise.com","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"agpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/2noise.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-05-27T18:26:49.000Z","updated_at":"2025-03-17T20:56:47.000Z","dependencies_parsed_at":"2024-05-29T04:59:59.149Z","dependency_job_id":"c587f427-85ae-4ee2-8e69-5164153e6c28","html_url":"https://github.com/2noise/ChatTTS","commit_stats":{"total_commits":374,"total_committers":44,"mean_commits":8.5,"dds":0.4144385026737968,"last_synced_commit":"8fcc0cd6ae162ff8f2d65a2b355aaafb47d7e9e8"},"previous_names":["2noise/chattts"],"tags_count":5,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/2noise%2FChatTTS","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/2noise%2FChatTTS/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/2noise%2FChatTTS/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/2noise%2FChatTTS/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/2noise","download_url":"https://codeload.github.com/2noise/ChatTTS/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":245252517,"owners_count":20585084,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["agent","chat","chatgpt","chattts","chinese","chinese-language","english","english-language","gpt","llm","llm-agent","natural-language-inference","python","text-to-speech","torch","torchaudio","tts"],"created_at":"2024-07-31T08:01:23.196Z","updated_at":"2025-03-24T10:32:48.260Z","avatar_url":"https://github.com/2noise.png","language":"Python","funding_links":[],"categories":["Python","Projekte","Azure Cognitive Search \u0026 OpenAI","精选文章","\u003cspan id=\"speech\"\u003eSpeech\u003c/span\u003e","语音合成","Text-to-Speech (TTS)","Tools \u0026 Frameworks","TTS (Text-to-Speech) | 文本转语音","Repos","4. 机器学习项目 | ML","Open Source Projects","UIs","Audio \u0026 Voice Assistants","🧠 AI Applications \u0026 Platforms","2. Open Foundation Models"],"sub_categories":["🗣️ Voice","文字转语音","\u003cspan id=\"tool\"\u003eLLM (LLM \u0026 Tool)\u003c/span\u003e","网络服务_其他","Open-Source Models \u0026 Libraries","Open-source projects","Open Source TTS Models | 开源 TTS 模型","Feedback Generation","Command-line(shell) interface","Human-in-the-Loop Agents","Tools"],"readme":"\u003cdiv align=\"center\"\u003e\n\n\u003ca href=\"https://trendshift.io/repositories/10489\" target=\"_blank\"\u003e\u003cimg src=\"https://trendshift.io/api/badge/repositories/10489\" alt=\"2noise%2FChatTTS | Trendshift\" style=\"width: 250px; height: 55px;\" width=\"250\" height=\"55\"/\u003e\u003c/a\u003e\n\n# ChatTTS\nA generative speech model for daily dialogue.\n\n[![Licence](https://img.shields.io/github/license/2noise/ChatTTS?style=for-the-badge)](https://github.com/2noise/ChatTTS/blob/main/LICENSE)\n[![PyPI](https://img.shields.io/pypi/v/ChatTTS.svg?style=for-the-badge\u0026color=green)](https://pypi.org/project/ChatTTS)\n\n[![Huggingface](https://img.shields.io/badge/🤗%20-Models-yellow.svg?style=for-the-badge)](https://huggingface.co/2Noise/ChatTTS)\n[![Open In Colab](https://img.shields.io/badge/Colab-F9AB00?style=for-the-badge\u0026logo=googlecolab\u0026color=525252)](https://colab.research.google.com/github/2noise/ChatTTS/blob/main/examples/ipynb/colab.ipynb)\n[![Discord](https://img.shields.io/badge/Discord-7289DA?style=for-the-badge\u0026logo=discord\u0026logoColor=white)](https://discord.gg/Ud5Jxgx5yD)\n\n**English** | [**简体中文**](docs/cn/README.md) | [**日本語**](docs/jp/README.md) | [**Русский**](docs/ru/README.md) | [**Español**](docs/es/README.md) | [**Français**](docs/fr/README.md) | [**한국어**](docs/kr/README.md)\n\n\u003c/div\u003e\n\n## Introduction\n\u003e [!Note]\n\u003e This repo contains the algorithm infrastructure and some simple examples.\n\n\u003e [!Tip]\n\u003e For the extended end-user products, please refer to the index repo [Awesome-ChatTTS](https://github.com/libukai/Awesome-ChatTTS/tree/en) maintained by the community.\n\nChatTTS is a text-to-speech model designed specifically for dialogue scenarios such as LLM assistant.\n\n### Supported Languages\n- [x] English\n- [x] Chinese\n- [ ] Coming Soon...\n\n### Highlights\n\u003e You can refer to **[this video on Bilibili](https://www.bilibili.com/video/BV1zn4y1o7iV)** for the detailed description.\n\n1. **Conversational TTS**: ChatTTS is optimized for dialogue-based tasks, enabling natural and expressive speech synthesis. It supports multiple speakers, facilitating interactive conversations.\n2. **Fine-grained Control**: The model could predict and control fine-grained prosodic features, including laughter, pauses, and interjections. \n3. **Better Prosody**: ChatTTS surpasses most of open-source TTS models in terms of prosody. We provide pretrained models to support further research and development.\n\n### Dataset \u0026 Model\n\u003e [!Important]\n\u003e The released model is for academic purposes only.\n\n- The main model is trained with Chinese and English audio data of 100,000+ hours.\n- The open-source version on **[HuggingFace](https://huggingface.co/2Noise/ChatTTS)** is a 40,000 hours pre-trained model without SFT.\n\n### Roadmap\n- [x] Open-source the 40k-hours-base model and spk_stats file.\n- [x] Streaming audio generation.\n- [x] Open-source DVAE encoder and zero shot inferring code.\n- [ ] Multi-emotion controlling.\n- [ ] ChatTTS.cpp (new repo in `2noise` org is welcomed)\n\n### Licenses\n\n#### The Code\n\nThe code is published under `AGPLv3+` license.\n\n#### The model\n\nThe model is published under `CC BY-NC 4.0` license. It is intended for educational and research use, and should not be used for any commercial or illegal purposes. The authors do not guarantee the accuracy, completeness, or reliability of the information. The information and data used in this repo, are for academic and research purposes only. The data obtained from publicly available sources, and the authors do not claim any ownership or copyright over the data.\n\n### Disclaimer\n\nChatTTS is a powerful text-to-speech system. However, it is very important to utilize this technology responsibly and ethically. To limit the use of ChatTTS, we added a small amount of high-frequency noise during the training of the 40,000-hour model, and compressed the audio quality as much as possible using MP3 format, to prevent malicious actors from potentially using it for criminal purposes. At the same time, we have internally trained a detection model and plan to open-source it in the future.\n\n### Contact\n\u003e GitHub issues/PRs are always welcomed.\n\n#### Formal Inquiries\nFor formal inquiries about the model and roadmap, please contact us at **open-source@2noise.com**.\n\n#### Online Chat\n##### 1. QQ Group (Chinese Social APP)\n- **Group 1**, 808364215\n- **Group 2**, 230696694\n- **Group 3**, 933639842\n- **Group 4**, 608667975\n\n##### 2. Discord Server\nJoin by clicking [here](https://discord.gg/Ud5Jxgx5yD).\n\n## Get Started\n### Clone Repo\n```bash\ngit clone https://github.com/2noise/ChatTTS\ncd ChatTTS\n```\n\n### Install requirements\n#### 1. Install Directly\n```bash\npip install --upgrade -r requirements.txt\n```\n\n#### 2. Install from conda\n```bash\nconda create -n chattts python=3.11\nconda activate chattts\npip install -r requirements.txt\n```\n\n#### Optional: Install vLLM (Linux only)\n```bash\npip install safetensors vllm==0.2.7 torchaudio\n```\n\n#### Unrecommended Optional: Install TransformerEngine if using NVIDIA GPU (Linux only)\n\u003e [!Warning]\n\u003e DO NOT INSTALL! \n\u003e The adaptation of TransformerEngine is currently under development and CANNOT run properly now. \n\u003e Only install it on developing purpose. See more details on at #672 #676\n\n\u003e [!Note]\n\u003e The installation process is very slow.\n\n```bash\npip install git+https://github.com/NVIDIA/TransformerEngine.git@stable\n```\n\n#### Unrecommended Optional: Install FlashAttention-2 (mainly NVIDIA GPU)\n\u003e [!Warning]\n\u003e DO NOT INSTALL! \n\u003e Currently the FlashAttention-2 will slow down the generating speed according to [this issue](https://github.com/huggingface/transformers/issues/26990). \n\u003e Only install it on developing purpose.\n\n\u003e [!Note]\n\u003e See supported devices at the [Hugging Face Doc](https://huggingface.co/docs/transformers/perf_infer_gpu_one#flashattention-2).\n\n\n```bash\npip install flash-attn --no-build-isolation\n```\n\n### Quick Start\n\u003e Make sure you are under the project root directory when you execute these commands below.\n\n#### 1. Launch WebUI\n```bash\npython examples/web/webui.py\n```\n\n#### 2. Infer by Command Line\n\u003e It will save audio to `./output_audio_n.mp3`\n\n```bash\npython examples/cmd/run.py \"Your text 1.\" \"Your text 2.\"\n```\n\n## Installation\n\n1. Install the stable version from PyPI\n```bash\npip install ChatTTS\n```\n\n2. Install the latest version from GitHub\n```bash\npip install git+https://github.com/2noise/ChatTTS\n```\n\n3. Install from local directory in dev mode\n```bash\npip install -e .\n```\n\n### Basic Usage\n\n```python\nimport ChatTTS\nimport torch\nimport torchaudio\n\nchat = ChatTTS.Chat()\nchat.load(compile=False) # Set to True for better performance\n\ntexts = [\"PUT YOUR 1st TEXT HERE\", \"PUT YOUR 2nd TEXT HERE\"]\n\nwavs = chat.infer(texts)\n\nfor i in range(len(wavs)):\n    \"\"\"\n    In some versions of torchaudio, the first line works but in other versions, so does the second line.\n    \"\"\"\n    try:\n        torchaudio.save(f\"basic_output{i}.wav\", torch.from_numpy(wavs[i]).unsqueeze(0), 24000)\n    except:\n        torchaudio.save(f\"basic_output{i}.wav\", torch.from_numpy(wavs[i]), 24000)\n```\n\n### Advanced Usage\n\n```python\n###################################\n# Sample a speaker from Gaussian.\n\nrand_spk = chat.sample_random_speaker()\nprint(rand_spk) # save it for later timbre recovery\n\nparams_infer_code = ChatTTS.Chat.InferCodeParams(\n    spk_emb = rand_spk, # add sampled speaker \n    temperature = .3,   # using custom temperature\n    top_P = 0.7,        # top P decode\n    top_K = 20,         # top K decode\n)\n\n###################################\n# For sentence level manual control.\n\n# use oral_(0-9), laugh_(0-2), break_(0-7) \n# to generate special token in text to synthesize.\nparams_refine_text = ChatTTS.Chat.RefineTextParams(\n    prompt='[oral_2][laugh_0][break_6]',\n)\n\nwavs = chat.infer(\n    texts,\n    params_refine_text=params_refine_text,\n    params_infer_code=params_infer_code,\n)\n\n###################################\n# For word level manual control.\n\ntext = 'What is [uv_break]your favorite english food?[laugh][lbreak]'\nwavs = chat.infer(text, skip_refine_text=True, params_refine_text=params_refine_text,  params_infer_code=params_infer_code)\n\"\"\"\nIn some versions of torchaudio, the first line works but in other versions, so does the second line.\n\"\"\"\ntry:\n    torchaudio.save(\"word_level_output.wav\", torch.from_numpy(wavs[0]).unsqueeze(0), 24000)\nexcept:\n    torchaudio.save(\"word_level_output.wav\", torch.from_numpy(wavs[0]), 24000)\n```\n\n\u003cdetails open\u003e\n  \u003csummary\u003e\u003ch4\u003eExample: self introduction\u003c/h4\u003e\u003c/summary\u003e\n\n```python\ninputs_en = \"\"\"\nchat T T S is a text to speech model designed for dialogue applications. \n[uv_break]it supports mixed language input [uv_break]and offers multi speaker \ncapabilities with precise control over prosodic elements like \n[uv_break]laughter[uv_break][laugh], [uv_break]pauses, [uv_break]and intonation. \n[uv_break]it delivers natural and expressive speech,[uv_break]so please\n[uv_break] use the project responsibly at your own risk.[uv_break]\n\"\"\".replace('\\n', '') # English is still experimental.\n\nparams_refine_text = ChatTTS.Chat.RefineTextParams(\n    prompt='[oral_2][laugh_0][break_4]',\n)\n\naudio_array_en = chat.infer(inputs_en, params_refine_text=params_refine_text)\ntorchaudio.save(\"self_introduction_output.wav\", torch.from_numpy(audio_array_en[0]), 24000)\n```\n\n\u003ctable\u003e\n\u003ctr\u003e\n\u003ctd align=\"center\"\u003e\n\n**male speaker**\n\n\u003c/td\u003e\n\u003ctd align=\"center\"\u003e\n\n**female speaker**\n\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd align=\"center\"\u003e\n\n[male speaker](https://github.com/2noise/ChatTTS/assets/130631963/e0f51251-db7f-4d39-a0e9-3e095bb65de1)\n\n\u003c/td\u003e\n\u003ctd align=\"center\"\u003e\n\n[female speaker](https://github.com/2noise/ChatTTS/assets/130631963/f5dcdd01-1091-47c5-8241-c4f6aaaa8bbd)\n\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003c/table\u003e\n\n\n\u003c/details\u003e\n\n## FAQ\n\n#### 1. How much VRAM do I need? How about infer speed?\nFor a 30-second audio clip, at least 4GB of GPU memory is required. For the 4090 GPU, it can generate audio corresponding to approximately 7 semantic tokens per second. The Real-Time Factor (RTF) is around 0.3.\n\n#### 2. Model stability is not good enough, with issues such as multi speakers or poor audio quality.\n\nThis is a problem that typically occurs with autoregressive models (for bark and valle). It's generally difficult to avoid. One can try multiple samples to find a suitable result.\n\n#### 3. Besides laughter, can we control anything else? Can we control other emotions?\n\nIn the current released model, the only token-level control units are `[laugh]`, `[uv_break]`, and `[lbreak]`. In future versions, we may open-source models with additional emotional control capabilities.\n\n## Acknowledgements\n- [bark](https://github.com/suno-ai/bark), [XTTSv2](https://github.com/coqui-ai/TTS) and [valle](https://arxiv.org/abs/2301.02111) demonstrate a remarkable TTS result by an autoregressive-style system.\n- [fish-speech](https://github.com/fishaudio/fish-speech) reveals capability of GVQ as audio tokenizer for LLM modeling.\n- [vocos](https://github.com/gemelo-ai/vocos) which is used as a pretrained vocoder.\n\n## Special Appreciation\n- [wlu-audio lab](https://audio.westlake.edu.cn/) for early algorithm experiments.\n\n## Thanks to all contributors for their efforts\n[![contributors](https://contrib.rocks/image?repo=2noise/ChatTTS)](https://github.com/2noise/ChatTTS/graphs/contributors)\n\n\u003cdiv align=\"center\"\u003e\n\n  ![counter](https://counter.seku.su/cmoe?name=chattts\u0026theme=mbs)\n\n\u003c/div\u003e\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2F2noise%2FChatTTS","html_url":"https://awesome.ecosyste.ms/projects/github.com%2F2noise%2FChatTTS","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2F2noise%2FChatTTS/lists"}