{"id":16264969,"url":"https://github.com/lenml/speech-ai-forge","last_synced_at":"2025-05-15T05:06:28.810Z","repository":{"id":242515016,"uuid":"809025386","full_name":"lenML/Speech-AI-Forge","owner":"lenML","description":"🍦 Speech-AI-Forge is a project developed around TTS generation model, implementing an API Server and a Gradio-based WebUI.","archived":false,"fork":false,"pushed_at":"2025-05-14T18:06:38.000Z","size":11212,"stargazers_count":1217,"open_issues_count":52,"forks_count":162,"subscribers_count":18,"default_branch":"main","last_synced_at":"2025-05-14T18:53:15.389Z","etag":null,"topics":["agent","asr","chattts","chattts-forge","chinese","colab","cosy-voice","cosyvoice","english","firered","fireredtts","fish-speech","gpt","llama","llm","ssml","stt","text-to-speech","tts","whisper"],"latest_commit_sha":null,"homepage":"https://huggingface.co/spaces/lenML/ChatTTS-Forge","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"agpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/lenML.png","metadata":{"files":{"readme":"README.en.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2024-06-01T13:24:45.000Z","updated_at":"2025-05-14T18:06:42.000Z","dependencies_parsed_at":null,"dependency_job_id":"ee96e40d-4fac-4171-8c78-762272c44b68","html_url":"https://github.com/lenML/Speech-AI-Forge","commit_stats":{"total_commits":507,"total_committers":7,"mean_commits":72.42857142857143,"dds":"0.023668639053254448","last_synced_commit":"44ccc49fa28336287f620055a9e15910df0a8574"},"previous_names":["lenml/chattts-forge"],"tags_count":7,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lenML%2FSpeech-AI-Forge","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lenML%2FSpeech-AI-Forge/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lenML%2FSpeech-AI-Forge/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lenML%2FSpeech-AI-Forge/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/lenML","download_url":"https://codeload.github.com/lenML/Speech-AI-Forge/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254276447,"owners_count":22043867,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["agent","asr","chattts","chattts-forge","chinese","colab","cosy-voice","cosyvoice","english","firered","fireredtts","fish-speech","gpt","llama","llm","ssml","stt","text-to-speech","tts","whisper"],"created_at":"2024-10-10T17:04:54.407Z","updated_at":"2025-05-15T05:06:23.789Z","avatar_url":"https://github.com/lenML.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"[cn](./README.md) | [en](./README.en.md) | [Discord Server](https://discord.gg/9XnXUhAy3t)\n\n# 🍦 Speech-AI-Forge\n\nSpeech-AI-Forge is a project developed around TTS generation model, implementing an API Server and a Gradio-based WebUI.\n\n![banner](./docs/banner.png)\n\nYou can experience and deploy Speech-AI-Forge through the following methods:\n\n| -                        | Description                             | Link                                                                                                                                                                  |\n| ------------------------ | --------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------- |\n| **Online Demo**          | Deployed on HuggingFace                 | [HuggingFace Spaces](https://huggingface.co/spaces/lenML/Speech-AI-Forge)                                                                                             |\n| **One-Click Start**      | Click the button to start Colab         | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/lenML/Speech-AI-Forge/blob/main/colab.en.ipynb) |\n| **Container Deployment** | See the docker section                  | [Docker](#docker)                                                                                                                                                     |\n| **Local Deployment**     | See the environment preparation section | [Local Deployment](#InstallationandRunning)                                                                                                                           |\n\n## Installation and Running\n\nFirst, ensure that the [relevant dependencies](./docs/dependencies.md) have been correctly installed.\n\nStart the application:\n\n```\npython webui.py\n```\n\n### Web UI Features\n\n[Click here for detailed documentation with images](./docs/webui_features.md)\n\n- **TTS (Text-to-Speech)**: Powerful TTS capabilities\n  - **Speaker Switch**: Switch between different voices\n    - **Built-in Voices**: Multiple built-in voices available, including `27 ChatTTS` / `7 CosyVoice` voices + `1 Reference Voice`\n    - **Custom Voice Upload**: Support for uploading custom voice files and performing real-time inference\n    - **Reference Voice**: Upload reference audio/text and perform TTS inference based on the reference audio\n  - **Style Control**: Built-in style control options to adjust the voice tone\n  - **Long Text Processing**: Support for long text inference with automatic text segmentation\n    - **Batch Size**: Configure `Batch size` to speed up inference for models that support batch processing\n  - **Refiner**: Native text `refiner` for `ChatTTS`, supports inference of unlimited-length text\n  - **Splitter Settings**: Fine-tune splitter configuration, control splitter `eos` (end of sentence) and splitting thresholds\n  - **Adjuster**: Control speech parameters like `speed`, `pitch`, and `volume`, with additional `loudness normalization` for improved output quality\n  - **Voice Enhancer**: Use the `Enhancer` model to improve TTS output quality, delivering better sound\n  - **Generation History**: Store the last three generated results for easy comparison\n  - **Multi-model Support**: Support for multiple TTS models, including `ChatTTS`, `CosyVoice`, `FishSpeech`, `GPT-SoVITS`, and `F5-TTS`\n\n- **SSML (Speech Synthesis Markup Language)**: Advanced TTS synthesis control\n  - **Splitter**: Fine control over text segmentation for long-form content\n  - **PodCast**: A tool for creating `long-form` and `multi-character` audio, ideal for blogs or scripted voice synthesis\n  - **From Subtitle**: Create SSML scripts directly from subtitle files for easy TTS generation\n  - **Script Editor**: New SSML script editor that allows users to export and edit SSML scripts from the Splitter (PodCast, From Subtitle) for further refinement\n\n- **Voice Management**:\n  - **Builder**: Create custom voices from ChatTTS seeds or by using reference audio\n  - **Test Voice**: Upload and test custom voice files quickly\n  - **ChatTTS Debugging Tools**: Specific tools for debugging `ChatTTS` voices\n    - **Random Seed**: Generate random voices using a random seed to create unique sound profiles\n    - **Voice Blending**: Blend voices generated from different seeds to create a new voice\n  - **Voice Hub**: Select and download voices from our voice library to your local machine. Access the voice repository at [Speech-AI-Forge-spks](https://github.com/lenML/Speech-AI-Forge-spks)\n\n- **ASR (Automatic Speech Recognition)**:\n  - **Whisper**: Use the Whisper model for high-quality speech-to-text (ASR)\n  - **SenseVoice**: ASR model in development, coming soon\n\n- **Tools**:\n  - **Post Process**: Post-processing tools for audio clipping, adjustment, and enhancement to optimize speech output\n\n### `launch.py`: API Server\n\nIn some cases, you might not need the WebUI or require higher API throughput, in which case you can start a simple API service with this script.\n\nTo start:\n\n```bash\npython launch.py\n```\n\nOnce launched, you can access `http://localhost:7870/docs` to see which API endpoints are available.\n\nMore help:\n\n- Use `python launch.py -h` to view script parameters\n- Check out the [API Documentation](./docs/api.md)\n\n## Docker\n\n### Image\n\nWIP (Under development)\n\n### Manual Build\n\nDownload models: `python -m scripts.download_models --source modelscope`\n\n\u003e This script will download the `chat-tts` and `enhancer` models. If you need to download other models, please refer to the `Model Download` section below.\n\n- For the webui: `docker-compose -f ./docker-compose.webui.yml up -d`\n- For the API: `docker-compose -f ./docker-compose.api.yml up -d`\n\nEnvironment variable configuration:\n\n- webui: [.env.webui](./.env.webui)\n- API: [.env.api](./.env.api)\n\n\n## Model Support\n\n| Model Category   | Model Name                                                                                  | Streaming Level | Multi-Language Support       | Status                  |\n| ---------------- | ------------------------------------------------------------------------------------------- | --------------- | ---------------------------- | ----------------------- |\n| **TTS**          | [ChatTTS](https://github.com/2noise/ChatTTS)                                                | token-level     | en, zh                       | ✅                       |\n|                  | [FishSpeech](https://github.com/fishaudio/fish-speech)                                       | sentence-level  | en, zh, jp, ko           | ✅ (1.4) |\n|                  | [CosyVoice](https://github.com/FunAudioLLM/CosyVoice)                                        | sentence-level  | en, zh, jp, yue, ko          | ✅(v2)                       |\n|                  | [FireRedTTS](https://github.com/FireRedTeam/FireRedTTS)                                      | sentence-level  | en, zh                       | ✅                       |\n|                  | [F5-TTS](https://github.com/SWivid/F5-TTS)                                                  | sentence-level  | en, zh                       | ✅                       |\n|                  | GPTSoVits                                                                                    | sentence-level  |                              | 🚧                       |\n| **ASR**          | [Whisper](https://github.com/openai/whisper)                                                | 🚧              | ✅                           | ✅                       |\n|                  | [SenseVoice](https://github.com/FunAudioLLM/SenseVoice)                                      | 🚧              | ✅                           | 🚧                       |\n| **Voice Clone**  | [OpenVoice](https://github.com/myshell-ai/OpenVoice)                                        |                 |                              | ✅                       |\n|                  | [RVC](https://github.com/svc-develop-team/RVC)                                              |                 |                              | 🚧                       |\n| **Enhancer**     | [ResembleEnhance](https://github.com/resemble-ai/resemble-enhance)                          |                 |                              | ✅                       |\n\n## Model Download\n\nSince Forge primarily focuses on API functionality development, automatic download logic has not yet been implemented. To download models, you need to manually invoke the download scripts, which can be found in the `./scripts` directory.\n\n### Download Script\n\n| Function     | Model          | Download Command                                                          |\n| ------------ | -------------- | ------------------------------------------------------------------------- |\n| **TTS**      | ChatTTS        | `python -m scripts.dl_chattts --source huggingface`                       |\n|              | FishSpeech(1.4)     | `python -m scripts.downloader.fish_speech_1_4 --source huggingface`    |\n|              | CosyVoice(v2)      | `python -m scripts.downloader.cosyvoice2 --source huggingface` |\n|              | FireRedTTS     | `python -m scripts.downloader.fire_red_tts --source huggingface`          |\n| **ASR**      | Whisper        | `python -m scripts.downloader.faster_whisper --source huggingface`        |\n| **CV**       | OpenVoice      | `python -m scripts.downloader.open_voice --source huggingface`            |\n| **Enhancer** | Enhancer Model | `python -m scripts.dl_enhance --source huggingface`                       |\n\n\u003e **Note**: If you need to use ModelScope to download models, use `--source modelscope`. Some models may not be available for download using ModelScope.\n\n## FAQ\n\n### How to perform voice cloning?\n\nCurrently, voice cloning is supported across various models, and formats like reference audio in `skpv1` are also adapted. Here are a few methods to use voice cloning:\n\n1. **In the WebUI**: You can upload reference audio in the voice selection section, which is the simplest way to use the voice cloning feature.\n2. **Using the API**: When using the API, you need to use a voice (i.e., a speaker) for voice cloning. First, you need to create a speaker file (e.g., `.spkv1.json`) with the required voice, and when calling the API, set the `spk` parameter to the speaker's name to enable cloning.\n3. **Voice Clone**: The system now also supports voice cloning using the voice clone model. When using the API, configure the appropriate `reference` to utilize this feature. (Currently, only OpenVoice is supported for voice cloning, so there’s no need to specify the model name.)\n\nFor related discussions, see issue #118.\n\n### The generated result with a reference audio `spk` file is full of noise?\n\nThis is likely caused by an issue with the uploaded audio configuration. You can try the following solutions:\n\n1. **Update**: Update the code and dependency versions. Most importantly, update Gradio (it's recommended to use the latest version if possible).\n2. **Process the audio**: Use ffmpeg or other software to edit the audio, convert it to mono, and then upload it. You can also try converting it to WAV format.\n3. **Check the text**: Make sure there are no unsupported characters in the reference text. It's also recommended to end the reference text with a `\"。\"` (this is a quirk of the model 😂).\n4. **Create with Colab**: Consider using the Colab environment to create the `spk` file to minimize environment-related issues.\n5. **TTS Test**: Currently, in the WebUI TTS page, you can upload reference audio directly. You can first test the audio and text, make adjustments, and then generate the `spk` file.\n\n### Can I train models?\n\nNot at the moment. This repository mainly provides a framework for inference services. There are plans to add some training-related features, but they are not a priority.\n\n### How can I optimize inference speed?\n\nThis repository focuses on integrating and developing engineering solutions, so model inference optimizations largely depend on upstream repositories or community implementations. If you have good optimization ideas, feel free to submit an issue or PR.\n\nFor now, the most practical optimization is to enable multiple workers. When running the `launch.py` script, you can start with the `--workers N` option to increase service throughput.\n\nThere are also other potential speed-up optimizations that are not yet fully implemented. If interested, feel free to explore:\n\n1. **Compile**: Models support compile acceleration, which can provide around a 30% speed increase, but the compilation process is slow.\n2. **Flash Attention**: Flash attention acceleration is supported (using the `--flash_attn` option), but it is still not perfect.\n3. **vllm**: Not yet implemented, pending updates from upstream repositories.\n\n### What are Prompt1 and Prompt2?\n\n\u003e Only for ChatTTS\n\nBoth Prompt1 and Prompt2 are system prompts, but the difference lies in their insertion points. Through testing, it was found that the current model is very sensitive to the first `[Stts]` token, so two prompts are required:\n\n- Prompt1 is inserted before the first `[Stts]`.\n- Prompt2 is inserted after the first `[Stts]`.\n\n### What is Prefix?\n\n\u003e Only for ChatTTS\n\nPrefix is mainly used to control the model's generation capabilities, similar to refine prompts in official examples. The prefix should only include special non-lexical tokens, such as `[laugh_0]`, `[oral_0]`, `[speed_0]`, `[break_0]`, etc.\n\n### What is the difference with `_p` in the Style?\n\nIn the Style settings, those with `_p` use both prompt + prefix, while those without `_p` use only the prefix.\n\n### Why is it so slow when `--compile` is enabled?\n\nSince inference padding has not yet been implemented, changing the shape during each inference may trigger torch to recompile.\n\n\u003e For now, it’s not recommended to enable this option.\n\n### Why is it so slow in Colab, only 2 it/s?\n\nPlease ensure that you are using a GPU instead of a CPU.\n\n- Click on the menu bar **Edit**.\n- Select **Notebook Settings**.\n- Choose **Hardware Accelerator** =\u003e T4 GPU.\n\n# Documents\n\nfind more documents from [here](./docs/readme.md)\n\n# Contributing\n\nTo contribute, clone the repository, make your changes, commit and push to your clone, and submit a pull request.\n\n# References\n\n- ChatTTS: https://github.com/2noise/ChatTTS\n- PaddleSpeech: https://github.com/PaddlePaddle/PaddleSpeech\n- resemble-enhance: https://github.com/resemble-ai/resemble-enhance\n- OpenVoice: https://github.com/myshell-ai/OpenVoice\n- FishSpeech: https://github.com/fishaudio/fish-speech\n- SenseVoice: https://github.com/FunAudioLLM/SenseVoice\n- CosyVoice: https://github.com/FunAudioLLM/CosyVoice\n- Whisper: https://github.com/openai/whisper\n\n- ChatTTS 默认说话人: https://github.com/2noise/ChatTTS/issues/238\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flenml%2Fspeech-ai-forge","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Flenml%2Fspeech-ai-forge","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flenml%2Fspeech-ai-forge/lists"}