{"id":28255621,"url":"https://github.com/uppermoon0/tts-provider","last_synced_at":"2026-03-12T14:17:33.451Z","repository":{"id":283655274,"uuid":"916441239","full_name":"UpperMoon0/TTS-Provider","owner":"UpperMoon0","description":"WebSocket-based Text-to-Speech service","archived":false,"fork":false,"pushed_at":"2025-08-31T06:32:26.000Z","size":22935,"stargazers_count":1,"open_issues_count":0,"forks_count":1,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-08-31T08:27:16.302Z","etag":null,"topics":["tts","tts-api"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/UpperMoon0.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-01-14T05:14:26.000Z","updated_at":"2025-08-31T06:32:00.000Z","dependencies_parsed_at":"2025-08-31T08:22:33.016Z","dependency_job_id":"869fe85c-9444-408a-aa47-6824bcc75b70","html_url":"https://github.com/UpperMoon0/TTS-Provider","commit_stats":null,"previous_names":["uppermoon0/tts-provider"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/UpperMoon0/TTS-Provider","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/UpperMoon0%2FTTS-Provider","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/UpperMoon0%2FTTS-Provider/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/UpperMoon0%2FTTS-Provider/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/UpperMoon0%2FTTS-Provider/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/UpperMoon0","download_url":"https://codeload.github.com/UpperMoon0/TTS-Provider/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/UpperMoon0%2FTTS-Provider/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":278272335,"owners_count":25959524,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-04T02:00:05.491Z","response_time":63,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["tts","tts-api"],"created_at":"2025-05-19T22:14:10.281Z","updated_at":"2025-10-04T05:58:02.111Z","avatar_url":"https://github.com/UpperMoon0.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# TTS Provider Server\n\nA flexible WebSocket-based Text-to-Speech service that supports multiple TTS backends. Currently supports:\n\n- Microsoft Edge TTS (default)\n- Zonos TTS\n\n## Installation\n\n1. Clone this repository\n2. Install the required packages:\n\n   ```bash\n   pip install -r requirements.txt\n   ```\n\n### Installing Zonos TTS Model\n\nTo use the Zonos TTS model, you'll need to:\n\n1. **Install System Dependencies:**\n    Zonos requires `espeak-ng`. On Debian/Ubuntu, you can install it with:\n\n    ```bash\n    sudo apt-get update \u0026\u0026 sudo apt-get install -y espeak-ng\n    ```\n\n    (Note: The `Dockerfile` already includes this step.)\n\n2. **Python Dependencies:**\n    The Zonos library will be installed automatically via `pip install -r requirements.txt`. It uses a specific fork (`UpperMoon0/nstut-zonos-fork`) which includes packaging fixes to ensure all submodules are correctly installed.\n\n3. **Reference Audio:**\n    Zonos performs voice cloning using reference audio files. You need to place your `.wav` reference audio files in the `tts_models/zonos_reference_audio/` directory. For example:\n    - `0.wav` or `default_speaker.wav` for speaker ID 0.\n    - `1.wav` for speaker ID 1, etc.\n    - `speaker_X.wav` can also be used for speaker ID X.\n    Refer to `tts_models/zonos_tts.py` for more details on speaker mapping and file naming. An example reference audio file (`1.wav`) is provided.\n\n## Running the Server\n\n```bash\n# Default command (uses Edge TTS by default if no model is specified in the request)\npython -m main\n```\n\nNote: TTS models (Edge, Zonos) are loaded lazily. The server initializes, but the actual model weights are loaded into memory only when the first request requiring that specific model is received, or if preloading is triggered. This approach minimizes startup time and initial memory footprint.\n\n## Running with Docker\n\nYou can also run the TTS Provider server using Docker. The Docker image has been optimized to reduce size while maintaining all functionality.\n\n1. **Pull the Docker Image:**\n\n    ```bash\n    docker pull nstut/tts-provider\n    ```\n\n2. **Run the Docker Container:**\n\n    ```bash\n    docker run --rm -itd --gpus all --name TTS-Provider -p 9000:9000 -e HF_TOKEN=\u003cYOUR_HF_TOKEN\u003e nstut/tts-provider\n    ```\n\n    **Explanation of the command:**\n    - `--rm`: Automatically remove the container when it exits.\n    - `-itd`: Run in interactive, TTY, and detached (background) mode.\n    - `--gpus all`: (Optional) If you have NVIDIA GPUs and want to use them for models like Zonos, this flag enables GPU access. Remove if you don't have GPUs or don't need GPU support.\n    - `--name TTS-Provider`: Assigns a name to the container for easier management.\n    - `-p 9000:9000`: Maps port 9000 on your host to port 9000 in the container.\n    - `-e HF_TOKEN=\u003cYOUR_HF_TOKEN\u003e`: Sets the Hugging Face token as an environment variable. **Replace `\u003cYOUR_HF_TOKEN\u003e` with your actual Hugging Face token.** This may be required if you plan to use models like Zonos that need to be downloaded from Hugging Face.\n    - `nstut/tts-provider`: The name of the Docker image to run.\n\n    The server will then be accessible at `ws://localhost:9000`.\n\n### Persisting Downloaded Models with Docker Volumes\n\nBy default, when a Docker container is removed, any data written inside it (like downloaded Hugging Face models) is lost. To prevent re-downloading models every time you start a new container, you should use a Docker volume to persist the Hugging Face cache.\n\nThe `Dockerfile` is configured to use `/app/huggingface_cache` as the Hugging Face home directory (`HF_HOME`). You can mount a volume to this location:\n\n**1. Using a Named Volume (Recommended):**\n\nFirst, create a named volume if you haven't already:\n\n```bash\ndocker volume create tts_provider_hf_cache\n```\n\nThen, run your container, mounting this volume:\n\n```bash\ndocker run --rm -itd --gpus all --name TTS-Provider \\\n  -p 9000:9000 \\\n  -e HF_TOKEN=\u003cYOUR_HF_TOKEN\u003e \\\n  -v tts_provider_hf_cache:/app/huggingface_cache \\\n  nstut/tts-provider\n```\n\nThis will store the downloaded models in the `tts_provider_hf_cache` volume, and they will be available to subsequent containers that mount the same volume.\n\n**2. Using a Host Directory (Bind Mount):**\n\nAlternatively, you can map a directory from your host machine into the container:\n\n```bash\ndocker run --rm -itd --gpus all --name TTS-Provider \\\n  -p 9000:9000 \\\n  -e HF_TOKEN=\u003cYOUR_HF_TOKEN\u003e \\\n  -v /path/on/your/host/hf_cache:/app/huggingface_cache \\\n  nstut/tts-provider\n```\n\nReplace `/path/on/your/host/hf_cache` with an actual directory path on your computer.\n\nUsing either of these methods will ensure that models downloaded by Hugging Face (e.g., for Zonos) are cached persistently.\n\n## Docker Image Optimization\n\nThe Docker image has been optimized to reduce size while maintaining all functionality:\n\n- Production dependencies separated from development dependencies\n- Layer optimization for better caching\n- Maintained GPU support for Zonos TTS\n\nSee [DOCKER_OPTIMIZATION.md](DOCKER_OPTIMIZATION.md) for details on the optimizations made.\n\n## Client Usage\n\nClients can connect to the server via WebSocket. See `tts_client.py` for a complete client implementation.\n\n### Basic Example\n\n```python\nimport asyncio\nfrom tts_client import TTSClient\n\nasync def main():\n    client = TTSClient(host=\"localhost\", port=9000)\n    \n    try:\n        await client.connect()\n        \n        # Get server information\n        info = await client.get_server_info()\n        print(f\"Available models: {info.get('available_models')}\")\n        \n        # Generate speech with default model (Edge TTS)\n        await client.generate_speech(\n            text=\"Hello, this is a test.\",\n            output_path=\"output.wav\"\n        )\n        \n        # Generate speech with Edge TTS\n        await client.generate_speech(\n            text=\"Hello, this is Edge TTS speaking.\",\n            output_path=\"output_edge.wav\",\n            model=\"edge\",\n            speaker=2  # Use the Davis voice\n        )\n        \n    finally:\n        await client.disconnect()\n\nasyncio.run(main())\n```\n\n## Speaker ID Mapping\n\nThe TTS Provider supports a speaker ID system. For EdgeTTS, you can use integer speaker IDs (0-3). For Zonos, speaker IDs correspond to your reference audio files.\n\n- **Simple usage**: Just provide a speaker ID as an integer.\n\n### Speaker ID Reference Table\n\n| ID | General Description | Edge TTS | Zonos TTS |\n|----|---------------------|----------|-----------|\n| 0  | Primary/Default     | US Male (Guy) | Cloned (e.g., `0.wav`/`default_speaker.wav`) |\n| 1  | Secondary           | US Female (Jenny) | Cloned (e.g., `1.wav`) |\n| 2  | Tertiary            | US Male (Davis) | Cloned (e.g., `2.wav`) |\n| 3  | Quaternary          | UK Female (Sonia) | Cloned (e.g., `3.wav`) |\n| ...| Additional Voices   | N/A      | Cloned (e.g., `X.wav`/`speaker_X.wav`) |\n\n*Note for Zonos TTS*: Speaker IDs correspond to user-provided `.wav` files in the `tts_models/zonos_reference_audio/` directory (e.g., speaker ID `X` typically maps to `X.wav` or `speaker_X.wav`). The voice characteristics are determined by these reference files. The system can support many such cloned speakers.\n\n## Selecting Models\n\nClients can select which model to use in each request by including a `model` parameter:\n\n- `edge` (or `edge-tts`) - Use Microsoft Edge TTS\n- `zonos` - Use Zonos TTS model\n\n## API Documentation\n\n### WebSocket Request Format\n\nBasic request format:\n\n```json\n{\n  \"text\": \"Text to convert to speech\",\n  \"speaker\": 0,\n  \"sample_rate\": 24000,\n  \"model\": \"edge\",  // Optional. Specifies model type (e.g., \"edge\", \"zonos\"). Defaults to \"edge\" if not provided.\n  \"lang\": \"en-US\"   // Optional. Specifies language. Defaults to \"en-US\".\n}\n```\n\n#### Language Support (`lang` parameter)\n\nClients **should** specify the language for TTS generation using standard IETF language tags (e.g., `en-US`, `ja-JP`, `es-ES`) in the `lang` parameter of the WebSocket request.\n\n- **Default**: If the `lang` parameter is not provided, `en-US` is generally assumed by most models, though specific model behavior can vary.\n- **Server-Side Language Code Mapping**:\n  - The server-side TTS models (`edge`, `zonos`) are responsible for mapping these standard input language codes to the specific formats required by their underlying TTS engines. This mapping is handled by a `_map_language_code` method within each model.\n  - While models *may* attempt to normalize and map common variations (e.g., \"en\", \"english\" to \"en-US\"), relying on this is discouraged for client implementations.\n- **Error Handling**:\n  - If a model cannot map the provided `lang` parameter to a supported language code (even after its internal normalization attempts), the server will return an error, and speech generation will fail. This indicates the language is not supported by the chosen model.\n- **Model-Specific Support**:\n  - **EdgeTTS**: Accepts standard codes like \"en-US\", \"ja-JP\". See its `VOICE_MAPPINGS` for explicitly configured target languages.\n  - **ZonosTTS**: Accepts standard codes and maps them to its wide range of supported languages (e.g., \"en-US\" might map to \"en-us\", \"ja-JP\" to \"ja\"). It uses a comprehensive mapping (see `PREFERRED_ZONOS_LANG_MAP` in `zonos_tts.py`) and checks against dynamically available Zonos language codes.\n- **Client Recommendation**: For maximum compatibility and predictability, clients **must** send well-formed IETF language tags (e.g., `en-US`, `ja-JP`).\n\n**Example (Japanese with EdgeTTS):**\n\n```json\n{\n  \"text\": \"こんにちは、これはテストです。\",\n  \"speaker\": 1,\n  \"model\": \"edge\",\n  \"lang\": \"ja-JP\" // Client sends standard ja-JP\n}\n```\n\n**Example (English with Zonos):**\n\n```json\n{\n  \"text\": \"Hello, this is Zonos.\",\n  \"speaker\": 0, \n  \"model\": \"zonos\",\n  \"lang\": \"en-US\" // Client sends standard en-US; Zonos maps to \"en-us\" or similar\n}\n```\n\n**Important Note:** For Edge TTS, voice modification parameters like `rate`, `volume`, and `pitch` are not supported and will be ignored. Edge TTS will always use the default voice characteristics.\n\n### Server Information Request\n\nTo get information about the server and available models:\n\n```json\n{\n  \"command\": \"info\"\n}\n```\n\nThe response includes the available speaker mappings to help you select the appropriate voice.\n\n## Model Loading Behavior\n\nAll TTS models are loaded lazily to optimize startup time and resource usage:\n\n- The server initializes with a default model configuration (currently \"edge\").\n- However, the actual loading of any model's weights and resources into memory occurs only when:\n    1. The first WebSocket request that requires that specific model is received.\n    2. An explicit preload operation is triggered (e.g., during server startup if configured, or via a specific command if implemented).\n- If a request comes in for a model that isn't loaded yet, the request is queued, and the model loading process begins. Once loaded, queued requests for that model are processed.\n- This ensures that only necessary models consume resources, and the server starts quickly.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fuppermoon0%2Ftts-provider","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fuppermoon0%2Ftts-provider","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fuppermoon0%2Ftts-provider/lists"}