{"id":27923329,"url":"https://github.com/devnen/dia-tts-server","last_synced_at":"2025-05-06T22:32:32.045Z","repository":{"id":289324260,"uuid":"970873979","full_name":"devnen/Dia-TTS-Server","owner":"devnen","description":"Self-host the powerful Dia TTS model. This server offers a user-friendly Web UI, flexible API endpoints (incl. OpenAI compatible), support for SafeTensors/BF16, voice cloning, dialogue generation, and GPU/CPU execution.","archived":false,"fork":false,"pushed_at":"2025-05-04T14:24:19.000Z","size":32745,"stargazers_count":136,"open_issues_count":5,"forks_count":24,"subscribers_count":4,"default_branch":"main","last_synced_at":"2025-05-04T14:25:42.663Z","etag":null,"topics":["ai","api-server","audio-generation","cuda","dia","dia-tts","dialogue-tts","fastapi","huggingface","openai-api","python","pytorch","speech-synthesis","speech-synthesis-api","text-to-speech","tts","tts-api","voice-cloning","web-ui"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/devnen.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-04-22T17:03:37.000Z","updated_at":"2025-05-04T14:24:22.000Z","dependencies_parsed_at":"2025-04-22T18:36:22.227Z","dependency_job_id":"7b7b8f89-0722-4943-8e5f-30c3ad4c8b1c","html_url":"https://github.com/devnen/Dia-TTS-Server","commit_stats":null,"previous_names":["devnen/dia-tts-server"],"tags_count":2,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/devnen%2FDia-TTS-Server","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/devnen%2FDia-TTS-Server/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/devnen%2FDia-TTS-Server/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/devnen%2FDia-TTS-Server/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/devnen","download_url":"https://codeload.github.com/devnen/Dia-TTS-Server/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":252779564,"owners_count":21802973,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai","api-server","audio-generation","cuda","dia","dia-tts","dialogue-tts","fastapi","huggingface","openai-api","python","pytorch","speech-synthesis","speech-synthesis-api","text-to-speech","tts","tts-api","voice-cloning","web-ui"],"created_at":"2025-05-06T22:32:04.972Z","updated_at":"2025-05-06T22:32:32.003Z","avatar_url":"https://github.com/devnen.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Dia TTS Server: OpenAI-Compatible API with Web UI, Large Text Handling \u0026 Built-in Voices\n\n**Self-host the powerful [Nari Labs Dia TTS model](https://github.com/nari-labs/dia) with this enhanced FastAPI server! Features an intuitive Web UI, flexible API endpoints (including OpenAI-compatible `/v1/audio/speech`), support for realistic dialogue (`[S1]`/`[S2]`), improved voice cloning, large text processing via intelligent chunking, and consistent, reproducible voices using 43 built-in ready-to-use voices and generation seeds feature.**\n\nNow with improved speed and reduced VRAM usage. Defaults to efficient BF16 SafeTensors for reduced VRAM and faster inference, with support for original `.pth` weights. Runs accelerated on NVIDIA GPUs (CUDA) with CPU fallback.\n\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg?style=for-the-badge)](LICENSE)\n[![Python Version](https://img.shields.io/badge/Python-3.10+-blue.svg?style=for-the-badge)](https://www.python.org/downloads/)\n[![Framework](https://img.shields.io/badge/Framework-FastAPI-green.svg?style=for-the-badge)](https://fastapi.tiangolo.com/)\n[![Model Format](https://img.shields.io/badge/Weights-SafeTensors%20/%20pth-orange.svg?style=for-the-badge)](https://github.com/huggingface/safetensors)\n[![Docker](https://img.shields.io/badge/Docker-Supported-blue.svg?style=for-the-badge)](https://www.docker.com/)\n[![Web UI](https://img.shields.io/badge/Web_UI-Included-4285F4?style=for-the-badge\u0026logo=googlechrome\u0026logoColor=white)](#)\n[![CUDA Compatible](https://img.shields.io/badge/CUDA-Compatible-76B900?style=for-the-badge\u0026logo=nvidia\u0026logoColor=white)](https://developer.nvidia.com/cuda-zone)\n[![API](https://img.shields.io/badge/OpenAI_Compatible_API-Ready-000000?style=for-the-badge\u0026logo=openai\u0026logoColor=white)](https://platform.openai.com/docs/api-reference)\n\n\u003cdiv align=\"center\"\u003e\n  \u003cimg src=\"static/screenshot-d.png\" alt=\"Dia TTS Server Web UI - Dark Mode\" width=\"33%\" /\u003e\n  \u003cimg src=\"static/screenshot-l.png\" alt=\"Dia TTS Server Web UI - Light Mode\" width=\"33%\" /\u003e\n\u003c/div\u003e\n\n---\n\n## 🗣️ Overview: Enhanced Dia TTS Access\n\nThe original [Dia 1.6B TTS model by Nari Labs](https://github.com/nari-labs/dia) provides incredible capabilities for generating realistic dialogue, complete with speaker turns and non-verbal sounds like `(laughs)` or `(sighs)`. This project builds upon that foundation by providing a robust **[FastAPI](https://fastapi.tiangolo.com/) server** that makes Dia significantly easier to use and integrate.\n\nWe solve the complexity of setting up and running the model by offering:\n\n*   An **OpenAI-compatible API endpoint**, allowing you to use Dia TTS with tools expecting OpenAI's API structure.\n*   A **modern Web UI** for easy experimentation, preset loading, reference audio management, and generation parameter tuning. The interface design draws inspiration from **[Lex-au's Orpheus-FastAPI project](https://github.com/Lex-au/Orpheus-FastAPI)**, adapting its intuitive layout and user experience for Dia TTS.\n*   **Large Text Handling:** Intelligently splits long text inputs into manageable chunks based on sentence structure and speaker tags, processes them sequentially, and seamlessly concatenates the audio.\n*   **Predefined Voices:** Select from 43 curated, ready-to-use synthetic voices for consistent and reliable output without cloning setup.\n*   **Improved Voice Cloning:** Enhanced pipeline with automatic audio processing and transcript handling (local `.txt` file or experimental Whisper fallback).\n*   **Consistent Generation:** Achieve consistent voice output across multiple generations or text chunks by using the \"Predefined Voices\" or \"Voice Cloning\" modes, optionally combined with a fixed integer **Seed**.\n*   Support for both original `.pth` weights and modern, secure **[SafeTensors](https://github.com/huggingface/safetensors)**, defaulting to a **BF16 SafeTensors** version which uses roughly half the VRAM and offers improved speed.\n*   Automatic **GPU (CUDA) acceleration** detection with fallback to CPU.\n*   Configuration primarily via `config.yaml`, with `.env` used for initial setup/reset.\n*   **Docker support** for easy containerized deployment with [Docker](https://www.docker.com/).\n\nThis server is your gateway to leveraging Dia's advanced TTS capabilities seamlessly, now with enhanced stability, voice consistency, and large text support.\n\n## ✨ What's New (v1.4.0 vs v1.0.0)\n\nThis version introduces significant improvements and new features:\n\n**🚀 New Features:**\n\n*   **Large Text Processing (Chunking):**\n    *   Automatically handles long text inputs by intelligently splitting them into smaller chunks based on sentence boundaries and speaker tags (`[S1]`/`[S2]`).\n    *   Processes each chunk individually and seamlessly concatenates the resulting audio, overcoming previous generation limits.\n    *   Configurable via UI toggle (\"Split text into chunks\") and chunk size slider.\n*   **Predefined Voices:**\n    *   Added support for using 43 curated, ready-to-use synthetic voices stored in the `./voices` directory.\n    *   Selectable via UI dropdown (\"Predefined Voices\" mode). Server automatically uses required transcripts.\n    *   Provides reliable voice output without manual cloning setup and avoids potential licensing issues.\n*   **Enhanced Voice Cloning:**\n    *   Improved backend pipeline for robustness.\n    *   Automatic reference audio processing: mono conversion, resampling to 44.1kHz, truncation (~20s).\n    *   Automatic transcript handling: Prioritizes local `.txt` file (recommended for accuracy) -\u003e **experimental Whisper generation** if `.txt` is missing. Backend handles transcript prepending automatically.\n    *   Robust reference file finding handles case-insensitivity and extensions.\n*   **Whisper Integration:** Added `openai-whisper` for automatic transcript generation as an experimental fallback during cloning. Configurable model (`WHISPER_MODEL_NAME` in `config.yaml`).\n*   **API Enhancements:**\n    *   `/tts` endpoint now supports `transcript` (for explicit clone transcript), `split_text`, `chunk_size`, and `seed`.\n    *   `/v1/audio/speech` endpoint now supports `seed`.\n*   **Generation Seed:** Added `seed` parameter to UI and API for influencing generation results. Using a fixed integer seed *in combination with* Predefined Voices or Voice Cloning helps maintain consistency across chunks or separate generations. Use -1 for random variation.\n*   **Terminal Progress:** Generation of long text (using chunking) now displays a `tqdm` progress bar in the server's terminal window.\n*   **UI Configuration Management:** Added UI section to view/edit `config.yaml` settings and save generation defaults.\n*   **Configuration System:** Migrated to `config.yaml` for primary runtime configuration, managed via `config.py`. `.env` is now used mainly for initial seeding or resetting defaults.\n\n**🔧 Fixes \u0026 Enhancements:**\n\n*   **VRAM Usage Fixed \u0026 Optimized:** Resolved memory leaks during inference and significantly reduced VRAM usage (approx. 14GB+ down to ~7GB) through code optimizations, fixing memory leaks, and BF16 default.\n*   **Performance:** Significant speed improvements reported (approaching 95% real-time on tested hardware: AMD Ryzen 9 9950X3D + NVIDIA RTX 3090).\n*   **Audio Post-Processing:** Automatically applies silence trimming (leading/trailing), internal silence reduction, and unvoiced segment removal (using Parselmouth) to improve audio quality and remove artifacts.\n*   **UI State Persistence:** Web UI now saves/restores text input, voice mode selection, file selections, and generation parameters (seed, chunking, sliders) in `config.yaml`.\n*   **UI Improvements:** Better loading indicators (shows chunk processing), refined chunking controls, seed input field, theme toggle, dynamic preset loading from `ui/presets.yaml`, warning modals for chunking/generation quality.\n*   **Cloning Workflow:** Backend now handles transcript prepending automatically. UI workflow simplified (user selects file, enters target text).\n*   **Dependency Management:** Added `tqdm`, `PyYAML`, `openai-whisper`, `parselmouth` to `requirements.txt`.\n*   **Code Refactoring:** Aligned internal engine code with refactored `dia` library structure. Updated `config.py` to use `YamlConfigManager`.\n\n## ✅ Features\n\n*   **Core Dia Capabilities (via [Nari Labs Dia](https://github.com/nari-labs/dia)):**\n    *   🗣️ Generate multi-speaker dialogue using `[S1]` / `[S2]` tags.\n    *   😂 Include non-verbal sounds like `(laughs)`, `(sighs)`, `(clears throat)`.\n    *   🎭 Perform voice cloning using reference audio prompts.\n*   **Enhanced Server \u0026 API:**\n    *   ⚡ Built with the high-performance **[FastAPI](https://fastapi.tiangolo.com/)** framework.\n    *   🤖 **OpenAI-Compatible API Endpoint** (`/v1/audio/speech`) for easy integration (now includes `seed`).\n    *   ⚙️ **Custom API Endpoint** (`/tts`) exposing all Dia generation parameters (now includes `seed`, `split_text`, `chunk_size`, `transcript`).\n    *   📄 Interactive API documentation via Swagger UI (`/docs`).\n    *   🩺 Health check endpoint (`/health`).\n*   **Advanced Generation Features:**\n    *   📚 **Large Text Handling:** Intelligently splits long inputs into chunks based on sentences and speaker tags, generates audio for each, and concatenates the results seamlessly. Configurable via `split_text` and `chunk_size`.\n    *   🎤 **Predefined Voices:** Select from 43 curated, ready-to-use synthetic voices in the `./voices` directory for consistent output without cloning setup.\n    *   ✨ **Improved Voice Cloning:** Robust pipeline with automatic audio processing and transcript handling (local `.txt` or Whisper fallback). Backend handles transcript prepending.\n    *   🌱 **Consistent Generation:** Use Predefined Voices or Voice Cloning modes, optionally with a fixed integer **Seed**, for consistent voice output across chunks or multiple requests.\n    *   🔇 **Audio Post-Processing:** Automatic steps to trim silence, fix internal pauses, and remove long unvoiced segments/artifacts.\n*   **Intuitive Web User Interface:**\n    *   🖱️ Modern, easy-to-use interface inspired by **[Lex-au's Orpheus-FastAPI project](https://github.com/Lex-au/Orpheus-FastAPI)**.\n    *   💡 **Presets:** Load example text and settings dynamically from `ui/presets.yaml`. Customize by editing the file.\n    *   🎤 **Reference Audio Upload:** Easily upload `.wav`/`.mp3` files for voice cloning.\n    *   🗣️ **Voice Mode Selection:** Choose between Predefined Voices, Voice Cloning, or Random/Dialogue modes.\n    *   🎛️ **Parameter Control:** Adjust generation settings (CFG Scale, Temperature, Speed, Seed, etc.) via sliders and inputs.\n    *   💾 **Configuration Management:** View and save server settings (`config.yaml`) and default generation parameters directly in the UI.\n    *   💾 **Session Persistence:** Remembers your last used settings via `config.yaml`.\n    *   ✂️ **Chunking Controls:** Enable/disable text splitting and adjust approximate chunk size.\n    *   ⚠️ **Warning Modals:** Optional warnings for chunking voice consistency and general generation quality.\n    *   🌓 **Light/Dark Mode:** Toggle between themes with preference saved locally.\n    *   🔊 **Audio Player:** Integrated waveform player ([WaveSurfer.js](https://wavesurfer.xyz/)) for generated audio with download option.\n    *   ⏳ **Loading Indicator:** Shows status, including chunk processing information.\n*   **Flexible \u0026 Efficient Model Handling:**\n    *   ☁️ Downloads models automatically from [Hugging Face Hub](https://huggingface.co/).\n    *   🔒 Supports loading secure **`.safetensors`** weights (default).\n    *   💾 Supports loading original **`.pth`** weights.\n    *   🚀 Defaults to **BF16 SafeTensors** for reduced memory footprint (~half size) and potentially faster inference. (Credit: [ttj/dia-1.6b-safetensors](https://huggingface.co/ttj/dia-1.6b-safetensors))\n    *   🔄 Easily switch between model formats/versions via `config.yaml`.\n*   **Performance \u0026 Configuration:**\n    *   💻 **GPU Acceleration:** Automatically uses NVIDIA CUDA if available, falls back to CPU. Optimized VRAM usage (~7GB typical).\n    *   📊 **Terminal Progress:** Displays `tqdm` progress bar when processing text chunks.\n    *   ⚙️ Primary configuration via `config.yaml`, initial seeding via `.env`.\n    *   📦 Uses standard Python virtual environments.\n*   **Docker Support:**\n    *   🐳 Containerized deployment via [Docker](https://www.docker.com/) and Docker Compose.\n    *   🔌 NVIDIA GPU acceleration with Container Toolkit integration.\n    *   💾 Persistent volumes for models, reference audio, predefined voices, outputs, and config.\n    *   🚀 One-command setup and deployment (`docker compose up -d`).\n\n## 🔩 System Prerequisites\n\n*   **Operating System:** Windows 10/11 (64-bit) or Linux (Debian/Ubuntu recommended).\n*   **Python:** Version 3.10 or later ([Download](https://www.python.org/downloads/)).\n*   **Git:** For cloning the repository ([Download](https://git-scm.com/downloads)).\n*   **Internet:** For downloading dependencies and models.\n*   **(Optional but HIGHLY Recommended for Performance):**\n    *   **NVIDIA GPU:** CUDA-compatible (Maxwell architecture or newer). Check [NVIDIA CUDA GPUs](https://developer.nvidia.com/cuda-gpus). Optimized VRAM usage (~7GB typical), but more helps.\n    *   **NVIDIA Drivers:** Latest version for your GPU/OS ([Download](https://www.nvidia.com/Download/index.aspx)).\n    *   **CUDA Toolkit:** Compatible version (e.g., 11.8, 12.1) matching the PyTorch build you install.\n*   **(Linux Only):**\n    *   `libsndfile1`: Audio library needed by `soundfile`. Install via package manager (e.g., `sudo apt install libsndfile1`).\n    *   `ffmpeg`: Required by `openai-whisper`. Install via package manager (e.g., `sudo apt install ffmpeg`).\n\n## 💻 Installation and Setup\n\nFollow these steps carefully to get the server running.\n\n**1. Clone the Repository**\n```bash\ngit clone https://github.com/devnen/dia-tts-server.git\ncd dia-tts-server\n```\n\n**2. Set up Python Virtual Environment**\n\nUsing a virtual environment is crucial!\n\n*   **Windows (PowerShell):**\n    ```powershell\n    # In the dia-tts-server directory\n    python -m venv venv\n    .\\venv\\Scripts\\activate\n    # Your prompt should now start with (venv)\n    ```\n\n*   **Linux (Bash - Debian/Ubuntu Example):**\n    ```bash\n    # Ensure prerequisites are installed\n    sudo apt update \u0026\u0026 sudo apt install python3 python3-venv python3-pip libsndfile1 ffmpeg -y\n\n    # In the dia-tts-server directory\n    python3 -m venv venv\n    source venv/bin/activate\n    # Your prompt should now start with (venv)\n    ```\n\n**3. Install Dependencies**\n\nMake sure your virtual environment is activated (`(venv)` prefix visible).\n\n```bash\n# Upgrade pip (recommended)\npip install --upgrade pip\n\n# Install project requirements (includes tqdm, yaml, parselmouth etc.)\npip install -r requirements.txt\n```\n⭐ **Note:** This installation includes large libraries like PyTorch. The download and installation process may take some time depending on your internet speed and system performance.\n\n⭐ **Important:** This installs the *CPU-only* version of PyTorch by default. If you have an NVIDIA GPU, proceed to Step 4 **before** running the server for GPU acceleration.\n\n**4. NVIDIA Driver and CUDA Setup (for GPU Acceleration)**\n\nSkip this step if you only have a CPU.\n\n*   **Step 4a: Check/Install NVIDIA Drivers**\n    *   Run `nvidia-smi` in your terminal/command prompt.\n    *   If it works, note the **CUDA Version** listed (e.g., 12.1, 11.8). This is the *maximum* your driver supports.\n    *   If it fails, download and install the latest drivers from [NVIDIA Driver Downloads](https://www.nvidia.com/Download/index.aspx) and **reboot**. Verify with `nvidia-smi` again.\n\n*   **Step 4b: Install PyTorch with CUDA Support**\n    *   Go to the [Official PyTorch Website](https://pytorch.org/get-started/locally/).\n    *   Use the configuration tool: Select **Stable**, **Windows/Linux**, **Pip**, **Python**, and the **CUDA version** that is **equal to or lower** than the one shown by `nvidia-smi` (e.g., if `nvidia-smi` shows 12.4, choose CUDA 12.1).\n    *   Copy the generated command (it will include `--index-url https://download.pytorch.org/whl/cuXXX`).\n    *   **In your activated `(venv)`:**\n        ```bash\n        # Uninstall the CPU version first!\n        pip uninstall torch torchvision torchaudio -y\n\n        # Paste and run the command copied from the PyTorch website\n        # Example (replace with your actual command):\n        pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121\n        ```\n\n*   **Step 4c: Verify PyTorch CUDA Installation**\n    *   In your activated `(venv)`, run `python` and execute the following single line:\n        ```python\n        import torch; print(f\"PyTorch version: {torch.__version__}\"); print(f\"CUDA available: {torch.cuda.is_available()}\"); print(f\"Device name: {torch.cuda.get_device_name(0)}\") if torch.cuda.is_available() else None; exit()\n        ```\n    *   If `CUDA available:` shows `True`, the setup was successful. If `False`, double-check driver installation and the PyTorch install command.\n\n## ⚙️ Configuration\n\nThe server now primarily uses `config.yaml` for runtime configuration.\n\n*   **`config.yaml`:** Located in the project root. This file stores all server settings, model paths, generation defaults, and UI state. It is created automatically on the first run if it doesn't exist. **This is the main file to edit for persistent configuration changes.**\n*   **`.env` File:** Used **only** for the *initial creation* of `config.yaml` if it's missing, or when using the \"Reset All Settings\" button in the UI. Values in `.env` override hardcoded defaults during this initial seeding/reset process. It is **not** read during normal server operation once `config.yaml` exists.\n*   **UI Configuration:** The \"Server Configuration\" and \"Generation Parameters\" sections in the Web UI allow direct editing and saving of values *into* `config.yaml`.\n\n**Key Configuration Areas (in `config.yaml` or UI):**\n\n*   `server`: `host`, `port`\n*   `model`: `repo_id`, `config_filename`, `weights_filename`, `whisper_model_name`\n*   `paths`: `model_cache`, `reference_audio`, `output`, `voices` (for predefined)\n*   `generation_defaults`: Default values for sliders/seed in the UI (`speed_factor`, `cfg_scale`, `temperature`, `top_p`, `cfg_filter_top_k`, `seed`, `split_text`, `chunk_size`).\n*   `ui_state`: Stores the last used text, voice mode, file selections, etc., for UI persistence.\n\n⭐ **Remember:** Changes made to `server`, `model`, or `paths` sections in `config.yaml` (or via the UI) **require a server restart** to take effect. Changes to `generation_defaults` or `ui_state` are applied dynamically or on the next page load.\n\n## ▶️ Running the Server\n\n**Note on Model Downloads:**\nThe first time you run the server (or after changing model settings in `config.yaml`), it will download the required Dia and Whisper model files (~3-7GB depending on selection). Monitor the terminal logs for progress. The server starts fully *after* downloads complete.\n\n1.  **Activate the virtual environment (if not activated):**\n    *   Linux/macOS: `source venv/bin/activate`\n    *   Windows: `.\\venv\\Scripts\\activate`\n2.  **Run the server:**\n    ```bash\n    python server.py\n    ```\n3.  **Access the UI:** The server should automatically attempt to open the Web UI in your default browser after startup. If it doesn't for any reason, manually navigate to `http://localhost:PORT` (e.g., `http://localhost:8003`).\n4.  **Access API Docs:** Open `http://localhost:PORT/docs`.\n5.  **Stop the server:** Press `CTRL+C` in the terminal.\n\nOkay, here is a revised Docker installation section for your `README.md`, incorporating the recent changes and decisions. It prioritizes using Docker Compose with the pre-built image from GitHub Container Registry (GHCR) as the recommended method.\n\n---\n\n## 🐳 Docker Installation\n\nRun Dia TTS Server easily using Docker. The recommended method uses Docker Compose with pre-built images from GitHub Container Registry (GHCR).\n\n### Prerequisites\n\n*   [Docker](https://docs.docker.com/get-docker/) installed.\n*   [Docker Compose](https://docs.docker.com/compose/install/) installed (usually included with Docker Desktop).\n*   (Optional but Recommended for GPU) NVIDIA GPU with up-to-date drivers and the [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html) installed.\n\n### Option 1: Using Docker Compose (Recommended)\n\nThis method uses `docker-compose.yml` to manage the container, volumes, and configuration easily. It leverages pre-built images hosted on GHCR.\n\n1.  **Clone the repository:** (You only need the `docker-compose.yml` and `env.example.txt` files from it)\n    ```bash\n    git clone https://github.com/devnen/dia-tts-server.git\n    cd dia-tts-server\n    ```\n\n2.  **(Optional) Initial Configuration via `.env`:**\n    *   If this is your very first time running the container and you want to override the default settings *before* `config.yaml` is created inside the container, copy the example environment file:\n        ```bash\n        cp env.example.txt .env\n        ```\n    *   Edit the `.env` file with your desired initial settings (e.g., `PORT`, model filenames).\n    *   **Note:** This `.env` file is *only* used to seed the *initial* `config.yaml` on the very first container start if `/app/config.yaml` doesn't already exist inside the container's volume (which it won't initially). Subsequent configuration changes should be made via the UI or by editing `config.yaml` directly (see Configuration Note below).\n\n3.  **Review `docker-compose.yml`:**\n    *   The repository includes a `docker-compose.yml` file configured to use the pre-built image and recommended settings. Ensure it looks similar to this:\n\n        ```yaml\n        # docker-compose.yml\n        version: '3.8'\n\n        services:\n          dia-tts-server:\n            # Use the pre-built image from GitHub Container Registry\n            image: ghcr.io/devnen/dia-tts-server:latest\n            # Alternatively, to build locally (e.g., for development):\n            # build:\n            #   context: .\n            #   dockerfile: Dockerfile\n            ports:\n              # Map host port (default 8003) to container port 8003\n              # You can change the host port via .env (e.g., PORT=8004)\n              - \"${PORT:-8003}:8003\"\n            volumes:\n              # Mount local directories into the container for persistent data\n              - ./model_cache:/app/model_cache\n              - ./reference_audio:/app/reference_audio\n              - ./outputs:/app/outputs\n              - ./voices:/app/voices\n              # DO NOT mount config.yaml - let the app create it inside\n\n            # --- GPU Access ---\n            # Modern method (Recommended for newer Docker/NVIDIA setups)\n            devices:\n              - nvidia.com/gpu=all\n            device_cgroup_rules:\n              - \"c 195:* rmw\" # Needed for some NVIDIA container toolkit versions\n              - \"c 236:* rmw\" # Needed for some NVIDIA container toolkit versions\n\n            # Legacy method (Alternative for older Docker/NVIDIA setups)\n            # If the 'devices' block above doesn't work, comment it out and uncomment\n            # the 'deploy' block below. Do not use both simultaneously.\n            # deploy:\n            #   resources:\n            #     reservations:\n            #       devices:\n            #         - driver: nvidia\n            #           count: 1 # Or specify specific GPUs e.g., \"device=0,1\"\n            #           capabilities: [gpu]\n            # --- End GPU Access ---\n\n            restart: unless-stopped\n            env_file:\n              # Load environment variables from .env file for initial config seeding\n              - .env\n            environment:\n              # Enable faster Hugging Face downloads inside the container\n              - HF_HUB_ENABLE_HF_TRANSFER=1\n              # Pass GPU capabilities (may be needed for legacy method if uncommented)\n              - NVIDIA_VISIBLE_DEVICES=all\n              - NVIDIA_DRIVER_CAPABILITIES=compute,utility\n\n        # Optional: Define named volumes if you prefer them over host mounts\n        # volumes:\n        #   model_cache:\n        #   reference_audio:\n        #   outputs:\n        #   voices:\n        ```\n\n4.  **Start the container:**\n    ```bash\n    docker compose up -d\n    ```\n    *   This command will:\n        *   Pull the latest `ghcr.io/devnen/dia-tts-server:latest` image.\n        *   Create the local directories (`model_cache`, `reference_audio`, `outputs`, `voices`) if they don't exist.\n        *   Start the container in detached mode (`-d`).\n    *   The first time you run this, it will download the TTS models into `./model_cache`, which may take some time depending on your internet speed.\n\n5.  **Access the UI:**\n    Open your web browser to `http://localhost:8003` (or the host port you configured in `.env`).\n\n6.  **View logs:**\n    ```bash\n    docker compose logs -f\n    ```\n\n7.  **Stop the container:**\n    ```bash\n    docker compose down\n    ```\n\n### Option 2: Using `docker run` (Alternative)\n\nThis method runs the container directly without Docker Compose, requiring manual specification of ports, volumes, and GPU flags.\n\n```bash\n# Ensure local directories exist first:\n# mkdir -p model_cache reference_audio outputs voices\n\ndocker run -d \\\n  --name dia-tts-server \\\n  -p 8003:8003 \\\n  -v ./model_cache:/app/model_cache \\\n  -v ./reference_audio:/app/reference_audio \\\n  -v ./outputs:/app/outputs \\\n  -v ./voices:/app/voices \\\n  --env HF_HUB_ENABLE_HF_TRANSFER=1 \\\n  --gpus all \\\n  ghcr.io/devnen/dia-tts-server:latest\n```\n\n*   Replace `8003:8003` with `\u003cyour_host_port\u003e:8003` if needed.\n*   `--gpus all` enables GPU access; consult NVIDIA Container Toolkit documentation for alternatives if needed.\n*   Initial configuration relies on model defaults unless you pass environment variables using multiple `-e VAR=VALUE` flags (more complex than using `.env` with Compose).\n\n### Configuration Note\n\n*   The server uses `config.yaml` inside the container (`/app/config.yaml`) for its settings.\n*   On the *very first start*, if `/app/config.yaml` doesn't exist, the server creates it using defaults from the code, potentially overridden by variables in the `.env` file (if using Docker Compose and `.env` exists).\n*   **After the first start,** changes should be made by:\n    *   Using the Web UI's settings page (if available).\n    *   Editing the `config.yaml` file *inside* the container (e.g., `docker compose exec dia-tts-server nano /app/config.yaml`). Changes require a container restart (`docker compose restart dia-tts-server`) to take effect for server/model/path settings. UI state changes are saved live.\n\n### Performance Optimizations\n\n*   **Faster Model Downloads**: `hf-transfer` is enabled by default in the provided `docker-compose.yml` and image, significantly speeding up initial model downloads from Hugging Face.\n*   **GPU Acceleration**: The `docker-compose.yml` and `docker run` examples include flags (`devices` or `--gpus`) to enable NVIDIA GPU acceleration if available. The Docker image uses a CUDA runtime base for efficiency.\n\n### Docker Volumes\n\nPersistent data is stored on your host machine via volume mounts:\n\n*   `./model_cache:/app/model_cache` (Downloaded TTS and Whisper models)\n*   `./reference_audio:/app/reference_audio` (Your uploaded reference audio files for cloning)\n*   `./outputs:/app/outputs` (Generated audio files)\n*   `./voices:/app/voices` (Predefined voice audio files)\n\n### Available Images\n\n*   **GitHub Container Registry**: `ghcr.io/devnen/dia-tts-server:latest` (Automatically built from the `main` branch)\n\n---\n\n## 💡 Usage\n\n### Web UI (`http://localhost:PORT`)\n\nThe most intuitive way to use the server:\n\n*   **Text Input:** Enter your script. Use `[S1]`/`[S2]` for dialogue and non-verbals like `(laughs)`. Content is saved automatically.\n*   **Generate Button \u0026 Chunking:** Click \"Generate Speech\". Below the text box:\n    *   **Split text into chunks:** Toggle checkbox (enabled by default). Enables splitting for long text (\u003e ~2x chunk size).\n    *   **Chunk Size:** Adjust the slider (visible when splitting is possible) for approximate chunk character length (default 120).\n*   **Voice Mode:** Choose:\n    *   `Predefined Voices`: Select a curated, ready-to-use synthetic voice from the `./voices` directory.\n    *   `Voice Cloning`: Select an uploaded reference file from `./reference_audio`. Requires a corresponding `.txt` transcript (recommended) or relies on experimental Whisper fallback. Backend handles transcript automatically.\n    *   `Random Single / Dialogue`: Uses `[S1]`/`[S2]` tags or generates a random voice if no tags. Use a fixed Seed for consistency.\n*   **Presets:** Click buttons (loaded from `ui/presets.yaml`) to populate text and parameters. Customize by editing the YAML file.\n*   **Reference Audio (Clone Mode):** Select an existing `.wav`/`.mp3` or click \"Import\" to upload new files to `./reference_audio`.\n*   **Generation Parameters:** Adjust sliders/inputs for Speed, CFG, Temperature, Top P, Top K, and **Seed**. Settings are saved automatically. Click \"Save Generation Parameters\" to update the defaults in `config.yaml`. Use -1 seed for random, integer for specific results.\n*   **Server Configuration:** View/edit `config.yaml` settings (requires server restart for some changes).\n*   **Loading Overlay:** Appears during generation, showing chunk progress if applicable.\n*   **Audio Player:** Appears on success with waveform, playback controls, download link, and generation info.\n*   **Theme Toggle:** Switch between light/dark modes.\n\n### API Endpoints (`/docs` for details)\n\n*   **`/v1/audio/speech` (POST):** OpenAI-compatible.\n    *   `input`: Text.\n    *   `voice`: 'S1', 'S2', 'dialogue', 'predefined_voice_filename.wav', or 'reference_filename.wav'.\n    *   `response_format`: 'opus' or 'wav'.\n    *   `speed`: Playback speed factor (0.5-2.0).\n    *   `seed`: (Optional) Integer seed, -1 for random.\n*   **`/tts` (POST):** Custom endpoint with full control.\n    *   `text`: Target text.\n    *   `voice_mode`: 'dialogue', 'single_s1', 'single_s2', 'clone', 'predefined'.\n    *   `clone_reference_filename`: Filename in `./reference_audio` (for clone) or `./voices` (for predefined).\n    *   `transcript`: (Optional, Clone Mode Only) Explicit transcript text to override file/Whisper lookup.\n    *   `output_format`: 'opus' or 'wav'.\n    *   `max_tokens`: (Optional) Max tokens *per chunk*.\n    *   `cfg_scale`, `temperature`, `top_p`, `cfg_filter_top_k`: Generation parameters.\n    *   `speed_factor`: Playback speed factor (0.5-2.0).\n    *   `seed`: (Optional) Integer seed, -1 for random.\n    *   `split_text`: (Optional) Boolean, enable/disable chunking (default: True).\n    *   `chunk_size`: (Optional) Integer, target chunk size (default: 120).\n\n## 🔍 Troubleshooting\n\n*   **CUDA Not Available / Slow:** Check NVIDIA drivers (`nvidia-smi`), ensure correct CUDA-enabled PyTorch is installed (Installation Step 4).\n*   **VRAM Out of Memory (OOM):**\n    *   Ensure you are using the BF16 model (`dia-v0_1_bf16.safetensors` in `config.yaml`) if VRAM is limited (~7GB needed).\n    *   Close other GPU-intensive applications. VRAM optimizations and leak fixes have significantly reduced requirements.\n    *   If processing very long text even with chunking, try reducing `chunk_size` (e.g., 100).\n*   **CUDA Out of Memory (OOM) During Startup:** This can happen due to temporary overhead. The server loads weights to CPU first to mitigate this. If it persists, check VRAM usage (`nvidia-smi`), ensure BF16 model is used, or try setting `PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True` environment variable before starting.\n*   **Import Errors (`dac`, `tqdm`, `yaml`, `whisper`, `parselmouth`):** Activate venv, run `pip install -r requirements.txt`. Ensure `descript-audio-codec` installed correctly.\n*   **`libsndfile` / `ffmpeg` Error (Linux):** Run `sudo apt install libsndfile1 ffmpeg`.\n*   **Model Download Fails (Dia or Whisper):** Check internet, `config.yaml` settings (`model.repo_id`, `model.weights_filename`, `model.whisper_model_name`), Hugging Face status, cache path permissions (`paths.model_cache`).\n*   **Voice Cloning Fails / Poor Quality:**\n    *   **Ensure accurate `.txt` transcript exists** alongside the reference audio in `./reference_audio`. Format: `[S1] text...` or `[S1] text... [S2] text...`. This is the most reliable method.\n    *   Whisper fallback is experimental and may be inaccurate.\n    *   Use clean, clear reference audio (5-20s).\n    *   Check server logs for specific errors during `_prepare_cloning_inputs`.\n*   **Permission Errors (Saving Files/Config):** Check write permissions for `paths.output`, `paths.reference_audio`, `paths.voices`, `paths.model_cache` (for Whisper transcript saves), and `config.yaml`.\n*   **UI Issues / Settings Not Saving:** Clear browser cache/local storage. Check developer console (F12) for JS errors. Ensure `config.yaml` is writable by the server process.\n*   **Inconsistent Voice with Chunking:** Use \"Predefined Voices\" or \"Voice Cloning\" mode. If using \"Random/Dialogue\" mode with splitting, use a fixed integer `seed` (not -1) for consistency across chunks. The UI provides a warning otherwise.\n*   **Port Conflict (`Address already in use` / `Errno 98`):** Another process is using the port (default 8003). Stop the other process or change the `server.port` in `config.yaml` (requires restart).\n    *   **Explanation:** This usually happens if a previous server instance didn't shut down cleanly or another application is bound to the same port.\n    *   **Linux:** Find/kill process: `sudo lsof -i:PORT | grep LISTEN | awk '{print $2}' | xargs kill -9` (Replace PORT, e.g., 8003).\n    *   **Windows:** Find/kill process: `for /f \"tokens=5\" %i in ('netstat -ano ^| findstr :PORT') do taskkill /F /PID %i` (Replace PORT, e.g., 8003). Use with caution.\n*   **Generation Cancel Button:** This is a \"UI Cancel\" - it stops the *frontend* from waiting but doesn't instantly halt ongoing backend model inference. Clicking Generate again cancels the previous UI wait.\n\n### Selecting GPUs on Multi-GPU Systems\n\nSet the `CUDA_VISIBLE_DEVICES` environment variable **before** running `python server.py` to specify which GPU(s) PyTorch should see. The server uses the first visible one (`cuda:0`).\n\n*   **Example (Use only physical GPU 1):**\n    *   Linux/macOS: `CUDA_VISIBLE_DEVICES=\"1\" python server.py`\n    *   Windows CMD: `set CUDA_VISIBLE_DEVICES=1 \u0026\u0026 python server.py`\n    *   Windows PowerShell: `$env:CUDA_VISIBLE_DEVICES=\"1\"; python server.py`\n\n*   **Example (Use physical GPUs 6 and 7 - server uses GPU 6):**\n    *   Linux/macOS: `CUDA_VISIBLE_DEVICES=\"6,7\" python server.py`\n    *   Windows CMD: `set CUDA_VISIBLE_DEVICES=6,7 \u0026\u0026 python server.py`\n    *   Windows PowerShell: `$env:CUDA_VISIBLE_DEVICES=\"6,7\"; python server.py`\n\n**Note:** `CUDA_VISIBLE_DEVICES` selects GPUs; it does **not** fix OOM errors if the chosen GPU lacks sufficient memory.\n\n## 🤝 Contributing\n\nContributions are welcome! Please feel free to open an issue to report bugs or suggest features, or submit a Pull Request for improvements.\n\n## 📜 License\n\nThis project is licensed under the **MIT License**.\n\nYou can find it here: [https://opensource.org/licenses/MIT](https://opensource.org/licenses/MIT)\n\n## 🙏 Acknowledgements\n\n*   **Core Model:** This project heavily relies on the excellent **[Dia TTS model](https://github.com/nari-labs/dia)** developed by **[Nari Labs](https://github.com/nari-labs)**. Their work in creating and open-sourcing the model is greatly appreciated.\n*   **UI Inspiration:** Special thanks to **[Lex-au](https://github.com/Lex-au)** whose **[Orpheus-FastAPI](https://github.com/Lex-au/Orpheus-FastAPI)** project served as inspiration for the web interface design of this project.\n*   **SafeTensors Conversion:** Thank you to user **[ttj on Hugging Face](https://huggingface.co/ttj)** for providing the converted **[SafeTensors weights](https://huggingface.co/ttj/dia-1.6b-safetensors)** used as the default in this server.\n*   **Containerization Technologies:** [Docker](https://www.docker.com/) and [NVIDIA Container Toolkit](https://github.com/NVIDIA/nvidia-docker) for enabling consistent deployment environments.\n*   **Core Libraries:**\n    *   [FastAPI](https://fastapi.tiangolo.com/)\n    *   [Uvicorn](https://www.uvicorn.org/)\n    *   [PyTorch](https://pytorch.org/)\n    *   [Hugging Face Hub](https://huggingface.co/docs/huggingface_hub/index) \u0026 [SafeTensors](https://github.com/huggingface/safetensors)\n    *   [Descript Audio Codec (DAC)](https://github.com/descriptinc/descript-audio-codec)\n    *   [SoundFile](https://python-soundfile.readthedocs.io/) \u0026 [libsndfile](http://www.mega-nerd.com/libsndfile/)\n    *   [Jinja2](https://jinja.palletsprojects.com/)\n    *   [WaveSurfer.js](https://wavesurfer.xyz/)\n    *   [Tailwind CSS](https://tailwindcss.com/) (via CDN)\n\n---\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdevnen%2Fdia-tts-server","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdevnen%2Fdia-tts-server","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdevnen%2Fdia-tts-server/lists"}