{"id":24579229,"url":"https://github.com/remsky/Kokoro-FastAPI","last_synced_at":"2025-10-05T05:31:11.707Z","repository":{"id":270340342,"uuid":"910061041","full_name":"remsky/Kokoro-FastAPI","owner":"remsky","description":"Dockerized FastAPI wrapper for Kokoro-82M text-to-speech model w/CPU ONNX and NVIDIA GPU PyTorch support, handling, and auto-stitching","archived":false,"fork":false,"pushed_at":"2025-01-21T20:05:57.000Z","size":32628,"stargazers_count":803,"open_issues_count":18,"forks_count":100,"subscribers_count":11,"default_branch":"master","last_synced_at":"2025-01-21T21:21:03.620Z","etag":null,"topics":["fastapi","huggingface-spaces","kokoro","kokoro-tts","onnx","onnxruntime","pytorch","tts","tts-api"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/remsky.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-12-30T11:53:42.000Z","updated_at":"2025-01-21T21:05:37.000Z","dependencies_parsed_at":"2025-01-14T07:31:44.540Z","dependency_job_id":"18191eec-627d-4ce8-8441-a4cbf78eefa8","html_url":"https://github.com/remsky/Kokoro-FastAPI","commit_stats":null,"previous_names":["remsky/kokoro-fastapi"],"tags_count":8,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/remsky%2FKokoro-FastAPI","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/remsky%2FKokoro-FastAPI/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/remsky%2FKokoro-FastAPI/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/remsky%2FKokoro-FastAPI/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/remsky","download_url":"https://codeload.github.com/remsky/Kokoro-FastAPI/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":235104743,"owners_count":18936412,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["fastapi","huggingface-spaces","kokoro","kokoro-tts","onnx","onnxruntime","pytorch","tts","tts-api"],"created_at":"2025-01-24T00:01:49.928Z","updated_at":"2025-10-05T05:31:06.644Z","avatar_url":"https://github.com/remsky.png","language":"Python","readme":"\u003cp align=\"center\"\u003e\n  \u003cimg src=\"githubbanner.png\" alt=\"Kokoro TTS Banner\"\u003e\n\u003c/p\u003e\n\n# \u003csub\u003e\u003csub\u003e_`FastKoko`_ \u003c/sub\u003e\u003c/sub\u003e\n[![Tests](https://img.shields.io/badge/tests-117%20passed-darkgreen)]()\n[![Coverage](https://img.shields.io/badge/coverage-60%25-grey)]()\n[![Tested at Model Commit](https://img.shields.io/badge/last--tested--model--commit-a67f113-blue)](https://huggingface.co/hexgrad/Kokoro-82M/tree/c3b0d86e2a980e027ef71c28819ea02e351c2667) [![Try on Spaces](https://img.shields.io/badge/%F0%9F%A4%97%20Try%20on-Spaces-blue)](https://huggingface.co/spaces/Remsky/Kokoro-TTS-Zero)\n\nDockerized FastAPI wrapper for [Kokoro-82M](https://huggingface.co/hexgrad/Kokoro-82M) text-to-speech model\n- OpenAI-compatible Speech endpoint, with inline voice combination functionality\n- NVIDIA GPU accelerated or CPU Onnx inference \n- very fast generation time\n  - 35x-100x+ real time speed via 4060Ti+\n  - 5x+ real time speed via M3 Pro CPU\n- streaming support w/ variable chunking to control latency \u0026 artifacts\n- phoneme, simple audio generation web ui utility\n- Runs on an 80mb-300mb model (CUDA container + 5gb on disk due to drivers)  \n\n## Quick Start\n\nThe service can be accessed through either the API endpoints or the Gradio web interface.\n\n1. Install prerequisites, and start the service using Docker Compose (Full setup including UI):\n   - Install [Docker Desktop](https://www.docker.com/products/docker-desktop/)\n   - Clone the repository:\n        ```bash\n        git clone https://github.com/remsky/Kokoro-FastAPI.git\n        cd Kokoro-FastAPI\n        \n        #   * Switch to stable branch if any issues *\n        git checkout v0.0.5post1-stable\n\n        cd docker/gpu # OR \n        # cd docker/cpu # Run this or the above\n        docker compose up --build \n        ```\n        \n      Once started:\n     - The API will be available at http://localhost:8880\n     - The UI can be accessed at http://localhost:7860\n        \n  __Or__ running the API alone using Docker (model + voice packs baked in) (Most Recent):\n          \n  ```bash\n  docker run -p 8880:8880 ghcr.io/remsky/kokoro-fastapi-cpu:v0.1.0post1 # CPU \n  docker run --gpus all -p 8880:8880 ghcr.io/remsky/kokoro-fastapi-gpu:v0.1.0post1 # Nvidia GPU\n  ```\n        \n        \n4. Run locally as an OpenAI-Compatible Speech Endpoint\n    ```python\n    from openai import OpenAI\n    client = OpenAI(\n        base_url=\"http://localhost:8880/v1\",\n        api_key=\"not-needed\"\n        )\n\n    with client.audio.speech.with_streaming_response.create(\n        model=\"kokoro\", \n        voice=\"af_sky+af_bella\", #single or multiple voicepack combo\n        input=\"Hello world!\",\n        response_format=\"mp3\"\n    ) as response:\n        response.stream_to_file(\"output.mp3\")\n    \n    ```\n\n    or visit http://localhost:7860\n    \u003cp align=\"center\"\u003e\n    \u003cimg src=\"ui\\GradioScreenShot.png\" width=\"80%\" alt=\"Voice Analysis Comparison\" style=\"border: 2px solid #333; padding: 10px;\"\u003e\n    \u003c/p\u003e\n\n## Features \n\u003cdetails\u003e\n\u003csummary\u003eOpenAI-Compatible Speech Endpoint\u003c/summary\u003e\n\n```python\n# Using OpenAI's Python library\nfrom openai import OpenAI\nclient = OpenAI(base_url=\"http://localhost:8880/v1\", api_key=\"not-needed\")\nresponse = client.audio.speech.create(\n    model=\"kokoro\",  # Not used but required for compatibility, also accepts library defaults\n    voice=\"af_bella+af_sky\",\n    input=\"Hello world!\",\n    response_format=\"mp3\"\n)\n\nresponse.stream_to_file(\"output.mp3\")\n```\nOr Via Requests:\n```python\nimport requests\n\n\nresponse = requests.get(\"http://localhost:8880/v1/audio/voices\")\nvoices = response.json()[\"voices\"]\n\n# Generate audio\nresponse = requests.post(\n    \"http://localhost:8880/v1/audio/speech\",\n    json={\n        \"model\": \"kokoro\",  # Not used but required for compatibility\n        \"input\": \"Hello world!\",\n        \"voice\": \"af_bella\",\n        \"response_format\": \"mp3\",  # Supported: mp3, wav, opus, flac\n        \"speed\": 1.0\n    }\n)\n\n# Save audio\nwith open(\"output.mp3\", \"wb\") as f:\n    f.write(response.content)\n```\n\nQuick tests (run from another terminal):\n```bash\npython examples/assorted_checks/test_openai/test_openai_tts.py # Test OpenAI Compatibility\npython examples/assorted_checks/test_voices/test_all_voices.py # Test all available voices\n```\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003eVoice Combination\u003c/summary\u003e\n\n- Averages model weights of any existing voicepacks\n- Saves generated voicepacks for future use\n- (new) Available through any endpoint, simply concatenate desired packs with \"+\"\n\nCombine voices and generate audio:\n```python\nimport requests\nresponse = requests.get(\"http://localhost:8880/v1/audio/voices\")\nvoices = response.json()[\"voices\"]\n\n# Create combined voice (saves locally on server)\nresponse = requests.post(\n    \"http://localhost:8880/v1/audio/voices/combine\",\n    json=[voices[0], voices[1]]\n)\ncombined_voice = response.json()[\"voice\"]\n\n# Generate audio with combined voice (or, simply pass multiple directly with `+` )\nresponse = requests.post(\n    \"http://localhost:8880/v1/audio/speech\",\n    json={\n        \"input\": \"Hello world!\",\n        \"voice\": combined_voice, # or skip the above step with f\"{voices[0]}+{voices[1]}\"\n        \"response_format\": \"mp3\"\n    }\n)\n```\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"assets/voice_analysis.png\" width=\"80%\" alt=\"Voice Analysis Comparison\" style=\"border: 2px solid #333; padding: 10px;\"\u003e\n\u003c/p\u003e\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003eMultiple Output Audio Formats\u003c/summary\u003e\n\n- mp3\n- wav\n- opus \n- flac\n- aac\n- pcm\n\n\u003cp align=\"center\"\u003e\n\u003cimg src=\"assets/format_comparison.png\" width=\"80%\" alt=\"Audio Format Comparison\" style=\"border: 2px solid #333; padding: 10px;\"\u003e\n\u003c/p\u003e\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003eGradio Web Utility\u003c/summary\u003e\n\nAccess the interactive web UI at http://localhost:7860 after starting the service. Features include:\n- Voice/format/speed selection\n- Audio playback and download\n- Text file or direct input\n\nIf you only want the API, just comment out everything in the docker-compose.yml under and including `gradio-ui`\n\nCurrently, voices created via the API are accessible here, but voice combination/creation has not yet been added\n\nRunning the UI Docker Service\n   - If you only want to run the Gradio web interface separately and connect it to an existing API service:\n      ```bash\n      docker run -p 7860:7860 \\\n        -e API_HOST=\u003capi-hostname-or-ip\u003e \\\n        -e API_PORT=8880 \\\n        ghcr.io/remsky/kokoro-fastapi-ui:v0.1.0\n      ```\n\n     - Replace `\u003capi-hostname-or-ip\u003e` with:\n       - `kokoro-tts` if the UI container is running in the same Docker Compose setup.\n       - `localhost` if the API is running on your local machine.\n  \n### Disabling Local Saving\n\nYou can disable local saving of audio files and hide the file view in the UI by setting the `DISABLE_LOCAL_SAVING` environment variable to `true`. This is useful when running the service on a server where you don't want to store generated audio files locally.\n\nWhen using Docker Compose:\n```yaml\nenvironment:\n  - DISABLE_LOCAL_SAVING=true\n```\n\nWhen running the Docker image directly:\n```bash\ndocker run -p 7860:7860 -e DISABLE_LOCAL_SAVING=true ghcr.io/remsky/kokoro-fastapi-ui:latest\n```\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003eStreaming Support\u003c/summary\u003e\n\n```python\n# OpenAI-compatible streaming\nfrom openai import OpenAI\nclient = OpenAI(\n    base_url=\"http://localhost:8880\", api_key=\"not-needed\")\n\n# Stream to file\nwith client.audio.speech.with_streaming_response.create(\n    model=\"kokoro\",\n    voice=\"af_bella\",\n    input=\"Hello world!\"\n) as response:\n    response.stream_to_file(\"output.mp3\")\n\n# Stream to speakers (requires PyAudio)\nimport pyaudio\nplayer = pyaudio.PyAudio().open(\n    format=pyaudio.paInt16, \n    channels=1, \n    rate=24000, \n    output=True\n)\n\nwith client.audio.speech.with_streaming_response.create(\n    model=\"kokoro\",\n    voice=\"af_bella\",\n    response_format=\"pcm\",\n    input=\"Hello world!\"\n) as response:\n    for chunk in response.iter_bytes(chunk_size=1024):\n        player.write(chunk)\n```\n\nOr via requests:\n```python\nimport requests\n\nresponse = requests.post(\n    \"http://localhost:8880/v1/audio/speech\",\n    json={\n        \"input\": \"Hello world!\",\n        \"voice\": \"af_bella\",\n        \"response_format\": \"pcm\"\n    },\n    stream=True\n)\n\nfor chunk in response.iter_content(chunk_size=1024):\n    if chunk:\n        # Process streaming chunks\n        pass\n```\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"assets/gpu_first_token_timeline_openai.png\" width=\"45%\" alt=\"GPU First Token Timeline\" style=\"border: 2px solid #333; padding: 10px; margin-right: 1%;\"\u003e\n  \u003cimg src=\"assets/cpu_first_token_timeline_stream_openai.png\" width=\"45%\" alt=\"CPU First Token Timeline\" style=\"border: 2px solid #333; padding: 10px;\"\u003e\n\u003c/p\u003e\n\nKey Streaming Metrics:\n- First token latency @ chunksize\n    - ~300ms  (GPU) @ 400 \n    - ~3500ms (CPU) @ 200 (older i7)\n    - ~\u003c1s    (CPU) @ 200 (M3 Pro)\n- Adjustable chunking settings for real-time playback \n\n*Note: Artifacts in intonation can increase with smaller chunks*\n\u003c/details\u003e\n\n## Processing Details\n\u003cdetails\u003e\n\u003csummary\u003ePerformance Benchmarks\u003c/summary\u003e\n\nBenchmarking was performed on generation via the local API using text lengths up to feature-length books (~1.5 hours output), measuring processing time and realtime factor. Tests were run on: \n- Windows 11 Home w/ WSL2 \n- NVIDIA 4060Ti 16gb GPU @ CUDA 12.1\n- 11th Gen i7-11700 @ 2.5GHz\n- 64gb RAM\n- WAV native output\n- H.G. Wells - The Time Machine (full text)\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"assets/gpu_processing_time.png\" width=\"45%\" alt=\"Processing Time\" style=\"border: 2px solid #333; padding: 10px; margin-right: 1%;\"\u003e\n  \u003cimg src=\"assets/gpu_realtime_factor.png\" width=\"45%\" alt=\"Realtime Factor\" style=\"border: 2px solid #333; padding: 10px;\"\u003e\n\u003c/p\u003e\n\nKey Performance Metrics:\n- Realtime Speed: Ranges between 25-50x (generation time to output audio length)\n- Average Processing Rate: 137.67 tokens/second (cl100k_base)\n\u003c/details\u003e\n\u003cdetails\u003e\n\u003csummary\u003eGPU Vs. CPU\u003c/summary\u003e\n\n```bash\n# GPU: Requires NVIDIA GPU with CUDA 12.1 support (~35x realtime speed)\ndocker compose up --build\n\n# CPU: ONNX optimized inference (~2.4x realtime speed)\ndocker compose -f docker-compose.cpu.yml up --build\n```\n*Note: Overall speed may have reduced somewhat with the structural changes to accomodate streaming. Looking into it* \n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003eNatural Boundary Detection\u003c/summary\u003e\n\n- Automatically splits and stitches at sentence boundaries \n- Helps to reduce artifacts and allow long form processing as the base model is only currently configured for approximately 30s output \n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003ePhoneme \u0026 Token Routes\u003c/summary\u003e\n\nConvert text to phonemes and/or generate audio directly from phonemes:\n```python\nimport requests\n\n# Convert text to phonemes\nresponse = requests.post(\n    \"http://localhost:8880/dev/phonemize\",\n    json={\n        \"text\": \"Hello world!\",\n        \"language\": \"a\"  # \"a\" for American English\n    }\n)\nresult = response.json()\nphonemes = result[\"phonemes\"]  # Phoneme string e.g  ðɪs ɪz ˈoʊnli ɐ tˈɛst\ntokens = result[\"tokens\"]      # Token IDs including start/end tokens \n\n# Generate audio from phonemes\nresponse = requests.post(\n    \"http://localhost:8880/dev/generate_from_phonemes\",\n    json={\n        \"phonemes\": phonemes,\n        \"voice\": \"af_bella\",\n        \"speed\": 1.0\n    }\n)\n\n# Save WAV audio\nwith open(\"speech.wav\", \"wb\") as f:\n    f.write(response.content)\n```\n\nSee `examples/phoneme_examples/generate_phonemes.py` for a sample script.\n\u003c/details\u003e\n\n## Known Issues\n\n\u003cdetails\u003e\n\u003csummary\u003eVersioning \u0026 Development\u003c/summary\u003e\n\nI'm doing what I can to keep things stable, but we are on an early and rapid set of build cycles here.\nIf you run into trouble, you may have to roll back a version on the release tags if something comes up, or build up from source and/or troubleshoot + submit a PR. Will leave the branch up here for the last known stable points:\n\n`v0.0.5post1`\n\nFree and open source is a community effort, and I love working on this project, though there's only really so many hours in a day. If you'd like to support the work, feel free to open a PR, buy me a coffee, or report any bugs/features/etc you find during use.\n\n  \u003ca href=\"https://www.buymeacoffee.com/remsky\" target=\"_blank\"\u003e\n    \u003cimg \n      src=\"https://cdn.buymeacoffee.com/buttons/v2/default-violet.png\" \n      alt=\"Buy Me A Coffee\" \n      style=\"height: 30px !important;width: 110px !important;\"\n    \u003e\n  \u003c/a\u003e\n\n  \n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003eLinux GPU Permissions\u003c/summary\u003e\n\nSome Linux users may encounter GPU permission issues when running as non-root. \nCan't guarantee anything, but here are some common solutions, consider your security requirements carefully\n\n### Option 1: Container Groups (Likely the best option)\n```yaml\nservices:\n  kokoro-tts:\n    # ... existing config ...\n    group_add:\n      - \"video\"\n      - \"render\"\n```\n\n### Option 2: Host System Groups\n```yaml\nservices:\n  kokoro-tts:\n    # ... existing config ...\n    user: \"${UID}:${GID}\"\n    group_add:\n      - \"video\"\n```\nNote: May require adding host user to groups: `sudo usermod -aG docker,video $USER` and system restart.\n\n### Option 3: Device Permissions (Use with caution)\n```yaml\nservices:\n  kokoro-tts:\n    # ... existing config ...\n    devices:\n      - /dev/nvidia0:/dev/nvidia0\n      - /dev/nvidiactl:/dev/nvidiactl\n      - /dev/nvidia-uvm:/dev/nvidia-uvm\n```\n⚠️ Warning: Reduces system security. Use only in development environments.\n\nPrerequisites: NVIDIA GPU, drivers, and container toolkit must be properly configured.\n\nVisit [NVIDIA Container Toolkit installation](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html) for more detailed information\n\n\u003c/details\u003e\n\n## Model and License\n\n\u003cdetails open\u003e\n\u003csummary\u003eModel\u003c/summary\u003e\n\nThis API uses the [Kokoro-82M](https://huggingface.co/hexgrad/Kokoro-82M) model from HuggingFace. \n\nVisit the model page for more details about training, architecture, and capabilities. I have no affiliation with any of their work, and produced this wrapper for ease of use and personal projects.\n\u003c/details\u003e\n\u003cdetails\u003e\n\u003csummary\u003eLicense\u003c/summary\u003e\nThis project is licensed under the Apache License 2.0 - see below for details:\n\n- The Kokoro model weights are licensed under Apache 2.0 (see [model page](https://huggingface.co/hexgrad/Kokoro-82M))\n- The FastAPI wrapper code in this repository is licensed under Apache 2.0 to match\n- The inference code adapted from StyleTTS2 is MIT licensed\n\nThe full Apache 2.0 license text can be found at: https://www.apache.org/licenses/LICENSE-2.0\n\u003c/details\u003e\n","funding_links":["https://www.buymeacoffee.com/remsky"],"categories":["Text to Speech","Python","1. Local Agents"],"sub_categories":["Imgur","Audio / Voice Agents"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fremsky%2FKokoro-FastAPI","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fremsky%2FKokoro-FastAPI","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fremsky%2FKokoro-FastAPI/lists"}