{"id":24534446,"url":"https://github.com/remsky/kokoro-fastapi","last_synced_at":"2025-05-14T05:10:32.394Z","repository":{"id":270340342,"uuid":"910061041","full_name":"remsky/Kokoro-FastAPI","owner":"remsky","description":"Dockerized FastAPI wrapper for Kokoro-82M text-to-speech model w/CPU ONNX and NVIDIA GPU PyTorch support, handling, and auto-stitching","archived":false,"fork":false,"pushed_at":"2025-05-06T02:03:52.000Z","size":62228,"stargazers_count":2550,"open_issues_count":68,"forks_count":356,"subscribers_count":28,"default_branch":"master","last_synced_at":"2025-05-06T03:20:19.154Z","etag":null,"topics":["fastapi","huggingface-spaces","kokoro","kokoro-tts","onnx","onnxruntime","openai-compatible-api","openwebui","pytorch","sillytavern","tts","tts-api","uv"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/remsky.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":"CONTRIBUTING.md","funding":".github/FUNDING.yml","license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null},"funding":{"github":null,"patreon":null,"open_collective":null,"ko_fi":null,"tidelift":null,"community_bridge":null,"liberapay":null,"issuehunt":null,"lfx_crowdfunding":null,"polar":null,"buy_me_a_coffee":"remsky","thanks_dev":null,"custom":null}},"created_at":"2024-12-30T11:53:42.000Z","updated_at":"2025-05-06T02:03:55.000Z","dependencies_parsed_at":"2025-01-14T07:31:44.540Z","dependency_job_id":"18191eec-627d-4ce8-8441-a4cbf78eefa8","html_url":"https://github.com/remsky/Kokoro-FastAPI","commit_stats":null,"previous_names":["remsky/kokoro-fastapi"],"tags_count":13,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/remsky%2FKokoro-FastAPI","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/remsky%2FKokoro-FastAPI/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/remsky%2FKokoro-FastAPI/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/remsky%2FKokoro-FastAPI/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/remsky","download_url":"https://codeload.github.com/remsky/Kokoro-FastAPI/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254076850,"owners_count":22010611,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["fastapi","huggingface-spaces","kokoro","kokoro-tts","onnx","onnxruntime","openai-compatible-api","openwebui","pytorch","sillytavern","tts","tts-api","uv"],"created_at":"2025-01-22T11:17:18.525Z","updated_at":"2025-05-14T05:10:32.358Z","avatar_url":"https://github.com/remsky.png","language":"Python","readme":"\u003cp align=\"center\"\u003e\n  \u003cimg src=\"githubbanner.png\" alt=\"Kokoro TTS Banner\"\u003e\n\u003c/p\u003e\n\n# \u003csub\u003e\u003csub\u003e_`FastKoko`_ \u003c/sub\u003e\u003c/sub\u003e\n[![Tests](https://img.shields.io/badge/tests-69-darkgreen)]()\n[![Coverage](https://img.shields.io/badge/coverage-54%25-tan)]()\n[![Try on Spaces](https://img.shields.io/badge/%F0%9F%A4%97%20Try%20on-Spaces-blue)](https://huggingface.co/spaces/Remsky/Kokoro-TTS-Zero)\n\n[![Kokoro](https://img.shields.io/badge/kokoro-0.9.2-BB5420)](https://github.com/hexgrad/kokoro)\n[![Misaki](https://img.shields.io/badge/misaki-0.9.3-B8860B)](https://github.com/hexgrad/misaki)\n\n[![Tested at Model Commit](https://img.shields.io/badge/last--tested--model--commit-1.0::9901c2b-blue)](https://huggingface.co/hexgrad/Kokoro-82M/commit/9901c2b79161b6e898b7ea857ae5298f47b8b0d6)\n\nDockerized FastAPI wrapper for [Kokoro-82M](https://huggingface.co/hexgrad/Kokoro-82M) text-to-speech model\n- Multi-language support (English, Japanese, Korean, Chinese, _Vietnamese soon_)\n- OpenAI-compatible Speech endpoint, NVIDIA GPU accelerated or CPU inference with PyTorch \n- ONNX support coming soon, see v0.1.5 and earlier for legacy ONNX support in the interim\n- Debug endpoints for monitoring system stats, integrated web UI on localhost:8880/web\n- Phoneme-based audio generation, phoneme generation\n- Per-word timestamped caption generation\n- Voice mixing with weighted combinations\n\n### Integration Guides\n [![Helm Chart](https://img.shields.io/badge/Helm%20Chart-black?style=flat\u0026logo=helm\u0026logoColor=white)](https://github.com/remsky/Kokoro-FastAPI/wiki/Setup-Kubernetes) [![DigitalOcean](https://img.shields.io/badge/DigitalOcean-black?style=flat\u0026logo=digitalocean\u0026logoColor=white)](https://github.com/remsky/Kokoro-FastAPI/wiki/Integrations-DigitalOcean) [![SillyTavern](https://img.shields.io/badge/SillyTavern-black?style=flat\u0026color=red)](https://github.com/remsky/Kokoro-FastAPI/wiki/Integrations-SillyTavern)\n[![OpenWebUI](https://img.shields.io/badge/OpenWebUI-black?style=flat\u0026color=white)](https://github.com/remsky/Kokoro-FastAPI/wiki/Integrations-OpenWebUi)\n## Get Started\n\n\u003cdetails\u003e\n\u003csummary\u003eQuickest Start (docker run)\u003c/summary\u003e\n\n\nPre built images are available to run, with arm/multi-arch support, and baked in models\nRefer to the core/config.py file for a full list of variables which can be managed via the environment\n\n```bash\n# the `latest` tag can be used, though it may have some unexpected bonus features which impact stability.\n Named versions should be pinned for your regular usage.\n Feedback/testing is always welcome\n\ndocker run -p 8880:8880 ghcr.io/remsky/kokoro-fastapi-cpu:latest # CPU, or:\ndocker run --gpus all -p 8880:8880 ghcr.io/remsky/kokoro-fastapi-gpu:latest  #NVIDIA GPU\n```\n\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\n\u003csummary\u003eQuick Start (docker compose) \u003c/summary\u003e\n\n1. Install prerequisites, and start the service using Docker Compose (Full setup including UI):\n   - Install [Docker](https://www.docker.com/products/docker-desktop/)\n   - Clone the repository:\n        ```bash\n        git clone https://github.com/remsky/Kokoro-FastAPI.git\n        cd Kokoro-FastAPI\n\n        cd docker/gpu  # For GPU support\n        # or cd docker/cpu  # For CPU support\n        docker compose up --build\n\n        # *Note for Apple Silicon (M1/M2) users:\n        # The current GPU build relies on CUDA, which is not supported on Apple Silicon.  \n        # If you are on an M1/M2/M3 Mac, please use the `docker/cpu` setup.  \n        # MPS (Apple's GPU acceleration) support is planned but not yet available.\n\n        # Models will auto-download, but if needed you can manually download:\n        python docker/scripts/download_model.py --output api/src/models/v1_0\n\n        # Or run directly via UV:\n        ./start-gpu.sh  # For GPU support\n        ./start-cpu.sh  # For CPU support\n        ```\n\u003c/details\u003e\n\u003cdetails\u003e\n\u003csummary\u003eDirect Run (via uv) \u003c/summary\u003e\n\n1. Install prerequisites ():\n   - Install [astral-uv](https://docs.astral.sh/uv/)\n   - Install [espeak-ng](https://github.com/espeak-ng/espeak-ng) in your system if you want it available as a fallback for unknown words/sounds. The upstream libraries may attempt to handle this, but results have varied.\n   - Clone the repository:\n        ```bash\n        git clone https://github.com/remsky/Kokoro-FastAPI.git\n        cd Kokoro-FastAPI\n        ```\n        \n        Run the [model download script](https://github.com/remsky/Kokoro-FastAPI/blob/master/docker/scripts/download_model.py) if you haven't already\n     \n        Start directly via UV (with hot-reload)\n        \n        Linux and macOS\n        ```bash\n        ./start-cpu.sh OR\n        ./start-gpu.sh \n        ```\n\n        Windows\n        ```powershell\n        .\\start-cpu.ps1 OR\n        .\\start-gpu.ps1 \n        ```\n\n\u003c/details\u003e\n\n\u003cdetails open\u003e\n\u003csummary\u003e Up and Running? \u003c/summary\u003e\n\n\nRun locally as an OpenAI-Compatible Speech Endpoint\n    \n```python\nfrom openai import OpenAI\n\nclient = OpenAI(\n    base_url=\"http://localhost:8880/v1\", api_key=\"not-needed\"\n)\n\nwith client.audio.speech.with_streaming_response.create(\n    model=\"kokoro\",\n    voice=\"af_sky+af_bella\", #single or multiple voicepack combo\n    input=\"Hello world!\"\n  ) as response:\n      response.stream_to_file(\"output.mp3\")\n```\n  \n- The API will be available at http://localhost:8880\n- API Documentation: http://localhost:8880/docs\n\n- Web Interface: http://localhost:8880/web\n\n\u003cdiv align=\"center\" style=\"display: flex; justify-content: center; gap: 10px;\"\u003e\n  \u003cimg src=\"assets/docs-screenshot.png\" width=\"42%\" alt=\"API Documentation\" style=\"border: 2px solid #333; padding: 10px;\"\u003e\n  \u003cimg src=\"assets/webui-screenshot.png\" width=\"42%\" alt=\"Web UI Screenshot\" style=\"border: 2px solid #333; padding: 10px;\"\u003e\n\u003c/div\u003e\n\n\u003c/details\u003e\n\n## Features \n\n\u003cdetails\u003e\n\u003csummary\u003eOpenAI-Compatible Speech Endpoint\u003c/summary\u003e\n\n```python\n# Using OpenAI's Python library\nfrom openai import OpenAI\nclient = OpenAI(base_url=\"http://localhost:8880/v1\", api_key=\"not-needed\")\nresponse = client.audio.speech.create(\n    model=\"kokoro\",  \n    voice=\"af_bella+af_sky\", # see /api/src/core/openai_mappings.json to customize\n    input=\"Hello world!\",\n    response_format=\"mp3\"\n)\n\nresponse.stream_to_file(\"output.mp3\")\n```\nOr Via Requests:\n```python\nimport requests\n\n\nresponse = requests.get(\"http://localhost:8880/v1/audio/voices\")\nvoices = response.json()[\"voices\"]\n\n# Generate audio\nresponse = requests.post(\n    \"http://localhost:8880/v1/audio/speech\",\n    json={\n        \"model\": \"kokoro\",  \n        \"input\": \"Hello world!\",\n        \"voice\": \"af_bella\",\n        \"response_format\": \"mp3\",  # Supported: mp3, wav, opus, flac\n        \"speed\": 1.0\n    }\n)\n\n# Save audio\nwith open(\"output.mp3\", \"wb\") as f:\n    f.write(response.content)\n```\n\nQuick tests (run from another terminal):\n```bash\npython examples/assorted_checks/test_openai/test_openai_tts.py # Test OpenAI Compatibility\npython examples/assorted_checks/test_voices/test_all_voices.py # Test all available voices\n```\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003eVoice Combination\u003c/summary\u003e\n\n- Weighted voice combinations using ratios (e.g., \"af_bella(2)+af_heart(1)\" for 67%/33% mix)\n- Ratios are automatically normalized to sum to 100%\n- Available through any endpoint by adding weights in parentheses\n- Saves generated voicepacks for future use\n\nCombine voices and generate audio:\n```python\nimport requests\nresponse = requests.get(\"http://localhost:8880/v1/audio/voices\")\nvoices = response.json()[\"voices\"]\n\n# Example 1: Simple voice combination (50%/50% mix)\nresponse = requests.post(\n    \"http://localhost:8880/v1/audio/speech\",\n    json={\n        \"input\": \"Hello world!\",\n        \"voice\": \"af_bella+af_sky\",  # Equal weights\n        \"response_format\": \"mp3\"\n    }\n)\n\n# Example 2: Weighted voice combination (67%/33% mix)\nresponse = requests.post(\n    \"http://localhost:8880/v1/audio/speech\",\n    json={\n        \"input\": \"Hello world!\",\n        \"voice\": \"af_bella(2)+af_sky(1)\",  # 2:1 ratio = 67%/33%\n        \"response_format\": \"mp3\"\n    }\n)\n\n# Example 3: Download combined voice as .pt file\nresponse = requests.post(\n    \"http://localhost:8880/v1/audio/voices/combine\",\n    json=\"af_bella(2)+af_sky(1)\"  # 2:1 ratio = 67%/33%\n)\n\n# Save the .pt file\nwith open(\"combined_voice.pt\", \"wb\") as f:\n    f.write(response.content)\n\n# Use the downloaded voice file\nresponse = requests.post(\n    \"http://localhost:8880/v1/audio/speech\",\n    json={\n        \"input\": \"Hello world!\",\n        \"voice\": \"combined_voice\",  # Use the saved voice file\n        \"response_format\": \"mp3\"\n    }\n)\n\n```\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"assets/voice_analysis.png\" width=\"80%\" alt=\"Voice Analysis Comparison\" style=\"border: 2px solid #333; padding: 10px;\"\u003e\n\u003c/p\u003e\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003eMultiple Output Audio Formats\u003c/summary\u003e\n\n- mp3\n- wav\n- opus \n- flac\n- m4a\n- pcm\n\n\u003cp align=\"center\"\u003e\n\u003cimg src=\"assets/format_comparison.png\" width=\"80%\" alt=\"Audio Format Comparison\" style=\"border: 2px solid #333; padding: 10px;\"\u003e\n\u003c/p\u003e\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003eStreaming Support\u003c/summary\u003e\n\n```python\n# OpenAI-compatible streaming\nfrom openai import OpenAI\nclient = OpenAI(\n    base_url=\"http://localhost:8880/v1\", api_key=\"not-needed\")\n\n# Stream to file\nwith client.audio.speech.with_streaming_response.create(\n    model=\"kokoro\",\n    voice=\"af_bella\",\n    input=\"Hello world!\"\n) as response:\n    response.stream_to_file(\"output.mp3\")\n\n# Stream to speakers (requires PyAudio)\nimport pyaudio\nplayer = pyaudio.PyAudio().open(\n    format=pyaudio.paInt16, \n    channels=1, \n    rate=24000, \n    output=True\n)\n\nwith client.audio.speech.with_streaming_response.create(\n    model=\"kokoro\",\n    voice=\"af_bella\",\n    response_format=\"pcm\",\n    input=\"Hello world!\"\n) as response:\n    for chunk in response.iter_bytes(chunk_size=1024):\n        player.write(chunk)\n```\n\nOr via requests:\n```python\nimport requests\n\nresponse = requests.post(\n    \"http://localhost:8880/v1/audio/speech\",\n    json={\n        \"input\": \"Hello world!\",\n        \"voice\": \"af_bella\",\n        \"response_format\": \"pcm\"\n    },\n    stream=True\n)\n\nfor chunk in response.iter_content(chunk_size=1024):\n    if chunk:\n        # Process streaming chunks\n        pass\n```\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"assets/gpu_first_token_timeline_openai.png\" width=\"45%\" alt=\"GPU First Token Timeline\" style=\"border: 2px solid #333; padding: 10px; margin-right: 1%;\"\u003e\n  \u003cimg src=\"assets/cpu_first_token_timeline_stream_openai.png\" width=\"45%\" alt=\"CPU First Token Timeline\" style=\"border: 2px solid #333; padding: 10px;\"\u003e\n\u003c/p\u003e\n\nKey Streaming Metrics:\n- First token latency @ chunksize\n    - ~300ms  (GPU) @ 400 \n    - ~3500ms (CPU) @ 200 (older i7)\n    - ~\u003c1s    (CPU) @ 200 (M3 Pro)\n- Adjustable chunking settings for real-time playback \n\n*Note: Artifacts in intonation can increase with smaller chunks*\n\u003c/details\u003e\n\n## Processing Details\n\u003cdetails\u003e\n\u003csummary\u003ePerformance Benchmarks\u003c/summary\u003e\n\nBenchmarking was performed on generation via the local API using text lengths up to feature-length books (~1.5 hours output), measuring processing time and realtime factor. Tests were run on: \n- Windows 11 Home w/ WSL2 \n- NVIDIA 4060Ti 16gb GPU @ CUDA 12.1\n- 11th Gen i7-11700 @ 2.5GHz\n- 64gb RAM\n- WAV native output\n- H.G. Wells - The Time Machine (full text)\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"assets/gpu_processing_time.png\" width=\"45%\" alt=\"Processing Time\" style=\"border: 2px solid #333; padding: 10px; margin-right: 1%;\"\u003e\n  \u003cimg src=\"assets/gpu_realtime_factor.png\" width=\"45%\" alt=\"Realtime Factor\" style=\"border: 2px solid #333; padding: 10px;\"\u003e\n\u003c/p\u003e\n\nKey Performance Metrics:\n- Realtime Speed: Ranges between 35x-100x (generation time to output audio length)\n- Average Processing Rate: 137.67 tokens/second (cl100k_base)\n\u003c/details\u003e\n\u003cdetails\u003e\n\u003csummary\u003eGPU Vs. CPU\u003c/summary\u003e\n\n```bash\n# GPU: Requires NVIDIA GPU with CUDA 12.8 support (~35x-100x realtime speed)\ncd docker/gpu\ndocker compose up --build\n\n# CPU: PyTorch CPU inference\ncd docker/cpu\ndocker compose up --build\n\n```\n*Note: Overall speed may have reduced somewhat with the structural changes to accommodate streaming. Looking into it* \n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003eNatural Boundary Detection\u003c/summary\u003e\n\n- Automatically splits and stitches at sentence boundaries \n- Helps to reduce artifacts and allow long form processing as the base model is only currently configured for approximately 30s output\n\nThe model is capable of processing up to a 510 phonemized token chunk at a time, however, this can often lead to 'rushed' speech or other artifacts. An additional layer of chunking is applied in the server, that creates flexible chunks with a `TARGET_MIN_TOKENS` , `TARGET_MAX_TOKENS`, and `ABSOLUTE_MAX_TOKENS` which are configurable via environment variables, and set to 175, 250, 450 by default\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003eTimestamped Captions \u0026 Phonemes\u003c/summary\u003e\n\nGenerate audio with word-level timestamps without streaming:\n```python\nimport requests\nimport base64\nimport json\n\nresponse = requests.post(\n    \"http://localhost:8880/dev/captioned_speech\",\n    json={\n        \"model\": \"kokoro\",\n        \"input\": \"Hello world!\",\n        \"voice\": \"af_bella\",\n        \"speed\": 1.0,\n        \"response_format\": \"mp3\",\n        \"stream\": False,\n    },\n    stream=False\n)\n\nwith open(\"output.mp3\",\"wb\") as f:\n\n    audio_json=json.loads(response.content)\n    \n    # Decode base 64 stream to bytes\n    chunk_audio=base64.b64decode(audio_json[\"audio\"].encode(\"utf-8\"))\n    \n    # Process streaming chunks\n    f.write(chunk_audio)\n    \n    # Print word level timestamps\n    print(audio_json[\"timestamps\"])\n```\n\nGenerate audio with word-level timestamps with streaming:\n```python\nimport requests\nimport base64\nimport json\n\nresponse = requests.post(\n    \"http://localhost:8880/dev/captioned_speech\",\n    json={\n        \"model\": \"kokoro\",\n        \"input\": \"Hello world!\",\n        \"voice\": \"af_bella\",\n        \"speed\": 1.0,\n        \"response_format\": \"mp3\",\n        \"stream\": True,\n    },\n    stream=True\n)\n\nf=open(\"output.mp3\",\"wb\")\nfor chunk in response.iter_lines(decode_unicode=True):\n    if chunk:\n        chunk_json=json.loads(chunk)\n        \n        # Decode base 64 stream to bytes\n        chunk_audio=base64.b64decode(chunk_json[\"audio\"].encode(\"utf-8\"))\n        \n        # Process streaming chunks\n        f.write(chunk_audio)\n        \n        # Print word level timestamps\n        print(chunk_json[\"timestamps\"])\n```\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003ePhoneme \u0026 Token Routes\u003c/summary\u003e\n\nConvert text to phonemes and/or generate audio directly from phonemes:\n```python\nimport requests\n\ndef get_phonemes(text: str, language: str = \"a\"):\n    \"\"\"Get phonemes and tokens for input text\"\"\"\n    response = requests.post(\n        \"http://localhost:8880/dev/phonemize\",\n        json={\"text\": text, \"language\": language}  # \"a\" for American English\n    )\n    response.raise_for_status()\n    result = response.json()\n    return result[\"phonemes\"], result[\"tokens\"]\n\ndef generate_audio_from_phonemes(phonemes: str, voice: str = \"af_bella\"):\n    \"\"\"Generate audio from phonemes\"\"\"\n    response = requests.post(\n        \"http://localhost:8880/dev/generate_from_phonemes\",\n        json={\"phonemes\": phonemes, \"voice\": voice},\n        headers={\"Accept\": \"audio/wav\"}\n    )\n    if response.status_code != 200:\n        print(f\"Error: {response.text}\")\n        return None\n    return response.content\n\n# Example usage\ntext = \"Hello world!\"\ntry:\n    # Convert text to phonemes\n    phonemes, tokens = get_phonemes(text)\n    print(f\"Phonemes: {phonemes}\")  # e.g. ðɪs ɪz ˈoʊnli ɐ tˈɛst\n    print(f\"Tokens: {tokens}\")      # Token IDs including start/end tokens\n\n    # Generate and save audio\n    if audio_bytes := generate_audio_from_phonemes(phonemes):\n        with open(\"speech.wav\", \"wb\") as f:\n            f.write(audio_bytes)\n        print(f\"Generated {len(audio_bytes)} bytes of audio\")\nexcept Exception as e:\n    print(f\"Error: {e}\")\n```\n\nSee `examples/phoneme_examples/generate_phonemes.py` for a sample script.\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003eDebug Endpoints\u003c/summary\u003e\n\nMonitor system state and resource usage with these endpoints:\n\n- `/debug/threads` - Get thread information and stack traces\n- `/debug/storage` - Monitor temp file and output directory usage\n- `/debug/system` - Get system information (CPU, memory, GPU)\n- `/debug/session_pools` - View ONNX session and CUDA stream status\n\nUseful for debugging resource exhaustion or performance issues.\n\u003c/details\u003e\n\n## Known Issues \u0026 Troubleshooting\n\n\u003cdetails\u003e\n\u003csummary\u003eMissing words \u0026 Missing some timestamps\u003c/summary\u003e\n\nThe api will automaticly do text normalization on input text which may incorrectly remove or change some phrases. This can be disabled by adding `\"normalization_options\":{\"normalize\": false}` to your request json:\n```python\nimport requests\n\nresponse = requests.post(\n    \"http://localhost:8880/v1/audio/speech\",\n    json={\n        \"input\": \"Hello world!\",\n        \"voice\": \"af_heart\",\n        \"response_format\": \"pcm\",\n        \"normalization_options\":\n        {\n            \"normalize\": False\n        }\n    },\n    stream=True\n)\n\nfor chunk in response.iter_content(chunk_size=1024):\n    if chunk:\n        # Process streaming chunks\n        pass\n```\n  \n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003eVersioning \u0026 Development\u003c/summary\u003e\n\n**Branching Strategy:**\n*   **`release` branch:** Contains the latest stable build, recommended for production use. Docker images tagged with specific versions (e.g., `v0.3.0`) are built from this branch.\n*   **`master` branch:** Used for active development. It may contain experimental features, ongoing changes, or fixes not yet in a stable release. Use this branch if you want the absolute latest code, but be aware it might be less stable. The `latest` Docker tag often points to builds from this branch.\n\nNote: This is a *development* focused project at its core. \n\nIf you run into trouble, you may have to roll back a version on the release tags if something comes up, or build up from source and/or troubleshoot + submit a PR.\n\nFree and open source is a community effort, and there's only really so many hours in a day. If you'd like to support the work, feel free to open a PR, buy me a coffee, or report any bugs/features/etc you find during use.\n\n  \u003ca href=\"https://www.buymeacoffee.com/remsky\" target=\"_blank\"\u003e\n    \u003cimg \n      src=\"https://cdn.buymeacoffee.com/buttons/v2/default-violet.png\" \n      alt=\"Buy Me A Coffee\" \n      style=\"height: 30px !important;width: 110px !important;\"\n    \u003e\n  \u003c/a\u003e\n\n  \n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003eLinux GPU Permissions\u003c/summary\u003e\n\nSome Linux users may encounter GPU permission issues when running as non-root. \nCan't guarantee anything, but here are some common solutions, consider your security requirements carefully\n\n### Option 1: Container Groups (Likely the best option)\n```yaml\nservices:\n  kokoro-tts:\n    # ... existing config ...\n    group_add:\n      - \"video\"\n      - \"render\"\n```\n\n### Option 2: Host System Groups\n```yaml\nservices:\n  kokoro-tts:\n    # ... existing config ...\n    user: \"${UID}:${GID}\"\n    group_add:\n      - \"video\"\n```\nNote: May require adding host user to groups: `sudo usermod -aG docker,video $USER` and system restart.\n\n### Option 3: Device Permissions (Use with caution)\n```yaml\nservices:\n  kokoro-tts:\n    # ... existing config ...\n    devices:\n      - /dev/nvidia0:/dev/nvidia0\n      - /dev/nvidiactl:/dev/nvidiactl\n      - /dev/nvidia-uvm:/dev/nvidia-uvm\n```\n⚠️ Warning: Reduces system security. Use only in development environments.\n\nPrerequisites: NVIDIA GPU, drivers, and container toolkit must be properly configured.\n\nVisit [NVIDIA Container Toolkit installation](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html) for more detailed information\n\n\u003c/details\u003e\n\n## Model and License\n\n\u003cdetails open\u003e\n\u003csummary\u003eModel\u003c/summary\u003e\n\nThis API uses the [Kokoro-82M](https://huggingface.co/hexgrad/Kokoro-82M) model from HuggingFace. \n\nVisit the model page for more details about training, architecture, and capabilities. I have no affiliation with any of their work, and produced this wrapper for ease of use and personal projects.\n\u003c/details\u003e\n\u003cdetails\u003e\n\u003csummary\u003eLicense\u003c/summary\u003e\nThis project is licensed under the Apache License 2.0 - see below for details:\n\n- The Kokoro model weights are licensed under Apache 2.0 (see [model page](https://huggingface.co/hexgrad/Kokoro-82M))\n- The FastAPI wrapper code in this repository is licensed under Apache 2.0 to match\n- The inference code adapted from StyleTTS2 is MIT licensed\n\nThe full Apache 2.0 license text can be found at: https://www.apache.org/licenses/LICENSE-2.0\n\u003c/details\u003e\n","funding_links":["https://buymeacoffee.com/remsky","https://www.buymeacoffee.com/remsky"],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fremsky%2Fkokoro-fastapi","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fremsky%2Fkokoro-fastapi","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fremsky%2Fkokoro-fastapi/lists"}