{"id":49006454,"url":"https://github.com/uttera/uttera-stt-hotcold","last_synced_at":"2026-04-18T20:12:59.790Z","repository":{"id":340896161,"uuid":"1168056801","full_name":"uttera/uttera-stt-hotcold","owner":"uttera","description":"High-performance Whisper STT API server with a hybrid \"Hot/Cold\" worker architecture.","archived":false,"fork":false,"pushed_at":"2026-04-15T19:51:33.000Z","size":1018,"stargazers_count":3,"open_issues_count":0,"forks_count":2,"subscribers_count":1,"default_branch":"master","last_synced_at":"2026-04-15T20:33:52.456Z","etag":null,"topics":["fastapi","faster-whisper","local-ai","open-webui","openai-whisper","openclaw","self-hosted","speech-to-text","stt","uttera","whisper"],"latest_commit_sha":null,"homepage":"https://uttera.ai","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/uttera.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-02-27T01:00:58.000Z","updated_at":"2026-04-15T19:51:37.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/uttera/uttera-stt-hotcold","commit_stats":null,"previous_names":["fakehec/whisper-stt-local-server"],"tags_count":18,"template":false,"template_full_name":null,"purl":"pkg:github/uttera/uttera-stt-hotcold","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/uttera%2Futtera-stt-hotcold","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/uttera%2Futtera-stt-hotcold/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/uttera%2Futtera-stt-hotcold/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/uttera%2Futtera-stt-hotcold/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/uttera","download_url":"https://codeload.github.com/uttera/uttera-stt-hotcold/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/uttera%2Futtera-stt-hotcold/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31982836,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-18T17:30:12.329Z","status":"ssl_error","status_checked_at":"2026-04-18T17:29:59.069Z","response_time":103,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["fastapi","faster-whisper","local-ai","open-webui","openai-whisper","openclaw","self-hosted","speech-to-text","stt","uttera","whisper"],"created_at":"2026-04-18T20:12:58.425Z","updated_at":"2026-04-18T20:12:59.777Z","avatar_url":"https://github.com/uttera.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# uttera-stt-hotcold\n\n\u003cp align=\"center\"\u003e\n  \u003ca href=\"https://uttera.ai\"\u003e\n    \u003cimg src=\"docs/img/banner.png\" alt=\"uttera.ai — The voice layer for your AI\" width=\"800\"\u003e\n  \u003c/a\u003e\n\u003c/p\u003e\n\nHigh-performance Whisper STT API server with a hybrid \"Hot/Cold\" worker architecture.\n\n**Ideal for locally running installations of agents like OpenClaw or Open-WebUI, where the media should not leave the private local domain.**\n\n\u003e **Created and maintained by [Hugo L. Espuny](https://github.com/fakehec).**\n\u003e Part of the [Uttera](https://uttera.ai) voice stack.\n\u003e Licensed under the [Apache License 2.0](LICENSE).\n\u003e See [NOTICE](NOTICE) for third-party attributions.\n\n## 📢 Project history: renamed and transferred\n\nThis repository has been **renamed** from `whisper-stt-local-server` to\n**`uttera-stt-hotcold`** and **transferred** from its original creator's\npersonal page ([@fakehec](https://github.com/fakehec)) to the\n[Uttera GitHub organization](https://github.com/uttera).\n\nGitHub redirects old URLs automatically, so any existing clones, forks,\nbookmarks, and links keep working. If you still have\n`fakehec/whisper-stt-local-server` as your `origin`, consider updating:\n\n```bash\ngit remote set-url origin https://github.com/uttera/uttera-stt-hotcold.git\n```\n\n## Positioning\n\n| Use case | This repo | Sibling repo |\n|---|---|---|\n| Home-lab, personal, small/mid GPU (8–16 GB) | ✅ [uttera-stt-hotcold](https://github.com/uttera/uttera-stt-hotcold) | — |\n| Cloud, multi-tenant, large GPU (≥24 GB) | — | [uttera-stt-vllm](https://github.com/uttera/uttera-stt-vllm) |\n\n**Choose `uttera-stt-hotcold` when**:\n- You have consumer GPUs (RTX 4070, 4080) and transcribe occasionally.\n- Personal or single-user deployment.\n- You want to share the GPU with other workloads.\n- **You have 8–24 GB of VRAM.** vLLM does not fit comfortably in this\n  range: at 8–16 GB the KV cache is too small for continuous batching\n  to beat hotcold; at 16–24 GB vLLM works but reserves 11–22 GB\n  permanently, wasting the co-location flexibility that is hotcold's\n  reason to exist on mid-sized GPUs.\n\n**Choose `uttera-stt-vllm` when**:\n- You transcribe hours of audio per day across many concurrent streams.\n- You want continuous batching to maximise GPU utilisation.\n- You have large-VRAM GPUs dedicated to inference.\n- **You have 32 GB+ of VRAM** (vLLM reserves ~22–29 GB at startup\n  depending on `gpu_memory_utilization`; below 32 GB total you either\n  run out of headroom or lose the batching advantage that justifies\n  the reservation).\n\nSee [`uttera-benchmarks`](https://github.com/uttera/uttera-benchmarks)\nfor reproducible head-to-head numbers across four load profiles\n(latency, burst up to N=1024, sustained) and two corpora (LibriSpeech\ntest-clean and an internal Spanish WAV corpus).\n\n## 🚀 Key Features\n\n- **Hybrid Concurrency:**\n  - **Hot Worker:** Keeps a Whisper model resident in VRAM for sub-second (~0.2s) inference.\n  - **Cold Workers:** Spawns on-demand subprocesses when the GPU is busy, ensuring long audio files don't block quick voice commands.\n- **GPU Accelerated:** Native support for NVIDIA CUDA, ensuring ultra-fast inference.\n- **OpenAI Compatible:** Implements the standard OpenAI STT API (`/v1/audio/transcriptions`, `/v1/audio/translations`). Includes `GET /v1/models` for client autodiscovery.\n- **Translation (v2.1.0+):** `POST /v1/audio/translations` supports arbitrary target languages via a Whisper-transcribe → LibreTranslate pipeline when `LIBRETRANSLATE_URL` is set (request field `to_language`, default `\"en\"`). Without `LIBRETRANSLATE_URL`, falls back to Whisper's native translate task (English only; works poorly on models like `turbo` that were not trained for it).\n- **Multilingual:** Supports all languages covered by Whisper (99 languages). Auto-detects language if not specified.\n- **Health Endpoint:** `GET /health` exposes server version, model name, and hot worker status for proxies and Docker healthchecks.\n- **Privacy First:** 100% local execution. Your audio never leaves your infrastructure.\n\n## 🧠 Available Models\n\n| Model | Params | VRAM (fp16) | Speed | Languages | Best for |\n| :--- | :--- | :--- | :--- | :--- | :--- |\n| `tiny` | 39M | ~1 GB | Fastest | 99 | Testing, low-resource |\n| `tiny.en` | 39M | ~1 GB | Fastest | English only | English-only, low-resource |\n| `base` | 74M | ~1 GB | Fast | 99 | Light workloads |\n| `base.en` | 74M | ~1 GB | Fast | English only | Light English-only |\n| `small` | 244M | ~2 GB | Moderate | 99 | Good accuracy/speed balance |\n| `small.en` | 244M | ~2 GB | Moderate | English only | English-only balanced |\n| `medium` | 769M | ~5 GB | Slow | 99 | **Default.** High accuracy |\n| `medium.en` | 769M | ~5 GB | Slow | English only | English-only high accuracy |\n| `large` | 1550M | ~10 GB | Slowest | 99 | Maximum accuracy (v1) |\n| `large-v2` | 1550M | ~10 GB | Slowest | 99 | Improved large |\n| `large-v3` | 1550M | ~10 GB | Slowest | 99 | Best accuracy overall |\n| `turbo` | 809M | ~6 GB | Fast | 99 | **Recommended.** large-v3 distilled, best quality/speed |\n\nSet the model via `WHISPER_MODEL` in `.env`. To download all models at once for offline use:\n\n```bash\nsource venv/bin/activate\npython3 -c \"\nimport whisper\nfor m in ['tiny','tiny.en','base','base.en','small','small.en',\n          'medium','medium.en','large','large-v2','large-v3','turbo']:\n    print(f'Downloading {m}...')\n    whisper.load_model(m, download_root='assets/models/whisper')\n    print(f'  Done: {m}')\n\"\n```\n\n## 📦 Installation \u0026 Setup\n\n### 1. Prerequisites (Debian/Ubuntu)\nInstall the following system dependencies first:\n```bash\nsudo apt update \u0026\u0026 sudo apt install -y ffmpeg python3 python3-venv\n```\n\n\u003e **Python version:** `setup.sh` uses the system default `python3` (3.12+ recommended). torch is pinned to `\u003e=2.9.0,\u003c2.10.0` to avoid CUDA 13 NPP dependency issues with newer versions.\n\n### 2. Unified Installation\n```bash\ngit clone https://github.com/uttera/uttera-stt-hotcold.git\ncd uttera-stt-hotcold\nchmod +x setup.sh\n./setup.sh\n```\n\n`setup.sh` creates the virtual environment, installs all dependencies, and downloads the configured Whisper model into `assets/models/`. It is safe to re-run.\n\n### 3. User Permissions \u0026 Hardware Acceleration\nTo run the server without `sudo` privileges and enable GPU acceleration, the user must belong to the `video` and `render` groups:\n```bash\nsudo usermod -aG video $USER\nsudo usermod -aG render $USER\n```\n*Note: Restart your session for changes to take effect.*\n\n### 4. Network Permissions\nThe server listens on port `5000` by default. Ensure the user has permissions to open sockets on this port (standard for ports \u003e1024).\n\n## 📡 API Endpoints\n\n| Method | Path | Description |\n| :--- | :--- | :--- |\n| `GET` | `/health` | Server liveness, version, and hot worker status. |\n| `GET` | `/v1/models` | OpenAI-compatible model list (`whisper-1`). |\n| `POST` | `/v1/audio/transcriptions` | Transcribe audio to text (Hot or Cold Lane). |\n| `POST` | `/v1/audio/translations` | Transcribe + translate to `to_language` (default `en`). With `LIBRETRANSLATE_URL`: any target language. Without: English only (Whisper native). |\n\n## 🛠 Execution\n\nThe server uses direct **Uvicorn** execution for maximum ASGI performance.\n\n### Manual Execution (Console)\n```bash\nsource venv/bin/activate\n\n# Localhost only\nuvicorn main_stt:app --host 127.0.0.1 --port 5000\n\n# Expose to local network\nuvicorn main_stt:app --host 0.0.0.0 --port 5000\n```\n\n### ⚙️ Environment Variables \u0026 .env\n\nCopy `.env.example` to `.env` and adjust as needed. All variables are optional.\n\n| Variable | Default | Description |\n| :--- | :--- | :--- |\n| `WHISPER_MODEL` | `medium` | Model to load: `tiny`, `base`, `small`, `medium`, `large`. |\n| `WHISPER_FP16` | `1` | fp16+LayerNorm-fp32 (halves VRAM). Set to `0` for fp32. |\n| `COLD_POOL_SIZE` | `10` | Max concurrent cold workers (safety cap). |\n| `COLD_WORKER_IDLE_TIMEOUT` | `60` | Seconds before idle cold worker exits. |\n| `COLD_WORKER_IDLE_STAGGER` | `10` | Stagger per worker slot to avoid mass die-off. |\n| `MIN_COLD_VRAM_GB` | `4.0` | Min free VRAM to spawn a cold worker (0=disable). |\n| `COLD_LANE_TIMEOUT_SECONDS` | `300` | Max seconds to wait for a Cold Lane subprocess before HTTP 500. |\n| `ROUTING_DRAIN_CAP_SECONDS` | `120` | Queue drain time considered 100% load. |\n| `REDIS_URL` | *(empty)* | Redis URL for node self-registration (opt-in). |\n| `NODE_HOST` | `localhost` | Host advertised to Redis for Gatekeeper routing. |\n| `NODE_PORT` | `5000` | Port advertised to Redis for Gatekeeper routing. |\n| `DEBUG` | `false` | Set to `true` to enable worker routing and subprocess traces. |\n| `VENV_PYTHON` | *(auto-detected)* | Path to venv Python. Auto-detected from `venv/bin/python`. |\n\n*See `.env.example` for the full list of variables and their defaults.*\n\n### User Service (systemd --user)\n1. Create directory if it doesn't exist: `mkdir -p ~/.config/systemd/user`\n2. Create: `~/.config/systemd/user/uttera-stt.service`\n3. Configuration (environment variables are loaded from your `.env` file):\n\n```ini\n[Unit]\nDescription=Uttera STT Hot/Cold Server\nAfter=network.target\n\n[Service]\nType=simple\nWorkingDirectory=%h/uttera-stt-hotcold\nExecStart=%h/uttera-stt-hotcold/venv/bin/uvicorn main_stt:app --host 127.0.0.1 --port 5000\nRestart=always\nRestartSec=5\n\n[Install]\nWantedBy=default.target\n```\n\n4. Enable and start:\n```bash\nsystemctl --user daemon-reload\nsystemctl --user enable --now uttera-stt.service\n```\n\n## 🔧 Troubleshooting\n\n### Cold Lane fails with `No such file or directory`\nIf concurrent requests return HTTP 500 with a path error, the Cold Lane cannot find the Whisper CLI or Python binary. The server auto-detects `venv/bin/python` and `venv/bin/whisper` relative to the project directory. If running from a non-standard location, set the paths explicitly in `.env`:\n```env\nVENV_PYTHON=/absolute/path/to/venv/bin/python\nWHISPER_SCRIPT=/absolute/path/to/venv/bin/whisper\n```\n\n### `PermissionError` on startup\nThe server defaults to `assets/models/whisper` inside the project directory — no root required. If you see a permission error on a path like `/opt/...`, an old `XDG_CACHE_HOME` env var is being inherited from the shell. Either unset it or override it in `.env`:\n```env\nXDG_CACHE_HOME=assets/models\n```\n\n### Cold Lane subprocess times out\nIf transcription of long audio hangs and eventually returns HTTP 500, increase the timeout in `.env`:\n```env\nCOLD_LANE_TIMEOUT_SECONDS=600\n```\n\n## 🐳 Docker\n\n### Host Prerequisites (one-time setup)\n\nBefore running `docker compose up` for the first time, the host machine requires two one-time configuration steps to enable GPU passthrough via the [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html) CDI mode.\n\n\u003e These steps are required because Docker's default legacy GPU mode relies on BPF cgroup device filters, which are not available in cgroup v2 environments (Ubuntu 22.04+). CDI solves this cleanly.\n\n**1. Add the NVIDIA package repository:**\n```bash\ncurl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey \\\n  | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg\n\ncurl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list \\\n  | sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' \\\n  | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list\n```\n\n**2. Install the toolkit:**\n```bash\nsudo apt update \u0026\u0026 sudo apt install -y nvidia-container-toolkit\n```\n\n**3. Generate the CDI spec** (exposes the GPU to containers via a stable device descriptor):\n```bash\nsudo nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml\n```\n\n**4. Enable CDI in the Docker daemon:**\n```bash\nsudo tee /etc/docker/daemon.json \u003c\u003c'EOF'\n{\n  \"features\": {\n    \"cdi\": true\n  }\n}\nEOF\nsudo systemctl restart docker\n```\n\n**5. Verify it works:**\n```bash\ndocker run --rm --device nvidia.com/gpu=all nvidia/cuda:12.4.1-runtime-ubuntu22.04 nvidia-smi\n```\n\n\u003e **Note:** Step 3 must be re-run if the NVIDIA driver is updated (`sudo nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml`).\n\n### Running with Docker Compose\n\n```bash\n# Build and start\ndocker compose up -d\n\n# Check server is ready\ncurl http://localhost:5000/health\n\n# View logs\ndocker compose logs -f\n\n# Stop\ndocker compose down\n```\n\nThe model is persisted in `assets/models/whisper/` (host volume), so it only downloads once.\n\n## 🔒 Security \u0026 Network Note\nBy default, the server binds to **`127.0.0.1`** on port **`5000`**.\n- To allow external network access, change `--host` to `0.0.0.0`.\n- **WARNING**: This API **does not have authentication**. Exposing it to the network via `0.0.0.0` represents a security risk. Ensure the server is protected by a firewall or operating within a secure VPN/local network.\n\n## 📊 Performance (NVIDIA RTX 5090, fp16, medium model)\n\n| Task | Latency |\n| :--- | :--- |\n| Short command (2s audio, Hot Lane) | **~0.2s** |\n| Long audio (30s, Hot Lane) | **~0.7s** |\n| 160 concurrent (Hot + Cold Pool) | Target ~21s total, 0 failures |\n\n## 🛡 License\n\n**Server source code**: [Apache License 2.0](LICENSE). Commercial use permitted.\n\n**Whisper model weights** (OpenAI): released under the MIT License —\ncommercial use permitted, no restrictions. See [NOTICE](NOTICE) for full\nattributions.\n\nCreated and maintained by [Hugo L. Espuny](https://github.com/fakehec),\nwith contributions acknowledged in [AUTHORS.md](AUTHORS.md).\n\n## ☕ Community\n\nIf you want to follow the project or get involved:\n\n- ⭐ Star this repo to help discoverability.\n- 🐛 Report issues via the [issue tracker](../../issues).\n- 💬 Join the conversation in [Discussions](../../discussions).\n- 📰 Technical posts at [blog.uttera.ai](https://blog.uttera.ai).\n- 🌐 Uttera Cloud: [https://uttera.ai](https://uttera.ai) (EU-hosted,\n  solar-powered, subscription flat-rate).\n\n---\n\n*Uttera /ˈʌt.ər.ə/ — from the English verb \"to utter\" (to speak aloud, to\npronounce, to give audible expression to). Formally, the name is a backronym\nof **U**niversal **T**ext **T**ransformer **E**ngine for **R**ealtime **A**udio\n— reflecting the project's origin as a STT/TTS server and its underlying\nTransformer architecture.*\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Futtera%2Futtera-stt-hotcold","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Futtera%2Futtera-stt-hotcold","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Futtera%2Futtera-stt-hotcold/lists"}