{"id":43659278,"url":"https://github.com/rtk-ai/vox","last_synced_at":"2026-04-01T17:25:04.178Z","repository":{"id":336041707,"uuid":"1147951881","full_name":"rtk-ai/vox","owner":"rtk-ai","description":"A universal AI toolkit for high-performance Speech-to-Text (STT) and Text-to-Speech (TTS) processing, designed for low-latency and easy model integration.","archived":false,"fork":false,"pushed_at":"2026-03-25T10:32:37.000Z","size":368,"stargazers_count":51,"open_issues_count":0,"forks_count":2,"subscribers_count":1,"default_branch":"main","last_synced_at":"2026-03-26T13:40:59.859Z","etag":null,"topics":["ai","audio-processing","deep-learning","experimental","real-time","speech-to-text","text-to-speech"],"latest_commit_sha":null,"homepage":"","language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/rtk-ai.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-02-02T12:00:17.000Z","updated_at":"2026-03-26T05:38:50.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/rtk-ai/vox","commit_stats":null,"previous_names":["rtk-ai/vox"],"tags_count":15,"template":false,"template_full_name":null,"purl":"pkg:github/rtk-ai/vox","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rtk-ai%2Fvox","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rtk-ai%2Fvox/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rtk-ai%2Fvox/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rtk-ai%2Fvox/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/rtk-ai","download_url":"https://codeload.github.com/rtk-ai/vox/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rtk-ai%2Fvox/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31290537,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-01T13:12:26.723Z","status":"ssl_error","status_checked_at":"2026-04-01T13:12:25.102Z","response_time":53,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai","audio-processing","deep-learning","experimental","real-time","speech-to-text","text-to-speech"],"created_at":"2026-02-04T21:29:25.209Z","updated_at":"2026-04-01T17:25:04.171Z","avatar_url":"https://github.com/rtk-ai.png","language":"Rust","funding_links":[],"categories":["Rust"],"sub_categories":[],"readme":"# vox\n\nCross-platform TTS CLI with five backends and MCP server for AI assistants.\n\n```\n                              vox\n                               |\n       +--------+--------+----+----+--------+-----------+\n       |        |        |         |        |           |\n     say      qwen    qwen-native kokoro  voxtream    (TUI)\n   (macOS)  (MLX/Py)  (Rust/candle) (ONNX) (zero-shot) vox setup\n   native   Apple Si.  CPU/Metal  CPU/GPU  CUDA/MPS\n                        /CUDA\n                          |\n                        rodio (audio playback)\n```\n\n## Backends\n\n| Backend | Engine | Voice cloning | Latency (cold) | Latency (warm) | GPU | Platform |\n|---------|--------|:---:|---:|---:|:---:|----------|\n| `say` | macOS native | No | **3s** | **3s** | No | macOS |\n| `kokoro` | ONNX via Python | No | **10s** | **10s** | No | All |\n| `qwen-native` | Candle (Rust) | Yes | **11m33s** | ~3s | Metal/CUDA | All |\n| `voxtream` | PyTorch 0.5B | Yes | **68s** | ~8s | CUDA/MPS | All |\n| `qwen` | MLX-Audio (Python) | Yes | ~15s | ~2s | Apple Neural | macOS |\n\n### Benchmark — single sentence (~50 chars)\n\nAll times measured end-to-end (model loading + inference + audio playback). Cold = first CLI call.\n\n| Backend | M2 Pro (CPU) | RTX 4070 Ti SUPER | Voice cloning | Quality |\n|---------|-------------:|-------------------------:|:---:|---------|\n| **`say`** | **3s** | macOS only | No | System voices |\n| **`kokoro`** | **10s** | ~10s | No | Good |\n| **`voxtream`** (VoXtream2, 0.5B) | **68s** / 40s warm | **23s** / **19s** warm | Yes (zero-shot) | Excellent |\n| **`qwen-native`** (Qwen3-TTS, 0.6B) | **11m33s** / 3s warm | **48s** (CPU) | Yes | Excellent |\n| **`qwen`** (MLX-Audio) | ~15s / 2s warm | macOS only | Yes | Excellent |\n\n**With daemon** (`vox daemon start` — keeps model server warm):\n\n| Backend | M2 Pro (CPU) | Notes |\n|---------|-------------:|-------|\n| **`voxtream`** | **32s** | Inference CPU-bound (~25s). On CUDA: paper reports 74ms first-packet |\n| **`qwen-native`** | **~3s** | Model stays in RAM via global Mutex |\n\n\u003e All CUDA benchmarks measured on RTX 4070 Ti SUPER (16GB). qwen-native CUDA not yet supported (requires cudarc update for CUDA 13.2).\n\u003e For lowest latency: `say` (macOS) or `kokoro`. For best quality + cloning: `voxtream` on CUDA with daemon.\n\n## Install\n\n```bash\n# Quick install (macOS / Linux / WSL)\ncurl -fsSL https://raw.githubusercontent.com/rtk-ai/vox/main/install.sh | sh\n\n# From source\ncargo install --path .\n\n# With GPU acceleration\ncargo install --path . --features metal  # macOS Apple Silicon\ncargo install --path . --features cuda   # Linux NVIDIA\n```\n\n### VoXtream backend (optional)\n\n```bash\nbrew install espeak-ng                              # macOS (or apt install espeak-ng on Linux)\nuv venv ~/.local/venvs/voxtream --python 3.11\nuv pip install --python ~/.local/venvs/voxtream/bin/python \"voxtream\u003e=0.2\"\n# Copy config files\ngit clone --depth 1 https://github.com/herimor/voxtream.git /tmp/voxtream-repo\ncp /tmp/voxtream-repo/configs/*.json \"$(vox config show 2\u003e/dev/null | head -1 | grep -v backend || echo ~/.config/vox)/voxtream/\"\n```\n\n| Platform | Default backend | GPU |\n|----------|----------------|-----|\n| macOS | `say` | `--features metal` |\n| Linux / WSL | `kokoro` | `--features cuda` |\n\nLinux requires `sudo apt install libasound2-dev`.\n\n## Quick start\n\n```bash\nvox \"Hello, world.\"                     # Speak with default backend\nvox -b voxtream \"Zero-shot TTS.\"        # VoXtream2 (fastest neural)\nvox -b kokoro -l fr \"Bonjour\"           # Kokoro with language\necho \"Piped text\" | vox                 # Read from stdin\nvox --list-voices                       # List available voices\nvox setup                               # Interactive TUI configuration\n```\n\n## Interactive setup (TUI)\n\nFor humans — choose backend, voice, language, and style interactively:\n\n```bash\nvox setup\n```\n\n```\n┌ Backend ──┐┌ Voice ─────┐┌ Lang ┐┌ Style ────┐┌ Config ──────┐\n│\u003e say      ││\u003e Samantha  ││\u003e en  ││\u003e (default)││ Backend: say │\n│  kokoro   ││  Thomas    ││  fr  ││  calm     ││ Voice: ...   │\n│  qwen-nat ││  Amelie    ││  es  ││  warm     ││ Lang:  en    │\n│  voxtream ││           ││  de  ││  cheerful ││              │\n│  qwen     ││           ││  ja  ││          ││ [T]est [S]ave│\n└───────────┘└────────────┘└──────┘└──────────┘└──────────────┘\n```\n\nNavigate with arrow keys / hjkl, Tab to switch panel, T to test, S to save, Q to quit.\n\nAI agents use CLI flags instead: `vox -b voxtream -l fr \"text\"`\n\n## AI assistant integration\n\nOne command configures **14 AI tools** (Claude Code, Cursor, VS Code, Zed, Codex, Gemini, Amazon Q, and more):\n\n```bash\nvox init                # MCP server (default) — all AI tools\nvox init -m cli         # CLAUDE.md + Stop hook (recommended)\nvox init -m skill       # /speak slash command\nvox init -m all         # all of the above\n```\n\nRunning `vox init` again is safe — it skips files that are already configured.\n\n### CLI mode vs MCP mode\n\n**CLI mode is recommended** for AI coding agents. Benchmarks show CLI tools are [10-32x cheaper and 100% reliable vs 72% for MCP](https://mariozechner.at/posts/2025-08-15-mcp-vs-cli/) due to MCP's TCP timeout overhead and JSON schema cost per call.\n\n| Mode | Reliability | Token cost | Best for |\n|------|------------|------------|----------|\n| **CLI** (`vox init -m cli`) | 100% | Low (Bash call) | Claude Code, Codex, terminal agents |\n| **MCP** (`vox init`) | ~72% | Higher (JSON schema) | Cursor, VS Code, GUI-based tools |\n\n## Voice cloning\n\n```bash\nvox clone add patrick --audio ~/voice.wav --text \"Transcription\"\nvox clone record myvoice --duration 10\nvox -v patrick \"This speaks with your voice.\"\nvox clone list\nvox clone remove patrick\n```\n\nWorks with `qwen`, `qwen-native`, and `voxtream` backends. VoXtream2 uses zero-shot cloning (3-10s audio prompt, no training needed).\n\n## Preferences\n\n```bash\nvox config show\nvox config set backend voxtream\nvox config set lang fr\nvox config set voice Chelsie\nvox config set gender feminine\nvox config set style warm\nvox config reset\n```\n\n## Sound packs\n\n```bash\nvox pack install peon              # Install a pack\nvox pack set peon                  # Activate it\nvox pack play greeting             # Play a sound\nvox pack list                      # List available packs\n```\n\n## Voice conversation (macOS)\n\n```bash\nexport ANTHROPIC_API_KEY=sk-...\nvox chat -l fr                     # Talk with Claude\nvox hear -l fr                     # Speech-to-text only\n```\n\n## Data\n\nAll state is stored locally — no data sent to external servers (except `vox chat` which uses Claude API).\n\n```\n~/.config/vox/           # or ~/Library/Application Support/vox/ on macOS\n  vox.db                 # SQLite: preferences, voice clones, usage logs\n  clones/                # Audio files for voice clones\n  packs/                 # Installed sound packs\n  voxtream/              # VoXtream2 config files\n```\n\n| Env var | Description |\n|---------|-------------|\n| `VOX_CONFIG_DIR` | Override config directory |\n| `VOX_DB_PATH` | Override database path |\n\n## Documentation\n\n| Document | Description |\n|----------|-------------|\n| [Architecture](docs/ARCHITECTURE.md) | Technical architecture, backends, DB schema, MCP protocol, security |\n| [Features](docs/FEATURES.md) | All commands and features documented |\n| [Guide](docs/GUIDE.md) | Installation, quick start, troubleshooting |\n\n## License\n\n[Apache-2.0](LICENSE)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frtk-ai%2Fvox","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Frtk-ai%2Fvox","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frtk-ai%2Fvox/lists"}