https://github.com/jamiepine/voicebox

The open-source voice synthesis studio powered by Qwen3-TTS.
https://github.com/jamiepine/voicebox

ai qwen3-tts voice-ai voice-clone whisper

Last synced: about 1 month ago
JSON representation

The open-source voice synthesis studio powered by Qwen3-TTS.

Host: GitHub
URL: https://github.com/jamiepine/voicebox
Owner: jamiepine
License: mit
Created: 2026-01-25T12:27:03.000Z (4 months ago)
Default Branch: main
Last Pushed: 2026-01-31T15:44:31.000Z (4 months ago)
Last Synced: 2026-01-31T17:13:19.724Z (4 months ago)
Topics: ai, qwen3-tts, voice-ai, voice-clone, whisper
Language: TypeScript
Homepage: https://voicebox.sh
Size: 89.1 MB
Stars: 144
Watchers: 4
Forks: 16
Open Issues: 11
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Security: SECURITY.md

Awesome Lists containing this project

awesome-voice-agents - Voicebox - first open-source voice cloning & synthesis studio. 5 TTS engines, 23 languages, timeline editor, REST API. Open-source ElevenLabs alternative. | ⭐ 本地优先，支持声音克隆、多轨编辑、后处理音效。[官网](https://voicebox.sh) | (TTS (Text-to-Speech) | 文本转语音 / Open Source TTS Models | 开源 TTS 模型)
StarryDivineSky - jamiepine/voicebox
awesome-mlx - voicebox - source voice synthesis studio (Audio & Speech)
awesome-random-stuff - jamiepine/voicebox - source voice synthesis studio powered by Qwen3-TTS. - jamiepine/voicebox (Uncategorized / Uncategorized)
awesome-gcp-llm-projects - Voicebox - A local-first voice synthesis studio with DAW-like features for professional voice synthesis. #Voice #Synthesis #LocalFirst (Projects / Industry Specific)
AiTreasureBox - jamiepine/voicebox - 04-22_22435_315](https://img.shields.io/github/stars/jamiepine/voicebox.svg)|The open-source voice synthesis studio| (Repos)
awesome-github-projects - voicebox - The open-source AI voice studio. Clone, dictate, create. ⭐25,855 `TypeScript` 🔥 (🤖 AI & Machine Learning)

README

          


  



Voicebox




  The open-source voice synthesis studio.


  Clone voices. Generate speech. Apply effects. Build voice-powered apps.


  All running locally on your machine.





  

    

  

  

    

  

  

    

  

  

    

  

  

    

  





  voicebox.sh •

  Docs •

  Download •

  Features •

  API •

  Troubleshooting








  

    

  





  Click the image above to watch the demo video on voicebox.sh








  





  






## What is Voicebox?

Voicebox is a **local-first voice cloning studio** — a free and open-source alternative to ElevenLabs. Clone voices from a few seconds of audio or pick from 50+ preset voices, generate speech in 23 languages across 7 TTS engines, apply post-processing effects, and compose multi-voice projects with a timeline editor.

- **Complete privacy** — models and voice data stay on your machine

- **7 TTS engines** — Qwen3-TTS, Qwen CustomVoice, LuxTTS, Chatterbox Multilingual, Chatterbox Turbo, HumeAI TADA, and Kokoro

- **Cloning and preset voices** — zero-shot cloning from a reference sample, or curated preset voices via Kokoro (50 voices) and Qwen CustomVoice (9 voices)

- **23 languages** — from English to Arabic, Japanese, Hindi, Swahili, and more

- **Post-processing effects** — pitch shift, reverb, delay, chorus, compression, and filters

- **Expressive speech** — paralinguistic tags like `[laugh]`, `[sigh]`, `[gasp]` via Chatterbox Turbo; natural-language delivery control via Qwen CustomVoice

- **Unlimited length** — auto-chunking with crossfade for scripts, articles, and chapters

- **Stories editor** — multi-track timeline for conversations, podcasts, and narratives

- **API-first** — REST API for integrating voice synthesis into your own projects

- **Native performance** — built with Tauri (Rust), not Electron

- **Runs everywhere** — macOS (MLX/Metal), Windows (CUDA), Linux, AMD ROCm, Intel Arc, Docker

---

## Download

| Platform              | Download                                               |

| --------------------- | ------------------------------------------------------ |

| macOS (Apple Silicon) | [Download DMG](https://voicebox.sh/download/mac-arm)   |

| macOS (Intel)         | [Download DMG](https://voicebox.sh/download/mac-intel) |

| Windows               | [Download MSI](https://voicebox.sh/download/windows)   |

| Docker                | `docker compose up`                                    |

> **[View all binaries →](https://github.com/jamiepine/voicebox/releases/latest)**

> **Linux** — Pre-built binaries are not yet available. See [voicebox.sh/linux-install](https://voicebox.sh/linux-install) for build-from-source instructions.

> **Having trouble?** See the [Troubleshooting Guide](docs/content/docs/overview/troubleshooting.mdx) for common install, generation, model-download, and GPU issues.

---

## Features

### Multi-Engine Voice Cloning

Seven TTS engines with different strengths, switchable per-generation:

| Engine                      | Languages | Strengths                                                                                                                                |

| --------------------------- | --------- | ---------------------------------------------------------------------------------------------------------------------------------------- |

| **Qwen3-TTS** (0.6B / 1.7B) | 10        | High-quality multilingual cloning, delivery instructions ("speak slowly", "whisper")                                                     |

| **Qwen CustomVoice**        | 10        | 9 curated preset voices with natural-language delivery control — no reference audio required                                             |

| **LuxTTS**                  | English   | Lightweight (~1GB VRAM), 48kHz output, 150x realtime on CPU                                                                              |

| **Chatterbox Multilingual** | 23        | Broadest language coverage — Arabic, Danish, Finnish, Greek, Hebrew, Hindi, Malay, Norwegian, Polish, Swahili, Swedish, Turkish and more |

| **Chatterbox Turbo**        | English   | Fast 350M model with paralinguistic emotion/sound tags                                                                                   |

| **TADA** (1B / 3B)          | 10        | HumeAI speech-language model — 700s+ coherent audio, text-acoustic dual alignment                                                        |

| **Kokoro**                  | 8         | 50 curated preset voices, tiny 82M model, fast CPU inference                                                                             |

### Emotions & Paralinguistic Tags

Only **Chatterbox Turbo** interprets paralinguistic tags like `[laugh]` and

`[sigh]`. Qwen3-TTS, LuxTTS, Chatterbox Multilingual, and HumeAI TADA read them

literally as text.

With **Chatterbox Turbo** selected, type `/` in the text input to open the tag

inserter and add expressive tags inline with speech:

`[laugh]` `[chuckle]` `[gasp]` `[cough]` `[sigh]` `[groan]` `[sniff]` `[shush]` `[clear throat]`

### Post-Processing Effects

8 audio effects powered by Spotify's `pedalboard` library. Apply after generation, preview in real time, build reusable presets.

| Effect           | Description                                   |

| ---------------- | --------------------------------------------- |

| Pitch Shift      | Up or down by up to 12 semitones              |

| Reverb           | Configurable room size, damping, wet/dry mix  |

| Delay            | Echo with adjustable time, feedback, and mix  |

| Chorus / Flanger | Modulated delay for metallic or lush textures |

| Compressor       | Dynamic range compression                     |

| Gain             | Volume adjustment (-40 to +40 dB)             |

| High-Pass Filter | Remove low frequencies                        |

| Low-Pass Filter  | Remove high frequencies                       |

Ships with 4 built-in presets (Robotic, Radio, Echo Chamber, Deep Voice) and supports custom presets. Effects can be assigned per-profile as defaults.

### Unlimited Generation Length

Text is automatically split at sentence boundaries and each chunk is generated independently, then crossfaded together. Works with all engines.

- Configurable auto-chunking limit (100–5,000 chars)

- Crossfade slider (0–200ms) for smooth transitions

- Max text length: 50,000 characters

- Smart splitting respects abbreviations, CJK punctuation, and `[tags]`

### Generation Versions

Every generation supports multiple versions with provenance tracking:

- **Original** — clean TTS output, always preserved

- **Effects versions** — apply different effects chains from any source version

- **Takes** — regenerate with a new seed for variation

- **Source tracking** — each version records its lineage

- **Favorites** — star generations for quick access

### Async Generation Queue

Generation is non-blocking. Submit and immediately start typing the next one.

- Serial execution queue prevents GPU contention

- Real-time SSE status streaming

- Failed generations can be retried

- Stale generations from crashes auto-recover on startup

### Voice Profile Management

- Create profiles from audio files or record directly in-app

- Import/export profiles to share or back up

- Multi-sample support for higher quality cloning

- Per-profile default effects chains

- Organize with descriptions and language tags

### Stories Editor

Multi-voice timeline editor for conversations, podcasts, and narratives.

- Multi-track composition with drag-and-drop

- Inline audio trimming and splitting

- Auto-playback with synchronized playhead

- Version pinning per track clip

### Recording & Transcription

- In-app recording with waveform visualization

- System audio capture (macOS and Windows)

- Automatic transcription powered by Whisper (including Whisper Turbo)

- Export recordings in multiple formats

### Model Management

- Per-model unload to free GPU memory without deleting downloads

- Custom models directory via `VOICEBOX_MODELS_DIR`

- Model folder migration with progress tracking

- Download cancel/clear UI

### GPU Support

| Platform                 | Backend        | Notes                                          |

| ------------------------ | -------------- | ---------------------------------------------- |

| macOS (Apple Silicon)    | MLX (Metal)    | 4-5x faster via Neural Engine                  |

| Windows / Linux (NVIDIA) | PyTorch (CUDA) | Auto-downloads CUDA binary from within the app |

| Linux (AMD)              | PyTorch (ROCm) | Auto-configures HSA_OVERRIDE_GFX_VERSION       |

| Windows (any GPU)        | DirectML       | Universal Windows GPU support                  |

| Intel Arc                | IPEX/XPU       | Intel discrete GPU acceleration                |

| Any                      | CPU            | Works everywhere, just slower                  |

---

## API

Voicebox exposes a full REST API for integrating voice synthesis into your own apps.

```bash

# Generate speech

curl -X POST http://localhost:17493/generate \

  -H "Content-Type: application/json" \

  -d '{"text": "Hello world", "profile_id": "abc123", "language": "en"}'

# List voice profiles

curl http://localhost:17493/profiles

# Create a profile

curl -X POST http://localhost:17493/profiles \

  -H "Content-Type: application/json" \

  -d '{"name": "My Voice", "language": "en"}'

```

**Use cases:** game dialogue, podcast production, accessibility tools, voice assistants, content automation.

Full API documentation available at `http://localhost:17493/docs`.

---

## Tech Stack

| Layer         | Technology                                        |

| ------------- | ------------------------------------------------- |

| Desktop App   | Tauri (Rust)                                      |

| Frontend      | React, TypeScript, Tailwind CSS                   |

| State         | Zustand, React Query                              |

| Backend       | FastAPI (Python)                                  |

| TTS Engines   | Qwen3-TTS, Qwen CustomVoice, LuxTTS, Chatterbox, Chatterbox Turbo, TADA, Kokoro |

| Effects       | Pedalboard (Spotify)                              |

| Transcription | Whisper / Whisper Turbo (PyTorch or MLX)          |

| Inference     | MLX (Apple Silicon) / PyTorch (CUDA/ROCm/XPU/CPU) |

| Database      | SQLite                                            |

| Audio         | WaveSurfer.js, librosa                            |

---

## Roadmap

| Feature                 | Description                                    |

| ----------------------- | ---------------------------------------------- |

| **Real-time Streaming** | Stream audio as it generates, word by word     |

| **Voice Design**        | Create new voices from text descriptions       |

| **More Models**         | XTTS, Bark, and other open-source voice models  |

| **Plugin Architecture** | Extend with custom models and effects          |

| **Mobile Companion**    | Control Voicebox from your phone               |

For the **full engineering status, open-issue triage, and prioritized work queue**, see [`docs/PROJECT_STATUS.md`](docs/PROJECT_STATUS.md) — a living document that tracks what's shipped, what's in-flight, candidate TTS engines under evaluation, and why we've accepted or backlogged specific integrations.

---

## Development

See [CONTRIBUTING.md](CONTRIBUTING.md) for detailed setup and contribution guidelines.

### Quick Start

```bash

git clone https://github.com/jamiepine/voicebox.git

cd voicebox

just setup   # creates Python venv, installs all deps

just dev     # starts backend + desktop app

```

Install [just](https://github.com/casey/just): `brew install just` or `cargo install just`. Run `just --list` to see all commands.

**Prerequisites:** [Bun](https://bun.sh), [Rust](https://rustup.rs), [Python 3.11+](https://python.org), [Tauri Prerequisites](https://v2.tauri.app/start/prerequisites/), and [Xcode](https://developer.apple.com/xcode/) on macOS.

### Building Locally

```bash

just build          # Build CPU server binary + Tauri app

just build-local    # (Windows) Build CPU + CUDA server binaries + Tauri app

```

### Adding New Voice Models

The multi-engine architecture makes adding new TTS engines straightforward. A [step-by-step guide](docs/content/docs/developer/tts-engines.mdx) covers the full process: dependency research, backend protocol implementation, frontend wiring, and PyInstaller bundling.

The guide is optimized for AI coding agents. An [agent skill](.agents/skills/add-tts-engine/SKILL.md) can pick up a model name and handle the entire integration autonomously — you just test the build locally.

### Project Structure

```

voicebox/

├── app/              # Shared React frontend

├── tauri/            # Desktop app (Tauri + Rust)

├── web/              # Web deployment

├── backend/          # Python FastAPI server

├── landing/          # Marketing website

└── scripts/          # Build & release scripts

```

---

## Contributing

Contributions welcome! See [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines.

1. Fork the repo

2. Create a feature branch

3. Make your changes

4. Submit a PR

## Security

Found a security vulnerability? Please report it responsibly. See [SECURITY.md](SECURITY.md) for details.

---

## License

MIT License — see [LICENSE](LICENSE) for details.

---



  voicebox.sh

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/jamiepine/voicebox

Awesome Lists containing this project

README

Voicebox