https://github.com/debpalash/omnivoice-studio

The open-source ElevenLabs alternative for local voice cloning, design, create, dubbing and dictation Desktop App
https://github.com/debpalash/omnivoice-studio

asr dubbing dubbing-ai elevenlabs local-ai omnivoice omnivoice-studio self-hosted speech-recognition speech-to-text text-to-speech transcription tts video-editing voice-ai voice-cloning voice-generation

Last synced: about 1 month ago
JSON representation

The open-source ElevenLabs alternative for local voice cloning, design, create, dubbing and dictation Desktop App

Host: GitHub
URL: https://github.com/debpalash/omnivoice-studio
Owner: debpalash
License: other
Created: 2026-04-09T21:40:26.000Z (3 months ago)
Default Branch: main
Last Pushed: 2026-05-27T05:09:47.000Z (about 1 month ago)
Last Synced: 2026-05-27T07:09:26.299Z (about 1 month ago)
Topics: asr, dubbing, dubbing-ai, elevenlabs, local-ai, omnivoice, omnivoice-studio, self-hosted, speech-recognition, speech-to-text, text-to-speech, transcription, tts, video-editing, voice-ai, voice-cloning, voice-generation
Language: Python
Homepage: https://palash.dev/omnivoice
Size: 15.6 MB
Stars: 4,770
Watchers: 29
Forks: 723
Open Issues: 26
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- Contributing: CONTRIBUTING.md
- Funding: .github/FUNDING.yml
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md
- Security: SECURITY.md
- Roadmap: docs/ROADMAP.md

Awesome Lists containing this project

README

          


  

  OmniVoice Studio

  The open-source ElevenLabs alternative.

  Real-time dictation, zero-shot voice cloning, and cinematic video dubbing — all on your desktop.
Open-source, no API keys, fully local. 646 languages.


  


    

    

    

    

    

  


  


    Quickstart ·

    Features ·

    Why OmniVoice Studio? ·

    TTS Engines ·

    ASR Engines ·

    Contributing ·

    Discord ·

    简体中文

  


  


    

    

    

    

  








  



> [!WARNING]

> **OmniVoice Studio is in active beta.** Things may break between releases. For the latest features and fixes, clone the repo and run from source rather than using pre-built installers. Bug reports and PRs are very welcome — [open an issue](https://github.com/debpalash/OmniVoice-Studio/issues) or [join Discord](https://discord.gg/bzQavDfVV9).



  


  

  


  _{Get setup help · Share your dubs · Vote on the roadmap · Early access to new engines}

  







## Features

  

    
🎙️ Voice Cloning

    3-second clip → mirror any voice.
646 languages, zero-shot.

  

  

    🎨 Voice Design

    Gender, age, accent, pitch, speed,
emotion, dialect — dial it in.

  

  

    🎬 Video Dubbing

    YouTube URL or file → transcribe →
translate → re-voice → MP4.

  

  

    
⌨️ Dictation Widget

    ⌘+⇧+Space from any app.
Transcribes, auto-pastes, disappears.

  

  

    🔊 Vocal Isolation

    Demucs-powered. Splits speech
from music, keeps the background.

  

  

    👥 Speaker Diarization

    Pyannote + WhisperX.
Auto-identifies who said what.

  

  

    
📦 Batch Queue

    Drop 50 videos, walk away.
Progress bars per job.

  

  

    🤖 MCP Server

    Use OmniVoice from Claude,
Cursor, or any MCP client.

  

  

    🛡️ AI Watermark

    AudioSeal (Meta). Invisible,
survives compression.

  

  

    
🔐 100% Local

    No keys, no cloud, no accounts.
Your machine only.

  

  

    ⚡ GPU Auto-Detect

    CUDA · MPS · ROCm · CPU.
≤8 GB? Auto-offloads.

  

  

    🧩 Extensible

    Subclass TTSBackend,
add any engine in ~50 lines.

  

---

## Quickstart

Per-OS install guides — pick yours and follow it end-to-end:

- **macOS** — [docs/install/macos.md](docs/install/macos.md)

- **Windows** — [docs/install/windows.md](docs/install/windows.md)

- **Linux** — [docs/install/linux.md](docs/install/linux.md)

- **Docker** — [docs/install/docker.md](docs/install/docker.md)

Stuck? See [docs/install/troubleshooting.md](docs/install/troubleshooting.md)

for the top 10 install errors. The in-app error UI deeplinks to those entries

when something breaks at runtime.

For Hugging Face token setup, see

[docs/setup/huggingface-token.md](docs/setup/huggingface-token.md). For

diarization-specific gating, see

[docs/features/diarization.md](docs/features/diarization.md).

## Screenshots

  

    

      

      
Voice Clone


      _{Drop a 3-second clip → mirror any voice. 646 languages, zero-shot.}

    

    

      

      
Voice Design


      _{Build new voices from scratch — gender, age, accent, pitch, style.}

    

  

  

    

      

      
Video Dubbing


      _{Upload or paste a YouTube URL. Transcribe, translate, re-voice, export.}

    

    

      

      
Voice Gallery


      _{Search YouTube, browse categories, download clips, build your library.}

    

  

  

    

      

      
Settings → Models


      _{15 models. One-click install. Auto-detects your platform (CUDA / MPS / CPU).}

    

    

      

      
Projects


      _{Dub projects, voice profiles, generation history, exports — all searchable.}

    

  

  

    

      

      
Settings → Logs


      _{Live backend, frontend, and Tauri runtime logs. Filter, refresh, clear.}

    

  

---

## Why OmniVoice Studio?

ElevenLabs charges **$5–$330/mo** and processes your audio on their servers. OmniVoice Studio runs **on your hardware, with no usage limits.**

| | **ElevenLabs** | **OmniVoice Studio** |

|---|---|---|

| **Pricing** | $5–$330/mo, per-character billing | Free for personal use · [Commercial license](#license) for business |

| **Voice Cloning** | ✅ 3s clip | ✅ 3s clip, zero-shot |

| **Voice Design** | ✅ Gender, age | ✅ Gender, age, accent, pitch, style, dialect |

| **Languages** | 32 | **646** |

| **Video Dubbing** | ✅ Cloud-only | ✅ Fully local |

| **Data Privacy** | Audio sent to cloud | **Nothing leaves your machine** |

| **API Keys** | Required | Not needed |

| **GPU Support** | N/A (cloud) | CUDA · Apple Silicon · ROCm · CPU |

| **Desktop App** | ❌ | ✅ macOS · Windows · Linux |

| **Customizable** | ❌ Closed | ✅ Fork it, extend it, ship it |

OmniVoice Studio gives you professional-grade AI tools without the subscription or the cloud.



  


  Convinced? Come build with us.


  

  





---

## System Requirements

| | **Minimum** | **Recommended** |

|---|---|---|

| **OS** | Windows 10, macOS 12+, Ubuntu 20.04+ | Any modern 64-bit OS |

| **RAM** | 8 GB | 16 GB+ |

| **VRAM (GPU)** | 4 GB (auto-offloads TTS to CPU) | 8 GB+ (NVIDIA RTX 3060+) |

| **Disk** | 10 GB free (models + cache) | 20 GB+ SSD |

| **Python** | 3.10+ (managed by `uv`) | 3.11–3.12 |

| **GPU** | Optional — CPU works | NVIDIA CUDA · Apple Silicon MPS · AMD ROCm |

> [!TIP]

> On GPUs with **≤8 GB VRAM**, OmniVoice automatically offloads TTS to CPU during transcription — no config needed. A dedicated GPU is not required; the entire pipeline runs on CPU (just slower).

### TTS Engines

OmniVoice ships a multi-engine TTS backend. The default engine (OmniVoice) is always available; additional engines are opt-in and auto-detected. Switch engines in **Settings → TTS Engine** or via the `OMNIVOICE_TTS_BACKEND` env var.

| Engine | Languages | Clone | Instruct | Linux | macOS ARM | Windows | License |

|--------|:---------:|:-----:|:--------:|:-----:|:---------:|:-------:|:-------:|

| **OmniVoice** (default) | 600+ | ✅ | ✅ | ✅ CUDA/CPU | ✅ MPS | ✅ CUDA/CPU | Built-in |

| **CosyVoice 3** | 9 + 18 dialects | ✅ | ✅ | ✅ CUDA/CPU | ✅ MPS | ✅ CUDA/CPU | Apache-2.0 |

| **MLX-Audio** (Kokoro, Qwen3-TTS, CSM, Dia, …) | Multi | Varies | Varies | ❌ | ✅ Native | ❌ | Varies |

| **VoxCPM2** | 30 | ✅ | ✅ | ✅ CUDA/CPU | ✅ MPS | ✅ CUDA/CPU | Apache-2.0 |

| **MOSS-TTS-Nano** | 20 | ✅ | ❌ | ✅ CUDA/CPU | ✅ CPU | ✅ CUDA/CPU | Apache-2.0 |

| **KittenTTS** | English | ❌ | ❌ | ✅ CPU | ✅ CPU | ✅ CPU | MIT |

> **CUDA** = GPU-accelerated · **MPS** = Apple Silicon Metal · **CPU** = runs everywhere, slower for large models · KittenTTS and MOSS-TTS-Nano run realtime on CPU · MLX-Audio is Apple Silicon only.

### ASR Engines

OmniVoice ships a multi-engine ASR (speech-to-text) backend that powers dictation, video dubbing, and subtitle generation — all fully local. **WhisperX** is the cross-platform default; the rest are opt-in and auto-detected. Switch in **Settings → ASR Engine** or via the `OMNIVOICE_ASR_BACKEND` env var.

| Engine | `OMNIVOICE_ASR_BACKEND` | Languages | Best for |

|--------|-------------------------|:---------:|----------|

| **WhisperX** (default) | `whisperx` | ~100 | Dubbing & subtitles — word-level timing via wav2vec2 forced alignment |

| **Faster-Whisper** | `faster-whisper` | ~100 | Fast transcription on Linux / macOS / Windows (CTranslate2) |

| **MLX Whisper** | `mlx-whisper` | ~100 | Native Apple Silicon speed (Apple MLX / Metal) |

| **PyTorch Whisper** | `pytorch-whisper` | ~100 | CUDA / CPU fallback via 🤗 Transformers |

| **Parakeet TDT** | `nemo-parakeet` | English + 25 EU | SOTA English accuracy, auto language detection (NVIDIA NeMo, GPU only) |

| **Moonshine** | `moonshine` | English | Edge / low-latency, ONNX |

| **FunASR** | `funasr` | 50+ | All-in-one multilingual — built-in VAD + inline speaker diarization (SenseVoice) |

> Whisper-family engines cover ~100 languages; **FunASR / SenseVoice** adds an all-in-one multilingual path with built-in voice-activity detection and inline speaker diarization. Every engine runs on-device — no API keys, no cloud.

---

## Architecture

```

┌─────────────────────────────────────────────────┐

│                  Frontend (React)                │

│  DubTab · VoicePreview · BatchQueue · Gallery    │

├─────────────────────────────────────────────────┤

│                Backend (FastAPI)                  │

│  97 API endpoints · SSE streaming · SQLite       │

├──────────┬──────────┬──────────┬────────────────┤

│ WhisperX │  Demucs  │OmniVoice │   Pyannote     │

│   ASR    │  Source  │   TTS    │  Diarization   │

│          │  Sep.    │          │                │

└──────────┴──────────┴──────────┴────────────────┘

        CUDA / MPS / ROCm / CPU (auto-detected)

```

---

## Roadmap

### ✅ Shipped

| Category | Features |

|----------|----------|

| **Dubbing** | Full pipeline (transcribe→translate→synthesize→mux), scene-aware splitting, lip-sync scoring, streaming TTS |

| **Voice** | Zero-shot cloning, voice design, A/B comparison, voice preview widget, gallery with favorites/tags |

| **Audio** | Demucs vocal isolation, per-segment gain, selective track export, stem/SRT/VTT/MP3 export |

| **Multi-Lang** | Multi-language batch picker, batch dubbing queue with sequential GPU execution |

| **Diarization** | Pyannote ML diarization, auto speaker clone extraction, per-speaker voice assignment |

| **Infra** | Docker deployment, CUDA/MPS/ROCm auto-detect, cuDNN 8 compat, VRAM-aware model offloading |

| **AI Provenance** | AudioSeal invisible watermarking (SynthID-like), video logo overlay, watermark detection API |

| **UX** | Undo/redo, keyboard shortcuts, drag-and-drop, session persistence, glassmorphism design system |

| **Real-time Events** | WebSocket event bus — instant sidebar refresh on data mutations, exponential backoff reconnect |

| **State Management** | Zustand store migration — `uiSlice`, `pillSlice`, `dubSlice`, `generateSlice`, `prefsSlice`, `glossarySlice` |

| **Desktop** | Cross-platform Tauri installers (macOS DMG, Windows MSI, Linux deb/AppImage), auto-update infrastructure |

| **Windows Hardening** | Cross-platform log paths, Triton workaround, HF symlink bypass, 300s health check timeout |

| **Dictation** | Global system-wide hotkey (`⌘+⇧+Space`), frameless floating widget, streaming ASR via WebSocket, auto-paste |

| **Batch Pipeline** | Full batch TTS: extract → transcribe → translate → generate → mix → export, with live progress tracking |

### 🔜 Up Next

- 🎬 **Lip-sync v2** — visual speech timing with wav2lip

- 📖 **Audiobook Editor** — chapter-aware long-form narration

- 🌐 **Hosted Demo** — try OmniVoice without installing anything

- 🔌 **Plugin Marketplace** — community-contributed TTS engines and effects

---

## Community



  






| Channel | What happens there |

|---------|--------------------|

| `#showcase` | Members share their dubs, clones, and voice designs |

| `#help` | Setup issues, GPU troubleshooting, model questions |

| `#feature-requests` | Vote on what gets built next |

| `#dev` | Architecture discussions, PR reviews, engine integrations |

| `#announcements` | Release notes, breaking changes, early access |

**[→ Join the Discord](https://discord.gg/bzQavDfVV9)** — we respond to setup questions within hours, not days.

---

## Contributing

We welcome contributions of all kinds — bug fixes, new TTS engine adapters, UI improvements, docs, and translations.

- 📖 Read the **[Contributing Guide](CONTRIBUTING.md)** for setup, code style, and PR workflow

- 🐛 Browse [good first issues](https://github.com/debpalash/OmniVoice-Studio/labels/good%20first%20issue)

- 💬 Join our [Discord](https://discord.gg/bzQavDfVV9) to discuss ideas or ask for help

---

## FAQ

Is this really as good as ElevenLabs?




For voice cloning and dubbing, yes — OmniVoice uses a state-of-the-art diffusion TTS model with 646 languages (ElevenLabs supports 32). Quality is comparable for most use cases. Where ElevenLabs wins is in their polished cloud API and pre-made voice library. OmniVoice wins on privacy, cost, language coverage, and customizability.

Does it work on Apple Silicon (M1/M2/M3/M4)?




Yes. MPS acceleration is auto-detected. MLX-optimized Whisper models are available for faster transcription on Apple hardware.

How much VRAM do I need?




4 GB minimum. With ≤8 GB, the TTS model is automatically offloaded to CPU during transcription. With 8+ GB, everything runs on GPU simultaneously. No GPU at all? CPU mode works — just slower (~3× for TTS).

Can I use this commercially?




Personal, educational, internal-team, and non-commercial use is free under FSL-1.1-ALv2. Building a competing product or service on top of OmniVoice Studio requires a commercial license — see License. Pricing tiers coming soon. Each release converts to Apache 2.0 two years after publication.

What languages are supported?




646 languages for TTS via the OmniVoice model. Transcription (WhisperX) supports 99 languages. Translation coverage depends on the target language pair.

Can I add my own TTS engine?




Yes. OmniVoice uses a built-in backend registry. To add an engine in ~50 lines, subclass TTSBackend in backend/services/tts_backend.py and add it to the _REGISTRY dictionary at the bottom. Six engines are built in: OmniVoice, CosyVoice, MLX-Audio (14+ sub-engines), VoxCPM2, MOSS-TTS-Nano, and KittenTTS. See the TTS Engines section for details.

---

## License

OmniVoice Studio is source-available under the [**Functional Source License (FSL-1.1-ALv2)**](https://fsl.software/).

**Free** for personal, educational, research, internal team, and non-commercial use. Each release **converts to Apache 2.0 automatically two years after publication**.

**Business / enterprise** users building a competing product or service on top of OmniVoice Studio need a commercial license. **Pricing tiers coming soon.** For inquiries in the meantime, reach out at **OmniVoice@palash.dev**.

See [`LICENSE`](LICENSE) for the full terms.

---

## Acknowledgments

OmniVoice Studio is built on the shoulders of exceptional open-source work:

| Project | Role |

|---------|------|

| [**OmniVoice (k2-fsa)**](https://github.com/k2-fsa/OmniVoice) | Zero-shot diffusion TTS engine — the core voice synthesis model |

| [**WhisperX**](https://github.com/m-bain/whisperX) | Word-level speech recognition and alignment |

| [**Demucs (Meta)**](https://github.com/facebookresearch/demucs) | Music source separation for vocal isolation |

| [**Pyannote**](https://github.com/pyannote/pyannote-audio) | Speaker diarization — who said what |

| [**CTranslate2**](https://github.com/OpenNMT/CTranslate2) | Optimized Transformer inference on CPU and GPU |

| [**AudioSeal (Meta)**](https://github.com/facebookresearch/audioseal) | Invisible neural audio watermarking for AI provenance |

| [**Tauri**](https://tauri.app) | Native desktop app framework |

---






If you read this far, you're our kind of person.


**[⭐ Star this repo](https://github.com/debpalash/OmniVoice-Studio)** so others can find it too.


**[💬 Join the Discord](https://discord.gg/bzQavDfVV9)** to share what you build.