https://github.com/debpalash/omnivoice-studio
A Cinematic audio dubbing, Cloning and voice generation studio
https://github.com/debpalash/omnivoice-studio
600-langs ai cuda mlx omnivoice tts voice-ai voice-cloning voice-generation whisper youtube-video
Last synced: about 1 month ago
JSON representation
A Cinematic audio dubbing, Cloning and voice generation studio
- Host: GitHub
- URL: https://github.com/debpalash/omnivoice-studio
- Owner: debpalash
- License: apache-2.0
- Created: 2026-04-09T21:40:26.000Z (about 2 months ago)
- Default Branch: main
- Last Pushed: 2026-04-21T13:05:20.000Z (about 1 month ago)
- Last Synced: 2026-04-21T15:10:45.595Z (about 1 month ago)
- Topics: 600-langs, ai, cuda, mlx, omnivoice, tts, voice-ai, voice-cloning, voice-generation, whisper, youtube-video
- Language: Python
- Homepage:
- Size: 11.6 MB
- Stars: 390
- Watchers: 5
- Forks: 26
- Open Issues: 3
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
OmniVoice Studio
Your Local Cinematic AI Dubbing Studio
Features β’
Getting Started β’
Roadmap β’
Changelog
The timeline-based cinematic dubbing and workspace UI.
---
Local, full-stack voice generation and cinematic dubbing. **No API keys. No cloud. Just run it.** Built on the open-source [OmniVoice](https://github.com/k2-fsa/OmniVoice) 600-language zero-shot diffusion model.
## β¨ Features
- π¬ **Video Dubbing** β transcribe, translate, re-voice, and mux back into MP4 with selective track export.
- π§ **Vocal Isolation** β built-in `demucs` automatically splits speech from music, keeping original background audio perfectly preserved.
- 𧬠**Voice Cloning & Design** β Clone specific voices from just a 3-second audio clip, or design completely new studio profiles with tags like `female, british accent, excited`.
- β‘ **Cross-Platform Native Execution** β Auto-detects and accelerates inference using Apple Silicon (MPS), NVIDIA (CUDA), AMD (ROCm), or standard CPU.
- π **Per-Segment Mixing** β Fine-grained volume/gain control per dubbed segment (0β200%) for broadcast-quality audio balancing.
- β¨οΈ **Keyboard-Driven Workflow** β `β+Enter` to generate, `β+S` to save, `β+Z`/`β+Shift+Z` for undo/redo.
- π‘ **Live Model Telemetry** β Real-time CPU/RAM/VRAM stats + model warm-up indicator (idle β loading β ready).
## π Getting Started
The easiest way to run OmniVoice Studio locally or on a cloud VM is via Docker. Our environment utilizes an optimized `pytorch/pytorch` configuration which seamlessly enables zero-config GPU passthrough if your host supports it.
### Option 1: One-Click Docker (Recommended)
```bash
git clone https://github.com/debpalash/OmniVoice-Studio.git
cd OmniVoice-Studio
docker compose up --build -d
```
That's it! Open [http://localhost:8000](http://localhost:8000) in your browser.
> [!TIP]
> **Windows/WSL Users:** Make sure your NVIDIA drivers are up to date. Docker Desktop automatically passes GPU capabilities to this container!
> **Cloud VMs (AWS, RunPod):** The image inherently supports CUDA 12.1. As long as `nvidia-container-toolkit` is installed on your host, `--gpus all` binds natively.
### Option 2: Local Development Setup
Quickly get OmniVoice Studio running natively on your hardware if you want to develop or modify code.
**Prerequisites:** Ensure `ffmpeg` is installed on your system.
Install standard modern web tooling: [Bun](https://bun.sh/) and [uv](https://docs.astral.sh/uv/getting-started/installation/).
```bash
git clone https://github.com/debpalash/OmniVoice-Studio.git
cd OmniVoice-Studio
# Boot the Backend
uv sync
uv run uvicorn backend.main:app
# Boot the Frontend (in a separate terminal)
bun install
bun run dev
```
OmniVoice Studio launches exactly two micro-services:
| Service | Protocol | Details |
|---|---|---|
| **Frontend** | `http://localhost:5173` | The real-time React UI β spanning cloning, design, and audio workspace. |
| **Backend** | `http://localhost:8000` | The FastAPI server handling model inference, translation pipelines, transcriber tasks. |
> [!NOTE]
> **First run optimization:** Model weights (approx. 1.2 GB) automatically download from HuggingFace the first time you execute a generation sequence. Subsequent launches trigger instantly from cache. *(Tip: Set `HF_TOKEN` in your environment for faster, authenticated downloads!)*
---
## πΊοΈ Roadmap
The studio is highly functional today, but we are aggressively expanding. Watch the roadmap to see what's shipping next:
### π Completed Milestones
- [x] Zero-shot voice cloning & complex voice design.
- [x] Full video cinematic dubbing pipeline (transcribe β translate β synthesize β mux).
- [x] Vocal isolation utilizing demucs alongside background audio retention.
- [x] Embedded waveform timeline editor for micro-segment-level audio manipulation.
- [x] Live system telemetry tracking (CPU, RAM, GPU VRAM usage).
- [x] Targeted multi-speaker diarization β auto-assign unique voice profiles per active speaker.
- [x] Studio project persistence β save, load, and cache multi-track projects seamlessly via local SQLite.
- [x] Production SRT/VTT subtitle export packaged alongside the dubbed `.mp4` video output.
- [x] Selective track export β choose exactly which language tracks (Original, DE, ES, etc.) to include in final MP4.
- [x] Per-segment volume/gain control with real-time mixing (0β200%).
- [x] Undo/redo system for all segment edits with 50-action history depth.
- [x] Keyboard shortcuts: `β+Enter` generate, `β+S` save, `β+Z`/`β+Shift+Z` undo/redo.
- [x] Drag-and-drop file uploads for both video and clone audio sources.
- [x] Model warm-up indicator with live status pill (idle/loading/ready).
- [x] Confirmation dialogs for all destructive actions (delete project/history/profile).
- [x] UI preferences persistence (sidebar state, zoom, active tab) across sessions.
- [x] Polished glassmorphism design system with micro-animations, focus rings, and custom scrollbars.
### π¨ Upcoming Features
- [x] **Real Speaker Diarization** β ML-based diarization via pyannote.audio for true multi-speaker identification.
- [x] **A/B Voice Comparison** β Side-by-side voice audition for casting decisions.
- [x] **Scene-Aware Dubbing** β FFmpeg scene detection to auto-split segments at visual cuts.
- [x] **Lip-Sync Scoring** β Analyze dubbed audio duration against original speaker timing with color-coded badges.
- [x] **Batch Processing** β Centralized async task queue ensuring sequential GPU execution with reconnectable SSE streams.
- [x] **Advanced Export Suite** β VTT subtitles, per-segment WAV ZIP, compressed MP3, and stem export (vocals + background separate).
- [x] **Streaming TTS** β Chunked WAV streaming with progressive download and auto-playback.
- [ ] **Native Desktop Applications** β Dedicated client apps for macOS, Windows, and Linux.
- [x] **One-Click Deployment** β Docker image packages engineered for zero-config GPU passthrough.
---
## π Changelog
### v1.2.0 β The Production Polish Update
- **Selective Track Export:** Choose exactly which audio tracks to include in the final MP4. Uncheck Original, keep only German β get a single-track export. Full per-track checkbox UI with dynamic FFmpeg stream index remapping.
- **Undo/Redo System:** Full `β+Z` / `β+Shift+Z` undo/redo for all segment edits (text, voice, volume, delete). 50-action deep history stack.
- **Per-Segment Volume Control:** Inline gain slider (0β200%) per segment row in the dub table. Backend applies gain during audio assembly with safe clamping.
- **Keyboard Shortcuts:** `β+Enter` to generate, `β+S` to save project. Browser default overrides prevented.
- **Model Status Indicator:** Live status pill in the header showing model warm-up state (idle β loading β ready). New `/model/status` backend endpoint.
- **Drag-and-Drop Everywhere:** Video upload already supported drop β now clone audio upload does too, with pink highlight on hover.
- **Confirmation Dialogs:** All destructive actions (delete project, profile, history item, clear all history) now require confirmation.
- **Session Persistence:** Sidebar collapsed state, active tab, and zoom level now persist across browser sessions via localStorage.
- **CSS Design System Overhaul:** Anti-aliased text, input focus glow rings, button hover shimmer, progress bar shimmer animation, fade-in on history items, selection color branding, Firefox scrollbar support, `tabular-nums` for timestamp columns.
- **AudioContext Pooling:** `playPing()` synthesis notification reuses a single AudioContext instead of creating one per call (browsers cap at ~6).
### v1.1.0 β The Cinematic Studio Update
- **The Cinematic Studio Interface:** Exhaustively re-engineered the UI to prioritize a high-density, real-estate optimized workflow featuring a dynamic UI zoom scalar (`Small`, `Normal`, `Max`). We minimized dead space and overhauled the widget layout keeping crucial tuning metrics immediately accessible.
- **Multi-Track Timeline:** Deeply integrated a multi-layered waveform sequence interface supporting precision audio segment positioning, unmuted live preview playback, localized track timing, and unconstrained draggable positioning manipulation.
- **Persistent Local Projects:** Put a complete stop to ephemeral state loss. All workspace metrics are successfully wrapped into `Projects` logged directly within a native embedded `SQLite` database. Workflows reliably survive browser shutdowns or server API reboots.
- **AI Cast Diarization:** Dropped in an offline `Pyannote` + `WhisperX` fusion pipeline evaluating multi-speaker metadata and categorizing overlapping, distinct speakers. Rapidly "cast" clone overrides seamlessly over complex dialogue tracks.
- **Polishing & Asset Control:** Cleaned cross-stack filename parsing and exported media rendering via `ffmpeg`, stabilizing codec dependencies, and deployed a unified custom `OmniVoice Studio` scalable aesthetic asset system.
## β Star History
Contributions and conceptual ideas are greatly appreciated β open an issue or submit a PR.