https://github.com/tianqbu/doppelvoice
Real-time Chinese↔English speech translation with zero-shot voice cloning · 端到端实时语音翻译 + 0样本音色克隆 · Powered by Doubao Seed LiveInterpret 2.0
https://github.com/tianqbu/doppelvoice
chinese-english doubao protobuf pyside6 real-time simultaneous-interpretation speech-translation voice-cloning websocket windows
Last synced: about 2 months ago
JSON representation
Real-time Chinese↔English speech translation with zero-shot voice cloning · 端到端实时语音翻译 + 0样本音色克隆 · Powered by Doubao Seed LiveInterpret 2.0
- Host: GitHub
- URL: https://github.com/tianqbu/doppelvoice
- Owner: TianqBu
- License: mit
- Created: 2026-04-25T20:14:53.000Z (about 2 months ago)
- Default Branch: main
- Last Pushed: 2026-04-26T17:01:47.000Z (about 2 months ago)
- Last Synced: 2026-04-28T10:35:10.953Z (about 2 months ago)
- Topics: chinese-english, doubao, protobuf, pyside6, real-time, simultaneous-interpretation, speech-translation, voice-cloning, websocket, windows
- Language: Python
- Size: 363 KB
- Stars: 2
- Watchers: 0
- Forks: 1
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md
Awesome Lists containing this project
README
# Doppelvoice
> **Your voice, in any language.**
> Real-time speech-to-speech translation with zero-shot voice cloning across **9 languages**
> (Chinese / English / Japanese / Indonesian / Spanish / Portuguese / German / French + bilingual ZH⇄EN auto).
> The other party hears **the target language in your own voice** through any meeting app —
> Zoom, Teams, WeChat, Google Meet, OBS, anything that takes a microphone.
>
> _Powered by ByteDance Doubao Seed LiveInterpret 2.0._
[中文](README.zh-CN.md) · [Architecture](docs/en/ARCHITECTURE.md) · [Setup](docs/en/SETUP.md) · [Troubleshooting](docs/en/TROUBLESHOOTING.md)
[](https://github.com/TianqBu/Doppelvoice/actions/workflows/tests.yml)
[](https://www.python.org/downloads/)
[](LICENSE)
[]()
[](https://github.com/TianqBu/Doppelvoice/releases/latest)
---
## What it does
```
You speak ─► Doppelvoice ─► Peer hears (in your voice)
┌──────────────────┐ ┌─────────┐ ┌──────────────────────────────┐
│ your mic │ ────► │ Doubao │ ────► │ virtual mic → Zoom / Teams … │
└──────────────────┘ │ AST 2.0 │ └──────────────────────────────┘
└─────────┘
```
Pick any of 9 source/target language codes (`zh / en / ja / id / es / pt / de / fr`)
or use `zhen` on both sides for bilingual ZH⇄EN auto-detection.
End-to-end latency ≈ 2.5–3 s. Subtitles stream token-by-token; voice is cloned zero-shot from your speech as you talk.
## Features
- 🎙 **End-to-end speech-to-speech** — no separate STT / MT / TTS plumbing
- 🗣 **Zero-shot voice cloning** — model captures your voice on the fly; explicit
`denoise=false` to retain breath / resonance details
- 🌐 **9 languages** — `zh / en / ja / id / es / pt / de / fr / zhen` (the
last one is the bilingual ZH⇄EN auto mode)
- ⚡ **~2.5 s latency** — production-grade real-time
- 🪟 **Native Windows GUI** (PySide6) with live bilingual subtitles
- 🔌 **Universal compatibility** — anything that accepts a microphone works
- 🔁 **Automatic reconnect** with exponential backoff and fatal-error classification
- 🔒 **Privacy-first defaults** — translated audio and subtitles never persist
to disk unless you opt in; logs auto-redact API keys and bearer tokens
- 🧹 **Clean device picker** — one entry per physical device (host-API
duplicates collapsed; MME 31-char name truncation handled)
- 🛠 **Configurable** — sample rate, jitter buffer, RMS gate, denoise toggle,
speaker_id, all tweakable
## Demo

## Quick start
Two ways to install. **Option A** is the fastest (no Python needed).
### Option A — Pre-built Windows binary (recommended)
1. Install [VB-Audio Virtual Cable](https://vb-audio.com/Cable/) → run installer as admin → reboot.
2. Download the latest **`Doppelvoice-vX.Y.Z-win64.zip`** from the [Releases page](https://github.com/TianqBu/Doppelvoice/releases/latest).
3. Unzip anywhere, then inside the folder: copy `.env.example` → `.env`, fill in `DOUBAO_APP_KEY` / `DOUBAO_ACCESS_KEY` (get them from the [Volcengine Console](https://console.volcengine.com/speech/app)).
4. Double-click `Doppelvoice.exe`. The GUI opens.
5. In your meeting app, set the microphone to **`CABLE Output (VB-Audio Virtual Cable)`**.
### Option B — From source (for developers)
```cmd
git clone https://github.com/TianqBu/Doppelvoice.git
cd Doppelvoice
python -m venv .venv
.venv\Scripts\pip install -e . :: installs from pyproject.toml
:: or: .venv\Scripts\pip install -r requirements.txt
copy .env.example .env
notepad .env :: fill in DOUBAO_APP_KEY / DOUBAO_ACCESS_KEY
check.bat :: verifies devices + API connectivity + StartSession
gui.bat :: launches the GUI
run.bat :: CLI mode
```
In your meeting app: pick **`CABLE Output (VB-Audio Virtual Cable)`** as the microphone.
## CLI
```cmd
run.bat :: start translation (CLI)
run.bat --gui :: launch GUI
run.bat --check :: self-check
run.bat --list-devices :: list audio devices
run.bat --source en --target zh :: reverse direction
run.bat --jitter-ms 80 :: lower latency (more underrun risk)
run.bat --log-level DEBUG :: verbose logs
```
## Configuration
All settings have sensible defaults. Override via `.env` or CLI flags.
| Variable | Default | Notes |
|---|---|---|
| `DOUBAO_APP_KEY` / `DOUBAO_ACCESS_KEY` | _required_ | from Volcengine console |
| `DOUBAO_RESOURCE_ID` | `volc.service_type.10053` | AST 2.0 resource ID |
| `SOURCE_LANG` / `TARGET_LANG` | `zh` / `en` | one of `zh / en / ja / id / es / pt / de / fr / zhen`. Use `zhen` on **both** sides for bilingual ZH⇄EN auto mode. |
| `MODE` | `s2s` | `s2s` (speech→speech) or `s2t` (speech→text) |
| `DENOISE` | `0` | `1` = server-side denoise on (cleaner input but flatter voice clone). `0` keeps breath / resonance for better cloning. |
| `SPEAKER_ID` | _empty_ | Doubao `ReqParams.speaker_id` — empty = clone the speaker; set to a preset like `zh_female_vv_uranus_bigtts` to use a stock voice instead |
| `INPUT_DEVICE` / `OUTPUT_DEVICE` | _auto_ | substring of device name (host API hidden; one entry per physical device) |
| `LOG_LEVEL` | `INFO` | `DEBUG` for verbose |
| `DUMP_AUDIO` | `false` | persist per-sentence ogg blobs (debug only) |
| `LOG_SUBTITLE` | `false` | persist subtitle text in logs (debug only) |
## Architecture
```
src/doppelvoice/
├── engine/ # Doubao AST 2.0 protobuf WebSocket client
├── audio/ # PortAudio (sounddevice) capture + playback + ogg/opus decoder
├── pipeline/ # asyncio orchestration: capture → ws → decode → playback
├── gui/ # PySide6 + qasync
├── cli.py
└── config.py
```
See [docs/en/ARCHITECTURE.md](docs/en/ARCHITECTURE.md) for the full protocol details.
## Tested with
- Windows 10 / 11 x64
- Python 3.10–3.12
- VB-Audio Virtual Cable 1.0.4 (Driver Pack 43)
- Zoom, 腾讯会议, 微信电话, Google Meet (Chrome), OBS
## Known limitations
1. **Voice cloning quality varies** with mic and clarity. AirPods over Bluetooth
HFP (16 kHz narrowband phone mode) gives mediocre results — a wired/USB mic
or laptop built-in mic is recommended. The default `denoise=false` already
tells the server to keep your voice's unique characteristics; toggling it
on in Settings would flatten the clone further.
2. **End-to-end latency floor ≈ 2.5 s** is the model's hard limit per the
[Seed LiveInterpret 2.0 paper](https://arxiv.org/abs/2507.17527); local
processing adds <500 ms.
3. **Voice expressiveness** of the public AST API is good but not as lively
as the Volcengine Console demo (which goes through a different BFF endpoint).
4. **Per-sentence audio decoding** (ogg_opus) adds ~500 ms latency vs raw
PCM (which the API does not currently honor).
5. **Use headphones, not speakers.** With external speakers the meeting
audio gets re-captured by your mic, re-translated, and sent back to the
peer as their own translated voice — a textbook acoustic feedback loop.
See [Troubleshooting](docs/en/TROUBLESHOOTING.md#feedback-loop-when-using-speakers).
## Privacy
- API keys live only in `.env` (gitignored).
- Translated audio and subtitle text are **not persisted** to disk by default.
- Set `DUMP_AUDIO=1` / `LOG_SUBTITLE=1` for debugging only.
- All audio is sent through ByteDance's Doubao API. Review their [Terms of Service](https://www.volcengine.com/docs/82379/1394617) before use with sensitive content.
## Contributing
PRs welcome. See [CONTRIBUTING.md](CONTRIBUTING.md).
## License
[MIT](LICENSE).
## Acknowledgements
- [ByteDance Seed LiveInterpret 2.0](https://seed.bytedance.com/en/seed_liveinterpret) — the underlying translation model
- [kizuna-ai-lab/sokuji](https://github.com/kizuna-ai-lab/sokuji) — protobuf reverse-engineering reference
- [VB-Audio Virtual Cable](https://vb-audio.com/Cable/) — virtual audio routing on Windows