https://github.com/tianqbu/doppelvoice

Real-time Chinese↔English speech translation with zero-shot voice cloning · 端到端实时语音翻译 + 0样本音色克隆 · Powered by Doubao Seed LiveInterpret 2.0
https://github.com/tianqbu/doppelvoice

chinese-english doubao protobuf pyside6 real-time simultaneous-interpretation speech-translation voice-cloning websocket windows

Last synced: about 2 months ago
JSON representation

Real-time Chinese↔English speech translation with zero-shot voice cloning · 端到端实时语音翻译 + 0样本音色克隆 · Powered by Doubao Seed LiveInterpret 2.0

Host: GitHub
URL: https://github.com/tianqbu/doppelvoice
Owner: TianqBu
License: mit
Created: 2026-04-25T20:14:53.000Z (about 2 months ago)
Default Branch: main
Last Pushed: 2026-04-26T17:01:47.000Z (about 2 months ago)
Last Synced: 2026-04-28T10:35:10.953Z (about 2 months ago)
Topics: chinese-english, doubao, protobuf, pyside6, real-time, simultaneous-interpretation, speech-translation, voice-cloning, websocket, windows
Language: Python
Size: 363 KB
Stars: 2
Watchers: 0
Forks: 1
Open Issues: 1
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md

Awesome Lists containing this project

README

          # Doppelvoice

> **Your voice, in any language.**

> Real-time speech-to-speech translation with zero-shot voice cloning across **9 languages**

> (Chinese / English / Japanese / Indonesian / Spanish / Portuguese / German / French + bilingual ZH⇄EN auto).

> The other party hears **the target language in your own voice** through any meeting app —

> Zoom, Teams, WeChat, Google Meet, OBS, anything that takes a microphone.

>

> _Powered by ByteDance Doubao Seed LiveInterpret 2.0._

[中文](README.zh-CN.md) · [Architecture](docs/en/ARCHITECTURE.md) · [Setup](docs/en/SETUP.md) · [Troubleshooting](docs/en/TROUBLESHOOTING.md)

[![tests](https://github.com/TianqBu/Doppelvoice/actions/workflows/tests.yml/badge.svg)](https://github.com/TianqBu/Doppelvoice/actions/workflows/tests.yml)

[![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/)

[![License: MIT](https://img.shields.io/badge/License-MIT-green.svg)](LICENSE)

[![Platform](https://img.shields.io/badge/platform-Windows-lightgrey.svg)]()

[![Release](https://img.shields.io/github/v/release/TianqBu/Doppelvoice)](https://github.com/TianqBu/Doppelvoice/releases/latest)

---

## What it does

```

You speak   ─►  Doppelvoice  ─►  Peer hears  (in your voice)

   ┌──────────────────┐       ┌─────────┐       ┌──────────────────────────────┐

   │     your mic     │ ────► │ Doubao  │ ────► │ virtual mic → Zoom / Teams … │

   └──────────────────┘       │ AST 2.0 │       └──────────────────────────────┘

                              └─────────┘

```

Pick any of 9 source/target language codes (`zh / en / ja / id / es / pt / de / fr`)

or use `zhen` on both sides for bilingual ZH⇄EN auto-detection.

End-to-end latency ≈ 2.5–3 s. Subtitles stream token-by-token; voice is cloned zero-shot from your speech as you talk.

## Features

- 🎙 **End-to-end speech-to-speech** — no separate STT / MT / TTS plumbing

- 🗣 **Zero-shot voice cloning** — model captures your voice on the fly; explicit

  `denoise=false` to retain breath / resonance details

- 🌐 **9 languages** — `zh / en / ja / id / es / pt / de / fr / zhen` (the

  last one is the bilingual ZH⇄EN auto mode)

- ⚡ **~2.5 s latency** — production-grade real-time

- 🪟 **Native Windows GUI** (PySide6) with live bilingual subtitles

- 🔌 **Universal compatibility** — anything that accepts a microphone works

- 🔁 **Automatic reconnect** with exponential backoff and fatal-error classification

- 🔒 **Privacy-first defaults** — translated audio and subtitles never persist

  to disk unless you opt in; logs auto-redact API keys and bearer tokens

- 🧹 **Clean device picker** — one entry per physical device (host-API

  duplicates collapsed; MME 31-char name truncation handled)

- 🛠 **Configurable** — sample rate, jitter buffer, RMS gate, denoise toggle,

  speaker_id, all tweakable

## Demo

![Doppelvoice GUI](docs/images/screenshot.png)

## Quick start

Two ways to install. **Option A** is the fastest (no Python needed).

### Option A — Pre-built Windows binary (recommended)

1. Install [VB-Audio Virtual Cable](https://vb-audio.com/Cable/) → run installer as admin → reboot.

2. Download the latest **`Doppelvoice-vX.Y.Z-win64.zip`** from the [Releases page](https://github.com/TianqBu/Doppelvoice/releases/latest).

3. Unzip anywhere, then inside the folder: copy `.env.example` → `.env`, fill in `DOUBAO_APP_KEY` / `DOUBAO_ACCESS_KEY` (get them from the [Volcengine Console](https://console.volcengine.com/speech/app)).

4. Double-click `Doppelvoice.exe`. The GUI opens.

5. In your meeting app, set the microphone to **`CABLE Output (VB-Audio Virtual Cable)`**.

### Option B — From source (for developers)

```cmd

git clone https://github.com/TianqBu/Doppelvoice.git

cd Doppelvoice

python -m venv .venv

.venv\Scripts\pip install -e .       :: installs from pyproject.toml

:: or: .venv\Scripts\pip install -r requirements.txt

copy .env.example .env

notepad .env       :: fill in DOUBAO_APP_KEY / DOUBAO_ACCESS_KEY

check.bat          :: verifies devices + API connectivity + StartSession

gui.bat            :: launches the GUI

run.bat            :: CLI mode

```

In your meeting app: pick **`CABLE Output (VB-Audio Virtual Cable)`** as the microphone.

## CLI

```cmd

run.bat                              :: start translation (CLI)

run.bat --gui                        :: launch GUI

run.bat --check                      :: self-check

run.bat --list-devices               :: list audio devices

run.bat --source en --target zh      :: reverse direction

run.bat --jitter-ms 80               :: lower latency (more underrun risk)

run.bat --log-level DEBUG            :: verbose logs

```

## Configuration

All settings have sensible defaults. Override via `.env` or CLI flags.

| Variable | Default | Notes |

|---|---|---|

| `DOUBAO_APP_KEY` / `DOUBAO_ACCESS_KEY` | _required_ | from Volcengine console |

| `DOUBAO_RESOURCE_ID` | `volc.service_type.10053` | AST 2.0 resource ID |

| `SOURCE_LANG` / `TARGET_LANG` | `zh` / `en` | one of `zh / en / ja / id / es / pt / de / fr / zhen`. Use `zhen` on **both** sides for bilingual ZH⇄EN auto mode. |

| `MODE` | `s2s` | `s2s` (speech→speech) or `s2t` (speech→text) |

| `DENOISE` | `0` | `1` = server-side denoise on (cleaner input but flatter voice clone). `0` keeps breath / resonance for better cloning. |

| `SPEAKER_ID` | _empty_ | Doubao `ReqParams.speaker_id` — empty = clone the speaker; set to a preset like `zh_female_vv_uranus_bigtts` to use a stock voice instead |

| `INPUT_DEVICE` / `OUTPUT_DEVICE` | _auto_ | substring of device name (host API hidden; one entry per physical device) |

| `LOG_LEVEL` | `INFO` | `DEBUG` for verbose |

| `DUMP_AUDIO` | `false` | persist per-sentence ogg blobs (debug only) |

| `LOG_SUBTITLE` | `false` | persist subtitle text in logs (debug only) |

## Architecture

```

src/doppelvoice/

├── engine/        # Doubao AST 2.0 protobuf WebSocket client

├── audio/         # PortAudio (sounddevice) capture + playback + ogg/opus decoder

├── pipeline/      # asyncio orchestration: capture → ws → decode → playback

├── gui/           # PySide6 + qasync

├── cli.py

└── config.py

```

See [docs/en/ARCHITECTURE.md](docs/en/ARCHITECTURE.md) for the full protocol details.

## Tested with

- Windows 10 / 11 x64

- Python 3.10–3.12

- VB-Audio Virtual Cable 1.0.4 (Driver Pack 43)

- Zoom, 腾讯会议, 微信电话, Google Meet (Chrome), OBS

## Known limitations

1. **Voice cloning quality varies** with mic and clarity. AirPods over Bluetooth

   HFP (16 kHz narrowband phone mode) gives mediocre results — a wired/USB mic

   or laptop built-in mic is recommended. The default `denoise=false` already

   tells the server to keep your voice's unique characteristics; toggling it

   on in Settings would flatten the clone further.

2. **End-to-end latency floor ≈ 2.5 s** is the model's hard limit per the

   [Seed LiveInterpret 2.0 paper](https://arxiv.org/abs/2507.17527); local

   processing adds <500 ms.

3. **Voice expressiveness** of the public AST API is good but not as lively

   as the Volcengine Console demo (which goes through a different BFF endpoint).

4. **Per-sentence audio decoding** (ogg_opus) adds ~500 ms latency vs raw

   PCM (which the API does not currently honor).

5. **Use headphones, not speakers.** With external speakers the meeting

   audio gets re-captured by your mic, re-translated, and sent back to the

   peer as their own translated voice — a textbook acoustic feedback loop.

   See [Troubleshooting](docs/en/TROUBLESHOOTING.md#feedback-loop-when-using-speakers).

## Privacy

- API keys live only in `.env` (gitignored).

- Translated audio and subtitle text are **not persisted** to disk by default.

- Set `DUMP_AUDIO=1` / `LOG_SUBTITLE=1` for debugging only.

- All audio is sent through ByteDance's Doubao API. Review their [Terms of Service](https://www.volcengine.com/docs/82379/1394617) before use with sensitive content.

## Contributing

PRs welcome. See [CONTRIBUTING.md](CONTRIBUTING.md).

## License

[MIT](LICENSE).

## Acknowledgements

- [ByteDance Seed LiveInterpret 2.0](https://seed.bytedance.com/en/seed_liveinterpret) — the underlying translation model

- [kizuna-ai-lab/sokuji](https://github.com/kizuna-ai-lab/sokuji) — protobuf reverse-engineering reference

- [VB-Audio Virtual Cable](https://vb-audio.com/Cable/) — virtual audio routing on Windows

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/tianqbu/doppelvoice

Awesome Lists containing this project

README