An open API service indexing awesome lists of open source software.

https://github.com/tianqbu/doppelvoice

Real-time Chinese↔English speech translation with zero-shot voice cloning · 端到端实时语音翻译 + 0样本音色克隆 · Powered by Doubao Seed LiveInterpret 2.0
https://github.com/tianqbu/doppelvoice

chinese-english doubao protobuf pyside6 real-time simultaneous-interpretation speech-translation voice-cloning websocket windows

Last synced: about 2 months ago
JSON representation

Real-time Chinese↔English speech translation with zero-shot voice cloning · 端到端实时语音翻译 + 0样本音色克隆 · Powered by Doubao Seed LiveInterpret 2.0

Awesome Lists containing this project

README

          

# Doppelvoice

> **Your voice, in any language.**
> Real-time speech-to-speech translation with zero-shot voice cloning across **9 languages**
> (Chinese / English / Japanese / Indonesian / Spanish / Portuguese / German / French + bilingual ZH⇄EN auto).
> The other party hears **the target language in your own voice** through any meeting app —
> Zoom, Teams, WeChat, Google Meet, OBS, anything that takes a microphone.
>
> _Powered by ByteDance Doubao Seed LiveInterpret 2.0._

[中文](README.zh-CN.md) · [Architecture](docs/en/ARCHITECTURE.md) · [Setup](docs/en/SETUP.md) · [Troubleshooting](docs/en/TROUBLESHOOTING.md)

[![tests](https://github.com/TianqBu/Doppelvoice/actions/workflows/tests.yml/badge.svg)](https://github.com/TianqBu/Doppelvoice/actions/workflows/tests.yml)
[![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/)
[![License: MIT](https://img.shields.io/badge/License-MIT-green.svg)](LICENSE)
[![Platform](https://img.shields.io/badge/platform-Windows-lightgrey.svg)]()
[![Release](https://img.shields.io/github/v/release/TianqBu/Doppelvoice)](https://github.com/TianqBu/Doppelvoice/releases/latest)

---

## What it does

```
You speak ─► Doppelvoice ─► Peer hears (in your voice)
┌──────────────────┐ ┌─────────┐ ┌──────────────────────────────┐
│ your mic │ ────► │ Doubao │ ────► │ virtual mic → Zoom / Teams … │
└──────────────────┘ │ AST 2.0 │ └──────────────────────────────┘
└─────────┘
```

Pick any of 9 source/target language codes (`zh / en / ja / id / es / pt / de / fr`)
or use `zhen` on both sides for bilingual ZH⇄EN auto-detection.

End-to-end latency ≈ 2.5–3 s. Subtitles stream token-by-token; voice is cloned zero-shot from your speech as you talk.

## Features

- 🎙 **End-to-end speech-to-speech** — no separate STT / MT / TTS plumbing
- 🗣 **Zero-shot voice cloning** — model captures your voice on the fly; explicit
`denoise=false` to retain breath / resonance details
- 🌐 **9 languages** — `zh / en / ja / id / es / pt / de / fr / zhen` (the
last one is the bilingual ZH⇄EN auto mode)
- ⚡ **~2.5 s latency** — production-grade real-time
- 🪟 **Native Windows GUI** (PySide6) with live bilingual subtitles
- 🔌 **Universal compatibility** — anything that accepts a microphone works
- 🔁 **Automatic reconnect** with exponential backoff and fatal-error classification
- 🔒 **Privacy-first defaults** — translated audio and subtitles never persist
to disk unless you opt in; logs auto-redact API keys and bearer tokens
- 🧹 **Clean device picker** — one entry per physical device (host-API
duplicates collapsed; MME 31-char name truncation handled)
- 🛠 **Configurable** — sample rate, jitter buffer, RMS gate, denoise toggle,
speaker_id, all tweakable

## Demo

![Doppelvoice GUI](docs/images/screenshot.png)

## Quick start

Two ways to install. **Option A** is the fastest (no Python needed).

### Option A — Pre-built Windows binary (recommended)

1. Install [VB-Audio Virtual Cable](https://vb-audio.com/Cable/) → run installer as admin → reboot.
2. Download the latest **`Doppelvoice-vX.Y.Z-win64.zip`** from the [Releases page](https://github.com/TianqBu/Doppelvoice/releases/latest).
3. Unzip anywhere, then inside the folder: copy `.env.example` → `.env`, fill in `DOUBAO_APP_KEY` / `DOUBAO_ACCESS_KEY` (get them from the [Volcengine Console](https://console.volcengine.com/speech/app)).
4. Double-click `Doppelvoice.exe`. The GUI opens.
5. In your meeting app, set the microphone to **`CABLE Output (VB-Audio Virtual Cable)`**.

### Option B — From source (for developers)

```cmd
git clone https://github.com/TianqBu/Doppelvoice.git
cd Doppelvoice
python -m venv .venv
.venv\Scripts\pip install -e . :: installs from pyproject.toml
:: or: .venv\Scripts\pip install -r requirements.txt

copy .env.example .env
notepad .env :: fill in DOUBAO_APP_KEY / DOUBAO_ACCESS_KEY

check.bat :: verifies devices + API connectivity + StartSession
gui.bat :: launches the GUI
run.bat :: CLI mode
```

In your meeting app: pick **`CABLE Output (VB-Audio Virtual Cable)`** as the microphone.

## CLI

```cmd
run.bat :: start translation (CLI)
run.bat --gui :: launch GUI
run.bat --check :: self-check
run.bat --list-devices :: list audio devices
run.bat --source en --target zh :: reverse direction
run.bat --jitter-ms 80 :: lower latency (more underrun risk)
run.bat --log-level DEBUG :: verbose logs
```

## Configuration

All settings have sensible defaults. Override via `.env` or CLI flags.

| Variable | Default | Notes |
|---|---|---|
| `DOUBAO_APP_KEY` / `DOUBAO_ACCESS_KEY` | _required_ | from Volcengine console |
| `DOUBAO_RESOURCE_ID` | `volc.service_type.10053` | AST 2.0 resource ID |
| `SOURCE_LANG` / `TARGET_LANG` | `zh` / `en` | one of `zh / en / ja / id / es / pt / de / fr / zhen`. Use `zhen` on **both** sides for bilingual ZH⇄EN auto mode. |
| `MODE` | `s2s` | `s2s` (speech→speech) or `s2t` (speech→text) |
| `DENOISE` | `0` | `1` = server-side denoise on (cleaner input but flatter voice clone). `0` keeps breath / resonance for better cloning. |
| `SPEAKER_ID` | _empty_ | Doubao `ReqParams.speaker_id` — empty = clone the speaker; set to a preset like `zh_female_vv_uranus_bigtts` to use a stock voice instead |
| `INPUT_DEVICE` / `OUTPUT_DEVICE` | _auto_ | substring of device name (host API hidden; one entry per physical device) |
| `LOG_LEVEL` | `INFO` | `DEBUG` for verbose |
| `DUMP_AUDIO` | `false` | persist per-sentence ogg blobs (debug only) |
| `LOG_SUBTITLE` | `false` | persist subtitle text in logs (debug only) |

## Architecture

```
src/doppelvoice/
├── engine/ # Doubao AST 2.0 protobuf WebSocket client
├── audio/ # PortAudio (sounddevice) capture + playback + ogg/opus decoder
├── pipeline/ # asyncio orchestration: capture → ws → decode → playback
├── gui/ # PySide6 + qasync
├── cli.py
└── config.py
```

See [docs/en/ARCHITECTURE.md](docs/en/ARCHITECTURE.md) for the full protocol details.

## Tested with

- Windows 10 / 11 x64
- Python 3.10–3.12
- VB-Audio Virtual Cable 1.0.4 (Driver Pack 43)
- Zoom, 腾讯会议, 微信电话, Google Meet (Chrome), OBS

## Known limitations

1. **Voice cloning quality varies** with mic and clarity. AirPods over Bluetooth
HFP (16 kHz narrowband phone mode) gives mediocre results — a wired/USB mic
or laptop built-in mic is recommended. The default `denoise=false` already
tells the server to keep your voice's unique characteristics; toggling it
on in Settings would flatten the clone further.
2. **End-to-end latency floor ≈ 2.5 s** is the model's hard limit per the
[Seed LiveInterpret 2.0 paper](https://arxiv.org/abs/2507.17527); local
processing adds <500 ms.
3. **Voice expressiveness** of the public AST API is good but not as lively
as the Volcengine Console demo (which goes through a different BFF endpoint).
4. **Per-sentence audio decoding** (ogg_opus) adds ~500 ms latency vs raw
PCM (which the API does not currently honor).
5. **Use headphones, not speakers.** With external speakers the meeting
audio gets re-captured by your mic, re-translated, and sent back to the
peer as their own translated voice — a textbook acoustic feedback loop.
See [Troubleshooting](docs/en/TROUBLESHOOTING.md#feedback-loop-when-using-speakers).

## Privacy

- API keys live only in `.env` (gitignored).
- Translated audio and subtitle text are **not persisted** to disk by default.
- Set `DUMP_AUDIO=1` / `LOG_SUBTITLE=1` for debugging only.
- All audio is sent through ByteDance's Doubao API. Review their [Terms of Service](https://www.volcengine.com/docs/82379/1394617) before use with sensitive content.

## Contributing

PRs welcome. See [CONTRIBUTING.md](CONTRIBUTING.md).

## License

[MIT](LICENSE).

## Acknowledgements

- [ByteDance Seed LiveInterpret 2.0](https://seed.bytedance.com/en/seed_liveinterpret) — the underlying translation model
- [kizuna-ai-lab/sokuji](https://github.com/kizuna-ai-lab/sokuji) — protobuf reverse-engineering reference
- [VB-Audio Virtual Cable](https://vb-audio.com/Cable/) — virtual audio routing on Windows