https://github.com/second-state/qwen3_tts_rs

A Rust implementation of the Qwen3 Text-to-Speech (TTS) model inference.
https://github.com/second-state/qwen3_tts_rs
Last synced: about 2 months ago
JSON representation
A Rust implementation of the Qwen3 Text-to-Speech (TTS) model inference.
Host: GitHub
URL: https://github.com/second-state/qwen3_tts_rs
Owner: second-state
Created: 2026-01-25T00:00:55.000Z (4 months ago)
Default Branch: main
Last Pushed: 2026-03-28T08:31:44.000Z (2 months ago)
Last Synced: 2026-03-28T12:25:44.871Z (2 months ago)
Language: Rust
Homepage:
Size: 3.28 MB
Stars: 194
Watchers: 1
Forks: 28
Open Issues: 2
Metadata Files:
- Readme: README.md
Awesome Lists containing this project

README

          # Qwen3 TTS - Rust CLI tools

[![Crates.io](https://img.shields.io/crates/v/qwen3-tts-rs.svg)](https://crates.io/crates/qwen3-tts-rs)

[![License](https://img.shields.io/crates/l/qwen3-tts-rs.svg)](https://github.com/second-state/qwen3_tts_rs/blob/main/LICENSE)

A Rust implementation of the Qwen3 Text-to-Speech (TTS) model inference. Provides three cross-platform CLI tools suitable for agentic skills for AI agents and bots.

- **tts** — generate speech from text with named speaker voices

- **voice_clone** — clone a voice from reference audio

- **api_server** — OpenAI-compatible HTTP API server

Supports two backends: **libtorch** (via the `tch` crate, cross-platform with optional CUDA) and **MLX** (Apple Silicon native via Metal GPU).

Learn more:

* [A Rust implementation / CLI](https://github.com/second-state/qwen3_asr_rs) for Qwen3's ASR (Automatic Speech Recognition or Speech-to-Text) models

* An OpenAI compatible [API server for audio / speech](https://github.com/second-state/qwen3_audio_api/tree/main/rust)

* An OpenClaw SKILL for voice generation. Copy and Paste to your lobster to [install it](https://raw.githubusercontent.com/second-state/qwen3_tts_rs/refs/heads/main/skills/install.md)

## Quick Start

Install binaries, models, and reference audio for your platform:

```bash

curl -sSf https://raw.githubusercontent.com/second-state/qwen3_tts_rs/main/install.sh | bash

cd qwen3_tts_rs

```

The installer detects your OS, CPU, and NVIDIA GPU (if present), then sets up everything in `./qwen3_tts_rs/`.

### Text-to-Speech

Generate speech with a named speaker using the CustomVoice model:

```bash

./tts models/Qwen3-TTS-12Hz-0.6B-CustomVoice "Hello world, this is a test." Vivian english

# Output: output.wav (24 kHz)

```

### Voice Cloning

Clone a voice from reference audio using the Base model (ICL mode):

```bash

./voice_clone models/Qwen3-TTS-12Hz-0.6B-Base reference_audio/trump.wav \

  "Hello, this is a voice cloning test." english \

  "Angered and appalled millions of Americans across the political spectrum"

# Output: output_voice_clone.wav (24 kHz)

```

### API Server

Start the OpenAI-compatible API server with the CustomVoice model:

```bash

./api_server models/Qwen3-TTS-12Hz-0.6B-CustomVoice --port 8080

```

Then call the endpoint:

```bash

curl -X POST http://localhost:8080/v1/audio/speech \

  -H "Content-Type: application/json" \

  -d '{"input": "Hello world!", "voice": "alloy"}' \

  -o output.wav

```

## Reference

### `tts` — Text-to-Speech

```

tts  [text] [speaker] [language] [instruction]

```

| Argument | Default | Description |

|----------|---------|-------------|

| `model_path` | (required) | Path to model directory |

| `text` | "Hello! This is a test..." | Text to synthesize (max ~4096 chars) |

| `speaker` | Vivian | Speaker name (see below) |

| `language` | english | Language: `english`, `chinese`, `japanese`, `korean` |

| `instruction` | (empty) | Voice style instruction (1.7B models only) |

Output: `output.wav` (24 kHz, 16-bit PCM)

**Available speakers** (CustomVoice models): Vivian, Serena, Ryan, Aiden, Uncle_fu, Ono_anna, Sohee, Eric, Dylan

**Instruction examples** (1.7B CustomVoice only):

- `"Speak in an urgent and excited voice"`

- `"Speak happily and joyfully"`

- `"Speak slowly and calmly"`

- `"Speak in a whisper"`

### `voice_clone` — Voice Cloning

```

voice_clone   [text] [language] [ref_text]

```

| Argument | Default | Description |

|----------|---------|-------------|

| `model_path` | (required) | Path to model directory |

| `ref_audio` | (required) | Path to reference WAV file |

| `text` | "Hello! This is a test..." | Text to synthesize |

| `language` | english | Language |

| `ref_text` | (none) | Transcript of reference audio (enables ICL mode, higher quality) |

Output: `output_voice_clone.wav` (24 kHz, 16-bit PCM)

**Preparing reference audio:** Must be mono 24 kHz 16-bit WAV. Convert with ffmpeg:

```bash

ffmpeg -i input.m4a -ac 1 -ar 24000 -sample_fmt s16 reference.wav

```

**Example:**

```bash

./voice_clone models/Qwen3-TTS-12Hz-0.6B-Base reference_audio/trump.wav \

  "Hello, this is a voice cloning test." english \

  "Angered and appalled millions of Americans across the political spectrum"

```

### `api_server` — OpenAI-Compatible API

```

api_server  [--host 127.0.0.1] [--port 8080]

```

| Option | Default | Description |

|--------|---------|-------------|

| `model_path` | (required) | Path to model directory |

| `--host` | 127.0.0.1 | Bind address |

| `--port` | 8080 | Listen port |

**Endpoints:**

| Method | Path | Description |

|--------|------|-------------|

| POST | `/v1/audio/speech` | Generate speech (OpenAI-compatible) |

| GET | `/v1/models` | List available models |

| GET | `/health` | Health check |

### API Request: `POST /v1/audio/speech`

```json

{

  "input": "Text to synthesize",

  "voice": "alloy",

  "model": "qwen3-tts",

  "response_format": "wav",

  "speed": 1.0,

  "stream": false,

  "language": "english",

  "instructions": "Speak urgently",

  "audio_sample": "",

  "audio_sample_text": "Transcript of the reference audio"

}

```

| Field | Type | Default | Description |

|-------|------|---------|-------------|

| `input` | string | (required) | Text to synthesize (max 4096 chars) |

| `voice` | string | `"alloy"` | OpenAI name or Qwen3 speaker name (see mapping below) |

| `model` | string | — | Accepted for compatibility, ignored |

| `response_format` | string | `"wav"` | `"wav"`, `"pcm"`, `"mp3"`, `"flac"`, `"ogg"`, or `"opus"` |

| `speed` | float | 1.0 | Speed multiplier (0.25–4.0) |

| `stream` | bool | false | Enable SSE streaming (requires `"pcm"`) |

| `language` | string | `"english"` | `english`, `chinese`, `japanese`, `korean`, `auto` |

| `instructions` | string | — | Voice style instruction (1.7B models only) |

| `audio_sample` | string | — | Base64-encoded reference WAV for voice cloning |

| `audio_sample_text` | string | — | Transcript of reference audio (required with `audio_sample`) |

**Voice name mapping** (OpenAI → Qwen3):

| OpenAI | Qwen3 |

|--------|-------|

| alloy | serena |

| echo | ryan |

| fable | vivian |

| onyx | eric |

| nova | ono_anna |

| shimmer | sohee |

You can also pass Qwen3 speaker names directly (e.g., `"voice": "vivian"`).

**Streaming:** When `stream: true` and `response_format: "pcm"`, the server returns Server-Sent Events with base64-encoded PCM chunks:

```

data: {"type":"speech.audio.delta","delta":""}

data: {"type":"speech.audio.done"}

```

**Voice cloning via API:**

```bash

# Encode reference audio as base64

REF_B64=$(base64 < reference.wav)

curl -X POST http://localhost:8080/v1/audio/speech \

  -H "Content-Type: application/json" \

  -d "{

    \"input\": \"Hello from a cloned voice.\",

    \"voice\": \"alloy\",

    \"audio_sample\": \"$REF_B64\",

    \"audio_sample_text\": \"Transcript of the reference audio\"

  }" -o cloned.wav

```

## Build from Source

### macOS (MLX backend)

Requires Apple Silicon Mac, Xcode, and CMake.

```bash

brew install cmake

git clone https://github.com/second-state/qwen3_tts_rs.git

cd qwen3_tts_rs

git submodule update --init --recursive

cargo build --release --no-default-features --features mlx

```

### Linux (libtorch backend)

**1. Download libtorch** from [libtorch-releases](https://github.com/second-state/libtorch-releases/releases/tag/v2.7.1):

```bash

# Linux x86_64 (CPU)

curl -LO https://github.com/second-state/libtorch-releases/releases/download/v2.7.1/libtorch-cxx11-abi-x86_64-2.7.1.tar.gz

tar xzf libtorch-cxx11-abi-x86_64-2.7.1.tar.gz

# Linux x86_64 (CUDA 12.6)

curl -LO https://github.com/second-state/libtorch-releases/releases/download/v2.7.1/libtorch-cxx11-abi-x86_64-cuda12.6-2.7.1.tar.gz

tar xzf libtorch-cxx11-abi-x86_64-cuda12.6-2.7.1.tar.gz

# Linux ARM64 (CPU)

curl -LO https://github.com/second-state/libtorch-releases/releases/download/v2.7.1/libtorch-cxx11-abi-aarch64-2.7.1.tar.gz

tar xzf libtorch-cxx11-abi-aarch64-2.7.1.tar.gz

# Linux ARM64 (CUDA 12.6 / Jetson)

curl -LO https://github.com/second-state/libtorch-releases/releases/download/v2.7.1/libtorch-cxx11-abi-aarch64-cuda12.6-2.7.1.tar.gz

tar xzf libtorch-cxx11-abi-aarch64-cuda12.6-2.7.1.tar.gz

```

**2. Set environment and build:**

```bash

export LIBTORCH=$(pwd)/libtorch

export LIBTORCH_BYPASS_VERSION_CHECK=1

git clone https://github.com/second-state/qwen3_tts_rs.git

cd qwen3_tts_rs

cargo build --release

```

Alternatively, use pip-installed PyTorch instead of downloading libtorch:

```bash

pip install torch==2.7.1

export LIBTORCH_USE_PYTORCH=1

export LD_LIBRARY_PATH=$(python3 -c "import torch; print(torch.__path__[0])")/lib:$LD_LIBRARY_PATH

```

### Download models and generate tokenizer

After building, download models and generate `tokenizer.json` for each:

```bash

pip install huggingface_hub transformers

huggingface-cli download Qwen/Qwen3-TTS-12Hz-0.6B-CustomVoice --local-dir models/Qwen3-TTS-12Hz-0.6B-CustomVoice

python3 -c "

from transformers import AutoTokenizer

for model in ['Qwen3-TTS-12Hz-0.6B-CustomVoice', 'Qwen3-TTS-12Hz-0.6B-Base', 'Qwen3-TTS-12Hz-1.7B-CustomVoice']:

    path = f'models/{model}'

    try:

        tok = AutoTokenizer.from_pretrained(path, trust_remote_code=True)

        tok.backend_tokenizer.save(f'{path}/tokenizer.json')

        print(f'Saved {path}/tokenizer.json')

    except Exception as e:

        print(f'Skipped {model}: {e}')

"

```

Binaries are in `target/release/`: `tts`, `voice_clone`, `api_server`.

### Rust library usage

Add to your `Cargo.toml`:

```toml

[dependencies]

qwen3-tts-rs = "0.2"

# Or for MLX backend:

# qwen3-tts-rs = { version = "0.2", default-features = false, features = ["mlx"] }

```

See the [API documentation on docs.rs](https://docs.rs/qwen3-tts-rs) for library usage examples.

## Performance (Apple M4 Mac, MLX backend)

Test sentences: ~15–20 words in English ("The quick brown fox..." / "Scientists have discovered...") and Chinese.

**RTF** = Real-Time Factor (wall time / audio duration). Lower is better; < 1.0 means faster than real-time.

### 0.6B CustomVoice

#### CLI (`tts` / `voice_clone`)

| Test | Speaker | Language | Audio | Wall Time | RTF |

|------|---------|----------|-------|-----------|-----|

| Preset voice | Vivian | English | 5.92s | 10.93s | 1.85x |

| Preset voice | Ryan | English | 8.16s | 14.15s | 1.73x |

| Preset voice | Vivian | Chinese | 6.64s | 11.26s | 1.70x |

| Voice clone (ICL) | ref audio | English | 7.04s | 16.77s | 2.38x |

#### API server (after warmup)

| Test | Voice | Mode | Audio | Wall Time | RTF |

|------|-------|------|-------|-----------|-----|

| Non-streaming WAV | alloy (serena) | full | 6.40s | 9.90s | 1.55x |

| Non-streaming WAV | echo (ryan) | full | 8.00s | 12.70s | 1.59x |

| Streaming PCM | alloy (serena) | stream | ~6.4s | 10.22s | ~1.60x |

| Voice clone WAV | alloy + ref | full | 9.04s | 20.11s | 2.22x |

### 1.7B CustomVoice

#### CLI (`tts`)

| Test | Speaker | Language | Audio | Wall Time | RTF |

|------|---------|----------|-------|-----------|-----|

| Preset voice | Vivian | English | 6.24s | 18.92s | 3.03x |

| Preset voice | Ryan | English | 8.64s | 27.50s | 3.18x |

| Preset voice | Vivian | Chinese | 6.24s | 18.31s | 2.93x |

| Preset + instruction | Vivian | English | 8.80s | 29.97s | 3.41x |

#### API server (after warmup)

| Test | Voice | Mode | Audio | Wall Time | RTF |

|------|-------|------|-------|-----------|-----|

| Non-streaming WAV | alloy (serena) | full | 5.52s | 15.14s | 2.74x |

| Non-streaming WAV | echo (ryan) | full | 8.64s | 26.31s | 3.05x |

| Streaming PCM | alloy (serena) | stream | 8.16s | 19.79s | 2.43x |

### 0.6B vs 1.7B comparison

| Metric | 0.6B avg RTF | 1.7B avg RTF | Slowdown |

|--------|-------------|-------------|----------|

| CLI preset voice | 1.76x | 3.05x | ~1.7x |

| API non-streaming | 1.57x | 2.90x | ~1.8x |

## Architecture

```

Text → Tokenizer → Dual-stream Embeddings → TalkerModel (28-layer Transformer)

                                                    ↓

                                              codec_head → Code 0

                                                    ↓

                                        CodePredictor (5-layer Transformer) → Codes 1-15

                                                    ↓

                                              Vocoder → 24kHz Waveform

```

## License

Apache-2.0

## Credits

Based on the original Python implementation by the Alibaba Qwen team.
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/second-state/qwen3_tts_rs

Awesome Lists containing this project

README