https://github.com/langurmonkey/qwensay
CLI tool for text-to-speech synthesis using Qwen3-TTS
https://github.com/langurmonkey/qwensay
Last synced: about 2 months ago
JSON representation
CLI tool for text-to-speech synthesis using Qwen3-TTS
- Host: GitHub
- URL: https://github.com/langurmonkey/qwensay
- Owner: langurmonkey
- License: gpl-3.0
- Created: 2026-04-27T17:15:42.000Z (2 months ago)
- Default Branch: master
- Last Pushed: 2026-04-28T08:08:10.000Z (2 months ago)
- Last Synced: 2026-04-28T08:30:48.062Z (2 months ago)
- Language: Python
- Size: 119 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE.md
Awesome Lists containing this project
README
# QwenSay
A command-line tool for text-to-speech synthesis using [Qwen3-TTS](https://github.com/QwenLM/Qwen3-TTS) — supports free-form voice design, preset speakers, and voice cloning. Models are downloaded automatically from Hugging Face on first run.
## Requirements
- Python 3.12+
- [`uv`](https://docs.astral.sh/uv/getting-started/installation/) for dependency management
- A CUDA-capable GPU (recommended; CPU works but is slow)
- [SoX](https://github.com/rbouqueau/SoX)
- [`mpv`](https://mpv.io) or `aplay` (optional, for automatic playback)
## Setup
```bash
# Clone or download the project, then enter the directory
cd qwensay
# Install default dependencies
uv sync
```
For Pascal (GTX 10x0 family), do this:
```bash
uv sync --group torch-pascal --no-group torch-default
```
On modern RTX cards, you want to use flash attention:
```bash
uv sync --extra gpu
```
> FlashAttention 2 is only used when the model is loaded in `bfloat16` on a CUDA device. The script enables it automatically if the package is present.
## Usage
```
uv run qwensay.py --text "Hello world" [OPTIONS]
```
### Required arguments
| Argument | Short | Description |
|---|---|---|
| `--text` | `-t` | The text to synthesize into speech |
### Optional arguments
| Argument | Short | Default | Description |
|---|---|---|---|
| `--instruct` | `-i` | (none) | Natural-language description (e.g., "A raspy old man") or style (e.g., "Whispering") |
| `--voice` | `-v` | `Ryan` | Preset speaker name (used for `1.7b-custom` and `0.6b` models) |
| `--language` | `-l` | `English` | Language: English, Chinese, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian |
| `--output` | `-o` | `outputs/` | Save path. If omitted, plays audio and saves to outputs/ with a timestamp |
| `--model` | `-m` | *(interactive)* | `1.7b` (Voice Design), `1.7b-custom` (Presets), `0.6b` (Lightweight) |
| `--device` | | *(auto)* | PyTorch device, e.g. `cuda:0` or `cpu` |
### Choosing a model
If `--model` is omitted the script presents an interactive prompt. Pass `--model` to skip it.
| Key | Hugging Face ID | Best for | VRAM |
|---|---|---|---|
| `1.7b` *(default)* | `Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign` | Free-form voice descriptions | ~6 GB |
| `1.7b-custom` | `Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice` | Preset speakers + style instructions | ~6 GB |
| `0.6b` | `Qwen/Qwen3-TTS-12Hz-0.6B-Base` | Voice cloning from reference audio | ~2 GB |
Models are downloaded to the Hugging Face cache (`~/.cache/huggingface/`) on first use and reused on subsequent runs.
#### Available preset speakers (`--voice`)
When using `1.7b-custom` or `0.6b`, you can choose from these official high-quality timbres:
- Female: `Vivian`, `Serena`, `Ono_Anna`, `Sohee`
- Male: `Ryan`, `Aiden`, `Uncle_Fu`, `Dylan`, `Eric`
## Examples
Voice design. Create a voice entirely from a description using the default 1.7B model.
```bash
uv run qwensay.py \
--text "Good morning! Today is going to be a great day." \
--voice "A cheerful, energetic young woman with a clear American accent"
```
Using a premium preset. Use a high-quality preset speaker with a specific emotional instruction.
```bash
uv run qwensay.py \
--text "The quick brown fox jumps over the lazy dog." \
--voice "Deep, authoritative male narrator, slow and deliberate pace" \
--model 1.7b
```
Multilingual. German text with a matching voice.
```bash
uv run qwensay.py \
--text "Guten Morgen! Wie geht es Ihnen heute?" \
--voice "Warm, professional male voice with a standard German accent" \
--language German
```
## Notes
- **Output:** If no `--output` is specified, files are saved to the `./outputs/` folder with a timestamp and played automatically via `mpv` or `aplay`.
- **GTX 10-series Users:** Do not use FlashAttention 2. The script will automatically fall back to "eager" mode, which works perfectly on older Pascal cards.
- **VRAM:** If you experience "Out of Memory" errors on 6GB/8GB cards, ensure no other GPU-heavy apps are running, or switch to the `0.6b` model.
- The **VoiceDesign** model (`1.7b`) gives the most expressive results for arbitrary voice descriptions. Write descriptions in natural language: age, gender, accent, pace, and emotion all influence the output.
- The **0.6B Base** model is optimized for cloning a voice from a short reference WAV. Passing a `--instruct` description to it works on a best-effort basis.
- Output is always a 16-bit PCM WAV file at the sample rate returned by the model (typically 24 kHz).
## Alternatives
If you need a lighter model, you can try Kitten TTS through [puss-say](https://gihub.com/Mic92/puss-say). It runs in the CPU with minimal resources, and the configuration is straightforward. The speech quality is not too bad.
## License
Qwen3-TTS model weights are released under the [Apache 2.0 license](https://github.com/QwenLM/Qwen3-TTS/blob/main/LICENSE) by Alibaba Cloud / Qwen Team.
This project is licensed under GPL-3.0.