https://github.com/cskwork/supertonic-tts
Local Supertonic 3 text-to-speech: web app (WebGPU/WASM, EN/KO/JA) + cross-platform CLI (supertts) with auto-play
https://github.com/cskwork/supertonic-tts
Last synced: 21 days ago
JSON representation
Local Supertonic 3 text-to-speech: web app (WebGPU/WASM, EN/KO/JA) + cross-platform CLI (supertts) with auto-play
- Host: GitHub
- URL: https://github.com/cskwork/supertonic-tts
- Owner: cskwork
- Created: 2026-05-16T15:52:53.000Z (about 1 month ago)
- Default Branch: main
- Last Pushed: 2026-05-17T15:02:55.000Z (about 1 month ago)
- Last Synced: 2026-05-17T17:26:22.765Z (about 1 month ago)
- Language: JavaScript
- Homepage: https://cskwork.github.io/supertonic-tts/
- Size: 71.3 KB
- Stars: 1
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Supertonic TTS — Web App + CLI
A clean, beginner-friendly text-to-speech project built on
[Supertonic 3](https://github.com/supertone-inc/supertonic). Two ways to use it:
- **Web app** — Three ready-made UI languages (English, Korean, Japanese), six
preset voices with one-tap preview, paste-or-upload input (`.txt` / `.docx`),
instant WAV download. Runs entirely in your browser via WebGPU/WebAssembly.
- **CLI** — `supertonic-tts "hello"` from any terminal on macOS, Windows, or
Linux. Installed globally with `npm`, native ONNX runtime, no GPU required.
No accounts, no API keys, no cloud round-trips.
## Features
- **3 UI languages**: English, Korean, Japanese
- **32 TTS language tags** available in the underlying Supertonic text processor
- **6 voice styles** with click-to-preview
- **Paste or upload**: drop in `.txt` or `.docx`
- **Sample text presets** per language
- **One-tap "Speak"** with autoplay + transcript view
- **WAV download** of any generated audio
- **WebGPU acceleration** with automatic WASM fallback
- **Fully local**: text never leaves the browser
## Supported TTS options
The app has two language layers:
- **Current UI choices**: English (`en`), Korean (`ko`), Japanese (`ja`).
These are the languages with ready-made sample text, preview text, and UI
tabs in `app/main.js`.
- **Underlying Supertonic language tags**: `en`, `ko`, `ja`, `ar`, `bg`, `cs`,
`da`, `de`, `el`, `es`, `et`, `fi`, `fr`, `hi`, `hr`, `hu`, `id`, `it`,
`lt`, `lv`, `nl`, `pl`, `pt`, `ro`, `ru`, `sk`, `sl`, `sv`, `tr`, `uk`,
`vi`, `na`. These are accepted by the text processor in `app/helper.js`.
To expose another language in the UI, add an entry to `LANGS` in `app/main.js`
with preview and preset text, then add or render the matching language tab.
### Voice styles
Every voice style can be used with every supported TTS language tag:
| ID | Display name | Type | Style file |
| --- | --- | --- | --- |
| `F1` | Mina | Female | `voice_styles/F1.json` |
| `F2` | Sora | Female | `voice_styles/F2.json` |
| `F3` | Yuna | Female | `voice_styles/F3.json` |
| `M1` | Aiden | Male | `voice_styles/M1.json` |
| `M2` | Hiro | Male | `voice_styles/M2.json` |
| `M3` | Leo | Male | `voice_styles/M3.json` |
`F1` / Mina is the default voice. Voice styles are downloaded from
`Supertone/supertonic-3` and loaded on demand from `assets/voice_styles/` in
development, or from the Hugging Face CDN in production.
### Model/runtime options
- **TTS model family**: Supertonic 3 from `Supertone/supertonic-3`.
- **ONNX model files**: `duration_predictor.onnx`, `text_encoder.onnx`,
`vector_estimator.onnx`, `vocoder.onnx`.
- **Runtime**: WebGPU first, then WebAssembly fallback.
- **Generation controls**: quality steps from 4 to 16, and speed from 0.7 to
1.8.
- **Output**: mono 44.1 kHz, 16-bit PCM WAV generated locally in the browser.
## CLI
A standalone Node CLI ships in this package. Install once and run from any
directory. Two equivalent commands are exposed: short (`supertts`) and full
(`supertonic-tts`).
```bash
# global install — Windows, macOS, Linux
npm install -g supertonic-tts
# simplest form — positional text, auto-detects KO/JA/EN
supertts "Hello from Supertonic!"
supertts "안녕하세요"
supertts "こんにちは" --voice M1
# explicit flags
supertts -t "Hi there" -o hi.wav --voice F2
supertts -f input.txt --lang ko -o out.wav
echo "piped text" | supertts -o piped.wav
```
On the first synth, model assets (~380 MB) are auto-downloaded from Hugging
Face into a platform-appropriate user cache:
| Platform | Default assets directory |
| --- | --- |
| Windows | `%LOCALAPPDATA%\supertonic-tts\assets` |
| macOS | `~/Library/Caches/supertonic-tts/assets` |
| Linux | `$XDG_CACHE_HOME/supertonic-tts/assets` (or `~/.cache/...`) |
Override with `--assets ` or the `SUPERTONIC_ASSETS` env var. Pre-fetch
without synthesizing via `supertonic-tts --download`.
### CLI flags
| Flag | Default | Description |
| --- | --- | --- |
| `-t, --text ` | — | inline text |
| `-f, --file
` | — | read text from a `.txt` file |
| `-o, --out
` | `./out-.wav` | output WAV path |
| `-l, --lang ` | auto | language tag (auto-detects ko/ja/en; see `--list-langs`) |
| `-v, --voice ` | `F1` | voice id: `F1`–`F3`, `M1`–`M3` |
| `-s, --speed ` | `1.05` | 0.7 – 1.8 |
| `--steps ` | `8` | quality steps 4 – 16 |
| `--silence ` | `0.3` | inter-chunk pause (sec) |
| `--assets ` | auto | override assets directory |
| `--download` | — | only fetch / verify assets |
| `--no-play` | — | don't auto-play the generated WAV |
| `--list-voices` | — | print voice catalog |
| `--list-langs` | — | print supported language tags |
| `-q, --quiet` | — | suppress progress logs |
| `-h, --help` | — | show help |
By default the generated WAV plays back immediately
(macOS `afplay`, Windows `Media.SoundPlayer`, Linux `paplay`/`aplay`/`play`/
`ffplay`). Playback is blocking — the command returns once the audio has
finished. Pass `--no-play` for batch / scripted usage.
The CLI prints the output path on `stdout` (one line, easy to pipe). All
progress / status messages go to `stderr`.
```bash
# capture the output path without playback
OUT=$(supertts "audio test" --quiet --no-play)
echo "wrote $OUT"
```
## Web app quick start
Requires Node.js 18+ only. Model assets (~380 MB) are streamed directly from
Hugging Face — no `git-lfs` needed.
```bash
# Install + auto-download the model assets
npm install
# Start the dev server (opens http://localhost:3000)
npm run dev
```
If the asset download was interrupted, just re-run it; existing files are
skipped automatically:
```bash
npm run assets
```
## Production build
```bash
npm run build # outputs to ./dist
npm start # serves ./dist on http://localhost:3000
```
In production builds, the app **fetches model weights directly from the
Hugging Face CDN at runtime** (`huggingface.co/Supertone/supertonic-3`),
so deployments don't have to ship the 380 MB of `.onnx` files. The CDN sets
proper CORS headers and long cache lifetimes.
## Deploying
### GitHub Pages (zero-config)
A workflow at `.github/workflows/deploy.yml` builds and publishes on every
push to `main`.
1. Push the repo to GitHub
2. In repo settings → Pages → Build and deployment → Source: **GitHub Actions**
3. Push to `main` (or trigger the workflow manually)
4. App is live at `https://.github.io//`
The workflow sets `VITE_BASE=//` so all relative URLs resolve under
the subpath. No model files are uploaded to Pages.
### Vercel
```bash
vercel --prod
```
`vercel.json` is already configured with:
- `Cross-Origin-Opener-Policy: same-origin`
- `Cross-Origin-Embedder-Policy: credentialless` (enables faster
multi-threaded WASM where supported)
- Long-cache headers for `/assets/*`
- `.vercelignore` excludes the local `assets/` directory from upload
### Self-hosting
`npm run build` emits a fully static `./dist` directory — serve it with any
static host (nginx, Caddy, Cloudflare Pages, S3 + CloudFront, etc.). If you
also want multi-threaded WASM acceleration, send these response headers:
```
Cross-Origin-Opener-Policy: same-origin
Cross-Origin-Embedder-Policy: credentialless
```
## Project layout
```
.
├── app/ # Vite project root (the web app)
│ ├── index.html
│ ├── main.js # UI + synthesis orchestration
│ ├── helper.js # Supertonic ONNX runtime helpers
│ └── style.css
├── assets/ # Model weights & voice styles (downloaded)
│ ├── onnx/*.onnx
│ ├── onnx/tts.json
│ ├── onnx/unicode_indexer.json
│ └── voice_styles/*.json
├── scripts/
│ └── download-assets.mjs
├── vite.config.js
└── package.json
```
## How it works
1. The browser loads four ONNX models (duration predictor, text encoder,
vector estimator, vocoder) and a voice style tensor.
2. Your text is preprocessed (NFKD-normalised, emoji-stripped, wrapped with
the language tag) and converted to token IDs.
3. A short diffusion loop denoises a latent audio representation.
4. The vocoder synthesises 44.1 kHz, 16-bit PCM. The WAV file is built
client-side and offered for playback / download.
Every step runs locally — your text and the generated audio never leave
the device.
## Troubleshooting
- **"Loading model" stays forever**: open DevTools → Network. If the model
files (`.onnx`) 404, run `npm run assets` again.
- **WebGPU disabled**: only modern Chrome / Edge / Safari Tech Preview
support WebGPU. The app silently falls back to WebAssembly — slower but
works everywhere.
- **DOCX upload fails**: complex DOCX files with embedded objects may not
parse cleanly. Save as plain `.txt` as a fallback.
- **Korean / Japanese sound rushed**: drop "Speed" in Advanced options
to ~0.95.
## License
App code: MIT. Supertonic model weights are subject to
[Supertone's license](https://huggingface.co/Supertone/supertonic-3).