https://github.com/cskwork/supertonic-tts

Local Supertonic 3 text-to-speech: web app (WebGPU/WASM, EN/KO/JA) + cross-platform CLI (supertts) with auto-play
https://github.com/cskwork/supertonic-tts

Last synced: 21 days ago
JSON representation

Local Supertonic 3 text-to-speech: web app (WebGPU/WASM, EN/KO/JA) + cross-platform CLI (supertts) with auto-play

Host: GitHub
URL: https://github.com/cskwork/supertonic-tts
Owner: cskwork
Created: 2026-05-16T15:52:53.000Z (about 1 month ago)
Default Branch: main
Last Pushed: 2026-05-17T15:02:55.000Z (about 1 month ago)
Last Synced: 2026-05-17T17:26:22.765Z (about 1 month ago)
Language: JavaScript
Homepage: https://cskwork.github.io/supertonic-tts/
Size: 71.3 KB
Stars: 1
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# Supertonic TTS — Web App + CLI

A clean, beginner-friendly text-to-speech project built on
[Supertonic 3](https://github.com/supertone-inc/supertonic). Two ways to use it:

- **Web app** — Three ready-made UI languages (English, Korean, Japanese), six
preset voices with one-tap preview, paste-or-upload input (`.txt` / `.docx`),
instant WAV download. Runs entirely in your browser via WebGPU/WebAssembly.
- **CLI** — `supertonic-tts "hello"` from any terminal on macOS, Windows, or
Linux. Installed globally with `npm`, native ONNX runtime, no GPU required.

No accounts, no API keys, no cloud round-trips.

## Features

- **3 UI languages**: English, Korean, Japanese
- **32 TTS language tags** available in the underlying Supertonic text processor
- **6 voice styles** with click-to-preview
- **Paste or upload**: drop in `.txt` or `.docx`
- **Sample text presets** per language
- **One-tap "Speak"** with autoplay + transcript view
- **WAV download** of any generated audio
- **WebGPU acceleration** with automatic WASM fallback
- **Fully local**: text never leaves the browser

## Supported TTS options

The app has two language layers:

- **Current UI choices**: English (`en`), Korean (`ko`), Japanese (`ja`).
These are the languages with ready-made sample text, preview text, and UI
tabs in `app/main.js`.
- **Underlying Supertonic language tags**: `en`, `ko`, `ja`, `ar`, `bg`, `cs`,
`da`, `de`, `el`, `es`, `et`, `fi`, `fr`, `hi`, `hr`, `hu`, `id`, `it`,
`lt`, `lv`, `nl`, `pl`, `pt`, `ro`, `ru`, `sk`, `sl`, `sv`, `tr`, `uk`,
`vi`, `na`. These are accepted by the text processor in `app/helper.js`.

To expose another language in the UI, add an entry to `LANGS` in `app/main.js`
with preview and preset text, then add or render the matching language tab.

### Voice styles

Every voice style can be used with every supported TTS language tag:

| ID | Display name | Type | Style file |
| --- | --- | --- | --- |
| `F1` | Mina | Female | `voice_styles/F1.json` |
| `F2` | Sora | Female | `voice_styles/F2.json` |
| `F3` | Yuna | Female | `voice_styles/F3.json` |
| `M1` | Aiden | Male | `voice_styles/M1.json` |
| `M2` | Hiro | Male | `voice_styles/M2.json` |
| `M3` | Leo | Male | `voice_styles/M3.json` |

`F1` / Mina is the default voice. Voice styles are downloaded from
`Supertone/supertonic-3` and loaded on demand from `assets/voice_styles/` in
development, or from the Hugging Face CDN in production.

### Model/runtime options

- **TTS model family**: Supertonic 3 from `Supertone/supertonic-3`.
- **ONNX model files**: `duration_predictor.onnx`, `text_encoder.onnx`,
`vector_estimator.onnx`, `vocoder.onnx`.
- **Runtime**: WebGPU first, then WebAssembly fallback.
- **Generation controls**: quality steps from 4 to 16, and speed from 0.7 to
1.8.
- **Output**: mono 44.1 kHz, 16-bit PCM WAV generated locally in the browser.

## CLI

A standalone Node CLI ships in this package. Install once and run from any
directory. Two equivalent commands are exposed: short (`supertts`) and full
(`supertonic-tts`).

```bash
# global install — Windows, macOS, Linux
npm install -g supertonic-tts

# simplest form — positional text, auto-detects KO/JA/EN
supertts "Hello from Supertonic!"
supertts "안녕하세요"
supertts "こんにちは" --voice M1

# explicit flags
supertts -t "Hi there" -o hi.wav --voice F2
supertts -f input.txt --lang ko -o out.wav
echo "piped text" | supertts -o piped.wav
```

On the first synth, model assets (~380 MB) are auto-downloaded from Hugging
Face into a platform-appropriate user cache:

| Platform | Default assets directory |
| --- | --- |
| Windows | `%LOCALAPPDATA%\supertonic-tts\assets` |
| macOS | `~/Library/Caches/supertonic-tts/assets` |
| Linux | `$XDG_CACHE_HOME/supertonic-tts/assets` (or `~/.cache/...`) |

Override with `--assets ` or the `SUPERTONIC_ASSETS` env var. Pre-fetch
without synthesizing via `supertonic-tts --download`.

### CLI flags

| Flag | Default | Description |
| --- | --- | --- |
| `-t, --text ` | — | inline text |
| `-f, --file

` | — | read text from a `.txt` file |
| `-o, --out

By default the generated WAV plays back immediately
(macOS `afplay`, Windows `Media.SoundPlayer`, Linux `paplay`/`aplay`/`play`/
`ffplay`). Playback is blocking — the command returns once the audio has
finished. Pass `--no-play` for batch / scripted usage.

The CLI prints the output path on `stdout` (one line, easy to pipe). All
progress / status messages go to `stderr`.

```bash
# capture the output path without playback
OUT=$(supertts "audio test" --quiet --no-play)
echo "wrote $OUT"
```

## Web app quick start

Requires Node.js 18+ only. Model assets (~380 MB) are streamed directly from
Hugging Face — no `git-lfs` needed.

```bash
# Install + auto-download the model assets
npm install

# Start the dev server (opens http://localhost:3000)
npm run dev
```

If the asset download was interrupted, just re-run it; existing files are
skipped automatically:

```bash
npm run assets
```

## Production build

```bash
npm run build # outputs to ./dist
npm start # serves ./dist on http://localhost:3000
```

In production builds, the app **fetches model weights directly from the
Hugging Face CDN at runtime** (`huggingface.co/Supertone/supertonic-3`),
so deployments don't have to ship the 380 MB of `.onnx` files. The CDN sets
proper CORS headers and long cache lifetimes.

## Deploying

### GitHub Pages (zero-config)

A workflow at `.github/workflows/deploy.yml` builds and publishes on every
push to `main`.

1. Push the repo to GitHub
2. In repo settings → Pages → Build and deployment → Source: **GitHub Actions**
3. Push to `main` (or trigger the workflow manually)
4. App is live at `https://.github.io//`

The workflow sets `VITE_BASE=//` so all relative URLs resolve under
the subpath. No model files are uploaded to Pages.

### Vercel

```bash
vercel --prod
```

`vercel.json` is already configured with:

- `Cross-Origin-Opener-Policy: same-origin`
- `Cross-Origin-Embedder-Policy: credentialless` (enables faster
multi-threaded WASM where supported)
- Long-cache headers for `/assets/*`
- `.vercelignore` excludes the local `assets/` directory from upload

### Self-hosting

`npm run build` emits a fully static `./dist` directory — serve it with any
static host (nginx, Caddy, Cloudflare Pages, S3 + CloudFront, etc.). If you
also want multi-threaded WASM acceleration, send these response headers:

```
Cross-Origin-Opener-Policy: same-origin
Cross-Origin-Embedder-Policy: credentialless
```

## Project layout

```
.
├── app/ # Vite project root (the web app)
│ ├── index.html
│ ├── main.js # UI + synthesis orchestration
│ ├── helper.js # Supertonic ONNX runtime helpers
│ └── style.css
├── assets/ # Model weights & voice styles (downloaded)
│ ├── onnx/*.onnx
│ ├── onnx/tts.json
│ ├── onnx/unicode_indexer.json
│ └── voice_styles/*.json
├── scripts/
│ └── download-assets.mjs
├── vite.config.js
└── package.json
```

## How it works

1. The browser loads four ONNX models (duration predictor, text encoder,
vector estimator, vocoder) and a voice style tensor.
2. Your text is preprocessed (NFKD-normalised, emoji-stripped, wrapped with
the language tag) and converted to token IDs.
3. A short diffusion loop denoises a latent audio representation.
4. The vocoder synthesises 44.1 kHz, 16-bit PCM. The WAV file is built
client-side and offered for playback / download.

Every step runs locally — your text and the generated audio never leave
the device.

## Troubleshooting

- **"Loading model" stays forever**: open DevTools → Network. If the model
files (`.onnx`) 404, run `npm run assets` again.
- **WebGPU disabled**: only modern Chrome / Edge / Safari Tech Preview
support WebGPU. The app silently falls back to WebAssembly — slower but
works everywhere.
- **DOCX upload fails**: complex DOCX files with embedded objects may not
parse cleanly. Save as plain `.txt` as a fallback.
- **Korean / Japanese sound rushed**: drop "Speed" in Advanced options
to ~0.95.

## License

App code: MIT. Supertonic model weights are subject to
[Supertone's license](https://huggingface.co/Supertone/supertonic-3).

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/cskwork/supertonic-tts

Awesome Lists containing this project

README