https://github.com/itamaker/kitten-tts-go

Go implementation of KittenTTS — an ultra-lightweight ONNX-based text-to-speech engine. Self-contained binaries, no Python dependency.
https://github.com/itamaker/kitten-tts-go
Last synced: 13 days ago
JSON representation
Go implementation of KittenTTS — an ultra-lightweight ONNX-based text-to-speech engine. Self-contained binaries, no Python dependency.
Host: GitHub
URL: https://github.com/itamaker/kitten-tts-go
Owner: itamaker
License: mit
Created: 2026-05-29T13:24:33.000Z (25 days ago)
Default Branch: main
Last Pushed: 2026-06-08T08:04:28.000Z (15 days ago)
Last Synced: 2026-06-08T09:22:37.685Z (15 days ago)
Language: Go
Size: 79.1 KB
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- Funding: .github/FUNDING.yml
- License: LICENSE
Awesome Lists containing this project

README

          # kitten-tts-go 🐱🐹

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/itamaker/kitten-tts-go/blob/main/examples/kitten-tts-go-colab.ipynb)

Go implementation of [KittenTTS](https://github.com/KittenML/KittenTTS) — an ultra-lightweight ONNX-based text-to-speech engine. Self-contained binaries with no Python dependency.

> **Try it now:** the [Colab notebook](examples/kitten-tts-go-colab.ipynb) builds the project and synthesizes speech in three clicks — no local setup, no GPU.

It produces two binaries:

* `kitten-tts` — a CLI tool for one-off speech generation. Ideally suited for AI agent skills.

* `kitten-tts-server` — an OpenAI-compatible API server with [SSE streaming](#sse-streaming) support.

> **Adapted from:** [KittenML/KittenTTS](https://github.com/KittenML/KittenTTS) (Apache-2.0). All model weights are from the original project.

## Key Features

- **Ultra-lightweight** — 15M to 80M parameter models; smallest is just 25 MB (int8)

- **CPU-optimized** — ONNX-based inference runs efficiently without a GPU

- **8 built-in voices** — Bella, Jasper, Luna, Bruno, Rosie, Hugo, Kiki, and Leo

- **Adjustable speech speed** — control playback rate via `-speed`

- **Text normalization** — built-in pipeline handles numbers and currencies

- **24 kHz output** — high-quality audio at a standard sample rate

- **Multiple audio formats** — MP3, FLAC, WAV, and PCM (pure Go) plus OGG Opus (via libopus)

## Dependencies

This port relies on three system dependencies:

### 1. ONNX Runtime shared library

Go inference uses [`yalue/onnxruntime_go`](https://github.com/yalue/onnxruntime_go), which loads the ONNX Runtime shared library dynamically at runtime.

```bash

# macOS

brew install onnxruntime

# Ubuntu/Debian — download a release from

# https://github.com/microsoft/onnxruntime/releases and place

# libonnxruntime.so on your library path

```

The library is auto-detected at common locations (`/usr/local/lib`, `/opt/homebrew/lib`, `/usr/lib`, …). To point at a specific file, set the `ONNXRUNTIME_LIB_PATH` environment variable:

```bash

export ONNXRUNTIME_LIB_PATH=/path/to/libonnxruntime.dylib

```

### 2. espeak-ng (for phonemization)

```bash

# macOS

brew install espeak-ng

# Ubuntu/Debian

sudo apt-get install -y espeak-ng

```

### 3. libopus + libopusfile (for Opus encoding)

Opus output is encoded with libopus via cgo. The [`hraban/opus`](https://gopkg.in/hraban/opus.v2) binding links both **libopus** and **libopusfile** (`#cgo pkg-config: opus opusfile`), so both are **build-time** dependencies (and must be on the library path at runtime). Building therefore requires a C compiler and `CGO_ENABLED=1` (the default for native builds).

```bash

# macOS

brew install opus opusfile pkg-config

# Ubuntu/Debian

sudo apt-get install -y libopus-dev libopusfile-dev pkg-config

```

> The libopus/libopusfile/pkg-config packages above are only needed when

> **building from source**. The **released binaries statically link** libopus and

> libopusfile, so the target machine needs only the ONNX Runtime shared library

> (`dlopen`'d at runtime) and espeak-ng — no opus libraries required.

## Available Models

| Model | Parameters | Size | Download |

|---|---|---|---|

| kitten-tts-mini | 80M | 80 MB | [KittenML/kitten-tts-mini-0.8](https://huggingface.co/KittenML/kitten-tts-mini-0.8) |

| kitten-tts-micro | 40M | 41 MB | [KittenML/kitten-tts-micro-0.8](https://huggingface.co/KittenML/kitten-tts-micro-0.8) |

| kitten-tts-nano | 15M | 56 MB | [KittenML/kitten-tts-nano-0.8](https://huggingface.co/KittenML/kitten-tts-nano-0.8-fp32) |

| kitten-tts-nano (int8) | 15M | 25 MB | [KittenML/kitten-tts-nano-0.8-int8](https://huggingface.co/KittenML/kitten-tts-nano-0.8-int8) |

### Downloading a model

Models are not vendored in this repository. Fetch one into `./models` with the

helper script:

```bash

scripts/fetch_model.sh nano-int8   # also: nano, micro, mini (default: nano-int8)

```

It reads the ONNX/voices filenames from the model's `config.json`, so it works

for every model above. (`./models` is git-ignored.) Equivalent manual download:

```bash

mkdir -p models/kitten-tts-nano-int8

for FILE in config.json kitten_tts_nano_v0_8.onnx voices.npz; do

  curl -L -o "models/kitten-tts-nano-int8/$FILE" \

    "https://huggingface.co/KittenML/kitten-tts-nano-0.8-int8/resolve/main/$FILE"

done

```

## Build

```bash

go build -o bin/ ./...

# Binaries at: bin/kitten-tts and bin/kitten-tts-server

```

> Building requires a C compiler and libopus/libopusfile (see [Dependencies](#dependencies)),

> since Opus encoding is compiled in via cgo.

### Releases

Pushing a `v*` tag triggers [`.github/workflows/release.yml`](.github/workflows/release.yml),

which builds the binaries with `go build` on GitHub runners and publishes a

GitHub Release with one `.tar.gz` per platform plus `checksums.txt`:

```bash

git tag v0.1.1

git push origin v0.1.1

```

libopus/libopusfile are statically linked from source, so each target is built on

its own native runner — Linux amd64/arm64 and macOS arm64 — except darwin/amd64,

which is cross-compiled on the Apple Silicon runner (clang is a universal

toolchain). To build locally:

```bash

go build -o bin/ ./...

```

### End-to-end smoke test

[`scripts/smoke_test.sh`](scripts/smoke_test.sh) builds the binaries and exercises

every audio format plus SSE streaming, fully offline, against a local model:

```bash

# Pass a model dir (or set KITTEN_MODEL_DIR, or place one at models/kitten-tts-nano-int8)

scripts/smoke_test.sh /path/to/models/kitten-tts-nano-int8

# Test prebuilt/release binaries instead of building:

KITTEN_BIN_DIR=dist/... scripts/smoke_test.sh /path/to/model

```

It checks the CLI (wav/mp3/flac/opus/pcm + `-list-voices`) and the server

(`/health`, `/v1/models`, all formats, streaming, and 400 validation), printing a

pass/fail summary and exiting non-zero on failure. Needs espeak-ng and the ONNX

Runtime library; uses `ffprobe` for codec checks when available.

## Generate Speech (CLI)

Following Go convention, **flags come before the positional arguments**

(`  [voice]`):

```bash

# Basic usage (outputs output.wav)

./bin/kitten-tts ./models/kitten-tts-nano-int8 'Hello, world!' Bruno

# Specify voice, speed, and output (flags first)

./bin/kitten-tts -voice Luna -speed 1.2 -output hello.wav ./models/kitten-tts-nano-int8 'Hello, world!'

# Encode directly to another format with -format

./bin/kitten-tts -format mp3 -output hello.mp3 ./models/kitten-tts-nano-int8 'Hello, world!'

# List available voices (no model directory needed)

./bin/kitten-tts -list-voices

```

CLI flags (single or double dash both work):

| Flag | Default | Description |

|---|---|---|

| `-voice`, `-v` | `Bruno` | Voice name (overrides the positional voice) |

| `-speed`, `-s` | `1.0` | Speech speed multiplier |

| `-output`, `-o` | `output.wav` | Output file path |

| `-format` | `wav` | Output format: `wav`, `mp3`, `flac`, `opus`, `pcm` |

| `-no-clean` | | Disable text normalization (numbers, currency) |

| `-list-voices` | | List available voices and exit (no model directory needed) |

> Because the CLI uses Go's standard `flag` package, any flag placed *after* a

> positional argument is treated as a positional. Put flags first.

## Run the API Server

Flags first, then the model directory:

```bash

./bin/kitten-tts-server -host 0.0.0.0 -port 8080 ./models/kitten-tts-nano-int8

```

The server exposes an OpenAI-compatible `/v1/audio/speech` endpoint:

```bash

curl -X POST http://localhost:8080/v1/audio/speech \

  -H 'Content-Type: application/json' \

  -d '{

    "model": "kitten-tts",

    "input": "Hello, world! This is KittenTTS running as an API server.",

    "voice": "alloy"

  }' \

  --output speech.mp3

```

**Request body:**

| Field | Type | Default | Description |

|---|---|---|---|

| `input` | string | *(required)* | Text to synthesize |

| `voice` | string | *(required)* | Voice name (OpenAI or KittenTTS names) |

| `model` | string | `""` | Accepted for compatibility; ignored |

| `response_format` | string | `"mp3"` | Output audio format (see below) |

| `speed` | float | `1.0` | Speech speed multiplier (0.25–4.0) |

| `stream` | bool | `false` | Enable SSE streaming (requires `"pcm"` format) |

Request bodies are capped at 1 MiB. Invalid requests — bad JSON, empty `input`,

an unknown `voice` or `response_format`, or `speed` out of range — return HTTP

400 with an OpenAI-style JSON error body.

**Supported audio formats:**

| Format | Content-Type | Description |

|---|---|---|

| `mp3` | `audio/mpeg` | MP3 (resampled to 44.1 kHz, pure-Go [shine](https://github.com/braheezy/shine-mp3) encoder) |

| `flac` | `audio/flac` | FLAC lossless (24 kHz native, pure-Go [mewkiz/flac](https://github.com/mewkiz/flac)) |

| `wav` | `audio/wav` | WAV 16-bit PCM (24 kHz native) |

| `pcm` | `audio/pcm` | Raw 16-bit signed little-endian PCM (24 kHz) |

| `opus` | `audio/ogg` | Opus in OGG container (resampled to 48 kHz) |

| `aac` | — | Not supported (returns error) |

> **Note on Opus:** there is no pure-Go Opus encoder, so Opus uses [libopus](https://opus-codec.org/)

> via cgo ([`hraban/opus`](https://gopkg.in/hraban/opus.v2)) plus a hand-written RFC 7845 OGG

> writer. libopus and libopusfile must be installed at build time (see [Dependencies](#dependencies)).

**API endpoints:**

| Method | Path | Description |

|---|---|---|

| `POST` | `/v1/audio/speech` | Generate speech from text |

| `GET` | `/v1/models` | List loaded model |

| `GET` | `/health` | Health check |

**Voice mapping (OpenAI → KittenTTS):**

| OpenAI | KittenTTS | Gender |

|---|---|---|

| alloy | Bella | Female |

| echo | Jasper | Male |

| fable | Luna | Female |

| onyx | Bruno | Male |

| nova | Rosie | Female |

| shimmer | Hugo | Male |

All 8 KittenTTS voices (Bella, Jasper, Luna, Bruno, Rosie, Hugo, Kiki, Leo) can also be used directly by name.

### SSE Streaming

For lower time-to-first-audio on longer texts, set `"stream": true` with `"response_format": "pcm"`. The server returns [Server-Sent Events](https://developer.mozilla.org/en-US/docs/Web/API/Server-sent_events) with base64-encoded PCM audio chunks, compatible with the OpenAI streaming TTS format:

```bash

curl -N -X POST http://localhost:8080/v1/audio/speech \

  -H "Content-Type: application/json" \

  -d '{

    "model": "kitten-tts",

    "input": "Hello, this is a streaming test. Each sentence is sent as a separate audio chunk.",

    "voice": "alloy",

    "response_format": "pcm",

    "stream": true

  }'

```

Each event is a JSON object on a `data:` line:

```

data: {"type":"speech.audio.delta","delta":""}

data: {"type":"speech.audio.delta","delta":""}

data: {"type":"speech.audio.done"}

```

The `delta` field contains 16-bit signed little-endian PCM at 24 kHz, base64-encoded. The first chunk is split at the earliest clause boundary for fast initial playback.

## Use as a Library

The `tts` package is a self-contained engine. `New` returns a `*tts.Model`;

encoders live in the `audio` package behind a small interface.

```go

package main

import (

	"os"

	"github.com/itamaker/kitten-tts-go/audio"

	"github.com/itamaker/kitten-tts-go/tts"

)

func main() {

	model, err := tts.New("models/kitten-tts-nano-int8")

	if err != nil {

		panic(err)

	}

	defer model.Close()

	samples, err := model.Generate("Hello, world!", "Bruno", 1.0, true)

	if err != nil {

		panic(err)

	}

	enc, _ := audio.NewEncoder("mp3") // or wav, flac, opus, pcm

	data, _ := enc.Encode(samples)

	os.WriteFile("hello.mp3", data, 0o644)

}

```

The phonemizer is an interface, so you can swap espeak-ng for your own (handy in

tests) via a functional option:

```go

model, err := tts.New(dir, tts.WithPhonemizer(myPhonemizer))

// myPhonemizer implements tts.Phonemizer: Phonemize(string) (string, error)

```

## Architecture

```

kitten-tts-go/

├── tts/                 # Core TTS engine (importable library)

│   ├── tts.go           # Model type, Generate/GenerateChunk, options

│   ├── load.go          # New: read config.json and load a model directory

│   ├── onnx.go          # ONNX session, isolated behind one file

│   ├── phonemes.go      # Phonemizer interface, espeak-ng impl, token IDs

│   ├── normalize.go     # Number/currency/whitespace normalization

│   ├── chunk.go         # Sentence/streaming text chunking

│   └── voices.go        # NPZ voice embedding loader

├── audio/               # Encoder interface + one file per format

│   ├── audio.go         # Encoder interface, registry, NewEncoder, resampling

│   ├── wav.go           # WAV + PCM

│   ├── mp3.go           # MP3 (shine)

│   ├── flac.go          # FLAC (mewkiz)

│   └── opus.go          # OGG Opus (libopus) + hand-written OGG container

└── cmd/

    ├── kitten-tts/        # CLI (stdlib flag)

    └── kitten-tts-server/ # OpenAI-compatible API server (net/http)

```

### How It Works

1. **Normalization** — Expands numbers ("42" → "forty-two"), currencies ("$10.50" → "ten dollars and fifty cents"), and collapses whitespace

2. **Phonemization** — Converts English text to IPA phonemes via a `Phonemizer` (espeak-ng by default)

3. **Token encoding** — Maps IPA phonemes to integer token IDs using a symbol table matching the original Python implementation

4. **Voice selection** — Loads style embeddings from the NPZ voice file

5. **ONNX inference** — Runs the model with input tokens, voice style, and speed parameters

6. **Audio encoding** — An `audio.Encoder` produces MP3 (default), FLAC, WAV, Opus, or raw PCM

## Design Notes

A few idiomatic-Go choices worth knowing:

- **`tts.Model`** is the core type, constructed with **`tts.New`**.

- **Audio formats** are an **`Encoder` interface** resolved by name from a registry (`audio.NewEncoder`), one file per format — not a single switch statement.

- **Phonemization** is the **`Phonemizer` interface**; the default is espeak-ng and `tts.WithPhonemizer` lets you swap it (e.g. in tests).

- The **CLI** uses the standard-library `flag` package (flags before positionals); the **server** is plain `net/http`.

- **ONNX inference** uses [`yalue/onnxruntime_go`](https://github.com/yalue/onnxruntime_go), which `dlopen`s the runtime — there is no cgo link to onnxruntime.

- **Encoders**: MP3 ([`shine-mp3`](https://github.com/braheezy/shine-mp3)) and FLAC ([`mewkiz/flac`](https://github.com/mewkiz/flac)) are pure Go; Opus uses [`hraban/opus`](https://gopkg.in/hraban/opus.v2) (libopus via cgo). AAC is not supported.

## License

MIT — see [LICENSE](LICENSE).

This is an independent Go implementation. The upstream KittenTTS project and its

model weights are Apache-2.0 (see [Acknowledgments](#acknowledgments)); those

weights are downloaded separately and are not part of this repository.

## Acknowledgments

- [KittenML](https://kittenml.com) for the original KittenTTS models and Python library

- [yalue/onnxruntime_go](https://github.com/yalue/onnxruntime_go) for the Go ONNX Runtime bindings

- [braheezy/shine-mp3](https://github.com/braheezy/shine-mp3) and [mewkiz/flac](https://github.com/mewkiz/flac) for pure-Go audio encoding

- [hraban/opus](https://gopkg.in/hraban/opus.v2) and [libopus](https://opus-codec.org/) for Opus encoding

- [espeak-ng](https://github.com/espeak-ng/espeak-ng) for phonemization
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/itamaker/kitten-tts-go

Awesome Lists containing this project

README