{"id":50751958,"url":"https://github.com/itamaker/kitten-tts-go","last_synced_at":"2026-06-11T02:01:14.500Z","repository":{"id":361369555,"uuid":"1253448611","full_name":"itamaker/kitten-tts-go","owner":"itamaker","description":"Go implementation of KittenTTS — an ultra-lightweight ONNX-based text-to-speech engine. Self-contained binaries, no Python dependency.","archived":false,"fork":false,"pushed_at":"2026-06-08T08:04:28.000Z","size":81,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-06-08T09:22:37.685Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/itamaker.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":".github/FUNDING.yml","license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null},"funding":{"github":"itamaker","buy_me_a_coffee":"amaker"}},"created_at":"2026-05-29T13:24:33.000Z","updated_at":"2026-06-08T08:04:29.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/itamaker/kitten-tts-go","commit_stats":null,"previous_names":["itamaker/kitten-tts-go"],"tags_count":2,"template":false,"template_full_name":null,"purl":"pkg:github/itamaker/kitten-tts-go","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/itamaker%2Fkitten-tts-go","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/itamaker%2Fkitten-tts-go/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/itamaker%2Fkitten-tts-go/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/itamaker%2Fkitten-tts-go/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/itamaker","download_url":"https://codeload.github.com/itamaker/kitten-tts-go/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/itamaker%2Fkitten-tts-go/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34178819,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-11T02:00:06.485Z","response_time":57,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2026-06-11T02:00:26.964Z","updated_at":"2026-06-11T02:01:14.494Z","avatar_url":"https://github.com/itamaker.png","language":"Go","funding_links":["https://github.com/sponsors/itamaker","https://buymeacoffee.com/amaker"],"categories":[],"sub_categories":[],"readme":"# kitten-tts-go 🐱🐹\n\n[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/itamaker/kitten-tts-go/blob/main/examples/kitten-tts-go-colab.ipynb)\n\nGo implementation of [KittenTTS](https://github.com/KittenML/KittenTTS) — an ultra-lightweight ONNX-based text-to-speech engine. Self-contained binaries with no Python dependency.\n\n\u003e **Try it now:** the [Colab notebook](examples/kitten-tts-go-colab.ipynb) builds the project and synthesizes speech in three clicks — no local setup, no GPU.\n\nIt produces two binaries:\n\n* `kitten-tts` — a CLI tool for one-off speech generation. Ideally suited for AI agent skills.\n* `kitten-tts-server` — an OpenAI-compatible API server with [SSE streaming](#sse-streaming) support.\n\n\u003e **Adapted from:** [KittenML/KittenTTS](https://github.com/KittenML/KittenTTS) (Apache-2.0). All model weights are from the original project.\n\n## Key Features\n\n- **Ultra-lightweight** — 15M to 80M parameter models; smallest is just 25 MB (int8)\n- **CPU-optimized** — ONNX-based inference runs efficiently without a GPU\n- **8 built-in voices** — Bella, Jasper, Luna, Bruno, Rosie, Hugo, Kiki, and Leo\n- **Adjustable speech speed** — control playback rate via `-speed`\n- **Text normalization** — built-in pipeline handles numbers and currencies\n- **24 kHz output** — high-quality audio at a standard sample rate\n- **Multiple audio formats** — MP3, FLAC, WAV, and PCM (pure Go) plus OGG Opus (via libopus)\n\n## Dependencies\n\nThis port relies on three system dependencies:\n\n### 1. ONNX Runtime shared library\n\nGo inference uses [`yalue/onnxruntime_go`](https://github.com/yalue/onnxruntime_go), which loads the ONNX Runtime shared library dynamically at runtime.\n\n```bash\n# macOS\nbrew install onnxruntime\n\n# Ubuntu/Debian — download a release from\n# https://github.com/microsoft/onnxruntime/releases and place\n# libonnxruntime.so on your library path\n```\n\nThe library is auto-detected at common locations (`/usr/local/lib`, `/opt/homebrew/lib`, `/usr/lib`, …). To point at a specific file, set the `ONNXRUNTIME_LIB_PATH` environment variable:\n\n```bash\nexport ONNXRUNTIME_LIB_PATH=/path/to/libonnxruntime.dylib\n```\n\n### 2. espeak-ng (for phonemization)\n\n```bash\n# macOS\nbrew install espeak-ng\n\n# Ubuntu/Debian\nsudo apt-get install -y espeak-ng\n```\n\n### 3. libopus + libopusfile (for Opus encoding)\n\nOpus output is encoded with libopus via cgo. The [`hraban/opus`](https://gopkg.in/hraban/opus.v2) binding links both **libopus** and **libopusfile** (`#cgo pkg-config: opus opusfile`), so both are **build-time** dependencies (and must be on the library path at runtime). Building therefore requires a C compiler and `CGO_ENABLED=1` (the default for native builds).\n\n```bash\n# macOS\nbrew install opus opusfile pkg-config\n\n# Ubuntu/Debian\nsudo apt-get install -y libopus-dev libopusfile-dev pkg-config\n```\n\n\u003e The libopus/libopusfile/pkg-config packages above are only needed when\n\u003e **building from source**. The **released binaries statically link** libopus and\n\u003e libopusfile, so the target machine needs only the ONNX Runtime shared library\n\u003e (`dlopen`'d at runtime) and espeak-ng — no opus libraries required.\n\n## Available Models\n\n| Model | Parameters | Size | Download |\n|---|---|---|---|\n| kitten-tts-mini | 80M | 80 MB | [KittenML/kitten-tts-mini-0.8](https://huggingface.co/KittenML/kitten-tts-mini-0.8) |\n| kitten-tts-micro | 40M | 41 MB | [KittenML/kitten-tts-micro-0.8](https://huggingface.co/KittenML/kitten-tts-micro-0.8) |\n| kitten-tts-nano | 15M | 56 MB | [KittenML/kitten-tts-nano-0.8](https://huggingface.co/KittenML/kitten-tts-nano-0.8-fp32) |\n| kitten-tts-nano (int8) | 15M | 25 MB | [KittenML/kitten-tts-nano-0.8-int8](https://huggingface.co/KittenML/kitten-tts-nano-0.8-int8) |\n\n### Downloading a model\n\nModels are not vendored in this repository. Fetch one into `./models` with the\nhelper script:\n\n```bash\nscripts/fetch_model.sh nano-int8   # also: nano, micro, mini (default: nano-int8)\n```\n\nIt reads the ONNX/voices filenames from the model's `config.json`, so it works\nfor every model above. (`./models` is git-ignored.) Equivalent manual download:\n\n```bash\nmkdir -p models/kitten-tts-nano-int8\nfor FILE in config.json kitten_tts_nano_v0_8.onnx voices.npz; do\n  curl -L -o \"models/kitten-tts-nano-int8/$FILE\" \\\n    \"https://huggingface.co/KittenML/kitten-tts-nano-0.8-int8/resolve/main/$FILE\"\ndone\n```\n\n## Build\n\n```bash\ngo build -o bin/ ./...\n# Binaries at: bin/kitten-tts and bin/kitten-tts-server\n```\n\n\u003e Building requires a C compiler and libopus/libopusfile (see [Dependencies](#dependencies)),\n\u003e since Opus encoding is compiled in via cgo.\n\n### Releases\n\nPushing a `v*` tag triggers [`.github/workflows/release.yml`](.github/workflows/release.yml),\nwhich builds the binaries with `go build` on GitHub runners and publishes a\nGitHub Release with one `.tar.gz` per platform plus `checksums.txt`:\n\n```bash\ngit tag v0.1.1\ngit push origin v0.1.1\n```\n\nlibopus/libopusfile are statically linked from source, so each target is built on\nits own native runner — Linux amd64/arm64 and macOS arm64 — except darwin/amd64,\nwhich is cross-compiled on the Apple Silicon runner (clang is a universal\ntoolchain). To build locally:\n\n```bash\ngo build -o bin/ ./...\n```\n\n### End-to-end smoke test\n\n[`scripts/smoke_test.sh`](scripts/smoke_test.sh) builds the binaries and exercises\nevery audio format plus SSE streaming, fully offline, against a local model:\n\n```bash\n# Pass a model dir (or set KITTEN_MODEL_DIR, or place one at models/kitten-tts-nano-int8)\nscripts/smoke_test.sh /path/to/models/kitten-tts-nano-int8\n\n# Test prebuilt/release binaries instead of building:\nKITTEN_BIN_DIR=dist/... scripts/smoke_test.sh /path/to/model\n```\n\nIt checks the CLI (wav/mp3/flac/opus/pcm + `-list-voices`) and the server\n(`/health`, `/v1/models`, all formats, streaming, and 400 validation), printing a\npass/fail summary and exiting non-zero on failure. Needs espeak-ng and the ONNX\nRuntime library; uses `ffprobe` for codec checks when available.\n\n## Generate Speech (CLI)\n\nFollowing Go convention, **flags come before the positional arguments**\n(`\u003cmodel_dir\u003e \u003ctext\u003e [voice]`):\n\n```bash\n# Basic usage (outputs output.wav)\n./bin/kitten-tts ./models/kitten-tts-nano-int8 'Hello, world!' Bruno\n\n# Specify voice, speed, and output (flags first)\n./bin/kitten-tts -voice Luna -speed 1.2 -output hello.wav ./models/kitten-tts-nano-int8 'Hello, world!'\n\n# Encode directly to another format with -format\n./bin/kitten-tts -format mp3 -output hello.mp3 ./models/kitten-tts-nano-int8 'Hello, world!'\n\n# List available voices (no model directory needed)\n./bin/kitten-tts -list-voices\n```\n\nCLI flags (single or double dash both work):\n\n| Flag | Default | Description |\n|---|---|---|\n| `-voice`, `-v` | `Bruno` | Voice name (overrides the positional voice) |\n| `-speed`, `-s` | `1.0` | Speech speed multiplier |\n| `-output`, `-o` | `output.wav` | Output file path |\n| `-format` | `wav` | Output format: `wav`, `mp3`, `flac`, `opus`, `pcm` |\n| `-no-clean` | | Disable text normalization (numbers, currency) |\n| `-list-voices` | | List available voices and exit (no model directory needed) |\n\n\u003e Because the CLI uses Go's standard `flag` package, any flag placed *after* a\n\u003e positional argument is treated as a positional. Put flags first.\n\n## Run the API Server\n\nFlags first, then the model directory:\n\n```bash\n./bin/kitten-tts-server -host 0.0.0.0 -port 8080 ./models/kitten-tts-nano-int8\n```\n\nThe server exposes an OpenAI-compatible `/v1/audio/speech` endpoint:\n\n```bash\ncurl -X POST http://localhost:8080/v1/audio/speech \\\n  -H 'Content-Type: application/json' \\\n  -d '{\n    \"model\": \"kitten-tts\",\n    \"input\": \"Hello, world! This is KittenTTS running as an API server.\",\n    \"voice\": \"alloy\"\n  }' \\\n  --output speech.mp3\n```\n\n**Request body:**\n\n| Field | Type | Default | Description |\n|---|---|---|---|\n| `input` | string | *(required)* | Text to synthesize |\n| `voice` | string | *(required)* | Voice name (OpenAI or KittenTTS names) |\n| `model` | string | `\"\"` | Accepted for compatibility; ignored |\n| `response_format` | string | `\"mp3\"` | Output audio format (see below) |\n| `speed` | float | `1.0` | Speech speed multiplier (0.25–4.0) |\n| `stream` | bool | `false` | Enable SSE streaming (requires `\"pcm\"` format) |\n\nRequest bodies are capped at 1 MiB. Invalid requests — bad JSON, empty `input`,\nan unknown `voice` or `response_format`, or `speed` out of range — return HTTP\n400 with an OpenAI-style JSON error body.\n\n**Supported audio formats:**\n\n| Format | Content-Type | Description |\n|---|---|---|\n| `mp3` | `audio/mpeg` | MP3 (resampled to 44.1 kHz, pure-Go [shine](https://github.com/braheezy/shine-mp3) encoder) |\n| `flac` | `audio/flac` | FLAC lossless (24 kHz native, pure-Go [mewkiz/flac](https://github.com/mewkiz/flac)) |\n| `wav` | `audio/wav` | WAV 16-bit PCM (24 kHz native) |\n| `pcm` | `audio/pcm` | Raw 16-bit signed little-endian PCM (24 kHz) |\n| `opus` | `audio/ogg` | Opus in OGG container (resampled to 48 kHz) |\n| `aac` | — | Not supported (returns error) |\n\n\u003e **Note on Opus:** there is no pure-Go Opus encoder, so Opus uses [libopus](https://opus-codec.org/)\n\u003e via cgo ([`hraban/opus`](https://gopkg.in/hraban/opus.v2)) plus a hand-written RFC 7845 OGG\n\u003e writer. libopus and libopusfile must be installed at build time (see [Dependencies](#dependencies)).\n\n**API endpoints:**\n\n| Method | Path | Description |\n|---|---|---|\n| `POST` | `/v1/audio/speech` | Generate speech from text |\n| `GET` | `/v1/models` | List loaded model |\n| `GET` | `/health` | Health check |\n\n**Voice mapping (OpenAI → KittenTTS):**\n\n| OpenAI | KittenTTS | Gender |\n|---|---|---|\n| alloy | Bella | Female |\n| echo | Jasper | Male |\n| fable | Luna | Female |\n| onyx | Bruno | Male |\n| nova | Rosie | Female |\n| shimmer | Hugo | Male |\n\nAll 8 KittenTTS voices (Bella, Jasper, Luna, Bruno, Rosie, Hugo, Kiki, Leo) can also be used directly by name.\n\n### SSE Streaming\n\nFor lower time-to-first-audio on longer texts, set `\"stream\": true` with `\"response_format\": \"pcm\"`. The server returns [Server-Sent Events](https://developer.mozilla.org/en-US/docs/Web/API/Server-sent_events) with base64-encoded PCM audio chunks, compatible with the OpenAI streaming TTS format:\n\n```bash\ncurl -N -X POST http://localhost:8080/v1/audio/speech \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\n    \"model\": \"kitten-tts\",\n    \"input\": \"Hello, this is a streaming test. Each sentence is sent as a separate audio chunk.\",\n    \"voice\": \"alloy\",\n    \"response_format\": \"pcm\",\n    \"stream\": true\n  }'\n```\n\nEach event is a JSON object on a `data:` line:\n\n```\ndata: {\"type\":\"speech.audio.delta\",\"delta\":\"\u003cbase64-encoded-pcm\u003e\"}\ndata: {\"type\":\"speech.audio.delta\",\"delta\":\"\u003cbase64-encoded-pcm\u003e\"}\ndata: {\"type\":\"speech.audio.done\"}\n```\n\nThe `delta` field contains 16-bit signed little-endian PCM at 24 kHz, base64-encoded. The first chunk is split at the earliest clause boundary for fast initial playback.\n\n## Use as a Library\n\nThe `tts` package is a self-contained engine. `New` returns a `*tts.Model`;\nencoders live in the `audio` package behind a small interface.\n\n```go\npackage main\n\nimport (\n\t\"os\"\n\n\t\"github.com/itamaker/kitten-tts-go/audio\"\n\t\"github.com/itamaker/kitten-tts-go/tts\"\n)\n\nfunc main() {\n\tmodel, err := tts.New(\"models/kitten-tts-nano-int8\")\n\tif err != nil {\n\t\tpanic(err)\n\t}\n\tdefer model.Close()\n\n\tsamples, err := model.Generate(\"Hello, world!\", \"Bruno\", 1.0, true)\n\tif err != nil {\n\t\tpanic(err)\n\t}\n\n\tenc, _ := audio.NewEncoder(\"mp3\") // or wav, flac, opus, pcm\n\tdata, _ := enc.Encode(samples)\n\tos.WriteFile(\"hello.mp3\", data, 0o644)\n}\n```\n\nThe phonemizer is an interface, so you can swap espeak-ng for your own (handy in\ntests) via a functional option:\n\n```go\nmodel, err := tts.New(dir, tts.WithPhonemizer(myPhonemizer))\n// myPhonemizer implements tts.Phonemizer: Phonemize(string) (string, error)\n```\n\n## Architecture\n\n```\nkitten-tts-go/\n├── tts/                 # Core TTS engine (importable library)\n│   ├── tts.go           # Model type, Generate/GenerateChunk, options\n│   ├── load.go          # New: read config.json and load a model directory\n│   ├── onnx.go          # ONNX session, isolated behind one file\n│   ├── phonemes.go      # Phonemizer interface, espeak-ng impl, token IDs\n│   ├── normalize.go     # Number/currency/whitespace normalization\n│   ├── chunk.go         # Sentence/streaming text chunking\n│   └── voices.go        # NPZ voice embedding loader\n├── audio/               # Encoder interface + one file per format\n│   ├── audio.go         # Encoder interface, registry, NewEncoder, resampling\n│   ├── wav.go           # WAV + PCM\n│   ├── mp3.go           # MP3 (shine)\n│   ├── flac.go          # FLAC (mewkiz)\n│   └── opus.go          # OGG Opus (libopus) + hand-written OGG container\n└── cmd/\n    ├── kitten-tts/        # CLI (stdlib flag)\n    └── kitten-tts-server/ # OpenAI-compatible API server (net/http)\n```\n\n### How It Works\n\n1. **Normalization** — Expands numbers (\"42\" → \"forty-two\"), currencies (\"$10.50\" → \"ten dollars and fifty cents\"), and collapses whitespace\n2. **Phonemization** — Converts English text to IPA phonemes via a `Phonemizer` (espeak-ng by default)\n3. **Token encoding** — Maps IPA phonemes to integer token IDs using a symbol table matching the original Python implementation\n4. **Voice selection** — Loads style embeddings from the NPZ voice file\n5. **ONNX inference** — Runs the model with input tokens, voice style, and speed parameters\n6. **Audio encoding** — An `audio.Encoder` produces MP3 (default), FLAC, WAV, Opus, or raw PCM\n\n## Design Notes\n\nA few idiomatic-Go choices worth knowing:\n\n- **`tts.Model`** is the core type, constructed with **`tts.New`**.\n- **Audio formats** are an **`Encoder` interface** resolved by name from a registry (`audio.NewEncoder`), one file per format — not a single switch statement.\n- **Phonemization** is the **`Phonemizer` interface**; the default is espeak-ng and `tts.WithPhonemizer` lets you swap it (e.g. in tests).\n- The **CLI** uses the standard-library `flag` package (flags before positionals); the **server** is plain `net/http`.\n- **ONNX inference** uses [`yalue/onnxruntime_go`](https://github.com/yalue/onnxruntime_go), which `dlopen`s the runtime — there is no cgo link to onnxruntime.\n- **Encoders**: MP3 ([`shine-mp3`](https://github.com/braheezy/shine-mp3)) and FLAC ([`mewkiz/flac`](https://github.com/mewkiz/flac)) are pure Go; Opus uses [`hraban/opus`](https://gopkg.in/hraban/opus.v2) (libopus via cgo). AAC is not supported.\n\n## License\n\nMIT — see [LICENSE](LICENSE).\n\nThis is an independent Go implementation. The upstream KittenTTS project and its\nmodel weights are Apache-2.0 (see [Acknowledgments](#acknowledgments)); those\nweights are downloaded separately and are not part of this repository.\n\n## Acknowledgments\n\n- [KittenML](https://kittenml.com) for the original KittenTTS models and Python library\n- [yalue/onnxruntime_go](https://github.com/yalue/onnxruntime_go) for the Go ONNX Runtime bindings\n- [braheezy/shine-mp3](https://github.com/braheezy/shine-mp3) and [mewkiz/flac](https://github.com/mewkiz/flac) for pure-Go audio encoding\n- [hraban/opus](https://gopkg.in/hraban/opus.v2) and [libopus](https://opus-codec.org/) for Opus encoding\n- [espeak-ng](https://github.com/espeak-ng/espeak-ng) for phonemization\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fitamaker%2Fkitten-tts-go","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fitamaker%2Fkitten-tts-go","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fitamaker%2Fkitten-tts-go/lists"}