{"id":47774176,"url":"https://github.com/met4citizen/headtts","last_synced_at":"2026-04-03T11:01:08.978Z","repository":{"id":289114476,"uuid":"970108189","full_name":"met4citizen/HeadTTS","owner":"met4citizen","description":"HeadTTS: Free neural text-to-speech (Kokoro) with timestamps and visemes for lip-sync. Runs in-browser (WebGPU/WASM) or on local Node.js WebSocket/REST server (CPU).","archived":false,"fork":false,"pushed_at":"2025-09-12T16:52:31.000Z","size":8198,"stargazers_count":43,"open_issues_count":0,"forks_count":4,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-09-12T18:55:22.434Z","etag":null,"topics":["kokoro","lip-sync","talkinghead","text-to-speech","timestamps","visemes"],"latest_commit_sha":null,"homepage":"https://met4citizen.github.io/HeadTTS/","language":"JavaScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/met4citizen.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-04-21T13:38:52.000Z","updated_at":"2025-09-12T16:49:42.000Z","dependencies_parsed_at":"2025-07-26T17:18:01.617Z","dependency_job_id":"a6262217-4570-47d5-b45c-522263599b31","html_url":"https://github.com/met4citizen/HeadTTS","commit_stats":null,"previous_names":["met4citizen/headtts"],"tags_count":10,"template":false,"template_full_name":null,"purl":"pkg:github/met4citizen/HeadTTS","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/met4citizen%2FHeadTTS","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/met4citizen%2FHeadTTS/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/met4citizen%2FHeadTTS/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/met4citizen%2FHeadTTS/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/met4citizen","download_url":"https://codeload.github.com/met4citizen/HeadTTS/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/met4citizen%2FHeadTTS/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31347183,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-03T08:03:20.796Z","status":"ssl_error","status_checked_at":"2026-04-03T08:00:37.834Z","response_time":107,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["kokoro","lip-sync","talkinghead","text-to-speech","timestamps","visemes"],"created_at":"2026-04-03T11:01:01.964Z","updated_at":"2026-04-03T11:01:08.955Z","avatar_url":"https://github.com/met4citizen.png","language":"JavaScript","funding_links":[],"categories":[],"sub_categories":[],"readme":"# \u003cimg src=\"logo.png\" width=\"100\"/\u003e\u0026nbsp; HeadTTS\n\n**HeadTTS** is a free JavaScript text-to-speech (TTS) solution that\nprovides phoneme-level timestamps and Oculus visemes for lip-sync, in addition\nto audio output (WAV/PCM). It uses\n[Kokoro](https://huggingface.co/onnx-community/Kokoro-82M-v1.0-ONNX-timestamped)\nneural model and voices, and inference can run entirely in\nthe browser (WebGPU or WASM), or alternatively\non a Node.js WebSocket/RESTful server (WebGPU or CPU).\n\n- **Pros**: Free. Doesn't require a server in in-browser mode.\nWebGPU support. Uses neural voices with a StyleTTS 2 model.\nGreat for lip-sync use cases and fully compatible with the\n[TalkingHead](https://github.com/met4citizen/TalkingHead).\nMIT licensed, doesn't use eSpeak or any other GPL-licensed\nmodule.\n\n- **Cons**: Only the latest desktop browsers have\n[WebGPU support](https://caniuse.com/webgpu) enabled by default,\nthe WASM fallback is much slower.\nKokoro is a lightweight model, but it still takes time to\nload the first time and consumes a lot of memory.\nEnglish is currently the only supported language.\n\n**👉 If you're using a desktop browser, check out the\n[IN-BROWSER DEMO](https://met4citizen.github.io/HeadTTS/)!** - If\nyour browser doesn't have WebGPU support enabled,\nthe demo app uses WASM as a fallback.\n\nThe project uses [websockets/ws](https://github.com/websockets/ws) (MIT License),\n[hugginface/transformers.js (with ONNX Runtime)](https://github.com/huggingface/transformers.js/)\n(Apache 2.0 License) and\n[onnx-community/Kokoro-82M-v1.0-ONNX-timestamped](https://huggingface.co/onnx-community/Kokoro-82M-v1.0-ONNX-timestamped)\n(Apache 2.0 License) as runtime dependencies. For information on\nlanguage modules and dictionaries, see Appendix B. Using\n[jest](https://jestjs.io) for testing.\n\nYou can find the list of supported English voices and voice samples\n[here](https://huggingface.co/onnx-community/Kokoro-82M-v1.0-ONNX-timestamped#voicessamples).\n\n---\n\n# In-browser Module: `headtts.mjs`\n\nThe HeadTTS JavaScript module enables in-browser text-to-speech\nusing Module Web Workers and WebGPU/WASM inference. Alternatively, it can\nconnect to and use the HeadTTS Node.js WebSocket/RESTful server.\n\nCreate a new `HeadTTS` class instance:\n\n```javascript\nimport { HeadTTS } from \"./modules/headtts.mjs\";\n\nconst headtts = new HeadTTS({\n  endpoints: [\"ws://127.0.0.1:8882\", \"webgpu\"], // Endpoints in order of priority\n  languages: ['en-us'], // Language modules to pre-load (in-browser)\n  voices: [\"af_bella\", \"am_fenrir\"] // Voices to pre-load (in-browser)\n});\n```\n\nBeware that if you import the HeadTTS module from a CDN, you may need to\nset the `workerModule` and `dictionaryURL` options explicitly,\nas the default relative paths will likely not work:\n\n```javascript\nimport { HeadTTS } from \"https://cdn.jsdelivr.net/npm/@met4citizen/headtts@1.3/+esm\";\n\nconst headtts = new HeadTTS({\n  /* ... */\n  workerModule: \"https://cdn.jsdelivr.net/npm/@met4citizen/headtts@1.3/modules/worker-tts.mjs\",\n  dictionaryURL: \"https://cdn.jsdelivr.net/npm/@met4citizen/headtts@1.3/dictionaries/\"\n});\n```\n\n\u003cdetails\u003e\n  \u003csummary\u003eCLICK HERE to see all the OPTIONS.\u003c/summary\u003e\n\nOption | Description | Default value\n--- | --- | ---\n`endpoints` | List of WebSocket/RESTful servers or backends `webgpu` or `wasm`, in order of priority. If one fails, the next is used.  | `[\"webgpu\",`\u003cbr\u003e` \"wasm\"]`\n`audioCtx` | Audio context for creating audio buffers. If `null`, a new one is created. | `null`\n`workerModule` | URL of the HeadTTS Web Worker module. Enables use from a CDN. If set to `null`, the relative path/file `./worker-tts.mjs` is used. | `null`\n`transformersModule` | URL of the `transformers.js` module to load. | `\"https://cdn.jsdelivr.net/npm/`\u003cbr\u003e`@huggingface/transformers@4.0.0`\u003cbr\u003e`/dist/transformers.min.js\"`\n`model` | Kokoro text-to-speech ONNX model (timestamped) used for in-browser inference. | `\"onnx-community/`\u003cbr\u003e`Kokoro-82M-v1.0-ONNX-timestamped\"`\n`dtypeWebgpu` | Data type precision for WebGPU inference: `\"fp32\"` (recommended), `\"fp16\"`, `\"q8\"`, `\"q4\"`, or `\"q4f16\"`.  | `\"fp32\"`\n`dtypeWasm` | Data type precision for WASM inference: `\"fp32\"`, `\"fp16\"`, `\"q8\"`, `\"q4\"`, or `\"q4f16\"`. | `\"q4\"`\n`styleDim` | Style embedding dimension for inference. | `256`\n`audioSampleRate` | Audio sample rate in Hz for inference. | `24000`\n`frameRate` | Frame rate in FPS for inference. | `40`\n`languages` | Language modules to be pre-loaded. | [`\"en-us\"`]\n`dictionaryURL` | URL to language dictionaries. Set to `null` to disable dictionaries. | `\"../dictionaries\"`\n`voiceURL` | URL for loading voices. If the given value is a relative URL, it should be relative to the worker file location. | `\"https://huggingface.co/`\u003cbr\u003e`onnx-community/`\u003cbr\u003e`Kokoro-82M-v1.0-ONNX/`\u003cbr\u003e`resolve/main/voices\"`\n`voices` | Voices to preload (e.g., `[\"af_bella\", \"am_fenrir\"]`).  | `[]`\n`splitSentences` | Whether to split text into sentences. | `true`\n`splitLength` | Maximum length (in characters) of each text chunk. | `500`\n`deltaStart` | Adjustment (in ms) to viseme start times. | `-10`\n`deltaEnd` | Adjustment (in ms) to viseme end times. | `10`\n`defaultVoice` | Default voice to use. | `\"af_bella\"`\n`defaultLanguage` | Default language to use. | `\"en-us\"`\n`defaultSpeed` | Speaking speed. Range: 0.25–4. | `1`\n`defaultAudioEncoding` | Default audio format: `\"wav\"` or `\"pcm\"` (PCM 16-bit LE). | `\"wav\"`\n`trace` | Bitmask for debugging subsystems (`0`=none, `255`=all):\u003cbr\u003e\u003cul\u003e\u003cli\u003eBit 0 (1): Connection\u003c/li\u003e\u003cli\u003eBit 1 (2): Messages\u003c/li\u003e\u003cli\u003eBit 2 (4): Events\u003c/li\u003e\u003cli\u003eBit 3 (8): G2P\u003c/li\u003e\u003cli\u003eBit 4 (16): Language modules\u003c/li\u003e\u003c/ul\u003e | `0`\n\nNote: Model related options apply only to in-browser inference.\nIf inference is performed on a server, server-specific\nsettings will apply instead.\n\n\u003c/details\u003e\n\nConnect to the first supported/available endpoint:\n\n```javascript\ntry {\n  await headtts.connect();\n} catch(error) {\n  console.error(error);\n}\n```\n\nMake an `onmessage` event handler to handle response messages. In this\nexample, we use\n[TalkingHead](https://github.com/met4citizen/TalkingHead) instance `head`\nto play the incoming audio and lip-sync data:\n\n```javascript\n// Speak and lipsync\nheadtts.onmessage = (message) =\u003e {\n  if ( message.type === \"audio\" ) {\n    try {\n      head.speakAudio( message.data, {}, (word) =\u003e {\n        console.log(word);\n      });\n    } catch(error) {\n      console.error(error);\n    }\n  } else if ( message.type === \"custom\" ) {\n    console.log(\"Received custom message, data=\", message.data);\n  } else if ( message.type === \"error\" ) {\n    console.error(\"Received error message, error=\", message.data.error);\n  }\n}\n```\n\n\u003cdetails\u003e\n  \u003csummary\u003eCLICK HERE to see all the available class EVENTS.\u003c/summary\u003e\n  \nEvent handler | Description\n--- | ---\n`onstart` | Triggered when the first message is added and all message queues were previously empty.\n`onmessage` | Handles incoming messages of type `audio`, `error` and `custom`. For details, see the API section.\n`onend` | Triggered when all message queues become empty.\n`onerror` | Handles system or class-level errors. If this handler is not set, such errors are thrown as exceptions. **Note:** Errors related to TTS conversion are sent to the `onmessage` handler (if defined) as messages of type `error`.\n\n\u003c/details\u003e\n\nSetup the voice:\n\n```javascript\nheadtts.setup({\n  voice: \"af_bella\",\n  language: \"en-us\",\n  speed: 1,\n  audioEncoding: \"wav\"\n});\n```\n\nThe HeadTTS client is stateful, so you don't need to call setup again\nunless you want to change a setting. For example, if you want to increase\nthe speed, simply call `headtts.setup({ speed: 1.5 })`.\n\nSynthesize speech using the current voice setup:\n\n```javascript\nheadtts.synthesize({\n  input: \"Test sentence.\"\n});\n```\n\nThe above approach relies on `onmessage` event handler to\nreceive and handle response messages and it is the recommended\napproach for real-time use cases. An alternative approach is to\n`await` for all the related audio messages:\n\n```javascript\ntry {\n  const messages = await headtts.synthesize({\n    input: \"Some long text...\"\n  });\n  console.log(messages); // [{type: 'audio', data: {…}, ref: 1}, {…}, ...]\n} catch(error) {\n  console.error(error);\n}\n```\n\nThe `input` property can be a string or, alternatively, an array\nof strings or inputs items.\n\n\u003cdetails\u003e\n  \u003csummary\u003eCLICK HERE to see the available input ITEM TYPES.\u003c/summary\u003e\n\nType | Description | Example\n---|---|---\n`text` |  Speak the text in `value`. This is equivalent to giving a pure string input. | \u003cpre\u003e{\u003cbr\u003e  type: \"text\",\u003cbr\u003e  value: \"This is an example.\"\u003cbr\u003e}\u003c/pre\u003e\n`speech` |  Speak the text in `value` with corresponding subtitles in `subtitles` (optional). This type allows the spoken words to be different that the subtitles. | \u003cpre\u003e{\u003cbr\u003e  type: \"speech\",\u003cbr\u003e  value: \"One two three\",\u003cbr\u003e  subtitles: \"123\"\u003cbr\u003e}\u003c/pre\u003e\n`phonetic` | Speak the model specific phonetic alphabets in `value` with corresponding `subtitles` (optional). | \u003cpre\u003e{\u003cbr\u003e  type: \"phonetic\",\u003cbr\u003e  value: \"mˈɜɹʧəndˌIz\",\u003cbr\u003e  subtitles: \"merchandise\"\u003cbr\u003e}\u003c/pre\u003e\n`characters` | Speak the `value` character-by-character with corresponding `subtitles` (optional). Supports also numbers that are read digit-by-digit. | \u003cpre\u003e{\u003cbr\u003e  type: \"characters\",\u003cbr\u003e  value: \"ABC-123-8\",\u003cbr\u003e  subtitles: \"ABC-123-8\"\u003cbr\u003e}\u003c/pre\u003e\n`number` | Speak the number in `value` with corresponding `subtitles` (optional). The number should presented as a string. | \u003cpre\u003e{\u003cbr\u003e  type: \"number\",\u003cbr\u003e  value: \"123.5\",\u003cbr\u003e  subtitles: \"123.5\"\u003cbr\u003e}\u003c/pre\u003e\n`date` | Speak the date in `value` with corresponding `subtitles` (optional). The date is presented as milliseconds from epoch. | \u003cpre\u003e{\u003cbr\u003e  type: \"date\",\u003cbr\u003e  value: Date.now(),\u003cbr\u003e  subtitles: \"02/05/2025\"\u003cbr\u003e}\u003c/pre\u003e\n`time` | Speak the time in `value` with corresponding `subtitles` (optional). The time is presented as milliseconds from epoch. | \u003cpre\u003e{\u003cbr\u003e  type: \"time\",\u003cbr\u003e  value: Date.now(),\u003cbr\u003e  subtitles: \"6:45 PM\"\u003cbr\u003e}\u003c/pre\u003e\n`break` | The length of the break in milliseconds in `value` with corresponding `subtitles` (optional). | \u003cpre\u003e{\u003cbr\u003e  type: \"break\",\u003cbr\u003e  value: 2000,\u003cbr\u003e  subtitles: \"...\"\u003cbr\u003e}\u003c/pre\u003e\n\nAn example using an array of input items:\n\n```javascript\n{\n  type: \"synthesize\",\n  id: 14, // Unique request identifier.\n  data: {\n    input: [\n      \"There were \",\n      { type: \"speech\", value: \"over two hundred \", subtitles: \"\u003e200 \" },\n      \"items of\",\n      { type: \"phonetic\", value: \"mˈɜɹʧəndˌIz \", subtitles: \"merchandise \" },\n      \"on sale.\"\n    ]\n  }\n}\n```\n\n\u003c/details\u003e\n\nYou can add a custom message to the message queue using\nthe `custom` method:\n\n```javascript\nheadtts.custom({\n  emoji: \"😀\"\n});\n```\n\nCustom messages can be used, for example, to synchronize\nspeech with animations, emojis, facial expressions, poses,\nand/or gestures. You need to implement the custom\nfunctionality yourself within the message handler.\n\n\u003cdetails\u003e\n  \u003csummary\u003eCLICK HERE to see all the class METHODS.\u003c/summary\u003e\n\nMethod | Description\n--- | ---\n`connect( settings=null, onprogress=null, onerror=null )` | Connects to the specified set of `endpoints` set in constructor or within the optinal `settings` object. If the `settings` parameter is provided, it forces a reconnection. The `onprogress` callback handles `ProgressEvent` events, while the `onerror` callback handles system-level error events. Returns a promise. **Note:** When connecting to a RESTful server, the method sends a hello message and considers the connection established only if a text response starting with `HeadTTS` is received.\n`clear()` | Clears all work queues and resolves all promises.\n`setup( data, onerror=null )` | Adds a new setup request to the work queue. See the API section for the supported `data` properties. Returns a promise.\n`synthesize( data, onmessage=null, onerror=null )` | Adds a new synthesis request to the work queue. The `data` object supports the `input` and `userData` properties. The `userData` property is returned in the output as `message.userData` unchanged. If event handlers are provided, they override the default handlers. Returns a promise that resolves with a sorted array of related messages of type `\"audio\"` or `\"error\"`.\n`custom( data, onmessage=null, onerror=null )` | Adds a new custom message to the work queue. If event handlers are provided, they override other handlers. Returns a promise that resolves with the related message of the type `\"custom\"`.\n\n\u003c/details\u003e\n\n---\n\n# NodeJS WebSocket/RESTful Server: `headtts-node.mjs`\n\nInstall (requires Node.js v20+):\n\n```bash\ngit clone https://github.com/met4citizen/HeadTTS\ncd HeadTTS\nnpm install\n```\n\nStart the server:\n\n```bash\nnpm start\n```\n\n\u003cdetails\u003e\n  \u003csummary\u003eCLICK HERE to see the COMMAND LINE OPTIONS.\u003c/summary\u003e\n\nOption|Description|Default\n---|---|---\n`--config [file]` | JSON configuration file name. | `./headtts-node.json`\n`--trace [0-255]` | Bitmask for debugging subsystems (`0`=none, `255`=all):\u003cbr\u003e\u003cul\u003e\u003cli\u003eBit 0 (1): Connection\u003c/li\u003e\u003cli\u003eBit 1 (2): Messages\u003c/li\u003e\u003cli\u003eBit 2 (4): Events\u003c/li\u003e\u003cli\u003eBit 3 (8): G2P\u003c/li\u003e\u003cli\u003eBit 4 (16): Language modules\u003c/li\u003e\u003c/ul\u003e | `0`\n\nAn example:\n\n```bash\nnode ./modules/headtts-node.mjs --trace 16\n```\n\n\u003c/details\u003e\n\nBy default, the server uses the `./headtts-node.json` configuration file.\n\n\u003cdetails\u003e\n  \u003csummary\u003eCLICK HERE to see the configurable PROPERTIES.\u003c/summary\u003e\n\nProperty|Description|Default\n---|---|---\n`server.port` | The port number the server listens on. | `8882`\n`server.certFile` | Path to the certificate file. | `null`\n`server.keyFile` | Path to the certificate key file. | `null`\n`server.websocket` | Enable the WebSocket server. | `true`\n`server.rest` | Enable the RESTful API server. | `true`\n`server.connectionTimeout` | Timeout duration for idle connections in milliseconds. | `20000`\n`server.corsOrigin` | Value for the `Access-Control-Allow-Origin` header. If `null`, CORS will not be enabled. | `*`\n`tts.threads` | Number of text-to-speech worker threads, ranging from 1 to the number of CPU cores. | `1`\n`tts.transformersModule` | Name of the transformers.js module to use. | `\"@huggingface/transformers\"`\n`tts.model` | The timestamped Kokoro TTS ONNX model. | `\"onnx-community/`\u003cbr\u003e`Kokoro-82M-v1.0-ONNX-timestamped\"`\n`tts.dtype` | The data type precision used for inference. Available options: `\"fp32\"`, `\"fp16\"`, `\"q8\"`, `\"q4\"`, or `\"q4f16\"`.  | `\"fp32\"`\n`tts.device` | Computation backend to use: `\"webgpu\"` or `\"cpu\"`. NOTE: Node.js WebGPU implementation in Transformers.js is not thread safe, so we can only have one thread for WebGPU. Others will be automatically started as `\"cpu\"`. | `\"webgpu\"`\n`tts.styleDim` | The embedding dimension for style. | `256`\n`tts.audioSampleRate` | Audio sample rate in Hertz (Hz). | `24000`\n`tts.frameRate` | Frame rate in frames per second (FPS). | `40`\n`tts.languages` | A list of languages to preload. | [`\"en-us\"`]\n`tts.dictionaryPath` | Path to the language modules. If `null`, dictionaries will not be used. | `\"./dictionaries\"`\n`tts.voicePath` | Path to the voice files. | `\"./voices\"`\n`tts.voices` | Array of voices to preload, e.g., `[\"af_bella\",\"am_fenrir\"]`. | `[]`\n`tts.deltaStart` | Adjustment (in ms) to viseme start times. | `-10`\n`tts.deltaEnd` | Adjustment (in ms) to viseme end times. | `10`\n`tts.defaults.voice` | Default voice to use. | `\"af_bella\"`\n`tts.defaults.language` | Default language to use. Supported options: `\"en-us\"`. | `\"en-us\"`\n`tts.defaults.speed` | Speaking speed. Range: 0.25–4. | `1`\n`tts.defaults.audioEncoding` | Default audio encoding format. Supported options are `\"wav\"` and `\"pcm\"` (PCM 16bit LE). | `\"wav\"`\n`trace` | Bitmask for debugging subsystems (`0`=none, `255`=all):\u003cbr\u003e\u003cul\u003e\u003cli\u003eBit 0 (1): Connection\u003c/li\u003e\u003cli\u003eBit 1 (2): Messages\u003c/li\u003e\u003cli\u003eBit 2 (4): Events\u003c/li\u003e\u003cli\u003eBit 3 (8): G2P\u003c/li\u003e\u003cli\u003eBit 4 (16): Language modules\u003c/li\u003e\u003c/ul\u003e  | `0`\n\n\u003c/details\u003e\n\n---\n\n# Appendix A: Server API reference\n\n## WebSocket API\n\nEvery WebSocket request must have a unique identifier, `id`. The server uses\na Web Worker thread pool, and because work is done in parallel,\nthe order of responses may vary. Therefore, each response includes\na `ref` property that identifies the original request, allowing\nthe order to be restored if necessary. The JS client class handles this\nautomatically.\n\n### Request: `setup`\n\n```javascript\n{\n  type: \"setup\",\n  id: 12, // Unique request identifier.\n  data: {\n    voice: \"af_bella\", // Voice name (optional)\n    language: \"en-us\", // Language (optional)\n    speed: 1, // Speed (optional)\n    audioEncoding: 'wav' // \"wav\" or \"pcm\" (PCM 16bit LE) (optional)\n  }\n}\n```\n\n### Request: `synthesize`\n\n```javascript\n{\n  type: \"synthesize\",\n  id: 13, // Unique request identifier.\n  data: {\n    input: \"This is an example.\" // String or array of input items\n  }\n}\n```\n\nThe response message for `synthesize` request is either `error` or `audio`.\n\n### Response: `error`\n\n```javascript\n{\n  type: \"error\",\n  ref: 13, // Original request id\n  data: {\n    error: \"Error loading voice 'af_bella'.\"\n  }\n}\n```\n\n### Response: `audio`\n\nReturns an audio object metadata that can be passed on the TalkingHead\n`speakAudio` method once the audio content itself has been added.\n\n```javascript\n{\n  type: \"audio\",\n  ref: 13,\n  data: {\n    words: ['This ', 'is ', 'an ', 'example.'],\n    wtimes: [440, 656, 876, 1050],\n    wdurations: [236, 240, 194, 1035],\n    visemes: ['TH', 'I', 'SS', 'I', 'SS', 'aa', 'nn', 'I', 'kk', 'SS', 'aa', 'PP', 'PP', 'E', 'RR'],\n    vtimes: [440, 472, 562, 656, 753, 876, 993, 1050, 1097, 1149, 1200, 1322, 1372, 1423, 1499],\n    vdurations: [52, 110, 74, 117, 75, 137, 47, 67, 72, 71, 142, 70, 71, 96, 399],\n    phonemes: ['ð', 'ɪ', 's', 'ɪ', 'z', 'æ', 'n', 'ɪ', 'ɡ', 'z', 'æ', 'm', 'p', 'ə', 'l'],  \n    audioEncoding: \"wav\"\n  }\n}\n```\n\nThe actual audio content will be delivered after this message as\nbinary data (see the next response message).\n\n### Response: Binary (ArrayBuffer)\n\nBinary data as an [ArrayBuffer](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/ArrayBuffer)\nrelated to the previous `audio` message. Depending on the set audio encoding,\neither a WAV file (`wav`) or a chunk of raw PCM 16bit LE samples (`pcm`).\n\n## RESTful API\n\nRESTful server API is a more simple alternative for WebSocket API.\nThe REST server is stateless, so voice parameters must be included\nfor each POST message. If you are using the HeadTTS client class,\nit handles this internally.\n\n### POST `/v1/synthesize`\n\nJSON | Description\n---|---\n`input` | Input to synthesize. String or an array of input items. For a string of text, maximum 500 characters.\n`voice` | Voice name.\n`language` | Language code.\n`speed` | Speed of speech.\n`audioEncoding` | Either \"wav\" for WAV file or \"pcm\" for raw PCM 16bit LE audio.\n\nOK response:\n\nJSON|Description\n---|---\n`audio` | Base64 encoded WAV data for `\"wav\"` or raw PCM 16bit LE samples for `\"pcm\"` audio encoding.\n`words` | Array of words.\n`wtimes` | Array of word starting times for `words` in milliseconds.\n`wdurations` | Array of word durations for `words` in milliseconds.\n`visemes` | Array of Oculus viseme IDs: `'aa'`, `'E'`, `'I'`, `'O'`, `'U'`, `'PP'`, `'SS'`, `'TH'`, `'CH'`, `'FF'`, `'kk'`, `'nn'`, `'RR'`, `'DD'`, `'sil'`.\n`vtimes` | Array of viseme starting times for `visemes` in milliseconds.\n`vdurations` | Array of viseme durations for `visemes` in milliseconds.\n`phonemes` | Array of phonemes corresponding to the array of visemes.\n`audioEncoding` | Audio encoding: `\"wav\"` or `\"pcm\"`.\n\nError response:\n\nJSON|Description\n---|---\n`error` | Error message string\n\n---\n\n# Appendix B: Language modules and dictionaries\n\n### American English, `en-us`\n\nThe American English language module is based on the\n[CMU Pronunciation Dictionary](http://www.speech.cs.cmu.edu/cgi-bin/cmudict)\nfrom Carnegie Mellon University, containing over 134,000 words and their\npronunciations. The original dataset is provided under a simplified\nBSD license, allowing free use for any research or commercial purpose.\n\nIn the [Kokoro](https://github.com/hexgrad/kokoro) TTS model,\nthe American English language data was trained using the\n[Misaki](https://github.com/hexgrad/misaki) G2P engine (en).\nTherefore, the original [ARPAbet](https://en.wikipedia.org/wiki/ARPABET)\nphonemes in the CMU dictionary have been converted to\n[IPA](https://en.wikipedia.org/wiki/International_Phonetic_Alphabet)\nand then to Misaki-compatible phonemes by applying the following mapping:\n\n- `ɚ` → [ `ɜ`, `ɹ` ], `ˈɝ` → [ `ˈɜ`, `ɹ` ], `ˌɝ` → [ `ˌɜ`, `ɹ` ]\n- `tʃ` → [ `ʧ` ], `dʒ` → [ `ʤ` ]\n- `eɪ` → [ `A` ], `ˈeɪ` → [ `ˈA` ], `ˌeɪ` → [ `ˌA` ]\n- `aɪ` → [ `I` ], `ˈaɪ` → [ `ˈI` ], `ˌaɪ` → [ `ˌI` ]\n- `aʊ` → [ `W` ], `ˈaʊ` → [ `ˈW` ], `ˌaʊ` → [ `ˌW` ]\n- `ɔɪ` → [ `Y` ], `ˈɔɪ` → [ `ˈY` ], `ˌɔɪ` → [ `ˌY` ]\n- `oʊ` → [ `O` ], `ˈoʊ` → [ `ˈO` ], `ˌoʊ` → [ `ˌO` ]\n- `əʊ` → [ `Q` ], `ˈəʊ` → [ `ˈQ` ], `ˌəʊ` → [ `ˌQ` ]\n\nThe final dictionary is a plain text file with around 125,000 lines (2,8MB).\nLines starting with `;;;` are comments. Each other line represents\none word and its pronunciations. The word and its different possible\npronunciations are separated by a tab character `\\t`. An example entry:\n\n```text\nMERCHANDISE\tmˈɜɹʧəndˌIz\n```\n\nOut-of-dictionary (OOD) words are converted using a rule-based algorithm based\non NRL Report 7948, *Automatic Translation of English Text to Phonetics\nby Means of Letter-to-Sound Rules* (Elovitz et al., 1976). The report is\navailable [here](https://apps.dtic.mil/sti/pdfs/ADA021929.pdf).\n\n\n### Finnish, `fi`\n\n\u003e [!IMPORTANT]  \n\u003e As of now, Finnish language is not supported by the Kokoro model.\nYou can use the `fi` language code with the English voices, but\nthe pronunciation will sound rather weird.\n\nThe phonemization of the Finnish language module is done by\nan in-built algorithm. The algorithm doesn't require a pronunciation\ndictionary, but it uses a compound word dictionary to get the secondary\nstress marks right for compound words.\n\nThe dictionary used for compound words is based on\n[The Dictionary of Contemporary Finnish](https://en.kotus.fi/dictionaries/#Dictionary-of-Contemporary-Finnish)\nmaintained by the Institute for the Languages of Finland. The original\ndataset contains more than 100,000 entries and is open-sourced\nunder the CC BY 4.0 license.\n\nThe pre-processed compound word dictionary is a plain text file with\naround 50,000 entries in 10,000 lines (~350kB). Lines starting\nwith `;;;` are comments. Each other line represents the first part\nof a compound word and the first four letters of all possible\nnext words, all separated by a tab character `\\t`. An example entry:\n\n```text\nALUMIINI\tFOLI\tKATT\tOKSI\tPAPE\tSEOS\tVENE\tVUOK\n```\n\n---\n\n# Appendix C: Latency\n\nIn-browser TTS using WebGPU runs approximately 3x faster than real time\nand about 10x faster than WASM. CPU-based inference on a Node.js server\nperforms surprisingly well. However, increasing the thread pool size\ndegrades performance. WebGPU inference on a Node.js server is slightly\nfaster than CPU inference, but it supports only a single dedicated WebGPU\nthread. On Metal, the fastest HeadTTS configuration is WebGPU with two\nthreads where the second thread automatically falls back to CPU execution.\n\nI recommend using 32-bit floating point precision (fp32) for the best\naudio quality unless memory consumption becomes a concern.\n\nUnofficial latency results using my own\n[latency test app](https://github.com/met4citizen/HeadTTS/blob/main/tests/latency.html):\n\nTTS Engine/Setup |`FIL`\u003csup\u003e\\[1]\u003c/sup\u003e|`FBL`\u003csup\u003e\\[2]\u003c/sup\u003e|`RTF`\u003csup\u003e\\[3]\u003c/sup\u003e\n---|---|---|---\nHeadTTS, Chrome, WebGPU/fp32 | 8.6s | 852ms | **0.27**\nHeadTTS, Edge, WebGPU/fp32 | 8.8s | 858ms | 0.28\nHeadTTS, Safari, WebGPU/fp32 | 25.8s | 2437ms | 0.82\nHeadTTS, Chrome, WASM/q4 | 45.4s | 4404ms | 1.45\nHeadTTS, Edge, WASM/q4 | 45.5s | 4392ms | 1.45\nHeadTTS, Safari, WASM/q4 | 45.6s | 4447ms | 1.46\nHeadTTS, WebSocket, WebGPU/fp32, 1 thread | 6.6s | 719ms | 0.21\nHeadTTS, WebSocket, WebGPU/fp32, 2 threads | 3.8s | 742ms | **0.12**\nHeadTTS, WebSocket, CPU/fp32, 1 thread | 6.8s | 712ms | 0.22\nHeadTTS, WebSocket, CPU/fp32, 4 threads | 6.0s | 2341ms | 0.20\nHeadTTS, REST, WebGPU/fp32, 1 thread | 6.7s | 713ms | 0.21\nHeadTTS, REST, WebGPU/fp32, 2 threads | 3.6s | 717ms | **0.11**\nHeadTTS, REST, CPU/fp32, 1 thread | 7.0s | 793ms | 0.23\nHeadTTS, REST, CPU/fp32, 4 threads | 6.5s | 2638ms | 0.21\nElevenLabs, WebSocket | 4.8s | 977ms | 0.20\nElevenLabs, REST | 11.3s | 1097ms | 0.46\nElevenLabs, REST, Flash_v2_5 | 4.8s | 581ms | 0.22\nMicrosoft TTS, WebSocket (Speech SDK) | 1.1s | 274ms | 0.04\nGoogle TTS, REST | 0.79s | 67ms | 0.03\n\n\n\u003csup\u003e\\[1]\u003c/sup\u003e *Finish latency*: Total time from sending text input to receiving\nthe full audio.\n\n\u003csup\u003e\\[2]\u003c/sup\u003e *First byte/part/sentence latency*: Time from sending the text input\nto receiving the first playable byte/part/sentence of audio.\nNote: This measure is not comparable across all models, since some\nsolutions use streaming, some not.\n\n\u003csup\u003e\\[3]\u003c/sup\u003e *Real-time factor* = Time to generate full audio / Duration of the full\naudio. If RTF \u003c 1, synthesis is faster than real-time (i.e., good).\n\n**Test setup**: Macbook Air M2 laptop, 8 cores, 16GB memory,\nmacOS Tahoe 26.0, Metal2 GPU 10 cores, 300/50 Mbit/s internet connection.\nThe latest Google Chrome, Edge, Safari desktop browsers.\n\nAll test cases use WAV or raw PCM 16bit LE format and the \"List 1\" of the\n[Harvard Sentences](https://www.cs.columbia.edu/~hgs/audio/harvard.html):\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmet4citizen%2Fheadtts","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmet4citizen%2Fheadtts","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmet4citizen%2Fheadtts/lists"}