{"id":16271128,"url":"https://github.com/hans00/phonemize","last_synced_at":"2026-01-27T03:01:37.191Z","repository":{"id":217339468,"uuid":"743567662","full_name":"hans00/phonemize","owner":"hans00","description":"Pure JS fast phonemizer with rule-based G2P prediction","archived":false,"fork":false,"pushed_at":"2025-08-20T08:04:59.000Z","size":373,"stargazers_count":15,"open_issues_count":1,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-10-28T03:08:30.787Z","etag":null,"topics":["chinese","english","g2p","ipa","phonemizer","tts"],"latest_commit_sha":null,"homepage":"","language":"TypeScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/hans00.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2024-01-15T14:09:44.000Z","updated_at":"2025-10-12T23:55:22.000Z","dependencies_parsed_at":"2024-10-27T21:38:47.251Z","dependency_job_id":"80a6f4d6-c732-4b3c-a444-d04f8c439ba0","html_url":"https://github.com/hans00/phonemize","commit_stats":null,"previous_names":["hans00/phonemize"],"tags_count":5,"template":false,"template_full_name":null,"purl":"pkg:github/hans00/phonemize","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hans00%2Fphonemize","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hans00%2Fphonemize/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hans00%2Fphonemize/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hans00%2Fphonemize/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/hans00","download_url":"https://codeload.github.com/hans00/phonemize/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hans00%2Fphonemize/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28798596,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-27T01:07:07.743Z","status":"online","status_checked_at":"2026-01-27T02:00:07.755Z","response_time":168,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["chinese","english","g2p","ipa","phonemizer","tts"],"created_at":"2024-10-10T18:12:34.469Z","updated_at":"2026-01-27T03:01:37.182Z","avatar_url":"https://github.com/hans00.png","language":"TypeScript","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Phonemize\n\n[![CI](https://github.com/hans00/phonemize/workflows/CI/badge.svg)](https://github.com/hans00/phonemize/actions/workflows/ci.yml)\n[![codecov](https://codecov.io/gh/hans00/phonemize/branch/main/graph/badge.svg)](https://codecov.io/gh/hans00/phonemize)\n[![npm version](https://badge.fury.io/js/phonemize.svg)](https://badge.fury.io/js/phonemize)\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\n[![Node.js Version](https://img.shields.io/node/v/phonemize.svg)](https://nodejs.org/)\n\nFast phonemizer with rule-based G2P (Grapheme-to-Phoneme) prediction.\nPure JavaScript implementation with no native dependencies.\n\nInspired by [ttstokenizer](https://github.com/neuml/ttstokenizer)\n\n## Features\n\n- ⚡ **Lightning fast** - Pure rule-based processing, no ML overhead\n- 🎯 **Intelligent compound word support** - Automatic decomposition of complex words\n- 📚 **Comprehensive dictionary** - 125,000+ word pronunciations\n- 🧠 **Smart rule-based G2P** - Advanced phonetic rules for unknown words\n- 🌍 **Multiple formats** - IPA, ARPABET, and Zhuyin output\n- 🌐 **Modular multilingual support** - G2P models are modularize load\n- 💻 **Pure JavaScript** - No native dependencies, works everywhere\n- 🔧 **Simple API** - Easy to integrate and use\n\n## Installation\n\n```bash\nnpm install phonemize\n```\n\n## Quick Start\n\n```javascript\nimport { phonemize, toIPA, toARPABET } from 'phonemize'\n\n// Default IPA output\nconsole.log(phonemize('Hello world!'))\n// Output: həˈɫoʊ ˈwɝɫd!\n\n// ARPABET format\nconsole.log(toARPABET('Hello world!'))\n// Output: HH AX EL1 OW W1 ER EL D!\n```\n\n### Presets\n\nFor different language support needs, you can use preset modules:\n\n```javascript\n// Default: English only\nimport { phonemize } from 'phonemize'\n\n// Chinese + English\nimport { phonemize } from 'phonemize/zh'\n\n// All languages (English + Chinese + Japanese + Korean + Russian)\nimport { phonemize } from 'phonemize/all'\n\n// Clean\n```\n\n## API Reference\n\n### Basic Functions\n\n#### `phonemize(text, options?)`\nConvert text to phonemes.\n\n```javascript\nphonemize('Hello world!')                    // IPA string\nphonemize('Hello world!', { returnArray: true })  // IPA array\n```\n\n**Options:**\n- `returnArray` (boolean): Return array instead of string\n- `format` ('ipa' | 'arpabet'): Output format\n- `stripStress` (boolean): Remove stress markers\n- `separator` (string): Phoneme separator (default: ' ')\n- `anyAscii` (boolean): Enable multilingual support via anyAscii transliteration\n\n#### `toIPA(text, options?)`\nConvert text to IPA phonemes.\n\n```javascript\ntoIPA('Hello world!')  // \"həˈɫoʊ ˈwɝɫd!\"\n```\n\n#### `toARPABET(text, options?)`\nConvert text to ARPABET phonemes.\n\n```javascript\ntoARPABET('Hello world!')  // \"HH AX L OW1 W ER1 L D!\"\n```\n\n#### `toZhuyin(text, options?)`\nConvert text to Zhuyin (Bopomofo / 注音) format.\n\nThis function is specifically designed for Chinese text. Non-Chinese text will be phonemized to IPA as a fallback.\n\n**Note:** The output format is `Zhuyin + tone number` (e.g., `ㄓㄨㄥ1 ㄨㄣ2`), which is optimized for **Kokoro**.\n\n```javascript\nimport { toZhuyin } from 'phonemize';\n\ntoZhuyin('中文'); // \"ㄓㄨㄥ1 ㄨㄣ2\"\ntoZhuyin('你好世界'); // \"ㄋㄧ3 ㄏㄠ3 ㄕ4 ㄐㄧㄝ4\"\ntoZhuyin('中文 and English'); // \"ㄓㄨㄥ1 ㄨㄣ2 ænd ˈɪŋɡlɪʃ\"\n```\n\n#### `useG2P(processor)`\nRegister a G2P processor for multilingual support.\n\n```javascript\nimport { useG2P } from 'phonemize'\nimport ChineseG2P from 'phonemize/zh-g2p'\nimport JapaneseG2P from 'phonemize/ja-g2p'\n\n// Register G2P processors\nuseG2P(new ChineseG2P())\nuseG2P(new JapaneseG2P())\n\n// Now phonemize can handle Chinese and Japanese text\nphonemize('你好')  // → ni˧˥ xɑʊ˨˩˦\nphonemize('こんにちは', { anyAscii: true })  // → konnitɕiwa\n```\n\n### Custom Pronunciations\n\n```javascript\nimport { addPronunciation } from 'phonemize'\n\n// Add custom word pronunciation\naddPronunciation('myword', 'ˈmaɪwərd') // Can be IPA or ARPABET\nconsole.log(phonemize('myword'))  // \"ˈmaɪwərd\"\n```\n\n### Advanced Tokenization\n\n```javascript\nimport { Tokenizer, createTokenizer } from 'phonemize'\n\n// Create custom tokenizer\nconst tokenizer = createTokenizer({\n  format: 'ipa',\n  stripStress: true,\n  separator: '-'\n})\n\n// Tokenize with detailed info\nconst tokens = tokenizer.tokenizeToTokens('Hello world!')\n// [\n//   { phoneme: \"həɫoʊ\", word: \"Hello\", position: 0 },\n//   { phoneme: \"wɝɫd\", word: \"world\", position: 6 }\n// ]\n```\n\n## Text Processing Features\n\n### Number Expansion\nNumbers are automatically converted to words:\n\n```javascript\nphonemize('I have 123 apples')\n// \"ˈaɪ ˈhæv ˈwən ˈhəndɝd ˈtwɛni ˈθɹi ˈæpəɫz\"\n```\n\n### Abbreviation Expansion\nCommon abbreviations are expanded:\n\n```javascript\nphonemize('Dr. Smith and Mr. Johnson')\n// \"ˈdɑktɝ ˈsmɪθ ˈænd ˈmɪstɝ ˈdʒɑnsən\"\n```\n\n### Currency and Dates\nSpecial handling for currency and dates:\n\n```javascript\nphonemize('15 dollars in 2023')\n// \"ˈfɪfˈtin ˈdɑɫɝz ˈɪn ˈtwɛni ˈtwɛni ˈθɹi\"\n```\n\n## Performance\n\n- **Dictionary lookup**: O(1) - Instant for known words\n- **Rule-based processing**: Extremely fast, no model loading\n- **Compound decomposition**: Efficient balanced search algorithm\n- **Memory efficient**: Compressed JSON dictionaries only\n- **Zero startup time**: No model initialization required\n\nTypical performance: **\u003e10000 words/second** on modern hardware.\n\n## Processing Pipeline\n\n1. **Language Detection** - Detect language before anyAscii conversion (if enabled)\n2. **anyAscii Transliteration** - Convert non-Latin scripts to ASCII (if enabled)\n3. **Dictionary Lookup** - Check for exact word match\n4. **Multilingual Processing** - Handle Chinese, Japanese, Korean, etc.\n5. **Compound Detection** - Intelligent decomposition of compound words\n6. **Multi-Compound Handling** - Special processing for very long compounds\n7. **Rule-Based G2P** - Apply phonetic rules for unknown words\n\nNote: The rule based G2P is LLM generated, may error generate. Best practice is use custom pronunciation for unknown words.\n\n## Supported Phoneme Sets\n\n### IPA (International Phonetic Alphabet)\nStandard IPA symbols for English phonemes with stress marks.\n\n### ARPABET  \nCMU ARPABET phoneme set with stress numbers (0,1,2).\n\n## Building from Source\n\n```bash\n# Install dependencies\nyarn\n\n# Compile TypeScript and dictionaries\nyarn build\n\n# Run tests\nyarn test\n```\n\n## License\n\nMIT\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhans00%2Fphonemize","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fhans00%2Fphonemize","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhans00%2Fphonemize/lists"}