{"id":47927216,"url":"https://github.com/tigregotico/orthography2ipa","last_synced_at":"2026-06-11T13:00:17.699Z","repository":{"id":348321687,"uuid":"1161999361","full_name":"TigreGotico/orthography2ipa","owner":"TigreGotico","description":null,"archived":false,"fork":false,"pushed_at":"2026-06-10T15:46:42.000Z","size":3142,"stargazers_count":0,"open_issues_count":4,"forks_count":1,"subscribers_count":0,"default_branch":"dev","last_synced_at":"2026-06-10T17:20:59.803Z","etag":null,"topics":["dialects","grapheme-clusters","grapheme-to-phoneme","graphemes","ipa","phonemes","phonemizer","phonetics"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/TigreGotico.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":"AGENTS.md","dco":null,"cla":null}},"created_at":"2026-02-19T18:59:45.000Z","updated_at":"2026-06-10T15:49:05.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/TigreGotico/orthography2ipa","commit_stats":null,"previous_names":["tigregotico/orthography2ipa"],"tags_count":1,"template":false,"template_full_name":null,"purl":"pkg:github/TigreGotico/orthography2ipa","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TigreGotico%2Forthography2ipa","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TigreGotico%2Forthography2ipa/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TigreGotico%2Forthography2ipa/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TigreGotico%2Forthography2ipa/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/TigreGotico","download_url":"https://codeload.github.com/TigreGotico/orthography2ipa/tar.gz/refs/heads/dev","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TigreGotico%2Forthography2ipa/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34199516,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-11T02:00:06.485Z","response_time":57,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["dialects","grapheme-clusters","grapheme-to-phoneme","graphemes","ipa","phonemes","phonemizer","phonetics"],"created_at":"2026-04-04T06:53:12.366Z","updated_at":"2026-06-11T13:00:17.654Z","avatar_url":"https://github.com/TigreGotico.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# orthography2ipa\n\nLinguistically motivated **grapheme→IPA** and **allophone** mappings for **350+ language codes** across 20+ language families — pure data, a maximal-munch IPA tokenizer, and a family of phonological/script distance metrics, with no trained weights to ship.\n\nOnly mappings grounded in official orthography and documented grammar are included. Arbitrary substring rules are excluded.\n\n## Why two maps\n\nThe central distinction the package enforces:\n\n- A **grapheme map** tells you which phonemes a spelling *can* represent. English ⟨th⟩ → `['θ', 'ð']`.\n- An **allophone map** tells you how a phoneme *surfaces* in context. English /t/ → `['t', 'tʰ', 'ɾ', 'ʔ', 't̚']`.\n\nKeeping these separate lets you go from text to phoneme candidates (transcription) and from phonemes to surface realisations (pronunciation modelling) without conflating the two.\n\n## What each language carries\n\nEvery `LanguageSpec` provides:\n\n1. **Graphemes** — orthographic units (characters, digraphs, trigraphs) mapped to canonical IPA phonemes.\n2. **Allophones** — each phoneme mapped to its positional/contextual surface realisations.\n3. **Positional graphemes** — context-sensitive overrides (word-initial, intervocalic, before /i/, …).\n4. **Ancestry** — weighted multi-ancestor lineage (parent, substrate, superstrate, adstrate, …) for dialect trees.\n5. **Sandhi rules** — cross-word phonological processes.\n6. **Tone inventory** — tone marks → labels, where applicable.\n7. **Provenance** — `QualityTier` (stub → skeleton → research → production), `ScriptType`, and bibliographic sources.\n\nRegional varieties get their own `LanguageSpec` objects linked through ancestry, and JSON data files support `graphemes_base`/`allophones_base` inheritance so a dialect only declares what differs from its parent.\n\n## Installation\n\n```bash\npip install orthography2ipa\n```\n\nFor richer language-specific pipelines, install a downstream engine\nbuilt on this library: [arbtok](https://github.com/TigreGotico/arbtok)\nfor Arabic, [tugaphone](https://github.com/TigreGotico/tugaphone) for\nPortuguese.\n\n## Quick start\n\n### Transcribe text to IPA\n\n```python\nimport orthography2ipa\n\northography2ipa.transcribe(\"olá mundo\", \"pt\")        # 'oˈla ˈmundo'\northography2ipa.transcribe(\"hello world\", \"en\")       # 'hɛllɒ wɔːɹld'\northography2ipa.transcribe(\"bona nuèit\", \"oc\")        # 'buna nyɛjt'\n\n# Beam search keeps ranked alternatives per word\nfrom orthography2ipa import G2P\n\nengine = G2P(\"pt-PT\")\nresult = engine.transcribe_detailed(\"um café\", search=\"beam\", beam_width=4)\nresult.ipa                          # 'ˈum kaˈfɛ'\nresult.words[1].candidates          # ranked IPAPath alternatives\n\n# The engine pipeline: normalize → tokenize → greedy/beam per word →\n# stress marks (when the spec declares stress rules) → sandhi →\n# dialect transform. Downstream engines (arbtok for Arabic, tugaphone\n# for Portuguese) build on this library for richer language-specific\n# pipelines.\n```\n\n### Language specs\n\n```python\nimport orthography2ipa\n\n# Get a language spec\nen = orthography2ipa.get(\"en-GB\")\n\n# Grapheme → IPA candidates\nen.graphemes[\"th\"]    # ['θ', 'ð']\n\n# Allophone map: how /t/ surfaces\nen.allophones[\"t\"]    # ['t', 'tʰ', 'ɾ', 'ʔ', 't̚']\n\n# Metadata\nen.name               # 'British English (RP)'\nen.family             # 'Germanic'\nen.script             # 'Latin'\n\n# Regional variants share ancestry but diverge where pronunciation does\npt_br = orthography2ipa.get(\"pt-BR\")\npt_br.graphemes[\"t\"]  # ['t', 't͡ʃ']   — palatalisation before /i/\n\n# Bare tags, ISO 639-3 aliases and near matches all resolve\northography2ipa.get(\"eng\").name   # 'British English (RP)'\northography2ipa.resolve(\"pt\")     # 'pt-PT' — reference variety\northography2ipa.resolve(\"en-NZ\")  # 'en-GB' — nearest registered\n\n# Discover what's available\northography2ipa.available_codes()\northography2ipa.available_families()\n```\n\n### IPA tokenizer\n\n`PhonetokTokenizer` performs maximal-munch grapheme tokenization with beam-search IPA expansion, ranking candidate transcriptions when a spelling is ambiguous:\n\n```python\nfrom orthography2ipa import get\nfrom orthography2ipa.phonetok import PhonetokTokenizer\n\ntok = PhonetokTokenizer(get(\"en-GB\"))\n\ntok.ipa_best(\"through\")              # 'θɹɔː'\nfor path in tok.ipa_beam(\"through\", beam_width=8):\n    print(path.ipa, path.score)      # θɹɔː 0.0, ðɹɔː 1.0, θɹoʊ 1.0, …\n```\n\n### Distance metrics\n\nCompare two languages across inventory, grapheme, allophone, and ancestry dimensions:\n\n```python\nfrom orthography2ipa import get\nfrom orthography2ipa.distance import phonological_distance\n\nd = phonological_distance(get(\"pt-BR\"), get(\"pt-PT\"))\nd.combined                    # 0.04 — near-identical\nd.inventory.feature_mean      # phoneme-inventory distance\nd.grapheme.mean_ipa_distance  # grapheme-mapping divergence\nd.allophone_sim               # allophone-overlap similarity\n```\n\nScript-level distance and feature vectors are available via `script_distance.py` and `feats.py`.\n\n## Command-line interface\n\nAfter installation the `orthography2ipa` command is available. Every subcommand accepts `--json` for machine-readable output.\n\n```bash\n# List languages and families\northography2ipa list\northography2ipa list --families\northography2ipa list --family Romance\n\n# Inspect a language\northography2ipa info pt-BR\northography2ipa info pt-BR --graphemes\northography2ipa info pt-BR --json\n\n# Transcribe text to IPA\northography2ipa transcribe pt \"olá mundo\"\northography2ipa transcribe en-GB \"through\" --search beam --beam-width 8\n\n# Phonological distance between two languages\northography2ipa distance pt-BR pt-PT\northography2ipa distance es-ES it-IT --json\n```\n\n## Languages\n\n| Family     | Examples |\n|------------|----------|\n| Romance    | `pt-PT`, `pt-BR`, `es-ES`, `es-AR`, `ca`, `fr-FR`, `it-IT`, `ro-RO`, `gl`, `oc`, `sc`, `an` |\n| Germanic   | `en-GB`, `de-DE`, `nl-NL`, `sv-SE`, `da-DK`, `no-NO`, `af` |\n| Slavic     | `ru-RU`, `uk-UA`, `pl-PL`, `cs-CZ`, `sr-RS`, `hr-HR`, `bg-BG` |\n| Celtic     | `cy`, `ga`, `gd`, `br`, `kw`, `gv` |\n| Indo-Aryan | `hi-IN`, `bn-BD`, `ur-PK`, `ne-NP`, `pa`, `gu`, `mr` |\n| Semitic    | `arb`, `he-IL`, `mt` |\n| Turkic     | `tr-TR`, `az`, `kk`, `uz` |\n| Hellenic   | `el-GR` |\n| Uralic     | `fi-FI`, `hu-HU`, `et-EE` |\n| Japonic    | `ja` |\n| Sinitic    | `zh` |\n| Koreanic   | `ko` |\n\n350+ codes across 40+ family groupings, including reconstructed proto-languages and fine-grained regional dialects.\n\n## Data structure\n\n```python\n@dataclass(frozen=True)\nclass LanguageSpec:\n    code: str                              # 'pt-BR'\n    name: str                              # 'Brazilian Portuguese'\n    family: str                            # 'Romance'\n    script: str                            # 'Latin'\n    graphemes: Dict[str, List[str]]        # 'th' → ['θ', 'ð']\n    allophones: Dict[str, List[str]]       # 't' → ['t', 'tʰ', 'ɾ', 'ʔ', 't̚']\n    positional_graphemes: Dict[...]        # context-sensitive overrides\n    parent: Optional[str]                  # primary parent code\n    ancestors: Tuple[Ancestor, ...]        # weighted multi-ancestor lineage\n    quality: QualityTier                   # stub | skeleton | research | production\n    script_type: ScriptType                # alphabet | abjad | abugida | ...\n    sandhi_rules: Tuple[SandhiRule, ...]   # cross-word rules\n    tone_inventory: Optional[Dict]         # tone marks → labels\n    sources: Tuple[LinguisticSource, ...]  # bibliographic references\n```\n\nWhen a spec declares graphemes but no explicit allophone map, a baseline identity allophone map is derived: every phoneme a grapheme can produce is, at minimum, its own surface realisation.\n\n## Design principles\n\n- **Linguistically motivated only** — digraphs like English ⟨th⟩, Portuguese ⟨lh⟩, or German ⟨sch⟩ are included because they are standard orthographic units; arbitrary substrings are not.\n- **Graphemes ≠ allophones** — spelling-to-phoneme and phoneme-to-surface are modelled separately.\n- **Regional variants** — where pronunciation diverges systematically, a separate `LanguageSpec` is provided with ancestry links.\n- **Multi-ancestor inheritance** — `graphemes_base`/`allophones_base` let dialect trees declare only their differences.\n- **Pure data, self-contained logic** — mappings are declarative JSON; the engine never loads external G2P implementations.\n\n## Building engines on top\n\n`G2PPlugin` and `WordContext` are exported as the base types for richer language-specific engines built **on** this library — [arbtok](https://github.com/TigreGotico/arbtok) (Arabic: contextual rule cascade + tashkeel diacritization) and [tugaphone](https://github.com/TigreGotico/tugaphone) (Portuguese: lexicon, POS and regional-accent layers). They consume the spec data, tokenizer and stress machinery and own their own pipelines.\n\nComponent plugins that slot into the bundled engine's own logic use dedicated entry-point groups: per-language syllabifiers register under `orthography2ipa.syllabify` (e.g. `silabificador` for Portuguese) and are honoured by stress detection automatically.\n\n## Contributing\n\nTo add a language, create `orthography2ipa/data/{code}.json` following `orthography2ipa/data/SCHEMA.md`. For dialects, use `graphemes_base`/`allophones_base` to inherit from the parent.\n\n## License\n\nApache 2.0\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftigregotico%2Forthography2ipa","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftigregotico%2Forthography2ipa","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftigregotico%2Forthography2ipa/lists"}