https://github.com/tigregotico/orthography2ipa

dialects grapheme-clusters grapheme-to-phoneme graphemes ipa phonemes phonemizer phonetics
Last synced: 6 days ago
JSON representation
Host: GitHub
URL: https://github.com/tigregotico/orthography2ipa
Owner: TigreGotico
Created: 2026-02-19T18:59:45.000Z (4 months ago)
Default Branch: dev
Last Pushed: 2026-06-10T15:46:42.000Z (7 days ago)
Last Synced: 2026-06-10T17:20:59.803Z (7 days ago)
Topics: dialects, grapheme-clusters, grapheme-to-phoneme, graphemes, ipa, phonemes, phonemizer, phonetics
Language: Python
Homepage:
Size: 3 MB
Stars: 0
Watchers: 0
Forks: 1
Open Issues: 4
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- Agents: AGENTS.md
Awesome Lists containing this project

README

          # orthography2ipa

Linguistically motivated **grapheme→IPA** and **allophone** mappings for **350+ language codes** across 20+ language families — pure data, a maximal-munch IPA tokenizer, and a family of phonological/script distance metrics, with no trained weights to ship.

Only mappings grounded in official orthography and documented grammar are included. Arbitrary substring rules are excluded.

## Why two maps

The central distinction the package enforces:

- A **grapheme map** tells you which phonemes a spelling *can* represent. English ⟨th⟩ → `['θ', 'ð']`.

- An **allophone map** tells you how a phoneme *surfaces* in context. English /t/ → `['t', 'tʰ', 'ɾ', 'ʔ', 't̚']`.

Keeping these separate lets you go from text to phoneme candidates (transcription) and from phonemes to surface realisations (pronunciation modelling) without conflating the two.

## What each language carries

Every `LanguageSpec` provides:

1. **Graphemes** — orthographic units (characters, digraphs, trigraphs) mapped to canonical IPA phonemes.

2. **Allophones** — each phoneme mapped to its positional/contextual surface realisations.

3. **Positional graphemes** — context-sensitive overrides (word-initial, intervocalic, before /i/, …).

4. **Ancestry** — weighted multi-ancestor lineage (parent, substrate, superstrate, adstrate, …) for dialect trees.

5. **Sandhi rules** — cross-word phonological processes.

6. **Tone inventory** — tone marks → labels, where applicable.

7. **Provenance** — `QualityTier` (stub → skeleton → research → production), `ScriptType`, and bibliographic sources.

Regional varieties get their own `LanguageSpec` objects linked through ancestry, and JSON data files support `graphemes_base`/`allophones_base` inheritance so a dialect only declares what differs from its parent.

## Installation

```bash

pip install orthography2ipa

```

For richer language-specific pipelines, install a downstream engine

built on this library: [arbtok](https://github.com/TigreGotico/arbtok)

for Arabic, [tugaphone](https://github.com/TigreGotico/tugaphone) for

Portuguese.

## Quick start

### Transcribe text to IPA

```python

import orthography2ipa

orthography2ipa.transcribe("olá mundo", "pt")        # 'oˈla ˈmundo'

orthography2ipa.transcribe("hello world", "en")       # 'hɛllɒ wɔːɹld'

orthography2ipa.transcribe("bona nuèit", "oc")        # 'buna nyɛjt'

# Beam search keeps ranked alternatives per word

from orthography2ipa import G2P

engine = G2P("pt-PT")

result = engine.transcribe_detailed("um café", search="beam", beam_width=4)

result.ipa                          # 'ˈum kaˈfɛ'

result.words[1].candidates          # ranked IPAPath alternatives

# The engine pipeline: normalize → tokenize → greedy/beam per word →

# stress marks (when the spec declares stress rules) → sandhi →

# dialect transform. Downstream engines (arbtok for Arabic, tugaphone

# for Portuguese) build on this library for richer language-specific

# pipelines.

```

### Language specs

```python

import orthography2ipa

# Get a language spec

en = orthography2ipa.get("en-GB")

# Grapheme → IPA candidates

en.graphemes["th"]    # ['θ', 'ð']

# Allophone map: how /t/ surfaces

en.allophones["t"]    # ['t', 'tʰ', 'ɾ', 'ʔ', 't̚']

# Metadata

en.name               # 'British English (RP)'

en.family             # 'Germanic'

en.script             # 'Latin'

# Regional variants share ancestry but diverge where pronunciation does

pt_br = orthography2ipa.get("pt-BR")

pt_br.graphemes["t"]  # ['t', 't͡ʃ']   — palatalisation before /i/

# Bare tags, ISO 639-3 aliases and near matches all resolve

orthography2ipa.get("eng").name   # 'British English (RP)'

orthography2ipa.resolve("pt")     # 'pt-PT' — reference variety

orthography2ipa.resolve("en-NZ")  # 'en-GB' — nearest registered

# Discover what's available

orthography2ipa.available_codes()

orthography2ipa.available_families()

```

### IPA tokenizer

`PhonetokTokenizer` performs maximal-munch grapheme tokenization with beam-search IPA expansion, ranking candidate transcriptions when a spelling is ambiguous:

```python

from orthography2ipa import get

from orthography2ipa.phonetok import PhonetokTokenizer

tok = PhonetokTokenizer(get("en-GB"))

tok.ipa_best("through")              # 'θɹɔː'

for path in tok.ipa_beam("through", beam_width=8):

    print(path.ipa, path.score)      # θɹɔː 0.0, ðɹɔː 1.0, θɹoʊ 1.0, …

```

### Distance metrics

Compare two languages across inventory, grapheme, allophone, and ancestry dimensions:

```python

from orthography2ipa import get

from orthography2ipa.distance import phonological_distance

d = phonological_distance(get("pt-BR"), get("pt-PT"))

d.combined                    # 0.04 — near-identical

d.inventory.feature_mean      # phoneme-inventory distance

d.grapheme.mean_ipa_distance  # grapheme-mapping divergence

d.allophone_sim               # allophone-overlap similarity

```

Script-level distance and feature vectors are available via `script_distance.py` and `feats.py`.

## Command-line interface

After installation the `orthography2ipa` command is available. Every subcommand accepts `--json` for machine-readable output.

```bash

# List languages and families

orthography2ipa list

orthography2ipa list --families

orthography2ipa list --family Romance

# Inspect a language

orthography2ipa info pt-BR

orthography2ipa info pt-BR --graphemes

orthography2ipa info pt-BR --json

# Transcribe text to IPA

orthography2ipa transcribe pt "olá mundo"

orthography2ipa transcribe en-GB "through" --search beam --beam-width 8

# Phonological distance between two languages

orthography2ipa distance pt-BR pt-PT

orthography2ipa distance es-ES it-IT --json

```

## Languages

| Family     | Examples |

|------------|----------|

| Romance    | `pt-PT`, `pt-BR`, `es-ES`, `es-AR`, `ca`, `fr-FR`, `it-IT`, `ro-RO`, `gl`, `oc`, `sc`, `an` |

| Germanic   | `en-GB`, `de-DE`, `nl-NL`, `sv-SE`, `da-DK`, `no-NO`, `af` |

| Slavic     | `ru-RU`, `uk-UA`, `pl-PL`, `cs-CZ`, `sr-RS`, `hr-HR`, `bg-BG` |

| Celtic     | `cy`, `ga`, `gd`, `br`, `kw`, `gv` |

| Indo-Aryan | `hi-IN`, `bn-BD`, `ur-PK`, `ne-NP`, `pa`, `gu`, `mr` |

| Semitic    | `arb`, `he-IL`, `mt` |

| Turkic     | `tr-TR`, `az`, `kk`, `uz` |

| Hellenic   | `el-GR` |

| Uralic     | `fi-FI`, `hu-HU`, `et-EE` |

| Japonic    | `ja` |

| Sinitic    | `zh` |

| Koreanic   | `ko` |

350+ codes across 40+ family groupings, including reconstructed proto-languages and fine-grained regional dialects.

## Data structure

```python

@dataclass(frozen=True)

class LanguageSpec:

    code: str                              # 'pt-BR'

    name: str                              # 'Brazilian Portuguese'

    family: str                            # 'Romance'

    script: str                            # 'Latin'

    graphemes: Dict[str, List[str]]        # 'th' → ['θ', 'ð']

    allophones: Dict[str, List[str]]       # 't' → ['t', 'tʰ', 'ɾ', 'ʔ', 't̚']

    positional_graphemes: Dict[...]        # context-sensitive overrides

    parent: Optional[str]                  # primary parent code

    ancestors: Tuple[Ancestor, ...]        # weighted multi-ancestor lineage

    quality: QualityTier                   # stub | skeleton | research | production

    script_type: ScriptType                # alphabet | abjad | abugida | ...

    sandhi_rules: Tuple[SandhiRule, ...]   # cross-word rules

    tone_inventory: Optional[Dict]         # tone marks → labels

    sources: Tuple[LinguisticSource, ...]  # bibliographic references

```

When a spec declares graphemes but no explicit allophone map, a baseline identity allophone map is derived: every phoneme a grapheme can produce is, at minimum, its own surface realisation.

## Design principles

- **Linguistically motivated only** — digraphs like English ⟨th⟩, Portuguese ⟨lh⟩, or German ⟨sch⟩ are included because they are standard orthographic units; arbitrary substrings are not.

- **Graphemes ≠ allophones** — spelling-to-phoneme and phoneme-to-surface are modelled separately.

- **Regional variants** — where pronunciation diverges systematically, a separate `LanguageSpec` is provided with ancestry links.

- **Multi-ancestor inheritance** — `graphemes_base`/`allophones_base` let dialect trees declare only their differences.

- **Pure data, self-contained logic** — mappings are declarative JSON; the engine never loads external G2P implementations.

## Building engines on top

`G2PPlugin` and `WordContext` are exported as the base types for richer language-specific engines built **on** this library — [arbtok](https://github.com/TigreGotico/arbtok) (Arabic: contextual rule cascade + tashkeel diacritization) and [tugaphone](https://github.com/TigreGotico/tugaphone) (Portuguese: lexicon, POS and regional-accent layers). They consume the spec data, tokenizer and stress machinery and own their own pipelines.

Component plugins that slot into the bundled engine's own logic use dedicated entry-point groups: per-language syllabifiers register under `orthography2ipa.syllabify` (e.g. `silabificador` for Portuguese) and are honoured by stress detection automatically.

## Contributing

To add a language, create `orthography2ipa/data/{code}.json` following `orthography2ipa/data/SCHEMA.md`. For dialects, use `graphemes_base`/`allophones_base` to inherit from the parent.

## License

Apache 2.0
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/tigregotico/orthography2ipa

Awesome Lists containing this project

README