{"id":30697689,"url":"https://github.com/tabahi/bournemouth-forced-aligner","last_synced_at":"2026-02-24T02:09:12.803Z","repository":{"id":310668351,"uuid":"1040425657","full_name":"tabahi/bournemouth-forced-aligner","owner":"tabahi","description":"Extract phoneme-level timestamps from speeh audio.","archived":false,"fork":false,"pushed_at":"2026-02-16T04:02:17.000Z","size":7713,"stargazers_count":116,"open_issues_count":0,"forks_count":12,"subscribers_count":6,"default_branch":"main","last_synced_at":"2026-02-16T11:15:37.193Z","etag":null,"topics":["alignment","forced-alignment","phonemes","speech","speech-processing","speech-recognition","text-to-speech","timestamps","tts","tts-dataset","word"],"latest_commit_sha":null,"homepage":"https://pypi.org/project/bournemouth-forced-aligner/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/tabahi.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-08-19T00:53:15.000Z","updated_at":"2026-02-16T04:02:21.000Z","dependencies_parsed_at":"2025-09-10T04:24:18.940Z","dependency_job_id":"83f9d227-5059-4c9f-b2a3-d56e507ff70d","html_url":"https://github.com/tabahi/bournemouth-forced-aligner","commit_stats":null,"previous_names":["tabahi/bournemouth-forced-aligner"],"tags_count":7,"template":false,"template_full_name":null,"purl":"pkg:github/tabahi/bournemouth-forced-aligner","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tabahi%2Fbournemouth-forced-aligner","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tabahi%2Fbournemouth-forced-aligner/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tabahi%2Fbournemouth-forced-aligner/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tabahi%2Fbournemouth-forced-aligner/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/tabahi","download_url":"https://codeload.github.com/tabahi/bournemouth-forced-aligner/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tabahi%2Fbournemouth-forced-aligner/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":29768842,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-02-24T01:40:24.820Z","status":"online","status_checked_at":"2026-02-24T02:00:07.497Z","response_time":75,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["alignment","forced-alignment","phonemes","speech","speech-processing","speech-recognition","text-to-speech","timestamps","tts","tts-dataset","word"],"created_at":"2025-09-02T09:38:43.545Z","updated_at":"2026-02-24T02:09:12.796Z","avatar_url":"https://github.com/tabahi.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Bournemouth Forced Aligner (BFA)\n\n\u003cdiv align=\"center\"\u003e\n\n[![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)\n[![PyPI version](https://badge.fury.io/py/bournemouth-forced-aligner.svg)](https://badge.fury.io/py/bournemouth-forced-aligner)\n[![License: GPLv3](https://img.shields.io/badge/License-GPLv3-yellow.svg)](https://www.gnu.org/licenses/gpl-3.0)\n[![GitHub stars](https://img.shields.io/github/stars/tabahi/bournemouth-forced-aligner.svg)](https://github.com/tabahi/bournemouth-forced-aligner/stargazers)\n\n**Automatically label which phoneme is spoken at which millisecond in an audio recording.**\n\n[🚀 Quick Start](#getting-started) • [🔧 Installation](#-installation) • [🌍 Languages](#-language-presets) • [💻 CLI](#-command-line-interface-cli) • [🐛 Issues](https://github.com/tabahi/bournemouth-forced-aligner/issues)\n\n\u003c/div\u003e\n\n---\n\n## What does it do?\n\n**Forced alignment** is the process of automatically finding *exactly when* each word or sound occurs in an audio recording, given a transcript of what was said.\n\nBFA takes two inputs — an **audio file** and its **text transcript** — and produces a detailed output with millisecond timestamps for every phoneme (the individual sounds that make up words).\n\n```\nInput:  🎵 audio.wav  +  📄 \"butterfly\"\n\nOutput:\n  b  →  33 ms – 50 ms   (confidence: 0.99)\n  ʌ  →  100 ms – 117 ms (confidence: 0.84)\n  ɾ  →  134 ms – 151 ms (confidence: 0.29)\n  ɚ  →  285 ms – 302 ms (confidence: 0.74)\n  f  →  352 ms – 403 ms (confidence: 0.99)\n  l  →  520 ms – 554 ms (confidence: 0.92)\n  aɪ →  604 ms – 621 ms (confidence: 0.41)\n```\n\nThe output can be exported as a **Praat TextGrid** file for visual inspection, or as **JSON** for use in scripts and TTS pipelines.\n\n### Key Facts\n\n| | |\n|---|---|\n| ⚡ Speed | ~0.2 seconds to align 10 seconds of audio (on CPU) |\n| 🌍 Languages | 80+ languages — English, German, French, Spanish, Hindi, Arabic, and many more |\n| 📄 Output formats | JSON timestamps, Praat TextGrid, phoneme embeddings |\n| 🖥️ Runs on | Linux, macOS, Windows — CPU or GPU |\n\n---\n\n## *What's new in v1.1.0* (February 2026)\n\n- **Better long-sentence handling** — silence anchoring now splits long segments at pauses (like commas and full stops), making alignment significantly more accurate for full sentences.\n- **Original IPA labels preserved** — the output now includes both the standardised ph66 label and the original espeak-ng IPA form, which is useful for fine-grained phonetic analysis and TTS training.\n- **Improved quality checks** — `coverage_analysis` in the output now runs multiple alignment quality tests. If `\"bad_confidence\": true`, that segment is likely unreliable and should be discarded.\n- **⚠️ Breaking change** — v1.1.0 refactors the API for batch processing. Code written for v1.0.x will need minor updates (see [CHANGELOG](CHANGELOG.md)).\n\n---\n\n**Words and phonemes aligned to a Mel-spectrogram:**\n\n![Aligned mel-spectrum plot](examples/samples/images/LJ02_mel_words.png)\n![Aligned mel-spectrum plot](examples/samples/images/LJ01_mel_phonemes.png)\n\n*See [mel_spectrum_alignment.py](examples/mel_spectrum_alignment.py) for a full example.*\n\n---\n\n## 🚀 Installation\n\n\u003e **Prerequisites:** Python 3.8 or newer. If you don't have Python yet, download it from [python.org](https://www.python.org/downloads/).\n\n### Step 1 — Install system dependencies\n\n**Linux (Ubuntu / Debian):**\n```bash\nsudo apt-get install espeak-ng ffmpeg\n```\n\n**macOS:**\n```bash\nbrew install espeak-ng ffmpeg\n```\n\n**Windows:**  \nDownload and install [eSpeak NG](https://github.com/espeak-ng/espeak-ng/blob/master/docs/guide.md) and [ffmpeg](https://ffmpeg.org/download.html), then add both to your system PATH.\n\n### Step 2 — Install BFA\n\n```bash\npip install bournemouth-forced-aligner\n```\n\n### Step 3 — Verify it works\n\n```bash\npython -c \"from bournemouth_aligner import PhonemeTimestampAligner; print('Installation successful!')\"\n```\n\n---\n\n## Getting Started\n\n### Quickstart — command line\n\nIf you just want to try BFA on a file right away, no Python needed:\n\n```bash\nbalign butterfly.wav \"butterfly\" --preset=en-us --mel-path=mel_spectrum.png\n```\n\nThis aligns the word *butterfly* in [`butterfly.wav`](examples/samples/audio/109867__timkahn__butterfly.wav), saves the phoneme timestamps to `butterfly.vs.json` (same folder as the audio), and also saves a mel-spectrogram image to [`mel_spectrum.png`](mel_spectrum_butterfly.png).\n\n\n---\n\n\n![Butterfly mel-spectrum plot](mel_spectrum_butterfly.png)\n\n\u003e **Tip:** Replace `\"butterfly\"` with any sentence spoken in the audio, and `--preset=en-us` with your language code (`de`, `fr`, `hi`, `ar`, …). Run `balign --help` to see more options.\n\n\n### Python example\n\n```python\nfrom bournemouth_aligner import PhonemeTimestampAligner\n\n# 1. Create the aligner — it will automatically download the right model\n#    Change \"en-us\" to your language code, e.g. \"de\", \"fr\", \"es\", \"hi\"\naligner = PhonemeTimestampAligner(preset=\"en-us\")\n\n# 2. Load your audio file (WAV, MP3, FLAC, etc.)\naudio = aligner.load_audio(\"butterfly.wav\")\n\n# 3. Run alignment\n#    Provide the transcript exactly as spoken in the audio\nresult = aligner.process_sentence(\"butterfly\", audio)\n\n# 4. Print the phoneme timestamps\nfor phoneme in result[\"segments\"][0][\"phoneme_ts\"]:\n    print(f\"{phoneme['ipa_label']:\u003e5}  {phoneme['start_ms']:.0f} ms – {phoneme['end_ms']:.0f} ms  (confidence: {phoneme['confidence']:.2f})\")\n```\n\nThat's it. No configuration files, no dictionary downloads, no corpus setup.\n\n\u003e **Note:** The first run will automatically download the model from HuggingFace (~50 MB). Subsequent runs use the cached copy.\n\n### Export to Praat TextGrid\n\n```python\n# Save a TextGrid file that you can open directly in Praat\naligner.convert_to_textgrid(result, output_file=\"my_recording.TextGrid\")\n```\n\n### Multi-language examples\n\n```python\n# German\naligner_de = PhonemeTimestampAligner(preset=\"de\")\n\n# French\naligner_fr = PhonemeTimestampAligner(preset=\"fr\")\n\n# Hindi\naligner_hi = PhonemeTimestampAligner(preset=\"hi\")\n\n# Spanish\naligner_es = PhonemeTimestampAligner(preset=\"es\")\n```\n\nSee the [full preset list](#-language-presets) for all 80+ supported languages.\n\n---\n\n### 📊 Sample Output\n\n\u003cdetails\u003e\n\u003csummary\u003e📋 Click to see the full JSON output for \"butterfly\"\u003c/summary\u003e\n\n```json\n\n{\n    \"segments\": [\n        {\n            \"start\": 0.0,\n            \"end\": 1.2588125,\n            \"text\": \"butterfly\",\n            \"ph66\": [\n                29,\n                10,\n                58,\n                9,\n                43,\n                56,\n                23\n            ],\n            \"pg16\": [\n                7,\n                2,\n                14,\n                2,\n                8,\n                13,\n                5\n            ],\n            \"coverage_analysis\": {\n                \"target_count\": 7,\n                \"aligned_count\": 7,\n                \"missing_count\": 0,\n                \"extra_count\": 0,\n                \"coverage_ratio\": 1.0,\n                \"missing_phonemes\": [],\n                \"extra_phonemes\": []\n            },\n            \"ipa\": [\n                \"b\",\n                \"ʌ\",\n                \"ɾ\",\n                \"ɚ\",\n                \"f\",\n                \"l\",\n                \"aɪ\"\n            ],\n            \"word_num\": [\n                0,\n                0,\n                0,\n                0,\n                0,\n                0,\n                0\n            ],\n            \"words\": [\n                \"butterfly\"\n            ],\n            \"phoneme_ts\": [\n                {\n                    \"phoneme_id\": 29,\n                    \"phoneme_label\": \"b\",\n                    \"ipa_label\": \"b\",\n                    \"start_ms\": 33.56833267211914,\n                    \"end_ms\": 50.35249710083008,\n                    \"confidence\": 0.9970603585243225,\n                    \"is_estimated\": false,\n                    \"target_seq_idx\": 0,\n                    \"index\": 0\n                },\n                {\n                    \"phoneme_id\": 10,\n                    \"phoneme_label\": \"ʌ\",\n                    \"ipa_label\": \"ʌ\",\n                    \"start_ms\": 100.70499420166016,\n                    \"end_ms\": 117.48916625976562,\n                    \"confidence\": 0.8809734582901001,\n                    \"is_estimated\": false,\n                    \"target_seq_idx\": 1,\n                    \"index\": 1\n                },\n                {\n                    \"phoneme_id\": 58,\n                    \"phoneme_label\": \"ɾ\",\n                    \"ipa_label\": \"ɾ\",\n                    \"start_ms\": 134.27333068847656,\n                    \"end_ms\": 151.0574951171875,\n                    \"confidence\": 0.07298105955123901,\n                    \"is_estimated\": false,\n                    \"target_seq_idx\": 2,\n                    \"index\": 2\n                },\n                {\n                    \"phoneme_id\": 9,\n                    \"phoneme_label\": \"ɚ\",\n                    \"ipa_label\": \"ɚ\",\n                    \"start_ms\": 285.3308410644531,\n                    \"end_ms\": 302.114990234375,\n                    \"confidence\": 0.18375618755817413,\n                    \"is_estimated\": false,\n                    \"target_seq_idx\": 3,\n                    \"index\": 3\n                },\n                {\n                    \"phoneme_id\": 43,\n                    \"phoneme_label\": \"f\",\n                    \"ipa_label\": \"f\",\n                    \"start_ms\": 369.2516784667969,\n                    \"end_ms\": 402.8199768066406,\n                    \"confidence\": 0.952548086643219,\n                    \"is_estimated\": false,\n                    \"target_seq_idx\": 4,\n                    \"index\": 4\n                },\n                {\n                    \"phoneme_id\": 56,\n                    \"phoneme_label\": \"l\",\n                    \"ipa_label\": \"l\",\n                    \"start_ms\": 520.3091430664062,\n                    \"end_ms\": 553.8775024414062,\n                    \"confidence\": 0.9023684859275818,\n                    \"is_estimated\": false,\n                    \"target_seq_idx\": 5,\n                    \"index\": 5\n                },\n                {\n                    \"phoneme_id\": 23,\n                    \"phoneme_label\": \"aɪ\",\n                    \"ipa_label\": \"aɪ\",\n                    \"start_ms\": 604.22998046875,\n                    \"end_ms\": 621.01416015625,\n                    \"confidence\": 0.11104730516672134,\n                    \"is_estimated\": false,\n                    \"target_seq_idx\": 6,\n                    \"index\": 6\n                }\n            ],\n            \"group_ts\": [\n                {\n                    \"group_id\": 7,\n                    \"group_label\": \"voiced_stops\",\n                    \"start_ms\": 33.56833267211914,\n                    \"end_ms\": 50.35249710083008,\n                    \"confidence\": 0.9979244470596313,\n                    \"is_estimated\": false,\n                    \"target_seq_idx\": 0,\n                    \"index\": 0\n                },\n                {\n                    \"group_id\": 2,\n                    \"group_label\": \"central_vowels\",\n                    \"start_ms\": 100.70499420166016,\n                    \"end_ms\": 117.48916625976562,\n                    \"confidence\": 0.9000658392906189,\n                    \"is_estimated\": false,\n                    \"target_seq_idx\": 1,\n                    \"index\": 1\n                },\n                {\n                    \"group_id\": 14,\n                    \"group_label\": \"rhotics\",\n                    \"start_ms\": 117.48916625976562,\n                    \"end_ms\": 151.0574951171875,\n                    \"confidence\": 0.0318431481719017,\n                    \"is_estimated\": false,\n                    \"target_seq_idx\": 2,\n                    \"index\": 2\n                },\n                {\n                    \"group_id\": 2,\n                    \"group_label\": \"central_vowels\",\n                    \"start_ms\": 285.3308410644531,\n                    \"end_ms\": 302.114990234375,\n                    \"confidence\": 0.5893039703369141,\n                    \"is_estimated\": false,\n                    \"target_seq_idx\": 3,\n                    \"index\": 3\n                },\n                {\n                    \"group_id\": 8,\n                    \"group_label\": \"voiceless_fricatives\",\n                    \"start_ms\": 352.4674987792969,\n                    \"end_ms\": 402.8199768066406,\n                    \"confidence\": 0.9883034229278564,\n                    \"is_estimated\": false,\n                    \"target_seq_idx\": 4,\n                    \"index\": 4\n                },\n                {\n                    \"group_id\": 13,\n                    \"group_label\": \"laterals\",\n                    \"start_ms\": 520.3091430664062,\n                    \"end_ms\": 553.8775024414062,\n                    \"confidence\": 0.8932946920394897,\n                    \"is_estimated\": false,\n                    \"target_seq_idx\": 5,\n                    \"index\": 5\n                },\n                {\n                    \"group_id\": 5,\n                    \"group_label\": \"diphthongs\",\n                    \"start_ms\": 604.22998046875,\n                    \"end_ms\": 621.01416015625,\n                    \"confidence\": 0.42801225185394287,\n                    \"is_estimated\": false,\n                    \"target_seq_idx\": 6,\n                    \"index\": 6\n                }\n            ],\n            \"words_ts\": [\n                {\n                    \"word\": \"butterfly\",\n                    \"start_ms\": 33.56833267211914,\n                    \"end_ms\": 621.01416015625,\n                    \"confidence\": 0.585819277380194,\n                    \"ph66\": [\n                        29,\n                        10,\n                        58,\n                        9,\n                        43,\n                        56,\n                        23\n                    ],\n                    \"ipa\": [\n                        \"b\",\n                        \"ʌ\",\n                        \"ɾ\",\n                        \"ɚ\",\n                        \"f\",\n                        \"l\",\n                        \"aɪ\"\n                    ]\n                }\n            ]\n        }\n    ]\n}\n```\n\n\u003c/details\u003e\n\n### Understanding the Output\n\nEvery segment in the output contains the following keys:\n\n| Key | What it contains | Useful for |\n|-----|-----------------|------------|\n| `phoneme_ts` | Per-phoneme timestamps: `ipa_label`, `start_ms`, `end_ms`, `confidence`, `is_estimated` | Phonetic annotation, TTS training |\n| `group_ts` | Same timestamps but grouped into 16 broad phoneme categories (vowels, stops, fricatives…) | Coarser acoustic analysis |\n| `words_ts` | Word-level timestamps derived from the phonemes | Word-level annotation |\n| `ipa` | IPA sequence for the segment (from espeak-ng) | Phonemic transcription reference |\n| `ph66` | Numeric indices into BFA's 66-class phoneme set | Model-internal representation |\n| `pg16` | Numeric indices into 16 phoneme group categories | Model-internal representation |\n| `words` | List of words in the segment | Word boundary reference |\n| `word_num` | Word index for each phoneme (parallel to `ph66`) | Maps each phoneme to its parent word |\n| `coverage_analysis` | Quality report: how many phonemes were aligned, missing, or estimated | Filtering out bad segments |\n\n**Reading `confidence`:** A value close to `1.0` means the model was very sure this phoneme occurred at this time. Values below `0.2` should be treated with caution. If `is_estimated` is `true`, the phoneme was placed by the recovery mechanism (not found by the primary Viterbi pass) — filter these out for high-precision work.\n\n---\n\n---\n\n## 🛠️ API Reference\n\n### 🌍 Language Presets\n\nBFA supports **80+ languages**. Just pass the language code as `preset` and the right model is selected automatically. Phonemisation is handled by [espeak-ng](https://github.com/espeak-ng/espeak-ng), which is free and works offline.\n\n**Supported language families:** Indo-European (Germanic, Romance, Slavic, Indic, Iranian…), Turkic, Semitic, Dravidian, and several others.\n\n**Not supported:** Tonal languages — Chinese (Mandarin, Cantonese), Vietnamese, Thai, Burmese — and isolating/agglutinative families such as Japanese and Korean, because the underlying acoustic model was not trained on them.\n\n```python\naligner = PhonemeTimestampAligner(preset=\"de\")   # German\naligner = PhonemeTimestampAligner(preset=\"hi\")   # Hindi\naligner = PhonemeTimestampAligner(preset=\"fr\")   # French\naligner = PhonemeTimestampAligner(preset=\"ar\")   # Arabic\naligner = PhonemeTimestampAligner(preset=\"ru\")   # Russian\n```\n\n#### 📋 Complete Preset Table\n\n\u003cdetails\u003e\n\u003csummary\u003e🔍 Click to view all 80+ supported language presets\u003c/summary\u003e\n\n| **Language** | **Preset Code** | **Model Used** | **Language Family** |\n|--------------|-----------------|----------------|-------------------|\n| **🇺🇸 ENGLISH VARIANTS** | | |\n| English (US) | `en-us`, `en` | English Model | West Germanic |\n| English (UK) | `en-gb` | English Model | West Germanic |\n| English (Caribbean) | `en-029` | English Model | West Germanic |\n| English (Lancastrian) | `en-gb-x-gbclan` | English Model | West Germanic |\n| English (RP) | `en-gb-x-rp` | English Model | West Germanic |\n| English (Scottish) | `en-gb-scotland` | English Model | West Germanic |\n| English (West Midlands) | `en-gb-x-gbcwmd` | English Model | West Germanic |\n| **🇪🇺 EUROPEAN LANGUAGES (MLS8)** | | |\n| German | `de` | MLS8 Model | West Germanic |\n| French | `fr` | MLS8 Model | Romance |\n| French (Belgium) | `fr-be` | MLS8 Model | Romance |\n| French (Switzerland) | `fr-ch` | MLS8 Model | Romance |\n| Spanish | `es` | MLS8 Model | Romance |\n| Spanish (Latin America) | `es-419` | MLS8 Model | Romance |\n| Italian | `it` | MLS8 Model | Romance |\n| Portuguese | `pt` | MLS8 Model | Romance |\n| Portuguese (Brazil) | `pt-br` | MLS8 Model | Romance |\n| Polish | `pl` | MLS8 Model | West Slavic |\n| Dutch | `nl` | MLS8 Model | West Germanic |\n| Danish | `da` | MLS8 Model | North Germanic |\n| Swedish | `sv` | MLS8 Model | North Germanic |\n| Norwegian Bokmål | `nb` | MLS8 Model | North Germanic |\n| Icelandic | `is` | MLS8 Model | North Germanic |\n| Czech | `cs` | MLS8 Model | West Slavic |\n| Slovak | `sk` | MLS8 Model | West Slavic |\n| Slovenian | `sl` | MLS8 Model | South Slavic |\n| Croatian | `hr` | MLS8 Model | South Slavic |\n| Bosnian | `bs` | MLS8 Model | South Slavic |\n| Serbian | `sr` | MLS8 Model | South Slavic |\n| Macedonian | `mk` | MLS8 Model | South Slavic |\n| Bulgarian | `bg` | MLS8 Model | South Slavic |\n| Romanian | `ro` | MLS8 Model | Romance |\n| Hungarian | `hu` | MLS8 Model | Uralic |\n| Estonian | `et` | MLS8 Model | Uralic |\n| Latvian | `lv` | MLS8 Model | Baltic |\n| Lithuanian | `lt` | MLS8 Model | Baltic |\n| Catalan | `ca` | MLS8 Model | Romance |\n| Aragonese | `an` | MLS8 Model | Romance |\n| Papiamento | `pap` | MLS8 Model | Romance |\n| Haitian Creole | `ht` | MLS8 Model | Romance |\n| Afrikaans | `af` | MLS8 Model | West Germanic |\n| Luxembourgish | `lb` | MLS8 Model | West Germanic |\n| Irish Gaelic | `ga` | MLS8 Model | Celtic |\n| Scottish Gaelic | `gd` | MLS8 Model | Celtic |\n| Welsh | `cy` | MLS8 Model | Celtic |\n| **🌏 INDO-EUROPEAN LANGUAGES (Universal)** | | |\n| Russian | `ru` | Universal Model | East Slavic |\n| Russian (Latvia) | `ru-lv` | Universal Model | East Slavic |\n| Ukrainian | `uk` | Universal Model | East Slavic |\n| Belarusian | `be` | Universal Model | East Slavic |\n| Hindi | `hi` | Universal Model | Indic |\n| Bengali | `bn` | Universal Model | Indic |\n| Urdu | `ur` | Universal Model | Indic |\n| Punjabi | `pa` | Universal Model | Indic |\n| Gujarati | `gu` | Universal Model | Indic |\n| Marathi | `mr` | Universal Model | Indic |\n| Nepali | `ne` | Universal Model | Indic |\n| Assamese | `as` | Universal Model | Indic |\n| Oriya | `or` | Universal Model | Indic |\n| Sinhala | `si` | Universal Model | Indic |\n| Konkani | `kok` | Universal Model | Indic |\n| Bishnupriya Manipuri | `bpy` | Universal Model | Indic |\n| Sindhi | `sd` | Universal Model | Indic |\n| Persian | `fa` | Universal Model | Iranian |\n| Persian (Latin) | `fa-latn` | Universal Model | Iranian |\n| Kurdish | `ku` | Universal Model | Iranian |\n| Greek (Modern) | `el` | Universal Model | Greek |\n| Greek (Ancient) | `grc` | Universal Model | Greek |\n| Armenian (East) | `hy` | Universal Model | Indo-European |\n| Armenian (West) | `hyw` | Universal Model | Indo-European |\n| Albanian | `sq` | Universal Model | Indo-European |\n| Latin | `la` | Universal Model | Italic |\n| **🇹🇷 TURKIC LANGUAGES (Universal)** | | |\n| Turkish | `tr` | Universal Model | Turkic |\n| Azerbaijani | `az` | Universal Model | Turkic |\n| Kazakh | `kk` | Universal Model | Turkic |\n| Kyrgyz | `ky` | Universal Model | Turkic |\n| Uzbek | `uz` | Universal Model | Turkic |\n| Tatar | `tt` | Universal Model | Turkic |\n| Turkmen | `tk` | Universal Model | Turkic |\n| Uyghur | `ug` | Universal Model | Turkic |\n| Bashkir | `ba` | Universal Model | Turkic |\n| Chuvash | `cu` | Universal Model | Turkic |\n| Nogai | `nog` | Universal Model | Turkic |\n| **🇫🇮 URALIC LANGUAGES (Universal)** | | |\n| Finnish | `fi` | Universal Model | Uralic |\n| Lule Saami | `smj` | Universal Model | Uralic |\n| **🕌 SEMITIC LANGUAGES (Universal)** | | |\n| Arabic | `ar` | Universal Model | Semitic |\n| Hebrew | `he` | Universal Model | Semitic |\n| Amharic | `am` | Universal Model | Semitic |\n| Maltese | `mt` | Universal Model | Semitic |\n| **🏝️ MALAYO-POLYNESIAN LANGUAGES (Universal)** | | |\n| Indonesian | `id` | Universal Model | Malayo-Polynesian |\n| Malay | `ms` | Universal Model | Malayo-Polynesian |\n| **🇮🇳 DRAVIDIAN LANGUAGES (Universal)** | | |\n| Tamil | `ta` | Universal Model | Dravidian |\n| Telugu | `te` | Universal Model | Dravidian |\n| Kannada | `kn` | Universal Model | Dravidian |\n| Malayalam | `ml` | Universal Model | Dravidian |\n| **🇬🇪 SOUTH CAUCASIAN LANGUAGES (Universal)** | | |\n| Georgian | `ka` | Universal Model | South Caucasian |\n| **🗾 LANGUAGE ISOLATES \u0026 OTHERS (Universal)** | | |\n| Basque | `eu` | Universal Model | Language Isolate |\n| Quechua | `qu` | Universal Model | Quechuan |\n| **🛸 CONSTRUCTED LANGUAGES (Universal)** | | |\n| Esperanto | `eo` | Universal Model | Constructed |\n| Interlingua | `ia` | Universal Model | Constructed |\n| Ido | `io` | Universal Model | Constructed |\n| Lingua Franca Nova | `lfn` | Universal Model | Constructed |\n| Lojban | `jbo` | Universal Model | Constructed |\n| Pyash | `py` | Universal Model | Constructed |\n| Lang Belta | `qdb` | Universal Model | Constructed |\n| Quenya | `qya` | Universal Model | Constructed |\n| Klingon | `piqd` | Universal Model | Constructed |\n| Sindarin | `sjn` | Universal Model | Constructed |\n\n\u003c/details\u003e\n\n#### 🔧 Model Selection Guide\n\n| **Model** | **Languages** | **Use Case** | **Performance** |\n|-----------|---------------|--------------|-----------------|\n| **English Model** | English variants | Best for English | Highest accuracy for English |\n| **MLS8 Model** | 8 European + similar | European languages | High accuracy for European |\n| **Universal Model** | 60+ Indo-European + related | Other supported languages | Good for Indo-European families |\n\n**⚠️ Unsupported Language Types:**\n- **Tonal languages**: Chinese (Mandarin, Cantonese), Vietnamese, Thai, Burmese\n- **Distant families**: Japanese, Korean, most African languages (Swahili, etc.)\n- **Indigenous languages**: Most Native American, Polynesian (except Indonesian/Malay)\n- **Recommendation**: For unsupported languages, use explicit `model_name` parameter with caution\n\n### Initialization\n\n```python\nPhonemeTimestampAligner(\n    preset=\"en-us\",               # Language code — selects model and phonemiser automatically\n    model_name=None,              # Override the model by name (optional)\n    cupe_ckpt_path=None,          # Override with a local model file path (optional)\n    lang=\"en-us\",                 # espeak-ng language code for phonemisation\n    duration_max=10,              # Max segment length in seconds (used for padding)\n    device=\"auto\",                # \"auto\" | \"cpu\" | \"cuda\" | \"mps\"\n    silence_anchors=0,            # \u003e0 enables silence-anchored splitting (try 3 for long sentences)\n    boost_targets=True,           # Boost acoustic probability of target phonemes before alignment\n    enforce_minimum=True,         # Prevent phonemes from being completely zeroed out\n    enforce_all_targets=True,     # Guarantee every phoneme in the transcript gets a timestamp\n    ignore_noise=True,            # Skip predicted noise frames in output\n    extend_soft_boundaries=True,  # Extend phoneme boundaries into adjacent low-confidence frames\n    boundary_softness=7,          # How far to extend (2=tight cores only, 7=generous)\n    bad_confidence_threshold=0.6  # Flag segments where \u003e60% of phonemes are low-confidence\n)\n```\n\n**For most users, only `preset` and `duration_max` need to be changed.** The alignment defaults are tuned for clean read speech.\n\n\u003cdetails\u003e\n\u003csummary\u003eParameter details\u003c/summary\u003e\n\n| Parameter | Default | What it does |\n|-----------|---------|-------------|\n| `preset` | `\"en-us\"` | Language code. Automatically picks the right model and espeak-ng language. |\n| `model_name` | `None` | Name of a specific CUPE model. Overrides `preset`. Downloaded from HuggingFace if not cached. |\n| `cupe_ckpt_path` | `None` | Path to a local model `.ckpt` file. Highest priority. |\n| `lang` | `\"en-us\"` | espeak-ng language code for phonemisation. See [all codes](https://github.com/espeak-ng/espeak-ng/blob/master/docs/languages.md). |\n| `duration_max` | `10` | Maximum segment duration in seconds. Longer audio is truncated. Keep ≤30 s for best results. |\n| `device` | `\"auto\"` | `\"auto\"` detects CUDA/MPS/CPU automatically. |\n| `silence_anchors` | `0` | When \u003e0, uses detected silences as anchor points to improve long-segment alignment. Try `3`. |\n| `boost_targets` | `True` | Increases the acoustic probability of expected phonemes before Viterbi decoding. |\n| `enforce_minimum` | `True` | Prevents any target phoneme from being completely zeroed out by the model. |\n| `enforce_all_targets` | `True` | After decoding, inserts any missing phonemes at their best estimated position. Set `False` for strictly probabilistic output. |\n| `ignore_noise` | `True` | Drops predicted noise/silence frames from output. Set `False` to include them as `\"noise\"` entries. |\n| `extend_soft_boundaries` | `True` | Extends phoneme boundaries into adjacent frames that still carry some acoustic evidence. |\n| `boundary_softness` | `7` | Controls how far boundaries extend. `2`–`3` = tight phoneme cores; `7` = generous boundaries. |\n| `bad_confidence_threshold` | `0.6` | Ratio of low-confidence phonemes that triggers a `bad_alignment` warning on a segment. |\n\n**Model priority (highest → lowest):** `cupe_ckpt_path` → `model_name` → `preset` → defaults.\n\n\u003c/details\u003e\n\n---\n\n**Available models** (downloaded automatically on first use):\n\n| Model name | Best for |\n|------------|----------|\n| `en_libri1000_ua01c_e4_val_GER=0.2186.ckpt` | English (recommended) |\n| `multi_MLS8_uh02_e36_val_GER=0.2334.ckpt` | German, French, Spanish, Italian, Portuguese, Polish, Dutch |\n| `large_multi_mswc38_ua02g_e03_val_GER=0.5133.ckpt` | All other supported languages (large) |\n| `multi_mswc38_ug20_e59_val_GER=0.5611.ckpt` | All other supported languages (faster/smaller) |\n\nAll models are hosted on [HuggingFace → Tabahi/CUPE-2i](https://huggingface.co/Tabahi/CUPE-2i/tree/main/ckpt).\n\n\n### `process_srt_file` — align a whole recording from a transcript file\n\n**When to use:** You have a full audio file and a Whisper-style JSON transcript (segments with `start`, `end`, `text`). This is the most common entry point for corpus annotation.\n\n```python\ntimestamps = aligner.process_srt_file(\n    srt_path,           # path to transcript JSON (Whisper output format)\n    audio_path,         # path to audio file\n    ts_out_path=None,   # optional: save results to this JSON file\n    extract_embeddings=False,\n    vspt_path=None,     # optional: save phoneme embeddings to this .pt file\n    do_groups=False,    # set True to also return phoneme group timestamps\n    debug=True\n)\n```\n\nReturns a dict with a `\"segments\"` key. See [example_advanced.py](examples/example_advanced.py).\n\n---\n\n### `process_sentence` — align one sentence\n\n**When to use:** You have a single sentence and its audio clip (already loaded).\n\n```python\nresult = aligner.process_sentence(\n    text,               # transcript of the audio\n    audio_wav,          # audio waveform tensor from load_audio()\n    extract_embeddings=False,\n    do_groups=False,    # set True to also return phoneme group timestamps\n    debug=False\n)\n# Returns a dict with \"segments\" key (one segment)\n# If extract_embeddings=True, returns (result, phoneme_embeddings, group_embeddings)\n```\n\nSee [basic_usage.py](examples/basic_usage.py).\n\n---\n\n### `process_sentences_batch` — align many sentences at once\n\n**When to use:** You have a list of sentences and corresponding audio clips and want to process them all efficiently in one call.\n\n```python\nresults = aligner.process_sentences_batch(\n    texts,       # list of transcript strings\n    audio_wavs,  # list of audio waveform tensors (one per text)\n    do_groups=False,\n    debug=False\n)\n# Returns a list of result dicts, one per input\n```\n\nSee [batch_aligment.py](examples/batch_aligment.py).\n\n---\n\n---\n\n### `process_segments` — batch alignment from segmented transcripts\n\n**When to use:** You have multiple audio files each with multiple time-stamped segments (e.g. output from Whisper over a corpus). This is the most efficient entry point for large-scale annotation.\n\n```python\n# Each item in srt_data corresponds to one audio file\nsrt_data = [\n    {\"segments\": [{\"start\": 0.0, \"end\": 3.5, \"text\": \"hello world\"}, ...]},  # file 1\n    {\"segments\": [{\"start\": 0.0, \"end\": 5.0, \"text\": \"another recording\"}, ...]},  # file 2\n]\naudio_wavs = [wav1, wav2]  # one waveform per file\n\nbatch_results = aligner.process_segments(\n    srt_data,\n    audio_wavs,\n    do_groups=False,\n    debug=True\n)\n# Returns a list of dicts (one per file), each with a \"segments\" key\n\n# With phoneme embeddings:\nbatch_results, phoneme_embds, group_embds = aligner.process_segments(\n    srt_data, audio_wavs, extract_embeddings=True\n)\n# phoneme_embds[file_idx][segment_idx] = embedding tensor\n```\n\nSee [batch_aligment.py](examples/batch_aligment.py) for a complete working example.\n\n---\n### `phonemize_sentence` — convert text to IPA and phoneme indices\n\nUseful for inspecting how BFA will interpret your text before running alignment.\n\n```python\nresult = aligner.phonemize_sentence(\"butterfly\")\n\nprint(result[\"eipa\"])     # ['b', 'ʌ', 'ɾ', 'ɚ', 'f', 'l', 'aɪ']  — original espeak-ng IPA\nprint(result[\"mipa\"])     # ['b', 'ʌ', 'ɾ', 'ɚ', 'f', 'l', 'aɪ']  — mapped IPA (after ph66 reduction)\nprint(result[\"ph66\"])     # [29, 10, 58, 9, 43, 56, 23]             — numeric indices for the model\nprint(result[\"pg16\"])     # [7, 2, 14, 2, 8, 13, 5]                 — phoneme group indices\nprint(result[\"words\"])    # ['butterfly']\nprint(result[\"word_num\"]) # [0, 0, 0, 0, 0, 0, 0]  — which word each phoneme belongs to\n```\n\nTo change the espeak-ng language after initialisation:\n```python\naligner.phonemizer.set_backend(language='de')  # switch to German phonemisation\n```\n\n#### About ph66 and the phoneme alphabet\n\nBFA uses a reduced alphabet of **66 phoneme classes (ph66)** that maps the full IPA inventory of any supported language onto a shared set of symbols. This is what the acoustic model was trained on.\n\n- `phoneme_label` in the output is the ph66 symbol (e.g. `\"a:\"`, `\"b\"`).\n- `ipa_label` is the original espeak-ng IPA for that position (e.g. `\"ɑː\"`, `\"b\"`).\n- For compound phonemes like /ɑːɹ/, BFA splits them into two entries. The second half gets `ipa_label: \"-\"` to indicate continuation.\n\n\u003cdetails\u003e\n\u003csummary\u003eExample: how a compound phoneme (ɑːɹ) appears in the output JSON\u003c/summary\u003e\n\n```json\n{\n    \"phoneme_id\": 19,\n    \"phoneme_label\": \"a:\",\n    \"ipa_label\": \"ɑːɹ\",\n    \"start_ms\": 2478.1,\n    \"end_ms\": 2606.8,\n    \"confidence\": 0.937,\n    \"is_estimated\": false\n},\n{\n    \"phoneme_id\": 59,\n    \"phoneme_label\": \"ɹ\",\n    \"ipa_label\": \"-\",\n    \"start_ms\": 2606.8,\n    \"end_ms\": 2639.0,\n    \"confidence\": 0.665,\n    \"is_estimated\": false\n}\n```\n\nThe original IPA /ɑːɹ/ is assigned to the first part; the second part carries `\"-\"` to signal it is a continuation.\n\n\u003c/details\u003e\n\n\n---\n\n### `extract_timestamps_from_segment_batch` — low-level batch inference\n\nThis is the internal method that `process_segments` calls. Most users will not need it directly. It takes raw audio tensors and phoneme index sequences and runs the CUPE model + Viterbi decoder.\n\nSee [run_simplipied_pipeline.py](examples/run_simplipied_pipeline.py) for a working example of the simplified variant (`extract_timestamps_from_segment_simplified`).\n\n---\n\n### `convert_to_textgrid` — export to Praat\n\nConverts the alignment result to a [Praat TextGrid](https://www.fon.hum.uva.nl/praat/manual/TextGrid_file_format.html) file with separate tiers for phonemes, phoneme groups, and words.\n\n```python\n# Save to file\naligner.convert_to_textgrid(result, output_file=\"recording.TextGrid\")\n\n# Or get the TextGrid content as a string (e.g. to embed in your own pipeline)\ntextgrid_str = aligner.convert_to_textgrid(result, output_file=None)\n\n# Include confidence scores in the tier labels\naligner.convert_to_textgrid(result, output_file=\"recording.TextGrid\", include_confidence=True)\n```\n\n\n\n---\n\n\n\n## 🔧 Advanced Usage\n\n\n### 🎙️ Mel-Spectrogram Alignment\n\nFor TTS and speech synthesis workflows, BFA can produce **frame-wise phoneme labels** aligned to a mel-spectrogram. This makes it straightforward to create duration labels for [HiFi-GAN](https://github.com/jik876/hifi-gan) and [BigVGAN](https://github.com/NVIDIA/BigVGAN) vocoders.\n\nSee the full working example: [mel_spectrum_alignment.py](examples/mel_spectrum_alignment.py).\n\n#### Extract Mel Spectrogram\n\n```python\nPhonemeTimestampAligner.extract_mel_spectrum(\n    wav,\n    wav_sample_rate,\n    vocoder_config={'num_mels': 80, 'num_freq': 1025, 'n_fft': 1024, 'hop_size': 256, 'win_size': 1024, 'sampling_rate': 22050, 'fmin': 0, 'fmax': 8000, 'model': 'whatever_22khz_80band_fmax8k_256x'}\n)\n```\n\n**Description:**  \nExtracts mel spectrogram from audio with vocoder compatibility.\n\n**Parameters:**\n- `wav`: Input waveform tensor of shape `(1, T)`\n- `wav_sample_rate`: Sample rate of the input waveform\n- `vocoder_config`: Configuration dictionary for HiFiGAN/BigVGAN vocoder compatibility.\n\n**Returns:**  \n- `mel`: Mel spectrogram tensor of shape `(frames, mel_bins)` - transposed for easy frame-wise processing\n\n#### Frame-wise Assortment\n\n```python\nPhonemeTimestampAligner.framewise_assortment(\n    aligned_ts,\n    total_frames,\n    frames_per_second,\n    gap_contraction=5,\n    select_key=\"phoneme_id\"\n)\n```\n\n**Description:**  \nConverts timestamp-based phoneme alignment to frame-wise labels matching mel-spectrogram frames.\n\n**Parameters:**\n- `aligned_ts`: List of timestamp dictionaries (from `phoneme_ts`, `group_ts`, or `word_ts`)\n- `total_frames`: Total number of frames in the mel spectrogram\n- `frames_per_second`: Frame rate of the mel spectrogram\n- `gap_contraction`: Number of frames to fill silent gaps on either side of segments (default: 5)\n- `select_key`: Key to extract from timestamps (`\"phoneme_id\"`, `\"group_id\"`, etc.)\n\n**Returns:**  \n- List of frame labels with length `total_frames`\n\n#### Frame Compression\n\n```python\nPhonemeTimestampAligner.compress_frames(frames_list)\n```\n\n**Description:**  \nCompresses consecutive identical frame values into run-length encoded format.\n\n**Example:**\n```python\nframes = [0,0,0,0,1,1,1,1,3,4,5,4,5,2,2,2]\ncompressed = compress_frames(frames)\n# Returns: [(0,4), (1,4), (3,1), (4,1), (5,1), (4,1), (5,1), (2,3)]\n```\n\n**Returns:**  \n- List of `(frame_value, count)` tuples\n\n#### Frame Decompression\n\n```python\nPhonemeTimestampAligner.decompress_frames(compressed_frames)\n```\n\n**Description:**  \nDecompresses run-length encoded frames back to full frame sequence.\n\n**Parameters:**\n- `compressed_frames`: List of `(phoneme_id, count)` tuples\n\n**Returns:**  \n- Decompressed list of frame labels\n\n\u003cdetails\u003e\n\u003csummary\u003e📊 Complete mel-spectrum alignment example\u003c/summary\u003e\n\n```python\n# pip install librosa\nimport torch\nfrom bournemouth_aligner import PhonemeTimestampAligner\n\n# Initialize aligner\nextractor = PhonemeTimestampAligner(model_name=\"en_libri1000_ua01c_e4_val_GER=0.2186.ckpt\", \n                                  lang='en-us', duration_max=10, device='auto')\n\n# Process audio and get timestamps\naudio_wav = extractor.load_audio(\"examples/samples/audio/109867__timkahn__butterfly.wav\")\ntimestamps = extractor.process_sentence(\"butterfly\", audio_wav)\n\n# Extract mel spectrogram with vocoder compatibility\nvocoder_config = {'num_mels': 80, 'hop_size': 256, 'sampling_rate': 22050}\nsegment_wav = audio_wav[:, :int(timestamps['segments'][0]['end'] * extractor.resampler_sample_rate)]\nmel_spec = extractor.extract_mel_spectrum(segment_wav, extractor.resampler_sample_rate, vocoder_config)\n\n# Create frame-wise phoneme alignment\ntotal_frames = mel_spec.shape[0]\nframes_per_second = total_frames / timestamps['segments'][0]['end']\nframes_assorted = extractor.framewise_assortment(\n    aligned_ts=timestamps['segments'][0]['phoneme_ts'], \n    total_frames=total_frames, \n    frames_per_second=frames_per_second\n)\n\n# Compress and visualize\ncompress_framesed = extractor.compress_frames(frames_assorted)\n# Use provided plot_mel_phonemes() function to visualize\n```\n\n\u003c/details\u003e\n\n\n\n### Integration Examples\n\n\u003cdetails\u003e\n\u003csummary\u003e🎙️ Whisper Integration\u003c/summary\u003e\n\n```python\n# pip install git+https://github.com/openai/whisper.git \nimport whisper, json\nfrom bournemouth_aligner import PhonemeTimestampAligner\n\n# Transcribe and align\nmodel = whisper.load_model(\"turbo\")\nresult = model.transcribe(\"audio.wav\")\nwith open(\"whisper_output.srt.json\", \"w\") as f:\n    json.dump(result, f)\n\n# Process with BFA\nextractor = PhonemeTimestampAligner(model_name=\"en_libri1000_ua01c_e4_val_GER=0.2186.ckpt\")\ntimestamps = extractor.process_srt_file(\"whisper_output.srt.json\", \"audio.wav\", \"timestamps.json\")\n```\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003e🔬 Manual pipeline — load audio and align in a few lines\u003c/summary\u003e\n\n```python\nimport torch\nfrom bournemouth_aligner import PhonemeTimestampAligner\n\n# Initialize and process\nextractor = PhonemeTimestampAligner(model_name=\"en_libri1000_ua01c_e4_val_GER=0.2186.ckpt\")\naudio_wav = extractor.load_audio(\"audio.wav\")  # Handles resampling and normalization\ntimestamps = extractor.process_sentence(\"your text here\", audio_wav)\n\n# Export to Praat\nextractor.convert_to_textgrid(timestamps, \"output.TextGrid\")\n```\n\n\u003c/details\u003e\n\n### 🤖 Machine Learning: Phoneme Embeddings\n\nBFA can optionally return **per-phoneme embeddings** (512-dimensional vectors) from the CUPE model. These can be used as phoneme-level acoustic features in downstream ML models. See [read_embeddings.py](examples/read_embeddings.py) for how to load and use them.\n\n---\n\n## 💻 Command Line Interface (CLI)\n\nBFA installs a `balign` command so you can align audio without writing any Python.\n\n### Syntax\n\n```\nbalign AUDIO_PATH TEXT_OR_SRT [OUTPUT_PATH] [OPTIONS]\n```\n\n`TEXT_OR_SRT` is either:\n- A **quoted sentence** — the text spoken in the audio (text mode, the default)\n- A **path to a JSON transcript file** — auto-detected when the argument points to an existing file (SRT mode)\n\n`OUTPUT_PATH` is optional. When omitted, the result is saved next to the audio as `\u003cname\u003e.vs.json`.\n\n---\n\n### Examples\n\n```bash\n# Align a single word — text mode (simplest)\nbalign butterfly.wav \"butterfly\" --preset=en-us\n\n# Save a mel-spectrogram image at the same time\nbalign butterfly.wav \"butterfly\" --preset=en-us --mel-path=mel_spectrum.png\n\n# Align a full sentence, save TextGrid for Praat\nbalign speech.wav \"hello world\" result.json --preset=en-us --textgrid=result.TextGrid\n\n# Align from a Whisper transcript file (SRT mode)\nbalign interview.wav transcript.srt.json aligned.json --preset=en-us\n\n# German\nbalign speech.wav \"hallo welt\" --preset=de\n\n# Hindi with GPU\nbalign speech.wav \"नमस्ते\" out.json --preset=hi --device=cuda\n\n# Save phoneme embeddings\nbalign speech.wav \"hello world\" out.json --preset=en-us --embeddings=emb.pt\n\n# Batch-align all wav files in a folder (SRT mode)\nfor f in *.wav; do balign \"$f\" \"${f%.wav}.srt.json\" \"${f%.wav}.vs.json\" --preset=en-us; done\n```\n\n---\n\n### Options\n\n| Option | Default | What it does |\n|--------|---------|-------------|\n| `--preset LANG` | `en-us` | Language code — picks the right model automatically. Use `--list-presets` to see all. |\n| `--mel-path PATH` | — | Save mel spectrogram as `.png` (image) or `.pt` (raw tensor). |\n| `--textgrid PATH` | — | Save a Praat TextGrid file with phoneme, group, and word tiers. |\n| `--embeddings PATH` | — | Save per-phoneme CUPE embeddings as a `.pt` tensor. |\n| `--device auto\\|cpu\\|cuda` | `auto` | Inference device. |\n| `--lang CODE` | *(from preset)* | Override the espeak-ng language code only. |\n| `--model NAME` | *(from preset)* | Override the CUPE model name (advanced). |\n| `--duration-max FLOAT` | `10.0` | Max segment length in seconds for windowed processing. |\n| `--boost-targets / --no-boost-targets` | on | Boost expected phoneme probabilities before Viterbi. |\n| `--debug` | off | Print segment-by-segment progress. |\n| `--list-presets` | — | Print all supported language codes and exit. |\n| `--version` | — | Print version and exit. |\n\n---\n\n### Transcript file format (SRT mode)\n\nWhen aligning a full recording, provide a JSON file in Whisper's output format:\n\n```json\n{\n  \"segments\": [\n    {\"start\": 0.0, \"end\": 3.5, \"text\": \"hello world this is a test\"},\n    {\"start\": 3.5, \"end\": 7.2, \"text\": \"another segment of speech\"}\n  ]\n}\n```\n\nGenerate this automatically with [Whisper](#integration-examples) or write it by hand for small files.\n\n---\n\n## 🧠 How It Works\n\nBFA is built on the **CUPE (Contextless Universal Phoneme Encoder)** model. Unlike HMM-based aligners like MFA, CUPE is a neural network that processes audio frame-by-frame independently (no context window), which is what makes it fast.\n\nRead the full paper: [BFA: Real-Time Multilingual Text-to-Speech Forced Alignment (arXiv 2509.23147)](https://arxiv.org/pdf/2509.23147)\n\n### Alignment pipeline\n\n```mermaid\ngraph TD\n    A[Audio Input] --\u003e B[RMS Normalization]\n    B --\u003e C[Audio Windowing]\n    C --\u003e D[CUPE Model Inference]\n    D --\u003e E[Phoneme/Group Probabilities]\n    E --\u003e F[Text Phonemization]\n    F --\u003e G[Target Boosting]\n    G --\u003e H[Viterbi Forced Alignment]\n    H --\u003e I[Missing Phoneme Recovery]\n    I --\u003e J[Confidence Calculation]\n    J --\u003e K[Frame-to-Time Conversion]\n    K --\u003e L[Output Generation]\n```\n\n**Step by step:**\n\n1. **Audio preprocessing** — RMS normalisation + sliding window (120 ms windows, 80 ms stride)\n2. **CUPE inference** — neural network assigns a probability to each of 66 phoneme classes for every 10 ms frame\n3. **Text phonemisation** — espeak-ng converts the transcript to a sequence of ph66 indices\n4. **Target boosting** — optionally increases the probability of phonemes that are expected to appear\n5. **Viterbi forced alignment** — CTC-style decoding finds the globally optimal path through the phoneme sequence\n6. **Recovery** — any phoneme missing from the Viterbi output is inserted at its most probable position\n7. **Confidence scoring** — each phoneme receives a score based on the average frame probability over its duration\n8. **Timestamp conversion** — frame indices are converted to milliseconds, accounting for any segment offset\n\n**CTC transition rules (for reference):**\n- *Stay* — repeat the current phoneme (or blank frame)\n- *Advance* — move to the next phoneme in the sequence\n- *Skip* — jump over a blank to the next phoneme (when adjacent phonemes differ)\n\n### Key alignment parameters\n\nBFA exposes several parameters not available in traditional aligners like MFA:\n\n#### 🎯 `boost_targets` (Default: `True`)\nIncreases log-probabilities of expected phonemes by a fixed boost factor (typically +5.0) before Viterbi decoding. If the sentence is very long or contains every possible phoneme, then boosting them all equally doesn't have much effect—because no phoneme stands out more than the others.\n\n**When it helps:**\n- **Cross-lingual scenarios**: Using English models on other languages where some phonemes are underrepresented\n- **Noisy audio**: When target phonemes have very low confidence but should be present\n- **Domain mismatch**: When model training data differs significantly from your audio\n\n**Important caveat:** For monolingual sentences, boosting affects ALL phonemes in the target sequence equally, making it equivalent to no boosting. The real benefit comes when using multilingual models or when certain phonemes are systematically underrepresented.\n\n#### 🛡️ `enforce_minimum` (Default: `True`) \nEnsures every target phoneme has at least a minimum probability (default: 1e-8) at each frame, preventing complete elimination during alignment.\n\n**Why this matters:**\n- Prevents target phonemes from being \"zeroed out\" by the model\n- Guarantees that even very quiet or unclear phonemes can be aligned\n- Helps for highly noisy audio in which all phonemes, not just targets, have extremely low probabilities.\n\n#### 🔒 `enforce_all_targets` (Default: `True`)\n**This is BFA's key differentiator from MFA.** After Viterbi decoding, BFA applies post-processing to guarantee that every target phoneme is present in the final alignment—even those with low acoustic probability. However, **downstream tasks can filter out these \"forced\" phonemes using their confidence scores**. For practical use, consider setting a confidence threshold  e.g., `timestamps[\"phoneme_ts\"][p][\"confidence\"] \u003c0.05`) to exclude phonemes that were aligned with little to no acoustic evidence.\n\n**Recovery mechanism:**\n1. Identifies any missing target phonemes after initial alignment\n2. Finds frames with highest probability for each missing phoneme\n3. Strategically inserts missing phonemes by:\n   - Replacing blank frames when possible\n   - Searching nearby frames within a small radius\n   - Force-replacing frames as last resort\n\n**Use cases:**\n- **Guaranteed coverage**: When you need every phoneme to be timestamped\n- **Noisy environments**: Where some phonemes might be completely missed by standard Viterbi\n- **Research applications**: When completeness is more important than probabilistic accuracy\n\n#### ⚖️ Parameter Interaction Effects\n\n| Scenario | Recommended Settings | Outcome |\n|----------|---------------------|---------|\n| **Clean monolingual audio** | All defaults | Standard high-quality alignment |\n| **Cross-lingual/noisy** | `boost_targets=True` | Better phoneme recovery |\n| **Research/completeness** | `enforce_all_targets=True` | 100% phoneme coverage |\n| **Probabilistically strict** | `enforce_all_targets=False` | Only high-confidence alignments |\n\n**Technical Details:**\n\n- **Audio Processing**: 16kHz sampling, sliding window approach for long audio\n- **Model Architecture**: Pre-trained CUPE-2i models from [HuggingFace](https://huggingface.co/Tabahi/CUPE-2i)  \n- **Alignment Strategy**: CTC path construction with blank tokens between phonemes\n- **Quality Assurance**: Post-processing ensures 100% target phoneme coverage (when enabled)\n\n\u003e **Performance Note**: CPU-optimized implementation. The iterative Viterbi algorithm and windowing operations are designed for single-threaded efficiency. Most operations are vectorized where possible, so batch processing should be faster on GPUs.\n\n--- \n\n\n\n\n### Accuracy benchmark\n\n**Alignment error on [TIMIT](https://catalog.ldc.upenn.edu/LDC93S1):**\n\n\u003cdiv align=\"center\"\u003e\n    \u003cimg src=\"examples/samples/images/BFA_vs_MFA_errors_on_TIMIT.png\" alt=\"Alignment error histogram comparing BFA and MFA on the TIMIT dataset\" width=\"600\"/\u003e\n\u003c/div\u003e\n\n- Most phoneme boundaries are within **±30 ms** of the hand-annotated ground truth.\n- Errors above **100 ms** are rare and typically caused by noisy or ambiguous segments.\n\n\u003e ⚠️ **Recommended maximum segment length: 30 seconds.** For longer recordings, segment the audio first using Whisper or a VAD tool. Segments above 60 seconds can degrade Viterbi accuracy.\n\n---\n\n## 🔬 Comparison with MFA (Montreal Forced Aligner)\n\nBFA and [MFA](https://montreal-forced-aligner.readthedocs.io/) are both forced aligners, but they work very differently and have complementary strengths.\n\n\u003cdiv align=\"center\"\u003e\n\n| Metric | BFA | MFA |\n|--------|-----|-----|\n| **Speed** | ~0.2 s per 10 s of audio | ~10 s per 2 s of audio |\n| **No dictionary needed** | ✅ espeak-ng generates phonemes on the fly | ❌ Requires a pronunciation dictionary |\n| **Real-time capable** | ✅ Yes (contextless frame processing) | ❌ No |\n| **Stop consonants (t, d, p, k)** | ✅ More precise boundaries | ⚠️ Tends to extend too far |\n| **Tail endings** | ⚠️ Occasionally misses | ❌ Often missed |\n| **Breathy sounds (h)** | ⚠️ Sometimes misses | ✅ Usually captures |\n| **Pause/silence handling** | ✅ Silence-aware (punctuation) | ❌ No punctuation awareness |\n| **Language coverage** | 80+ languages via espeak-ng | Limited to available dictionaries |\n\n\u003c/div\u003e\n\n### Praat TextGrid visualisation\n\n**\"In being comparatively modern\"** — LJ Speech sample\n\n![Praat TextGrid alignment visualisation for LJ Speech sample](examples/samples/images/LJ02_praat.png)\n\n---\n\n## Contribute\n\n- **ONNX port** is in progress — see [`bournemouth_aligner/cpp_onnx/main.cpp`](bournemouth_aligner/cpp_onnx/main.cpp) if you want to help.\n- **Mandarin and other tonal languages** — a new phoneme dictionary would be needed. Start a [discussion](https://github.com/tabahi/bournemouth-forced-aligner/issues) if you’d like to help.\n\n## Frequently Asked Questions\n\n**Q: Do I need a pronunciation dictionary?**  \nNo. BFA uses [espeak-ng](https://github.com/espeak-ng/espeak-ng) to generate IPA phonemisations automatically for all supported languages.\n\n**Q: Does it work on noisy or spontaneous speech?**  \nYes, but accuracy is lower than on clean read speech. Use `silence_anchors=3` and check `coverage_analysis[\"bad_confidence\"]` to filter out unreliable segments.\n\n**Q: What is the difference between `phoneme_label` and `ipa_label`?**  \n`phoneme_label` is BFA’s internal ph66 symbol (e.g. `\"a:\"`); `ipa_label` is the original espeak-ng IPA (e.g. `\"ˈɑːɹ\"`). Use `ipa_label` for phonetic analysis and `phoneme_label` for model-internal comparisons.\n\n**Q: Why is confidence low for some phonemes?**  \nThe CUPE model assigns low probability when the acoustic evidence is weak — common for short stops, reduced vowels, and phonemes at segment boundaries. Low confidence does not necessarily mean the timestamp is wrong; check visually in Praat if needed.\n\n**Q: Can I align a language that is not in the preset list?**  \nYou can try using `model_name` with the Universal model and a custom `lang` code. Results will vary. See [espeak-ng language codes](https://github.com/espeak-ng/espeak-ng/blob/master/docs/languages.md).\n\n**Q: What audio format does it accept?**  \nAny format supported by `torchaudio` / ffmpeg — WAV, MP3, FLAC, OGG, M4A, and others.\n\n## Citation\n\nIf you use BFA in research, please cite:\n\n```bibtex\n@misc{rehman2025bfa,\n      title={BFA: Real-time Multilingual Text-to-speech Forced Alignment}, \n      author={Abdul Rehman and Jingyao Cai and Jian-Jun Zhang and Xiaosong Yang},\n      year={2025},\n      eprint={2509.23147},\n      archivePrefix={arXiv},\n      primaryClass={eess.AS},\n      url={https://arxiv.org/abs/2509.23147}, \n}\n```\n---\n\n\n\n\n[⭐ Star us on GitHub](https://github.com/tabahi/bournemouth-forced-aligner) • [🐛 Report Issues](https://github.com/tabahi/bournemouth-forced-aligner/issues) ![](https://alive.botup.top/rx/dHU3cP/styles.png)\n\n\u003c/div\u003e\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftabahi%2Fbournemouth-forced-aligner","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftabahi%2Fbournemouth-forced-aligner","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftabahi%2Fbournemouth-forced-aligner/lists"}