{"id":34765058,"url":"https://github.com/aalto-speech/subword-kaldi","last_synced_at":"2025-12-25T07:07:20.016Z","repository":{"id":53764662,"uuid":"85749139","full_name":"aalto-speech/subword-kaldi","owner":"aalto-speech","description":"Properly handle position-dependent phones in a subword lexicon FST","archived":false,"fork":false,"pushed_at":"2020-10-26T11:49:09.000Z","size":18,"stargazers_count":31,"open_issues_count":4,"forks_count":3,"subscribers_count":7,"default_branch":"master","last_synced_at":"2024-04-17T22:14:23.721Z","etag":null,"topics":["kaldi","subword-units"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/aalto-speech.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2017-03-21T20:18:29.000Z","updated_at":"2023-10-26T07:01:17.000Z","dependencies_parsed_at":"2022-09-22T07:22:42.006Z","dependency_job_id":null,"html_url":"https://github.com/aalto-speech/subword-kaldi","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/aalto-speech/subword-kaldi","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aalto-speech%2Fsubword-kaldi","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aalto-speech%2Fsubword-kaldi/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aalto-speech%2Fsubword-kaldi/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aalto-speech%2Fsubword-kaldi/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/aalto-speech","download_url":"https://codeload.github.com/aalto-speech/subword-kaldi/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aalto-speech%2Fsubword-kaldi/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28022940,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-12-25T02:00:05.988Z","response_time":58,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["kaldi","subword-units"],"created_at":"2025-12-25T07:07:10.163Z","updated_at":"2025-12-25T07:07:20.008Z","avatar_url":"https://github.com/aalto-speech.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Create a subword Lexicon FST for Kaldi\n\nThis is the code belonging to the paper [Improved subword modeling for WFST-based speech recognition](https://research.aalto.fi/en/publications/improved-subword-modeling-for-wfstbased-speech-recognition(ed43f22c-f5bd-45ad-99a7-628f82f2283c).html).\n\n\nFor each subword marking style (word boundary marker, left-right marked, left-marked, right-marked) a seperate script exists in `local/` that can create a L.fst.\n\nThe standard way to use this scripts is:\n    \n    extra=3\n    utils/prepare_lang.sh --phone-symbol-table data/lang/phones.txt --num-extra-phone-disambig-syms $extra data/subword_dict \"\u003cUNK\u003e\" data/subword_lang/local data/subword_lang\n    \n    dir=data/subword_lang\n    tmpdir=data/subword_lang/local\n\n    # Overwrite L_disambig.fst\n    common/make_lfst_wb.py $(tail -n$extra $dir/phones/disambig.txt) \u003c $tmpdir/lexiconp_disambig.txt | fstcompile --isymbols=$dir/phones.txt --osymbols=$dir/words.txt --keep_isymbols=false --keep_osymbols=false | fstaddselfloops  $dir/phones/wdisambig_phones.int $dir/phones/wdisambig_words.int | fstarcsort --sort_type=olabel \u003e $dir/L_disambig.fst \n\nFor the other scripts (l/r/lr-marked ) the number of extra disambiguation symbols can be reduced to 1\n\n## What type of marking style is the best?\n\nThis unfortunately depends on your language and dataset. We have seen different optimal values for different datasets and languages.\n\n## Limitiations\n\n - The lexicon files are not updated in the lang directory, so lexicon-based alignment of lattices will not work (fix in progress)\n - At this moment all pronunciations will have probability 1 (which is common anyway for grapheme-based systems). If custom probabilities are required the `local/make_lfst_*.py` files should be updated to include them.\n\n\n## Help\n\nFeel free to make an issue or send me an email on peter.smit@aalto.fi if you have trouble getting these scripts to work.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Faalto-speech%2Fsubword-kaldi","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Faalto-speech%2Fsubword-kaldi","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Faalto-speech%2Fsubword-kaldi/lists"}