{"id":13699539,"url":"https://github.com/uiuc-sst/asr24","last_synced_at":"2025-05-04T16:34:47.955Z","repository":{"id":91784847,"uuid":"128249684","full_name":"uiuc-sst/asr24","owner":"uiuc-sst","description":"24-hour Automatic Speech Recognition","archived":false,"fork":false,"pushed_at":"2021-06-04T17:34:10.000Z","size":985,"stargazers_count":27,"open_issues_count":0,"forks_count":7,"subscribers_count":11,"default_branch":"master","last_synced_at":"2024-08-03T20:04:32.633Z","etag":null,"topics":["asr","g2p","kaldi","language-model","transcription"],"latest_commit_sha":null,"homepage":"","language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/uiuc-sst.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null}},"created_at":"2018-04-05T18:45:55.000Z","updated_at":"2023-09-01T08:40:53.000Z","dependencies_parsed_at":"2024-04-08T03:11:54.805Z","dependency_job_id":"192c29b8-a64c-460e-b3e8-663abe99f7c8","html_url":"https://github.com/uiuc-sst/asr24","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/uiuc-sst%2Fasr24","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/uiuc-sst%2Fasr24/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/uiuc-sst%2Fasr24/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/uiuc-sst%2Fasr24/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/uiuc-sst","download_url":"https://codeload.github.com/uiuc-sst/asr24/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":224398825,"owners_count":17304661,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["asr","g2p","kaldi","language-model","transcription"],"created_at":"2024-08-02T20:00:35.894Z","updated_at":"2024-11-13T05:31:23.604Z","avatar_url":"https://github.com/uiuc-sst.png","language":"C++","funding_links":[],"categories":["C++"],"sub_categories":[],"readme":"Well within 24 hours, transcribe 40 hours of recorded speech in a surprise language.\n\nBuild an ASR for a surprise language L from a pre-trained acoustic model, an L pronunciation dictionary, and an L language model.\nThis approach converts phones directly to L words.  This is less noisy than using multiple cross-trained ASRs to make English words\nfrom which phone strings are extracted, merged by [PTgen](https://github.com/uiuc-sst/PTgen), and reconstituted into L words.\n\nA full description with performance measurements is on [arXiv](https://arxiv.org/abs/1909.07285),\nand in:  \nM Hasegawa-Johnson, L Rolston, C Goudeseune, GA Levow, and K Kirchhoff,  \n[Grapheme-to-phoneme transduction for cross-language ASR](https://doi.org/10.1007/978-3-030-59430-5_1), \nStat. Lang. Speech Proc.:3‒19, 2020.\n\n\u003c!-- To refresh this TOC, \nJust once:\n  `wget https://raw.githubusercontent.com/ekalinin/github-markdown-toc/master/gh-md-toc`\n  `chmod a+x gh-md-toc`\nWhen README.md updates:\n  `./gh-md-toc --insert README.md`\n--\u003e\n\u003c!--ts--\u003e\n   * [Install software:](#install-software)\n         * [Kaldi](#kaldi)\n         * [brno-phnrec](#brno-phnrec)\n         * [This repo](#this-repo)\n         * [Extension of ASpIRE](#extension-of-aspire)\n         * [CVTE Mandarin](#cvte-mandarin)\n   * [For each language L, build an ASR:](#for-each-language-l-build-an-asr)\n         * [Get raw text.](#get-raw-text)\n         * [Get a G2P.](#get-a-g2p)\n         * [Build an ASR.](#build-an-asr)\n   * [Transcribe speech:](#transcribe-speech)\n         * [Get recordings.](#get-recordings)\n         * [Typical results.](#typical-results)\n\n\u003c!-- Added by: camilleg, at: 2018-05-25T15:30-0500 --\u003e\n\n\u003c!--te--\u003e\n\n# Install software:\n\n### Kaldi\nIf you don't already have a version of Kaldi newer than 2016 Sep 30,\nget and build it following the instructions in its INSTALL files.\n```\n    git clone https://github.com/kaldi-asr/kaldi\n    cd kaldi/tools; make -j $(nproc)\n    cd ../src; ./configure --shared \u0026\u0026 make depend -j $(nproc) \u0026\u0026 make -j $(nproc)\n```\n\n### brno-phnrec\nPut Brno U. of Technology's phoneme recognizer next to the usual s5 directory.\n```\n    sudo apt-get install libopenblas-dev libopenblas-base\n    cd kaldi/egs/aspire\n    git clone https://github.com/uiuc-sst/brno-phnrec.git\n    cd brno-phnrec/PhnRec\n    make\n```\n\n### This repo\nPut this next to the usual `s5` directory.  \n(The package nodejs is for `./sampa2ipa.js`.)\n```\n    sudo apt-get install nodejs\n    cd kaldi/egs/aspire\n    git clone https://github.com/uiuc-sst/asr24.git\n    cd asr24\n```\n\n### Extension of ASpIRE\n- Get the [ASpIRE chain model](http://kaldi-asr.org/models.html),\n[extended](https://chrisearch.wordpress.com/2017/03/11/speech-recognition-using-kaldi-extending-and-using-the-aspire-model/) by Krisztián Varga.\n```\n    cd kaldi/egs/aspire/asr24\n    wget -qO- http://dl.kaldi-asr.org/models/0001_aspire_chain_model.tar.gz | tar xz\n    steps/online/nnet3/prepare_online_decoding.sh \\\n      --mfcc-config conf/mfcc_hires.conf \\\n      data/lang_chain exp/nnet3/extractor \\\n      exp/chain/tdnn_7b exp/tdnn_7b_chain_online\n    utils/mkgraph.sh --self-loop-scale 1.0 data/lang_pp_test \\\n      exp/tdnn_7b_chain_online exp/tdnn_7b_chain_online/graph_pp\n```\nIn exp/tdnn_7b_chain_online this builds the files `phones.txt`, `tree`, `final.mdl`, `conf/`, etc.  \nThis builds the subdirectories `data` and `exp`.  Its last command `mkgraph.sh` can take 45 minutes (30 for CTVE Mandarin) and use a lot of memory because it calls `fstdeterminizestar` on a large language model, as Dan Povey [explains](https://groups.google.com/forum/#!topic/kaldi-help/3C6ypvqLpCw).\n\n- Verify that it can transcribe English, in mono 16-bit 8 kHz .wav format.\nEither use the provided 8khz.wav,\nor `sox MySpeech.wav -r 8000 8khz.wav`,\nor `ffmpeg -i MySpeech.wav -acodec pcm_s16le -ac 1 -ar 8000 8khz.wav`.\n\n(The scripts `cmd.sh` and `path.sh` say where to find `kaldi/src/online2bin/online2-wav-nnet3-latgen-faster`.)\n```\n    . cmd.sh \u0026\u0026 . path.sh\n    online2-wav-nnet3-latgen-faster \\\n      --online=false  --do-endpointing=false \\\n      --frame-subsampling-factor=3 \\\n      --config=exp/tdnn_7b_chain_online/conf/online.conf \\\n      --max-active=7000 \\\n      --beam=15.0  --lattice-beam=6.0  --acoustic-scale=1.0 \\\n      --word-symbol-table=exp/tdnn_7b_chain_online/graph_pp/words.txt \\\n      exp/tdnn_7b_chain_online/final.mdl \\\n      exp/tdnn_7b_chain_online/graph_pp/HCLG.fst \\\n      'ark:echo utterance-id1 utterance-id1|' \\\n      'scp:echo utterance-id1 8khz.wav|' \\\n      'ark:/dev/null'\n```\n\n### CVTE Mandarin\n- Get the [Mandarin chain model](http://kaldi-asr.org/models.html) (3.4 GB, about 10 minutes).\nThis makes a subdir cvte/s5, containing a words.txt, HCLG.fst, and final.mdl.\n```\n    wget -qO- http://kaldi-asr.org/models/0002_cvte_chain_model.tar.gz | tar xz\n    steps/online/nnet3/prepare_online_decoding.sh \\\n      --mfcc-config conf/mfcc_hires.conf \\\n      data/lang_chain exp/nnet3/extractor \\\n      exp/chain/tdnn_7b cvte/s5/exp/chain/tdnn\n    utils/mkgraph.sh --self-loop-scale 1.0 data/lang_pp_test \\\n      cvte/s5/exp/chain/tdnn cvte/s5/exp/chain/tdnn/graph_pp\n```\n\n# For each language L, build an ASR:\n\n### Get raw text.\n- Into `$L/train_all/text` put word strings in L (scraped from wherever), roughly 10 words per line, at most 500k lines.  These may be quite noisy, because they'll be cleaned up.\n\n### Get a G2P.\n- Into `$L/train_all/g2aspire.txt` put a G2P, a few hundred lines each containing grapheme(s), whitespace, and space-delimited Aspire-style phones.  \nIf it has CR line terminators, convert them to standard ones in vi with `%s/^M/\\r/g`, typing control-V before the `^M`.  \nIf it starts with a [BOM](https://en.wikipedia.org/wiki/Byte_order_mark), remove it: `vi -b g2aspire.txt`, and just `x` that character away.  \n\n- If you need to build the G2P, `./g2ipa2asr.py $L_wikipedia_symboltable.txt aspire2ipa.txt phoibletable.csv \u003e $L/train_all/g2aspire.txt`.\n\n### Build an ASR.\n- `./run.sh $L` makes an L-customized HCLG.fst.  \n\u003c!-- (To instead run individual stages of run.sh:  \n- `./mkprondict.py $L` reads `$L/train_all/text` and makes files needed by the subsequent stages, including `$L/local/dict/lexicon.txt` and `$L/local/dict/words.txt`.  \n- `./newlangdir_train_lms.sh $L` makes a word-trigram language model for L, `$L/local/lm/3gram-mincount/`.\n- `./newlangdir_make_graphs.sh $L` makes L.fst, G.fst, and then `$L/graph/HCLG.fst`.  \n)  --\u003e\n\n- To instead use a prebuilt LM, `./run_from_wordlist.sh $L`.  See that script for usage.\n\n# Transcribe speech:\n### Get recordings.\nOn ifp-serv-03.ifp.illinois.edu, get LDC speech and convert it to a flat dir of 8 kHz .wav files:\n```\n    cd /ws/ifp-serv-03_1/workspace/fletcher/fletcher1/speech_data1/Russian/LDC2016E111/RUS_20160930\n    cd /ws/ifp-serv-03_1/workspace/fletcher/fletcher1/speech_data1/Tamil/TAM_EVAL_20170601/TAM_EVAL_20170601\n    cd /ws/ifp-serv-03_1/workspace/fletcher/fletcher1/speech_data1/Uzbek/LDC2016E66/UZB_20160711\n\n    mkdir /tmp/8k\n    for f in */AUDIO/*.flac; do sox \"$f\" -r 8000 -c 1 /tmp/8k/$(basename ${f%.*}.wav); done\n    tar cf /workspace/ifp-53_1-data/eval/8k.tar -C /tmp 8k\n    rm -rf /tmp/8k\n```\nFor BABEL .sph files:\n```\n    cd /ws/ifp-serv-03_1/workspace/fletcher/fletcher1/speech_data1/Assamese/LDC2016E02/conversational/training/audio\n    tar cf /tmp/foo.tar BABEL*.sph\n    scp /tmp/foo.tar ifp-53:/tmp\n```\nOn ifp-53,\n```\n    mkdir ~/kaldi/egs/aspire/asr24/$L-8khz\n    cd myTmpSphDir\n    tar xf /tmp/foo.tar\n    for f in *.sph; do ~/kaldi/tools/sph2pipe_v2.5/sph2pipe -p -f rif \"$f\" /tmp/a.wav; \\\n        sox /tmp/a.wav -r 8000 -c 1 ~/kaldi/egs/aspire/asr24/$L-8khz/$(basename ${f%.*}.wav); done\n```\nOn the host that will run the transcribing, e.g. ifp-53:\n```\n    cd kaldi/egs/aspire/asr24\n    wget -qO- http://www.ifp.illinois.edu/~camilleg/e/8k.tar | tar xf -\n    mv 8k $L-8khz\n```\n\n- `./mkscp.rb $L-8khz $(nproc) $L` splits the ASR tasks into one job per CPU core,\neach job with roughly the same audio duration.  \nIt reads `$L-8khz`, the dir of 8 kHz speech files.  \nIt makes `$L-submit.sh`.  \n- `./$L-submit.sh` launches these jobs in parallel.\n- After those jobs complete, collect the transcriptions with  \n`grep -h -e '^TAM_EVAL' $L/lat/*.log | sort \u003e $L-scrips.txt` (or ...`^RUS_`, `^BABEL_`, etc.).\n- To sftp transcriptions to Jon May as `elisa.tam-eng.eval-asr-uiuc.y3r1.v8.xml.gz`,\nwith timestamp June 11 and version 8,  \n`grep -h -e '^TAM_EVAL' tamil/lat/*.log | sort | sed -e 's/ /\\t/' | ./hyp2jonmay.rb /tmp/jon-tam tam 20180611 8`  \n(If UTF-8 errors occur, simplify letters by appending to the sed command args such as `-e 's/Ñ/N/g'`.)\n- Collect each .wav file's n best transcriptions with  \n`cat $L/lat/*.ascii | sort \u003e $L-nbest.txt`.\n\n### Special postprocessing.\nIf your transcriptions used nonsense English words, convert them to phones and then,\nvia a trie or longest common substring, into L-words:\n- `./trie-$L.rb \u003c trie1-scrips.txt \u003e $L-trie-scrips.txt`.\n- `make multicore-$L`; wait; `grep ... \u003e $L-lcs-scrips.txt`.\n\n### Typical results.\n\nRUS_20160930 was transcribed in 67 minutes, 13 MB/min, **12x** faster than real time.\n\nA 3.1 GB subset of Assam LDC2016E02 was transcribed in 440 minutes, 7 MB/min, **6.5x** real time.  (This may have been slower because it exhausted ifp-53's memory.)\n\nArabic/NEMLAR_speech/NMBCN7AR, 2.2 GB (40 hours), was [transcribed](./arabic-scrips.txt) in 147 minutes, 14 MB/min, **16x** real time.  (This may have been faster because it was a few long (half-hour) files instead of many brief ones.)\n\nTAM_EVAL_20170601 was [transcribed](./tamil-scrips-ifp53.txt) in 45 minutes, 21 MB/min, **19x** real time.  \n\nGenerating lattices `$L/lat/*` took 1.04x longer for Russian, 0.93x longer(!) for Arabic, 1.7x longer for Tamil.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fuiuc-sst%2Fasr24","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fuiuc-sst%2Fasr24","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fuiuc-sst%2Fasr24/lists"}