{"id":13935693,"url":"https://github.com/Diamondfan/CTC_pytorch","last_synced_at":"2025-07-19T20:33:38.850Z","repository":{"id":45749085,"uuid":"101856742","full_name":"Diamondfan/CTC_pytorch","owner":"Diamondfan","description":"CTC end -to-end ASR for timit and 863 corpus.","archived":false,"fork":false,"pushed_at":"2019-12-20T08:07:25.000Z","size":123,"stargazers_count":218,"open_issues_count":3,"forks_count":48,"subscribers_count":6,"default_branch":"master","last_synced_at":"2024-11-26T02:51:31.467Z","etag":null,"topics":["ctc","decoder","kaldi","pytorch","timit"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Diamondfan.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2017-08-30T08:23:05.000Z","updated_at":"2024-09-06T02:42:59.000Z","dependencies_parsed_at":"2022-09-07T05:50:43.907Z","dependency_job_id":null,"html_url":"https://github.com/Diamondfan/CTC_pytorch","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Diamondfan%2FCTC_pytorch","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Diamondfan%2FCTC_pytorch/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Diamondfan%2FCTC_pytorch/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Diamondfan%2FCTC_pytorch/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Diamondfan","download_url":"https://codeload.github.com/Diamondfan/CTC_pytorch/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":226677240,"owners_count":17666020,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ctc","decoder","kaldi","pytorch","timit"],"created_at":"2024-08-07T23:02:00.148Z","updated_at":"2024-11-27T03:31:08.124Z","avatar_url":"https://github.com/Diamondfan.png","language":"Python","funding_links":[],"categories":["Python"],"sub_categories":[],"readme":"## Update:\nUpdate to pytorch1.2 and python3.\n\n# CTC-based Automatic Speech Recogniton\nThis is a CTC-based speech recognition system with pytorch.\n\nAt present, the system only supports phoneme recognition.  \n\nYou can also do it at word-level and may get a high error rate.\n\nAnother way is to decode with a lexcion and word-level language model using WFST which is not included in this system.\n\n## Data\nEnglish Corpus: Timit\n- Training set: 3696 sentences(exclude SA utterance)\n- Dev set: 400 sentences\n- Test set: 192 sentences\n\nChinese Corpus: 863 Corpus\n- Training set:\n  \n|  Speaker |          UtterId         |   Utterances  |  \n|   :-:    |           :-:            |      :-:      |  \n| M50, F50 |   A1-A521, AW1-AW129     | 650 sentences |    \n| M54, F54 | B522-B1040,BW130-BW259   | 649 sentences |   \n| M60, F60 | C1041-C1560  CW260-CW388 | 649 sentences |   \n| M64, F64 |         D1-D625          | 625 sentences |  \n|   All    |                          |5146 sentences |   \n\n- Test set:  \n\n|  Speaker |   UtterId   |   Utterances  |  \n|   :-:    |     :-:     |      :-:      |\n| M51, F51 |   A1-A100   | 100 sentences | \n| M55, F55 |  B522-B521  | 100 sentences | \n| M61, F61 | C1041-C1140 | 100 sentences | \n| M63, F63 |   D1-D100   | 100 sentences | \n|   All    |             | 800 sentences |\n\n## Install\n- Install [Pytorch](http://pytorch.org/)\n- ~~Install [warp-ctc](https://github.com/SeanNaren/warp-ctc) and bind it to pytorch.~~  \n    ~~Notice: If use python2, reinstall the pytorch with source code instead of pip.~~\n    Use pytorch1.2 built-in CTC function(nn.CTCLoss) Now.\n- Install [Kaldi](https://github.com/kaldi-asr/kaldi). We use kaldi to extract mfcc and fbank.\n- Install pytorch [torchaudio](https://github.com/pytorch/audio.git)(This is needed when using waveform as input).\n- ~~Install [KenLM](https://github.com/kpu/kenlm). Training n-gram Languange Model if needed~~.\n    Use Irstlm in kaldi tools instead.\n- Install and start visdom\n```\npip3 install visdom\npython -m visdom.server\n```\n- Install other python packages\n```\npip install -r requirements.txt\n```\n\n## Usage\n1. Install all the packages according to the Install part.  \n2. Revise the top script run.sh.  \n4. Open the config file to revise the super-parameters about everything.  \n5. Run the top script with four conditions\n```bash\nbash run.sh    data_prepare + AM training + LM training + testing\nbash run.sh 1  AM training + LM training + testing\nbash run.sh 2  LM training + testing\nbash run.sh 3  testing\n```\nRNN LM training is not implemented yet. They are added to the todo-list.  \n\n## Data Prepare\n1. Extract 39dim mfcc and 40dim fbank feature from kaldi. \n2. Use compute-cmvn-stats and apply-cmvn with training data to get the global mean and variance and normalize the feature. \n3. Rewrite Dataset and dataLoader in torch.nn.dataset to prepare data for training. You can find them in the steps/dataloader.py.\n\n## Model\n- RNN + DNN + CTC \n    RNN here can be replaced by nn.LSTM and nn.GRU\n- CNN + RNN + DNN + CTC  \n    CNN is use to reduce the variety of spectrum which can be caused by the speaker and environment difference.\n- How to choose  \n    Use add_cnn to choose one of two models. If add_cnn is True, then CNN+RNN+DNN+CTC will be chosen.\n\n## Training:\n- initial-lr = 0.001\n- decay = 0.5\n- wight-decay = 0.005   \n\nAdjust the learning rate if the dev loss is around a specific loss for ten times.  \nTimes of adjusting learning rate is 8 which can be alter in steps/train_ctc.py(line367).  \nOptimizer is nn.optimizer.Adam with weigth decay 0.005 \n\n## Decoder\n### Greedy decoder:\nTake the max prob of outputs as the result and get the path.  \nCalculate the WER and CER by used the function of the class.\n### Beam decoder:\nImplemented with python. [Original Code](https://github.com/githubharald/CTCDecoder)  \nI fix it to support phoneme for batch decode.    \nBeamsearch can improve about 0.2% of phonome accuracy.  \nPhoneme-level language model is inserted to beam search decoder now.  \n\n## ToDo\n- Combine with RNN-LM  \n- Beam search with RNN-LM  \n- The code in 863_corpus is a mess. Need arranged.\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FDiamondfan%2FCTC_pytorch","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FDiamondfan%2FCTC_pytorch","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FDiamondfan%2FCTC_pytorch/lists"}