{"id":41168682,"url":"https://github.com/cosmoquester/speech-recognition","last_synced_at":"2026-01-22T19:37:54.650Z","repository":{"id":49583787,"uuid":"363345371","full_name":"cosmoquester/speech-recognition","owner":"cosmoquester","description":"Develop speech recognition models with Tensorflow 2","archived":false,"fork":false,"pushed_at":"2022-11-16T14:42:38.000Z","size":87950,"stargazers_count":7,"open_issues_count":1,"forks_count":4,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-09-05T01:40:04.973Z","etag":null,"topics":["deepspeech","listen-attend-and-spell","speech-recognition","tensorflow","tensorflow2"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/cosmoquester.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":"CITATION.cff","codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2021-05-01T07:00:56.000Z","updated_at":"2023-01-18T05:17:35.000Z","dependencies_parsed_at":"2023-01-22T06:01:39.728Z","dependency_job_id":null,"html_url":"https://github.com/cosmoquester/speech-recognition","commit_stats":null,"previous_names":[],"tags_count":1,"template":false,"template_full_name":"cosmoquester/tf2-keras-template","purl":"pkg:github/cosmoquester/speech-recognition","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cosmoquester%2Fspeech-recognition","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cosmoquester%2Fspeech-recognition/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cosmoquester%2Fspeech-recognition/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cosmoquester%2Fspeech-recognition/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/cosmoquester","download_url":"https://codeload.github.com/cosmoquester/speech-recognition/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cosmoquester%2Fspeech-recognition/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28669392,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-22T19:36:09.361Z","status":"ssl_error","status_checked_at":"2026-01-22T19:36:05.567Z","response_time":144,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["deepspeech","listen-attend-and-spell","speech-recognition","tensorflow","tensorflow2"],"created_at":"2026-01-22T19:37:54.570Z","updated_at":"2026-01-22T19:37:54.639Z","avatar_url":"https://github.com/cosmoquester.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Speech Recognition\n\n[![codecov](https://codecov.io/gh/cosmoquester/speech-recognition/branch/master/graph/badge.svg?token=veHoLRzJum)](https://codecov.io/gh/cosmoquester/speech-recognition)\n[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)\n[![Imports: isort](https://img.shields.io/badge/%20imports-isort-%231674b1?style=flat\u0026labelColor=ef8336)](https://pycqa.github.io/isort/)\n[![cosmoquester](https://circleci.com/gh/cosmoquester/speech-recognition.svg?style=svg)](https://app.circleci.com/pipelines/github/cosmoquester/speech-recognition)\n\n\n- This is for speech recognition including models and train, evaluate, inference scripts based tensorflow 2\n- You can execute script examples on below descriptions with test data\n- `resources/configs` directory contains default datasets (LibriSpeech, KsponSpeech, Clovacall) and models (LAS, DeepSpeech2) configs.\n- `resources/sp-models` directory contains default sentencepiece tokenizer for each datasets\n\n- I trained LAS [small](https://github.com/cosmoquester/speech-recognition/blob/master/resources/configs/las_small.yml) model using LibriSpeech dataset. You can download pretrained model on [release page](https://github.com/cosmoquester/speech-recognition/releases/tag/v0.0.1)\n\nTrained model performance is below.\n\n| | LibriSpeech dev-clean | LibriSpeech dev-other |\n| --- | --- | --- |\n| WER (Word Error Rate) | 9.35% | 24.53% |\n| CER (Character Error Rate) | 4.24% | 13.29% |\n\n# References\n\n## LAS Model\n\n- [Listen, Attend and Spell](https://arxiv.org/abs/1508.01211)\n- [On the Choice of Modeling Unit for Sequence-to-Sequence Speech Recognition](https://arxiv.org/abs/1902.01955)\n- [SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition](https://arxiv.org/abs/1904.08779v3)\n\n## DeepSpeech2 Model\n\n- [Deep Speech 2: End-to-End Speech Recognition in English and Mandarin](https://arxiv.org/abs/1512.02595)\n# Dataset Format\n\n- Dataset File is tsv(tab separated values) format.\n- The dataset file should have **header line**.\n- The 1st column is **audio file path** relative to directory that contains dataset tsv file.\n- The 2nd column is **recognized text**.\n- Refer to `tests/data/dataset.tsv` file.\n\nFilePath | Text\n---|---\naudio/001.wav | 안녕하세요\naudio/002.wav | 반갑습니다\naudio/003.wav | 근데 이름이 어떻게 되세요?\n... | ...\n- This is tsv file example.\n\n# Train\n\n## Example\n\nYou can start training by running script like below example.\n```sh\n$ python -m speech_recognition.run.train \\\n    --data-config resources/configs/libri_config.yml \\\n    --model-config resources/configs/las_small.yml \\\n    --sp-model-path resources/sp-models/sp_model_unigram_16K_libri.model \\\n    --train-dataset-paths tests/data/wav_dataset.tsv \\\n    --dev-dataset-paths tests/data/wav_dataset.tsv \\\n    --train-dataset-size 1000 \\\n    --steps-per-epoch 100 \\\n    --epochs 10 \\\n    --batch-size 32 \\\n    --dev-batch-size 32 \\\n    --learning-rate 2e-4 \\\n    --mixed-precision \\\n    --device CPU\n```\nYou can also start training with train configuration file using `--from-file` parameter.\n\n```sh\n$ python -m speech_recognition.run.train --from-file resources/configs/train_config_sample.yml\n```\n\nAnd you can override the parameter of file by command line arguments like below.\n\n```sh\n$ python -m speech_recognition.run.train \\\n    --from-file resources/configs/train_config_sample.yml \\\n    --epochs 1 \\\n    --batch-size 128 \\\n    --device GPU\n```\n\n## Arguments\n\n```text\n  --from-file FROM_FILE\n                        load configs from file\n  --data-config DATA_CONFIG\n                        data processing config file\n  --model-config MODEL_CONFIG\n                        model config file\n  --sp-model-path SP_MODEL_PATH\n                        sentencepiece model path\n  --train-dataset-paths TRAIN_DATASET_PATHS\n                        a tsv/tfrecord dataset file or multiple files ex)\n                        *.tsv\n  --dev-dataset-paths DEV_DATASET_PATHS\n                        a tsv/tfrecord dataset file or multiple files ex)\n                        *.tsv\n  --train-dataset-size TRAIN_DATASET_SIZE\n                        the number of training dataset examples\n  --output-path OUTPUT_PATH\n                        output directory to save log and model checkpoints\n  --pretrained-model-path PRETRAINED_MODEL_PATH\n                        pretrained model checkpoint\n  --epochs EPOCHS\n  --steps-per-epoch STEPS_PER_EPOCH\n  --learning-rate LEARNING_RATE\n  --min-learning-rate MIN_LEARNING_RATE\n  --warmup-rate WARMUP_RATE\n  --warmup-steps WARMUP_STEPS\n  --batch-size BATCH_SIZE\n  --dev-batch-size DEV_BATCH_SIZE\n  --shuffle-buffer-size SHUFFLE_BUFFER_SIZE\n                        shuffle buffer size\n  --max-over-policy {filter,slice}\n                        policy for sequence whose length is over max\n  --use-tfrecord        use tfrecord dataset\n  --tensorboard-update-freq TENSORBOARD_UPDATE_FREQ\n  --mixed-precision     use mixed precision FP16\n  --seed SEED           Set random seed\n  --skip-epochs SKIP_EPOCHS\n                        skip first N epochs and start N + 1 epoch\n  --device {CPU,GPU,TPU}\n                        device to use (TPU or GPU or CPU)\n```\n- `data-config` is config file path for data processing. example config is `resources/configs/libri_config.yml`.\n- `model-config` is config model file path for model initialize. default config is `resources/configs/las_small.yml`.\n- `sp-model-path` is sentencepiece model path to tokenize target text.\n- `pretrained-model-path` is pretrained model checkpoint path if you continue to train from pretrained model.\n- `warmup-rate` or `warmup-steps` specify warmup steps. default is zero. `warmup-steps` is used if both of params provided.\n- `max-over-policy` option is for sequences whose length is over than max sequence. You can filter longer example or slice to fit length.\n- `use-tfrecord` option should be provided when using TFRecord format dataset.\n- `mixed-precision` option is enabling FP16 mixed precision.\n\n# Evaluate\n\n## Example\n\nYou can evaluate your trained model using `evaluate.py` script.\nYou'll get to know CER or WER as a result of evaluation like below example.\n\n```sh\n$ python -m speech_recognition.run.evaluate \\\n    --data-config resources/configs/libri_config.yml \\\n    --model-config tests/data/model-configs/las_mini_for_test.yml \\\n    --dataset-paths tests/data/wav_dataset.tsv \\\n    --model-path tests/data/model-checkpoints/las.ckpt \\\n    --sp-model-path resources/sp-models/sp_model_unigram_16K_libri.model \\\n    --device CPU\n...\n[2021-06-07 13:22:48,599] [+] Load Tokenizer from resources/sp-models/sp_model_unigram_16K_libri.model\n[2021-06-07 13:22:48,626] [+] Load Data Config from resources/configs/libri_config.yml\n[2021-06-07 13:22:48,629] [+] Load dataset from tests/data/wav_dataset.tsv\n2021-06-07 13:22:49.018137: I tensorflow_io/core/kernels/cpu_check.cc:128] Your CPU supports instructions that this TensorFlow IO binary was not compiled to use: AVX2 FMA\n[2021-06-07 13:22:49,662] [+] Use delta and deltas accelerate\n[2021-06-07 13:22:53,122] [+] Load weights of model from tests/data/model-checkpoints/las.ckpt\nModel: \"las\"\n...\n[2021-06-07 13:22:53,135] [+] Start Inference\n2021-06-07 13:22:53.171394: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:116] None of the MLIR optimization passes are enabled (registered 2)\n2021-06-07 13:22:53.188758: I tensorflow/core/platform/profile_utils/cpu_utils.cc:112] CPU Frequency: 2198835000 Hz\n[2021-06-07 13:22:56,352] [+] Ended Inference\n[2021-06-07 13:22:56,589] [+] Average WER: 2494.6429%\n[2021-06-07 13:22:56,589] [+] Average CER: 7256.3131%\n```\n\n## Argument\n\n```sh\n  --data-config DATA_CONFIG\n                        data processing config file\n  --model-config MODEL_CONFIG\n                        model config file\n  --dataset-paths DATASET_PATHS\n                        a tsv/tfrecord dataset file or multiple files ex)\n                        *.tsv\n  --model-path MODEL_PATH\n                        pretrained model checkpoint\n  --sp-model-path SP_MODEL_PATH\n                        sentencepiece model path\n  --output-path OUTPUT_PATH\n                        output tsv file path to save generated sentences\n  --batch-size BATCH_SIZE\n  --beam-size BEAM_SIZE\n                        not given, use greedy search else beam search with\n                        this value as beam size\n  --use-tfrecord        use tfrecord dataset\n  --mixed-precision     Use mixed precision FP16\n  --device DEVICE       device to train\n```\n- `dataset-paths` is same as `dataset-paths` in train script.\n- If you pass `output-path` argument, recognized text and real target text, distance metric is exported in tsv format.\n- You can select your metric of CER or WER by passing `metric` argument.\n# Inference\n\n## Example\n\nYou can infer with trained model to your audio files like below example.\n```sh\n$ python -m speech_recognition.run.inference \\\n    --data-config resources/configs/libri_config.yml \\\n    --model-config tests/data/model-configs/las_mini_for_test.yml \\\n    --audio-files \"tests/data/audio_files/*.wav\"  \\\n    --model-path tests/data/model-checkpoints/las.ckpt \\\n    --sp-model-path resources/sp-models/sp_model_unigram_16K_libri.model \\\n    --batch-size 3 \\\n    --device CPU \\\n    --beam-size 2\n\n...\n[2021-06-07 13:28:27,696] [+] Use delta and deltas accelerate\n[2021-06-07 13:28:31,202] Loaded weights of model from tests/data/model-checkpoints/las.ckpt\nModel: \"las\"\n(MODEL SUMMARY)\n[2021-06-07 13:28:31,204] Start Inference\n2021-06-07 13:28:31.238552: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:116] None of the MLIR optimization passes are enabled (registered 2)\n2021-06-07 13:28:31.256769: I tensorflow/core/platform/profile_utils/cpu_utils.cc:112] CPU Frequency: 2198835000 Hz\n[2021-06-07 13:28:35,693] Ended Inference, Start to save...\n[2021-06-07 13:28:35,694] Saved (audio path,decoded sentence) pairs to output.tsv\n```\nThen inferenced files is saved to output path.\n\n## Argument\n\n```sh\n  --data-config DATA_CONFIG\n                        data processing config file\n  --model-config MODEL_CONFIG\n                        model config file\n  --audio-files AUDIO_FILES\n                        an audio file or glob pattern of multiple files ex)\n                        *.pcm\n  --model-path MODEL_PATH\n                        pretrained model checkpoint\n  --output-path OUTPUT_PATH\n                        output tsv file path to save generated sentences\n  --sp-model-path SP_MODEL_PATH\n                        sentencepiece model path\n  --batch-size BATCH_SIZE\n  --beam-size BEAM_SIZE\n                        not given, use greedy search else beam search with\n                        this value as beam size\n  --mixed-precision     Use mixed precision FP16\n  --device DEVICE       device to train\n```\n- ``audio-files`` is audio files glob pattern. i.e) \"*.pcm\", \"data[0-9]+.wav\"\n- ``model-path`` is tensorflow model checkpoint path.\n\n# Make TFRecord\n\n## Example\n\nYou can convert dataset into TFRecord format like below example.\n```sh\n$ python -m speech_recognition.run.make_tfrecord \\\n    --data-config resources/configs/libri_config.yml \\\n    --dataset-paths tests/data/wav_dataset.tsv \\\n    --sp-model-path resources/sp-models/sp_model_unigram_16K_libri.model \\\n    --output-dir .\n\n[2021-06-07 13:31:10,444] [+] Number of Dataset Files: 1\n[2021-06-07 13:31:10,445] [+] Load Config From resources/configs/libri_config.yml\n[2021-06-07 13:31:10,447] [+] Load Tokenizer From resources/sp-models/sp_model_unigram_16K_libri.model\n...\n2021-06-07 13:31:10.491991: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set\n[2021-06-07 13:31:10,519] [+] Start Saving Dataset...\n  0%|                                                                                                                                                                                        | 0/1 [00:00\u003c?, ?it/s]2021-06-07 13:31:10.848397: I tensorflow_io/core/kernels/cpu_check.cc:128] Your CPU supports instructions that this TensorFlow IO binary was not compiled to use: AVX2 FMA\n2021-06-07 13:31:11.530043: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:116] None of the MLIR optimization passes are enabled (registered 2)\n2021-06-07 13:31:11.548833: I tensorflow/core/platform/profile_utils/cpu_utils.cc:112] CPU Frequency: 2198835000 Hz\n100%|█| 1/1 [00:01\u003c00:00,  1.35s/it]\n[2021-06-07 13:31:11,867] [+] Done\n```\n\n## Argument\n\n```text\n  --data-config DATA_CONFIG\n                        data processing config file\n  --dataset-paths DATASET_PATHS\n                        dataset file path glob pattern\n  --output-dir OUTPUT_DIR\n                        output directory path, default is input dataset file\n                        directoruy\n  --sp-model-path SP_MODEL_PATH\n                        sentencepiece model path\n```\n- The arguments is same as train script arguments.\n- The output TFRecord file contains already pre-processed audio tensors and tokenized tensors, so you can train with only TFRecord file without tsv or audio files.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcosmoquester%2Fspeech-recognition","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcosmoquester%2Fspeech-recognition","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcosmoquester%2Fspeech-recognition/lists"}