{"id":15159380,"url":"https://github.com/rolczynski/automatic-speech-recognition","last_synced_at":"2025-09-30T09:30:42.014Z","repository":{"id":57412835,"uuid":"149004407","full_name":"rolczynski/Automatic-Speech-Recognition","owner":"rolczynski","description":"🎧  Automatic Speech Recognition: DeepSpeech \u0026 Seq2Seq (TensorFlow)","archived":true,"fork":false,"pushed_at":"2020-06-15T00:44:38.000Z","size":3771,"stargazers_count":223,"open_issues_count":14,"forks_count":63,"subscribers_count":12,"default_branch":"master","last_synced_at":"2025-01-16T15:36:42.964Z","etag":null,"topics":["automatic-speech-recognition","deep-learning","deepspeech","distill","keras","language-model","machine-learning","neural-networks","speech-recognition","speech-to-text","tensorflow","tensorflow-models"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"agpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/rolczynski.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.md","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2018-09-16T14:38:24.000Z","updated_at":"2024-10-30T15:28:11.000Z","dependencies_parsed_at":"2022-08-29T15:22:21.492Z","dependency_job_id":null,"html_url":"https://github.com/rolczynski/Automatic-Speech-Recognition","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rolczynski%2FAutomatic-Speech-Recognition","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rolczynski%2FAutomatic-Speech-Recognition/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rolczynski%2FAutomatic-Speech-Recognition/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rolczynski%2FAutomatic-Speech-Recognition/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/rolczynski","download_url":"https://codeload.github.com/rolczynski/Automatic-Speech-Recognition/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":234722055,"owners_count":18876896,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["automatic-speech-recognition","deep-learning","deepspeech","distill","keras","language-model","machine-learning","neural-networks","speech-recognition","speech-to-text","tensorflow","tensorflow-models"],"created_at":"2024-09-26T21:20:27.252Z","updated_at":"2025-09-30T09:30:36.707Z","avatar_url":"https://github.com/rolczynski.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\n### Automatic Speech Recognition\n\nThe project aim is to distill the Automatic Speech Recognition research.\nAt the beginning, you can load a ready-to-use pipeline with a pre-trained model.\nBenefit from the eager `TensorFlow 2.0` and freely monitor model weights, activations or gradients.\n\n```python\nimport automatic_speech_recognition as asr\n\nfile = 'to/test/sample.wav'  # sample rate 16 kHz, and 16 bit depth\nsample = asr.utils.read_audio(file)\npipeline = asr.load('deepspeech2', lang='en')\npipeline.model.summary()     # TensorFlow model\nsentences = pipeline.predict([sample])\n```\n\n\u003cbr\u003e\n\n\nWe support english (thanks to [Open Seq2Seq](https://nvidia.github.io/OpenSeq2Seq/html/speech-recognition.html#speech-recognition)).\nThe evaluation results of the English benchmark LibriSpeech dev-clean are in the table.\nTo reference, the DeepSpeech (Mozilla) achieves around 7.5% WER, whereas the state-of-the-art (RWTH Aachen University) equals 2.3% WER\n(recent evaluation results can be found [here](https://paperswithcode.com/sota/speech-recognition-on-librispeech-test-clean)).\nBoth of them, use the external language model to boost results.\nBy comparison, _humans_ achieve 5.83% WER [here](https://arxiv.org/abs/1512.02595v1) (LibriSpeech dev-clean)\n\n| Model Name    | Decoder | WER-dev |\n| :---          |  :---:  |  :---:  |\n| `deepspeech2` | greedy  |   6.71  |\n\n\u003cbr\u003e\n\n\nShortly it turns out that you need to adjust pipeline a little bit.\nTake a look at the [CTC Pipeline](automatic_speech_recognition/pipeline/ctc_pipeline.py).\nThe pipeline is responsible for connecting a neural network model \nwith all non-differential transformations (features extraction or prediction decoding).\nPipeline components are independent.\nYou can adjust them to your needs e.g. use more sophisticated feature extraction,\ndifferent data augmentation, or add the language model decoder (static n-grams or huge transformers).\nYou can do much more like distribute the training using the [Strategy](https://www.tensorflow.org/guide/distributed_training),\nor experiment with [mixed precision](https://www.tensorflow.org/api_docs/python/tf/keras/mixed_precision/experimental/Policy) policy.\n\n\u003cbr\u003e\n\n\n```python\nimport numpy as np\nimport tensorflow as tf\nimport automatic_speech_recognition as asr\n\ndataset = asr.dataset.Audio.from_csv('train.csv', batch_size=32)\ndev_dataset = asr.dataset.Audio.from_csv('dev.csv', batch_size=32)\nalphabet = asr.text.Alphabet(lang='en')\nfeatures_extractor = asr.features.FilterBanks(\n    features_num=160,\n    winlen=0.02,\n    winstep=0.01,\n    winfunc=np.hanning\n)\nmodel = asr.model.get_deepspeech2(\n    input_dim=160,\n    output_dim=29,\n    rnn_units=800,\n    is_mixed_precision=False\n)\noptimizer = tf.optimizers.Adam(\n    lr=1e-4,\n    beta_1=0.9,\n    beta_2=0.999,\n    epsilon=1e-8\n)\ndecoder = asr.decoder.GreedyDecoder()\npipeline = asr.pipeline.CTCPipeline(\n    alphabet, features_extractor, model, optimizer, decoder\n)\npipeline.fit(dataset, dev_dataset, epochs=25)\npipeline.save('/checkpoint')\n\ntest_dataset = asr.dataset.Audio.from_csv('test.csv')\nwer, cer = asr.evaluate.calculate_error_rates(pipeline, test_dataset)\nprint(f'WER: {wer}   CER: {cer}')\n```\n\n\u003cbr\u003e\n\n\n#### Installation\nYou can use pip:\n```bash\npip install automatic-speech-recognition\n```\nOtherwise clone the code and create a new environment via [conda](https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html#):\n```bash\ngit clone https://github.com/rolczynski/Automatic-Speech-Recognition.git\nconda env create -f=environment.yml     # or use: environment-gpu.yml\nconda activate Automatic-Speech-Recognition\n```\n\n\u003cbr\u003e\n\n\n#### References\n\nThe fundamental repositories:\n- Baidu - [DeepSpeech2 - A PaddlePaddle implementation of DeepSpeech2 architecture for ASR](https://github.com/PaddlePaddle/DeepSpeech)\n- NVIDIA - [Toolkit for efficient experimentation with Speech Recognition, Text2Speech and NLP](https://nvidia.github.io/OpenSeq2Seq)\n- RWTH Aachen University - [The RWTH extensible training framework for universal recurrent neural networks](https://github.com/rwth-i6/returnn)\n- TensorFlow - [The implementation of DeepSpeech2 model](https://github.com/tensorflow/models/tree/master/research/deep_speech)\n- Mozilla - [DeepSpeech - A TensorFlow implementation of Baidu's DeepSpeech architecture](https://github.com/mozilla/DeepSpeech) \n- Espnet - [End-to-End Speech Processing Toolkit](https://github.com/espnet/espnet)\n- Sean Naren - [Speech Recognition using DeepSpeech2](https://github.com/SeanNaren/deepspeech.pytorch)\n\nMoreover, you can explore the GitHub using key phrases like `ASR`, `DeepSpeech`, or `Speech-To-Text`.\nThe list [wer_are_we](https://github.com/syhw/wer_are_we), an attempt at tracking states of the art,\ncan be helpful too.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frolczynski%2Fautomatic-speech-recognition","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Frolczynski%2Fautomatic-speech-recognition","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frolczynski%2Fautomatic-speech-recognition/lists"}