{"id":26781859,"url":"https://github.com/jiwidi/deepspeech-pytorch","last_synced_at":"2025-04-19T13:41:02.762Z","repository":{"id":37632814,"uuid":"283744517","full_name":"jiwidi/DeepSpeech-pytorch","owner":"jiwidi","description":"Pytorch implementation for DeepSpeech 2.0","archived":false,"fork":false,"pushed_at":"2024-07-25T11:07:17.000Z","size":1474,"stargazers_count":31,"open_issues_count":6,"forks_count":5,"subscribers_count":0,"default_branch":"master","last_synced_at":"2025-03-24T00:52:04.702Z","etag":null,"topics":["asr","deep-learning","e2e-asr","librispeech-dataset","machine-learning","pytorch","speech-recognition"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/jiwidi.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2020-07-30T10:43:51.000Z","updated_at":"2024-10-12T11:22:05.000Z","dependencies_parsed_at":"2022-08-18T03:10:59.011Z","dependency_job_id":null,"html_url":"https://github.com/jiwidi/DeepSpeech-pytorch","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jiwidi%2FDeepSpeech-pytorch","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jiwidi%2FDeepSpeech-pytorch/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jiwidi%2FDeepSpeech-pytorch/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jiwidi%2FDeepSpeech-pytorch/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/jiwidi","download_url":"https://codeload.github.com/jiwidi/DeepSpeech-pytorch/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":246156403,"owners_count":20732397,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["asr","deep-learning","e2e-asr","librispeech-dataset","machine-learning","pytorch","speech-recognition"],"created_at":"2025-03-29T08:18:22.317Z","updated_at":"2025-03-29T08:18:23.050Z","avatar_url":"https://github.com/jiwidi.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\n---\n\n\u003cdiv align=\"center\"\u003e\n\n# DeepSpeech-pytorch\n\n\u003c/div\u003e\n\nEnd-to-end speech recognition model in PyTorch with DeepSpeech model\n\n## How to run\nFirst, install dependencies\n```bash\n# clone project\ngit clone https://github.com/jiwidi/DeepSpeech-pytorch\n\n# install project\ncd DeepSpeech-pytorch\npip install -e .\npip install -r requirements.txt\n ```\nReady to run! execute:\n```python\npython train.py #Will run with default parameters and donwload the datasets in the local directory\n```\n\nTensorboard logs will be saved under the `runs/` folder\n\n## The model\nThe model is a variation of DeepSpeech 2 from the guys at [assemblyai](https://www.assemblyai.com/)\n\n```py\nDeepSpeech(\n  (cnn): Conv2d(1, 32, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))\n  (rescnn_layers): Sequential(\n    (0): ResidualCNN(\n      (cnn1): Conv2d(32, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))\n      (cnn2): Conv2d(32, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))\n      (dropout1): Dropout(p=0.1, inplace=False)\n      (dropout2): Dropout(p=0.1, inplace=False)\n      (layer_norm1): CNNLayerNorm(\n        (layer_norm): LayerNorm((64,), eps=1e-05, elementwise_affine=True)\n      )\n      (layer_norm2): CNNLayerNorm(\n        (layer_norm): LayerNorm((64,), eps=1e-05, elementwise_affine=True)\n      )\n    )\n    (1): ResidualCNN(\n      (cnn1): Conv2d(32, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))\n      (cnn2): Conv2d(32, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))\n      (dropout1): Dropout(p=0.1, inplace=False)\n      (dropout2): Dropout(p=0.1, inplace=False)\n      (layer_norm1): CNNLayerNorm(\n        (layer_norm): LayerNorm((64,), eps=1e-05, elementwise_affine=True)\n      )\n      (layer_norm2): CNNLayerNorm(\n        (layer_norm): LayerNorm((64,), eps=1e-05, elementwise_affine=True)\n      )\n    )\n  )\n  (fully_connected): Linear(in_features=2048, out_features=512, bias=True)\n  (birnn_layers): Sequential(\n    (0): BidirectionalGRU(\n      (BiGRU): GRU(512, 512, batch_first=True, bidirectional=True)\n      (layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)\n      (dropout): Dropout(p=0.1, inplace=False)\n    )\n    (1): BidirectionalGRU(\n      (BiGRU): GRU(1024, 512, bidirectional=True)\n      (layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)\n      (dropout): Dropout(p=0.1, inplace=False)\n    )\n    (2): BidirectionalGRU(\n      (BiGRU): GRU(1024, 512, bidirectional=True)\n      (layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)\n      (dropout): Dropout(p=0.1, inplace=False)\n    )\n  )\n  (classifier): Sequential(\n    (0): Linear(in_features=1024, out_features=512, bias=True)\n    (1): GELU()\n    (2): Dropout(p=0.1, inplace=False)\n    (3): Linear(in_features=512, out_features=29, bias=True)\n  )\n)\nNum Model Parameters 14233053\n```\nWith the following architecture:\n![model_architecture](images/model_architecture.png)\n\n## Results\nResults of training for 10 epochs show a great potencial. I would like to spend more time finetuning the model and training for longer epochs but I need to purchase cloud computing for that and is out of my scope right now.\n\nLoss\n-----\n\n|        Training data        |          Test data           |\n| :-------------------------: | :--------------------------: |\n| ![tr](images/trainloss.png) | ![test](images/testloss.png) |\n\nMetrics on `test-clean`\n-----\n\n| Character error rate CER |  Word error rate WER   |\n| :----------------------: | :--------------------: |\n|  ![CER](images/cer.png)  | ![WER](images/wer.png) |\n\n### Data pipeline\n\nFor testing the model we used the Librispeech dataset and performed a MelSpectogram followed by FrequencyMasking to mask out the frequency dimension, and TimeMasking for the time dimension.\n\n```py\ntrain_audio_transforms = nn.Sequential(\n    torchaudio.transforms.MelSpectrogram(sample_rate=16000, n_mels=128),\n    torchaudio.transforms.FrequencyMasking(freq_mask_param=15),\n    torchaudio.transforms.TimeMasking(time_mask_param=35)\n)\n```\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjiwidi%2Fdeepspeech-pytorch","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjiwidi%2Fdeepspeech-pytorch","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjiwidi%2Fdeepspeech-pytorch/lists"}