{"id":22488493,"url":"https://github.com/SeanNaren/deepspeech.pytorch","last_synced_at":"2025-08-02T21:30:58.416Z","repository":{"id":41562703,"uuid":"78508757","full_name":"SeanNaren/deepspeech.pytorch","owner":"SeanNaren","description":"Speech Recognition using DeepSpeech2.","archived":false,"fork":false,"pushed_at":"2022-12-13T15:05:51.000Z","size":742,"stargazers_count":2105,"open_issues_count":7,"forks_count":620,"subscribers_count":52,"default_branch":"master","last_synced_at":"2024-11-30T00:02:37.369Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/SeanNaren.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2017-01-10T07:28:18.000Z","updated_at":"2024-11-25T04:18:51.000Z","dependencies_parsed_at":"2023-01-28T13:01:21.247Z","dependency_job_id":null,"html_url":"https://github.com/SeanNaren/deepspeech.pytorch","commit_stats":null,"previous_names":[],"tags_count":7,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SeanNaren%2Fdeepspeech.pytorch","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SeanNaren%2Fdeepspeech.pytorch/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SeanNaren%2Fdeepspeech.pytorch/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SeanNaren%2Fdeepspeech.pytorch/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/SeanNaren","download_url":"https://codeload.github.com/SeanNaren/deepspeech.pytorch/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":228500216,"owners_count":17930015,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-12-06T17:17:49.395Z","updated_at":"2024-12-06T17:20:14.508Z","avatar_url":"https://github.com/SeanNaren.png","language":"Python","funding_links":[],"categories":["常见论文实现","Python","PyTorch Models","Paper implementations｜论文实现","Paper implementations"],"sub_categories":["Audio Processing","Other libraries｜其他库:","Other libraries:"],"readme":"# deepspeech.pytorch\n![Tests](https://github.com/SeanNaren/deepspeech.pytorch/actions/workflows/ci-test.yml/badge.svg)\n\nImplementation of DeepSpeech2 for PyTorch using [PyTorch Lightning](https://github.com/PyTorchLightning/pytorch-lightning). The repo supports training/testing and inference using the [DeepSpeech2](http://arxiv.org/pdf/1512.02595v1.pdf) model. Optionally a [kenlm](https://github.com/kpu/kenlm) language model can be used at inference time.\n\n## Install\n\nSeveral libraries are needed to be installed for training to work. I will assume that everything is being installed in\nan Anaconda installation on Ubuntu, with PyTorch installed.\n\nInstall [PyTorch](https://github.com/pytorch/pytorch#installation) if you haven't already.\n\nIf you want decoding to support beam search with an optional language model, install ctcdecode:\n```\ngit clone --recursive https://github.com/parlance/ctcdecode.git\ncd ctcdecode \u0026\u0026 pip install .\n```\n\nFinally clone this repo and run this within the repo:\n```\npip install -r requirements.txt\npip install -e . # Dev install\n```\n\nIf you plan to use Multi-node training, you'll need etcd. Below is the command to install on Ubuntu.\n```\nsudo apt-get install etcd\n```\n\n### Docker\n\nTo use the image with a GPU you'll need to have [nvidia-docker](https://github.com/NVIDIA/nvidia-docker) installed.\n\n```bash\nsudo docker run -ti --gpus all -v `pwd`/data:/workspace/data --tmpfs /tmp -p 8888:8888 --net=host --ipc=host seannaren/deepspeech.pytorch:latest # Opens a Jupyter notebook, mounting the /data drive in the container\n```\n\nOptionally you can use the command line by changing the entrypoint:\n\n```bash\nsudo docker run -ti --gpus all -v `pwd`/data:/workspace/data --tmpfs /tmp --entrypoint=/bin/bash --net=host --ipc=host seannaren/deepspeech.pytorch:latest\n```\n\n## Training\n\n### Datasets\n\nCurrently supports [AN4](http://www.speech.cs.cmu.edu/databases/an4/), [TEDLIUM](https://www.openslr.org/51/), [Voxforge](http://www.voxforge.org/), [Common Voice](https://commonvoice.mozilla.org/en/datasets) and [LibriSpeech](https://www.openslr.org/12). Scripts will setup the dataset and create manifest files used in data-loading. The scripts can be found in the data/ folder. Many of the scripts allow you to download the raw datasets separately if you choose so.\n\n### Training Commands\n\n##### AN4\n\n```bash\ncd data/ \u0026\u0026 python an4.py \u0026\u0026 cd ..\n\npython train.py +configs=an4\n```\n\n##### LibriSpeech\n\n```bash\ncd data/ \u0026\u0026 python librispeech.py \u0026\u0026 cd ..\n\npython train.py +configs=librispeech\n```\n\n##### Common Voice\n\n```bash\ncd data/ \u0026\u0026 python common_voice.py \u0026\u0026 cd ..\n\npython train.py +configs=commonvoice\n```\n##### TEDlium\n\n```bash\ncd data/ \u0026\u0026 python ted.py \u0026\u0026 cd ..\n\npython train.py +configs=tedlium\n```\n\n#### Custom Dataset\n\nTo create a custom dataset you must create a JSON file containing the locations of the training/testing data. This has to be in the format of:\n```json\n{\n  \"root_path\":\"path/to\",\n  \"samples\":[\n    {\"wav_path\":\"audio.wav\",\"transcript_path\":\"text.txt\"},\n    {\"wav_path\":\"audio2.wav\",\"transcript_path\":\"text2.txt\"},\n    ...\n  ]\n}\n```\nWhere the `root_path` is the root directory, `wav_path` is to the audio file, and the `transcript_path` is to a text file containing the transcript on one line. This can then be used as stated below.\n\n##### Note on CSV files ...\nUp until release [V2.1](https://github.com/SeanNaren/deepspeech.pytorch/releases/tag/V2.1), deepspeech.pytorch used CSV manifest files instead of JSON.\nThese manifest files are formatted similarly as a 2 column table:\n```\n/path/to/audio.wav,/path/to/text.txt\n/path/to/audio2.wav,/path/to/text2.txt\n...\n```\nNote that this format is incompatible [V3.0](https://github.com/SeanNaren/deepspeech.pytorch/releases/tag/V3.0) onwards.\n\n#### Merging multiple manifest files\n\nTo create bigger manifest files (to train/test on multiple datasets at once) we can merge manifest files together like below.\n\n```\ncd data/\npython merge_manifests.py manifest_1.json manifest_2.json --out new_manifest_dir\n```\n\n### Modifying Training Configs\n\nConfiguration is done via [Hydra](https://github.com/facebookresearch/hydra).\n\nDefaults can be seen in [config.py](deepspeech_pytorch/configs/train_config.py). Below is how you can override values set already:\n\n```\npython train.py data.train_path=data/train_manifest.json data.val_path=data/val_manifest.json\n```\n\nUse `python train.py --help` for all parameters and options.\n\nYou can also specify a config file to keep parameters stored in a yaml file like so:\n\nCreate folder `experiment/` and file `experiment/an4.yaml`:\n```yaml\ndata:\n  train_path: data/an4_train_manifest.json\n  val_path: data/an4_val_manifest.json\n```\n\n```\npython train.py +experiment=an4\n```\n\nTo see options available, check [here](./deepspeech_pytorch/configs/train_config.py).\n\n### Multi-GPU Training\n\nWe support single-machine multi-GPU training via [PyTorch Lightning](https://github.com/PyTorchLightning/pytorch-lightning).\n\nBelow is an example command when training on a machine with 4 local GPUs:\n\n```\npython train.py +configs=an4 trainer.gpus=4\n```\n\n### Multi-Node Training\n\nAlso supported is multi-machine capabilities using TorchElastic. This requires a node to exist as an explicit etcd host (which could be one of the GPU nodes but isn't recommended), a shared mount across your cluster to load/save checkpoints and communication between the nodes.\n\nBelow is an example where we've set one of our GPU nodes as our etcd host however if you're scaling up, it would be suggested to have a separate instance as your etcd instance to your GPU nodes as this will be a single point of failure.\n\nAssumed below is a shared drive called /share where we save our checkpoints and data to access.\n\nRun on the etcd host:\n```\nPUBLIC_HOST_NAME=127.0.0.1 # Change to public host name for all nodes to connect\netcd --enable-v2 \\\n     --listen-client-urls http://$PUBLIC_HOST_NAME:4377 \\\n     --advertise-client-urls http://$PUBLIC_HOST_NAME:4377 \\\n     --listen-peer-urls http://$PUBLIC_HOST_NAME:4379\n```\n\nRun on each GPU node:\n```\npython -m torchelastic.distributed.launch \\\n        --nnodes=2 \\\n        --nproc_per_node=4 \\\n        --rdzv_id=123 \\\n        --rdzv_backend=etcd \\\n        --rdzv_endpoint=$PUBLIC_HOST_NAME:4377 \\\n        train.py data.train_path=/share/data/an4_train_manifest.json \\\n                 data.val_path=/share/data/an4_val_manifest.json model.precision=half \\\n                 data.num_workers=8 checkpoint.save_folder=/share/checkpoints/ \\\n                 checkpoint.checkpoint=true checkpoint.load_auto_checkpoint=true checkpointing.save_n_recent_models=3 \\\n                 data.batch_size=8 trainer.max_epochs=70 \\\n                 trainer.accelerator=ddp trainer.gpus=4 trainer.num_nodes=2\n```\n\nUsing the `load_auto_checkpoint=true` flag we can re-continue training from the latest saved checkpoint.\n\nCurrently it is expected that there is an NFS drive/shared mount across all nodes within the cluster to load the latest checkpoint from.\n\n### Augmentation\n\nThere is support for three different types of augmentations: SpecAugment, noise injection and random tempo/gain perturbations.\n\n#### SpecAugment\n\nApplies simple Spectral Augmentation techniques directly on Mel spectogram features to make the model more robust to variations in input data. To enable SpecAugment, use the `--spec-augment` flag when training.\n\nSpecAugment implementation was adapted from [this](https://github.com/DemisEom/SpecAugment) project.\n\n#### Noise Injection\n\nDynamically adds noise into the training data to increase robustness. To use, first fill a directory up with all the noise files you want to sample from.\nThe dataloader will randomly pick samples from this directory.\n\nTo enable noise injection, use the `--noise-dir /path/to/noise/dir/` to specify where your noise files are. There are a few noise parameters to tweak, such as\n`--noise_prob` to determine the probability that noise is added, and the `--noise-min`, `--noise-max` parameters to determine the minimum and maximum noise to add in training.\n\nIncluded is a script to inject noise into an audio file to hear what different noise levels/files would sound like. Useful for curating the noise dataset.\n\n```\npython noise_inject.py --input-path /path/to/input.wav --noise-path /path/to/noise.wav --output-path /path/to/input_injected.wav --noise-level 0.5 # higher levels means more noise\n```\n\n#### Tempo/Gain Perturbation\n\nApplies small changes to the tempo and gain when loading audio to increase robustness. To use, use the `--speed-volume-perturb` flag when training.\n\n### Checkpoints\n\nTypically checkpoints are stored in `lightning_logs/` in the current working directory of the script.\n\nThis can be adjusted:\n\n```\npython train.py checkpoint.file_path=save_dir/\n```\n\nTo load a previously saved checkpoint:\n\n```\npython train.py trainer.resume_from_checkpoint=lightning_logs/deepspeech_checkpoint_epoch_N_iter_N.ckpt\n```\n\nThis continues from the same training state.\n\n## Testing/Inference\n\nTo evaluate a trained model on a test set (has to be in the same format as the training set):\n\n```\npython test.py model.model_path=models/deepspeech.pth test_path=/path/to/test_manifest.json\n```\n\nAn example script to output a transcription has been provided:\n\n```\npython transcribe.py \\\n       model.model_path=models/deepspeech.pth \\\n       model.cuda=True \\\n       chunk_size_seconds=-1 \\\n       audio_path=audio_path=/path/to/audio.wav\n```\n\nIf you used mixed-precision or half precision when training the model, you can use the `model.precision=half` for a speed/memory benefit. If you want to transcribe a long audio file that does not fit in the GPU, change the value of `chunk_size_seconds` to a positive number which represents the chunk size in seconds that will be used to segment the long audio file based on it.\n\n## Inference Server\n\nIncluded is a basic server script that will allow post request to be sent to the server to transcribe files.\n\n```\npython server.py --host 0.0.0.0 --port 8000 # Run on one window\n\ncurl -X POST http://0.0.0.0:8000/transcribe -H \"Content-type: multipart/form-data\" -F \"file=@/path/to/input.wav\"\n```\n\n## Using an ARPA LM\n\nWe support using kenlm based LMs. Below are instructions on how to take the LibriSpeech LMs found [here](http://www.openslr.org/11/) and tune the model to give you the best parameters when decoding, based on LibriSpeech.\n\n### Tuning the LibriSpeech LMs\n\nFirst ensure you've set up the librispeech datasets from the data/ folder.\nIn addition download the latest pre-trained librispeech model from the releases page, as well as the ARPA model you want to tune from [here](http://www.openslr.org/11/). For the below we use the 3-gram ARPA model (3e-7 prune).\n\nFirst we need to generate the acoustic output to be used to evaluate the model on LibriSpeech val.\n```\npython test.py data.test_path=data/librispeech_val_manifest.json model.model_path=librispeech_pretrained_v2.pth save_output=librispeech_val_output.npy\n```\n\nWe use a beam width of 128 which gives reasonable results. We suggest using a CPU intensive node to carry out the grid search.\n\n```\npython search_lm_params.py --num-workers 16 --saved-output librispeech_val_output.npy --output-path libri_tune_output.json --lm-alpha-from 0 --lm-alpha-to 5 --lm-beta-from 0 --lm-beta-to 3 --lm-path 3-gram.pruned.3e-7.arpa  --model-path librispeech_pretrained_v2.pth --beam-width 128 --lm-workers 16\n```\n\nThis will run a grid search across the alpha/beta parameters using a beam width of 128. Use the below script to find the best alpha/beta params:\n\n```\npython select_lm_params.py --input-path libri_tune_output.json\n```\n\nUse the alpha/beta parameters when using the beam decoder.\n\n### Building your own LM\n\nTo build your own LM you need to use the KenLM repo found [here](https://github.com/kpu/kenlm). Have a read of the documentation to get a sense of how to train your own LM. The above steps once trained can be used to find the appropriate parameters.\n\n### Alternate Decoders\nBy default, `test.py` and `transcribe.py` use a `GreedyDecoder` which picks the highest-likelihood output label at each timestep. Repeated and blank symbols are then filtered to give the final output.\n\nA beam search decoder can optionally be used with the installation of the `ctcdecode` library as described in the Installation section. The `test` and `transcribe` scripts have a `lm` config. To use the beam decoder, add `lm.decoder_type=beam`. The beam decoder enables additional decoding parameters:\n- **lm.beam_width** how many beams to consider at each timestep\n- **lm.lm_path** optional binary KenLM language model to use for decoding\n- **lm.alpha** weight for language model\n- **lm.beta** bonus weight for words\n\n### Time offsets\n\nUse the `offsets=true` flag to get positional information of each character in the transcription when using `transcribe.py` script. The offsets are based on the size\nof the output tensor, which you need to convert into a format required.\nFor example, based on default parameters you could multiply the offsets by a scalar (duration of file in seconds / size of output) to get the offsets in seconds.\n\n## Pre-trained models\n\nPre-trained models can be found under releases [here](https://github.com/SeanNaren/deepspeech.pytorch/releases).\n\n## Acknowledgements\n\nThanks to [Egor](https://github.com/EgorLakomkin) and [Ryan](https://github.com/ryanleary) for their contributions!\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FSeanNaren%2Fdeepspeech.pytorch","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FSeanNaren%2Fdeepspeech.pytorch","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FSeanNaren%2Fdeepspeech.pytorch/lists"}