{"id":22683769,"url":"https://github.com/dialpad/mucs_2021_dialpad","last_synced_at":"2025-04-12T18:30:29.052Z","repository":{"id":47928329,"uuid":"391137891","full_name":"dialpad/mucs_2021_dialpad","owner":"dialpad","description":"Dialpad team's submission to the MUCS 2021 workshop","archived":false,"fork":false,"pushed_at":"2023-05-26T10:20:54.000Z","size":174076,"stargazers_count":5,"open_issues_count":0,"forks_count":5,"subscribers_count":5,"default_branch":"shree/mucs2021-dialpad","last_synced_at":"2025-03-26T12:47:02.905Z","etag":null,"topics":["asr","end-to-end","indian-language","multilingual"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/dialpad.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2021-07-30T17:03:31.000Z","updated_at":"2022-09-17T10:32:25.000Z","dependencies_parsed_at":"2022-08-12T14:21:30.118Z","dependency_job_id":null,"html_url":"https://github.com/dialpad/mucs_2021_dialpad","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dialpad%2Fmucs_2021_dialpad","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dialpad%2Fmucs_2021_dialpad/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dialpad%2Fmucs_2021_dialpad/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dialpad%2Fmucs_2021_dialpad/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/dialpad","download_url":"https://codeload.github.com/dialpad/mucs_2021_dialpad/tar.gz/refs/heads/shree/mucs2021-dialpad","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248613067,"owners_count":21133433,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["asr","end-to-end","indian-language","multilingual"],"created_at":"2024-12-09T21:13:47.549Z","updated_at":"2025-04-12T18:30:29.029Z","avatar_url":"https://github.com/dialpad.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# The Dialpad ASR System for the Multilingual ASR Challenge for Low Resource Indian Languages 2021\n### Shreekantha Nadig, Riqiang Wang, Wang Yau Li, Jeffrey Michael, Frédéric Mailhot, Simon Vandieken, Jonas Robertson\n#### Dialpad, Inc.\n\nThis paper describes the multilingual ASR systems developed at Dialpad, Inc. for the Multilingual ASR challenge for low resource Indian languages at Interspeech 2021. We participated in Sub-task 1, where the systems are trained on data of six Indic languages provided by the organizers. On this task, we experimented with both hybrid HMM-DNN and end-to-end ASR architectures and studied how fine-tuning techniques can help in this multilingual scenario. We also experimented with both multilingual and language-specific decoders by using a pre-trained encoder, as well as the use of appropriate RNN and n-gram language models. Furthermore, we present novel studies on transliteration-based pre-training of the encoder, and a joint LID and ASR architecture. We show that the multilingual end-to-end ASR models outperform both hybrid model and monolingual baselines. Also, we demonstrate that current methods of joint LID-ASR fail when there are confounding channel characteristics. We conducted studies and propose ideas on how to mitigate the effect of some of the channel characteristics on the task of language recognition. Our best submission to the challenge achieved an average WER of $22.95\\%$ on the development set and $31.87\\%$ on the held-out test set and contains language-specific decoders fine-tuned on the multilingual encoder, along with the use of language-specific RNNLMs and n-gram LMs.\n\n\n## Video Presentation\n\n[![Dialpad video presentation for MUCS 2021 workshop](https://img.youtube.com/vi/_ZGWXh3UMiI/0.jpg)](https://youtu.be/_ZGWXh3UMiI)\n\n## Models\nAll the end-to-end models in this work are trained using the ESPnet toolkit. Hence, the inference also follows the standard format of the toolkit.\nThe features are extracted using `torchaudio` (as opposed to `kaldi` binaries) in the toolkit. We provide the feature extraction code as well.\nAll of the models in this work can be used with the standard ESPnet decoding scripts as mentioned in the ESPnet toolkit: https://github.com/espnet/interspeech2019-tutorial\n\nWe make available the following pre-trained models for this work:\n\n| Name      | Description |\n| ----------- | ----------- |\n| B0      | Baseline encoder-decoder with combined vocabulary |\n| B1   | B0's encoder + monolingual decoder (Encoder frozen from B0) |\n| B1 (unfreeze)   | B0's encoder + monolingual decoder (Fine-tune after un-freezing Encoder) |\n| B3   | B0 but with transliterated latin script |\n| C0   | B0 + explicit LID subtask |\n| C1   | B3's encoder + explicit LID decoder |\n| L0   | LID trained from scratch |\n| L1   | LID with transliterated Encoder from B3 |\n| \"lang\"\\_RNNLM | Byte-level RNNLM for each language |\n\nYou can find the pre-trained models in this Google Drive link: https://drive.google.com/drive/folders/1QlEZgzscznfPaVv_B62Ipz0grXdeDNIr?usp=sharing\n\nOur data preparation recipe and inference scripts are under [egs/mucs_2021/task1/](egs/mucs_2021/task1/)\n\n## Extracting features for inference\nFor all experiments, we extracted 80-dimensional log Mel filterbank features with a window size of 25 ms computed at every 10 ms.\nThe features are extracted using `torchaudio.compliance.kaldi.fbank`\n```python\nlmspc = torchaudio.compliance.kaldi.fbank(\n            waveform=torch.unsqueeze(torch.tensor(signal), axis=0),\n            sample_frequency=8000,\n            dither=1e-32,\n            energy_floor=0,\n            num_mel_bins=80,\n        )\n```\n\n## Performing inference with the pre-trained models\nWe give an example ipython notebook (`inference_example.ipynb`) to perform inference with various models with features extracted using `torchaudio` and with an appropriate RNNLM.\n\n---\n\u003cdiv align=\"left\"\u003e\u003cimg src=\"doc/image/espnet_logo1.png\" width=\"550\"/\u003e\u003c/div\u003e\n# ESPnet: end-to-end speech processing toolkit\n\n|system/pytorch ver.|1.3.1|1.4.0|1.5.1|1.6.0|1.7.1|1.8.1|1.9.0|\n| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |\n|ubuntu20/python3.8/pip|||||||[![Github Actions](https://github.com/espnet/espnet/workflows/CI/badge.svg)](https://github.com/espnet/espnet/actions)|\n|ubuntu18/python3.7/pip|[![Github Actions](https://github.com/espnet/espnet/workflows/CI/badge.svg)](https://github.com/espnet/espnet/actions)|[![Github Actions](https://github.com/espnet/espnet/workflows/CI/badge.svg)](https://github.com/espnet/espnet/actions)|[![Github Actions](https://github.com/espnet/espnet/workflows/CI/badge.svg)](https://github.com/espnet/espnet/actions)|[![Github Actions](https://github.com/espnet/espnet/workflows/CI/badge.svg)](https://github.com/espnet/espnet/actions)|[![Github Actions](https://github.com/espnet/espnet/workflows/CI/badge.svg)](https://github.com/espnet/espnet/actions)|[![Github Actions](https://github.com/espnet/espnet/workflows/CI/badge.svg)](https://github.com/espnet/espnet/actions)|[![Github Actions](https://github.com/espnet/espnet/workflows/CI/badge.svg)](https://github.com/espnet/espnet/actions)|\n|debian9/python3.6/conda|||||||[![debian9](https://github.com/espnet/espnet/workflows/debian9/badge.svg)](https://github.com/espnet/espnet/actions?query=workflow%3Adebian9)|\n|centos7/python3.6/conda||||||[![centos7](https://github.com/espnet/espnet/workflows/centos7/badge.svg)](https://github.com/espnet/espnet/actions?query=workflow%3Acentos7)||\n|doc/python3.8|||||||[![doc](https://github.com/espnet/espnet/workflows/doc/badge.svg)](https://github.com/espnet/espnet/actions?query=workflow%3Adoc)|\n\n[![PyPI version](https://badge.fury.io/py/espnet.svg)](https://badge.fury.io/py/espnet)\n[![Python Versions](https://img.shields.io/pypi/pyversions/espnet.svg)](https://pypi.org/project/espnet/)\n[![Downloads](https://pepy.tech/badge/espnet)](https://pepy.tech/project/espnet)\n[![GitHub license](https://img.shields.io/github/license/espnet/espnet.svg)](https://github.com/espnet/espnet)\n[![codecov](https://codecov.io/gh/espnet/espnet/branch/master/graph/badge.svg)](https://codecov.io/gh/espnet/espnet)\n[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)\n[![Mergify Status](https://img.shields.io/endpoint.svg?url=https://gh.mergify.io/badges/espnet/espnet\u0026style=flat)](https://mergify.io)\n[![Gitter](https://badges.gitter.im/espnet-en/community.svg)](https://gitter.im/espnet-en/community?utm_source=badge\u0026utm_medium=badge\u0026utm_campaign=pr-badge)\n\n[**Docs**](https://espnet.github.io/espnet/)\n| [**Example**](https://github.com/espnet/espnet/tree/master/egs)\n| [**Example (ESPnet2)**](https://github.com/espnet/espnet/tree/master/egs2)\n| [**Docker**](https://github.com/espnet/espnet/tree/master/docker)\n| [**Notebook**](https://github.com/espnet/notebook)\n| [**Tutorial (2019)**](https://github.com/espnet/interspeech2019-tutorial)\n\nESPnet is an end-to-end speech processing toolkit, mainly focuses on end-to-end speech recognition and end-to-end text-to-speech.\nESPnet uses [chainer](https://chainer.org/) and [pytorch](http://pytorch.org/) as a main deep learning engine,\nand also follows [Kaldi](http://kaldi-asr.org/) style data processing, feature extraction/format, and recipes to provide a complete setup for speech recognition and other speech processing experiments.\n\n## Key Features\n\n### Kaldi style complete recipe\n- Support numbers of `ASR` recipes (WSJ, Switchboard, CHiME-4/5, Librispeech, TED, CSJ, AMI, HKUST, Voxforge, REVERB, etc.)\n- Support numbers of `TTS` recipes with a similar manner to the ASR recipe (LJSpeech, LibriTTS, M-AILABS, etc.)\n- Support numbers of `ST` recipes (Fisher-CallHome Spanish, Libri-trans, IWSLT'18, How2, Must-C, Mboshi-French, etc.)\n- Support numbers of `MT` recipes (IWSLT'16, the above ST recipes etc.)\n- Support speech separation and recognition recipe (WSJ-2mix)\n- Support voice conversion recipe (VCC2020 baseline) (new!)\n\n\n### ASR: Automatic Speech Recognition\n- **State-of-the-art performance** in several ASR benchmarks (comparable/superior to hybrid DNN/HMM and CTC)\n- **Hybrid CTC/attention** based end-to-end ASR\n  - Fast/accurate training with CTC/attention multitask training\n  - CTC/attention joint decoding to boost monotonic alignment decoding\n  - Encoder: VGG-like CNN + BiRNN (LSTM/GRU), sub-sampling BiRNN (LSTM/GRU) or Transformer\n- Attention: Dot product, location-aware attention, variants of multihead\n- Incorporate RNNLM/LSTMLM/TransformerLM/N-gram trained only with text data\n- Batch GPU decoding\n- **Transducer** based end-to-end ASR\n  - Available: RNN-based encoder/decoder or custom encoder/decoder w/ supports for Transformer, Conformer, TDNN (encoder) and causal conv1d (decoder) blocks.\n  - Also support: mixed RNN/Custom encoder-decoder, VGG2L (RNN/Cutom encoder) and various decoding algorithms.\n  \u003e Please refer to the [tutorial page](https://espnet.github.io/espnet/tutorial.html#transducer) for complete documentation.\n- CTC segmentation\n- Non-autoregressive model based on Mask-CTC\n- ASR examples for supporting endangered language documentation (Please refer to egs/puebla_nahuatl and egs/yoloxochitl_mixtec for details)\n- Wav2Vec2.0 pretrained model as Encoder, imported from [FairSeq](https://github.com/pytorch/fairseq/tree/master/fairseq).\n\nDemonstration\n- Real-time ASR demo with ESPnet2  [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/espnet/notebook/blob/master/espnet2_asr_realtime_demo.ipynb)\n\n### TTS: Text-to-speech\n- Tacotron2\n- Transformer-TTS\n- FastSpeech\n- FastSpeech2 (in ESPnet2)\n- Conformer-based FastSpeech \u0026 FastSpeech2 (in ESPnet2)\n- Multi-speaker model with pretrained speaker embedding\n- Multi-speaker model with GST (in ESPnet2)\n- Phoneme-based training (En, Jp, and Zn)\n- Integration with neural vocoders (WaveNet, ParallelWaveGAN, and MelGAN)\n\nDemonstration\n- Real-time TTS demo with ESPnet2  [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/espnet/notebook/blob/master/espnet2_tts_realtime_demo.ipynb)\n- Real-time TTS demo with ESPnet1  [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/espnet/notebook/blob/master/tts_realtime_demo.ipynb)\n\nTo train the neural vocoder, please check the following repositories:\n- [kan-bayashi/ParallelWaveGAN](https://github.com/kan-bayashi/ParallelWaveGAN)\n- [r9y9/wavenet_vocoder](https://github.com/r9y9/wavenet_vocoder)\n\n\u003e **NOTE**:\n\u003e - We are moving on ESPnet2-based development for TTS.\n\u003e - If you are beginner, we recommend using [ESPnet2-TTS](https://github.com/espnet/espnet/tree/master/egs2/TEMPLATE/tts1).\n\n### SE: Speech enhancement (and separation)\n\n- Single-speaker speech enhancement\n- Multi-speaker speech separation\n- Unified encoder-separator-decoder structure for time-domain and frequency-domian models\n  - Encoder/Decoder: STFT/iSTFT, Convolution/Transposed-Convolution\n  - Separators: BLSTM, Transformer, Conformer, DPRNN, Neural Beamformers, etc.\n- Flexible ASR integration: working as an individual task or as the ASR frontend\n- Easy to import pretrained models from [Asteroid](https://github.com/asteroid-team/asteroid)\n  - Both the pre-trained models from Asteroid and the specific configuration are supported.\n\nDemonstration\n- Interactive SE demo with ESPnet2 [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1fjRJCh96SoYLZPRxsjF9VDv4Q2VoIckI?usp=sharing)\n\n\n### ST: Speech Translation \u0026 MT: Machine Translation\n- **State-of-the-art performance** in several ST benchmarks (comparable/superior to cascaded ASR and MT)\n- Transformer based end-to-end ST (new!)\n- Transformer based end-to-end MT (new!)\n\n### VC: Voice conversion\n- Transformer and Tacotron2 based parallel VC using melspectrogram (new!)\n- End-to-end VC based on cascaded ASR+TTS (Baseline system for Voice Conversion Challenge 2020!)\n\n### DNN Framework\n- Flexible network architecture thanks to chainer and pytorch\n- Flexible front-end processing thanks to [kaldiio](https://github.com/nttcslab-sp/kaldiio) and HDF5 support\n- Tensorboard based monitoring\n\n### ESPnet2\nSee [ESPnet2](https://espnet.github.io/espnet/espnet2_tutorial.html).\n\n- Indepedent from Kaldi/Chainer, unlike ESPnet1\n- On the fly feature extraction and text processing when training\n- Supporting DistributedDataParallel and DaraParallel both\n- Supporting multiple nodes training and integrated with [Slurm](https://slurm.schedmd.com/) or MPI\n- Supporting Sharded Training provided by [fairscale](https://github.com/facebookresearch/fairscale)\n- A template recipe which can be applied for all corpora\n- Possible to train any size of corpus without cpu memory error\n- [ESPnet Model Zoo](https://github.com/espnet/espnet_model_zoo)\n- Integrated with [wandb](https://espnet.github.io/espnet/espnet2_training_option.html#weights-biases-integration)\n\n## Installation\n- If you intend to do full experiments including DNN training, then see [Installation](https://espnet.github.io/espnet/installation.html).\n- If you just need the Python module only:\n    ```sh\n    pip install espnet\n    # To install latest\n    # pip install git+https://github.com/espnet/espnet\n    ```\n\n    You need to install some packages.\n\n    ```sh\n    pip install torch\n    pip install chainer==6.0.0 cupy==6.0.0    # [Option] If you'll use ESPnet1\n    pip install torchaudio                    # [Option] If you'll use enhancement task\n    pip install torch_optimizer               # [Option] If you'll use additional optimizers in ESPnet2\n    ```\n\n    There are some required packages depending on each task other than above. If you meet ImportError, please intall them at that time.\n- (ESPNet2) Once installed, run `wandb login` and set `--use_wandb true` to enable tracking runs using W\u0026B.\n\n## Usage\nSee [Usage](https://espnet.github.io/espnet/tutorial.html).\n\n## Docker Container\n\ngo to [docker/](docker/) and follow [instructions](https://espnet.github.io/espnet/docker.html).\n\n## Contribution\nThank you for taking times for ESPnet! Any contributions to ESPNet are welcome and feel free to ask any questions or requests to [issues](https://github.com/espnet/espnet/issues).\nIf it's the first contribution to ESPnet for you,  please follow the [contribution guide](CONTRIBUTING.md).\n\n## Results and demo\n\nYou can find useful tutorials and demos in [Interspeech 2019 Tutorial](https://github.com/espnet/interspeech2019-tutorial)\n\n### ASR results\n\n\u003cdetails\u003e\u003csummary\u003eexpand\u003c/summary\u003e\u003cdiv\u003e\n\n\nWe list the character error rate (CER) and word error rate (WER) of major ASR tasks.\n\n| Task                   | CER (%) | WER (%) | Pretrained model|\n| -----------            | :----:  | :----:  | :----:                                                                                                                                                                |\n| Aishell dev/test            | 4.6/5.1    | N/A     | [link](https://github.com/espnet/espnet/blob/master/egs/aishell/asr1/RESULTS.md#conformer-kernel-size--15--specaugment--lm-weight--00-result) |\n| **ESPnet2** Aishell dev/test            | 4.4/4.7    | N/A     | [link](https://github.com/espnet/espnet/tree/master/egs2/aishell/asr1#conformer--specaug--speed-perturbation-featsraw-n_fft512-hop_length128) |\n| Common Voice dev/test       | 1.7/1.8     | 2.2/2.3     | [link](https://github.com/espnet/espnet/blob/master/egs/commonvoice/asr1/RESULTS.md#first-results-default-pytorch-transformer-setting-with-bpe-100-epochs-single-gpu) |\n| CSJ eval1/eval2/eval3              | 5.7/3.8/4.2     | N/A     | [link](https://github.com/espnet/espnet/blob/master/egs/csj/asr1/RESULTS.md#pytorch-backend-transformer-without-any-hyperparameter-tuning)                            |\n| **ESPnet2** CSJ eval1/eval2/eval3              | 4.5/3.3/3.6     | N/A     | [link](https://github.com/espnet/espnet/tree/master/egs2/csj/asr1#initial-conformer-results)                            |\n| HKUST dev              | 23.5    | N/A     | [link](https://github.com/espnet/espnet/blob/master/egs/hkust/asr1/RESULTS.md#transformer-only-20-epochs)                                                             |\n|  **ESPnet2** HKUST dev              | 21.2    | N/A     | [link](https://github.com/espnet/espnet/tree/master/egs2/hkust/asr1#transformer-asr--transformer-lm)                                                             |\n| Librispeech dev_clean/dev_other/test_clean/test_other  | N/A     | 1.9/4.9/2.1/4.9     | [link](https://github.com/espnet/espnet/blob/master/egs/librispeech/asr1/RESULTS.md#pytorch-large-conformer-with-specaug--speed-perturbation-8-gpus--transformer-lm-4-gpus)             |\n| **ESPnet2** Librispeech dev_clean/dev_other/test_clean/test_other  | 0.7/2.2/0.7/2.1    | 1.9/4.6/2.1/4.7     | [link](https://github.com/espnet/espnet/tree/master/egs2/librispeech/asr1#with-transformer-lm)             |\n| Switchboard (eval2000) callhm/swbd           | N/A     | 14.0/6.8     | [link](https://github.com/espnet/espnet/blob/master/egs/swbd/asr1/RESULTS.md#conformer-with-bpe-2000-specaug-speed-perturbation-transformer-lm-decoding)   |\n| TEDLIUM2 dev/test           | N/A     | 8.6/7.2     | [link](https://github.com/espnet/espnet/blob/master/egs/tedlium2/asr1/RESULTS.md#conformer-large-model--specaug--speed-perturbation--rnnlm)   |\n| TEDLIUM3 dev/test           | N/A     | 9.6/7.6     | [link](https://github.com/espnet/espnet/blob/master/egs/tedlium3/asr1/RESULTS.md)                   |\n| WSJ dev93/eval92              | 3.2/2.1     | 7.0/4.7     | N/A |\n|  **ESPnet2** WSJ dev93/eval92              | 2.7/1.8     | 6.6/4.6     | [link](https://github.com/espnet/espnet/tree/master/egs2/wsj/asr1#using-transformer-lm-asr-model-is-same-as-the-above-lm_weight12-ctc_weight03-beam_size20) |\n\nNote that the performance of the CSJ, HKUST, and Librispeech tasks was significantly improved by using the wide network (#units = 1024) and large subword units if necessary reported by [RWTH](https://arxiv.org/pdf/1805.03294.pdf).\n\nIf you want to check the results of the other recipes, please check `egs/\u003cname_of_recipe\u003e/asr1/RESULTS.md`.\n\n\u003c/div\u003e\u003c/details\u003e\n\n\n### ASR demo\n\n\u003cdetails\u003e\u003csummary\u003eexpand\u003c/summary\u003e\u003cdiv\u003e\n\nYou can recognize speech in a WAV file using pretrained models.\nGo to a recipe directory and run `utils/recog_wav.sh` as follows:\n```sh\n# go to recipe directory and source path of espnet tools\ncd egs/tedlium2/asr1 \u0026\u0026 . ./path.sh\n# let's recognize speech!\nrecog_wav.sh --models tedlium2.transformer.v1 example.wav\n```\nwhere `example.wav` is a WAV file to be recognized.\nThe sampling rate must be consistent with that of data used in training.\n\nAvailable pretrained models in the demo script are listed as below.\n\n| Model                                                                                            | Notes                                                      |\n| :------                                                                                          | :------                                                    |\n| [tedlium2.rnn.v1](https://drive.google.com/open?id=1UqIY6WJMZ4sxNxSugUqp3mrGb3j6h7xe)            | Streaming decoding based on CTC-based VAD                  |\n| [tedlium2.rnn.v2](https://drive.google.com/open?id=1cac5Uc09lJrCYfWkLQsF8eapQcxZnYdf)            | Streaming decoding based on CTC-based VAD (batch decoding) |\n| [tedlium2.transformer.v1](https://drive.google.com/open?id=1cVeSOYY1twOfL9Gns7Z3ZDnkrJqNwPow)    | Joint-CTC attention Transformer trained on Tedlium 2       |\n| [tedlium3.transformer.v1](https://drive.google.com/open?id=1zcPglHAKILwVgfACoMWWERiyIquzSYuU)    | Joint-CTC attention Transformer trained on Tedlium 3       |\n| [librispeech.transformer.v1](https://drive.google.com/open?id=1BtQvAnsFvVi-dp_qsaFP7n4A_5cwnlR6) | Joint-CTC attention Transformer trained on Librispeech     |\n| [commonvoice.transformer.v1](https://drive.google.com/open?id=1tWccl6aYU67kbtkm8jv5H6xayqg1rzjh) | Joint-CTC attention Transformer trained on CommonVoice     |\n| [csj.transformer.v1](https://drive.google.com/open?id=120nUQcSsKeY5dpyMWw_kI33ooMRGT2uF)         | Joint-CTC attention Transformer trained on CSJ             |\n| [csj.rnn.v1](https://drive.google.com/open?id=1ALvD4nHan9VDJlYJwNurVr7H7OV0j2X9)                 | Joint-CTC attention VGGBLSTM trained on CSJ                |\n\n\u003c/div\u003e\u003c/details\u003e\n\n### SE results\n\u003cdetails\u003e\u003csummary\u003eexpand\u003c/summary\u003e\u003cdiv\u003e\n\nWe list results from three different models on WSJ0-2mix, which is one the most widely used benchmark dateset for speech separation.\n\n|Model|STOI|SAR|SDR|SIR|\n|---|---|---|---|---|\n|[TF Masking](https://zenodo.org/record/4498554)|0.89|11.40|10.24|18.04|\n|[Conv-Tasnet](https://zenodo.org/record/4498562)|0.95|16.62|15.94|25.90|\n|[DPRNN-Tasnet](https://zenodo.org/record/4688000)|0.96|18.82|18.29|28.92|\n\n\u003c/div\u003e\u003c/details\u003e\n\n### SE demos\n\u003cdetails\u003e\u003csummary\u003eexpand\u003c/summary\u003e\u003cdiv\u003e\nYou can try the interactive demo with Google Colab. Please click the following button to get access to the demos.\n\n[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1fjRJCh96SoYLZPRxsjF9VDv4Q2VoIckI?usp=sharing)\n\n\nIt is based on ESPnet2. Pretrained models are available for both speech enhancement and speech separation tasks.\n\n\u003c/div\u003e\u003c/details\u003e\n\n### ST results\n\n\u003cdetails\u003e\u003csummary\u003eexpand\u003c/summary\u003e\u003cdiv\u003e\n\nWe list 4-gram BLEU of major ST tasks.\n\n#### end-to-end system\n| Task | BLEU | Pretrained model |\n| ---- | :----: | :----: |\n| Fisher-CallHome Spanish fisher_test (Es-\u003eEn)      | 51.03 | [link](https://github.com/espnet/espnet/blob/master/egs/fisher_callhome_spanish/st1/RESULTS.md#train_spen_lcrm_pytorch_train_pytorch_transformer_bpe_short_long_bpe1000_specaug_asrtrans_mttrans) |\n| Fisher-CallHome Spanish callhome_evltest (Es-\u003eEn) | 20.44 | [link](https://github.com/espnet/espnet/blob/master/egs/fisher_callhome_spanish/st1/RESULTS.md#train_spen_lcrm_pytorch_train_pytorch_transformer_bpe_short_long_bpe1000_specaug_asrtrans_mttrans) |\n| Libri-trans test (En-\u003eFr)                         | 16.70 | [link](https://github.com/espnet/espnet/blob/master/egs/libri_trans/st1/RESULTS.md#train_spfr_lc_pytorch_train_pytorch_transformer_bpe_short_long_bpe1000_specaug_asrtrans_mttrans-1) |\n| How2 dev5 (En-\u003ePt)                                | 45.68 | [link](https://github.com/espnet/espnet/blob/master/egs/how2/st1/RESULTS.md#trainpt_tc_pytorch_train_pytorch_transformer_short_long_bpe8000_specaug_asrtrans_mttrans-1) |\n| Must-C tst-COMMON (En-\u003eDe)                        | 22.91 | [link](https://github.com/espnet/espnet/blob/master/egs/must_c/st1/RESULTS.md#train_spen-dede_tc_pytorch_train_pytorch_transformer_short_long_bpe8000_specaug_asrtrans_mttrans) |\n| Mboshi-French dev (Fr-\u003eMboshi)                    | 6.18  | N/A  |\n\n#### cascaded system\n| Task | BLEU | Pretrained model |\n| ---- | :----: | :----: |\n| Fisher-CallHome Spanish fisher_test (Es-\u003eEn)      | 42.16 | N/A  |\n| Fisher-CallHome Spanish callhome_evltest (Es-\u003eEn) | 19.82 | N/A  |\n| Libri-trans test (En-\u003eFr)                         | 16.96 | N/A  |\n| How2 dev5 (En-\u003ePt)                                | 44.90 | N/A  |\n| Must-C tst-COMMON (En-\u003eDe)                        | 23.65 | N/A  |\n\nIf you want to check the results of the other recipes, please check `egs/\u003cname_of_recipe\u003e/st1/RESULTS.md`.\n\n\u003c/div\u003e\u003c/details\u003e\n\n\n### ST demo\n\n\u003cdetails\u003e\u003csummary\u003eexpand\u003c/summary\u003e\u003cdiv\u003e\n\n(**New!**) We made a new real-time E2E-ST + TTS demonstration in Google Colab.\nPlease access the notebook from the following button and enjoy the real-time speech-to-speech translation!\n\n[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/espnet/notebook/blob/master/st_demo.ipynb)\n\n---\n\nYou can translate speech in a WAV file using pretrained models.\nGo to a recipe directory and run `utils/translate_wav.sh` as follows:\n```sh\n# go to recipe directory and source path of espnet tools\ncd egs/fisher_callhome_spanish/st1 \u0026\u0026 . ./path.sh\n# download example wav file\nwget -O - https://github.com/espnet/espnet/files/4100928/test.wav.tar.gz | tar zxvf -\n# let's translate speech!\ntranslate_wav.sh --models fisher_callhome_spanish.transformer.v1.es-en test.wav\n```\nwhere `test.wav` is a WAV file to be translated.\nThe sampling rate must be consistent with that of data used in training.\n\nAvailable pretrained models in the demo script are listed as below.\n\n| Model                                                                                            | Notes                                                      |\n| :------                                                                                          | :------                                                    |\n| [fisher_callhome_spanish.transformer.v1](https://drive.google.com/open?id=1hawp5ZLw4_SIHIT3edglxbKIIkPVe8n3)            | Transformer-ST trained on Fisher-CallHome Spanish Es-\u003eEn                  |\n\n\u003c/div\u003e\u003c/details\u003e\n\n\n### MT results\n\n\u003cdetails\u003e\u003csummary\u003eexpand\u003c/summary\u003e\u003cdiv\u003e\n\n| Task | BLEU | Pretrained model |\n| ---- | :----: | :----: |\n| Fisher-CallHome Spanish fisher_test (Es-\u003eEn)      | 61.45 | [link](https://github.com/espnet/espnet/blob/master/egs/fisher_callhome_spanish/mt1/RESULTS.md#trainen_lcrm_lcrm_pytorch_train_pytorch_transformer_bpe_bpe1000) |\n| Fisher-CallHome Spanish callhome_evltest (Es-\u003eEn) | 29.86 | [link](https://github.com/espnet/espnet/blob/master/egs/fisher_callhome_spanish/mt1/RESULTS.md#trainen_lcrm_lcrm_pytorch_train_pytorch_transformer_bpe_bpe1000) |\n| Libri-trans test (En-\u003eFr)                         | 18.09 | [link](https://github.com/espnet/espnet/blob/master/egs/libri_trans/mt1/RESULTS.md#trainfr_lcrm_tc_pytorch_train_pytorch_transformer_bpe1000) |\n| How2 dev5 (En-\u003ePt)                                | 58.61 | [link](https://github.com/espnet/espnet/blob/master/egs/how2/mt1/RESULTS.md#trainpt_tc_tc_pytorch_train_pytorch_transformer_bpe8000) |\n| Must-C tst-COMMON (En-\u003eDe)                        | 27.63 | [link](https://github.com/espnet/espnet/blob/master/egs/must_c/mt1/RESULTS.md#summary-4-gram-bleu) |\n| IWSLT'14 test2014 (En-\u003eDe)                        | 24.70 | [link](https://github.com/espnet/espnet/blob/master/egs/iwslt16/mt1/RESULTS.md#result) |\n| IWSLT'14 test2014 (De-\u003eEn)                        | 29.22 | [link](https://github.com/espnet/espnet/blob/master/egs/iwslt16/mt1/RESULTS.md#result) |\n| IWSLT'16 test2014 (En-\u003eDe)                        | 24.05 | [link](https://github.com/espnet/espnet/blob/master/egs/iwslt16/mt1/RESULTS.md#result) |\n| IWSLT'16 test2014 (De-\u003eEn)                        | 29.13 | [link](https://github.com/espnet/espnet/blob/master/egs/iwslt16/mt1/RESULTS.md#result) |\n\n\u003c/div\u003e\u003c/details\u003e\n\n### TTS results\n\n\u003cdetails\u003e\u003csummary\u003eESPnet2\u003c/summary\u003e\u003cdiv\u003e\n\nYou can listen to the generated samples in the following url.\n- [ESPnet2 TTS generated samples](https://drive.google.com/drive/folders/1H3fnlBbWMEkQUfrHqosKN_ZX_WjO29ma?usp=sharing)\n\n\u003e Note that in the generation we use Griffin-Lim (`wav/`) and [Parallel WaveGAN](https://github.com/kan-bayashi/ParallelWaveGAN) (`wav_pwg/`).\n\nYou can download pretrained models via `espnet_model_zoo`.\n- [ESPnet model zoo](https://github.com/espnet/espnet_model_zoo)\n- [Pretrained model list](https://github.com/espnet/espnet_model_zoo/blob/master/espnet_model_zoo/table.csv)\n\nYou can download pretrained vocoders via `kan-bayashi/ParallelWaveGAN`.\n- [kan-bayashi/ParallelWaveGAN](https://github.com/kan-bayashi/ParallelWaveGAN)\n- [Pretrained vocoder list](https://github.com/kan-bayashi/ParallelWaveGAN#results)\n\n\u003c/div\u003e\u003c/details\u003e\n\n\u003cdetails\u003e\u003csummary\u003eESPnet1\u003c/summary\u003e\u003cdiv\u003e\n\n\u003e NOTE: We are moving on ESPnet2-based development for TTS. Please check the latest results in the above ESPnet2 results.\n\nYou can listen to our samples in demo HP [espnet-tts-sample](https://espnet.github.io/espnet-tts-sample/).\nHere we list some notable ones:\n\n- [Single English speaker Tacotron2](https://drive.google.com/open?id=18JgsOCWiP_JkhONasTplnHS7yaF_konr)\n- [Single Japanese speaker Tacotron2](https://drive.google.com/open?id=1fEgS4-K4dtgVxwI4Pr7uOA1h4PE-zN7f)\n- [Single other language speaker Tacotron2](https://drive.google.com/open?id=1q_66kyxVZGU99g8Xb5a0Q8yZ1YVm2tN0)\n- [Multi English speaker Tacotron2](https://drive.google.com/open?id=18S_B8Ogogij34rIfJOeNF8D--uG7amz2)\n- [Single English speaker Transformer](https://drive.google.com/open?id=14EboYVsMVcAq__dFP1p6lyoZtdobIL1X)\n- [Single English speaker FastSpeech](https://drive.google.com/open?id=1PSxs1VauIndwi8d5hJmZlppGRVu2zuy5)\n- [Multi English speaker Transformer](https://drive.google.com/open?id=1_vrdqjM43DdN1Qz7HJkvMQ6lCMmWLeGp)\n- [Single Italian speaker FastSpeech](https://drive.google.com/open?id=13I5V2w7deYFX4DlVk1-0JfaXmUR2rNOv)\n- [Single Mandarin speaker Transformer](https://drive.google.com/open?id=1mEnZfBKqA4eT6Bn0eRZuP6lNzL-IL3VD)\n- [Single Mandarin speaker FastSpeech](https://drive.google.com/open?id=1Ol_048Tuy6BgvYm1RpjhOX4HfhUeBqdK)\n- [Multi Japanese speaker Transformer](https://drive.google.com/open?id=1fFMQDF6NV5Ysz48QLFYE8fEvbAxCsMBw)\n- [Single English speaker models with Parallel WaveGAN](https://drive.google.com/open?id=1HvB0_LDf1PVinJdehiuCt5gWmXGguqtx)\n- [Single English speaker knowledge distillation-based FastSpeech](https://drive.google.com/open?id=1wG-Y0itVYalxuLAHdkAHO7w1CWFfRPF4)\n\nYou can download all of the pretrained models and generated samples:\n- [All of the pretrained E2E-TTS models](https://drive.google.com/open?id=1k9RRyc06Zl0mM2A7mi-hxNiNMFb_YzTF)\n- [All of the generated samples](https://drive.google.com/open?id=1bQGuqH92xuxOX__reWLP4-cif0cbpMLX)\n\nNote that in the generated samples we use the following vocoders: Griffin-Lim (**GL**), WaveNet vocoder (**WaveNet**), Parallel WaveGAN (**ParallelWaveGAN**), and MelGAN (**MelGAN**).\nThe neural vocoders are based on following repositories.\n- [kan-bayashi/ParallelWaveGAN](https://github.com/kan-bayashi/ParallelWaveGAN): Parallel WaveGAN / MelGAN / Multi-band MelGAN\n- [r9y9/wavenet_vocoder](https://github.com/r9y9/wavenet_vocoder): 16 bit mixture of Logistics WaveNet vocoder\n- [kan-bayashi/PytorchWaveNetVocoder](https://github.com/kan-bayashi/PytorchWaveNetVocoder): 8 bit Softmax WaveNet Vocoder with the noise shaping\n\nIf you want to build your own neural vocoder, please check the above repositories.\n[kan-bayashi/ParallelWaveGAN](https://github.com/kan-bayashi/ParallelWaveGAN) provides [the manual](https://github.com/kan-bayashi/ParallelWaveGAN#decoding-with-espnet-tts-models-features) about how to decode ESPnet-TTS model's features with neural vocoders. Please check it.\n\nHere we list all of the pretrained neural vocoders. Please download and enjoy the generation of high quality speech!\n\n| Model link                                                                                           | Lang  | Fs [Hz] | Mel range [Hz] | FFT / Shift / Win [pt] | Model type                                                              |\n| :------                                                                                              | :---: | :----:  | :--------:     | :---------------:      | :------                                                                 |\n| [ljspeech.wavenet.softmax.ns.v1](https://drive.google.com/open?id=1eA1VcRS9jzFa-DovyTgJLQ_jmwOLIi8L) | EN    | 22.05k  | None           | 1024 / 256 / None      | [Softmax WaveNet](https://github.com/kan-bayashi/PytorchWaveNetVocoder) |\n| [ljspeech.wavenet.mol.v1](https://drive.google.com/open?id=1sY7gEUg39QaO1szuN62-Llst9TrFno2t)        | EN    | 22.05k  | None           | 1024 / 256 / None      | [MoL WaveNet](https://github.com/r9y9/wavenet_vocoder)                  |\n| [ljspeech.parallel_wavegan.v1](https://drive.google.com/open?id=1tv9GKyRT4CDsvUWKwH3s_OfXkiTi0gw7)   | EN    | 22.05k  | None           | 1024 / 256 / None      | [Parallel WaveGAN](https://github.com/kan-bayashi/ParallelWaveGAN)      |\n| [ljspeech.wavenet.mol.v2](https://drive.google.com/open?id=1es2HuKUeKVtEdq6YDtAsLNpqCy4fhIXr)        | EN    | 22.05k  | 80-7600        | 1024 / 256 / None      | [MoL WaveNet](https://github.com/r9y9/wavenet_vocoder)                  |\n| [ljspeech.parallel_wavegan.v2](https://drive.google.com/open?id=1Grn7X9wD35UcDJ5F7chwdTqTa4U7DeVB)   | EN    | 22.05k  | 80-7600        | 1024 / 256 / None      | [Parallel WaveGAN](https://github.com/kan-bayashi/ParallelWaveGAN)      |\n| [ljspeech.melgan.v1](https://drive.google.com/open?id=1ipPWYl8FBNRlBFaKj1-i23eQpW_W_YcR)             | EN    | 22.05k  | 80-7600        | 1024 / 256 / None      | [MelGAN](https://github.com/kan-bayashi/ParallelWaveGAN)                |\n| [ljspeech.melgan.v3](https://drive.google.com/open?id=1_a8faVA5OGCzIcJNw4blQYjfG4oA9VEt)             | EN    | 22.05k  | 80-7600        | 1024 / 256 / None      | [MelGAN](https://github.com/kan-bayashi/ParallelWaveGAN)                |\n| [libritts.wavenet.mol.v1](https://drive.google.com/open?id=1jHUUmQFjWiQGyDd7ZeiCThSjjpbF_B4h)        | EN    | 24k     | None           | 1024 / 256 / None      | [MoL WaveNet](https://github.com/r9y9/wavenet_vocoder)                  |\n| [jsut.wavenet.mol.v1](https://drive.google.com/open?id=187xvyNbmJVZ0EZ1XHCdyjZHTXK9EcfkK)            | JP    | 24k     | 80-7600        | 2048 / 300 / 1200      | [MoL WaveNet](https://github.com/r9y9/wavenet_vocoder)                  |\n| [jsut.parallel_wavegan.v1](https://drive.google.com/open?id=1OwrUQzAmvjj1x9cDhnZPp6dqtsEqGEJM)       | JP    | 24k     | 80-7600        | 2048 / 300 / 1200      | [Parallel WaveGAN](https://github.com/kan-bayashi/ParallelWaveGAN)      |\n| [csmsc.wavenet.mol.v1](https://drive.google.com/open?id=1PsjFRV5eUP0HHwBaRYya9smKy5ghXKzj)           | ZH    | 24k     | 80-7600        | 2048 / 300 / 1200      | [MoL WaveNet](https://github.com/r9y9/wavenet_vocoder)                  |\n| [csmsc.parallel_wavegan.v1](https://drive.google.com/open?id=10M6H88jEUGbRWBmU1Ff2VaTmOAeL8CEy)      | ZH    | 24k     | 80-7600        | 2048 / 300 / 1200      | [Parallel WaveGAN](https://github.com/kan-bayashi/ParallelWaveGAN)      |\n\nIf you want to use the above pretrained vocoders, please exactly match the feature setting with them.\n\n\u003c/div\u003e\u003c/details\u003e\n\n### TTS demo\n\n\u003cdetails\u003e\u003csummary\u003eESPnet2\u003c/summary\u003e\u003cdiv\u003e\n\nYou can try the real-time demo in Google Colab.\nPlease access the notebook from the following button and enjoy the real-time synthesis!\n\n- Real-time TTS demo with ESPnet2  [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/espnet/notebook/blob/master/espnet2_tts_realtime_demo.ipynb)\n\nEnglish, Japanese, and Mandarin models are available in the demo.\n\n\u003c/div\u003e\u003c/details\u003e\n\n\u003cdetails\u003e\u003csummary\u003eESPnet1\u003c/summary\u003e\u003cdiv\u003e\n\n\u003e NOTE: We are moving on ESPnet2-based development for TTS. Please check the latest demo in the above ESPnet2 demo.\n\nYou can try the real-time demo in Google Colab.\nPlease access the notebook from the following button and enjoy the real-time synthesis.\n\n- Real-time TTS demo with ESPnet1  [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/espnet/notebook/blob/master/tts_realtime_demo.ipynb)\n\nWe also provide shell script to perform synthesize.\nGo to a recipe directory and run `utils/synth_wav.sh` as follows:\n\n```sh\n# go to recipe directory and source path of espnet tools\ncd egs/ljspeech/tts1 \u0026\u0026 . ./path.sh\n# we use upper-case char sequence for the default model.\necho \"THIS IS A DEMONSTRATION OF TEXT TO SPEECH.\" \u003e example.txt\n# let's synthesize speech!\nsynth_wav.sh example.txt\n\n# also you can use multiple sentences\necho \"THIS IS A DEMONSTRATION OF TEXT TO SPEECH.\" \u003e example_multi.txt\necho \"TEXT TO SPEECH IS A TECHQNIQUE TO CONVERT TEXT INTO SPEECH.\" \u003e\u003e example_multi.txt\nsynth_wav.sh example_multi.txt\n```\n\nYou can change the pretrained model as follows:\n\n```sh\nsynth_wav.sh --models ljspeech.fastspeech.v1 example.txt\n```\n\nWaveform synthesis is performed with Griffin-Lim algorithm and neural vocoders (WaveNet and ParallelWaveGAN).\nYou can change the pretrained vocoder model as follows:\n\n```sh\nsynth_wav.sh --vocoder_models ljspeech.wavenet.mol.v1 example.txt\n```\n\nWaveNet vocoder provides very high quality speech but it takes time to generate.\n\nSee more details or available models via `--help`.\n\n```sh\nsynth_wav.sh --help\n```\n\n\u003c/div\u003e\u003c/details\u003e\n\n### VC results\n\n\u003cdetails\u003e\u003csummary\u003eexpand\u003c/summary\u003e\u003cdiv\u003e\n\n- Transformer and Tacotron2 based VC\n\nYou can listen to some samples on the [demo webpage](https://unilight.github.io/Publication-Demos/publications/transformer-vc/).\n\n- Cascade ASR+TTS as one of the baseline systems of VCC2020\n\nThe [Voice Conversion Challenge 2020](http://www.vc-challenge.org/) (VCC2020) adopts ESPnet to build an end-to-end based baseline system.\nIn VCC2020, the objective is intra/cross lingual nonparallel VC.\nYou can download converted samples of the cascade ASR+TTS baseline system [here](https://drive.google.com/drive/folders/1oeZo83GrOgtqxGwF7KagzIrfjr8X59Ue?usp=sharing).\n\n\u003c/div\u003e\u003c/details\u003e\n\n### CTC Segmentation demo\n\n\u003cdetails\u003e\u003csummary\u003eESPnet1\u003c/summary\u003e\u003cdiv\u003e\n\n[CTC segmentation](https://arxiv.org/abs/2007.09127) determines utterance segments within audio files.\nAligned utterance segments constitute the labels of speech datasets.\n\nAs demo, we align start and end of utterances within the audio file `ctc_align_test.wav`, using the example script `utils/ctc_align_wav.sh`.\nFor preparation, set up a data directory:\n\n```sh\ncd egs/tedlium2/align1/\n# data directory\nalign_dir=data/demo\nmkdir -p ${align_dir}\n# wav file\nbase=ctc_align_test\nwav=../../../test_utils/${base}.wav\n# recipe files\necho \"batchsize: 0\" \u003e ${align_dir}/align.yaml\n\ncat \u003c\u003c EOF \u003e ${align_dir}/utt_text\n${base} THE SALE OF THE HOTELS\n${base} IS PART OF HOLIDAY'S STRATEGY\n${base} TO SELL OFF ASSETS\n${base} AND CONCENTRATE\n${base} ON PROPERTY MANAGEMENT\nEOF\n```\n\nHere, `utt_text` is the file containing the list of utterances.\nChoose a pre-trained ASR model that includes a CTC layer to find utterance segments:\n\n```sh\n# pre-trained ASR model\nmodel=wsj.transformer_small.v1\nmkdir ./conf \u0026\u0026 cp ../../wsj/asr1/conf/no_preprocess.yaml ./conf\n\n../../../utils/asr_align_wav.sh \\\n    --models ${model} \\\n    --align_dir ${align_dir} \\\n    --align_config ${align_dir}/align.yaml \\\n    ${wav} ${align_dir}/utt_text\n```\n\nSegments are written to `aligned_segments` as a list of file/utterance name, utterance start and end times in seconds and a confidence score.\nThe confidence score is a probability in log space that indicates how good the utterance was aligned. If needed, remove bad utterances:\n\n```sh\nmin_confidence_score=-5\nawk -v ms=${min_confidence_score} '{ if ($5 \u003e ms) {print} }' ${align_dir}/aligned_segments\n```\n\nThe demo script `utils/ctc_align_wav.sh` uses an already pretrained ASR model (see list above for more models).\nIt is recommended to use models with RNN-based encoders (such as BLSTMP) for aligning large audio files;\nrather than using Transformer models that have a high memory consumption on longer audio data.\nThe sample rate of the audio must be consistent with that of the data used in training; adjust with `sox` if needed.\nA full example recipe is in `egs/tedlium2/align1/`.\n\n\u003c/div\u003e\u003c/details\u003e\n\n\u003cdetails\u003e\u003csummary\u003eESPnet2\u003c/summary\u003e\u003cdiv\u003e\n\n[CTC segmentation](https://arxiv.org/abs/2007.09127) determines utterance segments within audio files.\nAligned utterance segments constitute the labels of speech datasets.\n\nAs demo, we align start and end of utterances within the audio file `ctc_align_test.wav`.\nThis can be done either directly from the Python command line or using the script `espnet2/bin/asr_align.py`.\n\nFrom the Python command line interface:\n\n```python\n# load a model with character tokens\nfrom espnet_model_zoo.downloader import ModelDownloader\nd = ModelDownloader(cachedir=\"./modelcache\")\nwsjmodel = d.download_and_unpack(\"kamo-naoyuki/wsj\")\n# load the example file included in the ESPnet repository\nimport soundfile\nspeech, rate = soundfile.read(\"./test_utils/ctc_align_test.wav\")\n# CTC segmentation\nfrom espnet2.bin.asr_align import CTCSegmentation\naligner = CTCSegmentation( **wsjmodel , fs=rate )\ntext = \"\"\"\nutt1 THE SALE OF THE HOTELS\nutt2 IS PART OF HOLIDAY'S STRATEGY\nutt3 TO SELL OFF ASSETS\nutt4 AND CONCENTRATE ON PROPERTY MANAGEMENT\n\"\"\"\nsegments = aligner(speech, text)\nprint(segments)\n# utt1 utt 0.26 1.73 -0.0154 THE SALE OF THE HOTELS\n# utt2 utt 1.73 3.19 -0.7674 IS PART OF HOLIDAY'S STRATEGY\n# utt3 utt 3.19 4.20 -0.7433 TO SELL OFF ASSETS\n# utt4 utt 4.20 6.10 -0.4899 AND CONCENTRATE ON PROPERTY MANAGEMENT\n```\n\nAligning also works with fragments of the text.\nFor this, set the `gratis_blank` option that allows skipping unrelated audio sections without penalty.\nIt's also possible to omit the utterance names at the beginning of each line, by setting `kaldi_style_text` to False.\n\n```python\naligner.set_config( gratis_blank=True, kaldi_style_text=False )\ntext = [\"SALE OF THE HOTELS\", \"PROPERTY MANAGEMENT\"]\nsegments = aligner(speech, text)\nprint(segments)\n# utt_0000 utt 0.37 1.72 -2.0651 SALE OF THE HOTELS\n# utt_0001 utt 4.70 6.10 -5.0566 PROPERTY MANAGEMENT\n```\n\nThe script `espnet2/bin/asr_align.py` uses a similar interface. To align utterances:\n\n```sh\n# ASR model and config files from pretrained model (e.g. from cachedir):\nasr_config=\u003cpath-to-model\u003e/config.yaml\nasr_model=\u003cpath-to-model\u003e/valid.*best.pth\n# prepare the text file\nwav=\"test_utils/ctc_align_test.wav\"\ntext=\"test_utils/ctc_align_text.txt\"\ncat \u003c\u003c EOF \u003e ${text}\nutt1 THE SALE OF THE HOTELS\nutt2 IS PART OF HOLIDAY'S STRATEGY\nutt3 TO SELL OFF ASSETS\nutt4 AND CONCENTRATE\nutt5 ON PROPERTY MANAGEMENT\nEOF\n# obtain alignments:\npython espnet2/bin/asr_align.py --asr_train_config ${asr_config} --asr_model_file ${asr_model} --audio ${wav} --text ${text}\n# utt1 ctc_align_test 0.26 1.73 -0.0154 THE SALE OF THE HOTELS\n# utt2 ctc_align_test 1.73 3.19 -0.7674 IS PART OF HOLIDAY'S STRATEGY\n# utt3 ctc_align_test 3.19 4.20 -0.7433 TO SELL OFF ASSETS\n# utt4 ctc_align_test 4.20 4.97 -0.6017 AND CONCENTRATE\n# utt5 ctc_align_test 4.97 6.10 -0.3477 ON PROPERTY MANAGEMENT\n```\n\nThe output of the script can be redirected to a `segments` file by adding the argument `--output segments`.\nEach line contains file/utterance name, utterance start and end times in seconds and a confidence score; optionally also the utterance text.\nThe confidence score is a probability in log space that indicates how good the utterance was aligned. If needed, remove bad utterances:\n\n```sh\nmin_confidence_score=-7\n# here, we assume that the output was written to the file `segments`\nawk -v ms=${min_confidence_score} '{ if ($5 \u003e ms) {print} }' segments\n```\n\nSee the module documentation for more information.\nIt is recommended to use models with RNN-based encoders (such as BLSTMP) for aligning large audio files;\nrather than using Transformer models that have a high memory consumption on longer audio data.\nThe sample rate of the audio must be consistent with that of the data used in training; adjust with `sox` if needed.\n\n\u003c/div\u003e\u003c/details\u003e\n\n## References\n\n[1] Shinji Watanabe, Takaaki Hori, Shigeki Karita, Tomoki Hayashi, Jiro Nishitoba, Yuya Unno, Nelson Enrique Yalta Soplin, Jahn Heymann, Matthew Wiesner, Nanxin Chen, Adithya Renduchintala, and Tsubasa Ochiai, \"ESPnet: End-to-End Speech Processing Toolkit,\" *Proc. Interspeech'18*, pp. 2207-2211 (2018)\n\n[2] Suyoun Kim, Takaaki Hori, and Shinji Watanabe, \"Joint CTC-attention based end-to-end speech recognition using multi-task learning,\" *Proc. ICASSP'17*, pp. 4835--4839 (2017)\n\n[3] Shinji Watanabe, Takaaki Hori, Suyoun Kim, John R. Hershey and Tomoki Hayashi, \"Hybrid CTC/Attention Architecture for End-to-End Speech Recognition,\" *IEEE Journal of Selected Topics in Signal Processing*, vol. 11, no. 8, pp. 1240-1253, Dec. 2017\n\n## Citations\n\n```\n@inproceedings{watanabe2018espnet,\n  author={Shinji Watanabe and Takaaki Hori and Shigeki Karita and Tomoki Hayashi and Jiro Nishitoba and Yuya Unno and Nelson {Enrique Yalta Soplin} and Jahn Heymann and Matthew Wiesner and Nanxin Chen and Adithya Renduchintala and Tsubasa Ochiai},\n  title={{ESPnet}: End-to-End Speech Processing Toolkit},\n  year={2018},\n  booktitle={Proceedings of Interspeech},\n  pages={2207--2211},\n  doi={10.21437/Interspeech.2018-1456},\n  url={http://dx.doi.org/10.21437/Interspeech.2018-1456}\n}\n@inproceedings{hayashi2020espnet,\n  title={{Espnet-TTS}: Unified, reproducible, and integratable open source end-to-end text-to-speech toolkit},\n  author={Hayashi, Tomoki and Yamamoto, Ryuichi and Inoue, Katsuki and Yoshimura, Takenori and Watanabe, Shinji and Toda, Tomoki and Takeda, Kazuya and Zhang, Yu and Tan, Xu},\n  booktitle={Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},\n  pages={7654--7658},\n  year={2020},\n  organization={IEEE}\n}\n@inproceedings{inaguma-etal-2020-espnet,\n    title = \"{ESP}net-{ST}: All-in-One Speech Translation Toolkit\",\n    author = \"Inaguma, Hirofumi  and\n      Kiyono, Shun  and\n      Duh, Kevin  and\n      Karita, Shigeki  and\n      Yalta, Nelson  and\n      Hayashi, Tomoki  and\n      Watanabe, Shinji\",\n    booktitle = \"Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations\",\n    month = jul,\n    year = \"2020\",\n    address = \"Online\",\n    publisher = \"Association for Computational Linguistics\",\n    url = \"https://www.aclweb.org/anthology/2020.acl-demos.34\",\n    pages = \"302--311\",\n}\n@inproceedings{li2020espnet,\n  title={{ESPnet-SE}: End-to-End Speech Enhancement and Separation Toolkit Designed for {ASR} Integration},\n  author={Chenda Li and Jing Shi and Wangyou Zhang and Aswin Shanmugam Subramanian and Xuankai Chang and Naoyuki Kamo and Moto Hira and Tomoki Hayashi and Christoph Boeddeker and Zhuo Chen and Shinji Watanabe},\n  booktitle={Proceedings of IEEE Spoken Language Technology Workshop (SLT)},\n  pages={785--792},\n  year={2021},\n  organization={IEEE},\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdialpad%2Fmucs_2021_dialpad","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdialpad%2Fmucs_2021_dialpad","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdialpad%2Fmucs_2021_dialpad/lists"}