Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/hirofumi0810/neural_sp
End-to-end ASR/LM implementation with PyTorch
https://github.com/hirofumi0810/neural_sp
asr attention attention-mechanism automatic-speech-recognition ctc language-model language-modeling pytorch rnn-transducer seq2seq sequence-to-sequence speech speech-recognition streaming transformer transformer-xl
Last synced: 3 months ago
JSON representation
End-to-end ASR/LM implementation with PyTorch
- Host: GitHub
- URL: https://github.com/hirofumi0810/neural_sp
- Owner: hirofumi0810
- License: apache-2.0
- Created: 2017-09-10T06:33:28.000Z (about 7 years ago)
- Default Branch: master
- Last Pushed: 2021-08-30T12:06:20.000Z (about 3 years ago)
- Last Synced: 2024-06-21T19:56:55.251Z (5 months ago)
- Topics: asr, attention, attention-mechanism, automatic-speech-recognition, ctc, language-model, language-modeling, pytorch, rnn-transducer, seq2seq, sequence-to-sequence, speech, speech-recognition, streaming, transformer, transformer-xl
- Language: Python
- Homepage:
- Size: 8.66 MB
- Stars: 586
- Watchers: 33
- Forks: 140
- Open Issues: 45
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- StarryDivineSky - hirofumi0810/neural_sp
README
[![Build Status](https://travis-ci.org/hirofumi0810/neural_sp.svg?branch=master)](https://travis-ci.org/hirofumi0810/neural_sp)
[![codecov](https://codecov.io/gh/hirofumi0810/neural_sp/branch/master/graph/badge.svg?token=wy0VD7e3bH)](https://codecov.io/gh/hirofumi0810/neural_sp)# NeuralSP: Neural network based Speech Processing
## How to install
```
cd tools
make KALDI=/path/to/kaldi TOOL=/path/to/save/tools
```## Key features
### Corpus
- ASR
- AISHELL-1
- AISHELL-2
- AMI
- CSJ
- LaboroTVSpeech
- Librispeech
- Switchboard (+Fisher)
- TEDLIUM2/TEDLIUM3
- TIMIT
- WSJ- LM
- Penn Tree Bank
- WikiText2### Front-end
- Frame stacking
- Sequence summary network [[link](https://www.isca-speech.org/archive/Interspeech_2018/abstracts/1438.html)]
- SpecAugment [[link](https://arxiv.org/abs/1904.08779)]
- Adaptive SpecAugment [[link](https://arxiv.org/abs/1912.05533)]### Encoder
- RNN encoder
- (CNN-)BLSTM, (CNN-)LSTM, (CNN-)BLGRU, (CNN-)LGRU
- Latency-controlled BRNN [[link](https://arxiv.org/abs/1510.08983)]
- Random state passing (RSP) [[link](https://arxiv.org/abs/1910.11455)]
- Transformer encoder [[link](https://arxiv.org/abs/1706.03762)]
- Chunk hopping mechanism [[link](https://arxiv.org/abs/1902.06450)]
- Relative positional encoding [[link](https://arxiv.org/abs/1901.02860)]
- Causal mask
- Conformer encoder [[link](https://arxiv.org/abs/2005.08100)]
- Time-depth separable (TDS) convolution encoder [[link](https://arxiv.org/abs/1904.02619)] [[line](https://arxiv.org/abs/2001.09727)]
- Gated CNN encoder (GLU) [[link](https://openreview.net/forum?id=Hyig0zb0Z)]### Connectionist Temporal Classification (CTC) decoder
- Beam search
- Shallow fusion
- Forced alignment### RNN-Transducer (RNN-T) decoder [[link](https://arxiv.org/abs/1211.3711)]
- Beam search
- Shallow fusion### Attention-based decoder
- RNN decoder
- Shallow fusion
- Cold fusion [[link](https://arxiv.org/abs/1708.06426)]
- Deep fusion [[link](https://arxiv.org/abs/1503.03535)]
- Forward-backward attention decoding [[link](https://www.isca-speech.org/archive/Interspeech_2018/abstracts/1160.html)]
- Ensemble decoding
- internal LM estimation [[link](https://arxiv.org/abs/2011.01991)]
- Attention type
- location-based
- content-based
- dot-product
- GMM attention
- Streaming RNN decoder specific
- Hard monotonic attention [[link](https://arxiv.org/abs/1704.00784)]
- Monotonic chunkwise attention (MoChA) [[link](https://arxiv.org/abs/1712.05382)]
- Delay constrained training (DeCoT) [[link](https://arxiv.org/abs/2004.05009)]
- Minimum latency training (MinLT) [[link](https://arxiv.org/abs/2004.05009)]
- CTC-synchronous training (CTC-ST) [[link](https://arxiv.org/abs/2005.04712)]
- Transformer decoder [[link](https://arxiv.org/abs/1706.03762)]
- Streaming Transformer decoder specific
- Monotonic Multihead Attention [[link](https://arxiv.org/abs/1909.12406)] [[link](https://arxiv.org/abs/2005.09394)]### Language model (LM)
- RNNLM (recurrent neural network language model)
- Gated convolutional LM [[link](https://arxiv.org/abs/1612.08083)]
- Transformer LM
- Transformer-XL LM [[link](https://arxiv.org/abs/1901.02860)]
- Adaptive softmax [[link](https://arxiv.org/abs/1609.04309)]### Output units
- Phoneme
- Grapheme
- Wordpiece (BPE, sentencepiece)
- Word
- Word-char mix### Multi-task learning (MTL)
Multi-task learning (MTL) with different units are supported to alleviate data sparseness.
- Hybrid CTC/attention [[link](https://www.merl.com/publications/docs/TR2017-190.pdf)]
- Hierarchical Attention (e.g., word attention + character attention) [[link](http://sap.ist.i.kyoto-u.ac.jp/lab/bib/intl/INA-SLT18.pdf)]
- Hierarchical CTC (e.g., word CTC + character CTC) [[link](https://arxiv.org/abs/1711.10136)]
- Hierarchical CTC+Attention (e.g., word attention + character CTC) [[link](http://www.sap.ist.i.kyoto-u.ac.jp/lab/bib/intl/UEN-ICASSP18.pdf)]
- Forward-backward attention [[link](https://www.isca-speech.org/archive/Interspeech_2018/abstracts/1160.html)]
- LM objective## ASR Performance
### AISHELL-1 (CER)
| Model | dev | test |
| ----------- | --- | ---- |
| Conformer LAS | 4.1 | 4.5 |
| Transformer | 5.0 | 5.4 |
| Streaming MMA | 5.5 | 6.1 |### AISHELL-2 (CER)
| Model | test_android | test_ios | test_mic |
| ----------- | ------------ | -------- | -------- |
| Conformer LAS | 6.1 | 5.5 | 5.9 |### CSJ (WER)
| Model | eval1 | eval2 | eval3 |
| -------------- | ----- | ----- | ----- |
| Conformer LAS | 5.7 | 4.4 | 4.9 |
| BLSTM LAS | 6.5 | 5.1 | 5.6 |
| LC-BLSTM MoChA | 7.4 | 5.6 | 6.4 |### Switchboard 300h (WER)
| Model | SWB | CH |
| --------- | ---- | ---- |
| BLSTM LAS | 9.1 | 18.8 |### Switchboard+Fisher 2000h (WER)
| Model | SWB | CH |
| --------- | ---- | ---- |
| BLSTM LAS | 7.8 | 13.8 |### LaboroTVSpeech (CER)
| Model | dev_4k | dev | tedx-jp-10k |
| -------------- | ----- | ----- | ----- |
| Conformer LAS | 7.8 | 10.1 | 12.4 |### Librispeech (WER)
| Model | dev-clean | dev-other | test-clean | test-other |
| -------------- | --------- | --------- | ---------- | ---------- |
| Conformer LAS | 1.9 | 4.6 | 2.1 | 4.9 |
| Transformer | 2.1 | 5.3 | 2.4 | 5.7 |
| BLSTM LAS | 2.5 | 7.2 | 2.6 | 7.5 |
| BLSTM RNN-T | 2.9 | 8.5 | 3.2 | 9.0 |
| UniLSTM RNN-T | 3.7 | 11.7 | 4.0 | 11.6 |
| UniLSTM MoChA | 4.1 | 11.0 | 4.2 | 11.2 |
| LC-BLSTM RNN-T | 3.3 | 9.8 | 3.5 | 10.2 |
| LC-BLSTM MoChA | 3.3 | 8.8 | 3.5 | 9.1 |
| Streaming MMA | 2.5 | 6.9 | 2.7 | 7.1 |### TEDLIUM2 (WER)
| Model | dev | test |
| -------------- | ---- | ---- |
| Conformer LAS | 7.0 | 6.8 |
| BLSTM LAS | 8.1 | 7.5 |
| LC-BLSTM RNN-T | 8.0 | 7.7 |
| LC-BLSTM MoChA | 10.3 | 8.6 |
| UniLSTM RNN-T | 10.7 | 10.7 |
| UniLSTM MoChA | 13.5 | 11.6 |### WSJ (WER)
| Model | test_dev93 | test_eval92 |
| --------- | ---------- | ----------- |
| BLSTM LAS | 8.8 | 6.2 |## LM Performance
### Penn Tree Bank (PPL)
| Model | valid | test |
| ----------- | ----- | ----- |
| RNNLM | 87.99 | 86.06 |
| + cache=100 | 79.58 | 79.12 |
| + cache=500 | 77.36 | 76.94 |### WikiText2 (PPL)
| Model | valid | test |
| ------------ | ------ | ----- |
| RNNLM | 104.53 | 98.73 |
| + cache=100 | 90.86 | 85.87 |
| + cache=2000 | 76.10 | 72.77 |## Reference
- https://github.com/kaldi-asr/kaldi
- https://github.com/espnet/espnet
- https://github.com/awni/speech
- https://github.com/HawkAaron/E2E-ASR## Dependency
- https://github.com/SeanNaren/warp-ctc
- https://github.com/HawkAaron/warp-transducer
- https://github.com/1ytic/warp-rnnt