Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.
https://github.com/wenet-e2e/speech-recognition-papers

Towards hot directions in industrial end to end speech recognition
https://github.com/wenet-e2e/speech-recognition-papers
Last synced: 5 days ago
JSON representation
Towards hot directions in industrial end to end speech recognition
Host: GitHub
URL: https://github.com/wenet-e2e/speech-recognition-papers
Owner: wenet-e2e
License: mit
Created: 2020-10-11T06:04:57.000Z (over 3 years ago)
Default Branch: main
Last Pushed: 2021-11-30T08:28:52.000Z (over 2 years ago)
Last Synced: 2024-04-28T00:47:35.744Z (2 months ago)
Homepage:
Size: 20.5 KB
Stars: 320
Watchers: 19
Forks: 40
Open Issues: 1
Metadata Files:
- Readme: README.md
- License: LICENSE
Lists

awesome-ai-list-guide - speech-recognition-papers
README

        

* [Speech Recognition Papers](#speech-recognition-papers)

  * [Streaming ASR](#streaming-asr)

    * [RNA based](#rna-based)

    * [RNN-T based](#rnn-t-based)

    * [Attention based](#attention-based)

    * [Unified Streaming/Non-streaming models](#unified-streamingnon-streaming-models)

  * [Non-autoregressive (NAR)  ASR](#non-autoregressive-nar--asr)

  * [ASR Rescoring / Spelling Correction (2-pass decoding)](#asr-rescoring--spelling-correction-2-pass-decoding)

  * [On-device ASR](#on-device-asr)

  * [Noisy Student Training(Self Training)](#noisy-student-trainingself-training)

  * [Self Supervised Learning (SSL)](#self-supervised-learning-ssl)

    * [APC(Autoregressive Predictive Coding)](#apcautoregressive-predictive-coding)

    * [CPC(Contrastive Predictive Coding)](#cpccontrastive-predictive-coding)

# Speech Recognition Papers

List of hot directions in industrial speech recognition, i.e., [**Streaming ASR**](https://github.com/xingchensong/speech-recognition-papers/blob/main/README.md#streaming-asr) ([RNA-based](https://github.com/xingchensong/speech-recognition-papers#rna-based) || [RNN-T based](https://github.com/xingchensong/speech-recognition-papers#rnn-t-based) || [Attention based](https://github.com/xingchensong/speech-recognition-papers#attention-based) || [unified streaming/non-streaming](https://github.com/xingchensong/speech-recognition-papers#unified-streamingnon-streaming-models)) / [**Non-autoregressive ASR**](https://github.com/xingchensong/speech-recognition-papers#non-autoregressive-nar--asr) ...

If you are interested in this repo, any [pull request](https://github.com/xingchensong/speech-recognition-papers/pulls) is welcomed.

## Streaming ASR

### RNA based

- Standard RNA: [Recurrent Neural Aligner: An Encoder-Decoder Neural Network Model for Sequence to Sequence Mapping](https://pdfs.semanticscholar.org/7703/a2c5468ecbee5b62c048339a03358ed5fe19.pdf) (Interspeech 2017)

- Extended RNA: [Extending Recurrent Neural Aligner for Streaming End-to-End Speech Recognition in Mandarin](https://arxiv.org/pdf/1806.06342.pdf) (Interspeech 2018)

- Transformer equipped RNA: [Self-attention Aligner: A Latency-control End-to-end Model for ASR Using Self-attention Network and Chunk-hopping](https://arxiv.org/pdf/1902.06450.pdf) (ICASSP 2019)

- CIF: [CIF: Continuous Integrate-And-Fire for End-To-End Speech Recognition](https://arxiv.org/pdf/1905.11235.pdf) (ICASSP 2020)

- CIF: [A Comparison of Label-Synchronous and Frame-Synchronous End-to-End Models for Speech Recognition](https://arxiv.org/pdf/2005.10113.pdf) (Interspeech 2020)

### RNN-T based

- Standard RNN-T: [Streaming E2E Speech Recognition For Mobile Devices](https://arxiv.org/pdf/1811.06621.pdf) (ICASSP 2019)

- Latency Controlled RNN-T: [RNN-T For Latency Controlled ASR With Improved Beam Search](https://arxiv.org/pdf/1911.01629.pdf) (arXiv 2019)

- Transformer equipped RNN-T: [Self-Attention Transducers for End-to-End Speech Recognition](https://arxiv.org/pdf/1909.13037.pdf) (Interspeech 2019)

- Transformer equipped RNN-T: [Transformer Transducer: A Streamable Speech Recognition Model With Transformer Encoders And RNN-T Loss](https://arxiv.org/pdf/2002.02562.pdf) (ICASSP 2020)

- Transformer equipped RNN-T: [A Streaming On-Device End-to-End Model Surpassing Server-Side Conventional Model Quality and Latency](https://arxiv.org/pdf/2003.12710.pdf) (ICASSP 2020)

- Tricks for RNN-T Training: [Towards Fast And Accurate Streaming E2E ASR](https://arxiv.org/pdf/2004.11544.pdf) (ICASSP 2020)

- Knowledge Distillation for RNN-T: [Knowledge Distillation from Offline to Streaming RNN Transducer for End-to-end Speech Recognition]() (Interspeech 2020)

- Transfer Learning for RNN-T: [Transfer Learning Approaches for Streaming End-to-End Speech Recognition System](https://arxiv.org/pdf/2008.05086.pdf) (Interspeech 2020)

- Exploration on RNN-T: [Analyzing the Quality and Stability of a Streaming End-to-End On-Device Speech Recognizer](https://arxiv.org/pdf/2006.01416.pdf) (Interspeech 2020)

- Sequence-level Emission Regularization for RNN-T: [FastEmit: Low-latency Streaming ASR with Sequence-level Emission Regularization](https://arxiv.org/pdf/2010.11148.pdf) (arXiv 2020, submitted to ICASSP 2021)

- Model Distillation for RNN-T: [Improving Streaming Automatic Speech Recognition With Non-Streaming  Model Distillation On Unsupervised Data](https://arxiv.org/pdf/2010.12096.pdf) (arXiv 2020, submitted to ICASSP 2021)

- LM Fusion for RNN-T: [Improved Neural Language Model Fusion for Streaming Recurrent Neural Network Transducer](https://arxiv.org/pdf/2010.13878.pdf) (arXiv 2020, submitted to ICASSP 2021)

- Normalized jointer network: [Improving RNN transducer with normalized jointer network](https://arxiv.org/pdf/2011.01576.pdf) (arXiv 2020)

- Benchmark on RNN-T CTC LF-MMI: [Benchmarking LF-MMI, CTC and RNN-T Criteria for Streaming ASR](https://arxiv.org/pdf/2011.04785.pdf) (SLT 2021)

- Alignment Restricted RNN-T: [Alignment Restricted Streaming Recurrent Neural Network Transducer](https://arxiv.org/pdf/2011.03072.pdf) (SLT 2021)

- Conformer equipped RNN-T (with Cascaded Encoder and 2nd-pass beam search): [A Better and Faster End-to-End Model for Streaming ASR](https://arxiv.org/pdf/2011.10798.pdf) (arXiv 2020, submitted to ICASSP 2021)

- Multi-Speaker RNN-T: [Streaming end-to-end multi-talker speech recognition](https://arxiv.org/pdf/2011.13148.pdf)

### Attention based

- Montonic Attention: [Montonic Chunkwise Attention](https://arxiv.org/pdf/1712.05382.pdf) (ICLR 2018)

- Enhanced Montonic Attention: [Enhancing Monotonic Multihead Attention for Streaming ASR](https://arxiv.org/pdf/2005.09394.pdf) (Interspeech 2020)

- Minimum Latency Training based on Montomic Attention: [Minimum Latency Training Strategies For Streaming seq-to-seq ASR](https://arxiv.org/pdf/2004.05009.pdf) (ICASSP 2020)

- Triggered Attention: [Triggered Attention for End-to-End Speech Recognition](https://www.merl.com/publications/docs/TR2019-015.pdf) (ICASSP 2019)

- Triggered Attention for Transformer: [Streaming Automatic Speech Recognition With The Transformer Model](https://arxiv.org/pdf/2001.02674.pdf) (ICASSP 2020)

- Block-synchronous: [Streaming Transformer ASR with Blockwise Synchronous Inference](https://arxiv.org/pdf/2006.14941.pdf) (ASRU 2019)

- Block-synchronous with chunk reuse: [Transformer Online CTC/Attention E2E Speech Recognition Architecture](https://arxiv.org/pdf/2001.08290.pdf) (ICASSP 2020)

- Block-synchronous with RNN-T like decoding rule: [Synchronous Transformers For E2E Speech Recognition](https://arxiv.org/pdf/1912.02958.pdf) (ICASSP 2020)

- Scout-synchronous: [Low Latency End-to-End Streaming Speech Recognition with a Scout Network](https://arxiv.org/pdf/2003.10369.pdf) (Interspeech 2020)

- CTC-synchronous: [CTC-synchronous Training for Monotonic Attention Model](https://arxiv.org/pdf/2005.04712.pdf) (Interspeech 2020)

- Memory Augmented Attention: [Streaming Transformer-based Acoustic Models Using Self-attention with Augmented Memory](https://arxiv.org/pdf/2005.08042.pdf) (Interspeech 2020)

- Memory Augmented Attention: [Streaming Chunk-Aware Multihead Attention for Online End-to-End Speech Recognition](https://arxiv.org/pdf/2006.01712.pdf) (Interspeech 2020)

- Optimized Beam Search: [High Performance Sequence-to-Sequence Model for Streaming Speech Recognition](https://arxiv.org/pdf/2003.10022.pdf) (Interspeech 2020)

- Memory Augmented Attention: [Emformer: Efficient Memory Transformer Based Acoustic Model For Low Latency Streaming Speech Recognition](https://arxiv.org/pdf/2010.10759.pdf) (arXiv 2020, submitted to ICASSP 2021)

### Unified Streaming/Non-streaming models

- [Transformer Transducer: One Model Unifying Streaming And Non-Streaming Speech Recognition](https://arxiv.org/pdf/2010.03192.pdf) (arXiv 2020)

- [Universal ASR: Unify And Improve Streaming ASR With Full-Context Modeling](https://openreview.net/pdf?id=Pz_dcqfcKW8) (ICLR 2021 under double-blind review)

- [Cascaded encoders for unifying streaming and non-streaming ASR](https://arxiv.org/pdf/2010.14606.pdf) (arXiv 2020)

- Asynchronous Revision for non-streaming ASR: [Dynamic latency speech recognition with asynchronous revision](https://arxiv.org/pdf/2011.01570.pdf) (arXiv 2020, submitted to ICASSP 2021)

- 2-pass unifying (1st Streaming CTC, 2nd Attention Rescore): [Unified Streaming and Non-streaming Two-pass End-to-end Model for Speech Recognition](https://arxiv.org/pdf/2012.05481.pdf)

- 2-pass unifying (1st Streaming CTC, 2nd Attention Rescore): [One In A Hundred: Select The Best Predicted Sequence from Numerous Candidates for Streaming Speech Recognition](https://arxiv.org/pdf/2010.14791.pdf) (arXiv 2020)

## Non-autoregressive (NAR)  ASR

- MASK-Predict: [Listen and Fill in the Missing Letters: Non-Autoregressive Transformer for Speech Recognition](https://arxiv.org/pdf/1911.04908.pdf) (arXiv 2019)

- Imputer: [Imputer: Sequence modelling via imputation and dynamic programming](https://arxiv.org/pdf/2002.08926.pdf) (arXiv 2020)

- Insertion-based: [Insertion-Based Modeling for End-to-End Automatic Speech Recognition](https://arxiv.org/pdf/2005.13211.pdf) (arXiv 2020)

- MASK-CTC: [Mask CTC: Non-Autoregressive End-to-End ASR with CTC and Mask Predict](https://arxiv.org/pdf/2005.08700.pdf) (Interspeech 2020)

- Spike Triggered: [Spike-Triggered Non-Autoregressive Transformer for End-to-End Speech Recognition](https://arxiv.org/pdf/2005.07903.pdf) (Interspeech 2020)

- Similar to MASK-Predict: [Listen Attentively, and Spell Once: Whole Sentence Generation via a Non-Autoregressive Architecture for Low-Latency Speech Recognition](https://arxiv.org/pdf/2005.04862.pdf) (Interspeech 2020)

- Improved MASK-CTC: [Improved Mask-CTC for Non-Autoregressive End-to-End ASR](https://arxiv.org/pdf/2010.13270.pdf) (arXiv 2020, submitted to ICASSP 2021)

- Refine CTC Alignments over Latent Space: [Align-Refine: Non-Autoregressive Speech Recognition via Iterative Realignment](https://arxiv.org/pdf/2010.14233.pdf) (arXiv 2020)

- Also Refine CTC Alignments over Latent Space: [CASS-NAT: CTC Alignment-based Single Step Non-autoregressive Transformer for Speech Recognition](https://arxiv.org/pdf/2010.14725.pdf) (arXiv 2020, submitted to ICASSP 2021)

- Refine CTC Alignments over Output Space: [Non-Autoregressive Transformer ASR with CTC-Enhanced Decoder Input](https://arxiv.org/pdf/2010.15025.pdf) (arXiv 2020, submitted to ICASSP 2021)

## ASR Rescoring / Spelling Correction (2-pass decoding)

- Review: [Automatic Speech Recognition Errors Detection and Correction: A Review](https://www.sciencedirect.com/science/article/pii/S1877050918302187) (N/A)

- LAS based: [A Spelling Correction Model For E2E Speech Recognition](https://arxiv.org/pdf/1902.07178.pdf) (ICASSP 2019)

- Transformer based: [An Empirical Study Of Efficient ASR Rescoring With Transformers](https://arxiv.org/pdf/1910.11450.pdf) (arXiv 2019)

- Transformer based: [Automatic Spelling Correction with Transformer for CTC-based End-to-End Speech Recognition](https://arxiv.org/pdf/1904.10045.pdf) (Interspeech 2019)

- Transformer based: [Correction of Automatic Speech Recognition with Transformer Sequence-To-Sequence Model](https://arxiv.org/pdf/1910.10697.pdf) (ICASSP 2020)

- BERT based: [Effective Sentence Scoring Method Using BERT for Speech Recognition](http://proceedings.mlr.press/v101/shin19a/shin19a.pdf) (ACML 2019)

- BERT based: [Spelling Error Correction with Soft-Masked BERT](https://arxiv.org/pdf/2005.07421.pdf) (ACL 2020)

- Parallel Rescoring: [Parallel Rescoring with Transformer for Streaming On-Device Speech Recognition](https://arxiv.org/pdf/2008.13093.pdf) (Interspeech 2020)

## On-device ASR

- Review: [A review of on-device fully neural end-to-end automatic speech recognition algorithms](https://arxiv.org/pdf/2012.07974.pdf) (arXiv 2020)

- Lightweight Low-Rank transformer: [Lightweight and Efficient End-to-End Speech Recognition Using Low-Rank Transformer](https://arxiv.org/pdf/1910.13923.pdf) (ICASSP 2020)

- Attention replacement: [How Much Self-Attention Do We Need ƒ Trading Attention for Feed-Forward Layers](https://www-i6.informatik.rwth-aachen.de/publications/download/1126/IrieKazukiGerstenbergerAlexerSchl%FCterRalfNeyHermann--HowMuchSelf-AttentionDoWeNeed%3FTradingAttentionforFeed-ForwardLayers--2020.pdf) (ICASSP 2020)

- Lightweight transducer with WFST based decoding: [Tiny Transducer: A Highly-efficient Speech Recognition Model on Edge Devices](https://arxiv.org/pdf/2101.06856.pdf) (ICASSP 2021)

- Cascade transducer: [Cascade RNN-Transducer: Syllable Based Streaming On-device Mandarin Speech Recognition with a Syllable-to-Character Converter](https://arxiv.org/pdf/2011.08469.pdf) (SLT 2021)

## Noisy Student Training(Self Training)

- Self training with filtering and ensembles: [Self-training for end-to-end speech recognition](https://arxiv.org/pdf/1909.09116.pdf) (ICASSP 2020)

- Improved Noisy Student Training by gradational filtering: [Improved Noisy Student Training for Automatic Speech Recognition](https://arxiv.org/pdf/2005.09629.pdf) (Interspeech 2020)

## Self Supervised Learning(SSL)

### APC(Autoregressive Predictive Coding)

- [An Unsupervised Autoregressive Model for Speech Representation Learning](https://arxiv.org/pdf/1904.03240.pdf) (Interspeech 2019)

- [Generative Pre-Training for Speech with Autoregressive Predictive Coding](https://arxiv.org/pdf/1910.12607.pdf) (ICASSP 2020)

### CPC(Contrastive Predictive Coding)

- [wav2vec: Unsupervised Pre-training for Speech Recognition](https://arxiv.org/pdf/1904.05862.pdf) (Schneider et al., 2019)

- [vq-wav2vec: Self-Supervised Learning of Discrete Speech Representations](https://arxiv.org/pdf/1910.05453.pdf) (Baevski et al., 2019)

- [wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations](https://arxiv.org/pdf/2006.11477.pdf) (Baevski et al., 2020)