Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
awesome-speech-pretraining
Paper, Code and Statistics for Self-Supervised Learning and Pre-Training on Speech.
https://github.com/ddlbojack/awesome-speech-pretraining
Last synced: 6 days ago
JSON representation
-
Papers
-
SSL Model Distillation, Compression and Acceleration
- DistilHuBERT: Speech Representation Learning by Layer-wise Distillation of Hidden-unit BERT - *H Chang et al*, `ICASSP 2022`
- DistilHuBERT: Speech Representation Learning by Layer-wise Distillation of Hidden-unit BERT - *H Chang et al*, `ICASSP 2022`
- FitHuBERT: Going Thinner and Deeper for Knowledge Distillation of Speech Self-Supervised Learning - *Y Lee et al*, `INTERSPEECH 2022`
- LightHuBERT: Lightweight and Configurable Speech Representation Learning with Once-for-All Hidden-Unit BERT - *R Wang et al*, `INTERSPEECH 2022`
- Deep versus Wide: An Analysis of Student Architectures for Task-Agnostic Knowledge Distillation of Self-Supervised Speech Models - *T Ashihara et al*, `INTERSPEECH 2022`
- Exploring Effective Distillation of Self-Supervised Speech Models for Automatic Speech Recognition - *Y Wang et al*, `arXiv 2022`
- FitHuBERT: Going Thinner and Deeper for Knowledge Distillation of Speech Self-Supervised Learning - *Y Lee et al*, `INTERSPEECH 2022`
- LightHuBERT: Lightweight and Configurable Speech Representation Learning with Once-for-All Hidden-Unit BERT - *R Wang et al*, `INTERSPEECH 2022`
- Deep versus Wide: An Analysis of Student Architectures for Task-Agnostic Knowledge Distillation of Self-Supervised Speech Models - *T Ashihara et al*, `INTERSPEECH 2022`
- Exploring Effective Distillation of Self-Supervised Speech Models for Automatic Speech Recognition - *Y Wang et al*, `arXiv 2022`
- Fast-HuBERT: An Efficient Training Framework for Self-Supervised Speech Representation Learning - *G Yang et al*, `ASRU 2023`
- Fast-HuBERT: An Efficient Training Framework for Self-Supervised Speech Representation Learning - *G Yang et al*, `ASRU 2023`
-
2020
- Unsupervised pretraining transfers well across languages - *M Riviere et al*, `ICASSP 2020`
- Mockingjay: Unsupervised speech representation learning with deep bidirectional transformer encoders - *AT Liu et al*, `ICASSP 2020`
- Learning robust and multilingual speech representations - *K Kawakami et al*, `EMNLP 2020`
- Improved speech representations with multi-target autoregressive predictive coding - *YA Chung et al*, `ACL 2020`
- Effectiveness of self-supervised pre-training for asr - *A Baevski et al*, `ICASSP 2020`
- Deep contextualized acoustic representations for semi-supervised speech recognition - *S Ling et al*, `ICASSP 2020`
- Improved noisy student training for automatic speech recognition - *DS Park et al*, `INTERSPEECH 2020`
- wav2vec 2.0: A framework for self-supervised learning of speech representations - *A Baevski et al*, `NeurIPS 2020`
- Unsupervised cross-lingual representation learning for speech recognition - *A Conneau et al*, `arXiv 2020`
- Self-training and Pre-training are Complementary for Speech Recognition - *Q Xu et al*, `arXiv 2020, ICASSP 2021`
- Decoar 2.0: Deep contextualized acoustic representations with vector quantization
- Pushing the limits of semi-supervised learning for automatic speech recognition - *Y Zhang et al*, `arXiv 2020, NeurIPS Workshop 2020`
- Learning robust and multilingual speech representations - *K Kawakami et al*, `EMNLP 2020`
- Improved speech representations with multi-target autoregressive predictive coding - *YA Chung et al*, `ACL 2020`
- Unsupervised pretraining transfers well across languages - *M Riviere et al*, `ICASSP 2020`
- Mockingjay: Unsupervised speech representation learning with deep bidirectional transformer encoders - *AT Liu et al*, `ICASSP 2020`
- Effectiveness of self-supervised pre-training for asr - *A Baevski et al*, `ICASSP 2020`
- Deep contextualized acoustic representations for semi-supervised speech recognition - *S Ling et al*, `ICASSP 2020`
- Improved noisy student training for automatic speech recognition - *DS Park et al*, `INTERSPEECH 2020`
- wav2vec 2.0: A framework for self-supervised learning of speech representations - *A Baevski et al*, `NeurIPS 2020`
- Unsupervised cross-lingual representation learning for speech recognition - *A Conneau et al*, `arXiv 2020`
- Self-training and Pre-training are Complementary for Speech Recognition - *Q Xu et al*, `arXiv 2020, ICASSP 2021`
- Decoar 2.0: Deep contextualized acoustic representations with vector quantization
- Pushing the limits of semi-supervised learning for automatic speech recognition - *Y Zhang et al*, `arXiv 2020, NeurIPS Workshop 2020`
-
2021
- Simple and Effective Zero-shot Cross-lingual Phoneme Recognition - *Q Xu et al*, `arXiv 2021`
- Unsupervised Speech Recognition - *A Baevski et al*, `NeurIPS 2021`
- HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units - *WN Hsu et al*, `TASLP 2021`
- SUPERB: Speech processing Universal PERformance Benchmark - *S Yang et al*, `INTERSPEECH 2021`
- Wav-BERT: Cooperative Acoustic and Linguistic Representation Learning for Low-Resource Speech Recognition - *G Zheng et al*, `EMNLP 2021`
- Self-Supervised Learning for speech recognition with Intermediate layer supervision - *C Wang et al*, `ICASSP 2021`
- Wavlm: Large-scale self-supervised pre-training for full stack speech processing - *S Chen et al*, `arXiv 2021, JSTSP 2022`
- Bigssl: Exploring the frontier of large-scale semi-supervised learning for automatic speech recognition - *Y Zhang et al*, `arXiv 2021, JSTSP 2022`
- Speecht5: Unified-modal encoder-decoder pre-training for spoken language processing - *J Ao et al*, `arXiv 2021, ACL 2022`
- Simple and Effective Zero-shot Cross-lingual Phoneme Recognition - *Q Xu et al*, `arXiv 2021`
- Unsupervised Speech Recognition - *A Baevski et al*, `NeurIPS 2021`
- HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units - *WN Hsu et al*, `TASLP 2021`
- SUPERB: Speech processing Universal PERformance Benchmark - *S Yang et al*, `INTERSPEECH 2021`
- Wav-BERT: Cooperative Acoustic and Linguistic Representation Learning for Low-Resource Speech Recognition - *G Zheng et al*, `EMNLP 2021`
- Self-Supervised Learning for speech recognition with Intermediate layer supervision - *C Wang et al*, `ICASSP 2021`
- Wavlm: Large-scale self-supervised pre-training for full stack speech processing - *S Chen et al*, `arXiv 2021, JSTSP 2022`
- Bigssl: Exploring the frontier of large-scale semi-supervised learning for automatic speech recognition - *Y Zhang et al*, `arXiv 2021, JSTSP 2022`
- Speecht5: Unified-modal encoder-decoder pre-training for spoken language processing - *J Ao et al*, `arXiv 2021, ACL 2022`
- Unispeech: Unified speech representation learning with labeled and unlabeled data - *C Wang et al*, `ACL 2021`
- Tera: Self-supervised learning of transformer encoder representation for speech - *AT Liu et al*, `TASLP 2021`
- Robust wav2vec 2.0: Analyzing Domain Shift in Self-Supervised Pre-Training - *WN Hsu et al*, `INTERSPEECH 2021`
- Unispeech: Unified speech representation learning with labeled and unlabeled data - *C Wang et al*, `ACL 2021`
- Tera: Self-supervised learning of transformer encoder representation for speech - *AT Liu et al*, `TASLP 2021`
- Robust wav2vec 2.0: Analyzing Domain Shift in Self-Supervised Pre-Training - *WN Hsu et al*, `INTERSPEECH 2021`
-
2018
- Representation Learning with Contrastive Predictive Coding - *A Oord et al*, `arXiv 2018`
-
2019
- An Unsupervised Autoregressive Model for Speech Representation Learning - *YA Chung et al*, `INTERSPEECH 2019`
- wav2vec: Unsupervised Pre-training for Speech Recognition - *S Schneider et al*, `INTERSPEECH 2019`
- vq-wav2vec: Self-Supervised Learning of Discrete Speech Representations - *A Baevski et al*, `arXiv 2019, ICLR 2020`
- Improving Transformer-based Speech Recognition Using Unsupervised Pre-training - *D Jiang et al*, `arXiv 2019`
- Learning Problem-agnostic Speech Representations from Multiple Self-supervised Tasks - *S Pascual et al*, `INTERSPEECH 2019`
- Improving Transformer-based Speech Recognition Using Unsupervised Pre-training - *D Jiang et al*, `arXiv 2019`
- Learning Problem-agnostic Speech Representations from Multiple Self-supervised Tasks - *S Pascual et al*, `INTERSPEECH 2019`
-
2022
- Data2vec: A general framework for self-supervised learning in speech, vision and language - *A Baevski et al*, `ICML 2022`
- Self-supervised Learning with Random-projection Quantizer for Speech Recognition - *CC Chiu et al*, `ICML 2022`
- SUPERB-SG: Enhanced Speech processing Universal PERformance Benchmark for Semantic and Generative Capabilities - *HS Tsai et al*, `ACL 2022`
- Towards End-to-end Unsupervised Speech Recognition - *AH Liu et al*, `SLT 2022`
- Contrastive Siamese Network for Semi-Supervised Speech Recognition - *S Khorram et al*, `ICASSP 2022`
- Pre-Training Transformer Decoder for End-to-End ASR Model with Unpaired Speech Data - *J Ao et al*, `INTERSPEECH 2022`
- SPIRAL: Self-supervised Perturbation-Invariant Representation Learning for Speech Pre-Training - *W Huang et al*, `ICLR 2022`
- Wav2Seq: Pre-training Speech-to-Text Encoder-Decoder Models Using Pseudo Languages - *F Wu et al*, `arXiv 2022, ICASSP 2023`
- Speech Pre-training with Acoustic Piece - *S Ren et al*, `INTERSPEECH 2022`
- Data2vec: A general framework for self-supervised learning in speech, vision and language - *A Baevski et al*, `ICML 2022`
- Self-supervised Learning with Random-projection Quantizer for Speech Recognition - *CC Chiu et al*, `ICML 2022`
- SUPERB-SG: Enhanced Speech processing Universal PERformance Benchmark for Semantic and Generative Capabilities - *HS Tsai et al*, `ACL 2022`
- Towards End-to-end Unsupervised Speech Recognition - *AH Liu et al*, `SLT 2022`
- Contrastive Siamese Network for Semi-Supervised Speech Recognition - *S Khorram et al*, `ICASSP 2022`
- Pre-Training Transformer Decoder for End-to-End ASR Model with Unpaired Speech Data - *J Ao et al*, `INTERSPEECH 2022`
- SPIRAL: Self-supervised Perturbation-Invariant Representation Learning for Speech Pre-Training - *W Huang et al*, `ICLR 2022`
- Wav2Seq: Pre-training Speech-to-Text Encoder-Decoder Models Using Pseudo Languages - *F Wu et al*, `arXiv 2022, ICASSP 2023`
- MT4SSL: Boosting Self-Supervised Speech Representation Learning by Integrating Multiple Targets - *Z Ma et al*, `arXiv 2022, INTERSPEECH 2023`
- Supervision-Guided Codebooks for Masked Prediction in Speech Pre-training - *C Wang et al*, `INTERSPEECH 2022`
- Efficient Self-supervised Learning with Contextualized Target Representations for Vision, Speech and Language - *A Baevski et al*, `arXiv 2022`
- CoBERT: Self-Supervised Speech Representation Learning Through Code Representation Learning - *C Meng et al*, `arXiv 2022, INTERSPEECH 2023`
- Speech Pre-training with Acoustic Piece - *S Ren et al*, `INTERSPEECH 2022`
- Supervision-Guided Codebooks for Masked Prediction in Speech Pre-training - *C Wang et al*, `INTERSPEECH 2022`
- Efficient Self-supervised Learning with Contextualized Target Representations for Vision, Speech and Language - *A Baevski et al*, `arXiv 2022`
- CoBERT: Self-Supervised Speech Representation Learning Through Code Representation Learning - *C Meng et al*, `arXiv 2022, INTERSPEECH 2023`
- MT4SSL: Boosting Self-Supervised Speech Representation Learning by Integrating Multiple Targets - *Z Ma et al*, `arXiv 2022, INTERSPEECH 2023`
-
2023
- Pushing the Limits of Unsupervised Unit Discovery for SSL Speech Representation - *Z Ma et al*, `INTERSPEECH 2023`
- MCR-Data2vec 2.0: Improving Self-supervised Speech Pre-training via Model-level Consistency Regularization - *JW Yoon et al*, `INTERSPEECH 2023`
- CTCBERT: Advancing Hidden-unit BERT with CTC Objectives - *R Fan et al*, `ICASSP 2023`
- data2vec-aqc: Search for the right Teaching Assistant in the Teacher-Student training setup - *VS Lodagala et al*, `ICASSP 2023`
- CTCBERT: Advancing Hidden-unit BERT with CTC Objectives - *R Fan et al*, `ICASSP 2023`
- data2vec-aqc: Search for the right Teaching Assistant in the Teacher-Student training setup - *VS Lodagala et al*, `ICASSP 2023`
- Pushing the Limits of Unsupervised Unit Discovery for SSL Speech Representation - *Z Ma et al*, `INTERSPEECH 2023`
- MCR-Data2vec 2.0: Improving Self-supervised Speech Pre-training via Model-level Consistency Regularization - *JW Yoon et al*, `INTERSPEECH 2023`
-
Speech + Text
- A general multi-task learning framework to leverage text data for speech to text tasks - *Y Tang et al*, `ICASSP 2021`
- SLAM: A Unified Encoder for Speech and Language Modeling via Speech-Text Joint Pre-Training - *A Bapna et al*, `arXiv 2021`
- mSLAM: Massively multilingual joint pre-training for speech and text - *A Bapna et al*, `arXiv 2022`
- Optimizing Alignment of Speech and Language Latent Spaces for End-to-End Speech Recognition and Understanding - *W Wang et al*, `INTERSPEECH 2022`
- Unified Speech-Text Pre-training for Speech Translation and Recognition - *Y Tang et al*, `ACL 2022`
- Self-Supervised Audio-and-Text Pre-training with Extremely Low-Resource Parallel Data - *Y Kang et al*, `AAAI 2022`
- Distilling a Pretrained Language Model to a Multilingual ASR Model - *K Choi et al*, `INTERSPEECH 2022`
- SpeechUT: Bridging Speech and Text with Hidden-Unit for Encoder-Decoder Based Speech-Text Pre-training - *Z Zhang et al*, `EMNLP 2022`
- A general multi-task learning framework to leverage text data for speech to text tasks - *Y Tang et al*, `ICASSP 2021`
- SLAM: A Unified Encoder for Speech and Language Modeling via Speech-Text Joint Pre-Training - *A Bapna et al*, `arXiv 2021`
- mSLAM: Massively multilingual joint pre-training for speech and text - *A Bapna et al*, `arXiv 2022`
- Optimizing Alignment of Speech and Language Latent Spaces for End-to-End Speech Recognition and Understanding - *W Wang et al*, `INTERSPEECH 2022`
- Unified Speech-Text Pre-training for Speech Translation and Recognition - *Y Tang et al*, `ACL 2022`
- Self-Supervised Audio-and-Text Pre-training with Extremely Low-Resource Parallel Data - *Y Kang et al*, `AAAI 2022`
- Distilling a Pretrained Language Model to a Multilingual ASR Model - *K Choi et al*, `INTERSPEECH 2022`
- SpeechUT: Bridging Speech and Text with Hidden-Unit for Encoder-Decoder Based Speech-Text Pre-training - *Z Zhang et al*, `EMNLP 2022`
- TESSP: Text-Enhanced Self-Supervised Speech Pre-training - *Z Yao et al*, `arXiv 2022`
- SpeechLM: Enhanced Speech Pre-Training with Unpaired Textual Data - *Z Zhang et al*, `arXiv 2022`
- token2vec: A Joint Self-Supervised Pre-training Framework Using Unpaired Speech and Text - *X Yue et al*, `ICASSP 2023`
- TESSP: Text-Enhanced Self-Supervised Speech Pre-training - *Z Yao et al*, `arXiv 2022`
- SpeechLM: Enhanced Speech Pre-Training with Unpaired Textual Data - *Z Zhang et al*, `arXiv 2022`
- token2vec: A Joint Self-Supervised Pre-training Framework Using Unpaired Speech and Text - *X Yue et al*, `ICASSP 2023`
-
SSL for Audio
- BEATs: Audio Pre-Training with Acoustic Tokenizers - *S Chen et al*, `ICML 2023`
- Self-supervised Audio Teacher-Student Transformer for Both Clip-level and Frame-level Tasks - *X Li et al*, `arXiv 2023`
- BYOL for Audio: Self-Supervised Learning for General-Purpose Audio Representation - *D Niizumi et al*, `IJCNN 2021`
- Masked Autoencoders that Listen - *H Xu et al*, `NeurIPS 2022`
- MAE-AST: Masked Autoencoding Audio Spectrogram Transformer - *A Baade et al*, `INTERSPEECH 2022`
- EAT: Self-Supervised Pre-Training with Efficient Audio Transformer - *W Chen et al*, `arXiv 2024`
- BYOL for Audio: Self-Supervised Learning for General-Purpose Audio Representation - *D Niizumi et al*, `IJCNN 2021`
- Masked Autoencoders that Listen - *H Xu et al*, `NeurIPS 2022`
- MAE-AST: Masked Autoencoding Audio Spectrogram Transformer - *A Baade et al*, `INTERSPEECH 2022`
- BEATs: Audio Pre-Training with Acoustic Tokenizers - *S Chen et al*, `ICML 2023`
- Self-supervised Audio Teacher-Student Transformer for Both Clip-level and Frame-level Tasks - *X Li et al*, `arXiv 2023`
- EAT: Self-Supervised Pre-Training with Efficient Audio Transformer - *W Chen et al*, `arXiv 2024`
-
SSL for TTS
- Unsupervised acoustic unit discovery for speech synthesis using discrete latent-variable neural networks - *R Eloff et al*, `INTERSPEECH 2019`
- Unsupervised Learning For Sequence-to-sequence Text-to-speech For Low-resource Languages - *H Zhang et al*, `INTERSPEECH 2020`
- Towards Unsupervised Speech Synthesis - *AH Liu et al*, `NAACL 2022`
- Unsupervised acoustic unit discovery for speech synthesis using discrete latent-variable neural networks - *R Eloff et al*, `INTERSPEECH 2019`
- Unsupervised Learning For Sequence-to-sequence Text-to-speech For Low-resource Languages - *H Zhang et al*, `INTERSPEECH 2020`
- Towards Unsupervised Speech Synthesis - *AH Liu et al*, `NAACL 2022`
-
-
Resources
-
SSL Model Distillation, Compression and Acceleration
- **S**peech processing **U**niversal **PER**formance **B**enchmark (SUPERB)
- **S**elf-**S**upervised **S**peech **P**re-training and **R**epresentation **L**earning (**S3PRL**)
- **S**peech processing **U**niversal **PER**formance **B**enchmark (SUPERB)
- **S**elf-**S**upervised **S**peech **P**re-training and **R**epresentation **L**earning (**S3PRL**)
-
Programming Languages
Categories
Sub Categories