Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

awesome-speech-pretraining

Paper, Code and Statistics for Self-Supervised Learning and Pre-Training on Speech.
https://github.com/ddlbojack/awesome-speech-pretraining

Last synced: 6 days ago
JSON representation

Papers
- SSL Model Distillation, Compression and Acceleration
  - DistilHuBERT: Speech Representation Learning by Layer-wise Distillation of Hidden-unit BERT - *H Chang et al*, `ICASSP 2022`
  - DistilHuBERT: Speech Representation Learning by Layer-wise Distillation of Hidden-unit BERT - *H Chang et al*, `ICASSP 2022`
  - FitHuBERT: Going Thinner and Deeper for Knowledge Distillation of Speech Self-Supervised Learning - *Y Lee et al*, `INTERSPEECH 2022`
  - LightHuBERT: Lightweight and Configurable Speech Representation Learning with Once-for-All Hidden-Unit BERT - *R Wang et al*, `INTERSPEECH 2022`
  - Deep versus Wide: An Analysis of Student Architectures for Task-Agnostic Knowledge Distillation of Self-Supervised Speech Models - *T Ashihara et al*, `INTERSPEECH 2022`
  - Exploring Effective Distillation of Self-Supervised Speech Models for Automatic Speech Recognition - *Y Wang et al*, `arXiv 2022`
  - FitHuBERT: Going Thinner and Deeper for Knowledge Distillation of Speech Self-Supervised Learning - *Y Lee et al*, `INTERSPEECH 2022`
  - LightHuBERT: Lightweight and Configurable Speech Representation Learning with Once-for-All Hidden-Unit BERT - *R Wang et al*, `INTERSPEECH 2022`
  - Deep versus Wide: An Analysis of Student Architectures for Task-Agnostic Knowledge Distillation of Self-Supervised Speech Models - *T Ashihara et al*, `INTERSPEECH 2022`
  - Exploring Effective Distillation of Self-Supervised Speech Models for Automatic Speech Recognition - *Y Wang et al*, `arXiv 2022`
  - Fast-HuBERT: An Efficient Training Framework for Self-Supervised Speech Representation Learning - *G Yang et al*, `ASRU 2023`
  - Fast-HuBERT: An Efficient Training Framework for Self-Supervised Speech Representation Learning - *G Yang et al*, `ASRU 2023`
- 2020
  - Unsupervised pretraining transfers well across languages - *M Riviere et al*, `ICASSP 2020`
  - Mockingjay: Unsupervised speech representation learning with deep bidirectional transformer encoders - *AT Liu et al*, `ICASSP 2020`
  - Learning robust and multilingual speech representations - *K Kawakami et al*, `EMNLP 2020`
  - Improved speech representations with multi-target autoregressive predictive coding - *YA Chung et al*, `ACL 2020`
  - Effectiveness of self-supervised pre-training for asr - *A Baevski et al*, `ICASSP 2020`
  - Deep contextualized acoustic representations for semi-supervised speech recognition - *S Ling et al*, `ICASSP 2020`
  - Improved noisy student training for automatic speech recognition - *DS Park et al*, `INTERSPEECH 2020`
  - wav2vec 2.0: A framework for self-supervised learning of speech representations - *A Baevski et al*, `NeurIPS 2020`
  - Unsupervised cross-lingual representation learning for speech recognition - *A Conneau et al*, `arXiv 2020`
  - Self-training and Pre-training are Complementary for Speech Recognition - *Q Xu et al*, `arXiv 2020, ICASSP 2021`
  - Decoar 2.0: Deep contextualized acoustic representations with vector quantization
  - Pushing the limits of semi-supervised learning for automatic speech recognition - *Y Zhang et al*, `arXiv 2020, NeurIPS Workshop 2020`
  - Learning robust and multilingual speech representations - *K Kawakami et al*, `EMNLP 2020`
  - Improved speech representations with multi-target autoregressive predictive coding - *YA Chung et al*, `ACL 2020`
  - Unsupervised pretraining transfers well across languages - *M Riviere et al*, `ICASSP 2020`
  - Mockingjay: Unsupervised speech representation learning with deep bidirectional transformer encoders - *AT Liu et al*, `ICASSP 2020`
  - Effectiveness of self-supervised pre-training for asr - *A Baevski et al*, `ICASSP 2020`
  - Deep contextualized acoustic representations for semi-supervised speech recognition - *S Ling et al*, `ICASSP 2020`
  - Improved noisy student training for automatic speech recognition - *DS Park et al*, `INTERSPEECH 2020`
  - wav2vec 2.0: A framework for self-supervised learning of speech representations - *A Baevski et al*, `NeurIPS 2020`
  - Unsupervised cross-lingual representation learning for speech recognition - *A Conneau et al*, `arXiv 2020`
  - Self-training and Pre-training are Complementary for Speech Recognition - *Q Xu et al*, `arXiv 2020, ICASSP 2021`
  - Decoar 2.0: Deep contextualized acoustic representations with vector quantization
  - Pushing the limits of semi-supervised learning for automatic speech recognition - *Y Zhang et al*, `arXiv 2020, NeurIPS Workshop 2020`
- 2021
  - Simple and Effective Zero-shot Cross-lingual Phoneme Recognition - *Q Xu et al*, `arXiv 2021`
  - Unsupervised Speech Recognition - *A Baevski et al*, `NeurIPS 2021`
  - HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units - *WN Hsu et al*, `TASLP 2021`
  - SUPERB: Speech processing Universal PERformance Benchmark - *S Yang et al*, `INTERSPEECH 2021`
  - Wav-BERT: Cooperative Acoustic and Linguistic Representation Learning for Low-Resource Speech Recognition - *G Zheng et al*, `EMNLP 2021`
  - Self-Supervised Learning for speech recognition with Intermediate layer supervision - *C Wang et al*, `ICASSP 2021`
  - Wavlm: Large-scale self-supervised pre-training for full stack speech processing - *S Chen et al*, `arXiv 2021, JSTSP 2022`
  - Bigssl: Exploring the frontier of large-scale semi-supervised learning for automatic speech recognition - *Y Zhang et al*, `arXiv 2021, JSTSP 2022`
  - Speecht5: Unified-modal encoder-decoder pre-training for spoken language processing - *J Ao et al*, `arXiv 2021, ACL 2022`
  - Simple and Effective Zero-shot Cross-lingual Phoneme Recognition - *Q Xu et al*, `arXiv 2021`
  - Unsupervised Speech Recognition - *A Baevski et al*, `NeurIPS 2021`
  - HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units - *WN Hsu et al*, `TASLP 2021`
  - SUPERB: Speech processing Universal PERformance Benchmark - *S Yang et al*, `INTERSPEECH 2021`
  - Wav-BERT: Cooperative Acoustic and Linguistic Representation Learning for Low-Resource Speech Recognition - *G Zheng et al*, `EMNLP 2021`
  - Self-Supervised Learning for speech recognition with Intermediate layer supervision - *C Wang et al*, `ICASSP 2021`
  - Wavlm: Large-scale self-supervised pre-training for full stack speech processing - *S Chen et al*, `arXiv 2021, JSTSP 2022`
  - Bigssl: Exploring the frontier of large-scale semi-supervised learning for automatic speech recognition - *Y Zhang et al*, `arXiv 2021, JSTSP 2022`
  - Speecht5: Unified-modal encoder-decoder pre-training for spoken language processing - *J Ao et al*, `arXiv 2021, ACL 2022`
  - Unispeech: Unified speech representation learning with labeled and unlabeled data - *C Wang et al*, `ACL 2021`
  - Tera: Self-supervised learning of transformer encoder representation for speech - *AT Liu et al*, `TASLP 2021`
  - Robust wav2vec 2.0: Analyzing Domain Shift in Self-Supervised Pre-Training - *WN Hsu et al*, `INTERSPEECH 2021`
  - Unispeech: Unified speech representation learning with labeled and unlabeled data - *C Wang et al*, `ACL 2021`
  - Tera: Self-supervised learning of transformer encoder representation for speech - *AT Liu et al*, `TASLP 2021`
  - Robust wav2vec 2.0: Analyzing Domain Shift in Self-Supervised Pre-Training - *WN Hsu et al*, `INTERSPEECH 2021`
- 2018
  - Representation Learning with Contrastive Predictive Coding - *A Oord et al*, `arXiv 2018`
- 2019
  - An Unsupervised Autoregressive Model for Speech Representation Learning - *YA Chung et al*, `INTERSPEECH 2019`
  - wav2vec: Unsupervised Pre-training for Speech Recognition - *S Schneider et al*, `INTERSPEECH 2019`
  - vq-wav2vec: Self-Supervised Learning of Discrete Speech Representations - *A Baevski et al*, `arXiv 2019, ICLR 2020`
  - Improving Transformer-based Speech Recognition Using Unsupervised Pre-training - *D Jiang et al*, `arXiv 2019`
  - Learning Problem-agnostic Speech Representations from Multiple Self-supervised Tasks - *S Pascual et al*, `INTERSPEECH 2019`
  - Improving Transformer-based Speech Recognition Using Unsupervised Pre-training - *D Jiang et al*, `arXiv 2019`
  - Learning Problem-agnostic Speech Representations from Multiple Self-supervised Tasks - *S Pascual et al*, `INTERSPEECH 2019`
- 2022
  - Data2vec: A general framework for self-supervised learning in speech, vision and language - *A Baevski et al*, `ICML 2022`
  - Self-supervised Learning with Random-projection Quantizer for Speech Recognition - *CC Chiu et al*, `ICML 2022`
  - SUPERB-SG: Enhanced Speech processing Universal PERformance Benchmark for Semantic and Generative Capabilities - *HS Tsai et al*, `ACL 2022`
  - Towards End-to-end Unsupervised Speech Recognition - *AH Liu et al*, `SLT 2022`
  - Contrastive Siamese Network for Semi-Supervised Speech Recognition - *S Khorram et al*, `ICASSP 2022`
  - Pre-Training Transformer Decoder for End-to-End ASR Model with Unpaired Speech Data - *J Ao et al*, `INTERSPEECH 2022`
  - SPIRAL: Self-supervised Perturbation-Invariant Representation Learning for Speech Pre-Training - *W Huang et al*, `ICLR 2022`
  - Wav2Seq: Pre-training Speech-to-Text Encoder-Decoder Models Using Pseudo Languages - *F Wu et al*, `arXiv 2022, ICASSP 2023`
  - Speech Pre-training with Acoustic Piece - *S Ren et al*, `INTERSPEECH 2022`
  - Data2vec: A general framework for self-supervised learning in speech, vision and language - *A Baevski et al*, `ICML 2022`
  - Self-supervised Learning with Random-projection Quantizer for Speech Recognition - *CC Chiu et al*, `ICML 2022`
  - SUPERB-SG: Enhanced Speech processing Universal PERformance Benchmark for Semantic and Generative Capabilities - *HS Tsai et al*, `ACL 2022`
  - Towards End-to-end Unsupervised Speech Recognition - *AH Liu et al*, `SLT 2022`
  - Contrastive Siamese Network for Semi-Supervised Speech Recognition - *S Khorram et al*, `ICASSP 2022`
  - Pre-Training Transformer Decoder for End-to-End ASR Model with Unpaired Speech Data - *J Ao et al*, `INTERSPEECH 2022`
  - SPIRAL: Self-supervised Perturbation-Invariant Representation Learning for Speech Pre-Training - *W Huang et al*, `ICLR 2022`
  - Wav2Seq: Pre-training Speech-to-Text Encoder-Decoder Models Using Pseudo Languages - *F Wu et al*, `arXiv 2022, ICASSP 2023`
  - MT4SSL: Boosting Self-Supervised Speech Representation Learning by Integrating Multiple Targets - *Z Ma et al*, `arXiv 2022, INTERSPEECH 2023`
  - Supervision-Guided Codebooks for Masked Prediction in Speech Pre-training - *C Wang et al*, `INTERSPEECH 2022`
  - Efficient Self-supervised Learning with Contextualized Target Representations for Vision, Speech and Language - *A Baevski et al*, `arXiv 2022`
  - CoBERT: Self-Supervised Speech Representation Learning Through Code Representation Learning - *C Meng et al*, `arXiv 2022, INTERSPEECH 2023`
  - Speech Pre-training with Acoustic Piece - *S Ren et al*, `INTERSPEECH 2022`
  - Supervision-Guided Codebooks for Masked Prediction in Speech Pre-training - *C Wang et al*, `INTERSPEECH 2022`
  - Efficient Self-supervised Learning with Contextualized Target Representations for Vision, Speech and Language - *A Baevski et al*, `arXiv 2022`
  - CoBERT: Self-Supervised Speech Representation Learning Through Code Representation Learning - *C Meng et al*, `arXiv 2022, INTERSPEECH 2023`
  - MT4SSL: Boosting Self-Supervised Speech Representation Learning by Integrating Multiple Targets - *Z Ma et al*, `arXiv 2022, INTERSPEECH 2023`
- 2023
  - Pushing the Limits of Unsupervised Unit Discovery for SSL Speech Representation - *Z Ma et al*, `INTERSPEECH 2023`
  - MCR-Data2vec 2.0: Improving Self-supervised Speech Pre-training via Model-level Consistency Regularization - *JW Yoon et al*, `INTERSPEECH 2023`
  - CTCBERT: Advancing Hidden-unit BERT with CTC Objectives - *R Fan et al*, `ICASSP 2023`
  - data2vec-aqc: Search for the right Teaching Assistant in the Teacher-Student training setup - *VS Lodagala et al*, `ICASSP 2023`
  - CTCBERT: Advancing Hidden-unit BERT with CTC Objectives - *R Fan et al*, `ICASSP 2023`
  - data2vec-aqc: Search for the right Teaching Assistant in the Teacher-Student training setup - *VS Lodagala et al*, `ICASSP 2023`
  - Pushing the Limits of Unsupervised Unit Discovery for SSL Speech Representation - *Z Ma et al*, `INTERSPEECH 2023`
  - MCR-Data2vec 2.0: Improving Self-supervised Speech Pre-training via Model-level Consistency Regularization - *JW Yoon et al*, `INTERSPEECH 2023`
- Speech + Text
  - A general multi-task learning framework to leverage text data for speech to text tasks - *Y Tang et al*, `ICASSP 2021`
  - SLAM: A Unified Encoder for Speech and Language Modeling via Speech-Text Joint Pre-Training - *A Bapna et al*, `arXiv 2021`
  - mSLAM: Massively multilingual joint pre-training for speech and text - *A Bapna et al*, `arXiv 2022`
  - Optimizing Alignment of Speech and Language Latent Spaces for End-to-End Speech Recognition and Understanding - *W Wang et al*, `INTERSPEECH 2022`
  - Unified Speech-Text Pre-training for Speech Translation and Recognition - *Y Tang et al*, `ACL 2022`
  - Self-Supervised Audio-and-Text Pre-training with Extremely Low-Resource Parallel Data - *Y Kang et al*, `AAAI 2022`
  - Distilling a Pretrained Language Model to a Multilingual ASR Model - *K Choi et al*, `INTERSPEECH 2022`
  - SpeechUT: Bridging Speech and Text with Hidden-Unit for Encoder-Decoder Based Speech-Text Pre-training - *Z Zhang et al*, `EMNLP 2022`
  - A general multi-task learning framework to leverage text data for speech to text tasks - *Y Tang et al*, `ICASSP 2021`
  - SLAM: A Unified Encoder for Speech and Language Modeling via Speech-Text Joint Pre-Training - *A Bapna et al*, `arXiv 2021`
  - mSLAM: Massively multilingual joint pre-training for speech and text - *A Bapna et al*, `arXiv 2022`
  - Optimizing Alignment of Speech and Language Latent Spaces for End-to-End Speech Recognition and Understanding - *W Wang et al*, `INTERSPEECH 2022`
  - Unified Speech-Text Pre-training for Speech Translation and Recognition - *Y Tang et al*, `ACL 2022`
  - Self-Supervised Audio-and-Text Pre-training with Extremely Low-Resource Parallel Data - *Y Kang et al*, `AAAI 2022`
  - Distilling a Pretrained Language Model to a Multilingual ASR Model - *K Choi et al*, `INTERSPEECH 2022`
  - SpeechUT: Bridging Speech and Text with Hidden-Unit for Encoder-Decoder Based Speech-Text Pre-training - *Z Zhang et al*, `EMNLP 2022`
  - TESSP: Text-Enhanced Self-Supervised Speech Pre-training - *Z Yao et al*, `arXiv 2022`
  - SpeechLM: Enhanced Speech Pre-Training with Unpaired Textual Data - *Z Zhang et al*, `arXiv 2022`
  - token2vec: A Joint Self-Supervised Pre-training Framework Using Unpaired Speech and Text - *X Yue et al*, `ICASSP 2023`
  - TESSP: Text-Enhanced Self-Supervised Speech Pre-training - *Z Yao et al*, `arXiv 2022`
  - SpeechLM: Enhanced Speech Pre-Training with Unpaired Textual Data - *Z Zhang et al*, `arXiv 2022`
  - token2vec: A Joint Self-Supervised Pre-training Framework Using Unpaired Speech and Text - *X Yue et al*, `ICASSP 2023`
- SSL for Audio
  - BEATs: Audio Pre-Training with Acoustic Tokenizers - *S Chen et al*, `ICML 2023`
  - Self-supervised Audio Teacher-Student Transformer for Both Clip-level and Frame-level Tasks - *X Li et al*, `arXiv 2023`
  - BYOL for Audio: Self-Supervised Learning for General-Purpose Audio Representation - *D Niizumi et al*, `IJCNN 2021`
  - Masked Autoencoders that Listen - *H Xu et al*, `NeurIPS 2022`
  - MAE-AST: Masked Autoencoding Audio Spectrogram Transformer - *A Baade et al*, `INTERSPEECH 2022`
  - EAT: Self-Supervised Pre-Training with Efficient Audio Transformer - *W Chen et al*, `arXiv 2024`
  - BYOL for Audio: Self-Supervised Learning for General-Purpose Audio Representation - *D Niizumi et al*, `IJCNN 2021`
  - Masked Autoencoders that Listen - *H Xu et al*, `NeurIPS 2022`
  - MAE-AST: Masked Autoencoding Audio Spectrogram Transformer - *A Baade et al*, `INTERSPEECH 2022`
  - BEATs: Audio Pre-Training with Acoustic Tokenizers - *S Chen et al*, `ICML 2023`
  - Self-supervised Audio Teacher-Student Transformer for Both Clip-level and Frame-level Tasks - *X Li et al*, `arXiv 2023`
  - EAT: Self-Supervised Pre-Training with Efficient Audio Transformer - *W Chen et al*, `arXiv 2024`
- SSL for TTS
  - Unsupervised acoustic unit discovery for speech synthesis using discrete latent-variable neural networks - *R Eloff et al*, `INTERSPEECH 2019`
  - Unsupervised Learning For Sequence-to-sequence Text-to-speech For Low-resource Languages - *H Zhang et al*, `INTERSPEECH 2020`
  - Towards Unsupervised Speech Synthesis - *AH Liu et al*, `NAACL 2022`
  - Unsupervised acoustic unit discovery for speech synthesis using discrete latent-variable neural networks - *R Eloff et al*, `INTERSPEECH 2019`
  - Unsupervised Learning For Sequence-to-sequence Text-to-speech For Low-resource Languages - *H Zhang et al*, `INTERSPEECH 2020`
  - Towards Unsupervised Speech Synthesis - *AH Liu et al*, `NAACL 2022`
Resources
- SSL Model Distillation, Compression and Acceleration

Programming Languages

Categories

Papers 142 Resources 4

Sub Categories

2022 26 2020 24 2021 24 Speech + Text 22 SSL Model Distillation, Compression and Acceleration 16 SSL for Audio 12 2023 8 2019 7 SSL for TTS 6 2018 1

Keywords

wavlm 2 wav2vec2 2 wav2vec 2 vq-wav2vec 2 vq-apc 2 unispeech-sat 2 tera 2 speech-representation 2 speech-pretraining 2 self-supervised-learning 2 representation-learning 2 pase 2 mockingjay 2 hubert 2 distilhubert 2 decoar2 2 decoar 2 data2vec 2 cpc 2 apc 2