{"id":22366361,"url":"https://github.com/microsoft/unispeech","last_synced_at":"2025-04-04T20:11:03.855Z","repository":{"id":37631083,"uuid":"385814531","full_name":"microsoft/UniSpeech","owner":"microsoft","description":"UniSpeech  - Large Scale Self-Supervised Learning for Speech","archived":false,"fork":false,"pushed_at":"2024-04-05T13:14:48.000Z","size":75897,"stargazers_count":453,"open_issues_count":22,"forks_count":74,"subscribers_count":18,"default_branch":"main","last_synced_at":"2025-03-28T19:08:19.217Z","etag":null,"topics":["diarization","pytorch","speaker-verification","speech","speech-diarization","speech-processing","speech-recognition","speech-separation"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/microsoft.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":"SECURITY.md","support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2021-07-14T04:36:40.000Z","updated_at":"2025-03-26T23:47:20.000Z","dependencies_parsed_at":"2025-01-05T02:19:31.330Z","dependency_job_id":null,"html_url":"https://github.com/microsoft/UniSpeech","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/microsoft%2FUniSpeech","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/microsoft%2FUniSpeech/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/microsoft%2FUniSpeech/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/microsoft%2FUniSpeech/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/microsoft","download_url":"https://codeload.github.com/microsoft/UniSpeech/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247242678,"owners_count":20907134,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["diarization","pytorch","speaker-verification","speech","speech-diarization","speech-processing","speech-recognition","speech-separation"],"created_at":"2024-12-04T18:10:26.963Z","updated_at":"2025-04-04T20:11:03.816Z","avatar_url":"https://github.com/microsoft.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# UniSpeech\n\n\u003c!--**Pre-trained models for speech related tasks**--\u003e\n\nThe family of UniSpeech:\n\u003e [**WavLM**](https://arxiv.org/pdf/2110.13900.pdf) (```arXiv```): **WavLM: Large-Scale Self-Supervised  Pre-training   for Full Stack Speech Processing**\n\n\u003e [**UniSpeech**](https://github.com/microsoft/UniSpeech/tree/main/UniSpeech) (```ICML 2021```): **Unified Pre-training for Self-Supervised Learning and Supervised Learning for ASR**\n\n\u003e [**UniSpeech-SAT**](https://arxiv.org/pdf/2110.05752.pdf) (```ICASSP 2022 Submission```): **Universal Speech Representation Learning with  Speaker Aware Pre-Training**\n\n\u003e [**ILS-SSL**](https://arxiv.org/pdf/2112.08778.pdf) (```ICASSP 2022 Submission```): **Self-Supervised Learning for Speech Recognition with Intermediate Layer Supervision**\n\nModel introductions, evaluation results, and model inference instructions are located in their corresponding folders. The source code is here [https://github.com/microsoft/UniSpeech/tree/main/src].\n\n## Update\n- [HuggingFace Integration] Dec 23, 2021: [**WavLM**](https://huggingface.co/models?other=wavlm)  models are on [HuggingFace](https://huggingface.co/models?other=wavlm) . \n- [HuggingFace Integration] Octorber 26, 2021: [**UniSpeech-SAT**](https://huggingface.co/microsoft/unispeech-sat-large)  models are on [HuggingFace](https://huggingface.co/models?other=unispeech-sat) . \n- [Model Release] Octorber 13, 2021: [**UniSpeech-SAT**](https://arxiv.org/pdf/2110.05752.pdf) models are releaseed.\n- [HuggingFace Integration] Octorber 11, 2021: [**UniSpeech**](https://huggingface.co/microsoft/unispeech-large-1500h-cv)  models are on [HuggingFace](https://huggingface.co/models?other=unispeech) . \n- [Model Release] June, 2021: [**UniSpeech v1**](https://github.com/microsoft/UniSpeech/tree/main/UniSpeech) models are released.\n## Pre-trained models\nWe strongly suggest using our UniSpeech-SAT model for speaker related tasks, since it shows very powerful performance on various speaker related benchmarks.\nModel | Pretraining Dataset | Finetuning Dataset | Model\n|---|---|---|-----\nUniSpeech Large EN |  [Labeled: 1350 hrs en](https://commonvoice.mozilla.org/) | - |  [download](https://releasemodel.blob.core.windows.net/models/CommonVoicePretrainedModel/CommonVoiceEnglishPretrainedModel/checkpoint_best.pt?sv=2019-12-12\u0026st=2021-07-14T09%3A00%3A07Z\u0026se=2022-07-15T09%3A00%3A00Z\u0026sr=b\u0026sp=r\u0026sig=5sxvEwVRoGtkazNQYkOuFLlPYau8nl5Ng%2FfRJa0Vnc4%3D)\nUniSpeech Large Multilingual |  [Labeled: 1350 hrs en + 353 hrs fr + 168 hrs es + 90 hrs it](https://commonvoice.mozilla.org/) | - | [download](https://releasemodel.blob.core.windows.net/models/CommonVoicePretrainedModel/CommonVoiceMultilingualPretrainedModel/checkpoint_best.pt?sv=2019-12-12\u0026st=2021-07-14T09%3A00%3A39Z\u0026se=2022-07-15T09%3A00%3A00Z\u0026sr=b\u0026sp=r\u0026sig=y%2Fd3rqtbyqW0ZCwR7Czho5any90khA%2Ft3w9PTZ6N9vU%3D)\nUnispeech Large+ | [Labeled: 1350 hrs en, Unlabeled: 353 hrs fr](https://commonvoice.mozilla.org/) | - | [download](https://msranlcmtteamdrive.blob.core.windows.net/teamdrive/v-chengw/models/pt_fr353.large.one2one_unispeech/checkpoint_best.pt?st=2021-10-25T06%3A44%3A54Z\u0026se=2023-10-26T06%3A44%3A00Z\u0026sp=rl\u0026sv=2018-03-28\u0026sr=b\u0026sig=7tYuYMxVFfM2Vgi%2BoqUh%2ByJXD4hSuoafHgBP5VZApw0%3D)\nUniSpeech Large+ | [Labeld: 1350 hrs en, Unlabeled: 168 hrs es](https://commonvoice.mozilla.org/) | - | [download](https://msranlcmtteamdrive.blob.core.windows.net/teamdrive/v-chengw/models/pt_es168.large.one2one_unispeech/checkpoint_best.pt?st=2021-10-25T06%3A39%3A37Z\u0026se=2023-10-26T06%3A39%3A00Z\u0026sp=rl\u0026sv=2018-03-28\u0026sr=b\u0026sig=T2B5%2BlOI6v64TNdLSe9rdp3R%2B9Q2E35taUOigGW0nsQ%3D)\nUniSpeech Large+ | [Labeled: 1350 hrs en, Unlabeld: 90 hrs it](https://commonvoice.mozilla.org/) | -| [download](https://msranlcmtteamdrive.blob.core.windows.net/teamdrive/v-chengw/models/pt_it90.large.one2one_unispeech/checkpoint_best.pt?st=2021-10-25T06%3A52%3A08Z\u0026se=2023-10-26T06%3A52%3A00Z\u0026sp=rl\u0026sv=2018-03-28\u0026sr=b\u0026sig=kXsSJXK9r8UEYlUr2LaJxtPf8m9J2G23MfG725k2DBk%3D)\nUniSpeech Large Multilingual |  [Labeled: 1350 hrs en + 353 hrs fr + 168 hrs es + 90 hrs it, Unlabeled: 17 hrs ky](https://commonvoice.mozilla.org/) | - | [download](https://msranlcmtteamdrive.blob.core.windows.net/teamdrive/v-chengw/models/pt_ky17.large.many2one_unispeech/checkpoint_best.pt?st=2021-10-25T06%3A53%3A00Z\u0026se=2022-10-26T06%3A53%3A00Z\u0026sp=rl\u0026sv=2018-03-28\u0026sr=b\u0026sig=oCQecalXzC5daaurLLJGQdFNtfYwsBM6pNQrDAsf5i0%3D)\nUniSpeech Large+ | [Labeled: 1350 hrs en, Unlabeled: 353 hrs fr](https://commonvoice.mozilla.org/) | 1 hr fr | [download](https://msranlcmtteamdrive.blob.core.windows.net/teamdrive/v-chengw/models/ft_fr-pt_fr353.large.one2one_unispeech/checkpoint_best.pt?st=2021-10-25T06%3A27%3A53Z\u0026se=2023-10-26T06%3A27%3A00Z\u0026sp=rl\u0026sv=2018-03-28\u0026sr=b\u0026sig=9vEa3xqzWu7SYkACn9TQqDtcm%2BKmUcOHhabjbjZuPys%3D)\nUniSpeech Large+ | [Labeld: 1350 hrs en, Unlabeled: 168 hrs es](https://commonvoice.mozilla.org/) | 1 hr es | [download](https://msranlcmtteamdrive.blob.core.windows.net/teamdrive/v-chengw/models/ft_es-pt_es168.large.one2one_unispeech/checkpoint_best.pt?st=2021-10-25T06%3A21%3A34Z\u0026se=2024-10-26T06%3A21%3A00Z\u0026sp=rl\u0026sv=2018-03-28\u0026sr=b\u0026sig=G%2B0RddgOh653UzXG95Ljuwv7aG3tu9gXtPXn1ixCiug%3D)\nUniSpeech Large+ | [Labeled: 1350 hrs en, Unlabeld: 90 hrs it](https://commonvoice.mozilla.org/) | 1 hr it | [download](https://msranlcmtteamdrive.blob.core.windows.net/teamdrive/v-chengw/models/ft_it-pt_it90.large.one2one_unispeech/checkpoint_best.pt?st=2021-10-25T06%3A36%3A17Z\u0026se=2023-10-26T06%3A36%3A00Z\u0026sp=rl\u0026sv=2018-03-28\u0026sr=b\u0026sig=e1WD9uOCo9sCAdH%2FPZQ4wCD30aCDpZvvu43kJrqq2HE%3D)\nUniSpeech Large Multilingual |  [Labeled: 1350 hrs en + 353 hrs fr + 168 hrs es + 90 hrs it, Unlabeled: 17 hrs ky](https://commonvoice.mozilla.org/) | 1 hr ky | [download](https://msranlcmtteamdrive.blob.core.windows.net/teamdrive/v-chengw/models/pt_ky17.large.many2one_unispeech/checkpoint_best.pt?st=2021-10-25T06%3A54%3A04Z\u0026se=2023-10-26T06%3A54%3A00Z\u0026sp=rl\u0026sv=2018-03-28\u0026sr=b\u0026sig=2K3VjMcsbKfBkLVyDlqGhVpIX%2B2ZcA5DTlMhjdkXo3g%3D)\nUniSpeech-SAT Base |  [960 hrs LibriSpeech](http://www.openslr.org/12) | - | [download](https://valle.blob.core.windows.net/share/unispeech-sat/unispeech_repo/UniSpeech-SAT-Base.pt?sv=2021-10-04\u0026st=2024-01-30T06%3A26%3A06Z\u0026se=2094-01-31T06%3A26%3A00Z\u0026sr=b\u0026sp=r\u0026sig=Ts8al%2FPc%2BksI%2BY4tKvDVZhmyw02c9pMhFDxLrPntSd0%3D)\nUniSpeech-SAT Base+ | [60k hrs Libri-Light](https://github.com/facebookresearch/libri-light) + [10k hrs GigaSpeech](https://github.com/SpeechColab/GigaSpeech) + [24k hrs VoxPopuli](https://github.com/facebookresearch/voxpopuli/tree/main) | - | [download](https://valle.blob.core.windows.net/share/unispeech-sat/unispeech_repo/UniSpeech-SAT-Base+.pt?sv=2021-10-04\u0026st=2024-01-30T06%3A26%3A25Z\u0026se=2094-01-31T06%3A26%3A00Z\u0026sr=b\u0026sp=r\u0026sig=m6XAIXsC4rVNDW%2FFwXPNKjX2A%2BV9zBwmWAV93vAXcvc%3D)\nUniSpeech-SAT Large | [60k hrs Libri-Light](https://github.com/facebookresearch/libri-light) + [10k hrs GigaSpeech](https://github.com/SpeechColab/GigaSpeech) + [24k hrs VoxPopuli](https://github.com/facebookresearch/voxpopuli/tree/main) | - | [download](https://valle.blob.core.windows.net/share/unispeech-sat/unispeech_repo/UniSpeech-SAT-Large.pt?sv=2021-10-04\u0026st=2024-01-30T06%3A26%3A43Z\u0026se=2094-01-31T06%3A26%3A00Z\u0026sr=b\u0026sp=r\u0026sig=TJXNIfJzB%2Bsja3uh2xbCUxdbvp8gQP0zzlmZvU2si1I%3D)\nWavLM Base |  [960 hrs LibriSpeech](http://www.openslr.org/12)| -  | [download](https://valle.blob.core.windows.net/share/wavlm/unispeech_repo/WavLM-Base.pt?sv=2021-10-04\u0026st=2024-01-30T06%3A27%3A08Z\u0026se=2094-01-31T06%3A27%3A00Z\u0026sr=b\u0026sp=r\u0026sig=ThlNPycn578KFcON1NTJ7hzkpNLZR%2B3D4ImTgXQR%2B9E%3D)\nWavLM Base+ | [60k hrs Libri-Light](https://github.com/facebookresearch/libri-light) + [10k hrs GigaSpeech](https://github.com/SpeechColab/GigaSpeech) + [24k hrs VoxPopuli](https://github.com/facebookresearch/voxpopuli/tree/main)| -  |  [download](https://valle.blob.core.windows.net/share/wavlm/unispeech_repo/WavLM-Base+.pt?sv=2021-10-04\u0026st=2024-01-30T06%3A27%3A23Z\u0026se=2094-01-31T06%3A27%3A00Z\u0026sr=b\u0026sp=r\u0026sig=6XiFiDMiKNRLRYzNNAL9UWm0dAAFuRweRLFZ2h9IYzg%3D) \nWavLM Large | [60k hrs Libri-Light](https://github.com/facebookresearch/libri-light) + [10k hrs GigaSpeech](https://github.com/SpeechColab/GigaSpeech) + [24k hrs VoxPopuli](https://github.com/facebookresearch/voxpopuli/tree/main)| -  | [download](https://valle.blob.core.windows.net/share/wavlm/unispeech_repo/WavLM-Large.pt?sv=2021-10-04\u0026st=2024-01-30T06%3A27%3A39Z\u0026se=2094-01-31T06%3A27%3A00Z\u0026sr=b\u0026sp=r\u0026sig=eIFLuFlZxmBR7a642KnnLen7yoSzv465iLHLKokO7VM%3D) \n\n## Universal Representation Evaluation on SUPERB \n![alt text](WavLM/WavLM_SUPERB_Results.png)\n\n## Downstream Task Performance \nWe also evaluate our models on typical speaker related benchmarks.\n### Speaker Verification\nFinetune the model with VoxCeleb2 dev data, and evaluate it on the [VoxCeleb1](https://www.robots.ox.ac.uk/~vgg/data/voxceleb/#:~:text=VoxCeleb%20is%20an%20audio%2Dvisual,interview%20videos%20uploaded%20to%20YouTube)\n| Model         |Fix pre-train| Vox1-O | Vox1-E     | Vox1-H         |\n| ------------- |------------- | ---------- | ---------- | ---------- |\n| ECAPA-TDNN   | - | 0.87     | 1.12  | 2.12   |\n| HuBERT large  | Yes|  0.888\t|0.912|\t1.853 |\n| Wav2Vec2.0 (XLSR)| Yes | 0.915|\t0.945\t|1.895|\n| UniSpeech-SAT large | Yes | 0.771\t| 0.781|\t1.669|\n| WavLM large | Yes | 0.59\t| 0.65|\t1.328|\n| WavLM large | No | 0.505\t| 0.579|\t1.176|\n|+Large Margin Finetune and Score Calibration|\n| HuBERT large | No| 0.585|\t0.654\t|1.342|   \n| Wav2Vec2.0 (XLSR) | No| 0.564|\t0.605\t|1.23|   \n| UniSpeech-SAT large | No | 0.564 | 0.561| 1.23 |\n| **WavLM large (New)** | No | **0.33** | **0.477**| **0.984** |\n\n[Large-scale Self-Supervised Speech Representation Learning for Automatic Speaker Verification](https://arxiv.org/pdf/2110.05777.pdf)\n\n\n\n### Speech Separation\n\nEvaluation on [LibriCSS](https://github.com/chenzhuo1011/libri_css)\n| Model         |0S | 0L | OV10     |      OV20     |OV30 |OV40 |\n| ---------------- |------| ------ | ------ | ------ | ------ | ------ |\n| [Conformer](https://ieeexplore.ieee.org/abstract/document/9413423/) (SOTA)   | 4.5\t| 4.4\t|6.2\t|8.5|\t11\t|12.6|\n| UniSpeech-SAT base | 4.4|\t4.4\t|5.4|\t7.2|\t9.2\t|10.5|\n| UniSpeech-SAT large | 4.3|\t4.2\t|5.0\t|6.3|\t8.2|\t8.8|\n| WavLM base+ | 4.5|\t4.4\t|5.6|\t7.5|\t9.4\t|10.9|\n| **WavLM large** | 4.2| 4.1\t| 4.8\t| 5.8 |\t7.4|\t8.5|\n\n\n### Speaker Diarization\n\nEvaluation on CALLHOME\n| Model         |spk_2\t|spk_3|\tspk_4|\tspk_5|\tspk_6|\tspk_all |\n| ---------------- |------| ------ | ------ | ------ | ------ | ------ |\n| [EEND-vector clustering](https://arxiv.org/pdf/2105.09040.pdf)   | 7.96|\t11.93\t|16.38|\t21.21|\t23.1\t|12.49||\n| [EEND-EDA clustering](https://arxiv.org/abs/2107.01545) (SOTA)  | 7.11|\t11.88 |14.37|\t25.95|\t21.95\t|11.84||\n| UniSpeech-SAT large | 5.93|\t10.66|\t12.9\t|16.48|\t23.25|\t10.92|\n| WavLM Base| 6.99|\t11.12|\t15.20\t|16.48|\t21.61|\t11.75|\n| **WavLm large** | 6.46|\t10.69|\t11.84\t|12.89|\t20.70|\t10.35|\n\n\n## License\nThis project is licensed under the license found in the LICENSE file in the root directory of this source tree.\nPortions of the source code are based on the [FAIRSEQ](https://github.com/pytorch/fairseq) project.\n\n[Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct)\n\n\n### Reference\nIf you find our work is useful in your research, please cite the following paper:\n``` latex\n@inproceedings{Wang2021UniSpeech,\n  author    = {Chengyi Wang and Yu Wu and Yao Qian and Kenichi Kumatani and Shujie Liu and Furu Wei and Michael Zeng and Xuedong Huang},\n  editor    = {Marina Meila and Tong Zhang},\n  title     = {UniSpeech: Unified Speech Representation Learning with Labeled and\n               Unlabeled Data},\n  booktitle = {Proceedings of the 38th International Conference on Machine Learning,\n               {ICML} 2021, 18-24 July 2021, Virtual Event},\n  series    = {Proceedings of Machine Learning Research},\n  volume    = {139},\n  pages     = {10937--10947},\n  publisher = {{PMLR}},\n  year      = {2021},\n  url       = {http://proceedings.mlr.press/v139/wang21y.html},\n  timestamp = {Thu, 21 Oct 2021 16:06:12 +0200},\n  biburl    = {https://dblp.org/rec/conf/icml/0002WQK0WZ021.bib},\n  bibsource = {dblp computer science bibliography, https://dblp.org}\n}\n```\n\n``` latex\n@article{Chen2021WavLM,\n  title   = {WavLM: Large-Scale Self-Supervised  Pre-training   for Full Stack Speech Processing},\n  author  = {Sanyuan Chen and Chengyi Wang and Zhengyang Chen and Yu Wu and Shujie Liu and Zhuo Chen and Jinyu Li and Naoyuki Kanda and Takuya Yoshioka and Xiong Xiao and Jian Wu and Long Zhou and Shuo Ren and Yanmin Qian and Yao Qian and Jian Wu and Michael Zeng and Furu Wei},\n  eprint={2110.13900},\n  archivePrefix={arXiv},\n  primaryClass={cs.CL},\n  year={2021}\n}\n```\n\n``` latex\n@article{Chen2021UniSpeechSAT,\n  title   = {UniSpeech-SAT: Universal Speech Representation Learning with  Speaker Aware Pre-Training},\n  author  = {Sanyuan Chen and Yu Wu and Chengyi Wang and Zhengyang Chen and Zhuo Chen and Shujie Liu and   Jian Wu and Yao Qian and Furu Wei and Jinyu Li and  Xiangzhan Yu},\n  eprint={2110.05752},\n  archivePrefix={arXiv},\n  primaryClass={cs.CL},\n  year={2021}\n}\n```\n\n\n### Contact Information\n\nFor help or issues using UniSpeech models, please submit a GitHub issue.\n\nFor other communications related to UniSpeech, please contact Yu Wu (`yuwu1@microsoft.com`).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmicrosoft%2Funispeech","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmicrosoft%2Funispeech","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmicrosoft%2Funispeech/lists"}