{"id":14110468,"url":"https://github.com/ranchlai/awesome-speaker-embedding","last_synced_at":"2025-08-01T10:33:57.317Z","repository":{"id":110136086,"uuid":"371234075","full_name":"ranchlai/awesome-speaker-embedding","owner":"ranchlai","description":"A curated list of speaker-embedding speaker-verification, speaker-identification resources. ","archived":false,"fork":false,"pushed_at":"2021-08-12T08:22:43.000Z","size":342,"stargazers_count":50,"open_issues_count":0,"forks_count":5,"subscribers_count":3,"default_branch":"main","last_synced_at":"2025-07-21T12:03:58.778Z","etag":null,"topics":["speaker-embedding","speaker-recognition","speaker-verification"],"latest_commit_sha":null,"homepage":"","language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ranchlai.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2021-05-27T03:23:30.000Z","updated_at":"2025-06-27T16:51:07.000Z","dependencies_parsed_at":"2023-03-13T13:58:15.333Z","dependency_job_id":null,"html_url":"https://github.com/ranchlai/awesome-speaker-embedding","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/ranchlai/awesome-speaker-embedding","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ranchlai%2Fawesome-speaker-embedding","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ranchlai%2Fawesome-speaker-embedding/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ranchlai%2Fawesome-speaker-embedding/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ranchlai%2Fawesome-speaker-embedding/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ranchlai","download_url":"https://codeload.github.com/ranchlai/awesome-speaker-embedding/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ranchlai%2Fawesome-speaker-embedding/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":267437077,"owners_count":24086917,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-07-27T02:00:11.917Z","response_time":82,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["speaker-embedding","speaker-recognition","speaker-verification"],"created_at":"2024-08-14T10:02:51.798Z","updated_at":"2025-08-01T10:33:57.244Z","avatar_url":"https://github.com/ranchlai.png","language":null,"funding_links":[],"categories":["Other Lists"],"sub_categories":["TeX Lists"],"readme":"# awesome-speaker-embedding\nA curated list of speaker embedding/verification resources\n\n\n## Must-read papers\n- \\[01\\] [Deep Speaker: an End-to-End Neural Speaker Embedding System](https://arxiv.org/abs/1705.02304), Baidu inc, 2017\n- \\[02\\] [Text-Independent Speaker Verification Using 3D Convolutional Neural Networks](https://arxiv.org/abs/1705.09422), 2017\n- \\[03\\] [Speaker Recognition from Raw Waveform with SincNet](https://arxiv.org/abs/1808.00158), Bengio team,  raw waveform, 2018\n- \\[04\\] [VoxCeleb2: Deep Speaker Recognition](https://arxiv.org/abs/1806.05622) VGG group, Interspeech 2018\n- \\[05\\] [Generalized End-to-End Loss for Speaker Verification](https://arxiv.org/abs/1710.10467), Google, ICASSP 2017\n- \\[06\\] [Voxceleb: Large-scale speaker verification in the wild](https://www.robots.ox.ac.uk/~vgg/publications/2019/Nagrani19/nagrani19.pdf),VGG group, 2019\n- \\[07\\] [Deep neural network embeddings for text-independent speaker verification](http://danielpovey.com/files/2017_interspeech_embeddings.pdf), Interspeech 2017, original \u003cb\u003eTDNN\u003c/b\u003e paper from Johns Hopkins , MFCC/frame-based/time-delay/multi-class, softmax + cross-entropy loss\n- \\[08\\] [Robust DNN Embeddings for Speaker Recognition](https://arxiv.org/pdf/1803.09153v1.pdf), ICASSP 2018, the \u003cb\u003eX-vector\u003c/b\u003e paper Johns Hopkins,  based on TDNN, improved by adding Noise and reverberation for augmentation\n- \\[09\\] [Front-end factor analysis for speaker verification](http://groups.csail.mit.edu/sls/archives/root/publications/2010/Dehak_IEEE_Transactions.pdf), 2011, IEEE TASLP,  the '\u003cb\u003ei-vector\u003c/b\u003e' paper from Johns Hopkins \n- \\[10\\] [TDNN-UBM Time delay deep recognition neural network-based universal background models for speaker](https://www.danielpovey.com/files/2015_asru_tdnn_ubm.pdf) , 2015 \n- \\[11\\] [Deep neural networks for small footprint text-dependent speaker verification](https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/41939.pdf), The '\u003cb\u003eD-vector\u003c/b\u003e' paper from Johns Hopkins \n- \\[12\\] [Analysis of Score Normalization in Multilingual Speaker Recognition](http://www.fit.vutbr.cz/research/groups/speech/publi/2017/matejka_interspeech2017_IS170803.pdf), Interspeech 2017, The S-norm paper, useful for score normalization \n\n\n## Benchmarks (not very accurate)\n\nResults reported (by the authors) on [Voxceleb1](https://www.robots.ox.ac.uk/~vgg/data/voxceleb/meta/veri_test2.txt), [VoxCeleb1-E](https://www.robots.ox.ac.uk/~vgg/data/voxceleb/meta/list_test_all2.txt) and [VoxCeleb1-H](https://www.robots.ox.ac.uk/~vgg/data/voxceleb/meta/list_test_hard2.txt).\n\nVoxceleb1 public results (continuously updating...)\n| Name |  feature,model,activation/loss |  VoxCeleb1| VoxCeleb1-E| VoxCeleb1-H| Link |Affiliation|Year |\n| ---- | -------- | -------- | ------- | -------  |-------  |-------  |--------  |\n|X205| DPN68,Res2Net50| 0.7712%| 0.8968%| 1.637% |[report](https://arxiv.org/pdf/2011.00200.pdf) | AISpeech | 2020|\n|Veridas| ResNet152|1.08%|-|-|[report](https://www.robots.ox.ac.uk/~vgg/data/voxceleb/data_workshop_2020/veridas.pdf)|das-nano|2020\n|DKU-DukeECE | Resnet,ECAPA-TDNN| 0.888%|1.133%|2.008%|[report](https://arxiv.org/pdf/2010.12731.pdf)|Duke University|2020|\n|IDLAB | Resnet,ECAPA-TDNN| -|-|-|[report](https://arxiv.org/pdf/2010.12468.pdf)|Ghent University -|2020|\n|speechbrain | ECAPA-TDNN| 0.69% |-|-|[link](https://github.com/speechbrain/speechbrain/tree/develop/recipes/VoxCeleb/SpeakerRec)| -|2021|\n\n## Must-read technical reports\n\n[VOXSRC 2019 reports](https://www.robots.ox.ac.uk/~vgg/data/voxceleb/files/VoxSRC19.pdf)\n\n## Datasets\nCommonly-used speaker datasets: \n- [TIMIT](https://catalog.ldc.upenn.edu/LDC93S1): A small dataset for speaker and asr, non-free\n- [Free ST](https://www.openslr.org/38/): Mandarin speech corpus for speaker and asr, free \n- [NIST SRE](https://sre.nist.gov/) NIST Speaker Recognition Evaluation, non-free\n- [AIShell-1](https://www.openslr.org/33/): Mandarin speech corpus, divided into train/dev/test, free. \n- [AIShell-2](http://www.aishelltech.com/aishell_2): free for education, non-free for commercial\n- [AIShell-3](https://www.openslr.org/93/): free, for speaker, asr and tts\n- [AIShell-4](https://arxiv.org/abs/2104.03603), will be released soon\n- [HI-MIA](https://www.openslr.org/85/): free, for far-field text-dependent  speaker verification and  keyword spotting\n- [SITW](http://www.speech.sri.com/projects/sitw/) Speakers in the Wild, \n- [Voxceleb 1\u00262](https://www.openslr.org/82/), Celebrity interview video/audio extracted from Youtube\n- [Cn-Celeb 1\u00262](https://www.robots.ox.ac.uk/~vgg/data/voxceleb/vox2.html), Multi-genres speaker dataset in the wild, utterances are from chinese celebrities. \n\n## Challenges\n- [VoxCeleb Speaker Recognition Challenge (VoxSRC 2019)](https://www.robots.ox.ac.uk/~vgg/data/voxceleb/competition2019.html) [report](https://www.robots.ox.ac.uk/~vgg/data/voxceleb/files/VoxSRC19.pdf)\n- [VoxCeleb Speaker Recognition Challenge (VoxSRC 2020)](https://www.robots.ox.ac.uk/~vgg/data/voxceleb/competition2020.html)\n- [VoxCeleb Speaker Recognition Challenge (VoxSRC 2021)](https://www.robots.ox.ac.uk/~vgg/data/voxceleb/competition2021.html)\n- [Short-duration Speaker Verification (SdSV) Challenge 2020](https://sdsvc.github.io/2020/)\n- [Short-duration Speaker Verification (SdSV) Challenge 2021](https://sdsvc.github.io/)\n- [CTS Speaker Recognition Challenge 2020](https://sre.nist.gov/cts-challenge)\n- [Far-Field Speaker Verification Challenge (FFSVC 2020)](http://2020.ffsvc.org/)\n\n## Great Talks / Tutorials\n- [X-vectors: Neural Speech Embeddings for Speaker Recognition](https://www.robots.ox.ac.uk/~vgg/data/voxceleb/data_workshop_2020/keynote/daniel_talk.mp4), Daniel Garcia-Romero, 2020\n- [2020声纹识别研究与应用学术讨论会](https://hub.baai.ac.cn/view/4289)\n\n## Code/Tools/Frameworks/Libraries\n- [VGGVox](https://github.com/a-nagrani/VGGVox) The first baseline system for voxceleb dataset, originally implementated in Matlab.\n- [DeepSpeaker](https://github.com/philipperemy/deep-speaker]) An End-to-End Neural Speaker Embedding System.\n- [SincNet](https://github.com/mravanelli/SincNet), also in [speechbrain](https://github.com/speechbrain/speechbrain)\n- [3D CNN](https://github.com/astorfi/3D-convolutional-speaker-recognition) TensorFlow implementation of 3D Convolutional Neural Networks for Speaker Verification \n- [GE2E](https://github.com/HarryVolek/PyTorch_Speaker_Verification), implementation is also in [tensorlow](https://github.com/Janghyun1230/Speaker_Verification) \n- [asv-subtools](https://github.com/Snowdar/asv-subtools)  An Open Source Tools based on Pytorch and Kaldi for speaker recognition/language identification, XMU Speech Lab. \n- [Resemblyzer](https://github.com/resemble-ai/Resemblyzer), high-level representation of a voice through a deep learning model (referred to as the voice encoder).\n- [voxceleb](https://www.robots.ox.ac.uk/~vgg/data/voxceleb/) audio-visual dataset consisting of short clips of human speech, extracted from interview videos uploaded to YouTube\n- [Triplet-loss](https://omoindrot.github.io/triplet-loss) Triplet Loss and Online Triplet Mining in TensorFlow. \n- [Res2Net](https://github.com/Res2Net/Res2Net-PretrainedModels) The Res2net architecture used commonly in VoxCeleb speaker recognition challenge. \n- [voxceleb_trainer](https://github.com/clovaai/voxceleb_trainer) A very good speaker framework written in pytorch with pretrained models. \n- [Speechbrain](https://github.com/speechbrain/speechbrain/tree/develop/recipes/VoxCeleb/SpeakerRec)  Voxceleb recipe. \n- [kaldi](https://github.com/kaldi-asr/kaldi/tree/master/egs/voxceleb) Kaldi recipe for voxceleb. \n- [pytorch_xvectors](https://github.com/manojpamk/pytorch_xvectors) pytorch implementation of x-vectors. \n\n### More-recent papers\n- [Attention Back-end](https://arxiv.org/pdf/2104.01541.pdf), Compare PLDA and cosine with proposed attention Back-end, model: TDNN, Resnet, data: cn-celeb\n\n\n### Wining solutions of Challenges\n\n#### VoxSRC2019\n- Rank 1:  FBank, \"r-vectors\" using resnet, AAM loss. From Brno University of Technolog, [REPORT](https://www.robots.ox.ac.uk/~vgg/data/voxceleb/data_workshop/BUT_Zeinali_VoxSRC.pdf)\n- Rank 2: 80-dim FBank features, E-TDNN/F-TDNN models, various classification loss including softmax/AM-softmax/PLDA-softmax. From Johns Hopkins University, [REPORT](https://www.robots.ox.ac.uk/~vgg/data/voxceleb/data_workshop/JHU-HLTCOE_VoxSRC.pdf)\n- Rank 3: FBank, resnet + attentive pooling + Phonetic attention, BLSTM + ResNET, loss unclear(?). From Microsoft, [REPORT](https://www.robots.ox.ac.uk/~vgg/data/voxceleb/data_workshop/VoxSRC_TZ_microsoft.pdf)\n\n\n#### VoxSRC2020\n- Rank 1: 60-dim log-FBank, ECAPA-TDNN/SE-ResNet34, S-Norm, AAM-Softmax. From IDLab, [REPORT](https://www.robots.ox.ac.uk/~vgg/data/voxceleb/data_workshop_2020/participants/JTBD.pdf)\n- Rank 2: 40-dim FBank/mean-normalized, no VAD, resnet/Res2Net, S-Norm, CM-Softmax. From AI Speech, [REPORT](https://arxiv.org/pdf/2011.00200.pdf), kaldi [recipe](https://github.com/kaldi-asr/kaldi/tree/master/egs/voxceleb) for data-aug\n- Rank 3: Report not available\n\nPlease let me know if your code/repo is not listed here (ranchlai at 163.com)\n\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Franchlai%2Fawesome-speaker-embedding","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Franchlai%2Fawesome-speaker-embedding","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Franchlai%2Fawesome-speaker-embedding/lists"}