{"id":19882958,"url":"https://github.com/modelscope/3d-speaker","last_synced_at":"2025-05-14T19:02:33.100Z","repository":{"id":153283027,"uuid":"610089108","full_name":"modelscope/3D-Speaker","owner":"modelscope","description":"A Repository for Single- and Multi-modal Speaker Verification, Speaker Recognition and Speaker Diarization","archived":false,"fork":false,"pushed_at":"2024-10-29T09:36:22.000Z","size":3257,"stargazers_count":1187,"open_issues_count":3,"forks_count":101,"subscribers_count":17,"default_branch":"main","last_synced_at":"2024-10-29T11:44:02.288Z","etag":null,"topics":["3d-speaker","campplus","cnceleb","eres2net","language-identification","modelscope","rdino","speaker-diarization","speaker-verification","voxceleb"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/modelscope.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-03-06T04:05:27.000Z","updated_at":"2024-10-29T10:37:03.000Z","dependencies_parsed_at":null,"dependency_job_id":"329469cc-d823-4bb5-b14a-12f67cb67159","html_url":"https://github.com/modelscope/3D-Speaker","commit_stats":null,"previous_names":["modelscope/3d-speaker"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/modelscope%2F3D-Speaker","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/modelscope%2F3D-Speaker/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/modelscope%2F3D-Speaker/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/modelscope%2F3D-Speaker/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/modelscope","download_url":"https://codeload.github.com/modelscope/3D-Speaker/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248710363,"owners_count":21149185,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["3d-speaker","campplus","cnceleb","eres2net","language-identification","modelscope","rdino","speaker-diarization","speaker-verification","voxceleb"],"created_at":"2024-11-12T17:19:02.282Z","updated_at":"2025-04-13T11:45:11.996Z","avatar_url":"https://github.com/modelscope.png","language":"Python","readme":"\n\u003cp align=\"center\"\u003e\n    \u003cbr\u003e\n    \u003cimg src=\"docs/images/3D-Speaker-logo.png\" width=\"400\"/\u003e\n    \u003cbr\u003e\n\u003cp\u003e\n    \n\u003cdiv align=\"center\"\u003e\n    \n\u003c!-- [![Documentation Status](https://readthedocs.org/projects/easy-cv/badge/?version=latest)](https://easy-cv.readthedocs.io/en/latest/) --\u003e\n![license](https://img.shields.io/github/license/modelscope/modelscope.svg)\n\u003ca href=\"\"\u003e\u003cimg src=\"https://img.shields.io/badge/OS-Linux-orange.svg\"\u003e\u003c/a\u003e\n\u003ca href=\"\"\u003e\u003cimg src=\"https://img.shields.io/badge/Python-\u003e=3.8-aff.svg\"\u003e\u003c/a\u003e\n\u003ca href=\"\"\u003e\u003cimg src=\"https://img.shields.io/badge/Pytorch-\u003e=1.10-blue\"\u003e\u003c/a\u003e\n    \n\u003c/div\u003e\n    \n\u003cstrong\u003e3D-Speaker\u003c/strong\u003e is an open-source toolkit for single- and multi-modal speaker verification, speaker recognition, and speaker diarization. All pretrained models are accessible on [ModelScope](https://www.modelscope.cn/models?page=1\u0026tasks=speaker-verification\u0026type=audio). Furthermore, we present a large-scale speech corpus also called [3D-Speaker-Dataset](https://3dspeaker.github.io/) to facilitate the research of speech representation disentanglement.\n\n## Benchmark\nThe EER results on VoxCeleb, CNCeleb and 3D-Speaker datasets for fully-supervised speaker verification.\n| Model | Params | VoxCeleb1-O | CNCeleb | 3D-Speaker |\n|:-----:|:------:| :------:|:------:|:------:|\n| Res2Net | 4.03 M | 1.56% | 7.96% | 8.03% |\n| ResNet34 | 6.34 M | 1.05% | 6.92% | 7.29% |\n| ECAPA-TDNN | 20.8 M | 0.86% | 8.01% | 8.87% |\n| ERes2Net-base | 6.61 M | 0.84% | 6.69% | 7.21% |\n| CAM++ | 7.2 M | 0.65% | 6.78% | 7.75% |\n| ERes2NetV2 | 17.8M | 0.61% | **6.14%** | 6.52% |\n| ERes2Net-large | 22.46 M | **0.52%** | 6.17% | **6.34%** |\n\nThe DER results on public and internal multi-speaker datasets for speaker diarization.\n| Test | 3D-Speaker | [pyannote.audio](https://github.com/pyannote/pyannote-audio) | [DiariZen_WavLM](https://github.com/BUTSpeechFIT/DiariZen) | \n|:-----:|:------:|:------:|:------:|\n|[Aishell-4](https://arxiv.org/abs/2104.03603)|**10.30%**|12.2%|11.7%|\n|[Alimeeting](https://www.openslr.org/119/)|19.73%|24.4%|**17.6%**|\n|[AMI_SDM](https://groups.inf.ed.ac.uk/ami/corpus/)|21.76%|22.4%|**15.4%**|\n|[VoxConverse](https://github.com/joonson/voxconverse)|11.75%|**11.3%**|28.39%|\n|Meeting-CN_ZH-1|**18.91%**|22.37%|32.66%|\n|Meeting-CN_ZH-2|**12.78%**|17.86%|18%|\n\n\n## Quickstart\n### Install 3D-Speaker\n``` sh\ngit clone https://github.com/modelscope/3D-Speaker.git \u0026\u0026 cd 3D-Speaker\nconda create -n 3D-Speaker python=3.8\nconda activate 3D-Speaker\npip install -r requirements.txt\n```\n### Running experiments\n``` sh\n# Speaker verification: ERes2NetV2 on 3D-Speaker dataset\ncd egs/3dspeaker/sv-eres2netv2/\nbash run.sh\n# Speaker verification: CAM++ on 3D-Speaker dataset\ncd egs/3dspeaker/sv-cam++/\nbash run.sh\n# Speaker verification: ECAPA-TDNN on 3D-Speaker dataset\ncd egs/3dspeaker/sv-ecapa/\nbash run.sh\n# Self-supervised speaker verification: SDPN on VoxCeleb dataset\ncd egs/voxceleb/sv-sdpn/\nbash run.sh\n# Audio and multimodal Speaker diarization:\ncd egs/3dspeaker/speaker-diarization/\nbash run_audio.sh\nbash run_video.sh\n# Language identification\ncd egs/3dspeaker/language-idenitfication\nbash run.sh\n```\n### Inference using pretrained models from Modelscope\nAll pretrained models are released on [Modelscope](https://www.modelscope.cn/models?page=1\u0026tasks=speaker-verification\u0026type=audio).\n\n``` sh\n# Install modelscope\npip install modelscope\n# ERes2Net trained on 200k labeled speakers\nmodel_id=iic/speech_eres2net_sv_zh-cn_16k-common\n# ERes2NetV2 trained on 200k labeled speakers\nmodel_id=iic/speech_eres2netv2_sv_zh-cn_16k-common\n# CAM++ trained on 200k labeled speakers\nmodel_id=iic/speech_campplus_sv_zh-cn_16k-common\n# Run CAM++ or ERes2Net inference\npython speakerlab/bin/infer_sv.py --model_id $model_id\n# Run batch inference\npython speakerlab/bin/infer_sv_batch.py --model_id $model_id --wavs $wav_list\n\n# SDPN trained on VoxCeleb\nmodel_id=iic/speech_sdpn_ecapa_tdnn_sv_en_voxceleb_16k\n# Run SDPN inference\npython speakerlab/bin/infer_sv_ssl.py --model_id $model_id\n\n# Run diarization inference\npython speakerlab/bin/infer_diarization.py --wav [wav_list OR wav_path] --out_dir $out_dir\n# Enable overlap detection\npython speakerlab/bin/infer_diarization.py --wav [wav_list OR wav_path] --out_dir $out_dir --include_overlap --hf_access_token $hf_access_token\n```\n\n## Overview of Content\n\n- **Supervised Speaker Verification**\n  - [CAM++](https://github.com/modelscope/3D-Speaker/tree/main/egs/3dspeaker/sv-cam%2B%2B), [ERes2Net](https://github.com/modelscope/3D-Speaker/tree/main/egs/3dspeaker/sv-eres2net), [ERes2NetV2](https://github.com/modelscope/3D-Speaker/tree/main/egs/3dspeaker/sv-eres2netv2), [ECAPA-TDNN](https://github.com/modelscope/3D-Speaker/tree/main/egs/3dspeaker/sv-ecapa), [ResNet](https://github.com/modelscope/3D-Speaker/tree/main/egs/3dspeaker/sv-resnet) and [Res2Net](https://github.com/modelscope/3D-Speaker/tree/main/egs/3dspeaker/sv-res2net) training recipes on [3D-Speaker](https://3dspeaker.github.io/).\n\n  - [CAM++](https://github.com/modelscope/3D-Speaker/tree/main/egs/voxceleb/sv-cam%2B%2B), [ERes2Net](https://github.com/modelscope/3D-Speaker/tree/main/egs/voxceleb/sv-eres2net), [ERes2NetV2](https://github.com/modelscope/3D-Speaker/tree/main/egs/voxceleb/sv-eres2netv2), [ECAPA-TDNN](https://github.com/modelscope/3D-Speaker/tree/main/egs/voxceleb/sv-ecapa), [ResNet](https://github.com/modelscope/3D-Speaker/tree/main/egs/voxceleb/sv-resnet) and [Res2Net](https://github.com/modelscope/3D-Speaker/tree/main/egs/voxceleb/sv-res2net) training recipes on [VoxCeleb](https://www.robots.ox.ac.uk/~vgg/data/voxceleb/). \n\n  - [CAM++](https://github.com/modelscope/3D-Speaker/tree/main/egs/cnceleb/sv-cam%2B%2B), [ERes2Net](https://github.com/modelscope/3D-Speaker/tree/main/egs/cnceleb/sv-eres2net), [ERes2NetV2](https://github.com/modelscope/3D-Speaker/tree/main/egs/cnceleb/sv-eres2netv2), [ECAPA-TDNN](https://github.com/modelscope/3D-Speaker/tree/main/egs/cnceleb/sv-ecapa), [ResNet](https://github.com/modelscope/3D-Speaker/tree/main/egs/cnceleb/sv-resnet) and [Res2Net](https://github.com/modelscope/3D-Speaker/tree/main/egs/cnceleb/sv-res2net) training recipes on [CN-Celeb](http://cnceleb.org/).\n\n- **Self-supervised Speaker Verification**\n  - [RDINO](https://github.com/modelscope/3D-Speaker/tree/main/egs/voxceleb/sv-rdino) and [SDPN](https://github.com/modelscope/3D-Speaker/tree/main/egs/voxceleb/sv-sdpn) training recipes on [VoxCeleb](https://www.robots.ox.ac.uk/~vgg/data/voxceleb/)\n    \n  - [RDINO](https://github.com/modelscope/3D-Speaker/tree/main/egs/3dspeaker/sv-rdino) training recipes on [3D-Speaker](https://3dspeaker.github.io/).\n\n  - [RDINO](https://github.com/modelscope/3D-Speaker/tree/main/egs/cnceleb/sv-rdino) training recipes on [CN-Celeb](http://cnceleb.org/).\n\n- **Speaker Diarization**\n  - [Speaker diarization](https://github.com/modelscope/3D-Speaker/tree/main/egs/3dspeaker/speaker-diarization) inference recipes which comprise multiple modules, including overlap detection[optional], voice activity detection, speech segmentation, speaker embedding extraction, and speaker clustering. \n\n- **Language Identification**\n  - [Language identification](https://github.com/modelscope/3D-Speaker/tree/main/egs/3dspeaker/language-identification) training recipes on [3D-Speaker](https://3dspeaker.github.io/).\n\n- **3D-Speaker Dataset**\n  - Dataset introduction and download address: [3D-Speaker](https://3dspeaker.github.io/) \u003cbr\u003e\n  - Related paper address: [3D-Speaker](https://arxiv.org/pdf/2306.15354.pdf)\n\n\n## What‘s new :fire:\n- [2024.12] Update [diarization](https://github.com/modelscope/3D-Speaker/tree/main/egs/3dspeaker/speaker-diarization) recipes and add results on multiple diarization benchmarks.\n- [2024.8] Releasing [ERes2NetV2](https://modelscope.cn/models/iic/speech_eres2netv2_sv_zh-cn_16k-common) and [ERes2NetV2_w24s4ep4](https://modelscope.cn/models/iic/speech_eres2netv2w24s4ep4_sv_zh-cn_16k-common) pretrained models trained on 200k-speaker datasets.\n- [2024.5] Releasing [SDPN](https://github.com/modelscope/3D-Speaker/tree/main/egs/voxceleb/sv-sdpn) model and [X-vector](https://github.com/modelscope/3D-Speaker/tree/main/egs/voxceleb/sv-xvector) model training and inference recipes for VoxCeleb.\n- [2024.5] Releasing [visual module](https://github.com/modelscope/3D-Speaker/tree/main/egs/ava-asd/talknet) and [semantic module](https://github.com/modelscope/3D-Speaker/tree/main/egs/semantic_speaker/bert) training recipes.\n- [2024.4] Releasing [ONNX Runtime](https://github.com/modelscope/3D-Speaker/tree/main/runtime/onnxruntime) and the relevant scripts for inference.\n- [2024.4] Releasing [ERes2NetV2](https://github.com/modelscope/3D-Speaker/tree/main/egs/voxceleb/sv-eres2netv2) model with lower parameters and faster inference speed on VoxCeleb datasets.\n- [2024.2] Releasing [language identification](https://github.com/modelscope/3D-Speaker/tree/main/egs/3dspeaker/language-identification) integrating phonetic information recipes for more higher recognition accuracy.\n- [2024.2] Releasing [multimodal diarization](https://github.com/modelscope/3D-Speaker/tree/main/egs/3dspeaker/speaker-diarization) recipes which fuses audio and video image input to produce more accurate results.\n- [2024.1] Releasing [ResNet34](https://github.com/modelscope/3D-Speaker/tree/main/egs/voxceleb/sv-resnet) and [Res2Net](https://github.com/modelscope/3D-Speaker/tree/main/egs/voxceleb/sv-res2net) model training and inference recipes for 3D-Speaker, VoxCeleb and CN-Celeb datasets.\n- [2024.1] Releasing [large-margin finetune recipes](https://github.com/modelscope/3D-Speaker/blob/main/egs/voxceleb/sv-eres2net/run.sh) in speaker verification and adding [diarization recipes](https://github.com/modelscope/3D-Speaker/blob/main/egs/3dspeaker/speaker-diarization/run.sh). \n- [2023.11] [ERes2Net-base](https://modelscope.cn/models/damo/speech_eres2net_base_200k_sv_zh-cn_16k-common/summary) pretrained model released, trained on a Mandarin dataset of 200k labeled speakers.\n- [2023.10] Releasing [ECAPA model](https://github.com/modelscope/3D-Speaker/tree/main/egs/voxceleb/sv-ecapa) training and inference recipes for three datasets.\n- [2023.9] Releasing [RDINO](https://github.com/modelscope/3D-Speaker/tree/main/egs/cnceleb/sv-rdino) model training and inference recipes for [CN-Celeb](http://cnceleb.org/).\n- [2023.8] Releasing [CAM++](https://modelscope.cn/models/damo/speech_campplus_sv_cn_cnceleb_16k/summary), [ERes2Net-Base](https://modelscope.cn/models/damo/speech_eres2net_base_sv_zh-cn_cnceleb_16k/summary) and [ERes2Net-Large](https://modelscope.cn/models/damo/speech_eres2net_large_sv_zh-cn_cnceleb_16k/summary) benchmarks in [CN-Celeb](http://cnceleb.org/).\n- [2023.8] Releasing [ERes2Net](https://modelscope.cn/models/damo/speech_eres2net_base_lre_en-cn_16k/summary) annd [CAM++](https://modelscope.cn/models/damo/speech_campplus_lre_en-cn_16k/summary) in language identification for Mandarin and English. \n- [2023.7] Releasing [CAM++](https://modelscope.cn/models/damo/speech_campplus_sv_zh-cn_3dspeaker_16k/summary), [ERes2Net-Base](https://modelscope.cn/models/damo/speech_eres2net_base_sv_zh-cn_3dspeaker_16k/summary), [ERes2Net-Large](https://modelscope.cn/models/damo/speech_eres2net_large_sv_zh-cn_3dspeaker_16k/summary) pretrained models trained on [3D-Speaker](https://3dspeaker.github.io/).\n- [2023.7] Releasing [Dialogue Detection](https://modelscope.cn/models/damo/speech_bert_dialogue-detetction_speaker-diarization_chinese/summary) and [Semantic Speaker Change Detection](https://modelscope.cn/models/damo/speech_bert_semantic-spk-turn-detection-punc_speaker-diarization_chinese/summary) in speaker diarization.\n- [2023.7] Releasing [CAM++](https://modelscope.cn/models/damo/speech_campplus_lre_en-cn_16k/summary) in language identification for Mandarin and English.\n- [2023.6] Releasing [3D-Speaker](https://3dspeaker.github.io/) dataset and its corresponding benchmarks including [ERes2Net](https://github.com/modelscope/3D-Speaker/tree/main/egs/3dspeaker/sv-eres2net), [CAM++](https://github.com/modelscope/3D-Speaker/tree/main/egs/3dspeaker/sv-cam%2B%2B) and [RDINO](https://github.com/modelscope/3D-Speaker/tree/main/egs/3dspeaker/sv-rdino).\n- [2023.5] [ERes2Net](https://modelscope.cn/models/damo/speech_eres2net_sv_zh-cn_16k-common/summary) and [CAM++](https://www.modelscope.cn/models/damo/speech_campplus_sv_zh-cn_16k-common/summary) pretrained model released, trained on a Mandarin dataset of 200k labeled speakers.\n\n## Contact\nIf you have any comment or question about 3D-Speaker, please contact us by\n- email: {yfchen97, wanghuii}@mail.ustc.edu.cn, {dengchong.d, zsq174630, shuli.cly}@alibaba-inc.com\n\n## License\n3D-Speaker is released under the [Apache License 2.0](LICENSE).\n\n## Acknowledge\n3D-Speaker contains third-party components and code modified from some open-source repos, including: \u003cbr\u003e\n[Speechbrain](https://github.com/speechbrain/speechbrain), [Wespeaker](https://github.com/wenet-e2e/wespeaker), [D-TDNN](https://github.com/yuyq96/D-TDNN), [DINO](https://github.com/facebookresearch/dino), [Vicreg](https://github.com/facebookresearch/vicreg), [TalkNet-ASD\n](https://github.com/TaoRuijie/TalkNet-ASD), [Ultra-Light-Fast-Generic-Face-Detector-1MB](https://github.com/Linzaer/Ultra-Light-Fast-Generic-Face-Detector-1MB), [pyannote.audio](https://github.com/pyannote/pyannote-audio)\n\n\n## Citations\nIf you find this repository useful, please consider giving a star :star: and citation :t-rex::\n```BibTeX\n@article{chen20243d,\n  title={3D-Speaker-Toolkit: An Open Source Toolkit for Multi-modal Speaker Verification and Diarization},\n  author={Chen, Yafeng and Zheng, Siqi and Wang, Hui and Cheng, Luyao and others},\n  booktitle={ICASSP},\n  year={2025}\n}\n```\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmodelscope%2F3d-speaker","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmodelscope%2F3d-speaker","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmodelscope%2F3d-speaker/lists"}