{"id":15636425,"url":"https://github.com/astorfi/3d-convolutional-speaker-recognition-pytorch","last_synced_at":"2025-06-23T15:33:51.723Z","repository":{"id":108139560,"uuid":"162761435","full_name":"astorfi/3D-convolutional-speaker-recognition-pytorch","owner":"astorfi","description":":speaker: Deep Learning \u0026 3D Convolutional Neural Networks for Speaker Verification","archived":false,"fork":false,"pushed_at":"2019-01-08T20:07:06.000Z","size":32848,"stargazers_count":123,"open_issues_count":9,"forks_count":25,"subscribers_count":7,"default_branch":"master","last_synced_at":"2025-04-15T11:54:01.328Z","etag":null,"topics":["convolutional-neural-networks","python","pytorch","speaker-verification"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/astorfi.png","metadata":{"files":{"readme":"README.rst","changelog":null,"contributing":"CONTRIBUTING.rst","funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2018-12-21T22:04:00.000Z","updated_at":"2024-03-25T06:57:16.000Z","dependencies_parsed_at":null,"dependency_job_id":"52028112-370c-40ee-a028-29890335483f","html_url":"https://github.com/astorfi/3D-convolutional-speaker-recognition-pytorch","commit_stats":{"total_commits":13,"total_committers":1,"mean_commits":13.0,"dds":0.0,"last_synced_commit":"655d11d65c383ebb571682daf6202834892c02fb"},"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/astorfi/3D-convolutional-speaker-recognition-pytorch","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/astorfi%2F3D-convolutional-speaker-recognition-pytorch","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/astorfi%2F3D-convolutional-speaker-recognition-pytorch/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/astorfi%2F3D-convolutional-speaker-recognition-pytorch/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/astorfi%2F3D-convolutional-speaker-recognition-pytorch/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/astorfi","download_url":"https://codeload.github.com/astorfi/3D-convolutional-speaker-recognition-pytorch/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/astorfi%2F3D-convolutional-speaker-recognition-pytorch/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":261504854,"owners_count":23168915,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["convolutional-neural-networks","python","pytorch","speaker-verification"],"created_at":"2024-10-03T11:03:49.601Z","updated_at":"2025-06-23T15:33:51.665Z","avatar_url":"https://github.com/astorfi.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":".. image:: readme_images/follow-twitter.gif\n   :height: 100px\n   :width: 200 px\n   :scale: 50 %\n   :alt: alternate text\n   :align: right\n   :target: https://twitter.com/amirsinatorfi\n\n=============================================================================================\n3D Convolutional Neural Networks for Speaker Verification - `Official Project Page`_\n=============================================================================================\n\n.. image:: https://img.shields.io/badge/contributions-welcome-brightgreen.svg?style=flat\n    :target: https://github.com/astorfi/3D-convolutional-speaker-recognition/pulls\n.. image:: https://badges.frapsoft.com/os/v2/open-source.svg?v=102\n    :target: https://github.com/ellerbrock/open-source-badge/\n.. image:: https://img.shields.io/twitter/follow/amirsinatorfi.svg?label=Follow\u0026style=social\n      :target: https://twitter.com/amirsinatorfi\n    \n==============================\nTable of Contents\n==============================\n.. contents::\n  :local:\n  :depth: 4\n\n\nThis repository contains the Pytorch code release for our paper titled as *\"Text-Independent\nSpeaker Verification Using 3D Convolutional Neural Networks\"*. The link to the paper_ is\nprovided as well.\n\n\n.. _Official Project Page: https://codeocean.com/2017/08/01/3d-convolutional-neural-networks-for-speaker-recognition/code\n\n.. _paper: https://arxiv.org/abs/1705.09422\n.. _Pytorch: https://pytorch.org\n\nThe code has been developed using Pytorch_. The input pipeline must be prepared by the users.\nThis code is aimed to provide the implementation for Speaker Verification (SR) by using 3D convolutional neural networks\nfollowing the SR protocol.\n\n.. image:: readme_images/conv_gif.gif\n    :target: https://github.com/astorfi/3D-convolutional-speaker-recognition/blob/master/_images/conv_gif.gif\n\n------------\nCitation\n------------\n\nIf you used this code, please kindly consider citing the following paper:\n\n.. code:: shell\n\n    @article{torfi2017text,\n      title={Text-independent speaker verification using 3d convolutional neural networks},\n      author={Torfi, Amirsina and Nasrabadi, Nasser M and Dawson, Jeremy},\n      journal={arXiv preprint arXiv:1705.09422},\n      year={2017}\n    }\n\n--------------\nGeneral View\n--------------\n\nWe leveraged 3D convolutional architecture for creating the speaker model in order to simultaneously\ncapturing the speech-related and temporal information from the speakers' utterances.\n\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\nSpeaker Verification Protocol(SVP)\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\nIn this work, a 3D Convolutional Neural Network (3D-CNN)\narchitecture has been utilized for text-independent speaker\nverification in three phases.\n\n     1. At the **development phase**, a CNN is trained\n     to classify speakers at the utterance-level.\n\n     2. In the **enrollment stage**, the trained network is utilized to directly create a\n     speaker model for each speaker based on the extracted features.\n\n     3. Finally, in the **evaluation phase**, the extracted features\n     from the test utterance will be compared to the stored speaker\n     model to verify the claimed identity.\n\nThe aforementioned three phases are usually considered as the SV protocol. One of the main\nchallenges is the creation of the speaker models. Previously-reported approaches create\nspeaker models based on averaging the extracted features from utterances of the speaker,\nwhich is known as the d-vector system.\n\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\nHow to leverage 3D Convolutional Neural Networks?\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\nIn our paper, we propose the implementation of 3D-CNNs for direct speaker model creation\nin which, for both development and enrollment phases, an identical number of\nspeaker utterances is fed to the network for representing the spoken utterances\nand creation of the speaker model. This leads to simultaneously capturing the\nspeaker-related information and building a more robust system to cope with\nwithin-speaker variation. We demonstrate that the proposed method significantly\noutperforms the d-vector verification system.\n\n--------------------\nDataset\n--------------------\n\nUnlike the `Original Implementaion \u003chttps://github.com/astorfi/3D-convolutional-speaker-recognition\u003e`_, here we used the `VoxCeleb \u003chttp://www.robots.ox.ac.uk/~vgg/data/voxceleb/\u003e`_ publicy available dataset. The dataset contains annotated audio files. For Speaker Verification, the parts of the audio associated with the subject of interest, however, must be extracted from the ``raw audio files``.\n\nThree steps should be taken to prepare the data after downloading the data associated files.\n\n  1. Extract the specific audio part that the subject of interest is speaking.[`extract_audio.py \u003chttps://github.com/astorfi/3D-convolutional-speaker-recognition-pytorch/blob/master/code/0-data_preparation/0-extract_audio/extract_audio.py\u003e`_]\n  2. Create train/test phase.[`create_phases.py \u003chttps://github.com/astorfi/3D-convolutional-speaker-recognition-pytorch/blob/master/code/0-data_preparation/2-create_phases/create_phases.py\u003e`_]\n  3. Voice Activity Detection to remove the silence. [`vad.py \u003chttps://github.com/astorfi/3D-convolutional-speaker-recognition-pytorch/blob/master/code/0-data_preparation/3-VAD/vad.py\u003e`_]\n  \n\nCreating the dataset object, necessary preprocessing and feature extraction will be performed in the following data class:\n\n.. code:: python\n\n    class AudioDataset():\n    \"\"\"Audio dataset.\"\"\"\n\n        def __init__(self, files_path, audio_dir, transform=None):\n            \"\"\"\n            Args:\n                files_path (string): Path to the .txt file which the address of files are saved in it.\n                root_dir (string): Directory with all the audio files.\n                transform (callable, optional): Optional transform to be applied\n                    on a sample.\n            \"\"\"\n\n            # self.sound_files = [x.strip() for x in content]\n            self.audio_dir = audio_dir\n            self.transform = transform\n\n            # Open the .txt file and create a list from each line.\n            with open(files_path, 'r') as f:\n                content = f.readlines()\n            # you may also want to remove whitespace characters like `\\n` at the end of each line\n            list_files = []\n            for x in content:\n                sound_file_path = os.path.join(self.audio_dir, x.strip().split()[1])\n                try:\n                    with open(sound_file_path, 'rb') as f:\n                        riff_size, _ = wav._read_riff_chunk(f)\n                        file_size = os.path.getsize(sound_file_path)\n\n                    # Assertion error.\n                    assert riff_size == file_size and os.path.getsize(sound_file_path) \u003e 1000, \"Bad file!\"\n\n                    # Add to list if file is OK!\n                    list_files.append(x.strip())\n                except:\n                    print('file %s is corrupted!' % sound_file_path)\n\n            # Save the correct and healthy sound files to a list.\n            self.sound_files = list_files\n\n        def __len__(self):\n            return len(self.sound_files)\n\n        def __getitem__(self, idx):\n            # Get the sound file path\n            sound_file_path = os.path.join(self.audio_dir, self.sound_files[idx].split()[1]\n\n\n--------------------\nCode Implementation\n--------------------\n\nThe input pipeline must be provided by the user. **Please refer to ``code/0-input/input_feature.py`` for having an idea about how the input pipeline works.**\n\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\nInput Pipeline for this work\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\n.. image:: readme_images/Speech_GIF.gif\n    :target: https://github.com/astorfi/3D-convolutional-speaker-recognition/blob/master/_images/Speech_GIF.gif\n\nThe MFCC features can be used as the data representation of the spoken utterances at the frame level. However, a\ndrawback is their non-local characteristics due to the last DCT 1 operation for generating MFCCs. This operation disturbs the locality property and is in contrast with the local characteristics of the convolutional operations. The employed approach in this work is to use the log-energies, which we\ncall MFECs. The extraction of MFECs is similar to MFCCs\nby discarding the DCT operation. The temporal features are\noverlapping 20ms windows with the stride of 10ms, which are\nused for the generation of spectrum features. From a 0.8-\nsecond sound sample, 80 temporal feature sets (each forms\na 40 MFEC features) can be obtained which form the input\nspeech feature map. Each input feature map has the dimen-\nsionality of ζ × 80 × 40 which is formed from 80 input\nframes and their corresponding spectral features, where ζ is\nthe number of utterances used in modeling the speaker during\nthe development and enrollment stages.\n\n\n\nThe **speech features** have been extracted using [SpeechPy]_ package.\n\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\nImplementation of 3D Convolutional Operation\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\nThe following script has been used for our\nimplementation:\n\n.. code:: python\n\n        self.conv11 = nn.Conv3d(1, 16, (4, 9, 9), stride=(1, 2, 1))\n        self.conv11_bn = nn.BatchNorm3d(16)\n        self.conv11_activation = torch.nn.PReLU()\n        self.conv12 = nn.Conv3d(16, 16, (4, 9, 9), stride=(1, 1, 1))\n        self.conv12_bn = nn.BatchNorm3d(16)\n        self.conv12_activation = torch.nn.PReLU()\n        self.conv21 = nn.Conv3d(16, 32, (3, 7, 7), stride=(1, 1, 1))\n        self.conv21_bn = nn.BatchNorm3d(32)\n        self.conv21_activation = torch.nn.PReLU()\n        self.conv22 = nn.Conv3d(32, 32, (3, 7, 7), stride=(1, 1, 1))\n        self.conv22_bn = nn.BatchNorm3d(32)\n        self.conv22_activation = torch.nn.PReLU()\n        self.conv31 = nn.Conv3d(32, 64, (3, 5, 5), stride=(1, 1, 1))\n        self.conv31_bn = nn.BatchNorm3d(64)\n        self.conv31_activation = torch.nn.PReLU()\n        self.conv32 = nn.Conv3d(64, 64, (3, 5, 5), stride=(1, 1, 1))\n        self.conv32_bn = nn.BatchNorm3d(64)\n        self.conv32_activation = torch.nn.PReLU()\n        self.conv41 = nn.Conv3d(64, 128, (3, 3, 3), stride=(1, 1, 1))\n        self.conv41_bn = nn.BatchNorm3d(128)\n        self.conv41_activation = torch.nn.PReLU()\n\n\nAs it can be seen, ``slim.conv2d`` has been used. However, simply by using 3D kernels as ``[k_x, k_y, k_z]``\nand ``stride=[a, b, c]`` it can be turned into a 3D-conv operation. The base of the ``slim.conv2d`` is\n``tf.contrib.layers.conv2d``. Please refer to official Documentation_ for further details.\n\n.. _Documentation: https://www.tensorflow.org/api_docs/python/tf/contrib/layers\n\n\n--------\nLicense\n--------\n\nThe license is as follows:\n\n.. code:: shell\n\n\n   APPENDIX: How to apply the Apache License to your work.\n\n      To apply the Apache License to your work, attach the following\n      boilerplate notice, with the fields enclosed by brackets \"{}\"\n      replaced with your own identifying information. (Don't include the brackets!)  The text should be enclosed in the appropriate\n      comment syntax for the file format. We also recommend that a\n      file or class name and description of purpose be included on the\n      same \"printed page\" as the copyright notice for easier\n      identification within third-party archives.\n\n   Copyright {2017} {Amirsina Torfi}\n\n   Licensed under the Apache License, Version 2.0 (the \"License\");\n   you may not use this file except in compliance with the License.\n   You may obtain a copy of the License at\n\n       http://www.apache.org/licenses/LICENSE-2.0\n\n   Unless required by applicable law or agreed to in writing, software\n   distributed under the License is distributed on an \"AS IS\" BASIS,\n   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n   See the License for the specific language governing permissions and\n   limitations under the License.\n\n\nPlease refer to LICENSE_ file for further detail.\n\n.. _LICENSE: https://github.com/astorfi/3D-convolutional-speaker-recognition/blob/master/LICENSE\n\n\n-------------\nContribution\n-------------\n\nWe are looking forward to your kind feedback. Please help us to improve the code and make\nour work better. For contribution, please create the pull request and we will investigate it promptly.\nOnce again, we appreciate your feedback and code inspections.\n\n\n.. rubric:: references\n\n.. [SpeechPy] Amirsina Torfi. 2017. astorfi/speech_feature_extraction: SpeechPy. Zenodo. doi:10.5281/zenodo.810392.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fastorfi%2F3d-convolutional-speaker-recognition-pytorch","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fastorfi%2F3d-convolutional-speaker-recognition-pytorch","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fastorfi%2F3d-convolutional-speaker-recognition-pytorch/lists"}