Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/manojpamk/pytorch_xvectors
Deep speaker embeddings in PyTorch, including x-vectors. Code used in this work: https://arxiv.org/abs/2007.16196
https://github.com/manojpamk/pytorch_xvectors
speaker-diarization speaker-embeddings speaker-recognition speaker-verification
Last synced: 2 months ago
JSON representation
Deep speaker embeddings in PyTorch, including x-vectors. Code used in this work: https://arxiv.org/abs/2007.16196
- Host: GitHub
- URL: https://github.com/manojpamk/pytorch_xvectors
- Owner: manojpamk
- License: mit
- Created: 2020-03-25T06:34:48.000Z (almost 5 years ago)
- Default Branch: master
- Last Pushed: 2020-11-11T15:58:11.000Z (about 4 years ago)
- Last Synced: 2024-08-02T08:06:57.740Z (6 months ago)
- Topics: speaker-diarization, speaker-embeddings, speaker-recognition, speaker-verification
- Language: Python
- Homepage:
- Size: 356 KB
- Stars: 302
- Watchers: 8
- Forks: 65
- Open Issues: 5
-
Metadata Files:
- Readme: README.md
- License: LICENSE.md
Awesome Lists containing this project
- awesome-speaker-embedding - pytorch_xvectors - vectors. (Code/Tools/Frameworks/Libraries)
README
##
Deep speaker embeddings in PyTorch* [Requirements:](#requirements)
* [Other Tools:](#other-tools)
* [Installation:](#installation)
* [Data preparation](#data-preparation)
* [Training data preparation](#training-data-preparation)
* [Dataset for data augmentation](#dataset-for-data-augmentation)
* [Training](#training)
* [Embedding extraction](#embedding-extraction)
* [Pretrained model](#pretrained-model)
* [Downloading](#downloading)
* [Speaker Verification](#speaker-verification)
* [Speaker Diarization](#speaker-diarization)
* [Results](#results)
* [1. Speaker Verification (%R)](#1-speaker-verification-eer)
* [2. Speaker Diarization (%R)](#2-speaker-diarization-der)This repository contains code and models for training an x-vector speaker recognition model using Kaldi for feature preparation and PyTorch for DNN model training. MFCC feature configurations and TDNN model architecture follow the Voxceleb recipe in Kaldi (commit hash `9b4dc93c9`). Training procedures including optimizer and step count are similar to, but not exactly the same as Kaldi.
Additionally, code for training meta-learning embeddings are available in [train_proto.py](train_proto.py) and [train_relation.py](train_relation.py). An overview of these models is available at [https://arxiv.org/abs/2007.16196](https://arxiv.org/abs/2007.16196) and in the below figure:
![Overview: Meta Learning Models](figs/meta_learning_arch.png)
### Citation
If you found this toolkit useful in your research, consider citing the following:
```
@misc{kumar2020designing,
title={Designing Neural Speaker Embeddings with Meta Learning},
author={Manoj Kumar and Tae Jin-Park and Somer Bishop and Catherine Lord and Shrikanth Narayanan},
year={2020},
eprint={2007.16196},
archivePrefix={arXiv}
}
```### Requirements:
Python Libraries
```
python==3.6.10
torch==1.4.0
kaldiio==2.15.1
kaldi-python-io==1.0.4
```##### Other Tools:
* Spectral Clustering using normalized maximum eigengap [GitHub](https://github.com/tango4j/Auto-Tuning-Spectral-Clustering)
* Used for speaker clustering during diarization
* Diarization scoring tool [GitHub](https://github.com/nryant/dscore)
* Used for computing diarization error rate (DER)### Installation:
* Install the python libraries listed in [Requirements](#requirements)
* Install [Kaldi toolkit](https://github.com/kaldi-asr/kaldi/blob/master/INSTALL).
* This repository is tested with commit hash `9b4dc93c9` of the above [Kaldi repository](https://github.com/kaldi-asr/kaldi/blob/master/INSTALL).
* Kaldi is recommended to be installed in `$HOME/kaldi`.
* Download this repository. NOTE: Destination need not be inside Kaldi installation.
* Set the `voxcelebDir` variable inside [pytorch_run.sh](pytorch_run.sh)
* (Optional) Install Other Tools listering in [Requirements](#requirements)### Data preparation
#### Training data preparation
* Training features are expected in Kaldi nnet3 egs format, and read using the `nnet3EgsDL` class defined in [train_utils.py](train_utils.py).
* The voxceleb recipe is provided in [pytorch_run.sh](pytorch_run.sh) to prepare them.
* Extracted embeddings are written in Kaldi vector format, similar to `xvector.ark`.#### Dataset for data augmentation
[pytorch_run.sh](pytorch_run.sh) script augments the training data using the following two datasets.
* Download [MUSAN](https://openslr.org/17/) and extract to ./musan.
* Download [RIRS_NOISES](https://openslr.org/28/) and extract to ./RIRS_NOISES.### Training
```
CUDA_VISIBLE_DEVICES=0 python -m torch.distributed.launch --nproc_per_node=1 train_xent.py
```
```
usage: train_xent.py [-h] [--local_rank LOCAL_RANK] [-modelType MODELTYPE]
[-featDim FEATDIM] [-resumeTraining RESUMETRAINING]
[-resumeModelDir RESUMEMODELDIR]
[-numArchives NUMARCHIVES] [-numSpkrs NUMSPKRS]
[-logStepSize LOGSTEPSIZE] [-batchSize BATCHSIZE]
[-numEgsPerArk NUMEGSPERARK]
[-preFetchRatio PREFETCHRATIO]
[-optimMomentum OPTIMMOMENTUM] [-baseLR BASELR]
[-maxLR MAXLR] [-numEpochs NUMEPOCHS]
[-noiseEps NOISEEPS] [-pDropMax PDROPMAX]
[-stepFrac STEPFRAC]
egsDirpositional arguments:
egsDir Directory with training archivesoptional arguments:
-h, --help show this help message and exit
--local_rank LOCAL_RANK
-modelType MODELTYPE Refer train_utils.py
-featDim FEATDIM Frame-level feature dimension
-resumeTraining RESUMETRAINING
(1) Resume training, or (0) Train from scratch
-resumeModelDir RESUMEMODELDIR
Path containing training checkpoints
-numArchives NUMARCHIVES
Number of egs.*.ark files
-numSpkrs NUMSPKRS Number of output labels
-logStepSize LOGSTEPSIZE
Iterations per log
-batchSize BATCHSIZE Batch size
-numEgsPerArk NUMEGSPERARK
Number of training examples per egs file
-preFetchRatio PREFETCHRATIO
xbatchSize to fetch from dataloader
-optimMomentum OPTIMMOMENTUM
Optimizer momentum
-baseLR BASELR Initial LR
-maxLR MAXLR Maximum LR
-numEpochs NUMEPOCHS Number of training epochs
-noiseEps NOISEEPS Noise strength before pooling
-pDropMax PDROPMAX Maximum dropout probability
-stepFrac STEPFRAC Training iteration when dropout = pDropMax```
`egsDir` contains the nnet3 egs files.### Embedding extraction
```
usage: extract.py [-h] [-modelType MODELTYPE] [-numSpkrs NUMSPKRS]
modelDirectory featDir embeddingDirpositional arguments:
modelDirectory Directory containing the model checkpoints
featDir Directory containing features ready for extraction
embeddingDir Output directoryoptional arguments:
-h, --help show this help message and exit
-modelType MODELTYPE Refer train_utils.py
-numSpkrs NUMSPKRS Number of output labels for model
```
The script [pytorch_run.sh](pytorch_run.sh) can be used to train embeddings on the voxceleb recipe on an end-to-end basis.### Pretrained model
#### Downloading
Two ways to download the pre-trained model:
1. Google Drive [link](https://drive.google.com/file/d/1gbAWDdWN_pkOim4rWVXUlfuYjfyJqUHZ/view?usp=sharing) *(or)*
2. Command line ([reference](https://medium.com/@acpanjan/download-google-drive-files-using-wget-3c2c025a8b99))
```
wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=1gbAWDdWN_pkOim4rWVXUlfuYjfyJqUHZ' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1gbAWDdWN_pkOim4rWVXUlfuYjfyJqUHZ" -O preTrainedModel.zip && rm -rf /tmp/cookies.txt
```#### Speaker Verification
To reproduce voxceleb EER results with the pretrained model, follow the below steps.
NOTE: The voxceleb features must be prepared using `prepare_feats_for_egs.sh` prior to evaluation.1) Extract `models/` and `xvectors/` from the pre-trained archive into the installation directory
2) Set the following variables in [pytorch_run.sh](pytorch_run.sh):
```
modelDir=models/xvec_preTrained
trainFeatDir=data/train_combined_no_sil
trainXvecDir=xvectors/xvec_preTrained/train
testFeatDir=data/voxceleb1_test_no_sil
testXvecDir=xvectors/xvec_preTrained/test
```
3) Extract embeddings and compute EER, minDCF. Set `stage=7` in [pytorch_run.sh](pytorch_run.sh) and execute:
```
bash pytorch_run.sh
```
4) Alternatively, pretrained PLDA model is available inside `xvectors/train` directory. Set `stage=9` in [pytorch_run.sh](pytorch_run.sh) and execute:
```
bash pytorch_run.sh
```
#### Speaker Diarization```
cd egs/
```
Place the audio files to diarize and their corresponding RTTM files in `demo_wav/` and `demo_rttm/` directories. Execute:
```
bash diarize.sh
```### Results
#### 1. Speaker Verification (%EER)
| | Kaldi | pytorch_xvectors |
|:-------------|:-------------:|:-----:|
| Vox1-test | 3.13 | 2.82 |
| VOICES-dev | 10.30 | 8.59 |#### 2. Speaker Diarization (%DER)
NOTE: Clustering using [https://github.com/tango4j/Auto-Tuning-Spectral-Clustering](https://github.com/tango4j/Auto-Tuning-Spectral-Clustering)
| | Kaldi | pytorch_xvectors |
|:-------------|:-------------:|:-----:|
| DIHARD2 dev (no collar, oracle #spk) | 26.97 | 27.50 |
| DIHARD2 dev (no collar, est #spk) | 24.49 | 24.66 |
| AMI dev+test (26 meetings, collar, oracle #spk) | 6.39 | 6.30 |
| AMI dev+test (26 meetings, collar, est #spk) | 7.29 | 10.14 |