https://github.com/jefflai108/pytorch-kaldi-neural-speaker-embeddings
A light weight neural speaker embeddings extraction based on Kaldi and PyTorch.
https://github.com/jefflai108/pytorch-kaldi-neural-speaker-embeddings
kaldi learnable-dictionary-encoding pytorch speaker-identification speaker-recognition speaker-verification speech-processing
Last synced: 11 months ago
JSON representation
A light weight neural speaker embeddings extraction based on Kaldi and PyTorch.
- Host: GitHub
- URL: https://github.com/jefflai108/pytorch-kaldi-neural-speaker-embeddings
- Owner: jefflai108
- License: bsd-3-clause
- Created: 2019-11-08T05:23:07.000Z (over 6 years ago)
- Default Branch: master
- Last Pushed: 2020-01-27T22:29:25.000Z (over 6 years ago)
- Last Synced: 2024-08-08T23:21:02.747Z (almost 2 years ago)
- Topics: kaldi, learnable-dictionary-encoding, pytorch, speaker-identification, speaker-recognition, speaker-verification, speech-processing
- Language: Perl
- Homepage:
- Size: 9.35 MB
- Stars: 135
- Watchers: 8
- Forks: 34
- Open Issues: 5
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# pytorch-kaldi-neural-speaker-embeddings
**A light weight neural speaker embeddings extraction based on Kaldi and PyTorch.** \
The repository serves as a starting point for users to reproduce and experiment several recent advances in speaker recognition literature.
Kaldi is used for pre-processing and post-processing and PyTorch is used for training the neural speaker embeddings.
**I want to note that this repo is not meant for keeping track of state-of-the-art on speaker recognition, and most likely the models will be considered outdated in a few months (or sooner :().**
This repository contains a PyTorch+Kaldi pipeline to reproduce the core results for:
* [Exploring the Encoding Layer and Loss Function in End-to-End Speaker and Language Recognition System](https://arxiv.org/pdf/1804.05160.pdf)
* [A Novel Learnable Dictionary Encoding Layer for End-to-End Language Identification](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8462025)
With some modifications, you can easily adapt the pipeline for:
* [Angular softmax for short-duration text-independent speaker verification](https://pdfs.semanticscholar.org/93f9/24fa3d9920edc10469b04c90960c03dce856.pdf)
* [Utterance-level aggregation for speaker recognition in the wild](https://arxiv.org/pdf/1902.10107.pdf)
* [Margin Matters: Towards More Discriminative Deep Neural Network Embeddings for Speaker Recognition](https://arxiv.org/pdf/1906.07317.pdf)
* [Spatial Pyramid Encoding with Convex Length Normalization for Text-Independent Speaker Verification](https://arxiv.org/pdf/1906.08333.pdf)
If one wants to go further, take a look at our recent work on [multi-speaker text-to-speech](https://arxiv.org/pdf/1910.10838.pdf), where the same speaker embeddings are employed to model speaker characterisitcs in a text-to-speech system.
Lastly, kindly cite our paper(s) if you find this repository useful. Cite both if you are kind enough!
```
@article{villalba2019state,
title={State-of-the-art speaker recognition with neural network embeddings in nist sre18 and speakers in the wild evaluations},
author={Villalba, Jes{\'u}s and Chen, Nanxin and Snyder, David and Garcia-Romero, Daniel and McCree, Alan and Sell, Gregory and Borgstrom, Jonas and Garc{\'\i}a-Perera, Leibny Paola and Richardson, Fred and Dehak, R{\'e}da and others},
journal={Computer Speech \& Language},
pages={101026},
year={2019},
publisher={Elsevier}
}
```
```
@article{cooper2019zero,
title={Zero-Shot Multi-Speaker Text-To-Speech with State-of-the-art Neural Speaker Embeddings},
author={Cooper, Erica and Lai, Cheng-I and Yasuda, Yusuke and Fang, Fuming and Wang, Xin and Chen, Nanxin and Yamagishi, Junichi},
journal={arXiv preprint arXiv:1910.10838},
year={2019}
}
```
One should also check out the very nicely written [TensorFlow version](https://github.com/mycrazycracy/tf-kaldi-speaker) by [Yi Lu](http://yiliu.org.cn).
# Overview
Neural speaker embeddings: Encoder --> Pooling --> Classification \
LDE pooling method illustration:
# Requirements
`pip install -r requirements.txt`
Please also [download and properly setup Kaldi](https://kaldi-asr.org/doc/install.html). If you are stuck in this phase, this repository is liekly not for you.
# Getting Started
The bash file `pipeline.sh` contains the 12-stage speaker recognition pipeline, including feature extraction, the neural model training and decoding/evaluation. A more detailed description of each step is described in `pipeline.sh`. To get started, simply run:
`./pipeline.sh`
# Datasets
The models are trained on [VoxCeleb I+II](http://www.robots.ox.ac.uk/~vgg/data/voxceleb/), which is free for downloads (the trial lists are also there). One can easily adapt `pipeline.sh` for different datasets.
# Pre-Trained Models
Due to Youtube's privacy policy, unfortunately I am not allowed to upload pre-trained models for VoxCeleb I+II.
# Benchmarking Speaker Verification EERs
| Embedding name | dimension | normalization| pooling type | train objective | EER | DCFmin0.01 |
| :-----------------------:| :--------: | :----------: | :------------: | :----------------: | :---------: | :---------: |
| i-vectors | 400 | no | mean | EM | 5.329 | 0.493 |
| x-vectors | 512 | no | mean, std | Softmax | 3.298 | 0.343 |
| x-vectorsN | 512 | yes | mean, std | Softmax | 3.213 | 0.342 |
| LDE-1 | 512 | no | mean | Softmax | 3.415 | 0.366 |
| LDE-1N | 512 | yes | mean | Softmax | 3.446 | 0.365 |
| LDE-2 | 512 | no | mean | ASoftmax (m=2)| 3.674 | 0.364 |
| LDE-2N | 512 | yes | mean | ASoftmax (m=2)| 3.664 | 0.386 |
| LDE-3 | 512 | no | mean | ASoftmax (m=3)| **3.033** |**0.314** |
| LDE-3N | 512 | yes | mean | ASoftmax (m=3)| 3.171 | 0.327 |
| LDE-4 | 512 | no | mean | ASoftmax (m=4)| 3.112 | 0.315 |
| LDE-4N | 512 | yes | mean | ASoftmax (m=4)| 3.271 | 0.327 |
| LDE-5 | 256 | no | mean | ASoftmax (m=2)| 3.287 | 0.343 |
| LDE-5N | 256 | yes | mean | ASoftmax (m=2)| 3.367 | 0.351 |
| LDE-6 | 200 | no | mean | ASoftmax (m=2)| 3.266 | 0.396 |
| LDE-6N | 200 | yes | mean | ASoftmax (m=2)| 3.266 | 0.396 |
| LDE-7 | 512 | no | mean, std | ASoftmax (m=2)| **3.091** |**0.303** |
| LDE-7N | 512 | yes | mean, std | ASoftmax (m=2)| 3.171 | 0.328 |
# Using Speaker Embeddings for Tacotron2 Speaker Adaptation
# Speaker Embedding Space Visualization (cluster by speakers)
**i-vectors (baseline)**
**LDE**
# Benchmarking TTS MOS scores
| Embedding name | Naturalness dev | Naturalness test | Similarity dev | Similarity test |
| :-----------------------:| :-------------: | :--------------: | :------------: | :-------------: |
| vocoded | 3.41 | 3.55 | 2.79 | 2.82 |
| x-vectorsN | 3.19 | 3.19 | 1.86 | 2.37 |
| LDE-1 | 3.16 | 3.21 | **2.05** | 2.34 |
| LDE-1N | 3.13 | 3.46 | 1.97 | 2.45 |
| LDE-2 | 3.28 | 3.35 | 2.00 | 2.37 |
| LDE-2N | 3.19 | 3.33 | 2.00 | 2.35 |
| LDE-3 | 3.24 | **3.48** | 1.88 | **2.46** |
| LDE-3N | 3.16 | 3.33 | 2.00 | 2.37 |
| LDE-4 | 3.10 | 3.29 | 2.00 | 2.31 |
| LDE-4N | 3.20 | 3.29 | 1.98 | 2.39 |
| LDE-5 | 3.26 | 3.40 | 1.99 | 2.45 |
| LDE-5N | 3.07 | 3.37 | 2.02 | 2.41 |
| LDE-6 | 3.25 | 3.33 | 1.95 | 2.43 |
| LDE-6N | **3.29** | 3.23 | 1.94 | 2.39 |
| LDE-7 | 3.03 | 3.18 | 1.86 | 2.28 |
| LDE-7N | 3.02 | 3.24 | 2.02 | 2.42 |
# Credits
Base code written by [Nanxin Chen](https://github.com/bobchennan), Johns Hopkins University \
Experiments done by [Cheng-I Lai](http://people.csail.mit.edu/clai24/), MIT