https://github.com/k2kobayashi/crank
A toolkit for non-parallel voice conversion based on vector-quantized variational autoencoder
https://github.com/k2kobayashi/crank
adversarial-learning cyclic-constraints speech-synthesis vocoder voice-conversion vqvae
Last synced: about 1 month ago
JSON representation
A toolkit for non-parallel voice conversion based on vector-quantized variational autoencoder
- Host: GitHub
- URL: https://github.com/k2kobayashi/crank
- Owner: k2kobayashi
- License: mit
- Created: 2020-06-01T02:32:42.000Z (almost 5 years ago)
- Default Branch: master
- Last Pushed: 2024-07-25T11:09:32.000Z (10 months ago)
- Last Synced: 2025-03-30T19:11:30.475Z (about 2 months ago)
- Topics: adversarial-learning, cyclic-constraints, speech-synthesis, vocoder, voice-conversion, vqvae
- Language: Python
- Homepage:
- Size: 12.4 MB
- Stars: 171
- Watchers: 9
- Forks: 31
- Open Issues: 12
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGES.md
- License: LICENSE.txt
Awesome Lists containing this project
README
[](https://github.com/k2kobayashi/crank/actions/workflows/ci.yml)
[](https://badge.fury.io/py/crank-vc)# crank
Non-parallel voice conversion based on vector-quantized variational autoencoder with adversarial learning
## Setup
- Install Python dependency
```sh
$ git clone https://github.com/k2kobayashi/crank.git
$ cd crank/tools
$ make
```- install dependency for mosnet
```sh
$ sudo apt install ffmpeg # mosnet dependency
```## Recipes
- English
- VCC2020
- VCC2018 (Thanks to [@unilight](https://github.com/unilight))
- Japanese
- jsv_ver1### Conversion samples
You can access several converted audio samples of VCC 2018 dataset in the [URL](https://k2kobayashi.github.io/crankSamples/).
- [vcc2020v1](https://drive.google.com/file/d/1uInvCwggpBYmpplYxuIOidvJkPmav8kE/view?usp=sharing)
- [vcc2018v1](https://drive.google.com/file/d/1-Z_Y9pahPQcKR0rqdhu4elI6Hz686qX6/view?usp=sharing)## Run VCC2020 recipe
crank has prepared recipe for Voice Conversion Challenge 2020.
In crank recipe, there are 7 stages to implement non-parallel voice conversion.- stage 0
- download dataset
- stage 1
- initialization
- generate scp files and figures to be determine speaker-dependent parameters
- stage 2
- feature extraction
- extract mlfb and mcep features
- stage 3
- training
- stage 4
- reconstuction
- generate reconstructed feature for fine-tuning of neural vocoder
- stage 5
- evaluation
- convert evaluation waveform
- stage 6
- synthesis
- synthesis waveform by pre-trained [ParallelWaveGAN](https://github.com/kan-bayashi/ParallelWaveGAN)
- synthesis waveform by GriffinLim
- stage 7
- objective evalution
- mel-cepstrum distortion
- mosnet### Put dataset to downloads
Note that dataset is only released for the participants (2020/05/26).
```
$ cd egs/vaevc/vcc2020v1
$ mkdir downloads && cd downloads
$ mv /vcc2020_{training,evaluation}.zip downloads
$ unzip vcc2020_training.zip
$ unzip vcc2020_evaluation.zip
```### Run feature extraction and model training
Because the challenge defines its training and evaluation set, we have initially put configuration files.
So, you need to run from 2nd stage.```sh
$ ./run.sh --n_jobs 10 --stage 2 --stop_stage 5
```where the ```n_jobs``` indicates the number of CPU cores used in the training.
## Configuration
Configurations are defined in ```conf/mlfb_vqvae.yml```.
Followings are explanation of representative parameters.- feature
When you create your own recipe, be carefull to set parameters for feature extraction such as ```fs```, ```fftl```, ```hop_size```, ```framems```, ```shiftms```, and ```mcep_alpha```. These parameters depend on sampling frequency.
- feat_type
You can choose ```feat_type``` either ```mlfb``` or ```mcep```.
If you choose ```mlfb```, the converted waveforms are generated by either GllifinLim vocoder or ParallelWaveGAN vocoder.
If you choose ```mcep```, the converted waveforms are generated by world vocoder (i.e., excitation generation and MLSA filtering).- trainer_type
We support training with ```vqvae```, ```lsgan```, ```cyclegan```, and ```stargan``` using same generator network.
- ```vqvae```: default vqvae setting
- ```lsgan```: vqvae with adversarial learning
- ```cyclegan```: vqvae with adevesarial learning and cyclic constraints
- ```stargan```: vqvae with adevesarial learning similar to cyclegan## Create your recipe
### Copy recipe template
Please copy template directory to start creation of your recipe.
```sh
$ cp -r egs/vaevc/template egs/vaevc/
$ cd egs/vaevc/
```### Put .wav files
You need to put wav files appropriate directory.
You can choose either modifying ```download.sh``` or putting wav files.
In either case, the wav files should be located in each speaker like following
```/downloads/wav/{spkr1, spkr2, ..., spkr3}/*.wav```.If you modify ```downaload.sh```,
```sh
$ vim local/download.sh
```If you put wav files,
```sh
$ mkdir downloads
$ mv downloads/wav
$ touch downloads/.done
```### Run initialization
The initialization process generates kaldi-like scp files.
```sh
$ ./run.sh --stage 0 --stop_stage 1
```Then you modify speaker-dependent parameters in ```conf/spkr.yml``` using generated figures.
Page 20~22 in [slide](https://www.slideshare.net/NU_I_TODALAB/hands-on-voice-conversion) help you how to set these parameters.### Run feature extraction, train, reconstruction, and evaluation
After preparing configuration, you run it.
```sh
$ ./run.sh --stage 2 --stop_stage 7
```## Citation
Please cite this paper when you use crank.
```
K. Kobayashi, W-C. Huang, Y-C. Wu, P.L. Tobing, T. Hayashi, T. Toda,
"crank: an open-source software for nonparallel voice conversion based on vector-quantized variational autoencoder",
Proc. ICASSP, 2021. (accepted)
```## Achknowledgements
Thank you [@kan-bayashi](https://github.com/kan-bayashi) for lots of contributions and encouragement helps.
## Who we are
- Kazuhiro Kobayashi [@k2kobayashi](https://github.com/k2kobayashi) [maintainer, design and development]
- Wen-Chin Huang [@unilight](https://github.com/unilight) [maintainer, design and development]
- [Tomoki Toda](https://sites.google.com/site/tomokitoda/) [advisor]