https://github.com/maum-ai/assem-vc

Official Code for Assem-VC @ICASSP2022
https://github.com/maum-ai/assem-vc

deep-learning pytorch speech-synthesis voice-conversion

Last synced: 6 months ago
JSON representation

Official Code for Assem-VC @ICASSP2022

Host: GitHub
URL: https://github.com/maum-ai/assem-vc
Owner: maum-ai
License: bsd-3-clause
Created: 2021-04-01T07:43:14.000Z (over 4 years ago)
Default Branch: master
Last Pushed: 2022-05-16T02:44:30.000Z (over 3 years ago)
Last Synced: 2025-04-02T19:07:35.163Z (6 months ago)
Topics: deep-learning, pytorch, speech-synthesis, voice-conversion
Language: Jupyter Notebook
Homepage: https://mindslab-ai.github.io/assem-vc/
Size: 80.9 MB
Stars: 266
Watchers: 17
Forks: 38
Open Issues: 13
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# Assem-VC — Official PyTorch Implementation

![](./docs/images/overall.png)

#### **Assem-VC: Realistic Voice Conversion by Assembling Modern Speech Synthesis Techniques**

Kang-wook Kim, Seung-won Park, Junhyeok Lee, Myun-chul Joe @ [MINDsLab Inc.](https://maum.ai/), SNU

**Accepted to IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022**

Paper: https://arxiv.org/abs/2104.00931

Audio Samples: https://mindslab-ai.github.io/assem-vc/

Update: Enjoy our pre-trained model with [Google Colab notebook](https://colab.research.google.com/drive/1rj0d2Xfl0s9TmtHSrJt-8J-eVT6eOgS5?usp=sharing)!

Abstract: *In this paper, we pose the current state-of-the-art voice conversion (VC) systems as two-encoder-one-decoder models. After comparing these models, we combine the best features and propose Assem-VC, a new state-of-the-art any-to-many non-parallel VC system. This paper also introduces the GTA finetuning in VC, which significantly improves the quality and the speaker similarity of the outputs. Assem-VC outperforms the previous state-of-the-art approaches in both the naturalness and the speaker similarity on the VCTK dataset. As an objective result, the degree of speaker disentanglement of features such as phonetic posteriorgrams (PPG) is also explored. Our investigation indicates that many-to-many VC results are no longer distinct from human speech and similar quality can be achieved with any-to-many models.*

---
![](./docs/images/singing_overall.png)

#### **Controllable and Interpretable Singing Voice Decomposition via Assem-VC**

Kang-wook Kim, Junhyeok Lee @ [MINDsLab Inc.](https://maum.ai/), SNU

**Accepted to NeurIPS Workshop on ML for Creativity and Design 2021 (Oral)**

Paper: https://arxiv.org/abs/2110.12676

Audio Samples: https://mindslab-ai.github.io/assem-vc/singer/

Abstract: *We propose a singing decomposition system that encodes time-aligned linguistic content, pitch, and source speaker identity via Assem-VC. With decomposed speaker-independent information and the target speaker's embedding, we could synthesize the singing voice of the target speaker. In conclusion, we made a perfectly synced duet with the user's singing voice and the target singer's converted singing voice.*

## Requirements

This repository was tested with following environment:

- Python 3.6.8
- [PyTorch](https://pytorch.org/) 1.4.0
- [PyTorch Lightning](https://github.com/PytorchLightning/pytorch-lightning) 1.0.3
- The requirements are highlighted in [requirements.txt](./requirements.txt).

## Clone our Repository
```bash
git clone --recursive https://github.com/mindslab-ai/assem-vc
cd assem-vc
```

## Datasets

### Preparing Data

- To reproduce the results from our paper, you need to download:
- LibriTTS train-clean-100 split [tar.gz link](http://www.openslr.org/resources/60/train-clean-100.tar.gz)
- [VCTK dataset (Version 0.80)](https://datashare.ed.ac.uk/handle/10283/2651)
- Unzip each files, and clone them in `datasets/`.
- Resample them into 22.05kHz using `datasets/resample.py`.
```bash
python datasets/resample.py
```
Note that `dataset/resample.py` was hard-coded to remove original wavfiles in `datasets/` and replace them into resampled wavfiles,
and their filename `*.wav` will be transformed into `*-22k.wav`.
- You can use `datasets/resample_delete.sh` instead of `datasets/resample.py`. It does the same role.

### Preparing Metadata

Following a format from [NVIDIA/tacotron2](https://github.com/NVIDIA/tacotron2), the metadata should be formatted like:
```
path_to_wav|transcription|speaker_id
path_to_wav|transcription|speaker_id
...
```

When you want to learn and inference using phoneme, the transcription should have only unstressed [ARPABET](https://en.wikipedia.org/wiki/ARPABET).

Metadata containing ARPABET for LibriTTS train-clean-100 split and VCTK corpus are already prepared at `datasets/metadata`.
If you wish to use custom data, you need to make the metadata as shown above.

When converting transcription of metadata into ARPABET, you can use `datasets/g2p.py`.

```bash
python datasets/g2p.py -i -o
```

### Preparing Configuration Files

Training our VC system is consisted of two steps: (1) training Cotatron, (2) training VC decoder on top of Cotatron.

There are three `yaml` files in the `config` folder, which are configuration template for each model.
They **must** be edited to match your training requirements (dataset, metadata, etc.).

```bash
cp config/global/default.yaml config/global/config.yaml
cp config/cota/default.yaml config/cota/config.yaml
cp config/vc/default.yaml config/vc/config.yaml
```

Here, all files with name other than `default.yaml` will be ignored from git (see `.gitignore`).

- `config/global`: Global configs that are both used for training Cotatron & VC decoder.
- Fill in the blanks of: `speakers`, `train_dir`, `train_meta`, `val_dir`, `val_meta`, `f0s_list_path`.
- Example of speaker id list is shown in `datasets/metadata/libritts_vctk_speaker_list.txt`.
- When replicating the two-stage training process from our paper (training with LibriTTS and then LibriTTS+VCTK), please put both list of speaker ids from LibriTTS and VCTK at global config.
- `f0s_list_path` is set to `f0s.txt` by default
- `config/cota`: Configs for training Cotatron.
- You may want to change: `batch_size` for GPUs other than 32GB V100, or change `chkpt_dir` to save checkpoints in other disk.
- You can also modify `use_attn_loss`, whether guided attention loss is used or not.
- `config/vc`: Configs for training VC decoder.
- Fill in the blank of: `cotatron_path`.

### Extracting Pitch Range of Speakers

Before you train VC decoder, you should extract pitch range of each speaker:

```bash
python preprocess.py -c
```
Result will be saved at `f0s.txt`.

## Training
Currently, the training speed via multi-GPU setting may be slow due to the version issue of pytorch lightning.
If you want to train faster, see [this issue](https://github.com/mindslab-ai/assem-vc/issues/13).
### 1. Training Cotatron
To train the Cotatron, run this command:

```bash
python cotatron_trainer.py -c \
-g -n
```

Here are some example commands that might help you understand the arguments:

```bash
# train from scratch with name "my_runname"
python cotatron_trainer.py -c config/global/config.yaml config/cota/config.yaml \
-g 0 -n my_runname
```

Optionally, you can resume the training from previously saved checkpoint by adding `-p ` argument.

### 2. Training VC decoder

After the Cotatron is sufficiently trained (i.e., producing stable alignment + converged loss),
the VC decoder can be trained on top of it.

```bash
python synthesizer_trainer.py -c \
-g -n
```

The optional checkpoint argument is also available for VC decoder.

### 3. GTA finetuning HiFi-GAN

Once the VC decoder is trained, finetune the HiFi-GAN with GTA finetuning.
First, you should extract GTA mel-spectrograms from VC decoder.
```bash
python gta_extractor.py -c \
-p
```
The GTA mel-spectrograms calculated from audio file will be saved as `**.wav.gta` at first,
and then loaded from disk afterwards.

Train/validation metadata of GTA mels will be saved in `datasets/gta_metadata/gta_.txt`.
You should use those metadata when finetuning HiFi-GAN.

After extracting GTA mels, get into hifi-gan and follow manuals in [hifi-gan/README.md](https://github.com/wookladin/hifi-gan/blob/master/README.md)
```bash
cd hifi-gan
```

### Monitoring via Tensorboard

The progress of training with loss values and validation output can be monitored with tensorboard.
By default, the logs will be stored at `logs/cota` or `logs/vc`, which can be modified by editing `log.log_dir` parameter at config yaml file.

```bash
tensorboard --log_dir logs/cota --bind_all # Cotatron - Scalars, Images, Hparams, Projector will be shown.
tensorboard --log_dir logs/vc --bind_all # VC decoder - Scalars, Images, Hparams will be shown.
```

## Pre-trained Weight
We provide pretrained model of Assem-VC and GTA-finetuned HiFi-GAN generator weight.
Assem-VC was trained with VCTK and LibriTTS, and HiFi-GAN was finetuned with VCTK.

1. Download our published [models and configurations](https://drive.google.com/drive/folders/1aIl8ObHxsmsFLXBz-y05jMBN4LrpQejm?usp=sharing).
2. Place `global/config.yaml` at `config/global/config.yaml`, and `vc/config.yaml` at `config/vc/config.yaml`
3. Download `f0s.txt` and write the relative path of it at `hp.data.f0s_list_path`.
(Default path is `f0s.txt`)
4. write path of pretrained Assem-VC and HiFi-GAN models in [inference.ipynb](./inference.ipynb).

## Inference

After the VC decoder and HiFi-GAN are trained, you can use an arbitrary speaker's speech as the source.
You can convert it to speaker contained in trainset: which is any-to-many voice conversion.
1. Add your source audio(.wav) in `datasets/inference_source`
2. Add following lines at `datasets/inference_source/metadata_origin.txt`
```
your_audio.wav|transcription|speaker_id
```
Note that speaker_id has no effect whether or not it is in the training set.
3. Convert `datasets/inference_source/metadata_origin.txt` into ARPABET.
```bash
python datasets/g2p.py -i datasets/inference_source/metadata_origin.txt \
-o datasets/inference_source/metadata_g2p.txt
```
4. Run [inference.ipynb](./inference.ipynb)

We provide three samples including single TTS sample from [VITS demo page](https://jaywalnut310.github.io/vits-demo/index.html) for source audio.

**Note that source speech should be clean and the volume should not be too low.**

## Results

![](./docs/images/results.png)

*Disclaimer: We used an open-source g2p system in this repository, which is different from the proprietary g2p mentioned in the paper.
Hence, the quality of the result may differ from the paper.*

## Implementation details

Here are some noteworthy details of implementation, which could not be included in our paper due to the lack of space:

- Guided attention loss

We applied guided attention loss proposed in [DC-TTS](https://arxiv.org/abs/1710.08969).
It helped Cotatron's alignment learning stable and faster convergence.
See [modules/alignment_loss.py](./modules/alignment_loss.py).

## License

BSD 3-Clause License.

## Citation & Contact

```bibtex
@INPROCEEDINGS{kim2021assem,
title={ASSEM-VC: Realistic Voice Conversion by Assembling Modern Speech Synthesis Techniques},
author={Kim, Kang-Wook and Park, Seung-Won and Lee, Junhyeok and Joe, Myun-Chul},
booktitle={ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
year={2022},
volume={},
number={},
pages={6997-7001},
doi={10.1109/ICASSP43922.2022.9746139}}
```

---

```bibtex
@article{kim2021controllable,
title={Controllable and Interpretable Singing Voice Decomposition via Assem-VC},
author={Kim, Kang-wook and Lee, Junhyeok},
journal={NeurIPS 2021 Workshop on Machine Learning for Creativity and Design},
year={2021}
}
```

If you have a question or any kind of inquiries, please contact Kang-wook Kim at [full324@snu.ac.kr](mailto:full324@snu.ac.kr)

## Repository structure
```
.
├── LICENSE
├── README.md
├── cotatron.py
├── cotatron_trainer.py
├── gta_extractor.py
├── inference.ipynb
├── preprocess.py
├── requirements.txt
├── synthesizer.py
├── synthesizer_trainer.py
├── config
│   ├── cota
│   │   └── default.yaml
│   ├── global
│   │   └── default.yaml
│   └── vc
│   └── default.yaml
├── datasets
│   ├── __init__.py
│   ├── g2p.py
│   ├── resample.py
│   └── text_mel_dataset.py
│   ├── inference_source
│   │   (omitted)
│   ├── inference_target
│   │   (omitted)
│   ├── metadata
│   │   (ommited)
│   └── text
│   ├── __init__.py
│   ├── cleaners.py
│   ├── cmudict.py
│   ├── numbers.py
│   └── symbols.py
├── docs
│   (omitted)
├── hifi-gan
│   (omitted)
├── modules
│   ├── __init__.py
│   ├── alignment_loss.py
│   ├── attention.py
│   ├── classifier.py
│   ├── cond_bn.py
│   ├── encoder.py
│   ├── f0_encoder.py
│   ├── mel.py
│   ├── tts_decoder.py
│   ├── vc_decoder.py
│   └── zoneout.py
└── utils
├── loggers.py
├── plotting.py
└── utils.py
``` # Trainer file for Cotatron # GTA mel spectrogram extractor # Extracting speakers' pitch range # Trainer file for VC decoder (named as "synthesizer") # configuration template for Cotatron # configuration template for both Cotatron and VC decoder # configuration template for VC decoder # TextMelDataset and text preprocessor # Using G2P to convert metadata's transcription into ARPABET # Python file for audio resampling # custom source speechs and transcriptions for inference.ipynb # target speechs and transcriptions of VCTK for inference.ipynb # Refer to README.md within the folder. # Audio samples and code for https://mindslab-ai.github.io/assem-vc/ # Modified HiFi-GAN vocoder (https://github.com/wookladin/hifi-gan) # All modules that compose model, including mel.py # Guided attention loss # Implementation of DCA (https://arxiv.org/abs/1910.10288) # Code for calculating mel-spectrogram from raw audio # Zoneout LSTM # Misc. code snippets, usually for logging

## References

This implementation uses code from following repositories:
- [Keith Ito's Tacotron implementation](https://github.com/keithito/tacotron/)
- [NVIDIA's Tacotron2 implementation](https://github.com/NVIDIA/tacotron2)
- [Official Mellotron implementation](https://github.com/NVIDIA/mellotron)
- [Official HiFi-GAN implementation](https://github.com/jik876/hifi-gan)
- [Official Cotatron implementation](https://github.com/mindslab-ai/cotatron)
- [Kyubyong's g2pE implementation](https://github.com/Kyubyong/g2p)
- [Tomiinek's Multilngual TTS implementation](https://github.com/Tomiinek/Multilingual_Text_to_Speech)

This README was inspired by:
- [Tips for Publishing Research Code](https://github.com/paperswithcode/releasing-research-code)

The audio samples on [the demo page of Assem-VC](https://mindslab-ai.github.io/assem-vc/) and [the demo page of Assem-Singer](https://mindslab-ai.github.io/assem-vc/singer/) are partially derived from:
- [LibriTTS](https://arxiv.org/abs/1904.02882): Dataset for multispeaker TTS, derived from LibriSpeech.
- [VCTK](https://datashare.ed.ac.uk/handle/10283/2651): 46 hours of English speech from 108 speakers.
- [KSS](https://www.kaggle.com/bryanpark/korean-single-speaker-speech-dataset): Korean Single Speaker Speech Dataset.
- [CSD](https://zenodo.org/record/4785016): Children's Song Dataset for Singing Voice Research
- [NUS-48E](https://smcnus.comp.nus.edu.sg/nus-48e-sung-and-spoken-lyrics-corpus/): NUS-48E Sung and Spoken Lyrics Corpus

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/maum-ai/assem-vc

Awesome Lists containing this project

README