https://github.com/rongjiehuang/generspeech

PyTorch Implementation of GenerSpeech (NeurIPS'22): a text-to-speech model towards zero-shot style transfer of OOD custom voice.
https://github.com/rongjiehuang/generspeech

domain-generalization neurips-2022 speech-synthesis style-transfer text-to-speech tts

Last synced: 6 months ago
JSON representation

PyTorch Implementation of GenerSpeech (NeurIPS'22): a text-to-speech model towards zero-shot style transfer of OOD custom voice.

Host: GitHub
URL: https://github.com/rongjiehuang/generspeech
Owner: Rongjiehuang
License: mit
Created: 2022-10-09T14:12:25.000Z (almost 3 years ago)
Default Branch: main
Last Pushed: 2024-02-09T06:23:22.000Z (over 1 year ago)
Last Synced: 2025-03-30T21:07:01.609Z (6 months ago)
Topics: domain-generalization, neurips-2022, speech-synthesis, style-transfer, text-to-speech, tts
Language: Python
Homepage:
Size: 4.69 MB
Stars: 322
Watchers: 14
Forks: 40
Open Issues: 16
Metadata Files:
- Readme: readme.md
- License: LICENSE

Awesome Lists containing this project

README

# GenerSpeech: Towards Style Transfer for Generalizable Out-Of-Domain Text-to-Speech

#### Rongjie Huang, Yi Ren, Jinglin Liu, Chenye Cui, Zhou Zhao | Zhejiang University, Sea AI Lab

PyTorch Implementation of [GenerSpeech (NeurIPS'22)](https://arxiv.org/abs/2205.07211): a text-to-speech model towards high-fidelity zero-shot style transfer of OOD custom voice.

[![arXiv](https://img.shields.io/badge/arXiv-Paper-.svg)](https://arxiv.org/abs/2205.07211)
[![GitHub Stars](https://img.shields.io/github/stars/Rongjiehuang/GenerSpeech?style=social)](https://github.com/Rongjiehuang/GenerSpeech)

We provide our implementation and pretrained models in this repository.

Visit our [demo page](https://generspeech.github.io/) for audio samples.

## News
- December, 2022: **[GenerSpeech](https://arxiv.org/abs/2205.07211) (NeurIPS 2022)** released at Github.

## Key Features
- **Multi-level Style Transfer** for expressive text-to-speech.
- **Enhanced model generalization** to out-of-distribution (OOD) style reference.

## Quick Started
We provide an example of how you can generate high-fidelity samples using GenerSpeech.

To try on your own dataset, simply clone this repo in your local machine provided with NVIDIA GPU + CUDA cuDNN and follow the below instructions.

### Support Datasets and Pretrained Models
You can use pretrained models we provide [here](https://huggingface.co/spaces/Rongjiehuang/GenerSpeech/tree/main/checkpoints), and data [here](https://huggingface.co/spaces/Rongjiehuang/GenerSpeech/tree/main/data/binary/training_set). Details of each folder are as in follows:

| Model | Dataset (16 kHz) | Discription |
|-------------|------------------|--------------------------------------------------------------------------|
| GenerSpeech | LibriTTS,ESD | Acousitic model [(config)](modules/GenerSpeech/config/generspeech.yaml) |
| HIFI-GAN | LibriTTS,ESD | Neural Vocoder |
| Encoder | / | Emotion Encoder |

More supported datasets are coming soon.

### Dependencies

A suitable [conda](https://conda.io/) environment named `generspeech` can be created
and activated with:

```
conda env create -f environment.yaml
conda activate generspeech
```

### Multi-GPU
By default, this implementation uses as many GPUs in parallel as returned by `torch.cuda.device_count()`.
You can specify which GPUs to use by setting the `CUDA_DEVICES_AVAILABLE` environment variable before running the training module.

## Inference (Zero-shot TTS)
Here we provide a speech synthesis pipeline using GenerSpeech.

1. Prepare **GenerSpeech** (acoustic model): Download and put checkpoint at `checkpoints/GenerSpeech`
2. Prepare **HIFI-GAN** (neural vocoder): Download and put checkpoint at `checkpoints/trainset_hifigan`
3. Prepare **Emotion Encoder**: Download and put checkpoint at `checkpoints/Emotion_encoder.pt`
4. Prepare **dataset**: Download and put statistical files at `data/binary/training_set`
5. Prepare **path/to/reference_audio (16k)**: By default, GenerSpeech uses **[ASR](https://huggingface.co/facebook/wav2vec2-base-960h) + [MFA](https://montreal-forced-aligner.readthedocs.io/)** to obtain the text-speech alignment from reference.
```bash
CUDA_VISIBLE_DEVICES=$GPU python inference/GenerSpeech.py --config modules/GenerSpeech/config/generspeech.yaml --exp_name GenerSpeech --hparams="text='here we go',ref_audio='assets/0011_001570.wav'"
```

Generated wav files are saved in `infer_out` by default.

# Train your own model

### Data Preparation and Configuration ##
1. Set `raw_data_dir`, `processed_data_dir`, `binary_data_dir` in the config file, and download dataset to `raw_data_dir`.
2. Check `preprocess_cls` in the config file. The dataset structure needs to follow the processor `preprocess_cls`, or you could rewrite it according to your dataset. We provide a Libritts processor as an example in `modules/GenerSpeech/config/generspeech.yaml`
3. Download global emotion encoder to `emotion_encoder_path`. For more details, please refer to [this branch](https://github.com/Rongjiehuang/GenerSpeech/tree/encoder).
4. Preprocess Dataset
```bash
# Preprocess step: unify the file structure.
python data_gen/tts/bin/preprocess.py --config $path/to/config
# Align step: MFA alignment.
python data_gen/tts/bin/train_mfa_align.py --config $path/to/config
# Binarization step: Binarize data for fast IO.
CUDA_VISIBLE_DEVICES=$GPU python data_gen/tts/bin/binarize.py --config $path/to/config
```

You could also build a dataset via [NATSpeech](https://github.com/NATSpeech/NATSpeech), which shares a common MFA data-processing procedure.
We also provide our processed dataset (16kHz LibriTTS+ESD).

### Training GenerSpeech
```bash
CUDA_VISIBLE_DEVICES=$GPU python tasks/run.py --config modules/GenerSpeech/config/generspeech.yaml --exp_name GenerSpeech --reset
```

### Inference using GenerSpeech

```bash
CUDA_VISIBLE_DEVICES=$GPU python tasks/run.py --config modules/GenerSpeech/config/generspeech.yaml --exp_name GenerSpeech --infer
```

## Acknowledgements
This implementation uses parts of the code from the following Github repos:
[FastDiff](https://github.com/Rongjiehuang/FastDiff),
[NATSpeech](https://github.com/NATSpeech/NATSpeech),
as described in our code.

## Citations ##
If you find this code useful in your research, please cite our work:
```bib
@inproceedings{huanggenerspeech,
title={GenerSpeech: Towards Style Transfer for Generalizable Out-Of-Domain Text-to-Speech},
author={Huang, Rongjie and Ren, Yi and Liu, Jinglin and Cui, Chenye and Zhao, Zhou},
booktitle={Advances in Neural Information Processing Systems}
}
```

## Disclaimer ##
Any organization or individual is prohibited from using any technology mentioned in this paper to generate someone's speech without his/her consent, including but not limited to government leaders, political figures, and celebrities. If you do not comply with this item, you could be in violation of copyright laws.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/rongjiehuang/generspeech

Awesome Lists containing this project

README