https://github.com/maxmax2016/grad-tts-vocos
Grad-TTS-Vocos
https://github.com/maxmax2016/grad-tts-vocos
Last synced: about 1 year ago
JSON representation
Grad-TTS-Vocos
- Host: GitHub
- URL: https://github.com/maxmax2016/grad-tts-vocos
- Owner: MaxMax2016
- License: mit
- Created: 2023-09-15T03:42:21.000Z (over 2 years ago)
- Default Branch: main
- Last Pushed: 2023-09-15T07:07:40.000Z (over 2 years ago)
- Last Synced: 2025-04-05T13:38:10.292Z (about 1 year ago)
- Language: Python
- Homepage:
- Size: 305 KB
- Stars: 7
- Watchers: 0
- Forks: 3
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Bert-Grad-Vocos-TTS is based on Huawei Grad-TTS for Chinese, integrated Bert for rhyme and integrated vocos as vocoder
#### 用于学习的TTS算法项目,如果您在寻找直接用于生产的TTS,本项目并不适合您!


Bert-Grad Framework
## Acoustic Model
### Install and Test
download [vocos-mel-24khz](https://huggingface.co/charactr/vocos-mel-24khz) from [charactr-platform/vocos](https://github.com/charactr-platform/vocos)
download [prosody_model](https://github.com/Executedone/Chinese-FastSpeech2) from [Executedone/Chinese-FastSpeech2](https://github.com/Executedone/Chinese-FastSpeech2)
download [grad_tts.pt](https://github.com/PlayVoice/Bert-Grad-Vocos-TTS/releases/tag/release) from release page
put [pytorch_model.bin]() To ./vocos-mel-24khz/pytorch_model.bin
**rename best_model.pt to prosody_model.pt**
put [prosody_model.pt]() To ./bert/prosody_model.pt
put [grad_tts.pt]() To ./grad_tts.pt
> pip install -r requirements.txt
```
> cd ./grad/monotonic_align
> python setup.py build_ext --inplace
> cd -
```
> python inference.py --file test.txt --checkpoint grad_tts.pt --diffusion 1 --timesteps 4 --temperature 1.15
the waves infered will be saved in `./inference_out`
--diffusion : 1 for use and 0 for no use diffusion decoder when inference
### Data
download [baker](https://aistudio.baidu.com/datasetdetail/36741) data: https://www.data-baker.com/data/index/TNtts/
put `Waves` to ./data/Waves
put `000001-010000.txt` to ./data/000001-010000.txt
1, resample
> python tools/preprocess_a.py -w ./data/Wave/ -o ./data/wavs -s `24000`
2, extract mel
> python tools/preprocess_m.py --wav data/wavs/ --out data/mels/
3, extract bert, and generate train files by the way
> python tools/preprocess_b.py
output contains `data/berts/` and `data/files`
注意:打印信息,是在剔除`儿化音`(项目为算法演示,不做生产)
Raw label
``` c
000001 卡尔普#2陪外孙#1玩滑梯#4。
ka2 er2 pu3 pei2 wai4 sun1 wan2 hua2 ti1
000002 假语村言#2别再#1拥抱我#4。
jia2 yu3 cun1 yan2 bie2 zai4 yong1 bao4 wo3
```
Cleaned label
``` c
000001 卡尔普陪外孙玩滑梯。
ka2 er2 pu3 pei2 wai4 sun1 wan2 hua2 ti1
sil k a2 ^ er2 p u3 p ei2 ^ uai4 s uen1 ^ uan2 h ua2 t i1 sp sil
000002 假语村言别再拥抱我。
jia2 yu3 cun1 yan2 bie2 zai4 yong1 bao4 wo3
sil j ia2 ^ v3 c uen1 ^ ian2 b ie2 z ai4 ^ iong1 b ao4 ^ uo3 sp sil
```
Train files
```
./data/wavs/000001.wav|./data/mels/000001.pt|./data/berts/000001.npy|sil k a2 ^ er2 p u3 p ei2 ^ uai4 s uen1 ^ uan2 h ua2 t i1 sp sil
./data/wavs/000002.wav|./data/mels/000002.pt|./data/berts/000002.npy|sil j ia2 ^ v3 c uen1 ^ ian2 b ie2 z ai4 ^ iong1 b ao4 ^ uo3 sp sil
```
Error
```
002365 这图#2难不成#2是#1P过的#4?
zhe4 tu2 nan2 bu4 cheng2 shi4 P IY1 guo4 de5
```
### Train
debug train
> python tools/preprocess_d.py
start train
> python train.py
resume train
> python train.py -p logs/new_exp/grad_tts_***.pt
### Inference
> python inference.py --file test.txt --checkpoint ./logs/new_exp/grad_tts_***.pt --diffusion 1 --timesteps 20 --temperature 1.15
### Code sources and references
https://github.com/huawei-noah/Speech-Backbones/blob/main/Grad-TTS
https://github.com/thuhcsi/LightGrad
https://github.com/Executedone/Chinese-FastSpeech2
https://github.com/PlayVoice/vits_chinese
https://github.com/reppy4620/grad_tts
# Raw Grad-TTS information
Official implementation of the Grad-TTS model based on Diffusion Probabilistic Modelling. For all details check out our paper accepted to ICML 2021 via [this](https://arxiv.org/abs/2105.06337) link.
**Authors**: Vadim Popov\*, Ivan Vovk\*, Vladimir Gogoryan, Tasnima Sadekova, Mikhail Kudinov.
\*Equal contribution.
## Abstract
**Demo page** with voiced abstract: [link](https://grad-tts.github.io/).
Recently, denoising diffusion probabilistic models and generative score matching have shown high potential in modelling complex data distributions while stochastic calculus has provided a unified point of view on these techniques allowing for flexible inference schemes. In this paper we introduce Grad-TTS, a novel text-to-speech model with score-based decoder producing mel-spectrograms by gradually transforming noise predicted by encoder and aligned with text input by means of Monotonic Alignment Search. The framework of stochastic differential equations helps us to generalize conventional diffusion probabilistic models to the case of reconstructing data from noise with different parameters and allows to make this reconstruction flexible by explicitly controlling trade-off between sound quality and inference speed. Subjective human evaluation shows that Grad-TTS is competitive with state-of-the-art text-to-speech approaches in terms of Mean Opinion Score.
## References
* HiFi-GAN model is used as vocoder, official github repository: [link](https://github.com/jik876/hifi-gan).
* Monotonic Alignment Search algorithm is used for unsupervised duration modelling, official github repository: [link](https://github.com/jaywalnut310/glow-tts).
* Phonemization utilizes CMUdict, official github repository: [link](https://github.com/cmusphinx/cmudict).
## Vocoder Model
project link: https://github.com/charactr-platform/vocos
### Infer Test
dowdload pretrain model https://huggingface.co/charactr/vocos-mel-24khz
> python vocos/inference.py --wav test.wav
output file is `vocos_save.wav` in current path