{"id":19684159,"url":"https://github.com/maxmax2016/grad-tts-vocos","last_synced_at":"2025-04-29T05:32:10.047Z","repository":{"id":194944785,"uuid":"691861366","full_name":"MaxMax2016/Grad-TTS-Vocos","owner":"MaxMax2016","description":"Grad-TTS-Vocos","archived":false,"fork":false,"pushed_at":"2023-09-15T07:07:40.000Z","size":312,"stargazers_count":7,"open_issues_count":0,"forks_count":3,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-04-05T13:38:10.292Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/MaxMax2016.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2023-09-15T03:42:21.000Z","updated_at":"2024-05-11T13:03:49.000Z","dependencies_parsed_at":null,"dependency_job_id":"0a2d7625-c335-428a-bd09-2d767d3b1e83","html_url":"https://github.com/MaxMax2016/Grad-TTS-Vocos","commit_stats":null,"previous_names":["playvoice/bert-grad-vocos-tts","yuchendd/grad-tts-vocos","maxmax2016/grad-tts-vocos"],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MaxMax2016%2FGrad-TTS-Vocos","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MaxMax2016%2FGrad-TTS-Vocos/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MaxMax2016%2FGrad-TTS-Vocos/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MaxMax2016%2FGrad-TTS-Vocos/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/MaxMax2016","download_url":"https://codeload.github.com/MaxMax2016/Grad-TTS-Vocos/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":251444693,"owners_count":21590557,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-11T18:16:59.509Z","updated_at":"2025-04-29T05:32:09.704Z","avatar_url":"https://github.com/MaxMax2016.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Bert-Grad-Vocos-TTS is based on Huawei Grad-TTS for Chinese, integrated Bert for rhyme and integrated vocos as vocoder\n#### 用于学习的TTS算法项目，如果您在寻找直接用于生产的TTS，本项目并不适合您！\n\u003cdiv align=\"center\"\u003e\n\n![grad_tts](assets/grad_tts.jpg)\n\n![bert_grad_tts](assets/bert_grad_tts.jpg)\nBert-Grad Framework\n\u003c/div\u003e\n\n## Acoustic Model\n\n### Install and Test\n\ndownload [vocos-mel-24khz](https://huggingface.co/charactr/vocos-mel-24khz) from [charactr-platform/vocos](https://github.com/charactr-platform/vocos)\n\ndownload [prosody_model](https://github.com/Executedone/Chinese-FastSpeech2) from [Executedone/Chinese-FastSpeech2](https://github.com/Executedone/Chinese-FastSpeech2)\n\ndownload [grad_tts.pt](https://github.com/PlayVoice/Bert-Grad-Vocos-TTS/releases/tag/release) from release page\n\nput [pytorch_model.bin]() To ./vocos-mel-24khz/pytorch_model.bin\n\n**rename best_model.pt to prosody_model.pt**\n\nput [prosody_model.pt]() To ./bert/prosody_model.pt\n\nput [grad_tts.pt]() To ./grad_tts.pt\n\n\u003e pip install -r requirements.txt\n\n```\n\u003e cd ./grad/monotonic_align\n\u003e python setup.py build_ext --inplace\n\u003e cd -\n```\n\n\u003e python inference.py --file test.txt --checkpoint grad_tts.pt --diffusion 1 --timesteps 4 --temperature 1.15\n\nthe waves infered will be saved in `./inference_out`\n\n--diffusion : 1 for use and 0 for no use diffusion decoder when inference\n\n### Data\n\ndownload [baker](https://aistudio.baidu.com/datasetdetail/36741) data: https://www.data-baker.com/data/index/TNtts/\n\nput `Waves` to ./data/Waves\n\nput `000001-010000.txt` to ./data/000001-010000.txt\n\n1, resample\n\n\u003e python tools/preprocess_a.py -w ./data/Wave/ -o ./data/wavs -s `24000`\n\n2, extract mel\n\n\u003e python tools/preprocess_m.py --wav data/wavs/ --out data/mels/\n\n3, extract bert, and generate train files by the way\n\n\u003e python tools/preprocess_b.py\n\noutput contains `data/berts/` and `data/files`\n\n注意：打印信息，是在剔除`儿化音`（项目为算法演示，不做生产）\n\nRaw label\n``` c\n000001\t卡尔普#2陪外孙#1玩滑梯#4。\n\tka2 er2 pu3 pei2 wai4 sun1 wan2 hua2 ti1\n000002\t假语村言#2别再#1拥抱我#4。\n\tjia2 yu3 cun1 yan2 bie2 zai4 yong1 bao4 wo3\n```\nCleaned label\n``` c\n000001\t卡尔普陪外孙玩滑梯。\n\tka2 er2 pu3 pei2 wai4 sun1 wan2 hua2 ti1\n\tsil k a2 ^ er2 p u3 p ei2 ^ uai4 s uen1 ^ uan2 h ua2 t i1 sp sil\n000002\t假语村言别再拥抱我。\n\tjia2 yu3 cun1 yan2 bie2 zai4 yong1 bao4 wo3\n\tsil j ia2 ^ v3 c uen1 ^ ian2 b ie2 z ai4 ^ iong1 b ao4 ^ uo3 sp sil\n```\nTrain files\n```\n./data/wavs/000001.wav|./data/mels/000001.pt|./data/berts/000001.npy|sil k a2 ^ er2 p u3 p ei2 ^ uai4 s uen1 ^ uan2 h ua2 t i1 sp sil\n./data/wavs/000002.wav|./data/mels/000002.pt|./data/berts/000002.npy|sil j ia2 ^ v3 c uen1 ^ ian2 b ie2 z ai4 ^ iong1 b ao4 ^ uo3 sp sil\n```\nError\n```\n002365\t这图#2难不成#2是#1Ｐ过的#4？\n\tzhe4 tu2 nan2 bu4 cheng2 shi4 P IY1 guo4 de5\n```\n### Train\n\ndebug train\n\n\u003e python tools/preprocess_d.py\n\nstart train\n\n\u003e python train.py\n\nresume train\n\n\u003e python train.py -p logs/new_exp/grad_tts_***.pt\n\n### Inference\n\n\u003e python inference.py --file test.txt --checkpoint ./logs/new_exp/grad_tts_***.pt --diffusion 1 --timesteps 20 --temperature 1.15\n\n### Code sources and references\n\nhttps://github.com/huawei-noah/Speech-Backbones/blob/main/Grad-TTS\n\nhttps://github.com/thuhcsi/LightGrad\n\nhttps://github.com/Executedone/Chinese-FastSpeech2\n\nhttps://github.com/PlayVoice/vits_chinese\n\nhttps://github.com/reppy4620/grad_tts\n\n# Raw Grad-TTS information\n\nOfficial implementation of the Grad-TTS model based on Diffusion Probabilistic Modelling. For all details check out our paper accepted to ICML 2021 via [this](https://arxiv.org/abs/2105.06337) link.\n\n**Authors**: Vadim Popov\\*, Ivan Vovk\\*, Vladimir Gogoryan, Tasnima Sadekova, Mikhail Kudinov.\n\n\u003csup\u003e\\*Equal contribution.\u003c/sup\u003e\n\n## Abstract\n\n**Demo page** with voiced abstract: [link](https://grad-tts.github.io/).\n\nRecently, denoising diffusion probabilistic models and generative score matching have shown high potential in modelling complex data distributions while stochastic calculus has provided a unified point of view on these techniques allowing for flexible inference schemes. In this paper we introduce Grad-TTS, a novel text-to-speech model with score-based decoder producing mel-spectrograms by gradually transforming noise predicted by encoder and aligned with text input by means of Monotonic Alignment Search. The framework of stochastic differential equations helps us to generalize conventional diffusion probabilistic models to the case of reconstructing data from noise with different parameters and allows to make this reconstruction flexible by explicitly controlling trade-off between sound quality and inference speed. Subjective human evaluation shows that Grad-TTS is competitive with state-of-the-art text-to-speech approaches in terms of Mean Opinion Score.\n\n## References\n\n* HiFi-GAN model is used as vocoder, official github repository: [link](https://github.com/jik876/hifi-gan).\n* Monotonic Alignment Search algorithm is used for unsupervised duration modelling, official github repository: [link](https://github.com/jaywalnut310/glow-tts).\n* Phonemization utilizes CMUdict, official github repository: [link](https://github.com/cmusphinx/cmudict).\n\n\n## Vocoder Model\n\nproject link: https://github.com/charactr-platform/vocos\n\n### Infer Test\n\ndowdload pretrain model https://huggingface.co/charactr/vocos-mel-24khz\n\n\u003e python vocos/inference.py --wav test.wav\n\noutput file is `vocos_save.wav` in current path\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmaxmax2016%2Fgrad-tts-vocos","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmaxmax2016%2Fgrad-tts-vocos","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmaxmax2016%2Fgrad-tts-vocos/lists"}