{"id":13958550,"url":"https://github.com/Rongjiehuang/ProDiff","last_synced_at":"2025-07-21T00:31:12.693Z","repository":{"id":58653341,"uuid":"524092789","full_name":"Rongjiehuang/ProDiff","owner":"Rongjiehuang","description":"PyTorch Implementation of ProDiff (ACM-MM'22) with a Extremely-Fast diffusion speech synthesis pipeline","archived":false,"fork":false,"pushed_at":"2023-04-19T10:08:22.000Z","size":186,"stargazers_count":433,"open_issues_count":18,"forks_count":55,"subscribers_count":16,"default_branch":"main","last_synced_at":"2025-05-25T17:04:54.302Z","etag":null,"topics":["diffusion-models","speech-synthesis","text-to-speech"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Rongjiehuang.png","metadata":{"files":{"readme":"readme.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-08-12T13:10:44.000Z","updated_at":"2025-05-18T19:33:56.000Z","dependencies_parsed_at":"2024-11-28T02:32:25.013Z","dependency_job_id":"349ea39d-987c-4150-b7fb-570b22af36ab","html_url":"https://github.com/Rongjiehuang/ProDiff","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/Rongjiehuang/ProDiff","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Rongjiehuang%2FProDiff","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Rongjiehuang%2FProDiff/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Rongjiehuang%2FProDiff/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Rongjiehuang%2FProDiff/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Rongjiehuang","download_url":"https://codeload.github.com/Rongjiehuang/ProDiff/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Rongjiehuang%2FProDiff/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":266221259,"owners_count":23894965,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["diffusion-models","speech-synthesis","text-to-speech"],"created_at":"2024-08-08T13:01:43.536Z","updated_at":"2025-07-21T00:31:11.927Z","avatar_url":"https://github.com/Rongjiehuang.png","language":"Python","funding_links":[],"categories":["语音合成"],"sub_categories":["网络服务_其他"],"readme":"# ProDiff: Progressive Fast Diffusion Model For High-Quality Text-to-Speech\n\n#### Rongjie Huang, Zhou Zhao, Huadai Liu, Jinglin Liu, Chenye Cui, Yi Ren\n\nPyTorch Implementation of [ProDiff (ACM Multimedia'22)](https://arxiv.org/abs/2207.06389): a conditional diffusion probabilistic model capable of generating high fidelity speech efficiently.\n\n[![arXiv](https://img.shields.io/badge/arXiv-Paper-\u003cCOLOR\u003e.svg)](https://arxiv.org/abs/2207.06389)\n[![GitHub Stars](https://img.shields.io/github/stars/Rongjiehuang/ProDiff?style=social)](https://github.com/Rongjiehuang/ProDiff)\n![visitors](https://visitor-badge.glitch.me/badge?page_id=Rongjiehuang/ProDiff)\n[![Hugging Face](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-blue)](https://huggingface.co/spaces/Rongjiehuang/ProDiff)\n\nWe provide our implementation and pretrained models as open source in this repository.\n\nVisit our [demo page](https://prodiff.github.io/) for audio samples.\n\n## News\n- April, 2022: Our previous work **[FastDiff](https://arxiv.org/abs/2204.09934) (IJCAI 2022)** released in [Github](https://github.com/Rongjiehuang/FastDiff). \n- September, 2022: **[ProDiff](https://arxiv.org/abs/2207.06389) (ACM Multimedia 2022)** released in Github.\n\n## Key Features\n- **Extremely-Fast** diffusion text-to-speech synthesis pipeline for potential **industrial deployment**.\n- **Tutorial and code base** for speech diffusion models.\n- More **supported diffusion mechanism** (e.g., guided diffusion) will be available.\n\n## Quick Started\nWe provide an example of how you can generate high-fidelity samples using ProDiff.\n\nTo try on your own dataset, simply clone this repo in your local machine provided with NVIDIA GPU + CUDA cuDNN and follow the below instructions.\n\n### Support Datasets and Pretrained Models\n\nSimply run following command to download the weights\n```python\n  from huggingface_hub import snapshot_download \n  downloaded_path = snapshot_download(repo_id=\"Rongjiehuang/ProDiff\")\n```\n\nand move the downloaded checkpoints to `checkpoints/$Model/model_ckpt_steps_*.ckpt`\n```bash\n   mv ${downloaded_path}/checkpoints/  checkpoints/\n```\n\nDetails of each folder are as in follows:\n\n| Model             | Dataset     | Config                                          | \n|-------------------|-------------|-------------------------------------------------|\n| ProDiff Teacher   | LJSpeech    | `modules/ProDiff/config/prodiff_teacher.yaml`   | \n| ProDiff           | LJSpeech    | `modules/ProDiff/config/prodiff.yaml`           | \n\n\nMore supported datasets are coming soon.\n\n\n\n### Dependencies\nSee requirements in `requirement.txt`:\n- [pytorch](https://github.com/pytorch/pytorch)\n- [librosa](https://github.com/librosa/librosa)\n- [NATSpeech](https://github.com/NATSpeech/NATSpeech)\n\n### Multi-GPU\nBy default, this implementation uses as many GPUs in parallel as returned by `torch.cuda.device_count()`. \nYou can specify which GPUs to use by setting the `CUDA_DEVICES_AVAILABLE` environment variable before running the training module.\n\n## Extremely-Fast Text-to-Speech with diffusion probabilistic models \n\nHere we provide a speech synthesis pipeline using diffusion probabilistic models: ProDiff (acoustic model) + FastDiff (neural vocoder). [![Hugging Face](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-blue)](https://huggingface.co/spaces/Rongjiehuang/ProDiff)\n\n1. Prepare acoustic model (ProDiff or ProDiff Teacher): Download LJSpeech checkpoint and put it in `checkpoints/ProDiff` or `checkpoints/ProDiff_Teacher`\n2. Prepare neural vocoder (FastDiff): Download LJSpeech checkpoint and put it in `checkpoints/FastDiff`\n\n3. Specify the input `$text`, and set `N` for reverse sampling in neural vocoder, which is a trade off between quality and speed. \n4. Run the following command for extreme fast speed `(2-iter ProDiff + 4-iter FastDiff)`:\n```bash\nCUDA_VISIBLE_DEVICES=$GPU python inference/ProDiff.py --config modules/ProDiff/config/prodiff.yaml --exp_name ProDiff --hparams=\"N=4,text='$txt'\" --reset\n```\nGenerated wav files are saved in `infer_out` by default.\u003cbr\u003e\nNote: For better quality, it's recommended to finetune the FastDiff neural vocoder [here](https://github.com/Rongjiehuang/FastDiff).\n\n5. Enjoy speed-quality trade-off:  `(4-iter ProDiff Teacher + 6-iter FastDiff)`:\n```bash\nCUDA_VISIBLE_DEVICES=$GPU python inference/ProDiff_teacher.py --config modules/ProDiff/config/prodiff_teacher.yaml --exp_name ProDiff_Teacher --hparams=\"N=6,text='$txt'\" --reset\n```\n\n# Train your own model\n\n### Data Preparation and Configuraion ##\n1. Set `raw_data_dir`, `processed_data_dir`, `binary_data_dir` in the config file\n2. Download dataset to `raw_data_dir`. Note: the dataset structure needs to follow `egs/datasets/audio/*/pre_align.py`, or you could rewrite `pre_align.py` according to your dataset.\n3. Preprocess Dataset \n```bash\n# Preprocess step: unify the file structure.\npython data_gen/tts/bin/pre_align.py --config $path/to/config\n# Align step: MFA alignment.\npython data_gen/tts/runs/train_mfa_align.py --config $CONFIG_NAME\n# Binarization step: Binarize data for fast IO.\nCUDA_VISIBLE_DEVICES=$GPU python data_gen/tts/bin/binarize.py --config $path/to/config\n```\n\nYou could also build a dataset via [NATSpeech](https://github.com/NATSpeech/NATSpeech), which shares a common MFA data-processing procedure.\nWe also provide our processed LJSpeech dataset [here](https://zjueducn-my.sharepoint.com/:f:/g/personal/rongjiehuang_zju_edu_cn/Eo7r83WZPK1GmlwvFhhIKeQBABZpYW3ec9c8WZoUV5HhbA?e=9QoWnf).\n\n\n### Training Teacher of ProDiff \n```bash\nCUDA_VISIBLE_DEVICES=$GPU python tasks/run.py --config modules/ProDiff/config/prodiff_teacher.yaml  --exp_name ProDiff_Teacher --reset\n```\n\n### Training ProDiff\n```bash\nCUDA_VISIBLE_DEVICES=$GPU python tasks/run.py --config modules/ProDiff/config/prodiff.yaml  --exp_name ProDiff --reset\n```\n\n### Inference using ProDiff Teacher\n\n```bash\nCUDA_VISIBLE_DEVICES=$GPU python tasks/run.py --config modules/ProDiff/config/prodiff_teacher.yaml  --exp_name ProDiff_Teacher --infer\n```\n\n### Inference using ProDiff\n\n```bash\nCUDA_VISIBLE_DEVICES=$GPU python tasks/run.py --config modules/ProDiff/config/prodiff.yaml  --exp_name ProDiff --infer\n```\n\n## Acknowledgements\nThis implementation uses parts of the code from the following Github repos:\n[FastDiff](https://github.com/Rongjiehuang/FastDiff),\n[DiffSinger](https://github.com/MoonInTheRiver/DiffSinger),\n[NATSpeech](https://github.com/NATSpeech/NATSpeech),\nas described in our code.\n\n## Citations ##\nIf you find this code useful in your research, please cite our work:\n```bib\n@inproceedings{huang2022prodiff,\n  title={ProDiff: Progressive Fast Diffusion Model For High-Quality Text-to-Speech},\n  author={Huang, Rongjie and Zhao, Zhou and Liu, Huadai and Liu, Jinglin and Cui, Chenye and Ren, Yi},\n  booktitle={Proceedings of the 30th ACM International Conference on Multimedia},\n  year={2022}\n}\n\n@article{huang2022fastdiff,\n  title={FastDiff: A Fast Conditional Diffusion Model for High-Quality Speech Synthesis},\n  author={Huang, Rongjie and Lam, Max WY and Wang, Jun and Su, Dan and Yu, Dong and Ren, Yi and Zhao, Zhou},\n  booktitle = {Proceedings of the Thirty-First International Joint Conference on\n               Artificial Intelligence, {IJCAI-22}},\n  publisher = {International Joint Conferences on Artificial Intelligence Organization},\n  year={2022}\n}\n```\n\n## Disclaimer ##\nAny organization or individual is prohibited from using any technology mentioned in this paper to generate someone's speech without his/her consent, including but not limited to government leaders, political figures, and celebrities. If you do not comply with this item, you could be in violation of copyright laws.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FRongjiehuang%2FProDiff","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FRongjiehuang%2FProDiff","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FRongjiehuang%2FProDiff/lists"}