{"id":13958564,"url":"https://github.com/Rongjiehuang/FastDiff","last_synced_at":"2025-07-21T00:31:14.605Z","repository":{"id":37489160,"uuid":"448758519","full_name":"Rongjiehuang/FastDiff","owner":"Rongjiehuang","description":"PyTorch Implementation of FastDiff (IJCAI'22)","archived":false,"fork":false,"pushed_at":"2024-06-20T00:50:37.000Z","size":3127,"stargazers_count":410,"open_issues_count":16,"forks_count":61,"subscribers_count":19,"default_branch":"main","last_synced_at":"2025-05-25T12:07:37.392Z","etag":null,"topics":["ijcai2022","neural-vocoder","speech-synthesis","text-to-speech","vocoder"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Rongjiehuang.png","metadata":{"files":{"readme":"readme.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-01-17T04:56:17.000Z","updated_at":"2025-03-29T06:32:23.000Z","dependencies_parsed_at":"2024-10-29T15:21:02.698Z","dependency_job_id":null,"html_url":"https://github.com/Rongjiehuang/FastDiff","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/Rongjiehuang/FastDiff","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Rongjiehuang%2FFastDiff","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Rongjiehuang%2FFastDiff/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Rongjiehuang%2FFastDiff/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Rongjiehuang%2FFastDiff/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Rongjiehuang","download_url":"https://codeload.github.com/Rongjiehuang/FastDiff/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Rongjiehuang%2FFastDiff/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":266221259,"owners_count":23894965,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ijcai2022","neural-vocoder","speech-synthesis","text-to-speech","vocoder"],"created_at":"2024-08-08T13:01:44.076Z","updated_at":"2025-07-21T00:31:11.950Z","avatar_url":"https://github.com/Rongjiehuang.png","language":"Python","funding_links":[],"categories":["语音合成"],"sub_categories":["网络服务_其他"],"readme":"# FastDiff: A Fast Conditional Diffusion Model for High-Quality Speech Synthesis\n\n\u003cdiv align=center\u003e \u003cimg src=\"assets/Demo.gif\" alt=\"drawing\" style=\"width:250px; \"/\u003e \u003c/div\u003e\n\n\n#### Rongjie Huang, Max W. Y. Lam, Jun Wang, Dan Su, Dong Yu, Yi Ren, Zhou Zhao\n\nPyTorch Implementation of [FastDiff (IJCAI'22)](https://arxiv.org/abs/2204.09934): a conditional diffusion probabilistic model capable of generating high fidelity speech efficiently.\n\n[![arXiv](https://img.shields.io/badge/arXiv-Paper-\u003cCOLOR\u003e.svg)](https://arxiv.org/abs/2204.09934)\n[![GitHub Stars](https://img.shields.io/github/stars/Rongjiehuang/FastDiff?style=social)](https://github.com/Rongjiehuang/FastDiff)\n![visitors](https://visitor-badge.glitch.me/badge?page_id=Rongjiehuang/FastDiff)\n[![Hugging Face](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-blue)](https://huggingface.co/spaces/Rongjiehuang/ProDiff)\n\nWe provide our implementation and pretrained models as open source in this repository.\n\nVisit our [demo page](https://fastdiff.github.io/) for audio samples.\n\nOur follow-up work might also interest you: [ProDiff (ACM Multimedia'22)](https://arxiv.org/abs/2207.06389) on [GitHub](https://github.com/Rongjiehuang/ProDiff)\n\n## News\n- April.22, 2021: **FastDiff** accepted by IJCAI 2022.\n- June.21, 2022: The LJSpeech checkpoint and demo code are provided.\n- August.12, 2022: The VCTK/LibriTTS checkpoints are provided.\n- August.25, 2022: **FastDiff (tacotron)** is provided.\n- September, 2022: We release follow-up work [ProDiff (ACM Multimedia'22)](https://arxiv.org/abs/2207.06389) on [GitHub](https://github.com/Rongjiehuang/ProDiff), where we futher optimized the speed-and-quality trade-off.\n\n# Quick Started\nWe provide an example of how you can generate high-fidelity samples using FastDiff.\n\nTo try on your own dataset, simply clone this repo in your local machine provided with NVIDIA GPU + CUDA cuDNN and follow the below intructions.\n\n## Support Datasets and Pretrained Models\n\nYou can also use pretrained models we provide [here](https://huggingface.co/Rongjiehuang/FastDiff).\nDetails of each folder are as in follows:\n\n| Dataset            | Config                                           | \n|--------------------|--------------------------------------------------|\n| LJSpeech           | `modules/FastDiff/config/FastDiff.yaml`          | \n| LibriTTS           | `modules/FastDiff/config/FastDiff_libritts.yaml` | \n| VCTK               | `modules/FastDiff/config/FastDiff_vctk.yaml`     |    \n| LJSpeech(Tacotron) | `modules/FastDiff/config/FastDiff_tacotron.yaml` |    \n\nMore supported datasets are coming soon.\n\nPut the checkpoints in `checkpoints/$your_experiment_name/model_ckpt_steps_*.ckpt`\n\n## Dependencies\nSee requirements in `requirement.txt`:\n- [pytorch](https://github.com/pytorch/pytorch)\n- [librosa](https://github.com/librosa/librosa)\n- [NATSpeech](https://github.com/NATSpeech/NATSpeech)\n\n## Multi-GPU\nBy default, this implementation uses as many GPUs in parallel as returned by `torch.cuda.device_count()`. \nYou can specify which GPUs to use by setting the `CUDA_DEVICES_AVAILABLE` environment variable before running the training module.\n\n## Inference for text-to-speech synthesis\n\n### Using ProDiff\nWe provide a more efficient and stable pipeline in [![Hugging Face](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-blue)](https://huggingface.co/spaces/Rongjiehuang/ProDiff) and [GitHub](https://github.com/Rongjiehuang/ProDiff)\n\n### Using Tacotron\nDownload LJSpeech checkpoint for neural vocoding of tacotron output [here](https://zjueducn-my.sharepoint.com/:f:/g/personal/rongjiehuang_zju_edu_cn/Epia7La6O7FHsKPTHZXZpoMBF7PoDcjWeKgC-7jtpVkCOQ?e=b8vPiA).\nWe provide a demo in `egs/demo_tacotron.ipynb`. \n\n### Using Portaspeech, DiffSpeech, FastSpeech 2\n\n1. Download LJSpeech checkpoint and put it in `checkpoint/FastDiff/model_ckpt_steps_*.ckpt `\n2. Specify the input `$text`, and an int-type index `$model_index` to choose the TTS model. `0`(Portaspeech, Ren et al), `1`(FastSpeech 2, Ren et al), or `2`(DiffSpeech, Liu et al).\n3. Set `N` for reverse sampling, which is a trade off between quality and speed. \n4. Run the following command.\n```bash\nCUDA_VISIBLE_DEVICES=$GPU python egs/demo_tts.py --N $N --text $text --model $model_index \n```\nGenerated wav files are saved in `checkpoints/FastDiff/` by default.\u003cbr\u003e\nNote: For better quality, it's recommended to finetune the FastDiff model.\n\n## Inference from wav file\n1. Make `wavs` directory and copy wav files into the directory.\n2. Set `N` for reverse sampling, which is a trade off between quality and speed. \n3. Run the following command.\n```bash\nCUDA_VISIBLE_DEVICES=$GPU python tasks/run.py --config $path/to/config  --exp_name $your_experiment_name --infer --hparams='test_input_dir=wavs,N=$N'\n```\n\nGenerated wav files are saved in `checkpoints/$your_experiment_name/` by default.\u003cbr\u003e\n\n## Inference for end-to-end speech synthesis\n1. Make `mels` directory and copy generated mel-spectrogram files into the directory.\u003cbr\u003e\nYou can generate mel-spectrograms using [Tacotron2](https://github.com/NVIDIA/tacotron2), \n[Glow-TTS](https://github.com/jaywalnut310/glow-tts) and so forth.\n2. Set `N` for reverse sampling, which is a trade off between quality and speed. \n3. Run the following command.\n```bash\nCUDA_VISIBLE_DEVICES=$GPU python tasks/run.py --config $path/to/config --exp_name $your_experiment_name --infer --hparams='test_mel_dir=mels,use_wav=False,N=$N'\n```\nGenerated wav files are saved in `checkpoints/$your_experiment_name/` by default.\u003cbr\u003e\n\nNote: If you find the output wav noisy, it's likely because of the mel-preprocessing mismatch between the acoustic and vocoder models.\n\n# Train your own model\n\n### Data Preparation and Configuraion ##\n1. Set `raw_data_dir`, `processed_data_dir`, `binary_data_dir` in the config file. For custom dataset, please specify configurations of audio preprocessing in `modules/FastDiff/config/base.yaml`\n2. Download dataset to `raw_data_dir`. Note: the dataset structure needs to follow `egs/datasets/audio/*/pre_align.py`, or you could rewrite `pre_align.py` according to your dataset\n3. Preprocess Dataset \n```bash\n# Preprocess step: unify the file structure.\npython data_gen/tts/bin/pre_align.py --config $path/to/config\n# Binarization step: Binarize data for fast IO.\nCUDA_VISIBLE_DEVICES=$GPU python data_gen/tts/bin/binarize.py --config $path/to/config\n```\n\nWe also provide our processed LJSpeech dataset [here](https://zjueducn-my.sharepoint.com/:f:/g/personal/rongjiehuang_zju_edu_cn/Eo7r83WZPK1GmlwvFhhIKeQBABZpYW3ec9c8WZoUV5HhbA?e=9QoWnf).\n\n### Training the Refinement Network\n```bash\nCUDA_VISIBLE_DEVICES=$GPU python tasks/run.py --config $path/to/config  --exp_name $your_experiment_name --reset\n```\n\n### Training the Noise Predictor Network (Optional)\nRefer to [Bilateral Denoising Diffusion Models (BDDMs)](https://github.com/tencent-ailab/bddm).\n\n### Noise Scheduling (Optional)\nYou can use our pre-derived noise schedule in this time, or refer to [Bilateral Denoising Diffusion Models (BDDMs)](https://github.com/tencent-ailab/bddm).\n\n### Inference\n\n```bash\nCUDA_VISIBLE_DEVICES=$GPU python tasks/run.py --config $path/to/config  --exp_name $your_experiment_name --infer\n```\n\n\n## Acknowledgements\nThis implementation uses parts of the code from the following Github repos:\n[NATSpeech](https://github.com/NATSpeech/NATSpeech),\n[Tacotron2](https://github.com/NVIDIA/tacotron2), and\n[DiffWave-Vocoder](https://github.com/philsyn/DiffWave-Vocoder)\nas described in our code.\n\n## Citations ##\nIf you find this code useful in your research, please consider citing:\n```\n@article{huang2022fastdiff,\n  title={FastDiff: A Fast Conditional Diffusion Model for High-Quality Speech Synthesis},\n  author={Huang, Rongjie and Lam, Max WY and Wang, Jun and Su, Dan and Yu, Dong and Ren, Yi and Zhao, Zhou},\n  booktitle = {Proceedings of the Thirty-First International Joint Conference on\n               Artificial Intelligence, {IJCAI-22}},\n  publisher = {International Joint Conferences on Artificial Intelligence Organization},\n  year={2022}\n}\n```\n\n## Disclaimer ##\n- This is not an officially supported Tencent product.\n\n- Any organization or individual is prohibited from using any technology mentioned in this paper to generate someone's speech without his/her consent, including but not limited to government leaders, political figures, and celebrities. If you do not comply with this item, you could be in violation of copyright laws.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FRongjiehuang%2FFastDiff","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FRongjiehuang%2FFastDiff","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FRongjiehuang%2FFastDiff/lists"}