{"id":26527911,"url":"https://github.com/lmnt-com/diffwave","last_synced_at":"2025-10-03T13:07:49.690Z","repository":{"id":37662767,"uuid":"297845441","full_name":"lmnt-com/diffwave","owner":"lmnt-com","description":"DiffWave is a fast, high-quality neural vocoder and waveform synthesizer.","archived":false,"fork":false,"pushed_at":"2024-03-26T12:31:24.000Z","size":21,"stargazers_count":823,"open_issues_count":14,"forks_count":116,"subscribers_count":21,"default_branch":"master","last_synced_at":"2025-04-19T05:01:10.669Z","etag":null,"topics":["deep-learning","diffwave","machine-learning","neural-network","paper","pretrained-models","pytorch","speech","speech-synthesis","text-to-speech","tts","vocoder"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/lmnt-com.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-09-23T03:50:07.000Z","updated_at":"2025-04-16T15:07:28.000Z","dependencies_parsed_at":"2025-03-21T15:51:06.629Z","dependency_job_id":null,"html_url":"https://github.com/lmnt-com/diffwave","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lmnt-com%2Fdiffwave","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lmnt-com%2Fdiffwave/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lmnt-com%2Fdiffwave/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lmnt-com%2Fdiffwave/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/lmnt-com","download_url":"https://codeload.github.com/lmnt-com/diffwave/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254436944,"owners_count":22070946,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["deep-learning","diffwave","machine-learning","neural-network","paper","pretrained-models","pytorch","speech","speech-synthesis","text-to-speech","tts","vocoder"],"created_at":"2025-03-21T15:36:22.493Z","updated_at":"2025-10-03T13:07:49.620Z","avatar_url":"https://github.com/lmnt-com.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# DiffWave\n![PyPI Release](https://img.shields.io/pypi/v/diffwave?label=release) [![License](https://img.shields.io/github/license/lmnt-com/diffwave)](https://github.com/lmnt-com/diffwave/blob/master/LICENSE)\n\n**We're hiring!**\nIf you like what we're building here, [come join us at LMNT](https://explore.lmnt.com).\n\nDiffWave is a fast, high-quality neural vocoder and waveform synthesizer. It starts with Gaussian noise and converts it into speech via iterative refinement. The speech can be controlled by providing a conditioning signal (e.g. log-scaled Mel spectrogram). The model and architecture details are described in [DiffWave: A Versatile Diffusion Model for Audio Synthesis](https://arxiv.org/pdf/2009.09761.pdf).\n\n## What's new (2021-11-09)\n- unconditional waveform synthesis (thanks to [Andrechang](https://github.com/Andrechang)!)\n\n## What's new (2021-04-01)\n- fast sampling algorithm based on v3 of the DiffWave paper\n\n## What's new (2020-10-14)\n- new pretrained model trained for 1M steps\n- updated audio samples with output from new model\n\n## Status (2021-11-09)\n- [x] fast inference procedure\n- [x] stable training\n- [x] high-quality synthesis\n- [x] mixed-precision training\n- [x] multi-GPU training\n- [x] command-line inference\n- [x] programmatic inference API\n- [x] PyPI package\n- [x] audio samples\n- [x] pretrained models\n- [x] unconditional waveform synthesis\n\nBig thanks to [Zhifeng Kong](https://github.com/FengNiMa) (lead author of DiffWave) for pointers and bug fixes.\n\n## Audio samples\n[22.05 kHz audio samples](https://lmnt.com/assets/diffwave)\n\n## Pretrained models\n[22.05 kHz pretrained model](https://lmnt.com/assets/diffwave/diffwave-ljspeech-22kHz-1000578.pt) (31 MB, SHA256: `d415d2117bb0bba3999afabdd67ed11d9e43400af26193a451d112e2560821a8`)\n\nThis pre-trained model is able to synthesize speech with a real-time factor of 0.87 (smaller is faster).\n\n### Pre-trained model details\n- trained on 4x 1080Ti\n- default parameters\n- single precision floating point (FP32)\n- trained on LJSpeech dataset excluding LJ001\u0026ast; and LJ002\u0026ast;\n- trained for 1000578 steps (1273 epochs)\n\n## Install\n\nInstall using pip:\n```\npip install diffwave\n```\n\nor from GitHub:\n```\ngit clone https://github.com/lmnt-com/diffwave.git\ncd diffwave\npip install .\n```\n\n### Training\nBefore you start training, you'll need to prepare a training dataset. The dataset can have any directory structure as long as the contained .wav files are 16-bit mono (e.g. [LJSpeech](https://keithito.com/LJ-Speech-Dataset/), [VCTK](https://pytorch.org/audio/_modules/torchaudio/datasets/vctk.html)). By default, this implementation assumes a sample rate of 22.05 kHz. If you need to change this value, edit [params.py](https://github.com/lmnt-com/diffwave/blob/master/src/diffwave/params.py).\n\n```\npython -m diffwave.preprocess /path/to/dir/containing/wavs\npython -m diffwave /path/to/model/dir /path/to/dir/containing/wavs\n\n# in another shell to monitor training progress:\ntensorboard --logdir /path/to/model/dir --bind_all\n```\n\nYou should expect to hear intelligible (but noisy) speech by ~8k steps (~1.5h on a 2080 Ti).\n\n#### Multi-GPU training\nBy default, this implementation uses as many GPUs in parallel as returned by [`torch.cuda.device_count()`](https://pytorch.org/docs/stable/cuda.html#torch.cuda.device_count). You can specify which GPUs to use by setting the [`CUDA_DEVICES_AVAILABLE`](https://developer.nvidia.com/blog/cuda-pro-tip-control-gpu-visibility-cuda_visible_devices/) environment variable before running the training module.\n\n### Inference API\nBasic usage:\n\n```python\nfrom diffwave.inference import predict as diffwave_predict\n\nmodel_dir = '/path/to/model/dir'\nspectrogram = # get your hands on a spectrogram in [N,C,W] format\naudio, sample_rate = diffwave_predict(spectrogram, model_dir, fast_sampling=True)\n\n# audio is a GPU tensor in [N,T] format.\n```\n\n### Inference CLI\n```\npython -m diffwave.inference --fast /path/to/model /path/to/spectrogram -o output.wav\n```\n\n## References\n- [DiffWave: A Versatile Diffusion Model for Audio Synthesis](https://arxiv.org/pdf/2009.09761.pdf)\n- [Denoising Diffusion Probabilistic Models](https://arxiv.org/pdf/2006.11239.pdf)\n- [Code for Denoising Diffusion Probabilistic Models](https://github.com/hojonathanho/diffusion)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flmnt-com%2Fdiffwave","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Flmnt-com%2Fdiffwave","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flmnt-com%2Fdiffwave/lists"}