{"id":19556288,"url":"https://github.com/maum-ai/wavegrad2","last_synced_at":"2025-08-02T12:36:04.802Z","repository":{"id":48783053,"uuid":"384387750","full_name":"maum-ai/wavegrad2","owner":"maum-ai","description":"Unofficial Pytorch Implementation of WaveGrad2","archived":false,"fork":false,"pushed_at":"2021-08-18T09:20:33.000Z","size":14139,"stargazers_count":112,"open_issues_count":1,"forks_count":16,"subscribers_count":9,"default_branch":"main","last_synced_at":"2025-06-03T03:13:04.629Z","etag":null,"topics":["deep-generative-model","deep-learning","end-to-end","speech-synthesis","text-to-speech","tts"],"latest_commit_sha":null,"homepage":"https://mindslab-ai.github.io/wavegrad2/","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"bsd-3-clause","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/maum-ai.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2021-07-09T09:28:57.000Z","updated_at":"2025-02-25T06:20:21.000Z","dependencies_parsed_at":"2022-08-30T14:31:46.557Z","dependency_job_id":null,"html_url":"https://github.com/maum-ai/wavegrad2","commit_stats":null,"previous_names":["maum-ai/wavegrad2"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/maum-ai%2Fwavegrad2","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/maum-ai%2Fwavegrad2/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/maum-ai%2Fwavegrad2/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/maum-ai%2Fwavegrad2/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/maum-ai","download_url":"https://codeload.github.com/maum-ai/wavegrad2/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/maum-ai%2Fwavegrad2/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":259280300,"owners_count":22833424,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["deep-generative-model","deep-learning","end-to-end","speech-synthesis","text-to-speech","tts"],"created_at":"2024-11-11T04:37:27.136Z","updated_at":"2025-06-11T14:09:58.282Z","avatar_url":"https://github.com/maum-ai.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# WaveGrad 2 \u0026mdash; Unofficial PyTorch Implementation\n\n**WaveGrad 2: Iterative Refinement for Text-to-Speech Synthesis**\u003cbr\u003e\nUnofficial PyTorch+[Lightning](https://github.com/PyTorchLightning/pytorch-lightning) Implementation of **Chen *et al.*(JHU, Google Brain), [WaveGrad2](https://arxiv.org/abs/2106.09660)**.\u003cbr\u003e\n\n[![arXiv](https://img.shields.io/badge/arXiv-2106.09660-brightgreen.svg?style=flat-square)](https://arxiv.org/abs/2106.09660) [![githubio](https://img.shields.io/static/v1?message=Audio%20Samples\u0026logo=Github\u0026labelColor=grey\u0026color=blue\u0026logoColor=white\u0026label=%20\u0026style=flat-square)](https://mindslab-ai.github.io/wavegrad2/) [![Colab](https://img.shields.io/static/v1?message=Open%20in%20Colab\u0026logo=googlecolab\u0026labelColor=grey\u0026color=yellow\u0026logoColor=white\u0026label=%20\u0026style=flat-square)](https://colab.research.google.com/drive/1AK3AI3lS_rXacTIYHpf0mYV4NdU56Hn6?usp=sharing)\n\n![](./docs/sampling.gif)\n\n**Update: Enjoy our pre-trained model with [Google Colab notebook](https://colab.research.google.com/drive/1AK3AI3lS_rXacTIYHpf0mYV4NdU56Hn6?usp=sharing)!**\n\n## TODO\n- [x] More training for WaveGrad-Base setup\n- [x] Checkpoint release for Base\n- [x] WaveGrad-Large Decoder\n- [x] Checkpoint release for Large\n- [ ] Inference by reduced sampling steps\n\n## Requirements\n- [Pytorch](https://pytorch.org/) \n- [Pytorch-Lightning](https://github.com/PyTorchLightning/pytorch-lightning)==1.2.10\n- The requirements are highlighted in [requirements.txt](./requirements.txt).\u003cbr\u003e\n- We also provide docker setup [Dockerfile](./Dockerfile).\u003cbr\u003e\n\n## Datasets\nThe supported datasets are\n- [LJSpeech](https://keithito.com/LJ-Speech-Dataset/): a single-speaker English dataset consists of 13100 short audio clips of a female speaker reading passages from 7 non-fiction books, approximately 24 hours in total.\n- [AISHELL-3](http://www.aishelltech.com/aishell_3): a Mandarin TTS dataset with 218 male and female speakers, roughly 85 hours in total.\n- etc.\n\nWe take LJSpeech as an example hereafter.\n## Preprocessing\n- Adjust `preprocess.yaml`, especially `path` section.\n```yaml\npath:\n  corpus_path: '/DATA1/LJSpeech-1.1' # LJSpeech corpus path\n  lexicon_path: 'lexicon/librispeech-lexicon.txt'\n  raw_path: './raw_data/LJSpeech'\n  preprocessed_path: './preprocessed_data/LJSpeech'\n``` \n\n- run `prepare_align.py` for some preparations. \n```shell script\npython prepare_align.py -c preprocess.yaml\n```\n\n- [Montreal Forced Aligner](https://montreal-forced-aligner.readthedocs.io/en/latest/) (MFA) is used to obtain the alignments between the utterances and the phoneme sequences.\nAlignments for the LJSpeech and AISHELL-3 datasets are provided [here](https://drive.google.com/drive/folders/1DBRkALpPd6FL9gjHMmMEdHODmkgNIIK4?usp=sharing).\nYou have to unzip the files in ``preprocessed_data/LJSpeech/TextGrid/``.\n\n- After that, run `preprocess.py`.\n```shell script\npython preprocess.py -c preprocess.yaml\n```\n\n- Alternately, you can align the corpus by yourself. \n- Download the official MFA package and run it to align the corpus.\n```shell script\n./montreal-forced-aligner/bin/mfa_align raw_data/LJSpeech/ lexicon/librispeech-lexicon.txt english preprocessed_data/LJSpeech\n```\nor\n```shell script\n./montreal-forced-aligner/bin/mfa_train_and_align raw_data/LJSpeech/ lexicon/librispeech-lexicon.txt preprocessed_data/LJSpeech\n```\n\n- And then run `preprocess.py`.\n```shell script\npython preprocess.py -c preprocess.yaml\n```\n## Training\n- Adjust `hparameter.yaml`, especially `train` section.\n```yaml\ntrain:\n  batch_size: 12 # Dependent on GPU memory size\n  adam:\n    lr: 3e-4\n    weight_decay: 1e-6\n  decay:\n    rate: 0.05\n    start: 25000\n    end: 100000\n  num_workers: 16 # Dependent on CPU cores\n  gpus: 2 # number of GPUs\n  loss_rate:\n    dur: 1.0\n```\n\n- If you want to train with other dataset, adjust `data` section in `hparameter.yaml`\n```yaml\ndata:\n  lang: 'eng'\n  text_cleaners: ['english_cleaners'] # korean_cleaners, english_cleaners, chinese_cleaners\n  speakers: ['LJSpeech']\n  train_dir: 'preprocessed_data/LJSpeech'\n  train_meta: 'train.txt'  # relative path of metadata file from train_dir\n  val_dir: 'preprocessed_data/LJSpeech'\n  val_meta: 'val.txt'  # relative path of metadata file from val_dir'\n  lexicon_path: 'lexicon/librispeech-lexicon.txt'\n```\n\n- run `trainer.py`\n```shell script\npython trainer.py\n```\n\n- If you want to resume training from checkpoint, check parser.\n```shell script\nparser = argparse.ArgumentParser()\nparser.add_argument('-r', '--resume_from', type =int,\\\n\trequired = False, help = \"Resume Checkpoint epoch number\")\nparser.add_argument('-s', '--restart', action = \"store_true\",\\\n\trequired = False, help = \"Significant change occured, use this\")\nparser.add_argument('-e', '--ema', action = \"store_true\",\n\trequired = False, help = \"Start from ema checkpoint\")\nargs = parser.parse_args()\n```\n\n- During training, tensorboard logger is logging loss, spectrogram and audio.\n```shell script\ntensorboard --logdir=./tensorboard --bind_all\n```\n![](./docs/tb.png)\n\n## Inference\n- run `inference.py`\n```shell script\npython inference.py -c \u003ccheckpoint_path\u003e --text \u003c'text'\u003e\n```\n\nWe provide a Jupyter Notebook script to provide the code for inference and show some visualizations with resulting audio.\n- [Colab notebook](https://colab.research.google.com/drive/1AK3AI3lS_rXacTIYHpf0mYV4NdU56Hn6?usp=sharing) \nThis notebook provides pre-trained weights for WaveGrad 2 and you can download it via url inside(Both Checkpoint for `WaveGrad-Base` and `WaveGrad-Large` decoder).\n\n## Large Decoder\nWe implemented `WaveGrad-Large` decoder for high MOS output.\u003cbr\u003e\n**Note: it could be different with google's implementation since number of parameters are different with paper's value.**\u003cbr\u003e\n- To train with Large model you need to modify `hparameter.yaml`.\n```yaml\nwavegrad:\n  is_large: True #if False, Base\n  ...\n  dilations: [[1,2,4,8],[1,2,4,8],[1,2,4,8],[1,2,4,8],[1,2,4,8]] #dilations for Large\n  #dilations: [[1,2,4,8],[1,2,4,8],[1,2,4,8],[1,2,1,2],[1,2,1,2]] dilations for Base\n```\n- Go back to [Training section](#training).\n\n## Note\nSince this repo is unofficial implementation and WaveGrad2 paper do not provide several details, a slight differences between paper could exist.  \nWe listed modifications or arbitrary setups\n- Normal LSTM without ZoneOut is applied for encoder. \n- [g2p\\_en](https://github.com/Kyubyong/g2p) is applied instead of Google's unknown G2P.\n- Trained with LJSpeech datasdet instead of Google's proprietary dataset.\n  - Due to dataset replacement, output audio's sampling rate becomes 22.05kHz instead of 24kHz.\n- MT + SpecAug are not implemented.\n- WaveGrad decoder shares same issues from [ivanvovk's WaveGrad implementation](https://github.com/ivanvovk/WaveGrad).\n  - e.g. https://github.com/ivanvovk/WaveGrad/issues/24#issue-943985027\n- `WaveGrad-Large` decoder's architecture could be different with Google's implementation.\n- hyperparameters\n  - `train.batch_size: 12` for Base and `train.batch_size: 6` for Large, Trained with 2 V100 (32GB) GPUs\n  - `train.adam.lr: 3e-4` and `train.adam.weight_decay: 1e-6`\n  - `train.decay` learning rate decay is applied during training\n  - `train.loss_rate: 1` as `total_loss = 1 * L1_loss + 1 * duration_loss`\n  - `ddpm.ddpm_noise_schedule: torch.linspace(1e-6, 0.01, hparams.ddpm.max_step)`\n  - `encoder.channel` is reduced to 512 from 1024 or 2048\n- *TODO* things.\n\n## Tree\n```\n.\n├── Dockerfile\n├── README.md\n├── dataloader.py\n├── docs\n│   ├── spec.png\n│   ├── tb.png\n│   └── tblogger.png\n├── hparameter.yaml\n├── inference.py\n├── lexicon\n│   ├── librispeech-lexicon.txt\n│   └── pinyin-lexicon-r.txt\n├── lightning_model.py\n├── model\n│   ├── base.py\n│   ├── downsampling.py\n│   ├── encoder.py\n│   ├── gaussian_upsampling.py\n│   ├── interpolation.py\n│   ├── layers.py\n│   ├── linear_modulation.py\n│   ├── nn.py\n│   ├── resampling.py\n│   ├── upsampling.py\n│   └── window.py\n├── prepare_align.py\n├── preprocess.py\n├── preprocess.yaml\n├── preprocessor\n│   ├── ljspeech.py\n│   └── preprocessor.py\n├── text\n│   ├── __init__.py\n│   ├── cleaners.py\n│   ├── cmudict.py\n│   ├── numbers.py\n│   └── symbols.py\n├── trainer.py\n├── utils\n│   ├── mel.py\n│   ├── stft.py\n│   ├── tblogger.py\n│   └── utils.py\n└── wavegrad2_tester.ipynb\n```\n\n## Author\nThis code is implemented by\n- [Seungu Han](https://github.com/Seungwoo0326) at MINDs Lab [hansw0326@mindslab.ai](mailto:hansw0326@mindslab.ai)\n- [Junhyeok Lee](https://github.com/junjun3518) at MINDs Lab [jun3518@mindslab.ai](mailto:jun3518@mindslab.ai)\n\nSpecial thanks to \n- [Kang-wook Kim](https://github.com/wookladin) at MINDs Lab \n- [Wonbin Jung](https://github.com/Wonbin-Jung) at MINDs Lab\n- [Sang Hoon Woo](https://github.com/tonyswoo) at MINDs Lab\n\n## References\n- Chen *et al.*, [WaveGrad 2: Iterative Refinement for Text-to-Speech Synthesis](https://arxiv.org/abs/2106.09660)\n- Chen *et al.*, [WaveGrad: Estimating Gradients for Waveform Generation](https://arxiv.org/abs/2009.00713)\n- Ho *et al.*, [Denoising Diffusion Probabilistic Models](https://arxiv.org/abs/2006.11239)\n- Shen *et al.*, [Non-Attentive Tacotron: Robust and Controllable Neural TTS Synthesis Including Unsupervised Duration Modeling](https://arxiv.org/abs/2010.04301)\n\nThis implementation uses code from following repositories:\n- [J.Ho's Official DDPM Implementation](https://github.com/hojonathanho/diffusion)\n- [lucidrains' DDPM Pytorch Implementation](https://github.com/lucidrains/denoising-diffusion-pytorch)\n- [ivanvovk's WaveGrad Pytorch Implementation](https://github.com/ivanvovk/WaveGrad)\n- [lmnt-com's DiffWave Pytorch Implementation](https://github.com/lmnt-com/diffwave)\n- [ming024's FastSpeech2 Pytorch Implementation](https://github.com/ming024/FastSpeech2)\n- [yanggeng1995's EATS Pytorch Implementation](https://github.com/yanggeng1995/EATS)\n- [Kyubyoung's g2p\\_en](https://github.com/Kyubyong/g2p)\n- [mindslab's NU-Wave](https://github.com/mindslab-ai/nuwave)\n- [Keith Ito's Tacotron implementation](https://github.com/keithito/tacotron)\n- [NVIDIA's Tacotron2 implementation](https://github.com/NVIDIA/tacotron2)\n\nThe webpage for the audio samples uses a template from:\n- [WaveGrad2 Official Github.io](https://wavegrad.github.io/v2/)\n\nThe audio samples on our webpage are partially derived from:\n- [LJSpeech](https://keithito.com/LJ-Speech-Dataset/): a single-speaker English dataset consists of 13100 short audio clips of a female speaker reading passages from 7 non-fiction books, approximately 24 hours in total.\n- [WaveGrad2 Official Github.io](https://wavegrad.github.io/v2/)\n\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmaum-ai%2Fwavegrad2","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmaum-ai%2Fwavegrad2","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmaum-ai%2Fwavegrad2/lists"}