{"id":19085035,"url":"https://github.com/jaywalnut310/waveglow-vqvae","last_synced_at":"2025-04-30T09:26:00.761Z","repository":{"id":85847256,"uuid":"191395787","full_name":"jaywalnut310/waveglow-vqvae","owner":"jaywalnut310","description":"WaveGlow vocoder with VQVAE","archived":false,"fork":false,"pushed_at":"2019-06-18T05:23:30.000Z","size":3698,"stargazers_count":61,"open_issues_count":4,"forks_count":7,"subscribers_count":6,"default_branch":"master","last_synced_at":"2025-04-18T23:59:53.181Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/jaywalnut310.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2019-06-11T15:05:37.000Z","updated_at":"2025-02-07T18:03:50.000Z","dependencies_parsed_at":"2023-03-08T16:45:41.051Z","dependency_job_id":null,"html_url":"https://github.com/jaywalnut310/waveglow-vqvae","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jaywalnut310%2Fwaveglow-vqvae","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jaywalnut310%2Fwaveglow-vqvae/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jaywalnut310%2Fwaveglow-vqvae/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jaywalnut310%2Fwaveglow-vqvae/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/jaywalnut310","download_url":"https://codeload.github.com/jaywalnut310/waveglow-vqvae/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":251675746,"owners_count":21625882,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-09T02:53:37.286Z","updated_at":"2025-04-30T09:26:00.755Z","avatar_url":"https://github.com/jaywalnut310.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# WaveGlow vocoder with VQVAE\n\nTensorflow implementation of [WaveGlow: A Flow-based Generative Network for Speech Synthesis](https://arxiv.org/abs/1811.00002)\nand [Neural Discrete Representation Learning](https://arxiv.org/abs/1711.00937).\n\nThis implementation includes **multi-gpu** and **mixed precision**(unstable yet) support.\nIt is highly based on some github repositories:[waveglow](https://github.com/NVIDIA/waveglow).\nData used here are the [LJSpeech dataset](https://keithito.com/LJ-Speech-Dataset/) and [VCTK Corpus](https://homepages.inf.ed.ac.uk/jyamagis/page3/page58/page58.html).\n\nYou can choose local conditions among mel-spectrogram or vector-quantized representations and also choose whether to use speaker identity as a global condition.\nAs more options, polyak-averaging, FiLM and weight normalization are implemented.\n\n\n## Audio Samples\n### LJ dataset\nMel spectrogram condition (original WaveGlow): https://drive.google.com/open?id=1HuV51fnhEZG_6vGubXVrer6lAtZK7py9\n\nVQVAE condition: https://drive.google.com/open?id=1xcGSelMycn2g-72noZH4vPiPpG0d7pZq\n\n### VCTK Corpus (Voice conversion)\nIt does not work well at now :(\n\nSource (360): https://drive.google.com/open?id=1CfEvnQS_dVYRhsvj8NDqogOJlzK7npTd\n\nTarget (303): https://drive.google.com/open?id=1-kcSglimKgJrRjLDfPbD7s5KxZuFRY-i\n\n\n## My Humble Contribution\nI slightly modify the original VQVAE optimization technique to increase robustness w.r.t hyperparameter choices and diversity of latent code usage without index-collapse.\nThat is,\n- the original technique contains 1) finding neareast latent codes given encoded vectors and 2) updating latent codes according to matching encoded vectors.\n- I modify them as 1) finding distribution of latent codes given encoded vectors and 2) updating latent codes to increase the likelihood given distribution of matching encoded vectors.\n- By replacing EMA with the gradient descent method, it can give additional gradient signals to latent codes to reduce reconstruction loss (which is impossible in the EMA setting.).\n\nIt resembles Soft-EM method a lot. The difference between Soft-EM is to replace closed form Maximization step with a gradient descent method.\nFor more information, please see em_toy.ipynb or contact me(jaywalnut310@gmail.com).\n\nAs I haven't investigated this method thoroughly, I cannot say it is better than previous methods in almost every cases.\nBut I found this novel method works pretty well in all of my experimental settings (no index-collapse).\n\n\n## Pre-requisites\n1. Tensorflow 1.12 (1.13 would work with some deprecation warnings)\n2. (If fp16 training is needed) Volta GPUs\n\n\n## Setup\n```sh\n# 1. create dataset folder\nmkdir datasets\ncd datasets\n\n# 2. Download and extract datasets\nwget https://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2\ntar -jxvf LJSpeech-1.1.tar.bz2\n\n# Additionally, download VCTK Corpus\nwget http://homepages.inf.ed.ac.uk/jyamagis/release/VCTK-Corpus.tar.gz\ntar -zxvf VCTK-Corpus.tar.gz\ncd ../filelists\npython resample_vctk.py # Change sample rate\n\n# 3. Create TFRecords\npython generate_data.py\n\n# Additionally, create VCTK TFRecords\npython generate_data.py -c tfr_dir=datasets/vctk tfr_prefix=vctk train_files=filelists/vctk_sid_audio_text_train_filelist.txt eval_files=filelists/vctk_sid_audio_text_eval_filelist.txt\n```\n\n\n## Training\n```sh\n# 1. Create log directory\nmkdir ~/your-log-dir\n\n# 2. (Optional) Copy configs\ncp ./config.yml ~/your-log-dir\n\n# 3. Run training\npython train.py -m ~/your-log-dir\n```\n\nIf you want to change hparams, then you can do it by choosing one of two options.\n* modify config.yml\n* add arguments as below:\n  ```sh\n  python train.py -m ~/your-log-dir --c hidden_size=512 num_heads=8\n  ```\n\nExample configs:\n- fp32 training: `python train.py -m ~/your-log-dir --c ftype=float32 loss_scale=1`\n- mel condition: `python train.py -m ~/your-log-dir --c local_condition=mel use_vq=false`\n- remove FiLM layers: `python train.py -m ~/your-log-dir --c use_film=false`\n\n\n## Pre-trained models\nCompressed model directories with pretrained weights are available: WILL BE UPLOADED SOON!\n\nYou can generate samples with those models in inference.ipynb.\n\nYou may have to change tfr_dir and model_dir to work on your settings.\n\n\n## Disclaimer\n- For fp16 settings, you need 1 week to train 1M steps with 4 V100 GPUs.\n- I haven't tried fp32 training, so there might be some issues to train high quality models.\n- As fp16 training is not robust enough (at now), I usually train FiLM enabled model and unabled model consequently and choose one which survives.\n- For a single speaker dataset(LJ Speech dataset), trained model vocoding quality is good enough compared to mel-spectrogram condtioned one.\n- For multi-speaker dataset(VCTK Corpus), disentangling between speaker identity and local condition does not work well (at now). I am investigating reasons though.\n- The next step would be training Text-to-LatentCodes model(as Transformer) so that fully TTS is possible.\n- If you're interested in this project, please improve models with me!\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjaywalnut310%2Fwaveglow-vqvae","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjaywalnut310%2Fwaveglow-vqvae","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjaywalnut310%2Fwaveglow-vqvae/lists"}