{"id":13454828,"url":"https://github.com/ibab/tensorflow-wavenet","last_synced_at":"2025-05-14T09:08:24.882Z","repository":{"id":41407105,"uuid":"68013171","full_name":"ibab/tensorflow-wavenet","owner":"ibab","description":"A TensorFlow implementation of DeepMind's WaveNet paper","archived":false,"fork":false,"pushed_at":"2023-07-12T06:15:53.000Z","size":335,"stargazers_count":5439,"open_issues_count":176,"forks_count":1289,"subscribers_count":260,"default_branch":"master","last_synced_at":"2025-04-04T14:11:12.333Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ibab.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2016-09-12T13:50:45.000Z","updated_at":"2025-04-03T06:07:07.000Z","dependencies_parsed_at":"2023-01-19T22:58:50.526Z","dependency_job_id":"8099bca7-b566-485a-b2d6-496cd100f582","html_url":"https://github.com/ibab/tensorflow-wavenet","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ibab%2Ftensorflow-wavenet","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ibab%2Ftensorflow-wavenet/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ibab%2Ftensorflow-wavenet/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ibab%2Ftensorflow-wavenet/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ibab","download_url":"https://codeload.github.com/ibab/tensorflow-wavenet/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248469121,"owners_count":21108960,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-07-31T08:00:58.289Z","updated_at":"2025-04-11T19:45:16.823Z","avatar_url":"https://github.com/ibab.png","language":"Python","funding_links":[],"categories":["Models/Projects","Speech synthesizer or Text-to-Speech(TTS)","Python","TensorFlow Models","模型项目","Deep Learning Projects"],"sub_categories":["Audio Processing","微信群"],"readme":"# A TensorFlow implementation of DeepMind's WaveNet paper\n\n[![Build Status](https://travis-ci.org/ibab/tensorflow-wavenet.svg?branch=master)](https://travis-ci.org/ibab/tensorflow-wavenet)\n\nThis is a TensorFlow implementation of the [WaveNet generative neural\nnetwork architecture](https://deepmind.com/blog/wavenet-generative-model-raw-audio/) for audio generation.\n\n\u003ctable style=\"border-collapse: collapse\"\u003e\n\u003ctr\u003e\n\u003ctd\u003e\n\u003cp\u003e\nThe WaveNet neural network architecture directly generates a raw audio waveform,\nshowing excellent results in text-to-speech and general audio generation (see the\nDeepMind blog post and paper for details).\n\u003c/p\u003e\n\u003cp\u003e\nThe network models the conditional probability to generate the next\nsample in the audio waveform, given all previous samples and possibly\nadditional parameters.\n\u003c/p\u003e\n\u003cp\u003e\nAfter an audio preprocessing step, the input waveform is quantized to a fixed integer range.\nThe integer amplitudes are then one-hot encoded to produce a tensor of shape \u003ccode\u003e(num_samples, num_channels)\u003c/code\u003e.\n\u003c/p\u003e\n\u003cp\u003e\nA convolutional layer that only accesses the current and previous inputs then reduces the channel dimension.\n\u003c/p\u003e\n\u003cp\u003e\nThe core of the network is constructed as a stack of \u003cem\u003ecausal dilated layers\u003c/em\u003e, each of which is a\ndilated convolution (convolution with holes), which only accesses the current and past audio samples.\n\u003c/p\u003e\n\u003cp\u003e\nThe outputs of all layers are combined and extended back to the original number\nof channels by a series of dense postprocessing layers, followed by a softmax\nfunction to transform the outputs into a categorical distribution.\n\u003c/p\u003e\n\u003cp\u003e\nThe loss function is the cross-entropy between the output for each timestep and the input at the next timestep.\n\u003c/p\u003e\n\u003cp\u003e\nIn this repository, the network implementation can be found in \u003ca href=\"./wavenet/model.py\"\u003emodel.py\u003c/a\u003e.\n\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd width=\"300\"\u003e\n\u003cimg src=\"images/network.png\" width=\"300\"\u003e\u003c/img\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003c/table\u003e\n\n## Requirements\n\nTensorFlow needs to be installed before running the training script.\nCode is tested on TensorFlow version 1.0.1 for Python 2.7 and Python 3.5.\n\nIn addition, [librosa](https://github.com/librosa/librosa) must be installed for reading and writing audio.\n\nTo install the required python packages, run\n```bash\npip install -r requirements.txt\n```\n\nFor GPU support, use\n```bash\npip install -r requirements_gpu.txt\n```\n\n## Training the network\n\nYou can use any corpus containing `.wav` files.\nWe've mainly used the [VCTK corpus](http://homepages.inf.ed.ac.uk/jyamagis/page3/page58/page58.html) (around 10.4GB, [Alternative host](http://www.udialogue.org/download/cstr-vctk-corpus.html)) so far.\n\nIn order to train the network, execute\n```bash\npython train.py --data_dir=corpus\n```\nto train the network, where `corpus` is a directory containing `.wav` files.\nThe script will recursively collect all `.wav` files in the directory.\n\nYou can see documentation on each of the training settings by running\n```bash\npython train.py --help\n```\n\nYou can find the configuration of the model parameters in [`wavenet_params.json`](./wavenet_params.json).\nThese need to stay the same between training and generation.\n\n### Global Conditioning\nGlobal conditioning refers to modifying the model such that the id of a set of mutually-exclusive categories is specified during training and generation of .wav file.\nIn the case of the VCTK, this id is the integer id of the speaker, of which there are over a hundred.\nThis allows (indeed requires) that a speaker id be specified at time of generation to select which of the speakers it should mimic. For more details see the paper or source code.\n\n### Training with Global Conditioning\nThe instructions above for training refer to training without global conditioning. To train with global conditioning, specify command-line arguments as follows:\n```\npython train.py --data_dir=corpus --gc_channels=32\n```\nThe --gc_channels argument does two things:\n* It tells the train.py script that\nit should build a model that includes global conditioning.\n* It specifies the\nsize of the embedding vector that is looked up based on the id of the speaker.\n\nThe global conditioning logic in train.py and audio_reader.py is \"hard-wired\" to the VCTK corpus at the moment in that it expects to be able to determine the speaker id from the pattern of file naming used in VCTK, but can be easily be modified.\n\n## Generating audio\n\n[Example output](https://soundcloud.com/user-731806733/tensorflow-wavenet-500-msec-88k-train-steps)\ngenerated by @jyegerlehner based on speaker 280 from the VCTK corpus.\n\nYou can use the `generate.py` script to generate audio using a previously trained model.\n\n### Generating without Global Conditioning\nRun\n```\npython generate.py --samples 16000 logdir/train/2017-02-13T16-45-34/model.ckpt-80000\n```\nwhere `logdir/train/2017-02-13T16-45-34/model.ckpt-80000` needs to be a path to previously saved model (without extension).\nThe `--samples` parameter specifies how many audio samples you would like to generate (16000 corresponds to 1 second by default).\n\nThe generated waveform can be played back using TensorBoard, or stored as a\n`.wav` file by using the `--wav_out_path` parameter:\n```\npython generate.py --wav_out_path=generated.wav --samples 16000 logdir/train/2017-02-13T16-45-34/model.ckpt-80000\n```\n\nPassing `--save_every` in addition to `--wav_out_path` will save the in-progress wav file every n samples.\n```\npython generate.py --wav_out_path=generated.wav --save_every 2000 --samples 16000 logdir/train/2017-02-13T16-45-34/model.ckpt-80000\n```\n\nFast generation is enabled by default.\nIt uses the implementation from the [Fast Wavenet](https://github.com/tomlepaine/fast-wavenet) repository.\nYou can follow the link for an explanation of how it works.\nThis reduces the time needed to generate samples to a few minutes.\n\nTo disable fast generation:\n```\npython generate.py --samples 16000 logdir/train/2017-02-13T16-45-34/model.ckpt-80000 --fast_generation=false\n```\n\n### Generating with Global Conditioning\nGenerate from a model incorporating global conditioning as follows:\n```\npython generate.py --samples 16000  --wav_out_path speaker311.wav --gc_channels=32 --gc_cardinality=377 --gc_id=311 logdir/train/2017-02-13T16-45-34/model.ckpt-80000\n```\nWhere:\n\n`--gc_channels=32` specifies 32 is the size of the embedding vector, and\nmust match what was specified when training.\n\n`--gc_cardinality=377` is required\nas 376 is the largest id of a speaker in the VCTK corpus. If some other corpus\nis used, then this number should match what is automatically determined and\nprinted out by the train.py script at training time.\n\n`--gc_id=311` specifies the id of speaker, speaker 311, for which a sample is\nto be generated.\n\n## Running tests\n\nInstall the test requirements\n```\npip install -r requirements_test.txt\n```\n\nRun the test suite\n```\n./ci/test.sh\n```\n\n## Missing features\n\nCurrently there is no local conditioning on extra information which would allow\ncontext stacks or controlling what speech is generated.\n\n\n## Related projects\n\n- [tex-wavenet](https://github.com/Zeta36/tensorflow-tex-wavenet), a WaveNet for text generation.\n- [image-wavenet](https://github.com/Zeta36/tensorflow-image-wavenet), a WaveNet for image generation.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fibab%2Ftensorflow-wavenet","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fibab%2Ftensorflow-wavenet","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fibab%2Ftensorflow-wavenet/lists"}