{"id":13585878,"url":"https://github.com/keithito/tacotron","last_synced_at":"2025-10-07T14:27:29.695Z","repository":{"id":38629547,"uuid":"96632503","full_name":"keithito/tacotron","owner":"keithito","description":"A TensorFlow implementation of Google's Tacotron speech synthesis with pre-trained model (unofficial)","archived":false,"fork":false,"pushed_at":"2023-07-06T21:12:04.000Z","size":113,"stargazers_count":2975,"open_issues_count":137,"forks_count":955,"subscribers_count":148,"default_branch":"master","last_synced_at":"2025-05-08T02:42:07.857Z","etag":null,"topics":["machine-learning","python","speech-synthesis","tacotron","tensorflow","tts"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/keithito.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2017-07-08T17:03:31.000Z","updated_at":"2025-04-11T04:37:37.000Z","dependencies_parsed_at":"2022-07-13T05:50:29.449Z","dependency_job_id":"2fd2b08c-52c1-4f76-bd9a-5afa761bd79d","html_url":"https://github.com/keithito/tacotron","commit_stats":null,"previous_names":[],"tags_count":3,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/keithito%2Ftacotron","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/keithito%2Ftacotron/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/keithito%2Ftacotron/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/keithito%2Ftacotron/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/keithito","download_url":"https://codeload.github.com/keithito/tacotron/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254136729,"owners_count":22020771,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["machine-learning","python","speech-synthesis","tacotron","tensorflow","tts"],"created_at":"2024-08-01T15:05:12.004Z","updated_at":"2025-10-07T14:27:24.642Z","avatar_url":"https://github.com/keithito.png","language":"Python","funding_links":[],"categories":["Python","Speech synthesizer or Text-to-Speech(TTS)","语音合成","Tools \u0026 Frameworks","Deepfake Voices"],"sub_categories":["网络服务_其他","Open-source projects","Codes Mainly on Generation"],"readme":"# Tacotron\n\nAn implementation of Tacotron speech synthesis in TensorFlow.\n\n\n### Audio Samples\n\n  * **[Audio Samples](https://keithito.github.io/audio-samples/)** from models trained using this repo.\n    * The first set was trained for 441K steps on the [LJ Speech Dataset](https://keithito.com/LJ-Speech-Dataset/)\n      * Speech started to become intelligible around 20K steps.\n    * The second set was trained by [@MXGray](https://github.com/MXGray) for 140K steps on the [Nancy Corpus](http://www.cstr.ed.ac.uk/projects/blizzard/2011/lessac_blizzard2011/).\n\n\n### Recent Updates\n\n1. @npuichigo [fixed](https://github.com/keithito/tacotron/pull/205) a bug where dropout was not being applied in the prenet.\n\n2. @begeekmyfriend created a [fork](https://github.com/begeekmyfriend/tacotron) that adds location-sensitive attention and the stop token from the [Tacotron 2](https://arxiv.org/abs/1712.05884) paper. This can greatly reduce the amount of data required to train a model.\n\n\n## Background\n\nIn April 2017, Google published a paper, [Tacotron: Towards End-to-End Speech Synthesis](https://arxiv.org/pdf/1703.10135.pdf),\nwhere they present a neural text-to-speech model that learns to synthesize speech directly from\n(text, audio) pairs. However, they didn't release their source code or training data. This is an\nindependent attempt to provide an open-source implementation of the model described in their paper.\n\nThe quality isn't as good as Google's demo yet, but hopefully it will get there someday :-).\nPull requests are welcome!\n\n\n\n## Quick Start\n\n### Installing dependencies\n\n1. Install Python 3.\n\n2. Install the latest version of [TensorFlow](https://www.tensorflow.org/install/) for your platform. For better\n   performance, install with GPU support if it's available. This code works with TensorFlow 1.3 and later.\n\n3. Install requirements:\n   ```\n   pip install -r requirements.txt\n   ```\n\n\n### Using a pre-trained model\n\n1. **Download and unpack a model**:\n   ```\n   curl https://data.keithito.com/data/speech/tacotron-20180906.tar.gz | tar xzC /tmp\n   ```\n\n2. **Run the demo server**:\n   ```\n   python3 demo_server.py --checkpoint /tmp/tacotron-20180906/model.ckpt\n   ```\n\n3. **Point your browser at localhost:9000**\n   * Type what you want to synthesize\n\n\n\n### Training\n\n*Note: you need at least 40GB of free disk space to train a model.*\n\n1. **Download a speech dataset.**\n\n   The following are supported out of the box:\n    * [LJ Speech](https://keithito.com/LJ-Speech-Dataset/) (Public Domain)\n    * [Blizzard 2012](http://www.cstr.ed.ac.uk/projects/blizzard/2012/phase_one) (Creative Commons Attribution Share-Alike)\n\n   You can use other datasets if you convert them to the right format. See [TRAINING_DATA.md](TRAINING_DATA.md) for more info.\n\n\n2. **Unpack the dataset into `~/tacotron`**\n\n   After unpacking, your tree should look like this for LJ Speech:\n   ```\n   tacotron\n     |- LJSpeech-1.1\n         |- metadata.csv\n         |- wavs\n   ```\n\n   or like this for Blizzard 2012:\n   ```\n   tacotron\n     |- Blizzard2012\n         |- ATrampAbroad\n         |   |- sentence_index.txt\n         |   |- lab\n         |   |- wav\n         |- TheManThatCorruptedHadleyburg\n             |- sentence_index.txt\n             |- lab\n             |- wav\n   ```\n\n3. **Preprocess the data**\n   ```\n   python3 preprocess.py --dataset ljspeech\n   ```\n     * Use `--dataset blizzard` for Blizzard data\n\n4. **Train a model**\n   ```\n   python3 train.py\n   ```\n\n   Tunable hyperparameters are found in [hparams.py](hparams.py). You can adjust these at the command\n   line using the `--hparams` flag, for example `--hparams=\"batch_size=16,outputs_per_step=2\"`.\n   Hyperparameters should generally be set to the same values at both training and eval time.\n   The default hyperparameters are recommended for LJ Speech and other English-language data.\n   See [TRAINING_DATA.md](TRAINING_DATA.md) for other languages.\n\n\n5. **Monitor with Tensorboard** (optional)\n   ```\n   tensorboard --logdir ~/tacotron/logs-tacotron\n   ```\n\n   The trainer dumps audio and alignments every 1000 steps. You can find these in\n   `~/tacotron/logs-tacotron`.\n\n6. **Synthesize from a checkpoint**\n   ```\n   python3 demo_server.py --checkpoint ~/tacotron/logs-tacotron/model.ckpt-185000\n   ```\n   Replace \"185000\" with the checkpoint number that you want to use, then open a browser\n   to `localhost:9000` and type what you want to speak. Alternately, you can\n   run [eval.py](eval.py) at the command line:\n   ```\n   python3 eval.py --checkpoint ~/tacotron/logs-tacotron/model.ckpt-185000\n   ```\n   If you set the `--hparams` flag when training, set the same value here.\n\n\n## Notes and Common Issues\n\n  * [TCMalloc](http://goog-perftools.sourceforge.net/doc/tcmalloc.html) seems to improve\n    training speed and avoids occasional slowdowns seen with the default allocator. You\n    can enable it by installing it and setting `LD_PRELOAD=/usr/lib/libtcmalloc.so`. With TCMalloc,\n    you can get around 1.1 sec/step on a GTX 1080Ti.\n\n  * You can train with [CMUDict](http://www.speech.cs.cmu.edu/cgi-bin/cmudict) by downloading the\n    dictionary to ~/tacotron/training and then passing the flag `--hparams=\"use_cmudict=True\"` to\n    train.py. This will allow you to pass ARPAbet phonemes enclosed in curly braces at eval\n    time to force a particular pronunciation, e.g. `Turn left on {HH AW1 S S T AH0 N} Street.`\n\n  * If you pass a Slack incoming webhook URL as the `--slack_url` flag to train.py, it will send\n    you progress updates every 1000 steps.\n\n  * Occasionally, you may see a spike in loss and the model will forget how to attend (the\n    alignments will no longer make sense). Although it will recover eventually, it may\n    save time to restart at a checkpoint prior to the spike by passing the\n    `--restore_step=150000` flag to train.py (replacing 150000 with a step number prior to the\n    spike). **Update**: a recent [fix](https://github.com/keithito/tacotron/pull/7) to gradient\n    clipping by @candlewill may have fixed this.\n    \n  * During eval and training, audio length is limited to `max_iters * outputs_per_step * frame_shift_ms`\n    milliseconds. With the defaults (max_iters=200, outputs_per_step=5, frame_shift_ms=12.5), this is\n    12.5 seconds.\n    \n    If your training examples are longer, you will see an error like this:\n    `Incompatible shapes: [32,1340,80] vs. [32,1000,80]`\n    \n    To fix this, you can set a larger value of `max_iters` by passing `--hparams=\"max_iters=300\"` to\n    train.py (replace \"300\" with a value based on how long your audio is and the formula above).\n    \n  * Here is the expected loss curve when training on LJ Speech with the default hyperparameters:\n    ![Loss curve](https://user-images.githubusercontent.com/1945356/36077599-c0513e4a-0f21-11e8-8525-07347847720c.png)\n\n\n## Other Implementations\n  * By Alex Barron: https://github.com/barronalex/Tacotron\n  * By Kyubyong Park: https://github.com/Kyubyong/tacotron\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkeithito%2Ftacotron","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fkeithito%2Ftacotron","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkeithito%2Ftacotron/lists"}