{"id":13935747,"url":"https://github.com/spring-media/TransformerTTS","last_synced_at":"2025-07-19T21:30:33.231Z","repository":{"id":40143763,"uuid":"250279308","full_name":"as-ideas/TransformerTTS","owner":"as-ideas","description":"🤖💬 Transformer TTS: Implementation of a non-autoregressive Transformer based neural network for text to speech.","archived":false,"fork":false,"pushed_at":"2024-05-03T19:50:55.000Z","size":26615,"stargazers_count":1119,"open_issues_count":50,"forks_count":225,"subscribers_count":33,"default_branch":"main","last_synced_at":"2024-08-08T23:21:43.201Z","etag":null,"topics":["axelspringerai","deep-learning","python","tensorflow","text-to-speech","tts"],"latest_commit_sha":null,"homepage":"https://as-ideas.github.io/TransformerTTS/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/as-ideas.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-03-26T14:21:36.000Z","updated_at":"2024-08-07T08:41:45.000Z","dependencies_parsed_at":"2024-04-27T23:36:40.084Z","dependency_job_id":"e58b0783-344c-42ab-80f4-90148f4d3a24","html_url":"https://github.com/as-ideas/TransformerTTS","commit_stats":null,"previous_names":[],"tags_count":4,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/as-ideas%2FTransformerTTS","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/as-ideas%2FTransformerTTS/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/as-ideas%2FTransformerTTS/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/as-ideas%2FTransformerTTS/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/as-ideas","download_url":"https://codeload.github.com/as-ideas/TransformerTTS/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":226677353,"owners_count":17666037,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["axelspringerai","deep-learning","python","tensorflow","text-to-speech","tts"],"created_at":"2024-08-07T23:02:03.640Z","updated_at":"2025-07-19T21:30:33.225Z","avatar_url":"https://github.com/as-ideas.png","language":"Python","funding_links":[],"categories":["Python"],"sub_categories":[],"readme":"\u003cp align=\"center\"\u003e\n    \u003cbr\u003e\n    \u003cimg src=\"https://raw.githubusercontent.com/as-ideas/TransformerTTS/master/docs/transformer_logo.png\" width=\"400\"/\u003e\n    \u003cbr\u003e\n\u003c/p\u003e\n\n\u003ch2 align=\"center\"\u003e\n\u003cp\u003eA Text-to-Speech Transformer in TensorFlow 2\u003c/p\u003e\n\u003c/h2\u003e\n\n\nImplementation of a non-autoregressive Transformer based neural network for Text-to-Speech (TTS). \u003cbr\u003e\nThis repo is based, among others, on the following papers:\n- [Neural Speech Synthesis with Transformer Network](https://arxiv.org/abs/1809.08895)\n- [FastSpeech: Fast, Robust and Controllable Text to Speech](https://arxiv.org/abs/1905.09263)\n- [FastSpeech 2: Fast and High-Quality End-to-End Text to Speech](https://arxiv.org/abs/2006.04558)\n- [FastPitch: Parallel Text-to-speech with Pitch Prediction](https://fastpitch.github.io/)\n\nOur pre-trained LJSpeech model is compatible with the pre-trained vocoders:\n- [MelGAN](https://github.com/seungwonpark/melgan)\n- [HiFiGAN](https://github.com/jik876/hifi-gan)\n\n(older versions are available also for [WaveRNN](https://github.com/fatchord/WaveRNN))\n\nFor quick inference with these vocoders, checkout the [Vocoding branch](https://github.com/as-ideas/TransformerTTS/tree/vocoding)\n\n#### Non-Autoregressive\nBeing non-autoregressive, this Transformer model is:\n- Robust: No repeats and failed attention modes for challenging sentences.\n- Fast: With no autoregression, predictions take a fraction of the time.\n- Controllable: It is possible to control the speed and pitch of the generated utterance.\n\n## 🔈 Samples\n\n[Can be found here.](https://as-ideas.github.io/TransformerTTS/)\n\nThese samples' spectrograms are converted using the pre-trained [MelGAN](https://github.com/seungwonpark/melgan) vocoder.\u003cbr\u003e\n\n\nTry it out on Colab:\n\n[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/as-ideas/TransformerTTS/blob/main/notebooks/synthesize_forward_melgan.ipynb)\n\n## Updates\n- 06/20: Added normalisation and pre-trained models compatible with the faster [MelGAN](https://github.com/seungwonpark/melgan) vocoder.\n- 11/20: Added pitch prediction. Autoregressive model is now specialized as an Aligner and Forward is now the only TTS model. Changed models architectures. Discontinued WaveRNN support. Improved duration extraction with Dijkstra algorithm.\n- 03/20: Vocoding branch.\n\n## 📖 Contents\n- [Installation](#installation)\n- [API](#pre-trained-ljspeech-api)\n- [Dataset](#dataset)\n- [Training](#training)\n    - [Aligner](#train-aligner-model)\n    - [TTS](#train-tts-model)\n- [Prediction](#prediction)\n- [Model Weights](#model-weights)\n\n## Installation\n\nMake sure you have:\n\n* Python \u003e= 3.6\n\nInstall espeak as phonemizer backend (for macOS use brew):\n```\nsudo apt-get install espeak\n```\n\nThen install the rest with pip:\n```\npip install -r requirements.txt\n```\n\nRead the individual scripts for more command line arguments.\n\n## Pre-Trained LJSpeech API\nUse our pre-trained model (with Griffin-Lim) from command line with\n```commandline\npython predict_tts.py -t \"Please, say something.\"\n```\nOr in a python script\n```python\nfrom data.audio import Audio\nfrom model.factory import tts_ljspeech\n\nmodel = tts_ljspeech()\naudio = Audio.from_config(model.config)\nout = model.predict('Please, say something.')\n\n# Convert spectrogram to wav (with griffin lim)\nwav = audio.reconstruct_waveform(out['mel'].numpy().T)\n```\n\nYou can specify the model step with the `--step` flag (CL) or `step` parameter (script).\u003cbr\u003e\nSteps from 60000 to 100000 are available at a frequency of 5K steps (60000, 65000, ..., 95000, 100000).\n\n\u003cb\u003eIMPORTANT:\u003c/b\u003e make sure to checkout the correct repository version to use the API.\u003cbr\u003e\nCurrently 493be6345341af0df3ae829de79c2793c9afd0ec\n\n## Dataset\nYou can directly use [LJSpeech](https://keithito.com/LJ-Speech-Dataset/) to create the training dataset.\n\n#### Configuration\n* If training on LJSpeech, or if unsure, simply use ```config/training_config.yaml``` to create [MelGAN](https://github.com/seungwonpark/melgan) or [HiFiGAN](https://github.com/jik876/hifi-gan) compatible models\n    * swap the content of ```data_config_wavernn.yaml``` in ```config/training_config.yaml``` to create models compatible with [WaveRNN](https://github.com/fatchord/WaveRNN) \n* **EDIT PATHS**: in `config/training_config.yaml` edit the paths to point at your dataset and log folders\n\n#### Custom dataset\nPrepare a folder containing your metadata and wav files, for instance\n```\n|- dataset_folder/\n|   |- metadata.csv\n|   |- wavs/\n|       |- file1.wav\n|       |- ...\n```\nif `metadata.csv` has the following format\n``` wav_file_name|transcription ```\nyou can use the ljspeech preprocessor in ```data/metadata_readers.py```, otherwise add your own under the same file.\n\nMake sure that:\n -  the metadata reader function name is the same as ```data_name``` field in ```training_config.yaml```.\n -  the metadata file (can be anything) is specified under ```metadata_path``` in ```training_config.yaml``` \n\n## Training\nChange the ```--config``` argument based on the configuration of your choice.\n### Train Aligner Model\n#### Create training dataset\n```bash\npython create_training_data.py --config config/training_config.yaml\n```\nThis will populate the training data directory (default `transformer_tts_data.ljspeech`).\n#### Training\n```bash\npython train_aligner.py --config config/training_config.yaml\n```\n### Train TTS Model\n#### Compute alignment dataset\nFirst use the aligner model to create the durations dataset\n```bash\npython extract_durations.py --config config/training_config.yaml\n```\nthis will add the `durations.\u003csession name\u003e` as well as the char-wise pitch folders to the training data directory.\n#### Training\n```bash\npython train_tts.py --config config/training_config.yaml\n```\n#### Training \u0026 Model configuration\n- Training and model settings can be configured in `training_config.yaml`\n\n#### Resume or restart training\n- To resume training simply use the same configuration files\n- To restart training, delete the weights and/or the logs from the logs folder with the training flag `--reset_dir` (both) or `--reset_logs`, `--reset_weights`\n\n#### Monitor training\n```bash\ntensorboard --logdir /logs/directory/\n```\n\n![Tensorboard Demo](https://raw.githubusercontent.com/as-ideas/TransformerTTS/master/docs/tboard_demo.gif)\n## Prediction\n### With model weights\nFrom command line with\n```commandline\npython predict_tts.py -t \"Please, say something.\" -p /path/to/weights/\n```\nOr in a python script\n```python\nfrom model.models import ForwardTransformer\nfrom data.audio import Audio\nmodel = ForwardTransformer.load_model('/path/to/weights/')\naudio = Audio.from_config(model.config)\nout = model.predict('Please, say something.')\n\n# Convert spectrogram to wav (with griffin lim)\nwav = audio.reconstruct_waveform(out['mel'].numpy().T)\n```\n\n## Model Weights\nAccess the pre-trained models with the API call.\n\nOld weights\n| Model URL | Commit | Vocoder Commit|\n|---|---|---|\n|[ljspeech_tts_model](https://public-asai-dl-models.s3.eu-central-1.amazonaws.com/ljspeech_weights_tts.zip)| 0cd7d33 | aca5990 |\n|[ljspeech_melgan_forward_model](https://public-asai-dl-models.s3.eu-central-1.amazonaws.com/TransformerTTS/ljspeech_melgan_forward_transformer.zip)| 1c1cb03| aca5990 |\n|[ljspeech_melgan_autoregressive_model_v2](https://public-asai-dl-models.s3.eu-central-1.amazonaws.com/TransformerTTS/ljspeech_melgan_autoregressive_transformer.zip)| 1c1cb03| aca5990 |\n|[ljspeech_wavernn_forward_model](https://public-asai-dl-models.s3.eu-central-1.amazonaws.com/TransformerTTS/ljspeech_wavernn_forward_transformer.zip)| 1c1cb03| 3595219 |\n|[ljspeech_wavernn_autoregressive_model_v2](https://public-asai-dl-models.s3.eu-central-1.amazonaws.com/TransformerTTS/ljspeech_wavernn_autoregressive_transformer.zip)| 1c1cb03| 3595219 |\n|[ljspeech_wavernn_forward_model](https://public-asai-dl-models.s3.eu-central-1.amazonaws.com/TransformerTTS/ljspeech_forward_transformer.zip)| d9ccee6| 3595219 |\n|[ljspeech_wavernn_autoregressive_model_v2](https://public-asai-dl-models.s3.eu-central-1.amazonaws.com/TransformerTTS/ljspeech_autoregressive_transformer.zip)| d9ccee6| 3595219 |\n|[ljspeech_wavernn_autoregressive_model_v1](https://github.com/as-ideas/tts_model_outputs/tree/master/ljspeech_transformertts)| 2f3a1b5| 3595219 |\n## Maintainers\n* Francesco Cardinale, github: [cfrancesco](https://github.com/cfrancesco)\n\n## Special thanks\n[MelGAN](https://github.com/seungwonpark/melgan) and [WaveRNN](https://github.com/fatchord/WaveRNN): data normalization and samples' vocoders are from these repos.\n\n[Erogol](https://github.com/erogol) and the Mozilla TTS team for the lively exchange on the topic.\n\n\n## Copyright\nSee [LICENSE](LICENSE) for details.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fspring-media%2FTransformerTTS","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fspring-media%2FTransformerTTS","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fspring-media%2FTransformerTTS/lists"}