{"id":13724199,"url":"https://github.com/jaywalnut310/glow-tts","last_synced_at":"2025-04-04T15:09:59.879Z","repository":{"id":41067857,"uuid":"265200146","full_name":"jaywalnut310/glow-tts","owner":"jaywalnut310","description":"A Generative Flow for Text-to-Speech via Monotonic Alignment Search","archived":false,"fork":false,"pushed_at":"2022-07-12T07:12:57.000Z","size":2261,"stargazers_count":683,"open_issues_count":47,"forks_count":151,"subscribers_count":19,"default_branch":"master","last_synced_at":"2025-03-28T14:09:01.129Z","etag":null,"topics":["deep-learning","pytorch","speech-synthesis","text-to-speech","tts"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/jaywalnut310.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2020-05-19T09:12:35.000Z","updated_at":"2025-03-27T02:44:33.000Z","dependencies_parsed_at":"2022-08-10T01:29:45.965Z","dependency_job_id":null,"html_url":"https://github.com/jaywalnut310/glow-tts","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jaywalnut310%2Fglow-tts","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jaywalnut310%2Fglow-tts/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jaywalnut310%2Fglow-tts/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jaywalnut310%2Fglow-tts/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/jaywalnut310","download_url":"https://codeload.github.com/jaywalnut310/glow-tts/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247198463,"owners_count":20900080,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["deep-learning","pytorch","speech-synthesis","text-to-speech","tts"],"created_at":"2024-08-03T01:01:51.930Z","updated_at":"2025-04-04T15:09:59.856Z","avatar_url":"https://github.com/jaywalnut310.png","language":"Python","funding_links":[],"categories":["\u003cspan id=\"speech\"\u003eSpeech\u003c/span\u003e","Python"],"sub_categories":["\u003cspan id=\"tool\"\u003eLLM (LLM \u0026 Tool)\u003c/span\u003e"],"readme":"# Glow-TTS: A Generative Flow for Text-to-Speech via Monotonic Alignment Search\n\n### Jaehyeon Kim, Sungwon Kim, Jungil Kong, and Sungroh Yoon\n\nIn our recent [paper](https://arxiv.org/abs/2005.11129), we propose Glow-TTS: A Generative Flow for Text-to-Speech via Monotonic Alignment Search.\n\nRecently, text-to-speech (TTS) models such as FastSpeech and ParaNet have been proposed to generate mel-spectrograms from text in parallel. Despite the advantage, the parallel TTS models cannot be trained without guidance from autoregressive TTS models as their external aligners. In this work, we propose Glow-TTS, a flow-based generative model for parallel TTS that does not require any external aligner. By combining the properties of flows and dynamic programming, the proposed model searches for the most probable monotonic alignment between text and the latent representation of speech on its own. We demonstrate that enforcing hard monotonic alignments enables robust TTS, which generalizes to long utterances, and employing generative flows enables fast, diverse, and controllable speech synthesis. Glow-TTS obtains an order-of-magnitude speed-up over the autoregressive model, Tacotron 2, at synthesis with comparable speech quality. We further show that our model can be easily extended to a multi-speaker setting.\n\nVisit our [demo](https://jaywalnut310.github.io/glow-tts-demo/index.html) for audio samples.\n\nWe also provide the [pretrained model](https://drive.google.com/open?id=1JiCMBVTG4BMREK8cT3MYck1MgYvwASL0).\n\n\u003ctable style=\"width:100%\"\u003e\n  \u003ctr\u003e\n    \u003cth\u003eGlow-TTS at training\u003c/th\u003e\n    \u003cth\u003eGlow-TTS at inference\u003c/th\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd\u003e\u003cimg src=\"resources/fig_1a.png\" alt=\"Glow-TTS at training\" height=\"400\"\u003e\u003c/td\u003e\n    \u003ctd\u003e\u003cimg src=\"resources/fig_1b.png\" alt=\"Glow-TTS at inference\" height=\"400\"\u003e\u003c/td\u003e\n  \u003c/tr\u003e\n\u003c/table\u003e\n\n\n## Update Notes*\n\nThis result was not included in the paper. Lately, we found that two modifications help to improve the synthesis quality of Glow-TTS.; 1) moving to a vocoder, [HiFi-GAN](https://arxiv.org/abs/2010.05646) to reduce noise, 2) putting a blank token between any two input tokens to improve pronunciation. Specifically, \nwe used a fine-tuned vocoder with Tacotron 2 which is provided as a pretrained model in the [HiFi-GAN repo](https://github.com/jik876/hifi-gan). If you're interested, please listen to the samples in our [demo](https://jaywalnut310.github.io/glow-tts-demo/index.html).\n\nFor adding a blank token, we provide a [config file](./configs/base_blank.json) and a [pretrained model](https://drive.google.com/open?id=1RxR6JWg6WVBZYb-pIw58hi1XLNb5aHEi). We also provide an inference example [inference_hifigan.ipynb](./inference_hifigan.ipynb). You may need to initialize HiFi-GAN submodule: `git submodule init; git submodule update`\n\n\n## 1. Environments we use\n\n* Python3.6.9\n* pytorch1.2.0\n* cython0.29.12\n* librosa0.7.1\n* numpy1.16.4\n* scipy1.3.0\n\nFor Mixed-precision training, we use [apex](https://github.com/NVIDIA/apex); commit: 37cdaf4\n\n\n## 2. Pre-requisites\n\na) Download and extract the [LJ Speech dataset](https://keithito.com/LJ-Speech-Dataset/), then rename or create a link to the dataset folder: `ln -s /path/to/LJSpeech-1.1/wavs DUMMY`\n\nb) Initialize WaveGlow submodule: `git submodule init; git submodule update`\n\nDon't forget to download pretrained WaveGlow model and place it into the waveglow folder.\n\nc) Build Monotonic Alignment Search Code (Cython): `cd monotonic_align; python setup.py build_ext --inplace`\n\n\n## 3. Training Example\n\n```sh\nsh train_ddi.sh configs/base.json base\n```\n\n## 4. Inference Example\n\nSee [inference.ipynb](./inference.ipynb)\n\n\n## Acknowledgements\n\nOur implementation is hugely influenced by the following repos:\n* [WaveGlow](https://github.com/NVIDIA/waveglow)\n* [Tensor2Tensor](https://github.com/tensorflow/tensor2tensor)\n* [Mellotron](https://github.com/NVIDIA/mellotron)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjaywalnut310%2Fglow-tts","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjaywalnut310%2Fglow-tts","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjaywalnut310%2Fglow-tts/lists"}