{"id":19900382,"url":"https://github.com/x-lance/voiceflow-tts","last_synced_at":"2025-04-06T07:15:01.948Z","repository":{"id":194912007,"uuid":"691851615","full_name":"X-LANCE/VoiceFlow-TTS","owner":"X-LANCE","description":"[ICASSP 2024] This is the official code for \"VoiceFlow: Efficient Text-to-Speech with Rectified Flow Matching\"","archived":false,"fork":false,"pushed_at":"2024-09-03T05:41:17.000Z","size":889,"stargazers_count":338,"open_issues_count":8,"forks_count":22,"subscribers_count":15,"default_branch":"main","last_synced_at":"2025-03-30T06:08:09.387Z","etag":null,"topics":["conditional-flow-matching","generative-models","probabilistic-models","rectified-flow-matching","speech-synthesis","text-to-speech","tts"],"latest_commit_sha":null,"homepage":"https://cantabile-kwok.github.io/VoiceFlow/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/X-LANCE.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-09-15T03:01:48.000Z","updated_at":"2025-03-28T21:10:52.000Z","dependencies_parsed_at":"2023-09-15T18:40:00.587Z","dependency_job_id":"6c6b3009-7718-496e-ab54-4f752da3ba2f","html_url":"https://github.com/X-LANCE/VoiceFlow-TTS","commit_stats":null,"previous_names":["cantabile-kwok/voiceflow-tts","x-lance/voiceflow-tts"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/X-LANCE%2FVoiceFlow-TTS","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/X-LANCE%2FVoiceFlow-TTS/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/X-LANCE%2FVoiceFlow-TTS/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/X-LANCE%2FVoiceFlow-TTS/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/X-LANCE","download_url":"https://codeload.github.com/X-LANCE/VoiceFlow-TTS/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247445681,"owners_count":20939961,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["conditional-flow-matching","generative-models","probabilistic-models","rectified-flow-matching","speech-synthesis","text-to-speech","tts"],"created_at":"2024-11-12T20:12:07.328Z","updated_at":"2025-04-06T07:15:01.904Z","avatar_url":"https://github.com/X-LANCE.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# VoiceFlow: Efficient Text-to-Speech with Rectified Flow Matching\n\u003e This is the official implementation of our ICASSP 2024 paper [VoiceFlow](https://arxiv.org/abs/2309.05027).\n\n![traj](resources/traj.png)\n\n## Environment Setup\nThis repo is tested on **python 3.9** on Linux. You can set up the environment with conda\n```shell\n# Install required packages\nconda create -n vflow python==3.9  # or any name you like\nconda activate vflow\npip install -r requirements.txt\n\n# Then, set PATH\nsource path.sh  # change the env name in it if you don't use \"vflow\"\n\n# Install monotonic_align for MAS\ncd model/monotonic_align\npython setup.py build_ext --inplace\n```\nNote that to avoid the trouble of installing [torchdyn](https://github.com/DiffEqML/torchdyn), we directly copy the torchdyn 1.0.6 version here locally at `torchdyn/`.\n\nThe following process may also need `bash` and `perl` commands in your environment.\n\n## Data Preparation\nThis repo relies on Kaldi-style data organization.\nAll data description files should be put in subdirectories in `data/`.\nSee `data/ljspeech/example` for a basic example. \nIn this example, the following plain text files are necessary:\n1. `wav.scp`: organized as `utt /path/to/wav`.\n2. `utts.list`: every line specifies an utterance. This can be obtained by `cut -d ' ' -f 1 wav.scp \u003e utts.list`.\n3. `utt2spk`: organized as `utt spk_name`.\n4. `text` and `phn_duration`: specifies the phoneme sequence and the corresponding integer durations (in frames).\nAlso, there is a `data/ljspeech/phones.txt` file to specify all the phones together with their indexes in dictionary.\n\nFor LJSpeech, we provide the processed file [online](https://huggingface.co/datasets/cantabile-kwok/ljspeech-1024-256-dur/resolve/main/ljspeech-1024-256.zip).\nYou can download it and unzip to `data/ljspeech/{train,val}`.\nIf you want to train on your own dataset, you might have to create these files yourself (or change the data loading strategy).\n\nAfter having these manifest files, please do the following to extract mel-spectrogram for training:\n```shell\nbash extract_fbank.sh --stage 0 --stop_stage 2 --nj 16\n# nj: number of parallel jobs. \n# Have a look into the script if you need to change something\n# Bash variables before \"parse_options.sh\" can be passed by CLI, e.g. \"--key value\".\n```\nNote that we default to use **16kHz** data here.\nThis will create `feats/fbank` and `feats/normed_fbank`, where Kaldi-style scp and ark files store the mel-spectrogram data. \nThe normed features will be used for training.\n\nIf you want to use speaker-IDs (like LJSpeech, instead of using pretrained speaker embeddings such as xvectors) for training, please run:\n```shell\nmake_utt2spk_id.py data/ljspeech/train/utt2spk data/ljspeech/val/utt2spk\n# You can add more files in CLI. Will write utt2num_frames in the same directory to these files.\n```\n\n## Training\nConfigurations for training is stored as yaml file in `configs/`.\nData manifests and features for training and validation set will be specified in those yaml files.\nYou will need to change double-quoted file paths there if you need to train on your own data.\n\nThen, training is performed by \n```shell\npython train.py -c configs/${your_yaml} -m ${model_name}\n# e.g. python train.py -c configs/lj_16k_gt_dur.yaml -m lj_16k_gt_dur\n```\nIt will create `logs/${model_name}` for logging and checkpointing.\n\nSeveral notes:\n* By default, the program performs EMA to average weights. Weights with or without EMA will both be saved. \n* By default, the program will try to find the latest checkpoint for resuming. EMA checkpoints are prior to non-EMA checkpoints.\n* You can set `use_gt_dur` to `false` to turn on MAS algorithm. In this setting, it is better to set `add_blank` to `true`.\n\n## Generate Data for ReFlow and Perform Reflow\nAfter training the model to some degree, it can be ready for flow rectification process.\nFlow rectification requires to generate data using the trained model and use the (noise, data) pair to train the model again.\nAs this process should always involve the whole training dataset, it is recommended to run on multiple GPUs for parallel decoding.\nWe provide a script to do this:\n```shell\n# Set CUDA_VISIBLE_DEVICES, or the program will use all available GPUs.\npython generate_for_reflow.py -c configs/${your_yaml} -m ${model_name} \\\n                              --EMA --max-utt-num 100000000 \\\n                              --dataset train \\\n                              --solver euler -t 10 \\\n                              --gt-dur\n# --EMA specifies to load EMA checkpoint (latest)\n# --max-utt-num sets the number of utterances to decode (in this case, arbitrarily high)\n# --solver euler -t 10 specifies the solver and timesteps. Could be adaptive solvers like dopri5.\n# --gt-dur forces the model to use ground truth duration for decoding.\n```\nThis will create `synthetic_wav/${model_name}/generate_for_reflow/train` for storage. `noise.scp` together with `feats.scp` will be stored.\nAfter decoding the training set, you can also decode validation set by `--dataset val`.\n\nThen, specify the paths to these `feats.scp` and `noise.scp` in a new configuration yaml, like in the `lj_16k_gt_dur_reflow.yaml`:\n```yaml\nperform_reflow: true\n...\ndata:\n    train:\n        feats_scp: \"synthetic_wav/lj_16k_gt_dur/train/feats.scp\"\n        noise_scp: \"synthetic_wav/lj_16k_gt_dur/train/noise.scp\"\n...\n```\n\nNow it is ready for training again in ReFlow, with the same script in training but new yaml config files.\nFeel free to copy a trained model to the new log dir for resuming.\nAlso, it is possible to change the model structure and train from scratch on the reflow data.\n\n## Inference\nSimilar to \"generate data for reflow\", model inference can be done by\n```shell\npython inference_dataset.py -c configs/${your_yaml} -m ${model_name} --EMA \\\n                          --solver euler -t 10\n```\nThis will synthesize mel-spectrograms for the validation set in your config, storing them at `synthetic_wav/${model_name}/tts_gt_spk/feats.scp`.\nSpeaker, speed and temperature can be specified; see `tools.get_hparams_decode()` function for complete set of options.\n\nInference can then be done in the `hifigan/` directory. Please refer to the [README](hifigan/README.md) there.\n\n## Acknowledgement\nDuring the development, the following repositories were referred to:\n* [Kaldi](https://github.com/kaldi-asr/kaldi) and [UniCATS-CTX-vec2wav](https://github.com/cantabile-kwok/UniCATS-CTX-vec2wav) for most utility scripts in `utils/`.\n* [GradTTS](https://github.com/huawei-noah/Speech-Backbones/tree/main/Grad-TTS), where most of the model architecture and training pipelines are adopted.\n* [VITS](https://github.com/jaywalnut310/vits), whose distributed bucket sampler is used.\n* [CFM](https://github.com/atong01/conditional-flow-matching), for the ODE samplers.\n\n## 💡Easter Eggs \u0026 Citation\nThis repository also contains some experimental functionalities. ⚠️Warning: not guaranteed to be correct!\n* **Voice conversion**. As GlowTTS can perform voice conversion via the disentangling property of normalizing flows, it is reasonable that flow matching can also perform it. Method `model.tts.GradTTS.voice_conversion` gives a preliminary try.\n\n* **Likelihood estimation**. Differential equation-based generative models have the ability to estimate data likelihoods by the instantaneous change-of-variable formula\n```math\n\\log p_0(\\boldsymbol x(0)) = \\log p_1(\\boldsymbol  x(1)) + \\int _0^1 \\nabla_{\\boldsymbol x} \\cdot {\\boldsymbol v}(\\boldsymbol x(t), t)\\mathrm d t\n```\n  In practice, integral is replaced by summation, and divergence is replaced by the Skilling-Hutchinson trace estimator. See the Appendix D.2 in [Song, et. al](https://arxiv.org/abs/2011.13456) for theoretical details. I implemented this in `model.tts.GradTTS.compute_likelihood`. \n* **Optimal transport**. The conditional flow matching used in this paper is not a **marginally** optimal transport path but only a **conditionally** optimal path. For the marginal optimal transport, [Tong et. al](https://arxiv.org/abs/2302.00482) introduces to sample $x_0,x_1$ together from the joint optimal transport distribution $\\pi(x_0,x_1)$. I tried this in `model.cfm.OTCFM`, though it doe not work very well for now.\n* **Different estimator architectures**. You can specify an estimator besides the `GradLogPEstimator2d` by the `model.fm_net_type` configuration. Currently the [DiffSinger](https://ojs.aaai.org/index.php/AAAI/article/view/21350)'s estimator architecture is also supported. You can add more, e.g. that introduced in [Matcha-TTS](https://github.com/shivammehta25/Matcha-TTS).\n* **Better alignment learning**. This repo supports supervised duration modeling together with monotonic alignment search as that in GradTTS. However, there might be a better way for MAS in flow-matching TTS. `model.tts.GradTTS.forward` now supports beta binomial prior for alignment maps; and if you want, you can change the variable `MAS_target` to something else, e.g. flow-transformed noise!\n\nFeel free to cite this work if it helps 😄\n\n```\n@INPROCEEDINGS{guo2024voiceflow,\n  author={Guo, Yiwei and Du, Chenpeng and Ma, Ziyang and Chen, Xie and Yu, Kai},\n  booktitle={ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, \n  title={{VoiceFlow}: Efficient Text-To-Speech with Rectified Flow Matching}, \n  year={2024},\n  volume={},\n  number={},\n  pages={11121-11125},\n  keywords={Signal processing algorithms;Signal processing;Acoustics;Mathematical models;Vectors;Trajectory;Speech processing;Text-to-speech;flow matching;rectified flow;efficiency;speed-quality tradeoff},\n  doi={10.1109/ICASSP48485.2024.10445948}\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fx-lance%2Fvoiceflow-tts","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fx-lance%2Fvoiceflow-tts","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fx-lance%2Fvoiceflow-tts/lists"}