{"id":13724204,"url":"https://github.com/shivammehta25/Matcha-TTS","last_synced_at":"2025-05-07T17:33:44.424Z","repository":{"id":195362075,"uuid":"687727674","full_name":"shivammehta25/Matcha-TTS","owner":"shivammehta25","description":"[ICASSP 2024] 🍵 Matcha-TTS: A fast TTS architecture with conditional flow matching","archived":false,"fork":false,"pushed_at":"2024-05-20T22:03:08.000Z","size":65583,"stargazers_count":404,"open_issues_count":7,"forks_count":53,"subscribers_count":12,"default_branch":"main","last_synced_at":"2024-05-21T00:43:18.617Z","etag":null,"topics":["deep-learning","diffusion-model","diffusion-models","flow-matching","machine-learning","non-autoregressive","probabilistic","probabilistic-machine-learning","text-to-speech","tts","tts-api","tts-engines"],"latest_commit_sha":null,"homepage":"https://shivammehta25.github.io/Matcha-TTS/","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/shivammehta25.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-09-05T22:08:15.000Z","updated_at":"2024-08-09T17:40:18.458Z","dependencies_parsed_at":"2023-09-17T17:01:33.880Z","dependency_job_id":"5d913be3-4881-40bb-a58d-a1f704fd5be8","html_url":"https://github.com/shivammehta25/Matcha-TTS","commit_stats":null,"previous_names":["shivammehta25/matcha-tts"],"tags_count":7,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/shivammehta25%2FMatcha-TTS","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/shivammehta25%2FMatcha-TTS/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/shivammehta25%2FMatcha-TTS/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/shivammehta25%2FMatcha-TTS/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/shivammehta25","download_url":"https://codeload.github.com/shivammehta25/Matcha-TTS/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":224628441,"owners_count":17343340,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["deep-learning","diffusion-model","diffusion-models","flow-matching","machine-learning","non-autoregressive","probabilistic","probabilistic-machine-learning","text-to-speech","tts","tts-api","tts-engines"],"created_at":"2024-08-03T01:01:52.016Z","updated_at":"2025-05-07T17:33:44.410Z","avatar_url":"https://github.com/shivammehta25.png","language":"Jupyter Notebook","funding_links":[],"categories":["\u003cspan id=\"speech\"\u003eSpeech\u003c/span\u003e"],"sub_categories":["\u003cspan id=\"tool\"\u003eLLM (LLM \u0026 Tool)\u003c/span\u003e"],"readme":"\u003cdiv align=\"center\"\u003e\n\n# 🍵 Matcha-TTS: A fast TTS architecture with conditional flow matching\n\n### [Shivam Mehta](https://www.kth.se/profile/smehta), [Ruibo Tu](https://www.kth.se/profile/ruibo), [Jonas Beskow](https://www.kth.se/profile/beskow), [Éva Székely](https://www.kth.se/profile/szekely), and [Gustav Eje Henter](https://people.kth.se/~ghe/)\n\n[![python](https://img.shields.io/badge/-Python_3.10-blue?logo=python\u0026logoColor=white)](https://www.python.org/downloads/release/python-3100/)\n[![pytorch](https://img.shields.io/badge/PyTorch_2.0+-ee4c2c?logo=pytorch\u0026logoColor=white)](https://pytorch.org/get-started/locally/)\n[![lightning](https://img.shields.io/badge/-Lightning_2.0+-792ee5?logo=pytorchlightning\u0026logoColor=white)](https://pytorchlightning.ai/)\n[![hydra](https://img.shields.io/badge/Config-Hydra_1.3-89b8cd)](https://hydra.cc/)\n[![black](https://img.shields.io/badge/Code%20Style-Black-black.svg?labelColor=gray)](https://black.readthedocs.io/en/stable/)\n[![isort](https://img.shields.io/badge/%20imports-isort-%231674b1?style=flat\u0026labelColor=ef8336)](https://pycqa.github.io/isort/)\n\n\u003cp style=\"text-align: center;\"\u003e\n  \u003cimg src=\"https://shivammehta25.github.io/Matcha-TTS/images/logo.png\" height=\"128\"/\u003e\n\u003c/p\u003e\n\n\u003c/div\u003e\n\n\u003e This is the official code implementation of 🍵 Matcha-TTS [ICASSP 2024].\n\nWe propose 🍵 Matcha-TTS, a new approach to non-autoregressive neural TTS, that uses [conditional flow matching](https://arxiv.org/abs/2210.02747) (similar to [rectified flows](https://arxiv.org/abs/2209.03003)) to speed up ODE-based speech synthesis. Our method:\n\n- Is probabilistic\n- Has compact memory footprint\n- Sounds highly natural\n- Is very fast to synthesise from\n\nCheck out our [demo page](https://shivammehta25.github.io/Matcha-TTS) and read [our ICASSP 2024 paper](https://arxiv.org/abs/2309.03199) for more details.\n\n[Pre-trained models](https://drive.google.com/drive/folders/17C_gYgEHOxI5ZypcfE_k1piKCtyR0isJ?usp=sharing) will be automatically downloaded with the CLI or gradio interface.\n\nYou can also [try 🍵 Matcha-TTS in your browser on HuggingFace 🤗 spaces](https://huggingface.co/spaces/shivammehta25/Matcha-TTS).\n\n## Teaser video\n\n[![Watch the video](https://img.youtube.com/vi/xmvJkz3bqw0/hqdefault.jpg)](https://youtu.be/xmvJkz3bqw0)\n\n## Installation\n\n1. Create an environment (suggested but optional)\n\n```\nconda create -n matcha-tts python=3.10 -y\nconda activate matcha-tts\n```\n\n2. Install Matcha TTS using pip or from source\n\n```bash\npip install matcha-tts\n```\n\nfrom source\n\n```bash\npip install git+https://github.com/shivammehta25/Matcha-TTS.git\ncd Matcha-TTS\npip install -e .\n```\n\n3. Run CLI / gradio app / jupyter notebook\n\n```bash\n# This will download the required models\nmatcha-tts --text \"\u003cINPUT TEXT\u003e\"\n```\n\nor\n\n```bash\nmatcha-tts-app\n```\n\nor open `synthesis.ipynb` on jupyter notebook\n\n### CLI Arguments\n\n- To synthesise from given text, run:\n\n```bash\nmatcha-tts --text \"\u003cINPUT TEXT\u003e\"\n```\n\n- To synthesise from a file, run:\n\n```bash\nmatcha-tts --file \u003cPATH TO FILE\u003e\n```\n\n- To batch synthesise from a file, run:\n\n```bash\nmatcha-tts --file \u003cPATH TO FILE\u003e --batched\n```\n\nAdditional arguments\n\n- Speaking rate\n\n```bash\nmatcha-tts --text \"\u003cINPUT TEXT\u003e\" --speaking_rate 1.0\n```\n\n- Sampling temperature\n\n```bash\nmatcha-tts --text \"\u003cINPUT TEXT\u003e\" --temperature 0.667\n```\n\n- Euler ODE solver steps\n\n```bash\nmatcha-tts --text \"\u003cINPUT TEXT\u003e\" --steps 10\n```\n\n## Train with your own dataset\n\nLet's assume we are training with LJ Speech\n\n1. Download the dataset from [here](https://keithito.com/LJ-Speech-Dataset/), extract it to `data/LJSpeech-1.1`, and prepare the file lists to point to the extracted data like for [item 5 in the setup of the NVIDIA Tacotron 2 repo](https://github.com/NVIDIA/tacotron2#setup).\n\n2. Clone and enter the Matcha-TTS repository\n\n```bash\ngit clone https://github.com/shivammehta25/Matcha-TTS.git\ncd Matcha-TTS\n```\n\n3. Install the package from source\n\n```bash\npip install -e .\n```\n\n4. Go to `configs/data/ljspeech.yaml` and change\n\n```yaml\ntrain_filelist_path: data/filelists/ljs_audio_text_train_filelist.txt\nvalid_filelist_path: data/filelists/ljs_audio_text_val_filelist.txt\n```\n\n5. Generate normalisation statistics with the yaml file of dataset configuration\n\n```bash\nmatcha-data-stats -i ljspeech.yaml\n# Output:\n#{'mel_mean': -5.53662231756592, 'mel_std': 2.1161014277038574}\n```\n\nUpdate these values in `configs/data/ljspeech.yaml` under `data_statistics` key.\n\n```bash\ndata_statistics:  # Computed for ljspeech dataset\n  mel_mean: -5.536622\n  mel_std: 2.116101\n```\n\nto the paths of your train and validation filelists.\n\n6. Run the training script\n\n```bash\nmake train-ljspeech\n```\n\nor\n\n```bash\npython matcha/train.py experiment=ljspeech\n```\n\n- for a minimum memory run\n\n```bash\npython matcha/train.py experiment=ljspeech_min_memory\n```\n\n- for multi-gpu training, run\n\n```bash\npython matcha/train.py experiment=ljspeech trainer.devices=[0,1]\n```\n\n7. Synthesise from the custom trained model\n\n```bash\nmatcha-tts --text \"\u003cINPUT TEXT\u003e\" --checkpoint_path \u003cPATH TO CHECKPOINT\u003e\n```\n\n## ONNX support\n\n\u003e Special thanks to [@mush42](https://github.com/mush42) for implementing ONNX export and inference support.\n\nIt is possible to export Matcha checkpoints to [ONNX](https://onnx.ai/), and run inference on the exported ONNX graph.\n\n### ONNX export\n\nTo export a checkpoint to ONNX, first install ONNX with\n\n```bash\npip install onnx\n```\n\nthen run the following:\n\n```bash\npython3 -m matcha.onnx.export matcha.ckpt model.onnx --n-timesteps 5\n```\n\nOptionally, the ONNX exporter accepts **vocoder-name** and **vocoder-checkpoint** arguments. This enables you to embed the vocoder in the exported graph and generate waveforms in a single run (similar to end-to-end TTS systems).\n\n**Note** that `n_timesteps` is treated as a hyper-parameter rather than a model input. This means you should specify it during export (not during inference). If not specified, `n_timesteps` is set to **5**.\n\n**Important**: for now, torch\u003e=2.1.0 is needed for export since the `scaled_product_attention` operator is not exportable in older versions. Until the final version is released, those who want to export their models must install torch\u003e=2.1.0 manually as a pre-release.\n\n### ONNX Inference\n\nTo run inference on the exported model, first install `onnxruntime` using\n\n```bash\npip install onnxruntime\npip install onnxruntime-gpu  # for GPU inference\n```\n\nthen use the following:\n\n```bash\npython3 -m matcha.onnx.infer model.onnx --text \"hey\" --output-dir ./outputs\n```\n\nYou can also control synthesis parameters:\n\n```bash\npython3 -m matcha.onnx.infer model.onnx --text \"hey\" --output-dir ./outputs --temperature 0.4 --speaking_rate 0.9 --spk 0\n```\n\nTo run inference on **GPU**, make sure to install **onnxruntime-gpu** package, and then pass `--gpu` to the inference command:\n\n```bash\npython3 -m matcha.onnx.infer model.onnx --text \"hey\" --output-dir ./outputs --gpu\n```\n\nIf you exported only Matcha to ONNX, this will write mel-spectrogram as graphs and `numpy` arrays to the output directory.\nIf you embedded the vocoder in the exported graph, this will write `.wav` audio files to the output directory.\n\nIf you exported only Matcha to ONNX, and you want to run a full TTS pipeline, you can pass a path to a vocoder model in `ONNX` format:\n\n```bash\npython3 -m matcha.onnx.infer model.onnx --text \"hey\" --output-dir ./outputs --vocoder hifigan.small.onnx\n```\n\nThis will write `.wav` audio files to the output directory.\n\n## Extract phoneme alignments from Matcha-TTS\n\nIf the dataset is structured as\n\n```bash\ndata/\n└── LJSpeech-1.1\n    ├── metadata.csv\n    ├── README\n    ├── test.txt\n    ├── train.txt\n    ├── val.txt\n    └── wavs\n```\nThen you can extract the phoneme level alignments from a Trained Matcha-TTS model using:\n```bash\npython  matcha/utils/get_durations_from_trained_model.py -i dataset_yaml -c \u003ccheckpoint\u003e\n```\nExample:\n```bash\npython  matcha/utils/get_durations_from_trained_model.py -i ljspeech.yaml -c matcha_ljspeech.ckpt\n```\nor simply:\n```bash\nmatcha-tts-get-durations -i ljspeech.yaml -c matcha_ljspeech.ckpt\n```\n---\n## Train using extracted alignments\n\nIn the datasetconfig turn on load duration.\nExample: `ljspeech.yaml`\n```\nload_durations: True\n```\nor see an examples in configs/experiment/ljspeech_from_durations.yaml\n\n\n## Citation information\n\nIf you use our code or otherwise find this work useful, please cite our paper:\n\n```text\n@inproceedings{mehta2024matcha,\n  title={Matcha-{TTS}: A fast {TTS} architecture with conditional flow matching},\n  author={Mehta, Shivam and Tu, Ruibo and Beskow, Jonas and Sz{\\'e}kely, {\\'E}va and Henter, Gustav Eje},\n  booktitle={Proc. ICASSP},\n  year={2024}\n}\n```\n\n## Acknowledgements\n\nSince this code uses [Lightning-Hydra-Template](https://github.com/ashleve/lightning-hydra-template), you have all the powers that come with it.\n\nOther source code we would like to acknowledge:\n\n- [Coqui-TTS](https://github.com/coqui-ai/TTS/tree/dev): For helping me figure out how to make cython binaries pip installable and encouragement\n- [Hugging Face Diffusers](https://huggingface.co/): For their awesome diffusers library and its components\n- [Grad-TTS](https://github.com/huawei-noah/Speech-Backbones/tree/main/Grad-TTS): For the monotonic alignment search source code\n- [torchdyn](https://github.com/DiffEqML/torchdyn): Useful for trying other ODE solvers during research and development\n- [labml.ai](https://nn.labml.ai/transformers/rope/index.html): For the RoPE implementation\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fshivammehta25%2FMatcha-TTS","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fshivammehta25%2FMatcha-TTS","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fshivammehta25%2FMatcha-TTS/lists"}