{"id":28704095,"url":"https://github.com/digitalphonetics/ims-toucan","last_synced_at":"2025-06-26T14:02:58.573Z","repository":{"id":37498216,"uuid":"392995634","full_name":"DigitalPhonetics/IMS-Toucan","owner":"DigitalPhonetics","description":"Controllable and fast Text-to-Speech for over 7000 languages!","archived":false,"fork":false,"pushed_at":"2025-06-25T18:50:54.000Z","size":22419,"stargazers_count":1617,"open_issues_count":6,"forks_count":184,"subscribers_count":22,"default_branch":"MassiveScaleToucan","last_synced_at":"2025-06-25T19:36:38.526Z","etag":null,"topics":["deep-learning","pytorch","speech","speech-processing","speech-synthesis","text-to-speech","toolkit","tts"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/DigitalPhonetics.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":".github/FUNDING.yml","license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null},"funding":{"github":["Flux9665"],"patreon":null,"open_collective":null,"ko_fi":null,"tidelift":null,"community_bridge":null,"liberapay":null,"issuehunt":null,"lfx_crowdfunding":null,"polar":null,"buy_me_a_coffee":null,"thanks_dev":null,"custom":null}},"created_at":"2021-08-05T10:12:38.000Z","updated_at":"2025-06-24T12:34:25.000Z","dependencies_parsed_at":"2023-10-17T01:17:34.629Z","dependency_job_id":"62c95c5f-2629-40d9-897f-007bb832f6e1","html_url":"https://github.com/DigitalPhonetics/IMS-Toucan","commit_stats":null,"previous_names":[],"tags_count":15,"template":false,"template_full_name":null,"purl":"pkg:github/DigitalPhonetics/IMS-Toucan","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DigitalPhonetics%2FIMS-Toucan","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DigitalPhonetics%2FIMS-Toucan/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DigitalPhonetics%2FIMS-Toucan/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DigitalPhonetics%2FIMS-Toucan/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/DigitalPhonetics","download_url":"https://codeload.github.com/DigitalPhonetics/IMS-Toucan/tar.gz/refs/heads/MassiveScaleToucan","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DigitalPhonetics%2FIMS-Toucan/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":262081074,"owners_count":23255657,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["deep-learning","pytorch","speech","speech-processing","speech-synthesis","text-to-speech","toolkit","tts"],"created_at":"2025-06-14T14:01:20.088Z","updated_at":"2025-06-26T14:02:58.566Z","avatar_url":"https://github.com/DigitalPhonetics.png","language":"Python","funding_links":["https://github.com/sponsors/Flux9665"],"categories":["语音合成"],"sub_categories":["资源传输下载"],"readme":"\u003cp align=\"right\"\u003e\n\u003cimg alt=\"GitHub Repo stars\" src=\"https://img.shields.io/github/stars/DigitalPhonetics/IMS-Toucan\"\u003e\n\u003cimg alt=\"GitHub Repo Downloads\" src=\"https://img.shields.io/github/downloads/DigitalPhonetics/IMS-Toucan/total\"\u003e\n\u003cimg alt=\"GitHub Release\" src=\"https://img.shields.io/github/v/release/DigitalPhonetics/IMS-Toucan\"\u003e\n\u003ca href=https://huggingface.co/spaces/Flux9665/MassivelyMultilingualTTS\u003e\u003cimg alt=\"Demo Link\" src=\"https://img.shields.io/badge/DEMO-\u003cCOLOR\u003e.svg\"\u003e\u003c/a\u003e\n\u003c/p\u003e\n\n---\n\n# Text-to-Speech for over 7000 Languages\n\nIMS Toucan is a toolkit for training, using, and teaching state-of-the-art Text-to-Speech Synthesis, developed at the\n**Institute for Natural Language Processing (IMS), University of Stuttgart, Germany**, official home of the massively\nmultilingual ToucanTTS system. Our system is fast, controllable, and doesn't require a ton of compute.\n\n\u003cbr\u003e\n\n![image](Utility/toucan.png)\n\n\u003cbr\u003e\n\nIf you find this repo useful, consider giving it a star. ⭐ Large numbers make me happy, and they are very motivating. If\nyou want to motivate me even more, you can even\nconsider [sponsoring this toolkit](https://github.com/sponsors/Flux9665). We only use GitHub Sponsors for this, there\nare scammers on other platforms that pretend to be the creator. Don't let them fool you. The code and the models are\nabsolutely free, and thanks to the generous support of Hugging Face🤗, we even have\nan [instance of the model running on GPU](https://huggingface.co/spaces/Flux9665/MassivelyMultilingualTTS) free for\nanyone to use.\n\n--- \n\u003cbr\u003e\n\n## Links 🦚\n\n### Interactive Demo\n\n[Check out our interactive massively-multi-lingual demo on Hugging Face🤗](https://huggingface.co/spaces/Flux9665/MassivelyMultilingualTTS)\n\n### Dataset\n\n[We have also published a massively multilingual TTS dataset on Hugging Face🤗](https://huggingface.co/datasets/Flux9665/BibleMMS)\n\n### Languages\n\n[A list of supported languages can be found here](https://github.com/DigitalPhonetics/IMS-Toucan/blob/MassiveScaleToucan/Utility/language_list.md)\n\n--- \n\u003cbr\u003e\n\n## Installation 🦉\n\n#### Basic Requirements\n\nPython 3.10 is the recommended version.\n\nTo install this toolkit, clone it onto the machine you want to use it on\n(should have at least one cuda enabled GPU if you intend to train models on that machine. For inference, you don't need\na GPU).\n\nIf you're using Linux, you should have the following packages installed, or install them with apt-get if you haven't (on\nmost distributions they come pre-installed):\n\n```\nlibsndfile1\nespeak-ng\nffmpeg\nlibasound-dev\nlibportaudio2\nlibsqlite3-dev\n```\n\nNavigate to the directory you have cloned. We recommend creating and activating a\n[virtual environment](https://docs.python.org/3/library/venv.html)\nto install the basic requirements into. The commands below summarize everything you need to do under Linux. If you are\nrunning Windows, the second line needs to be changed, please have a look at\nthe [venv documentation](https://docs.python.org/3/library/venv.html).\n\n```\npython -m venv \u003cpath_to_where_you_want_your_env_to_be\u003e\n\nsource \u003cpath_to_where_you_want_your_env_to_be\u003e/bin/activate\n\npip install --no-cache-dir -r requirements.txt\n```\n\nRun the second line everytime you start using the tool again to activate the virtual environment again, if you e.g.\nlogged out in the meantime. To make use of a GPU, you don't need to do anything else on a Linux machine. On a Windows\nmachine, have a look at [the official PyTorch website](https://pytorch.org/) for the install-command that enables GPU\nsupport.\n\n#### Storage configuration\n\nIf you don't want the pretrained and trained models as well as the cache files resulting from preprocessing your\ndatasets to be stored in the default subfolders, you can set corresponding directories globally by\nediting `Utility/storage_config.py` to suit your needs (the path can be relative to the repository root directory or\nabsolute).\n\n#### Pretrained Models\n\nYou don't need to use pretrained models, but it can speed things up tremendously. They will be downloaded on the fly\nautomatically when they are needed, thanks to Hugging Face🤗 and [VB](https://github.com/Vaibhavs10) in particular.\n\n#### \\[optional] eSpeak-NG\n\neSpeak-NG is an optional requirement, that handles lots of special cases in many languages, so it's good to have.\n\nOn most **Linux** environments it will be installed already, and if it is not, and you have the sufficient rights, you\ncan install it by simply running\n\n```\napt-get install espeak-ng\n```\n\nFor **Windows**, they provide a convenient .msi installer file\n[on their GitHub release page](https://github.com/espeak-ng/espeak-ng/releases). After installation on non-linux\nsystems, you'll also need to tell the phonemizer library where to find your espeak installation by setting the\n`PHONEMIZER_ESPEAK_LIBRARY` environment variable, which is discussed in\n[this issue](https://github.com/bootphon/phonemizer/issues/44#issuecomment-1008449718).\n\nFor **Mac** it's unfortunately a lot more complicated. Thanks to Sang Hyun Park, here is a guide for installing it on\nMac:\nFor M1 Macs, the most convenient method to install espeak-ng onto your system is via a\n[MacPorts port of espeak-ng](https://ports.macports.org/port/espeak-ng/). MacPorts itself can be installed from the\n[MacPorts website](https://www.macports.org/install.php), which also requires Apple's\n[XCode](https://developer.apple.com/xcode/). Once XCode and MacPorts have been installed, you can install the port of\nespeak-ng via\n\n```\nsudo port install espeak-ng\n```\n\nAs stated in the Windows install instructions, the espeak-ng installation will need to be set as a variable for the\nphonemizer library. The environment variable is `PHONEMIZER_ESPEAK_LIBRARY` as given in the\n[GitHub thread](https://github.com/bootphon/phonemizer/issues/44#issuecomment-1008449718) linked above.\nHowever, the espeak-ng installation file you need to set this variable to is a .dylib file rather than a .dll file on\nMac. In order to locate the espeak-ng library file, you can run `port contents espeak-ng`. The specific file you are\nlooking for is named `libespeak-ng.dylib`.\n\n--- \n\u003cbr\u003e\n\n## Inference 🦢\n\nYou can load your trained models, or the pretrained provided one, using the `InferenceInterfaces/ToucanTTSInterface.py`.\nSimply create an object from it with the proper directory handle\nidentifying the model you want to use. The rest should work out in the background. You might want to set a language\nembedding or a speaker embedding using the *set_language* and *set_speaker_embedding* functions. Most things should be\nself-explanatory.\n\nAn *InferenceInterface* contains two methods to create audio from text. They are\n*read_to_file* and\n*read_aloud*.\n\n- *read_to_file* takes as input a list of strings and a filename. It will synthesize the sentences in the list and\n  concatenate them with a short pause inbetween and write them to the filepath you supply as the other argument.\n\n- *read_aloud* takes just a string, which it will then convert to speech and immediately play using the system's\n  speakers. If you set the optional argument\n  *view* to\n  *True*, a visualization will pop up, that you need to close for the program to continue.\n\nTheir use is demonstrated in\n*run_interactive_demo.py* and\n*run_text_to_file_reader.py*.\n\nThere are simple scaling parameters to control the duration, the variance of the pitch curve and the variance of the\nenergy curve. You can either change them in the code when using the interactive demo or the reader, or you can simply\npass them to the interface when you use it in your own code.\n\nTo change the language of the model and see which languages are available in our pretrained model,\n[have a look at the list linked here](https://github.com/DigitalPhonetics/IMS-Toucan/blob/feb573ca630823974e6ced22591ab41cdfb93674/Utility/language_list.md)\n\n--- \n\u003cbr\u003e\n\n## Creating a new Recipe (Training Pipeline) 🐣\n\nIn the directory called\n*Utility* there is a file called\n`path_to_transcript_dicts.py`. In this file you should write a function that returns a dictionary that has all the\nabsolute paths to each of the audio files in your dataset as strings as the keys and the textual transcriptions of the\ncorresponding audios as the values.\n\nThen go to the directory\n*TrainingInterfaces/Recipes*. In there, make a copy of the `finetuning_example_simple.py` file if you just want to\nfinetune on a single dataset or `finetuning_example_multilingual.py` if you want to finetune on multiple datasets,\npotentially even multiple languages. We will use this copy\nas reference and only make the necessary changes to use the new dataset. Find the call(s) to the *prepare_tts_corpus*\nfunction. Replace the path_to_transcript_dict used there with the one(s) you just created. Then change the name of the\ncorresponding cache directory to something that makes sense for the dataset.\nAlso look out for the variable *save_dir*, which is where the checkpoints will be saved to. This is a default value, you\ncan overwrite it when calling\nthe pipeline later using a command line argument, in case you want to fine-tune from a checkpoint and thus save into a\ndifferent directory. Finally, change the\n*lang* argument in the creation of the dataset and in the call to the train loop function to the ISO 639-3 language ID\nthat\nmatches your data.\n\nThe arguments that are given to the train loop in the finetuning examples are meant for the case of finetuning from a\npretrained model. If you want\nto train from scratch, have a look at a different pipeline that has ToucanTTS in its name and look at the arguments\nused there.\n\nOnce this is complete, we are almost done, now we just need to make it available to the\n`run_training_pipeline.py` file in the top level. In said file, import the\n*run* function from the pipeline you just created and give it a meaningful name. Now in the\n*pipeline_dict*, add your imported function as value and use as key a shorthand that makes sense.\n\n--- \n\u003cbr\u003e\n\n## Training a Model 🦜\n\nOnce you have a recipe built, training is super easy:\n\n```\npython run_training_pipeline.py \u003cshorthand of the pipeline\u003e\n```\n\nYou can supply any of the following arguments, but don't have to (although for training you should definitely specify at\nleast a GPU ID).\n\n```\n--gpu_id \u003cID of the GPU you wish to use, as displayed with nvidia-smi, default is cpu. If multiple GPUs are provided (comma separated), then distributed training will be used, but the script has to be started with torchrun.\u003e \n\n--resume_checkpoint \u003cpath to a checkpoint to load\u003e\n\n--resume (if this is present, the furthest checkpoint available will be loaded automatically)\n\n--finetune (if this is present, the provided checkpoint will be fine-tuned on the data from this pipeline)\n\n--model_save_dir \u003cpath to a directory where the checkpoints should be saved\u003e\n\n--wandb (if this is present, the logs will be synchronized to your weights\u0026biases account, if you are logged in on the command line)\n\n--wandb_resume_id \u003cthe id of the run you want to resume, if you are using weights\u0026biases (you can find the id in the URL of the run)\u003e\n```\n\nFor multi-GPU training, you have to supply multiple GPU ids (comma separated) and start the script with torchrun. You\nalso have to specify the number of GPUs. This has to match the number of IDs that you supply. Careful: torchrun is\nincompatible with nohup! Use tmux instead to keep the script running after you log out of the shell.\n\n```\ntorchrun --standalone --nproc_per_node=4 --nnodes=1 run_training_pipeline.py \u003cshorthand of the pipeline\u003e --gpu_id \"0,1,2,3\"\n```\n\nAfter every epoch (or alternatively after certain step counts), some logs will be written to the console and to the\nWeights and Biases website, if you are logged in and set the flag. If you get cuda out of memory errors, you need to\ndecrease\nthe batchsize in the arguments of the call to the training_loop in the pipeline you are running. Try decreasing the\nbatchsize in small steps until you get no more out of cuda memory errors.\n\nIn the directory you specified for saving, checkpoint files and spectrogram visualization\ndata will appear. Since the checkpoints are quite big, only the five most recent ones will be kept. The amount of\ntraining steps highly depends on the data you are using and whether you're finetuning from a pretrained checkpoint or\ntraining from scratch. The fewer data you have, the fewer steps you should take to prevent a possible collapse. If\nyou want to stop earlier, just kill the process, since everything is daemonic all the child-processes should die with\nit. In case there are some ghost-processes left behind, you can use the following command to find them and kill them\nmanually.\n\n```\nfuser -v /dev/nvidia*\n```\n\nWhenever a checkpoint is saved, a compressed version that can be used for inference is also created, which is named\n_best.py_\n\n--- \n\u003cbr\u003e\n\n## FAQ 🐓\n\nHere are a few points that were brought up by users:\n\n- How can I figure out if my data has outliers or similar problems? -- There is a scorer that can find and even remove\n  samples from your dataset cache that have extraordinarily high loss values, have a look at `run_scorer.py`.\n- My error message shows GPU0, even though I specified a different GPU -- The way GPU selection works is that the\n  specified GPU is set as the only visible device, in order to avoid backend stuff running accidentally on different\n  GPUs. So internally the program will name the device GPU0, because it is the only GPU it can see. It is actually\n  running on the GPU you specified.\n- read_to_file produces strange outputs -- Check if you're passing a list to the method or a string. Since strings can\n  be\n  iterated over, it might not throw an error, but a list of strings is expected.\n- `UserWarning: Detected call of lr_scheduler.step() before optimizer.step().` -- We use a custom scheduler, and torch\n  incorrectly thinks that we call the scheduler and the optimizer in the wrong order. Just ignore this warning, it is\n  completely meaningless.\n- `WARNING[XFORMERS]: xFormers can't load C++/CUDA extensions. [...]` -- Another meaningless warning. We actually don't\n  use xFormers ourselves, it is just part of the dependencies of one of our dependencies, but it is not used at any\n  place.\n- `The torchaudio backend is switched to 'soundfile'. Note that 'sox_io' is not supported on Windows. [...]` -- Just\n  happens under Windows and doesn't affect anything.\n- `WARNING:phonemizer:words count mismatch on 200.0% of the lines (2/1) [...]` -- We have no idea why espeak started\n  giving out this warning, however it doesn't seem to affect anything, so it seems safe to ignore.\n- Loss turns to `NaN` -- The default learning rates work on clean data. If your data is less clean, try using the scorer\n  to find problematic samples, or reduce the learning rate. The most common problem is there being pauses in the speech,\n  but nothing that hints at them in the text. That's why ASR corpora, which leave out punctuation, are usually difficult\n  to use for TTS.\n\n--- \n\u003cbr\u003e\n\n## Acknowledgements 🦆\n\nThe basic PyTorch modules of FastSpeech 2 and GST are taken from\n[ESPnet](https://github.com/espnet/espnet), the PyTorch modules of\nHiFi-GAN are taken from the [ParallelWaveGAN repository](https://github.com/kan-bayashi/ParallelWaveGAN).\nSome modules related to the ConditionalFlowMatching based PostNet as outlined in MatchaTTS are taken\nfrom the [official MatchaTTS codebase](https://github.com/shivammehta25/Matcha-TTS) and some are taken\nfrom [the StableTTS codebase](https://github.com/KdaiP/StableTTS).\nFor grapheme-to-phoneme conversion, we rely on the aforementioned eSpeak-NG as\nwell as [transphone](https://github.com/xinjli/transphone). We\nuse [encodec, a neural audio codec](https://github.com/yangdongchao/AcademiCodec) as intermediate representation\nfor caching the train data to save space.\n\n## Citation 🐧\n\n\u003ca href=\"https://star-history.com/#DigitalPhonetics/IMS-Toucan\u0026Date\"\u003e\n \u003cpicture\u003e\n   \u003csource media=\"(prefers-color-scheme: dark)\" srcset=\"https://api.star-history.com/svg?repos=DigitalPhonetics/IMS-Toucan\u0026type=Date\u0026theme=dark\" /\u003e\n   \u003csource media=\"(prefers-color-scheme: light)\" srcset=\"https://api.star-history.com/svg?repos=DigitalPhonetics/IMS-Toucan\u0026type=Date\" /\u003e\n   \u003cimg alt=\"Star History Chart\" src=\"https://api.star-history.com/svg?repos=DigitalPhonetics/IMS-Toucan\u0026type=Date\" /\u003e\n \u003c/picture\u003e\n\u003c/a\u003e\n\n### Introduction of the Toolkit [[associated code and models]](https://github.com/DigitalPhonetics/IMS-Toucan/releases/tag/v1.0)\n\n```\n@inproceedings{lux2021toucan,\n  year         = 2021,\n  title        = {{The IMS Toucan system for the Blizzard Challenge 2021}},\n  author       = {Florian Lux and Julia Koch and Antje Schweitzer and Ngoc Thang Vu},\n  booktitle    = {Blizzard Challenge Workshop},\n  publisher    = {ISCA Speech Synthesis SIG}\n}\n```\n\n### Adding Articulatory Features and Meta-Learning Pretraining [[associated code and models]](https://github.com/DigitalPhonetics/IMS-Toucan/releases/tag/v1.1)\n\n```\n@inproceedings{lux2022laml,\n  year         = 2022,\n  title        = {{Language-Agnostic Meta-Learning for Low-Resource Text-to-Speech with Articulatory Features}},\n  author       = {Florian Lux and Ngoc Thang Vu},\n  booktitle    = {ACL}\n}\n```\n\n### Adding Exact Prosody-Cloning Capabilities [[associated code and models]](https://github.com/DigitalPhonetics/IMS-Toucan/releases/tag/v2.2)\n\n```\n@inproceedings{lux2022cloning,\n  year         = 2022,\n  title        = {{Exact Prosody Cloning in Zero-Shot Multispeaker Text-to-Speech}},\n  author       = {Lux, Florian and Koch, Julia and Vu, Ngoc Thang},\n  booktitle    = {SLT},\n  publisher    = {IEEE}\n}\n```\n\n### Adding Language Embeddings and Word Boundaries [[associated code and models]](https://github.com/DigitalPhonetics/IMS-Toucan/releases/tag/v2.2)\n\n```\n@inproceedings{lux2022lrms,\n  year         = 2022,\n  title        = {{Low-Resource Multilingual and Zero-Shot Multispeaker TTS}},\n  author       = {Florian Lux and Julia Koch and Ngoc Thang Vu},\n  booktitle    = {AACL}\n}\n```\n\n### Adding Controllable Speaker Embedding Generation [[associated code and models]](https://github.com/DigitalPhonetics/IMS-Toucan/releases/tag/v2.3)\n\n```\n@inproceedings{lux2023controllable,\n  year         = 2023,\n  title        = {{Low-Resource Multilingual and Zero-Shot Multispeaker TTS}},\n  author       = {Florian Lux and Pascal Tilli and Sarina Meyer and Ngoc Thang Vu},\n  booktitle    = {Interspeech}\n  publisher    = {ISCA}\n}\n```\n\n### Our Contribution to the Blizzard Challenge 2023 [[associated code and models]](https://github.com/DigitalPhonetics/IMS-Toucan/releases/tag/v2.b)\n\n```\n@inproceedings{lux2023blizzard,\n  year         = 2023,\n  title        = {{The IMS Toucan System for the Blizzard Challenge 2023}},\n  author       = {Florian Lux and Julia Koch and Sarina Meyer and Thomas Bott and Nadja Schauffler and Pavel Denisov and Antje Schweitzer and Ngoc Thang Vu},\n  booktitle    = {Blizzard Challenge Workshop},\n  publisher    = {ISCA Speech Synthesis SIG}\n}\n```\n\n### Introducing the first TTS System in over 7000 languages [[associated code and models]](https://github.com/DigitalPhonetics/IMS-Toucan/releases/tag/v3.0)\n\n```\n@inproceedings{lux2024massive,\n  year         = 2024,\n  title        = {{Meta Learning Text-to-Speech Synthesis in over 7000 Languages}},\n  author       = {Florian Lux and Sarina Meyer and Lyonel Behringer and Frank Zalkow and Phat Do and Matt Coler and  Emanuël A. P. Habets and Ngoc Thang Vu},\n  booktitle    = {Interspeech}\n  publisher    = {ISCA}\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdigitalphonetics%2Fims-toucan","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdigitalphonetics%2Fims-toucan","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdigitalphonetics%2Fims-toucan/lists"}