{"id":17186942,"url":"https://github.com/SWivid/F5-TTS","last_synced_at":"2025-02-23T22:32:14.457Z","repository":{"id":257826500,"uuid":"869549554","full_name":"SWivid/F5-TTS","owner":"SWivid","description":"Official code for \"F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching\"","archived":false,"fork":false,"pushed_at":"2025-02-18T10:42:12.000Z","size":1843,"stargazers_count":9680,"open_issues_count":42,"forks_count":1303,"subscribers_count":95,"default_branch":"main","last_synced_at":"2025-02-18T11:26:04.448Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"https://arxiv.org/abs/2410.06885","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/SWivid.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-10-08T13:36:55.000Z","updated_at":"2025-02-18T10:42:17.000Z","dependencies_parsed_at":"2024-12-13T09:17:59.177Z","dependency_job_id":"9d8afde7-9365-4925-93fb-7fff0fcaf431","html_url":"https://github.com/SWivid/F5-TTS","commit_stats":null,"previous_names":["swivid/f5-tts"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SWivid%2FF5-TTS","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SWivid%2FF5-TTS/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SWivid%2FF5-TTS/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SWivid%2FF5-TTS/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/SWivid","download_url":"https://codeload.github.com/SWivid/F5-TTS/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":240390495,"owners_count":19793779,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-10-15T01:01:16.485Z","updated_at":"2025-02-23T22:32:14.423Z","avatar_url":"https://github.com/SWivid.png","language":"Python","funding_links":[],"categories":["Text-to-Speech (TTS)","others","Python","🎙 Voice \u0026 Audio Tools","排行榜 [2025-03-18]","AI \u0026 Machine Learning for CG","2. Open Foundation Models","4. Text-to-speech (TTS)"],"sub_categories":["Open-Source Models \u0026 Libraries","Text-to-Speech","AI Audio \u0026 Music","Open source"],"readme":"# F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching\n\n[![python](https://img.shields.io/badge/Python-3.10-brightgreen)](https://github.com/SWivid/F5-TTS)\n[![arXiv](https://img.shields.io/badge/arXiv-2410.06885-b31b1b.svg?logo=arXiv)](https://arxiv.org/abs/2410.06885)\n[![demo](https://img.shields.io/badge/GitHub-Demo%20page-orange.svg)](https://swivid.github.io/F5-TTS/)\n[![hfspace](https://img.shields.io/badge/🤗-Space%20demo-yellow)](https://huggingface.co/spaces/mrfakename/E2-F5-TTS)\n[![msspace](https://img.shields.io/badge/🤖-Space%20demo-blue)](https://modelscope.cn/studios/modelscope/E2-F5-TTS)\n[![lab](https://img.shields.io/badge/X--LANCE-Lab-grey?labelColor=lightgrey)](https://x-lance.sjtu.edu.cn/)\n[![lab](https://img.shields.io/badge/Peng%20Cheng-Lab-grey?labelColor=lightgrey)](https://www.pcl.ac.cn)\n\u003c!-- \u003cimg src=\"https://github.com/user-attachments/assets/12d7749c-071a-427c-81bf-b87b91def670\" alt=\"Watermark\" style=\"width: 40px; height: auto\"\u003e --\u003e\n\n**F5-TTS**: Diffusion Transformer with ConvNeXt V2, faster trained and inference.\n\n**E2 TTS**: Flat-UNet Transformer, closest reproduction from [paper](https://arxiv.org/abs/2406.18009).\n\n**Sway Sampling**: Inference-time flow step sampling strategy, greatly improves performance\n\n### Thanks to all the contributors !\n\n## News\n- **2024/10/08**: F5-TTS \u0026 E2 TTS base models on [🤗 Hugging Face](https://huggingface.co/SWivid/F5-TTS), [🤖 Model Scope](https://www.modelscope.cn/models/SWivid/F5-TTS_Emilia-ZH-EN), [🟣 Wisemodel](https://wisemodel.cn/models/SJTU_X-LANCE/F5-TTS_Emilia-ZH-EN).\n\n## Installation\n\n### Create a separate environment if needed\n\n```bash\n# Create a python 3.10 conda env (you could also use virtualenv)\nconda create -n f5-tts python=3.10\nconda activate f5-tts\n```\n\n### Install PyTorch with matched device\n\n\u003cdetails\u003e\n\u003csummary\u003eNVIDIA GPU\u003c/summary\u003e\n\n\u003e ```bash\n\u003e # Install pytorch with your CUDA version, e.g.\n\u003e pip install torch==2.3.0+cu118 torchaudio==2.3.0+cu118 --extra-index-url https://download.pytorch.org/whl/cu118\n\u003e ```\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003eAMD GPU\u003c/summary\u003e\n\n\u003e ```bash\n\u003e # Install pytorch with your ROCm version (Linux only), e.g.\n\u003e pip install torch==2.5.1+rocm6.2 torchaudio==2.5.1+rocm6.2 --extra-index-url https://download.pytorch.org/whl/rocm6.2\n\u003e ```\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003eIntel GPU\u003c/summary\u003e\n\n\u003e ```bash\n\u003e # Install pytorch with your XPU version, e.g.\n\u003e # Intel® Deep Learning Essentials or Intel® oneAPI Base Toolkit must be installed\n\u003e pip install torch torchaudio --index-url https://download.pytorch.org/whl/test/xpu\n\u003e \n\u003e # Intel GPU support is also available through IPEX (Intel® Extension for PyTorch)\n\u003e # IPEX does not require the Intel® Deep Learning Essentials or Intel® oneAPI Base Toolkit\n\u003e # See: https://pytorch-extension.intel.com/installation?request=platform\n\u003e ```\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003eApple Silicon\u003c/summary\u003e\n\n\u003e ```bash\n\u003e # Install the stable pytorch, e.g.\n\u003e pip install torch torchaudio\n\u003e ```\n\n\u003c/details\u003e\n\n### Then you can choose one from below:\n\n\u003e ### 1. As a pip package (if just for inference)\n\u003e \n\u003e ```bash\n\u003e pip install git+https://github.com/SWivid/F5-TTS.git\n\u003e ```\n\u003e \n\u003e ### 2. Local editable (if also do training, finetuning)\n\u003e \n\u003e ```bash\n\u003e git clone https://github.com/SWivid/F5-TTS.git\n\u003e cd F5-TTS\n\u003e # git submodule update --init --recursive  # (optional, if need \u003e bigvgan)\n\u003e pip install -e .\n\u003e ```\n\n### Docker usage also available\n```bash\n# Build from Dockerfile\ndocker build -t f5tts:v1 .\n\n# Or pull from GitHub Container Registry\ndocker pull ghcr.io/swivid/f5-tts:main\n```\n\n\n## Inference\n\n### 1. Gradio App\n\nCurrently supported features:\n\n- Basic TTS with Chunk Inference\n- Multi-Style / Multi-Speaker Generation\n- Voice Chat powered by Qwen2.5-3B-Instruct\n- [Custom inference with more language support](src/f5_tts/infer/SHARED.md)\n\n```bash\n# Launch a Gradio app (web interface)\nf5-tts_infer-gradio\n\n# Specify the port/host\nf5-tts_infer-gradio --port 7860 --host 0.0.0.0\n\n# Launch a share link\nf5-tts_infer-gradio --share\n```\n\n\u003cdetails\u003e\n\u003csummary\u003eNVIDIA device docker compose file example\u003c/summary\u003e\n\n```yaml\nservices:\n  f5-tts:\n    image: ghcr.io/swivid/f5-tts:main\n    ports:\n      - \"7860:7860\"\n    environment:\n      GRADIO_SERVER_PORT: 7860\n    entrypoint: [\"f5-tts_infer-gradio\", \"--port\", \"7860\", \"--host\", \"0.0.0.0\"]\n    deploy:\n      resources:\n        reservations:\n          devices:\n            - driver: nvidia\n              count: 1\n              capabilities: [gpu]\n\nvolumes:\n  f5-tts:\n    driver: local\n```\n\n\u003c/details\u003e\n\n### 2. CLI Inference\n\n```bash\n# Run with flags\n# Leave --ref_text \"\" will have ASR model transcribe (extra GPU memory usage)\nf5-tts_infer-cli \\\n--model \"F5-TTS\" \\\n--ref_audio \"ref_audio.wav\" \\\n--ref_text \"The content, subtitle or transcription of reference audio.\" \\\n--gen_text \"Some text you want TTS model generate for you.\"\n\n# Run with default setting. src/f5_tts/infer/examples/basic/basic.toml\nf5-tts_infer-cli\n# Or with your own .toml file\nf5-tts_infer-cli -c custom.toml\n\n# Multi voice. See src/f5_tts/infer/README.md\nf5-tts_infer-cli -c src/f5_tts/infer/examples/multi/story.toml\n```\n\n### 3. More instructions\n\n- In order to have better generation results, take a moment to read [detailed guidance](src/f5_tts/infer).\n- The [Issues](https://github.com/SWivid/F5-TTS/issues?q=is%3Aissue) are very useful, please try to find the solution by properly searching the keywords of problem encountered. If no answer found, then feel free to open an issue.\n\n\n## Training\n\n### 1. Gradio App\n\nRead [training \u0026 finetuning guidance](src/f5_tts/train) for more instructions.\n\n```bash\n# Quick start with Gradio web interface\nf5-tts_finetune-gradio\n```\n\n\n## [Evaluation](src/f5_tts/eval)\n\n\n## Development\n\nUse pre-commit to ensure code quality (will run linters and formatters automatically)\n\n```bash\npip install pre-commit\npre-commit install\n```\n\nWhen making a pull request, before each commit, run: \n\n```bash\npre-commit run --all-files\n```\n\nNote: Some model components have linting exceptions for E722 to accommodate tensor notation\n\n\n## Acknowledgements\n\n- [E2-TTS](https://arxiv.org/abs/2406.18009) brilliant work, simple and effective\n- [Emilia](https://arxiv.org/abs/2407.05361), [WenetSpeech4TTS](https://arxiv.org/abs/2406.05763), [LibriTTS](https://arxiv.org/abs/1904.02882), [LJSpeech](https://keithito.com/LJ-Speech-Dataset/) valuable datasets\n- [lucidrains](https://github.com/lucidrains) initial CFM structure with also [bfs18](https://github.com/bfs18) for discussion\n- [SD3](https://arxiv.org/abs/2403.03206) \u0026 [Hugging Face diffusers](https://github.com/huggingface/diffusers) DiT and MMDiT code structure\n- [torchdiffeq](https://github.com/rtqichen/torchdiffeq) as ODE solver, [Vocos](https://huggingface.co/charactr/vocos-mel-24khz) and [BigVGAN](https://github.com/NVIDIA/BigVGAN) as vocoder\n- [FunASR](https://github.com/modelscope/FunASR), [faster-whisper](https://github.com/SYSTRAN/faster-whisper), [UniSpeech](https://github.com/microsoft/UniSpeech), [SpeechMOS](https://github.com/tarepan/SpeechMOS) for evaluation tools\n- [ctc-forced-aligner](https://github.com/MahmoudAshraf97/ctc-forced-aligner) for speech edit test\n- [mrfakename](https://x.com/realmrfakename) huggingface space demo ~\n- [f5-tts-mlx](https://github.com/lucasnewman/f5-tts-mlx/tree/main) Implementation with MLX framework by [Lucas Newman](https://github.com/lucasnewman)\n- [F5-TTS-ONNX](https://github.com/DakeQQ/F5-TTS-ONNX) ONNX Runtime version by [DakeQQ](https://github.com/DakeQQ)\n\n## Citation\nIf our work and codebase is useful for you, please cite as:\n```\n@article{chen-etal-2024-f5tts,\n      title={F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching}, \n      author={Yushen Chen and Zhikang Niu and Ziyang Ma and Keqi Deng and Chunhui Wang and Jian Zhao and Kai Yu and Xie Chen},\n      journal={arXiv preprint arXiv:2410.06885},\n      year={2024},\n}\n```\n## License\n\nOur code is released under MIT License. The pre-trained models are licensed under the CC-BY-NC license due to the training data Emilia, which is an in-the-wild dataset. Sorry for any inconvenience this may cause.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FSWivid%2FF5-TTS","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FSWivid%2FF5-TTS","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FSWivid%2FF5-TTS/lists"}