{"id":20563608,"url":"https://github.com/gooofy/zerovox","last_synced_at":"2025-10-12T10:03:07.266Z","repository":{"id":236499287,"uuid":"792729650","full_name":"gooofy/zerovox","owner":"gooofy","description":"zero-shot realtime TTS system, fully offline, free and open source","archived":false,"fork":false,"pushed_at":"2025-04-18T16:43:28.000Z","size":40823,"stargazers_count":44,"open_issues_count":2,"forks_count":6,"subscribers_count":3,"default_branch":"main","last_synced_at":"2025-09-23T16:26:24.469Z","etag":null,"topics":["deep-learning","hifigan","melgan","multi-speaker-tts","python","pytorch","speaker-encoder","speaker-encodings","speech","speech-synthesis","text-to-speech","tts","tts-model","voice-cloning","voice-synthesis"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/gooofy.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2024-04-27T12:10:26.000Z","updated_at":"2025-09-19T04:49:31.000Z","dependencies_parsed_at":"2024-08-24T10:33:08.708Z","dependency_job_id":"7a071e2f-9834-4424-b74b-1691f548ac14","html_url":"https://github.com/gooofy/zerovox","commit_stats":null,"previous_names":["gooofy/zerovox"],"tags_count":6,"template":false,"template_full_name":null,"purl":"pkg:github/gooofy/zerovox","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gooofy%2Fzerovox","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gooofy%2Fzerovox/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gooofy%2Fzerovox/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gooofy%2Fzerovox/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/gooofy","download_url":"https://codeload.github.com/gooofy/zerovox/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gooofy%2Fzerovox/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":279011043,"owners_count":26084863,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-12T02:00:06.719Z","response_time":53,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["deep-learning","hifigan","melgan","multi-speaker-tts","python","pytorch","speaker-encoder","speaker-encodings","speech","speech-synthesis","text-to-speech","tts","tts-model","voice-cloning","voice-synthesis"],"created_at":"2024-11-16T04:19:46.843Z","updated_at":"2025-10-12T10:03:07.249Z","avatar_url":"https://github.com/gooofy.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"ZeroVOX: A zero-shot realtime TTS system, fully offline, free and open source\n=============================================================================\n\nZeroVOX is a text-to-speech (TTS) system built for real-time and embedded use.\n\nZeroVox runs entirely offline, ensuring privacy and independence from cloud services. It's completely free and open source, inviting community contributions and suggestions.\n\nModeled after FastSpeech2, ZeroVOX goes a step further with zero-shot speaker cloning, utilizing effective speaker embedding. The system supports both English and German speech generation from a single model, trained on an extensive dataset. ZeroVOX is phoneme-based, leveraging pronunciation dictionaries to ensure accurate word articulation, utilizing the CMU dictionary for English and a custom dictionary for German from the ZamiaSpeech project where also the phoneme set used originates from.\n\nZeroVOX can serve as a TTS backend for LLMs, enabling real-time interactions, and as an easy-to-install TTS system for home automation systems like Home Assistant. Since it is non-autoregressive like FastSpeech2 its output is generally easy to control and predictable.\n\nLicense: ZeroVOX is Apache 2 licensed with many parts leveraged from other projects (see credits section below) under MIT license.\n\nDemo\n====\n\nPlease Note: model is still in alpha stage and still training.\n\nhttps://huggingface.co/spaces/goooofy/zerovox-demo\n\nAudio Corpus Stats\n==================\n\nCurrent ZeroVOX training corpus stats:\n\n    german  audio corpus: 16679 speakers, 475.3 hours audio\n    english audio corpus: 19899 speakers, 358.7 hours audio\n\nZeroVOX Model Training\n======================\n\nset ZEROVOX_PREPROCESSED_DATA_PATH env var to point to where you want to store preprocessed data, e.g.\n\n    export ZEROVOX_PREPROCESSED_DATA_PATH=\"/mnt/data1/preprocessed_data\"\n\nData Preparation\n----------------\n\n(1/2) prepare corpus yamls:\n\n    pushd configs/corpora/cv_de_100\n    ./gen_cv.sh\n    popd\n\n(2/2) preprocess:\n\n    utils/preprocess.py configs/tts_medium_styledec.yaml configs/corpora/de_hui configs/corpora/cv_de_100 ...\n\nTTS Model Training\n------------------\n\n    utils/train_tts.py \\\n        -c configs/tts_medium_styledec.yaml \\\n        --accelerator=gpu \\\n        --threads=24 \\\n        --batch-size=20 \\\n        --max-epochs=100 \\\n        --out-folder=models/tts_de_zerovox_medium_1 \\\n        configs/corpora/cv_de_100 \\\n        configs/corpora/de_hui \\\n        configs/corpora/de_thorsten.yaml\n\nCredits\n=======\n\nThe training setup is originally based on Efficientspeech by Rowel Atienza\n\nhttps://github.com/roatienza/efficientspeech\n\n    @inproceedings{atienza2023efficientspeech,\n      title={EfficientSpeech: An On-Device Text to Speech Model},\n      author={Atienza, Rowel},\n      booktitle={ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},\n      pages={1--5},\n      year={2023},\n      organization={IEEE}\n    }\n\nThe FastSpeech2 encoder and decoder is borrowed (under MIT license) from Chung-Ming Chien's implementation of FastSpeech2\n\nhttps://github.com/ming024/FastSpeech2\n\n\n    @misc{ren2022fastspeech2fasthighquality,\n        title={FastSpeech 2: Fast and High-Quality End-to-End Text to Speech}, \n        author={Yi Ren and Chenxu Hu and Xu Tan and Tao Qin and Sheng Zhao and Zhou Zhao and Tie-Yan Liu},\n        year={2022},\n        eprint={2006.04558},\n        archivePrefix={arXiv},\n        primaryClass={eess.AS},\n        url={https://arxiv.org/abs/2006.04558}, \n    }\n\nThe StyleTTS encoder is borrowd (under MIT license) from Aaron (Yinghao) Li's implementation of StyleTTS:\n\nhttps://github.com/yl4579/StyleTTS\n\n    @misc{li2023stylettsstylebasedgenerativemodel,\n        title={StyleTTS: A Style-Based Generative Model for Natural and Diverse Text-to-Speech Synthesis}, \n        author={Yinghao Aaron Li and Cong Han and Nima Mesgarani},\n        year={2023},\n        eprint={2205.15439},\n        archivePrefix={arXiv},\n        primaryClass={eess.AS},\n        url={https://arxiv.org/abs/2205.15439}, \n    }\n\nThe HiFi-GAN MEL decoder implementation is borrowed (under MIT license) from Jungil Kong's hifi-gan project:\n\nhttps://github.com/jik876/hifi-gan\n\n    @misc{kong2020hifigangenerativeadversarialnetworks,\n        title={HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis}, \n        author={Jungil Kong and Jaehyeon Kim and Jaekyoung Bae},\n        year={2020},\n        eprint={2010.05646},\n        archivePrefix={arXiv},\n        primaryClass={cs.SD},\n        url={https://arxiv.org/abs/2010.05646}, \n    }\n\nThe ZeroShot ResNet based speaker encoding is borrowed (under MIT license) from voxceleb_trainer by Clova AI Research\n\nhttps://github.com/clovaai/voxceleb_trainer\n\n    @inproceedings{chung2020in,\n    title={In defence of metric learning for speaker recognition},\n    author={Chung, Joon Son and Huh, Jaesung and Mun, Seongkyu and Lee, Minjae and Heo, Hee Soo and Choe, Soyeon and Ham, Chiheon and Jung, Sunghwan and Lee, Bong-Jin and Han, Icksang},\n    booktitle={Proc. Interspeech},\n    year={2020}\n    }\n\n    @inproceedings{he2016deep,\n    title={Deep residual learning for image recognition},\n    author={He, Kaiming and Zhang, Xiangyu and Ren, Shaoqing and Sun, Jian},\n    booktitle={IEEE Conference on Computer Vision and Pattern Recognition},\n    pages={770--778},\n    year={2016}\n    }\n\nSpeaker Conditional Layer Normalization (SCLN) which is borrowed (under MIT license) from\n\nhttps://github.com/keonlee9420/Cross-Speaker-Emotion-Transfer\nby Keon Lee\n\n    @misc{wu2021crossspeakeremotiontransferbased,\n        title={Cross-speaker Emotion Transfer Based on Speaker Condition Layer Normalization and Semi-Supervised Training in Text-To-Speech}, \n        author={Pengfei Wu and Junjie Pan and Chenchang Xu and Junhui Zhang and Lin Wu and Xiang Yin and Zejun Ma},\n        year={2021},\n        eprint={2110.04153},\n        archivePrefix={arXiv},\n        primaryClass={eess.AS},\n        url={https://arxiv.org/abs/2110.04153}, \n    }\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgooofy%2Fzerovox","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fgooofy%2Fzerovox","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgooofy%2Fzerovox/lists"}