{"id":13441535,"url":"https://github.com/lucidrains/voicebox-pytorch","last_synced_at":"2025-05-15T18:08:27.930Z","repository":{"id":185335606,"uuid":"673372387","full_name":"lucidrains/voicebox-pytorch","owner":"lucidrains","description":"Implementation of Voicebox, new SOTA Text-to-speech network from MetaAI, in Pytorch","archived":false,"fork":false,"pushed_at":"2024-10-01T16:26:02.000Z","size":287,"stargazers_count":648,"open_issues_count":11,"forks_count":53,"subscribers_count":46,"default_branch":"main","last_synced_at":"2025-04-20T04:14:04.386Z","etag":null,"topics":["artificial-intelligence","deep-learning","text-to-speech"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/lucidrains.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-08-01T13:30:54.000Z","updated_at":"2025-04-19T12:02:51.000Z","dependencies_parsed_at":"2024-01-16T02:47:23.596Z","dependency_job_id":"7d2fa81f-f40f-402f-8886-4a3d92853295","html_url":"https://github.com/lucidrains/voicebox-pytorch","commit_stats":{"total_commits":131,"total_committers":7,"mean_commits":"18.714285714285715","dds":"0.17557251908396942","last_synced_commit":"d115a997452f278190a2634be500a3db0da5db15"},"previous_names":["lucidrains/voicebox-pytorch"],"tags_count":65,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lucidrains%2Fvoicebox-pytorch","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lucidrains%2Fvoicebox-pytorch/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lucidrains%2Fvoicebox-pytorch/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lucidrains%2Fvoicebox-pytorch/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/lucidrains","download_url":"https://codeload.github.com/lucidrains/voicebox-pytorch/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254394722,"owners_count":22063984,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["artificial-intelligence","deep-learning","text-to-speech"],"created_at":"2024-07-31T03:01:35.178Z","updated_at":"2025-05-15T18:08:27.894Z","avatar_url":"https://github.com/lucidrains.png","language":"Python","funding_links":[],"categories":["Python"],"sub_categories":[],"readme":"\u003cimg src=\"./images/voicebox.png\" width=\"400px\"\u003e\u003c/img\u003e\n\n## Voicebox - Pytorch\n\nImplementation of \u003ca href=\"https://arxiv.org/abs/2306.15687\"\u003eVoicebox\u003c/a\u003e, new SOTA Text-to-Speech model from MetaAI, in Pytorch. \u003ca href=\"https://about.fb.com/news/2023/06/introducing-voicebox-ai-for-speech-generation/\"\u003ePress release\u003c/a\u003e\n\nIn this work, we will use rotary embeddings. The authors seem unaware that ALiBi cannot be straightforwardly used for bidirectional models.\n\nThe paper also addresses the issue with time embedding incorrectly subjected to relative distances (they concat the time embedding along the frame dimension of the audio tokens). This repository will use adaptive normalization, as applied successfully in \u003ca href=\"https://arxiv.org/abs/2211.07292\"\u003ePaella\u003c/a\u003e\n\nUpdate: Recommend you just use \u003ca href=\"https://github.com/lucidrains/e2-tts-pytorch\"\u003eE2 TTS\u003c/a\u003e instead of this work\n\n## Appreciation\n\n- \u003ca href=\"https://translated.com\"\u003e\u003cimg style=\"vertical-align: middle;\" src=\"./images/translated.png\" height=\"20px\" alt=\"Translated\"\u003e\u003cimg\u003e\u003c/a\u003e for awarding me the \u003ca href=\"https://imminent.translated.com/research-grants-ceremony-innovations-in-language-technology\"\u003eImminent Grant\u003c/a\u003e to advance the state of open sourced text-to-speech solutions. This project was started and will be completed under this grant.\n\n- \u003ca href=\"https://stability.ai/\"\u003eStabilityAI\u003c/a\u003e for the generous sponsorship, as well as my other sponsors, for affording me the independence to open source artificial intelligence.\n\n- \u003ca href=\"https://github.com/b-chiang\"\u003eBryan Chiang\u003c/a\u003e for the ongoing code review, sharing his expertise on TTS, and pointing me to \u003ca href=\"https://github.com/atong01/conditional-flow-matching\"\u003ean open sourced implementation\u003c/a\u003e of conditional flow matching\n\n- \u003ca href=\"https://github.com/manmay-nakhashi\"\u003eManmay\u003c/a\u003e for getting the repository started with the alignment code\n\n- \u003ca href=\"https://github.com/chenht2010\"\u003e@chenht2010\u003c/a\u003e for finding a bug with rotary positions, and for validating that the code in the repository converges\n\n- \u003ca href=\"https://github.com/lucasnewman\"\u003eLucas Newman\u003c/a\u003e for (yet again) pull requesting all the training code for Spear-TTS conditioned Voicebox training!\n\n- \u003ca href=\"https://github.com/lucasnewman\"\u003eLucas Newman\u003c/a\u003e has demonstrated that the whole system works with Spear-TTS conditioning. Training converges even better than \u003ca href=\"https://github.com/lucidrains/soundstorm-pytorch\"\u003eSoundstorm\u003c/a\u003e\n\n## Install\n\n```bash\n$ pip install voicebox-pytorch\n```\n\n## Usage\n\nTraining and sampling with `TextToSemantic` module from \u003ca href=\"https://github.com/lucidrains/spear-tts-pytorch\"\u003eSpearTTS\u003c/a\u003e\n\n```python\nimport torch\n\nfrom voicebox_pytorch import (\n    VoiceBox,\n    EncodecVoco,\n    ConditionalFlowMatcherWrapper,\n    HubertWithKmeans,\n    TextToSemantic\n)\n\n# https://github.com/facebookresearch/fairseq/tree/main/examples/hubert\n\nwav2vec = HubertWithKmeans(\n    checkpoint_path = '/path/to/hubert/checkpoint.pt',\n    kmeans_path = '/path/to/hubert/kmeans.bin'\n)\n\ntext_to_semantic = TextToSemantic(\n    wav2vec = wav2vec,\n    dim = 512,\n    source_depth = 1,\n    target_depth = 1,\n    use_openai_tokenizer = True\n)\n\ntext_to_semantic.load('/path/to/trained/spear-tts/model.pt')\n\nmodel = VoiceBox(\n    dim = 512,\n    audio_enc_dec = EncodecVoco(),\n    num_cond_tokens = 500,\n    depth = 2,\n    dim_head = 64,\n    heads = 16\n)\n\ncfm_wrapper = ConditionalFlowMatcherWrapper(\n    voicebox = model,\n    text_to_semantic = text_to_semantic\n)\n\n# mock data\n\naudio = torch.randn(2, 12000)\n\n# train\n\nloss = cfm_wrapper(audio)\nloss.backward()\n\n# after much training\n\ntexts = [\n    'the rain in spain falls mainly in the plains',\n    'she sells sea shells by the seashore'\n]\n\ncond = torch.randn(2, 12000)\nsampled = cfm_wrapper.sample(cond = cond, texts = texts) # (2, 1, \u003caudio length\u003e)\n```\n\nFor unconditional training, `condition_on_text` on `VoiceBox` must be set to `False`\n\n```python\nimport torch\nfrom voicebox_pytorch import (\n    VoiceBox,\n    ConditionalFlowMatcherWrapper\n)\n\nmodel = VoiceBox(\n    dim = 512,\n    num_cond_tokens = 500,\n    depth = 2,\n    dim_head = 64,\n    heads = 16,\n    condition_on_text = False\n)\n\ncfm_wrapper = ConditionalFlowMatcherWrapper(\n    voicebox = model\n)\n\n# mock data\n\nx = torch.randn(2, 1024, 512)\n\n# train\n\nloss = cfm_wrapper(x)\n\nloss.backward()\n\n# after much training\n\ncond = torch.randn(2, 1024, 512)\n\nsampled = cfm_wrapper.sample(cond = cond) # (2, 1024, 512)\n```\n\n## Todo\n\n- [x] read and internalize original flow matching paper\n    - [x] basic loss\n    - [x] get neural ode working with torchdyn\n- [x] get basic mask generation logic with the p_drop of 0.2-0.3 for ICL\n- [x] take care of p_drop, different between voicebox and duration model\n- [x] support torchdiffeq and torchode\n- [x] switch to adaptive rmsnorm for time conditioning\n- [x] add encodec / voco for starters\n- [x] setup training and sampling with raw audio, if `audio_enc_dec` is passed in\n- [x] integrate with log mel spec / encodec - vocos\n- [x] spear-tts-integration\n- [x] basic accelerate trainer - thanks to @lucasnewman!\n\n- [ ] cleanup NS2 aligner class and then setup duration predictor training\n- [ ] figure out the correct settings for `MelVoco` encode, as the reconstructed audio is longer in length\n- [ ] calculate how many seconds corresponds to each frame and add as property on `AudioEncoderDecoder` - when sampling, allow for specifying in seconds\n\n## Citations\n\n```bibtex\n@article{Le2023VoiceboxTM,\n    title   = {Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale},\n    author  = {Matt Le and Apoorv Vyas and Bowen Shi and Brian Karrer and Leda Sari and Rashel Moritz and Mary Williamson and Vimal Manohar and Yossi Adi and Jay Mahadeokar and Wei-Ning Hsu},\n    journal = {ArXiv},\n    year    = {2023},\n    volume  = {abs/2306.15687},\n    url     = {https://api.semanticscholar.org/CorpusID:259275061}\n}\n```\n\n```bibtex\n@inproceedings{dao2022flashattention,\n    title   = {Flash{A}ttention: Fast and Memory-Efficient Exact Attention with {IO}-Awareness},\n    author  = {Dao, Tri and Fu, Daniel Y. and Ermon, Stefano and Rudra, Atri and R{\\'e}, Christopher},\n    booktitle = {Advances in Neural Information Processing Systems},\n    year    = {2022}\n}\n```\n\n```bibtex\n@misc{torchdiffeq,\n    author  = {Chen, Ricky T. Q.},\n    title   = {torchdiffeq},\n    year    = {2018},\n    url     = {https://github.com/rtqichen/torchdiffeq},\n}\n```\n\n```bibtex\n@inproceedings{lienen2022torchode,\n    title     = {torchode: A Parallel {ODE} Solver for PyTorch},\n    author    = {Marten Lienen and Stephan G{\\\"u}nnemann},\n    booktitle = {The Symbiosis of Deep Learning and Differential Equations II, NeurIPS},\n    year      = {2022},\n    url       = {https://openreview.net/forum?id=uiKVKTiUYB0}\n}\n```\n\n```bibtex\n@article{siuzdak2023vocos,\n    title   = {Vocos: Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesis},\n    author  = {Siuzdak, Hubert},\n    journal = {arXiv preprint arXiv:2306.00814},\n    year    = {2023}\n}\n```\n\n```bibtex\n@misc{darcet2023vision,\n    title   = {Vision Transformers Need Registers},\n    author  = {Timothée Darcet and Maxime Oquab and Julien Mairal and Piotr Bojanowski},\n    year    = {2023},\n    eprint  = {2309.16588},\n    archivePrefix = {arXiv},\n    primaryClass = {cs.CV}\n}\n```\n\n```bibtex\n@inproceedings{Dehghani2023ScalingVT,\n    title   = {Scaling Vision Transformers to 22 Billion Parameters},\n    author  = {Mostafa Dehghani and Josip Djolonga and Basil Mustafa and Piotr Padlewski and Jonathan Heek and Justin Gilmer and Andreas Steiner and Mathilde Caron and Robert Geirhos and Ibrahim M. Alabdulmohsin and Rodolphe Jenatton and Lucas Beyer and Michael Tschannen and Anurag Arnab and Xiao Wang and Carlos Riquelme and Matthias Minderer and Joan Puigcerver and Utku Evci and Manoj Kumar and Sjoerd van Steenkiste and Gamaleldin F. Elsayed and Aravindh Mahendran and Fisher Yu and Avital Oliver and Fantine Huot and Jasmijn Bastings and Mark Collier and Alexey A. Gritsenko and Vighnesh Birodkar and Cristina Nader Vasconcelos and Yi Tay and Thomas Mensink and Alexander Kolesnikov and Filip Paveti'c and Dustin Tran and Thomas Kipf and Mario Luvci'c and Xiaohua Zhai and Daniel Keysers and Jeremiah Harmsen and Neil Houlsby},\n    booktitle = {International Conference on Machine Learning},\n    year    = {2023},\n    url     = {https://api.semanticscholar.org/CorpusID:256808367}\n}\n```\n\n```bibtex\n@inproceedings{Katsch2023GateLoopFD,\n    title   = {GateLoop: Fully Data-Controlled Linear Recurrence for Sequence Modeling},\n    author  = {Tobias Katsch},\n    year    = {2023},\n    url     = {https://api.semanticscholar.org/CorpusID:265018962}\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flucidrains%2Fvoicebox-pytorch","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Flucidrains%2Fvoicebox-pytorch","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flucidrains%2Fvoicebox-pytorch/lists"}