{"id":13449153,"url":"https://github.com/lifeiteng/vall-e","last_synced_at":"2025-05-15T10:00:52.166Z","repository":{"id":65580115,"uuid":"593927328","full_name":"lifeiteng/vall-e","owner":"lifeiteng","description":"PyTorch implementation of VALL-E(Zero-Shot Text-To-Speech), Reproduced Demo https://lifeiteng.github.io/valle/index.html","archived":false,"fork":false,"pushed_at":"2023-11-14T12:35:46.000Z","size":97399,"stargazers_count":2115,"open_issues_count":33,"forks_count":324,"subscribers_count":48,"default_branch":"main","last_synced_at":"2025-04-14T15:56:54.597Z","etag":null,"topics":["chatgpt","in-context-learning","large-language-models","text-to-speech","tts","vall-e","valle"],"latest_commit_sha":null,"homepage":"https://lifeiteng.github.io/valle/index.html","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/lifeiteng.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":".github/FUNDING.yml","license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null},"funding":{"github":"lifeiteng","patreon":null,"open_collective":null,"ko_fi":null,"tidelift":null,"community_bridge":null,"liberapay":null,"issuehunt":null,"otechie":null,"lfx_crowdfunding":null,"custom":["https://github.com/lifeiteng/SoundStorm/blob/master/.github/sponsor.jpg"]}},"created_at":"2023-01-27T06:56:47.000Z","updated_at":"2025-04-13T23:09:44.000Z","dependencies_parsed_at":"2023-10-29T08:24:20.344Z","dependency_job_id":null,"html_url":"https://github.com/lifeiteng/vall-e","commit_stats":null,"previous_names":[],"tags_count":3,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lifeiteng%2Fvall-e","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lifeiteng%2Fvall-e/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lifeiteng%2Fvall-e/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lifeiteng%2Fvall-e/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/lifeiteng","download_url":"https://codeload.github.com/lifeiteng/vall-e/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254319715,"owners_count":22051072,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["chatgpt","in-context-learning","large-language-models","text-to-speech","tts","vall-e","valle"],"created_at":"2024-07-31T06:00:32.346Z","updated_at":"2025-05-15T10:00:51.397Z","avatar_url":"https://github.com/lifeiteng.png","language":"Python","funding_links":["https://github.com/sponsors/lifeiteng","https://github.com/lifeiteng/SoundStorm/blob/master/.github/sponsor.jpg","https://www.buymeacoffee.com/feiteng"],"categories":["Python","Reimplementations","Audio models","SDK, Libraries, Frameworks"],"sub_categories":["Python library, sdk or frameworks"],"readme":"Language : 🇺🇸 | [🇨🇳](./README.zh-CN.md)\n\nAn unofficial PyTorch implementation of VALL-E([Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers](https://arxiv.org/abs/2301.02111)).\n\nWe can train the VALL-E model on one GPU.\n\n![model](./docs/images/Overview.jpg)\n\n## Demo\n\n* [official demo](https://valle-demo.github.io/)\n* [reproduced demo](https://lifeiteng.github.io/valle/index.html)\n\n\u003ca href=\"https://www.buymeacoffee.com/feiteng\" target=\"_blank\"\u003e\u003cimg src=\"https://cdn.buymeacoffee.com/buttons/v2/default-blue.png\" alt=\"Buy Me A Coffee\" style=\"height: 40px !important;width: 145px !important;\" \u003e\u003c/a\u003e\n\n\u003cimg src=\"./docs/images/vallf.png\" width=\"500\" height=\"400\"\u003e\n\n\n## Broader impacts\n\n\u003e Since VALL-E could synthesize speech that maintains speaker identity, it may carry potential risks in misuse of the model, such as spoofing voice identification or impersonating a specific speaker.\n\nTo avoid abuse, Well-trained models and services will not be provided.\n\n## Install Deps\n\nTo get up and running quickly just follow the steps below:\n\n```\n# PyTorch\npip install torch==1.13.1 torchaudio==0.13.1 --extra-index-url https://download.pytorch.org/whl/cu116\npip install torchmetrics==0.11.1\n# fbank\npip install librosa==0.8.1\n\n# phonemizer pypinyin\napt-get install espeak-ng\n## OSX: brew install espeak\npip install phonemizer==3.2.1 pypinyin==0.48.0\n\n# lhotse update to newest version\n# https://github.com/lhotse-speech/lhotse/pull/956\n# https://github.com/lhotse-speech/lhotse/pull/960\npip uninstall lhotse\npip uninstall lhotse\npip install git+https://github.com/lhotse-speech/lhotse\n\n# k2\n# find the right version in https://huggingface.co/csukuangfj/k2\npip install https://huggingface.co/csukuangfj/k2/resolve/main/cuda/k2-1.23.4.dev20230224+cuda11.6.torch1.13.1-cp310-cp310-linux_x86_64.whl\n\n# icefall\ngit clone https://github.com/k2-fsa/icefall\ncd icefall\npip install -r requirements.txt\nexport PYTHONPATH=`pwd`/../icefall:$PYTHONPATH\necho \"export PYTHONPATH=`pwd`/../icefall:\\$PYTHONPATH\" \u003e\u003e ~/.zshrc\necho \"export PYTHONPATH=`pwd`/../icefall:\\$PYTHONPATH\" \u003e\u003e ~/.bashrc\ncd -\nsource ~/.zshrc\n\n# valle\ngit clone https://github.com/lifeiteng/valle.git\ncd valle\npip install -e .\n```\n\n\n## Training\u0026Inference\n* #### English example [examples/libritts/README.md](egs/libritts/README.md)\n* #### Chinese example [examples/aishell1/README.md](egs/aishell1/README.md)\n* ### Prefix Mode 0 1 2 4 for NAR Decoder\n  **Paper Chapter 5.1** \"The average length of the waveform in LibriLight is 60 seconds. During\ntraining, we randomly crop the waveform to a random length between 10 seconds and 20 seconds. For the NAR acoustic prompt tokens, we select a random segment waveform of 3 seconds from the same utterance.\"\n  * **0**: no acoustic prompt tokens\n  * **1**: random prefix of current batched utterances **(This is recommended)**\n  * **2**: random segment of current batched utterances\n  * **4**: same as the paper (As they randomly crop the long waveform to multiple utterances, so the same utterance means pre or post utterance in the same long waveform.)\n    ```\n    # If train NAR Decoders with prefix_mode 4\n    python3 bin/trainer.py --prefix_mode 4 --dataset libritts --input-strategy PromptedPrecomputedFeatures ...\n    ```\n\n#### [LibriTTS demo](https://lifeiteng.github.io/valle/index.html) Trained on one GPU with 24G memory\n\n```\ncd examples/libritts\n\n# step1 prepare dataset\nbash prepare.sh --stage -1 --stop-stage 3\n\n# step2 train the model on one GPU with 24GB memory\nexp_dir=exp/valle\n\n## Train AR model\npython3 bin/trainer.py --max-duration 80 --filter-min-duration 0.5 --filter-max-duration 14 --train-stage 1 \\\n      --num-buckets 6 --dtype \"bfloat16\" --save-every-n 10000 --valid-interval 20000 \\\n      --model-name valle --share-embedding true --norm-first true --add-prenet false \\\n      --decoder-dim 1024 --nhead 16 --num-decoder-layers 12 --prefix-mode 1 \\\n      --base-lr 0.05 --warmup-steps 200 --average-period 0 \\\n      --num-epochs 20 --start-epoch 1 --start-batch 0 --accumulate-grad-steps 4 \\\n      --exp-dir ${exp_dir}\n\n## Train NAR model\ncp ${exp_dir}/best-valid-loss.pt ${exp_dir}/epoch-2.pt  # --start-epoch 3=2+1\npython3 bin/trainer.py --max-duration 40 --filter-min-duration 0.5 --filter-max-duration 14 --train-stage 2 \\\n      --num-buckets 6 --dtype \"float32\" --save-every-n 10000 --valid-interval 20000 \\\n      --model-name valle --share-embedding true --norm-first true --add-prenet false \\\n      --decoder-dim 1024 --nhead 16 --num-decoder-layers 12 --prefix-mode 1 \\\n      --base-lr 0.05 --warmup-steps 200 --average-period 0 \\\n      --num-epochs 40 --start-epoch 3 --start-batch 0 --accumulate-grad-steps 4 \\\n      --exp-dir ${exp_dir}\n\n# step3 inference\npython3 bin/infer.py --output-dir infer/demos \\\n    --checkpoint=${exp_dir}/best-valid-loss.pt \\\n    --text-prompts \"KNOT one point one five miles per hour.\" \\\n    --audio-prompts ./prompts/8463_294825_000043_000000.wav \\\n    --text \"To get up and running quickly just follow the steps below.\" \\\n\n# Demo Inference\nhttps://github.com/lifeiteng/lifeiteng.github.com/blob/main/valle/run.sh#L68\n```\n![train](./docs/images/train.png)\n\n#### Troubleshooting\n\n* **SummaryWriter segmentation fault (core dumped)**\n   * LINE `tb_writer = SummaryWriter(log_dir=f\"{params.exp_dir}/tensorboard\")`\n   * FIX  [https://github.com/tensorflow/tensorboard/pull/6135/files](https://github.com/tensorflow/tensorboard/pull/6135/files)\n   ```\n   file=`python  -c 'import site; print(f\"{site.getsitepackages()[0]}/tensorboard/summary/writer/event_file_writer.py\")'`\n   sed -i 's/import tf/import tensorflow_stub as tf/g' $file\n   ```\n\n#### Training on a custom dataset?\n* prepare the dataset to `lhotse manifests`\n  * There are plenty of references here [lhotse/recipes](https://github.com/lhotse-speech/lhotse/tree/master/lhotse/recipes)\n* `python3 bin/tokenizer.py ...`\n* `python3 bin/trainer.py ...`\n\n## Contributing\n\n* Parallelize bin/tokenizer.py on multi-GPUs\n* \u003ca href=\"https://www.buymeacoffee.com/feiteng\" target=\"_blank\"\u003e\u003cimg src=\"https://cdn.buymeacoffee.com/buttons/v2/default-blue.png\" alt=\"Buy Me A Coffee\" style=\"height: 40px !important;width: 145px !important;\" \u003e\u003c/a\u003e\n\n## Citing\n\nTo cite this repository:\n\n```bibtex\n@misc{valle,\n  author={Feiteng Li},\n  title={VALL-E: A neural codec language model},\n  year={2023},\n  url={http://github.com/lifeiteng/vall-e}\n}\n```\n\n```bibtex\n@article{VALL-E,\n  title     = {Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers},\n  author    = {Chengyi Wang, Sanyuan Chen, Yu Wu,\n               Ziqiang Zhang, Long Zhou, Shujie Liu,\n               Zhuo Chen, Yanqing Liu, Huaming Wang,\n               Jinyu Li, Lei He, Sheng Zhao, Furu Wei},\n  year      = {2023},\n  eprint    = {2301.02111},\n  archivePrefix = {arXiv},\n  volume    = {abs/2301.02111},\n  url       = {http://arxiv.org/abs/2301.02111},\n}\n```\n\n## Star History\n\n[![Star History Chart](https://api.star-history.com/svg?repos=lifeiteng/vall-e\u0026type=Date)](https://star-history.com/#lifeiteng/vall-e\u0026Date)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flifeiteng%2Fvall-e","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Flifeiteng%2Fvall-e","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flifeiteng%2Fvall-e/lists"}