{"id":27073264,"url":"https://github.com/facebookresearch/loop","last_synced_at":"2025-04-06T00:01:57.192Z","repository":{"id":66233089,"uuid":"100496936","full_name":"facebookarchive/loop","owner":"facebookarchive","description":"A method to generate speech across multiple speakers","archived":true,"fork":false,"pushed_at":"2019-03-21T19:07:42.000Z","size":146,"stargazers_count":870,"open_issues_count":18,"forks_count":159,"subscribers_count":68,"default_branch":"master","last_synced_at":"2024-08-12T21:27:13.239Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/facebookarchive.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING","funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2017-08-16T14:14:27.000Z","updated_at":"2024-08-12T19:32:01.000Z","dependencies_parsed_at":null,"dependency_job_id":"8f923c71-0db3-468b-8d10-1a73171203f8","html_url":"https://github.com/facebookarchive/loop","commit_stats":null,"previous_names":["facebookresearch/loop"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/facebookarchive%2Floop","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/facebookarchive%2Floop/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/facebookarchive%2Floop/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/facebookarchive%2Floop/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/facebookarchive","download_url":"https://codeload.github.com/facebookarchive/loop/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247415933,"owners_count":20935388,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-04-06T00:00:50.043Z","updated_at":"2025-04-06T00:01:57.087Z","avatar_url":"https://github.com/facebookarchive.png","language":"Python","readme":"# VoiceLoop\nPyTorch implementation of the method described in the paper [VoiceLoop: Voice Fitting and Synthesis via a Phonological Loop](https://arxiv.org/abs/1707.06588).\n\n\u003cp align=\"center\"\u003e\u003cimg width=\"70%\" src=\"img/method.png\" /\u003e\u003c/p\u003e\n\nVoiceLoop is a neural text-to-speech (TTS) that is able to transform text to speech in voices that are sampled\nin the wild. Some demo samples can be [found here](https://ytaigman.github.io/loop/site/).\n\n## Quick Links\n- [Demo Samples](https://ytaigman.github.io/loop/site/) \n- [Quick Start](#quick-start)\n- [Setup](#setup)\n- [Training](#training)\n\n## Quick Start\nFollow the instructions in [Setup](#setup) and then simply execute:\n ```bash\n python generate.py  --npz data/vctk/numpy_features_valid/p318_212.npz --spkr 13 --checkpoint models/vctk/bestmodel.pth\n ```\n Results will be placed in ```models/vctk/results```. It will generate 2 samples: \n  * The [generated sample](https://ytaigman.github.io/loop/demos/vctk_tutorial/p318_212.gen_10.wav) will be saved with the gen_10.wav extension.\n  * Its [ground-truth (test) sample](https://ytaigman.github.io/loop/demos/vctk_tutorial/p318_212.orig.wav) is also generated and is saved with the orig.wav extension.\n  \nYou can also generate the same text but with a different speaker, specifically:\n ```bash\n python generate.py  --npz data/vctk/numpy_features_valid/p318_212.npz --spkr 18 --checkpoint models/vctk/bestmodel.pth\n ```\nWhich will generate the following [sample](https://ytaigman.github.io/loop/demos/vctk_tutorial/p318_212.gen_14.wav). \n\nHere is the corresponding attention plot: \n\n\u003cp align=\"center\"\u003e\u003cimg width=\"50%\" src=\"img/attn_10.png\" /\u003e\u003cimg width=\"50%\" src=\"img/attn_14.png\" /\u003e\u003c/p\u003e\n\nLegend: X-axis is output time (acoustic samples) Y-axis is input (text/phonemes). Left figure is speaker 10, right is speaker 14. \n\nFinally, free text is also supported:\n ```bash\npython generate.py  --text \"hello world\" --spkr 1 --checkpoint models/vctk/bestmodel.pth\n```\n\n## Setup\nRequirements: Linux/OSX, Python2.7 and [PyTorch 0.1.12](http://pytorch.org/). Generation requires installing [phonemizer](https://github.com/bootphon/phonemizer), follow the setup instructions there. \nThe current version of the code requires CUDA support for training. Generation can be done on the CPU.\n\n```bash\ngit clone https://github.com/facebookresearch/loop.git\ncd loop\npip install -r scripts/requirements.txt\n```\n\n### Data\nThe data used to train the models in the paper can be downloaded via:\n```bash\nbash scripts/download_data.sh\n```\n\nThe script downloads and preprocesses a subset of [VCTK](http://homepages.inf.ed.ac.uk/jyamagis/page3/page58/page58.html). This subset contains speakers with american accent.  \n\nThe dataset was preprocessed using [Merlin](http://www.cstr.ed.ac.uk/projects/merlin/) - from each audio clip we extracted vocoder features using the [WORLD](http://ml.cs.yamanashi.ac.jp/world/english/) vocoder. After downloading, the dataset will be located under subfolder ```data``` as follows:\n\n```\nloop\n├── data\n    └── vctk\n        ├── norm_info\n        │   ├── norm.dat\n        ├── numpy_feautres\n        │   ├── p294_001.npz\n        │   ├── p294_002.npz\n        │   └── ...\n        └── numpy_features_valid\n```\n\nThe preprocess pipeline can be executed using the following script by Kyle Kastner: https://gist.github.com/kastnerkyle/cc0ac48d34860c5bb3f9112f4d9a0300.\n\n### Pretrained Models\nPretrainde models can be downloaded via:\n```bash\nbash scripts/download_models.sh\n```\nAfter downloading, the models will be located under subfolder ```models``` as follows:\n\n```\nloop\n├── data\n├── models\n    ├── blizzard\n    ├── vctk\n    │   ├── args.pth\n    │   └── bestmodel.pth\n    └── vctk_alt\n```\n\n**Update 10/25/2017:** Single speaker model available in models/blizzard/\n\n### SPTK and WORLD\nFinally, speech generation requires [SPTK3.9](http://sp-tk.sourceforge.net/) and [WORLD](http://ml.cs.yamanashi.ac.jp/world/english/) vocoder as done in Merlin. To download the executables: \n```bash\nbash scripts/download_tools.sh\n```\nWhich results the following sub directories:\n```\nloop\n├── data\n├── models\n├── tools\n    ├── SPTK-3.9\n    └── WORLD\n```\n \n## Training\n\n### Single-Speaker\nSingle speaker model is trained on [blizzard 2011](http://www.cstr.ed.ac.uk/projects/blizzard/2011/lessac_blizzard2011/). Data should be downloaded and prepared as described above. Once the data is ready, run:\n```bash\npython train.py --noise 1 --expName blizzard_init --seq-len 1600 --max-seq-len 1600 --data data/blizzard --nspk 1 --lr 1e-5 --epochs 10\n```\nThen, continue training the model with :\n```bash\npython train.py --noise 1 --expName blizzard --seq-len 1600 --max-seq-len 1600 --data data/blizzard --nspk 1 --lr 1e-4 --checkpoint checkpoints/blizzard_init/bestmodel.pth --epochs 90\n```\n### Multi-Speaker\nTraining a new model on vctk, first train the model using noise level of 4 and input sequence length of 100:\n```bash\npython train.py --expName vctk --data data/vctk --noise 4 --seq-len 100 --epochs 90\n```\nThen, continue training the model using noise level of 2, on full sequences:\n```bash\npython train.py --expName vctk_noise_2 --data data/vctk --checkpoint checkpoints/vctk/bestmodel.pth --noise 2 --seq-len 1000 --epochs 90\n```\n\n## Citation\nIf you find this code useful in your research then please cite:\n\n```\n@article{taigman2017voice,\n  title           = {VoiceLoop: Voice Fitting and Synthesis via a Phonological Loop},\n  author          = {Taigman, Yaniv and Wolf, Lior and Polyak, Adam and Nachmani, Eliya},\n  journal         = {ArXiv e-prints},\n  archivePrefix   = \"arXiv\",\n  eprinttype      = {arxiv},\n  eprint          = {1705.03122},\n  primaryClass    = \"cs.CL\",\n  year            = {2017}\n  month           = October,\n}\n```\n\n## License\nLoop has a CC-BY-NC license.\n","funding_links":[],"categories":["Synthesis","Pytorch \u0026 related libraries｜Pytorch \u0026 相关库","Pytorch \u0026 related libraries","Table of Contents"],"sub_categories":["NLP \u0026 Speech Processing｜自然语言处理 \u0026 语音处理:","NLP \u0026 Speech Processing:"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffacebookresearch%2Floop","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ffacebookresearch%2Floop","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffacebookresearch%2Floop/lists"}