{"id":13572087,"url":"https://github.com/gemelo-ai/vocos","last_synced_at":"2025-04-04T09:31:41.158Z","repository":{"id":171744960,"uuid":"648152119","full_name":"gemelo-ai/vocos","owner":"gemelo-ai","description":"Vocos: Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesis","archived":false,"fork":false,"pushed_at":"2024-08-07T11:07:23.000Z","size":18167,"stargazers_count":887,"open_issues_count":47,"forks_count":105,"subscribers_count":31,"default_branch":"main","last_synced_at":"2025-03-05T15:51:24.089Z","etag":null,"topics":["vocoder","vocos"],"latest_commit_sha":null,"homepage":"https://gemelo-ai.github.io/vocos/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/gemelo-ai.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-06-01T10:20:23.000Z","updated_at":"2025-03-05T11:11:53.000Z","dependencies_parsed_at":"2024-06-19T02:58:33.437Z","dependency_job_id":"72009a97-02c3-421e-8a0f-5157719e15f9","html_url":"https://github.com/gemelo-ai/vocos","commit_stats":null,"previous_names":["charactr-platform/vocos","gemelo-ai/vocos"],"tags_count":5,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gemelo-ai%2Fvocos","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gemelo-ai%2Fvocos/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gemelo-ai%2Fvocos/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gemelo-ai%2Fvocos/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/gemelo-ai","download_url":"https://codeload.github.com/gemelo-ai/vocos/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247153356,"owners_count":20892641,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["vocoder","vocos"],"created_at":"2024-08-01T14:01:12.883Z","updated_at":"2025-04-04T09:31:36.146Z","avatar_url":"https://github.com/gemelo-ai.png","language":"Python","funding_links":[],"categories":["Python"],"sub_categories":[],"readme":"# Vocos: Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesis\n\n[Audio samples](https://gemelo-ai.github.io/vocos/) |\nPaper [[abs]](https://arxiv.org/abs/2306.00814) [[pdf]](https://arxiv.org/pdf/2306.00814.pdf)\n\nVocos is a fast neural vocoder designed to synthesize audio waveforms from acoustic features. Trained using a Generative\nAdversarial Network (GAN) objective, Vocos can generate waveforms in a single forward pass. Unlike other typical\nGAN-based vocoders, Vocos does not model audio samples in the time domain. Instead, it generates spectral\ncoefficients, facilitating rapid audio reconstruction through inverse Fourier transform.\n\n## Installation\n\nTo use Vocos only in inference mode, install it using:\n\n```bash\npip install vocos\n```\n\nIf you wish to train the model, install it with additional dependencies:\n\n```bash\npip install vocos[train]\n```\n\n## Usage\n\n### Reconstruct audio from mel-spectrogram\n\n```python\nimport torch\n\nfrom vocos import Vocos\n\nvocos = Vocos.from_pretrained(\"charactr/vocos-mel-24khz\")\n\nmel = torch.randn(1, 100, 256)  # B, C, T\naudio = vocos.decode(mel)\n```\n\nCopy-synthesis from a file:\n\n```python\nimport torchaudio\n\ny, sr = torchaudio.load(YOUR_AUDIO_FILE)\nif y.size(0) \u003e 1:  # mix to mono\n    y = y.mean(dim=0, keepdim=True)\ny = torchaudio.functional.resample(y, orig_freq=sr, new_freq=24000)\ny_hat = vocos(y)\n```\n\n### Reconstruct audio from EnCodec tokens\n\nAdditionally, you need to provide a `bandwidth_id` which corresponds to the embedding for bandwidth from the\nlist: `[1.5, 3.0, 6.0, 12.0]`.\n\n```python\nvocos = Vocos.from_pretrained(\"charactr/vocos-encodec-24khz\")\n\naudio_tokens = torch.randint(low=0, high=1024, size=(8, 200))  # 8 codeboooks, 200 frames\nfeatures = vocos.codes_to_features(audio_tokens)\nbandwidth_id = torch.tensor([2])  # 6 kbps\n\naudio = vocos.decode(features, bandwidth_id=bandwidth_id)\n```\n\nCopy-synthesis from a file: It extracts and quantizes features with EnCodec, then reconstructs them with Vocos in a\nsingle forward pass.\n\n```python\ny, sr = torchaudio.load(YOUR_AUDIO_FILE)\nif y.size(0) \u003e 1:  # mix to mono\n    y = y.mean(dim=0, keepdim=True)\ny = torchaudio.functional.resample(y, orig_freq=sr, new_freq=24000)\n\ny_hat = vocos(y, bandwidth_id=bandwidth_id)\n```\n\n### Integrate with 🐶 [Bark](https://github.com/suno-ai/bark) text-to-audio model\n\nSee [example notebook](notebooks%2FBark%2BVocos.ipynb).\n\n## Pre-trained models\n\n| Model Name                                                                          | Dataset       | Training Iterations | Parameters \n|-------------------------------------------------------------------------------------|---------------|-------------------|------------|\n| [charactr/vocos-mel-24khz](https://huggingface.co/charactr/vocos-mel-24khz)         | LibriTTS      | 1M                | 13.5M\n| [charactr/vocos-encodec-24khz](https://huggingface.co/charactr/vocos-encodec-24khz) | DNS Challenge | 2M                | 7.9M\n\n## Training\n\nPrepare a filelist of audio files for the training and validation set:\n\n```bash\nfind $TRAIN_DATASET_DIR -name *.wav \u003e filelist.train\nfind $VAL_DATASET_DIR -name *.wav \u003e filelist.val\n```\n\nFill a config file, e.g. [vocos.yaml](configs%2Fvocos.yaml), with your filelist paths and start training with:\n\n```bash\npython train.py -c configs/vocos.yaml\n```\n\nRefer to [Pytorch Lightning documentation](https://lightning.ai/docs/pytorch/stable/) for details about customizing the\ntraining pipeline.\n\n## Citation\n\nIf this code contributes to your research, please cite our work:\n\n```\n@article{siuzdak2023vocos,\n  title={Vocos: Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesis},\n  author={Siuzdak, Hubert},\n  journal={arXiv preprint arXiv:2306.00814},\n  year={2023}\n}\n```\n\n## License\n\nThe code in this repository is released under the MIT license as found in the\n[LICENSE](LICENSE) file.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgemelo-ai%2Fvocos","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fgemelo-ai%2Fvocos","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgemelo-ai%2Fvocos/lists"}