{"id":13545340,"url":"https://github.com/bshall/hubert","last_synced_at":"2025-04-09T06:11:02.791Z","repository":{"id":42518372,"uuid":"417578841","full_name":"bshall/hubert","owner":"bshall","description":"HuBERT content encoders for: A Comparison of Discrete and Soft Speech Units for Improved Voice Conversion","archived":false,"fork":false,"pushed_at":"2024-10-01T10:08:22.000Z","size":468,"stargazers_count":350,"open_issues_count":11,"forks_count":55,"subscribers_count":4,"default_branch":"main","last_synced_at":"2025-04-01T19:19:34.184Z","etag":null,"topics":["pytorch","representation-learning","speech","voice-conversion"],"latest_commit_sha":null,"homepage":"https://bshall.github.io/soft-vc/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/bshall.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2021-10-15T17:13:11.000Z","updated_at":"2025-03-25T10:05:56.000Z","dependencies_parsed_at":"2024-11-13T09:31:28.258Z","dependency_job_id":null,"html_url":"https://github.com/bshall/hubert","commit_stats":null,"previous_names":[],"tags_count":2,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bshall%2Fhubert","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bshall%2Fhubert/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bshall%2Fhubert/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bshall%2Fhubert/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/bshall","download_url":"https://codeload.github.com/bshall/hubert/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247987285,"owners_count":21028895,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["pytorch","representation-learning","speech","voice-conversion"],"created_at":"2024-08-01T11:01:01.166Z","updated_at":"2025-04-09T06:11:02.754Z","avatar_url":"https://github.com/bshall.png","language":"Python","funding_links":[],"categories":["Python","Modified"],"sub_categories":["SoftVC"],"readme":"# HuBERT\n\n[![arXiv](https://img.shields.io/badge/arXiv-Paper-\u003cCOLOR\u003e.svg)](https://arxiv.org/abs/2111.02392)\n[![demo](https://img.shields.io/static/v1?message=Audio%20Samples\u0026logo=Github\u0026labelColor=grey\u0026color=blue\u0026logoColor=white\u0026label=%20\u0026style=flat)](https://bshall.github.io/soft-vc/)\n[![colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/bshall/soft-vc/blob/main/soft-vc-demo.ipynb)\n\nTraining and inference scripts for the HuBERT content encoders in [A Comparison of Discrete and Soft Speech Units for Improved Voice Conversion](https://ieeexplore.ieee.org/abstract/document/9746484).\nFor more details see [soft-vc](https://github.com/bshall/soft-vc). Audio samples can be found [here](https://bshall.github.io/soft-vc/). Colab demo can be found [here](https://colab.research.google.com/github/bshall/soft-vc/blob/main/soft-vc-demo.ipynb).\n\n\u003cdiv align=\"center\"\u003e\n    \u003cimg width=\"100%\" alt=\"Soft-VC\"\n      src=\"https://raw.githubusercontent.com/bshall/hubert/main/content-encoder.png\"\u003e\n\u003c/div\u003e\n\u003cdiv\u003e\n  \u003csup\u003e\n    \u003cstrong\u003eFig 1:\u003c/strong\u003e Architecture of the voice conversion system. a) The \u003cstrong\u003ediscrete\u003c/strong\u003e content encoder clusters audio features to produce a sequence of discrete speech units. b) The \u003cstrong\u003esoft\u003c/strong\u003e content encoder is trained to predict the discrete units. The acoustic model transforms the discrete/soft speech units into a target spectrogram. The vocoder converts the spectrogram into an audio waveform.\n  \u003c/sup\u003e\n\u003c/div\u003e\n\n## Example Usage\n\n### Programmatic Usage\n\n```python\nimport torch, torchaudio\n\n# Load checkpoint (either hubert_soft or hubert_discrete)\nhubert = torch.hub.load(\"bshall/hubert:main\", \"hubert_soft\", trust_repo=True).cuda()\n\n# Load audio\nwav, sr = torchaudio.load(\"path/to/wav\")\nassert sr == 16000\nwav = wav.unsqueeze(0).cuda()\n\n# Extract speech units\nunits = hubert.units(x)\n```\n\n### Script-Based Usage\n\n```\nusage: encode.py [-h] [--extension EXTENSION] {soft,discrete} in-dir out-dir\n\nEncode an audio dataset.\n\npositional arguments:\n  {soft,discrete}       available models (HuBERT-Soft or HuBERT-Discrete)\n  in-dir                path to the dataset directory.\n  out-dir               path to the output directory.\n\noptional arguments:\n  -h, --help            show this help message and exit\n  --extension EXTENSION\n                        extension of the audio files (defaults to .flac).\n```\n\n## Training\n\n### Step 1: Dataset Preparation\n\nDownload and extract the [LibriSpeech](https://www.openslr.org/12) corpus. The training script expects the following tree structure for the dataset directory:\n\n```\n│   lengths.json\n│\n└───wavs\n    ├───dev-*\n    │   ├───84\n    │   ├───...\n    │   └───8842\n    └───train-*\n        ├───19\n        ├───...\n        └───8975\n```\n\nThe `train-*` and `dev-*` directories should contain the training and validation splits respectively. Note that there can be multiple `train` and `dev` folders e.g., `train-clean-100`, `train-other-500`, etc. Finally, the `lengths.json` file should contain key-value pairs with the file path and number of samples:\n\n```json\n{\n    \"dev-clean/1272/128104/1272-128104-0000\": 93680,\n    \"dev-clean/1272/128104/1272-128104-0001\": 77040,\n}\n```\n\n### Step 2: Extract Discrete Speech Units\n\nEncode LibriSpeech using the HuBERT-Discrete model and `encode.py` script:\n\n```\nusage: encode.py [-h] [--extension EXTENSION] {soft,discrete} in-dir out-dir\n\nEncode an audio dataset.\n\npositional arguments:\n  {soft,discrete}       available models (HuBERT-Soft or HuBERT-Discrete)\n  in-dir                path to the dataset directory.\n  out-dir               path to the output directory.\n\noptional arguments:\n  -h, --help            show this help message and exit\n  --extension EXTENSION\n                        extension of the audio files (defaults to .flac).\n```\n\nfor example:\n\n```\npython encode.py discrete path/to/LibriSpeech/wavs path/to/LibriSpeech/discrete\n```\n\nAt this point the directory tree should look like:\n\n```\n│   lengths.json\n│\n├───discrete\n│   ├───...\n└───wavs\n    ├───...\n```\n\n### Step 3: Train the HuBERT-Soft Content Encoder\n\n```\nusage: train.py [-h] [--resume RESUME] [--warmstart] [--mask] [--alpha ALPHA] dataset-dir checkpoint-dir\n\nTrain HuBERT soft content encoder.\n\npositional arguments:\n  dataset-dir      path to the data directory.\n  checkpoint-dir   path to the checkpoint directory.\n\noptional arguments:\n  -h, --help       show this help message and exit\n  --resume RESUME  path to the checkpoint to resume from.\n  --warmstart      whether to initialize from the fairseq HuBERT checkpoint.\n  --mask           whether to use input masking.\n  --alpha ALPHA    weight for the masked loss.\n```\n\n## Links\n\n- [Soft-VC repo](https://github.com/bshall/soft-vc)\n- [Soft-VC paper](https://ieeexplore.ieee.org/abstract/document/9746484)\n- [Official HuBERT repo](https://github.com/pytorch/fairseq)\n- [HuBERT paper](https://arxiv.org/abs/2106.07447)\n\n## Citation\n\nIf you found this work helpful please consider citing our paper:\n\n```\n@inproceedings{\n    soft-vc-2022,\n    author={van Niekerk, Benjamin and Carbonneau, Marc-André and Zaïdi, Julian and Baas, Matthew and Seuté, Hugo and Kamper, Herman},\n    booktitle={ICASSP}, \n    title={A Comparison of Discrete and Soft Speech Units for Improved Voice Conversion}, \n    year={2022}\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbshall%2Fhubert","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbshall%2Fhubert","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbshall%2Fhubert/lists"}