{"id":19450773,"url":"https://github.com/plachtaa/facodec","last_synced_at":"2025-04-03T03:13:27.561Z","repository":{"id":236988982,"uuid":"793573955","full_name":"Plachtaa/FAcodec","owner":"Plachtaa","description":"Training code for FAcodec presented in NaturalSpeech3","archived":false,"fork":false,"pushed_at":"2024-08-26T14:02:02.000Z","size":20466,"stargazers_count":198,"open_issues_count":18,"forks_count":22,"subscribers_count":10,"default_branch":"master","last_synced_at":"2025-04-02T21:44:42.613Z","etag":null,"topics":["audio-codec","voice-conversion","zero-shot-voice-conversion"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Plachtaa.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-04-29T13:29:24.000Z","updated_at":"2025-04-02T11:21:09.000Z","dependencies_parsed_at":"2024-12-22T12:12:30.200Z","dependency_job_id":"331a9f6e-44da-4b5d-96ff-9ea87e3cb282","html_url":"https://github.com/Plachtaa/FAcodec","commit_stats":null,"previous_names":["plachtaa/facodec"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Plachtaa%2FFAcodec","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Plachtaa%2FFAcodec/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Plachtaa%2FFAcodec/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Plachtaa%2FFAcodec/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Plachtaa","download_url":"https://codeload.github.com/Plachtaa/FAcodec/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":246927839,"owners_count":20856198,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["audio-codec","voice-conversion","zero-shot-voice-conversion"],"created_at":"2024-11-10T16:38:51.325Z","updated_at":"2025-04-03T03:13:27.440Z","avatar_url":"https://github.com/Plachtaa.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# FAcodec\n\nThis project is supported by [Amphion](https://github.com/open-mmlab/Amphion).\n\nPytorch implementation for the training of FAcodec, which was proposed in paper [NaturalSpeech 3: Zero-Shot Speech Synthesis\nwith Factorized Codec and Diffusion Models](https://arxiv.org/pdf/2403.03100)  \n\nThis implementation made some key improvements to the training pipeline, so that the requirements of any form of annotations, including \ntranscripts, phoneme alignments, and speaker labels, are eliminated. All you need are simply raw speech files.  \nWith the new training pipeline, it is possible to train the model on more languages with more diverse timbre distributions.  \nWe release the code for training and inference, including a pretrained checkpoint on 50k hours speech data with over 1 million speakers.\n## Requirements\n- Python 3.10\n\n## Installation\n```bash\ngit clone https://github.com/Plachtaa/FAcodec.git\npip install -r requirements.txt\n```\nIf you want to train the model by yourself, install the following packages:\n```bash\npip install nemo_toolkit['all']\npip install descript-audio-codec\n```\n\n## Model storage\nWe provide pretrained checkpoints on 50k hours speech data.  \n\n| Model type        | Link                                                                                                                                   |\n|-------------------|----------------------------------------------------------------------------------------------------------------------------------------|\n| FAcodec           | [![Hugging Face](https://img.shields.io/badge/🤗%20Hugging%20Face-FAcodec-blue)](https://huggingface.co/Plachta/FAcodec)               |\n| FAcodec redecoder | [![Hugging Face](https://img.shields.io/badge/🤗%20Hugging%20Face-FAredecoder-blue)](https://huggingface.co/Plachta/FAcodec-redecoder) |\n\n## Demo\nTry our model on [![Hugging Face](https://img.shields.io/badge/🤗%20Hugging%20Face-Space-blue)](https://huggingface.co/spaces/Plachta/FAcodecV2)!\n\n## Training\n```bash\naccelerate launch train.py --config ./configs/config.yaml\n```\nBefore you run the command above, replace the `PseudoDataset` class in `meldataset.py` with your own dataset.\nSimply load your own wave files in the same format.  \nTo train redecoder, the voice conversion model, run:\n```bash\naccelerate launch train_redecoder.py --config ./configs/config_redecoder.yaml\n```\nRemember to fill in the checkpoint path of a pretrained FAcodec model in the config file.\n\n## Usage\n\n### Encode \u0026 reconstruct\n```bash\npython reconstruct.py --source \u003csource_wav\u003e --ckpt-path \u003cckpt_path\u003e --config-path \u003cconfig_path\u003e\n```\nIf no `--ckpt-path` or `--config-path` is specified, model weights will be automatically downloaded from Hugging Face.  \nFor China mainland users, add additional environment variable to specify huggingface endpoint:\n```bash\nHF_ENDPOINT=https://hf-mirror.com python reconstruct_redecoder.py --source \u003csource_wav\u003e --target \u003ctarget_wav\u003e\n```\n\n### Extracting representations\n```python\nimport yaml\nfrom modules.commons import build_model, recursive_munch\nfrom hf_utils import load_custom_model_from_hf\nimport torch\nimport torchaudio\nimport librosa\n\ndevice = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\nckpt_path, config_path = load_custom_model_from_hf(\"Plachta/FAcodec\")\nmodel = build_model(yaml.safe_load(open(config_path))['model_params'])\nckpt_params = torch.load(ckpt_path, map_location=\"cpu\")\n\nfor key in ckpt_params:\n    model[key].load_state_dict(ckpt_params[key])\n\n_ = [model[key].eval() for key in model]\n_ = [model[key].to(device) for key in model]\n\nwith torch.no_grad():\n    source = \"path/to/source.wav\"\n    source_audio = librosa.load(source, sr=24000)[0]\n    source_audio = torch.tensor(source_audio).unsqueeze(0).float().to(device)\n    z = model.encoder(source_audio[None, ...].to(device).float())\n    z, quantized, _, _, timbre, codes = model.quantizer(z, source_audio[None, ...].to(device).float(), return_codes=True)\n```\nwhere:  \n`timbre` is the timbre representation, one single vector for each utterance.  \n`codes[0]` is the prosody representation  \n`codes[1]` is the content representation\n\n### Zero-shot voice conversion\n```bash\npython reconstruct_redecoder.py \\\n    --source \u003csource_wav\u003e \n    --target \u003ctarget_wav\u003e \n    --codec-ckpt-path \u003ccodec_ckpt_path\u003e \n    --redecoder-ckpt-path \u003credecoder_ckpt_path\u003e \n    --codec-config-path \u003ccodec_config_path\u003e \n    --redecoder-config-path \u003credecoder_config_path\u003e\n```\nsame as above, if no checkpoint path or config path is specified, model weights will be automatically downloaded from Hugging Face.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fplachtaa%2Ffacodec","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fplachtaa%2Ffacodec","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fplachtaa%2Ffacodec/lists"}