{"id":15136543,"url":"https://github.com/playvoice/lora-svc","last_synced_at":"2025-04-04T10:09:42.836Z","repository":{"id":58833167,"uuid":"534018600","full_name":"PlayVoice/lora-svc","owner":"PlayVoice","description":"singing voice change based on whisper, and lora for singing voice clone","archived":false,"fork":false,"pushed_at":"2023-11-03T04:08:23.000Z","size":18425,"stargazers_count":634,"open_issues_count":31,"forks_count":77,"subscribers_count":24,"default_branch":"main","last_synced_at":"2025-04-04T10:09:37.619Z","etag":null,"topics":["lora","singing-voice-conversion","speech-to-sing","uni-svc","vits","vits-svc","voice-change","voice-cloning","voice-conversion","whisper"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/PlayVoice.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2022-09-08T02:22:42.000Z","updated_at":"2025-03-28T09:53:55.000Z","dependencies_parsed_at":"2023-02-19T17:00:48.751Z","dependency_job_id":"2eb973c2-1d91-4174-b93e-47b246cc9379","html_url":"https://github.com/PlayVoice/lora-svc","commit_stats":null,"previous_names":[],"tags_count":10,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/PlayVoice%2Flora-svc","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/PlayVoice%2Flora-svc/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/PlayVoice%2Flora-svc/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/PlayVoice%2Flora-svc/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/PlayVoice","download_url":"https://codeload.github.com/PlayVoice/lora-svc/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247157283,"owners_count":20893220,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["lora","singing-voice-conversion","speech-to-sing","uni-svc","vits","vits-svc","voice-change","voice-cloning","voice-conversion","whisper"],"created_at":"2024-09-26T06:22:41.930Z","updated_at":"2025-04-04T10:09:42.805Z","avatar_url":"https://github.com/PlayVoice.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cdiv align=\"center\"\u003e\n\u003ch1\u003e Singing Voice Conversion based on Whisper \u0026 neural source-filter BigVGAN \u003c/h1\u003e\n\n\u003cimg alt=\"GitHub Repo stars\" src=\"https://img.shields.io/github/stars/PlayVoice/lora-svc\"\u003e\n\u003cimg alt=\"GitHub forks\" src=\"https://img.shields.io/github/forks/PlayVoice/lora-svc\"\u003e\n\u003cimg alt=\"GitHub issues\" src=\"https://img.shields.io/github/issues/PlayVoice/lora-svc\"\u003e\n\u003cimg alt=\"GitHub\" src=\"https://img.shields.io/github/license/PlayVoice/lora-svc\"\u003e\n\u003c/div\u003e\n\n```\nBlack technology based on the three giants of artificial intelligence:\n\nOpenAI's whisper, 680,000 hours in multiple languages\n\nNvidia's bigvgan, anti-aliasing for speech generation\n\nMicrosoft's adapter, high-efficiency for fine-tuning\n```\n\n**LoRA is not fully implemented in this project**, but it can be found here: [LoRA TTS](https://github.com/NVIDIA/NeMo/blob/main/nemo/collections/tts/modules/adapters.py) \u0026 [paper](https://arxiv.org/abs/2211.00585)\n\nuse pretrain model to fine tune\n\nhttps://user-images.githubusercontent.com/16432329/231021007-6e34cbb4-e256-491d-8ab6-5ce4e822da21.mp4\n\n\n## Dataset preparation\n\nNecessary pre-processing:\n- 1 accompaniment separation, [UVR](https://github.com/Anjok07/ultimatevocalremovergui)\n- 2 cut audio, less than 30 seconds for whisper, [slicer](https://github.com/flutydeer/audio-slicer)\n\nthen put the dataset into the data_raw directory according to the following file structure\n```shell\ndata_raw\n├───speaker0\n│   ├───000001.wav\n│   ├───...\n│   └───000xxx.wav\n└───speaker1\n    ├───000001.wav\n    ├───...\n    └───000xxx.wav\n```\n\n## Install dependencies\n\n- 1 software dependency\n\n  \u003e pip install -r requirements.txt\n\n- 2 download the Timbre Encoder: [Speaker-Encoder by @mueller91](https://drive.google.com/drive/folders/15oeBYf6Qn1edONkVLXe82MzdIi3O_9m3), put `best_model.pth.tar`  into `speaker_pretrain/`\n\n- 3 download whisper model [multiple language medium model](https://openaipublic.azureedge.net/main/whisper/models/345ae4da62f9b3d59415adc60127b97c714f32e89e936602e85993674d08dcb1/medium.pt), Make sure to download `medium.pt`，put it into `whisper_pretrain/`\n\n    **Tip: whisper is built-in, do not install it additionally, it will conflict and report an error**\n\n- 4 download pretrain model [maxgan_pretrain_32K.pth](https://github.com/PlayVoice/lora-svc/releases/download/v_final/maxgan_pretrain_32K.pth), and do test\n\n    \u003e python svc_inference.py --config configs/maxgan.yaml --model maxgan_pretrain_32K.pth --spk ./configs/singers/singer0001.npy --wave test.wav\n\n## Data preprocessing\nuse this command if you want to automate this:\n\n\u003e python3 prepare/easyprocess.py\n\nor step by step, as follows:\n\n- 1， re-sampling\n\n    generate audio with a sampling rate of 16000Hz\n\n    \u003e python prepare/preprocess_a.py -w ./data_raw -o ./data_svc/waves-16k -s 16000\n\n    generate audio with a sampling rate of 32000Hz\n\n    \u003e python prepare/preprocess_a.py -w ./data_raw -o ./data_svc/waves-32k -s 32000\n\n- 2， use 16K audio to extract pitch\n\n    \u003e python prepare/preprocess_f0.py -w data_svc/waves-16k/ -p data_svc/pitch\n\n- 3， use 16K audio to extract ppg\n\n    \u003e python prepare/preprocess_ppg.py -w data_svc/waves-16k/ -p data_svc/whisper\n\n- 4， use 16k audio to extract timbre code\n\n    \u003e python prepare/preprocess_speaker.py data_svc/waves-16k/ data_svc/speaker\n\n- 5， extract the singer code for inference\n\n    \u003e python prepare/preprocess_speaker_ave.py data_svc/speaker/ data_svc/singer\n\n- 6， use 32k audio to generate training index\n\n    \u003e python prepare/preprocess_train.py\n\n- 7， training file debugging\n\n    \u003e python prepare/preprocess_zzz.py -c configs/maxgan.yaml\n\n```shell\ndata_svc/\n└── waves-16k\n│    └── speaker0\n│    │      ├── 000001.wav\n│    │      └── 000xxx.wav\n│    └── speaker1\n│           ├── 000001.wav\n│           └── 000xxx.wav\n└── waves-32k\n│    └── speaker0\n│    │      ├── 000001.wav\n│    │      └── 000xxx.wav\n│    └── speaker1\n│           ├── 000001.wav\n│           └── 000xxx.wav\n└── pitch\n│    └── speaker0\n│    │      ├── 000001.pit.npy\n│    │      └── 000xxx.pit.npy\n│    └── speaker1\n│           ├── 000001.pit.npy\n│           └── 000xxx.pit.npy\n└── whisper\n│    └── speaker0\n│    │      ├── 000001.ppg.npy\n│    │      └── 000xxx.ppg.npy\n│    └── speaker1\n│           ├── 000001.ppg.npy\n│           └── 000xxx.ppg.npy\n└── speaker\n│    └── speaker0\n│    │      ├── 000001.spk.npy\n│    │      └── 000xxx.spk.npy\n│    └── speaker1\n│           ├── 000001.spk.npy\n│           └── 000xxx.spk.npy\n└── singer\n    ├── speaker0.spk.npy\n    └── speaker1.spk.npy\n```\n\n## Train\n- 0， if fine-tuning based on the pre-trained model, you need to download the pre-trained model: [maxgan_pretrain_32K.pth](https://github.com/PlayVoice/lora-svc/releases/download/v_final/maxgan_pretrain_32K.pth)\n\n    \u003e set pretrain: \"./maxgan_pretrain_32K.pth\" in configs/maxgan.yaml，and adjust the learning rate appropriately, eg 1e-5\n\n- 1， start training\n\n    \u003e python svc_trainer.py -c configs/maxgan.yaml -n svc\n\n- 2， resume training\n\n    \u003e python svc_trainer.py -c configs/maxgan.yaml -n svc -p chkpt/svc/***.pth\n\n- 3， view log\n\n    \u003e tensorboard --logdir logs/\n\n![final_model_loss](https://github.com/PlayVoice/lora-svc/assets/16432329/60b6f141-e20e-4a13-ac98-669efbf10472)\n\n## Inference\n\nuse this command if you want a GUI that does all the commands below:\n\n\u003e python3 svc_gui.py\n\nor step by step, as follows:\n\n- 1， export inference model\n\n    \u003e python svc_export.py --config configs/maxgan.yaml --checkpoint_path chkpt/svc/***.pt\n\n- 2， use whisper to extract content encoding, without using one-click reasoning, in order to reduce GPU memory usage\n\n    \u003e python whisper/inference.py -w test.wav -p test.ppg.npy\n\n- 3， extract the F0 parameter to the csv text format\n\n    \u003e python pitch/inference.py -w test.wav -p test.csv\n\n- 4， specify parameters and infer\n\n    \u003e python svc_inference.py --config configs/maxgan.yaml --model maxgan_g.pth --spk ./data_svc/singers/your_singer.npy --wave test.wav --ppg test.ppg.npy --pit test.csv\n\n    when --ppg is specified, when the same audio is reasoned multiple times, it can avoid repeated extraction of audio content codes; if it is not specified, it will be automatically extracted;\n\n    when --pit is specified, the manually tuned F0 parameter can be loaded; if not specified, it will be automatically extracted;\n\n    generate files in the current directory:svc_out.wav\n\n    | args |--config | --model | --spk | --wave | --ppg | --pit | --shift |\n    | :---:  | :---: | :---: | :---: | :---: | :---: | :---: | :---: |\n    | name | config path | model path | speaker | wave input | wave ppg | wave pitch | pitch shift |\n\n- 5, post by vad\n\n    \u003e python svc_inference_post.py --ref test.wav --svc svc_out.wav --out svc_post.wav\n\n## Source of code and References\n[Adapter-Based Extension of Multi-Speaker Text-to-Speech Model for New Speakers](https://arxiv.org/abs/2211.00585)\n\n[AdaSpeech: Adaptive Text to Speech for Custom Voice](https://arxiv.org/pdf/2103.00993.pdf)\n\nhttps://github.com/nii-yamagishilab/project-NN-Pytorch-scripts/tree/master/project/01-nsf\n\nhttps://github.com/mindslab-ai/univnet [[paper]](https://arxiv.org/abs/2106.07889)\n\nhttps://github.com/openai/whisper/ [[paper]](https://arxiv.org/abs/2212.04356)\n\nhttps://github.com/NVIDIA/BigVGAN [[paper]](https://arxiv.org/abs/2206.04658)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fplayvoice%2Flora-svc","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fplayvoice%2Flora-svc","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fplayvoice%2Flora-svc/lists"}