{"id":13716161,"url":"https://github.com/maum-ai/voicefilter","last_synced_at":"2025-05-16T11:06:29.519Z","repository":{"id":38421635,"uuid":"177093758","full_name":"maum-ai/voicefilter","owner":"maum-ai","description":"Unofficial PyTorch implementation of Google AI's VoiceFilter system","archived":false,"fork":false,"pushed_at":"2024-02-05T07:57:26.000Z","size":1196,"stargazers_count":1035,"open_issues_count":12,"forks_count":227,"subscribers_count":36,"default_branch":"master","last_synced_at":"2024-05-23T06:46:38.895Z","etag":null,"topics":["audio-separation","pytorch","source-separation","speech-separation","voicefilter"],"latest_commit_sha":null,"homepage":"http://swpark.me/voicefilter","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/maum-ai.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2019-03-22T07:37:59.000Z","updated_at":"2024-08-03T00:38:31.695Z","dependencies_parsed_at":"2024-08-03T01:01:57.059Z","dependency_job_id":null,"html_url":"https://github.com/maum-ai/voicefilter","commit_stats":null,"previous_names":["maum-ai/voicefilter","mindslab-ai/voicefilter"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/maum-ai%2Fvoicefilter","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/maum-ai%2Fvoicefilter/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/maum-ai%2Fvoicefilter/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/maum-ai%2Fvoicefilter/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/maum-ai","download_url":"https://codeload.github.com/maum-ai/voicefilter/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254518383,"owners_count":22084374,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["audio-separation","pytorch","source-separation","speech-separation","voicefilter"],"created_at":"2024-08-03T00:01:07.605Z","updated_at":"2025-05-16T11:06:24.507Z","avatar_url":"https://github.com/maum-ai.png","language":"Python","funding_links":[],"categories":["Pytorch \u0026 related libraries｜Pytorch \u0026 相关库","Python"],"sub_categories":["NLP \u0026 Speech Processing｜自然语言处理 \u0026 语音处理:"],"readme":"# VoiceFilter\n\n## Note from Seung-won (2020.10.25)\n\nHi everyone! It's Seung-won from MINDs Lab, Inc.\nIt's been a long time since I've released this open-source,\nand I didn't expect this repository to grab such a great amount of attention for a long time.\nI would like to thank everyone for giving such attention, and also Mr. Quan Wang (the first author of the VoiceFilter paper) for referring this project in his paper.\n\nActually, this project was done by me when it was only 3 months after I started studying deep learning \u0026 speech separation without a supervisor in the relevant field.\nBack then, I didn't know what is a power-law compression, and the correct way to validate/test the models.\nNow that I've spent more time on deep learning \u0026 speech since then (I also wrote a paper published at [Interspeech 2020](https://arxiv.org/abs/2005.03295) 😊),\nI can observe some obvious mistakes that I've made.\nThose issues were kindly raised by GitHub users; please refer to the\n[Issues](https://github.com/mindslab-ai/voicefilter/issues?q=is%3Aissue+) and [Pull Requests](https://github.com/mindslab-ai/voicefilter/pulls) for that.\nThat being said, this repository can be quite unreliable,\nand I would like to remind everyone to use this code at their own risk (as specified in LICENSE).\n\nUnfortunately, I can't afford extra time on revising this project or reviewing the Issues / Pull Requests.\nInstead, I would like to offer some pointers to newer, more reliable resources:\n\n- [VoiceFilter-Lite](https://arxiv.org/abs/2009.04323):\nThis is a newer version of VoiceFilter presented at Interspeech 2020, which is also written by Mr. Quan Wang (and his colleagues at Google).\nI highly recommend checking this paper, since it focused on a more realistic situation where VoiceFilter is needed.\n- [List of VoiceFilter implementation available on GitHub](https://paperswithcode.com/paper/voicefilter-targeted-voice-separation-by):\nIn March 2019, this repository was the only available open-source implementation of VoiceFilter.\nHowever, much better implementations that deserve more attention became available across GitHub.\nPlease check them, and choose the one that meets your demand.\n- [PyTorch Lightning](https://www.pytorchlightning.ai/):\nBack in 2019, I could not find a great deep-learning project template for myself,\nso I and my colleagues had used this project as a template for other new projects.\nFor people who are searching for such project template, I would like to strongly recommend PyTorch Lightning.\nEven though I had done a lot of effort into developing my own template during 2019\n([VoiceFilter](https://github.com/mindslab-ai/voicefilter) -\u003e [RandWireNN](https://github.com/seungwonpark/RandWireNN)\n-\u003e [MelNet](https://github.com/Deepest-Project/MelNet) -\u003e [MelGAN](https://github.com/seungwonpark/melgan)),\nI found PyTorch Lightning much better than my own template.\n\nThanks for reading, and I wish everyone good health during the global pandemic situation.\n\nBest regards, Seung-won Park\n\n---\n\nUnofficial PyTorch implementation of Google AI's:\n[VoiceFilter: Targeted Voice Separation by Speaker-Conditioned Spectrogram Masking](https://arxiv.org/abs/1810.04826).\n\n![](./assets/voicefilter.png)\n\n## Result\n\n- Training took about 20 hours on AWS p3.2xlarge(NVIDIA V100).\n\n### Audio Sample\n\n- Listen to audio sample at webpage: http://swpark.me/voicefilter/\n\n\n### Metric\n\n| Median SDR             | Paper | Ours |\n| ---------------------- | ----- | ---- |\n| before VoiceFilter     |  2.5  |  1.9 |\n| after VoiceFilter      | 12.6  | 10.2 |\n\n![](./assets/sdr-result.png)\n\n- SDR converged at 10, which is slightly lower than paper's.\n\n\n## Dependencies\n\n1. Python and packages\n\n    This code was tested on Python 3.6 with PyTorch 1.0.1.\n    Other packages can be installed by:\n\n    ```bash\n    pip install -r requirements.txt\n    ```\n\n1. Miscellaneous \n\n    [ffmpeg-normalize](https://github.com/slhck/ffmpeg-normalize) is used for resampling and normalizing wav files.\n    See README.md of [ffmpeg-normalize](https://github.com/slhck/ffmpeg-normalize/blob/master/README.md) for installation.\n\n## Prepare Dataset\n\n1. Download LibriSpeech dataset\n\n    To replicate VoiceFilter paper, get LibriSpeech dataset at http://www.openslr.org/12/.\n    `train-clear-100.tar.gz`(6.3G) contains speech of 252 speakers, and `train-clear-360.tar.gz`(23G) contains 922 speakers.\n    You may use either, but the more speakers you have in dataset, the more better VoiceFilter will be.\n\n1. Resample \u0026 Normalize wav files\n\n    First, unzip `tar.gz` file to desired folder:\n    ```bash\n    tar -xvzf train-clear-360.tar.gz\n    ```\n\n    Next, copy `utils/normalize-resample.sh` to root directory of unzipped data folder. Then:\n    ```bash\n    vim normalize-resample.sh # set \"N\" as your CPU core number.\n    chmod a+x normalize-resample.sh\n    ./normalize-resample.sh # this may take long\n    ```\n\n1. Edit `config.yaml`\n\n    ```bash\n    cd config\n    cp default.yaml config.yaml\n    vim config.yaml\n    ```\n\n1. Preprocess wav files\n\n    In order to boost training speed, perform STFT for each files before training by:\n    ```bash\n    python generator.py -c [config yaml] -d [data directory] -o [output directory] -p [processes to run]\n    ```\n    This will create 100,000(train) + 1000(test) data. (About 160G)\n\n\n## Train VoiceFilter\n\n1. Get pretrained model for speaker recognition system\n\n    VoiceFilter utilizes speaker recognition system ([d-vector embeddings](https://google.github.io/speaker-id/publications/GE2E/)).\n    Here, we provide pretrained model for obtaining d-vector embeddings.\n\n    This model was trained with [VoxCeleb2](http://www.robots.ox.ac.uk/~vgg/data/voxceleb/vox2.html) dataset,\n    where utterances are randomly fit to time length [70, 90] frames.\n    Tests are done with window 80 / hop 40 and have shown equal error rate about 1%.\n    Data used for test were selected from first 8 speakers of [VoxCeleb1](http://www.robots.ox.ac.uk/~vgg/data/voxceleb/vox1.html) test dataset, where 10 utterances per each speakers are randomly selected.\n    \n    **Update**: Evaluation on VoxCeleb1 selected pair showed 7.4% EER.\n    \n    The model can be downloaded at [this GDrive link](https://drive.google.com/file/d/1YFmhmUok-W76JkrfA0fzQt3c-ZsfiwfL/view?usp=sharing).\n\n1. Run\n\n    After specifying `train_dir`, `test_dir` at `config.yaml`, run:\n    ```bash\n    python trainer.py -c [config yaml] -e [path of embedder pt file] -m [name]\n    ```\n    This will create `chkpt/name` and `logs/name` at base directory(`-b` option, `.` in default)\n\n1. View tensorboardX\n\n    ```bash\n    tensorboard --logdir ./logs\n    ```\n    \n    ![](./assets/tensorboard.png)\n\n1. Resuming from checkpoint\n\n    ```bash\n    python trainer.py -c [config yaml] --checkpoint_path [chkpt/name/chkpt_{step}.pt] -e [path of embedder pt file] -m name\n    ```\n\n## Evaluate\n\n```bash\npython inference.py -c [config yaml] -e [path of embedder pt file] --checkpoint_path [path of chkpt pt file] -m [path of mixed wav file] -r [path of reference wav file] -o [output directory]\n```\n\n## Possible improvments\n\n- Try power-law compressed reconstruction error as loss function, instead of MSE. (See [#14](https://github.com/mindslab-ai/voicefilter/issues/14))\n\n## Author\n\n[Seungwon Park](http://swpark.me) at MINDsLab (yyyyy@snu.ac.kr, swpark@mindslab.ai)\n\n## License\n\nApache License 2.0\n\nThis repository contains codes adapted/copied from the followings:\n- [utils/adabound.py](./utils/adabound.py) from https://github.com/Luolc/AdaBound (Apache License 2.0)\n- [utils/audio.py](./utils/audio.py) from https://github.com/keithito/tacotron (MIT License)\n- [utils/hparams.py](./utils/hparams.py) from https://github.com/HarryVolek/PyTorch_Speaker_Verification (No License specified)\n- [utils/normalize-resample.sh](./utils/normalize-resample.sh.) from https://unix.stackexchange.com/a/216475\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmaum-ai%2Fvoicefilter","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmaum-ai%2Fvoicefilter","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmaum-ai%2Fvoicefilter/lists"}