{"id":19882966,"url":"https://github.com/modelscope/funcodec","last_synced_at":"2025-04-05T23:09:49.156Z","repository":{"id":198810032,"uuid":"701589325","full_name":"modelscope/FunCodec","owner":"modelscope","description":"FunCodec is a research-oriented toolkit for audio quantization and downstream applications, such as text-to-speech synthesis, music generation et.al. ","archived":false,"fork":false,"pushed_at":"2024-01-25T11:56:17.000Z","size":1529,"stargazers_count":392,"open_issues_count":21,"forks_count":33,"subscribers_count":16,"default_branch":"master","last_synced_at":"2025-03-29T22:06:24.145Z","etag":null,"topics":["audio-generation","audio-quantization","codec","encodec","speech-synthesis","speech-to-text","tts","voicecloning"],"latest_commit_sha":null,"homepage":"https://funcodec.github.io/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/modelscope.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-10-07T02:00:40.000Z","updated_at":"2025-03-22T14:05:46.000Z","dependencies_parsed_at":"2024-01-05T03:40:49.870Z","dependency_job_id":"3a68ee81-3100-4456-9dee-35c844962816","html_url":"https://github.com/modelscope/FunCodec","commit_stats":null,"previous_names":["alibaba-damo-academy/funcodec"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/modelscope%2FFunCodec","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/modelscope%2FFunCodec/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/modelscope%2FFunCodec/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/modelscope%2FFunCodec/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/modelscope","download_url":"https://codeload.github.com/modelscope/FunCodec/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247411235,"owners_count":20934653,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["audio-generation","audio-quantization","codec","encodec","speech-synthesis","speech-to-text","tts","voicecloning"],"created_at":"2024-11-12T17:19:03.002Z","updated_at":"2025-04-05T23:09:49.128Z","avatar_url":"https://github.com/modelscope.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# FunCodec: A Fundamental, Reproducible and Integrable Open-source Toolkit for Neural Speech Codec\n\nThis project is still working on progress. To make FunCodec better, please let me know your concerns and feel free to comment them in the `Issues` part.\n\n## News\n- 2023.12.22 🎉🎉: We release the training and inference recipes for LauraTTS as well as pre-trained models. \n[LauraTTS](https://arxiv.org/abs/2310.04673) is a powerful codec-based zero-shot text-to-speech synthesizer, \nwhich outperforms VALL-E in terms of semantic consistency and speaker similarity.\nPlease refer `egs/LibriTTS/text2speech_laura/README.md` for more details.\n\n## Installation\n\n```shell\ngit clone https://github.com/alibaba/FunCodec.git \u0026\u0026 cd FunCodec\npip install --editable ./\n```\n\n## Available models\n🤗 links to the Huggingface model hub, while ⭐ refers the Modelscope.\n\n| Model name                                                          |                                                                                                              Model hub                                                                                                               |  Corpus  |  Bitrate  | Parameters | Flops  |\n|:--------------------------------------------------------------------|:------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|:--------:|:---------:|:----------:|:------:|\n| audio_codec-encodec-zh_en-general-16k-nq32ds640-pytorch             |             [🤗](https://huggingface.co/alibaba-damo/audio_codec-encodec-zh_en-general-16k-nq32ds640-pytorch) [⭐](https://www.modelscope.cn/models/damo/audio_codec-encodec-zh_en-general-16k-nq32ds640-pytorch/summary)             | General  | 250~8000  |  57.83 M   | 7.73G  |\n| audio_codec-encodec-zh_en-general-16k-nq32ds320-pytorch             |             [🤗](https://huggingface.co/alibaba-damo/audio_codec-encodec-zh_en-general-16k-nq32ds320-pytorch) [⭐](https://www.modelscope.cn/models/damo/audio_codec-encodec-zh_en-general-16k-nq32ds320-pytorch/summary)             | General  | 500~16000 |  14.85 M   | 3.72 G |\n| audio_codec-encodec-en-libritts-16k-nq32ds640-pytorch               |               [🤗](https://huggingface.co/alibaba-damo/audio_codec-encodec-en-libritts-16k-nq32ds640-pytorch) [⭐](https://www.modelscope.cn/models/damo/audio_codec-encodec-en-libritts-16k-nq32ds640-pytorch/summary)               | LibriTTS | 250~8000  |  57.83 M   | 7.73G  |\n| audio_codec-encodec-en-libritts-16k-nq32ds320-pytorch               |               [🤗](https://huggingface.co/alibaba-damo/audio_codec-encodec-en-libritts-16k-nq32ds320-pytorch) [⭐](https://www.modelscope.cn/models/damo/audio_codec-encodec-en-libritts-16k-nq32ds320-pytorch/summary)               | LibriTTS | 500~16000 |  14.85 M   | 3.72 G |\n| audio_codec-freqcodec_magphase-en-libritts-16k-gr8nq32ds320-pytorch | [🤗](https://huggingface.co/alibaba-damo/audio_codec-freqcodec_magphase-en-libritts-16k-gr8nq32ds320-pytorch) [⭐](https://www.modelscope.cn/models/damo/audio_codec-freqcodec_magphase-en-libritts-16k-gr8nq32ds320-pytorch/summary) | LibriTTS | 500~16000 |   4.50 M   | 2.18 G | \n| audio_codec-freqcodec_magphase-en-libritts-16k-gr1nq32ds320-pytorch | [🤗](https://huggingface.co/alibaba-damo/audio_codec-freqcodec_magphase-en-libritts-16k-gr1nq32ds320-pytorch) [⭐](https://www.modelscope.cn/models/damo/audio_codec-freqcodec_magphase-en-libritts-16k-gr1nq32ds320-pytorch/summary) | LibriTTS | 500~16000 |   0.52 M   | 0.34 G |\n\n## Model Download\n### Download models from ModelScope\nPlease refer `egs/LibriTTS/codec/encoding_decoding.sh` to download pretrained models:\n```shell\ncd egs/LibriTTS/codec\nmodel_name=audio_codec-encodec-zh_en-general-16k-nq32ds640-pytorch\nbash encoding_decoding.sh --stage 0 --model_name ${model_name} --model_hub modelscope\n# The pre-trained model will be downloaded to exp/audio_codec-encodec-zh_en-general-16k-nq32ds640-pytorch\n```\n\n### Download models from Huggingface\nPlease refer `egs/LibriTTS/codec/encoding_decoding.sh` to download pretrained models:\n```shell\ncd egs/LibriTTS/codec\nmodel_name=audio_codec-encodec-zh_en-general-16k-nq32ds640-pytorch\nbash encoding_decoding.sh --stage 0 --model_name ${model_name} --model_hub huggingface\n# The pre-trained model will be downloaded to exp/audio_codec-encodec-zh_en-general-16k-nq32ds640-pytorch\n```\n\n## Inference\n### Batch inference\nPlease refer `egs/LibriTTS/codec/encoding_decoding.sh` to perform encoding and decoding.\nExtract codes with an input file `input_wav.scp`, \nand the codes will be saved to `output_dir/codecs.txt` in a format of jsonl.\n```shell\ncd egs/LibriTTS/codec\nbash encoding_decoding.sh --stage 1 --batch_size 16 --num_workers 4 --gpu_devices \"0,1\" \\\n  --model_dir exp/${model_name} --bit_width 16000 \\\n  --wav_scp input_wav.scp  --out_dir outputs/codecs/\n# input_wav.scp has the following format：\n# uttid1 path/to/file1.wav\n# uttid2 path/to/file2.wav\n# ...\n```\n\nDecode codes with an input file `codecs.txt`, \nand the reconstructed waveform will be saved to `output_dir/logdir/output.*/*.wav`.\n```shell\nbash encoding_decoding.sh --stage 2 --batch_size 16 --num_workers 4 --gpu_devices \"0,1\" \\\n  --model_dir exp/${model_name} --bit_width 16000 --file_sampling_rate 16000 \\\n  --wav_scp codecs.txt --out_dir outputs/recon_wavs \n# codecs.scp is the output of above encoding stage, which has the following format：\n# uttid1 [[[1, 2, 3, ...],[2, 3, 4, ...], ...]]\n# uttid2 [[[9, 7, 5, ...],[3, 1, 2, ...], ...]]\n```\n\n\u003c!---\n### Demo inference\n---\u003e\n\n## Training\n### Training on open-source corpora\nFor commonly-used open-source corpora, you can train a model using the recipe in `egs` directory.\nFor example, to train a model on the `LibriTTS` corpus, you can use `egs/LibriTTS/codec/run.sh`:\n```shell\n# entry the LibriTTS recipe directory\ncd egs/LibriTTS/codec\n# run data downloading, preparation and training stages with 2 GPUs (device 0 and 1)\nbash run.sh --stage 0 --stop_stage 3 --gpu_devices 0,1 --gpu_num 2\n```\nWe recommend run the script stage by stage to have an overview of FunCodec.\n\n### Training on customized data\nFor uncovered corpora or customized dataset, you can prepare the data by yourself.\nIn general, FunCodec employs the kaldi-like `wav.scp` file to organize the data files.\n`wav.scp` has the following format:\n```shell\n# for waveform files\nuttid1 /path/to/uttid1.wav\nuttid2 /path/to/uttid2.wav\n# for kaldi-ark files\nuttid3 /path/to/ark1.wav:10\nuttid4 /path/to/ark1.wav:200\nuttid5 /path/to/ark2.wav:10\n```\nAs shown in the above example, FunCodec supports the combination of waveforms or kaldi-ark files \nin one `wav.scp` file for both training and inference.\nHere is a demo script to train a model on your customized dataset named `foo`:\n```shell\ncd egs/LibriTTS/codec\n# 0. make the directory for train, dev and test sets\nmkdir -p dump/foo/train dump/foo/dev dump/foo/test\n\n# 1a. if you already have the wav.scp file, just place them under the corresponding directories\nmv train.scp dump/foo/train/; mv dev.scp dump/foo/dev/; mv test.scp dump/foo/test/;\n# 1b. if you don't have the wav.scp file, you can prepare it as follows\nfind path/to/train_set/ -iname \"*.wav\" | awk -F '/' '{print $(NF),$0}' | sort \u003e dump/foo/train/wav.scp\nfind path/to/dev_set/   -iname \"*.wav\" | awk -F '/' '{print $(NF),$0}' | sort \u003e dump/foo/dev/wav.scp\nfind path/to/test_set/  -iname \"*.wav\" | awk -F '/' '{print $(NF),$0}' | sort \u003e dump/foo/test/wav.scp\n\n# 2. collate shape files\nmkdir exp/foo_states/train exp/foo_states/dev\ntorchrun --nproc_per_node=4 --master_port=1234 scripts/gen_wav_length.py --wav_scp dump/foo/train/wav.scp --out_dir exp/foo_states/train/wav_length\ncat exp/foo_states/train/wav_length/wav_length.*.txt | shuf \u003e exp/foo_states/train/speech_shape\ntorchrun --nproc_per_node=4 --master_port=1234 scripts/gen_wav_length.py --wav_scp dump/foo/dev/wav.scp --out_dir exp/foo_states/dev/wav_length\ncat exp/foo_states/dev/wav_length/wav_length.*.txt | shuf \u003e exp/foo_states/dev/speech_shape\n\n# 3. train the model with 2 GPUs (device 4 and 5) on the customized dataset (foo)\nbash run.sh --gpu_devices 4,5 --gpu_num 2 --dumpdir dump/foo --state_dir foo_states\n```\n\n## Acknowledge\n\n1. We had a consistent design of [FunASR](https://github.com/alibaba/FunASR), including dataloader, model definition and so on.\n2. We borrowed a lot of code from [Kaldi](http://kaldi-asr.org/) for data preparation.\n3. We borrowed a lot of code from [ESPnet](https://github.com/espnet/espnet). FunCodec follows up the training and finetuning pipelines of ESPnet.\n4. We borrowed the design of model architecture from [Enocdec](https://github.com/facebookresearch/encodec) and [Enocdec_Trainner](https://github.com/Mikxox/EnCodec_Trainer).\n\n## License\nThis project is licensed under [The MIT License](https://opensource.org/licenses/MIT). \nFunCodec also contains various third-party components and some code modified from other repos \nunder other open source licenses.\n\n## Citations\n\n``` bibtex\n@misc{du2023funcodec,\n      title={FunCodec: A Fundamental, Reproducible and Integrable Open-source Toolkit for Neural Speech Codec},\n      author={Zhihao Du, Shiliang Zhang, Kai Hu, Siqi Zheng},\n      year={2023},\n      eprint={2309.07405},\n      archivePrefix={arXiv},\n      primaryClass={cs.Sound}\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmodelscope%2Ffuncodec","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmodelscope%2Ffuncodec","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmodelscope%2Ffuncodec/lists"}