{"id":27292959,"url":"https://github.com/zhenye234/xcodec","last_synced_at":"2025-04-11T22:40:36.553Z","repository":{"id":254076513,"uuid":"806157441","full_name":"zhenye234/xcodec","owner":"zhenye234","description":"AAAI 2025: Codec Does Matter: Exploring the Semantic Shortcoming of Codec for Audio Language Model","archived":false,"fork":false,"pushed_at":"2025-03-29T07:59:06.000Z","size":1846,"stargazers_count":186,"open_issues_count":14,"forks_count":11,"subscribers_count":7,"default_branch":"main","last_synced_at":"2025-03-29T08:29:22.739Z","etag":null,"topics":["audio","audio-codec","codec","gpt","language-model","music","self-supervised-learning","semantic","sound","speech","speech-language-model","text-to-music","text-to-sound","text-to-speech","tokenizer","vall-e"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/zhenye234.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-05-26T14:36:40.000Z","updated_at":"2025-03-29T07:59:09.000Z","dependencies_parsed_at":"2025-01-10T08:02:16.033Z","dependency_job_id":null,"html_url":"https://github.com/zhenye234/xcodec","commit_stats":null,"previous_names":["zhenye234/xcodec"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zhenye234%2Fxcodec","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zhenye234%2Fxcodec/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zhenye234%2Fxcodec/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zhenye234%2Fxcodec/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/zhenye234","download_url":"https://codeload.github.com/zhenye234/xcodec/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248493007,"owners_count":21113159,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["audio","audio-codec","codec","gpt","language-model","music","self-supervised-learning","semantic","sound","speech","speech-language-model","text-to-music","text-to-sound","text-to-speech","tokenizer","vall-e"],"created_at":"2025-04-11T22:40:34.940Z","updated_at":"2025-04-11T22:40:36.518Z","avatar_url":"https://github.com/zhenye234.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\n[![arXiv](https://img.shields.io/badge/arXiv-2408.17175-brightgreen.svg?style=flat-square)](https://arxiv.org/pdf/2408.17175)  \n# X-Codec\n\nUnified  Semantic and Acoustic Codec  for Audio Language Model.\n\n# X-Codec-2.0 released!\n\n\n# Paper \n \n\n**Title**: Codec Does Matter: Exploring the Semantic Shortcoming of Codec for Audio Language Model (AAAI 2025)\n\n**Authors**: Zhen Ye, Peiwen Sun, Jiahe Lei, Hongzhan Lin, Xu Tan, Zheqi Dai, Qiuqiang Kong, Jianyi Chen, Jiahao Pan, Qifeng Liu, Yike Guo*, Wei Xue*\n\n\u003cimg src=\"fig1.png\" alt=\"Overview\" width=\"600\"/\u003e\n\n# Experiments on VALL-E\n\u003cimg src=\"exp.png\" alt=\"Exp\" width=\"900\"/\u003e\n\n\u003c!-- # ckpts --\u003e\n\n\u003c!-- Speech ckpts [downlaod link](https://drive.google.com/file/d/1oF1_R0Z2JNnqdPbuqiL8tJeY6pDwuQG1/view?usp=sharing)\n \nGeneral audio ckpts [Soon] --\u003e\n\n# Highlight\n\nYou can easily apply our approach to enhance any existing acoustic codec:\n\nFor example\n\n```python\nclass Codec():\n    def __init__(self):\n        # Acoustic codec components\n        self.encoder = Encoder(...)       # Acoustic encoder\n        self.decoder = Decoder(...)       # Acoustic decoder\n        self.quantizer = RVQ(...)         # Residual Vector Quantizer (RVQ)\n\n        # Adding the semantic module\n        self.semantic_model = AutoModel.from_pretrained(...)  # e.g., Hubert, WavLM\n\n        # Adding Projector\n        self.fc_prior = nn.Linear(...)     \n        self.fc_post1 = nn.Linear(...)     \n        self.fc_post2 = nn.Linear(...)     \n\n    def forward(self, x, bw):\n        # Encode the input acoustically and semantically\n        e_acoustic = self.encoder(x)\n        e_semantic = self.semantic_model(x)\n\n        # Combine acoustic and semantic features\n        combined_features = torch.cat([e_acoustic, e_semantic])\n\n        # Apply prior transformation\n        transformed_features = self.fc_prior(combined_features)\n\n        # Quantize the unified  semantic and acoustic features\n        quantized, codes, bandwidth, commit_loss = self.quantizer(transformed_features, bw)\n\n        # Post-process the quantized features\n        quantized_semantic = self.fc_post1(quantized)\n        quantized_acoustic = self.fc_post2(quantized)\n\n        # Decode the quantized acoustic features\n        output = self.decoder(quantized_acoustic)\n\n\n\n    def semantic_loss(self,semantic,quantized_semantic):\n        return F.mse_loss(semantic,quantized_semantic)     \n```\nFor more details, please refer to our code.\n\n# Available models\n🤗 links to the Huggingface model hub.\n\n| Model name                                  | Hugging Face                                                                                           | Config                                                                                                   | Semantic Model                                                        | Domain        | Training Data                 |\n|---------------------------------------------|--------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------|---------------|-------------------------------|\n| xcodec_hubert_librispeech                   | [🤗](https://huggingface.co/ZhenYe234/xcodec/blob/main/xcodec_speech_hubert_librispeech.pth)            | [🤗](https://huggingface.co/ZhenYe234/xcodec/blob/main/config_hubert.yaml)                                | [🤗 Hubert-base](https://huggingface.co/facebook/hubert-base-ls960)               | Speech        | Librispeech                   |\n| xcodec_wavlm_mls (not mentioned in paper)   | [🤗](https://huggingface.co/ZhenYe234/xcodec/blob/main/xcodec_speech_wavlm_mls.pth)                     | [🤗](https://huggingface.co/ZhenYe234/xcodec/blob/main/config_wavlm.yaml)                                 | [🤗 Wavlm-base-plus](https://huggingface.co/microsoft/wavlm-base-plus)                | Speech        | MLS English                   |\n| xcodec_wavlm_more_data (not mentioned in paper) | [🤗](https://huggingface.co/ZhenYe234/xcodec/blob/main/xcodec_speech_wavlm_more_data.pth)               | [🤗](https://huggingface.co/ZhenYe234/xcodec/blob/main/config_wavlm.yaml)                                 | [🤗 Wavlm-base-plus](https://huggingface.co/microsoft/wavlm-base-plus)                | Speech        | MLS English + Internal data   |\n| xcodec_hubert_general_audio                 | [🤗](https://huggingface.co/ZhenYe234/xcodec/blob/main/xcodec_hubert_general_audio.pth)                              | [🤗](https://huggingface.co/ZhenYe234/xcodec/blob/main/config_hubert_general.yaml)                     | [🤗Hubert-base-general-audio](https://huggingface.co/ZhenYe234/hubert_base_general_audio)      | General audio | 200k hours internal data      |\n| xcodec_hubert_general_audio_more_data (not mentioned in paper) | [🤗](https://huggingface.co/ZhenYe234/xcodec/blob/main/xcodec_hubert_general_audio_v2.pth) | [🤗](https://huggingface.co/ZhenYe234/xcodec/blob/main/config_hubert_general.yaml) | [🤗Hubert-base-general-audio](https://huggingface.co/ZhenYe234/hubert_base_general_audio) | General audio | More balanced data            |\n\n\n\n\n\n# Inference\n\nTo run inference, first download the model and config from hugging face.\n\n```bash\npython inference.py\n```\n\n# Training\nPrepare  the training_file and validation_file in config. The file should list the paths to your audio files:\n```bash\n/path/to/your/xxx.wav\n/path/to/your/yyy.wav\n...\n```\nThen:\n\n```bash\ntorchrun --nnodes=1 --nproc-per-node=8 main_launch_vqdp.py\n```\n\n## Acknowledgement\nI would like to extend a special thanks to authors of Uniaudio and DAC, since our code base is mainly borrowed from  [Uniaudio](https://github.com/yangdongchao/UniAudio/tree/main/codec) and [DAC](https://github.com/descriptinc/descript-audio-codec).\n\n## Citation\nIf you find this repo helpful, please consider citing in the following format:\n\n```bibtex\n@article{ye2024codecdoesmatterexploring,\n      title={Codec Does Matter: Exploring the Semantic Shortcoming of Codec for Audio Language Model}, \n      author={Zhen Ye and Peiwen Sun and Jiahe Lei and Hongzhan Lin and Xu Tan and Zheqi Dai and Qiuqiang Kong and Jianyi Chen and Jiahao Pan and Qifeng Liu and Yike Guo and Wei Xue},\n      journal={arXiv preprint arXiv:2408.17175},\n      year={2024},\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fzhenye234%2Fxcodec","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fzhenye234%2Fxcodec","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fzhenye234%2Fxcodec/lists"}