{"id":23933498,"url":"https://github.com/X-LANCE/SLAM-LLM","last_synced_at":"2025-09-11T16:31:57.579Z","repository":{"id":240774531,"uuid":"708737761","full_name":"X-LANCE/SLAM-LLM","owner":"X-LANCE","description":"Speech, Language, Audio, Music Processing with Large Language Model","archived":false,"fork":false,"pushed_at":"2024-12-27T07:41:28.000Z","size":168382,"stargazers_count":627,"open_issues_count":13,"forks_count":56,"subscribers_count":22,"default_branch":"main","last_synced_at":"2025-01-02T08:12:46.672Z","etag":null,"topics":["audio-processing","large-language-model","multimodal-large-language-models","music-processing","peft","speech-processing"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/X-LANCE.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-10-23T09:37:40.000Z","updated_at":"2024-12-30T06:18:02.000Z","dependencies_parsed_at":"2024-05-20T21:22:33.055Z","dependency_job_id":"8b4c309b-cd37-4461-a2dc-2414c7a6b637","html_url":"https://github.com/X-LANCE/SLAM-LLM","commit_stats":null,"previous_names":["ddlbojack/slam-llm","x-lance/slam-llm"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/X-LANCE%2FSLAM-LLM","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/X-LANCE%2FSLAM-LLM/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/X-LANCE%2FSLAM-LLM/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/X-LANCE%2FSLAM-LLM/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/X-LANCE","download_url":"https://codeload.github.com/X-LANCE/SLAM-LLM/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":232658684,"owners_count":18556988,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["audio-processing","large-language-model","multimodal-large-language-models","music-processing","peft","speech-processing"],"created_at":"2025-01-06T00:29:43.957Z","updated_at":"2025-09-11T16:31:57.524Z","avatar_url":"https://github.com/X-LANCE.png","language":"Python","funding_links":[],"categories":["Building","Tools\u003ca id=\"tool\"\u003e\u003c/a\u003e"],"sub_categories":["LLM Models","Others\u003ca id=\"paper11\"\u003e\u003c/a\u003e"],"readme":"\u003cdiv align=\"center\"\u003e\n    \u003ch1\u003e\n    SLAM-LLM\n    \u003c/h1\u003e\n    \u003cp\u003e\n    \u003cb\u003eSLAM-LLM\u003c/b\u003e is a deep learning toolkit that allows researchers and\ndevelopers to train custom multimodal large language model (MLLM), focusing on \u003cb\u003eS\u003c/b\u003epeech, \u003cb\u003eL\u003c/b\u003eanguage, \u003cb\u003eA\u003c/b\u003eudio, \u003cb\u003eM\u003c/b\u003eusic processing. We provide detailed recipes for training and high-performance checkpoints for inference. \u003cbr\u003e\n    \u003c/p\u003e\n    \u003cp\u003e\n    \u003cimg src=\"docs/logo.jpg\" alt=\"SLAM-LLM Logo\" style=\"width: 200px; height: 200px;\"\u003e\n    \u003c/p\u003e\n    \u003cp\u003e\n    \u003c/p\u003e\n    \u003ca href=\"https://github.com/ddlBoJack/SLAM-LLM\"\u003e\u003cimg src=\"https://img.shields.io/badge/Platform-linux-lightgrey\" alt=\"version\"\u003e\u003c/a\u003e\n    \u003ca href=\"https://github.com/ddlBoJack/SLAM-LLM\"\u003e\u003cimg src=\"https://img.shields.io/badge/Cuda-11.8+-orange\" alt=\"version\"\u003e\u003c/a\u003e\n    \u003ca href=\"https://github.com/ddlBoJack/SLAM-LLM\"\u003e\u003cimg src=\"https://img.shields.io/badge/PyTorch-2.01+-brightgreen\" alt=\"python\"\u003e\u003c/a\u003e\n    \u003ca href=\"https://github.com/ddlBoJack/SLAM-LLM\"\u003e\u003cimg src=\"https://img.shields.io/badge/License-MIT-red.svg\" alt=\"mit\"\u003e\u003c/a\u003e\n\u003c/div\u003e\n\n# Table of Contents\n1. [News](#news)\n2. [Installation](#installation)\n3. [Usage](#usage)\n    - [List of Recipes](#list-of-recipes)\n    - [Configuration Priority](#configuration-priority)\n4. [Features](#features)\n5. [Acknowledge](#acknowledge)\n6. [Citation](#citation)\n\n# News\n- [Update Apr. 24, 2025] We have supported [large-scale industrial training](examples/aispeech_asr/README.md), suitable for datasets on the order of 100,000 hours. Its main features include:\n  - **Support for multi-task training:** Designed to support tasks such as ASR and ST through a unified data format. \n  - **Dynamic prompt selection:** Supports random selection from multiple prompts. \n  - **Iterative dataset:** Uses an iterative dataset format to reduce startup time for large datasets. \n  - **Deepspeed training:** Supports DeepSpeed training to significantly reduce memory usage.\n  - **Multi-machine multi-GPU inference:** Supports distributed inference across multiple machines and GPUs to reduce evaluation time.\n  - **Dynamic frame batching:** Dynamically combines frames based on audio size rather than using a fixed batch size, significantly reducing training and evaluation time (reduces training time by 3/4 for 100,000 hours of data).\n- [Update Apr. 24, 2025] We have supported the Deepspeed, checkout the instruction #Fine-tuning using Deepspeed at [here](examples/asr_librispeech/README.md).\n- [Update Jan. 22, 2025] 🔥🔥🔥 Full reproduction (including all data preparation, model training, and inference) for [SLAM-Omni](examples/s2s/README.md) has been supported.  \n![](docs/slam-omni-model.png)\n  - SLAM-Omni is a **timbre-controllable** voice interaction system that requires only **single-stage training** and minimal resources to achieve high-quality, end-to-end speech dialogue, supporting multi-turn conversations in both Chinese and English. ([paper](https://arxiv.org/abs/2412.15649), [demo](https://slam-omni.github.io))\n  - We have fully reproduced the **training and inference** processes of SLAM-Omni and open-sourced all related training datasets. The provided code framework theoretically supports all codec-based spoken dialogue models. Additionally, we offer the reproduction code for [Mini-Omni](https://github.com/gpt-omni/mini-omni).\n\n\u003ctable class=\"center\"\u003e\n\u003ctr\u003e\n    \u003ctd width=50% style=\"border: none\"\u003e\n        \u003cvideo controls autoplay loop src=\"https://github.com/user-attachments/assets/73597edb-0d66-453b-b10c-8cf8dd3cae18\" muted=\"false\"\u003e\u003c/video\u003e\n    \u003c/td\u003e\n    \u003ctd width=50% style=\"border: none\"\u003e\n        \u003cvideo controls autoplay loop src=\"https://github.com/user-attachments/assets/7a797491-0509-4da8-8662-f2107bd8856a\" muted=\"false\"\u003e\u003c/video\u003e\n    \u003c/td\u003e\n\u003c/tr\u003e\n\u003c/table\u003e\n\n- [Update Nov. 17, 2024] Recipes for [LLM-Based Contextual ASR](examples/contextual_asr/README.md) have been supported. \n- [Update Nov. 5, 2024] Recipes for [speech emotion captioning (SEC)](examples/sec_emotioncaps/README.md) with [emotion2vec](https://github.com/ddlBoJack/emotion2vec) as the encoder has been supported.\n- [Update Oct. 12, 2024] Recipes for [SLAM-AAC](examples/slam_aac/README.md) with [EAT](https://github.com/cwx-worst-one/EAT) as the encoder have been supported. \n- [Update Sep. 28, 2024] Recipes for [CoT-ST](examples/st_covost2/README.md) have been supported. \n- [Update Sep. 25, 2024] Recipes for [DRCap](examples/drcap_zeroshot_aac/README.md) have been supported. \n- [Update Jun. 12, 2024] Recipes for [MaLa-ASR](examples/mala_asr_slidespeech/README.md) have been supported. \n- **[CALL FOR EXAMPLE]** We sincerely invite developers and researchers to develop new applications, conduct academic research based on SLAM-LLM, and pull request your examples! We also acknowledge engineering PR (such as improving and speeding up multi-node training). \n- [Update May. 22, 2024] Please join [slack](https://join.slack.com/t/slam-llm/shared_invite/zt-2mc0pkhhs-5jjOi8Cwc8R1Xc8IQmykDA) or [WeChat group](./docs/Wechat.jpg). We will sync our updates and Q\u0026A here. \n- [Update May. 21, 2024] Recipes for [Spatial Audio Understanding](examples/seld_spatialsoundqa/README.md) have been supported. \n- [Update May. 20, 2024] Recipes for [music caption (MC)](examples/mc_musiccaps/README.md) have been supported. \n- [Update May. 8, 2024] Recipes for [visual speech recognition (VSR)](examples/vsr_LRS3/README.md) have been supported. \n- [Update May. 4, 2024] Recipes for [zero-shot text-to-speech (TTS)](examples/vallex/README.md) have been supported. \n- [Update Apr. 28, 2024] Recipes for [automated audio captioning (AAC)](examples/aac_audiocaps/README.md) have been supported. \n- [Update Mar. 31, 2024] Recipes for [automatic speech recognition (ASR)](examples/asr_librispeech/README.md) have been supported. \n\n# Installation\n```bash\ngit clone https://github.com/huggingface/transformers.git\ncd transformers\ngit checkout tags/v4.35.2\npip install -e .\ncd ..\ngit clone https://github.com/huggingface/peft.git\ncd peft\ngit checkout tags/v0.6.0\npip install -e .\ncd ..\npip install torch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2 --index-url https://download.pytorch.org/whl/cu118\ngit clone https://github.com/ddlBoJack/SLAM-LLM.git\ncd SLAM-LLM\npip install  -e .\n```\n\nFor some examples, you may need to use `fairseq`, the command line is as follows:\n```\n# you need to install fairseq before SLAM-LLM\ngit clone https://github.com/pytorch/fairseq\ncd fairseq\npip install --editable ./\n```\nWe also provide a docker image for convenience:\n```shell\n# build docker image\ndocker build -t slam-llm:latest .\n\n# run docker image with gpu\ndocker run -it --gpus all --name slam --shm-size=256g slam-llm:latest /bin/bash\n```\n# Usage\n## List of Recipes\nWe provide reference implementations of various LLM-based speech, audio, and music tasks: \n- **Speech Task**\n    - Automatic Speech Recognition (ASR)\n        - [SLAM-ASR](examples/asr_librispeech/README.md)\n    \n    - Contextual Automatic Speech Recognition (CASR)\n        - [ Mala-ASR](examples/mala_asr_slidespeech/README.md)\n        - [LLM-Based Contextual ASR](examples/contextual_asr/README.md)\n    \n    - [Visual Speech Recognition (VSR)](examples/vsr_LRS3/README.md) \n    - Speech-to-Text Translation (S2TT)\n        - [CoT-ST](examples/st_covost2/README.md)\n    \n    - Text-to-Speech (TTS)\n        - [VALL-E-X](examples/vallex/README.md)\n    - [Speech Emotion Captioning (SEC)](examples/sec_emotioncaps/README.md)\n    - Voice Interaction System\n        - [SLAM-Omni](examples/s2s/README.md)\n    \n- **Audio Task**\n    - [Automated Audio Captioning (AAC)](examples/aac_audiocaps/README.md)\n      - [SLAM-AAC](examples/slam_aac/README.md)\n      - [DRCap](examples/drcap_zeroshot_aac/README.md)\n  \n    - Spatial Audio Understanding\n      - [BAT](examples/seld_spatialsoundqa/README.md)\n    \n- **Music Task**\n    - [Music Caption (MC)](examples/mc_musiccaps/README.md)\n\n## Configuration Priority\nWe provide hierarchical configuration inheritance relationships as follows:\n```\ncommand-line (shell file) \u003e Hydra configuration (yaml file) \u003e dataclass configuration (Python file)\n```\n\n# Features\n- Easily extend to new models and tasks.\n- Detailed recipes for training and high-performance checkpoints for inference.\n- Mixed precision training which trains faster with less GPU memory on NVIDIA tensor cores. \n- Multi-GPU training with data and model parallel, supporting [DDP](https://pytorch.org/tutorials/intermediate/ddp_tutorial.html), [FSDP](https://pytorch.org/tutorials/intermediate/FSDP_tutorial.html) and [deepspeed](https://github.com/microsoft/DeepSpeed) (still need to be improved).  \n- Flexible configuration based on [Hydra](https://github.com/facebookresearch/hydra) and [dataclass](https://docs.python.org/3/library/dataclasses.html) allowing a combination of code, command-line and file based configuration. \n\n# Acknowledge\n- We borrow code from [Llama-Recipes](https://github.com/meta-llama/llama-recipes) for the training process. \n- We borrow code from [Fairseq](https://github.com/facebookresearch/fairseq) for deepspeed configuration. \n- We thank the contributors for providing diverse recipes. \n\n# Citation\n\n## Speech Task\n\nSLAM-ASR:\n```\n@article{ma2024embarrassingly,\n  title={An Embarrassingly Simple Approach for LLM with Strong ASR Capacity},\n  author={Ma, Ziyang and Yang, Guanrou and Yang, Yifan and Gao, Zhifu and Wang, Jiaming and Du, Zhihao and Yu, Fan and Chen, Qian and Zheng, Siqi and Zhang, Shiliang and others},\n  journal={arXiv preprint arXiv:2402.08846},\n  year={2024}\n}\n```\nMala-ASR:\n```\n@article{yang2024mala,\n  title={MaLa-ASR: Multimedia-Assisted LLM-Based ASR},\n  author={Yang, Guanrou and Ma, Ziyang and Yu, Fan and Gao, Zhifu and Zhang, Shiliang and Chen, Xie},\n  journal={Proc. INTERSPEECH},\n  year={2024}\n}\n```\nLLM-Based Contextual ASR:\n```\n@article{yang2024ctc,\n  title={CTC-Assisted LLM-Based Contextual ASR},\n  author={Yang, Guanrou and Ma, Ziyang and Gao, Zhifu and Zhang, Shiliang and Chen, Xie},\n  journal={Proc. SLT},\n  year={2024}\n}\n```\nCoT-ST:\n```\n@article{du2024cot,\n  title={CoT-ST: Enhancing LLM-based Speech Translation with Multimodal Chain-of-Thought},\n  author={Du, Yexing and Ma, Ziyang and Yang, Yifan and Deng, Keqi and Chen, Xie and Yang, Bo and Xiang, Yang and Liu, Ming and Qin, Bing},\n  journal={arXiv preprint arXiv:2409.19510},\n  year={2024}\n}\n```\n\nSLAM-Omni:\n```\n@article{chen2024slam,\n  title={SLAM-Omni: Timbre-Controllable Voice Interaction System with Single-Stage Training},\n  author={Chen, Wenxi and Ma, Ziyang and Yan, Ruiqi and Liang, Yuzhe and Li, Xiquan and Xu, Ruiyang and Niu, Zhikang and Zhu, Yanqiao and Yang, Yifan and Liu, Zhanxun and others},\n  journal={arXiv preprint arXiv:2412.15649},\n  year={2024}\n}\n```\n\n## Audio Task\nSLAM-AAC:\n```\n@article{chen2025slam,\n  title={SLAM-AAC: Enhancing Audio Captioning with Paraphrasing Augmentation and CLAP-Refine through LLMs},\n  author={Chen, Wenxi and Ma, Ziyang and Li, Xiquan and Xu, Xuenan and Liang, Yuzhe and Zheng, Zhisheng and Yu, Kai and Chen, Xie},\n  journal={Proc. ICASSP},\n  year={2025}\n}\n```\nDRCap:\n```\n@article{li2025drcap,\n  title={DRCap: Decoding CLAP Latents with Retrieval-augmented Generation for Zero-shot Audio Captioning},\n  author={Li, Xiquan and Chen, Wenxi and Ma, Ziyang and Xu, Xuenan and Liang, Yuzhe and Zheng, Zhisheng and Kong, Qiuqiang and Chen, Xie},\n  journal={Proc. ICASSP},\n  year={2025}\n}\n```\nBAT:\n```\n@article{zheng2024bat,\n  title={BAT: Learning to Reason about Spatial Sounds with Large Language Models},\n  author={Zheng, Zhisheng and Peng, Puyuan and Ma, Ziyang and Chen, Xie and Choi, Eunsol and Harwath, David},\n  journal={Proc. ICML},\n  year={2024}\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FX-LANCE%2FSLAM-LLM","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FX-LANCE%2FSLAM-LLM","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FX-LANCE%2FSLAM-LLM/lists"}