{"id":26658081,"url":"https://github.com/skit-ai/speechllm","last_synced_at":"2026-02-20T01:02:16.481Z","repository":{"id":245721987,"uuid":"810274202","full_name":"skit-ai/SpeechLLM","owner":"skit-ai","description":"This repository contains the training, inference, evaluation code for SpeechLLM models and details about the model releases on huggingface.","archived":false,"fork":false,"pushed_at":"2024-06-25T21:09:06.000Z","size":4065,"stargazers_count":92,"open_issues_count":3,"forks_count":8,"subscribers_count":5,"default_branch":"main","last_synced_at":"2025-03-25T09:17:59.737Z","etag":null,"topics":["conversational-ai","llm","multi-modal-llms","multi-modality","speech"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/skit-ai.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":"CITATION.cff","codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2024-06-04T11:34:25.000Z","updated_at":"2025-03-25T06:30:00.000Z","dependencies_parsed_at":"2025-04-11T13:05:23.125Z","dependency_job_id":null,"html_url":"https://github.com/skit-ai/SpeechLLM","commit_stats":null,"previous_names":["skit-ai/speechllm"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/skit-ai%2FSpeechLLM","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/skit-ai%2FSpeechLLM/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/skit-ai%2FSpeechLLM/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/skit-ai%2FSpeechLLM/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/skit-ai","download_url":"https://codeload.github.com/skit-ai/SpeechLLM/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248404355,"owners_count":21097718,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["conversational-ai","llm","multi-modal-llms","multi-modality","speech"],"created_at":"2025-03-25T09:18:04.255Z","updated_at":"2026-02-20T01:02:16.469Z","avatar_url":"https://github.com/skit-ai.png","language":"Python","readme":"# SpeechLLM\n\n[![hf_model](https://img.shields.io/badge/🤗-SpeechLLM%20HuggingFace-blue.svg)](https://huggingface.co/collections/skit-ai/speechllm-66605bfb37a54d4e4a60efe2)\n[![Code License](https://img.shields.io/badge/Code%20License-Apache_2.0-green.svg)](https://github.com/skit-ai/SpeechLLM/blob/main/LICENSE)\n[![github](https://img.shields.io/badge/-Github-black?logo=github)](https://github.com/skit-ai/SpeechLLM.git)[![GitHub stars](https://img.shields.io/github/stars/skit-ai/SpeechLLM?style=social)](https://github.com/skit-ai/SpeechLLM/stargazers)\n[![Open in Colab](https://img.shields.io/badge/Open%20in%20Colab-F9AB00?logo=googlecolab\u0026color=blue)](https://colab.research.google.com/drive/1uqhRl36LJKA4IxnrhplLMv0wQ_f3OuBM?usp=sharing)\n\n\n\n![](./assets/speechllm.png)\n\nSpeechLLM is a multi-modal Language Model (LLM) specifically trained to analyze and predict metadata from a speaker's turn in a conversation. This advanced model integrates a speech encoder to transform speech signals into meaningful speech representations. These embeddings, combined with text instructions, are then processed by the LLM to generate predictions.\n\nThe model inputs an speech audio file of **16 KHz** and predicts the following:\n1. **SpeechActivity** : if the audio signal contains speech (True/False)\n2. **Transcript** : ASR transcript of the audio\n3. **Gender** of the speaker (Female/Male)\n4. **Age** of the speaker (Young/Middle-Age/Senior)\n5. **Accent** of the speaker (Africa/America/Celtic/Europe/Oceania/South-Asia/South-East-Asia)\n6. **Emotion** of the speaker (Happy/Sad/Anger/Neutral/Frustrated)\n\n## Usage\n```python\n# Load model directly from huggingface\nfrom transformers import AutoModel\nmodel = AutoModel.from_pretrained(\"skit-ai/speechllm-2B\", trust_remote_code=True)\n\nmodel.generate_meta(\n    audio_path=\"path-to-audio.wav\", #16k Hz, mono\n    audio_tensor=torchaudio.load(\"path-to-audio.wav\")[1], # [Optional] either audio_path or audio_tensor directly\n    instruction=\"Give me the following information about the audio [SpeechActivity, Transcript, Gender, Emotion, Age, Accent]\",\n    max_new_tokens=500, \n    return_special_tokens=False\n)\n\n# Model Generation\n'''\n{\n  \"SpeechActivity\" : \"True\",\n  \"Transcript\": \"Yes, I got it. I'll make the payment now.\",\n  \"Gender\": \"Female\",\n  \"Emotion\": \"Neutral\",\n  \"Age\": \"Young\",\n  \"Accent\" : \"America\",\n}\n'''\n```\n\nTry the model in [Google Colab Notebook](https://colab.research.google.com/drive/1uqhRl36LJKA4IxnrhplLMv0wQ_f3OuBM?usp=sharing). Also, check out our [blog](https://tech.skit.ai/speech-conversational-llms/) on SpeechLLM for end-to-end conversational agents(User Speech -\u003e Response).\n\n## Model Weights\nWe released the speechllm-2B and speechllm-1.5B model checkpoints on huggingface :hugs:.\n| **Model**         | **Speech Encoder**                                                                  | **LLM**                                                                                            | checkpoint url                                                |\n|-------------------|-------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------|---------------------------------------------------------------|\n| **speechllm-2B**  | [facebook/hubert-xlarge-ll60k](https://huggingface.co/facebook/hubert-xlarge-ll60k) | [TinyLlama/TinyLlama-1.1B-Chat-v1.0](https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0)    | [Huggingface](https://huggingface.co/skit-ai/speechllm-2B)    |\n| **speechllm-1.5B** | [microsoft/wavlm-large](https://huggingface.co/microsoft/wavlm-large)               | [ TinyLlama/TinyLlama-1.1B-Chat-v1.0 ]( https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0) | [Huggingface]( https://huggingface.co/skit-ai/speechllm-1.5B) |\n\n## Latest Checkpoint Result\n\n### speechllm-2B\n|         **Dataset**        |       **Type**      | **Word Error Rate** | **Gender Acc** | **Age Acc** | **Accent Acc** |\n|:--------------------------:|:-------------------:|:-------------------:|:--------------:|:-----------:|:--------------:|\n| **librispeech-test-clean** | Read Speech         |         6.73        |     0.9496     |             |                |\n| **librispeech-test-other** | Read Speech         |         9.13        |     0.9217     |             |                |\n| **CommonVoice test**       | Diverse Accent, Age |        25.66        |     0.8680     |    0.6041   |     0.6959     |\n\n### speechllm-1.5B\n|         **Dataset**        |       **Type**      | **Word Error Rate** | **Gender Acc** | **Age Acc** | **Accent Acc** |\n|:--------------------------:|:-------------------:|:-------------------:|:--------------:|:-----------:|:--------------:|\n| **librispeech-test-clean** | Read Speech         |        11.51        |     0.9594     |             |                |\n| **librispeech-test-other** | Read Speech         |        16.68        |     0.9297     |             |                |\n| **CommonVoice test**       | Diverse Accent, Age |        26.02        |     0.9476     |    0.6498   |     0.8121     |\n\n\n## Training\n\n### Dataset Preparation and Installation\nInstall the necessary packages in the requirements.txt and take care of CUDA versions. Then prepare the audio dataset similar to data_samples/train.csv and data_samples/dev.csv, if new tasks eg: (noise, environment class) has to be added, then update the dataset.py accordingly.\n```bash\npip install requirements.txt\n``` \n\n### Train\nupdate the config in train.py, such as audio_encoder_name, llm_name, etc and other hyper parameters.\n```bash\npython train.py\n``` \n\n### Evaluation\nAfter training, update checkpoint path and test dataset path(similar format to train/dev.csv).\n```bash\npython test.py\n``` \n\n### Infer model in Streamlit app\n```bash\nstreamlit run app.py\n```\n![](./assets/streamlit_app.png)\n\n\n## Disclaimer\nThe models provided in this repository are not perfect and may produce errors in Automatic Speech Recognition (ASR), gender identification, age estimation, accent recognition, and emotion detection. Additionally, these models may exhibit biases related to gender, age, accent, and emotion. Please use with caution, especially in production environments, and be aware of potential inaccuracies and biases.\n\n## License\nThis project is released under the Apache 2.0 license as found in the LICENSE file. The released checkpoints, and code are intended for research purpose subject to the license of [facebook/hubert-xlarge-ll60k](https://huggingface.co/facebook/hubert-xlarge-ll60k), [microsoft/wavlm-large](https://huggingface.co/microsoft/wavlm-large) and [ TinyLlama/TinyLlama-1.1B-Chat-v1.0 ]( https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0) models.\n\n## Cite\n```\n@misc{Rajaa_SpeechLLM_Multi-Modal_LLM,\nauthor = {Rajaa, Shangeth and Tushar, Abhinav},\ntitle = {{SpeechLLM: Multi-Modal LLM for Speech Understanding}},\nurl = {https://github.com/skit-ai/SpeechLLM}\n}\n```\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fskit-ai%2Fspeechllm","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fskit-ai%2Fspeechllm","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fskit-ai%2Fspeechllm/lists"}