{"id":13754206,"url":"https://github.com/QwenLM/Qwen-Audio","last_synced_at":"2025-05-09T22:31:37.359Z","repository":{"id":207153100,"uuid":"715441002","full_name":"QwenLM/Qwen-Audio","owner":"QwenLM","description":"The official repo of Qwen-Audio (通义千问-Audio) chat \u0026 pretrained large audio language model proposed by Alibaba Cloud.","archived":false,"fork":false,"pushed_at":"2024-07-05T09:17:49.000Z","size":25785,"stargazers_count":1659,"open_issues_count":59,"forks_count":118,"subscribers_count":27,"default_branch":"main","last_synced_at":"2025-04-07T21:14:34.093Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/QwenLM.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-11-07T06:31:39.000Z","updated_at":"2025-04-07T10:28:56.000Z","dependencies_parsed_at":"2024-01-03T11:36:12.009Z","dependency_job_id":"46bc2b9d-b91b-439b-9afc-818cbbc5e4af","html_url":"https://github.com/QwenLM/Qwen-Audio","commit_stats":null,"previous_names":["qwenlm/qwen-audio"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/QwenLM%2FQwen-Audio","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/QwenLM%2FQwen-Audio/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/QwenLM%2FQwen-Audio/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/QwenLM%2FQwen-Audio/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/QwenLM","download_url":"https://codeload.github.com/QwenLM/Qwen-Audio/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":253335699,"owners_count":21892714,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-03T09:01:49.809Z","updated_at":"2025-05-09T22:31:32.342Z","avatar_url":"https://github.com/QwenLM.png","language":"Python","funding_links":[],"categories":["A01_文本生成_文本对话","Python"],"sub_categories":["大语言对话模型及数据"],"readme":"\u003cp align=\"left\"\u003e\n \u003ca href=\"README_CN.md\"\u003e中文\u003c/a\u003e \u0026nbsp｜ \u0026nbsp English\u0026nbsp\u0026nbsp\n\u003c/p\u003e\n\u003cbr\u003e\u003cbr\u003e\n\n\n\u003cp align=\"center\"\u003e\n \u003cimg src=\"assets/audio_logo.jpg\" width=\"400\"/\u003e\n\u003cp\u003e\n\u003cbr\u003e\n\n\u003cp align=\"center\"\u003e\n Qwen-Audio \u003ca href=\"https://www.modelscope.cn/models/qwen/QWen-Audio/summary\"\u003e🤖 \u003ca\u003e | \u003ca href=\"https://huggingface.co/Qwen/Qwen-Audio\"\u003e🤗\u003c/a\u003e\u0026nbsp ｜ Qwen-Audio-Chat \u003ca href=\"https://www.modelscope.cn/models/qwen/QWen-Audio-Chat/summary\"\u003e🤖 \u003ca\u003e| \u003ca href=\"https://huggingface.co/Qwen/Qwen-Audio-Chat\"\u003e🤗\u003c/a\u003e\u0026nbsp | \u0026nbsp\u0026nbsp Demo\u003ca href=\"https://modelscope.cn/studios/qwen/Qwen-Audio-Chat-Demo/summary\"\u003e 🤖\u003c/a\u003e | \u003ca href=\"https://huggingface.co/spaces/Qwen/Qwen-Audio\"\u003e🤗\u003c/a\u003e\u0026nbsp\n\u003cbr\u003e\n\u0026nbsp\u0026nbsp\u003ca href=\"https://qwen-audio.github.io/Qwen-Audio/\"\u003eHomepage\u003c/a\u003e\u0026nbsp ｜ \u0026nbsp\u0026nbsp\u003ca href=\"http://arxiv.org/abs/2311.07919\"\u003ePaper\u003c/a\u003e\u0026nbsp\u0026nbsp | \u0026nbsp\u0026nbsp\u0026nbsp\u003ca href=\"https://github.com/QwenLM/Qwen/blob/main/assets/wechat.png\"\u003eWeChat\u003c/a\u003e\u0026nbsp\u0026nbsp | \u0026nbsp\u0026nbsp\u003ca href=\"https://discord.gg/CV4E9rpNSD\"\u003eDiscord\u003c/a\u003e\u0026nbsp\u0026nbsp\u003c/a\u003e\n\u003c/p\u003e\n\u003cbr\u003e\u003cbr\u003e\n\n[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/qwen-audio-advancing-universal-audio/speech-recognition-on-aishell-1)](https://paperswithcode.com/sota/speech-recognition-on-aishell-1?p=qwen-audio-advancing-universal-audio)\n[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/qwen-audio-advancing-universal-audio/speech-recognition-on-aishell-2-test-android-1)](https://paperswithcode.com/sota/speech-recognition-on-aishell-2-test-android-1?p=qwen-audio-advancing-universal-audio)\n[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/qwen-audio-advancing-universal-audio/speech-recognition-on-aishell-2-test-ios)](https://paperswithcode.com/sota/speech-recognition-on-aishell-2-test-ios?p=qwen-audio-advancing-universal-audio)\n[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/qwen-audio-advancing-universal-audio/speech-recognition-on-aishell-2-test-mic-1)](https://paperswithcode.com/sota/speech-recognition-on-aishell-2-test-mic-1?p=qwen-audio-advancing-universal-audio)\n[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/qwen-audio-advancing-universal-audio/acoustic-scene-classification-on-cochlscene)](https://paperswithcode.com/sota/acoustic-scene-classification-on-cochlscene?p=qwen-audio-advancing-universal-audio)\n[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/qwen-audio-advancing-universal-audio/acoustic-scene-classification-on-tut-acoustic)](https://paperswithcode.com/sota/acoustic-scene-classification-on-tut-acoustic?p=qwen-audio-advancing-universal-audio) \u003cbr\u003e\n[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/qwen-audio-advancing-universal-audio/audio-classification-on-vocalsound)](https://paperswithcode.com/sota/audio-classification-on-vocalsound?p=qwen-audio-advancing-universal-audio) \u003cbr\u003e\n[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/qwen-audio-advancing-universal-audio/audio-captioning-on-clotho)](https://paperswithcode.com/sota/audio-captioning-on-clotho?p=qwen-audio-advancing-universal-audio) \u003cbr\u003e\n[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/qwen-audio-advancing-universal-audio/speech-recognition-on-librispeech-test-clean)](https://paperswithcode.com/sota/speech-recognition-on-librispeech-test-clean?p=qwen-audio-advancing-universal-audio)\n[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/qwen-audio-advancing-universal-audio/emotion-recognition-in-conversation-on-meld)](https://paperswithcode.com/sota/emotion-recognition-in-conversation-on-meld?p=qwen-audio-advancing-universal-audio)\n[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/qwen-audio-advancing-universal-audio/speech-recognition-on-librispeech-test-other)](https://paperswithcode.com/sota/speech-recognition-on-librispeech-test-other?p=qwen-audio-advancing-universal-audio)\n\n**Qwen-Audio** (Qwen Large Audio Language Model) is the multimodal version of the large model series, Qwen (abbr. Tongyi Qianwen), proposed by Alibaba Cloud. Qwen-Audio accepts diverse audio (human speech, natural sound, music and song) and text as inputs, outputs text. The contribution of Qwen-Audio include:\n\n- **Fundamental audio models**: Qwen-Audio is a fundamental multi-task audio-language model that supports various tasks, languages, and audio types, serving as a universal audio understanding model. Building upon Qwen-Audio, we develop Qwen-Audio-Chat through instruction fine-tuning, enabling multi-turn dialogues and supporting diverse audio-oriented scenarios.\n- **Multi-task learning framework for all types of audios**: To scale up audio-language pre-training, we address the challenge of variation in textual labels associated with different datasets by proposing a multi-task training framework, enabling knowledge sharing and avoiding one-to-many interference. Our model incorporates more than 30 tasks and extensive experiments show the model achieves strong performance.\n- **Strong Performance**: Experimental results show that Qwen-Audio achieves impressive performance across diverse benchmark tasks without requiring any task-specific fine-tuning, surpassing its counterparts. Specifically, Qwen-Audio achieves state-of-the-art results on the test set of Aishell1, cochlscene, ClothoAQA, and VocalSound.\n- **Flexible multi-run chat from audio and text input**: Qwen-Audio supports multiple-audio analysis, sound understanding and reasoning, music appreciation, and tool usage.\n\n\u003cbr\u003e\n\u003cp align=\"center\"\u003e\n \u003cimg src=\"assets/framework.png\" width=\"800\"/\u003e\n\u003cp\u003e\n\u003cbr\u003e\n\n\nWe release two models of the Qwen-Audio series soon:\n\n- Qwen-Audio: The pre-trained multi-task audio understanding model uses Qwen-7B as the initialization of the LLM, and [Whisper-large-v2](https://github.com/openai/whisper) as the initialization of the audio encoder.\n- Qwen-Audio-Chat: A multimodal LLM-based AI assistant, which is trained with alignment techniques. Qwen-Audio-Chat supports more flexible interaction, such as multiple audio inputs, multi-round question answering, and creative capabilities.\n\u003cbr\u003e\n\n## News and Updates\n* 2023.11.30 🔥 We have released the checkpoints of both **Qwen-Audio** and **Qwen-Audio-Chat** on ModelScope and Hugging Face.\n* 2023.11.15 🎉 We released a [paper](http://arxiv.org/abs/2311.07919) for details about Qwen-Audio and Qwen-Audio-Chat model, including training details and model performance.\n\n\u003cbr\u003e\n\n## Evaluation\nWe evaluated the Qwen-Audio's abilities on 12 standard benchmarks as follows:\n\u003cp align=\"center\"\u003e\n \u003cimg src=\"assets/evaluation.png\" width=\"800\"/\u003e\n\u003cp\u003e\n\n\nThe below is the overal performance：\n\u003cp align=\"center\"\u003e\n \u003cimg src=\"assets/radar_new.png\" width=\"800\"/\u003e\n\u003cp\u003e\n\n\nThe details of evaluation are as follows:\n### Automatic Speech Recognition\n\u003ctable\u003e\n\u003cthead\u003e\n\u003ctr\u003e\n \u003cth rowspan=\"2\"\u003eDataset\u003c/th\u003e\n \u003cth rowspan=\"2\"\u003eModel\u003c/th\u003e\n \u003cth colspan=\"4\"\u003eResults (WER)\u003c/th\u003e\n \u003c/tr\u003e\n\u003ctr\u003e\n \u003cth\u003edev-clean\u003c/th\u003e\n \u003cth\u003edev-othoer\u003c/th\u003e\n \u003cth\u003etest-clean\u003c/th\u003e\n \u003cth\u003etest-other\u003c/th\u003e\n \u003c/tr\u003e\n\u003c/thead\u003e\n\n\u003ctbody align=\"center\"\u003e\n\u003ctr\u003e\n \u003ctd rowspan=\"5\"\u003eLibrispeech\u003c/td\u003e\n \u003ctd\u003eSpeechT5\u003c/td\u003e\n \u003ctd\u003e2.1\u003c/td\u003e\n \u003ctd\u003e5.5\u003c/td\u003e\n \u003ctd\u003e2.4\u003c/td\u003e\n \u003ctd\u003e5.8\u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd\u003eSpeechNet\u003c/td\u003e\n \u003ctd\u003e-\u003c/td\u003e\n \u003ctd\u003e-\u003c/td\u003e\n \u003ctd\u003e30.7\u003c/td\u003e\n \u003ctd\u003e-\u003c/td\u003e\n \u003c/tr\u003e\n\u003ctr\u003e\n \u003ctd\u003eSLM-FT\u003c/td\u003e\n \u003ctd\u003e-\u003c/td\u003e\n \u003ctd\u003e-\u003c/td\u003e\n \u003ctd\u003e2.6\u003c/td\u003e\n \u003ctd\u003e5.0\u003c/td\u003e\n \u003c/tr\u003e\n\u003ctr\u003e\n \u003ctd\u003eSALMONN\u003c/td\u003e\n \u003ctd\u003e-\u003c/td\u003e\n \u003ctd\u003e-\u003c/td\u003e\n \u003ctd\u003e2.1\u003c/td\u003e\n \u003ctd\u003e4.9\u003c/td\u003e\n \u003c/tr\u003e\n\u003ctr\u003e\n \u003ctd\u003eQwen-Audio\u003c/td\u003e\n \u003ctd\u003e\u003cstrong\u003e1.8\u003c/strong\u003e\u003c/td\u003e\n \u003ctd\u003e\u003cstrong\u003e4.0\u003c/strong\u003e\u003c/td\u003e\n \u003ctd\u003e\u003cstrong\u003e2.0\u003c/strong\u003e\u003c/td\u003e\n \u003ctd\u003e\u003cstrong\u003e4.2\u003c/strong\u003e\u003c/td\u003e\n \u003c/tr\u003e\n\u003c/table\u003e\n\n\n\u003ctable\u003e\n\u003cthead\u003e\n\u003ctr\u003e\n \u003cth rowspan=\"2\"\u003eDataset\u003c/th\u003e\n \u003cth rowspan=\"2\"\u003eModel\u003c/th\u003e\n \u003cth colspan=\"2\"\u003eResults (WER)\u003c/th\u003e\n \u003c/tr\u003e\n\u003ctr\u003e\n \u003cth\u003edev\u003c/th\u003e\n \u003cth\u003etest\u003c/th\u003e\n \u003c/tr\u003e\n\u003c/thead\u003e\n\n\u003ctbody align=\"center\"\u003e\n\u003ctr\u003e\n \u003ctd rowspan=\"4\"\u003eAishell1\u003c/td\u003e\n \u003ctd\u003eMMSpeech-base\u003c/td\u003e\n \u003ctd\u003e2.0\u003c/td\u003e\n \u003ctd\u003e2.1\u003c/td\u003e\n \u003c/tr\u003e\n\u003ctr\u003e\n \u003ctd\u003eMMSpeech-large\u003c/td\u003e\n \u003ctd\u003e1.6\u003c/td\u003e\n \u003ctd\u003e1.9\u003c/td\u003e\n \u003c/tr\u003e\n\u003ctr\u003e\n \u003ctd\u003eParaformer-large\u003c/td\u003e\n \u003ctd\u003e-\u003c/td\u003e\n \u003ctd\u003e2.0\u003c/td\u003e\n \u003c/tr\u003e\n\u003ctr\u003e\n \u003ctd\u003eQwen-Audio\u003c/td\u003e\n \u003ctd\u003e\u003cstrong\u003e1.2 (SOTA)\u003c/strong\u003e\u003c/td\u003e\n \u003ctd\u003e\u003cstrong\u003e1.3 (SOTA)\u003c/strong\u003e\u003c/td\u003e\n \u003c/tr\u003e\n\u003c/table\u003e\n\n\n\u003ctable\u003e\n\u003cthead\u003e\n\u003ctr\u003e\n \u003cth rowspan=\"2\"\u003eDataset\u003c/th\u003e\n \u003cth rowspan=\"2\"\u003eModel\u003c/th\u003e\n \u003cth colspan=\"3\"\u003eResults (WER)\u003c/th\u003e\n \u003c/tr\u003e\n\u003ctr\u003e\n \u003cth\u003eMic\u003c/th\u003e\n \u003cth\u003eiOS\u003c/th\u003e\n \u003cth\u003eAndroid\u003c/th\u003e\n \u003c/tr\u003e\n\u003c/thead\u003e\n\n\u003ctbody align=\"center\"\u003e\n\u003ctr\u003e\n \u003ctd rowspan=\"3\"\u003eAishell2\u003c/td\u003e\n \u003ctd\u003eMMSpeech-base\u003c/td\u003e\n \u003ctd\u003e4.5\u003c/td\u003e\n \u003ctd\u003e3.9\u003c/td\u003e\n \u003ctd\u003e4.0\u003c/td\u003e\n \u003c/tr\u003e\n\u003ctr\u003e\n \u003ctd\u003eParaformer-large\u003c/td\u003e\n \u003ctd\u003e-\u003c/td\u003e\n \u003ctd\u003e\u003cstrong\u003e2.9\u003c/strong\u003e\u003c/td\u003e\n \u003ctd\u003e-\u003c/td\u003e\n \u003c/tr\u003e\n\u003ctr\u003e\n \u003ctd\u003eQwen-Audio\u003c/td\u003e\n \u003ctd\u003e\u003cstrong\u003e3.3\u003c/strong\u003e\u003c/td\u003e\n \u003ctd\u003e3.1\u003c/td\u003e\n \u003ctd\u003e\u003cstrong\u003e3.3\u003c/strong\u003e\u003c/td\u003e\n \u003c/tr\u003e\n\u003c/table\u003e\n\n### Soeech-to-text Translation\n\u003ctable\u003e\n\u003cthead\u003e\n\u003ctr\u003e\n \u003cth rowspan=\"2\"\u003eDataset\u003c/th\u003e\n \u003cth rowspan=\"2\"\u003eModel\u003c/th\u003e\n \u003cth colspan=\"7\"\u003eResults （BLUE)\u003c/th\u003e\n \u003c/tr\u003e\n\u003ctr\u003e\n \u003cth\u003een-de\u003c/th\u003e\n \u003cth\u003ede-en\u003c/th\u003e\n \u003cth\u003een-zh\u003c/th\u003e\n \u003cth\u003ezh-en\u003c/th\u003e\n \u003cth\u003ees-en\u003c/th\u003e\n \u003cth\u003efr-en\u003c/th\u003e\n \u003cth\u003eit-en\u003c/th\u003e\n \u003c/tr\u003e\n\u003c/thead\u003e\n\n\u003ctbody align=\"center\"\u003e\n\u003ctr\u003e\n \u003ctd rowspan=\"4\"\u003eCoVoST2\u003c/td\u003e\n \u003ctd\u003eSALMMON\u003c/td\u003e\n \u003ctd\u003e18.6\u003c/td\u003e\n \u003ctd\u003e-\u003c/td\u003e\n \u003ctd\u003e33.1\u003c/td\u003e\n \u003ctd\u003e-\u003c/td\u003e\n \u003ctd\u003e-\u003c/td\u003e\n \u003ctd\u003e-\u003c/td\u003e\n \u003ctd\u003e-\u003c/td\u003e\n \u003c/tr\u003e\n\u003ctr\u003e\n \u003ctd\u003eSpeechLLaMA\u003c/td\u003e\n \u003ctd\u003e-\u003c/td\u003e\n \u003ctd\u003e27.1\u003c/td\u003e\n \u003ctd\u003e-\u003c/td\u003e\n \u003ctd\u003e12.3\u003c/td\u003e\n \u003ctd\u003e27.9\u003c/td\u003e\n \u003ctd\u003e25.2\u003c/td\u003e\n \u003ctd\u003e25.9\u003c/td\u003e\n \u003c/tr\u003e\n\u003ctr\u003e\n \u003ctd\u003eBLSP\u003c/td\u003e\n \u003ctd\u003e14.1\u003c/td\u003e\n \u003ctd\u003e-\u003c/td\u003e\n \u003ctd\u003e-\u003c/td\u003e\n \u003ctd\u003e-\u003c/td\u003e\n \u003ctd\u003e-\u003c/td\u003e\n \u003ctd\u003e-\u003c/td\u003e\n \u003ctd\u003e-\u003c/td\u003e\n \u003c/tr\u003e\n\u003ctr\u003e\n \u003ctd\u003eQwen-Audio\u003c/td\u003e\n \u003ctd\u003e\u003cstrong\u003e25.1\u003c/strong\u003e\u003c/td\u003e\n \u003ctd\u003e\u003cstrong\u003e33.9\u003c/strong\u003e\u003c/td\u003e\n \u003ctd\u003e\u003cstrong\u003e41.5\u003c/strong\u003e\u003c/td\u003e\n \u003ctd\u003e\u003cstrong\u003e15.7\u003c/strong\u003e\u003c/td\u003e\n \u003ctd\u003e\u003cstrong\u003e39.7\u003c/strong\u003e\u003c/td\u003e\n \u003ctd\u003e\u003cstrong\u003e38.5\u003c/strong\u003e\u003c/td\u003e\n \u003ctd\u003e\u003cstrong\u003e36.0\u003c/strong\u003e\u003c/td\u003e\n \u003c/tr\u003e\n\u003c/table\u003e\n\n### Automatic Audio Caption\n\n\u003ctable\u003e\n\u003cthead\u003e\n\u003ctr\u003e\n \u003cth rowspan=\"2\"\u003eDataset\u003c/th\u003e\n \u003cth rowspan=\"2\"\u003eModel\u003c/th\u003e\n \u003cth colspan=\"3\"\u003eResults\u003c/th\u003e\n \u003c/tr\u003e\n\u003ctr\u003e\n \u003cth\u003eCIDER\u003c/th\u003e\n \u003cth\u003eSPICE\u003c/th\u003e\n \u003cth\u003eSPIDEr\u003c/th\u003e\n \u003c/tr\u003e\n\u003c/thead\u003e\n\n\u003ctbody align=\"center\"\u003e\n\u003ctr\u003e\n \u003ctd rowspan=\"2\"\u003eClotho\u003c/td\u003e\n \u003ctd\u003ePengi\u003c/td\u003e\n \u003ctd\u003e0.416\u003c/td\u003e\n \u003ctd\u003e0.126\u003c/td\u003e\n \u003ctd\u003e0.271\u003c/td\u003e\n \u003c/tr\u003e\n\u003ctr\u003e\n \u003ctd\u003eQwen-Audio\u003c/td\u003e\n \u003ctd\u003e\u003cstrong\u003e0.441\u003c/strong\u003e\u003c/td\u003e\n \u003ctd\u003e\u003cstrong\u003e0.136\u003c/strong\u003e\u003c/td\u003e\n \u003ctd\u003e\u003cstrong\u003e0.288\u003c/strong\u003e\u003c/td\u003e\n \u003c/tr\u003e\n\u003c/table\u003e\n\n\n### Speech Recognition with Word-level Timestamp\n\u003ctable\u003e\n\u003cthead\u003e\n\u003ctr\u003e\n \u003cth rowspan=\"1\"\u003eDataset\u003c/th\u003e\n \u003cth rowspan=\"1\"\u003eModel\u003c/th\u003e\n \u003cth colspan=\"1\"\u003eAAC (ms)\u003c/th\u003e\n \u003c/tr\u003e\n\u003c/thead\u003e\n\n\u003ctbody align=\"center\"\u003e\n\u003ctr\u003e\n \u003ctd rowspan=\"3\"\u003eIndustrial Data\u003c/td\u003e\n \u003ctd\u003eForce-aligner\u003c/td\u003e\n \u003ctd\u003e60.3\u003c/td\u003e\n \u003c/tr\u003e\n\u003ctr\u003e\n \u003ctd\u003eParaformer-large-TP\u003c/td\u003e\n \u003ctd\u003e65.3\u003c/td\u003e\n \u003c/tr\u003e\n\u003ctr\u003e\n \u003ctd\u003eQwen-Audio\u003c/td\u003e\n \u003ctd\u003e\u003cstrong\u003e51.5 (SOTA)\u003c/strong\u003e\u003c/td\u003e\n \u003c/tr\u003e\n\u003c/table\u003e\n\n\n### Automatic Scene Classification\n\u003ctable\u003e\n\u003cthead\u003e\n\u003ctr\u003e\n \u003cth rowspan=\"1\"\u003eDataset\u003c/th\u003e\n \u003cth rowspan=\"1\"\u003eModel\u003c/th\u003e\n \u003cth colspan=\"1\"\u003eACC\u003c/th\u003e\n \u003c/tr\u003e\n\u003c/thead\u003e\n\n\u003ctbody align=\"center\"\u003e\n\u003ctr\u003e\n \u003ctd rowspan=\"2\"\u003eCochlscene\u003c/td\u003e\n \u003ctd\u003eCochlscene\u003c/td\u003e\n \u003ctd\u003e0.669\u003c/td\u003e\n \u003c/tr\u003e\n\u003ctr\u003e\n \u003ctd\u003eQwen-Audio\u003c/td\u003e\n \u003ctd\u003e\u003cstrong\u003e0.795 (SOTA)\u003c/strong\u003e\u003c/td\u003e\n \u003c/tr\u003e\n\u003ctr\u003e\n \u003ctd rowspan=\"2\"\u003eTUT2017\u003c/td\u003e\n \u003ctd\u003ePengi\u003c/td\u003e\n \u003ctd\u003e0.353\u003c/td\u003e\n \u003c/tr\u003e\n\u003ctr\u003e\n \u003ctd\u003eQwen-Audio\u003c/td\u003e\n \u003ctd\u003e\u003cstrong\u003e0.649\u003c/strong\u003e\u003c/td\u003e\n \u003c/tr\u003e\n\u003c/table\u003e\n\n\n### Speech Emotion Recognition\n\u003ctable\u003e\n\u003cthead\u003e\n\u003ctr\u003e\n \u003cth rowspan=\"1\"\u003eDataset\u003c/th\u003e\n \u003cth rowspan=\"1\"\u003eModel\u003c/th\u003e\n \u003cth colspan=\"1\"\u003eACC\u003c/th\u003e\n \u003c/tr\u003e\n\u003c/thead\u003e\n\n\u003ctbody align=\"center\"\u003e\n\u003ctr\u003e\n \u003ctd rowspan=\"2\"\u003eMeld\u003c/td\u003e\n \u003ctd\u003eWavLM-large\u003c/td\u003e\n \u003ctd\u003e0.542\u003c/td\u003e\n \u003c/tr\u003e\n\u003ctr\u003e\n \u003ctd\u003eQwen-Audio\u003c/td\u003e\n \u003ctd\u003e\u003cstrong\u003e0.557\u003c/strong\u003e\u003c/td\u003e\n \u003c/tr\u003e\n\u003c/table\u003e\n\n\n### Audio Question \u0026 Answer\n\u003ctable\u003e\n\u003cthead\u003e\n\u003ctr\u003e\n \u003cth rowspan=\"2\"\u003eDataset\u003c/th\u003e\n \u003cth rowspan=\"2\"\u003eModel\u003c/th\u003e\n \u003cth colspan=\"2\"\u003eResults\u003c/th\u003e\n \u003c/tr\u003e\n\u003ctr\u003e\n \u003cth\u003eACC\u003c/th\u003e\n \u003cth\u003eACC (binary)\u003c/th\u003e\n \u003c/tr\u003e\n\u003c/thead\u003e\n\n\u003ctbody align=\"center\"\u003e\n\u003ctr\u003e\n \u003ctd rowspan=\"3\"\u003eClothoAQA\u003c/td\u003e\n \u003ctd\u003eClothoAQA\u003c/td\u003e\n \u003ctd\u003e0.542\u003c/td\u003e\n \u003ctd\u003e0.627\u003c/td\u003e\n \u003c/tr\u003e\n\u003ctr\u003e\n \u003ctd\u003ePengi\u003c/td\u003e\n \u003ctd\u003e-\u003c/td\u003e\n \u003ctd\u003e0.645\u003c/td\u003e\n \u003c/tr\u003e\n\u003ctr\u003e\n \u003ctd\u003eQwen-Audio\u003c/td\u003e\n \u003ctd\u003e\u003cstrong\u003e0.579\u003c/strong\u003e\u003c/td\u003e\n \u003ctd\u003e\u003cstrong\u003e0.749\u003c/strong\u003e\u003c/td\u003e\n \u003c/tr\u003e\n\u003c/table\u003e\n\n### Vocal Sound Classification\n\n\u003ctable\u003e\n\u003cthead\u003e\n\u003ctr\u003e\n \u003cth rowspan=\"1\"\u003eDataset\u003c/th\u003e\n \u003cth rowspan=\"1\"\u003eModel\u003c/th\u003e\n \u003cth colspan=\"1\"\u003eACC\u003c/th\u003e\n \u003c/tr\u003e\n\u003c/thead\u003e\n\n\u003ctbody align=\"center\"\u003e\n\u003ctr\u003e\n \u003ctd rowspan=\"3\"\u003eVocalSound\u003c/td\u003e\n \u003ctd\u003eCLAP\u003c/td\u003e\n \u003ctd\u003e0.4945\u003c/td\u003e\n \u003c/tr\u003e\n\u003ctr\u003e\n \u003ctd\u003ePengi\u003c/td\u003e\n \u003ctd\u003e0.6035\u003c/td\u003e\n \u003c/tr\u003e\n\u003ctr\u003e\n \u003ctd\u003eQwen-Audio\u003c/td\u003e\n \u003ctd\u003e\u003cstrong\u003e0.9289 (SOTA)\u003c/strong\u003e\u003c/td\u003e\n \u003c/tr\u003e\n\u003c/table\u003e\n\n\n### Music Note Analysis\n\u003ctable\u003e\n\u003cthead\u003e\n\u003ctr\u003e\n \u003cth rowspan=\"1\"\u003eDataset\u003c/th\u003e\n \u003cth rowspan=\"1\"\u003eModel\u003c/th\u003e\n \u003cth colspan=\"1\"\u003eNS. Qualities (MAP)\u003c/th\u003e\n\u003cth colspan=\"1\"\u003eNS. Instrument (ACC)\u003c/th\u003e\n \u003c/tr\u003e\n\u003c/thead\u003e\n\n\u003ctbody align=\"center\"\u003e\n\u003ctr\u003e\n \u003ctd rowspan=\"2\"\u003eNSynth\u003c/td\u003e\n \u003ctd\u003ePengi\u003c/td\u003e\n \u003ctd\u003e0.3860\u003c/td\u003e\n \u003ctd\u003e0.5007\u003c/td\u003e\n \u003c/tr\u003e\n\u003ctr\u003e\n \u003ctd\u003eQwen-Audio\u003c/td\u003e\n \u003ctd\u003e\u003cstrong\u003e0.4742\u003c/strong\u003e\u003c/td\u003e\n \u003ctd\u003e\u003cstrong\u003e0.7882\u003c/strong\u003e\u003c/td\u003e\n \u003c/tr\u003e\n\u003c/table\u003e\n\nWe have provided **all** evaluation scripts to reproduce our results. Please refer to [eval_audio/EVALUATION.md](eval_audio/EVALUATION.md) for details.\n\n### Evaluation of Chat\nTo evaluate the chat abilities of Qwen-Audio-Chat, we provide [TUTORIAL](TUTORIAL.md) and demo for users. \n\n## Requirements\n\n* python 3.8 and above\n* pytorch 1.12 and above, 2.0 and above are recommended\n* CUDA 11.4 and above are recommended (this is for GPU users)\n* FFmpeg\n\u003cbr\u003e\n\n## Quickstart\n\nBelow, we provide simple examples to show how to use Qwen-Audio and Qwen-Audio-Chat with 🤖 ModelScope and 🤗 Transformers.\n\nBefore running the code, make sure you have setup the environment and installed the required packages. Make sure you meet the above requirements, and then install the dependent libraries.\n\n```bash\npip install -r requirements.txt\n```\n\nNow you can start with ModelScope or Transformers. For more usage, please refer to the [tutorial](TUTORIAL.md). Qwen-Audio models currently perform best with audio clips under 30 seconds.\n\n#### 🤗 Transformers\n\nTo use Qwen-Audio-Chat for the inference, all you need to do is to input a few lines of codes as demonstrated below. However, **please make sure that you are using the latest code.**\n\n```python\nfrom transformers import AutoModelForCausalLM, AutoTokenizer\nfrom transformers.generation import GenerationConfig\nimport torch\ntorch.manual_seed(1234)\n\n# Note: The default behavior now has injection attack prevention off.\ntokenizer = AutoTokenizer.from_pretrained(\"Qwen/Qwen-Audio-Chat\", trust_remote_code=True)\n\n# use bf16\n# model = AutoModelForCausalLM.from_pretrained(\"Qwen/Qwen-Audio-Chat\", device_map=\"auto\", trust_remote_code=True, bf16=True).eval()\n# use fp16\n# model = AutoModelForCausalLM.from_pretrained(\"Qwen/Qwen-Audio-Chat\", device_map=\"auto\", trust_remote_code=True, fp16=True).eval()\n# use cpu only\n# model = AutoModelForCausalLM.from_pretrained(\"Qwen/Qwen-Audio-Chat\", device_map=\"cpu\", trust_remote_code=True).eval()\n# use cuda device\nmodel = AutoModelForCausalLM.from_pretrained(\"Qwen/Qwen-Audio-Chat\", device_map=\"cuda\", trust_remote_code=True).eval()\n\n# Specify hyperparameters for generation (No need to do this if you are using transformers\u003e4.32.0)\n# model.generation_config = GenerationConfig.from_pretrained(\"Qwen/Qwen-Audio-Chat\", trust_remote_code=True)\n\n# 1st dialogue turn\nquery = tokenizer.from_list_format([\n {'audio': 'assets/audio/1272-128104-0000.flac'}, # Either a local path or an url\n {'text': 'what does the person say?'},\n])\nresponse, history = model.chat(tokenizer, query=query, history=None)\nprint(response)\n# The person says: \"mister quilter is the apostle of the middle classes and we are glad to welcome his gospel\".\n\n# 2nd dialogue turn\nresponse, history = model.chat(tokenizer, 'Find the start time and end time of the word \"middle classes\"', history=history)\nprint(response)\n# The word \"middle classes\" starts at \u003c|2.33|\u003e seconds and ends at \u003c|3.26|\u003e seconds.\n```\n\nRunning Qwen-Audio pretrained base model is also simple.\n\n```python\nfrom transformers import AutoModelForCausalLM, AutoTokenizer\nfrom transformers.generation import GenerationConfig\nimport torch\ntorch.manual_seed(1234)\n\ntokenizer = AutoTokenizer.from_pretrained(\"Qwen/Qwen-Audio\", trust_remote_code=True)\n\n# use bf16\n# model = AutoModelForCausalLM.from_pretrained(\"Qwen/Qwen-Audio\", device_map=\"auto\", trust_remote_code=True, bf16=True).eval()\n# use fp16\n# model = AutoModelForCausalLM.from_pretrained(\"Qwen/Qwen-Audio\", device_map=\"auto\", trust_remote_code=True, fp16=True).eval()\n# use cpu only\n# model = AutoModelForCausalLM.from_pretrained(\"Qwen/Qwen-Audio\", device_map=\"cpu\", trust_remote_code=True).eval()\n# use cuda device\nmodel = AutoModelForCausalLM.from_pretrained(\"Qwen/Qwen-Audio\", device_map=\"cuda\", trust_remote_code=True).eval()\n\n# Specify hyperparameters for generation (No need to do this if you are using transformers\u003e4.32.0)\n# model.generation_config = GenerationConfig.from_pretrained(\"Qwen/Qwen-Audio\", trust_remote_code=True)\naudio_url = \"assets/audio/1272-128104-0000.flac\"\nsp_prompt = \"\u003c|startoftranscription|\u003e\u003c|en|\u003e\u003c|transcribe|\u003e\u003c|en|\u003e\u003c|notimestamps|\u003e\u003c|wo_itn|\u003e\"\nquery = f\"\u003caudio\u003e{audio_url}\u003c/audio\u003e{sp_prompt}\"\naudio_info = tokenizer.process_audio(query)\ninputs = tokenizer(query, return_tensors='pt', audio_info=audio_info)\ninputs = inputs.to(model.device)\npred = model.generate(**inputs, audio_info=audio_info)\nresponse = tokenizer.decode(pred.cpu()[0], skip_special_tokens=False,audio_info=audio_info)\nprint(response)\n# \u003caudio\u003eassets/audio/1272-128104-0000.flac\u003c/audio\u003e\u003c|startoftranscription|\u003e\u003c|en|\u003e\u003c|transcribe|\u003e\u003c|en|\u003e\u003c|notimestamps|\u003e\u003c|wo_itn|\u003emister quilting is the apostle of the middle classes and we are glad to welcome his gospel\u003c|endoftext|\u003e\n```\n\n\nIn the event of a network issue while attempting to download model checkpoints and codes from Hugging Face, an alternative approach is to initially fetch the checkpoint from ModelScope and then load it from the local directory as outlined below:\n\n```python\nfrom modelscope import snapshot_download\nfrom transformers import AutoModelForCausalLM, AutoTokenizer\n\n# Downloading model checkpoint to a local dir model_dir\nmodel_id = 'qwen/Qwen-Audio-Chat'\nrevision = 'master'\nmodel_dir = snapshot_download(model_id, revision=revision)\n\n# Loading local checkpoints\n# trust_remote_code is still set as True since we still load codes from local dir instead of transformers\ntokenizer = AutoTokenizer.from_pretrained(model_dir, trust_remote_code=True)\nmodel = AutoModelForCausalLM.from_pretrained(\n model_dir,\n device_map=\"cuda\",\n trust_remote_code=True\n).eval()\n```\n\n#### 🤖 ModelScope\n\nModelScope is an opensource platform for Model-as-a-Service (MaaS), which provides flexible and cost-effective model service to AI developers. Similarly, you can run the models with ModelScope as shown below:\n\n```python\nfrom modelscope import (\n snapshot_download, AutoModelForCausalLM, AutoTokenizer, GenerationConfig\n)\nimport torch\nmodel_id = 'qwen/Qwen-Audio-Chat'\nrevision = 'master'\n\nmodel_dir = snapshot_download(model_id, revision=revision)\ntorch.manual_seed(1234)\n\ntokenizer = AutoTokenizer.from_pretrained(model_dir, trust_remote_code=True)\nif not hasattr(tokenizer, 'model_dir'):\n tokenizer.model_dir = model_dir\n# use bf16\n# model = AutoModelForCausalLM.from_pretrained(model_dir, device_map=\"auto\", trust_remote_code=True, bf16=True).eval()\n# use fp16\n# model = AutoModelForCausalLM.from_pretrained(model_dir, device_map=\"auto\", trust_remote_code=True, fp16=True).eval()\n# use CPU\n# model = AutoModelForCausalLM.from_pretrained(model_dir, device_map=\"cpu\", trust_remote_code=True).eval()\n# use gpu\nmodel = AutoModelForCausalLM.from_pretrained(model_dir, device_map=\"auto\", trust_remote_code=True).eval()\n\n# 1st dialogue turn\nquery = tokenizer.from_list_format([\n {'audio': 'assets/audio/1272-128104-0000.flac'}, # Either a local path or an url\n {'text': 'what does the person say?'},\n])\nresponse, history = model.chat(tokenizer, query=query, history=None)\nprint(response)\n# The person says: \"mister quilter is the apostle of the middle classes and we are glad to welcome his gospel\".\n\n# 2st dialogue turn\nresponse, history = model.chat(tokenizer, 'Find the start time and end time of the word \"middle classes\"', history=history)\nprint(response)\n# The word \"middle classes\" starts at \u003c|2.33|\u003e seconds and ends at \u003c|3.26|\u003e seconds.\n```\n\n## Demo\n\n### Web UI\n\nWe provide code for users to build a web UI demo. Before you start, make sure you install the following packages:\n\n```\npip install -r requirements_web_demo.txt\n```\n\nThen run the command below and click on the generated link:\n\n```\npython web_demo_audio.py\n```\n\u003cbr\u003e\n\n## FAQ\n\nIf you meet problems, please refer to [FAQ](FAQ.md) and the issues first to search a solution before you launch a new issue.\n\u003cbr\u003e\n\n## We Are Hiring\n\nIf you are interested in joining us as full-time or intern, please contact us at qwen_audio@list.alibaba-inc.com.\n\u003cbr\u003e\n\n## License Agreement\n\nResearchers and developers are free to use the codes and model weights of both Qwen-Audio and Qwen-Audio-Chat. We also allow their commercial use. Check our license at [LICENSE](LICENSE) for more details.\n\u003cbr\u003e\n\n## Citation\n\nIf you find our paper and code useful in your research, please consider giving a star :star: and citation :pencil: :)\n\n```BibTeX\n@article{Qwen-Audio,\n title={Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models},\n author={Chu, Yunfei and Xu, Jin and Zhou, Xiaohuan and Yang, Qian and Zhang, Shiliang and Yan, Zhijie and Zhou, Chang and Zhou, Jingren},\n journal={arXiv preprint arXiv:2311.07919},\n year={2023}\n}\n```\n\u003cbr\u003e\n\n## Contact Us\n\nIf you are interested to leave a message to either our research team or product team, feel free to send an email to qianwen_opensource@alibabacloud.com.\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FQwenLM%2FQwen-Audio","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FQwenLM%2FQwen-Audio","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FQwenLM%2FQwen-Audio/lists"}