https://github.com/QwenLM/Qwen2-Audio

The official repo of Qwen2-Audio chat & pretrained large audio language model proposed by Alibaba Cloud.
https://github.com/QwenLM/Qwen2-Audio
Last synced: 7 months ago
JSON representation
The official repo of Qwen2-Audio chat & pretrained large audio language model proposed by Alibaba Cloud.
Host: GitHub
URL: https://github.com/QwenLM/Qwen2-Audio
Owner: QwenLM
Created: 2024-06-24T06:11:27.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2024-08-13T04:02:17.000Z (over 1 year ago)
Last Synced: 2024-11-09T18:27:03.906Z (about 1 year ago)
Language: Python
Homepage:
Size: 2.11 MB
Stars: 1,201
Watchers: 31
Forks: 80
Open Issues: 69
Metadata Files:
- Readme: README.md
Awesome Lists containing this project

ai-game-devtools - Qwen2-Audio - Audio chat & pretrained large audio language model proposed by Alibaba Cloud. |[arXiv](https://arxiv.org/abs/2407.10759) | | Audio | (<span id="audio">Audio</span> / <span id="tool">LLM (LLM & Tool)</span>)
StarryDivineSky - QwenLM/Qwen2-Audio - Audio聊天和预训练大型音频语言模型的官方仓库。它能够接受各种音频信号输入，并对语音指令进行音频分析或直接文本响应。我们介绍了两种不同的音频交互模式：语音聊天：用户无需文字输入即可自由地与Qwen2-Audio进行语音交互;音频分析：用户可以在交互过程中提供音频和文本指令进行分析。 (语音识别与合成_其他 / 网络服务_其他)
README

          


        中文 &nbsp｜ &nbsp English&nbsp&nbsp









    





Qwen2-Audio-7B 🤖  | 🤗&nbsp ｜ Qwen-Audio-7B-Instruct 🤖 | 🤗&nbsp ｜ Demo 🤖 | 🤗&nbsp




📑 Paper &nbsp&nbsp | &nbsp&nbsp 📑 Blog &nbsp&nbsp | &nbsp&nbsp 💬 WeChat (微信)&nbsp&nbsp | &nbsp&nbsp Discord&nbsp&nbsp



We introduce the latest progress of Qwen-Audio, a large-scale audio-language model called Qwen2-Audio, which is capable of accepting various audio signal inputs and performing audio analysis or direct textual responses with regard to speech instructions. We introduce two distinct audio interaction modes:

* voice chat: users can freely engage in voice interactions with Qwen2-Audio without text input;

* audio analysis: users could provide audio and text instructions for analysis during the interaction;

**We've released two models of the Qwen2-Audio series: Qwen2-Audio-7B and Qwen2-Audio-7B-Instruct.**

## Architecture

The overview of three-stage training process of Qwen2-Audio.



    



## News and Updates

* 2024.8.9 🎉 We released the checkpoints of both `Qwen2-Audio-7B` and `Qwen2-Audio-7B-Instruct` on ModelScope and Hugging Face.

* 2024.7.15 🎉 We released the paper of **Qwen2-Audio**, introducing the relevant model structure, training methods, and model performance. Check our [report](https://arxiv.org/abs/2407.10759) for details!

* 2023.11.30 🔥  We released the **Qwen-Audio** series.




## Evaluation

We evaluated the Qwen2-Audio's abilities on 13 standard benchmarks as follows:

TaskDescriptionDatasetSplitMetricASRAutomatic Speech RecognitionFleursdev | testWERAishell2testLibrispeechdev | testCommon Voicedev | testS2TTSpeech-to-Text TranslationCoVoST2testBLEU SERSpeech Emotion RecognitionMeldtestACCVSCVocal Sound ClassificationVocalSoundtestACCAIR-Bench
Chat-Benchmark-SpeechFisher
SpokenWOZ
IEMOCAP
Common voicedev | testGPT-4 EvalChat-Benchmark-SoundClothodev | testGPT-4 Eval

Chat-Benchmark-MusicMusicCapsdev | testGPT-4 EvalChat-Benchmark-Mixed-AudioCommon voice
AudioCaps
MusicCapsdev | testGPT-4 Eval

The below is the overal performance:



    



The details of evaluation are as follows:




(Note: The evaluation results we present are based on the initial model of the original training framework. However, the scores showed some fluctuations after converting the framework to Huggingface. Here, we present our complete evaluation results, starting with the initial model results from the paper.)

TaskDatasetModelPerformanceMetricsResultsASRLibrispeech
dev-clean | dev-other | 
test-clean | test-otherSpeechT5WER 2.1 | 5.5 | 2.4 | 5.8SpeechNet- | - | 30.7 | -SLM-FT- | - | 2.6 | 5.0SALMONN- | - | 2.1 | 4.9SpeechVerse- | - | 2.1 | 4.4Qwen-Audio1.8 | 4.0 | 2.0 | 4.2Qwen2-Audio1.3 | 3.4 | 1.6 | 3.6Common Voice 15 
en | zh | yue | frWhisper-large-v3WER 9.3 | 12.8 | 10.9 | 10.8Qwen2-Audio8.6 | 6.9 | 5.9 | 9.6

Fleurs 
zhWhisper-large-v3WER 7.7Qwen2-Audio7.5Aishell2 
Mic | iOS | AndroidMMSpeech-baseWER 4.5 | 3.9 | 4.0Paraformer-large- | 2.9 | -Qwen-Audio3.3 | 3.1 | 3.3Qwen2-Audio3.0 | 3.0 | 2.9S2TTCoVoST2 
en-de | de-en | 
en-zh | zh-enSALMONNBLEU 18.6 | - | 33.1 | -SpeechLLaMA- | 27.1 | - | 12.3BLSP14.1 | - | - | -Qwen-Audio25.1 | 33.9 | 41.5 | 15.7Qwen2-Audio29.9 | 35.2 | 45.2 | 24.4

CoVoST2 
es-en | fr-en | it-en |SpeechLLaMABLEU 27.9 | 25.2 | 25.9Qwen-Audio39.7 | 38.5 | 36.0Qwen2-Audio40.0 | 38.5 | 36.3SERMeldWavLM-largeACC 0.542Qwen-Audio0.557Qwen2-Audio0.553VSCVocalSoundCLAPACC 0.4945Pengi0.6035Qwen-Audio0.9289Qwen2-Audio0.9392

AIR-Bench 
Chat Benchmark
Speech | Sound |
 Music | Mixed-AudioSALMONN
BLSP
Pandagpt
Macaw-LLM
SpeechGPT
Next-gpt
Qwen-Audio
Gemini-1.5-pro
Qwen2-AudioGPT-4 6.16 | 6.28 | 5.95 | 6.08
6.17 | 5.55 | 5.08 | 5.33
3.58 | 5.46 | 5.06 | 4.25
0.97 | 1.01 | 0.91 | 1.01
1.57 | 0.95 | 0.95 | 4.13
3.86 | 4.76 | 4.18 | 4.13
6.47 | 6.95 | 5.52 | 6.08
6.97 | 5.49 | 5.06 | 5.27
7.18 | 6.99 | 6.79 | 6.77

(Second is after converting huggingface)

TaskDatasetModelPerformanceMetricsResultsASRLibrispeech
dev-clean | dev-other | 
test-clean | test-otherSpeechT5WER 2.1 | 5.5 | 2.4 | 5.8SpeechNet- | - | 30.7 | -SLM-FT- | - | 2.6 | 5.0SALMONN- | - | 2.1 | 4.9SpeechVerse- | - | 2.1 | 4.4Qwen-Audio1.8 | 4.0 | 2.0 | 4.2Qwen2-Audio1.7 | 3.6 | 1.7 | 4.0Common Voice 15 
en | zh | yue | frWhisper-large-v3WER 9.3 | 12.8 | 10.9 | 10.8Qwen2-Audio8.7 | 6.5 | 5.9 | 9.6

Fleurs 
zhWhisper-large-v3WER 7.7Qwen2-Audio7.0Aishell2 
Mic | iOS | AndroidMMSpeech-baseWER 4.5 | 3.9 | 4.0Paraformer-large- | 2.9 | -Qwen-Audio3.3 | 3.1 | 3.3Qwen2-Audio3.2 | 3.1 | 2.9S2TTCoVoST2 
en-de | de-en | 
en-zh | zh-enSALMONNBLEU 18.6 | - | 33.1 | -SpeechLLaMA- | 27.1 | - | 12.3BLSP14.1 | - | - | -Qwen-Audio25.1 | 33.9 | 41.5 | 15.7Qwen2-Audio29.6 | 33.6 | 45.6 | 24.0

CoVoST2 
es-en | fr-en | it-en |SpeechLLaMABLEU 27.9 | 25.2 | 25.9Qwen-Audio39.7 | 38.5 | 36.0Qwen2-Audio38.7 | 37.2 | 35.2SERMeldWavLM-largeACC 0.542Qwen-Audio0.557Qwen2-Audio0.535VSCVocalSoundCLAPACC 0.4945Pengi0.6035Qwen-Audio0.9289Qwen2-Audio0.9395

AIR-Bench 
Chat Benchmark
Speech | Sound |
 Music | Mixed-AudioSALMONN
BLSP
Pandagpt
Macaw-LLM
SpeechGPT
Next-gpt
Qwen-Audio
Gemini-1.5-pro
Qwen2-AudioGPT-4 6.16 | 6.28 | 5.95 | 6.08
6.17 | 5.55 | 5.08 | 5.33
3.58 | 5.46 | 5.06 | 4.25
0.97 | 1.01 | 0.91 | 1.01
1.57 | 0.95 | 0.95 | 4.13
3.86 | 4.76 | 4.18 | 4.13
6.47 | 6.95 | 5.52 | 6.08
6.97 | 5.49 | 5.06 | 5.27
7.24 | 6.83 | 6.73 | 6.42

We have provided **all** evaluation scripts to reproduce our results. Please refer to [eval_audio/EVALUATION.md](eval_audio/EVALUATION.md) for details.

## Requirements

The code of Qwen2-Audio has been in the latest Hugging face transformers and we advise you to build from source with command `pip install git+https://github.com/huggingface/transformers`, or you might encounter the following error:

```

KeyError: 'qwen2-audio'

```

## Quickstart

Below, we provide simple examples to show how to use Qwen2-Audio and Qwen2-Audio-Instruct with 🤗 Transformers.

Before running the code, make sure you have setup the environment and installed the required packages. Make sure you meet the above requirements, and then install the dependent libraries.

Now you can start with ModelScope or Transformers. Qwen2-Audio models currently perform best with audio clips under 30 seconds.

#### 🤗 Transformers

In the following, we demonstrate how to use `Qwen2-Audio-7B-Instruct` for the inference, supporting both voice chat and audio analysis modes. Note that we have used the ChatML format for dialog, in this demo we show how to leverage `apply_chat_template` for this purpose.

##### Voice Chat Inference

In the voice chat mode, users can freely engage in voice interactions with Qwen2-Audio without text input:

```python

from io import BytesIO

from urllib.request import urlopen

import librosa

from transformers import Qwen2AudioForConditionalGeneration, AutoProcessor

processor = AutoProcessor.from_pretrained("Qwen/Qwen2-Audio-7B-Instruct")

model = Qwen2AudioForConditionalGeneration.from_pretrained("Qwen/Qwen2-Audio-7B-Instruct", device_map="auto")

conversation = [

    {"role": "user", "content": [

        {"type": "audio", "audio_url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-Audio/audio/guess_age_gender.wav"},

    ]},

    {"role": "assistant", "content": "Yes, the speaker is female and in her twenties."},

    {"role": "user", "content": [

        {"type": "audio", "audio_url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-Audio/audio/translate_to_chinese.wav"},

    ]},

]

text = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)

audios = []

for message in conversation:

    if isinstance(message["content"], list):

        for ele in message["content"]:

            if ele["type"] == "audio":

                audios.append(librosa.load(

                    BytesIO(urlopen(ele['audio_url']).read()), 

                    sr=processor.feature_extractor.sampling_rate)[0]

                )

inputs = processor(text=text, audios=audios, return_tensors="pt", padding=True)

inputs.input_ids = inputs.input_ids.to("cuda")

generate_ids = model.generate(**inputs, max_length=256)

generate_ids = generate_ids[:, inputs.input_ids.size(1):]

response = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]

```

##### Audio Analysis Inference

In the audio analysis, users could provide both audio and text instructions for analysis:

```python

from io import BytesIO

from urllib.request import urlopen

import librosa

from transformers import Qwen2AudioForConditionalGeneration, AutoProcessor

processor = AutoProcessor.from_pretrained("Qwen/Qwen2-Audio-7B-Instruct")

model = Qwen2AudioForConditionalGeneration.from_pretrained("Qwen/Qwen2-Audio-7B-Instruct", device_map="auto")

conversation = [

    {'role': 'system', 'content': 'You are a helpful assistant.'}, 

    {"role": "user", "content": [

        {"type": "audio", "audio_url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-Audio/audio/glass-breaking-151256.mp3"},

        {"type": "text", "text": "What's that sound?"},

    ]},

    {"role": "assistant", "content": "It is the sound of glass shattering."},

    {"role": "user", "content": [

        {"type": "text", "text": "What can you do when you hear that?"},

    ]},

    {"role": "assistant", "content": "Stay alert and cautious, and check if anyone is hurt or if there is any damage to property."},

    {"role": "user", "content": [

        {"type": "audio", "audio_url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-Audio/audio/1272-128104-0000.flac"},

        {"type": "text", "text": "What does the person say?"},

    ]},

]

text = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)

audios = []

for message in conversation:

    if isinstance(message["content"], list):

        for ele in message["content"]:

            if ele["type"] == "audio":

                audios.append(

                    librosa.load(

                        BytesIO(urlopen(ele['audio_url']).read()), 

                        sr=processor.feature_extractor.sampling_rate)[0]

                )

inputs = processor(text=text, audios=audios, return_tensors="pt", padding=True)

inputs.input_ids = inputs.input_ids.to("cuda")

generate_ids = model.generate(**inputs, max_length=256)

generate_ids = generate_ids[:, inputs.input_ids.size(1):]

response = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]

```

##### Batch Inference

We also support batch inference:

```python

from io import BytesIO

from urllib.request import urlopen

import librosa

from transformers import Qwen2AudioForConditionalGeneration, AutoProcessor

processor = AutoProcessor.from_pretrained("Qwen/Qwen2-Audio-7B-Instruct")

model = Qwen2AudioForConditionalGeneration.from_pretrained("Qwen/Qwen2-Audio-7B-Instruct", device_map="auto")

conversation1 = [

    {"role": "user", "content": [

        {"type": "audio", "audio_url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-Audio/audio/glass-breaking-151256.mp3"},

        {"type": "text", "text": "What's that sound?"},

    ]},

    {"role": "assistant", "content": "It is the sound of glass shattering."},

    {"role": "user", "content": [

        {"type": "audio", "audio_url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-Audio/audio/f2641_0_throatclearing.wav"},

        {"type": "text", "text": "What can you hear?"},

    ]}

]

conversation2 = [

    {"role": "user", "content": [

        {"type": "audio", "audio_url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-Audio/audio/1272-128104-0000.flac"},

        {"type": "text", "text": "What does the person say?"},

    ]},

]

conversations = [conversation1, conversation2]

text = [processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False) for conversation in conversations]

audios = []

for conversation in conversations:

    for message in conversation:

        if isinstance(message["content"], list):

            for ele in message["content"]:

                if ele["type"] == "audio":

                    audios.append(

                        librosa.load(

                            BytesIO(urlopen(ele['audio_url']).read()), 

                            sr=processor.feature_extractor.sampling_rate)[0]

                    )

inputs = processor(text=text, audios=audios, return_tensors="pt", padding=True)

inputs['input_ids'] = inputs['input_ids'].to("cuda")

inputs.input_ids = inputs.input_ids.to("cuda")

generate_ids = model.generate(**inputs, max_length=256)

generate_ids = generate_ids[:, inputs.input_ids.size(1):]

response = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)

```

Running Qwen2-Audio pretrained base model is also simple.

```python

from io import BytesIO

from urllib.request import urlopen

import librosa

from transformers import AutoProcessor, Qwen2AudioForConditionalGeneration

model = Qwen2AudioForConditionalGeneration.from_pretrained("Qwen/Qwen2-Audio-7B" ,trust_remote_code=True)

processor = AutoProcessor.from_pretrained("Qwen/Qwen2-Audio-7B" ,trust_remote_code=True)

prompt = "<|audio_bos|><|AUDIO|><|audio_eos|>Generate the caption in English:"

url = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-Audio/glass-breaking-151256.mp3"

audio, sr = librosa.load(BytesIO(urlopen(url).read()), sr=processor.feature_extractor.sampling_rate)

inputs = processor(text=prompt, audios=audio, return_tensors="pt")

generated_ids = model.generate(**inputs, max_length=256)

generated_ids = generated_ids[:, inputs.input_ids.size(1):]

response = processor.batch_decode(generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]

```

#### 🤖 ModelScope

We strongly advise users especially those in mainland China to use ModelScope. `snapshot_download` can help you solve issues concerning downloading checkpoints.

## Demo

### Web UI

We provide code for users to build a web UI demo. Before you start, make sure you install the following packages:

```

pip install -r requirements_web_demo.txt

```

Then run the command below and click on the generated link:

```

python demo/web_demo_audio.py

```




## demos 

More impressive cases will be updated on our blog at [Qwen's blog](https://qwenlm.github.io/blog/qwen2-audio).

## We Are Hiring

If you are interested in joining us as full-time or intern, please contact us at `qwen_audio@list.alibaba-inc.com`.




## License Agreement

Check the license of each model inside its HF repo. It is NOT necessary for you to submit a request for commercial usage.




## Citation

If you find our paper and code useful in your research, please consider giving a star :star: and citation :pencil: :)

```BibTeX

@article{Qwen-Audio,

  title={Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models},

  author={Chu, Yunfei and Xu, Jin and Zhou, Xiaohuan and Yang, Qian and Zhang, Shiliang and Yan, Zhijie  and Zhou, Chang and Zhou, Jingren},

  journal={arXiv preprint arXiv:2311.07919},

  year={2023}

}

```

```BibTeX

@article{Qwen2-Audio,

  title={Qwen2-Audio Technical Report},

  author={Chu, Yunfei and Xu, Jin and Yang, Qian and Wei, Haojie and Wei, Xipin and Guo,  Zhifang and Leng, Yichong and Lv, Yuanjun and He, Jinzheng and Lin, Junyang and Zhou, Chang and Zhou, Jingren},

  journal={arXiv preprint arXiv:2407.10759},

  year={2024}

}

```




## Contact Us

If you are interested to leave a message to either our research team or product team, feel free to send an email to `qianwen_opensource@alibabacloud.com`.
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/QwenLM/Qwen2-Audio

Awesome Lists containing this project

README