https://github.com/QwenLM/Qwen-Audio

The official repo of Qwen-Audio (通义千问-Audio) chat & pretrained large audio language model proposed by Alibaba Cloud.
https://github.com/QwenLM/Qwen-Audio

Last synced: 6 months ago
JSON representation

The official repo of Qwen-Audio (通义千问-Audio) chat & pretrained large audio language model proposed by Alibaba Cloud.

Host: GitHub
URL: https://github.com/QwenLM/Qwen-Audio
Owner: QwenLM
License: other
Created: 2023-11-07T06:31:39.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2024-07-05T09:17:49.000Z (10 months ago)
Last Synced: 2024-11-11T18:02:59.665Z (6 months ago)
Language: Python
Size: 24.6 MB
Stars: 1,480
Watchers: 25
Forks: 106
Open Issues: 57
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

StarryDivineSky - QwenLM/Qwen-Audio - Audio接受各种音频（人类语音、自然声音、音乐和歌曲）和文本作为输入，输出文本。贡献包括：`基础音频模型`：基础的多任务音频语言模型，支持各种任务、语言和音频类型，作为通用音频理解模型。在Qwen-Audio的基础上，我们通过指令微调开发Qwen-Audio-Chat，实现多轮对话，支持多样化的音频场景。`适用于所有类型音频的多任务学习框架`：为了扩大音频语言预训练的规模，我们通过提出一个多任务训练框架，实现知识共享和避免一对多干扰，解决了与不同数据集相关的文本标签变化的挑战。我们的模型包含 30 多个任务，大量实验表明该模型具有强大的性能。`强大的性能`：在各种基准测试任务中都取得了令人印象深刻的性能，而无需任何特定任务的微调，超过了同类产品。在 Aishell1、cochlscene、ClothoAQA 和 VocalSound 的测试集上取得先进的结果。`从音频和文本输入灵活多运行聊天`：支持多音频分析、声音理解和推理、音乐欣赏和工具使用。 (A01_文本生成_文本对话 / 大语言对话模型及数据)

README

        


        中文 &nbsp｜ &nbsp English&nbsp&nbsp









    








        Qwen-Audio 🤖  | 🤗&nbsp ｜ Qwen-Audio-Chat 🤖 | 🤗&nbsp | &nbsp&nbsp Demo 🤖 | 🤗&nbsp




&nbsp&nbspHomepage&nbsp ｜ &nbsp&nbspPaper&nbsp&nbsp | &nbsp&nbsp&nbspWeChat&nbsp&nbsp | &nbsp&nbspDiscord&nbsp&nbsp







[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/qwen-audio-advancing-universal-audio/speech-recognition-on-aishell-1)](https://paperswithcode.com/sota/speech-recognition-on-aishell-1?p=qwen-audio-advancing-universal-audio)

[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/qwen-audio-advancing-universal-audio/speech-recognition-on-aishell-2-test-android-1)](https://paperswithcode.com/sota/speech-recognition-on-aishell-2-test-android-1?p=qwen-audio-advancing-universal-audio)

[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/qwen-audio-advancing-universal-audio/speech-recognition-on-aishell-2-test-ios)](https://paperswithcode.com/sota/speech-recognition-on-aishell-2-test-ios?p=qwen-audio-advancing-universal-audio)

[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/qwen-audio-advancing-universal-audio/speech-recognition-on-aishell-2-test-mic-1)](https://paperswithcode.com/sota/speech-recognition-on-aishell-2-test-mic-1?p=qwen-audio-advancing-universal-audio)

[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/qwen-audio-advancing-universal-audio/acoustic-scene-classification-on-cochlscene)](https://paperswithcode.com/sota/acoustic-scene-classification-on-cochlscene?p=qwen-audio-advancing-universal-audio)

[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/qwen-audio-advancing-universal-audio/acoustic-scene-classification-on-tut-acoustic)](https://paperswithcode.com/sota/acoustic-scene-classification-on-tut-acoustic?p=qwen-audio-advancing-universal-audio) 


[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/qwen-audio-advancing-universal-audio/audio-classification-on-vocalsound)](https://paperswithcode.com/sota/audio-classification-on-vocalsound?p=qwen-audio-advancing-universal-audio) 


[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/qwen-audio-advancing-universal-audio/audio-captioning-on-clotho)](https://paperswithcode.com/sota/audio-captioning-on-clotho?p=qwen-audio-advancing-universal-audio) 


[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/qwen-audio-advancing-universal-audio/speech-recognition-on-librispeech-test-clean)](https://paperswithcode.com/sota/speech-recognition-on-librispeech-test-clean?p=qwen-audio-advancing-universal-audio)

[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/qwen-audio-advancing-universal-audio/emotion-recognition-in-conversation-on-meld)](https://paperswithcode.com/sota/emotion-recognition-in-conversation-on-meld?p=qwen-audio-advancing-universal-audio)

[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/qwen-audio-advancing-universal-audio/speech-recognition-on-librispeech-test-other)](https://paperswithcode.com/sota/speech-recognition-on-librispeech-test-other?p=qwen-audio-advancing-universal-audio)

**Qwen-Audio** (Qwen Large Audio Language Model) is the multimodal version of the large model series, Qwen (abbr. Tongyi Qianwen), proposed by Alibaba Cloud. Qwen-Audio accepts diverse audio (human speech, natural sound, music and song) and text as inputs, outputs text. The contribution of Qwen-Audio include:

- **Fundamental audio models**: Qwen-Audio is a fundamental multi-task audio-language model that supports various tasks, languages, and audio types, serving as a universal audio understanding model. Building upon Qwen-Audio, we develop Qwen-Audio-Chat through instruction fine-tuning, enabling multi-turn dialogues and supporting diverse audio-oriented scenarios.

- **Multi-task learning framework for all types of audios**: To scale up audio-language pre-training, we address the challenge of variation in textual labels associated with different datasets by proposing a multi-task training framework, enabling knowledge sharing and avoiding one-to-many interference. Our model incorporates more than 30 tasks and extensive experiments show the model achieves strong performance.

- **Strong Performance**: Experimental results show that Qwen-Audio achieves impressive performance across diverse benchmark tasks without requiring any task-specific fine-tuning, surpassing its counterparts. Specifically, Qwen-Audio achieves state-of-the-art results on the test set of Aishell1, cochlscene, ClothoAQA, and VocalSound.

- **Flexible multi-run chat from audio and text input**: Qwen-Audio supports multiple-audio analysis, sound understanding and reasoning, music appreciation, and tool usage.






    






We release two models of the Qwen-Audio series soon:

- Qwen-Audio: The pre-trained multi-task audio understanding model uses Qwen-7B as the initialization of the LLM, and [Whisper-large-v2](https://github.com/openai/whisper) as the initialization of the audio encoder.

- Qwen-Audio-Chat: A multimodal LLM-based AI assistant, which is trained with alignment techniques. Qwen-Audio-Chat supports more flexible interaction, such as multiple audio inputs, multi-round question answering, and creative capabilities.




## News and Updates

* 2023.11.30 🔥 We have released the checkpoints of both **Qwen-Audio** and **Qwen-Audio-Chat** on ModelScope and Hugging Face.

* 2023.11.15 🎉 We released a [paper](http://arxiv.org/abs/2311.07919) for details about Qwen-Audio and Qwen-Audio-Chat model, including training details and model performance.




## Evaluation

We evaluated the Qwen-Audio's abilities on 12 standard benchmarks as follows:



    



The below is the overal performance：



    



The details of evaluation are as follows:

### Automatic Speech Recognition



    Dataset

    Model

    Results (WER)

  

    dev-clean

    dev-othoer

    test-clean

    test-other

  

    Librispeech

    SpeechT5

    2.1

    5.5

    2.4

    5.8

  

  

    SpeechNet

    -

    -

    30.7

    -

  

    SLM-FT

    -

    -

    2.6

    5.0

  

    SALMONN

    -

    -

    2.1

    4.9

  

    Qwen-Audio

    1.8

    4.0

    2.0

    4.2

  

    Dataset

    Model

    Results (WER)

  

    dev

    test

  

    Aishell1

    MMSpeech-base

    2.0

    2.1

  

    MMSpeech-large

    1.6

    1.9

  

    Paraformer-large

    -

    2.0

  

    Qwen-Audio

    1.2 (SOTA)

    1.3 (SOTA)

  

    Dataset

    Model

    Results (WER)

  

    Mic

    iOS

    Android

  

    Aishell2

    MMSpeech-base

    4.5

    3.9

    4.0

  

    Paraformer-large

    -

    2.9

    -

  

    Qwen-Audio

    3.3

    3.1

    3.3

  

### Soeech-to-text Translation

    Dataset

    Model

    Results （BLUE)

  

    en-de

    de-en

    en-zh

    zh-en

    es-en

    fr-en

    it-en

  

    CoVoST2

    SALMMON

    18.6

    -

    33.1

    -

    -

    -

    -

  

    SpeechLLaMA

    -

    27.1

    -

    12.3

    27.9

    25.2

    25.9

  

    BLSP

    14.1

    -

    -

    -

    -

    -

    -

  

    Qwen-Audio

    25.1

    33.9

    41.5

    15.7

    39.7

    38.5

    36.0

  

### Automatic Audio Caption

    Dataset

    Model

    Results

  

    CIDER

    SPICE

    SPIDEr

  

    Clotho

    Pengi

    0.416

    0.126

    0.271

  

    Qwen-Audio

    0.441

    0.136

    0.288

  

### Speech Recognition with Word-level Timestamp

    Dataset

    Model

    AAC (ms)

  

    Industrial Data

    Force-aligner

    60.3

  

    Paraformer-large-TP

    65.3

  

    Qwen-Audio

    51.5 (SOTA)

  

### Automatic Scene Classification

    Dataset

    Model

    ACC

  

    Cochlscene

    Cochlscene

    0.669

  

    Qwen-Audio

    0.795 (SOTA)

  

    TUT2017

    Pengi

    0.353

  

    Qwen-Audio

    0.649

  

### Speech Emotion Recognition

    Dataset

    Model

    ACC

  

    Meld

    WavLM-large

    0.542

  

    Qwen-Audio

    0.557

  

### Audio Question & Answer

    Dataset

    Model

    Results

  

    ACC

    ACC (binary)

  

    ClothoAQA

    ClothoAQA

    0.542

    0.627

  

    Pengi

    -

    0.645

  

    Qwen-Audio

    0.579

    0.749

  

### Vocal Sound Classification

    Dataset

    Model

    ACC

  

    VocalSound

    CLAP

    0.4945

  

    Pengi

    0.6035

  

    Qwen-Audio

    0.9289 (SOTA)

  

### Music Note Analysis

    Dataset

    Model

    NS. Qualities (MAP)

NS. Instrument (ACC)

  

    NSynth

    Pengi

    0.3860

    0.5007

  

    Qwen-Audio

    0.4742

    0.7882

  

We have provided **all** evaluation scripts to reproduce our results. Please refer to [eval_audio/EVALUATION.md](eval_audio/EVALUATION.md) for details.

### Evaluation of Chat

To evaluate the chat abilities of Qwen-Audio-Chat, we provide [TUTORIAL](TUTORIAL.md) and demo for users. 

## Requirements

* python 3.8 and above

* pytorch 1.12 and above, 2.0 and above are recommended

* CUDA 11.4 and above are recommended (this is for GPU users)

* FFmpeg




## Quickstart

Below, we provide simple examples to show how to use Qwen-Audio and Qwen-Audio-Chat with 🤖 ModelScope and 🤗 Transformers.

Before running the code, make sure you have setup the environment and installed the required packages. Make sure you meet the above requirements, and then install the dependent libraries.

```bash

pip install -r requirements.txt

```

Now you can start with ModelScope or Transformers. For more usage, please refer to the [tutorial](TUTORIAL.md). Qwen-Audio models currently perform best with audio clips under 30 seconds.

#### 🤗 Transformers

To use Qwen-Audio-Chat for the inference, all you need to do is to input a few lines of codes as demonstrated below. However, **please make sure that you are using the latest code.**

```python

from transformers import AutoModelForCausalLM, AutoTokenizer

from transformers.generation import GenerationConfig

import torch

torch.manual_seed(1234)

# Note: The default behavior now has injection attack prevention off.

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-Audio-Chat", trust_remote_code=True)

# use bf16

# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-Audio-Chat", device_map="auto", trust_remote_code=True, bf16=True).eval()

# use fp16

# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-Audio-Chat", device_map="auto", trust_remote_code=True, fp16=True).eval()

# use cpu only

# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-Audio-Chat", device_map="cpu", trust_remote_code=True).eval()

# use cuda device

model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-Audio-Chat", device_map="cuda", trust_remote_code=True).eval()

# Specify hyperparameters for generation (No need to do this if you are using transformers>4.32.0)

# model.generation_config = GenerationConfig.from_pretrained("Qwen/Qwen-Audio-Chat", trust_remote_code=True)

# 1st dialogue turn

query = tokenizer.from_list_format([

    {'audio': 'assets/audio/1272-128104-0000.flac'}, # Either a local path or an url

    {'text': 'what does the person say?'},

])

response, history = model.chat(tokenizer, query=query, history=None)

print(response)

# The person says: "mister quilter is the apostle of the middle classes and we are glad to welcome his gospel".

# 2nd dialogue turn

response, history = model.chat(tokenizer, 'Find the start time and end time of the word "middle classes"', history=history)

print(response)

# The word "middle classes" starts at <|2.33|> seconds and ends at <|3.26|> seconds.

```

Running Qwen-Audio pretrained base model is also simple.

```python

from transformers import AutoModelForCausalLM, AutoTokenizer

from transformers.generation import GenerationConfig

import torch

torch.manual_seed(1234)

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-Audio", trust_remote_code=True)

# use bf16

# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-Audio", device_map="auto", trust_remote_code=True, bf16=True).eval()

# use fp16

# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-Audio", device_map="auto", trust_remote_code=True, fp16=True).eval()

# use cpu only

# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-Audio", device_map="cpu", trust_remote_code=True).eval()

# use cuda device

model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-Audio", device_map="cuda", trust_remote_code=True).eval()

# Specify hyperparameters for generation (No need to do this if you are using transformers>4.32.0)

# model.generation_config = GenerationConfig.from_pretrained("Qwen/Qwen-Audio", trust_remote_code=True)

audio_url = "assets/audio/1272-128104-0000.flac"

sp_prompt = "<|startoftranscription|><|en|><|transcribe|><|en|><|notimestamps|><|wo_itn|>"

query = f"{audio_url}{sp_prompt}"

audio_info = tokenizer.process_audio(query)

inputs = tokenizer(query, return_tensors='pt', audio_info=audio_info)

inputs = inputs.to(model.device)

pred = model.generate(**inputs, audio_info=audio_info)

response = tokenizer.decode(pred.cpu()[0], skip_special_tokens=False,audio_info=audio_info)

print(response)

# assets/audio/1272-128104-0000.flac<|startoftranscription|><|en|><|transcribe|><|en|><|notimestamps|><|wo_itn|>mister quilting is the apostle of the middle classes and we are glad to welcome his gospel<|endoftext|>

```

In the event of a network issue while attempting to download model checkpoints and codes from Hugging Face, an alternative approach is to initially fetch the checkpoint from ModelScope and then load it from the local directory as outlined below:

```python

from modelscope import snapshot_download

from transformers import AutoModelForCausalLM, AutoTokenizer

# Downloading model checkpoint to a local dir model_dir

model_id = 'qwen/Qwen-Audio-Chat'

revision = 'master'

model_dir = snapshot_download(model_id, revision=revision)

# Loading local checkpoints

# trust_remote_code is still set as True since we still load codes from local dir instead of transformers

tokenizer = AutoTokenizer.from_pretrained(model_dir, trust_remote_code=True)

model = AutoModelForCausalLM.from_pretrained(

    model_dir,

    device_map="cuda",

    trust_remote_code=True

).eval()

```

#### 🤖 ModelScope

ModelScope is an opensource platform for Model-as-a-Service (MaaS), which provides flexible and cost-effective model service to AI developers. Similarly, you can run the models with ModelScope as shown below:

```python

from modelscope import (

    snapshot_download, AutoModelForCausalLM, AutoTokenizer, GenerationConfig

)

import torch

model_id = 'qwen/Qwen-Audio-Chat'

revision = 'master'

model_dir = snapshot_download(model_id, revision=revision)

torch.manual_seed(1234)

tokenizer = AutoTokenizer.from_pretrained(model_dir, trust_remote_code=True)

if not hasattr(tokenizer, 'model_dir'):

    tokenizer.model_dir = model_dir

# use bf16

# model = AutoModelForCausalLM.from_pretrained(model_dir, device_map="auto", trust_remote_code=True, bf16=True).eval()

# use fp16

# model = AutoModelForCausalLM.from_pretrained(model_dir, device_map="auto", trust_remote_code=True, fp16=True).eval()

# use CPU

# model = AutoModelForCausalLM.from_pretrained(model_dir, device_map="cpu", trust_remote_code=True).eval()

# use gpu

model = AutoModelForCausalLM.from_pretrained(model_dir, device_map="auto", trust_remote_code=True).eval()

# 1st dialogue turn

query = tokenizer.from_list_format([

    {'audio': 'assets/audio/1272-128104-0000.flac'}, # Either a local path or an url

    {'text': 'what does the person say?'},

])

response, history = model.chat(tokenizer, query=query, history=None)

print(response)

# The person says: "mister quilter is the apostle of the middle classes and we are glad to welcome his gospel".

# 2st dialogue turn

response, history = model.chat(tokenizer, 'Find the start time and end time of the word "middle classes"', history=history)

print(response)

# The word "middle classes" starts at <|2.33|> seconds and ends at <|3.26|> seconds.

```

## Demo

### Web UI

We provide code for users to build a web UI demo. Before you start, make sure you install the following packages:

```

pip install -r requirements_web_demo.txt

```

Then run the command below and click on the generated link:

```

python web_demo_audio.py

```




## FAQ

If you meet problems, please refer to [FAQ](FAQ.md) and the issues first to search a solution before you launch a new issue.




## We Are Hiring

If you are interested in joining us as full-time or intern, please contact us at [email protected].




## License Agreement

Researchers and developers are free to use the codes and model weights of both Qwen-Audio and Qwen-Audio-Chat. We also allow their commercial use. Check our license at [LICENSE](LICENSE) for more details.




## Citation

If you find our paper and code useful in your research, please consider giving a star :star: and citation :pencil: :)

```BibTeX

@article{Qwen-Audio,

  title={Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models},

  author={Chu, Yunfei and Xu, Jin and Zhou, Xiaohuan and Yang, Qian and Zhang, Shiliang and Yan, Zhijie  and Zhou, Chang and Zhou, Jingren},

  journal={arXiv preprint arXiv:2311.07919},

  year={2023}

}

```




## Contact Us

If you are interested to leave a message to either our research team or product team, feel free to send an email to [email protected].

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/QwenLM/Qwen-Audio

Awesome Lists containing this project

README