An open API service indexing awesome lists of open source software.

https://github.com/rainbowluocs/openomni

OpenOmni: Official implementation of Advancing Open-Source Omnimodal Large Language Models with Progressive Multimodal Alignment and Real-Time Self-Aware Emotional Speech Synthesis
https://github.com/rainbowluocs/openomni

image large-language-model large-multimodal-models multimodal multimodal-large-language-models omni speech

Last synced: about 1 year ago
JSON representation

OpenOmni: Official implementation of Advancing Open-Source Omnimodal Large Language Models with Progressive Multimodal Alignment and Real-Time Self-Aware Emotional Speech Synthesis

Awesome Lists containing this project

README

          



# OpenOmni: Advancing Open-Source Omnimodal Large Language Models with Progressive Multimodal Alignment and Real-Time Self-Aware Emotional Speech Synthesis

[[๐Ÿ“– arXiv Paper](https://arxiv.org/pdf/2501.04561)] [[๐Ÿ“Š Datasets](https://huggingface.co/datasets/Tongyi-ConvAI/OpenOmni)] [[๐Ÿ† Models](https://huggingface.co/Tongyi-ConvAI/OpenOmni)]

OpenOmni is the end-to-end fully open-sourced pioneering method that successfully incorporates image,speech and text into the omni large language model. OpenOmni's design for speech generation through language bridging and text-guided speech can be quickly trained in situations where omni-modal data and VRAM resources are scarce. OpenOmni not only supports omni-modal nderstanding, but also supports two real-time emotional speech generation modes, CTC mode and AR mode, so that users can flexibly choose according to their needs to achieve a balance between generation speed and quality. The flexible framework design allows OpenOmni to be easily and quickly applied to a variety of downstream tasks, such as speech embodied navigation, multi-role-playing speech dialogue, etc. Everyone is welcome to come and experience it now!

## ๐Ÿ”ฅ Update
- [2025/02/12]๐Ÿ”ฅAdd some missing file and fix all possible bug
- [2025/01/13]๐Ÿ”ฅOpenOmni is coming! We release the [code](https://github.com/RainBowLuoCS/OpenOmni), [model](https://huggingface.co/Tongyi-ConvAI/OpenOmni) and [data](https://huggingface.co/datasets/Tongyi-ConvAI/OpenOmni)
- [2025/01/09]๐Ÿ”ฅAfter two months of company audit! We release the [paper](https://arxiv.org/pdf/2501.04561)
- [2024/11/14]๐Ÿ”ฅWe submit the [paper](https://arxiv.org/pdf/2501.04561) for peer review
- [2024/09/15]๐Ÿ”ฅWe write the first line of OpenOmni project for fully open-sourced pioneering OmniLLM in end-to-end manner.

## ๐Ÿ‘€ Contents
+ Setup
+ Model
+ Preparation
+ Train
+ Evaluation
+ Example
+ Citation

## ๐Ÿ“ท Setup
Please follow the instructions below to install the required packages.

1. Clone this repository

```plain
git clone https://github.com/RainBowLuoCS/OpenOmni.git
cd OpenOmni
```

1. Install Package

```plain
conda create -n openomni python=3.10 -y
conda activate openomni
pip install --upgrade pip # enable PEP 660 support
pip install -e ".[train]"
pip install -r requirements.txt
```

1. Install additional packages for training

```plain
pip install flash-attn --no-build-isolation
```
## ๐Ÿ”ฅ Fast Usage

After downloading the weights and configuring the paths properly. Two open-sourced speech tokenizer are needed for speech discretization and reconstruction with different vocabulary size! [CosVoice for 6K CTC Mode](https://github.com/FunAudioLLM/CosyVoice) and [GLM4Voice for 16K AR Mode](https://github.com/THUDM/GLM-4-Voice)

Fast inference for omnimodal input (speech,text,image and video)
```plain
python inference.py
```

Fast interation for omnimodal input (speech,text,image and video)
```plain
python demo.py
```

## Model
![](assets/framework.png)

Here are the pretrained weights and instruction tuning weights

| Stage | Model | Speech Projector | Image
Projector | IT Data | Download |
| --- | --- | --- | --- | --- | --- |
| 1-1 | OpenOMNI-Qwen2-7B-Stage1-1 | ckpt | ckpt | openomni_stage1-1.json | ckpt |
| 2-1 | OpenOMNI-Qwen2-7B-Stage2-1 | ckpt | ckpt | openomni_stage2-1.json | ckpt |
| 2-2 | OpenOMNI-Qwen2-7B-Stage2-2 | ckpt | ckpt | openomni_stage2-2.json | ckpt |
| 3-1 | OpenOMNI-Qwen2-7B-Stage3-1 | ckpt | ckpt | openomni_stage3-1.json | ckpt |
| 3-2 | OpenOMNI-Qwen2-7B-Stage3-2 | ckpt | ckpt | openomni_stage3-2.json | ckpt |

## Preparation
### Dataset
Please follow [MMEvol](https://github.com/AlibabaResearch/DAMO-ConvAI/tree/main/mmevol) to prepare the corresponding images-text datasets. Here we only provide the details of speech-text datasets.

The following is the data directory tree of OpenOmni

### data structure
```plain
datasets
โ”œโ”€โ”€ json # data receipe
โ”‚ โ”œโ”€โ”€ openomni_stage1-1.json # speech2text pretraining
โ”‚ โ”œโ”€โ”€ openomni_stage2-1.json # image2text pretraining
โ”‚ โ”œโ”€โ”€ openomni_stage2-2.json # image2text instruction tuning
โ”‚ โ”œโ”€โ”€ openomni_stage3-1.json # text2speech pretraining
โ”‚ โ”œโ”€โ”€ openomni_stage3-2.json # text2speech emotional injection
โ”œโ”€โ”€ asr # classic bilingual speech corpus
โ”‚ โ”œโ”€โ”€ AISHELL-4
โ”‚ โ”œโ”€โ”€ LibriSPeech
โ”‚ โ”œโ”€โ”€ WeNetSpeech
โ”œโ”€โ”€ audio_en # synthetic english speech corpus for question
โ”œโ”€โ”€ audio_llava # synthetic bilingual speech corpus for answer
โ”œโ”€โ”€ audio_zh # synthetic chinese speech corpus for question
โ”œโ”€โ”€ audio_unit # synthetic bilingual speech corpus for answer
โ”œโ”€โ”€ audio_prefer # synthetic emotional bilingual speech corpus for answer
โ”œโ”€โ”€ audio_reject # synthetic emotional bilingual speech corpus for answer
โ”œโ”€โ”€ audio_ultrachat # synthetic bilingual speech corpus for answer
โ”œโ”€โ”€ ai2d
โ”‚ โ”œโ”€โ”€ abc_images
โ”‚ โ”œโ”€โ”€ annotations
โ”‚ โ”œโ”€โ”€ images
โ”‚ โ”œโ”€โ”€ questions
โ”‚ โ””โ”€โ”€ categories.json
......

```

+ All file/path starting with "audio" are self-synthesized.
+ DPO contains approximately 9k entries for "prefer" and "reject," covering 9 types of emotions.

More details about data curation can be found in our [paper](https://arxiv.org/pdf/2501.04561).

## Train
### Speech2Text Pretrain
Please download the MMEvol, AIShell-4, LibriSPeech, WeNetSpeech, OpenOmni Data and organize the data following Preparation before training . Make sure set up the corresponding train script with correct setting (data path, weight path, and hyper-paramaters)

```plain
bash scripts/train/llama3/speech2text_pretrain.sh
bash scripts/train/qwen2/speech2text_pretrain.sh
```

### Image2Text Pretrain
Please make sure you download and organize the data following [Preparation](https://github.com/RainBowLuoCS/MMEvol#preparation) before training. Make sure set up the corresponding train script with correct setting (data path, weight path, and hyper-paramaters)

```plain
bash scripts/train/llama3/image2text_pretrain.sh
bash scripts/train/qwen2/image2text_pretrain.sh
```

### Image2Text Instruction Tuning
Please make sure you download and organize the data following [Preparation](https://github.com/RainBowLuoCS/MMEvol#preparation) before training. Make sure set up the corresponding train script with correct setting (data path, weight path, and hyper-paramaters)

```plain
bash scripts/train/llama3/image2text_finetune.sh
bash scripts/train/qwen2/image2text_finetue.sh
```

### Text2Speech Pretrain
Please make sure you download and organize the data following [Preparation](https://github.com/RainBowLuoCS/MMEvol#preparation) before training. Make sure set up the corresponding train script with correct setting (data path, weight path, and hyper-paramaters)

```plain
bash scripts/train/llama3/text2speech_ pretrain.sh
bash scripts/train/qwen2/text2speech_ pretrain.sh
```

### Text2Speech Emotional DPO Tuning
Please make sure you download and organize the data following [Preparation](https://github.com/RainBowLuoCS/MMEvol#preparation) before training. Make sure set up the corresponding train script with correct setting (data path, weight path, and hyper-paramaters)

```plain
bash scripts/train/llama3/text2speech_ dpo.sh
bash scripts/train/qwen2/text2speech_ dpo.sh
```

## Evaluation
### Dataset
#### Ensure that your api_base, key and dataset are correctly configured before evaluation.
### data structure
```plain
datasets
โ”œโ”€โ”€ json # data receipe
โ”‚ โ”œโ”€โ”€ aishell2_eval.jsonl # aishell evaluation
โ”‚ โ”œโ”€โ”€ librispeech_eval.jsonl # image2text pretraining
โ”‚ โ”œโ”€โ”€ wenetspeech_eval.json # image2text instruction tuning
โ”‚ โ”œโ”€โ”€ openomni_emotion_val.json
โ”œโ”€โ”€ OmniBench # OmniBench
โ”‚ โ”œโ”€โ”€ mmdata
โ”‚ โ”œโ”€โ”€ dataset
โ”‚ โ”œโ”€โ”€ eval.json
โ”œโ”€โ”€ Ov-Odyssey # Ov-Odyssey Bench
โ”‚ โ”œโ”€โ”€ av_odyssey_part1.parquet
โ”‚ โ”œโ”€โ”€ av_odyssey_part2.parquet
โ”‚ โ”œโ”€โ”€ av_odyssey_part3.parquet
โ”‚ โ”œโ”€โ”€ av_odyssey_part4.parquet
โ”‚ โ”œโ”€โ”€ av_odyssey_part5.parquet

```

### Speech-Text Evaluation
Make sure set up the corresponding train script with correct setting (data path, weight path, and hyper-paramaters)

```plain
python openomni/eval/llama3/asr_eavl.py
python openomni/eval/qwen2/asr_eavl.py
```

| Model | LibriSpeech-test-clean | LibriSpeech-test-other | AIShell2-dev | AIShell2-test | WeNetSpeech-testnet | WeNetSpeech-testmeeting |
| --- | --- | --- | --- | --- | --- | --- |
| VITA | 8.1 | 18.4 | | | 12.2 | 16.5 |
| EMOVA | 4.0 | 8.6 | 10.6 | 10.3 | | |
| MINI-OMNI | 4.5 | 9.7 | | | | |
| Freeze-Omni | 3.29 | 7.4 | | | 8.57 | 10.09 |
| ours | 2.57 | 5.6 | 6.81 | 6.87 | 7.63 | |

### Image-Text Evaluation
Refer to MMEvol for detailed OpenCampass Vision Language Evaluation

```plain
# run on all 9 datasets
./script/run_inference.sh OpenOmni-Qwen "MME MMMU_DEV_VAL MathVista_MINI LLaVABench RealWorldQA MMStar MMVet AI2D_TEST OCRBench HallusionBench POPE BLINK" all

# The following are instructions for running on a single dataset
# MME
./script/run_inference.sh OpenOmni-Qwen MME all
# MMMU_DEV_VAL
./script/run_inference.sh OpenOmni-Qwen MMMU_DEV_VAL all
# MathVista_MINI
./script/run_inference.sh OpenOmni-Qwen MathVista_MINI all
.....
```

### Speech-Text-Image Evaluation
Please download OmniBench and run the following command

```plain
python openomni/eval/llama3/omni_eavl.py
python openomni/eval/qwen2/omni_eavl.py
```

### Speech-Text-Image-Video Evaluation
Please download Ov-Odyssey and run the following command

```plain
python openomni/eval/llama3/ov_odyssey_eavl.py
python openomni/eval/qwen2/ov_odyssey_eavl.py
```

### Text-Speech Evaluation
```plain
python openomni/eval/llama3/t2s_eavl.py
python openomni/eval/qwen2/t2s_eavl.py
```

### Emotional Text-Speech Evaluation
```plain
python openomni/eval/llama3/et2s_eavl.py
python openomni/eval/qwen2/et2s_eavl.py
```
## ๐Ÿ“Œ Cases

**ๅ››ๆ˜ฏๅ››๏ผŒๅๆ˜ฏๅ๏ผŒๅๅ››ๆ˜ฏๅๅ››๏ผŒๅ››ๅๆ˜ฏๅ››ๅใ€‚**

**้ป‘ๅŒ–่‚ฅๅ‘็ฐ๏ผŒ็ฐๅŒ–่‚ฅๅ‘้ป‘๏ผŒ้ป‘ๅŒ–่‚ฅๅ‘็ฐไผšๆŒฅๅ‘๏ผŒ็ฐๅŒ–่‚ฅๆŒฅๅ‘ไผšๅ‘้ป‘ใ€‚**

**ๅƒ่‘ก่„ไธๅ่‘ก่„็šฎ๏ผŒไธๅƒ่‘ก่„ๅ€’ๅ่‘ก่„็šฎใ€‚**

[ๅ››ๆ˜ฏๅ››๏ผŒๅๆ˜ฏๅ๏ผŒๅๅ››ๆ˜ฏๅๅ››๏ผŒๅ››ๅๆ˜ฏๅ››ๅใ€‚](https://github.com/user-attachments/assets/64dcbe0d-6f28-43ce-916e-5aea264f13f0)

[้ป‘ๅŒ–่‚ฅๅ‘็ฐ๏ผŒ็ฐๅŒ–่‚ฅๅ‘้ป‘๏ผŒ้ป‘ๅŒ–่‚ฅๅ‘็ฐไผšๆŒฅๅ‘๏ผŒ็ฐๅŒ–่‚ฅๆŒฅๅ‘ไผšๅ‘้ป‘ใ€‚](https://github.com/user-attachments/assets/996e5ec9-8baa-491d-a731-51d454fca493)


[ๅƒ่‘ก่„ไธๅ่‘ก่„็šฎ๏ผŒไธๅƒ่‘ก่„ๅ€’ๅ่‘ก่„็šฎใ€‚](https://github.com/user-attachments/assets/e7035bc0-1b11-4b9c-9491-e86c289daa2f)

**ๅ…ซ็™พๆ ‡ๅ…ตๅฅ”ๅŒ—ๅก๏ผŒ็‚ฎๅ…ตๅนถๆŽ’ๅŒ—่พน่ท‘๏ผŒ็‚ฎๅ…ตๆ€•ๆŠŠๆ ‡ๅ…ต็ขฐ๏ผŒๆ ‡ๅ…ตๆ€•็ขฐ็‚ฎๅ…ต็‚ฎใ€‚**

**็บขๅ‡คๅ‡ฐ๏ผŒ้ป„ๅ‡คๅ‡ฐ๏ผŒ็ฒ‰็บขๅ‡คๅ‡ฐ๏ผŒ่Šฑๅ‡คๅ‡ฐใ€‚**

**็‰›้ƒŽๅนดๅนดๆ‹ๅˆ˜ๅจ˜๏ผŒๅˆ˜ๅจ˜ๅฟตๅฟตๆ‹็‰›้ƒŽใ€‚**

[ๅ…ซ็™พๆ ‡ๅ…ตๅฅ”ๅŒ—ๅก๏ผŒ็‚ฎๅ…ตๅนถๆŽ’ๅŒ—่พน่ท‘๏ผŒ็‚ฎๅ…ตๆ€•ๆŠŠๆ ‡ๅ…ต็ขฐ๏ผŒๆ ‡ๅ…ตๆ€•็ขฐ็‚ฎๅ…ต็‚ฎใ€‚](https://github.com/user-attachments/assets/626c5732-2386-49cb-992c-0bd251af40df)

[็บขๅ‡คๅ‡ฐ๏ผŒ้ป„ๅ‡คๅ‡ฐ๏ผŒ็ฒ‰็บขๅ‡คๅ‡ฐ๏ผŒ่Šฑๅ‡คๅ‡ฐใ€‚](https://github.com/user-attachments/assets/2d5e862b-abb1-4656-b80f-1576f730005e)

[็‰›้ƒŽๅนดๅนดๆ‹ๅˆ˜ๅจ˜๏ผŒๅˆ˜ๅจ˜ๅฟตๅฟตๆ‹็‰›้ƒŽใ€‚](https://github.com/user-attachments/assets/89207b65-7855-425d-84ae-0badb5c1e73f)

**She sells seashells by the seashore.**

**Peter Piper picked a peck of pickled peppers.**

**Six slippery snails slid slowly seaward.**


[en_0.webm](https://github.com/user-attachments/assets/cc61b680-1f80-416e-89f7-418222f2de74)


[en_1.webm](https://github.com/user-attachments/assets/74c058dd-9674-4832-9a08-fa882a16d539)

[en_2.webm](https://github.com/user-attachments/assets/bcdbf12d-c5e0-4373-bc92-625fb61fe9ab)

**Six sleek swans swam swiftly southwards.**

**I saw Susie sitting in a shoeshine shop.**

**Can you can a can as a canner can can a can?**

[en_3.webm](https://github.com/user-attachments/assets/aab3314f-b03c-4398-a935-e013aac02235)

[en_4.webm](https://github.com/user-attachments/assets/6b4cdf14-4a87-4dce-8063-252ef5078428)

[en_5.webm](https://github.com/user-attachments/assets/9d0794f0-a36b-415d-a264-8935bbf96921)

## ๐Ÿ“šVideo Demo

https://github.com/user-attachments/assets/cd679b7c-9f9d-4631-a1f5-96b1428a8ad4

## ๐Ÿ“šCitation

If you find this repo useful for your research, please consider citing the paper

```
@article{luo2025openomni,
title={OpenOmni: Large Language Models Pivot Zero-shot Omnimodal Alignment across Language with Real-time Self-Aware Emotional Speech Synthesis},
author={Luo, Run and Lin, Ting-En and Zhang, Haonan and Wu, Yuchuan and Liu, Xiong and Yang, Min and Li, Yongbin and Chen, Longze and Li, Jiaming and Zhang, Lei and others},
journal={arXiv preprint arXiv:2501.04561},
year={2025}
}
```
```
@article{luo2024mmevol,
title={Mmevol: Empowering multimodal large language models with evol-instruct},
author={Luo, Run and Zhang, Haonan and Chen, Longze and Lin, Ting-En and Liu, Xiong and Wu, Yuchuan and Yang, Min and Wang, Minzheng and Zeng, Pengpeng and Gao, Lianli and others},
journal={arXiv preprint arXiv:2409.05840},
year={2024}
}
```

## ๐Ÿ“ง Contact

if you have any question, please consider following concat for help

- Run Luo โ€” r.luo@siat.ac.cn

- Haonan Zhang โ€” zchiowal@gmail.com

## Acknowledgement

\- [LLaVA](https://github.com/haotian-liu/LLaVA) and [LLaVA-Omni](https://github.com/ictnlp/LLaMA-Omni): the codebase we built upon. Thanks for their brilliant contributions to the community! We just can't wait to use OpenOmni.

\- [VLMEvalKit](https://github.com/open-compass/VLMEvalKit): the amazing open-sourced suit for evaluating various LMMs!

\- [CosVoice](https://github.com/FunAudioLLM/CosyVoice): the amazing open-sourced speech tokenizer for speech discretization and reconstruction with 6k vocabulary size!

\- [GLM4Voice](https://github.com/THUDM/GLM-4-Voice): the amazing open-sourced speech tokenizer for speech discretization and reconstruction with 16k vocabulary size!