https://github.com/rainbowluocs/openomni

OpenOmni: Official implementation of Advancing Open-Source Omnimodal Large Language Models with Progressive Multimodal Alignment and Real-Time Self-Aware Emotional Speech Synthesis
https://github.com/rainbowluocs/openomni

image large-language-model large-multimodal-models multimodal multimodal-large-language-models omni speech

Last synced: about 1 year ago
JSON representation

OpenOmni: Official implementation of Advancing Open-Source Omnimodal Large Language Models with Progressive Multimodal Alignment and Real-Time Self-Aware Emotional Speech Synthesis

Host: GitHub
URL: https://github.com/rainbowluocs/openomni
Owner: RainBowLuoCS
Created: 2025-01-11T13:06:05.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2025-03-17T01:26:45.000Z (about 1 year ago)
Last Synced: 2025-03-17T02:36:28.089Z (about 1 year ago)
Topics: image, large-language-model, large-multimodal-models, multimodal, multimodal-large-language-models, omni, speech
Language: Python
Homepage:
Size: 8.39 MB
Stars: 36
Watchers: 2
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# OpenOmni: Advancing Open-Source Omnimodal Large Language Models with Progressive Multimodal Alignment and Real-Time Self-Aware Emotional Speech Synthesis

[[📖 arXiv Paper](https://arxiv.org/pdf/2501.04561)] [[📊 Datasets](https://huggingface.co/datasets/Tongyi-ConvAI/OpenOmni)] [[🏆 Models](https://huggingface.co/Tongyi-ConvAI/OpenOmni)]

OpenOmni is the end-to-end fully open-sourced pioneering method that successfully incorporates image,speech and text into the omni large language model. OpenOmni's design for speech generation through language bridging and text-guided speech can be quickly trained in situations where omni-modal data and VRAM resources are scarce. OpenOmni not only supports omni-modal nderstanding, but also supports two real-time emotional speech generation modes, CTC mode and AR mode, so that users can flexibly choose according to their needs to achieve a balance between generation speed and quality. The flexible framework design allows OpenOmni to be easily and quickly applied to a variety of downstream tasks, such as speech embodied navigation, multi-role-playing speech dialogue, etc. Everyone is welcome to come and experience it now!

## 🔥 Update
- [2025/02/12]🔥Add some missing file and fix all possible bug
- [2025/01/13]🔥OpenOmni is coming! We release the [code](https://github.com/RainBowLuoCS/OpenOmni), [model](https://huggingface.co/Tongyi-ConvAI/OpenOmni) and [data](https://huggingface.co/datasets/Tongyi-ConvAI/OpenOmni)
- [2025/01/09]🔥After two months of company audit! We release the [paper](https://arxiv.org/pdf/2501.04561)
- [2024/11/14]🔥We submit the [paper](https://arxiv.org/pdf/2501.04561) for peer review
- [2024/09/15]🔥We write the first line of OpenOmni project for fully open-sourced pioneering OmniLLM in end-to-end manner.

## 👀 Contents
+ Setup
+ Model
+ Preparation
+ Train
+ Evaluation
+ Example
+ Citation

## 📷 Setup
Please follow the instructions below to install the required packages.

1. Clone this repository

```plain
git clone https://github.com/RainBowLuoCS/OpenOmni.git
cd OpenOmni
```

1. Install Package

```plain
conda create -n openomni python=3.10 -y
conda activate openomni
pip install --upgrade pip # enable PEP 660 support
pip install -e ".[train]"
pip install -r requirements.txt
```

1. Install additional packages for training

```plain
pip install flash-attn --no-build-isolation
```
## 🔥 Fast Usage

After downloading the weights and configuring the paths properly. Two open-sourced speech tokenizer are needed for speech discretization and reconstruction with different vocabulary size! [CosVoice for 6K CTC Mode](https://github.com/FunAudioLLM/CosyVoice) and [GLM4Voice for 16K AR Mode](https://github.com/THUDM/GLM-4-Voice)

Fast inference for omnimodal input (speech,text,image and video)
```plain
python inference.py
```

Fast interation for omnimodal input (speech,text,image and video)
```plain
python demo.py
```

## Model
![](assets/framework.png)

Here are the pretrained weights and instruction tuning weights

| Stage | Model | Speech Projector | Image
Projector | IT Data | Download |
| --- | --- | --- | --- | --- | --- |
| 1-1 | OpenOMNI-Qwen2-7B-Stage1-1 | ckpt | ckpt | openomni_stage1-1.json | ckpt |
| 2-1 | OpenOMNI-Qwen2-7B-Stage2-1 | ckpt | ckpt | openomni_stage2-1.json | ckpt |
| 2-2 | OpenOMNI-Qwen2-7B-Stage2-2 | ckpt | ckpt | openomni_stage2-2.json | ckpt |
| 3-1 | OpenOMNI-Qwen2-7B-Stage3-1 | ckpt | ckpt | openomni_stage3-1.json | ckpt |
| 3-2 | OpenOMNI-Qwen2-7B-Stage3-2 | ckpt | ckpt | openomni_stage3-2.json | ckpt |

## Preparation
### Dataset
Please follow [MMEvol](https://github.com/AlibabaResearch/DAMO-ConvAI/tree/main/mmevol) to prepare the corresponding images-text datasets. Here we only provide the details of speech-text datasets.

The following is the data directory tree of OpenOmni

### data structure
```plain
datasets
├── json # data receipe
│ ├── openomni_stage1-1.json # speech2text pretraining
│ ├── openomni_stage2-1.json # image2text pretraining
│ ├── openomni_stage2-2.json # image2text instruction tuning
│ ├── openomni_stage3-1.json # text2speech pretraining
│ ├── openomni_stage3-2.json # text2speech emotional injection
├── asr # classic bilingual speech corpus
│ ├── AISHELL-4
│ ├── LibriSPeech
│ ├── WeNetSpeech
├── audio_en # synthetic english speech corpus for question
├── audio_llava # synthetic bilingual speech corpus for answer
├── audio_zh # synthetic chinese speech corpus for question
├── audio_unit # synthetic bilingual speech corpus for answer
├── audio_prefer # synthetic emotional bilingual speech corpus for answer
├── audio_reject # synthetic emotional bilingual speech corpus for answer
├── audio_ultrachat # synthetic bilingual speech corpus for answer
├── ai2d
│ ├── abc_images
│ ├── annotations
│ ├── images
│ ├── questions
│ └── categories.json
......

```

+ All file/path starting with "audio" are self-synthesized.
+ DPO contains approximately 9k entries for "prefer" and "reject," covering 9 types of emotions.

More details about data curation can be found in our [paper](https://arxiv.org/pdf/2501.04561).

## Train
### Speech2Text Pretrain
Please download the MMEvol, AIShell-4, LibriSPeech, WeNetSpeech, OpenOmni Data and organize the data following Preparation before training . Make sure set up the corresponding train script with correct setting (data path, weight path, and hyper-paramaters)

```plain
bash scripts/train/llama3/speech2text_pretrain.sh
bash scripts/train/qwen2/speech2text_pretrain.sh
```

### Image2Text Pretrain
Please make sure you download and organize the data following [Preparation](https://github.com/RainBowLuoCS/MMEvol#preparation) before training. Make sure set up the corresponding train script with correct setting (data path, weight path, and hyper-paramaters)

```plain
bash scripts/train/llama3/image2text_pretrain.sh
bash scripts/train/qwen2/image2text_pretrain.sh
```

### Image2Text Instruction Tuning
Please make sure you download and organize the data following [Preparation](https://github.com/RainBowLuoCS/MMEvol#preparation) before training. Make sure set up the corresponding train script with correct setting (data path, weight path, and hyper-paramaters)

```plain
bash scripts/train/llama3/image2text_finetune.sh
bash scripts/train/qwen2/image2text_finetue.sh
```

### Text2Speech Pretrain
Please make sure you download and organize the data following [Preparation](https://github.com/RainBowLuoCS/MMEvol#preparation) before training. Make sure set up the corresponding train script with correct setting (data path, weight path, and hyper-paramaters)

```plain
bash scripts/train/llama3/text2speech_ pretrain.sh
bash scripts/train/qwen2/text2speech_ pretrain.sh
```

### Text2Speech Emotional DPO Tuning
Please make sure you download and organize the data following [Preparation](https://github.com/RainBowLuoCS/MMEvol#preparation) before training. Make sure set up the corresponding train script with correct setting (data path, weight path, and hyper-paramaters)

```plain
bash scripts/train/llama3/text2speech_ dpo.sh
bash scripts/train/qwen2/text2speech_ dpo.sh
```

## Evaluation
### Dataset
#### Ensure that your api_base, key and dataset are correctly configured before evaluation.
### data structure
```plain
datasets
├── json # data receipe
│ ├── aishell2_eval.jsonl # aishell evaluation
│ ├── librispeech_eval.jsonl # image2text pretraining
│ ├── wenetspeech_eval.json # image2text instruction tuning
│ ├── openomni_emotion_val.json
├── OmniBench # OmniBench
│ ├── mmdata
│ ├── dataset
│ ├── eval.json
├── Ov-Odyssey # Ov-Odyssey Bench
│ ├── av_odyssey_part1.parquet
│ ├── av_odyssey_part2.parquet
│ ├── av_odyssey_part3.parquet
│ ├── av_odyssey_part4.parquet
│ ├── av_odyssey_part5.parquet

```

### Speech-Text Evaluation
Make sure set up the corresponding train script with correct setting (data path, weight path, and hyper-paramaters)

```plain
python openomni/eval/llama3/asr_eavl.py
python openomni/eval/qwen2/asr_eavl.py
```

| Model | LibriSpeech-test-clean | LibriSpeech-test-other | AIShell2-dev | AIShell2-test | WeNetSpeech-testnet | WeNetSpeech-testmeeting |
| --- | --- | --- | --- | --- | --- | --- |
| VITA | 8.1 | 18.4 | | | 12.2 | 16.5 |
| EMOVA | 4.0 | 8.6 | 10.6 | 10.3 | | |
| MINI-OMNI | 4.5 | 9.7 | | | | |
| Freeze-Omni | 3.29 | 7.4 | | | 8.57 | 10.09 |
| ours | 2.57 | 5.6 | 6.81 | 6.87 | 7.63 | |

### Image-Text Evaluation
Refer to MMEvol for detailed OpenCampass Vision Language Evaluation

```plain
# run on all 9 datasets
./script/run_inference.sh OpenOmni-Qwen "MME MMMU_DEV_VAL MathVista_MINI LLaVABench RealWorldQA MMStar MMVet AI2D_TEST OCRBench HallusionBench POPE BLINK" all

# The following are instructions for running on a single dataset
# MME
./script/run_inference.sh OpenOmni-Qwen MME all
# MMMU_DEV_VAL
./script/run_inference.sh OpenOmni-Qwen MMMU_DEV_VAL all
# MathVista_MINI
./script/run_inference.sh OpenOmni-Qwen MathVista_MINI all
.....
```

### Speech-Text-Image Evaluation
Please download OmniBench and run the following command

```plain
python openomni/eval/llama3/omni_eavl.py
python openomni/eval/qwen2/omni_eavl.py
```

### Speech-Text-Image-Video Evaluation
Please download Ov-Odyssey and run the following command

```plain
python openomni/eval/llama3/ov_odyssey_eavl.py
python openomni/eval/qwen2/ov_odyssey_eavl.py
```

### Text-Speech Evaluation
```plain
python openomni/eval/llama3/t2s_eavl.py
python openomni/eval/qwen2/t2s_eavl.py
```

### Emotional Text-Speech Evaluation
```plain
python openomni/eval/llama3/et2s_eavl.py
python openomni/eval/qwen2/et2s_eavl.py
```
## 📌 Cases

**四是四，十是十，十四是十四，四十是四十。**

**黑化肥发灰，灰化肥发黑，黑化肥发灰会挥发，灰化肥挥发会发黑。**

**吃葡萄不吐葡萄皮，不吃葡萄倒吐葡萄皮。**

[四是四，十是十，十四是十四，四十是四十。](https://github.com/user-attachments/assets/64dcbe0d-6f28-43ce-916e-5aea264f13f0)

[黑化肥发灰，灰化肥发黑，黑化肥发灰会挥发，灰化肥挥发会发黑。](https://github.com/user-attachments/assets/996e5ec9-8baa-491d-a731-51d454fca493)

[吃葡萄不吐葡萄皮，不吃葡萄倒吐葡萄皮。](https://github.com/user-attachments/assets/e7035bc0-1b11-4b9c-9491-e86c289daa2f)

**八百标兵奔北坡，炮兵并排北边跑，炮兵怕把标兵碰，标兵怕碰炮兵炮。**

**红凤凰，黄凤凰，粉红凤凰，花凤凰。**

**牛郎年年恋刘娘，刘娘念念恋牛郎。**

[八百标兵奔北坡，炮兵并排北边跑，炮兵怕把标兵碰，标兵怕碰炮兵炮。](https://github.com/user-attachments/assets/626c5732-2386-49cb-992c-0bd251af40df)

[红凤凰，黄凤凰，粉红凤凰，花凤凰。](https://github.com/user-attachments/assets/2d5e862b-abb1-4656-b80f-1576f730005e)

[牛郎年年恋刘娘，刘娘念念恋牛郎。](https://github.com/user-attachments/assets/89207b65-7855-425d-84ae-0badb5c1e73f)

**She sells seashells by the seashore.**

**Peter Piper picked a peck of pickled peppers.**

**Six slippery snails slid slowly seaward.**

[en_0.webm](https://github.com/user-attachments/assets/cc61b680-1f80-416e-89f7-418222f2de74)

[en_1.webm](https://github.com/user-attachments/assets/74c058dd-9674-4832-9a08-fa882a16d539)

[en_2.webm](https://github.com/user-attachments/assets/bcdbf12d-c5e0-4373-bc92-625fb61fe9ab)

**Six sleek swans swam swiftly southwards.**

**I saw Susie sitting in a shoeshine shop.**

**Can you can a can as a canner can can a can?**

[en_3.webm](https://github.com/user-attachments/assets/aab3314f-b03c-4398-a935-e013aac02235)

[en_4.webm](https://github.com/user-attachments/assets/6b4cdf14-4a87-4dce-8063-252ef5078428)

[en_5.webm](https://github.com/user-attachments/assets/9d0794f0-a36b-415d-a264-8935bbf96921)

## 📚Video Demo

https://github.com/user-attachments/assets/cd679b7c-9f9d-4631-a1f5-96b1428a8ad4

## 📚Citation

If you find this repo useful for your research, please consider citing the paper

```
@article{luo2025openomni,
title={OpenOmni: Large Language Models Pivot Zero-shot Omnimodal Alignment across Language with Real-time Self-Aware Emotional Speech Synthesis},
author={Luo, Run and Lin, Ting-En and Zhang, Haonan and Wu, Yuchuan and Liu, Xiong and Yang, Min and Li, Yongbin and Chen, Longze and Li, Jiaming and Zhang, Lei and others},
journal={arXiv preprint arXiv:2501.04561},
year={2025}
}
```
```
@article{luo2024mmevol,
title={Mmevol: Empowering multimodal large language models with evol-instruct},
author={Luo, Run and Zhang, Haonan and Chen, Longze and Lin, Ting-En and Liu, Xiong and Wu, Yuchuan and Yang, Min and Wang, Minzheng and Zeng, Pengpeng and Gao, Lianli and others},
journal={arXiv preprint arXiv:2409.05840},
year={2024}
}
```

## 📧 Contact

if you have any question, please consider following concat for help

- Run Luo — r.luo@siat.ac.cn

- Haonan Zhang — zchiowal@gmail.com

## Acknowledgement

\- [LLaVA](https://github.com/haotian-liu/LLaVA) and [LLaVA-Omni](https://github.com/ictnlp/LLaMA-Omni): the codebase we built upon. Thanks for their brilliant contributions to the community! We just can't wait to use OpenOmni.

\- [VLMEvalKit](https://github.com/open-compass/VLMEvalKit): the amazing open-sourced suit for evaluating various LMMs!

\- [CosVoice](https://github.com/FunAudioLLM/CosyVoice): the amazing open-sourced speech tokenizer for speech discretization and reconstruction with 6k vocabulary size!

\- [GLM4Voice](https://github.com/THUDM/GLM-4-Voice): the amazing open-sourced speech tokenizer for speech discretization and reconstruction with 16k vocabulary size!

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/rainbowluocs/openomni

Awesome Lists containing this project

README