https://github.com/rainbowluocs/openomni
OpenOmni: Official implementation of Advancing Open-Source Omnimodal Large Language Models with Progressive Multimodal Alignment and Real-Time Self-Aware Emotional Speech Synthesis
https://github.com/rainbowluocs/openomni
image large-language-model large-multimodal-models multimodal multimodal-large-language-models omni speech
Last synced: about 1 year ago
JSON representation
OpenOmni: Official implementation of Advancing Open-Source Omnimodal Large Language Models with Progressive Multimodal Alignment and Real-Time Self-Aware Emotional Speech Synthesis
- Host: GitHub
- URL: https://github.com/rainbowluocs/openomni
- Owner: RainBowLuoCS
- Created: 2025-01-11T13:06:05.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2025-03-17T01:26:45.000Z (about 1 year ago)
- Last Synced: 2025-03-17T02:36:28.089Z (about 1 year ago)
- Topics: image, large-language-model, large-multimodal-models, multimodal, multimodal-large-language-models, omni, speech
- Language: Python
- Homepage:
- Size: 8.39 MB
- Stars: 36
- Watchers: 2
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# OpenOmni: Advancing Open-Source Omnimodal Large Language Models with Progressive Multimodal Alignment and Real-Time Self-Aware Emotional Speech Synthesis
[[๐ arXiv Paper](https://arxiv.org/pdf/2501.04561)] [[๐ Datasets](https://huggingface.co/datasets/Tongyi-ConvAI/OpenOmni)] [[๐ Models](https://huggingface.co/Tongyi-ConvAI/OpenOmni)]
OpenOmni is the end-to-end fully open-sourced pioneering method that successfully incorporates image,speech and text into the omni large language model. OpenOmni's design for speech generation through language bridging and text-guided speech can be quickly trained in situations where omni-modal data and VRAM resources are scarce. OpenOmni not only supports omni-modal nderstanding, but also supports two real-time emotional speech generation modes, CTC mode and AR mode, so that users can flexibly choose according to their needs to achieve a balance between generation speed and quality. The flexible framework design allows OpenOmni to be easily and quickly applied to a variety of downstream tasks, such as speech embodied navigation, multi-role-playing speech dialogue, etc. Everyone is welcome to come and experience it now!
## ๐ฅ Update
- [2025/02/12]๐ฅAdd some missing file and fix all possible bug
- [2025/01/13]๐ฅOpenOmni is coming! We release the [code](https://github.com/RainBowLuoCS/OpenOmni), [model](https://huggingface.co/Tongyi-ConvAI/OpenOmni) and [data](https://huggingface.co/datasets/Tongyi-ConvAI/OpenOmni)
- [2025/01/09]๐ฅAfter two months of company audit! We release the [paper](https://arxiv.org/pdf/2501.04561)
- [2024/11/14]๐ฅWe submit the [paper](https://arxiv.org/pdf/2501.04561) for peer review
- [2024/09/15]๐ฅWe write the first line of OpenOmni project for fully open-sourced pioneering OmniLLM in end-to-end manner.
## ๐ Contents
+ Setup
+ Model
+ Preparation
+ Train
+ Evaluation
+ Example
+ Citation
## ๐ท Setup
Please follow the instructions below to install the required packages.
1. Clone this repository
```plain
git clone https://github.com/RainBowLuoCS/OpenOmni.git
cd OpenOmni
```
1. Install Package
```plain
conda create -n openomni python=3.10 -y
conda activate openomni
pip install --upgrade pip # enable PEP 660 support
pip install -e ".[train]"
pip install -r requirements.txt
```
1. Install additional packages for training
```plain
pip install flash-attn --no-build-isolation
```
## ๐ฅ Fast Usage
After downloading the weights and configuring the paths properly. Two open-sourced speech tokenizer are needed for speech discretization and reconstruction with different vocabulary size! [CosVoice for 6K CTC Mode](https://github.com/FunAudioLLM/CosyVoice) and [GLM4Voice for 16K AR Mode](https://github.com/THUDM/GLM-4-Voice)
Fast inference for omnimodal input (speech,text,image and video)
```plain
python inference.py
```
Fast interation for omnimodal input (speech,text,image and video)
```plain
python demo.py
```
## Model

Here are the pretrained weights and instruction tuning weights
| Stage | Model | Speech Projector | Image
Projector | IT Data | Download |
| --- | --- | --- | --- | --- | --- |
| 1-1 | OpenOMNI-Qwen2-7B-Stage1-1 | ckpt | ckpt | openomni_stage1-1.json | ckpt |
| 2-1 | OpenOMNI-Qwen2-7B-Stage2-1 | ckpt | ckpt | openomni_stage2-1.json | ckpt |
| 2-2 | OpenOMNI-Qwen2-7B-Stage2-2 | ckpt | ckpt | openomni_stage2-2.json | ckpt |
| 3-1 | OpenOMNI-Qwen2-7B-Stage3-1 | ckpt | ckpt | openomni_stage3-1.json | ckpt |
| 3-2 | OpenOMNI-Qwen2-7B-Stage3-2 | ckpt | ckpt | openomni_stage3-2.json | ckpt |
## Preparation
### Dataset
Please follow [MMEvol](https://github.com/AlibabaResearch/DAMO-ConvAI/tree/main/mmevol) to prepare the corresponding images-text datasets. Here we only provide the details of speech-text datasets.
The following is the data directory tree of OpenOmni
### data structure
```plain
datasets
โโโ json # data receipe
โ โโโ openomni_stage1-1.json # speech2text pretraining
โ โโโ openomni_stage2-1.json # image2text pretraining
โ โโโ openomni_stage2-2.json # image2text instruction tuning
โ โโโ openomni_stage3-1.json # text2speech pretraining
โ โโโ openomni_stage3-2.json # text2speech emotional injection
โโโ asr # classic bilingual speech corpus
โ โโโ AISHELL-4
โ โโโ LibriSPeech
โ โโโ WeNetSpeech
โโโ audio_en # synthetic english speech corpus for question
โโโ audio_llava # synthetic bilingual speech corpus for answer
โโโ audio_zh # synthetic chinese speech corpus for question
โโโ audio_unit # synthetic bilingual speech corpus for answer
โโโ audio_prefer # synthetic emotional bilingual speech corpus for answer
โโโ audio_reject # synthetic emotional bilingual speech corpus for answer
โโโ audio_ultrachat # synthetic bilingual speech corpus for answer
โโโ ai2d
โ โโโ abc_images
โ โโโ annotations
โ โโโ images
โ โโโ questions
โ โโโ categories.json
......
```
+ All file/path starting with "audio" are self-synthesized.
+ DPO contains approximately 9k entries for "prefer" and "reject," covering 9 types of emotions.
More details about data curation can be found in our [paper](https://arxiv.org/pdf/2501.04561).
## Train
### Speech2Text Pretrain
Please download the MMEvol, AIShell-4, LibriSPeech, WeNetSpeech, OpenOmni Data and organize the data following Preparation before training . Make sure set up the corresponding train script with correct setting (data path, weight path, and hyper-paramaters)
```plain
bash scripts/train/llama3/speech2text_pretrain.sh
bash scripts/train/qwen2/speech2text_pretrain.sh
```
### Image2Text Pretrain
Please make sure you download and organize the data following [Preparation](https://github.com/RainBowLuoCS/MMEvol#preparation) before training. Make sure set up the corresponding train script with correct setting (data path, weight path, and hyper-paramaters)
```plain
bash scripts/train/llama3/image2text_pretrain.sh
bash scripts/train/qwen2/image2text_pretrain.sh
```
### Image2Text Instruction Tuning
Please make sure you download and organize the data following [Preparation](https://github.com/RainBowLuoCS/MMEvol#preparation) before training. Make sure set up the corresponding train script with correct setting (data path, weight path, and hyper-paramaters)
```plain
bash scripts/train/llama3/image2text_finetune.sh
bash scripts/train/qwen2/image2text_finetue.sh
```
### Text2Speech Pretrain
Please make sure you download and organize the data following [Preparation](https://github.com/RainBowLuoCS/MMEvol#preparation) before training. Make sure set up the corresponding train script with correct setting (data path, weight path, and hyper-paramaters)
```plain
bash scripts/train/llama3/text2speech_ pretrain.sh
bash scripts/train/qwen2/text2speech_ pretrain.sh
```
### Text2Speech Emotional DPO Tuning
Please make sure you download and organize the data following [Preparation](https://github.com/RainBowLuoCS/MMEvol#preparation) before training. Make sure set up the corresponding train script with correct setting (data path, weight path, and hyper-paramaters)
```plain
bash scripts/train/llama3/text2speech_ dpo.sh
bash scripts/train/qwen2/text2speech_ dpo.sh
```
## Evaluation
### Dataset
#### Ensure that your api_base, key and dataset are correctly configured before evaluation.
### data structure
```plain
datasets
โโโ json # data receipe
โ โโโ aishell2_eval.jsonl # aishell evaluation
โ โโโ librispeech_eval.jsonl # image2text pretraining
โ โโโ wenetspeech_eval.json # image2text instruction tuning
โ โโโ openomni_emotion_val.json
โโโ OmniBench # OmniBench
โ โโโ mmdata
โ โโโ dataset
โ โโโ eval.json
โโโ Ov-Odyssey # Ov-Odyssey Bench
โ โโโ av_odyssey_part1.parquet
โ โโโ av_odyssey_part2.parquet
โ โโโ av_odyssey_part3.parquet
โ โโโ av_odyssey_part4.parquet
โ โโโ av_odyssey_part5.parquet
```
### Speech-Text Evaluation
Make sure set up the corresponding train script with correct setting (data path, weight path, and hyper-paramaters)
```plain
python openomni/eval/llama3/asr_eavl.py
python openomni/eval/qwen2/asr_eavl.py
```
| Model | LibriSpeech-test-clean | LibriSpeech-test-other | AIShell2-dev | AIShell2-test | WeNetSpeech-testnet | WeNetSpeech-testmeeting |
| --- | --- | --- | --- | --- | --- | --- |
| VITA | 8.1 | 18.4 | | | 12.2 | 16.5 |
| EMOVA | 4.0 | 8.6 | 10.6 | 10.3 | | |
| MINI-OMNI | 4.5 | 9.7 | | | | |
| Freeze-Omni | 3.29 | 7.4 | | | 8.57 | 10.09 |
| ours | 2.57 | 5.6 | 6.81 | 6.87 | 7.63 | |
### Image-Text Evaluation
Refer to MMEvol for detailed OpenCampass Vision Language Evaluation
```plain
# run on all 9 datasets
./script/run_inference.sh OpenOmni-Qwen "MME MMMU_DEV_VAL MathVista_MINI LLaVABench RealWorldQA MMStar MMVet AI2D_TEST OCRBench HallusionBench POPE BLINK" all
# The following are instructions for running on a single dataset
# MME
./script/run_inference.sh OpenOmni-Qwen MME all
# MMMU_DEV_VAL
./script/run_inference.sh OpenOmni-Qwen MMMU_DEV_VAL all
# MathVista_MINI
./script/run_inference.sh OpenOmni-Qwen MathVista_MINI all
.....
```
### Speech-Text-Image Evaluation
Please download OmniBench and run the following command
```plain
python openomni/eval/llama3/omni_eavl.py
python openomni/eval/qwen2/omni_eavl.py
```
### Speech-Text-Image-Video Evaluation
Please download Ov-Odyssey and run the following command
```plain
python openomni/eval/llama3/ov_odyssey_eavl.py
python openomni/eval/qwen2/ov_odyssey_eavl.py
```
### Text-Speech Evaluation
```plain
python openomni/eval/llama3/t2s_eavl.py
python openomni/eval/qwen2/t2s_eavl.py
```
### Emotional Text-Speech Evaluation
```plain
python openomni/eval/llama3/et2s_eavl.py
python openomni/eval/qwen2/et2s_eavl.py
```
## ๐ Cases
**ๅๆฏๅ๏ผๅๆฏๅ๏ผๅๅๆฏๅๅ๏ผๅๅๆฏๅๅใ**
**้ปๅ่ฅๅ็ฐ๏ผ็ฐๅ่ฅๅ้ป๏ผ้ปๅ่ฅๅ็ฐไผๆฅๅ๏ผ็ฐๅ่ฅๆฅๅไผๅ้ปใ**
**ๅ่ก่ไธๅ่ก่็ฎ๏ผไธๅ่ก่ๅๅ่ก่็ฎใ**
[ๅๆฏๅ๏ผๅๆฏๅ๏ผๅๅๆฏๅๅ๏ผๅๅๆฏๅๅใ](https://github.com/user-attachments/assets/64dcbe0d-6f28-43ce-916e-5aea264f13f0)
[้ปๅ่ฅๅ็ฐ๏ผ็ฐๅ่ฅๅ้ป๏ผ้ปๅ่ฅๅ็ฐไผๆฅๅ๏ผ็ฐๅ่ฅๆฅๅไผๅ้ปใ](https://github.com/user-attachments/assets/996e5ec9-8baa-491d-a731-51d454fca493)
[ๅ่ก่ไธๅ่ก่็ฎ๏ผไธๅ่ก่ๅๅ่ก่็ฎใ](https://github.com/user-attachments/assets/e7035bc0-1b11-4b9c-9491-e86c289daa2f)
**ๅ
ซ็พๆ ๅ
ตๅฅๅๅก๏ผ็ฎๅ
ตๅนถๆๅ่พน่ท๏ผ็ฎๅ
ตๆๆๆ ๅ
ต็ขฐ๏ผๆ ๅ
ตๆ็ขฐ็ฎๅ
ต็ฎใ**
**็บขๅคๅฐ๏ผ้ปๅคๅฐ๏ผ็ฒ็บขๅคๅฐ๏ผ่ฑๅคๅฐใ**
**็้ๅนดๅนดๆๅๅจ๏ผๅๅจๅฟตๅฟตๆ็้ใ**
[ๅ
ซ็พๆ ๅ
ตๅฅๅๅก๏ผ็ฎๅ
ตๅนถๆๅ่พน่ท๏ผ็ฎๅ
ตๆๆๆ ๅ
ต็ขฐ๏ผๆ ๅ
ตๆ็ขฐ็ฎๅ
ต็ฎใ](https://github.com/user-attachments/assets/626c5732-2386-49cb-992c-0bd251af40df)
[็บขๅคๅฐ๏ผ้ปๅคๅฐ๏ผ็ฒ็บขๅคๅฐ๏ผ่ฑๅคๅฐใ](https://github.com/user-attachments/assets/2d5e862b-abb1-4656-b80f-1576f730005e)
[็้ๅนดๅนดๆๅๅจ๏ผๅๅจๅฟตๅฟตๆ็้ใ](https://github.com/user-attachments/assets/89207b65-7855-425d-84ae-0badb5c1e73f)
**She sells seashells by the seashore.**
**Peter Piper picked a peck of pickled peppers.**
**Six slippery snails slid slowly seaward.**
[en_0.webm](https://github.com/user-attachments/assets/cc61b680-1f80-416e-89f7-418222f2de74)
[en_1.webm](https://github.com/user-attachments/assets/74c058dd-9674-4832-9a08-fa882a16d539)
[en_2.webm](https://github.com/user-attachments/assets/bcdbf12d-c5e0-4373-bc92-625fb61fe9ab)
**Six sleek swans swam swiftly southwards.**
**I saw Susie sitting in a shoeshine shop.**
**Can you can a can as a canner can can a can?**
[en_3.webm](https://github.com/user-attachments/assets/aab3314f-b03c-4398-a935-e013aac02235)
[en_4.webm](https://github.com/user-attachments/assets/6b4cdf14-4a87-4dce-8063-252ef5078428)
[en_5.webm](https://github.com/user-attachments/assets/9d0794f0-a36b-415d-a264-8935bbf96921)
## ๐Video Demo
https://github.com/user-attachments/assets/cd679b7c-9f9d-4631-a1f5-96b1428a8ad4
## ๐Citation
If you find this repo useful for your research, please consider citing the paper
```
@article{luo2025openomni,
title={OpenOmni: Large Language Models Pivot Zero-shot Omnimodal Alignment across Language with Real-time Self-Aware Emotional Speech Synthesis},
author={Luo, Run and Lin, Ting-En and Zhang, Haonan and Wu, Yuchuan and Liu, Xiong and Yang, Min and Li, Yongbin and Chen, Longze and Li, Jiaming and Zhang, Lei and others},
journal={arXiv preprint arXiv:2501.04561},
year={2025}
}
```
```
@article{luo2024mmevol,
title={Mmevol: Empowering multimodal large language models with evol-instruct},
author={Luo, Run and Zhang, Haonan and Chen, Longze and Lin, Ting-En and Liu, Xiong and Wu, Yuchuan and Yang, Min and Wang, Minzheng and Zeng, Pengpeng and Gao, Lianli and others},
journal={arXiv preprint arXiv:2409.05840},
year={2024}
}
```
## ๐ง Contact
if you have any question, please consider following concat for help
- Run Luo โ r.luo@siat.ac.cn
- Haonan Zhang โ zchiowal@gmail.com
## Acknowledgement
\- [LLaVA](https://github.com/haotian-liu/LLaVA) and [LLaVA-Omni](https://github.com/ictnlp/LLaMA-Omni): the codebase we built upon. Thanks for their brilliant contributions to the community! We just can't wait to use OpenOmni.
\- [VLMEvalKit](https://github.com/open-compass/VLMEvalKit): the amazing open-sourced suit for evaluating various LMMs!
\- [CosVoice](https://github.com/FunAudioLLM/CosyVoice): the amazing open-sourced speech tokenizer for speech discretization and reconstruction with 6k vocabulary size!
\- [GLM4Voice](https://github.com/THUDM/GLM-4-Voice): the amazing open-sourced speech tokenizer for speech discretization and reconstruction with 16k vocabulary size!