{"id":27400825,"url":"https://github.com/rainbowluocs/openomni","last_synced_at":"2025-04-14T03:43:10.605Z","repository":{"id":272174506,"uuid":"915285869","full_name":"RainBowLuoCS/OpenOmni","owner":"RainBowLuoCS","description":"OpenOmni: Official implementation of  Advancing Open-Source Omnimodal Large Language Models with Progressive Multimodal Alignment and Real-Time Self-Aware Emotional Speech Synthesis","archived":false,"fork":false,"pushed_at":"2025-03-17T01:26:45.000Z","size":8798,"stargazers_count":36,"open_issues_count":0,"forks_count":1,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-03-17T02:36:28.089Z","etag":null,"topics":["image","large-language-model","large-multimodal-models","multimodal","multimodal-large-language-models","omni","speech"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/RainBowLuoCS.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2025-01-11T13:06:05.000Z","updated_at":"2025-03-17T01:26:49.000Z","dependencies_parsed_at":"2025-02-10T06:23:04.200Z","dependency_job_id":"de7acbb7-1578-41d4-b47c-13c0b26a7089","html_url":"https://github.com/RainBowLuoCS/OpenOmni","commit_stats":null,"previous_names":["rainbowluocs/openomni"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/RainBowLuoCS%2FOpenOmni","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/RainBowLuoCS%2FOpenOmni/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/RainBowLuoCS%2FOpenOmni/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/RainBowLuoCS%2FOpenOmni/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/RainBowLuoCS","download_url":"https://codeload.github.com/RainBowLuoCS/OpenOmni/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248819059,"owners_count":21166469,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["image","large-language-model","large-multimodal-models","multimodal","multimodal-large-language-models","omni","speech"],"created_at":"2025-04-14T03:43:10.054Z","updated_at":"2025-04-14T03:43:10.599Z","avatar_url":"https://github.com/RainBowLuoCS.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\n\u003cdiv align=center\u003e\n\u003cimg src=\"assets/logo.png\" width=\"140px\"\u003e\n\u003c/div\u003e\n\n# OpenOmni: Advancing Open-Source Omnimodal Large Language Models with Progressive Multimodal Alignment and Real-Time Self-Aware Emotional Speech Synthesis\n\n\u003cfont size=5\u003e\u003cdiv align='center' \u003e  [[📖 arXiv Paper](https://arxiv.org/pdf/2501.04561)] [[📊 Datasets](https://huggingface.co/datasets/Tongyi-ConvAI/OpenOmni)] [[🏆 Models](https://huggingface.co/Tongyi-ConvAI/OpenOmni)]  \u003c/div\u003e\u003c/font\u003e\nOpenOmni is the end-to-end fully open-sourced pioneering method that successfully incorporates image,speech and text into the omni large language model. OpenOmni's design for speech generation through language bridging and text-guided speech can be quickly trained in situations where omni-modal data and VRAM resources are scarce. OpenOmni not only supports omni-modal nderstanding, but also supports two real-time emotional speech generation modes, CTC mode and AR mode, so that users can flexibly choose according to their needs to achieve a balance between generation speed and quality. The flexible framework design allows OpenOmni to be easily and quickly applied to a variety of downstream tasks, such as speech embodied navigation, multi-role-playing speech dialogue, etc. Everyone is welcome to come and experience it now!\n\n## 🔥 Update\n- [2025/02/12]🔥Add some missing file and fix all possible bug\n- [2025/01/13]🔥OpenOmni is coming! We release the [code](https://github.com/RainBowLuoCS/OpenOmni), [model](https://huggingface.co/Tongyi-ConvAI/OpenOmni) and [data](https://huggingface.co/datasets/Tongyi-ConvAI/OpenOmni)\n- [2025/01/09]🔥After two months of company audit! We release the [paper](https://arxiv.org/pdf/2501.04561)\n- [2024/11/14]🔥We submit the [paper](https://arxiv.org/pdf/2501.04561) for peer review\n- [2024/09/15]🔥We write the first line of OpenOmni project for fully open-sourced pioneering OmniLLM in end-to-end manner.\n\n\n## \u003cfont style=\"color:rgb(31, 35, 40);\"\u003e👀\u003c/font\u003e\u003cfont style=\"color:rgb(31, 35, 40);\"\u003e Contents\u003c/font\u003e\n+ \u003cfont style=\"color:rgb(31, 35, 40);\"\u003eSetup\u003c/font\u003e\n+ \u003cfont style=\"color:rgb(31, 35, 40);\"\u003eModel\u003c/font\u003e\n+ \u003cfont style=\"color:rgb(31, 35, 40);\"\u003ePreparation\u003c/font\u003e\n+ \u003cfont style=\"color:rgb(31, 35, 40);\"\u003eTrain\u003c/font\u003e\n+ \u003cfont style=\"color:rgb(31, 35, 40);\"\u003eEvaluation\u003c/font\u003e\n+ \u003cfont style=\"color:rgb(31, 35, 40);\"\u003eExample\u003c/font\u003e\n+ \u003cfont style=\"color:rgb(31, 35, 40);\"\u003eCitation\u003c/font\u003e\n\n## \u003cfont style=\"color:rgb(31, 35, 40);\"\u003e📷\u003c/font\u003e\u003cfont style=\"color:rgb(31, 35, 40);\"\u003e Setup\u003c/font\u003e\n\u003cfont style=\"color:rgb(31, 35, 40);\"\u003ePlease follow the instructions below to install the required packages.\u003c/font\u003e\n\n1. \u003cfont style=\"color:rgb(31, 35, 40);\"\u003eClone this repository\u003c/font\u003e\n\n```plain\ngit clone https://github.com/RainBowLuoCS/OpenOmni.git\ncd OpenOmni\n```\n\n1. \u003cfont style=\"color:rgb(31, 35, 40);\"\u003eInstall Package\u003c/font\u003e\n\n```plain\nconda create -n openomni python=3.10 -y\nconda activate openomni\npip install --upgrade pip  # enable PEP 660 support\npip install -e \".[train]\"\npip install -r requirements.txt\n```\n\n1. \u003cfont style=\"color:rgb(31, 35, 40);\"\u003eInstall additional packages for training\u003c/font\u003e\n\n```plain\npip install flash-attn --no-build-isolation\n```\n## 🔥 Fast Usage\n\nAfter downloading the weights and configuring the paths properly. Two open-sourced speech tokenizer are needed for speech discretization and reconstruction with different vocabulary size!  [CosVoice for 6K CTC Mode](https://github.com/FunAudioLLM/CosyVoice) and [GLM4Voice for 16K AR Mode](https://github.com/THUDM/GLM-4-Voice)\n\nFast inference for omnimodal input (speech,text,image and video)\n```plain\npython inference.py\n```\n\nFast interation for omnimodal input (speech,text,image and video)\n```plain\npython demo.py\n```\n\n## \u003cfont style=\"color:rgb(31, 35, 40);\"\u003eModel\u003c/font\u003e\n![](assets/framework.png)\n\n\u003cfont style=\"color:rgb(31, 35, 40);\"\u003eHere are the pretrained weights and instruction tuning weights\u003c/font\u003e\n\n| Stage | \u003cfont style=\"color:rgb(31, 35, 40);\"\u003eModel\u003c/font\u003e | \u003cfont style=\"color:rgb(31, 35, 40);\"\u003eSpeech Projector\u003c/font\u003e | \u003cfont style=\"color:rgb(31, 35, 40);\"\u003eImage\u003c/font\u003e\u003cbr/\u003e\u003cfont style=\"color:rgb(31, 35, 40);\"\u003eProjector\u003c/font\u003e | \u003cfont style=\"color:rgb(31, 35, 40);\"\u003eIT Data\u003c/font\u003e | \u003cfont style=\"color:rgb(31, 35, 40);\"\u003eDownload\u003c/font\u003e |\n| --- | --- | --- | --- | --- | --- |\n| 1-1 | \u003cfont style=\"color:rgb(31, 35, 40);\"\u003eOpenOMNI-Qwen2-7B-Stage1-1\u003c/font\u003e | ckpt | ckpt | \u003cfont style=\"color:rgb(31, 35, 40);\"\u003eopenomni_stage1-1.json\u003c/font\u003e | ckpt |\n| 2-1 | \u003cfont style=\"color:rgb(31, 35, 40);\"\u003eOpenOMNI-Qwen2-7B-Stage2-1\u003c/font\u003e | ckpt | ckpt | \u003cfont style=\"color:rgb(31, 35, 40);\"\u003eopenomni_stage2-1.json\u003c/font\u003e | ckpt |\n| 2-2 | \u003cfont style=\"color:rgb(31, 35, 40);\"\u003eOpenOMNI-Qwen2-7B-Stage2-2\u003c/font\u003e | ckpt | ckpt | \u003cfont style=\"color:rgb(31, 35, 40);\"\u003eopenomni_stage2-2.json\u003c/font\u003e | ckpt |\n| 3-1 | \u003cfont style=\"color:rgb(31, 35, 40);\"\u003eOpenOMNI-Qwen2-7B-Stage3-1\u003c/font\u003e | ckpt | ckpt | \u003cfont style=\"color:rgb(31, 35, 40);\"\u003eopenomni_stage3-1.json\u003c/font\u003e | ckpt |\n| 3-2 | \u003cfont style=\"color:rgb(31, 35, 40);\"\u003eOpenOMNI-Qwen2-7B-Stage3-2\u003c/font\u003e | ckpt | ckpt | \u003cfont style=\"color:rgb(31, 35, 40);\"\u003eopenomni_stage3-2.json\u003c/font\u003e | ckpt |\n\n\n## \u003cfont style=\"color:rgb(31, 35, 40);\"\u003ePreparation\u003c/font\u003e\n### \u003cfont style=\"color:rgb(31, 35, 40);\"\u003eDataset\u003c/font\u003e\n\u003cfont style=\"color:rgb(31, 35, 40);\"\u003ePlease follow [MMEvol](https://github.com/AlibabaResearch/DAMO-ConvAI/tree/main/mmevol) to prepare the corresponding images-text datasets. Here we only provide the details of speech-text datasets.\u003c/font\u003e\n\nThe following is the data directory tree of OpenOmni\n\n### \u003cfont style=\"color:rgb(31, 35, 40);\"\u003edata structure\u003c/font\u003e\n```plain\ndatasets\n├── json # data receipe\n│   ├── openomni_stage1-1.json # speech2text pretraining\n│   ├── openomni_stage2-1.json # image2text pretraining\n│   ├── openomni_stage2-2.json # image2text instruction tuning\n│   ├── openomni_stage3-1.json # text2speech pretraining\n│   ├── openomni_stage3-2.json # text2speech emotional injection\n├── asr # classic bilingual speech corpus\n│   ├── AISHELL-4\n│   ├── LibriSPeech\n│   ├── WeNetSpeech\n├── audio_en # synthetic english speech corpus for question\n├── audio_llava # synthetic bilingual speech corpus for answer\n├── audio_zh # synthetic chinese speech corpus for question\n├── audio_unit # synthetic bilingual speech corpus for answer\n├── audio_prefer # synthetic emotional bilingual speech corpus for answer\n├── audio_reject # synthetic emotional bilingual speech corpus for answer\n├── audio_ultrachat # synthetic bilingual speech corpus for answer\n├── ai2d\n│   ├── abc_images\n│   ├── annotations\n│   ├── images\n│   ├── questions\n│   └── categories.json\n......\n\n\n```\n\n+ All file/path starting with \"audio\" are self-synthesized.  \n+ DPO contains approximately 9k entries for \"prefer\" and \"reject,\" covering 9 types of emotions.\n\nMore details about data curation can be found in our [paper](https://arxiv.org/pdf/2501.04561).\n\n## \u003cfont style=\"color:rgb(31, 35, 40);\"\u003eTrain\u003c/font\u003e\n### \u003cfont style=\"color:rgb(31, 35, 40);\"\u003eSpeech2Text Pretrain\u003c/font\u003e\n\u003cfont style=\"color:rgb(31, 35, 40);\"\u003ePlease download the MMEvol, AIShell-4, LibriSPeech, WeNetSpeech,  OpenOmni Data and organize the data following Preparation before training .  Make sure set up the corresponding train script with correct setting (data path, weight path, and hyper-paramaters)\u003c/font\u003e\n\n```plain\nbash scripts/train/llama3/speech2text_pretrain.sh\nbash scripts/train/qwen2/speech2text_pretrain.sh\n```\n\n### \u003cfont style=\"color:rgb(31, 35, 40);\"\u003eImage2Text Pretrain\u003c/font\u003e\n\u003cfont style=\"color:rgb(31, 35, 40);\"\u003ePlease make sure you download and organize the data following\u003c/font\u003e\u003cfont style=\"color:rgb(31, 35, 40);\"\u003e \u003c/font\u003e[\u003cfont style=\"color:rgb(31, 35, 40);\"\u003ePreparation\u003c/font\u003e](https://github.com/RainBowLuoCS/MMEvol#preparation)\u003cfont style=\"color:rgb(31, 35, 40);\"\u003e \u003c/font\u003e\u003cfont style=\"color:rgb(31, 35, 40);\"\u003ebefore training. Make sure set up the corresponding train script with correct setting (data path, weight path, and hyper-paramaters)\u003c/font\u003e\n\n```plain\nbash scripts/train/llama3/image2text_pretrain.sh\nbash scripts/train/qwen2/image2text_pretrain.sh\n```\n\n### \u003cfont style=\"color:rgb(31, 35, 40);\"\u003eImage2Text Instruction Tuning\u003c/font\u003e\n\u003cfont style=\"color:rgb(31, 35, 40);\"\u003ePlease make sure you download and organize the data following\u003c/font\u003e\u003cfont style=\"color:rgb(31, 35, 40);\"\u003e \u003c/font\u003e[\u003cfont style=\"color:rgb(31, 35, 40);\"\u003ePreparation\u003c/font\u003e](https://github.com/RainBowLuoCS/MMEvol#preparation)\u003cfont style=\"color:rgb(31, 35, 40);\"\u003e \u003c/font\u003e\u003cfont style=\"color:rgb(31, 35, 40);\"\u003ebefore training. Make sure set up the corresponding train script with correct setting (data path, weight path, and hyper-paramaters)\u003c/font\u003e\n\n```plain\nbash scripts/train/llama3/image2text_finetune.sh\nbash scripts/train/qwen2/image2text_finetue.sh\n```\n\n### \u003cfont style=\"color:rgb(31, 35, 40);\"\u003eText2Speech Pretrain\u003c/font\u003e\n\u003cfont style=\"color:rgb(31, 35, 40);\"\u003ePlease make sure you download and organize the data following\u003c/font\u003e\u003cfont style=\"color:rgb(31, 35, 40);\"\u003e \u003c/font\u003e[\u003cfont style=\"color:rgb(31, 35, 40);\"\u003ePreparation\u003c/font\u003e](https://github.com/RainBowLuoCS/MMEvol#preparation)\u003cfont style=\"color:rgb(31, 35, 40);\"\u003e \u003c/font\u003e\u003cfont style=\"color:rgb(31, 35, 40);\"\u003ebefore training. Make sure set up the corresponding train script with correct setting (data path, weight path, and hyper-paramaters)\u003c/font\u003e\n\n```plain\nbash scripts/train/llama3/text2speech_ pretrain.sh\nbash scripts/train/qwen2/text2speech_ pretrain.sh\n```\n\n### \u003cfont style=\"color:rgb(31, 35, 40);\"\u003eText2Speech Emotional DPO Tuning\u003c/font\u003e\n\u003cfont style=\"color:rgb(31, 35, 40);\"\u003ePlease make sure you download and organize the data following\u003c/font\u003e\u003cfont style=\"color:rgb(31, 35, 40);\"\u003e \u003c/font\u003e[\u003cfont style=\"color:rgb(31, 35, 40);\"\u003ePreparation\u003c/font\u003e](https://github.com/RainBowLuoCS/MMEvol#preparation)\u003cfont style=\"color:rgb(31, 35, 40);\"\u003e \u003c/font\u003e\u003cfont style=\"color:rgb(31, 35, 40);\"\u003ebefore training. Make sure set up the corresponding train script with correct setting (data path, weight path, and hyper-paramaters)\u003c/font\u003e\n\n```plain\nbash scripts/train/llama3/text2speech_ dpo.sh\nbash scripts/train/qwen2/text2speech_ dpo.sh\n```\n\n## \u003cfont style=\"color:rgb(31, 35, 40);\"\u003eEvaluation\u003c/font\u003e\n### \u003cfont style=\"color:rgb(31, 35, 40);\"\u003eDataset\u003c/font\u003e\n#### \u003cfont style=\"color:rgb(31, 35, 40);\"\u003eEnsure that your api_base, key and dataset are correctly configured before evaluation.\u003c/font\u003e\n### \u003cfont style=\"color:rgb(31, 35, 40);\"\u003edata structure\u003c/font\u003e\n```plain\ndatasets\n├── json # data receipe\n│   ├── aishell2_eval.jsonl # aishell evaluation\n│   ├── librispeech_eval.jsonl # image2text pretraining\n│   ├── wenetspeech_eval.json # image2text instruction tuning\n│   ├── openomni_emotion_val.json \n├── OmniBench # OmniBench\n│   ├── mmdata\n│   ├── dataset\n│   \t\t├── eval.json\n├── Ov-Odyssey # Ov-Odyssey Bench\n│   ├── av_odyssey_part1.parquet\n│   ├── av_odyssey_part2.parquet\n│   ├── av_odyssey_part3.parquet\n│   ├── av_odyssey_part4.parquet\n│   ├── av_odyssey_part5.parquet\n\n\n```\n\n### \u003cfont style=\"color:rgb(31, 35, 40);\"\u003eSpeech-Text Evaluation \u003c/font\u003e\n\u003cfont style=\"color:rgb(31, 35, 40);\"\u003eMake sure set up the corresponding train script with correct setting (data path, weight path, and hyper-paramaters)\u003c/font\u003e\n\n```plain\npython openomni/eval/llama3/asr_eavl.py\npython openomni/eval/qwen2/asr_eavl.py\n```\n\n| \u003cfont style=\"color:rgb(31, 35, 40);\"\u003eModel\u003c/font\u003e | \u003cfont style=\"color:rgb(31, 35, 40);\"\u003eLibriSpeech-test-clean\u003c/font\u003e | \u003cfont style=\"color:rgb(31, 35, 40);\"\u003eLibriSpeech-test-other\u003c/font\u003e | \u003cfont style=\"color:rgb(31, 35, 40);\"\u003eAIShell2-dev\u003c/font\u003e | \u003cfont style=\"color:rgb(31, 35, 40);\"\u003eAIShell2-test\u003c/font\u003e | \u003cfont style=\"color:rgb(31, 35, 40);\"\u003eWeNetSpeech-testnet\u003c/font\u003e | \u003cfont style=\"color:rgb(31, 35, 40);\"\u003eWeNetSpeech-testmeeting\u003c/font\u003e |\n| --- | --- | --- | --- | --- | --- | --- |\n| \u003cfont style=\"color:rgb(31, 35, 40);\"\u003eVITA\u003c/font\u003e | 8.1 | 18.4 | \u003cfont style=\"color:rgb(31, 35, 40);\"\u003e\u003c/font\u003e | \u003cfont style=\"color:rgb(31, 35, 40);\"\u003e\u003c/font\u003e | \u003cfont style=\"color:rgb(31, 35, 40);\"\u003e12.2\u003c/font\u003e | \u003cfont style=\"color:rgb(31, 35, 40);\"\u003e16.5\u003c/font\u003e |\n| \u003cfont style=\"color:rgb(31, 35, 40);\"\u003eEMOVA\u003c/font\u003e | 4.0 | 8.6 | \u003cfont style=\"color:rgb(31, 35, 40);\"\u003e10.6\u003c/font\u003e | \u003cfont style=\"color:rgb(31, 35, 40);\"\u003e10.3\u003c/font\u003e | \u003cfont style=\"color:rgb(31, 35, 40);\"\u003e\u003c/font\u003e | \u003cfont style=\"color:rgb(31, 35, 40);\"\u003e\u003c/font\u003e |\n| \u003cfont style=\"color:rgb(31, 35, 40);\"\u003eMINI-OMNI\u003c/font\u003e | 4.5 | 9.7 | \u003cfont style=\"color:rgb(31, 35, 40);\"\u003e\u003c/font\u003e | \u003cfont style=\"color:rgb(31, 35, 40);\"\u003e\u003c/font\u003e | \u003cfont style=\"color:rgb(31, 35, 40);\"\u003e\u003c/font\u003e | \u003cfont style=\"color:rgb(31, 35, 40);\"\u003e\u003c/font\u003e |\n| \u003cfont style=\"color:rgb(31, 35, 40);\"\u003eFreeze-Omni\u003c/font\u003e | 3.29 | 7.4 | \u003cfont style=\"color:rgb(31, 35, 40);\"\u003e\u003c/font\u003e | \u003cfont style=\"color:rgb(31, 35, 40);\"\u003e\u003c/font\u003e | \u003cfont style=\"color:rgb(31, 35, 40);\"\u003e8.57\u003c/font\u003e | \u003cfont style=\"color:rgb(31, 35, 40);\"\u003e10.09\u003c/font\u003e |\n| \u003cfont style=\"color:rgb(31, 35, 40);\"\u003eours\u003c/font\u003e | 2.57 | 5.6 | \u003cfont style=\"color:rgb(31, 35, 40);\"\u003e6.81\u003c/font\u003e | \u003cfont style=\"color:rgb(31, 35, 40);\"\u003e6.87\u003c/font\u003e | \u003cfont style=\"color:rgb(31, 35, 40);\"\u003e7.63\u003c/font\u003e | \u003cfont style=\"color:rgb(31, 35, 40);\"\u003e\u003c/font\u003e |\n\n\n### \u003cfont style=\"color:rgb(31, 35, 40);\"\u003eImage-Text Evaluation \u003c/font\u003e\n\u003cfont style=\"color:rgb(31, 35, 40);\"\u003eRefer to MMEvol for detailed OpenCampass Vision Language Evaluation\u003c/font\u003e\n\n```plain\n# run on all 9 datasets\n./script/run_inference.sh OpenOmni-Qwen \"MME MMMU_DEV_VAL MathVista_MINI LLaVABench RealWorldQA MMStar MMVet AI2D_TEST OCRBench HallusionBench POPE BLINK\" all\n\n# The following are instructions for running on a single dataset\n# MME\n./script/run_inference.sh OpenOmni-Qwen MME all\n# MMMU_DEV_VAL\n./script/run_inference.sh OpenOmni-Qwen MMMU_DEV_VAL all\n# MathVista_MINI\n./script/run_inference.sh OpenOmni-Qwen MathVista_MINI all\n.....\n```\n\n### \u003cfont style=\"color:rgb(31, 35, 40);\"\u003eSpeech-Text-Image Evaluation \u003c/font\u003e\n\u003cfont style=\"color:rgb(31, 35, 40);\"\u003ePlease download OmniBench and run the following command\u003c/font\u003e\n\n```plain\npython openomni/eval/llama3/omni_eavl.py\npython openomni/eval/qwen2/omni_eavl.py\n```\n\n### \u003cfont style=\"color:rgb(31, 35, 40);\"\u003eSpeech-Text-Image-Video Evaluation \u003c/font\u003e\n\u003cfont style=\"color:rgb(31, 35, 40);\"\u003ePlease download Ov-Odyssey and run the following command\u003c/font\u003e\n\n```plain\npython openomni/eval/llama3/ov_odyssey_eavl.py\npython openomni/eval/qwen2/ov_odyssey_eavl.py\n```\n\n### \u003cfont style=\"color:rgb(31, 35, 40);\"\u003eText-Speech Evaluation \u003c/font\u003e\n```plain\npython openomni/eval/llama3/t2s_eavl.py\npython openomni/eval/qwen2/t2s_eavl.py\n```\n\n### \u003cfont style=\"color:rgb(31, 35, 40);\"\u003eEmotional Text-Speech Evaluation \u003c/font\u003e\n```plain\npython openomni/eval/llama3/et2s_eavl.py\npython openomni/eval/qwen2/et2s_eavl.py\n```\n## \u003cfont style=\"color:rgb(31, 35, 40);\"\u003e 📌 Cases \u003c/font\u003e\n\u003ctable\u003e\n\u003ctr\u003e\n\u003ctd align=\"center\"\u003e\n\n**四是四，十是十，十四是十四，四十是四十。**\n\n\u003c/td\u003e\n\u003ctd align=\"center\"\u003e\n\n**黑化肥发灰，灰化肥发黑，黑化肥发灰会挥发，灰化肥挥发会发黑。**\n\n\u003c/td\u003e\n\u003ctd align=\"center\"\u003e\n\n**吃葡萄不吐葡萄皮，不吃葡萄倒吐葡萄皮。**\n\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd align=\"center\"\u003e\n\n[四是四，十是十，十四是十四，四十是四十。](https://github.com/user-attachments/assets/64dcbe0d-6f28-43ce-916e-5aea264f13f0)\n\n\u003c/td\u003e\n\u003ctd align=\"center\"\u003e\n\n[黑化肥发灰，灰化肥发黑，黑化肥发灰会挥发，灰化肥挥发会发黑。](https://github.com/user-attachments/assets/996e5ec9-8baa-491d-a731-51d454fca493)\n\n\u003c/td\u003e\n\u003ctd align=\"center\"\u003e\n  \n[吃葡萄不吐葡萄皮，不吃葡萄倒吐葡萄皮。](https://github.com/user-attachments/assets/e7035bc0-1b11-4b9c-9491-e86c289daa2f)\n\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd align=\"center\"\u003e\n\n**八百标兵奔北坡，炮兵并排北边跑，炮兵怕把标兵碰，标兵怕碰炮兵炮。**\n\n\u003c/td\u003e\n\u003ctd align=\"center\"\u003e\n\n**红凤凰，黄凤凰，粉红凤凰，花凤凰。**\n\n\u003c/td\u003e\n\u003ctd align=\"center\"\u003e\n\n**牛郎年年恋刘娘，刘娘念念恋牛郎。**\n\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd align=\"center\"\u003e\n\n[八百标兵奔北坡，炮兵并排北边跑，炮兵怕把标兵碰，标兵怕碰炮兵炮。](https://github.com/user-attachments/assets/626c5732-2386-49cb-992c-0bd251af40df)\n\n\u003c/td\u003e\n\u003ctd align=\"center\"\u003e\n\n[红凤凰，黄凤凰，粉红凤凰，花凤凰。](https://github.com/user-attachments/assets/2d5e862b-abb1-4656-b80f-1576f730005e)\n\n\u003c/td\u003e\n\u003ctd align=\"center\"\u003e\n\n[牛郎年年恋刘娘，刘娘念念恋牛郎。](https://github.com/user-attachments/assets/89207b65-7855-425d-84ae-0badb5c1e73f)\n\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003c/table\u003e\n\n\n\u003ctable\u003e\n\u003ctr\u003e\n\u003ctd align=\"center\"\u003e\n\n**She sells seashells by the seashore.**\n\n\u003c/td\u003e\n\u003ctd align=\"center\"\u003e\n\n**Peter Piper picked a peck of pickled peppers.**\n\n\u003c/td\u003e\n\u003ctd align=\"center\"\u003e\n\n**Six slippery snails slid slowly seaward.**\n\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd align=\"center\"\u003e\n  \n[en_0.webm](https://github.com/user-attachments/assets/cc61b680-1f80-416e-89f7-418222f2de74)\n\n\u003c/td\u003e\n\u003ctd align=\"center\"\u003e\n  \n[en_1.webm](https://github.com/user-attachments/assets/74c058dd-9674-4832-9a08-fa882a16d539)\n\n\n\u003c/td\u003e\n\u003ctd align=\"center\"\u003e\n\n[en_2.webm](https://github.com/user-attachments/assets/bcdbf12d-c5e0-4373-bc92-625fb61fe9ab)\n\n\n\u003c/td\u003e\n\u003c/tr\u003e\n\n\u003ctr\u003e\n\u003ctd align=\"center\"\u003e\n\n**Six sleek swans swam swiftly southwards.**\n\n\u003c/td\u003e\n\u003ctd align=\"center\"\u003e\n\n**I saw Susie sitting in a shoeshine shop.**\n\n\u003c/td\u003e\n\u003ctd align=\"center\"\u003e\n\n**Can you can a can as a canner can can a can?**\n\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd align=\"center\"\u003e\n\n[en_3.webm](https://github.com/user-attachments/assets/aab3314f-b03c-4398-a935-e013aac02235)\n\n\n\u003c/td\u003e\n\u003ctd align=\"center\"\u003e\n\n[en_4.webm](https://github.com/user-attachments/assets/6b4cdf14-4a87-4dce-8063-252ef5078428)\n\n\u003c/td\u003e\n\u003ctd align=\"center\"\u003e\n\n[en_5.webm](https://github.com/user-attachments/assets/9d0794f0-a36b-415d-a264-8935bbf96921)\n\n\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003c/table\u003e\n\n## \u003cfont style=\"color:rgb(31, 35, 40);\"\u003e📚\u003c/font\u003e\u003cfont style=\"color:rgb(31, 35, 40);\"\u003eVideo Demo\u003c/font\u003e \n\n\nhttps://github.com/user-attachments/assets/cd679b7c-9f9d-4631-a1f5-96b1428a8ad4\n\n\n## \u003cfont style=\"color:rgb(31, 35, 40);\"\u003e📚\u003c/font\u003e\u003cfont style=\"color:rgb(31, 35, 40);\"\u003eCitation\u003c/font\u003e \n\nIf you find this repo useful for your research, please consider citing the paper\n\n```\n@article{luo2025openomni,\n  title={OpenOmni: Large Language Models Pivot Zero-shot Omnimodal Alignment across Language with Real-time Self-Aware Emotional Speech Synthesis},\n  author={Luo, Run and Lin, Ting-En and Zhang, Haonan and Wu, Yuchuan and Liu, Xiong and Yang, Min and Li, Yongbin and Chen, Longze and Li, Jiaming and Zhang, Lei and others},\n  journal={arXiv preprint arXiv:2501.04561},\n  year={2025}\n}\n```\n```\n@article{luo2024mmevol,\n  title={Mmevol: Empowering multimodal large language models with evol-instruct},\n  author={Luo, Run and Zhang, Haonan and Chen, Longze and Lin, Ting-En and Liu, Xiong and Wu, Yuchuan and Yang, Min and Wang, Minzheng and Zeng, Pengpeng and Gao, Lianli and others},\n  journal={arXiv preprint arXiv:2409.05840},\n  year={2024}\n}\n```\n\n## \u003cfont style=\"color:rgb(31, 35, 40);\"\u003e📧 \u003c/font\u003e\u003cfont style=\"color:rgb(31, 35, 40);\"\u003eContact\u003c/font\u003e \n\nif you have any question, please consider following concat for help\n\n- Run Luo — r.luo@siat.ac.cn\n\n- Haonan Zhang — zchiowal@gmail.com\n\n\n## Acknowledgement\n\n\\- [LLaVA](https://github.com/haotian-liu/LLaVA) and [LLaVA-Omni](https://github.com/ictnlp/LLaMA-Omni): the codebase we built upon. Thanks for their brilliant contributions to the community! We just can't wait to use OpenOmni.\n\n\\- [VLMEvalKit](https://github.com/open-compass/VLMEvalKit): the amazing open-sourced suit for evaluating various LMMs!\n\n\\- [CosVoice](https://github.com/FunAudioLLM/CosyVoice): the amazing open-sourced speech tokenizer for speech discretization and reconstruction with 6k vocabulary size!\n\n\\- [GLM4Voice](https://github.com/THUDM/GLM-4-Voice): the amazing open-sourced speech tokenizer for speech discretization and reconstruction with 16k vocabulary size!\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frainbowluocs%2Fopenomni","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Frainbowluocs%2Fopenomni","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frainbowluocs%2Fopenomni/lists"}