{"id":26245214,"url":"https://github.com/evolvinglmms-lab/videommmu","last_synced_at":"2025-09-07T01:04:46.577Z","repository":{"id":273657296,"uuid":"920453225","full_name":"EvolvingLMMs-Lab/VideoMMMU","owner":"EvolvingLMMs-Lab","description":null,"archived":false,"fork":false,"pushed_at":"2025-08-08T02:20:40.000Z","size":16344,"stargazers_count":53,"open_issues_count":0,"forks_count":2,"subscribers_count":3,"default_branch":"main","last_synced_at":"2025-09-01T07:32:17.446Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"https://videommmu.github.io/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/EvolvingLMMs-Lab.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-01-22T07:12:14.000Z","updated_at":"2025-08-27T16:50:27.000Z","dependencies_parsed_at":"2025-04-23T19:00:53.281Z","dependency_job_id":"758951ec-2306-438a-937b-948999b2864a","html_url":"https://github.com/EvolvingLMMs-Lab/VideoMMMU","commit_stats":null,"previous_names":["videommmu/videommmu","evolvinglmms-lab/videommmu"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/EvolvingLMMs-Lab/VideoMMMU","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/EvolvingLMMs-Lab%2FVideoMMMU","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/EvolvingLMMs-Lab%2FVideoMMMU/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/EvolvingLMMs-Lab%2FVideoMMMU/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/EvolvingLMMs-Lab%2FVideoMMMU/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/EvolvingLMMs-Lab","download_url":"https://codeload.github.com/EvolvingLMMs-Lab/VideoMMMU/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/EvolvingLMMs-Lab%2FVideoMMMU/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":273983110,"owners_count":25202095,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-09-06T02:00:13.247Z","response_time":2576,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-03-13T12:32:10.274Z","updated_at":"2025-09-07T01:04:46.566Z","avatar_url":"https://github.com/EvolvingLMMs-Lab.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# \u003cimg src=\"./assets/pyramid-chart.png\" alt=\"Video-MMMU Icon\" style=\"height: 30px; vertical-align: middle;\"\u003e Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos\n\n\u003cp align=\"center\"\u003e\n  \u003ca href=\"https://videommmu.github.io/\"\u003e\u003cimg src=\"https://img.shields.io/badge/🎓-Website-red\" height=\"23\"\u003e\u003c/a\u003e\n  \u003ca href=\"https://arxiv.org/abs/2501.13826\"\u003e\u003cimg src=\"https://img.shields.io/badge/📝-Paper-blue\" height=\"23\"\u003e\u003c/a\u003e\n  \u003ca href=\"https://huggingface.co/datasets/lmms-lab/VideoMMMU\"\u003e\u003cimg src=\"https://img.shields.io/badge/🤗-Dataset-yellow\" height=\"23\"\u003e\u003c/a\u003e\n  \u003ca href=\"https://lmms-lab.framer.ai\"\u003e\u003cimg src=\"https://img.shields.io/badge/🏠-LMMs_Lab_Homepage-green\" height=\"23\"\u003e\u003c/a\u003e\n  \u003ca href=\"https://discord.gg/zdkwKUqrPy\"\u003e\u003cimg src=\"https://img.shields.io/badge/💬-Discord_LMMs_Eval-beige\" height=\"23\"\u003e\u003c/a\u003e\n\u003c/p\u003e\n\n🖋 [Kairui Hu](https://kairuihu.github.io/), [Penghao Wu](https://penghao-wu.github.io/), [Fanyi Pu](https://github.com/pufanyi), [Wang Xiao](https://www.ntu.edu.sg/s-lab), [Xiang Yue](https://xiangyue9607.github.io/), [Yuanhan Zhang](https://zhangyuanhan-ai.github.io/), [Bo Li](https://brianboli.com/), and [Ziwei Liu](https://liuziwei7.github.io/)\n\n---\n\n## 🔥 News\n- [2025-3] 🎉🎉 We update Video-MMMU leaderboard to include [Kimi-k1.6-preview-20250308](https://github.com/MoonshotAI/Kimi-k1.5), [VideoLLaMA3-7B](https://huggingface.co/DAMO-NLP-SG/VideoLLaMA3-7B). Thanks for the acknowledgement to Video-MMMU!\n- [2025-2] 🎉🎉 We update Video-MMMU leaderboard to include [Qwen-2.5-VL-72B](https://huggingface.co/Qwen/Qwen2.5-VL-72B-Instruct), [Qwen-2.5-VL-7B](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct),  [mPLUG-Owl3-7B](https://github.com/X-PLUG/mPLUG-Owl/tree/main/mPLUG-Owl3), [InternVideo2.5-Chat-8B](https://huggingface.co/OpenGVLab/InternVideo2_5_Chat_8B), [VideoChat-Flash-7B@448](https://huggingface.co/OpenGVLab/VideoChat-Flash-Qwen2-7B_res448). Thanks for the acknowledgement to Video-MMMU!\n- [2025-1] 🎉🎉 We introduce [VideoMMMU](https://videommmu.github.io/), a multi-modal, multi-disciplinary video benchmark that evaluates the **knowledge acquisition capability** from educational videos.\n\n\n\n## 🧠 Overview  \n\u003cdiv align=\"center\"\u003e\n    \u003cimg src=\"./assets/figure1_0308.png\" alt=\"Figure 1\" width=\"100%\"\u003e\n\u003c/div\u003e\n\nVideo-MMMU is the first benchmark to assess knowledge acquisition from educational videos, evaluating how well LMMs learn new knowledge from videos and apply what they learn in practice.\n\n### 1) Knowledge-Intensive Video Collection\nVideo-MMMU features 300 **lecture-style videos** covering 6 professional disciplines—Art, Business, Science, Medicine, Humanities, and Engineering, spanning 30 subjects.\n\n### 2) Knowledge Acquisition-Based Question Design  \nEach video is accompanied by 3 QA pairs, designed to evaluate video-based learning at different cognitive levels:  \n- **Perception** – Identifying key information.  \n- **Comprehension** – Understanding underlying concepts.  \n- **Adaptation** – Applying knowledge to new scenarios.  \n\nThis results in 900 question-answer pairs (300 videos × 3 QA pairs per video), systematically measuring a model's ability to acquire and apply knowledge from educational videos.  \n \n\n## ❓QA Design\n\u003cdiv align=\"center\"\u003e\n    \u003cimg src=\"./assets/figure_2.png\" alt=\"Figure 2\" width=\"100%\"\u003e\n\u003c/div\u003e\n\n**Perception**  \n- ASR (Automatic Speech Recognition): The **Art** category (top left).  \n- OCR (Optical Character Recognition): The **Business** category (bottom left).\n  \n**Comprehension**  \n- Concept Comprehension: The **Humanities** category (top center).  \n- Problem-Solving Strategy Comprehension: The **Science** category (bottom center).\n\n**Adaptation**  \n- Case Study Analysis: The **Medicine** category (top right).  \n- Problem-Solving Strategy Adaptation: The **Engineering** category (bottom right).\n\n\u003cdiv align=\"center\"\u003e\n    \u003cimg src=\"./assets/figure3_0308.png\" alt=\"Figure 3\" width=\"100%\"\u003e\n\u003c/div\u003e\n\n## 🔍 A New Perspective on VideoQA \n\n### Videos as a Knowledge Source\nTraditional VideoQA benchmarks focus primarily on evaluating how well models interpret visual content. Video-MMMU is the first to treat videos as a **source of knowledge**, assessing how effectively LMMs acquire knowledge from educational videos.  \n\n### Measuring Knowledge Gain: The Δknowledge Metric\nA key novelty of Video-MMMU is that it evaluates not just a model’s absolute accuracy but also its **delta accuracy**—the improvement in performance after learning from a video. A model may initially fail to solve an exam question, but we give the model a video where a human could learn to solve the question by watching this video. Video-MMMU evaluates how well LMMs **improve their performance** after watching the relevant video. Video-MMMU introduces **Δknowledge** to quantify the model's learning gain on the Adaptation track questions. **Δknowledge** is defined as the normalized performance gain:\n\n```math\n$$\n\\Delta_{\\text{knowledge}} = \\frac{\\text{Acc}_{\\text{after\\_video}} - \\text{Acc}_{\\text{before\\_video}}}{100\\% - \\text{Acc}_{\\text{before\\_video}}} \\times 100\\%\n$$\n\n```\n\n\n## 🛠️ Evaluation Pipeline\nThe evaluation of VideoMMMU is integrated into [LMMs-Eval](https://github.com/EvolvingLMMs-Lab/lmms-eval/tree/main). The detailed instructions of the evaluation are shown as follows.\n\n### Installation\n\nFor formal usage, you can install the package from PyPI by running the following command:\n```bash\npip install lmms-eval\n```\n\nFor development, you can install the package by cloning the repository and running the following command:\n```bash\ngit clone https://github.com/EvolvingLMMs-Lab/lmms-eval\ncd lmms-eval\npip install -e .\n```\n\nIf you want to test LLaVA, you will have to clone their repo from [LLaVA](https://github.com/haotian-liu/LLaVA) and\n```bash\ngit clone https://github.com/LLaVA-VL/LLaVA-NeXT\ncd LLaVA-NeXT\npip install -e .\n```\n\n### Evaluation\n\nWe use [LLaVA-OneVision-7B](https://huggingface.co/llava-hf/llava-onevision-qwen2-7b-ov-hf) as an example in the following commands. You can change `--model`, and `--model_args` based on your requirement.\n\n**Evaluation of LLaVA-OneVision on VideoMMMU (all 3 tracks)**\n\n```bash\naccelerate launch --num_processes=1 --main_process_port 12345 -m lmms_eval \\\n--model llava_onevision \\\n--model_args pretrained=lmms-lab/llava-onevision-qwen2-7b-ov,conv_template=qwen_1_5,model_name=llava_qwen,max_frames_num=32,torch_dype=bfloat16 \\\n    --tasks video_mmmu \\\n    --batch_size 1 \\\n    --log_samples \\\n    --log_samples_suffix debug \\\n    --output_path ./logs/\n```\n\n**Evaluate a single track of VideoMMMU**\n\nPerception track: \n```bash\naccelerate launch --num_processes=1 --main_process_port 12345 -m lmms_eval \\\n--model llava_onevision \\\n--model_args pretrained=lmms-lab/llava-onevision-qwen2-7b-ov,conv_template=qwen_1_5,model_name=llava_qwen,max_frames_num=32,torch_dype=bfloat16 \\\n    --tasks video_mmmu_perception \\\n    --batch_size 1 \\\n    --log_samples \\\n    --log_samples_suffix debug \\\n    --output_path ./logs/\n```\n\nComprehension track: \n```bash\naccelerate launch --num_processes=1 --main_process_port 12345 -m lmms_eval \\\n--model llava_onevision \\\n--model_args pretrained=lmms-lab/llava-onevision-qwen2-7b-ov,conv_template=qwen_1_5,model_name=llava_qwen,max_frames_num=32,torch_dype=bfloat16 \\\n    --tasks video_mmmu_comprehension \\\n    --batch_size 1 \\\n    --log_samples \\\n    --log_samples_suffix debug \\\n    --output_path ./logs/\n```\n\nAdaptation track: \n```bash\naccelerate launch --num_processes=1 --main_process_port 12345 -m lmms_eval \\\n--model llava_onevision \\\n--model_args pretrained=lmms-lab/llava-onevision-qwen2-7b-ov,conv_template=qwen_1_5,model_name=llava_qwen,max_frames_num=32,torch_dype=bfloat16 \\\n    --tasks video_mmmu_adaptation \\\n    --batch_size 1 \\\n    --log_samples \\\n    --log_samples_suffix debug \\\n    --output_path ./logs/\n```\n\n**Evaluate the question_only track of VideoMMMU -- Knowledge Acquisition Experiment (∆knowledge)**\n\nThe \"question_only\" track consists of 2-second videos that contain only the image associated with the Adaptation Track question. This is the baseline for ∆knowledge.\n\nTo evaluate this setting, you can use the following command:\n\n```bash\naccelerate launch --num_processes=1 --main_process_port 12345 -m lmms_eval \\\n--model llava_onevision \\\n--model_args pretrained=lmms-lab/llava-onevision-qwen2-7b-ov,conv_template=qwen_1_5,model_name=llava_qwen,max_frames_num=1,torch_dype=bfloat16 \\\n    --tasks video_mmmu_adaptation_question_only \\\n    --batch_size 1 \\\n    --log_samples \\\n    --log_samples_suffix debug \\\n    --output_path ./logs/\n```\n\nThe Δknowledge is defined as : \n```math\n$$\n\\Delta_{\\text{knowledge}} = \\frac{\\text{Acc}_{\\text{adaptation}} - \\text{Acc}_{\\text{question\\_only}}}{100\\% - \\text{Acc}_{\\text{question\\_only}}} \\times 100\\%\n$$\n\n```\n\n\n***Adaptation Track setting***\n\nTo ensure compatibility with [LMMs-Eval](https://github.com/EvolvingLMMs-Lab/lmms-eval), the image associated with the Adaptation Track question has been appended in the last frame of the video. A prompt has also been added to inform the model that the question image is located in this final frame.\n\nIf you prefer an interleaved format, you can insert the image (either the last frame of the video or the ```image 1``` entry from the HF dataset) into the designated placeholder ```\u003cimage 1\u003e```.\n\n\n## 🎓 Video-MMMU Leaderboard\n\nWe evaluate various open-source and proprietary LMMs. The table below provides a detailed comparison. To submit your model results, please send an email to videommmu2025@gmail.com.\n\n\n| Model | Overall \\| Δknowledge | Perception | Comprehension | Adaptation |\n|---|---|---|---|---|\n| [GPT-5-thinking](https://openai.com/index/introducing-gpt-5/) | 84.6 \\| -- | -- | -- | -- |\n| [Gemini-2.5-Pro](https://deepmind.google/models/gemini/pro/) | 83.6 \\| -- | -- | -- | -- |\n| [OpenAI O3](https://openai.com/index/introducing-o3-and-o4-mini/) | 83.3 \\| -- | -- | -- | -- |\n| [Keye-VL-1.5-8B](https://huggingface.co/Kwai-Keye/Keye-VL-1_5-8B) | 66.00 \\| 🟢 +0.0 | 77.67 | 68.67 | 51.67 |\n| [Claude-3.5-Sonnet](https://www.anthropic.com/news/claude-3-5-sonnet) | 65.78 \\| 🟢 +11.4 | 72.00 | 69.67 | 55.67 |\n| [Kimi-VL-A3B-Thinking-2506](https://huggingface.co/moonshotai/Kimi-VL-A3B-Thinking-2506) | 65.22 \\| 🟢 +3.5 | 75.00 | 66.33 | 54.33 |\n| [GPT-4o](https://openai.com/index/hello-gpt-4o/) | 61.22 \\| 🟢 +15.6 | 66.00 | 62.00 | 55.67 |\n| [Qwen-2.5-VL-72B](https://huggingface.co/Qwen/Qwen2.5-VL-72B-Instruct) | 60.22 \\| 🟢 +9.7 | 69.33 | 61.00 | 50.33 |\n| [GLM-4V-PLUS-0111](https://www.bigmodel.cn/dev/api/normal-model/glm-4v) | 57.56 \\| 🔴 -1.7 | 77.33 | 53.33 | 42.00 |\n| [Gemini 1.5 Pro](https://deepmind.google/technologies/gemini/pro/) | 53.89 \\| 🟢 +8.7 | 59.00 | 53.33 | 49.33 |\n| [Video-RTS](https://arxiv.org/abs/2507.06485) | 52.70 \\| -- | -- | -- | -- |\n| [Aria](https://rhymes.ai/blog-details/aria-first-open-multimodal-native-moe-model) | 50.78 \\| 🟢 +3.2 | 65.67 | 46.67 | 40.00 |\n| [Gemini 1.5 Flash](https://storage.googleapis.com/deepmind-media/gemini/gemini_v1_5_report.pdf) | 49.78 \\| 🔴 -3.3 | 57.33 | 49.00 | 43.00 |\n| [LLaVA-Video-72B](https://huggingface.co/lmms-lab/LLaVA-Video-72B-Qwen2) | 49.67 \\| 🟢 +7.1 | 59.67 | 46.00 | 43.33 |\n| [LLaVA-OneVision-72B](https://huggingface.co/llava-hf/llava-onevision-qwen2-72b-ov-hf) | 48.33 \\| 🟢 +6.6 | 59.67 | 42.33 | 43.00 |\n| [Qwen-2.5-VL-7B](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct) | 47.44 \\| 🟢 +2.2 | 58.33 | 44.33 | 39.67 |\n| [VideoLLaMA3-7B](https://huggingface.co/DAMO-NLP-SG/VideoLLaMA3-7B) | 47.00 \\| 🔴 -0.5 | 60.33 | 46.00 | 34.67 |\n| [InternVideo2.5-Chat-8B](https://huggingface.co/OpenGVLab/InternVideo2_5_Chat_8B) | 43.00 \\| 🟢 +3.0 | 54.67 | 41.67 | 32.67 |\n| [mPLUG-Owl3-7B](https://github.com/X-PLUG/mPLUG-Owl/tree/main/mPLUG-Owl3) | 42.00 \\| 🟢 +7.5 | 49.33 | 38.67 | 38.00 |\n| [MAmmoTH-VL-8B](https://mammoth-vl.github.io/) | 41.78 \\| 🟢 +1.5 | 51.67 | 40.00 | 33.67 |\n| [VideoChat-Flash-7B@448](https://huggingface.co/OpenGVLab/VideoChat-Flash-Qwen2-7B_res448) | 41.67 \\| 🔴 -1.3 | 51.67 | 40.67 | 32.67 |\n| [InternVL2-8B](https://huggingface.co/OpenGVLab/InternVL2-8B) | 37.44 \\| 🔴 -8.5 | 47.33 | 33.33 | 31.67 |\n| [LLaVA-Video-7B](https://huggingface.co/lmms-lab/LLaVA-Video-7B-Qwen2) | 36.11 \\| 🔴 -5.3 | 41.67 | 33.33 | 33.33 |\n| [VILA1.5-40B](https://huggingface.co/Efficient-Large-Model/VILA1.5-40b) | 34.00 \\| 🟢 +9.4 | 38.67 | 30.67 | 32.67 |\n| [LLaVA-OneVision-7B](https://huggingface.co/llava-hf/llava-onevision-qwen2-7b-ov-hf) | 33.89 \\| 🔴 -5.6 | 40.00 | 31.00 | 30.67 |\n| [Llama-3.2-11B](https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/) | 30.00 \\| ➖ — | 35.67 | 32.33 | 22.00 |\n| [LongVA-7B](https://huggingface.co/lmms-lab/LongVA-7B) | 23.98 \\| 🔴 -7.0 | 24.00 | 24.33 | 23.67 |\n| [VILA1.5-8B](https://huggingface.co/Efficient-Large-Model/Llama-3-VILA1.5-8B-Fix) | 20.89 \\| 🟢 +5.9 | 20.33 | 17.33 | 25.00 |\n\n\n\n\n\n## Citation\n\n```shell\n@article{hu2025videommmu,\n    title={Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos},\n    author={Kairui Hu and Penghao Wu and Fanyi Pu and Wang Xiao and Yuanhan Zhang and Xiang Yue and Bo Li and Ziwei Liu},\n    booktitle={arXiv preprint arXiv:2501.13826},\n    year={2025},\n    url={https://arxiv.org/abs/2501.13826}\n}\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fevolvinglmms-lab%2Fvideommmu","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fevolvinglmms-lab%2Fvideommmu","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fevolvinglmms-lab%2Fvideommmu/lists"}