{"id":13628883,"url":"https://github.com/MMMU-Benchmark/MMMU","last_synced_at":"2025-04-17T04:32:37.019Z","repository":{"id":209763438,"uuid":"722649798","full_name":"MMMU-Benchmark/MMMU","owner":"MMMU-Benchmark","description":"This repo contains evaluation code for the paper \"MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI\"","archived":false,"fork":false,"pushed_at":"2025-04-15T03:05:29.000Z","size":195059,"stargazers_count":413,"open_issues_count":0,"forks_count":34,"subscribers_count":3,"default_branch":"main","last_synced_at":"2025-04-15T04:22:12.534Z","etag":null,"topics":["computer-vision","deep-learning","deep-neural-networks","evaluation","foundation-models","large-language-models","large-multimodal-models","llm","llms","machine-learning","multimodal","multimodal-deep-learning","multimodal-learning","multimodality","natural-language-processing","question-answering","stem","visual-question-answering"],"latest_commit_sha":null,"homepage":"https://mmmu-benchmark.github.io/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/MMMU-Benchmark.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2023-11-23T15:53:31.000Z","updated_at":"2025-04-15T03:05:34.000Z","dependencies_parsed_at":"2024-01-31T13:25:18.762Z","dependency_job_id":"65743d83-d31c-4d07-a3f7-011b6f38ea8a","html_url":"https://github.com/MMMU-Benchmark/MMMU","commit_stats":null,"previous_names":["mmmu-benchmark/mmmu"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MMMU-Benchmark%2FMMMU","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MMMU-Benchmark%2FMMMU/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MMMU-Benchmark%2FMMMU/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MMMU-Benchmark%2FMMMU/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/MMMU-Benchmark","download_url":"https://codeload.github.com/MMMU-Benchmark/MMMU/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":249316000,"owners_count":21249872,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["computer-vision","deep-learning","deep-neural-networks","evaluation","foundation-models","large-language-models","large-multimodal-models","llm","llms","machine-learning","multimodal","multimodal-deep-learning","multimodal-learning","multimodality","natural-language-processing","question-answering","stem","visual-question-answering"],"created_at":"2024-08-01T22:00:58.933Z","updated_at":"2025-04-17T04:32:32.008Z","avatar_url":"https://github.com/MMMU-Benchmark.png","language":"Python","funding_links":[],"categories":["🎭 Multi-modal Testing","Multi-modal Large Language Models (MLLMs) Datasets \u003ca id=\"multi-modal-large-language-models-mllms-datasets\"\u003e\u003c/a\u003e","Tools"],"sub_categories":["Evaluation Datasets \u003ca id=\"evaluation02\"\u003e\u003c/a\u003e","LLM Evaluations and Benchmarks"],"readme":"# MMMU Benchmark\n\n[**🌐 Homepage**](https://mmmu-benchmark.github.io/) | [**🏆 Leaderboard**](https://mmmu-benchmark.github.io/#leaderboard) | [**🤗 MMMU-Pro**](https://huggingface.co/datasets/MMMU/MMMU_Pro) | [**📖 MMMU-Pro arXiv**](https://arxiv.org/abs/2409.02813) | [**🤗 MMMU**](https://huggingface.co/datasets/MMMU/MMMU/) | [**📖 MMMU arXiv**](https://arxiv.org/pdf/2311.16502.pdf) \n\nThis repo contains the evaluation code for the paper \"[MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark](https://arxiv.org/abs/2409.02813)\" and \"[MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI](https://arxiv.org/pdf/2311.16502.pdf)\"\n\n## 🔔News\n\n- **🔥[2024-09-05] Introducing [MMMU-Pro](https://arxiv.org/abs/2409.02813), a robust version of MMMU benchmark for multimodal AI evaluation! 🚀**\n- **🚀[2024-01-31]: We added Human Expert performance on the [Leaderboard](https://mmmu-benchmark.github.io/#leaderboard)!🌟**\n- **🔥[2023-12-04]: Our evaluation server for test set is now availble on [EvalAI](https://eval.ai/web/challenges/challenge-page/2179/overview). We welcome all submissions and look forward to your participation! 😆**\n\n## Introduction\n\n### MMMU\n\nMMMU is a new benchmark designed to evaluate multimodal models on massive multi-discipline tasks demanding college-level subject knowledge and deliberate reasoning. MMMU includes **11.5K meticulously collected multimodal questions** from college exams, quizzes, and textbooks, covering six core disciplines: Art \u0026 Design, Business, Science, Health \u0026 Medicine, Humanities \u0026 Social Science, and Tech \u0026 Engineering. These questions span **30 subjects** and **183 subfields**, comprising **32 highly heterogeneous image types**, such as charts, diagrams, maps, tables, music sheets, and chemical structures. Unlike existing benchmarks, MMMU focuses on advanced perception and reasoning with domain-specific knowledge, challenging models to perform tasks akin to those faced by experts. Our evaluation of 14 open-source LMMs and the proprietary GPT-4V(ision) highlights the substantial challenges posed by MMMU. Even the advanced GPT-4V only achieves a 56% accuracy, indicating significant room for improvement. We believe MMMU will stimulate the community to build next-generation multimodal foundation models towards expert artificial general intelligence (AGI).\n\n![Alt text](mmmu.png)\n\n### MMMU-Pro\n\nBuilding upon MMMU, MMMU-Pro introduces even more stringent assessment methodologies to evaluate multimodal models' intrinsic understanding and reasoning capabilities. MMMU-Pro employs a meticulously structured three-step process:\n\n1. **Filtering out text-only answerable questions**: Ensures that the questions pressing multimodal understanding rather than purely textual comprehension.\n2. **Augmenting candidate options**: Introduces additional plausible options to make the task more challenging.\n3. **Vision-only input setting**: Embedding questions within images pushes AI to \"see\" and \"read\" simultaneously, replicating a core human cognitive skill of integrating visual and textual information.\n\nOur results reveal that model performance on MMMU-Pro is significantly lower than on MMMU, with accuracies ranging from 16.8% to 26.9% across various models. We investigate the effects of OCR prompts and Chain of Thought (CoT) reasoning. OCR prompts have minimal impact, while CoT generally enhances performance. MMMU-Pro offers a more rigorous evaluation framework, closely reflecting real-world scenarios and providing critical insights for advancing multimodal AI research.\n\n![Alt text](mmmu-pro.png)\n\n## Dataset Creation\n\nMMMU and MMMU-Pro were meticulously designed to challenge and evaluate multimodal models with tasks demanding college-level subject knowledge and complex reasoning. For more detailed information, please refer to our Hugging Face datasets:\n\n- [**🤗 MMMU Dataset**](https://huggingface.co/datasets/MMMU/MMMU/)\n- [**🤗 MMMU-Pro Dataset**](https://huggingface.co/datasets/MMMU/MMMU_Pro)\n\n## Evaluation\n\nPlease refer to our evaluation folders for detailed information on evaluating with both MMMU and MMMU-Pro benchmarks:\n\n- [**MMMU Evaluation**](mmmu)\n- [**MMMU-Pro Evaluation**](mmmu-pro)\n\n🎯 **MMMU Evaluation**\n\n- **We have released a full suite comprising 150 development samples and 900 validation samples. However, the 10,500 test questions are available without their answers.**\n- Use the **development set** for few-shot/in-context learning.\n- Use the **validation set** for debugging models, selecting hyperparameters, and quick evaluations.\n\nThe answers and explanations for the test set questions are withheld. You can submit your model's predictions for the **test set** on **[EvalAI](https://eval.ai/web/challenges/challenge-page/2179/overview)**.\n\n## Disclaimers\nThe guidelines for the annotators emphasized strict compliance with copyright and licensing rules from the initial data source, specifically avoiding materials from websites that forbid copying and redistribution. \nShould you encounter any data samples potentially breaching the copyright or licensing regulations of any site, we encourage you to [contact](#contact) us. Upon verification, such samples will be promptly removed.\n\n## Contact\n- Xiang Yue: xiangyue.work@gmail.com\n- Yu Su: su.809@osu.edu\n- Wenhu Chen: wenhuchen@uwaterloo.ca\n\n## Citation\n\n**BibTeX:**\n```bibtex\n@inproceedings{yue2023mmmu,\n  title={MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI},\n  author={Xiang Yue and Yuansheng Ni and Kai Zhang and Tianyu Zheng and Ruoqi Liu and Ge Zhang and Samuel Stevens and Dongfu Jiang and Weiming Ren and Yuxuan Sun and Cong Wei and Botao Yu and Ruibin Yuan and Renliang Sun and Ming Yin and Boyuan Zheng and Zhenzhu Yang and Yibo Liu and Wenhao Huang and Huan Sun and Yu Su and Wenhu Chen},\n  booktitle={Proceedings of CVPR},\n  year={2024},\n}\n\n@article{yue2024mmmu,\n  title={MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark},\n  author={Xiang Yue and Tianyu Zheng and Yuansheng Ni and Yubo Wang and Kai Zhang and Shengbang Tong and Yuxuan Sun and Botao Yu and Ge Zhang and Huan Sun and Yu Su and Wenhu Chen and Graham Neubig},\n  journal={arXiv preprint arXiv:2409.02813},\n  year={2024}\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FMMMU-Benchmark%2FMMMU","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FMMMU-Benchmark%2FMMMU","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FMMMU-Benchmark%2FMMMU/lists"}