{"id":18344339,"url":"https://github.com/lupantech/mathvista","last_synced_at":"2025-05-16T12:03:13.079Z","repository":{"id":198121755,"uuid":"700102862","full_name":"lupantech/MathVista","owner":"lupantech","description":"MathVista: data, code, and evaluation for Mathematical Reasoning in Visual Contexts ","archived":false,"fork":false,"pushed_at":"2024-11-29T04:30:53.000Z","size":52667,"stargazers_count":292,"open_issues_count":4,"forks_count":48,"subscribers_count":5,"default_branch":"main","last_synced_at":"2025-04-12T08:38:31.635Z","etag":null,"topics":["ai4math","large-language-models","large-multimadality-models","machine-learning","mathematics","mathqa","science","visual-question-answering"],"latest_commit_sha":null,"homepage":"https://mathvista.github.io/","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"cc-by-sa-4.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/lupantech.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-10-04T00:25:27.000Z","updated_at":"2025-04-08T03:36:41.000Z","dependencies_parsed_at":"2024-11-29T05:33:07.281Z","dependency_job_id":null,"html_url":"https://github.com/lupantech/MathVista","commit_stats":{"total_commits":104,"total_committers":5,"mean_commits":20.8,"dds":0.3846153846153846,"last_synced_commit":"99fa993d4e3f659f8d93b7786502e9109e94d273"},"previous_names":["lupantech/mathvista"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lupantech%2FMathVista","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lupantech%2FMathVista/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lupantech%2FMathVista/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lupantech%2FMathVista/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/lupantech","download_url":"https://codeload.github.com/lupantech/MathVista/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254527084,"owners_count":22085918,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai4math","large-language-models","large-multimadality-models","machine-learning","mathematics","mathqa","science","visual-question-answering"],"created_at":"2024-11-05T21:05:38.225Z","updated_at":"2025-05-16T12:03:13.032Z","avatar_url":"https://github.com/lupantech.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# MathVista: Evaluating Math Reasoning in Visual Contexts\n\n![MathQA](https://img.shields.io/badge/Task-MathQA-red) \n![Mathematical Reasoning](https://img.shields.io/badge/Task-Mathematical_Reasoning-red) \n![Multi-Modal](https://img.shields.io/badge/Task-Multi--Modal-red) \n![ScienceQA](https://img.shields.io/badge/Dataset-MathVista-blue)  \n![Claude-4](https://img.shields.io/badge/Model-Claude--2-green) \n![ChatGPT](https://img.shields.io/badge/Model-ChatGPT-green) \n![GPT-4](https://img.shields.io/badge/Model-GPT--4-green) \n![Gemini](https://img.shields.io/badge/Model-Gemini-green)\n![GPT-4V](https://img.shields.io/badge/Model-GPT--4V-green)\n\nCode for the Paper \"[MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts](https://arxiv.org/abs/2310.02255)\".\n\nFor more details, please refer to the project page with dataset exploration and visualization tools: [https://mathvista.github.io/](https://mathvista.github.io/).\n\n:bell: If you have any questions or suggestions, please don't hesitate to let us know. You can comment on the [Twitter](https://twitter.com/lupantech/status/1717313355780964608), or post an issue on this repository.\n\n[[Webpage](https://mathvista.github.io/)] [[Paper](https://arxiv.org/abs/2310.02255)] [[Huggingface Dataset](https://huggingface.co/datasets/AI4Math/MathVista)] [[Leaderboard](https://mathvista.github.io/#leaderboard)] [[Visualization](https://mathvista.github.io/#visualization)] [[Result Explorer](https://mathvista.github.io/#explorer)] [[Twitter](https://twitter.com/lupantech/status/1717313355780964608)]\n\n\u003cp align=\"center\"\u003e\n    \u003cimg src=\"assets/logo_v1.png\" width=\"40%\"\u003e \u003cbr\u003e\n  Tentative logo for \u003cb\u003eMathVista\u003c/b\u003e. Generated by DALL·E 3 prompted by \n  \u003cbr\u003e\"A photo-based logo with a gradient of soft blue and modern typography, accompanied by the title 'MathVista'\".\n\u003c/p\u003e\n\n## Outlines\n- [🔦 Spotlight 🔦](https://github.com/lupantech/MathVista/blob/main/README.md#-spotlight-performance-update-sept-12-2024-)\n- [💥 News 💥](https://github.com/lupantech/MathVista/blob/main/README.md#-news-)\n- [👀 About MathVista](https://github.com/lupantech/MathVista/blob/main/README.md#-about-mathvista)\n- [🏆 Leaderboard 🏆](https://github.com/lupantech/MathVista/blob/main/README.md#-leaderboard-)\n  - [Contributing the Leaderboard](https://github.com/lupantech/MathVista/blob/main/README.md#contributing-the-leaderboard)\n  - [Leaderboard on the testmini subset](https://github.com/lupantech/MathVista/blob/main/README.md#leaderboard-on-the-testmini-subset)\n  - [Leaderboard on the test subset](https://github.com/lupantech/MathVista/blob/main/README.md#leaderboard-on-the-test-subset)\n- [📊 Dataset Examples](https://github.com/lupantech/MathVista/blob/main/README.md#-dataset-examples)\n- [📖 Dataset Usage](https://github.com/lupantech/MathVista/blob/main/README.md#-dataset-usage)\n  - [Data Source](https://github.com/lupantech/MathVista/blob/main/README.md#-dataset-usage)\n  - [Data Downloading](https://github.com/lupantech/MathVista/blob/main/README.md#data-downloading)\n  - [Data Format](https://github.com/lupantech/MathVista/blob/main/README.md#data-format)\n  - [Data Visualization](https://github.com/lupantech/MathVista/blob/main/README.md#data-visualization)\n  - [Usage Demos](https://github.com/lupantech/MathVista/blob/main/README.md#usage-demos)\n- [🔮 Evaluations on MathVista](https://github.com/lupantech/MathVista/blob/main/README.md#-evaluations-on-mathvista)\n  - [Requirements (Optional)](https://github.com/lupantech/MathVista/blob/main/README.md#requirements-optional)\n  - [Downloading Images (Optional)](https://github.com/lupantech/MathVista/blob/main/README.md#downloading-images-optional)\n  - [Evaluation Pipelines](https://github.com/lupantech/MathVista/blob/main/README.md#evaluation-pipelines)\n- [📝 Evaluation Scripts of Our Models](https://github.com/lupantech/MathVista/blob/main/README.md#-evaluation-scripts-of-our-models)\n  - [Evaluating Multimodal Bard](https://github.com/lupantech/MathVista/blob/main/README.md#evaluating-multimodal-bard)\n  - [Evaluating Chain-of-Thought GPT-4](https://github.com/lupantech/MathVista/blob/main/README.md#evaluating-chain-of-thought-gpt-4)\n  - [Evaluating Program-of-Thought GPT-4](https://github.com/lupantech/MathVista/blob/main/README.md#evaluating-program-of-thought-gpt-4)\n  - [Evaluating More Settings](https://github.com/lupantech/MathVista/blob/main/README.md#evaluating-more-settings)\n  - [Evaluating Large Multimodal Models](https://github.com/lupantech/MathVista/blob/main/README.md#evaluating-large-multimodal-models)\n- [📈 Evaluation Results](https://github.com/lupantech/MathVista/blob/main/README.md#-evaluation-results)\n- [📜 License](https://github.com/lupantech/MathVista/blob/main/README.md#-license)\n- [☕ Stay Connected!](https://github.com/lupantech/MathVista/blob/main/README.md#coffee-stay-connected)\n- [✅ Cite](https://github.com/lupantech/MathVista/blob/main/README.md#white_check_mark-cite)\n- [🧠 Related Work](https://github.com/lupantech/MathVista/blob/main/README.md#-related-work)\n- [🤝 Contributors](https://github.com/lupantech/MathVista/blob/main/README.md#-contributors)\n\n\n\n## 💥 Spotlight: Performance Update (Sept 12, 2024) 💥\n\n- **Eight models** have now surpassed the average human performance level (based on AMT workers with at least a high school diploma).\n- The top performers include:\n  - 🥇 **[OpenAI o1](https://openai.com/index/learning-to-reason-with-llms/)**\n  - 🥈 **[Grok-2](https://x.ai/blog/grok-2)**\n  - 🥉 **[Grok-2 mini](https://x.ai/blog/grok-2)**\n\n## 💥 News 💥\n- **[2024.09.12]** 💥 **OpenAI o1 🥇 Sets New SOTA on MathVista with 73.9!** OpenAI’s latest large multimodal model breaks the 70% barrier on **MathVista**, setting a new SOTA. Read more on the [OpenAI blog](https://openai.com/index/learning-to-reason-with-llms/).\n- **[2024.06.20]** 💥 **Claude 3.5 Sonnet achieves new SOTA** on MathVista with **67.7**! Learn more at the [Anthropic blog](https://www.anthropic.com/news/claude-3-5-sonnet).\n- **[2024.05.13]** 💥 **OpenAI's GPT-4o Outperforms Humans on MathVista!** For the first time, OpenAI's new GPT-4o model has achieved a higher score than the human average on MathVista, scoring **63.8** compared to humans' **60.3**. Learn more at the [OpenAI blog](https://openai.com/index/hello-gpt-4o/).\n- **[2024.01.16]** 🌟 Our **MathVista** paper has been accepted for an **Oral** presentation at **ICLR 2024** (only top 85 out of over 7200 submissions)! 🎉 Cheers!\n- **[2023.12.21]** 🚀 [Qwen-VL-Plus](https://github.com/QwenLM/Qwen-VL) achieves **43.3%**, establishing itself as the best-performing one in open-sourced models. 🎉 Congratulations!\n- **[2023.12.08]** 🔍 We've updated the leaderboard and radar graphs with the **fine-grained scores** of the **Gemini** family models. Thanks to the Gemini Team and Google for providing us with these results! 👏\n- **[2023.12.06]** 🚀 Google's newly released multimodal model, [Gemini](https://blog.google/technology/ai/google-gemini-ai/), shows impressive abilities on **MathVista**, achieving a new SOTA performance with **50.3%**! 🎉  Cheers!!\n- **[2023.11.17]** 🌟 Congratulations to [SPHINX (V2)](https://github.com/Alpha-VLLM/LLaMA2-Accessory/tree/main/SPHINX), which is now the SOTA open-source multimodal model on **MathVista**, reaching **36.7%**. 👏\n- **[2023.10.25]** 🚀 Dive into our comprehensive **112-page** evaluation of **GPT-4V**, Bard, and other Large Multimodal Models, encompassing both **quantitative** and **qualitative** insights. [Explore the full paper now!](https://arxiv.org/abs/2310.02255) 📄✨\n- **[2023.10.16]** 🔍 We are working on a comparative study on the **GPT-4V** model. Stay tuned for the detailed report! 📑.\n- **[2023.10.15]** We finished the manual evaluation of **GPT-4V** with the playground chatbot on the *testmini* set on **MathVista**. 🚀 GPT-4V achieves a substantial gain of **15.1%** ⬆️ over Bard, reaching a new record of **49.9%**! 🎉\n- **[2023.10.15]** Our dataset is now accessible at [Huggingface Datasets](https://huggingface.co/datasets/AI4Math/MathVista).\n- **[2023.10.15]** Our dataset is now accessible at [Paper With Code](https://paperswithcode.com/dataset/mathvista).\n- **[2023.10.03]** The top-performing model, 🎭 **Multimodal Bard**, achieved a score of **34.8%** on the *testmini* set for **MathVista** 📊.\n- **[2023.10.03]** Our work was featured by [Aran Komatsuzaki](https://twitter.com/arankomatsuzaki) on [Twitter](https://twitter.com/arankomatsuzaki/status/1709380140717809992). Thanks!\n- **[2023.10.03]** Our paper is now accessible at https://arxiv.org/abs/2310.02255.\n\n## 👀 About MathVista\n\n**Large Language Models (LLMs)** and **Large Multimodal Models (LMMs)** exhibit impressive problem-solving skills in many tasks and domains, but their ability in mathematical reasoning in visual contexts has not been systematically studied. To bridge this gap, we present **MathVista**, a benchmark designed to combine challenges from diverse mathematical and visual tasks. It consists of **6,141 examples**, derived from **28 existing multimodal datasets** involving mathematics and **3 newly created datasets** (i.e., **IQTest, FunctionQA, and PaperQA**). Completing these tasks requires fine-grained, deep visual understanding and compositional reasoning, which all state-of-the-art foundation models find challenging.\n\n\u003cp align=\"center\"\u003e\n    \u003cimg src=\"assets/data-composition.png\" width=\"40%\"\u003e \u003cbr\u003e\n  Source dataset distribution of \u003cb\u003eMathVista\u003c/b\u003e.\n\u003c/p\u003e\n\nIn October 2023, we conducted **a comprehensive, quantitative evaluation of 12 prominent foundation models** with **MathVista**. The best-performing **GPT-4V** model achieved an overall accuracy of **49.9%**, substantially outperforming Bard, the second-best performer, by **15.1%**. Our in-depth analysis revealed that the superiority of **GPT-4V** is mainly attributed to its enhanced visual perception and mathematical reasoning. However, **GPT-4V** still falls short of human performance by **10.4%**, as it often struggles to understand complex figures and perform rigorous reasoning. This significant gap underscores the critical role that **MathVista** will play in the development of general-purpose AI agents capable of tackling mathematically intensive and visually rich real-world tasks. \n\n\u003cp align=\"center\"\u003e\n    \u003cimg src=\"assets/score_leaderboard_gpt4v.png\" width=\"70%\"\u003e \u003cbr\u003e\n  Accuracy scores the testmini set (1,000 examples) of \u003cb\u003eMathVista\u003c/b\u003e.\n\u003c/p\u003e\n\nWe further explore the new ability of **self-verification**, the use of **self-consistency**, and the **goal-directed multi-turn human-AI dialogues**, highlighting the promising potential of GPT-4V for future research.\n\n\u003cp align=\"center\"\u003e\n    \u003cimg src=\"assets/tease_scores_version4_gemini.png\" width=\"80%\"\u003e \u003cbr\u003e\n  Accuracy scores of one leading LLM (i.e., PoT GPT-4), four primary LMMs, random chance, and human performance on \u003cb\u003eMathVista\u003c/b\u003e.\n\u003c/p\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003e🔍 See the accuracy scores without Gemini Ultra\u003c/summary\u003e\n\n\u003cp align=\"center\"\u003e\n    \u003cimg src=\"assets/tease_scores_gpt4v.png\" width=\"80%\"\u003e \u003cbr\u003e\n  Accuracy scores of one leading LLM (i.e., PoT GPT-4), four primary LMMs, random chance, and human performance on \u003cb\u003eMathVista\u003c/b\u003e.\n\u003c/p\u003e\n\n\u003c/details\u003e\n\nFor more details, you can find our project page [here](https://mathvista.github.io/) and our paper [here](https://arxiv.org/abs/2310.02255).\n\n## 🏆 Leaderboard 🏆\n\n### Contributing the Leaderboard\n\n🚨🚨 The leaderboard is continuously being updated. \n\nThe evaluation instructions are available at [🔮 Evaluations on MathVista](https://github.com/lupantech/MathVista?tab=readme-ov-file#-evaluations-on-mathvista) and [📝 Evaluation Scripts of Our Models](https://github.com/lupantech/MathVista?tab=readme-ov-file#-evaluation-scripts-of-our-models).\n\nTo submit your results to the leaderboard on the **testmini** subset, please send to [this email](mailto:lupantech@gmail.com) with your result json file and score json file, referring to the template files below:\n\n- [output_testmini_template_for_leaderboard_submission.json](https://github.com/lupantech/MathVista/blob/main/results/leaderboad_submission_template/output_testmini_template_for_leaderboard_submission.json)\n- [scores_testmini_template_for_leaderboard_submission.json](https://github.com/lupantech/MathVista/blob/main/results/leaderboad_submission_template/scores_testmini_template_for_leaderboard_submission.json)\n\nTo submit your results to the leaderboard on the **test** subset, please send to [this email](mailto:lupantech@gmail.com) with your result file (**we will generate the score file for you**),  referring to the template file below:\n\n- [output_test_template_for_leaderboard_submission.json](https://github.com/lupantech/MathVista/blob/main/results/leaderboad_submission_template/output_test_template_for_leaderboard_submission.json)\n\n### Leaderboard on the testmini subset\n\nAccuracy scores on the **testmini** subset (1,000 examples):\n\n| **#** | **Model**                            | **Method** | **Source**                                                   | **Date**   | **ALL**  | **FQA** | **GPS** | **MWP** | **TQA** | **VQA** | **ALG** | **ARI** | **GEO** | **LOG** | **NUM** | **SCI** | **STA** |\n| ----- | ------------------------------------ | ---------- | ------------------------------------------------------------ | ---------- | -------- | ------- | ------- | ------- | ------- | ------- | ------- | ------- | ------- | ------- | ------- | ------- | ------- |\n| -     | **Human Performance\\***              | -          | [Link](https://arxiv.org/abs/2310.02255)                     | 2023-10-03 | **60.3** | 59.7    | 48.4    | 73.0    | 63.2    | 55.9    | 50.9    | 59.2    | 51.4    | 40.7    | 53.8    | 64.9    | 63.9    |\n| 1     | **OpenAI o1 🥇**                      | LMM 🖼️      | [Link](https://openai.com/index/learning-to-reason-with-llms/) | 2024-09-12 | **73.9** | -       | -       | -       | -       | -       | -       | -       | -       | -       | -       | -       | -       |\n| 2     | **Grok-2 🥈**                         | LMM 🖼️      | [Link](https://x.ai/blog/grok-2)                             | 2024-08-13 | **69.0** | -       | -       | -       | -       | -       | -       | -       | -       | -       | -       | -       | -       |\n| 3     | **Grok-2 mini 🥉**                    | LMM 🖼️      | [Link](https://x.ai/blog/grok-2)                             | 2024-08-13 | **68.1** | -       | -       | -       | -       | -       | -       | -       | -       | -       | -       | -       | -       |\n| 4     | **Claude 3.5 Sonnet**                | LMM 🖼️      | [Link](https://www.anthropic.com/news/claude-3-5-sonnet)     | 2024-06-20 | **67.7** | -       | -       | -       | -       | -       | -       | -       | -       | -       | -       | -       | -       |\n| 5     | **LLaVA-OneVision**                  | LMM 🖼️      | [Link](https://llava-vl.github.io/blog/2024-08-05-llava-onevision/) | 2024-08-06 | **67.5** | -       | -       | -       | -       | -       | -       | -       | -       | -       | -       | -       | -       |\n| 6     | **InternVL2-Pro**                    | LMM 🖼️      | [Link](https://github.com/OpenGVLab/InternVL)                | 2024-09-04 | **66.8** | 70.6    | 65.4    | 76.9    | 71.5    | 48.0    | 66.5    | 62.3    | 63.6    | 27.0    | 40.3    | 65.6    | 81.1    |\n| 7     | **TextGrad (GPT-4o)**                | LMM 🖼️      | [Link](https://github.com/zou-group/textgrad)                | 2024-07-08 | **66.1** | -       | -       | -       | -       | -       | -       | -       | -       | -       | -       | -       | -       |\n| 8     | **Gemini 1.5 Pro (May 2024)**        | LMM 🖼️      | [Link](https://storage.googleapis.com/deepmind-media/gemini/gemini_v1_5_report.pdf) | 2024-05-17 | **63.9** | -       | -       | -       | -       | -       | -       | -       | -       | -       | -       | -       | -       |\n| 9     | **GPT-4o**                           | LMM 🖼️      | [Link](https://openai.com/index/hello-gpt-4o/)               | 2024-05-13 | **63.8** | -       | -       | -       | -       | -       | -       | -       | -       | -       | -       | -       | -       |\n| 10    | **InternVL-Chat-V1.2-Plus**          | LMM 🖼️      | [Link](https://arxiv.org/abs/2312.14238)                     | 2024-02-22 | **59.9** | 51.7    | 61.1    | 79.6    | 52.5    | 57.0    | 54.5    | 63.2    | 61.1    | 16.2    | 48.6    | 55.7    | 60.8    |\n| 11    | **Gemini 1.5 Flash (May 2024)**      | LMM 🖼️      | [Link](https://storage.googleapis.com/deepmind-media/gemini/gemini_v1_5_report.pdf) | 2024-05-17 | **58.4** | -       | -       | -       | -       | -       | -       | -       | -       | -       | -       | -       | -       |\n| 12    | **GPT-4T 2024-04-09**                | LMM 🖼️      | [Link](https://openai.com/index/hello-gpt-4o/)               | 2024-05-13 | **58.1** | -       | -       | -       | -       | -       | -       | -       | -       | -       | -       | -       | -       |\n| 13    | **Pixtral 12B**                      | LMM 🖼️      | [Link](https://x.com/_philschmid/status/1833954941624615151) | 2024-09-11 | **58.0** | -       | -       | -       | -       | -       | -       | -       | -       | -       | -       | -       | -       |\n| 14    | **InternLM-XComposer2-VL-7B**        | LMM 🖼️      | [Link](https://github.com/InternLM/InternLM-XComposer)       | 2024-01-22 | **57.6** | 55.0    | 63.0    | 73.7    | 56.3    | 39.7    | 56.6    | 52.4    | 62.3    | 8.1     | 42.4    | 59.0    | 64.1    |\n| 15    | **Gemini 1.0 Ultra**                 | LMM 🖼️      | [Link](https://arxiv.org/abs/2312.11805)                     | 2023-12-06 | **53.0** | 49.1    | 56.2    | 53.8    | 69.0    | 40.2    | 58.4    | 45.9    | 55.6    | 21.6    | 38.9    | 62.3    | 59.5    |\n| 16    | **Grok-1.5V**                        | LMM 🖼️      | [Link](https://x.ai/blog/grok-1.5v)                          | 2024-04-12 | **52.8** | -       | -       | -       | -       | -       | -       | -       | -       | -       | -       | -       | -       |\n| 17    | **Gemini 1.5 Pro (Feb 2024)**        | LMM 🖼️      | [Link](https://storage.googleapis.com/deepmind-media/gemini/gemini_v1_5_report.pdf) | 2024-02-15 | **52.1** | -       | -       | -       | -       | -       | -       | -       | -       | -       | -       | -       | -       |\n| 18    | **Claude 3 Opus**                    | LMM 🖼️      | [Link](https://www-cdn.anthropic.com/de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model_Card_Claude_3.pdf) | 2024-03-04 | **50.5** | -       | -       | -       | -       | -       | -       | -       | -       | -       | -       | -       | -       |\n| 19    | **GPT-4V (Playground)**              | LMM 🖼️      | [Link](https://arxiv.org/abs/2310.02255)                     | 2023-10-15 | **49.9** | 43.1    | 50.5    | 57.5    | 65.2    | 38.0    | 53.0    | 49.0    | 51.0    | 21.6    | 20.1    | 63.1    | 55.8    |\n| 20    | **Claude 3 Sonnet**                  | LMM 🖼️      | [Link](https://www-cdn.anthropic.com/de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model_Card_Claude_3.pdf) | 2024-03-04 | **47.9** | -       | -       | -       | -       | -       | -       | -       | -       | -       | -       | -       | -       |\n| 21    | **InternVL-Chat-V1.2**               | LMM 🖼️      | [Link](https://arxiv.org/abs/2312.14238)                     | 2024-02-22 | **47.7** | 50.9    | 61.1    | 30.6    | 48.1    | 44.7    | 52.3    | 36.5    | 58.2    | 18.9    | 30.6    | 54.9    | 51.8    |\n| 22    | **Math-LLaVA-13B**                   | LMM 🖼️      | [Link](http://arxiv.org/abs/2406.17294)                      | 2024-06-25 | **46.6** | 37.2    | 57.7    | 56.5    | 51.3    | 33.5    | 53.0    | 40.2    | 56.5    | 16.2    | 33.3    | 49.2    | 43.9    |\n| 23    | **LLaVA-NeXT-34B**                   | LMM 🖼️      | [Link](https://llava-vl.github.io/blog/2024-01-30-llava-1-6/) | 2024-01-30 | **46.5** | -       | -       | -       | -       | -       | -       | -       | -       | -       | -       | -       | -       |\n| 24    | **Claude 3 Haiku**                   | LMM 🖼️      | [Link](https://www-cdn.anthropic.com/de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model_Card_Claude_3.pdf) | 2024-03-04 | **46.4** | -       | -       | -       | -       | -       | -       | -       | -       | -       | -       | -       | -       |\n| 25    | **Gemini 1.0 Pro**                   | LMM 🖼️      | [Link](https://arxiv.org/abs/2312.11805)                     | 2023-12-06 | **45.2** | 47.6    | 40.4    | 39.2    | 61.4    | 39.1    | 45.2    | 38.8    | 41.0    | 10.8    | 32.6    | 54.9    | 56.8    |\n| 26    | **Phi-3-Vision-128K-In**             | LMM 🖼️      | [Link](https://huggingface.co/microsoft/Phi-3-vision-128k-instruct) | 2024-05-21 | **44.5** | -       | -       | -       | -       | -       | -       | -       | -       | -       | -       | -       | -       |\n| 27    | **Phi-3.5-Vision 4.2B**              | LMM 🖼️      | [Link](https://arxiv.org/abs/2404.14219)                     | 2024-04-22 | **43.9** | -       | -       | -       | -       | -       | -       | -       | -       | -       | -       | -       | -       |\n| 28    | **Qwen-VL-Plus**                     | LMM 🖼️      | [Link](https://github.com/QwenLM/Qwen-VL)                    | 2023-12-21 | **43.3** | 54.6    | 38.5    | 31.2    | 55.1    | 34.1    | 39.1    | 32.0    | 39.3    | 18.9    | 26.4    | 59.0    | 56.1    |\n| 29    | **Mini-Gemini-HD (Hermes-2-Yi-34B)** | LMM 🖼️      | [Link](https://arxiv.org/abs/2403.18814)                     | 2024-03-27 | **43.3** | -       | -       | -       | -       | -       | -       | -       | -       | -       | -       | -       | -       |\n| 30    | **SPHINX-MoE**                       | MoE 🤖      | [Link](https://github.com/Alpha-VLLM/LLaMA2-Accessory/tree/main/SPHINX) | 2024-01-12 | **42.3** | 49.8    | 31.2    | 42.5    | 46.8    | 39.7    | 31.7    | 41.6    | 30.5    | 16.2    | 27.1    | 50.8    | 50.8    |\n| 31    | **Mini-Gemini (Mixtral-8x7B)**       | LMM 🖼️      | [Link](https://arxiv.org/abs/2403.18814)                     | 2024-03-27 | **41.8** | -       | -       | -       | -       | -       | -       | -       | -       | -       | -       | -       | -       |\n| 32    | **MM1-7B-MoE-Chat**                  | LMM 🖼️      | [Link](https://arxiv.org/abs/2403.09611)                     | 2024-03-14 | **40.9** | -       | -       | -       | -       | -       | -       | -       | -       | -       | -       | -       | -       |\n| 33    | **MiniCPM-V-2 (2.8B)**               | LMM 🖼️      | [Link](https://github.com/OpenBMB/MiniCPM-V)                 | 2024-04-14 | **40.6** | 53.2    | 26.0    | 37.1    | 44.3    | 39.1    | 28.5    | 33.1    | 28.0    | 10.8    | 39.6    | 48.4    | 51.8    |\n| 34    | **MM1-30B-Chat**                     | LMM 🖼️      | [Link](https://arxiv.org/abs/2403.09611)                     | 2024-03-14 | **39.4** | -       | -       | -       | -       | -       | -       | -       | -       | -       | -       | -       | -       |\n| 35    | **SPHINX-Plus**                      | MoE 🤖      | [Link](https://github.com/Alpha-VLLM/LLaMA2-Accessory/tree/main/SPHINX) | 2024-01-12 | **36.8** | -       | -       | -       | -       | -       | -       | -       | -       | -       | -       | -       | -       |\n| 36    | **SPHINX (V2)**                      | LMM 🖼️      | [Link](https://github.com/Alpha-VLLM/LLaMA2-Accessory/tree/main/SPHINX) | 2023-11-17 | **36.7** | 54.6    | 16.4    | 23.1    | 41.8    | 43.0    | 20.6    | 33.4    | 17.6    | 24.3    | 21.5    | 43.4    | 51.5    |\n| 37    | **MM1-7B-Chat**                      | LMM 🖼️      | [Link](https://arxiv.org/abs/2403.09611)                     | 2024-03-14 | **35.9** | -       | -       | -       | -       | -       | -       | -       | -       | -       | -       | -       | -       |\n| 38    | **SPHINX-Intern2**                   | MoE 🤖      | [Link](https://github.com/Alpha-VLLM/LLaMA2-Accessory/tree/main/SPHINX) | 2024-01-12 | **35.5** | -       | -       | -       | -       | -       | -       | -       | -       | -       | -       | -       | -       |\n| 39    | **OmniLMM-12B**                      | LMM 🖼️      | [Link](https://github.com/OpenBMB/OmniLMM)                   | 2024-02-01 | **34.9** | 45.0    | 17.8    | 26.9    | 44.9    | 39.1    | 23.1    | 32.3    | 20.9    | 18.9    | 27.8    | 45.9    | 44.2    |\n| 40    | **Multimodal Bard**                  | LMM 🖼️      | [Link](https://arxiv.org/abs/2310.02255)                     | 2023-10-03 | **34.8** | 26.0    | 47.1    | 29.6    | 48.7    | 26.8    | 46.5    | 28.6    | 47.8    | 13.5    | 14.9    | 47.5    | 33.0    |\n| 41    | **LLaVA-NeXT-Vicuna-7B**             | LMM 🖼️      | [Link](https://llava-vl.github.io/blog/2024-01-30-llava-next/) | 2024-01-30 | **34.6** | -       | -       | -       | -       | -       | -       | -       | -       | -       | -       | -       | -       |\n| 42    | **PoT GPT-4 (Caption+OCR)**          | Tool 🛠️     | [Link](https://arxiv.org/abs/2310.02255)                     | 2023-10-03 | **33.9** | 30.1    | 39.4    | 30.6    | 39.9    | 31.3    | 37.4    | 31.7    | 41.0    | 18.9    | 20.1    | 44.3    | 37.9    |\n| 43    | **CoT Claude (Caption+OCR)**         | Tool 🛠️     | [Link](https://arxiv.org/abs/2310.02255)                     | 2023-10-03 | **33.2** | 27.5    | 29.3    | 36.0    | 49.4    | 29.1    | 31.0    | 32.9    | 31.0    | 16.2    | 17.4    | 50.8    | 37.2    |\n| 44    | **CoT GPT4 (Caption+OCR)**           | Tool 🛠️     | [Link](https://arxiv.org/abs/2310.02255)                     | 2023-10-03 | **33.2** | 27.9    | 31.7    | 31.2    | 51.9    | 28.5    | 33.5    | 30.9    | 32.2    | 13.5    | 12.5    | 58.2    | 37.9    |\n| 45    | **CoT ChatGPT (Caption+OCR)**        | Tool 🛠️     | [Link](https://arxiv.org/abs/2310.02255)                     | 2023-10-03 | **33.2** | 26.0    | 31.7    | 35.5    | 48.1    | 30.2    | 32.4    | 32.3    | 33.0    | 16.2    | 17.4    | 54.9    | 36.2    |\n| 46    | **MM1-3B-MoE-Chat**                  | LMM 🖼️      | [Link](https://arxiv.org/abs/2403.09611)                     | 2024-03-14 | **32.6** | -       | -       | -       | -       | -       | -       | -       | -       | -       | -       | -       | -       |\n| 47    | **MM1-3B-Chat**                      | LMM 🖼️      | [Link](https://arxiv.org/abs/2403.09611)                     | 2024-03-14 | **32.0** | -       | -       | -       | -       | -       | -       | -       | -       | -       | -       | -       | -       |\n| 48    | **Gemini 1.0 Nano 2**                | LMM 🖼️      | [Link](https://arxiv.org/abs/2312.11805)                     | 2023-12-06 | **30.6** | 28.6    | 23.6    | 30.6    | 41.8    | 31.8    | 27.1    | 29.8    | 26.8    | 10.8    | 20.8    | 40.2    | 33.5    |\n| 49    | **LLaVA-1.5-13B**                    | LMM 🖼️      | [Link](https://llava-vl.github.io/blog/2024-01-30-llava-1-6/) | 2024-01-30 | **27.6** | -       | -       | -       | -       | -       | -       | -       | -       | -       | -       | -       | -       |\n| 50    | **SPHINX (V1)**                      | LMM 🖼️      | [Link](https://github.com/Alpha-VLLM/LLaMA2-Accessory/tree/main/SPHINX) | 2023-11-09 | **27.5** | 23.4    | 23.1    | 21.5    | 39.9    | 34.1    | 25.6    | 28.1    | 23.4    | 16.2    | 17.4    | 40.2    | 23.6    |\n| 51    | **Gemini 1.0 Nano 1**                | LMM 🖼️      | [Link](https://arxiv.org/abs/2312.11805)                     | 2023-12-06 | **27.3** | 30.9    | 21.6    | 23.7    | 29.1    | 30.7    | 23.8    | 25.5    | 21.3    | 13.5    | 20.8    | 27.9    | 30.9    |\n| 52    | **PoT ChatGPT (Caption+OCR)**        | Tool 🛠️     | [Link](https://arxiv.org/abs/2310.02255)                     | 2023-10-03 | **26.8** | 24.5    | 26.4    | 23.7    | 33.5    | 27.9    | 27.8    | 26.1    | 28.0    | 18.9    | 13.2    | 33.6    | 29.9    |\n| 53    | **SPHINX-Tiny**                      | MoE 🤖      | [Link](https://github.com/Alpha-VLLM/LLaMA2-Accessory/tree/main/SPHINX) | 2024-01-12 | **26.4** | -       | -       | -       | -       | -       | -       | -       | -       | -       | -       | -       | -       |\n| 54    | **LLaVA (LLaMA-2-13B)**              | LMM 🖼️      | [Link](https://arxiv.org/abs/2310.02255)                     | 2023-10-03 | **26.1** | 26.8    | 29.3    | 16.1    | 32.3    | 26.3    | 27.3    | 20.1    | 28.8    | 24.3    | 18.3    | 37.3    | 25.1    |\n| 55    | **InstructBLIP (Vicuna-7B)**         | LMM 🖼️      | [Link](https://arxiv.org/abs/2310.02255)                     | 2023-10-03 | **25.3** | 23.1    | 20.7    | 18.3    | 32.3    | 35.2    | 21.8    | 27.1    | 20.7    | 18.9    | 20.4    | 33.0    | 23.1    |\n| 56    | **LLaVAR**                           | LMM 🖼️      | [Link](https://arxiv.org/abs/2310.02255)                     | 2023-10-03 | **25.2** | 21.9    | 25.0    | 16.7    | 34.8    | 30.7    | 24.2    | 22.1    | 23.0    | 13.5    | 15.3    | 42.6    | 21.9    |\n| 57    | **LLaMA-Adapter-V2 (7B)**            | LMM 🖼️      | [Link](https://arxiv.org/abs/2310.02255)                     | 2023-10-03 | **23.9** | 21.2    | 25.5    | 11.3    | 32.3    | 31.8    | 26.3    | 20.4    | 24.3    | 24.3    | 13.9    | 29.5    | 18.3    |\n| 58    | **miniGPT4 (LLaMA-2-7B)**            | LMM 🖼️      | [Link](https://arxiv.org/abs/2310.02255)                     | 2023-10-03 | **23.1** | 18.6    | 26.0    | 13.4    | 30.4    | 30.2    | 28.1    | 21.0    | 24.7    | 16.2    | 16.7    | 25.4    | 17.9    |\n| 59    | **mPLUG-Owl (LLaMA-7B)**             | LMM 🖼️      | [Link](https://arxiv.org/abs/2310.02255)                     | 2023-10-03 | **22.2** | 22.7    | 23.6    | 10.2    | 27.2    | 27.9    | 23.6    | 19.2    | 23.9    | 13.5    | 12.7    | 26.3    | 21.4    |\n| 60    | **IDEFICS (9B-Instruct)**            | LMM 🖼️      | [Link](https://arxiv.org/abs/2310.02255)                     | 2023-10-03 | **19.8** | 21.6    | 21.1    | 6.5     | 25.9    | 24.0    | 22.1    | 15.0    | 19.8    | 18.9    | 9.9     | 24.6    | 18.1    |\n| 61    | **Random Chance**                    | -          | [Link](https://arxiv.org/abs/2310.02255)                     | 2023-10-03 | **17.9** | 15.5    | 24.1    | 4.5     | 23.4    | 24.3    | 25.8    | 13.8    | 22.7    | 13.4    | 8.8     | 15.8    | 14.3    |\n\nSome notations in the table:\n\n- **Human Performance\\*:** Average human performance from AMT annotators who have high school diplomas or above.\n\n- **Gemini**: the fine-grained scores are from **the Gemini Team, Google**.\n\n- **GPT-4V (Playground)**: the launched playground at https://chat.openai.com/?model=gpt-4; experimental dates range from Oct 7, 2023, to Oct 15, 2023\n\n- **GPT-4**: the `gpt-4-0613` engine\n\n- **Method types**\n  -  **MoE 🤖:** Mixture of Experts\n  -  **LMM 🖼️:** Large Multimodal Model\n  -  **Tool 🛠️:** Tool-augmented Large Language Model\n  \n- **Task types:** \n  - **FQA:** figure question answering\n  - **GPS:** geometry problem solving\n  - **MWP:** math word problem solving\n  -  **TQA:** textbook question answering\n  - **VQA:** visual question answering\n- **Mathematical reasoning types:** \n  - **ALG:** algebraic reasoning\n  - **ARI:** arithmetic reasoning\n  -  **GEO:** geometry reasoning\n  - **LOG:** logical reasoning\n  - **NUM:** numeric commonsense reasoning\n  - **SCI:** scientific reasoning \n  - **STA:** statistical reasoning\n\n🔔 The automatic evaluation on [CodaLab](https://codalab.org/) are under construction. \n\n\n## 📊 Dataset Examples\n\nExamples of our newly annotated datasets: **IQTest**, **FunctionQA**, and **PaperQA**:\n\n\u003cp align=\"center\"\u003e\n    \u003cimg src=\"assets/our_new_3_datasets.png\" width=\"60%\"\u003e \u003cbr\u003e\n\u003c/p\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003e🔍 Click to expand/collapse more examples\u003c/summary\u003e\n\nExamples of seven mathematical reasoning skills:\n\n1. Arithmetic Reasoning\n\n\u003cimg src=\"https://raw.githubusercontent.com/lupantech/MathVista/main/assets/skills/ari.png\" style=\"zoom:40%;\" /\u003e\n\n2. Statistical Reasoning\n\n\u003cimg src=\"https://raw.githubusercontent.com/lupantech/MathVista/main/assets/skills/sta.png\" style=\"zoom:40%;\" /\u003e\n\n3. Algebraic Reasoning\n\n\u003cimg src=\"https://raw.githubusercontent.com/lupantech/MathVista/main/assets/skills/alg.png\" style=\"zoom:40%;\" /\u003e\n\n4. Geometry Reasoning\n\n\u003cimg src=\"https://raw.githubusercontent.com/lupantech/MathVista/main/assets/skills/geo.png\" style=\"zoom:40%;\" /\u003e\n\n5. Numeric Commonsense Reasoning\n\n\u003cimg src=\"https://raw.githubusercontent.com/lupantech/MathVista/main/assets/skills/num.png\" style=\"zoom:40%;\" /\u003e\n\n6. Scientific Reasoning\n\n\u003cimg src=\"https://raw.githubusercontent.com/lupantech/MathVista/main/assets/skills/sci.png\" style=\"zoom:40%;\" /\u003e\n\n7. Logical Reasoning\n\n\u003cimg src=\"https://raw.githubusercontent.com/lupantech/MathVista/main/assets/skills/log.png\" style=\"zoom:40%;\" /\u003e\n\n\u003c/details\u003e\n\n## 📖 Dataset Usage\n\n### Data Source\n\nThe **MathVista** dataset is derived from three newly collected datasets: IQTest, FunctionQA, and Paper, as well as 28 other source datasets. Details can be found in the [source.json](https://huggingface.co/datasets/AI4Math/MathVista/blob/main/source.json) file. All these source datasets have been preprocessed and labeled for evaluation purposes.\n\n### Data Downloading\n\nAll the data examples were divided into two subsets: *testmini* and *test*.\n\n- **testmini**: 1,000 examples used for model development, validation, or for those with limited computing resources.\n- **test**: 5,141 examples for standard evaluation. Notably, the answer labels for test will NOT be publicly released.\n\nYou can download this dataset by the following command (make sure that you have installed [Huggingface Datasets](https://huggingface.co/docs/datasets/quickstart)):\n\n```python\nfrom datasets import load_dataset\n\ndataset = load_dataset(\"AI4Math/MathVista\")\n```\n\nHere are some examples of how to access the downloaded dataset:\n\n```python\n# print the first example on the testmini set\nprint(dataset[\"testmini\"][0])\nprint(dataset[\"testmini\"][0]['pid']) # print the problem id \nprint(dataset[\"testmini\"][0]['question']) # print the question text \nprint(dataset[\"testmini\"][0]['query']) # print the query text\nprint(dataset[\"testmini\"][0]['image']) # print the image path\nprint(dataset[\"testmini\"][0]['answer']) # print the answer\ndataset[\"testmini\"][0]['decoded_image'] # display the image\n\n# print the first example on the test set\nprint(dataset[\"test\"][0])\n```\n\nWe have uploaded a demo to illustrate how to access the MathVista dataset on Hugging Face, available at [hugging_face_dataset_demo.ipynb](https://github.com/lupantech/MathVista/blob/main/jupyter_notebook_demos/hugging_face_dataset_demo.ipynb).\n\n### Data Format\n\nThe dataset is provided in json format and contains the following attributes:\n\n```\n{\n    \"question\": [string] The question text,\n    \"image\": [string] A file path pointing to the associated image,\n    \"choices\": [list] Choice options for multiple-choice problems. For free-form problems, this could be a 'none' value,\n    \"unit\": [string] The unit associated with the answer, e.g., \"m^2\", \"years\". If no unit is relevant, it can be a 'none' value,\n    \"precision\": [integer] The number of decimal places the answer should be rounded to,\n    \"answer\": [string] The correct answer for the problem,\n    \"question_type\": [string] The type of question: \"multi_choice\" or \"free_form\",\n    \"answer_type\": [string] The format of the answer: \"text\", \"integer\", \"float\", or \"list\",\n    \"pid\": [string] Problem ID, e.g., \"1\",\n    \"metadata\": {\n        \"split\": [string] Data split: \"testmini\" or \"test\",\n        \"language\": [string] Question language: \"English\", \"Chinese\", or \"Persian\",\n        \"img_width\": [integer] The width of the associated image in pixels,\n        \"img_height\": [integer] The height of the associated image in pixels,\n        \"source\": [string] The source dataset from which the problem was taken,\n        \"category\": [string] The category of the problem: \"math-targeted-vqa\" or \"general-vqa\",\n        \"task\": [string] The task of the problem, e.g., \"geometry problem solving\",\n        \"context\": [string] The visual context type of the associated image,\n        \"grade\": [string] The grade level of the problem, e.g., \"high school\",\n        \"skills\": [list] A list of mathematical reasoning skills that the problem tests\n    },\n    \"query\": [string] the query text used as input (prompt) for the evaluation model\n}\n```\n\n### Data Visualization\n\n🎰 You can explore the dataset in an interactive way [here](https://mathvista.github.io/#visualization).\n\n\u003cdetails\u003e\n\u003csummary\u003eClick to expand/collapse the visualization page screenshot.\u003c/summary\u003e\n\u003cimg src=\"https://raw.githubusercontent.com/lupantech/MathVista/main/assets/data_visualizer.png\" style=\"zoom:40%;\" /\u003e\n\u003c/details\u003e\n\n### Usage Demos\n\nWe offer a few demo examples for using the dataset, as follows:\n\n- Use the Bard API for inference: [bard_local_demo.ipynb](https://github.com/lupantech/MathVista/blob/main/jupyter_notebook_demos/bard_local_demo.ipynb)\n\nStay tuned for more demos coming soon!\n\n## 🔮 Evaluations on MathVista\n\n### Requirements (Optional)\n\nInstall the Python dependencies if you would like to reproduce our results for ChatGPT, GPT-4, Claude-2, and Bard:\n\n```sh\npip install openai # for ChatGPT and GPT-4\npip install anthropic # for Claude-2\npip install bardapi # for Bard\n```\n\nFor more details, please refer to:\n\n- [OpenAI API key](https://platform.openai.com/account/api-keys)\n- [Claude API Key](https://docs.anthropic.com/claude/reference/getting-started-with-the-api)\n- [Bard API Key](https://bard.google.com/)\n\nIf you are considering evaluating your own model, these dependencies might be optional.\n\n### Downloading Images (Optional)\n\nWe provide images in the JPG format. You can download and unzip them using the following commands:\n\n```sh\ncd data\nwget https://huggingface.co/datasets/AI4Math/MathVista/resolve/main/images.zip\nunzip images.zip \u0026\u0026 rm images.zip\n```\n\nThis step might be optional if you prefer to use the Hugging Face format of the data.\n\n### Evaluation Pipelines\n\nRecent foundation models have been trained to generate longer responses instead of brief text. As such, we propose a new strategy for benchmarking MathVista. This evaluation process comprises three stages:\n\n**(Step 1) Response Generation** ([generate_response.py](https://github.com/lupantech/MathVista/blob/main/evaluation/generate_response.py)): The models generate responses based on the given input query (prompt). This input query integrates the task description, the question, choices, and metadata. Such a design encourage the models yield responses in the desired format, subsequently enhancing the overall evaluation scores. An example of such an input query is:\n\n```\nHint: Please answer the question and provide the correct option letter, e.g., A, B, C, D, at the end.\nQuestion: Find $m\\\\angle H$\nChoices:\n(A) 97\n(B) 102\n(C) 107\n(D) 122\n```\n\nThe task description is defined as follows:\n\n| Question type   | Answer type | Task instruction                                             |\n| --------------- | ----------- | ------------------------------------------------------------ |\n| Multiple-choice | Text        | Please answer the question and provide the correct option letter, e.g., A, B, C, D, at the end. |\n| Free-form       | Integer     | Please answer the question requiring an integer answer and provide the final value, e.g., 1, 2, 3, at the end. |\n| Free-form       | Float (1)   | Please answer the question requiring a floating-point number with one decimal place and provide the final value, e.g., 1.2, 1.3, 1.4, at the end. |\n| Free-form       | Float (2)   | Please answer the question requiring a floating-point number with two decimal places and provide the final value, e.g., 1.23, 1.34, 1.45, at the end. |\n| Free-form       | List        | Please answer the question requiring a Python list as an answer and provide the final list, e.g., [1, 2, 3], [1.2, 1.3, 1.4], at the end. |\n\n**(Step 2) Answer Extraction** ([extract_answer.py](https://github.com/lupantech/MathVista/blob/main/evaluation/extract_answer.py)): Next, the short answer text is extracted from the detailed response. We propose an answer extractor based on LLMs such as GPT-4. A preliminary study of 200 examples shows that GPT-4 can extract the answer text with more than 99.5% accuracy. Below are examples of extracting short answers from long responses:\n\n```\n# Example 1\nHint: Please answer the question requiring an integer answer and provide the final value,\ne.g., 1, 2, 3, at the end.\nQuestion: Which number is missing?\n\nModel response: The number missing in the sequence is 14.\n\nExtracted answer: 14\n\n# Example 2\nHint: Please answer the question and provide the correct option letter, e.g., A, B, C,\nD, at the end.\nQuestion: What fraction of the shape is blue?\nChoices: \n(A) 3/11 \n(B) 8/11 \n(C) 6/11 \n(D) 3/5\n\nModel response: The correct answer is (B) 8/11.\n\nExtracted answer: B\n```\n\n**(Step 3) Score Calculation** ([calculate_score.py](https://github.com/lupantech/MathVista/blob/main/evaluation/extract_answer.py)): Finally, the extracted answer is normalized to a required answer format (e.g., an option letter or an integer), and the target metric scores are computed.\n\n## 📝 Evaluation Scripts of Our Models\n\nTo execute the evaluation scripts in our paper, ensure your `data` folder has the following structure:\n\n```\n├── query.json\n├── test.json\n├── testmini.json\n├── images\n    ├── 1.jpg\n    ├── 2.jpg\n    └── ...\n└── texts\n    ├── captions_bard.json\n    └── ocrs_easyocr.json\n```\n\nAdditionally, ensure that the API keys for ChatGPT, GPT-4, Claude-2, and Bard are properly set up.\n\n### Evaluating Multimodal Bard\n\nIf you have setted Multimodal Bard, you can run the following commands:\n\nGenerate the response on the **testmini** subset:\n\n```sh\ncd evaluation\n\npython generate_response.py \\\n--model bard \\\n--output_dir ../results/bard \\\n--output_file output_bard.json\n```\n\nExtract the short answer text for score calculation on the **testmini** subset:\n\n```sh\npython extract_answer.py \\\n--output_dir ../results/bard \\\n--output_file output_bard.json \n```\n\nCalculate the final score on the **testmini** subset:\n\n```sh\npython calculate_score.py \\\n--output_dir ../results/bard \\\n--output_file output_bard.json \\\n--score_file scores_bard.json\n```\n\nGenerate the response of the **test** subset:\n\n```sh\npython generate_response.py \\\n--model bard \\\n--input_file test.json \\\n--output_dir ../results/bard \\\n--output_file output_bard_test.json\n```\n\nExtract the short answer text for score calculation on the **test** subset:\n\n```sh\npython extract_answer.py \\\n--output_dir ../results/bard \\\n--output_file output_bard_test.json \n```\n\n### Evaluating Chain-of-Thought GPT-4\n\nGenerate the response on the **testmini** subset:\n\n```sh\ncd evaluation\n\npython generate_response.py \\\n--model gpt-4-0613 \\\n--output_dir ../results/gpt4 \\\n--output_file output_gpt4_2shot_solution_use_caption_ocr.json \\\n--shot_num 2 \\\n--shot_type solution \\\n--use_caption \\\n--use_ocr \\\n--caption_file ../data/texts/captions_bard.json \\\n--ocr_file ../data/texts/ocrs_easyocr.json \n```\n\nExtract the short answer text for score calculation on the **testmini** subset:\n\n```sh\npython extract_answer.py \\\n--output_dir ../results/gpt4 \\\n--output_file output_gpt4_2shot_solution_use_caption_ocr.json\n```\n\nCalculate the final score on the **testmini** subset:\n\n```sh\npython calculate_score.py \\\n--output_dir ../results/gpt4 \\\n--output_file output_gpt4_2shot_solution_use_caption_ocr.json \\\n--score_file scores_gpt4_2shot_solution_use_caption_ocr.json\n```\n\nGenerate the response of the **test** subset:\n\n```sh\npython generate_response.py \\\n--model gpt-4-0613 \\\n-input_file test.json \\\n--output_dir ../results/gpt4 \\\n--output_file output_test_gpt4_2shot_code_use_caption_ocr.json \\\n--shot_num 2 \\\n--shot_type solution \\\n--use_caption \\\n--use_ocr \\\n--caption_file ../data/texts/captions_bard.json \\\n--ocr_file ../data/texts/ocrs_easyocr.json \n```\n\nExtract the short answer text for score calculation on the **test** subset:\n\n```sh\npython extract_answer.py \\\n--output_dir ../results/bard \\\n--output_file output_test_gpt4_2shot_code_use_caption_ocr.json \n```\n\n### Evaluating Program-of-Thought GPT-4\n\nGenerate the response on the **testmini** subset:\n\n```sh\ncd evaluation\n\npython generate_response.py \\\n--model gpt-4-0613 \\\n--output_dir ../results/gpt4 \\\n--output_file output_gpt4_2shot_code_use_caption_ocr.json \\\n--shot_num 2 \\\n--shot_type code \\\n--use_caption \\\n--use_ocr \\\n--caption_file ../data/texts/captions_bard.json \\\n--ocr_file ../data/texts/ocrs_easyocr.json \n```\n\nExtract the short answer text for score calculation on the **testmini** subset:\n\n```sh\npython extract_answer.py \\\n--output_dir ../results/gpt4 \\\n--output_file output_gpt4_2shot_code_use_caption_ocr.json \\\n--response_label execution\n```\n\nCalculate the final score on the **testmini** subset:\n\n```sh\npython calculate_score.py \\\n--output_dir ../results/gpt4 \\\n--output_file output_gpt4_2shot_code_use_caption_ocr.json \\\n--score_file scores_gpt4_2shot_code_use_caption_ocr.json\n```\n\nGenerate the response of the **test** subset:\n\n```sh\npython generate_response.py \\\n--model gpt-4-0613 \\\n--input_file test.json \\\n--output_dir ../results/gpt4 \\\n--output_file output_test_gpt4_2shot_code_use_caption_ocr.json \\\n--shot_num 2 \\\n--shot_type code \\\n--use_caption \\\n--use_ocr \\\n--caption_file ../data/texts/captions_bard.json \\\n--ocr_file ../data/texts/ocrs_easyocr.json \n```\n\nExtract the short answer text for score calculation on the **test** subset:\n\n```sh\npython extract_answer.py \\\n--output_dir ../results/gpt4 \\\n--output_file output_test_gpt4_2shot_code_use_caption_ocr.json \\\n--response_label execution\n```\n\n### Evaluating More Settings\n\nFor additional settings for large language models and other baselines, please refer to the running scripts available in the [`scripts`](https://github.com/lupantech/MathVista/tree/main/scripts) directory.\n\n### Evaluating Large Multimodal Models\n\nWe thank [Hritik Bansal](https://sites.google.com/view/hbansal) and the [VisIT-Bench](https://github.com/mlfoundations/VisIT-Bench/tree/main) project for providing easy-to-use [codes](https://github.com/mlfoundations/VisIT-Bench/tree/main/baselines) for evaluating most of the large multimodal models included in our paper.\n\n## 📈 Evaluation Results\n\n\u003cdetails\u003e\n\u003csummary\u003eClick to expand/collapse the examples.\u003c/summary\u003e\n\u003cimg src=\"https://raw.githubusercontent.com/lupantech/MathVista/main/assets/results_examples/5.png\" style=\"zoom:40%;\" /\u003e\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003eClick to expand/collapse the examples.\u003c/summary\u003e\n\u003cimg src=\"https://raw.githubusercontent.com/lupantech/MathVista/main/assets/results_examples/6.png\" style=\"zoom:40%;\" /\u003e\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003eClick to expand/collapse the example.\u003c/summary\u003e\n\u003cimg src=\"https://raw.githubusercontent.com/lupantech/MathVista/main/assets/results_examples/48.png\" style=\"zoom:40%;\" /\u003e\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003eClick to expand/collapse the example.\u003c/summary\u003e\n\u003cimg src=\"https://raw.githubusercontent.com/lupantech/MathVista/main/assets/results_examples/50.png\" style=\"zoom:40%;\" /\u003e\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003eClick to expand/collapse the example.\u003c/summary\u003e\n\u003cimg src=\"https://raw.githubusercontent.com/lupantech/MathVista/main/assets/results_examples/52.png\" style=\"zoom:40%;\" /\u003e\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003eClick to expand/collapse the example.\u003c/summary\u003e\n\u003cimg src=\"https://raw.githubusercontent.com/lupantech/MathVista/main/assets/results_examples/53.png\" style=\"zoom:40%;\" /\u003e\n\u003c/details\u003e\nWe stored the result files from different models in the [results](https://github.com/lupantech/MathVista/tree/main/results/) directory.\n\n🐙 For visualization of these results, visit our [exploration](https://mathvista.github.io/#explorer) page.\n\n## 📜 License\n\nThe new contributions to our dataset are distributed under the [CC BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/) license, including\n\n- The creation of three dataset: IQTest, FunctionQA, and Paper;\n- The filtering and cleaning of source datasets;\n- The standard formalization of instances for evaluation purposes;\n- The annotations of metadata.\n\nThe copyright of the images and the questions belongs to the original authors, and the source of every image and original question can be found in the `metadata` field and in the [source.json](https://huggingface.co/datasets/AI4Math/MathVista/blob/main/source.json) file. Alongside this license, the following conditions apply:\n\n- **Purpose:** The dataset was primarily designed for use as a test set.\n- **Commercial Use:** The dataset can be used commercially as a test set, but using it as a training set is prohibited. By accessing or using this dataset, you acknowledge and agree to abide by these terms in conjunction with the [CC BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/) license.\n\n## :coffee: Stay Connected!\n\nFantastic! I'm always open to engaging discussions, collaborations, or even just sharing a virtual coffee. To get in touch, visit [Pan Lu](https://lupantech.github.io/)'s homepage for contact information.\n\n\n## :white_check_mark: Cite\n\nIf you find **MathVista** useful for your your research and applications, please kindly cite using this BibTeX:\n\n```latex\n@inproceedings{lu2024mathvista,\n  title={MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts},\n  author={Lu, Pan and Bansal, Hritik and Xia, Tony and Liu, Jiacheng and Li, Chunyuan and Hajishirzi, Hannaneh and Cheng, Hao and Chang, Kai-Wei and Galley, Michel and Gao, Jianfeng},\n  booktitle={International Conference on Learning Representations (ICLR)},\n  year={2024}\n}\n```\n\n## 🧠 Related Work\n\nExplore our additional research on **large language models** and **large multimodal models** , focusing on mathematical reasoning, scientific reasoning, and multimodal reasoning:\n\n- **[Chameleon]** [Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models](https://chameleon-llm.github.io/)\n- **[ScienceQA]** [Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering](https://scienceqa.github.io/)\n- **[LLaMA-Adapter]** [LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention](https://github.com/OpenGVLab/LLaMA-Adapter)\n- **[LLaMA-Adapter V2]** [LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model](https://github.com/OpenGVLab/LLaMA-Adapter)\n- **[DL4MATH]** [A Survey of Deep Learning for Mathematical Reasoning](https://arxiv.org/abs/2212.10535)\n- **[PromptPG]** [Dynamic Prompt Learning via Policy Gradient for Semi-structured Mathematical Reasoning](https://promptpg.github.io/)\n- **[SciBench]** [SciBench: Evaluating College-Level Scientific Problem-Solving Abilities of Large Language Models](https://arxiv.org/abs/2307.10635)\n- **[TheoremQA]** [TheoremQA: A Theorem-driven Question Answering dataset](https://arxiv.org/abs/2305.12524)\n- **[Līla]** [A Unified Benchmark for Mathematical Reasoning](https://lila.apps.allenai.org/)\n- **[IconQA]** [IconQA: A New Benchmark for Abstract Diagram Understanding and Visual Language Reasoning](https://iconqa.github.io/)\n- **[Inter-GPS]** [Inter-GPS: Interpretable Geometry Problem Solving with Formal Language and Symbolic Reasoning](https://lupantech.github.io/inter-gps/)\n\n## 🤝 Contributors\n\nHere are the key contributors to this project:\n\n[Pan Lu](https://lupantech.github.io/)\u003csup\u003e1\u003c/sup\u003e, [Hritik Bansal](https://sites.google.com/view/hbansal)\u003csup\u003e1\u003c/sup\u003e, [Tony Xia](https://tonyxia2001.github.io/)\u003csup\u003e1\u003c/sup\u003e, [Jiacheng Liu](https://liujch1998.github.io/)\u003csup\u003e2\u003c/sup\u003e, [Chunyuan Li](https://chunyuan.li/)\u003csup\u003e3\u003c/sup\u003e, [Hannaneh Hajishirzi](https://homes.cs.washington.edu/~hannaneh/)\u003csup\u003e2\u003c/sup\u003e, [Hao Cheng](https://sites.google.com/site/hcheng2site/Home)\u003csup\u003e3\u003c/sup\u003e, [Kai-Wei Chang](http://web.cs.ucla.edu/~kwchang/)\u003csup\u003e1\u003c/sup\u003e, [Michel Galley](https://www.microsoft.com/en-us/research/people/mgalley/?from=https://research.microsoft.com/~mgalley\u0026type=exact)\u003csup\u003e3\u003c/sup\u003e, [Jianfeng Gao](https://www.microsoft.com/en-us/research/people/jfgao/)\u003csup\u003e3\u003c/sup\u003e\n\n\u003csup\u003e1\u003c/sup\u003eUniversity of California, Los Angeles, \u003csup\u003e2\u003c/sup\u003eUniversity of Washington, \u003csup\u003e3\u003c/sup\u003eMicrosoft Research\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flupantech%2Fmathvista","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Flupantech%2Fmathvista","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flupantech%2Fmathvista/lists"}