{"id":13652990,"url":"https://github.com/DAMO-NLP-SG/M3Exam","last_synced_at":"2025-04-23T06:30:59.882Z","repository":{"id":173929852,"uuid":"651441273","full_name":"DAMO-NLP-SG/M3Exam","owner":"DAMO-NLP-SG","description":"Data and code for paper \"M3Exam: A Multilingual, Multimodal, Multilevel Benchmark for Examining Large Language Models\"","archived":false,"fork":false,"pushed_at":"2023-06-15T03:07:28.000Z","size":702,"stargazers_count":99,"open_issues_count":6,"forks_count":12,"subscribers_count":9,"default_branch":"main","last_synced_at":"2025-04-07T22:05:31.083Z","etag":null,"topics":["ai-education","chatgpt","evaluation","gpt-4","large-language-models","llms","multilingual","multimodal"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/DAMO-NLP-SG.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2023-06-09T08:41:11.000Z","updated_at":"2025-03-27T09:52:42.000Z","dependencies_parsed_at":null,"dependency_job_id":"e03daa82-bb5e-4682-b741-239a914322d5","html_url":"https://github.com/DAMO-NLP-SG/M3Exam","commit_stats":{"total_commits":2,"total_committers":1,"mean_commits":2.0,"dds":0.0,"last_synced_commit":"832a49585d1e8049612b4fd4669f3f8fee9c6014"},"previous_names":["damo-nlp-sg/m3exam"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DAMO-NLP-SG%2FM3Exam","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DAMO-NLP-SG%2FM3Exam/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DAMO-NLP-SG%2FM3Exam/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DAMO-NLP-SG%2FM3Exam/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/DAMO-NLP-SG","download_url":"https://codeload.github.com/DAMO-NLP-SG/M3Exam/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":250384796,"owners_count":21421794,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai-education","chatgpt","evaluation","gpt-4","large-language-models","llms","multilingual","multimodal"],"created_at":"2024-08-02T02:01:04.675Z","updated_at":"2025-04-23T06:30:58.399Z","avatar_url":"https://github.com/DAMO-NLP-SG.png","language":"Python","funding_links":[],"categories":["Datasets-or-Benchmark","多模态大模型"],"sub_categories":["通用","网络服务_其他"],"readme":"# M3Exam: A Multilingual 🌏, Multimodal 🖼, Multilevel 📈 Benchmark for LLMs\n\nThis is the repository for [M3Exam: A Multilingual, Multimodal, Multilevel Benchmark for Examining Large Language Models](https://arxiv.org/pdf/2306.05179.pdf).\n\nTL;DR: We introduce M3Exam, a novel benchmark sourced from real and official human exam questions for evaluating LLMs in a multilingual, multimodal, and multilevel context.\n\n![image](./images/m3exam-examples.jpg)\n\n\n## Data\n### Access the data\n* You can download the data from [here](https://cutt.ly/m3exam-data).\n* The downloaded folder will be encrypted (to prevent some automatic crawling scripts). Please get the password from the bottom of this page.\n* After unzipping the file, you will see the following file structure:\n```\ndata/\n    multimodal-questions/         \u003c- questions requiring images\n        xx-questions-image.json   \u003c- file containing the questions, xx is a language\n        iamges-xx/                \u003c- folder containg all the images for xx\n    text-questions/               \u003c- questions with pure text\n        xx-questions-dev.json     \u003c- held-out data (e.g., can be used as in-context examples)\n        xx-questions-test.json    \u003c- main test data for evaluation\n```\n\n### Data format\n* Questions are stored in json format, you can read each json file to check the data. For example:\n\n```python\nwith open(f'./data/text-question/{lang}-questions-dev.json', 'w') as f:\n    data = json.load(f)  # data is a list of questions\n```\n\n* Each question is stored in json format:\n\n```\n{\n    'question_text': 'Which Civil War event occurred first?',\n    'background_description': [],\n    'answer_text': '2',\n    'options': ['(1) battle of Gettysburg',\n    '(2) firing on Fort Sumter',\n    '(3) assassination of President Lincoln',\n    '(4) Emancipation Proclamation'],\n    'need_image': 'no',\n    'language': 'english',\n    'level': 'mid',\n    'subject': 'social',\n    'subject_category': 'social-science',\n    'year': '2006'\n}\n```\n\n\n## Evaluation\n* first you need to fill in your OpenAI API key in the bash files:\n```\npython main.py \\\n--setting zero-shot \\\n--model chat \\\n--use_api \\\n--selected_langs \"['english']\" \\\n--api_key #put your key here\n```\n* then you can quickly check by running `quick_run.sh`, which will run on 10 English questions and produce `english-pred.json` in the corresponding output folder\n* to evaluate, you can also run `eval.sh` to check the performance on this 10 examples!\n* to run on more data, you can refer to `run.sh` for more detailed settings\n```\npython main.py \\\n--setting zero-shot \\\n--model chat \\\n--use_api \\\n--selected_langs \"['english']\" \\\n--selected_levels \"['low', 'mid', 'high']\" \\\n--num_samples all \\\n--api_key #put your key here\n```\n    * specify the languages you want to run through `--selected_langs`\n    * running on all questions, set `--num_samples all`\n\n\n## Citation\nIf you find this useful in your research, please consider citing it:\n```\n@article{zhang2023m3exam,\n      title={M3Exam: A Multilingual, Multimodal, Multilevel Benchmark for Examining Large Language Models},\n      author={Wenxuan Zhang and Sharifah Mahani Aljunied and Chang Gao and Yew Ken Chia and Lidong Bing},\n      year={2023},\n      eprint={2306.05179},\n      archivePrefix={arXiv},\n      primaryClass={cs.CL}\n}\n```\n\npassword: 12317","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FDAMO-NLP-SG%2FM3Exam","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FDAMO-NLP-SG%2FM3Exam","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FDAMO-NLP-SG%2FM3Exam/lists"}