{"id":19651403,"url":"https://github.com/nlp-uoregon/mlmm-evaluation","last_synced_at":"2025-08-02T17:11:38.066Z","repository":{"id":187070755,"uuid":"675511365","full_name":"nlp-uoregon/mlmm-evaluation","owner":"nlp-uoregon","description":"Multilingual Large Language Models Evaluation Benchmark","archived":false,"fork":false,"pushed_at":"2024-08-21T09:21:31.000Z","size":3287,"stargazers_count":124,"open_issues_count":14,"forks_count":18,"subscribers_count":4,"default_branch":"main","last_synced_at":"2025-06-24T14:52:05.860Z","etag":null,"topics":["datasets","evaluation","evaluation-datasets","evaluation-framework","language-model","large-language-models","multilingual","natural-language-processing","nlp"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/nlp-uoregon.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2023-08-07T05:08:46.000Z","updated_at":"2025-06-16T06:28:20.000Z","dependencies_parsed_at":"2023-08-08T22:00:00.905Z","dependency_job_id":"5671ee02-6ce3-45c9-b04b-63fd293ab7ea","html_url":"https://github.com/nlp-uoregon/mlmm-evaluation","commit_stats":{"total_commits":18,"total_committers":5,"mean_commits":3.6,"dds":0.5,"last_synced_commit":"0590a08356140243523b2befbb8817361aed2487"},"previous_names":["nlp-uoregon/mlmm-evaluation"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/nlp-uoregon/mlmm-evaluation","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nlp-uoregon%2Fmlmm-evaluation","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nlp-uoregon%2Fmlmm-evaluation/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nlp-uoregon%2Fmlmm-evaluation/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nlp-uoregon%2Fmlmm-evaluation/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/nlp-uoregon","download_url":"https://codeload.github.com/nlp-uoregon/mlmm-evaluation/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nlp-uoregon%2Fmlmm-evaluation/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":268424029,"owners_count":24248119,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-08-02T02:00:12.353Z","response_time":74,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["datasets","evaluation","evaluation-datasets","evaluation-framework","language-model","large-language-models","multilingual","natural-language-processing","nlp"],"created_at":"2024-11-11T15:06:27.902Z","updated_at":"2025-08-02T17:11:38.012Z","avatar_url":"https://github.com/nlp-uoregon.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003ch1 align=\"center\"\u003e \u003cp\u003e Evaluation Framework for Multilingual Large Language Models \u003c/p\u003e\u003c/h1\u003e\n\n\u003cdiv align=\"center\"\u003e\n    \u003ca href=\"https://github.com/nlp-uoregon/mlmm-evaluation/blob/main/LICENSE\"\u003e\n        \u003cimg alt=\"GitHub\" src=\"https://img.shields.io/badge/Code%20License-Apache_2.0-green.svg\"\u003e\n    \u003c/a\u003e\n    \u003ca href=\"https://github.com/nlp-uoregon/mlmm-evaluation/blob/main/DATA_LICENSE\"\u003e\n        \u003cimg alt=\"GitHub data\" src=\"https://img.shields.io/badge/Data%20License-CC%20By%20NC%204.0-red.svg\"\u003e\n    \u003c/a\u003e\n\u003c/div\u003e\n\n## Overview\n\nThis repo contains benchmark datasets and evaluation scripts for Multilingual Large Language Models (LLMs). These datasets can be used to evaluate the models across 26 different languages and encompass three distinct tasks: ARC, HellaSwag, and MMLU. This is released as a part of our [Okapi framework](https://github.com/nlp-uoregon/Okapi) for multilingual instruction-tuned LLMs with reinforcement learning from human feedback.\n\n\n- [**ARC**](https://allenai.org/data/arc): A dataset with 7,787 genuine grade-school level, multiple-choice science questions, assembled to encourage research in advanced question-answering.\n- [**HellaSwag**](https://allenai.org/data/hellaswag): HellaSWAG is a dataset for studying grounded commonsense inference. It consists of 70k multiple choice questions about grounded situations: each question comes from one of two domains *activitynet* or *wikihow* with four answer choices about what might happen next in the scene. The correct answer is the (real) sentence for the next event; the three incorrect answers are adversarially generated and human verified, so as to fool machines but not humans.\n- [**MMLU**](https://arxiv.org/pdf/2009.03300.pdf): This dataset contains multiple choice questions derived from diverse fields of knowledge. The test covers subjects in the humanities, social sciences, hard sciences, and other essential areas of learning for certain individuals.\n\nCurrently, our datasets support 26 languages: Russian, German, Chinese, French, Spanish, Italian, Dutch, Vietnamese, Indonesian, Arabic, Hungarian, Romanian, Danish, Slovak, Ukrainian, Catalan, Serbian, Croatian, Hindi, Bengali, Tamil, Nepali, Malayalam, Marathi, Telugu, and Kannada. \n\n\u003cp align=\"center\"\u003e\n\u003cimg src=\"assets/Okapi_Languages.png\" width=\"450\"/\u003e\n\u003c/p\u003e\n\nThese datasets are translated from the original ARC, HellaSwag, and MMLU datasets in English using ChatGPT. Our technical paper for Okapi to describe the datasets along with evaluation results for several multilingual LLMs (e.g., BLOOM, LLaMa, and our Okapi models) can be found [here](https://arxiv.org/pdf/2307.16039.pdf).\n\n**Usage and License Notices**: Our evaluation framework is intended and licensed for research use only. The datasets are CC BY NC 4.0 (allowing only non-commercial use) that should not be used outside of research purposes.\n\n## Install\n\nTo install `lm-eval` from our repository main branch, run:\n\n```bash\ngit clone https://github.com/nlp-uoregon/mlmm-evaluation.git\ncd mlmm-evaluation\npip install -e \".[multilingual]\"\n```\n\n## Basic Usage\nFirstly, you need to download the multilingual evaluation datasets by using the following script:\n```bash\nbash scripts/download.sh\n```\n\nTo evaluate your model on three tasks, you can use the following script:\n```bash\nbash scripts/run.sh [LANG] [YOUR-MODEL-PATH]\n```\n\nFor instance, if you want to evaluate our [Okapi Vietnamese model](https://huggingface.co/uonlp/okapi-vi-bloom), you could run:\n```bash\nbash scripts/run.sh vi uonlp/okapi-vi-bloom\n```\n\n## Leaderboard\n\nWe maintain a [leaderboard](https://huggingface.co/spaces/uonlp/open_multilingual_llm_leaderboard) for tracking the progress of multilingual LLM. \n\n## Acknowledgements\nOur framework inherited largely from the [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) repo from EleutherAI. Please also kindly cite their repo if you use the code.\n\n## Citation\nIf you use the data, model, or code in this repository, please cite:\n\n```bibtex\n@article{dac2023okapi,\n  title={Okapi: Instruction-tuned Large Language Models in Multiple Languages with Reinforcement Learning from Human Feedback},\n  author={Dac Lai, Viet and Van Nguyen, Chien and Ngo, Nghia Trung and Nguyen, Thuat and Dernoncourt, Franck and Rossi, Ryan A and Nguyen, Thien Huu},\n  journal={arXiv e-prints},\n  pages={arXiv--2307},\n  year={2023}\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnlp-uoregon%2Fmlmm-evaluation","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fnlp-uoregon%2Fmlmm-evaluation","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnlp-uoregon%2Fmlmm-evaluation/lists"}