{"id":13605123,"url":"https://github.com/hendrycks/test","last_synced_at":"2025-05-16T04:06:05.162Z","repository":{"id":49338798,"uuid":"293649366","full_name":"hendrycks/test","owner":"hendrycks","description":"Measuring Massive Multitask Language Understanding | ICLR 2021","archived":false,"fork":false,"pushed_at":"2023-05-28T18:28:58.000Z","size":2345,"stargazers_count":1374,"open_issues_count":14,"forks_count":100,"subscribers_count":18,"default_branch":"master","last_synced_at":"2025-04-08T15:04:57.080Z","etag":null,"topics":["few-shot-learning","gpt-3","muti-task","transfer-learning"],"latest_commit_sha":null,"homepage":"https://arxiv.org/abs/2009.03300","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/hendrycks.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null}},"created_at":"2020-09-07T23:02:57.000Z","updated_at":"2025-04-08T13:51:28.000Z","dependencies_parsed_at":"2024-03-24T08:33:34.086Z","dependency_job_id":"efe03de8-13a4-4c3e-8c49-31eaa66e86c7","html_url":"https://github.com/hendrycks/test","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hendrycks%2Ftest","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hendrycks%2Ftest/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hendrycks%2Ftest/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hendrycks%2Ftest/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/hendrycks","download_url":"https://codeload.github.com/hendrycks/test/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254464895,"owners_count":22075570,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["few-shot-learning","gpt-3","muti-task","transfer-learning"],"created_at":"2024-08-01T19:00:54.900Z","updated_at":"2025-05-16T04:06:01.684Z","avatar_url":"https://github.com/hendrycks.png","language":"Python","readme":"# Measuring Massive Multitask Language Understanding\nThis is the repository for [Measuring Massive Multitask Language Understanding](https://arxiv.org/pdf/2009.03300) by\n[Dan Hendrycks](https://people.eecs.berkeley.edu/~hendrycks/), [Collin Burns](http://collinpburns.com), [Steven Basart](https://stevenbas.art), [Andy Zou](https://andyzoujm.github.io/), Mantas Mazeika, [Dawn Song](https://people.eecs.berkeley.edu/~dawnsong/), and [Jacob Steinhardt](https://www.stat.berkeley.edu/~jsteinhardt/) (ICLR 2021).\n\nThis repository contains OpenAI API evaluation code, and the test is available for download [**here**](https://people.eecs.berkeley.edu/~hendrycks/data.tar).\n\n## Test Leaderboard\n\nIf you want to have your model added to the leaderboard, please reach out to us or submit a pull request.\n\n\nResults of the test:\n|                Model               | Authors |  Humanities |  Social Sciences  | STEM | Other | Average |\n|------------------------------------|----------|:-------:|:-------:|:-------:|:-------:|:-------:|\n| [Chinchilla](https://arxiv.org/abs/2203.15556) (70B, few-shot) | Hoffmann et al., 2022 | 63.6 | 79.3 | 54.9 | 73.9 | 67.5\n| [Gopher](https://storage.googleapis.com/deepmind-media/research/language-research/Training%20Gopher.pdf) (280B, few-shot) | Rae et al., 2021 | 56.2 | 71.9 | 47.4 | 66.1 | 60.0\n| [GPT-3](https://arxiv.org/abs/2005.14165) (175B, fine-tuned) | Brown et al., 2020 | 52.5 | 63.9 | 41.4 | 57.9 | 53.9\n| [flan-T5-xl](https://arxiv.org/abs/2210.11416) | Chung et al., 2022 | 46.3 | 57.7 | 39.0 | 55.1 | 49.3\n| [UnifiedQA](https://arxiv.org/abs/2005.00700) | Khashabi et al., 2020 | 45.6 | 56.6 | 40.2 | 54.6 | 48.9\n| [GPT-3](https://arxiv.org/abs/2005.14165) (175B, few-shot) | Brown et al., 2020 | 40.8 | 50.4 | 36.7 | 48.8 | 43.9\n| [GPT-3](https://arxiv.org/abs/2005.14165) (6.7B, fine-tuned) | Brown et al., 2020 | 42.1 | 49.2 | 35.1 | 46.9 | 43.2\n| [flan-T5-large](https://arxiv.org/abs/2210.11416) | Chung et al., 2022 | 39.1 | 49.1 | 33.2 | 47.4 | 41.9\n| [flan-T5-base](https://arxiv.org/abs/2210.11416) | Chung et al., 2022 | 34.0 | 38.1 | 27.6 | 37.0 | 34.2\n| [GPT-2](https://arxiv.org/abs/2005.14165) | Radford et al., 2019 | 32.8 | 33.3 | 30.2 | 33.1 | 32.4\n| [flan-T5-small](https://arxiv.org/abs/2210.11416) | Chung et al., 2022 | 29.9 | 30.9 | 27.5 | 29.7 | 29.5\n| Random Baseline           | N/A | 25.0 | 25.0 | 25.0 | 25.0 | 25.0 | 25.0\n\n\n## Citation\n\nIf you find this useful in your research, please consider citing the test and also the [ETHICS](https://arxiv.org/abs/2008.02275) dataset it draws from:\n\n    @article{hendryckstest2021,\n      title={Measuring Massive Multitask Language Understanding},\n      author={Dan Hendrycks and Collin Burns and Steven Basart and Andy Zou and Mantas Mazeika and Dawn Song and Jacob Steinhardt},\n      journal={Proceedings of the International Conference on Learning Representations (ICLR)},\n      year={2021}\n    }\n\n    @article{hendrycks2021ethics,\n      title={Aligning AI With Shared Human Values},\n      author={Dan Hendrycks and Collin Burns and Steven Basart and Andrew Critch and Jerry Li and Dawn Song and Jacob Steinhardt},\n      journal={Proceedings of the International Conference on Learning Representations (ICLR)},\n      year={2021}\n    }\n","funding_links":[],"categories":["📂 Benchmarks \u0026 Datasets","Dataset","A01_文本生成_文本对话","Python","Benchmark","Model Evaluation \u0026 Benchmarking","Benchmarks \u0026 Datasets","🛠️ AI 工具与框架","Benchmarks","11. Benchmarks \u0026 Leaderboards","🔬 Research \u0026 Evaluation Tools","📊 AI Evaluation \u0026 Benchmarks"],"sub_categories":["Only Text","大语言对话模型及数据","English","LangManus","General Language Understanding","模型评估","General","Data \u0026 Alignment Tools","Academic \u0026 Research Platforms","Usage Tips"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhendrycks%2Ftest","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fhendrycks%2Ftest","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhendrycks%2Ftest/lists"}