{"id":13652385,"url":"https://github.com/AI21Labs/lm-evaluation","last_synced_at":"2025-04-23T03:30:47.468Z","repository":{"id":47744118,"uuid":"393050670","full_name":"AI21Labs/lm-evaluation","owner":"AI21Labs","description":"Evaluation suite for large-scale language models.","archived":false,"fork":false,"pushed_at":"2021-08-15T13:49:52.000Z","size":20,"stargazers_count":123,"open_issues_count":2,"forks_count":14,"subscribers_count":5,"default_branch":"main","last_synced_at":"2024-11-10T03:35:26.737Z","etag":null,"topics":["evaluation-framework","language-model"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/AI21Labs.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2021-08-05T13:22:20.000Z","updated_at":"2024-07-18T13:10:21.000Z","dependencies_parsed_at":"2022-09-08T12:51:04.711Z","dependency_job_id":null,"html_url":"https://github.com/AI21Labs/lm-evaluation","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AI21Labs%2Flm-evaluation","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AI21Labs%2Flm-evaluation/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AI21Labs%2Flm-evaluation/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AI21Labs%2Flm-evaluation/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/AI21Labs","download_url":"https://codeload.github.com/AI21Labs/lm-evaluation/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":250365260,"owners_count":21418657,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["evaluation-framework","language-model"],"created_at":"2024-08-02T02:00:58.882Z","updated_at":"2025-04-23T03:30:43.225Z","avatar_url":"https://github.com/AI21Labs.png","language":"Python","funding_links":[],"categories":["Tools"],"sub_categories":[],"readme":"# LM Evaluation Test Suite\nThis repo contains code for running the evaluations and reproducing the results from the [Jurassic-1 Technical Paper](https://uploads-ssl.webflow.com/60fd4503684b466578c0d307/61138924626a6981ee09caf6_jurassic_tech_paper.pdf) (see [blog post](https://www.ai21.com/blog/announcing-ai21-studio-and-jurassic-1)), with current support for running the tasks through both the [AI21 Studio API](https://studio.ai21.com/) and [OpenAI's GPT3 API](https://beta.openai.com/).\n\n## Citation\nPlease use the following bibtex entry:\n```\n@techreport{J1WhitePaper,\n  author = {Lieber, Opher and Sharir, Or and Lenz, Barak and Shoham, Yoav},\n  title = {Jurassic-1: Technical Details And Evaluation},\n  institution = {AI21 Labs},\n  year = 2021,\n  month = aug,\n}\n```\n\n## Installation\n```\ngit clone https://github.com/AI21Labs/lm-evaluation.git\ncd lm-evaluation\npip install -e .\n```\n\n## Usage\nThe entry point for running the evaluations is lm_evaluation/run_eval.py, which receives a list of tasks and models to run. \n\nThe models argument should be in the form \"provider/model_name\" where provider can be \"ai21\" or \"openai\" and the model name is one of the providers supported models.\n\nWhen running through one of the API models, set the your API key(s) using the environment variables AI21_STUDIO_API_KEY and OPENAI_API_KEY. Make sure to consider the costs and quota limits of the models you are running beforehand.\n\nExamples:\n```console\n# Evaluate hellaswag and winogrande on j1-large\npython -m lm_evaluation.run_eval --tasks hellaswag winogrande --models ai21/j1-large\n\n# Evaluate all multiple-choice tasks on j1-jumbo\npython -m lm_evaluation.run_eval --tasks all_mc --models ai21/j1-jumbo\n\n# Evaluate all docprob tasks on curie and j1-large\npython -m lm_evaluation.run_eval --tasks all_docprobs --models ai21/j1-large openai/curie\n\n```\n\n## Datasets\nThe repo currently support the zero-shot multiple-choice and document probability datasets reported in the [Jurassic-1 Technical Paper](https://uploads-ssl.webflow.com/60fd4503684b466578c0d307/61138924626a6981ee09caf6_jurassic_tech_paper.pdf).\n\n### Multiple Choice\nMultiple choice datasets are formatted as described in the [GPT3 paper](https://arxiv.org/abs/2005.14165), and the default reported evaluation metrics are those described there.\n\nAll our formatted datasets except for storycloze are publically available and referenced in [lm_evaluation/tasks_config.py](lm_evaluation/tasks_config.py). Storycloze needs to be [manually downloaded](https://cs.rochester.edu/nlp/rocstories/) and formatted, and the location should be configured through the environment variable 'STORYCLOZE_TEST_PATH'.\n\n### Document Probabilities\nDocument probability tasks include documents from 19 data sources, including [C4](https://www.tensorflow.org/datasets/catalog/c4) and datasets from ['The Pile'](https://arxiv.org/abs/2101.00027).\n\nEach document is pre-split at sentence boundaries to sub-documents of up to 1024 GPT tokens each, to ensure all models see the same inputs/contexts regardless of tokenization, and to support evaluation of models which are limited to sequence lengths of 1024.\n\nEach of the 19 tasks have ~4MB of total text data.\n\n## Additional Configuration\n\n### Results Folder\nBy default all results will be saved to the folder 'results', and rerunning the same tasks will load the existing results. The results folder can be changed using the environment variable LM_EVALUATION_RESULTS_DIR.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FAI21Labs%2Flm-evaluation","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FAI21Labs%2Flm-evaluation","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FAI21Labs%2Flm-evaluation/lists"}