{"id":23652796,"url":"https://github.com/allenai/olmes","last_synced_at":"2025-10-13T15:56:50.525Z","repository":{"id":263725632,"uuid":"882122902","full_name":"allenai/olmes","owner":"allenai","description":"Reproducible, flexible LLM evaluations","archived":false,"fork":false,"pushed_at":"2025-07-11T23:17:19.000Z","size":4014,"stargazers_count":239,"open_issues_count":8,"forks_count":45,"subscribers_count":5,"default_branch":"main","last_synced_at":"2025-09-01T03:47:07.913Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/allenai.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2024-11-02T00:05:38.000Z","updated_at":"2025-08-29T22:48:46.000Z","dependencies_parsed_at":null,"dependency_job_id":"e0daf977-2f24-4586-9a86-f43c316ce833","html_url":"https://github.com/allenai/olmes","commit_stats":null,"previous_names":["allenai/olmes"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/allenai/olmes","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/allenai%2Folmes","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/allenai%2Folmes/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/allenai%2Folmes/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/allenai%2Folmes/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/allenai","download_url":"https://codeload.github.com/allenai/olmes/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/allenai%2Folmes/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":279015940,"owners_count":26085777,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-13T02:00:06.723Z","response_time":61,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-12-28T17:01:32.397Z","updated_at":"2025-10-13T15:56:50.520Z","avatar_url":"https://github.com/allenai.png","language":"Python","funding_links":[],"categories":["OLMo 2 (Nov. 2024)","A01_文本生成_文本对话"],"sub_categories":["大语言对话模型及数据"],"readme":"# Open Language Model Evaluation System (OLMES)\n\n## Introduction\n\nThe OLMES (Open Language Model Evaluation System) repository is used within [Ai2](https://allenai.org)'s Open \nLanguage Model efforts to evaluate base and\ninstruction-tuned LLMs on a range of tasks. The repository includes code to faithfully reproduce the \nevaluation results in research papers such as\n   * **OLMo:** Accelerating the Science of Language Models ([Groeneveld et al, 2024](https://www.semanticscholar.org/paper/ac45bbf9940512d9d686cf8cd3a95969bc313570))\n   * **OLMES:** A Standard for Language Model Evaluations ([Gu et al, 2024](https://www.semanticscholar.org/paper/c689c37c5367abe4790bff402c1d54944ae73b2a))\n   * **TÜLU 3:** Pushing Frontiers in Open Language Model Post-Training ([Lambert et al, 2024](https://www.semanticscholar.org/paper/T/%22ULU-3%3A-Pushing-Frontiers-in-Open-Language-Model-Lambert-Morrison/5ca8f14a7e47e887a60e7473f9666e1f7fc52de7))\n   * **OLMo 2:** 2 OLMo 2 Furious ([Team OLMo et al, 2024](https://arxiv.org/abs/2501.00656))\n\nThe code base uses helpful features from the [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) \nby Eleuther AI, with a number of modifications and enhancements, including:\n\n  * Support deep configurations for variants of tasks\n  * Record more detailed data about instance-level predictions (logprobs, etc)\n  * Custom metrics and metric aggregations\n  * Integration with external storage options for results\n\n\n## Setup\n\nStart by cloning the repository and install dependencies (optionally creating a virtual environment,\nPython 3.10 or higher is recommended):\n```\ngit clone https://github.com/allenai/olmes.git\ncd olmes\n\nconda create -n olmes python=3.10\nconda activate olmes\npip install -e .\n```\n\nFor running on GPUs (with vLLM), use instead `pip install -e .[gpu]`. If you get complaints regarding the \n`torch` version, downgrade to `torch\u003e=2.2` in [pyproject.toml](pyproject.toml).\n\n## Running evaluations\n\nTo run an evaluation with a specific model and task (or task suite):\n\n```commandline\nolmes --model olmo-1b --task arc_challenge::olmes --output-dir my-eval-dir1\n```\n\nThis will launch the standard [OLMES](https://www.semanticscholar.org/paper/c689c37c5367abe4790bff402c1d54944ae73b2a) \nversion of [ARC Challenge](https://www.semanticscholar.org/paper/88bb0a28bb58d847183ec505dda89b63771bb495) \n(which uses a curated 5-shot example, trying both multiple-choice and cloze formulations, and reporting\nthe max) with the [pythia-1b model](https://huggingface.co/EleutherAI/pythia-1b), storing the output in `my-eval-dir1`\n\nMultiple tasks can be specified after the `--task` argument, e.g.,\n```commandline\nolmes --model olmo-1b --task arc_challenge::olmes hellaswag::olmes --output-dir my-eval-dir1\n```\n\nBefore starting an evaluation, you can sanity check using `--inspect`, which shows a sample prompt\n(and does a tiny 5-instance eval with a small model)\n```commandline\nolmes --task arc_challenge:mc::olmes --inspect\n```\nYou can also look at the fully expanded command of the job you are about launch using the `--dry-run` flag:\n```commandline\nolmes --model pythia-1b --task mmlu::olmes --output-dir my-eval-dir1 --dry-run\n```\n\nFor a full list of arguments run `olmes --help`.\n\n\n## Running specific evaluation suites\n\n### OLMES standard - Original 10 multiple-choice tasks\n\nTo run all 10 multiple-choice tasks from the [OLMES paper](https://www.semanticscholar.org/paper/c689c37c5367abe4790bff402c1d54944ae73b2a):\n```\nolmes --model olmo-7b --task core_9mcqa::olmes --output-dir \u003cdir\u003e\nolmes --model olmo-7b --task mmlu::olmes --output-dir \u003cdir\u003e\n```\n\n### OLMo evaluations\n\nTo reproduce numbers in the [OLMo paper](https://www.semanticscholar.org/paper/ac45bbf9940512d9d686cf8cd3a95969bc313570):\n```\nolmes --model olmo-7b --task main_suite::olmo1 --output-dir \u003cdir\u003e\nolmes --model olmo-7b --task mmlu::olmo1 --output-dir \u003cdir\u003e\n```\n\n### TÜLU 3 evaluations\n\nThe list of exact tasks and associated formulations used in the [TÜLU 3 \nwork](https://www.semanticscholar.org/paper/T/%22ULU-3%3A-Pushing-Frontiers-in-Open-Language-Model-Lambert-Morrison/5ca8f14a7e47e887a60e7473f9666e1f7fc52de7) \ncan be found in these  suites in the [task suite library](oe_eval/configs/task_suites.py):\n   *  `\"tulu_3_dev\"`: Tasks evaluated during development\n   * `\"tulu_3_unseen\"`: Held-out task used during final evaluation\n\n### OLMo 2 evaluations\n\nThe list of exact tasks and associated formulations used in the base model evaluations of the \n[OLMo 2 technical report]()\ncan be found in these suites in the [task suite library](oe_eval/configs/task_suites.py):\n   * `\"core_9mcqa::olmes\"`: The core 9 multiple-choice tasks from original OLMES standard\n   * `\"mmlu:mc::olmes\"`: The MMLU tasks in multiple-choice format\n   * `\"olmo_2_generative::olmes\"`: The 5 generative tasks used in OLMo 2 development\n   * `\"olmo_2_heldout::olmes\"`: The 5 held-out tasks used in OLMo 2 final evaluation\n\n\n\n## Model configuration\n\nModels can be directly referenced by their Huggingface model path, e.g., `--model allenai/OLMoE-1B-7B-0924`,\nor by their key in the [model library](oe_eval/configs/models.py), e.g., `--model olmoe-1b-7b-0924` which\ncan include additional configuration options (such as `max_length` for max context size and `model_path` for\nlocal path to model).\n\nThe default model type uses the Huggingface model implementations, but you can also use the `--model-type vllm` flag to use\nthe vLLM implementations for models that support it, as well as `--model-type litellm` to run API-based models.\n\nYou can specify arbitrary JSON-parse-able model arguments directly in the command line as well, e.g.\n```commandline\nolmes --model google/gemma-2b --model-args '{\"trust_remote_code\": true, \"add_bos_token\": true}' ...\n```\nTo see a list of available models, run `oe-eval --list-models`, for a list of models containing a certain phrase,\nyou can follow this with a substring (any regular expression), e.g., `oe-eval --list-models llama`.\n\n\n## Task configuration\n\nTo specify a task, use the [task library](oe_eval/configs/tasks.py) which have\nentries like\n```\n\"arc_challenge:rc::olmes\": {\n    \"task_name\": \"arc_challenge\",\n    \"split\": \"test\",\n    \"primary_metric\": \"acc_uncond\",\n    \"num_shots\": 5,\n    \"fewshot_source\": \"OLMES:ARC-Challenge\",\n    \"metadata\": {\n        \"regimes\": [\"OLMES-v0.1\"],\n    },\n},\n```\nEach task can also have custom entries for `context_kwargs` (controlling details of the prompt),\n`generation_kwargs` (controlling details of the generation), and `metric_kwargs` (controlling details of the metrics). The `primary_metric` indicates which metric field will be reported as the \"primary score\" for the task.\n\nThe task configuration parameters can be overridden on the command line, these will generally apply to all tasks, e.g.,\n```commandline\nolmes --task arc_challenge:rc::olmes hellaswag::rc::olmes --split dev ...\n```\nbut using a json format for each task, can be on per-task (but it's generally better to use the task \nlibrary for this), e.g.,\n```\nolmes --task '{\"task_name\": \"arc_challenge:rc::olmes\", \"num_shots\": 2}' '{\"task_name\": \"hellasag:rc::olmes\", \"num_shots\": 4}' ...\n```\nFor complicated commands like this, using `--dry-run` can be helpful to see the full command before running it.\n\nTo see a list of available tasks, run `oe-eval --list-tasks`, for a list of tasks containing a certain phrase,\nyou can follow this with a substring (any regular expression), e.g., `oe-eval --list-tasks arc`.\n\n### Task suite configurations\n\nTo define a suite of tasks to run together, use the [task suite library](oe_eval/configs/task_suites.py),\nwith entries like:\n```python\nTASK_SUITE_CONFIGS[\"mmlu:mc::olmes\"] = {\n    \"tasks\": [f\"mmlu_{sub}:mc::olmes\" for sub in MMLU_SUBJECTS],\n    \"primary_metric\": \"macro\",\n}\n```\nspecifying the list of tasks as well as how the metrics should be aggregated across the tasks.\n\n\n## Evaluation output\n\nThe evaluation output is stored in the specified output directory, in a set of files\nfor each task. See [output formats](OUTPUT_FORMATS.md) for more details.\n\nThe output can optionally be stored in a Google Sheet by specifying the `--gsheet` argument (with authentication\nstored in environment variable `GDRIVE_SERVICE_ACCOUNT_JSON`).\n\nThe output can also be stored in a Huggingface dataset directory by specifying the `--hf-save-dir` argument, \na remote directory (like `s3://...`) by specifying the `--remote-output-dir` argument,\nor in a W\u0026B project by specifying the `--wandb-run-path` argument.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fallenai%2Folmes","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fallenai%2Folmes","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fallenai%2Folmes/lists"}