{"id":37140424,"url":"https://github.com/bigscience-workshop/lm-evaluation-harness","last_synced_at":"2026-01-14T16:27:45.752Z","repository":{"id":37016985,"uuid":"430485172","full_name":"bigscience-workshop/lm-evaluation-harness","owner":"bigscience-workshop","description":"A framework for few-shot evaluation of autoregressive language models.","archived":false,"fork":true,"pushed_at":"2023-05-09T09:21:46.000Z","size":10119,"stargazers_count":105,"open_issues_count":15,"forks_count":29,"subscribers_count":4,"default_branch":"master","last_synced_at":"2025-09-09T16:09:34.476Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":"EleutherAI/lm-evaluation-harness","license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/bigscience-workshop.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.md","code_of_conduct":null,"threat_model":null,"audit":null,"citation":"CITATION.bib","codeowners":"CODEOWNERS","security":null,"support":null}},"created_at":"2021-11-21T21:31:37.000Z","updated_at":"2025-06-25T23:00:05.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/bigscience-workshop/lm-evaluation-harness","commit_stats":null,"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"purl":"pkg:github/bigscience-workshop/lm-evaluation-harness","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bigscience-workshop%2Flm-evaluation-harness","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bigscience-workshop%2Flm-evaluation-harness/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bigscience-workshop%2Flm-evaluation-harness/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bigscience-workshop%2Flm-evaluation-harness/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/bigscience-workshop","download_url":"https://codeload.github.com/bigscience-workshop/lm-evaluation-harness/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bigscience-workshop%2Flm-evaluation-harness/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28425923,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-14T15:24:48.085Z","status":"ssl_error","status_checked_at":"2026-01-14T15:23:41.940Z","response_time":107,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2026-01-14T16:27:45.166Z","updated_at":"2026-01-14T16:27:45.746Z","avatar_url":"https://github.com/bigscience-workshop.png","language":"Python","readme":"# `lm-evaluation-harness` + `promptsource`\n\n![](https://github.com/EleutherAI/lm-evaluation-harness/workflows/Build/badge.svg)\n[![codecov](https://codecov.io/gh/EleutherAI/lm-evaluation-harness/branch/master/graph/badge.svg?token=JSG3O2427J)](https://codecov.io/gh/EleutherAI/lm-evaluation-harness)\n\n## Overview\n\nThis project provides a unified framework to test causal (GPT-2, GPT-3, GPTNeo, etc) and seq2seq (T5, T0) language models via prompt evaluation.\n\nAs of now, all prompts are provided via the `promptsource` [eval-hackathon branch](https://github.com/bigscience-workshop/promptsource/tree/eval-hackathon); all datasets are from huggingface `datasets`.\n\nThis fork is __not__ backwards compatible with the original evaluation harness.\n\n## Installation\n\n```bash\ngit clone https://github.com/bigscience-workshop/lm-evaluation-harness\ncd lm-evaluation-harness\npip install -e \".[dev]\"\n```\n\n## CLI Usage 🖥️\n\nTo evaluate a model (e.g. GPT-2) on NLP tasks such as SuperGLUE WiC, you can run the following command:\n\n```bash\npython main.py \\\n    --model_api_name 'hf-causal' \\\n    --model_args pretrained='gpt2' \\\n    --task_name 'wic' \\\n    --template_names 'same_sense','polysemous' \\\n    --device cpu\n```\n\nAdditional arguments can be provided to the model constructor using the `--model_args` flag. For larger models supported by HuggingFace `transformers`, we provide parallelism and mixed-precision utilities through the [`accelerate`](https://github.com/huggingface/accelerate) package. It can be activated for `hf-causal`/`hf-seq2seq` by passing `use_accelerate=True` and `dtype=half` to the `--model_args` flag, respectively. For finer grained control over `accelerate` options, see the constructor docstrings for `HuggingFaceAutoLM` in `huggingface.py`.\n\n```bash\npython main.py \\\n    --model_api_name 'hf-causal' \\\n    --model_args use_accelerate=True,pretrained='facebook/opt-13b' \\\n    --task_name wnli\n```\n\nIf you have access to the OpenAI API, you can also evaluate GPT-3 engines:\n\n```bash\nexport OPENAI_API_SECRET_KEY={YOUR_KEY_HERE}\npython main.py \\\n    --model_api_name 'openai' \\\n    --model_args engine='curie' \\\n    --task_name hans\n```\n\n **When reporting results from eval harness, please include the task versions (shown in `results[\"versions\"]`) for reproducibility.** This allows bug fixes to tasks while also ensuring that previously reported scores are reproducible.\n\n### Detailed Usage\n\n```\nusage: main.py [-h] --model_api_name MODEL_API_NAME [--model_args MODEL_ARGS] --task_name TASK_NAME\n               [--template_names TEMPLATE_NAMES] [--num_fewshot NUM_FEWSHOT] [--batch_size BATCH_SIZE]\n               [--device DEVICE] [--limit LIMIT] [--output_path OUTPUT_PATH] [--template_idx TEMPLATE_IDX]\n               [--bootstrap_iters BOOTSTRAP_ITERS] [--no_tracking] [--use_cache]\n\noptional arguments:\n  -h, --help            show this help message and exit\n  --model_api_name MODEL_API_NAME\n                        Name of the model API to use. See `lm_eval.list_model_apis()` for available APIs\n  --model_args MODEL_ARGS\n                        Model constructor args that you'd pass into a model of type `--model_api_name`. These must\n                        be comma-separated keyword args, e.g. `key1=value1,key2=value2`, with no spaces\n  --task_name TASK_NAME\n                        Name of the task to use as found in the lm_eval registry. See: `lm_eval.list_tasks()`\n  --task_args TASK_ARGS\n                        Optional task constructor args that you'd pass into a task class of kind \" `--task_name`.\n                        These must be comma-separated keyword args, e.g. `key1=value1,key2=value2`, with no spaces.\n                        WARNING: To avoid parsing errors, ensure your strings are quoted. For example,\n                            `example_separator='\\n+++\\n'`\n                        WARNING: Values must NOT contain commas.\n  --template_names TEMPLATE_NAMES\n                        Comma-separated list of template names for the specified task. Example:\n                        `\u003e python main.py ... --task_name rte --template_names imply,mean`\n                        - Default: `all_templates`\n                        - General Selectors:\n                            - `\"all_templates\"`: Selects all templates for the task\n                            - `\"original_templates\"`: Selects only templates that are designed to match the original task\n  --num_fewshot NUM_FEWSHOT\n  --batch_size BATCH_SIZE\n  --seed SEED\n  --device DEVICE       The device to place your model onto, e.g. cuda:0. For large models available through the\n                        HuggingFace Hub you should use `accelerate` by passing `use_accelerate=True` to\n                        `--model_args`\n  --limit LIMIT         Limit the number of examples to evaluate on; ONLY USE THIS FOR DEBUGGING PURPOSES\n  --output_path OUTPUT_PATH\n                        Use output_path as `output_filename`. For example:\n                        `\u003e python main.py ... --output_path blop`\n                        # saves files into `outputs/blop.json` Warning: You currently cannot change/add folder\n                        structure.\n  --template_idx TEMPLATE_IDX\n                        Choose template by index from available templates\n  --bootstrap_iters BOOTSTRAP_ITERS\n                        Iters for stderr computation\n  --no_tracking         Skip carbon emission tracking\n  --use_cache           Whether to cache your model's predictions or not\n```\n\n## Library Usage 📖\n\nYou can also use `lm_eval` as a library:\n\n```python\nimport lm_eval\n\nmodel = lm_eval.get_model(\"hf-causal\", pretrained=\"gpt2\", device=\"cpu\")\ntasks = lm_eval.get_task_list(\n    \"superglue_rte\",\n    template_names=[\"does this imply\", \"must be true\"])\nresults = lm_eval.evaluate(model=model, tasks=tasks)\n```\n\nThe main user-facing functions are:\n\n- [`lm_eval.get_model(model_api_name, **kwargs)`](./lm_eval/models/__init__.py) creates a model from a model API\n- [`lm_eval.get_task(task_name, template_name, **kwargs)`](./lm_eval/tasks/__init__.py) creates a task with the prompt template\n- [`lm_eval.get_task_list(task_name, template_names, **kwargs)`](./lm_eval/tasks/__init__.py) creates multiple instances of a task with different prompt templates\n- [`lm_eval.evaluate(model, tasks, **kwargs)`](./lm_eval/evaluator.py) evaluates a model on a list of tasks\n\nSome high-level convenience functions are also made available:\n- [`lm_eval.list_model_apis()`](./lm_eval/models/__init__.py) lists all available model APIs you can evaluate from\n- [`lm_eval.list_tasks()`](./lm_eval/tasks/__init__.py) lists all available tasks\n- [`lm_eval.list_templates(task_name)`](./lm_eval/tasks/__init__.py) lists all available templates for a task\n- [`lm_eval.get_templates(task_name)`](./lm_eval/tasks/__init__.py) returns promptsource dataset templates for a task\n\n## Gotchas 🩹\n\n- __You must pass templates to `PerplexityTask`s__  even though they will be ignored, as models will be scored from the raw text found in the task's dataset.\n\n- __Multi-lingual ROUGE is unsupported__ as general token splitting is absent from [rouge-score](https://github.com/google-research/google-research/tree/master/rouge). For multi-lingual tasks, please ignore rouge metrics until this is resolved. _NOTE_: `English` works as intended.\n\n- __Task versioning is not fully integrated__! If you're reporting your model's results, please include the package versions or commit IDs for this `lm-evaluation-harness` branch as well as the HuggingFace `datasets` and `promptsource` packages.\n\n- __`promptsource` installation issue__: Some prompts may be excluded from the installed `promptsource` branch due to git-based pip installation issues. If the latest commit on the `promptsource/eval-hackathon` branch contains a prompt you're looking for but was not included in the installed version from our `setup.py`, you should run the following from within your environment:\n    ```bash\n    pip uninstall promptsource\n    git clone --single-branch --branch eval-hackathon https://github.com/bigscience-workshop/promptsource\n    cd promptsource\n    pip install -e .\n    ```\n\n## Features\n\n- Growing number of tasks integrated with `promptsource` (20+).\n\n- Support for HuggingFace Causal language models, HuggingFace Seq2Seq models, and the OpenAI Completions API (GPT-3), with flexible tokenization-agnostic interfaces.\n\n## Implementing new tasks\n\nTo implement a new task in eval harness, follow the [`PromptSourceTask` template](./templates/new_prompt_source_task.py).\n\n## Using load_from_disk instead of load_dataset\n\nYou can use load_from_disk (convenient on Jean Zay supercomputer) by setting `task_args download_mode='load_from_disk',data_dir=\u003cdata/path\u003e`\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbigscience-workshop%2Flm-evaluation-harness","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbigscience-workshop%2Flm-evaluation-harness","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbigscience-workshop%2Flm-evaluation-harness/lists"}