{"id":13624225,"url":"https://github.com/huggingface/lighteval","last_synced_at":"2025-10-14T15:29:25.303Z","repository":{"id":221503379,"uuid":"748650671","full_name":"huggingface/lighteval","owner":"huggingface","description":"Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends","archived":false,"fork":false,"pushed_at":"2025-10-08T12:30:54.000Z","size":7870,"stargazers_count":1987,"open_issues_count":207,"forks_count":358,"subscribers_count":28,"default_branch":"main","last_synced_at":"2025-10-08T14:36:44.458Z","etag":null,"topics":["evaluation","evaluation-framework","evaluation-metrics","huggingface"],"latest_commit_sha":null,"homepage":"https://huggingface.co/docs/lighteval/en/index","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/huggingface.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2024-01-26T13:15:39.000Z","updated_at":"2025-10-07T17:17:07.000Z","dependencies_parsed_at":"2024-02-11T15:49:02.522Z","dependency_job_id":"f9b39ef6-8e8c-4970-bd20-cab2dfee8b68","html_url":"https://github.com/huggingface/lighteval","commit_stats":null,"previous_names":["huggingface/lighteval"],"tags_count":14,"template":false,"template_full_name":null,"purl":"pkg:github/huggingface/lighteval","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/huggingface%2Flighteval","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/huggingface%2Flighteval/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/huggingface%2Flighteval/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/huggingface%2Flighteval/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/huggingface","download_url":"https://codeload.github.com/huggingface/lighteval/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/huggingface%2Flighteval/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":279019046,"owners_count":26086518,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-14T02:00:06.444Z","response_time":60,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["evaluation","evaluation-framework","evaluation-metrics","huggingface"],"created_at":"2024-08-01T21:01:40.229Z","updated_at":"2025-10-14T15:29:25.297Z","avatar_url":"https://github.com/huggingface.png","language":"Python","readme":"\u003cp align=\"center\"\u003e\n  \u003cbr/\u003e\n    \u003cimg alt=\"lighteval library logo\" src=\"./assets/lighteval-doc.svg\" width=\"376\" height=\"59\" style=\"max-width: 100%;\"\u003e\n  \u003cbr/\u003e\n\u003c/p\u003e\n\n\n\u003cp align=\"center\"\u003e\n    \u003ci\u003eYour go-to toolkit for lightning-fast, flexible LLM evaluation, from Hugging Face's Leaderboard and Evals Team.\u003c/i\u003e\n\u003c/p\u003e\n\n\u003cdiv align=\"center\"\u003e\n\n[![Tests](https://github.com/huggingface/lighteval/actions/workflows/tests.yaml/badge.svg?branch=main)](https://github.com/huggingface/lighteval/actions/workflows/tests.yaml?query=branch%3Amain)\n[![Quality](https://github.com/huggingface/lighteval/actions/workflows/quality.yaml/badge.svg?branch=main)](https://github.com/huggingface/lighteval/actions/workflows/quality.yaml?query=branch%3Amain)\n[![Python versions](https://img.shields.io/pypi/pyversions/lighteval)](https://www.python.org/downloads/)\n[![License](https://img.shields.io/badge/License-MIT-green.svg)](https://github.com/huggingface/lighteval/blob/main/LICENSE)\n[![Version](https://img.shields.io/pypi/v/lighteval)](https://pypi.org/project/lighteval/)\n\n\u003c/div\u003e\n\n---\n\n\u003cp align=\"center\"\u003e\n  \u003ca href=\"https://huggingface.co/docs/lighteval/main/en/index\" target=\"_blank\"\u003e\n    \u003cimg alt=\"Documentation\" src=\"https://img.shields.io/badge/Documentation-4F4F4F?style=for-the-badge\u0026logo=readthedocs\u0026logoColor=white\" /\u003e\n  \u003c/a\u003e\n\u003c/p\u003e\n\n---\n\n**Lighteval** is your *all-in-one toolkit* for evaluating LLMs across multiple\nbackends—whether your model is being **served somewhere** or **already loaded in memory**.\nDive deep into your model's performance by saving and exploring *detailed,\nsample-by-sample results* to debug and see how your models stack-up.\n\n*Customization at your fingertips*: letting you either browse all our existing tasks and [metrics](https://huggingface.co/docs/lighteval/metric-list) or effortlessly create your own [custom task](https://huggingface.co/docs/lighteval/adding-a-custom-task) and [custom metric](https://huggingface.co/docs/lighteval/adding-a-new-metric), tailored to your needs.\n\n\n## Available Tasks\n\nLighteval supports **7,000+ evaluation tasks** across multiple domains and languages. Here's an overview of some *popular benchmarks*:\n\n\n### 📚 **Knowledge**\n- **General Knowledge**: MMLU, MMLU-Pro, MMMU, BIG-Bench\n- **Question Answering**: TriviaQA, Natural Questions, SimpleQA, Humanity's Last Exam (HLE)\n- **Specialized**: GPQA, AGIEval\n\n### 🧮 **Math and Code**\n- **Math Problems**: GSM8K, GSM-Plus, MATH, MATH500\n- **Competition Math**: AIME24, AIME25\n- **Multilingual Math**: MGSM (Grade School Math in 10+ languages)\n- **Coding Benchmarks**: LCB (LiveCodeBench)\n\n### 🎯 **Chat Model Evaluation**\n- **Instruction Following**: IFEval, IFEval-fr\n- **Reasoning**: MUSR, DROP (discrete reasoning)\n- **Long Context**: RULER\n- **Dialogue**: MT-Bench\n- **Holistic Evaluation**: HELM, BIG-Bench\n\n### 🌍 **Multilingual Evaluation**\n- **Cross-lingual**: XTREME, Flores200 (200 languages), XCOPA, XQuAD\n- **Language-specific**: \n  - **Arabic**: ArabicMMLU\n  - **Filipino**: FilBench\n  - **French**: IFEval-fr, GPQA-fr, BAC-fr\n  - **German**: German RAG Eval\n  - **Serbian**: Serbian LLM Benchmark, OZ Eval\n  - **Turkic**: TUMLU (9 Turkic languages)\n  - **Chinese**: CMMLU, CEval, AGIEval\n  - **Russian**: RUMMLU, Russian SQuAD\n  - **And many more...**\n\n### 🧠 **Core Language Understanding**\n- **NLU**: GLUE, SuperGLUE, TriviaQA, Natural Questions\n- **Commonsense**: HellaSwag, WinoGrande, ProtoQA\n- **Natural Language Inference**: XNLI\n- **Reading Comprehension**: SQuAD, XQuAD, MLQA, Belebele\n\n\n## ⚡️ Installation\n\n\u003e **Note**: lighteval is currently *completely untested on Windows*, and we don't support it yet. (*Should be fully functional on Mac/Linux*)\n\n```bash\npip install lighteval\n```\n\nLighteval allows for *many extras* when installing, see [here](https://huggingface.co/docs/lighteval/installation) for a **complete list**.\n\nIf you want to push results to the **Hugging Face Hub**, add your access token as\nan environment variable:\n\n```shell\nhuggingface-cli login\n```\n\n## 🚀 Quickstart\n\nLighteval offers the following entry points for model evaluation:\n\n- `lighteval accelerate`: Evaluate models on CPU or one or more GPUs using [🤗\n  Accelerate](https://github.com/huggingface/accelerate)\n- `lighteval nanotron`: Evaluate models in distributed settings using [⚡️\n  Nanotron](https://github.com/huggingface/nanotron)\n- `lighteval vllm`: Evaluate models on one or more GPUs using [🚀\n  VLLM](https://github.com/vllm-project/vllm)\n- `lighteval sglang`: Evaluate models using [SGLang](https://github.com/sgl-project/sglang) as backend\n- `lighteval endpoint`: Evaluate models using various endpoints as backend\n  - `lighteval endpoint inference-endpoint`: Evaluate models using Hugging Face's [Inference Endpoints API](https://huggingface.co/inference-endpoints/dedicated)\n  - `lighteval endpoint tgi`: Evaluate models using [🔗 Text Generation Inference](https://huggingface.co/docs/text-generation-inference/en/index) running locally\n  - `lighteval endpoint litellm`: Evaluate models on any compatible API using [LiteLLM](https://www.litellm.ai/)\n  - `lighteval endpoint inference-providers`: Evaluate models using [HuggingFace's inference providers](https://huggingface.co/docs/inference-providers/en/index) as backend\n\nDid not find what you need ? You can always make your custom model API by following [this guide](https://huggingface.co/docs/lighteval/main/en/evaluating-a-custom-model)\n- `lighteval custom`: Evaluate custom models (can be anything)\n\nHere's a **quick command** to evaluate using the *Accelerate backend*:\n\n```shell\nlighteval accelerate \\\n    \"model_name=gpt2\" \\\n    \"leaderboard|truthfulqa:mc|0\"\n```\n\nOr use the **Python API** to run a model *already loaded in memory*!\n\n```python\nfrom transformers import AutoModelForCausalLM\n\nfrom lighteval.logging.evaluation_tracker import EvaluationTracker\nfrom lighteval.models.transformers.transformers_model import TransformersModel, TransformersModelConfig\nfrom lighteval.pipeline import ParallelismManager, Pipeline, PipelineParameters\n\n\nMODEL_NAME = \"meta-llama/Meta-Llama-3-8B-Instruct\"\nBENCHMARKS = \"lighteval|gsm8k|0\"\n\nevaluation_tracker = EvaluationTracker(output_dir=\"./results\")\npipeline_params = PipelineParameters(\n    launcher_type=ParallelismManager.NONE,\n    max_samples=2\n)\n\nmodel = AutoModelForCausalLM.from_pretrained(\n  MODEL_NAME, device_map=\"auto\"\n)\nconfig = TransformersModelConfig(model_name=MODEL_NAME, batch_size=1)\nmodel = TransformersModel.from_model(model, config)\n\npipeline = Pipeline(\n    model=model,\n    pipeline_parameters=pipeline_params,\n    evaluation_tracker=evaluation_tracker,\n    tasks=BENCHMARKS,\n)\n\nresults = pipeline.evaluate()\npipeline.show_results()\nresults = pipeline.get_results()\n```\n\n## 🙏 Acknowledgements\n\nLighteval took inspiration from the following *amazing* frameworks: Eleuther's [AI Harness](https://github.com/EleutherAI/lm-evaluation-harness) and Stanford's\n[HELM](https://crfm.stanford.edu/helm/latest/). We are grateful to their teams for their **pioneering work** on LLM evaluations.\n\nWe'd also like to offer our thanks to all the community members who have contributed to the library, adding new features and reporting or fixing bugs.\n\n## 🌟 Contributions Welcome 💙💚💛💜🧡\n\n**Got ideas?** Found a bug? Want to add a\n[task](https://huggingface.co/docs/lighteval/adding-a-custom-task) or\n[metric](https://huggingface.co/docs/lighteval/adding-a-new-metric)?\nContributions are *warmly welcomed*!\n\nIf you're adding a **new feature**, please *open an issue first*.\n\nIf you open a PR, don't forget to **run the styling**!\n\n```bash\npip install -e .[dev]\npre-commit install\npre-commit run --all-files\n```\n## 📜 Citation\n\n```bibtex\n@misc{lighteval,\n  author = {Habib, Nathan and Fourrier, Clémentine and Kydlíček, Hynek and Wolf, Thomas and Tunstall, Lewis},\n  title = {LightEval: A lightweight framework for LLM evaluation},\n  year = {2023},\n  version = {0.11.0},\n  url = {https://github.com/huggingface/lighteval}\n}\n```\n","funding_links":[],"categories":["Python","评估 Evaluation","A01_文本生成_文本对话","News","LLM Evaluation:","Evaluation and Monitoring","Recently Updated","大型语言模型（LLM）排行榜","LLM Evaluation Framework","Tools","Evaluation","HuggingFace SmolLM (v2 Oct. 2024)"],"sub_categories":["大语言对话模型及数据","[Sep 15, 2024](/content/2024/09/15/README.md)","LLM 评估工具","LLM Evaluations and Benchmarks","Evaluators and Test Harnesses"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhuggingface%2Flighteval","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fhuggingface%2Flighteval","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhuggingface%2Flighteval/lists"}