{"id":16248992,"url":"https://github.com/huggingface/evaluation-guidebook","last_synced_at":"2025-10-14T15:33:12.830Z","repository":{"id":257820575,"uuid":"870039374","full_name":"huggingface/evaluation-guidebook","owner":"huggingface","description":"Sharing both practical insights and theoretical knowledge about LLM evaluation that we gathered while managing the Open LLM Leaderboard and designing lighteval!","archived":false,"fork":false,"pushed_at":"2025-09-26T14:21:51.000Z","size":1127,"stargazers_count":1645,"open_issues_count":2,"forks_count":92,"subscribers_count":14,"default_branch":"main","last_synced_at":"2025-09-26T16:19:26.510Z","etag":null,"topics":["evaluation","evaluation-metrics","guidebook","large-language-models","llm","machine-learning","tutorial"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/huggingface.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2024-10-09T10:36:39.000Z","updated_at":"2025-09-26T14:21:55.000Z","dependencies_parsed_at":"2025-09-26T16:20:26.366Z","dependency_job_id":null,"html_url":"https://github.com/huggingface/evaluation-guidebook","commit_stats":null,"previous_names":["huggingface/evaluation-guidebook"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/huggingface/evaluation-guidebook","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/huggingface%2Fevaluation-guidebook","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/huggingface%2Fevaluation-guidebook/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/huggingface%2Fevaluation-guidebook/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/huggingface%2Fevaluation-guidebook/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/huggingface","download_url":"https://codeload.github.com/huggingface/evaluation-guidebook/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/huggingface%2Fevaluation-guidebook/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":279019320,"owners_count":26086711,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-14T02:00:06.444Z","response_time":60,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["evaluation","evaluation-metrics","guidebook","large-language-models","llm","machine-learning","tutorial"],"created_at":"2024-10-10T15:00:47.961Z","updated_at":"2025-10-14T15:33:12.825Z","avatar_url":"https://github.com/huggingface.png","language":"Jupyter Notebook","funding_links":[],"categories":["NLP","Jupyter Notebook","A01_文本生成_文本对话","评估 Evaluation","Books"],"sub_categories":["3. Pretraining","大语言对话模型及数据","Open Access"],"readme":"# The LLM Evaluation guidebook ⚖️\n\nIf you've ever wondered how to make sure an LLM performs well on your specific task, this guide is for you! \nIt covers the different ways you can evaluate a model, guides on designing your own evaluations, and tips and tricks from practical experience.\n\nWhether working with production models, a researcher or a hobbyist, I hope you'll find what you need; and if not, open an issue (to suggest ameliorations or missing resources) and I'll complete the guide!\n\n## How to read this guide\n- **Beginner user**: \n  If you don't know anything about evaluation, you should start by the  `Basics` sections in each chapter before diving deeper. \n  You'll also find explanations to support you about important LLM topics in `General knowledge`: for example, how model inference works and what tokenization is.\n- **Advanced user**:\n  The more practical sections are the `Tips and Tricks` ones, and `Troubleshooting` chapter. You'll also find interesting things in the `Designing` sections.\n- **User coming back to the site**: \n  Every year I do a dive on a topic, check them out!\n\nIn text, links prefixed by ⭐ are links I really enjoyed and recommend reading.\n\n## Table of contents\nIf you want an intro on the topic, you can read this [blog](https://huggingface.co/blog/clefourrier/llm-evaluation) on how and why we do evaluation!\n\n### Automatic benchmarks\n- [Basics](https://github.com/huggingface/evaluation-guidebook/blob/main/contents/automated-benchmarks/basics.md)\n- [Designing your automatic evaluation](https://github.com/huggingface/evaluation-guidebook/blob/main/contents/automated-benchmarks/designing-your-automatic-evaluation.md)\n- [Some evaluation datasets](https://github.com/huggingface/evaluation-guidebook/blob/main/contents/automated-benchmarks/some-evaluation-datasets.md)\n- [Tips and tricks](https://github.com/huggingface/evaluation-guidebook/blob/main/contents/automated-benchmarks/tips-and-tricks.md)\n\n### Human evaluation\n- [Basics](https://github.com/huggingface/evaluation-guidebook/blob/main/contents/human-evaluation/basics.md)\n- [Using human annotators](https://github.com/huggingface/evaluation-guidebook/blob/main/contents/human-evaluation/using-human-annotators.md)\n- [Tips and tricks](https://github.com/huggingface/evaluation-guidebook/blob/main/contents/human-evaluation/tips-and-tricks.md)\n\n### LLM-as-a-judge\n- [Basics](https://github.com/huggingface/evaluation-guidebook/blob/main/contents/model-as-a-judge/basics.md)\n- [Getting a Judge-LLM](https://github.com/huggingface/evaluation-guidebook/blob/main/contents/model-as-a-judge/getting-a-judge-llm.md)\n- [Designing your evaluation prompt](https://github.com/huggingface/evaluation-guidebook/blob/main/contents/model-as-a-judge/designing-your-evaluation-prompt.md)\n- [Evaluating your evaluator](https://github.com/huggingface/evaluation-guidebook/blob/main/contents/model-as-a-judge/evaluating-your-evaluator.md)\n- [What about reward models](https://github.com/huggingface/evaluation-guidebook/blob/main/contents/model-as-a-judge/what-about-reward-models.md)\n- [Tips and tricks](https://github.com/huggingface/evaluation-guidebook/blob/main/contents/model-as-a-judge/tips-and-tricks.md)\n\n### Troubleshooting\nThe most densely practical part of this guide. \n- [Troubleshooting inference](https://github.com/huggingface/evaluation-guidebook/blob/main/contents/troubleshooting/troubleshooting-inference.md)\n- [Troubleshooting reproducibility](https://github.com/huggingface/evaluation-guidebook/blob/main/contents/troubleshooting/troubleshooting-reproducibility.md)\n\n### General knowledge\nThese are mostly beginner guides to LLM basics, but will still contain some tips and cool references! \nIf you're an advanced user, I suggest skimming to the `Going further` sections.\n- [Model inference and evaluation](https://github.com/huggingface/evaluation-guidebook/blob/main/contents/general-knowledge/model-inference-and-evaluation.md)\n- [Tokenization](https://github.com/huggingface/evaluation-guidebook/blob/main/contents/general-knowledge/tokenization.md)\n\n## Yearly dives\n- [2023, year of Open Source](https://github.com/huggingface/evaluation-guidebook/blob/main/yearly_dives/2023-year-of-open-source.md)\n- [2024, what should evaluation be for?](https://github.com/huggingface/evaluation-guidebook/blob/main/yearly_dives/2024-evals-thoughts-from-iclr.md)\n- [2025, evaluations to build \"real life\" useful models](https://github.com/huggingface/evaluation-guidebook/blob/main/yearly_dives/2025-evaluations-for-useful-models.md)\n\n## Resources\nLinks I like\n- [About evaluation](https://github.com/huggingface/evaluation-guidebook/blob/main/resources/about-evaluation.md)\n- [About general NLP](https://github.com/huggingface/evaluation-guidebook/blob/main/resources/about-NLP.md)\n- [The UltraScale Playbook](https://huggingface.co/spaces/nanotron/ultrascale-playbook)\n\n## Community translations\nThis guide has been kindly community translated!\n- 🇨🇳 https://github.com/huggingface/evaluation-guidebook/tree/main/translations/zh/contents, thanks to @SuSung-boy \n- 🇫🇷 https://huggingface.co/spaces/CATIE-AQ/Guide_Evaluation_LLM, thanks to @lbourdois\n\n## Thanks\nThis guide has been heavily inspired by the [ML Engineering Guidebook](https://github.com/stas00/ml-engineering) by Stas Bekman! Thanks for this cool resource!\n\nMany thanks also to all the people who inspired this guide through discussions either at events or online, notably and not limited to:\n- 🤝 Luca Soldaini, Kyle Lo and Ian Magnusson (Allen AI), Max Bartolo (Cohere), Kai Wu (Meta), Swyx and Alessio Fanelli (Latent Space Podcast), Hailey Schoelkopf (EleutherAI), Martin Signoux (Open AI), Moritz Hardt (Max Planck Institute), Ludwig Schmidt (Anthropic)\n- 🔥 community users of the Open LLM Leaderboard and lighteval, who often raised very interesting points in discussions\n- 🤗 people at Hugging Face, like Lewis Tunstall, Hynek Kydlíček, Guilherme Penedo and Thom Wolf, and of course my teammate Nathan Habib with whom I've been doing evaluation and leaderboards since 2022\n\nand of course to all the contributors :)\n\n## Citation\n[![CC BY-NC-SA 4.0][cc-by-nc-sa-image]][cc-by-nc-sa]\n\n[cc-by-nc-sa]: http://creativecommons.org/licenses/by-nc-sa/4.0/\n[cc-by-nc-sa-image]: https://licensebuttons.net/l/by-nc-sa/4.0/88x31.png\n[cc-by-nc-sa-shield]: https://img.shields.io/badge/License-CC-BY--NC--SA-4.0-lightgrey.svg\n\n```\n@misc{fourrier2024evaluation,\n  author = {Clémentine Fourrier and The Hugging Face Community},\n  title = {LLM Evaluation Guidebook},\n  year = {2024},\n  journal = {GitHub repository},\n  url = {https://github.com/huggingface/evaluation-guidebook)\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhuggingface%2Fevaluation-guidebook","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fhuggingface%2Fevaluation-guidebook","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhuggingface%2Fevaluation-guidebook/lists"}