{"id":13455945,"url":"https://github.com/evalplus/evalplus","last_synced_at":"2026-01-12T02:50:03.958Z","repository":{"id":160754299,"uuid":"628154495","full_name":"evalplus/evalplus","owner":"evalplus","description":"Rigourous evaluation of LLM-synthesized code - NeurIPS 2023 \u0026 COLM 2024","archived":false,"fork":false,"pushed_at":"2024-10-29T08:23:15.000Z","size":4915,"stargazers_count":1216,"open_issues_count":45,"forks_count":107,"subscribers_count":8,"default_branch":"master","last_synced_at":"2024-10-29T09:43:26.857Z","etag":null,"topics":["benchmark","chatgpt","efficiency","gpt-4","large-language-models","program-synthesis","testing"],"latest_commit_sha":null,"homepage":"https://evalplus.github.io","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/evalplus.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":"CITATION.cff","codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-04-15T04:20:10.000Z","updated_at":"2024-10-29T08:23:19.000Z","dependencies_parsed_at":"2023-11-18T03:21:46.439Z","dependency_job_id":"4e7b13cf-56db-457f-ab4c-ed1a12add693","html_url":"https://github.com/evalplus/evalplus","commit_stats":null,"previous_names":[],"tags_count":13,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/evalplus%2Fevalplus","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/evalplus%2Fevalplus/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/evalplus%2Fevalplus/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/evalplus%2Fevalplus/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/evalplus","download_url":"https://codeload.github.com/evalplus/evalplus/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":245243189,"owners_count":20583581,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["benchmark","chatgpt","efficiency","gpt-4","large-language-models","program-synthesis","testing"],"created_at":"2024-07-31T08:01:13.832Z","updated_at":"2026-01-12T02:50:03.951Z","avatar_url":"https://github.com/evalplus.png","language":"Python","readme":"# `EvalPlus(📖) =\u003e 📚`\n\n\u003cp align=\"center\"\u003e\n    \u003ca href=\"https://evalplus.github.io\"\u003e\u003cimg src=\"https://img.shields.io/badge/%F0%9F%8F%86-leaderboard-8A2BE2\"\u003e\u003c/a\u003e\n    \u003ca href=\"https://openreview.net/forum?id=1qvx610Cu7\"\u003e\u003cimg src=\"https://img.shields.io/badge/EvalPlus-NeurIPS'23-a55fed.svg\"\u003e\u003c/a\u003e\n    \u003ca href=\"https://openreview.net/forum?id=IBCBMeAhmC\"\u003e\u003cimg src=\"https://img.shields.io/badge/EvalPerf-COLM'24-a55fed.svg\"\u003e\u003c/a\u003e\n    \u003ca href=\"https://huggingface.co/evalplus/\"\u003e\u003cimg src=\"https://img.shields.io/badge/🤗%20Hugging%20Face-evalplus-%23ff8811.svg\"\u003e\u003c/a\u003e\n    \u003ca href=\"https://pypi.org/project/evalplus/\"\u003e\u003cimg src=\"https://img.shields.io/pypi/v/evalplus?color=g\"\u003e\u003c/a\u003e\n    \u003ca href=\"https://hub.docker.com/r/ganler/evalplus\" title=\"Docker\"\u003e\u003cimg src=\"https://img.shields.io/docker/image-size/ganler/evalplus\"\u003e\u003c/a\u003e\n\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\n    \u003ca href=\"#-about\"\u003e📙About\u003c/a\u003e •\n    \u003ca href=\"#-quick-start\"\u003e🔥Quick Start\u003c/a\u003e •\n    \u003ca href=\"#-llm-backends\"\u003e🚀LLM Backends\u003c/a\u003e •\n    \u003ca href=\"#-documents\"\u003e📚Documents\u003c/a\u003e •\n    \u003ca href=\"#-citation\"\u003e📜Citation\u003c/a\u003e •\n    \u003ca href=\"#-acknowledgement\"\u003e🙏Acknowledgement\u003c/a\u003e\n\u003c/p\u003e\n\n## 📢 News\n\nWho's using EvalPlus datasets? EvalPlus has been used by various LLM teams, including:\n\n* [Meta Llama 3.1 and 3.3](https://ai.meta.com/blog/meta-llama-3-1/)\n* [Allen AI TÜLU 1/2/3](https://github.com/allenai/open-instruct/blob/main/docs/tulu1_tulu2.md#benchmark-based-eval)\n* [Qwen2.5-Coder](https://qwenlm.github.io/blog/qwen2.5-coder-family/)\n* [CodeQwen 1.5](https://qwenlm.github.io/blog/codeqwen1.5/)\n* [DeepSeek-Coder V2](https://arxiv.org/pdf/2406.11931)\n* [Qwen2](https://arxiv.org/pdf/2407.10671)\n* [Snowflake Arctic](https://www.snowflake.com/en/data-cloud/arctic/)\n* [StarCoder2](https://arxiv.org/pdf/2402.19173)\n* [Magicoder](https://arxiv.org/pdf/2312.02120)\n* [WizardCoder](https://arxiv.org/pdf/2306.08568)\n\nBelow tracks the notable updates of EvalPlus:\n\n- **[2024-10-20 `v0.3.1`]**: EvalPlus `v0.3.1` is officially released! Highlights: *(i)* Code efficiency evaluation via EvalPerf, *(ii)* one command to run all: generation + post-processing + evaluation, *(iii)* support for more inference backends such as Google Gemini \u0026 Anthropic, etc.\n- **[2024-06-09 pre `v0.3.0`]**: Improved ground-truth solutions for MBPP+ tasks (IDs: 459, 102, 559). Thanks to [EvalArena](https://github.com/crux-eval/eval-arena).\n- **[2024-04-17 pre `v0.3.0`]**: MBPP+ is upgraded to `v0.2.0` by removing some broken tasks (399 -\u003e 378 tasks). ~4pp pass@1 improvement could be expected.\n\n\u003cdetails\u003e\u003csummary\u003eEarlier news \u003ci\u003e:: click to expand ::\u003c/i\u003e\u003c/summary\u003e\n\u003cdiv\u003e\n\n- ([`v0.2.1`](https://github.com/evalplus/evalplus/releases/tag/v0.2.1)) You can use EvalPlus datasets via [bigcode-evaluation-harness](https://github.com/bigcode-project/bigcode-evaluation-harness)! HumanEval+ oracle fixes (32).\n- ([`v0.2.0`](https://github.com/evalplus/evalplus/releases/tag/v0.2.0)) MBPP+ is released! HumanEval contract \u0026 input fixes (0/3/9/148/114/1/2/99/28/32/35/160).\n- ([`v0.1.7`](https://github.com/evalplus/evalplus/releases/tag/v0.1.7)) [Leaderboard](https://evalplus.github.io/leaderboard.html) release; HumanEval+ contract and input fixes (32/166/126/6)\n- ([`v0.1.6`](https://github.com/evalplus/evalplus/releases/tag/v0.1.6)) Configurable and by-default-conservative timeout settings; HumanEval+ contract \u0026 ground-truth fixes (129/148/75/53/0/3/9/140)\n- ([`v0.1.5`](https://github.com/evalplus/evalplus/releases/tag/v0.1.5)) HumanEval+ mini is released for ultra-fast evaluation when you have too many samples!\n- ([`v0.1.1`](https://github.com/evalplus/evalplus/releases/tag/v0.1.1)) Optimizing user experiences: evaluation speed, PyPI package, Docker, etc.\n- ([`v0.1.0`](https://github.com/evalplus/evalplus/releases/tag/v0.1.0)) HumanEval+ is released!\n\n\u003c/div\u003e\n\u003c/details\u003e\n\n\n## 📙 About\n\nEvalPlus is a rigorous evaluation framework for LLM4Code, with:\n\n- ✨ **HumanEval+**: 80x more tests than the original HumanEval!\n- ✨ **MBPP+**: 35x more tests than the original MBPP!\n- ✨ **EvalPerf**: evaluating the efficiency of LLM-generated code!\n- ✨ **Framework**: our packages/images/tools can easily and safely evaluate LLMs on above benchmarks.\n\nWhy EvalPlus?\n\n- ✨ **Precise evaluation**: See [our leaderboard](https://evalplus.github.io/leaderboard.html) for latest LLM rankings before \u0026 after rigorous evaluation.\n- ✨ **Coding rigorousness**: Look at the score differences! esp. before \u0026 after using EvalPlus tests! Less drop means more rigorousness in code generation; while a bigger drop means the generated code tends to be fragile.\n- ✨ **Code efficiency**: Beyond correctness, our EvalPerf dataset evaluates the efficiency of LLM-generated code via performance-exercising coding tasks and test inputs.\n\nWant to know more details? Read our papers \u0026 materials!\n\n- **EvalPlus**: [NeurIPS'23 paper](https://openreview.net/forum?id=1qvx610Cu7), [Slides](https://docs.google.com/presentation/d/1eTxzUQG9uHaU13BGhrqm4wH5NmMZiM3nI0ezKlODxKs), [Poster](https://jw-liu.xyz/assets/pdf/EvalPlus_Poster.pdf), [Leaderboard](https://evalplus.github.io/leaderboard.html)\n- **EvalPerf**: [COLM'24 paper](https://openreview.net/forum?id=IBCBMeAhmC), [Poster](https://jw-liu.xyz/assets/pdf/jiawei-colm-evalperf-poster.pdf), [Documentation](./docs/evalperf.md), [Leaderboard](https://evalplus.github.io/evalperf.html)\n\n\n## 🔥 Quick Start\n\n### Code Correctness Evaluation: HumanEval(+) or MBPP(+)\n\n```bash\npip install --upgrade \"evalplus[vllm] @ git+https://github.com/evalplus/evalplus\"\n# Or `pip install \"evalplus[vllm]\" --upgrade` for the latest stable release\n\nevalplus.evaluate --model \"ise-uiuc/Magicoder-S-DS-6.7B\" \\\n                  --dataset [humaneval|mbpp]             \\\n                  --backend vllm                         \\\n                  --greedy\n```\n\n\u003cdetails\u003e\u003csummary\u003e🛡️ Safe code execution within Docker \u003ci\u003e:: click to expand ::\u003c/i\u003e\u003c/summary\u003e\n\u003cdiv\u003e\n\n```bash\n# Local generation\nevalplus.codegen --model \"ise-uiuc/Magicoder-S-DS-6.7B\" \\\n                 --dataset humaneval                    \\\n                 --backend vllm                         \\\n                 --greedy\n\n# Code execution within Docker\ndocker run --rm --pull=always -v $(pwd)/evalplus_results:/app ganler/evalplus:latest \\\n           evalplus.evaluate --dataset humaneval                                     \\\n           --samples /app/humaneval/ise-uiuc--Magicoder-S-DS-6.7B_vllm_temp_0.0.jsonl\n```\n\n\u003c/div\u003e\n\u003c/details\u003e\n\n### Code Efficiency Evaluation: EvalPerf (*nix only)\n\n```bash\npip install --upgrade \"evalplus[perf,vllm] @ git+https://github.com/evalplus/evalplus\"\n# Or `pip install \"evalplus[perf,vllm]\" --upgrade` for the latest stable release\n\nsudo sh -c 'echo 0 \u003e /proc/sys/kernel/perf_event_paranoid' # Enable perf\nevalplus.evalperf --model \"ise-uiuc/Magicoder-S-DS-6.7B\" --backend vllm\n```\n\n\u003cdetails\u003e\u003csummary\u003e🛡️ Safe code execution within Docker \u003ci\u003e:: click to expand ::\u003c/i\u003e\u003c/summary\u003e\n\u003cdiv\u003e\n\n```bash\n# Local generation\nevalplus.codegen --model \"ise-uiuc/Magicoder-S-DS-6.7B\" \\\n                 --dataset evalperf                     \\\n                 --backend vllm                         \\\n                 --temperature 1.0                      \\\n                 --n-samples 100\n\n# Code execution within Docker\nsudo sh -c 'echo 0 \u003e /proc/sys/kernel/perf_event_paranoid' # Enable perf\ndocker run --cap-add PERFMON --rm --pull=always -v $(pwd)/evalplus_results:/app ganler/evalplus:latest \\\n           evalplus.evalperf --samples /app/evalperf/ise-uiuc--Magicoder-S-DS-6.7B_vllm_temp_1.0.jsonl\n```\n\n\u003c/div\u003e\n\u003c/details\u003e\n\n## 🚀 LLM Backends\n\n### HuggingFace models\n\n- `transformers` backend:\n\n```bash\nevalplus.evaluate --model \"ise-uiuc/Magicoder-S-DS-6.7B\" \\\n                  --dataset [humaneval|mbpp]             \\\n                  --backend hf                           \\\n                  --greedy\n```\n\n\u003e [!Note]\n\u003e\n\u003e EvalPlus uses different prompts for base and chat models.\n\u003e By default it is detected by `tokenizer.chat_template` when using `hf`/`vllm` as backend.\n\u003e For other backends, only chat mode is allowed.\n\u003e\n\u003e Therefore, if your base models come with a `tokenizer.chat_template`,\n\u003e please add `--force-base-prompt` to avoid being evaluated\n\u003e in a chat mode.\n\n\u003cdetails\u003e\u003csummary\u003eEnable Flash Attention 2 \u003ci\u003e:: click to expand ::\u003c/i\u003e\u003c/summary\u003e\n\u003cdiv\u003e\n\n```bash\n# Install Flash Attention 2\npip install packaging ninja\npip install flash-attn --no-build-isolation\n# Note: if you have installation problem, consider using pre-built\n# wheels from https://github.com/Dao-AILab/flash-attention/releases\n\n# Run evaluation with FA2\nevalplus.evaluate --model \"ise-uiuc/Magicoder-S-DS-6.7B\"         \\\n                  --dataset [humaneval|mbpp]                     \\\n                  --backend hf                                   \\\n                  --attn-implementation [flash_attention_2|sdpa] \\\n                  --greedy\n```\n\n\u003c/div\u003e\n\u003c/details\u003e\n\n- `vllm` backend:\n\n```bash\nevalplus.evaluate --model \"ise-uiuc/Magicoder-S-DS-6.7B\" \\\n                  --dataset [humaneval|mbpp]             \\\n                  --backend vllm                         \\\n                  --tp [TENSOR_PARALLEL_SIZE]            \\\n                  --greedy\n```\n\n- `openai` compatible servers (e.g., [vLLM](https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html)):\n\n```bash\n# OpenAI models\nexport OPENAI_API_KEY=\"{KEY}\" # https://platform.openai.com/settings/organization/api-keys\nevalplus.evaluate --model \"gpt-4o-2024-08-06\"  \\\n                  --dataset [humaneval|mbpp]   \\\n                  --backend openai --greedy\n\n# DeepSeek\nexport OPENAI_API_KEY=\"{KEY}\" # https://platform.deepseek.com/api_keys\nevalplus.evaluate --model \"deepseek-chat\"              \\\n                  --dataset [humaneval|mbpp]           \\\n                  --base-url https://api.deepseek.com  \\\n                  --backend openai --greedy\n\n# Grok\nexport OPENAI_API_KEY=\"{KEY}\" # https://console.x.ai/\nevalplus.evaluate --model \"grok-beta\"             \\\n                  --dataset [humaneval|mbpp]      \\\n                  --base-url https://api.x.ai/v1  \\\n                  --backend openai --greedy\n\n# vLLM server\n# First, launch a vLLM server: https://docs.vllm.ai/en/latest/serving/deploying_with_docker.html\nevalplus.evaluate --model \"ise-uiuc/Magicoder-S-DS-6.7B\" \\\n                  --dataset [humaneval|mbpp]             \\\n                  --base-url http://localhost:8000/v1    \\\n                  --backend openai --greedy\n\n# GPTQModel\nevalplus.evaluate --model \"ModelCloud/Llama-3.2-1B-Instruct-gptqmodel-4bit-vortex-v1\" \\\n                  --dataset [humaneval|mbpp]                                          \\\n                  --backend gptqmodel --greedy\n```\n\n### OpenAI models\n\n- Access OpenAI APIs from [OpenAI Console](https://platform.openai.com/)\n\n```bash\nexport OPENAI_API_KEY=\"[YOUR_API_KEY]\"\nevalplus.evaluate --model \"gpt-4o\"            \\\n                  --dataset [humaneval|mbpp]  \\\n                  --backend openai            \\\n                  --greedy\n```\n\n### Anthropic models\n\n- Access Anthropic APIs from [Anthropic Console](https://console.anthropic.com/)\n\n```bash\nexport ANTHROPIC_API_KEY=\"[YOUR_API_KEY]\"\nevalplus.evaluate --model \"claude-3-haiku-20240307\" \\\n                  --dataset [humaneval|mbpp]        \\\n                  --backend anthropic               \\\n                  --greedy\n```\n\n### Google Gemini models\n\n- Access Gemini APIs from [Google AI Studio](https://aistudio.google.com/)\n\n```bash\nexport GOOGLE_API_KEY=\"[YOUR_API_KEY]\"\nevalplus.evaluate --model \"gemini-1.5-pro\"    \\\n                  --dataset [humaneval|mbpp]  \\\n                  --backend google            \\\n                  --greedy\n```\n\n### Amazon Bedrock models\n\n- [Amazon Bedrock](https://aws.amazon.com/bedrock/)\n\n```bash\nexport BEDROCK_ROLE_ARN=\"[BEDROCK_ROLE_ARN]\"\nevalplus.evaluate --model \"anthropic.claude-3-5-sonnet-20241022-v2:0\" \\\n                  --dataset [humaneval|mbpp]                          \\\n                  --backend bedrock                                   \\\n                  --greedy\n```\n### Ollama backend\n\n- [Ollama](https://ollama.com/)\n\n```bash\nevalplus.evaluate --model \"mistral:7b\" \\\n                  --dataset [humaneval|mbpp]          \\\n                  --backend ollama                    \\\n                  --base-url http://localhost:11434/v1 \\\n                  --greedy\n```\n\n### Intel® Gaudi® Accelerator\n\n- [Intel® Gaudi®](https://docs.habana.ai/en/latest/index.html)\n\nTo run `hf` backend for Intel Gaudi install [optimum-habana](https://github.com/huggingface/optimum-habana) first.\n\n```bash\npip install git+https://github.com/huggingface/optimum-habana.git\nevalplus.evaluate --model \"qwen/CodeQwen1.5-7B-Chat\" \\\n                  --dataset [humaneval|mbpp]         \\\n                  --backend hf_gaudi                 \\\n                  --greedy                           \\\n                  --torch_compile\n```\nOR in Lazy Mode:\n```bash\nPT_HPU_LAZY_MODE=1 evalplus.evaluate --model \"qwen/CodeQwen1.5-7B-Chat\" \\\n                  --dataset [humaneval|mbpp]                            \\\n                  --backend hf_gaudi                                    \\\n                  --greedy\n```\n\nTo run `vllm` backend for Intel Gaudi install [HabanaAI vllm](https://github.com/HabanaAI/vllm-fork) first.\n\n```bash\ngit clone https://github.com/HabanaAI/vllm-fork.git\ncd vllm-fork\ngit checkout habana_main\npip install --upgrade pip\npip install -r requirements-hpu.txt\npython setup.py develop\n```\nThen run:\n```bash\nPT_HPU_LAZY_MODE=1 evalplus.evaluate --model \"qwen/CodeQwen1.5-7B-Chat\" \\\n                  --dataset [humaneval|mbpp]                            \\\n                  --backend vllm                                        \\\n                  --greedy\n```\nYou can checkout the generation and results at `evalplus_results/[humaneval|mbpp]/`\n\n\u003cdetails\u003e\u003csummary\u003e⏬ Using EvalPlus as a local repo? \u003ci\u003e:: click to expand ::\u003c/i\u003e\u003c/summary\u003e\n\u003cdiv\u003e\n\n```bash\ngit clone https://github.com/evalplus/evalplus.git\ncd evalplus\nexport PYTHONPATH=$PYTHONPATH:$(pwd)\npip install -r requirements.txt\n```\n\n\u003c/div\u003e\n\u003c/details\u003e\n\n## 📚 Documents\n\nTo learn more about how to use EvalPlus, please refer to:\n\n- [EvalPlus Commands](./docs/cli.md)\n- [EvalPerf](./docs/evalperf.md)\n- [Program Execution](./docs/execution.md)\n\n## 📜 Citation\n\n```bibtex\n@inproceedings{evalplus,\n  title = {Is Your Code Generated by Chat{GPT} Really Correct? Rigorous Evaluation of Large Language Models for Code Generation},\n  author = {Liu, Jiawei and Xia, Chunqiu Steven and Wang, Yuyao and Zhang, Lingming},\n  booktitle = {Thirty-seventh Conference on Neural Information Processing Systems},\n  year = {2023},\n  url = {https://openreview.net/forum?id=1qvx610Cu7},\n}\n\n@inproceedings{evalperf,\n  title = {Evaluating Language Models for Efficient Code Generation},\n  author = {Liu, Jiawei and Xie, Songrun and Wang, Junhao and Wei, Yuxiang and Ding, Yifeng and Zhang, Lingming},\n  booktitle = {First Conference on Language Modeling},\n  year = {2024},\n  url = {https://openreview.net/forum?id=IBCBMeAhmC},\n}\n```\n\n## 🙏 Acknowledgement\n\n- [HumanEval](https://github.com/openai/human-eval)\n- [MBPP](https://github.com/google-research/google-research/tree/master/mbpp)\n","funding_links":[],"categories":["Python","A01_文本生成_文本对话","NLP","Benchmarks \u0026 Evaluation","Evaluation and Monitoring","UIs","3. Prompt Optimization"],"sub_categories":["大语言对话模型及数据","Code Benchmarks","Benchmark Datasets","Command-line(shell) interface","Rust"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fevalplus%2Fevalplus","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fevalplus%2Fevalplus","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fevalplus%2Fevalplus/lists"}