{"id":34087179,"url":"https://github.com/evo-eval/evoeval","last_synced_at":"2026-04-05T11:31:30.018Z","repository":{"id":230135710,"uuid":"777662906","full_name":"evo-eval/evoeval","owner":"evo-eval","description":"EvoEval: Evolving Coding Benchmarks via LLM","archived":false,"fork":false,"pushed_at":"2024-04-06T23:26:59.000Z","size":18353,"stargazers_count":60,"open_issues_count":0,"forks_count":5,"subscribers_count":2,"default_branch":"main","last_synced_at":"2024-11-04T09:14:32.459Z","etag":null,"topics":["benchmark","chatgpt","claude-3","gemini-pro","gpt-4","large-language-models","llm","program-synthesis","testing"],"latest_commit_sha":null,"homepage":"https://evo-eval.github.io/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/evo-eval.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null}},"created_at":"2024-03-26T09:30:19.000Z","updated_at":"2024-10-28T20:01:03.000Z","dependencies_parsed_at":"2024-04-07T00:38:36.435Z","dependency_job_id":null,"html_url":"https://github.com/evo-eval/evoeval","commit_stats":null,"previous_names":["evo-eval/evoeval"],"tags_count":1,"template":false,"template_full_name":null,"purl":"pkg:github/evo-eval/evoeval","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/evo-eval%2Fevoeval","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/evo-eval%2Fevoeval/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/evo-eval%2Fevoeval/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/evo-eval%2Fevoeval/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/evo-eval","download_url":"https://codeload.github.com/evo-eval/evoeval/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/evo-eval%2Fevoeval/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31434624,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-05T08:13:15.228Z","status":"ssl_error","status_checked_at":"2026-04-05T08:13:11.839Z","response_time":75,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["benchmark","chatgpt","claude-3","gemini-pro","gpt-4","large-language-models","llm","program-synthesis","testing"],"created_at":"2025-12-14T13:38:05.885Z","updated_at":"2026-04-05T11:31:30.004Z","avatar_url":"https://github.com/evo-eval.png","language":"Python","readme":"# \u003cimg src=\"resources/butterfly_dark.png\" width=\"32px\" height=\"auto\"\u003e EvoEval: Evolving Coding Benchmarks via LLM\r\n\r\n\u003cp align=\"center\"\u003e\r\n    \u003ca href=\"https://evo-eval.github.io/leaderboard.html\"\u003e\u003cimg src=\"https://img.shields.io/badge/🏆-LeaderBoard-8e7cc3?style=for-the-badge\"\u003e\u003c/a\u003e\r\n    \u003ca href=\"https://evo-eval.github.io/visualization.html\"\u003e\u003cimg src=\"https://img.shields.io/badge/🔮-Visualization-3d85c6?style=for-the-badge\"\u003e\u003c/a\u003e\r\n    \u003ca href=\"https://arxiv.org/abs/2403.19114\"\u003e\u003cimg src=\"https://img.shields.io/badge/📃-Arxiv-b31b1b?style=for-the-badge\"\u003e\u003c/a\u003e\r\n    \u003ca href=\"https://huggingface.co/evoeval/\"\u003e\u003cimg src=\"https://img.shields.io/badge/🤗-Huggingface-f59e0b?style=for-the-badge\"\u003e\u003c/a\u003e\r\n    \u003ca href=\"https://pypi.org/project/evoeval/\"\u003e\u003cimg src=\"https://img.shields.io/badge/0.1.0-Pypi-3b719f?style=for-the-badge\u0026logo=pypi\"\u003e\u003c/a\u003e\r\n\u003c/p\u003e\r\n\r\n\u003cp align=\"center\"\u003e\r\n    \u003cbig\u003e\u003ca href=\"#-quick-start\"\u003e⚡Quick Start\u003c/a\u003e\u003c/big\u003e |\r\n    \u003cbig\u003e\u003ca href=\"#-benchmarks\"\u003e🔠Benchmarks\u003c/a\u003e\u003c/big\u003e |\r\n    \u003cbig\u003e\u003ca href=\"#-llm-generated-code\"\u003e🤖LLM Generated Code\u003c/a\u003e\u003c/big\u003e |\r\n    \u003cbig\u003e\u003ca href=\"#-citation\"\u003e📝Citation\u003c/a\u003e\u003c/big\u003e |\r\n    \u003cbig\u003e\u003ca href=\"#-acknowledgement\"\u003e🙏Acknowledgement\u003c/a\u003e\u003c/big\u003e\r\n\u003c/p\u003e\r\n\r\n## \u003cimg src=\"resources/butterfly_dark.png\" width=\"23px\" height=\"auto\"\u003e About \r\n\r\n**EvoEval**\u003csup\u003e1\u003c/sup\u003e is a holistic benchmark suite created by _evolving_ **HumanEval** problems:\r\n- 🔥 Contains **828** new problems across **5** 🌠 semantic-altering and **2** ⭐ semantic-preserving benchmarks\r\n- 🔮 Allows evaluation/comparison across different **dimensions** and problem **types** (i.e., _Difficult_, _Creative_ or _Tool Use_ problems). See our [**visualization tool**](https://evo-eval.github.io/visualization.html) for ready-to-use comparison\r\n- 🏆 Complete with [**leaderboard**](https://evo-eval.github.io/leaderboard.html), **groundtruth solutions**, **robust testcases** and **evaluation scripts** to easily fit into your evaluation pipeline\r\n- 🤖 Generated LLM code samples from **\u003e50** different models to save you time in running experiments\r\n\r\n\u003csup\u003e1\u003c/sup\u003e coincidentally similar pronunciation with 😈 EvilEval\r\n\r\n\u003cp align=\"center\"\u003e\r\n\u003cimg src=\"./resources/example.gif\" style=\"width:75%; margin-left: auto; margin-right: auto;\"\u003e\r\n\u003c/p\u003e\r\n\r\nCheckout our 📃 [paper](https://arxiv.org/abs/2403.19114) and [webpage](https://evo-eval.github.io) for more detail! \r\n\r\n\r\n\r\n## ⚡ Quick Start\r\n\r\nDirectly install the package:\r\n\r\n```bash\r\npip install evoeval --upgrade\r\n```\r\n\r\n\u003cdetails\u003e\u003csummary\u003e⏬ Nightly Version \u003c/summary\u003e\r\n\u003cdiv\u003e\r\n\r\n```bash\r\npip install \"git+https://github.com/evo-eval/evoeval.git\" --upgrade\r\n```\r\n\r\n\u003c/div\u003e\r\n\u003c/details\u003e\r\n\r\n\u003cdetails\u003e\u003csummary\u003e⏬ Local Repository\u003c/summary\u003e\r\n\u003cdiv\u003e\r\n\r\n```bash\r\ngit clone https://github.com/evo-eval/evoeval.git\r\ncd evoeval\r\nexport PYTHONPATH=$PYTHONPATH:$(pwd)\r\npip install -r requirements.txt\r\n```\r\n\r\n\u003c/div\u003e\r\n\u003c/details\u003e\r\n\r\nNow you are ready to download EvoEval benchmarks and perform evaluation!\r\n\r\n### 🧑‍💻 Code generation\r\n\r\nTo download our benchmarks, simply use the following code snippet:\r\n\r\n```python\r\nfrom evoeval.data import get_evo_eval\r\n\r\nevoeval_benchmark = \"EvoEval_difficult\" # you can pick from 7 different benchmarks!\r\n\r\nproblems = get_evo_eval(evoeval_benchmark)\r\n```\r\n\r\nFor code generation and evaluation, we adopt the same style as [HumanEval+](https://github.com/evalplus/evalplus) and [HumanEval](https://github.com/openai/human-eval).\r\n\r\nImplement the `GEN_SOLUTION` function by calling the LLM to produce the complete solution (include the function header + code) and save the samples to `{benchmark}_samples.jsonl`:\r\n\r\n```python\r\nfrom evoeval.data import get_evo_eval, write_jsonl\r\n\r\nevoeval_benchmark = \"EvoEval_difficult\"\r\n\r\nsamples = [\r\n    dict(task_id=task_id, solution=GEN_SOLUTION(problem[\"prompt\"]))\r\n    for task_id, problem in get_evo_eval(evoeval_benchmark).items()\r\n]\r\nwrite_jsonl(f\"{evoeval_benchmark}_samples.jsonl\", samples)\r\n```\r\n\r\n\u003e [!TIP]\r\n\u003e \r\n\u003e EvoEval `samples.jsonl` expects the solution field to contain the **complete** code implementation, this is \r\n\u003e slightly different from the original HumanEval where the solution field only contains the function body.\r\n\u003e \r\n\u003e If you want to follow exactly like HumanEval setup, checkout our 🤗 Huggingface [datasets](https://huggingface.co/evoeval), which can be directly ran with\r\n\u003e HumanEval evaluation [script](https://huggingface.co/evoeval)\r\n\r\n### 🕵️ Evaluation\r\n\r\nYou can use our provided [docker](https://docs.docker.com/get-docker/) image:\r\n\r\n```bash\r\ndocker run --rm -v $(pwd):/app evoeval/evoeval:latest --dataset EvoEval_difficult --samples EvoEval_difficult_samples.jsonl\r\n```\r\n\r\nOr run it locally:\r\n\r\n```bash\r\nevoeval.evaluate --dataset EvoEval_difficult --samples EvoEval_difficult_samples.jsonl\r\n```\r\n\r\nOr if you are using it as a local repository:\r\n\r\n```bash\r\nexport PYTHONPATH=$PYTHONPATH:$(pwd)\r\npython evoeval/evaluate.py --dataset EvoEval_difficult --samples EvoEval_difficult_samples.jsonl\r\n```\r\n\r\nYou should expect to see the following output (when evaluated on GPT-4):\r\n```\r\nComputing expected output...\r\nExpected outputs computed in 11.24s\r\nReading samples...\r\n100it [00:00, 164.16it/s]\r\n100%|████████████████████████████████████████████████████████████████| 100/100 [00:07\u003c00:00, 12.77it/s]\r\nEvoEval_difficult\r\npass@1: 0.520 # for reference GPT-4 solves more than 80% of problems in HumanEval\r\n```\r\nThis shows the pass@1 score for the EvoEval_difficult benchmark. You can use `--i-just-wanna-run` to recompute the evaluation result\r\n\r\n\u003e [!Note]\r\n\u003e \r\n\u003e You can also evaluate the LLM solutions in a folder format with each subfolder contains\r\n\u003e the LLM solution for each problem in the benchmark\r\n\u003e\r\n\u003e For example, you can grab the [GPT-4 solutions](https://github.com/evo-eval/evoeval/releases/download/v0.1.0/gpt-4_temp_0.0.zip) in our [v0.1.0 release](https://github.com/evo-eval/evoeval/releases/tag/v0.1.0).\r\n\u003e After unzipping, you can run the following command:\r\n\u003e \r\n\u003e ```bash\r\n\u003e evoeval.evaluate --dataset EvoEval_difficult --samples gpt-4_temp_0.0/EvoEval_difficult \r\n\u003e ```\r\n\u003e\r\n\u003e to obtain the same result as above using `.jsonl`\r\n\r\n\r\n## 🔠 Benchmarks\r\n\r\n**EvoEval** contains **7** different benchmarks, each with a unique set of problems \r\nevolved from the original **HumanEval** problems. 🌠 denotes semantic-altering benchmarks, \r\nwhile ⭐ denotes semantic-preserving benchmarks.:\r\n\r\n\u003cdetails\u003e\u003csummary\u003e\u003cb\u003e🌠EvoEval_difficult:\u003c/b\u003e\u003c/summary\u003e\r\n\u003cdiv\u003e\r\n\r\n\u003e Introduce complexity by adding additional constraints and requirements,\r\n\u003e replace commonly used requirements to less common ones, or add additional reasoning\r\n\u003e steps to the original problem.\r\n\u003c/div\u003e\r\n\u003c/details\u003e\r\n\r\n\u003cdetails\u003e\u003csummary\u003e\u003cb\u003e🌠EvoEval_creative:\u003c/b\u003e\u003c/summary\u003e\r\n\u003cdiv\u003e\r\n\r\n\u003e Generate a more creative problem compared to the original through the use\r\n\u003e of stories or uncommon narratives.\r\n\u003c/div\u003e\r\n\u003c/details\u003e\r\n\r\n\r\n\u003cdetails\u003e\u003csummary\u003e\u003cb\u003e🌠EvoEval_subtle:\u003c/b\u003e\u003c/summary\u003e\r\n\u003cdiv\u003e\r\n\r\n\u003e Make a subtle and minor change to the original problem such as inverting or\r\n\u003e replacing a requirement.\r\n\u003c/div\u003e\r\n\u003c/details\u003e\r\n\r\n\r\n\u003cdetails\u003e\u003csummary\u003e\u003cb\u003e🌠EvoEval_combine:\u003c/b\u003e\u003c/summary\u003e\r\n\u003cdiv\u003e\r\n\r\n\u003e Combine two different problems by integrating the concepts from both problems. In order to select problems that make sense to combine, we apply a simple heuristic\r\n\u003e to combine only problems of the same type together categorized based on the type of\r\n\u003e input arguments in the original problem.\r\n\u003c/div\u003e\r\n\u003c/details\u003e\r\n\r\n\u003cdetails\u003e\u003csummary\u003e\u003cb\u003e🌠EvoEval_tool_use:\u003c/b\u003e\u003c/summary\u003e\r\n\u003cdiv\u003e\r\n\r\n\u003e Produce a new problem containing a main problem and one or more helpers\r\n\u003e functions which can be used to solve it. Each helper function is fully implemented and\r\n\u003e provides hints or useful functionality for solving the main problem. The main problem\r\n\u003e does not explicitly reference individual helper functions, and we do not require the model\r\n\u003e to use the provided helpers.\r\n\u003c/div\u003e\r\n\u003c/details\u003e\r\n\r\n\r\n\u003cdetails\u003e\u003csummary\u003e\u003cb\u003e⭐EvoEval_verbose:\u003c/b\u003e\u003c/summary\u003e\r\n\u003cdiv\u003e\r\n\r\n\u003e Reword the original docstring to be more verbose. These verbose docstrings\r\n\u003e can use more descriptive language to illustrate the problem, include detailed explanation\r\n\u003e of the example output, and provide additional hints.\r\n\u003c/div\u003e\r\n\u003c/details\u003e\r\n\r\n\u003cdetails\u003e\u003csummary\u003e\u003cb\u003e⭐EvoEval_concise:\u003c/b\u003e\u003c/summary\u003e\r\n\u003cdiv\u003e\r\n\r\n\u003e Reword the original docstring to be more concise by removing unnecessary\r\n\u003e details and using concise language. Furthermore, simple examples that are not required\r\n\u003e to demonstrate edge cases may be removed.\r\n\r\n\u003c/div\u003e\r\n\u003c/details\u003e\r\n\r\nFor each problem in each **EvoEval** benchmark, we include the complete groundtruth as well as test cases for functional evaluation.\r\n\r\n\u003e [!Note]\r\n\u003e \r\n\u003e **Problem Structure**\r\n\u003e \r\n\u003e ```json\r\n\u003e {\r\n\u003e \"task_id\": \"identifier string for the task\",\r\n\u003e \"entry_point\": \"name of the function\",\r\n\u003e \"prompt\": \"function signature with docstring\",\r\n\u003e \"canonical_solution\": \"groundtruth implementation\",\r\n\u003e \"inputs\": \"test inputs for each problem\",\r\n\u003e \"parent\": \"original HumanEval problem it evolved from\",\r\n\u003e \"main\": \"special field of EvoEval_tool_use to show just the main problem description\",\r\n\u003e \"helpers\": \"special field of EvoEval_tool_use to show the helper functions\"\r\n\u003e }\r\n\u003e ```\r\n\r\n## 🤖 LLM Generated Code\r\n\r\nTo view the performance of **\u003e50** LLMs on the EvoEval benchmarks,\r\nwe provide a complete [leaderboard](https://evo-eval.github.io/leaderboard.html) as well as a \r\n[visualization tool](https://evo-eval.github.io/visualization.html) to compare the performance of different models.\r\n\r\nFurther, we also provide all code samples from LLMs on the **EvoEval** benchmarks:\r\n\r\n* See the attachment of our [v0.1.0 release](https://github.com/evo-eval/evoeval/releases/tag/v0.1.0).\r\n\r\nEach LLM generation is packaged in a zip file named like `{model_name}_temp_0.0.zip`. You can unzip the folder and obtain the\r\nLLM generation for each of our 7 benchmarks + the original HumanEval problems. Note that we only evaluate the greedy output for each LLM.\r\n\r\n## 📝 Citation\r\n\r\n```bibtex\r\n@article{evoeval,\r\n  author    = {Xia, Chunqiu Steven and Deng, Yinlin and Zhang, Lingming},\r\n  title     = {Top Leaderboard Ranking = Top Coding Proficiency, Always? EvoEval: Evolving Coding Benchmarks via LLM},\r\n  year      = {2024},\r\n  journal   = {arXiv preprint},\r\n}\r\n```\r\n\r\n\u003e [!Note]\r\n\u003e \r\n\u003e The first two authors contributed equally to this work, with author order determined via [_Nigiri_](https://senseis.xmp.net/?Nigiri)\r\n\r\n## 🙏 Acknowledgement\r\n\r\n* [HumanEval](https://github.com/openai/human-eval)\r\n* We especially thank [EvalPlus](https://github.com/evalplus/evalplus)\r\n\r\n\r\n\r\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fevo-eval%2Fevoeval","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fevo-eval%2Fevoeval","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fevo-eval%2Fevoeval/lists"}