{"id":26479515,"url":"https://github.com/tongye98/awesome-code-benchmark","last_synced_at":"2025-03-20T01:55:38.320Z","repository":{"id":282673126,"uuid":"948882317","full_name":"tongye98/Awesome-Code-Benchmark","owner":"tongye98","description":"A comprehensive code domain benchmark review of LLM researches.","archived":false,"fork":false,"pushed_at":"2025-03-16T07:20:20.000Z","size":6,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-03-16T08:18:55.165Z","etag":null,"topics":["awesome","benchmarks","bug-fixing","code-completion","code-efficiency","code-generation","codellms","data-science","multimodal","reasoning"],"latest_commit_sha":null,"homepage":"","language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/tongye98.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2025-03-15T07:00:38.000Z","updated_at":"2025-03-16T07:22:54.000Z","dependencies_parsed_at":"2025-03-16T10:15:21.351Z","dependency_job_id":null,"html_url":"https://github.com/tongye98/Awesome-Code-Benchmark","commit_stats":null,"previous_names":["tongye98/awesome-code-benchmark"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tongye98%2FAwesome-Code-Benchmark","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tongye98%2FAwesome-Code-Benchmark/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tongye98%2FAwesome-Code-Benchmark/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tongye98%2FAwesome-Code-Benchmark/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/tongye98","download_url":"https://codeload.github.com/tongye98/Awesome-Code-Benchmark/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":244536437,"owners_count":20468349,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["awesome","benchmarks","bug-fixing","code-completion","code-efficiency","code-generation","codellms","data-science","multimodal","reasoning"],"created_at":"2025-03-20T01:55:37.654Z","updated_at":"2025-03-20T01:55:38.303Z","avatar_url":"https://github.com/tongye98.png","language":null,"funding_links":[],"categories":["Other Lists"],"sub_categories":["TeX Lists"],"readme":"\u003cdiv align=\"center\"\u003e\r\n  \u003ch1\u003e👨‍💻 Awesome Code Benchmark\u003c/h1\u003e\r\n  \u003ca href=\"https://awesome.re\"\u003e\r\n    \u003cimg src=\"https://awesome.re/badge.svg\" alt=\"Awesome\"\u003e\r\n  \u003c/a\u003e\r\n  \u003ca href=\"https://img.shields.io/badge/PRs-Welcome-red\"\u003e\r\n    \u003cimg src=\"https://img.shields.io/badge/PRs-Welcome-red\" alt=\"PRs Welcome\"\u003e\r\n  \u003c/a\u003e\r\n\u003c/div\u003e\r\n\r\nA comprehensive code domain benchmark review of LLM researches.\r\n\r\n\u003cp align=\"center\"\u003e\r\n    \u003cimg src=\"https://i.imgur.com/waxVImv.png\" alt=\"Oryx Video-ChatGPT\"\u003e\r\n\u003c/p\u003e\r\n\r\n## News\r\n- 🔥🔥 **[2025-03-17]** We add **Code Version** (Version-specific code generation) benchmarks.\r\n- 🔥🔥 **[2025-03-16]** A thorough review of code domain benchmarks for LLM research has been released.\r\n\r\n\u003c!-- ## Table of Contents\r\n1. [Code C](#1_Code Completion \u0026 Code Generation) --\u003e\r\n\r\n## 🚀 Top Code Benchmark\r\n\r\n### Code Completion \u0026 Code Generation\r\n* **HumanEval**: code completion \r\n* **MBPP**: text -\u003e code; code generation \r\n* **EvalPlus**: Extends the HumanEval and MBPP benchmarks\r\n* **MultiPL-E**: Extends the HumanEval and MBPP benchmarks to 18 languages\r\n* **CodeClarQA**: containing pairs of natural language descriptions and code with created synthetic clarification questions and answers.\r\n* **BigCodeBench**: Complete Split \u0026 Instruct Split\r\n* **DevEval**: Repo-level code generation\r\n\r\n| Benchmark | Paper | Date | Github | Dataset \u0026 Website \u0026 LeaderBoard |\r\n|:--|:--|:--|:--|:--|\r\n| HumanEval     | [Evaluating Large Language Models Trained on Code](https://arxiv.org/abs/2107.03374)                                                                       | Arxiv 2021/07       | [Github](https://github.com/openai/human-eval)                                | [🤗Dataset](https://huggingface.co/datasets/openai/openai_humaneval) | \r\n| MBPP          | [Program Synthesis with Large Language Models](https://arxiv.org/abs/2108.07732)                                                                           | Arxiv 2021/08       | | [🤗Dataset](https://huggingface.co/datasets/google-research-datasets/mbpp) | \r\n| EvalPlus      | [Is Your Code Generated by Chat{GPT} Really Correct? Rigorous Evaluation of Large Language Models for Code Generation](https://arxiv.org/abs/2305.01210)   | NeruIPS 2023        | [Github](https://github.com/evalplus/evalplus)                                | [🤗Dataset](https://huggingface.co/evalplus) | \r\n| MultiPL-E     | [MultiPL-E: A Scalable and Polyglot Approach to Benchmarking Neural Code Generation](https://ieeexplore.ieee.org/abstract/document/10103177)               | TSE 2023            | [Github](https://github.com/nuprl/MultiPL-E)                                  | [🤗Dataset](https://huggingface.co/datasets/nuprl/MultiPL-E) |\r\n| CodeClarQA    | [Python Code Generation by Asking Clarification Questions](https://arxiv.org/abs/2212.09885v2)                                                             | ACL 2023            | [Github](https://github.com/UKPLab/codeclarqa)                                | [Dataset](https://drive.google.com/file/d/1bM-b-L10vNpk7Onyft9BXK8GlMIGl52q/view?usp=sharing) | \r\n| BigCodeBench  | [BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions](https://arxiv.org/abs/2406.15877)                        | ICLR 2025           | [Github](https://github.com/bigcode-project/bigcodebench)                     | [📊LeaderBoard](https://huggingface.co/spaces/bigcode/bigcodebench-leaderboard) |   \r\n| DevEval       | [DevEval: A Manually-Annotated Code Generation Benchmark Aligned with Real-World Code Repositories](https://arxiv.org/abs/2405.19856)                      | ACL 2024            | [Github](https://github.com/seketeam/DevEval)                                 | [🤗Dataset](https://huggingface.co/datasets/LJ0815/DevEval/blob/main/Source_Code.tar.gz)|\r\n\r\n\r\n### Code Efficiency\r\n| Benchmark | Paper | Date | Github | Dataset \u0026 Website \u0026 LeaderBoard |\r\n|:--|:--|:--|:--|:--|\r\n| EvalPerf      | [Evaluating Language Models for Efficient Code Generation](https://arxiv.org/abs/2408.06450)                                          | COLM 2024                   | [Github](https://github.com/evalplus/evalplus)        | [🤗Dataset](https://huggingface.co/datasets/evalplus/evalperf) |\r\n| EffiBench     | [EffiBench: Benchmarking the Efficiency of Automatically Generated Code](https://arxiv.org/abs/2402.02037)                            | NeurIPS 2024                | [Github](https://github.com/huangd1999/EffiBench)     |  |\r\n| Mercury       | [Mercury: A Code Efficiency Benchmark for Code Large Language Models](https://arxiv.org/abs/2402.07844v4)                             | NeurIPS 2024                | [Github](https://github.com/Elfsong/Mercury)          | [🤗Dataset](https://huggingface.co/datasets/Elfsong/Mercury) |\r\n| ECCO          | [ECCO: Can We Improve Model-Generated Code Efficiency Without Sacrificing Functional Correctness?](https://arxiv.org/abs/2407.14044)  | EMNLP 2024                  | [Github](https://github.com/CodeEff/ECCO)             | [🤗Dataset](https://huggingface.co/datasets/CodeEff/ECCO)|\r\n| PIE           | [Learning Performance-Improving Code Edits](https://arxiv.org/abs/2302.07867)                                                         | ICLR 2024                   | [Github](https://github.com/LearningOpt/pie)          | [🌐Website](https://pie4perf.com)|  \r\n| ENAMEL        | [How Efficient is LLM-Generated Code? A Rigorous \u0026 High-Standard Benchmark](https://arxiv.org/abs/2406.06647)                         | ICLR 2025                   | [Github](https://github.com/q-rz/enamel)              | [🤗Dataset](https://huggingface.co/datasets/q-rz/enamel) |\r\n\r\n\r\n### CodeFix \u0026 Bug-Fix\r\n* **HumanEvalFix**: code repair capabilitie \r\n* **SWT-Bench**: Evaluating LLMs on testing generation for real world software issues \r\n* **SWE-bench**: Evaluating LLMs Resolve Real-World GitHub Issues \r\n* **SWE-bench Multimodal**: Evaluate LLMs on their ability to fix bugs in visual, user-facing JavaScript software \r\n\r\n| Benchmark | Paper | Date | Github | Dataset \u0026 Website \u0026 LeaderBoard |\r\n|:--|:--|:--|:--|:--|\r\n| HumanEvalFix          | [OctoPack: Instruction Tuning Code Large Language Models](https://arxiv.org/abs/2308.07124)                                    | Arxiv 2023/08              | [Github](https://github.com/bigcode-project/octopack)        | [🤗Dataset](https://huggingface.co/datasets/bigcode/humanevalpack) |  \r\n| SWT-Bench             | [SWT-Bench: Testing and Validating Real-World Bug-Fixes with Code Agents](https://arxiv.org/abs/2406.12952)                    | NeurIPS 2024               | [Github](https://github.com/logic-star-ai/SWT-Bench)         | [🌐Website](https://swtbench.com) |\r\n| SWE-bench             | [SWE-bench: Can Language Models Resolve Real-World GitHub Issues?](https://arxiv.org/abs/2310.06770)                           | ICLR 2024                  | [Github](https://github.com/swe-bench/SWE-bench)             | [🌐Website](https://www.swebench.com) | \r\n| SWE-bench Multimodal  | [SWE-bench Multimodal: Do AI Systems Generalize to Visual Software Domains?](https://arxiv.org/abs/2410.03859)                 | ICLR 2025                  | [Github](https://github.com/swe-bench/SWE-bench)             | [🌐Website](https://www.swebench.com/multimodal) [🤗Dataset](https://www.swebench.com/multimodal) | \r\n\r\n\r\n### Code Reasoning \u0026 Understanding\r\n* **CRUXEval**: code reasoning, understanding, and execution capabilities\r\n* **CodeMMLU**: code understanding and comprehension\r\n\r\n|Benchmark | Paper | Date | Github | Dataset \u0026 Website \u0026 LeaderBoard |\r\n|:--|:--|:--|:--|:--|\r\n| CRUXEval      | [CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution](https://arxiv.org/abs/2401.03065)                                      | Arxiv 2024/01         | [Github](https://github.com/facebookresearch/cruxeval)        | [📊LeaderBoard](https://crux-eval.github.io/leaderboard.html) |    \r\n| CodeMMLU      | [CodeMMLU: A Multi-Task Benchmark for Assessing Code Understanding Capabilities of CodeLLMs](https://arxiv.org/abs/2410.01999)                 | ICLR 2025             | [Github](https://github.com/FSoft-AI4Code/CodeMMLU/)          | [🤗Dataset](https://huggingface.co/datasets/Fsoft-AIC/CodeMMLU) [📊LeaderBoard](https://fsoft-ai4code.github.io/leaderboards/codemmlu/) [🌐  Website](https://fsoft-ai4code.github.io/codemmlu/) | \r\n\r\n\r\n### Data science\r\n* **DS-1000**: Data Science Code Generation\r\n* **DA-Code**: Data science tasks \r\n\r\n| Benchmark | Paper | Date | Github | Dataset \u0026 Website \u0026 LeaderBoard |\r\n|:--|:--|:--|:--|:--|\r\n| DS-1000      | [DS-1000: A Natural and Reliable Benchmark for Data Science Code Generation](https://arxiv.org/abs/2211.11501)                                      | ICML 2023                 | [Github](https://github.com/xlang-ai/DS-1000)        | [🌐HomePage](https://ds1000-code-gen.github.io) [🤗Dataset](https://huggingface.co/datasets/xlangai/DS-1000)  | \r\n| DA-Code      | [DA-Code: Agent Data Science Code Generation Benchmark for Large Language Models](https://arxiv.org/abs/2410.07331)                                 | EMNLP 2024                | [Github](https://github.com/yiyihum/da-code)         | [🌐Website](https://da-code-bench.github.io) [🤗Dataset](https://huggingface.co/datasets/Jianwen2003/DA-Code) | \r\n\r\n\r\n### Text2SQL \r\n* **Spider**:  text-to-SQL  \r\n* **Spider 2.0**: text-to-SQL\r\n\r\n| Benchmark | Paper | Date | Github | Dataset \u0026 Website \u0026 LeaderBoard |\r\n|:--|:--|:--|:--|:--|\r\n| Spider      | [Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task](https://arxiv.org/abs/1809.08887)         | EMNLP 2018     | [Github](https://github.com/taoyds/spider)        | [🌐Homepage](https://yale-lily.github.io/spider) |\r\n| Spider 2.0  | [Spider 2.0: Evaluating Language Models on Real-World Enterprise Text-to-SQL Workflows](https://arxiv.org/abs/2411.07763)                                  | ICLR 2025      | [Github](https://github.com/xlang-ai/Spider2)     | [🌐Website](https://spider2-sql.github.io) |\r\n\r\n\r\n### MultiModal Code Generation\r\n* **ChartMimic** : Chart-to-Code Generation\r\n\r\n| Benchmark | Paper | Date| Github | Dataset \u0026 Website \u0026 LeaderBoard |\r\n|:--|:--|:--|:--|:--|\r\n| ChartMimic      | [ChartMimic: Evaluating LMM's Cross-Modal Reasoning Capability via Chart-to-Code Generation](https://arxiv.org/abs/2406.09961)                       | ICLR 2025                | [Github](https://github.com/ChartMimic/ChartMimic)        | [🌐Website](https://chartmimic.github.io) [🤗Dataset](https://huggingface.co/datasets/ChartMimic/ChartMimic) |  \r\n\r\n\r\n### Security Code Generation\r\n*  **RedCode**: comprehensive and practical evaluations on the safety of code agents \r\n\r\n| Benchmark | Paper | Date| Github | Dataset \u0026 Website \u0026 LeaderBoard |\r\n|:--|:--|:--|:--|:--|\r\n| RedCode        | [RedCode: Risky Code Execution and Generation Benchmark for Code Agents](https://arxiv.org/abs/2411.07781)                                      | NeurIPS 2024                | [Github](https://github.com/AI-secure/RedCode)        | [🌐Website](https://redcode-agent.github.io) [📊LeaderBoard](https://redcode-agent.github.io/#leaderboard) |\r\n\r\n\r\n### Code Translation\r\n* **TransCoder**: code translation in C++, Java, Python\r\n\r\n\r\n| Benchmark | Paper | Date| Github | Dataset \u0026 Website \u0026 LeaderBoard |\r\n|:--|:--|:--|:--|:--|\r\n| TransCoder        | [Unsupervised Translation of Programming Languages](https://arxiv.org/abs/2006.03511)                                      | NeurIPS 2020                | [Github](https://github.com/facebookresearch/TransCoder)(deprecated) [Github](https://github.com/facebookresearch/CodeGen)(new)        |  |\r\n\r\n\r\n### Code Version\r\nVersion-specific code generation\r\n* **CodeUpdateEval**: code migration with Time-wise dataset                        \r\n* **JavaVersionGenBench**: Code Completion Across Evolving JAVA Versions                \r\n* **VersiCode**: Version-controllable Code Generation                         \r\n* **GitChameleon**: 116 version-aware Python code-completion problems with unit tests \r\n* **LLM-Deprecated-APl**:  Deprecated APl mapping and functions code completion        \r\n* **CodeUpdateArena**: API Update Knowledge Editing Assessment                      \r\n* **LibEvolutionEval**: Version-Specifc Code Generation                              \r\n\r\n| Benchmark | Paper | Date| Github | Dataset \u0026 Website \u0026 LeaderBoard |\r\n|:--|:--|:--|:--|:--|\r\n| CodeUpdateEval             | [Automatically Recommend Code Updates: Are We There Yet?](https://arxiv.org/abs/2209.07048v3)                                                |  TOSEM 2024      | [Github](https://github.com/yueyueL/CodeLM-CodeUpdateEval)                       | [🤗Dataset](https://github.com/yueyueL/CodeLM-CodeUpdateEval/tree/main/dataset) | \r\n| JavaVersionGenBench        | [On the Generalizability of Deep Learning-based Code Completion Across Programming Language Versions](https://arxiv.org/pdf/2403.15149)      |  ICPC 2024       | [Github](https://github.com/java-generalization/java-generalization-replication) | [🤗Dataset](https://zenodo.org/records/10057237)               | \r\n| VersiCode                  | [VersiCode: Towards Version-controllable Code Generation](https://arxiv.org/abs/2406.07411)                                                  |  Arxiv 2024/10   | [Github](https://github.com/wutong8023/VersiCode)                                | [🌐Website](https://wutong8023.site/VersiCode/) [🤗Dataset](https://huggingface.co/datasets/AstoneNg/VersiCode) | \r\n| GitChameleon               | [GitChameleon: Unmasking the Version-Switching Capabilities of Code Generation Models](https://arxiv.org/abs/2411.05830)                     |  Arxiv 2024/11   | [Github](https://github.com/NizarIslah/GitChameleon)                             | [🤗Dataset](https://github.com/NizarIslah/GitChameleon/tree/main/dataset) | \r\n| LLM-Deprecated-APl         | [LLMs Meet Library Evolution: Evaluating Deprecated API Usage in LLM-based Code Completion](https://arxiv.org/abs/2406.09834)                |  ICSE 2025       | [Github](https://github.com/cs-wangchong/LLM-Deprecated-API)                     | [🤗Dataset](https://figshare.com/s/e8de860d8fc2ec0541d2)       |\r\n| CodeUpdateArena            | [CodeUpdateArena: Benchmarking Knowledge Editing on API Updates](https://arxiv.org/abs/2407.06249)                                           |  Arxiv 2025/02   | [Github](https://github.com/leo-liuzy/CodeUpdateArena)                           | [🤗Dataset](https://github.com/leo-liuzy/CodeUpdateArena/tree/main/data) | \r\n| LibEvolutionEval           | [LibEvolutionEval: A Benchmark and Study for Version-Specific Code Generation](https://arxiv.org/abs/2412.04478)                             |  NAACL 2025      |                                                                                  |                                                              | \r\n\r\n\r\n### Industry Code Generation\r\nPLC (Programmable logic controller) \u0026 Verilog (Hardware description language) \u0026 ... (to be released soon)\r\n\r\n\r\n### Multi-Dimension\r\n* **LiveCodeBench**: self-repair, code execution, test output prediction, code generation\r\n* **RACE**: Readability, Maintainability, Correctness, and Efficiency \r\n\r\n| Benchmark | Paper | Date| Github | Dataset \u0026 Website \u0026 LeaderBoard |\r\n|:--|:--|:--|:--|:--|\r\n| LiveCodeBench | [LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code](https://arxiv.org/abs/2403.07974)  | Arxiv 2024/03 | [Github](https://github.com/LiveCodeBench/LiveCodeBench) | [🤗Dataset](https://huggingface.co/livecodebench) |  \r\n| RACE          | [Beyond Correctness: Benchmarking Multi-dimensional Code Generation for Large Language Models](https://arxiv.org/abs/2407.11470) | Arxiv 2024/07 | [Github](https://github.com/jszheng21/RACE)              | [📊LeaderBoard](https://huggingface.co/spaces/jszheng/RACE_leaderboard) | \r\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftongye98%2Fawesome-code-benchmark","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftongye98%2Fawesome-code-benchmark","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftongye98%2Fawesome-code-benchmark/lists"}