https://github.com/tongye98/awesome-code-benchmark

A comprehensive code domain benchmark review of LLM researches.
https://github.com/tongye98/awesome-code-benchmark
awesome benchmarks bug-fixing code-completion code-efficiency code-generation codellms data-science multimodal reasoning
Last synced: 3 months ago
JSON representation
A comprehensive code domain benchmark review of LLM researches.
Host: GitHub
URL: https://github.com/tongye98/awesome-code-benchmark
Owner: tongye98
License: mit
Created: 2025-03-15T07:00:38.000Z (3 months ago)
Default Branch: main
Last Pushed: 2025-03-16T07:20:20.000Z (3 months ago)
Last Synced: 2025-03-16T08:18:55.165Z (3 months ago)
Topics: awesome, benchmarks, bug-fixing, code-completion, code-efficiency, code-generation, codellms, data-science, multimodal, reasoning
Homepage:
Size: 5.86 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project

ultimate-awesome - awesome-code-benchmark - A comprehensive code domain benchmark review of LLM researches. (Other Lists / Julia Lists)
README

        


  👨‍💻 Awesome Code Benchmark

  

    

  

  

    

  



A comprehensive code domain benchmark review of LLM researches.



    



## News

- 🔥🔥 **[2025-03-17]** We add **Code Version** (Version-specific code generation) benchmarks.

- 🔥🔥 **[2025-03-16]** A thorough review of code domain benchmarks for LLM research has been released.

## 🚀 Top Code Benchmark

### Code Completion & Code Generation

* **HumanEval**: code completion 

* **MBPP**: text -> code; code generation 

* **EvalPlus**: Extends the HumanEval and MBPP benchmarks

* **MultiPL-E**: Extends the HumanEval and MBPP benchmarks to 18 languages

* **CodeClarQA**: containing pairs of natural language descriptions and code with created synthetic clarification questions and answers.

* **BigCodeBench**: Complete Split & Instruct Split

* **DevEval**: Repo-level code generation

| Benchmark | Paper | Date | Github | Dataset & Website & LeaderBoard |

|:--|:--|:--|:--|:--|

| HumanEval     | [Evaluating Large Language Models Trained on Code](https://arxiv.org/abs/2107.03374)                                                                       | Arxiv 2021/07       | [Github](https://github.com/openai/human-eval)                                | [🤗Dataset](https://huggingface.co/datasets/openai/openai_humaneval) | 

| MBPP          | [Program Synthesis with Large Language Models](https://arxiv.org/abs/2108.07732)                                                                           | Arxiv 2021/08       | | [🤗Dataset](https://huggingface.co/datasets/google-research-datasets/mbpp) | 

| EvalPlus      | [Is Your Code Generated by Chat{GPT} Really Correct? Rigorous Evaluation of Large Language Models for Code Generation](https://arxiv.org/abs/2305.01210)   | NeruIPS 2023        | [Github](https://github.com/evalplus/evalplus)                                | [🤗Dataset](https://huggingface.co/evalplus) | 

| MultiPL-E     | [MultiPL-E: A Scalable and Polyglot Approach to Benchmarking Neural Code Generation](https://ieeexplore.ieee.org/abstract/document/10103177)               | TSE 2023            | [Github](https://github.com/nuprl/MultiPL-E)                                  | [🤗Dataset](https://huggingface.co/datasets/nuprl/MultiPL-E) |

| CodeClarQA    | [Python Code Generation by Asking Clarification Questions](https://arxiv.org/abs/2212.09885v2)                                                             | ACL 2023            | [Github](https://github.com/UKPLab/codeclarqa)                                | [Dataset](https://drive.google.com/file/d/1bM-b-L10vNpk7Onyft9BXK8GlMIGl52q/view?usp=sharing) | 

| BigCodeBench  | [BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions](https://arxiv.org/abs/2406.15877)                        | ICLR 2025           | [Github](https://github.com/bigcode-project/bigcodebench)                     | [📊LeaderBoard](https://huggingface.co/spaces/bigcode/bigcodebench-leaderboard) |   

| DevEval       | [DevEval: A Manually-Annotated Code Generation Benchmark Aligned with Real-World Code Repositories](https://arxiv.org/abs/2405.19856)                      | ACL 2024            | [Github](https://github.com/seketeam/DevEval)                                 | [🤗Dataset](https://huggingface.co/datasets/LJ0815/DevEval/blob/main/Source_Code.tar.gz)|

### Code Efficiency

| Benchmark | Paper | Date | Github | Dataset & Website & LeaderBoard |

|:--|:--|:--|:--|:--|

| EvalPerf      | [Evaluating Language Models for Efficient Code Generation](https://arxiv.org/abs/2408.06450)                                          | COLM 2024                   | [Github](https://github.com/evalplus/evalplus)        | [🤗Dataset](https://huggingface.co/datasets/evalplus/evalperf) |

| EffiBench     | [EffiBench: Benchmarking the Efficiency of Automatically Generated Code](https://arxiv.org/abs/2402.02037)                            | NeurIPS 2024                | [Github](https://github.com/huangd1999/EffiBench)     |  |

| Mercury       | [Mercury: A Code Efficiency Benchmark for Code Large Language Models](https://arxiv.org/abs/2402.07844v4)                             | NeurIPS 2024                | [Github](https://github.com/Elfsong/Mercury)          | [🤗Dataset](https://huggingface.co/datasets/Elfsong/Mercury) |

| ECCO          | [ECCO: Can We Improve Model-Generated Code Efficiency Without Sacrificing Functional Correctness?](https://arxiv.org/abs/2407.14044)  | EMNLP 2024                  | [Github](https://github.com/CodeEff/ECCO)             | [🤗Dataset](https://huggingface.co/datasets/CodeEff/ECCO)|

| PIE           | [Learning Performance-Improving Code Edits](https://arxiv.org/abs/2302.07867)                                                         | ICLR 2024                   | [Github](https://github.com/LearningOpt/pie)          | [🌐Website](https://pie4perf.com)|  

| ENAMEL        | [How Efficient is LLM-Generated Code? A Rigorous & High-Standard Benchmark](https://arxiv.org/abs/2406.06647)                         | ICLR 2025                   | [Github](https://github.com/q-rz/enamel)              | [🤗Dataset](https://huggingface.co/datasets/q-rz/enamel) |

### CodeFix & Bug-Fix

* **HumanEvalFix**: code repair capabilitie 

* **SWT-Bench**: Evaluating LLMs on testing generation for real world software issues 

* **SWE-bench**: Evaluating LLMs Resolve Real-World GitHub Issues 

* **SWE-bench Multimodal**: Evaluate LLMs on their ability to fix bugs in visual, user-facing JavaScript software 

| Benchmark | Paper | Date | Github | Dataset & Website & LeaderBoard |

|:--|:--|:--|:--|:--|

| HumanEvalFix          | [OctoPack: Instruction Tuning Code Large Language Models](https://arxiv.org/abs/2308.07124)                                    | Arxiv 2023/08              | [Github](https://github.com/bigcode-project/octopack)        | [🤗Dataset](https://huggingface.co/datasets/bigcode/humanevalpack) |  

| SWT-Bench             | [SWT-Bench: Testing and Validating Real-World Bug-Fixes with Code Agents](https://arxiv.org/abs/2406.12952)                    | NeurIPS 2024               | [Github](https://github.com/logic-star-ai/SWT-Bench)         | [🌐Website](https://swtbench.com) |

| SWE-bench             | [SWE-bench: Can Language Models Resolve Real-World GitHub Issues?](https://arxiv.org/abs/2310.06770)                           | ICLR 2024                  | [Github](https://github.com/swe-bench/SWE-bench)             | [🌐Website](https://www.swebench.com) | 

| SWE-bench Multimodal  | [SWE-bench Multimodal: Do AI Systems Generalize to Visual Software Domains?](https://arxiv.org/abs/2410.03859)                 | ICLR 2025                  | [Github](https://github.com/swe-bench/SWE-bench)             | [🌐Website](https://www.swebench.com/multimodal) [🤗Dataset](https://www.swebench.com/multimodal) | 

### Code Reasoning & Understanding

* **CRUXEval**: code reasoning, understanding, and execution capabilities

* **CodeMMLU**: code understanding and comprehension

|Benchmark | Paper | Date | Github | Dataset & Website & LeaderBoard |

|:--|:--|:--|:--|:--|

| CRUXEval      | [CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution](https://arxiv.org/abs/2401.03065)                                      | Arxiv 2024/01         | [Github](https://github.com/facebookresearch/cruxeval)        | [📊LeaderBoard](https://crux-eval.github.io/leaderboard.html) |    

| CodeMMLU      | [CodeMMLU: A Multi-Task Benchmark for Assessing Code Understanding Capabilities of CodeLLMs](https://arxiv.org/abs/2410.01999)                 | ICLR 2025             | [Github](https://github.com/FSoft-AI4Code/CodeMMLU/)          | [🤗Dataset](https://huggingface.co/datasets/Fsoft-AIC/CodeMMLU) [📊LeaderBoard](https://fsoft-ai4code.github.io/leaderboards/codemmlu/) [🌐  Website](https://fsoft-ai4code.github.io/codemmlu/) | 

### Data science

* **DS-1000**: Data Science Code Generation

* **DA-Code**: Data science tasks 

| Benchmark | Paper | Date | Github | Dataset & Website & LeaderBoard |

|:--|:--|:--|:--|:--|

| DS-1000      | [DS-1000: A Natural and Reliable Benchmark for Data Science Code Generation](https://arxiv.org/abs/2211.11501)                                      | ICML 2023                 | [Github](https://github.com/xlang-ai/DS-1000)        | [🌐HomePage](https://ds1000-code-gen.github.io) [🤗Dataset](https://huggingface.co/datasets/xlangai/DS-1000)  | 

| DA-Code      | [DA-Code: Agent Data Science Code Generation Benchmark for Large Language Models](https://arxiv.org/abs/2410.07331)                                 | EMNLP 2024                | [Github](https://github.com/yiyihum/da-code)         | [🌐Website](https://da-code-bench.github.io) [🤗Dataset](https://huggingface.co/datasets/Jianwen2003/DA-Code) | 

### Text2SQL 

* **Spider**:  text-to-SQL  

* **Spider 2.0**: text-to-SQL

| Benchmark | Paper | Date | Github | Dataset & Website & LeaderBoard |

|:--|:--|:--|:--|:--|

| Spider      | [Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task](https://arxiv.org/abs/1809.08887)         | EMNLP 2018     | [Github](https://github.com/taoyds/spider)        | [🌐Homepage](https://yale-lily.github.io/spider) |

| Spider 2.0  | [Spider 2.0: Evaluating Language Models on Real-World Enterprise Text-to-SQL Workflows](https://arxiv.org/abs/2411.07763)                                  | ICLR 2025      | [Github](https://github.com/xlang-ai/Spider2)     | [🌐Website](https://spider2-sql.github.io) |

### MultiModal Code Generation

* **ChartMimic** : Chart-to-Code Generation

| Benchmark | Paper | Date| Github | Dataset & Website & LeaderBoard |

|:--|:--|:--|:--|:--|

| ChartMimic      | [ChartMimic: Evaluating LMM's Cross-Modal Reasoning Capability via Chart-to-Code Generation](https://arxiv.org/abs/2406.09961)                       | ICLR 2025                | [Github](https://github.com/ChartMimic/ChartMimic)        | [🌐Website](https://chartmimic.github.io) [🤗Dataset](https://huggingface.co/datasets/ChartMimic/ChartMimic) |  

### Security Code Generation

*  **RedCode**: comprehensive and practical evaluations on the safety of code agents 

| Benchmark | Paper | Date| Github | Dataset & Website & LeaderBoard |

|:--|:--|:--|:--|:--|

| RedCode        | [RedCode: Risky Code Execution and Generation Benchmark for Code Agents](https://arxiv.org/abs/2411.07781)                                      | NeurIPS 2024                | [Github](https://github.com/AI-secure/RedCode)        | [🌐Website](https://redcode-agent.github.io) [📊LeaderBoard](https://redcode-agent.github.io/#leaderboard) |

### Code Translation

* **TransCoder**: code translation in C++, Java, Python

| Benchmark | Paper | Date| Github | Dataset & Website & LeaderBoard |

|:--|:--|:--|:--|:--|

| TransCoder        | [Unsupervised Translation of Programming Languages](https://arxiv.org/abs/2006.03511)                                      | NeurIPS 2020                | [Github](https://github.com/facebookresearch/TransCoder)(deprecated) [Github](https://github.com/facebookresearch/CodeGen)(new)        |  |

### Code Version

Version-specific code generation

* **CodeUpdateEval**: code migration with Time-wise dataset                        

* **JavaVersionGenBench**: Code Completion Across Evolving JAVA Versions                

* **VersiCode**: Version-controllable Code Generation                         

* **GitChameleon**: 116 version-aware Python code-completion problems with unit tests 

* **LLM-Deprecated-APl**:  Deprecated APl mapping and functions code completion        

* **CodeUpdateArena**: API Update Knowledge Editing Assessment                      

* **LibEvolutionEval**: Version-Specifc Code Generation                              

| Benchmark | Paper | Date| Github | Dataset & Website & LeaderBoard |

|:--|:--|:--|:--|:--|

| CodeUpdateEval             | [Automatically Recommend Code Updates: Are We There Yet?](https://arxiv.org/abs/2209.07048v3)                                                |  TOSEM 2024      | [Github](https://github.com/yueyueL/CodeLM-CodeUpdateEval)                       | [🤗Dataset](https://github.com/yueyueL/CodeLM-CodeUpdateEval/tree/main/dataset) | 

| JavaVersionGenBench        | [On the Generalizability of Deep Learning-based Code Completion Across Programming Language Versions](https://arxiv.org/pdf/2403.15149)      |  ICPC 2024       | [Github](https://github.com/java-generalization/java-generalization-replication) | [🤗Dataset](https://zenodo.org/records/10057237)               | 

| VersiCode                  | [VersiCode: Towards Version-controllable Code Generation](https://arxiv.org/abs/2406.07411)                                                  |  Arxiv 2024/10   | [Github](https://github.com/wutong8023/VersiCode)                                | [🌐Website](https://wutong8023.site/VersiCode/) [🤗Dataset](https://huggingface.co/datasets/AstoneNg/VersiCode) | 

| GitChameleon               | [GitChameleon: Unmasking the Version-Switching Capabilities of Code Generation Models](https://arxiv.org/abs/2411.05830)                     |  Arxiv 2024/11   | [Github](https://github.com/NizarIslah/GitChameleon)                             | [🤗Dataset](https://github.com/NizarIslah/GitChameleon/tree/main/dataset) | 

| LLM-Deprecated-APl         | [LLMs Meet Library Evolution: Evaluating Deprecated API Usage in LLM-based Code Completion](https://arxiv.org/abs/2406.09834)                |  ICSE 2025       | [Github](https://github.com/cs-wangchong/LLM-Deprecated-API)                     | [🤗Dataset](https://figshare.com/s/e8de860d8fc2ec0541d2)       |

| CodeUpdateArena            | [CodeUpdateArena: Benchmarking Knowledge Editing on API Updates](https://arxiv.org/abs/2407.06249)                                           |  Arxiv 2025/02   | [Github](https://github.com/leo-liuzy/CodeUpdateArena)                           | [🤗Dataset](https://github.com/leo-liuzy/CodeUpdateArena/tree/main/data) | 

| LibEvolutionEval           | [LibEvolutionEval: A Benchmark and Study for Version-Specific Code Generation](https://arxiv.org/abs/2412.04478)                             |  NAACL 2025      |                                                                                  |                                                              | 

### Industry Code Generation

PLC (Programmable logic controller) & Verilog (Hardware description language) & ... (to be released soon)

### Multi-Dimension

* **LiveCodeBench**: self-repair, code execution, test output prediction, code generation

* **RACE**: Readability, Maintainability, Correctness, and Efficiency 

| Benchmark | Paper | Date| Github | Dataset & Website & LeaderBoard |

|:--|:--|:--|:--|:--|

| LiveCodeBench | [LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code](https://arxiv.org/abs/2403.07974)  | Arxiv 2024/03 | [Github](https://github.com/LiveCodeBench/LiveCodeBench) | [🤗Dataset](https://huggingface.co/livecodebench) |  

| RACE          | [Beyond Correctness: Benchmarking Multi-dimensional Code Generation for Large Language Models](https://arxiv.org/abs/2407.11470) | Arxiv 2024/07 | [Github](https://github.com/jszheng21/RACE)              | [📊LeaderBoard](https://huggingface.co/spaces/jszheng/RACE_leaderboard) |
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/tongye98/awesome-code-benchmark

Awesome Lists containing this project

README

👨‍💻 Awesome Code Benchmark