https://github.com/tongye98/awesome-code-benchmark
A comprehensive code domain benchmark review of LLM researches.
https://github.com/tongye98/awesome-code-benchmark
List: awesome-code-benchmark
awesome benchmarks bug-fixing code-completion code-efficiency code-generation codellms data-science multimodal reasoning
Last synced: 2 months ago
JSON representation
A comprehensive code domain benchmark review of LLM researches.
- Host: GitHub
- URL: https://github.com/tongye98/awesome-code-benchmark
- Owner: tongye98
- License: mit
- Created: 2025-03-15T07:00:38.000Z (3 months ago)
- Default Branch: main
- Last Pushed: 2025-03-16T07:20:20.000Z (2 months ago)
- Last Synced: 2025-03-16T08:18:55.165Z (2 months ago)
- Topics: awesome, benchmarks, bug-fixing, code-completion, code-efficiency, code-generation, codellms, data-science, multimodal, reasoning
- Homepage:
- Size: 5.86 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- ultimate-awesome - awesome-code-benchmark - A comprehensive code domain benchmark review of LLM researches. (Other Lists / Julia Lists)
README
A comprehensive code domain benchmark review of LLM researches.
![]()
## News
- 🔥🔥 **[2025-03-17]** We add **Code Version** (Version-specific code generation) benchmarks.
- 🔥🔥 **[2025-03-16]** A thorough review of code domain benchmarks for LLM research has been released.## 🚀 Top Code Benchmark
### Code Completion & Code Generation
* **HumanEval**: code completion
* **MBPP**: text -> code; code generation
* **EvalPlus**: Extends the HumanEval and MBPP benchmarks
* **MultiPL-E**: Extends the HumanEval and MBPP benchmarks to 18 languages
* **CodeClarQA**: containing pairs of natural language descriptions and code with created synthetic clarification questions and answers.
* **BigCodeBench**: Complete Split & Instruct Split
* **DevEval**: Repo-level code generation| Benchmark | Paper | Date | Github | Dataset & Website & LeaderBoard |
|:--|:--|:--|:--|:--|
| HumanEval | [Evaluating Large Language Models Trained on Code](https://arxiv.org/abs/2107.03374) | Arxiv 2021/07 | [Github](https://github.com/openai/human-eval) | [🤗Dataset](https://huggingface.co/datasets/openai/openai_humaneval) |
| MBPP | [Program Synthesis with Large Language Models](https://arxiv.org/abs/2108.07732) | Arxiv 2021/08 | | [🤗Dataset](https://huggingface.co/datasets/google-research-datasets/mbpp) |
| EvalPlus | [Is Your Code Generated by Chat{GPT} Really Correct? Rigorous Evaluation of Large Language Models for Code Generation](https://arxiv.org/abs/2305.01210) | NeruIPS 2023 | [Github](https://github.com/evalplus/evalplus) | [🤗Dataset](https://huggingface.co/evalplus) |
| MultiPL-E | [MultiPL-E: A Scalable and Polyglot Approach to Benchmarking Neural Code Generation](https://ieeexplore.ieee.org/abstract/document/10103177) | TSE 2023 | [Github](https://github.com/nuprl/MultiPL-E) | [🤗Dataset](https://huggingface.co/datasets/nuprl/MultiPL-E) |
| CodeClarQA | [Python Code Generation by Asking Clarification Questions](https://arxiv.org/abs/2212.09885v2) | ACL 2023 | [Github](https://github.com/UKPLab/codeclarqa) | [Dataset](https://drive.google.com/file/d/1bM-b-L10vNpk7Onyft9BXK8GlMIGl52q/view?usp=sharing) |
| BigCodeBench | [BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions](https://arxiv.org/abs/2406.15877) | ICLR 2025 | [Github](https://github.com/bigcode-project/bigcodebench) | [📊LeaderBoard](https://huggingface.co/spaces/bigcode/bigcodebench-leaderboard) |
| DevEval | [DevEval: A Manually-Annotated Code Generation Benchmark Aligned with Real-World Code Repositories](https://arxiv.org/abs/2405.19856) | ACL 2024 | [Github](https://github.com/seketeam/DevEval) | [🤗Dataset](https://huggingface.co/datasets/LJ0815/DevEval/blob/main/Source_Code.tar.gz)|### Code Efficiency
| Benchmark | Paper | Date | Github | Dataset & Website & LeaderBoard |
|:--|:--|:--|:--|:--|
| EvalPerf | [Evaluating Language Models for Efficient Code Generation](https://arxiv.org/abs/2408.06450) | COLM 2024 | [Github](https://github.com/evalplus/evalplus) | [🤗Dataset](https://huggingface.co/datasets/evalplus/evalperf) |
| EffiBench | [EffiBench: Benchmarking the Efficiency of Automatically Generated Code](https://arxiv.org/abs/2402.02037) | NeurIPS 2024 | [Github](https://github.com/huangd1999/EffiBench) | |
| Mercury | [Mercury: A Code Efficiency Benchmark for Code Large Language Models](https://arxiv.org/abs/2402.07844v4) | NeurIPS 2024 | [Github](https://github.com/Elfsong/Mercury) | [🤗Dataset](https://huggingface.co/datasets/Elfsong/Mercury) |
| ECCO | [ECCO: Can We Improve Model-Generated Code Efficiency Without Sacrificing Functional Correctness?](https://arxiv.org/abs/2407.14044) | EMNLP 2024 | [Github](https://github.com/CodeEff/ECCO) | [🤗Dataset](https://huggingface.co/datasets/CodeEff/ECCO)|
| PIE | [Learning Performance-Improving Code Edits](https://arxiv.org/abs/2302.07867) | ICLR 2024 | [Github](https://github.com/LearningOpt/pie) | [🌐Website](https://pie4perf.com)|
| ENAMEL | [How Efficient is LLM-Generated Code? A Rigorous & High-Standard Benchmark](https://arxiv.org/abs/2406.06647) | ICLR 2025 | [Github](https://github.com/q-rz/enamel) | [🤗Dataset](https://huggingface.co/datasets/q-rz/enamel) |### CodeFix & Bug-Fix
* **HumanEvalFix**: code repair capabilitie
* **SWT-Bench**: Evaluating LLMs on testing generation for real world software issues
* **SWE-bench**: Evaluating LLMs Resolve Real-World GitHub Issues
* **SWE-bench Multimodal**: Evaluate LLMs on their ability to fix bugs in visual, user-facing JavaScript software| Benchmark | Paper | Date | Github | Dataset & Website & LeaderBoard |
|:--|:--|:--|:--|:--|
| HumanEvalFix | [OctoPack: Instruction Tuning Code Large Language Models](https://arxiv.org/abs/2308.07124) | Arxiv 2023/08 | [Github](https://github.com/bigcode-project/octopack) | [🤗Dataset](https://huggingface.co/datasets/bigcode/humanevalpack) |
| SWT-Bench | [SWT-Bench: Testing and Validating Real-World Bug-Fixes with Code Agents](https://arxiv.org/abs/2406.12952) | NeurIPS 2024 | [Github](https://github.com/logic-star-ai/SWT-Bench) | [🌐Website](https://swtbench.com) |
| SWE-bench | [SWE-bench: Can Language Models Resolve Real-World GitHub Issues?](https://arxiv.org/abs/2310.06770) | ICLR 2024 | [Github](https://github.com/swe-bench/SWE-bench) | [🌐Website](https://www.swebench.com) |
| SWE-bench Multimodal | [SWE-bench Multimodal: Do AI Systems Generalize to Visual Software Domains?](https://arxiv.org/abs/2410.03859) | ICLR 2025 | [Github](https://github.com/swe-bench/SWE-bench) | [🌐Website](https://www.swebench.com/multimodal) [🤗Dataset](https://www.swebench.com/multimodal) |### Code Reasoning & Understanding
* **CRUXEval**: code reasoning, understanding, and execution capabilities
* **CodeMMLU**: code understanding and comprehension|Benchmark | Paper | Date | Github | Dataset & Website & LeaderBoard |
|:--|:--|:--|:--|:--|
| CRUXEval | [CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution](https://arxiv.org/abs/2401.03065) | Arxiv 2024/01 | [Github](https://github.com/facebookresearch/cruxeval) | [📊LeaderBoard](https://crux-eval.github.io/leaderboard.html) |
| CodeMMLU | [CodeMMLU: A Multi-Task Benchmark for Assessing Code Understanding Capabilities of CodeLLMs](https://arxiv.org/abs/2410.01999) | ICLR 2025 | [Github](https://github.com/FSoft-AI4Code/CodeMMLU/) | [🤗Dataset](https://huggingface.co/datasets/Fsoft-AIC/CodeMMLU) [📊LeaderBoard](https://fsoft-ai4code.github.io/leaderboards/codemmlu/) [🌐 Website](https://fsoft-ai4code.github.io/codemmlu/) |### Data science
* **DS-1000**: Data Science Code Generation
* **DA-Code**: Data science tasks| Benchmark | Paper | Date | Github | Dataset & Website & LeaderBoard |
|:--|:--|:--|:--|:--|
| DS-1000 | [DS-1000: A Natural and Reliable Benchmark for Data Science Code Generation](https://arxiv.org/abs/2211.11501) | ICML 2023 | [Github](https://github.com/xlang-ai/DS-1000) | [🌐HomePage](https://ds1000-code-gen.github.io) [🤗Dataset](https://huggingface.co/datasets/xlangai/DS-1000) |
| DA-Code | [DA-Code: Agent Data Science Code Generation Benchmark for Large Language Models](https://arxiv.org/abs/2410.07331) | EMNLP 2024 | [Github](https://github.com/yiyihum/da-code) | [🌐Website](https://da-code-bench.github.io) [🤗Dataset](https://huggingface.co/datasets/Jianwen2003/DA-Code) |### Text2SQL
* **Spider**: text-to-SQL
* **Spider 2.0**: text-to-SQL| Benchmark | Paper | Date | Github | Dataset & Website & LeaderBoard |
|:--|:--|:--|:--|:--|
| Spider | [Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task](https://arxiv.org/abs/1809.08887) | EMNLP 2018 | [Github](https://github.com/taoyds/spider) | [🌐Homepage](https://yale-lily.github.io/spider) |
| Spider 2.0 | [Spider 2.0: Evaluating Language Models on Real-World Enterprise Text-to-SQL Workflows](https://arxiv.org/abs/2411.07763) | ICLR 2025 | [Github](https://github.com/xlang-ai/Spider2) | [🌐Website](https://spider2-sql.github.io) |### MultiModal Code Generation
* **ChartMimic** : Chart-to-Code Generation| Benchmark | Paper | Date| Github | Dataset & Website & LeaderBoard |
|:--|:--|:--|:--|:--|
| ChartMimic | [ChartMimic: Evaluating LMM's Cross-Modal Reasoning Capability via Chart-to-Code Generation](https://arxiv.org/abs/2406.09961) | ICLR 2025 | [Github](https://github.com/ChartMimic/ChartMimic) | [🌐Website](https://chartmimic.github.io) [🤗Dataset](https://huggingface.co/datasets/ChartMimic/ChartMimic) |### Security Code Generation
* **RedCode**: comprehensive and practical evaluations on the safety of code agents| Benchmark | Paper | Date| Github | Dataset & Website & LeaderBoard |
|:--|:--|:--|:--|:--|
| RedCode | [RedCode: Risky Code Execution and Generation Benchmark for Code Agents](https://arxiv.org/abs/2411.07781) | NeurIPS 2024 | [Github](https://github.com/AI-secure/RedCode) | [🌐Website](https://redcode-agent.github.io) [📊LeaderBoard](https://redcode-agent.github.io/#leaderboard) |### Code Translation
* **TransCoder**: code translation in C++, Java, Python| Benchmark | Paper | Date| Github | Dataset & Website & LeaderBoard |
|:--|:--|:--|:--|:--|
| TransCoder | [Unsupervised Translation of Programming Languages](https://arxiv.org/abs/2006.03511) | NeurIPS 2020 | [Github](https://github.com/facebookresearch/TransCoder)(deprecated) [Github](https://github.com/facebookresearch/CodeGen)(new) | |### Code Version
Version-specific code generation
* **CodeUpdateEval**: code migration with Time-wise dataset
* **JavaVersionGenBench**: Code Completion Across Evolving JAVA Versions
* **VersiCode**: Version-controllable Code Generation
* **GitChameleon**: 116 version-aware Python code-completion problems with unit tests
* **LLM-Deprecated-APl**: Deprecated APl mapping and functions code completion
* **CodeUpdateArena**: API Update Knowledge Editing Assessment
* **LibEvolutionEval**: Version-Specifc Code Generation| Benchmark | Paper | Date| Github | Dataset & Website & LeaderBoard |
|:--|:--|:--|:--|:--|
| CodeUpdateEval | [Automatically Recommend Code Updates: Are We There Yet?](https://arxiv.org/abs/2209.07048v3) | TOSEM 2024 | [Github](https://github.com/yueyueL/CodeLM-CodeUpdateEval) | [🤗Dataset](https://github.com/yueyueL/CodeLM-CodeUpdateEval/tree/main/dataset) |
| JavaVersionGenBench | [On the Generalizability of Deep Learning-based Code Completion Across Programming Language Versions](https://arxiv.org/pdf/2403.15149) | ICPC 2024 | [Github](https://github.com/java-generalization/java-generalization-replication) | [🤗Dataset](https://zenodo.org/records/10057237) |
| VersiCode | [VersiCode: Towards Version-controllable Code Generation](https://arxiv.org/abs/2406.07411) | Arxiv 2024/10 | [Github](https://github.com/wutong8023/VersiCode) | [🌐Website](https://wutong8023.site/VersiCode/) [🤗Dataset](https://huggingface.co/datasets/AstoneNg/VersiCode) |
| GitChameleon | [GitChameleon: Unmasking the Version-Switching Capabilities of Code Generation Models](https://arxiv.org/abs/2411.05830) | Arxiv 2024/11 | [Github](https://github.com/NizarIslah/GitChameleon) | [🤗Dataset](https://github.com/NizarIslah/GitChameleon/tree/main/dataset) |
| LLM-Deprecated-APl | [LLMs Meet Library Evolution: Evaluating Deprecated API Usage in LLM-based Code Completion](https://arxiv.org/abs/2406.09834) | ICSE 2025 | [Github](https://github.com/cs-wangchong/LLM-Deprecated-API) | [🤗Dataset](https://figshare.com/s/e8de860d8fc2ec0541d2) |
| CodeUpdateArena | [CodeUpdateArena: Benchmarking Knowledge Editing on API Updates](https://arxiv.org/abs/2407.06249) | Arxiv 2025/02 | [Github](https://github.com/leo-liuzy/CodeUpdateArena) | [🤗Dataset](https://github.com/leo-liuzy/CodeUpdateArena/tree/main/data) |
| LibEvolutionEval | [LibEvolutionEval: A Benchmark and Study for Version-Specific Code Generation](https://arxiv.org/abs/2412.04478) | NAACL 2025 | | |### Industry Code Generation
PLC (Programmable logic controller) & Verilog (Hardware description language) & ... (to be released soon)### Multi-Dimension
* **LiveCodeBench**: self-repair, code execution, test output prediction, code generation
* **RACE**: Readability, Maintainability, Correctness, and Efficiency| Benchmark | Paper | Date| Github | Dataset & Website & LeaderBoard |
|:--|:--|:--|:--|:--|
| LiveCodeBench | [LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code](https://arxiv.org/abs/2403.07974) | Arxiv 2024/03 | [Github](https://github.com/LiveCodeBench/LiveCodeBench) | [🤗Dataset](https://huggingface.co/livecodebench) |
| RACE | [Beyond Correctness: Benchmarking Multi-dimensional Code Generation for Large Language Models](https://arxiv.org/abs/2407.11470) | Arxiv 2024/07 | [Github](https://github.com/jszheng21/RACE) | [📊LeaderBoard](https://huggingface.co/spaces/jszheng/RACE_leaderboard) |