awesome-code-benchmark
A comprehensive code domain benchmark review of LLM researches.
https://github.com/tongye98/awesome-code-benchmark
Last synced: 6 days ago
JSON representation
-
š Benchmark Categories
-
Code Generation & Completion
- SR-Eval: Evaluating LLMs on Code Generation under Stepwise Requirement Refinement - `task: Code generation / completion` - `granularity: Function / Repository` - `interaction: Single-turn / Multi-turn` - `evaluation: Unit Tests / Execution`
-
Code Understanding, Search & Review
- SWE-PRBench: Benchmarking AI Code Review Quality Against Pull Request Feedback - `task: Pull request review` - `granularity: Repository / Workflow` - `interaction: Single-turn / Agentic` - `evaluation: LLM-as-Judge / Human`
- Benchmarking LLMs for Fine-Grained Code Review with Enriched Context in Practice - `task: Fine-grained code review` - `granularity: File / Repository` - `interaction: Single-turn / Agentic` - `evaluation: LLM-as-Judge / Human`
-
Performance Optimization
- CodegenBench: Can LLMs Write Efficient Code Across Architectures? - `task: Performance optimization` - `granularity: Function / Repository` - `interaction: Single-turn / Agentic` - `evaluation: Execution / Performance` - [Code](https://anonymous.4open.science/r/CodegenBench-EDE1/) [Dataset](https://anonymous.4open.science/r/CodegenBenchDataset-2551/)
-
Program Repair, Testing & Debugging
- Planning to Explore: Curiosity-Driven Planning for LLM Test Generation - `task: Iterative test generation` - `granularity: Repository` - `interaction: Agentic` - `evaluation: Unit Tests / Execution`
- BuildBench: Benchmarking LLM Agents on Compiling Real-World Open-Source Software - `task: Build and compilation repair` - `granularity: Repository / Workflow` - `interaction: Agentic` - `evaluation: Execution`
-
Repository & Agentic Software Engineering
- ProjDevBench: Benchmarking AI Coding Agents on End-to-End Project Development - `task: End-to-end project development` - `granularity: Project / Repository` - `interaction: Agentic` - `evaluation: Unit Tests / LLM-as-Judge` - [Github](https://github.com/zsworld6/projdevbench)
- HWE-Bench: Benchmarking LLM Agents on Real-World Hardware Bug Repair Tasks - `task: Hardware repository bug repair` - `granularity: Repository` - `interaction: Agentic` - `evaluation: Execution`
- IDE-Bench: Evaluating Large Language Models as IDE Agents on Real-World Software Engineering Tasks - `task: IDE-native software engineering` - `granularity: Repository / Workflow` - `interaction: Agentic` - `evaluation: Unit Tests / Execution`
- RepoGenesis: Benchmarking End-to-End Microservice Generation from Readme to Repository - `task: End-to-end microservice repository generation` - `granularity: Project / Repository` - `interaction: Agentic` - `evaluation: Unit Tests / Execution / Deployment` - [Github](https://github.com/pzy2000/RepoGenesis)
- SWE-Refactor: A Repository-Level Benchmark for Real-World LLM-Based Code Refactoring - `task: Repository-level refactoring` - `granularity: Repository` - `interaction: Single-turn / Agentic` - `evaluation: Compilation / Unit Tests / Static Analysis`
- SWE Context Bench: A Benchmark for Context Learning in Coding - `task: Context reuse across related coding tasks` - `granularity: Repository / Workflow` - `interaction: Multi-turn / Agentic` - `evaluation: Unit Tests / Execution / Cost`
- SWE-Bench Mobile: Can Large Language Model Agents Develop Industry-Level Mobile Applications? - `task: Industrial mobile application development` - `granularity: Repository / Workflow` - `interaction: Multimodal / Agentic` - `evaluation: Unit Tests / Execution` - [šWebsite](https://swebenchmobile.com)
- RepoMod-Bench: A Benchmark for Code Repository Modernization via Implementation-Agnostic Testing - `task: Repository modernization / migration` - `granularity: Project / Repository` - `interaction: Agentic` - `evaluation: Black-box Tests / Execution` - [Github](https://github.com/Modelcode-ai/mcode-benchmark)
- SWE-QA-Pro: A Representative Benchmark and Scalable Training Recipe for Repository-Level Code Understanding - `task: Agentic repository-level code understanding` - `granularity: Repository` - `interaction: Agentic` - `evaluation: QA Accuracy / Execution`
- Beyond Isolated Tasks: A Framework for Evaluating Coding Agents on Sequential Software Evolution - `task: Sequential software evolution` - `granularity: Repository / Workflow` - `interaction: Multi-turn / Agentic` - `evaluation: Unit Tests / Execution / Static Analysis`
- RepoZero: Can LLMs Generate a Code Repository from Scratch? - `task: Repository generation from scratch` - `granularity: Project / Repository` - `interaction: Agentic` - `evaluation: Black-box Tests / Execution`
- VISTA: An End-to-End Benchmark for Visual Spec-to-Web-App Coding Agents - `task: Visual spec-to-web-app development` - `granularity: Project / Workflow` - `interaction: Multimodal / Agentic` - `evaluation: Browser Tests / Visual Similarity / DOM Matching` - [Github](https://github.com/kaboider/VIS_APP_Code) [šWebsite](https://kaboider.github.io/VIS_APP/) [š¤Dataset](https://huggingface.co/datasets/JunJiaGuo/VIS-APP-Bench)
- Benchmarks are Not Enough: RAMP for Runtime Assessing of Agentic Models in Production Systems - `task: Production runtime assessment` - `granularity: Project / Workflow` - `interaction: Agentic` - `evaluation: Execution / Runtime / Cost` - [šWebsite](https://ramp.nexa-lang.com/)
- Claw-SWE-Bench: A Benchmark for Evaluating OpenClaw-style Agent Harnesses on Coding Tasks - `task: Agent harness evaluation for issue resolving` - `granularity: Repository` - `interaction: Agentic` - `evaluation: Unit Tests / Execution / Cost` - [Github](https://github.com/opensquilla/claw-swe-bench) [š¤Dataset](https://huggingface.co/datasets/TokenRhythm/Claw-SWE-Bench)
- Dialogue SWE-Bench: A Benchmark for Dialogue-Driven Coding Agents - `task: Dialogue-driven issue resolving` - `granularity: Repository / Workflow` - `interaction: Multi-turn / Agentic` - `evaluation: Unit Tests / Dialogue Quality`
- SWE-Future: Forecast-Conditioned Data Synthesis for Future-Oriented Software Engineering Agents - `task: Future-oriented repository task synthesis` - `granularity: Repository` - `interaction: Agentic` - `evaluation: Semantic Matching / Execution`
- SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering? - `task: Freelance software engineering` - `granularity: Repository / Workflow` - `interaction: Agentic` - `evaluation: Execution / Human / Economic` - [Github](https://github.com/openai/SWELancer-Benchmark)
- Multi-SWE-bench: A Multilingual Benchmark for Issue Resolving - `task: Multilingual issue resolving` - `granularity: Repository` - `interaction: Agentic` - `evaluation: Unit Tests / Execution`
- SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks? - `task: Long-horizon issue resolving` - `granularity: Repository / Workflow` - `interaction: Agentic` - `evaluation: Unit Tests / Execution`
- Automatically Benchmarking LLM Code Agents through Agent-Driven Annotation and Evaluation - `task: PRD-driven project development` - `granularity: Project / Repository` - `interaction: Agentic` - `evaluation: Execution / LLM-as-Judge`
- SWE-Bench++: A Framework for the Scalable Generation of Software Engineering Benchmarks from Open-Source Repositories - `task: Repository-level bug fixing and feature requests` - `granularity: Repository` - `interaction: Agentic` - `evaluation: Unit Tests / Execution`
- AInsteinBench: Benchmarking Coding Agents on Scientific Repositories - `task: Scientific repository development` - `granularity: Repository / Workflow` - `interaction: Agentic` - `evaluation: Execution`
- SWE-Dev: Evaluating and Training Autonomous Feature-Driven Software Development - `task: Feature-driven software development` - `granularity: Repository / Workflow` - `interaction: Agentic` - `evaluation: Unit Tests / Execution` - [Github](https://github.com/DorothyDUUU/SWE-Dev)
- SWE-Flow: Synthesizing Software Engineering Data in a Test-Driven Manner - `task: Test-driven incremental development` - `granularity: Repository / Workflow` - `interaction: Agentic` - `evaluation: Unit Tests / Execution` - [Github](https://github.com/Hambaobao/SWE-Flow)
- LoCoBench-Agent: An Interactive Benchmark for LLM Agents in Long-Context Software Engineering - `task: Long-context software engineering workflows` - `granularity: Repository / Workflow` - `interaction: Multi-turn / Agentic` - `evaluation: Execution / Tool-use Metrics / LLM-as-Judge`
-
-
News
- SWE-PolyBench: A multi-language benchmark for repository level evaluation of coding agents
- aiXamine: Simplified LLM Safety and Security
- CRUST-Bench: A Comprehensive Benchmark for C-to-safe-Rust Transpilation
- Paper2Code: Automating Code Generation from Scientific Papers in Machine Learning
- CoCo-Bench: A Comprehensive Code Benchmark For Multi-task Large Language Model Evaluation - ->
- Draw with Thought: Unleashing Multimodal Reasoning for Scientific Diagram Generation
- ELABORATION: A Comprehensive Benchmark on Human-LLM Competitive Programming
- OSS-Bench: Benchmark Generator for Coding LLMs
- VERINA: Benchmarking Verifiable Code Generation
- ResearchCodeBench: Benchmarking LLMs on Implementing Novel Machine Learning Research Code
- EFFIBENCH-X:A Multi-Language Benchmark fo rMeasuring Effciency ofLLM.Generated Code
- Synthesizing Performance Constraints for Evaluating and Improving Code Efficiency
- Breakpoint: Scalable evaluation of system-level reasoning in LLM code agents
- LongCodeBench: Evaluating Coding LLMs at 1M Context Windows
- Success is in the Details: Evaluate and Enhance Details Sensitivity of Code
- CodeSense: a Real-World Benchmark and Dataset for Code Semantic Reasoning
- Flow2Code: Evaluating Large Language Models for Flowchart-based Code Generation Capability
- CODEMENV: Benchmarking Large Language Models on Code Migration
- DecompileBench: A Comprehensive Benchmark for Evaluating Decompilers in Real-World Scenarios - ->
- VADER: A Human-Evaluated Benchmark for Vulnerability Assessment, Detection, Explanation, and Remediation
- CPRet: A Dataset, Benchmark, and Model for Retrieval in Competitive Programming
- DS-Bench: A Realistic Benchmark for Data Science Code Generation
- Rethinking Repetition Problems of LLMs in Code Generation
- WebGen-Bench: Evaluating LLMs on Generating Interactive and Functional Websites from Scratch
- OmniGIRL: A Multilingual and Multimodal Benchmark for GitHub Issue Resolution - sen University
- CodeFlowBench: A Multi-turn, Iterative Benchmark for Complex Code Generation
- Towards Mitigating API Hallucination in Code Generated by LLMs with Hierarchical Dependency Aware
- OIBench: Benchmarking Strong Reasoning Models with Olympiad in Informatics - Eval
- FrontendBench: A Benchmark for Evaluating LLMs on Front-End Development via Automatic Evaluation
- CoQuIR: A Comprehensive Benchmark for Code Quality-Aware Information Retrieval
- CodeMirage: A Multi-Lingual Benchmark for Detecting AI-Generated and Paraphrased Source Code from Production-Level LLMs
- SEC-bench: Automated Benchmarking of LLM Agents on Real-World Software Security Tasks - Champaign
- RAS-Eval: A Comprehensive Benchmark for Security Evaluation of LLM Agents in Real-World Environments
- MLDebugging: Towards Benchmarking Code Debugging Across Multi-Library Scenarios
- OJBench: A Competition Level Code Benchmark For Large Language Models
- TeXpert: A Multi-Level Benchmark for Evaluating LaTeX Code Generation by LLMs - ->
- DyCodeEval: Dynamic Benchmarking of Reasoning Capabilities in Code Large Language Models Under Data Contamination
- PPM: Automated Generation of Diverse Programming Problems for Benchmarking Code Generation Models
- CORE: Benchmarking LLMs Code Reasoning Capabilities through Static Analysis Tasks
- ArtifactsBench: Bridging the Visual-Interactive Gap in LLM Code Generation Evaluation
- CoreCodeBench: A Configurable Multi-Scenario Repository-Level Benchmark
- Flattery in Motion: Benchmarking and Analyzing Sycophancy in Video-LLMs
- Model Editing for LLMs4Code: How Far are We?
- VeriBench: Benchmarking Large Language Models for Verilog Code Generation and Design Synthesis
- ResBench: Benchmarking LLM-Generated FPGA Designs with Resource Awareness
- Natural language is not enough: Benchmarking multi-modal generative AI for Verilog generation
- JsDeObsBench: Measuring and Benchmarking LLMs for JavaScript Deobfuscation
- From Reproduction to Replication: Evaluating Research Agents with Progressive Code Masking
- Skywork-SWE: Unveiling Data Scaling Laws for Software Engineering in LLMs - ->
- AutoCodeBench: Large Language Models are Automatic Code Benchmark Generators
- Dynamic Benchmark Construction for Evaluating Large Language Models on Real-World Codes
- STEPWISE-CODEX-Bench: Evaluating Complex Multi-Function Comprehension and Fine-Grained Execution Reasoning
- SWE-Perf: Can Language Models Optimize Code Performance on Real-World Repositories?
- CodeJudgeBench: Benchmarking LLM-as-a-Judge for Coding Tasks
- Multilingual Multimodal Software Developer for Code Generation
- CodeAssistBench (CAB): Dataset & Benchmarking for Multi-turn Chat-Based Code Assistance
- SWE-MERA: A Dynamic Benchmark for Agenticly Evaluating Large Language Models on Software Engineering Tasks
- Running in CIRCLE? A Simple Benchmark for LLM Code Interpreter Security
- MOCHA: Are Code Language Models Robust Against Multi-Turn Malicious Coding Prompts? - Champaign
- Turning the Tide: Repository-based Code Reflection
- TRACY: Benchmarking Execution Efficiency of LLM-Based Code Translation
- BinMetric: A Comprehensive Binary Analysis Benchmark for Large Language Models
- A.S.E: A Repository-Level Benchmark for Evaluating Security in AI-Generated Code
- GitTaskBench: A Benchmark for Code Agents Solving Real-World Tasks Through Code Repository Leveraging
- LoCoBench: A Benchmark for Long-Context Large Language Models in Complex Software Engineering
- CodeFuse-CR-Bench: A Comprehensiveness-aware Benchmark for End-to-End Code Review Evaluation in Python Projects
-
Survey
-
š Top Code Benchmark
-
Code Completion & Code Generation
- Evaluating Large Language Models Trained on Code - eval) [](https://github.com/openai/human-eval) | [š¤Dataset](https://huggingface.co/datasets/openai/openai_humaneval) |
- Program Synthesis with Large Language Models - research/google-research/tree/master/mbpp) [](https://github.com/google-research/google-research) | [š¤Dataset](https://huggingface.co/datasets/google-research-datasets/mbpp) |
- Is Your Code Generated by Chat{GPT} Really Correct? Rigorous Evaluation of Large Language Models for Code Generation
- MultiPL-E: A Scalable and Polyglot Approach to Benchmarking Neural Code Generation - E) [](https://github.com/nuprl/MultiPL-E) | [š¤Dataset](https://huggingface.co/datasets/nuprl/MultiPL-E) |
- Python Code Generation by Asking Clarification Questions - b-L10vNpk7Onyft9BXK8GlMIGl52q/view?usp=sharing) |
- BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions - project/bigcodebench) [](https://github.com/bigcode-project/bigcodebench)| [š¤Dataset](https://huggingface.co/collections/bigcode/bigcodebench-666ed21a5039c618e608ab06) [šLeaderBoard](https://huggingface.co/spaces/bigcode/bigcodebench-leaderboard) |
- DevEval: A Manually-Annotated Code Generation Benchmark Aligned with Real-World Code Repositories
- Measuring Coding Challenge Competence With APPS
- DynaCode: A Dynamic Complexity-Aware Code Benchmark for Evaluating Large Language Models in Code Generation
- MCoNaLa: A Benchmark for Code Generation from Multiple Natural Languages - conala) [](https://github.com/zorazrw/multilingual-conala) | [š¤Dataset](https://huggingface.co/datasets/neulab/mconala) |
- LongCoder: A Long-Range Pre-trained Language Model for Code Completion
- RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems
- LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding
- Execution-Based Evaluation for Open-Domain Code Generation
- BioCoder: A Benchmark for Bioinformatics Code Generation with Large Language Models
- CrossCodeEval: A Diverse and Multilingual Benchmark for Cross-File Code Completion - science/cceval) [](https://github.com/amazon-science/cceval) | [Dataset](https://github.com/amazon-science/cceval/tree/main/data) |
- MT-Bench: How Good are LLMs at Multi-turn Question Answering - bench-101) | |
- ML-Bench: Evaluating Large Language Models and Agents for Machine Learning Tasks on Repository-Level Code - bench) [](https://github.com/gersteinlab/ML-bench) | [š¤Dataset](https://huggingface.co/datasets/super-dainiu/ml-bench) [šWebsite](https://ml-bench.github.io/)|
- PLPilot: Benchmark an Automated Programming Language Design Framework Enabled by LLMs
- CoderEval: A Benchmark of Pragmatic Code Generation with Generative Pre-trained Models
- OOP: Object-Oriented Programming Evaluation Benchmark for Large Language Models - eval) [](https://github.com/alphadl/OOP-eval) | [š¤Dataset](https://huggingface.co/datasets/codeai-dteam/oop) |
- A Static Evaluation of Code Completion by Large Language Models
- L2CEval: Evaluating Language-to-Code Generation Capabilities of Large Language Models
- ICE-Score: Instructing Large Language Models to Evaluate Code - score) | |
- Exploring Language Model's Code Generation Ability with Auxiliary Functions
- R2E: Turning Any GitHub Repository into a Programming Agent Test Environment - project/r2e) | |
- EvoCodeBench: An Evolving Code Generation Benchmark Aligned with Real-World Code Repositories
- A Performance Study of LLM-Generated Code on Leetcode
- Coeditor: Leveraging Contextual Changes for Multi-round Code Auto-editing
- Competition-Level Code Generation with AlphaCode - deepmind/code_contests) [](https://github.com/google-deepmind/code_contests) | [Dataset](https://github.com/google-deepmind/code_contests)|
- LLM4Decompile: Decompiling Binary Code with Large Language Models - ghidra-100k)|
- Enhancing Repository-Level Code Generation with Integrated Contextual Information
- AICoderEval: Improving AI Domain Code Generation of Large Language Models
- CodeAgent: Enhancing Code Generation with Tool-Integrated Agent Systems for Real-World Repo-level Coding Challenges
- Evaluation of LLMs on Syntax-Aware Code Fill-in-the-Middle Tasks
- ConCodeEval: Evaluating Large Language Models for Code Constraints in Domain-Specific Languages
- CodeGeeX: A Pre-Trained Model for Code Generation with Multilingual Benchmarking on HumanEval-X
- StudentEval: A Benchmark of Student-Written Prompts for Large Language Models of Code - EASEL-lab/StudentEval)[](https://github.com/Wellesley-EASEL-lab/StudentEval) | [š¤Dataset](https://huggingface.co/datasets/wellesley-easel/StudentEval) |
- LeetCodeDataset: A Temporal Dataset for Robust Evaluation and Efficient Training of Code LLMs
- A Large-scale Class-level Benchmark Dataset for Code Generation with LLMs
- CodeMixBench: Evaluating Large Language Models on Code Generation with Code-Mixed Prompts
- IFEvalCode: Controlled Code Generation
- HumanEval Pro and MBPP Pro: Evaluating Large Language Models on Self-invoking Code Generation - Pro/CodeEval-Pro)[](https://github.com/CodeEval-Pro/CodeEval-Pro) | [š¤Dataset](https://huggingface.co/CodeEval-Pro) [šWebsite](https://answers111.github.io/evalpro.github.io/index.html) [šLeaderBoard]([CodeEval-Pro](https://answers111.github.io/evalpro.github.io/leaderboard.html/)) |
-
Code Efficiency
- Evaluating Language Models for Efficient Code Generation
- EffiBench: Benchmarking the Efficiency of Automatically Generated Code
- Mercury: A Code Efficiency Benchmark for Code Large Language Models
- ECCO: Can We Improve Model-Generated Code Efficiency Without Sacrificing Functional Correctness?
- Learning Performance-Improving Code Edits
- How Efficient is LLM-Generated Code? A Rigorous & High-Standard Benchmark - rz/enamel) [](https://github.com/q-rz/enamel) | [š¤Dataset](https://huggingface.co/datasets/q-rz/enamel) |
- Improving Assembly Code Performance with Large Language Models via Reinforcement Learning
-
CodeFix & Bug-Fix
- OctoPack: Instruction Tuning Code Large Language Models - project/octopack)[](https://github.com/bigcode-project/octopack) | [š¤Dataset](https://huggingface.co/datasets/bigcode/humanevalpack) |
- SWT-Bench: Testing and Validating Real-World Bug-Fixes with Code Agents - star-ai/SWT-Bench) [](https://github.com/logic-star-ai/SWT-Bench) | [šWebsite](https://swtbench.com) |
- SWE-bench: Can Language Models Resolve Real-World GitHub Issues? - bench/SWE-bench) [](https://github.com/swe-bench/SWE-bench) | [šWebsite](https://www.swebench.com) |
- SWE-bench Multimodal: Do AI Systems Generalize to Visual Software Domains? - bench/SWE-bench) [](https://github.com/swe-bench/SWE-bench) | [š¤Dataset](https://www.swebench.com/multimodal) [šWebsite](https://www.swebench.com/multimodal) |
- LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
- Large Language Models of Code Fail at Completing Code with Potential Bugs - science/buggy-code-completion)[](https://github.com/amazon-science/buggy-code-completion) | [Dataset](https://github.com/amazon-science/buggy-code-completion) |
- GitBug-Java: A Reproducible Benchmark of Recent Java Bugs - java)[](https://github.com/gitbugactions/gitbug-java) | [š¤Dataset](https://huggingface.co/datasets/gitbugactions/gitbug-java) [šWebsite](https://nuno.saavedra.pt/gitbug-java#!/) |
- GitBug-Actions: Building Reproducible Bug-Fix Benchmarks with GitHub Actions
- When Large Language Models Confront Repository-Level Automatic Program Repair: How Well They Done?
- RepoFixEval: A Repository-Level Program Repair Benchmark From Issue Discovering to Bug Fixing
- DebugBench: Evaluating Debugging Capability of Large Language Models
- Instruct, Not Assist: LLM-based Multi-Turn Planning and Hierarchical Questioning for Socratic Code Debugging
- Socratic Questioning of Novice Debuggers: A Benchmark Dataset and Preliminary Evaluations - debugging-benchmark) | |
- Coffee-Gym: An Environment for Evaluating and Improving Natural Language Feedback on Erroneous Code - Gym/Project-Coffee-Gym) |
- INTERVENOR: Prompt the Coding Ability of Large Language Models with the Interactive Chain of Repairing
- Towards Low-Resource Automatic Program Repair with Meta-Learning and Pretrained Language Models - weishi/Meta-APR) | |
- ZS4C: Zero-Shot Synthesis of Compilable Code for Incomplete Code Snippets using LLMs
- FeedbackEval A Benchmark for Evaluating Large Language Models in Feedback-Driven Code Repair Tasks
- COAST: Enhancing the Code Debugging Ability of LLMs through Communicative Agent Based Data Synthesis
- CVE-Bench:Benchmarking LLM-based Software Engineering Agentās Ability to Repair Real-World CVE Vulnerabilities
- CVE-Bench:Benchmarking LLM-based Software Engineering Agentās Ability to Repair Real-World CVE Vulnerabilities
-
Code Hallucination
- Exploring and Evaluating Hallucinations in LLM-Powered Code Generation
- CodeHalu: Investigating Code Hallucinations in LLMs via Execution-based Verification
- Collu-Bench: A Benchmark for Predicting Language Model Hallucinations in Code - asset/collu-bench)|
- THINK: Tackling API Hallucinations in LLMs via Injecting Knowledge - Ljx/think)[](https://github.com/Leah-Ljx/think) | [š¤Dataset](https://github.com/Leah-LJX/THINK/tree/main/benchmark) |
-
Code Reasoning & Understanding
- CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution - eval.github.io/leaderboard.html) |
- CodeMMLU: A Multi-Task Benchmark for Assessing Code Understanding Capabilities of CodeLLMs - AI4Code/CodeMMLU/) [](https://github.com/FSoft-AI4Code/CodeMMLU) | [š¤Dataset](https://huggingface.co/datasets/Fsoft-AIC/CodeMMLU) [šWebsite](https://fsoft-ai4code.github.io/codemmlu/) [šLeaderBoard](https://fsoft-ai4code.github.io/leaderboards/codemmlu/) |
- GenCodeSearchNet: A Benchmark Test Suite for Evaluating Generalization in Programming Language Understanding
- CodeJudge-Eval: Can Large Language Models be Good Judges in Code Understanding? - Research/CodeJudge-Eval) [](https://github.com/CodeLLM-Research/CodeJudge-Eval)| |
- How Effectively Do Code Language Models Understand Poor-Readability Code? - y/PoorCodeSumEval) [](https://github.com/ythere-y/PoorCodeSumEval) | [š¤Dataset](https://huggingface.co/datasets/google/code_x_glue_ct_code_to_text) |
- A Novel Refactoring and Semantic Aware Abstract Syntax Tree Differencing Tool and a Benchmark for Evaluating the Accuracy of Diff Tools
- CodeScope: An Execution-based Multilingual Multitask Multidimensional Benchmark for Evaluating LLMs on Code Understanding and Generation - scope-benchmark/) <br />[š¤Dataset](https://huggingface.co/datasets/WeixiangYan/CodeScope) |
- CodeSense: a Real-World Benchmark and Dataset for Code Semantic Reasoning - bench/codesense-codes) [](https://github.com/codesense-bench/codesense-codes) | [š¤Dataset](https://huggingface.co/datasets/codesense-bench/codesense/tree/main)[šLeaderBoard](https://codesense-bench.github.io/leaderboard.html) |
- ICPC-Eval: Probing the Frontiers of LLM Reasoning with Competitive Programming Contests - AIBOX/ICPC-Eval) |
- LONGCODEU: Benchmarking Long-Context Language Models on Long Code Understanding
-
Code Security & Robustness
- RedCode: Risky Code Execution and Generation Benchmark for Code Agents - secure/RedCode) [](https://github.com/AI-secure/RedCode) | [šWebsite](https://redcode-agent.github.io) [šLeaderBoard](https://redcode-agent.github.io/#leaderboard) |
- ReCode: Robustness Evaluation of Code Generation Models - science/recode) [](https://github.com/amazon-science/recode) | [Dataset](https://github.com/amazon-science/recode/tree/main/dataset-release) |
- COCO: Testing Code Generation Systems via Concretized Instructions - 2023/COCO) [](https://github.com/coco-2023/COCO) | |
- CodeWMBench: An Automated Benchmark for Code Watermarking Evaluation - TURC 2024 | [Github](https://github.com/Dizzy-K/CodeWMBench) [](https://github.com/Dizzy-K/CodeWMBench) |
- RMCBench: Benchmarking Large Language Models' Resistance to Malicious Code - yuan233/RMCBench) [](https://github.com/qing-yuan233/RMCBench)| [š¤Dataset](https://huggingface.co/datasets/zhongqy/RMCBench) |
- Benchmarking the Security Aspect of Large Language Model-Based Code Generation
- CYBERSECEVAL 3: Advancing the Evaluation of Cybersecurity Risks and Capabilities in Large Language Models - llama/PurpleLlama/tree/main/CybersecurityBenchmarks) | [Dataset](https://github.com/meta-llama/PurpleLlama/tree/main/CybersecurityBenchmarks)|
- IRIS: LLM-Assisted Static Analysis for Detecting Security Vulnerabilities - sast/iris) | |
- CS-Eval: A Comprehensive Large Language Model Benchmark for CyberSecurity - EVAL/CS-Eval) [](https://github.com/CS-EVAL/CS-Eval)| [š¤Dataset](https://huggingface.co/datasets/cseval/cs-eval) |
- SecBench: A Comprehensive Multi-Dimensional Benchmarking Dataset for LLMs in Cybersecurity
- SafeGenBench: A Benchmark Framework for Security Vulnerability Detection in LLM-Generated Code
-
Code Translation
- Unsupervised Translation of Programming Languages
- AVATAR: A Parallel Corpus for Java-Python Program Translation
- On the Evaluation of Neural Code Translation: Taxonomy and Benchmark - TransEval) [](https://github.com/PolyEval/G-TransEval)| [š¤Dataset](https://github.com/polyeval/g-transeval/tree/main/G-TransEval) |
- CodeTransOcean: A Comprehensive Multilingual Benchmark for Code Translation
-
Programming Languages
Sub Categories
Code Completion & Code Generation
43
MultiModal Code Tasks
25
Repository & Agentic Software Engineering
25
CodeFix & Bug-Fix
21
Text2SQL
19
Industry Code Generation
15
Multi & Other Dimension
15
Code Translation
11
Code Security & Robustness
11
Code Reasoning & Understanding
10
Data science
8
Code Version
8
Code Efficiency
7
Code Hallucination
4
MultiModal Code Generation
3
Security Code Generation & Test Generation
2
Code Understanding, Search & Review
2
Program Repair, Testing & Debugging
2
Code Generation & Completion
1
Performance Optimization
1