awesome-code-benchmark

A comprehensive code domain benchmark review of LLM researches.
https://github.com/tongye98/awesome-code-benchmark

Last synced: 15 days ago
JSON representation

🚀 Top Code Benchmark
- CodeFix & Bug-Fix
 - CVE-Bench:Benchmarking LLM-based Software Engineering Agent’s Ability to Repair Real-World CVE Vulnerabilities
 - COAST: Enhancing the Code Debugging Ability of LLMs through Communicative Agent Based Data Synthesis
 - SWT-Bench: Testing and Validating Real-World Bug-Fixes with Code Agents - star-ai/SWT-Bench) [![Stars](https://img.shields.io/github/stars/logic-star-ai/SWT-Bench?style=social&label=Stars)](https://github.com/logic-star-ai/SWT-Bench) | [🌐Website](https://swtbench.com) |
 - SWE-bench Multimodal: Do AI Systems Generalize to Visual Software Domains? - bench/SWE-bench) [![Stars](https://img.shields.io/github/stars/swe-bench/SWE-bench?style=social&label=Stars)](https://github.com/swe-bench/SWE-bench) | [🤗Dataset](https://www.swebench.com/multimodal) [🌐Website](https://www.swebench.com/multimodal) |
 - GitBug-Actions: Building Reproducible Bug-Fix Benchmarks with GitHub Actions
 - Coffee-Gym: An Environment for Evaluating and Improving Natural Language Feedback on Erroneous Code - Gym/Project-Coffee-Gym) |
 - INTERVENOR: Prompt the Coding Ability of Large Language Models with the Interactive Chain of Repairing
 - Towards Low-Resource Automatic Program Repair with Meta-Learning and Pretrained Language Models - weishi/Meta-APR) | |
 - ZS4C: Zero-Shot Synthesis of Compilable Code for Incomplete Code Snippets using LLMs
 - RepoFixEval: A Repository-Level Program Repair Benchmark From Issue Discovering to Bug Fixing
 - DebugBench: Evaluating Debugging Capability of Large Language Models
 - Instruct, Not Assist: LLM-based Multi-Turn Planning and Hierarchical Questioning for Socratic Code Debugging
 - Socratic Questioning of Novice Debuggers: A Benchmark Dataset and Preliminary Evaluations - debugging-benchmark) | |
 - Large Language Models of Code Fail at Completing Code with Potential Bugs - science/buggy-code-completion)[![Stars](https://img.shields.io/github/stars/amazon-science/buggy-code-completion?style=social&label=Stars)](https://github.com/amazon-science/buggy-code-completion) | [Dataset](https://github.com/amazon-science/buggy-code-completion) |
 - SWT-Bench: Testing and Validating Real-World Bug-Fixes with Code Agents - star-ai/SWT-Bench) [![Stars](https://img.shields.io/github/stars/logic-star-ai/SWT-Bench?style=social&label=Stars)](https://github.com/logic-star-ai/SWT-Bench) | [🌐Website](https://swtbench.com) |
 - OctoPack: Instruction Tuning Code Large Language Models - project/octopack)[![Stars](https://img.shields.io/github/stars/bigcode-project/octopack?style=social&label=Stars)](https://github.com/bigcode-project/octopack) | [🤗Dataset](https://huggingface.co/datasets/bigcode/humanevalpack) |
 - SWE-bench: Can Language Models Resolve Real-World GitHub Issues? - bench/SWE-bench) [![Stars](https://img.shields.io/github/stars/swe-bench/SWE-bench?style=social&label=Stars)](https://github.com/swe-bench/SWE-bench) | [🌐Website](https://www.swebench.com) |
 - GitBug-Java: A Reproducible Benchmark of Recent Java Bugs - java)[![Stars](https://img.shields.io/github/stars/gitbugactions/gitbug-java?style=social&label=Stars)](https://github.com/gitbugactions/gitbug-java) | [🤗Dataset](https://huggingface.co/datasets/gitbugactions/gitbug-java) [🌐Website](https://nuno.saavedra.pt/gitbug-java#!/) |
 - When Large Language Models Confront Repository-Level Automatic Program Repair: How Well They Done?
 - FeedbackEval A Benchmark for Evaluating Large Language Models in Feedback-Driven Code Repair Tasks
 - SWE-bench Multimodal: Do AI Systems Generalize to Visual Software Domains? - bench/SWE-bench) | [🌐Website](https://www.swebench.com/multimodal) [🤗Dataset](https://www.swebench.com/multimodal) |
 - SWE-bench: Can Language Models Resolve Real-World GitHub Issues? - bench/SWE-bench) [![Stars](https://img.shields.io/github/stars/swe-bench/SWE-bench?style=social&label=Stars)](https://github.com/swe-bench/SWE-bench) | [🌐Website](https://www.swebench.com) |
- MultiModal Code Tasks
 - LLM Code Customization with Visual Results: A Benchmark on TikZ
 - Draw with Thought: Unleashing Multimodal Reasoning for Scientific Diagram Generation
 - ChartMimic: Evaluating LMM's Cross-Modal Reasoning Capability via Chart-to-Code Generation
 - MMCode: Benchmarking Multimodal Large Language Models for Code Generation with Visually Rich Programming Problems
 - BigDocs: An Open Dataset for Training Multimodal Models on Document and Code Tasks - Bench) [🌐Website](https://bigdocs.github.io/) |
 - Plot2Code: A Comprehensive Benchmark for Evaluating Multi-modal Large Language Models in Code Generation from Scientific Plots
 - Drawing Pandas: A Benchmark for LLMs in Generating Plotting Code - Research/PandasPlotBench)[![Stars](https://img.shields.io/github/stars/JetBrains-Research/PandasPlotBench?style=social&label=Stars)](https://github.com/JetBrains-Research/PandasPlotBench) | [🤗Dataset](https://huggingface.co/datasets/JetBrains-Research/PandasPlotBench) |
 - Web2Code: A Large-scale Webpage-to-Code Dataset and Evaluation Framework for Multimodal LLMs - LLM/Web2code)[![Stars](https://img.shields.io/github/stars/MBZUAI-LLM/Web2code?style=social&label=Stars)](https://github.com/MBZUAI-LLM/Web2code) | [🤗Dataset](https://huggingface.co/datasets/MBZUAI/Web2Code) [🌐Website](https://mbzuai-llm.github.io/webpage2code/) |
 - VGBench: Evaluating Large Language Models on Vector Graphics Understanding and Generation
 - SVGEditBench: A Benchmark Dataset for Quantitative Assessment of LLM's SVG Editing Capabilities - lab/SVGEditBench) [![Stars](https://img.shields.io/github/stars/mti-lab/SVGEditBench?style=social&label=Stars)](https://github.com/mti-lab/SVGEditBench)| [🤗Dataset](https://github.com/mti-lab/SVGEditBench) |
 - HumanEval-V: Benchmarking High-Level Visual Reasoning with Complex Diagrams in Coding Tasks - V/HumanEval-V-Benchmark)[![Stars](https://img.shields.io/github/stars/HumanEval-V/HumanEval-V-Benchmark?style=social&label=Stars)](https://github.com/HumanEval-V/HumanEval-V-Benchmark) | [🌐Website](https://humaneval-v.github.io/) [📊LeaderBoard](https://humaneval-v.github.io/#leaderboard) [🤗Dataset](https://huggingface.co/datasets/HumanEval-V/HumanEval-V-Benchmark) |
 - WAFFLE: Multi-Modal Model for Automated Front-End Development - asset/Waffle)[![Stars](https://img.shields.io/github/stars/lt-asset/Waffle?style=social&label=Stars)](https://github.com/lt-asset/Waffle) | [🤗Dataset](https://github.com/lt-asset/Waffle/tree/master/WebSight-Test) |
 - Sketch2Code: Evaluating Vision-Language Models for Interactive Web Design Prototyping
 - Interaction2Code: Benchmarking MLLM-based Interactive Webpage Code Generation from Interactive Prototyping - ov-file#Dataset-Download) [📊LeaderBoard](https://github.com/WebPAI/Interaction2Code?tab=readme-ov-file#Leaderboard) |
 - ScratchEval: Are GPT-4o Smarter than My Child? Evaluating Large Multimodal Models with Visual Programming Challenges
 - MRWeb: An Exploration of Generating Multi-Page Resource-Aware Web Code from UI Designs
 - Image2Struct: Benchmarking Structure Extraction for Vision-Language Models - crfm/helm)[![Stars](https://img.shields.io/github/stars/stanford-crfm/helm?style=social&label=Stars)](https://github.com/stanford-crfm/helm) | [🌐Website](https://crfm.stanford.edu/helm/image2struct/latest/) [🤗Dataset](https://huggingface.co/datasets/stanford-crfm/i2s-latex) |
 - WebCode2M: A Real-World Dataset for Code Generation from Webpage Designs - codes/naturalcc/tree/main/examples/webcode2m) | [🌐Website](https://webcode2m.github.io/) [🤗Dataset](https://huggingface.co/datasets/xcodemind/webcode2m) |
 - Design2Code: Benchmarking Multimodal Code Generation for Automated Front-End Engineering - NLP/Design2Code-hf) |
 - From Words to Structured Visuals: A Benchmark and Framework for Text-to-Diagram Generation and Editing - 67c5c0935149cdc6e0230b46) |
 - ChartMimic: Evaluating LMM's Cross-Modal Reasoning Capability via Chart-to-Code Generation
 - StarVector: Generating Scalable Vector Graphics Code from Images and Text - vector) [![Stars](https://img.shields.io/github/stars/joanrod/star-vector?style=social&label=Stars)](https://github.com/joanrod/star-vector) | [🌐Website](https://starvector.github.io/#:~:text=StarVector) [🤗Dataset](https://huggingface.co/collections/starvector/starvector-svg-datasets-svg-bench-67811204a76475be4dd66d09) |
 - Empowering LLMs to Understand and Generate Complex Vector Graphics
 - ChartCoder: Advancing Multimodal Large Language Model for Chart-to-Code Generation - 160k) |
 - Code-Vision: Evaluating Multimodal LLMs Logic Understanding and Code Generation Capabilities
 - Advancing vision-language models in front-end development via data synthesis - Code-VLM/Flame-Code-VLM) | [🤗Dataset](https://github.com/Flame-Code-VLM/Flame-Code-VLM?tab=readme-ov-file#dataset) |
- Code Translation
 - CRUST-Bench: A Comprehensive Benchmark for C-to-safe-Rust Transpilation - bench)[![Stars](https://img.shields.io/github/stars/anirudhkhatry/CRUST-bench?style=social&label=Stars)](https://github.com/anirudhkhatry/CRUST-bench) | [Dataset](https://github.com/anirudhkhatry/CRUST-bench)|
 - Unsupervised Translation of Programming Languages
 - Unsupervised Translation of Programming Languages
 - Repository-level Code Translation Benchmark Targeting Rust
 - XCodeEval: An Execution-based Large Scale Multilingual Multitask Benchmark for Code Understanding, Generation, Translation and Retrieval - ACL Anthology - NLP-sg/xCodeEval) |
 - Escalating LLM-based Code Translation Benchmarking into the Class-level Era - 11 | [Github](https://github.com/anonymous-author-coder/ClassEval-T-Code-Translation-Evaluation-Dataset)[![Stars](https://img.shields.io/github/stars/anonymous-author-coder/ClassEval-T-Code-Translation-Evaluation-Dataset?style=social&label=Stars)](https://github.com/anonymous-author-coder/ClassEval-T-Code-Translation-Evaluation-Dataset) | [🤗Dataset](https://github.com/anonymous-author-coder/ClassEval-T-Code-Translation-Evaluation-Dataset/tree/main/ClassEval_T) |
 - AVATAR: A Parallel Corpus for Java-Python Program Translation
 - On the Evaluation of Neural Code Translation: Taxonomy and Benchmark - TransEval) [![Stars](https://img.shields.io/github/stars/PolyEval/G-TransEval?style=social&label=Stars)](https://github.com/PolyEval/G-TransEval)| [🤗Dataset](https://github.com/polyeval/g-transeval/tree/main/G-TransEval) |
 - CodeTransOcean: A Comprehensive Multilingual Benchmark for Code Translation
 - XCodeEval: An Execution-based Large Scale Multilingual Multitask Benchmark for Code Understanding, Generation, Translation and Retrieval - NLP-sg/xCodeEval) |
 - Unraveling the Potential of Large Language Models in Code Translation: How Far Are We? - humaneval)[![Stars](https://img.shields.io/github/stars/q4x3/poly-humaneval?style=social&label=Stars)](https://github.com/q4x3/poly-humaneval) | [🤗Dataset](https://github.com/q4x3/poly-humaneval/tree/main/benchmark) |
 - Skeleton-Guided-Translation: A Benchmarking Framework for Code Repository Translation with Fine-Grained Quality Evaluation
 - Enhancing LLMs in Long Code Translation through Instrumentation and Program State Alignment
- Code Completion & Code Generation
 - LeetCodeDataset: A Temporal Dataset for Robust Evaluation and Efficient Training of Code LLMs
 - A Large-scale Class-level Benchmark Dataset for Code Generation with LLMs
 - Evaluating Large Language Models Trained on Code - eval) [![Stars](https://img.shields.io/github/stars/openai/human-eval?style=social&label=Stars)](https://github.com/openai/human-eval) | [🤗Dataset](https://huggingface.co/datasets/openai/openai_humaneval) |
 - Program Synthesis with Large Language Models - research/google-research/tree/master/mbpp) [![Stars](https://img.shields.io/github/stars/google-research/google-research?style=social&label=Stars)](https://github.com/google-research/google-research) | [🤗Dataset](https://huggingface.co/datasets/google-research-datasets/mbpp) |
 - MultiPL-E: A Scalable and Polyglot Approach to Benchmarking Neural Code Generation - E) [![Stars](https://img.shields.io/github/stars/nuprl/MultiPL-E?style=social&label=Stars)](https://github.com/nuprl/MultiPL-E) | [🤗Dataset](https://huggingface.co/datasets/nuprl/MultiPL-E) |
 - Python Code Generation by Asking Clarification Questions - b-L10vNpk7Onyft9BXK8GlMIGl52q/view?usp=sharing) |
 - BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions - project/bigcodebench) [![Stars](https://img.shields.io/github/stars/bigcode-project/bigcodebench?style=social&label=Stars)](https://github.com/bigcode-project/bigcodebench)| [🤗Dataset](https://huggingface.co/collections/bigcode/bigcodebench-666ed21a5039c618e608ab06) [📊LeaderBoard](https://huggingface.co/spaces/bigcode/bigcodebench-leaderboard) |
 - DevEval: A Manually-Annotated Code Generation Benchmark Aligned with Real-World Code Repositories
 - Measuring Coding Challenge Competence With APPS
 - DynaCode: A Dynamic Complexity-Aware Code Benchmark for Evaluating Large Language Models in Code Generation
 - MCoNaLa: A Benchmark for Code Generation from Multiple Natural Languages - conala) [![Stars](https://img.shields.io/github/stars/zorazrw/multilingual-conala?style=social&label=Stars)](https://github.com/zorazrw/multilingual-conala) | [🤗Dataset](https://huggingface.co/datasets/neulab/mconala) |
 - LongCoder: A Long-Range Pre-trained Language Model for Code Completion
 - RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems
 - LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding
 - Execution-Based Evaluation for Open-Domain Code Generation
 - R2E: Turning Any GitHub Repository into a Programming Agent Test Environment - project/r2e) | |
 - BioCoder: A Benchmark for Bioinformatics Code Generation with Large Language Models
 - CrossCodeEval: A Diverse and Multilingual Benchmark for Cross-File Code Completion - science/cceval) [![Stars](https://img.shields.io/github/stars/amazon-science/cceval?style=social&label=Stars)](https://github.com/amazon-science/cceval) | [Dataset](https://github.com/amazon-science/cceval/tree/main/data) |
 - MT-Bench: How Good are LLMs at Multi-turn Question Answering - bench-101) | |
 - ML-Bench: Evaluating Large Language Models and Agents for Machine Learning Tasks on Repository-Level Code - bench) [![Stars](https://img.shields.io/github/stars/gersteinlab/ML-bench?style=social&label=Stars)](https://github.com/gersteinlab/ML-bench) | [🤗Dataset](https://huggingface.co/datasets/super-dainiu/ml-bench) [🌐Website](https://ml-bench.github.io/)|
 - PLPilot: Benchmark an Automated Programming Language Design Framework Enabled by LLMs
 - CoderEval: A Benchmark of Pragmatic Code Generation with Generative Pre-trained Models
 - A Static Evaluation of Code Completion by Large Language Models
 - ICE-Score: Instructing Large Language Models to Evaluate Code - score) | |
 - Exploring Language Model's Code Generation Ability with Auxiliary Functions
 - Coeditor: Leveraging Contextual Changes for Multi-round Code Auto-editing
 - Competition-Level Code Generation with AlphaCode - deepmind/code_contests) [![Stars](https://img.shields.io/github/stars/google-deepmind/code_contests?style=social&label=Stars)](https://github.com/google-deepmind/code_contests) | [Dataset](https://github.com/google-deepmind/code_contests)|
 - LLM4Decompile: Decompiling Binary Code with Large Language Models - ghidra-100k)|
 - Enhancing Repository-Level Code Generation with Integrated Contextual Information
 - AICoderEval: Improving AI Domain Code Generation of Large Language Models
 - CodeAgent: Enhancing Code Generation with Tool-Integrated Agent Systems for Real-World Repo-level Coding Challenges
 - Evaluation of LLMs on Syntax-Aware Code Fill-in-the-Middle Tasks
 - ConCodeEval: Evaluating Large Language Models for Code Constraints in Domain-Specific Languages
 - A Performance Study of LLM-Generated Code on Leetcode
 - CodeGeeX: A Pre-Trained Model for Code Generation with Multilingual Benchmarking on HumanEval-X
 - MultiPL-E: A Scalable and Polyglot Approach to Benchmarking Neural Code Generation - E) [![Stars](https://img.shields.io/github/stars/nuprl/MultiPL-E?style=social&label=Stars)](https://github.com/nuprl/MultiPL-E) | [🤗Dataset](https://huggingface.co/datasets/nuprl/MultiPL-E) |
 - Python Code Generation by Asking Clarification Questions - b-L10vNpk7Onyft9BXK8GlMIGl52q/view?usp=sharing) |
 - Is Your Code Generated by Chat{GPT} Really Correct? Rigorous Evaluation of Large Language Models for Code Generation
 - StudentEval: A Benchmark of Student-Written Prompts for Large Language Models of Code - EASEL-lab/StudentEval)[![Stars](https://img.shields.io/github/stars/Wellesley-EASEL-lab/StudentEval?style=social&label=Stars)](https://github.com/Wellesley-EASEL-lab/StudentEval) | [🤗Dataset](https://huggingface.co/datasets/wellesley-easel/StudentEval) |
 - DevEval: A Manually-Annotated Code Generation Benchmark Aligned with Real-World Code Repositories
 - OOP: Object-Oriented Programming Evaluation Benchmark for Large Language Models - eval) [![Stars](https://img.shields.io/github/stars/alphadl/OOP-eval?style=social&label=Stars)](https://github.com/alphadl/OOP-eval) | [🤗Dataset](https://huggingface.co/datasets/codeai-dteam/oop) |
 - L2CEval: Evaluating Language-to-Code Generation Capabilities of Large Language Models
 - BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions - project/bigcodebench) [![Stars](https://img.shields.io/github/stars/bigcode-project/bigcodebench?style=social&label=Stars)](https://github.com/bigcode-project/bigcodebench)| [🤗Dataset](https://huggingface.co/collections/bigcode/bigcodebench-666ed21a5039c618e608ab06) [📊LeaderBoard](https://huggingface.co/spaces/bigcode/bigcodebench-leaderboard) |
 - EvoCodeBench: An Evolving Code Generation Benchmark Aligned with Real-World Code Repositories
- Multi & Other Dimension
 - LiCoEval: Evaluating LLMs on License Compliance in Code Generation - pku/LiCoEval)[![Stars](https://img.shields.io/github/stars/osslab-pku/LiCoEval?style=social&label=Stars)](https://github.com/osslab-pku/LiCoEval) | [Dataset](https://figshare.com/s/362cb2bdaed764831166) |
 - OctoPack: Instruction Tuning Code Large Language Models - project/octopack)[![Stars](https://img.shields.io/github/stars/bigcode-project/octopack?style=social&label=Stars)](https://github.com/bigcode-project/octopack) | [🤗Dataset](https://huggingface.co/datasets/bigcode/humanevalpack) |
 - Beyond Correctness: Benchmarking Multi-dimensional Code Generation for Large Language Models
 - LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
 - Paper2Code: Automating Code Generation from Scientific Papers in Machine Learning - doer/Paper2Code)[![Stars](https://img.shields.io/github/stars/going-doer/Paper2Code?style=social&label=Stars)](https://github.com/going-doer/Paper2Code) | [🤗Dataset](https://huggingface.co/datasets/iaminju/paper2code) |
 - CoCo-Bench: A Comprehensive Code Benchmark For Multi-task Large Language Model Evaluation
 - RepoFusion: Training Code Models to Understand Your Repository
 - Improving Natural Language Capability of Code Large Language Model
 - CodeBenchGen: Creating Scalable Execution-based Code Generation Benchmarks
 - Exploring Multi-Lingual Bias of Large Code Models in Code Generation
 - What's Wrong with Your Code Generated by Large Language Models? An Extensive Study
 - StepCoder: Improve Code Generation with Reinforcement Learning from Compiler Feedback
 - InfiBench: Evaluating the Question-Answering Capabilities of Code Large Language Models - coder/infibench-evaluation-harness/)[![Stars](https://img.shields.io/github/stars/infi-coder/infibench-evaluation-harness?style=social&label=Stars)](https://github.com/infi-coder/infibench-evaluation-harness) | [🌐Website](https://infi-coder.github.io/infibench/) |
 - Can LLM Replace Stack Overflow? A Study on Robustness and Reliability of Large Language Model Code Generation
 - Top Leaderboard Ranking = Top Coding Proficiency, Always? EvoEval: Evolving Coding Benchmarks via LLM - eval/evoeval) [![Stars](https://img.shields.io/github/stars/evo-eval/evoeval?style=social&label=Stars)](https://github.com/evo-eval/evoeval)| |
 - AssertionBench: A Benchmark to Evaluate Large-Language Models for Assertion Generation - lab/assertion_data_for_LLM)[![Stars](https://img.shields.io/github/stars/achieve-lab/assertion_data_for_LLM?style=social&label=Stars)](https://github.com/achieve-lab/assertion_data_for_LLM) | |
 - Evaluating Large Language Models with Runtime Behavior of Program Execution - eval/r-eval.github.io)[![Stars](https://img.shields.io/github/stars/r-eval/r-eval.github.io?style=social&label=Stars)](https://github.com/r-eval/r-eval.github.io) | [📊LeaderBoard](https://r-eval.github.io/) |
 - SWE-PolyBench: A multi-language benchmark for repository level evaluation of coding agents - science/SWE-PolyBench) [![Stars](https://img.shields.io/github/stars/amazon-science/SWE-PolyBench?style=social&label=Stars)](https://github.com/amazon-science/SWE-PolyBench) | [🌐Website](https://amazon-science.github.io/SWE-PolyBench/) [🤗Dataset](https://huggingface.co/datasets/AmazonScience/SWE-PolyBench) |
 - CodeScope: An Execution-based Multilingual Multitask Multidimensional Benchmark for Evaluating LLMs on Code Understanding and Generation - scope-benchmark/) [🤗Dataset](https://huggingface.co/datasets/WeixiangYan/CodeScope) |
- Industry Code Generation
 - OpenLLM-RTL: Open Dataset and Benchmark for LLM-Aided Design RTL Generation - zhiyao/RTL-Coder)[![Stars](https://img.shields.io/github/stars/hkust-zhiyao/RTL-Coder?style=social&label=Stars)](https://github.comhkust-zhiyao/RTL-Coder) | [🤗Dataset](https://github.com/hkust-zhiyao/RTL-Coder/tree/main/dataset) |
 - MG-Verilog: Multi-grained Dataset Towards Enhanced LLM-assisted Verilog Generation - EIC/mg-verilog)[![Stars](https://img.shields.io/github/stars/GATECH-EIC/mg-verilog?style=social&label=Stars)](https://github.com/GATECH-EIC/mg-verilog) | |
 - RTL-Repo: A Benchmark for Evaluating LLMs on Large-Scale RTL Design Projects - Repo)[![Stars](https://img.shields.io/github/stars/AUCOHL/RTL-Repo?style=social&label=Stars)](https://github.com/AUCOHL/RTL-Repo) | [🤗Dataset](https://huggingface.co/datasets/ahmedallam/RTL-Repo) |
 - ComplexVCoder: An LLM-Driven Framework for Systematic Generation of Complex Verilog Code
 - Exploring Code Language Models for Automated HLS-based Hardware Generation: Benchmark, Infrastructure and Analysis
 - VHDL-Eval: A Framework for Evaluating Large Language Models in VHDL Code Generation
 - Chain-of-Descriptions: Improving Code LLMs for VHDL Code Generation and Summarization
 - Natural language is not enough: Benchmarking multi-modal generative AI for Verilog generation
 - VerilogEval Evaluating Large Language Models for Verilog Code Generation - eval)[![Stars](https://img.shields.io/github/stars/NVlabs/verilog-eval?style=social&label=Stars)](https://github.com/NVlabs/verilog-eval) | [🤗Dataset](https://github.com/NVlabs/verilog-eval/tree/main/dataset_code-complete-iccad2023) |
 - Benchmarking Large Language Models for Automated Verilog RTL Code Generation - thakur/vgen)[![Stars](https://img.shields.io/github/stars/shailja-thakur/vgen?style=social&label=Stars)](https://github.com/shailja-thakur/vgen) | [🤗Dataset](https://github.com/shailja-thakur/VGen/tree/main/prompts-and-testbenches) |
 - RTLLM: An Open-Source Benchmark for Design RTL Generation with Large Language Model - zhiyao/rtllm)[![Stars](https://img.shields.io/github/stars/hkust-zhiyao/rtllm?style=social&label=Stars)](https://github.com/hkust-zhiyao/rtllm) | [🤗Dataset](https://github.com/hkust-zhiyao/rtllm) |
 - LLM4PLC: Harnessing Large Language Models for Verifiable Programming of PLCs in Industrial Control Systems
 - Agents4PLC: Automating Closed-loop PLC Code Generation and Verification in Industrial Control Systems using LLM-based Agents - zju/Agents4PLC_release)[![Stars](https://img.shields.io/github/stars/Luoji-zju/Agents4PLC_release?style=social&label=Stars)](https://github.com/Luoji-zju/Agents4PLC_release) | [🤗Dataset](https://github.com/Luoji-zju/Agents4PLC_release/tree/master/benchmark) |
 - A Multi-Agent Framework for Extensible Structured Text Generation in PLCs
 - MetRex: A Benchmark for Verilog Code Metric Reasoning Using LLMs - lab/MetRex)[![Stars](https://img.shields.io/github/stars/scale-lab/MetRex?style=social&label=Stars)](https://github.com/scale-lab/MetRex) | [🤗Dataset](https://huggingface.co/datasets/scale-lab/MetRex) |
- Code Efficiency
 - Evaluating Language Models for Efficient Code Generation
 - EffiBench: Benchmarking the Efficiency of Automatically Generated Code
 - Mercury: A Code Efficiency Benchmark for Code Large Language Models
 - ECCO: Can We Improve Model-Generated Code Efficiency Without Sacrificing Functional Correctness?
 - Learning Performance-Improving Code Edits
 - How Efficient is LLM-Generated Code? A Rigorous & High-Standard Benchmark - rz/enamel) [![Stars](https://img.shields.io/github/stars/q-rz/enamel?style=social&label=Stars)](https://github.com/q-rz/enamel) | [🤗Dataset](https://huggingface.co/datasets/q-rz/enamel) |
 - Mercury: A Code Efficiency Benchmark for Code Large Language Models
 - ECCO: Can We Improve Model-Generated Code Efficiency Without Sacrificing Functional Correctness?
 - Learning Performance-Improving Code Edits
 - How Efficient is LLM-Generated Code? A Rigorous & High-Standard Benchmark - rz/enamel) | [🤗Dataset](https://huggingface.co/datasets/q-rz/enamel) |
 - Evaluating Language Models for Efficient Code Generation
 - EffiBench: Benchmarking the Efficiency of Automatically Generated Code
- Code Reasoning & Understanding
 - CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution - eval.github.io/leaderboard.html) |
 - How Effectively Do Code Language Models Understand Poor-Readability Code? - y/PoorCodeSumEval) [![Stars](https://img.shields.io/github/stars/ythere-y/PoorCodeSumEval?style=social&label=Stars)](https://github.com/ythere-y/PoorCodeSumEval) | [🤗Dataset](https://huggingface.co/datasets/google/code_x_glue_ct_code_to_text) |
 - A Novel Refactoring and Semantic Aware Abstract Syntax Tree Differencing Tool and a Benchmark for Evaluating the Accuracy of Diff Tools
 - GenCodeSearchNet: A Benchmark Test Suite for Evaluating Generalization in Programming Language Understanding
 - CodeJudge-Eval: Can Large Language Models be Good Judges in Code Understanding? - Research/CodeJudge-Eval) [![Stars](https://img.shields.io/github/stars/CodeLLM-Research/CodeJudge-Eval?style=social&label=Stars)](https://github.com/CodeLLM-Research/CodeJudge-Eval)| |
 - CodeMMLU: A Multi-Task Benchmark for Assessing Code Understanding Capabilities of CodeLLMs - AI4Code/CodeMMLU/) [![Stars](https://img.shields.io/github/stars/FSoft-AI4Code/CodeMMLU?style=social&label=Stars)](https://github.com/FSoft-AI4Code/CodeMMLU) | [🤗Dataset](https://huggingface.co/datasets/Fsoft-AIC/CodeMMLU) [🌐Website](https://fsoft-ai4code.github.io/codemmlu/) [📊LeaderBoard](https://fsoft-ai4code.github.io/leaderboards/codemmlu/) |
 - CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution - eval.github.io/leaderboard.html) |
 - CodeMMLU: A Multi-Task Benchmark for Assessing Code Understanding Capabilities of CodeLLMs - AI4Code/CodeMMLU/) [![Stars](https://img.shields.io/github/stars/FSoft-AI4Code/CodeMMLU?style=social&label=Stars)](https://github.com/FSoft-AI4Code/CodeMMLU) | [🤗Dataset](https://huggingface.co/datasets/Fsoft-AIC/CodeMMLU) [🌐Website](https://fsoft-ai4code.github.io/codemmlu/) [📊LeaderBoard](https://fsoft-ai4code.github.io/leaderboards/codemmlu/) |
- Data science
 - DA-Code: Agent Data Science Code Generation Benchmark for Large Language Models - code)[![Stars](https://img.shields.io/github/stars/yiyihum/da-code?style=social&label=Stars)](https://github.com/yiyihum/da-code) | [🤗Dataset](https://huggingface.co/datasets/Jianwen2003/DA-Code) [🌐Website](https://da-code-bench.github.io) |
 - DA-Code: Agent Data Science Code Generation Benchmark for Large Language Models - code) | [🌐Website](https://da-code-bench.github.io) [🤗Dataset](https://huggingface.co/datasets/Jianwen2003/DA-Code) |
 - Evaluation of Code LLMs on Geospatial Code Generation - ai/geospatial-code-llms-dataset) | |
 - SensorBench: Benchmarking LLMs in Coding-Based Sensor Processing
 - MatPlotAgent: Method and Evaluation for LLM-Based Agentic Scientific Data Visualization
 - DS-1000: A Natural and Reliable Benchmark for Data Science Code Generation - ai/DS-1000)[![Stars](https://img.shields.io/github/stars/xlang-ai/DS-1000?style=social&label=Stars)](https://github.com/xlang-ai/DS-1000) | [🤗Dataset](https://huggingface.co/datasets/xlangai/DS-1000) [🌐HomePage](https://ds1000-code-gen.github.io) |
 - Natural Language to Code Generation in Interactive Data Science Notebooks - research/arcade-nl2code?utm_source=chatgpt.com) [![Stars](https://img.shields.io/github/stars/google-research/arcade-nl2code?utm_source=chatgpt.com?style=social&label=Stars)](https://github.com/google-research/arcade-nl2code?utm_source=chatgpt.com)| [Dataset](https://www.kaggle.com/datasets/googleai/arcade-nl2code-dataset) |
 - DS-1000: A Natural and Reliable Benchmark for Data Science Code Generation - ai/DS-1000)[![Stars](https://img.shields.io/github/stars/xlang-ai/DS-1000?style=social&label=Stars)](https://github.com/xlang-ai/DS-1000) | [🤗Dataset](https://huggingface.co/datasets/xlangai/DS-1000) [🌐HomePage](https://ds1000-code-gen.github.io) |
 - DataSciBench: An LLM Agent Benchmark for Data Science
 - DSBench: How Far Are Data Science Agents from Becoming Data Science Experts?
- Text2SQL
 - Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task - lily.github.io/spider) |
 - Spider 2.0: Evaluating Language Models on Real-World Enterprise Text-to-SQL Workflows - ai/Spider2) [![Stars](https://img.shields.io/github/stars/xlang-ai/Spider2?style=social&label=Stars)](https://github.com/xlang-ai/Spider2)| [🌐Website](https://spider2-sql.github.io) |
 - Structure-Grounded Pretraining for Text-to-SQL
 - Overview of the EHRSQL 2024 Shared Task on Reliable Text-to-SQL Modeling on Electronic Health Records - 2024) | |
 - Exploring underexplored limitations of crossdomain text-to-sql generalization - DK) [![Stars](https://img.shields.io/github/stars/ygan/Spider-DK?style=social&label=Stars)](https://github.com/ygan/Spider-DK) | |
 - ScienceBenchmark: A Complex Real-World Benchmark for Evaluating Natural Language to SQL Systems
 - FinSQL: Model-Agnostic LLMs-based Text-to-SQL Framework for Financial Analysis
 - A Benchmark to Understand the Role of Knowledge Graphs on Large Language Model's Accuracy for Question Answering on Enterprise SQL Databases - NDA 24 | [Github](https://github.com/datadotworld/cwd-benchmark-data) | |
 - Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task - lily.github.io/spider) |
 - SParC: Cross-Domain Semantic Parsing in Context - lily.github.io/sparc) |
 - CoSQL: A Conversational Text-to-SQL Challenge Towards Cross-Domain Natural Language Interfaces to Databases - lily.github.io/cosql) |
 - Towards robustness of text-to-SQL models against synonym substitution - Syn) [![Stars](https://img.shields.io/github/stars/ygan/Spider-Syn?style=social&label=Stars)](https://github.com/ygan/Spider-Syn)| |
 - Can LLM Already Serve as A Database Interface? A BIg Bench for Large-Scale Database Grounded Text-to-SQLs - ConvAI/tree/main/bird) [![Stars](https://img.shields.io/github/stars/AlibabaResearch/DAMO-ConvAI?style=social&label=Stars)](https://github.com/AlibabaResearch/DAMO-ConvAI) | [🌐Website](https://bird-bench.github.io/) |
 - Dr.Spider: A Diagnostic Evaluation Benchmark towards Text-to-SQL Robustness - robustness-text-to-sql)[![Stars](https://img.shields.io/github/stars/awslabs/diagnostic-robustness-text-to-sql?style=social&label=Stars)](https://github.com/awslabs/diagnostic-robustness-text-to-sql) | |
 - BookSQL: A Large Scale Text-to-SQL Dataset for Accounting Domain - Lab/BookSQL) [![Stars](https://img.shields.io/github/stars/Exploration-Lab/BookSQL?style=social&label=Stars)](https://github.com/Exploration-Lab/BookSQL)| [Dataset](https://github.com/Exploration-Lab/BookSQL) |
 - Archer: A Human-Labeled Text-to-SQL Dataset with Arithmetic, Commonsense and Hypothetical Reasoning - bench/) |
 - SecureSQL: Evaluating Data Leakage of Large Language Models as Natural Language Interfaces to Databases
 - Spider 2.0: Evaluating Language Models on Real-World Enterprise Text-to-SQL Workflows - ai/Spider2) [![Stars](https://img.shields.io/github/stars/xlang-ai/Spider2?style=social&label=Stars)](https://github.com/xlang-ai/Spider2)| [🌐Website](https://spider2-sql.github.io) |
 - SNAILS: Schema Naming Assessments for Improved LLM-Based SQL Inference
 - Semantic Captioning: Benchmark Dataset and Graph-Aware Few-Shot In-Context Learning for SQL2Text - icl)[![Stars](https://img.shields.io/github/stars/aliwister/ast-icl?style=social&label=Stars)](https://github.com/aliwister/ast-icl)| [Dataset](https://github.com/aliwister/ast-icl) |
 - Benchmarking the Text-to-SQL Capability of Large Language Models: A Comprehensive Evaluation
- Code Version
 - Automatically Recommend Code Updates: Are We There Yet? - CodeUpdateEval) [![Stars](https://img.shields.io/github/stars/yueyueL/CodeLM-CodeUpdateEval?style=social&label=Stars)](https://github.com/yueyueL/CodeLM-CodeUpdateEval) | [🤗Dataset](https://github.com/yueyueL/CodeLM-CodeUpdateEval/tree/main/dataset) |
 - VersiCode: Towards Version-controllable Code Generation
 - Automatically Recommend Code Updates: Are We There Yet? - CodeUpdateEval) | [🤗Dataset](https://github.com/yueyueL/CodeLM-CodeUpdateEval/tree/main/dataset) |
 - VersiCode: Towards Version-controllable Code Generation
 - GitChameleon: Unmasking the Version-Switching Capabilities of Code Generation Models
 - LLMs Meet Library Evolution: Evaluating Deprecated API Usage in LLM-based Code Completion - wangchong/LLM-Deprecated-API) [![Stars](https://img.shields.io/github/stars/cs-wangchong/LLM-Deprecated-API?style=social&label=Stars)](https://github.com/cs-wangchong/LLM-Deprecated-API) | [🤗Dataset](https://figshare.com/s/e8de860d8fc2ec0541d2) |
 - CodeUpdateArena: Benchmarking Knowledge Editing on API Updates - liuzy/CodeUpdateArena) [![Stars](https://img.shields.io/github/stars/leo-liuzy/CodeUpdateArena?style=social&label=Stars)](https://github.com/leo-liuzy/CodeUpdateArena) | [🤗Dataset](https://github.com/leo-liuzy/CodeUpdateArena/tree/main/data) |
 - LibEvolutionEval: A Benchmark and Study for Version-Specific Code Generation
 - LibEvolutionEval: A Benchmark and Study for Version-Specific Code Generation
 - On the Generalizability of Deep Learning-based Code Completion Across Programming Language Versions - generalization/java-generalization-replication)[![Stars](https://img.shields.io/github/stars/java-generalization/java-generalization-replication?style=social&label=Stars)](https://github.com/java-generalization/java-generalization-replication) | [🤗Dataset](https://zenodo.org/records/10057237) |
 - GitChameleon: Unmasking the Version-Switching Capabilities of Code Generation Models
 - LLMs Meet Library Evolution: Evaluating Deprecated API Usage in LLM-based Code Completion - wangchong/LLM-Deprecated-API) [![Stars](https://img.shields.io/github/stars/cs-wangchong/LLM-Deprecated-API?style=social&label=Stars)](https://github.com/cs-wangchong/LLM-Deprecated-API) | [🤗Dataset](https://figshare.com/s/e8de860d8fc2ec0541d2) |
 - CodeUpdateArena: Benchmarking Knowledge Editing on API Updates - liuzy/CodeUpdateArena) [![Stars](https://img.shields.io/github/stars/leo-liuzy/CodeUpdateArena?style=social&label=Stars)](https://github.com/leo-liuzy/CodeUpdateArena) | [🤗Dataset](https://github.com/leo-liuzy/CodeUpdateArena/tree/main/data) |
 - RustEvo2: An Evolving Benchmark for API Evolution in LLM-based Rust Code Generation
 - On the Generalizability of Deep Learning-based Code Completion Across Programming Language Versions - generalization/java-generalization-replication)[![Stars](https://img.shields.io/github/stars/java-generalization/java-generalization-replication?style=social&label=Stars)](https://github.com/java-generalization/java-generalization-replication) | [🤗Dataset](https://zenodo.org/records/10057237) |
- Multi-Dimension
 - Beyond Correctness: Benchmarking Multi-dimensional Code Generation for Large Language Models
 - LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
- Code Security & Robustness
 - ReCode: Robustness Evaluation of Code Generation Models - science/recode) [![Stars](https://img.shields.io/github/stars/amazon-science/recode?style=social&label=Stars)](https://github.com/amazon-science/recode) | [Dataset](https://github.com/amazon-science/recode/tree/main/dataset-release) |
 - COCO: Testing Code Generation Systems via Concretized Instructions - 2023/COCO) [![Stars](https://img.shields.io/github/stars/coco-2023/COCO?style=social&label=Stars)](https://github.com/coco-2023/COCO) | |
 - RedCode: Risky Code Execution and Generation Benchmark for Code Agents - secure/RedCode) [![Stars](https://img.shields.io/github/stars/AI-secure/RedCode?style=social&label=Stars)](https://github.com/AI-secure/RedCode) | [🌐Website](https://redcode-agent.github.io) [📊LeaderBoard](https://redcode-agent.github.io/#leaderboard) |
 - CodeWMBench: An Automated Benchmark for Code Watermarking Evaluation - TURC 2024 | [Github](https://github.com/Dizzy-K/CodeWMBench) [![Stars](https://img.shields.io/github/stars/Dizzy-K/CodeWMBench?style=social&label=Stars)](https://github.com/Dizzy-K/CodeWMBench) |
 - RMCBench: Benchmarking Large Language Models' Resistance to Malicious Code - yuan233/RMCBench) [![Stars](https://img.shields.io/github/stars/qing-yuan233/RMCBench?style=social&label=Stars)](https://github.com/qing-yuan233/RMCBench)| [🤗Dataset](https://huggingface.co/datasets/zhongqy/RMCBench) |
 - Benchmarking the Security Aspect of Large Language Model-Based Code Generation
 - IRIS: LLM-Assisted Static Analysis for Detecting Security Vulnerabilities - sast/iris) | |
 - CYBERSECEVAL 3: Advancing the Evaluation of Cybersecurity Risks and Capabilities in Large Language Models - llama/PurpleLlama/tree/main/CybersecurityBenchmarks) | [Dataset](https://github.com/meta-llama/PurpleLlama/tree/main/CybersecurityBenchmarks)|
 - CS-Eval: A Comprehensive Large Language Model Benchmark for CyberSecurity - EVAL/CS-Eval) [![Stars](https://img.shields.io/github/stars/CS-EVAL/CS-Eval?style=social&label=Stars)](https://github.com/CS-EVAL/CS-Eval)| [🤗Dataset](https://huggingface.co/datasets/cseval/cs-eval) |
 - SecBench: A Comprehensive Multi-Dimensional Benchmarking Dataset for LLMs in Cybersecurity
 - aiXamine: Simplified LLM Safety and Security
- Code Hallucination
- MultiModal Code Generation
 - Web2Code: A Large-scale Webpage-to-Code Dataset and Evaluation Framework for Multimodal LLMs - LLM/Web2code) | [🤗Dataset](https://huggingface.co/datasets/MBZUAI/Web2Code) [🌐Website](https://mbzuai-llm.github.io/webpage2code/) |
 - Drawing Pandas: A Benchmark for LLMs in Generating Plotting Code - Research/PandasPlotBench) | [🤗Dataset](https://huggingface.co/datasets/JetBrains-Research/PandasPlotBench) |
 - AutoPresent: Designing Structured Visuals from Scratch - 01 | [Github](https://github.com/para-lost/AutoPresent) | [🤗Dataset](https://github.com/para-lost/AutoPresent/tree/main/slidesbench) |
- Security Code Generation & Test Generation
 - Tests4Py: A Benchmark for System Testing
 - LLM Security Guard for Code
 - RedCode: Risky Code Execution and Generation Benchmark for Code Agents - secure/RedCode) | [🌐Website](https://redcode-agent.github.io) [📊LeaderBoard](https://redcode-agent.github.io/#leaderboard) |

Programming Languages

Python 1

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

awesome-code-benchmark

🚀 Top Code Benchmark

CodeFix & Bug-Fix

MultiModal Code Tasks

Code Translation

Code Completion & Code Generation

Multi & Other Dimension

Industry Code Generation

Code Efficiency

Code Reasoning & Understanding

Data science

Text2SQL

Code Version

Multi-Dimension

Code Security & Robustness

Code Hallucination

MultiModal Code Generation

Security Code Generation & Test Generation