awesome-code-benchmark
A comprehensive code domain benchmark review of LLM researches.
https://github.com/tongye98/awesome-code-benchmark
Last synced: 3 days ago
JSON representation
-
š Top Code Benchmark
-
Code Completion & Code Generation
- Evaluating Large Language Models Trained on Code - eval) [](https://github.com/openai/human-eval) | [š¤Dataset](https://huggingface.co/datasets/openai/openai_humaneval) |
- Program Synthesis with Large Language Models - research/google-research/tree/master/mbpp) [](https://github.com/google-research/google-research) | [š¤Dataset](https://huggingface.co/datasets/google-research-datasets/mbpp) |
- MultiPL-E: A Scalable and Polyglot Approach to Benchmarking Neural Code Generation - E) | [š¤Dataset](https://huggingface.co/datasets/nuprl/MultiPL-E) |
- Python Code Generation by Asking Clarification Questions - b-L10vNpk7Onyft9BXK8GlMIGl52q/view?usp=sharing) |
- BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions - project/bigcodebench) | [šLeaderBoard](https://huggingface.co/spaces/bigcode/bigcodebench-leaderboard) | Ā
- DevEval: A Manually-Annotated Code Generation Benchmark Aligned with Real-World Code Repositories
- Measuring Coding Challenge Competence With APPS
- DynaCode: A Dynamic Complexity-Aware Code Benchmark for Evaluating Large Language Models in Code Generation
- MCoNaLa: A Benchmark for Code Generation from Multiple Natural Languages - conala) [](https://github.com/zorazrw/multilingual-conala) | [š¤Dataset](https://huggingface.co/datasets/neulab/mconala) |
- LongCoder: A Long-Range Pre-trained Language Model for Code Completion
- RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems
- LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding
- Execution-Based Evaluation for Open-Domain Code Generation
- R2E: Turning Any GitHub Repository into a Programming Agent Test Environment - project/r2e) | |
- BioCoder: A Benchmark for Bioinformatics Code Generation with Large Language Models
- CrossCodeEval: A Diverse and Multilingual Benchmark for Cross-File Code Completion - science/cceval) [](https://github.com/amazon-science/cceval) | [Dataset](https://github.com/amazon-science/cceval/tree/main/data) |
- MT-Bench: How Good are LLMs at Multi-turn Question Answering - bench-101) | |
- ML-Bench: Evaluating Large Language Models and Agents for Machine Learning Tasks on Repository-Level Code - bench) [](https://github.com/gersteinlab/ML-bench) | [š¤Dataset](https://huggingface.co/datasets/super-dainiu/ml-bench) [šWebsite](https://ml-bench.github.io/)|
- PLPilot: Benchmark an Automated Programming Language Design Framework Enabled by LLMs
- CoderEval: A Benchmark of Pragmatic Code Generation with Generative Pre-trained Models
- A Static Evaluation of Code Completion by Large Language Models
- ICE-Score: Instructing Large Language Models to Evaluate Code - score) | |
- Exploring Language Model's Code Generation Ability with Auxiliary Functions
- Coeditor: Leveraging Contextual Changes for Multi-round Code Auto-editing
- Competition-Level Code Generation with AlphaCode - deepmind/code_contests) [](https://github.com/google-deepmind/code_contests) | [Dataset](https://github.com/google-deepmind/code_contests)|
- LLM4Decompile: Decompiling Binary Code with Large Language Models - ghidra-100k)|
- Enhancing Repository-Level Code Generation with Integrated Contextual Information
- AICoderEval: Improving AI Domain Code Generation of Large Language Models
- CodeAgent: Enhancing Code Generation with Tool-Integrated Agent Systems for Real-World Repo-level Coding Challenges
- Evaluation of LLMs on Syntax-Aware Code Fill-in-the-Middle Tasks
- ConCodeEval: Evaluating Large Language Models for Code Constraints in Domain-Specific Languages
- A Performance Study of LLM-Generated Code on Leetcode
- CodeGeeX: A Pre-Trained Model for Code Generation with Multilingual Benchmarking on HumanEval-X
- MultiPL-E: A Scalable and Polyglot Approach to Benchmarking Neural Code Generation - E) [](https://github.com/nuprl/MultiPL-E) | [š¤Dataset](https://huggingface.co/datasets/nuprl/MultiPL-E) |
- Python Code Generation by Asking Clarification Questions - b-L10vNpk7Onyft9BXK8GlMIGl52q/view?usp=sharing) |
- Is Your Code Generated by Chat{GPT} Really Correct? Rigorous Evaluation of Large Language Models for Code Generation
- StudentEval: A Benchmark of Student-Written Prompts for Large Language Models of Code - EASEL-lab/StudentEval)[](https://github.com/Wellesley-EASEL-lab/StudentEval) | [š¤Dataset](https://huggingface.co/datasets/wellesley-easel/StudentEval) |
- DevEval: A Manually-Annotated Code Generation Benchmark Aligned with Real-World Code Repositories
- OOP: Object-Oriented Programming Evaluation Benchmark for Large Language Models - eval) [](https://github.com/alphadl/OOP-eval) | [š¤Dataset](https://huggingface.co/datasets/codeai-dteam/oop) |
- L2CEval: Evaluating Language-to-Code Generation Capabilities of Large Language Models
- BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions - project/bigcodebench) [](https://github.com/bigcode-project/bigcodebench)| [š¤Dataset](https://huggingface.co/collections/bigcode/bigcodebench-666ed21a5039c618e608ab06) [šLeaderBoard](https://huggingface.co/spaces/bigcode/bigcodebench-leaderboard) |
- EvoCodeBench: An Evolving Code Generation Benchmark Aligned with Real-World Code Repositories
-
Code Efficiency
- Evaluating Language Models for Efficient Code Generation
- EffiBench: Benchmarking the Efficiency of Automatically Generated Code
- Mercury: A Code Efficiency Benchmark for Code Large Language Models
- ECCO: Can We Improve Model-Generated Code Efficiency Without Sacrificing Functional Correctness?
- Learning Performance-Improving Code Edits
- How Efficient is LLM-Generated Code? A Rigorous & High-Standard Benchmark - rz/enamel) [](https://github.com/q-rz/enamel) | [š¤Dataset](https://huggingface.co/datasets/q-rz/enamel) |
- Mercury: A Code Efficiency Benchmark for Code Large Language Models
- ECCO: Can We Improve Model-Generated Code Efficiency Without Sacrificing Functional Correctness?
- Learning Performance-Improving Code Edits
- How Efficient is LLM-Generated Code? A Rigorous & High-Standard Benchmark - rz/enamel) | [š¤Dataset](https://huggingface.co/datasets/q-rz/enamel) |
- Evaluating Language Models for Efficient Code Generation
- EffiBench: Benchmarking the Efficiency of Automatically Generated Code
-
Multi & Other Dimension
- OctoPack: Instruction Tuning Code Large Language Models - project/octopack)[](https://github.com/bigcode-project/octopack) | [š¤Dataset](https://huggingface.co/datasets/bigcode/humanevalpack) |
- Beyond Correctness: Benchmarking Multi-dimensional Code Generation for Large Language Models
- LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
- RepoFusion: Training Code Models to Understand Your Repository
- Improving Natural Language Capability of Code Large Language Model
- CodeBenchGen: Creating Scalable Execution-based Code Generation Benchmarks
- Exploring Multi-Lingual Bias of Large Code Models in Code Generation
- What's Wrong with Your Code Generated by Large Language Models? An Extensive Study
- StepCoder: Improve Code Generation with Reinforcement Learning from Compiler Feedback
- InfiBench: Evaluating the Question-Answering Capabilities of Code Large Language Models - coder/infibench-evaluation-harness/)[](https://github.com/infi-coder/infibench-evaluation-harness) | [šWebsite](https://infi-coder.github.io/infibench/) |
- Can LLM Replace Stack Overflow? A Study on Robustness and Reliability of Large Language Model Code Generation
- Top Leaderboard Ranking = Top Coding Proficiency, Always? EvoEval: Evolving Coding Benchmarks via LLM - eval/evoeval) [](https://github.com/evo-eval/evoeval)| |
- CodeScope: An Execution-based Multilingual Multitask Multidimensional Benchmark for Evaluating LLMs on Code Understanding and Generation - scope-benchmark/) <br />[š¤Dataset](https://huggingface.co/datasets/WeixiangYan/CodeScope) |
- AssertionBench: A Benchmark to Evaluate Large-Language Models for Assertion Generation - lab/assertion_data_for_LLM)[](https://github.com/achieve-lab/assertion_data_for_LLM) | |
- Evaluating Large Language Models with Runtime Behavior of Program Execution - eval/r-eval.github.io)[](https://github.com/r-eval/r-eval.github.io) | [šLeaderBoard](https://r-eval.github.io/) |
- SWE-PolyBench: A multi-language benchmark for repository level evaluation of coding agents - science/SWE-PolyBench) [](https://github.com/amazon-science/SWE-PolyBench) | [šWebsite](https://amazon-science.github.io/SWE-PolyBench/) [š¤Dataset](https://huggingface.co/datasets/AmazonScience/SWE-PolyBench) |
-
CodeFix & Bug-Fix
- SWT-Bench: Testing and Validating Real-World Bug-Fixes with Code Agents - star-ai/SWT-Bench) | [šWebsite](https://swtbench.com) |
- SWE-bench: Can Language Models Resolve Real-World GitHub Issues? - bench/SWE-bench) | [šWebsite](https://www.swebench.com) |
- SWE-bench Multimodal: Do AI Systems Generalize to Visual Software Domains? - bench/SWE-bench) [](https://github.com/swe-bench/SWE-bench) | [š¤Dataset](https://www.swebench.com/multimodal) [šWebsite](https://www.swebench.com/multimodal) |
- GitBug-Actions: Building Reproducible Bug-Fix Benchmarks with GitHub Actions
- Coffee-Gym: An Environment for Evaluating and Improving Natural Language Feedback on Erroneous Code - Gym/Project-Coffee-Gym) |
- INTERVENOR: Prompt the Coding Ability of Large Language Models with the Interactive Chain of Repairing
- Towards Low-Resource Automatic Program Repair with Meta-Learning and Pretrained Language Models - weishi/Meta-APR) | |
- ZS4C: Zero-Shot Synthesis of Compilable Code for Incomplete Code Snippets using LLMs
- RepoFixEval: A Repository-Level Program Repair Benchmark From Issue Discovering to Bug Fixing
- DebugBench: Evaluating Debugging Capability of Large Language Models
- Instruct, Not Assist: LLM-based Multi-Turn Planning and Hierarchical Questioning for Socratic Code Debugging
- Socratic Questioning of Novice Debuggers: A Benchmark Dataset and Preliminary Evaluations - debugging-benchmark) | |
- Large Language Models of Code Fail at Completing Code with Potential Bugs - science/buggy-code-completion)[](https://github.com/amazon-science/buggy-code-completion) | [Dataset](https://github.com/amazon-science/buggy-code-completion) |
- SWT-Bench: Testing and Validating Real-World Bug-Fixes with Code Agents - star-ai/SWT-Bench) [](https://github.com/logic-star-ai/SWT-Bench) | [šWebsite](https://swtbench.com) |
- OctoPack: Instruction Tuning Code Large Language Models - project/octopack)[](https://github.com/bigcode-project/octopack) | [š¤Dataset](https://huggingface.co/datasets/bigcode/humanevalpack) |
- SWE-bench: Can Language Models Resolve Real-World GitHub Issues? - bench/SWE-bench) [](https://github.com/swe-bench/SWE-bench) | [šWebsite](https://www.swebench.com) |
- GitBug-Java: A Reproducible Benchmark of Recent Java Bugs - java)[](https://github.com/gitbugactions/gitbug-java) | [š¤Dataset](https://huggingface.co/datasets/gitbugactions/gitbug-java) [šWebsite](https://nuno.saavedra.pt/gitbug-java#!/) |
- When Large Language Models Confront Repository-Level Automatic Program Repair: How Well They Done?
- FeedbackEval A Benchmark for Evaluating Large Language Models in Feedback-Driven Code Repair Tasks
- SWE-bench Multimodal: Do AI Systems Generalize to Visual Software Domains? - bench/SWE-bench) | [šWebsite](https://www.swebench.com/multimodal) [š¤Dataset](https://www.swebench.com/multimodal) |
-
Code Reasoning & Understanding
- CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution - eval.github.io/leaderboard.html) |
- CodeMMLU: A Multi-Task Benchmark for Assessing Code Understanding Capabilities of CodeLLMs - AI4Code/CodeMMLU/) | [š¤Dataset](https://huggingface.co/datasets/Fsoft-AIC/CodeMMLU) [šLeaderBoard](https://fsoft-ai4code.github.io/leaderboards/codemmlu/) [š Website](https://fsoft-ai4code.github.io/codemmlu/) |
- How Effectively Do Code Language Models Understand Poor-Readability Code? - y/PoorCodeSumEval) [](https://github.com/ythere-y/PoorCodeSumEval) | [š¤Dataset](https://huggingface.co/datasets/google/code_x_glue_ct_code_to_text) |
- A Novel Refactoring and Semantic Aware Abstract Syntax Tree Differencing Tool and a Benchmark for Evaluating the Accuracy of Diff Tools
- GenCodeSearchNet: A Benchmark Test Suite for Evaluating Generalization in Programming Language Understanding
- CodeJudge-Eval: Can Large Language Models be Good Judges in Code Understanding? - Research/CodeJudge-Eval) [](https://github.com/CodeLLM-Research/CodeJudge-Eval)| |
- CodeMMLU: A Multi-Task Benchmark for Assessing Code Understanding Capabilities of CodeLLMs - AI4Code/CodeMMLU/) [](https://github.com/FSoft-AI4Code/CodeMMLU) | [š¤Dataset](https://huggingface.co/datasets/Fsoft-AIC/CodeMMLU) [šWebsite](https://fsoft-ai4code.github.io/codemmlu/) [šLeaderBoard](https://fsoft-ai4code.github.io/leaderboards/codemmlu/) |
- CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution - eval.github.io/leaderboard.html) |
-
Data science
- DS-1000: A Natural and Reliable Benchmark for Data Science Code Generation - ai/DS-1000) | [šHomePage](https://ds1000-code-gen.github.io) [š¤Dataset](https://huggingface.co/datasets/xlangai/DS-1000) |
- DA-Code: Agent Data Science Code Generation Benchmark for Large Language Models - code)[](https://github.com/yiyihum/da-code) | [š¤Dataset](https://huggingface.co/datasets/Jianwen2003/DA-Code) [šWebsite](https://da-code-bench.github.io) |
- DA-Code: Agent Data Science Code Generation Benchmark for Large Language Models - code) | [šWebsite](https://da-code-bench.github.io) [š¤Dataset](https://huggingface.co/datasets/Jianwen2003/DA-Code) |
- Evaluation of Code LLMs on Geospatial Code Generation - ai/geospatial-code-llms-dataset) | |
- SensorBench: Benchmarking LLMs in Coding-Based Sensor Processing
- MatPlotAgent: Method and Evaluation for LLM-Based Agentic Scientific Data Visualization
- DS-1000: A Natural and Reliable Benchmark for Data Science Code Generation - ai/DS-1000)[](https://github.com/xlang-ai/DS-1000) | [š¤Dataset](https://huggingface.co/datasets/xlangai/DS-1000) [šHomePage](https://ds1000-code-gen.github.io) |
- Natural Language to Code Generation in Interactive Data Science Notebooks - research/arcade-nl2code?utm_source=chatgpt.com) [](https://github.com/google-research/arcade-nl2code?utm_source=chatgpt.com)| [Dataset](https://www.kaggle.com/datasets/googleai/arcade-nl2code-dataset) |
-
Code Translation
- Unsupervised Translation of Programming Languages
- Unsupervised Translation of Programming Languages
- Repository-level Code Translation Benchmark Targeting Rust
- XCodeEval: An Execution-based Large Scale Multilingual Multitask Benchmark for Code Understanding, Generation, Translation and Retrieval - ACL Anthology - NLP-sg/xCodeEval) |
- Escalating LLM-based Code Translation Benchmarking into the Class-level Era - 11 | [Github](https://github.com/anonymous-author-coder/ClassEval-T-Code-Translation-Evaluation-Dataset)[](https://github.com/anonymous-author-coder/ClassEval-T-Code-Translation-Evaluation-Dataset) | [š¤Dataset](https://github.com/anonymous-author-coder/ClassEval-T-Code-Translation-Evaluation-Dataset/tree/main/ClassEval_T) |
- AVATAR: A Parallel Corpus for Java-Python Program Translation
- On the Evaluation of Neural Code Translation: Taxonomy and Benchmark - TransEval) [](https://github.com/PolyEval/G-TransEval)| [š¤Dataset](https://github.com/polyeval/g-transeval/tree/main/G-TransEval) |
- CodeTransOcean: A Comprehensive Multilingual Benchmark for Code Translation
- XCodeEval: An Execution-based Large Scale Multilingual Multitask Benchmark for Code Understanding, Generation, Translation and Retrieval - NLP-sg/xCodeEval) |
- Unraveling the Potential of Large Language Models in Code Translation: How Far Are We? - humaneval)[](https://github.com/q4x3/poly-humaneval) | [š¤Dataset](https://github.com/q4x3/poly-humaneval/tree/main/benchmark) |
- Skeleton-Guided-Translation: A Benchmarking Framework for Code Repository Translation with Fine-Grained Quality Evaluation
- Enhancing LLMs in Long Code Translation through Instrumentation and Program State Alignment
-
Text2SQL
- Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task - lily.github.io/spider) |
- Spider 2.0: Evaluating Language Models on Real-World Enterprise Text-to-SQL Workflows - ai/Spider2) | [šWebsite](https://spider2-sql.github.io) |
- Structure-Grounded Pretraining for Text-to-SQL
- Overview of the EHRSQL 2024 Shared Task on Reliable Text-to-SQL Modeling on Electronic Health Records - 2024) | |
- Exploring underexplored limitations of crossdomain text-to-sql generalization - DK) [](https://github.com/ygan/Spider-DK) | |
- ScienceBenchmark: A Complex Real-World Benchmark for Evaluating Natural Language to SQL Systems
- FinSQL: Model-Agnostic LLMs-based Text-to-SQL Framework for Financial Analysis
- A Benchmark to Understand the Role of Knowledge Graphs on Large Language Model's Accuracy for Question Answering on Enterprise SQL Databases - NDA 24 | [Github](https://github.com/datadotworld/cwd-benchmark-data) | |
- Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task - lily.github.io/spider) |
- SParC: Cross-Domain Semantic Parsing in Context - lily.github.io/sparc) |
- CoSQL: A Conversational Text-to-SQL Challenge Towards Cross-Domain Natural Language Interfaces to Databases - lily.github.io/cosql) |
- Towards robustness of text-to-SQL models against synonym substitution - Syn) [](https://github.com/ygan/Spider-Syn)| |
- Can LLM Already Serve as A Database Interface? A BIg Bench for Large-Scale Database Grounded Text-to-SQLs - ConvAI/tree/main/bird) [](https://github.com/AlibabaResearch/DAMO-ConvAI) | [šWebsite](https://bird-bench.github.io/) |
- Dr.Spider: A Diagnostic Evaluation Benchmark towards Text-to-SQL Robustness - robustness-text-to-sql)[](https://github.com/awslabs/diagnostic-robustness-text-to-sql) | |
- BookSQL: A Large Scale Text-to-SQL Dataset for Accounting Domain - Lab/BookSQL) [](https://github.com/Exploration-Lab/BookSQL)| [Dataset](https://github.com/Exploration-Lab/BookSQL) |
- Archer: A Human-Labeled Text-to-SQL Dataset with Arithmetic, Commonsense and Hypothetical Reasoning - bench/) |
- SecureSQL: Evaluating Data Leakage of Large Language Models as Natural Language Interfaces to Databases
- Spider 2.0: Evaluating Language Models on Real-World Enterprise Text-to-SQL Workflows - ai/Spider2) [](https://github.com/xlang-ai/Spider2)| [šWebsite](https://spider2-sql.github.io) |
- SNAILS: Schema Naming Assessments for Improved LLM-Based SQL Inference
- Semantic Captioning: Benchmark Dataset and Graph-Aware Few-Shot In-Context Learning for SQL2Text - icl)[](https://github.com/aliwister/ast-icl)| [Dataset](https://github.com/aliwister/ast-icl) |
-
MultiModal Code Generation
- ChartMimic: Evaluating LMM's Cross-Modal Reasoning Capability via Chart-to-Code Generation
- Web2Code: A Large-scale Webpage-to-Code Dataset and Evaluation Framework for Multimodal LLMs - LLM/Web2code) | [š¤Dataset](https://huggingface.co/datasets/MBZUAI/Web2Code)<br />[šWebsite](https://mbzuai-llm.github.io/webpage2code/) |
- Drawing Pandas: A Benchmark for LLMs in Generating Plotting Code - Research/PandasPlotBench) | [š¤Dataset](https://huggingface.co/datasets/JetBrains-Research/PandasPlotBench) |
- AutoPresent: Designing Structured Visuals from Scratch - 01 | [Github](https://github.com/para-lost/AutoPresent) | [š¤Dataset](https://github.com/para-lost/AutoPresent/tree/main/slidesbench) |
-
Code Version
- Automatically Recommend Code Updates: Are We There Yet? - CodeUpdateEval) [](https://github.com/yueyueL/CodeLM-CodeUpdateEval) | [š¤Dataset](https://github.com/yueyueL/CodeLM-CodeUpdateEval/tree/main/dataset) |
- VersiCode: Towards Version-controllable Code Generation
- Automatically Recommend Code Updates: Are We There Yet? - CodeUpdateEval) | [š¤Dataset](https://github.com/yueyueL/CodeLM-CodeUpdateEval/tree/main/dataset) |
- On the Generalizability of Deep Learning-based Code Completion Across Programming Language Versions - generalization/java-generalization-replication) | [š¤Dataset](https://zenodo.org/records/10057237) |
- VersiCode: Towards Version-controllable Code Generation
- GitChameleon: Unmasking the Version-Switching Capabilities of Code Generation Models
- LLMs Meet Library Evolution: Evaluating Deprecated API Usage in LLM-based Code Completion - wangchong/LLM-Deprecated-API) | [š¤Dataset](https://figshare.com/s/e8de860d8fc2ec0541d2) |
- CodeUpdateArena: Benchmarking Knowledge Editing on API Updates - liuzy/CodeUpdateArena) | [š¤Dataset](https://github.com/leo-liuzy/CodeUpdateArena/tree/main/data) |
- LibEvolutionEval: A Benchmark and Study for Version-Specific Code Generation
- LibEvolutionEval: A Benchmark and Study for Version-Specific Code Generation
- On the Generalizability of Deep Learning-based Code Completion Across Programming Language Versions - generalization/java-generalization-replication)[](https://github.com/java-generalization/java-generalization-replication) | [š¤Dataset](https://zenodo.org/records/10057237) |
- GitChameleon: Unmasking the Version-Switching Capabilities of Code Generation Models
- LLMs Meet Library Evolution: Evaluating Deprecated API Usage in LLM-based Code Completion - wangchong/LLM-Deprecated-API) [](https://github.com/cs-wangchong/LLM-Deprecated-API) | [š¤Dataset](https://figshare.com/s/e8de860d8fc2ec0541d2) |
- CodeUpdateArena: Benchmarking Knowledge Editing on API Updates - liuzy/CodeUpdateArena) [](https://github.com/leo-liuzy/CodeUpdateArena) | [š¤Dataset](https://github.com/leo-liuzy/CodeUpdateArena/tree/main/data) |
- RustEvo2: An Evolving Benchmark for API Evolution in LLM-based Rust Code Generation
-
Multi-Dimension
-
Code Security & Robustness
- ReCode: Robustness Evaluation of Code Generation Models - science/recode) [](https://github.com/amazon-science/recode) | [Dataset](https://github.com/amazon-science/recode/tree/main/dataset-release) |
- COCO: Testing Code Generation Systems via Concretized Instructions - 2023/COCO) [](https://github.com/coco-2023/COCO) | |
- RedCode: Risky Code Execution and Generation Benchmark for Code Agents - secure/RedCode) [](https://github.com/AI-secure/RedCode) | [šWebsite](https://redcode-agent.github.io) [šLeaderBoard](https://redcode-agent.github.io/#leaderboard) |
- CodeWMBench: An Automated Benchmark for Code Watermarking Evaluation - TURC 2024 | [Github](https://github.com/Dizzy-K/CodeWMBench) [](https://github.com/Dizzy-K/CodeWMBench) |
- RMCBench: Benchmarking Large Language Models' Resistance to Malicious Code - yuan233/RMCBench) [](https://github.com/qing-yuan233/RMCBench)| [š¤Dataset](https://huggingface.co/datasets/zhongqy/RMCBench) |
- Benchmarking the Security Aspect of Large Language Model-Based Code Generation
- IRIS: LLM-Assisted Static Analysis for Detecting Security Vulnerabilities - sast/iris) | |
- CYBERSECEVAL 3: Advancing the Evaluation of Cybersecurity Risks and Capabilities in Large Language Models - llama/PurpleLlama/tree/main/CybersecurityBenchmarks) | [Dataset](https://github.com/meta-llama/PurpleLlama/tree/main/CybersecurityBenchmarks)|
- CS-Eval: A Comprehensive Large Language Model Benchmark for CyberSecurity - EVAL/CS-Eval) [](https://github.com/CS-EVAL/CS-Eval)| [š¤Dataset](https://huggingface.co/datasets/cseval/cs-eval) |
- SecBench: A Comprehensive Multi-Dimensional Benchmarking Dataset for LLMs in Cybersecurity
-
Code Hallucination
-
MultiModal Code Tasks
- MMCode: Benchmarking Multimodal Large Language Models for Code Generation with Visually Rich Programming Problems
- BigDocs: An Open Dataset for Training Multimodal Models on Document and Code Tasks - Bench) <br/> [šWebsite](https://bigdocs.github.io/) |
- Plot2Code: A Comprehensive Benchmark for Evaluating Multi-modal Large Language Models in Code Generation from Scientific Plots
- Drawing Pandas: A Benchmark for LLMs in Generating Plotting Code - Research/PandasPlotBench)[](https://github.com/JetBrains-Research/PandasPlotBench) | [š¤Dataset](https://huggingface.co/datasets/JetBrains-Research/PandasPlotBench) |
- Web2Code: A Large-scale Webpage-to-Code Dataset and Evaluation Framework for Multimodal LLMs - LLM/Web2code)[](https://github.com/MBZUAI-LLM/Web2code) | [š¤Dataset](https://huggingface.co/datasets/MBZUAI/Web2Code)<br />[šWebsite](https://mbzuai-llm.github.io/webpage2code/) |
- VGBench: Evaluating Large Language Models on Vector Graphics Understanding and Generation
- SVGEditBench: A Benchmark Dataset for Quantitative Assessment of LLM's SVG Editing Capabilities - lab/SVGEditBench) [](https://github.com/mti-lab/SVGEditBench)| [š¤Dataset](https://github.com/mti-lab/SVGEditBench) |
- HumanEval-V: Benchmarking High-Level Visual Reasoning with Complex Diagrams in Coding Tasks - V/HumanEval-V-Benchmark)[](https://github.com/HumanEval-V/HumanEval-V-Benchmark) | [šWebsite](https://humaneval-v.github.io/)<br />[šLeaderBoard](https://humaneval-v.github.io/#leaderboard)<br />[š¤Dataset](https://huggingface.co/datasets/HumanEval-V/HumanEval-V-Benchmark) |
- WAFFLE: Multi-Modal Model for Automated Front-End Development - asset/Waffle)[](https://github.com/lt-asset/Waffle) | [š¤Dataset](https://github.com/lt-asset/Waffle/tree/master/WebSight-Test) |
- Sketch2Code: Evaluating Vision-Language Models for Interactive Web Design Prototyping
- Interaction2Code: Benchmarking MLLM-based Interactive Webpage Code Generation from Interactive Prototyping - ov-file#Dataset-Download)<br />[šLeaderBoard](https://github.com/WebPAI/Interaction2Code?tab=readme-ov-file#Leaderboard) |
- ScratchEval: Are GPT-4o Smarter than My Child? Evaluating Large Multimodal Models with Visual Programming Challenges
- MRWeb: An Exploration of Generating Multi-Page Resource-Aware Web Code from UI Designs
- Image2Struct: Benchmarking Structure Extraction for Vision-Language Models - crfm/helm)[](https://github.com/stanford-crfm/helm) | [šWebsite](https://crfm.stanford.edu/helm/image2struct/latest/)<br />[š¤Dataset](https://huggingface.co/datasets/stanford-crfm/i2s-latex) |
- WebCode2M: A Real-World Dataset for Code Generation from Webpage Designs - codes/naturalcc/tree/main/examples/webcode2m) | [šWebsite](https://webcode2m.github.io/)<br />[š¤Dataset](https://huggingface.co/datasets/xcodemind/webcode2m) |
- Design2Code: Benchmarking Multimodal Code Generation for Automated Front-End Engineering - NLP/Design2Code-hf) |
- From Words to Structured Visuals: A Benchmark and Framework for Text-to-Diagram Generation and Editing - 67c5c0935149cdc6e0230b46) |
- ChartMimic: Evaluating LMM's Cross-Modal Reasoning Capability via Chart-to-Code Generation
- StarVector: Generating Scalable Vector Graphics Code from Images and Text - vector) [](https://github.com/joanrod/star-vector) | [šWebsite](https://starvector.github.io/#:~:text=StarVector)<br />[š¤Dataset](https://huggingface.co/collections/starvector/starvector-svg-datasets-svg-bench-67811204a76475be4dd66d09) |
- Empowering LLMs to Understand and Generate Complex Vector Graphics
- ChartCoder: Advancing Multimodal Large Language Model for Chart-to-Code Generation - 160k) |
- Code-Vision: Evaluating Multimodal LLMs Logic Understanding and Code Generation Capabilities
- Advancing vision-language models in front-end development via data synthesis - Code-VLM/Flame-Code-VLM) | [š¤Dataset](https://github.com/Flame-Code-VLM/Flame-Code-VLM?tab=readme-ov-file#dataset) |
-
Security Code Generation & Test Generation
- Tests4Py: A Benchmark for System Testing
- LLM Security Guard for Code
- RedCode: Risky Code Execution and Generation Benchmark for Code Agents - secure/RedCode) | [šWebsite](https://redcode-agent.github.io) [šLeaderBoard](https://redcode-agent.github.io/#leaderboard) |
-
Industry Code Generation
- Exploring Code Language Models for Automated HLS-based Hardware Generation: Benchmark, Infrastructure and Analysis
- VHDL-Eval: A Framework for Evaluating Large Language Models in VHDL Code Generation
- Chain-of-Descriptions: Improving Code LLMs for VHDL Code Generation and Summarization
- Natural language is not enough: Benchmarking multi-modal generative AI for Verilog generation
- VerilogEval Evaluating Large Language Models for Verilog Code Generation - eval)[](https://github.com/NVlabs/verilog-eval) | [š¤Dataset](https://github.com/NVlabs/verilog-eval/tree/main/dataset_code-complete-iccad2023) |
- Benchmarking Large Language Models for Automated Verilog RTL Code Generation - thakur/vgen)[](https://github.com/shailja-thakur/vgen) | [š¤Dataset](https://github.com/shailja-thakur/VGen/tree/main/prompts-and-testbenches) |
- RTLLM: An Open-Source Benchmark for Design RTL Generation with Large Language Model - zhiyao/rtllm)[](https://github.com/hkust-zhiyao/rtllm) | [š¤Dataset](https://github.com/hkust-zhiyao/rtllm) | | |
- LLM4PLC: Harnessing Large Language Models for Verifiable Programming of PLCs in Industrial Control Systems
- Agents4PLC: Automating Closed-loop PLC Code Generation and Verification in Industrial Control Systems using LLM-based Agents - zju/Agents4PLC_release)[](https://github.com/Luoji-zju/Agents4PLC_release) | [š¤Dataset](https://github.com/Luoji-zju/Agents4PLC_release/tree/master/benchmark) |
- A Multi-Agent Framework for Extensible Structured Text Generation in PLCs
- MetRex: A Benchmark for Verilog Code Metric Reasoning Using LLMs - lab/MetRex)[](https://github.com/scale-lab/MetRex) | [š¤Dataset](https://huggingface.co/datasets/scale-lab/MetRex) |
-
Programming Languages
Categories
Sub Categories
Code Completion & Code Generation
42
MultiModal Code Tasks
23
CodeFix & Bug-Fix
20
Text2SQL
20
Multi & Other Dimension
16
Code Version
15
Code Translation
12
Code Efficiency
12
Industry Code Generation
11
Code Security & Robustness
10
Data science
8
Code Reasoning & Understanding
8
MultiModal Code Generation
4
Security Code Generation & Test Generation
3
Multi-Dimension
2
Code Hallucination
2