awesome-code-benchmark
A comprehensive code domain benchmark review of LLM researches.
https://github.com/tongye98/awesome-code-benchmark
Last synced: 15 days ago
JSON representation
-
News
- SWE-PolyBench: A multi-language benchmark for repository level evaluation of coding agents
- aiXamine: Simplified LLM Safety and Security
- CRUST-Bench: A Comprehensive Benchmark for C-to-safe-Rust Transpilation
- Paper2Code: Automating Code Generation from Scientific Papers in Machine Learning
- CoCo-Bench: A Comprehensive Code Benchmark For Multi-task Large Language Model Evaluation - ->
- Draw with Thought: Unleashing Multimodal Reasoning for Scientific Diagram Generation
- ELABORATION: A Comprehensive Benchmark on Human-LLM Competitive Programming
- OSS-Bench: Benchmark Generator for Coding LLMs
- VERINA: Benchmarking Verifiable Code Generation
- ResearchCodeBench: Benchmarking LLMs on Implementing Novel Machine Learning Research Code
- EFFIBENCH-X:A Multi-Language Benchmark fo rMeasuring Effciency ofLLM.Generated Code
- Synthesizing Performance Constraints for Evaluating and Improving Code Efficiency
- Breakpoint: Scalable evaluation of system-level reasoning in LLM code agents
- LongCodeBench: Evaluating Coding LLMs at 1M Context Windows
- Success is in the Details: Evaluate and Enhance Details Sensitivity of Code
- CodeSense: a Real-World Benchmark and Dataset for Code Semantic Reasoning
- Flow2Code: Evaluating Large Language Models for Flowchart-based Code Generation Capability
- CODEMENV: Benchmarking Large Language Models on Code Migration
- DecompileBench: A Comprehensive Benchmark for Evaluating Decompilers in Real-World Scenarios - ->
- VADER: A Human-Evaluated Benchmark for Vulnerability Assessment, Detection, Explanation, and Remediation
- CPRet: A Dataset, Benchmark, and Model for Retrieval in Competitive Programming
- DS-Bench: A Realistic Benchmark for Data Science Code Generation
- Rethinking Repetition Problems of LLMs in Code Generation
- WebGen-Bench: Evaluating LLMs on Generating Interactive and Functional Websites from Scratch
- OmniGIRL: A Multilingual and Multimodal Benchmark for GitHub Issue Resolution - sen University
- CodeFlowBench: A Multi-turn, Iterative Benchmark for Complex Code Generation
- Towards Mitigating API Hallucination in Code Generated by LLMs with Hierarchical Dependency Aware
- OIBench: Benchmarking Strong Reasoning Models with Olympiad in Informatics - Eval
- FrontendBench: A Benchmark for Evaluating LLMs on Front-End Development via Automatic Evaluation
- CoQuIR: A Comprehensive Benchmark for Code Quality-Aware Information Retrieval
- CodeMirage: A Multi-Lingual Benchmark for Detecting AI-Generated and Paraphrased Source Code from Production-Level LLMs
- SEC-bench: Automated Benchmarking of LLM Agents on Real-World Software Security Tasks - Champaign
- RAS-Eval: A Comprehensive Benchmark for Security Evaluation of LLM Agents in Real-World Environments
- MLDebugging: Towards Benchmarking Code Debugging Across Multi-Library Scenarios
- OJBench: A Competition Level Code Benchmark For Large Language Models
- TeXpert: A Multi-Level Benchmark for Evaluating LaTeX Code Generation by LLMs - ->
- DyCodeEval: Dynamic Benchmarking of Reasoning Capabilities in Code Large Language Models Under Data Contamination
- PPM: Automated Generation of Diverse Programming Problems for Benchmarking Code Generation Models
- CORE: Benchmarking LLMs Code Reasoning Capabilities through Static Analysis Tasks
- ArtifactsBench: Bridging the Visual-Interactive Gap in LLM Code Generation Evaluation
- CoreCodeBench: A Configurable Multi-Scenario Repository-Level Benchmark
- Flattery in Motion: Benchmarking and Analyzing Sycophancy in Video-LLMs
- Model Editing for LLMs4Code: How Far are We?
- VeriBench: Benchmarking Large Language Models for Verilog Code Generation and Design Synthesis
- ResBench: Benchmarking LLM-Generated FPGA Designs with Resource Awareness
- Natural language is not enough: Benchmarking multi-modal generative AI for Verilog generation
- JsDeObsBench: Measuring and Benchmarking LLMs for JavaScript Deobfuscation
- From Reproduction to Replication: Evaluating Research Agents with Progressive Code Masking
- Skywork-SWE: Unveiling Data Scaling Laws for Software Engineering in LLMs - ->
- AutoCodeBench: Large Language Models are Automatic Code Benchmark Generators
- Dynamic Benchmark Construction for Evaluating Large Language Models on Real-World Codes
- STEPWISE-CODEX-Bench: Evaluating Complex Multi-Function Comprehension and Fine-Grained Execution Reasoning
- SWE-Perf: Can Language Models Optimize Code Performance on Real-World Repositories?
- CodeJudgeBench: Benchmarking LLM-as-a-Judge for Coding Tasks
- Multilingual Multimodal Software Developer for Code Generation
- CodeAssistBench (CAB): Dataset & Benchmarking for Multi-turn Chat-Based Code Assistance
- SWE-MERA: A Dynamic Benchmark for Agenticly Evaluating Large Language Models on Software Engineering Tasks
- Running in CIRCLE? A Simple Benchmark for LLM Code Interpreter Security
- MOCHA: Are Code Language Models Robust Against Multi-Turn Malicious Coding Prompts? - Champaign
- Turning the Tide: Repository-based Code Reflection
- TRACY: Benchmarking Execution Efficiency of LLM-Based Code Translation
- BinMetric: A Comprehensive Binary Analysis Benchmark for Large Language Models
- A.S.E: A Repository-Level Benchmark for Evaluating Security in AI-Generated Code
- GitTaskBench: A Benchmark for Code Agents Solving Real-World Tasks Through Code Repository Leveraging
- LoCoBench: A Benchmark for Long-Context Large Language Models in Complex Software Engineering
- CodeFuse-CR-Bench: A Comprehensiveness-aware Benchmark for End-to-End Code Review Evaluation in Python Projects
-
Survey
-
š Top Code Benchmark
-
Code Completion & Code Generation
- Evaluating Large Language Models Trained on Code - eval) [](https://github.com/openai/human-eval) | [š¤Dataset](https://huggingface.co/datasets/openai/openai_humaneval) |
- Program Synthesis with Large Language Models - research/google-research/tree/master/mbpp) [](https://github.com/google-research/google-research) | [š¤Dataset](https://huggingface.co/datasets/google-research-datasets/mbpp) |
- Is Your Code Generated by Chat{GPT} Really Correct? Rigorous Evaluation of Large Language Models for Code Generation
- MultiPL-E: A Scalable and Polyglot Approach to Benchmarking Neural Code Generation - E) [](https://github.com/nuprl/MultiPL-E) | [š¤Dataset](https://huggingface.co/datasets/nuprl/MultiPL-E) |
- Python Code Generation by Asking Clarification Questions - b-L10vNpk7Onyft9BXK8GlMIGl52q/view?usp=sharing) |
- BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions - project/bigcodebench) [](https://github.com/bigcode-project/bigcodebench)| [š¤Dataset](https://huggingface.co/collections/bigcode/bigcodebench-666ed21a5039c618e608ab06) [šLeaderBoard](https://huggingface.co/spaces/bigcode/bigcodebench-leaderboard) |
- DevEval: A Manually-Annotated Code Generation Benchmark Aligned with Real-World Code Repositories
- Measuring Coding Challenge Competence With APPS
- DynaCode: A Dynamic Complexity-Aware Code Benchmark for Evaluating Large Language Models in Code Generation
- MCoNaLa: A Benchmark for Code Generation from Multiple Natural Languages - conala) [](https://github.com/zorazrw/multilingual-conala) | [š¤Dataset](https://huggingface.co/datasets/neulab/mconala) |
- LongCoder: A Long-Range Pre-trained Language Model for Code Completion
- RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems
- LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding
- Execution-Based Evaluation for Open-Domain Code Generation
- BioCoder: A Benchmark for Bioinformatics Code Generation with Large Language Models
- CrossCodeEval: A Diverse and Multilingual Benchmark for Cross-File Code Completion - science/cceval) [](https://github.com/amazon-science/cceval) | [Dataset](https://github.com/amazon-science/cceval/tree/main/data) |
- MT-Bench: How Good are LLMs at Multi-turn Question Answering - bench-101) | |
- ML-Bench: Evaluating Large Language Models and Agents for Machine Learning Tasks on Repository-Level Code - bench) [](https://github.com/gersteinlab/ML-bench) | [š¤Dataset](https://huggingface.co/datasets/super-dainiu/ml-bench) [šWebsite](https://ml-bench.github.io/)|
- PLPilot: Benchmark an Automated Programming Language Design Framework Enabled by LLMs
- CoderEval: A Benchmark of Pragmatic Code Generation with Generative Pre-trained Models
- OOP: Object-Oriented Programming Evaluation Benchmark for Large Language Models - eval) [](https://github.com/alphadl/OOP-eval) | [š¤Dataset](https://huggingface.co/datasets/codeai-dteam/oop) |
- A Static Evaluation of Code Completion by Large Language Models
- L2CEval: Evaluating Language-to-Code Generation Capabilities of Large Language Models
- ICE-Score: Instructing Large Language Models to Evaluate Code - score) | |
- Exploring Language Model's Code Generation Ability with Auxiliary Functions
- R2E: Turning Any GitHub Repository into a Programming Agent Test Environment - project/r2e) | |
- EvoCodeBench: An Evolving Code Generation Benchmark Aligned with Real-World Code Repositories
- A Performance Study of LLM-Generated Code on Leetcode
- Coeditor: Leveraging Contextual Changes for Multi-round Code Auto-editing
- Competition-Level Code Generation with AlphaCode - deepmind/code_contests) [](https://github.com/google-deepmind/code_contests) | [Dataset](https://github.com/google-deepmind/code_contests)|
- LLM4Decompile: Decompiling Binary Code with Large Language Models - ghidra-100k)|
- Enhancing Repository-Level Code Generation with Integrated Contextual Information
- AICoderEval: Improving AI Domain Code Generation of Large Language Models
- CodeAgent: Enhancing Code Generation with Tool-Integrated Agent Systems for Real-World Repo-level Coding Challenges
- Evaluation of LLMs on Syntax-Aware Code Fill-in-the-Middle Tasks
- ConCodeEval: Evaluating Large Language Models for Code Constraints in Domain-Specific Languages
- CodeGeeX: A Pre-Trained Model for Code Generation with Multilingual Benchmarking on HumanEval-X
- StudentEval: A Benchmark of Student-Written Prompts for Large Language Models of Code - EASEL-lab/StudentEval)[](https://github.com/Wellesley-EASEL-lab/StudentEval) | [š¤Dataset](https://huggingface.co/datasets/wellesley-easel/StudentEval) |
- LeetCodeDataset: A Temporal Dataset for Robust Evaluation and Efficient Training of Code LLMs
- A Large-scale Class-level Benchmark Dataset for Code Generation with LLMs
- CodeMixBench: Evaluating Large Language Models on Code Generation with Code-Mixed Prompts
- IFEvalCode: Controlled Code Generation
- HumanEval Pro and MBPP Pro: Evaluating Large Language Models on Self-invoking Code Generation - Pro/CodeEval-Pro)[](https://github.com/CodeEval-Pro/CodeEval-Pro) | [š¤Dataset](https://huggingface.co/CodeEval-Pro) [šWebsite](https://answers111.github.io/evalpro.github.io/index.html) [šLeaderBoard]([CodeEval-Pro](https://answers111.github.io/evalpro.github.io/leaderboard.html/)) |
-
Code Efficiency
- Evaluating Language Models for Efficient Code Generation
- EffiBench: Benchmarking the Efficiency of Automatically Generated Code
- Mercury: A Code Efficiency Benchmark for Code Large Language Models
- ECCO: Can We Improve Model-Generated Code Efficiency Without Sacrificing Functional Correctness?
- Learning Performance-Improving Code Edits
- How Efficient is LLM-Generated Code? A Rigorous & High-Standard Benchmark - rz/enamel) [](https://github.com/q-rz/enamel) | [š¤Dataset](https://huggingface.co/datasets/q-rz/enamel) |
- Improving Assembly Code Performance with Large Language Models via Reinforcement Learning
-
CodeFix & Bug-Fix
- OctoPack: Instruction Tuning Code Large Language Models - project/octopack)[](https://github.com/bigcode-project/octopack) | [š¤Dataset](https://huggingface.co/datasets/bigcode/humanevalpack) |
- SWT-Bench: Testing and Validating Real-World Bug-Fixes with Code Agents - star-ai/SWT-Bench) [](https://github.com/logic-star-ai/SWT-Bench) | [šWebsite](https://swtbench.com) |
- SWE-bench: Can Language Models Resolve Real-World GitHub Issues? - bench/SWE-bench) [](https://github.com/swe-bench/SWE-bench) | [šWebsite](https://www.swebench.com) |
- SWE-bench Multimodal: Do AI Systems Generalize to Visual Software Domains? - bench/SWE-bench) [](https://github.com/swe-bench/SWE-bench) | [š¤Dataset](https://www.swebench.com/multimodal) [šWebsite](https://www.swebench.com/multimodal) |
- LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
- Large Language Models of Code Fail at Completing Code with Potential Bugs - science/buggy-code-completion)[](https://github.com/amazon-science/buggy-code-completion) | [Dataset](https://github.com/amazon-science/buggy-code-completion) |
- GitBug-Java: A Reproducible Benchmark of Recent Java Bugs - java)[](https://github.com/gitbugactions/gitbug-java) | [š¤Dataset](https://huggingface.co/datasets/gitbugactions/gitbug-java) [šWebsite](https://nuno.saavedra.pt/gitbug-java#!/) |
- GitBug-Actions: Building Reproducible Bug-Fix Benchmarks with GitHub Actions
- When Large Language Models Confront Repository-Level Automatic Program Repair: How Well They Done?
- RepoFixEval: A Repository-Level Program Repair Benchmark From Issue Discovering to Bug Fixing
- DebugBench: Evaluating Debugging Capability of Large Language Models
- Instruct, Not Assist: LLM-based Multi-Turn Planning and Hierarchical Questioning for Socratic Code Debugging
- Socratic Questioning of Novice Debuggers: A Benchmark Dataset and Preliminary Evaluations - debugging-benchmark) | |
- Coffee-Gym: An Environment for Evaluating and Improving Natural Language Feedback on Erroneous Code - Gym/Project-Coffee-Gym) |
- INTERVENOR: Prompt the Coding Ability of Large Language Models with the Interactive Chain of Repairing
- Towards Low-Resource Automatic Program Repair with Meta-Learning and Pretrained Language Models - weishi/Meta-APR) | |
- ZS4C: Zero-Shot Synthesis of Compilable Code for Incomplete Code Snippets using LLMs
- FeedbackEval A Benchmark for Evaluating Large Language Models in Feedback-Driven Code Repair Tasks
- COAST: Enhancing the Code Debugging Ability of LLMs through Communicative Agent Based Data Synthesis
- CVE-Bench:Benchmarking LLM-based Software Engineering Agentās Ability to Repair Real-World CVE Vulnerabilities
- CVE-Bench:Benchmarking LLM-based Software Engineering Agentās Ability to Repair Real-World CVE Vulnerabilities
-
Code Hallucination
- Exploring and Evaluating Hallucinations in LLM-Powered Code Generation
- CodeHalu: Investigating Code Hallucinations in LLMs via Execution-based Verification
- Collu-Bench: A Benchmark for Predicting Language Model Hallucinations in Code - asset/collu-bench)|
- THINK: Tackling API Hallucinations in LLMs via Injecting Knowledge - Ljx/think)[](https://github.com/Leah-Ljx/think) | [š¤Dataset](https://github.com/Leah-LJX/THINK/tree/main/benchmark) |
-
Code Reasoning & Understanding
- CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution - eval.github.io/leaderboard.html) |
- CodeMMLU: A Multi-Task Benchmark for Assessing Code Understanding Capabilities of CodeLLMs - AI4Code/CodeMMLU/) [](https://github.com/FSoft-AI4Code/CodeMMLU) | [š¤Dataset](https://huggingface.co/datasets/Fsoft-AIC/CodeMMLU) [šWebsite](https://fsoft-ai4code.github.io/codemmlu/) [šLeaderBoard](https://fsoft-ai4code.github.io/leaderboards/codemmlu/) |
- GenCodeSearchNet: A Benchmark Test Suite for Evaluating Generalization in Programming Language Understanding
- CodeJudge-Eval: Can Large Language Models be Good Judges in Code Understanding? - Research/CodeJudge-Eval) [](https://github.com/CodeLLM-Research/CodeJudge-Eval)| |
- How Effectively Do Code Language Models Understand Poor-Readability Code? - y/PoorCodeSumEval) [](https://github.com/ythere-y/PoorCodeSumEval) | [š¤Dataset](https://huggingface.co/datasets/google/code_x_glue_ct_code_to_text) |
- A Novel Refactoring and Semantic Aware Abstract Syntax Tree Differencing Tool and a Benchmark for Evaluating the Accuracy of Diff Tools
- CodeScope: An Execution-based Multilingual Multitask Multidimensional Benchmark for Evaluating LLMs on Code Understanding and Generation - scope-benchmark/) <br />[š¤Dataset](https://huggingface.co/datasets/WeixiangYan/CodeScope) |
- CodeSense: a Real-World Benchmark and Dataset for Code Semantic Reasoning - bench/codesense-codes) [](https://github.com/codesense-bench/codesense-codes) | [š¤Dataset](https://huggingface.co/datasets/codesense-bench/codesense/tree/main)[šLeaderBoard](https://codesense-bench.github.io/leaderboard.html) |
- ICPC-Eval: Probing the Frontiers of LLM Reasoning with Competitive Programming Contests - AIBOX/ICPC-Eval) |
- LONGCODEU: Benchmarking Long-Context Language Models on Long Code Understanding
-
Code Security & Robustness
- RedCode: Risky Code Execution and Generation Benchmark for Code Agents - secure/RedCode) [](https://github.com/AI-secure/RedCode) | [šWebsite](https://redcode-agent.github.io) [šLeaderBoard](https://redcode-agent.github.io/#leaderboard) |
- ReCode: Robustness Evaluation of Code Generation Models - science/recode) [](https://github.com/amazon-science/recode) | [Dataset](https://github.com/amazon-science/recode/tree/main/dataset-release) |
- COCO: Testing Code Generation Systems via Concretized Instructions - 2023/COCO) [](https://github.com/coco-2023/COCO) | |
- CodeWMBench: An Automated Benchmark for Code Watermarking Evaluation - TURC 2024 | [Github](https://github.com/Dizzy-K/CodeWMBench) [](https://github.com/Dizzy-K/CodeWMBench) |
- RMCBench: Benchmarking Large Language Models' Resistance to Malicious Code - yuan233/RMCBench) [](https://github.com/qing-yuan233/RMCBench)| [š¤Dataset](https://huggingface.co/datasets/zhongqy/RMCBench) |
- Benchmarking the Security Aspect of Large Language Model-Based Code Generation
- CYBERSECEVAL 3: Advancing the Evaluation of Cybersecurity Risks and Capabilities in Large Language Models - llama/PurpleLlama/tree/main/CybersecurityBenchmarks) | [Dataset](https://github.com/meta-llama/PurpleLlama/tree/main/CybersecurityBenchmarks)|
- IRIS: LLM-Assisted Static Analysis for Detecting Security Vulnerabilities - sast/iris) | |
- CS-Eval: A Comprehensive Large Language Model Benchmark for CyberSecurity - EVAL/CS-Eval) [](https://github.com/CS-EVAL/CS-Eval)| [š¤Dataset](https://huggingface.co/datasets/cseval/cs-eval) |
- SecBench: A Comprehensive Multi-Dimensional Benchmarking Dataset for LLMs in Cybersecurity
- SafeGenBench: A Benchmark Framework for Security Vulnerability Detection in LLM-Generated Code
-
Code Translation
- Unsupervised Translation of Programming Languages
- AVATAR: A Parallel Corpus for Java-Python Program Translation
- On the Evaluation of Neural Code Translation: Taxonomy and Benchmark - TransEval) [](https://github.com/PolyEval/G-TransEval)| [š¤Dataset](https://github.com/polyeval/g-transeval/tree/main/G-TransEval) |
- CodeTransOcean: A Comprehensive Multilingual Benchmark for Code Translation
- Repository-level Code Translation Benchmark Targeting Rust
- XCodeEval: An Execution-based Large Scale Multilingual Multitask Benchmark for Code Understanding, Generation, Translation and Retrieval - ACL Anthology - NLP-sg/xCodeEval) |
- Escalating LLM-based Code Translation Benchmarking into the Class-level Era - 11 | [Github](https://github.com/anonymous-author-coder/ClassEval-T-Code-Translation-Evaluation-Dataset)[](https://github.com/anonymous-author-coder/ClassEval-T-Code-Translation-Evaluation-Dataset) | [š¤Dataset](https://github.com/anonymous-author-coder/ClassEval-T-Code-Translation-Evaluation-Dataset/tree/main/ClassEval_T) |
- Skeleton-Guided-Translation: A Benchmarking Framework for Code Repository Translation with Fine-Grained Quality Evaluation
- XCodeEval: An Execution-based Large Scale Multilingual Multitask Benchmark for Code Understanding, Generation, Translation and Retrieval - NLP-sg/xCodeEval) |
- Unraveling the Potential of Large Language Models in Code Translation: How Far Are We? - humaneval)[](https://github.com/q4x3/poly-humaneval) | [š¤Dataset](https://github.com/q4x3/poly-humaneval/tree/main/benchmark) |
- Enhancing LLMs in Long Code Translation through Instrumentation and Program State Alignment
-
Code Version
- Automatically Recommend Code Updates: Are We There Yet? - CodeUpdateEval) [](https://github.com/yueyueL/CodeLM-CodeUpdateEval) | [š¤Dataset](https://github.com/yueyueL/CodeLM-CodeUpdateEval/tree/main/dataset) |
- On the Generalizability of Deep Learning-based Code Completion Across Programming Language Versions - generalization/java-generalization-replication)[](https://github.com/java-generalization/java-generalization-replication) | [š¤Dataset](https://zenodo.org/records/10057237) |
- VersiCode: Towards Version-controllable Code Generation
- GitChameleon: Unmasking the Version-Switching Capabilities of Code Generation Models
- LLMs Meet Library Evolution: Evaluating Deprecated API Usage in LLM-based Code Completion - wangchong/LLM-Deprecated-API) [](https://github.com/cs-wangchong/LLM-Deprecated-API) | [š¤Dataset](https://figshare.com/s/e8de860d8fc2ec0541d2) |
- CodeUpdateArena: Benchmarking Knowledge Editing on API Updates - liuzy/CodeUpdateArena) [](https://github.com/leo-liuzy/CodeUpdateArena) | [š¤Dataset](https://github.com/leo-liuzy/CodeUpdateArena/tree/main/data) |
- LibEvolutionEval: A Benchmark and Study for Version-Specific Code Generation
- RustEvo2: An Evolving Benchmark for API Evolution in LLM-based Rust Code Generation
-
Data science
- DS-1000: A Natural and Reliable Benchmark for Data Science Code Generation - ai/DS-1000)[](https://github.com/xlang-ai/DS-1000) | [š¤Dataset](https://huggingface.co/datasets/xlangai/DS-1000) [šHomePage](https://ds1000-code-gen.github.io) |
- DA-Code: Agent Data Science Code Generation Benchmark for Large Language Models - code)[](https://github.com/yiyihum/da-code) | [š¤Dataset](https://huggingface.co/datasets/Jianwen2003/DA-Code) [šWebsite](https://da-code-bench.github.io) |
- Evaluation of Code LLMs on Geospatial Code Generation - ai/geospatial-code-llms-dataset) | |
- SensorBench: Benchmarking LLMs in Coding-Based Sensor Processing
- MatPlotAgent: Method and Evaluation for LLM-Based Agentic Scientific Data Visualization
- Natural Language to Code Generation in Interactive Data Science Notebooks - research/arcade-nl2code?utm_source=chatgpt.com) [](https://github.com/google-research/arcade-nl2code?utm_source=chatgpt.com)| [Dataset](https://www.kaggle.com/datasets/googleai/arcade-nl2code-dataset) |
- DataSciBench: An LLM Agent Benchmark for Data Science
- DSBench: How Far Are Data Science Agents from Becoming Data Science Experts?
-
Industry Code Generation
- VerilogEval Evaluating Large Language Models for Verilog Code Generation - eval)[](https://github.com/NVlabs/verilog-eval) | [š¤Dataset](https://github.com/NVlabs/verilog-eval/tree/main/dataset_code-complete-iccad2023) |
- RTLLM: An Open-Source Benchmark for Design RTL Generation with Large Language Model - zhiyao/rtllm)[](https://github.com/hkust-zhiyao/rtllm) | [š¤Dataset](https://github.com/hkust-zhiyao/rtllm) |
- MetRex: A Benchmark for Verilog Code Metric Reasoning Using LLMs - lab/MetRex)[](https://github.com/scale-lab/MetRex) | [š¤Dataset](https://huggingface.co/datasets/scale-lab/MetRex) |
- Exploring Code Language Models for Automated HLS-based Hardware Generation: Benchmark, Infrastructure and Analysis
- VHDL-Eval: A Framework for Evaluating Large Language Models in VHDL Code Generation
- Chain-of-Descriptions: Improving Code LLMs for VHDL Code Generation and Summarization
- LLM4PLC: Harnessing Large Language Models for Verifiable Programming of PLCs in Industrial Control Systems
- Agents4PLC: Automating Closed-loop PLC Code Generation and Verification in Industrial Control Systems using LLM-based Agents - zju/Agents4PLC_release)[](https://github.com/Luoji-zju/Agents4PLC_release) | [š¤Dataset](https://github.com/Luoji-zju/Agents4PLC_release/tree/master/benchmark) |
-
Programming Languages
Categories
Sub Categories
Code Completion & Code Generation
43
MultiModal Code Tasks
25
CodeFix & Bug-Fix
21
Text2SQL
19
Industry Code Generation
15
Multi & Other Dimension
14
Code Translation
11
Code Security & Robustness
11
Code Reasoning & Understanding
10
Code Version
8
Data science
8
Code Efficiency
7
Code Hallucination
4
MultiModal Code Generation
3
Security Code Generation & Test Generation
2