Awesome-Code-LLM

[TMLR] A curated list of language modeling researches for code (and other software engineering activities), plus related datasets.
https://github.com/codefuse-ai/Awesome-Code-LLM

Last synced: 11 days ago
JSON representation

8. Datasets
- 8.2 Benchmarks
  - [paper - s-Last-Code-Exam/HLCE)] |
  - [paper
  - [paper - level-Vulnerability-Detection)] |
  - [paper
  - [paper - Ren/OJBench)] |
  - [paper
  - [paper - level-Vulnerability-Detection)] |
  - [paper
  - [paper - level-Vulnerability-Detection)] |
  - [paper
  - [paper - level-Vulnerability-Detection)] |
  - 2025-06
  - [paper - Hunyuan/ArtifactsBenchmark)] |
  - [paper
  - [paper
  - [paper - level-Vulnerability-Detection)] |
  - [paper - Eval-Official/CoreCodeBench)] |
  - [paper
  - [paper - level-Vulnerability-Detection)] |
  - [paper
  - [paper - level-Vulnerability-Detection)] |
  - 2025-07 - Evaluation/MERA_CODE)]
  - 2025-07
  - [paper
  - [paper - level-Vulnerability-Detection)] |
  - [paper
  - 2025-07 - perf/swe-perf)] |
  - [paper
  - [paper - level-Vulnerability-Detection)] |
  - [paper
  - [paper
  - [paper
  - [paper
  - [paper
  - [paper - level-Vulnerability-Detection)] |
  - [paper
  - [paper - level-Vulnerability-Detection)] |
  - [paper
  - [paper - level-Vulnerability-Detection)] |
  - [paper
  - [paper - level-Vulnerability-Detection)] |
  - [paper
  - [paper - level-Vulnerability-Detection)] |
  - [paper
  - [paper - level-Vulnerability-Detection)] |
  - [paper
  - [paper - level-Vulnerability-Detection)] |
  - [paper - bench_Pro-os)] |
  - [paper - weihan/SWE-QA-Bench)] |
  - 2025-09
  - [paper - a-p/AetherCode)] |
  - [paper
  - [paper - level-Vulnerability-Detection)] |
  - [paper
  - [paper
  - [paper
  - [paper
  - [paper
  - [paper
  - [paper - level-Vulnerability-Detection)] |
  - [paper
  - [paper
  - 2025-10
  - 2025-10
  - [paper
  - [paper - interact.github.io/)] |
  - [paper - level-Vulnerability-Detection)] |
  - [paper
  - [paper - level-Vulnerability-Detection)] |
  - 2025-10
  - [paper
  - [paper
  - [paper
  - [paper - level-Vulnerability-Detection)] |
  - [paper
  - 2025-10
  - [paper
  - [paper - ai/Falcon)] |
  - [paper - level-Vulnerability-Detection)] |
  - [paper
  - [paper - level-Vulnerability-Detection)] |
  - [paper - Replication)] |
  - [paper - JPG/VCode)] |
  - [paper
  - [paper - level-Vulnerability-Detection)] |
  - [paper - Sharp-Bench)] |
  - [paper
  - [paper
  - [paper
  - [paper - level-Vulnerability-Detection)] |
  - [paper
  - [paper
  - [paper
  - [paper - level-Vulnerability-Detection)] |
  - [paper - Computing-Lab/gpuFLOPBench)] |
  - [paper
  - [paper
  - [paper
  - [paper - level-Vulnerability-Detection)] |
  - [paper
  - [paper
  - [paper - level-Vulnerability-Detection)] |
  - 2025-12
  - [paper
  - [paper
  - [paper - EVO)] |
  - [paper
  - [paper - level-Vulnerability-Detection)] |
  - [paper
  - [paper
  - [paper - level-Vulnerability-Detection)] |
  - [paper
  - [paper
  - [paper
  - 2026-01
  - [paper
  - [paper
  - [paper
  - [paper - level-Vulnerability-Detection)] |
  - [paper
  - [paper - level-Vulnerability-Detection)] |
  - [paper
  - 2026-02
  - [paper
  - [paper
  - [paper
  - [paper - level-Vulnerability-Detection)] |
  - [paper - nlp/timemachine-bench)] |
  - [paper
  - [paper
  - [paper
  - [paper
  - [paper - level-Vulnerability-Detection)] |
  - 2026-03
  - [paper
  - [paper - AI-Lab/SWE-QA-Pro)] |
  - [paper
  - [paper - level-Vulnerability-Detection)] |
  - [paper - Team/BeyondSWE)] |
  - [paper - code/fc-eval)] |
  - [paper
  - [paper
  - [paper - level-Vulnerability-Detection)] |
  - [paper
9. Recommended Readings
- 8.2 Benchmarks
  - PaLM: Scaling Language Modeling with Pathways
  - BLOOM: A 176B-Parameter Open-Access Multilingual Language Model - source dense LLM, trained on 46 languages, with detailed discussion about training and evaluation |
  - LLaMA - 4](https://arxiv.org/abs/2303.08774) or [PaLM 2](https://arxiv.org/abs/2305.10403). For comprehensive reviews on these more general topics, we refer to other sources such as [Awesome-LLM](https://github.com/Hannibal046/Awesome-LLM), [Awesome AIGC Tutorials](https://github.com/luban-agi/Awesome-AIGC-Tutorials), or for LLM applications in other specific domains: [Awesome Domain LLM](https://github.com/luban-agi/Awesome-Domain-LLM), [Awesome Tool Learning](https://github.com/luban-agi/Awesome-Tool-Learning#awesome-tool-learning), [Awesome-LLM-MT](https://github.com/hsing-wang/Awesome-LLM-MT), [Awesome Education LLM](https://github.com/Geralt-Targaryen/Awesome-Education-LLM).
  - The Pile: An 800GB Dataset of Diverse Text for Language Modeling
  - Neural Machine Translation by Jointly Learning to Align and Translate - decoder RNN |
  - Neural Machine Translation of Rare Words with Subword Units - pair encoding: split rare words into subword units |
  - Attention Is All You Need - attention for long-range dependency and parallel training |
  - Mixed Precision Training
  - GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding
  - Improving Language Understanding by Generative Pre-Training - finetuning paradigm applied to Transformer decoder |
  - BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
  - Language Models are Unsupervised Multitask Learners
  - SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems
  - RoBERTa: A Robustly Optimized BERT Pretraining Approach
  - Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
  - ZeRO: Memory Optimizations Toward Training Trillion Parameter Models - efficient distributed optimization |
  - Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer - decoder pretrained with an MLM-like denoising objective |
  - Language Models are Few-Shot Learners - 2 (175B), they discovered a new learning paradigm: In-Context Learning (ICL) |
  - Measuring Massive Multitask Language Understanding - knowledge and complex reasoning benchmark |
  - LoRA: Low-Rank Adaptation of Large Language Models - efficient finetuning |
  - Finetuned Language Models Are Zero-Shot Learners - finetuning |
  - Multitask Prompted Training Enables Zero-Shot Task Generalization
  - Scaling Language Models: Methods, Analysis & Insights from Training Gopher
  - Chain-of-Thought Prompting Elicits Reasoning in Large Language Models - of-Though reasoning |
  - Training language models to follow instructions with human feedback - 3 instruction finetuned with RLHF (reinforcement learning from human feedback) |
  - Training Compute-Optimal Large Language Models
  - Large Language Models are Zero-Shot Reasoners
  - Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models - knowledge and complex reasoning benchmark |
  - Emergent Abilities of Large Language Models
  - Scaling Instruction-Finetuned Language Models
  - Self-Instruct: Aligning Language Models with Self-Generated Instructions - generated data |
  - RoFormer: Enhanced Transformer with Rotary Position Embedding
News

Programming Languages

Python 2

Categories

5. Methods/Models for Downstream Tasks 1,248 8. Datasets 583 3. When Coding Meets Reasoning 315 2. Models 286 6. Analysis of AI-Generated Code 246 4. Code LLM for Low-Resource, Low-Level, and Domain-Specific Languages 122 7. Human-LLM Interaction 73 News 62 9. Recommended Readings 32 5. Datasets 29 4. Datasets 20 1. Surveys 17 6. Datasets 4 Other Awesome LLM Reading Lists 3 Star History 2 7. User-LLM Interaction 1

Sub Categories

8.2 Benchmarks 613 3.5 Frontend Navigation 179 Text-To-SQL 171 3.3 Code Agents 119 Vulnerability Detection 116 Others 113 2.1 Base LLMs and Pretraining Strategies 98 Code Generation 92 Code Commenting and Summarization 83 Test Generation 79 2.4 (Instruction) Fine-Tuning on Code 76 Malicious Code Detection 75 Program Repair 75 3.1 Coding for Reasoning 66 Security and Vulnerabilities 59 3.4 Interactive Coding 55 2.3 General Pretraining on Code 54 Code Review 49 Code Translation 47 Frontend Development 46 2.5 Reinforcement Learning on Code 44 Repository-Level Coding 42 Code Similarity and Embedding (Clone Detection, Code Search) 38 Correctness 34 Issue Resolution 32 5.2 Benchmarks 30 Requirement Engineering 28 Log Analysis 26 Program Proof 26 Automated Machine Learning 25 AI-Generated Code Detection 25 Compiler Optimization 24 Code RAG 23 Code Refactoring and Migration 23 Binary Analysis and Decompilation 23 4.2 Benchmarks 20 Efficiency 20 3.2 Code Simulation 18 Software Configuration 17 Code Ranking 16 Code QA & Reasoning 16 Robustness 15 Oracle Generation 15 2.2 Existing LLM Adapted to Code 14 Hallucination 13 Fuzz Testing 12 Interpretability 12 Software Modeling 10 API Usage 10 Privacy 9 Commit Message Generation 8 Mutation Testing 7 Bias 7 8.1 Pretraining 6 Type Prediction 4 6.2 Benchmarks 4 Contamination 3

Keywords

llm 2 customizable 1 multi-model-support 1 multi-task-fine-tuning 1 multi-task-learning 1 user-friendly 1 agent 1 ekg 1 kg 1 memory-management 1 multi-agent 1 rag 1 tool-learning 1 ai 1 awesome 1 education 1 nlp 1 papers 1

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

Awesome-Code-LLM

8. Datasets

8.2 Benchmarks

9. Recommended Readings

8.2 Benchmarks

News