awesome-ai4code

A collection of recent papers, benchmarks and datasets of AI4Code domain.
https://github.com/bdqnghi/awesome-ai4code

Last synced: about 8 hours ago
JSON representation

Tools/Products
- AI code completion tools
- More General Coding Assistants
  - ZZZ Code AI
  - StackSpot AI
  - 16x Prompt
  - Phind
  - CensusGPT
  - Pixee
  - Cursor
  - Autodoc
  - Wizi
  - Safurai
  - Cosine
  - Cosine
- ChatGPT in your editor
- LLM-powered natural language compilers
  - Parsel 🐍
Academic
- Conferences
- Papers (This list is a bit outdated, need to update)
  - Large Language Models of Code Fail at Completing Code with Potential Bugs - Tuan Dinh, Jinman Zhao, Samson Tan, Renato Negrinho, Leonard Lausen, Sheng Zha, George Karypis.
  - Large Language Models Meet NL2Code: A Survey - Daoguang Zan, Bei Chen, Fengji Zhang, Dianjie Lu, Bingchao Wu, Bei Guan, Yongji Wang, Jian-Guang Lou (EMNLP 2023)
  - RepoFusion: Training Code Models to Understand Your Repository - Disha Shrivastava, Denis Kocetkov, Harm de Vries, Dzmitry Bahdanau, Torsten Scholak
  - XCODEEVAL: An Execution-based Large Scale Multilingual Multitask Benchmark for Code Understanding, Generation, Translation and Retrieval - Mohammad Abdullah Matin Khan, M Saiful Bari, Xuan Long Do, Weishi Wang, Md Rizwan Parvez, Shafiq Joty
  - Large Language Models of Code Fail at Completing Code with Potential Bugs - Tuan Dinh, Jinman Zhao, Samson Tan, Renato Negrinho, Leonard Lausen, Sheng Zha, George Karypis.
  - Large Language Models Meet NL2Code: A Survey - Daoguang Zan, Bei Chen, Fengji Zhang, Dianjie Lu, Bingchao Wu, Bei Guan, Yongji Wang, Jian-Guang Lou (EMNLP 2023)
  - XCODEEVAL: An Execution-based Large Scale Multilingual Multitask Benchmark for Code Understanding, Generation, Translation and Retrieval - Mohammad Abdullah Matin Khan, M Saiful Bari, Xuan Long Do, Weishi Wang, Md Rizwan Parvez, Shafiq Joty
  - RepoFusion: Training Code Models to Understand Your Repository - Disha Shrivastava, Denis Kocetkov, Harm de Vries, Dzmitry Bahdanau, Torsten Scholak
Pretrained CodeLLMs
- Papers (This list is a bit outdated, need to update)
  - CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation - Yue Wang, Weishi Wang, Shafiq Joty, Steven C.H. Hoi (EMNLP 2021) (***CodeT5***).
  - CodeBERT:A Pre-Trained Model for Programming and Natural Language - Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, Ming Zhou (EMNLP 2020 Findings) (***CodeBERT***).
  - Learning and Evaluating Contextual Embedding of Source Code - Aditya Kanade, Petros Maniatis, Gogul Balakrishnan, Kensen Shi. (ICML 2020) (***CuBERT***).
  - Unsupervised Translation of Programming Languages - Marie-Anne Lachaux, Baptiste Roziere, Lowik Chanussot, Guillaume Lample (NeurIPS 2020) (***Transcoder***).
  - Contrastive Code Representation Learning
  - CoTexT: Multi-task Learning with Code-Text Transformer
  - How could Neural Networks understand Programs? - Yan Liu (ICML 2021) (***OSCAR***)
  - Unified Pre-training for Program Understanding and Generation - Wei Chang (NAACL 2021) (***PLBART***).
  - Exploring Software Naturalness through Neural Language Models - BERT***).
  - PYMT5: multi-mode translation of natural language and PYTHON code with transformers
  - DOBF: A Deobfuscation Pre-Training Objective for Programming Languages - Anne Lachaux, Marc Szafraniec, Guillaume Lample, (arXiv 2021) (***DOBF***).
  - Studying the Usage of Text-To-Text Transfer Transformer to Support Code-Related Tasks
  - Disentangled Code Representation Learning for Multiple Programming Languages - Fingings 2021) (***CODEDISEN***).
  - SYNCOBERT: Syntax-Guided Multi-Modal Contrastive Pre-Training for Code Representation
  - TreeBERT: A Tree-Based Pre-Trained Model for Programming Language
  - Empirical Study of Transformers for Source Code
  - GraphCodeBERT: Pre-training Code Representations with Data Flow - Daya Guo, Shuo Ren, Shuai Lu, Zhangyin Feng, Duyu Tang, Shujie Liu, Long Zhou, Nan Duan, Alexey Svyatkovskiy, Shengyu Fu, Michele Tufano, Shao Kun Deng, Colin Clement, Dawn Drain, Neel Sundaresan, Jian Yin, Daxin Jiang, Ming Zhou (ICLR 2021) (***GraphCodeBERT***).
  - CodeTrans: Towards Cracking the Language of Silicone’s Code Through Self-Supervised Deep Learning and High Performance Computing
  - Self-Supervised Learning for Code Retrieval and Summarization through Semantic-Preserving Program Transformations - Nghi D. Q. BUI, Yijun YU, Lingxiao JIANG (SIGIR 2021) (***Corder***).
  - CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation - Yue Wang, Weishi Wang, Shafiq Joty, Steven C.H. Hoi (EMNLP 2021) (***CodeT5***).
  - CodeBERT:A Pre-Trained Model for Programming and Natural Language - Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, Ming Zhou (EMNLP 2020 Findings) (***CodeBERT***).
  - Unsupervised Translation of Programming Languages - Marie-Anne Lachaux, Baptiste Roziere, Lowik Chanussot, Guillaume Lample (NeurIPS 2020) (***Transcoder***).
  - Contrastive Code Representation Learning
  - How could Neural Networks understand Programs? - Yan Liu (ICML 2021) (***OSCAR***)
  - SYNCOBERT: Syntax-Guided Multi-Modal Contrastive Pre-Training for Code Representation
Talks and Tutorials
- Papers (This list is a bit outdated, need to update)
Talk and Tutorial
- ETH Zurich Workshop on Software Correctness and Reliability
Dataset and Benchmark
- Papers (This list is a bit outdated, need to update)

Programming Languages

Python 5 TypeScript 3 HTML 2 C# 1 Emacs Lisp 1 Lua 1 Jupyter Notebook 1

Categories

Tools/Products 26 Pretrained CodeLLMs 25 Academic 19 Dataset and Benchmark 8 Talks and Tutorials 3 Talk and Tutorial 1

Sub Categories

Papers (This list is a bit outdated, need to update) 44 More General Coding Assistants 12 Conferences 11 AI code completion tools 10 ChatGPT in your editor 3 LLM-powered natural language compilers 1

Keywords

program-synthesis 3 ai 3 chatgpt 2 machine-learning 2 ml 2 typescript 2 python 2 authorship-attribution 1 vscode 1 gpt-4 1 gpt-3 1 tensorflow 1 self-attention 1 rnn 1 representation-learning 1 programming-language-theory 1 open-data 1 nlp-machine-learning 1 nlp 1 neural-networks 1 natural-language-processing 1 machine-learning-on-source-code 1 deep-learning 1 datasets 1 data-science 1 data 1 cnn 1 bert 1 puzzles 1 programming-competitions 1 code-generation 1 search 1 react 1 javascript 1 frontend 1 robotics 1 lean 1 language-model 1 documentation-generator 1 cli-tool 1 llms 1 gpt 1 generative-models 1 emacs 1 stylometry 1 source 1 jam-programming-competition 1 google-code-jam 1 dataset 1 contest 1