Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
awesome-machine-learning-on-source-code
Cool links & research papers related to Machine Learning applied to source code (MLonCode)
https://github.com/eric-erki/awesome-machine-learning-on-source-code
Last synced: 1 day ago
JSON representation
-
Digests
- Learning from "Big Code" - Techniques, challenges, tools, datasets on "Big Code".
- A Survey of Machine Learning for Big Code and Naturalness - Survey and literature review on Machine Learning on Source Code.
-
Competitions
- CodRep - competition on automatic program repair: given a source line, find the insertion point.
-
Posts
- Semantic Code Search
- Training a Model to Summarize Github Issues
- Semantic Code Search
- Training a Model to Summarize Github Issues
- Semantic Code Search
- Training a Model to Summarize Github Issues
- Semantic Code Search
- Training a Model to Summarize Github Issues
- Semantic Code Search
- Training a Model to Summarize Github Issues
- Semantic Code Search
- Training a Model to Summarize Github Issues
- Semantic Code Search
- Training a Model to Summarize Github Issues
- Semantic Code Search
- Training a Model to Summarize Github Issues
- Semantic Code Search
- Learning from Source Code
- Training a Model to Summarize Github Issues
- Sequence Intent Classification Using Hierarchical Attention Networks
- Syntax-Directed Variational Autoencoder for Structured Data
- Weighted MinHash on GPU helps to find duplicate GitHub repositories.
- Source Code Identifier Embeddings
- The half-life of code & the ship of Theseus
- The eigenvector of "Why we moved from language X to language Y"
- Analyzing Github, How Developers Change Programming Languages Over Time
- Topic Modeling of GitHub Repositories
- Semantic Code Search
- Training a Model to Summarize Github Issues
- Semantic Code Search
- Training a Model to Summarize Github Issues
- Semantic Code Search
- Training a Model to Summarize Github Issues
- Semantic Code Search
- Training a Model to Summarize Github Issues
- Training a Model to Summarize Github Issues
- Semantic Code Search
- Training a Model to Summarize Github Issues
- Training a Model to Summarize Github Issues
- Semantic Code Search
- Training a Model to Summarize Github Issues
- Semantic Code Search
- Training a Model to Summarize Github Issues
- Semantic Code Search
- Semantic Code Search
- Training a Model to Summarize Github Issues
- Semantic Code Search
- Training a Model to Summarize Github Issues
- Semantic Code Search
- Training a Model to Summarize Github Issues
- Semantic Code Search
- Training a Model to Summarize Github Issues
- Semantic Code Search
- Training a Model to Summarize Github Issues
- Using recurrent neural networks to predict next tokens in the java solutions
- Semantic Code Search
- Training a Model to Summarize Github Issues
- Semantic Code Search
- Training a Model to Summarize Github Issues
- Semantic Code Search
- Training a Model to Summarize Github Issues
- Semantic Code Search
- Training a Model to Summarize Github Issues
- Semantic Code Search
- Training a Model to Summarize Github Issues
- Semantic Code Search
- Training a Model to Summarize Github Issues
- Training a Model to Summarize Github Issues
-
Papers
- Sentiment Analysis for Software Engineering: How Far Can We Go? - Bin Lin, Fiorella Zampetti, Gabriele Bavota, Massimiliano Di Penta, Michele Lanza, Rocco Oliveto, ICSE 2018.
- Sentiment Polarity Detection for Software Development - Fabio Calefato, Filippo Lanubile, Federico Maiorano, Nicole Novielli, Empirical Software Engineering 2017.
- Leveraging Automated Sentiment Analysis in Software Engineering - Md Rakibul Islam, Minhaz F. Zibran, MSR 2017.
- SentiCR: A Customized Sentiment Analysis Tool for Code Review Interactions - Toufique Ahmed, Amiangshu Bosu, Anindya Iqbal, Shahram Rahimi, ASE 2017.
- A Convolutional Attention Network for Extreme Summarization of Source Code - Miltiadis Allamanis, Hao Peng, Charles Sutton, ICML 2016.
- A study of repetitiveness of code changes in software evolution - HA Nguyen, AT Nguyen, TT Nguyen, TN Nguyen, H Rajan, ASE 2013.
- Coarse-to-Fine Decoding for Neural Semantic Parsing - Li Dong, Mirella Lapata, ACL 2018
- Semantic Parsing with Syntax- and Table-Aware SQL Generation - Yibo Sun, Duyu Tang, Nan Duan, Jianshu Ji, Guihong Cao, Xiaocheng Feng, Bing Qin, Ting Liu, Ming Zhou, ACL 2018
- DialSQL: Dialogue Based Structured Query Generation - Izzeddin Gur, Semih Yavuz, Yu Su, Xifeng Yan, ACL 2018
- NL2Bash: A Corpus and Semantic Parser for Natural Language Interface to the Linux Operating System - Xi Victoria Lin, Chenglong Wang, Luke Zettlemoyer, Michael D. Ernst, LREC 2018.
- Recent Advances in Neural Program Synthesis - Neel Kant, 2018.
- Neural Sketch Learning for Conditional Program Generation - Vijayaraghavan Murali, Letao Qi, Swarat Chaudhuri, Chris Jermaine, ICLR 2018.
- Neural Program Search: Solving Programming Tasks from Description and Examples - Illia Polosukhin, Alexander Skidanov, ICLR 2018.
- Neural Program Synthesis with Priority Queue Training - Daniel A. Abolafia, Mohammad Norouzi, Quoc V. Le, 2018.
- Towards Synthesizing Complex Programs from Input-Output Examples - Xinyun Chen, Chang Liu, Dawn Song, ICLR 2018.
- Glass-Box Program Synthesis: A Machine Learning Approach - Konstantina Christakopoulou, Adam Tauman Kalai, AAAI 2018.
- Program Synthesis for Character Level Language Modeling - Pavol Bielik, Veselin Raychev, Martin Vechev, ICLR 2017.
- SQLNet: Generating Structured Queries From Natural Language Without Reinforcement Learning - Xiaojun Xu, Chang Liu, Dawn Song, 2017.
- Learning to Select Examples for Program Synthesis - Yewen Pu, Zachery Miranda, Armando Solar-Lezama, Leslie Pack Kaelbling, 2017.
- Neural Program Meta-Induction - Jacob Devlin, Rudy Bunel, Rishabh Singh, Matthew Hausknecht, Pushmeet Kohli, NIPS 2017.
- Learning to Infer Graphics Programs from Hand-Drawn Images - Kevin Ellis, Daniel Ritchie, Armando Solar-Lezama, Joshua B. Tenenbaum, 2017.
- Neural Attribute Machines for Program Generation - Matthew Amodio, Swarat Chaudhuri, Thomas Reps, 2017.
- Abstract Syntax Networks for Code Generation and Semantic Parsing - Maxim Rabinovich, Mitchell Stern, Dan Klein, ACL 2017.
- Making Neural Programming Architectures Generalize via Recursion - Jonathon Cai, Richard Shin, Dawn Song, ICLR 2017.
- A Syntactic Neural Model for General-Purpose Code Generation - Pengcheng Yin, Graham Neubig, ACL 2017.
- Program Synthesis from Natural Language Using Recurrent Neural Networks - Xi Victoria Lin, Chenglong Wang, Deric Pang, Kevin Vu, Luke Zettlemoyer, Michael Ernst, 2017.
- RobustFill: Neural Program Learning under Noisy I/O - Jacob Devlin, Jonathan Uesato, Surya Bhupatiraju, Rishabh Singh, Abdel-rahman Mohamed, Pushmeet Kohli, ICML 2017.
- Lifelong Perceptual Programming By Example - Gaunt, Alexander L., Marc Brockschmidt, Nate Kushman, and Daniel Tarlow, 2017.
- Neural Programming by Example - Chengxun Shu, Hongyu Zhang, AAAI 2017.
- DeepCoder: Learning to Write Programs - Balog Matej, Alexander L. Gaunt, Marc Brockschmidt, Sebastian Nowozin, and Daniel Tarlow, ICLR 2017.
- Latent Attention For If-Then Program Synthesis - Xinyun Chen, Chang Liu, Richard Shin, Dawn Song, Mingcheng Chen, NIPS 2016.
- Neural Symbolic Machines: Learning Semantic Parsers on Freebase with Weak Supervision (Short Version) - Liang Chen, Jonathan Berant, Quoc Le, Kenneth D. Forbus, and Ni Lao, NIPS 2016.
- Programs as Black-Box Explanations - Singh, Sameer, Marco Tulio Ribeiro, and Carlos Guestrin, NIPS 2016.
- Structured Generative Models of Natural Source Code - Chris J. Maddison, Daniel Tarlow, ICML 2014.
- code2vec: Learning Distributed Representations of Code - Uri Alon, Meital Zilberstein, Omer Levy, Eran Yahav, 2018.
- Learning to Represent Programs with Graphs - Miltiadis Allamanis, Marc Brockschmidt, Mahmoud Khademi, ICLR 2018.
- A Survey of Machine Learning for Big Code and Naturalness - Miltiadis Allamanis, Earl T. Barr, Premkumar Devanbu, Charles Sutton, 2017.
- A deep language model for software code - Hoa Khanh Dam, Truyen Tran, Trang Pham, 2016.
- A General Path-Based Representation for Predicting Program Properties - Uri Alon, Meital Zilberstein, Omer Levy, Eran Yahav, PLDI 2018.
- Cross-Language Learning for Program Classification using Bilateral Tree-Based Convolutional Neural Networks - Nghi D. Q. Bui, Lingxiao Jiang, Yijun Yu, AAAI 2018.
- Syntax-Directed Variational Autoencoder for Structured Data - Hanjun Dai, Yingtao Tian, Bo Dai, Steven Skiena, Le Song, ICLR 2018.
- Divide and Conquer with Neural Networks - Nowak, Alex, and Joan Bruna, ICLR 2018.
- Learning Efficient Algorithms with Hierarchical Attentive Memory - Andrychowicz, Marcin, and Karol Kurach, 2016.
- Learning Operations on a Stack with Neural Turing Machines - Deleu, Tristan, and Joseph Dureau, NIPS 2016.
- Probabilistic Neural Programs - Murray, Kenton W., and Jayant Krishnamurthy, NIPS 2016.
- Hierarchical multiscale recurrent neural networks - Chung Junyoung, Sungjin Ahn, and Yoshua Bengio, ICLR 2017.
- Neural Programmer-Interpreters - Reed, Scott, and Nando de Freitas, ICLR 2016.
- Neural GPUs Learn Algorithms - Kaiser, Łukasz, and Ilya Sutskever, ICLR 2016.
- Neural Random-Access Machines - Karol Kurach, Marcin Andrychowicz, Ilya Sutskever, ERCIM News 2016.
- Neural Programmer: Inducing Latent Programs with Gradient Descent - Neelakantan, Arvind, Quoc V. Le, and Ilya Sutskever, ICLR 2015.
- Learning to Execute - Wojciech Zaremba, Ilya Sutskever, 2015.
- Inferring Algorithmic Patterns with Stack-Augmented Recurrent Nets - Joulin, Armand, and Tomas Mikolov, NIPS 2015.
- Neural Turing Machines - Graves, Alex, Greg Wayne, and Ivo Danihelka, 2014.
- From Machine Learning to Machine Reasoning - Bottou Leon, Journal of Machine Learning 2011.
- Word Embeddings for the Software Engineering Domain - Vasiliki Efstathiou, Christos Chatzilenas, Diomidis Spinellis, MSR 2018.
- Document Distance Estimation via Code Graph Embedding - Zeqi Lin, Junfeng Zhao, Yanzhen Zou, Bing Xie, Internetware 2017.
- Combining Word2Vec with revised vector space model for better code retrieval - Thanh Van Nguyen, Anh Tuan Nguyen, Hung Dang Phan, Trong Duc Nguyen, Tien N. Nguyen, ICSE 2017.
- From word embeddings to document similarities for improved information retrieval in software engineering - Xin Ye, Hui Shen, Xiao Ma, Razvan Bunescu, Chang Liu, ICSE 2016.
- Tree-to-tree Neural Networks for Program Translation - Xinyun Chen, Chang Liu, Dawn Song, ICLR 2018.
- Code Attention: Translating Code to Comments by Exploiting Domain Features - Wenhao Zheng, Hong-Yu Zhou, Ming Li, Jianxin Wu, 2017.
- Automatically Generating Commit Messages from Diffs using Neural Machine Translation - Siyuan Jiang, Ameer Armaly, Collin McMillan, ASE 2017.
- A Parallel Corpus of Python Functions and Documentation Strings for Automated Code Documentation and Code Generation - Antonio Valerio Miceli Barone, Rico Sennrich, ICNLP 2017.
- A Neural Architecture for Generating Natural Language Descriptions from Source Code Changes - Pablo Loyola, Edison Marrese-Taylor, Yutaka Matsuo, ACL 2017.
- Code Completion with Neural Attention and Pointer Networks - Jian Li, Yue Wang, Irwin King, Michael R. Lyu, 2017.
- Code Completion with Statistical Language Models - Veselin Raychev, Martin Vechev, Eran Yahav, PLDI 2014.
- A deep tree-based model for software defect prediction - HK Dam, T Pham, SW Ng, [T Tran](https://truyentran.github.io), J Grundy, A Ghose, T Kim, CJ Kim, 2018.
- Learning a Static Analyzer from Data - Pavol Bielik, Veselin Raychev, Martin Vechev, CAV 2017. [video](https://www.youtube.com/watch?v=bkieI3jLxVY)
- Automated Vulnerability Detection in Source Code Using Deep Representation Learning - Rebecca L. Russell, Louis Kim, Lei H. Hamilton, Tomo Lazovich, Jacob A. Harer, Onur Ozdemir, Paul M. Ellingwood, Marc W. McConley, 2018.
- Shaping Program Repair Space with Existing Patches and Similar Code - Jiajun Jiang, Yingfei Xiong, Hongyu Zhang, Qing Gao, Xiangqun Chen, 2018. ([code](https://github.com/xgdsmileboy/SimFix))
- Learning to Repair Software Vulnerabilities with Generative Adversarial Networks - Jacob A. Harer, Onur Ozdemir, Tomo Lazovich, Christopher P. Reale, Rebecca L. Russell, Louis Y. Kim, Peter Chin, 2018.
- Dynamic Neural Program Embedding for Program Repair - Ke Wang, Rishabh Singh, Zhendong Su, ICLR 2018.
- Estimating defectiveness of source code: A predictive model using GitHub content - Ritu Kapur, Balwinder Sodhi, 2018
- Automated software vulnerability detection with machine learning - Jacob A. Harer, Louis Y. Kim, Rebecca L. Russell, Onur Ozdemir, Leonard R. Kosta, Akshay Rangamani, Lei H. Hamilton, Gabriel I. Centeno, Jonathan R. Key, Paul M. Ellingwood, Marc W. McConley, Jeffrey M. Opper, Peter Chin, Tomo Lazovich, IWSPA 2018
- Semantic Code Repair using Neuro-Symbolic Transformation Networks - Jacob Devlin, Jonathan Uesato, Rishabh Singh, Pushmeet Kohli, 2017.
- Automated Identification of Security Issues from Commit Messages and Bug Reports - Yaqin Zhou and Asankhaya Sharma, FSE 2017.
- SmartPaste: Learning to Adapt Source Code - Miltiadis Allamanis, Marc Brockschmidt, 2017.
- End-to-End Prediction of Buffer Overruns from Raw Source Code via Neural Memory Networks - Min-je Choi, Sehun Jeong, Hakjoo Oh, Jaegul Choo, IJCAI 2017.
- Hierarchical Learning of Cross-Language Mappings through Distributed Vector Representations for Code - Nghi D. Q. Bui, Lingxiao Jiang, ICSE 2018.
- DeepAM: Migrate APIs with Multi-modal Sequence to Sequence Learning - Xiaodong Gu, Hongyu Zhang, Dongmei Zhang, Sunghun Kim, IJCAI 2017.
- Deep API Learning - Xiaodong Gu, Hongyu Zhang, Dongmei Zhang, Sunghun Kim, FSE 2016.
- Exploring API Embedding for API Usages and Applications - Nguyen, Nguyen, Phan and Nguyen, Journal of Systems and Software 2017.
- Lean GHTorrent: GitHub data on demand - Georgios Gousios, Bogdan Vasilescu, Alexander Serebrenik, Andy Zaidman, MSR 2014.
- The Case for Learned Index Structures - Tim Kraska, Alex Beutel, Ed H. Chi, Jeffrey Dean, Neoklis Polyzotis, SIGMOD 2018.
- Learning to superoptimize programs - Rudy Bunel, Alban Desmaison, M. Pawan Kumar, Philip H.S. Torr, Pushmeet Kohlim ICLR 2017.
- Neural Nets Can Learn Function Type Signatures From Binaries - Zheng Leong Chua, Shiqi Shen, Prateek Saxena, and Zhenkai Liang, USENIX Security Symposium 2017.
- Adaptive Neural Compilation - Rudy Bunel, Alban Desmaison, Pushmeet Kohli, Philip H.S. Torr, M. Pawan Kumar, NIPS 2016.
- Learning to Superoptimize Programs - Workshop Version - Bunel, Rudy, Alban Desmaison, M. Pawan Kumar, Philip H. S. Torr, and Pushmeet Kohli, NIPS 2016.
- Topic modeling of public repositories at scale using names in source code - Vadim Markovtsev, Eiso Kant, 2017.
- Semantic clustering: Identifying topics in source code - Adrian Kuhn, Stéphane Ducasse, Tudor Girba, Information & Software Technology 2007.
- A Benchmark Study on Sentiment Analysis for Software Engineering Research - Nicole Novielli, Daniela Girardi, Filippo Lanubile, MSR 2018.
- Summarizing Source Code using a Neural Attention Model - Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, Luke Zettlemoyer, ACL 2016.
- DéjàVu: a map of code duplicates on GitHub - Cristina V. Lopes, Petr Maj, Pedro Martins, Vaibhav Saini, Di Yang, Jakub Zitny, Hitesh Sajnani, Jan Vitek, Programming Languages OOPSLA 2017.
- Deep Learning Code Fragments for Code Clone Detection - Martin White, Michele Tufano, Christopher Vendome, and Denys Poshyvanyk, ASE 2016.
- DDRprog: A CLEVR Differentiable Dynamic Reasoning Programmer - Joseph Suarez, Justin Johnson, Fei-Fei Li, 2018.
- Improving the Universality and Learnability of Neural Programmer-Interpreters with Combinator Abstraction - Da Xiao, Jo-Yu Liao, Xingyuan Yuan, ICLR 2018.
- Differentiable Programs with Neural Libraries - Alexander L. Gaunt, Marc Brockschmidt, Nate Kushman, Daniel Tarlow, ICML 2017.
- Differentiable Functional Program Interpreters - John K. Feser, Marc Brockschmidt, Alexander L. Gaunt, Daniel Tarlow, 2017.
- Neural Functional Programming - Feser John K., Marc Brockschmidt, Alexander L. Gaunt, and Daniel Tarlow, ICLR 2017.
- TerpreT: A Probabilistic Programming Language for Program Induction - Gaunt, Alexander L., Marc Brockschmidt, Rishabh Singh, Nate Kushman, Pushmeet Kohli, Jonathan Taylor, and Daniel Tarlow, NIPS 2016.
- A Family of Blockwise One-Factor Distributions for Modelling High-Dimensional Binary Data - Matthieu Marbac and Mohammed Sedki, Computational Statistics & Data Analysis 2017.
- BayesBinMix: an R Package for Model Based Clustering of Multivariate Binary Data - Panagiotis Papastamoulis and Magnus Rattray, R Journal 2016.
- Robust mixture modelling using the t distribution - D. Peel and G. J. McLachlan, Statistics and Computing 2000.
- Robust mixture modeling using the skew t distribution - Tsung I. Lin, Jack C. Lee and Wan J. Hsieh, Statistics and Computing 2010.
- A Fast Unified Model for Parsing and Sentence Understanding - Samuel R. Bowman, Jon Gauthier, Abhinav Rastogi, Raghav Gupta, Christopher D. Manning, Christopher Potts, ACL 2016.
- Latent Predictor Networks for Code Generation - Wang Ling, Edward Grefenstette, Karl Moritz Hermann, Tomáš Kočiský, Andrew Senior, Fumin Wang, Phil Blunsom, ACL 2016.
- Learning Python Code Suggestion with a Sparse Pointer Network - Avishkar Bhoopchand, Tim Rocktäschel, Earl Barr, Sebastian Riedel, 2016.
- Sorting and Transforming Program Repair Ingredients via Deep Learning Code Similarities - Martin White, Michele Tufano, Matías Martínez, Martin Monperrus, Denys Poshyvanyk, 2017.
- Tailored Mutants Fit Bugs Better - Miltiadis Allamanis, Earl T. Barr, René Just, Charles Sutton, 2016.
- Programming with a Differentiable Forth Interpreter - Bošnjak, Matko, Tim Rocktäschel, Jason Naradowsky, and Sebastian Riedel, ICML 2017.
- A study of repetitiveness of code changes in software evolution - HA Nguyen, AT Nguyen, TT Nguyen, TN Nguyen, H Rajan, ASE 2013.
- Semantic Parsing with Syntax- and Table-Aware SQL Generation - Yibo Sun, Duyu Tang, Nan Duan, Jianshu Ji, Guihong Cao, Xiaocheng Feng, Bing Qin, Ting Liu, Ming Zhou, ACL 2018
-
Conferences
- ACM International Conference on Software Engineering, ICSE
- ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (FSE)
- 2018 IEEE 25th International Conference on Software Analysis, Evolution, and Reengineering (SANER)
- Workshop on NLP for Software Engineering
- SysML
- Mining Software Repositories
- source{d} tech talks
- Learning to Code: Machine Learning for Program Induction - Alexander Gaunt.
- ACM International Conference on Software Engineering, ICSE
-
Talks
- Machine Learning on Source Code
- Similarity of GitHub Repositories by Source Code Identifiers
- Using deep RNN to model source code
- Source code abstracts classification using CNN (1)
- Source code abstracts classification using CNN (2)
- Source code abstracts classification using CNN (3)
- Measuring code sentiment in a Git repository
-
Software
- Extreme Source Code Summarization - Convolutional attention neural network that learns to summarize source code into a short method name-like summary by just looking at the source code tokens.
- JNice2Predict - Efficient and scalable open-source framework for structured prediction, enabling one to build new statistical engines more quickly.
- bblfsh - Self-hosted server for source code parsing.
- Public Git Archive - 3 TB of Git repositories from GitHub.
- GitHub repositories - languages distribution - Programming languages distribution in 14,000,000 repositories on GitHub (October 2016).
- 452M commits on GitHub - ≈ 452M commits' metadata from 16M repositories on GitHub (October 2016).
- GitHub readme files - Readme files of all GitHub repositories (16M) (October 2016).
- from language X to Y - Cache file Erik Bernhardsson collected for his awesome blog post.
- GitHub word2vec 120k - Sequences of identifiers extracted from top starred 120,000 GitHub repositories.
- GitHub Source Code Names - Names in source code extracted from 13M GitHub repositories, not people.
- GitHub duplicate repositories - GitHub repositories not marked as forks but very similar to each other.
- GitHub lng keyword frequencies - Programming language keyword frequency extracted from 16M GitHub repositories.
- GitHub Java Corpus - GitHub Java corpus is a set of Java projects collected from GitHub that we have used in a number of our publications. The corpus consists of 14,785 projects and 352,312,696 LOC.
- 150k Python Dataset - Dataset consisting of 150,000 Python ASTs.
- 150k JavaScript Dataset - Dataset consisting of 150,000 JavaScript files and their parsed ASTs.
- GitHub JavaScript Dump October 2016 - Dataset consisting of 494,352 syntactically-valid JavaScript files obtained from the top ~10000 starred JavaScript repositories on GitHub, with licenses, and parsed ASTs.
-
Credits
Programming Languages
Sub Categories