awesome-machine-learning-on-source-code

Cool links & research papers related to Machine Learning applied to source code (MLonCode)
https://github.com/src-d/awesome-machine-learning-on-source-code

Last synced: 5 days ago
JSON representation

Papers
- A Neural Framework for Retrieval and Summarization of Source Code - Qingying Chen, Minghui Zhou, ASE 2018.
- Oreo: detection of clones in the twilight zone - Vaibhav Saini, Farima Farmahinifarahani, Yadong Lu, Pierre Baldi, and Cristina V. Lopes, FSE 2018.
- A Deep Learning Approach to Program Similarity - Niccolò Marastoni, Roberto Giacobazzi and Mila Dalla Preda, MASES 2018.
- DéjàVu: a map of code duplicates on GitHub - Cristina V. Lopes, Petr Maj, Pedro Martins, Vaibhav Saini, Di Yang, Jakub Zitny, Hitesh Sajnani, Jan Vitek, Programming Languages OOPSLA 2017.
- Making Neural Programming Architectures Generalize via Recursion - Jonathon Cai, Richard Shin, Dawn Song, ICLR 2017.
- AST-Based Deep Learning for Detecting Malicious PowerShell - Gili Rusak, Abdullah Al-Dujaili, Una-May O'Reilly, 2018.
- DeepBugs: A Learning Approach to Name-based Bug Detection - Michael Pradel, Koushik Sen, 2018.
- Sentiment Polarity Detection for Software Development - Fabio Calefato, Filippo Lanubile, Federico Maiorano, Nicole Novielli, Empirical Software Engineering 2017.
- Learning-Based Recursive Aggregation of Abstract Syntax Trees for Code Clone Detection - Lutz Büch and Artur Andrzejak, SANER 2019.
- Generating Accurate and Compact Edit Scripts Using Tree Differencing - Veit Frick, Thomas Grassauer, Fabian Beck, Martin Pinzger, ICSME 2018.
- A Family of Blockwise One-Factor Distributions for Modelling High-Dimensional Binary Data - Matthieu Marbac and Mohammed Sedki, Computational Statistics & Data Analysis 2017.
- BayesBinMix: an R Package for Model Based Clustering of Multivariate Binary Data - Panagiotis Papastamoulis and Magnus Rattray, R Journal 2016.
- Leveraging Automated Sentiment Analysis in Software Engineering - Md Rakibul Islam, Minhaz F. Zibran, MSR 2017.
- Sentiment Analysis for Software Engineering: How Far Can We Go? - Bin Lin, Fiorella Zampetti, Gabriele Bavota, Massimiliano Di Penta, Michele Lanza, Rocco Oliveto, ICSE 2018.
- An Empirical Investigation into Learning Bug-Fixing Patches in the Wild via Neural Machine Translation - Michele Tufano, Cody Watson, Gabriele Bavota, Massimiliano Di Penta, Martin White, Denys Poshyvanyk, ASE 2018.
- NL2Type: Inferring JavaScript Function Types from Natural Language Information - Rabee Sohail Malik, Jibesh Patra, Michael Pradel, ICSE 2019.
- Are Deep Neural Networks the Best Choice for Modeling Source Code? - Vincent J. Hellendoorn, Premkumar Devanbu, FSE 2017.
- Some from Here, Some from There: Cross-project Code Reuse in GitHub - Mohammad Gharehyazie, Baishakhi Ray, Vladimir Filkov, MSR 2017.
- API usage pattern recommendation for software development - Haoran Niu, Iman Keivanloo, Ying Zou, 2017.
- Semantic clustering: Identifying topics in source code - Adrian Kuhn, Stéphane Ducasse, Tudor Girba, Information & Software Technology 2007.
- Deep Learning Code Fragments for Code Clone Detection - Martin White, Michele Tufano, Christopher Vendome, and Denys Poshyvanyk, ASE 2016.
- Fine-grained and Accurate Source Code Differencing - Jean-Rémy Falleri, Floréal Morandat, Xavier Blanc, Matias Martinez, Martin Monperrus, ASE 2014.
- Clustering Binary Data with Bernoulli Mixture Models - Neal S. Grantham.
Posts
Software
- JNice2Predict - Efficient and scalable open-source framework for structured prediction, enabling one to build new statistical engines more quickly.
- Clone Digger - clone detection for Python and Java.
- bblfsh - Self-hosted server for source code parsing.
- Tregex, Tsurgeon and Semgrex - Tregex is a utility for matching patterns in trees, based on tree relationships and regular expression matches on nodes (the name is short for "tree regular expressions").
- Public Git Archive - 6 TB of Git repositories from GitHub.
- GitHub Issue Titles and Descriptions for NLP Analysis - ~8 million GitHub issue titles and descriptions from 2017.
- GitHub repositories - languages distribution - Programming languages distribution in 14,000,000 repositories on GitHub (October 2016).
- 452M commits on GitHub - ≈ 452M commits' metadata from 16M repositories on GitHub (October 2016).
- GitHub readme files - Readme files of all GitHub repositories (16M) (October 2016).
- from language X to Y - Cache file Erik Bernhardsson collected for his awesome blog post.
- GitHub word2vec 120k - Sequences of identifiers extracted from top starred 120,000 GitHub repositories.
- GitHub Source Code Names - Names in source code extracted from 13M GitHub repositories, not people.
- GitHub duplicate repositories - GitHub repositories not marked as forks but very similar to each other.
- GitHub lng keyword frequencies - Programming language keyword frequency extracted from 16M GitHub repositories.
- GitHub Java Corpus - GitHub Java corpus is a set of Java projects collected from GitHub that we have used in a number of our publications. The corpus consists of 14,785 projects and 352,312,696 LOC.
- 150k Python Dataset - Dataset consisting of 150,000 Python ASTs.
- 150k JavaScript Dataset - Dataset consisting of 150,000 JavaScript files and their parsed ASTs.
- GitHub JavaScript Dump October 2016 - Dataset consisting of 494,352 syntactically-valid JavaScript files obtained from the top ~10000 starred JavaScript repositories on GitHub, with licenses, and parsed ASTs.
- BigCloneBench - Clone detection benchmark of 8 million function clone pairs in the IJaDataset.
- Differentiable Neural Computer (DNC) - TensorFlow implementation of the Differentiable Neural Computer.
- sourced.ml - Abstracts feature extraction from source code syntax trees and working with ML models.
- vecino - Finds similar Git repositories.
- apollo - Source code deduplication as scale, research.
- gemini - Source code deduplication as scale, production.
- enry - Insanely fast file based programming language detector.
- hercules - Git repository mining framework with batteries on top of go-git.
- DeepCS - Keras and Pytorch implementations of DeepCS (Deep Code Search).
- Code Neuron - Recurrent neural network to detect code blocks in natural language text.
- Naturalize - Language agnostic framework for learning coding conventions from a codebase and then expoiting this information for suggesting better identifier names and formatting changes in the code.
- Extreme Source Code Summarization - Convolutional attention neural network that learns to summarize source code into a short method name-like summary by just looking at the source code tokens.
- Summarizing Source Code using a Neural Attention Model - CODE-NN, uses LSTM networks with attention to produce sentences that describe C# code snippets and SQL queries from StackOverflow. Torch over C#/SQL
- Probabilistic API Miner - Near parameter-free probabilistic algorithm for mining the most interesting API patterns from a list of API call sequences.
- Interesting Sequence Miner - Novel algorithm that mines the most interesting sequences under a probabilistic model. It is able to efficiently infer interesting sequences directly from the database.
- TASSAL - Tool for the automatic summarization of source code using autofolding. Autofolding automatically creates a summary of a source code file by folding non-essential code and comment blocks.
- JNice2Predict - Efficient and scalable open-source framework for structured prediction, enabling one to build new statistical engines more quickly.
- Sensibility - Uses LSTMs to detect and correct syntax errors in Java source code.
- DeepBugs - Framework for learning bug detectors from an existing code corpus.
- DeepSim - a deep learning-based approach to measure code functional similarity.
- rnn-autocomplete - Neural code autocompletion with RNN (bachelor's thesis).
- MindsDB - MindsDB is an Explainable AutoML framework for developers. With MindsDB you can build, train and use state of the art ML models in as simple as one line of code.
- go-git - Highly extensible Git implementation in pure Go which is friendly to data mining.
- minhashcuda - Weighted MinHash implementation on CUDA to efficiently find duplicates.
- kmcuda - k-means on CUDA to cluster and to search for nearest neighbors in dense space.
- wmd-relax - Python package which finds nearest neighbors at Word Mover's Distance.
- source{d} models - Machine Learning models for MLonCode trained using the source{d} stack.
- Neural-Code-Search-Evaluation-Dataset - dataset contains links to 4.7M methods from 24k+ repositories with 287 StackOverflow questions and code snippet answers.
- CodeSearchNet - collection of datasets and benchmarks for code retrieval using natural language. Contains 2M pairs of (`comment`, `code`).
- StackOverflow Question-Code Dataset - ~148K Python and ~120K SQL question-code pairs mined from StackOverflow.
- card2code - This dataset contains the language to code datasets described in the paper [Latent Predictor Networks for Code Generation](#card2code).
- NL2Bash - This dataset contains a set of ~10,000 bash one-liners collected from websites such as StackOverflow and their English descriptions written by Bash programmers, as described in the [paper](https://arxiv.org/abs/1802.08979).
- Clone Digger - clone detection for Python and Java.
- GitHub Java Corpus - GitHub Java corpus is a set of Java projects collected from GitHub that we have used in a number of our publications. The corpus consists of 14,785 projects and 352,312,696 LOC.
- engine - Scalable and distributed data retrieval pipeline for source code.
Talks

Programming Languages

Python 14 Java 4 Go 3 Jupyter Notebook 2 Scala 1 C++ 1 NewLisp 1 HTML 1

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

awesome-machine-learning-on-source-code

Papers

Posts

Software

Talks