awesome-machine-learning-on-source-code
Cool links & research papers related to Machine Learning applied to source code (MLonCode)
https://github.com/src-d/awesome-machine-learning-on-source-code
Last synced: 8 days ago
JSON representation
-
Papers
- Oreo: detection of clones in the twilight zone - Vaibhav Saini, Farima Farmahinifarahani, Yadong Lu, Pierre Baldi, and Cristina V. Lopes, FSE 2018.
- A Deep Learning Approach to Program Similarity - Niccolò Marastoni, Roberto Giacobazzi and Mila Dalla Preda, MASES 2018.
- DéjàVu: a map of code duplicates on GitHub - Cristina V. Lopes, Petr Maj, Pedro Martins, Vaibhav Saini, Di Yang, Jakub Zitny, Hitesh Sajnani, Jan Vitek, Programming Languages OOPSLA 2017.
- Making Neural Programming Architectures Generalize via Recursion - Jonathon Cai, Richard Shin, Dawn Song, ICLR 2017.
- AST-Based Deep Learning for Detecting Malicious PowerShell - Gili Rusak, Abdullah Al-Dujaili, Una-May O'Reilly, 2018.
- DeepBugs: A Learning Approach to Name-based Bug Detection - Michael Pradel, Koushik Sen, 2018.
- Sentiment Polarity Detection for Software Development - Fabio Calefato, Filippo Lanubile, Federico Maiorano, Nicole Novielli, Empirical Software Engineering 2017.
- Learning-Based Recursive Aggregation of Abstract Syntax Trees for Code Clone Detection - Lutz Büch and Artur Andrzejak, SANER 2019.
- Generating Accurate and Compact Edit Scripts Using Tree Differencing - Veit Frick, Thomas Grassauer, Fabian Beck, Martin Pinzger, ICSME 2018.
- A Family of Blockwise One-Factor Distributions for Modelling High-Dimensional Binary Data - Matthieu Marbac and Mohammed Sedki, Computational Statistics & Data Analysis 2017.
- BayesBinMix: an R Package for Model Based Clustering of Multivariate Binary Data - Panagiotis Papastamoulis and Magnus Rattray, R Journal 2016.
- Leveraging Automated Sentiment Analysis in Software Engineering - Md Rakibul Islam, Minhaz F. Zibran, MSR 2017.
- Sentiment Analysis for Software Engineering: How Far Can We Go? - Bin Lin, Fiorella Zampetti, Gabriele Bavota, Massimiliano Di Penta, Michele Lanza, Rocco Oliveto, ICSE 2018.
- An Empirical Investigation into Learning Bug-Fixing Patches in the Wild via Neural Machine Translation - Michele Tufano, Cody Watson, Gabriele Bavota, Massimiliano Di Penta, Martin White, Denys Poshyvanyk, ASE 2018.
- NL2Type: Inferring JavaScript Function Types from Natural Language Information - Rabee Sohail Malik, Jibesh Patra, Michael Pradel, ICSE 2019.
- Are Deep Neural Networks the Best Choice for Modeling Source Code? - Vincent J. Hellendoorn, Premkumar Devanbu, FSE 2017.
- Some from Here, Some from There: Cross-project Code Reuse in GitHub - Mohammad Gharehyazie, Baishakhi Ray, Vladimir Filkov, MSR 2017.
- API usage pattern recommendation for software development - Haoran Niu, Iman Keivanloo, Ying Zou, 2017.
-
Posts
- Semantic Code Search
- Learning from Source Code
- Training a Model to Summarize Github Issues
- Sequence Intent Classification Using Hierarchical Attention Networks
- Syntax-Directed Variational Autoencoder for Structured Data
- Weighted MinHash on GPU helps to find duplicate GitHub repositories.
- Source Code Identifier Embeddings
- Using recurrent neural networks to predict next tokens in the java solutions
- The half-life of code & the ship of Theseus
- The eigenvector of "Why we moved from language X to language Y"
- Analyzing Github, How Developers Change Programming Languages Over Time
- Topic Modeling of GitHub Repositories
- Aroma: Using machine learning for code recommendation
- Semantic Code Search
- Training a Model to Summarize Github Issues
- Semantic Code Search
- Training a Model to Summarize Github Issues
- Semantic Code Search
- Training a Model to Summarize Github Issues
- Semantic Code Search
- Training a Model to Summarize Github Issues
- Semantic Code Search
- Training a Model to Summarize Github Issues
- Semantic Code Search
- Training a Model to Summarize Github Issues
- Semantic Code Search
- Training a Model to Summarize Github Issues
- Semantic Code Search
- Training a Model to Summarize Github Issues
- Semantic Code Search
- Training a Model to Summarize Github Issues
- Semantic Code Search
- Training a Model to Summarize Github Issues
- Semantic Code Search
- Training a Model to Summarize Github Issues
- Semantic Code Search
- Training a Model to Summarize Github Issues
- Semantic Code Search
- Training a Model to Summarize Github Issues
- Semantic Code Search
- Training a Model to Summarize Github Issues
- Semantic Code Search
- Training a Model to Summarize Github Issues
- Semantic Code Search
- Training a Model to Summarize Github Issues
- Semantic Code Search
- Training a Model to Summarize Github Issues
- Semantic Code Search
- Training a Model to Summarize Github Issues
- Semantic Code Search
- Training a Model to Summarize Github Issues
- Semantic Code Search
- Training a Model to Summarize Github Issues
- Semantic Code Search
- Training a Model to Summarize Github Issues
- Semantic Code Search
- Training a Model to Summarize Github Issues
- Semantic Code Search
- Training a Model to Summarize Github Issues
- Semantic Code Search
- Training a Model to Summarize Github Issues
- Semantic Code Search
- Training a Model to Summarize Github Issues
- Semantic Code Search
- Training a Model to Summarize Github Issues
- Semantic Code Search
- Training a Model to Summarize Github Issues
- Training a Model to Summarize Github Issues
- Training a Model to Summarize Github Issues
- Aroma: Using machine learning for code recommendation
- Syntax-Directed Variational Autoencoder for Structured Data
-
Software
- JNice2Predict - Efficient and scalable open-source framework for structured prediction, enabling one to build new statistical engines more quickly.
- Clone Digger - clone detection for Python and Java.
- bblfsh - Self-hosted server for source code parsing.
- Tregex, Tsurgeon and Semgrex - Tregex is a utility for matching patterns in trees, based on tree relationships and regular expression matches on nodes (the name is short for "tree regular expressions").
- Public Git Archive - 6 TB of Git repositories from GitHub.
- GitHub Issue Titles and Descriptions for NLP Analysis - ~8 million GitHub issue titles and descriptions from 2017.
- GitHub repositories - languages distribution - Programming languages distribution in 14,000,000 repositories on GitHub (October 2016).
- 452M commits on GitHub - ≈ 452M commits' metadata from 16M repositories on GitHub (October 2016).
- GitHub readme files - Readme files of all GitHub repositories (16M) (October 2016).
- from language X to Y - Cache file Erik Bernhardsson collected for his awesome blog post.
- GitHub word2vec 120k - Sequences of identifiers extracted from top starred 120,000 GitHub repositories.
- GitHub Source Code Names - Names in source code extracted from 13M GitHub repositories, not people.
- GitHub duplicate repositories - GitHub repositories not marked as forks but very similar to each other.
- GitHub lng keyword frequencies - Programming language keyword frequency extracted from 16M GitHub repositories.
- GitHub Java Corpus - GitHub Java corpus is a set of Java projects collected from GitHub that we have used in a number of our publications. The corpus consists of 14,785 projects and 352,312,696 LOC.
- 150k Python Dataset - Dataset consisting of 150,000 Python ASTs.
- 150k JavaScript Dataset - Dataset consisting of 150,000 JavaScript files and their parsed ASTs.
- GitHub JavaScript Dump October 2016 - Dataset consisting of 494,352 syntactically-valid JavaScript files obtained from the top ~10000 starred JavaScript repositories on GitHub, with licenses, and parsed ASTs.
- BigCloneBench - Clone detection benchmark of 8 million function clone pairs in the IJaDataset.
- Differentiable Neural Computer (DNC) - TensorFlow implementation of the Differentiable Neural Computer.
- sourced.ml - Abstracts feature extraction from source code syntax trees and working with ML models.
- vecino - Finds similar Git repositories.
- apollo - Source code deduplication as scale, research.
- gemini - Source code deduplication as scale, production.
- enry - Insanely fast file based programming language detector.
- hercules - Git repository mining framework with batteries on top of go-git.
- DeepCS - Keras and Pytorch implementations of DeepCS (Deep Code Search).
- Code Neuron - Recurrent neural network to detect code blocks in natural language text.
- Naturalize - Language agnostic framework for learning coding conventions from a codebase and then expoiting this information for suggesting better identifier names and formatting changes in the code.
- Extreme Source Code Summarization - Convolutional attention neural network that learns to summarize source code into a short method name-like summary by just looking at the source code tokens.
- Summarizing Source Code using a Neural Attention Model - CODE-NN, uses LSTM networks with attention to produce sentences that describe C# code snippets and SQL queries from StackOverflow. Torch over C#/SQL
- Probabilistic API Miner - Near parameter-free probabilistic algorithm for mining the most interesting API patterns from a list of API call sequences.
- Interesting Sequence Miner - Novel algorithm that mines the most interesting sequences under a probabilistic model. It is able to efficiently infer interesting sequences directly from the database.
- TASSAL - Tool for the automatic summarization of source code using autofolding. Autofolding automatically creates a summary of a source code file by folding non-essential code and comment blocks.
- JNice2Predict - Efficient and scalable open-source framework for structured prediction, enabling one to build new statistical engines more quickly.
- Sensibility - Uses LSTMs to detect and correct syntax errors in Java source code.
- DeepBugs - Framework for learning bug detectors from an existing code corpus.
- DeepSim - a deep learning-based approach to measure code functional similarity.
- rnn-autocomplete - Neural code autocompletion with RNN (bachelor's thesis).
- MindsDB - MindsDB is an Explainable AutoML framework for developers. With MindsDB you can build, train and use state of the art ML models in as simple as one line of code.
- go-git - Highly extensible Git implementation in pure Go which is friendly to data mining.
- minhashcuda - Weighted MinHash implementation on CUDA to efficiently find duplicates.
- kmcuda - k-means on CUDA to cluster and to search for nearest neighbors in dense space.
- wmd-relax - Python package which finds nearest neighbors at Word Mover's Distance.
- source{d} models - Machine Learning models for MLonCode trained using the source{d} stack.
- Neural-Code-Search-Evaluation-Dataset - dataset contains links to 4.7M methods from 24k+ repositories with 287 StackOverflow questions and code snippet answers.
- CodeSearchNet - collection of datasets and benchmarks for code retrieval using natural language. Contains 2M pairs of (`comment`, `code`).
- StackOverflow Question-Code Dataset - ~148K Python and ~120K SQL question-code pairs mined from StackOverflow.
- card2code - This dataset contains the language to code datasets described in the paper [Latent Predictor Networks for Code Generation](#card2code).
- NL2Bash - This dataset contains a set of ~10,000 bash one-liners collected from websites such as StackOverflow and their English descriptions written by Bash programmers, as described in the [paper](https://arxiv.org/abs/1802.08979).
- Clone Digger - clone detection for Python and Java.
- GitHub Java Corpus - GitHub Java corpus is a set of Java projects collected from GitHub that we have used in a number of our publications. The corpus consists of 14,785 projects and 352,312,696 LOC.
- engine - Scalable and distributed data retrieval pipeline for source code.
-
Talks
- Machine Learning on Source Code
- Similarity of GitHub Repositories by Source Code Identifiers
- Using deep RNN to model source code
- Source code abstracts classification using CNN (1)
- Source code abstracts classification using CNN (2)
- Source code abstracts classification using CNN (3)
- Embedding the GitHub contribution graph
- Measuring code sentiment in a Git repository
- Machine Learning on Source Code
- Similarity of GitHub Repositories by Source Code Identifiers
- Using deep RNN to model source code
- Source code abstracts classification using CNN (1)
- Source code abstracts classification using CNN (2)
- Source code abstracts classification using CNN (3)
- Measuring code sentiment in a Git repository
Categories
Sub Categories
Keywords
machine-learning
8
ml4code
4
tensorflow
3
python
3
git
3
nlp
2
cuda
2
source-code
2
duplicate-detection
2
duplicates
2
java
2
golang
2
mloncode
2
data-mining
2
git-client
1
git-server
1
go-git
1
rag
1
llms
1
databases
1
artificial-inteligence
1
ai
1
burndown
1
agi
1
git-analysis
1
bert
1
cnn
1
data
1
data-science
1
datasets
1
deep-learning
1
machine-learning-on-source-code
1
ml
1
natural-language-processing
1
neural-networks
1
nlp-machine-learning
1
open-data
1
programming-language-theory
1
representation-learning
1
rnn
1
self-attention
1
keras
1
lstm
1
neural-network
1
syntax
1
syntax-checker
1
syntax-error
1
babelfish
1
mlosc
1
model
1