awesome-machine-learning-on-source-code
Cool links & research papers related to Machine Learning applied to source code
https://github.com/ermuur/awesome-machine-learning-on-source-code
Last synced: 6 days ago
JSON representation
-
Posts
- The eigenvector of "Why we moved from language X to language Y"
- Analyzing Github, How Developers Change Programming Languages Over Time
- Topic Modeling of GitHub Repositories
- Semantic Code Search
- Training a Model to Summarize Github Issues
- Semantic Code Search
- Training a Model to Summarize Github Issues
- Semantic Code Search
- Training a Model to Summarize Github Issues
- Semantic Code Search
- Training a Model to Summarize Github Issues
- Semantic Code Search
- Training a Model to Summarize Github Issues
- Semantic Code Search
- Training a Model to Summarize Github Issues
- Semantic Code Search
- Training a Model to Summarize Github Issues
- Semantic Code Search
- Training a Model to Summarize Github Issues
- Semantic Code Search
- Training a Model to Summarize Github Issues
- Semantic Code Search
- Training a Model to Summarize Github Issues
- Semantic Code Search
- Training a Model to Summarize Github Issues
- Semantic Code Search
- Training a Model to Summarize Github Issues
- Semantic Code Search
- Training a Model to Summarize Github Issues
- Semantic Code Search
- Training a Model to Summarize Github Issues
- Semantic Code Search
- Training a Model to Summarize Github Issues
- Semantic Code Search
- Training a Model to Summarize Github Issues
- Semantic Code Search
- Training a Model to Summarize Github Issues
- Semantic Code Search
- Training a Model to Summarize Github Issues
- Semantic Code Search
- Training a Model to Summarize Github Issues
- Semantic Code Search
- Training a Model to Summarize Github Issues
- Semantic Code Search
- Training a Model to Summarize Github Issues
- Training a Model to Summarize Github Issues
- Aroma: Using machine learning for code recommendation
- Syntax-Directed Variational Autoencoder for Structured Data
-
Software
- JNice2Predict - Efficient and scalable open-source framework for structured prediction, enabling one to build new statistical engines more quickly.
- bblfsh - Self-hosted server for source code parsing.
- Tregex, Tsurgeon and Semgrex - Tregex is a utility for matching patterns in trees, based on tree relationships and regular expression matches on nodes (the name is short for "tree regular expressions").
- Public Git Archive - 3 TB of Git repositories from GitHub.
- GitHub repositories - languages distribution - Programming languages distribution in 14,000,000 repositories on GitHub (October 2016).
- 452M commits on GitHub - ≈ 452M commits' metadata from 16M repositories on GitHub (October 2016).
- GitHub readme files - Readme files of all GitHub repositories (16M) (October 2016).
- from language X to Y - Cache file Erik Bernhardsson collected for his awesome blog post.
- GitHub word2vec 120k - Sequences of identifiers extracted from top starred 120,000 GitHub repositories.
- GitHub Source Code Names - Names in source code extracted from 13M GitHub repositories, not people.
- GitHub duplicate repositories - GitHub repositories not marked as forks but very similar to each other.
- GitHub lng keyword frequencies - Programming language keyword frequency extracted from 16M GitHub repositories.
- 150k Python Dataset - Dataset consisting of 150,000 Python ASTs.
- 150k JavaScript Dataset - Dataset consisting of 150,000 JavaScript files and their parsed ASTs.
- GitHub JavaScript Dump October 2016 - Dataset consisting of 494,352 syntactically-valid JavaScript files obtained from the top ~10000 starred JavaScript repositories on GitHub, with licenses, and parsed ASTs.
- Differentiable Neural Computer (DNC) - TensorFlow implementation of the Differentiable Neural Computer.
- sourced.ml - Abstracts feature extraction from source code syntax trees and working with ML models.
- vecino - Finds similar Git repositories.
- apollo - Source code deduplication as scale, research.
- gemini - Source code deduplication as scale, production.
- enry - Insanely fast file based programming language detector.
- hercules - Git repository mining framework with batteries on top of go-git.
- DeepCS - Keras and Pytorch implementations of DeepCS (Deep Code Search).
- Code Neuron - Recurrent neural network to detect code blocks in natural language text.
- Naturalize - Language agnostic framework for learning coding conventions from a codebase and then expoiting this information for suggesting better identifier names and formatting changes in the code.
- Extreme Source Code Summarization - Convolutional attention neural network that learns to summarize source code into a short method name-like summary by just looking at the source code tokens.
- Summarizing Source Code using a Neural Attention Model - CODE-NN, uses LSTM networks with attention to produce sentences that describe C# code snippets and SQL queries from StackOverflow. Torch over C#/SQL
- Probabilistic API Miner - Near parameter-free probabilistic algorithm for mining the most interesting API patterns from a list of API call sequences.
- Interesting Sequence Miner - Novel algorithm that mines the most interesting sequences under a probabilistic model. It is able to efficiently infer interesting sequences directly from the database.
- TASSAL - Tool for the automatic summarization of source code using autofolding. Autofolding automatically creates a summary of a source code file by folding non-essential code and comment blocks.
- JNice2Predict - Efficient and scalable open-source framework for structured prediction, enabling one to build new statistical engines more quickly.
- Sensibility - Uses LSTMs to detect and correct syntax errors in Java source code.
- DeepBugs - Framework for learning bug detectors from an existing code corpus.
- DeepSim - a deep learning-based approach to measure code functional similarity.
- rnn-autocomplete - Neural code autocompletion with RNN (bachelor's thesis).
- go-git - Highly extensible Git implementation in pure Go which is friendly to data mining.
- minhashcuda - Weighted MinHash implementation on CUDA to efficiently find duplicates.
- kmcuda - k-means on CUDA to cluster and to search for nearest neighbors in dense space.
- wmd-relax - Python package which finds nearest neighbors at Word Mover's Distance.
- source{d} models - Machine Learning models for MLonCode trained using the source{d} stack.
- StackOverflow Question-Code Dataset - ~148K Python and ~120K SQL question-code pairs mined from StackOverflow.
- card2code - This dataset contains the language to code datasets described in the paper [Latent Predictor Networks for Code Generation](#card2code).
- NL2Bash - This dataset contains a set of ~10,000 bash one-liners collected from websites such as StackOverflow and their English descriptions written by Bash programmers, as described in the [paper](https://arxiv.org/abs/1802.08979).
- Clone Digger - clone detection for Python and Java.
- GitHub Java Corpus - GitHub Java corpus is a set of Java projects collected from GitHub that we have used in a number of our publications. The corpus consists of 14,785 projects and 352,312,696 LOC.
- engine - Scalable and distributed data retrieval pipeline for source code.
-
Talks
- Machine Learning on Source Code
- Similarity of GitHub Repositories by Source Code Identifiers
- Using deep RNN to model source code
- Source code abstracts classification using CNN (1)
- Source code abstracts classification using CNN (2)
- Source code abstracts classification using CNN (3)
- Measuring code sentiment in a Git repository
- Machine Learning on Source Code
- Similarity of GitHub Repositories by Source Code Identifiers
- Using deep RNN to model source code
- Source code abstracts classification using CNN (1)
- Source code abstracts classification using CNN (2)
- Source code abstracts classification using CNN (3)
- Measuring code sentiment in a Git repository
Categories
Sub Categories
Keywords
machine-learning
7
ml4code
4
git
3
data-mining
2
java
2
python
2
mloncode
2
golang
2
tensorflow
2
duplicates
2
duplicate-detection
2
source-code
2
cuda
2
afk-mc2
1
linguist
1
language-detection
1
kmeans
1
knn-search
1
yinyang
1
copynet
1
cli
1
tensorboard-visualizations
1
mining-software-repositories
1
git-analysis
1
burndown
1
go-git
1
git-server
1
git-client
1
topic-modeling
1
exploratory-data-analysis
1
word2vec
1
ast
1
similarity-search
1
similarity
1
spark
1
source-code-analysis
1
hash
1
nlp
1
model
1
mlosc
1
babelfish
1
syntax-error
1
syntax-checker
1
syntax
1
neural-network
1
lstm
1
keras
1
minhash
1
lsh
1
seq2seq
1