awesome-machine-learning-on-source-code

Cool links & research papers related to Machine Learning applied to source code
https://github.com/ermuur/awesome-machine-learning-on-source-code

Last synced: 6 days ago
JSON representation

Posts
Software
- JNice2Predict - Efficient and scalable open-source framework for structured prediction, enabling one to build new statistical engines more quickly.
- bblfsh - Self-hosted server for source code parsing.
- Tregex, Tsurgeon and Semgrex - Tregex is a utility for matching patterns in trees, based on tree relationships and regular expression matches on nodes (the name is short for "tree regular expressions").
- Public Git Archive - 3 TB of Git repositories from GitHub.
- GitHub repositories - languages distribution - Programming languages distribution in 14,000,000 repositories on GitHub (October 2016).
- 452M commits on GitHub - ≈ 452M commits' metadata from 16M repositories on GitHub (October 2016).
- GitHub readme files - Readme files of all GitHub repositories (16M) (October 2016).
- from language X to Y - Cache file Erik Bernhardsson collected for his awesome blog post.
- GitHub word2vec 120k - Sequences of identifiers extracted from top starred 120,000 GitHub repositories.
- GitHub Source Code Names - Names in source code extracted from 13M GitHub repositories, not people.
- GitHub duplicate repositories - GitHub repositories not marked as forks but very similar to each other.
- GitHub lng keyword frequencies - Programming language keyword frequency extracted from 16M GitHub repositories.
- 150k Python Dataset - Dataset consisting of 150,000 Python ASTs.
- 150k JavaScript Dataset - Dataset consisting of 150,000 JavaScript files and their parsed ASTs.
- GitHub JavaScript Dump October 2016 - Dataset consisting of 494,352 syntactically-valid JavaScript files obtained from the top ~10000 starred JavaScript repositories on GitHub, with licenses, and parsed ASTs.
- Differentiable Neural Computer (DNC) - TensorFlow implementation of the Differentiable Neural Computer.
- sourced.ml - Abstracts feature extraction from source code syntax trees and working with ML models.
- vecino - Finds similar Git repositories.
- apollo - Source code deduplication as scale, research.
- gemini - Source code deduplication as scale, production.
- enry - Insanely fast file based programming language detector.
- hercules - Git repository mining framework with batteries on top of go-git.
- DeepCS - Keras and Pytorch implementations of DeepCS (Deep Code Search).
- Code Neuron - Recurrent neural network to detect code blocks in natural language text.
- Naturalize - Language agnostic framework for learning coding conventions from a codebase and then expoiting this information for suggesting better identifier names and formatting changes in the code.
- Extreme Source Code Summarization - Convolutional attention neural network that learns to summarize source code into a short method name-like summary by just looking at the source code tokens.
- Summarizing Source Code using a Neural Attention Model - CODE-NN, uses LSTM networks with attention to produce sentences that describe C# code snippets and SQL queries from StackOverflow. Torch over C#/SQL
- Probabilistic API Miner - Near parameter-free probabilistic algorithm for mining the most interesting API patterns from a list of API call sequences.
- Interesting Sequence Miner - Novel algorithm that mines the most interesting sequences under a probabilistic model. It is able to efficiently infer interesting sequences directly from the database.
- TASSAL - Tool for the automatic summarization of source code using autofolding. Autofolding automatically creates a summary of a source code file by folding non-essential code and comment blocks.
- JNice2Predict - Efficient and scalable open-source framework for structured prediction, enabling one to build new statistical engines more quickly.
- Sensibility - Uses LSTMs to detect and correct syntax errors in Java source code.
- DeepBugs - Framework for learning bug detectors from an existing code corpus.
- DeepSim - a deep learning-based approach to measure code functional similarity.
- rnn-autocomplete - Neural code autocompletion with RNN (bachelor's thesis).
- go-git - Highly extensible Git implementation in pure Go which is friendly to data mining.
- minhashcuda - Weighted MinHash implementation on CUDA to efficiently find duplicates.
- kmcuda - k-means on CUDA to cluster and to search for nearest neighbors in dense space.
- wmd-relax - Python package which finds nearest neighbors at Word Mover's Distance.
- source{d} models - Machine Learning models for MLonCode trained using the source{d} stack.
- StackOverflow Question-Code Dataset - ~148K Python and ~120K SQL question-code pairs mined from StackOverflow.
- card2code - This dataset contains the language to code datasets described in the paper [Latent Predictor Networks for Code Generation](#card2code).
- NL2Bash - This dataset contains a set of ~10,000 bash one-liners collected from websites such as StackOverflow and their English descriptions written by Bash programmers, as described in the [paper](https://arxiv.org/abs/1802.08979).
- Clone Digger - clone detection for Python and Java.
- GitHub Java Corpus - GitHub Java corpus is a set of Java projects collected from GitHub that we have used in a number of our publications. The corpus consists of 14,785 projects and 352,312,696 LOC.
- engine - Scalable and distributed data retrieval pipeline for source code.
Talks

Programming Languages

Python 12 Java 4 Go 3 Scala 1 C++ 1 NewLisp 1 Jupyter Notebook 1 HTML 1

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

awesome-machine-learning-on-source-code

Posts

Software

Talks