
An open API service indexing awesome lists of open source software. is a library and command line tools to build and apply machine learning models on top of Universal Abstract Syntax Trees

ast machine-learning mloncode word2vec

Last synced: about 2 months ago
JSON representation is a library and command line tools to build and apply machine learning models on top of Universal Abstract Syntax Trees




# MLonCode research playground [![PyPI](]( [![Build Status](]( [![Docker Build Status](]( [![codecov](](

**This project is no longer maintained, it has evolved into several others:**

* [ml-core]( - the bits which are independent of mining tools.
* [ml-mining]( - general purpose mining environment, currenly based on the deprecated [jgit-spark-connector](

**Below goes the original README.**

This project is the foundation for [MLonCode]( research and development. It abstracts feature extraction and training models, thus allowing to focus on the higher level tasks.

Currently, the following models are implemented:

* BOW - weighted bag of x, where x is many different extracted feature types.
* id2vec, source code identifier embeddings.
* docfreq, feature document frequencies \(part of TF-IDF\).
* topic modeling over source code identifiers.

It is written in Python3 and has been tested on Linux and macOS. source{d} ml is tightly coupled with [source{d} engine]( and delegates all the feature extraction parallelization to it.

Here is the list of proof-of-concept projects which are built using

* [vecino]( - finding similar repositories.
* [tmsc]( - listing topics of a repository.
* [snippet-ranger]( - topic modeling of source code snippets.
* [apollo]( - source code deduplication at scale.

## Installation

Whether you wish to include Spark in your installation or would rather use an existing
installation, to use `sourced-ml` you will need to have some native libraries installed,
e.g. on Ubuntu you must first run: `apt install libxml2-dev libsnappy-dev`. [Tensorflow](
is also a requirement - we support both the CPU and GPU version.
In order to select which version you want, modify the package name in the next section
to either `sourced-ml[tf]` or `sourced-ml[tf-gpu]` depending on your choice.
**If you don't, neither version will be installed.**

### With Apache Spark included

pip3 install sourced-ml

### Use existing Apache Spark

If you already have Apache Spark installed and configured on your environment at `$APACHE_SPARK` you can re-use it and avoid downloading 200Mb through [pip "editable installs"]( by

pip3 install -e "$SPARK_HOME/python"
pip3 install sourced-ml

In both cases, you will need to have some native libraries installed. E.g.,
on Ubuntu `apt install libxml2-dev libsnappy-dev`. Some parts require [Tensorflow](

## Usage

This project exposes two interfaces: API and command line. The command line is

srcml --help

## Docker image

docker run -it --rm srcd/ml --help

If this first command fails with

Cannot connect to the Docker daemon. Is the docker daemon running on this host?

And you are sure that the daemon is running, then you need to add your user to `docker` group: refer to the [documentation](

## Contributions

...are welcome! See [CONTRIBUTING]( and [CODE\_OF\](

## License

[Apache 2.0](

## Algorithms

#### Identifier embeddings

We build the source code identifier co-occurrence matrix for every repository.

1. Read Git repositories.
2. Classify files using [enry](
3. Extract [UAST]( from each supported file.
4. [Split and stem]( all the identifiers in each tree.
5. [Traverse UAST](, collapse all non-identifier paths and record all

identifiers on the same level as co-occurring. Besides, connect them with their immediate parents.

6. Write the global co-occurrence matrix.
7. Train the embeddings using [Swivel]( \(requires Tensorflow\). Interactively view

the intermediate results in Tensorboard using `--logs`.

8. Write the identifier embeddings model.

1-5 is performed with `repos2coocc` command, 6 with `id2vec_preproc`, 7 with `id2vec_train`, 8 with `id2vec_postproc`.

#### Weighted Bag of X

We represent every repository as a weighted bag-of-vectors, provided by we've got document frequencies \("docfreq"\) and identifier embeddings \("id2vec"\).

1. Clone or read the repository from disk.
2. Classify files using [enry](
3. Extract [UAST]( from each supported file.
4. Extract various features from each tree, e.g. identifiers, literals or node2vec-like structural fingerprints.
5. Group by repository, file or function.
6. Set the weight of each such feature according to TF-IDF.
7. Write the BOW model.

1-7 are performed with `repos2bow` command.

#### Topic modeling

See [here](doc/

## Glossary

See [here](