Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

https://github.com/src-d/ml

sourced.ml is a library and command line tools to build and apply machine learning models on top of Universal Abstract Syntax Trees
https://github.com/src-d/ml

ast machine-learning mloncode word2vec

Last synced: about 2 months ago
JSON representation

sourced.ml is a library and command line tools to build and apply machine learning models on top of Universal Abstract Syntax Trees

Host: GitHub
URL: https://github.com/src-d/ml
Owner: src-d
License: other
Archived: true
Created: 2017-06-12T15:21:41.000Z (about 7 years ago)
Default Branch: master
Last Pushed: 2019-05-22T09:56:11.000Z (about 5 years ago)
Last Synced: 2024-04-01T09:20:26.774Z (3 months ago)
Topics: ast, machine-learning, mloncode, word2vec
Language: Python
Homepage:
Size: 28.3 MB
Stars: 141
Watchers: 16
Forks: 44
Open Issues: 26
Metadata Files:
- Readme: README.md
- Contributing: contributing.md
- License: license.md
- Code of conduct: code_of_conduct.md

Lists

awesome-machine-learning-on-source-code - sourced.ml - Abstracts feature extraction from source code syntax trees and working with ML models. (Software)
awesome-machine-learning-on-source-code - sourced.ml - Abstracts feature extraction from source code syntax trees and working with ML models. (Software)
awesome-machine-learning-on-source-code - sourced.ml - Abstracts feature extraction from source code syntax trees and working with ML models. (Software)
awesome-machine-learning-on-source-code - sourced.ml - Abstracts feature extraction from source code syntax trees and working with ML models. (Software)
awesome-machine-learning-on-source-code - sourced.ml - Abstracts feature extraction from source code syntax trees and working with ML models. (Software)

README

        # MLonCode research playground [![PyPI](https://img.shields.io/pypi/v/sourced-ml.svg)](https://pypi.python.org/pypi/sourced-ml) [![Build Status](https://travis-ci.org/src-d/ml.svg)](https://travis-ci.org/src-d/ml) [![Docker Build Status](https://img.shields.io/docker/build/srcd/ml.svg)](https://hub.docker.com/r/srcd/ml) [![codecov](https://codecov.io/github/src-d/ml/coverage.svg)](https://codecov.io/gh/src-d/ml)

**This project is no longer maintained, it has evolved into several others:**

* [ml-core](https://github.com/src-d/ml-core) - the bits which are independent of mining tools.

* [ml-mining](https://github.com/src-d/ml-mining) - general purpose mining environment, currenly based on the deprecated [jgit-spark-connector](https://github.com/src-d/jgit-spark-connector).

**Below goes the original README.**

This project is the foundation for [MLonCode](https://github.com/src-d/awesome-machine-learning-on-source-code) research and development. It abstracts feature extraction and training models, thus allowing to focus on the higher level tasks.

Currently, the following models are implemented:

* BOW - weighted bag of x, where x is many different extracted feature types.

* id2vec, source code identifier embeddings.

* docfreq, feature document frequencies \(part of TF-IDF\).

* topic modeling over source code identifiers.

It is written in Python3 and has been tested on Linux and macOS. source{d} ml is tightly coupled with [source{d} engine](https://engine.sourced.tech) and delegates all the feature extraction parallelization to it.

Here is the list of proof-of-concept projects which are built using sourced.ml:

* [vecino](https://github.com/src-d/vecino) - finding similar repositories.

* [tmsc](https://github.com/src-d/tmsc) - listing topics of a repository.

* [snippet-ranger](https://github.com/src-d/snippet-ranger) - topic modeling of source code snippets.

* [apollo](https://github.com/src-d/apollo) - source code deduplication at scale.

## Installation

Whether you wish to include Spark in your installation or would rather use an existing

installation, to use `sourced-ml` you will need to have some native libraries installed,

e.g. on Ubuntu you must first run: `apt install libxml2-dev libsnappy-dev`. [Tensorflow](https://tensorflow.org)

is also a requirement - we support both the CPU and GPU  version. 

In order to select which version you want, modify the package name in the next section

to either `sourced-ml[tf]` or `sourced-ml[tf-gpu]` depending on your choice.

**If you don't, neither version will be installed.**

### With Apache Spark included

```text

pip3 install sourced-ml

```

### Use existing Apache Spark

If you already have Apache Spark installed and configured on your environment at `$APACHE_SPARK` you can re-use it and avoid downloading 200Mb through [pip "editable installs"](https://pip.pypa.io/en/stable/reference/pip_install/#editable-installs) by

```text

pip3 install -e "$SPARK_HOME/python"

pip3 install sourced-ml

```

In both cases, you will need to have some native libraries installed. E.g., 

on Ubuntu `apt install libxml2-dev libsnappy-dev`. Some parts require [Tensorflow](https://tensorflow.org).

## Usage

This project exposes two interfaces: API and command line. The command line is

```text

srcml --help

```

## Docker image

```text

docker run -it --rm srcd/ml --help

```

If this first command fails with

```text

Cannot connect to the Docker daemon. Is the docker daemon running on this host?

```

And you are sure that the daemon is running, then you need to add your user to `docker` group: refer to the [documentation](https://docs.docker.com/engine/installation/linux/linux-postinstall/#manage-docker-as-a-non-root-user).

## Contributions

...are welcome! See [CONTRIBUTING](contributing.md) and [CODE\_OF\_CONDUCT.md](code_of_conduct.md).

## License

[Apache 2.0](license.md)

## Algorithms

#### Identifier embeddings

We build the source code identifier co-occurrence matrix for every repository.

1. Read Git repositories.

2. Classify files using [enry](https://github.com/src-d/enry).

3. Extract [UAST](https://doc.bblf.sh/uast/specification.html) from each supported file.

4. [Split and stem](https://github.com/src-d/ml/tree/d1f13d079f57caa6338bb7eb8acb9062e011eda9/sourced/ml/algorithms/token_parser.py) all the identifiers in each tree.

5. [Traverse UAST](https://github.com/src-d/ml/tree/d1f13d079f57caa6338bb7eb8acb9062e011eda9/sourced/ml/transformers/coocc.py), collapse all non-identifier paths and record all

   identifiers on the same level as co-occurring. Besides, connect them with their immediate parents.

6. Write the global co-occurrence matrix.

7. Train the embeddings using [Swivel](https://github.com/src-d/ml/tree/d1f13d079f57caa6338bb7eb8acb9062e011eda9/sourced/ml/algorithms/swivel.py) \(requires Tensorflow\). Interactively view

   the intermediate results in Tensorboard using `--logs`.

8. Write the identifier embeddings model.

1-5 is performed with `repos2coocc` command, 6 with `id2vec_preproc`, 7 with `id2vec_train`, 8 with `id2vec_postproc`.

#### Weighted Bag of X

We represent every repository as a weighted bag-of-vectors, provided by we've got document frequencies \("docfreq"\) and identifier embeddings \("id2vec"\).

1. Clone or read the repository from disk.

2. Classify files using [enry](https://github.com/src-d/enry).

3. Extract [UAST](https://doc.bblf.sh/uast/specification.html) from each supported file.

4. Extract various features from each tree, e.g. identifiers, literals or node2vec-like structural fingerprints.

5. Group by repository, file or function.

6. Set the weight of each such feature according to TF-IDF.

7. Write the BOW model.

1-7 are performed with `repos2bow` command.

#### Topic modeling

See [here](doc/topic_modeling.md).

## Glossary

See [here](GLOSSARY.md).