Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/danhper/suplearn-clone-detection

Cross language clone detection using supervised learning
https://github.com/danhper/suplearn-clone-detection

Last synced: about 1 month ago
JSON representation

Cross language clone detection using supervised learning

Host: GitHub
URL: https://github.com/danhper/suplearn-clone-detection
Owner: danhper
Created: 2017-10-14T13:19:32.000Z (about 7 years ago)
Default Branch: master
Last Pushed: 2022-01-21T19:20:14.000Z (almost 3 years ago)
Last Synced: 2024-04-14T11:08:04.990Z (8 months ago)
Language: Python
Homepage:
Size: 235 KB
Stars: 15
Watchers: 1
Forks: 7
Open Issues: 6
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

        # suplearn-clone-detection

[![CircleCI](https://circleci.com/gh/danhper/suplearn-clone-detection.svg?style=svg&circle-token=738ac3f3e6453f2beef09c2bf1a2e72d2a959ee0)](https://circleci.com/gh/tuvistavie/suplearn-clone-detection)

## Setup

```

pip install -r requirements.txt

python setup.py develop

```

Note that Tensorflow needs to be installed separately using [these steps][tensorflow-install]

## Configuration

First, copy [config.yml.example](./config.yml.example) to `config.yml`

```

cp config.yml.example config.yml

```

Then, modify the content of `config.yml`. The configuration file is

self-documented and the most important parameters we used can be found in the paper.

## Dataset

We train our model using a dataset with data extracted from the competitive programming website AtCoder: https://atcoder.jp.

The dataset can be downloaded as an SQLite3 database : [java-python-clones.db.gz][cross-language-clones-db].

You will most likely need to decompress the database before using it.

We also provide the raw data as a tarball but it should generally not be needed: [java-python-clones.tar.gz][cross-language-clones-tar].

The database contains both the text representation and the AST representation

of the source code. All the data is in the `submissions` table. We describe

the different rows of the table below.

Name | Type | Description

-----|------|------------

id | INTEGER | Primary key for the submission

url | VARCHAR(255) | URL of the problem on AtCoder

contest_type | VARCHAR(64) | Contest type on AtCoder (beginner or regular)

contest_id | INTEGER | Contest ID on AtCoder

problem_id | INTEGER | Problem ID on AtCoder

problem_title | VARCHAR(255) | Problem title on AtCoder (usually in Japanese)

filename | VARCHAR(255) | Original path of the file

language | VARCHAR(64) | Full name of the language used

language_code | VARCHAR(64) | Short name of the language used

source_length | INTEGER | Source length in bytes

exec_time | INTEGER | Execution time in ms

tokens_count | INTEGER | Number of tokens in the source

source | TEXT | Source code of the submission

ast | TEXT | JSON encoded AST representation of the source code

The database also contains a `samples` table which should be populated

using the `suplearn-clone generate-dataset` command.

## Usage

The model should already be configured in `config.yml` to use the following steps.

### Generating training samples

Before training the model, the clones pair for training/cross-validation/test must first be generated using the following command.

```

suplearn-clone generate-dataset -c /path/to/config.yml

```

### Training the model

Once the data is generated, the model can be trained

by simply using the following command

```

suplearn-clone train -c /path/to/config.yml

```

### Testing the model

The model can be evaulated on test data by using the following command:

```

suplearn-clone evaulate -c /path/to/config.yml -m /path/to/model.h5 --data-type= -o results.json 

```

Note that `config.yml` should be the same file as the one used for training.

## Using pre-trained embeddings

Pre-trained embeddings can be used by using the `model.languages.n.embeddings`

setting in the configuration file.

This repository does not provide any functionality to train emebddings.

Please check the [bigcode-tools][bigcode-tools] repository for the instructions

on how to train embeddings.

## Citing the project

If you are using this for academic work, we would be thankful if you could cite the following paper.

```

@inproceedings{Perez:2019:CCD:3341883.3341965,

 author = {Perez, Daniel and Chiba, Shigeru},

 title = {Cross-language Clone Detection by Learning over Abstract Syntax Trees},

 booktitle = {Proceedings of the 16th International Conference on Mining Software Repositories},

 series = {MSR '19},

 year = {2019},

 location = {Montreal, Quebec, Canada},

 pages = {518--528},

 numpages = {11},

 url = {https://doi.org/10.1109/MSR.2019.00078},

 doi = {10.1109/MSR.2019.00078},

 acmid = {3341965},

 publisher = {IEEE Press},

 address = {Piscataway, NJ, USA},

 keywords = {clone detection, machine learning, source code representation},

}

```

[tensorflow-install]: https://www.tensorflow.org/install

[cross-language-clones-db]: https://static.perez.sh/research/2019/cross-language-clone-detection/datasets/java-python-clones.db.gz

[cross-language-clones-tar]: https://static.perez.sh/research/2019/cross-language-clone-detection/datasets/java-python-clones.tar.gz

[bigcode-tools]: https://github.com/danhper/bigcode-tools