Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/simonepri/varname-seq2seq
📄Source code variable naming using a seq2seq architecture
https://github.com/simonepri/varname-seq2seq
nlp pytorch rnn seq2seq
Last synced: about 1 month ago
JSON representation
📄Source code variable naming using a seq2seq architecture
- Host: GitHub
- URL: https://github.com/simonepri/varname-seq2seq
- Owner: simonepri
- License: mit
- Created: 2020-02-07T07:00:40.000Z (about 5 years ago)
- Default Branch: master
- Last Pushed: 2020-03-19T00:26:25.000Z (almost 5 years ago)
- Last Synced: 2025-01-09T17:33:59.528Z (about 1 month ago)
- Topics: nlp, pytorch, rnn, seq2seq
- Language: Python
- Homepage:
- Size: 129 KB
- Stars: 10
- Watchers: 4
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: readme.md
- License: license
Awesome Lists containing this project
README
varname-seq2seq
📄Source code variable naming using a seq2seq architecture.## Synopsis
varname-seq2seq is a source code sequence-to-sequence model that allows to train models to perform source code variable naming for virtually any programming language.
The image below shows an example of input for the model and the respective output produced.
You can try a demo of this model for Java [using this Colab notebook][colab:demo-java].## Variable Naming
By variable naming, we mean the task of suggesting the name of a particular variable (local variables and methods arguments) in a piece of code.
The suggested names should be ideally the ones that an experienced developer would choose in the particular context in which the variable is used.For example, if the models receive the piece of code on the left, we may want him to suggest the correction on the right, in which we replaced `s` with `sum_of_squares`.
```python
def score(X):
s = 0.0
for x in X:
s += x * x
return s
``````python
def score(X):
sum_of_squares = 0.0
for x in X:
sum_of_squares += x * x
return sum_of_squares
```## Dataset generation
To train the model, we extract naming examples from a large corpus of several open-source projects in a given language.
A naming example is a piece of code in which we mask all the occurrences of a particular variable with a special `` token, and then we ask the model to predict the original variable name we masked.
When we generate naming examples, we can also obfuscate all the occurrences of surrounding variable names with the special `` to discourage the model from learning to name a variable relying on surrounding variable names.
In the following, we will be using the `obf` abbreviation to indicate that we used the obfuscation strategy just described.Let us take the following piece of Java code as an example.
```java
public class Test { Test ( int a ) { int b = a ; } }
```
From this, we can extract two naming examples, one in which we mask all the occurrences of the variable `a`, and one in which we do the same but for the variable `b`.```java
public class Test { Test ( int ) { int = ; } }
public class Test { Test ( int ) { int = ; } }
```All these examples are divided into four splits.
We pick an arbitrary number of projects, and we use all the examples from these projects to create the `unseen` test set on which the model is tested. This set is made of projects from which the model has never seen any examples.
Then we use the remaining projects, and we randomly split all the examples extracted into the three balanced splits: train-dev-test.### Pre-generated datasets
We distribute the pre-generated datasets showed in the table below.
If you need more, you can generate new ones on your own by using [this Colab notebook][colab:dataset].| Name | Language | Download |
|------|----------|----------|
| java-obf | Java | [][download:java-corpora-dataset-obfuscated.tgz] |
| java | Java | [][download:java-corpora-dataset.tgz] |## Model training
The core idea of the model is to capture the syntactic usage context of a variable across a given fragment of code, and then to use this usage context to predict a natural name for a particular variable.
The intuition is that the usage context of a particular variable should contain enough information to describe how the variable is used, thus allowing us to derive an appropriate name.This is achieved using two neural networks in an Encoder-Decoder architecture: one that condenses a sequence of tokens into an efficient vector representation that makes up the usage context, and another network that predicts a suitable name for the given usage context.
The image below shows a pictorial representation of the encoder-decoder model.
`e` and `d` are two embedding layers, `z` is the usage context, and `f` is a linear layer.### Pre-trained models
We distribute the pre-trained models showed in the table below.
If you want to train the model on a different dataset, you can do so by using [this Colab notebook][colab:model].| Name | Language | Download |
|------|----------|----------|
| java-obf | Java | [][download:java-lstm-1-256-256-dtf-lrs-obf.tgz] |
| java | Java | [][download:java-lstm-1-256-256-dtf-lrs.tgz] |## Evaluation
To asses the effectiveness of the model, two primary metrics are considered: accuracy (ACC) and edit distance (EDIST).
Both metrics measure the ability of the model to recover the original names from the usage context of a particular variable, but they do so in a different manner.
The former measures exact target-prediction subword alignment, while the latter measures how many subword units need to be changed to transform the prediction in the target.The following two figures show some simple examples of how the two metrics are computed.
### Results
The following table reports the metrics for the different models-datasets we distribute.
| Model | Dataset | Test
ACC - EDIST | Unseen
ACC - EDIST | Test & Unseen
AVG |
|-------|---------|:-------------------:|:----------------------:|:---------------------:|
| java-obf | java-obf | 73.56% - 91.25% | **45.26%** - 80.92% | 72.75% |
| java | java | **73.54%** - 91.25% | 45.13% - **81.09%** | 72.75% |## Authors
- **Simone Primarosa** - [simonepri][github:simonepri]
See also the list of [contributors][contributors] who participated in this project.
## License
This project is licensed under the MIT License - see the [license][license] file for details.
[license]: https://github.com/simonepri/varname-seq2seq/tree/master/license
[contributors]: https://github.com/simonepri/varname-seq2seq/contributors[src/bin]: https://github.com/simonepri/varname-seq2seq/tree/master/src/bin
[download:java-lstm-1-256-256-dtf-lrs-obf.tgz]: https://github.com/simonepri/varname-seq2seq/releases/latest/download/java-lstm-1-256-256-dtf-lrs-obf.tgz
[download:java-lstm-1-256-256-dtf-lrs.tgz]: https://github.com/simonepri/varname-seq2seq/releases/latest/download/java-lstm-1-256-256-dtf-lrs.tgz
[download:java-corpora-dataset.tgz]: https://github.com/simonepri/varname-seq2seq/releases/latest/download/java-corpora-dataset.tgz
[download:java-corpora-dataset-obfuscated.tgz]: https://github.com/simonepri/varname-seq2seq/releases/latest/download/java-corpora-dataset-obfuscated.tgz[repo:Bukkit/Bukkit]: https://github.com/Bukkit/Bukkit
[repo:clojure/clojure]: https://github.com/clojure/clojure
[repo:apache/dubbo]: https://github.com/apache/dubbo
[repo:google/error-prone]: https://github.com/google/error-prone
[repo:grails/grails-core]: https://github.com/grails/grails-core
[repo:google/guice]: https://github.com/google/guice
[repo:hibernate/hibernate-orm]: https://github.com/hibernate/hibernate-orm
[repo:jhy/jsoup]: https://github.com/jhy/jsoup
[repo:junit-team/junit4]: https://github.com/junit-team/junit4
[repo:apache/kafka]: https://github.com/apache/kafka
[repo:libgdx/libgdx]: https://github.com/libgdx/libgdx
[repo:dropwizard/metrics]: https://github.com/dropwizard/metrics
[repo:square/okhttp]: https://github.com/square/okhttp
[repo:spring-projects/spring-framework]: https://github.com/spring-projects/spring-framework
[repo:apache/tomcat]: https://github.com/apache/tomcat
[repo:apache/cassandra]: https://github.com/apache/cassandra[github:simonepri]: https://github.com/simonepri
[colab:demo-java]: https://colab.research.google.com/github/simonepri/varname-seq2seq/blob/master/examples/predict_java.ipynb
[colab:model]: https://colab.research.google.com/github/simonepri/varname-seq2seq/blob/master/examples/train.ipynb
[colab:dataset]: https://colab.research.google.com/github/simonepri/varname-seq2seq/blob/master/examples/dataset_generation.ipynb