{"id":13412073,"url":"https://github.com/simonepri/varname-seq2seq","last_synced_at":"2025-04-28T16:05:38.108Z","repository":{"id":66099011,"uuid":"238862564","full_name":"simonepri/varname-seq2seq","owner":"simonepri","description":"📄Source code variable naming using a seq2seq architecture","archived":false,"fork":false,"pushed_at":"2020-03-19T00:26:25.000Z","size":132,"stargazers_count":10,"open_issues_count":0,"forks_count":0,"subscribers_count":3,"default_branch":"master","last_synced_at":"2025-04-28T16:05:37.901Z","etag":null,"topics":["nlp","pytorch","rnn","seq2seq"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/simonepri.png","metadata":{"files":{"readme":"readme.md","changelog":null,"contributing":null,"funding":null,"license":"license","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-02-07T07:00:40.000Z","updated_at":"2024-08-14T20:45:39.000Z","dependencies_parsed_at":null,"dependency_job_id":"a53a85c5-9c07-4c83-a97a-4ef08cea5d7d","html_url":"https://github.com/simonepri/varname-seq2seq","commit_stats":null,"previous_names":[],"tags_count":2,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/simonepri%2Fvarname-seq2seq","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/simonepri%2Fvarname-seq2seq/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/simonepri%2Fvarname-seq2seq/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/simonepri%2Fvarname-seq2seq/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/simonepri","download_url":"https://codeload.github.com/simonepri/varname-seq2seq/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":251342723,"owners_count":21574244,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["nlp","pytorch","rnn","seq2seq"],"created_at":"2024-07-30T20:01:20.734Z","updated_at":"2025-04-28T16:05:38.037Z","avatar_url":"https://github.com/simonepri.png","language":"Python","funding_links":[],"categories":["Python"],"sub_categories":[],"readme":"\u003ch1 align=\"center\"\u003e\n  \u003cb\u003evarname-seq2seq\u003c/b\u003e\n\u003c/h1\u003e\n\u003cp align=\"center\"\u003e\n  \u003c!-- CI - TravisCI --\u003e\n  \u003ca href=\"https://travis-ci.com/simonepri/varname-seq2seq\"\u003e\n    \u003cimg src=\"https://img.shields.io/travis/com/simonepri/varname-seq2seq/master.svg\" alt=\"Build Status\" /\u003e\n  \u003c/a\u003e\n  \u003c!-- License - MIT --\u003e\n  \u003ca href=\"https://github.com/simonepri/varname-seq2seq/tree/master/license\"\u003e\n    \u003cimg src=\"https://img.shields.io/github/license/simonepri/varname-seq2seq.svg\" alt=\"Project license\" /\u003e\n  \u003c/a\u003e\n\u003c/p\u003e\n\u003cp align=\"center\"\u003e\n  📄Source code variable naming using a seq2seq architecture.\n\u003c/p\u003e\n\n\n## Synopsis\n\nvarname-seq2seq is a source code sequence-to-sequence model that allows to train models to perform source code variable naming for virtually any programming language.\n\nThe image below shows an example of input for the model and the respective output produced.  \nYou can try a demo of this model for Java [using this Colab notebook][colab:demo-java].\n\n\u003cp align=\"center\"\u003e\n  \u003ca href=\"#\"\u003e\n    \u003cimg height=\"350\" src=\"https://user-images.githubusercontent.com/3505087/77019200-e6a7c200-6977-11ea-9b96-e51824ddcb62.png\" alt=\"model\" /\u003e\n  \u003c/a\u003e\n\u003c/p\u003e\n\n\n## Variable Naming\n\nBy variable naming, we mean the task of suggesting the name of a particular variable (local variables and methods arguments) in a piece of code.\nThe suggested names should be ideally the ones that an experienced developer would choose in the particular context in which the variable is used.\n\nFor example, if the models receive the piece of code on the left, we may want him to suggest the correction on the right, in which we replaced `s` with `sum_of_squares`.\n\n\u003ctable\u003e\n\u003ctr\u003e\n\u003ctd\u003e\n\n```python\ndef score(X):\n  s = 0.0\n  for x in X:\n    s += x * x\n  return s\n```\n\n\u003c/td\u003e\n\u003ctd\u003e\n\n```python\ndef score(X):\n  sum_of_squares = 0.0\n  for x in X:\n    sum_of_squares += x * x\n  return sum_of_squares\n```\n\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003c/table\u003e\n\n\n## Dataset generation\n\nTo train the model, we extract naming examples from a large corpus of several open-source projects in a given language.\nA naming example is a piece of code in which we mask all the occurrences of a particular variable with a special `\u003cmask\u003e` token, and then we ask the model to predict the original variable name we masked.\nWhen we generate naming examples, we can also obfuscate all the occurrences of surrounding variable names with the special `\u003cvarX\u003e` to discourage the model from learning to name a variable relying on surrounding variable names.\nIn the following, we will be using the `obf` abbreviation to indicate that we used the obfuscation strategy just described.\n\nLet us take the following piece of Java code as an example.\n```java\npublic class Test { Test ( int a ) { int b = a ; } }\n```\nFrom this, we can extract two naming examples, one in which we mask all the occurrences of the variable `a`, and one in which we do the same but for the variable `b`.\n\n```java\npublic class Test { Test ( int \u003cmask\u003e ) { int \u003cvar2\u003e = \u003cmask\u003e ; } }\npublic class Test { Test ( int \u003cvar1\u003e ) { int \u003cmask\u003e = \u003cvar1\u003e ; } }\n```\n\nAll these examples are divided into four splits.\nWe pick an arbitrary number of projects, and we use all the examples from these projects to create the `unseen` test set on which the model is tested. This set is made of projects from which the model has never seen any examples.\nThen we use the remaining projects, and we randomly split all the examples extracted into the three balanced splits: train-dev-test.\n\n### Pre-generated datasets\n\nWe distribute the pre-generated datasets showed in the table below.  \nIf you need more, you can generate new ones on your own by using [this Colab notebook][colab:dataset].\n\n| Name | Language | Download |\n|------|----------|----------|\n| java-obf | Java | [![Download java-corpora-dataset-obfuscated.tgz](https://img.shields.io/github/downloads/simonepri/varname-seq2seq/latest/java-corpora-dataset-obfuscated.tgz.svg)][download:java-corpora-dataset-obfuscated.tgz] |\n| java | Java | [![Download java-corpora-dataset.tgz](https://img.shields.io/github/downloads/simonepri/varname-seq2seq/latest/java-corpora-dataset.tgz.svg)][download:java-corpora-dataset.tgz] |\n\n\n## Model training\n\nThe core idea of the model is to capture the syntactic usage context of a variable across a given fragment of code, and then to use this usage context to predict a natural name for a particular variable.\nThe intuition is that the usage context of a particular variable should contain enough information to describe how the variable is used, thus allowing us to derive an appropriate name.\n\nThis is achieved using two neural networks in an Encoder-Decoder architecture: one that condenses a sequence of tokens into an efficient vector representation that makes up the usage context, and another network that predicts a suitable name for the given usage context.\n\nThe image below shows a pictorial representation of the encoder-decoder model.  \n`e` and `d` are two embedding layers, `z` is the usage context, and `f` is a linear layer.\n\n\u003cp align=\"center\"\u003e\n  \u003ca href=\"#\"\u003e\n    \u003cimg height=\"250\" src=\"https://user-images.githubusercontent.com/3505087/77015522-f1108e80-696c-11ea-837d-b5aa2328546c.png\" alt=\"model\" /\u003e\n  \u003c/a\u003e\n\u003c/p\u003e\n\n### Pre-trained models\n\nWe distribute the pre-trained models showed in the table below.  \nIf you want to train the model on a different dataset, you can do so by using [this Colab notebook][colab:model].\n\n| Name | Language | Download |\n|------|----------|----------|\n| java-obf | Java | [![Download java-lstm-1-256-256-dtf-lrs-obf.tgz](https://img.shields.io/github/downloads/simonepri/varname-seq2seq/latest/java-lstm-1-256-256-dtf-lrs-obf.tgz.svg)][download:java-lstm-1-256-256-dtf-lrs-obf.tgz] |\n| java | Java | [![Download java-lstm-1-256-256-dtf-lrs.tgz](https://img.shields.io/github/downloads/simonepri/varname-seq2seq/latest/java-lstm-1-256-256-dtf-lrs.tgz.svg)][download:java-lstm-1-256-256-dtf-lrs.tgz] |\n\n\n## Evaluation\n\nTo asses the effectiveness of the model, two primary metrics are considered: accuracy (ACC) and edit distance (EDIST).\nBoth metrics measure the ability of the model to recover the original names from the usage context of a particular variable, but they do so in a different manner.\nThe former measures exact target-prediction subword alignment, while the latter measures how many subword units need to be changed to transform the prediction in the target.\n\nThe following two figures show some simple examples of how the two metrics are computed.\n\n\u003cp align=\"center\"\u003e\n  \u003ca href=\"#\"\u003e\n    \u003cimg height=\"100\" src=\"https://user-images.githubusercontent.com/3505087/77015949-3b463f80-696e-11ea-86f5-2c72811e21c5.png\" alt=\"model\" /\u003e\n    \u003cimg height=\"100\" src=\"https://user-images.githubusercontent.com/3505087/77015962-426d4d80-696e-11ea-84b0-d36d936380ce.png\" alt=\"model\" /\u003e\n  \u003c/a\u003e\n\u003c/p\u003e\n\n\n### Results\n\nThe following table reports the metrics for the different models-datasets we distribute.\n\n| Model | Dataset | Test\u003cbr\u003eACC - EDIST | Unseen \u003cbr\u003eACC - EDIST | Test \u0026 Unseen \u003cbr\u003eAVG |\n|-------|---------|:-------------------:|:----------------------:|:---------------------:|\n| java-obf | java-obf | 73.56% - 91.25% | **45.26%** - 80.92% | 72.75% |\n| java | java | **73.54%** - 91.25% | 45.13% - **81.09%** | 72.75% |\n\n\n## Authors\n\n- **Simone Primarosa** - [simonepri][github:simonepri]\n\nSee also the list of [contributors][contributors] who participated in this project.\n\n\n## License\n\nThis project is licensed under the MIT License - see the [license][license] file for details.\n\n\n\u003c!-- Links --\u003e\n[license]: https://github.com/simonepri/varname-seq2seq/tree/master/license\n[contributors]: https://github.com/simonepri/varname-seq2seq/contributors\n\n[src/bin]: https://github.com/simonepri/varname-seq2seq/tree/master/src/bin\n\n[download:java-lstm-1-256-256-dtf-lrs-obf.tgz]: https://github.com/simonepri/varname-seq2seq/releases/latest/download/java-lstm-1-256-256-dtf-lrs-obf.tgz\n[download:java-lstm-1-256-256-dtf-lrs.tgz]: https://github.com/simonepri/varname-seq2seq/releases/latest/download/java-lstm-1-256-256-dtf-lrs.tgz\n[download:java-corpora-dataset.tgz]: https://github.com/simonepri/varname-seq2seq/releases/latest/download/java-corpora-dataset.tgz\n[download:java-corpora-dataset-obfuscated.tgz]: https://github.com/simonepri/varname-seq2seq/releases/latest/download/java-corpora-dataset-obfuscated.tgz\n\n[repo:Bukkit/Bukkit]: https://github.com/Bukkit/Bukkit\n[repo:clojure/clojure]: https://github.com/clojure/clojure\n[repo:apache/dubbo]: https://github.com/apache/dubbo\n[repo:google/error-prone]: https://github.com/google/error-prone\n[repo:grails/grails-core]: https://github.com/grails/grails-core\n[repo:google/guice]: https://github.com/google/guice\n[repo:hibernate/hibernate-orm]: https://github.com/hibernate/hibernate-orm\n[repo:jhy/jsoup]: https://github.com/jhy/jsoup\n[repo:junit-team/junit4]: https://github.com/junit-team/junit4\n[repo:apache/kafka]: https://github.com/apache/kafka\n[repo:libgdx/libgdx]: https://github.com/libgdx/libgdx\n[repo:dropwizard/metrics]: https://github.com/dropwizard/metrics\n[repo:square/okhttp]: https://github.com/square/okhttp\n[repo:spring-projects/spring-framework]: https://github.com/spring-projects/spring-framework\n[repo:apache/tomcat]: https://github.com/apache/tomcat\n[repo:apache/cassandra]: https://github.com/apache/cassandra\n\n[github:simonepri]: https://github.com/simonepri\n\n[colab:demo-java]: https://colab.research.google.com/github/simonepri/varname-seq2seq/blob/master/examples/predict_java.ipynb\n[colab:model]: https://colab.research.google.com/github/simonepri/varname-seq2seq/blob/master/examples/train.ipynb\n[colab:dataset]: https://colab.research.google.com/github/simonepri/varname-seq2seq/blob/master/examples/dataset_generation.ipynb\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsimonepri%2Fvarname-seq2seq","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsimonepri%2Fvarname-seq2seq","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsimonepri%2Fvarname-seq2seq/lists"}