{"id":13427548,"url":"https://github.com/tech-srl/code2seq","last_synced_at":"2025-04-12T21:33:46.980Z","repository":{"id":37686695,"uuid":"160662744","full_name":"tech-srl/code2seq","owner":"tech-srl","description":"Code for the model presented in the paper: \"code2seq: Generating Sequences from Structured Representations of Code\"","archived":false,"fork":false,"pushed_at":"2024-08-15T11:09:49.000Z","size":4474,"stargazers_count":558,"open_issues_count":12,"forks_count":165,"subscribers_count":14,"default_branch":"master","last_synced_at":"2025-04-11T23:47:50.293Z","etag":null,"topics":["code","code2seq","from","generating","iclr2019","of","representations","sequences","structured"],"latest_commit_sha":null,"homepage":"http://code2seq.org","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/tech-srl.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":"CITATION.cff","codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2018-12-06T11:06:40.000Z","updated_at":"2025-04-09T11:07:02.000Z","dependencies_parsed_at":"2024-11-08T03:42:31.097Z","dependency_job_id":"d28730c4-40ad-4a3f-95ad-f0dde83bb1c0","html_url":"https://github.com/tech-srl/code2seq","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tech-srl%2Fcode2seq","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tech-srl%2Fcode2seq/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tech-srl%2Fcode2seq/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tech-srl%2Fcode2seq/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/tech-srl","download_url":"https://codeload.github.com/tech-srl/code2seq/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248635579,"owners_count":21137262,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["code","code2seq","from","generating","iclr2019","of","representations","sequences","structured"],"created_at":"2024-07-31T01:00:31.294Z","updated_at":"2025-04-12T21:33:46.933Z","avatar_url":"https://github.com/tech-srl.png","language":"Python","funding_links":[],"categories":["Paper List"],"sub_categories":["RNN/LSTM-based"],"readme":"# code2seq\nThis is an official implementation of the model described in:\n\n[Uri Alon](http://urialon.cswp.cs.technion.ac.il), [Shaked Brody](http://www.cs.technion.ac.il/people/shakedbr/), [Omer Levy](https://levyomer.wordpress.com) and [Eran Yahav](http://www.cs.technion.ac.il/~yahave/), \"code2seq: Generating Sequences from Structured Representations of Code\" [[PDF]](https://openreview.net/pdf?id=H1gKYo09tX)\n\nAppeared in **ICLR'2019** (**poster** available [here](https://urialon.cswp.cs.technion.ac.il/wp-content/uploads/sites/83/2019/05/ICLR19_poster_code2seq.pdf))\n\nAn **online demo** is available at [https://code2seq.org](https://code2seq.org).\n\nThis is a TensorFlow implementation of the network, with Java and C# extractors for preprocessing the input code. \nIt can be easily extended to other languages, \nsince the TensorFlow network is agnostic to the input programming language (see [Extending to other languages](#extending-to-other-languages).\nContributions are welcome.\n\n\u003ccenter style=\"padding: 40px\"\u003e\u003cimg width=\"70%\" src=\"https://github.com/tech-srl/code2seq/raw/master/images/network.png\" /\u003e\u003c/center\u003e\n\n## See also:\n  * **Structural Language Models for Code** (ICML'2020) is a new paper that learns to generate the missing code within a larger code snippet. This is similar to code completion, but is able to predict complex expressions rather than a single token at a time. See [PDF](https://arxiv.org/pdf/1910.00577.pdf), demo at [http://AnyCodeGen.org](http://AnyCodeGen.org).\n  * **Adversarial Examples for Models of Code** is a new paper that shows how to slightly mutate the input code snippet of code2vec and GNNs models (thus, introducing adversarial examples), such that the model (code2vec or GNNs) will output a prediction of our choice. See [PDF](https://arxiv.org/pdf/1910.07517.pdf) (code: soon).\n  * **Neural Reverse Engineering of Stripped Binaries** is a new paper that learns to predict procedure names in stripped binaries, thus use neural networks for reverse engineering. See [PDF](https://arxiv.org/pdf/1902.09122) (code: soon).\n  * **code2vec** (POPL'2019) is our previous model. It can only generate a single label at a time (rather than a sequence as code2seq), but it is much faster to train (because of its simplicity). See [PDF](https://urialon.cswp.cs.technion.ac.il/wp-content/uploads/sites/83/2018/12/code2vec-popl19.pdf), demo at [https://code2vec.org](https://code2vec.org) and [code](https://github.com/tech-srl/code2vec/).\n\n\nTable of Contents\n=================\n  * [Requirements](#requirements)\n  * [Quickstart](#quickstart)\n  * [Configuration](#configuration)\n  * [Releasing a trained mode](#releasing-a-trained-model)\n  * [Extending to other languages](#extending-to-other-languages)\n  * [Datasets](#datasets)\n  * [Baselines](#baselines)\n  * [Citation](#citation)\n\n## Requirements\n  * [python3](https://www.linuxbabe.com/ubuntu/install-python-3-6-ubuntu-16-04-16-10-17-04) \n  * TensorFlow 1.12 ([install](https://www.tensorflow.org/install/install_linux)). To check TensorFlow version:\n\u003e python3 -c 'import tensorflow as tf; print(tf.\\_\\_version\\_\\_)'\n  - For a TensorFlow 2.1 implementation by [@Kolkir](https://github.com/Kolkir/), see: [https://github.com/Kolkir/code2seq](https://github.com/Kolkir/code2seq)\n  * For [creating a new Java dataset](#creating-and-preprocessing-a-new-java-dataset) or [manually examining a trained model](#step-4-manual-examination-of-a-trained-model) (any operation that requires parsing of a new code example): [JDK](https://openjdk.java.net/install/)\n  * For creating a C# dataset: [dotnet-core](https://dotnet.microsoft.com/download) version 2.2 or newer.\n  * `pip install rouge` for computing rouge scores.\n\n## Quickstart\n### Step 0: Cloning this repository\n```\ngit clone https://github.com/tech-srl/code2seq\ncd code2seq\n```\n\n### Step 1: Creating a new dataset from Java sources\nTo obtain a preprocessed dataset to train a network on, you can either download our\npreprocessed dataset, or create a new dataset from Java source files.\n\n#### Download our preprocessed dataset Java-large dataset (~16M examples, compressed: 11G, extracted 125GB)\n```\nmkdir data\ncd data\nwget https://s3.amazonaws.com/code2seq/datasets/java-large-preprocessed.tar.gz\ntar -xvzf java-large-preprocessed.tar.gz\n```\nThis will create a `data/java-large/` sub-directory, containing the files that hold training, test and validation sets,\nand a dict file for various dataset properties.\n\n#### Creating and preprocessing a new Java dataset\nTo create and preprocess a new dataset (for example, to compare code2seq to another model on another dataset):\n  * Edit the file [preprocess.sh](preprocess.sh) using the instructions there, pointing it to the correct training, validation and test directories.\n  * Run the preprocess.sh file:\n\u003e bash preprocess.sh\n\n### Step 2: Training a model\nYou can either download an already trained model, or train a new model using a preprocessed dataset.\n\n#### Downloading a trained model (137 MB)\nWe already trained a model for 52 epochs on the data that was preprocessed in the previous step. This model is the same model that was \n used in the paper and the same model that serves the demo at [code2seq.org](code2seq.org).\n```\nwget https://s3.amazonaws.com/code2seq/model/java-large/java-large-model.tar.gz\ntar -xvzf java-large-model.tar.gz\n```\n\n##### Note:\nThis trained model is in a \"released\" state, which means that we stripped it from its training parameters.\n\n#### Training a model from scratch\nTo train a model from scratch:\n  * Edit the file [train.sh](train.sh) to point it to the right preprocessed data. By default, \n  it points to our \"java-large\" dataset that was preprocessed in the previous step.\n  * Before training, you can edit the configuration hyper-parameters in the file [config.py](config.py),\n  as explained in [Configuration](#configuration).\n  * Run the [train.sh](train.sh) script:\n```\nbash train.sh\n```\n\n### Step 3: Evaluating a trained model\nAfter `config.PATIENCE` iterations of no improvement on the validation set, training stops by itself.\n\nSuppose that iteration #52 is our chosen model, run:\n```\npython3 code2seq.py --load models/java-large-model/model_iter52.release --test data/java-large/java-large.test.c2s\n```\nWhile evaluating, a file named \"log.txt\" is written to the same dir as the saved models, with each test example name and the model's prediction.\n\n### Step 4: Manual examination of a trained model\nTo manually examine a trained model, run:\n```\npython3 code2seq.py --load models/java-large-model/model_iter52.release --predict\n```\nAfter the model loads, follow the instructions and edit the file `Input.java` and enter a Java \nmethod or code snippet, and examine the model's predictions and attention scores.\n\n#### Note: \nDue to TensorFlow's limitations, if using beam search (`config.BEAM_WIDTH \u003e 0`), then `BEAM_WIDTH` hypotheses will be printed, but\nwithout attention weights. If not using beam search (`config.BEAM_WIDTH == 0`), then a single hypothesis will be printed *with \nthe attention weights* in every decoding timestep. \n\n## Configuration\nChanging hyper-parameters is possible by editing the file [config.py](config.py).\n\nHere are some of the parameters and their description:\n#### config.NUM_EPOCHS = 3000\nThe max number of epochs to train the model. \n#### config.SAVE_EVERY_EPOCHS = 1\nThe frequency, in epochs, of saving a model and evaluating on the validation set during training.\n#### config.PATIENCE = 10\nControlling early stopping: how many epochs of no improvement should training continue before stopping.  \n#### config.BATCH_SIZE = 512\nBatch size during training.\n#### config.TEST_BATCH_SIZE = 256\nBatch size during evaluation. Affects only the evaluation speed and memory consumption, does not affect the results.\n#### config.SHUFFLE_BUFFER_SIZE = 10000\nThe buffer size that the reader uses for shuffling the training data. \nControls the randomness of the data. \nIncreasing this value might hurt training throughput. \n#### config.CSV_BUFFER_SIZE = 100 * 1024 * 1024  \nThe buffer size (in bytes) of the CSV dataset reader.\n#### config.MAX_CONTEXTS = 200\nThe number of contexts to sample in each example during training \n(resampling a different subset of this size every training iteration).\n#### config.SUBTOKENS_VOCAB_MAX_SIZE = 190000\nThe max size of the subtoken vocabulary.\n#### config.TARGET_VOCAB_MAX_SIZE = 27000\nThe max size of the target words vocabulary.\n#### config.EMBEDDINGS_SIZE = 128\nEmbedding size for subtokens, AST nodes and target symbols.\n#### config.RNN_SIZE = 128 * 2 \nThe total size of the two LSTMs that are used to embed the paths if `config.BIRNN` is `True`, or the size of the single LSTM if `config.BIRNN` is `False`.\n#### config.DECODER_SIZE = 320\nSize of each LSTM layer in the decoder.\n#### config.NUM_DECODER_LAYERS = 1\nNumber of decoder LSTM layers. Can be increased to support long target sequences.\n#### config.MAX_PATH_LENGTH = 8 + 1\nThe max number of nodes in a path \n#### config.MAX_NAME_PARTS = 5\nThe max number of subtokens in an input token. If the token is longer, only the first subtokens will be read.\n#### config.MAX_TARGET_PARTS = 6\nThe max number of symbols in the target sequence. \nSet to 6 by default for method names, but can be increased for learning datasets with longer sequences.\n### config.BIRNN = True\nIf True, use a bidirectional LSTM to encode each path. If False, use a unidirectional LSTM only. \n#### config.RANDOM_CONTEXTS = True\nWhen True, sample `MAX_CONTEXT` from every example every training iteration. \nWhen False, take the first `MAX_CONTEXTS` only.\n#### config.BEAM_WIDTH = 0\nBeam width in beam search. Inactive when 0. \n#### config.USE_MOMENTUM = True\nIf `True`, use Momentum optimizer with nesterov. If `False`, use Adam \n(Adam converges in fewer epochs; Momentum leads to slightly better results). \n\n## Releasing a trained model\nIf you wish to keep a trained model for inference only (without the ability to continue training it) you can\nrelease the model using:\n```\npython3 code2seq.py --load models/java-large-model/model_iter52 --release\n```\nThis will save a copy of the trained model with the '.release' suffix.\nA \"released\" model usually takes ~3x less disk space.\n\n## Extending to other languages  \n\nThis project currently supports Java and C\\# as the input languages.\n\n_**March 2020** - a code2seq extractor for **C++** based on LLVM was developed by [@Kolkir](https://github.com/Kolkir/) and is available here: [https://github.com/Kolkir/cppminer](https://github.com/Kolkir/cppminer)._\n\n_**January 2020** - a code2seq extractor for Python (specifically targeting the Python150k dataset) was contributed by [@stasbel](https://github.com/stasbel). See: [https://github.com/tech-srl/code2seq/tree/master/Python150kExtractor](https://github.com/tech-srl/code2seq/tree/master/Python150kExtractor)._\n\n_**January 2020** - an extractor for predicting TypeScript type annotations for JavaScript input using code2vec was developed by [@izosak](https://github.com/izosak) and Noa Cohen, and is available here:\n[https://github.com/tech-srl/id2vec](https://github.com/tech-srl/id2vec)._\n\n~~_**June 2019** - an extractor for **C** that is compatible with our model was developed by [CMU SEI team](https://github.com/cmu-sei/code2vec-c)._~~ - removed by CMU SEI team.\n\n_**June 2019** - a code2vec extractor for **Python, Java, C, C++** by JetBrains Research is available here: [PathMiner](https://github.com/JetBrains-Research/astminer)._\n\nTo extend code2seq to other languages other than Java and C#, a new extractor (similar to the [JavaExtractor](JavaExtractor))\nshould be implemented, and be called by [preprocess.sh](preprocess.sh).\nBasically, an extractor should be able to output for each directory containing source files:\n  * A single text file, where each row is an example.\n  * Each example is a space-delimited list of fields, where:\n  1. The first field is the target label, internally delimited by the \"|\" character (for example: `compare|ignore|case`)\n  2. Each of the following field are contexts, where each context has three components separated by commas (\",\"). None of these components can include spaces nor commas.\n  \n  We refer to these three components as a token, a path, and another token, but in general other types of ternary contexts can be considered.  \n  \n  Each \"token\" component is a token in the code, split to subtokens using the \"|\" character.\n  \n  Each path is a path between two tokens, split to path nodes (or other kinds of building blocks) using the \"|\" character.\n  Example for a context:\n  \n`my|key,StringExression|MethodCall|Name,get|value`\n\nHere `my|key` and `get|value` are tokens, and `StringExression|MethodCall|Name` is the syntactic path that connects them. \n\n## Datasets\n### Java\nTo download the Java-small, Java-med and Java-large datasets used in the Code Summarization task as raw `*.java` files, use:\n\n  * [Java-small](https://s3.amazonaws.com/code2seq/datasets/java-small.tar.gz)\n  * [Java-med](https://s3.amazonaws.com/code2seq/datasets/java-med.tar.gz)\n  * [Java-large](https://s3.amazonaws.com/code2seq/datasets/java-large.tar.gz)\n  \nTo download the preprocessed datasets, use:\n  * [Java-small-preprocessed](https://s3.amazonaws.com/code2seq/datasets/java-small-preprocessed.tar.gz)\n  * [Java-med-preprocessed](https://s3.amazonaws.com/code2seq/datasets/java-med-preprocessed.tar.gz)\n  * [Java-large-preprocessed](https://s3.amazonaws.com/code2seq/datasets/java-large-preprocessed.tar.gz)\n\n### C#\nThe C# dataset used in the Code Captioning task can be downloaded from the [CodeNN](https://github.com/sriniiyer/codenn/) repository.\n\n## Baselines\n### Using the trained model\nFor the NMT baselines (BiLSTM, Transformer) we used the implementation of [OpenNMT-py](http://opennmt.net/OpenNMT-py/).\nThe trained BiLSTM model is available here:\n`https://code2seq.s3.amazonaws.com/lstm_baseline/model_acc_62.88_ppl_12.03_e16.pt`\n\nTest+validation sources and targets:\n```\nhttps://code2seq.s3.amazonaws.com/lstm_baseline/test_expected_actual.txt\nhttps://code2seq.s3.amazonaws.com/lstm_baseline/test_source.txt\nhttps://code2seq.s3.amazonaws.com/lstm_baseline/test_target.txt\nhttps://code2seq.s3.amazonaws.com/lstm_baseline/val_source.txt\nhttps://code2seq.s3.amazonaws.com/lstm_baseline/val_target.txt\n```\n\nThe command line for \"translating\" a \"source\" file to a \"target\" is:\n`python3 translate.py -model model_acc_62.88_ppl_12.03_e16.pt -src test_source.txt -output translation_epoch16.txt -gpu 0`\n\nThis results in a `translation_epoch16.txt` which we compare to `test_target.txt` to compute the score.\nThe file `test_expected_actual.txt` is a line-by-line concatenation of the true reference (\"expected\") with the corresponding prediction (the \"actual\").\n\n### Creating data for the baseline\nWe first modified the JavaExtractor (the same one as in this) to locate the methods to train on and print them to a file where each method is a single line. This modification is currently not checked in, but instead of extracting paths, it just prints `node.toString()` and replaces \"\\n\" with space, where `node` is the object holding the AST node of type `MethodDeclaration`.\n\nThen, we tokenized (including sub-tokenization of identifiers, i.e., `\"ArrayList\"-\u003e [\"Array\",\"List\"])` each method body using `javalang`, using [this](baseline_tokenization/subtokenize_nmt_baseline.py) script (which can be run on [this](baseline_tokenization/input_example.txt) input example).\nSo a program of:\n```\nvoid methodName(String fooBar) {\n    System.out.println(\"hello world\");\n}\n```\n\nshould be printed by the modified JavaExtractor as:\n\n```method name|void (String fooBar){ System.out.println(\"hello world\");}```\n\nand the tokenization script would turn it into: \n\n```void ( String foo Bar ) { System . out . println ( \" hello world \" ) ; }```\n\nand the label to be predicted, i.e., \"method name\", into a separate file.\n\nOpenNMT-py can then be trained over these training source and target files.\n\n## Citation \n\n[code2seq: Generating Sequences from Structured Representations of Code](https://arxiv.org/pdf/1808.01400)\n\n```\n@inproceedings{\n    alon2018codeseq,\n    title={code2seq: Generating Sequences from Structured Representations of Code},\n    author={Uri Alon and Shaked Brody and Omer Levy and Eran Yahav},\n    booktitle={International Conference on Learning Representations},\n    year={2019},\n    url={https://openreview.net/forum?id=H1gKYo09tX},\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftech-srl%2Fcode2seq","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftech-srl%2Fcode2seq","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftech-srl%2Fcode2seq/lists"}