{"id":18852792,"url":"https://github.com/tech-srl/slm-code-generation","last_synced_at":"2025-06-28T18:36:45.824Z","repository":{"id":146789988,"uuid":"248974539","full_name":"tech-srl/slm-code-generation","owner":"tech-srl","description":"TensorFlow code for the neural network presented in the paper: \"Structural Language Models of Code\"  (ICML'2020)","archived":false,"fork":false,"pushed_at":"2022-05-20T21:48:54.000Z","size":2649,"stargazers_count":88,"open_issues_count":7,"forks_count":10,"subscribers_count":10,"default_branch":"master","last_synced_at":"2025-04-14T10:33:43.969Z","etag":null,"topics":["anycodegen","code","codegen","icml2020","language","models","source","structural"],"latest_commit_sha":null,"homepage":"https://AnyCodeGen.org","language":"Java","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/tech-srl.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":"CITATION.cff","codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2020-03-21T12:47:18.000Z","updated_at":"2025-03-24T06:00:27.000Z","dependencies_parsed_at":null,"dependency_job_id":"1f4f55bf-feea-4b68-87cb-e0b2067de94e","html_url":"https://github.com/tech-srl/slm-code-generation","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/tech-srl/slm-code-generation","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tech-srl%2Fslm-code-generation","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tech-srl%2Fslm-code-generation/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tech-srl%2Fslm-code-generation/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tech-srl%2Fslm-code-generation/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/tech-srl","download_url":"https://codeload.github.com/tech-srl/slm-code-generation/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tech-srl%2Fslm-code-generation/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":262477639,"owners_count":23317528,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["anycodegen","code","codegen","icml2020","language","models","source","structural"],"created_at":"2024-11-08T03:41:34.177Z","updated_at":"2025-06-28T18:36:45.793Z","avatar_url":"https://github.com/tech-srl.png","language":"Java","readme":"# SLM: Structural Language Models of Code\nThis is an official implementation of the model described in:\n\n\"Structural Language Models of Code\" [[PDF]](https://arxiv.org/pdf/1910.00577.pdf)\n\nAppeared in ICML'2020.\n\nAn **online demo** is available at [https://AnyCodeGen.org](https://AnyCodeGen.org).\n\nThis repository currently contains the dataset and the data extractor that we used to create the Java dataset in the paper.\n\n\nFeel free to open a [new issue](https://github.com/tech-srl/slm-code-generation/issues/new) \nfor any question. We always respond quickly.\n\n\u003ccenter style=\"padding: 40px\"\u003e\u003cimg width=\"70%\" src=\"https://github.com/tech-srl/slm-code-generation/raw/master/images/fig1.png\" /\u003e\u003c/center\u003e\n\u003ccenter style=\"padding: 40px\"\u003e\u003cimg width=\"70%\" src=\"https://github.com/tech-srl/slm-code-generation/raw/master/images/fig2.png\" /\u003e\u003c/center\u003e\n\n\nTable of Contents\n=================\n  * [Requirements](#requirements)\n  * [Download our preprocessd dataset](#download-our-preprocessed-java-small-dataset)\n  * [Creating a new dataset](#creating-and-preprocessing-a-new-java-dataset)\n  * [Datasets](#datasets)\n  * [Querying the trained model](#querying-the-trained-model)\n  * [Citation](#citation)\n\n## Requirements\n  * [python3](https://www.linuxbabe.com/ubuntu/install-python-3-6-ubuntu-16-04-16-10-17-04) \n  * TensorFlow 1.13 or newer ([install](https://www.tensorflow.org/install/install_linux)). To check TensorFlow version:\n\u003e python3 -c 'import tensorflow as tf; print(tf.\\_\\_version\\_\\_)'\n  * For [creating a new Java dataset](#creating-and-preprocessing-a-new-java-dataset): [JDK 12](https://openjdk.java.net/install/)\n\n\n\n## Download our preprocessed Java-small dataset \nThis dataset contains ~1.3M examples (1.1GB).\n```\nmkdir data\ncd data\nwget https://codegen-slm.s3.us-east-2.amazonaws.com/data/java-small-preprocessed.tar.gz\ntar -xvzf java-small-preprocessed.tar.gz\n```\nThis will create a `data/java-small/` sub-directory, containing the files that hold training, test and validation sets,\na dict file for various dataset properties and histograms, and a grammar file that is used during beam search to \ndistinguish between terminal and non-terminal nodes.\n\n## Creating and preprocessing a new Java dataset\nTo create and preprocess a new dataset (for example, to compare SLM to a new model on another dataset):\n  * Edit the file [preprocess.sh](preprocess.sh) using the instructions there, pointing it to the correct training, validation and test directories.\n  * Run the preprocess.sh file:\n\u003e bash preprocess.sh\n\n## Datasets\n### Java\nTo download the Java-small as raw `*.java` files, use:\n\n  * [Java-small](https://s3.amazonaws.com/code2seq/datasets/java-small.tar.gz)\n  \nTo download the preprocessed dataset, use:\n  * [Java-small-preprocessed](https://codegen-slm.s3.us-east-2.amazonaws.com/data/java-small-preprocessed.tar.gz)\n\nTo download the dataset in a tokenized format that can be used in seq2seq models (for example, with [OpenNMT-py](http://opennmt.net/OpenNMT-py/)), use:\n  * [Java-small-seq2seq](https://codegen-slm.s3.us-east-2.amazonaws.com/data/java-seq2seq-data.tar.gz)\n  \nThe following JSON files are the files that are created by the JavaExtractor. \nThe preprocessed and the seq2seq files\nare created from these JSON files:\n  * [Java-small-json](https://codegen-slm.s3.us-east-2.amazonaws.com/data/java-small-json.tar.gz)\n\nEvery line is a JSON object\nthat contains the following fields: `num_targets`, `num_nodes`, `targets`, \n`is_token`, `target_child_id`, `internal_paths`, `relative_paths`, `head_paths`, \n`head_root_path`, `head_child_id`, `linearized_tree`, `filepath`, `left_context`, \n`right_context`, `target_seq`, `line`.\n\n\n\n### C#\nThe C# dataset that we used in the paper was created using the raw (`*.cs` files) dataset of\n [Allamanis et al., 2018](https://miltos.allamanis.com/publications/2018learning/),\n (https://aka.ms/iclr18-prog-graphs-dataset) and can be found here: [https://aka.ms/iclr18-prog-graphs-dataset](https://aka.ms/iclr18-prog-graphs-dataset).\n\nTo extract examples from the C# files, we modified the data extraction code of \nBrockschmidt et al., 2019: [https://github.com/microsoft/graph-based-code-modelling/](https://github.com/microsoft/graph-based-code-modelling/).\n\n## Querying the Trained Model\nTo query the trained model, use the following API, where `MYCODE` is the given code snippet, that includes two question marks (`??`) to mark the \"hole\" that should be completed. \n\n### To query the expression-prediction model (the \"paper model\" in the demo website):\n```\ncurl -X POST https://w0w3uc4a63.execute-api.us-east-1.amazonaws.com/prod/predict -d '{\"code\": \"MYCODE\"}'\n```\n\nFor example:\n\n```\ncurl -X POST https://w0w3uc4a63.execute-api.us-east-1.amazonaws.com/prod/predict -d '{\"code\": \"public static Path[] stat2Paths(FileStatus[] stats) {  if (stats == null) return null;  Path[] ret = new Path[stats.length]; for (int i = 0; i \u003c stats.length; ++i) { ret[i] = ??; } return ret; }\"}'\n```\n\n### To query the statement-prediction model (the \"extended model\" in the demo website):\n```\ncurl -X POST https://63g9yqims7.execute-api.us-east-1.amazonaws.com/prod/predict -d '{\"code\": \"MYCODE\"}'\n```\n\nFor example:\n\n```\ncurl -X POST https://63g9yqims7.execute-api.us-east-1.amazonaws.com/prod/predict -d '{\"code\": \"@Override public boolean retainAll(Collection\u003c?\u003e collection) { boolean changed = false;     for (Iterator\u003cE\u003e iter = iterator(); iter.hasNext(); ) {         Element elem = iter.next();        if (!collection.contains(elem)) {           iter.remove();             ??        }    }    return changed;}\"}'\n```\n\n## Citation \n\n[Structural Language Models of Code](https://arxiv.org/pdf/1910.00577.pdf)\n\n```\n@inproceedings{alon2020structural,\n  title={Structural language models of code},\n  author={Alon, Uri and Sadaka, Roy and Levy, Omer and Yahav, Eran},\n  booktitle={International Conference on Machine Learning},\n  pages={245--256},\n  year={2020},\n  organization={PMLR}\n}\n```\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftech-srl%2Fslm-code-generation","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftech-srl%2Fslm-code-generation","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftech-srl%2Fslm-code-generation/lists"}