{"id":23450778,"url":"https://github.com/basedrhys/obfuscated-code2vec","last_synced_at":"2025-06-29T15:32:37.946Z","repository":{"id":39729177,"uuid":"201604175","full_name":"basedrhys/obfuscated-code2vec","owner":"basedrhys","description":"Code for the paper \"Embedding Java Classes with code2vec: Improvements from Variable Obfuscation\" in MSR 2020","archived":false,"fork":false,"pushed_at":"2023-03-24T23:50:52.000Z","size":17114,"stargazers_count":32,"open_issues_count":4,"forks_count":7,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-04-13T20:14:58.049Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/basedrhys.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2019-08-10T08:46:28.000Z","updated_at":"2025-03-08T20:12:27.000Z","dependencies_parsed_at":"2022-09-16T21:01:16.239Z","dependency_job_id":null,"html_url":"https://github.com/basedrhys/obfuscated-code2vec","commit_stats":null,"previous_names":[],"tags_count":2,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/basedrhys%2Fobfuscated-code2vec","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/basedrhys%2Fobfuscated-code2vec/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/basedrhys%2Fobfuscated-code2vec/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/basedrhys%2Fobfuscated-code2vec/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/basedrhys","download_url":"https://codeload.github.com/basedrhys/obfuscated-code2vec/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248774968,"owners_count":21159534,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-12-24T00:15:06.322Z","updated_at":"2025-04-13T20:15:06.367Z","avatar_url":"https://github.com/basedrhys.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Embedding Java Classes with code2vec: Improvements from Variable Obfuscation\n\n![Overall project view](img/overall.png)\n\nThis repository contains the java obfuscation tool created with [Spoon](https://github.com/INRIA/spoon) and the dataset pipeline as described in: \n\n[Rhys Compton](https://www.rhyscompton.co.nz/), [Eibe Frank](https://www.cs.waikato.ac.nz/~eibe/), [Panos Patros](https://www.cms.waikato.ac.nz/people/ppatros), and [Abigail Koay](https://www.waikato.ac.nz/staff-profiles/people/akoay) - *Embedding Java Classes with code2vec: Improvements from Variable Obfuscation*, in MSR '20 [[ArXiv Preprint]](https://arxiv.org/abs/2004.02942)\n\nAlso included are all models and data used in the paper for reproducing/further research.\n\nTable of Contents\n=================\n- [Downloadable Assets](#downloadable-assets)\n- [Requirements](#requirements)\n- [Usage - Obfuscator](#usage-obfuscator)\n- [Usage - Dataset Pipeline](#usage-dataset-pipeline)\n- [Trained code2vec Models](#trained-code2vec-models)\n- [Datasets](#datasets)\n- [Citation](#citation)\n\n## Downloadable Assets\n- [Trained Models](https://zenodo.org/record/3577367)\n- [Evaluation datasets](https://zenodo.org/record/3575197)\n\n## Requirements\n- Java 8+\n- Python 3\n\n## Usage: Obfuscator\n1. `cd java-obfuscator`\n1. Locate a folder of `.java` files (e.g., from the [code2seq](https://github.com/tech-srl/code2seq#datasets) repository)\n2. Alter the input and output directories in `obfs-script.sh`, as well as the number of threads of your machine. If you're running this on a particularly large folder (e.g., millions of files) then you may need to increase the `NUM_PARTITIONS` to 3 or 4, otherwise memory issues can occur, grinding the obfuscator to a near halt.\n3. Run `obfs-script.sh` i.e. `$ source obfs-script.sh`\n\nThis will result in a new obfuscated folder of `.java` files, that can be used to train a new obfuscated code2vec model (or any model that performs learning from source code for that matter).\n\n## Usage: Dataset Pipeline\n\n![Dataset Pipeline View](img/pipeline.png)\n\nThe pipeline uses a trained code2vec model as a feature extractor, converting a classification dataset of `.java` files into a numerical form (`.arff` by default), that can then be used as input for any standard classifier.\n\nAll of the model-related code (`common.py`, `model.py`, `PathContextReader.py`) as well as the `JavaExtractor` folder is code from the original [code2vec repository](https://github.com/tech-srl/code2vec). This was used for invoking the trained code2vec models to create method embeddings - using the code2vec model as a feature extractor.\n\nThe dataset should be in the form of those supplied with this paper i.e.:\n```\ndataset_name\n|-- class1\n    |-- file1.java\n    |-- file2.java\n    ...\n|-- class2\n    |-- file251.java\n    |-- file252.java\n    ...\n\n...\n```\n\nTo run the dataset pipeline and create class-level embeddings for a dataset of Java files:\n1. `cd pipeline`\n2. `pip install -r requirements.txt`\n3. Download a `.java` dataset (from the datasets supplied or your own) and put in the `java_files/` directory\n4. Download a code2vec model checkpoint and put the checkpoint folder in the `models/` directory\n5. Change the paths and definitions in `model_defs.py` and number of models in `scripts/create_datasets.sh` to match your setup\n6. Run `create_datasets.sh` (`source scripts/create_datasets.sh`). This will loop through each model and create class-level embeddings for the supplied datasets. The resulting datasets will be in `.arff` format in the `weka_files/` folder. \n\nYou can now perform class-level classification on the dataset using any off-the-shelf WEKA classifier. Note that the dataset contains the original filename as a string attribute for debugging purposes; you'll likely need to remove this attribute before you pass the dataset into a classifier.\n\n### Config\nBy default the pipeline will use the full range of values for each parameter, which creates a huge number of resulting `.arff` datasets (\u003e1000). To reduce the number of these, remove (or comment out) some of the items in the arrays in `reduction_methods.py` and `selection_methods.py` (at the end of the file). Our experiments showed that the `SelectAll` selection method and `NoReduction` reduction method performed best in most cases so you may want to just keep these.\n\n## Trained code2vec Models\n\nThe models are all available for download: [Zenodo Link](https://zenodo.org/record/3577367).\n\nThe `.java` datasets used to train each of the models (different versions of `java-large` from the [code2seq repository](https://github.com/tech-srl/code2seq)), as well as the preprocessed code2vec-ready versions of those datasets are also available: [Google Drive Link](https://drive.google.com/open?id=1CXgSXKf292BTlryASui2kBvYvJSvFnWN)\n\n## Datasets\n\nThe `.java` datasets collated for this research are all available for download: [Zenodo Link](https://zenodo.org/record/3575197). \n\nFor the interactive embedding visualisation links below, best results are often seen by UMAP.\n\nClass distributions shown below generated by [WEKA](https://www.cs.waikato.ac.nz/ml/weka/)\n\n### OpenCV/Spring\n\n2 categories, 305 instances\n\n![Class Distribution](img/classDist_opencv.png)\n\n[Embedding Visualisation](http://projector.tensorflow.org/?config=https://gist.githubusercontent.com/basedrhys/fbb71520686db5e748e8681de112407c/raw/3900fd07bdc4441cf66f69c4e710611dd7fcecd9/opencv_config.json)\n\n![OpenCV/Spring Visualisation](img/vis_opencv.png)\n\n### Algorithm Classification\n\n7 categories, 182 instances\n\n![Class Distribution](img/classDist_algorithms.png)\n\n[Embedding Visualisation](http://projector.tensorflow.org/?config=https://gist.githubusercontent.com/basedrhys/5660cf47252411bdf83e4ff4f877f02a/raw/8e53136f79251fdce82524d9fc6539c039f9be63/algorithm_config.json)\n\n![Algorithm Classification Visualisation](img/vis_algorithm.png)\n\n### Code Author Attribution\n\n13 categories, 1062 instances\n\n![Class Distribution](img/classDist_Author.png)\n\n[Embedding Visualisation](http://projector.tensorflow.org/?config=https://gist.githubusercontent.com/basedrhys/36fcd8653f2d759a8f1b03e56502a58e/raw/7d2ddef1c219d4fad7a49cc2c978d1ff4e25e5f1/author_config.json)\n\n![Algorithm Classification Visualisation](img/vis_authors.png)\n\n### Bug Detection\n\n2 categories, 31135 instances*\n\n![Class Distribution](img/classDist_javaBugs.png)\n\n### Duplicate File Detection\n\n2 categories, 1669 instances\n\n![Class Distribution](img/classDist_dupFiles.png)\n\n### Duplicate Function Detection\n\n2 categories, 1277 instances\n\n![Class Distribution](img/classDist_dupFunc.png)\n\n### Malware Classification \n\nCan't share dataset for security reasons, however, you can request it from the original authors: http://amd.arguslab.org/\n\n3 categories, 20927 instances*\n\n![Class Distribution](img/classDist_malware.png)\n\n\n#### Notes\n\n`*` - 2000 samples per class were randomly sampled during experiments, so the results in the paper are reported on a smaller dataset. The downloadable dataset is the full version. \n\n## Citation\n\n[Embedding Java Classes with code2vec: Improvements from Variable Obfuscation](https://arxiv.org/pdf/2004.02942.pdf)\n\n```\n@inproceedings{10.1145/3379597.3387445,\nauthor = {Compton, Rhys and Frank, Eibe and Patros, Panos and Koay, Abigail},\ntitle = {Embedding Java Classes with Code2vec: Improvements from Variable Obfuscation},\nyear = {2020},\nisbn = {9781450375177},\npublisher = {Association for Computing Machinery},\naddress = {New York, NY, USA},\nurl = {https://doi.org/10.1145/3379597.3387445},\ndoi = {10.1145/3379597.3387445},\nbooktitle = {Proceedings of the 17th International Conference on Mining Software Repositories},\npages = {243–253},\nnumpages = {11},\nkeywords = {machine learning, code obfuscation, neural networks, code2vec, source code},\nlocation = {Seoul, Republic of Korea},\nseries = {MSR '20}\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbasedrhys%2Fobfuscated-code2vec","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbasedrhys%2Fobfuscated-code2vec","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbasedrhys%2Fobfuscated-code2vec/lists"}