{"id":20484414,"url":"https://github.com/sap-samples/security-research-codegraphsmote","last_synced_at":"2026-06-07T15:31:17.644Z","repository":{"id":161726042,"uuid":"623356641","full_name":"SAP-samples/security-research-codegraphsmote","owner":"SAP-samples","description":"Data augmentation strategy that can be applied to code graphs for learning-based vulnerability discovery.","archived":false,"fork":false,"pushed_at":"2024-11-15T08:25:17.000Z","size":776,"stargazers_count":1,"open_issues_count":0,"forks_count":2,"subscribers_count":5,"default_branch":"main","last_synced_at":"2025-03-05T16:19:24.320Z","etag":null,"topics":["augmentation","data","detection","learning","machine","research","sample","security","vulnerability"],"latest_commit_sha":null,"homepage":null,"language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/SAP-samples.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-04-04T07:57:57.000Z","updated_at":"2025-02-19T15:52:32.000Z","dependencies_parsed_at":null,"dependency_job_id":"07c36333-a9fb-4b30-b837-1989119e2994","html_url":"https://github.com/SAP-samples/security-research-codegraphsmote","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/SAP-samples/security-research-codegraphsmote","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SAP-samples%2Fsecurity-research-codegraphsmote","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SAP-samples%2Fsecurity-research-codegraphsmote/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SAP-samples%2Fsecurity-research-codegraphsmote/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SAP-samples%2Fsecurity-research-codegraphsmote/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/SAP-samples","download_url":"https://codeload.github.com/SAP-samples/security-research-codegraphsmote/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SAP-samples%2Fsecurity-research-codegraphsmote/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34027670,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-07T02:00:07.652Z","response_time":124,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["augmentation","data","detection","learning","machine","research","sample","security","vulnerability"],"created_at":"2024-11-15T16:22:13.770Z","updated_at":"2026-06-07T15:31:17.576Z","avatar_url":"https://github.com/SAP-samples.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# CodeGraphSMOTE - Data Augmentation for Vulnerability Discovery\n\n[![REUSE status](https://api.reuse.software/badge/github.com/SAP-samples/security-research-codegraphsmote)](https://api.reuse.software/info/github.com/SAP-samples/security-research-codegraphsmote)\n\n\n## Description\n\nThis repository contains the source code for our paper [CodeGraphSMOTE - Data Augmentation for Vulnerability Discovery](https://link.springer.com/chapter/10.1007/978-3-031-37586-6_17)\n\n## Requirements\n\n- Python\n- PyTorch\n- PyTorch Geometric\n- NetworkX\n- imbalanced-learn\n- gensim\n- tokenizers\n- pandas\n\nFor the transformer reconstruction demo:\n- dash (for the transformer reconstruction demo)\n- [cpg-to-dot](https://github.com/SAP-samples/security-research-taintgraphs)\n\n## Training data\n\nTraining data for the various datasets can be obtained at:\n\n- [Devign](https://sites.google.com/view/devign) (FFmpeg+QEMU)\n- [ReVeal](https://github.com/VulDetProject/ReVeal) (Chromium+Debian)\n- [PatchDB](https://sunlab-gmu.github.io/PatchDB/)\n\nFrom the commits, methods are extracted as vulnerable prior to the fix commit and as non-vulnerable after the fix commit, as described in Devign and ReVeal. Afterwards, the resulting C code is processed using [Fraunhofer-CPG](https://github.com/Fraunhofer-AISEC/cpg). A single file per method containing the cpg in Graphviz DOT language needs to be placed in the cache folders of this directory (alternatively the paths in `params/dataset_params.py` can be changed). The processed CPG-files can be created ergonomically using [cpg-to-dot](https://github.com/SAP-samples/security-research-taintgraphs).\n\n## Scripts relevant to the reproduction of the results\n\n- `notebooks/`\n    - `analyze_cwe.ipynb`\n        Visualization of the distances between CWE clusters. Used for the right-hand side of figure 4\n    - `degree_vis.ipynb`\n        Notebook containing the code for visualizations of the average degree against the number of nodes. Used for figures 2b and 2c.\n- `params/`\n    Hyperparameters of training, models and datasets as well as paths to the data and various other configuration\n- `scripts/`\n    - `cpg_reconstruction/`\n        - `demo.py`\n            Interactive demonstration the reconstruction of code from a CPG  using the trained transformer\n        - `demo2.py`\n            Interactive demonstration of the interpolation between two code samples and reconstruction using the transformer; used for figure 3\n        - `train.py`\n            Training of the CPG reconstruction transformer\n    - `cwe_distances.py`\n        Generates the data needed for `notebooks/analyze_cwe.ipynb`\n    - `degree_per_node.py`\n        Generates figure 2a\n    - `draw_cwes.py`\n        Generates left-hand side of figure 4\n    - `plot_percentages.py`\n        Used for creating figure 5\n    - `quick_preprocess.py`\n        Parallelized version of data preprocessing. Use this before training on any dataset\n    - `view_results.py`\n        Creates textual summaries of the cross-validation experiments. Used for table 1\n- `cv_classifier.py`\n    Used to generate the results on the full dataset with cross-evaluation for table 1\n- `cv_subsampling_drop.py`\n    Used to generate the subsampled results shown as \"Node-Dropping\" in figure 5\n- `cv_subsampling_sard.py`\n    Used to generate the subsampled results shown as \"SARD\" in figure 5\n- `cv_subsampling_smote.py`\n    Used to generate the subsampled results shown as \"CodeGraphSMOTE\" in figure 5\n- `cv_subsampling.py`\n    Used to generate the subsampled results shown as \"Downsampled\" in figure 5\n- `train_vgae.py`\n    Training of the VGAE model used for CodeGraphSMOTE\n\nAll implementations of models, training and data processing are in `experiments/`. All other files are utility files to ease implementation of the scripts.\n\n## How to obtain support\n[Create an issue](https://github.com/SAP-samples/security-research-codegraphsmote/issues) in this repository if you find a bug or have questions about the content.\n \nFor additional support, [ask a question in SAP Community](https://answers.sap.com/questions/ask.html).\n\n## Contributing\nIf you wish to contribute code, offer fixes or improvements, please send a pull request. Due to legal reasons, contributors will be asked to accept a DCO when they create the first pull request to this project. This happens in an automated fashion during the submission process. SAP uses [the standard DCO text of the Linux Foundation](https://developercertificate.org/).\n\n## License\nCopyright (c) 2023 SAP SE or an SAP affiliate company. All rights reserved. This project is licensed under the Apache Software License, version 2.0 except as noted otherwise in the [LICENSE](LICENSE) file.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsap-samples%2Fsecurity-research-codegraphsmote","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsap-samples%2Fsecurity-research-codegraphsmote","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsap-samples%2Fsecurity-research-codegraphsmote/lists"}