{"id":33145284,"url":"https://github.com/whyisyoung/BODMAS","last_synced_at":"2025-11-15T18:00:38.268Z","repository":{"id":44416800,"uuid":"350577559","full_name":"whyisyoung/BODMAS","owner":"whyisyoung","description":"Code for our DLS'21 paper - BODMAS: An Open Dataset for Learning based Temporal Analysis of PE Malware. BODMAS is short for Blue Hexagon Open Dataset for Malware AnalysiS. ","archived":false,"fork":false,"pushed_at":"2024-03-31T03:44:53.000Z","size":117,"stargazers_count":64,"open_issues_count":0,"forks_count":12,"subscribers_count":4,"default_branch":"gh-pages","last_synced_at":"2024-04-17T08:20:16.778Z","etag":null,"topics":["malware","malware-dataset","malware-research","open-datasets","pe-malware","temporal-data"],"latest_commit_sha":null,"homepage":"https://whyisyoung.github.io/BODMAS/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"bsd-2-clause","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/whyisyoung.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null}},"created_at":"2021-03-23T04:22:45.000Z","updated_at":"2024-04-15T14:35:05.000Z","dependencies_parsed_at":"2022-09-22T06:10:34.552Z","dependency_job_id":"9be8badc-1ca7-4a80-b0b8-6b73e0706ac0","html_url":"https://github.com/whyisyoung/BODMAS","commit_stats":{"total_commits":29,"total_committers":2,"mean_commits":14.5,"dds":0.03448275862068961,"last_synced_commit":"e8abcb9b68a7d44a51190cc6bbf053ff56a58d03"},"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/whyisyoung/BODMAS","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/whyisyoung%2FBODMAS","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/whyisyoung%2FBODMAS/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/whyisyoung%2FBODMAS/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/whyisyoung%2FBODMAS/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/whyisyoung","download_url":"https://codeload.github.com/whyisyoung/BODMAS/tar.gz/refs/heads/gh-pages","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/whyisyoung%2FBODMAS/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":284597116,"owners_count":27032396,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-11-15T02:00:06.050Z","response_time":57,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["malware","malware-dataset","malware-research","open-datasets","pe-malware","temporal-data"],"created_at":"2025-11-15T13:00:33.463Z","updated_at":"2025-11-15T18:00:38.250Z","avatar_url":"https://github.com/whyisyoung.png","language":"Python","funding_links":[],"categories":[":bookmark_tabs: Datasets","Windows Datasets \u003cimg src=\"./images/windows.png\"\u003e"],"sub_categories":["Scientific Research"],"readme":"# BODMAS Malware Dataset\n\n## Introduction\nThe BODMAS Malware Dataset is created and maintained by [Blue Hexagon](https://bluehexagon.ai/) and [UIUC](https://illinois.edu/).\n\nIt contains 57,293 malware and 77,142 benign Windows PE files, including binaries (disarmed malware only), feature vectors, and metadata.\n\nFurther details can be found in our paper “BODMAS: An Open Dataset for Learning based Temporal Analysis of PE Malware” [[PDF](https://liminyang.web.illinois.edu/data/DLS21_BODMAS.pdf)], Deep Learing and Security Workshop 2021 (co-located with IEEE Security and Privacy 2021).\n\nIf you end up building on this dataset as part of a project or publication, please include a reference to our paper:\n\n```\n@inproceedings{bodmas,\n  title = {BODMAS: An Open Dataset for Learning based Temporal Analysis of PE Malware},\n  author = {Yang, Limin and Ciptadi, Arridhana and Laziuk, Ihar and Ahmadzadeh, Ali and Wang, Gang},\n  booktitle = {4th Deep Learning and Security Workshop},\n  year = {2021}\n}\n```\n\n## Download\nPlease visit [this link](https://whyisyoung.github.io/BODMAS/) for more details.\n\n## Installation\n1. Before we get started, please check your server storage and memory. I ran most of the experiments on our lab clusters containing 9 servers (see specification [here](https://gangw.cs.illinois.edu/cluster.html)). I use Fabric to distribute code to different servers to simplify repetitive experiments. You can use 1 server, but you need to change some shell scripts, see the Examples section.\n\n2. Clone this repo to your home directory (you can save to other directories but you need to change some scripts if you did, see the warning in the Examples section:\n    ```bash\n    cd ~\n    git clone git@github.com:whyisyoung/BODMAS.git\n    ```\n\n3. We recommend setting up a Python 3.6.8 virtual environment (other Python 3.6 or above versions might also work but didn't test).\n\n    ```bash\n    cd BODMAS/code/\n    pip install requirements.txt\n    python setup.py install\n    ```\n\n## Configuration\n1. For BODMAS, follow the guidelines of the Download section. Put `bluehex_metadata.csv` and `bluehex.npz` under `BODMAS/code/multiple_data/`.\n2. For Ember and UCSB-packerware, you can download pre-processed feature vectors and metadata here (about 3.4 GB in total): [Google Drive link](https://drive.google.com/drive/folders/12DMPeh8DA2ukPATnHX4K__shWFJIiBN5?usp=sharing). Note for Ember, we combine Ember 2017 and 2018 as a whole. Put the 4 downloaded files under `BODMAS/code/multiple_data/`.\n\n3. For SOREL-20M, you can download pre-trained LightGBM and DNN models here: [https://github.com/sophos-ai/SOREL-20M](https://github.com/sophos-ai/SOREL-20M)\nIf you want to use pretrained SOREL-20M models, you need to specify your locations for some folders in `code/bodmas/config.py`:\n\n```python\n    'sophos_model_folder': '/home/datashare/sophos/baselines/checkpoints/lightGBM/',\n    'sophos_features_folder': '/home/datashare/sophos/lightGBM-features/'\n```\n\n## Examples\n\n### Testing pre-trained models on our BODMAS dataset (Table II in our paper):\n1. Using Ember and random seed 0 as the training set (**PLEASE change the hostname of \"angel\" to yours**):\n    ```bash\n    cd BODMAS/code/\n    ./main_pretrain.sh\n    ```\n\n    For other random seeds (1-4), uncomment the rest of the first code block of `main_pretrain.sh`, also change the hostname of (\"beast\" \"bishop\" \"colossus\" \"cyclops\") to yours. It would be highly recommended to run only 1 random seed each time if you don't have enough memory.\n\n    Call graph:\n    ```bash\n    main_pretrain.sh -\u003e fabric_pretrain.py -\u003e run_pretrain.sh -\u003e pretrain_model_test_on_bluehex.py\n    ```\n\n    WARNING: If you didn't put this repo under your home directory (i.e., this repo would appear as `~/BODMAS`), you might need to change the line 18 of `fabric_pretrain.py`. This also applies to `fabric_multiclass.py` (line 17)\n\n2. Using Sophos pre-trained models, uncomment the second code block of `main_pretrain.sh` and change the hostname to yours. Using UCSB as the training set, uncomment the third code block of `main_pretrain.sh` and change the hostname accordingly. Code for Sophos-DNN is very similar thus omitted here.\n\n### Incremental Retraining (Fig.1 in our paper)\n1. Before running the script, if you want to test Transcend, you need to ask for access to the Transcend code from Feargus Pendlebury and Lorenzo Cavallaro (https://s2lab.cs.ucl.ac.uk/) . **Please CC me as well.** Otherwise you can uncomment the corresponding import and related code.\n\n2. Use corresponding code blocks and change the hostname to yours accordingly:\n    ```bash\n    ./run_ember_drift.sh\n    ```\n\n3. Call graph:\n    ```bash\n    run_ember_drift.sh -\u003e concept_drift_ember.py\n    ```\n\n### Training with New Data (Fig. 2 in our paper)\n1. Change the hostname to yours accordingly and run the following script. It's highly recommend to run each random seed sequentially to avoid memory error unless you can run them on multiple servers.\n    ```bash\n    ./main_bluehex_binary.sh\n    ```\n\n2. Call graph:\n    ```bash\n    main_bluehex_binary.sh -\u003e bluehex_main.py\n    ```\n\n\n### Multi-class classification (Fig. 3, 4 in our paper)\n1. Use corresponding code blocks and change the hostname to yours accordingly:\n    ```bash\n    ./main_bluehex_multiclass.sh\n    ```\n\n    Call graph:\n    ```bash\n    main_bluehex_multiclass.sh -\u003e fabric_multiclass.py -\u003e run_multiclass.sh -\u003e bluehex_main.py\n    ```\n\n## Contact\nIf you have any questions, please contact Limin (liminy2@illinois.edu).\n\n## Licensing\nBSD 2-Clause \"Simplified\" License.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fwhyisyoung%2FBODMAS","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fwhyisyoung%2FBODMAS","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fwhyisyoung%2FBODMAS/lists"}