{"id":13736050,"url":"https://github.com/ucsb-seclab/packware","last_synced_at":"2025-05-08T12:31:53.356Z","repository":{"id":87965390,"uuid":"240344039","full_name":"ucsb-seclab/packware","owner":"ucsb-seclab","description":"Effects of packers on machine-learning-based malware classifiers that use only static analysis","archived":false,"fork":false,"pushed_at":"2024-06-17T22:48:26.000Z","size":362,"stargazers_count":83,"open_issues_count":1,"forks_count":17,"subscribers_count":12,"default_branch":"master","last_synced_at":"2024-11-15T04:31:29.647Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ucsb-seclab.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-02-13T19:31:04.000Z","updated_at":"2024-10-05T13:31:45.000Z","dependencies_parsed_at":"2024-03-29T05:21:08.811Z","dependency_job_id":"7f59599a-3716-431d-829b-de2850675e81","html_url":"https://github.com/ucsb-seclab/packware","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ucsb-seclab%2Fpackware","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ucsb-seclab%2Fpackware/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ucsb-seclab%2Fpackware/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ucsb-seclab%2Fpackware/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ucsb-seclab","download_url":"https://codeload.github.com/ucsb-seclab/packware/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":253068643,"owners_count":21848855,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-03T03:01:14.983Z","updated_at":"2025-05-08T12:31:52.785Z","avatar_url":"https://github.com/ucsb-seclab.png","language":"Python","readme":"# When Malware is Packin’ Heat; Limits of Machine Learning Classifiers Based on Static Analysis Features\n## Contents\n\n[1. Introduction](#1-introduction)\n\n[2. Dataset](#2-dataset)\n\n[3. Docker](#3-docker)\n\n[4. Experiments](#4-experiments)\n\n## 1. Introduction\nThis repository provides datasets and codes that are needed to reproduce the experiments in the paper [When Malware is Packin’ Heat; Limits of Machine Learning Classifiers Based on Static Analysis Features](https://www.ndss-symposium.org/wp-content/uploads/2020/02/24310-paper.pdf). You can find the presention of our work [here](https://youtu.be/hMIEKFrRA-s).\n\nIn this paper, we have investigated the following question: does static analysis on packed binaries provide a rich enough set of features to a malware classifier? We first observed that the distribution of the packers in the training set must be considered, otherwise the lack of overlap between packers used in benign and malicious samples might cause the classifier to distinguish between packing routines instead of behaviors. Different from what is commonly assumed, packers preserve information when packing programs that is “useful” for malware classification, however, such information does not necessarily capture the sample’s behavior. In addition, such information does not help the classifier to (1) generalize its knowledge to operate on previously unseen packers, and (2) be robust against trivial adversarial attacks. We observed that static machine-learning-based products on VirusTotal produce a high false positive rate on packed binaries, possibly due to the limitations discussed in this work. This issue becomes magnified as we see a trend in the anti-malware industry toward an increasing deployment of machine-learning-based classifiers that only use static features.\n\nIf you find this work useful for your research you may want cite our paper.\n```\n@inproceedings{aghakhani2020malware,\n  title={When Malware is Packin'Heat; Limits of Machine Learning Classifiers Based on Static Analysis Features},\n  author={Aghakhani, Hojjat and Gritti, Fabio and Mecca, Francesco and Lindorfer, Martina and Ortolani, Stefano and Balzarotti, Davide and Vigna, Giovanni and Kruegel, Christopher},\n  booktitle={Network and Distributed Systems Security (NDSS) Symposium 2020},\n  year={2020}\n}\n```\n\n## 2. Dataset\nWe create and use two different datasets in this work, named Wild Dataset and Lab Dataset.\nThe former contains executables found in the wild, from two sources, an anti-malware vendor, and EMBER Dataset.\nWe created the latter by packing executables in the wild with a set of nine packers.\nWe exploited a wide range of techniques, especially dynamic analysis, to determine whether each sample is (1) benign or malicious and (2) packed or not packed.\nFor details, you may read the paper (Section IV).\n\nAll these two datasets are stored in a single pickle file (using pandas package). Column ```source``` determines the source of each sample, ```wild``` and ```wild-ember``` mean the sample has been seen in the wild, by the anti-malware vendor or Endgame, and ```lab``` means we have created the sample by packing a sample from Wild Dataset.\n```packed``` column determines if the sample is packed or not. ```malicious``` column determines if the sample is malicious or not. ```packer_name``` determines the packer which is used to pack the sample, ```none``` is set for unpacked samples. For samples from Lab Dataset, the column ```unpacked_sample_sha1``` determines the sha1sum of the executable before packing. This might be helpful for some experiments, as we are able to track back the history of the sample. In general, the name of the columns should be self-descriptive.\n\nTo download the pickle file, navigate to [this url](https://drive.google.com/file/d/1TeOPkfP_a2lik1EQxa7pn9WirBuM4jV5/view?usp=share_link) or [install gdrive](https://github.com/odeke-em/drive/releases) and run the following commands (NOTE: you will need to use a web browser to authorize gdrive to use your credentials):\n```sh\nmkdir data\ncd data/\ndrive init\n# ... copy authorization url to your browser ...\ndrive pull -id 1TeOPkfP_a2lik1EQxa7pn9WirBuM4jV5\n```\nTo download only Wild Dataset, navigate to [this url](https://drive.google.com/file/d/1stVX2-APaiH9XvXhVpySMkRmnLqsSCLM/view?usp=sharing) or [install gdrive](https://github.com/odeke-em/drive/releases) and run the following commands (see NOTE above):\n```sh\nmkdir data\ncd data/\ndrive init\n# ... copy authorization url to your browser ...\ndrive pull -id 1stVX2-APaiH9XvXhVpySMkRmnLqsSCLM\n```\n\nTo download the samples, please contact us. We have all the samples on our server, and we are happy to share it with the community. We do our best to make this process smooth. Unfortunately, there are always serious legitimate concerns with putting this huge number of malware samples in the wild.\nAs we fully explained in the paper, we used Cuckoo and Deep Packer Inspection tools to create our datasets. All the file related to this process, including the dynamic behavior of samples are available on demand. We are happy to provide that also.\nWe also can provide the VirusTotal reports for all the executables in our datasets.\nPlease read [here](https://github.com/ucsb-seclab/packware/blob/master/datasets/README.md) before contacting us.\n## 3. Docker\nIn order to use our source code in the docker image, you first need to properly install Docker.\nTo download the docker image that we used for our experiments, navigate to [this url](https://drive.google.com/file/d/1c7lOFLIf4rA2HRqfdRaEvYsbTSRlTjwE/view?usp=sharing) or [install gdrive](https://github.com/odeke-em/drive/releases) and run the following commands:\n```sh\ndrive init\n# ... copy authorization url to your browser ...\ndrive pull -id 1c7lOFLIf4rA2HRqfdRaEvYsbTSRlTjwE # md5sum: 1e198bfd8ca37a5f49d0b380e85234d2\n```\nThen, to load and run the container:\n```console\n$ ./load_image.sh packware-docker.tar\n$ ./run_docker.sh\n```\nNow, you can run all the experiments.\n## 4. Experiments\nYou need to execute the following scripts in ```code/experiments``` directory. Roughly speaking, each experiment uses one configuration file (starts with ```config*.py```) and the main training file (training.py or training-nn.py for neural network).\nIn ```config*.py```, ```round``` means how many times we run an experiments. It is just for collecting more coherent results, it has set to five in our experiments.\nWe know the code is not very well-written, and we are happy to answer all the questions.\n\nRun the following commands for the experiments in the paper (with the same order as in the paper).\n```\n./exp_nopacked-benign.sh\n```\n\nTo run Experiment \"packer classifier\", run:\n```\npython packerclassifier.py\n```\n\nTo run Experiment \"good-bad packers\", run:\n```\npython run_goodbadpackers_allcombs.py\n```\n```\n./exp_diffPackedBenign.sh\n./exp_diffPackedBenignNN.sh # for neural network\n```\n```\n./exp_labDiffPackedBenign.sh\n./exp_labDiffPackedBenignNN.sh # for neural network\n```\n```\n./exp_singlepacker.sh\n./exp_singlepacker-onlyapiimport.sh\n./exp_singlepacker-onlyheader.sh\n./exp_singlepacker-onlyrich.sh\n./exp_singlepacker-onlysections.sh\n```\n```\n./exp_wildvspacker.sh\n./exp_wildvspacker-rich.sh\n./exp_wildvspacker-nn.sh # for neural network\n```\n```\n./exp_withheldpacker.sh\n./exp_withheldpacker-nongrams.sh\n./exp_withheldpacker-nn.sh # for neural network\n```\n```\n./exp_labagainstwild.sh\n```\n```\n./exp_dolphin.sh #  Strong \u0026 Complete Encryption\n```\n\nFor the adversarial experiment, use ```code/experiments/adversarial/adv.py``` script.\n","funding_links":[],"categories":[":bookmark_tabs: Datasets"],"sub_categories":["Scientific Research"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fucsb-seclab%2Fpackware","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fucsb-seclab%2Fpackware","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fucsb-seclab%2Fpackware/lists"}